Session 2 of the four-session plan. Lands two operational pieces: the Claude Code slash command that makes AtoCore reachable from inside any Claude Code session, and the full backup/restore procedure doc that turns the backup endpoint code into a real operational drill. Slash command (.claude/commands/atocore-context.md) --------------------------------------------------- - Project-level slash command following the standard frontmatter format (description + argument-hint) - Parses the user prompt and an optional trailing project id, with case-insensitive matching against the registered project ids (atocore, p04-gigabit, p05-interferometer, p06-polisher and their aliases) - Calls POST /context/build on the live AtoCore service, defaulting to http://dalidou:8100 (overridable via ATOCORE_API_BASE env var) - Renders the formatted context pack inline so the user can see exactly what AtoCore would feed an LLM, plus a stats banner and a per-chunk source list - Includes graceful failure handling for network errors, 4xx, 5xx, and the empty-result case - Defines a future capture path that POSTs to /interactions for the Phase 9 reflection loop. The current command leaves capture as manual / opt-in pending a clean post-turn hook design .gitignore changes ------------------ - Replaced wholesale .claude/ ignore with .claude/* + exceptions for .claude/commands/ so project slash commands can be tracked - Other .claude/* paths (worktrees, settings, local state) remain ignored Backup-restore procedure (docs/backup-restore-procedure.md) ----------------------------------------------------------- - Defines what gets backed up (SQLite + registry always, Chroma optional under ingestion lock) and what doesn't (sources, code, logs, cache, tmp) - Documents the snapshot directory layout and the timestamp format - Three trigger paths in priority order: - via POST /admin/backup with {include_chroma: true|false} - via the standalone src/atocore/ops/backup.py module - via cold filesystem copy with brief downtime as last resort - Listing and validation procedure with the /admin/backup and /admin/backup/{stamp}/validate endpoints - Full step-by-step restore procedure with mandatory pre-flight safety snapshot, ownership/permission requirements, and the post-restore verification checks - Rollback path using the pre-restore safety copy - Retention policy (last 7 daily / 4 weekly / 6 monthly) and explicit acknowledgment that the cleanup job is not yet implemented - Drill schedule: quarterly full restore drill, post-migration drill, post-incident validation - Common failure mode table with diagnoses - Quickstart cheat sheet at the end for daily reference - Open follow-ups: cleanup script, off-Dalidou target, encryption, automatic post-backup validation, incremental Chroma snapshots The procedure has not yet been exercised against the live Dalidou instance — that is the next step the user runs themselves once the slash command is in place.
13 KiB
AtoCore Backup and Restore Procedure
Scope
This document defines the operational procedure for backing up and
restoring AtoCore's machine state on the Dalidou deployment. It is
the practical companion to docs/backup-strategy.md (which defines
the strategy) and src/atocore/ops/backup.py (which implements the
mechanics).
The intent is that this procedure can be followed by anyone with SSH access to Dalidou and the AtoCore admin endpoints.
What gets backed up
A create_runtime_backup snapshot contains, in order of importance:
| Artifact | Source path on Dalidou | Backup destination | Always included |
|---|---|---|---|
| SQLite database | /srv/storage/atocore/data/db/atocore.db |
<backup_root>/db/atocore.db |
yes |
| Project registry JSON | /srv/storage/atocore/config/project-registry.json |
<backup_root>/config/project-registry.json |
yes (if file exists) |
| Backup metadata | (generated) | <backup_root>/backup-metadata.json |
yes |
| Chroma vector store | /srv/storage/atocore/data/chroma/ |
<backup_root>/chroma/ |
only when include_chroma=true |
The SQLite snapshot uses the online conn.backup() API and is safe
to take while the database is in use. The Chroma snapshot is a cold
directory copy and is only safe when no ingestion is running;
the API endpoint enforces this by acquiring the ingestion lock for
the duration of the copy.
What is not in the backup:
- Source documents under
/srv/storage/atocore/sources/vault/and/srv/storage/atocore/sources/drive/. These are read-only inputs and live in the user's PKM/Drive, which is backed up separately by their own systems. - Application code. The container image is the source of truth for code; recovery means rebuilding the image, not restoring code from a backup.
- Logs under
/srv/storage/atocore/logs/. - Embeddings cache under
/srv/storage/atocore/data/cache/. - Temp files under
/srv/storage/atocore/data/tmp/.
Backup root layout
Each backup snapshot lives in its own timestamped directory:
/srv/storage/atocore/backups/snapshots/
├── 20260407T060000Z/
│ ├── backup-metadata.json
│ ├── db/
│ │ └── atocore.db
│ ├── config/
│ │ └── project-registry.json
│ └── chroma/ # only if include_chroma=true
│ └── ...
├── 20260408T060000Z/
│ └── ...
└── ...
The timestamp is UTC, format YYYYMMDDTHHMMSSZ.
Triggering a backup
Option A — via the admin endpoint (preferred)
# DB + registry only (fast, safe at any time)
curl -fsS -X POST http://dalidou:8100/admin/backup \
-H "Content-Type: application/json" \
-d '{"include_chroma": false}'
# DB + registry + Chroma (acquires ingestion lock)
curl -fsS -X POST http://dalidou:8100/admin/backup \
-H "Content-Type: application/json" \
-d '{"include_chroma": true}'
The response is the backup metadata JSON. Save the backup_root
field — that's the directory the snapshot was written to.
Option B — via the standalone script (when the API is down)
docker exec atocore python -m atocore.ops.backup
This runs create_runtime_backup() directly, without going through
the API or the ingestion lock. Use it only when the AtoCore service
itself is unhealthy and you can't hit the admin endpoint.
Option C — manual file copy (last resort)
If both the API and the standalone script are unusable:
sudo systemctl stop atocore # or: docker compose stop atocore
sudo cp /srv/storage/atocore/data/db/atocore.db \
/srv/storage/atocore/backups/manual-$(date -u +%Y%m%dT%H%M%SZ).db
sudo cp /srv/storage/atocore/config/project-registry.json \
/srv/storage/atocore/backups/manual-$(date -u +%Y%m%dT%H%M%SZ).registry.json
sudo systemctl start atocore
This is a cold backup and requires brief downtime.
Listing backups
curl -fsS http://dalidou:8100/admin/backup
Returns the configured backup_dir and a list of all snapshots
under it, with their full metadata if available.
Or, on the host directly:
ls -la /srv/storage/atocore/backups/snapshots/
Validating a backup
Before relying on a backup for restore, validate it:
curl -fsS http://dalidou:8100/admin/backup/20260407T060000Z/validate
The validator:
- confirms the snapshot directory exists
- opens the SQLite snapshot and runs
PRAGMA integrity_check - parses the registry JSON
- confirms the Chroma directory exists (if it was included)
A valid backup returns "valid": true and an empty errors array.
A failing validation returns "valid": false with one or more
specific error strings (e.g. db_integrity_check_failed,
registry_invalid_json, chroma_snapshot_missing).
Validate every backup at creation time. A backup that has never been validated is not actually a backup — it's just a hopeful copy of bytes.
Restore procedure
Pre-flight (always)
- Identify which snapshot you want to restore. List available
snapshots and pick by timestamp:
curl -fsS http://dalidou:8100/admin/backup | jq '.backups[].stamp' - Validate it. Refuse to restore an invalid backup:
STAMP=20260407T060000Z curl -fsS http://dalidou:8100/admin/backup/$STAMP/validate | jq . - Stop AtoCore. SQLite cannot be hot-restored under a running
process and Chroma will not pick up new files until the process
restarts.
docker compose stop atocore # or: sudo systemctl stop atocore - Take a safety snapshot of the current state before overwriting
it. This is your "if the restore makes things worse, here's the
undo" backup.
PRESERVE_STAMP=$(date -u +%Y%m%dT%H%M%SZ) sudo cp /srv/storage/atocore/data/db/atocore.db \ /srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.db sudo cp /srv/storage/atocore/config/project-registry.json \ /srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.registry.json 2>/dev/null || true
Restore the SQLite database
SNAPSHOT_DIR=/srv/storage/atocore/backups/snapshots/$STAMP
sudo cp $SNAPSHOT_DIR/db/atocore.db \
/srv/storage/atocore/data/db/atocore.db
sudo chown 1000:1000 /srv/storage/atocore/data/db/atocore.db
sudo chmod 600 /srv/storage/atocore/data/db/atocore.db
The chown should match the gitea/atocore container user. Verify by checking the existing perms before overwriting:
stat -c '%U:%G %a' /srv/storage/atocore/data/db/atocore.db
Restore the project registry
if [ -f $SNAPSHOT_DIR/config/project-registry.json ]; then
sudo cp $SNAPSHOT_DIR/config/project-registry.json \
/srv/storage/atocore/config/project-registry.json
sudo chown 1000:1000 /srv/storage/atocore/config/project-registry.json
sudo chmod 644 /srv/storage/atocore/config/project-registry.json
fi
If the snapshot does not contain a registry, the current registry is preserved. The pre-flight safety copy still gives you a recovery path if you need to roll back.
Restore the Chroma vector store (if it was in the snapshot)
if [ -d $SNAPSHOT_DIR/chroma ]; then
# Move the current chroma dir aside as a safety copy
sudo mv /srv/storage/atocore/data/chroma \
/srv/storage/atocore/data/chroma.pre-restore-$PRESERVE_STAMP
# Copy the snapshot in
sudo cp -a $SNAPSHOT_DIR/chroma /srv/storage/atocore/data/chroma
sudo chown -R 1000:1000 /srv/storage/atocore/data/chroma
fi
If the snapshot does NOT contain a Chroma dir but the SQLite restore would leave the vector store and the SQL store inconsistent (e.g. SQL has chunks the vector store doesn't), you have two options:
- Option 1: rebuild the vector store from source documents. Run ingestion fresh after the SQL restore. This regenerates embeddings from the actual source files. Slow but produces a perfectly consistent state.
- Option 2: accept the inconsistency and live with stale-vector
filtering. The retriever already drops vector results whose
SQL row no longer exists (
_existing_chunk_idsfilter), so the inconsistency surfaces as missing results, not bad ones.
For an unplanned restore, Option 2 is the right immediate move. Then schedule a fresh ingestion pass to rebuild the vector store properly.
Restart AtoCore
docker compose up -d atocore
# or: sudo systemctl start atocore
Post-restore verification
# 1. Service is healthy
curl -fsS http://dalidou:8100/health | jq .
# 2. Stats look right
curl -fsS http://dalidou:8100/stats | jq .
# 3. Project registry loads
curl -fsS http://dalidou:8100/projects | jq '.projects | length'
# 4. A known-good context query returns non-empty results
curl -fsS -X POST http://dalidou:8100/context/build \
-H "Content-Type: application/json" \
-d '{"prompt": "what is p05 about", "project": "p05-interferometer"}' | jq '.chunks_used'
If any of these are wrong, the restore is bad. Roll back using the pre-restore safety copy:
docker compose stop atocore
sudo cp /srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.db \
/srv/storage/atocore/data/db/atocore.db
sudo cp /srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.registry.json \
/srv/storage/atocore/config/project-registry.json 2>/dev/null || true
# If you also restored chroma:
sudo rm -rf /srv/storage/atocore/data/chroma
sudo mv /srv/storage/atocore/data/chroma.pre-restore-$PRESERVE_STAMP \
/srv/storage/atocore/data/chroma
docker compose up -d atocore
Retention policy
- Last 7 daily backups: kept verbatim
- Last 4 weekly backups (Sunday): kept verbatim
- Last 6 monthly backups (1st of month): kept verbatim
- Anything older: deleted
The retention job is not yet implemented and is tracked as a follow-up. Until then, the snapshots directory grows monotonically. A simple cron-based cleanup script is the next step:
0 4 * * * /srv/storage/atocore/scripts/cleanup-old-backups.sh
Drill schedule
A backup that has never been restored is theoretical. The schedule:
- At least once per quarter, perform a full restore drill on a staging environment (or a temporary container with a separate data dir) and verify the post-restore checks pass.
- After every breaking schema migration, perform a restore drill to confirm the migration is reversible.
- After any incident that touched the storage layer (the EXDEV bug from April 2026 is a good example), confirm the next backup validates clean.
Common failure modes and what to do about them
| Symptom | Likely cause | Action |
|---|---|---|
db_integrity_check_failed on validation |
SQLite snapshot copied while a write was in progress, or disk corruption | Take a fresh backup and validate again. If it fails twice, suspect the underlying disk. |
registry_invalid_json |
Registry was being edited at backup time | Take a fresh backup. The registry is small so this is cheap. |
chroma_snapshot_missing after a restore |
Snapshot was DB-only and the restore didn't move the existing chroma dir | Either rebuild via fresh ingestion or restore an older snapshot that includes Chroma. |
| Service won't start after restore | Permissions wrong on the restored files | Re-run chown 1000:1000 (or whatever the gitea/atocore container user is) on the data dir. |
/stats returns 0 documents after restore |
The SQL store was restored but the source paths in source_documents don't match the current Dalidou paths |
This means the backup came from a different deployment. Don't trust this restore — it's pulling from the wrong layout. |
Open follow-ups (not yet implemented)
- Retention cleanup script: see the cron entry above.
- Off-Dalidou backup target: currently snapshots live on the
same disk as the live data. A real disaster-recovery story
needs at least one snapshot on a different physical machine.
The simplest first step is a periodic
rsyncto the user's laptop or to another server. - Backup encryption: snapshots contain raw SQLite and JSON. Consider age/gpg encryption if backups will be shipped off-site.
- Automatic post-backup validation: today the validator must
be invoked manually. The
create_runtime_backupfunction should callvalidate_backupon its own output and refuse to declare success if validation fails. - Chroma backup is currently full directory copy every time. For large vector stores this gets expensive. A future improvement would be incremental snapshots via filesystem-level snapshotting (LVM, btrfs, ZFS).
Quickstart cheat sheet
# Daily backup (DB + registry only — fast)
curl -fsS -X POST http://dalidou:8100/admin/backup \
-H "Content-Type: application/json" -d '{}'
# Weekly backup (DB + registry + Chroma — slower, holds ingestion lock)
curl -fsS -X POST http://dalidou:8100/admin/backup \
-H "Content-Type: application/json" -d '{"include_chroma": true}'
# List backups
curl -fsS http://dalidou:8100/admin/backup | jq '.backups[].stamp'
# Validate the most recent backup
LATEST=$(curl -fsS http://dalidou:8100/admin/backup | jq -r '.backups[-1].stamp')
curl -fsS http://dalidou:8100/admin/backup/$LATEST/validate | jq .
# Full restore — see the "Restore procedure" section above