# AtoCore Backup and Restore Procedure ## Scope This document defines the operational procedure for backing up and restoring AtoCore's machine state on the Dalidou deployment. It is the practical companion to `docs/backup-strategy.md` (which defines the strategy) and `src/atocore/ops/backup.py` (which implements the mechanics). The intent is that this procedure can be followed by anyone with SSH access to Dalidou and the AtoCore admin endpoints. ## What gets backed up A `create_runtime_backup` snapshot contains, in order of importance: | Artifact | Source path on Dalidou | Backup destination | Always included | |---|---|---|---| | SQLite database | `/srv/storage/atocore/data/db/atocore.db` | `/db/atocore.db` | yes | | Project registry JSON | `/srv/storage/atocore/config/project-registry.json` | `/config/project-registry.json` | yes (if file exists) | | Backup metadata | (generated) | `/backup-metadata.json` | yes | | Chroma vector store | `/srv/storage/atocore/data/chroma/` | `/chroma/` | only when `include_chroma=true` | The SQLite snapshot uses the online `conn.backup()` API and is safe to take while the database is in use. The Chroma snapshot is a cold directory copy and is **only safe when no ingestion is running**; the API endpoint enforces this by acquiring the ingestion lock for the duration of the copy. What is **not** in the backup: - Source documents under `/srv/storage/atocore/sources/vault/` and `/srv/storage/atocore/sources/drive/`. These are read-only inputs and live in the user's PKM/Drive, which is backed up separately by their own systems. - Application code. The container image is the source of truth for code; recovery means rebuilding the image, not restoring code from a backup. - Logs under `/srv/storage/atocore/logs/`. - Embeddings cache under `/srv/storage/atocore/data/cache/`. - Temp files under `/srv/storage/atocore/data/tmp/`. ## Backup root layout Each backup snapshot lives in its own timestamped directory: ``` /srv/storage/atocore/backups/snapshots/ ├── 20260407T060000Z/ │ ├── backup-metadata.json │ ├── db/ │ │ └── atocore.db │ ├── config/ │ │ └── project-registry.json │ └── chroma/ # only if include_chroma=true │ └── ... ├── 20260408T060000Z/ │ └── ... └── ... ``` The timestamp is UTC, format `YYYYMMDDTHHMMSSZ`. ## Triggering a backup ### Option A — via the admin endpoint (preferred) ```bash # DB + registry only (fast, safe at any time) curl -fsS -X POST http://dalidou:8100/admin/backup \ -H "Content-Type: application/json" \ -d '{"include_chroma": false}' # DB + registry + Chroma (acquires ingestion lock) curl -fsS -X POST http://dalidou:8100/admin/backup \ -H "Content-Type: application/json" \ -d '{"include_chroma": true}' ``` The response is the backup metadata JSON. Save the `backup_root` field — that's the directory the snapshot was written to. ### Option B — via the standalone script (when the API is down) ```bash docker exec atocore python -m atocore.ops.backup ``` This runs `create_runtime_backup()` directly, without going through the API or the ingestion lock. Use it only when the AtoCore service itself is unhealthy and you can't hit the admin endpoint. ### Option C — manual file copy (last resort) If both the API and the standalone script are unusable: ```bash sudo systemctl stop atocore # or: docker compose stop atocore sudo cp /srv/storage/atocore/data/db/atocore.db \ /srv/storage/atocore/backups/manual-$(date -u +%Y%m%dT%H%M%SZ).db sudo cp /srv/storage/atocore/config/project-registry.json \ /srv/storage/atocore/backups/manual-$(date -u +%Y%m%dT%H%M%SZ).registry.json sudo systemctl start atocore ``` This is a cold backup and requires brief downtime. ## Listing backups ```bash curl -fsS http://dalidou:8100/admin/backup ``` Returns the configured `backup_dir` and a list of all snapshots under it, with their full metadata if available. Or, on the host directly: ```bash ls -la /srv/storage/atocore/backups/snapshots/ ``` ## Validating a backup Before relying on a backup for restore, validate it: ```bash curl -fsS http://dalidou:8100/admin/backup/20260407T060000Z/validate ``` The validator: - confirms the snapshot directory exists - opens the SQLite snapshot and runs `PRAGMA integrity_check` - parses the registry JSON - confirms the Chroma directory exists (if it was included) A valid backup returns `"valid": true` and an empty `errors` array. A failing validation returns `"valid": false` with one or more specific error strings (e.g. `db_integrity_check_failed`, `registry_invalid_json`, `chroma_snapshot_missing`). **Validate every backup at creation time.** A backup that has never been validated is not actually a backup — it's just a hopeful copy of bytes. ## Restore procedure ### Pre-flight (always) 1. Identify which snapshot you want to restore. List available snapshots and pick by timestamp: ```bash curl -fsS http://dalidou:8100/admin/backup | jq '.backups[].stamp' ``` 2. Validate it. Refuse to restore an invalid backup: ```bash STAMP=20260407T060000Z curl -fsS http://dalidou:8100/admin/backup/$STAMP/validate | jq . ``` 3. **Stop AtoCore.** SQLite cannot be hot-restored under a running process and Chroma will not pick up new files until the process restarts. ```bash docker compose stop atocore # or: sudo systemctl stop atocore ``` 4. **Take a safety snapshot of the current state** before overwriting it. This is your "if the restore makes things worse, here's the undo" backup. ```bash PRESERVE_STAMP=$(date -u +%Y%m%dT%H%M%SZ) sudo cp /srv/storage/atocore/data/db/atocore.db \ /srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.db sudo cp /srv/storage/atocore/config/project-registry.json \ /srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.registry.json 2>/dev/null || true ``` ### Restore the SQLite database ```bash SNAPSHOT_DIR=/srv/storage/atocore/backups/snapshots/$STAMP sudo cp $SNAPSHOT_DIR/db/atocore.db \ /srv/storage/atocore/data/db/atocore.db sudo chown 1000:1000 /srv/storage/atocore/data/db/atocore.db sudo chmod 600 /srv/storage/atocore/data/db/atocore.db ``` The chown should match the gitea/atocore container user. Verify by checking the existing perms before overwriting: ```bash stat -c '%U:%G %a' /srv/storage/atocore/data/db/atocore.db ``` ### Restore the project registry ```bash if [ -f $SNAPSHOT_DIR/config/project-registry.json ]; then sudo cp $SNAPSHOT_DIR/config/project-registry.json \ /srv/storage/atocore/config/project-registry.json sudo chown 1000:1000 /srv/storage/atocore/config/project-registry.json sudo chmod 644 /srv/storage/atocore/config/project-registry.json fi ``` If the snapshot does not contain a registry, the current registry is preserved. The pre-flight safety copy still gives you a recovery path if you need to roll back. ### Restore the Chroma vector store (if it was in the snapshot) ```bash if [ -d $SNAPSHOT_DIR/chroma ]; then # Move the current chroma dir aside as a safety copy sudo mv /srv/storage/atocore/data/chroma \ /srv/storage/atocore/data/chroma.pre-restore-$PRESERVE_STAMP # Copy the snapshot in sudo cp -a $SNAPSHOT_DIR/chroma /srv/storage/atocore/data/chroma sudo chown -R 1000:1000 /srv/storage/atocore/data/chroma fi ``` If the snapshot does NOT contain a Chroma dir but the SQLite restore would leave the vector store and the SQL store inconsistent (e.g. SQL has chunks the vector store doesn't), you have two options: - **Option 1: rebuild the vector store from source documents.** Run ingestion fresh after the SQL restore. This regenerates embeddings from the actual source files. Slow but produces a perfectly consistent state. - **Option 2: accept the inconsistency and live with stale-vector filtering.** The retriever already drops vector results whose SQL row no longer exists (`_existing_chunk_ids` filter), so the inconsistency surfaces as missing results, not bad ones. For an unplanned restore, Option 2 is the right immediate move. Then schedule a fresh ingestion pass to rebuild the vector store properly. ### Restart AtoCore ```bash docker compose up -d atocore # or: sudo systemctl start atocore ``` ### Post-restore verification ```bash # 1. Service is healthy curl -fsS http://dalidou:8100/health | jq . # 2. Stats look right curl -fsS http://dalidou:8100/stats | jq . # 3. Project registry loads curl -fsS http://dalidou:8100/projects | jq '.projects | length' # 4. A known-good context query returns non-empty results curl -fsS -X POST http://dalidou:8100/context/build \ -H "Content-Type: application/json" \ -d '{"prompt": "what is p05 about", "project": "p05-interferometer"}' | jq '.chunks_used' ``` If any of these are wrong, the restore is bad. Roll back using the pre-restore safety copy: ```bash docker compose stop atocore sudo cp /srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.db \ /srv/storage/atocore/data/db/atocore.db sudo cp /srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.registry.json \ /srv/storage/atocore/config/project-registry.json 2>/dev/null || true # If you also restored chroma: sudo rm -rf /srv/storage/atocore/data/chroma sudo mv /srv/storage/atocore/data/chroma.pre-restore-$PRESERVE_STAMP \ /srv/storage/atocore/data/chroma docker compose up -d atocore ``` ## Retention policy - **Last 7 daily backups**: kept verbatim - **Last 4 weekly backups** (Sunday): kept verbatim - **Last 6 monthly backups** (1st of month): kept verbatim - **Anything older**: deleted The retention job is **not yet implemented** and is tracked as a follow-up. Until then, the snapshots directory grows monotonically. A simple cron-based cleanup script is the next step: ```cron 0 4 * * * /srv/storage/atocore/scripts/cleanup-old-backups.sh ``` ## Drill schedule A backup that has never been restored is theoretical. The schedule: - **At least once per quarter**, perform a full restore drill on a staging environment (or a temporary container with a separate data dir) and verify the post-restore checks pass. - **After every breaking schema migration**, perform a restore drill to confirm the migration is reversible. - **After any incident** that touched the storage layer (the EXDEV bug from April 2026 is a good example), confirm the next backup validates clean. ## Common failure modes and what to do about them | Symptom | Likely cause | Action | |---|---|---| | `db_integrity_check_failed` on validation | SQLite snapshot copied while a write was in progress, or disk corruption | Take a fresh backup and validate again. If it fails twice, suspect the underlying disk. | | `registry_invalid_json` | Registry was being edited at backup time | Take a fresh backup. The registry is small so this is cheap. | | `chroma_snapshot_missing` after a restore | Snapshot was DB-only and the restore didn't move the existing chroma dir | Either rebuild via fresh ingestion or restore an older snapshot that includes Chroma. | | Service won't start after restore | Permissions wrong on the restored files | Re-run `chown 1000:1000` (or whatever the gitea/atocore container user is) on the data dir. | | `/stats` returns 0 documents after restore | The SQL store was restored but the source paths in `source_documents` don't match the current Dalidou paths | This means the backup came from a different deployment. Don't trust this restore — it's pulling from the wrong layout. | ## Open follow-ups (not yet implemented) 1. **Retention cleanup script**: see the cron entry above. 2. **Off-Dalidou backup target**: currently snapshots live on the same disk as the live data. A real disaster-recovery story needs at least one snapshot on a different physical machine. The simplest first step is a periodic `rsync` to the user's laptop or to another server. 3. **Backup encryption**: snapshots contain raw SQLite and JSON. Consider age/gpg encryption if backups will be shipped off-site. 4. **Automatic post-backup validation**: today the validator must be invoked manually. The `create_runtime_backup` function should call `validate_backup` on its own output and refuse to declare success if validation fails. 5. **Chroma backup is currently full directory copy** every time. For large vector stores this gets expensive. A future improvement would be incremental snapshots via filesystem-level snapshotting (LVM, btrfs, ZFS). ## Quickstart cheat sheet ```bash # Daily backup (DB + registry only — fast) curl -fsS -X POST http://dalidou:8100/admin/backup \ -H "Content-Type: application/json" -d '{}' # Weekly backup (DB + registry + Chroma — slower, holds ingestion lock) curl -fsS -X POST http://dalidou:8100/admin/backup \ -H "Content-Type: application/json" -d '{"include_chroma": true}' # List backups curl -fsS http://dalidou:8100/admin/backup | jq '.backups[].stamp' # Validate the most recent backup LATEST=$(curl -fsS http://dalidou:8100/admin/backup | jq -r '.backups[-1].stamp') curl -fsS http://dalidou:8100/admin/backup/$LATEST/validate | jq . # Full restore — see the "Restore procedure" section above ```