# AtoCore Backup and Restore Procedure ## Scope This document defines the operational procedure for backing up and restoring AtoCore's machine state on the Dalidou deployment. It is the practical companion to `docs/backup-strategy.md` (which defines the strategy) and `src/atocore/ops/backup.py` (which implements the mechanics). The intent is that this procedure can be followed by anyone with SSH access to Dalidou and the AtoCore admin endpoints. ## What gets backed up A `create_runtime_backup` snapshot contains, in order of importance: | Artifact | Source path on Dalidou | Backup destination | Always included | |---|---|---|---| | SQLite database | `/srv/storage/atocore/data/db/atocore.db` | `/db/atocore.db` | yes | | Project registry JSON | `/srv/storage/atocore/config/project-registry.json` | `/config/project-registry.json` | yes (if file exists) | | Backup metadata | (generated) | `/backup-metadata.json` | yes | | Chroma vector store | `/srv/storage/atocore/data/chroma/` | `/chroma/` | only when `include_chroma=true` | The SQLite snapshot uses the online `conn.backup()` API and is safe to take while the database is in use. The Chroma snapshot is a cold directory copy and is **only safe when no ingestion is running**; the API endpoint enforces this by acquiring the ingestion lock for the duration of the copy. What is **not** in the backup: - Source documents under `/srv/storage/atocore/sources/vault/` and `/srv/storage/atocore/sources/drive/`. These are read-only inputs and live in the user's PKM/Drive, which is backed up separately by their own systems. - Application code. The container image is the source of truth for code; recovery means rebuilding the image, not restoring code from a backup. - Logs under `/srv/storage/atocore/logs/`. - Embeddings cache under `/srv/storage/atocore/data/cache/`. - Temp files under `/srv/storage/atocore/data/tmp/`. ## Backup root layout Each backup snapshot lives in its own timestamped directory: ``` /srv/storage/atocore/backups/snapshots/ ├── 20260407T060000Z/ │ ├── backup-metadata.json │ ├── db/ │ │ └── atocore.db │ ├── config/ │ │ └── project-registry.json │ └── chroma/ # only if include_chroma=true │ └── ... ├── 20260408T060000Z/ │ └── ... └── ... ``` The timestamp is UTC, format `YYYYMMDDTHHMMSSZ`. ## Triggering a backup ### Option A — via the admin endpoint (preferred) ```bash # DB + registry only (fast, safe at any time) curl -fsS -X POST http://dalidou:8100/admin/backup \ -H "Content-Type: application/json" \ -d '{"include_chroma": false}' # DB + registry + Chroma (acquires ingestion lock) curl -fsS -X POST http://dalidou:8100/admin/backup \ -H "Content-Type: application/json" \ -d '{"include_chroma": true}' ``` The response is the backup metadata JSON. Save the `backup_root` field — that's the directory the snapshot was written to. ### Option B — via the standalone script (when the API is down) ```bash docker exec atocore python -m atocore.ops.backup ``` This runs `create_runtime_backup()` directly, without going through the API or the ingestion lock. Use it only when the AtoCore service itself is unhealthy and you can't hit the admin endpoint. ### Option C — manual file copy (last resort) If both the API and the standalone script are unusable: ```bash sudo systemctl stop atocore # or: docker compose stop atocore sudo cp /srv/storage/atocore/data/db/atocore.db \ /srv/storage/atocore/backups/manual-$(date -u +%Y%m%dT%H%M%SZ).db sudo cp /srv/storage/atocore/config/project-registry.json \ /srv/storage/atocore/backups/manual-$(date -u +%Y%m%dT%H%M%SZ).registry.json sudo systemctl start atocore ``` This is a cold backup and requires brief downtime. ## Listing backups ```bash curl -fsS http://dalidou:8100/admin/backup ``` Returns the configured `backup_dir` and a list of all snapshots under it, with their full metadata if available. Or, on the host directly: ```bash ls -la /srv/storage/atocore/backups/snapshots/ ``` ## Validating a backup Before relying on a backup for restore, validate it: ```bash curl -fsS http://dalidou:8100/admin/backup/20260407T060000Z/validate ``` The validator: - confirms the snapshot directory exists - opens the SQLite snapshot and runs `PRAGMA integrity_check` - parses the registry JSON - confirms the Chroma directory exists (if it was included) A valid backup returns `"valid": true` and an empty `errors` array. A failing validation returns `"valid": false` with one or more specific error strings (e.g. `db_integrity_check_failed`, `registry_invalid_json`, `chroma_snapshot_missing`). **Validate every backup at creation time.** A backup that has never been validated is not actually a backup — it's just a hopeful copy of bytes. ## Restore procedure Since 2026-04-09 the restore is implemented as a proper module function plus CLI entry point: `restore_runtime_backup()` in `src/atocore/ops/backup.py`, invoked as `python -m atocore.ops.backup restore --confirm-service-stopped`. It automatically takes a pre-restore safety snapshot (your rollback anchor), handles SQLite WAL/SHM cleanly, restores the registry, and runs `PRAGMA integrity_check` on the restored db. This replaces the earlier manual `sudo cp` sequence. The function refuses to run without `--confirm-service-stopped`. This is deliberate: hot-restoring into a running service corrupts SQLite state. ### Pre-flight (always) 1. Identify which snapshot you want to restore. List available snapshots and pick by timestamp: ```bash curl -fsS http://127.0.0.1:8100/admin/backup | jq '.backups[].stamp' ``` 2. Validate it. Refuse to restore an invalid backup: ```bash STAMP=20260409T060000Z curl -fsS http://127.0.0.1:8100/admin/backup/$STAMP/validate | jq . ``` 3. **Stop AtoCore.** SQLite cannot be hot-restored under a running process and Chroma will not pick up new files until the process restarts. ```bash cd /srv/storage/atocore/app/deploy/dalidou docker compose down docker compose ps # atocore should be Exited/gone ``` ### Run the restore Use a one-shot container that reuses the live service's volume mounts so every path (`db_path`, `chroma_path`, backup dir) resolves to the same place the main service would see: ```bash cd /srv/storage/atocore/app/deploy/dalidou docker compose run --rm --entrypoint python atocore \ -m atocore.ops.backup restore \ $STAMP \ --confirm-service-stopped ``` Output is a JSON document. The critical fields: - `pre_restore_snapshot`: stamp of the safety snapshot of live state taken right before the restore. **Write this down.** If the restore was the wrong call, this is how you roll it back. - `db_restored`: should be `true` - `registry_restored`: `true` if the backup captured a registry - `chroma_restored`: `true` if the backup captured a chroma tree and include_chroma resolved to true (default) - `restored_integrity_ok`: **must be `true`** — if this is false, STOP and do not start the service; investigate the integrity error first. The restored file is still on disk but untrusted. ### Controlling the restore The CLI supports a few flags for finer control: - `--no-pre-snapshot` skips the pre-restore safety snapshot. Use this only when you know you have another rollback path. - `--no-chroma` restores only SQLite + registry, leaving the current Chroma dir alone. Useful if Chroma is consistent but SQLite needs a rollback. - `--chroma` forces Chroma restoration even if the metadata doesn't clearly indicate the snapshot has it (rare). ### Chroma restore and bind-mounted volumes The Chroma dir on Dalidou is a bind-mounted Docker volume. The restore cannot `rmtree` the destination (you can't unlink a mount point — it raises `OSError [Errno 16] Device or resource busy`), so the function clears the dir's CONTENTS and uses `copytree(dirs_exist_ok=True)` to copy the snapshot back in. The regression test `test_restore_chroma_does_not_unlink_destination_directory` in `tests/test_backup.py` captures the destination inode before and after restore and asserts it's stable — the same invariant that protects the bind mount. This was discovered during the first real Dalidou restore drill on 2026-04-09. If you see a new restore failure with `Device or resource busy`, something has regressed this fix. ### Restart AtoCore ```bash cd /srv/storage/atocore/app/deploy/dalidou docker compose up -d # Wait for /health to come up for i in 1 2 3 4 5 6 7 8 9 10; do curl -fsS http://127.0.0.1:8100/health \ && break || { echo "not ready ($i/10)"; sleep 3; } done ``` **Note on build_sha after restore:** The one-shot `docker compose run` container does not carry the build provenance env vars that `deploy.sh` exports at deploy time. After a restore, `/health` will report `build_sha: "unknown"` until you re-run `deploy.sh` or manually re-deploy. This is cosmetic — the data is correctly restored — but if you need `build_sha` to be accurate, run a redeploy after the restore: ```bash cd /srv/storage/atocore/app bash deploy/dalidou/deploy.sh ``` ### Post-restore verification ```bash # 1. Service is healthy curl -fsS http://127.0.0.1:8100/health | jq . # 2. Stats look right curl -fsS http://127.0.0.1:8100/stats | jq . # 3. Project registry loads curl -fsS http://127.0.0.1:8100/projects | jq '.projects | length' # 4. A known-good context query returns non-empty results curl -fsS -X POST http://127.0.0.1:8100/context/build \ -H "Content-Type: application/json" \ -d '{"prompt": "what is p05 about", "project": "p05-interferometer"}' | jq '.chunks_used' ``` If any of these are wrong, the restore is bad. Roll back using the pre-restore safety snapshot whose stamp you recorded from the restore output. The rollback is the same procedure — stop the service and restore that stamp: ```bash docker compose down docker compose run --rm --entrypoint python atocore \ -m atocore.ops.backup restore \ $PRE_RESTORE_SNAPSHOT_STAMP \ --confirm-service-stopped \ --no-pre-snapshot docker compose up -d ``` (`--no-pre-snapshot` because the rollback itself doesn't need one; you already have the original snapshot as a fallback if everything goes sideways.) ### Restore drill The restore is exercised at three levels: 1. **Unit tests.** `tests/test_backup.py` has six restore tests (refuse-without-confirm, invalid backup, full round-trip, Chroma round-trip, inode-stability regression, WAL sidecar cleanup, skip-pre-snapshot). These run in CI on every commit. 2. **Module-level round-trip.** `test_restore_round_trip_reverses_post_backup_mutations` is the canonical drill in code form: seed baseline, snapshot, mutate, restore, assert mutation reversed + baseline survived + pre-restore snapshot captured the mutation. 3. **Live drill on Dalidou.** Periodically run the full procedure against the real service with a disposable drill-marker memory (created via `POST /memory` with `memory_type=episodic` and `project=drill`), following the sequence above and then verifying the marker is gone afterward via `GET /memory?project=drill`. The first such drill on 2026-04-09 surfaced the bind-mount bug; future runs primarily exist to verify the fix stays fixed. Run the live drill: - **Before** enabling any new write-path automation (auto-capture, automated ingestion, reinforcement sweeps). - **After** any change to `src/atocore/ops/backup.py` or to schema migrations in `src/atocore/models/database.py`. - **After** a Dalidou OS upgrade or docker version bump. - **At least once per quarter** as a standing operational check. - **After any incident** that touched the storage layer. Record each drill run (stamp, pre-restore snapshot stamp, pass/fail, any surprises) somewhere durable — a line in the project journal or a git commit message is enough. A drill you ran once and never again is barely more than a drill you never ran. ## Retention policy - **Last 7 daily backups**: kept verbatim - **Last 4 weekly backups** (Sunday): kept verbatim - **Last 6 monthly backups** (1st of month): kept verbatim - **Anything older**: deleted The retention job is **not yet implemented** and is tracked as a follow-up. Until then, the snapshots directory grows monotonically. A simple cron-based cleanup script is the next step: ```cron 0 4 * * * /srv/storage/atocore/scripts/cleanup-old-backups.sh ``` ## Common failure modes and what to do about them | Symptom | Likely cause | Action | |---|---|---| | `db_integrity_check_failed` on validation | SQLite snapshot copied while a write was in progress, or disk corruption | Take a fresh backup and validate again. If it fails twice, suspect the underlying disk. | | `registry_invalid_json` | Registry was being edited at backup time | Take a fresh backup. The registry is small so this is cheap. | | Restore: `restored_integrity_ok: false` | Source snapshot was itself corrupt (validation should have caught it — file a bug) or copy was interrupted mid-write | Do NOT start the service. Validate the snapshot directly with `python -m atocore.ops.backup validate `, try a different older snapshot, or roll back to the pre-restore safety snapshot. | | Restore: `OSError [Errno 16] Device or resource busy` on Chroma | Old code tried to `rmtree` the Chroma mount point. Fixed on 2026-04-09 by `test_restore_chroma_does_not_unlink_destination_directory` | Ensure you're running commit 2026-04-09 or later; if you need to work around an older build, use `--no-chroma` and restore Chroma contents manually. | | `chroma_snapshot_missing` after a restore | Snapshot was DB-only | Either rebuild via fresh ingestion or restore an older snapshot that includes Chroma. | | Service won't start after restore | Permissions wrong on the restored files | Re-run `chown 1000:1000` (or whatever the gitea/atocore container user is) on the data dir. | | `/stats` returns 0 documents after restore | The SQL store was restored but the source paths in `source_documents` don't match the current Dalidou paths | This means the backup came from a different deployment. Don't trust this restore — it's pulling from the wrong layout. | | Drill marker still present after restore | Wrong stamp, service still writing during `docker compose down`, or the restore JSON didn't report `db_restored: true` | Roll back via the pre-restore safety snapshot and retry with the correct source snapshot. | ## Open follow-ups (not yet implemented) Tracked separately in `docs/next-steps.md` — the list below is the backup-specific subset. 1. **Retention cleanup script**: see the cron entry above. The snapshots directory grows monotonically until this exists. 2. **Off-Dalidou backup target**: currently snapshots live on the same disk as the live data. A real disaster-recovery story needs at least one snapshot on a different physical machine. The simplest first step is a periodic `rsync` to the user's laptop or to another server. 3. **Backup encryption**: snapshots contain raw SQLite and JSON. Consider age/gpg encryption if backups will be shipped off-site. 4. **Automatic post-backup validation**: today the validator must be invoked manually. The `create_runtime_backup` function should call `validate_backup` on its own output and refuse to declare success if validation fails. 5. **Chroma backup is currently full directory copy** every time. For large vector stores this gets expensive. A future improvement would be incremental snapshots via filesystem-level snapshotting (LVM, btrfs, ZFS). **Done** (kept for historical reference): - ~~Implement `restore_runtime_backup()` as a proper module function so the restore isn't a manual `sudo cp` dance~~ — landed 2026-04-09 in commit 3362080, followed by the Chroma bind-mount fix from the first real drill. ## Quickstart cheat sheet ```bash # Daily backup (DB + registry only — fast) curl -fsS -X POST http://127.0.0.1:8100/admin/backup \ -H "Content-Type: application/json" -d '{}' # Weekly backup (DB + registry + Chroma — slower, holds ingestion lock) curl -fsS -X POST http://127.0.0.1:8100/admin/backup \ -H "Content-Type: application/json" -d '{"include_chroma": true}' # List backups curl -fsS http://127.0.0.1:8100/admin/backup | jq '.backups[].stamp' # Validate the most recent backup LATEST=$(curl -fsS http://127.0.0.1:8100/admin/backup | jq -r '.backups[-1].stamp') curl -fsS http://127.0.0.1:8100/admin/backup/$LATEST/validate | jq . # Full restore (service must be stopped first) cd /srv/storage/atocore/app/deploy/dalidou docker compose down docker compose run --rm --entrypoint python atocore \ -m atocore.ops.backup restore $STAMP --confirm-service-stopped docker compose up -d # Live drill: exercise the full create -> mutate -> restore flow # against the running service. The marker memory uses # memory_type=episodic (valid types: identity, preference, project, # episodic, knowledge, adaptation) and project=drill so it's easy # to find via GET /memory?project=drill before and after. # # See the "Restore drill" section above for the full sequence. STAMP=$(curl -fsS -X POST http://127.0.0.1:8100/admin/backup \ -H 'Content-Type: application/json' \ -d '{"include_chroma": true}' | jq -r '.backup_root' | awk -F/ '{print $NF}') curl -fsS -X POST http://127.0.0.1:8100/memory \ -H 'Content-Type: application/json' \ -d '{"memory_type":"episodic","content":"DRILL-MARKER","project":"drill","confidence":1.0}' cd /srv/storage/atocore/app/deploy/dalidou docker compose down docker compose run --rm --entrypoint python atocore \ -m atocore.ops.backup restore $STAMP --confirm-service-stopped docker compose up -d # Marker should be gone: curl -fsS 'http://127.0.0.1:8100/memory?project=drill' | jq . ```