# Backup / Restore Drill ## Purpose Before turning on any automation that writes to AtoCore continuously (auto-capture of Claude Code sessions, automated source ingestion, reinforcement sweeps), we need to know — with certainty — that a backup can actually be restored. A backup you've never restored is not a backup; it's a file that happens to be named that way. This runbook walks through the canonical drill: take a snapshot, mutate live state, stop the service, restore from the snapshot, start the service, and verify the mutation is reversed. When the drill passes, the runtime store has a trustworthy rollback. ## What gets backed up `src/atocore/ops/backup.py::create_runtime_backup()` writes the following into `$ATOCORE_BACKUP_DIR/snapshots//`: | Component | How | Hot/Cold | Notes | |---|---|---|---| | SQLite (`atocore.db`) | `conn.backup()` online API | **hot** | Safe with service running; self-contained main file, no WAL sidecar. | | Project registry JSON | file copy | cold | Only if the file exists. | | Chroma vector store | `shutil.copytree` | **cold** | Only when `include_chroma=True`. Caller must hold `exclusive_ingestion()` so nothing writes during the copy — the `POST /admin/backup?include_chroma=true` route does this automatically. | | `backup-metadata.json` | JSON blob | — | Records paths, sizes, and whether Chroma was included. Restore reads this to know what to pull back. | Things that are **not** in the backup and must be handled separately: - The `.env` file under `deploy/dalidou/` — secrets live out of git and out of the backup on purpose. The operator must re-place it on any fresh host. - The source content under `sources/vault` and `sources/drive` — these are read-only inputs by convention, owned by AtoVault / AtoDrive, and backed up there. - Any running transient state (in-flight HTTP requests, ingestion queues). Stop the service cleanly if you care about those. ## What restore does `restore_runtime_backup(stamp, confirm_service_stopped=True)`: 1. **Validates** the backup first via `validate_backup()` — refuses to run on any error (missing metadata, corrupt snapshot db, etc.). 2. **Takes a pre-restore safety snapshot** of the current state (SQLite only, not Chroma — to keep it fast) and returns its stamp. This is the reversibility guarantee: if the restore was the wrong call, you can roll it back by restoring the pre-restore snapshot. 3. **Forces a WAL checkpoint** on the current db (`PRAGMA wal_checkpoint(TRUNCATE)`) to flush any lingering writes and release OS file handles on `-wal`/`-shm`, so the copy step won't race a half-open sqlite connection. 4. **Removes stale WAL/SHM sidecars** next to the target db. The snapshot `.db` is a self-contained main-file image with no WAL of its own; leftover `-wal` from the old live process would desync against the restored main file. 5. **Copies the snapshot db** over the live db path. 6. **Restores the registry JSON** if the snapshot captured one. 7. **Restores the Chroma tree** if the snapshot captured one and `include_chroma` resolves to true (defaults to whether the snapshot has Chroma). 8. **Runs `PRAGMA integrity_check`** on the restored db and reports the result alongside a summary of what was touched. If `confirm_service_stopped` is not passed, the function refuses — this is deliberate. Hot-restoring into a running service is not supported and would corrupt state. ## The drill Run this from a Dalidou host with the AtoCore container already deployed and healthy. The whole drill takes under two minutes. It does not touch source content or disturb any `.env` secrets. ### Step 1. Capture a snapshot via the HTTP API The running service holds the db; use the admin route so the Chroma snapshot is taken under `exclusive_ingestion()`. The endpoint takes a JSON body (not a query string): ```bash curl -fsS -X POST 'http://127.0.0.1:8100/admin/backup' \ -H 'Content-Type: application/json' \ -d '{"include_chroma": true}' \ | python3 -m json.tool ``` Record the `backup_root` and note the stamp (the last path segment, e.g. `20260409T012345Z`). That stamp is the input to the restore step. ### Step 2. Record a known piece of live state Pick something small and unambiguous to use as a marker. The simplest is the current health snapshot plus a memory count: ```bash curl -fsS 'http://127.0.0.1:8100/health' | python3 -m json.tool ``` Note the `memory_count`, `interaction_count`, and `build_sha`. These are your pre-drill baseline. ### Step 3. Mutate live state AFTER the backup Write something the restore should reverse. Any write endpoint is fine — a throwaway test memory is the cleanest. The request body must include `memory_type` (the AtoCore memory schema requires it): ```bash curl -fsS -X POST 'http://127.0.0.1:8100/memory' \ -H 'Content-Type: application/json' \ -d '{ "memory_type": "note", "content": "DRILL-MARKER: this memory should not survive the restore", "project": "drill", "confidence": 1.0 }' \ | python3 -m json.tool ``` Record the returned `id`. Confirm it's there: ```bash curl -fsS 'http://127.0.0.1:8100/health' | python3 -m json.tool # memory_count should be baseline + 1 # And you can list the drill-project memories directly: curl -fsS 'http://127.0.0.1:8100/memory?project=drill' | python3 -m json.tool # should return the DRILL-MARKER memory ``` ### Step 4. Stop the service ```bash cd /srv/storage/atocore/app/deploy/dalidou docker compose down ``` Wait for the container to actually exit: ```bash docker compose ps # atocore should be gone or Exited ``` ### Step 5. Restore from the snapshot Run the restore inside a one-shot container that reuses the same volumes as the live service. This guarantees the paths resolve identically to the running container's view. ```bash cd /srv/storage/atocore/app/deploy/dalidou docker compose run --rm --entrypoint python atocore \ -m atocore.ops.backup restore \ \ --confirm-service-stopped ``` The output is JSON; the important fields are: - `pre_restore_snapshot`: stamp of the safety snapshot of live state at the moment of restore. **Write this down.** If the restore turns out to be the wrong call, this is how you roll it back. - `db_restored`: `true` - `registry_restored`: `true` if the backup had a registry - `chroma_restored`: `true` if the backup had a chroma snapshot - `restored_integrity_ok`: **must be `true`** — if this is false, STOP and do not start the service; investigate the integrity error first. If restoration fails at any step, the function raises a clean `RuntimeError` and nothing partial is committed past the main file swap. The pre-restore safety snapshot is your rollback anchor. ### Step 6. Start the service back up ```bash cd /srv/storage/atocore/app/deploy/dalidou docker compose up -d ``` Wait for `/health` to respond: ```bash for i in 1 2 3 4 5 6 7 8 9 10; do curl -fsS 'http://127.0.0.1:8100/health' \ && break || { echo "not ready ($i/10)"; sleep 3; } done ``` ### Step 7. Verify the drill marker is gone ```bash curl -fsS 'http://127.0.0.1:8100/health' | python3 -m json.tool # memory_count should equal the Step 2 baseline, NOT baseline + 1 ``` You can also list the drill-project memories directly: ```bash curl -fsS 'http://127.0.0.1:8100/memory?project=drill' | python3 -m json.tool # should return an empty list — the DRILL-MARKER memory was rolled back ``` For a semantic-retrieval cross-check, issue a query (the `/query` endpoint takes `prompt`, not `query`): ```bash curl -fsS -X POST 'http://127.0.0.1:8100/query' \ -H 'Content-Type: application/json' \ -d '{"prompt": "DRILL-MARKER drill marker", "top_k": 5}' \ | python3 -m json.tool # should not return the DRILL-MARKER memory in the hits ``` If the marker is gone and `memory_count` matches the baseline, the drill **passed**. The runtime store has a trustworthy rollback. ### Step 8. (Optional) Clean up the safety snapshot If everything went smoothly you can leave the pre-restore safety snapshot on disk for a few days as a paranoia buffer. There's no automatic cleanup yet — `list_runtime_backups()` will show it, and you can remove it by hand once you're confident: ```bash rm -rf /srv/storage/atocore/backups/snapshots/ ``` ## Failure modes and recovery ### Restore reports `restored_integrity_ok: false` The copied db failed `PRAGMA integrity_check`. Do **not** start the service. This usually means either the source snapshot was itself corrupt (and `validate_backup` should have caught it — file a bug if it didn't), or the copy was interrupted. Options: 1. Validate the source snapshot directly: `python -m atocore.ops.backup validate ` 2. Pick a different, older snapshot and retry the restore. 3. Roll the db back to your pre-restore safety snapshot. ### The live container won't start after restore Check the container logs: ```bash cd /srv/storage/atocore/app/deploy/dalidou docker compose logs --tail=100 atocore ``` Common causes: - Schema drift between the snapshot and the current code version. `_apply_migrations` in `src/atocore/models/database.py` is idempotent and should absorb most forward migrations, but a backward restore (running new code against an older snapshot) may hit unexpected state. The migration only ADDs columns, so the opposite direction is usually safe, but verify. - Chroma and SQLite disagreeing about what chunks exist. The backup captures them together to minimize this, but if you restore SQLite without Chroma (`--no-chroma`), retrieval may return stale vectors. Re-ingest if this happens. ### The drill marker is still present after restore Something went wrong. Possible causes: - You restored a snapshot taken AFTER the drill marker was written (wrong stamp). - The service was writing during the drill and committed the marker before `docker compose down`. Double-check the order. - The restore silently skipped the db step. Check the restore output for `db_restored: true` and `restored_integrity_ok: true`. Roll back to the pre-restore safety snapshot and retry with the correct source snapshot. ## When to run this drill - **Before** enabling any new write-path automation (auto-capture, automated ingestion, reinforcement sweeps, scheduled extraction). - **After** any change to `src/atocore/ops/backup.py` or the schema migrations in `src/atocore/models/database.py`. - **After** a Dalidou OS upgrade or docker version bump. - **Monthly** as a standing operational check. Record each drill run (pass/fail) somewhere durable — even a line in the project journal is enough. A drill you ran once and never again is barely more than a drill you never ran.