ops: add restore_runtime_backup + drill runbook

Close the backup side of the loop: we had create/list/validate but no restore, and no documented drill. A backup you've never restored is not a backup. This lands the missing restore surface and the procedure to exercise it before enabling any write-path automation (auto-capture, automated ingestion, reinforcement sweeps). Code — src/atocore/ops/backup.py: - restore_runtime_backup(stamp, *, include_chroma, pre_restore_snapshot, confirm_service_stopped) performs: 1. validate_backup() gate — refuse on any error 2. pre-restore safety snapshot of current state (reversibility anchor) 3. PRAGMA wal_checkpoint(TRUNCATE) on target db (flush + release OS handles; Windows needs this after conn.backup() reads) 4. unlink stale -wal/-shm sidecars (tolerant to Windows lock races) 5. shutil.copy2 snapshot db over target 6. restore registry if snapshot captured one 7. restore Chroma tree if snapshot captured one and include_chroma resolves to true (defaults to whether backup has Chroma) 8. PRAGMA integrity_check on restored db, report result - Refuses without confirm_service_stopped=True to prevent hot-restore into a running service (would corrupt SQLite state) - Rewrote main() as argparse with 4 subcommands: create, list, validate, restore. `python -m atocore.ops.backup restore STAMP --confirm-service-stopped` is the drill CLI entry point, run via `docker compose run --rm --entrypoint python atocore` so it reuses the live service's volume mounts Tests — tests/test_backup.py (6 new): - test_restore_refuses_without_confirm_service_stopped - test_restore_raises_on_invalid_backup - test_restore_round_trip_reverses_post_backup_mutations (canonical drill flow: seed -> backup -> mutate -> restore -> mutation gone + baseline survived + pre-restore snapshot has the mutation captured as rollback anchor) - test_restore_round_trip_with_chroma - test_restore_skips_pre_snapshot_when_requested - test_restore_cleans_stale_wal_sidecars (asserts stale byte markers do not survive, not file existence, since PRAGMA integrity_check may legitimately recreate -wal) Docs — docs/backup-restore-drill.md (new): - What gets backed up (hot sqlite, cold chroma, registry JSON, metadata.json) and what doesn't (.env, source content) - What restore does, step by step, and why confirm_service_stopped is a hard gate - 8-step drill procedure: capture -> baseline -> mutate -> stop -> restore -> start -> verify marker gone -> optional cleanup - Correct endpoint bodies verified against routes.py: POST /admin/backup with JSON body {"include_chroma": true} POST /memory with memory_type/content/project/confidence GET /memory?project=drill to list drill markers POST /query with {"prompt": ..., "top_k": ...} (not "query") - Failure modes: integrity_check fail, container won't start, marker still present after restore, with remediation for each - When to run: before new write-path automation, after backup.py or schema changes, after infra bumps, monthly as standing check 225/225 tests passing (219 existing + 6 new restore). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 21:17:48 -04:00
parent 03822389a1
commit 336208004c
3 changed files with 782 additions and 2 deletions
--- a/docs/backup-restore-drill.md
+++ b/docs/backup-restore-drill.md
@@ -0,0 +1,296 @@
+# Backup / Restore Drill
+
+## Purpose
+
+Before turning on any automation that writes to AtoCore continuously
+(auto-capture of Claude Code sessions, automated source ingestion,
+reinforcement sweeps), we need to know — with certainty — that a
+backup can actually be restored. A backup you've never restored is
+not a backup; it's a file that happens to be named that way.
+
+This runbook walks through the canonical drill: take a snapshot,
+mutate live state, stop the service, restore from the snapshot,
+start the service, and verify the mutation is reversed. When the
+drill passes, the runtime store has a trustworthy rollback.
+
+## What gets backed up
+
+`src/atocore/ops/backup.py::create_runtime_backup()` writes the
+following into `$ATOCORE_BACKUP_DIR/snapshots/<stamp>/`:
+
+| Component | How | Hot/Cold | Notes |
+|---|---|---|---|
+| SQLite (`atocore.db`) | `conn.backup()` online API | **hot** | Safe with service running; self-contained main file, no WAL sidecar. |
+| Project registry JSON | file copy | cold | Only if the file exists. |
+| Chroma vector store | `shutil.copytree` | **cold** | Only when `include_chroma=True`. Caller must hold `exclusive_ingestion()` so nothing writes during the copy — the `POST /admin/backup?include_chroma=true` route does this automatically. |
+| `backup-metadata.json` | JSON blob | — | Records paths, sizes, and whether Chroma was included. Restore reads this to know what to pull back. |
+
+Things that are **not** in the backup and must be handled separately:
+
+- The `.env` file under `deploy/dalidou/` — secrets live out of git
+  and out of the backup on purpose. The operator must re-place it
+  on any fresh host.
+- The source content under `sources/vault` and `sources/drive` —
+  these are read-only inputs by convention, owned by AtoVault /
+  AtoDrive, and backed up there.
+- Any running transient state (in-flight HTTP requests, ingestion
+  queues). Stop the service cleanly if you care about those.
+
+## What restore does
+
+`restore_runtime_backup(stamp, confirm_service_stopped=True)`:
+
+1. **Validates** the backup first via `validate_backup()` —
+   refuses to run on any error (missing metadata, corrupt snapshot
+   db, etc.).
+2. **Takes a pre-restore safety snapshot** of the current state
+   (SQLite only, not Chroma — to keep it fast) and returns its
+   stamp. This is the reversibility guarantee: if the restore was
+   the wrong call, you can roll it back by restoring the
+   pre-restore snapshot.
+3. **Forces a WAL checkpoint** on the current db
+   (`PRAGMA wal_checkpoint(TRUNCATE)`) to flush any lingering
+   writes and release OS file handles on `-wal`/`-shm`, so the
+   copy step won't race a half-open sqlite connection.
+4. **Removes stale WAL/SHM sidecars** next to the target db.
+   The snapshot `.db` is a self-contained main-file image with no
+   WAL of its own; leftover `-wal` from the old live process
+   would desync against the restored main file.
+5. **Copies the snapshot db** over the live db path.
+6. **Restores the registry JSON** if the snapshot captured one.
+7. **Restores the Chroma tree** if the snapshot captured one and
+   `include_chroma` resolves to true (defaults to whether the
+   snapshot has Chroma).
+8. **Runs `PRAGMA integrity_check`** on the restored db and
+   reports the result alongside a summary of what was touched.
+
+If `confirm_service_stopped` is not passed, the function refuses —
+this is deliberate. Hot-restoring into a running service is not
+supported and would corrupt state.
+
+## The drill
+
+Run this from a Dalidou host with the AtoCore container already
+deployed and healthy. The whole drill takes under two minutes. It
+does not touch source content or disturb any `.env` secrets.
+
+### Step 1. Capture a snapshot via the HTTP API
+
+The running service holds the db; use the admin route so the
+Chroma snapshot is taken under `exclusive_ingestion()`. The
+endpoint takes a JSON body (not a query string):
+
+```bash
+curl -fsS -X POST 'http://127.0.0.1:8100/admin/backup' \
+    -H 'Content-Type: application/json' \
+    -d '{"include_chroma": true}' \
+    | python3 -m json.tool
+```
+
+Record the `backup_root` and note the stamp (the last path segment,
+e.g. `20260409T012345Z`). That stamp is the input to the restore
+step.
+
+### Step 2. Record a known piece of live state
+
+Pick something small and unambiguous to use as a marker. The
+simplest is the current health snapshot plus a memory count:
+
+```bash
+curl -fsS 'http://127.0.0.1:8100/health' | python3 -m json.tool
+```
+
+Note the `memory_count`, `interaction_count`, and `build_sha`. These
+are your pre-drill baseline.
+
+### Step 3. Mutate live state AFTER the backup
+
+Write something the restore should reverse. Any write endpoint is
+fine — a throwaway test memory is the cleanest. The request body
+must include `memory_type` (the AtoCore memory schema requires it):
+
+```bash
+curl -fsS -X POST 'http://127.0.0.1:8100/memory' \
+    -H 'Content-Type: application/json' \
+    -d '{
+        "memory_type": "note",
+        "content": "DRILL-MARKER: this memory should not survive the restore",
+        "project": "drill",
+        "confidence": 1.0
+    }' \
+    | python3 -m json.tool
+```
+
+Record the returned `id`. Confirm it's there:
+
+```bash
+curl -fsS 'http://127.0.0.1:8100/health' | python3 -m json.tool
+# memory_count should be baseline + 1
+
+# And you can list the drill-project memories directly:
+curl -fsS 'http://127.0.0.1:8100/memory?project=drill' | python3 -m json.tool
+# should return the DRILL-MARKER memory
+```
+
+### Step 4. Stop the service
+
+```bash
+cd /srv/storage/atocore/app/deploy/dalidou
+docker compose down
+```
+
+Wait for the container to actually exit:
+
+```bash
+docker compose ps
+# atocore should be gone or Exited
+```
+
+### Step 5. Restore from the snapshot
+
+Run the restore inside a one-shot container that reuses the same
+volumes as the live service. This guarantees the paths resolve
+identically to the running container's view.
+
+```bash
+cd /srv/storage/atocore/app/deploy/dalidou
+docker compose run --rm --entrypoint python atocore \
+    -m atocore.ops.backup restore \
+        <YOUR_STAMP_FROM_STEP_1> \
+        --confirm-service-stopped
+```
+
+The output is JSON; the important fields are:
+
+- `pre_restore_snapshot`: stamp of the safety snapshot of live
+  state at the moment of restore. **Write this down.** If the
+  restore turns out to be the wrong call, this is how you roll
+  it back.
+- `db_restored`: `true`
+- `registry_restored`: `true` if the backup had a registry
+- `chroma_restored`: `true` if the backup had a chroma snapshot
+- `restored_integrity_ok`: **must be `true`** — if this is false,
+  STOP and do not start the service; investigate the integrity
+  error first.
+
+If restoration fails at any step, the function raises a clean
+`RuntimeError` and nothing partial is committed past the main file
+swap. The pre-restore safety snapshot is your rollback anchor.
+
+### Step 6. Start the service back up
+
+```bash
+cd /srv/storage/atocore/app/deploy/dalidou
+docker compose up -d
+```
+
+Wait for `/health` to respond:
+
+```bash
+for i in 1 2 3 4 5 6 7 8 9 10; do
+    curl -fsS 'http://127.0.0.1:8100/health' \
+        && break || { echo "not ready ($i/10)"; sleep 3; }
+done
+```
+
+### Step 7. Verify the drill marker is gone
+
+```bash
+curl -fsS 'http://127.0.0.1:8100/health' | python3 -m json.tool
+# memory_count should equal the Step 2 baseline, NOT baseline + 1
+```
+
+You can also list the drill-project memories directly:
+
+```bash
+curl -fsS 'http://127.0.0.1:8100/memory?project=drill' | python3 -m json.tool
+# should return an empty list — the DRILL-MARKER memory was rolled back
+```
+
+For a semantic-retrieval cross-check, issue a query (the `/query`
+endpoint takes `prompt`, not `query`):
+
+```bash
+curl -fsS -X POST 'http://127.0.0.1:8100/query' \
+    -H 'Content-Type: application/json' \
+    -d '{"prompt": "DRILL-MARKER drill marker", "top_k": 5}' \
+    | python3 -m json.tool
+# should not return the DRILL-MARKER memory in the hits
+```
+
+If the marker is gone and `memory_count` matches the baseline, the
+drill **passed**. The runtime store has a trustworthy rollback.
+
+### Step 8. (Optional) Clean up the safety snapshot
+
+If everything went smoothly you can leave the pre-restore safety
+snapshot on disk for a few days as a paranoia buffer. There's no
+automatic cleanup yet — `list_runtime_backups()` will show it, and
+you can remove it by hand once you're confident:
+
+```bash
+rm -rf /srv/storage/atocore/backups/snapshots/<pre_restore_stamp>
+```
+
+## Failure modes and recovery
+
+### Restore reports `restored_integrity_ok: false`
+
+The copied db failed `PRAGMA integrity_check`. Do **not** start
+the service. This usually means either the source snapshot was
+itself corrupt (and `validate_backup` should have caught it — file
+a bug if it didn't), or the copy was interrupted. Options:
+
+1. Validate the source snapshot directly:
+   `python -m atocore.ops.backup validate <STAMP>`
+2. Pick a different, older snapshot and retry the restore.
+3. Roll the db back to your pre-restore safety snapshot.
+
+### The live container won't start after restore
+
+Check the container logs:
+
+```bash
+cd /srv/storage/atocore/app/deploy/dalidou
+docker compose logs --tail=100 atocore
+```
+
+Common causes:
+
+- Schema drift between the snapshot and the current code version.
+  `_apply_migrations` in `src/atocore/models/database.py` is
+  idempotent and should absorb most forward migrations, but a
+  backward restore (running new code against an older snapshot)
+  may hit unexpected state. The migration only ADDs columns, so
+  the opposite direction is usually safe, but verify.
+- Chroma and SQLite disagreeing about what chunks exist. The
+  backup captures them together to minimize this, but if you
+  restore SQLite without Chroma (`--no-chroma`), retrieval may
+  return stale vectors. Re-ingest if this happens.
+
+### The drill marker is still present after restore
+
+Something went wrong. Possible causes:
+
+- You restored a snapshot taken AFTER the drill marker was
+  written (wrong stamp).
+- The service was writing during the drill and committed the
+  marker before `docker compose down`. Double-check the order.
+- The restore silently skipped the db step. Check the restore
+  output for `db_restored: true` and `restored_integrity_ok: true`.
+
+Roll back to the pre-restore safety snapshot and retry with the
+correct source snapshot.
+
+## When to run this drill
+
+- **Before** enabling any new write-path automation (auto-capture,
+  automated ingestion, reinforcement sweeps, scheduled extraction).
+- **After** any change to `src/atocore/ops/backup.py` or the
+  schema migrations in `src/atocore/models/database.py`.
+- **After** a Dalidou OS upgrade or docker version bump.
+- **Monthly** as a standing operational check.
+
+Record each drill run (pass/fail) somewhere durable — even a line
+in the project journal is enough. A drill you ran once and never
+again is barely more than a drill you never ran.