ops: add restore_runtime_backup + drill runbook

Close the backup side of the loop: we had create/list/validate but
no restore, and no documented drill. A backup you've never restored
is not a backup. This lands the missing restore surface and the
procedure to exercise it before enabling any write-path automation
(auto-capture, automated ingestion, reinforcement sweeps).

Code — src/atocore/ops/backup.py:

- restore_runtime_backup(stamp, *, include_chroma, pre_restore_snapshot,
  confirm_service_stopped) performs:
  1. validate_backup() gate — refuse on any error
  2. pre-restore safety snapshot of current state (reversibility anchor)
  3. PRAGMA wal_checkpoint(TRUNCATE) on target db (flush + release
     OS handles; Windows needs this after conn.backup() reads)
  4. unlink stale -wal/-shm sidecars (tolerant to Windows lock races)
  5. shutil.copy2 snapshot db over target
  6. restore registry if snapshot captured one
  7. restore Chroma tree if snapshot captured one and include_chroma
     resolves to true (defaults to whether backup has Chroma)
  8. PRAGMA integrity_check on restored db, report result
- Refuses without confirm_service_stopped=True to prevent hot-restore
  into a running service (would corrupt SQLite state)
- Rewrote main() as argparse with 4 subcommands: create, list,
  validate, restore. `python -m atocore.ops.backup restore STAMP
  --confirm-service-stopped` is the drill CLI entry point, run via
  `docker compose run --rm --entrypoint python atocore` so it reuses
  the live service's volume mounts

Tests — tests/test_backup.py (6 new):

- test_restore_refuses_without_confirm_service_stopped
- test_restore_raises_on_invalid_backup
- test_restore_round_trip_reverses_post_backup_mutations
  (canonical drill flow: seed -> backup -> mutate -> restore ->
   mutation gone + baseline survived + pre-restore snapshot has
   the mutation captured as rollback anchor)
- test_restore_round_trip_with_chroma
- test_restore_skips_pre_snapshot_when_requested
- test_restore_cleans_stale_wal_sidecars (asserts stale byte
  markers do not survive, not file existence, since PRAGMA
  integrity_check may legitimately recreate -wal)

Docs — docs/backup-restore-drill.md (new):

- What gets backed up (hot sqlite, cold chroma, registry JSON,
  metadata.json) and what doesn't (.env, source content)
- What restore does, step by step, and why confirm_service_stopped
  is a hard gate
- 8-step drill procedure: capture -> baseline -> mutate -> stop ->
  restore -> start -> verify marker gone -> optional cleanup
- Correct endpoint bodies verified against routes.py:
    POST /admin/backup with JSON body {"include_chroma": true}
    POST /memory with memory_type/content/project/confidence
    GET /memory?project=drill to list drill markers
    POST /query with {"prompt": ..., "top_k": ...} (not "query")
- Failure modes: integrity_check fail, container won't start,
  marker still present after restore, with remediation for each
- When to run: before new write-path automation, after backup.py
  or schema changes, after infra bumps, monthly as standing check

225/225 tests passing (219 existing + 6 new restore).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-08 21:17:48 -04:00
parent 03822389a1
commit 336208004c
3 changed files with 782 additions and 2 deletions

View File

@@ -0,0 +1,296 @@
# Backup / Restore Drill
## Purpose
Before turning on any automation that writes to AtoCore continuously
(auto-capture of Claude Code sessions, automated source ingestion,
reinforcement sweeps), we need to know — with certainty — that a
backup can actually be restored. A backup you've never restored is
not a backup; it's a file that happens to be named that way.
This runbook walks through the canonical drill: take a snapshot,
mutate live state, stop the service, restore from the snapshot,
start the service, and verify the mutation is reversed. When the
drill passes, the runtime store has a trustworthy rollback.
## What gets backed up
`src/atocore/ops/backup.py::create_runtime_backup()` writes the
following into `$ATOCORE_BACKUP_DIR/snapshots/<stamp>/`:
| Component | How | Hot/Cold | Notes |
|---|---|---|---|
| SQLite (`atocore.db`) | `conn.backup()` online API | **hot** | Safe with service running; self-contained main file, no WAL sidecar. |
| Project registry JSON | file copy | cold | Only if the file exists. |
| Chroma vector store | `shutil.copytree` | **cold** | Only when `include_chroma=True`. Caller must hold `exclusive_ingestion()` so nothing writes during the copy — the `POST /admin/backup?include_chroma=true` route does this automatically. |
| `backup-metadata.json` | JSON blob | — | Records paths, sizes, and whether Chroma was included. Restore reads this to know what to pull back. |
Things that are **not** in the backup and must be handled separately:
- The `.env` file under `deploy/dalidou/` — secrets live out of git
and out of the backup on purpose. The operator must re-place it
on any fresh host.
- The source content under `sources/vault` and `sources/drive`
these are read-only inputs by convention, owned by AtoVault /
AtoDrive, and backed up there.
- Any running transient state (in-flight HTTP requests, ingestion
queues). Stop the service cleanly if you care about those.
## What restore does
`restore_runtime_backup(stamp, confirm_service_stopped=True)`:
1. **Validates** the backup first via `validate_backup()`
refuses to run on any error (missing metadata, corrupt snapshot
db, etc.).
2. **Takes a pre-restore safety snapshot** of the current state
(SQLite only, not Chroma — to keep it fast) and returns its
stamp. This is the reversibility guarantee: if the restore was
the wrong call, you can roll it back by restoring the
pre-restore snapshot.
3. **Forces a WAL checkpoint** on the current db
(`PRAGMA wal_checkpoint(TRUNCATE)`) to flush any lingering
writes and release OS file handles on `-wal`/`-shm`, so the
copy step won't race a half-open sqlite connection.
4. **Removes stale WAL/SHM sidecars** next to the target db.
The snapshot `.db` is a self-contained main-file image with no
WAL of its own; leftover `-wal` from the old live process
would desync against the restored main file.
5. **Copies the snapshot db** over the live db path.
6. **Restores the registry JSON** if the snapshot captured one.
7. **Restores the Chroma tree** if the snapshot captured one and
`include_chroma` resolves to true (defaults to whether the
snapshot has Chroma).
8. **Runs `PRAGMA integrity_check`** on the restored db and
reports the result alongside a summary of what was touched.
If `confirm_service_stopped` is not passed, the function refuses —
this is deliberate. Hot-restoring into a running service is not
supported and would corrupt state.
## The drill
Run this from a Dalidou host with the AtoCore container already
deployed and healthy. The whole drill takes under two minutes. It
does not touch source content or disturb any `.env` secrets.
### Step 1. Capture a snapshot via the HTTP API
The running service holds the db; use the admin route so the
Chroma snapshot is taken under `exclusive_ingestion()`. The
endpoint takes a JSON body (not a query string):
```bash
curl -fsS -X POST 'http://127.0.0.1:8100/admin/backup' \
-H 'Content-Type: application/json' \
-d '{"include_chroma": true}' \
| python3 -m json.tool
```
Record the `backup_root` and note the stamp (the last path segment,
e.g. `20260409T012345Z`). That stamp is the input to the restore
step.
### Step 2. Record a known piece of live state
Pick something small and unambiguous to use as a marker. The
simplest is the current health snapshot plus a memory count:
```bash
curl -fsS 'http://127.0.0.1:8100/health' | python3 -m json.tool
```
Note the `memory_count`, `interaction_count`, and `build_sha`. These
are your pre-drill baseline.
### Step 3. Mutate live state AFTER the backup
Write something the restore should reverse. Any write endpoint is
fine — a throwaway test memory is the cleanest. The request body
must include `memory_type` (the AtoCore memory schema requires it):
```bash
curl -fsS -X POST 'http://127.0.0.1:8100/memory' \
-H 'Content-Type: application/json' \
-d '{
"memory_type": "note",
"content": "DRILL-MARKER: this memory should not survive the restore",
"project": "drill",
"confidence": 1.0
}' \
| python3 -m json.tool
```
Record the returned `id`. Confirm it's there:
```bash
curl -fsS 'http://127.0.0.1:8100/health' | python3 -m json.tool
# memory_count should be baseline + 1
# And you can list the drill-project memories directly:
curl -fsS 'http://127.0.0.1:8100/memory?project=drill' | python3 -m json.tool
# should return the DRILL-MARKER memory
```
### Step 4. Stop the service
```bash
cd /srv/storage/atocore/app/deploy/dalidou
docker compose down
```
Wait for the container to actually exit:
```bash
docker compose ps
# atocore should be gone or Exited
```
### Step 5. Restore from the snapshot
Run the restore inside a one-shot container that reuses the same
volumes as the live service. This guarantees the paths resolve
identically to the running container's view.
```bash
cd /srv/storage/atocore/app/deploy/dalidou
docker compose run --rm --entrypoint python atocore \
-m atocore.ops.backup restore \
<YOUR_STAMP_FROM_STEP_1> \
--confirm-service-stopped
```
The output is JSON; the important fields are:
- `pre_restore_snapshot`: stamp of the safety snapshot of live
state at the moment of restore. **Write this down.** If the
restore turns out to be the wrong call, this is how you roll
it back.
- `db_restored`: `true`
- `registry_restored`: `true` if the backup had a registry
- `chroma_restored`: `true` if the backup had a chroma snapshot
- `restored_integrity_ok`: **must be `true`** — if this is false,
STOP and do not start the service; investigate the integrity
error first.
If restoration fails at any step, the function raises a clean
`RuntimeError` and nothing partial is committed past the main file
swap. The pre-restore safety snapshot is your rollback anchor.
### Step 6. Start the service back up
```bash
cd /srv/storage/atocore/app/deploy/dalidou
docker compose up -d
```
Wait for `/health` to respond:
```bash
for i in 1 2 3 4 5 6 7 8 9 10; do
curl -fsS 'http://127.0.0.1:8100/health' \
&& break || { echo "not ready ($i/10)"; sleep 3; }
done
```
### Step 7. Verify the drill marker is gone
```bash
curl -fsS 'http://127.0.0.1:8100/health' | python3 -m json.tool
# memory_count should equal the Step 2 baseline, NOT baseline + 1
```
You can also list the drill-project memories directly:
```bash
curl -fsS 'http://127.0.0.1:8100/memory?project=drill' | python3 -m json.tool
# should return an empty list — the DRILL-MARKER memory was rolled back
```
For a semantic-retrieval cross-check, issue a query (the `/query`
endpoint takes `prompt`, not `query`):
```bash
curl -fsS -X POST 'http://127.0.0.1:8100/query' \
-H 'Content-Type: application/json' \
-d '{"prompt": "DRILL-MARKER drill marker", "top_k": 5}' \
| python3 -m json.tool
# should not return the DRILL-MARKER memory in the hits
```
If the marker is gone and `memory_count` matches the baseline, the
drill **passed**. The runtime store has a trustworthy rollback.
### Step 8. (Optional) Clean up the safety snapshot
If everything went smoothly you can leave the pre-restore safety
snapshot on disk for a few days as a paranoia buffer. There's no
automatic cleanup yet — `list_runtime_backups()` will show it, and
you can remove it by hand once you're confident:
```bash
rm -rf /srv/storage/atocore/backups/snapshots/<pre_restore_stamp>
```
## Failure modes and recovery
### Restore reports `restored_integrity_ok: false`
The copied db failed `PRAGMA integrity_check`. Do **not** start
the service. This usually means either the source snapshot was
itself corrupt (and `validate_backup` should have caught it — file
a bug if it didn't), or the copy was interrupted. Options:
1. Validate the source snapshot directly:
`python -m atocore.ops.backup validate <STAMP>`
2. Pick a different, older snapshot and retry the restore.
3. Roll the db back to your pre-restore safety snapshot.
### The live container won't start after restore
Check the container logs:
```bash
cd /srv/storage/atocore/app/deploy/dalidou
docker compose logs --tail=100 atocore
```
Common causes:
- Schema drift between the snapshot and the current code version.
`_apply_migrations` in `src/atocore/models/database.py` is
idempotent and should absorb most forward migrations, but a
backward restore (running new code against an older snapshot)
may hit unexpected state. The migration only ADDs columns, so
the opposite direction is usually safe, but verify.
- Chroma and SQLite disagreeing about what chunks exist. The
backup captures them together to minimize this, but if you
restore SQLite without Chroma (`--no-chroma`), retrieval may
return stale vectors. Re-ingest if this happens.
### The drill marker is still present after restore
Something went wrong. Possible causes:
- You restored a snapshot taken AFTER the drill marker was
written (wrong stamp).
- The service was writing during the drill and committed the
marker before `docker compose down`. Double-check the order.
- The restore silently skipped the db step. Check the restore
output for `db_restored: true` and `restored_integrity_ok: true`.
Roll back to the pre-restore safety snapshot and retry with the
correct source snapshot.
## When to run this drill
- **Before** enabling any new write-path automation (auto-capture,
automated ingestion, reinforcement sweeps, scheduled extraction).
- **After** any change to `src/atocore/ops/backup.py` or the
schema migrations in `src/atocore/models/database.py`.
- **After** a Dalidou OS upgrade or docker version bump.
- **Monthly** as a standing operational check.
Record each drill run (pass/fail) somewhere durable — even a line
in the project journal is enough. A drill you ran once and never
again is barely more than a drill you never ran.