Files

Anto01 336208004c ops: add restore_runtime_backup + drill runbook

Close the backup side of the loop: we had create/list/validate but
no restore, and no documented drill. A backup you've never restored
is not a backup. This lands the missing restore surface and the
procedure to exercise it before enabling any write-path automation
(auto-capture, automated ingestion, reinforcement sweeps).

Code — src/atocore/ops/backup.py:

- restore_runtime_backup(stamp, *, include_chroma, pre_restore_snapshot,
  confirm_service_stopped) performs:
  1. validate_backup() gate — refuse on any error
  2. pre-restore safety snapshot of current state (reversibility anchor)
  3. PRAGMA wal_checkpoint(TRUNCATE) on target db (flush + release
     OS handles; Windows needs this after conn.backup() reads)
  4. unlink stale -wal/-shm sidecars (tolerant to Windows lock races)
  5. shutil.copy2 snapshot db over target
  6. restore registry if snapshot captured one
  7. restore Chroma tree if snapshot captured one and include_chroma
     resolves to true (defaults to whether backup has Chroma)
  8. PRAGMA integrity_check on restored db, report result
- Refuses without confirm_service_stopped=True to prevent hot-restore
  into a running service (would corrupt SQLite state)
- Rewrote main() as argparse with 4 subcommands: create, list,
  validate, restore. `python -m atocore.ops.backup restore STAMP
  --confirm-service-stopped` is the drill CLI entry point, run via
  `docker compose run --rm --entrypoint python atocore` so it reuses
  the live service's volume mounts

Tests — tests/test_backup.py (6 new):

- test_restore_refuses_without_confirm_service_stopped
- test_restore_raises_on_invalid_backup
- test_restore_round_trip_reverses_post_backup_mutations
  (canonical drill flow: seed -> backup -> mutate -> restore ->
   mutation gone + baseline survived + pre-restore snapshot has
   the mutation captured as rollback anchor)
- test_restore_round_trip_with_chroma
- test_restore_skips_pre_snapshot_when_requested
- test_restore_cleans_stale_wal_sidecars (asserts stale byte
  markers do not survive, not file existence, since PRAGMA
  integrity_check may legitimately recreate -wal)

Docs — docs/backup-restore-drill.md (new):

- What gets backed up (hot sqlite, cold chroma, registry JSON,
  metadata.json) and what doesn't (.env, source content)
- What restore does, step by step, and why confirm_service_stopped
  is a hard gate
- 8-step drill procedure: capture -> baseline -> mutate -> stop ->
  restore -> start -> verify marker gone -> optional cleanup
- Correct endpoint bodies verified against routes.py:
    POST /admin/backup with JSON body {"include_chroma": true}
    POST /memory with memory_type/content/project/confidence
    GET /memory?project=drill to list drill markers
    POST /query with {"prompt": ..., "top_k": ...} (not "query")
- Failure modes: integrity_check fail, container won't start,
  marker still present after restore, with remediation for each
- When to run: before new write-path automation, after backup.py
  or schema changes, after infra bumps, monthly as standing check

225/225 tests passing (219 existing + 6 new restore).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-08 21:17:48 -04:00

11 KiB

Raw Blame History

Backup / Restore Drill

Purpose

Before turning on any automation that writes to AtoCore continuously (auto-capture of Claude Code sessions, automated source ingestion, reinforcement sweeps), we need to know — with certainty — that a backup can actually be restored. A backup you've never restored is not a backup; it's a file that happens to be named that way.

This runbook walks through the canonical drill: take a snapshot, mutate live state, stop the service, restore from the snapshot, start the service, and verify the mutation is reversed. When the drill passes, the runtime store has a trustworthy rollback.

What gets backed up

src/atocore/ops/backup.py::create_runtime_backup() writes the following into $ATOCORE_BACKUP_DIR/snapshots/<stamp>/:

Component	How	Hot/Cold	Notes
SQLite (`atocore.db`)	`conn.backup()` online API	hot	Safe with service running; self-contained main file, no WAL sidecar.
Project registry JSON	file copy	cold	Only if the file exists.
Chroma vector store	`shutil.copytree`	cold	Only when `include_chroma=True`. Caller must hold `exclusive_ingestion()` so nothing writes during the copy — the `POST /admin/backup?include_chroma=true` route does this automatically.
`backup-metadata.json`	JSON blob	—	Records paths, sizes, and whether Chroma was included. Restore reads this to know what to pull back.

Things that are not in the backup and must be handled separately:

The .env file under deploy/dalidou/ — secrets live out of git and out of the backup on purpose. The operator must re-place it on any fresh host.
The source content under sources/vault and sources/drive — these are read-only inputs by convention, owned by AtoVault / AtoDrive, and backed up there.
Any running transient state (in-flight HTTP requests, ingestion queues). Stop the service cleanly if you care about those.

What restore does

restore_runtime_backup(stamp, confirm_service_stopped=True):

Validates the backup first via validate_backup() — refuses to run on any error (missing metadata, corrupt snapshot db, etc.).
Takes a pre-restore safety snapshot of the current state (SQLite only, not Chroma — to keep it fast) and returns its stamp. This is the reversibility guarantee: if the restore was the wrong call, you can roll it back by restoring the pre-restore snapshot.
Forces a WAL checkpoint on the current db (PRAGMA wal_checkpoint(TRUNCATE)) to flush any lingering writes and release OS file handles on -wal/-shm, so the copy step won't race a half-open sqlite connection.
Removes stale WAL/SHM sidecars next to the target db. The snapshot .db is a self-contained main-file image with no WAL of its own; leftover -wal from the old live process would desync against the restored main file.
Copies the snapshot db over the live db path.
Restores the registry JSON if the snapshot captured one.
Restores the Chroma tree if the snapshot captured one and include_chroma resolves to true (defaults to whether the snapshot has Chroma).
Runs PRAGMA integrity_check on the restored db and reports the result alongside a summary of what was touched.

If confirm_service_stopped is not passed, the function refuses — this is deliberate. Hot-restoring into a running service is not supported and would corrupt state.

The drill

Run this from a Dalidou host with the AtoCore container already deployed and healthy. The whole drill takes under two minutes. It does not touch source content or disturb any .env secrets.

Step 1. Capture a snapshot via the HTTP API

The running service holds the db; use the admin route so the Chroma snapshot is taken under exclusive_ingestion(). The endpoint takes a JSON body (not a query string):

curl -fsS -X POST 'http://127.0.0.1:8100/admin/backup' \
    -H 'Content-Type: application/json' \
    -d '{"include_chroma": true}' \
    | python3 -m json.tool

Record the backup_root and note the stamp (the last path segment, e.g. 20260409T012345Z). That stamp is the input to the restore step.

Step 2. Record a known piece of live state

Pick something small and unambiguous to use as a marker. The simplest is the current health snapshot plus a memory count:

curl -fsS 'http://127.0.0.1:8100/health' | python3 -m json.tool

Note the memory_count, interaction_count, and build_sha. These are your pre-drill baseline.

Step 3. Mutate live state AFTER the backup

Write something the restore should reverse. Any write endpoint is fine — a throwaway test memory is the cleanest. The request body must include memory_type (the AtoCore memory schema requires it):

curl -fsS -X POST 'http://127.0.0.1:8100/memory' \
    -H 'Content-Type: application/json' \
    -d '{
        "memory_type": "note",
        "content": "DRILL-MARKER: this memory should not survive the restore",
        "project": "drill",
        "confidence": 1.0
    }' \
    | python3 -m json.tool

Record the returned id. Confirm it's there:

curl -fsS 'http://127.0.0.1:8100/health' | python3 -m json.tool
# memory_count should be baseline + 1

# And you can list the drill-project memories directly:
curl -fsS 'http://127.0.0.1:8100/memory?project=drill' | python3 -m json.tool
# should return the DRILL-MARKER memory

Step 4. Stop the service

cd /srv/storage/atocore/app/deploy/dalidou
docker compose down

Wait for the container to actually exit:

docker compose ps
# atocore should be gone or Exited

Step 5. Restore from the snapshot

Run the restore inside a one-shot container that reuses the same volumes as the live service. This guarantees the paths resolve identically to the running container's view.

cd /srv/storage/atocore/app/deploy/dalidou
docker compose run --rm --entrypoint python atocore \
    -m atocore.ops.backup restore \
        <YOUR_STAMP_FROM_STEP_1> \
        --confirm-service-stopped

The output is JSON; the important fields are:

pre_restore_snapshot: stamp of the safety snapshot of live state at the moment of restore. Write this down. If the restore turns out to be the wrong call, this is how you roll it back.
db_restored: true
registry_restored: true if the backup had a registry
chroma_restored: true if the backup had a chroma snapshot
restored_integrity_ok: must be true — if this is false, STOP and do not start the service; investigate the integrity error first.

If restoration fails at any step, the function raises a clean RuntimeError and nothing partial is committed past the main file swap. The pre-restore safety snapshot is your rollback anchor.

Step 6. Start the service back up

cd /srv/storage/atocore/app/deploy/dalidou
docker compose up -d

Wait for /health to respond:

for i in 1 2 3 4 5 6 7 8 9 10; do
    curl -fsS 'http://127.0.0.1:8100/health' \
        && break || { echo "not ready ($i/10)"; sleep 3; }
done

Step 7. Verify the drill marker is gone

curl -fsS 'http://127.0.0.1:8100/health' | python3 -m json.tool
# memory_count should equal the Step 2 baseline, NOT baseline + 1

You can also list the drill-project memories directly:

curl -fsS 'http://127.0.0.1:8100/memory?project=drill' | python3 -m json.tool
# should return an empty list — the DRILL-MARKER memory was rolled back

For a semantic-retrieval cross-check, issue a query (the /query endpoint takes prompt, not query):

curl -fsS -X POST 'http://127.0.0.1:8100/query' \
    -H 'Content-Type: application/json' \
    -d '{"prompt": "DRILL-MARKER drill marker", "top_k": 5}' \
    | python3 -m json.tool
# should not return the DRILL-MARKER memory in the hits

If the marker is gone and memory_count matches the baseline, the drill passed. The runtime store has a trustworthy rollback.

Step 8. (Optional) Clean up the safety snapshot

If everything went smoothly you can leave the pre-restore safety snapshot on disk for a few days as a paranoia buffer. There's no automatic cleanup yet — list_runtime_backups() will show it, and you can remove it by hand once you're confident:

rm -rf /srv/storage/atocore/backups/snapshots/<pre_restore_stamp>

Failure modes and recovery

Restore reports `restored_integrity_ok: false`

The copied db failed PRAGMA integrity_check. Do not start the service. This usually means either the source snapshot was itself corrupt (and validate_backup should have caught it — file a bug if it didn't), or the copy was interrupted. Options:

Validate the source snapshot directly: python -m atocore.ops.backup validate <STAMP>
Pick a different, older snapshot and retry the restore.
Roll the db back to your pre-restore safety snapshot.

The live container won't start after restore

Check the container logs:

cd /srv/storage/atocore/app/deploy/dalidou
docker compose logs --tail=100 atocore

Common causes:

Schema drift between the snapshot and the current code version. _apply_migrations in src/atocore/models/database.py is idempotent and should absorb most forward migrations, but a backward restore (running new code against an older snapshot) may hit unexpected state. The migration only ADDs columns, so the opposite direction is usually safe, but verify.
Chroma and SQLite disagreeing about what chunks exist. The backup captures them together to minimize this, but if you restore SQLite without Chroma (--no-chroma), retrieval may return stale vectors. Re-ingest if this happens.

The drill marker is still present after restore

Something went wrong. Possible causes:

You restored a snapshot taken AFTER the drill marker was written (wrong stamp).
The service was writing during the drill and committed the marker before docker compose down. Double-check the order.
The restore silently skipped the db step. Check the restore output for db_restored: true and restored_integrity_ok: true.

Roll back to the pre-restore safety snapshot and retry with the correct source snapshot.

When to run this drill

Before enabling any new write-path automation (auto-capture, automated ingestion, reinforcement sweeps, scheduled extraction).
After any change to src/atocore/ops/backup.py or the schema migrations in src/atocore/models/database.py.
After a Dalidou OS upgrade or docker version bump.
Monthly as a standing operational check.

Record each drill run (pass/fail) somewhere durable — even a line in the project journal is enough. A drill you ran once and never again is barely more than a drill you never ran.

11 KiB Raw Blame History