Close the backup side of the loop: we had create/list/validate but
no restore, and no documented drill. A backup you've never restored
is not a backup. This lands the missing restore surface and the
procedure to exercise it before enabling any write-path automation
(auto-capture, automated ingestion, reinforcement sweeps).
Code — src/atocore/ops/backup.py:
- restore_runtime_backup(stamp, *, include_chroma, pre_restore_snapshot,
confirm_service_stopped) performs:
1. validate_backup() gate — refuse on any error
2. pre-restore safety snapshot of current state (reversibility anchor)
3. PRAGMA wal_checkpoint(TRUNCATE) on target db (flush + release
OS handles; Windows needs this after conn.backup() reads)
4. unlink stale -wal/-shm sidecars (tolerant to Windows lock races)
5. shutil.copy2 snapshot db over target
6. restore registry if snapshot captured one
7. restore Chroma tree if snapshot captured one and include_chroma
resolves to true (defaults to whether backup has Chroma)
8. PRAGMA integrity_check on restored db, report result
- Refuses without confirm_service_stopped=True to prevent hot-restore
into a running service (would corrupt SQLite state)
- Rewrote main() as argparse with 4 subcommands: create, list,
validate, restore. `python -m atocore.ops.backup restore STAMP
--confirm-service-stopped` is the drill CLI entry point, run via
`docker compose run --rm --entrypoint python atocore` so it reuses
the live service's volume mounts
Tests — tests/test_backup.py (6 new):
- test_restore_refuses_without_confirm_service_stopped
- test_restore_raises_on_invalid_backup
- test_restore_round_trip_reverses_post_backup_mutations
(canonical drill flow: seed -> backup -> mutate -> restore ->
mutation gone + baseline survived + pre-restore snapshot has
the mutation captured as rollback anchor)
- test_restore_round_trip_with_chroma
- test_restore_skips_pre_snapshot_when_requested
- test_restore_cleans_stale_wal_sidecars (asserts stale byte
markers do not survive, not file existence, since PRAGMA
integrity_check may legitimately recreate -wal)
Docs — docs/backup-restore-drill.md (new):
- What gets backed up (hot sqlite, cold chroma, registry JSON,
metadata.json) and what doesn't (.env, source content)
- What restore does, step by step, and why confirm_service_stopped
is a hard gate
- 8-step drill procedure: capture -> baseline -> mutate -> stop ->
restore -> start -> verify marker gone -> optional cleanup
- Correct endpoint bodies verified against routes.py:
POST /admin/backup with JSON body {"include_chroma": true}
POST /memory with memory_type/content/project/confidence
GET /memory?project=drill to list drill markers
POST /query with {"prompt": ..., "top_k": ...} (not "query")
- Failure modes: integrity_check fail, container won't start,
marker still present after restore, with remediation for each
- When to run: before new write-path automation, after backup.py
or schema changes, after infra bumps, monthly as standing check
225/225 tests passing (219 existing + 6 new restore).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
11 KiB
Backup / Restore Drill
Purpose
Before turning on any automation that writes to AtoCore continuously (auto-capture of Claude Code sessions, automated source ingestion, reinforcement sweeps), we need to know — with certainty — that a backup can actually be restored. A backup you've never restored is not a backup; it's a file that happens to be named that way.
This runbook walks through the canonical drill: take a snapshot, mutate live state, stop the service, restore from the snapshot, start the service, and verify the mutation is reversed. When the drill passes, the runtime store has a trustworthy rollback.
What gets backed up
src/atocore/ops/backup.py::create_runtime_backup() writes the
following into $ATOCORE_BACKUP_DIR/snapshots/<stamp>/:
| Component | How | Hot/Cold | Notes |
|---|---|---|---|
SQLite (atocore.db) |
conn.backup() online API |
hot | Safe with service running; self-contained main file, no WAL sidecar. |
| Project registry JSON | file copy | cold | Only if the file exists. |
| Chroma vector store | shutil.copytree |
cold | Only when include_chroma=True. Caller must hold exclusive_ingestion() so nothing writes during the copy — the POST /admin/backup?include_chroma=true route does this automatically. |
backup-metadata.json |
JSON blob | — | Records paths, sizes, and whether Chroma was included. Restore reads this to know what to pull back. |
Things that are not in the backup and must be handled separately:
- The
.envfile underdeploy/dalidou/— secrets live out of git and out of the backup on purpose. The operator must re-place it on any fresh host. - The source content under
sources/vaultandsources/drive— these are read-only inputs by convention, owned by AtoVault / AtoDrive, and backed up there. - Any running transient state (in-flight HTTP requests, ingestion queues). Stop the service cleanly if you care about those.
What restore does
restore_runtime_backup(stamp, confirm_service_stopped=True):
- Validates the backup first via
validate_backup()— refuses to run on any error (missing metadata, corrupt snapshot db, etc.). - Takes a pre-restore safety snapshot of the current state (SQLite only, not Chroma — to keep it fast) and returns its stamp. This is the reversibility guarantee: if the restore was the wrong call, you can roll it back by restoring the pre-restore snapshot.
- Forces a WAL checkpoint on the current db
(
PRAGMA wal_checkpoint(TRUNCATE)) to flush any lingering writes and release OS file handles on-wal/-shm, so the copy step won't race a half-open sqlite connection. - Removes stale WAL/SHM sidecars next to the target db.
The snapshot
.dbis a self-contained main-file image with no WAL of its own; leftover-walfrom the old live process would desync against the restored main file. - Copies the snapshot db over the live db path.
- Restores the registry JSON if the snapshot captured one.
- Restores the Chroma tree if the snapshot captured one and
include_chromaresolves to true (defaults to whether the snapshot has Chroma). - Runs
PRAGMA integrity_checkon the restored db and reports the result alongside a summary of what was touched.
If confirm_service_stopped is not passed, the function refuses —
this is deliberate. Hot-restoring into a running service is not
supported and would corrupt state.
The drill
Run this from a Dalidou host with the AtoCore container already
deployed and healthy. The whole drill takes under two minutes. It
does not touch source content or disturb any .env secrets.
Step 1. Capture a snapshot via the HTTP API
The running service holds the db; use the admin route so the
Chroma snapshot is taken under exclusive_ingestion(). The
endpoint takes a JSON body (not a query string):
curl -fsS -X POST 'http://127.0.0.1:8100/admin/backup' \
-H 'Content-Type: application/json' \
-d '{"include_chroma": true}' \
| python3 -m json.tool
Record the backup_root and note the stamp (the last path segment,
e.g. 20260409T012345Z). That stamp is the input to the restore
step.
Step 2. Record a known piece of live state
Pick something small and unambiguous to use as a marker. The simplest is the current health snapshot plus a memory count:
curl -fsS 'http://127.0.0.1:8100/health' | python3 -m json.tool
Note the memory_count, interaction_count, and build_sha. These
are your pre-drill baseline.
Step 3. Mutate live state AFTER the backup
Write something the restore should reverse. Any write endpoint is
fine — a throwaway test memory is the cleanest. The request body
must include memory_type (the AtoCore memory schema requires it):
curl -fsS -X POST 'http://127.0.0.1:8100/memory' \
-H 'Content-Type: application/json' \
-d '{
"memory_type": "note",
"content": "DRILL-MARKER: this memory should not survive the restore",
"project": "drill",
"confidence": 1.0
}' \
| python3 -m json.tool
Record the returned id. Confirm it's there:
curl -fsS 'http://127.0.0.1:8100/health' | python3 -m json.tool
# memory_count should be baseline + 1
# And you can list the drill-project memories directly:
curl -fsS 'http://127.0.0.1:8100/memory?project=drill' | python3 -m json.tool
# should return the DRILL-MARKER memory
Step 4. Stop the service
cd /srv/storage/atocore/app/deploy/dalidou
docker compose down
Wait for the container to actually exit:
docker compose ps
# atocore should be gone or Exited
Step 5. Restore from the snapshot
Run the restore inside a one-shot container that reuses the same volumes as the live service. This guarantees the paths resolve identically to the running container's view.
cd /srv/storage/atocore/app/deploy/dalidou
docker compose run --rm --entrypoint python atocore \
-m atocore.ops.backup restore \
<YOUR_STAMP_FROM_STEP_1> \
--confirm-service-stopped
The output is JSON; the important fields are:
pre_restore_snapshot: stamp of the safety snapshot of live state at the moment of restore. Write this down. If the restore turns out to be the wrong call, this is how you roll it back.db_restored:trueregistry_restored:trueif the backup had a registrychroma_restored:trueif the backup had a chroma snapshotrestored_integrity_ok: must betrue— if this is false, STOP and do not start the service; investigate the integrity error first.
If restoration fails at any step, the function raises a clean
RuntimeError and nothing partial is committed past the main file
swap. The pre-restore safety snapshot is your rollback anchor.
Step 6. Start the service back up
cd /srv/storage/atocore/app/deploy/dalidou
docker compose up -d
Wait for /health to respond:
for i in 1 2 3 4 5 6 7 8 9 10; do
curl -fsS 'http://127.0.0.1:8100/health' \
&& break || { echo "not ready ($i/10)"; sleep 3; }
done
Step 7. Verify the drill marker is gone
curl -fsS 'http://127.0.0.1:8100/health' | python3 -m json.tool
# memory_count should equal the Step 2 baseline, NOT baseline + 1
You can also list the drill-project memories directly:
curl -fsS 'http://127.0.0.1:8100/memory?project=drill' | python3 -m json.tool
# should return an empty list — the DRILL-MARKER memory was rolled back
For a semantic-retrieval cross-check, issue a query (the /query
endpoint takes prompt, not query):
curl -fsS -X POST 'http://127.0.0.1:8100/query' \
-H 'Content-Type: application/json' \
-d '{"prompt": "DRILL-MARKER drill marker", "top_k": 5}' \
| python3 -m json.tool
# should not return the DRILL-MARKER memory in the hits
If the marker is gone and memory_count matches the baseline, the
drill passed. The runtime store has a trustworthy rollback.
Step 8. (Optional) Clean up the safety snapshot
If everything went smoothly you can leave the pre-restore safety
snapshot on disk for a few days as a paranoia buffer. There's no
automatic cleanup yet — list_runtime_backups() will show it, and
you can remove it by hand once you're confident:
rm -rf /srv/storage/atocore/backups/snapshots/<pre_restore_stamp>
Failure modes and recovery
Restore reports restored_integrity_ok: false
The copied db failed PRAGMA integrity_check. Do not start
the service. This usually means either the source snapshot was
itself corrupt (and validate_backup should have caught it — file
a bug if it didn't), or the copy was interrupted. Options:
- Validate the source snapshot directly:
python -m atocore.ops.backup validate <STAMP> - Pick a different, older snapshot and retry the restore.
- Roll the db back to your pre-restore safety snapshot.
The live container won't start after restore
Check the container logs:
cd /srv/storage/atocore/app/deploy/dalidou
docker compose logs --tail=100 atocore
Common causes:
- Schema drift between the snapshot and the current code version.
_apply_migrationsinsrc/atocore/models/database.pyis idempotent and should absorb most forward migrations, but a backward restore (running new code against an older snapshot) may hit unexpected state. The migration only ADDs columns, so the opposite direction is usually safe, but verify. - Chroma and SQLite disagreeing about what chunks exist. The
backup captures them together to minimize this, but if you
restore SQLite without Chroma (
--no-chroma), retrieval may return stale vectors. Re-ingest if this happens.
The drill marker is still present after restore
Something went wrong. Possible causes:
- You restored a snapshot taken AFTER the drill marker was written (wrong stamp).
- The service was writing during the drill and committed the
marker before
docker compose down. Double-check the order. - The restore silently skipped the db step. Check the restore
output for
db_restored: trueandrestored_integrity_ok: true.
Roll back to the pre-restore safety snapshot and retry with the correct source snapshot.
When to run this drill
- Before enabling any new write-path automation (auto-capture, automated ingestion, reinforcement sweeps, scheduled extraction).
- After any change to
src/atocore/ops/backup.pyor the schema migrations insrc/atocore/models/database.py. - After a Dalidou OS upgrade or docker version bump.
- Monthly as a standing operational check.
Record each drill run (pass/fail) somewhere durable — even a line in the project journal is enough. A drill you ran once and never again is barely more than a drill you never ran.