ops: add restore_runtime_backup + drill runbook
Close the backup side of the loop: we had create/list/validate but
no restore, and no documented drill. A backup you've never restored
is not a backup. This lands the missing restore surface and the
procedure to exercise it before enabling any write-path automation
(auto-capture, automated ingestion, reinforcement sweeps).
Code — src/atocore/ops/backup.py:
- restore_runtime_backup(stamp, *, include_chroma, pre_restore_snapshot,
confirm_service_stopped) performs:
1. validate_backup() gate — refuse on any error
2. pre-restore safety snapshot of current state (reversibility anchor)
3. PRAGMA wal_checkpoint(TRUNCATE) on target db (flush + release
OS handles; Windows needs this after conn.backup() reads)
4. unlink stale -wal/-shm sidecars (tolerant to Windows lock races)
5. shutil.copy2 snapshot db over target
6. restore registry if snapshot captured one
7. restore Chroma tree if snapshot captured one and include_chroma
resolves to true (defaults to whether backup has Chroma)
8. PRAGMA integrity_check on restored db, report result
- Refuses without confirm_service_stopped=True to prevent hot-restore
into a running service (would corrupt SQLite state)
- Rewrote main() as argparse with 4 subcommands: create, list,
validate, restore. `python -m atocore.ops.backup restore STAMP
--confirm-service-stopped` is the drill CLI entry point, run via
`docker compose run --rm --entrypoint python atocore` so it reuses
the live service's volume mounts
Tests — tests/test_backup.py (6 new):
- test_restore_refuses_without_confirm_service_stopped
- test_restore_raises_on_invalid_backup
- test_restore_round_trip_reverses_post_backup_mutations
(canonical drill flow: seed -> backup -> mutate -> restore ->
mutation gone + baseline survived + pre-restore snapshot has
the mutation captured as rollback anchor)
- test_restore_round_trip_with_chroma
- test_restore_skips_pre_snapshot_when_requested
- test_restore_cleans_stale_wal_sidecars (asserts stale byte
markers do not survive, not file existence, since PRAGMA
integrity_check may legitimately recreate -wal)
Docs — docs/backup-restore-drill.md (new):
- What gets backed up (hot sqlite, cold chroma, registry JSON,
metadata.json) and what doesn't (.env, source content)
- What restore does, step by step, and why confirm_service_stopped
is a hard gate
- 8-step drill procedure: capture -> baseline -> mutate -> stop ->
restore -> start -> verify marker gone -> optional cleanup
- Correct endpoint bodies verified against routes.py:
POST /admin/backup with JSON body {"include_chroma": true}
POST /memory with memory_type/content/project/confidence
GET /memory?project=drill to list drill markers
POST /query with {"prompt": ..., "top_k": ...} (not "query")
- Failure modes: integrity_check fail, container won't start,
marker still present after restore, with remediation for each
- When to run: before new write-path automation, after backup.py
or schema changes, after infra bumps, monthly as standing check
225/225 tests passing (219 existing + 6 new restore).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
296
docs/backup-restore-drill.md
Normal file
296
docs/backup-restore-drill.md
Normal file
@@ -0,0 +1,296 @@
|
||||
# Backup / Restore Drill
|
||||
|
||||
## Purpose
|
||||
|
||||
Before turning on any automation that writes to AtoCore continuously
|
||||
(auto-capture of Claude Code sessions, automated source ingestion,
|
||||
reinforcement sweeps), we need to know — with certainty — that a
|
||||
backup can actually be restored. A backup you've never restored is
|
||||
not a backup; it's a file that happens to be named that way.
|
||||
|
||||
This runbook walks through the canonical drill: take a snapshot,
|
||||
mutate live state, stop the service, restore from the snapshot,
|
||||
start the service, and verify the mutation is reversed. When the
|
||||
drill passes, the runtime store has a trustworthy rollback.
|
||||
|
||||
## What gets backed up
|
||||
|
||||
`src/atocore/ops/backup.py::create_runtime_backup()` writes the
|
||||
following into `$ATOCORE_BACKUP_DIR/snapshots/<stamp>/`:
|
||||
|
||||
| Component | How | Hot/Cold | Notes |
|
||||
|---|---|---|---|
|
||||
| SQLite (`atocore.db`) | `conn.backup()` online API | **hot** | Safe with service running; self-contained main file, no WAL sidecar. |
|
||||
| Project registry JSON | file copy | cold | Only if the file exists. |
|
||||
| Chroma vector store | `shutil.copytree` | **cold** | Only when `include_chroma=True`. Caller must hold `exclusive_ingestion()` so nothing writes during the copy — the `POST /admin/backup?include_chroma=true` route does this automatically. |
|
||||
| `backup-metadata.json` | JSON blob | — | Records paths, sizes, and whether Chroma was included. Restore reads this to know what to pull back. |
|
||||
|
||||
Things that are **not** in the backup and must be handled separately:
|
||||
|
||||
- The `.env` file under `deploy/dalidou/` — secrets live out of git
|
||||
and out of the backup on purpose. The operator must re-place it
|
||||
on any fresh host.
|
||||
- The source content under `sources/vault` and `sources/drive` —
|
||||
these are read-only inputs by convention, owned by AtoVault /
|
||||
AtoDrive, and backed up there.
|
||||
- Any running transient state (in-flight HTTP requests, ingestion
|
||||
queues). Stop the service cleanly if you care about those.
|
||||
|
||||
## What restore does
|
||||
|
||||
`restore_runtime_backup(stamp, confirm_service_stopped=True)`:
|
||||
|
||||
1. **Validates** the backup first via `validate_backup()` —
|
||||
refuses to run on any error (missing metadata, corrupt snapshot
|
||||
db, etc.).
|
||||
2. **Takes a pre-restore safety snapshot** of the current state
|
||||
(SQLite only, not Chroma — to keep it fast) and returns its
|
||||
stamp. This is the reversibility guarantee: if the restore was
|
||||
the wrong call, you can roll it back by restoring the
|
||||
pre-restore snapshot.
|
||||
3. **Forces a WAL checkpoint** on the current db
|
||||
(`PRAGMA wal_checkpoint(TRUNCATE)`) to flush any lingering
|
||||
writes and release OS file handles on `-wal`/`-shm`, so the
|
||||
copy step won't race a half-open sqlite connection.
|
||||
4. **Removes stale WAL/SHM sidecars** next to the target db.
|
||||
The snapshot `.db` is a self-contained main-file image with no
|
||||
WAL of its own; leftover `-wal` from the old live process
|
||||
would desync against the restored main file.
|
||||
5. **Copies the snapshot db** over the live db path.
|
||||
6. **Restores the registry JSON** if the snapshot captured one.
|
||||
7. **Restores the Chroma tree** if the snapshot captured one and
|
||||
`include_chroma` resolves to true (defaults to whether the
|
||||
snapshot has Chroma).
|
||||
8. **Runs `PRAGMA integrity_check`** on the restored db and
|
||||
reports the result alongside a summary of what was touched.
|
||||
|
||||
If `confirm_service_stopped` is not passed, the function refuses —
|
||||
this is deliberate. Hot-restoring into a running service is not
|
||||
supported and would corrupt state.
|
||||
|
||||
## The drill
|
||||
|
||||
Run this from a Dalidou host with the AtoCore container already
|
||||
deployed and healthy. The whole drill takes under two minutes. It
|
||||
does not touch source content or disturb any `.env` secrets.
|
||||
|
||||
### Step 1. Capture a snapshot via the HTTP API
|
||||
|
||||
The running service holds the db; use the admin route so the
|
||||
Chroma snapshot is taken under `exclusive_ingestion()`. The
|
||||
endpoint takes a JSON body (not a query string):
|
||||
|
||||
```bash
|
||||
curl -fsS -X POST 'http://127.0.0.1:8100/admin/backup' \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"include_chroma": true}' \
|
||||
| python3 -m json.tool
|
||||
```
|
||||
|
||||
Record the `backup_root` and note the stamp (the last path segment,
|
||||
e.g. `20260409T012345Z`). That stamp is the input to the restore
|
||||
step.
|
||||
|
||||
### Step 2. Record a known piece of live state
|
||||
|
||||
Pick something small and unambiguous to use as a marker. The
|
||||
simplest is the current health snapshot plus a memory count:
|
||||
|
||||
```bash
|
||||
curl -fsS 'http://127.0.0.1:8100/health' | python3 -m json.tool
|
||||
```
|
||||
|
||||
Note the `memory_count`, `interaction_count`, and `build_sha`. These
|
||||
are your pre-drill baseline.
|
||||
|
||||
### Step 3. Mutate live state AFTER the backup
|
||||
|
||||
Write something the restore should reverse. Any write endpoint is
|
||||
fine — a throwaway test memory is the cleanest. The request body
|
||||
must include `memory_type` (the AtoCore memory schema requires it):
|
||||
|
||||
```bash
|
||||
curl -fsS -X POST 'http://127.0.0.1:8100/memory' \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"memory_type": "note",
|
||||
"content": "DRILL-MARKER: this memory should not survive the restore",
|
||||
"project": "drill",
|
||||
"confidence": 1.0
|
||||
}' \
|
||||
| python3 -m json.tool
|
||||
```
|
||||
|
||||
Record the returned `id`. Confirm it's there:
|
||||
|
||||
```bash
|
||||
curl -fsS 'http://127.0.0.1:8100/health' | python3 -m json.tool
|
||||
# memory_count should be baseline + 1
|
||||
|
||||
# And you can list the drill-project memories directly:
|
||||
curl -fsS 'http://127.0.0.1:8100/memory?project=drill' | python3 -m json.tool
|
||||
# should return the DRILL-MARKER memory
|
||||
```
|
||||
|
||||
### Step 4. Stop the service
|
||||
|
||||
```bash
|
||||
cd /srv/storage/atocore/app/deploy/dalidou
|
||||
docker compose down
|
||||
```
|
||||
|
||||
Wait for the container to actually exit:
|
||||
|
||||
```bash
|
||||
docker compose ps
|
||||
# atocore should be gone or Exited
|
||||
```
|
||||
|
||||
### Step 5. Restore from the snapshot
|
||||
|
||||
Run the restore inside a one-shot container that reuses the same
|
||||
volumes as the live service. This guarantees the paths resolve
|
||||
identically to the running container's view.
|
||||
|
||||
```bash
|
||||
cd /srv/storage/atocore/app/deploy/dalidou
|
||||
docker compose run --rm --entrypoint python atocore \
|
||||
-m atocore.ops.backup restore \
|
||||
<YOUR_STAMP_FROM_STEP_1> \
|
||||
--confirm-service-stopped
|
||||
```
|
||||
|
||||
The output is JSON; the important fields are:
|
||||
|
||||
- `pre_restore_snapshot`: stamp of the safety snapshot of live
|
||||
state at the moment of restore. **Write this down.** If the
|
||||
restore turns out to be the wrong call, this is how you roll
|
||||
it back.
|
||||
- `db_restored`: `true`
|
||||
- `registry_restored`: `true` if the backup had a registry
|
||||
- `chroma_restored`: `true` if the backup had a chroma snapshot
|
||||
- `restored_integrity_ok`: **must be `true`** — if this is false,
|
||||
STOP and do not start the service; investigate the integrity
|
||||
error first.
|
||||
|
||||
If restoration fails at any step, the function raises a clean
|
||||
`RuntimeError` and nothing partial is committed past the main file
|
||||
swap. The pre-restore safety snapshot is your rollback anchor.
|
||||
|
||||
### Step 6. Start the service back up
|
||||
|
||||
```bash
|
||||
cd /srv/storage/atocore/app/deploy/dalidou
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
Wait for `/health` to respond:
|
||||
|
||||
```bash
|
||||
for i in 1 2 3 4 5 6 7 8 9 10; do
|
||||
curl -fsS 'http://127.0.0.1:8100/health' \
|
||||
&& break || { echo "not ready ($i/10)"; sleep 3; }
|
||||
done
|
||||
```
|
||||
|
||||
### Step 7. Verify the drill marker is gone
|
||||
|
||||
```bash
|
||||
curl -fsS 'http://127.0.0.1:8100/health' | python3 -m json.tool
|
||||
# memory_count should equal the Step 2 baseline, NOT baseline + 1
|
||||
```
|
||||
|
||||
You can also list the drill-project memories directly:
|
||||
|
||||
```bash
|
||||
curl -fsS 'http://127.0.0.1:8100/memory?project=drill' | python3 -m json.tool
|
||||
# should return an empty list — the DRILL-MARKER memory was rolled back
|
||||
```
|
||||
|
||||
For a semantic-retrieval cross-check, issue a query (the `/query`
|
||||
endpoint takes `prompt`, not `query`):
|
||||
|
||||
```bash
|
||||
curl -fsS -X POST 'http://127.0.0.1:8100/query' \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"prompt": "DRILL-MARKER drill marker", "top_k": 5}' \
|
||||
| python3 -m json.tool
|
||||
# should not return the DRILL-MARKER memory in the hits
|
||||
```
|
||||
|
||||
If the marker is gone and `memory_count` matches the baseline, the
|
||||
drill **passed**. The runtime store has a trustworthy rollback.
|
||||
|
||||
### Step 8. (Optional) Clean up the safety snapshot
|
||||
|
||||
If everything went smoothly you can leave the pre-restore safety
|
||||
snapshot on disk for a few days as a paranoia buffer. There's no
|
||||
automatic cleanup yet — `list_runtime_backups()` will show it, and
|
||||
you can remove it by hand once you're confident:
|
||||
|
||||
```bash
|
||||
rm -rf /srv/storage/atocore/backups/snapshots/<pre_restore_stamp>
|
||||
```
|
||||
|
||||
## Failure modes and recovery
|
||||
|
||||
### Restore reports `restored_integrity_ok: false`
|
||||
|
||||
The copied db failed `PRAGMA integrity_check`. Do **not** start
|
||||
the service. This usually means either the source snapshot was
|
||||
itself corrupt (and `validate_backup` should have caught it — file
|
||||
a bug if it didn't), or the copy was interrupted. Options:
|
||||
|
||||
1. Validate the source snapshot directly:
|
||||
`python -m atocore.ops.backup validate <STAMP>`
|
||||
2. Pick a different, older snapshot and retry the restore.
|
||||
3. Roll the db back to your pre-restore safety snapshot.
|
||||
|
||||
### The live container won't start after restore
|
||||
|
||||
Check the container logs:
|
||||
|
||||
```bash
|
||||
cd /srv/storage/atocore/app/deploy/dalidou
|
||||
docker compose logs --tail=100 atocore
|
||||
```
|
||||
|
||||
Common causes:
|
||||
|
||||
- Schema drift between the snapshot and the current code version.
|
||||
`_apply_migrations` in `src/atocore/models/database.py` is
|
||||
idempotent and should absorb most forward migrations, but a
|
||||
backward restore (running new code against an older snapshot)
|
||||
may hit unexpected state. The migration only ADDs columns, so
|
||||
the opposite direction is usually safe, but verify.
|
||||
- Chroma and SQLite disagreeing about what chunks exist. The
|
||||
backup captures them together to minimize this, but if you
|
||||
restore SQLite without Chroma (`--no-chroma`), retrieval may
|
||||
return stale vectors. Re-ingest if this happens.
|
||||
|
||||
### The drill marker is still present after restore
|
||||
|
||||
Something went wrong. Possible causes:
|
||||
|
||||
- You restored a snapshot taken AFTER the drill marker was
|
||||
written (wrong stamp).
|
||||
- The service was writing during the drill and committed the
|
||||
marker before `docker compose down`. Double-check the order.
|
||||
- The restore silently skipped the db step. Check the restore
|
||||
output for `db_restored: true` and `restored_integrity_ok: true`.
|
||||
|
||||
Roll back to the pre-restore safety snapshot and retry with the
|
||||
correct source snapshot.
|
||||
|
||||
## When to run this drill
|
||||
|
||||
- **Before** enabling any new write-path automation (auto-capture,
|
||||
automated ingestion, reinforcement sweeps, scheduled extraction).
|
||||
- **After** any change to `src/atocore/ops/backup.py` or the
|
||||
schema migrations in `src/atocore/models/database.py`.
|
||||
- **After** a Dalidou OS upgrade or docker version bump.
|
||||
- **Monthly** as a standing operational check.
|
||||
|
||||
Record each drill run (pass/fail) somewhere durable — even a line
|
||||
in the project journal is enough. A drill you ran once and never
|
||||
again is barely more than a drill you never ran.
|
||||
Reference in New Issue
Block a user