fix: chroma restore bind-mount bug + consolidate docs

Two fixes from the 2026-04-09 first real restore drill on Dalidou, plus the long-overdue doc consolidation I should have done when I added the drill runbook instead of creating a duplicate. ## Chroma restore bind-mount bug (drill finding) src/atocore/ops/backup.py: restore_runtime_backup() used to call shutil.rmtree(dst_chroma) before copying the snapshot back. In the Dockerized Dalidou deployment the chroma dir is a bind-mounted volume — you can't unlink a mount point, rmtree raises OSError [Errno 16] Device or resource busy and the restore silently fails to touch Chroma. This bit the first real drill; the operator worked around it with --no-chroma plus a manual cp -a. Fix: clear the destination's CONTENTS (iterdir + rmtree/unlink per child) and use copytree(dirs_exist_ok=True) so the mount point itself is never touched. Equivalent semantics, bind-mount-safe. Regression test: tests/test_backup.py::test_restore_chroma_does_not_unlink_destination_directory captures Path.stat().st_ino of the dest dir before and after restore and asserts they match. That's the same invariant a bind-mounted chroma dir enforces — if the inode changed, the mount would have failed. 11/11 backup tests now pass. ## Doc consolidation docs/backup-restore-drill.md existed as a duplicate of the authoritative docs/backup-restore-procedure.md. When I added the drill runbook in commit 3362080 I wrote it from scratch instead of updating the existing procedure — bad doc hygiene on a project that's literally about being a context engine. - Deleted docs/backup-restore-drill.md - Folded its contents into docs/backup-restore-procedure.md: - Replaced the manual sudo cp restore sequence with the new `python -m atocore.ops.backup restore <STAMP> --confirm-service-stopped` CLI - Added the one-shot docker compose run pattern for running restore inside a container that reuses the live volume mounts - Documented the --no-pre-snapshot / --no-chroma / --chroma flags - New "Chroma restore and bind-mounted volumes" subsection explaining the bug and the regression test that protects the fix - New "Restore drill" subsection with three levels (unit tests, module round-trip, live Dalidou drill) and the cadence list - Failure-mode table gained four entries: restored_integrity_ok, Device-or-resource-busy, drill marker still present, chroma_snapshot_missing - "Open follow-ups" struck the restore_runtime_backup item (done) and added a "Done (historical)" note referencing 2026-04-09 - Quickstart cheat sheet now has a full drill one-liner using memory_type=episodic (the 2026-04-09 drill found the runbook's memory_type=note was invalid — the valid set is identity, preference, project, episodic, knowledge, adaptation) ## Status doc sync Long overdue — I've been landing code without updating the project's narrative state docs. docs/current-state.md: - "Reliability Baseline" now reflects: restore_runtime_backup is real with CLI, pre-restore safety snapshot, WAL cleanup, integrity check; live drill on 2026-04-09 surfaced and fixed Chroma bind-mount bug; deploy provenance via /health build_sha; deploy.sh self-update re-exec guard - "Immediate Next Focus" reshuffled: drill re-run (priority 1) and auto-capture (priority 2) are now ahead of retrieval quality work, reflecting the updated unblock sequence docs/next-steps.md: - New item 1: re-run the drill with chroma working end-to-end - New item 2: auto-capture conservative mode (Stop hook) - Old item 7 rewritten as item 9 listing what's DONE (create/list/validate/restore, admin/backup endpoint with include_chroma, /health provenance, self-update guard, procedure doc with failure modes) and what's still pending (retention cleanup, off-Dalidou target, auto-validation) ## Test count 226 passing (was 225 + 1 new inode-stability regression test). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 09:13:21 -04:00
parent 336208004c
commit 1a8fdf4225
6 changed files with 331 additions and 431 deletions
--- a/docs/backup-restore-procedure.md
+++ b/docs/backup-restore-procedure.md
@@ -146,141 +146,181 @@ of bytes.

 ## Restore procedure

+Since 2026-04-09 the restore is implemented as a proper module
+function plus CLI entry point: `restore_runtime_backup()` in
+`src/atocore/ops/backup.py`, invoked as
+`python -m atocore.ops.backup restore <STAMP> --confirm-service-stopped`.
+It automatically takes a pre-restore safety snapshot (your rollback
+anchor), handles SQLite WAL/SHM cleanly, restores the registry, and
+runs `PRAGMA integrity_check` on the restored db. This replaces the
+earlier manual `sudo cp` sequence.
+
+The function refuses to run without `--confirm-service-stopped`.
+This is deliberate: hot-restoring into a running service corrupts
+SQLite state.
+
 ### Pre-flight (always)

 1. Identify which snapshot you want to restore. List available
   snapshots and pick by timestamp:
   ```bash
-   curl -fsS http://dalidou:8100/admin/backup | jq '.backups[].stamp'
+   curl -fsS http://127.0.0.1:8100/admin/backup | jq '.backups[].stamp'
   ```
 2. Validate it. Refuse to restore an invalid backup:
   ```bash
-   STAMP=20260407T060000Z
-   curl -fsS http://dalidou:8100/admin/backup/$STAMP/validate | jq .
+   STAMP=20260409T060000Z
+   curl -fsS http://127.0.0.1:8100/admin/backup/$STAMP/validate | jq .
   ```
 3. **Stop AtoCore.** SQLite cannot be hot-restored under a running
   process and Chroma will not pick up new files until the process
   restarts.
   ```bash
-   docker compose stop atocore
-   # or: sudo systemctl stop atocore
-   ```
-4. **Take a safety snapshot of the current state** before overwriting
-   it. This is your "if the restore makes things worse, here's the
-   undo" backup.
-   ```bash
-   PRESERVE_STAMP=$(date -u +%Y%m%dT%H%M%SZ)
-   sudo cp /srv/storage/atocore/data/db/atocore.db \
-           /srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.db
-   sudo cp /srv/storage/atocore/config/project-registry.json \
-           /srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.registry.json 2>/dev/null || true
+   cd /srv/storage/atocore/app/deploy/dalidou
+   docker compose down
+   docker compose ps   # atocore should be Exited/gone
   ```

-### Restore the SQLite database
+### Run the restore
+
+Use a one-shot container that reuses the live service's volume
+mounts so every path (`db_path`, `chroma_path`, backup dir) resolves
+to the same place the main service would see:

 ```bash
-SNAPSHOT_DIR=/srv/storage/atocore/backups/snapshots/$STAMP
-sudo cp $SNAPSHOT_DIR/db/atocore.db \
-        /srv/storage/atocore/data/db/atocore.db
-sudo chown 1000:1000 /srv/storage/atocore/data/db/atocore.db
-sudo chmod 600 /srv/storage/atocore/data/db/atocore.db
+cd /srv/storage/atocore/app/deploy/dalidou
+docker compose run --rm --entrypoint python atocore \
+    -m atocore.ops.backup restore \
+        $STAMP \
+        --confirm-service-stopped
 ```

-The chown should match the gitea/atocore container user. Verify
-by checking the existing perms before overwriting:
+Output is a JSON document. The critical fields:

-```bash
-stat -c '%U:%G %a' /srv/storage/atocore/data/db/atocore.db
-```
+- `pre_restore_snapshot`: stamp of the safety snapshot of live
+  state taken right before the restore. **Write this down.** If
+  the restore was the wrong call, this is how you roll it back.
+- `db_restored`: should be `true`
+- `registry_restored`: `true` if the backup captured a registry
+- `chroma_restored`: `true` if the backup captured a chroma tree
+  and include_chroma resolved to true (default)
+- `restored_integrity_ok`: **must be `true`** — if this is false,
+  STOP and do not start the service; investigate the integrity
+  error first. The restored file is still on disk but untrusted.

-### Restore the project registry
+### Controlling the restore

-```bash
-if [ -f $SNAPSHOT_DIR/config/project-registry.json ]; then
-  sudo cp $SNAPSHOT_DIR/config/project-registry.json \
-          /srv/storage/atocore/config/project-registry.json
-  sudo chown 1000:1000 /srv/storage/atocore/config/project-registry.json
-  sudo chmod 644 /srv/storage/atocore/config/project-registry.json
-fi
-```
+The CLI supports a few flags for finer control:

-If the snapshot does not contain a registry, the current registry is
-preserved. The pre-flight safety copy still gives you a recovery path
-if you need to roll back.
+- `--no-pre-snapshot` skips the pre-restore safety snapshot. Use
+  this only when you know you have another rollback path.
+- `--no-chroma` restores only SQLite + registry, leaving the
+  current Chroma dir alone. Useful if Chroma is consistent but
+  SQLite needs a rollback.
+- `--chroma` forces Chroma restoration even if the metadata
+  doesn't clearly indicate the snapshot has it (rare).

-### Restore the Chroma vector store (if it was in the snapshot)
+### Chroma restore and bind-mounted volumes

-```bash
-if [ -d $SNAPSHOT_DIR/chroma ]; then
-  # Move the current chroma dir aside as a safety copy
-  sudo mv /srv/storage/atocore/data/chroma \
-          /srv/storage/atocore/data/chroma.pre-restore-$PRESERVE_STAMP
+The Chroma dir on Dalidou is a bind-mounted Docker volume. The
+restore cannot `rmtree` the destination (you can't unlink a mount
+point — it raises `OSError [Errno 16] Device or resource busy`),
+so the function clears the dir's CONTENTS and uses
+`copytree(dirs_exist_ok=True)` to copy the snapshot back in. The
+regression test `test_restore_chroma_does_not_unlink_destination_directory`
+in `tests/test_backup.py` captures the destination inode before
+and after restore and asserts it's stable — the same invariant
+that protects the bind mount.

-  # Copy the snapshot in
-  sudo cp -a $SNAPSHOT_DIR/chroma /srv/storage/atocore/data/chroma
-  sudo chown -R 1000:1000 /srv/storage/atocore/data/chroma
-fi
-```
-
-If the snapshot does NOT contain a Chroma dir but the SQLite
-restore would leave the vector store and the SQL store inconsistent
-(e.g. SQL has chunks the vector store doesn't), you have two
-options:
-
- **Option 1: rebuild the vector store from source documents.** Run
-  ingestion fresh after the SQL restore. This regenerates embeddings
-  from the actual source files. Slow but produces a perfectly
-  consistent state.
- **Option 2: accept the inconsistency and live with stale-vector
-  filtering.** The retriever already drops vector results whose
-  SQL row no longer exists (`_existing_chunk_ids` filter), so the
-  inconsistency surfaces as missing results, not bad ones.
-
-For an unplanned restore, Option 2 is the right immediate move.
-Then schedule a fresh ingestion pass to rebuild the vector store
-properly.
+This was discovered during the first real Dalidou restore drill
+on 2026-04-09. If you see a new restore failure with
+`Device or resource busy`, something has regressed this fix.

 ### Restart AtoCore

 ```bash
-docker compose up -d atocore
-# or: sudo systemctl start atocore
+cd /srv/storage/atocore/app/deploy/dalidou
+docker compose up -d
+# Wait for /health to come up
+for i in 1 2 3 4 5 6 7 8 9 10; do
+    curl -fsS http://127.0.0.1:8100/health \
+        && break || { echo "not ready ($i/10)"; sleep 3; }
+done
 ```

 ### Post-restore verification

 ```bash
 # 1. Service is healthy
-curl -fsS http://dalidou:8100/health | jq .
+curl -fsS http://127.0.0.1:8100/health | jq .

 # 2. Stats look right
-curl -fsS http://dalidou:8100/stats | jq .
+curl -fsS http://127.0.0.1:8100/stats | jq .

 # 3. Project registry loads
-curl -fsS http://dalidou:8100/projects | jq '.projects | length'
+curl -fsS http://127.0.0.1:8100/projects | jq '.projects | length'

 # 4. A known-good context query returns non-empty results
-curl -fsS -X POST http://dalidou:8100/context/build \
+curl -fsS -X POST http://127.0.0.1:8100/context/build \
  -H "Content-Type: application/json" \
  -d '{"prompt": "what is p05 about", "project": "p05-interferometer"}' | jq '.chunks_used'
 ```

 If any of these are wrong, the restore is bad. Roll back using the
-pre-restore safety copy:
+pre-restore safety snapshot whose stamp you recorded from the
+restore output. The rollback is the same procedure — stop the
+service and restore that stamp:

 ```bash
-docker compose stop atocore
-sudo cp /srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.db \
-        /srv/storage/atocore/data/db/atocore.db
-sudo cp /srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.registry.json \
-        /srv/storage/atocore/config/project-registry.json 2>/dev/null || true
-# If you also restored chroma:
-sudo rm -rf /srv/storage/atocore/data/chroma
-sudo mv /srv/storage/atocore/data/chroma.pre-restore-$PRESERVE_STAMP \
-        /srv/storage/atocore/data/chroma
-docker compose up -d atocore
+docker compose down
+docker compose run --rm --entrypoint python atocore \
+    -m atocore.ops.backup restore \
+        $PRE_RESTORE_SNAPSHOT_STAMP \
+        --confirm-service-stopped \
+        --no-pre-snapshot
+docker compose up -d
 ```

+(`--no-pre-snapshot` because the rollback itself doesn't need one;
+you already have the original snapshot as a fallback if everything
+goes sideways.)
+
+### Restore drill
+
+The restore is exercised at three levels:
+
+1. **Unit tests.** `tests/test_backup.py` has six restore tests
+   (refuse-without-confirm, invalid backup, full round-trip,
+   Chroma round-trip, inode-stability regression, WAL sidecar
+   cleanup, skip-pre-snapshot). These run in CI on every commit.
+2. **Module-level round-trip.**
+   `test_restore_round_trip_reverses_post_backup_mutations` is
+   the canonical drill in code form: seed baseline, snapshot,
+   mutate, restore, assert mutation reversed + baseline survived
+   + pre-restore snapshot captured the mutation.
+3. **Live drill on Dalidou.** Periodically run the full procedure
+   against the real service with a disposable drill-marker
+   memory (created via `POST /memory` with `memory_type=episodic`
+   and `project=drill`), following the sequence above and then
+   verifying the marker is gone afterward via
+   `GET /memory?project=drill`. The first such drill on
+   2026-04-09 surfaced the bind-mount bug; future runs
+   primarily exist to verify the fix stays fixed.
+
+Run the live drill:
+
+- **Before** enabling any new write-path automation (auto-capture,
+  automated ingestion, reinforcement sweeps).
+- **After** any change to `src/atocore/ops/backup.py` or to
+  schema migrations in `src/atocore/models/database.py`.
+- **After** a Dalidou OS upgrade or docker version bump.
+- **At least once per quarter** as a standing operational check.
+- **After any incident** that touched the storage layer.
+
+Record each drill run (stamp, pre-restore snapshot stamp, pass/fail,
+any surprises) somewhere durable — a line in the project journal
+or a git commit message is enough. A drill you ran once and never
+again is barely more than a drill you never ran.
+
 ## Retention policy

 - **Last 7 daily backups**: kept verbatim
@@ -296,32 +336,26 @@ A simple cron-based cleanup script is the next step:
 0 4 * * * /srv/storage/atocore/scripts/cleanup-old-backups.sh
 ```

-## Drill schedule
-
-A backup that has never been restored is theoretical. The schedule:
-
- **At least once per quarter**, perform a full restore drill on a
-  staging environment (or a temporary container with a separate
-  data dir) and verify the post-restore checks pass.
- **After every breaking schema migration**, perform a restore drill
-  to confirm the migration is reversible.
- **After any incident** that touched the storage layer (the EXDEV
-  bug from April 2026 is a good example), confirm the next backup
-  validates clean.
-
 ## Common failure modes and what to do about them

 | Symptom | Likely cause | Action |
 |---|---|---|
 | `db_integrity_check_failed` on validation | SQLite snapshot copied while a write was in progress, or disk corruption | Take a fresh backup and validate again. If it fails twice, suspect the underlying disk. |
 | `registry_invalid_json` | Registry was being edited at backup time | Take a fresh backup. The registry is small so this is cheap. |
-| `chroma_snapshot_missing` after a restore | Snapshot was DB-only and the restore didn't move the existing chroma dir | Either rebuild via fresh ingestion or restore an older snapshot that includes Chroma. |
+| Restore: `restored_integrity_ok: false` | Source snapshot was itself corrupt (validation should have caught it — file a bug) or copy was interrupted mid-write | Do NOT start the service. Validate the snapshot directly with `python -m atocore.ops.backup validate <STAMP>`, try a different older snapshot, or roll back to the pre-restore safety snapshot. |
+| Restore: `OSError [Errno 16] Device or resource busy` on Chroma | Old code tried to `rmtree` the Chroma mount point. Fixed on 2026-04-09 by `test_restore_chroma_does_not_unlink_destination_directory` | Ensure you're running commit 2026-04-09 or later; if you need to work around an older build, use `--no-chroma` and restore Chroma contents manually. |
+| `chroma_snapshot_missing` after a restore | Snapshot was DB-only | Either rebuild via fresh ingestion or restore an older snapshot that includes Chroma. |
 | Service won't start after restore | Permissions wrong on the restored files | Re-run `chown 1000:1000` (or whatever the gitea/atocore container user is) on the data dir. |
 | `/stats` returns 0 documents after restore | The SQL store was restored but the source paths in `source_documents` don't match the current Dalidou paths | This means the backup came from a different deployment. Don't trust this restore — it's pulling from the wrong layout. |
+| Drill marker still present after restore | Wrong stamp, service still writing during `docker compose down`, or the restore JSON didn't report `db_restored: true` | Roll back via the pre-restore safety snapshot and retry with the correct source snapshot. |

 ## Open follow-ups (not yet implemented)

-1. **Retention cleanup script**: see the cron entry above.
+Tracked separately in `docs/next-steps.md` — the list below is the
+backup-specific subset.
+
+1. **Retention cleanup script**: see the cron entry above. The
+   snapshots directory grows monotonically until this exists.
 2. **Off-Dalidou backup target**: currently snapshots live on the
   same disk as the live data. A real disaster-recovery story
   needs at least one snapshot on a different physical machine.
@@ -338,23 +372,59 @@ A backup that has never been restored is theoretical. The schedule:
   improvement would be incremental snapshots via filesystem-level
   snapshotting (LVM, btrfs, ZFS).

+**Done** (kept for historical reference):
+
+- ~~Implement `restore_runtime_backup()` as a proper module
+  function so the restore isn't a manual `sudo cp` dance~~ —
+  landed 2026-04-09 in commit 3362080, followed by the
+  Chroma bind-mount fix from the first real drill.
+
 ## Quickstart cheat sheet

 ```bash
 # Daily backup (DB + registry only — fast)
-curl -fsS -X POST http://dalidou:8100/admin/backup \
+curl -fsS -X POST http://127.0.0.1:8100/admin/backup \
  -H "Content-Type: application/json" -d '{}'

 # Weekly backup (DB + registry + Chroma — slower, holds ingestion lock)
-curl -fsS -X POST http://dalidou:8100/admin/backup \
+curl -fsS -X POST http://127.0.0.1:8100/admin/backup \
  -H "Content-Type: application/json" -d '{"include_chroma": true}'

 # List backups
-curl -fsS http://dalidou:8100/admin/backup | jq '.backups[].stamp'
+curl -fsS http://127.0.0.1:8100/admin/backup | jq '.backups[].stamp'

 # Validate the most recent backup
-LATEST=$(curl -fsS http://dalidou:8100/admin/backup | jq -r '.backups[-1].stamp')
-curl -fsS http://dalidou:8100/admin/backup/$LATEST/validate | jq .
+LATEST=$(curl -fsS http://127.0.0.1:8100/admin/backup | jq -r '.backups[-1].stamp')
+curl -fsS http://127.0.0.1:8100/admin/backup/$LATEST/validate | jq .

-# Full restore — see the "Restore procedure" section above
+# Full restore (service must be stopped first)
+cd /srv/storage/atocore/app/deploy/dalidou
+docker compose down
+docker compose run --rm --entrypoint python atocore \
+    -m atocore.ops.backup restore $STAMP --confirm-service-stopped
+docker compose up -d
+
+# Live drill: exercise the full create -> mutate -> restore flow
+# against the running service. The marker memory uses
+# memory_type=episodic (valid types: identity, preference, project,
+# episodic, knowledge, adaptation) and project=drill so it's easy
+# to find via GET /memory?project=drill before and after.
+#
+# See the "Restore drill" section above for the full sequence.
+STAMP=$(curl -fsS -X POST http://127.0.0.1:8100/admin/backup \
+    -H 'Content-Type: application/json' \
+    -d '{"include_chroma": true}' | jq -r '.backup_root' | awk -F/ '{print $NF}')
+
+curl -fsS -X POST http://127.0.0.1:8100/memory \
+    -H 'Content-Type: application/json' \
+    -d '{"memory_type":"episodic","content":"DRILL-MARKER","project":"drill","confidence":1.0}'
+
+cd /srv/storage/atocore/app/deploy/dalidou
+docker compose down
+docker compose run --rm --entrypoint python atocore \
+    -m atocore.ops.backup restore $STAMP --confirm-service-stopped
+docker compose up -d
+
+# Marker should be gone:
+curl -fsS 'http://127.0.0.1:8100/memory?project=drill' | jq .
 ```