fix: chroma restore bind-mount bug + consolidate docs

Two fixes from the 2026-04-09 first real restore drill on Dalidou, plus the long-overdue doc consolidation I should have done when I added the drill runbook instead of creating a duplicate. ## Chroma restore bind-mount bug (drill finding) src/atocore/ops/backup.py: restore_runtime_backup() used to call shutil.rmtree(dst_chroma) before copying the snapshot back. In the Dockerized Dalidou deployment the chroma dir is a bind-mounted volume — you can't unlink a mount point, rmtree raises OSError [Errno 16] Device or resource busy and the restore silently fails to touch Chroma. This bit the first real drill; the operator worked around it with --no-chroma plus a manual cp -a. Fix: clear the destination's CONTENTS (iterdir + rmtree/unlink per child) and use copytree(dirs_exist_ok=True) so the mount point itself is never touched. Equivalent semantics, bind-mount-safe. Regression test: tests/test_backup.py::test_restore_chroma_does_not_unlink_destination_directory captures Path.stat().st_ino of the dest dir before and after restore and asserts they match. That's the same invariant a bind-mounted chroma dir enforces — if the inode changed, the mount would have failed. 11/11 backup tests now pass. ## Doc consolidation docs/backup-restore-drill.md existed as a duplicate of the authoritative docs/backup-restore-procedure.md. When I added the drill runbook in commit 3362080 I wrote it from scratch instead of updating the existing procedure — bad doc hygiene on a project that's literally about being a context engine. - Deleted docs/backup-restore-drill.md - Folded its contents into docs/backup-restore-procedure.md: - Replaced the manual sudo cp restore sequence with the new `python -m atocore.ops.backup restore <STAMP> --confirm-service-stopped` CLI - Added the one-shot docker compose run pattern for running restore inside a container that reuses the live volume mounts - Documented the --no-pre-snapshot / --no-chroma / --chroma flags - New "Chroma restore and bind-mounted volumes" subsection explaining the bug and the regression test that protects the fix - New "Restore drill" subsection with three levels (unit tests, module round-trip, live Dalidou drill) and the cadence list - Failure-mode table gained four entries: restored_integrity_ok, Device-or-resource-busy, drill marker still present, chroma_snapshot_missing - "Open follow-ups" struck the restore_runtime_backup item (done) and added a "Done (historical)" note referencing 2026-04-09 - Quickstart cheat sheet now has a full drill one-liner using memory_type=episodic (the 2026-04-09 drill found the runbook's memory_type=note was invalid — the valid set is identity, preference, project, episodic, knowledge, adaptation) ## Status doc sync Long overdue — I've been landing code without updating the project's narrative state docs. docs/current-state.md: - "Reliability Baseline" now reflects: restore_runtime_backup is real with CLI, pre-restore safety snapshot, WAL cleanup, integrity check; live drill on 2026-04-09 surfaced and fixed Chroma bind-mount bug; deploy provenance via /health build_sha; deploy.sh self-update re-exec guard - "Immediate Next Focus" reshuffled: drill re-run (priority 1) and auto-capture (priority 2) are now ahead of retrieval quality work, reflecting the updated unblock sequence docs/next-steps.md: - New item 1: re-run the drill with chroma working end-to-end - New item 2: auto-capture conservative mode (Stop hook) - Old item 7 rewritten as item 9 listing what's DONE (create/list/validate/restore, admin/backup endpoint with include_chroma, /health provenance, self-update guard, procedure doc with failure modes) and what's still pending (retention cleanup, off-Dalidou target, auto-validation) ## Test count 226 passing (was 225 + 1 new inode-stability regression test). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 09:13:21 -04:00
parent 336208004c
commit 1a8fdf4225
6 changed files with 331 additions and 431 deletions
--- a/docs/next-steps.md
+++ b/docs/next-steps.md
@@ -20,45 +20,75 @@ This working list should be read alongside:

 ## Immediate Next Steps

-1. Use the T420 `atocore-context` skill and the new organic routing layer in
+1. Re-run the backup/restore drill on Dalidou with the Chroma
+   bind-mount fix in place
+   - the 2026-04-09 drill was a PARTIAL PASS: db restore + marker
+     reversal worked cleanly, but the Chroma step failed with
+     `OSError [Errno 16] Device or resource busy` because
+     `shutil.rmtree` cannot unlink a Docker bind-mounted volume
+   - fix landed immediately after: `restore_runtime_backup()` now
+     clears the destination's CONTENTS and uses
+     `copytree(dirs_exist_ok=True)`, and the regression test
+     `test_restore_chroma_does_not_unlink_destination_directory`
+     asserts the destination inode is stable
+   - need a green end-to-end run with `--chroma` actually
+     working in-container before enabling write-path automation
+2. Turn on auto-capture of Claude Code sessions once the drill
+   re-run is clean
+   - conservative mode: Stop hook posts to `/interactions`,
+     no auto-extraction into review queue without review cadence
+     in place
+3. Use the T420 `atocore-context` skill and the new organic routing layer in
   real OpenClaw workflows
   - confirm `auto-context` feels natural
   - confirm project inference is good enough in practice
   - confirm the fail-open behavior remains acceptable in practice
-2. Review retrieval quality after the first real project ingestion batch
+4. Review retrieval quality after the first real project ingestion batch
   - check whether the top hits are useful
   - check whether trusted project state remains dominant
   - reduce cross-project competition and prompt ambiguity where needed
   - use `debug-context` to inspect the exact last AtoCore supplement
-3. Treat the active-project full markdown/text wave as complete
+5. Treat the active-project full markdown/text wave as complete
   - `p04-gigabit`
   - `p05-interferometer`
   - `p06-polisher`
-4. Define a cleaner source refresh model
+6. Define a cleaner source refresh model
   - make the difference between source truth, staged inputs, and machine store
     explicit
   - move toward a project source registry and refresh workflow
   - foundation now exists via project registry + per-project refresh API
   - registration policy + template + proposal + approved registration are now
     the normal path for new projects
-5. Move to Wave 2 trusted-operational ingestion
+7. Move to Wave 2 trusted-operational ingestion
   - curated dashboards
   - decision logs
   - milestone/current-status views
   - operational truth, not just raw project notes
-6. Integrate the new engineering architecture docs into active planning, not immediate schema code
+8. Integrate the new engineering architecture docs into active planning, not immediate schema code
   - keep `docs/architecture/engineering-knowledge-hybrid-architecture.md` as the target layer model
   - keep `docs/architecture/engineering-ontology-v1.md` as the V1 structured-domain target
   - do not start entity/relationship persistence until the ingestion, retrieval, registry, and backup baseline feels boring and stable
-7. Define backup and export procedures for Dalidou
-   - exercise the new SQLite + registry snapshot path on Dalidou
-   - Chroma backup or rebuild policy
-   - retention and restore validation
-   - admin backup endpoint now supports `include_chroma` cold snapshot
-     under the ingestion lock and `validate` confirms each snapshot is
-     openable; remaining work is the operational retention policy
-8. Keep deeper automatic runtime integration modest until the organic read-only
-   model has proven value
+9. Finish the boring operations baseline around backup
+   - retention policy cleanup script (snapshots dir grows
+     monotonically today)
+   - off-Dalidou backup target (at minimum an rsync to laptop or
+     another host so a single-disk failure isn't terminal)
+   - automatic post-backup validation (have `create_runtime_backup`
+     call `validate_backup` on its own output and refuse to
+     declare success if validation fails)
+   - DONE in commits be40994 / 0382238 / 3362080 / this one:
+     - `create_runtime_backup` + `list_runtime_backups` +
+       `validate_backup` + `restore_runtime_backup` with CLI
+     - `POST /admin/backup` with `include_chroma=true` under
+       the ingestion lock
+     - `/health` build_sha / build_time / build_branch provenance
+     - `deploy.sh` self-update re-exec guard + build_sha drift
+       verification
+     - live drill procedure in `docs/backup-restore-procedure.md`
+       with failure-mode table and the memory_type=episodic
+       marker pattern from the 2026-04-09 drill
+10. Keep deeper automatic runtime integration modest until the organic read-only
+    model has proven value

 ## Trusted State Status