fix: chroma restore bind-mount bug + consolidate docs

Two fixes from the 2026-04-09 first real restore drill on Dalidou,
plus the long-overdue doc consolidation I should have done when I
added the drill runbook instead of creating a duplicate.

## Chroma restore bind-mount bug (drill finding)

src/atocore/ops/backup.py: restore_runtime_backup() used to call
shutil.rmtree(dst_chroma) before copying the snapshot back. In the
Dockerized Dalidou deployment the chroma dir is a bind-mounted
volume — you can't unlink a mount point, rmtree raises
  OSError [Errno 16] Device or resource busy
and the restore silently fails to touch Chroma. This bit the first
real drill; the operator worked around it with --no-chroma plus a
manual cp -a.

Fix: clear the destination's CONTENTS (iterdir + rmtree/unlink per
child) and use copytree(dirs_exist_ok=True) so the mount point
itself is never touched. Equivalent semantics, bind-mount-safe.

Regression test:
tests/test_backup.py::test_restore_chroma_does_not_unlink_destination_directory
captures Path.stat().st_ino of the dest dir before and after
restore and asserts they match. That's the same invariant a
bind-mounted chroma dir enforces — if the inode changed, the
mount would have failed. 11/11 backup tests now pass.

## Doc consolidation

docs/backup-restore-drill.md existed as a duplicate of the
authoritative docs/backup-restore-procedure.md. When I added the
drill runbook in commit 3362080 I wrote it from scratch instead of
updating the existing procedure — bad doc hygiene on a project
that's literally about being a context engine.

- Deleted docs/backup-restore-drill.md
- Folded its contents into docs/backup-restore-procedure.md:
  - Replaced the manual sudo cp restore sequence with the new
    `python -m atocore.ops.backup restore <STAMP>
    --confirm-service-stopped` CLI
  - Added the one-shot docker compose run pattern for running
    restore inside a container that reuses the live volume mounts
  - Documented the --no-pre-snapshot / --no-chroma / --chroma flags
  - New "Chroma restore and bind-mounted volumes" subsection
    explaining the bug and the regression test that protects the fix
  - New "Restore drill" subsection with three levels (unit tests,
    module round-trip, live Dalidou drill) and the cadence list
  - Failure-mode table gained four entries: restored_integrity_ok,
    Device-or-resource-busy, drill marker still present,
    chroma_snapshot_missing
  - "Open follow-ups" struck the restore_runtime_backup item (done)
    and added a "Done (historical)" note referencing 2026-04-09
  - Quickstart cheat sheet now has a full drill one-liner using
    memory_type=episodic (the 2026-04-09 drill found the runbook's
    memory_type=note was invalid — the valid set is identity,
    preference, project, episodic, knowledge, adaptation)

## Status doc sync

Long overdue — I've been landing code without updating the
project's narrative state docs.

docs/current-state.md:
- "Reliability Baseline" now reflects: restore_runtime_backup is
  real with CLI, pre-restore safety snapshot, WAL cleanup,
  integrity check; live drill on 2026-04-09 surfaced and fixed
  Chroma bind-mount bug; deploy provenance via /health build_sha;
  deploy.sh self-update re-exec guard
- "Immediate Next Focus" reshuffled: drill re-run (priority 1) and
  auto-capture (priority 2) are now ahead of retrieval quality work,
  reflecting the updated unblock sequence

docs/next-steps.md:
- New item 1: re-run the drill with chroma working end-to-end
- New item 2: auto-capture conservative mode (Stop hook)
- Old item 7 rewritten as item 9 listing what's DONE
  (create/list/validate/restore, admin/backup endpoint with
  include_chroma, /health provenance, self-update guard,
  procedure doc with failure modes) and what's still pending
  (retention cleanup, off-Dalidou target, auto-validation)

## Test count

226 passing (was 225 + 1 new inode-stability regression test).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-09 09:13:21 -04:00
parent 336208004c
commit 1a8fdf4225
6 changed files with 331 additions and 431 deletions

View File

@@ -20,45 +20,75 @@ This working list should be read alongside:
## Immediate Next Steps
1. Use the T420 `atocore-context` skill and the new organic routing layer in
1. Re-run the backup/restore drill on Dalidou with the Chroma
bind-mount fix in place
- the 2026-04-09 drill was a PARTIAL PASS: db restore + marker
reversal worked cleanly, but the Chroma step failed with
`OSError [Errno 16] Device or resource busy` because
`shutil.rmtree` cannot unlink a Docker bind-mounted volume
- fix landed immediately after: `restore_runtime_backup()` now
clears the destination's CONTENTS and uses
`copytree(dirs_exist_ok=True)`, and the regression test
`test_restore_chroma_does_not_unlink_destination_directory`
asserts the destination inode is stable
- need a green end-to-end run with `--chroma` actually
working in-container before enabling write-path automation
2. Turn on auto-capture of Claude Code sessions once the drill
re-run is clean
- conservative mode: Stop hook posts to `/interactions`,
no auto-extraction into review queue without review cadence
in place
3. Use the T420 `atocore-context` skill and the new organic routing layer in
real OpenClaw workflows
- confirm `auto-context` feels natural
- confirm project inference is good enough in practice
- confirm the fail-open behavior remains acceptable in practice
2. Review retrieval quality after the first real project ingestion batch
4. Review retrieval quality after the first real project ingestion batch
- check whether the top hits are useful
- check whether trusted project state remains dominant
- reduce cross-project competition and prompt ambiguity where needed
- use `debug-context` to inspect the exact last AtoCore supplement
3. Treat the active-project full markdown/text wave as complete
5. Treat the active-project full markdown/text wave as complete
- `p04-gigabit`
- `p05-interferometer`
- `p06-polisher`
4. Define a cleaner source refresh model
6. Define a cleaner source refresh model
- make the difference between source truth, staged inputs, and machine store
explicit
- move toward a project source registry and refresh workflow
- foundation now exists via project registry + per-project refresh API
- registration policy + template + proposal + approved registration are now
the normal path for new projects
5. Move to Wave 2 trusted-operational ingestion
7. Move to Wave 2 trusted-operational ingestion
- curated dashboards
- decision logs
- milestone/current-status views
- operational truth, not just raw project notes
6. Integrate the new engineering architecture docs into active planning, not immediate schema code
8. Integrate the new engineering architecture docs into active planning, not immediate schema code
- keep `docs/architecture/engineering-knowledge-hybrid-architecture.md` as the target layer model
- keep `docs/architecture/engineering-ontology-v1.md` as the V1 structured-domain target
- do not start entity/relationship persistence until the ingestion, retrieval, registry, and backup baseline feels boring and stable
7. Define backup and export procedures for Dalidou
- exercise the new SQLite + registry snapshot path on Dalidou
- Chroma backup or rebuild policy
- retention and restore validation
- admin backup endpoint now supports `include_chroma` cold snapshot
under the ingestion lock and `validate` confirms each snapshot is
openable; remaining work is the operational retention policy
8. Keep deeper automatic runtime integration modest until the organic read-only
model has proven value
9. Finish the boring operations baseline around backup
- retention policy cleanup script (snapshots dir grows
monotonically today)
- off-Dalidou backup target (at minimum an rsync to laptop or
another host so a single-disk failure isn't terminal)
- automatic post-backup validation (have `create_runtime_backup`
call `validate_backup` on its own output and refuse to
declare success if validation fails)
- DONE in commits be40994 / 0382238 / 3362080 / this one:
- `create_runtime_backup` + `list_runtime_backups` +
`validate_backup` + `restore_runtime_backup` with CLI
- `POST /admin/backup` with `include_chroma=true` under
the ingestion lock
- `/health` build_sha / build_time / build_branch provenance
- `deploy.sh` self-update re-exec guard + build_sha drift
verification
- live drill procedure in `docs/backup-restore-procedure.md`
with failure-mode table and the memory_type=episodic
marker pattern from the 2026-04-09 drill
10. Keep deeper automatic runtime integration modest until the organic read-only
model has proven value
## Trusted State Status