fix: chroma restore bind-mount bug + consolidate docs

Two fixes from the 2026-04-09 first real restore drill on Dalidou,
plus the long-overdue doc consolidation I should have done when I
added the drill runbook instead of creating a duplicate.

## Chroma restore bind-mount bug (drill finding)

src/atocore/ops/backup.py: restore_runtime_backup() used to call
shutil.rmtree(dst_chroma) before copying the snapshot back. In the
Dockerized Dalidou deployment the chroma dir is a bind-mounted
volume — you can't unlink a mount point, rmtree raises
  OSError [Errno 16] Device or resource busy
and the restore silently fails to touch Chroma. This bit the first
real drill; the operator worked around it with --no-chroma plus a
manual cp -a.

Fix: clear the destination's CONTENTS (iterdir + rmtree/unlink per
child) and use copytree(dirs_exist_ok=True) so the mount point
itself is never touched. Equivalent semantics, bind-mount-safe.

Regression test:
tests/test_backup.py::test_restore_chroma_does_not_unlink_destination_directory
captures Path.stat().st_ino of the dest dir before and after
restore and asserts they match. That's the same invariant a
bind-mounted chroma dir enforces — if the inode changed, the
mount would have failed. 11/11 backup tests now pass.

## Doc consolidation

docs/backup-restore-drill.md existed as a duplicate of the
authoritative docs/backup-restore-procedure.md. When I added the
drill runbook in commit 3362080 I wrote it from scratch instead of
updating the existing procedure — bad doc hygiene on a project
that's literally about being a context engine.

- Deleted docs/backup-restore-drill.md
- Folded its contents into docs/backup-restore-procedure.md:
  - Replaced the manual sudo cp restore sequence with the new
    `python -m atocore.ops.backup restore <STAMP>
    --confirm-service-stopped` CLI
  - Added the one-shot docker compose run pattern for running
    restore inside a container that reuses the live volume mounts
  - Documented the --no-pre-snapshot / --no-chroma / --chroma flags
  - New "Chroma restore and bind-mounted volumes" subsection
    explaining the bug and the regression test that protects the fix
  - New "Restore drill" subsection with three levels (unit tests,
    module round-trip, live Dalidou drill) and the cadence list
  - Failure-mode table gained four entries: restored_integrity_ok,
    Device-or-resource-busy, drill marker still present,
    chroma_snapshot_missing
  - "Open follow-ups" struck the restore_runtime_backup item (done)
    and added a "Done (historical)" note referencing 2026-04-09
  - Quickstart cheat sheet now has a full drill one-liner using
    memory_type=episodic (the 2026-04-09 drill found the runbook's
    memory_type=note was invalid — the valid set is identity,
    preference, project, episodic, knowledge, adaptation)

## Status doc sync

Long overdue — I've been landing code without updating the
project's narrative state docs.

docs/current-state.md:
- "Reliability Baseline" now reflects: restore_runtime_backup is
  real with CLI, pre-restore safety snapshot, WAL cleanup,
  integrity check; live drill on 2026-04-09 surfaced and fixed
  Chroma bind-mount bug; deploy provenance via /health build_sha;
  deploy.sh self-update re-exec guard
- "Immediate Next Focus" reshuffled: drill re-run (priority 1) and
  auto-capture (priority 2) are now ahead of retrieval quality work,
  reflecting the updated unblock sequence

docs/next-steps.md:
- New item 1: re-run the drill with chroma working end-to-end
- New item 2: auto-capture conservative mode (Stop hook)
- Old item 7 rewritten as item 9 listing what's DONE
  (create/list/validate/restore, admin/backup endpoint with
  include_chroma, /health provenance, self-update guard,
  procedure doc with failure modes) and what's still pending
  (retention cleanup, off-Dalidou target, auto-validation)

## Test count

226 passing (was 225 + 1 new inode-stability regression test).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-09 09:13:21 -04:00
parent 336208004c
commit 1a8fdf4225
6 changed files with 331 additions and 431 deletions

View File

@@ -326,6 +326,65 @@ def test_restore_round_trip_with_chroma(tmp_path, monkeypatch):
config.settings = original_settings
def test_restore_chroma_does_not_unlink_destination_directory(tmp_path, monkeypatch):
"""Regression: restore must not rmtree the chroma dir itself.
In a Dockerized deployment the chroma dir is a bind-mounted
volume. Calling shutil.rmtree on a mount point raises
``OSError [Errno 16] Device or resource busy``, which broke the
first real Dalidou drill on 2026-04-09. The fix clears the
directory's CONTENTS and copytree(dirs_exist_ok=True) into it,
keeping the directory inode (and any bind mount) intact.
This test captures the inode of the destination directory before
and after restore and asserts they match — that's what a
bind-mounted chroma dir would also see.
"""
monkeypatch.setenv("ATOCORE_DATA_DIR", str(tmp_path / "data"))
monkeypatch.setenv("ATOCORE_BACKUP_DIR", str(tmp_path / "backups"))
monkeypatch.setenv(
"ATOCORE_PROJECT_REGISTRY_PATH", str(tmp_path / "config" / "project-registry.json")
)
original_settings = config.settings
try:
config.settings = config.Settings()
init_db()
chroma_dir = config.settings.chroma_path
(chroma_dir / "coll-a").mkdir(parents=True, exist_ok=True)
(chroma_dir / "coll-a" / "baseline.bin").write_bytes(b"baseline")
create_runtime_backup(
datetime(2026, 4, 9, 15, 0, 0, tzinfo=UTC), include_chroma=True
)
# Capture the destination directory's stat signature before restore.
chroma_stat_before = chroma_dir.stat()
# Add a file post-backup so restore has work to do.
(chroma_dir / "coll-a" / "post_backup.bin").write_bytes(b"post")
restore_runtime_backup(
"20260409T150000Z", confirm_service_stopped=True
)
# Directory still exists (would have failed on mount point) and
# its st_ino matches — the mount itself wasn't unlinked.
assert chroma_dir.exists()
chroma_stat_after = chroma_dir.stat()
assert chroma_stat_before.st_ino == chroma_stat_after.st_ino, (
"chroma directory inode changed — restore recreated the "
"directory instead of clearing its contents; this would "
"fail on a Docker bind-mounted volume"
)
# And the contents did actually get restored.
assert (chroma_dir / "coll-a" / "baseline.bin").exists()
assert not (chroma_dir / "coll-a" / "post_backup.bin").exists()
finally:
config.settings = original_settings
def test_restore_skips_pre_snapshot_when_requested(tmp_path, monkeypatch):
monkeypatch.setenv("ATOCORE_DATA_DIR", str(tmp_path / "data"))
monkeypatch.setenv("ATOCORE_BACKUP_DIR", str(tmp_path / "backups"))