ATOCore

Author	SHA1	Message	Date
Anto01	1a8fdf4225	fix: chroma restore bind-mount bug + consolidate docs Two fixes from the 2026-04-09 first real restore drill on Dalidou, plus the long-overdue doc consolidation I should have done when I added the drill runbook instead of creating a duplicate. ## Chroma restore bind-mount bug (drill finding) src/atocore/ops/backup.py: restore_runtime_backup() used to call shutil.rmtree(dst_chroma) before copying the snapshot back. In the Dockerized Dalidou deployment the chroma dir is a bind-mounted volume — you can't unlink a mount point, rmtree raises OSError [Errno 16] Device or resource busy and the restore silently fails to touch Chroma. This bit the first real drill; the operator worked around it with --no-chroma plus a manual cp -a. Fix: clear the destination's CONTENTS (iterdir + rmtree/unlink per child) and use copytree(dirs_exist_ok=True) so the mount point itself is never touched. Equivalent semantics, bind-mount-safe. Regression test: tests/test_backup.py::test_restore_chroma_does_not_unlink_destination_directory captures Path.stat().st_ino of the dest dir before and after restore and asserts they match. That's the same invariant a bind-mounted chroma dir enforces — if the inode changed, the mount would have failed. 11/11 backup tests now pass. ## Doc consolidation docs/backup-restore-drill.md existed as a duplicate of the authoritative docs/backup-restore-procedure.md. When I added the drill runbook in commit `3362080` I wrote it from scratch instead of updating the existing procedure — bad doc hygiene on a project that's literally about being a context engine. - Deleted docs/backup-restore-drill.md - Folded its contents into docs/backup-restore-procedure.md: - Replaced the manual sudo cp restore sequence with the new `python -m atocore.ops.backup restore <STAMP> --confirm-service-stopped` CLI - Added the one-shot docker compose run pattern for running restore inside a container that reuses the live volume mounts - Documented the --no-pre-snapshot / --no-chroma / --chroma flags - New "Chroma restore and bind-mounted volumes" subsection explaining the bug and the regression test that protects the fix - New "Restore drill" subsection with three levels (unit tests, module round-trip, live Dalidou drill) and the cadence list - Failure-mode table gained four entries: restored_integrity_ok, Device-or-resource-busy, drill marker still present, chroma_snapshot_missing - "Open follow-ups" struck the restore_runtime_backup item (done) and added a "Done (historical)" note referencing 2026-04-09 - Quickstart cheat sheet now has a full drill one-liner using memory_type=episodic (the 2026-04-09 drill found the runbook's memory_type=note was invalid — the valid set is identity, preference, project, episodic, knowledge, adaptation) ## Status doc sync Long overdue — I've been landing code without updating the project's narrative state docs. docs/current-state.md: - "Reliability Baseline" now reflects: restore_runtime_backup is real with CLI, pre-restore safety snapshot, WAL cleanup, integrity check; live drill on 2026-04-09 surfaced and fixed Chroma bind-mount bug; deploy provenance via /health build_sha; deploy.sh self-update re-exec guard - "Immediate Next Focus" reshuffled: drill re-run (priority 1) and auto-capture (priority 2) are now ahead of retrieval quality work, reflecting the updated unblock sequence docs/next-steps.md: - New item 1: re-run the drill with chroma working end-to-end - New item 2: auto-capture conservative mode (Stop hook) - Old item 7 rewritten as item 9 listing what's DONE (create/list/validate/restore, admin/backup endpoint with include_chroma, /health provenance, self-update guard, procedure doc with failure modes) and what's still pending (retention cleanup, off-Dalidou target, auto-validation) ## Test count 226 passing (was 225 + 1 new inode-stability regression test). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 09:13:21 -04:00
Anto01	336208004c	ops: add restore_runtime_backup + drill runbook Close the backup side of the loop: we had create/list/validate but no restore, and no documented drill. A backup you've never restored is not a backup. This lands the missing restore surface and the procedure to exercise it before enabling any write-path automation (auto-capture, automated ingestion, reinforcement sweeps). Code — src/atocore/ops/backup.py: - restore_runtime_backup(stamp, *, include_chroma, pre_restore_snapshot, confirm_service_stopped) performs: 1. validate_backup() gate — refuse on any error 2. pre-restore safety snapshot of current state (reversibility anchor) 3. PRAGMA wal_checkpoint(TRUNCATE) on target db (flush + release OS handles; Windows needs this after conn.backup() reads) 4. unlink stale -wal/-shm sidecars (tolerant to Windows lock races) 5. shutil.copy2 snapshot db over target 6. restore registry if snapshot captured one 7. restore Chroma tree if snapshot captured one and include_chroma resolves to true (defaults to whether backup has Chroma) 8. PRAGMA integrity_check on restored db, report result - Refuses without confirm_service_stopped=True to prevent hot-restore into a running service (would corrupt SQLite state) - Rewrote main() as argparse with 4 subcommands: create, list, validate, restore. `python -m atocore.ops.backup restore STAMP --confirm-service-stopped` is the drill CLI entry point, run via `docker compose run --rm --entrypoint python atocore` so it reuses the live service's volume mounts Tests — tests/test_backup.py (6 new): - test_restore_refuses_without_confirm_service_stopped - test_restore_raises_on_invalid_backup - test_restore_round_trip_reverses_post_backup_mutations (canonical drill flow: seed -> backup -> mutate -> restore -> mutation gone + baseline survived + pre-restore snapshot has the mutation captured as rollback anchor) - test_restore_round_trip_with_chroma - test_restore_skips_pre_snapshot_when_requested - test_restore_cleans_stale_wal_sidecars (asserts stale byte markers do not survive, not file existence, since PRAGMA integrity_check may legitimately recreate -wal) Docs — docs/backup-restore-drill.md (new): - What gets backed up (hot sqlite, cold chroma, registry JSON, metadata.json) and what doesn't (.env, source content) - What restore does, step by step, and why confirm_service_stopped is a hard gate - 8-step drill procedure: capture -> baseline -> mutate -> stop -> restore -> start -> verify marker gone -> optional cleanup - Correct endpoint bodies verified against routes.py: POST /admin/backup with JSON body {"include_chroma": true} POST /memory with memory_type/content/project/confidence GET /memory?project=drill to list drill markers POST /query with {"prompt": ..., "top_k": ...} (not "query") - Failure modes: integrity_check fail, container won't start, marker still present after restore, with remediation for each - When to run: before new write-path automation, after backup.py or schema changes, after infra bumps, monthly as standing check 225/225 tests passing (219 existing + 6 new restore). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 21:17:48 -04:00
Anto01	c9b9eede25	feat: tunable ranking, refresh status, chroma backup + admin endpoints Three small improvements that move the operational baseline forward without changing the existing trust model. 1. Tunable retrieval ranking weights - rank_project_match_boost, rank_query_token_step, rank_query_token_cap, rank_path_high_signal_boost, rank_path_low_signal_penalty are now Settings fields - all overridable via ATOCORE_* env vars - retriever no longer hard-codes 2.0 / 1.18 / 0.72 / 0.08 / 1.32 - lets ranking be tuned per environment as Wave 1 is exercised without code changes 2. /projects/{name}/refresh status - refresh_registered_project now returns an overall status field ("ingested", "partial", "nothing_to_ingest") plus roots_ingested and roots_skipped counters - ProjectRefreshResponse advertises the new fields so callers can rely on them - covers the case where every configured root is missing on disk 3. Chroma cold snapshot + admin backup endpoints - create_runtime_backup now accepts include_chroma and writes a cold directory copy of the chroma persistence path - new list_runtime_backups() and validate_backup() helpers - new endpoints: - POST /admin/backup create snapshot (optional chroma) - GET /admin/backup list snapshots - GET /admin/backup/{stamp}/validate structural validation - chroma snapshots are taken under exclusive_ingestion() so a refresh or ingest cannot race with the cold copy - backup metadata records what was actually included and how big Tests: - 8 new tests covering tunable weights, refresh status branches (ingested / partial / nothing_to_ingest), chroma snapshot, list, validate, and the API endpoints (including the lock-acquisition path) - existing fake refresh stubs in test_api_storage.py updated for the expanded ProjectRefreshResponse model - full suite: 105 passing (was 97) next-steps doc updated to reflect that the chroma snapshot + restore validation gap from current-state.md is now closed in code; only the operational retention policy remains.	2026-04-06 18:42:19 -04:00
Anto01	c9757e313a	Harden runtime and add backup foundation	2026-04-06 10:15:00 -04:00

4 Commits