Commit Graph

3 Commits

Author SHA1 Message Date
336208004c ops: add restore_runtime_backup + drill runbook
Close the backup side of the loop: we had create/list/validate but
no restore, and no documented drill. A backup you've never restored
is not a backup. This lands the missing restore surface and the
procedure to exercise it before enabling any write-path automation
(auto-capture, automated ingestion, reinforcement sweeps).

Code — src/atocore/ops/backup.py:

- restore_runtime_backup(stamp, *, include_chroma, pre_restore_snapshot,
  confirm_service_stopped) performs:
  1. validate_backup() gate — refuse on any error
  2. pre-restore safety snapshot of current state (reversibility anchor)
  3. PRAGMA wal_checkpoint(TRUNCATE) on target db (flush + release
     OS handles; Windows needs this after conn.backup() reads)
  4. unlink stale -wal/-shm sidecars (tolerant to Windows lock races)
  5. shutil.copy2 snapshot db over target
  6. restore registry if snapshot captured one
  7. restore Chroma tree if snapshot captured one and include_chroma
     resolves to true (defaults to whether backup has Chroma)
  8. PRAGMA integrity_check on restored db, report result
- Refuses without confirm_service_stopped=True to prevent hot-restore
  into a running service (would corrupt SQLite state)
- Rewrote main() as argparse with 4 subcommands: create, list,
  validate, restore. `python -m atocore.ops.backup restore STAMP
  --confirm-service-stopped` is the drill CLI entry point, run via
  `docker compose run --rm --entrypoint python atocore` so it reuses
  the live service's volume mounts

Tests — tests/test_backup.py (6 new):

- test_restore_refuses_without_confirm_service_stopped
- test_restore_raises_on_invalid_backup
- test_restore_round_trip_reverses_post_backup_mutations
  (canonical drill flow: seed -> backup -> mutate -> restore ->
   mutation gone + baseline survived + pre-restore snapshot has
   the mutation captured as rollback anchor)
- test_restore_round_trip_with_chroma
- test_restore_skips_pre_snapshot_when_requested
- test_restore_cleans_stale_wal_sidecars (asserts stale byte
  markers do not survive, not file existence, since PRAGMA
  integrity_check may legitimately recreate -wal)

Docs — docs/backup-restore-drill.md (new):

- What gets backed up (hot sqlite, cold chroma, registry JSON,
  metadata.json) and what doesn't (.env, source content)
- What restore does, step by step, and why confirm_service_stopped
  is a hard gate
- 8-step drill procedure: capture -> baseline -> mutate -> stop ->
  restore -> start -> verify marker gone -> optional cleanup
- Correct endpoint bodies verified against routes.py:
    POST /admin/backup with JSON body {"include_chroma": true}
    POST /memory with memory_type/content/project/confidence
    GET /memory?project=drill to list drill markers
    POST /query with {"prompt": ..., "top_k": ...} (not "query")
- Failure modes: integrity_check fail, container won't start,
  marker still present after restore, with remediation for each
- When to run: before new write-path automation, after backup.py
  or schema changes, after infra bumps, monthly as standing check

225/225 tests passing (219 existing + 6 new restore).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 21:17:48 -04:00
c9b9eede25 feat: tunable ranking, refresh status, chroma backup + admin endpoints
Three small improvements that move the operational baseline forward
without changing the existing trust model.

1. Tunable retrieval ranking weights
   - rank_project_match_boost, rank_query_token_step,
     rank_query_token_cap, rank_path_high_signal_boost,
     rank_path_low_signal_penalty are now Settings fields
   - all overridable via ATOCORE_* env vars
   - retriever no longer hard-codes 2.0 / 1.18 / 0.72 / 0.08 / 1.32
   - lets ranking be tuned per environment as Wave 1 is exercised
     without code changes

2. /projects/{name}/refresh status
   - refresh_registered_project now returns an overall status field
     ("ingested", "partial", "nothing_to_ingest") plus roots_ingested
     and roots_skipped counters
   - ProjectRefreshResponse advertises the new fields so callers can
     rely on them
   - covers the case where every configured root is missing on disk

3. Chroma cold snapshot + admin backup endpoints
   - create_runtime_backup now accepts include_chroma and writes a
     cold directory copy of the chroma persistence path
   - new list_runtime_backups() and validate_backup() helpers
   - new endpoints:
     - POST /admin/backup            create snapshot (optional chroma)
     - GET  /admin/backup            list snapshots
     - GET  /admin/backup/{stamp}/validate  structural validation
   - chroma snapshots are taken under exclusive_ingestion() so a refresh
     or ingest cannot race with the cold copy
   - backup metadata records what was actually included and how big

Tests:
- 8 new tests covering tunable weights, refresh status branches
  (ingested / partial / nothing_to_ingest), chroma snapshot, list,
  validate, and the API endpoints (including the lock-acquisition path)
- existing fake refresh stubs in test_api_storage.py updated for the
  expanded ProjectRefreshResponse model
- full suite: 105 passing (was 97)

next-steps doc updated to reflect that the chroma snapshot + restore
validation gap from current-state.md is now closed in code; only the
operational retention policy remains.
2026-04-06 18:42:19 -04:00
c9757e313a Harden runtime and add backup foundation 2026-04-06 10:15:00 -04:00