ops: add restore_runtime_backup + drill runbook

Close the backup side of the loop: we had create/list/validate but no restore, and no documented drill. A backup you've never restored is not a backup. This lands the missing restore surface and the procedure to exercise it before enabling any write-path automation (auto-capture, automated ingestion, reinforcement sweeps). Code — src/atocore/ops/backup.py: - restore_runtime_backup(stamp, *, include_chroma, pre_restore_snapshot, confirm_service_stopped) performs: 1. validate_backup() gate — refuse on any error 2. pre-restore safety snapshot of current state (reversibility anchor) 3. PRAGMA wal_checkpoint(TRUNCATE) on target db (flush + release OS handles; Windows needs this after conn.backup() reads) 4. unlink stale -wal/-shm sidecars (tolerant to Windows lock races) 5. shutil.copy2 snapshot db over target 6. restore registry if snapshot captured one 7. restore Chroma tree if snapshot captured one and include_chroma resolves to true (defaults to whether backup has Chroma) 8. PRAGMA integrity_check on restored db, report result - Refuses without confirm_service_stopped=True to prevent hot-restore into a running service (would corrupt SQLite state) - Rewrote main() as argparse with 4 subcommands: create, list, validate, restore. `python -m atocore.ops.backup restore STAMP --confirm-service-stopped` is the drill CLI entry point, run via `docker compose run --rm --entrypoint python atocore` so it reuses the live service's volume mounts Tests — tests/test_backup.py (6 new): - test_restore_refuses_without_confirm_service_stopped - test_restore_raises_on_invalid_backup - test_restore_round_trip_reverses_post_backup_mutations (canonical drill flow: seed -> backup -> mutate -> restore -> mutation gone + baseline survived + pre-restore snapshot has the mutation captured as rollback anchor) - test_restore_round_trip_with_chroma - test_restore_skips_pre_snapshot_when_requested - test_restore_cleans_stale_wal_sidecars (asserts stale byte markers do not survive, not file existence, since PRAGMA integrity_check may legitimately recreate -wal) Docs — docs/backup-restore-drill.md (new): - What gets backed up (hot sqlite, cold chroma, registry JSON, metadata.json) and what doesn't (.env, source content) - What restore does, step by step, and why confirm_service_stopped is a hard gate - 8-step drill procedure: capture -> baseline -> mutate -> stop -> restore -> start -> verify marker gone -> optional cleanup - Correct endpoint bodies verified against routes.py: POST /admin/backup with JSON body {"include_chroma": true} POST /memory with memory_type/content/project/confidence GET /memory?project=drill to list drill markers POST /query with {"prompt": ..., "top_k": ...} (not "query") - Failure modes: integrity_check fail, container won't start, marker still present after restore, with remediation for each - When to run: before new write-path automation, after backup.py or schema changes, after infra bumps, monthly as standing check 225/225 tests passing (219 existing + 6 new restore). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 21:17:48 -04:00
parent 03822389a1
commit 336208004c
3 changed files with 782 additions and 2 deletions
--- a/tests/test_backup.py
+++ b/tests/test_backup.py
@@ -1,14 +1,17 @@
-"""Tests for runtime backup creation."""
+"""Tests for runtime backup creation and restore."""

 import json
 import sqlite3
 from datetime import UTC, datetime

+import pytest
+
 import atocore.config as config
 from atocore.models.database import init_db
 from atocore.ops.backup import (
    create_runtime_backup,
    list_runtime_backups,
+    restore_runtime_backup,
    validate_backup,
 )

@@ -156,3 +159,242 @@ def test_create_runtime_backup_handles_missing_registry(tmp_path, monkeypatch):
        config.settings = original_settings

    assert result["registry_snapshot_path"] == ""
+
+
+def test_restore_refuses_without_confirm_service_stopped(tmp_path, monkeypatch):
+    monkeypatch.setenv("ATOCORE_DATA_DIR", str(tmp_path / "data"))
+    monkeypatch.setenv("ATOCORE_BACKUP_DIR", str(tmp_path / "backups"))
+    monkeypatch.setenv(
+        "ATOCORE_PROJECT_REGISTRY_PATH", str(tmp_path / "config" / "project-registry.json")
+    )
+
+    original_settings = config.settings
+    try:
+        config.settings = config.Settings()
+        init_db()
+        create_runtime_backup(datetime(2026, 4, 9, 10, 0, 0, tzinfo=UTC))
+
+        with pytest.raises(RuntimeError, match="confirm_service_stopped"):
+            restore_runtime_backup("20260409T100000Z")
+    finally:
+        config.settings = original_settings
+
+
+def test_restore_raises_on_invalid_backup(tmp_path, monkeypatch):
+    monkeypatch.setenv("ATOCORE_DATA_DIR", str(tmp_path / "data"))
+    monkeypatch.setenv("ATOCORE_BACKUP_DIR", str(tmp_path / "backups"))
+    monkeypatch.setenv(
+        "ATOCORE_PROJECT_REGISTRY_PATH", str(tmp_path / "config" / "project-registry.json")
+    )
+
+    original_settings = config.settings
+    try:
+        config.settings = config.Settings()
+        init_db()
+        with pytest.raises(RuntimeError, match="failed validation"):
+            restore_runtime_backup(
+                "20250101T000000Z", confirm_service_stopped=True
+            )
+    finally:
+        config.settings = original_settings
+
+
+def test_restore_round_trip_reverses_post_backup_mutations(tmp_path, monkeypatch):
+    """Canonical drill: snapshot -> mutate -> restore -> mutation gone."""
+    monkeypatch.setenv("ATOCORE_DATA_DIR", str(tmp_path / "data"))
+    monkeypatch.setenv("ATOCORE_BACKUP_DIR", str(tmp_path / "backups"))
+    monkeypatch.setenv(
+        "ATOCORE_PROJECT_REGISTRY_PATH", str(tmp_path / "config" / "project-registry.json")
+    )
+
+    registry_path = tmp_path / "config" / "project-registry.json"
+    registry_path.parent.mkdir(parents=True)
+    registry_path.write_text(
+        '{"projects":[{"id":"p01-example","aliases":[],'
+        '"ingest_roots":[{"source":"vault","subpath":"incoming/projects/p01-example"}]}]}\n',
+        encoding="utf-8",
+    )
+
+    original_settings = config.settings
+    try:
+        config.settings = config.Settings()
+        init_db()
+
+        # 1. Seed baseline state that should SURVIVE the restore.
+        with sqlite3.connect(str(config.settings.db_path)) as conn:
+            conn.execute(
+                "INSERT INTO projects (id, name) VALUES (?, ?)",
+                ("p01", "Baseline Project"),
+            )
+            conn.commit()
+
+        # 2. Create the backup we're going to restore to.
+        create_runtime_backup(datetime(2026, 4, 9, 11, 0, 0, tzinfo=UTC))
+        stamp = "20260409T110000Z"
+
+        # 3. Mutate live state AFTER the backup — this is what the
+        #    restore should reverse.
+        with sqlite3.connect(str(config.settings.db_path)) as conn:
+            conn.execute(
+                "INSERT INTO projects (id, name) VALUES (?, ?)",
+                ("p99", "Post Backup Mutation"),
+            )
+            conn.commit()
+
+        # Confirm the mutation is present before restore.
+        with sqlite3.connect(str(config.settings.db_path)) as conn:
+            row = conn.execute(
+                "SELECT name FROM projects WHERE id = ?", ("p99",)
+            ).fetchone()
+            assert row is not None and row[0] == "Post Backup Mutation"
+
+        # 4. Restore — the drill procedure. Explicit confirm_service_stopped.
+        result = restore_runtime_backup(
+            stamp, confirm_service_stopped=True
+        )
+
+        # 5. Verify restore report
+        assert result["stamp"] == stamp
+        assert result["db_restored"] is True
+        assert result["registry_restored"] is True
+        assert result["restored_integrity_ok"] is True
+        assert result["pre_restore_snapshot"] is not None
+
+        # 6. Verify live state reflects the restore: baseline survived,
+        #    post-backup mutation is gone.
+        with sqlite3.connect(str(config.settings.db_path)) as conn:
+            baseline = conn.execute(
+                "SELECT name FROM projects WHERE id = ?", ("p01",)
+            ).fetchone()
+            mutation = conn.execute(
+                "SELECT name FROM projects WHERE id = ?", ("p99",)
+            ).fetchone()
+        assert baseline is not None and baseline[0] == "Baseline Project"
+        assert mutation is None
+
+        # 7. Pre-restore safety snapshot DOES contain the mutation —
+        #    it captured current state before overwriting. This is the
+        #    reversibility guarantee: the operator can restore back to
+        #    it if the restore itself was a mistake.
+        pre_stamp = result["pre_restore_snapshot"]
+        pre_validation = validate_backup(pre_stamp)
+        assert pre_validation["valid"] is True
+        pre_db_path = pre_validation["metadata"]["db_snapshot_path"]
+        with sqlite3.connect(pre_db_path) as conn:
+            pre_mutation = conn.execute(
+                "SELECT name FROM projects WHERE id = ?", ("p99",)
+            ).fetchone()
+        assert pre_mutation is not None and pre_mutation[0] == "Post Backup Mutation"
+    finally:
+        config.settings = original_settings
+
+
+def test_restore_round_trip_with_chroma(tmp_path, monkeypatch):
+    monkeypatch.setenv("ATOCORE_DATA_DIR", str(tmp_path / "data"))
+    monkeypatch.setenv("ATOCORE_BACKUP_DIR", str(tmp_path / "backups"))
+    monkeypatch.setenv(
+        "ATOCORE_PROJECT_REGISTRY_PATH", str(tmp_path / "config" / "project-registry.json")
+    )
+
+    original_settings = config.settings
+    try:
+        config.settings = config.Settings()
+        init_db()
+
+        # Seed baseline chroma state that should survive restore.
+        chroma_dir = config.settings.chroma_path
+        (chroma_dir / "coll-a").mkdir(parents=True, exist_ok=True)
+        (chroma_dir / "coll-a" / "baseline.bin").write_bytes(b"baseline")
+
+        create_runtime_backup(
+            datetime(2026, 4, 9, 12, 0, 0, tzinfo=UTC), include_chroma=True
+        )
+        stamp = "20260409T120000Z"
+
+        # Mutate chroma after backup: add a file + remove baseline.
+        (chroma_dir / "coll-a" / "post_backup.bin").write_bytes(b"post")
+        (chroma_dir / "coll-a" / "baseline.bin").unlink()
+
+        result = restore_runtime_backup(
+            stamp, confirm_service_stopped=True
+        )
+
+        assert result["chroma_restored"] is True
+        assert (chroma_dir / "coll-a" / "baseline.bin").exists()
+        assert not (chroma_dir / "coll-a" / "post_backup.bin").exists()
+    finally:
+        config.settings = original_settings
+
+
+def test_restore_skips_pre_snapshot_when_requested(tmp_path, monkeypatch):
+    monkeypatch.setenv("ATOCORE_DATA_DIR", str(tmp_path / "data"))
+    monkeypatch.setenv("ATOCORE_BACKUP_DIR", str(tmp_path / "backups"))
+    monkeypatch.setenv(
+        "ATOCORE_PROJECT_REGISTRY_PATH", str(tmp_path / "config" / "project-registry.json")
+    )
+
+    original_settings = config.settings
+    try:
+        config.settings = config.Settings()
+        init_db()
+        create_runtime_backup(datetime(2026, 4, 9, 13, 0, 0, tzinfo=UTC))
+
+        before_count = len(list_runtime_backups())
+
+        result = restore_runtime_backup(
+            "20260409T130000Z",
+            confirm_service_stopped=True,
+            pre_restore_snapshot=False,
+        )
+
+        after_count = len(list_runtime_backups())
+        assert result["pre_restore_snapshot"] is None
+        assert after_count == before_count
+    finally:
+        config.settings = original_settings
+
+
+def test_restore_cleans_stale_wal_sidecars(tmp_path, monkeypatch):
+    """Stale WAL/SHM sidecars must not carry bytes past the restore.
+
+    Note: after restore runs, PRAGMA integrity_check reopens the
+    restored db which may legitimately recreate a fresh -wal. So we
+    assert that the STALE byte marker no longer appears in either
+    sidecar, not that the files are absent.
+    """
+    monkeypatch.setenv("ATOCORE_DATA_DIR", str(tmp_path / "data"))
+    monkeypatch.setenv("ATOCORE_BACKUP_DIR", str(tmp_path / "backups"))
+    monkeypatch.setenv(
+        "ATOCORE_PROJECT_REGISTRY_PATH", str(tmp_path / "config" / "project-registry.json")
+    )
+
+    original_settings = config.settings
+    try:
+        config.settings = config.Settings()
+        init_db()
+        create_runtime_backup(datetime(2026, 4, 9, 14, 0, 0, tzinfo=UTC))
+
+        # Write fake stale WAL/SHM next to the live db with an
+        # unmistakable marker.
+        target_db = config.settings.db_path
+        wal = target_db.with_name(target_db.name + "-wal")
+        shm = target_db.with_name(target_db.name + "-shm")
+        stale_marker = b"STALE-SIDECAR-MARKER-DO-NOT-SURVIVE"
+        wal.write_bytes(stale_marker)
+        shm.write_bytes(stale_marker)
+        assert wal.exists() and shm.exists()
+
+        restore_runtime_backup(
+            "20260409T140000Z", confirm_service_stopped=True
+        )
+
+        # The restored db must pass integrity check (tested elsewhere);
+        # here we just confirm that no file next to it still contains
+        # the stale marker from the old live process.
+        for sidecar in (wal, shm):
+            if sidecar.exists():
+                assert stale_marker not in sidecar.read_bytes(), (
+                    f"{sidecar.name} still carries stale marker"
+                )
+    finally:
+        config.settings = original_settings