ops: add restore_runtime_backup + drill runbook

Close the backup side of the loop: we had create/list/validate but no restore, and no documented drill. A backup you've never restored is not a backup. This lands the missing restore surface and the procedure to exercise it before enabling any write-path automation (auto-capture, automated ingestion, reinforcement sweeps). Code — src/atocore/ops/backup.py: - restore_runtime_backup(stamp, *, include_chroma, pre_restore_snapshot, confirm_service_stopped) performs: 1. validate_backup() gate — refuse on any error 2. pre-restore safety snapshot of current state (reversibility anchor) 3. PRAGMA wal_checkpoint(TRUNCATE) on target db (flush + release OS handles; Windows needs this after conn.backup() reads) 4. unlink stale -wal/-shm sidecars (tolerant to Windows lock races) 5. shutil.copy2 snapshot db over target 6. restore registry if snapshot captured one 7. restore Chroma tree if snapshot captured one and include_chroma resolves to true (defaults to whether backup has Chroma) 8. PRAGMA integrity_check on restored db, report result - Refuses without confirm_service_stopped=True to prevent hot-restore into a running service (would corrupt SQLite state) - Rewrote main() as argparse with 4 subcommands: create, list, validate, restore. `python -m atocore.ops.backup restore STAMP --confirm-service-stopped` is the drill CLI entry point, run via `docker compose run --rm --entrypoint python atocore` so it reuses the live service's volume mounts Tests — tests/test_backup.py (6 new): - test_restore_refuses_without_confirm_service_stopped - test_restore_raises_on_invalid_backup - test_restore_round_trip_reverses_post_backup_mutations (canonical drill flow: seed -> backup -> mutate -> restore -> mutation gone + baseline survived + pre-restore snapshot has the mutation captured as rollback anchor) - test_restore_round_trip_with_chroma - test_restore_skips_pre_snapshot_when_requested - test_restore_cleans_stale_wal_sidecars (asserts stale byte markers do not survive, not file existence, since PRAGMA integrity_check may legitimately recreate -wal) Docs — docs/backup-restore-drill.md (new): - What gets backed up (hot sqlite, cold chroma, registry JSON, metadata.json) and what doesn't (.env, source content) - What restore does, step by step, and why confirm_service_stopped is a hard gate - 8-step drill procedure: capture -> baseline -> mutate -> stop -> restore -> start -> verify marker gone -> optional cleanup - Correct endpoint bodies verified against routes.py: POST /admin/backup with JSON body {"include_chroma": true} POST /memory with memory_type/content/project/confidence GET /memory?project=drill to list drill markers POST /query with {"prompt": ..., "top_k": ...} (not "query") - Failure modes: integrity_check fail, container won't start, marker still present after restore, with remediation for each - When to run: before new write-path automation, after backup.py or schema changes, after infra bumps, monthly as standing check 225/225 tests passing (219 existing + 6 new restore). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 21:17:48 -04:00
parent 03822389a1
commit 336208004c
3 changed files with 782 additions and 2 deletions
--- a/src/atocore/ops/backup.py
+++ b/src/atocore/ops/backup.py
@@ -216,6 +216,166 @@ def validate_backup(stamp: str) -> dict:
    return result


+def restore_runtime_backup(
+    stamp: str,
+    *,
+    include_chroma: bool | None = None,
+    pre_restore_snapshot: bool = True,
+    confirm_service_stopped: bool = False,
+) -> dict:
+    """Restore a previously captured runtime backup.
+
+    CRITICAL: the AtoCore service MUST be stopped before calling this.
+    Overwriting a live SQLite database corrupts state and can break
+    the running container's open connections. The caller must pass
+    ``confirm_service_stopped=True`` as an explicit acknowledgment —
+    otherwise this function refuses to run.
+
+    The restore procedure:
+
+    1. Validate the backup via ``validate_backup``; refuse on any error.
+    2. (default) Create a pre-restore safety snapshot of the CURRENT
+       state so the restore itself is reversible. The snapshot stamp
+       is returned in the result for the operator to record.
+    3. Remove stale SQLite WAL/SHM sidecar files next to the target db
+       before copying — the snapshot is a self-contained main-file
+       image from ``conn.backup()``, and leftover WAL/SHM from the old
+       live db would desync against the restored main file.
+    4. Copy the snapshot db over the target db path.
+    5. Restore the project registry file if the snapshot captured one.
+    6. Restore the Chroma directory if ``include_chroma`` resolves to
+       true. When ``include_chroma is None`` the function defers to
+       whether the snapshot captured Chroma (the common case).
+    7. Run ``PRAGMA integrity_check`` on the restored db and report
+       the result.
+
+    Returns a dict describing what was restored. On refused restore
+    (service still running, validation failed) raises ``RuntimeError``.
+    """
+    if not confirm_service_stopped:
+        raise RuntimeError(
+            "restore_runtime_backup refuses to run without "
+            "confirm_service_stopped=True — stop the AtoCore container "
+            "first (e.g. `docker compose down` from deploy/dalidou) "
+            "before calling this function"
+        )
+
+    validation = validate_backup(stamp)
+    if not validation.get("valid"):
+        raise RuntimeError(
+            f"backup {stamp} failed validation: {validation.get('errors')}"
+        )
+    metadata = validation.get("metadata") or {}
+
+    pre_snapshot_stamp: str | None = None
+    if pre_restore_snapshot:
+        pre = create_runtime_backup(include_chroma=False)
+        pre_snapshot_stamp = Path(pre["backup_root"]).name
+
+    target_db = _config.settings.db_path
+    source_db = Path(metadata.get("db_snapshot_path", ""))
+    if not source_db.exists():
+        raise RuntimeError(
+            f"db snapshot not found at {source_db} — backup "
+            f"metadata may be stale"
+        )
+
+    # Force sqlite to flush any lingering WAL into the main file and
+    # release OS-level file handles on -wal/-shm before we swap the
+    # main file. Passing through conn.backup() in the pre-restore
+    # snapshot can leave sidecars momentarily locked on Windows;
+    # an explicit checkpoint(TRUNCATE) is the reliable way to flush
+    # and release. Best-effort: if the target db can't be opened
+    # (missing, corrupt), fall through and trust the copy step.
+    if target_db.exists():
+        try:
+            with sqlite3.connect(str(target_db)) as checkpoint_conn:
+                checkpoint_conn.execute("PRAGMA wal_checkpoint(TRUNCATE)")
+        except sqlite3.DatabaseError as exc:
+            log.warning(
+                "restore_pre_checkpoint_failed",
+                target_db=str(target_db),
+                error=str(exc),
+            )
+
+    # Remove stale WAL/SHM sidecars from the old live db so SQLite
+    # can't read inconsistent state on next open. Tolerant to
+    # Windows file-lock races — the subsequent copy replaces the
+    # main file anyway, and the integrity check afterward is the
+    # actual correctness signal.
+    wal_path = target_db.with_name(target_db.name + "-wal")
+    shm_path = target_db.with_name(target_db.name + "-shm")
+    for stale in (wal_path, shm_path):
+        if stale.exists():
+            try:
+                stale.unlink()
+            except OSError as exc:
+                log.warning(
+                    "restore_sidecar_unlink_failed",
+                    path=str(stale),
+                    error=str(exc),
+                )
+
+    target_db.parent.mkdir(parents=True, exist_ok=True)
+    shutil.copy2(source_db, target_db)
+
+    registry_restored = False
+    registry_snapshot_path = metadata.get("registry_snapshot_path", "")
+    if registry_snapshot_path:
+        src_reg = Path(registry_snapshot_path)
+        if src_reg.exists():
+            dst_reg = _config.settings.resolved_project_registry_path
+            dst_reg.parent.mkdir(parents=True, exist_ok=True)
+            shutil.copy2(src_reg, dst_reg)
+            registry_restored = True
+
+    chroma_snapshot_path = metadata.get("chroma_snapshot_path", "")
+    if include_chroma is None:
+        include_chroma = bool(chroma_snapshot_path)
+    chroma_restored = False
+    if include_chroma and chroma_snapshot_path:
+        src_chroma = Path(chroma_snapshot_path)
+        if src_chroma.exists() and src_chroma.is_dir():
+            dst_chroma = _config.settings.chroma_path
+            if dst_chroma.exists():
+                shutil.rmtree(dst_chroma)
+            shutil.copytree(src_chroma, dst_chroma)
+            chroma_restored = True
+
+    restored_integrity_ok = False
+    integrity_error: str | None = None
+    try:
+        with sqlite3.connect(str(target_db)) as conn:
+            row = conn.execute("PRAGMA integrity_check").fetchone()
+            restored_integrity_ok = bool(row and row[0] == "ok")
+            if not restored_integrity_ok:
+                integrity_error = row[0] if row else "no_row"
+    except sqlite3.DatabaseError as exc:
+        integrity_error = f"db_open_failed: {exc}"
+
+    result: dict = {
+        "stamp": stamp,
+        "pre_restore_snapshot": pre_snapshot_stamp,
+        "target_db": str(target_db),
+        "db_restored": True,
+        "registry_restored": registry_restored,
+        "chroma_restored": chroma_restored,
+        "restored_integrity_ok": restored_integrity_ok,
+    }
+    if integrity_error:
+        result["integrity_error"] = integrity_error
+
+    log.info(
+        "runtime_backup_restored",
+        stamp=stamp,
+        pre_restore_snapshot=pre_snapshot_stamp,
+        registry_restored=registry_restored,
+        chroma_restored=chroma_restored,
+        integrity_ok=restored_integrity_ok,
+    )
+    return result
+
+
 def _backup_sqlite_db(source_path: Path, dest_path: Path) -> None:
    source_conn = sqlite3.connect(str(source_path))
    dest_conn = sqlite3.connect(str(dest_path))
@@ -242,7 +402,89 @@ def _copy_directory_tree(source: Path, dest: Path) -> tuple[int, int]:


 def main() -> None:
-    result = create_runtime_backup()
+    """CLI entry point for the backup module.
+
+    Supports four subcommands:
+
+    - ``create``   run ``create_runtime_backup`` (default if none given)
+    - ``list``     list all runtime backup snapshots
+    - ``validate`` validate a specific snapshot by stamp
+    - ``restore``  restore a specific snapshot by stamp
+
+    The restore subcommand is the one used by the backup/restore drill
+    and MUST be run only when the AtoCore service is stopped. It takes
+    ``--confirm-service-stopped`` as an explicit acknowledgment.
+    """
+    import argparse
+
+    parser = argparse.ArgumentParser(
+        prog="python -m atocore.ops.backup",
+        description="AtoCore runtime backup create/list/validate/restore",
+    )
+    sub = parser.add_subparsers(dest="command")
+
+    p_create = sub.add_parser("create", help="create a new runtime backup")
+    p_create.add_argument(
+        "--chroma",
+        action="store_true",
+        help="also snapshot the Chroma vector store (cold copy)",
+    )
+
+    sub.add_parser("list", help="list runtime backup snapshots")
+
+    p_validate = sub.add_parser("validate", help="validate a snapshot by stamp")
+    p_validate.add_argument("stamp", help="snapshot stamp (e.g. 20260409T010203Z)")
+
+    p_restore = sub.add_parser(
+        "restore",
+        help="restore a snapshot by stamp (service must be stopped)",
+    )
+    p_restore.add_argument("stamp", help="snapshot stamp to restore")
+    p_restore.add_argument(
+        "--confirm-service-stopped",
+        action="store_true",
+        help="explicit acknowledgment that the AtoCore container is stopped",
+    )
+    p_restore.add_argument(
+        "--no-pre-snapshot",
+        action="store_true",
+        help="skip the pre-restore safety snapshot of current state",
+    )
+    chroma_group = p_restore.add_mutually_exclusive_group()
+    chroma_group.add_argument(
+        "--chroma",
+        dest="include_chroma",
+        action="store_true",
+        default=None,
+        help="force-restore the Chroma snapshot",
+    )
+    chroma_group.add_argument(
+        "--no-chroma",
+        dest="include_chroma",
+        action="store_false",
+        help="skip the Chroma snapshot even if it was captured",
+    )
+
+    args = parser.parse_args()
+    command = args.command or "create"
+
+    if command == "create":
+        include_chroma = getattr(args, "chroma", False)
+        result = create_runtime_backup(include_chroma=include_chroma)
+    elif command == "list":
+        result = {"backups": list_runtime_backups()}
+    elif command == "validate":
+        result = validate_backup(args.stamp)
+    elif command == "restore":
+        result = restore_runtime_backup(
+            args.stamp,
+            include_chroma=args.include_chroma,
+            pre_restore_snapshot=not args.no_pre_snapshot,
+            confirm_service_stopped=args.confirm_service_stopped,
+        )
+    else:  # pragma: no cover — argparse guards this
+        parser.error(f"unknown command: {command}")
+
    print(json.dumps(result, indent=2, ensure_ascii=True))