ops: add restore_runtime_backup + drill runbook

Close the backup side of the loop: we had create/list/validate but
no restore, and no documented drill. A backup you've never restored
is not a backup. This lands the missing restore surface and the
procedure to exercise it before enabling any write-path automation
(auto-capture, automated ingestion, reinforcement sweeps).

Code — src/atocore/ops/backup.py:

- restore_runtime_backup(stamp, *, include_chroma, pre_restore_snapshot,
  confirm_service_stopped) performs:
  1. validate_backup() gate — refuse on any error
  2. pre-restore safety snapshot of current state (reversibility anchor)
  3. PRAGMA wal_checkpoint(TRUNCATE) on target db (flush + release
     OS handles; Windows needs this after conn.backup() reads)
  4. unlink stale -wal/-shm sidecars (tolerant to Windows lock races)
  5. shutil.copy2 snapshot db over target
  6. restore registry if snapshot captured one
  7. restore Chroma tree if snapshot captured one and include_chroma
     resolves to true (defaults to whether backup has Chroma)
  8. PRAGMA integrity_check on restored db, report result
- Refuses without confirm_service_stopped=True to prevent hot-restore
  into a running service (would corrupt SQLite state)
- Rewrote main() as argparse with 4 subcommands: create, list,
  validate, restore. `python -m atocore.ops.backup restore STAMP
  --confirm-service-stopped` is the drill CLI entry point, run via
  `docker compose run --rm --entrypoint python atocore` so it reuses
  the live service's volume mounts

Tests — tests/test_backup.py (6 new):

- test_restore_refuses_without_confirm_service_stopped
- test_restore_raises_on_invalid_backup
- test_restore_round_trip_reverses_post_backup_mutations
  (canonical drill flow: seed -> backup -> mutate -> restore ->
   mutation gone + baseline survived + pre-restore snapshot has
   the mutation captured as rollback anchor)
- test_restore_round_trip_with_chroma
- test_restore_skips_pre_snapshot_when_requested
- test_restore_cleans_stale_wal_sidecars (asserts stale byte
  markers do not survive, not file existence, since PRAGMA
  integrity_check may legitimately recreate -wal)

Docs — docs/backup-restore-drill.md (new):

- What gets backed up (hot sqlite, cold chroma, registry JSON,
  metadata.json) and what doesn't (.env, source content)
- What restore does, step by step, and why confirm_service_stopped
  is a hard gate
- 8-step drill procedure: capture -> baseline -> mutate -> stop ->
  restore -> start -> verify marker gone -> optional cleanup
- Correct endpoint bodies verified against routes.py:
    POST /admin/backup with JSON body {"include_chroma": true}
    POST /memory with memory_type/content/project/confidence
    GET /memory?project=drill to list drill markers
    POST /query with {"prompt": ..., "top_k": ...} (not "query")
- Failure modes: integrity_check fail, container won't start,
  marker still present after restore, with remediation for each
- When to run: before new write-path automation, after backup.py
  or schema changes, after infra bumps, monthly as standing check

225/225 tests passing (219 existing + 6 new restore).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-08 21:17:48 -04:00
parent 03822389a1
commit 336208004c
3 changed files with 782 additions and 2 deletions

View File

@@ -216,6 +216,166 @@ def validate_backup(stamp: str) -> dict:
return result
def restore_runtime_backup(
stamp: str,
*,
include_chroma: bool | None = None,
pre_restore_snapshot: bool = True,
confirm_service_stopped: bool = False,
) -> dict:
"""Restore a previously captured runtime backup.
CRITICAL: the AtoCore service MUST be stopped before calling this.
Overwriting a live SQLite database corrupts state and can break
the running container's open connections. The caller must pass
``confirm_service_stopped=True`` as an explicit acknowledgment —
otherwise this function refuses to run.
The restore procedure:
1. Validate the backup via ``validate_backup``; refuse on any error.
2. (default) Create a pre-restore safety snapshot of the CURRENT
state so the restore itself is reversible. The snapshot stamp
is returned in the result for the operator to record.
3. Remove stale SQLite WAL/SHM sidecar files next to the target db
before copying — the snapshot is a self-contained main-file
image from ``conn.backup()``, and leftover WAL/SHM from the old
live db would desync against the restored main file.
4. Copy the snapshot db over the target db path.
5. Restore the project registry file if the snapshot captured one.
6. Restore the Chroma directory if ``include_chroma`` resolves to
true. When ``include_chroma is None`` the function defers to
whether the snapshot captured Chroma (the common case).
7. Run ``PRAGMA integrity_check`` on the restored db and report
the result.
Returns a dict describing what was restored. On refused restore
(service still running, validation failed) raises ``RuntimeError``.
"""
if not confirm_service_stopped:
raise RuntimeError(
"restore_runtime_backup refuses to run without "
"confirm_service_stopped=True — stop the AtoCore container "
"first (e.g. `docker compose down` from deploy/dalidou) "
"before calling this function"
)
validation = validate_backup(stamp)
if not validation.get("valid"):
raise RuntimeError(
f"backup {stamp} failed validation: {validation.get('errors')}"
)
metadata = validation.get("metadata") or {}
pre_snapshot_stamp: str | None = None
if pre_restore_snapshot:
pre = create_runtime_backup(include_chroma=False)
pre_snapshot_stamp = Path(pre["backup_root"]).name
target_db = _config.settings.db_path
source_db = Path(metadata.get("db_snapshot_path", ""))
if not source_db.exists():
raise RuntimeError(
f"db snapshot not found at {source_db} — backup "
f"metadata may be stale"
)
# Force sqlite to flush any lingering WAL into the main file and
# release OS-level file handles on -wal/-shm before we swap the
# main file. Passing through conn.backup() in the pre-restore
# snapshot can leave sidecars momentarily locked on Windows;
# an explicit checkpoint(TRUNCATE) is the reliable way to flush
# and release. Best-effort: if the target db can't be opened
# (missing, corrupt), fall through and trust the copy step.
if target_db.exists():
try:
with sqlite3.connect(str(target_db)) as checkpoint_conn:
checkpoint_conn.execute("PRAGMA wal_checkpoint(TRUNCATE)")
except sqlite3.DatabaseError as exc:
log.warning(
"restore_pre_checkpoint_failed",
target_db=str(target_db),
error=str(exc),
)
# Remove stale WAL/SHM sidecars from the old live db so SQLite
# can't read inconsistent state on next open. Tolerant to
# Windows file-lock races — the subsequent copy replaces the
# main file anyway, and the integrity check afterward is the
# actual correctness signal.
wal_path = target_db.with_name(target_db.name + "-wal")
shm_path = target_db.with_name(target_db.name + "-shm")
for stale in (wal_path, shm_path):
if stale.exists():
try:
stale.unlink()
except OSError as exc:
log.warning(
"restore_sidecar_unlink_failed",
path=str(stale),
error=str(exc),
)
target_db.parent.mkdir(parents=True, exist_ok=True)
shutil.copy2(source_db, target_db)
registry_restored = False
registry_snapshot_path = metadata.get("registry_snapshot_path", "")
if registry_snapshot_path:
src_reg = Path(registry_snapshot_path)
if src_reg.exists():
dst_reg = _config.settings.resolved_project_registry_path
dst_reg.parent.mkdir(parents=True, exist_ok=True)
shutil.copy2(src_reg, dst_reg)
registry_restored = True
chroma_snapshot_path = metadata.get("chroma_snapshot_path", "")
if include_chroma is None:
include_chroma = bool(chroma_snapshot_path)
chroma_restored = False
if include_chroma and chroma_snapshot_path:
src_chroma = Path(chroma_snapshot_path)
if src_chroma.exists() and src_chroma.is_dir():
dst_chroma = _config.settings.chroma_path
if dst_chroma.exists():
shutil.rmtree(dst_chroma)
shutil.copytree(src_chroma, dst_chroma)
chroma_restored = True
restored_integrity_ok = False
integrity_error: str | None = None
try:
with sqlite3.connect(str(target_db)) as conn:
row = conn.execute("PRAGMA integrity_check").fetchone()
restored_integrity_ok = bool(row and row[0] == "ok")
if not restored_integrity_ok:
integrity_error = row[0] if row else "no_row"
except sqlite3.DatabaseError as exc:
integrity_error = f"db_open_failed: {exc}"
result: dict = {
"stamp": stamp,
"pre_restore_snapshot": pre_snapshot_stamp,
"target_db": str(target_db),
"db_restored": True,
"registry_restored": registry_restored,
"chroma_restored": chroma_restored,
"restored_integrity_ok": restored_integrity_ok,
}
if integrity_error:
result["integrity_error"] = integrity_error
log.info(
"runtime_backup_restored",
stamp=stamp,
pre_restore_snapshot=pre_snapshot_stamp,
registry_restored=registry_restored,
chroma_restored=chroma_restored,
integrity_ok=restored_integrity_ok,
)
return result
def _backup_sqlite_db(source_path: Path, dest_path: Path) -> None:
source_conn = sqlite3.connect(str(source_path))
dest_conn = sqlite3.connect(str(dest_path))
@@ -242,7 +402,89 @@ def _copy_directory_tree(source: Path, dest: Path) -> tuple[int, int]:
def main() -> None:
result = create_runtime_backup()
"""CLI entry point for the backup module.
Supports four subcommands:
- ``create`` run ``create_runtime_backup`` (default if none given)
- ``list`` list all runtime backup snapshots
- ``validate`` validate a specific snapshot by stamp
- ``restore`` restore a specific snapshot by stamp
The restore subcommand is the one used by the backup/restore drill
and MUST be run only when the AtoCore service is stopped. It takes
``--confirm-service-stopped`` as an explicit acknowledgment.
"""
import argparse
parser = argparse.ArgumentParser(
prog="python -m atocore.ops.backup",
description="AtoCore runtime backup create/list/validate/restore",
)
sub = parser.add_subparsers(dest="command")
p_create = sub.add_parser("create", help="create a new runtime backup")
p_create.add_argument(
"--chroma",
action="store_true",
help="also snapshot the Chroma vector store (cold copy)",
)
sub.add_parser("list", help="list runtime backup snapshots")
p_validate = sub.add_parser("validate", help="validate a snapshot by stamp")
p_validate.add_argument("stamp", help="snapshot stamp (e.g. 20260409T010203Z)")
p_restore = sub.add_parser(
"restore",
help="restore a snapshot by stamp (service must be stopped)",
)
p_restore.add_argument("stamp", help="snapshot stamp to restore")
p_restore.add_argument(
"--confirm-service-stopped",
action="store_true",
help="explicit acknowledgment that the AtoCore container is stopped",
)
p_restore.add_argument(
"--no-pre-snapshot",
action="store_true",
help="skip the pre-restore safety snapshot of current state",
)
chroma_group = p_restore.add_mutually_exclusive_group()
chroma_group.add_argument(
"--chroma",
dest="include_chroma",
action="store_true",
default=None,
help="force-restore the Chroma snapshot",
)
chroma_group.add_argument(
"--no-chroma",
dest="include_chroma",
action="store_false",
help="skip the Chroma snapshot even if it was captured",
)
args = parser.parse_args()
command = args.command or "create"
if command == "create":
include_chroma = getattr(args, "chroma", False)
result = create_runtime_backup(include_chroma=include_chroma)
elif command == "list":
result = {"backups": list_runtime_backups()}
elif command == "validate":
result = validate_backup(args.stamp)
elif command == "restore":
result = restore_runtime_backup(
args.stamp,
include_chroma=args.include_chroma,
pre_restore_snapshot=not args.no_pre_snapshot,
confirm_service_stopped=args.confirm_service_stopped,
)
else: # pragma: no cover — argparse guards this
parser.error(f"unknown command: {command}")
print(json.dumps(result, indent=2, ensure_ascii=True))