ops: add restore_runtime_backup + drill runbook

Close the backup side of the loop: we had create/list/validate but no restore, and no documented drill. A backup you've never restored is not a backup. This lands the missing restore surface and the procedure to exercise it before enabling any write-path automation (auto-capture, automated ingestion, reinforcement sweeps). Code — src/atocore/ops/backup.py: - restore_runtime_backup(stamp, *, include_chroma, pre_restore_snapshot, confirm_service_stopped) performs: 1. validate_backup() gate — refuse on any error 2. pre-restore safety snapshot of current state (reversibility anchor) 3. PRAGMA wal_checkpoint(TRUNCATE) on target db (flush + release OS handles; Windows needs this after conn.backup() reads) 4. unlink stale -wal/-shm sidecars (tolerant to Windows lock races) 5. shutil.copy2 snapshot db over target 6. restore registry if snapshot captured one 7. restore Chroma tree if snapshot captured one and include_chroma resolves to true (defaults to whether backup has Chroma) 8. PRAGMA integrity_check on restored db, report result - Refuses without confirm_service_stopped=True to prevent hot-restore into a running service (would corrupt SQLite state) - Rewrote main() as argparse with 4 subcommands: create, list, validate, restore. `python -m atocore.ops.backup restore STAMP --confirm-service-stopped` is the drill CLI entry point, run via `docker compose run --rm --entrypoint python atocore` so it reuses the live service's volume mounts Tests — tests/test_backup.py (6 new): - test_restore_refuses_without_confirm_service_stopped - test_restore_raises_on_invalid_backup - test_restore_round_trip_reverses_post_backup_mutations (canonical drill flow: seed -> backup -> mutate -> restore -> mutation gone + baseline survived + pre-restore snapshot has the mutation captured as rollback anchor) - test_restore_round_trip_with_chroma - test_restore_skips_pre_snapshot_when_requested - test_restore_cleans_stale_wal_sidecars (asserts stale byte markers do not survive, not file existence, since PRAGMA integrity_check may legitimately recreate -wal) Docs — docs/backup-restore-drill.md (new): - What gets backed up (hot sqlite, cold chroma, registry JSON, metadata.json) and what doesn't (.env, source content) - What restore does, step by step, and why confirm_service_stopped is a hard gate - 8-step drill procedure: capture -> baseline -> mutate -> stop -> restore -> start -> verify marker gone -> optional cleanup - Correct endpoint bodies verified against routes.py: POST /admin/backup with JSON body {"include_chroma": true} POST /memory with memory_type/content/project/confidence GET /memory?project=drill to list drill markers POST /query with {"prompt": ..., "top_k": ...} (not "query") - Failure modes: integrity_check fail, container won't start, marker still present after restore, with remediation for each - When to run: before new write-path automation, after backup.py or schema changes, after infra bumps, monthly as standing check 225/225 tests passing (219 existing + 6 new restore). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 21:17:48 -04:00
parent 03822389a1
commit 336208004c
3 changed files with 782 additions and 2 deletions
--- a/docs/backup-restore-drill.md
+++ b/docs/backup-restore-drill.md
@@ -0,0 +1,296 @@
+# Backup / Restore Drill
+
+## Purpose
+
+Before turning on any automation that writes to AtoCore continuously
+(auto-capture of Claude Code sessions, automated source ingestion,
+reinforcement sweeps), we need to know — with certainty — that a
+backup can actually be restored. A backup you've never restored is
+not a backup; it's a file that happens to be named that way.
+
+This runbook walks through the canonical drill: take a snapshot,
+mutate live state, stop the service, restore from the snapshot,
+start the service, and verify the mutation is reversed. When the
+drill passes, the runtime store has a trustworthy rollback.
+
+## What gets backed up
+
+`src/atocore/ops/backup.py::create_runtime_backup()` writes the
+following into `$ATOCORE_BACKUP_DIR/snapshots/<stamp>/`:
+
+| Component | How | Hot/Cold | Notes |
+|---|---|---|---|
+| SQLite (`atocore.db`) | `conn.backup()` online API | **hot** | Safe with service running; self-contained main file, no WAL sidecar. |
+| Project registry JSON | file copy | cold | Only if the file exists. |
+| Chroma vector store | `shutil.copytree` | **cold** | Only when `include_chroma=True`. Caller must hold `exclusive_ingestion()` so nothing writes during the copy — the `POST /admin/backup?include_chroma=true` route does this automatically. |
+| `backup-metadata.json` | JSON blob | — | Records paths, sizes, and whether Chroma was included. Restore reads this to know what to pull back. |
+
+Things that are **not** in the backup and must be handled separately:
+
+- The `.env` file under `deploy/dalidou/` — secrets live out of git
+  and out of the backup on purpose. The operator must re-place it
+  on any fresh host.
+- The source content under `sources/vault` and `sources/drive` —
+  these are read-only inputs by convention, owned by AtoVault /
+  AtoDrive, and backed up there.
+- Any running transient state (in-flight HTTP requests, ingestion
+  queues). Stop the service cleanly if you care about those.
+
+## What restore does
+
+`restore_runtime_backup(stamp, confirm_service_stopped=True)`:
+
+1. **Validates** the backup first via `validate_backup()` —
+   refuses to run on any error (missing metadata, corrupt snapshot
+   db, etc.).
+2. **Takes a pre-restore safety snapshot** of the current state
+   (SQLite only, not Chroma — to keep it fast) and returns its
+   stamp. This is the reversibility guarantee: if the restore was
+   the wrong call, you can roll it back by restoring the
+   pre-restore snapshot.
+3. **Forces a WAL checkpoint** on the current db
+   (`PRAGMA wal_checkpoint(TRUNCATE)`) to flush any lingering
+   writes and release OS file handles on `-wal`/`-shm`, so the
+   copy step won't race a half-open sqlite connection.
+4. **Removes stale WAL/SHM sidecars** next to the target db.
+   The snapshot `.db` is a self-contained main-file image with no
+   WAL of its own; leftover `-wal` from the old live process
+   would desync against the restored main file.
+5. **Copies the snapshot db** over the live db path.
+6. **Restores the registry JSON** if the snapshot captured one.
+7. **Restores the Chroma tree** if the snapshot captured one and
+   `include_chroma` resolves to true (defaults to whether the
+   snapshot has Chroma).
+8. **Runs `PRAGMA integrity_check`** on the restored db and
+   reports the result alongside a summary of what was touched.
+
+If `confirm_service_stopped` is not passed, the function refuses —
+this is deliberate. Hot-restoring into a running service is not
+supported and would corrupt state.
+
+## The drill
+
+Run this from a Dalidou host with the AtoCore container already
+deployed and healthy. The whole drill takes under two minutes. It
+does not touch source content or disturb any `.env` secrets.
+
+### Step 1. Capture a snapshot via the HTTP API
+
+The running service holds the db; use the admin route so the
+Chroma snapshot is taken under `exclusive_ingestion()`. The
+endpoint takes a JSON body (not a query string):
+
+```bash
+curl -fsS -X POST 'http://127.0.0.1:8100/admin/backup' \
+    -H 'Content-Type: application/json' \
+    -d '{"include_chroma": true}' \
+    | python3 -m json.tool
+```
+
+Record the `backup_root` and note the stamp (the last path segment,
+e.g. `20260409T012345Z`). That stamp is the input to the restore
+step.
+
+### Step 2. Record a known piece of live state
+
+Pick something small and unambiguous to use as a marker. The
+simplest is the current health snapshot plus a memory count:
+
+```bash
+curl -fsS 'http://127.0.0.1:8100/health' | python3 -m json.tool
+```
+
+Note the `memory_count`, `interaction_count`, and `build_sha`. These
+are your pre-drill baseline.
+
+### Step 3. Mutate live state AFTER the backup
+
+Write something the restore should reverse. Any write endpoint is
+fine — a throwaway test memory is the cleanest. The request body
+must include `memory_type` (the AtoCore memory schema requires it):
+
+```bash
+curl -fsS -X POST 'http://127.0.0.1:8100/memory' \
+    -H 'Content-Type: application/json' \
+    -d '{
+        "memory_type": "note",
+        "content": "DRILL-MARKER: this memory should not survive the restore",
+        "project": "drill",
+        "confidence": 1.0
+    }' \
+    | python3 -m json.tool
+```
+
+Record the returned `id`. Confirm it's there:
+
+```bash
+curl -fsS 'http://127.0.0.1:8100/health' | python3 -m json.tool
+# memory_count should be baseline + 1
+
+# And you can list the drill-project memories directly:
+curl -fsS 'http://127.0.0.1:8100/memory?project=drill' | python3 -m json.tool
+# should return the DRILL-MARKER memory
+```
+
+### Step 4. Stop the service
+
+```bash
+cd /srv/storage/atocore/app/deploy/dalidou
+docker compose down
+```
+
+Wait for the container to actually exit:
+
+```bash
+docker compose ps
+# atocore should be gone or Exited
+```
+
+### Step 5. Restore from the snapshot
+
+Run the restore inside a one-shot container that reuses the same
+volumes as the live service. This guarantees the paths resolve
+identically to the running container's view.
+
+```bash
+cd /srv/storage/atocore/app/deploy/dalidou
+docker compose run --rm --entrypoint python atocore \
+    -m atocore.ops.backup restore \
+        <YOUR_STAMP_FROM_STEP_1> \
+        --confirm-service-stopped
+```
+
+The output is JSON; the important fields are:
+
+- `pre_restore_snapshot`: stamp of the safety snapshot of live
+  state at the moment of restore. **Write this down.** If the
+  restore turns out to be the wrong call, this is how you roll
+  it back.
+- `db_restored`: `true`
+- `registry_restored`: `true` if the backup had a registry
+- `chroma_restored`: `true` if the backup had a chroma snapshot
+- `restored_integrity_ok`: **must be `true`** — if this is false,
+  STOP and do not start the service; investigate the integrity
+  error first.
+
+If restoration fails at any step, the function raises a clean
+`RuntimeError` and nothing partial is committed past the main file
+swap. The pre-restore safety snapshot is your rollback anchor.
+
+### Step 6. Start the service back up
+
+```bash
+cd /srv/storage/atocore/app/deploy/dalidou
+docker compose up -d
+```
+
+Wait for `/health` to respond:
+
+```bash
+for i in 1 2 3 4 5 6 7 8 9 10; do
+    curl -fsS 'http://127.0.0.1:8100/health' \
+        && break || { echo "not ready ($i/10)"; sleep 3; }
+done
+```
+
+### Step 7. Verify the drill marker is gone
+
+```bash
+curl -fsS 'http://127.0.0.1:8100/health' | python3 -m json.tool
+# memory_count should equal the Step 2 baseline, NOT baseline + 1
+```
+
+You can also list the drill-project memories directly:
+
+```bash
+curl -fsS 'http://127.0.0.1:8100/memory?project=drill' | python3 -m json.tool
+# should return an empty list — the DRILL-MARKER memory was rolled back
+```
+
+For a semantic-retrieval cross-check, issue a query (the `/query`
+endpoint takes `prompt`, not `query`):
+
+```bash
+curl -fsS -X POST 'http://127.0.0.1:8100/query' \
+    -H 'Content-Type: application/json' \
+    -d '{"prompt": "DRILL-MARKER drill marker", "top_k": 5}' \
+    | python3 -m json.tool
+# should not return the DRILL-MARKER memory in the hits
+```
+
+If the marker is gone and `memory_count` matches the baseline, the
+drill **passed**. The runtime store has a trustworthy rollback.
+
+### Step 8. (Optional) Clean up the safety snapshot
+
+If everything went smoothly you can leave the pre-restore safety
+snapshot on disk for a few days as a paranoia buffer. There's no
+automatic cleanup yet — `list_runtime_backups()` will show it, and
+you can remove it by hand once you're confident:
+
+```bash
+rm -rf /srv/storage/atocore/backups/snapshots/<pre_restore_stamp>
+```
+
+## Failure modes and recovery
+
+### Restore reports `restored_integrity_ok: false`
+
+The copied db failed `PRAGMA integrity_check`. Do **not** start
+the service. This usually means either the source snapshot was
+itself corrupt (and `validate_backup` should have caught it — file
+a bug if it didn't), or the copy was interrupted. Options:
+
+1. Validate the source snapshot directly:
+   `python -m atocore.ops.backup validate <STAMP>`
+2. Pick a different, older snapshot and retry the restore.
+3. Roll the db back to your pre-restore safety snapshot.
+
+### The live container won't start after restore
+
+Check the container logs:
+
+```bash
+cd /srv/storage/atocore/app/deploy/dalidou
+docker compose logs --tail=100 atocore
+```
+
+Common causes:
+
+- Schema drift between the snapshot and the current code version.
+  `_apply_migrations` in `src/atocore/models/database.py` is
+  idempotent and should absorb most forward migrations, but a
+  backward restore (running new code against an older snapshot)
+  may hit unexpected state. The migration only ADDs columns, so
+  the opposite direction is usually safe, but verify.
+- Chroma and SQLite disagreeing about what chunks exist. The
+  backup captures them together to minimize this, but if you
+  restore SQLite without Chroma (`--no-chroma`), retrieval may
+  return stale vectors. Re-ingest if this happens.
+
+### The drill marker is still present after restore
+
+Something went wrong. Possible causes:
+
+- You restored a snapshot taken AFTER the drill marker was
+  written (wrong stamp).
+- The service was writing during the drill and committed the
+  marker before `docker compose down`. Double-check the order.
+- The restore silently skipped the db step. Check the restore
+  output for `db_restored: true` and `restored_integrity_ok: true`.
+
+Roll back to the pre-restore safety snapshot and retry with the
+correct source snapshot.
+
+## When to run this drill
+
+- **Before** enabling any new write-path automation (auto-capture,
+  automated ingestion, reinforcement sweeps, scheduled extraction).
+- **After** any change to `src/atocore/ops/backup.py` or the
+  schema migrations in `src/atocore/models/database.py`.
+- **After** a Dalidou OS upgrade or docker version bump.
+- **Monthly** as a standing operational check.
+
+Record each drill run (pass/fail) somewhere durable — even a line
+in the project journal is enough. A drill you ran once and never
+again is barely more than a drill you never ran.
--- a/src/atocore/ops/backup.py
+++ b/src/atocore/ops/backup.py
@@ -216,6 +216,166 @@ def validate_backup(stamp: str) -> dict:
    return result


+def restore_runtime_backup(
+    stamp: str,
+    *,
+    include_chroma: bool | None = None,
+    pre_restore_snapshot: bool = True,
+    confirm_service_stopped: bool = False,
+) -> dict:
+    """Restore a previously captured runtime backup.
+
+    CRITICAL: the AtoCore service MUST be stopped before calling this.
+    Overwriting a live SQLite database corrupts state and can break
+    the running container's open connections. The caller must pass
+    ``confirm_service_stopped=True`` as an explicit acknowledgment —
+    otherwise this function refuses to run.
+
+    The restore procedure:
+
+    1. Validate the backup via ``validate_backup``; refuse on any error.
+    2. (default) Create a pre-restore safety snapshot of the CURRENT
+       state so the restore itself is reversible. The snapshot stamp
+       is returned in the result for the operator to record.
+    3. Remove stale SQLite WAL/SHM sidecar files next to the target db
+       before copying — the snapshot is a self-contained main-file
+       image from ``conn.backup()``, and leftover WAL/SHM from the old
+       live db would desync against the restored main file.
+    4. Copy the snapshot db over the target db path.
+    5. Restore the project registry file if the snapshot captured one.
+    6. Restore the Chroma directory if ``include_chroma`` resolves to
+       true. When ``include_chroma is None`` the function defers to
+       whether the snapshot captured Chroma (the common case).
+    7. Run ``PRAGMA integrity_check`` on the restored db and report
+       the result.
+
+    Returns a dict describing what was restored. On refused restore
+    (service still running, validation failed) raises ``RuntimeError``.
+    """
+    if not confirm_service_stopped:
+        raise RuntimeError(
+            "restore_runtime_backup refuses to run without "
+            "confirm_service_stopped=True — stop the AtoCore container "
+            "first (e.g. `docker compose down` from deploy/dalidou) "
+            "before calling this function"
+        )
+
+    validation = validate_backup(stamp)
+    if not validation.get("valid"):
+        raise RuntimeError(
+            f"backup {stamp} failed validation: {validation.get('errors')}"
+        )
+    metadata = validation.get("metadata") or {}
+
+    pre_snapshot_stamp: str | None = None
+    if pre_restore_snapshot:
+        pre = create_runtime_backup(include_chroma=False)
+        pre_snapshot_stamp = Path(pre["backup_root"]).name
+
+    target_db = _config.settings.db_path
+    source_db = Path(metadata.get("db_snapshot_path", ""))
+    if not source_db.exists():
+        raise RuntimeError(
+            f"db snapshot not found at {source_db} — backup "
+            f"metadata may be stale"
+        )
+
+    # Force sqlite to flush any lingering WAL into the main file and
+    # release OS-level file handles on -wal/-shm before we swap the
+    # main file. Passing through conn.backup() in the pre-restore
+    # snapshot can leave sidecars momentarily locked on Windows;
+    # an explicit checkpoint(TRUNCATE) is the reliable way to flush
+    # and release. Best-effort: if the target db can't be opened
+    # (missing, corrupt), fall through and trust the copy step.
+    if target_db.exists():
+        try:
+            with sqlite3.connect(str(target_db)) as checkpoint_conn:
+                checkpoint_conn.execute("PRAGMA wal_checkpoint(TRUNCATE)")
+        except sqlite3.DatabaseError as exc:
+            log.warning(
+                "restore_pre_checkpoint_failed",
+                target_db=str(target_db),
+                error=str(exc),
+            )
+
+    # Remove stale WAL/SHM sidecars from the old live db so SQLite
+    # can't read inconsistent state on next open. Tolerant to
+    # Windows file-lock races — the subsequent copy replaces the
+    # main file anyway, and the integrity check afterward is the
+    # actual correctness signal.
+    wal_path = target_db.with_name(target_db.name + "-wal")
+    shm_path = target_db.with_name(target_db.name + "-shm")
+    for stale in (wal_path, shm_path):
+        if stale.exists():
+            try:
+                stale.unlink()
+            except OSError as exc:
+                log.warning(
+                    "restore_sidecar_unlink_failed",
+                    path=str(stale),
+                    error=str(exc),
+                )
+
+    target_db.parent.mkdir(parents=True, exist_ok=True)
+    shutil.copy2(source_db, target_db)
+
+    registry_restored = False
+    registry_snapshot_path = metadata.get("registry_snapshot_path", "")
+    if registry_snapshot_path:
+        src_reg = Path(registry_snapshot_path)
+        if src_reg.exists():
+            dst_reg = _config.settings.resolved_project_registry_path
+            dst_reg.parent.mkdir(parents=True, exist_ok=True)
+            shutil.copy2(src_reg, dst_reg)
+            registry_restored = True
+
+    chroma_snapshot_path = metadata.get("chroma_snapshot_path", "")
+    if include_chroma is None:
+        include_chroma = bool(chroma_snapshot_path)
+    chroma_restored = False
+    if include_chroma and chroma_snapshot_path:
+        src_chroma = Path(chroma_snapshot_path)
+        if src_chroma.exists() and src_chroma.is_dir():
+            dst_chroma = _config.settings.chroma_path
+            if dst_chroma.exists():
+                shutil.rmtree(dst_chroma)
+            shutil.copytree(src_chroma, dst_chroma)
+            chroma_restored = True
+
+    restored_integrity_ok = False
+    integrity_error: str | None = None
+    try:
+        with sqlite3.connect(str(target_db)) as conn:
+            row = conn.execute("PRAGMA integrity_check").fetchone()
+            restored_integrity_ok = bool(row and row[0] == "ok")
+            if not restored_integrity_ok:
+                integrity_error = row[0] if row else "no_row"
+    except sqlite3.DatabaseError as exc:
+        integrity_error = f"db_open_failed: {exc}"
+
+    result: dict = {
+        "stamp": stamp,
+        "pre_restore_snapshot": pre_snapshot_stamp,
+        "target_db": str(target_db),
+        "db_restored": True,
+        "registry_restored": registry_restored,
+        "chroma_restored": chroma_restored,
+        "restored_integrity_ok": restored_integrity_ok,
+    }
+    if integrity_error:
+        result["integrity_error"] = integrity_error
+
+    log.info(
+        "runtime_backup_restored",
+        stamp=stamp,
+        pre_restore_snapshot=pre_snapshot_stamp,
+        registry_restored=registry_restored,
+        chroma_restored=chroma_restored,
+        integrity_ok=restored_integrity_ok,
+    )
+    return result
+
+
 def _backup_sqlite_db(source_path: Path, dest_path: Path) -> None:
    source_conn = sqlite3.connect(str(source_path))
    dest_conn = sqlite3.connect(str(dest_path))
@@ -242,7 +402,89 @@ def _copy_directory_tree(source: Path, dest: Path) -> tuple[int, int]:


 def main() -> None:
-    result = create_runtime_backup()
+    """CLI entry point for the backup module.
+
+    Supports four subcommands:
+
+    - ``create``   run ``create_runtime_backup`` (default if none given)
+    - ``list``     list all runtime backup snapshots
+    - ``validate`` validate a specific snapshot by stamp
+    - ``restore``  restore a specific snapshot by stamp
+
+    The restore subcommand is the one used by the backup/restore drill
+    and MUST be run only when the AtoCore service is stopped. It takes
+    ``--confirm-service-stopped`` as an explicit acknowledgment.
+    """
+    import argparse
+
+    parser = argparse.ArgumentParser(
+        prog="python -m atocore.ops.backup",
+        description="AtoCore runtime backup create/list/validate/restore",
+    )
+    sub = parser.add_subparsers(dest="command")
+
+    p_create = sub.add_parser("create", help="create a new runtime backup")
+    p_create.add_argument(
+        "--chroma",
+        action="store_true",
+        help="also snapshot the Chroma vector store (cold copy)",
+    )
+
+    sub.add_parser("list", help="list runtime backup snapshots")
+
+    p_validate = sub.add_parser("validate", help="validate a snapshot by stamp")
+    p_validate.add_argument("stamp", help="snapshot stamp (e.g. 20260409T010203Z)")
+
+    p_restore = sub.add_parser(
+        "restore",
+        help="restore a snapshot by stamp (service must be stopped)",
+    )
+    p_restore.add_argument("stamp", help="snapshot stamp to restore")
+    p_restore.add_argument(
+        "--confirm-service-stopped",
+        action="store_true",
+        help="explicit acknowledgment that the AtoCore container is stopped",
+    )
+    p_restore.add_argument(
+        "--no-pre-snapshot",
+        action="store_true",
+        help="skip the pre-restore safety snapshot of current state",
+    )
+    chroma_group = p_restore.add_mutually_exclusive_group()
+    chroma_group.add_argument(
+        "--chroma",
+        dest="include_chroma",
+        action="store_true",
+        default=None,
+        help="force-restore the Chroma snapshot",
+    )
+    chroma_group.add_argument(
+        "--no-chroma",
+        dest="include_chroma",
+        action="store_false",
+        help="skip the Chroma snapshot even if it was captured",
+    )
+
+    args = parser.parse_args()
+    command = args.command or "create"
+
+    if command == "create":
+        include_chroma = getattr(args, "chroma", False)
+        result = create_runtime_backup(include_chroma=include_chroma)
+    elif command == "list":
+        result = {"backups": list_runtime_backups()}
+    elif command == "validate":
+        result = validate_backup(args.stamp)
+    elif command == "restore":
+        result = restore_runtime_backup(
+            args.stamp,
+            include_chroma=args.include_chroma,
+            pre_restore_snapshot=not args.no_pre_snapshot,
+            confirm_service_stopped=args.confirm_service_stopped,
+        )
+    else:  # pragma: no cover — argparse guards this
+        parser.error(f"unknown command: {command}")
+
    print(json.dumps(result, indent=2, ensure_ascii=True))


--- a/tests/test_backup.py
+++ b/tests/test_backup.py
@@ -1,14 +1,17 @@
-"""Tests for runtime backup creation."""
+"""Tests for runtime backup creation and restore."""

 import json
 import sqlite3
 from datetime import UTC, datetime

+import pytest
+
 import atocore.config as config
 from atocore.models.database import init_db
 from atocore.ops.backup import (
    create_runtime_backup,
    list_runtime_backups,
+    restore_runtime_backup,
    validate_backup,
 )

@@ -156,3 +159,242 @@ def test_create_runtime_backup_handles_missing_registry(tmp_path, monkeypatch):
        config.settings = original_settings

    assert result["registry_snapshot_path"] == ""
+
+
+def test_restore_refuses_without_confirm_service_stopped(tmp_path, monkeypatch):
+    monkeypatch.setenv("ATOCORE_DATA_DIR", str(tmp_path / "data"))
+    monkeypatch.setenv("ATOCORE_BACKUP_DIR", str(tmp_path / "backups"))
+    monkeypatch.setenv(
+        "ATOCORE_PROJECT_REGISTRY_PATH", str(tmp_path / "config" / "project-registry.json")
+    )
+
+    original_settings = config.settings
+    try:
+        config.settings = config.Settings()
+        init_db()
+        create_runtime_backup(datetime(2026, 4, 9, 10, 0, 0, tzinfo=UTC))
+
+        with pytest.raises(RuntimeError, match="confirm_service_stopped"):
+            restore_runtime_backup("20260409T100000Z")
+    finally:
+        config.settings = original_settings
+
+
+def test_restore_raises_on_invalid_backup(tmp_path, monkeypatch):
+    monkeypatch.setenv("ATOCORE_DATA_DIR", str(tmp_path / "data"))
+    monkeypatch.setenv("ATOCORE_BACKUP_DIR", str(tmp_path / "backups"))
+    monkeypatch.setenv(
+        "ATOCORE_PROJECT_REGISTRY_PATH", str(tmp_path / "config" / "project-registry.json")
+    )
+
+    original_settings = config.settings
+    try:
+        config.settings = config.Settings()
+        init_db()
+        with pytest.raises(RuntimeError, match="failed validation"):
+            restore_runtime_backup(
+                "20250101T000000Z", confirm_service_stopped=True
+            )
+    finally:
+        config.settings = original_settings
+
+
+def test_restore_round_trip_reverses_post_backup_mutations(tmp_path, monkeypatch):
+    """Canonical drill: snapshot -> mutate -> restore -> mutation gone."""
+    monkeypatch.setenv("ATOCORE_DATA_DIR", str(tmp_path / "data"))
+    monkeypatch.setenv("ATOCORE_BACKUP_DIR", str(tmp_path / "backups"))
+    monkeypatch.setenv(
+        "ATOCORE_PROJECT_REGISTRY_PATH", str(tmp_path / "config" / "project-registry.json")
+    )
+
+    registry_path = tmp_path / "config" / "project-registry.json"
+    registry_path.parent.mkdir(parents=True)
+    registry_path.write_text(
+        '{"projects":[{"id":"p01-example","aliases":[],'
+        '"ingest_roots":[{"source":"vault","subpath":"incoming/projects/p01-example"}]}]}\n',
+        encoding="utf-8",
+    )
+
+    original_settings = config.settings
+    try:
+        config.settings = config.Settings()
+        init_db()
+
+        # 1. Seed baseline state that should SURVIVE the restore.
+        with sqlite3.connect(str(config.settings.db_path)) as conn:
+            conn.execute(
+                "INSERT INTO projects (id, name) VALUES (?, ?)",
+                ("p01", "Baseline Project"),
+            )
+            conn.commit()
+
+        # 2. Create the backup we're going to restore to.
+        create_runtime_backup(datetime(2026, 4, 9, 11, 0, 0, tzinfo=UTC))
+        stamp = "20260409T110000Z"
+
+        # 3. Mutate live state AFTER the backup — this is what the
+        #    restore should reverse.
+        with sqlite3.connect(str(config.settings.db_path)) as conn:
+            conn.execute(
+                "INSERT INTO projects (id, name) VALUES (?, ?)",
+                ("p99", "Post Backup Mutation"),
+            )
+            conn.commit()
+
+        # Confirm the mutation is present before restore.
+        with sqlite3.connect(str(config.settings.db_path)) as conn:
+            row = conn.execute(
+                "SELECT name FROM projects WHERE id = ?", ("p99",)
+            ).fetchone()
+            assert row is not None and row[0] == "Post Backup Mutation"
+
+        # 4. Restore — the drill procedure. Explicit confirm_service_stopped.
+        result = restore_runtime_backup(
+            stamp, confirm_service_stopped=True
+        )
+
+        # 5. Verify restore report
+        assert result["stamp"] == stamp
+        assert result["db_restored"] is True
+        assert result["registry_restored"] is True
+        assert result["restored_integrity_ok"] is True
+        assert result["pre_restore_snapshot"] is not None
+
+        # 6. Verify live state reflects the restore: baseline survived,
+        #    post-backup mutation is gone.
+        with sqlite3.connect(str(config.settings.db_path)) as conn:
+            baseline = conn.execute(
+                "SELECT name FROM projects WHERE id = ?", ("p01",)
+            ).fetchone()
+            mutation = conn.execute(
+                "SELECT name FROM projects WHERE id = ?", ("p99",)
+            ).fetchone()
+        assert baseline is not None and baseline[0] == "Baseline Project"
+        assert mutation is None
+
+        # 7. Pre-restore safety snapshot DOES contain the mutation —
+        #    it captured current state before overwriting. This is the
+        #    reversibility guarantee: the operator can restore back to
+        #    it if the restore itself was a mistake.
+        pre_stamp = result["pre_restore_snapshot"]
+        pre_validation = validate_backup(pre_stamp)
+        assert pre_validation["valid"] is True
+        pre_db_path = pre_validation["metadata"]["db_snapshot_path"]
+        with sqlite3.connect(pre_db_path) as conn:
+            pre_mutation = conn.execute(
+                "SELECT name FROM projects WHERE id = ?", ("p99",)
+            ).fetchone()
+        assert pre_mutation is not None and pre_mutation[0] == "Post Backup Mutation"
+    finally:
+        config.settings = original_settings
+
+
+def test_restore_round_trip_with_chroma(tmp_path, monkeypatch):
+    monkeypatch.setenv("ATOCORE_DATA_DIR", str(tmp_path / "data"))
+    monkeypatch.setenv("ATOCORE_BACKUP_DIR", str(tmp_path / "backups"))
+    monkeypatch.setenv(
+        "ATOCORE_PROJECT_REGISTRY_PATH", str(tmp_path / "config" / "project-registry.json")
+    )
+
+    original_settings = config.settings
+    try:
+        config.settings = config.Settings()
+        init_db()
+
+        # Seed baseline chroma state that should survive restore.
+        chroma_dir = config.settings.chroma_path
+        (chroma_dir / "coll-a").mkdir(parents=True, exist_ok=True)
+        (chroma_dir / "coll-a" / "baseline.bin").write_bytes(b"baseline")
+
+        create_runtime_backup(
+            datetime(2026, 4, 9, 12, 0, 0, tzinfo=UTC), include_chroma=True
+        )
+        stamp = "20260409T120000Z"
+
+        # Mutate chroma after backup: add a file + remove baseline.
+        (chroma_dir / "coll-a" / "post_backup.bin").write_bytes(b"post")
+        (chroma_dir / "coll-a" / "baseline.bin").unlink()
+
+        result = restore_runtime_backup(
+            stamp, confirm_service_stopped=True
+        )
+
+        assert result["chroma_restored"] is True
+        assert (chroma_dir / "coll-a" / "baseline.bin").exists()
+        assert not (chroma_dir / "coll-a" / "post_backup.bin").exists()
+    finally:
+        config.settings = original_settings
+
+
+def test_restore_skips_pre_snapshot_when_requested(tmp_path, monkeypatch):
+    monkeypatch.setenv("ATOCORE_DATA_DIR", str(tmp_path / "data"))
+    monkeypatch.setenv("ATOCORE_BACKUP_DIR", str(tmp_path / "backups"))
+    monkeypatch.setenv(
+        "ATOCORE_PROJECT_REGISTRY_PATH", str(tmp_path / "config" / "project-registry.json")
+    )
+
+    original_settings = config.settings
+    try:
+        config.settings = config.Settings()
+        init_db()
+        create_runtime_backup(datetime(2026, 4, 9, 13, 0, 0, tzinfo=UTC))
+
+        before_count = len(list_runtime_backups())
+
+        result = restore_runtime_backup(
+            "20260409T130000Z",
+            confirm_service_stopped=True,
+            pre_restore_snapshot=False,
+        )
+
+        after_count = len(list_runtime_backups())
+        assert result["pre_restore_snapshot"] is None
+        assert after_count == before_count
+    finally:
+        config.settings = original_settings
+
+
+def test_restore_cleans_stale_wal_sidecars(tmp_path, monkeypatch):
+    """Stale WAL/SHM sidecars must not carry bytes past the restore.
+
+    Note: after restore runs, PRAGMA integrity_check reopens the
+    restored db which may legitimately recreate a fresh -wal. So we
+    assert that the STALE byte marker no longer appears in either
+    sidecar, not that the files are absent.
+    """
+    monkeypatch.setenv("ATOCORE_DATA_DIR", str(tmp_path / "data"))
+    monkeypatch.setenv("ATOCORE_BACKUP_DIR", str(tmp_path / "backups"))
+    monkeypatch.setenv(
+        "ATOCORE_PROJECT_REGISTRY_PATH", str(tmp_path / "config" / "project-registry.json")
+    )
+
+    original_settings = config.settings
+    try:
+        config.settings = config.Settings()
+        init_db()
+        create_runtime_backup(datetime(2026, 4, 9, 14, 0, 0, tzinfo=UTC))
+
+        # Write fake stale WAL/SHM next to the live db with an
+        # unmistakable marker.
+        target_db = config.settings.db_path
+        wal = target_db.with_name(target_db.name + "-wal")
+        shm = target_db.with_name(target_db.name + "-shm")
+        stale_marker = b"STALE-SIDECAR-MARKER-DO-NOT-SURVIVE"
+        wal.write_bytes(stale_marker)
+        shm.write_bytes(stale_marker)
+        assert wal.exists() and shm.exists()
+
+        restore_runtime_backup(
+            "20260409T140000Z", confirm_service_stopped=True
+        )
+
+        # The restored db must pass integrity check (tested elsewhere);
+        # here we just confirm that no file next to it still contains
+        # the stale marker from the old live process.
+        for sidecar in (wal, shm):
+            if sidecar.exists():
+                assert stale_marker not in sidecar.read_bytes(), (
+                    f"{sidecar.name} still carries stale marker"
+                )
+    finally:
+        config.settings = original_settings