diff --git a/docs/backup-restore-drill.md b/docs/backup-restore-drill.md new file mode 100644 index 0000000..4a88ab6 --- /dev/null +++ b/docs/backup-restore-drill.md @@ -0,0 +1,296 @@ +# Backup / Restore Drill + +## Purpose + +Before turning on any automation that writes to AtoCore continuously +(auto-capture of Claude Code sessions, automated source ingestion, +reinforcement sweeps), we need to know — with certainty — that a +backup can actually be restored. A backup you've never restored is +not a backup; it's a file that happens to be named that way. + +This runbook walks through the canonical drill: take a snapshot, +mutate live state, stop the service, restore from the snapshot, +start the service, and verify the mutation is reversed. When the +drill passes, the runtime store has a trustworthy rollback. + +## What gets backed up + +`src/atocore/ops/backup.py::create_runtime_backup()` writes the +following into `$ATOCORE_BACKUP_DIR/snapshots//`: + +| Component | How | Hot/Cold | Notes | +|---|---|---|---| +| SQLite (`atocore.db`) | `conn.backup()` online API | **hot** | Safe with service running; self-contained main file, no WAL sidecar. | +| Project registry JSON | file copy | cold | Only if the file exists. | +| Chroma vector store | `shutil.copytree` | **cold** | Only when `include_chroma=True`. Caller must hold `exclusive_ingestion()` so nothing writes during the copy — the `POST /admin/backup?include_chroma=true` route does this automatically. | +| `backup-metadata.json` | JSON blob | — | Records paths, sizes, and whether Chroma was included. Restore reads this to know what to pull back. | + +Things that are **not** in the backup and must be handled separately: + +- The `.env` file under `deploy/dalidou/` — secrets live out of git + and out of the backup on purpose. The operator must re-place it + on any fresh host. +- The source content under `sources/vault` and `sources/drive` — + these are read-only inputs by convention, owned by AtoVault / + AtoDrive, and backed up there. +- Any running transient state (in-flight HTTP requests, ingestion + queues). Stop the service cleanly if you care about those. + +## What restore does + +`restore_runtime_backup(stamp, confirm_service_stopped=True)`: + +1. **Validates** the backup first via `validate_backup()` — + refuses to run on any error (missing metadata, corrupt snapshot + db, etc.). +2. **Takes a pre-restore safety snapshot** of the current state + (SQLite only, not Chroma — to keep it fast) and returns its + stamp. This is the reversibility guarantee: if the restore was + the wrong call, you can roll it back by restoring the + pre-restore snapshot. +3. **Forces a WAL checkpoint** on the current db + (`PRAGMA wal_checkpoint(TRUNCATE)`) to flush any lingering + writes and release OS file handles on `-wal`/`-shm`, so the + copy step won't race a half-open sqlite connection. +4. **Removes stale WAL/SHM sidecars** next to the target db. + The snapshot `.db` is a self-contained main-file image with no + WAL of its own; leftover `-wal` from the old live process + would desync against the restored main file. +5. **Copies the snapshot db** over the live db path. +6. **Restores the registry JSON** if the snapshot captured one. +7. **Restores the Chroma tree** if the snapshot captured one and + `include_chroma` resolves to true (defaults to whether the + snapshot has Chroma). +8. **Runs `PRAGMA integrity_check`** on the restored db and + reports the result alongside a summary of what was touched. + +If `confirm_service_stopped` is not passed, the function refuses — +this is deliberate. Hot-restoring into a running service is not +supported and would corrupt state. + +## The drill + +Run this from a Dalidou host with the AtoCore container already +deployed and healthy. The whole drill takes under two minutes. It +does not touch source content or disturb any `.env` secrets. + +### Step 1. Capture a snapshot via the HTTP API + +The running service holds the db; use the admin route so the +Chroma snapshot is taken under `exclusive_ingestion()`. The +endpoint takes a JSON body (not a query string): + +```bash +curl -fsS -X POST 'http://127.0.0.1:8100/admin/backup' \ + -H 'Content-Type: application/json' \ + -d '{"include_chroma": true}' \ + | python3 -m json.tool +``` + +Record the `backup_root` and note the stamp (the last path segment, +e.g. `20260409T012345Z`). That stamp is the input to the restore +step. + +### Step 2. Record a known piece of live state + +Pick something small and unambiguous to use as a marker. The +simplest is the current health snapshot plus a memory count: + +```bash +curl -fsS 'http://127.0.0.1:8100/health' | python3 -m json.tool +``` + +Note the `memory_count`, `interaction_count`, and `build_sha`. These +are your pre-drill baseline. + +### Step 3. Mutate live state AFTER the backup + +Write something the restore should reverse. Any write endpoint is +fine — a throwaway test memory is the cleanest. The request body +must include `memory_type` (the AtoCore memory schema requires it): + +```bash +curl -fsS -X POST 'http://127.0.0.1:8100/memory' \ + -H 'Content-Type: application/json' \ + -d '{ + "memory_type": "note", + "content": "DRILL-MARKER: this memory should not survive the restore", + "project": "drill", + "confidence": 1.0 + }' \ + | python3 -m json.tool +``` + +Record the returned `id`. Confirm it's there: + +```bash +curl -fsS 'http://127.0.0.1:8100/health' | python3 -m json.tool +# memory_count should be baseline + 1 + +# And you can list the drill-project memories directly: +curl -fsS 'http://127.0.0.1:8100/memory?project=drill' | python3 -m json.tool +# should return the DRILL-MARKER memory +``` + +### Step 4. Stop the service + +```bash +cd /srv/storage/atocore/app/deploy/dalidou +docker compose down +``` + +Wait for the container to actually exit: + +```bash +docker compose ps +# atocore should be gone or Exited +``` + +### Step 5. Restore from the snapshot + +Run the restore inside a one-shot container that reuses the same +volumes as the live service. This guarantees the paths resolve +identically to the running container's view. + +```bash +cd /srv/storage/atocore/app/deploy/dalidou +docker compose run --rm --entrypoint python atocore \ + -m atocore.ops.backup restore \ + \ + --confirm-service-stopped +``` + +The output is JSON; the important fields are: + +- `pre_restore_snapshot`: stamp of the safety snapshot of live + state at the moment of restore. **Write this down.** If the + restore turns out to be the wrong call, this is how you roll + it back. +- `db_restored`: `true` +- `registry_restored`: `true` if the backup had a registry +- `chroma_restored`: `true` if the backup had a chroma snapshot +- `restored_integrity_ok`: **must be `true`** — if this is false, + STOP and do not start the service; investigate the integrity + error first. + +If restoration fails at any step, the function raises a clean +`RuntimeError` and nothing partial is committed past the main file +swap. The pre-restore safety snapshot is your rollback anchor. + +### Step 6. Start the service back up + +```bash +cd /srv/storage/atocore/app/deploy/dalidou +docker compose up -d +``` + +Wait for `/health` to respond: + +```bash +for i in 1 2 3 4 5 6 7 8 9 10; do + curl -fsS 'http://127.0.0.1:8100/health' \ + && break || { echo "not ready ($i/10)"; sleep 3; } +done +``` + +### Step 7. Verify the drill marker is gone + +```bash +curl -fsS 'http://127.0.0.1:8100/health' | python3 -m json.tool +# memory_count should equal the Step 2 baseline, NOT baseline + 1 +``` + +You can also list the drill-project memories directly: + +```bash +curl -fsS 'http://127.0.0.1:8100/memory?project=drill' | python3 -m json.tool +# should return an empty list — the DRILL-MARKER memory was rolled back +``` + +For a semantic-retrieval cross-check, issue a query (the `/query` +endpoint takes `prompt`, not `query`): + +```bash +curl -fsS -X POST 'http://127.0.0.1:8100/query' \ + -H 'Content-Type: application/json' \ + -d '{"prompt": "DRILL-MARKER drill marker", "top_k": 5}' \ + | python3 -m json.tool +# should not return the DRILL-MARKER memory in the hits +``` + +If the marker is gone and `memory_count` matches the baseline, the +drill **passed**. The runtime store has a trustworthy rollback. + +### Step 8. (Optional) Clean up the safety snapshot + +If everything went smoothly you can leave the pre-restore safety +snapshot on disk for a few days as a paranoia buffer. There's no +automatic cleanup yet — `list_runtime_backups()` will show it, and +you can remove it by hand once you're confident: + +```bash +rm -rf /srv/storage/atocore/backups/snapshots/ +``` + +## Failure modes and recovery + +### Restore reports `restored_integrity_ok: false` + +The copied db failed `PRAGMA integrity_check`. Do **not** start +the service. This usually means either the source snapshot was +itself corrupt (and `validate_backup` should have caught it — file +a bug if it didn't), or the copy was interrupted. Options: + +1. Validate the source snapshot directly: + `python -m atocore.ops.backup validate ` +2. Pick a different, older snapshot and retry the restore. +3. Roll the db back to your pre-restore safety snapshot. + +### The live container won't start after restore + +Check the container logs: + +```bash +cd /srv/storage/atocore/app/deploy/dalidou +docker compose logs --tail=100 atocore +``` + +Common causes: + +- Schema drift between the snapshot and the current code version. + `_apply_migrations` in `src/atocore/models/database.py` is + idempotent and should absorb most forward migrations, but a + backward restore (running new code against an older snapshot) + may hit unexpected state. The migration only ADDs columns, so + the opposite direction is usually safe, but verify. +- Chroma and SQLite disagreeing about what chunks exist. The + backup captures them together to minimize this, but if you + restore SQLite without Chroma (`--no-chroma`), retrieval may + return stale vectors. Re-ingest if this happens. + +### The drill marker is still present after restore + +Something went wrong. Possible causes: + +- You restored a snapshot taken AFTER the drill marker was + written (wrong stamp). +- The service was writing during the drill and committed the + marker before `docker compose down`. Double-check the order. +- The restore silently skipped the db step. Check the restore + output for `db_restored: true` and `restored_integrity_ok: true`. + +Roll back to the pre-restore safety snapshot and retry with the +correct source snapshot. + +## When to run this drill + +- **Before** enabling any new write-path automation (auto-capture, + automated ingestion, reinforcement sweeps, scheduled extraction). +- **After** any change to `src/atocore/ops/backup.py` or the + schema migrations in `src/atocore/models/database.py`. +- **After** a Dalidou OS upgrade or docker version bump. +- **Monthly** as a standing operational check. + +Record each drill run (pass/fail) somewhere durable — even a line +in the project journal is enough. A drill you ran once and never +again is barely more than a drill you never ran. diff --git a/src/atocore/ops/backup.py b/src/atocore/ops/backup.py index df25d4e..825ae1c 100644 --- a/src/atocore/ops/backup.py +++ b/src/atocore/ops/backup.py @@ -216,6 +216,166 @@ def validate_backup(stamp: str) -> dict: return result +def restore_runtime_backup( + stamp: str, + *, + include_chroma: bool | None = None, + pre_restore_snapshot: bool = True, + confirm_service_stopped: bool = False, +) -> dict: + """Restore a previously captured runtime backup. + + CRITICAL: the AtoCore service MUST be stopped before calling this. + Overwriting a live SQLite database corrupts state and can break + the running container's open connections. The caller must pass + ``confirm_service_stopped=True`` as an explicit acknowledgment — + otherwise this function refuses to run. + + The restore procedure: + + 1. Validate the backup via ``validate_backup``; refuse on any error. + 2. (default) Create a pre-restore safety snapshot of the CURRENT + state so the restore itself is reversible. The snapshot stamp + is returned in the result for the operator to record. + 3. Remove stale SQLite WAL/SHM sidecar files next to the target db + before copying — the snapshot is a self-contained main-file + image from ``conn.backup()``, and leftover WAL/SHM from the old + live db would desync against the restored main file. + 4. Copy the snapshot db over the target db path. + 5. Restore the project registry file if the snapshot captured one. + 6. Restore the Chroma directory if ``include_chroma`` resolves to + true. When ``include_chroma is None`` the function defers to + whether the snapshot captured Chroma (the common case). + 7. Run ``PRAGMA integrity_check`` on the restored db and report + the result. + + Returns a dict describing what was restored. On refused restore + (service still running, validation failed) raises ``RuntimeError``. + """ + if not confirm_service_stopped: + raise RuntimeError( + "restore_runtime_backup refuses to run without " + "confirm_service_stopped=True — stop the AtoCore container " + "first (e.g. `docker compose down` from deploy/dalidou) " + "before calling this function" + ) + + validation = validate_backup(stamp) + if not validation.get("valid"): + raise RuntimeError( + f"backup {stamp} failed validation: {validation.get('errors')}" + ) + metadata = validation.get("metadata") or {} + + pre_snapshot_stamp: str | None = None + if pre_restore_snapshot: + pre = create_runtime_backup(include_chroma=False) + pre_snapshot_stamp = Path(pre["backup_root"]).name + + target_db = _config.settings.db_path + source_db = Path(metadata.get("db_snapshot_path", "")) + if not source_db.exists(): + raise RuntimeError( + f"db snapshot not found at {source_db} — backup " + f"metadata may be stale" + ) + + # Force sqlite to flush any lingering WAL into the main file and + # release OS-level file handles on -wal/-shm before we swap the + # main file. Passing through conn.backup() in the pre-restore + # snapshot can leave sidecars momentarily locked on Windows; + # an explicit checkpoint(TRUNCATE) is the reliable way to flush + # and release. Best-effort: if the target db can't be opened + # (missing, corrupt), fall through and trust the copy step. + if target_db.exists(): + try: + with sqlite3.connect(str(target_db)) as checkpoint_conn: + checkpoint_conn.execute("PRAGMA wal_checkpoint(TRUNCATE)") + except sqlite3.DatabaseError as exc: + log.warning( + "restore_pre_checkpoint_failed", + target_db=str(target_db), + error=str(exc), + ) + + # Remove stale WAL/SHM sidecars from the old live db so SQLite + # can't read inconsistent state on next open. Tolerant to + # Windows file-lock races — the subsequent copy replaces the + # main file anyway, and the integrity check afterward is the + # actual correctness signal. + wal_path = target_db.with_name(target_db.name + "-wal") + shm_path = target_db.with_name(target_db.name + "-shm") + for stale in (wal_path, shm_path): + if stale.exists(): + try: + stale.unlink() + except OSError as exc: + log.warning( + "restore_sidecar_unlink_failed", + path=str(stale), + error=str(exc), + ) + + target_db.parent.mkdir(parents=True, exist_ok=True) + shutil.copy2(source_db, target_db) + + registry_restored = False + registry_snapshot_path = metadata.get("registry_snapshot_path", "") + if registry_snapshot_path: + src_reg = Path(registry_snapshot_path) + if src_reg.exists(): + dst_reg = _config.settings.resolved_project_registry_path + dst_reg.parent.mkdir(parents=True, exist_ok=True) + shutil.copy2(src_reg, dst_reg) + registry_restored = True + + chroma_snapshot_path = metadata.get("chroma_snapshot_path", "") + if include_chroma is None: + include_chroma = bool(chroma_snapshot_path) + chroma_restored = False + if include_chroma and chroma_snapshot_path: + src_chroma = Path(chroma_snapshot_path) + if src_chroma.exists() and src_chroma.is_dir(): + dst_chroma = _config.settings.chroma_path + if dst_chroma.exists(): + shutil.rmtree(dst_chroma) + shutil.copytree(src_chroma, dst_chroma) + chroma_restored = True + + restored_integrity_ok = False + integrity_error: str | None = None + try: + with sqlite3.connect(str(target_db)) as conn: + row = conn.execute("PRAGMA integrity_check").fetchone() + restored_integrity_ok = bool(row and row[0] == "ok") + if not restored_integrity_ok: + integrity_error = row[0] if row else "no_row" + except sqlite3.DatabaseError as exc: + integrity_error = f"db_open_failed: {exc}" + + result: dict = { + "stamp": stamp, + "pre_restore_snapshot": pre_snapshot_stamp, + "target_db": str(target_db), + "db_restored": True, + "registry_restored": registry_restored, + "chroma_restored": chroma_restored, + "restored_integrity_ok": restored_integrity_ok, + } + if integrity_error: + result["integrity_error"] = integrity_error + + log.info( + "runtime_backup_restored", + stamp=stamp, + pre_restore_snapshot=pre_snapshot_stamp, + registry_restored=registry_restored, + chroma_restored=chroma_restored, + integrity_ok=restored_integrity_ok, + ) + return result + + def _backup_sqlite_db(source_path: Path, dest_path: Path) -> None: source_conn = sqlite3.connect(str(source_path)) dest_conn = sqlite3.connect(str(dest_path)) @@ -242,7 +402,89 @@ def _copy_directory_tree(source: Path, dest: Path) -> tuple[int, int]: def main() -> None: - result = create_runtime_backup() + """CLI entry point for the backup module. + + Supports four subcommands: + + - ``create`` run ``create_runtime_backup`` (default if none given) + - ``list`` list all runtime backup snapshots + - ``validate`` validate a specific snapshot by stamp + - ``restore`` restore a specific snapshot by stamp + + The restore subcommand is the one used by the backup/restore drill + and MUST be run only when the AtoCore service is stopped. It takes + ``--confirm-service-stopped`` as an explicit acknowledgment. + """ + import argparse + + parser = argparse.ArgumentParser( + prog="python -m atocore.ops.backup", + description="AtoCore runtime backup create/list/validate/restore", + ) + sub = parser.add_subparsers(dest="command") + + p_create = sub.add_parser("create", help="create a new runtime backup") + p_create.add_argument( + "--chroma", + action="store_true", + help="also snapshot the Chroma vector store (cold copy)", + ) + + sub.add_parser("list", help="list runtime backup snapshots") + + p_validate = sub.add_parser("validate", help="validate a snapshot by stamp") + p_validate.add_argument("stamp", help="snapshot stamp (e.g. 20260409T010203Z)") + + p_restore = sub.add_parser( + "restore", + help="restore a snapshot by stamp (service must be stopped)", + ) + p_restore.add_argument("stamp", help="snapshot stamp to restore") + p_restore.add_argument( + "--confirm-service-stopped", + action="store_true", + help="explicit acknowledgment that the AtoCore container is stopped", + ) + p_restore.add_argument( + "--no-pre-snapshot", + action="store_true", + help="skip the pre-restore safety snapshot of current state", + ) + chroma_group = p_restore.add_mutually_exclusive_group() + chroma_group.add_argument( + "--chroma", + dest="include_chroma", + action="store_true", + default=None, + help="force-restore the Chroma snapshot", + ) + chroma_group.add_argument( + "--no-chroma", + dest="include_chroma", + action="store_false", + help="skip the Chroma snapshot even if it was captured", + ) + + args = parser.parse_args() + command = args.command or "create" + + if command == "create": + include_chroma = getattr(args, "chroma", False) + result = create_runtime_backup(include_chroma=include_chroma) + elif command == "list": + result = {"backups": list_runtime_backups()} + elif command == "validate": + result = validate_backup(args.stamp) + elif command == "restore": + result = restore_runtime_backup( + args.stamp, + include_chroma=args.include_chroma, + pre_restore_snapshot=not args.no_pre_snapshot, + confirm_service_stopped=args.confirm_service_stopped, + ) + else: # pragma: no cover — argparse guards this + parser.error(f"unknown command: {command}") + print(json.dumps(result, indent=2, ensure_ascii=True)) diff --git a/tests/test_backup.py b/tests/test_backup.py index ee601f3..7b0ef0f 100644 --- a/tests/test_backup.py +++ b/tests/test_backup.py @@ -1,14 +1,17 @@ -"""Tests for runtime backup creation.""" +"""Tests for runtime backup creation and restore.""" import json import sqlite3 from datetime import UTC, datetime +import pytest + import atocore.config as config from atocore.models.database import init_db from atocore.ops.backup import ( create_runtime_backup, list_runtime_backups, + restore_runtime_backup, validate_backup, ) @@ -156,3 +159,242 @@ def test_create_runtime_backup_handles_missing_registry(tmp_path, monkeypatch): config.settings = original_settings assert result["registry_snapshot_path"] == "" + + +def test_restore_refuses_without_confirm_service_stopped(tmp_path, monkeypatch): + monkeypatch.setenv("ATOCORE_DATA_DIR", str(tmp_path / "data")) + monkeypatch.setenv("ATOCORE_BACKUP_DIR", str(tmp_path / "backups")) + monkeypatch.setenv( + "ATOCORE_PROJECT_REGISTRY_PATH", str(tmp_path / "config" / "project-registry.json") + ) + + original_settings = config.settings + try: + config.settings = config.Settings() + init_db() + create_runtime_backup(datetime(2026, 4, 9, 10, 0, 0, tzinfo=UTC)) + + with pytest.raises(RuntimeError, match="confirm_service_stopped"): + restore_runtime_backup("20260409T100000Z") + finally: + config.settings = original_settings + + +def test_restore_raises_on_invalid_backup(tmp_path, monkeypatch): + monkeypatch.setenv("ATOCORE_DATA_DIR", str(tmp_path / "data")) + monkeypatch.setenv("ATOCORE_BACKUP_DIR", str(tmp_path / "backups")) + monkeypatch.setenv( + "ATOCORE_PROJECT_REGISTRY_PATH", str(tmp_path / "config" / "project-registry.json") + ) + + original_settings = config.settings + try: + config.settings = config.Settings() + init_db() + with pytest.raises(RuntimeError, match="failed validation"): + restore_runtime_backup( + "20250101T000000Z", confirm_service_stopped=True + ) + finally: + config.settings = original_settings + + +def test_restore_round_trip_reverses_post_backup_mutations(tmp_path, monkeypatch): + """Canonical drill: snapshot -> mutate -> restore -> mutation gone.""" + monkeypatch.setenv("ATOCORE_DATA_DIR", str(tmp_path / "data")) + monkeypatch.setenv("ATOCORE_BACKUP_DIR", str(tmp_path / "backups")) + monkeypatch.setenv( + "ATOCORE_PROJECT_REGISTRY_PATH", str(tmp_path / "config" / "project-registry.json") + ) + + registry_path = tmp_path / "config" / "project-registry.json" + registry_path.parent.mkdir(parents=True) + registry_path.write_text( + '{"projects":[{"id":"p01-example","aliases":[],' + '"ingest_roots":[{"source":"vault","subpath":"incoming/projects/p01-example"}]}]}\n', + encoding="utf-8", + ) + + original_settings = config.settings + try: + config.settings = config.Settings() + init_db() + + # 1. Seed baseline state that should SURVIVE the restore. + with sqlite3.connect(str(config.settings.db_path)) as conn: + conn.execute( + "INSERT INTO projects (id, name) VALUES (?, ?)", + ("p01", "Baseline Project"), + ) + conn.commit() + + # 2. Create the backup we're going to restore to. + create_runtime_backup(datetime(2026, 4, 9, 11, 0, 0, tzinfo=UTC)) + stamp = "20260409T110000Z" + + # 3. Mutate live state AFTER the backup — this is what the + # restore should reverse. + with sqlite3.connect(str(config.settings.db_path)) as conn: + conn.execute( + "INSERT INTO projects (id, name) VALUES (?, ?)", + ("p99", "Post Backup Mutation"), + ) + conn.commit() + + # Confirm the mutation is present before restore. + with sqlite3.connect(str(config.settings.db_path)) as conn: + row = conn.execute( + "SELECT name FROM projects WHERE id = ?", ("p99",) + ).fetchone() + assert row is not None and row[0] == "Post Backup Mutation" + + # 4. Restore — the drill procedure. Explicit confirm_service_stopped. + result = restore_runtime_backup( + stamp, confirm_service_stopped=True + ) + + # 5. Verify restore report + assert result["stamp"] == stamp + assert result["db_restored"] is True + assert result["registry_restored"] is True + assert result["restored_integrity_ok"] is True + assert result["pre_restore_snapshot"] is not None + + # 6. Verify live state reflects the restore: baseline survived, + # post-backup mutation is gone. + with sqlite3.connect(str(config.settings.db_path)) as conn: + baseline = conn.execute( + "SELECT name FROM projects WHERE id = ?", ("p01",) + ).fetchone() + mutation = conn.execute( + "SELECT name FROM projects WHERE id = ?", ("p99",) + ).fetchone() + assert baseline is not None and baseline[0] == "Baseline Project" + assert mutation is None + + # 7. Pre-restore safety snapshot DOES contain the mutation — + # it captured current state before overwriting. This is the + # reversibility guarantee: the operator can restore back to + # it if the restore itself was a mistake. + pre_stamp = result["pre_restore_snapshot"] + pre_validation = validate_backup(pre_stamp) + assert pre_validation["valid"] is True + pre_db_path = pre_validation["metadata"]["db_snapshot_path"] + with sqlite3.connect(pre_db_path) as conn: + pre_mutation = conn.execute( + "SELECT name FROM projects WHERE id = ?", ("p99",) + ).fetchone() + assert pre_mutation is not None and pre_mutation[0] == "Post Backup Mutation" + finally: + config.settings = original_settings + + +def test_restore_round_trip_with_chroma(tmp_path, monkeypatch): + monkeypatch.setenv("ATOCORE_DATA_DIR", str(tmp_path / "data")) + monkeypatch.setenv("ATOCORE_BACKUP_DIR", str(tmp_path / "backups")) + monkeypatch.setenv( + "ATOCORE_PROJECT_REGISTRY_PATH", str(tmp_path / "config" / "project-registry.json") + ) + + original_settings = config.settings + try: + config.settings = config.Settings() + init_db() + + # Seed baseline chroma state that should survive restore. + chroma_dir = config.settings.chroma_path + (chroma_dir / "coll-a").mkdir(parents=True, exist_ok=True) + (chroma_dir / "coll-a" / "baseline.bin").write_bytes(b"baseline") + + create_runtime_backup( + datetime(2026, 4, 9, 12, 0, 0, tzinfo=UTC), include_chroma=True + ) + stamp = "20260409T120000Z" + + # Mutate chroma after backup: add a file + remove baseline. + (chroma_dir / "coll-a" / "post_backup.bin").write_bytes(b"post") + (chroma_dir / "coll-a" / "baseline.bin").unlink() + + result = restore_runtime_backup( + stamp, confirm_service_stopped=True + ) + + assert result["chroma_restored"] is True + assert (chroma_dir / "coll-a" / "baseline.bin").exists() + assert not (chroma_dir / "coll-a" / "post_backup.bin").exists() + finally: + config.settings = original_settings + + +def test_restore_skips_pre_snapshot_when_requested(tmp_path, monkeypatch): + monkeypatch.setenv("ATOCORE_DATA_DIR", str(tmp_path / "data")) + monkeypatch.setenv("ATOCORE_BACKUP_DIR", str(tmp_path / "backups")) + monkeypatch.setenv( + "ATOCORE_PROJECT_REGISTRY_PATH", str(tmp_path / "config" / "project-registry.json") + ) + + original_settings = config.settings + try: + config.settings = config.Settings() + init_db() + create_runtime_backup(datetime(2026, 4, 9, 13, 0, 0, tzinfo=UTC)) + + before_count = len(list_runtime_backups()) + + result = restore_runtime_backup( + "20260409T130000Z", + confirm_service_stopped=True, + pre_restore_snapshot=False, + ) + + after_count = len(list_runtime_backups()) + assert result["pre_restore_snapshot"] is None + assert after_count == before_count + finally: + config.settings = original_settings + + +def test_restore_cleans_stale_wal_sidecars(tmp_path, monkeypatch): + """Stale WAL/SHM sidecars must not carry bytes past the restore. + + Note: after restore runs, PRAGMA integrity_check reopens the + restored db which may legitimately recreate a fresh -wal. So we + assert that the STALE byte marker no longer appears in either + sidecar, not that the files are absent. + """ + monkeypatch.setenv("ATOCORE_DATA_DIR", str(tmp_path / "data")) + monkeypatch.setenv("ATOCORE_BACKUP_DIR", str(tmp_path / "backups")) + monkeypatch.setenv( + "ATOCORE_PROJECT_REGISTRY_PATH", str(tmp_path / "config" / "project-registry.json") + ) + + original_settings = config.settings + try: + config.settings = config.Settings() + init_db() + create_runtime_backup(datetime(2026, 4, 9, 14, 0, 0, tzinfo=UTC)) + + # Write fake stale WAL/SHM next to the live db with an + # unmistakable marker. + target_db = config.settings.db_path + wal = target_db.with_name(target_db.name + "-wal") + shm = target_db.with_name(target_db.name + "-shm") + stale_marker = b"STALE-SIDECAR-MARKER-DO-NOT-SURVIVE" + wal.write_bytes(stale_marker) + shm.write_bytes(stale_marker) + assert wal.exists() and shm.exists() + + restore_runtime_backup( + "20260409T140000Z", confirm_service_stopped=True + ) + + # The restored db must pass integrity check (tested elsewhere); + # here we just confirm that no file next to it still contains + # the stale marker from the old live process. + for sidecar in (wal, shm): + if sidecar.exists(): + assert stale_marker not in sidecar.read_bytes(), ( + f"{sidecar.name} still carries stale marker" + ) + finally: + config.settings = original_settings