ops: add restore_runtime_backup + drill runbook

Close the backup side of the loop: we had create/list/validate but
no restore, and no documented drill. A backup you've never restored
is not a backup. This lands the missing restore surface and the
procedure to exercise it before enabling any write-path automation
(auto-capture, automated ingestion, reinforcement sweeps).

Code — src/atocore/ops/backup.py:

- restore_runtime_backup(stamp, *, include_chroma, pre_restore_snapshot,
  confirm_service_stopped) performs:
  1. validate_backup() gate — refuse on any error
  2. pre-restore safety snapshot of current state (reversibility anchor)
  3. PRAGMA wal_checkpoint(TRUNCATE) on target db (flush + release
     OS handles; Windows needs this after conn.backup() reads)
  4. unlink stale -wal/-shm sidecars (tolerant to Windows lock races)
  5. shutil.copy2 snapshot db over target
  6. restore registry if snapshot captured one
  7. restore Chroma tree if snapshot captured one and include_chroma
     resolves to true (defaults to whether backup has Chroma)
  8. PRAGMA integrity_check on restored db, report result
- Refuses without confirm_service_stopped=True to prevent hot-restore
  into a running service (would corrupt SQLite state)
- Rewrote main() as argparse with 4 subcommands: create, list,
  validate, restore. `python -m atocore.ops.backup restore STAMP
  --confirm-service-stopped` is the drill CLI entry point, run via
  `docker compose run --rm --entrypoint python atocore` so it reuses
  the live service's volume mounts

Tests — tests/test_backup.py (6 new):

- test_restore_refuses_without_confirm_service_stopped
- test_restore_raises_on_invalid_backup
- test_restore_round_trip_reverses_post_backup_mutations
  (canonical drill flow: seed -> backup -> mutate -> restore ->
   mutation gone + baseline survived + pre-restore snapshot has
   the mutation captured as rollback anchor)
- test_restore_round_trip_with_chroma
- test_restore_skips_pre_snapshot_when_requested
- test_restore_cleans_stale_wal_sidecars (asserts stale byte
  markers do not survive, not file existence, since PRAGMA
  integrity_check may legitimately recreate -wal)

Docs — docs/backup-restore-drill.md (new):

- What gets backed up (hot sqlite, cold chroma, registry JSON,
  metadata.json) and what doesn't (.env, source content)
- What restore does, step by step, and why confirm_service_stopped
  is a hard gate
- 8-step drill procedure: capture -> baseline -> mutate -> stop ->
  restore -> start -> verify marker gone -> optional cleanup
- Correct endpoint bodies verified against routes.py:
    POST /admin/backup with JSON body {"include_chroma": true}
    POST /memory with memory_type/content/project/confidence
    GET /memory?project=drill to list drill markers
    POST /query with {"prompt": ..., "top_k": ...} (not "query")
- Failure modes: integrity_check fail, container won't start,
  marker still present after restore, with remediation for each
- When to run: before new write-path automation, after backup.py
  or schema changes, after infra bumps, monthly as standing check

225/225 tests passing (219 existing + 6 new restore).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-08 21:17:48 -04:00
parent 03822389a1
commit 336208004c
3 changed files with 782 additions and 2 deletions

View File

@@ -0,0 +1,296 @@
# Backup / Restore Drill
## Purpose
Before turning on any automation that writes to AtoCore continuously
(auto-capture of Claude Code sessions, automated source ingestion,
reinforcement sweeps), we need to know — with certainty — that a
backup can actually be restored. A backup you've never restored is
not a backup; it's a file that happens to be named that way.
This runbook walks through the canonical drill: take a snapshot,
mutate live state, stop the service, restore from the snapshot,
start the service, and verify the mutation is reversed. When the
drill passes, the runtime store has a trustworthy rollback.
## What gets backed up
`src/atocore/ops/backup.py::create_runtime_backup()` writes the
following into `$ATOCORE_BACKUP_DIR/snapshots/<stamp>/`:
| Component | How | Hot/Cold | Notes |
|---|---|---|---|
| SQLite (`atocore.db`) | `conn.backup()` online API | **hot** | Safe with service running; self-contained main file, no WAL sidecar. |
| Project registry JSON | file copy | cold | Only if the file exists. |
| Chroma vector store | `shutil.copytree` | **cold** | Only when `include_chroma=True`. Caller must hold `exclusive_ingestion()` so nothing writes during the copy — the `POST /admin/backup?include_chroma=true` route does this automatically. |
| `backup-metadata.json` | JSON blob | — | Records paths, sizes, and whether Chroma was included. Restore reads this to know what to pull back. |
Things that are **not** in the backup and must be handled separately:
- The `.env` file under `deploy/dalidou/` — secrets live out of git
and out of the backup on purpose. The operator must re-place it
on any fresh host.
- The source content under `sources/vault` and `sources/drive`
these are read-only inputs by convention, owned by AtoVault /
AtoDrive, and backed up there.
- Any running transient state (in-flight HTTP requests, ingestion
queues). Stop the service cleanly if you care about those.
## What restore does
`restore_runtime_backup(stamp, confirm_service_stopped=True)`:
1. **Validates** the backup first via `validate_backup()`
refuses to run on any error (missing metadata, corrupt snapshot
db, etc.).
2. **Takes a pre-restore safety snapshot** of the current state
(SQLite only, not Chroma — to keep it fast) and returns its
stamp. This is the reversibility guarantee: if the restore was
the wrong call, you can roll it back by restoring the
pre-restore snapshot.
3. **Forces a WAL checkpoint** on the current db
(`PRAGMA wal_checkpoint(TRUNCATE)`) to flush any lingering
writes and release OS file handles on `-wal`/`-shm`, so the
copy step won't race a half-open sqlite connection.
4. **Removes stale WAL/SHM sidecars** next to the target db.
The snapshot `.db` is a self-contained main-file image with no
WAL of its own; leftover `-wal` from the old live process
would desync against the restored main file.
5. **Copies the snapshot db** over the live db path.
6. **Restores the registry JSON** if the snapshot captured one.
7. **Restores the Chroma tree** if the snapshot captured one and
`include_chroma` resolves to true (defaults to whether the
snapshot has Chroma).
8. **Runs `PRAGMA integrity_check`** on the restored db and
reports the result alongside a summary of what was touched.
If `confirm_service_stopped` is not passed, the function refuses —
this is deliberate. Hot-restoring into a running service is not
supported and would corrupt state.
## The drill
Run this from a Dalidou host with the AtoCore container already
deployed and healthy. The whole drill takes under two minutes. It
does not touch source content or disturb any `.env` secrets.
### Step 1. Capture a snapshot via the HTTP API
The running service holds the db; use the admin route so the
Chroma snapshot is taken under `exclusive_ingestion()`. The
endpoint takes a JSON body (not a query string):
```bash
curl -fsS -X POST 'http://127.0.0.1:8100/admin/backup' \
-H 'Content-Type: application/json' \
-d '{"include_chroma": true}' \
| python3 -m json.tool
```
Record the `backup_root` and note the stamp (the last path segment,
e.g. `20260409T012345Z`). That stamp is the input to the restore
step.
### Step 2. Record a known piece of live state
Pick something small and unambiguous to use as a marker. The
simplest is the current health snapshot plus a memory count:
```bash
curl -fsS 'http://127.0.0.1:8100/health' | python3 -m json.tool
```
Note the `memory_count`, `interaction_count`, and `build_sha`. These
are your pre-drill baseline.
### Step 3. Mutate live state AFTER the backup
Write something the restore should reverse. Any write endpoint is
fine — a throwaway test memory is the cleanest. The request body
must include `memory_type` (the AtoCore memory schema requires it):
```bash
curl -fsS -X POST 'http://127.0.0.1:8100/memory' \
-H 'Content-Type: application/json' \
-d '{
"memory_type": "note",
"content": "DRILL-MARKER: this memory should not survive the restore",
"project": "drill",
"confidence": 1.0
}' \
| python3 -m json.tool
```
Record the returned `id`. Confirm it's there:
```bash
curl -fsS 'http://127.0.0.1:8100/health' | python3 -m json.tool
# memory_count should be baseline + 1
# And you can list the drill-project memories directly:
curl -fsS 'http://127.0.0.1:8100/memory?project=drill' | python3 -m json.tool
# should return the DRILL-MARKER memory
```
### Step 4. Stop the service
```bash
cd /srv/storage/atocore/app/deploy/dalidou
docker compose down
```
Wait for the container to actually exit:
```bash
docker compose ps
# atocore should be gone or Exited
```
### Step 5. Restore from the snapshot
Run the restore inside a one-shot container that reuses the same
volumes as the live service. This guarantees the paths resolve
identically to the running container's view.
```bash
cd /srv/storage/atocore/app/deploy/dalidou
docker compose run --rm --entrypoint python atocore \
-m atocore.ops.backup restore \
<YOUR_STAMP_FROM_STEP_1> \
--confirm-service-stopped
```
The output is JSON; the important fields are:
- `pre_restore_snapshot`: stamp of the safety snapshot of live
state at the moment of restore. **Write this down.** If the
restore turns out to be the wrong call, this is how you roll
it back.
- `db_restored`: `true`
- `registry_restored`: `true` if the backup had a registry
- `chroma_restored`: `true` if the backup had a chroma snapshot
- `restored_integrity_ok`: **must be `true`** — if this is false,
STOP and do not start the service; investigate the integrity
error first.
If restoration fails at any step, the function raises a clean
`RuntimeError` and nothing partial is committed past the main file
swap. The pre-restore safety snapshot is your rollback anchor.
### Step 6. Start the service back up
```bash
cd /srv/storage/atocore/app/deploy/dalidou
docker compose up -d
```
Wait for `/health` to respond:
```bash
for i in 1 2 3 4 5 6 7 8 9 10; do
curl -fsS 'http://127.0.0.1:8100/health' \
&& break || { echo "not ready ($i/10)"; sleep 3; }
done
```
### Step 7. Verify the drill marker is gone
```bash
curl -fsS 'http://127.0.0.1:8100/health' | python3 -m json.tool
# memory_count should equal the Step 2 baseline, NOT baseline + 1
```
You can also list the drill-project memories directly:
```bash
curl -fsS 'http://127.0.0.1:8100/memory?project=drill' | python3 -m json.tool
# should return an empty list — the DRILL-MARKER memory was rolled back
```
For a semantic-retrieval cross-check, issue a query (the `/query`
endpoint takes `prompt`, not `query`):
```bash
curl -fsS -X POST 'http://127.0.0.1:8100/query' \
-H 'Content-Type: application/json' \
-d '{"prompt": "DRILL-MARKER drill marker", "top_k": 5}' \
| python3 -m json.tool
# should not return the DRILL-MARKER memory in the hits
```
If the marker is gone and `memory_count` matches the baseline, the
drill **passed**. The runtime store has a trustworthy rollback.
### Step 8. (Optional) Clean up the safety snapshot
If everything went smoothly you can leave the pre-restore safety
snapshot on disk for a few days as a paranoia buffer. There's no
automatic cleanup yet — `list_runtime_backups()` will show it, and
you can remove it by hand once you're confident:
```bash
rm -rf /srv/storage/atocore/backups/snapshots/<pre_restore_stamp>
```
## Failure modes and recovery
### Restore reports `restored_integrity_ok: false`
The copied db failed `PRAGMA integrity_check`. Do **not** start
the service. This usually means either the source snapshot was
itself corrupt (and `validate_backup` should have caught it — file
a bug if it didn't), or the copy was interrupted. Options:
1. Validate the source snapshot directly:
`python -m atocore.ops.backup validate <STAMP>`
2. Pick a different, older snapshot and retry the restore.
3. Roll the db back to your pre-restore safety snapshot.
### The live container won't start after restore
Check the container logs:
```bash
cd /srv/storage/atocore/app/deploy/dalidou
docker compose logs --tail=100 atocore
```
Common causes:
- Schema drift between the snapshot and the current code version.
`_apply_migrations` in `src/atocore/models/database.py` is
idempotent and should absorb most forward migrations, but a
backward restore (running new code against an older snapshot)
may hit unexpected state. The migration only ADDs columns, so
the opposite direction is usually safe, but verify.
- Chroma and SQLite disagreeing about what chunks exist. The
backup captures them together to minimize this, but if you
restore SQLite without Chroma (`--no-chroma`), retrieval may
return stale vectors. Re-ingest if this happens.
### The drill marker is still present after restore
Something went wrong. Possible causes:
- You restored a snapshot taken AFTER the drill marker was
written (wrong stamp).
- The service was writing during the drill and committed the
marker before `docker compose down`. Double-check the order.
- The restore silently skipped the db step. Check the restore
output for `db_restored: true` and `restored_integrity_ok: true`.
Roll back to the pre-restore safety snapshot and retry with the
correct source snapshot.
## When to run this drill
- **Before** enabling any new write-path automation (auto-capture,
automated ingestion, reinforcement sweeps, scheduled extraction).
- **After** any change to `src/atocore/ops/backup.py` or the
schema migrations in `src/atocore/models/database.py`.
- **After** a Dalidou OS upgrade or docker version bump.
- **Monthly** as a standing operational check.
Record each drill run (pass/fail) somewhere durable — even a line
in the project journal is enough. A drill you ran once and never
again is barely more than a drill you never ran.

View File

@@ -216,6 +216,166 @@ def validate_backup(stamp: str) -> dict:
return result
def restore_runtime_backup(
stamp: str,
*,
include_chroma: bool | None = None,
pre_restore_snapshot: bool = True,
confirm_service_stopped: bool = False,
) -> dict:
"""Restore a previously captured runtime backup.
CRITICAL: the AtoCore service MUST be stopped before calling this.
Overwriting a live SQLite database corrupts state and can break
the running container's open connections. The caller must pass
``confirm_service_stopped=True`` as an explicit acknowledgment —
otherwise this function refuses to run.
The restore procedure:
1. Validate the backup via ``validate_backup``; refuse on any error.
2. (default) Create a pre-restore safety snapshot of the CURRENT
state so the restore itself is reversible. The snapshot stamp
is returned in the result for the operator to record.
3. Remove stale SQLite WAL/SHM sidecar files next to the target db
before copying — the snapshot is a self-contained main-file
image from ``conn.backup()``, and leftover WAL/SHM from the old
live db would desync against the restored main file.
4. Copy the snapshot db over the target db path.
5. Restore the project registry file if the snapshot captured one.
6. Restore the Chroma directory if ``include_chroma`` resolves to
true. When ``include_chroma is None`` the function defers to
whether the snapshot captured Chroma (the common case).
7. Run ``PRAGMA integrity_check`` on the restored db and report
the result.
Returns a dict describing what was restored. On refused restore
(service still running, validation failed) raises ``RuntimeError``.
"""
if not confirm_service_stopped:
raise RuntimeError(
"restore_runtime_backup refuses to run without "
"confirm_service_stopped=True — stop the AtoCore container "
"first (e.g. `docker compose down` from deploy/dalidou) "
"before calling this function"
)
validation = validate_backup(stamp)
if not validation.get("valid"):
raise RuntimeError(
f"backup {stamp} failed validation: {validation.get('errors')}"
)
metadata = validation.get("metadata") or {}
pre_snapshot_stamp: str | None = None
if pre_restore_snapshot:
pre = create_runtime_backup(include_chroma=False)
pre_snapshot_stamp = Path(pre["backup_root"]).name
target_db = _config.settings.db_path
source_db = Path(metadata.get("db_snapshot_path", ""))
if not source_db.exists():
raise RuntimeError(
f"db snapshot not found at {source_db} — backup "
f"metadata may be stale"
)
# Force sqlite to flush any lingering WAL into the main file and
# release OS-level file handles on -wal/-shm before we swap the
# main file. Passing through conn.backup() in the pre-restore
# snapshot can leave sidecars momentarily locked on Windows;
# an explicit checkpoint(TRUNCATE) is the reliable way to flush
# and release. Best-effort: if the target db can't be opened
# (missing, corrupt), fall through and trust the copy step.
if target_db.exists():
try:
with sqlite3.connect(str(target_db)) as checkpoint_conn:
checkpoint_conn.execute("PRAGMA wal_checkpoint(TRUNCATE)")
except sqlite3.DatabaseError as exc:
log.warning(
"restore_pre_checkpoint_failed",
target_db=str(target_db),
error=str(exc),
)
# Remove stale WAL/SHM sidecars from the old live db so SQLite
# can't read inconsistent state on next open. Tolerant to
# Windows file-lock races — the subsequent copy replaces the
# main file anyway, and the integrity check afterward is the
# actual correctness signal.
wal_path = target_db.with_name(target_db.name + "-wal")
shm_path = target_db.with_name(target_db.name + "-shm")
for stale in (wal_path, shm_path):
if stale.exists():
try:
stale.unlink()
except OSError as exc:
log.warning(
"restore_sidecar_unlink_failed",
path=str(stale),
error=str(exc),
)
target_db.parent.mkdir(parents=True, exist_ok=True)
shutil.copy2(source_db, target_db)
registry_restored = False
registry_snapshot_path = metadata.get("registry_snapshot_path", "")
if registry_snapshot_path:
src_reg = Path(registry_snapshot_path)
if src_reg.exists():
dst_reg = _config.settings.resolved_project_registry_path
dst_reg.parent.mkdir(parents=True, exist_ok=True)
shutil.copy2(src_reg, dst_reg)
registry_restored = True
chroma_snapshot_path = metadata.get("chroma_snapshot_path", "")
if include_chroma is None:
include_chroma = bool(chroma_snapshot_path)
chroma_restored = False
if include_chroma and chroma_snapshot_path:
src_chroma = Path(chroma_snapshot_path)
if src_chroma.exists() and src_chroma.is_dir():
dst_chroma = _config.settings.chroma_path
if dst_chroma.exists():
shutil.rmtree(dst_chroma)
shutil.copytree(src_chroma, dst_chroma)
chroma_restored = True
restored_integrity_ok = False
integrity_error: str | None = None
try:
with sqlite3.connect(str(target_db)) as conn:
row = conn.execute("PRAGMA integrity_check").fetchone()
restored_integrity_ok = bool(row and row[0] == "ok")
if not restored_integrity_ok:
integrity_error = row[0] if row else "no_row"
except sqlite3.DatabaseError as exc:
integrity_error = f"db_open_failed: {exc}"
result: dict = {
"stamp": stamp,
"pre_restore_snapshot": pre_snapshot_stamp,
"target_db": str(target_db),
"db_restored": True,
"registry_restored": registry_restored,
"chroma_restored": chroma_restored,
"restored_integrity_ok": restored_integrity_ok,
}
if integrity_error:
result["integrity_error"] = integrity_error
log.info(
"runtime_backup_restored",
stamp=stamp,
pre_restore_snapshot=pre_snapshot_stamp,
registry_restored=registry_restored,
chroma_restored=chroma_restored,
integrity_ok=restored_integrity_ok,
)
return result
def _backup_sqlite_db(source_path: Path, dest_path: Path) -> None:
source_conn = sqlite3.connect(str(source_path))
dest_conn = sqlite3.connect(str(dest_path))
@@ -242,7 +402,89 @@ def _copy_directory_tree(source: Path, dest: Path) -> tuple[int, int]:
def main() -> None:
result = create_runtime_backup()
"""CLI entry point for the backup module.
Supports four subcommands:
- ``create`` run ``create_runtime_backup`` (default if none given)
- ``list`` list all runtime backup snapshots
- ``validate`` validate a specific snapshot by stamp
- ``restore`` restore a specific snapshot by stamp
The restore subcommand is the one used by the backup/restore drill
and MUST be run only when the AtoCore service is stopped. It takes
``--confirm-service-stopped`` as an explicit acknowledgment.
"""
import argparse
parser = argparse.ArgumentParser(
prog="python -m atocore.ops.backup",
description="AtoCore runtime backup create/list/validate/restore",
)
sub = parser.add_subparsers(dest="command")
p_create = sub.add_parser("create", help="create a new runtime backup")
p_create.add_argument(
"--chroma",
action="store_true",
help="also snapshot the Chroma vector store (cold copy)",
)
sub.add_parser("list", help="list runtime backup snapshots")
p_validate = sub.add_parser("validate", help="validate a snapshot by stamp")
p_validate.add_argument("stamp", help="snapshot stamp (e.g. 20260409T010203Z)")
p_restore = sub.add_parser(
"restore",
help="restore a snapshot by stamp (service must be stopped)",
)
p_restore.add_argument("stamp", help="snapshot stamp to restore")
p_restore.add_argument(
"--confirm-service-stopped",
action="store_true",
help="explicit acknowledgment that the AtoCore container is stopped",
)
p_restore.add_argument(
"--no-pre-snapshot",
action="store_true",
help="skip the pre-restore safety snapshot of current state",
)
chroma_group = p_restore.add_mutually_exclusive_group()
chroma_group.add_argument(
"--chroma",
dest="include_chroma",
action="store_true",
default=None,
help="force-restore the Chroma snapshot",
)
chroma_group.add_argument(
"--no-chroma",
dest="include_chroma",
action="store_false",
help="skip the Chroma snapshot even if it was captured",
)
args = parser.parse_args()
command = args.command or "create"
if command == "create":
include_chroma = getattr(args, "chroma", False)
result = create_runtime_backup(include_chroma=include_chroma)
elif command == "list":
result = {"backups": list_runtime_backups()}
elif command == "validate":
result = validate_backup(args.stamp)
elif command == "restore":
result = restore_runtime_backup(
args.stamp,
include_chroma=args.include_chroma,
pre_restore_snapshot=not args.no_pre_snapshot,
confirm_service_stopped=args.confirm_service_stopped,
)
else: # pragma: no cover — argparse guards this
parser.error(f"unknown command: {command}")
print(json.dumps(result, indent=2, ensure_ascii=True))

View File

@@ -1,14 +1,17 @@
"""Tests for runtime backup creation."""
"""Tests for runtime backup creation and restore."""
import json
import sqlite3
from datetime import UTC, datetime
import pytest
import atocore.config as config
from atocore.models.database import init_db
from atocore.ops.backup import (
create_runtime_backup,
list_runtime_backups,
restore_runtime_backup,
validate_backup,
)
@@ -156,3 +159,242 @@ def test_create_runtime_backup_handles_missing_registry(tmp_path, monkeypatch):
config.settings = original_settings
assert result["registry_snapshot_path"] == ""
def test_restore_refuses_without_confirm_service_stopped(tmp_path, monkeypatch):
monkeypatch.setenv("ATOCORE_DATA_DIR", str(tmp_path / "data"))
monkeypatch.setenv("ATOCORE_BACKUP_DIR", str(tmp_path / "backups"))
monkeypatch.setenv(
"ATOCORE_PROJECT_REGISTRY_PATH", str(tmp_path / "config" / "project-registry.json")
)
original_settings = config.settings
try:
config.settings = config.Settings()
init_db()
create_runtime_backup(datetime(2026, 4, 9, 10, 0, 0, tzinfo=UTC))
with pytest.raises(RuntimeError, match="confirm_service_stopped"):
restore_runtime_backup("20260409T100000Z")
finally:
config.settings = original_settings
def test_restore_raises_on_invalid_backup(tmp_path, monkeypatch):
monkeypatch.setenv("ATOCORE_DATA_DIR", str(tmp_path / "data"))
monkeypatch.setenv("ATOCORE_BACKUP_DIR", str(tmp_path / "backups"))
monkeypatch.setenv(
"ATOCORE_PROJECT_REGISTRY_PATH", str(tmp_path / "config" / "project-registry.json")
)
original_settings = config.settings
try:
config.settings = config.Settings()
init_db()
with pytest.raises(RuntimeError, match="failed validation"):
restore_runtime_backup(
"20250101T000000Z", confirm_service_stopped=True
)
finally:
config.settings = original_settings
def test_restore_round_trip_reverses_post_backup_mutations(tmp_path, monkeypatch):
"""Canonical drill: snapshot -> mutate -> restore -> mutation gone."""
monkeypatch.setenv("ATOCORE_DATA_DIR", str(tmp_path / "data"))
monkeypatch.setenv("ATOCORE_BACKUP_DIR", str(tmp_path / "backups"))
monkeypatch.setenv(
"ATOCORE_PROJECT_REGISTRY_PATH", str(tmp_path / "config" / "project-registry.json")
)
registry_path = tmp_path / "config" / "project-registry.json"
registry_path.parent.mkdir(parents=True)
registry_path.write_text(
'{"projects":[{"id":"p01-example","aliases":[],'
'"ingest_roots":[{"source":"vault","subpath":"incoming/projects/p01-example"}]}]}\n',
encoding="utf-8",
)
original_settings = config.settings
try:
config.settings = config.Settings()
init_db()
# 1. Seed baseline state that should SURVIVE the restore.
with sqlite3.connect(str(config.settings.db_path)) as conn:
conn.execute(
"INSERT INTO projects (id, name) VALUES (?, ?)",
("p01", "Baseline Project"),
)
conn.commit()
# 2. Create the backup we're going to restore to.
create_runtime_backup(datetime(2026, 4, 9, 11, 0, 0, tzinfo=UTC))
stamp = "20260409T110000Z"
# 3. Mutate live state AFTER the backup — this is what the
# restore should reverse.
with sqlite3.connect(str(config.settings.db_path)) as conn:
conn.execute(
"INSERT INTO projects (id, name) VALUES (?, ?)",
("p99", "Post Backup Mutation"),
)
conn.commit()
# Confirm the mutation is present before restore.
with sqlite3.connect(str(config.settings.db_path)) as conn:
row = conn.execute(
"SELECT name FROM projects WHERE id = ?", ("p99",)
).fetchone()
assert row is not None and row[0] == "Post Backup Mutation"
# 4. Restore — the drill procedure. Explicit confirm_service_stopped.
result = restore_runtime_backup(
stamp, confirm_service_stopped=True
)
# 5. Verify restore report
assert result["stamp"] == stamp
assert result["db_restored"] is True
assert result["registry_restored"] is True
assert result["restored_integrity_ok"] is True
assert result["pre_restore_snapshot"] is not None
# 6. Verify live state reflects the restore: baseline survived,
# post-backup mutation is gone.
with sqlite3.connect(str(config.settings.db_path)) as conn:
baseline = conn.execute(
"SELECT name FROM projects WHERE id = ?", ("p01",)
).fetchone()
mutation = conn.execute(
"SELECT name FROM projects WHERE id = ?", ("p99",)
).fetchone()
assert baseline is not None and baseline[0] == "Baseline Project"
assert mutation is None
# 7. Pre-restore safety snapshot DOES contain the mutation —
# it captured current state before overwriting. This is the
# reversibility guarantee: the operator can restore back to
# it if the restore itself was a mistake.
pre_stamp = result["pre_restore_snapshot"]
pre_validation = validate_backup(pre_stamp)
assert pre_validation["valid"] is True
pre_db_path = pre_validation["metadata"]["db_snapshot_path"]
with sqlite3.connect(pre_db_path) as conn:
pre_mutation = conn.execute(
"SELECT name FROM projects WHERE id = ?", ("p99",)
).fetchone()
assert pre_mutation is not None and pre_mutation[0] == "Post Backup Mutation"
finally:
config.settings = original_settings
def test_restore_round_trip_with_chroma(tmp_path, monkeypatch):
monkeypatch.setenv("ATOCORE_DATA_DIR", str(tmp_path / "data"))
monkeypatch.setenv("ATOCORE_BACKUP_DIR", str(tmp_path / "backups"))
monkeypatch.setenv(
"ATOCORE_PROJECT_REGISTRY_PATH", str(tmp_path / "config" / "project-registry.json")
)
original_settings = config.settings
try:
config.settings = config.Settings()
init_db()
# Seed baseline chroma state that should survive restore.
chroma_dir = config.settings.chroma_path
(chroma_dir / "coll-a").mkdir(parents=True, exist_ok=True)
(chroma_dir / "coll-a" / "baseline.bin").write_bytes(b"baseline")
create_runtime_backup(
datetime(2026, 4, 9, 12, 0, 0, tzinfo=UTC), include_chroma=True
)
stamp = "20260409T120000Z"
# Mutate chroma after backup: add a file + remove baseline.
(chroma_dir / "coll-a" / "post_backup.bin").write_bytes(b"post")
(chroma_dir / "coll-a" / "baseline.bin").unlink()
result = restore_runtime_backup(
stamp, confirm_service_stopped=True
)
assert result["chroma_restored"] is True
assert (chroma_dir / "coll-a" / "baseline.bin").exists()
assert not (chroma_dir / "coll-a" / "post_backup.bin").exists()
finally:
config.settings = original_settings
def test_restore_skips_pre_snapshot_when_requested(tmp_path, monkeypatch):
monkeypatch.setenv("ATOCORE_DATA_DIR", str(tmp_path / "data"))
monkeypatch.setenv("ATOCORE_BACKUP_DIR", str(tmp_path / "backups"))
monkeypatch.setenv(
"ATOCORE_PROJECT_REGISTRY_PATH", str(tmp_path / "config" / "project-registry.json")
)
original_settings = config.settings
try:
config.settings = config.Settings()
init_db()
create_runtime_backup(datetime(2026, 4, 9, 13, 0, 0, tzinfo=UTC))
before_count = len(list_runtime_backups())
result = restore_runtime_backup(
"20260409T130000Z",
confirm_service_stopped=True,
pre_restore_snapshot=False,
)
after_count = len(list_runtime_backups())
assert result["pre_restore_snapshot"] is None
assert after_count == before_count
finally:
config.settings = original_settings
def test_restore_cleans_stale_wal_sidecars(tmp_path, monkeypatch):
"""Stale WAL/SHM sidecars must not carry bytes past the restore.
Note: after restore runs, PRAGMA integrity_check reopens the
restored db which may legitimately recreate a fresh -wal. So we
assert that the STALE byte marker no longer appears in either
sidecar, not that the files are absent.
"""
monkeypatch.setenv("ATOCORE_DATA_DIR", str(tmp_path / "data"))
monkeypatch.setenv("ATOCORE_BACKUP_DIR", str(tmp_path / "backups"))
monkeypatch.setenv(
"ATOCORE_PROJECT_REGISTRY_PATH", str(tmp_path / "config" / "project-registry.json")
)
original_settings = config.settings
try:
config.settings = config.Settings()
init_db()
create_runtime_backup(datetime(2026, 4, 9, 14, 0, 0, tzinfo=UTC))
# Write fake stale WAL/SHM next to the live db with an
# unmistakable marker.
target_db = config.settings.db_path
wal = target_db.with_name(target_db.name + "-wal")
shm = target_db.with_name(target_db.name + "-shm")
stale_marker = b"STALE-SIDECAR-MARKER-DO-NOT-SURVIVE"
wal.write_bytes(stale_marker)
shm.write_bytes(stale_marker)
assert wal.exists() and shm.exists()
restore_runtime_backup(
"20260409T140000Z", confirm_service_stopped=True
)
# The restored db must pass integrity check (tested elsewhere);
# here we just confirm that no file next to it still contains
# the stale marker from the old live process.
for sidecar in (wal, shm):
if sidecar.exists():
assert stale_marker not in sidecar.read_bytes(), (
f"{sidecar.name} still carries stale marker"
)
finally:
config.settings = original_settings