Files
ATOCore/tests/test_backup.py
Anto01 1a8fdf4225 fix: chroma restore bind-mount bug + consolidate docs
Two fixes from the 2026-04-09 first real restore drill on Dalidou,
plus the long-overdue doc consolidation I should have done when I
added the drill runbook instead of creating a duplicate.

## Chroma restore bind-mount bug (drill finding)

src/atocore/ops/backup.py: restore_runtime_backup() used to call
shutil.rmtree(dst_chroma) before copying the snapshot back. In the
Dockerized Dalidou deployment the chroma dir is a bind-mounted
volume — you can't unlink a mount point, rmtree raises
  OSError [Errno 16] Device or resource busy
and the restore silently fails to touch Chroma. This bit the first
real drill; the operator worked around it with --no-chroma plus a
manual cp -a.

Fix: clear the destination's CONTENTS (iterdir + rmtree/unlink per
child) and use copytree(dirs_exist_ok=True) so the mount point
itself is never touched. Equivalent semantics, bind-mount-safe.

Regression test:
tests/test_backup.py::test_restore_chroma_does_not_unlink_destination_directory
captures Path.stat().st_ino of the dest dir before and after
restore and asserts they match. That's the same invariant a
bind-mounted chroma dir enforces — if the inode changed, the
mount would have failed. 11/11 backup tests now pass.

## Doc consolidation

docs/backup-restore-drill.md existed as a duplicate of the
authoritative docs/backup-restore-procedure.md. When I added the
drill runbook in commit 3362080 I wrote it from scratch instead of
updating the existing procedure — bad doc hygiene on a project
that's literally about being a context engine.

- Deleted docs/backup-restore-drill.md
- Folded its contents into docs/backup-restore-procedure.md:
  - Replaced the manual sudo cp restore sequence with the new
    `python -m atocore.ops.backup restore <STAMP>
    --confirm-service-stopped` CLI
  - Added the one-shot docker compose run pattern for running
    restore inside a container that reuses the live volume mounts
  - Documented the --no-pre-snapshot / --no-chroma / --chroma flags
  - New "Chroma restore and bind-mounted volumes" subsection
    explaining the bug and the regression test that protects the fix
  - New "Restore drill" subsection with three levels (unit tests,
    module round-trip, live Dalidou drill) and the cadence list
  - Failure-mode table gained four entries: restored_integrity_ok,
    Device-or-resource-busy, drill marker still present,
    chroma_snapshot_missing
  - "Open follow-ups" struck the restore_runtime_backup item (done)
    and added a "Done (historical)" note referencing 2026-04-09
  - Quickstart cheat sheet now has a full drill one-liner using
    memory_type=episodic (the 2026-04-09 drill found the runbook's
    memory_type=note was invalid — the valid set is identity,
    preference, project, episodic, knowledge, adaptation)

## Status doc sync

Long overdue — I've been landing code without updating the
project's narrative state docs.

docs/current-state.md:
- "Reliability Baseline" now reflects: restore_runtime_backup is
  real with CLI, pre-restore safety snapshot, WAL cleanup,
  integrity check; live drill on 2026-04-09 surfaced and fixed
  Chroma bind-mount bug; deploy provenance via /health build_sha;
  deploy.sh self-update re-exec guard
- "Immediate Next Focus" reshuffled: drill re-run (priority 1) and
  auto-capture (priority 2) are now ahead of retrieval quality work,
  reflecting the updated unblock sequence

docs/next-steps.md:
- New item 1: re-run the drill with chroma working end-to-end
- New item 2: auto-capture conservative mode (Stop hook)
- Old item 7 rewritten as item 9 listing what's DONE
  (create/list/validate/restore, admin/backup endpoint with
  include_chroma, /health provenance, self-update guard,
  procedure doc with failure modes) and what's still pending
  (retention cleanup, off-Dalidou target, auto-validation)

## Test count

226 passing (was 225 + 1 new inode-stability regression test).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 09:13:21 -04:00

460 lines
18 KiB
Python

"""Tests for runtime backup creation and restore."""
import json
import sqlite3
from datetime import UTC, datetime
import pytest
import atocore.config as config
from atocore.models.database import init_db
from atocore.ops.backup import (
create_runtime_backup,
list_runtime_backups,
restore_runtime_backup,
validate_backup,
)
def test_create_runtime_backup_copies_db_and_registry(tmp_path, monkeypatch):
monkeypatch.setenv("ATOCORE_DATA_DIR", str(tmp_path / "data"))
monkeypatch.setenv("ATOCORE_BACKUP_DIR", str(tmp_path / "backups"))
monkeypatch.setenv(
"ATOCORE_PROJECT_REGISTRY_PATH", str(tmp_path / "config" / "project-registry.json")
)
registry_path = tmp_path / "config" / "project-registry.json"
registry_path.parent.mkdir(parents=True)
registry_path.write_text('{"projects":[{"id":"p01-example","aliases":[],"ingest_roots":[{"source":"vault","subpath":"incoming/projects/p01-example"}]}]}\n', encoding="utf-8")
original_settings = config.settings
try:
config.settings = config.Settings()
init_db()
with sqlite3.connect(str(config.settings.db_path)) as conn:
conn.execute("INSERT INTO projects (id, name) VALUES (?, ?)", ("p01", "P01 Example"))
conn.commit()
result = create_runtime_backup(datetime(2026, 4, 6, 18, 0, 0, tzinfo=UTC))
finally:
config.settings = original_settings
db_snapshot = tmp_path / "backups" / "snapshots" / "20260406T180000Z" / "db" / "atocore.db"
registry_snapshot = (
tmp_path / "backups" / "snapshots" / "20260406T180000Z" / "config" / "project-registry.json"
)
metadata_path = (
tmp_path / "backups" / "snapshots" / "20260406T180000Z" / "backup-metadata.json"
)
assert result["db_snapshot_path"] == str(db_snapshot)
assert db_snapshot.exists()
assert registry_snapshot.exists()
assert metadata_path.exists()
with sqlite3.connect(str(db_snapshot)) as conn:
row = conn.execute("SELECT name FROM projects WHERE id = ?", ("p01",)).fetchone()
assert row[0] == "P01 Example"
metadata = json.loads(metadata_path.read_text(encoding="utf-8"))
assert metadata["registry_snapshot_path"] == str(registry_snapshot)
def test_create_runtime_backup_includes_chroma_when_requested(tmp_path, monkeypatch):
monkeypatch.setenv("ATOCORE_DATA_DIR", str(tmp_path / "data"))
monkeypatch.setenv("ATOCORE_BACKUP_DIR", str(tmp_path / "backups"))
monkeypatch.setenv(
"ATOCORE_PROJECT_REGISTRY_PATH", str(tmp_path / "config" / "project-registry.json")
)
original_settings = config.settings
try:
config.settings = config.Settings()
init_db()
# Create a fake chroma directory tree with a couple of files.
chroma_dir = config.settings.chroma_path
(chroma_dir / "collection-a").mkdir(parents=True, exist_ok=True)
(chroma_dir / "collection-a" / "data.bin").write_bytes(b"\x00\x01\x02\x03")
(chroma_dir / "metadata.json").write_text('{"ok":true}', encoding="utf-8")
result = create_runtime_backup(
datetime(2026, 4, 6, 20, 0, 0, tzinfo=UTC),
include_chroma=True,
)
finally:
config.settings = original_settings
chroma_snapshot_root = (
tmp_path / "backups" / "snapshots" / "20260406T200000Z" / "chroma"
)
assert result["chroma_snapshot_included"] is True
assert result["chroma_snapshot_path"] == str(chroma_snapshot_root)
assert result["chroma_snapshot_files"] >= 2
assert result["chroma_snapshot_bytes"] > 0
assert (chroma_snapshot_root / "collection-a" / "data.bin").exists()
assert (chroma_snapshot_root / "metadata.json").exists()
def test_list_and_validate_runtime_backups(tmp_path, monkeypatch):
monkeypatch.setenv("ATOCORE_DATA_DIR", str(tmp_path / "data"))
monkeypatch.setenv("ATOCORE_BACKUP_DIR", str(tmp_path / "backups"))
monkeypatch.setenv(
"ATOCORE_PROJECT_REGISTRY_PATH", str(tmp_path / "config" / "project-registry.json")
)
original_settings = config.settings
try:
config.settings = config.Settings()
init_db()
first = create_runtime_backup(datetime(2026, 4, 6, 21, 0, 0, tzinfo=UTC))
second = create_runtime_backup(datetime(2026, 4, 6, 22, 0, 0, tzinfo=UTC))
listing = list_runtime_backups()
first_validation = validate_backup("20260406T210000Z")
second_validation = validate_backup("20260406T220000Z")
missing_validation = validate_backup("20260101T000000Z")
finally:
config.settings = original_settings
assert len(listing) == 2
assert {entry["stamp"] for entry in listing} == {
"20260406T210000Z",
"20260406T220000Z",
}
for entry in listing:
assert entry["has_metadata"] is True
assert entry["metadata"]["db_snapshot_path"]
assert first_validation["valid"] is True
assert first_validation["db_ok"] is True
assert first_validation["errors"] == []
assert second_validation["valid"] is True
assert missing_validation["exists"] is False
assert "snapshot_directory_missing" in missing_validation["errors"]
# both metadata paths are reachable on disk
assert json.loads(
(tmp_path / "backups" / "snapshots" / "20260406T210000Z" / "backup-metadata.json")
.read_text(encoding="utf-8")
)["db_snapshot_path"] == first["db_snapshot_path"]
assert second["db_snapshot_path"].endswith("atocore.db")
def test_create_runtime_backup_handles_missing_registry(tmp_path, monkeypatch):
monkeypatch.setenv("ATOCORE_DATA_DIR", str(tmp_path / "data"))
monkeypatch.setenv("ATOCORE_BACKUP_DIR", str(tmp_path / "backups"))
monkeypatch.setenv(
"ATOCORE_PROJECT_REGISTRY_PATH", str(tmp_path / "config" / "project-registry.json")
)
original_settings = config.settings
try:
config.settings = config.Settings()
init_db()
result = create_runtime_backup(datetime(2026, 4, 6, 19, 0, 0, tzinfo=UTC))
finally:
config.settings = original_settings
assert result["registry_snapshot_path"] == ""
def test_restore_refuses_without_confirm_service_stopped(tmp_path, monkeypatch):
monkeypatch.setenv("ATOCORE_DATA_DIR", str(tmp_path / "data"))
monkeypatch.setenv("ATOCORE_BACKUP_DIR", str(tmp_path / "backups"))
monkeypatch.setenv(
"ATOCORE_PROJECT_REGISTRY_PATH", str(tmp_path / "config" / "project-registry.json")
)
original_settings = config.settings
try:
config.settings = config.Settings()
init_db()
create_runtime_backup(datetime(2026, 4, 9, 10, 0, 0, tzinfo=UTC))
with pytest.raises(RuntimeError, match="confirm_service_stopped"):
restore_runtime_backup("20260409T100000Z")
finally:
config.settings = original_settings
def test_restore_raises_on_invalid_backup(tmp_path, monkeypatch):
monkeypatch.setenv("ATOCORE_DATA_DIR", str(tmp_path / "data"))
monkeypatch.setenv("ATOCORE_BACKUP_DIR", str(tmp_path / "backups"))
monkeypatch.setenv(
"ATOCORE_PROJECT_REGISTRY_PATH", str(tmp_path / "config" / "project-registry.json")
)
original_settings = config.settings
try:
config.settings = config.Settings()
init_db()
with pytest.raises(RuntimeError, match="failed validation"):
restore_runtime_backup(
"20250101T000000Z", confirm_service_stopped=True
)
finally:
config.settings = original_settings
def test_restore_round_trip_reverses_post_backup_mutations(tmp_path, monkeypatch):
"""Canonical drill: snapshot -> mutate -> restore -> mutation gone."""
monkeypatch.setenv("ATOCORE_DATA_DIR", str(tmp_path / "data"))
monkeypatch.setenv("ATOCORE_BACKUP_DIR", str(tmp_path / "backups"))
monkeypatch.setenv(
"ATOCORE_PROJECT_REGISTRY_PATH", str(tmp_path / "config" / "project-registry.json")
)
registry_path = tmp_path / "config" / "project-registry.json"
registry_path.parent.mkdir(parents=True)
registry_path.write_text(
'{"projects":[{"id":"p01-example","aliases":[],'
'"ingest_roots":[{"source":"vault","subpath":"incoming/projects/p01-example"}]}]}\n',
encoding="utf-8",
)
original_settings = config.settings
try:
config.settings = config.Settings()
init_db()
# 1. Seed baseline state that should SURVIVE the restore.
with sqlite3.connect(str(config.settings.db_path)) as conn:
conn.execute(
"INSERT INTO projects (id, name) VALUES (?, ?)",
("p01", "Baseline Project"),
)
conn.commit()
# 2. Create the backup we're going to restore to.
create_runtime_backup(datetime(2026, 4, 9, 11, 0, 0, tzinfo=UTC))
stamp = "20260409T110000Z"
# 3. Mutate live state AFTER the backup — this is what the
# restore should reverse.
with sqlite3.connect(str(config.settings.db_path)) as conn:
conn.execute(
"INSERT INTO projects (id, name) VALUES (?, ?)",
("p99", "Post Backup Mutation"),
)
conn.commit()
# Confirm the mutation is present before restore.
with sqlite3.connect(str(config.settings.db_path)) as conn:
row = conn.execute(
"SELECT name FROM projects WHERE id = ?", ("p99",)
).fetchone()
assert row is not None and row[0] == "Post Backup Mutation"
# 4. Restore — the drill procedure. Explicit confirm_service_stopped.
result = restore_runtime_backup(
stamp, confirm_service_stopped=True
)
# 5. Verify restore report
assert result["stamp"] == stamp
assert result["db_restored"] is True
assert result["registry_restored"] is True
assert result["restored_integrity_ok"] is True
assert result["pre_restore_snapshot"] is not None
# 6. Verify live state reflects the restore: baseline survived,
# post-backup mutation is gone.
with sqlite3.connect(str(config.settings.db_path)) as conn:
baseline = conn.execute(
"SELECT name FROM projects WHERE id = ?", ("p01",)
).fetchone()
mutation = conn.execute(
"SELECT name FROM projects WHERE id = ?", ("p99",)
).fetchone()
assert baseline is not None and baseline[0] == "Baseline Project"
assert mutation is None
# 7. Pre-restore safety snapshot DOES contain the mutation —
# it captured current state before overwriting. This is the
# reversibility guarantee: the operator can restore back to
# it if the restore itself was a mistake.
pre_stamp = result["pre_restore_snapshot"]
pre_validation = validate_backup(pre_stamp)
assert pre_validation["valid"] is True
pre_db_path = pre_validation["metadata"]["db_snapshot_path"]
with sqlite3.connect(pre_db_path) as conn:
pre_mutation = conn.execute(
"SELECT name FROM projects WHERE id = ?", ("p99",)
).fetchone()
assert pre_mutation is not None and pre_mutation[0] == "Post Backup Mutation"
finally:
config.settings = original_settings
def test_restore_round_trip_with_chroma(tmp_path, monkeypatch):
monkeypatch.setenv("ATOCORE_DATA_DIR", str(tmp_path / "data"))
monkeypatch.setenv("ATOCORE_BACKUP_DIR", str(tmp_path / "backups"))
monkeypatch.setenv(
"ATOCORE_PROJECT_REGISTRY_PATH", str(tmp_path / "config" / "project-registry.json")
)
original_settings = config.settings
try:
config.settings = config.Settings()
init_db()
# Seed baseline chroma state that should survive restore.
chroma_dir = config.settings.chroma_path
(chroma_dir / "coll-a").mkdir(parents=True, exist_ok=True)
(chroma_dir / "coll-a" / "baseline.bin").write_bytes(b"baseline")
create_runtime_backup(
datetime(2026, 4, 9, 12, 0, 0, tzinfo=UTC), include_chroma=True
)
stamp = "20260409T120000Z"
# Mutate chroma after backup: add a file + remove baseline.
(chroma_dir / "coll-a" / "post_backup.bin").write_bytes(b"post")
(chroma_dir / "coll-a" / "baseline.bin").unlink()
result = restore_runtime_backup(
stamp, confirm_service_stopped=True
)
assert result["chroma_restored"] is True
assert (chroma_dir / "coll-a" / "baseline.bin").exists()
assert not (chroma_dir / "coll-a" / "post_backup.bin").exists()
finally:
config.settings = original_settings
def test_restore_chroma_does_not_unlink_destination_directory(tmp_path, monkeypatch):
"""Regression: restore must not rmtree the chroma dir itself.
In a Dockerized deployment the chroma dir is a bind-mounted
volume. Calling shutil.rmtree on a mount point raises
``OSError [Errno 16] Device or resource busy``, which broke the
first real Dalidou drill on 2026-04-09. The fix clears the
directory's CONTENTS and copytree(dirs_exist_ok=True) into it,
keeping the directory inode (and any bind mount) intact.
This test captures the inode of the destination directory before
and after restore and asserts they match — that's what a
bind-mounted chroma dir would also see.
"""
monkeypatch.setenv("ATOCORE_DATA_DIR", str(tmp_path / "data"))
monkeypatch.setenv("ATOCORE_BACKUP_DIR", str(tmp_path / "backups"))
monkeypatch.setenv(
"ATOCORE_PROJECT_REGISTRY_PATH", str(tmp_path / "config" / "project-registry.json")
)
original_settings = config.settings
try:
config.settings = config.Settings()
init_db()
chroma_dir = config.settings.chroma_path
(chroma_dir / "coll-a").mkdir(parents=True, exist_ok=True)
(chroma_dir / "coll-a" / "baseline.bin").write_bytes(b"baseline")
create_runtime_backup(
datetime(2026, 4, 9, 15, 0, 0, tzinfo=UTC), include_chroma=True
)
# Capture the destination directory's stat signature before restore.
chroma_stat_before = chroma_dir.stat()
# Add a file post-backup so restore has work to do.
(chroma_dir / "coll-a" / "post_backup.bin").write_bytes(b"post")
restore_runtime_backup(
"20260409T150000Z", confirm_service_stopped=True
)
# Directory still exists (would have failed on mount point) and
# its st_ino matches — the mount itself wasn't unlinked.
assert chroma_dir.exists()
chroma_stat_after = chroma_dir.stat()
assert chroma_stat_before.st_ino == chroma_stat_after.st_ino, (
"chroma directory inode changed — restore recreated the "
"directory instead of clearing its contents; this would "
"fail on a Docker bind-mounted volume"
)
# And the contents did actually get restored.
assert (chroma_dir / "coll-a" / "baseline.bin").exists()
assert not (chroma_dir / "coll-a" / "post_backup.bin").exists()
finally:
config.settings = original_settings
def test_restore_skips_pre_snapshot_when_requested(tmp_path, monkeypatch):
monkeypatch.setenv("ATOCORE_DATA_DIR", str(tmp_path / "data"))
monkeypatch.setenv("ATOCORE_BACKUP_DIR", str(tmp_path / "backups"))
monkeypatch.setenv(
"ATOCORE_PROJECT_REGISTRY_PATH", str(tmp_path / "config" / "project-registry.json")
)
original_settings = config.settings
try:
config.settings = config.Settings()
init_db()
create_runtime_backup(datetime(2026, 4, 9, 13, 0, 0, tzinfo=UTC))
before_count = len(list_runtime_backups())
result = restore_runtime_backup(
"20260409T130000Z",
confirm_service_stopped=True,
pre_restore_snapshot=False,
)
after_count = len(list_runtime_backups())
assert result["pre_restore_snapshot"] is None
assert after_count == before_count
finally:
config.settings = original_settings
def test_restore_cleans_stale_wal_sidecars(tmp_path, monkeypatch):
"""Stale WAL/SHM sidecars must not carry bytes past the restore.
Note: after restore runs, PRAGMA integrity_check reopens the
restored db which may legitimately recreate a fresh -wal. So we
assert that the STALE byte marker no longer appears in either
sidecar, not that the files are absent.
"""
monkeypatch.setenv("ATOCORE_DATA_DIR", str(tmp_path / "data"))
monkeypatch.setenv("ATOCORE_BACKUP_DIR", str(tmp_path / "backups"))
monkeypatch.setenv(
"ATOCORE_PROJECT_REGISTRY_PATH", str(tmp_path / "config" / "project-registry.json")
)
original_settings = config.settings
try:
config.settings = config.Settings()
init_db()
create_runtime_backup(datetime(2026, 4, 9, 14, 0, 0, tzinfo=UTC))
# Write fake stale WAL/SHM next to the live db with an
# unmistakable marker.
target_db = config.settings.db_path
wal = target_db.with_name(target_db.name + "-wal")
shm = target_db.with_name(target_db.name + "-shm")
stale_marker = b"STALE-SIDECAR-MARKER-DO-NOT-SURVIVE"
wal.write_bytes(stale_marker)
shm.write_bytes(stale_marker)
assert wal.exists() and shm.exists()
restore_runtime_backup(
"20260409T140000Z", confirm_service_stopped=True
)
# The restored db must pass integrity check (tested elsewhere);
# here we just confirm that no file next to it still contains
# the stale marker from the old live process.
for sidecar in (wal, shm):
if sidecar.exists():
assert stale_marker not in sidecar.read_bytes(), (
f"{sidecar.name} still carries stale marker"
)
finally:
config.settings = original_settings