# Backup / Restore Drill

## Purpose

Before turning on any automation that writes to AtoCore continuously
(auto-capture of Claude Code sessions, automated source ingestion,
reinforcement sweeps), we need to know — with certainty — that a
backup can actually be restored. A backup you've never restored is
not a backup; it's a file that happens to be named that way.

This runbook walks through the canonical drill: take a snapshot,
mutate live state, stop the service, restore from the snapshot,
start the service, and verify the mutation is reversed. When the
drill passes, the runtime store has a trustworthy rollback.

## What gets backed up

`src/atocore/ops/backup.py::create_runtime_backup()` writes the
following into `$ATOCORE_BACKUP_DIR/snapshots/<stamp>/`:

| Component | How | Hot/Cold | Notes |
|---|---|---|---|
| SQLite (`atocore.db`) | `conn.backup()` online API | **hot** | Safe with service running; self-contained main file, no WAL sidecar. |
| Project registry JSON | file copy | cold | Only if the file exists. |
| Chroma vector store | `shutil.copytree` | **cold** | Only when `include_chroma=True`. Caller must hold `exclusive_ingestion()` so nothing writes during the copy — the `POST /admin/backup?include_chroma=true` route does this automatically. |
| `backup-metadata.json` | JSON blob | — | Records paths, sizes, and whether Chroma was included. Restore reads this to know what to pull back. |

Things that are **not** in the backup and must be handled separately:

- The `.env` file under `deploy/dalidou/` — secrets live out of git
  and out of the backup on purpose. The operator must re-place it
  on any fresh host.
- The source content under `sources/vault` and `sources/drive` —
  these are read-only inputs by convention, owned by AtoVault /
  AtoDrive, and backed up there.
- Any running transient state (in-flight HTTP requests, ingestion
  queues). Stop the service cleanly if you care about those.

## What restore does

`restore_runtime_backup(stamp, confirm_service_stopped=True)`:

1. **Validates** the backup first via `validate_backup()` —
   refuses to run on any error (missing metadata, corrupt snapshot
   db, etc.).
2. **Takes a pre-restore safety snapshot** of the current state
   (SQLite only, not Chroma — to keep it fast) and returns its
   stamp. This is the reversibility guarantee: if the restore was
   the wrong call, you can roll it back by restoring the
   pre-restore snapshot.
3. **Forces a WAL checkpoint** on the current db
   (`PRAGMA wal_checkpoint(TRUNCATE)`) to flush any lingering
   writes and release OS file handles on `-wal`/`-shm`, so the
   copy step won't race a half-open sqlite connection.
4. **Removes stale WAL/SHM sidecars** next to the target db.
   The snapshot `.db` is a self-contained main-file image with no
   WAL of its own; leftover `-wal` from the old live process
   would desync against the restored main file.
5. **Copies the snapshot db** over the live db path.
6. **Restores the registry JSON** if the snapshot captured one.
7. **Restores the Chroma tree** if the snapshot captured one and
   `include_chroma` resolves to true (defaults to whether the
   snapshot has Chroma).
8. **Runs `PRAGMA integrity_check`** on the restored db and
   reports the result alongside a summary of what was touched.

If `confirm_service_stopped` is not passed, the function refuses —
this is deliberate. Hot-restoring into a running service is not
supported and would corrupt state.

## The drill

Run this from a Dalidou host with the AtoCore container already
deployed and healthy. The whole drill takes under two minutes. It
does not touch source content or disturb any `.env` secrets.

### Step 1. Capture a snapshot via the HTTP API

The running service holds the db; use the admin route so the
Chroma snapshot is taken under `exclusive_ingestion()`. The
endpoint takes a JSON body (not a query string):

```bash
curl -fsS -X POST 'http://127.0.0.1:8100/admin/backup' \
    -H 'Content-Type: application/json' \
    -d '{"include_chroma": true}' \
    | python3 -m json.tool
```

Record the `backup_root` and note the stamp (the last path segment,
e.g. `20260409T012345Z`). That stamp is the input to the restore
step.

### Step 2. Record a known piece of live state

Pick something small and unambiguous to use as a marker. The
simplest is the current health snapshot plus a memory count:

```bash
curl -fsS 'http://127.0.0.1:8100/health' | python3 -m json.tool
```

Note the `memory_count`, `interaction_count`, and `build_sha`. These
are your pre-drill baseline.

### Step 3. Mutate live state AFTER the backup

Write something the restore should reverse. Any write endpoint is
fine — a throwaway test memory is the cleanest. The request body
must include `memory_type` (the AtoCore memory schema requires it):

```bash
curl -fsS -X POST 'http://127.0.0.1:8100/memory' \
    -H 'Content-Type: application/json' \
    -d '{
        "memory_type": "note",
        "content": "DRILL-MARKER: this memory should not survive the restore",
        "project": "drill",
        "confidence": 1.0
    }' \
    | python3 -m json.tool
```

Record the returned `id`. Confirm it's there:

```bash
curl -fsS 'http://127.0.0.1:8100/health' | python3 -m json.tool
# memory_count should be baseline + 1

# And you can list the drill-project memories directly:
curl -fsS 'http://127.0.0.1:8100/memory?project=drill' | python3 -m json.tool
# should return the DRILL-MARKER memory
```

### Step 4. Stop the service

```bash
cd /srv/storage/atocore/app/deploy/dalidou
docker compose down
```

Wait for the container to actually exit:

```bash
docker compose ps
# atocore should be gone or Exited
```

### Step 5. Restore from the snapshot

Run the restore inside a one-shot container that reuses the same
volumes as the live service. This guarantees the paths resolve
identically to the running container's view.

```bash
cd /srv/storage/atocore/app/deploy/dalidou
docker compose run --rm --entrypoint python atocore \
    -m atocore.ops.backup restore \
        <YOUR_STAMP_FROM_STEP_1> \
        --confirm-service-stopped
```

The output is JSON; the important fields are:

- `pre_restore_snapshot`: stamp of the safety snapshot of live
  state at the moment of restore. **Write this down.** If the
  restore turns out to be the wrong call, this is how you roll
  it back.
- `db_restored`: `true`
- `registry_restored`: `true` if the backup had a registry
- `chroma_restored`: `true` if the backup had a chroma snapshot
- `restored_integrity_ok`: **must be `true`** — if this is false,
  STOP and do not start the service; investigate the integrity
  error first.

If restoration fails at any step, the function raises a clean
`RuntimeError` and nothing partial is committed past the main file
swap. The pre-restore safety snapshot is your rollback anchor.

### Step 6. Start the service back up

```bash
cd /srv/storage/atocore/app/deploy/dalidou
docker compose up -d
```

Wait for `/health` to respond:

```bash
for i in 1 2 3 4 5 6 7 8 9 10; do
    curl -fsS 'http://127.0.0.1:8100/health' \
        && break || { echo "not ready ($i/10)"; sleep 3; }
done
```

### Step 7. Verify the drill marker is gone

```bash
curl -fsS 'http://127.0.0.1:8100/health' | python3 -m json.tool
# memory_count should equal the Step 2 baseline, NOT baseline + 1
```

You can also list the drill-project memories directly:

```bash
curl -fsS 'http://127.0.0.1:8100/memory?project=drill' | python3 -m json.tool
# should return an empty list — the DRILL-MARKER memory was rolled back
```

For a semantic-retrieval cross-check, issue a query (the `/query`
endpoint takes `prompt`, not `query`):

```bash
curl -fsS -X POST 'http://127.0.0.1:8100/query' \
    -H 'Content-Type: application/json' \
    -d '{"prompt": "DRILL-MARKER drill marker", "top_k": 5}' \
    | python3 -m json.tool
# should not return the DRILL-MARKER memory in the hits
```

If the marker is gone and `memory_count` matches the baseline, the
drill **passed**. The runtime store has a trustworthy rollback.

### Step 8. (Optional) Clean up the safety snapshot

If everything went smoothly you can leave the pre-restore safety
snapshot on disk for a few days as a paranoia buffer. There's no
automatic cleanup yet — `list_runtime_backups()` will show it, and
you can remove it by hand once you're confident:

```bash
rm -rf /srv/storage/atocore/backups/snapshots/<pre_restore_stamp>
```

## Failure modes and recovery

### Restore reports `restored_integrity_ok: false`

The copied db failed `PRAGMA integrity_check`. Do **not** start
the service. This usually means either the source snapshot was
itself corrupt (and `validate_backup` should have caught it — file
a bug if it didn't), or the copy was interrupted. Options:

1. Validate the source snapshot directly:
   `python -m atocore.ops.backup validate <STAMP>`
2. Pick a different, older snapshot and retry the restore.
3. Roll the db back to your pre-restore safety snapshot.

### The live container won't start after restore

Check the container logs:

```bash
cd /srv/storage/atocore/app/deploy/dalidou
docker compose logs --tail=100 atocore
```

Common causes:

- Schema drift between the snapshot and the current code version.
  `_apply_migrations` in `src/atocore/models/database.py` is
  idempotent and should absorb most forward migrations, but a
  backward restore (running new code against an older snapshot)
  may hit unexpected state. The migration only ADDs columns, so
  the opposite direction is usually safe, but verify.
- Chroma and SQLite disagreeing about what chunks exist. The
  backup captures them together to minimize this, but if you
  restore SQLite without Chroma (`--no-chroma`), retrieval may
  return stale vectors. Re-ingest if this happens.

### The drill marker is still present after restore

Something went wrong. Possible causes:

- You restored a snapshot taken AFTER the drill marker was
  written (wrong stamp).
- The service was writing during the drill and committed the
  marker before `docker compose down`. Double-check the order.
- The restore silently skipped the db step. Check the restore
  output for `db_restored: true` and `restored_integrity_ok: true`.

Roll back to the pre-restore safety snapshot and retry with the
correct source snapshot.

## When to run this drill

- **Before** enabling any new write-path automation (auto-capture,
  automated ingestion, reinforcement sweeps, scheduled extraction).
- **After** any change to `src/atocore/ops/backup.py` or the
  schema migrations in `src/atocore/models/database.py`.
- **After** a Dalidou OS upgrade or docker version bump.
- **Monthly** as a standing operational check.

Record each drill run (pass/fail) somewhere durable — even a line
in the project journal is enough. A drill you ran once and never
again is barely more than a drill you never ran.