Files
ATOCore/docs/backup-restore-procedure.md
Anto01 2d911909f8 feat: auto-capture Claude Code sessions via Stop hook
Add deploy/hooks/capture_stop.py — a Claude Code Stop hook that reads
the transcript JSONL, extracts the last user prompt, and POSTs to the
AtoCore /interactions endpoint in conservative mode (reinforce=false).

Conservative mode means: capture only, no automatic reinforcement or
extraction into the review queue. Kill switch: ATOCORE_CAPTURE_DISABLED=1.

Also: note build_sha cosmetic issue after restore in runbook, update
project status docs to reflect drill pass and auto-capture wiring.

17 new tests (243 total, all passing).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-11 09:00:42 -04:00

18 KiB

AtoCore Backup and Restore Procedure

Scope

This document defines the operational procedure for backing up and restoring AtoCore's machine state on the Dalidou deployment. It is the practical companion to docs/backup-strategy.md (which defines the strategy) and src/atocore/ops/backup.py (which implements the mechanics).

The intent is that this procedure can be followed by anyone with SSH access to Dalidou and the AtoCore admin endpoints.

What gets backed up

A create_runtime_backup snapshot contains, in order of importance:

Artifact Source path on Dalidou Backup destination Always included
SQLite database /srv/storage/atocore/data/db/atocore.db <backup_root>/db/atocore.db yes
Project registry JSON /srv/storage/atocore/config/project-registry.json <backup_root>/config/project-registry.json yes (if file exists)
Backup metadata (generated) <backup_root>/backup-metadata.json yes
Chroma vector store /srv/storage/atocore/data/chroma/ <backup_root>/chroma/ only when include_chroma=true

The SQLite snapshot uses the online conn.backup() API and is safe to take while the database is in use. The Chroma snapshot is a cold directory copy and is only safe when no ingestion is running; the API endpoint enforces this by acquiring the ingestion lock for the duration of the copy.

What is not in the backup:

  • Source documents under /srv/storage/atocore/sources/vault/ and /srv/storage/atocore/sources/drive/. These are read-only inputs and live in the user's PKM/Drive, which is backed up separately by their own systems.
  • Application code. The container image is the source of truth for code; recovery means rebuilding the image, not restoring code from a backup.
  • Logs under /srv/storage/atocore/logs/.
  • Embeddings cache under /srv/storage/atocore/data/cache/.
  • Temp files under /srv/storage/atocore/data/tmp/.

Backup root layout

Each backup snapshot lives in its own timestamped directory:

/srv/storage/atocore/backups/snapshots/
  ├── 20260407T060000Z/
  │   ├── backup-metadata.json
  │   ├── db/
  │   │   └── atocore.db
  │   ├── config/
  │   │   └── project-registry.json
  │   └── chroma/                    # only if include_chroma=true
  │       └── ...
  ├── 20260408T060000Z/
  │   └── ...
  └── ...

The timestamp is UTC, format YYYYMMDDTHHMMSSZ.

Triggering a backup

Option A — via the admin endpoint (preferred)

# DB + registry only (fast, safe at any time)
curl -fsS -X POST http://dalidou:8100/admin/backup \
  -H "Content-Type: application/json" \
  -d '{"include_chroma": false}'

# DB + registry + Chroma (acquires ingestion lock)
curl -fsS -X POST http://dalidou:8100/admin/backup \
  -H "Content-Type: application/json" \
  -d '{"include_chroma": true}'

The response is the backup metadata JSON. Save the backup_root field — that's the directory the snapshot was written to.

Option B — via the standalone script (when the API is down)

docker exec atocore python -m atocore.ops.backup

This runs create_runtime_backup() directly, without going through the API or the ingestion lock. Use it only when the AtoCore service itself is unhealthy and you can't hit the admin endpoint.

Option C — manual file copy (last resort)

If both the API and the standalone script are unusable:

sudo systemctl stop atocore   # or: docker compose stop atocore
sudo cp /srv/storage/atocore/data/db/atocore.db \
        /srv/storage/atocore/backups/manual-$(date -u +%Y%m%dT%H%M%SZ).db
sudo cp /srv/storage/atocore/config/project-registry.json \
        /srv/storage/atocore/backups/manual-$(date -u +%Y%m%dT%H%M%SZ).registry.json
sudo systemctl start atocore

This is a cold backup and requires brief downtime.

Listing backups

curl -fsS http://dalidou:8100/admin/backup

Returns the configured backup_dir and a list of all snapshots under it, with their full metadata if available.

Or, on the host directly:

ls -la /srv/storage/atocore/backups/snapshots/

Validating a backup

Before relying on a backup for restore, validate it:

curl -fsS http://dalidou:8100/admin/backup/20260407T060000Z/validate

The validator:

  • confirms the snapshot directory exists
  • opens the SQLite snapshot and runs PRAGMA integrity_check
  • parses the registry JSON
  • confirms the Chroma directory exists (if it was included)

A valid backup returns "valid": true and an empty errors array. A failing validation returns "valid": false with one or more specific error strings (e.g. db_integrity_check_failed, registry_invalid_json, chroma_snapshot_missing).

Validate every backup at creation time. A backup that has never been validated is not actually a backup — it's just a hopeful copy of bytes.

Restore procedure

Since 2026-04-09 the restore is implemented as a proper module function plus CLI entry point: restore_runtime_backup() in src/atocore/ops/backup.py, invoked as python -m atocore.ops.backup restore <STAMP> --confirm-service-stopped. It automatically takes a pre-restore safety snapshot (your rollback anchor), handles SQLite WAL/SHM cleanly, restores the registry, and runs PRAGMA integrity_check on the restored db. This replaces the earlier manual sudo cp sequence.

The function refuses to run without --confirm-service-stopped. This is deliberate: hot-restoring into a running service corrupts SQLite state.

Pre-flight (always)

  1. Identify which snapshot you want to restore. List available snapshots and pick by timestamp:
    curl -fsS http://127.0.0.1:8100/admin/backup | jq '.backups[].stamp'
    
  2. Validate it. Refuse to restore an invalid backup:
    STAMP=20260409T060000Z
    curl -fsS http://127.0.0.1:8100/admin/backup/$STAMP/validate | jq .
    
  3. Stop AtoCore. SQLite cannot be hot-restored under a running process and Chroma will not pick up new files until the process restarts.
    cd /srv/storage/atocore/app/deploy/dalidou
    docker compose down
    docker compose ps   # atocore should be Exited/gone
    

Run the restore

Use a one-shot container that reuses the live service's volume mounts so every path (db_path, chroma_path, backup dir) resolves to the same place the main service would see:

cd /srv/storage/atocore/app/deploy/dalidou
docker compose run --rm --entrypoint python atocore \
    -m atocore.ops.backup restore \
        $STAMP \
        --confirm-service-stopped

Output is a JSON document. The critical fields:

  • pre_restore_snapshot: stamp of the safety snapshot of live state taken right before the restore. Write this down. If the restore was the wrong call, this is how you roll it back.
  • db_restored: should be true
  • registry_restored: true if the backup captured a registry
  • chroma_restored: true if the backup captured a chroma tree and include_chroma resolved to true (default)
  • restored_integrity_ok: must be true — if this is false, STOP and do not start the service; investigate the integrity error first. The restored file is still on disk but untrusted.

Controlling the restore

The CLI supports a few flags for finer control:

  • --no-pre-snapshot skips the pre-restore safety snapshot. Use this only when you know you have another rollback path.
  • --no-chroma restores only SQLite + registry, leaving the current Chroma dir alone. Useful if Chroma is consistent but SQLite needs a rollback.
  • --chroma forces Chroma restoration even if the metadata doesn't clearly indicate the snapshot has it (rare).

Chroma restore and bind-mounted volumes

The Chroma dir on Dalidou is a bind-mounted Docker volume. The restore cannot rmtree the destination (you can't unlink a mount point — it raises OSError [Errno 16] Device or resource busy), so the function clears the dir's CONTENTS and uses copytree(dirs_exist_ok=True) to copy the snapshot back in. The regression test test_restore_chroma_does_not_unlink_destination_directory in tests/test_backup.py captures the destination inode before and after restore and asserts it's stable — the same invariant that protects the bind mount.

This was discovered during the first real Dalidou restore drill on 2026-04-09. If you see a new restore failure with Device or resource busy, something has regressed this fix.

Restart AtoCore

cd /srv/storage/atocore/app/deploy/dalidou
docker compose up -d
# Wait for /health to come up
for i in 1 2 3 4 5 6 7 8 9 10; do
    curl -fsS http://127.0.0.1:8100/health \
        && break || { echo "not ready ($i/10)"; sleep 3; }
done

Note on build_sha after restore: The one-shot docker compose run container does not carry the build provenance env vars that deploy.sh exports at deploy time. After a restore, /health will report build_sha: "unknown" until you re-run deploy.sh or manually re-deploy. This is cosmetic — the data is correctly restored — but if you need build_sha to be accurate, run a redeploy after the restore:

cd /srv/storage/atocore/app
bash deploy/dalidou/deploy.sh

Post-restore verification

# 1. Service is healthy
curl -fsS http://127.0.0.1:8100/health | jq .

# 2. Stats look right
curl -fsS http://127.0.0.1:8100/stats | jq .

# 3. Project registry loads
curl -fsS http://127.0.0.1:8100/projects | jq '.projects | length'

# 4. A known-good context query returns non-empty results
curl -fsS -X POST http://127.0.0.1:8100/context/build \
  -H "Content-Type: application/json" \
  -d '{"prompt": "what is p05 about", "project": "p05-interferometer"}' | jq '.chunks_used'

If any of these are wrong, the restore is bad. Roll back using the pre-restore safety snapshot whose stamp you recorded from the restore output. The rollback is the same procedure — stop the service and restore that stamp:

docker compose down
docker compose run --rm --entrypoint python atocore \
    -m atocore.ops.backup restore \
        $PRE_RESTORE_SNAPSHOT_STAMP \
        --confirm-service-stopped \
        --no-pre-snapshot
docker compose up -d

(--no-pre-snapshot because the rollback itself doesn't need one; you already have the original snapshot as a fallback if everything goes sideways.)

Restore drill

The restore is exercised at three levels:

  1. Unit tests. tests/test_backup.py has six restore tests (refuse-without-confirm, invalid backup, full round-trip, Chroma round-trip, inode-stability regression, WAL sidecar cleanup, skip-pre-snapshot). These run in CI on every commit.
  2. Module-level round-trip. test_restore_round_trip_reverses_post_backup_mutations is the canonical drill in code form: seed baseline, snapshot, mutate, restore, assert mutation reversed + baseline survived
    • pre-restore snapshot captured the mutation.
  3. Live drill on Dalidou. Periodically run the full procedure against the real service with a disposable drill-marker memory (created via POST /memory with memory_type=episodic and project=drill), following the sequence above and then verifying the marker is gone afterward via GET /memory?project=drill. The first such drill on 2026-04-09 surfaced the bind-mount bug; future runs primarily exist to verify the fix stays fixed.

Run the live drill:

  • Before enabling any new write-path automation (auto-capture, automated ingestion, reinforcement sweeps).
  • After any change to src/atocore/ops/backup.py or to schema migrations in src/atocore/models/database.py.
  • After a Dalidou OS upgrade or docker version bump.
  • At least once per quarter as a standing operational check.
  • After any incident that touched the storage layer.

Record each drill run (stamp, pre-restore snapshot stamp, pass/fail, any surprises) somewhere durable — a line in the project journal or a git commit message is enough. A drill you ran once and never again is barely more than a drill you never ran.

Retention policy

  • Last 7 daily backups: kept verbatim
  • Last 4 weekly backups (Sunday): kept verbatim
  • Last 6 monthly backups (1st of month): kept verbatim
  • Anything older: deleted

The retention job is not yet implemented and is tracked as a follow-up. Until then, the snapshots directory grows monotonically. A simple cron-based cleanup script is the next step:

0 4 * * * /srv/storage/atocore/scripts/cleanup-old-backups.sh

Common failure modes and what to do about them

Symptom Likely cause Action
db_integrity_check_failed on validation SQLite snapshot copied while a write was in progress, or disk corruption Take a fresh backup and validate again. If it fails twice, suspect the underlying disk.
registry_invalid_json Registry was being edited at backup time Take a fresh backup. The registry is small so this is cheap.
Restore: restored_integrity_ok: false Source snapshot was itself corrupt (validation should have caught it — file a bug) or copy was interrupted mid-write Do NOT start the service. Validate the snapshot directly with python -m atocore.ops.backup validate <STAMP>, try a different older snapshot, or roll back to the pre-restore safety snapshot.
Restore: OSError [Errno 16] Device or resource busy on Chroma Old code tried to rmtree the Chroma mount point. Fixed on 2026-04-09 by test_restore_chroma_does_not_unlink_destination_directory Ensure you're running commit 2026-04-09 or later; if you need to work around an older build, use --no-chroma and restore Chroma contents manually.
chroma_snapshot_missing after a restore Snapshot was DB-only Either rebuild via fresh ingestion or restore an older snapshot that includes Chroma.
Service won't start after restore Permissions wrong on the restored files Re-run chown 1000:1000 (or whatever the gitea/atocore container user is) on the data dir.
/stats returns 0 documents after restore The SQL store was restored but the source paths in source_documents don't match the current Dalidou paths This means the backup came from a different deployment. Don't trust this restore — it's pulling from the wrong layout.
Drill marker still present after restore Wrong stamp, service still writing during docker compose down, or the restore JSON didn't report db_restored: true Roll back via the pre-restore safety snapshot and retry with the correct source snapshot.

Open follow-ups (not yet implemented)

Tracked separately in docs/next-steps.md — the list below is the backup-specific subset.

  1. Retention cleanup script: see the cron entry above. The snapshots directory grows monotonically until this exists.
  2. Off-Dalidou backup target: currently snapshots live on the same disk as the live data. A real disaster-recovery story needs at least one snapshot on a different physical machine. The simplest first step is a periodic rsync to the user's laptop or to another server.
  3. Backup encryption: snapshots contain raw SQLite and JSON. Consider age/gpg encryption if backups will be shipped off-site.
  4. Automatic post-backup validation: today the validator must be invoked manually. The create_runtime_backup function should call validate_backup on its own output and refuse to declare success if validation fails.
  5. Chroma backup is currently full directory copy every time. For large vector stores this gets expensive. A future improvement would be incremental snapshots via filesystem-level snapshotting (LVM, btrfs, ZFS).

Done (kept for historical reference):

  • Implement restore_runtime_backup() as a proper module function so the restore isn't a manual sudo cp dance — landed 2026-04-09 in commit 3362080, followed by the Chroma bind-mount fix from the first real drill.

Quickstart cheat sheet

# Daily backup (DB + registry only — fast)
curl -fsS -X POST http://127.0.0.1:8100/admin/backup \
  -H "Content-Type: application/json" -d '{}'

# Weekly backup (DB + registry + Chroma — slower, holds ingestion lock)
curl -fsS -X POST http://127.0.0.1:8100/admin/backup \
  -H "Content-Type: application/json" -d '{"include_chroma": true}'

# List backups
curl -fsS http://127.0.0.1:8100/admin/backup | jq '.backups[].stamp'

# Validate the most recent backup
LATEST=$(curl -fsS http://127.0.0.1:8100/admin/backup | jq -r '.backups[-1].stamp')
curl -fsS http://127.0.0.1:8100/admin/backup/$LATEST/validate | jq .

# Full restore (service must be stopped first)
cd /srv/storage/atocore/app/deploy/dalidou
docker compose down
docker compose run --rm --entrypoint python atocore \
    -m atocore.ops.backup restore $STAMP --confirm-service-stopped
docker compose up -d

# Live drill: exercise the full create -> mutate -> restore flow
# against the running service. The marker memory uses
# memory_type=episodic (valid types: identity, preference, project,
# episodic, knowledge, adaptation) and project=drill so it's easy
# to find via GET /memory?project=drill before and after.
#
# See the "Restore drill" section above for the full sequence.
STAMP=$(curl -fsS -X POST http://127.0.0.1:8100/admin/backup \
    -H 'Content-Type: application/json' \
    -d '{"include_chroma": true}' | jq -r '.backup_root' | awk -F/ '{print $NF}')

curl -fsS -X POST http://127.0.0.1:8100/memory \
    -H 'Content-Type: application/json' \
    -d '{"memory_type":"episodic","content":"DRILL-MARKER","project":"drill","confidence":1.0}'

cd /srv/storage/atocore/app/deploy/dalidou
docker compose down
docker compose run --rm --entrypoint python atocore \
    -m atocore.ops.backup restore $STAMP --confirm-service-stopped
docker compose up -d

# Marker should be gone:
curl -fsS 'http://127.0.0.1:8100/memory?project=drill' | jq .