Add deploy/hooks/capture_stop.py — a Claude Code Stop hook that reads the transcript JSONL, extracts the last user prompt, and POSTs to the AtoCore /interactions endpoint in conservative mode (reinforce=false). Conservative mode means: capture only, no automatic reinforcement or extraction into the review queue. Kill switch: ATOCORE_CAPTURE_DISABLED=1. Also: note build_sha cosmetic issue after restore in runbook, update project status docs to reflect drill pass and auto-capture wiring. 17 new tests (243 total, all passing). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
18 KiB
AtoCore Backup and Restore Procedure
Scope
This document defines the operational procedure for backing up and
restoring AtoCore's machine state on the Dalidou deployment. It is
the practical companion to docs/backup-strategy.md (which defines
the strategy) and src/atocore/ops/backup.py (which implements the
mechanics).
The intent is that this procedure can be followed by anyone with SSH access to Dalidou and the AtoCore admin endpoints.
What gets backed up
A create_runtime_backup snapshot contains, in order of importance:
| Artifact | Source path on Dalidou | Backup destination | Always included |
|---|---|---|---|
| SQLite database | /srv/storage/atocore/data/db/atocore.db |
<backup_root>/db/atocore.db |
yes |
| Project registry JSON | /srv/storage/atocore/config/project-registry.json |
<backup_root>/config/project-registry.json |
yes (if file exists) |
| Backup metadata | (generated) | <backup_root>/backup-metadata.json |
yes |
| Chroma vector store | /srv/storage/atocore/data/chroma/ |
<backup_root>/chroma/ |
only when include_chroma=true |
The SQLite snapshot uses the online conn.backup() API and is safe
to take while the database is in use. The Chroma snapshot is a cold
directory copy and is only safe when no ingestion is running;
the API endpoint enforces this by acquiring the ingestion lock for
the duration of the copy.
What is not in the backup:
- Source documents under
/srv/storage/atocore/sources/vault/and/srv/storage/atocore/sources/drive/. These are read-only inputs and live in the user's PKM/Drive, which is backed up separately by their own systems. - Application code. The container image is the source of truth for code; recovery means rebuilding the image, not restoring code from a backup.
- Logs under
/srv/storage/atocore/logs/. - Embeddings cache under
/srv/storage/atocore/data/cache/. - Temp files under
/srv/storage/atocore/data/tmp/.
Backup root layout
Each backup snapshot lives in its own timestamped directory:
/srv/storage/atocore/backups/snapshots/
├── 20260407T060000Z/
│ ├── backup-metadata.json
│ ├── db/
│ │ └── atocore.db
│ ├── config/
│ │ └── project-registry.json
│ └── chroma/ # only if include_chroma=true
│ └── ...
├── 20260408T060000Z/
│ └── ...
└── ...
The timestamp is UTC, format YYYYMMDDTHHMMSSZ.
Triggering a backup
Option A — via the admin endpoint (preferred)
# DB + registry only (fast, safe at any time)
curl -fsS -X POST http://dalidou:8100/admin/backup \
-H "Content-Type: application/json" \
-d '{"include_chroma": false}'
# DB + registry + Chroma (acquires ingestion lock)
curl -fsS -X POST http://dalidou:8100/admin/backup \
-H "Content-Type: application/json" \
-d '{"include_chroma": true}'
The response is the backup metadata JSON. Save the backup_root
field — that's the directory the snapshot was written to.
Option B — via the standalone script (when the API is down)
docker exec atocore python -m atocore.ops.backup
This runs create_runtime_backup() directly, without going through
the API or the ingestion lock. Use it only when the AtoCore service
itself is unhealthy and you can't hit the admin endpoint.
Option C — manual file copy (last resort)
If both the API and the standalone script are unusable:
sudo systemctl stop atocore # or: docker compose stop atocore
sudo cp /srv/storage/atocore/data/db/atocore.db \
/srv/storage/atocore/backups/manual-$(date -u +%Y%m%dT%H%M%SZ).db
sudo cp /srv/storage/atocore/config/project-registry.json \
/srv/storage/atocore/backups/manual-$(date -u +%Y%m%dT%H%M%SZ).registry.json
sudo systemctl start atocore
This is a cold backup and requires brief downtime.
Listing backups
curl -fsS http://dalidou:8100/admin/backup
Returns the configured backup_dir and a list of all snapshots
under it, with their full metadata if available.
Or, on the host directly:
ls -la /srv/storage/atocore/backups/snapshots/
Validating a backup
Before relying on a backup for restore, validate it:
curl -fsS http://dalidou:8100/admin/backup/20260407T060000Z/validate
The validator:
- confirms the snapshot directory exists
- opens the SQLite snapshot and runs
PRAGMA integrity_check - parses the registry JSON
- confirms the Chroma directory exists (if it was included)
A valid backup returns "valid": true and an empty errors array.
A failing validation returns "valid": false with one or more
specific error strings (e.g. db_integrity_check_failed,
registry_invalid_json, chroma_snapshot_missing).
Validate every backup at creation time. A backup that has never been validated is not actually a backup — it's just a hopeful copy of bytes.
Restore procedure
Since 2026-04-09 the restore is implemented as a proper module
function plus CLI entry point: restore_runtime_backup() in
src/atocore/ops/backup.py, invoked as
python -m atocore.ops.backup restore <STAMP> --confirm-service-stopped.
It automatically takes a pre-restore safety snapshot (your rollback
anchor), handles SQLite WAL/SHM cleanly, restores the registry, and
runs PRAGMA integrity_check on the restored db. This replaces the
earlier manual sudo cp sequence.
The function refuses to run without --confirm-service-stopped.
This is deliberate: hot-restoring into a running service corrupts
SQLite state.
Pre-flight (always)
- Identify which snapshot you want to restore. List available
snapshots and pick by timestamp:
curl -fsS http://127.0.0.1:8100/admin/backup | jq '.backups[].stamp' - Validate it. Refuse to restore an invalid backup:
STAMP=20260409T060000Z curl -fsS http://127.0.0.1:8100/admin/backup/$STAMP/validate | jq . - Stop AtoCore. SQLite cannot be hot-restored under a running
process and Chroma will not pick up new files until the process
restarts.
cd /srv/storage/atocore/app/deploy/dalidou docker compose down docker compose ps # atocore should be Exited/gone
Run the restore
Use a one-shot container that reuses the live service's volume
mounts so every path (db_path, chroma_path, backup dir) resolves
to the same place the main service would see:
cd /srv/storage/atocore/app/deploy/dalidou
docker compose run --rm --entrypoint python atocore \
-m atocore.ops.backup restore \
$STAMP \
--confirm-service-stopped
Output is a JSON document. The critical fields:
pre_restore_snapshot: stamp of the safety snapshot of live state taken right before the restore. Write this down. If the restore was the wrong call, this is how you roll it back.db_restored: should betrueregistry_restored:trueif the backup captured a registrychroma_restored:trueif the backup captured a chroma tree and include_chroma resolved to true (default)restored_integrity_ok: must betrue— if this is false, STOP and do not start the service; investigate the integrity error first. The restored file is still on disk but untrusted.
Controlling the restore
The CLI supports a few flags for finer control:
--no-pre-snapshotskips the pre-restore safety snapshot. Use this only when you know you have another rollback path.--no-chromarestores only SQLite + registry, leaving the current Chroma dir alone. Useful if Chroma is consistent but SQLite needs a rollback.--chromaforces Chroma restoration even if the metadata doesn't clearly indicate the snapshot has it (rare).
Chroma restore and bind-mounted volumes
The Chroma dir on Dalidou is a bind-mounted Docker volume. The
restore cannot rmtree the destination (you can't unlink a mount
point — it raises OSError [Errno 16] Device or resource busy),
so the function clears the dir's CONTENTS and uses
copytree(dirs_exist_ok=True) to copy the snapshot back in. The
regression test test_restore_chroma_does_not_unlink_destination_directory
in tests/test_backup.py captures the destination inode before
and after restore and asserts it's stable — the same invariant
that protects the bind mount.
This was discovered during the first real Dalidou restore drill
on 2026-04-09. If you see a new restore failure with
Device or resource busy, something has regressed this fix.
Restart AtoCore
cd /srv/storage/atocore/app/deploy/dalidou
docker compose up -d
# Wait for /health to come up
for i in 1 2 3 4 5 6 7 8 9 10; do
curl -fsS http://127.0.0.1:8100/health \
&& break || { echo "not ready ($i/10)"; sleep 3; }
done
Note on build_sha after restore: The one-shot docker compose run
container does not carry the build provenance env vars that deploy.sh
exports at deploy time. After a restore, /health will report
build_sha: "unknown" until you re-run deploy.sh or manually
re-deploy. This is cosmetic — the data is correctly restored — but if
you need build_sha to be accurate, run a redeploy after the restore:
cd /srv/storage/atocore/app
bash deploy/dalidou/deploy.sh
Post-restore verification
# 1. Service is healthy
curl -fsS http://127.0.0.1:8100/health | jq .
# 2. Stats look right
curl -fsS http://127.0.0.1:8100/stats | jq .
# 3. Project registry loads
curl -fsS http://127.0.0.1:8100/projects | jq '.projects | length'
# 4. A known-good context query returns non-empty results
curl -fsS -X POST http://127.0.0.1:8100/context/build \
-H "Content-Type: application/json" \
-d '{"prompt": "what is p05 about", "project": "p05-interferometer"}' | jq '.chunks_used'
If any of these are wrong, the restore is bad. Roll back using the pre-restore safety snapshot whose stamp you recorded from the restore output. The rollback is the same procedure — stop the service and restore that stamp:
docker compose down
docker compose run --rm --entrypoint python atocore \
-m atocore.ops.backup restore \
$PRE_RESTORE_SNAPSHOT_STAMP \
--confirm-service-stopped \
--no-pre-snapshot
docker compose up -d
(--no-pre-snapshot because the rollback itself doesn't need one;
you already have the original snapshot as a fallback if everything
goes sideways.)
Restore drill
The restore is exercised at three levels:
- Unit tests.
tests/test_backup.pyhas six restore tests (refuse-without-confirm, invalid backup, full round-trip, Chroma round-trip, inode-stability regression, WAL sidecar cleanup, skip-pre-snapshot). These run in CI on every commit. - Module-level round-trip.
test_restore_round_trip_reverses_post_backup_mutationsis the canonical drill in code form: seed baseline, snapshot, mutate, restore, assert mutation reversed + baseline survived- pre-restore snapshot captured the mutation.
- Live drill on Dalidou. Periodically run the full procedure
against the real service with a disposable drill-marker
memory (created via
POST /memorywithmemory_type=episodicandproject=drill), following the sequence above and then verifying the marker is gone afterward viaGET /memory?project=drill. The first such drill on 2026-04-09 surfaced the bind-mount bug; future runs primarily exist to verify the fix stays fixed.
Run the live drill:
- Before enabling any new write-path automation (auto-capture, automated ingestion, reinforcement sweeps).
- After any change to
src/atocore/ops/backup.pyor to schema migrations insrc/atocore/models/database.py. - After a Dalidou OS upgrade or docker version bump.
- At least once per quarter as a standing operational check.
- After any incident that touched the storage layer.
Record each drill run (stamp, pre-restore snapshot stamp, pass/fail, any surprises) somewhere durable — a line in the project journal or a git commit message is enough. A drill you ran once and never again is barely more than a drill you never ran.
Retention policy
- Last 7 daily backups: kept verbatim
- Last 4 weekly backups (Sunday): kept verbatim
- Last 6 monthly backups (1st of month): kept verbatim
- Anything older: deleted
The retention job is not yet implemented and is tracked as a follow-up. Until then, the snapshots directory grows monotonically. A simple cron-based cleanup script is the next step:
0 4 * * * /srv/storage/atocore/scripts/cleanup-old-backups.sh
Common failure modes and what to do about them
| Symptom | Likely cause | Action |
|---|---|---|
db_integrity_check_failed on validation |
SQLite snapshot copied while a write was in progress, or disk corruption | Take a fresh backup and validate again. If it fails twice, suspect the underlying disk. |
registry_invalid_json |
Registry was being edited at backup time | Take a fresh backup. The registry is small so this is cheap. |
Restore: restored_integrity_ok: false |
Source snapshot was itself corrupt (validation should have caught it — file a bug) or copy was interrupted mid-write | Do NOT start the service. Validate the snapshot directly with python -m atocore.ops.backup validate <STAMP>, try a different older snapshot, or roll back to the pre-restore safety snapshot. |
Restore: OSError [Errno 16] Device or resource busy on Chroma |
Old code tried to rmtree the Chroma mount point. Fixed on 2026-04-09 by test_restore_chroma_does_not_unlink_destination_directory |
Ensure you're running commit 2026-04-09 or later; if you need to work around an older build, use --no-chroma and restore Chroma contents manually. |
chroma_snapshot_missing after a restore |
Snapshot was DB-only | Either rebuild via fresh ingestion or restore an older snapshot that includes Chroma. |
| Service won't start after restore | Permissions wrong on the restored files | Re-run chown 1000:1000 (or whatever the gitea/atocore container user is) on the data dir. |
/stats returns 0 documents after restore |
The SQL store was restored but the source paths in source_documents don't match the current Dalidou paths |
This means the backup came from a different deployment. Don't trust this restore — it's pulling from the wrong layout. |
| Drill marker still present after restore | Wrong stamp, service still writing during docker compose down, or the restore JSON didn't report db_restored: true |
Roll back via the pre-restore safety snapshot and retry with the correct source snapshot. |
Open follow-ups (not yet implemented)
Tracked separately in docs/next-steps.md — the list below is the
backup-specific subset.
- Retention cleanup script: see the cron entry above. The snapshots directory grows monotonically until this exists.
- Off-Dalidou backup target: currently snapshots live on the
same disk as the live data. A real disaster-recovery story
needs at least one snapshot on a different physical machine.
The simplest first step is a periodic
rsyncto the user's laptop or to another server. - Backup encryption: snapshots contain raw SQLite and JSON. Consider age/gpg encryption if backups will be shipped off-site.
- Automatic post-backup validation: today the validator must
be invoked manually. The
create_runtime_backupfunction should callvalidate_backupon its own output and refuse to declare success if validation fails. - Chroma backup is currently full directory copy every time. For large vector stores this gets expensive. A future improvement would be incremental snapshots via filesystem-level snapshotting (LVM, btrfs, ZFS).
Done (kept for historical reference):
Implement— landed 2026-04-09 in commitrestore_runtime_backup()as a proper module function so the restore isn't a manualsudo cpdance3362080, followed by the Chroma bind-mount fix from the first real drill.
Quickstart cheat sheet
# Daily backup (DB + registry only — fast)
curl -fsS -X POST http://127.0.0.1:8100/admin/backup \
-H "Content-Type: application/json" -d '{}'
# Weekly backup (DB + registry + Chroma — slower, holds ingestion lock)
curl -fsS -X POST http://127.0.0.1:8100/admin/backup \
-H "Content-Type: application/json" -d '{"include_chroma": true}'
# List backups
curl -fsS http://127.0.0.1:8100/admin/backup | jq '.backups[].stamp'
# Validate the most recent backup
LATEST=$(curl -fsS http://127.0.0.1:8100/admin/backup | jq -r '.backups[-1].stamp')
curl -fsS http://127.0.0.1:8100/admin/backup/$LATEST/validate | jq .
# Full restore (service must be stopped first)
cd /srv/storage/atocore/app/deploy/dalidou
docker compose down
docker compose run --rm --entrypoint python atocore \
-m atocore.ops.backup restore $STAMP --confirm-service-stopped
docker compose up -d
# Live drill: exercise the full create -> mutate -> restore flow
# against the running service. The marker memory uses
# memory_type=episodic (valid types: identity, preference, project,
# episodic, knowledge, adaptation) and project=drill so it's easy
# to find via GET /memory?project=drill before and after.
#
# See the "Restore drill" section above for the full sequence.
STAMP=$(curl -fsS -X POST http://127.0.0.1:8100/admin/backup \
-H 'Content-Type: application/json' \
-d '{"include_chroma": true}' | jq -r '.backup_root' | awk -F/ '{print $NF}')
curl -fsS -X POST http://127.0.0.1:8100/memory \
-H 'Content-Type: application/json' \
-d '{"memory_type":"episodic","content":"DRILL-MARKER","project":"drill","confidence":1.0}'
cd /srv/storage/atocore/app/deploy/dalidou
docker compose down
docker compose run --rm --entrypoint python atocore \
-m atocore.ops.backup restore $STAMP --confirm-service-stopped
docker compose up -d
# Marker should be gone:
curl -fsS 'http://127.0.0.1:8100/memory?project=drill' | jq .