Add deploy/hooks/capture_stop.py — a Claude Code Stop hook that reads the transcript JSONL, extracts the last user prompt, and POSTs to the AtoCore /interactions endpoint in conservative mode (reinforce=false). Conservative mode means: capture only, no automatic reinforcement or extraction into the review queue. Kill switch: ATOCORE_CAPTURE_DISABLED=1. Also: note build_sha cosmetic issue after restore in runbook, update project status docs to reflect drill pass and auto-capture wiring. 17 new tests (243 total, all passing). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
443 lines
18 KiB
Markdown
443 lines
18 KiB
Markdown
# AtoCore Backup and Restore Procedure
|
|
|
|
## Scope
|
|
|
|
This document defines the operational procedure for backing up and
|
|
restoring AtoCore's machine state on the Dalidou deployment. It is
|
|
the practical companion to `docs/backup-strategy.md` (which defines
|
|
the strategy) and `src/atocore/ops/backup.py` (which implements the
|
|
mechanics).
|
|
|
|
The intent is that this procedure can be followed by anyone with
|
|
SSH access to Dalidou and the AtoCore admin endpoints.
|
|
|
|
## What gets backed up
|
|
|
|
A `create_runtime_backup` snapshot contains, in order of importance:
|
|
|
|
| Artifact | Source path on Dalidou | Backup destination | Always included |
|
|
|---|---|---|---|
|
|
| SQLite database | `/srv/storage/atocore/data/db/atocore.db` | `<backup_root>/db/atocore.db` | yes |
|
|
| Project registry JSON | `/srv/storage/atocore/config/project-registry.json` | `<backup_root>/config/project-registry.json` | yes (if file exists) |
|
|
| Backup metadata | (generated) | `<backup_root>/backup-metadata.json` | yes |
|
|
| Chroma vector store | `/srv/storage/atocore/data/chroma/` | `<backup_root>/chroma/` | only when `include_chroma=true` |
|
|
|
|
The SQLite snapshot uses the online `conn.backup()` API and is safe
|
|
to take while the database is in use. The Chroma snapshot is a cold
|
|
directory copy and is **only safe when no ingestion is running**;
|
|
the API endpoint enforces this by acquiring the ingestion lock for
|
|
the duration of the copy.
|
|
|
|
What is **not** in the backup:
|
|
|
|
- Source documents under `/srv/storage/atocore/sources/vault/` and
|
|
`/srv/storage/atocore/sources/drive/`. These are read-only
|
|
inputs and live in the user's PKM/Drive, which is backed up
|
|
separately by their own systems.
|
|
- Application code. The container image is the source of truth for
|
|
code; recovery means rebuilding the image, not restoring code from
|
|
a backup.
|
|
- Logs under `/srv/storage/atocore/logs/`.
|
|
- Embeddings cache under `/srv/storage/atocore/data/cache/`.
|
|
- Temp files under `/srv/storage/atocore/data/tmp/`.
|
|
|
|
## Backup root layout
|
|
|
|
Each backup snapshot lives in its own timestamped directory:
|
|
|
|
```
|
|
/srv/storage/atocore/backups/snapshots/
|
|
├── 20260407T060000Z/
|
|
│ ├── backup-metadata.json
|
|
│ ├── db/
|
|
│ │ └── atocore.db
|
|
│ ├── config/
|
|
│ │ └── project-registry.json
|
|
│ └── chroma/ # only if include_chroma=true
|
|
│ └── ...
|
|
├── 20260408T060000Z/
|
|
│ └── ...
|
|
└── ...
|
|
```
|
|
|
|
The timestamp is UTC, format `YYYYMMDDTHHMMSSZ`.
|
|
|
|
## Triggering a backup
|
|
|
|
### Option A — via the admin endpoint (preferred)
|
|
|
|
```bash
|
|
# DB + registry only (fast, safe at any time)
|
|
curl -fsS -X POST http://dalidou:8100/admin/backup \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"include_chroma": false}'
|
|
|
|
# DB + registry + Chroma (acquires ingestion lock)
|
|
curl -fsS -X POST http://dalidou:8100/admin/backup \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"include_chroma": true}'
|
|
```
|
|
|
|
The response is the backup metadata JSON. Save the `backup_root`
|
|
field — that's the directory the snapshot was written to.
|
|
|
|
### Option B — via the standalone script (when the API is down)
|
|
|
|
```bash
|
|
docker exec atocore python -m atocore.ops.backup
|
|
```
|
|
|
|
This runs `create_runtime_backup()` directly, without going through
|
|
the API or the ingestion lock. Use it only when the AtoCore service
|
|
itself is unhealthy and you can't hit the admin endpoint.
|
|
|
|
### Option C — manual file copy (last resort)
|
|
|
|
If both the API and the standalone script are unusable:
|
|
|
|
```bash
|
|
sudo systemctl stop atocore # or: docker compose stop atocore
|
|
sudo cp /srv/storage/atocore/data/db/atocore.db \
|
|
/srv/storage/atocore/backups/manual-$(date -u +%Y%m%dT%H%M%SZ).db
|
|
sudo cp /srv/storage/atocore/config/project-registry.json \
|
|
/srv/storage/atocore/backups/manual-$(date -u +%Y%m%dT%H%M%SZ).registry.json
|
|
sudo systemctl start atocore
|
|
```
|
|
|
|
This is a cold backup and requires brief downtime.
|
|
|
|
## Listing backups
|
|
|
|
```bash
|
|
curl -fsS http://dalidou:8100/admin/backup
|
|
```
|
|
|
|
Returns the configured `backup_dir` and a list of all snapshots
|
|
under it, with their full metadata if available.
|
|
|
|
Or, on the host directly:
|
|
|
|
```bash
|
|
ls -la /srv/storage/atocore/backups/snapshots/
|
|
```
|
|
|
|
## Validating a backup
|
|
|
|
Before relying on a backup for restore, validate it:
|
|
|
|
```bash
|
|
curl -fsS http://dalidou:8100/admin/backup/20260407T060000Z/validate
|
|
```
|
|
|
|
The validator:
|
|
- confirms the snapshot directory exists
|
|
- opens the SQLite snapshot and runs `PRAGMA integrity_check`
|
|
- parses the registry JSON
|
|
- confirms the Chroma directory exists (if it was included)
|
|
|
|
A valid backup returns `"valid": true` and an empty `errors` array.
|
|
A failing validation returns `"valid": false` with one or more
|
|
specific error strings (e.g. `db_integrity_check_failed`,
|
|
`registry_invalid_json`, `chroma_snapshot_missing`).
|
|
|
|
**Validate every backup at creation time.** A backup that has never
|
|
been validated is not actually a backup — it's just a hopeful copy
|
|
of bytes.
|
|
|
|
## Restore procedure
|
|
|
|
Since 2026-04-09 the restore is implemented as a proper module
|
|
function plus CLI entry point: `restore_runtime_backup()` in
|
|
`src/atocore/ops/backup.py`, invoked as
|
|
`python -m atocore.ops.backup restore <STAMP> --confirm-service-stopped`.
|
|
It automatically takes a pre-restore safety snapshot (your rollback
|
|
anchor), handles SQLite WAL/SHM cleanly, restores the registry, and
|
|
runs `PRAGMA integrity_check` on the restored db. This replaces the
|
|
earlier manual `sudo cp` sequence.
|
|
|
|
The function refuses to run without `--confirm-service-stopped`.
|
|
This is deliberate: hot-restoring into a running service corrupts
|
|
SQLite state.
|
|
|
|
### Pre-flight (always)
|
|
|
|
1. Identify which snapshot you want to restore. List available
|
|
snapshots and pick by timestamp:
|
|
```bash
|
|
curl -fsS http://127.0.0.1:8100/admin/backup | jq '.backups[].stamp'
|
|
```
|
|
2. Validate it. Refuse to restore an invalid backup:
|
|
```bash
|
|
STAMP=20260409T060000Z
|
|
curl -fsS http://127.0.0.1:8100/admin/backup/$STAMP/validate | jq .
|
|
```
|
|
3. **Stop AtoCore.** SQLite cannot be hot-restored under a running
|
|
process and Chroma will not pick up new files until the process
|
|
restarts.
|
|
```bash
|
|
cd /srv/storage/atocore/app/deploy/dalidou
|
|
docker compose down
|
|
docker compose ps # atocore should be Exited/gone
|
|
```
|
|
|
|
### Run the restore
|
|
|
|
Use a one-shot container that reuses the live service's volume
|
|
mounts so every path (`db_path`, `chroma_path`, backup dir) resolves
|
|
to the same place the main service would see:
|
|
|
|
```bash
|
|
cd /srv/storage/atocore/app/deploy/dalidou
|
|
docker compose run --rm --entrypoint python atocore \
|
|
-m atocore.ops.backup restore \
|
|
$STAMP \
|
|
--confirm-service-stopped
|
|
```
|
|
|
|
Output is a JSON document. The critical fields:
|
|
|
|
- `pre_restore_snapshot`: stamp of the safety snapshot of live
|
|
state taken right before the restore. **Write this down.** If
|
|
the restore was the wrong call, this is how you roll it back.
|
|
- `db_restored`: should be `true`
|
|
- `registry_restored`: `true` if the backup captured a registry
|
|
- `chroma_restored`: `true` if the backup captured a chroma tree
|
|
and include_chroma resolved to true (default)
|
|
- `restored_integrity_ok`: **must be `true`** — if this is false,
|
|
STOP and do not start the service; investigate the integrity
|
|
error first. The restored file is still on disk but untrusted.
|
|
|
|
### Controlling the restore
|
|
|
|
The CLI supports a few flags for finer control:
|
|
|
|
- `--no-pre-snapshot` skips the pre-restore safety snapshot. Use
|
|
this only when you know you have another rollback path.
|
|
- `--no-chroma` restores only SQLite + registry, leaving the
|
|
current Chroma dir alone. Useful if Chroma is consistent but
|
|
SQLite needs a rollback.
|
|
- `--chroma` forces Chroma restoration even if the metadata
|
|
doesn't clearly indicate the snapshot has it (rare).
|
|
|
|
### Chroma restore and bind-mounted volumes
|
|
|
|
The Chroma dir on Dalidou is a bind-mounted Docker volume. The
|
|
restore cannot `rmtree` the destination (you can't unlink a mount
|
|
point — it raises `OSError [Errno 16] Device or resource busy`),
|
|
so the function clears the dir's CONTENTS and uses
|
|
`copytree(dirs_exist_ok=True)` to copy the snapshot back in. The
|
|
regression test `test_restore_chroma_does_not_unlink_destination_directory`
|
|
in `tests/test_backup.py` captures the destination inode before
|
|
and after restore and asserts it's stable — the same invariant
|
|
that protects the bind mount.
|
|
|
|
This was discovered during the first real Dalidou restore drill
|
|
on 2026-04-09. If you see a new restore failure with
|
|
`Device or resource busy`, something has regressed this fix.
|
|
|
|
### Restart AtoCore
|
|
|
|
```bash
|
|
cd /srv/storage/atocore/app/deploy/dalidou
|
|
docker compose up -d
|
|
# Wait for /health to come up
|
|
for i in 1 2 3 4 5 6 7 8 9 10; do
|
|
curl -fsS http://127.0.0.1:8100/health \
|
|
&& break || { echo "not ready ($i/10)"; sleep 3; }
|
|
done
|
|
```
|
|
|
|
**Note on build_sha after restore:** The one-shot `docker compose run`
|
|
container does not carry the build provenance env vars that `deploy.sh`
|
|
exports at deploy time. After a restore, `/health` will report
|
|
`build_sha: "unknown"` until you re-run `deploy.sh` or manually
|
|
re-deploy. This is cosmetic — the data is correctly restored — but if
|
|
you need `build_sha` to be accurate, run a redeploy after the restore:
|
|
|
|
```bash
|
|
cd /srv/storage/atocore/app
|
|
bash deploy/dalidou/deploy.sh
|
|
```
|
|
|
|
### Post-restore verification
|
|
|
|
```bash
|
|
# 1. Service is healthy
|
|
curl -fsS http://127.0.0.1:8100/health | jq .
|
|
|
|
# 2. Stats look right
|
|
curl -fsS http://127.0.0.1:8100/stats | jq .
|
|
|
|
# 3. Project registry loads
|
|
curl -fsS http://127.0.0.1:8100/projects | jq '.projects | length'
|
|
|
|
# 4. A known-good context query returns non-empty results
|
|
curl -fsS -X POST http://127.0.0.1:8100/context/build \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"prompt": "what is p05 about", "project": "p05-interferometer"}' | jq '.chunks_used'
|
|
```
|
|
|
|
If any of these are wrong, the restore is bad. Roll back using the
|
|
pre-restore safety snapshot whose stamp you recorded from the
|
|
restore output. The rollback is the same procedure — stop the
|
|
service and restore that stamp:
|
|
|
|
```bash
|
|
docker compose down
|
|
docker compose run --rm --entrypoint python atocore \
|
|
-m atocore.ops.backup restore \
|
|
$PRE_RESTORE_SNAPSHOT_STAMP \
|
|
--confirm-service-stopped \
|
|
--no-pre-snapshot
|
|
docker compose up -d
|
|
```
|
|
|
|
(`--no-pre-snapshot` because the rollback itself doesn't need one;
|
|
you already have the original snapshot as a fallback if everything
|
|
goes sideways.)
|
|
|
|
### Restore drill
|
|
|
|
The restore is exercised at three levels:
|
|
|
|
1. **Unit tests.** `tests/test_backup.py` has six restore tests
|
|
(refuse-without-confirm, invalid backup, full round-trip,
|
|
Chroma round-trip, inode-stability regression, WAL sidecar
|
|
cleanup, skip-pre-snapshot). These run in CI on every commit.
|
|
2. **Module-level round-trip.**
|
|
`test_restore_round_trip_reverses_post_backup_mutations` is
|
|
the canonical drill in code form: seed baseline, snapshot,
|
|
mutate, restore, assert mutation reversed + baseline survived
|
|
+ pre-restore snapshot captured the mutation.
|
|
3. **Live drill on Dalidou.** Periodically run the full procedure
|
|
against the real service with a disposable drill-marker
|
|
memory (created via `POST /memory` with `memory_type=episodic`
|
|
and `project=drill`), following the sequence above and then
|
|
verifying the marker is gone afterward via
|
|
`GET /memory?project=drill`. The first such drill on
|
|
2026-04-09 surfaced the bind-mount bug; future runs
|
|
primarily exist to verify the fix stays fixed.
|
|
|
|
Run the live drill:
|
|
|
|
- **Before** enabling any new write-path automation (auto-capture,
|
|
automated ingestion, reinforcement sweeps).
|
|
- **After** any change to `src/atocore/ops/backup.py` or to
|
|
schema migrations in `src/atocore/models/database.py`.
|
|
- **After** a Dalidou OS upgrade or docker version bump.
|
|
- **At least once per quarter** as a standing operational check.
|
|
- **After any incident** that touched the storage layer.
|
|
|
|
Record each drill run (stamp, pre-restore snapshot stamp, pass/fail,
|
|
any surprises) somewhere durable — a line in the project journal
|
|
or a git commit message is enough. A drill you ran once and never
|
|
again is barely more than a drill you never ran.
|
|
|
|
## Retention policy
|
|
|
|
- **Last 7 daily backups**: kept verbatim
|
|
- **Last 4 weekly backups** (Sunday): kept verbatim
|
|
- **Last 6 monthly backups** (1st of month): kept verbatim
|
|
- **Anything older**: deleted
|
|
|
|
The retention job is **not yet implemented** and is tracked as a
|
|
follow-up. Until then, the snapshots directory grows monotonically.
|
|
A simple cron-based cleanup script is the next step:
|
|
|
|
```cron
|
|
0 4 * * * /srv/storage/atocore/scripts/cleanup-old-backups.sh
|
|
```
|
|
|
|
## Common failure modes and what to do about them
|
|
|
|
| Symptom | Likely cause | Action |
|
|
|---|---|---|
|
|
| `db_integrity_check_failed` on validation | SQLite snapshot copied while a write was in progress, or disk corruption | Take a fresh backup and validate again. If it fails twice, suspect the underlying disk. |
|
|
| `registry_invalid_json` | Registry was being edited at backup time | Take a fresh backup. The registry is small so this is cheap. |
|
|
| Restore: `restored_integrity_ok: false` | Source snapshot was itself corrupt (validation should have caught it — file a bug) or copy was interrupted mid-write | Do NOT start the service. Validate the snapshot directly with `python -m atocore.ops.backup validate <STAMP>`, try a different older snapshot, or roll back to the pre-restore safety snapshot. |
|
|
| Restore: `OSError [Errno 16] Device or resource busy` on Chroma | Old code tried to `rmtree` the Chroma mount point. Fixed on 2026-04-09 by `test_restore_chroma_does_not_unlink_destination_directory` | Ensure you're running commit 2026-04-09 or later; if you need to work around an older build, use `--no-chroma` and restore Chroma contents manually. |
|
|
| `chroma_snapshot_missing` after a restore | Snapshot was DB-only | Either rebuild via fresh ingestion or restore an older snapshot that includes Chroma. |
|
|
| Service won't start after restore | Permissions wrong on the restored files | Re-run `chown 1000:1000` (or whatever the gitea/atocore container user is) on the data dir. |
|
|
| `/stats` returns 0 documents after restore | The SQL store was restored but the source paths in `source_documents` don't match the current Dalidou paths | This means the backup came from a different deployment. Don't trust this restore — it's pulling from the wrong layout. |
|
|
| Drill marker still present after restore | Wrong stamp, service still writing during `docker compose down`, or the restore JSON didn't report `db_restored: true` | Roll back via the pre-restore safety snapshot and retry with the correct source snapshot. |
|
|
|
|
## Open follow-ups (not yet implemented)
|
|
|
|
Tracked separately in `docs/next-steps.md` — the list below is the
|
|
backup-specific subset.
|
|
|
|
1. **Retention cleanup script**: see the cron entry above. The
|
|
snapshots directory grows monotonically until this exists.
|
|
2. **Off-Dalidou backup target**: currently snapshots live on the
|
|
same disk as the live data. A real disaster-recovery story
|
|
needs at least one snapshot on a different physical machine.
|
|
The simplest first step is a periodic `rsync` to the user's
|
|
laptop or to another server.
|
|
3. **Backup encryption**: snapshots contain raw SQLite and JSON.
|
|
Consider age/gpg encryption if backups will be shipped off-site.
|
|
4. **Automatic post-backup validation**: today the validator must
|
|
be invoked manually. The `create_runtime_backup` function
|
|
should call `validate_backup` on its own output and refuse to
|
|
declare success if validation fails.
|
|
5. **Chroma backup is currently full directory copy** every time.
|
|
For large vector stores this gets expensive. A future
|
|
improvement would be incremental snapshots via filesystem-level
|
|
snapshotting (LVM, btrfs, ZFS).
|
|
|
|
**Done** (kept for historical reference):
|
|
|
|
- ~~Implement `restore_runtime_backup()` as a proper module
|
|
function so the restore isn't a manual `sudo cp` dance~~ —
|
|
landed 2026-04-09 in commit 3362080, followed by the
|
|
Chroma bind-mount fix from the first real drill.
|
|
|
|
## Quickstart cheat sheet
|
|
|
|
```bash
|
|
# Daily backup (DB + registry only — fast)
|
|
curl -fsS -X POST http://127.0.0.1:8100/admin/backup \
|
|
-H "Content-Type: application/json" -d '{}'
|
|
|
|
# Weekly backup (DB + registry + Chroma — slower, holds ingestion lock)
|
|
curl -fsS -X POST http://127.0.0.1:8100/admin/backup \
|
|
-H "Content-Type: application/json" -d '{"include_chroma": true}'
|
|
|
|
# List backups
|
|
curl -fsS http://127.0.0.1:8100/admin/backup | jq '.backups[].stamp'
|
|
|
|
# Validate the most recent backup
|
|
LATEST=$(curl -fsS http://127.0.0.1:8100/admin/backup | jq -r '.backups[-1].stamp')
|
|
curl -fsS http://127.0.0.1:8100/admin/backup/$LATEST/validate | jq .
|
|
|
|
# Full restore (service must be stopped first)
|
|
cd /srv/storage/atocore/app/deploy/dalidou
|
|
docker compose down
|
|
docker compose run --rm --entrypoint python atocore \
|
|
-m atocore.ops.backup restore $STAMP --confirm-service-stopped
|
|
docker compose up -d
|
|
|
|
# Live drill: exercise the full create -> mutate -> restore flow
|
|
# against the running service. The marker memory uses
|
|
# memory_type=episodic (valid types: identity, preference, project,
|
|
# episodic, knowledge, adaptation) and project=drill so it's easy
|
|
# to find via GET /memory?project=drill before and after.
|
|
#
|
|
# See the "Restore drill" section above for the full sequence.
|
|
STAMP=$(curl -fsS -X POST http://127.0.0.1:8100/admin/backup \
|
|
-H 'Content-Type: application/json' \
|
|
-d '{"include_chroma": true}' | jq -r '.backup_root' | awk -F/ '{print $NF}')
|
|
|
|
curl -fsS -X POST http://127.0.0.1:8100/memory \
|
|
-H 'Content-Type: application/json' \
|
|
-d '{"memory_type":"episodic","content":"DRILL-MARKER","project":"drill","confidence":1.0}'
|
|
|
|
cd /srv/storage/atocore/app/deploy/dalidou
|
|
docker compose down
|
|
docker compose run --rm --entrypoint python atocore \
|
|
-m atocore.ops.backup restore $STAMP --confirm-service-stopped
|
|
docker compose up -d
|
|
|
|
# Marker should be gone:
|
|
curl -fsS 'http://127.0.0.1:8100/memory?project=drill' | jq .
|
|
```
|