Files

Anto01 1a8fdf4225 fix: chroma restore bind-mount bug + consolidate docs

Two fixes from the 2026-04-09 first real restore drill on Dalidou,
plus the long-overdue doc consolidation I should have done when I
added the drill runbook instead of creating a duplicate.

## Chroma restore bind-mount bug (drill finding)

src/atocore/ops/backup.py: restore_runtime_backup() used to call
shutil.rmtree(dst_chroma) before copying the snapshot back. In the
Dockerized Dalidou deployment the chroma dir is a bind-mounted
volume — you can't unlink a mount point, rmtree raises
  OSError [Errno 16] Device or resource busy
and the restore silently fails to touch Chroma. This bit the first
real drill; the operator worked around it with --no-chroma plus a
manual cp -a.

Fix: clear the destination's CONTENTS (iterdir + rmtree/unlink per
child) and use copytree(dirs_exist_ok=True) so the mount point
itself is never touched. Equivalent semantics, bind-mount-safe.

Regression test:
tests/test_backup.py::test_restore_chroma_does_not_unlink_destination_directory
captures Path.stat().st_ino of the dest dir before and after
restore and asserts they match. That's the same invariant a
bind-mounted chroma dir enforces — if the inode changed, the
mount would have failed. 11/11 backup tests now pass.

## Doc consolidation

docs/backup-restore-drill.md existed as a duplicate of the
authoritative docs/backup-restore-procedure.md. When I added the
drill runbook in commit 3362080 I wrote it from scratch instead of
updating the existing procedure — bad doc hygiene on a project
that's literally about being a context engine.

- Deleted docs/backup-restore-drill.md
- Folded its contents into docs/backup-restore-procedure.md:
  - Replaced the manual sudo cp restore sequence with the new
    `python -m atocore.ops.backup restore <STAMP>
    --confirm-service-stopped` CLI
  - Added the one-shot docker compose run pattern for running
    restore inside a container that reuses the live volume mounts
  - Documented the --no-pre-snapshot / --no-chroma / --chroma flags
  - New "Chroma restore and bind-mounted volumes" subsection
    explaining the bug and the regression test that protects the fix
  - New "Restore drill" subsection with three levels (unit tests,
    module round-trip, live Dalidou drill) and the cadence list
  - Failure-mode table gained four entries: restored_integrity_ok,
    Device-or-resource-busy, drill marker still present,
    chroma_snapshot_missing
  - "Open follow-ups" struck the restore_runtime_backup item (done)
    and added a "Done (historical)" note referencing 2026-04-09
  - Quickstart cheat sheet now has a full drill one-liner using
    memory_type=episodic (the 2026-04-09 drill found the runbook's
    memory_type=note was invalid — the valid set is identity,
    preference, project, episodic, knowledge, adaptation)

## Status doc sync

Long overdue — I've been landing code without updating the
project's narrative state docs.

docs/current-state.md:
- "Reliability Baseline" now reflects: restore_runtime_backup is
  real with CLI, pre-restore safety snapshot, WAL cleanup,
  integrity check; live drill on 2026-04-09 surfaced and fixed
  Chroma bind-mount bug; deploy provenance via /health build_sha;
  deploy.sh self-update re-exec guard
- "Immediate Next Focus" reshuffled: drill re-run (priority 1) and
  auto-capture (priority 2) are now ahead of retrieval quality work,
  reflecting the updated unblock sequence

docs/next-steps.md:
- New item 1: re-run the drill with chroma working end-to-end
- New item 2: auto-capture conservative mode (Stop hook)
- Old item 7 rewritten as item 9 listing what's DONE
  (create/list/validate/restore, admin/backup endpoint with
  include_chroma, /health provenance, self-update guard,
  procedure doc with failure modes) and what's still pending
  (retention cleanup, off-Dalidou target, auto-validation)

## Test count

226 passing (was 225 + 1 new inode-stability regression test).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-09 09:13:21 -04:00

17 KiB

Raw Blame History

AtoCore Backup and Restore Procedure

Scope

This document defines the operational procedure for backing up and restoring AtoCore's machine state on the Dalidou deployment. It is the practical companion to docs/backup-strategy.md (which defines the strategy) and src/atocore/ops/backup.py (which implements the mechanics).

The intent is that this procedure can be followed by anyone with SSH access to Dalidou and the AtoCore admin endpoints.

What gets backed up

A create_runtime_backup snapshot contains, in order of importance:

Artifact	Source path on Dalidou	Backup destination	Always included
SQLite database	`/srv/storage/atocore/data/db/atocore.db`	`<backup_root>/db/atocore.db`	yes
Project registry JSON	`/srv/storage/atocore/config/project-registry.json`	`<backup_root>/config/project-registry.json`	yes (if file exists)
Backup metadata	(generated)	`<backup_root>/backup-metadata.json`	yes
Chroma vector store	`/srv/storage/atocore/data/chroma/`	`<backup_root>/chroma/`	only when `include_chroma=true`

The SQLite snapshot uses the online conn.backup() API and is safe to take while the database is in use. The Chroma snapshot is a cold directory copy and is only safe when no ingestion is running; the API endpoint enforces this by acquiring the ingestion lock for the duration of the copy.

What is not in the backup:

Source documents under /srv/storage/atocore/sources/vault/ and /srv/storage/atocore/sources/drive/. These are read-only inputs and live in the user's PKM/Drive, which is backed up separately by their own systems.
Application code. The container image is the source of truth for code; recovery means rebuilding the image, not restoring code from a backup.
Logs under /srv/storage/atocore/logs/.
Embeddings cache under /srv/storage/atocore/data/cache/.
Temp files under /srv/storage/atocore/data/tmp/.

Backup root layout

Each backup snapshot lives in its own timestamped directory:

/srv/storage/atocore/backups/snapshots/
  ├── 20260407T060000Z/
  │   ├── backup-metadata.json
  │   ├── db/
  │   │   └── atocore.db
  │   ├── config/
  │   │   └── project-registry.json
  │   └── chroma/                    # only if include_chroma=true
  │       └── ...
  ├── 20260408T060000Z/
  │   └── ...
  └── ...

The timestamp is UTC, format YYYYMMDDTHHMMSSZ.

Triggering a backup

Option A — via the admin endpoint (preferred)

# DB + registry only (fast, safe at any time)
curl -fsS -X POST http://dalidou:8100/admin/backup \
  -H "Content-Type: application/json" \
  -d '{"include_chroma": false}'

# DB + registry + Chroma (acquires ingestion lock)
curl -fsS -X POST http://dalidou:8100/admin/backup \
  -H "Content-Type: application/json" \
  -d '{"include_chroma": true}'

The response is the backup metadata JSON. Save the backup_root field — that's the directory the snapshot was written to.

Option B — via the standalone script (when the API is down)

docker exec atocore python -m atocore.ops.backup

This runs create_runtime_backup() directly, without going through the API or the ingestion lock. Use it only when the AtoCore service itself is unhealthy and you can't hit the admin endpoint.

Option C — manual file copy (last resort)

If both the API and the standalone script are unusable:

sudo systemctl stop atocore   # or: docker compose stop atocore
sudo cp /srv/storage/atocore/data/db/atocore.db \
        /srv/storage/atocore/backups/manual-$(date -u +%Y%m%dT%H%M%SZ).db
sudo cp /srv/storage/atocore/config/project-registry.json \
        /srv/storage/atocore/backups/manual-$(date -u +%Y%m%dT%H%M%SZ).registry.json
sudo systemctl start atocore

This is a cold backup and requires brief downtime.

Listing backups

curl -fsS http://dalidou:8100/admin/backup

Returns the configured backup_dir and a list of all snapshots under it, with their full metadata if available.

Or, on the host directly:

ls -la /srv/storage/atocore/backups/snapshots/

Validating a backup

Before relying on a backup for restore, validate it:

curl -fsS http://dalidou:8100/admin/backup/20260407T060000Z/validate

The validator:

confirms the snapshot directory exists
opens the SQLite snapshot and runs PRAGMA integrity_check
parses the registry JSON
confirms the Chroma directory exists (if it was included)

A valid backup returns "valid": true and an empty errors array. A failing validation returns "valid": false with one or more specific error strings (e.g. db_integrity_check_failed, registry_invalid_json, chroma_snapshot_missing).

Validate every backup at creation time. A backup that has never been validated is not actually a backup — it's just a hopeful copy of bytes.

Restore procedure

Since 2026-04-09 the restore is implemented as a proper module function plus CLI entry point: restore_runtime_backup() in src/atocore/ops/backup.py, invoked as python -m atocore.ops.backup restore <STAMP> --confirm-service-stopped. It automatically takes a pre-restore safety snapshot (your rollback anchor), handles SQLite WAL/SHM cleanly, restores the registry, and runs PRAGMA integrity_check on the restored db. This replaces the earlier manual sudo cp sequence.

The function refuses to run without --confirm-service-stopped. This is deliberate: hot-restoring into a running service corrupts SQLite state.

Pre-flight (always)

Identify which snapshot you want to restore. List available snapshots and pick by timestamp:
```
curl -fsS http://127.0.0.1:8100/admin/backup | jq '.backups[].stamp'
```

Validate it. Refuse to restore an invalid backup:

STAMP=20260409T060000Z
curl -fsS http://127.0.0.1:8100/admin/backup/$STAMP/validate | jq .

Stop AtoCore. SQLite cannot be hot-restored under a running process and Chroma will not pick up new files until the process restarts.
```
cd /srv/storage/atocore/app/deploy/dalidou
docker compose down
docker compose ps   # atocore should be Exited/gone
```

Run the restore

Use a one-shot container that reuses the live service's volume mounts so every path (db_path, chroma_path, backup dir) resolves to the same place the main service would see:

cd /srv/storage/atocore/app/deploy/dalidou
docker compose run --rm --entrypoint python atocore \
    -m atocore.ops.backup restore \
        $STAMP \
        --confirm-service-stopped

Output is a JSON document. The critical fields:

pre_restore_snapshot: stamp of the safety snapshot of live state taken right before the restore. Write this down. If the restore was the wrong call, this is how you roll it back.
db_restored: should be true
registry_restored: true if the backup captured a registry
chroma_restored: true if the backup captured a chroma tree and include_chroma resolved to true (default)
restored_integrity_ok: must be true — if this is false, STOP and do not start the service; investigate the integrity error first. The restored file is still on disk but untrusted.

Controlling the restore

The CLI supports a few flags for finer control:

--no-pre-snapshot skips the pre-restore safety snapshot. Use this only when you know you have another rollback path.
--no-chroma restores only SQLite + registry, leaving the current Chroma dir alone. Useful if Chroma is consistent but SQLite needs a rollback.
--chroma forces Chroma restoration even if the metadata doesn't clearly indicate the snapshot has it (rare).

Chroma restore and bind-mounted volumes

The Chroma dir on Dalidou is a bind-mounted Docker volume. The restore cannot rmtree the destination (you can't unlink a mount point — it raises OSError [Errno 16] Device or resource busy), so the function clears the dir's CONTENTS and uses copytree(dirs_exist_ok=True) to copy the snapshot back in. The regression test test_restore_chroma_does_not_unlink_destination_directory in tests/test_backup.py captures the destination inode before and after restore and asserts it's stable — the same invariant that protects the bind mount.

This was discovered during the first real Dalidou restore drill on 2026-04-09. If you see a new restore failure with Device or resource busy, something has regressed this fix.

Restart AtoCore

cd /srv/storage/atocore/app/deploy/dalidou
docker compose up -d
# Wait for /health to come up
for i in 1 2 3 4 5 6 7 8 9 10; do
    curl -fsS http://127.0.0.1:8100/health \
        && break || { echo "not ready ($i/10)"; sleep 3; }
done

Post-restore verification

# 1. Service is healthy
curl -fsS http://127.0.0.1:8100/health | jq .

# 2. Stats look right
curl -fsS http://127.0.0.1:8100/stats | jq .

# 3. Project registry loads
curl -fsS http://127.0.0.1:8100/projects | jq '.projects | length'

# 4. A known-good context query returns non-empty results
curl -fsS -X POST http://127.0.0.1:8100/context/build \
  -H "Content-Type: application/json" \
  -d '{"prompt": "what is p05 about", "project": "p05-interferometer"}' | jq '.chunks_used'

If any of these are wrong, the restore is bad. Roll back using the pre-restore safety snapshot whose stamp you recorded from the restore output. The rollback is the same procedure — stop the service and restore that stamp:

docker compose down
docker compose run --rm --entrypoint python atocore \
    -m atocore.ops.backup restore \
        $PRE_RESTORE_SNAPSHOT_STAMP \
        --confirm-service-stopped \
        --no-pre-snapshot
docker compose up -d

(--no-pre-snapshot because the rollback itself doesn't need one; you already have the original snapshot as a fallback if everything goes sideways.)

Restore drill

The restore is exercised at three levels:

Unit tests. tests/test_backup.py has six restore tests (refuse-without-confirm, invalid backup, full round-trip, Chroma round-trip, inode-stability regression, WAL sidecar cleanup, skip-pre-snapshot). These run in CI on every commit.
Module-level round-trip. test_restore_round_trip_reverses_post_backup_mutations is the canonical drill in code form: seed baseline, snapshot, mutate, restore, assert mutation reversed + baseline survived
- pre-restore snapshot captured the mutation.
Live drill on Dalidou. Periodically run the full procedure against the real service with a disposable drill-marker memory (created via POST /memory with memory_type=episodic and project=drill), following the sequence above and then verifying the marker is gone afterward via GET /memory?project=drill. The first such drill on 2026-04-09 surfaced the bind-mount bug; future runs primarily exist to verify the fix stays fixed.

Run the live drill:

Before enabling any new write-path automation (auto-capture, automated ingestion, reinforcement sweeps).
After any change to src/atocore/ops/backup.py or to schema migrations in src/atocore/models/database.py.
After a Dalidou OS upgrade or docker version bump.
At least once per quarter as a standing operational check.
After any incident that touched the storage layer.

Record each drill run (stamp, pre-restore snapshot stamp, pass/fail, any surprises) somewhere durable — a line in the project journal or a git commit message is enough. A drill you ran once and never again is barely more than a drill you never ran.

Retention policy

Last 7 daily backups: kept verbatim
Last 4 weekly backups (Sunday): kept verbatim
Last 6 monthly backups (1st of month): kept verbatim
Anything older: deleted

The retention job is not yet implemented and is tracked as a follow-up. Until then, the snapshots directory grows monotonically. A simple cron-based cleanup script is the next step:

0 4 * * * /srv/storage/atocore/scripts/cleanup-old-backups.sh

Common failure modes and what to do about them

Symptom	Likely cause	Action
`db_integrity_check_failed` on validation	SQLite snapshot copied while a write was in progress, or disk corruption	Take a fresh backup and validate again. If it fails twice, suspect the underlying disk.
`registry_invalid_json`	Registry was being edited at backup time	Take a fresh backup. The registry is small so this is cheap.
Restore: `restored_integrity_ok: false`	Source snapshot was itself corrupt (validation should have caught it — file a bug) or copy was interrupted mid-write	Do NOT start the service. Validate the snapshot directly with `python -m atocore.ops.backup validate <STAMP>`, try a different older snapshot, or roll back to the pre-restore safety snapshot.
Restore: `OSError [Errno 16] Device or resource busy` on Chroma	Old code tried to `rmtree` the Chroma mount point. Fixed on 2026-04-09 by `test_restore_chroma_does_not_unlink_destination_directory`	Ensure you're running commit 2026-04-09 or later; if you need to work around an older build, use `--no-chroma` and restore Chroma contents manually.
`chroma_snapshot_missing` after a restore	Snapshot was DB-only	Either rebuild via fresh ingestion or restore an older snapshot that includes Chroma.
Service won't start after restore	Permissions wrong on the restored files	Re-run `chown 1000:1000` (or whatever the gitea/atocore container user is) on the data dir.
`/stats` returns 0 documents after restore	The SQL store was restored but the source paths in `source_documents` don't match the current Dalidou paths	This means the backup came from a different deployment. Don't trust this restore — it's pulling from the wrong layout.
Drill marker still present after restore	Wrong stamp, service still writing during `docker compose down`, or the restore JSON didn't report `db_restored: true`	Roll back via the pre-restore safety snapshot and retry with the correct source snapshot.

Open follow-ups (not yet implemented)

Tracked separately in docs/next-steps.md — the list below is the backup-specific subset.

Retention cleanup script: see the cron entry above. The snapshots directory grows monotonically until this exists.
Off-Dalidou backup target: currently snapshots live on the same disk as the live data. A real disaster-recovery story needs at least one snapshot on a different physical machine. The simplest first step is a periodic rsync to the user's laptop or to another server.
Backup encryption: snapshots contain raw SQLite and JSON. Consider age/gpg encryption if backups will be shipped off-site.
Automatic post-backup validation: today the validator must be invoked manually. The create_runtime_backup function should call validate_backup on its own output and refuse to declare success if validation fails.
Chroma backup is currently full directory copy every time. For large vector stores this gets expensive. A future improvement would be incremental snapshots via filesystem-level snapshotting (LVM, btrfs, ZFS).

Done (kept for historical reference):

~~Implement restore_runtime_backup() as a proper module function so the restore isn't a manual sudo cp dance~~ — landed 2026-04-09 in commit 3362080, followed by the Chroma bind-mount fix from the first real drill.

Quickstart cheat sheet

# Daily backup (DB + registry only — fast)
curl -fsS -X POST http://127.0.0.1:8100/admin/backup \
  -H "Content-Type: application/json" -d '{}'

# Weekly backup (DB + registry + Chroma — slower, holds ingestion lock)
curl -fsS -X POST http://127.0.0.1:8100/admin/backup \
  -H "Content-Type: application/json" -d '{"include_chroma": true}'

# List backups
curl -fsS http://127.0.0.1:8100/admin/backup | jq '.backups[].stamp'

# Validate the most recent backup
LATEST=$(curl -fsS http://127.0.0.1:8100/admin/backup | jq -r '.backups[-1].stamp')
curl -fsS http://127.0.0.1:8100/admin/backup/$LATEST/validate | jq .

# Full restore (service must be stopped first)
cd /srv/storage/atocore/app/deploy/dalidou
docker compose down
docker compose run --rm --entrypoint python atocore \
    -m atocore.ops.backup restore $STAMP --confirm-service-stopped
docker compose up -d

# Live drill: exercise the full create -> mutate -> restore flow
# against the running service. The marker memory uses
# memory_type=episodic (valid types: identity, preference, project,
# episodic, knowledge, adaptation) and project=drill so it's easy
# to find via GET /memory?project=drill before and after.
#
# See the "Restore drill" section above for the full sequence.
STAMP=$(curl -fsS -X POST http://127.0.0.1:8100/admin/backup \
    -H 'Content-Type: application/json' \
    -d '{"include_chroma": true}' | jq -r '.backup_root' | awk -F/ '{print $NF}')

curl -fsS -X POST http://127.0.0.1:8100/memory \
    -H 'Content-Type: application/json' \
    -d '{"memory_type":"episodic","content":"DRILL-MARKER","project":"drill","confidence":1.0}'

cd /srv/storage/atocore/app/deploy/dalidou
docker compose down
docker compose run --rm --entrypoint python atocore \
    -m atocore.ops.backup restore $STAMP --confirm-service-stopped
docker compose up -d

# Marker should be gone:
curl -fsS 'http://127.0.0.1:8100/memory?project=drill' | jq .

17 KiB Raw Blame History