Files

Anto01 a637017900 slash command for daily AtoCore use + backup-restore procedure

Session 2 of the four-session plan. Lands two operational pieces:
the Claude Code slash command that makes AtoCore reachable from
inside any Claude Code session, and the full backup/restore
procedure doc that turns the backup endpoint code into a real
operational drill.

Slash command (.claude/commands/atocore-context.md)
---------------------------------------------------
- Project-level slash command following the standard frontmatter
  format (description + argument-hint)
- Parses the user prompt and an optional trailing project id, with
  case-insensitive matching against the registered project ids
  (atocore, p04-gigabit, p05-interferometer, p06-polisher and
  their aliases)
- Calls POST /context/build on the live AtoCore service, defaulting
  to http://dalidou:8100 (overridable via ATOCORE_API_BASE env var)
- Renders the formatted context pack inline so the user can see
  exactly what AtoCore would feed an LLM, plus a stats banner and a
  per-chunk source list
- Includes graceful failure handling for network errors, 4xx, 5xx,
  and the empty-result case
- Defines a future capture path that POSTs to /interactions for the
  Phase 9 reflection loop. The current command leaves capture as
  manual / opt-in pending a clean post-turn hook design

.gitignore changes
------------------
- Replaced wholesale .claude/ ignore with .claude/* + exceptions
  for .claude/commands/ so project slash commands can be tracked
- Other .claude/* paths (worktrees, settings, local state) remain
  ignored

Backup-restore procedure (docs/backup-restore-procedure.md)
-----------------------------------------------------------
- Defines what gets backed up (SQLite + registry always, Chroma
  optional under ingestion lock) and what doesn't (sources, code,
  logs, cache, tmp)
- Documents the snapshot directory layout and the timestamp format
- Three trigger paths in priority order:
  - via POST /admin/backup with {include_chroma: true|false}
  - via the standalone src/atocore/ops/backup.py module
  - via cold filesystem copy with brief downtime as last resort
- Listing and validation procedure with the /admin/backup and
  /admin/backup/{stamp}/validate endpoints
- Full step-by-step restore procedure with mandatory pre-flight
  safety snapshot, ownership/permission requirements, and the
  post-restore verification checks
- Rollback path using the pre-restore safety copy
- Retention policy (last 7 daily / 4 weekly / 6 monthly) and
  explicit acknowledgment that the cleanup job is not yet
  implemented
- Drill schedule: quarterly full restore drill, post-migration
  drill, post-incident validation
- Common failure mode table with diagnoses
- Quickstart cheat sheet at the end for daily reference
- Open follow-ups: cleanup script, off-Dalidou target,
  encryption, automatic post-backup validation, incremental
  Chroma snapshots

The procedure has not yet been exercised against the live Dalidou
instance — that is the next step the user runs themselves once
the slash command is in place.

2026-04-07 06:46:50 -04:00

13 KiB

Raw Blame History

AtoCore Backup and Restore Procedure

Scope

This document defines the operational procedure for backing up and restoring AtoCore's machine state on the Dalidou deployment. It is the practical companion to docs/backup-strategy.md (which defines the strategy) and src/atocore/ops/backup.py (which implements the mechanics).

The intent is that this procedure can be followed by anyone with SSH access to Dalidou and the AtoCore admin endpoints.

What gets backed up

A create_runtime_backup snapshot contains, in order of importance:

Artifact	Source path on Dalidou	Backup destination	Always included
SQLite database	`/srv/storage/atocore/data/db/atocore.db`	`<backup_root>/db/atocore.db`	yes
Project registry JSON	`/srv/storage/atocore/config/project-registry.json`	`<backup_root>/config/project-registry.json`	yes (if file exists)
Backup metadata	(generated)	`<backup_root>/backup-metadata.json`	yes
Chroma vector store	`/srv/storage/atocore/data/chroma/`	`<backup_root>/chroma/`	only when `include_chroma=true`

The SQLite snapshot uses the online conn.backup() API and is safe to take while the database is in use. The Chroma snapshot is a cold directory copy and is only safe when no ingestion is running; the API endpoint enforces this by acquiring the ingestion lock for the duration of the copy.

What is not in the backup:

Source documents under /srv/storage/atocore/sources/vault/ and /srv/storage/atocore/sources/drive/. These are read-only inputs and live in the user's PKM/Drive, which is backed up separately by their own systems.
Application code. The container image is the source of truth for code; recovery means rebuilding the image, not restoring code from a backup.
Logs under /srv/storage/atocore/logs/.
Embeddings cache under /srv/storage/atocore/data/cache/.
Temp files under /srv/storage/atocore/data/tmp/.

Backup root layout

Each backup snapshot lives in its own timestamped directory:

/srv/storage/atocore/backups/snapshots/
  ├── 20260407T060000Z/
  │   ├── backup-metadata.json
  │   ├── db/
  │   │   └── atocore.db
  │   ├── config/
  │   │   └── project-registry.json
  │   └── chroma/                    # only if include_chroma=true
  │       └── ...
  ├── 20260408T060000Z/
  │   └── ...
  └── ...

The timestamp is UTC, format YYYYMMDDTHHMMSSZ.

Triggering a backup

Option A — via the admin endpoint (preferred)

# DB + registry only (fast, safe at any time)
curl -fsS -X POST http://dalidou:8100/admin/backup \
  -H "Content-Type: application/json" \
  -d '{"include_chroma": false}'

# DB + registry + Chroma (acquires ingestion lock)
curl -fsS -X POST http://dalidou:8100/admin/backup \
  -H "Content-Type: application/json" \
  -d '{"include_chroma": true}'

The response is the backup metadata JSON. Save the backup_root field — that's the directory the snapshot was written to.

Option B — via the standalone script (when the API is down)

docker exec atocore python -m atocore.ops.backup

This runs create_runtime_backup() directly, without going through the API or the ingestion lock. Use it only when the AtoCore service itself is unhealthy and you can't hit the admin endpoint.

Option C — manual file copy (last resort)

If both the API and the standalone script are unusable:

sudo systemctl stop atocore   # or: docker compose stop atocore
sudo cp /srv/storage/atocore/data/db/atocore.db \
        /srv/storage/atocore/backups/manual-$(date -u +%Y%m%dT%H%M%SZ).db
sudo cp /srv/storage/atocore/config/project-registry.json \
        /srv/storage/atocore/backups/manual-$(date -u +%Y%m%dT%H%M%SZ).registry.json
sudo systemctl start atocore

This is a cold backup and requires brief downtime.

Listing backups

curl -fsS http://dalidou:8100/admin/backup

Returns the configured backup_dir and a list of all snapshots under it, with their full metadata if available.

Or, on the host directly:

ls -la /srv/storage/atocore/backups/snapshots/

Validating a backup

Before relying on a backup for restore, validate it:

curl -fsS http://dalidou:8100/admin/backup/20260407T060000Z/validate

The validator:

confirms the snapshot directory exists
opens the SQLite snapshot and runs PRAGMA integrity_check
parses the registry JSON
confirms the Chroma directory exists (if it was included)

A valid backup returns "valid": true and an empty errors array. A failing validation returns "valid": false with one or more specific error strings (e.g. db_integrity_check_failed, registry_invalid_json, chroma_snapshot_missing).

Validate every backup at creation time. A backup that has never been validated is not actually a backup — it's just a hopeful copy of bytes.

Restore procedure

Pre-flight (always)

Identify which snapshot you want to restore. List available snapshots and pick by timestamp:
```
curl -fsS http://dalidou:8100/admin/backup | jq '.backups[].stamp'
```

Validate it. Refuse to restore an invalid backup:

STAMP=20260407T060000Z
curl -fsS http://dalidou:8100/admin/backup/$STAMP/validate | jq .

Stop AtoCore. SQLite cannot be hot-restored under a running process and Chroma will not pick up new files until the process restarts.
```
docker compose stop atocore
# or: sudo systemctl stop atocore
```

Take a safety snapshot of the current state before overwriting it. This is your "if the restore makes things worse, here's the undo" backup.

PRESERVE_STAMP=$(date -u +%Y%m%dT%H%M%SZ)
sudo cp /srv/storage/atocore/data/db/atocore.db \
        /srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.db
sudo cp /srv/storage/atocore/config/project-registry.json \
        /srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.registry.json 2>/dev/null || true

Restore the SQLite database

SNAPSHOT_DIR=/srv/storage/atocore/backups/snapshots/$STAMP
sudo cp $SNAPSHOT_DIR/db/atocore.db \
        /srv/storage/atocore/data/db/atocore.db
sudo chown 1000:1000 /srv/storage/atocore/data/db/atocore.db
sudo chmod 600 /srv/storage/atocore/data/db/atocore.db

The chown should match the gitea/atocore container user. Verify by checking the existing perms before overwriting:

stat -c '%U:%G %a' /srv/storage/atocore/data/db/atocore.db

Restore the project registry

if [ -f $SNAPSHOT_DIR/config/project-registry.json ]; then
  sudo cp $SNAPSHOT_DIR/config/project-registry.json \
          /srv/storage/atocore/config/project-registry.json
  sudo chown 1000:1000 /srv/storage/atocore/config/project-registry.json
  sudo chmod 644 /srv/storage/atocore/config/project-registry.json
fi

If the snapshot does not contain a registry, the current registry is preserved. The pre-flight safety copy still gives you a recovery path if you need to roll back.

Restore the Chroma vector store (if it was in the snapshot)

if [ -d $SNAPSHOT_DIR/chroma ]; then
  # Move the current chroma dir aside as a safety copy
  sudo mv /srv/storage/atocore/data/chroma \
          /srv/storage/atocore/data/chroma.pre-restore-$PRESERVE_STAMP

  # Copy the snapshot in
  sudo cp -a $SNAPSHOT_DIR/chroma /srv/storage/atocore/data/chroma
  sudo chown -R 1000:1000 /srv/storage/atocore/data/chroma
fi

If the snapshot does NOT contain a Chroma dir but the SQLite restore would leave the vector store and the SQL store inconsistent (e.g. SQL has chunks the vector store doesn't), you have two options:

Option 1: rebuild the vector store from source documents. Run ingestion fresh after the SQL restore. This regenerates embeddings from the actual source files. Slow but produces a perfectly consistent state.
Option 2: accept the inconsistency and live with stale-vector filtering. The retriever already drops vector results whose SQL row no longer exists (_existing_chunk_ids filter), so the inconsistency surfaces as missing results, not bad ones.

For an unplanned restore, Option 2 is the right immediate move. Then schedule a fresh ingestion pass to rebuild the vector store properly.

Restart AtoCore

docker compose up -d atocore
# or: sudo systemctl start atocore

Post-restore verification

# 1. Service is healthy
curl -fsS http://dalidou:8100/health | jq .

# 2. Stats look right
curl -fsS http://dalidou:8100/stats | jq .

# 3. Project registry loads
curl -fsS http://dalidou:8100/projects | jq '.projects | length'

# 4. A known-good context query returns non-empty results
curl -fsS -X POST http://dalidou:8100/context/build \
  -H "Content-Type: application/json" \
  -d '{"prompt": "what is p05 about", "project": "p05-interferometer"}' | jq '.chunks_used'

If any of these are wrong, the restore is bad. Roll back using the pre-restore safety copy:

docker compose stop atocore
sudo cp /srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.db \
        /srv/storage/atocore/data/db/atocore.db
sudo cp /srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.registry.json \
        /srv/storage/atocore/config/project-registry.json 2>/dev/null || true
# If you also restored chroma:
sudo rm -rf /srv/storage/atocore/data/chroma
sudo mv /srv/storage/atocore/data/chroma.pre-restore-$PRESERVE_STAMP \
        /srv/storage/atocore/data/chroma
docker compose up -d atocore

Retention policy

Last 7 daily backups: kept verbatim
Last 4 weekly backups (Sunday): kept verbatim
Last 6 monthly backups (1st of month): kept verbatim
Anything older: deleted

The retention job is not yet implemented and is tracked as a follow-up. Until then, the snapshots directory grows monotonically. A simple cron-based cleanup script is the next step:

0 4 * * * /srv/storage/atocore/scripts/cleanup-old-backups.sh

Drill schedule

A backup that has never been restored is theoretical. The schedule:

At least once per quarter, perform a full restore drill on a staging environment (or a temporary container with a separate data dir) and verify the post-restore checks pass.
After every breaking schema migration, perform a restore drill to confirm the migration is reversible.
After any incident that touched the storage layer (the EXDEV bug from April 2026 is a good example), confirm the next backup validates clean.

Common failure modes and what to do about them

Symptom	Likely cause	Action
`db_integrity_check_failed` on validation	SQLite snapshot copied while a write was in progress, or disk corruption	Take a fresh backup and validate again. If it fails twice, suspect the underlying disk.
`registry_invalid_json`	Registry was being edited at backup time	Take a fresh backup. The registry is small so this is cheap.
`chroma_snapshot_missing` after a restore	Snapshot was DB-only and the restore didn't move the existing chroma dir	Either rebuild via fresh ingestion or restore an older snapshot that includes Chroma.
Service won't start after restore	Permissions wrong on the restored files	Re-run `chown 1000:1000` (or whatever the gitea/atocore container user is) on the data dir.
`/stats` returns 0 documents after restore	The SQL store was restored but the source paths in `source_documents` don't match the current Dalidou paths	This means the backup came from a different deployment. Don't trust this restore — it's pulling from the wrong layout.

Open follow-ups (not yet implemented)

Retention cleanup script: see the cron entry above.
Off-Dalidou backup target: currently snapshots live on the same disk as the live data. A real disaster-recovery story needs at least one snapshot on a different physical machine. The simplest first step is a periodic rsync to the user's laptop or to another server.
Backup encryption: snapshots contain raw SQLite and JSON. Consider age/gpg encryption if backups will be shipped off-site.
Automatic post-backup validation: today the validator must be invoked manually. The create_runtime_backup function should call validate_backup on its own output and refuse to declare success if validation fails.
Chroma backup is currently full directory copy every time. For large vector stores this gets expensive. A future improvement would be incremental snapshots via filesystem-level snapshotting (LVM, btrfs, ZFS).

Quickstart cheat sheet

# Daily backup (DB + registry only — fast)
curl -fsS -X POST http://dalidou:8100/admin/backup \
  -H "Content-Type: application/json" -d '{}'

# Weekly backup (DB + registry + Chroma — slower, holds ingestion lock)
curl -fsS -X POST http://dalidou:8100/admin/backup \
  -H "Content-Type: application/json" -d '{"include_chroma": true}'

# List backups
curl -fsS http://dalidou:8100/admin/backup | jq '.backups[].stamp'

# Validate the most recent backup
LATEST=$(curl -fsS http://dalidou:8100/admin/backup | jq -r '.backups[-1].stamp')
curl -fsS http://dalidou:8100/admin/backup/$LATEST/validate | jq .

# Full restore — see the "Restore procedure" section above

13 KiB Raw Blame History