Files
ATOCore/docs/backup-restore-procedure.md
Anto01 a637017900 slash command for daily AtoCore use + backup-restore procedure
Session 2 of the four-session plan. Lands two operational pieces:
the Claude Code slash command that makes AtoCore reachable from
inside any Claude Code session, and the full backup/restore
procedure doc that turns the backup endpoint code into a real
operational drill.

Slash command (.claude/commands/atocore-context.md)
---------------------------------------------------
- Project-level slash command following the standard frontmatter
  format (description + argument-hint)
- Parses the user prompt and an optional trailing project id, with
  case-insensitive matching against the registered project ids
  (atocore, p04-gigabit, p05-interferometer, p06-polisher and
  their aliases)
- Calls POST /context/build on the live AtoCore service, defaulting
  to http://dalidou:8100 (overridable via ATOCORE_API_BASE env var)
- Renders the formatted context pack inline so the user can see
  exactly what AtoCore would feed an LLM, plus a stats banner and a
  per-chunk source list
- Includes graceful failure handling for network errors, 4xx, 5xx,
  and the empty-result case
- Defines a future capture path that POSTs to /interactions for the
  Phase 9 reflection loop. The current command leaves capture as
  manual / opt-in pending a clean post-turn hook design

.gitignore changes
------------------
- Replaced wholesale .claude/ ignore with .claude/* + exceptions
  for .claude/commands/ so project slash commands can be tracked
- Other .claude/* paths (worktrees, settings, local state) remain
  ignored

Backup-restore procedure (docs/backup-restore-procedure.md)
-----------------------------------------------------------
- Defines what gets backed up (SQLite + registry always, Chroma
  optional under ingestion lock) and what doesn't (sources, code,
  logs, cache, tmp)
- Documents the snapshot directory layout and the timestamp format
- Three trigger paths in priority order:
  - via POST /admin/backup with {include_chroma: true|false}
  - via the standalone src/atocore/ops/backup.py module
  - via cold filesystem copy with brief downtime as last resort
- Listing and validation procedure with the /admin/backup and
  /admin/backup/{stamp}/validate endpoints
- Full step-by-step restore procedure with mandatory pre-flight
  safety snapshot, ownership/permission requirements, and the
  post-restore verification checks
- Rollback path using the pre-restore safety copy
- Retention policy (last 7 daily / 4 weekly / 6 monthly) and
  explicit acknowledgment that the cleanup job is not yet
  implemented
- Drill schedule: quarterly full restore drill, post-migration
  drill, post-incident validation
- Common failure mode table with diagnoses
- Quickstart cheat sheet at the end for daily reference
- Open follow-ups: cleanup script, off-Dalidou target,
  encryption, automatic post-backup validation, incremental
  Chroma snapshots

The procedure has not yet been exercised against the live Dalidou
instance — that is the next step the user runs themselves once
the slash command is in place.
2026-04-07 06:46:50 -04:00

361 lines
13 KiB
Markdown

# AtoCore Backup and Restore Procedure
## Scope
This document defines the operational procedure for backing up and
restoring AtoCore's machine state on the Dalidou deployment. It is
the practical companion to `docs/backup-strategy.md` (which defines
the strategy) and `src/atocore/ops/backup.py` (which implements the
mechanics).
The intent is that this procedure can be followed by anyone with
SSH access to Dalidou and the AtoCore admin endpoints.
## What gets backed up
A `create_runtime_backup` snapshot contains, in order of importance:
| Artifact | Source path on Dalidou | Backup destination | Always included |
|---|---|---|---|
| SQLite database | `/srv/storage/atocore/data/db/atocore.db` | `<backup_root>/db/atocore.db` | yes |
| Project registry JSON | `/srv/storage/atocore/config/project-registry.json` | `<backup_root>/config/project-registry.json` | yes (if file exists) |
| Backup metadata | (generated) | `<backup_root>/backup-metadata.json` | yes |
| Chroma vector store | `/srv/storage/atocore/data/chroma/` | `<backup_root>/chroma/` | only when `include_chroma=true` |
The SQLite snapshot uses the online `conn.backup()` API and is safe
to take while the database is in use. The Chroma snapshot is a cold
directory copy and is **only safe when no ingestion is running**;
the API endpoint enforces this by acquiring the ingestion lock for
the duration of the copy.
What is **not** in the backup:
- Source documents under `/srv/storage/atocore/sources/vault/` and
`/srv/storage/atocore/sources/drive/`. These are read-only
inputs and live in the user's PKM/Drive, which is backed up
separately by their own systems.
- Application code. The container image is the source of truth for
code; recovery means rebuilding the image, not restoring code from
a backup.
- Logs under `/srv/storage/atocore/logs/`.
- Embeddings cache under `/srv/storage/atocore/data/cache/`.
- Temp files under `/srv/storage/atocore/data/tmp/`.
## Backup root layout
Each backup snapshot lives in its own timestamped directory:
```
/srv/storage/atocore/backups/snapshots/
├── 20260407T060000Z/
│ ├── backup-metadata.json
│ ├── db/
│ │ └── atocore.db
│ ├── config/
│ │ └── project-registry.json
│ └── chroma/ # only if include_chroma=true
│ └── ...
├── 20260408T060000Z/
│ └── ...
└── ...
```
The timestamp is UTC, format `YYYYMMDDTHHMMSSZ`.
## Triggering a backup
### Option A — via the admin endpoint (preferred)
```bash
# DB + registry only (fast, safe at any time)
curl -fsS -X POST http://dalidou:8100/admin/backup \
-H "Content-Type: application/json" \
-d '{"include_chroma": false}'
# DB + registry + Chroma (acquires ingestion lock)
curl -fsS -X POST http://dalidou:8100/admin/backup \
-H "Content-Type: application/json" \
-d '{"include_chroma": true}'
```
The response is the backup metadata JSON. Save the `backup_root`
field — that's the directory the snapshot was written to.
### Option B — via the standalone script (when the API is down)
```bash
docker exec atocore python -m atocore.ops.backup
```
This runs `create_runtime_backup()` directly, without going through
the API or the ingestion lock. Use it only when the AtoCore service
itself is unhealthy and you can't hit the admin endpoint.
### Option C — manual file copy (last resort)
If both the API and the standalone script are unusable:
```bash
sudo systemctl stop atocore # or: docker compose stop atocore
sudo cp /srv/storage/atocore/data/db/atocore.db \
/srv/storage/atocore/backups/manual-$(date -u +%Y%m%dT%H%M%SZ).db
sudo cp /srv/storage/atocore/config/project-registry.json \
/srv/storage/atocore/backups/manual-$(date -u +%Y%m%dT%H%M%SZ).registry.json
sudo systemctl start atocore
```
This is a cold backup and requires brief downtime.
## Listing backups
```bash
curl -fsS http://dalidou:8100/admin/backup
```
Returns the configured `backup_dir` and a list of all snapshots
under it, with their full metadata if available.
Or, on the host directly:
```bash
ls -la /srv/storage/atocore/backups/snapshots/
```
## Validating a backup
Before relying on a backup for restore, validate it:
```bash
curl -fsS http://dalidou:8100/admin/backup/20260407T060000Z/validate
```
The validator:
- confirms the snapshot directory exists
- opens the SQLite snapshot and runs `PRAGMA integrity_check`
- parses the registry JSON
- confirms the Chroma directory exists (if it was included)
A valid backup returns `"valid": true` and an empty `errors` array.
A failing validation returns `"valid": false` with one or more
specific error strings (e.g. `db_integrity_check_failed`,
`registry_invalid_json`, `chroma_snapshot_missing`).
**Validate every backup at creation time.** A backup that has never
been validated is not actually a backup — it's just a hopeful copy
of bytes.
## Restore procedure
### Pre-flight (always)
1. Identify which snapshot you want to restore. List available
snapshots and pick by timestamp:
```bash
curl -fsS http://dalidou:8100/admin/backup | jq '.backups[].stamp'
```
2. Validate it. Refuse to restore an invalid backup:
```bash
STAMP=20260407T060000Z
curl -fsS http://dalidou:8100/admin/backup/$STAMP/validate | jq .
```
3. **Stop AtoCore.** SQLite cannot be hot-restored under a running
process and Chroma will not pick up new files until the process
restarts.
```bash
docker compose stop atocore
# or: sudo systemctl stop atocore
```
4. **Take a safety snapshot of the current state** before overwriting
it. This is your "if the restore makes things worse, here's the
undo" backup.
```bash
PRESERVE_STAMP=$(date -u +%Y%m%dT%H%M%SZ)
sudo cp /srv/storage/atocore/data/db/atocore.db \
/srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.db
sudo cp /srv/storage/atocore/config/project-registry.json \
/srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.registry.json 2>/dev/null || true
```
### Restore the SQLite database
```bash
SNAPSHOT_DIR=/srv/storage/atocore/backups/snapshots/$STAMP
sudo cp $SNAPSHOT_DIR/db/atocore.db \
/srv/storage/atocore/data/db/atocore.db
sudo chown 1000:1000 /srv/storage/atocore/data/db/atocore.db
sudo chmod 600 /srv/storage/atocore/data/db/atocore.db
```
The chown should match the gitea/atocore container user. Verify
by checking the existing perms before overwriting:
```bash
stat -c '%U:%G %a' /srv/storage/atocore/data/db/atocore.db
```
### Restore the project registry
```bash
if [ -f $SNAPSHOT_DIR/config/project-registry.json ]; then
sudo cp $SNAPSHOT_DIR/config/project-registry.json \
/srv/storage/atocore/config/project-registry.json
sudo chown 1000:1000 /srv/storage/atocore/config/project-registry.json
sudo chmod 644 /srv/storage/atocore/config/project-registry.json
fi
```
If the snapshot does not contain a registry, the current registry is
preserved. The pre-flight safety copy still gives you a recovery path
if you need to roll back.
### Restore the Chroma vector store (if it was in the snapshot)
```bash
if [ -d $SNAPSHOT_DIR/chroma ]; then
# Move the current chroma dir aside as a safety copy
sudo mv /srv/storage/atocore/data/chroma \
/srv/storage/atocore/data/chroma.pre-restore-$PRESERVE_STAMP
# Copy the snapshot in
sudo cp -a $SNAPSHOT_DIR/chroma /srv/storage/atocore/data/chroma
sudo chown -R 1000:1000 /srv/storage/atocore/data/chroma
fi
```
If the snapshot does NOT contain a Chroma dir but the SQLite
restore would leave the vector store and the SQL store inconsistent
(e.g. SQL has chunks the vector store doesn't), you have two
options:
- **Option 1: rebuild the vector store from source documents.** Run
ingestion fresh after the SQL restore. This regenerates embeddings
from the actual source files. Slow but produces a perfectly
consistent state.
- **Option 2: accept the inconsistency and live with stale-vector
filtering.** The retriever already drops vector results whose
SQL row no longer exists (`_existing_chunk_ids` filter), so the
inconsistency surfaces as missing results, not bad ones.
For an unplanned restore, Option 2 is the right immediate move.
Then schedule a fresh ingestion pass to rebuild the vector store
properly.
### Restart AtoCore
```bash
docker compose up -d atocore
# or: sudo systemctl start atocore
```
### Post-restore verification
```bash
# 1. Service is healthy
curl -fsS http://dalidou:8100/health | jq .
# 2. Stats look right
curl -fsS http://dalidou:8100/stats | jq .
# 3. Project registry loads
curl -fsS http://dalidou:8100/projects | jq '.projects | length'
# 4. A known-good context query returns non-empty results
curl -fsS -X POST http://dalidou:8100/context/build \
-H "Content-Type: application/json" \
-d '{"prompt": "what is p05 about", "project": "p05-interferometer"}' | jq '.chunks_used'
```
If any of these are wrong, the restore is bad. Roll back using the
pre-restore safety copy:
```bash
docker compose stop atocore
sudo cp /srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.db \
/srv/storage/atocore/data/db/atocore.db
sudo cp /srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.registry.json \
/srv/storage/atocore/config/project-registry.json 2>/dev/null || true
# If you also restored chroma:
sudo rm -rf /srv/storage/atocore/data/chroma
sudo mv /srv/storage/atocore/data/chroma.pre-restore-$PRESERVE_STAMP \
/srv/storage/atocore/data/chroma
docker compose up -d atocore
```
## Retention policy
- **Last 7 daily backups**: kept verbatim
- **Last 4 weekly backups** (Sunday): kept verbatim
- **Last 6 monthly backups** (1st of month): kept verbatim
- **Anything older**: deleted
The retention job is **not yet implemented** and is tracked as a
follow-up. Until then, the snapshots directory grows monotonically.
A simple cron-based cleanup script is the next step:
```cron
0 4 * * * /srv/storage/atocore/scripts/cleanup-old-backups.sh
```
## Drill schedule
A backup that has never been restored is theoretical. The schedule:
- **At least once per quarter**, perform a full restore drill on a
staging environment (or a temporary container with a separate
data dir) and verify the post-restore checks pass.
- **After every breaking schema migration**, perform a restore drill
to confirm the migration is reversible.
- **After any incident** that touched the storage layer (the EXDEV
bug from April 2026 is a good example), confirm the next backup
validates clean.
## Common failure modes and what to do about them
| Symptom | Likely cause | Action |
|---|---|---|
| `db_integrity_check_failed` on validation | SQLite snapshot copied while a write was in progress, or disk corruption | Take a fresh backup and validate again. If it fails twice, suspect the underlying disk. |
| `registry_invalid_json` | Registry was being edited at backup time | Take a fresh backup. The registry is small so this is cheap. |
| `chroma_snapshot_missing` after a restore | Snapshot was DB-only and the restore didn't move the existing chroma dir | Either rebuild via fresh ingestion or restore an older snapshot that includes Chroma. |
| Service won't start after restore | Permissions wrong on the restored files | Re-run `chown 1000:1000` (or whatever the gitea/atocore container user is) on the data dir. |
| `/stats` returns 0 documents after restore | The SQL store was restored but the source paths in `source_documents` don't match the current Dalidou paths | This means the backup came from a different deployment. Don't trust this restore — it's pulling from the wrong layout. |
## Open follow-ups (not yet implemented)
1. **Retention cleanup script**: see the cron entry above.
2. **Off-Dalidou backup target**: currently snapshots live on the
same disk as the live data. A real disaster-recovery story
needs at least one snapshot on a different physical machine.
The simplest first step is a periodic `rsync` to the user's
laptop or to another server.
3. **Backup encryption**: snapshots contain raw SQLite and JSON.
Consider age/gpg encryption if backups will be shipped off-site.
4. **Automatic post-backup validation**: today the validator must
be invoked manually. The `create_runtime_backup` function
should call `validate_backup` on its own output and refuse to
declare success if validation fails.
5. **Chroma backup is currently full directory copy** every time.
For large vector stores this gets expensive. A future
improvement would be incremental snapshots via filesystem-level
snapshotting (LVM, btrfs, ZFS).
## Quickstart cheat sheet
```bash
# Daily backup (DB + registry only — fast)
curl -fsS -X POST http://dalidou:8100/admin/backup \
-H "Content-Type: application/json" -d '{}'
# Weekly backup (DB + registry + Chroma — slower, holds ingestion lock)
curl -fsS -X POST http://dalidou:8100/admin/backup \
-H "Content-Type: application/json" -d '{"include_chroma": true}'
# List backups
curl -fsS http://dalidou:8100/admin/backup | jq '.backups[].stamp'
# Validate the most recent backup
LATEST=$(curl -fsS http://dalidou:8100/admin/backup | jq -r '.backups[-1].stamp')
curl -fsS http://dalidou:8100/admin/backup/$LATEST/validate | jq .
# Full restore — see the "Restore procedure" section above
```