diff --git a/.claude/commands/atocore-context.md b/.claude/commands/atocore-context.md new file mode 100644 index 0000000..64b3baa --- /dev/null +++ b/.claude/commands/atocore-context.md @@ -0,0 +1,123 @@ +--- +description: Pull a context pack from the live AtoCore service for the current prompt +argument-hint: [project-id] +--- + +You are about to enrich a user prompt with context from the live AtoCore +service. This is the daily-use entry point for AtoCore from inside Claude +Code. + +## Step 1 — parse the arguments + +The user invoked `/atocore-context` with the following arguments: + +``` +$ARGUMENTS +``` + +Treat the **entire argument string** as the prompt text by default. If the +last whitespace-separated token looks like a registered project id (matches +one of `atocore`, `p04-gigabit`, `p04`, `p05-interferometer`, `p05`, +`p06-polisher`, `p06`, or any case-insensitive variant), treat it as the +project hint and use the rest as the prompt text. Otherwise, leave the +project hint empty. + +## Step 2 — call the AtoCore /context/build endpoint + +Use the Bash tool to call AtoCore. The default endpoint is the live +Dalidou instance. Read `ATOCORE_API_BASE` from the environment if set, +otherwise default to `http://dalidou:3000` (the gitea host) — wait, +no, AtoCore lives on a different port. Default to `http://dalidou:8100` +which is the AtoCore service port from `pyproject.toml` and `config.py`. + +Build the JSON body with `jq -n` so quoting is safe. Run something like: + +```bash +ATOCORE_API_BASE="${ATOCORE_API_BASE:-http://dalidou:8100}" +PROMPT_TEXT='' +PROJECT_HINT='' + +if [ -n "$PROJECT_HINT" ]; then + BODY=$(jq -n --arg p "$PROMPT_TEXT" --arg proj "$PROJECT_HINT" \ + '{prompt:$p, project:$proj}') +else + BODY=$(jq -n --arg p "$PROMPT_TEXT" '{prompt:$p}') +fi + +curl -fsS -X POST "$ATOCORE_API_BASE/context/build" \ + -H "Content-Type: application/json" \ + -d "$BODY" +``` + +If `jq` is not available on the host, fall back to a Python one-liner: + +```bash +python -c "import json,sys; print(json.dumps({'prompt': sys.argv[1], 'project': sys.argv[2]} if sys.argv[2] else {'prompt': sys.argv[1]}))" "$PROMPT_TEXT" "$PROJECT_HINT" +``` + +## Step 3 — present the context pack to the user + +The response is JSON with at least these fields: +`formatted_context`, `chunks_used`, `total_chars`, `budget`, +`budget_remaining`, `duration_ms`, and a `chunks` array. + +Print the response in a readable summary: + +1. Print a one-line stats banner: `chunks=N, chars=X/budget, duration=Yms` +2. Print the `formatted_context` block verbatim inside a fenced text + code block so the user can read what AtoCore would feed an LLM +3. Print the `chunks` array as a small bulleted list with `source_file`, + `heading_path`, and `score` per chunk + +If the response is empty (`chunks_used=0`, no project state, no +memories), tell the user explicitly: "AtoCore returned no context for +this prompt — either the corpus does not have relevant information or +the project hint is wrong. Try `/atocore-context `." + +If the curl call fails: +- Network error → tell the user the AtoCore service may be down at + `$ATOCORE_API_BASE` and suggest checking `curl $ATOCORE_API_BASE/health` +- 4xx → print the error body verbatim, the API error message is usually + enough +- 5xx → print the error body and suggest checking the service logs + +## Step 4 — capture the interaction (optional, opt-in) + +If the user has previously asked the assistant to capture interactions +into AtoCore (or if the slash command was invoked with the trailing +literal `--capture` token), also POST the captured exchange to +`/interactions` so the Phase 9 reflection loop sees it. Skip this step +silently otherwise. The capture body is: + +```json +{ + "prompt": "", + "response": "", + "response_summary": "", + "project": "", + "client": "claude-code-slash", + "session_id": "", + "memories_used": [""], + "chunks_used": [""], + "context_pack": {"chunks_used": , "total_chars": } +} +``` + +Note that the response field stays empty here — the LLM hasn't actually +answered yet at the moment the slash command runs. A separate post-turn +hook (not part of this command) would update the same interaction with +the response, OR a follow-up `/atocore-record-response ` +command would do it. For now, leave that as future work. + +## Notes for the assistant + +- DO NOT invent project ids that aren't in the registry. If the user + passed something that doesn't match, treat it as part of the prompt. +- DO NOT silently fall back to a different endpoint. If `ATOCORE_API_BASE` + is wrong, surface the network error and let the user fix the env var. +- DO NOT hide the formatted context pack from the user. The whole point + of this command is to show what AtoCore would feed an LLM, so the user + can decide if it's relevant. +- The output goes into the user's working context as background — they + may follow up with their actual question, and the AtoCore context pack + acts as informal injected knowledge. diff --git a/.gitignore b/.gitignore index 07957c6..1069933 100644 --- a/.gitignore +++ b/.gitignore @@ -10,4 +10,6 @@ htmlcov/ .coverage venv/ .venv/ -.claude/ +.claude/* +!.claude/commands/ +!.claude/commands/** diff --git a/docs/backup-restore-procedure.md b/docs/backup-restore-procedure.md new file mode 100644 index 0000000..a6e427b --- /dev/null +++ b/docs/backup-restore-procedure.md @@ -0,0 +1,360 @@ +# AtoCore Backup and Restore Procedure + +## Scope + +This document defines the operational procedure for backing up and +restoring AtoCore's machine state on the Dalidou deployment. It is +the practical companion to `docs/backup-strategy.md` (which defines +the strategy) and `src/atocore/ops/backup.py` (which implements the +mechanics). + +The intent is that this procedure can be followed by anyone with +SSH access to Dalidou and the AtoCore admin endpoints. + +## What gets backed up + +A `create_runtime_backup` snapshot contains, in order of importance: + +| Artifact | Source path on Dalidou | Backup destination | Always included | +|---|---|---|---| +| SQLite database | `/srv/storage/atocore/data/db/atocore.db` | `/db/atocore.db` | yes | +| Project registry JSON | `/srv/storage/atocore/config/project-registry.json` | `/config/project-registry.json` | yes (if file exists) | +| Backup metadata | (generated) | `/backup-metadata.json` | yes | +| Chroma vector store | `/srv/storage/atocore/data/chroma/` | `/chroma/` | only when `include_chroma=true` | + +The SQLite snapshot uses the online `conn.backup()` API and is safe +to take while the database is in use. The Chroma snapshot is a cold +directory copy and is **only safe when no ingestion is running**; +the API endpoint enforces this by acquiring the ingestion lock for +the duration of the copy. + +What is **not** in the backup: + +- Source documents under `/srv/storage/atocore/sources/vault/` and + `/srv/storage/atocore/sources/drive/`. These are read-only + inputs and live in the user's PKM/Drive, which is backed up + separately by their own systems. +- Application code. The container image is the source of truth for + code; recovery means rebuilding the image, not restoring code from + a backup. +- Logs under `/srv/storage/atocore/logs/`. +- Embeddings cache under `/srv/storage/atocore/data/cache/`. +- Temp files under `/srv/storage/atocore/data/tmp/`. + +## Backup root layout + +Each backup snapshot lives in its own timestamped directory: + +``` +/srv/storage/atocore/backups/snapshots/ + ├── 20260407T060000Z/ + │ ├── backup-metadata.json + │ ├── db/ + │ │ └── atocore.db + │ ├── config/ + │ │ └── project-registry.json + │ └── chroma/ # only if include_chroma=true + │ └── ... + ├── 20260408T060000Z/ + │ └── ... + └── ... +``` + +The timestamp is UTC, format `YYYYMMDDTHHMMSSZ`. + +## Triggering a backup + +### Option A — via the admin endpoint (preferred) + +```bash +# DB + registry only (fast, safe at any time) +curl -fsS -X POST http://dalidou:8100/admin/backup \ + -H "Content-Type: application/json" \ + -d '{"include_chroma": false}' + +# DB + registry + Chroma (acquires ingestion lock) +curl -fsS -X POST http://dalidou:8100/admin/backup \ + -H "Content-Type: application/json" \ + -d '{"include_chroma": true}' +``` + +The response is the backup metadata JSON. Save the `backup_root` +field — that's the directory the snapshot was written to. + +### Option B — via the standalone script (when the API is down) + +```bash +docker exec atocore python -m atocore.ops.backup +``` + +This runs `create_runtime_backup()` directly, without going through +the API or the ingestion lock. Use it only when the AtoCore service +itself is unhealthy and you can't hit the admin endpoint. + +### Option C — manual file copy (last resort) + +If both the API and the standalone script are unusable: + +```bash +sudo systemctl stop atocore # or: docker compose stop atocore +sudo cp /srv/storage/atocore/data/db/atocore.db \ + /srv/storage/atocore/backups/manual-$(date -u +%Y%m%dT%H%M%SZ).db +sudo cp /srv/storage/atocore/config/project-registry.json \ + /srv/storage/atocore/backups/manual-$(date -u +%Y%m%dT%H%M%SZ).registry.json +sudo systemctl start atocore +``` + +This is a cold backup and requires brief downtime. + +## Listing backups + +```bash +curl -fsS http://dalidou:8100/admin/backup +``` + +Returns the configured `backup_dir` and a list of all snapshots +under it, with their full metadata if available. + +Or, on the host directly: + +```bash +ls -la /srv/storage/atocore/backups/snapshots/ +``` + +## Validating a backup + +Before relying on a backup for restore, validate it: + +```bash +curl -fsS http://dalidou:8100/admin/backup/20260407T060000Z/validate +``` + +The validator: +- confirms the snapshot directory exists +- opens the SQLite snapshot and runs `PRAGMA integrity_check` +- parses the registry JSON +- confirms the Chroma directory exists (if it was included) + +A valid backup returns `"valid": true` and an empty `errors` array. +A failing validation returns `"valid": false` with one or more +specific error strings (e.g. `db_integrity_check_failed`, +`registry_invalid_json`, `chroma_snapshot_missing`). + +**Validate every backup at creation time.** A backup that has never +been validated is not actually a backup — it's just a hopeful copy +of bytes. + +## Restore procedure + +### Pre-flight (always) + +1. Identify which snapshot you want to restore. List available + snapshots and pick by timestamp: + ```bash + curl -fsS http://dalidou:8100/admin/backup | jq '.backups[].stamp' + ``` +2. Validate it. Refuse to restore an invalid backup: + ```bash + STAMP=20260407T060000Z + curl -fsS http://dalidou:8100/admin/backup/$STAMP/validate | jq . + ``` +3. **Stop AtoCore.** SQLite cannot be hot-restored under a running + process and Chroma will not pick up new files until the process + restarts. + ```bash + docker compose stop atocore + # or: sudo systemctl stop atocore + ``` +4. **Take a safety snapshot of the current state** before overwriting + it. This is your "if the restore makes things worse, here's the + undo" backup. + ```bash + PRESERVE_STAMP=$(date -u +%Y%m%dT%H%M%SZ) + sudo cp /srv/storage/atocore/data/db/atocore.db \ + /srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.db + sudo cp /srv/storage/atocore/config/project-registry.json \ + /srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.registry.json 2>/dev/null || true + ``` + +### Restore the SQLite database + +```bash +SNAPSHOT_DIR=/srv/storage/atocore/backups/snapshots/$STAMP +sudo cp $SNAPSHOT_DIR/db/atocore.db \ + /srv/storage/atocore/data/db/atocore.db +sudo chown 1000:1000 /srv/storage/atocore/data/db/atocore.db +sudo chmod 600 /srv/storage/atocore/data/db/atocore.db +``` + +The chown should match the gitea/atocore container user. Verify +by checking the existing perms before overwriting: + +```bash +stat -c '%U:%G %a' /srv/storage/atocore/data/db/atocore.db +``` + +### Restore the project registry + +```bash +if [ -f $SNAPSHOT_DIR/config/project-registry.json ]; then + sudo cp $SNAPSHOT_DIR/config/project-registry.json \ + /srv/storage/atocore/config/project-registry.json + sudo chown 1000:1000 /srv/storage/atocore/config/project-registry.json + sudo chmod 644 /srv/storage/atocore/config/project-registry.json +fi +``` + +If the snapshot does not contain a registry, the current registry is +preserved. The pre-flight safety copy still gives you a recovery path +if you need to roll back. + +### Restore the Chroma vector store (if it was in the snapshot) + +```bash +if [ -d $SNAPSHOT_DIR/chroma ]; then + # Move the current chroma dir aside as a safety copy + sudo mv /srv/storage/atocore/data/chroma \ + /srv/storage/atocore/data/chroma.pre-restore-$PRESERVE_STAMP + + # Copy the snapshot in + sudo cp -a $SNAPSHOT_DIR/chroma /srv/storage/atocore/data/chroma + sudo chown -R 1000:1000 /srv/storage/atocore/data/chroma +fi +``` + +If the snapshot does NOT contain a Chroma dir but the SQLite +restore would leave the vector store and the SQL store inconsistent +(e.g. SQL has chunks the vector store doesn't), you have two +options: + +- **Option 1: rebuild the vector store from source documents.** Run + ingestion fresh after the SQL restore. This regenerates embeddings + from the actual source files. Slow but produces a perfectly + consistent state. +- **Option 2: accept the inconsistency and live with stale-vector + filtering.** The retriever already drops vector results whose + SQL row no longer exists (`_existing_chunk_ids` filter), so the + inconsistency surfaces as missing results, not bad ones. + +For an unplanned restore, Option 2 is the right immediate move. +Then schedule a fresh ingestion pass to rebuild the vector store +properly. + +### Restart AtoCore + +```bash +docker compose up -d atocore +# or: sudo systemctl start atocore +``` + +### Post-restore verification + +```bash +# 1. Service is healthy +curl -fsS http://dalidou:8100/health | jq . + +# 2. Stats look right +curl -fsS http://dalidou:8100/stats | jq . + +# 3. Project registry loads +curl -fsS http://dalidou:8100/projects | jq '.projects | length' + +# 4. A known-good context query returns non-empty results +curl -fsS -X POST http://dalidou:8100/context/build \ + -H "Content-Type: application/json" \ + -d '{"prompt": "what is p05 about", "project": "p05-interferometer"}' | jq '.chunks_used' +``` + +If any of these are wrong, the restore is bad. Roll back using the +pre-restore safety copy: + +```bash +docker compose stop atocore +sudo cp /srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.db \ + /srv/storage/atocore/data/db/atocore.db +sudo cp /srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.registry.json \ + /srv/storage/atocore/config/project-registry.json 2>/dev/null || true +# If you also restored chroma: +sudo rm -rf /srv/storage/atocore/data/chroma +sudo mv /srv/storage/atocore/data/chroma.pre-restore-$PRESERVE_STAMP \ + /srv/storage/atocore/data/chroma +docker compose up -d atocore +``` + +## Retention policy + +- **Last 7 daily backups**: kept verbatim +- **Last 4 weekly backups** (Sunday): kept verbatim +- **Last 6 monthly backups** (1st of month): kept verbatim +- **Anything older**: deleted + +The retention job is **not yet implemented** and is tracked as a +follow-up. Until then, the snapshots directory grows monotonically. +A simple cron-based cleanup script is the next step: + +```cron +0 4 * * * /srv/storage/atocore/scripts/cleanup-old-backups.sh +``` + +## Drill schedule + +A backup that has never been restored is theoretical. The schedule: + +- **At least once per quarter**, perform a full restore drill on a + staging environment (or a temporary container with a separate + data dir) and verify the post-restore checks pass. +- **After every breaking schema migration**, perform a restore drill + to confirm the migration is reversible. +- **After any incident** that touched the storage layer (the EXDEV + bug from April 2026 is a good example), confirm the next backup + validates clean. + +## Common failure modes and what to do about them + +| Symptom | Likely cause | Action | +|---|---|---| +| `db_integrity_check_failed` on validation | SQLite snapshot copied while a write was in progress, or disk corruption | Take a fresh backup and validate again. If it fails twice, suspect the underlying disk. | +| `registry_invalid_json` | Registry was being edited at backup time | Take a fresh backup. The registry is small so this is cheap. | +| `chroma_snapshot_missing` after a restore | Snapshot was DB-only and the restore didn't move the existing chroma dir | Either rebuild via fresh ingestion or restore an older snapshot that includes Chroma. | +| Service won't start after restore | Permissions wrong on the restored files | Re-run `chown 1000:1000` (or whatever the gitea/atocore container user is) on the data dir. | +| `/stats` returns 0 documents after restore | The SQL store was restored but the source paths in `source_documents` don't match the current Dalidou paths | This means the backup came from a different deployment. Don't trust this restore — it's pulling from the wrong layout. | + +## Open follow-ups (not yet implemented) + +1. **Retention cleanup script**: see the cron entry above. +2. **Off-Dalidou backup target**: currently snapshots live on the + same disk as the live data. A real disaster-recovery story + needs at least one snapshot on a different physical machine. + The simplest first step is a periodic `rsync` to the user's + laptop or to another server. +3. **Backup encryption**: snapshots contain raw SQLite and JSON. + Consider age/gpg encryption if backups will be shipped off-site. +4. **Automatic post-backup validation**: today the validator must + be invoked manually. The `create_runtime_backup` function + should call `validate_backup` on its own output and refuse to + declare success if validation fails. +5. **Chroma backup is currently full directory copy** every time. + For large vector stores this gets expensive. A future + improvement would be incremental snapshots via filesystem-level + snapshotting (LVM, btrfs, ZFS). + +## Quickstart cheat sheet + +```bash +# Daily backup (DB + registry only — fast) +curl -fsS -X POST http://dalidou:8100/admin/backup \ + -H "Content-Type: application/json" -d '{}' + +# Weekly backup (DB + registry + Chroma — slower, holds ingestion lock) +curl -fsS -X POST http://dalidou:8100/admin/backup \ + -H "Content-Type: application/json" -d '{"include_chroma": true}' + +# List backups +curl -fsS http://dalidou:8100/admin/backup | jq '.backups[].stamp' + +# Validate the most recent backup +LATEST=$(curl -fsS http://dalidou:8100/admin/backup | jq -r '.backups[-1].stamp') +curl -fsS http://dalidou:8100/admin/backup/$LATEST/validate | jq . + +# Full restore — see the "Restore procedure" section above +```