slash command for daily AtoCore use + backup-restore procedure

Session 2 of the four-session plan. Lands two operational pieces: the Claude Code slash command that makes AtoCore reachable from inside any Claude Code session, and the full backup/restore procedure doc that turns the backup endpoint code into a real operational drill. Slash command (.claude/commands/atocore-context.md) --------------------------------------------------- - Project-level slash command following the standard frontmatter format (description + argument-hint) - Parses the user prompt and an optional trailing project id, with case-insensitive matching against the registered project ids (atocore, p04-gigabit, p05-interferometer, p06-polisher and their aliases) - Calls POST /context/build on the live AtoCore service, defaulting to http://dalidou:8100 (overridable via ATOCORE_API_BASE env var) - Renders the formatted context pack inline so the user can see exactly what AtoCore would feed an LLM, plus a stats banner and a per-chunk source list - Includes graceful failure handling for network errors, 4xx, 5xx, and the empty-result case - Defines a future capture path that POSTs to /interactions for the Phase 9 reflection loop. The current command leaves capture as manual / opt-in pending a clean post-turn hook design .gitignore changes ------------------ - Replaced wholesale .claude/ ignore with .claude/* + exceptions for .claude/commands/ so project slash commands can be tracked - Other .claude/* paths (worktrees, settings, local state) remain ignored Backup-restore procedure (docs/backup-restore-procedure.md) ----------------------------------------------------------- - Defines what gets backed up (SQLite + registry always, Chroma optional under ingestion lock) and what doesn't (sources, code, logs, cache, tmp) - Documents the snapshot directory layout and the timestamp format - Three trigger paths in priority order: - via POST /admin/backup with {include_chroma: true|false} - via the standalone src/atocore/ops/backup.py module - via cold filesystem copy with brief downtime as last resort - Listing and validation procedure with the /admin/backup and /admin/backup/{stamp}/validate endpoints - Full step-by-step restore procedure with mandatory pre-flight safety snapshot, ownership/permission requirements, and the post-restore verification checks - Rollback path using the pre-restore safety copy - Retention policy (last 7 daily / 4 weekly / 6 monthly) and explicit acknowledgment that the cleanup job is not yet implemented - Drill schedule: quarterly full restore drill, post-migration drill, post-incident validation - Common failure mode table with diagnoses - Quickstart cheat sheet at the end for daily reference - Open follow-ups: cleanup script, off-Dalidou target, encryption, automatic post-backup validation, incremental Chroma snapshots The procedure has not yet been exercised against the live Dalidou instance — that is the next step the user runs themselves once the slash command is in place.
2026-04-07 06:46:50 -04:00
parent d0ff8b5738
commit a637017900
3 changed files with 486 additions and 1 deletions
--- a/.claude/commands/atocore-context.md
+++ b/.claude/commands/atocore-context.md
@@ -0,0 +1,123 @@
+---
+description: Pull a context pack from the live AtoCore service for the current prompt
+argument-hint: <prompt text> [project-id]
+---
+
+You are about to enrich a user prompt with context from the live AtoCore
+service. This is the daily-use entry point for AtoCore from inside Claude
+Code.
+
+## Step 1 — parse the arguments
+
+The user invoked `/atocore-context` with the following arguments:
+
+```
+$ARGUMENTS
+```
+
+Treat the **entire argument string** as the prompt text by default. If the
+last whitespace-separated token looks like a registered project id (matches
+one of `atocore`, `p04-gigabit`, `p04`, `p05-interferometer`, `p05`,
+`p06-polisher`, `p06`, or any case-insensitive variant), treat it as the
+project hint and use the rest as the prompt text. Otherwise, leave the
+project hint empty.
+
+## Step 2 — call the AtoCore /context/build endpoint
+
+Use the Bash tool to call AtoCore. The default endpoint is the live
+Dalidou instance. Read `ATOCORE_API_BASE` from the environment if set,
+otherwise default to `http://dalidou:3000` (the gitea host) — wait,
+no, AtoCore lives on a different port. Default to `http://dalidou:8100`
+which is the AtoCore service port from `pyproject.toml` and `config.py`.
+
+Build the JSON body with `jq -n` so quoting is safe. Run something like:
+
+```bash
+ATOCORE_API_BASE="${ATOCORE_API_BASE:-http://dalidou:8100}"
+PROMPT_TEXT='<the prompt text from step 1>'
+PROJECT_HINT='<the project hint or empty string>'
+
+if [ -n "$PROJECT_HINT" ]; then
+  BODY=$(jq -n --arg p "$PROMPT_TEXT" --arg proj "$PROJECT_HINT" \
+    '{prompt:$p, project:$proj}')
+else
+  BODY=$(jq -n --arg p "$PROMPT_TEXT" '{prompt:$p}')
+fi
+
+curl -fsS -X POST "$ATOCORE_API_BASE/context/build" \
+  -H "Content-Type: application/json" \
+  -d "$BODY"
+```
+
+If `jq` is not available on the host, fall back to a Python one-liner:
+
+```bash
+python -c "import json,sys; print(json.dumps({'prompt': sys.argv[1], 'project': sys.argv[2]} if sys.argv[2] else {'prompt': sys.argv[1]}))" "$PROMPT_TEXT" "$PROJECT_HINT"
+```
+
+## Step 3 — present the context pack to the user
+
+The response is JSON with at least these fields:
+`formatted_context`, `chunks_used`, `total_chars`, `budget`,
+`budget_remaining`, `duration_ms`, and a `chunks` array.
+
+Print the response in a readable summary:
+
+1. Print a one-line stats banner: `chunks=N, chars=X/budget, duration=Yms`
+2. Print the `formatted_context` block verbatim inside a fenced text
+   code block so the user can read what AtoCore would feed an LLM
+3. Print the `chunks` array as a small bulleted list with `source_file`,
+   `heading_path`, and `score` per chunk
+
+If the response is empty (`chunks_used=0`, no project state, no
+memories), tell the user explicitly: "AtoCore returned no context for
+this prompt — either the corpus does not have relevant information or
+the project hint is wrong. Try `/atocore-context <prompt> <project-id>`."
+
+If the curl call fails:
+- Network error → tell the user the AtoCore service may be down at
+  `$ATOCORE_API_BASE` and suggest checking `curl $ATOCORE_API_BASE/health`
+- 4xx → print the error body verbatim, the API error message is usually
+  enough
+- 5xx → print the error body and suggest checking the service logs
+
+## Step 4 — capture the interaction (optional, opt-in)
+
+If the user has previously asked the assistant to capture interactions
+into AtoCore (or if the slash command was invoked with the trailing
+literal `--capture` token), also POST the captured exchange to
+`/interactions` so the Phase 9 reflection loop sees it. Skip this step
+silently otherwise. The capture body is:
+
+```json
+{
+  "prompt": "<user prompt>",
+  "response": "",
+  "response_summary": "",
+  "project": "<project hint or empty>",
+  "client": "claude-code-slash",
+  "session_id": "<a stable id for this Claude Code session>",
+  "memories_used": ["<from chunks array if available>"],
+  "chunks_used": ["<chunk_id from chunks array>"],
+  "context_pack": {"chunks_used": <N>, "total_chars": <X>}
+}
+```
+
+Note that the response field stays empty here — the LLM hasn't actually
+answered yet at the moment the slash command runs. A separate post-turn
+hook (not part of this command) would update the same interaction with
+the response, OR a follow-up `/atocore-record-response <interaction-id>`
+command would do it. For now, leave that as future work.
+
+## Notes for the assistant
+
+- DO NOT invent project ids that aren't in the registry. If the user
+  passed something that doesn't match, treat it as part of the prompt.
+- DO NOT silently fall back to a different endpoint. If `ATOCORE_API_BASE`
+  is wrong, surface the network error and let the user fix the env var.
+- DO NOT hide the formatted context pack from the user. The whole point
+  of this command is to show what AtoCore would feed an LLM, so the user
+  can decide if it's relevant.
+- The output goes into the user's working context as background — they
+  may follow up with their actual question, and the AtoCore context pack
+  acts as informal injected knowledge.
--- a/.gitignore
+++ b/.gitignore
@@ -10,4 +10,6 @@ htmlcov/
 .coverage
 venv/
 .venv/
-.claude/
+.claude/*
+!.claude/commands/
+!.claude/commands/**
--- a/docs/backup-restore-procedure.md
+++ b/docs/backup-restore-procedure.md
@@ -0,0 +1,360 @@
+# AtoCore Backup and Restore Procedure
+
+## Scope
+
+This document defines the operational procedure for backing up and
+restoring AtoCore's machine state on the Dalidou deployment. It is
+the practical companion to `docs/backup-strategy.md` (which defines
+the strategy) and `src/atocore/ops/backup.py` (which implements the
+mechanics).
+
+The intent is that this procedure can be followed by anyone with
+SSH access to Dalidou and the AtoCore admin endpoints.
+
+## What gets backed up
+
+A `create_runtime_backup` snapshot contains, in order of importance:
+
+| Artifact | Source path on Dalidou | Backup destination | Always included |
+|---|---|---|---|
+| SQLite database | `/srv/storage/atocore/data/db/atocore.db` | `<backup_root>/db/atocore.db` | yes |
+| Project registry JSON | `/srv/storage/atocore/config/project-registry.json` | `<backup_root>/config/project-registry.json` | yes (if file exists) |
+| Backup metadata | (generated) | `<backup_root>/backup-metadata.json` | yes |
+| Chroma vector store | `/srv/storage/atocore/data/chroma/` | `<backup_root>/chroma/` | only when `include_chroma=true` |
+
+The SQLite snapshot uses the online `conn.backup()` API and is safe
+to take while the database is in use. The Chroma snapshot is a cold
+directory copy and is **only safe when no ingestion is running**;
+the API endpoint enforces this by acquiring the ingestion lock for
+the duration of the copy.
+
+What is **not** in the backup:
+
+- Source documents under `/srv/storage/atocore/sources/vault/` and
+  `/srv/storage/atocore/sources/drive/`. These are read-only
+  inputs and live in the user's PKM/Drive, which is backed up
+  separately by their own systems.
+- Application code. The container image is the source of truth for
+  code; recovery means rebuilding the image, not restoring code from
+  a backup.
+- Logs under `/srv/storage/atocore/logs/`.
+- Embeddings cache under `/srv/storage/atocore/data/cache/`.
+- Temp files under `/srv/storage/atocore/data/tmp/`.
+
+## Backup root layout
+
+Each backup snapshot lives in its own timestamped directory:
+
+```
+/srv/storage/atocore/backups/snapshots/
+  ├── 20260407T060000Z/
+  │   ├── backup-metadata.json
+  │   ├── db/
+  │   │   └── atocore.db
+  │   ├── config/
+  │   │   └── project-registry.json
+  │   └── chroma/                    # only if include_chroma=true
+  │       └── ...
+  ├── 20260408T060000Z/
+  │   └── ...
+  └── ...
+```
+
+The timestamp is UTC, format `YYYYMMDDTHHMMSSZ`.
+
+## Triggering a backup
+
+### Option A — via the admin endpoint (preferred)
+
+```bash
+# DB + registry only (fast, safe at any time)
+curl -fsS -X POST http://dalidou:8100/admin/backup \
+  -H "Content-Type: application/json" \
+  -d '{"include_chroma": false}'
+
+# DB + registry + Chroma (acquires ingestion lock)
+curl -fsS -X POST http://dalidou:8100/admin/backup \
+  -H "Content-Type: application/json" \
+  -d '{"include_chroma": true}'
+```
+
+The response is the backup metadata JSON. Save the `backup_root`
+field — that's the directory the snapshot was written to.
+
+### Option B — via the standalone script (when the API is down)
+
+```bash
+docker exec atocore python -m atocore.ops.backup
+```
+
+This runs `create_runtime_backup()` directly, without going through
+the API or the ingestion lock. Use it only when the AtoCore service
+itself is unhealthy and you can't hit the admin endpoint.
+
+### Option C — manual file copy (last resort)
+
+If both the API and the standalone script are unusable:
+
+```bash
+sudo systemctl stop atocore   # or: docker compose stop atocore
+sudo cp /srv/storage/atocore/data/db/atocore.db \
+        /srv/storage/atocore/backups/manual-$(date -u +%Y%m%dT%H%M%SZ).db
+sudo cp /srv/storage/atocore/config/project-registry.json \
+        /srv/storage/atocore/backups/manual-$(date -u +%Y%m%dT%H%M%SZ).registry.json
+sudo systemctl start atocore
+```
+
+This is a cold backup and requires brief downtime.
+
+## Listing backups
+
+```bash
+curl -fsS http://dalidou:8100/admin/backup
+```
+
+Returns the configured `backup_dir` and a list of all snapshots
+under it, with their full metadata if available.
+
+Or, on the host directly:
+
+```bash
+ls -la /srv/storage/atocore/backups/snapshots/
+```
+
+## Validating a backup
+
+Before relying on a backup for restore, validate it:
+
+```bash
+curl -fsS http://dalidou:8100/admin/backup/20260407T060000Z/validate
+```
+
+The validator:
+- confirms the snapshot directory exists
+- opens the SQLite snapshot and runs `PRAGMA integrity_check`
+- parses the registry JSON
+- confirms the Chroma directory exists (if it was included)
+
+A valid backup returns `"valid": true` and an empty `errors` array.
+A failing validation returns `"valid": false` with one or more
+specific error strings (e.g. `db_integrity_check_failed`,
+`registry_invalid_json`, `chroma_snapshot_missing`).
+
+**Validate every backup at creation time.** A backup that has never
+been validated is not actually a backup — it's just a hopeful copy
+of bytes.
+
+## Restore procedure
+
+### Pre-flight (always)
+
+1. Identify which snapshot you want to restore. List available
+   snapshots and pick by timestamp:
+   ```bash
+   curl -fsS http://dalidou:8100/admin/backup | jq '.backups[].stamp'
+   ```
+2. Validate it. Refuse to restore an invalid backup:
+   ```bash
+   STAMP=20260407T060000Z
+   curl -fsS http://dalidou:8100/admin/backup/$STAMP/validate | jq .
+   ```
+3. **Stop AtoCore.** SQLite cannot be hot-restored under a running
+   process and Chroma will not pick up new files until the process
+   restarts.
+   ```bash
+   docker compose stop atocore
+   # or: sudo systemctl stop atocore
+   ```
+4. **Take a safety snapshot of the current state** before overwriting
+   it. This is your "if the restore makes things worse, here's the
+   undo" backup.
+   ```bash
+   PRESERVE_STAMP=$(date -u +%Y%m%dT%H%M%SZ)
+   sudo cp /srv/storage/atocore/data/db/atocore.db \
+           /srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.db
+   sudo cp /srv/storage/atocore/config/project-registry.json \
+           /srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.registry.json 2>/dev/null || true
+   ```
+
+### Restore the SQLite database
+
+```bash
+SNAPSHOT_DIR=/srv/storage/atocore/backups/snapshots/$STAMP
+sudo cp $SNAPSHOT_DIR/db/atocore.db \
+        /srv/storage/atocore/data/db/atocore.db
+sudo chown 1000:1000 /srv/storage/atocore/data/db/atocore.db
+sudo chmod 600 /srv/storage/atocore/data/db/atocore.db
+```
+
+The chown should match the gitea/atocore container user. Verify
+by checking the existing perms before overwriting:
+
+```bash
+stat -c '%U:%G %a' /srv/storage/atocore/data/db/atocore.db
+```
+
+### Restore the project registry
+
+```bash
+if [ -f $SNAPSHOT_DIR/config/project-registry.json ]; then
+  sudo cp $SNAPSHOT_DIR/config/project-registry.json \
+          /srv/storage/atocore/config/project-registry.json
+  sudo chown 1000:1000 /srv/storage/atocore/config/project-registry.json
+  sudo chmod 644 /srv/storage/atocore/config/project-registry.json
+fi
+```
+
+If the snapshot does not contain a registry, the current registry is
+preserved. The pre-flight safety copy still gives you a recovery path
+if you need to roll back.
+
+### Restore the Chroma vector store (if it was in the snapshot)
+
+```bash
+if [ -d $SNAPSHOT_DIR/chroma ]; then
+  # Move the current chroma dir aside as a safety copy
+  sudo mv /srv/storage/atocore/data/chroma \
+          /srv/storage/atocore/data/chroma.pre-restore-$PRESERVE_STAMP
+
+  # Copy the snapshot in
+  sudo cp -a $SNAPSHOT_DIR/chroma /srv/storage/atocore/data/chroma
+  sudo chown -R 1000:1000 /srv/storage/atocore/data/chroma
+fi
+```
+
+If the snapshot does NOT contain a Chroma dir but the SQLite
+restore would leave the vector store and the SQL store inconsistent
+(e.g. SQL has chunks the vector store doesn't), you have two
+options:
+
+- **Option 1: rebuild the vector store from source documents.** Run
+  ingestion fresh after the SQL restore. This regenerates embeddings
+  from the actual source files. Slow but produces a perfectly
+  consistent state.
+- **Option 2: accept the inconsistency and live with stale-vector
+  filtering.** The retriever already drops vector results whose
+  SQL row no longer exists (`_existing_chunk_ids` filter), so the
+  inconsistency surfaces as missing results, not bad ones.
+
+For an unplanned restore, Option 2 is the right immediate move.
+Then schedule a fresh ingestion pass to rebuild the vector store
+properly.
+
+### Restart AtoCore
+
+```bash
+docker compose up -d atocore
+# or: sudo systemctl start atocore
+```
+
+### Post-restore verification
+
+```bash
+# 1. Service is healthy
+curl -fsS http://dalidou:8100/health | jq .
+
+# 2. Stats look right
+curl -fsS http://dalidou:8100/stats | jq .
+
+# 3. Project registry loads
+curl -fsS http://dalidou:8100/projects | jq '.projects | length'
+
+# 4. A known-good context query returns non-empty results
+curl -fsS -X POST http://dalidou:8100/context/build \
+  -H "Content-Type: application/json" \
+  -d '{"prompt": "what is p05 about", "project": "p05-interferometer"}' | jq '.chunks_used'
+```
+
+If any of these are wrong, the restore is bad. Roll back using the
+pre-restore safety copy:
+
+```bash
+docker compose stop atocore
+sudo cp /srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.db \
+        /srv/storage/atocore/data/db/atocore.db
+sudo cp /srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.registry.json \
+        /srv/storage/atocore/config/project-registry.json 2>/dev/null || true
+# If you also restored chroma:
+sudo rm -rf /srv/storage/atocore/data/chroma
+sudo mv /srv/storage/atocore/data/chroma.pre-restore-$PRESERVE_STAMP \
+        /srv/storage/atocore/data/chroma
+docker compose up -d atocore
+```
+
+## Retention policy
+
+- **Last 7 daily backups**: kept verbatim
+- **Last 4 weekly backups** (Sunday): kept verbatim
+- **Last 6 monthly backups** (1st of month): kept verbatim
+- **Anything older**: deleted
+
+The retention job is **not yet implemented** and is tracked as a
+follow-up. Until then, the snapshots directory grows monotonically.
+A simple cron-based cleanup script is the next step:
+
+```cron
+0 4 * * * /srv/storage/atocore/scripts/cleanup-old-backups.sh
+```
+
+## Drill schedule
+
+A backup that has never been restored is theoretical. The schedule:
+
+- **At least once per quarter**, perform a full restore drill on a
+  staging environment (or a temporary container with a separate
+  data dir) and verify the post-restore checks pass.
+- **After every breaking schema migration**, perform a restore drill
+  to confirm the migration is reversible.
+- **After any incident** that touched the storage layer (the EXDEV
+  bug from April 2026 is a good example), confirm the next backup
+  validates clean.
+
+## Common failure modes and what to do about them
+
+| Symptom | Likely cause | Action |
+|---|---|---|
+| `db_integrity_check_failed` on validation | SQLite snapshot copied while a write was in progress, or disk corruption | Take a fresh backup and validate again. If it fails twice, suspect the underlying disk. |
+| `registry_invalid_json` | Registry was being edited at backup time | Take a fresh backup. The registry is small so this is cheap. |
+| `chroma_snapshot_missing` after a restore | Snapshot was DB-only and the restore didn't move the existing chroma dir | Either rebuild via fresh ingestion or restore an older snapshot that includes Chroma. |
+| Service won't start after restore | Permissions wrong on the restored files | Re-run `chown 1000:1000` (or whatever the gitea/atocore container user is) on the data dir. |
+| `/stats` returns 0 documents after restore | The SQL store was restored but the source paths in `source_documents` don't match the current Dalidou paths | This means the backup came from a different deployment. Don't trust this restore — it's pulling from the wrong layout. |
+
+## Open follow-ups (not yet implemented)
+
+1. **Retention cleanup script**: see the cron entry above.
+2. **Off-Dalidou backup target**: currently snapshots live on the
+   same disk as the live data. A real disaster-recovery story
+   needs at least one snapshot on a different physical machine.
+   The simplest first step is a periodic `rsync` to the user's
+   laptop or to another server.
+3. **Backup encryption**: snapshots contain raw SQLite and JSON.
+   Consider age/gpg encryption if backups will be shipped off-site.
+4. **Automatic post-backup validation**: today the validator must
+   be invoked manually. The `create_runtime_backup` function
+   should call `validate_backup` on its own output and refuse to
+   declare success if validation fails.
+5. **Chroma backup is currently full directory copy** every time.
+   For large vector stores this gets expensive. A future
+   improvement would be incremental snapshots via filesystem-level
+   snapshotting (LVM, btrfs, ZFS).
+
+## Quickstart cheat sheet
+
+```bash
+# Daily backup (DB + registry only — fast)
+curl -fsS -X POST http://dalidou:8100/admin/backup \
+  -H "Content-Type: application/json" -d '{}'
+
+# Weekly backup (DB + registry + Chroma — slower, holds ingestion lock)
+curl -fsS -X POST http://dalidou:8100/admin/backup \
+  -H "Content-Type: application/json" -d '{"include_chroma": true}'
+
+# List backups
+curl -fsS http://dalidou:8100/admin/backup | jq '.backups[].stamp'
+
+# Validate the most recent backup
+LATEST=$(curl -fsS http://dalidou:8100/admin/backup | jq -r '.backups[-1].stamp')
+curl -fsS http://dalidou:8100/admin/backup/$LATEST/validate | jq .
+
+# Full restore — see the "Restore procedure" section above
+```