slash command for daily AtoCore use + backup-restore procedure

Session 2 of the four-session plan. Lands two operational pieces:
the Claude Code slash command that makes AtoCore reachable from
inside any Claude Code session, and the full backup/restore
procedure doc that turns the backup endpoint code into a real
operational drill.

Slash command (.claude/commands/atocore-context.md)
---------------------------------------------------
- Project-level slash command following the standard frontmatter
  format (description + argument-hint)
- Parses the user prompt and an optional trailing project id, with
  case-insensitive matching against the registered project ids
  (atocore, p04-gigabit, p05-interferometer, p06-polisher and
  their aliases)
- Calls POST /context/build on the live AtoCore service, defaulting
  to http://dalidou:8100 (overridable via ATOCORE_API_BASE env var)
- Renders the formatted context pack inline so the user can see
  exactly what AtoCore would feed an LLM, plus a stats banner and a
  per-chunk source list
- Includes graceful failure handling for network errors, 4xx, 5xx,
  and the empty-result case
- Defines a future capture path that POSTs to /interactions for the
  Phase 9 reflection loop. The current command leaves capture as
  manual / opt-in pending a clean post-turn hook design

.gitignore changes
------------------
- Replaced wholesale .claude/ ignore with .claude/* + exceptions
  for .claude/commands/ so project slash commands can be tracked
- Other .claude/* paths (worktrees, settings, local state) remain
  ignored

Backup-restore procedure (docs/backup-restore-procedure.md)
-----------------------------------------------------------
- Defines what gets backed up (SQLite + registry always, Chroma
  optional under ingestion lock) and what doesn't (sources, code,
  logs, cache, tmp)
- Documents the snapshot directory layout and the timestamp format
- Three trigger paths in priority order:
  - via POST /admin/backup with {include_chroma: true|false}
  - via the standalone src/atocore/ops/backup.py module
  - via cold filesystem copy with brief downtime as last resort
- Listing and validation procedure with the /admin/backup and
  /admin/backup/{stamp}/validate endpoints
- Full step-by-step restore procedure with mandatory pre-flight
  safety snapshot, ownership/permission requirements, and the
  post-restore verification checks
- Rollback path using the pre-restore safety copy
- Retention policy (last 7 daily / 4 weekly / 6 monthly) and
  explicit acknowledgment that the cleanup job is not yet
  implemented
- Drill schedule: quarterly full restore drill, post-migration
  drill, post-incident validation
- Common failure mode table with diagnoses
- Quickstart cheat sheet at the end for daily reference
- Open follow-ups: cleanup script, off-Dalidou target,
  encryption, automatic post-backup validation, incremental
  Chroma snapshots

The procedure has not yet been exercised against the live Dalidou
instance — that is the next step the user runs themselves once
the slash command is in place.
This commit is contained in:
2026-04-07 06:46:50 -04:00
parent d0ff8b5738
commit a637017900
3 changed files with 486 additions and 1 deletions

View File

@@ -0,0 +1,123 @@
---
description: Pull a context pack from the live AtoCore service for the current prompt
argument-hint: <prompt text> [project-id]
---
You are about to enrich a user prompt with context from the live AtoCore
service. This is the daily-use entry point for AtoCore from inside Claude
Code.
## Step 1 — parse the arguments
The user invoked `/atocore-context` with the following arguments:
```
$ARGUMENTS
```
Treat the **entire argument string** as the prompt text by default. If the
last whitespace-separated token looks like a registered project id (matches
one of `atocore`, `p04-gigabit`, `p04`, `p05-interferometer`, `p05`,
`p06-polisher`, `p06`, or any case-insensitive variant), treat it as the
project hint and use the rest as the prompt text. Otherwise, leave the
project hint empty.
## Step 2 — call the AtoCore /context/build endpoint
Use the Bash tool to call AtoCore. The default endpoint is the live
Dalidou instance. Read `ATOCORE_API_BASE` from the environment if set,
otherwise default to `http://dalidou:3000` (the gitea host) — wait,
no, AtoCore lives on a different port. Default to `http://dalidou:8100`
which is the AtoCore service port from `pyproject.toml` and `config.py`.
Build the JSON body with `jq -n` so quoting is safe. Run something like:
```bash
ATOCORE_API_BASE="${ATOCORE_API_BASE:-http://dalidou:8100}"
PROMPT_TEXT='<the prompt text from step 1>'
PROJECT_HINT='<the project hint or empty string>'
if [ -n "$PROJECT_HINT" ]; then
BODY=$(jq -n --arg p "$PROMPT_TEXT" --arg proj "$PROJECT_HINT" \
'{prompt:$p, project:$proj}')
else
BODY=$(jq -n --arg p "$PROMPT_TEXT" '{prompt:$p}')
fi
curl -fsS -X POST "$ATOCORE_API_BASE/context/build" \
-H "Content-Type: application/json" \
-d "$BODY"
```
If `jq` is not available on the host, fall back to a Python one-liner:
```bash
python -c "import json,sys; print(json.dumps({'prompt': sys.argv[1], 'project': sys.argv[2]} if sys.argv[2] else {'prompt': sys.argv[1]}))" "$PROMPT_TEXT" "$PROJECT_HINT"
```
## Step 3 — present the context pack to the user
The response is JSON with at least these fields:
`formatted_context`, `chunks_used`, `total_chars`, `budget`,
`budget_remaining`, `duration_ms`, and a `chunks` array.
Print the response in a readable summary:
1. Print a one-line stats banner: `chunks=N, chars=X/budget, duration=Yms`
2. Print the `formatted_context` block verbatim inside a fenced text
code block so the user can read what AtoCore would feed an LLM
3. Print the `chunks` array as a small bulleted list with `source_file`,
`heading_path`, and `score` per chunk
If the response is empty (`chunks_used=0`, no project state, no
memories), tell the user explicitly: "AtoCore returned no context for
this prompt — either the corpus does not have relevant information or
the project hint is wrong. Try `/atocore-context <prompt> <project-id>`."
If the curl call fails:
- Network error → tell the user the AtoCore service may be down at
`$ATOCORE_API_BASE` and suggest checking `curl $ATOCORE_API_BASE/health`
- 4xx → print the error body verbatim, the API error message is usually
enough
- 5xx → print the error body and suggest checking the service logs
## Step 4 — capture the interaction (optional, opt-in)
If the user has previously asked the assistant to capture interactions
into AtoCore (or if the slash command was invoked with the trailing
literal `--capture` token), also POST the captured exchange to
`/interactions` so the Phase 9 reflection loop sees it. Skip this step
silently otherwise. The capture body is:
```json
{
"prompt": "<user prompt>",
"response": "",
"response_summary": "",
"project": "<project hint or empty>",
"client": "claude-code-slash",
"session_id": "<a stable id for this Claude Code session>",
"memories_used": ["<from chunks array if available>"],
"chunks_used": ["<chunk_id from chunks array>"],
"context_pack": {"chunks_used": <N>, "total_chars": <X>}
}
```
Note that the response field stays empty here — the LLM hasn't actually
answered yet at the moment the slash command runs. A separate post-turn
hook (not part of this command) would update the same interaction with
the response, OR a follow-up `/atocore-record-response <interaction-id>`
command would do it. For now, leave that as future work.
## Notes for the assistant
- DO NOT invent project ids that aren't in the registry. If the user
passed something that doesn't match, treat it as part of the prompt.
- DO NOT silently fall back to a different endpoint. If `ATOCORE_API_BASE`
is wrong, surface the network error and let the user fix the env var.
- DO NOT hide the formatted context pack from the user. The whole point
of this command is to show what AtoCore would feed an LLM, so the user
can decide if it's relevant.
- The output goes into the user's working context as background — they
may follow up with their actual question, and the AtoCore context pack
acts as informal injected knowledge.

4
.gitignore vendored
View File

@@ -10,4 +10,6 @@ htmlcov/
.coverage
venv/
.venv/
.claude/
.claude/*
!.claude/commands/
!.claude/commands/**

View File

@@ -0,0 +1,360 @@
# AtoCore Backup and Restore Procedure
## Scope
This document defines the operational procedure for backing up and
restoring AtoCore's machine state on the Dalidou deployment. It is
the practical companion to `docs/backup-strategy.md` (which defines
the strategy) and `src/atocore/ops/backup.py` (which implements the
mechanics).
The intent is that this procedure can be followed by anyone with
SSH access to Dalidou and the AtoCore admin endpoints.
## What gets backed up
A `create_runtime_backup` snapshot contains, in order of importance:
| Artifact | Source path on Dalidou | Backup destination | Always included |
|---|---|---|---|
| SQLite database | `/srv/storage/atocore/data/db/atocore.db` | `<backup_root>/db/atocore.db` | yes |
| Project registry JSON | `/srv/storage/atocore/config/project-registry.json` | `<backup_root>/config/project-registry.json` | yes (if file exists) |
| Backup metadata | (generated) | `<backup_root>/backup-metadata.json` | yes |
| Chroma vector store | `/srv/storage/atocore/data/chroma/` | `<backup_root>/chroma/` | only when `include_chroma=true` |
The SQLite snapshot uses the online `conn.backup()` API and is safe
to take while the database is in use. The Chroma snapshot is a cold
directory copy and is **only safe when no ingestion is running**;
the API endpoint enforces this by acquiring the ingestion lock for
the duration of the copy.
What is **not** in the backup:
- Source documents under `/srv/storage/atocore/sources/vault/` and
`/srv/storage/atocore/sources/drive/`. These are read-only
inputs and live in the user's PKM/Drive, which is backed up
separately by their own systems.
- Application code. The container image is the source of truth for
code; recovery means rebuilding the image, not restoring code from
a backup.
- Logs under `/srv/storage/atocore/logs/`.
- Embeddings cache under `/srv/storage/atocore/data/cache/`.
- Temp files under `/srv/storage/atocore/data/tmp/`.
## Backup root layout
Each backup snapshot lives in its own timestamped directory:
```
/srv/storage/atocore/backups/snapshots/
├── 20260407T060000Z/
│ ├── backup-metadata.json
│ ├── db/
│ │ └── atocore.db
│ ├── config/
│ │ └── project-registry.json
│ └── chroma/ # only if include_chroma=true
│ └── ...
├── 20260408T060000Z/
│ └── ...
└── ...
```
The timestamp is UTC, format `YYYYMMDDTHHMMSSZ`.
## Triggering a backup
### Option A — via the admin endpoint (preferred)
```bash
# DB + registry only (fast, safe at any time)
curl -fsS -X POST http://dalidou:8100/admin/backup \
-H "Content-Type: application/json" \
-d '{"include_chroma": false}'
# DB + registry + Chroma (acquires ingestion lock)
curl -fsS -X POST http://dalidou:8100/admin/backup \
-H "Content-Type: application/json" \
-d '{"include_chroma": true}'
```
The response is the backup metadata JSON. Save the `backup_root`
field — that's the directory the snapshot was written to.
### Option B — via the standalone script (when the API is down)
```bash
docker exec atocore python -m atocore.ops.backup
```
This runs `create_runtime_backup()` directly, without going through
the API or the ingestion lock. Use it only when the AtoCore service
itself is unhealthy and you can't hit the admin endpoint.
### Option C — manual file copy (last resort)
If both the API and the standalone script are unusable:
```bash
sudo systemctl stop atocore # or: docker compose stop atocore
sudo cp /srv/storage/atocore/data/db/atocore.db \
/srv/storage/atocore/backups/manual-$(date -u +%Y%m%dT%H%M%SZ).db
sudo cp /srv/storage/atocore/config/project-registry.json \
/srv/storage/atocore/backups/manual-$(date -u +%Y%m%dT%H%M%SZ).registry.json
sudo systemctl start atocore
```
This is a cold backup and requires brief downtime.
## Listing backups
```bash
curl -fsS http://dalidou:8100/admin/backup
```
Returns the configured `backup_dir` and a list of all snapshots
under it, with their full metadata if available.
Or, on the host directly:
```bash
ls -la /srv/storage/atocore/backups/snapshots/
```
## Validating a backup
Before relying on a backup for restore, validate it:
```bash
curl -fsS http://dalidou:8100/admin/backup/20260407T060000Z/validate
```
The validator:
- confirms the snapshot directory exists
- opens the SQLite snapshot and runs `PRAGMA integrity_check`
- parses the registry JSON
- confirms the Chroma directory exists (if it was included)
A valid backup returns `"valid": true` and an empty `errors` array.
A failing validation returns `"valid": false` with one or more
specific error strings (e.g. `db_integrity_check_failed`,
`registry_invalid_json`, `chroma_snapshot_missing`).
**Validate every backup at creation time.** A backup that has never
been validated is not actually a backup — it's just a hopeful copy
of bytes.
## Restore procedure
### Pre-flight (always)
1. Identify which snapshot you want to restore. List available
snapshots and pick by timestamp:
```bash
curl -fsS http://dalidou:8100/admin/backup | jq '.backups[].stamp'
```
2. Validate it. Refuse to restore an invalid backup:
```bash
STAMP=20260407T060000Z
curl -fsS http://dalidou:8100/admin/backup/$STAMP/validate | jq .
```
3. **Stop AtoCore.** SQLite cannot be hot-restored under a running
process and Chroma will not pick up new files until the process
restarts.
```bash
docker compose stop atocore
# or: sudo systemctl stop atocore
```
4. **Take a safety snapshot of the current state** before overwriting
it. This is your "if the restore makes things worse, here's the
undo" backup.
```bash
PRESERVE_STAMP=$(date -u +%Y%m%dT%H%M%SZ)
sudo cp /srv/storage/atocore/data/db/atocore.db \
/srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.db
sudo cp /srv/storage/atocore/config/project-registry.json \
/srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.registry.json 2>/dev/null || true
```
### Restore the SQLite database
```bash
SNAPSHOT_DIR=/srv/storage/atocore/backups/snapshots/$STAMP
sudo cp $SNAPSHOT_DIR/db/atocore.db \
/srv/storage/atocore/data/db/atocore.db
sudo chown 1000:1000 /srv/storage/atocore/data/db/atocore.db
sudo chmod 600 /srv/storage/atocore/data/db/atocore.db
```
The chown should match the gitea/atocore container user. Verify
by checking the existing perms before overwriting:
```bash
stat -c '%U:%G %a' /srv/storage/atocore/data/db/atocore.db
```
### Restore the project registry
```bash
if [ -f $SNAPSHOT_DIR/config/project-registry.json ]; then
sudo cp $SNAPSHOT_DIR/config/project-registry.json \
/srv/storage/atocore/config/project-registry.json
sudo chown 1000:1000 /srv/storage/atocore/config/project-registry.json
sudo chmod 644 /srv/storage/atocore/config/project-registry.json
fi
```
If the snapshot does not contain a registry, the current registry is
preserved. The pre-flight safety copy still gives you a recovery path
if you need to roll back.
### Restore the Chroma vector store (if it was in the snapshot)
```bash
if [ -d $SNAPSHOT_DIR/chroma ]; then
# Move the current chroma dir aside as a safety copy
sudo mv /srv/storage/atocore/data/chroma \
/srv/storage/atocore/data/chroma.pre-restore-$PRESERVE_STAMP
# Copy the snapshot in
sudo cp -a $SNAPSHOT_DIR/chroma /srv/storage/atocore/data/chroma
sudo chown -R 1000:1000 /srv/storage/atocore/data/chroma
fi
```
If the snapshot does NOT contain a Chroma dir but the SQLite
restore would leave the vector store and the SQL store inconsistent
(e.g. SQL has chunks the vector store doesn't), you have two
options:
- **Option 1: rebuild the vector store from source documents.** Run
ingestion fresh after the SQL restore. This regenerates embeddings
from the actual source files. Slow but produces a perfectly
consistent state.
- **Option 2: accept the inconsistency and live with stale-vector
filtering.** The retriever already drops vector results whose
SQL row no longer exists (`_existing_chunk_ids` filter), so the
inconsistency surfaces as missing results, not bad ones.
For an unplanned restore, Option 2 is the right immediate move.
Then schedule a fresh ingestion pass to rebuild the vector store
properly.
### Restart AtoCore
```bash
docker compose up -d atocore
# or: sudo systemctl start atocore
```
### Post-restore verification
```bash
# 1. Service is healthy
curl -fsS http://dalidou:8100/health | jq .
# 2. Stats look right
curl -fsS http://dalidou:8100/stats | jq .
# 3. Project registry loads
curl -fsS http://dalidou:8100/projects | jq '.projects | length'
# 4. A known-good context query returns non-empty results
curl -fsS -X POST http://dalidou:8100/context/build \
-H "Content-Type: application/json" \
-d '{"prompt": "what is p05 about", "project": "p05-interferometer"}' | jq '.chunks_used'
```
If any of these are wrong, the restore is bad. Roll back using the
pre-restore safety copy:
```bash
docker compose stop atocore
sudo cp /srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.db \
/srv/storage/atocore/data/db/atocore.db
sudo cp /srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.registry.json \
/srv/storage/atocore/config/project-registry.json 2>/dev/null || true
# If you also restored chroma:
sudo rm -rf /srv/storage/atocore/data/chroma
sudo mv /srv/storage/atocore/data/chroma.pre-restore-$PRESERVE_STAMP \
/srv/storage/atocore/data/chroma
docker compose up -d atocore
```
## Retention policy
- **Last 7 daily backups**: kept verbatim
- **Last 4 weekly backups** (Sunday): kept verbatim
- **Last 6 monthly backups** (1st of month): kept verbatim
- **Anything older**: deleted
The retention job is **not yet implemented** and is tracked as a
follow-up. Until then, the snapshots directory grows monotonically.
A simple cron-based cleanup script is the next step:
```cron
0 4 * * * /srv/storage/atocore/scripts/cleanup-old-backups.sh
```
## Drill schedule
A backup that has never been restored is theoretical. The schedule:
- **At least once per quarter**, perform a full restore drill on a
staging environment (or a temporary container with a separate
data dir) and verify the post-restore checks pass.
- **After every breaking schema migration**, perform a restore drill
to confirm the migration is reversible.
- **After any incident** that touched the storage layer (the EXDEV
bug from April 2026 is a good example), confirm the next backup
validates clean.
## Common failure modes and what to do about them
| Symptom | Likely cause | Action |
|---|---|---|
| `db_integrity_check_failed` on validation | SQLite snapshot copied while a write was in progress, or disk corruption | Take a fresh backup and validate again. If it fails twice, suspect the underlying disk. |
| `registry_invalid_json` | Registry was being edited at backup time | Take a fresh backup. The registry is small so this is cheap. |
| `chroma_snapshot_missing` after a restore | Snapshot was DB-only and the restore didn't move the existing chroma dir | Either rebuild via fresh ingestion or restore an older snapshot that includes Chroma. |
| Service won't start after restore | Permissions wrong on the restored files | Re-run `chown 1000:1000` (or whatever the gitea/atocore container user is) on the data dir. |
| `/stats` returns 0 documents after restore | The SQL store was restored but the source paths in `source_documents` don't match the current Dalidou paths | This means the backup came from a different deployment. Don't trust this restore — it's pulling from the wrong layout. |
## Open follow-ups (not yet implemented)
1. **Retention cleanup script**: see the cron entry above.
2. **Off-Dalidou backup target**: currently snapshots live on the
same disk as the live data. A real disaster-recovery story
needs at least one snapshot on a different physical machine.
The simplest first step is a periodic `rsync` to the user's
laptop or to another server.
3. **Backup encryption**: snapshots contain raw SQLite and JSON.
Consider age/gpg encryption if backups will be shipped off-site.
4. **Automatic post-backup validation**: today the validator must
be invoked manually. The `create_runtime_backup` function
should call `validate_backup` on its own output and refuse to
declare success if validation fails.
5. **Chroma backup is currently full directory copy** every time.
For large vector stores this gets expensive. A future
improvement would be incremental snapshots via filesystem-level
snapshotting (LVM, btrfs, ZFS).
## Quickstart cheat sheet
```bash
# Daily backup (DB + registry only — fast)
curl -fsS -X POST http://dalidou:8100/admin/backup \
-H "Content-Type: application/json" -d '{}'
# Weekly backup (DB + registry + Chroma — slower, holds ingestion lock)
curl -fsS -X POST http://dalidou:8100/admin/backup \
-H "Content-Type: application/json" -d '{"include_chroma": true}'
# List backups
curl -fsS http://dalidou:8100/admin/backup | jq '.backups[].stamp'
# Validate the most recent backup
LATEST=$(curl -fsS http://dalidou:8100/admin/backup | jq -r '.backups[-1].stamp')
curl -fsS http://dalidou:8100/admin/backup/$LATEST/validate | jq .
# Full restore — see the "Restore procedure" section above
```