slash command for daily AtoCore use + backup-restore procedure
Session 2 of the four-session plan. Lands two operational pieces: the Claude Code slash command that makes AtoCore reachable from inside any Claude Code session, and the full backup/restore procedure doc that turns the backup endpoint code into a real operational drill. Slash command (.claude/commands/atocore-context.md) --------------------------------------------------- - Project-level slash command following the standard frontmatter format (description + argument-hint) - Parses the user prompt and an optional trailing project id, with case-insensitive matching against the registered project ids (atocore, p04-gigabit, p05-interferometer, p06-polisher and their aliases) - Calls POST /context/build on the live AtoCore service, defaulting to http://dalidou:8100 (overridable via ATOCORE_API_BASE env var) - Renders the formatted context pack inline so the user can see exactly what AtoCore would feed an LLM, plus a stats banner and a per-chunk source list - Includes graceful failure handling for network errors, 4xx, 5xx, and the empty-result case - Defines a future capture path that POSTs to /interactions for the Phase 9 reflection loop. The current command leaves capture as manual / opt-in pending a clean post-turn hook design .gitignore changes ------------------ - Replaced wholesale .claude/ ignore with .claude/* + exceptions for .claude/commands/ so project slash commands can be tracked - Other .claude/* paths (worktrees, settings, local state) remain ignored Backup-restore procedure (docs/backup-restore-procedure.md) ----------------------------------------------------------- - Defines what gets backed up (SQLite + registry always, Chroma optional under ingestion lock) and what doesn't (sources, code, logs, cache, tmp) - Documents the snapshot directory layout and the timestamp format - Three trigger paths in priority order: - via POST /admin/backup with {include_chroma: true|false} - via the standalone src/atocore/ops/backup.py module - via cold filesystem copy with brief downtime as last resort - Listing and validation procedure with the /admin/backup and /admin/backup/{stamp}/validate endpoints - Full step-by-step restore procedure with mandatory pre-flight safety snapshot, ownership/permission requirements, and the post-restore verification checks - Rollback path using the pre-restore safety copy - Retention policy (last 7 daily / 4 weekly / 6 monthly) and explicit acknowledgment that the cleanup job is not yet implemented - Drill schedule: quarterly full restore drill, post-migration drill, post-incident validation - Common failure mode table with diagnoses - Quickstart cheat sheet at the end for daily reference - Open follow-ups: cleanup script, off-Dalidou target, encryption, automatic post-backup validation, incremental Chroma snapshots The procedure has not yet been exercised against the live Dalidou instance — that is the next step the user runs themselves once the slash command is in place.
This commit is contained in:
123
.claude/commands/atocore-context.md
Normal file
123
.claude/commands/atocore-context.md
Normal file
@@ -0,0 +1,123 @@
|
|||||||
|
---
|
||||||
|
description: Pull a context pack from the live AtoCore service for the current prompt
|
||||||
|
argument-hint: <prompt text> [project-id]
|
||||||
|
---
|
||||||
|
|
||||||
|
You are about to enrich a user prompt with context from the live AtoCore
|
||||||
|
service. This is the daily-use entry point for AtoCore from inside Claude
|
||||||
|
Code.
|
||||||
|
|
||||||
|
## Step 1 — parse the arguments
|
||||||
|
|
||||||
|
The user invoked `/atocore-context` with the following arguments:
|
||||||
|
|
||||||
|
```
|
||||||
|
$ARGUMENTS
|
||||||
|
```
|
||||||
|
|
||||||
|
Treat the **entire argument string** as the prompt text by default. If the
|
||||||
|
last whitespace-separated token looks like a registered project id (matches
|
||||||
|
one of `atocore`, `p04-gigabit`, `p04`, `p05-interferometer`, `p05`,
|
||||||
|
`p06-polisher`, `p06`, or any case-insensitive variant), treat it as the
|
||||||
|
project hint and use the rest as the prompt text. Otherwise, leave the
|
||||||
|
project hint empty.
|
||||||
|
|
||||||
|
## Step 2 — call the AtoCore /context/build endpoint
|
||||||
|
|
||||||
|
Use the Bash tool to call AtoCore. The default endpoint is the live
|
||||||
|
Dalidou instance. Read `ATOCORE_API_BASE` from the environment if set,
|
||||||
|
otherwise default to `http://dalidou:3000` (the gitea host) — wait,
|
||||||
|
no, AtoCore lives on a different port. Default to `http://dalidou:8100`
|
||||||
|
which is the AtoCore service port from `pyproject.toml` and `config.py`.
|
||||||
|
|
||||||
|
Build the JSON body with `jq -n` so quoting is safe. Run something like:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ATOCORE_API_BASE="${ATOCORE_API_BASE:-http://dalidou:8100}"
|
||||||
|
PROMPT_TEXT='<the prompt text from step 1>'
|
||||||
|
PROJECT_HINT='<the project hint or empty string>'
|
||||||
|
|
||||||
|
if [ -n "$PROJECT_HINT" ]; then
|
||||||
|
BODY=$(jq -n --arg p "$PROMPT_TEXT" --arg proj "$PROJECT_HINT" \
|
||||||
|
'{prompt:$p, project:$proj}')
|
||||||
|
else
|
||||||
|
BODY=$(jq -n --arg p "$PROMPT_TEXT" '{prompt:$p}')
|
||||||
|
fi
|
||||||
|
|
||||||
|
curl -fsS -X POST "$ATOCORE_API_BASE/context/build" \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d "$BODY"
|
||||||
|
```
|
||||||
|
|
||||||
|
If `jq` is not available on the host, fall back to a Python one-liner:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -c "import json,sys; print(json.dumps({'prompt': sys.argv[1], 'project': sys.argv[2]} if sys.argv[2] else {'prompt': sys.argv[1]}))" "$PROMPT_TEXT" "$PROJECT_HINT"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Step 3 — present the context pack to the user
|
||||||
|
|
||||||
|
The response is JSON with at least these fields:
|
||||||
|
`formatted_context`, `chunks_used`, `total_chars`, `budget`,
|
||||||
|
`budget_remaining`, `duration_ms`, and a `chunks` array.
|
||||||
|
|
||||||
|
Print the response in a readable summary:
|
||||||
|
|
||||||
|
1. Print a one-line stats banner: `chunks=N, chars=X/budget, duration=Yms`
|
||||||
|
2. Print the `formatted_context` block verbatim inside a fenced text
|
||||||
|
code block so the user can read what AtoCore would feed an LLM
|
||||||
|
3. Print the `chunks` array as a small bulleted list with `source_file`,
|
||||||
|
`heading_path`, and `score` per chunk
|
||||||
|
|
||||||
|
If the response is empty (`chunks_used=0`, no project state, no
|
||||||
|
memories), tell the user explicitly: "AtoCore returned no context for
|
||||||
|
this prompt — either the corpus does not have relevant information or
|
||||||
|
the project hint is wrong. Try `/atocore-context <prompt> <project-id>`."
|
||||||
|
|
||||||
|
If the curl call fails:
|
||||||
|
- Network error → tell the user the AtoCore service may be down at
|
||||||
|
`$ATOCORE_API_BASE` and suggest checking `curl $ATOCORE_API_BASE/health`
|
||||||
|
- 4xx → print the error body verbatim, the API error message is usually
|
||||||
|
enough
|
||||||
|
- 5xx → print the error body and suggest checking the service logs
|
||||||
|
|
||||||
|
## Step 4 — capture the interaction (optional, opt-in)
|
||||||
|
|
||||||
|
If the user has previously asked the assistant to capture interactions
|
||||||
|
into AtoCore (or if the slash command was invoked with the trailing
|
||||||
|
literal `--capture` token), also POST the captured exchange to
|
||||||
|
`/interactions` so the Phase 9 reflection loop sees it. Skip this step
|
||||||
|
silently otherwise. The capture body is:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"prompt": "<user prompt>",
|
||||||
|
"response": "",
|
||||||
|
"response_summary": "",
|
||||||
|
"project": "<project hint or empty>",
|
||||||
|
"client": "claude-code-slash",
|
||||||
|
"session_id": "<a stable id for this Claude Code session>",
|
||||||
|
"memories_used": ["<from chunks array if available>"],
|
||||||
|
"chunks_used": ["<chunk_id from chunks array>"],
|
||||||
|
"context_pack": {"chunks_used": <N>, "total_chars": <X>}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Note that the response field stays empty here — the LLM hasn't actually
|
||||||
|
answered yet at the moment the slash command runs. A separate post-turn
|
||||||
|
hook (not part of this command) would update the same interaction with
|
||||||
|
the response, OR a follow-up `/atocore-record-response <interaction-id>`
|
||||||
|
command would do it. For now, leave that as future work.
|
||||||
|
|
||||||
|
## Notes for the assistant
|
||||||
|
|
||||||
|
- DO NOT invent project ids that aren't in the registry. If the user
|
||||||
|
passed something that doesn't match, treat it as part of the prompt.
|
||||||
|
- DO NOT silently fall back to a different endpoint. If `ATOCORE_API_BASE`
|
||||||
|
is wrong, surface the network error and let the user fix the env var.
|
||||||
|
- DO NOT hide the formatted context pack from the user. The whole point
|
||||||
|
of this command is to show what AtoCore would feed an LLM, so the user
|
||||||
|
can decide if it's relevant.
|
||||||
|
- The output goes into the user's working context as background — they
|
||||||
|
may follow up with their actual question, and the AtoCore context pack
|
||||||
|
acts as informal injected knowledge.
|
||||||
4
.gitignore
vendored
4
.gitignore
vendored
@@ -10,4 +10,6 @@ htmlcov/
|
|||||||
.coverage
|
.coverage
|
||||||
venv/
|
venv/
|
||||||
.venv/
|
.venv/
|
||||||
.claude/
|
.claude/*
|
||||||
|
!.claude/commands/
|
||||||
|
!.claude/commands/**
|
||||||
|
|||||||
360
docs/backup-restore-procedure.md
Normal file
360
docs/backup-restore-procedure.md
Normal file
@@ -0,0 +1,360 @@
|
|||||||
|
# AtoCore Backup and Restore Procedure
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
|
||||||
|
This document defines the operational procedure for backing up and
|
||||||
|
restoring AtoCore's machine state on the Dalidou deployment. It is
|
||||||
|
the practical companion to `docs/backup-strategy.md` (which defines
|
||||||
|
the strategy) and `src/atocore/ops/backup.py` (which implements the
|
||||||
|
mechanics).
|
||||||
|
|
||||||
|
The intent is that this procedure can be followed by anyone with
|
||||||
|
SSH access to Dalidou and the AtoCore admin endpoints.
|
||||||
|
|
||||||
|
## What gets backed up
|
||||||
|
|
||||||
|
A `create_runtime_backup` snapshot contains, in order of importance:
|
||||||
|
|
||||||
|
| Artifact | Source path on Dalidou | Backup destination | Always included |
|
||||||
|
|---|---|---|---|
|
||||||
|
| SQLite database | `/srv/storage/atocore/data/db/atocore.db` | `<backup_root>/db/atocore.db` | yes |
|
||||||
|
| Project registry JSON | `/srv/storage/atocore/config/project-registry.json` | `<backup_root>/config/project-registry.json` | yes (if file exists) |
|
||||||
|
| Backup metadata | (generated) | `<backup_root>/backup-metadata.json` | yes |
|
||||||
|
| Chroma vector store | `/srv/storage/atocore/data/chroma/` | `<backup_root>/chroma/` | only when `include_chroma=true` |
|
||||||
|
|
||||||
|
The SQLite snapshot uses the online `conn.backup()` API and is safe
|
||||||
|
to take while the database is in use. The Chroma snapshot is a cold
|
||||||
|
directory copy and is **only safe when no ingestion is running**;
|
||||||
|
the API endpoint enforces this by acquiring the ingestion lock for
|
||||||
|
the duration of the copy.
|
||||||
|
|
||||||
|
What is **not** in the backup:
|
||||||
|
|
||||||
|
- Source documents under `/srv/storage/atocore/sources/vault/` and
|
||||||
|
`/srv/storage/atocore/sources/drive/`. These are read-only
|
||||||
|
inputs and live in the user's PKM/Drive, which is backed up
|
||||||
|
separately by their own systems.
|
||||||
|
- Application code. The container image is the source of truth for
|
||||||
|
code; recovery means rebuilding the image, not restoring code from
|
||||||
|
a backup.
|
||||||
|
- Logs under `/srv/storage/atocore/logs/`.
|
||||||
|
- Embeddings cache under `/srv/storage/atocore/data/cache/`.
|
||||||
|
- Temp files under `/srv/storage/atocore/data/tmp/`.
|
||||||
|
|
||||||
|
## Backup root layout
|
||||||
|
|
||||||
|
Each backup snapshot lives in its own timestamped directory:
|
||||||
|
|
||||||
|
```
|
||||||
|
/srv/storage/atocore/backups/snapshots/
|
||||||
|
├── 20260407T060000Z/
|
||||||
|
│ ├── backup-metadata.json
|
||||||
|
│ ├── db/
|
||||||
|
│ │ └── atocore.db
|
||||||
|
│ ├── config/
|
||||||
|
│ │ └── project-registry.json
|
||||||
|
│ └── chroma/ # only if include_chroma=true
|
||||||
|
│ └── ...
|
||||||
|
├── 20260408T060000Z/
|
||||||
|
│ └── ...
|
||||||
|
└── ...
|
||||||
|
```
|
||||||
|
|
||||||
|
The timestamp is UTC, format `YYYYMMDDTHHMMSSZ`.
|
||||||
|
|
||||||
|
## Triggering a backup
|
||||||
|
|
||||||
|
### Option A — via the admin endpoint (preferred)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# DB + registry only (fast, safe at any time)
|
||||||
|
curl -fsS -X POST http://dalidou:8100/admin/backup \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{"include_chroma": false}'
|
||||||
|
|
||||||
|
# DB + registry + Chroma (acquires ingestion lock)
|
||||||
|
curl -fsS -X POST http://dalidou:8100/admin/backup \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{"include_chroma": true}'
|
||||||
|
```
|
||||||
|
|
||||||
|
The response is the backup metadata JSON. Save the `backup_root`
|
||||||
|
field — that's the directory the snapshot was written to.
|
||||||
|
|
||||||
|
### Option B — via the standalone script (when the API is down)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker exec atocore python -m atocore.ops.backup
|
||||||
|
```
|
||||||
|
|
||||||
|
This runs `create_runtime_backup()` directly, without going through
|
||||||
|
the API or the ingestion lock. Use it only when the AtoCore service
|
||||||
|
itself is unhealthy and you can't hit the admin endpoint.
|
||||||
|
|
||||||
|
### Option C — manual file copy (last resort)
|
||||||
|
|
||||||
|
If both the API and the standalone script are unusable:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo systemctl stop atocore # or: docker compose stop atocore
|
||||||
|
sudo cp /srv/storage/atocore/data/db/atocore.db \
|
||||||
|
/srv/storage/atocore/backups/manual-$(date -u +%Y%m%dT%H%M%SZ).db
|
||||||
|
sudo cp /srv/storage/atocore/config/project-registry.json \
|
||||||
|
/srv/storage/atocore/backups/manual-$(date -u +%Y%m%dT%H%M%SZ).registry.json
|
||||||
|
sudo systemctl start atocore
|
||||||
|
```
|
||||||
|
|
||||||
|
This is a cold backup and requires brief downtime.
|
||||||
|
|
||||||
|
## Listing backups
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -fsS http://dalidou:8100/admin/backup
|
||||||
|
```
|
||||||
|
|
||||||
|
Returns the configured `backup_dir` and a list of all snapshots
|
||||||
|
under it, with their full metadata if available.
|
||||||
|
|
||||||
|
Or, on the host directly:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ls -la /srv/storage/atocore/backups/snapshots/
|
||||||
|
```
|
||||||
|
|
||||||
|
## Validating a backup
|
||||||
|
|
||||||
|
Before relying on a backup for restore, validate it:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -fsS http://dalidou:8100/admin/backup/20260407T060000Z/validate
|
||||||
|
```
|
||||||
|
|
||||||
|
The validator:
|
||||||
|
- confirms the snapshot directory exists
|
||||||
|
- opens the SQLite snapshot and runs `PRAGMA integrity_check`
|
||||||
|
- parses the registry JSON
|
||||||
|
- confirms the Chroma directory exists (if it was included)
|
||||||
|
|
||||||
|
A valid backup returns `"valid": true` and an empty `errors` array.
|
||||||
|
A failing validation returns `"valid": false` with one or more
|
||||||
|
specific error strings (e.g. `db_integrity_check_failed`,
|
||||||
|
`registry_invalid_json`, `chroma_snapshot_missing`).
|
||||||
|
|
||||||
|
**Validate every backup at creation time.** A backup that has never
|
||||||
|
been validated is not actually a backup — it's just a hopeful copy
|
||||||
|
of bytes.
|
||||||
|
|
||||||
|
## Restore procedure
|
||||||
|
|
||||||
|
### Pre-flight (always)
|
||||||
|
|
||||||
|
1. Identify which snapshot you want to restore. List available
|
||||||
|
snapshots and pick by timestamp:
|
||||||
|
```bash
|
||||||
|
curl -fsS http://dalidou:8100/admin/backup | jq '.backups[].stamp'
|
||||||
|
```
|
||||||
|
2. Validate it. Refuse to restore an invalid backup:
|
||||||
|
```bash
|
||||||
|
STAMP=20260407T060000Z
|
||||||
|
curl -fsS http://dalidou:8100/admin/backup/$STAMP/validate | jq .
|
||||||
|
```
|
||||||
|
3. **Stop AtoCore.** SQLite cannot be hot-restored under a running
|
||||||
|
process and Chroma will not pick up new files until the process
|
||||||
|
restarts.
|
||||||
|
```bash
|
||||||
|
docker compose stop atocore
|
||||||
|
# or: sudo systemctl stop atocore
|
||||||
|
```
|
||||||
|
4. **Take a safety snapshot of the current state** before overwriting
|
||||||
|
it. This is your "if the restore makes things worse, here's the
|
||||||
|
undo" backup.
|
||||||
|
```bash
|
||||||
|
PRESERVE_STAMP=$(date -u +%Y%m%dT%H%M%SZ)
|
||||||
|
sudo cp /srv/storage/atocore/data/db/atocore.db \
|
||||||
|
/srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.db
|
||||||
|
sudo cp /srv/storage/atocore/config/project-registry.json \
|
||||||
|
/srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.registry.json 2>/dev/null || true
|
||||||
|
```
|
||||||
|
|
||||||
|
### Restore the SQLite database
|
||||||
|
|
||||||
|
```bash
|
||||||
|
SNAPSHOT_DIR=/srv/storage/atocore/backups/snapshots/$STAMP
|
||||||
|
sudo cp $SNAPSHOT_DIR/db/atocore.db \
|
||||||
|
/srv/storage/atocore/data/db/atocore.db
|
||||||
|
sudo chown 1000:1000 /srv/storage/atocore/data/db/atocore.db
|
||||||
|
sudo chmod 600 /srv/storage/atocore/data/db/atocore.db
|
||||||
|
```
|
||||||
|
|
||||||
|
The chown should match the gitea/atocore container user. Verify
|
||||||
|
by checking the existing perms before overwriting:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
stat -c '%U:%G %a' /srv/storage/atocore/data/db/atocore.db
|
||||||
|
```
|
||||||
|
|
||||||
|
### Restore the project registry
|
||||||
|
|
||||||
|
```bash
|
||||||
|
if [ -f $SNAPSHOT_DIR/config/project-registry.json ]; then
|
||||||
|
sudo cp $SNAPSHOT_DIR/config/project-registry.json \
|
||||||
|
/srv/storage/atocore/config/project-registry.json
|
||||||
|
sudo chown 1000:1000 /srv/storage/atocore/config/project-registry.json
|
||||||
|
sudo chmod 644 /srv/storage/atocore/config/project-registry.json
|
||||||
|
fi
|
||||||
|
```
|
||||||
|
|
||||||
|
If the snapshot does not contain a registry, the current registry is
|
||||||
|
preserved. The pre-flight safety copy still gives you a recovery path
|
||||||
|
if you need to roll back.
|
||||||
|
|
||||||
|
### Restore the Chroma vector store (if it was in the snapshot)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
if [ -d $SNAPSHOT_DIR/chroma ]; then
|
||||||
|
# Move the current chroma dir aside as a safety copy
|
||||||
|
sudo mv /srv/storage/atocore/data/chroma \
|
||||||
|
/srv/storage/atocore/data/chroma.pre-restore-$PRESERVE_STAMP
|
||||||
|
|
||||||
|
# Copy the snapshot in
|
||||||
|
sudo cp -a $SNAPSHOT_DIR/chroma /srv/storage/atocore/data/chroma
|
||||||
|
sudo chown -R 1000:1000 /srv/storage/atocore/data/chroma
|
||||||
|
fi
|
||||||
|
```
|
||||||
|
|
||||||
|
If the snapshot does NOT contain a Chroma dir but the SQLite
|
||||||
|
restore would leave the vector store and the SQL store inconsistent
|
||||||
|
(e.g. SQL has chunks the vector store doesn't), you have two
|
||||||
|
options:
|
||||||
|
|
||||||
|
- **Option 1: rebuild the vector store from source documents.** Run
|
||||||
|
ingestion fresh after the SQL restore. This regenerates embeddings
|
||||||
|
from the actual source files. Slow but produces a perfectly
|
||||||
|
consistent state.
|
||||||
|
- **Option 2: accept the inconsistency and live with stale-vector
|
||||||
|
filtering.** The retriever already drops vector results whose
|
||||||
|
SQL row no longer exists (`_existing_chunk_ids` filter), so the
|
||||||
|
inconsistency surfaces as missing results, not bad ones.
|
||||||
|
|
||||||
|
For an unplanned restore, Option 2 is the right immediate move.
|
||||||
|
Then schedule a fresh ingestion pass to rebuild the vector store
|
||||||
|
properly.
|
||||||
|
|
||||||
|
### Restart AtoCore
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker compose up -d atocore
|
||||||
|
# or: sudo systemctl start atocore
|
||||||
|
```
|
||||||
|
|
||||||
|
### Post-restore verification
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 1. Service is healthy
|
||||||
|
curl -fsS http://dalidou:8100/health | jq .
|
||||||
|
|
||||||
|
# 2. Stats look right
|
||||||
|
curl -fsS http://dalidou:8100/stats | jq .
|
||||||
|
|
||||||
|
# 3. Project registry loads
|
||||||
|
curl -fsS http://dalidou:8100/projects | jq '.projects | length'
|
||||||
|
|
||||||
|
# 4. A known-good context query returns non-empty results
|
||||||
|
curl -fsS -X POST http://dalidou:8100/context/build \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{"prompt": "what is p05 about", "project": "p05-interferometer"}' | jq '.chunks_used'
|
||||||
|
```
|
||||||
|
|
||||||
|
If any of these are wrong, the restore is bad. Roll back using the
|
||||||
|
pre-restore safety copy:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker compose stop atocore
|
||||||
|
sudo cp /srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.db \
|
||||||
|
/srv/storage/atocore/data/db/atocore.db
|
||||||
|
sudo cp /srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.registry.json \
|
||||||
|
/srv/storage/atocore/config/project-registry.json 2>/dev/null || true
|
||||||
|
# If you also restored chroma:
|
||||||
|
sudo rm -rf /srv/storage/atocore/data/chroma
|
||||||
|
sudo mv /srv/storage/atocore/data/chroma.pre-restore-$PRESERVE_STAMP \
|
||||||
|
/srv/storage/atocore/data/chroma
|
||||||
|
docker compose up -d atocore
|
||||||
|
```
|
||||||
|
|
||||||
|
## Retention policy
|
||||||
|
|
||||||
|
- **Last 7 daily backups**: kept verbatim
|
||||||
|
- **Last 4 weekly backups** (Sunday): kept verbatim
|
||||||
|
- **Last 6 monthly backups** (1st of month): kept verbatim
|
||||||
|
- **Anything older**: deleted
|
||||||
|
|
||||||
|
The retention job is **not yet implemented** and is tracked as a
|
||||||
|
follow-up. Until then, the snapshots directory grows monotonically.
|
||||||
|
A simple cron-based cleanup script is the next step:
|
||||||
|
|
||||||
|
```cron
|
||||||
|
0 4 * * * /srv/storage/atocore/scripts/cleanup-old-backups.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
## Drill schedule
|
||||||
|
|
||||||
|
A backup that has never been restored is theoretical. The schedule:
|
||||||
|
|
||||||
|
- **At least once per quarter**, perform a full restore drill on a
|
||||||
|
staging environment (or a temporary container with a separate
|
||||||
|
data dir) and verify the post-restore checks pass.
|
||||||
|
- **After every breaking schema migration**, perform a restore drill
|
||||||
|
to confirm the migration is reversible.
|
||||||
|
- **After any incident** that touched the storage layer (the EXDEV
|
||||||
|
bug from April 2026 is a good example), confirm the next backup
|
||||||
|
validates clean.
|
||||||
|
|
||||||
|
## Common failure modes and what to do about them
|
||||||
|
|
||||||
|
| Symptom | Likely cause | Action |
|
||||||
|
|---|---|---|
|
||||||
|
| `db_integrity_check_failed` on validation | SQLite snapshot copied while a write was in progress, or disk corruption | Take a fresh backup and validate again. If it fails twice, suspect the underlying disk. |
|
||||||
|
| `registry_invalid_json` | Registry was being edited at backup time | Take a fresh backup. The registry is small so this is cheap. |
|
||||||
|
| `chroma_snapshot_missing` after a restore | Snapshot was DB-only and the restore didn't move the existing chroma dir | Either rebuild via fresh ingestion or restore an older snapshot that includes Chroma. |
|
||||||
|
| Service won't start after restore | Permissions wrong on the restored files | Re-run `chown 1000:1000` (or whatever the gitea/atocore container user is) on the data dir. |
|
||||||
|
| `/stats` returns 0 documents after restore | The SQL store was restored but the source paths in `source_documents` don't match the current Dalidou paths | This means the backup came from a different deployment. Don't trust this restore — it's pulling from the wrong layout. |
|
||||||
|
|
||||||
|
## Open follow-ups (not yet implemented)
|
||||||
|
|
||||||
|
1. **Retention cleanup script**: see the cron entry above.
|
||||||
|
2. **Off-Dalidou backup target**: currently snapshots live on the
|
||||||
|
same disk as the live data. A real disaster-recovery story
|
||||||
|
needs at least one snapshot on a different physical machine.
|
||||||
|
The simplest first step is a periodic `rsync` to the user's
|
||||||
|
laptop or to another server.
|
||||||
|
3. **Backup encryption**: snapshots contain raw SQLite and JSON.
|
||||||
|
Consider age/gpg encryption if backups will be shipped off-site.
|
||||||
|
4. **Automatic post-backup validation**: today the validator must
|
||||||
|
be invoked manually. The `create_runtime_backup` function
|
||||||
|
should call `validate_backup` on its own output and refuse to
|
||||||
|
declare success if validation fails.
|
||||||
|
5. **Chroma backup is currently full directory copy** every time.
|
||||||
|
For large vector stores this gets expensive. A future
|
||||||
|
improvement would be incremental snapshots via filesystem-level
|
||||||
|
snapshotting (LVM, btrfs, ZFS).
|
||||||
|
|
||||||
|
## Quickstart cheat sheet
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Daily backup (DB + registry only — fast)
|
||||||
|
curl -fsS -X POST http://dalidou:8100/admin/backup \
|
||||||
|
-H "Content-Type: application/json" -d '{}'
|
||||||
|
|
||||||
|
# Weekly backup (DB + registry + Chroma — slower, holds ingestion lock)
|
||||||
|
curl -fsS -X POST http://dalidou:8100/admin/backup \
|
||||||
|
-H "Content-Type: application/json" -d '{"include_chroma": true}'
|
||||||
|
|
||||||
|
# List backups
|
||||||
|
curl -fsS http://dalidou:8100/admin/backup | jq '.backups[].stamp'
|
||||||
|
|
||||||
|
# Validate the most recent backup
|
||||||
|
LATEST=$(curl -fsS http://dalidou:8100/admin/backup | jq -r '.backups[-1].stamp')
|
||||||
|
curl -fsS http://dalidou:8100/admin/backup/$LATEST/validate | jq .
|
||||||
|
|
||||||
|
# Full restore — see the "Restore procedure" section above
|
||||||
|
```
|
||||||
Reference in New Issue
Block a user