docs/backup-restore-procedure.md

# AtoCore Backup and Restore Procedure

## Scope

This document defines the operational procedure for backing up and
restoring AtoCore's machine state on the Dalidou deployment. It is
the practical companion to `docs/backup-strategy.md` (which defines
the strategy) and `src/atocore/ops/backup.py` (which implements the
mechanics).

The intent is that this procedure can be followed by anyone with
SSH access to Dalidou and the AtoCore admin endpoints.

## What gets backed up

A `create_runtime_backup` snapshot contains, in order of importance:

| Artifact | Source path on Dalidou | Backup destination | Always included |
|---|---|---|---|
| SQLite database | `/srv/storage/atocore/data/db/atocore.db` | `<backup_root>/db/atocore.db` | yes |
| Project registry JSON | `/srv/storage/atocore/config/project-registry.json` | `<backup_root>/config/project-registry.json` | yes (if file exists) |
| Backup metadata | (generated) | `<backup_root>/backup-metadata.json` | yes |
| Chroma vector store | `/srv/storage/atocore/data/chroma/` | `<backup_root>/chroma/` | only when `include_chroma=true` |

The SQLite snapshot uses the online `conn.backup()` API and is safe
to take while the database is in use. The Chroma snapshot is a cold
directory copy and is **only safe when no ingestion is running**;
the API endpoint enforces this by acquiring the ingestion lock for
the duration of the copy.

What is **not** in the backup:

- Source documents under `/srv/storage/atocore/sources/vault/` and
  `/srv/storage/atocore/sources/drive/`. These are read-only
  inputs and live in the user's PKM/Drive, which is backed up
  separately by their own systems.
- Application code. The container image is the source of truth for
  code; recovery means rebuilding the image, not restoring code from
  a backup.
- Logs under `/srv/storage/atocore/logs/`.
- Embeddings cache under `/srv/storage/atocore/data/cache/`.
- Temp files under `/srv/storage/atocore/data/tmp/`.

## Backup root layout

Each backup snapshot lives in its own timestamped directory:

```
/srv/storage/atocore/backups/snapshots/
  ├── 20260407T060000Z/
  │   ├── backup-metadata.json
  │   ├── db/
  │   │   └── atocore.db
  │   ├── config/
  │   │   └── project-registry.json
  │   └── chroma/                    # only if include_chroma=true
  │       └── ...
  ├── 20260408T060000Z/
  │   └── ...
  └── ...
```

The timestamp is UTC, format `YYYYMMDDTHHMMSSZ`.

## Triggering a backup

### Option A — via the admin endpoint (preferred)

```bash
# DB + registry only (fast, safe at any time)
curl -fsS -X POST http://dalidou:8100/admin/backup \
  -H "Content-Type: application/json" \
  -d '{"include_chroma": false}'

# DB + registry + Chroma (acquires ingestion lock)
curl -fsS -X POST http://dalidou:8100/admin/backup \
  -H "Content-Type: application/json" \
  -d '{"include_chroma": true}'
```

The response is the backup metadata JSON. Save the `backup_root`
field — that's the directory the snapshot was written to.

### Option B — via the standalone script (when the API is down)

```bash
docker exec atocore python -m atocore.ops.backup
```

This runs `create_runtime_backup()` directly, without going through
the API or the ingestion lock. Use it only when the AtoCore service
itself is unhealthy and you can't hit the admin endpoint.

### Option C — manual file copy (last resort)

If both the API and the standalone script are unusable:

```bash
sudo systemctl stop atocore   # or: docker compose stop atocore
sudo cp /srv/storage/atocore/data/db/atocore.db \
        /srv/storage/atocore/backups/manual-$(date -u +%Y%m%dT%H%M%SZ).db
sudo cp /srv/storage/atocore/config/project-registry.json \
        /srv/storage/atocore/backups/manual-$(date -u +%Y%m%dT%H%M%SZ).registry.json
sudo systemctl start atocore
```

This is a cold backup and requires brief downtime.

## Listing backups

```bash
curl -fsS http://dalidou:8100/admin/backup
```

Returns the configured `backup_dir` and a list of all snapshots
under it, with their full metadata if available.

Or, on the host directly:

```bash
ls -la /srv/storage/atocore/backups/snapshots/
```

## Validating a backup

Before relying on a backup for restore, validate it:

```bash
curl -fsS http://dalidou:8100/admin/backup/20260407T060000Z/validate
```

The validator:
- confirms the snapshot directory exists
- opens the SQLite snapshot and runs `PRAGMA integrity_check`
- parses the registry JSON
- confirms the Chroma directory exists (if it was included)

A valid backup returns `"valid": true` and an empty `errors` array.
A failing validation returns `"valid": false` with one or more
specific error strings (e.g. `db_integrity_check_failed`,
`registry_invalid_json`, `chroma_snapshot_missing`).

**Validate every backup at creation time.** A backup that has never
been validated is not actually a backup — it's just a hopeful copy
of bytes.

## Restore procedure

### Pre-flight (always)

1. Identify which snapshot you want to restore. List available
   snapshots and pick by timestamp:
   ```bash
   curl -fsS http://dalidou:8100/admin/backup | jq '.backups[].stamp'
   ```
2. Validate it. Refuse to restore an invalid backup:
   ```bash
   STAMP=20260407T060000Z
   curl -fsS http://dalidou:8100/admin/backup/$STAMP/validate | jq .
   ```
3. **Stop AtoCore.** SQLite cannot be hot-restored under a running
   process and Chroma will not pick up new files until the process
   restarts.
   ```bash
   docker compose stop atocore
   # or: sudo systemctl stop atocore
   ```
4. **Take a safety snapshot of the current state** before overwriting
   it. This is your "if the restore makes things worse, here's the
   undo" backup.
   ```bash
   PRESERVE_STAMP=$(date -u +%Y%m%dT%H%M%SZ)
   sudo cp /srv/storage/atocore/data/db/atocore.db \
           /srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.db
   sudo cp /srv/storage/atocore/config/project-registry.json \
           /srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.registry.json 2>/dev/null || true
   ```

### Restore the SQLite database

```bash
SNAPSHOT_DIR=/srv/storage/atocore/backups/snapshots/$STAMP
sudo cp $SNAPSHOT_DIR/db/atocore.db \
        /srv/storage/atocore/data/db/atocore.db
sudo chown 1000:1000 /srv/storage/atocore/data/db/atocore.db
sudo chmod 600 /srv/storage/atocore/data/db/atocore.db
```

The chown should match the gitea/atocore container user. Verify
by checking the existing perms before overwriting:

```bash
stat -c '%U:%G %a' /srv/storage/atocore/data/db/atocore.db
```

### Restore the project registry

```bash
if [ -f $SNAPSHOT_DIR/config/project-registry.json ]; then
  sudo cp $SNAPSHOT_DIR/config/project-registry.json \
          /srv/storage/atocore/config/project-registry.json
  sudo chown 1000:1000 /srv/storage/atocore/config/project-registry.json
  sudo chmod 644 /srv/storage/atocore/config/project-registry.json
fi
```

If the snapshot does not contain a registry, the current registry is
preserved. The pre-flight safety copy still gives you a recovery path
if you need to roll back.

### Restore the Chroma vector store (if it was in the snapshot)

```bash
if [ -d $SNAPSHOT_DIR/chroma ]; then
  # Move the current chroma dir aside as a safety copy
  sudo mv /srv/storage/atocore/data/chroma \
          /srv/storage/atocore/data/chroma.pre-restore-$PRESERVE_STAMP

  # Copy the snapshot in
  sudo cp -a $SNAPSHOT_DIR/chroma /srv/storage/atocore/data/chroma
  sudo chown -R 1000:1000 /srv/storage/atocore/data/chroma
fi
```

If the snapshot does NOT contain a Chroma dir but the SQLite
restore would leave the vector store and the SQL store inconsistent
(e.g. SQL has chunks the vector store doesn't), you have two
options:

- **Option 1: rebuild the vector store from source documents.** Run
  ingestion fresh after the SQL restore. This regenerates embeddings
  from the actual source files. Slow but produces a perfectly
  consistent state.
- **Option 2: accept the inconsistency and live with stale-vector
  filtering.** The retriever already drops vector results whose
  SQL row no longer exists (`_existing_chunk_ids` filter), so the
  inconsistency surfaces as missing results, not bad ones.

For an unplanned restore, Option 2 is the right immediate move.
Then schedule a fresh ingestion pass to rebuild the vector store
properly.

### Restart AtoCore

```bash
docker compose up -d atocore
# or: sudo systemctl start atocore
```

### Post-restore verification

```bash
# 1. Service is healthy
curl -fsS http://dalidou:8100/health | jq .

# 2. Stats look right
curl -fsS http://dalidou:8100/stats | jq .

# 3. Project registry loads
curl -fsS http://dalidou:8100/projects | jq '.projects | length'

# 4. A known-good context query returns non-empty results
curl -fsS -X POST http://dalidou:8100/context/build \
  -H "Content-Type: application/json" \
  -d '{"prompt": "what is p05 about", "project": "p05-interferometer"}' | jq '.chunks_used'
```

If any of these are wrong, the restore is bad. Roll back using the
pre-restore safety copy:

```bash
docker compose stop atocore
sudo cp /srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.db \
        /srv/storage/atocore/data/db/atocore.db
sudo cp /srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.registry.json \
        /srv/storage/atocore/config/project-registry.json 2>/dev/null || true
# If you also restored chroma:
sudo rm -rf /srv/storage/atocore/data/chroma
sudo mv /srv/storage/atocore/data/chroma.pre-restore-$PRESERVE_STAMP \
        /srv/storage/atocore/data/chroma
docker compose up -d atocore
```

## Retention policy

- **Last 7 daily backups**: kept verbatim
- **Last 4 weekly backups** (Sunday): kept verbatim
- **Last 6 monthly backups** (1st of month): kept verbatim
- **Anything older**: deleted

The retention job is **not yet implemented** and is tracked as a
follow-up. Until then, the snapshots directory grows monotonically.
A simple cron-based cleanup script is the next step:

```cron
0 4 * * * /srv/storage/atocore/scripts/cleanup-old-backups.sh
```

## Drill schedule

A backup that has never been restored is theoretical. The schedule:

- **At least once per quarter**, perform a full restore drill on a
  staging environment (or a temporary container with a separate
  data dir) and verify the post-restore checks pass.
- **After every breaking schema migration**, perform a restore drill
  to confirm the migration is reversible.
- **After any incident** that touched the storage layer (the EXDEV
  bug from April 2026 is a good example), confirm the next backup
  validates clean.

## Common failure modes and what to do about them

| Symptom | Likely cause | Action |
|---|---|---|
| `db_integrity_check_failed` on validation | SQLite snapshot copied while a write was in progress, or disk corruption | Take a fresh backup and validate again. If it fails twice, suspect the underlying disk. |
| `registry_invalid_json` | Registry was being edited at backup time | Take a fresh backup. The registry is small so this is cheap. |
| `chroma_snapshot_missing` after a restore | Snapshot was DB-only and the restore didn't move the existing chroma dir | Either rebuild via fresh ingestion or restore an older snapshot that includes Chroma. |
| Service won't start after restore | Permissions wrong on the restored files | Re-run `chown 1000:1000` (or whatever the gitea/atocore container user is) on the data dir. |
| `/stats` returns 0 documents after restore | The SQL store was restored but the source paths in `source_documents` don't match the current Dalidou paths | This means the backup came from a different deployment. Don't trust this restore — it's pulling from the wrong layout. |

## Open follow-ups (not yet implemented)

1. **Retention cleanup script**: see the cron entry above.
2. **Off-Dalidou backup target**: currently snapshots live on the
   same disk as the live data. A real disaster-recovery story
   needs at least one snapshot on a different physical machine.
   The simplest first step is a periodic `rsync` to the user's
   laptop or to another server.
3. **Backup encryption**: snapshots contain raw SQLite and JSON.
   Consider age/gpg encryption if backups will be shipped off-site.
4. **Automatic post-backup validation**: today the validator must
   be invoked manually. The `create_runtime_backup` function
   should call `validate_backup` on its own output and refuse to
   declare success if validation fails.
5. **Chroma backup is currently full directory copy** every time.
   For large vector stores this gets expensive. A future
   improvement would be incremental snapshots via filesystem-level
   snapshotting (LVM, btrfs, ZFS).

## Quickstart cheat sheet

```bash
# Daily backup (DB + registry only — fast)
curl -fsS -X POST http://dalidou:8100/admin/backup \
  -H "Content-Type: application/json" -d '{}'

# Weekly backup (DB + registry + Chroma — slower, holds ingestion lock)
curl -fsS -X POST http://dalidou:8100/admin/backup \
  -H "Content-Type: application/json" -d '{"include_chroma": true}'

# List backups
curl -fsS http://dalidou:8100/admin/backup | jq '.backups[].stamp'

# Validate the most recent backup
LATEST=$(curl -fsS http://dalidou:8100/admin/backup | jq -r '.backups[-1].stamp')
curl -fsS http://dalidou:8100/admin/backup/$LATEST/validate | jq .

# Full restore — see the "Restore procedure" section above
```
slash command for daily AtoCore use + backup-restore procedure Session 2 of the four-session plan. Lands two operational pieces: the Claude Code slash command that makes AtoCore reachable from inside any Claude Code session, and the full backup/restore procedure doc that turns the backup endpoint code into a real operational drill. Slash command (.claude/commands/atocore-context.md) --------------------------------------------------- - Project-level slash command following the standard frontmatter format (description + argument-hint) - Parses the user prompt and an optional trailing project id, with case-insensitive matching against the registered project ids (atocore, p04-gigabit, p05-interferometer, p06-polisher and their aliases) - Calls POST /context/build on the live AtoCore service, defaulting to http://dalidou:8100 (overridable via ATOCORE_API_BASE env var) - Renders the formatted context pack inline so the user can see exactly what AtoCore would feed an LLM, plus a stats banner and a per-chunk source list - Includes graceful failure handling for network errors, 4xx, 5xx, and the empty-result case - Defines a future capture path that POSTs to /interactions for the Phase 9 reflection loop. The current command leaves capture as manual / opt-in pending a clean post-turn hook design .gitignore changes ------------------ - Replaced wholesale .claude/ ignore with .claude/* + exceptions for .claude/commands/ so project slash commands can be tracked - Other .claude/* paths (worktrees, settings, local state) remain ignored Backup-restore procedure (docs/backup-restore-procedure.md) ----------------------------------------------------------- - Defines what gets backed up (SQLite + registry always, Chroma optional under ingestion lock) and what doesn't (sources, code, logs, cache, tmp) - Documents the snapshot directory layout and the timestamp format - Three trigger paths in priority order: - via POST /admin/backup with {include_chroma: true\|false} - via the standalone src/atocore/ops/backup.py module - via cold filesystem copy with brief downtime as last resort - Listing and validation procedure with the /admin/backup and /admin/backup/{stamp}/validate endpoints - Full step-by-step restore procedure with mandatory pre-flight safety snapshot, ownership/permission requirements, and the post-restore verification checks - Rollback path using the pre-restore safety copy - Retention policy (last 7 daily / 4 weekly / 6 monthly) and explicit acknowledgment that the cleanup job is not yet implemented - Drill schedule: quarterly full restore drill, post-migration drill, post-incident validation - Common failure mode table with diagnoses - Quickstart cheat sheet at the end for daily reference - Open follow-ups: cleanup script, off-Dalidou target, encryption, automatic post-backup validation, incremental Chroma snapshots The procedure has not yet been exercised against the live Dalidou instance — that is the next step the user runs themselves once the slash command is in place. 2026-04-07 06:46:50 -04:00			`# AtoCore Backup and Restore Procedure`

			`## Scope`

			`This document defines the operational procedure for backing up and`
			`restoring AtoCore's machine state on the Dalidou deployment. It is`
			the practical companion to `docs/backup-strategy.md` (which defines
			the strategy) and `src/atocore/ops/backup.py` (which implements the
			`mechanics).`

			`The intent is that this procedure can be followed by anyone with`
			`SSH access to Dalidou and the AtoCore admin endpoints.`

			`## What gets backed up`

			A `create_runtime_backup` snapshot contains, in order of importance:

			`\| Artifact \| Source path on Dalidou \| Backup destination \| Always included \|`
			`\|---\|---\|---\|---\|`
			\| SQLite database \| `/srv/storage/atocore/data/db/atocore.db` \| `<backup_root>/db/atocore.db` \| yes \|
			\| Project registry JSON \| `/srv/storage/atocore/config/project-registry.json` \| `<backup_root>/config/project-registry.json` \| yes (if file exists) \|
			\| Backup metadata \| (generated) \| `<backup_root>/backup-metadata.json` \| yes \|
			\| Chroma vector store \| `/srv/storage/atocore/data/chroma/` \| `<backup_root>/chroma/` \| only when `include_chroma=true` \|

			The SQLite snapshot uses the online `conn.backup()` API and is safe
			`to take while the database is in use. The Chroma snapshot is a cold`
			`directory copy and is only safe when no ingestion is running;`
			`the API endpoint enforces this by acquiring the ingestion lock for`
			`the duration of the copy.`

			`What is not in the backup:`

			- Source documents under `/srv/storage/atocore/sources/vault/` and
			`/srv/storage/atocore/sources/drive/`. These are read-only
			`inputs and live in the user's PKM/Drive, which is backed up`
			`separately by their own systems.`
			`- Application code. The container image is the source of truth for`
			`code; recovery means rebuilding the image, not restoring code from`
			`a backup.`
			- Logs under `/srv/storage/atocore/logs/`.
			- Embeddings cache under `/srv/storage/atocore/data/cache/`.
			- Temp files under `/srv/storage/atocore/data/tmp/`.

			`## Backup root layout`

			`Each backup snapshot lives in its own timestamped directory:`

			```
			`/srv/storage/atocore/backups/snapshots/`
			`├── 20260407T060000Z/`
			`│ ├── backup-metadata.json`
			`│ ├── db/`
			`│ │ └── atocore.db`
			`│ ├── config/`
			`│ │ └── project-registry.json`
			`│ └── chroma/ # only if include_chroma=true`
			`│ └── ...`
			`├── 20260408T060000Z/`
			`│ └── ...`
			`└── ...`
			```

			The timestamp is UTC, format `YYYYMMDDTHHMMSSZ`.

			`## Triggering a backup`

			`### Option A — via the admin endpoint (preferred)`

			```bash
			`# DB + registry only (fast, safe at any time)`
			`curl -fsS -X POST http://dalidou:8100/admin/backup \`
			`-H "Content-Type: application/json" \`
			`-d '{"include_chroma": false}'`

			`# DB + registry + Chroma (acquires ingestion lock)`
			`curl -fsS -X POST http://dalidou:8100/admin/backup \`
			`-H "Content-Type: application/json" \`
			`-d '{"include_chroma": true}'`
			```

			The response is the backup metadata JSON. Save the `backup_root`
			`field — that's the directory the snapshot was written to.`

			`### Option B — via the standalone script (when the API is down)`

			```bash
			`docker exec atocore python -m atocore.ops.backup`
			```

			This runs `create_runtime_backup()` directly, without going through
			`the API or the ingestion lock. Use it only when the AtoCore service`
			`itself is unhealthy and you can't hit the admin endpoint.`

			`### Option C — manual file copy (last resort)`

			`If both the API and the standalone script are unusable:`

			```bash
			`sudo systemctl stop atocore # or: docker compose stop atocore`
			`sudo cp /srv/storage/atocore/data/db/atocore.db \`
			`/srv/storage/atocore/backups/manual-$(date -u +%Y%m%dT%H%M%SZ).db`
			`sudo cp /srv/storage/atocore/config/project-registry.json \`
			`/srv/storage/atocore/backups/manual-$(date -u +%Y%m%dT%H%M%SZ).registry.json`
			`sudo systemctl start atocore`
			```

			`This is a cold backup and requires brief downtime.`

			`## Listing backups`

			```bash
			`curl -fsS http://dalidou:8100/admin/backup`
			```

			Returns the configured `backup_dir` and a list of all snapshots
			`under it, with their full metadata if available.`

			`Or, on the host directly:`

			```bash
			`ls -la /srv/storage/atocore/backups/snapshots/`
			```

			`## Validating a backup`

			`Before relying on a backup for restore, validate it:`

			```bash
			`curl -fsS http://dalidou:8100/admin/backup/20260407T060000Z/validate`
			```

			`The validator:`
			`- confirms the snapshot directory exists`
			- opens the SQLite snapshot and runs `PRAGMA integrity_check`
			`- parses the registry JSON`
			`- confirms the Chroma directory exists (if it was included)`

			A valid backup returns `"valid": true` and an empty `errors` array.
			A failing validation returns `"valid": false` with one or more
			specific error strings (e.g. `db_integrity_check_failed`,
			`registry_invalid_json`, `chroma_snapshot_missing`).

			`Validate every backup at creation time. A backup that has never`
			`been validated is not actually a backup — it's just a hopeful copy`
			`of bytes.`

			`## Restore procedure`

			`### Pre-flight (always)`

			`1. Identify which snapshot you want to restore. List available`
			`snapshots and pick by timestamp:`
			```bash
			`curl -fsS http://dalidou:8100/admin/backup \| jq '.backups[].stamp'`
			```
			`2. Validate it. Refuse to restore an invalid backup:`
			```bash
			`STAMP=20260407T060000Z`
			`curl -fsS http://dalidou:8100/admin/backup/$STAMP/validate \| jq .`
			```
			`3. Stop AtoCore. SQLite cannot be hot-restored under a running`
			`process and Chroma will not pick up new files until the process`
			`restarts.`
			```bash
			`docker compose stop atocore`
			`# or: sudo systemctl stop atocore`
			```
			`4. Take a safety snapshot of the current state before overwriting`
			`it. This is your "if the restore makes things worse, here's the`
			`undo" backup.`
			```bash
			`PRESERVE_STAMP=$(date -u +%Y%m%dT%H%M%SZ)`
			`sudo cp /srv/storage/atocore/data/db/atocore.db \`
			`/srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.db`
			`sudo cp /srv/storage/atocore/config/project-registry.json \`
			`/srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.registry.json 2>/dev/null \|\| true`
			```

			`### Restore the SQLite database`

			```bash
			`SNAPSHOT_DIR=/srv/storage/atocore/backups/snapshots/$STAMP`
			`sudo cp $SNAPSHOT_DIR/db/atocore.db \`
			`/srv/storage/atocore/data/db/atocore.db`
			`sudo chown 1000:1000 /srv/storage/atocore/data/db/atocore.db`
			`sudo chmod 600 /srv/storage/atocore/data/db/atocore.db`
			```

			`The chown should match the gitea/atocore container user. Verify`
			`by checking the existing perms before overwriting:`

			```bash
			`stat -c '%U:%G %a' /srv/storage/atocore/data/db/atocore.db`
			```

			`### Restore the project registry`

			```bash
			`if [ -f $SNAPSHOT_DIR/config/project-registry.json ]; then`
			`sudo cp $SNAPSHOT_DIR/config/project-registry.json \`
			`/srv/storage/atocore/config/project-registry.json`
			`sudo chown 1000:1000 /srv/storage/atocore/config/project-registry.json`
			`sudo chmod 644 /srv/storage/atocore/config/project-registry.json`
			`fi`
			```

			`If the snapshot does not contain a registry, the current registry is`
			`preserved. The pre-flight safety copy still gives you a recovery path`
			`if you need to roll back.`

			`### Restore the Chroma vector store (if it was in the snapshot)`

			```bash
			`if [ -d $SNAPSHOT_DIR/chroma ]; then`
			`# Move the current chroma dir aside as a safety copy`
			`sudo mv /srv/storage/atocore/data/chroma \`
			`/srv/storage/atocore/data/chroma.pre-restore-$PRESERVE_STAMP`

			`# Copy the snapshot in`
			`sudo cp -a $SNAPSHOT_DIR/chroma /srv/storage/atocore/data/chroma`
			`sudo chown -R 1000:1000 /srv/storage/atocore/data/chroma`
			`fi`
			```

			`If the snapshot does NOT contain a Chroma dir but the SQLite`
			`restore would leave the vector store and the SQL store inconsistent`
			`(e.g. SQL has chunks the vector store doesn't), you have two`
			`options:`

			`- Option 1: rebuild the vector store from source documents. Run`
			`ingestion fresh after the SQL restore. This regenerates embeddings`
			`from the actual source files. Slow but produces a perfectly`
			`consistent state.`
			`- **Option 2: accept the inconsistency and live with stale-vector`
			`filtering.** The retriever already drops vector results whose`
			SQL row no longer exists (`_existing_chunk_ids` filter), so the
			`inconsistency surfaces as missing results, not bad ones.`

			`For an unplanned restore, Option 2 is the right immediate move.`
			`Then schedule a fresh ingestion pass to rebuild the vector store`
			`properly.`

			`### Restart AtoCore`

			```bash
			`docker compose up -d atocore`
			`# or: sudo systemctl start atocore`
			```

			`### Post-restore verification`

			```bash
			`# 1. Service is healthy`
			`curl -fsS http://dalidou:8100/health \| jq .`

			`# 2. Stats look right`
			`curl -fsS http://dalidou:8100/stats \| jq .`

			`# 3. Project registry loads`
			`curl -fsS http://dalidou:8100/projects \| jq '.projects \| length'`

			`# 4. A known-good context query returns non-empty results`
			`curl -fsS -X POST http://dalidou:8100/context/build \`
			`-H "Content-Type: application/json" \`
			`-d '{"prompt": "what is p05 about", "project": "p05-interferometer"}' \| jq '.chunks_used'`
			```

			`If any of these are wrong, the restore is bad. Roll back using the`
			`pre-restore safety copy:`

			```bash
			`docker compose stop atocore`
			`sudo cp /srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.db \`
			`/srv/storage/atocore/data/db/atocore.db`
			`sudo cp /srv/storage/atocore/backups/pre-restore-$PRESERVE_STAMP.registry.json \`
			`/srv/storage/atocore/config/project-registry.json 2>/dev/null \|\| true`
			`# If you also restored chroma:`
			`sudo rm -rf /srv/storage/atocore/data/chroma`
			`sudo mv /srv/storage/atocore/data/chroma.pre-restore-$PRESERVE_STAMP \`
			`/srv/storage/atocore/data/chroma`
			`docker compose up -d atocore`
			```

			`## Retention policy`

			`- Last 7 daily backups: kept verbatim`
			`- Last 4 weekly backups (Sunday): kept verbatim`
			`- Last 6 monthly backups (1st of month): kept verbatim`
			`- Anything older: deleted`

			`The retention job is not yet implemented and is tracked as a`
			`follow-up. Until then, the snapshots directory grows monotonically.`
			`A simple cron-based cleanup script is the next step:`

			```cron
			`0 4 * * * /srv/storage/atocore/scripts/cleanup-old-backups.sh`
			```

			`## Drill schedule`

			`A backup that has never been restored is theoretical. The schedule:`

			`- At least once per quarter, perform a full restore drill on a`
			`staging environment (or a temporary container with a separate`
			`data dir) and verify the post-restore checks pass.`
			`- After every breaking schema migration, perform a restore drill`
			`to confirm the migration is reversible.`
			`- After any incident that touched the storage layer (the EXDEV`
			`bug from April 2026 is a good example), confirm the next backup`
			`validates clean.`

			`## Common failure modes and what to do about them`

			`\| Symptom \| Likely cause \| Action \|`
			`\|---\|---\|---\|`
			\| `db_integrity_check_failed` on validation \| SQLite snapshot copied while a write was in progress, or disk corruption \| Take a fresh backup and validate again. If it fails twice, suspect the underlying disk. \|
			\| `registry_invalid_json` \| Registry was being edited at backup time \| Take a fresh backup. The registry is small so this is cheap. \|
			\| `chroma_snapshot_missing` after a restore \| Snapshot was DB-only and the restore didn't move the existing chroma dir \| Either rebuild via fresh ingestion or restore an older snapshot that includes Chroma. \|
			\| Service won't start after restore \| Permissions wrong on the restored files \| Re-run `chown 1000:1000` (or whatever the gitea/atocore container user is) on the data dir. \|
			\| `/stats` returns 0 documents after restore \| The SQL store was restored but the source paths in `source_documents` don't match the current Dalidou paths \| This means the backup came from a different deployment. Don't trust this restore — it's pulling from the wrong layout. \|

			`## Open follow-ups (not yet implemented)`

			`1. Retention cleanup script: see the cron entry above.`
			`2. Off-Dalidou backup target: currently snapshots live on the`
			`same disk as the live data. A real disaster-recovery story`
			`needs at least one snapshot on a different physical machine.`
			The simplest first step is a periodic `rsync` to the user's
			`laptop or to another server.`
			`3. Backup encryption: snapshots contain raw SQLite and JSON.`
			`Consider age/gpg encryption if backups will be shipped off-site.`
			`4. Automatic post-backup validation: today the validator must`
			be invoked manually. The `create_runtime_backup` function
			should call `validate_backup` on its own output and refuse to
			`declare success if validation fails.`
			`5. Chroma backup is currently full directory copy every time.`
			`For large vector stores this gets expensive. A future`
			`improvement would be incremental snapshots via filesystem-level`
			`snapshotting (LVM, btrfs, ZFS).`

			`## Quickstart cheat sheet`

			```bash
			`# Daily backup (DB + registry only — fast)`
			`curl -fsS -X POST http://dalidou:8100/admin/backup \`
			`-H "Content-Type: application/json" -d '{}'`

			`# Weekly backup (DB + registry + Chroma — slower, holds ingestion lock)`
			`curl -fsS -X POST http://dalidou:8100/admin/backup \`
			`-H "Content-Type: application/json" -d '{"include_chroma": true}'`

			`# List backups`
			`curl -fsS http://dalidou:8100/admin/backup \| jq '.backups[].stamp'`

			`# Validate the most recent backup`
			`LATEST=$(curl -fsS http://dalidou:8100/admin/backup \| jq -r '.backups[-1].stamp')`
			`curl -fsS http://dalidou:8100/admin/backup/$LATEST/validate \| jq .`

			`# Full restore — see the "Restore procedure" section above`
			```