Harden runtime and add backup foundation

2026-04-06 10:15:00 -04:00
parent 9715fe3143
commit c9757e313a
11 changed files with 331 additions and 10 deletions
--- a/docs/backup-strategy.md
+++ b/docs/backup-strategy.md
@@ -0,0 +1,80 @@
+# AtoCore Backup Strategy
+
+## Purpose
+
+This document describes the current backup baseline for the Dalidou-hosted
+AtoCore machine store.
+
+The immediate goal is not full disaster-proof automation yet. The goal is to
+have one safe, repeatable way to snapshot the most important writable state.
+
+## Current Backup Baseline
+
+Today, the safest hot-backup target is:
+
+- SQLite machine database
+- project registry JSON
+- backup metadata describing what was captured
+
+This is now supported by:
+
+- `python -m atocore.ops.backup`
+
+## What The Script Captures
+
+The backup command creates a timestamped snapshot under:
+
+- `ATOCORE_BACKUP_DIR/snapshots/<timestamp>/`
+
+It currently writes:
+
+- `db/atocore.db`
+  - created with SQLite's backup API
+- `config/project-registry.json`
+  - copied if it exists
+- `backup-metadata.json`
+  - timestamp, paths, and backup notes
+
+## What It Does Not Yet Capture
+
+The current script does not hot-backup Chroma.
+
+That is intentional.
+
+For now, Chroma should be treated as one of:
+
+- rebuildable derived state
+- or something that needs a deliberate cold snapshot/export workflow
+
+Until that workflow exists, do not rely on ad hoc live file copies of the
+vector store while the service is actively writing.
+
+## Dalidou Use
+
+On Dalidou, the canonical machine paths are:
+
+- DB:
+  - `/srv/storage/atocore/data/db/atocore.db`
+- registry:
+  - `/srv/storage/atocore/config/project-registry.json`
+- backups:
+  - `/srv/storage/atocore/backups`
+
+So a normal backup run should happen on Dalidou itself, not from another
+machine.
+
+## Next Backup Improvements
+
+1. decide Chroma policy clearly
+   - rebuild vs cold snapshot vs export
+2. add a simple scheduled backup routine on Dalidou
+3. add retention policy for old snapshots
+4. optionally add a restore validation check
+
+## Healthy Rule
+
+Do not design around syncing the live machine DB/vector store between machines.
+
+Back up the canonical Dalidou state.
+Restore from Dalidou state.
+Keep OpenClaw as a client of AtoCore, not a storage peer.
--- a/docs/current-state.md
+++ b/docs/current-state.md
@@ -39,6 +39,11 @@ now includes a first curated ingestion batch for the active projects.
 - context builder
 - API routes for query, context, health, and source status
 - project registry and per-project refresh foundation
+- project registration lifecycle:
+  - template
+  - proposal preview
+  - approved registration
+  - refresh
 - env-driven storage and deployment paths
 - Dalidou Docker deployment foundation
 - initial AtoCore self-knowledge corpus ingested on Dalidou
@@ -64,6 +69,11 @@ The service and storage foundation are live on Dalidou.

 The machine-data host is real and canonical.

+The project registry is now also persisted in a canonical mounted config path on
+Dalidou:
+
+- `/srv/storage/atocore/config/project-registry.json`
+
 The content corpus is partially populated now.

 The Dalidou instance already contains:
@@ -88,9 +98,9 @@ The Dalidou instance already contains:
 Current live stats after the latest documentation sync and active-project ingest
 passes:

- `source_documents`: 34
- `source_chunks`: 551
- `vectors`: 551
+- `source_documents`: 35
+- `source_chunks`: 560
+- `vectors`: 560

 The broader long-term corpus is still not fully populated yet. Wider project and
 vault ingestion remains a deliberate next step rather than something already
@@ -149,8 +159,28 @@ The source refresh model now has a concrete foundation in code:

 - a project registry file defines known project ids, aliases, and ingest roots
 - the API can list registered projects
+- the API can return a registration template
+- the API can preview a registration without mutating state
+- the API can persist an approved registration
 - the API can refresh one registered project at a time

+This lifecycle is now coherent end to end for normal use.
+
+## Reliability Baseline
+
+The runtime has now been hardened in a few practical ways:
+
+- SQLite connections use a configurable busy timeout
+- SQLite uses WAL mode to reduce transient lock pain under normal concurrent use
+- project registry writes are atomic file replacements rather than in-place rewrites
+- a first runtime backup path now exists for:
+  - SQLite
+  - project registry
+  - backup metadata
+
+This does not eliminate every concurrency edge, but it materially improves the
+current operational baseline.
+
 In `Trusted Project State`:

 - each active seeded project now has a conservative trusted-state set
@@ -167,7 +197,7 @@ This separation is healthy:

 ## Immediate Next Focus

-1. Use the new T420-side AtoCore skill in real OpenClaw workflows
+1. Use the new T420-side AtoCore skill and registration flow in real OpenClaw workflows
 2. Tighten retrieval quality for the newly seeded active projects
 3. Define the first broader AtoVault/AtoDrive ingestion batches
 4. Add backup/export strategy for Dalidou machine state
--- a/docs/next-steps.md
+++ b/docs/next-steps.md
@@ -31,10 +31,12 @@ AtoCore now has:
     explicit
   - move toward a project source registry and refresh workflow
   - foundation now exists via project registry + per-project refresh API
-   - registration policy + template are now the next normal path for new projects
+   - registration policy + template + proposal + approved registration are now
+     the normal path for new projects
 5. Define backup and export procedures for Dalidou
-   - SQLite snapshot/backup strategy
+   - exercise the new SQLite + registry snapshot path on Dalidou
   - Chroma backup or rebuild policy
+   - retention and restore validation
 6. Keep deeper automatic runtime integration deferred until the read-only model
   has proven value

@@ -101,6 +103,7 @@ P06:
 The next batch is successful if:

 - OpenClaw can use AtoCore naturally when context is needed
+- OpenClaw can also register a new project cleanly before refreshing it
 - AtoCore answers correctly for the active project set
 - retrieval surfaces the seeded project docs instead of mostly AtoCore meta-docs
 - trusted project state remains concise and high confidence