Files

Anto01 1a8fdf4225 fix: chroma restore bind-mount bug + consolidate docs

Two fixes from the 2026-04-09 first real restore drill on Dalidou,
plus the long-overdue doc consolidation I should have done when I
added the drill runbook instead of creating a duplicate.

## Chroma restore bind-mount bug (drill finding)

src/atocore/ops/backup.py: restore_runtime_backup() used to call
shutil.rmtree(dst_chroma) before copying the snapshot back. In the
Dockerized Dalidou deployment the chroma dir is a bind-mounted
volume — you can't unlink a mount point, rmtree raises
  OSError [Errno 16] Device or resource busy
and the restore silently fails to touch Chroma. This bit the first
real drill; the operator worked around it with --no-chroma plus a
manual cp -a.

Fix: clear the destination's CONTENTS (iterdir + rmtree/unlink per
child) and use copytree(dirs_exist_ok=True) so the mount point
itself is never touched. Equivalent semantics, bind-mount-safe.

Regression test:
tests/test_backup.py::test_restore_chroma_does_not_unlink_destination_directory
captures Path.stat().st_ino of the dest dir before and after
restore and asserts they match. That's the same invariant a
bind-mounted chroma dir enforces — if the inode changed, the
mount would have failed. 11/11 backup tests now pass.

## Doc consolidation

docs/backup-restore-drill.md existed as a duplicate of the
authoritative docs/backup-restore-procedure.md. When I added the
drill runbook in commit 3362080 I wrote it from scratch instead of
updating the existing procedure — bad doc hygiene on a project
that's literally about being a context engine.

- Deleted docs/backup-restore-drill.md
- Folded its contents into docs/backup-restore-procedure.md:
  - Replaced the manual sudo cp restore sequence with the new
    `python -m atocore.ops.backup restore <STAMP>
    --confirm-service-stopped` CLI
  - Added the one-shot docker compose run pattern for running
    restore inside a container that reuses the live volume mounts
  - Documented the --no-pre-snapshot / --no-chroma / --chroma flags
  - New "Chroma restore and bind-mounted volumes" subsection
    explaining the bug and the regression test that protects the fix
  - New "Restore drill" subsection with three levels (unit tests,
    module round-trip, live Dalidou drill) and the cadence list
  - Failure-mode table gained four entries: restored_integrity_ok,
    Device-or-resource-busy, drill marker still present,
    chroma_snapshot_missing
  - "Open follow-ups" struck the restore_runtime_backup item (done)
    and added a "Done (historical)" note referencing 2026-04-09
  - Quickstart cheat sheet now has a full drill one-liner using
    memory_type=episodic (the 2026-04-09 drill found the runbook's
    memory_type=note was invalid — the valid set is identity,
    preference, project, episodic, knowledge, adaptation)

## Status doc sync

Long overdue — I've been landing code without updating the
project's narrative state docs.

docs/current-state.md:
- "Reliability Baseline" now reflects: restore_runtime_backup is
  real with CLI, pre-restore safety snapshot, WAL cleanup,
  integrity check; live drill on 2026-04-09 surfaced and fixed
  Chroma bind-mount bug; deploy provenance via /health build_sha;
  deploy.sh self-update re-exec guard
- "Immediate Next Focus" reshuffled: drill re-run (priority 1) and
  auto-capture (priority 2) are now ahead of retrieval quality work,
  reflecting the updated unblock sequence

docs/next-steps.md:
- New item 1: re-run the drill with chroma working end-to-end
- New item 2: auto-capture conservative mode (Stop hook)
- Old item 7 rewritten as item 9 listing what's DONE
  (create/list/validate/restore, admin/backup endpoint with
  include_chroma, /health provenance, self-update guard,
  procedure doc with failure modes) and what's still pending
  (retention cleanup, off-Dalidou target, auto-validation)

## Test count

226 passing (was 225 + 1 new inode-stability regression test).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-09 09:13:21 -04:00

7.0 KiB

Raw Blame History

AtoCore Next Steps

Current Position

AtoCore now has:

canonical runtime and machine storage on Dalidou
separated source and machine-data boundaries
initial self-knowledge ingested into the live instance
trusted project-state entries for AtoCore itself
a first read-only OpenClaw integration path on the T420
a first real active-project corpus batch for:
- p04-gigabit
- p05-interferometer
- p06-polisher

This working list should be read alongside:

master-plan-status.md

Immediate Next Steps

Re-run the backup/restore drill on Dalidou with the Chroma bind-mount fix in place
- the 2026-04-09 drill was a PARTIAL PASS: db restore + marker reversal worked cleanly, but the Chroma step failed with OSError [Errno 16] Device or resource busy because shutil.rmtree cannot unlink a Docker bind-mounted volume
- fix landed immediately after: restore_runtime_backup() now clears the destination's CONTENTS and uses copytree(dirs_exist_ok=True), and the regression test test_restore_chroma_does_not_unlink_destination_directory asserts the destination inode is stable
- need a green end-to-end run with --chroma actually working in-container before enabling write-path automation
Turn on auto-capture of Claude Code sessions once the drill re-run is clean
- conservative mode: Stop hook posts to /interactions, no auto-extraction into review queue without review cadence in place
Use the T420 atocore-context skill and the new organic routing layer in real OpenClaw workflows
- confirm auto-context feels natural
- confirm project inference is good enough in practice
- confirm the fail-open behavior remains acceptable in practice
Review retrieval quality after the first real project ingestion batch
- check whether the top hits are useful
- check whether trusted project state remains dominant
- reduce cross-project competition and prompt ambiguity where needed
- use debug-context to inspect the exact last AtoCore supplement
Treat the active-project full markdown/text wave as complete
- p04-gigabit
- p05-interferometer
- p06-polisher
Define a cleaner source refresh model
- make the difference between source truth, staged inputs, and machine store explicit
- move toward a project source registry and refresh workflow
- foundation now exists via project registry + per-project refresh API
- registration policy + template + proposal + approved registration are now the normal path for new projects
Move to Wave 2 trusted-operational ingestion
- curated dashboards
- decision logs
- milestone/current-status views
- operational truth, not just raw project notes
Integrate the new engineering architecture docs into active planning, not immediate schema code
- keep docs/architecture/engineering-knowledge-hybrid-architecture.md as the target layer model
- keep docs/architecture/engineering-ontology-v1.md as the V1 structured-domain target
- do not start entity/relationship persistence until the ingestion, retrieval, registry, and backup baseline feels boring and stable
Finish the boring operations baseline around backup
- retention policy cleanup script (snapshots dir grows monotonically today)
- off-Dalidou backup target (at minimum an rsync to laptop or another host so a single-disk failure isn't terminal)
- automatic post-backup validation (have create_runtime_backup call validate_backup on its own output and refuse to declare success if validation fails)
- DONE in commits be40994 / 0382238 / 3362080 / this one:
  - create_runtime_backup + list_runtime_backups + validate_backup + restore_runtime_backup with CLI
  - POST /admin/backup with include_chroma=true under the ingestion lock
  - /health build_sha / build_time / build_branch provenance
  - deploy.sh self-update re-exec guard + build_sha drift verification
  - live drill procedure in docs/backup-restore-procedure.md with failure-mode table and the memory_type=episodic marker pattern from the 2026-04-09 drill
Keep deeper automatic runtime integration modest until the organic read-only model has proven value

Trusted State Status

The first conservative trusted-state promotion pass is now complete for:

p04-gigabit
p05-interferometer
p06-polisher

Each project now has a small set of stable entries covering:

summary
architecture or boundary decision
key constraints
current next focus

This materially improves context/build quality for project-hinted prompts.

Recommended Near-Term Project Work

The active-project full markdown/text wave is now in.

The near-term work is now:

strengthen retrieval quality
promote or refine trusted operational truth where the broad corpus is now too noisy
keep trusted project state concise and high-confidence
widen only through named ingestion waves

Recommended Next Wave Inputs

Wave 2 should emphasize trusted operational truth, not bulk historical notes.

P04:

current status dashboard
current selected design path
current frame interface truth
current next-step milestone view

P05:

selected vendor path
current error-budget baseline
current architecture freeze or open decisions
current procurement / next-action view

P06:

current system map
current shared contracts baseline
current calibration procedure truth
current July / proving roadmap view

Deferred On Purpose

automatic write-back from OpenClaw into AtoCore
automatic memory promotion
reflection loop integration
replacing OpenClaw's own memory system
syncing the live machine DB between machines

Success Criteria For The Next Batch

The next batch is successful if:

OpenClaw can use AtoCore naturally when context is needed
OpenClaw can infer registered projects and call AtoCore organically for project-knowledge questions
the active-project full corpus wave can be inspected and used concretely through auto-context, context-build, and debug-context
OpenClaw can also register a new project cleanly before refreshing it
existing project registrations can be refined safely before refresh when the staged source set evolves
AtoCore answers correctly for the active project set
retrieval surfaces the seeded project docs instead of mostly AtoCore meta-docs
trusted project state remains concise and high confidence
project ingestion remains controlled rather than noisy
the canonical Dalidou instance stays stable

Long-Run Goal

The long-run target is:

continue working normally inside PKM project stacks and Gitea repos
let OpenClaw keep its own memory and runtime behavior
let AtoCore supplement LLM work with stronger trusted context, retrieval, and context assembly

That means AtoCore should behave like a durable external context engine and machine-memory layer, not a replacement for normal repo work or OpenClaw memory.

7.0 KiB Raw Blame History