Two fixes from the 2026-04-09 first real restore drill on Dalidou,
plus the long-overdue doc consolidation I should have done when I
added the drill runbook instead of creating a duplicate.
## Chroma restore bind-mount bug (drill finding)
src/atocore/ops/backup.py: restore_runtime_backup() used to call
shutil.rmtree(dst_chroma) before copying the snapshot back. In the
Dockerized Dalidou deployment the chroma dir is a bind-mounted
volume — you can't unlink a mount point, rmtree raises
OSError [Errno 16] Device or resource busy
and the restore silently fails to touch Chroma. This bit the first
real drill; the operator worked around it with --no-chroma plus a
manual cp -a.
Fix: clear the destination's CONTENTS (iterdir + rmtree/unlink per
child) and use copytree(dirs_exist_ok=True) so the mount point
itself is never touched. Equivalent semantics, bind-mount-safe.
Regression test:
tests/test_backup.py::test_restore_chroma_does_not_unlink_destination_directory
captures Path.stat().st_ino of the dest dir before and after
restore and asserts they match. That's the same invariant a
bind-mounted chroma dir enforces — if the inode changed, the
mount would have failed. 11/11 backup tests now pass.
## Doc consolidation
docs/backup-restore-drill.md existed as a duplicate of the
authoritative docs/backup-restore-procedure.md. When I added the
drill runbook in commit 3362080 I wrote it from scratch instead of
updating the existing procedure — bad doc hygiene on a project
that's literally about being a context engine.
- Deleted docs/backup-restore-drill.md
- Folded its contents into docs/backup-restore-procedure.md:
- Replaced the manual sudo cp restore sequence with the new
`python -m atocore.ops.backup restore <STAMP>
--confirm-service-stopped` CLI
- Added the one-shot docker compose run pattern for running
restore inside a container that reuses the live volume mounts
- Documented the --no-pre-snapshot / --no-chroma / --chroma flags
- New "Chroma restore and bind-mounted volumes" subsection
explaining the bug and the regression test that protects the fix
- New "Restore drill" subsection with three levels (unit tests,
module round-trip, live Dalidou drill) and the cadence list
- Failure-mode table gained four entries: restored_integrity_ok,
Device-or-resource-busy, drill marker still present,
chroma_snapshot_missing
- "Open follow-ups" struck the restore_runtime_backup item (done)
and added a "Done (historical)" note referencing 2026-04-09
- Quickstart cheat sheet now has a full drill one-liner using
memory_type=episodic (the 2026-04-09 drill found the runbook's
memory_type=note was invalid — the valid set is identity,
preference, project, episodic, knowledge, adaptation)
## Status doc sync
Long overdue — I've been landing code without updating the
project's narrative state docs.
docs/current-state.md:
- "Reliability Baseline" now reflects: restore_runtime_backup is
real with CLI, pre-restore safety snapshot, WAL cleanup,
integrity check; live drill on 2026-04-09 surfaced and fixed
Chroma bind-mount bug; deploy provenance via /health build_sha;
deploy.sh self-update re-exec guard
- "Immediate Next Focus" reshuffled: drill re-run (priority 1) and
auto-capture (priority 2) are now ahead of retrieval quality work,
reflecting the updated unblock sequence
docs/next-steps.md:
- New item 1: re-run the drill with chroma working end-to-end
- New item 2: auto-capture conservative mode (Stop hook)
- Old item 7 rewritten as item 9 listing what's DONE
(create/list/validate/restore, admin/backup endpoint with
include_chroma, /health provenance, self-update guard,
procedure doc with failure modes) and what's still pending
(retention cleanup, off-Dalidou target, auto-validation)
## Test count
226 passing (was 225 + 1 new inode-stability regression test).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
7.0 KiB
7.0 KiB
AtoCore Next Steps
Current Position
AtoCore now has:
- canonical runtime and machine storage on Dalidou
- separated source and machine-data boundaries
- initial self-knowledge ingested into the live instance
- trusted project-state entries for AtoCore itself
- a first read-only OpenClaw integration path on the T420
- a first real active-project corpus batch for:
p04-gigabitp05-interferometerp06-polisher
This working list should be read alongside:
Immediate Next Steps
- Re-run the backup/restore drill on Dalidou with the Chroma
bind-mount fix in place
- the 2026-04-09 drill was a PARTIAL PASS: db restore + marker
reversal worked cleanly, but the Chroma step failed with
OSError [Errno 16] Device or resource busybecauseshutil.rmtreecannot unlink a Docker bind-mounted volume - fix landed immediately after:
restore_runtime_backup()now clears the destination's CONTENTS and usescopytree(dirs_exist_ok=True), and the regression testtest_restore_chroma_does_not_unlink_destination_directoryasserts the destination inode is stable - need a green end-to-end run with
--chromaactually working in-container before enabling write-path automation
- the 2026-04-09 drill was a PARTIAL PASS: db restore + marker
reversal worked cleanly, but the Chroma step failed with
- Turn on auto-capture of Claude Code sessions once the drill
re-run is clean
- conservative mode: Stop hook posts to
/interactions, no auto-extraction into review queue without review cadence in place
- conservative mode: Stop hook posts to
- Use the T420
atocore-contextskill and the new organic routing layer in real OpenClaw workflows- confirm
auto-contextfeels natural - confirm project inference is good enough in practice
- confirm the fail-open behavior remains acceptable in practice
- confirm
- Review retrieval quality after the first real project ingestion batch
- check whether the top hits are useful
- check whether trusted project state remains dominant
- reduce cross-project competition and prompt ambiguity where needed
- use
debug-contextto inspect the exact last AtoCore supplement
- Treat the active-project full markdown/text wave as complete
p04-gigabitp05-interferometerp06-polisher
- Define a cleaner source refresh model
- make the difference between source truth, staged inputs, and machine store explicit
- move toward a project source registry and refresh workflow
- foundation now exists via project registry + per-project refresh API
- registration policy + template + proposal + approved registration are now the normal path for new projects
- Move to Wave 2 trusted-operational ingestion
- curated dashboards
- decision logs
- milestone/current-status views
- operational truth, not just raw project notes
- Integrate the new engineering architecture docs into active planning, not immediate schema code
- keep
docs/architecture/engineering-knowledge-hybrid-architecture.mdas the target layer model - keep
docs/architecture/engineering-ontology-v1.mdas the V1 structured-domain target - do not start entity/relationship persistence until the ingestion, retrieval, registry, and backup baseline feels boring and stable
- keep
- Finish the boring operations baseline around backup
- retention policy cleanup script (snapshots dir grows monotonically today)
- off-Dalidou backup target (at minimum an rsync to laptop or another host so a single-disk failure isn't terminal)
- automatic post-backup validation (have
create_runtime_backupcallvalidate_backupon its own output and refuse to declare success if validation fails) - DONE in commits
be40994/0382238/3362080/ this one:create_runtime_backup+list_runtime_backups+validate_backup+restore_runtime_backupwith CLIPOST /admin/backupwithinclude_chroma=trueunder the ingestion lock/healthbuild_sha / build_time / build_branch provenancedeploy.shself-update re-exec guard + build_sha drift verification- live drill procedure in
docs/backup-restore-procedure.mdwith failure-mode table and the memory_type=episodic marker pattern from the 2026-04-09 drill
- Keep deeper automatic runtime integration modest until the organic read-only model has proven value
Trusted State Status
The first conservative trusted-state promotion pass is now complete for:
p04-gigabitp05-interferometerp06-polisher
Each project now has a small set of stable entries covering:
- summary
- architecture or boundary decision
- key constraints
- current next focus
This materially improves context/build quality for project-hinted prompts.
Recommended Near-Term Project Work
The active-project full markdown/text wave is now in.
The near-term work is now:
- strengthen retrieval quality
- promote or refine trusted operational truth where the broad corpus is now too noisy
- keep trusted project state concise and high-confidence
- widen only through named ingestion waves
Recommended Next Wave Inputs
Wave 2 should emphasize trusted operational truth, not bulk historical notes.
P04:
- current status dashboard
- current selected design path
- current frame interface truth
- current next-step milestone view
P05:
- selected vendor path
- current error-budget baseline
- current architecture freeze or open decisions
- current procurement / next-action view
P06:
- current system map
- current shared contracts baseline
- current calibration procedure truth
- current July / proving roadmap view
Deferred On Purpose
- automatic write-back from OpenClaw into AtoCore
- automatic memory promotion
- reflection loop integration
- replacing OpenClaw's own memory system
- syncing the live machine DB between machines
Success Criteria For The Next Batch
The next batch is successful if:
- OpenClaw can use AtoCore naturally when context is needed
- OpenClaw can infer registered projects and call AtoCore organically for project-knowledge questions
- the active-project full corpus wave can be inspected and used concretely
through
auto-context,context-build, anddebug-context - OpenClaw can also register a new project cleanly before refreshing it
- existing project registrations can be refined safely before refresh when the staged source set evolves
- AtoCore answers correctly for the active project set
- retrieval surfaces the seeded project docs instead of mostly AtoCore meta-docs
- trusted project state remains concise and high confidence
- project ingestion remains controlled rather than noisy
- the canonical Dalidou instance stays stable
Long-Run Goal
The long-run target is:
- continue working normally inside PKM project stacks and Gitea repos
- let OpenClaw keep its own memory and runtime behavior
- let AtoCore supplement LLM work with stronger trusted context, retrieval, and context assembly
That means AtoCore should behave like a durable external context engine and machine-memory layer, not a replacement for normal repo work or OpenClaw memory.