Files
ATOCore/docs/next-steps.md
Anto01 1a8fdf4225 fix: chroma restore bind-mount bug + consolidate docs
Two fixes from the 2026-04-09 first real restore drill on Dalidou,
plus the long-overdue doc consolidation I should have done when I
added the drill runbook instead of creating a duplicate.

## Chroma restore bind-mount bug (drill finding)

src/atocore/ops/backup.py: restore_runtime_backup() used to call
shutil.rmtree(dst_chroma) before copying the snapshot back. In the
Dockerized Dalidou deployment the chroma dir is a bind-mounted
volume — you can't unlink a mount point, rmtree raises
  OSError [Errno 16] Device or resource busy
and the restore silently fails to touch Chroma. This bit the first
real drill; the operator worked around it with --no-chroma plus a
manual cp -a.

Fix: clear the destination's CONTENTS (iterdir + rmtree/unlink per
child) and use copytree(dirs_exist_ok=True) so the mount point
itself is never touched. Equivalent semantics, bind-mount-safe.

Regression test:
tests/test_backup.py::test_restore_chroma_does_not_unlink_destination_directory
captures Path.stat().st_ino of the dest dir before and after
restore and asserts they match. That's the same invariant a
bind-mounted chroma dir enforces — if the inode changed, the
mount would have failed. 11/11 backup tests now pass.

## Doc consolidation

docs/backup-restore-drill.md existed as a duplicate of the
authoritative docs/backup-restore-procedure.md. When I added the
drill runbook in commit 3362080 I wrote it from scratch instead of
updating the existing procedure — bad doc hygiene on a project
that's literally about being a context engine.

- Deleted docs/backup-restore-drill.md
- Folded its contents into docs/backup-restore-procedure.md:
  - Replaced the manual sudo cp restore sequence with the new
    `python -m atocore.ops.backup restore <STAMP>
    --confirm-service-stopped` CLI
  - Added the one-shot docker compose run pattern for running
    restore inside a container that reuses the live volume mounts
  - Documented the --no-pre-snapshot / --no-chroma / --chroma flags
  - New "Chroma restore and bind-mounted volumes" subsection
    explaining the bug and the regression test that protects the fix
  - New "Restore drill" subsection with three levels (unit tests,
    module round-trip, live Dalidou drill) and the cadence list
  - Failure-mode table gained four entries: restored_integrity_ok,
    Device-or-resource-busy, drill marker still present,
    chroma_snapshot_missing
  - "Open follow-ups" struck the restore_runtime_backup item (done)
    and added a "Done (historical)" note referencing 2026-04-09
  - Quickstart cheat sheet now has a full drill one-liner using
    memory_type=episodic (the 2026-04-09 drill found the runbook's
    memory_type=note was invalid — the valid set is identity,
    preference, project, episodic, knowledge, adaptation)

## Status doc sync

Long overdue — I've been landing code without updating the
project's narrative state docs.

docs/current-state.md:
- "Reliability Baseline" now reflects: restore_runtime_backup is
  real with CLI, pre-restore safety snapshot, WAL cleanup,
  integrity check; live drill on 2026-04-09 surfaced and fixed
  Chroma bind-mount bug; deploy provenance via /health build_sha;
  deploy.sh self-update re-exec guard
- "Immediate Next Focus" reshuffled: drill re-run (priority 1) and
  auto-capture (priority 2) are now ahead of retrieval quality work,
  reflecting the updated unblock sequence

docs/next-steps.md:
- New item 1: re-run the drill with chroma working end-to-end
- New item 2: auto-capture conservative mode (Stop hook)
- Old item 7 rewritten as item 9 listing what's DONE
  (create/list/validate/restore, admin/backup endpoint with
  include_chroma, /health provenance, self-update guard,
  procedure doc with failure modes) and what's still pending
  (retention cleanup, off-Dalidou target, auto-validation)

## Test count

226 passing (was 225 + 1 new inode-stability regression test).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 09:13:21 -04:00

7.0 KiB

AtoCore Next Steps

Current Position

AtoCore now has:

  • canonical runtime and machine storage on Dalidou
  • separated source and machine-data boundaries
  • initial self-knowledge ingested into the live instance
  • trusted project-state entries for AtoCore itself
  • a first read-only OpenClaw integration path on the T420
  • a first real active-project corpus batch for:
    • p04-gigabit
    • p05-interferometer
    • p06-polisher

This working list should be read alongside:

Immediate Next Steps

  1. Re-run the backup/restore drill on Dalidou with the Chroma bind-mount fix in place
    • the 2026-04-09 drill was a PARTIAL PASS: db restore + marker reversal worked cleanly, but the Chroma step failed with OSError [Errno 16] Device or resource busy because shutil.rmtree cannot unlink a Docker bind-mounted volume
    • fix landed immediately after: restore_runtime_backup() now clears the destination's CONTENTS and uses copytree(dirs_exist_ok=True), and the regression test test_restore_chroma_does_not_unlink_destination_directory asserts the destination inode is stable
    • need a green end-to-end run with --chroma actually working in-container before enabling write-path automation
  2. Turn on auto-capture of Claude Code sessions once the drill re-run is clean
    • conservative mode: Stop hook posts to /interactions, no auto-extraction into review queue without review cadence in place
  3. Use the T420 atocore-context skill and the new organic routing layer in real OpenClaw workflows
    • confirm auto-context feels natural
    • confirm project inference is good enough in practice
    • confirm the fail-open behavior remains acceptable in practice
  4. Review retrieval quality after the first real project ingestion batch
    • check whether the top hits are useful
    • check whether trusted project state remains dominant
    • reduce cross-project competition and prompt ambiguity where needed
    • use debug-context to inspect the exact last AtoCore supplement
  5. Treat the active-project full markdown/text wave as complete
    • p04-gigabit
    • p05-interferometer
    • p06-polisher
  6. Define a cleaner source refresh model
    • make the difference between source truth, staged inputs, and machine store explicit
    • move toward a project source registry and refresh workflow
    • foundation now exists via project registry + per-project refresh API
    • registration policy + template + proposal + approved registration are now the normal path for new projects
  7. Move to Wave 2 trusted-operational ingestion
    • curated dashboards
    • decision logs
    • milestone/current-status views
    • operational truth, not just raw project notes
  8. Integrate the new engineering architecture docs into active planning, not immediate schema code
    • keep docs/architecture/engineering-knowledge-hybrid-architecture.md as the target layer model
    • keep docs/architecture/engineering-ontology-v1.md as the V1 structured-domain target
    • do not start entity/relationship persistence until the ingestion, retrieval, registry, and backup baseline feels boring and stable
  9. Finish the boring operations baseline around backup
    • retention policy cleanup script (snapshots dir grows monotonically today)
    • off-Dalidou backup target (at minimum an rsync to laptop or another host so a single-disk failure isn't terminal)
    • automatic post-backup validation (have create_runtime_backup call validate_backup on its own output and refuse to declare success if validation fails)
    • DONE in commits be40994 / 0382238 / 3362080 / this one:
      • create_runtime_backup + list_runtime_backups + validate_backup + restore_runtime_backup with CLI
      • POST /admin/backup with include_chroma=true under the ingestion lock
      • /health build_sha / build_time / build_branch provenance
      • deploy.sh self-update re-exec guard + build_sha drift verification
      • live drill procedure in docs/backup-restore-procedure.md with failure-mode table and the memory_type=episodic marker pattern from the 2026-04-09 drill
  10. Keep deeper automatic runtime integration modest until the organic read-only model has proven value

Trusted State Status

The first conservative trusted-state promotion pass is now complete for:

  • p04-gigabit
  • p05-interferometer
  • p06-polisher

Each project now has a small set of stable entries covering:

  • summary
  • architecture or boundary decision
  • key constraints
  • current next focus

This materially improves context/build quality for project-hinted prompts.

The active-project full markdown/text wave is now in.

The near-term work is now:

  1. strengthen retrieval quality
  2. promote or refine trusted operational truth where the broad corpus is now too noisy
  3. keep trusted project state concise and high-confidence
  4. widen only through named ingestion waves

Wave 2 should emphasize trusted operational truth, not bulk historical notes.

P04:

  • current status dashboard
  • current selected design path
  • current frame interface truth
  • current next-step milestone view

P05:

  • selected vendor path
  • current error-budget baseline
  • current architecture freeze or open decisions
  • current procurement / next-action view

P06:

  • current system map
  • current shared contracts baseline
  • current calibration procedure truth
  • current July / proving roadmap view

Deferred On Purpose

  • automatic write-back from OpenClaw into AtoCore
  • automatic memory promotion
  • reflection loop integration
  • replacing OpenClaw's own memory system
  • syncing the live machine DB between machines

Success Criteria For The Next Batch

The next batch is successful if:

  • OpenClaw can use AtoCore naturally when context is needed
  • OpenClaw can infer registered projects and call AtoCore organically for project-knowledge questions
  • the active-project full corpus wave can be inspected and used concretely through auto-context, context-build, and debug-context
  • OpenClaw can also register a new project cleanly before refreshing it
  • existing project registrations can be refined safely before refresh when the staged source set evolves
  • AtoCore answers correctly for the active project set
  • retrieval surfaces the seeded project docs instead of mostly AtoCore meta-docs
  • trusted project state remains concise and high confidence
  • project ingestion remains controlled rather than noisy
  • the canonical Dalidou instance stays stable

Long-Run Goal

The long-run target is:

  • continue working normally inside PKM project stacks and Gitea repos
  • let OpenClaw keep its own memory and runtime behavior
  • let AtoCore supplement LLM work with stronger trusted context, retrieval, and context assembly

That means AtoCore should behave like a durable external context engine and machine-memory layer, not a replacement for normal repo work or OpenClaw memory.