Files
ATOCore/docs/next-steps.md
Anto01 4da81c9e4e feat: retrieval eval harness + doc sync
scripts/retrieval_eval.py walks a fixture file of project-hinted
questions, runs each against POST /context/build, and scores the
returned formatted_context against per-fixture expect_present and
expect_absent substring checklists. Exit 0 on all-pass, 1 on any
miss. Human-readable by default, --json for automation.

First live run against Dalidou at SHA 1161645: 4/6 pass. The two
failures are real findings, not harness bugs:

- p05-configuration FAIL: "GigaBIT M1" appears in the p05 pack.
  Cross-project bleed from a shared p05 doc that legitimately
  mentions the p04 mirror under test. Fixture kept strict so
  future ranker tuning can close the gap.
- p05-vendor-signal FAIL: "Zygo" missing. The vendor memory exists
  with confidence 0.9 but get_memories_for_context walks memories
  in fixed order (effectively by updated_at / confidence), so lower-
  ranked memories get pushed out of the per-project budget slice by
  higher-confidence ones even when the query is specifically about
  the lower-ranked content. Query-relevance ordering of memories is
  the natural next fix.

Docs sync:

- master-plan-status.md: Phase 9 reflection entry now notes that
  capture→reinforce runs automatically and project memories reach
  the context pack, while extract remains batch/manual. First batch-
  extract pass surfaced 1 candidate from 42 interactions — extractor
  rule tuning is a known follow-up.
- next-steps.md: the 2026-04-11 retrieval quality review entry now
  shows the project-memory-band work as DONE, and a new
  "Reflection Loop Live Check" subsection records the extractor-
  coverage finding from the first batch run.
- Both files now agree with the code; follow-up reviewers
  (Codex, future Claude) should no longer see narrative drift.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 12:39:03 -04:00

10 KiB

AtoCore Next Steps

Current Position

AtoCore now has:

  • canonical runtime and machine storage on Dalidou
  • separated source and machine-data boundaries
  • initial self-knowledge ingested into the live instance
  • trusted project-state entries for AtoCore itself
  • a first read-only OpenClaw integration path on the T420
  • a first real active-project corpus batch for:
    • p04-gigabit
    • p05-interferometer
    • p06-polisher

This working list should be read alongside:

Immediate Next Steps

  1. Re-run the backup/restore drill — DONE 2026-04-11, full pass
  2. Turn on auto-capture of Claude Code sessions — DONE 2026-04-11, Stop hook via deploy/hooks/capture_stop.pyPOST /interactions with reinforce=false; kill switch: ATOCORE_CAPTURE_DISABLED=1 2a. Run a short real-use pilot with auto-capture on
    • verify interactions are landing in Dalidou
    • check prompt/response quality and truncation
    • confirm fail-open: no user-visible impact when Dalidou is down
  3. Use the T420 atocore-context skill and the new organic routing layer in real OpenClaw workflows
    • confirm auto-context feels natural
    • confirm project inference is good enough in practice
    • confirm the fail-open behavior remains acceptable in practice
  4. Review retrieval quality after the first real project ingestion batch
    • check whether the top hits are useful
    • check whether trusted project state remains dominant
    • reduce cross-project competition and prompt ambiguity where needed
    • use debug-context to inspect the exact last AtoCore supplement
  5. Treat the active-project full markdown/text wave as complete
    • p04-gigabit
    • p05-interferometer
    • p06-polisher
  6. Define a cleaner source refresh model
    • make the difference between source truth, staged inputs, and machine store explicit
    • move toward a project source registry and refresh workflow
    • foundation now exists via project registry + per-project refresh API
    • registration policy + template + proposal + approved registration are now the normal path for new projects
  7. Move to Wave 2 trusted-operational ingestion
    • curated dashboards
    • decision logs
    • milestone/current-status views
    • operational truth, not just raw project notes
  8. Integrate the new engineering architecture docs into active planning, not immediate schema code
    • keep docs/architecture/engineering-knowledge-hybrid-architecture.md as the target layer model
    • keep docs/architecture/engineering-ontology-v1.md as the V1 structured-domain target
    • do not start entity/relationship persistence until the ingestion, retrieval, registry, and backup baseline feels boring and stable
  9. Finish the boring operations baseline around backup
    • retention policy cleanup script (snapshots dir grows monotonically today)
    • off-Dalidou backup target (at minimum an rsync to laptop or another host so a single-disk failure isn't terminal)
    • automatic post-backup validation (have create_runtime_backup call validate_backup on its own output and refuse to declare success if validation fails)
    • DONE in commits be40994 / 0382238 / 3362080 / this one:
      • create_runtime_backup + list_runtime_backups + validate_backup + restore_runtime_backup with CLI
      • POST /admin/backup with include_chroma=true under the ingestion lock
      • /health build_sha / build_time / build_branch provenance
      • deploy.sh self-update re-exec guard + build_sha drift verification
      • live drill procedure in docs/backup-restore-procedure.md with failure-mode table and the memory_type=episodic marker pattern from the 2026-04-09 drill
  10. Keep deeper automatic runtime integration modest until the organic read-only model has proven value

Trusted State Status

The first conservative trusted-state promotion pass is now complete for:

  • p04-gigabit
  • p05-interferometer
  • p06-polisher

Each project now has a small set of stable entries covering:

  • summary
  • architecture or boundary decision
  • key constraints
  • current next focus

This materially improves context/build quality for project-hinted prompts.

The active-project full markdown/text wave is now in.

The near-term work is now:

  1. strengthen retrieval quality
  2. promote or refine trusted operational truth where the broad corpus is now too noisy
  3. keep trusted project state concise and high-confidence
  4. widen only through named ingestion waves

Wave 2 should emphasize trusted operational truth, not bulk historical notes.

P04:

  • current status dashboard
  • current selected design path
  • current frame interface truth
  • current next-step milestone view

P05:

  • selected vendor path
  • current error-budget baseline
  • current architecture freeze or open decisions
  • current procurement / next-action view

P06:

  • current system map
  • current shared contracts baseline
  • current calibration procedure truth
  • current July / proving roadmap view

Deferred On Purpose

  • automatic write-back from OpenClaw into AtoCore
  • automatic memory promotion
  • reflection loop integration — baseline now landed (2026-04-11): Stop hook runs reinforce automatically, project memories are folded into the context pack, batch-extract and triage CLIs exist. What remains deferred: scheduled/automatic batch extraction and extractor rule tuning (rule-based extractor produced 1 candidate from 42 real captures — needs new cues for conversational LLM content).
  • replacing OpenClaw's own memory system
  • syncing the live machine DB between machines

Success Criteria For The Next Batch

The next batch is successful if:

  • OpenClaw can use AtoCore naturally when context is needed
  • OpenClaw can infer registered projects and call AtoCore organically for project-knowledge questions
  • the active-project full corpus wave can be inspected and used concretely through auto-context, context-build, and debug-context
  • OpenClaw can also register a new project cleanly before refreshing it
  • existing project registrations can be refined safely before refresh when the staged source set evolves
  • AtoCore answers correctly for the active project set
  • retrieval surfaces the seeded project docs instead of mostly AtoCore meta-docs
  • trusted project state remains concise and high confidence
  • project ingestion remains controlled rather than noisy
  • the canonical Dalidou instance stays stable

Retrieval Quality Review — 2026-04-11

First sweep with real project-hinted queries on Dalidou. Used POST /context/build against p04, p05, p06 with representative questions and inspected formatted_context.

Findings:

  • Trusted Project State is surfacing correctly. The DECISION and REQUIREMENT categories appear at the top of the pack and include the expected key facts (e.g. p04 "Option B conical-back mirror architecture"). This is the strongest signal in the pack today.
  • Chunk retrieval is relevant on-topic but broad. Top chunks for the p04 architecture query are PDR intro, CAD assembly overview, and the index — all on the right project but none of them directly answer the "why was Option B chosen" question. The authoritative answer sits in Project State, not in the chunks.
  • Active memories are NOT reaching the pack. The context builder surfaces Trusted Project State and retrieved chunks but does not include the 21 active project/knowledge memories. Reinforcement (Phase 9 Commit B) bumps memory confidence without the memory ever being read back into a prompt — the reflection loop has no outlet on the retrieval side. This is a design gap, not a bug: needs a decision on whether memories should feed into context assembly, and if so at what trust level (below project_state, above chunks).
  • Cross-project bleed is low. The p04 query did pull one p05 chunk (CGH_Design_Input_for_AOM) as the bottom hit but the top-4 were all p04.

Proposed follow-ups (not yet scheduled):

  1. Decide whether memories should be folded into formatted_context and under what section header. DONE 2026-04-11 (commits 8ea53f4, 5913da5, 1161645). A --- Project Memories --- band now sits between identity/preference and retrieved chunks, gated on a canonical project hint to prevent cross-project bleed. Budget ratio 0.25 (tuned empirically — paragraph memories are ~400 chars and earlier 0.15 ratio starved the first entry by one char). Verified live: p04 architecture query surfaces the Option B memory.
  2. Re-run the same three queries after any builder change and compare formatted_context diffs — still open, and is the natural entry point for the retrieval eval harness on the roadmap.

Reflection Loop Live Check — 2026-04-11

First real run of batch-extract across 42 captured Claude Code interactions on Dalidou produced exactly 1 candidate, and that candidate was a synthetic test capture from earlier in the session (rejected). Finding:

  • The rule-based extractor in src/atocore/memory/extractor.py keys on explicit structural cues (decision headings like ## Decision: ..., preference sentences, etc.). Real Claude Code responses are conversational and almost never contain those cues.
  • This means the capture → extract half of the reflection loop is effectively inert against organic LLM sessions until either the rules are broadened (new cue families: "we chose X because...", "the selected approach is...", etc.) or an LLM-assisted extraction path is added alongside the rule-based one.
  • Capture → reinforce is working correctly on live data (length-aware matcher verified on live paraphrase of a p04 memory).

Follow-up candidates (not yet scheduled):

  1. Extractor rule expansion — add conversational-form rules so real session text has a chance of surfacing candidates.
  2. LLM-assisted extractor as a separate rule family, guarded by confidence and always landing in status=candidate (never active).
  3. Retrieval eval harness — diffable scorecard of formatted_context across a fixed question set per active project.

Long-Run Goal

The long-run target is:

  • continue working normally inside PKM project stacks and Gitea repos
  • let OpenClaw keep its own memory and runtime behavior
  • let AtoCore supplement LLM work with stronger trusted context, retrieval, and context assembly

That means AtoCore should behave like a durable external context engine and machine-memory layer, not a replacement for normal repo work or OpenClaw memory.