Files

Anto01 4da81c9e4e feat: retrieval eval harness + doc sync

scripts/retrieval_eval.py walks a fixture file of project-hinted
questions, runs each against POST /context/build, and scores the
returned formatted_context against per-fixture expect_present and
expect_absent substring checklists. Exit 0 on all-pass, 1 on any
miss. Human-readable by default, --json for automation.

First live run against Dalidou at SHA 1161645: 4/6 pass. The two
failures are real findings, not harness bugs:

- p05-configuration FAIL: "GigaBIT M1" appears in the p05 pack.
  Cross-project bleed from a shared p05 doc that legitimately
  mentions the p04 mirror under test. Fixture kept strict so
  future ranker tuning can close the gap.
- p05-vendor-signal FAIL: "Zygo" missing. The vendor memory exists
  with confidence 0.9 but get_memories_for_context walks memories
  in fixed order (effectively by updated_at / confidence), so lower-
  ranked memories get pushed out of the per-project budget slice by
  higher-confidence ones even when the query is specifically about
  the lower-ranked content. Query-relevance ordering of memories is
  the natural next fix.

Docs sync:

- master-plan-status.md: Phase 9 reflection entry now notes that
  capture→reinforce runs automatically and project memories reach
  the context pack, while extract remains batch/manual. First batch-
  extract pass surfaced 1 candidate from 42 interactions — extractor
  rule tuning is a known follow-up.
- next-steps.md: the 2026-04-11 retrieval quality review entry now
  shows the project-memory-band work as DONE, and a new
  "Reflection Loop Live Check" subsection records the extractor-
  coverage finding from the first batch run.
- Both files now agree with the code; follow-up reviewers
  (Codex, future Claude) should no longer see narrative drift.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-11 12:39:03 -04:00

10 KiB

Raw Blame History

AtoCore Next Steps

Current Position

AtoCore now has:

canonical runtime and machine storage on Dalidou
separated source and machine-data boundaries
initial self-knowledge ingested into the live instance
trusted project-state entries for AtoCore itself
a first read-only OpenClaw integration path on the T420
a first real active-project corpus batch for:
- p04-gigabit
- p05-interferometer
- p06-polisher

This working list should be read alongside:

master-plan-status.md

Immediate Next Steps

~~Re-run the backup/restore drill~~ — DONE 2026-04-11, full pass
~~Turn on auto-capture of Claude Code sessions~~ — DONE 2026-04-11, Stop hook via deploy/hooks/capture_stop.py → POST /interactions with reinforce=false; kill switch: ATOCORE_CAPTURE_DISABLED=1 2a. Run a short real-use pilot with auto-capture on
- verify interactions are landing in Dalidou
- check prompt/response quality and truncation
- confirm fail-open: no user-visible impact when Dalidou is down
Use the T420 atocore-context skill and the new organic routing layer in real OpenClaw workflows
- confirm auto-context feels natural
- confirm project inference is good enough in practice
- confirm the fail-open behavior remains acceptable in practice
Review retrieval quality after the first real project ingestion batch
- check whether the top hits are useful
- check whether trusted project state remains dominant
- reduce cross-project competition and prompt ambiguity where needed
- use debug-context to inspect the exact last AtoCore supplement
Treat the active-project full markdown/text wave as complete
- p04-gigabit
- p05-interferometer
- p06-polisher
Define a cleaner source refresh model
- make the difference between source truth, staged inputs, and machine store explicit
- move toward a project source registry and refresh workflow
- foundation now exists via project registry + per-project refresh API
- registration policy + template + proposal + approved registration are now the normal path for new projects
Move to Wave 2 trusted-operational ingestion
- curated dashboards
- decision logs
- milestone/current-status views
- operational truth, not just raw project notes
Integrate the new engineering architecture docs into active planning, not immediate schema code
- keep docs/architecture/engineering-knowledge-hybrid-architecture.md as the target layer model
- keep docs/architecture/engineering-ontology-v1.md as the V1 structured-domain target
- do not start entity/relationship persistence until the ingestion, retrieval, registry, and backup baseline feels boring and stable
Finish the boring operations baseline around backup
- retention policy cleanup script (snapshots dir grows monotonically today)
- off-Dalidou backup target (at minimum an rsync to laptop or another host so a single-disk failure isn't terminal)
- automatic post-backup validation (have create_runtime_backup call validate_backup on its own output and refuse to declare success if validation fails)
- DONE in commits be40994 / 0382238 / 3362080 / this one:
  - create_runtime_backup + list_runtime_backups + validate_backup + restore_runtime_backup with CLI
  - POST /admin/backup with include_chroma=true under the ingestion lock
  - /health build_sha / build_time / build_branch provenance
  - deploy.sh self-update re-exec guard + build_sha drift verification
  - live drill procedure in docs/backup-restore-procedure.md with failure-mode table and the memory_type=episodic marker pattern from the 2026-04-09 drill
Keep deeper automatic runtime integration modest until the organic read-only model has proven value

Trusted State Status

The first conservative trusted-state promotion pass is now complete for:

p04-gigabit
p05-interferometer
p06-polisher

Each project now has a small set of stable entries covering:

summary
architecture or boundary decision
key constraints
current next focus

This materially improves context/build quality for project-hinted prompts.

Recommended Near-Term Project Work

The active-project full markdown/text wave is now in.

The near-term work is now:

strengthen retrieval quality
promote or refine trusted operational truth where the broad corpus is now too noisy
keep trusted project state concise and high-confidence
widen only through named ingestion waves

Recommended Next Wave Inputs

Wave 2 should emphasize trusted operational truth, not bulk historical notes.

P04:

current status dashboard
current selected design path
current frame interface truth
current next-step milestone view

P05:

selected vendor path
current error-budget baseline
current architecture freeze or open decisions
current procurement / next-action view

P06:

current system map
current shared contracts baseline
current calibration procedure truth
current July / proving roadmap view

Deferred On Purpose

automatic write-back from OpenClaw into AtoCore
automatic memory promotion
~~reflection loop integration~~ — baseline now landed (2026-04-11): Stop hook runs reinforce automatically, project memories are folded into the context pack, batch-extract and triage CLIs exist. What remains deferred: scheduled/automatic batch extraction and extractor rule tuning (rule-based extractor produced 1 candidate from 42 real captures — needs new cues for conversational LLM content).
replacing OpenClaw's own memory system
syncing the live machine DB between machines

Success Criteria For The Next Batch

The next batch is successful if:

OpenClaw can use AtoCore naturally when context is needed
OpenClaw can infer registered projects and call AtoCore organically for project-knowledge questions
the active-project full corpus wave can be inspected and used concretely through auto-context, context-build, and debug-context
OpenClaw can also register a new project cleanly before refreshing it
existing project registrations can be refined safely before refresh when the staged source set evolves
AtoCore answers correctly for the active project set
retrieval surfaces the seeded project docs instead of mostly AtoCore meta-docs
trusted project state remains concise and high confidence
project ingestion remains controlled rather than noisy
the canonical Dalidou instance stays stable

Retrieval Quality Review — 2026-04-11

First sweep with real project-hinted queries on Dalidou. Used POST /context/build against p04, p05, p06 with representative questions and inspected formatted_context.

Findings:

Trusted Project State is surfacing correctly. The DECISION and REQUIREMENT categories appear at the top of the pack and include the expected key facts (e.g. p04 "Option B conical-back mirror architecture"). This is the strongest signal in the pack today.
Chunk retrieval is relevant on-topic but broad. Top chunks for the p04 architecture query are PDR intro, CAD assembly overview, and the index — all on the right project but none of them directly answer the "why was Option B chosen" question. The authoritative answer sits in Project State, not in the chunks.
Active memories are NOT reaching the pack. The context builder surfaces Trusted Project State and retrieved chunks but does not include the 21 active project/knowledge memories. Reinforcement (Phase 9 Commit B) bumps memory confidence without the memory ever being read back into a prompt — the reflection loop has no outlet on the retrieval side. This is a design gap, not a bug: needs a decision on whether memories should feed into context assembly, and if so at what trust level (below project_state, above chunks).
Cross-project bleed is low. The p04 query did pull one p05 chunk (CGH_Design_Input_for_AOM) as the bottom hit but the top-4 were all p04.

Proposed follow-ups (not yet scheduled):

~~Decide whether memories should be folded into formatted_context and under what section header.~~ DONE 2026-04-11 (commits 8ea53f4, 5913da5, 1161645). A --- Project Memories --- band now sits between identity/preference and retrieved chunks, gated on a canonical project hint to prevent cross-project bleed. Budget ratio 0.25 (tuned empirically — paragraph memories are ~400 chars and earlier 0.15 ratio starved the first entry by one char). Verified live: p04 architecture query surfaces the Option B memory.
Re-run the same three queries after any builder change and compare formatted_context diffs — still open, and is the natural entry point for the retrieval eval harness on the roadmap.

Reflection Loop Live Check — 2026-04-11

First real run of batch-extract across 42 captured Claude Code interactions on Dalidou produced exactly 1 candidate, and that candidate was a synthetic test capture from earlier in the session (rejected). Finding:

The rule-based extractor in src/atocore/memory/extractor.py keys on explicit structural cues (decision headings like ## Decision: ..., preference sentences, etc.). Real Claude Code responses are conversational and almost never contain those cues.
This means the capture → extract half of the reflection loop is effectively inert against organic LLM sessions until either the rules are broadened (new cue families: "we chose X because...", "the selected approach is...", etc.) or an LLM-assisted extraction path is added alongside the rule-based one.
Capture → reinforce is working correctly on live data (length-aware matcher verified on live paraphrase of a p04 memory).

Follow-up candidates (not yet scheduled):

Extractor rule expansion — add conversational-form rules so real session text has a chance of surfacing candidates.
LLM-assisted extractor as a separate rule family, guarded by confidence and always landing in status=candidate (never active).
Retrieval eval harness — diffable scorecard of formatted_context across a fixed question set per active project.

Long-Run Goal

The long-run target is:

continue working normally inside PKM project stacks and Gitea repos
let OpenClaw keep its own memory and runtime behavior
let AtoCore supplement LLM work with stronger trusted context, retrieval, and context assembly

That means AtoCore should behave like a durable external context engine and machine-memory layer, not a replacement for normal repo work or OpenClaw memory.

10 KiB Raw Blame History