feat: retrieval eval harness + doc sync

scripts/retrieval_eval.py walks a fixture file of project-hinted questions, runs each against POST /context/build, and scores the returned formatted_context against per-fixture expect_present and expect_absent substring checklists. Exit 0 on all-pass, 1 on any miss. Human-readable by default, --json for automation. First live run against Dalidou at SHA 1161645: 4/6 pass. The two failures are real findings, not harness bugs: - p05-configuration FAIL: "GigaBIT M1" appears in the p05 pack. Cross-project bleed from a shared p05 doc that legitimately mentions the p04 mirror under test. Fixture kept strict so future ranker tuning can close the gap. - p05-vendor-signal FAIL: "Zygo" missing. The vendor memory exists with confidence 0.9 but get_memories_for_context walks memories in fixed order (effectively by updated_at / confidence), so lower- ranked memories get pushed out of the per-project budget slice by higher-confidence ones even when the query is specifically about the lower-ranked content. Query-relevance ordering of memories is the natural next fix. Docs sync: - master-plan-status.md: Phase 9 reflection entry now notes that capture→reinforce runs automatically and project memories reach the context pack, while extract remains batch/manual. First batch- extract pass surfaced 1 candidate from 42 interactions — extractor rule tuning is a known follow-up. - next-steps.md: the 2026-04-11 retrieval quality review entry now shows the project-memory-band work as DONE, and a new "Reflection Loop Live Check" subsection records the extractor- coverage finding from the first batch run. - Both files now agree with the code; follow-up reviewers (Codex, future Claude) should no longer see narrative drift. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 12:39:03 -04:00
parent 7bf83bf46a
commit 4da81c9e4e
4 changed files with 338 additions and 8 deletions
--- a/docs/next-steps.md
+++ b/docs/next-steps.md
@@ -137,7 +137,12 @@ P06:

 - automatic write-back from OpenClaw into AtoCore
 - automatic memory promotion
- reflection loop integration
+- ~~reflection loop integration~~ — baseline now landed (2026-04-11):
+  Stop hook runs reinforce automatically, project memories are folded
+  into the context pack, batch-extract and triage CLIs exist. What
+  remains deferred: scheduled/automatic batch extraction and extractor
+  rule tuning (rule-based extractor produced 1 candidate from 42 real
+  captures — needs new cues for conversational LLM content).
 - replacing OpenClaw's own memory system
 - syncing the live machine DB between machines

@@ -190,12 +195,45 @@ Findings:

 Proposed follow-ups (not yet scheduled):

-1. Decide whether memories should be folded into `formatted_context`
-   and under what section header. Candidate: a "--- Project Memories ---"
-   band between Trusted Project State and Retrieved Context, filtered
-   to active memories for the target project plus identity/preference.
+1. ~~Decide whether memories should be folded into `formatted_context`
+   and under what section header.~~ DONE 2026-04-11 (commits 8ea53f4,
+   5913da5, 1161645). A `--- Project Memories ---` band now sits
+   between identity/preference and retrieved chunks, gated on a
+   canonical project hint to prevent cross-project bleed. Budget
+   ratio 0.25 (tuned empirically — paragraph memories are ~400 chars
+   and earlier 0.15 ratio starved the first entry by one char).
+   Verified live: p04 architecture query surfaces the Option B memory.
 2. Re-run the same three queries after any builder change and compare
-   `formatted_context` diffs.
+   `formatted_context` diffs — still open, and is the natural entry
+   point for the retrieval eval harness on the roadmap.
+
+## Reflection Loop Live Check — 2026-04-11
+
+First real run of `batch-extract` across 42 captured Claude Code
+interactions on Dalidou produced exactly **1 candidate**, and that
+candidate was a synthetic test capture from earlier in the session
+(rejected). Finding:
+
+- The rule-based extractor in `src/atocore/memory/extractor.py` keys
+  on explicit structural cues (decision headings like
+  `## Decision: ...`, preference sentences, etc.). Real Claude Code
+  responses are conversational and almost never contain those cues.
+- This means the capture → extract half of the reflection loop is
+  effectively inert against organic LLM sessions until either the
+  rules are broadened (new cue families: "we chose X because...",
+  "the selected approach is...", etc.) or an LLM-assisted extraction
+  path is added alongside the rule-based one.
+- Capture → reinforce is working correctly on live data (length-aware
+  matcher verified on live paraphrase of a p04 memory).
+
+Follow-up candidates (not yet scheduled):
+
+1. Extractor rule expansion — add conversational-form rules so real
+   session text has a chance of surfacing candidates.
+2. LLM-assisted extractor as a separate rule family, guarded by
+   confidence and always landing in `status=candidate` (never active).
+3. Retrieval eval harness — diffable scorecard of
+   `formatted_context` across a fixed question set per active project.

 ## Long-Run Goal