feat: retrieval eval harness + doc sync

scripts/retrieval_eval.py walks a fixture file of project-hinted questions, runs each against POST /context/build, and scores the returned formatted_context against per-fixture expect_present and expect_absent substring checklists. Exit 0 on all-pass, 1 on any miss. Human-readable by default, --json for automation. First live run against Dalidou at SHA 1161645: 4/6 pass. The two failures are real findings, not harness bugs: - p05-configuration FAIL: "GigaBIT M1" appears in the p05 pack. Cross-project bleed from a shared p05 doc that legitimately mentions the p04 mirror under test. Fixture kept strict so future ranker tuning can close the gap. - p05-vendor-signal FAIL: "Zygo" missing. The vendor memory exists with confidence 0.9 but get_memories_for_context walks memories in fixed order (effectively by updated_at / confidence), so lower- ranked memories get pushed out of the per-project budget slice by higher-confidence ones even when the query is specifically about the lower-ranked content. Query-relevance ordering of memories is the natural next fix. Docs sync: - master-plan-status.md: Phase 9 reflection entry now notes that capture→reinforce runs automatically and project memories reach the context pack, while extract remains batch/manual. First batch- extract pass surfaced 1 candidate from 42 interactions — extractor rule tuning is a known follow-up. - next-steps.md: the 2026-04-11 retrieval quality review entry now shows the project-memory-band work as DONE, and a new "Reflection Loop Live Check" subsection records the extractor- coverage finding from the first batch run. - Both files now agree with the code; follow-up reviewers (Codex, future Claude) should no longer see narrative drift. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 12:39:03 -04:00
parent 7bf83bf46a
commit 4da81c9e4e
4 changed files with 338 additions and 8 deletions
--- a/docs/master-plan-status.md
+++ b/docs/master-plan-status.md
@@ -32,7 +32,18 @@ read-only additive mode.
 ### Baseline Complete

 - Phase 9 - Reflection (all three foundation commits landed:
-  A capture, B reinforcement, C candidate extraction + review queue)
+  A capture, B reinforcement, C candidate extraction + review queue).
+  As of 2026-04-11 the capture → reinforce half runs automatically on
+  every Stop-hook capture (length-aware token-overlap matcher handles
+  paragraph-length memories), and project-scoped memories now reach
+  the context pack via a dedicated `--- Project Memories ---` band
+  between identity/preference and retrieved chunks. The extract half
+  is still a manual / batch flow by design (`scripts/atocore_client.py
+  batch-extract` + `triage`). First live batch-extract run over 42
+  captured interactions produced 1 candidate (rule extractor is
+  conservative and keys on structural cues like `## Decision:`
+  headings that rarely appear in conversational LLM responses) —
+  extractor tuning is a known follow-up.

 ### Not Yet Complete In The Intended Sense

@@ -167,7 +178,9 @@ These remain intentionally deferred.

 - automatic write-back from OpenClaw into AtoCore
 - automatic memory promotion
- reflection loop integration
+- ~~reflection loop integration~~ — baseline now in (capture→reinforce
+  auto, extract batch/manual). Extractor tuning and scheduled batch
+  extraction still open.
 - replacing OpenClaw's own memory system
 - live machine-DB sync between machines
 - full ontology / graph expansion before the current baseline is stable