feat: retrieval eval harness + doc sync

scripts/retrieval_eval.py walks a fixture file of project-hinted
questions, runs each against POST /context/build, and scores the
returned formatted_context against per-fixture expect_present and
expect_absent substring checklists. Exit 0 on all-pass, 1 on any
miss. Human-readable by default, --json for automation.

First live run against Dalidou at SHA 1161645: 4/6 pass. The two
failures are real findings, not harness bugs:

- p05-configuration FAIL: "GigaBIT M1" appears in the p05 pack.
  Cross-project bleed from a shared p05 doc that legitimately
  mentions the p04 mirror under test. Fixture kept strict so
  future ranker tuning can close the gap.
- p05-vendor-signal FAIL: "Zygo" missing. The vendor memory exists
  with confidence 0.9 but get_memories_for_context walks memories
  in fixed order (effectively by updated_at / confidence), so lower-
  ranked memories get pushed out of the per-project budget slice by
  higher-confidence ones even when the query is specifically about
  the lower-ranked content. Query-relevance ordering of memories is
  the natural next fix.

Docs sync:

- master-plan-status.md: Phase 9 reflection entry now notes that
  capture→reinforce runs automatically and project memories reach
  the context pack, while extract remains batch/manual. First batch-
  extract pass surfaced 1 candidate from 42 interactions — extractor
  rule tuning is a known follow-up.
- next-steps.md: the 2026-04-11 retrieval quality review entry now
  shows the project-memory-band work as DONE, and a new
  "Reflection Loop Live Check" subsection records the extractor-
  coverage finding from the first batch run.
- Both files now agree with the code; follow-up reviewers
  (Codex, future Claude) should no longer see narrative drift.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-11 12:39:03 -04:00
parent 7bf83bf46a
commit 4da81c9e4e
4 changed files with 338 additions and 8 deletions

View File

@@ -32,7 +32,18 @@ read-only additive mode.
### Baseline Complete
- Phase 9 - Reflection (all three foundation commits landed:
A capture, B reinforcement, C candidate extraction + review queue)
A capture, B reinforcement, C candidate extraction + review queue).
As of 2026-04-11 the capture → reinforce half runs automatically on
every Stop-hook capture (length-aware token-overlap matcher handles
paragraph-length memories), and project-scoped memories now reach
the context pack via a dedicated `--- Project Memories ---` band
between identity/preference and retrieved chunks. The extract half
is still a manual / batch flow by design (`scripts/atocore_client.py
batch-extract` + `triage`). First live batch-extract run over 42
captured interactions produced 1 candidate (rule extractor is
conservative and keys on structural cues like `## Decision:`
headings that rarely appear in conversational LLM responses) —
extractor tuning is a known follow-up.
### Not Yet Complete In The Intended Sense
@@ -167,7 +178,9 @@ These remain intentionally deferred.
- automatic write-back from OpenClaw into AtoCore
- automatic memory promotion
- reflection loop integration
- ~~reflection loop integration~~ — baseline now in (capture→reinforce
auto, extract batch/manual). Extractor tuning and scheduled batch
extraction still open.
- replacing OpenClaw's own memory system
- live machine-DB sync between machines
- full ontology / graph expansion before the current baseline is stable