Files
ATOCore/scripts/retrieval_eval_fixtures.json
Anto01 4da81c9e4e feat: retrieval eval harness + doc sync
scripts/retrieval_eval.py walks a fixture file of project-hinted
questions, runs each against POST /context/build, and scores the
returned formatted_context against per-fixture expect_present and
expect_absent substring checklists. Exit 0 on all-pass, 1 on any
miss. Human-readable by default, --json for automation.

First live run against Dalidou at SHA 1161645: 4/6 pass. The two
failures are real findings, not harness bugs:

- p05-configuration FAIL: "GigaBIT M1" appears in the p05 pack.
  Cross-project bleed from a shared p05 doc that legitimately
  mentions the p04 mirror under test. Fixture kept strict so
  future ranker tuning can close the gap.
- p05-vendor-signal FAIL: "Zygo" missing. The vendor memory exists
  with confidence 0.9 but get_memories_for_context walks memories
  in fixed order (effectively by updated_at / confidence), so lower-
  ranked memories get pushed out of the per-project budget slice by
  higher-confidence ones even when the query is specifically about
  the lower-ranked content. Query-relevance ordering of memories is
  the natural next fix.

Docs sync:

- master-plan-status.md: Phase 9 reflection entry now notes that
  capture→reinforce runs automatically and project memories reach
  the context pack, while extract remains batch/manual. First batch-
  extract pass surfaced 1 candidate from 42 interactions — extractor
  rule tuning is a known follow-up.
- next-steps.md: the 2026-04-11 retrieval quality review entry now
  shows the project-memory-band work as DONE, and a new
  "Reflection Loop Live Check" subsection records the extractor-
  coverage finding from the first batch run.
- Both files now agree with the code; follow-up reviewers
  (Codex, future Claude) should no longer see narrative drift.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 12:39:03 -04:00

86 lines
2.3 KiB
JSON

[
{
"name": "p04-architecture-decision",
"project": "p04-gigabit",
"prompt": "what mirror architecture was selected for GigaBIT M1 and why",
"expect_present": [
"--- Trusted Project State ---",
"Option B",
"conical",
"--- Project Memories ---"
],
"expect_absent": [
"p06-polisher",
"folded-beam"
],
"notes": "Canonical p04 decision — should surface both Trusted Project State (selected_mirror_architecture) and the project-memory band with the Option B memory"
},
{
"name": "p04-constraints",
"project": "p04-gigabit",
"prompt": "what are the key GigaBIT M1 program constraints",
"expect_present": [
"--- Trusted Project State ---",
"Zerodur",
"1.2"
],
"expect_absent": [
"polisher suite"
],
"notes": "Key constraints are in Trusted Project State (key_constraints) and in the mission-framing memory"
},
{
"name": "p05-configuration",
"project": "p05-interferometer",
"prompt": "what is the selected interferometer configuration",
"expect_present": [
"folded-beam",
"CGH"
],
"expect_absent": [
"p04-gigabit",
"GigaBIT M1"
],
"notes": "P05 architecture memory covers folded-beam + CGH; should not bleed p04"
},
{
"name": "p05-vendor-signal",
"project": "p05-interferometer",
"prompt": "what is the current vendor signal for the interferometer procurement",
"expect_present": [
"4D",
"Zygo"
],
"expect_absent": [
"polisher"
],
"notes": "Vendor memory mentions 4D as strongest technical candidate and Zygo Verifire SV as value path"
},
{
"name": "p06-suite-split",
"project": "p06-polisher",
"prompt": "how is the polisher software suite split across layers",
"expect_present": [
"polisher-sim",
"polisher-post",
"polisher-control"
],
"expect_absent": [
"GigaBIT"
],
"notes": "The three-layer split is in multiple p06 memories; check all three names surface together"
},
{
"name": "p06-control-rule",
"project": "p06-polisher",
"prompt": "what is the polisher control design rule",
"expect_present": [
"interlocks"
],
"expect_absent": [
"interferometer"
],
"notes": "Control design rule memory mentions interlocks and state transitions"
}
]