feat: retrieval eval harness + doc sync

scripts/retrieval_eval.py walks a fixture file of project-hinted questions, runs each against POST /context/build, and scores the returned formatted_context against per-fixture expect_present and expect_absent substring checklists. Exit 0 on all-pass, 1 on any miss. Human-readable by default, --json for automation. First live run against Dalidou at SHA 1161645: 4/6 pass. The two failures are real findings, not harness bugs: - p05-configuration FAIL: "GigaBIT M1" appears in the p05 pack. Cross-project bleed from a shared p05 doc that legitimately mentions the p04 mirror under test. Fixture kept strict so future ranker tuning can close the gap. - p05-vendor-signal FAIL: "Zygo" missing. The vendor memory exists with confidence 0.9 but get_memories_for_context walks memories in fixed order (effectively by updated_at / confidence), so lower- ranked memories get pushed out of the per-project budget slice by higher-confidence ones even when the query is specifically about the lower-ranked content. Query-relevance ordering of memories is the natural next fix. Docs sync: - master-plan-status.md: Phase 9 reflection entry now notes that capture→reinforce runs automatically and project memories reach the context pack, while extract remains batch/manual. First batch- extract pass surfaced 1 candidate from 42 interactions — extractor rule tuning is a known follow-up. - next-steps.md: the 2026-04-11 retrieval quality review entry now shows the project-memory-band work as DONE, and a new "Reflection Loop Live Check" subsection records the extractor- coverage finding from the first batch run. - Both files now agree with the code; follow-up reviewers (Codex, future Claude) should no longer see narrative drift. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 12:39:03 -04:00
parent 7bf83bf46a
commit 4da81c9e4e
4 changed files with 338 additions and 8 deletions
--- a/docs/master-plan-status.md
+++ b/docs/master-plan-status.md
@@ -32,7 +32,18 @@ read-only additive mode.
 ### Baseline Complete

 - Phase 9 - Reflection (all three foundation commits landed:
-  A capture, B reinforcement, C candidate extraction + review queue)
+  A capture, B reinforcement, C candidate extraction + review queue).
+  As of 2026-04-11 the capture → reinforce half runs automatically on
+  every Stop-hook capture (length-aware token-overlap matcher handles
+  paragraph-length memories), and project-scoped memories now reach
+  the context pack via a dedicated `--- Project Memories ---` band
+  between identity/preference and retrieved chunks. The extract half
+  is still a manual / batch flow by design (`scripts/atocore_client.py
+  batch-extract` + `triage`). First live batch-extract run over 42
+  captured interactions produced 1 candidate (rule extractor is
+  conservative and keys on structural cues like `## Decision:`
+  headings that rarely appear in conversational LLM responses) —
+  extractor tuning is a known follow-up.

 ### Not Yet Complete In The Intended Sense

@@ -167,7 +178,9 @@ These remain intentionally deferred.

 - automatic write-back from OpenClaw into AtoCore
 - automatic memory promotion
- reflection loop integration
+- ~~reflection loop integration~~ — baseline now in (capture→reinforce
+  auto, extract batch/manual). Extractor tuning and scheduled batch
+  extraction still open.
 - replacing OpenClaw's own memory system
 - live machine-DB sync between machines
 - full ontology / graph expansion before the current baseline is stable
--- a/docs/next-steps.md
+++ b/docs/next-steps.md
@@ -137,7 +137,12 @@ P06:

 - automatic write-back from OpenClaw into AtoCore
 - automatic memory promotion
- reflection loop integration
+- ~~reflection loop integration~~ — baseline now landed (2026-04-11):
+  Stop hook runs reinforce automatically, project memories are folded
+  into the context pack, batch-extract and triage CLIs exist. What
+  remains deferred: scheduled/automatic batch extraction and extractor
+  rule tuning (rule-based extractor produced 1 candidate from 42 real
+  captures — needs new cues for conversational LLM content).
 - replacing OpenClaw's own memory system
 - syncing the live machine DB between machines

@@ -190,12 +195,45 @@ Findings:

 Proposed follow-ups (not yet scheduled):

-1. Decide whether memories should be folded into `formatted_context`
-   and under what section header. Candidate: a "--- Project Memories ---"
-   band between Trusted Project State and Retrieved Context, filtered
-   to active memories for the target project plus identity/preference.
+1. ~~Decide whether memories should be folded into `formatted_context`
+   and under what section header.~~ DONE 2026-04-11 (commits 8ea53f4,
+   5913da5, 1161645). A `--- Project Memories ---` band now sits
+   between identity/preference and retrieved chunks, gated on a
+   canonical project hint to prevent cross-project bleed. Budget
+   ratio 0.25 (tuned empirically — paragraph memories are ~400 chars
+   and earlier 0.15 ratio starved the first entry by one char).
+   Verified live: p04 architecture query surfaces the Option B memory.
 2. Re-run the same three queries after any builder change and compare
-   `formatted_context` diffs.
+   `formatted_context` diffs — still open, and is the natural entry
+   point for the retrieval eval harness on the roadmap.
+
+## Reflection Loop Live Check — 2026-04-11
+
+First real run of `batch-extract` across 42 captured Claude Code
+interactions on Dalidou produced exactly **1 candidate**, and that
+candidate was a synthetic test capture from earlier in the session
+(rejected). Finding:
+
+- The rule-based extractor in `src/atocore/memory/extractor.py` keys
+  on explicit structural cues (decision headings like
+  `## Decision: ...`, preference sentences, etc.). Real Claude Code
+  responses are conversational and almost never contain those cues.
+- This means the capture → extract half of the reflection loop is
+  effectively inert against organic LLM sessions until either the
+  rules are broadened (new cue families: "we chose X because...",
+  "the selected approach is...", etc.) or an LLM-assisted extraction
+  path is added alongside the rule-based one.
+- Capture → reinforce is working correctly on live data (length-aware
+  matcher verified on live paraphrase of a p04 memory).
+
+Follow-up candidates (not yet scheduled):
+
+1. Extractor rule expansion — add conversational-form rules so real
+   session text has a chance of surfacing candidates.
+2. LLM-assisted extractor as a separate rule family, guarded by
+   confidence and always landing in `status=candidate` (never active).
+3. Retrieval eval harness — diffable scorecard of
+   `formatted_context` across a fixed question set per active project.

 ## Long-Run Goal

--- a/scripts/retrieval_eval.py
+++ b/scripts/retrieval_eval.py
@@ -0,0 +1,194 @@
+"""Retrieval quality eval harness.
+
+Runs a fixed set of project-hinted questions against
+``POST /context/build`` on a live AtoCore instance and scores the
+resulting ``formatted_context`` against per-question expectations.
+The goal is a diffable scorecard that tells you, run-to-run,
+whether a retrieval / builder / ingestion change moved the needle.
+
+Design notes
+------------
+- Fixtures live in ``scripts/retrieval_eval_fixtures.json`` so new
+  questions can be added without touching Python. Each fixture
+  names the project, the prompt, and a checklist of substrings that
+  MUST appear in ``formatted_context`` (``expect_present``) and
+  substrings that MUST NOT appear (``expect_absent``). The absent
+  list catches cross-project bleed and stale content.
+- The checklist is deliberately substring-based (not regex, not
+  embedding-similarity) so a failure is always a trivially
+  reproducible "this string is not in that string". Richer scoring
+  can come later once we know the harness is useful.
+- The harness is external to the app runtime and talks to AtoCore
+  over HTTP, so it works against dev, staging, or prod. It follows
+  the same environment-variable contract as ``atocore_client.py``
+  (``ATOCORE_BASE_URL``, ``ATOCORE_TIMEOUT_SECONDS``).
+- Exit code 0 on all-pass, 1 on any fixture failure. Intended for
+  manual runs today; a future cron / CI hook can consume the
+  JSON output via ``--json``.
+
+Usage
+-----
+
+    python scripts/retrieval_eval.py            # human-readable report
+    python scripts/retrieval_eval.py --json     # machine-readable
+    python scripts/retrieval_eval.py --fixtures path/to/custom.json
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import os
+import sys
+import urllib.error
+import urllib.parse
+import urllib.request
+from dataclasses import dataclass, field
+from pathlib import Path
+
+DEFAULT_BASE_URL = os.environ.get("ATOCORE_BASE_URL", "http://dalidou:8100")
+DEFAULT_TIMEOUT = int(os.environ.get("ATOCORE_TIMEOUT_SECONDS", "30"))
+DEFAULT_BUDGET = 3000
+DEFAULT_FIXTURES = Path(__file__).parent / "retrieval_eval_fixtures.json"
+
+
+@dataclass
+class Fixture:
+    name: str
+    project: str
+    prompt: str
+    budget: int = DEFAULT_BUDGET
+    expect_present: list[str] = field(default_factory=list)
+    expect_absent: list[str] = field(default_factory=list)
+    notes: str = ""
+
+
+@dataclass
+class FixtureResult:
+    fixture: Fixture
+    ok: bool
+    missing_present: list[str]
+    unexpected_absent: list[str]
+    total_chars: int
+    error: str = ""
+
+
+def load_fixtures(path: Path) -> list[Fixture]:
+    data = json.loads(path.read_text(encoding="utf-8"))
+    if not isinstance(data, list):
+        raise ValueError(f"{path} must contain a JSON array of fixtures")
+    fixtures: list[Fixture] = []
+    for i, raw in enumerate(data):
+        if not isinstance(raw, dict):
+            raise ValueError(f"fixture {i} is not an object")
+        fixtures.append(
+            Fixture(
+                name=raw["name"],
+                project=raw.get("project", ""),
+                prompt=raw["prompt"],
+                budget=int(raw.get("budget", DEFAULT_BUDGET)),
+                expect_present=list(raw.get("expect_present", [])),
+                expect_absent=list(raw.get("expect_absent", [])),
+                notes=raw.get("notes", ""),
+            )
+        )
+    return fixtures
+
+
+def run_fixture(fixture: Fixture, base_url: str, timeout: int) -> FixtureResult:
+    payload = {
+        "prompt": fixture.prompt,
+        "project": fixture.project or None,
+        "budget": fixture.budget,
+    }
+    req = urllib.request.Request(
+        url=f"{base_url}/context/build",
+        method="POST",
+        headers={"Content-Type": "application/json"},
+        data=json.dumps(payload).encode("utf-8"),
+    )
+    try:
+        with urllib.request.urlopen(req, timeout=timeout) as resp:
+            body = json.loads(resp.read().decode("utf-8"))
+    except urllib.error.URLError as exc:
+        return FixtureResult(
+            fixture=fixture,
+            ok=False,
+            missing_present=list(fixture.expect_present),
+            unexpected_absent=[],
+            total_chars=0,
+            error=f"http_error: {exc}",
+        )
+
+    formatted = body.get("formatted_context") or ""
+    missing = [s for s in fixture.expect_present if s not in formatted]
+    unexpected = [s for s in fixture.expect_absent if s in formatted]
+    return FixtureResult(
+        fixture=fixture,
+        ok=not missing and not unexpected,
+        missing_present=missing,
+        unexpected_absent=unexpected,
+        total_chars=len(formatted),
+    )
+
+
+def print_human_report(results: list[FixtureResult]) -> None:
+    total = len(results)
+    passed = sum(1 for r in results if r.ok)
+    print(f"Retrieval eval: {passed}/{total} fixtures passed")
+    print()
+    for r in results:
+        marker = "PASS" if r.ok else "FAIL"
+        print(f"[{marker}] {r.fixture.name}  project={r.fixture.project}  chars={r.total_chars}")
+        if r.error:
+            print(f"       error: {r.error}")
+        for miss in r.missing_present:
+            print(f"       missing expected: {miss!r}")
+        for bleed in r.unexpected_absent:
+            print(f"       unexpected present: {bleed!r}")
+        if r.fixture.notes and not r.ok:
+            print(f"       notes: {r.fixture.notes}")
+
+
+def print_json_report(results: list[FixtureResult]) -> None:
+    payload = {
+        "total": len(results),
+        "passed": sum(1 for r in results if r.ok),
+        "fixtures": [
+            {
+                "name": r.fixture.name,
+                "project": r.fixture.project,
+                "ok": r.ok,
+                "total_chars": r.total_chars,
+                "missing_present": r.missing_present,
+                "unexpected_absent": r.unexpected_absent,
+                "error": r.error,
+            }
+            for r in results
+        ],
+    }
+    json.dump(payload, sys.stdout, indent=2)
+    sys.stdout.write("\n")
+
+
+def main() -> int:
+    parser = argparse.ArgumentParser(description="AtoCore retrieval quality eval harness")
+    parser.add_argument("--base-url", default=DEFAULT_BASE_URL)
+    parser.add_argument("--timeout", type=int, default=DEFAULT_TIMEOUT)
+    parser.add_argument("--fixtures", type=Path, default=DEFAULT_FIXTURES)
+    parser.add_argument("--json", action="store_true", help="emit machine-readable JSON")
+    args = parser.parse_args()
+
+    fixtures = load_fixtures(args.fixtures)
+    results = [run_fixture(f, args.base_url, args.timeout) for f in fixtures]
+
+    if args.json:
+        print_json_report(results)
+    else:
+        print_human_report(results)
+
+    return 0 if all(r.ok for r in results) else 1
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
--- a/scripts/retrieval_eval_fixtures.json
+++ b/scripts/retrieval_eval_fixtures.json
@@ -0,0 +1,85 @@
+[
+  {
+    "name": "p04-architecture-decision",
+    "project": "p04-gigabit",
+    "prompt": "what mirror architecture was selected for GigaBIT M1 and why",
+    "expect_present": [
+      "--- Trusted Project State ---",
+      "Option B",
+      "conical",
+      "--- Project Memories ---"
+    ],
+    "expect_absent": [
+      "p06-polisher",
+      "folded-beam"
+    ],
+    "notes": "Canonical p04 decision — should surface both Trusted Project State (selected_mirror_architecture) and the project-memory band with the Option B memory"
+  },
+  {
+    "name": "p04-constraints",
+    "project": "p04-gigabit",
+    "prompt": "what are the key GigaBIT M1 program constraints",
+    "expect_present": [
+      "--- Trusted Project State ---",
+      "Zerodur",
+      "1.2"
+    ],
+    "expect_absent": [
+      "polisher suite"
+    ],
+    "notes": "Key constraints are in Trusted Project State (key_constraints) and in the mission-framing memory"
+  },
+  {
+    "name": "p05-configuration",
+    "project": "p05-interferometer",
+    "prompt": "what is the selected interferometer configuration",
+    "expect_present": [
+      "folded-beam",
+      "CGH"
+    ],
+    "expect_absent": [
+      "p04-gigabit",
+      "GigaBIT M1"
+    ],
+    "notes": "P05 architecture memory covers folded-beam + CGH; should not bleed p04"
+  },
+  {
+    "name": "p05-vendor-signal",
+    "project": "p05-interferometer",
+    "prompt": "what is the current vendor signal for the interferometer procurement",
+    "expect_present": [
+      "4D",
+      "Zygo"
+    ],
+    "expect_absent": [
+      "polisher"
+    ],
+    "notes": "Vendor memory mentions 4D as strongest technical candidate and Zygo Verifire SV as value path"
+  },
+  {
+    "name": "p06-suite-split",
+    "project": "p06-polisher",
+    "prompt": "how is the polisher software suite split across layers",
+    "expect_present": [
+      "polisher-sim",
+      "polisher-post",
+      "polisher-control"
+    ],
+    "expect_absent": [
+      "GigaBIT"
+    ],
+    "notes": "The three-layer split is in multiple p06 memories; check all three names surface together"
+  },
+  {
+    "name": "p06-control-rule",
+    "project": "p06-polisher",
+    "prompt": "what is the polisher control design rule",
+    "expect_present": [
+      "interlocks"
+    ],
+    "expect_absent": [
+      "interferometer"
+    ],
+    "notes": "Control design rule memory mentions interlocks and state transitions"
+  }
+]