feat: retrieval eval harness + doc sync
scripts/retrieval_eval.py walks a fixture file of project-hinted
questions, runs each against POST /context/build, and scores the
returned formatted_context against per-fixture expect_present and
expect_absent substring checklists. Exit 0 on all-pass, 1 on any
miss. Human-readable by default, --json for automation.
First live run against Dalidou at SHA 1161645: 4/6 pass. The two
failures are real findings, not harness bugs:
- p05-configuration FAIL: "GigaBIT M1" appears in the p05 pack.
Cross-project bleed from a shared p05 doc that legitimately
mentions the p04 mirror under test. Fixture kept strict so
future ranker tuning can close the gap.
- p05-vendor-signal FAIL: "Zygo" missing. The vendor memory exists
with confidence 0.9 but get_memories_for_context walks memories
in fixed order (effectively by updated_at / confidence), so lower-
ranked memories get pushed out of the per-project budget slice by
higher-confidence ones even when the query is specifically about
the lower-ranked content. Query-relevance ordering of memories is
the natural next fix.
Docs sync:
- master-plan-status.md: Phase 9 reflection entry now notes that
capture→reinforce runs automatically and project memories reach
the context pack, while extract remains batch/manual. First batch-
extract pass surfaced 1 candidate from 42 interactions — extractor
rule tuning is a known follow-up.
- next-steps.md: the 2026-04-11 retrieval quality review entry now
shows the project-memory-band work as DONE, and a new
"Reflection Loop Live Check" subsection records the extractor-
coverage finding from the first batch run.
- Both files now agree with the code; follow-up reviewers
(Codex, future Claude) should no longer see narrative drift.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -32,7 +32,18 @@ read-only additive mode.
|
|||||||
### Baseline Complete
|
### Baseline Complete
|
||||||
|
|
||||||
- Phase 9 - Reflection (all three foundation commits landed:
|
- Phase 9 - Reflection (all three foundation commits landed:
|
||||||
A capture, B reinforcement, C candidate extraction + review queue)
|
A capture, B reinforcement, C candidate extraction + review queue).
|
||||||
|
As of 2026-04-11 the capture → reinforce half runs automatically on
|
||||||
|
every Stop-hook capture (length-aware token-overlap matcher handles
|
||||||
|
paragraph-length memories), and project-scoped memories now reach
|
||||||
|
the context pack via a dedicated `--- Project Memories ---` band
|
||||||
|
between identity/preference and retrieved chunks. The extract half
|
||||||
|
is still a manual / batch flow by design (`scripts/atocore_client.py
|
||||||
|
batch-extract` + `triage`). First live batch-extract run over 42
|
||||||
|
captured interactions produced 1 candidate (rule extractor is
|
||||||
|
conservative and keys on structural cues like `## Decision:`
|
||||||
|
headings that rarely appear in conversational LLM responses) —
|
||||||
|
extractor tuning is a known follow-up.
|
||||||
|
|
||||||
### Not Yet Complete In The Intended Sense
|
### Not Yet Complete In The Intended Sense
|
||||||
|
|
||||||
@@ -167,7 +178,9 @@ These remain intentionally deferred.
|
|||||||
|
|
||||||
- automatic write-back from OpenClaw into AtoCore
|
- automatic write-back from OpenClaw into AtoCore
|
||||||
- automatic memory promotion
|
- automatic memory promotion
|
||||||
- reflection loop integration
|
- ~~reflection loop integration~~ — baseline now in (capture→reinforce
|
||||||
|
auto, extract batch/manual). Extractor tuning and scheduled batch
|
||||||
|
extraction still open.
|
||||||
- replacing OpenClaw's own memory system
|
- replacing OpenClaw's own memory system
|
||||||
- live machine-DB sync between machines
|
- live machine-DB sync between machines
|
||||||
- full ontology / graph expansion before the current baseline is stable
|
- full ontology / graph expansion before the current baseline is stable
|
||||||
|
|||||||
@@ -137,7 +137,12 @@ P06:
|
|||||||
|
|
||||||
- automatic write-back from OpenClaw into AtoCore
|
- automatic write-back from OpenClaw into AtoCore
|
||||||
- automatic memory promotion
|
- automatic memory promotion
|
||||||
- reflection loop integration
|
- ~~reflection loop integration~~ — baseline now landed (2026-04-11):
|
||||||
|
Stop hook runs reinforce automatically, project memories are folded
|
||||||
|
into the context pack, batch-extract and triage CLIs exist. What
|
||||||
|
remains deferred: scheduled/automatic batch extraction and extractor
|
||||||
|
rule tuning (rule-based extractor produced 1 candidate from 42 real
|
||||||
|
captures — needs new cues for conversational LLM content).
|
||||||
- replacing OpenClaw's own memory system
|
- replacing OpenClaw's own memory system
|
||||||
- syncing the live machine DB between machines
|
- syncing the live machine DB between machines
|
||||||
|
|
||||||
@@ -190,12 +195,45 @@ Findings:
|
|||||||
|
|
||||||
Proposed follow-ups (not yet scheduled):
|
Proposed follow-ups (not yet scheduled):
|
||||||
|
|
||||||
1. Decide whether memories should be folded into `formatted_context`
|
1. ~~Decide whether memories should be folded into `formatted_context`
|
||||||
and under what section header. Candidate: a "--- Project Memories ---"
|
and under what section header.~~ DONE 2026-04-11 (commits 8ea53f4,
|
||||||
band between Trusted Project State and Retrieved Context, filtered
|
5913da5, 1161645). A `--- Project Memories ---` band now sits
|
||||||
to active memories for the target project plus identity/preference.
|
between identity/preference and retrieved chunks, gated on a
|
||||||
|
canonical project hint to prevent cross-project bleed. Budget
|
||||||
|
ratio 0.25 (tuned empirically — paragraph memories are ~400 chars
|
||||||
|
and earlier 0.15 ratio starved the first entry by one char).
|
||||||
|
Verified live: p04 architecture query surfaces the Option B memory.
|
||||||
2. Re-run the same three queries after any builder change and compare
|
2. Re-run the same three queries after any builder change and compare
|
||||||
`formatted_context` diffs.
|
`formatted_context` diffs — still open, and is the natural entry
|
||||||
|
point for the retrieval eval harness on the roadmap.
|
||||||
|
|
||||||
|
## Reflection Loop Live Check — 2026-04-11
|
||||||
|
|
||||||
|
First real run of `batch-extract` across 42 captured Claude Code
|
||||||
|
interactions on Dalidou produced exactly **1 candidate**, and that
|
||||||
|
candidate was a synthetic test capture from earlier in the session
|
||||||
|
(rejected). Finding:
|
||||||
|
|
||||||
|
- The rule-based extractor in `src/atocore/memory/extractor.py` keys
|
||||||
|
on explicit structural cues (decision headings like
|
||||||
|
`## Decision: ...`, preference sentences, etc.). Real Claude Code
|
||||||
|
responses are conversational and almost never contain those cues.
|
||||||
|
- This means the capture → extract half of the reflection loop is
|
||||||
|
effectively inert against organic LLM sessions until either the
|
||||||
|
rules are broadened (new cue families: "we chose X because...",
|
||||||
|
"the selected approach is...", etc.) or an LLM-assisted extraction
|
||||||
|
path is added alongside the rule-based one.
|
||||||
|
- Capture → reinforce is working correctly on live data (length-aware
|
||||||
|
matcher verified on live paraphrase of a p04 memory).
|
||||||
|
|
||||||
|
Follow-up candidates (not yet scheduled):
|
||||||
|
|
||||||
|
1. Extractor rule expansion — add conversational-form rules so real
|
||||||
|
session text has a chance of surfacing candidates.
|
||||||
|
2. LLM-assisted extractor as a separate rule family, guarded by
|
||||||
|
confidence and always landing in `status=candidate` (never active).
|
||||||
|
3. Retrieval eval harness — diffable scorecard of
|
||||||
|
`formatted_context` across a fixed question set per active project.
|
||||||
|
|
||||||
## Long-Run Goal
|
## Long-Run Goal
|
||||||
|
|
||||||
|
|||||||
194
scripts/retrieval_eval.py
Normal file
194
scripts/retrieval_eval.py
Normal file
@@ -0,0 +1,194 @@
|
|||||||
|
"""Retrieval quality eval harness.
|
||||||
|
|
||||||
|
Runs a fixed set of project-hinted questions against
|
||||||
|
``POST /context/build`` on a live AtoCore instance and scores the
|
||||||
|
resulting ``formatted_context`` against per-question expectations.
|
||||||
|
The goal is a diffable scorecard that tells you, run-to-run,
|
||||||
|
whether a retrieval / builder / ingestion change moved the needle.
|
||||||
|
|
||||||
|
Design notes
|
||||||
|
------------
|
||||||
|
- Fixtures live in ``scripts/retrieval_eval_fixtures.json`` so new
|
||||||
|
questions can be added without touching Python. Each fixture
|
||||||
|
names the project, the prompt, and a checklist of substrings that
|
||||||
|
MUST appear in ``formatted_context`` (``expect_present``) and
|
||||||
|
substrings that MUST NOT appear (``expect_absent``). The absent
|
||||||
|
list catches cross-project bleed and stale content.
|
||||||
|
- The checklist is deliberately substring-based (not regex, not
|
||||||
|
embedding-similarity) so a failure is always a trivially
|
||||||
|
reproducible "this string is not in that string". Richer scoring
|
||||||
|
can come later once we know the harness is useful.
|
||||||
|
- The harness is external to the app runtime and talks to AtoCore
|
||||||
|
over HTTP, so it works against dev, staging, or prod. It follows
|
||||||
|
the same environment-variable contract as ``atocore_client.py``
|
||||||
|
(``ATOCORE_BASE_URL``, ``ATOCORE_TIMEOUT_SECONDS``).
|
||||||
|
- Exit code 0 on all-pass, 1 on any fixture failure. Intended for
|
||||||
|
manual runs today; a future cron / CI hook can consume the
|
||||||
|
JSON output via ``--json``.
|
||||||
|
|
||||||
|
Usage
|
||||||
|
-----
|
||||||
|
|
||||||
|
python scripts/retrieval_eval.py # human-readable report
|
||||||
|
python scripts/retrieval_eval.py --json # machine-readable
|
||||||
|
python scripts/retrieval_eval.py --fixtures path/to/custom.json
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import sys
|
||||||
|
import urllib.error
|
||||||
|
import urllib.parse
|
||||||
|
import urllib.request
|
||||||
|
from dataclasses import dataclass, field
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
DEFAULT_BASE_URL = os.environ.get("ATOCORE_BASE_URL", "http://dalidou:8100")
|
||||||
|
DEFAULT_TIMEOUT = int(os.environ.get("ATOCORE_TIMEOUT_SECONDS", "30"))
|
||||||
|
DEFAULT_BUDGET = 3000
|
||||||
|
DEFAULT_FIXTURES = Path(__file__).parent / "retrieval_eval_fixtures.json"
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class Fixture:
|
||||||
|
name: str
|
||||||
|
project: str
|
||||||
|
prompt: str
|
||||||
|
budget: int = DEFAULT_BUDGET
|
||||||
|
expect_present: list[str] = field(default_factory=list)
|
||||||
|
expect_absent: list[str] = field(default_factory=list)
|
||||||
|
notes: str = ""
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class FixtureResult:
|
||||||
|
fixture: Fixture
|
||||||
|
ok: bool
|
||||||
|
missing_present: list[str]
|
||||||
|
unexpected_absent: list[str]
|
||||||
|
total_chars: int
|
||||||
|
error: str = ""
|
||||||
|
|
||||||
|
|
||||||
|
def load_fixtures(path: Path) -> list[Fixture]:
|
||||||
|
data = json.loads(path.read_text(encoding="utf-8"))
|
||||||
|
if not isinstance(data, list):
|
||||||
|
raise ValueError(f"{path} must contain a JSON array of fixtures")
|
||||||
|
fixtures: list[Fixture] = []
|
||||||
|
for i, raw in enumerate(data):
|
||||||
|
if not isinstance(raw, dict):
|
||||||
|
raise ValueError(f"fixture {i} is not an object")
|
||||||
|
fixtures.append(
|
||||||
|
Fixture(
|
||||||
|
name=raw["name"],
|
||||||
|
project=raw.get("project", ""),
|
||||||
|
prompt=raw["prompt"],
|
||||||
|
budget=int(raw.get("budget", DEFAULT_BUDGET)),
|
||||||
|
expect_present=list(raw.get("expect_present", [])),
|
||||||
|
expect_absent=list(raw.get("expect_absent", [])),
|
||||||
|
notes=raw.get("notes", ""),
|
||||||
|
)
|
||||||
|
)
|
||||||
|
return fixtures
|
||||||
|
|
||||||
|
|
||||||
|
def run_fixture(fixture: Fixture, base_url: str, timeout: int) -> FixtureResult:
|
||||||
|
payload = {
|
||||||
|
"prompt": fixture.prompt,
|
||||||
|
"project": fixture.project or None,
|
||||||
|
"budget": fixture.budget,
|
||||||
|
}
|
||||||
|
req = urllib.request.Request(
|
||||||
|
url=f"{base_url}/context/build",
|
||||||
|
method="POST",
|
||||||
|
headers={"Content-Type": "application/json"},
|
||||||
|
data=json.dumps(payload).encode("utf-8"),
|
||||||
|
)
|
||||||
|
try:
|
||||||
|
with urllib.request.urlopen(req, timeout=timeout) as resp:
|
||||||
|
body = json.loads(resp.read().decode("utf-8"))
|
||||||
|
except urllib.error.URLError as exc:
|
||||||
|
return FixtureResult(
|
||||||
|
fixture=fixture,
|
||||||
|
ok=False,
|
||||||
|
missing_present=list(fixture.expect_present),
|
||||||
|
unexpected_absent=[],
|
||||||
|
total_chars=0,
|
||||||
|
error=f"http_error: {exc}",
|
||||||
|
)
|
||||||
|
|
||||||
|
formatted = body.get("formatted_context") or ""
|
||||||
|
missing = [s for s in fixture.expect_present if s not in formatted]
|
||||||
|
unexpected = [s for s in fixture.expect_absent if s in formatted]
|
||||||
|
return FixtureResult(
|
||||||
|
fixture=fixture,
|
||||||
|
ok=not missing and not unexpected,
|
||||||
|
missing_present=missing,
|
||||||
|
unexpected_absent=unexpected,
|
||||||
|
total_chars=len(formatted),
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def print_human_report(results: list[FixtureResult]) -> None:
|
||||||
|
total = len(results)
|
||||||
|
passed = sum(1 for r in results if r.ok)
|
||||||
|
print(f"Retrieval eval: {passed}/{total} fixtures passed")
|
||||||
|
print()
|
||||||
|
for r in results:
|
||||||
|
marker = "PASS" if r.ok else "FAIL"
|
||||||
|
print(f"[{marker}] {r.fixture.name} project={r.fixture.project} chars={r.total_chars}")
|
||||||
|
if r.error:
|
||||||
|
print(f" error: {r.error}")
|
||||||
|
for miss in r.missing_present:
|
||||||
|
print(f" missing expected: {miss!r}")
|
||||||
|
for bleed in r.unexpected_absent:
|
||||||
|
print(f" unexpected present: {bleed!r}")
|
||||||
|
if r.fixture.notes and not r.ok:
|
||||||
|
print(f" notes: {r.fixture.notes}")
|
||||||
|
|
||||||
|
|
||||||
|
def print_json_report(results: list[FixtureResult]) -> None:
|
||||||
|
payload = {
|
||||||
|
"total": len(results),
|
||||||
|
"passed": sum(1 for r in results if r.ok),
|
||||||
|
"fixtures": [
|
||||||
|
{
|
||||||
|
"name": r.fixture.name,
|
||||||
|
"project": r.fixture.project,
|
||||||
|
"ok": r.ok,
|
||||||
|
"total_chars": r.total_chars,
|
||||||
|
"missing_present": r.missing_present,
|
||||||
|
"unexpected_absent": r.unexpected_absent,
|
||||||
|
"error": r.error,
|
||||||
|
}
|
||||||
|
for r in results
|
||||||
|
],
|
||||||
|
}
|
||||||
|
json.dump(payload, sys.stdout, indent=2)
|
||||||
|
sys.stdout.write("\n")
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> int:
|
||||||
|
parser = argparse.ArgumentParser(description="AtoCore retrieval quality eval harness")
|
||||||
|
parser.add_argument("--base-url", default=DEFAULT_BASE_URL)
|
||||||
|
parser.add_argument("--timeout", type=int, default=DEFAULT_TIMEOUT)
|
||||||
|
parser.add_argument("--fixtures", type=Path, default=DEFAULT_FIXTURES)
|
||||||
|
parser.add_argument("--json", action="store_true", help="emit machine-readable JSON")
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
fixtures = load_fixtures(args.fixtures)
|
||||||
|
results = [run_fixture(f, args.base_url, args.timeout) for f in fixtures]
|
||||||
|
|
||||||
|
if args.json:
|
||||||
|
print_json_report(results)
|
||||||
|
else:
|
||||||
|
print_human_report(results)
|
||||||
|
|
||||||
|
return 0 if all(r.ok for r in results) else 1
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
raise SystemExit(main())
|
||||||
85
scripts/retrieval_eval_fixtures.json
Normal file
85
scripts/retrieval_eval_fixtures.json
Normal file
@@ -0,0 +1,85 @@
|
|||||||
|
[
|
||||||
|
{
|
||||||
|
"name": "p04-architecture-decision",
|
||||||
|
"project": "p04-gigabit",
|
||||||
|
"prompt": "what mirror architecture was selected for GigaBIT M1 and why",
|
||||||
|
"expect_present": [
|
||||||
|
"--- Trusted Project State ---",
|
||||||
|
"Option B",
|
||||||
|
"conical",
|
||||||
|
"--- Project Memories ---"
|
||||||
|
],
|
||||||
|
"expect_absent": [
|
||||||
|
"p06-polisher",
|
||||||
|
"folded-beam"
|
||||||
|
],
|
||||||
|
"notes": "Canonical p04 decision — should surface both Trusted Project State (selected_mirror_architecture) and the project-memory band with the Option B memory"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "p04-constraints",
|
||||||
|
"project": "p04-gigabit",
|
||||||
|
"prompt": "what are the key GigaBIT M1 program constraints",
|
||||||
|
"expect_present": [
|
||||||
|
"--- Trusted Project State ---",
|
||||||
|
"Zerodur",
|
||||||
|
"1.2"
|
||||||
|
],
|
||||||
|
"expect_absent": [
|
||||||
|
"polisher suite"
|
||||||
|
],
|
||||||
|
"notes": "Key constraints are in Trusted Project State (key_constraints) and in the mission-framing memory"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "p05-configuration",
|
||||||
|
"project": "p05-interferometer",
|
||||||
|
"prompt": "what is the selected interferometer configuration",
|
||||||
|
"expect_present": [
|
||||||
|
"folded-beam",
|
||||||
|
"CGH"
|
||||||
|
],
|
||||||
|
"expect_absent": [
|
||||||
|
"p04-gigabit",
|
||||||
|
"GigaBIT M1"
|
||||||
|
],
|
||||||
|
"notes": "P05 architecture memory covers folded-beam + CGH; should not bleed p04"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "p05-vendor-signal",
|
||||||
|
"project": "p05-interferometer",
|
||||||
|
"prompt": "what is the current vendor signal for the interferometer procurement",
|
||||||
|
"expect_present": [
|
||||||
|
"4D",
|
||||||
|
"Zygo"
|
||||||
|
],
|
||||||
|
"expect_absent": [
|
||||||
|
"polisher"
|
||||||
|
],
|
||||||
|
"notes": "Vendor memory mentions 4D as strongest technical candidate and Zygo Verifire SV as value path"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "p06-suite-split",
|
||||||
|
"project": "p06-polisher",
|
||||||
|
"prompt": "how is the polisher software suite split across layers",
|
||||||
|
"expect_present": [
|
||||||
|
"polisher-sim",
|
||||||
|
"polisher-post",
|
||||||
|
"polisher-control"
|
||||||
|
],
|
||||||
|
"expect_absent": [
|
||||||
|
"GigaBIT"
|
||||||
|
],
|
||||||
|
"notes": "The three-layer split is in multiple p06 memories; check all three names surface together"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "p06-control-rule",
|
||||||
|
"project": "p06-polisher",
|
||||||
|
"prompt": "what is the polisher control design rule",
|
||||||
|
"expect_present": [
|
||||||
|
"interlocks"
|
||||||
|
],
|
||||||
|
"expect_absent": [
|
||||||
|
"interferometer"
|
||||||
|
],
|
||||||
|
"notes": "Control design rule memory mentions interlocks and state transitions"
|
||||||
|
}
|
||||||
|
]
|
||||||
Reference in New Issue
Block a user