diff --git a/docs/master-plan-status.md b/docs/master-plan-status.md index c51e84e..c41e308 100644 --- a/docs/master-plan-status.md +++ b/docs/master-plan-status.md @@ -32,7 +32,18 @@ read-only additive mode. ### Baseline Complete - Phase 9 - Reflection (all three foundation commits landed: - A capture, B reinforcement, C candidate extraction + review queue) + A capture, B reinforcement, C candidate extraction + review queue). + As of 2026-04-11 the capture → reinforce half runs automatically on + every Stop-hook capture (length-aware token-overlap matcher handles + paragraph-length memories), and project-scoped memories now reach + the context pack via a dedicated `--- Project Memories ---` band + between identity/preference and retrieved chunks. The extract half + is still a manual / batch flow by design (`scripts/atocore_client.py + batch-extract` + `triage`). First live batch-extract run over 42 + captured interactions produced 1 candidate (rule extractor is + conservative and keys on structural cues like `## Decision:` + headings that rarely appear in conversational LLM responses) — + extractor tuning is a known follow-up. ### Not Yet Complete In The Intended Sense @@ -167,7 +178,9 @@ These remain intentionally deferred. - automatic write-back from OpenClaw into AtoCore - automatic memory promotion -- reflection loop integration +- ~~reflection loop integration~~ — baseline now in (capture→reinforce + auto, extract batch/manual). Extractor tuning and scheduled batch + extraction still open. - replacing OpenClaw's own memory system - live machine-DB sync between machines - full ontology / graph expansion before the current baseline is stable diff --git a/docs/next-steps.md b/docs/next-steps.md index 784941d..e83fff7 100644 --- a/docs/next-steps.md +++ b/docs/next-steps.md @@ -137,7 +137,12 @@ P06: - automatic write-back from OpenClaw into AtoCore - automatic memory promotion -- reflection loop integration +- ~~reflection loop integration~~ — baseline now landed (2026-04-11): + Stop hook runs reinforce automatically, project memories are folded + into the context pack, batch-extract and triage CLIs exist. What + remains deferred: scheduled/automatic batch extraction and extractor + rule tuning (rule-based extractor produced 1 candidate from 42 real + captures — needs new cues for conversational LLM content). - replacing OpenClaw's own memory system - syncing the live machine DB between machines @@ -190,12 +195,45 @@ Findings: Proposed follow-ups (not yet scheduled): -1. Decide whether memories should be folded into `formatted_context` - and under what section header. Candidate: a "--- Project Memories ---" - band between Trusted Project State and Retrieved Context, filtered - to active memories for the target project plus identity/preference. +1. ~~Decide whether memories should be folded into `formatted_context` + and under what section header.~~ DONE 2026-04-11 (commits 8ea53f4, + 5913da5, 1161645). A `--- Project Memories ---` band now sits + between identity/preference and retrieved chunks, gated on a + canonical project hint to prevent cross-project bleed. Budget + ratio 0.25 (tuned empirically — paragraph memories are ~400 chars + and earlier 0.15 ratio starved the first entry by one char). + Verified live: p04 architecture query surfaces the Option B memory. 2. Re-run the same three queries after any builder change and compare - `formatted_context` diffs. + `formatted_context` diffs — still open, and is the natural entry + point for the retrieval eval harness on the roadmap. + +## Reflection Loop Live Check — 2026-04-11 + +First real run of `batch-extract` across 42 captured Claude Code +interactions on Dalidou produced exactly **1 candidate**, and that +candidate was a synthetic test capture from earlier in the session +(rejected). Finding: + +- The rule-based extractor in `src/atocore/memory/extractor.py` keys + on explicit structural cues (decision headings like + `## Decision: ...`, preference sentences, etc.). Real Claude Code + responses are conversational and almost never contain those cues. +- This means the capture → extract half of the reflection loop is + effectively inert against organic LLM sessions until either the + rules are broadened (new cue families: "we chose X because...", + "the selected approach is...", etc.) or an LLM-assisted extraction + path is added alongside the rule-based one. +- Capture → reinforce is working correctly on live data (length-aware + matcher verified on live paraphrase of a p04 memory). + +Follow-up candidates (not yet scheduled): + +1. Extractor rule expansion — add conversational-form rules so real + session text has a chance of surfacing candidates. +2. LLM-assisted extractor as a separate rule family, guarded by + confidence and always landing in `status=candidate` (never active). +3. Retrieval eval harness — diffable scorecard of + `formatted_context` across a fixed question set per active project. ## Long-Run Goal diff --git a/scripts/retrieval_eval.py b/scripts/retrieval_eval.py new file mode 100644 index 0000000..b067da8 --- /dev/null +++ b/scripts/retrieval_eval.py @@ -0,0 +1,194 @@ +"""Retrieval quality eval harness. + +Runs a fixed set of project-hinted questions against +``POST /context/build`` on a live AtoCore instance and scores the +resulting ``formatted_context`` against per-question expectations. +The goal is a diffable scorecard that tells you, run-to-run, +whether a retrieval / builder / ingestion change moved the needle. + +Design notes +------------ +- Fixtures live in ``scripts/retrieval_eval_fixtures.json`` so new + questions can be added without touching Python. Each fixture + names the project, the prompt, and a checklist of substrings that + MUST appear in ``formatted_context`` (``expect_present``) and + substrings that MUST NOT appear (``expect_absent``). The absent + list catches cross-project bleed and stale content. +- The checklist is deliberately substring-based (not regex, not + embedding-similarity) so a failure is always a trivially + reproducible "this string is not in that string". Richer scoring + can come later once we know the harness is useful. +- The harness is external to the app runtime and talks to AtoCore + over HTTP, so it works against dev, staging, or prod. It follows + the same environment-variable contract as ``atocore_client.py`` + (``ATOCORE_BASE_URL``, ``ATOCORE_TIMEOUT_SECONDS``). +- Exit code 0 on all-pass, 1 on any fixture failure. Intended for + manual runs today; a future cron / CI hook can consume the + JSON output via ``--json``. + +Usage +----- + + python scripts/retrieval_eval.py # human-readable report + python scripts/retrieval_eval.py --json # machine-readable + python scripts/retrieval_eval.py --fixtures path/to/custom.json +""" + +from __future__ import annotations + +import argparse +import json +import os +import sys +import urllib.error +import urllib.parse +import urllib.request +from dataclasses import dataclass, field +from pathlib import Path + +DEFAULT_BASE_URL = os.environ.get("ATOCORE_BASE_URL", "http://dalidou:8100") +DEFAULT_TIMEOUT = int(os.environ.get("ATOCORE_TIMEOUT_SECONDS", "30")) +DEFAULT_BUDGET = 3000 +DEFAULT_FIXTURES = Path(__file__).parent / "retrieval_eval_fixtures.json" + + +@dataclass +class Fixture: + name: str + project: str + prompt: str + budget: int = DEFAULT_BUDGET + expect_present: list[str] = field(default_factory=list) + expect_absent: list[str] = field(default_factory=list) + notes: str = "" + + +@dataclass +class FixtureResult: + fixture: Fixture + ok: bool + missing_present: list[str] + unexpected_absent: list[str] + total_chars: int + error: str = "" + + +def load_fixtures(path: Path) -> list[Fixture]: + data = json.loads(path.read_text(encoding="utf-8")) + if not isinstance(data, list): + raise ValueError(f"{path} must contain a JSON array of fixtures") + fixtures: list[Fixture] = [] + for i, raw in enumerate(data): + if not isinstance(raw, dict): + raise ValueError(f"fixture {i} is not an object") + fixtures.append( + Fixture( + name=raw["name"], + project=raw.get("project", ""), + prompt=raw["prompt"], + budget=int(raw.get("budget", DEFAULT_BUDGET)), + expect_present=list(raw.get("expect_present", [])), + expect_absent=list(raw.get("expect_absent", [])), + notes=raw.get("notes", ""), + ) + ) + return fixtures + + +def run_fixture(fixture: Fixture, base_url: str, timeout: int) -> FixtureResult: + payload = { + "prompt": fixture.prompt, + "project": fixture.project or None, + "budget": fixture.budget, + } + req = urllib.request.Request( + url=f"{base_url}/context/build", + method="POST", + headers={"Content-Type": "application/json"}, + data=json.dumps(payload).encode("utf-8"), + ) + try: + with urllib.request.urlopen(req, timeout=timeout) as resp: + body = json.loads(resp.read().decode("utf-8")) + except urllib.error.URLError as exc: + return FixtureResult( + fixture=fixture, + ok=False, + missing_present=list(fixture.expect_present), + unexpected_absent=[], + total_chars=0, + error=f"http_error: {exc}", + ) + + formatted = body.get("formatted_context") or "" + missing = [s for s in fixture.expect_present if s not in formatted] + unexpected = [s for s in fixture.expect_absent if s in formatted] + return FixtureResult( + fixture=fixture, + ok=not missing and not unexpected, + missing_present=missing, + unexpected_absent=unexpected, + total_chars=len(formatted), + ) + + +def print_human_report(results: list[FixtureResult]) -> None: + total = len(results) + passed = sum(1 for r in results if r.ok) + print(f"Retrieval eval: {passed}/{total} fixtures passed") + print() + for r in results: + marker = "PASS" if r.ok else "FAIL" + print(f"[{marker}] {r.fixture.name} project={r.fixture.project} chars={r.total_chars}") + if r.error: + print(f" error: {r.error}") + for miss in r.missing_present: + print(f" missing expected: {miss!r}") + for bleed in r.unexpected_absent: + print(f" unexpected present: {bleed!r}") + if r.fixture.notes and not r.ok: + print(f" notes: {r.fixture.notes}") + + +def print_json_report(results: list[FixtureResult]) -> None: + payload = { + "total": len(results), + "passed": sum(1 for r in results if r.ok), + "fixtures": [ + { + "name": r.fixture.name, + "project": r.fixture.project, + "ok": r.ok, + "total_chars": r.total_chars, + "missing_present": r.missing_present, + "unexpected_absent": r.unexpected_absent, + "error": r.error, + } + for r in results + ], + } + json.dump(payload, sys.stdout, indent=2) + sys.stdout.write("\n") + + +def main() -> int: + parser = argparse.ArgumentParser(description="AtoCore retrieval quality eval harness") + parser.add_argument("--base-url", default=DEFAULT_BASE_URL) + parser.add_argument("--timeout", type=int, default=DEFAULT_TIMEOUT) + parser.add_argument("--fixtures", type=Path, default=DEFAULT_FIXTURES) + parser.add_argument("--json", action="store_true", help="emit machine-readable JSON") + args = parser.parse_args() + + fixtures = load_fixtures(args.fixtures) + results = [run_fixture(f, args.base_url, args.timeout) for f in fixtures] + + if args.json: + print_json_report(results) + else: + print_human_report(results) + + return 0 if all(r.ok for r in results) else 1 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/scripts/retrieval_eval_fixtures.json b/scripts/retrieval_eval_fixtures.json new file mode 100644 index 0000000..424a847 --- /dev/null +++ b/scripts/retrieval_eval_fixtures.json @@ -0,0 +1,85 @@ +[ + { + "name": "p04-architecture-decision", + "project": "p04-gigabit", + "prompt": "what mirror architecture was selected for GigaBIT M1 and why", + "expect_present": [ + "--- Trusted Project State ---", + "Option B", + "conical", + "--- Project Memories ---" + ], + "expect_absent": [ + "p06-polisher", + "folded-beam" + ], + "notes": "Canonical p04 decision — should surface both Trusted Project State (selected_mirror_architecture) and the project-memory band with the Option B memory" + }, + { + "name": "p04-constraints", + "project": "p04-gigabit", + "prompt": "what are the key GigaBIT M1 program constraints", + "expect_present": [ + "--- Trusted Project State ---", + "Zerodur", + "1.2" + ], + "expect_absent": [ + "polisher suite" + ], + "notes": "Key constraints are in Trusted Project State (key_constraints) and in the mission-framing memory" + }, + { + "name": "p05-configuration", + "project": "p05-interferometer", + "prompt": "what is the selected interferometer configuration", + "expect_present": [ + "folded-beam", + "CGH" + ], + "expect_absent": [ + "p04-gigabit", + "GigaBIT M1" + ], + "notes": "P05 architecture memory covers folded-beam + CGH; should not bleed p04" + }, + { + "name": "p05-vendor-signal", + "project": "p05-interferometer", + "prompt": "what is the current vendor signal for the interferometer procurement", + "expect_present": [ + "4D", + "Zygo" + ], + "expect_absent": [ + "polisher" + ], + "notes": "Vendor memory mentions 4D as strongest technical candidate and Zygo Verifire SV as value path" + }, + { + "name": "p06-suite-split", + "project": "p06-polisher", + "prompt": "how is the polisher software suite split across layers", + "expect_present": [ + "polisher-sim", + "polisher-post", + "polisher-control" + ], + "expect_absent": [ + "GigaBIT" + ], + "notes": "The three-layer split is in multiple p06 memories; check all three names surface together" + }, + { + "name": "p06-control-rule", + "project": "p06-polisher", + "prompt": "what is the polisher control design rule", + "expect_present": [ + "interlocks" + ], + "expect_absent": [ + "interferometer" + ], + "notes": "Control design rule memory mentions interlocks and state transitions" + } +]