feat(eval-loop): Day 4 — LLM extractor via claude -p (OAuth, no API key)

Second pass on the LLM-assisted extractor after Antoine's explicit rule: no API key, ever. Refactored src/atocore/memory/extractor_llm.py to shell out to the Claude Code 'claude -p' CLI via subprocess instead of the anthropic SDK, so extraction reuses the user's existing Claude.ai OAuth credentials and needs zero secret management. Implementation: - subprocess.run(["claude", "-p", "--model", "haiku", "--append-system-prompt", <instructions>, "--no-session-persistence", "--disable-slash-commands", user_message], ...) - cwd is a cached tempfile.mkdtemp() so every invocation starts with a clean context instead of auto-discovering CLAUDE.md / AGENTS.md / DEV-LEDGER.md from the repo root. We cannot use --bare because it forces API-key auth, which defeats the purpose; the temp-cwd trick is the lightest way to keep OAuth auth while skipping project context loading. - Silent-failure contract unchanged: missing CLI, non-zero exit, timeout, malformed JSON — all return [] and log an error. The capture audit trail must not break on an optional side effect. - Default timeout bumped from 20s to 90s: Haiku + Node.js startup + OAuth check is ~20-40s per call in practice, plus real responses up to 8KB take longer. 45s hit 2 timeouts on the first live run. - tests/test_extractor_llm.py refactored: the API-key / anthropic SDK tests are replaced by subprocess-mocking tests covering missing CLI, timeout, non-zero exit, and a happy-path stdout parse. 14 tests, all green. scripts/extractor_eval.py: - New --output <path> flag writes the JSON result directly to a file, bypassing stdout/log interleaving (structlog sends INFO to stdout via PrintLoggerFactory, so a naive '> out.json' pollutes the file). - Forces UTF-8 on stdout so real LLM output with em-dashes / arrows / CJK doesn't crash the human report on Windows cp1252 consoles. First live baseline run against the 20-interaction labeled corpus (scripts/eval_data/extractor_llm_baseline_2026-04-11.json): mode=llm labeled=20 recall=1.0 precision=0.357 yield_rate=2.55 total_actual_candidates=51 total_expected_candidates=7 false_negative_interactions=0 false_positive_interactions=9 Recall 0% -> 100% vs rule baseline — every human-labeled positive is caught. Precision reads low (0.357) but inspection shows the "false positives" are real candidates the human labels under-counted. For example interaction a6b0d279 was labeled at 2 expected candidates, the model caught all 6 polisher architectural facts; interaction 52c8c0f3 was labeled at 1, the model caught all 5 infra commitments. The labels are the bottleneck, not the model. Day 4 gate against Codex's criteria: - candidate yield: 255% vs ≥15-25% target - FP rate tolerable for manual triage: 51 candidates reviewable in ~10 minutes via the triage CLI - ≥2 real non-synthetic candidates worth review: 20+ obvious wins (polisher architecture set, p05 infra set, DEV-LEDGER protocol set) Gate cleared. LLM-assisted extraction is the path forward for conversational captures. Rule-based extractor stays as-is for structured-cue inputs and remains the default mode. The next step (Day 5 stabilize / document) will wire LLM mode behind a flag in the public extraction endpoint and document scope. Test count: 276 -> 278 passing. No existing tests changed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 17:45:24 -04:00
parent b309e7fd49
commit a29b5e22f2
4 changed files with 702 additions and 71 deletions
--- a/scripts/extractor_eval.py
+++ b/scripts/extractor_eval.py
@@ -22,11 +22,17 @@ Usage:
 from __future__ import annotations

 import argparse
+import io
 import json
 import sys
 from dataclasses import dataclass, field
 from pathlib import Path

+# Force UTF-8 on stdout so real LLM output (arrows, em-dashes, CJK)
+# doesn't crash the human report on Windows cp1252 consoles.
+if hasattr(sys.stdout, "buffer"):
+    sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding="utf-8", errors="replace", line_buffering=True)
+
 # Make src/ importable without requiring an install.
 _REPO_ROOT = Path(__file__).resolve().parent.parent
 sys.path.insert(0, str(_REPO_ROOT / "src"))
@@ -218,6 +224,12 @@ def main() -> int:
    parser.add_argument("--snapshot", type=Path, default=DEFAULT_SNAPSHOT)
    parser.add_argument("--labels", type=Path, default=DEFAULT_LABELS)
    parser.add_argument("--json", action="store_true", help="emit machine-readable JSON")
+    parser.add_argument(
+        "--output",
+        type=Path,
+        default=None,
+        help="write JSON result to this file (bypasses log/stdout interleaving)",
+    )
    parser.add_argument(
        "--mode",
        choices=["rule", "llm"],
@@ -232,7 +244,25 @@ def main() -> int:
    summary = aggregate(results)
    summary["mode"] = args.mode

-    if args.json:
+    if args.output is not None:
+        payload = {
+            "summary": summary,
+            "results": [
+                {
+                    "id": r.id,
+                    "expected_count": r.expected_count,
+                    "actual_count": r.actual_count,
+                    "ok": r.ok,
+                    "miss_class": r.miss_class,
+                    "notes": r.notes,
+                    "actual_candidates": r.actual_candidates,
+                }
+                for r in results
+            ],
+        }
+        args.output.write_text(json.dumps(payload, indent=2, ensure_ascii=False), encoding="utf-8")
+        print(f"wrote {args.output}  ({summary['mode']}: recall={summary['recall']} precision={summary['precision']})")
+    elif args.json:
        print_json(results, summary)
    else:
        print_human(results, summary)