feat(eval-loop): Day 4 — LLM extractor via claude -p (OAuth, no API key)

Second pass on the LLM-assisted extractor after Antoine's explicit rule: no API key, ever. Refactored src/atocore/memory/extractor_llm.py to shell out to the Claude Code 'claude -p' CLI via subprocess instead of the anthropic SDK, so extraction reuses the user's existing Claude.ai OAuth credentials and needs zero secret management. Implementation: - subprocess.run(["claude", "-p", "--model", "haiku", "--append-system-prompt", <instructions>, "--no-session-persistence", "--disable-slash-commands", user_message], ...) - cwd is a cached tempfile.mkdtemp() so every invocation starts with a clean context instead of auto-discovering CLAUDE.md / AGENTS.md / DEV-LEDGER.md from the repo root. We cannot use --bare because it forces API-key auth, which defeats the purpose; the temp-cwd trick is the lightest way to keep OAuth auth while skipping project context loading. - Silent-failure contract unchanged: missing CLI, non-zero exit, timeout, malformed JSON — all return [] and log an error. The capture audit trail must not break on an optional side effect. - Default timeout bumped from 20s to 90s: Haiku + Node.js startup + OAuth check is ~20-40s per call in practice, plus real responses up to 8KB take longer. 45s hit 2 timeouts on the first live run. - tests/test_extractor_llm.py refactored: the API-key / anthropic SDK tests are replaced by subprocess-mocking tests covering missing CLI, timeout, non-zero exit, and a happy-path stdout parse. 14 tests, all green. scripts/extractor_eval.py: - New --output <path> flag writes the JSON result directly to a file, bypassing stdout/log interleaving (structlog sends INFO to stdout via PrintLoggerFactory, so a naive '> out.json' pollutes the file). - Forces UTF-8 on stdout so real LLM output with em-dashes / arrows / CJK doesn't crash the human report on Windows cp1252 consoles. First live baseline run against the 20-interaction labeled corpus (scripts/eval_data/extractor_llm_baseline_2026-04-11.json): mode=llm labeled=20 recall=1.0 precision=0.357 yield_rate=2.55 total_actual_candidates=51 total_expected_candidates=7 false_negative_interactions=0 false_positive_interactions=9 Recall 0% -> 100% vs rule baseline — every human-labeled positive is caught. Precision reads low (0.357) but inspection shows the "false positives" are real candidates the human labels under-counted. For example interaction a6b0d279 was labeled at 2 expected candidates, the model caught all 6 polisher architectural facts; interaction 52c8c0f3 was labeled at 1, the model caught all 5 infra commitments. The labels are the bottleneck, not the model. Day 4 gate against Codex's criteria: - candidate yield: 255% vs ≥15-25% target - FP rate tolerable for manual triage: 51 candidates reviewable in ~10 minutes via the triage CLI - ≥2 real non-synthetic candidates worth review: 20+ obvious wins (polisher architecture set, p05 infra set, DEV-LEDGER protocol set) Gate cleared. LLM-assisted extraction is the path forward for conversational captures. Rule-based extractor stays as-is for structured-cue inputs and remains the default mode. The next step (Day 5 stabilize / document) will wire LLM mode behind a flag in the public extraction endpoint and document scope. Test count: 276 -> 278 passing. No existing tests changed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 17:45:24 -04:00
parent b309e7fd49
commit a29b5e22f2
4 changed files with 702 additions and 71 deletions
--- a/scripts/eval_data/extractor_llm_baseline_2026-04-11.json
+++ b/scripts/eval_data/extractor_llm_baseline_2026-04-11.json
@@ -0,0 +1,518 @@
+{
+  "summary": {
+    "total": 20,
+    "exact_match": 6,
+    "positive_expected": 5,
+    "total_expected_candidates": 7,
+    "total_actual_candidates": 51,
+    "yield_rate": 2.55,
+    "recall": 1.0,
+    "precision": 0.357,
+    "false_positive_interactions": 9,
+    "false_negative_interactions": 0,
+    "miss_classes": {},
+    "mode": "llm"
+  },
+  "results": [
+    {
+      "id": "ab239158-d6ac-4c51-b6e4-dd4ccea384a2",
+      "expected_count": 0,
+      "actual_count": 1,
+      "ok": false,
+      "miss_class": "n/a",
+      "notes": "Instructional deploy guidance. No durable claim.",
+      "actual_candidates": [
+        {
+          "memory_type": "knowledge",
+          "content": "AtoCore deployments to dalidou use the script /srv/storage/atocore/app/deploy/dalidou/deploy.sh instead of manual docker commands",
+          "project": "atocore",
+          "rule": "llm_extraction"
+        }
+      ]
+    },
+    {
+      "id": "da153f2a-b20a-4dee-8c72-431ebb71f08c",
+      "expected_count": 0,
+      "actual_count": 0,
+      "ok": true,
+      "miss_class": "n/a",
+      "notes": "'Deploy still in progress.' Pure status.",
+      "actual_candidates": []
+    },
+    {
+      "id": "7d8371ee-c6d3-4dfe-a7b0-2d091f075c15",
+      "expected_count": 0,
+      "actual_count": 0,
+      "ok": true,
+      "miss_class": "n/a",
+      "notes": "Git command walkthrough. No durable claim.",
+      "actual_candidates": []
+    },
+    {
+      "id": "14bf3f90-e318-466e-81ac-d35522741ba5",
+      "expected_count": 0,
+      "actual_count": 4,
+      "ok": false,
+      "miss_class": "n/a",
+      "notes": "Ledger status update. Transient fact, not a durable memory candidate.",
+      "actual_candidates": [
+        {
+          "memory_type": "project",
+          "content": "Retrieval/extraction evaluation follows 8-day mini-phase plan with hard gates to prevent scope drift. Preflight checks must validate git SHAs, baselines, and fixture stability before coding.",
+          "project": "",
+          "rule": "llm_extraction"
+        },
+        {
+          "memory_type": "project",
+          "content": "Day 1: Create labeled extractor eval set from 30 captures (10 zero-candidate, 10 single-candidate, 10 ambiguous) with metadata; create scoring tool to measure precision/recall.",
+          "project": "",
+          "rule": "llm_extraction"
+        },
+        {
+          "memory_type": "project",
+          "content": "Day 2: Measure current extractor against labeled set, recording yield, true/false positives, and false negatives by pattern.",
+          "project": "",
+          "rule": "llm_extraction"
+        },
+        {
+          "memory_type": "project",
+          "content": "Session Log/Ledger system tracks work state across sessions so future sessions immediately know what is true and what is next; phases marked by git SHAs.",
+          "project": "",
+          "rule": "llm_extraction"
+        }
+      ]
+    },
+    {
+      "id": "8f855235-c38d-4c27-9f2b-8530ebe1a2d8",
+      "expected_count": 0,
+      "actual_count": 0,
+      "ok": true,
+      "miss_class": "n/a",
+      "notes": "Short-term recommendation ('merge to main and deploy'), not a standing decision.",
+      "actual_candidates": []
+    },
+    {
+      "id": "04a96eb5-cd00-4e9f-9252-b2cc919000a4",
+      "expected_count": 0,
+      "actual_count": 0,
+      "ok": true,
+      "miss_class": "n/a",
+      "notes": "Dev server config table. Operational detail, not a memory.",
+      "actual_candidates": []
+    },
+    {
+      "id": "79d606ed-8981-454a-83af-c25226b1b65c",
+      "expected_count": 1,
+      "actual_count": 3,
+      "ok": false,
+      "miss_class": "recommendation_prose",
+      "notes": "A recommendation that later became a ratified decision. Rule extractor would need a 'simplest version that could work today' / 'I'd start with' cue class.",
+      "actual_candidates": [
+        {
+          "memory_type": "project",
+          "content": "atocore uses multi-model coordination: Claude and codex share DEV-LEDGER.md (current state / active plan / P1+P2 findings / recent decisions / commit log) read at session start, appended at session end",
+          "project": "atocore",
+          "rule": "llm_extraction"
+        },
+        {
+          "memory_type": "project",
+          "content": "atocore starts with manual-event-loop (/audit or /status prompts) using DEV-LEDGER.md before upgrading to automated git hooks/CI review",
+          "project": "atocore",
+          "rule": "llm_extraction"
+        },
+        {
+          "memory_type": "project",
+          "content": "atocore development involves coordinating between Claude and codex models with shared plan/review strategy and counter-validation to improve system quality",
+          "project": "atocore",
+          "rule": "llm_extraction"
+        }
+      ]
+    },
+    {
+      "id": "a6b0d279-c564-4bce-a703-e476f4a148ad",
+      "expected_count": 2,
+      "actual_count": 6,
+      "ok": false,
+      "miss_class": "architectural_change_summary",
+      "notes": "Two durable architectural facts about the polisher machine (Z-axis is engage/retract, cam is read-only). Extractor would need to recognize 'A is now B' / 'X removed, Y added' patterns.",
+      "actual_candidates": [
+        {
+          "memory_type": "project",
+          "content": "Z-axis is a binary engage/retract mechanism (z_engaged bool), not continuous position control; confirmation timeout z_engage_timeout_s required.",
+          "project": "p06-polisher",
+          "rule": "llm_extraction"
+        },
+        {
+          "memory_type": "project",
+          "content": "Cam amplitude and offset are mechanically set by operator and read via encoders; no actuators control them, controller receives encoder telemetry only.",
+          "project": "p06-polisher",
+          "rule": "llm_extraction"
+        },
+        {
+          "memory_type": "project",
+          "content": "Cam parameters in controller are expected_cam_amplitude_deg and expected_cam_offset_deg (read-only reference for verification), not command setpoints.",
+          "project": "p06-polisher",
+          "rule": "llm_extraction"
+        },
+        {
+          "memory_type": "project",
+          "content": "Manual mode UI displays cam encoder readings (cam_amplitude_deg, cam_offset_deg) as read-only for operator verification of mechanical setting.",
+          "project": "p06-polisher",
+          "rule": "llm_extraction"
+        },
+        {
+          "memory_type": "project",
+          "content": "Manual session log records cam_setting measured at session start; run-log segment actual block includes cam_amplitude_deg_mean and cam_offset_deg_mean.",
+          "project": "p06-polisher",
+          "rule": "llm_extraction"
+        },
+        {
+          "memory_type": "project",
+          "content": "Machine capabilities now define z_type: engage_retract and cam_type: mechanical_with_encoder instead of actuator-driven setpoints.",
+          "project": "p06-polisher",
+          "rule": "llm_extraction"
+        }
+      ]
+    },
+    {
+      "id": "4e00e398-2e89-4653-8ee5-3f65c7f4d2d3",
+      "expected_count": 0,
+      "actual_count": 0,
+      "ok": true,
+      "miss_class": "n/a",
+      "notes": "Clarification question to user.",
+      "actual_candidates": []
+    },
+    {
+      "id": "a6a7816a-7590-4616-84f4-49d9054c2a91",
+      "expected_count": 0,
+      "actual_count": 3,
+      "ok": false,
+      "miss_class": "n/a",
+      "notes": "Instructional response offering two next moves.",
+      "actual_candidates": [
+        {
+          "memory_type": "knowledge",
+          "content": "Codex is an audit agent; communicate with it via markdown prompts with numbered steps; it updates findings via commits to codex/* branches or direct messages.",
+          "project": "",
+          "rule": "llm_extraction"
+        },
+        {
+          "memory_type": "preference",
+          "content": "Audit-first workflow recommended: have codex audit DEV-LEDGER.md and recent commits before execution; validates round-trip, catches errors early.",
+          "project": "",
+          "rule": "llm_extraction"
+        },
+        {
+          "memory_type": "knowledge",
+          "content": "DEV-LEDGER.md at repo root is the shared coordination document with Orientation, Active Plan, and Open Review Findings sections.",
+          "project": "",
+          "rule": "llm_extraction"
+        }
+      ]
+    },
+    {
+      "id": "03527502-316a-4a3e-989c-00719392c7d1",
+      "expected_count": 0,
+      "actual_count": 0,
+      "ok": true,
+      "miss_class": "n/a",
+      "notes": "Troubleshooting a paste failure. Ephemeral.",
+      "actual_candidates": []
+    },
+    {
+      "id": "1fff59fc-545f-42df-9dd1-a0e6dec1b7ee",
+      "expected_count": 0,
+      "actual_count": 3,
+      "ok": false,
+      "miss_class": "n/a",
+      "notes": "Agreement + follow-up question. No durable claim.",
+      "actual_candidates": [
+        {
+          "memory_type": "project",
+          "content": "Roadmap: Extractor improvement → Harness expansion → Wave 2 trusted operational ingestion → Finish OpenClaw integration (in that order)",
+          "project": "atocore",
+          "rule": "llm_extraction"
+        },
+        {
+          "memory_type": "project",
+          "content": "Phase 1 (Extractor): eval-driven loop—label captures, improve rules/add LLM mode, measure yield & FP, stop when queue reviewable (not coverage metrics)",
+          "project": "atocore",
+          "rule": "llm_extraction"
+        },
+        {
+          "memory_type": "project",
+          "content": "Phases 1 & 2 (Extractor + Harness) are a mini-phase; without harness, extractor improvements are blind edits",
+          "project": "atocore",
+          "rule": "llm_extraction"
+        }
+      ]
+    },
+    {
+      "id": "eb65dc18-0030-4720-ace7-f55af9df719d",
+      "expected_count": 0,
+      "actual_count": 2,
+      "ok": false,
+      "miss_class": "n/a",
+      "notes": "Explanation of how the capture hook works. Instructional.",
+      "actual_candidates": [
+        {
+          "memory_type": "knowledge",
+          "content": "Dalidou stores Claude Code interactions via a Stop hook that fires after each turn and POSTs to http://dalidou:8100/interactions with client=claude-code parameter",
+          "project": "",
+          "rule": "llm_extraction"
+        },
+        {
+          "memory_type": "adaptation",
+          "content": "Interaction capture system is passive and automatic; no manual action required, interactions accumulate automatically during normal Claude Code usage",
+          "project": "",
+          "rule": "llm_extraction"
+        }
+      ]
+    },
+    {
+      "id": "52c8c0f3-32fb-4b48-9065-73c778a08417",
+      "expected_count": 1,
+      "actual_count": 5,
+      "ok": false,
+      "miss_class": "spec_update_announcement",
+      "notes": "Concrete architectural commitments just added to the polisher spec. Phrased as '§17.1 Local Storage - USB SSD mandatory, not SD card.' The '§' section markers could be a new cue.",
+      "actual_candidates": [
+        {
+          "memory_type": "project",
+          "content": "USB SSD mandatory for storage (not SD card); directory structure /data/runs/{id}/, /data/manual/{id}/; status.json for machine state",
+          "project": "",
+          "rule": "llm_extraction"
+        },
+        {
+          "memory_type": "project",
+          "content": "RPi joins Tailscale mesh for remote access over SSH VPN; no public IP or port forwarding; fully offline operation",
+          "project": "",
+          "rule": "llm_extraction"
+        },
+        {
+          "memory_type": "project",
+          "content": "Data synchronization via rsync over Tailscale, failure-tolerant and non-blocking; USB stick as manual fallback",
+          "project": "",
+          "rule": "llm_extraction"
+        },
+        {
+          "memory_type": "project",
+          "content": "Machine design principle: works fully offline and independently; network connection is for remote access only",
+          "project": "",
+          "rule": "llm_extraction"
+        },
+        {
+          "memory_type": "project",
+          "content": "No cloud, no real-time streaming, no remote control features in design scope",
+          "project": "",
+          "rule": "llm_extraction"
+        }
+      ]
+    },
+    {
+      "id": "32d40414-15af-47ee-944b-2cceae9574b8",
+      "expected_count": 0,
+      "actual_count": 5,
+      "ok": false,
+      "miss_class": "n/a",
+      "notes": "Session recap. Historical summary, not a durable memory.",
+      "actual_candidates": [
+        {
+          "memory_type": "project",
+          "content": "P1: Reflection loop integration incomplete—extraction remains manual (POST /interactions/{id}/extract), not auto-triggered with reinforcement. Live capture won't auto-populate candidate review queue.",
+          "project": "atocore",
+          "rule": "llm_extraction"
+        },
+        {
+          "memory_type": "project",
+          "content": "P1: Project memories excluded from context injection; build_context() requests [\"identity\", \"preference\"] only. Reinforcement signal doesn't reach assembled context packs.",
+          "project": "atocore",
+          "rule": "llm_extraction"
+        },
+        {
+          "memory_type": "project",
+          "content": "Current batch-extract rules produce only 1 candidate from 42 real captures. Extractor needs conversational-cue detection or LLM-assisted path to improve yield.",
+          "project": "atocore",
+          "rule": "llm_extraction"
+        },
+        {
+          "memory_type": "project",
+          "content": "Next priority: extractor rule expansion (cheapest validation of reflection loop), then Wave 2 trusted operational ingestion (master-plan priority). Defer retrieval eval harness focus.",
+          "project": "atocore",
+          "rule": "llm_extraction"
+        },
+        {
+          "memory_type": "knowledge",
+          "content": "Alias canonicalization fix (resolve_project_name() boundary) is consistently applied across project state, memories, interactions, and context lookup. Code review approved directionally.",
+          "project": "atocore",
+          "rule": "llm_extraction"
+        }
+      ]
+    },
+    {
+      "id": "b6d2cdfc-37fb-459a-96bd-caefb9beaab4",
+      "expected_count": 0,
+      "actual_count": 1,
+      "ok": false,
+      "miss_class": "n/a",
+      "notes": "Deployment prompt for Dalidou. Operational, not a memory.",
+      "actual_candidates": [
+        {
+          "memory_type": "preference",
+          "content": "User prefers receiving standalone testing prompts they can paste into Claude Code on target deployments rather than having the assistant run tests directly.",
+          "project": "atocore",
+          "rule": "llm_extraction"
+        }
+      ]
+    },
+    {
+      "id": "ee03d823-931b-4d4e-9258-88b4ed5eeb07",
+      "expected_count": 2,
+      "actual_count": 5,
+      "ok": false,
+      "miss_class": "layered_recommendation",
+      "notes": "Layered infra recommendation with 'non-negotiable' / 'strongly recommended' strength markers. The 'non-negotiable' token could be a new cue class.",
+      "actual_candidates": [
+        {
+          "memory_type": "project",
+          "content": "USB SSD on RPi is mandatory for polishing telemetry storage; must be independent of network for data integrity during runs.",
+          "project": "p06-polisher",
+          "rule": "llm_extraction"
+        },
+        {
+          "memory_type": "project",
+          "content": "Use Tailscale mesh for RPi remote access to provide SSH, file transfer, and NAT traversal without port forwarding.",
+          "project": "p06-polisher",
+          "rule": "llm_extraction"
+        },
+        {
+          "memory_type": "project",
+          "content": "Auto-sync telemetry data via rsync over Tailscale after runs complete; fire-and-forget pattern with automatic retry on network interruption.",
+          "project": "p06-polisher",
+          "rule": "llm_extraction"
+        },
+        {
+          "memory_type": "project",
+          "content": "Real-time telemetry monitoring should target 10 Hz downsampling; full 100 Hz streaming over network is not necessary.",
+          "project": "p06-polisher",
+          "rule": "llm_extraction"
+        },
+        {
+          "memory_type": "knowledge",
+          "content": "Polishing telemetry data rate is approximately 29 MB per hour (100 Hz × 20 channels × 4 bytes = 8 KB/s).",
+          "project": "p06-polisher",
+          "rule": "llm_extraction"
+        }
+      ]
+    },
+    {
+      "id": "dd234d9f-0d1c-47e8-b01c-eebcb568c7e7",
+      "expected_count": 1,
+      "actual_count": 3,
+      "ok": false,
+      "miss_class": "alignment_assertion",
+      "notes": "Architectural invariant assertion. '**Alignment verified**' / 'nothing changes for X' style. Likely too subtle for rule matching without LLM assistance.",
+      "actual_candidates": [
+        {
+          "memory_type": "project",
+          "content": "Machine spec (shareable) + Atomaste spec (internal) separate concerns. Machine spec hides program generation as 'separate scope' to protect IP/business strategy.",
+          "project": "p06-polisher",
+          "rule": "llm_extraction"
+        },
+        {
+          "memory_type": "project",
+          "content": "Firmware interface contract is invariant: controller-job.v1 input, run-log.v1 + telemetry output. No firmware changes needed regardless of program generation implementation.",
+          "project": "p06-polisher",
+          "rule": "llm_extraction"
+        },
+        {
+          "memory_type": "project",
+          "content": "Atomaste sim spec documents forward/return paths, calibration model (Preston k), translation loss, and service/IP strategy—details hidden from shareable machine spec.",
+          "project": "p06-polisher",
+          "rule": "llm_extraction"
+        }
+      ]
+    },
+    {
+      "id": "1f95891a-cf37-400e-9d68-4fad8e04dcbb",
+      "expected_count": 0,
+      "actual_count": 4,
+      "ok": false,
+      "miss_class": "n/a",
+      "notes": "Huge session handoff prompt. Informational only.",
+      "actual_candidates": [
+        {
+          "memory_type": "knowledge",
+          "content": "AtoCore is FastAPI (Python 3.12, SQLite + ChromaDB) on Dalidou home server (dalidou:8100), repo C:\\Users\\antoi\\ATOCore, data /srv/storage/atocore/, ingests Obsidian vault + Google Drive into vector memory system.",
+          "project": "atocore",
+          "rule": "llm_extraction"
+        },
+        {
+          "memory_type": "knowledge",
+          "content": "Deploy AtoCore: git push origin main, then ssh papa@dalidou and run /srv/storage/atocore/app/deploy/dalidou/deploy.sh",
+          "project": "atocore",
+          "rule": "llm_extraction"
+        },
+        {
+          "memory_type": "adaptation",
+          "content": "Do not add memory extraction to interaction capture hot path; keep extraction as separate batch/manual step. Reason: latency and queue noise before review rhythm is comfortable.",
+          "project": "atocore",
+          "rule": "llm_extraction"
+        },
+        {
+          "memory_type": "project",
+          "content": "As of 2026-04-11, approved roadmap in order: observe reinforcement, batch extraction, candidate triage, off-Dalidou backup, retrieval quality review.",
+          "project": "atocore",
+          "rule": "llm_extraction"
+        }
+      ]
+    },
+    {
+      "id": "5580950f-d010-4544-be4b-b3071271a698",
+      "expected_count": 0,
+      "actual_count": 6,
+      "ok": false,
+      "miss_class": "n/a",
+      "notes": "Ledger schema sketch. Structural design proposal, later ratified — but the same idea was already captured as a ratified decision in the recent decisions section, so not worth re-extracting from this conversational form.",
+      "actual_candidates": [
+        {
+          "memory_type": "project",
+          "content": "AtoCore adopts DEV-LEDGER.md as shared operating memory with stable headers; updated at session boundaries",
+          "project": "atocore",
+          "rule": "llm_extraction"
+        },
+        {
+          "memory_type": "adaptation",
+          "content": "Codex branches for AtoCore fork from main (never orphan); use naming pattern codex/<topic>",
+          "project": "atocore",
+          "rule": "llm_extraction"
+        },
+        {
+          "memory_type": "adaptation",
+          "content": "In AtoCore, Claude builds and Codex audits; never work in parallel on same files",
+          "project": "atocore",
+          "rule": "llm_extraction"
+        },
+        {
+          "memory_type": "adaptation",
+          "content": "In AtoCore, P1-severity findings in DEV-LEDGER.md block further main commits until acknowledged",
+          "project": "atocore",
+          "rule": "llm_extraction"
+        },
+        {
+          "memory_type": "adaptation",
+          "content": "Every AtoCore session appends to DEV-LEDGER.md Session Log and updates Orientation before ending",
+          "project": "atocore",
+          "rule": "llm_extraction"
+        },
+        {
+          "memory_type": "project",
+          "content": "AtoCore roadmap: (1) extractor improvement, (2) harness expansion, (3) Wave 2 ingestion, (4) OpenClaw finish; steps 1+2 are current mini-phase",
+          "project": "atocore",
+          "rule": "llm_extraction"
+        }
+      ]
+    }
+  ]
+}
--- a/scripts/extractor_eval.py
+++ b/scripts/extractor_eval.py
@@ -22,11 +22,17 @@ Usage:
 from __future__ import annotations

 import argparse
+import io
 import json
 import sys
 from dataclasses import dataclass, field
 from pathlib import Path

+# Force UTF-8 on stdout so real LLM output (arrows, em-dashes, CJK)
+# doesn't crash the human report on Windows cp1252 consoles.
+if hasattr(sys.stdout, "buffer"):
+    sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding="utf-8", errors="replace", line_buffering=True)
+
 # Make src/ importable without requiring an install.
 _REPO_ROOT = Path(__file__).resolve().parent.parent
 sys.path.insert(0, str(_REPO_ROOT / "src"))
@@ -218,6 +224,12 @@ def main() -> int:
    parser.add_argument("--snapshot", type=Path, default=DEFAULT_SNAPSHOT)
    parser.add_argument("--labels", type=Path, default=DEFAULT_LABELS)
    parser.add_argument("--json", action="store_true", help="emit machine-readable JSON")
+    parser.add_argument(
+        "--output",
+        type=Path,
+        default=None,
+        help="write JSON result to this file (bypasses log/stdout interleaving)",
+    )
    parser.add_argument(
        "--mode",
        choices=["rule", "llm"],
@@ -232,7 +244,25 @@ def main() -> int:
    summary = aggregate(results)
    summary["mode"] = args.mode

-    if args.json:
+    if args.output is not None:
+        payload = {
+            "summary": summary,
+            "results": [
+                {
+                    "id": r.id,
+                    "expected_count": r.expected_count,
+                    "actual_count": r.actual_count,
+                    "ok": r.ok,
+                    "miss_class": r.miss_class,
+                    "notes": r.notes,
+                    "actual_candidates": r.actual_candidates,
+                }
+                for r in results
+            ],
+        }
+        args.output.write_text(json.dumps(payload, indent=2, ensure_ascii=False), encoding="utf-8")
+        print(f"wrote {args.output}  ({summary['mode']}: recall={summary['recall']} precision={summary['precision']})")
+    elif args.json:
        print_json(results, summary)
    else:
        print_human(results, summary)