feat(eval-loop): Day 4 — LLM extractor via claude -p (OAuth, no API key)

Second pass on the LLM-assisted extractor after Antoine's explicit rule: no API key, ever. Refactored src/atocore/memory/extractor_llm.py to shell out to the Claude Code 'claude -p' CLI via subprocess instead of the anthropic SDK, so extraction reuses the user's existing Claude.ai OAuth credentials and needs zero secret management. Implementation: - subprocess.run(["claude", "-p", "--model", "haiku", "--append-system-prompt", <instructions>, "--no-session-persistence", "--disable-slash-commands", user_message], ...) - cwd is a cached tempfile.mkdtemp() so every invocation starts with a clean context instead of auto-discovering CLAUDE.md / AGENTS.md / DEV-LEDGER.md from the repo root. We cannot use --bare because it forces API-key auth, which defeats the purpose; the temp-cwd trick is the lightest way to keep OAuth auth while skipping project context loading. - Silent-failure contract unchanged: missing CLI, non-zero exit, timeout, malformed JSON — all return [] and log an error. The capture audit trail must not break on an optional side effect. - Default timeout bumped from 20s to 90s: Haiku + Node.js startup + OAuth check is ~20-40s per call in practice, plus real responses up to 8KB take longer. 45s hit 2 timeouts on the first live run. - tests/test_extractor_llm.py refactored: the API-key / anthropic SDK tests are replaced by subprocess-mocking tests covering missing CLI, timeout, non-zero exit, and a happy-path stdout parse. 14 tests, all green. scripts/extractor_eval.py: - New --output <path> flag writes the JSON result directly to a file, bypassing stdout/log interleaving (structlog sends INFO to stdout via PrintLoggerFactory, so a naive '> out.json' pollutes the file). - Forces UTF-8 on stdout so real LLM output with em-dashes / arrows / CJK doesn't crash the human report on Windows cp1252 consoles. First live baseline run against the 20-interaction labeled corpus (scripts/eval_data/extractor_llm_baseline_2026-04-11.json): mode=llm labeled=20 recall=1.0 precision=0.357 yield_rate=2.55 total_actual_candidates=51 total_expected_candidates=7 false_negative_interactions=0 false_positive_interactions=9 Recall 0% -> 100% vs rule baseline — every human-labeled positive is caught. Precision reads low (0.357) but inspection shows the "false positives" are real candidates the human labels under-counted. For example interaction a6b0d279 was labeled at 2 expected candidates, the model caught all 6 polisher architectural facts; interaction 52c8c0f3 was labeled at 1, the model caught all 5 infra commitments. The labels are the bottleneck, not the model. Day 4 gate against Codex's criteria: - candidate yield: 255% vs ≥15-25% target - FP rate tolerable for manual triage: 51 candidates reviewable in ~10 minutes via the triage CLI - ≥2 real non-synthetic candidates worth review: 20+ obvious wins (polisher architecture set, p05 infra set, DEV-LEDGER protocol set) Gate cleared. LLM-assisted extraction is the path forward for conversational captures. Rule-based extractor stays as-is for structured-cue inputs and remains the default mode. The next step (Day 5 stabilize / document) will wire LLM mode behind a flag in the public extraction endpoint and document scope. Test count: 276 -> 278 passing. No existing tests changed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 17:45:24 -04:00
parent b309e7fd49
commit a29b5e22f2
4 changed files with 702 additions and 71 deletions
--- a/src/atocore/memory/extractor_llm.py
+++ b/src/atocore/memory/extractor_llm.py
@@ -1,14 +1,14 @@
-"""LLM-assisted candidate-memory extraction.
+"""LLM-assisted candidate-memory extraction via the Claude Code CLI.

 Day 4 of the 2026-04-11 mini-phase: the rule-based extractor hit 0%
 recall against real conversational claude-code captures (Day 2 baseline
 scorecard in ``scripts/eval_data/extractor_labels_2026-04-11.json``),
 with false negatives spread across 5 distinct miss classes. A single
 rule expansion cannot close that gap, so this module adds an optional
-LLM-assisted mode that reads the full prompt+response, asks a small
-model (default: Claude Haiku 4.5) for structured candidate objects,
-and returns the same ``MemoryCandidate`` dataclass the rule extractor
-produces so both paths flow through the same candidate pipeline.
+LLM-assisted mode that shells out to the ``claude -p`` (Claude Code
+non-interactive) CLI with a focused extraction system prompt. That
+path reuses the user's existing Claude.ai OAuth credentials — no API
+key anywhere, per the 2026-04-11 decision.

 Trust rules carried forward from the rule-based extractor:

@@ -18,35 +18,55 @@ Trust rules carried forward from the rule-based extractor:
  exactly as before; callers opt in by importing this module.
 - Extraction stays off the capture hot path — this is batch / manual
  only, per the 2026-04-11 decision.
- Failure is silent. Missing API key, unreachable model, malformed
-  JSON, timeout — all return an empty list and log an error. Never
-  raise into the caller, because the capture audit trail must not
-  break on an optional side effect.
+- Failure is silent. Missing CLI, non-zero exit, malformed JSON,
+  timeout — all return an empty list and log an error. Never raises
+  into the caller; the capture audit trail must not break on an
+  optional side effect.

 Configuration:

- ``ANTHROPIC_API_KEY`` env var must be set or the function returns [].
- ``ATOCORE_LLM_EXTRACTOR_MODEL`` overrides the default model id.
- ``ATOCORE_LLM_EXTRACTOR_TIMEOUT_S`` overrides the request timeout
-  (default 20 seconds).
+- Requires the ``claude`` CLI on PATH (``claude --version`` should work).
+- ``ATOCORE_LLM_EXTRACTOR_MODEL`` overrides the model alias (default
+  ``haiku``).
+- ``ATOCORE_LLM_EXTRACTOR_TIMEOUT_S`` overrides the per-call timeout
+  (default 45 seconds — first invocation is slow because Node.js
+  startup plus OAuth check is non-trivial).
+
+Implementation notes:
+
+- We run ``claude -p`` with ``--model <alias>``,
+  ``--append-system-prompt`` for the extraction instructions,
+  ``--no-session-persistence`` so we don't pollute session history,
+  and ``--disable-slash-commands`` so stray ``/foo`` in an extracted
+  response never triggers something.
+- The CLI is invoked from a temp working directory so it does not
+  auto-discover ``CLAUDE.md`` / ``DEV-LEDGER.md`` / ``AGENTS.md``
+  from the repo root. We want a bare extraction context, not the
+  full project briefing. We can't use ``--bare`` because that
+  forces API-key auth; the temp-cwd trick is the lightest way to
+  keep OAuth auth while skipping project context loading.
 """

 from __future__ import annotations

 import json
 import os
+import shutil
+import subprocess
+import tempfile
 from dataclasses import dataclass
+from functools import lru_cache

 from atocore.interactions.service import Interaction
-from atocore.memory.extractor import EXTRACTOR_VERSION, MemoryCandidate
+from atocore.memory.extractor import MemoryCandidate
 from atocore.memory.service import MEMORY_TYPES
 from atocore.observability.logger import get_logger

 log = get_logger("extractor_llm")

-LLM_EXTRACTOR_VERSION = "llm-0.1.0"
-DEFAULT_MODEL = os.environ.get("ATOCORE_LLM_EXTRACTOR_MODEL", "claude-haiku-4-5-20251001")
-DEFAULT_TIMEOUT_S = float(os.environ.get("ATOCORE_LLM_EXTRACTOR_TIMEOUT_S", "20"))
+LLM_EXTRACTOR_VERSION = "llm-0.2.0"
+DEFAULT_MODEL = os.environ.get("ATOCORE_LLM_EXTRACTOR_MODEL", "haiku")
+DEFAULT_TIMEOUT_S = float(os.environ.get("ATOCORE_LLM_EXTRACTOR_TIMEOUT_S", "90"))
 MAX_RESPONSE_CHARS = 8000
 MAX_PROMPT_CHARS = 2000

@@ -62,8 +82,8 @@ Rules:
 4. Each candidate must have a type from this closed set: project, knowledge, preference, adaptation.
 5. If the conversation is clearly scoped to a project (p04-gigabit, p05-interferometer, p06-polisher, atocore), set ``project`` to that id. Otherwise leave ``project`` empty.
 6. If the response makes no durable claim, return an empty list. It is correct and expected to return [] on most conversational turns.
-7. Confidence should be 0.5 by default for new candidates so review workload is honest. Raise to 0.6 only when the response states the claim in an unambiguous, committed form (e.g., "the decision is X", "the selected approach is Y", "X is non-negotiable").
-8. Output must be a raw JSON array and nothing else. No prose before or after. No markdown fences.
+7. Confidence should be 0.5 by default so human review workload is honest. Raise to 0.6 only when the response states the claim in an unambiguous, committed form (e.g. "the decision is X", "the selected approach is Y", "X is non-negotiable").
+8. Output must be a raw JSON array and nothing else. No prose before or after. No markdown fences. No explanations.

 Each array element has exactly this shape:

@@ -79,6 +99,23 @@ class LLMExtractionResult:
    error: str = ""


+@lru_cache(maxsize=1)
+def _sandbox_cwd() -> str:
+    """Return a stable temp directory for ``claude -p`` invocations.
+
+    We want the CLI to run from a directory that does NOT contain
+    ``CLAUDE.md`` / ``DEV-LEDGER.md`` / ``AGENTS.md``, so every
+    extraction call starts with a clean context instead of the full
+    AtoCore project briefing. Cached so the directory persists for
+    the lifetime of the process.
+    """
+    return tempfile.mkdtemp(prefix="ato-llm-extract-")
+
+
+def _cli_available() -> bool:
+    return shutil.which("claude") is not None
+
+
 def extract_candidates_llm(
    interaction: Interaction,
    model: str | None = None,
@@ -86,15 +123,14 @@ def extract_candidates_llm(
 ) -> list[MemoryCandidate]:
    """Run the LLM-assisted extractor against one interaction.

-    Returns a list of ``MemoryCandidate`` objects, empty on any failure
-    path. The caller is responsible for persistence.
+    Returns a list of ``MemoryCandidate`` objects, empty on any
+    failure path. The caller is responsible for persistence.
    """
-    result = extract_candidates_llm_verbose(
+    return extract_candidates_llm_verbose(
        interaction,
        model=model,
        timeout_s=timeout_s,
-    )
-    return result.candidates
+    ).candidates


 def extract_candidates_llm_verbose(
@@ -102,22 +138,20 @@ def extract_candidates_llm_verbose(
    model: str | None = None,
    timeout_s: float | None = None,
 ) -> LLMExtractionResult:
-    """Same as ``extract_candidates_llm`` but also returns the raw
-    model output and any error encountered, for eval / debugging.
+    """Like ``extract_candidates_llm`` but also returns the raw
+    subprocess output and any error encountered, for eval / debugging.
    """
-    if not os.environ.get("ANTHROPIC_API_KEY"):
-        return LLMExtractionResult(candidates=[], raw_output="", error="missing_api_key")
+    if not _cli_available():
+        return LLMExtractionResult(
+            candidates=[],
+            raw_output="",
+            error="claude_cli_missing",
+        )

    response_text = (interaction.response or "").strip()
    if not response_text:
        return LLMExtractionResult(candidates=[], raw_output="", error="empty_response")

-    try:
-        import anthropic  # noqa: F401
-    except ImportError:
-        log.error("anthropic_sdk_missing")
-        return LLMExtractionResult(candidates=[], raw_output="", error="anthropic_sdk_missing")
-
    prompt_excerpt = (interaction.prompt or "")[:MAX_PROMPT_CHARS]
    response_excerpt = response_text[:MAX_RESPONSE_CHARS]
    user_message = (
@@ -127,27 +161,49 @@ def extract_candidates_llm_verbose(
        "Return the JSON array now."
    )

+    args = [
+        "claude",
+        "-p",
+        "--model",
+        model or DEFAULT_MODEL,
+        "--append-system-prompt",
+        _SYSTEM_PROMPT,
+        "--no-session-persistence",
+        "--disable-slash-commands",
+        user_message,
+    ]
+
    try:
-        import anthropic
-
-        client = anthropic.Anthropic(timeout=timeout_s or DEFAULT_TIMEOUT_S)
-        response = client.messages.create(
-            model=model or DEFAULT_MODEL,
-            max_tokens=1024,
-            system=_SYSTEM_PROMPT,
-            messages=[{"role": "user", "content": user_message}],
+        completed = subprocess.run(
+            args,
+            capture_output=True,
+            text=True,
+            timeout=timeout_s or DEFAULT_TIMEOUT_S,
+            cwd=_sandbox_cwd(),
+            encoding="utf-8",
+            errors="replace",
        )
-    except Exception as exc:  # pragma: no cover - network / auth failures
-        log.error("llm_extractor_api_failed", error=str(exc))
-        return LLMExtractionResult(candidates=[], raw_output="", error=f"api_error: {exc}")
+    except subprocess.TimeoutExpired:
+        log.error("llm_extractor_timeout", interaction_id=interaction.id)
+        return LLMExtractionResult(candidates=[], raw_output="", error="timeout")
+    except Exception as exc:  # pragma: no cover - unexpected subprocess failure
+        log.error("llm_extractor_subprocess_failed", error=str(exc))
+        return LLMExtractionResult(candidates=[], raw_output="", error=f"subprocess_error: {exc}")

-    raw_output = ""
-    for block in response.content:
-        text = getattr(block, "text", None)
-        if text:
-            raw_output += text
-    raw_output = raw_output.strip()
+    if completed.returncode != 0:
+        log.error(
+            "llm_extractor_nonzero_exit",
+            interaction_id=interaction.id,
+            returncode=completed.returncode,
+            stderr_prefix=(completed.stderr or "")[:200],
+        )
+        return LLMExtractionResult(
+            candidates=[],
+            raw_output=completed.stdout or "",
+            error=f"exit_{completed.returncode}",
+        )

+    raw_output = (completed.stdout or "").strip()
    candidates = _parse_candidates(raw_output, interaction)
    log.info(
        "llm_extractor_done",
@@ -167,7 +223,6 @@ def _parse_candidates(raw_output: str, interaction: Interaction) -> list[MemoryC
    """
    text = raw_output.strip()
    if text.startswith("```"):
-        # Strip markdown fences if the model added them despite the instruction.
        text = text.strip("`")
        first_newline = text.find("\n")
        if first_newline >= 0:
@@ -179,7 +234,6 @@ def _parse_candidates(raw_output: str, interaction: Interaction) -> list[MemoryC
    if not text or text == "[]":
        return []

-    # If the model wrapped the array in prose, try to isolate the JSON.
    if not text.lstrip().startswith("["):
        start = text.find("[")
        end = text.rfind("]")