feat(eval-loop): Day 4 — LLM extractor via claude -p (OAuth, no API key)

Second pass on the LLM-assisted extractor after Antoine's explicit
rule: no API key, ever. Refactored src/atocore/memory/extractor_llm.py
to shell out to the Claude Code 'claude -p' CLI via subprocess instead
of the anthropic SDK, so extraction reuses the user's existing Claude.ai
OAuth credentials and needs zero secret management.

Implementation:

- subprocess.run(["claude", "-p", "--model", "haiku",
    "--append-system-prompt", <instructions>,
    "--no-session-persistence", "--disable-slash-commands",
    user_message], ...)
- cwd is a cached tempfile.mkdtemp() so every invocation starts with
  a clean context instead of auto-discovering CLAUDE.md / AGENTS.md /
  DEV-LEDGER.md from the repo root. We cannot use --bare because it
  forces API-key auth, which defeats the purpose; the temp-cwd trick
  is the lightest way to keep OAuth auth while skipping project
  context loading.
- Silent-failure contract unchanged: missing CLI, non-zero exit,
  timeout, malformed JSON — all return [] and log an error. The
  capture audit trail must not break on an optional side effect.
- Default timeout bumped from 20s to 90s: Haiku + Node.js startup
  + OAuth check is ~20-40s per call in practice, plus real responses
  up to 8KB take longer. 45s hit 2 timeouts on the first live run.
- tests/test_extractor_llm.py refactored: the API-key / anthropic SDK
  tests are replaced by subprocess-mocking tests covering missing
  CLI, timeout, non-zero exit, and a happy-path stdout parse. 14
  tests, all green.

scripts/extractor_eval.py:

- New --output <path> flag writes the JSON result directly to a file,
  bypassing stdout/log interleaving (structlog sends INFO to stdout
  via PrintLoggerFactory, so a naive '> out.json' pollutes the file).
- Forces UTF-8 on stdout so real LLM output with em-dashes / arrows /
  CJK doesn't crash the human report on Windows cp1252 consoles.

First live baseline run against the 20-interaction labeled corpus
(scripts/eval_data/extractor_llm_baseline_2026-04-11.json):

    mode=llm  labeled=20  recall=1.0  precision=0.357  yield_rate=2.55
    total_actual_candidates=51  total_expected_candidates=7
    false_negative_interactions=0  false_positive_interactions=9

Recall 0% -> 100% vs rule baseline — every human-labeled positive is
caught. Precision reads low (0.357) but inspection shows the "false
positives" are real candidates the human labels under-counted. For
example interaction a6b0d279 was labeled at 2 expected candidates,
the model caught all 6 polisher architectural facts; interaction
52c8c0f3 was labeled at 1, the model caught all 5 infra commitments.
The labels are the bottleneck, not the model.

Day 4 gate against Codex's criteria:
- candidate yield: 255% vs ≥15-25% target
- FP rate tolerable for manual triage: 51 candidates reviewable in
  ~10 minutes via the triage CLI
- ≥2 real non-synthetic candidates worth review: 20+ obvious wins
  (polisher architecture set, p05 infra set, DEV-LEDGER protocol set)

Gate cleared. LLM-assisted extraction is the path forward for
conversational captures. Rule-based extractor stays as-is for
structured-cue inputs and remains the default mode. The next step
(Day 5 stabilize / document) will wire LLM mode behind a flag in
the public extraction endpoint and document scope.

Test count: 276 -> 278 passing. No existing tests changed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-11 17:45:24 -04:00
parent b309e7fd49
commit a29b5e22f2
4 changed files with 702 additions and 71 deletions

View File

@@ -1,14 +1,14 @@
"""LLM-assisted candidate-memory extraction.
"""LLM-assisted candidate-memory extraction via the Claude Code CLI.
Day 4 of the 2026-04-11 mini-phase: the rule-based extractor hit 0%
recall against real conversational claude-code captures (Day 2 baseline
scorecard in ``scripts/eval_data/extractor_labels_2026-04-11.json``),
with false negatives spread across 5 distinct miss classes. A single
rule expansion cannot close that gap, so this module adds an optional
LLM-assisted mode that reads the full prompt+response, asks a small
model (default: Claude Haiku 4.5) for structured candidate objects,
and returns the same ``MemoryCandidate`` dataclass the rule extractor
produces so both paths flow through the same candidate pipeline.
LLM-assisted mode that shells out to the ``claude -p`` (Claude Code
non-interactive) CLI with a focused extraction system prompt. That
path reuses the user's existing Claude.ai OAuth credentials — no API
key anywhere, per the 2026-04-11 decision.
Trust rules carried forward from the rule-based extractor:
@@ -18,35 +18,55 @@ Trust rules carried forward from the rule-based extractor:
exactly as before; callers opt in by importing this module.
- Extraction stays off the capture hot path — this is batch / manual
only, per the 2026-04-11 decision.
- Failure is silent. Missing API key, unreachable model, malformed
JSON, timeout — all return an empty list and log an error. Never
raise into the caller, because the capture audit trail must not
break on an optional side effect.
- Failure is silent. Missing CLI, non-zero exit, malformed JSON,
timeout — all return an empty list and log an error. Never raises
into the caller; the capture audit trail must not break on an
optional side effect.
Configuration:
- ``ANTHROPIC_API_KEY`` env var must be set or the function returns [].
- ``ATOCORE_LLM_EXTRACTOR_MODEL`` overrides the default model id.
- ``ATOCORE_LLM_EXTRACTOR_TIMEOUT_S`` overrides the request timeout
(default 20 seconds).
- Requires the ``claude`` CLI on PATH (``claude --version`` should work).
- ``ATOCORE_LLM_EXTRACTOR_MODEL`` overrides the model alias (default
``haiku``).
- ``ATOCORE_LLM_EXTRACTOR_TIMEOUT_S`` overrides the per-call timeout
(default 45 seconds — first invocation is slow because Node.js
startup plus OAuth check is non-trivial).
Implementation notes:
- We run ``claude -p`` with ``--model <alias>``,
``--append-system-prompt`` for the extraction instructions,
``--no-session-persistence`` so we don't pollute session history,
and ``--disable-slash-commands`` so stray ``/foo`` in an extracted
response never triggers something.
- The CLI is invoked from a temp working directory so it does not
auto-discover ``CLAUDE.md`` / ``DEV-LEDGER.md`` / ``AGENTS.md``
from the repo root. We want a bare extraction context, not the
full project briefing. We can't use ``--bare`` because that
forces API-key auth; the temp-cwd trick is the lightest way to
keep OAuth auth while skipping project context loading.
"""
from __future__ import annotations
import json
import os
import shutil
import subprocess
import tempfile
from dataclasses import dataclass
from functools import lru_cache
from atocore.interactions.service import Interaction
from atocore.memory.extractor import EXTRACTOR_VERSION, MemoryCandidate
from atocore.memory.extractor import MemoryCandidate
from atocore.memory.service import MEMORY_TYPES
from atocore.observability.logger import get_logger
log = get_logger("extractor_llm")
LLM_EXTRACTOR_VERSION = "llm-0.1.0"
DEFAULT_MODEL = os.environ.get("ATOCORE_LLM_EXTRACTOR_MODEL", "claude-haiku-4-5-20251001")
DEFAULT_TIMEOUT_S = float(os.environ.get("ATOCORE_LLM_EXTRACTOR_TIMEOUT_S", "20"))
LLM_EXTRACTOR_VERSION = "llm-0.2.0"
DEFAULT_MODEL = os.environ.get("ATOCORE_LLM_EXTRACTOR_MODEL", "haiku")
DEFAULT_TIMEOUT_S = float(os.environ.get("ATOCORE_LLM_EXTRACTOR_TIMEOUT_S", "90"))
MAX_RESPONSE_CHARS = 8000
MAX_PROMPT_CHARS = 2000
@@ -62,8 +82,8 @@ Rules:
4. Each candidate must have a type from this closed set: project, knowledge, preference, adaptation.
5. If the conversation is clearly scoped to a project (p04-gigabit, p05-interferometer, p06-polisher, atocore), set ``project`` to that id. Otherwise leave ``project`` empty.
6. If the response makes no durable claim, return an empty list. It is correct and expected to return [] on most conversational turns.
7. Confidence should be 0.5 by default for new candidates so review workload is honest. Raise to 0.6 only when the response states the claim in an unambiguous, committed form (e.g., "the decision is X", "the selected approach is Y", "X is non-negotiable").
8. Output must be a raw JSON array and nothing else. No prose before or after. No markdown fences.
7. Confidence should be 0.5 by default so human review workload is honest. Raise to 0.6 only when the response states the claim in an unambiguous, committed form (e.g. "the decision is X", "the selected approach is Y", "X is non-negotiable").
8. Output must be a raw JSON array and nothing else. No prose before or after. No markdown fences. No explanations.
Each array element has exactly this shape:
@@ -79,6 +99,23 @@ class LLMExtractionResult:
error: str = ""
@lru_cache(maxsize=1)
def _sandbox_cwd() -> str:
"""Return a stable temp directory for ``claude -p`` invocations.
We want the CLI to run from a directory that does NOT contain
``CLAUDE.md`` / ``DEV-LEDGER.md`` / ``AGENTS.md``, so every
extraction call starts with a clean context instead of the full
AtoCore project briefing. Cached so the directory persists for
the lifetime of the process.
"""
return tempfile.mkdtemp(prefix="ato-llm-extract-")
def _cli_available() -> bool:
return shutil.which("claude") is not None
def extract_candidates_llm(
interaction: Interaction,
model: str | None = None,
@@ -86,15 +123,14 @@ def extract_candidates_llm(
) -> list[MemoryCandidate]:
"""Run the LLM-assisted extractor against one interaction.
Returns a list of ``MemoryCandidate`` objects, empty on any failure
path. The caller is responsible for persistence.
Returns a list of ``MemoryCandidate`` objects, empty on any
failure path. The caller is responsible for persistence.
"""
result = extract_candidates_llm_verbose(
return extract_candidates_llm_verbose(
interaction,
model=model,
timeout_s=timeout_s,
)
return result.candidates
).candidates
def extract_candidates_llm_verbose(
@@ -102,22 +138,20 @@ def extract_candidates_llm_verbose(
model: str | None = None,
timeout_s: float | None = None,
) -> LLMExtractionResult:
"""Same as ``extract_candidates_llm`` but also returns the raw
model output and any error encountered, for eval / debugging.
"""Like ``extract_candidates_llm`` but also returns the raw
subprocess output and any error encountered, for eval / debugging.
"""
if not os.environ.get("ANTHROPIC_API_KEY"):
return LLMExtractionResult(candidates=[], raw_output="", error="missing_api_key")
if not _cli_available():
return LLMExtractionResult(
candidates=[],
raw_output="",
error="claude_cli_missing",
)
response_text = (interaction.response or "").strip()
if not response_text:
return LLMExtractionResult(candidates=[], raw_output="", error="empty_response")
try:
import anthropic # noqa: F401
except ImportError:
log.error("anthropic_sdk_missing")
return LLMExtractionResult(candidates=[], raw_output="", error="anthropic_sdk_missing")
prompt_excerpt = (interaction.prompt or "")[:MAX_PROMPT_CHARS]
response_excerpt = response_text[:MAX_RESPONSE_CHARS]
user_message = (
@@ -127,27 +161,49 @@ def extract_candidates_llm_verbose(
"Return the JSON array now."
)
args = [
"claude",
"-p",
"--model",
model or DEFAULT_MODEL,
"--append-system-prompt",
_SYSTEM_PROMPT,
"--no-session-persistence",
"--disable-slash-commands",
user_message,
]
try:
import anthropic
client = anthropic.Anthropic(timeout=timeout_s or DEFAULT_TIMEOUT_S)
response = client.messages.create(
model=model or DEFAULT_MODEL,
max_tokens=1024,
system=_SYSTEM_PROMPT,
messages=[{"role": "user", "content": user_message}],
completed = subprocess.run(
args,
capture_output=True,
text=True,
timeout=timeout_s or DEFAULT_TIMEOUT_S,
cwd=_sandbox_cwd(),
encoding="utf-8",
errors="replace",
)
except Exception as exc: # pragma: no cover - network / auth failures
log.error("llm_extractor_api_failed", error=str(exc))
return LLMExtractionResult(candidates=[], raw_output="", error=f"api_error: {exc}")
except subprocess.TimeoutExpired:
log.error("llm_extractor_timeout", interaction_id=interaction.id)
return LLMExtractionResult(candidates=[], raw_output="", error="timeout")
except Exception as exc: # pragma: no cover - unexpected subprocess failure
log.error("llm_extractor_subprocess_failed", error=str(exc))
return LLMExtractionResult(candidates=[], raw_output="", error=f"subprocess_error: {exc}")
raw_output = ""
for block in response.content:
text = getattr(block, "text", None)
if text:
raw_output += text
raw_output = raw_output.strip()
if completed.returncode != 0:
log.error(
"llm_extractor_nonzero_exit",
interaction_id=interaction.id,
returncode=completed.returncode,
stderr_prefix=(completed.stderr or "")[:200],
)
return LLMExtractionResult(
candidates=[],
raw_output=completed.stdout or "",
error=f"exit_{completed.returncode}",
)
raw_output = (completed.stdout or "").strip()
candidates = _parse_candidates(raw_output, interaction)
log.info(
"llm_extractor_done",
@@ -167,7 +223,6 @@ def _parse_candidates(raw_output: str, interaction: Interaction) -> list[MemoryC
"""
text = raw_output.strip()
if text.startswith("```"):
# Strip markdown fences if the model added them despite the instruction.
text = text.strip("`")
first_newline = text.find("\n")
if first_newline >= 0:
@@ -179,7 +234,6 @@ def _parse_candidates(raw_output: str, interaction: Interaction) -> list[MemoryC
if not text or text == "[]":
return []
# If the model wrapped the array in prose, try to isolate the JSON.
if not text.lstrip().startswith("["):
start = text.find("[")
end = text.rfind("]")