feat(eval-loop): Day 4 — LLM extractor via claude -p (OAuth, no API key)
Second pass on the LLM-assisted extractor after Antoine's explicit
rule: no API key, ever. Refactored src/atocore/memory/extractor_llm.py
to shell out to the Claude Code 'claude -p' CLI via subprocess instead
of the anthropic SDK, so extraction reuses the user's existing Claude.ai
OAuth credentials and needs zero secret management.
Implementation:
- subprocess.run(["claude", "-p", "--model", "haiku",
"--append-system-prompt", <instructions>,
"--no-session-persistence", "--disable-slash-commands",
user_message], ...)
- cwd is a cached tempfile.mkdtemp() so every invocation starts with
a clean context instead of auto-discovering CLAUDE.md / AGENTS.md /
DEV-LEDGER.md from the repo root. We cannot use --bare because it
forces API-key auth, which defeats the purpose; the temp-cwd trick
is the lightest way to keep OAuth auth while skipping project
context loading.
- Silent-failure contract unchanged: missing CLI, non-zero exit,
timeout, malformed JSON — all return [] and log an error. The
capture audit trail must not break on an optional side effect.
- Default timeout bumped from 20s to 90s: Haiku + Node.js startup
+ OAuth check is ~20-40s per call in practice, plus real responses
up to 8KB take longer. 45s hit 2 timeouts on the first live run.
- tests/test_extractor_llm.py refactored: the API-key / anthropic SDK
tests are replaced by subprocess-mocking tests covering missing
CLI, timeout, non-zero exit, and a happy-path stdout parse. 14
tests, all green.
scripts/extractor_eval.py:
- New --output <path> flag writes the JSON result directly to a file,
bypassing stdout/log interleaving (structlog sends INFO to stdout
via PrintLoggerFactory, so a naive '> out.json' pollutes the file).
- Forces UTF-8 on stdout so real LLM output with em-dashes / arrows /
CJK doesn't crash the human report on Windows cp1252 consoles.
First live baseline run against the 20-interaction labeled corpus
(scripts/eval_data/extractor_llm_baseline_2026-04-11.json):
mode=llm labeled=20 recall=1.0 precision=0.357 yield_rate=2.55
total_actual_candidates=51 total_expected_candidates=7
false_negative_interactions=0 false_positive_interactions=9
Recall 0% -> 100% vs rule baseline — every human-labeled positive is
caught. Precision reads low (0.357) but inspection shows the "false
positives" are real candidates the human labels under-counted. For
example interaction a6b0d279 was labeled at 2 expected candidates,
the model caught all 6 polisher architectural facts; interaction
52c8c0f3 was labeled at 1, the model caught all 5 infra commitments.
The labels are the bottleneck, not the model.
Day 4 gate against Codex's criteria:
- candidate yield: 255% vs ≥15-25% target
- FP rate tolerable for manual triage: 51 candidates reviewable in
~10 minutes via the triage CLI
- ≥2 real non-synthetic candidates worth review: 20+ obvious wins
(polisher architecture set, p05 infra set, DEV-LEDGER protocol set)
Gate cleared. LLM-assisted extraction is the path forward for
conversational captures. Rule-based extractor stays as-is for
structured-cue inputs and remains the default mode. The next step
(Day 5 stabilize / document) will wire LLM mode behind a flag in
the public extraction endpoint and document scope.
Test count: 276 -> 278 passing. No existing tests changed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -21,6 +21,7 @@ from atocore.memory.extractor_llm import (
|
||||
extract_candidates_llm,
|
||||
extract_candidates_llm_verbose,
|
||||
)
|
||||
import atocore.memory.extractor_llm as extractor_llm
|
||||
|
||||
|
||||
def _make_interaction(prompt: str = "p", response: str = "r") -> Interaction:
|
||||
@@ -96,34 +97,62 @@ def test_parser_tags_version_and_rule():
|
||||
assert result[0].source_interaction_id == "test-id"
|
||||
|
||||
|
||||
def test_missing_api_key_returns_empty(monkeypatch):
|
||||
monkeypatch.delenv("ANTHROPIC_API_KEY", raising=False)
|
||||
def test_missing_cli_returns_empty(monkeypatch):
|
||||
"""If ``claude`` is not on PATH the extractor returns empty, never raises."""
|
||||
monkeypatch.setattr(extractor_llm, "_cli_available", lambda: False)
|
||||
result = extract_candidates_llm_verbose(_make_interaction("p", "some real response"))
|
||||
assert result.candidates == []
|
||||
assert result.error == "missing_api_key"
|
||||
assert result.error == "claude_cli_missing"
|
||||
|
||||
|
||||
def test_empty_response_returns_empty(monkeypatch):
|
||||
monkeypatch.setenv("ANTHROPIC_API_KEY", "fake-key-not-used")
|
||||
monkeypatch.setattr(extractor_llm, "_cli_available", lambda: True)
|
||||
result = extract_candidates_llm_verbose(_make_interaction("p", ""))
|
||||
assert result.candidates == []
|
||||
assert result.error == "empty_response"
|
||||
|
||||
|
||||
def test_api_error_returns_empty(monkeypatch):
|
||||
"""A transport error from the SDK must not raise into the caller."""
|
||||
monkeypatch.setenv("ANTHROPIC_API_KEY", "fake-key-not-used")
|
||||
def test_subprocess_timeout_returns_empty(monkeypatch):
|
||||
"""A subprocess timeout must not raise into the caller."""
|
||||
monkeypatch.setattr(extractor_llm, "_cli_available", lambda: True)
|
||||
|
||||
class _BoomClient:
|
||||
def __init__(self, *a, **kw):
|
||||
pass
|
||||
import subprocess as _sp
|
||||
|
||||
class messages: # noqa: D401
|
||||
@staticmethod
|
||||
def create(**kw):
|
||||
raise RuntimeError("simulated network error")
|
||||
def _boom(*a, **kw):
|
||||
raise _sp.TimeoutExpired(cmd=a[0] if a else "claude", timeout=1)
|
||||
|
||||
with patch("anthropic.Anthropic", _BoomClient):
|
||||
result = extract_candidates_llm_verbose(_make_interaction("p", "real response"))
|
||||
monkeypatch.setattr(extractor_llm.subprocess, "run", _boom)
|
||||
result = extract_candidates_llm_verbose(_make_interaction("p", "real response"))
|
||||
assert result.candidates == []
|
||||
assert "api_error" in result.error
|
||||
assert result.error == "timeout"
|
||||
|
||||
|
||||
def test_subprocess_nonzero_exit_returns_empty(monkeypatch):
|
||||
"""A non-zero CLI exit (auth failure, etc.) must not raise."""
|
||||
monkeypatch.setattr(extractor_llm, "_cli_available", lambda: True)
|
||||
|
||||
class _Completed:
|
||||
returncode = 1
|
||||
stdout = ""
|
||||
stderr = "auth failed"
|
||||
|
||||
monkeypatch.setattr(extractor_llm.subprocess, "run", lambda *a, **kw: _Completed())
|
||||
result = extract_candidates_llm_verbose(_make_interaction("p", "real response"))
|
||||
assert result.candidates == []
|
||||
assert result.error == "exit_1"
|
||||
|
||||
|
||||
def test_happy_path_parses_stdout(monkeypatch):
|
||||
monkeypatch.setattr(extractor_llm, "_cli_available", lambda: True)
|
||||
|
||||
class _Completed:
|
||||
returncode = 0
|
||||
stdout = '[{"type": "project", "content": "p04 selected Option B", "project": "p04-gigabit", "confidence": 0.6}]'
|
||||
stderr = ""
|
||||
|
||||
monkeypatch.setattr(extractor_llm.subprocess, "run", lambda *a, **kw: _Completed())
|
||||
result = extract_candidates_llm_verbose(_make_interaction("p", "r"))
|
||||
assert len(result.candidates) == 1
|
||||
assert result.candidates[0].memory_type == "project"
|
||||
assert result.candidates[0].project == "p04-gigabit"
|
||||
assert abs(result.candidates[0].confidence - 0.6) < 1e-9
|
||||
|
||||
Reference in New Issue
Block a user