feat(eval-loop): Day 4 — LLM-assisted extractor path (additive, flagged)

Day 2 baseline showed 0% recall for the rule-based extractor across 5 distinct miss classes. Day 4 decision gate: prototype an LLM-assisted mode behind a flag. Option A ratified by Antoine. New module src/atocore/memory/extractor_llm.py: - extract_candidates_llm(interaction) returns the same MemoryCandidate dataclass the rule extractor produces, so both paths flow through the existing triage / candidate pipeline unchanged. - extract_candidates_llm_verbose() also returns the raw model output and any error string, for eval and debugging. - Uses Claude Haiku 4.5 by default; model overridable via ATOCORE_LLM_EXTRACTOR_MODEL env. Timeout via ATOCORE_LLM_EXTRACTOR_TIMEOUT_S (default 20s). - Silent-failure contract: missing API key, unreachable model, malformed JSON — all return [] and log an error. Never raises into the caller. The capture audit trail must not break on an optional side effect. - Parser tolerates markdown fences, surrounding prose, invalid memory types, clamps confidence to [0,1], drops empty content. - System prompt explicitly tells the model to return [] for most conversational turns (durable-fact bar, not "extract everything"). - Trust rules unchanged: candidates are never auto-promoted, extraction stays off the capture hot path, human triages via the existing CLI. scripts/extractor_eval.py: new --mode {rule,llm} flag so the same labeled corpus can be scored against both extractors. Default remains rule so existing invocations are unchanged. tests/test_extractor_llm.py: 12 new unit tests covering the parser (empty array, malformed JSON, markdown fences, surrounding prose, invalid types, empty content, confidence clamping, version tagging), plus contract tests for missing API key, empty response, and a mocked api_error path so failure modes never raise. Test count: 264 -> 276 passing. No existing tests changed. Next step: run `python scripts/extractor_eval.py --mode llm` against the labeled set with ANTHROPIC_API_KEY in env, record the delta, decide whether to wire LLM mode into the API endpoint and CLI or keep it script-only for now. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 15:18:30 -04:00
parent 7d8d599030
commit b309e7fd49
3 changed files with 370 additions and 3 deletions
--- a/tests/test_extractor_llm.py
+++ b/tests/test_extractor_llm.py
@@ -0,0 +1,129 @@
+"""Tests for the LLM-assisted extractor path.
+
+Focused on the parser and failure-mode contracts — the actual network
+call is exercised out of band by running
+``python scripts/extractor_eval.py --mode llm`` against the frozen
+labeled corpus with ``ANTHROPIC_API_KEY`` set. These tests only
+exercise the pieces that don't need network.
+"""
+
+from __future__ import annotations
+
+import os
+from unittest.mock import patch
+
+import pytest
+
+from atocore.interactions.service import Interaction
+from atocore.memory.extractor_llm import (
+    LLM_EXTRACTOR_VERSION,
+    _parse_candidates,
+    extract_candidates_llm,
+    extract_candidates_llm_verbose,
+)
+
+
+def _make_interaction(prompt: str = "p", response: str = "r") -> Interaction:
+    return Interaction(
+        id="test-id",
+        prompt=prompt,
+        response=response,
+        response_summary="",
+        project="",
+        client="test",
+        session_id="",
+    )
+
+
+def test_parser_handles_empty_array():
+    result = _parse_candidates("[]", _make_interaction())
+    assert result == []
+
+
+def test_parser_handles_malformed_json():
+    result = _parse_candidates("{ not valid json", _make_interaction())
+    assert result == []
+
+
+def test_parser_strips_markdown_fences():
+    raw = "```json\n[{\"type\": \"knowledge\", \"content\": \"x is y\", \"project\": \"\", \"confidence\": 0.5}]\n```"
+    result = _parse_candidates(raw, _make_interaction())
+    assert len(result) == 1
+    assert result[0].memory_type == "knowledge"
+    assert result[0].content == "x is y"
+
+
+def test_parser_strips_surrounding_prose():
+    raw = "Here are the candidates:\n[{\"type\": \"project\", \"content\": \"foo\", \"project\": \"p04\", \"confidence\": 0.6}]\nThat's it."
+    result = _parse_candidates(raw, _make_interaction())
+    assert len(result) == 1
+    assert result[0].memory_type == "project"
+    assert result[0].project == "p04"
+
+
+def test_parser_drops_invalid_memory_types():
+    raw = '[{"type": "nonsense", "content": "x"}, {"type": "project", "content": "y"}]'
+    result = _parse_candidates(raw, _make_interaction())
+    assert len(result) == 1
+    assert result[0].memory_type == "project"
+
+
+def test_parser_drops_empty_content():
+    raw = '[{"type": "knowledge", "content": "   "}, {"type": "knowledge", "content": "real"}]'
+    result = _parse_candidates(raw, _make_interaction())
+    assert len(result) == 1
+    assert result[0].content == "real"
+
+
+def test_parser_clamps_confidence_to_unit_interval():
+    raw = '[{"type": "knowledge", "content": "c1", "confidence": 2.5}, {"type": "knowledge", "content": "c2", "confidence": -0.4}]'
+    result = _parse_candidates(raw, _make_interaction())
+    assert result[0].confidence == 1.0
+    assert result[1].confidence == 0.0
+
+
+def test_parser_defaults_confidence_on_missing_field():
+    raw = '[{"type": "knowledge", "content": "c1"}]'
+    result = _parse_candidates(raw, _make_interaction())
+    assert result[0].confidence == 0.5
+
+
+def test_parser_tags_version_and_rule():
+    raw = '[{"type": "project", "content": "c1"}]'
+    result = _parse_candidates(raw, _make_interaction())
+    assert result[0].rule == "llm_extraction"
+    assert result[0].extractor_version == LLM_EXTRACTOR_VERSION
+    assert result[0].source_interaction_id == "test-id"
+
+
+def test_missing_api_key_returns_empty(monkeypatch):
+    monkeypatch.delenv("ANTHROPIC_API_KEY", raising=False)
+    result = extract_candidates_llm_verbose(_make_interaction("p", "some real response"))
+    assert result.candidates == []
+    assert result.error == "missing_api_key"
+
+
+def test_empty_response_returns_empty(monkeypatch):
+    monkeypatch.setenv("ANTHROPIC_API_KEY", "fake-key-not-used")
+    result = extract_candidates_llm_verbose(_make_interaction("p", ""))
+    assert result.candidates == []
+    assert result.error == "empty_response"
+
+
+def test_api_error_returns_empty(monkeypatch):
+    """A transport error from the SDK must not raise into the caller."""
+    monkeypatch.setenv("ANTHROPIC_API_KEY", "fake-key-not-used")
+
+    class _BoomClient:
+        def __init__(self, *a, **kw):
+            pass
+
+        class messages:  # noqa: D401
+            @staticmethod
+            def create(**kw):
+                raise RuntimeError("simulated network error")
+
+    with patch("anthropic.Anthropic", _BoomClient):
+        result = extract_candidates_llm_verbose(_make_interaction("p", "real response"))
+    assert result.candidates == []
+    assert "api_error" in result.error