diff --git a/docs/next-steps.md b/docs/next-steps.md index e83fff7..eb9ac79 100644 --- a/docs/next-steps.md +++ b/docs/next-steps.md @@ -226,14 +226,53 @@ candidate was a synthetic test capture from earlier in the session - Capture → reinforce is working correctly on live data (length-aware matcher verified on live paraphrase of a p04 memory). -Follow-up candidates (not yet scheduled): +Follow-up candidates: -1. Extractor rule expansion — add conversational-form rules so real - session text has a chance of surfacing candidates. -2. LLM-assisted extractor as a separate rule family, guarded by - confidence and always landing in `status=candidate` (never active). -3. Retrieval eval harness — diffable scorecard of - `formatted_context` across a fixed question set per active project. +1. ~~Extractor rule expansion~~ — Day 2 baseline showed 0% recall + across 5 distinct miss classes; rule expansion cannot close a + 5-way miss. Deprioritized. +2. ~~LLM-assisted extractor~~ — DONE 2026-04-12. `extractor_llm.py` + shells out to `claude -p` (Haiku, OAuth, no API key). First live + run: 100% recall, 2.55 yield/interaction on a 20-interaction + labeled set. First triage: 51 candidates → 16 promoted, 35 + rejected (31% accept rate). Active memories 20 → 36. +3. ~~Retrieval eval harness~~ — DONE 2026-04-11 (scripts/retrieval_eval.py, + 6/6 passing). Expansion to 15-20 fixtures is mini-phase Day 6. + +## Extractor Scope — 2026-04-12 + +What the LLM-assisted extractor (`src/atocore/memory/extractor_llm.py`) +extracts from conversational Claude Code captures: + +**In scope:** + +- Architectural commitments (e.g. "Z-axis is engage/retract, not + continuous position") +- Ratified decisions with project scope (e.g. "USB SSD mandatory on + RPi for telemetry storage") +- Durable engineering facts (e.g. "telemetry data rate ~29 MB/hour") +- Working rules and adaptation patterns (e.g. "extraction stays off + the capture hot path") +- Interface invariants (e.g. "controller-job.v1 in, run-log.v1 out; + no firmware change needed") + +**Out of scope (intentionally rejected by triage):** + +- Transient roadmap / plan steps that will be stale in a week +- Operational instructions ("run this command to deploy") +- Process rules that live in DEV-LEDGER.md / AGENTS.md, not in memory +- Implementation details that are too granular (individual field names + when the parent concept is already captured) +- Already-fixed review findings (P1/P2 that no longer apply) +- Duplicates of existing active memories with wrong project tags + +**Trust model:** + +- Extraction stays off the capture hot path (batch / manual only) +- All candidates land as `status=candidate`, never auto-promoted +- Human or auto-triage reviews before promotion to active +- Future direction: multi-model extraction + triage (Codex/Gemini as + second-pass reviewers for robustness against single-model bias) ## Long-Run Goal diff --git a/src/atocore/memory/extractor_llm.py b/src/atocore/memory/extractor_llm.py index 06d2739..9653e40 100644 --- a/src/atocore/memory/extractor_llm.py +++ b/src/atocore/memory/extractor_llm.py @@ -29,7 +29,7 @@ Configuration: - ``ATOCORE_LLM_EXTRACTOR_MODEL`` overrides the model alias (default ``haiku``). - ``ATOCORE_LLM_EXTRACTOR_TIMEOUT_S`` overrides the per-call timeout - (default 45 seconds — first invocation is slow because Node.js + (default 90 seconds — first invocation is slow because Node.js startup plus OAuth check is non-trivial). Implementation notes: