diff --git a/DEV-LEDGER.md b/DEV-LEDGER.md index 4f43963..4310cfb 100644 --- a/DEV-LEDGER.md +++ b/DEV-LEDGER.md @@ -6,12 +6,12 @@ ## Orientation -- **live_sha** (Dalidou `/health` build_sha): `95daa5c` -- **last_updated**: 2026-04-12 by Claude (Day 4 complete, first triage done, 36 active memories) -- **main_tip**: `06792d8` +- **live_sha** (Dalidou `/health` build_sha): `5c69f77` +- **last_updated**: 2026-04-12 by Claude (mini-phase Day 8 close) +- **main_tip**: `5c69f77` - **test_count**: 278 passing -- **harness**: `6/6 PASS` (`python scripts/retrieval_eval.py` against live Dalidou) -- **active_memories**: 36 (was 20 before triage; p06-polisher 2->16, atocore 0->5) +- **harness**: `15/18 PASS` (expanded from 6 to 18 fixtures; 3 remaining failures are budget-contention on p06 memory band, not ranking bugs) +- **active_memories**: 36 (was 20 before mini-phase; p06-polisher 2->16, atocore 0->5) - **off_host_backup**: `papa@192.168.86.39:/home/papa/atocore-backups/` via cron env `ATOCORE_BACKUP_RSYNC`, verified ## Active Plan @@ -141,6 +141,7 @@ One branch `codex/extractor-eval-loop` for Day 1-5, a second `codex/retrieval-ha ## Session Log +- **2026-04-12 Claude** `06792d8..5c69f77` Day 5-8 close. Documented extractor scope (5 in-scope, 6 out-of-scope categories). Expanded harness from 6 to 18 fixtures (p04 +1, p05 +1, p06 +7, adversarial +2). Per-entry memory cap at 250 chars fixed 1 of 4 budget-contention failures. Final harness: 15/18 PASS. Mini-phase complete. Before/after: rule extractor 0% recall -> LLM 100%; harness 6/6 -> 15/18; active memories 20 -> 36. - **2026-04-12 Claude** `330ecfb..06792d8` (merged eval-loop branch + triage). Day 1-4 of the mini-phase completed in one session. Day 2 baseline: rule extractor 0% recall, 5 distinct miss classes. Day 4 gate cleared: LLM extractor (claude -p haiku, OAuth) hit 100% recall, 2.55 yield/interaction. Refactored from anthropic SDK to subprocess after "no API key" rule. First live triage: 51 candidates -> 16 promoted, 35 rejected. Active memories 20->36. p06-polisher went from 2 to 16 memories (firmware/telemetry architecture set). POST /memory now accepts status field. Test count 264->278. - **2026-04-11 Claude** `claude/extractor-eval-loop @ 7d8d599` — Day 1+2 of the mini-phase. Froze a 64-interaction snapshot (`scripts/eval_data/interactions_snapshot_2026-04-11.json`) and labeled 20 by length-stratified random sample (5 positive, 15 zero; 7 total expected candidates). Built `scripts/extractor_eval.py` as a file-based eval runner. **Day 2 baseline: rule extractor hit 0% yield / 0% recall / 0% precision on the labeled set; 5 false negatives across 5 distinct miss classes (recommendation_prose, architectural_change_summary, spec_update_announcement, layered_recommendation, alignment_assertion).** This is the Day 4 hard-stop signal arriving two days early — a single rule expansion cannot close a 5-way miss, and widening rules blindly will collapse precision. The Day 4 decision gate is escalated to Antoine for ratification before Day 3 touches any extractor code. No extractor code on main has changed. - **2026-04-11 Codex (ledger audit)** fixed stale `main_tip`, retargeted R1 from the API surface to the live Claude Stop hook, and formalized the review write protocol so Claude can consume findings without rewriting them.