chore: Day 8 — close mini-phase with before/after metrics

Mini-phase complete. Before/after deltas: Metric Before After ───────────────────────────────────────── Rule extractor recall 0% 0% (unchanged, deprioritized) LLM extractor recall n/a 100% (new, claude -p haiku) LLM candidate yield n/a 2.55/interaction First triage accept rate n/a 31% (16/51) Active memories 20 36 (+16) p06-polisher memories 2 16 (+14) atocore memories 0 5 (+5) Retrieval harness 6/6 15/18 (expanded to 18 fixtures) Test count 264 278 (+14) 3 remaining harness failures are budget-contention on the p06 memory band: the specific memory a fixture targets ranks 4th+ and the 25% budget only holds 2-3 entries. Not a ranking bug — the per-entry 250-char cap was the one justified tweak; a second budget change risks regressing other fixtures per Codex's Day 7 hard gate. Ledger updated: Orientation, Session Log, main_tip, harness line. Next on the roadmap (from DEV-LEDGER Active Plan / docs/next-steps): - Wave 2 trusted operational ingestion (p04/p05/p06 dashboards) - Finish OpenClaw integration (Phase 8) - Auto-triage (multi-model second pass to reduce human review) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 06:41:42 -04:00
parent 5c69f77b45
commit 146f2e4a5e
1 changed files with 6 additions and 5 deletions
--- a/DEV-LEDGER.md
+++ b/DEV-LEDGER.md
@@ -6,12 +6,12 @@
 ## Orientation
- **live_sha** (Dalidou `/health` build_sha): `95daa5c`
+- **live_sha** (Dalidou `/health` build_sha): `5c69f77`
- **last_updated**: 2026-04-12 by Claude (Day 4 complete, first triage done, 36 active memories)
+- **last_updated**: 2026-04-12 by Claude (mini-phase Day 8 close)
- **main_tip**: `06792d8`
+- **main_tip**: `5c69f77`
 - **test_count**: 278 passing
- **harness**: `6/6 PASS` (`python scripts/retrieval_eval.py` against live Dalidou)
+- **harness**: `15/18 PASS` (expanded from 6 to 18 fixtures; 3 remaining failures are budget-contention on p06 memory band, not ranking bugs)
- **active_memories**: 36 (was 20 before triage; p06-polisher 2->16, atocore 0->5)
+- **active_memories**: 36 (was 20 before mini-phase; p06-polisher 2->16, atocore 0->5)
 - **off_host_backup**: `papa@192.168.86.39:/home/papa/atocore-backups/` via cron env `ATOCORE_BACKUP_RSYNC`, verified
 ## Active Plan
@@ -141,6 +141,7 @@ One branch `codex/extractor-eval-loop` for Day 1-5, a second `codex/retrieval-ha
 ## Session Log
 - **2026-04-12 Claude** `06792d8..5c69f77` Day 5-8 close. Documented extractor scope (5 in-scope, 6 out-of-scope categories). Expanded harness from 6 to 18 fixtures (p04 +1, p05 +1, p06 +7, adversarial +2). Per-entry memory cap at 250 chars fixed 1 of 4 budget-contention failures. Final harness: 15/18 PASS. Mini-phase complete. Before/after: rule extractor 0% recall -> LLM 100%; harness 6/6 -> 15/18; active memories 20 -> 36.
 - **2026-04-12 Claude** `330ecfb..06792d8` (merged eval-loop branch + triage). Day 1-4 of the mini-phase completed in one session. Day 2 baseline: rule extractor 0% recall, 5 distinct miss classes. Day 4 gate cleared: LLM extractor (claude -p haiku, OAuth) hit 100% recall, 2.55 yield/interaction. Refactored from anthropic SDK to subprocess after "no API key" rule. First live triage: 51 candidates -> 16 promoted, 35 rejected. Active memories 20->36. p06-polisher went from 2 to 16 memories (firmware/telemetry architecture set). POST /memory now accepts status field. Test count 264->278.
 - **2026-04-11 Claude** `claude/extractor-eval-loop @ 7d8d599` — Day 1+2 of the mini-phase. Froze a 64-interaction snapshot (`scripts/eval_data/interactions_snapshot_2026-04-11.json`) and labeled 20 by length-stratified random sample (5 positive, 15 zero; 7 total expected candidates). Built `scripts/extractor_eval.py` as a file-based eval runner. **Day 2 baseline: rule extractor hit 0% yield / 0% recall / 0% precision on the labeled set; 5 false negatives across 5 distinct miss classes (recommendation_prose, architectural_change_summary, spec_update_announcement, layered_recommendation, alignment_assertion).** This is the Day 4 hard-stop signal arriving two days early — a single rule expansion cannot close a 5-way miss, and widening rules blindly will collapse precision. The Day 4 decision gate is escalated to Antoine for ratification before Day 3 touches any extractor code. No extractor code on main has changed.
 - **2026-04-11 Codex (ledger audit)** fixed stale `main_tip`, retargeted R1 from the API surface to the live Claude Stop hook, and formalized the review write protocol so Claude can consume findings without rewriting them.