From 89c7964237d60ea1605f9e2ba6c6327d4877a93f Mon Sep 17 00:00:00 2001 From: Anto01 Date: Sun, 12 Apr 2026 11:31:32 +0000 Subject: [PATCH] audit: record 2026-04-12 review findings --- DEV-LEDGER.md | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/DEV-LEDGER.md b/DEV-LEDGER.md index 4310cfb..b754ae3 100644 --- a/DEV-LEDGER.md +++ b/DEV-LEDGER.md @@ -7,10 +7,10 @@ ## Orientation - **live_sha** (Dalidou `/health` build_sha): `5c69f77` -- **last_updated**: 2026-04-12 by Claude (mini-phase Day 8 close) -- **main_tip**: `5c69f77` +- **last_updated**: 2026-04-12 by Codex (audit branch `codex/audit-2026-04-12`) +- **main_tip**: `146f2e4` - **test_count**: 278 passing -- **harness**: `15/18 PASS` (expanded from 6 to 18 fixtures; 3 remaining failures are budget-contention on p06 memory band, not ranking bugs) +- **harness**: `15/18 PASS` (remaining failures are mixed: p06-firmware-interface exposes a lexical-ranking tie, p06-offline-design is a live triage scoping miss, p06-tailscale still has retrieved-chunk bleed) - **active_memories**: 36 (was 20 before mini-phase; p06-polisher 2->16, atocore 0->5) - **off_host_backup**: `papa@192.168.86.39:/home/papa/atocore-backups/` via cron env `ATOCORE_BACKUP_RSYNC`, verified @@ -124,6 +124,10 @@ One branch `codex/extractor-eval-loop` for Day 1-5, a second `codex/retrieval-ha | R2 | Codex | P1 | src/atocore/context/builder.py | Project memories excluded from pack | fixed | Claude | 2026-04-11 | 8ea53f4 | | R3 | Claude | P2 | src/atocore/memory/extractor.py | Rule cues (`## Decision:`) never fire on conversational LLM text | open | Claude | 2026-04-11 | | | R4 | Codex | P2 | DEV-LEDGER.md:11 | Orientation `main_tip` was stale versus `HEAD` / `origin/main` | fixed | Codex | 2026-04-11 | 81307ce | +| R5 | Codex | P1 | src/atocore/interactions/service.py:157-174 | The deployed extraction path still calls only the rule extractor; the new LLM extractor is eval/script-only, so Day 4 "gate cleared" is true as a benchmark result but not as an operational extraction path | open | Claude | 2026-04-12 | | +| R6 | Codex | P1 | src/atocore/memory/extractor_llm.py:258-276 | LLM extraction accepts model-supplied `project` verbatim with no fallback to `interaction.project`; live triage promoted a clearly p06 memory (offline/network rule) as project=`""`, which explains the p06-offline-design harness miss and falsifies the current "all 3 failures are budget-contention" claim | open | Claude | 2026-04-12 | | +| R7 | Codex | P2 | src/atocore/memory/service.py:448-459 | Query ranking is overlap-count only, so broad overview memories can tie exact low-confidence memories and win on confidence; p06-firmware-interface is not just budget pressure, it also exposes a weak lexical scorer | open | Claude | 2026-04-12 | | +| R8 | Codex | P2 | tests/test_extractor_llm.py:1-7 | LLM extractor tests stop at parser/failure contracts; there is no automated coverage for the script-only persistence/review path that produced the 16 promoted memories, including project-scope preservation | open | Claude | 2026-04-12 | | ## Recent Decisions @@ -141,6 +145,7 @@ One branch `codex/extractor-eval-loop` for Day 1-5, a second `codex/retrieval-ha ## Session Log +- **2026-04-12 Codex (audit branch `codex/audit-2026-04-12`)** audited `c5bad99..146f2e4` against code, live Dalidou, and the 36 active memories. Confirmed: `claude -p` invocation is not shell-injection-prone (`subprocess.run(args)` with no shell), off-host backup wiring matches the ledger, and R1 remains unresolved in practice. Added R5-R8. Corrected Orientation `main_tip` (`146f2e4`, not `5c69f77`) and tightened the harness note: p06-firmware-interface is a ranking-tie issue, p06-offline-design comes from a project-scope miss in live triage, and p06-tailscale is retrieved-chunk bleed rather than memory-band budget contention. - **2026-04-12 Claude** `06792d8..5c69f77` Day 5-8 close. Documented extractor scope (5 in-scope, 6 out-of-scope categories). Expanded harness from 6 to 18 fixtures (p04 +1, p05 +1, p06 +7, adversarial +2). Per-entry memory cap at 250 chars fixed 1 of 4 budget-contention failures. Final harness: 15/18 PASS. Mini-phase complete. Before/after: rule extractor 0% recall -> LLM 100%; harness 6/6 -> 15/18; active memories 20 -> 36. - **2026-04-12 Claude** `330ecfb..06792d8` (merged eval-loop branch + triage). Day 1-4 of the mini-phase completed in one session. Day 2 baseline: rule extractor 0% recall, 5 distinct miss classes. Day 4 gate cleared: LLM extractor (claude -p haiku, OAuth) hit 100% recall, 2.55 yield/interaction. Refactored from anthropic SDK to subprocess after "no API key" rule. First live triage: 51 candidates -> 16 promoted, 35 rejected. Active memories 20->36. p06-polisher went from 2 to 16 memories (firmware/telemetry architecture set). POST /memory now accepts status field. Test count 264->278. - **2026-04-11 Claude** `claude/extractor-eval-loop @ 7d8d599` — Day 1+2 of the mini-phase. Froze a 64-interaction snapshot (`scripts/eval_data/interactions_snapshot_2026-04-11.json`) and labeled 20 by length-stratified random sample (5 positive, 15 zero; 7 total expected candidates). Built `scripts/extractor_eval.py` as a file-based eval runner. **Day 2 baseline: rule extractor hit 0% yield / 0% recall / 0% precision on the labeled set; 5 false negatives across 5 distinct miss classes (recommendation_prose, architectural_change_summary, spec_update_announcement, layered_recommendation, alignment_assertion).** This is the Day 4 hard-stop signal arriving two days early — a single rule expansion cannot close a 5-way miss, and widening rules blindly will collapse precision. The Day 4 decision gate is escalated to Antoine for ratification before Day 3 touches any extractor code. No extractor code on main has changed.