Day 1 (labeled corpus):
- scripts/eval_data/interactions_snapshot_2026-04-11.json — frozen
snapshot of 64 real claude-code interactions pulled from live
Dalidou (test-client captures filtered out). This is the stable
corpus the whole mini-phase labels against, independent of future
captures.
- scripts/eval_data/extractor_labels_2026-04-11.json — 20 hand-labeled
interactions drawn by length-stratified random sample. Positives:
5/20 = ~25%, total expected candidates: 7. Plan deviation: Codex's
plan asked for 30 (10/10/10 buckets); the real corpus is heavily
skewed toward instructional/status content, so honest labeling of
20 already crosses the fail-early threshold of "at least 5 plausible
positives" without padding.
Day 2 (baseline measurement):
- scripts/extractor_eval.py — file-based eval runner that loads the
snapshot + labels, runs extract_candidates_from_interaction on each,
and reports yield / recall / precision / miss-class breakdown.
Returns exit 1 on any false positive or false negative.
Current rule extractor against the labeled set:
labeled=20 exact_match=15 positive_expected=5
yield=0.0 recall=0.0 precision=0.0
false_negatives=5 false_positives=0
miss_classes:
recommendation_prose
architectural_change_summary
spec_update_announcement
layered_recommendation
alignment_assertion
Interpretation: the rule-based extractor matches exactly zero of the
5 plausible positive interactions in the labeled set, and the misses
are spread across 5 distinct cue classes with no single dominant
pattern. This is the Day 4 hard-stop signal landing on Day 2 — a
single rule expansion cannot close a 5-way miss, and widening rules
blindly will collapse precision. The right move is to go straight to
the Day 4 decision gate and consider LLM-assisted extraction.
Escalating to DEV-LEDGER.md as R5 for human ratification before
continuing. Not skipping Day 3 silently.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8.3 KiB
8.3 KiB