Files
ATOCore/scripts/eval_data/interactions_snapshot_2026-04-11.json

2 lines
153 KiB
JSON
Raw Normal View History

feat(eval-loop): Day 1+2 — labeled extractor corpus + baseline scorecard Day 1 (labeled corpus): - scripts/eval_data/interactions_snapshot_2026-04-11.json — frozen snapshot of 64 real claude-code interactions pulled from live Dalidou (test-client captures filtered out). This is the stable corpus the whole mini-phase labels against, independent of future captures. - scripts/eval_data/extractor_labels_2026-04-11.json — 20 hand-labeled interactions drawn by length-stratified random sample. Positives: 5/20 = ~25%, total expected candidates: 7. Plan deviation: Codex's plan asked for 30 (10/10/10 buckets); the real corpus is heavily skewed toward instructional/status content, so honest labeling of 20 already crosses the fail-early threshold of "at least 5 plausible positives" without padding. Day 2 (baseline measurement): - scripts/extractor_eval.py — file-based eval runner that loads the snapshot + labels, runs extract_candidates_from_interaction on each, and reports yield / recall / precision / miss-class breakdown. Returns exit 1 on any false positive or false negative. Current rule extractor against the labeled set: labeled=20 exact_match=15 positive_expected=5 yield=0.0 recall=0.0 precision=0.0 false_negatives=5 false_positives=0 miss_classes: recommendation_prose architectural_change_summary spec_update_announcement layered_recommendation alignment_assertion Interpretation: the rule-based extractor matches exactly zero of the 5 plausible positive interactions in the labeled set, and the misses are spread across 5 distinct cue classes with no single dominant pattern. This is the Day 4 hard-stop signal landing on Day 2 — a single rule expansion cannot close a 5-way miss, and widening rules blindly will collapse precision. The right move is to go straight to the Day 4 decision gate and consider LLM-assisted extraction. Escalating to DEV-LEDGER.md as R5 for human ratification before continuing. Not skipping Day 3 silently. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 15:11:33 -04:00
{"frozen_at": "2026-04-11", "live_sha": "38f6e52", "total": 64, "interactions": [{"id": "c65b10e0-ea59-40be-a262-58455b97fc86", "client": "claude-code", "project": "", "session_id": "d38284e5-29eb-46bf-bff3-e3329f0ee99a", "created_at": "2026-04-11 19:00:30", "prompt": "so codex recomended : yes i'd like to build a review read/write so that claude can pick it and use it? what do you recommend we work this out?\n2:57 PM\n\n\nYes. I\u2019d make it a very small protocol, not a custom system.\n\nUse DEV-LEDGER.md as the shared read/write surface, and standardize only three things:\n\n1. One canonical finding format\nKeep Open Review Findings as the inbox. Each finding should always include:\n\nid\nfinder\nseverity\nfile:line\nsummary\nstatus\nowner\nopened_at\nresolved_by\nI\u2019d slightly expand the table to:\n\n| id | finder | severity | file:line | summary | status | owner | opened_at | resolved_by |\nStatuses should be fixed vocabulary:\n\nopen\nacknowledged\nin_progress\nfixed\ndeclined\n2. A tight handoff rule\nUse this operating rule:\n\nCodex writes findings only.\nClaude never deletes a finding.\nClaude only changes:\nstatus\nowner\nresolved_by\noptionally appends a short \u201cresolution note\u201d in Session Log\nIf Claude disagrees, mark declined with a short reason instead of editing the original summary.\nThat keeps audit history stable and prevents \u201csilent rewrites.\u201d\n\n3. A read order for every Claude session\nClaude should read in this order before coding:\n\nOrientation\nActive Plan\nOpen Review Findings\nnewest Session Log entries\nAnd Claude should apply this rule:\n\nP1 open or P1 acknowledged findings block adjacent claim-making and should be addressed or explicitly deferred before more feature work in that area.\nP2 findings can be batched unless they undermine the current task.\nWhat I\u2019d add next to make it actually usable:\n\nA. Add a tiny \u201cReview Protocol\u201d section\nSomething like:\n\n## Review Protocol\n\n- Codex records findings in Open Review Findings.\n- Claude must read Open Review Findings at session start.\n- Claude may update only status/owner/resolved_by, not rewrite finding text.\n- Any fix for a finding must reference the finding id in commit message or Session Log.\n- P1 findings block further commits in the affected area until acknowledged.\nB. Require finding IDs in commits or session log\nExample:\n\ncommit: fix(R1): retarget live capture claim to hook boundary\nor Session Log: Addressed R1 by ... in <sha>\nThat gives you traceability without extra tooling.\n\nC. Split \u201cclaim-level\u201d vs \u201ccode-level\u201d findings\nYour second finding is exactly why this matters. Some findings are about shipped behavior, some about narrative accuracy. I\u2019d add a prefix in summary or a new column:\n\nkind: code\nkind: claim\nkind: ops\nIf you want to stay minimal, just put it in the summary.\n\nMy recommendation for the immediate next change:\n\nUpdate the findings table schema with owner, opened_at, resolved_by.\nAdd a short Review Protocol section.\nRetarget R1 from routes.py to deploy/hooks/capture_stop.py.\nFix stale main_tip.\nThat\u2019s enough for Claude to reliably pick up and act on reviews without inventing process each time.\n\nIf you want, I can draft the exact markdown patch for the ledger format and protocol text. he's writing to ledger right now, we", "response": "Codex's proposal is sound \u00e2\u20ac\u201d I'd accept all of it except the `kind` column, which (as he notes) can live as a prefix in the summary to stay minimal. The one missing piece is a concurrency rule, since he and I can now both write to the ledger and that's the exact \"parallel work\" failure mode we said we'd avoid.\n\nRecommendation: **treat the ledger like any other shared file \u00e2\u20ac\u201d rely on git's conflict mechanism, not a custom protocol.** Rule: \"before writing the ledger, `git pull --rebase origin main`; if you see conflicts, reconcile before pushing. If you see the other agent listed as owner on a section in progress, wait.\" That's it. No locks, no coord