docs: formalize DEV-LEDGER review protocol

2026-04-11 15:03:33 -04:00
parent 81307cec47
commit d9dc55f841
1 changed files with 44 additions and 28 deletions
--- a/DEV-LEDGER.md
+++ b/DEV-LEDGER.md
@@ -2,13 +2,13 @@

 > Shared operating memory between humans, Claude, and Codex.
 > **Every session MUST read this file at start and append a Session Log entry before ending.**
-> Section headers are stable — do not rename them. Trim Session Log and Recent Decisions to the last 20 entries at session end; older history lives in `git log` and `docs/`.
+> Section headers are stable - do not rename them. Trim Session Log and Recent Decisions to the last 20 entries at session end; older history lives in `git log` and `docs/`.

 ## Orientation

 - **live_sha** (Dalidou `/health` build_sha): `38f6e52`
- **last_updated**: 2026-04-11 by Claude (ledger wired)
- **main_tip**: `59331e5`
+- **last_updated**: 2026-04-11 by Codex (review protocol formalized)
+- **main_tip**: `81307ce`
 - **test_count**: 264 passing
 - **harness**: `6/6 PASS` (`python scripts/retrieval_eval.py` against live Dalidou)
 - **off_host_backup**: `papa@192.168.86.39:/home/papa/atocore-backups/` via cron env `ATOCORE_BACKUP_RSYNC`, verified
@@ -31,98 +31,114 @@ Stop if any of these fail:

 Success: baseline eval output saved, baseline extract output saved, working branch created from `origin/main`.

-### Day 1 — Labeled extractor eval set
+### Day 1 - Labeled extractor eval set

 Pick 30 real captures: 10 that should produce 0 candidates, 10 that should plausibly produce 1, 10 ambiguous/hard. Store as a stable artifact (interaction id, expected count, expected type, notes). Add a runner that scores extractor output against labels.

 Success: 30 labeled interactions in a stable artifact, one-command precision/recall output.
 Fail-early: if labeling 30 takes more than a day because the concept is unclear, tighten the extraction target before touching code.

-### Day 2 — Measure current extractor
+### Day 2 - Measure current extractor

 Run the rule-based extractor on all 30. Record yield, TP, FP, FN. Bucket misses by class (conversational preference, decision summary, status/constraint, meta chatter).

 Success: short scorecard with counts by miss type, top 2 miss classes obvious.
-Fail-early: if the labeled set shows fewer than 5 plausible positives total, the corpus is too weak — relabel before tuning.
+Fail-early: if the labeled set shows fewer than 5 plausible positives total, the corpus is too weak - relabel before tuning.

-### Day 3 — Smallest rule expansion for top miss class
+### Day 3 - Smallest rule expansion for top miss class

 Add 1-2 narrow, explainable rules for the worst miss class. Add unit tests from real paraphrase examples in the labeled set. Then rerun eval.

 Success: recall up on the labeled set, false positives do not materially rise, new tests cover the new cue class.
 Fail-early: if one rule expansion raises FP above ~20% of extracted candidates, revert or narrow before adding more.

-### Day 4 — Decision gate: more rules or LLM-assisted prototype
+### Day 4 - Decision gate: more rules or LLM-assisted prototype

 If rule expansion reaches a **meaningfully reviewable queue**, keep going with rules. Otherwise prototype an LLM-assisted extraction mode behind a flag.

 "Meaningfully reviewable queue":
- ≥ 15-25% candidate yield on the 30 labeled captures
+- >= 15-25% candidate yield on the 30 labeled captures
 - FP rate low enough that manual triage feels tolerable
- ≥ 2 real non-synthetic candidates worth review
+- >= 2 real non-synthetic candidates worth review

 Hard stop: if candidate yield is still under 10% after this point, stop rule tinkering and switch to architecture review (LLM-assisted OR narrower extraction scope).

-### Day 5 — Stabilize and document
+### Day 5 - Stabilize and document

 Add remaining focused rules or the flagged LLM-assisted path. Write down in-scope and out-of-scope utterance kinds.

-Success: labeled eval green against target threshold, extractor scope explainable in ≤ 5 bullets.
+Success: labeled eval green against target threshold, extractor scope explainable in <= 5 bullets.

-### Day 6 — Retrieval harness expansion (6 → 15-20 fixtures)
+### Day 6 - Retrieval harness expansion (6 -> 15-20 fixtures)

 Grow across p04/p05/p06. Include short ambiguous prompts, cross-project collision cases, expected project-state wins, expected project-memory wins, and 1-2 "should fail open / low confidence" cases.

-Success: ≥ 15 fixtures, each active project has easy + medium + hard cases.
+Success: >= 15 fixtures, each active project has easy + medium + hard cases.
 Fail-early: if fixtures are mostly obvious wins, add harder adversarial cases before claiming coverage.

-### Day 7 — Regression pass and calibration
+### Day 7 - Regression pass and calibration

 Run harness on current code vs live Dalidou. Inspect failures (ranking, ingestion gap, project bleed, budget). Make at most ONE ranking/budget tweak if the harness clearly justifies it. Do not mix harness expansion and ranking changes in a single commit unless tightly coupled.

 Success: harness still passes or improves after extractor work; any ranking tweak is justified by a concrete fixture delta.
 Fail-early: if > 20-25% of harness fixtures regress after extractor changes, separate concerns before merging.

-### Day 8 — Merge and close
+### Day 8 - Merge and close

 Clean commit sequence. Save before/after metrics (extractor scorecard, harness results). Update docs only with claims the metrics support.

-Merge order: labeled corpus + runner → extractor improvements + tests → harness expansion → any justified ranking tweak → docs sync last.
+Merge order: labeled corpus + runner -> extractor improvements + tests -> harness expansion -> any justified ranking tweak -> docs sync last.

 Success: point to a before/after delta for both extraction and retrieval; docs do not overclaim.

 ### Hard Gates (stop/rethink points)

- Extractor yield < 10% after 30 labeled interactions → stop, reconsider rule-only extraction
- FP rate > 20% on labeled set → narrow rules before adding more
- Harness expansion finds < 3 genuinely hard cases → harness still too soft
- Ranking change improves one project but regresses another → do not merge without explicit tradeoff note
+- Extractor yield < 10% after 30 labeled interactions -> stop, reconsider rule-only extraction
+- FP rate > 20% on labeled set -> narrow rules before adding more
+- Harness expansion finds < 3 genuinely hard cases -> harness still too soft
+- Ranking change improves one project but regresses another -> do not merge without explicit tradeoff note

 ### Branching

 One branch `codex/extractor-eval-loop` for Day 1-5, a second `codex/retrieval-harness-expansion` for Day 6-7. Keeps extraction and retrieval judgments auditable.

+## Review Protocol
+
+- Codex records review findings in **Open Review Findings**.
+- Claude must read **Open Review Findings** at session start before coding.
+- Codex owns finding text. Claude may update operational fields only:
+  - `status`
+  - `owner`
+  - `resolved_by`
+- If Claude disagrees with a finding, do not rewrite it. Mark it `declined` and explain why in the **Session Log**.
+- Any commit or session that addresses a finding should reference the finding id in the commit message or **Session Log**.
+- `P1` findings block further commits in the affected area until they are at least acknowledged and explicitly tracked.
+- Findings may be code-level, claim-level, or ops-level. If the implementation boundary changes, retarget the finding instead of silently closing it.
+
 ## Open Review Findings

-| id  | finder | severity | file:line                                 | summary                                                    | status   |
-|-----|--------|----------|-------------------------------------------|------------------------------------------------------------|----------|
-| R1  | Codex  | P1       | src/atocore/api/routes.py                 | Capture→extract still manual; "loop closed both sides" was overstated | acknowledged — addressed in Active Plan Day 1-5 |
-| R2  | Codex  | P1       | src/atocore/context/builder.py            | Project memories excluded from pack                        | fixed @ 8ea53f4 (codex read was stale, now caught up) |
-| R3  | Claude | P2       | src/atocore/memory/extractor.py           | Rule cues (`## Decision:`) never fire on conversational LLM text | open — Active Plan root cause |
+| id  | finder | severity | file:line                          | summary                                                                 | status       | owner  | opened_at  | resolved_by |
+|-----|--------|----------|------------------------------------|-------------------------------------------------------------------------|--------------|--------|------------|-------------|
+| R1  | Codex  | P1       | deploy/hooks/capture_stop.py:76-85 | Live Claude capture still omits `extract`, so "loop closed both sides" remains overstated in practice even though the API supports it | acknowledged | Claude | 2026-04-11 |             |
+| R2  | Codex  | P1       | src/atocore/context/builder.py     | Project memories excluded from pack                                     | fixed        | Claude | 2026-04-11 | 8ea53f4     |
+| R3  | Claude | P2       | src/atocore/memory/extractor.py    | Rule cues (`## Decision:`) never fire on conversational LLM text        | open         | Claude | 2026-04-11 |             |
+| R4  | Codex  | P2       | DEV-LEDGER.md:11                   | Orientation `main_tip` was stale versus `HEAD` / `origin/main`          | fixed        | Codex  | 2026-04-11 | 81307ce     |

 ## Recent Decisions

 - **2026-04-11** Adopt this ledger as shared operating memory between Claude and Codex. *Proposed by:* Antoine. *Ratified by:* Antoine.
 - **2026-04-11** Accept Codex's 8-day mini-phase plan verbatim as Active Plan. *Proposed by:* Codex. *Ratified by:* Antoine.
+- **2026-04-11** Review findings live in `DEV-LEDGER.md` with Codex owning finding text and Claude updating status fields only. *Proposed by:* Codex. *Ratified by:* Antoine.
 - **2026-04-11** Project memories land in the pack under `--- Project Memories ---` at 25% budget ratio, gated on canonical project hint. *Proposed by:* Claude.
 - **2026-04-11** Extraction stays off the capture hot path. Batch / manual only. *Proposed by:* Antoine.
- **2026-04-11** 4-step roadmap: extractor → harness expansion → Wave 2 ingestion → OpenClaw finish. Steps 1+2 as one mini-phase. *Ratified by:* Antoine.
+- **2026-04-11** 4-step roadmap: extractor -> harness expansion -> Wave 2 ingestion -> OpenClaw finish. Steps 1+2 as one mini-phase. *Ratified by:* Antoine.
 - **2026-04-11** Codex branches must fork from `main`, not be orphan commits. *Proposed by:* Claude. *Agreed by:* Codex.

 ## Session Log

+- **2026-04-11 Codex (ledger audit)** fixed stale `main_tip`, retargeted R1 from the API surface to the live Claude Stop hook, and formalized the review write protocol so Claude can consume findings without rewriting them.
 - **2026-04-11 Claude** `b3253f3..59331e5` (1 commit). Wired the DEV-LEDGER, added session protocol to AGENTS.md, created project-local CLAUDE.md, deleted stale `codex/port-atocore-ops-client` remote branch. No code changes, no redeploy needed.
- **2026-04-11 Claude** `c5bad99..b3253f3` (11 commits + 1 merge). Length-aware reinforcement, project memories in pack, query-relevance memory ranking, hyphenated-identifier tokenizer, retrieval eval harness seeded, off-host backup wired end-to-end, docs synced, codex integration-pass branch merged. Harness went 0→6/6 on live Dalidou.
+- **2026-04-11 Claude** `c5bad99..b3253f3` (11 commits + 1 merge). Length-aware reinforcement, project memories in pack, query-relevance memory ranking, hyphenated-identifier tokenizer, retrieval eval harness seeded, off-host backup wired end-to-end, docs synced, codex integration-pass branch merged. Harness went 0->6/6 on live Dalidou.
 - **2026-04-11 Codex (async review)** identified 2 P1s against a stale checkout. R1 was fair (extraction not automated), R2 was outdated (project memories already landed on main). Delivered the 8-day execution plan now in Active Plan.
 - **2026-04-06 Antoine** created `codex/atocore-integration-pass` with the `t420-openclaw/` workspace (merged 2026-04-11).