Files

Anto01 330ecfb6a6 chore(ledger): Day 2 baseline escalated to Day 4 gate early

Day 2 extractor eval baseline on a 20-interaction labeled set shows
0% yield / 0% recall / 0% precision. The 5 false negatives span
5 distinct miss classes, matching the pattern Codex's Day 4 hard
gate was designed to catch but arriving two days early.

No extractor code change on main. Day 1+2 artifacts committed on
working branch 'claude/extractor-eval-loop' at 7d8d599. Day 4
decision (keep rule-expanding vs prototype LLM-assisted mode) is
escalated to Antoine for ratification before Day 3 work touches
any extractor.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-11 15:12:58 -04:00

12 KiB

Raw Blame History

AtoCore Dev Ledger

Shared operating memory between humans, Claude, and Codex. Every session MUST read this file at start and append a Session Log entry before ending. Section headers are stable - do not rename them. Trim Session Log and Recent Decisions to the last 20 entries at session end; older history lives in git log and docs/.

Orientation

live_sha (Dalidou /health build_sha): 38f6e52
last_updated: 2026-04-11 by Claude (Day 1+2 eval on working branch, Day 4 gate escalated)
main_tip: d9dc55f (unchanged; Day 1+2 artifacts live on claude/extractor-eval-loop @ 7d8d599)
test_count: 264 passing
harness: 6/6 PASS (python scripts/retrieval_eval.py against live Dalidou)
off_host_backup: papa@192.168.86.39:/home/papa/atocore-backups/ via cron env ATOCORE_BACKUP_RSYNC, verified

Active Plan

Mini-phase: Extractor improvement (eval-driven) + retrieval harness expansion. Duration: 8 days, hard gates at each day boundary. Plan author: Codex (2026-04-11). Executor: Claude. Audit: Codex.

Preflight (before Day 1)

Stop if any of these fail:

git rev-parse HEAD on main matches the expected branching tip
Live /health on Dalidou reports the SHA you think is deployed
python scripts/retrieval_eval.py --json still passes at the current baseline
batch-extract over the known 42-capture slice reproduces the current low-yield baseline
A frozen sample set exists for extractor labeling so the target does not move mid-phase

Success: baseline eval output saved, baseline extract output saved, working branch created from origin/main.

Day 1 - Labeled extractor eval set

Pick 30 real captures: 10 that should produce 0 candidates, 10 that should plausibly produce 1, 10 ambiguous/hard. Store as a stable artifact (interaction id, expected count, expected type, notes). Add a runner that scores extractor output against labels.

Success: 30 labeled interactions in a stable artifact, one-command precision/recall output. Fail-early: if labeling 30 takes more than a day because the concept is unclear, tighten the extraction target before touching code.

Day 2 - Measure current extractor

Run the rule-based extractor on all 30. Record yield, TP, FP, FN. Bucket misses by class (conversational preference, decision summary, status/constraint, meta chatter).

Success: short scorecard with counts by miss type, top 2 miss classes obvious. Fail-early: if the labeled set shows fewer than 5 plausible positives total, the corpus is too weak - relabel before tuning.

Day 3 - Smallest rule expansion for top miss class

Add 1-2 narrow, explainable rules for the worst miss class. Add unit tests from real paraphrase examples in the labeled set. Then rerun eval.

Success: recall up on the labeled set, false positives do not materially rise, new tests cover the new cue class. Fail-early: if one rule expansion raises FP above ~20% of extracted candidates, revert or narrow before adding more.

Day 4 - Decision gate: more rules or LLM-assisted prototype

If rule expansion reaches a meaningfully reviewable queue, keep going with rules. Otherwise prototype an LLM-assisted extraction mode behind a flag.

"Meaningfully reviewable queue":

= 15-25% candidate yield on the 30 labeled captures
FP rate low enough that manual triage feels tolerable
= 2 real non-synthetic candidates worth review

Hard stop: if candidate yield is still under 10% after this point, stop rule tinkering and switch to architecture review (LLM-assisted OR narrower extraction scope).

Day 5 - Stabilize and document

Add remaining focused rules or the flagged LLM-assisted path. Write down in-scope and out-of-scope utterance kinds.

Success: labeled eval green against target threshold, extractor scope explainable in <= 5 bullets.

Day 6 - Retrieval harness expansion (6 -> 15-20 fixtures)

Grow across p04/p05/p06. Include short ambiguous prompts, cross-project collision cases, expected project-state wins, expected project-memory wins, and 1-2 "should fail open / low confidence" cases.

Success: >= 15 fixtures, each active project has easy + medium + hard cases. Fail-early: if fixtures are mostly obvious wins, add harder adversarial cases before claiming coverage.

Day 7 - Regression pass and calibration

Run harness on current code vs live Dalidou. Inspect failures (ranking, ingestion gap, project bleed, budget). Make at most ONE ranking/budget tweak if the harness clearly justifies it. Do not mix harness expansion and ranking changes in a single commit unless tightly coupled.

Success: harness still passes or improves after extractor work; any ranking tweak is justified by a concrete fixture delta. Fail-early: if > 20-25% of harness fixtures regress after extractor changes, separate concerns before merging.

Day 8 - Merge and close

Clean commit sequence. Save before/after metrics (extractor scorecard, harness results). Update docs only with claims the metrics support.

Merge order: labeled corpus + runner -> extractor improvements + tests -> harness expansion -> any justified ranking tweak -> docs sync last.

Success: point to a before/after delta for both extraction and retrieval; docs do not overclaim.

Hard Gates (stop/rethink points)

Extractor yield < 10% after 30 labeled interactions -> stop, reconsider rule-only extraction
FP rate > 20% on labeled set -> narrow rules before adding more
Harness expansion finds < 3 genuinely hard cases -> harness still too soft
Ranking change improves one project but regresses another -> do not merge without explicit tradeoff note

Branching

One branch codex/extractor-eval-loop for Day 1-5, a second codex/retrieval-harness-expansion for Day 6-7. Keeps extraction and retrieval judgments auditable.

Review Protocol

Codex records review findings in Open Review Findings.
Claude must read Open Review Findings at session start before coding.
Codex owns finding text. Claude may update operational fields only:
- status
- owner
- resolved_by
If Claude disagrees with a finding, do not rewrite it. Mark it declined and explain why in the Session Log.
Any commit or session that addresses a finding should reference the finding id in the commit message or Session Log.
P1 findings block further commits in the affected area until they are at least acknowledged and explicitly tracked.
Findings may be code-level, claim-level, or ops-level. If the implementation boundary changes, retarget the finding instead of silently closing it.

Open Review Findings

id	finder	severity	file:line	summary	status	owner	opened_at	resolved_by
R1	Codex	P1	deploy/hooks/capture_stop.py:76-85	Live Claude capture still omits `extract`, so "loop closed both sides" remains overstated in practice even though the API supports it	acknowledged	Claude	2026-04-11
R2	Codex	P1	src/atocore/context/builder.py	Project memories excluded from pack	fixed	Claude	2026-04-11	`8ea53f4`
R3	Claude	P2	src/atocore/memory/extractor.py	Rule cues (`## Decision:`) never fire on conversational LLM text	open	Claude	2026-04-11
R4	Codex	P2	DEV-LEDGER.md:11	Orientation `main_tip` was stale versus `HEAD` / `origin/main`	fixed	Codex	2026-04-11	`81307ce`

Recent Decisions

2026-04-11 Adopt this ledger as shared operating memory between Claude and Codex. Proposed by: Antoine. Ratified by: Antoine.
2026-04-11 Accept Codex's 8-day mini-phase plan verbatim as Active Plan. Proposed by: Codex. Ratified by: Antoine.
2026-04-11 Review findings live in DEV-LEDGER.md with Codex owning finding text and Claude updating status fields only. Proposed by: Codex. Ratified by: Antoine.
2026-04-11 Project memories land in the pack under --- Project Memories --- at 25% budget ratio, gated on canonical project hint. Proposed by: Claude.
2026-04-11 Extraction stays off the capture hot path. Batch / manual only. Proposed by: Antoine.
2026-04-11 4-step roadmap: extractor -> harness expansion -> Wave 2 ingestion -> OpenClaw finish. Steps 1+2 as one mini-phase. Ratified by: Antoine.
2026-04-11 Codex branches must fork from main, not be orphan commits. Proposed by: Claude. Agreed by: Codex.

Session Log

2026-04-11 Claude claude/extractor-eval-loop @ 7d8d599 — Day 1+2 of the mini-phase. Froze a 64-interaction snapshot (scripts/eval_data/interactions_snapshot_2026-04-11.json) and labeled 20 by length-stratified random sample (5 positive, 15 zero; 7 total expected candidates). Built scripts/extractor_eval.py as a file-based eval runner. Day 2 baseline: rule extractor hit 0% yield / 0% recall / 0% precision on the labeled set; 5 false negatives across 5 distinct miss classes (recommendation_prose, architectural_change_summary, spec_update_announcement, layered_recommendation, alignment_assertion). This is the Day 4 hard-stop signal arriving two days early — a single rule expansion cannot close a 5-way miss, and widening rules blindly will collapse precision. The Day 4 decision gate is escalated to Antoine for ratification before Day 3 touches any extractor code. No extractor code on main has changed.
2026-04-11 Codex (ledger audit) fixed stale main_tip, retargeted R1 from the API surface to the live Claude Stop hook, and formalized the review write protocol so Claude can consume findings without rewriting them.
2026-04-11 Claude b3253f3..59331e5 (1 commit). Wired the DEV-LEDGER, added session protocol to AGENTS.md, created project-local CLAUDE.md, deleted stale codex/port-atocore-ops-client remote branch. No code changes, no redeploy needed.
2026-04-11 Claude c5bad99..b3253f3 (11 commits + 1 merge). Length-aware reinforcement, project memories in pack, query-relevance memory ranking, hyphenated-identifier tokenizer, retrieval eval harness seeded, off-host backup wired end-to-end, docs synced, codex integration-pass branch merged. Harness went 0->6/6 on live Dalidou.
2026-04-11 Codex (async review) identified 2 P1s against a stale checkout. R1 was fair (extraction not automated), R2 was outdated (project memories already landed on main). Delivered the 8-day execution plan now in Active Plan.
2026-04-06 Antoine created codex/atocore-integration-pass with the t420-openclaw/ workspace (merged 2026-04-11).

Working Rules

Claude builds; Codex audits. No parallel work on the same files.
Codex branches fork from main: git fetch origin && git checkout -b codex/<topic> origin/main.
P1 findings block further main commits until acknowledged in Open Review Findings.
Every session appends at least one Session Log line and bumps Orientation.
Trim Session Log and Recent Decisions to the last 20 at session end.
Docs in docs/ may overclaim stale status; the ledger is the one-file source of truth for "what is true right now."

Quick Commands

# Check live state
ssh papa@dalidou "curl -s http://localhost:8100/health"

# Run the retrieval harness
python scripts/retrieval_eval.py            # human-readable
python scripts/retrieval_eval.py --json     # machine-readable

# Deploy a new main tip
git push origin main && ssh papa@dalidou "bash /srv/storage/atocore/app/deploy/dalidou/deploy.sh"

# Reflection-loop ops
python scripts/atocore_client.py batch-extract '' '' 200 false   # preview
python scripts/atocore_client.py batch-extract '' '' 200 true    # persist
python scripts/atocore_client.py triage

12 KiB Raw Blame History