# AtoCore Dev Ledger > Shared operating memory between humans, Claude, and Codex. > **Every session MUST read this file at start and append a Session Log entry before ending.** > Section headers are stable - do not rename them. Trim Session Log and Recent Decisions to the last 20 entries at session end; older history lives in `git log` and `docs/`. ## Orientation - **live_sha** (Dalidou `/health` build_sha): `95daa5c` - **last_updated**: 2026-04-12 by Claude (Day 4 complete, first triage done, 36 active memories) - **main_tip**: `06792d8` - **test_count**: 278 passing - **harness**: `6/6 PASS` (`python scripts/retrieval_eval.py` against live Dalidou) - **active_memories**: 36 (was 20 before triage; p06-polisher 2->16, atocore 0->5) - **off_host_backup**: `papa@192.168.86.39:/home/papa/atocore-backups/` via cron env `ATOCORE_BACKUP_RSYNC`, verified ## Active Plan **Mini-phase**: Extractor improvement (eval-driven) + retrieval harness expansion. **Duration**: 8 days, hard gates at each day boundary. **Plan author**: Codex (2026-04-11). **Executor**: Claude. **Audit**: Codex. ### Preflight (before Day 1) Stop if any of these fail: - `git rev-parse HEAD` on `main` matches the expected branching tip - Live `/health` on Dalidou reports the SHA you think is deployed - `python scripts/retrieval_eval.py --json` still passes at the current baseline - `batch-extract` over the known 42-capture slice reproduces the current low-yield baseline - A frozen sample set exists for extractor labeling so the target does not move mid-phase Success: baseline eval output saved, baseline extract output saved, working branch created from `origin/main`. ### Day 1 - Labeled extractor eval set Pick 30 real captures: 10 that should produce 0 candidates, 10 that should plausibly produce 1, 10 ambiguous/hard. Store as a stable artifact (interaction id, expected count, expected type, notes). Add a runner that scores extractor output against labels. Success: 30 labeled interactions in a stable artifact, one-command precision/recall output. Fail-early: if labeling 30 takes more than a day because the concept is unclear, tighten the extraction target before touching code. ### Day 2 - Measure current extractor Run the rule-based extractor on all 30. Record yield, TP, FP, FN. Bucket misses by class (conversational preference, decision summary, status/constraint, meta chatter). Success: short scorecard with counts by miss type, top 2 miss classes obvious. Fail-early: if the labeled set shows fewer than 5 plausible positives total, the corpus is too weak - relabel before tuning. ### Day 3 - Smallest rule expansion for top miss class Add 1-2 narrow, explainable rules for the worst miss class. Add unit tests from real paraphrase examples in the labeled set. Then rerun eval. Success: recall up on the labeled set, false positives do not materially rise, new tests cover the new cue class. Fail-early: if one rule expansion raises FP above ~20% of extracted candidates, revert or narrow before adding more. ### Day 4 - Decision gate: more rules or LLM-assisted prototype If rule expansion reaches a **meaningfully reviewable queue**, keep going with rules. Otherwise prototype an LLM-assisted extraction mode behind a flag. "Meaningfully reviewable queue": - >= 15-25% candidate yield on the 30 labeled captures - FP rate low enough that manual triage feels tolerable - >= 2 real non-synthetic candidates worth review Hard stop: if candidate yield is still under 10% after this point, stop rule tinkering and switch to architecture review (LLM-assisted OR narrower extraction scope). ### Day 5 - Stabilize and document Add remaining focused rules or the flagged LLM-assisted path. Write down in-scope and out-of-scope utterance kinds. Success: labeled eval green against target threshold, extractor scope explainable in <= 5 bullets. ### Day 6 - Retrieval harness expansion (6 -> 15-20 fixtures) Grow across p04/p05/p06. Include short ambiguous prompts, cross-project collision cases, expected project-state wins, expected project-memory wins, and 1-2 "should fail open / low confidence" cases. Success: >= 15 fixtures, each active project has easy + medium + hard cases. Fail-early: if fixtures are mostly obvious wins, add harder adversarial cases before claiming coverage. ### Day 7 - Regression pass and calibration Run harness on current code vs live Dalidou. Inspect failures (ranking, ingestion gap, project bleed, budget). Make at most ONE ranking/budget tweak if the harness clearly justifies it. Do not mix harness expansion and ranking changes in a single commit unless tightly coupled. Success: harness still passes or improves after extractor work; any ranking tweak is justified by a concrete fixture delta. Fail-early: if > 20-25% of harness fixtures regress after extractor changes, separate concerns before merging. ### Day 8 - Merge and close Clean commit sequence. Save before/after metrics (extractor scorecard, harness results). Update docs only with claims the metrics support. Merge order: labeled corpus + runner -> extractor improvements + tests -> harness expansion -> any justified ranking tweak -> docs sync last. Success: point to a before/after delta for both extraction and retrieval; docs do not overclaim. ### Hard Gates (stop/rethink points) - Extractor yield < 10% after 30 labeled interactions -> stop, reconsider rule-only extraction - FP rate > 20% on labeled set -> narrow rules before adding more - Harness expansion finds < 3 genuinely hard cases -> harness still too soft - Ranking change improves one project but regresses another -> do not merge without explicit tradeoff note ### Branching One branch `codex/extractor-eval-loop` for Day 1-5, a second `codex/retrieval-harness-expansion` for Day 6-7. Keeps extraction and retrieval judgments auditable. ## Review Protocol - Codex records review findings in **Open Review Findings**. - Claude must read **Open Review Findings** at session start before coding. - Codex owns finding text. Claude may update operational fields only: - `status` - `owner` - `resolved_by` - If Claude disagrees with a finding, do not rewrite it. Mark it `declined` and explain why in the **Session Log**. - Any commit or session that addresses a finding should reference the finding id in the commit message or **Session Log**. - `P1` findings block further commits in the affected area until they are at least acknowledged and explicitly tracked. - Findings may be code-level, claim-level, or ops-level. If the implementation boundary changes, retarget the finding instead of silently closing it. ## Open Review Findings | id | finder | severity | file:line | summary | status | owner | opened_at | resolved_by | |-----|--------|----------|------------------------------------|-------------------------------------------------------------------------|--------------|--------|------------|-------------| | R1 | Codex | P1 | deploy/hooks/capture_stop.py:76-85 | Live Claude capture still omits `extract`, so "loop closed both sides" remains overstated in practice even though the API supports it | acknowledged | Claude | 2026-04-11 | | | R2 | Codex | P1 | src/atocore/context/builder.py | Project memories excluded from pack | fixed | Claude | 2026-04-11 | 8ea53f4 | | R3 | Claude | P2 | src/atocore/memory/extractor.py | Rule cues (`## Decision:`) never fire on conversational LLM text | open | Claude | 2026-04-11 | | | R4 | Codex | P2 | DEV-LEDGER.md:11 | Orientation `main_tip` was stale versus `HEAD` / `origin/main` | fixed | Codex | 2026-04-11 | 81307ce | ## Recent Decisions - **2026-04-12** Day 4 gate cleared: LLM-assisted extraction via `claude -p` (OAuth, no API key) is the path forward. Rule extractor stays as default for structural cues. *Proposed by:* Claude. *Ratified by:* Antoine. - **2026-04-12** First live triage: 16 promoted, 35 rejected from 51 LLM-extracted candidates. 31% accept rate. Active memory count 20->36. *Executed by:* Claude. *Ratified by:* Antoine. - **2026-04-12** No API keys allowed in AtoCore — LLM-assisted features use OAuth via `claude -p` or equivalent CLI-authenticated paths. *Proposed by:* Antoine. - **2026-04-12** Multi-model extraction direction: extraction/triage should be model-agnostic, with Codex/Gemini/Ollama as second-pass reviewers for robustness. *Proposed by:* Antoine. - **2026-04-11** Adopt this ledger as shared operating memory between Claude and Codex. *Proposed by:* Antoine. *Ratified by:* Antoine. - **2026-04-11** Accept Codex's 8-day mini-phase plan verbatim as Active Plan. *Proposed by:* Codex. *Ratified by:* Antoine. - **2026-04-11** Review findings live in `DEV-LEDGER.md` with Codex owning finding text and Claude updating status fields only. *Proposed by:* Codex. *Ratified by:* Antoine. - **2026-04-11** Project memories land in the pack under `--- Project Memories ---` at 25% budget ratio, gated on canonical project hint. *Proposed by:* Claude. - **2026-04-11** Extraction stays off the capture hot path. Batch / manual only. *Proposed by:* Antoine. - **2026-04-11** 4-step roadmap: extractor -> harness expansion -> Wave 2 ingestion -> OpenClaw finish. Steps 1+2 as one mini-phase. *Ratified by:* Antoine. - **2026-04-11** Codex branches must fork from `main`, not be orphan commits. *Proposed by:* Claude. *Agreed by:* Codex. ## Session Log - **2026-04-12 Claude** `330ecfb..06792d8` (merged eval-loop branch + triage). Day 1-4 of the mini-phase completed in one session. Day 2 baseline: rule extractor 0% recall, 5 distinct miss classes. Day 4 gate cleared: LLM extractor (claude -p haiku, OAuth) hit 100% recall, 2.55 yield/interaction. Refactored from anthropic SDK to subprocess after "no API key" rule. First live triage: 51 candidates -> 16 promoted, 35 rejected. Active memories 20->36. p06-polisher went from 2 to 16 memories (firmware/telemetry architecture set). POST /memory now accepts status field. Test count 264->278. - **2026-04-11 Claude** `claude/extractor-eval-loop @ 7d8d599` — Day 1+2 of the mini-phase. Froze a 64-interaction snapshot (`scripts/eval_data/interactions_snapshot_2026-04-11.json`) and labeled 20 by length-stratified random sample (5 positive, 15 zero; 7 total expected candidates). Built `scripts/extractor_eval.py` as a file-based eval runner. **Day 2 baseline: rule extractor hit 0% yield / 0% recall / 0% precision on the labeled set; 5 false negatives across 5 distinct miss classes (recommendation_prose, architectural_change_summary, spec_update_announcement, layered_recommendation, alignment_assertion).** This is the Day 4 hard-stop signal arriving two days early — a single rule expansion cannot close a 5-way miss, and widening rules blindly will collapse precision. The Day 4 decision gate is escalated to Antoine for ratification before Day 3 touches any extractor code. No extractor code on main has changed. - **2026-04-11 Codex (ledger audit)** fixed stale `main_tip`, retargeted R1 from the API surface to the live Claude Stop hook, and formalized the review write protocol so Claude can consume findings without rewriting them. - **2026-04-11 Claude** `b3253f3..59331e5` (1 commit). Wired the DEV-LEDGER, added session protocol to AGENTS.md, created project-local CLAUDE.md, deleted stale `codex/port-atocore-ops-client` remote branch. No code changes, no redeploy needed. - **2026-04-11 Claude** `c5bad99..b3253f3` (11 commits + 1 merge). Length-aware reinforcement, project memories in pack, query-relevance memory ranking, hyphenated-identifier tokenizer, retrieval eval harness seeded, off-host backup wired end-to-end, docs synced, codex integration-pass branch merged. Harness went 0->6/6 on live Dalidou. - **2026-04-11 Codex (async review)** identified 2 P1s against a stale checkout. R1 was fair (extraction not automated), R2 was outdated (project memories already landed on main). Delivered the 8-day execution plan now in Active Plan. - **2026-04-06 Antoine** created `codex/atocore-integration-pass` with the `t420-openclaw/` workspace (merged 2026-04-11). ## Working Rules - Claude builds; Codex audits. No parallel work on the same files. - Codex branches fork from `main`: `git fetch origin && git checkout -b codex/ origin/main`. - P1 findings block further main commits until acknowledged in Open Review Findings. - Every session appends at least one Session Log line and bumps Orientation. - Trim Session Log and Recent Decisions to the last 20 at session end. - Docs in `docs/` may overclaim stale status; the ledger is the one-file source of truth for "what is true right now." ## Quick Commands ```bash # Check live state ssh papa@dalidou "curl -s http://localhost:8100/health" # Run the retrieval harness python scripts/retrieval_eval.py # human-readable python scripts/retrieval_eval.py --json # machine-readable # Deploy a new main tip git push origin main && ssh papa@dalidou "bash /srv/storage/atocore/app/deploy/dalidou/deploy.sh" # Reflection-loop ops python scripts/atocore_client.py batch-extract '' '' 200 false # preview python scripts/atocore_client.py batch-extract '' '' 200 true # persist python scripts/atocore_client.py triage ```