Before: a model returning 'p04-gigabit' for a p06-polisher interaction would silently override the known scope because the project was registered. After: interaction.project always wins when set. Model project is only a fallback for unscoped captures. Not yet guaranteed: within-project semantic errors (model says the right project but wrong content). That's a content-quality concern, not a trust-hierarchy issue. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
21 KiB
AtoCore Dev Ledger
Shared operating memory between humans, Claude, and Codex. Every session MUST read this file at start and append a Session Log entry before ending. Section headers are stable - do not rename them. Trim Session Log and Recent Decisions to the last 20 entries at session end; older history lives in
git loganddocs/.
Orientation
- live_sha (Dalidou
/healthbuild_sha):8951c62(R9 fix ate5e9a99not yet deployed) - last_updated: 2026-04-12 by Claude (Batch 3 R9 fix)
- main_tip:
e5e9a99 - test_count: 290 passing (local dev shell)
- harness:
17/18 PASS(only p06-tailscale still failing) - active_memories: 41
- candidate_memories: 0
- project_state_entries: p04=5, p05=6, p06=6 (Wave 2 entries still present on live Dalidou; 17 total visible)
- off_host_backup:
papa@192.168.86.39:/home/papa/atocore-backups/via cron envATOCORE_BACKUP_RSYNC, verified
Active Plan
Mini-phase: Extractor improvement (eval-driven) + retrieval harness expansion. Duration: 8 days, hard gates at each day boundary. Plan author: Codex (2026-04-11). Executor: Claude. Audit: Codex.
Preflight (before Day 1)
Stop if any of these fail:
git rev-parse HEADonmainmatches the expected branching tip- Live
/healthon Dalidou reports the SHA you think is deployed python scripts/retrieval_eval.py --jsonstill passes at the current baselinebatch-extractover the known 42-capture slice reproduces the current low-yield baseline- A frozen sample set exists for extractor labeling so the target does not move mid-phase
Success: baseline eval output saved, baseline extract output saved, working branch created from origin/main.
Day 1 - Labeled extractor eval set
Pick 30 real captures: 10 that should produce 0 candidates, 10 that should plausibly produce 1, 10 ambiguous/hard. Store as a stable artifact (interaction id, expected count, expected type, notes). Add a runner that scores extractor output against labels.
Success: 30 labeled interactions in a stable artifact, one-command precision/recall output. Fail-early: if labeling 30 takes more than a day because the concept is unclear, tighten the extraction target before touching code.
Day 2 - Measure current extractor
Run the rule-based extractor on all 30. Record yield, TP, FP, FN. Bucket misses by class (conversational preference, decision summary, status/constraint, meta chatter).
Success: short scorecard with counts by miss type, top 2 miss classes obvious. Fail-early: if the labeled set shows fewer than 5 plausible positives total, the corpus is too weak - relabel before tuning.
Day 3 - Smallest rule expansion for top miss class
Add 1-2 narrow, explainable rules for the worst miss class. Add unit tests from real paraphrase examples in the labeled set. Then rerun eval.
Success: recall up on the labeled set, false positives do not materially rise, new tests cover the new cue class. Fail-early: if one rule expansion raises FP above ~20% of extracted candidates, revert or narrow before adding more.
Day 4 - Decision gate: more rules or LLM-assisted prototype
If rule expansion reaches a meaningfully reviewable queue, keep going with rules. Otherwise prototype an LLM-assisted extraction mode behind a flag.
"Meaningfully reviewable queue":
-
= 15-25% candidate yield on the 30 labeled captures
- FP rate low enough that manual triage feels tolerable
-
= 2 real non-synthetic candidates worth review
Hard stop: if candidate yield is still under 10% after this point, stop rule tinkering and switch to architecture review (LLM-assisted OR narrower extraction scope).
Day 5 - Stabilize and document
Add remaining focused rules or the flagged LLM-assisted path. Write down in-scope and out-of-scope utterance kinds.
Success: labeled eval green against target threshold, extractor scope explainable in <= 5 bullets.
Day 6 - Retrieval harness expansion (6 -> 15-20 fixtures)
Grow across p04/p05/p06. Include short ambiguous prompts, cross-project collision cases, expected project-state wins, expected project-memory wins, and 1-2 "should fail open / low confidence" cases.
Success: >= 15 fixtures, each active project has easy + medium + hard cases. Fail-early: if fixtures are mostly obvious wins, add harder adversarial cases before claiming coverage.
Day 7 - Regression pass and calibration
Run harness on current code vs live Dalidou. Inspect failures (ranking, ingestion gap, project bleed, budget). Make at most ONE ranking/budget tweak if the harness clearly justifies it. Do not mix harness expansion and ranking changes in a single commit unless tightly coupled.
Success: harness still passes or improves after extractor work; any ranking tweak is justified by a concrete fixture delta. Fail-early: if > 20-25% of harness fixtures regress after extractor changes, separate concerns before merging.
Day 8 - Merge and close
Clean commit sequence. Save before/after metrics (extractor scorecard, harness results). Update docs only with claims the metrics support.
Merge order: labeled corpus + runner -> extractor improvements + tests -> harness expansion -> any justified ranking tweak -> docs sync last.
Success: point to a before/after delta for both extraction and retrieval; docs do not overclaim.
Hard Gates (stop/rethink points)
- Extractor yield < 10% after 30 labeled interactions -> stop, reconsider rule-only extraction
- FP rate > 20% on labeled set -> narrow rules before adding more
- Harness expansion finds < 3 genuinely hard cases -> harness still too soft
- Ranking change improves one project but regresses another -> do not merge without explicit tradeoff note
Branching
One branch codex/extractor-eval-loop for Day 1-5, a second codex/retrieval-harness-expansion for Day 6-7. Keeps extraction and retrieval judgments auditable.
Review Protocol
- Codex records review findings in Open Review Findings.
- Claude must read Open Review Findings at session start before coding.
- Codex owns finding text. Claude may update operational fields only:
statusownerresolved_by
- If Claude disagrees with a finding, do not rewrite it. Mark it
declinedand explain why in the Session Log. - Any commit or session that addresses a finding should reference the finding id in the commit message or Session Log.
P1findings block further commits in the affected area until they are at least acknowledged and explicitly tracked.- Findings may be code-level, claim-level, or ops-level. If the implementation boundary changes, retarget the finding instead of silently closing it.
Open Review Findings
| id | finder | severity | file:line | summary | status | owner | opened_at | resolved_by |
|---|---|---|---|---|---|---|---|---|
| R1 | Codex | P1 | deploy/hooks/capture_stop.py:76-85 | Live Claude capture still omits extract, so "loop closed both sides" remains overstated in practice even though the API supports it |
fixed | Claude | 2026-04-11 | c67bec0 |
| R2 | Codex | P1 | src/atocore/context/builder.py | Project memories excluded from pack | fixed | Claude | 2026-04-11 | 8ea53f4 |
| R3 | Claude | P2 | src/atocore/memory/extractor.py | Rule cues (## Decision:) never fire on conversational LLM text |
open | Claude | 2026-04-11 | |
| R4 | Codex | P2 | DEV-LEDGER.md:11 | Orientation main_tip was stale versus HEAD / origin/main |
fixed | Codex | 2026-04-11 | 81307ce |
| R5 | Codex | P1 | src/atocore/interactions/service.py:157-174 | The deployed extraction path still calls only the rule extractor; the new LLM extractor is eval/script-only, so Day 4 "gate cleared" is true as a benchmark result but not as an operational extraction path | fixed | Claude | 2026-04-12 | c67bec0 |
| R6 | Codex | P1 | src/atocore/memory/extractor_llm.py:258-276 | LLM extraction accepts model-supplied project verbatim with no fallback to interaction.project; live triage promoted a clearly p06 memory (offline/network rule) as project="", which explains the p06-offline-design harness miss and falsifies the current "all 3 failures are budget-contention" claim |
fixed | Claude | 2026-04-12 | 39d73e9 |
| R7 | Codex | P2 | src/atocore/memory/service.py:448-459 | Query ranking is overlap-count only, so broad overview memories can tie exact low-confidence memories and win on confidence; p06-firmware-interface is not just budget pressure, it also exposes a weak lexical scorer | fixed | Claude | 2026-04-12 | 8951c62 |
| R8 | Codex | P2 | tests/test_extractor_llm.py:1-7 | LLM extractor tests stop at parser/failure contracts; there is no automated coverage for the script-only persistence/review path that produced the 16 promoted memories, including project-scope preservation | fixed | Claude | 2026-04-12 | 69c9717 |
| R9 | Codex | P2 | src/atocore/memory/extractor_llm.py:258-259 | The R6 fallback only repairs empty project output. A wrong non-empty model project still overrides the interaction's known scope, so project attribution is improved but not yet trust-preserving. | fixed | Claude | 2026-04-12 | e5e9a99 |
| R10 | Codex | P2 | docs/master-plan-status.md:31-33 | "Phase 8 - OpenClaw Integration" is fair as a baseline milestone, but not as a "primary" integration claim. t420-openclaw/atocore.py currently covers a narrow read-oriented subset (13 request shapes vs 32 API routes) plus fail-open health, while memory/interactions/admin write paths remain out of surface. |
open | Claude | 2026-04-12 | |
| R11 | Codex | P2 | src/atocore/api/routes.py:773-845 | POST /admin/extract-batch still accepts mode="llm" inside the container and returns a successful 0-candidate result instead of surfacing that host-only LLM extraction is unavailable from this runtime. That is a misleading API contract for operators. |
open | Claude | 2026-04-12 | |
| R12 | Codex | P2 | scripts/batch_llm_extract_live.py:39-190 | The host-side extractor duplicates the LLM system prompt and JSON parsing logic from src/atocore/memory/extractor_llm.py. It works today, but this is now a prompt/parser drift risk across the container and host implementations. |
open | Claude | 2026-04-12 | |
| R13 | Codex | P2 | DEV-LEDGER.md:12 | The new 286 passing test-count claim is not reproducibly auditable from the current audit environments: neither Dalidou nor the clean worktree has pytest available. The claim may be true in Claude's dev shell, but it remains unverified in this audit. |
open | Claude | 2026-04-12 |
Recent Decisions
- 2026-04-12 Day 4 gate cleared: LLM-assisted extraction via
claude -p(OAuth, no API key) is the path forward. Rule extractor stays as default for structural cues. Proposed by: Claude. Ratified by: Antoine. - 2026-04-12 First live triage: 16 promoted, 35 rejected from 51 LLM-extracted candidates. 31% accept rate. Active memory count 20->36. Executed by: Claude. Ratified by: Antoine.
- 2026-04-12 No API keys allowed in AtoCore — LLM-assisted features use OAuth via
claude -por equivalent CLI-authenticated paths. Proposed by: Antoine. - 2026-04-12 Multi-model extraction direction: extraction/triage should be model-agnostic, with Codex/Gemini/Ollama as second-pass reviewers for robustness. Proposed by: Antoine.
- 2026-04-11 Adopt this ledger as shared operating memory between Claude and Codex. Proposed by: Antoine. Ratified by: Antoine.
- 2026-04-11 Accept Codex's 8-day mini-phase plan verbatim as Active Plan. Proposed by: Codex. Ratified by: Antoine.
- 2026-04-11 Review findings live in
DEV-LEDGER.mdwith Codex owning finding text and Claude updating status fields only. Proposed by: Codex. Ratified by: Antoine. - 2026-04-11 Project memories land in the pack under
--- Project Memories ---at 25% budget ratio, gated on canonical project hint. Proposed by: Claude. - 2026-04-11 Extraction stays off the capture hot path. Batch / manual only. Proposed by: Antoine.
- 2026-04-11 4-step roadmap: extractor -> harness expansion -> Wave 2 ingestion -> OpenClaw finish. Steps 1+2 as one mini-phase. Ratified by: Antoine.
- 2026-04-11 Codex branches must fork from
main, not be orphan commits. Proposed by: Claude. Agreed by: Codex.
Session Log
-
2026-04-12 Claude Batch 3 (R9 fix):
144dbbd..e5e9a99. Trust hierarchy for project attribution — interaction scope always wins when set, model project only used for unscoped interactions + registered check. 7 case tests (A-G) cover every combination. Harness 17/18 (no regression). Tests 286->290. Before: wrong registered project could silently override interaction scope. After: interaction.project is the strongest signal; model project is only a fallback for unscoped captures. Not yet guaranteed: nothing prevents the same project's model output from being semantically wrong within that project. R9 marked fixed. -
2026-04-12 Codex (audit branch
codex/audit-batch2) audited69c9717..origin/mainagainst the current branch tip and live Dalidou. Verified: live build is8951c62, retrieval harness improved to 17/18 PASS, candidate queue is now empty, active memories rose to 41, andpython3 scripts/auto_triage.py --dry-run --base-url http://127.0.0.1:8100runs cleanly on Dalidou but only exercised the empty-queue path. Updated R7 to fixed (8951c62) and R8 to fixed (69c9717). Kept R9 open because project trust-preservation still allows a wrong non-empty registered project from the model to override the interaction scope. Added R13 because the new286 passingclaim could not be independently reproduced in this audit:pytestis absent on both Dalidou and the clean audit worktree. Also corrected stale Orientation fields (live SHA, main tip, harness, active/candidate memory counts). -
2026-04-12 Codex (audit branch
codex/audit-2026-04-12-extraction) audited54d84b5..ac7f77dwith live Dalidou verification. Confirmed the host-side LLM extraction pipeline is operational: nightly cron points atdeploy/dalidou/cron-backup.sh, Step 4 callsdeploy/dalidou/batch-extract.sh, the batch script exists/executable on Dalidou, and a manual host-side run produced candidates successfully. Updated R1 and R5 to fixed (c67bec0) because extraction now runs unattended off-container. Live state during audit: build39d73e9, active memories 36, candidate queue 29 (16 existing + 13 added by manual verification run), andlast_extract_batch_runpopulated in AtoCore project state. Added R11-R12 for the misleading containermode=llmno-op and host/container prompt-parser duplication. Security note: CLI positional prompt/response text is visible in process args whileclaude -pruns; acceptable on a single-user home host, but worth remembering if Dalidou's trust boundary changes. -
2026-04-12 Codex (audit branch
codex/audit-2026-04-12-final) auditedc5bad99..e2895b5against origin/main, live Dalidou, and the OpenClaw client script. Live state checked: build39d73e9, harness reproducible at 16/18 PASS, active memories 36, andt420-openclaw/atocore.py healthfails open correctly withfail_open=true. Spot-checks of Wave 2 project-state entries matched their cited vault docs. Updated R5-R8 status reality (R6 fixed by39d73e9), added R9-R10, and corrected Orientationmain_tiptoe2895b5because the ledger had drifted behind origin/main. Note: live Dalidou is still on39d73e9, so branch-truth and deploy-truth are not the same yet. -
2026-04-12 Claude Wave 2 trusted operational ingestion + codex audit response. Read 6 vault docs, created 8 new Trusted Project State entries (p04 +2, p05 +3, p06 +3). Fixed R6 (project fallback in LLM extractor) per codex audit. Fixed misscoped p06 offline memory on live Dalidou. Merged codex/audit-2026-04-12. Switched default LLM model from haiku to sonnet. Harness 15/18 -> 16/18. Tests 278 -> 280. main_tip
146f2e4->39d73e9. -
2026-04-12 Codex (audit branch
codex/audit-2026-04-12) auditedc5bad99..146f2e4against code, live Dalidou, and the 36 active memories. Confirmed:claude -pinvocation is not shell-injection-prone (subprocess.run(args)with no shell), off-host backup wiring matches the ledger, and R1 remains unresolved in practice. Added R5-R8. Corrected Orientationmain_tip(146f2e4, not5c69f77) and tightened the harness note: p06-firmware-interface is a ranking-tie issue, p06-offline-design comes from a project-scope miss in live triage, and p06-tailscale is retrieved-chunk bleed rather than memory-band budget contention. -
2026-04-12 Claude
06792d8..5c69f77Day 5-8 close. Documented extractor scope (5 in-scope, 6 out-of-scope categories). Expanded harness from 6 to 18 fixtures (p04 +1, p05 +1, p06 +7, adversarial +2). Per-entry memory cap at 250 chars fixed 1 of 4 budget-contention failures. Final harness: 15/18 PASS. Mini-phase complete. Before/after: rule extractor 0% recall -> LLM 100%; harness 6/6 -> 15/18; active memories 20 -> 36. -
2026-04-12 Claude
330ecfb..06792d8(merged eval-loop branch + triage). Day 1-4 of the mini-phase completed in one session. Day 2 baseline: rule extractor 0% recall, 5 distinct miss classes. Day 4 gate cleared: LLM extractor (claude -p haiku, OAuth) hit 100% recall, 2.55 yield/interaction. Refactored from anthropic SDK to subprocess after "no API key" rule. First live triage: 51 candidates -> 16 promoted, 35 rejected. Active memories 20->36. p06-polisher went from 2 to 16 memories (firmware/telemetry architecture set). POST /memory now accepts status field. Test count 264->278. -
2026-04-11 Claude
claude/extractor-eval-loop @ 7d8d599— Day 1+2 of the mini-phase. Froze a 64-interaction snapshot (scripts/eval_data/interactions_snapshot_2026-04-11.json) and labeled 20 by length-stratified random sample (5 positive, 15 zero; 7 total expected candidates). Builtscripts/extractor_eval.pyas a file-based eval runner. Day 2 baseline: rule extractor hit 0% yield / 0% recall / 0% precision on the labeled set; 5 false negatives across 5 distinct miss classes (recommendation_prose, architectural_change_summary, spec_update_announcement, layered_recommendation, alignment_assertion). This is the Day 4 hard-stop signal arriving two days early — a single rule expansion cannot close a 5-way miss, and widening rules blindly will collapse precision. The Day 4 decision gate is escalated to Antoine for ratification before Day 3 touches any extractor code. No extractor code on main has changed. -
2026-04-11 Codex (ledger audit) fixed stale
main_tip, retargeted R1 from the API surface to the live Claude Stop hook, and formalized the review write protocol so Claude can consume findings without rewriting them. -
2026-04-11 Claude
b3253f3..59331e5(1 commit). Wired the DEV-LEDGER, added session protocol to AGENTS.md, created project-local CLAUDE.md, deleted stalecodex/port-atocore-ops-clientremote branch. No code changes, no redeploy needed. -
2026-04-11 Claude
c5bad99..b3253f3(11 commits + 1 merge). Length-aware reinforcement, project memories in pack, query-relevance memory ranking, hyphenated-identifier tokenizer, retrieval eval harness seeded, off-host backup wired end-to-end, docs synced, codex integration-pass branch merged. Harness went 0->6/6 on live Dalidou. -
2026-04-11 Codex (async review) identified 2 P1s against a stale checkout. R1 was fair (extraction not automated), R2 was outdated (project memories already landed on main). Delivered the 8-day execution plan now in Active Plan.
-
2026-04-06 Antoine created
codex/atocore-integration-passwith thet420-openclaw/workspace (merged 2026-04-11).
Working Rules
- Claude builds; Codex audits. No parallel work on the same files.
- Codex branches fork from
main:git fetch origin && git checkout -b codex/<topic> origin/main. - P1 findings block further main commits until acknowledged in Open Review Findings.
- Every session appends at least one Session Log line and bumps Orientation.
- Trim Session Log and Recent Decisions to the last 20 at session end.
- Docs in
docs/may overclaim stale status; the ledger is the one-file source of truth for "what is true right now."
Quick Commands
# Check live state
ssh papa@dalidou "curl -s http://localhost:8100/health"
# Run the retrieval harness
python scripts/retrieval_eval.py # human-readable
python scripts/retrieval_eval.py --json # machine-readable
# Deploy a new main tip
git push origin main && ssh papa@dalidou "bash /srv/storage/atocore/app/deploy/dalidou/deploy.sh"
# Reflection-loop ops
python scripts/atocore_client.py batch-extract '' '' 200 false # preview
python scripts/atocore_client.py batch-extract '' '' 200 true # persist
python scripts/atocore_client.py triage