Files

Anto01 dbb8f915e2 chore(ledger): Batch 3 close — R9 fixed, before/after documented

Before: a model returning 'p04-gigabit' for a p06-polisher
interaction would silently override the known scope because the
project was registered. After: interaction.project always wins
when set. Model project is only a fallback for unscoped captures.

Not yet guaranteed: within-project semantic errors (model says
the right project but wrong content). That's a content-quality
concern, not a trust-hierarchy issue.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-12 15:38:19 -04:00

21 KiB

Raw Blame History

AtoCore Dev Ledger

Shared operating memory between humans, Claude, and Codex. Every session MUST read this file at start and append a Session Log entry before ending. Section headers are stable - do not rename them. Trim Session Log and Recent Decisions to the last 20 entries at session end; older history lives in git log and docs/.

Orientation

live_sha (Dalidou /health build_sha): 8951c62 (R9 fix at e5e9a99 not yet deployed)
last_updated: 2026-04-12 by Claude (Batch 3 R9 fix)
main_tip: e5e9a99
test_count: 290 passing (local dev shell)
harness: 17/18 PASS (only p06-tailscale still failing)
active_memories: 41
candidate_memories: 0
project_state_entries: p04=5, p05=6, p06=6 (Wave 2 entries still present on live Dalidou; 17 total visible)
off_host_backup: papa@192.168.86.39:/home/papa/atocore-backups/ via cron env ATOCORE_BACKUP_RSYNC, verified

Active Plan

Mini-phase: Extractor improvement (eval-driven) + retrieval harness expansion. Duration: 8 days, hard gates at each day boundary. Plan author: Codex (2026-04-11). Executor: Claude. Audit: Codex.

Preflight (before Day 1)

Stop if any of these fail:

git rev-parse HEAD on main matches the expected branching tip
Live /health on Dalidou reports the SHA you think is deployed
python scripts/retrieval_eval.py --json still passes at the current baseline
batch-extract over the known 42-capture slice reproduces the current low-yield baseline
A frozen sample set exists for extractor labeling so the target does not move mid-phase

Success: baseline eval output saved, baseline extract output saved, working branch created from origin/main.

Day 1 - Labeled extractor eval set

Pick 30 real captures: 10 that should produce 0 candidates, 10 that should plausibly produce 1, 10 ambiguous/hard. Store as a stable artifact (interaction id, expected count, expected type, notes). Add a runner that scores extractor output against labels.

Success: 30 labeled interactions in a stable artifact, one-command precision/recall output. Fail-early: if labeling 30 takes more than a day because the concept is unclear, tighten the extraction target before touching code.

Day 2 - Measure current extractor

Run the rule-based extractor on all 30. Record yield, TP, FP, FN. Bucket misses by class (conversational preference, decision summary, status/constraint, meta chatter).

Success: short scorecard with counts by miss type, top 2 miss classes obvious. Fail-early: if the labeled set shows fewer than 5 plausible positives total, the corpus is too weak - relabel before tuning.

Day 3 - Smallest rule expansion for top miss class

Add 1-2 narrow, explainable rules for the worst miss class. Add unit tests from real paraphrase examples in the labeled set. Then rerun eval.

Success: recall up on the labeled set, false positives do not materially rise, new tests cover the new cue class. Fail-early: if one rule expansion raises FP above ~20% of extracted candidates, revert or narrow before adding more.

Day 4 - Decision gate: more rules or LLM-assisted prototype

If rule expansion reaches a meaningfully reviewable queue, keep going with rules. Otherwise prototype an LLM-assisted extraction mode behind a flag.

"Meaningfully reviewable queue":

= 15-25% candidate yield on the 30 labeled captures
FP rate low enough that manual triage feels tolerable
= 2 real non-synthetic candidates worth review

Hard stop: if candidate yield is still under 10% after this point, stop rule tinkering and switch to architecture review (LLM-assisted OR narrower extraction scope).

Day 5 - Stabilize and document

Add remaining focused rules or the flagged LLM-assisted path. Write down in-scope and out-of-scope utterance kinds.

Success: labeled eval green against target threshold, extractor scope explainable in <= 5 bullets.

Day 6 - Retrieval harness expansion (6 -> 15-20 fixtures)

Grow across p04/p05/p06. Include short ambiguous prompts, cross-project collision cases, expected project-state wins, expected project-memory wins, and 1-2 "should fail open / low confidence" cases.

Success: >= 15 fixtures, each active project has easy + medium + hard cases. Fail-early: if fixtures are mostly obvious wins, add harder adversarial cases before claiming coverage.

Day 7 - Regression pass and calibration

Run harness on current code vs live Dalidou. Inspect failures (ranking, ingestion gap, project bleed, budget). Make at most ONE ranking/budget tweak if the harness clearly justifies it. Do not mix harness expansion and ranking changes in a single commit unless tightly coupled.

Success: harness still passes or improves after extractor work; any ranking tweak is justified by a concrete fixture delta. Fail-early: if > 20-25% of harness fixtures regress after extractor changes, separate concerns before merging.

Day 8 - Merge and close

Clean commit sequence. Save before/after metrics (extractor scorecard, harness results). Update docs only with claims the metrics support.

Merge order: labeled corpus + runner -> extractor improvements + tests -> harness expansion -> any justified ranking tweak -> docs sync last.

Success: point to a before/after delta for both extraction and retrieval; docs do not overclaim.

Hard Gates (stop/rethink points)

Extractor yield < 10% after 30 labeled interactions -> stop, reconsider rule-only extraction
FP rate > 20% on labeled set -> narrow rules before adding more
Harness expansion finds < 3 genuinely hard cases -> harness still too soft
Ranking change improves one project but regresses another -> do not merge without explicit tradeoff note

Branching

One branch codex/extractor-eval-loop for Day 1-5, a second codex/retrieval-harness-expansion for Day 6-7. Keeps extraction and retrieval judgments auditable.

Review Protocol

Codex records review findings in Open Review Findings.
Claude must read Open Review Findings at session start before coding.
Codex owns finding text. Claude may update operational fields only:
- status
- owner
- resolved_by
If Claude disagrees with a finding, do not rewrite it. Mark it declined and explain why in the Session Log.
Any commit or session that addresses a finding should reference the finding id in the commit message or Session Log.
P1 findings block further commits in the affected area until they are at least acknowledged and explicitly tracked.
Findings may be code-level, claim-level, or ops-level. If the implementation boundary changes, retarget the finding instead of silently closing it.

Open Review Findings

id	finder	severity	file:line	summary	status	owner	opened_at	resolved_by
R1	Codex	P1	deploy/hooks/capture_stop.py:76-85	Live Claude capture still omits `extract`, so "loop closed both sides" remains overstated in practice even though the API supports it	fixed	Claude	2026-04-11	`c67bec0`
R2	Codex	P1	src/atocore/context/builder.py	Project memories excluded from pack	fixed	Claude	2026-04-11	`8ea53f4`
R3	Claude	P2	src/atocore/memory/extractor.py	Rule cues (`## Decision:`) never fire on conversational LLM text	open	Claude	2026-04-11
R4	Codex	P2	DEV-LEDGER.md:11	Orientation `main_tip` was stale versus `HEAD` / `origin/main`	fixed	Codex	2026-04-11	`81307ce`
R5	Codex	P1	src/atocore/interactions/service.py:157-174	The deployed extraction path still calls only the rule extractor; the new LLM extractor is eval/script-only, so Day 4 "gate cleared" is true as a benchmark result but not as an operational extraction path	fixed	Claude	2026-04-12	`c67bec0`
R6	Codex	P1	src/atocore/memory/extractor_llm.py:258-276	LLM extraction accepts model-supplied `project` verbatim with no fallback to `interaction.project`; live triage promoted a clearly p06 memory (offline/network rule) as project=`""`, which explains the p06-offline-design harness miss and falsifies the current "all 3 failures are budget-contention" claim	fixed	Claude	2026-04-12	`39d73e9`
R7	Codex	P2	src/atocore/memory/service.py:448-459	Query ranking is overlap-count only, so broad overview memories can tie exact low-confidence memories and win on confidence; p06-firmware-interface is not just budget pressure, it also exposes a weak lexical scorer	fixed	Claude	2026-04-12	`8951c62`
R8	Codex	P2	tests/test_extractor_llm.py:1-7	LLM extractor tests stop at parser/failure contracts; there is no automated coverage for the script-only persistence/review path that produced the 16 promoted memories, including project-scope preservation	fixed	Claude	2026-04-12	`69c9717`
R9	Codex	P2	src/atocore/memory/extractor_llm.py:258-259	The R6 fallback only repairs empty project output. A wrong non-empty model project still overrides the interaction's known scope, so project attribution is improved but not yet trust-preserving.	fixed	Claude	2026-04-12	`e5e9a99`
R10	Codex	P2	docs/master-plan-status.md:31-33	"Phase 8 - OpenClaw Integration" is fair as a baseline milestone, but not as a "primary" integration claim. `t420-openclaw/atocore.py` currently covers a narrow read-oriented subset (13 request shapes vs 32 API routes) plus fail-open health, while memory/interactions/admin write paths remain out of surface.	open	Claude	2026-04-12
R11	Codex	P2	src/atocore/api/routes.py:773-845	`POST /admin/extract-batch` still accepts `mode="llm"` inside the container and returns a successful 0-candidate result instead of surfacing that host-only LLM extraction is unavailable from this runtime. That is a misleading API contract for operators.	open	Claude	2026-04-12
R12	Codex	P2	scripts/batch_llm_extract_live.py:39-190	The host-side extractor duplicates the LLM system prompt and JSON parsing logic from `src/atocore/memory/extractor_llm.py`. It works today, but this is now a prompt/parser drift risk across the container and host implementations.	open	Claude	2026-04-12
R13	Codex	P2	DEV-LEDGER.md:12	The new `286 passing` test-count claim is not reproducibly auditable from the current audit environments: neither Dalidou nor the clean worktree has `pytest` available. The claim may be true in Claude's dev shell, but it remains unverified in this audit.	open	Claude	2026-04-12

Recent Decisions

2026-04-12 Day 4 gate cleared: LLM-assisted extraction via claude -p (OAuth, no API key) is the path forward. Rule extractor stays as default for structural cues. Proposed by: Claude. Ratified by: Antoine.
2026-04-12 First live triage: 16 promoted, 35 rejected from 51 LLM-extracted candidates. 31% accept rate. Active memory count 20->36. Executed by: Claude. Ratified by: Antoine.
2026-04-12 No API keys allowed in AtoCore — LLM-assisted features use OAuth via claude -p or equivalent CLI-authenticated paths. Proposed by: Antoine.
2026-04-12 Multi-model extraction direction: extraction/triage should be model-agnostic, with Codex/Gemini/Ollama as second-pass reviewers for robustness. Proposed by: Antoine.
2026-04-11 Adopt this ledger as shared operating memory between Claude and Codex. Proposed by: Antoine. Ratified by: Antoine.
2026-04-11 Accept Codex's 8-day mini-phase plan verbatim as Active Plan. Proposed by: Codex. Ratified by: Antoine.
2026-04-11 Review findings live in DEV-LEDGER.md with Codex owning finding text and Claude updating status fields only. Proposed by: Codex. Ratified by: Antoine.
2026-04-11 Project memories land in the pack under --- Project Memories --- at 25% budget ratio, gated on canonical project hint. Proposed by: Claude.
2026-04-11 Extraction stays off the capture hot path. Batch / manual only. Proposed by: Antoine.
2026-04-11 4-step roadmap: extractor -> harness expansion -> Wave 2 ingestion -> OpenClaw finish. Steps 1+2 as one mini-phase. Ratified by: Antoine.
2026-04-11 Codex branches must fork from main, not be orphan commits. Proposed by: Claude. Agreed by: Codex.

Session Log

2026-04-12 Claude Batch 3 (R9 fix): 144dbbd..e5e9a99. Trust hierarchy for project attribution — interaction scope always wins when set, model project only used for unscoped interactions + registered check. 7 case tests (A-G) cover every combination. Harness 17/18 (no regression). Tests 286->290. Before: wrong registered project could silently override interaction scope. After: interaction.project is the strongest signal; model project is only a fallback for unscoped captures. Not yet guaranteed: nothing prevents the same project's model output from being semantically wrong within that project. R9 marked fixed.
2026-04-12 Codex (audit branch codex/audit-batch2) audited 69c9717..origin/main against the current branch tip and live Dalidou. Verified: live build is 8951c62, retrieval harness improved to 17/18 PASS, candidate queue is now empty, active memories rose to 41, and python3 scripts/auto_triage.py --dry-run --base-url http://127.0.0.1:8100 runs cleanly on Dalidou but only exercised the empty-queue path. Updated R7 to fixed (8951c62) and R8 to fixed (69c9717). Kept R9 open because project trust-preservation still allows a wrong non-empty registered project from the model to override the interaction scope. Added R13 because the new 286 passing claim could not be independently reproduced in this audit: pytest is absent on both Dalidou and the clean audit worktree. Also corrected stale Orientation fields (live SHA, main tip, harness, active/candidate memory counts).
2026-04-12 Codex (audit branch codex/audit-2026-04-12-extraction) audited 54d84b5..ac7f77d with live Dalidou verification. Confirmed the host-side LLM extraction pipeline is operational: nightly cron points at deploy/dalidou/cron-backup.sh, Step 4 calls deploy/dalidou/batch-extract.sh, the batch script exists/executable on Dalidou, and a manual host-side run produced candidates successfully. Updated R1 and R5 to fixed (c67bec0) because extraction now runs unattended off-container. Live state during audit: build 39d73e9, active memories 36, candidate queue 29 (16 existing + 13 added by manual verification run), and last_extract_batch_run populated in AtoCore project state. Added R11-R12 for the misleading container mode=llm no-op and host/container prompt-parser duplication. Security note: CLI positional prompt/response text is visible in process args while claude -p runs; acceptable on a single-user home host, but worth remembering if Dalidou's trust boundary changes.
2026-04-12 Codex (audit branch codex/audit-2026-04-12-final) audited c5bad99..e2895b5 against origin/main, live Dalidou, and the OpenClaw client script. Live state checked: build 39d73e9, harness reproducible at 16/18 PASS, active memories 36, and t420-openclaw/atocore.py health fails open correctly with fail_open=true. Spot-checks of Wave 2 project-state entries matched their cited vault docs. Updated R5-R8 status reality (R6 fixed by 39d73e9), added R9-R10, and corrected Orientation main_tip to e2895b5 because the ledger had drifted behind origin/main. Note: live Dalidou is still on 39d73e9, so branch-truth and deploy-truth are not the same yet.
2026-04-12 Claude Wave 2 trusted operational ingestion + codex audit response. Read 6 vault docs, created 8 new Trusted Project State entries (p04 +2, p05 +3, p06 +3). Fixed R6 (project fallback in LLM extractor) per codex audit. Fixed misscoped p06 offline memory on live Dalidou. Merged codex/audit-2026-04-12. Switched default LLM model from haiku to sonnet. Harness 15/18 -> 16/18. Tests 278 -> 280. main_tip 146f2e4 -> 39d73e9.
2026-04-12 Codex (audit branch codex/audit-2026-04-12) audited c5bad99..146f2e4 against code, live Dalidou, and the 36 active memories. Confirmed: claude -p invocation is not shell-injection-prone (subprocess.run(args) with no shell), off-host backup wiring matches the ledger, and R1 remains unresolved in practice. Added R5-R8. Corrected Orientation main_tip (146f2e4, not 5c69f77) and tightened the harness note: p06-firmware-interface is a ranking-tie issue, p06-offline-design comes from a project-scope miss in live triage, and p06-tailscale is retrieved-chunk bleed rather than memory-band budget contention.
2026-04-12 Claude 06792d8..5c69f77 Day 5-8 close. Documented extractor scope (5 in-scope, 6 out-of-scope categories). Expanded harness from 6 to 18 fixtures (p04 +1, p05 +1, p06 +7, adversarial +2). Per-entry memory cap at 250 chars fixed 1 of 4 budget-contention failures. Final harness: 15/18 PASS. Mini-phase complete. Before/after: rule extractor 0% recall -> LLM 100%; harness 6/6 -> 15/18; active memories 20 -> 36.
2026-04-12 Claude 330ecfb..06792d8 (merged eval-loop branch + triage). Day 1-4 of the mini-phase completed in one session. Day 2 baseline: rule extractor 0% recall, 5 distinct miss classes. Day 4 gate cleared: LLM extractor (claude -p haiku, OAuth) hit 100% recall, 2.55 yield/interaction. Refactored from anthropic SDK to subprocess after "no API key" rule. First live triage: 51 candidates -> 16 promoted, 35 rejected. Active memories 20->36. p06-polisher went from 2 to 16 memories (firmware/telemetry architecture set). POST /memory now accepts status field. Test count 264->278.
2026-04-11 Claude claude/extractor-eval-loop @ 7d8d599 — Day 1+2 of the mini-phase. Froze a 64-interaction snapshot (scripts/eval_data/interactions_snapshot_2026-04-11.json) and labeled 20 by length-stratified random sample (5 positive, 15 zero; 7 total expected candidates). Built scripts/extractor_eval.py as a file-based eval runner. Day 2 baseline: rule extractor hit 0% yield / 0% recall / 0% precision on the labeled set; 5 false negatives across 5 distinct miss classes (recommendation_prose, architectural_change_summary, spec_update_announcement, layered_recommendation, alignment_assertion). This is the Day 4 hard-stop signal arriving two days early — a single rule expansion cannot close a 5-way miss, and widening rules blindly will collapse precision. The Day 4 decision gate is escalated to Antoine for ratification before Day 3 touches any extractor code. No extractor code on main has changed.
2026-04-11 Codex (ledger audit) fixed stale main_tip, retargeted R1 from the API surface to the live Claude Stop hook, and formalized the review write protocol so Claude can consume findings without rewriting them.
2026-04-11 Claude b3253f3..59331e5 (1 commit). Wired the DEV-LEDGER, added session protocol to AGENTS.md, created project-local CLAUDE.md, deleted stale codex/port-atocore-ops-client remote branch. No code changes, no redeploy needed.
2026-04-11 Claude c5bad99..b3253f3 (11 commits + 1 merge). Length-aware reinforcement, project memories in pack, query-relevance memory ranking, hyphenated-identifier tokenizer, retrieval eval harness seeded, off-host backup wired end-to-end, docs synced, codex integration-pass branch merged. Harness went 0->6/6 on live Dalidou.
2026-04-11 Codex (async review) identified 2 P1s against a stale checkout. R1 was fair (extraction not automated), R2 was outdated (project memories already landed on main). Delivered the 8-day execution plan now in Active Plan.
2026-04-06 Antoine created codex/atocore-integration-pass with the t420-openclaw/ workspace (merged 2026-04-11).

Working Rules

Claude builds; Codex audits. No parallel work on the same files.
Codex branches fork from main: git fetch origin && git checkout -b codex/<topic> origin/main.
P1 findings block further main commits until acknowledged in Open Review Findings.
Every session appends at least one Session Log line and bumps Orientation.
Trim Session Log and Recent Decisions to the last 20 at session end.
Docs in docs/ may overclaim stale status; the ledger is the one-file source of truth for "what is true right now."

Quick Commands

# Check live state
ssh papa@dalidou "curl -s http://localhost:8100/health"

# Run the retrieval harness
python scripts/retrieval_eval.py            # human-readable
python scripts/retrieval_eval.py --json     # machine-readable

# Deploy a new main tip
git push origin main && ssh papa@dalidou "bash /srv/storage/atocore/app/deploy/dalidou/deploy.sh"

# Reflection-loop ops
python scripts/atocore_client.py batch-extract '' '' 200 false   # preview
python scripts/atocore_client.py batch-extract '' '' 200 true    # persist
python scripts/atocore_client.py triage

21 KiB Raw Blame History