9.3 KiB
AtoCore Dev Ledger
Shared operating memory between humans, Claude, and Codex. Every session MUST read this file at start and append a Session Log entry before ending. Section headers are stable — do not rename them. Trim Session Log and Recent Decisions to the last 20 entries at session end; older history lives in
git loganddocs/.
Orientation
- live_sha (Dalidou
/healthbuild_sha):38f6e52 - last_updated: 2026-04-11 by Claude (ledger wired)
- main_tip:
59331e5 - test_count: 264 passing
- harness:
6/6 PASS(python scripts/retrieval_eval.pyagainst live Dalidou) - off_host_backup:
papa@192.168.86.39:/home/papa/atocore-backups/via cron envATOCORE_BACKUP_RSYNC, verified
Active Plan
Mini-phase: Extractor improvement (eval-driven) + retrieval harness expansion. Duration: 8 days, hard gates at each day boundary. Plan author: Codex (2026-04-11). Executor: Claude. Audit: Codex.
Preflight (before Day 1)
Stop if any of these fail:
git rev-parse HEADonmainmatches the expected branching tip- Live
/healthon Dalidou reports the SHA you think is deployed python scripts/retrieval_eval.py --jsonstill passes at the current baselinebatch-extractover the known 42-capture slice reproduces the current low-yield baseline- A frozen sample set exists for extractor labeling so the target does not move mid-phase
Success: baseline eval output saved, baseline extract output saved, working branch created from origin/main.
Day 1 — Labeled extractor eval set
Pick 30 real captures: 10 that should produce 0 candidates, 10 that should plausibly produce 1, 10 ambiguous/hard. Store as a stable artifact (interaction id, expected count, expected type, notes). Add a runner that scores extractor output against labels.
Success: 30 labeled interactions in a stable artifact, one-command precision/recall output. Fail-early: if labeling 30 takes more than a day because the concept is unclear, tighten the extraction target before touching code.
Day 2 — Measure current extractor
Run the rule-based extractor on all 30. Record yield, TP, FP, FN. Bucket misses by class (conversational preference, decision summary, status/constraint, meta chatter).
Success: short scorecard with counts by miss type, top 2 miss classes obvious. Fail-early: if the labeled set shows fewer than 5 plausible positives total, the corpus is too weak — relabel before tuning.
Day 3 — Smallest rule expansion for top miss class
Add 1-2 narrow, explainable rules for the worst miss class. Add unit tests from real paraphrase examples in the labeled set. Then rerun eval.
Success: recall up on the labeled set, false positives do not materially rise, new tests cover the new cue class. Fail-early: if one rule expansion raises FP above ~20% of extracted candidates, revert or narrow before adding more.
Day 4 — Decision gate: more rules or LLM-assisted prototype
If rule expansion reaches a meaningfully reviewable queue, keep going with rules. Otherwise prototype an LLM-assisted extraction mode behind a flag.
"Meaningfully reviewable queue":
- ≥ 15-25% candidate yield on the 30 labeled captures
- FP rate low enough that manual triage feels tolerable
- ≥ 2 real non-synthetic candidates worth review
Hard stop: if candidate yield is still under 10% after this point, stop rule tinkering and switch to architecture review (LLM-assisted OR narrower extraction scope).
Day 5 — Stabilize and document
Add remaining focused rules or the flagged LLM-assisted path. Write down in-scope and out-of-scope utterance kinds.
Success: labeled eval green against target threshold, extractor scope explainable in ≤ 5 bullets.
Day 6 — Retrieval harness expansion (6 → 15-20 fixtures)
Grow across p04/p05/p06. Include short ambiguous prompts, cross-project collision cases, expected project-state wins, expected project-memory wins, and 1-2 "should fail open / low confidence" cases.
Success: ≥ 15 fixtures, each active project has easy + medium + hard cases. Fail-early: if fixtures are mostly obvious wins, add harder adversarial cases before claiming coverage.
Day 7 — Regression pass and calibration
Run harness on current code vs live Dalidou. Inspect failures (ranking, ingestion gap, project bleed, budget). Make at most ONE ranking/budget tweak if the harness clearly justifies it. Do not mix harness expansion and ranking changes in a single commit unless tightly coupled.
Success: harness still passes or improves after extractor work; any ranking tweak is justified by a concrete fixture delta. Fail-early: if > 20-25% of harness fixtures regress after extractor changes, separate concerns before merging.
Day 8 — Merge and close
Clean commit sequence. Save before/after metrics (extractor scorecard, harness results). Update docs only with claims the metrics support.
Merge order: labeled corpus + runner → extractor improvements + tests → harness expansion → any justified ranking tweak → docs sync last.
Success: point to a before/after delta for both extraction and retrieval; docs do not overclaim.
Hard Gates (stop/rethink points)
- Extractor yield < 10% after 30 labeled interactions → stop, reconsider rule-only extraction
- FP rate > 20% on labeled set → narrow rules before adding more
- Harness expansion finds < 3 genuinely hard cases → harness still too soft
- Ranking change improves one project but regresses another → do not merge without explicit tradeoff note
Branching
One branch codex/extractor-eval-loop for Day 1-5, a second codex/retrieval-harness-expansion for Day 6-7. Keeps extraction and retrieval judgments auditable.
Open Review Findings
| id | finder | severity | file:line | summary | status |
|---|---|---|---|---|---|
| R1 | Codex | P1 | src/atocore/api/routes.py | Capture→extract still manual; "loop closed both sides" was overstated | acknowledged — addressed in Active Plan Day 1-5 |
| R2 | Codex | P1 | src/atocore/context/builder.py | Project memories excluded from pack | fixed @ 8ea53f4 (codex read was stale, now caught up) |
| R3 | Claude | P2 | src/atocore/memory/extractor.py | Rule cues (## Decision:) never fire on conversational LLM text |
open — Active Plan root cause |
Recent Decisions
- 2026-04-11 Adopt this ledger as shared operating memory between Claude and Codex. Proposed by: Antoine. Ratified by: Antoine.
- 2026-04-11 Accept Codex's 8-day mini-phase plan verbatim as Active Plan. Proposed by: Codex. Ratified by: Antoine.
- 2026-04-11 Project memories land in the pack under
--- Project Memories ---at 25% budget ratio, gated on canonical project hint. Proposed by: Claude. - 2026-04-11 Extraction stays off the capture hot path. Batch / manual only. Proposed by: Antoine.
- 2026-04-11 4-step roadmap: extractor → harness expansion → Wave 2 ingestion → OpenClaw finish. Steps 1+2 as one mini-phase. Ratified by: Antoine.
- 2026-04-11 Codex branches must fork from
main, not be orphan commits. Proposed by: Claude. Agreed by: Codex.
Session Log
- 2026-04-11 Claude
b3253f3..59331e5(1 commit). Wired the DEV-LEDGER, added session protocol to AGENTS.md, created project-local CLAUDE.md, deleted stalecodex/port-atocore-ops-clientremote branch. No code changes, no redeploy needed. - 2026-04-11 Claude
c5bad99..b3253f3(11 commits + 1 merge). Length-aware reinforcement, project memories in pack, query-relevance memory ranking, hyphenated-identifier tokenizer, retrieval eval harness seeded, off-host backup wired end-to-end, docs synced, codex integration-pass branch merged. Harness went 0→6/6 on live Dalidou. - 2026-04-11 Codex (async review) identified 2 P1s against a stale checkout. R1 was fair (extraction not automated), R2 was outdated (project memories already landed on main). Delivered the 8-day execution plan now in Active Plan.
- 2026-04-06 Antoine created
codex/atocore-integration-passwith thet420-openclaw/workspace (merged 2026-04-11).
Working Rules
- Claude builds; Codex audits. No parallel work on the same files.
- Codex branches fork from
main:git fetch origin && git checkout -b codex/<topic> origin/main. - P1 findings block further main commits until acknowledged in Open Review Findings.
- Every session appends at least one Session Log line and bumps Orientation.
- Trim Session Log and Recent Decisions to the last 20 at session end.
- Docs in
docs/may overclaim stale status; the ledger is the one-file source of truth for "what is true right now."
Quick Commands
# Check live state
ssh papa@dalidou "curl -s http://localhost:8100/health"
# Run the retrieval harness
python scripts/retrieval_eval.py # human-readable
python scripts/retrieval_eval.py --json # machine-readable
# Deploy a new main tip
git push origin main && ssh papa@dalidou "bash /srv/storage/atocore/app/deploy/dalidou/deploy.sh"
# Reflection-loop ops
python scripts/atocore_client.py batch-extract '' '' 200 false # preview
python scripts/atocore_client.py batch-extract '' '' 200 true # persist
python scripts/atocore_client.py triage