Files
ATOCore/DEV-LEDGER.md
Anto01 146f2e4a5e chore: Day 8 — close mini-phase with before/after metrics
Mini-phase complete. Before/after deltas:

  Metric                    Before     After
  ─────────────────────────────────────────
  Rule extractor recall     0%         0% (unchanged, deprioritized)
  LLM extractor recall      n/a        100% (new, claude -p haiku)
  LLM candidate yield       n/a        2.55/interaction
  First triage accept rate  n/a        31% (16/51)
  Active memories           20         36 (+16)
  p06-polisher memories     2          16 (+14)
  atocore memories          0          5  (+5)
  Retrieval harness         6/6        15/18 (expanded to 18 fixtures)
  Test count                264        278 (+14)

3 remaining harness failures are budget-contention on the p06 memory
band: the specific memory a fixture targets ranks 4th+ and the 25%
budget only holds 2-3 entries. Not a ranking bug — the per-entry
250-char cap was the one justified tweak; a second budget change
risks regressing other fixtures per Codex's Day 7 hard gate.

Ledger updated: Orientation, Session Log, main_tip, harness line.

Next on the roadmap (from DEV-LEDGER Active Plan / docs/next-steps):
  - Wave 2 trusted operational ingestion (p04/p05/p06 dashboards)
  - Finish OpenClaw integration (Phase 8)
  - Auto-triage (multi-model second pass to reduce human review)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 06:41:42 -04:00

14 KiB

AtoCore Dev Ledger

Shared operating memory between humans, Claude, and Codex. Every session MUST read this file at start and append a Session Log entry before ending. Section headers are stable - do not rename them. Trim Session Log and Recent Decisions to the last 20 entries at session end; older history lives in git log and docs/.

Orientation

  • live_sha (Dalidou /health build_sha): 5c69f77
  • last_updated: 2026-04-12 by Claude (mini-phase Day 8 close)
  • main_tip: 5c69f77
  • test_count: 278 passing
  • harness: 15/18 PASS (expanded from 6 to 18 fixtures; 3 remaining failures are budget-contention on p06 memory band, not ranking bugs)
  • active_memories: 36 (was 20 before mini-phase; p06-polisher 2->16, atocore 0->5)
  • off_host_backup: papa@192.168.86.39:/home/papa/atocore-backups/ via cron env ATOCORE_BACKUP_RSYNC, verified

Active Plan

Mini-phase: Extractor improvement (eval-driven) + retrieval harness expansion. Duration: 8 days, hard gates at each day boundary. Plan author: Codex (2026-04-11). Executor: Claude. Audit: Codex.

Preflight (before Day 1)

Stop if any of these fail:

  • git rev-parse HEAD on main matches the expected branching tip
  • Live /health on Dalidou reports the SHA you think is deployed
  • python scripts/retrieval_eval.py --json still passes at the current baseline
  • batch-extract over the known 42-capture slice reproduces the current low-yield baseline
  • A frozen sample set exists for extractor labeling so the target does not move mid-phase

Success: baseline eval output saved, baseline extract output saved, working branch created from origin/main.

Day 1 - Labeled extractor eval set

Pick 30 real captures: 10 that should produce 0 candidates, 10 that should plausibly produce 1, 10 ambiguous/hard. Store as a stable artifact (interaction id, expected count, expected type, notes). Add a runner that scores extractor output against labels.

Success: 30 labeled interactions in a stable artifact, one-command precision/recall output. Fail-early: if labeling 30 takes more than a day because the concept is unclear, tighten the extraction target before touching code.

Day 2 - Measure current extractor

Run the rule-based extractor on all 30. Record yield, TP, FP, FN. Bucket misses by class (conversational preference, decision summary, status/constraint, meta chatter).

Success: short scorecard with counts by miss type, top 2 miss classes obvious. Fail-early: if the labeled set shows fewer than 5 plausible positives total, the corpus is too weak - relabel before tuning.

Day 3 - Smallest rule expansion for top miss class

Add 1-2 narrow, explainable rules for the worst miss class. Add unit tests from real paraphrase examples in the labeled set. Then rerun eval.

Success: recall up on the labeled set, false positives do not materially rise, new tests cover the new cue class. Fail-early: if one rule expansion raises FP above ~20% of extracted candidates, revert or narrow before adding more.

Day 4 - Decision gate: more rules or LLM-assisted prototype

If rule expansion reaches a meaningfully reviewable queue, keep going with rules. Otherwise prototype an LLM-assisted extraction mode behind a flag.

"Meaningfully reviewable queue":

  • = 15-25% candidate yield on the 30 labeled captures

  • FP rate low enough that manual triage feels tolerable
  • = 2 real non-synthetic candidates worth review

Hard stop: if candidate yield is still under 10% after this point, stop rule tinkering and switch to architecture review (LLM-assisted OR narrower extraction scope).

Day 5 - Stabilize and document

Add remaining focused rules or the flagged LLM-assisted path. Write down in-scope and out-of-scope utterance kinds.

Success: labeled eval green against target threshold, extractor scope explainable in <= 5 bullets.

Day 6 - Retrieval harness expansion (6 -> 15-20 fixtures)

Grow across p04/p05/p06. Include short ambiguous prompts, cross-project collision cases, expected project-state wins, expected project-memory wins, and 1-2 "should fail open / low confidence" cases.

Success: >= 15 fixtures, each active project has easy + medium + hard cases. Fail-early: if fixtures are mostly obvious wins, add harder adversarial cases before claiming coverage.

Day 7 - Regression pass and calibration

Run harness on current code vs live Dalidou. Inspect failures (ranking, ingestion gap, project bleed, budget). Make at most ONE ranking/budget tweak if the harness clearly justifies it. Do not mix harness expansion and ranking changes in a single commit unless tightly coupled.

Success: harness still passes or improves after extractor work; any ranking tweak is justified by a concrete fixture delta. Fail-early: if > 20-25% of harness fixtures regress after extractor changes, separate concerns before merging.

Day 8 - Merge and close

Clean commit sequence. Save before/after metrics (extractor scorecard, harness results). Update docs only with claims the metrics support.

Merge order: labeled corpus + runner -> extractor improvements + tests -> harness expansion -> any justified ranking tweak -> docs sync last.

Success: point to a before/after delta for both extraction and retrieval; docs do not overclaim.

Hard Gates (stop/rethink points)

  • Extractor yield < 10% after 30 labeled interactions -> stop, reconsider rule-only extraction
  • FP rate > 20% on labeled set -> narrow rules before adding more
  • Harness expansion finds < 3 genuinely hard cases -> harness still too soft
  • Ranking change improves one project but regresses another -> do not merge without explicit tradeoff note

Branching

One branch codex/extractor-eval-loop for Day 1-5, a second codex/retrieval-harness-expansion for Day 6-7. Keeps extraction and retrieval judgments auditable.

Review Protocol

  • Codex records review findings in Open Review Findings.
  • Claude must read Open Review Findings at session start before coding.
  • Codex owns finding text. Claude may update operational fields only:
    • status
    • owner
    • resolved_by
  • If Claude disagrees with a finding, do not rewrite it. Mark it declined and explain why in the Session Log.
  • Any commit or session that addresses a finding should reference the finding id in the commit message or Session Log.
  • P1 findings block further commits in the affected area until they are at least acknowledged and explicitly tracked.
  • Findings may be code-level, claim-level, or ops-level. If the implementation boundary changes, retarget the finding instead of silently closing it.

Open Review Findings

id finder severity file:line summary status owner opened_at resolved_by
R1 Codex P1 deploy/hooks/capture_stop.py:76-85 Live Claude capture still omits extract, so "loop closed both sides" remains overstated in practice even though the API supports it acknowledged Claude 2026-04-11
R2 Codex P1 src/atocore/context/builder.py Project memories excluded from pack fixed Claude 2026-04-11 8ea53f4
R3 Claude P2 src/atocore/memory/extractor.py Rule cues (## Decision:) never fire on conversational LLM text open Claude 2026-04-11
R4 Codex P2 DEV-LEDGER.md:11 Orientation main_tip was stale versus HEAD / origin/main fixed Codex 2026-04-11 81307ce

Recent Decisions

  • 2026-04-12 Day 4 gate cleared: LLM-assisted extraction via claude -p (OAuth, no API key) is the path forward. Rule extractor stays as default for structural cues. Proposed by: Claude. Ratified by: Antoine.
  • 2026-04-12 First live triage: 16 promoted, 35 rejected from 51 LLM-extracted candidates. 31% accept rate. Active memory count 20->36. Executed by: Claude. Ratified by: Antoine.
  • 2026-04-12 No API keys allowed in AtoCore — LLM-assisted features use OAuth via claude -p or equivalent CLI-authenticated paths. Proposed by: Antoine.
  • 2026-04-12 Multi-model extraction direction: extraction/triage should be model-agnostic, with Codex/Gemini/Ollama as second-pass reviewers for robustness. Proposed by: Antoine.
  • 2026-04-11 Adopt this ledger as shared operating memory between Claude and Codex. Proposed by: Antoine. Ratified by: Antoine.
  • 2026-04-11 Accept Codex's 8-day mini-phase plan verbatim as Active Plan. Proposed by: Codex. Ratified by: Antoine.
  • 2026-04-11 Review findings live in DEV-LEDGER.md with Codex owning finding text and Claude updating status fields only. Proposed by: Codex. Ratified by: Antoine.
  • 2026-04-11 Project memories land in the pack under --- Project Memories --- at 25% budget ratio, gated on canonical project hint. Proposed by: Claude.
  • 2026-04-11 Extraction stays off the capture hot path. Batch / manual only. Proposed by: Antoine.
  • 2026-04-11 4-step roadmap: extractor -> harness expansion -> Wave 2 ingestion -> OpenClaw finish. Steps 1+2 as one mini-phase. Ratified by: Antoine.
  • 2026-04-11 Codex branches must fork from main, not be orphan commits. Proposed by: Claude. Agreed by: Codex.

Session Log

  • 2026-04-12 Claude 06792d8..5c69f77 Day 5-8 close. Documented extractor scope (5 in-scope, 6 out-of-scope categories). Expanded harness from 6 to 18 fixtures (p04 +1, p05 +1, p06 +7, adversarial +2). Per-entry memory cap at 250 chars fixed 1 of 4 budget-contention failures. Final harness: 15/18 PASS. Mini-phase complete. Before/after: rule extractor 0% recall -> LLM 100%; harness 6/6 -> 15/18; active memories 20 -> 36.
  • 2026-04-12 Claude 330ecfb..06792d8 (merged eval-loop branch + triage). Day 1-4 of the mini-phase completed in one session. Day 2 baseline: rule extractor 0% recall, 5 distinct miss classes. Day 4 gate cleared: LLM extractor (claude -p haiku, OAuth) hit 100% recall, 2.55 yield/interaction. Refactored from anthropic SDK to subprocess after "no API key" rule. First live triage: 51 candidates -> 16 promoted, 35 rejected. Active memories 20->36. p06-polisher went from 2 to 16 memories (firmware/telemetry architecture set). POST /memory now accepts status field. Test count 264->278.
  • 2026-04-11 Claude claude/extractor-eval-loop @ 7d8d599 — Day 1+2 of the mini-phase. Froze a 64-interaction snapshot (scripts/eval_data/interactions_snapshot_2026-04-11.json) and labeled 20 by length-stratified random sample (5 positive, 15 zero; 7 total expected candidates). Built scripts/extractor_eval.py as a file-based eval runner. Day 2 baseline: rule extractor hit 0% yield / 0% recall / 0% precision on the labeled set; 5 false negatives across 5 distinct miss classes (recommendation_prose, architectural_change_summary, spec_update_announcement, layered_recommendation, alignment_assertion). This is the Day 4 hard-stop signal arriving two days early — a single rule expansion cannot close a 5-way miss, and widening rules blindly will collapse precision. The Day 4 decision gate is escalated to Antoine for ratification before Day 3 touches any extractor code. No extractor code on main has changed.
  • 2026-04-11 Codex (ledger audit) fixed stale main_tip, retargeted R1 from the API surface to the live Claude Stop hook, and formalized the review write protocol so Claude can consume findings without rewriting them.
  • 2026-04-11 Claude b3253f3..59331e5 (1 commit). Wired the DEV-LEDGER, added session protocol to AGENTS.md, created project-local CLAUDE.md, deleted stale codex/port-atocore-ops-client remote branch. No code changes, no redeploy needed.
  • 2026-04-11 Claude c5bad99..b3253f3 (11 commits + 1 merge). Length-aware reinforcement, project memories in pack, query-relevance memory ranking, hyphenated-identifier tokenizer, retrieval eval harness seeded, off-host backup wired end-to-end, docs synced, codex integration-pass branch merged. Harness went 0->6/6 on live Dalidou.
  • 2026-04-11 Codex (async review) identified 2 P1s against a stale checkout. R1 was fair (extraction not automated), R2 was outdated (project memories already landed on main). Delivered the 8-day execution plan now in Active Plan.
  • 2026-04-06 Antoine created codex/atocore-integration-pass with the t420-openclaw/ workspace (merged 2026-04-11).

Working Rules

  • Claude builds; Codex audits. No parallel work on the same files.
  • Codex branches fork from main: git fetch origin && git checkout -b codex/<topic> origin/main.
  • P1 findings block further main commits until acknowledged in Open Review Findings.
  • Every session appends at least one Session Log line and bumps Orientation.
  • Trim Session Log and Recent Decisions to the last 20 at session end.
  • Docs in docs/ may overclaim stale status; the ledger is the one-file source of truth for "what is true right now."

Quick Commands

# Check live state
ssh papa@dalidou "curl -s http://localhost:8100/health"

# Run the retrieval harness
python scripts/retrieval_eval.py            # human-readable
python scripts/retrieval_eval.py --json     # machine-readable

# Deploy a new main tip
git push origin main && ssh papa@dalidou "bash /srv/storage/atocore/app/deploy/dalidou/deploy.sh"

# Reflection-loop ops
python scripts/atocore_client.py batch-extract '' '' 200 false   # preview
python scripts/atocore_client.py batch-extract '' '' 200 true    # persist
python scripts/atocore_client.py triage