Files

Anto01 59331e522d feat: DEV-LEDGER.md as shared operating memory + session protocol

The ledger is the one-file source of truth for "what is currently
true" across Claude/Codex/human sessions:

- Orientation (live SHA, main tip, test count, harness state)
- Active Plan (currently Codex's 8-day extractor + harness plan
  with hard gates and fail-early thresholds)
- Open Review Findings (P1/P2, status)
- Recent Decisions (bounded to last 20)
- Session Log (bounded to last 20)
- Working Rules (no parallel work, branching rule, P1 block)

Narrative docs under docs/ sometimes lag reality; the ledger does
not. Every session MUST read it at start and append a Session Log
line before ending.

AGENTS.md: added a new "Session protocol" section at the top that
points at the ledger. Applies to any agent (Claude, Codex, future).

CLAUDE.md (new, project-local): project instructions for Claude
Code in this repo. Points at DEV-LEDGER.md and AGENTS.md, spells
out the deploy workflow and the Claude/Codex working model.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-11 14:46:21 -04:00

9.1 KiB

Raw Blame History

AtoCore Dev Ledger

Shared operating memory between humans, Claude, and Codex. Every session MUST read this file at start and append a Session Log entry before ending. Section headers are stable — do not rename them. Trim Session Log and Recent Decisions to the last 20 entries at session end; older history lives in git log and docs/.

Orientation

live_sha (Dalidou /health build_sha): 38f6e52
main_tip: b3253f3
last_updated: 2026-04-11 by Claude
test_count: 264 passing
harness: 6/6 PASS (python scripts/retrieval_eval.py against live Dalidou)
off_host_backup: papa@192.168.86.39:/home/papa/atocore-backups/ via cron env ATOCORE_BACKUP_RSYNC, verified

Active Plan

Mini-phase: Extractor improvement (eval-driven) + retrieval harness expansion. Duration: 8 days, hard gates at each day boundary. Plan author: Codex (2026-04-11). Executor: Claude. Audit: Codex.

Preflight (before Day 1)

Stop if any of these fail:

git rev-parse HEAD on main matches the expected branching tip
Live /health on Dalidou reports the SHA you think is deployed
python scripts/retrieval_eval.py --json still passes at the current baseline
batch-extract over the known 42-capture slice reproduces the current low-yield baseline
A frozen sample set exists for extractor labeling so the target does not move mid-phase

Success: baseline eval output saved, baseline extract output saved, working branch created from origin/main.

Day 1 — Labeled extractor eval set

Pick 30 real captures: 10 that should produce 0 candidates, 10 that should plausibly produce 1, 10 ambiguous/hard. Store as a stable artifact (interaction id, expected count, expected type, notes). Add a runner that scores extractor output against labels.

Success: 30 labeled interactions in a stable artifact, one-command precision/recall output. Fail-early: if labeling 30 takes more than a day because the concept is unclear, tighten the extraction target before touching code.

Day 2 — Measure current extractor

Run the rule-based extractor on all 30. Record yield, TP, FP, FN. Bucket misses by class (conversational preference, decision summary, status/constraint, meta chatter).

Success: short scorecard with counts by miss type, top 2 miss classes obvious. Fail-early: if the labeled set shows fewer than 5 plausible positives total, the corpus is too weak — relabel before tuning.

Day 3 — Smallest rule expansion for top miss class

Add 1-2 narrow, explainable rules for the worst miss class. Add unit tests from real paraphrase examples in the labeled set. Then rerun eval.

Success: recall up on the labeled set, false positives do not materially rise, new tests cover the new cue class. Fail-early: if one rule expansion raises FP above ~20% of extracted candidates, revert or narrow before adding more.

Day 4 — Decision gate: more rules or LLM-assisted prototype

If rule expansion reaches a meaningfully reviewable queue, keep going with rules. Otherwise prototype an LLM-assisted extraction mode behind a flag.

"Meaningfully reviewable queue":

≥ 15-25% candidate yield on the 30 labeled captures
FP rate low enough that manual triage feels tolerable
≥ 2 real non-synthetic candidates worth review

Hard stop: if candidate yield is still under 10% after this point, stop rule tinkering and switch to architecture review (LLM-assisted OR narrower extraction scope).

Day 5 — Stabilize and document

Add remaining focused rules or the flagged LLM-assisted path. Write down in-scope and out-of-scope utterance kinds.

Success: labeled eval green against target threshold, extractor scope explainable in ≤ 5 bullets.

Day 6 — Retrieval harness expansion (6 → 15-20 fixtures)

Grow across p04/p05/p06. Include short ambiguous prompts, cross-project collision cases, expected project-state wins, expected project-memory wins, and 1-2 "should fail open / low confidence" cases.

Success: ≥ 15 fixtures, each active project has easy + medium + hard cases. Fail-early: if fixtures are mostly obvious wins, add harder adversarial cases before claiming coverage.

Day 7 — Regression pass and calibration

Run harness on current code vs live Dalidou. Inspect failures (ranking, ingestion gap, project bleed, budget). Make at most ONE ranking/budget tweak if the harness clearly justifies it. Do not mix harness expansion and ranking changes in a single commit unless tightly coupled.

Success: harness still passes or improves after extractor work; any ranking tweak is justified by a concrete fixture delta. Fail-early: if > 20-25% of harness fixtures regress after extractor changes, separate concerns before merging.

Day 8 — Merge and close

Clean commit sequence. Save before/after metrics (extractor scorecard, harness results). Update docs only with claims the metrics support.

Merge order: labeled corpus + runner → extractor improvements + tests → harness expansion → any justified ranking tweak → docs sync last.

Success: point to a before/after delta for both extraction and retrieval; docs do not overclaim.

Hard Gates (stop/rethink points)

Extractor yield < 10% after 30 labeled interactions → stop, reconsider rule-only extraction
FP rate > 20% on labeled set → narrow rules before adding more
Harness expansion finds < 3 genuinely hard cases → harness still too soft
Ranking change improves one project but regresses another → do not merge without explicit tradeoff note

Branching

One branch codex/extractor-eval-loop for Day 1-5, a second codex/retrieval-harness-expansion for Day 6-7. Keeps extraction and retrieval judgments auditable.

Open Review Findings

id	finder	severity	file:line	summary	status
R1	Codex	P1	src/atocore/api/routes.py	Capture→extract still manual; "loop closed both sides" was overstated	acknowledged — addressed in Active Plan Day 1-5
R2	Codex	P1	src/atocore/context/builder.py	Project memories excluded from pack	fixed @ `8ea53f4` (codex read was stale, now caught up)
R3	Claude	P2	src/atocore/memory/extractor.py	Rule cues (`## Decision:`) never fire on conversational LLM text	open — Active Plan root cause

Recent Decisions

2026-04-11 Adopt this ledger as shared operating memory between Claude and Codex. Proposed by: Antoine. Ratified by: Antoine.
2026-04-11 Accept Codex's 8-day mini-phase plan verbatim as Active Plan. Proposed by: Codex. Ratified by: Antoine.
2026-04-11 Project memories land in the pack under --- Project Memories --- at 25% budget ratio, gated on canonical project hint. Proposed by: Claude.
2026-04-11 Extraction stays off the capture hot path. Batch / manual only. Proposed by: Antoine.
2026-04-11 4-step roadmap: extractor → harness expansion → Wave 2 ingestion → OpenClaw finish. Steps 1+2 as one mini-phase. Ratified by: Antoine.
2026-04-11 Codex branches must fork from main, not be orphan commits. Proposed by: Claude. Agreed by: Codex.

Session Log

2026-04-11 Claude c5bad99..b3253f3 (11 commits + 1 merge). Length-aware reinforcement, project memories in pack, query-relevance memory ranking, hyphenated-identifier tokenizer, retrieval eval harness seeded, off-host backup wired end-to-end, docs synced, codex integration-pass branch merged. Harness went 0→6/6 on live Dalidou.
2026-04-11 Codex (async review) identified 2 P1s against a stale checkout. R1 was fair (extraction not automated), R2 was outdated (project memories already landed on main). Delivered the 8-day execution plan now in Active Plan.
2026-04-06 Antoine created codex/atocore-integration-pass with the t420-openclaw/ workspace (merged 2026-04-11).

Working Rules

Claude builds; Codex audits. No parallel work on the same files.
Codex branches fork from main: git fetch origin && git checkout -b codex/<topic> origin/main.
P1 findings block further main commits until acknowledged in Open Review Findings.
Every session appends at least one Session Log line and bumps Orientation.
Trim Session Log and Recent Decisions to the last 20 at session end.
Docs in docs/ may overclaim stale status; the ledger is the one-file source of truth for "what is true right now."

Quick Commands

# Check live state
ssh papa@dalidou "curl -s http://localhost:8100/health"

# Run the retrieval harness
python scripts/retrieval_eval.py            # human-readable
python scripts/retrieval_eval.py --json     # machine-readable

# Deploy a new main tip
git push origin main && ssh papa@dalidou "bash /srv/storage/atocore/app/deploy/dalidou/deploy.sh"

# Reflection-loop ops
python scripts/atocore_client.py batch-extract '' '' 200 false   # preview
python scripts/atocore_client.py batch-extract '' '' 200 true    # persist
python scripts/atocore_client.py triage

9.1 KiB Raw Blame History