feat: DEV-LEDGER.md as shared operating memory + session protocol

The ledger is the one-file source of truth for "what is currently true" across Claude/Codex/human sessions: - Orientation (live SHA, main tip, test count, harness state) - Active Plan (currently Codex's 8-day extractor + harness plan with hard gates and fail-early thresholds) - Open Review Findings (P1/P2, status) - Recent Decisions (bounded to last 20) - Session Log (bounded to last 20) - Working Rules (no parallel work, branching rule, P1 block) Narrative docs under docs/ sometimes lag reality; the ledger does not. Every session MUST read it at start and append a Session Log line before ending. AGENTS.md: added a new "Session protocol" section at the top that points at the ledger. Applies to any agent (Claude, Codex, future). CLAUDE.md (new, project-local): project instructions for Claude Code in this repo. Points at DEV-LEDGER.md and AGENTS.md, spells out the deploy workflow and the Claude/Codex working model. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 14:46:21 -04:00
parent b3253f35ee
commit 59331e522d
3 changed files with 192 additions and 0 deletions
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -1,5 +1,13 @@
 # AGENTS.md

+## Session protocol (read first, every session)
+
+**Before doing anything else, read `DEV-LEDGER.md` at the repo root.** It is the one-file source of truth for "what is currently true" — live SHA, active plan, open review findings, recent decisions. The narrative docs under `docs/` may lag; the ledger does not.
+
+**Before ending a session, append a Session Log line to `DEV-LEDGER.md`** with what you did and which commit range it covers, and bump the Orientation section if anything there changed.
+
+This rule applies equally to Claude, Codex, and any future agent working in this repo.
+
 ## Project role
 This repository is AtoCore, the runtime and machine-memory layer of the Ato ecosystem.

--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -0,0 +1,30 @@
+# CLAUDE.md — project instructions for AtoCore
+
+## Session protocol
+
+Before doing anything else in this repo, read `DEV-LEDGER.md` at the repo root. It is the shared operating memory between Claude, Codex, and the human operator — live Dalidou SHA, active plan, open P1/P2 review findings, recent decisions, and session log. The narrative docs under `docs/` sometimes lag; the ledger does not.
+
+Before ending a session, append a Session Log line to `DEV-LEDGER.md` covering:
+
+- which commits you produced (sha range)
+- what changed at a high level
+- any harness / test count deltas
+- anything you overclaimed and later corrected
+
+Bump the **Orientation** section if `live_sha`, `main_tip`, `test_count`, or `harness` changed.
+
+`AGENTS.md` at the repo root carries the broader project principles (storage separation, deployment model, coding guidance). Read it when you need the "why" behind a constraint.
+
+## Deploy workflow
+
+```bash
+git push origin main && ssh papa@dalidou "bash /srv/storage/atocore/app/deploy/dalidou/deploy.sh"
+```
+
+The deploy script self-verifies via `/health` build_sha — if it exits non-zero, do not assume the change is live.
+
+## Working model
+
+- Claude builds; Codex audits. No parallel work on the same files.
+- P1 review findings block further `main` commits until acknowledged in the ledger's **Open Review Findings** table.
+- Codex branches must fork from `origin/main` (no orphan commits that require `--allow-unrelated-histories`).
--- a/DEV-LEDGER.md
+++ b/DEV-LEDGER.md
@@ -0,0 +1,154 @@
+# AtoCore Dev Ledger
+
+> Shared operating memory between humans, Claude, and Codex.
+> **Every session MUST read this file at start and append a Session Log entry before ending.**
+> Section headers are stable — do not rename them. Trim Session Log and Recent Decisions to the last 20 entries at session end; older history lives in `git log` and `docs/`.
+
+## Orientation
+
+- **live_sha** (Dalidou `/health` build_sha): `38f6e52`
+- **main_tip**: `b3253f3`
+- **last_updated**: 2026-04-11 by Claude
+- **test_count**: 264 passing
+- **harness**: `6/6 PASS` (`python scripts/retrieval_eval.py` against live Dalidou)
+- **off_host_backup**: `papa@192.168.86.39:/home/papa/atocore-backups/` via cron env `ATOCORE_BACKUP_RSYNC`, verified
+
+## Active Plan
+
+**Mini-phase**: Extractor improvement (eval-driven) + retrieval harness expansion.
+**Duration**: 8 days, hard gates at each day boundary.
+**Plan author**: Codex (2026-04-11). **Executor**: Claude. **Audit**: Codex.
+
+### Preflight (before Day 1)
+
+Stop if any of these fail:
+
+- `git rev-parse HEAD` on `main` matches the expected branching tip
+- Live `/health` on Dalidou reports the SHA you think is deployed
+- `python scripts/retrieval_eval.py --json` still passes at the current baseline
+- `batch-extract` over the known 42-capture slice reproduces the current low-yield baseline
+- A frozen sample set exists for extractor labeling so the target does not move mid-phase
+
+Success: baseline eval output saved, baseline extract output saved, working branch created from `origin/main`.
+
+### Day 1 — Labeled extractor eval set
+
+Pick 30 real captures: 10 that should produce 0 candidates, 10 that should plausibly produce 1, 10 ambiguous/hard. Store as a stable artifact (interaction id, expected count, expected type, notes). Add a runner that scores extractor output against labels.
+
+Success: 30 labeled interactions in a stable artifact, one-command precision/recall output.
+Fail-early: if labeling 30 takes more than a day because the concept is unclear, tighten the extraction target before touching code.
+
+### Day 2 — Measure current extractor
+
+Run the rule-based extractor on all 30. Record yield, TP, FP, FN. Bucket misses by class (conversational preference, decision summary, status/constraint, meta chatter).
+
+Success: short scorecard with counts by miss type, top 2 miss classes obvious.
+Fail-early: if the labeled set shows fewer than 5 plausible positives total, the corpus is too weak — relabel before tuning.
+
+### Day 3 — Smallest rule expansion for top miss class
+
+Add 1-2 narrow, explainable rules for the worst miss class. Add unit tests from real paraphrase examples in the labeled set. Then rerun eval.
+
+Success: recall up on the labeled set, false positives do not materially rise, new tests cover the new cue class.
+Fail-early: if one rule expansion raises FP above ~20% of extracted candidates, revert or narrow before adding more.
+
+### Day 4 — Decision gate: more rules or LLM-assisted prototype
+
+If rule expansion reaches a **meaningfully reviewable queue**, keep going with rules. Otherwise prototype an LLM-assisted extraction mode behind a flag.
+
+"Meaningfully reviewable queue":
+- ≥ 15-25% candidate yield on the 30 labeled captures
+- FP rate low enough that manual triage feels tolerable
+- ≥ 2 real non-synthetic candidates worth review
+
+Hard stop: if candidate yield is still under 10% after this point, stop rule tinkering and switch to architecture review (LLM-assisted OR narrower extraction scope).
+
+### Day 5 — Stabilize and document
+
+Add remaining focused rules or the flagged LLM-assisted path. Write down in-scope and out-of-scope utterance kinds.
+
+Success: labeled eval green against target threshold, extractor scope explainable in ≤ 5 bullets.
+
+### Day 6 — Retrieval harness expansion (6 → 15-20 fixtures)
+
+Grow across p04/p05/p06. Include short ambiguous prompts, cross-project collision cases, expected project-state wins, expected project-memory wins, and 1-2 "should fail open / low confidence" cases.
+
+Success: ≥ 15 fixtures, each active project has easy + medium + hard cases.
+Fail-early: if fixtures are mostly obvious wins, add harder adversarial cases before claiming coverage.
+
+### Day 7 — Regression pass and calibration
+
+Run harness on current code vs live Dalidou. Inspect failures (ranking, ingestion gap, project bleed, budget). Make at most ONE ranking/budget tweak if the harness clearly justifies it. Do not mix harness expansion and ranking changes in a single commit unless tightly coupled.
+
+Success: harness still passes or improves after extractor work; any ranking tweak is justified by a concrete fixture delta.
+Fail-early: if > 20-25% of harness fixtures regress after extractor changes, separate concerns before merging.
+
+### Day 8 — Merge and close
+
+Clean commit sequence. Save before/after metrics (extractor scorecard, harness results). Update docs only with claims the metrics support.
+
+Merge order: labeled corpus + runner → extractor improvements + tests → harness expansion → any justified ranking tweak → docs sync last.
+
+Success: point to a before/after delta for both extraction and retrieval; docs do not overclaim.
+
+### Hard Gates (stop/rethink points)
+
+- Extractor yield < 10% after 30 labeled interactions → stop, reconsider rule-only extraction
+- FP rate > 20% on labeled set → narrow rules before adding more
+- Harness expansion finds < 3 genuinely hard cases → harness still too soft
+- Ranking change improves one project but regresses another → do not merge without explicit tradeoff note
+
+### Branching
+
+One branch `codex/extractor-eval-loop` for Day 1-5, a second `codex/retrieval-harness-expansion` for Day 6-7. Keeps extraction and retrieval judgments auditable.
+
+## Open Review Findings
+
+| id  | finder | severity | file:line                                 | summary                                                    | status   |
+|-----|--------|----------|-------------------------------------------|------------------------------------------------------------|----------|
+| R1  | Codex  | P1       | src/atocore/api/routes.py                 | Capture→extract still manual; "loop closed both sides" was overstated | acknowledged — addressed in Active Plan Day 1-5 |
+| R2  | Codex  | P1       | src/atocore/context/builder.py            | Project memories excluded from pack                        | fixed @ 8ea53f4 (codex read was stale, now caught up) |
+| R3  | Claude | P2       | src/atocore/memory/extractor.py           | Rule cues (`## Decision:`) never fire on conversational LLM text | open — Active Plan root cause |
+
+## Recent Decisions
+
+- **2026-04-11** Adopt this ledger as shared operating memory between Claude and Codex. *Proposed by:* Antoine. *Ratified by:* Antoine.
+- **2026-04-11** Accept Codex's 8-day mini-phase plan verbatim as Active Plan. *Proposed by:* Codex. *Ratified by:* Antoine.
+- **2026-04-11** Project memories land in the pack under `--- Project Memories ---` at 25% budget ratio, gated on canonical project hint. *Proposed by:* Claude.
+- **2026-04-11** Extraction stays off the capture hot path. Batch / manual only. *Proposed by:* Antoine.
+- **2026-04-11** 4-step roadmap: extractor → harness expansion → Wave 2 ingestion → OpenClaw finish. Steps 1+2 as one mini-phase. *Ratified by:* Antoine.
+- **2026-04-11** Codex branches must fork from `main`, not be orphan commits. *Proposed by:* Claude. *Agreed by:* Codex.
+
+## Session Log
+
+- **2026-04-11 Claude** `c5bad99..b3253f3` (11 commits + 1 merge). Length-aware reinforcement, project memories in pack, query-relevance memory ranking, hyphenated-identifier tokenizer, retrieval eval harness seeded, off-host backup wired end-to-end, docs synced, codex integration-pass branch merged. Harness went 0→6/6 on live Dalidou.
+- **2026-04-11 Codex (async review)** identified 2 P1s against a stale checkout. R1 was fair (extraction not automated), R2 was outdated (project memories already landed on main). Delivered the 8-day execution plan now in Active Plan.
+- **2026-04-06 Antoine** created `codex/atocore-integration-pass` with the `t420-openclaw/` workspace (merged 2026-04-11).
+
+## Working Rules
+
+- Claude builds; Codex audits. No parallel work on the same files.
+- Codex branches fork from `main`: `git fetch origin && git checkout -b codex/<topic> origin/main`.
+- P1 findings block further main commits until acknowledged in Open Review Findings.
+- Every session appends at least one Session Log line and bumps Orientation.
+- Trim Session Log and Recent Decisions to the last 20 at session end.
+- Docs in `docs/` may overclaim stale status; the ledger is the one-file source of truth for "what is true right now."
+
+## Quick Commands
+
+```bash
+# Check live state
+ssh papa@dalidou "curl -s http://localhost:8100/health"
+
+# Run the retrieval harness
+python scripts/retrieval_eval.py            # human-readable
+python scripts/retrieval_eval.py --json     # machine-readable
+
+# Deploy a new main tip
+git push origin main && ssh papa@dalidou "bash /srv/storage/atocore/app/deploy/dalidou/deploy.sh"
+
+# Reflection-loop ops
+python scripts/atocore_client.py batch-extract '' '' 200 false   # preview
+python scripts/atocore_client.py batch-extract '' '' 200 true    # persist
+python scripts/atocore_client.py triage
+```