feat: DEV-LEDGER.md as shared operating memory + session protocol

The ledger is the one-file source of truth for "what is currently
true" across Claude/Codex/human sessions:

- Orientation (live SHA, main tip, test count, harness state)
- Active Plan (currently Codex's 8-day extractor + harness plan
  with hard gates and fail-early thresholds)
- Open Review Findings (P1/P2, status)
- Recent Decisions (bounded to last 20)
- Session Log (bounded to last 20)
- Working Rules (no parallel work, branching rule, P1 block)

Narrative docs under docs/ sometimes lag reality; the ledger does
not. Every session MUST read it at start and append a Session Log
line before ending.

AGENTS.md: added a new "Session protocol" section at the top that
points at the ledger. Applies to any agent (Claude, Codex, future).

CLAUDE.md (new, project-local): project instructions for Claude
Code in this repo. Points at DEV-LEDGER.md and AGENTS.md, spells
out the deploy workflow and the Claude/Codex working model.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-11 14:46:21 -04:00
parent b3253f35ee
commit 59331e522d
3 changed files with 192 additions and 0 deletions

View File

@@ -1,5 +1,13 @@
# AGENTS.md
## Session protocol (read first, every session)
**Before doing anything else, read `DEV-LEDGER.md` at the repo root.** It is the one-file source of truth for "what is currently true" — live SHA, active plan, open review findings, recent decisions. The narrative docs under `docs/` may lag; the ledger does not.
**Before ending a session, append a Session Log line to `DEV-LEDGER.md`** with what you did and which commit range it covers, and bump the Orientation section if anything there changed.
This rule applies equally to Claude, Codex, and any future agent working in this repo.
## Project role
This repository is AtoCore, the runtime and machine-memory layer of the Ato ecosystem.

30
CLAUDE.md Normal file
View File

@@ -0,0 +1,30 @@
# CLAUDE.md — project instructions for AtoCore
## Session protocol
Before doing anything else in this repo, read `DEV-LEDGER.md` at the repo root. It is the shared operating memory between Claude, Codex, and the human operator — live Dalidou SHA, active plan, open P1/P2 review findings, recent decisions, and session log. The narrative docs under `docs/` sometimes lag; the ledger does not.
Before ending a session, append a Session Log line to `DEV-LEDGER.md` covering:
- which commits you produced (sha range)
- what changed at a high level
- any harness / test count deltas
- anything you overclaimed and later corrected
Bump the **Orientation** section if `live_sha`, `main_tip`, `test_count`, or `harness` changed.
`AGENTS.md` at the repo root carries the broader project principles (storage separation, deployment model, coding guidance). Read it when you need the "why" behind a constraint.
## Deploy workflow
```bash
git push origin main && ssh papa@dalidou "bash /srv/storage/atocore/app/deploy/dalidou/deploy.sh"
```
The deploy script self-verifies via `/health` build_sha — if it exits non-zero, do not assume the change is live.
## Working model
- Claude builds; Codex audits. No parallel work on the same files.
- P1 review findings block further `main` commits until acknowledged in the ledger's **Open Review Findings** table.
- Codex branches must fork from `origin/main` (no orphan commits that require `--allow-unrelated-histories`).

154
DEV-LEDGER.md Normal file
View File

@@ -0,0 +1,154 @@
# AtoCore Dev Ledger
> Shared operating memory between humans, Claude, and Codex.
> **Every session MUST read this file at start and append a Session Log entry before ending.**
> Section headers are stable — do not rename them. Trim Session Log and Recent Decisions to the last 20 entries at session end; older history lives in `git log` and `docs/`.
## Orientation
- **live_sha** (Dalidou `/health` build_sha): `38f6e52`
- **main_tip**: `b3253f3`
- **last_updated**: 2026-04-11 by Claude
- **test_count**: 264 passing
- **harness**: `6/6 PASS` (`python scripts/retrieval_eval.py` against live Dalidou)
- **off_host_backup**: `papa@192.168.86.39:/home/papa/atocore-backups/` via cron env `ATOCORE_BACKUP_RSYNC`, verified
## Active Plan
**Mini-phase**: Extractor improvement (eval-driven) + retrieval harness expansion.
**Duration**: 8 days, hard gates at each day boundary.
**Plan author**: Codex (2026-04-11). **Executor**: Claude. **Audit**: Codex.
### Preflight (before Day 1)
Stop if any of these fail:
- `git rev-parse HEAD` on `main` matches the expected branching tip
- Live `/health` on Dalidou reports the SHA you think is deployed
- `python scripts/retrieval_eval.py --json` still passes at the current baseline
- `batch-extract` over the known 42-capture slice reproduces the current low-yield baseline
- A frozen sample set exists for extractor labeling so the target does not move mid-phase
Success: baseline eval output saved, baseline extract output saved, working branch created from `origin/main`.
### Day 1 — Labeled extractor eval set
Pick 30 real captures: 10 that should produce 0 candidates, 10 that should plausibly produce 1, 10 ambiguous/hard. Store as a stable artifact (interaction id, expected count, expected type, notes). Add a runner that scores extractor output against labels.
Success: 30 labeled interactions in a stable artifact, one-command precision/recall output.
Fail-early: if labeling 30 takes more than a day because the concept is unclear, tighten the extraction target before touching code.
### Day 2 — Measure current extractor
Run the rule-based extractor on all 30. Record yield, TP, FP, FN. Bucket misses by class (conversational preference, decision summary, status/constraint, meta chatter).
Success: short scorecard with counts by miss type, top 2 miss classes obvious.
Fail-early: if the labeled set shows fewer than 5 plausible positives total, the corpus is too weak — relabel before tuning.
### Day 3 — Smallest rule expansion for top miss class
Add 1-2 narrow, explainable rules for the worst miss class. Add unit tests from real paraphrase examples in the labeled set. Then rerun eval.
Success: recall up on the labeled set, false positives do not materially rise, new tests cover the new cue class.
Fail-early: if one rule expansion raises FP above ~20% of extracted candidates, revert or narrow before adding more.
### Day 4 — Decision gate: more rules or LLM-assisted prototype
If rule expansion reaches a **meaningfully reviewable queue**, keep going with rules. Otherwise prototype an LLM-assisted extraction mode behind a flag.
"Meaningfully reviewable queue":
- ≥ 15-25% candidate yield on the 30 labeled captures
- FP rate low enough that manual triage feels tolerable
- ≥ 2 real non-synthetic candidates worth review
Hard stop: if candidate yield is still under 10% after this point, stop rule tinkering and switch to architecture review (LLM-assisted OR narrower extraction scope).
### Day 5 — Stabilize and document
Add remaining focused rules or the flagged LLM-assisted path. Write down in-scope and out-of-scope utterance kinds.
Success: labeled eval green against target threshold, extractor scope explainable in ≤ 5 bullets.
### Day 6 — Retrieval harness expansion (6 → 15-20 fixtures)
Grow across p04/p05/p06. Include short ambiguous prompts, cross-project collision cases, expected project-state wins, expected project-memory wins, and 1-2 "should fail open / low confidence" cases.
Success: ≥ 15 fixtures, each active project has easy + medium + hard cases.
Fail-early: if fixtures are mostly obvious wins, add harder adversarial cases before claiming coverage.
### Day 7 — Regression pass and calibration
Run harness on current code vs live Dalidou. Inspect failures (ranking, ingestion gap, project bleed, budget). Make at most ONE ranking/budget tweak if the harness clearly justifies it. Do not mix harness expansion and ranking changes in a single commit unless tightly coupled.
Success: harness still passes or improves after extractor work; any ranking tweak is justified by a concrete fixture delta.
Fail-early: if > 20-25% of harness fixtures regress after extractor changes, separate concerns before merging.
### Day 8 — Merge and close
Clean commit sequence. Save before/after metrics (extractor scorecard, harness results). Update docs only with claims the metrics support.
Merge order: labeled corpus + runner → extractor improvements + tests → harness expansion → any justified ranking tweak → docs sync last.
Success: point to a before/after delta for both extraction and retrieval; docs do not overclaim.
### Hard Gates (stop/rethink points)
- Extractor yield < 10% after 30 labeled interactions → stop, reconsider rule-only extraction
- FP rate > 20% on labeled set → narrow rules before adding more
- Harness expansion finds < 3 genuinely hard cases → harness still too soft
- Ranking change improves one project but regresses another → do not merge without explicit tradeoff note
### Branching
One branch `codex/extractor-eval-loop` for Day 1-5, a second `codex/retrieval-harness-expansion` for Day 6-7. Keeps extraction and retrieval judgments auditable.
## Open Review Findings
| id | finder | severity | file:line | summary | status |
|-----|--------|----------|-------------------------------------------|------------------------------------------------------------|----------|
| R1 | Codex | P1 | src/atocore/api/routes.py | Capture→extract still manual; "loop closed both sides" was overstated | acknowledged — addressed in Active Plan Day 1-5 |
| R2 | Codex | P1 | src/atocore/context/builder.py | Project memories excluded from pack | fixed @ 8ea53f4 (codex read was stale, now caught up) |
| R3 | Claude | P2 | src/atocore/memory/extractor.py | Rule cues (`## Decision:`) never fire on conversational LLM text | open — Active Plan root cause |
## Recent Decisions
- **2026-04-11** Adopt this ledger as shared operating memory between Claude and Codex. *Proposed by:* Antoine. *Ratified by:* Antoine.
- **2026-04-11** Accept Codex's 8-day mini-phase plan verbatim as Active Plan. *Proposed by:* Codex. *Ratified by:* Antoine.
- **2026-04-11** Project memories land in the pack under `--- Project Memories ---` at 25% budget ratio, gated on canonical project hint. *Proposed by:* Claude.
- **2026-04-11** Extraction stays off the capture hot path. Batch / manual only. *Proposed by:* Antoine.
- **2026-04-11** 4-step roadmap: extractor → harness expansion → Wave 2 ingestion → OpenClaw finish. Steps 1+2 as one mini-phase. *Ratified by:* Antoine.
- **2026-04-11** Codex branches must fork from `main`, not be orphan commits. *Proposed by:* Claude. *Agreed by:* Codex.
## Session Log
- **2026-04-11 Claude** `c5bad99..b3253f3` (11 commits + 1 merge). Length-aware reinforcement, project memories in pack, query-relevance memory ranking, hyphenated-identifier tokenizer, retrieval eval harness seeded, off-host backup wired end-to-end, docs synced, codex integration-pass branch merged. Harness went 0→6/6 on live Dalidou.
- **2026-04-11 Codex (async review)** identified 2 P1s against a stale checkout. R1 was fair (extraction not automated), R2 was outdated (project memories already landed on main). Delivered the 8-day execution plan now in Active Plan.
- **2026-04-06 Antoine** created `codex/atocore-integration-pass` with the `t420-openclaw/` workspace (merged 2026-04-11).
## Working Rules
- Claude builds; Codex audits. No parallel work on the same files.
- Codex branches fork from `main`: `git fetch origin && git checkout -b codex/<topic> origin/main`.
- P1 findings block further main commits until acknowledged in Open Review Findings.
- Every session appends at least one Session Log line and bumps Orientation.
- Trim Session Log and Recent Decisions to the last 20 at session end.
- Docs in `docs/` may overclaim stale status; the ledger is the one-file source of truth for "what is true right now."
## Quick Commands
```bash
# Check live state
ssh papa@dalidou "curl -s http://localhost:8100/health"
# Run the retrieval harness
python scripts/retrieval_eval.py # human-readable
python scripts/retrieval_eval.py --json # machine-readable
# Deploy a new main tip
git push origin main && ssh papa@dalidou "bash /srv/storage/atocore/app/deploy/dalidou/deploy.sh"
# Reflection-loop ops
python scripts/atocore_client.py batch-extract '' '' 200 false # preview
python scripts/atocore_client.py batch-extract '' '' 200 true # persist
python scripts/atocore_client.py triage
```