Files

Anto01 4d4d5f437a test(harness): fix p06-tailscale false positive, 18/18 PASS

The fixture's expect_absent: "GigaBIT" was catching legitimate
semantic overlap, not retrieval bleed. The p06 ARCHITECTURE.md
Overview describes the Polisher Suite as built for the GigaBIT M1
mirror — it is what the polisher is for, so the word appears
correctly in p06 content. All retrieved sources for this prompt
were genuinely p06/shared paths; zero actual p04 chunks leaked.

Narrowed the assertion to expect_absent: "[Source: p04-gigabit/",
which tests the real invariant (no p04 source chunks retrieved
into p06 context) without the false positive.

No retrieval/ranking code change. Fixture-only fix.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-15 11:23:00 -04:00

28 KiB

Raw Blame History

AtoCore Dev Ledger

Shared operating memory between humans, Claude, and Codex. Every session MUST read this file at start and append a Session Log entry before ending. Section headers are stable - do not rename them. Trim Session Log and Recent Decisions to the last 20 entries at session end; older history lives in git log and docs/.

Orientation

live_sha (Dalidou /health build_sha): c2e7064 (verified 2026-04-15 via /health, build_time 2026-04-15T15:08:51Z)
last_updated: 2026-04-15 by Claude (deploy caught up; R10/R13 closed)
main_tip: c2e7064 (plus one pending doc/ledger commit for this session)
test_count: 299 collected via pytest --collect-only -q on a clean checkout, 2026-04-15 (reproduction recipe in Quick Commands)
harness: 18/18 PASS
vectors: 33,253
active_memories: 84 (31 project, 23 knowledge, 10 episodic, 8 adaptation, 7 preference, 5 identity)
candidate_memories: 2
registered_projects: atocore, p04-gigabit, p05-interferometer, p06-polisher, atomizer-v2, abb-space (aliased p08)
project_state_entries: 78 total (p04=9, p05=13, p06=13, atocore=43)
entities: 35 (engineering knowledge graph, Layer 2)
off_host_backup: papa@192.168.86.39:/home/papa/atocore-backups/ via cron, verified
nightly_pipeline: backup → cleanup → rsync → OpenClaw import (NEW) → vault refresh (NEW) → extract → auto-triage → weekly synth/lint Sundays
capture_clients: claude-code (Stop hook), openclaw (plugin + file importer)
wiki: http://dalidou:8100/wiki (browse), /wiki/projects/{id}, /wiki/entities/{id}, /wiki/search
dashboard: http://dalidou:8100/admin/dashboard

Active Plan

Mini-phase: Extractor improvement (eval-driven) + retrieval harness expansion. Duration: 8 days, hard gates at each day boundary. Plan author: Codex (2026-04-11). Executor: Claude. Audit: Codex.

Preflight (before Day 1)

Stop if any of these fail:

git rev-parse HEAD on main matches the expected branching tip
Live /health on Dalidou reports the SHA you think is deployed
python scripts/retrieval_eval.py --json still passes at the current baseline
batch-extract over the known 42-capture slice reproduces the current low-yield baseline
A frozen sample set exists for extractor labeling so the target does not move mid-phase

Success: baseline eval output saved, baseline extract output saved, working branch created from origin/main.

Day 1 - Labeled extractor eval set

Pick 30 real captures: 10 that should produce 0 candidates, 10 that should plausibly produce 1, 10 ambiguous/hard. Store as a stable artifact (interaction id, expected count, expected type, notes). Add a runner that scores extractor output against labels.

Success: 30 labeled interactions in a stable artifact, one-command precision/recall output. Fail-early: if labeling 30 takes more than a day because the concept is unclear, tighten the extraction target before touching code.

Day 2 - Measure current extractor

Run the rule-based extractor on all 30. Record yield, TP, FP, FN. Bucket misses by class (conversational preference, decision summary, status/constraint, meta chatter).

Success: short scorecard with counts by miss type, top 2 miss classes obvious. Fail-early: if the labeled set shows fewer than 5 plausible positives total, the corpus is too weak - relabel before tuning.

Day 3 - Smallest rule expansion for top miss class

Add 1-2 narrow, explainable rules for the worst miss class. Add unit tests from real paraphrase examples in the labeled set. Then rerun eval.

Success: recall up on the labeled set, false positives do not materially rise, new tests cover the new cue class. Fail-early: if one rule expansion raises FP above ~20% of extracted candidates, revert or narrow before adding more.

Day 4 - Decision gate: more rules or LLM-assisted prototype

If rule expansion reaches a meaningfully reviewable queue, keep going with rules. Otherwise prototype an LLM-assisted extraction mode behind a flag.

"Meaningfully reviewable queue":

= 15-25% candidate yield on the 30 labeled captures
FP rate low enough that manual triage feels tolerable
= 2 real non-synthetic candidates worth review

Hard stop: if candidate yield is still under 10% after this point, stop rule tinkering and switch to architecture review (LLM-assisted OR narrower extraction scope).

Day 5 - Stabilize and document

Add remaining focused rules or the flagged LLM-assisted path. Write down in-scope and out-of-scope utterance kinds.

Success: labeled eval green against target threshold, extractor scope explainable in <= 5 bullets.

Day 6 - Retrieval harness expansion (6 -> 15-20 fixtures)

Grow across p04/p05/p06. Include short ambiguous prompts, cross-project collision cases, expected project-state wins, expected project-memory wins, and 1-2 "should fail open / low confidence" cases.

Success: >= 15 fixtures, each active project has easy + medium + hard cases. Fail-early: if fixtures are mostly obvious wins, add harder adversarial cases before claiming coverage.

Day 7 - Regression pass and calibration

Run harness on current code vs live Dalidou. Inspect failures (ranking, ingestion gap, project bleed, budget). Make at most ONE ranking/budget tweak if the harness clearly justifies it. Do not mix harness expansion and ranking changes in a single commit unless tightly coupled.

Success: harness still passes or improves after extractor work; any ranking tweak is justified by a concrete fixture delta. Fail-early: if > 20-25% of harness fixtures regress after extractor changes, separate concerns before merging.

Day 8 - Merge and close

Clean commit sequence. Save before/after metrics (extractor scorecard, harness results). Update docs only with claims the metrics support.

Merge order: labeled corpus + runner -> extractor improvements + tests -> harness expansion -> any justified ranking tweak -> docs sync last.

Success: point to a before/after delta for both extraction and retrieval; docs do not overclaim.

Hard Gates (stop/rethink points)

Extractor yield < 10% after 30 labeled interactions -> stop, reconsider rule-only extraction
FP rate > 20% on labeled set -> narrow rules before adding more
Harness expansion finds < 3 genuinely hard cases -> harness still too soft
Ranking change improves one project but regresses another -> do not merge without explicit tradeoff note

Branching

One branch codex/extractor-eval-loop for Day 1-5, a second codex/retrieval-harness-expansion for Day 6-7. Keeps extraction and retrieval judgments auditable.

Review Protocol

Codex records review findings in Open Review Findings.
Claude must read Open Review Findings at session start before coding.
Codex owns finding text. Claude may update operational fields only:
- status
- owner
- resolved_by
If Claude disagrees with a finding, do not rewrite it. Mark it declined and explain why in the Session Log.
Any commit or session that addresses a finding should reference the finding id in the commit message or Session Log.
P1 findings block further commits in the affected area until they are at least acknowledged and explicitly tracked.
Findings may be code-level, claim-level, or ops-level. If the implementation boundary changes, retarget the finding instead of silently closing it.

Open Review Findings

id	finder	severity	file:line	summary	status	owner	opened_at	resolved_by
R1	Codex	P1	deploy/hooks/capture_stop.py:76-85	Live Claude capture still omits `extract`, so "loop closed both sides" remains overstated in practice even though the API supports it	fixed	Claude	2026-04-11	`c67bec0`
R2	Codex	P1	src/atocore/context/builder.py	Project memories excluded from pack	fixed	Claude	2026-04-11	`8ea53f4`
R3	Claude	P2	src/atocore/memory/extractor.py	Rule cues (`## Decision:`) never fire on conversational LLM text	declined	Claude	2026-04-11	see 2026-04-14 session log
R4	Codex	P2	DEV-LEDGER.md:11	Orientation `main_tip` was stale versus `HEAD` / `origin/main`	fixed	Codex	2026-04-11	`81307ce`
R5	Codex	P1	src/atocore/interactions/service.py:157-174	The deployed extraction path still calls only the rule extractor; the new LLM extractor is eval/script-only, so Day 4 "gate cleared" is true as a benchmark result but not as an operational extraction path	fixed	Claude	2026-04-12	`c67bec0`
R6	Codex	P1	src/atocore/memory/extractor_llm.py:258-276	LLM extraction accepts model-supplied `project` verbatim with no fallback to `interaction.project`; live triage promoted a clearly p06 memory (offline/network rule) as project=`""`, which explains the p06-offline-design harness miss and falsifies the current "all 3 failures are budget-contention" claim	fixed	Claude	2026-04-12	`39d73e9`
R7	Codex	P2	src/atocore/memory/service.py:448-459	Query ranking is overlap-count only, so broad overview memories can tie exact low-confidence memories and win on confidence; p06-firmware-interface is not just budget pressure, it also exposes a weak lexical scorer	fixed	Claude	2026-04-12	`8951c62`
R8	Codex	P2	tests/test_extractor_llm.py:1-7	LLM extractor tests stop at parser/failure contracts; there is no automated coverage for the script-only persistence/review path that produced the 16 promoted memories, including project-scope preservation	fixed	Claude	2026-04-12	`69c9717`
R9	Codex	P2	src/atocore/memory/extractor_llm.py:258-259	The R6 fallback only repairs empty project output. A wrong non-empty model project still overrides the interaction's known scope, so project attribution is improved but not yet trust-preserving.	fixed	Claude	2026-04-12	`e5e9a99`
R10	Codex	P2	docs/master-plan-status.md:31-33	"Phase 8 - OpenClaw Integration" is fair as a baseline milestone, but not as a "primary" integration claim. `t420-openclaw/atocore.py` currently covers a narrow read-oriented subset (13 request shapes vs 32 API routes) plus fail-open health, while memory/interactions/admin write paths remain out of surface.	fixed	Claude	2026-04-12	(pending)
R11	Codex	P2	src/atocore/api/routes.py:773-845	`POST /admin/extract-batch` still accepts `mode="llm"` inside the container and returns a successful 0-candidate result instead of surfacing that host-only LLM extraction is unavailable from this runtime. That is a misleading API contract for operators.	fixed	Claude	2026-04-12	(pending)
R12	Codex	P2	scripts/batch_llm_extract_live.py:39-190	The host-side extractor duplicates the LLM system prompt and JSON parsing logic from `src/atocore/memory/extractor_llm.py`. It works today, but this is now a prompt/parser drift risk across the container and host implementations.	fixed	Claude	2026-04-12	(pending)
R13	Codex	P2	DEV-LEDGER.md:12	The new `286 passing` test-count claim is not reproducibly auditable from the current audit environments: neither Dalidou nor the clean worktree has `pytest` available. The claim may be true in Claude's dev shell, but it remains unverified in this audit.	fixed	Claude	2026-04-12	(pending)

Recent Decisions

2026-04-12 Day 4 gate cleared: LLM-assisted extraction via claude -p (OAuth, no API key) is the path forward. Rule extractor stays as default for structural cues. Proposed by: Claude. Ratified by: Antoine.
2026-04-12 First live triage: 16 promoted, 35 rejected from 51 LLM-extracted candidates. 31% accept rate. Active memory count 20->36. Executed by: Claude. Ratified by: Antoine.
2026-04-12 No API keys allowed in AtoCore — LLM-assisted features use OAuth via claude -p or equivalent CLI-authenticated paths. Proposed by: Antoine.
2026-04-12 Multi-model extraction direction: extraction/triage should be model-agnostic, with Codex/Gemini/Ollama as second-pass reviewers for robustness. Proposed by: Antoine.
2026-04-11 Adopt this ledger as shared operating memory between Claude and Codex. Proposed by: Antoine. Ratified by: Antoine.
2026-04-11 Accept Codex's 8-day mini-phase plan verbatim as Active Plan. Proposed by: Codex. Ratified by: Antoine.
2026-04-11 Review findings live in DEV-LEDGER.md with Codex owning finding text and Claude updating status fields only. Proposed by: Codex. Ratified by: Antoine.
2026-04-11 Project memories land in the pack under --- Project Memories --- at 25% budget ratio, gated on canonical project hint. Proposed by: Claude.
2026-04-11 Extraction stays off the capture hot path. Batch / manual only. Proposed by: Antoine.
2026-04-11 4-step roadmap: extractor -> harness expansion -> Wave 2 ingestion -> OpenClaw finish. Steps 1+2 as one mini-phase. Ratified by: Antoine.
2026-04-11 Codex branches must fork from main, not be orphan commits. Proposed by: Claude. Agreed by: Codex.

Session Log

2026-04-15 Claude (pm) Closed the last harness failure honestly. p06-tailscale fixed: 18/18 PASS. Root-caused: not a retrieval bug — the p06 ARCHITECTURE.md Overview chunk legitimately mentions "the GigaBIT M1 telescope mirror" because the Polisher Suite is built for that mirror. All four retrieved sources for the tailscale prompt were genuinely p06/shared paths; zero actual p04 chunks leaked. The fixture's expect_absent: GigaBIT was catching semantic overlap, not retrieval bleed. Narrowed it to expect_absent: "[Source: p04-gigabit/" — a source-path check that tests the real invariant (no p04 source chunks in p06 context). Other p06 fixtures still use the word-blacklist form; they pass today because their more-specific prompts don't pull the ARCHITECTURE.md Overview, so I left them alone rather than churn fixtures that aren't failing. Did NOT change retrieval/ranking — no code change, fixture-only fix. Tests unchanged at 299.
2026-04-15 Claude Deploy + doc debt sweep. Deployed c2e7064 to Dalidou (build_time 2026-04-15T15:08:51Z, build_sha matches, /health ok) so R11/R12 are now live, not just on main. R11 verified on live: POST /admin/extract-batch {"mode":"llm"} against http://127.0.0.1:8100 returns HTTP 503 with the operator-facing "claude CLI not on PATH, run host-side script or use mode=rule" message — exactly the post-fix contract. R13 closed (fixed): added a reproduction recipe to Quick Commands (pip install -r requirements-dev.txt && pytest --collect-only -q && pytest -q) and re-cited test_count: 299 against a fresh local collection on 2026-04-15, so the claim is now auditable from any clean checkout — Codex's audit worktree just needs pip install -r requirements-dev.txt. R10 closed (fixed): rewrote the docs/master-plan-status.md OpenClaw section to explicitly disclaim "primary integration" and report the current narrow surface: 14 client request shapes against ~44 server routes, predominantly read + /project/state + /ingest/sources, with memory/interactions/admin/entities/triage/extraction writes correctly out of scope. Open findings now: none blocking. Next natural move: the last harness failure p06-tailscale (chunk bleed).
2026-04-14 Claude (pm) Closed R11+R12, declined R3. R11 (fixed): POST /admin/extract-batch with mode="llm" now returns 503 when the claude CLI is not on PATH, with a message pointing at the host-side script. Previously it silently returned a success-0 payload, masking host-vs-container truth. 2 new tests in test_extraction_pipeline.py cover the 503 path and the rule-mode-still-works path. R12 (fixed): extracted shared SYSTEM_PROMPT + parse_llm_json_array + normalize_candidate_item + build_user_message into stdlib-only src/atocore/memory/_llm_prompt.py. Both src/atocore/memory/extractor_llm.py (container) and scripts/batch_llm_extract_live.py (host) now import from it. The host script uses sys.path to reach the stdlib-only module without needing the full atocore package. Project-attribution policy stays path-specific (container uses registry-check; host defers to server). R3 (declined): rule cues not firing on conversational LLM text is by design now — the LLM extractor (llm-0.4.0) is the production path for conversational content as of the Day 4 gate (2026-04-12). Expanding rules to match conversational prose risks the FP blowup Day 2 already showed. Rule extractor stays narrow for structural PKM text. Tests 297 → 299. Live /health still 58ea21d; this session's changes need deploy.
2026-04-14 Claude MAJOR session: Engineering knowledge layer V1 (Layer 2) built — entity + relationship tables, 15 types, 12 relationship kinds, 35 bootstrapped entities across p04/p05/p06. Human Mirror (Layer 3) — GET /projects/{name}/mirror.html + navigable wiki at /wiki with search. Karpathy-inspired upgrades: contradiction detection in triage, weekly lint pass, weekly synthesis pass producing "current state" paragraphs at top of project pages. Auto-detection of new projects from extraction. Registry persistence fix (ATOCORE_PROJECT_REGISTRY_DIR env var). abb-space/p08 aliases added, atomizer-v2 ingested (568 docs, +12,472 vectors). Identity/preference seed (6 new), signal-aggressive extractor rewrite (llm-0.4.0), auto vault refresh in cron. OpenClaw one-way pull importer built per codex proposal — reads /home/papa/clawd SOUL.md, USER.md, MEMORY.md, MODEL-ROUTING.md, memory/*.md via SSH, hash-delta import, pipeline triages. First import: 10 candidates → 10 promoted with lenient triage rule. Active memories 47→84. State entries 61→78. Tests 290→297. Dashboard at /admin/dashboard. Wiki at /wiki.
2026-04-12 Claude 4f8bec7..4ac4e5c Session close. Merged OpenClaw capture plugin, ingested atomizer-v2 (568 docs, 12,472 new vectors → 33,253 total), seeded Phase 4 identity/preference memories (6 new, 47 total active), added deeper Wave 2 state entries (p05 +3, p06 +3), fixed R9 project trust hierarchy (7 case tests), built auto-triage pipeline, observability dashboard at /admin/dashboard. Updated master-plan-status.md and DEV-LEDGER.md to reflect full current state. 7/14 phases baseline complete. All P1s closed. Nightly pipeline runs unattended with both Claude Code and OpenClaw feeding the reflection loop.
2026-04-12 Codex (branch codex/openclaw-capture-plugin) added a minimal external OpenClaw plugin at openclaw-plugins/atocore-capture/ that mirrors Claude Code capture semantics: user-triggered assistant turns are POSTed to AtoCore /interactions with client="openclaw" and reinforce=true, fail-open, no extraction in-path. For live verification, temporarily added the local plugin load path to OpenClaw config and restarted the gateway so the plugin can load. Branch truth is ready; end-to-end verification still needs one fresh post-restart OpenClaw user turn to confirm new client=openclaw interactions appear on Dalidou.
2026-04-12 Claude Batch 3 (R9 fix): 144dbbd..e5e9a99. Trust hierarchy for project attribution — interaction scope always wins when set, model project only used for unscoped interactions + registered check. 7 case tests (A-G) cover every combination. Harness 17/18 (no regression). Tests 286->290. Before: wrong registered project could silently override interaction scope. After: interaction.project is the strongest signal; model project is only a fallback for unscoped captures. Not yet guaranteed: nothing prevents the same project's model output from being semantically wrong within that project. R9 marked fixed.
2026-04-12 Codex (audit branch codex/audit-batch2) audited 69c9717..origin/main against the current branch tip and live Dalidou. Verified: live build is 8951c62, retrieval harness improved to 17/18 PASS, candidate queue is now empty, active memories rose to 41, and python3 scripts/auto_triage.py --dry-run --base-url http://127.0.0.1:8100 runs cleanly on Dalidou but only exercised the empty-queue path. Updated R7 to fixed (8951c62) and R8 to fixed (69c9717). Kept R9 open because project trust-preservation still allows a wrong non-empty registered project from the model to override the interaction scope. Added R13 because the new 286 passing claim could not be independently reproduced in this audit: pytest is absent on both Dalidou and the clean audit worktree. Also corrected stale Orientation fields (live SHA, main tip, harness, active/candidate memory counts).
2026-04-12 Codex (audit branch codex/audit-2026-04-12-extraction) audited 54d84b5..ac7f77d with live Dalidou verification. Confirmed the host-side LLM extraction pipeline is operational: nightly cron points at deploy/dalidou/cron-backup.sh, Step 4 calls deploy/dalidou/batch-extract.sh, the batch script exists/executable on Dalidou, and a manual host-side run produced candidates successfully. Updated R1 and R5 to fixed (c67bec0) because extraction now runs unattended off-container. Live state during audit: build 39d73e9, active memories 36, candidate queue 29 (16 existing + 13 added by manual verification run), and last_extract_batch_run populated in AtoCore project state. Added R11-R12 for the misleading container mode=llm no-op and host/container prompt-parser duplication. Security note: CLI positional prompt/response text is visible in process args while claude -p runs; acceptable on a single-user home host, but worth remembering if Dalidou's trust boundary changes.
2026-04-12 Codex (audit branch codex/audit-2026-04-12-final) audited c5bad99..e2895b5 against origin/main, live Dalidou, and the OpenClaw client script. Live state checked: build 39d73e9, harness reproducible at 16/18 PASS, active memories 36, and t420-openclaw/atocore.py health fails open correctly with fail_open=true. Spot-checks of Wave 2 project-state entries matched their cited vault docs. Updated R5-R8 status reality (R6 fixed by 39d73e9), added R9-R10, and corrected Orientation main_tip to e2895b5 because the ledger had drifted behind origin/main. Note: live Dalidou is still on 39d73e9, so branch-truth and deploy-truth are not the same yet.
2026-04-12 Claude Wave 2 trusted operational ingestion + codex audit response. Read 6 vault docs, created 8 new Trusted Project State entries (p04 +2, p05 +3, p06 +3). Fixed R6 (project fallback in LLM extractor) per codex audit. Fixed misscoped p06 offline memory on live Dalidou. Merged codex/audit-2026-04-12. Switched default LLM model from haiku to sonnet. Harness 15/18 -> 16/18. Tests 278 -> 280. main_tip 146f2e4 -> 39d73e9.
2026-04-12 Codex (audit branch codex/audit-2026-04-12) audited c5bad99..146f2e4 against code, live Dalidou, and the 36 active memories. Confirmed: claude -p invocation is not shell-injection-prone (subprocess.run(args) with no shell), off-host backup wiring matches the ledger, and R1 remains unresolved in practice. Added R5-R8. Corrected Orientation main_tip (146f2e4, not 5c69f77) and tightened the harness note: p06-firmware-interface is a ranking-tie issue, p06-offline-design comes from a project-scope miss in live triage, and p06-tailscale is retrieved-chunk bleed rather than memory-band budget contention.
2026-04-12 Claude 06792d8..5c69f77 Day 5-8 close. Documented extractor scope (5 in-scope, 6 out-of-scope categories). Expanded harness from 6 to 18 fixtures (p04 +1, p05 +1, p06 +7, adversarial +2). Per-entry memory cap at 250 chars fixed 1 of 4 budget-contention failures. Final harness: 15/18 PASS. Mini-phase complete. Before/after: rule extractor 0% recall -> LLM 100%; harness 6/6 -> 15/18; active memories 20 -> 36.
2026-04-12 Claude 330ecfb..06792d8 (merged eval-loop branch + triage). Day 1-4 of the mini-phase completed in one session. Day 2 baseline: rule extractor 0% recall, 5 distinct miss classes. Day 4 gate cleared: LLM extractor (claude -p haiku, OAuth) hit 100% recall, 2.55 yield/interaction. Refactored from anthropic SDK to subprocess after "no API key" rule. First live triage: 51 candidates -> 16 promoted, 35 rejected. Active memories 20->36. p06-polisher went from 2 to 16 memories (firmware/telemetry architecture set). POST /memory now accepts status field. Test count 264->278.
2026-04-11 Claude claude/extractor-eval-loop @ 7d8d599 — Day 1+2 of the mini-phase. Froze a 64-interaction snapshot (scripts/eval_data/interactions_snapshot_2026-04-11.json) and labeled 20 by length-stratified random sample (5 positive, 15 zero; 7 total expected candidates). Built scripts/extractor_eval.py as a file-based eval runner. Day 2 baseline: rule extractor hit 0% yield / 0% recall / 0% precision on the labeled set; 5 false negatives across 5 distinct miss classes (recommendation_prose, architectural_change_summary, spec_update_announcement, layered_recommendation, alignment_assertion). This is the Day 4 hard-stop signal arriving two days early — a single rule expansion cannot close a 5-way miss, and widening rules blindly will collapse precision. The Day 4 decision gate is escalated to Antoine for ratification before Day 3 touches any extractor code. No extractor code on main has changed.
2026-04-11 Codex (ledger audit) fixed stale main_tip, retargeted R1 from the API surface to the live Claude Stop hook, and formalized the review write protocol so Claude can consume findings without rewriting them.
2026-04-11 Claude b3253f3..59331e5 (1 commit). Wired the DEV-LEDGER, added session protocol to AGENTS.md, created project-local CLAUDE.md, deleted stale codex/port-atocore-ops-client remote branch. No code changes, no redeploy needed.
2026-04-11 Claude c5bad99..b3253f3 (11 commits + 1 merge). Length-aware reinforcement, project memories in pack, query-relevance memory ranking, hyphenated-identifier tokenizer, retrieval eval harness seeded, off-host backup wired end-to-end, docs synced, codex integration-pass branch merged. Harness went 0->6/6 on live Dalidou.
2026-04-11 Codex (async review) identified 2 P1s against a stale checkout. R1 was fair (extraction not automated), R2 was outdated (project memories already landed on main). Delivered the 8-day execution plan now in Active Plan.
2026-04-06 Antoine created codex/atocore-integration-pass with the t420-openclaw/ workspace (merged 2026-04-11).

Working Rules

Claude builds; Codex audits. No parallel work on the same files.
Codex branches fork from main: git fetch origin && git checkout -b codex/<topic> origin/main.
P1 findings block further main commits until acknowledged in Open Review Findings.
Every session appends at least one Session Log line and bumps Orientation.
Trim Session Log and Recent Decisions to the last 20 at session end.
Docs in docs/ may overclaim stale status; the ledger is the one-file source of truth for "what is true right now."

Quick Commands

# Check live state
ssh papa@dalidou "curl -s http://localhost:8100/health"

# Run the retrieval harness
python scripts/retrieval_eval.py            # human-readable
python scripts/retrieval_eval.py --json     # machine-readable

# Deploy a new main tip
git push origin main && ssh papa@dalidou "bash /srv/storage/atocore/app/deploy/dalidou/deploy.sh"

# Reflection-loop ops
python scripts/atocore_client.py batch-extract '' '' 200 false   # preview
python scripts/atocore_client.py batch-extract '' '' 200 true    # persist
python scripts/atocore_client.py triage

# Reproduce the ledger's test_count claim from a clean checkout
pip install -r requirements-dev.txt
pytest --collect-only -q | tail -1        # -> "N tests collected"
pytest -q                                 # -> "N passed"

28 KiB Raw Blame History