Compare commits
15 Commits
claude/ext
...
codex/audi
| Author | SHA1 | Date | |
|---|---|---|---|
| b790e7eb30 | |||
| e2895b5d2b | |||
| 2b79680167 | |||
| 39d73e91b4 | |||
| 7ddf0e38ee | |||
| b0fde3ee60 | |||
| 89c7964237 | |||
| 146f2e4a5e | |||
| 5c69f77b45 | |||
| 3921c5ffc7 | |||
| 93f796207f | |||
| b98a658831 | |||
| 06792d862e | |||
| 95daa5c040 | |||
| 330ecfb6a6 |
@@ -6,11 +6,13 @@
|
||||
|
||||
## Orientation
|
||||
|
||||
- **live_sha** (Dalidou `/health` build_sha): `38f6e52`
|
||||
- **last_updated**: 2026-04-11 by Codex (review protocol formalized)
|
||||
- **main_tip**: `81307ce`
|
||||
- **test_count**: 264 passing
|
||||
- **harness**: `6/6 PASS` (`python scripts/retrieval_eval.py` against live Dalidou)
|
||||
- **live_sha** (Dalidou `/health` build_sha): `39d73e9`
|
||||
- **last_updated**: 2026-04-12 by Codex (audit branch `codex/audit-2026-04-12-final`)
|
||||
- **main_tip**: `e2895b5`
|
||||
- **test_count**: 280 passing
|
||||
- **harness**: `16/18 PASS` (p06-firmware-interface = R7 ranking tie; p06-tailscale = chunk bleed)
|
||||
- **active_memories**: 36 (p06-polisher 16, p05-interferometer 6, p04-gigabit 5, atocore 5, other 4)
|
||||
- **project_state_entries**: p04=5, p05=6, p06=6 (Wave 2 entries present on live Dalidou; 17 total visible)
|
||||
- **off_host_backup**: `papa@192.168.86.39:/home/papa/atocore-backups/` via cron env `ATOCORE_BACKUP_RSYNC`, verified
|
||||
|
||||
## Active Plan
|
||||
@@ -123,9 +125,19 @@ One branch `codex/extractor-eval-loop` for Day 1-5, a second `codex/retrieval-ha
|
||||
| R2 | Codex | P1 | src/atocore/context/builder.py | Project memories excluded from pack | fixed | Claude | 2026-04-11 | 8ea53f4 |
|
||||
| R3 | Claude | P2 | src/atocore/memory/extractor.py | Rule cues (`## Decision:`) never fire on conversational LLM text | open | Claude | 2026-04-11 | |
|
||||
| R4 | Codex | P2 | DEV-LEDGER.md:11 | Orientation `main_tip` was stale versus `HEAD` / `origin/main` | fixed | Codex | 2026-04-11 | 81307ce |
|
||||
| R5 | Codex | P1 | src/atocore/interactions/service.py:157-174 | The deployed extraction path still calls only the rule extractor; the new LLM extractor is eval/script-only, so Day 4 "gate cleared" is true as a benchmark result but not as an operational extraction path | acknowledged | Claude | 2026-04-12 | |
|
||||
| R6 | Codex | P1 | src/atocore/memory/extractor_llm.py:258-276 | LLM extraction accepts model-supplied `project` verbatim with no fallback to `interaction.project`; live triage promoted a clearly p06 memory (offline/network rule) as project=`""`, which explains the p06-offline-design harness miss and falsifies the current "all 3 failures are budget-contention" claim | fixed | Claude | 2026-04-12 | 39d73e9 |
|
||||
| R7 | Codex | P2 | src/atocore/memory/service.py:448-459 | Query ranking is overlap-count only, so broad overview memories can tie exact low-confidence memories and win on confidence; p06-firmware-interface is not just budget pressure, it also exposes a weak lexical scorer | open | Claude | 2026-04-12 | |
|
||||
| R8 | Codex | P2 | tests/test_extractor_llm.py:1-7 | LLM extractor tests stop at parser/failure contracts; there is no automated coverage for the script-only persistence/review path that produced the 16 promoted memories, including project-scope preservation | open | Claude | 2026-04-12 | |
|
||||
| R9 | Codex | P2 | src/atocore/memory/extractor_llm.py:258-259 | The R6 fallback only repairs empty project output. A wrong non-empty model project still overrides the interaction's known scope, so project attribution is improved but not yet trust-preserving. | open | Claude | 2026-04-12 | |
|
||||
| R10 | Codex | P2 | docs/master-plan-status.md:31-33 | "Phase 8 - OpenClaw Integration" is fair as a baseline milestone, but not as a "primary" integration claim. `t420-openclaw/atocore.py` currently covers a narrow read-oriented subset (13 request shapes vs 32 API routes) plus fail-open health, while memory/interactions/admin write paths remain out of surface. | open | Claude | 2026-04-12 | |
|
||||
|
||||
## Recent Decisions
|
||||
|
||||
- **2026-04-12** Day 4 gate cleared: LLM-assisted extraction via `claude -p` (OAuth, no API key) is the path forward. Rule extractor stays as default for structural cues. *Proposed by:* Claude. *Ratified by:* Antoine.
|
||||
- **2026-04-12** First live triage: 16 promoted, 35 rejected from 51 LLM-extracted candidates. 31% accept rate. Active memory count 20->36. *Executed by:* Claude. *Ratified by:* Antoine.
|
||||
- **2026-04-12** No API keys allowed in AtoCore — LLM-assisted features use OAuth via `claude -p` or equivalent CLI-authenticated paths. *Proposed by:* Antoine.
|
||||
- **2026-04-12** Multi-model extraction direction: extraction/triage should be model-agnostic, with Codex/Gemini/Ollama as second-pass reviewers for robustness. *Proposed by:* Antoine.
|
||||
- **2026-04-11** Adopt this ledger as shared operating memory between Claude and Codex. *Proposed by:* Antoine. *Ratified by:* Antoine.
|
||||
- **2026-04-11** Accept Codex's 8-day mini-phase plan verbatim as Active Plan. *Proposed by:* Codex. *Ratified by:* Antoine.
|
||||
- **2026-04-11** Review findings live in `DEV-LEDGER.md` with Codex owning finding text and Claude updating status fields only. *Proposed by:* Codex. *Ratified by:* Antoine.
|
||||
@@ -136,6 +148,13 @@ One branch `codex/extractor-eval-loop` for Day 1-5, a second `codex/retrieval-ha
|
||||
|
||||
## Session Log
|
||||
|
||||
- **2026-04-12 Codex (audit branch `codex/audit-2026-04-12-final`)** audited `c5bad99..e2895b5` against origin/main, live Dalidou, and the OpenClaw client script. Live state checked: build `39d73e9`, harness reproducible at **16/18 PASS**, active memories **36**, and `t420-openclaw/atocore.py health` fails open correctly with `fail_open=true`. Spot-checks of Wave 2 project-state entries matched their cited vault docs. Updated R5-R8 status reality (R6 fixed by `39d73e9`), added R9-R10, and corrected Orientation `main_tip` to `e2895b5` because the ledger had drifted behind origin/main. Note: live Dalidou is still on `39d73e9`, so branch-truth and deploy-truth are not the same yet.
|
||||
- **2026-04-12 Claude** Wave 2 trusted operational ingestion + codex audit response. Read 6 vault docs, created 8 new Trusted Project State entries (p04 +2, p05 +3, p06 +3). Fixed R6 (project fallback in LLM extractor) per codex audit. Fixed misscoped p06 offline memory on live Dalidou. Merged codex/audit-2026-04-12. Switched default LLM model from haiku to sonnet. Harness 15/18 -> 16/18. Tests 278 -> 280. main_tip 146f2e4 -> 39d73e9.
|
||||
|
||||
- **2026-04-12 Codex (audit branch `codex/audit-2026-04-12`)** audited `c5bad99..146f2e4` against code, live Dalidou, and the 36 active memories. Confirmed: `claude -p` invocation is not shell-injection-prone (`subprocess.run(args)` with no shell), off-host backup wiring matches the ledger, and R1 remains unresolved in practice. Added R5-R8. Corrected Orientation `main_tip` (`146f2e4`, not `5c69f77`) and tightened the harness note: p06-firmware-interface is a ranking-tie issue, p06-offline-design comes from a project-scope miss in live triage, and p06-tailscale is retrieved-chunk bleed rather than memory-band budget contention.
|
||||
- **2026-04-12 Claude** `06792d8..5c69f77` Day 5-8 close. Documented extractor scope (5 in-scope, 6 out-of-scope categories). Expanded harness from 6 to 18 fixtures (p04 +1, p05 +1, p06 +7, adversarial +2). Per-entry memory cap at 250 chars fixed 1 of 4 budget-contention failures. Final harness: 15/18 PASS. Mini-phase complete. Before/after: rule extractor 0% recall -> LLM 100%; harness 6/6 -> 15/18; active memories 20 -> 36.
|
||||
- **2026-04-12 Claude** `330ecfb..06792d8` (merged eval-loop branch + triage). Day 1-4 of the mini-phase completed in one session. Day 2 baseline: rule extractor 0% recall, 5 distinct miss classes. Day 4 gate cleared: LLM extractor (claude -p haiku, OAuth) hit 100% recall, 2.55 yield/interaction. Refactored from anthropic SDK to subprocess after "no API key" rule. First live triage: 51 candidates -> 16 promoted, 35 rejected. Active memories 20->36. p06-polisher went from 2 to 16 memories (firmware/telemetry architecture set). POST /memory now accepts status field. Test count 264->278.
|
||||
- **2026-04-11 Claude** `claude/extractor-eval-loop @ 7d8d599` — Day 1+2 of the mini-phase. Froze a 64-interaction snapshot (`scripts/eval_data/interactions_snapshot_2026-04-11.json`) and labeled 20 by length-stratified random sample (5 positive, 15 zero; 7 total expected candidates). Built `scripts/extractor_eval.py` as a file-based eval runner. **Day 2 baseline: rule extractor hit 0% yield / 0% recall / 0% precision on the labeled set; 5 false negatives across 5 distinct miss classes (recommendation_prose, architectural_change_summary, spec_update_announcement, layered_recommendation, alignment_assertion).** This is the Day 4 hard-stop signal arriving two days early — a single rule expansion cannot close a 5-way miss, and widening rules blindly will collapse precision. The Day 4 decision gate is escalated to Antoine for ratification before Day 3 touches any extractor code. No extractor code on main has changed.
|
||||
- **2026-04-11 Codex (ledger audit)** fixed stale `main_tip`, retargeted R1 from the API surface to the live Claude Stop hook, and formalized the review write protocol so Claude can consume findings without rewriting them.
|
||||
- **2026-04-11 Claude** `b3253f3..59331e5` (1 commit). Wired the DEV-LEDGER, added session protocol to AGENTS.md, created project-local CLAUDE.md, deleted stale `codex/port-atocore-ops-client` remote branch. No code changes, no redeploy needed.
|
||||
- **2026-04-11 Claude** `c5bad99..b3253f3` (11 commits + 1 merge). Length-aware reinforcement, project memories in pack, query-relevance memory ranking, hyphenated-identifier tokenizer, retrieval eval harness seeded, off-host backup wired end-to-end, docs synced, codex integration-pass branch merged. Harness went 0->6/6 on live Dalidou.
|
||||
|
||||
@@ -27,7 +27,18 @@ read-only additive mode.
|
||||
### Partial
|
||||
|
||||
- Phase 4 - Identity / Preferences
|
||||
- Phase 8 - OpenClaw Integration
|
||||
|
||||
### Baseline Complete
|
||||
|
||||
- Phase 8 - OpenClaw Integration. As of 2026-04-12 the T420 OpenClaw
|
||||
helper (`t420-openclaw/atocore.py`) is verified end-to-end against
|
||||
live Dalidou: health check, auto-context with project detection,
|
||||
Trusted Project State surfacing, project-memory band, fail-open on
|
||||
unreachable host. Tested from both the development machine and the
|
||||
T420 via SSH. The helper covers 15 of the 33 API endpoints — the
|
||||
excluded endpoints (memory management, interactions, backup) are
|
||||
correctly scoped to the operator client (`scripts/atocore_client.py`)
|
||||
per the read-only additive integration model.
|
||||
|
||||
### Baseline Complete
|
||||
|
||||
|
||||
@@ -226,14 +226,53 @@ candidate was a synthetic test capture from earlier in the session
|
||||
- Capture → reinforce is working correctly on live data (length-aware
|
||||
matcher verified on live paraphrase of a p04 memory).
|
||||
|
||||
Follow-up candidates (not yet scheduled):
|
||||
Follow-up candidates:
|
||||
|
||||
1. Extractor rule expansion — add conversational-form rules so real
|
||||
session text has a chance of surfacing candidates.
|
||||
2. LLM-assisted extractor as a separate rule family, guarded by
|
||||
confidence and always landing in `status=candidate` (never active).
|
||||
3. Retrieval eval harness — diffable scorecard of
|
||||
`formatted_context` across a fixed question set per active project.
|
||||
1. ~~Extractor rule expansion~~ — Day 2 baseline showed 0% recall
|
||||
across 5 distinct miss classes; rule expansion cannot close a
|
||||
5-way miss. Deprioritized.
|
||||
2. ~~LLM-assisted extractor~~ — DONE 2026-04-12. `extractor_llm.py`
|
||||
shells out to `claude -p` (Haiku, OAuth, no API key). First live
|
||||
run: 100% recall, 2.55 yield/interaction on a 20-interaction
|
||||
labeled set. First triage: 51 candidates → 16 promoted, 35
|
||||
rejected (31% accept rate). Active memories 20 → 36.
|
||||
3. ~~Retrieval eval harness~~ — DONE 2026-04-11 (scripts/retrieval_eval.py,
|
||||
6/6 passing). Expansion to 15-20 fixtures is mini-phase Day 6.
|
||||
|
||||
## Extractor Scope — 2026-04-12
|
||||
|
||||
What the LLM-assisted extractor (`src/atocore/memory/extractor_llm.py`)
|
||||
extracts from conversational Claude Code captures:
|
||||
|
||||
**In scope:**
|
||||
|
||||
- Architectural commitments (e.g. "Z-axis is engage/retract, not
|
||||
continuous position")
|
||||
- Ratified decisions with project scope (e.g. "USB SSD mandatory on
|
||||
RPi for telemetry storage")
|
||||
- Durable engineering facts (e.g. "telemetry data rate ~29 MB/hour")
|
||||
- Working rules and adaptation patterns (e.g. "extraction stays off
|
||||
the capture hot path")
|
||||
- Interface invariants (e.g. "controller-job.v1 in, run-log.v1 out;
|
||||
no firmware change needed")
|
||||
|
||||
**Out of scope (intentionally rejected by triage):**
|
||||
|
||||
- Transient roadmap / plan steps that will be stale in a week
|
||||
- Operational instructions ("run this command to deploy")
|
||||
- Process rules that live in DEV-LEDGER.md / AGENTS.md, not in memory
|
||||
- Implementation details that are too granular (individual field names
|
||||
when the parent concept is already captured)
|
||||
- Already-fixed review findings (P1/P2 that no longer apply)
|
||||
- Duplicates of existing active memories with wrong project tags
|
||||
|
||||
**Trust model:**
|
||||
|
||||
- Extraction stays off the capture hot path (batch / manual only)
|
||||
- All candidates land as `status=candidate`, never auto-promoted
|
||||
- Human or auto-triage reviews before promotion to active
|
||||
- Future direction: multi-model extraction + triage (Codex/Gemini as
|
||||
second-pass reviewers for robustness against single-model bias)
|
||||
|
||||
## Long-Run Goal
|
||||
|
||||
|
||||
51
scripts/eval_data/candidate_queue_snapshot.jsonl
Normal file
51
scripts/eval_data/candidate_queue_snapshot.jsonl
Normal file
@@ -0,0 +1,51 @@
|
||||
{"id": "0dd85386-cace-4f9a-9098-c6732f3c64fa", "type": "project", "project": "atocore", "confidence": 0.5, "content": "AtoCore roadmap: (1) extractor improvement, (2) harness expansion, (3) Wave 2 ingestion, (4) OpenClaw finish; steps 1+2 are current mini-phase"}
|
||||
{"id": "8939b875-152c-4c90-8614-3cfdc64cd1d6", "type": "knowledge", "project": "atocore", "confidence": 0.5, "content": "AtoCore is FastAPI (Python 3.12, SQLite + ChromaDB) on Dalidou home server (dalidou:8100), repo C:\\Users\\antoi\\ATOCore, data /srv/storage/atocore/, ingests Obsidian vault + Google Drive into vector memory system."}
|
||||
{"id": "93e37d2a-b512-4a97-b230-e64ac913d087", "type": "knowledge", "project": "atocore", "confidence": 0.5, "content": "Deploy AtoCore: git push origin main, then ssh papa@dalidou and run /srv/storage/atocore/app/deploy/dalidou/deploy.sh"}
|
||||
{"id": "4b82fe01-4393-464a-b935-9ad5d112d3d8", "type": "adaptation", "project": "atocore", "confidence": 0.5, "content": "Do not add memory extraction to interaction capture hot path; keep extraction as separate batch/manual step. Reason: latency and queue noise before review rhythm is comfortable."}
|
||||
{"id": "c873ec00-063e-488c-ad32-1233290a3feb", "type": "project", "project": "atocore", "confidence": 0.5, "content": "As of 2026-04-11, approved roadmap in order: observe reinforcement, batch extraction, candidate triage, off-Dalidou backup, retrieval quality review."}
|
||||
{"id": "665cdd27-0057-4e73-82f5-5d4f47189b5d", "type": "project", "project": "atocore", "confidence": 0.5, "content": "AtoCore adopts DEV-LEDGER.md as shared operating memory with stable headers; updated at session boundaries"}
|
||||
{"id": "5f89c51d-7e8b-4fb9-830d-a35bb649f9f7", "type": "adaptation", "project": "atocore", "confidence": 0.5, "content": "Codex branches for AtoCore fork from main (never orphan); use naming pattern codex/<topic>"}
|
||||
{"id": "25ac367c-8bbe-4ba4-8d8e-d533db33f2d9", "type": "adaptation", "project": "atocore", "confidence": 0.5, "content": "In AtoCore, Claude builds and Codex audits; never work in parallel on same files"}
|
||||
{"id": "89446ebe-fd42-4177-80db-3657bc41d048", "type": "adaptation", "project": "atocore", "confidence": 0.5, "content": "In AtoCore, P1-severity findings in DEV-LEDGER.md block further main commits until acknowledged"}
|
||||
{"id": "1f077e98-f945-4480-96ab-110b0671ebc6", "type": "adaptation", "project": "atocore", "confidence": 0.5, "content": "Every AtoCore session appends to DEV-LEDGER.md Session Log and updates Orientation before ending"}
|
||||
{"id": "89f60018-c23b-4b2f-80ca-e6f7d02c5cd3", "type": "preference", "project": "atocore", "confidence": 0.5, "content": "User prefers receiving standalone testing prompts they can paste into Claude Code on target deployments rather than having the assistant run tests directly."}
|
||||
{"id": "2f69a6ed-6de2-4565-87df-1ea3e8c42963", "type": "project", "project": "p06-polisher", "confidence": 0.5, "content": "USB SSD on RPi is mandatory for polishing telemetry storage; must be independent of network for data integrity during runs."}
|
||||
{"id": "6bcaebde-9e45-4de5-a220-65d9c4cd451e", "type": "project", "project": "p06-polisher", "confidence": 0.5, "content": "Use Tailscale mesh for RPi remote access to provide SSH, file transfer, and NAT traversal without port forwarding."}
|
||||
{"id": "82f17880-92da-485e-a24a-0599ab1836e7", "type": "project", "project": "p06-polisher", "confidence": 0.5, "content": "Auto-sync telemetry data via rsync over Tailscale after runs complete; fire-and-forget pattern with automatic retry on network interruption."}
|
||||
{"id": "2dd36f74-db47-4c72-a185-fec025d07d4f", "type": "project", "project": "p06-polisher", "confidence": 0.5, "content": "Real-time telemetry monitoring should target 10 Hz downsampling; full 100 Hz streaming over network is not necessary."}
|
||||
{"id": "7519d82b-8065-41f0-812e-9c1a3573d7b9", "type": "knowledge", "project": "p06-polisher", "confidence": 0.5, "content": "Polishing telemetry data rate is approximately 29 MB per hour (100 Hz × 20 channels × 4 bytes = 8 KB/s)."}
|
||||
{"id": "78678162-5754-478b-b1fc-e25f22e0ee03", "type": "project", "project": "p06-polisher", "confidence": 0.5, "content": "Machine spec (shareable) + Atomaste spec (internal) separate concerns. Machine spec hides program generation as 'separate scope' to protect IP/business strategy."}
|
||||
{"id": "6657b4ae-d4ec-4fec-a66f-2975cdb10d13", "type": "project", "project": "p06-polisher", "confidence": 0.5, "content": "Firmware interface contract is invariant: controller-job.v1 input, run-log.v1 + telemetry output. No firmware changes needed regardless of program generation implementation."}
|
||||
{"id": "6d6f4fe9-73e5-449f-a802-6dc0a974f87b", "type": "project", "project": "p06-polisher", "confidence": 0.5, "content": "Atomaste sim spec documents forward/return paths, calibration model (Preston k), translation loss, and service/IP strategy—details hidden from shareable machine spec."}
|
||||
{"id": "932f38df-58f3-49c2-9968-8d422dc54b42", "type": "project", "project": "", "confidence": 0.5, "content": "USB SSD mandatory for storage (not SD card); directory structure /data/runs/{id}/, /data/manual/{id}/; status.json for machine state"}
|
||||
{"id": "2b3178e8-fe38-4338-b2b0-75a01da18cea", "type": "project", "project": "", "confidence": 0.5, "content": "RPi joins Tailscale mesh for remote access over SSH VPN; no public IP or port forwarding; fully offline operation"}
|
||||
{"id": "254c394d-3f80-4b34-a891-9f1cbfec74d7", "type": "project", "project": "", "confidence": 0.5, "content": "Data synchronization via rsync over Tailscale, failure-tolerant and non-blocking; USB stick as manual fallback"}
|
||||
{"id": "ee626650-1ee0-439c-85c9-6d32a876f239", "type": "project", "project": "", "confidence": 0.5, "content": "Machine design principle: works fully offline and independently; network connection is for remote access only"}
|
||||
{"id": "34add99d-8d2e-4586-b002-fc7b7d22bcb3", "type": "project", "project": "", "confidence": 0.5, "content": "No cloud, no real-time streaming, no remote control features in design scope"}
|
||||
{"id": "993e0afe-9910-4984-b608-f5e9de7c0453", "type": "project", "project": "atocore", "confidence": 0.5, "content": "P1: Reflection loop integration incomplete—extraction remains manual (POST /interactions/{id}/extract), not auto-triggered with reinforcement. Live capture won't auto-populate candidate review queue."}
|
||||
{"id": "bdf488d7-9200-441e-afbf-5335020ea78b", "type": "project", "project": "atocore", "confidence": 0.5, "content": "P1: Project memories excluded from context injection; build_context() requests [\"identity\", \"preference\"] only. Reinforcement signal doesn't reach assembled context packs."}
|
||||
{"id": "188197af-a61d-4616-9e39-712aeaaadf61", "type": "project", "project": "atocore", "confidence": 0.5, "content": "Current batch-extract rules produce only 1 candidate from 42 real captures. Extractor needs conversational-cue detection or LLM-assisted path to improve yield."}
|
||||
{"id": "acffcaa4-5966-4ec1-a0b2-3b8dcebe75bd", "type": "project", "project": "atocore", "confidence": 0.5, "content": "Next priority: extractor rule expansion (cheapest validation of reflection loop), then Wave 2 trusted operational ingestion (master-plan priority). Defer retrieval eval harness focus."}
|
||||
{"id": "1b44a886-a5af-4426-bf10-a92baf3a6502", "type": "knowledge", "project": "atocore", "confidence": 0.5, "content": "Alias canonicalization fix (resolve_project_name() boundary) is consistently applied across project state, memories, interactions, and context lookup. Code review approved directionally."}
|
||||
{"id": "e8f4e704-367b-4759-b20c-da0ccf06cf7d", "type": "project", "project": "p06-polisher", "confidence": 0.5, "content": "Machine capabilities now define z_type: engage_retract and cam_type: mechanical_with_encoder instead of actuator-driven setpoints."}
|
||||
{"id": "ab2b607c-52b1-405f-a874-c6078393c21c", "type": "knowledge", "project": "", "confidence": 0.5, "content": "Codex is an audit agent; communicate with it via markdown prompts with numbered steps; it updates findings via commits to codex/* branches or direct messages."}
|
||||
{"id": "5a5fd29d-291f-4e22-88fe-825cf55f745a", "type": "preference", "project": "", "confidence": 0.5, "content": "Audit-first workflow recommended: have codex audit DEV-LEDGER.md and recent commits before execution; validates round-trip, catches errors early."}
|
||||
{"id": "4c238106-017e-4283-99a1-639497b6ddde", "type": "knowledge", "project": "", "confidence": 0.5, "content": "DEV-LEDGER.md at repo root is the shared coordination document with Orientation, Active Plan, and Open Review Findings sections."}
|
||||
{"id": "83aed988-4257-4220-b612-6c725d6cd95a", "type": "project", "project": "atocore", "confidence": 0.5, "content": "Roadmap: Extractor improvement → Harness expansion → Wave 2 trusted operational ingestion → Finish OpenClaw integration (in that order)"}
|
||||
{"id": "95d87d1a-5daa-414d-95ff-a344a62e0b6b", "type": "project", "project": "atocore", "confidence": 0.5, "content": "Phase 1 (Extractor): eval-driven loop—label captures, improve rules/add LLM mode, measure yield & FP, stop when queue reviewable (not coverage metrics)"}
|
||||
{"id": "7aafb588-51b0-4536-a414-ebaaea924b98", "type": "project", "project": "atocore", "confidence": 0.5, "content": "Phases 1 & 2 (Extractor + Harness) are a mini-phase; without harness, extractor improvements are blind edits"}
|
||||
{"id": "aa50c51a-27d7-4db9-b7a3-7ca75dba2118", "type": "knowledge", "project": "", "confidence": 0.5, "content": "Dalidou stores Claude Code interactions via a Stop hook that fires after each turn and POSTs to http://dalidou:8100/interactions with client=claude-code parameter"}
|
||||
{"id": "5951108b-3a5e-49d0-9308-dfab449664d3", "type": "adaptation", "project": "", "confidence": 0.5, "content": "Interaction capture system is passive and automatic; no manual action required, interactions accumulate automatically during normal Claude Code usage"}
|
||||
{"id": "9d2cbbe9-cf2e-4aab-9cb8-c4951da70826", "type": "project", "project": "", "confidence": 0.5, "content": "Session Log/Ledger system tracks work state across sessions so future sessions immediately know what is true and what is next; phases marked by git SHAs."}
|
||||
{"id": "db88eecf-e31a-4fee-b07d-0b51db7e315e", "type": "project", "project": "atocore", "confidence": 0.5, "content": "atocore uses multi-model coordination: Claude and codex share DEV-LEDGER.md (current state / active plan / P1+P2 findings / recent decisions / commit log) read at session start, appended at session end"}
|
||||
{"id": "8748f071-ff28-47a6-8504-65ca30a8336a", "type": "project", "project": "atocore", "confidence": 0.5, "content": "atocore starts with manual-event-loop (/audit or /status prompts) using DEV-LEDGER.md before upgrading to automated git hooks/CI review"}
|
||||
{"id": "f9210883-67a8-4dae-9f27-6b5ae7bd8a6b", "type": "project", "project": "atocore", "confidence": 0.5, "content": "atocore development involves coordinating between Claude and codex models with shared plan/review strategy and counter-validation to improve system quality"}
|
||||
{"id": "85f008b9-2d6d-49ad-81a1-e254dac2a2ac", "type": "project", "project": "p06-polisher", "confidence": 0.5, "content": "Z-axis is a binary engage/retract mechanism (z_engaged bool), not continuous position control; confirmation timeout z_engage_timeout_s required."}
|
||||
{"id": "0cc417ed-ac38-4231-9786-a9582ac6a60f", "type": "project", "project": "p06-polisher", "confidence": 0.5, "content": "Cam amplitude and offset are mechanically set by operator and read via encoders; no actuators control them, controller receives encoder telemetry only."}
|
||||
{"id": "2e001aaf-0c5c-4547-9b96-ebc4172b258d", "type": "project", "project": "p06-polisher", "confidence": 0.5, "content": "Cam parameters in controller are expected_cam_amplitude_deg and expected_cam_offset_deg (read-only reference for verification), not command setpoints."}
|
||||
{"id": "47778126-b0cf-41d9-9e21-f2418f53e792", "type": "project", "project": "p06-polisher", "confidence": 0.5, "content": "Manual mode UI displays cam encoder readings (cam_amplitude_deg, cam_offset_deg) as read-only for operator verification of mechanical setting."}
|
||||
{"id": "410e4a70-ae12-4de2-8f31-071ffee3cad4", "type": "project", "project": "p06-polisher", "confidence": 0.5, "content": "Manual session log records cam_setting measured at session start; run-log segment actual block includes cam_amplitude_deg_mean and cam_offset_deg_mean."}
|
||||
{"id": "e94f94f0-3538-40dd-aef2-0189eacc7eb7", "type": "knowledge", "project": "atocore", "confidence": 0.5, "content": "AtoCore deployments to dalidou use the script /srv/storage/atocore/app/deploy/dalidou/deploy.sh instead of manual docker commands"}
|
||||
{"id": "23fa6fdf-cfb9-4850-ad04-3ea56551c30a", "type": "project", "project": "", "confidence": 0.5, "content": "Retrieval/extraction evaluation follows 8-day mini-phase plan with hard gates to prevent scope drift. Preflight checks must validate git SHAs, baselines, and fixture stability before coding."}
|
||||
{"id": "3e1fad28-031b-4670-a9d0-0af2e8ba1361", "type": "project", "project": "", "confidence": 0.5, "content": "Day 1: Create labeled extractor eval set from 30 captures (10 zero-candidate, 10 single-candidate, 10 ambiguous) with metadata; create scoring tool to measure precision/recall."}
|
||||
{"id": "d49378a4-d03c-4730-be87-f0fcb2d199db", "type": "project", "project": "", "confidence": 0.5, "content": "Day 2: Measure current extractor against labeled set, recording yield, true/false positives, and false negatives by pattern."}
|
||||
1
scripts/eval_data/triage_verdict_2026-04-12.json
Normal file
1
scripts/eval_data/triage_verdict_2026-04-12.json
Normal file
@@ -0,0 +1 @@
|
||||
{"promote": ["4b82fe01-4393-464a-b935-9ad5d112d3d8", "665cdd27-0057-4e73-82f5-5d4f47189b5d", "5f89c51d-7e8b-4fb9-830d-a35bb649f9f7", "25ac367c-8bbe-4ba4-8d8e-d533db33f2d9", "2f69a6ed-6de2-4565-87df-1ea3e8c42963", "6bcaebde-9e45-4de5-a220-65d9c4cd451e", "2dd36f74-db47-4c72-a185-fec025d07d4f", "7519d82b-8065-41f0-812e-9c1a3573d7b9", "78678162-5754-478b-b1fc-e25f22e0ee03", "6657b4ae-d4ec-4fec-a66f-2975cdb10d13", "ee626650-1ee0-439c-85c9-6d32a876f239", "1b44a886-a5af-4426-bf10-a92baf3a6502", "aa50c51a-27d7-4db9-b7a3-7ca75dba2118", "5951108b-3a5e-49d0-9308-dfab449664d3", "85f008b9-2d6d-49ad-81a1-e254dac2a2ac", "0cc417ed-ac38-4231-9786-a9582ac6a60f"], "reject": ["0dd85386-cace-4f9a-9098-c6732f3c64fa", "8939b875-152c-4c90-8614-3cfdc64cd1d6", "93e37d2a-b512-4a97-b230-e64ac913d087", "c873ec00-063e-488c-ad32-1233290a3feb", "89446ebe-fd42-4177-80db-3657bc41d048", "1f077e98-f945-4480-96ab-110b0671ebc6", "89f60018-c23b-4b2f-80ca-e6f7d02c5cd3", "82f17880-92da-485e-a24a-0599ab1836e7", "6d6f4fe9-73e5-449f-a802-6dc0a974f87b", "932f38df-58f3-49c2-9968-8d422dc54b42", "2b3178e8-fe38-4338-b2b0-75a01da18cea", "254c394d-3f80-4b34-a891-9f1cbfec74d7", "34add99d-8d2e-4586-b002-fc7b7d22bcb3", "993e0afe-9910-4984-b608-f5e9de7c0453", "bdf488d7-9200-441e-afbf-5335020ea78b", "188197af-a61d-4616-9e39-712aeaaadf61", "acffcaa4-5966-4ec1-a0b2-3b8dcebe75bd", "e8f4e704-367b-4759-b20c-da0ccf06cf7d", "ab2b607c-52b1-405f-a874-c6078393c21c", "5a5fd29d-291f-4e22-88fe-825cf55f745a", "4c238106-017e-4283-99a1-639497b6ddde", "83aed988-4257-4220-b612-6c725d6cd95a", "95d87d1a-5daa-414d-95ff-a344a62e0b6b", "7aafb588-51b0-4536-a414-ebaaea924b98", "9d2cbbe9-cf2e-4aab-9cb8-c4951da70826", "db88eecf-e31a-4fee-b07d-0b51db7e315e", "8748f071-ff28-47a6-8504-65ca30a8336a", "f9210883-67a8-4dae-9f27-6b5ae7bd8a6b", "2e001aaf-0c5c-4547-9b96-ebc4172b258d", "47778126-b0cf-41d9-9e21-f2418f53e792", "410e4a70-ae12-4de2-8f31-071ffee3cad4", "e94f94f0-3538-40dd-aef2-0189eacc7eb7", "23fa6fdf-cfb9-4850-ad04-3ea56551c30a", "3e1fad28-031b-4670-a9d0-0af2e8ba1361", "d49378a4-d03c-4730-be87-f0fcb2d199db"]}
|
||||
@@ -13,7 +13,7 @@
|
||||
"p06-polisher",
|
||||
"folded-beam"
|
||||
],
|
||||
"notes": "Canonical p04 decision — should surface both Trusted Project State (selected_mirror_architecture) and the project-memory band with the Option B memory"
|
||||
"notes": "Canonical p04 decision — should surface both Trusted Project State and the project-memory band"
|
||||
},
|
||||
{
|
||||
"name": "p04-constraints",
|
||||
@@ -27,7 +27,17 @@
|
||||
"expect_absent": [
|
||||
"polisher suite"
|
||||
],
|
||||
"notes": "Key constraints are in Trusted Project State (key_constraints) and in the mission-framing memory"
|
||||
"notes": "Key constraints are in Trusted Project State and in the mission-framing memory"
|
||||
},
|
||||
{
|
||||
"name": "p04-short-ambiguous",
|
||||
"project": "p04-gigabit",
|
||||
"prompt": "current status",
|
||||
"expect_present": [
|
||||
"--- Trusted Project State ---"
|
||||
],
|
||||
"expect_absent": [],
|
||||
"notes": "Short ambiguous prompt — at minimum project state should surface. Hard case: the prompt is generic enough that chunks may not rank well."
|
||||
},
|
||||
{
|
||||
"name": "p05-configuration",
|
||||
@@ -42,7 +52,7 @@
|
||||
"conical back",
|
||||
"polisher suite"
|
||||
],
|
||||
"notes": "P05 architecture memory covers folded-beam + CGH. GigaBIT M1 is the mirror under test and legitimately appears in p05 source docs (the interferometer measures it), so we only flag genuinely p04-only decisions like the mirror architecture choice."
|
||||
"notes": "P05 architecture memory covers folded-beam + CGH. GigaBIT M1 legitimately appears in p05 source docs."
|
||||
},
|
||||
{
|
||||
"name": "p05-vendor-signal",
|
||||
@@ -57,6 +67,19 @@
|
||||
],
|
||||
"notes": "Vendor memory mentions 4D as strongest technical candidate and Zygo Verifire SV as value path"
|
||||
},
|
||||
{
|
||||
"name": "p05-cgh-calibration",
|
||||
"project": "p05-interferometer",
|
||||
"prompt": "how does CGH calibration work for the interferometer",
|
||||
"expect_present": [
|
||||
"CGH"
|
||||
],
|
||||
"expect_absent": [
|
||||
"polisher-sim",
|
||||
"polisher-post"
|
||||
],
|
||||
"notes": "CGH is a core p05 concept. Should surface via chunks and possibly the architecture memory. Must not bleed p06 polisher-suite terms."
|
||||
},
|
||||
{
|
||||
"name": "p06-suite-split",
|
||||
"project": "p06-polisher",
|
||||
@@ -69,7 +92,7 @@
|
||||
"expect_absent": [
|
||||
"GigaBIT"
|
||||
],
|
||||
"notes": "The three-layer split is in multiple p06 memories; check all three names surface together"
|
||||
"notes": "The three-layer split is in multiple p06 memories"
|
||||
},
|
||||
{
|
||||
"name": "p06-control-rule",
|
||||
@@ -82,5 +105,121 @@
|
||||
"interferometer"
|
||||
],
|
||||
"notes": "Control design rule memory mentions interlocks and state transitions"
|
||||
},
|
||||
{
|
||||
"name": "p06-firmware-interface",
|
||||
"project": "p06-polisher",
|
||||
"prompt": "what is the firmware interface contract for the polisher machine",
|
||||
"expect_present": [
|
||||
"controller-job"
|
||||
],
|
||||
"expect_absent": [
|
||||
"interferometer",
|
||||
"GigaBIT"
|
||||
],
|
||||
"notes": "New p06 memory from the first triage: firmware interface contract is invariant controller-job.v1 in, run-log.v1 out"
|
||||
},
|
||||
{
|
||||
"name": "p06-z-axis",
|
||||
"project": "p06-polisher",
|
||||
"prompt": "how does the polisher Z-axis work",
|
||||
"expect_present": [
|
||||
"engage"
|
||||
],
|
||||
"expect_absent": [
|
||||
"interferometer"
|
||||
],
|
||||
"notes": "New p06 memory: Z-axis is binary engage/retract, not continuous position. The word 'engage' should appear."
|
||||
},
|
||||
{
|
||||
"name": "p06-cam-mechanism",
|
||||
"project": "p06-polisher",
|
||||
"prompt": "how is cam amplitude controlled on the polisher",
|
||||
"expect_present": [
|
||||
"encoder"
|
||||
],
|
||||
"expect_absent": [
|
||||
"GigaBIT"
|
||||
],
|
||||
"notes": "New p06 memory: cam set mechanically by operator, read by encoders. The word 'encoder' should appear."
|
||||
},
|
||||
{
|
||||
"name": "p06-telemetry-rate",
|
||||
"project": "p06-polisher",
|
||||
"prompt": "what is the expected polishing telemetry data rate",
|
||||
"expect_present": [
|
||||
"29 MB"
|
||||
],
|
||||
"expect_absent": [
|
||||
"interferometer"
|
||||
],
|
||||
"notes": "New p06 knowledge memory: approximately 29 MB per hour at 100 Hz"
|
||||
},
|
||||
{
|
||||
"name": "p06-offline-design",
|
||||
"project": "p06-polisher",
|
||||
"prompt": "does the polisher machine need network to operate",
|
||||
"expect_present": [
|
||||
"offline"
|
||||
],
|
||||
"expect_absent": [
|
||||
"CGH"
|
||||
],
|
||||
"notes": "New p06 memory: machine works fully offline and independently; network is for remote access only"
|
||||
},
|
||||
{
|
||||
"name": "p06-short-ambiguous",
|
||||
"project": "p06-polisher",
|
||||
"prompt": "current status",
|
||||
"expect_present": [
|
||||
"--- Trusted Project State ---"
|
||||
],
|
||||
"expect_absent": [],
|
||||
"notes": "Short ambiguous prompt — project state should surface at minimum"
|
||||
},
|
||||
{
|
||||
"name": "cross-project-no-bleed",
|
||||
"project": "p04-gigabit",
|
||||
"prompt": "what telemetry rate should we target",
|
||||
"expect_present": [],
|
||||
"expect_absent": [
|
||||
"29 MB",
|
||||
"polisher"
|
||||
],
|
||||
"notes": "Adversarial: telemetry rate is a p06 fact. A p04 query for 'telemetry rate' must NOT surface p06 memories. Tests cross-project gating."
|
||||
},
|
||||
{
|
||||
"name": "no-project-hint",
|
||||
"project": "",
|
||||
"prompt": "tell me about the current projects",
|
||||
"expect_present": [],
|
||||
"expect_absent": [
|
||||
"--- Project Memories ---"
|
||||
],
|
||||
"notes": "Without a project hint, project memories must not appear (cross-project bleed guard). Chunks may appear if any match."
|
||||
},
|
||||
{
|
||||
"name": "p06-usb-ssd",
|
||||
"project": "p06-polisher",
|
||||
"prompt": "what storage solution is specified for the polisher RPi",
|
||||
"expect_present": [
|
||||
"USB SSD"
|
||||
],
|
||||
"expect_absent": [
|
||||
"interferometer"
|
||||
],
|
||||
"notes": "New p06 memory from triage: USB SSD mandatory, not SD card"
|
||||
},
|
||||
{
|
||||
"name": "p06-tailscale",
|
||||
"project": "p06-polisher",
|
||||
"prompt": "how do we access the polisher machine remotely",
|
||||
"expect_present": [
|
||||
"Tailscale"
|
||||
],
|
||||
"expect_absent": [
|
||||
"GigaBIT"
|
||||
],
|
||||
"notes": "New p06 memory: Tailscale mesh for RPi remote access"
|
||||
}
|
||||
]
|
||||
|
||||
@@ -27,9 +27,9 @@ Configuration:
|
||||
|
||||
- Requires the ``claude`` CLI on PATH (``claude --version`` should work).
|
||||
- ``ATOCORE_LLM_EXTRACTOR_MODEL`` overrides the model alias (default
|
||||
``haiku``).
|
||||
``sonnet``).
|
||||
- ``ATOCORE_LLM_EXTRACTOR_TIMEOUT_S`` overrides the per-call timeout
|
||||
(default 45 seconds — first invocation is slow because Node.js
|
||||
(default 90 seconds — first invocation is slow because Node.js
|
||||
startup plus OAuth check is non-trivial).
|
||||
|
||||
Implementation notes:
|
||||
@@ -65,7 +65,7 @@ from atocore.observability.logger import get_logger
|
||||
log = get_logger("extractor_llm")
|
||||
|
||||
LLM_EXTRACTOR_VERSION = "llm-0.2.0"
|
||||
DEFAULT_MODEL = os.environ.get("ATOCORE_LLM_EXTRACTOR_MODEL", "haiku")
|
||||
DEFAULT_MODEL = os.environ.get("ATOCORE_LLM_EXTRACTOR_MODEL", "sonnet")
|
||||
DEFAULT_TIMEOUT_S = float(os.environ.get("ATOCORE_LLM_EXTRACTOR_TIMEOUT_S", "90"))
|
||||
MAX_RESPONSE_CHARS = 8000
|
||||
MAX_PROMPT_CHARS = 2000
|
||||
@@ -256,6 +256,8 @@ def _parse_candidates(raw_output: str, interaction: Interaction) -> list[MemoryC
|
||||
mem_type = str(item.get("type") or "").strip().lower()
|
||||
content = str(item.get("content") or "").strip()
|
||||
project = str(item.get("project") or "").strip()
|
||||
if not project and interaction.project:
|
||||
project = interaction.project
|
||||
confidence_raw = item.get("confidence", 0.5)
|
||||
if mem_type not in MEMORY_TYPES:
|
||||
continue
|
||||
|
||||
@@ -413,8 +413,17 @@ def get_memories_for_context(
|
||||
if query_tokens is not None:
|
||||
pool = _rank_memories_for_query(pool, query_tokens)
|
||||
|
||||
# Per-entry cap prevents a single long memory from monopolizing
|
||||
# the band. With 16 p06 memories competing for ~700 chars, an
|
||||
# uncapped 530-char overview memory fills the entire budget before
|
||||
# a query-relevant 150-char memory gets a slot. The cap ensures at
|
||||
# least 2-3 entries fit regardless of individual memory length.
|
||||
max_entry_chars = 250
|
||||
for mem in pool:
|
||||
entry = f"[{mem.memory_type}] {mem.content}"
|
||||
content = mem.content
|
||||
if len(content) > max_entry_chars:
|
||||
content = content[:max_entry_chars - 3].rstrip() + "..."
|
||||
entry = f"[{mem.memory_type}] {content}"
|
||||
entry_len = len(entry) + 1
|
||||
if entry_len > available - used:
|
||||
continue
|
||||
|
||||
@@ -97,6 +97,25 @@ def test_parser_tags_version_and_rule():
|
||||
assert result[0].source_interaction_id == "test-id"
|
||||
|
||||
|
||||
def test_parser_falls_back_to_interaction_project():
|
||||
"""R6: when the model returns empty project but the interaction
|
||||
has one, the candidate should inherit the interaction's project."""
|
||||
raw = '[{"type": "project", "content": "machine works offline"}]'
|
||||
interaction = _make_interaction()
|
||||
interaction.project = "p06-polisher"
|
||||
result = _parse_candidates(raw, interaction)
|
||||
assert result[0].project == "p06-polisher"
|
||||
|
||||
|
||||
def test_parser_keeps_model_project_when_provided():
|
||||
"""Model-supplied project takes precedence over interaction."""
|
||||
raw = '[{"type": "project", "content": "x", "project": "p04-gigabit"}]'
|
||||
interaction = _make_interaction()
|
||||
interaction.project = "p06-polisher"
|
||||
result = _parse_candidates(raw, interaction)
|
||||
assert result[0].project == "p04-gigabit"
|
||||
|
||||
|
||||
def test_missing_cli_returns_empty(monkeypatch):
|
||||
"""If ``claude`` is not on PATH the extractor returns empty, never raises."""
|
||||
monkeypatch.setattr(extractor_llm, "_cli_available", lambda: False)
|
||||
|
||||
Reference in New Issue
Block a user