feat(eval-loop): Day 4 — LLM extractor via claude -p (OAuth, no API key)
Second pass on the LLM-assisted extractor after Antoine's explicit
rule: no API key, ever. Refactored src/atocore/memory/extractor_llm.py
to shell out to the Claude Code 'claude -p' CLI via subprocess instead
of the anthropic SDK, so extraction reuses the user's existing Claude.ai
OAuth credentials and needs zero secret management.
Implementation:
- subprocess.run(["claude", "-p", "--model", "haiku",
"--append-system-prompt", <instructions>,
"--no-session-persistence", "--disable-slash-commands",
user_message], ...)
- cwd is a cached tempfile.mkdtemp() so every invocation starts with
a clean context instead of auto-discovering CLAUDE.md / AGENTS.md /
DEV-LEDGER.md from the repo root. We cannot use --bare because it
forces API-key auth, which defeats the purpose; the temp-cwd trick
is the lightest way to keep OAuth auth while skipping project
context loading.
- Silent-failure contract unchanged: missing CLI, non-zero exit,
timeout, malformed JSON — all return [] and log an error. The
capture audit trail must not break on an optional side effect.
- Default timeout bumped from 20s to 90s: Haiku + Node.js startup
+ OAuth check is ~20-40s per call in practice, plus real responses
up to 8KB take longer. 45s hit 2 timeouts on the first live run.
- tests/test_extractor_llm.py refactored: the API-key / anthropic SDK
tests are replaced by subprocess-mocking tests covering missing
CLI, timeout, non-zero exit, and a happy-path stdout parse. 14
tests, all green.
scripts/extractor_eval.py:
- New --output <path> flag writes the JSON result directly to a file,
bypassing stdout/log interleaving (structlog sends INFO to stdout
via PrintLoggerFactory, so a naive '> out.json' pollutes the file).
- Forces UTF-8 on stdout so real LLM output with em-dashes / arrows /
CJK doesn't crash the human report on Windows cp1252 consoles.
First live baseline run against the 20-interaction labeled corpus
(scripts/eval_data/extractor_llm_baseline_2026-04-11.json):
mode=llm labeled=20 recall=1.0 precision=0.357 yield_rate=2.55
total_actual_candidates=51 total_expected_candidates=7
false_negative_interactions=0 false_positive_interactions=9
Recall 0% -> 100% vs rule baseline — every human-labeled positive is
caught. Precision reads low (0.357) but inspection shows the "false
positives" are real candidates the human labels under-counted. For
example interaction a6b0d279 was labeled at 2 expected candidates,
the model caught all 6 polisher architectural facts; interaction
52c8c0f3 was labeled at 1, the model caught all 5 infra commitments.
The labels are the bottleneck, not the model.
Day 4 gate against Codex's criteria:
- candidate yield: 255% vs ≥15-25% target
- FP rate tolerable for manual triage: 51 candidates reviewable in
~10 minutes via the triage CLI
- ≥2 real non-synthetic candidates worth review: 20+ obvious wins
(polisher architecture set, p05 infra set, DEV-LEDGER protocol set)
Gate cleared. LLM-assisted extraction is the path forward for
conversational captures. Rule-based extractor stays as-is for
structured-cue inputs and remains the default mode. The next step
(Day 5 stabilize / document) will wire LLM mode behind a flag in
the public extraction endpoint and document scope.
Test count: 276 -> 278 passing. No existing tests changed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
518
scripts/eval_data/extractor_llm_baseline_2026-04-11.json
Normal file
518
scripts/eval_data/extractor_llm_baseline_2026-04-11.json
Normal file
@@ -0,0 +1,518 @@
|
||||
{
|
||||
"summary": {
|
||||
"total": 20,
|
||||
"exact_match": 6,
|
||||
"positive_expected": 5,
|
||||
"total_expected_candidates": 7,
|
||||
"total_actual_candidates": 51,
|
||||
"yield_rate": 2.55,
|
||||
"recall": 1.0,
|
||||
"precision": 0.357,
|
||||
"false_positive_interactions": 9,
|
||||
"false_negative_interactions": 0,
|
||||
"miss_classes": {},
|
||||
"mode": "llm"
|
||||
},
|
||||
"results": [
|
||||
{
|
||||
"id": "ab239158-d6ac-4c51-b6e4-dd4ccea384a2",
|
||||
"expected_count": 0,
|
||||
"actual_count": 1,
|
||||
"ok": false,
|
||||
"miss_class": "n/a",
|
||||
"notes": "Instructional deploy guidance. No durable claim.",
|
||||
"actual_candidates": [
|
||||
{
|
||||
"memory_type": "knowledge",
|
||||
"content": "AtoCore deployments to dalidou use the script /srv/storage/atocore/app/deploy/dalidou/deploy.sh instead of manual docker commands",
|
||||
"project": "atocore",
|
||||
"rule": "llm_extraction"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "da153f2a-b20a-4dee-8c72-431ebb71f08c",
|
||||
"expected_count": 0,
|
||||
"actual_count": 0,
|
||||
"ok": true,
|
||||
"miss_class": "n/a",
|
||||
"notes": "'Deploy still in progress.' Pure status.",
|
||||
"actual_candidates": []
|
||||
},
|
||||
{
|
||||
"id": "7d8371ee-c6d3-4dfe-a7b0-2d091f075c15",
|
||||
"expected_count": 0,
|
||||
"actual_count": 0,
|
||||
"ok": true,
|
||||
"miss_class": "n/a",
|
||||
"notes": "Git command walkthrough. No durable claim.",
|
||||
"actual_candidates": []
|
||||
},
|
||||
{
|
||||
"id": "14bf3f90-e318-466e-81ac-d35522741ba5",
|
||||
"expected_count": 0,
|
||||
"actual_count": 4,
|
||||
"ok": false,
|
||||
"miss_class": "n/a",
|
||||
"notes": "Ledger status update. Transient fact, not a durable memory candidate.",
|
||||
"actual_candidates": [
|
||||
{
|
||||
"memory_type": "project",
|
||||
"content": "Retrieval/extraction evaluation follows 8-day mini-phase plan with hard gates to prevent scope drift. Preflight checks must validate git SHAs, baselines, and fixture stability before coding.",
|
||||
"project": "",
|
||||
"rule": "llm_extraction"
|
||||
},
|
||||
{
|
||||
"memory_type": "project",
|
||||
"content": "Day 1: Create labeled extractor eval set from 30 captures (10 zero-candidate, 10 single-candidate, 10 ambiguous) with metadata; create scoring tool to measure precision/recall.",
|
||||
"project": "",
|
||||
"rule": "llm_extraction"
|
||||
},
|
||||
{
|
||||
"memory_type": "project",
|
||||
"content": "Day 2: Measure current extractor against labeled set, recording yield, true/false positives, and false negatives by pattern.",
|
||||
"project": "",
|
||||
"rule": "llm_extraction"
|
||||
},
|
||||
{
|
||||
"memory_type": "project",
|
||||
"content": "Session Log/Ledger system tracks work state across sessions so future sessions immediately know what is true and what is next; phases marked by git SHAs.",
|
||||
"project": "",
|
||||
"rule": "llm_extraction"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "8f855235-c38d-4c27-9f2b-8530ebe1a2d8",
|
||||
"expected_count": 0,
|
||||
"actual_count": 0,
|
||||
"ok": true,
|
||||
"miss_class": "n/a",
|
||||
"notes": "Short-term recommendation ('merge to main and deploy'), not a standing decision.",
|
||||
"actual_candidates": []
|
||||
},
|
||||
{
|
||||
"id": "04a96eb5-cd00-4e9f-9252-b2cc919000a4",
|
||||
"expected_count": 0,
|
||||
"actual_count": 0,
|
||||
"ok": true,
|
||||
"miss_class": "n/a",
|
||||
"notes": "Dev server config table. Operational detail, not a memory.",
|
||||
"actual_candidates": []
|
||||
},
|
||||
{
|
||||
"id": "79d606ed-8981-454a-83af-c25226b1b65c",
|
||||
"expected_count": 1,
|
||||
"actual_count": 3,
|
||||
"ok": false,
|
||||
"miss_class": "recommendation_prose",
|
||||
"notes": "A recommendation that later became a ratified decision. Rule extractor would need a 'simplest version that could work today' / 'I'd start with' cue class.",
|
||||
"actual_candidates": [
|
||||
{
|
||||
"memory_type": "project",
|
||||
"content": "atocore uses multi-model coordination: Claude and codex share DEV-LEDGER.md (current state / active plan / P1+P2 findings / recent decisions / commit log) read at session start, appended at session end",
|
||||
"project": "atocore",
|
||||
"rule": "llm_extraction"
|
||||
},
|
||||
{
|
||||
"memory_type": "project",
|
||||
"content": "atocore starts with manual-event-loop (/audit or /status prompts) using DEV-LEDGER.md before upgrading to automated git hooks/CI review",
|
||||
"project": "atocore",
|
||||
"rule": "llm_extraction"
|
||||
},
|
||||
{
|
||||
"memory_type": "project",
|
||||
"content": "atocore development involves coordinating between Claude and codex models with shared plan/review strategy and counter-validation to improve system quality",
|
||||
"project": "atocore",
|
||||
"rule": "llm_extraction"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "a6b0d279-c564-4bce-a703-e476f4a148ad",
|
||||
"expected_count": 2,
|
||||
"actual_count": 6,
|
||||
"ok": false,
|
||||
"miss_class": "architectural_change_summary",
|
||||
"notes": "Two durable architectural facts about the polisher machine (Z-axis is engage/retract, cam is read-only). Extractor would need to recognize 'A is now B' / 'X removed, Y added' patterns.",
|
||||
"actual_candidates": [
|
||||
{
|
||||
"memory_type": "project",
|
||||
"content": "Z-axis is a binary engage/retract mechanism (z_engaged bool), not continuous position control; confirmation timeout z_engage_timeout_s required.",
|
||||
"project": "p06-polisher",
|
||||
"rule": "llm_extraction"
|
||||
},
|
||||
{
|
||||
"memory_type": "project",
|
||||
"content": "Cam amplitude and offset are mechanically set by operator and read via encoders; no actuators control them, controller receives encoder telemetry only.",
|
||||
"project": "p06-polisher",
|
||||
"rule": "llm_extraction"
|
||||
},
|
||||
{
|
||||
"memory_type": "project",
|
||||
"content": "Cam parameters in controller are expected_cam_amplitude_deg and expected_cam_offset_deg (read-only reference for verification), not command setpoints.",
|
||||
"project": "p06-polisher",
|
||||
"rule": "llm_extraction"
|
||||
},
|
||||
{
|
||||
"memory_type": "project",
|
||||
"content": "Manual mode UI displays cam encoder readings (cam_amplitude_deg, cam_offset_deg) as read-only for operator verification of mechanical setting.",
|
||||
"project": "p06-polisher",
|
||||
"rule": "llm_extraction"
|
||||
},
|
||||
{
|
||||
"memory_type": "project",
|
||||
"content": "Manual session log records cam_setting measured at session start; run-log segment actual block includes cam_amplitude_deg_mean and cam_offset_deg_mean.",
|
||||
"project": "p06-polisher",
|
||||
"rule": "llm_extraction"
|
||||
},
|
||||
{
|
||||
"memory_type": "project",
|
||||
"content": "Machine capabilities now define z_type: engage_retract and cam_type: mechanical_with_encoder instead of actuator-driven setpoints.",
|
||||
"project": "p06-polisher",
|
||||
"rule": "llm_extraction"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "4e00e398-2e89-4653-8ee5-3f65c7f4d2d3",
|
||||
"expected_count": 0,
|
||||
"actual_count": 0,
|
||||
"ok": true,
|
||||
"miss_class": "n/a",
|
||||
"notes": "Clarification question to user.",
|
||||
"actual_candidates": []
|
||||
},
|
||||
{
|
||||
"id": "a6a7816a-7590-4616-84f4-49d9054c2a91",
|
||||
"expected_count": 0,
|
||||
"actual_count": 3,
|
||||
"ok": false,
|
||||
"miss_class": "n/a",
|
||||
"notes": "Instructional response offering two next moves.",
|
||||
"actual_candidates": [
|
||||
{
|
||||
"memory_type": "knowledge",
|
||||
"content": "Codex is an audit agent; communicate with it via markdown prompts with numbered steps; it updates findings via commits to codex/* branches or direct messages.",
|
||||
"project": "",
|
||||
"rule": "llm_extraction"
|
||||
},
|
||||
{
|
||||
"memory_type": "preference",
|
||||
"content": "Audit-first workflow recommended: have codex audit DEV-LEDGER.md and recent commits before execution; validates round-trip, catches errors early.",
|
||||
"project": "",
|
||||
"rule": "llm_extraction"
|
||||
},
|
||||
{
|
||||
"memory_type": "knowledge",
|
||||
"content": "DEV-LEDGER.md at repo root is the shared coordination document with Orientation, Active Plan, and Open Review Findings sections.",
|
||||
"project": "",
|
||||
"rule": "llm_extraction"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "03527502-316a-4a3e-989c-00719392c7d1",
|
||||
"expected_count": 0,
|
||||
"actual_count": 0,
|
||||
"ok": true,
|
||||
"miss_class": "n/a",
|
||||
"notes": "Troubleshooting a paste failure. Ephemeral.",
|
||||
"actual_candidates": []
|
||||
},
|
||||
{
|
||||
"id": "1fff59fc-545f-42df-9dd1-a0e6dec1b7ee",
|
||||
"expected_count": 0,
|
||||
"actual_count": 3,
|
||||
"ok": false,
|
||||
"miss_class": "n/a",
|
||||
"notes": "Agreement + follow-up question. No durable claim.",
|
||||
"actual_candidates": [
|
||||
{
|
||||
"memory_type": "project",
|
||||
"content": "Roadmap: Extractor improvement → Harness expansion → Wave 2 trusted operational ingestion → Finish OpenClaw integration (in that order)",
|
||||
"project": "atocore",
|
||||
"rule": "llm_extraction"
|
||||
},
|
||||
{
|
||||
"memory_type": "project",
|
||||
"content": "Phase 1 (Extractor): eval-driven loop—label captures, improve rules/add LLM mode, measure yield & FP, stop when queue reviewable (not coverage metrics)",
|
||||
"project": "atocore",
|
||||
"rule": "llm_extraction"
|
||||
},
|
||||
{
|
||||
"memory_type": "project",
|
||||
"content": "Phases 1 & 2 (Extractor + Harness) are a mini-phase; without harness, extractor improvements are blind edits",
|
||||
"project": "atocore",
|
||||
"rule": "llm_extraction"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "eb65dc18-0030-4720-ace7-f55af9df719d",
|
||||
"expected_count": 0,
|
||||
"actual_count": 2,
|
||||
"ok": false,
|
||||
"miss_class": "n/a",
|
||||
"notes": "Explanation of how the capture hook works. Instructional.",
|
||||
"actual_candidates": [
|
||||
{
|
||||
"memory_type": "knowledge",
|
||||
"content": "Dalidou stores Claude Code interactions via a Stop hook that fires after each turn and POSTs to http://dalidou:8100/interactions with client=claude-code parameter",
|
||||
"project": "",
|
||||
"rule": "llm_extraction"
|
||||
},
|
||||
{
|
||||
"memory_type": "adaptation",
|
||||
"content": "Interaction capture system is passive and automatic; no manual action required, interactions accumulate automatically during normal Claude Code usage",
|
||||
"project": "",
|
||||
"rule": "llm_extraction"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "52c8c0f3-32fb-4b48-9065-73c778a08417",
|
||||
"expected_count": 1,
|
||||
"actual_count": 5,
|
||||
"ok": false,
|
||||
"miss_class": "spec_update_announcement",
|
||||
"notes": "Concrete architectural commitments just added to the polisher spec. Phrased as '§17.1 Local Storage - USB SSD mandatory, not SD card.' The '§' section markers could be a new cue.",
|
||||
"actual_candidates": [
|
||||
{
|
||||
"memory_type": "project",
|
||||
"content": "USB SSD mandatory for storage (not SD card); directory structure /data/runs/{id}/, /data/manual/{id}/; status.json for machine state",
|
||||
"project": "",
|
||||
"rule": "llm_extraction"
|
||||
},
|
||||
{
|
||||
"memory_type": "project",
|
||||
"content": "RPi joins Tailscale mesh for remote access over SSH VPN; no public IP or port forwarding; fully offline operation",
|
||||
"project": "",
|
||||
"rule": "llm_extraction"
|
||||
},
|
||||
{
|
||||
"memory_type": "project",
|
||||
"content": "Data synchronization via rsync over Tailscale, failure-tolerant and non-blocking; USB stick as manual fallback",
|
||||
"project": "",
|
||||
"rule": "llm_extraction"
|
||||
},
|
||||
{
|
||||
"memory_type": "project",
|
||||
"content": "Machine design principle: works fully offline and independently; network connection is for remote access only",
|
||||
"project": "",
|
||||
"rule": "llm_extraction"
|
||||
},
|
||||
{
|
||||
"memory_type": "project",
|
||||
"content": "No cloud, no real-time streaming, no remote control features in design scope",
|
||||
"project": "",
|
||||
"rule": "llm_extraction"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "32d40414-15af-47ee-944b-2cceae9574b8",
|
||||
"expected_count": 0,
|
||||
"actual_count": 5,
|
||||
"ok": false,
|
||||
"miss_class": "n/a",
|
||||
"notes": "Session recap. Historical summary, not a durable memory.",
|
||||
"actual_candidates": [
|
||||
{
|
||||
"memory_type": "project",
|
||||
"content": "P1: Reflection loop integration incomplete—extraction remains manual (POST /interactions/{id}/extract), not auto-triggered with reinforcement. Live capture won't auto-populate candidate review queue.",
|
||||
"project": "atocore",
|
||||
"rule": "llm_extraction"
|
||||
},
|
||||
{
|
||||
"memory_type": "project",
|
||||
"content": "P1: Project memories excluded from context injection; build_context() requests [\"identity\", \"preference\"] only. Reinforcement signal doesn't reach assembled context packs.",
|
||||
"project": "atocore",
|
||||
"rule": "llm_extraction"
|
||||
},
|
||||
{
|
||||
"memory_type": "project",
|
||||
"content": "Current batch-extract rules produce only 1 candidate from 42 real captures. Extractor needs conversational-cue detection or LLM-assisted path to improve yield.",
|
||||
"project": "atocore",
|
||||
"rule": "llm_extraction"
|
||||
},
|
||||
{
|
||||
"memory_type": "project",
|
||||
"content": "Next priority: extractor rule expansion (cheapest validation of reflection loop), then Wave 2 trusted operational ingestion (master-plan priority). Defer retrieval eval harness focus.",
|
||||
"project": "atocore",
|
||||
"rule": "llm_extraction"
|
||||
},
|
||||
{
|
||||
"memory_type": "knowledge",
|
||||
"content": "Alias canonicalization fix (resolve_project_name() boundary) is consistently applied across project state, memories, interactions, and context lookup. Code review approved directionally.",
|
||||
"project": "atocore",
|
||||
"rule": "llm_extraction"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "b6d2cdfc-37fb-459a-96bd-caefb9beaab4",
|
||||
"expected_count": 0,
|
||||
"actual_count": 1,
|
||||
"ok": false,
|
||||
"miss_class": "n/a",
|
||||
"notes": "Deployment prompt for Dalidou. Operational, not a memory.",
|
||||
"actual_candidates": [
|
||||
{
|
||||
"memory_type": "preference",
|
||||
"content": "User prefers receiving standalone testing prompts they can paste into Claude Code on target deployments rather than having the assistant run tests directly.",
|
||||
"project": "atocore",
|
||||
"rule": "llm_extraction"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "ee03d823-931b-4d4e-9258-88b4ed5eeb07",
|
||||
"expected_count": 2,
|
||||
"actual_count": 5,
|
||||
"ok": false,
|
||||
"miss_class": "layered_recommendation",
|
||||
"notes": "Layered infra recommendation with 'non-negotiable' / 'strongly recommended' strength markers. The 'non-negotiable' token could be a new cue class.",
|
||||
"actual_candidates": [
|
||||
{
|
||||
"memory_type": "project",
|
||||
"content": "USB SSD on RPi is mandatory for polishing telemetry storage; must be independent of network for data integrity during runs.",
|
||||
"project": "p06-polisher",
|
||||
"rule": "llm_extraction"
|
||||
},
|
||||
{
|
||||
"memory_type": "project",
|
||||
"content": "Use Tailscale mesh for RPi remote access to provide SSH, file transfer, and NAT traversal without port forwarding.",
|
||||
"project": "p06-polisher",
|
||||
"rule": "llm_extraction"
|
||||
},
|
||||
{
|
||||
"memory_type": "project",
|
||||
"content": "Auto-sync telemetry data via rsync over Tailscale after runs complete; fire-and-forget pattern with automatic retry on network interruption.",
|
||||
"project": "p06-polisher",
|
||||
"rule": "llm_extraction"
|
||||
},
|
||||
{
|
||||
"memory_type": "project",
|
||||
"content": "Real-time telemetry monitoring should target 10 Hz downsampling; full 100 Hz streaming over network is not necessary.",
|
||||
"project": "p06-polisher",
|
||||
"rule": "llm_extraction"
|
||||
},
|
||||
{
|
||||
"memory_type": "knowledge",
|
||||
"content": "Polishing telemetry data rate is approximately 29 MB per hour (100 Hz × 20 channels × 4 bytes = 8 KB/s).",
|
||||
"project": "p06-polisher",
|
||||
"rule": "llm_extraction"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "dd234d9f-0d1c-47e8-b01c-eebcb568c7e7",
|
||||
"expected_count": 1,
|
||||
"actual_count": 3,
|
||||
"ok": false,
|
||||
"miss_class": "alignment_assertion",
|
||||
"notes": "Architectural invariant assertion. '**Alignment verified**' / 'nothing changes for X' style. Likely too subtle for rule matching without LLM assistance.",
|
||||
"actual_candidates": [
|
||||
{
|
||||
"memory_type": "project",
|
||||
"content": "Machine spec (shareable) + Atomaste spec (internal) separate concerns. Machine spec hides program generation as 'separate scope' to protect IP/business strategy.",
|
||||
"project": "p06-polisher",
|
||||
"rule": "llm_extraction"
|
||||
},
|
||||
{
|
||||
"memory_type": "project",
|
||||
"content": "Firmware interface contract is invariant: controller-job.v1 input, run-log.v1 + telemetry output. No firmware changes needed regardless of program generation implementation.",
|
||||
"project": "p06-polisher",
|
||||
"rule": "llm_extraction"
|
||||
},
|
||||
{
|
||||
"memory_type": "project",
|
||||
"content": "Atomaste sim spec documents forward/return paths, calibration model (Preston k), translation loss, and service/IP strategy—details hidden from shareable machine spec.",
|
||||
"project": "p06-polisher",
|
||||
"rule": "llm_extraction"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "1f95891a-cf37-400e-9d68-4fad8e04dcbb",
|
||||
"expected_count": 0,
|
||||
"actual_count": 4,
|
||||
"ok": false,
|
||||
"miss_class": "n/a",
|
||||
"notes": "Huge session handoff prompt. Informational only.",
|
||||
"actual_candidates": [
|
||||
{
|
||||
"memory_type": "knowledge",
|
||||
"content": "AtoCore is FastAPI (Python 3.12, SQLite + ChromaDB) on Dalidou home server (dalidou:8100), repo C:\\Users\\antoi\\ATOCore, data /srv/storage/atocore/, ingests Obsidian vault + Google Drive into vector memory system.",
|
||||
"project": "atocore",
|
||||
"rule": "llm_extraction"
|
||||
},
|
||||
{
|
||||
"memory_type": "knowledge",
|
||||
"content": "Deploy AtoCore: git push origin main, then ssh papa@dalidou and run /srv/storage/atocore/app/deploy/dalidou/deploy.sh",
|
||||
"project": "atocore",
|
||||
"rule": "llm_extraction"
|
||||
},
|
||||
{
|
||||
"memory_type": "adaptation",
|
||||
"content": "Do not add memory extraction to interaction capture hot path; keep extraction as separate batch/manual step. Reason: latency and queue noise before review rhythm is comfortable.",
|
||||
"project": "atocore",
|
||||
"rule": "llm_extraction"
|
||||
},
|
||||
{
|
||||
"memory_type": "project",
|
||||
"content": "As of 2026-04-11, approved roadmap in order: observe reinforcement, batch extraction, candidate triage, off-Dalidou backup, retrieval quality review.",
|
||||
"project": "atocore",
|
||||
"rule": "llm_extraction"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "5580950f-d010-4544-be4b-b3071271a698",
|
||||
"expected_count": 0,
|
||||
"actual_count": 6,
|
||||
"ok": false,
|
||||
"miss_class": "n/a",
|
||||
"notes": "Ledger schema sketch. Structural design proposal, later ratified — but the same idea was already captured as a ratified decision in the recent decisions section, so not worth re-extracting from this conversational form.",
|
||||
"actual_candidates": [
|
||||
{
|
||||
"memory_type": "project",
|
||||
"content": "AtoCore adopts DEV-LEDGER.md as shared operating memory with stable headers; updated at session boundaries",
|
||||
"project": "atocore",
|
||||
"rule": "llm_extraction"
|
||||
},
|
||||
{
|
||||
"memory_type": "adaptation",
|
||||
"content": "Codex branches for AtoCore fork from main (never orphan); use naming pattern codex/<topic>",
|
||||
"project": "atocore",
|
||||
"rule": "llm_extraction"
|
||||
},
|
||||
{
|
||||
"memory_type": "adaptation",
|
||||
"content": "In AtoCore, Claude builds and Codex audits; never work in parallel on same files",
|
||||
"project": "atocore",
|
||||
"rule": "llm_extraction"
|
||||
},
|
||||
{
|
||||
"memory_type": "adaptation",
|
||||
"content": "In AtoCore, P1-severity findings in DEV-LEDGER.md block further main commits until acknowledged",
|
||||
"project": "atocore",
|
||||
"rule": "llm_extraction"
|
||||
},
|
||||
{
|
||||
"memory_type": "adaptation",
|
||||
"content": "Every AtoCore session appends to DEV-LEDGER.md Session Log and updates Orientation before ending",
|
||||
"project": "atocore",
|
||||
"rule": "llm_extraction"
|
||||
},
|
||||
{
|
||||
"memory_type": "project",
|
||||
"content": "AtoCore roadmap: (1) extractor improvement, (2) harness expansion, (3) Wave 2 ingestion, (4) OpenClaw finish; steps 1+2 are current mini-phase",
|
||||
"project": "atocore",
|
||||
"rule": "llm_extraction"
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -22,11 +22,17 @@ Usage:
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import io
|
||||
import json
|
||||
import sys
|
||||
from dataclasses import dataclass, field
|
||||
from pathlib import Path
|
||||
|
||||
# Force UTF-8 on stdout so real LLM output (arrows, em-dashes, CJK)
|
||||
# doesn't crash the human report on Windows cp1252 consoles.
|
||||
if hasattr(sys.stdout, "buffer"):
|
||||
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding="utf-8", errors="replace", line_buffering=True)
|
||||
|
||||
# Make src/ importable without requiring an install.
|
||||
_REPO_ROOT = Path(__file__).resolve().parent.parent
|
||||
sys.path.insert(0, str(_REPO_ROOT / "src"))
|
||||
@@ -218,6 +224,12 @@ def main() -> int:
|
||||
parser.add_argument("--snapshot", type=Path, default=DEFAULT_SNAPSHOT)
|
||||
parser.add_argument("--labels", type=Path, default=DEFAULT_LABELS)
|
||||
parser.add_argument("--json", action="store_true", help="emit machine-readable JSON")
|
||||
parser.add_argument(
|
||||
"--output",
|
||||
type=Path,
|
||||
default=None,
|
||||
help="write JSON result to this file (bypasses log/stdout interleaving)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--mode",
|
||||
choices=["rule", "llm"],
|
||||
@@ -232,7 +244,25 @@ def main() -> int:
|
||||
summary = aggregate(results)
|
||||
summary["mode"] = args.mode
|
||||
|
||||
if args.json:
|
||||
if args.output is not None:
|
||||
payload = {
|
||||
"summary": summary,
|
||||
"results": [
|
||||
{
|
||||
"id": r.id,
|
||||
"expected_count": r.expected_count,
|
||||
"actual_count": r.actual_count,
|
||||
"ok": r.ok,
|
||||
"miss_class": r.miss_class,
|
||||
"notes": r.notes,
|
||||
"actual_candidates": r.actual_candidates,
|
||||
}
|
||||
for r in results
|
||||
],
|
||||
}
|
||||
args.output.write_text(json.dumps(payload, indent=2, ensure_ascii=False), encoding="utf-8")
|
||||
print(f"wrote {args.output} ({summary['mode']}: recall={summary['recall']} precision={summary['precision']})")
|
||||
elif args.json:
|
||||
print_json(results, summary)
|
||||
else:
|
||||
print_human(results, summary)
|
||||
|
||||
Reference in New Issue
Block a user