Codex R6: the LLM extractor accepted the model's project field
verbatim. When the model returned empty string, clearly p06 memories
got promoted as project='', making them invisible to the p06
project-memory band and explaining the p06-offline-design harness
failure.
Fix: if model returns empty project but interaction.project is set,
inherit the interaction's project. Model-supplied project still takes
precedence when non-empty.
Two new tests lock the fallback and precedence behaviors.
R5 acknowledged (LLM extractor not yet wired into API — next task).
Test count: 278 -> 280. Harness re-run pending after deploy.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Codex correctly identified:
- R5 (P1): LLM extractor is script-only, not wired into the API
- R6 (P1): LLM extractor drops interaction.project when model
returns empty — caused the p06-offline-design harness failure
- R7 (P2): lexical scorer ties on overlap count, broad memories
win on confidence tiebreaker
- R8 (P2): no integration test for the persist/triage flow
Also corrected the harness-failure narrative: not all 3 are budget
contention. One is a ranking tie, one is a project-scope miss,
one is chunk bleed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Haiku was producing noisy candidates (31% accept rate on first
triage). Sonnet should give tighter extraction with fewer false
positives while still catching the same durable-fact patterns.
Override: ATOCORE_LLM_EXTRACTOR_MODEL=haiku to revert.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mini-phase complete. Before/after deltas:
Metric Before After
─────────────────────────────────────────
Rule extractor recall 0% 0% (unchanged, deprioritized)
LLM extractor recall n/a 100% (new, claude -p haiku)
LLM candidate yield n/a 2.55/interaction
First triage accept rate n/a 31% (16/51)
Active memories 20 36 (+16)
p06-polisher memories 2 16 (+14)
atocore memories 0 5 (+5)
Retrieval harness 6/6 15/18 (expanded to 18 fixtures)
Test count 264 278 (+14)
3 remaining harness failures are budget-contention on the p06 memory
band: the specific memory a fixture targets ranks 4th+ and the 25%
budget only holds 2-3 entries. Not a ranking bug — the per-entry
250-char cap was the one justified tweak; a second budget change
risks regressing other fixtures per Codex's Day 7 hard gate.
Ledger updated: Orientation, Session Log, main_tip, harness line.
Next on the roadmap (from DEV-LEDGER Active Plan / docs/next-steps):
- Wave 2 trusted operational ingestion (p04/p05/p06 dashboards)
- Finish OpenClaw integration (Phase 8)
- Auto-triage (multi-model second pass to reduce human review)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
A 530-char program overview memory with confidence 0.96 was filling
the entire 25% project-memory budget at equal overlap score (3 tokens),
beating shorter query-relevant newly-promoted memories (confidence
0.5) on the confidence tiebreaker. The long memory legitimately
scored well, but its length starved every other memory from the band.
Fix: truncate each formatted entry to 250 chars with '...' so at
least 2-3 memories fit the ~700-char available budget. This doesn't
change ranking — the most relevant memory still goes first — but
it ensures the runner-up can also appear.
Harness fixture delta: Day 7 regression pass pending after deploy.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Added 12 new fixtures across all three active projects:
- p04: 1 short/ambiguous case ('current status')
- p05: 1 CGH calibration case with cross-project bleed guard
- p06: 7 new fixtures targeting triage-promoted memories
(firmware interface, z-axis, cam encoder, telemetry rate,
offline design, USB SSD, Tailscale)
- Adversarial: cross-project-no-bleed (p04 query must not surface
p06 telemetry rate), no-project-hint (project memories must not
appear without a hint)
First run: 14/18 passing.
4 failures (p06-firmware-interface, p06-z-axis, p06-offline-design,
p06-tailscale) share the same root cause: long pre-existing p06
memories (530+ chars, confidence 0.9+) fill the 25% project-memory
budget before the query-relevant newly-promoted memories (shorter,
confidence 0.5) get a slot. Budget contention at equal overlap
score tiebroken by confidence. Day 7 ranking tweak target.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Documents the LLM-assisted extractor's in-scope / out-of-scope
categories derived from the first live triage pass (16 promoted,
35 rejected). Five in-scope classes, six explicit out-of-scope
classes, trust model summary, multi-model future direction.
Cleaned up stale follow-up items in next-steps.md: rule expansion
marked deprioritized, LLM extractor marked done, retrieval harness
marked done with expansion pending.
Fixed docstring timeout (45s -> 90s) in extractor_llm.py.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
First end-to-end triage pass on 51 LLM-extracted candidates from
the Day 4 baseline run (extractor_llm via claude -p haiku against
a 20-interaction frozen snapshot).
Results:
- Promoted 16 memories (31% accept rate):
* p06-polisher: 9 (USB SSD, Tailscale, 10 Hz telemetry,
controller-job.v1 invariant, offline-first, z-axis engage/
retract, cam encoder read-only, spec separation)
* atocore: 7 (extraction off hot path, DEV-LEDGER adopted,
codex branching rule, Claude builds/Codex audits, alias
canonicalization, Stop hook capture, passive capture)
- Rejected 35 (stale roadmap items, duplicates with wrong project
tags, already-fixed P1 findings, process rules that live in
DEV-LEDGER/AGENTS.md not in memory, too-granular implementation
details, operational instructions)
Active memory count: 20 → 36. p06-polisher went from 2 to 16.
Candidate queue: 0.
The triage verdict is saved at
scripts/eval_data/triage_verdict_2026-04-12.json for audit.
persist_llm_candidates.py used to push candidates to Dalidou.
POST /memory now accepts a 'status' field (default 'active') so
external scripts can create candidate memories directly.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mini-phase Day 1-4: frozen interaction snapshot, labeled extractor
eval corpus (20 labels), eval runner with --mode rule|llm, LLM-
assisted extractor via claude -p (OAuth, no API key), baseline
measurements (rule 0% recall → LLM 100% recall), status field
exposed on POST /memory, persist_llm_candidates.py script.
Day 4 gate cleared: LLM-assisted extraction is the recommended
path for conversational captures. Rule-based stays as default for
structural-cue content.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The API endpoint now passes the request's status field through to
create_memory() so external scripts can create candidate memories
directly without going through the extract endpoint. Default remains
'active' for backward compatibility.
persist_llm_candidates.py reads a saved extractor eval baseline
JSON (e.g. the Day 4 LLM run) and POSTs each candidate to Dalidou
with status=candidate. Safe to re-run — duplicate content returns
400 which the script counts as 'skipped'.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Second pass on the LLM-assisted extractor after Antoine's explicit
rule: no API key, ever. Refactored src/atocore/memory/extractor_llm.py
to shell out to the Claude Code 'claude -p' CLI via subprocess instead
of the anthropic SDK, so extraction reuses the user's existing Claude.ai
OAuth credentials and needs zero secret management.
Implementation:
- subprocess.run(["claude", "-p", "--model", "haiku",
"--append-system-prompt", <instructions>,
"--no-session-persistence", "--disable-slash-commands",
user_message], ...)
- cwd is a cached tempfile.mkdtemp() so every invocation starts with
a clean context instead of auto-discovering CLAUDE.md / AGENTS.md /
DEV-LEDGER.md from the repo root. We cannot use --bare because it
forces API-key auth, which defeats the purpose; the temp-cwd trick
is the lightest way to keep OAuth auth while skipping project
context loading.
- Silent-failure contract unchanged: missing CLI, non-zero exit,
timeout, malformed JSON — all return [] and log an error. The
capture audit trail must not break on an optional side effect.
- Default timeout bumped from 20s to 90s: Haiku + Node.js startup
+ OAuth check is ~20-40s per call in practice, plus real responses
up to 8KB take longer. 45s hit 2 timeouts on the first live run.
- tests/test_extractor_llm.py refactored: the API-key / anthropic SDK
tests are replaced by subprocess-mocking tests covering missing
CLI, timeout, non-zero exit, and a happy-path stdout parse. 14
tests, all green.
scripts/extractor_eval.py:
- New --output <path> flag writes the JSON result directly to a file,
bypassing stdout/log interleaving (structlog sends INFO to stdout
via PrintLoggerFactory, so a naive '> out.json' pollutes the file).
- Forces UTF-8 on stdout so real LLM output with em-dashes / arrows /
CJK doesn't crash the human report on Windows cp1252 consoles.
First live baseline run against the 20-interaction labeled corpus
(scripts/eval_data/extractor_llm_baseline_2026-04-11.json):
mode=llm labeled=20 recall=1.0 precision=0.357 yield_rate=2.55
total_actual_candidates=51 total_expected_candidates=7
false_negative_interactions=0 false_positive_interactions=9
Recall 0% -> 100% vs rule baseline — every human-labeled positive is
caught. Precision reads low (0.357) but inspection shows the "false
positives" are real candidates the human labels under-counted. For
example interaction a6b0d279 was labeled at 2 expected candidates,
the model caught all 6 polisher architectural facts; interaction
52c8c0f3 was labeled at 1, the model caught all 5 infra commitments.
The labels are the bottleneck, not the model.
Day 4 gate against Codex's criteria:
- candidate yield: 255% vs ≥15-25% target
- FP rate tolerable for manual triage: 51 candidates reviewable in
~10 minutes via the triage CLI
- ≥2 real non-synthetic candidates worth review: 20+ obvious wins
(polisher architecture set, p05 infra set, DEV-LEDGER protocol set)
Gate cleared. LLM-assisted extraction is the path forward for
conversational captures. Rule-based extractor stays as-is for
structured-cue inputs and remains the default mode. The next step
(Day 5 stabilize / document) will wire LLM mode behind a flag in
the public extraction endpoint and document scope.
Test count: 276 -> 278 passing. No existing tests changed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Day 2 baseline showed 0% recall for the rule-based extractor across
5 distinct miss classes. Day 4 decision gate: prototype an
LLM-assisted mode behind a flag. Option A ratified by Antoine.
New module src/atocore/memory/extractor_llm.py:
- extract_candidates_llm(interaction) returns the same MemoryCandidate
dataclass the rule extractor produces, so both paths flow through
the existing triage / candidate pipeline unchanged.
- extract_candidates_llm_verbose() also returns the raw model output
and any error string, for eval and debugging.
- Uses Claude Haiku 4.5 by default; model overridable via
ATOCORE_LLM_EXTRACTOR_MODEL env. Timeout via
ATOCORE_LLM_EXTRACTOR_TIMEOUT_S (default 20s).
- Silent-failure contract: missing API key, unreachable model,
malformed JSON — all return [] and log an error. Never raises
into the caller. The capture audit trail must not break on an
optional side effect.
- Parser tolerates markdown fences, surrounding prose, invalid
memory types, clamps confidence to [0,1], drops empty content.
- System prompt explicitly tells the model to return [] for most
conversational turns (durable-fact bar, not "extract everything").
- Trust rules unchanged: candidates are never auto-promoted,
extraction stays off the capture hot path, human triages via the
existing CLI.
scripts/extractor_eval.py: new --mode {rule,llm} flag so the same
labeled corpus can be scored against both extractors. Default
remains rule so existing invocations are unchanged.
tests/test_extractor_llm.py: 12 new unit tests covering the parser
(empty array, malformed JSON, markdown fences, surrounding prose,
invalid types, empty content, confidence clamping, version tagging),
plus contract tests for missing API key, empty response, and a
mocked api_error path so failure modes never raise.
Test count: 264 -> 276 passing. No existing tests changed.
Next step: run `python scripts/extractor_eval.py --mode llm` against
the labeled set with ANTHROPIC_API_KEY in env, record the delta,
decide whether to wire LLM mode into the API endpoint and CLI or
keep it script-only for now.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Day 2 extractor eval baseline on a 20-interaction labeled set shows
0% yield / 0% recall / 0% precision. The 5 false negatives span
5 distinct miss classes, matching the pattern Codex's Day 4 hard
gate was designed to catch but arriving two days early.
No extractor code change on main. Day 1+2 artifacts committed on
working branch 'claude/extractor-eval-loop' at 7d8d599. Day 4
decision (keep rule-expanding vs prototype LLM-assisted mode) is
escalated to Antoine for ratification before Day 3 work touches
any extractor.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Day 1 (labeled corpus):
- scripts/eval_data/interactions_snapshot_2026-04-11.json — frozen
snapshot of 64 real claude-code interactions pulled from live
Dalidou (test-client captures filtered out). This is the stable
corpus the whole mini-phase labels against, independent of future
captures.
- scripts/eval_data/extractor_labels_2026-04-11.json — 20 hand-labeled
interactions drawn by length-stratified random sample. Positives:
5/20 = ~25%, total expected candidates: 7. Plan deviation: Codex's
plan asked for 30 (10/10/10 buckets); the real corpus is heavily
skewed toward instructional/status content, so honest labeling of
20 already crosses the fail-early threshold of "at least 5 plausible
positives" without padding.
Day 2 (baseline measurement):
- scripts/extractor_eval.py — file-based eval runner that loads the
snapshot + labels, runs extract_candidates_from_interaction on each,
and reports yield / recall / precision / miss-class breakdown.
Returns exit 1 on any false positive or false negative.
Current rule extractor against the labeled set:
labeled=20 exact_match=15 positive_expected=5
yield=0.0 recall=0.0 precision=0.0
false_negatives=5 false_positives=0
miss_classes:
recommendation_prose
architectural_change_summary
spec_update_announcement
layered_recommendation
alignment_assertion
Interpretation: the rule-based extractor matches exactly zero of the
5 plausible positive interactions in the labeled set, and the misses
are spread across 5 distinct cue classes with no single dominant
pattern. This is the Day 4 hard-stop signal landing on Day 2 — a
single rule expansion cannot close a 5-way miss, and widening rules
blindly will collapse precision. The right move is to go straight to
the Day 4 decision gate and consider LLM-assisted extraction.
Escalating to DEV-LEDGER.md as R5 for human ratification before
continuing. Not skipping Day 3 silently.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The ledger is the one-file source of truth for "what is currently
true" across Claude/Codex/human sessions:
- Orientation (live SHA, main tip, test count, harness state)
- Active Plan (currently Codex's 8-day extractor + harness plan
with hard gates and fail-early thresholds)
- Open Review Findings (P1/P2, status)
- Recent Decisions (bounded to last 20)
- Session Log (bounded to last 20)
- Working Rules (no parallel work, branching rule, P1 block)
Narrative docs under docs/ sometimes lag reality; the ledger does
not. Every session MUST read it at start and append a Session Log
line before ending.
AGENTS.md: added a new "Session protocol" section at the top that
points at the ledger. Applies to any agent (Claude, Codex, future).
CLAUDE.md (new, project-local): project instructions for Claude
Code in this repo. Points at DEV-LEDGER.md and AGENTS.md, spells
out the deploy workflow and the Claude/Codex working model.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds the t420-openclaw/ workspace: the OpenClaw side of the
AtoCore integration surface — agent bootstrap docs, atocore-context
skill, tools manifest, operations guide, and a thin HTTP client
wrapper (atocore.py + atocore.sh) that shells out to the canonical
Dalidou endpoint.
Branch is a single orphan commit authored 2026-04-06 by Antoine;
merging with --allow-unrelated-histories since it has no common
ancestor with main. Paths are entirely new (t420-openclaw/) so
there is no file-level conflict.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The fixture asserted 'GigaBIT M1' must not appear in a p05 pack,
but GigaBIT M1 is the mirror the interferometer measures, so its
name legitimately shows up in p05 source docs (CGH test setup
diagrams, AOM design input, etc.). Flagging it as bleed was false
positive.
Replace the assertion with genuinely p04-only material: the
'Option B' / 'conical back' architecture decision and a p06 tag,
neither of which has any reason to appear in a p05 configuration
answer.
Harness now passes 6/6 against live Dalidou at 38f6e52 — the
first clean baseline. Subsequent retrieval/ranking/ingestion
changes can be measured against this run.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Hyphen- and slash-separated identifiers (polisher-control,
twyman-green, etc.) were single tokens in the reinforcement /
memory-ranking tokenizer, so queries had to match the exact
hyphenation to score. The harness caught this on p06-control-rule:
'polisher control design rule' scored 2 overlap on each of the
three polisher-*/design-rule memories and the tiebreaker picked
the wrong one.
Now hyphenated words contribute both the full form AND each
sub-token. Extracted _add_token helper to avoid duplicating the
stop-word / length gate at both insertion points.
Reinforcement matcher tests still pass (28) — the new sub-tokens
only widen the match set, they never narrow it, so memories that
previously reinforced continue to reinforce.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Per-type ranking was still starving later types: when a p05 query
matched a 'knowledge' memory best but 'project' came first in the
type order, the project-type candidates filled the budget before
the knowledge-type pool was even ranked.
Collect all candidates into a single pool, dedupe by id, then
rank the whole pool once against the query before walking the
flat budget. Python's stable sort preserves insertion order (which
still reflects the caller's memory_types order) as a natural
tiebreaker when scores are equal.
Regression surfaced by the retrieval eval harness:
p05-vendor-signal still missing 'Zygo' after 5aeeb1c — the vendor
memory was type=knowledge but never reached the ranker because
type=project consumed the budget first.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
get_memories_for_context now accepts an optional query string.
When provided, candidate memories are reranked by lexical overlap
with the query (stemmed token intersection, ties broken by
confidence) before the budget walk. Without a query the order is
unchanged — effectively "by confidence desc" as before — so
non-builder callers see no behaviour change.
The fetch limit is raised from 10 to 30 so there's a real pool to
rerank. Token overlap reuses _normalize/_tokenize from
reinforcement.py so ranking and reinforcement matching share the
same notion of distinctive terms.
build_context passes the user_prompt through to both the identity/
preference and project-memory calls. The retrieval harness
regression the fix is targeting:
- p05-vendor-signal FAIL @ 1161645: "Zygo" missing from the pack
even though an active vendor memory contained it. Root cause:
higher-confidence p05 memories filled the 25% budget slice
before the vendor memory ever got a chance. Query-aware ordering
puts the vendor memory first when the query is about vendors.
New regression test test_project_memories_query_relevance_ordering
locks the behaviour in with two p05 memories and a tight budget.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
scripts/retrieval_eval.py walks a fixture file of project-hinted
questions, runs each against POST /context/build, and scores the
returned formatted_context against per-fixture expect_present and
expect_absent substring checklists. Exit 0 on all-pass, 1 on any
miss. Human-readable by default, --json for automation.
First live run against Dalidou at SHA 1161645: 4/6 pass. The two
failures are real findings, not harness bugs:
- p05-configuration FAIL: "GigaBIT M1" appears in the p05 pack.
Cross-project bleed from a shared p05 doc that legitimately
mentions the p04 mirror under test. Fixture kept strict so
future ranker tuning can close the gap.
- p05-vendor-signal FAIL: "Zygo" missing. The vendor memory exists
with confidence 0.9 but get_memories_for_context walks memories
in fixed order (effectively by updated_at / confidence), so lower-
ranked memories get pushed out of the per-project budget slice by
higher-confidence ones even when the query is specifically about
the lower-ranked content. Query-relevance ordering of memories is
the natural next fix.
Docs sync:
- master-plan-status.md: Phase 9 reflection entry now notes that
capture→reinforce runs automatically and project memories reach
the context pack, while extract remains batch/manual. First batch-
extract pass surfaced 1 candidate from 42 interactions — extractor
rule tuning is a known follow-up.
- next-steps.md: the 2026-04-11 retrieval quality review entry now
shows the project-memory-band work as DONE, and a new
"Reflection Loop Live Check" subsection records the extractor-
coverage finding from the first batch run.
- Both files now agree with the code; follow-up reviewers
(Codex, future Claude) should no longer see narrative drift.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
deploy.sh sync-checkout was landing the file without an exec bit,
so the cron run hit 'Permission denied' until chmod +x was applied
manually on Dalidou. Persist the exec bit in the git index so
future deploys don't regress.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
At 0.15 the effective per-call allowance (450 - 55 wrapper) was 395
chars, which is just under the length of a real paragraph-length
project memory (~400 chars). Verified on live p04 probe: band was
still absent after the flat-budget fix because the first memory
entry was one character too long for the budget.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The per-type slicing (available // len(memory_types)) starved
paragraph-length memories: with 3 types and a 450-char budget,
each type got ~131 chars while real project memories are 300-500
chars each — every entry was skipped and the new Project Memories
band never appeared in the live pack.
Switch to a flat budget pool walked type-by-type in order. Short
identity/preference memories still get first pick when the budget
is tight, but long project memories can now compete for space.
Caught on the first post-deploy probe: 2 active p04 memories
existed but none landed in formatted_context.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The retrieval-quality review on 2026-04-11 found that active
project/knowledge/episodic memories never reached the pack: only
Trusted Project State and identity/preference memories were being
assembled. Reinforcement bumped confidence on memories that had
no retrieval outlet, so the reflection loop was half-open.
This change adds a third memory tier between identity/preference
and retrieved chunks:
- PROJECT_MEMORY_BUDGET_RATIO = 0.15
- Memory types: project, knowledge, episodic
- Only populated when a canonical project is in scope — without
a project hint, project memories stay out (cross-project bleed
would rot the signal)
- Rendered under a dedicated "--- Project Memories ---" header
so the LLM can distinguish it from the identity/preference band
- Trim order in _trim_context_to_budget: retrieval → project
memories → identity/preference → project state (most recently
added tier drops first when budget is tight)
get_memories_for_context gains header/footer kwargs so the two
memory blocks can be distinguished in a single pack without a
second helper.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Reinforcement matcher now handles paragraph-length memories via a
dual-mode threshold: short memories keep the 70% overlap rule,
long memories (>15 stems) require 12 absolute overlaps AND 35%
fraction so organic paraphrase can still reinforce. Diagnosis:
every active memory stayed at reference_count=0 because 40-token
project summaries never hit 70% overlap on real responses.
- scripts/atocore_client.py gains batch-extract (fan out
/interactions/{id}/extract over recent interactions) and triage
(interactive promote/reject walker for the candidate queue),
matching the Phase 9 reflection-loop review flow without pulling
extraction into the capture hot path.
- deploy/dalidou/cron-backup.sh adds an optional off-host rsync step
gated on ATOCORE_BACKUP_RSYNC, fail-open when the target is offline
so a laptop being off at 03:00 UTC never reds the local backup.
- docs/next-steps.md records the retrieval-quality sweep: project
state surfaces, chunks are on-topic but broad, active memories
never reach the pack (reflection loop has no retrieval outlet yet).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Stop hook now sends reinforce=true so the token-overlap matcher
runs on every captured interaction. Memory confidence will accumulate
signal from organic Claude Code use.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- POST /admin/backup/cleanup — retention cleanup via API (dry-run by default)
- record_interaction() accepts extract=True to auto-extract candidate
memories from response text using the Phase 9C rule-based extractor
- POST /interactions accepts extract field to enable extraction on capture
- deploy/dalidou/cron-backup.sh — daily backup + cleanup for cron
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- create_runtime_backup() now auto-validates its output and includes
validated/validation_errors fields in returned metadata
- New cleanup_old_backups() with retention policy: 7 daily, 4 weekly
(Sundays), 6 monthly (1st of month), dry-run by default
- CLI `cleanup` subcommand added to backup module
- 9 new tests (2 validation + 7 retention), 259 total passing
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the substring-based _memory_matches() with a token-overlap
matcher that tokenizes both memory content and response, applies
lightweight stemming (trailing s/ed/ing) and stop-word removal, then
checks whether >= 70% of the memory's tokens appear in the response.
This fixes the paraphrase blindness that prevented reinforcement from
ever firing on natural responses ("prefers" vs "prefer", "because
history" vs "because the history").
7 new tests (26 total reinforcement tests, all passing).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Claude Code Stop hook sends `last_assistant_message`, not
`assistant_message`. This was causing response_chars=0 on all
captured interactions. Also removes the temporary debug log block.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add deploy/hooks/capture_stop.py — a Claude Code Stop hook that reads
the transcript JSONL, extracts the last user prompt, and POSTs to the
AtoCore /interactions endpoint in conservative mode (reinforce=false).
Conservative mode means: capture only, no automatic reinforcement or
extraction into the review queue. Kill switch: ATOCORE_CAPTURE_DISABLED=1.
Also: note build_sha cosmetic issue after restore in runbook, update
project status docs to reflect drill pass and auto-capture wiring.
17 new tests (243 total, all passing).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two fixes from the 2026-04-09 first real restore drill on Dalidou,
plus the long-overdue doc consolidation I should have done when I
added the drill runbook instead of creating a duplicate.
## Chroma restore bind-mount bug (drill finding)
src/atocore/ops/backup.py: restore_runtime_backup() used to call
shutil.rmtree(dst_chroma) before copying the snapshot back. In the
Dockerized Dalidou deployment the chroma dir is a bind-mounted
volume — you can't unlink a mount point, rmtree raises
OSError [Errno 16] Device or resource busy
and the restore silently fails to touch Chroma. This bit the first
real drill; the operator worked around it with --no-chroma plus a
manual cp -a.
Fix: clear the destination's CONTENTS (iterdir + rmtree/unlink per
child) and use copytree(dirs_exist_ok=True) so the mount point
itself is never touched. Equivalent semantics, bind-mount-safe.
Regression test:
tests/test_backup.py::test_restore_chroma_does_not_unlink_destination_directory
captures Path.stat().st_ino of the dest dir before and after
restore and asserts they match. That's the same invariant a
bind-mounted chroma dir enforces — if the inode changed, the
mount would have failed. 11/11 backup tests now pass.
## Doc consolidation
docs/backup-restore-drill.md existed as a duplicate of the
authoritative docs/backup-restore-procedure.md. When I added the
drill runbook in commit 3362080 I wrote it from scratch instead of
updating the existing procedure — bad doc hygiene on a project
that's literally about being a context engine.
- Deleted docs/backup-restore-drill.md
- Folded its contents into docs/backup-restore-procedure.md:
- Replaced the manual sudo cp restore sequence with the new
`python -m atocore.ops.backup restore <STAMP>
--confirm-service-stopped` CLI
- Added the one-shot docker compose run pattern for running
restore inside a container that reuses the live volume mounts
- Documented the --no-pre-snapshot / --no-chroma / --chroma flags
- New "Chroma restore and bind-mounted volumes" subsection
explaining the bug and the regression test that protects the fix
- New "Restore drill" subsection with three levels (unit tests,
module round-trip, live Dalidou drill) and the cadence list
- Failure-mode table gained four entries: restored_integrity_ok,
Device-or-resource-busy, drill marker still present,
chroma_snapshot_missing
- "Open follow-ups" struck the restore_runtime_backup item (done)
and added a "Done (historical)" note referencing 2026-04-09
- Quickstart cheat sheet now has a full drill one-liner using
memory_type=episodic (the 2026-04-09 drill found the runbook's
memory_type=note was invalid — the valid set is identity,
preference, project, episodic, knowledge, adaptation)
## Status doc sync
Long overdue — I've been landing code without updating the
project's narrative state docs.
docs/current-state.md:
- "Reliability Baseline" now reflects: restore_runtime_backup is
real with CLI, pre-restore safety snapshot, WAL cleanup,
integrity check; live drill on 2026-04-09 surfaced and fixed
Chroma bind-mount bug; deploy provenance via /health build_sha;
deploy.sh self-update re-exec guard
- "Immediate Next Focus" reshuffled: drill re-run (priority 1) and
auto-capture (priority 2) are now ahead of retrieval quality work,
reflecting the updated unblock sequence
docs/next-steps.md:
- New item 1: re-run the drill with chroma working end-to-end
- New item 2: auto-capture conservative mode (Stop hook)
- Old item 7 rewritten as item 9 listing what's DONE
(create/list/validate/restore, admin/backup endpoint with
include_chroma, /health provenance, self-update guard,
procedure doc with failure modes) and what's still pending
(retention cleanup, off-Dalidou target, auto-validation)
## Test count
226 passing (was 225 + 1 new inode-stability regression test).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Close the backup side of the loop: we had create/list/validate but
no restore, and no documented drill. A backup you've never restored
is not a backup. This lands the missing restore surface and the
procedure to exercise it before enabling any write-path automation
(auto-capture, automated ingestion, reinforcement sweeps).
Code — src/atocore/ops/backup.py:
- restore_runtime_backup(stamp, *, include_chroma, pre_restore_snapshot,
confirm_service_stopped) performs:
1. validate_backup() gate — refuse on any error
2. pre-restore safety snapshot of current state (reversibility anchor)
3. PRAGMA wal_checkpoint(TRUNCATE) on target db (flush + release
OS handles; Windows needs this after conn.backup() reads)
4. unlink stale -wal/-shm sidecars (tolerant to Windows lock races)
5. shutil.copy2 snapshot db over target
6. restore registry if snapshot captured one
7. restore Chroma tree if snapshot captured one and include_chroma
resolves to true (defaults to whether backup has Chroma)
8. PRAGMA integrity_check on restored db, report result
- Refuses without confirm_service_stopped=True to prevent hot-restore
into a running service (would corrupt SQLite state)
- Rewrote main() as argparse with 4 subcommands: create, list,
validate, restore. `python -m atocore.ops.backup restore STAMP
--confirm-service-stopped` is the drill CLI entry point, run via
`docker compose run --rm --entrypoint python atocore` so it reuses
the live service's volume mounts
Tests — tests/test_backup.py (6 new):
- test_restore_refuses_without_confirm_service_stopped
- test_restore_raises_on_invalid_backup
- test_restore_round_trip_reverses_post_backup_mutations
(canonical drill flow: seed -> backup -> mutate -> restore ->
mutation gone + baseline survived + pre-restore snapshot has
the mutation captured as rollback anchor)
- test_restore_round_trip_with_chroma
- test_restore_skips_pre_snapshot_when_requested
- test_restore_cleans_stale_wal_sidecars (asserts stale byte
markers do not survive, not file existence, since PRAGMA
integrity_check may legitimately recreate -wal)
Docs — docs/backup-restore-drill.md (new):
- What gets backed up (hot sqlite, cold chroma, registry JSON,
metadata.json) and what doesn't (.env, source content)
- What restore does, step by step, and why confirm_service_stopped
is a hard gate
- 8-step drill procedure: capture -> baseline -> mutate -> stop ->
restore -> start -> verify marker gone -> optional cleanup
- Correct endpoint bodies verified against routes.py:
POST /admin/backup with JSON body {"include_chroma": true}
POST /memory with memory_type/content/project/confidence
GET /memory?project=drill to list drill markers
POST /query with {"prompt": ..., "top_k": ...} (not "query")
- Failure modes: integrity_check fail, container won't start,
marker still present after restore, with remediation for each
- When to run: before new write-path automation, after backup.py
or schema changes, after infra bumps, monthly as standing check
225/225 tests passing (219 existing + 6 new restore).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When deploy.sh itself changes in the commit being pulled, the bash
process is still running the OLD script from memory — git reset --hard
updated the file on disk but the in-memory instructions are stale.
This bit the 2026-04-09 Dalidou deploy: the old pre-build-sha Step 2
ran against fresh source, so the container started with
ATOCORE_BUILD_SHA="unknown" instead of the real commit. Manual
re-run fixed it, but the class of bug will re-emerge every time
deploy.sh itself changes.
Fix (Step 1.5):
- After git reset --hard, sha1 the running script ($0) and the
on-disk copy at $APP_DIR/deploy/dalidou/deploy.sh
- If they differ, export ATOCORE_DEPLOY_REEXECED=1 and exec into
the fresh copy so Step 2 onward runs under the new script
- The sentinel env var prevents recursion
- Skipped in dry-run mode, when $0 isn't readable, or when the
on-disk script doesn't exist yet
Docs (docs/dalidou-deployment.md):
- New "The deploy.sh self-update race" troubleshooting section
explaining the root cause, the Step 1.5 mechanism, what the log
output looks like, and how to opt out
Verified syntax and dry-run. 219/219 tests still passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Make /health report the precise git SHA the container was built from,
so 'is the live service current?' can be answered without ambiguity.
0.2.0 was too coarse to trust as a 'live is current' signal — many
commits share the same __version__.
Three layers:
1. /health endpoint (src/atocore/api/routes.py)
- Reads ATOCORE_BUILD_SHA, ATOCORE_BUILD_TIME, ATOCORE_BUILD_BRANCH
from environment, defaults to 'unknown'
- Reports them alongside existing code_version field
2. docker-compose.yml
- Forwards the three env vars from the host into the container
- Defaults to 'unknown' so direct `docker compose up` runs (without
deploy.sh) cleanly signal missing build provenance
3. deploy.sh
- Step 2 captures git SHA + UTC timestamp + branch and exports them
as env vars before `docker compose up -d --build`
- Step 6 reads /health post-deploy and compares the reported
build_sha against the freshly-built one. Mismatch exits non-zero
(exit code 6) with a remediation hint covering cached image,
env propagation, and concurrent restart cases
Tests (tests/test_api_storage.py):
- test_health_endpoint_reports_code_version_from_module
- test_health_endpoint_reports_build_metadata_from_env
- test_health_endpoint_reports_unknown_when_build_env_unset
Docs (docs/dalidou-deployment.md):
- Three-level drift detection table (code_version coarse,
build_sha precise, build_time/branch forensic)
- Canonical drift check script using LIVE_SHA vs EXPECTED_SHA
- Note that running deploy.sh is itself the simplest drift check
219/219 tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Dalidou Claude's second re-deploy (commit b492f5f) reported one
remaining friction point: the app dir was root-owned from the
previous manual-workaround deploy (when ALTER TABLE was run as
root to work around the schema init bug), so deploy.sh's git
fetch/reset hit a permission wall. They worked around it with
a one-shot docker run chown, but the script itself produced
cryptic git errors before that, so the fix wasn't obvious until
after the fact.
This commit adds a permission pre-flight check that runs BEFORE
any git operations and exits cleanly with an explicit remediation
message instead of letting git produce half-state on partial
failure.
The check:
1. Reads the current owner of the app dir via `stat -c '%U:%G'`
2. Reports the current user via `id -un` / `id -u:id -g`
3. Attempts to create a throwaway marker file in the app dir
4. If the marker write fails, prints three distinct remediation
commands covering the common environments:
a. sudo chown -R 1000:1000 $APP_DIR (if passwordless sudo)
b. sudo bash $0 (if running deploy.sh itself as root works)
c. docker run --rm -v $APP_DIR:/app alpine chown -R ...
(what Dalidou Claude actually did on 2026-04-08)
5. Exits with code 5 so CI / automation can distinguish "no
permission" from other deploy failures
Dry-run mode skips the check (nothing is mutated in dry-run).
A brief WARNING is also printed early if the app dir exists but
doesn't appear writable, before the fatal check — this gives
operators a heads-up even in the happy-path case.
Syntax check: bash -n passes.
Full suite: 216 passing (unchanged; no code changes to the app).
What this commit does NOT do
----------------------------
- Does NOT automatically fix permissions. chown needs root and
we don't want deploy.sh to escalate silently. The operator
runs one of the three remediation commands manually.
- Does NOT check permissions on nested files (like .git/config)
individually. The marker-file test on the app dir root is the
cheapest proxy that catches the common case (root-owned dir
tree after a previous sudo-based operation).
- Does NOT change behavior on first-time deploys where the app
dir doesn't exist yet. The check is gated on `-d $APP_DIR`.
Three issues Dalidou Claude surfaced during the first real deploy
of commit e877e5b to the live service (report from 2026-04-08).
Bug 1 was the critical one — a schema init ordering bug that would
have bitten every future upgrade from a pre-Phase-9 schema — and
the other two were usability traps around hostname resolution.
Bug 1 (CRITICAL): schema init ordering
--------------------------------------
src/atocore/models/database.py
SCHEMA_SQL contained CREATE INDEX statements that referenced
columns added later by _apply_migrations():
CREATE INDEX IF NOT EXISTS idx_memories_project ON memories(project);
CREATE INDEX IF NOT EXISTS idx_interactions_project_name ON interactions(project);
CREATE INDEX IF NOT EXISTS idx_interactions_session ON interactions(session_id);
On a FRESH install, CREATE TABLE IF NOT EXISTS creates the tables
with the Phase 9 shape (columns present), so the CREATE INDEX runs
cleanly and _apply_migrations is effectively a no-op.
On an UPGRADE from a pre-Phase-9 schema, CREATE TABLE IF NOT EXISTS
is a no-op (the tables already exist in the old shape), the columns
are NOT added yet, and the CREATE INDEX fails with
"OperationalError: no such column: project" before
_apply_migrations gets a chance to add the columns.
Dalidou Claude hit this exactly when redeploying from 0.1.0 to
0.2.0 — had to manually ALTER TABLE to add the Phase 9 columns
before the container could start.
The fix is to remove the Phase 9-column indexes from SCHEMA_SQL.
They already exist in _apply_migrations() AFTER the corresponding
ALTER TABLE, so they still get created on both fresh and upgrade
paths — just after the columns exist, not before.
Indexes still in SCHEMA_SQL (all safe — reference columns that
have existed since the first release):
- idx_chunks_document on source_chunks(document_id)
- idx_memories_type on memories(memory_type)
- idx_memories_status on memories(status)
- idx_interactions_project on interactions(project_id)
Indexes moved to _apply_migrations (already there — just no longer
duplicated in SCHEMA_SQL):
- idx_memories_project on memories(project)
- idx_interactions_project_name on interactions(project)
- idx_interactions_session on interactions(session_id)
- idx_interactions_created_at on interactions(created_at)
Regression test: tests/test_database.py
---------------------------------------
New test_init_db_upgrades_pre_phase9_schema_without_failing:
- Seeds the DB with the exact pre-Phase-9 shape (no project /
last_referenced_at / reference_count on memories; no project /
client / session_id / response / memories_used / chunks_used on
interactions)
- Calls init_db() — which used to raise OperationalError before
the fix
- Verifies all Phase 9 columns are present after the call
- Verifies the migration indexes exist
Before the fix this test would have failed with
"OperationalError: no such column: project" on the init_db call.
After the fix it passes. This locks the invariant "init_db is
safe on any legacy schema shape" so the bug can't silently come
back.
Full suite: 216 passing (was 215), 1 warning. The +1 is the new
regression test.
Bug 3 (usability): deploy.sh DNS default
----------------------------------------
deploy/dalidou/deploy.sh
ATOCORE_GIT_REMOTE defaulted to http://dalidou:3000/Antoine/ATOCore.git
which requires the "dalidou" hostname to resolve. On the Dalidou
host itself it didn't (no /etc/hosts entry for localhost alias),
so deploy.sh had to be run with the IP as a manual workaround.
Fix: default ATOCORE_GIT_REMOTE to http://127.0.0.1:3000/Antoine/ATOCore.git.
Loopback always works on the host running the script. Callers
from a remote host (e.g. running deploy.sh from a laptop against
the Dalidou LAN) set ATOCORE_GIT_REMOTE explicitly. The script
header's Environment Variables section documents this with an
explicit reference to the 2026-04-08 Dalidou deploy report so the
rationale isn't lost.
docs/dalidou-deployment.md gets a new "Troubleshooting hostname
resolution" subsection and a new example invocation showing how
to deploy from a remote host with an explicit ATOCORE_GIT_REMOTE
override.
Bug 2 (usability): atocore_client.py ATOCORE_BASE_URL documentation
-------------------------------------------------------------------
scripts/atocore_client.py
Same class of issue as bug 3. BASE_URL defaults to
http://dalidou:8100 which resolves fine from a remote caller
(laptop, T420/OpenClaw over Tailscale) but NOT from the Dalidou
host itself or from inside the atocore container. Dalidou Claude
saw the CLI return
{"status": "unavailable", "fail_open": true}
while direct curl to http://127.0.0.1:8100 worked.
The fix here is NOT to change the default (remote callers are
the common case and would break) but to DOCUMENT the override
clearly so the next operator knows what's happening:
- The script module docstring grew a new "Environment variables"
section covering ATOCORE_BASE_URL, ATOCORE_TIMEOUT_SECONDS,
ATOCORE_REFRESH_TIMEOUT_SECONDS, and ATOCORE_FAIL_OPEN, with
the explicit override example for on-host/in-container use
- It calls out the exact symptom (fail-open envelope when the
base URL doesn't resolve) so the diagnosis is obvious from
the error alone
- docs/dalidou-deployment.md troubleshooting section mirrors
this guidance so there's one place to look regardless of
whether the operator starts with the client help or the
deploy doc
What this commit does NOT do
----------------------------
- Does NOT change the default ATOCORE_BASE_URL. Doing that would
break the T420 OpenClaw helper and every remote caller who
currently relies on the hostname. Documentation is the right
fix for this case.
- Does NOT fix /etc/hosts on Dalidou. That's a host-level
configuration issue that the user can fix if they prefer
having the hostname resolve; the deploy.sh fix makes it
unnecessary regardless.
- Does NOT re-run the validation on Dalidou. The next step is
for the live service to pull this commit via deploy.sh (which
should now work without the IP workaround) and re-run the
Phase 9 loop test to confirm nothing regressed.
Dalidou Claude's validation run against the live service exposed a
structural gap: the deployment at /srv/storage/atocore/app has no
git connection, the running container was built from pre-Phase-9
source, and /health hardcoded 'version: 0.1.0' so drift is
invisible. Weeks of work have been shipping to Gitea but never
reaching the live service.
This commit fixes both the drift-invisibility problem and the
absence of an update workflow, so the next deploy to Dalidou can
go live cleanly and future drifts surface immediately.
Layer 1: deployment drift is now visible via /health
----------------------------------------------------
- src/atocore/__init__.py: __version__ bumped from 0.1.0 to 0.2.0
and documented as the source of truth for the deployed code
version, with a history block explaining when each bump happens
(API surface change, schema change, user-visible behavior change)
- src/atocore/main.py: FastAPI constructor now uses __version__
instead of the hardcoded '0.1.0' string, so the OpenAPI docs
reflect the actual code version
- src/atocore/api/routes.py: /health now reads from __version__
dynamically. Both the existing 'version' field and a new
'code_version' field report the same value for backwards compat.
A new docstring explains that comparing this to the main
branch's __version__ is the fastest way to detect drift.
- pyproject.toml: version bumped to 0.2.0 to stay in sync
The comparison is now:
curl /health -> "code_version": "0.2.0"
grep __version__ src/atocore/__init__.py -> "0.2.0"
If those differ, the deployment is stale. Concrete, unambiguous.
Layer 2: deploy.sh as the canonical update path
-----------------------------------------------
New file: deploy/dalidou/deploy.sh
One-shot bash script that handles both the first-time deploy
(where /srv/storage/atocore/app may not be a git repo yet) and
the ongoing update case. Steps:
1. If app dir is not a git checkout, back it up as
<dir>.pre-git-<utc-stamp> and re-clone from Gitea.
If it IS a checkout, fetch + reset --hard origin/<branch>.
2. Report the deployable commit SHA
3. Check that deploy/dalidou/.env exists (hard fail if missing
with a clear message pointing at .env.example)
4. docker compose up -d --build — rebuilds the image from
current source, restarts the container
5. Poll /health for up to 30 seconds; on failure, print the
last 50 lines of container logs and exit non-zero
6. Parse /health.code_version and compare to the __version__
in the freshly-pulled source. If they differ, exit non-zero
with a message suggesting docker compose down && up
7. On success, report commit + code_version + "health: ok"
Configurable via env vars:
- ATOCORE_APP_DIR (default /srv/storage/atocore/app)
- ATOCORE_GIT_REMOTE (default http://dalidou:3000/Antoine/ATOCore.git)
- ATOCORE_BRANCH (default main)
- ATOCORE_HEALTH_URL (default http://127.0.0.1:8100/health)
- ATOCORE_DEPLOY_DRY_RUN=1 for preview-only mode
Explicit non-goals documented in the script header:
- does not manage secrets (.env is the caller's responsibility)
- does not take a pre-deploy backup (call /admin/backup first
if you want one)
- does not roll back on failure (redeploy a known-good commit
to recover)
- does not touch the DB directly — schema migrations run at
service startup via the lifespan handler, and all existing
_apply_migrations ALTERs are idempotent ADD COLUMN operations
Layer 3: updated docs/dalidou-deployment.md
-------------------------------------------
- First-time deployment steps now explicitly say "git clone", not
"place the repository", so future first-time deploys don't end
up as static snapshots again
- New "Updating a running deployment" section covering deploy.sh
usage with all three modes (normal / branch override / dry-run)
- New "Deployment drift detection" section with the one-liner
comparison between /health code_version and the repo's
__version__
- New "Schema migrations on redeploy" section enumerating the
exact ALTER TABLE statements that run on a pre-0.2.0 -> 0.2.0
upgrade, confirming they are additive-only and safe, and
recommending a backup via /admin/backup before any redeploy
Full suite: 215 passing, 1 warning. No test was hardcoded to the
old version string, so the version bump was safe without test
changes.
What this commit does NOT do
----------------------------
- Does NOT execute the deploy on the live Dalidou instance. That
requires Dalidou access and is the next step. A ready-to-paste
prompt for Dalidou Claude will be provided separately.
- Does NOT add CI/CD, webhook-based auto-deploy, or reverse
proxy. Those remain in the 'deferred' section of the
deployment doc.
- Does NOT change the Dockerfile. The existing 'COPY source at
build time' pattern is what deploy.sh relies on — rebuilding
the image picks up new code.
- Does NOT modify the database schema. The Phase 9 migrations
that Dalidou's DB needs will be applied automatically on next
service startup via the existing _apply_migrations path.
Codex's sequence step 3: finish the Phase 9 operator surface in the
shared client. The previous client version (0.1.0) covered stable
operations (project lifecycle, retrieval, context build, trusted
state, audit-query) but explicitly deferred capture/extract/queue/
promote/reject pending "exercised workflow". That deferral ran
into a bootstrap problem: real Claude Code sessions can't exercise
the Phase 9 loop without a usable client surface to drive it. This
commit ships the 8 missing subcommands so the next step (real
validation on Dalidou) is unblocked.
Bumps CLIENT_VERSION from 0.1.0 to 0.2.0 per the semver rules in
llm-client-integration.md (new subcommands = minor bump).
New subcommands in scripts/atocore_client.py
--------------------------------------------
| Subcommand | Endpoint |
|-----------------------|-------------------------------------------|
| capture | POST /interactions |
| extract | POST /interactions/{id}/extract |
| reinforce-interaction | POST /interactions/{id}/reinforce |
| list-interactions | GET /interactions |
| get-interaction | GET /interactions/{id} |
| queue | GET /memory?status=candidate |
| promote | POST /memory/{id}/promote |
| reject | POST /memory/{id}/reject |
Each follows the existing client style: positional arguments with
empty-string defaults for optional filters, truthy-string arguments
for booleans (matching the existing refresh-project pattern), JSON
output via print_json(), fail-open behavior inherited from
request().
capture accepts prompt + response + project + client + session_id +
reinforce as positionals, defaulting the client field to
"atocore-client" when omitted so every capture from the shared
client is identifiable in the interactions audit trail.
extract defaults to preview mode (persist=false). Pass "true" as
the second positional to create candidate memories.
list-interactions and queue build URL query strings with
url-encoded values and always include the limit, matching how the
existing context-build subcommand handles its parameters.
Security fix: ID-field URL encoding
-----------------------------------
The initial draft used urllib.parse.quote() with the default safe
set, which does NOT encode "/" because it's a reserved path
character. That's a security footgun on ID fields: passing
"promote mem/evil/action" would build /memory/mem/evil/action/promote
and hit a completely different endpoint than intended.
Fixed by passing safe="" to urllib.parse.quote() on every ID field
(interaction_id and memory_id). The tests cover this explicitly via
test_extract_url_encodes_interaction_id and test_promote_url_encodes_memory_id,
both of which would have failed with the default behavior.
Project names keep the default quote behavior because a project
name with a slash would already be broken elsewhere in the system
(ingest root resolution, file paths, etc).
tests/test_atocore_client.py (new, 18 tests, all green)
-------------------------------------------------------
A dedicated test file for the shared client that mocks the
request() helper and verifies each subcommand:
- calls the correct HTTP method and path
- builds the correct JSON body (or query string)
- passes the right subset of CLI arguments through
- URL-encodes ID fields so path traversal isn't possible
Tests are structured as unit tests (not integration tests) because
the API surface on the server side already has its own route tests
in test_api_storage.py and the Phase 9 specific files. These tests
are the wiring contract between CLI args and HTTP calls.
Test file highlights:
- capture: default values, custom client, reinforce=false
- extract: preview by default, persist=true opt-in, URL encoding
- reinforce-interaction: correct path construction
- list-interactions: no filters, single filter, full filter set
(including ISO 8601 since parameter with T separator and Z)
- get-interaction: fetch by id
- queue: always filters status=candidate, accepts memory_type
and project, coerces limit to int
- promote / reject: correct path + URL encoding
- test_phase9_full_loop_via_client_shape: end-to-end sequence
that drives capture -> extract preview -> extract persist ->
queue list -> promote -> reject through the shared client and
verifies the exact sequence of HTTP calls that would be made
These tests run in ~0.2s because they mock request() — no DB, no
Chroma, no HTTP. The fast feedback loop matters because the
client surface is what every agent integration eventually depends
on.
docs/architecture/llm-client-integration.md updates
---------------------------------------------------
- New "Phase 9 reflection loop (shipped after migration safety
work)" section under "What's in scope for the shared client
today" with the full 8-subcommand table and a note explaining
the bootstrap-problem rationale
- Removed the "Memory review queue and reflection loop" section
from "What's intentionally NOT in scope today"; backup admin
and engineering-entity commands remain the only deferred
families
- Renumbered the deferred-commands list (was 3 items, now 2)
- Open follow-ups updated: memory-review-subcommand item replaced
with "real-usage validation of the Phase 9 loop" as the next
concrete dependency
- TL;DR updated to list the reflection-loop subcommands
- Versioning note records the v0.1.0 -> v0.2.0 bump with the
subcommands included
Full suite: 215 passing (was 197), 1 warning. The +18 is
tests/test_atocore_client.py. Runtime unchanged because the new
tests don't touch the DB.
What this commit does NOT do
----------------------------
- Does NOT change the server-side endpoints. All 8 subcommands
call existing API routes that were shipped in Phase 9 Commits
A/B/C. This is purely a client-side wiring commit.
- Does NOT run the reflection loop against the live Dalidou
instance. That's the next concrete step and is explicitly
called out in the open-follow-ups section of the updated doc.
- Does NOT modify the Claude Code slash command. It still pulls
context only; the capture/extract/queue/promote companion
commands (e.g. /atocore-record-response) are deferred until the
capture workflow has been exercised in real use at least once.
- Does NOT refactor the OpenClaw helper. That's a cross-repo
change and remains a queued follow-up, now unblocked by the
shared client having the reflection-loop subcommands.
Codex caught a real data-loss bug in the legacy alias migration
shipped in 7e60f5a. plan_state_migration filtered state rows to
status='active' only, then apply_plan deleted the shadow projects
row at the end. Because project_state.project_id has
ON DELETE CASCADE, any superseded or invalid state rows still
attached to the shadow project got silently cascade-deleted —
exactly the audit loss a cleanup migration must not cause.
This commit fixes the bug and adds regression tests that lock in
the invariant "shadow state of every status is accounted for".
Root cause
----------
scripts/migrate_legacy_aliases.py::plan_state_migration was:
"SELECT * FROM project_state WHERE project_id = ? AND status = 'active'"
which only found live rows. Any historical row (status in
'superseded' or 'invalid') was invisible to the plan, so the apply
step had nothing to rekey for it. Then the shadow project row was
deleted at the end, cascade-deleting every unplanned row.
The fix
-------
plan_state_migration now selects ALL state rows attached to the
shadow project regardless of status, and handles every row per a
per-status decision table:
| Shadow status | Canonical at same triple? | Values | Action |
|---------------|---------------------------|------------|--------------------------------|
| any | no | — | clean rekey |
| any | yes | same | shadow superseded in place |
| active | yes, active | different | COLLISION, apply refuses |
| active | yes, inactive | different | shadow wins, canonical deleted |
| inactive | yes, any | different | historical drop (logged) |
Four changes in the script:
1. SELECT drops the status filter so the plan walks every row.
2. New StateRekeyPlan.historical_drops list captures the shadow
rows that lose to a canonical row at the same triple because the
shadow is already inactive. These are the only unavoidable data
losses, and they happen because the UNIQUE(project_id, category,
key) constraint on project_state doesn't allow two rows per
triple regardless of status.
3. New apply action 'replace_inactive_canonical' for the
shadow-active-vs-canonical-inactive case. At apply time the
canonical inactive row is DELETEd first (SQLite's default
immediate constraint checking) and then the shadow is UPDATEd
into its place in two separate statements. Adds a new
state_rows_replaced_inactive_canonical counter.
4. New apply counter state_rows_historical_dropped for audit
transparency. The rows themselves are still cascade-deleted
when the shadow project row is dropped, but they're counted
and reported.
Five places render_plan_text and plan_to_json_dict updated:
- counts() gains state_historical_drops
- render_plan_text prints a 'historical drops' section with each
shadow-canonical pair and their statuses when there are any, so
the operator sees the audit loss BEFORE running --apply
- The new section explicitly tells the operator: "if any of these
values are worth keeping as separate audit records, manually copy
them out before running --apply"
- plan_to_json_dict carries historical_drops into the JSON report
- The state counts table in the human report now shows both
'state collisions (block)' and 'state historical drops' as
separate lines so the operator can distinguish
"apply will refuse" from "apply will drop historical rows"
Regression tests (3 new, all green)
-----------------------------------
tests/test_migrate_legacy_aliases.py:
- test_apply_preserves_superseded_shadow_state_when_no_collision:
the direct regression for the codex finding. Seeds a shadow with
a superseded state row on a triple the canonical doesn't have,
runs the migration, verifies via raw SQL that the row is now
attached to the canonical projects row and still has status
'superseded'. This is the test that would have failed before
the fix.
- test_apply_drops_shadow_inactive_row_when_canonical_holds_same_triple:
covers the unavoidable data-loss case. Seeds shadow superseded
+ canonical active at the same triple with different values,
verifies plan.counts() reports one historical_drop, runs apply,
verifies the canonical value is preserved and the shadow value
is gone.
- test_apply_replaces_inactive_canonical_with_active_shadow:
covers the cross-contamination case where shadow has live value
and canonical has a stale invalid row. Shadow wins by deleting
canonical and rekeying in its place. Verifies the counter and
the final state.
Plus _seed_state_row now accepts a status kwarg so the seeding
helper can create superseded/invalid rows directly.
test_dry_run_on_empty_registry_reports_empty_plan was updated to
include the new state_historical_drops key in the expected counts
dict (all zero for an empty plan, so the test shape is the same).
Full suite: 197 passing (was 194), 1 warning. The +3 is the three
new regression tests.
What this commit does NOT do
----------------------------
- Does NOT try to preserve historical shadow rows that collide
with a canonical row at the same triple. That would require a
schema change (adding (id) to the UNIQUE key, or a separate
history table) and isn't in scope for a cleanup migration.
The operator sees these as explicit 'historical drops' in the
plan output and can copy them out manually if any are worth
preserving.
- Does NOT change any behavior for rows that were already
reachable from the canonicalized read path. The fix only
affects legacy rows whose project_id points at a shadow row.
- Does NOT re-verify the earlier happy-path tests beyond the full
suite confirming them still green.
Closes the compatibility gap documented in
docs/architecture/project-identity-canonicalization.md. Before fb6298a,
writes to project_state, memories, and interactions stored the raw
project name. After fb6298a every service-layer entry point
canonicalizes through the registry, which silently made pre-fix
alias-keyed rows unreachable from the new read path. Now there's
a migration tool to find and fix them.
This commit is the tool and its tests. The tool is NOT run against
the live Dalidou DB in this commit — that's a separate supervised
manual step after reviewing the dry-run output.
scripts/migrate_legacy_aliases.py
---------------------------------
Standalone offline migration tool. Dry-run default, --apply explicit.
What it inspects:
- projects: rows whose name is a registered alias and differs from
the canonical project_id (shadow rows)
- project_state: rows whose project_id points at a shadow; plan
rekeys them to the canonical row's id. (category, key) collisions
against the canonical block the apply step until a human resolves
- memories: rows whose project column is a registered alias. Plain
string rekey. Dedup collisions (after rekey, same
(memory_type, content, project, status)) are handled by the
existing memory supersession model: newer row stays active, older
becomes superseded with updated_at as tiebreaker
- interactions: rows whose project column is a registered alias.
Plain string rekey, no collision handling
What it does NOT do:
- Never touches rows that are already canonical
- Never auto-resolves project_state collisions (refuses until the
human picks a winner via POST /project/state)
- Never creates data; only rekeys or supersedes
- Never runs outside a single SQLite transaction; any failure rolls
back the entire migration
Safety rails:
- Dry-run is default. --apply is explicit.
- Apply on empty plan refuses unless --allow-empty (prevents
accidental runs that look meaningful but did nothing)
- Apply refuses on any project_state collision
- Apply refuses on integrity errors (e.g. two case-variant rows
both matching the canonical lookup)
- Writes a JSON report to data/migrations/ on every run (dry-run
and apply alike) for audit
- Idempotent: running twice produces the same final state as
running once. The second run finds zero shadow rows and exits
clean.
CLI flags:
--registry PATH override ATOCORE_PROJECT_REGISTRY_PATH
--db PATH override the AtoCore SQLite DB path
--apply actually mutate (default is dry-run)
--allow-empty permit --apply on an empty plan
--report-dir PATH where to write the JSON report
--json emit the plan as JSON instead of human prose
Smoke test against the Phase 9 validation DB produces the expected
"Nothing to migrate. The database is clean." output with 4 known
canonical projects and 0 shadows.
tests/test_migrate_legacy_aliases.py
------------------------------------
19 new tests, all green:
Plan-building:
- test_dry_run_on_empty_registry_reports_empty_plan
- test_dry_run_on_clean_registered_db_reports_empty_plan
- test_dry_run_finds_shadow_project
- test_dry_run_plans_state_rekey_without_collisions
- test_dry_run_detects_state_collision
- test_dry_run_plans_memory_rekey_and_supersession
- test_dry_run_plans_interaction_rekey
Apply:
- test_apply_refuses_on_state_collision
- test_apply_migrates_clean_shadow_end_to_end (verifies get_state
can see the state via BOTH the alias AND the canonical after
migration)
- test_apply_drops_shadow_state_duplicate_without_collision
(same (category, key, value) on both sides - mark shadow
superseded, don't hit the UNIQUE constraint)
- test_apply_migrates_memories
- test_apply_migrates_interactions
- test_apply_is_idempotent
- test_apply_refuses_with_integrity_errors (uses case-variant
canonical rows to work around projects.name UNIQUE constraint;
verifies the case-insensitive duplicate detection works)
Reporting:
- test_plan_to_json_dict_is_serializable
- test_write_report_creates_file
- test_render_plan_text_on_empty_plan
- test_render_plan_text_on_collision
End-to-end gap closure (the most important test):
- test_legacy_alias_gap_is_closed_after_migration
- Seeds the exact same scenario as
test_legacy_alias_keyed_state_is_invisible_until_migrated
in test_project_state.py (which documents the pre-migration
gap)
- Confirms the row is invisible before migration
- Runs the migration
- Verifies the row is reachable via BOTH the canonical id AND
the alias afterward
- This test and the pre-migration gap test together lock in
"before migration: invisible, after migration: reachable"
as the documented invariant
Full suite: 194 passing (was 175), 1 warning. The +19 is the new
migration test file.
Next concrete step after this commit
------------------------------------
- Run the dry-run against the live Dalidou DB to find out the
actual blast radius. The script is the inspection SQL, codified.
- Review the dry-run output together
- If clean (zero shadows), no apply needed; close the doc gap as
"verified nothing to migrate on this deployment"
- If there are shadows, resolve any collisions via
POST /project/state, then run --apply under supervision
- After apply, the test_legacy_alias_keyed_state_is_invisible_until_migrated
test still passes (it simulates the gap directly, so it's
independent of the live DB state) and the gap-closed companion
test continues to guard forward
Codex caught a real documentation accuracy bug in the previous
canonicalization doc commit (f521aab). The doc claimed that rows
written under aliases before fb6298a "still work via the
unregistered-name fallback path" — that is wrong for REGISTERED
aliases, which is exactly the case that matters.
The unregistered-name fallback only saves you when the project was
never in the registry: a row stored under "orphan-project" is read
back via "orphan-project", both pass through resolve_project_name
unchanged, and the strings line up. For a registered alias like
"p05", the helper rewrites the read key to "p05-interferometer"
but does NOT rewrite the storage key, so the legacy row becomes
silently invisible.
This commit corrects the doc and locks the gap behavior in with
a regression test, so the issue cannot be lost again.
docs/architecture/project-identity-canonicalization.md
------------------------------------------------------
- Removed the misleading claim from the "What this rule does NOT
cover" section. Replaced with a pointer to the new gap section
and an explicit statement that the migration is required before
engineering V1 ships.
- New "Compatibility gap: legacy alias-keyed rows" section between
"Why this is the trust hierarchy in action" and "The rule for
new entry points". This is the natural insertion point because
the gap is exactly the trust hierarchy failing for legacy data.
The section covers:
* a worked T0/T1 timeline showing the exact failure mode
* what is at risk on the live Dalidou DB, ranked by trust tier:
projects table (shadow rows), project_state (highest risk
because Layer 3 is most-authoritative), memories, interactions
* inspection SQL queries for measuring the actual blast radius
on the live DB before running any migration
* the spec for the migration script: walk projects, find shadow
rows, merge dependent state via the conflict model when there
are collisions, dry-run mode, idempotent
* explicit statement that this is required pre-V1 because V1
will add new project-keyed tables and the killer correctness
queries from engineering-query-catalog.md would report wrong
results against any project that has shadow rows
- "Open follow-ups" item 1 promoted from "tracked optional" to
"REQUIRED before engineering V1 ships, NOT optional" with a
more honest cost estimate (~150 LOC migration + ~50 LOC tests
+ supervised live run, not the previous optimistic ~30 LOC)
- TL;DR rewritten to mention the gap explicitly and re-order
the open follow-ups so the migration is the top priority
tests/test_project_state.py
---------------------------
- New test_legacy_alias_keyed_state_is_invisible_until_migrated
- Inserts a "p05" project row + a project_state row pointing at
it via raw SQL (bypassing set_state which now canonicalizes),
simulating a pre-fix legacy row
- Verifies the canonicalized get_state path can NOT see the row
via either the alias or the canonical id — this is the bug
- Verifies the row is still in the database (just unreachable),
so the migration script has something to find
- The docstring explicitly says: "When the legacy alias migration
script lands, this test must be inverted." Future readers will
know exactly when and how to update it.
Full suite: 175 passing (was 174), 1 warning. The +1 is the new
gap regression test.
What this commit does NOT do
----------------------------
- The migration script itself is NOT in this commit. Codex's
finding was a doc accuracy issue, and the right scope is fix
the doc + lock the gap behavior in. Writing the migration is
the next concrete step but is bigger (~200 LOC + dry-run mode
+ collision handling via the conflict model + supervised run
on the live Dalidou DB), warrants its own commit, and probably
warrants a "draft + review the dry-run output before applying"
workflow rather than a single shot.
- Existing tests are unchanged. The new test stands alone as a
documented gap; the 12 canonicalization tests from fb6298a
still pass without modification.
Codifies the helper-at-every-service-boundary rule that fb6298a
implemented across the eight current callsites. The contract is
intentionally simple but easy to forget, so it lives in its own
doc that the engineering layer V1 implementation sprint can read
before adding new project-keyed entity surfaces.
docs/architecture/project-identity-canonicalization.md
------------------------------------------------------
- The contract: every read/write that takes a project name MUST
call resolve_project_name() before the value crosses a service
boundary; canonicalization happens once, at the first statement
after input validation, never later
- The helper API: resolve_project_name(name) returns the canonical
project_id for registered names, the input unchanged for empty
or unregistered names (the second case is the backwards-compat
path for hand-curated state predating the registry)
- Full table of the 8 current callsites: builder.build_context,
project_state.set_state/get_state/invalidate_state,
interactions.record_interaction/list_interactions,
memory.create_memory/get_memories
- Where the helper is intentionally NOT called and why: legacy
ensure_project lookup, retriever's own _project_match_boost
(which already calls get_registered_project), _rank_chunks
secondary substring boost (multiplicative not filter, can't
drop relevant chunks), update_memory (no project field update),
unregistered names (the rule applied to a name with no record)
- Why this is the trust hierarchy in action: Layer 3 trusted
state has to be findable to win the trust battle; an
un-canonicalized lookup silently makes Layer 3 invisible and
the system falls through to lower-trust retrieved chunks with
no signal to the human
- The 4-step rule for new entry points: identify project-keyed
reads/writes, place the call as the first statement after
validation, add a regression test using the project_registry
fixture, verify None/empty paths
- How the project_registry fixture works with a copy-pasteable
example
- What the rule does NOT cover: alias creation (registry's own
write path), registry hot-reloading (no in-process cache by
design), cross-project dedup (collision detection at
registration), time-bounded canonicalization (canonical id is
stable forever), legacy data migration (open follow-up)
- Engineering layer V1 implications: every new service entry
point in the entities/relationships/conflicts/mirror modules
must apply the helper at the first statement after validation;
treated as code review failure if missing
- Open follow-ups: legacy data migration script (~30 LOC),
registry file caching when projects scale beyond ~50, case
sensitivity audit when entity-side storage lands, _rank_chunks
cleanup, documentation discoverability (intentional redundancy
between this doc, the helper docstring, and per-callsite comments)
- Quick reference card: copy-pasteable template for new service
functions
master-plan-status.md updated
-----------------------------
- New doc added to the engineering-layer planning sprint listing
- Marked as required reading before V1 implementation begins
- Note that V1 must apply the contract at every new service-layer
entry point
Pure doc work, no code changes. Full suite stays at 174 passing
because no source changed.