# Phase 9 First Real Use Report ## What this is The first empirical exercise of the Phase 9 reflection loop after Commits A, B, and C all landed. The goal is to find out where the extractor and the reinforcement matcher actually behave well versus where their behaviour drifts from the design intent. The validation is reproducible. To re-run: ```bash python scripts/phase9_first_real_use.py ``` This writes an isolated SQLite + Chroma store under `data/validation/phase9-first-use/` (gitignored), seeds three active memories, then runs eight sample interactions through the full capture → reinforce → extract pipeline. ## What we ran Eight synthetic interactions, each paraphrased from a real working session about AtoCore itself or the active engineering projects: | # | Label | Project | Expected | |---|--------------------------------------|----------------------|---------------------------| | 1 | exdev-mount-merge-decision | atocore | 1 decision_heading | | 2 | ownership-was-the-real-fix | atocore | 1 fact_heading | | 3 | memory-vs-entity-canonical-home | atocore | 1 decision_heading (long) | | 4 | auto-promotion-deferred | atocore | 1 decision_heading | | 5 | preference-rebase-workflow | atocore | 1 preference_sentence | | 6 | constraint-from-doc-cite | p05-interferometer | 1 constraint_heading | | 7 | prose-only-no-cues | atocore | 0 candidates | | 8 | multiple-cues-in-one-interaction | p06-polisher | 3 distinct rules | Plus 3 seed memories were inserted before the run: - `pref_rebase`: "prefers rebase-based workflows because history stays linear" (preference, 0.6) - `pref_concise`: "writes commit messages focused on the why, not the what" (preference, 0.6) - `identity_runs_atocore`: "mechanical engineer who runs AtoCore for context engineering" (identity, 0.9) ## What happened — extraction (the good news) **Every extraction expectation was met exactly.** All eight samples produced the predicted candidate count and the predicted rule classifications: | Sample | Expected | Got | Pass | |---------------------------------------|----------|-----|------| | exdev-mount-merge-decision | 1 | 1 | ✅ | | ownership-was-the-real-fix | 1 | 1 | ✅ | | memory-vs-entity-canonical-home | 1 | 1 | ✅ | | auto-promotion-deferred | 1 | 1 | ✅ | | preference-rebase-workflow | 1 | 1 | ✅ | | constraint-from-doc-cite | 1 | 1 | ✅ | | prose-only-no-cues | **0** | **0** | ✅ | | multiple-cues-in-one-interaction | 3 | 3 | ✅ | **Total: 9 candidates from 8 interactions, 0 false positives, 0 misses on heading patterns or sentence patterns.** The extractor's strictness is well-tuned for the kinds of structural cues we actually use. Things worth noting: - **Sample 7 (`prose-only-no-cues`) produced zero candidates as designed.** This is the most important sanity check — it confirms the extractor won't fill the review queue with general prose when there's no structural intent. - **Sample 3's long content was preserved without truncation.** The 280-char max wasn't hit, and the content kept its full meaning. - **Sample 8 produced three distinct rules in one interaction** (decision_heading, constraint_heading, requirement_heading) without the dedup key collapsing them. The dedup key is `(memory_type, normalized_content, rule)` and the three are all different on at least one axis, so they coexist as expected. - **The prose around each heading was correctly ignored.** Sample 6 has a second sentence ("the error budget allocates 6 nm to the laser source...") that does NOT have a structural cue, and the extractor correctly didn't fire on it. ## What happened — reinforcement (the empirical finding) **Reinforcement matched zero seeded memories across all 8 samples, even when the response clearly echoed the seed.** Sample 5's response was: > *"I prefer rebase-based workflows because the history stays linear > and reviewers have an easier time."* The seeded `pref_rebase` memory was: > *"prefers rebase-based workflows because history stays linear"* A human reading both says these are the same fact. The reinforcement matcher disagrees. After all 8 interactions: ``` pref_rebase: confidence=0.6000 refs=0 last=- pref_concise: confidence=0.6000 refs=0 last=- identity_runs_atocore: confidence=0.9000 refs=0 last=- ``` **Nothing moved.** This is the most important finding from this validation pass. ### Why the matcher missed it The current `_memory_matches` rule (in `src/atocore/memory/reinforcement.py`) does a normalized substring match: it lowercases both sides, collapses whitespace, then asks "does the leading 80-char window of the memory content appear as a substring in the response?" For the rebase example: - needle (normalized): `prefers rebase-based workflows because history stays linear` - haystack (normalized): `i prefer rebase-based workflows because the history stays linear and reviewers have an easier time.` The needle starts with `prefers` (with the trailing `s`), and the haystack has `prefer` (without the `s`, because of the first-person voice). And the needle has `because history stays linear`, while the haystack has `because the history stays linear`. **Two small natural paraphrases, and the substring fails.** This isn't a bug in the matcher's implementation — it's doing exactly what it was specified to do. It's a design limitation: the substring rule is too brittle for real prose, where the same fact gets re-stated with different verb forms, articles, and word order. ### Severity **Medium-high.** Reinforcement is the entire point of Commit B. A reinforcement matcher that never fires on natural paraphrases will leave seeded memories with stale confidence forever. The reflection loop runs but it doesn't actually reinforce anything. That hollows out the value of having reinforcement at all. It is not a critical bug because: - Nothing breaks. The pipeline still runs cleanly. - Reinforcement is supposed to be a *signal*, not the only path to high confidence — humans can still curate confidence directly. - The candidate-extraction path (Commit C) is unaffected and works perfectly. But it does need to be addressed before Phase 9 can be considered operationally complete. ## Recommended fix (deferred to a follow-up commit) Replace the substring matcher with a token-overlap matcher. The specification: 1. Tokenize both memory content and response into lowercase words of length >= 3, dropping a small stop list (`the`, `a`, `an`, `and`, `or`, `of`, `to`, `is`, `was`, `that`, `this`, `with`, `for`, `from`, `into`). 2. Stem aggressively (or at minimum, fold trailing `s` and `ed` so `prefers`/`prefer`/`preferred` collapse to one token). 3. A match exists if **at least 70% of the memory's content tokens** appear in the response token set. 4. Memory content must still be at least `_MIN_MEMORY_CONTENT_LENGTH` characters to be considered. This is more permissive than the substring rule but still tight enough to avoid spurious matches on generic words. It would have caught the rebase example because: - memory tokens (after stop-list and stemming): `{prefer, rebase-bas, workflow, because, history, stay, linear}` - response tokens: `{prefer, rebase-bas, workflow, because, history, stay, linear, reviewer, easi, time}` - overlap: 7 / 7 memory tokens = 100% > 70% threshold → match ### Why not fix it in this report Three reasons: 1. The validation report is supposed to be evidence, not a fix spec. A separate commit will introduce the new matcher with its own tests. 2. The token-overlap matcher needs its own design review for edge cases (very long memories, very short responses, technical abbreviations, code snippets in responses). 3. Mixing the report and the fix into one commit would muddle the audit trail. The report is the empirical evidence; the fix is the response. The fix is queued as the next Phase 9 maintenance commit and is flagged in the next-steps section below. ## Other observations ### Extraction is conservative on purpose, and that's working Sample 7 is the most important data point in the whole run. A natural prose response with no structural cues produced zero candidates. **This is exactly the design intent** — the extractor should be loud about explicit decisions/constraints/requirements and quiet about everything else. If the extractor were too loose the review queue would fill up with low-value items and the human would stop reviewing. After this run I have measurably more confidence that the V0 rule set is the right starting point. Future rules can be added one at a time as we see specific patterns the extractor misses, instead of guessing at what might be useful. ### Confidence on candidates All extracted candidates landed at the default `confidence=0.5`, which is what the extractor is currently hardcoded to do. The `promotion-rules.md` doc proposes a per-rule prior with a structural-signal multiplier and freshness bonus. None of that is implemented yet. The validation didn't reveal any urgency around this — humans review the candidates either way — but it confirms that the priors-and-multipliers refinement is a reasonable next step rather than a critical one. ### Multiple cues in one interaction Sample 8 confirmed an important property: **three structural cues in the same response do not collide in dedup**. The dedup key is `(memory_type, normalized_content, rule)`, and since each cue produced a distinct (type, content, rule) tuple, all three landed cleanly. This matters because real working sessions naturally bundle multiple decisions/constraints/requirements into one summary. The extractor handles those bundles correctly. ### Project scoping Each candidate carries the `project` from the source interaction into its own `project` field. Sample 6 (p05) and sample 8 (p06) both produced candidates with the right project. This is non-obvious because the extractor module never explicitly looks at project — it inherits from the interaction it's scanning. Worth keeping in mind when the entity extractor is built: same pattern should apply. ## What this validates and what it doesn't ### Validates - The Phase 9 Commit C extractor's rule set is well-tuned for hand-written structural cues - The dedup logic does the right thing across multiple cues - The "drop candidates that match an existing active memory" filter works (would have been visible if any seeded memory had matched one of the heading texts — none did, but the code path is the same one that's covered in `tests/test_extractor.py`) - The `prose-only-no-cues` no-fire case is solid - Long content is preserved without truncation - Project scoping flows through the pipeline ### Does NOT validate - The reinforcement matcher (clearly, since it caught nothing) - The behaviour against very long documents (each sample was under 700 chars; real interaction responses can be 10× that) - The behaviour against responses that contain code blocks (the extractor's regex rules don't handle code-block fenced sections specially) - Cross-interaction promotion-to-active flow (no candidate was promoted in this run; the lifecycle is covered by the unit tests but not by this empirical exercise) - The behaviour at scale: 8 interactions is a one-shot. We need to see the queue after 50+ before judging reviewer ergonomics. ### Recommended next empirical exercises 1. **Real conversation capture**, using a slash command from a real Claude Code session against either a local or Dalidou AtoCore instance. The synthetic responses in this script are honest paraphrases but they're still hand-curated. 2. **Bulk capture from existing PKM**, ingesting a few real project notes through the extractor as if they were interactions. This stresses the rules against documents that weren't written with the extractor in mind. 3. **Reinforcement matcher rerun** after the token-overlap matcher lands. ## Action items from this report - [ ] **Fix reinforcement matcher** with token-overlap rule described in the "Recommended fix" section above. Owner: next session. Severity: medium-high. - [x] **Document the extractor's V0 strictness** as a working property, not a limitation. Sample 7 makes the case. - [ ] **Build the slash command** so the next validation run can use real (not synthetic) interactions. Tracked in Session 2 of the current planning sprint. - [ ] **Run a 50+ interaction batch** to evaluate reviewer ergonomics. Deferred until the slash command exists. ## Reproducibility The script is deterministic. Re-running it will produce identical results because: - the data dir is wiped on every run - the sample interactions are constants - the memory uuid generation is non-deterministic but the important fields (content, type, count, rule) are not - the `data/validation/phase9-first-use/` directory is gitignored, so no state leaks across runs To reproduce this exact report: ```bash python scripts/phase9_first_real_use.py ``` To get JSON output for downstream tooling: ```bash python scripts/phase9_first_real_use.py --json ```