audit-batch2/docs/phase9-first-real-use.md

# Phase 9 First Real Use Report

## What this is

The first empirical exercise of the Phase 9 reflection loop after
Commits A, B, and C all landed. The goal is to find out where the
extractor and the reinforcement matcher actually behave well versus
where their behaviour drifts from the design intent.

The validation is reproducible. To re-run:

```bash
python scripts/phase9_first_real_use.py
```

This writes an isolated SQLite + Chroma store under
`data/validation/phase9-first-use/` (gitignored), seeds three active
memories, then runs eight sample interactions through the full
capture → reinforce → extract pipeline.

## What we ran

Eight synthetic interactions, each paraphrased from a real working
session about AtoCore itself or the active engineering projects:

| # | Label                                | Project              | Expected                  |
|---|--------------------------------------|----------------------|---------------------------|
| 1 | exdev-mount-merge-decision           | atocore              | 1 decision_heading        |
| 2 | ownership-was-the-real-fix           | atocore              | 1 fact_heading            |
| 3 | memory-vs-entity-canonical-home      | atocore              | 1 decision_heading (long) |
| 4 | auto-promotion-deferred              | atocore              | 1 decision_heading        |
| 5 | preference-rebase-workflow           | atocore              | 1 preference_sentence     |
| 6 | constraint-from-doc-cite             | p05-interferometer   | 1 constraint_heading      |
| 7 | prose-only-no-cues                   | atocore              | 0 candidates              |
| 8 | multiple-cues-in-one-interaction     | p06-polisher         | 3 distinct rules          |

Plus 3 seed memories were inserted before the run:

- `pref_rebase`: "prefers rebase-based workflows because history stays linear" (preference, 0.6)
- `pref_concise`: "writes commit messages focused on the why, not the what" (preference, 0.6)
- `identity_runs_atocore`: "mechanical engineer who runs AtoCore for context engineering" (identity, 0.9)

## What happened — extraction (the good news)

**Every extraction expectation was met exactly.** All eight samples
produced the predicted candidate count and the predicted rule
classifications:

| Sample                                | Expected | Got | Pass |
|---------------------------------------|----------|-----|------|
| exdev-mount-merge-decision            | 1        | 1   | ✅    |
| ownership-was-the-real-fix            | 1        | 1   | ✅    |
| memory-vs-entity-canonical-home       | 1        | 1   | ✅    |
| auto-promotion-deferred               | 1        | 1   | ✅    |
| preference-rebase-workflow            | 1        | 1   | ✅    |
| constraint-from-doc-cite              | 1        | 1   | ✅    |
| prose-only-no-cues                    | **0**    | **0** | ✅  |
| multiple-cues-in-one-interaction      | 3        | 3   | ✅    |

**Total: 9 candidates from 8 interactions, 0 false positives, 0 misses
on heading patterns or sentence patterns.**

The extractor's strictness is well-tuned for the kinds of structural
cues we actually use. Things worth noting:

- **Sample 7 (`prose-only-no-cues`) produced zero candidates as
  designed.** This is the most important sanity check — it confirms
  the extractor won't fill the review queue with general prose when
  there's no structural intent.
- **Sample 3's long content was preserved without truncation.** The
  280-char max wasn't hit, and the content kept its full meaning.
- **Sample 8 produced three distinct rules in one interaction**
  (decision_heading, constraint_heading, requirement_heading) without
  the dedup key collapsing them. The dedup key is
  `(memory_type, normalized_content, rule)` and the three are all
  different on at least one axis, so they coexist as expected.
- **The prose around each heading was correctly ignored.** Sample 6
  has a second sentence ("the error budget allocates 6 nm to the
  laser source...") that does NOT have a structural cue, and the
  extractor correctly didn't fire on it.

## What happened — reinforcement (the empirical finding)

**Reinforcement matched zero seeded memories across all 8 samples,
even when the response clearly echoed the seed.**

Sample 5's response was:

> *"I prefer rebase-based workflows because the history stays linear
> and reviewers have an easier time."*

The seeded `pref_rebase` memory was:

> *"prefers rebase-based workflows because history stays linear"*

A human reading both says these are the same fact. The reinforcement
matcher disagrees. After all 8 interactions:

```
pref_rebase:           confidence=0.6000  refs=0  last=-
pref_concise:          confidence=0.6000  refs=0  last=-
identity_runs_atocore: confidence=0.9000  refs=0  last=-
```

**Nothing moved.** This is the most important finding from this
validation pass.

### Why the matcher missed it

The current `_memory_matches` rule (in
`src/atocore/memory/reinforcement.py`) does a normalized substring
match: it lowercases both sides, collapses whitespace, then asks
"does the leading 80-char window of the memory content appear as a
substring in the response?"

For the rebase example:

- needle (normalized): `prefers rebase-based workflows because history stays linear`
- haystack (normalized): `i prefer rebase-based workflows because the history stays linear and reviewers have an easier time.`

The needle starts with `prefers` (with the trailing `s`), and the
haystack has `prefer` (without the `s`, because of the first-person
voice). And the needle has `because history stays linear`, while the
haystack has `because the history stays linear`. **Two small natural
paraphrases, and the substring fails.**

This isn't a bug in the matcher's implementation — it's doing
exactly what it was specified to do. It's a design limitation: the
substring rule is too brittle for real prose, where the same fact
gets re-stated with different verb forms, articles, and word order.

### Severity

**Medium-high.** Reinforcement is the entire point of Commit B.
A reinforcement matcher that never fires on natural paraphrases
will leave seeded memories with stale confidence forever. The
reflection loop runs but it doesn't actually reinforce anything.
That hollows out the value of having reinforcement at all.

It is not a critical bug because:
- Nothing breaks. The pipeline still runs cleanly.
- Reinforcement is supposed to be a *signal*, not the only path to
  high confidence — humans can still curate confidence directly.
- The candidate-extraction path (Commit C) is unaffected and works
  perfectly.

But it does need to be addressed before Phase 9 can be considered
operationally complete.

## Recommended fix (deferred to a follow-up commit)

Replace the substring matcher with a token-overlap matcher. The
specification:

1. Tokenize both memory content and response into lowercase words
   of length >= 3, dropping a small stop list (`the`, `a`, `an`,
   `and`, `or`, `of`, `to`, `is`, `was`, `that`, `this`, `with`,
   `for`, `from`, `into`).
2. Stem aggressively (or at minimum, fold trailing `s` and `ed`
   so `prefers`/`prefer`/`preferred` collapse to one token).
3. A match exists if **at least 70% of the memory's content
   tokens** appear in the response token set.
4. Memory content must still be at least `_MIN_MEMORY_CONTENT_LENGTH`
   characters to be considered.

This is more permissive than the substring rule but still tight
enough to avoid spurious matches on generic words. It would have
caught the rebase example because:

- memory tokens (after stop-list and stemming):
  `{prefer, rebase-bas, workflow, because, history, stay, linear}`
- response tokens:
  `{prefer, rebase-bas, workflow, because, history, stay, linear,
  reviewer, easi, time}`
- overlap: 7 / 7 memory tokens = 100% > 70% threshold → match

### Why not fix it in this report

Three reasons:

1. The validation report is supposed to be evidence, not a fix
   spec. A separate commit will introduce the new matcher with
   its own tests.
2. The token-overlap matcher needs its own design review for edge
   cases (very long memories, very short responses, technical
   abbreviations, code snippets in responses).
3. Mixing the report and the fix into one commit would muddle the
   audit trail. The report is the empirical evidence; the fix is
   the response.

The fix is queued as the next Phase 9 maintenance commit and is
flagged in the next-steps section below.

## Other observations

### Extraction is conservative on purpose, and that's working

Sample 7 is the most important data point in the whole run.
A natural prose response with no structural cues produced zero
candidates. **This is exactly the design intent** — the extractor
should be loud about explicit decisions/constraints/requirements
and quiet about everything else. If the extractor were too loose
the review queue would fill up with low-value items and the human
would stop reviewing.

After this run I have measurably more confidence that the V0 rule
set is the right starting point. Future rules can be added one at
a time as we see specific patterns the extractor misses, instead of
guessing at what might be useful.

### Confidence on candidates

All extracted candidates landed at the default `confidence=0.5`,
which is what the extractor is currently hardcoded to do. The
`promotion-rules.md` doc proposes a per-rule prior with a
structural-signal multiplier and freshness bonus. None of that is
implemented yet. The validation didn't reveal any urgency around
this — humans review the candidates either way — but it confirms
that the priors-and-multipliers refinement is a reasonable next
step rather than a critical one.

### Multiple cues in one interaction

Sample 8 confirmed an important property: **three structural
cues in the same response do not collide in dedup**. The dedup
key is `(memory_type, normalized_content, rule)`, and since each
cue produced a distinct (type, content, rule) tuple, all three
landed cleanly.

This matters because real working sessions naturally bundle
multiple decisions/constraints/requirements into one summary.
The extractor handles those bundles correctly.

### Project scoping

Each candidate carries the `project` from the source interaction
into its own `project` field. Sample 6 (p05) and sample 8 (p06)
both produced candidates with the right project. This is
non-obvious because the extractor module never explicitly looks
at project — it inherits from the interaction it's scanning. Worth
keeping in mind when the entity extractor is built: same pattern
should apply.

## What this validates and what it doesn't

### Validates

- The Phase 9 Commit C extractor's rule set is well-tuned for
  hand-written structural cues
- The dedup logic does the right thing across multiple cues
- The "drop candidates that match an existing active memory" filter
  works (would have been visible if any seeded memory had matched
  one of the heading texts — none did, but the code path is the
  same one that's covered in `tests/test_extractor.py`)
- The `prose-only-no-cues` no-fire case is solid
- Long content is preserved without truncation
- Project scoping flows through the pipeline

### Does NOT validate

- The reinforcement matcher (clearly, since it caught nothing)
- The behaviour against very long documents (each sample was
  under 700 chars; real interaction responses can be 10× that)
- The behaviour against responses that contain code blocks (the
  extractor's regex rules don't handle code-block fenced sections
  specially)
- Cross-interaction promotion-to-active flow (no candidate was
  promoted in this run; the lifecycle is covered by the unit tests
  but not by this empirical exercise)
- The behaviour at scale: 8 interactions is a one-shot. We need
  to see the queue after 50+ before judging reviewer ergonomics.

### Recommended next empirical exercises

1. **Real conversation capture**, using a slash command from a
   real Claude Code session against either a local or Dalidou
   AtoCore instance. The synthetic responses in this script are
   honest paraphrases but they're still hand-curated.
2. **Bulk capture from existing PKM**, ingesting a few real
   project notes through the extractor as if they were
   interactions. This stresses the rules against documents that
   weren't written with the extractor in mind.
3. **Reinforcement matcher rerun** after the token-overlap
   matcher lands.

## Action items from this report

- [ ] **Fix reinforcement matcher** with token-overlap rule
      described in the "Recommended fix" section above. Owner:
      next session. Severity: medium-high.
- [x] **Document the extractor's V0 strictness** as a working
      property, not a limitation. Sample 7 makes the case.
- [ ] **Build the slash command** so the next validation run
      can use real (not synthetic) interactions. Tracked in
      Session 2 of the current planning sprint.
- [ ] **Run a 50+ interaction batch** to evaluate reviewer
      ergonomics. Deferred until the slash command exists.

## Reproducibility

The script is deterministic. Re-running it will produce
identical results because:

- the data dir is wiped on every run
- the sample interactions are constants
- the memory uuid generation is non-deterministic but the
  important fields (content, type, count, rule) are not
- the `data/validation/phase9-first-use/` directory is gitignored,
  so no state leaks across runs

To reproduce this exact report:

```bash
python scripts/phase9_first_real_use.py
```

To get JSON output for downstream tooling:

```bash
python scripts/phase9_first_real_use.py --json
```
-												phase9 first-real-use validation + small hygiene wins

Session 1 of the four-session plan. Empirically exercises the Phase 9
loop (capture -> reinforce -> extract) for the first time and lands
three small hygiene fixes.

Validation script + report
--------------------------
scripts/phase9_first_real_use.py — reproducible script that:
  - sets up an isolated SQLite + Chroma store under
    data/validation/phase9-first-use (gitignored)
  - seeds 3 active memories
  - runs 8 sample interactions through capture + reinforce + extract
  - prints what each step produced and reinforcement state at the end
  - supports --json output for downstream tooling

docs/phase9-first-real-use.md — narrative report of the run with:
  - extraction results table (8/8 expectations met exactly)
  - the empirical finding that REINFORCEMENT MATCHED ZERO seeds
    despite sample 5 clearly echoing the rebase preference memory
  - root cause analysis: the substring matcher is too brittle for
    natural paraphrases (e.g. "prefers" vs "I prefer", "history"
    vs "the history")
  - recommended fix: replace substring matcher with a token-overlap
    matcher (>=70% of memory tokens present in response, with
    light stemming and a small stop list)
  - explicit note that the fix is queued as a follow-up commit, not
    bundled into the report — keeps the audit trail clean

Key extraction results from the run:
  - all 7 heading/sentence rules fired correctly
  - 0 false positives on the prose-only sample (the most important
    sanity check)
  - long content preserved without truncation
  - dedup correctly kept three distinct cues from one interaction
  - project scoping flowed cleanly through the pipeline

Hygiene 1: FastAPI lifespan migration (src/atocore/main.py)
- Replaced @app.on_event("startup") with the modern @asynccontextmanager
  lifespan handler
- Same setup work (setup_logging, ensure_runtime_dirs, init_db,
  init_project_state_schema, startup_ready log)
- Removes the two on_event deprecation warnings from every test run
- Test suite now shows 1 warning instead of 3

Hygiene 2: EXTRACTOR_VERSION constant (src/atocore/memory/extractor.py)
- Added EXTRACTOR_VERSION = "0.1.0" with a versioned change log comment
- MemoryCandidate dataclass carries extractor_version on every candidate
- POST /interactions/{id}/extract response now includes extractor_version
  on both the top level (current run) and on each candidate
- Implements the versioning requirement called out in
  docs/architecture/promotion-rules.md so old candidates can be
  identified and re-evaluated when the rule set evolves

Hygiene 3: ~/.git-credentials cleanup (out-of-tree, not committed)
- Removed the dead OAUTH_USER:<jwt> line for dalidou:3000 that was
  being silently rewritten by the system credential manager on every
  push attempt
- Configured credential.http://dalidou:3000.helper with the empty-string
  sentinel pattern so the URL-specific helper chain is exactly
  ["", store] instead of inheriting the system-level "manager" helper
  that ships with Git for Windows
- Same fix for the 100.80.199.40 (Tailscale) entry
- Verified end to end: a fresh push using only the cleaned credentials
  file (no embedded URL) authenticates as Antoine and lands cleanly

Full suite: 160 passing (no change from previous), 1 warning
(was 3) thanks to the lifespan migration.

											
										
										
											2026-04-07 06:16:35 -04:00
+								# Phase 9 First Real Use Report
 								## What this is
 								The first empirical exercise of the Phase 9 reflection loop after
 								Commits A, B, and C all landed. The goal is to find out where the
 								extractor and the reinforcement matcher actually behave well versus
 								where their behaviour drifts from the design intent.
 								The validation is reproducible. To re-run:
 								```bash
 								python scripts/phase9_first_real_use.py
 								```
 								This writes an isolated SQLite + Chroma store under
 								`data/validation/phase9-first-use/` (gitignored), seeds three active
 								memories, then runs eight sample interactions through the full
 								capture → reinforce → extract pipeline.
 								## What we ran
 								Eight synthetic interactions, each paraphrased from a real working
 								session about AtoCore itself or the active engineering projects:
 								| # | Label                                | Project              | Expected                  |
 								|---|--------------------------------------|----------------------|---------------------------|
 								| 1 | exdev-mount-merge-decision           | atocore              | 1 decision_heading        |
 								| 2 | ownership-was-the-real-fix           | atocore              | 1 fact_heading            |
 								| 3 | memory-vs-entity-canonical-home      | atocore              | 1 decision_heading (long) |
 								| 4 | auto-promotion-deferred              | atocore              | 1 decision_heading        |
 								| 5 | preference-rebase-workflow           | atocore              | 1 preference_sentence     |
 								| 6 | constraint-from-doc-cite             | p05-interferometer   | 1 constraint_heading      |
 								| 7 | prose-only-no-cues                   | atocore              | 0 candidates              |
 								| 8 | multiple-cues-in-one-interaction     | p06-polisher         | 3 distinct rules          |
 								Plus 3 seed memories were inserted before the run:
 								- `pref_rebase`: "prefers rebase-based workflows because history stays linear" (preference, 0.6)
 								- `pref_concise`: "writes commit messages focused on the why, not the what" (preference, 0.6)
 								- `identity_runs_atocore`: "mechanical engineer who runs AtoCore for context engineering" (identity, 0.9)
 								## What happened — extraction (the good news)
 								**Every extraction expectation was met exactly.** All eight samples
 								produced the predicted candidate count and the predicted rule
 								classifications:
 								| Sample                                | Expected | Got | Pass |
 								|---------------------------------------|----------|-----|------|
 								| exdev-mount-merge-decision            | 1        | 1   | ✅    |
 								| ownership-was-the-real-fix            | 1        | 1   | ✅    |
 								| memory-vs-entity-canonical-home       | 1        | 1   | ✅    |
 								| auto-promotion-deferred               | 1        | 1   | ✅    |
 								| preference-rebase-workflow            | 1        | 1   | ✅    |
 								| constraint-from-doc-cite              | 1        | 1   | ✅    |
 								| prose-only-no-cues                    | **0**    | **0** | ✅  |
 								| multiple-cues-in-one-interaction      | 3        | 3   | ✅    |
 								**Total: 9 candidates from 8 interactions, 0 false positives, 0 misses
 								on heading patterns or sentence patterns.**
 								The extractor's strictness is well-tuned for the kinds of structural
 								cues we actually use. Things worth noting:
 								- **Sample 7 (`prose-only-no-cues`) produced zero candidates as
 								  designed.** This is the most important sanity check — it confirms
 								  the extractor won't fill the review queue with general prose when
 								  there's no structural intent.
 								- **Sample 3's long content was preserved without truncation.** The
 -char max wasn't hit, and the content kept its full meaning.
 								- **Sample 8 produced three distinct rules in one interaction**
 								  (decision_heading, constraint_heading, requirement_heading) without
 								  the dedup key collapsing them. The dedup key is
 								  `(memory_type, normalized_content, rule)` and the three are all
 								  different on at least one axis, so they coexist as expected.
 								- **The prose around each heading was correctly ignored.** Sample 6
 								  has a second sentence ("the error budget allocates 6 nm to the
 								  laser source...") that does NOT have a structural cue, and the
 								  extractor correctly didn't fire on it.
 								## What happened — reinforcement (the empirical finding)
 								**Reinforcement matched zero seeded memories across all 8 samples,
 								even when the response clearly echoed the seed.**
 								Sample 5's response was:
 								> *"I prefer rebase-based workflows because the history stays linear
 								> and reviewers have an easier time."*
 								The seeded `pref_rebase` memory was:
 								> *"prefers rebase-based workflows because history stays linear"*
 								A human reading both says these are the same fact. The reinforcement
 								matcher disagrees. After all 8 interactions:
 								```
 								pref_rebase:           confidence=0.6000  refs=0  last=-
 								pref_concise:          confidence=0.6000  refs=0  last=-
 								identity_runs_atocore: confidence=0.9000  refs=0  last=-
 								```
 								**Nothing moved.** This is the most important finding from this
 								validation pass.
 								### Why the matcher missed it
 								The current `_memory_matches` rule (in
 								`src/atocore/memory/reinforcement.py`) does a normalized substring
 								match: it lowercases both sides, collapses whitespace, then asks
 								"does the leading 80-char window of the memory content appear as a
 								substring in the response?"
 								For the rebase example:
 								- needle (normalized): `prefers rebase-based workflows because history stays linear`
 								- haystack (normalized): `i prefer rebase-based workflows because the history stays linear and reviewers have an easier time.`
 								The needle starts with `prefers` (with the trailing `s`), and the
 								haystack has `prefer` (without the `s`, because of the first-person
 								voice). And the needle has `because history stays linear`, while the
 								haystack has `because the history stays linear`. **Two small natural
 								paraphrases, and the substring fails.**
 								This isn't a bug in the matcher's implementation — it's doing
 								exactly what it was specified to do. It's a design limitation: the
 								substring rule is too brittle for real prose, where the same fact
 								gets re-stated with different verb forms, articles, and word order.
 								### Severity
 								**Medium-high.** Reinforcement is the entire point of Commit B.
 								A reinforcement matcher that never fires on natural paraphrases
 								will leave seeded memories with stale confidence forever. The
 								reflection loop runs but it doesn't actually reinforce anything.
 								That hollows out the value of having reinforcement at all.
 								It is not a critical bug because:
 								- Nothing breaks. The pipeline still runs cleanly.
 								- Reinforcement is supposed to be a *signal*, not the only path to
 								  high confidence — humans can still curate confidence directly.
 								- The candidate-extraction path (Commit C) is unaffected and works
 								  perfectly.
 								But it does need to be addressed before Phase 9 can be considered
 								operationally complete.
 								## Recommended fix (deferred to a follow-up commit)
 								Replace the substring matcher with a token-overlap matcher. The
 								specification:
 . Tokenize both memory content and response into lowercase words
 								   of length >= 3, dropping a small stop list (`the`, `a`, `an`,
 								   `and`, `or`, `of`, `to`, `is`, `was`, `that`, `this`, `with`,
 								   `for`, `from`, `into`).
 . Stem aggressively (or at minimum, fold trailing `s` and `ed`
 								   so `prefers`/`prefer`/`preferred` collapse to one token).
 . A match exists if **at least 70% of the memory's content
 								   tokens** appear in the response token set.
 . Memory content must still be at least `_MIN_MEMORY_CONTENT_LENGTH`
 								   characters to be considered.
 								This is more permissive than the substring rule but still tight
 								enough to avoid spurious matches on generic words. It would have
 								caught the rebase example because:
 								- memory tokens (after stop-list and stemming):
 								  `{prefer, rebase-bas, workflow, because, history, stay, linear}`
 								- response tokens:
 								  `{prefer, rebase-bas, workflow, because, history, stay, linear,
 								  reviewer, easi, time}`
 								- overlap: 7 / 7 memory tokens = 100% > 70% threshold → match
 								### Why not fix it in this report
 								Three reasons:
 . The validation report is supposed to be evidence, not a fix
 								   spec. A separate commit will introduce the new matcher with
 								   its own tests.
 . The token-overlap matcher needs its own design review for edge
 								   cases (very long memories, very short responses, technical
 								   abbreviations, code snippets in responses).
 . Mixing the report and the fix into one commit would muddle the
 								   audit trail. The report is the empirical evidence; the fix is
 								   the response.
 								The fix is queued as the next Phase 9 maintenance commit and is
 								flagged in the next-steps section below.
 								## Other observations
 								### Extraction is conservative on purpose, and that's working
 								Sample 7 is the most important data point in the whole run.
 								A natural prose response with no structural cues produced zero
 								candidates. **This is exactly the design intent** — the extractor
 								should be loud about explicit decisions/constraints/requirements
 								and quiet about everything else. If the extractor were too loose
 								the review queue would fill up with low-value items and the human
 								would stop reviewing.
 								After this run I have measurably more confidence that the V0 rule
 								set is the right starting point. Future rules can be added one at
 								a time as we see specific patterns the extractor misses, instead of
 								guessing at what might be useful.
 								### Confidence on candidates
 								All extracted candidates landed at the default `confidence=0.5`,
 								which is what the extractor is currently hardcoded to do. The
 								`promotion-rules.md` doc proposes a per-rule prior with a
 								structural-signal multiplier and freshness bonus. None of that is
 								implemented yet. The validation didn't reveal any urgency around
 								this — humans review the candidates either way — but it confirms
 								that the priors-and-multipliers refinement is a reasonable next
 								step rather than a critical one.
 								### Multiple cues in one interaction
 								Sample 8 confirmed an important property: **three structural
 								cues in the same response do not collide in dedup**. The dedup
 								key is `(memory_type, normalized_content, rule)`, and since each
 								cue produced a distinct (type, content, rule) tuple, all three
 								landed cleanly.
 								This matters because real working sessions naturally bundle
 								multiple decisions/constraints/requirements into one summary.
 								The extractor handles those bundles correctly.
 								### Project scoping
 								Each candidate carries the `project` from the source interaction
 								into its own `project` field. Sample 6 (p05) and sample 8 (p06)
 								both produced candidates with the right project. This is
 								non-obvious because the extractor module never explicitly looks
 								at project — it inherits from the interaction it's scanning. Worth
 								keeping in mind when the entity extractor is built: same pattern
 								should apply.
 								## What this validates and what it doesn't
 								### Validates
 								- The Phase 9 Commit C extractor's rule set is well-tuned for
 								  hand-written structural cues
 								- The dedup logic does the right thing across multiple cues
 								- The "drop candidates that match an existing active memory" filter
 								  works (would have been visible if any seeded memory had matched
 								  one of the heading texts — none did, but the code path is the
 								  same one that's covered in `tests/test_extractor.py`)
 								- The `prose-only-no-cues` no-fire case is solid
 								- Long content is preserved without truncation
 								- Project scoping flows through the pipeline
 								### Does NOT validate
 								- The reinforcement matcher (clearly, since it caught nothing)
 								- The behaviour against very long documents (each sample was
 								  under 700 chars; real interaction responses can be 10× that)
 								- The behaviour against responses that contain code blocks (the
 								  extractor's regex rules don't handle code-block fenced sections
 								  specially)
 								- Cross-interaction promotion-to-active flow (no candidate was
 								  promoted in this run; the lifecycle is covered by the unit tests
 								  but not by this empirical exercise)
 								- The behaviour at scale: 8 interactions is a one-shot. We need
 								  to see the queue after 50+ before judging reviewer ergonomics.
 								### Recommended next empirical exercises
 . **Real conversation capture**, using a slash command from a
 								   real Claude Code session against either a local or Dalidou
 								   AtoCore instance. The synthetic responses in this script are
 								   honest paraphrases but they're still hand-curated.
 . **Bulk capture from existing PKM**, ingesting a few real
 								   project notes through the extractor as if they were
 								   interactions. This stresses the rules against documents that
 								   weren't written with the extractor in mind.
 . **Reinforcement matcher rerun** after the token-overlap
 								   matcher lands.
 								## Action items from this report
 								- [ ] **Fix reinforcement matcher** with token-overlap rule
 								      described in the "Recommended fix" section above. Owner:
 								      next session. Severity: medium-high.
 								- [x] **Document the extractor's V0 strictness** as a working
 								      property, not a limitation. Sample 7 makes the case.
 								- [ ] **Build the slash command** so the next validation run
 								      can use real (not synthetic) interactions. Tracked in
 								      Session 2 of the current planning sprint.
 								- [ ] **Run a 50+ interaction batch** to evaluate reviewer
 								      ergonomics. Deferred until the slash command exists.
 								## Reproducibility
 								The script is deterministic. Re-running it will produce
 								identical results because:
 								- the data dir is wiped on every run
 								- the sample interactions are constants
 								- the memory uuid generation is non-deterministic but the
 								  important fields (content, type, count, rule) are not
 								- the `data/validation/phase9-first-use/` directory is gitignored,
 								  so no state leaks across runs
 								To reproduce this exact report:
 								```bash
 								python scripts/phase9_first_real_use.py
 								```
 								To get JSON output for downstream tooling:
 								```bash
 								python scripts/phase9_first_real_use.py --json
 								```