322 lines
13 KiB
Markdown
322 lines
13 KiB
Markdown
|
|
# Phase 9 First Real Use Report
|
|||
|
|
|
|||
|
|
## What this is
|
|||
|
|
|
|||
|
|
The first empirical exercise of the Phase 9 reflection loop after
|
|||
|
|
Commits A, B, and C all landed. The goal is to find out where the
|
|||
|
|
extractor and the reinforcement matcher actually behave well versus
|
|||
|
|
where their behaviour drifts from the design intent.
|
|||
|
|
|
|||
|
|
The validation is reproducible. To re-run:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
python scripts/phase9_first_real_use.py
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
This writes an isolated SQLite + Chroma store under
|
|||
|
|
`data/validation/phase9-first-use/` (gitignored), seeds three active
|
|||
|
|
memories, then runs eight sample interactions through the full
|
|||
|
|
capture → reinforce → extract pipeline.
|
|||
|
|
|
|||
|
|
## What we ran
|
|||
|
|
|
|||
|
|
Eight synthetic interactions, each paraphrased from a real working
|
|||
|
|
session about AtoCore itself or the active engineering projects:
|
|||
|
|
|
|||
|
|
| # | Label | Project | Expected |
|
|||
|
|
|---|--------------------------------------|----------------------|---------------------------|
|
|||
|
|
| 1 | exdev-mount-merge-decision | atocore | 1 decision_heading |
|
|||
|
|
| 2 | ownership-was-the-real-fix | atocore | 1 fact_heading |
|
|||
|
|
| 3 | memory-vs-entity-canonical-home | atocore | 1 decision_heading (long) |
|
|||
|
|
| 4 | auto-promotion-deferred | atocore | 1 decision_heading |
|
|||
|
|
| 5 | preference-rebase-workflow | atocore | 1 preference_sentence |
|
|||
|
|
| 6 | constraint-from-doc-cite | p05-interferometer | 1 constraint_heading |
|
|||
|
|
| 7 | prose-only-no-cues | atocore | 0 candidates |
|
|||
|
|
| 8 | multiple-cues-in-one-interaction | p06-polisher | 3 distinct rules |
|
|||
|
|
|
|||
|
|
Plus 3 seed memories were inserted before the run:
|
|||
|
|
|
|||
|
|
- `pref_rebase`: "prefers rebase-based workflows because history stays linear" (preference, 0.6)
|
|||
|
|
- `pref_concise`: "writes commit messages focused on the why, not the what" (preference, 0.6)
|
|||
|
|
- `identity_runs_atocore`: "mechanical engineer who runs AtoCore for context engineering" (identity, 0.9)
|
|||
|
|
|
|||
|
|
## What happened — extraction (the good news)
|
|||
|
|
|
|||
|
|
**Every extraction expectation was met exactly.** All eight samples
|
|||
|
|
produced the predicted candidate count and the predicted rule
|
|||
|
|
classifications:
|
|||
|
|
|
|||
|
|
| Sample | Expected | Got | Pass |
|
|||
|
|
|---------------------------------------|----------|-----|------|
|
|||
|
|
| exdev-mount-merge-decision | 1 | 1 | ✅ |
|
|||
|
|
| ownership-was-the-real-fix | 1 | 1 | ✅ |
|
|||
|
|
| memory-vs-entity-canonical-home | 1 | 1 | ✅ |
|
|||
|
|
| auto-promotion-deferred | 1 | 1 | ✅ |
|
|||
|
|
| preference-rebase-workflow | 1 | 1 | ✅ |
|
|||
|
|
| constraint-from-doc-cite | 1 | 1 | ✅ |
|
|||
|
|
| prose-only-no-cues | **0** | **0** | ✅ |
|
|||
|
|
| multiple-cues-in-one-interaction | 3 | 3 | ✅ |
|
|||
|
|
|
|||
|
|
**Total: 9 candidates from 8 interactions, 0 false positives, 0 misses
|
|||
|
|
on heading patterns or sentence patterns.**
|
|||
|
|
|
|||
|
|
The extractor's strictness is well-tuned for the kinds of structural
|
|||
|
|
cues we actually use. Things worth noting:
|
|||
|
|
|
|||
|
|
- **Sample 7 (`prose-only-no-cues`) produced zero candidates as
|
|||
|
|
designed.** This is the most important sanity check — it confirms
|
|||
|
|
the extractor won't fill the review queue with general prose when
|
|||
|
|
there's no structural intent.
|
|||
|
|
- **Sample 3's long content was preserved without truncation.** The
|
|||
|
|
280-char max wasn't hit, and the content kept its full meaning.
|
|||
|
|
- **Sample 8 produced three distinct rules in one interaction**
|
|||
|
|
(decision_heading, constraint_heading, requirement_heading) without
|
|||
|
|
the dedup key collapsing them. The dedup key is
|
|||
|
|
`(memory_type, normalized_content, rule)` and the three are all
|
|||
|
|
different on at least one axis, so they coexist as expected.
|
|||
|
|
- **The prose around each heading was correctly ignored.** Sample 6
|
|||
|
|
has a second sentence ("the error budget allocates 6 nm to the
|
|||
|
|
laser source...") that does NOT have a structural cue, and the
|
|||
|
|
extractor correctly didn't fire on it.
|
|||
|
|
|
|||
|
|
## What happened — reinforcement (the empirical finding)
|
|||
|
|
|
|||
|
|
**Reinforcement matched zero seeded memories across all 8 samples,
|
|||
|
|
even when the response clearly echoed the seed.**
|
|||
|
|
|
|||
|
|
Sample 5's response was:
|
|||
|
|
|
|||
|
|
> *"I prefer rebase-based workflows because the history stays linear
|
|||
|
|
> and reviewers have an easier time."*
|
|||
|
|
|
|||
|
|
The seeded `pref_rebase` memory was:
|
|||
|
|
|
|||
|
|
> *"prefers rebase-based workflows because history stays linear"*
|
|||
|
|
|
|||
|
|
A human reading both says these are the same fact. The reinforcement
|
|||
|
|
matcher disagrees. After all 8 interactions:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
pref_rebase: confidence=0.6000 refs=0 last=-
|
|||
|
|
pref_concise: confidence=0.6000 refs=0 last=-
|
|||
|
|
identity_runs_atocore: confidence=0.9000 refs=0 last=-
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Nothing moved.** This is the most important finding from this
|
|||
|
|
validation pass.
|
|||
|
|
|
|||
|
|
### Why the matcher missed it
|
|||
|
|
|
|||
|
|
The current `_memory_matches` rule (in
|
|||
|
|
`src/atocore/memory/reinforcement.py`) does a normalized substring
|
|||
|
|
match: it lowercases both sides, collapses whitespace, then asks
|
|||
|
|
"does the leading 80-char window of the memory content appear as a
|
|||
|
|
substring in the response?"
|
|||
|
|
|
|||
|
|
For the rebase example:
|
|||
|
|
|
|||
|
|
- needle (normalized): `prefers rebase-based workflows because history stays linear`
|
|||
|
|
- haystack (normalized): `i prefer rebase-based workflows because the history stays linear and reviewers have an easier time.`
|
|||
|
|
|
|||
|
|
The needle starts with `prefers` (with the trailing `s`), and the
|
|||
|
|
haystack has `prefer` (without the `s`, because of the first-person
|
|||
|
|
voice). And the needle has `because history stays linear`, while the
|
|||
|
|
haystack has `because the history stays linear`. **Two small natural
|
|||
|
|
paraphrases, and the substring fails.**
|
|||
|
|
|
|||
|
|
This isn't a bug in the matcher's implementation — it's doing
|
|||
|
|
exactly what it was specified to do. It's a design limitation: the
|
|||
|
|
substring rule is too brittle for real prose, where the same fact
|
|||
|
|
gets re-stated with different verb forms, articles, and word order.
|
|||
|
|
|
|||
|
|
### Severity
|
|||
|
|
|
|||
|
|
**Medium-high.** Reinforcement is the entire point of Commit B.
|
|||
|
|
A reinforcement matcher that never fires on natural paraphrases
|
|||
|
|
will leave seeded memories with stale confidence forever. The
|
|||
|
|
reflection loop runs but it doesn't actually reinforce anything.
|
|||
|
|
That hollows out the value of having reinforcement at all.
|
|||
|
|
|
|||
|
|
It is not a critical bug because:
|
|||
|
|
- Nothing breaks. The pipeline still runs cleanly.
|
|||
|
|
- Reinforcement is supposed to be a *signal*, not the only path to
|
|||
|
|
high confidence — humans can still curate confidence directly.
|
|||
|
|
- The candidate-extraction path (Commit C) is unaffected and works
|
|||
|
|
perfectly.
|
|||
|
|
|
|||
|
|
But it does need to be addressed before Phase 9 can be considered
|
|||
|
|
operationally complete.
|
|||
|
|
|
|||
|
|
## Recommended fix (deferred to a follow-up commit)
|
|||
|
|
|
|||
|
|
Replace the substring matcher with a token-overlap matcher. The
|
|||
|
|
specification:
|
|||
|
|
|
|||
|
|
1. Tokenize both memory content and response into lowercase words
|
|||
|
|
of length >= 3, dropping a small stop list (`the`, `a`, `an`,
|
|||
|
|
`and`, `or`, `of`, `to`, `is`, `was`, `that`, `this`, `with`,
|
|||
|
|
`for`, `from`, `into`).
|
|||
|
|
2. Stem aggressively (or at minimum, fold trailing `s` and `ed`
|
|||
|
|
so `prefers`/`prefer`/`preferred` collapse to one token).
|
|||
|
|
3. A match exists if **at least 70% of the memory's content
|
|||
|
|
tokens** appear in the response token set.
|
|||
|
|
4. Memory content must still be at least `_MIN_MEMORY_CONTENT_LENGTH`
|
|||
|
|
characters to be considered.
|
|||
|
|
|
|||
|
|
This is more permissive than the substring rule but still tight
|
|||
|
|
enough to avoid spurious matches on generic words. It would have
|
|||
|
|
caught the rebase example because:
|
|||
|
|
|
|||
|
|
- memory tokens (after stop-list and stemming):
|
|||
|
|
`{prefer, rebase-bas, workflow, because, history, stay, linear}`
|
|||
|
|
- response tokens:
|
|||
|
|
`{prefer, rebase-bas, workflow, because, history, stay, linear,
|
|||
|
|
reviewer, easi, time}`
|
|||
|
|
- overlap: 7 / 7 memory tokens = 100% > 70% threshold → match
|
|||
|
|
|
|||
|
|
### Why not fix it in this report
|
|||
|
|
|
|||
|
|
Three reasons:
|
|||
|
|
|
|||
|
|
1. The validation report is supposed to be evidence, not a fix
|
|||
|
|
spec. A separate commit will introduce the new matcher with
|
|||
|
|
its own tests.
|
|||
|
|
2. The token-overlap matcher needs its own design review for edge
|
|||
|
|
cases (very long memories, very short responses, technical
|
|||
|
|
abbreviations, code snippets in responses).
|
|||
|
|
3. Mixing the report and the fix into one commit would muddle the
|
|||
|
|
audit trail. The report is the empirical evidence; the fix is
|
|||
|
|
the response.
|
|||
|
|
|
|||
|
|
The fix is queued as the next Phase 9 maintenance commit and is
|
|||
|
|
flagged in the next-steps section below.
|
|||
|
|
|
|||
|
|
## Other observations
|
|||
|
|
|
|||
|
|
### Extraction is conservative on purpose, and that's working
|
|||
|
|
|
|||
|
|
Sample 7 is the most important data point in the whole run.
|
|||
|
|
A natural prose response with no structural cues produced zero
|
|||
|
|
candidates. **This is exactly the design intent** — the extractor
|
|||
|
|
should be loud about explicit decisions/constraints/requirements
|
|||
|
|
and quiet about everything else. If the extractor were too loose
|
|||
|
|
the review queue would fill up with low-value items and the human
|
|||
|
|
would stop reviewing.
|
|||
|
|
|
|||
|
|
After this run I have measurably more confidence that the V0 rule
|
|||
|
|
set is the right starting point. Future rules can be added one at
|
|||
|
|
a time as we see specific patterns the extractor misses, instead of
|
|||
|
|
guessing at what might be useful.
|
|||
|
|
|
|||
|
|
### Confidence on candidates
|
|||
|
|
|
|||
|
|
All extracted candidates landed at the default `confidence=0.5`,
|
|||
|
|
which is what the extractor is currently hardcoded to do. The
|
|||
|
|
`promotion-rules.md` doc proposes a per-rule prior with a
|
|||
|
|
structural-signal multiplier and freshness bonus. None of that is
|
|||
|
|
implemented yet. The validation didn't reveal any urgency around
|
|||
|
|
this — humans review the candidates either way — but it confirms
|
|||
|
|
that the priors-and-multipliers refinement is a reasonable next
|
|||
|
|
step rather than a critical one.
|
|||
|
|
|
|||
|
|
### Multiple cues in one interaction
|
|||
|
|
|
|||
|
|
Sample 8 confirmed an important property: **three structural
|
|||
|
|
cues in the same response do not collide in dedup**. The dedup
|
|||
|
|
key is `(memory_type, normalized_content, rule)`, and since each
|
|||
|
|
cue produced a distinct (type, content, rule) tuple, all three
|
|||
|
|
landed cleanly.
|
|||
|
|
|
|||
|
|
This matters because real working sessions naturally bundle
|
|||
|
|
multiple decisions/constraints/requirements into one summary.
|
|||
|
|
The extractor handles those bundles correctly.
|
|||
|
|
|
|||
|
|
### Project scoping
|
|||
|
|
|
|||
|
|
Each candidate carries the `project` from the source interaction
|
|||
|
|
into its own `project` field. Sample 6 (p05) and sample 8 (p06)
|
|||
|
|
both produced candidates with the right project. This is
|
|||
|
|
non-obvious because the extractor module never explicitly looks
|
|||
|
|
at project — it inherits from the interaction it's scanning. Worth
|
|||
|
|
keeping in mind when the entity extractor is built: same pattern
|
|||
|
|
should apply.
|
|||
|
|
|
|||
|
|
## What this validates and what it doesn't
|
|||
|
|
|
|||
|
|
### Validates
|
|||
|
|
|
|||
|
|
- The Phase 9 Commit C extractor's rule set is well-tuned for
|
|||
|
|
hand-written structural cues
|
|||
|
|
- The dedup logic does the right thing across multiple cues
|
|||
|
|
- The "drop candidates that match an existing active memory" filter
|
|||
|
|
works (would have been visible if any seeded memory had matched
|
|||
|
|
one of the heading texts — none did, but the code path is the
|
|||
|
|
same one that's covered in `tests/test_extractor.py`)
|
|||
|
|
- The `prose-only-no-cues` no-fire case is solid
|
|||
|
|
- Long content is preserved without truncation
|
|||
|
|
- Project scoping flows through the pipeline
|
|||
|
|
|
|||
|
|
### Does NOT validate
|
|||
|
|
|
|||
|
|
- The reinforcement matcher (clearly, since it caught nothing)
|
|||
|
|
- The behaviour against very long documents (each sample was
|
|||
|
|
under 700 chars; real interaction responses can be 10× that)
|
|||
|
|
- The behaviour against responses that contain code blocks (the
|
|||
|
|
extractor's regex rules don't handle code-block fenced sections
|
|||
|
|
specially)
|
|||
|
|
- Cross-interaction promotion-to-active flow (no candidate was
|
|||
|
|
promoted in this run; the lifecycle is covered by the unit tests
|
|||
|
|
but not by this empirical exercise)
|
|||
|
|
- The behaviour at scale: 8 interactions is a one-shot. We need
|
|||
|
|
to see the queue after 50+ before judging reviewer ergonomics.
|
|||
|
|
|
|||
|
|
### Recommended next empirical exercises
|
|||
|
|
|
|||
|
|
1. **Real conversation capture**, using a slash command from a
|
|||
|
|
real Claude Code session against either a local or Dalidou
|
|||
|
|
AtoCore instance. The synthetic responses in this script are
|
|||
|
|
honest paraphrases but they're still hand-curated.
|
|||
|
|
2. **Bulk capture from existing PKM**, ingesting a few real
|
|||
|
|
project notes through the extractor as if they were
|
|||
|
|
interactions. This stresses the rules against documents that
|
|||
|
|
weren't written with the extractor in mind.
|
|||
|
|
3. **Reinforcement matcher rerun** after the token-overlap
|
|||
|
|
matcher lands.
|
|||
|
|
|
|||
|
|
## Action items from this report
|
|||
|
|
|
|||
|
|
- [ ] **Fix reinforcement matcher** with token-overlap rule
|
|||
|
|
described in the "Recommended fix" section above. Owner:
|
|||
|
|
next session. Severity: medium-high.
|
|||
|
|
- [x] **Document the extractor's V0 strictness** as a working
|
|||
|
|
property, not a limitation. Sample 7 makes the case.
|
|||
|
|
- [ ] **Build the slash command** so the next validation run
|
|||
|
|
can use real (not synthetic) interactions. Tracked in
|
|||
|
|
Session 2 of the current planning sprint.
|
|||
|
|
- [ ] **Run a 50+ interaction batch** to evaluate reviewer
|
|||
|
|
ergonomics. Deferred until the slash command exists.
|
|||
|
|
|
|||
|
|
## Reproducibility
|
|||
|
|
|
|||
|
|
The script is deterministic. Re-running it will produce
|
|||
|
|
identical results because:
|
|||
|
|
|
|||
|
|
- the data dir is wiped on every run
|
|||
|
|
- the sample interactions are constants
|
|||
|
|
- the memory uuid generation is non-deterministic but the
|
|||
|
|
important fields (content, type, count, rule) are not
|
|||
|
|
- the `data/validation/phase9-first-use/` directory is gitignored,
|
|||
|
|
so no state leaks across runs
|
|||
|
|
|
|||
|
|
To reproduce this exact report:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
python scripts/phase9_first_real_use.py
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
To get JSON output for downstream tooling:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
python scripts/phase9_first_real_use.py --json
|
|||
|
|
```
|