phase9 first-real-use validation + small hygiene wins
Session 1 of the four-session plan. Empirically exercises the Phase 9
loop (capture -> reinforce -> extract) for the first time and lands
three small hygiene fixes.
Validation script + report
--------------------------
scripts/phase9_first_real_use.py — reproducible script that:
- sets up an isolated SQLite + Chroma store under
data/validation/phase9-first-use (gitignored)
- seeds 3 active memories
- runs 8 sample interactions through capture + reinforce + extract
- prints what each step produced and reinforcement state at the end
- supports --json output for downstream tooling
docs/phase9-first-real-use.md — narrative report of the run with:
- extraction results table (8/8 expectations met exactly)
- the empirical finding that REINFORCEMENT MATCHED ZERO seeds
despite sample 5 clearly echoing the rebase preference memory
- root cause analysis: the substring matcher is too brittle for
natural paraphrases (e.g. "prefers" vs "I prefer", "history"
vs "the history")
- recommended fix: replace substring matcher with a token-overlap
matcher (>=70% of memory tokens present in response, with
light stemming and a small stop list)
- explicit note that the fix is queued as a follow-up commit, not
bundled into the report — keeps the audit trail clean
Key extraction results from the run:
- all 7 heading/sentence rules fired correctly
- 0 false positives on the prose-only sample (the most important
sanity check)
- long content preserved without truncation
- dedup correctly kept three distinct cues from one interaction
- project scoping flowed cleanly through the pipeline
Hygiene 1: FastAPI lifespan migration (src/atocore/main.py)
- Replaced @app.on_event("startup") with the modern @asynccontextmanager
lifespan handler
- Same setup work (setup_logging, ensure_runtime_dirs, init_db,
init_project_state_schema, startup_ready log)
- Removes the two on_event deprecation warnings from every test run
- Test suite now shows 1 warning instead of 3
Hygiene 2: EXTRACTOR_VERSION constant (src/atocore/memory/extractor.py)
- Added EXTRACTOR_VERSION = "0.1.0" with a versioned change log comment
- MemoryCandidate dataclass carries extractor_version on every candidate
- POST /interactions/{id}/extract response now includes extractor_version
on both the top level (current run) and on each candidate
- Implements the versioning requirement called out in
docs/architecture/promotion-rules.md so old candidates can be
identified and re-evaluated when the rule set evolves
Hygiene 3: ~/.git-credentials cleanup (out-of-tree, not committed)
- Removed the dead OAUTH_USER:<jwt> line for dalidou:3000 that was
being silently rewritten by the system credential manager on every
push attempt
- Configured credential.http://dalidou:3000.helper with the empty-string
sentinel pattern so the URL-specific helper chain is exactly
["", store] instead of inheriting the system-level "manager" helper
that ships with Git for Windows
- Same fix for the 100.80.199.40 (Tailscale) entry
- Verified end to end: a fresh push using only the cleaned credentials
file (no embedded URL) authenticates as Antoine and lands cleanly
Full suite: 160 passing (no change from previous), 1 warning
(was 3) thanks to the lifespan migration.
This commit is contained in:
321
docs/phase9-first-real-use.md
Normal file
321
docs/phase9-first-real-use.md
Normal file
@@ -0,0 +1,321 @@
|
||||
# Phase 9 First Real Use Report
|
||||
|
||||
## What this is
|
||||
|
||||
The first empirical exercise of the Phase 9 reflection loop after
|
||||
Commits A, B, and C all landed. The goal is to find out where the
|
||||
extractor and the reinforcement matcher actually behave well versus
|
||||
where their behaviour drifts from the design intent.
|
||||
|
||||
The validation is reproducible. To re-run:
|
||||
|
||||
```bash
|
||||
python scripts/phase9_first_real_use.py
|
||||
```
|
||||
|
||||
This writes an isolated SQLite + Chroma store under
|
||||
`data/validation/phase9-first-use/` (gitignored), seeds three active
|
||||
memories, then runs eight sample interactions through the full
|
||||
capture → reinforce → extract pipeline.
|
||||
|
||||
## What we ran
|
||||
|
||||
Eight synthetic interactions, each paraphrased from a real working
|
||||
session about AtoCore itself or the active engineering projects:
|
||||
|
||||
| # | Label | Project | Expected |
|
||||
|---|--------------------------------------|----------------------|---------------------------|
|
||||
| 1 | exdev-mount-merge-decision | atocore | 1 decision_heading |
|
||||
| 2 | ownership-was-the-real-fix | atocore | 1 fact_heading |
|
||||
| 3 | memory-vs-entity-canonical-home | atocore | 1 decision_heading (long) |
|
||||
| 4 | auto-promotion-deferred | atocore | 1 decision_heading |
|
||||
| 5 | preference-rebase-workflow | atocore | 1 preference_sentence |
|
||||
| 6 | constraint-from-doc-cite | p05-interferometer | 1 constraint_heading |
|
||||
| 7 | prose-only-no-cues | atocore | 0 candidates |
|
||||
| 8 | multiple-cues-in-one-interaction | p06-polisher | 3 distinct rules |
|
||||
|
||||
Plus 3 seed memories were inserted before the run:
|
||||
|
||||
- `pref_rebase`: "prefers rebase-based workflows because history stays linear" (preference, 0.6)
|
||||
- `pref_concise`: "writes commit messages focused on the why, not the what" (preference, 0.6)
|
||||
- `identity_runs_atocore`: "mechanical engineer who runs AtoCore for context engineering" (identity, 0.9)
|
||||
|
||||
## What happened — extraction (the good news)
|
||||
|
||||
**Every extraction expectation was met exactly.** All eight samples
|
||||
produced the predicted candidate count and the predicted rule
|
||||
classifications:
|
||||
|
||||
| Sample | Expected | Got | Pass |
|
||||
|---------------------------------------|----------|-----|------|
|
||||
| exdev-mount-merge-decision | 1 | 1 | ✅ |
|
||||
| ownership-was-the-real-fix | 1 | 1 | ✅ |
|
||||
| memory-vs-entity-canonical-home | 1 | 1 | ✅ |
|
||||
| auto-promotion-deferred | 1 | 1 | ✅ |
|
||||
| preference-rebase-workflow | 1 | 1 | ✅ |
|
||||
| constraint-from-doc-cite | 1 | 1 | ✅ |
|
||||
| prose-only-no-cues | **0** | **0** | ✅ |
|
||||
| multiple-cues-in-one-interaction | 3 | 3 | ✅ |
|
||||
|
||||
**Total: 9 candidates from 8 interactions, 0 false positives, 0 misses
|
||||
on heading patterns or sentence patterns.**
|
||||
|
||||
The extractor's strictness is well-tuned for the kinds of structural
|
||||
cues we actually use. Things worth noting:
|
||||
|
||||
- **Sample 7 (`prose-only-no-cues`) produced zero candidates as
|
||||
designed.** This is the most important sanity check — it confirms
|
||||
the extractor won't fill the review queue with general prose when
|
||||
there's no structural intent.
|
||||
- **Sample 3's long content was preserved without truncation.** The
|
||||
280-char max wasn't hit, and the content kept its full meaning.
|
||||
- **Sample 8 produced three distinct rules in one interaction**
|
||||
(decision_heading, constraint_heading, requirement_heading) without
|
||||
the dedup key collapsing them. The dedup key is
|
||||
`(memory_type, normalized_content, rule)` and the three are all
|
||||
different on at least one axis, so they coexist as expected.
|
||||
- **The prose around each heading was correctly ignored.** Sample 6
|
||||
has a second sentence ("the error budget allocates 6 nm to the
|
||||
laser source...") that does NOT have a structural cue, and the
|
||||
extractor correctly didn't fire on it.
|
||||
|
||||
## What happened — reinforcement (the empirical finding)
|
||||
|
||||
**Reinforcement matched zero seeded memories across all 8 samples,
|
||||
even when the response clearly echoed the seed.**
|
||||
|
||||
Sample 5's response was:
|
||||
|
||||
> *"I prefer rebase-based workflows because the history stays linear
|
||||
> and reviewers have an easier time."*
|
||||
|
||||
The seeded `pref_rebase` memory was:
|
||||
|
||||
> *"prefers rebase-based workflows because history stays linear"*
|
||||
|
||||
A human reading both says these are the same fact. The reinforcement
|
||||
matcher disagrees. After all 8 interactions:
|
||||
|
||||
```
|
||||
pref_rebase: confidence=0.6000 refs=0 last=-
|
||||
pref_concise: confidence=0.6000 refs=0 last=-
|
||||
identity_runs_atocore: confidence=0.9000 refs=0 last=-
|
||||
```
|
||||
|
||||
**Nothing moved.** This is the most important finding from this
|
||||
validation pass.
|
||||
|
||||
### Why the matcher missed it
|
||||
|
||||
The current `_memory_matches` rule (in
|
||||
`src/atocore/memory/reinforcement.py`) does a normalized substring
|
||||
match: it lowercases both sides, collapses whitespace, then asks
|
||||
"does the leading 80-char window of the memory content appear as a
|
||||
substring in the response?"
|
||||
|
||||
For the rebase example:
|
||||
|
||||
- needle (normalized): `prefers rebase-based workflows because history stays linear`
|
||||
- haystack (normalized): `i prefer rebase-based workflows because the history stays linear and reviewers have an easier time.`
|
||||
|
||||
The needle starts with `prefers` (with the trailing `s`), and the
|
||||
haystack has `prefer` (without the `s`, because of the first-person
|
||||
voice). And the needle has `because history stays linear`, while the
|
||||
haystack has `because the history stays linear`. **Two small natural
|
||||
paraphrases, and the substring fails.**
|
||||
|
||||
This isn't a bug in the matcher's implementation — it's doing
|
||||
exactly what it was specified to do. It's a design limitation: the
|
||||
substring rule is too brittle for real prose, where the same fact
|
||||
gets re-stated with different verb forms, articles, and word order.
|
||||
|
||||
### Severity
|
||||
|
||||
**Medium-high.** Reinforcement is the entire point of Commit B.
|
||||
A reinforcement matcher that never fires on natural paraphrases
|
||||
will leave seeded memories with stale confidence forever. The
|
||||
reflection loop runs but it doesn't actually reinforce anything.
|
||||
That hollows out the value of having reinforcement at all.
|
||||
|
||||
It is not a critical bug because:
|
||||
- Nothing breaks. The pipeline still runs cleanly.
|
||||
- Reinforcement is supposed to be a *signal*, not the only path to
|
||||
high confidence — humans can still curate confidence directly.
|
||||
- The candidate-extraction path (Commit C) is unaffected and works
|
||||
perfectly.
|
||||
|
||||
But it does need to be addressed before Phase 9 can be considered
|
||||
operationally complete.
|
||||
|
||||
## Recommended fix (deferred to a follow-up commit)
|
||||
|
||||
Replace the substring matcher with a token-overlap matcher. The
|
||||
specification:
|
||||
|
||||
1. Tokenize both memory content and response into lowercase words
|
||||
of length >= 3, dropping a small stop list (`the`, `a`, `an`,
|
||||
`and`, `or`, `of`, `to`, `is`, `was`, `that`, `this`, `with`,
|
||||
`for`, `from`, `into`).
|
||||
2. Stem aggressively (or at minimum, fold trailing `s` and `ed`
|
||||
so `prefers`/`prefer`/`preferred` collapse to one token).
|
||||
3. A match exists if **at least 70% of the memory's content
|
||||
tokens** appear in the response token set.
|
||||
4. Memory content must still be at least `_MIN_MEMORY_CONTENT_LENGTH`
|
||||
characters to be considered.
|
||||
|
||||
This is more permissive than the substring rule but still tight
|
||||
enough to avoid spurious matches on generic words. It would have
|
||||
caught the rebase example because:
|
||||
|
||||
- memory tokens (after stop-list and stemming):
|
||||
`{prefer, rebase-bas, workflow, because, history, stay, linear}`
|
||||
- response tokens:
|
||||
`{prefer, rebase-bas, workflow, because, history, stay, linear,
|
||||
reviewer, easi, time}`
|
||||
- overlap: 7 / 7 memory tokens = 100% > 70% threshold → match
|
||||
|
||||
### Why not fix it in this report
|
||||
|
||||
Three reasons:
|
||||
|
||||
1. The validation report is supposed to be evidence, not a fix
|
||||
spec. A separate commit will introduce the new matcher with
|
||||
its own tests.
|
||||
2. The token-overlap matcher needs its own design review for edge
|
||||
cases (very long memories, very short responses, technical
|
||||
abbreviations, code snippets in responses).
|
||||
3. Mixing the report and the fix into one commit would muddle the
|
||||
audit trail. The report is the empirical evidence; the fix is
|
||||
the response.
|
||||
|
||||
The fix is queued as the next Phase 9 maintenance commit and is
|
||||
flagged in the next-steps section below.
|
||||
|
||||
## Other observations
|
||||
|
||||
### Extraction is conservative on purpose, and that's working
|
||||
|
||||
Sample 7 is the most important data point in the whole run.
|
||||
A natural prose response with no structural cues produced zero
|
||||
candidates. **This is exactly the design intent** — the extractor
|
||||
should be loud about explicit decisions/constraints/requirements
|
||||
and quiet about everything else. If the extractor were too loose
|
||||
the review queue would fill up with low-value items and the human
|
||||
would stop reviewing.
|
||||
|
||||
After this run I have measurably more confidence that the V0 rule
|
||||
set is the right starting point. Future rules can be added one at
|
||||
a time as we see specific patterns the extractor misses, instead of
|
||||
guessing at what might be useful.
|
||||
|
||||
### Confidence on candidates
|
||||
|
||||
All extracted candidates landed at the default `confidence=0.5`,
|
||||
which is what the extractor is currently hardcoded to do. The
|
||||
`promotion-rules.md` doc proposes a per-rule prior with a
|
||||
structural-signal multiplier and freshness bonus. None of that is
|
||||
implemented yet. The validation didn't reveal any urgency around
|
||||
this — humans review the candidates either way — but it confirms
|
||||
that the priors-and-multipliers refinement is a reasonable next
|
||||
step rather than a critical one.
|
||||
|
||||
### Multiple cues in one interaction
|
||||
|
||||
Sample 8 confirmed an important property: **three structural
|
||||
cues in the same response do not collide in dedup**. The dedup
|
||||
key is `(memory_type, normalized_content, rule)`, and since each
|
||||
cue produced a distinct (type, content, rule) tuple, all three
|
||||
landed cleanly.
|
||||
|
||||
This matters because real working sessions naturally bundle
|
||||
multiple decisions/constraints/requirements into one summary.
|
||||
The extractor handles those bundles correctly.
|
||||
|
||||
### Project scoping
|
||||
|
||||
Each candidate carries the `project` from the source interaction
|
||||
into its own `project` field. Sample 6 (p05) and sample 8 (p06)
|
||||
both produced candidates with the right project. This is
|
||||
non-obvious because the extractor module never explicitly looks
|
||||
at project — it inherits from the interaction it's scanning. Worth
|
||||
keeping in mind when the entity extractor is built: same pattern
|
||||
should apply.
|
||||
|
||||
## What this validates and what it doesn't
|
||||
|
||||
### Validates
|
||||
|
||||
- The Phase 9 Commit C extractor's rule set is well-tuned for
|
||||
hand-written structural cues
|
||||
- The dedup logic does the right thing across multiple cues
|
||||
- The "drop candidates that match an existing active memory" filter
|
||||
works (would have been visible if any seeded memory had matched
|
||||
one of the heading texts — none did, but the code path is the
|
||||
same one that's covered in `tests/test_extractor.py`)
|
||||
- The `prose-only-no-cues` no-fire case is solid
|
||||
- Long content is preserved without truncation
|
||||
- Project scoping flows through the pipeline
|
||||
|
||||
### Does NOT validate
|
||||
|
||||
- The reinforcement matcher (clearly, since it caught nothing)
|
||||
- The behaviour against very long documents (each sample was
|
||||
under 700 chars; real interaction responses can be 10× that)
|
||||
- The behaviour against responses that contain code blocks (the
|
||||
extractor's regex rules don't handle code-block fenced sections
|
||||
specially)
|
||||
- Cross-interaction promotion-to-active flow (no candidate was
|
||||
promoted in this run; the lifecycle is covered by the unit tests
|
||||
but not by this empirical exercise)
|
||||
- The behaviour at scale: 8 interactions is a one-shot. We need
|
||||
to see the queue after 50+ before judging reviewer ergonomics.
|
||||
|
||||
### Recommended next empirical exercises
|
||||
|
||||
1. **Real conversation capture**, using a slash command from a
|
||||
real Claude Code session against either a local or Dalidou
|
||||
AtoCore instance. The synthetic responses in this script are
|
||||
honest paraphrases but they're still hand-curated.
|
||||
2. **Bulk capture from existing PKM**, ingesting a few real
|
||||
project notes through the extractor as if they were
|
||||
interactions. This stresses the rules against documents that
|
||||
weren't written with the extractor in mind.
|
||||
3. **Reinforcement matcher rerun** after the token-overlap
|
||||
matcher lands.
|
||||
|
||||
## Action items from this report
|
||||
|
||||
- [ ] **Fix reinforcement matcher** with token-overlap rule
|
||||
described in the "Recommended fix" section above. Owner:
|
||||
next session. Severity: medium-high.
|
||||
- [x] **Document the extractor's V0 strictness** as a working
|
||||
property, not a limitation. Sample 7 makes the case.
|
||||
- [ ] **Build the slash command** so the next validation run
|
||||
can use real (not synthetic) interactions. Tracked in
|
||||
Session 2 of the current planning sprint.
|
||||
- [ ] **Run a 50+ interaction batch** to evaluate reviewer
|
||||
ergonomics. Deferred until the slash command exists.
|
||||
|
||||
## Reproducibility
|
||||
|
||||
The script is deterministic. Re-running it will produce
|
||||
identical results because:
|
||||
|
||||
- the data dir is wiped on every run
|
||||
- the sample interactions are constants
|
||||
- the memory uuid generation is non-deterministic but the
|
||||
important fields (content, type, count, rule) are not
|
||||
- the `data/validation/phase9-first-use/` directory is gitignored,
|
||||
so no state leaks across runs
|
||||
|
||||
To reproduce this exact report:
|
||||
|
||||
```bash
|
||||
python scripts/phase9_first_real_use.py
|
||||
```
|
||||
|
||||
To get JSON output for downstream tooling:
|
||||
|
||||
```bash
|
||||
python scripts/phase9_first_real_use.py --json
|
||||
```
|
||||
Reference in New Issue
Block a user