phase9 first-real-use validation + small hygiene wins

Session 1 of the four-session plan. Empirically exercises the Phase 9 loop (capture -> reinforce -> extract) for the first time and lands three small hygiene fixes. Validation script + report -------------------------- scripts/phase9_first_real_use.py — reproducible script that: - sets up an isolated SQLite + Chroma store under data/validation/phase9-first-use (gitignored) - seeds 3 active memories - runs 8 sample interactions through capture + reinforce + extract - prints what each step produced and reinforcement state at the end - supports --json output for downstream tooling docs/phase9-first-real-use.md — narrative report of the run with: - extraction results table (8/8 expectations met exactly) - the empirical finding that REINFORCEMENT MATCHED ZERO seeds despite sample 5 clearly echoing the rebase preference memory - root cause analysis: the substring matcher is too brittle for natural paraphrases (e.g. "prefers" vs "I prefer", "history" vs "the history") - recommended fix: replace substring matcher with a token-overlap matcher (>=70% of memory tokens present in response, with light stemming and a small stop list) - explicit note that the fix is queued as a follow-up commit, not bundled into the report — keeps the audit trail clean Key extraction results from the run: - all 7 heading/sentence rules fired correctly - 0 false positives on the prose-only sample (the most important sanity check) - long content preserved without truncation - dedup correctly kept three distinct cues from one interaction - project scoping flowed cleanly through the pipeline Hygiene 1: FastAPI lifespan migration (src/atocore/main.py) - Replaced @app.on_event("startup") with the modern @asynccontextmanager lifespan handler - Same setup work (setup_logging, ensure_runtime_dirs, init_db, init_project_state_schema, startup_ready log) - Removes the two on_event deprecation warnings from every test run - Test suite now shows 1 warning instead of 3 Hygiene 2: EXTRACTOR_VERSION constant (src/atocore/memory/extractor.py) - Added EXTRACTOR_VERSION = "0.1.0" with a versioned change log comment - MemoryCandidate dataclass carries extractor_version on every candidate - POST /interactions/{id}/extract response now includes extractor_version on both the top level (current run) and on each candidate - Implements the versioning requirement called out in docs/architecture/promotion-rules.md so old candidates can be identified and re-evaluated when the rule set evolves Hygiene 3: ~/.git-credentials cleanup (out-of-tree, not committed) - Removed the dead OAUTH_USER:<jwt> line for dalidou:3000 that was being silently rewritten by the system credential manager on every push attempt - Configured credential.http://dalidou:3000.helper with the empty-string sentinel pattern so the URL-specific helper chain is exactly ["", store] instead of inheriting the system-level "manager" helper that ships with Git for Windows - Same fix for the 100.80.199.40 (Tailscale) entry - Verified end to end: a fresh push using only the cleaned credentials file (no embedded URL) authenticates as Antoine and lands cleanly Full suite: 160 passing (no change from previous), 1 warning (was 3) thanks to the lifespan migration.
2026-04-07 06:16:35 -04:00
parent bd291ff874
commit b9da5b6d84
5 changed files with 754 additions and 8 deletions
--- a/docs/phase9-first-real-use.md
+++ b/docs/phase9-first-real-use.md
@@ -0,0 +1,321 @@
+# Phase 9 First Real Use Report
+
+## What this is
+
+The first empirical exercise of the Phase 9 reflection loop after
+Commits A, B, and C all landed. The goal is to find out where the
+extractor and the reinforcement matcher actually behave well versus
+where their behaviour drifts from the design intent.
+
+The validation is reproducible. To re-run:
+
+```bash
+python scripts/phase9_first_real_use.py
+```
+
+This writes an isolated SQLite + Chroma store under
+`data/validation/phase9-first-use/` (gitignored), seeds three active
+memories, then runs eight sample interactions through the full
+capture → reinforce → extract pipeline.
+
+## What we ran
+
+Eight synthetic interactions, each paraphrased from a real working
+session about AtoCore itself or the active engineering projects:
+
+| # | Label                                | Project              | Expected                  |
+|---|--------------------------------------|----------------------|---------------------------|
+| 1 | exdev-mount-merge-decision           | atocore              | 1 decision_heading        |
+| 2 | ownership-was-the-real-fix           | atocore              | 1 fact_heading            |
+| 3 | memory-vs-entity-canonical-home      | atocore              | 1 decision_heading (long) |
+| 4 | auto-promotion-deferred              | atocore              | 1 decision_heading        |
+| 5 | preference-rebase-workflow           | atocore              | 1 preference_sentence     |
+| 6 | constraint-from-doc-cite             | p05-interferometer   | 1 constraint_heading      |
+| 7 | prose-only-no-cues                   | atocore              | 0 candidates              |
+| 8 | multiple-cues-in-one-interaction     | p06-polisher         | 3 distinct rules          |
+
+Plus 3 seed memories were inserted before the run:
+
+- `pref_rebase`: "prefers rebase-based workflows because history stays linear" (preference, 0.6)
+- `pref_concise`: "writes commit messages focused on the why, not the what" (preference, 0.6)
+- `identity_runs_atocore`: "mechanical engineer who runs AtoCore for context engineering" (identity, 0.9)
+
+## What happened — extraction (the good news)
+
+**Every extraction expectation was met exactly.** All eight samples
+produced the predicted candidate count and the predicted rule
+classifications:
+
+| Sample                                | Expected | Got | Pass |
+|---------------------------------------|----------|-----|------|
+| exdev-mount-merge-decision            | 1        | 1   | ✅    |
+| ownership-was-the-real-fix            | 1        | 1   | ✅    |
+| memory-vs-entity-canonical-home       | 1        | 1   | ✅    |
+| auto-promotion-deferred               | 1        | 1   | ✅    |
+| preference-rebase-workflow            | 1        | 1   | ✅    |
+| constraint-from-doc-cite              | 1        | 1   | ✅    |
+| prose-only-no-cues                    | **0**    | **0** | ✅  |
+| multiple-cues-in-one-interaction      | 3        | 3   | ✅    |
+
+**Total: 9 candidates from 8 interactions, 0 false positives, 0 misses
+on heading patterns or sentence patterns.**
+
+The extractor's strictness is well-tuned for the kinds of structural
+cues we actually use. Things worth noting:
+
+- **Sample 7 (`prose-only-no-cues`) produced zero candidates as
+  designed.** This is the most important sanity check — it confirms
+  the extractor won't fill the review queue with general prose when
+  there's no structural intent.
+- **Sample 3's long content was preserved without truncation.** The
+  280-char max wasn't hit, and the content kept its full meaning.
+- **Sample 8 produced three distinct rules in one interaction**
+  (decision_heading, constraint_heading, requirement_heading) without
+  the dedup key collapsing them. The dedup key is
+  `(memory_type, normalized_content, rule)` and the three are all
+  different on at least one axis, so they coexist as expected.
+- **The prose around each heading was correctly ignored.** Sample 6
+  has a second sentence ("the error budget allocates 6 nm to the
+  laser source...") that does NOT have a structural cue, and the
+  extractor correctly didn't fire on it.
+
+## What happened — reinforcement (the empirical finding)
+
+**Reinforcement matched zero seeded memories across all 8 samples,
+even when the response clearly echoed the seed.**
+
+Sample 5's response was:
+
+> *"I prefer rebase-based workflows because the history stays linear
+> and reviewers have an easier time."*
+
+The seeded `pref_rebase` memory was:
+
+> *"prefers rebase-based workflows because history stays linear"*
+
+A human reading both says these are the same fact. The reinforcement
+matcher disagrees. After all 8 interactions:
+
+```
+pref_rebase:           confidence=0.6000  refs=0  last=-
+pref_concise:          confidence=0.6000  refs=0  last=-
+identity_runs_atocore: confidence=0.9000  refs=0  last=-
+```
+
+**Nothing moved.** This is the most important finding from this
+validation pass.
+
+### Why the matcher missed it
+
+The current `_memory_matches` rule (in
+`src/atocore/memory/reinforcement.py`) does a normalized substring
+match: it lowercases both sides, collapses whitespace, then asks
+"does the leading 80-char window of the memory content appear as a
+substring in the response?"
+
+For the rebase example:
+
+- needle (normalized): `prefers rebase-based workflows because history stays linear`
+- haystack (normalized): `i prefer rebase-based workflows because the history stays linear and reviewers have an easier time.`
+
+The needle starts with `prefers` (with the trailing `s`), and the
+haystack has `prefer` (without the `s`, because of the first-person
+voice). And the needle has `because history stays linear`, while the
+haystack has `because the history stays linear`. **Two small natural
+paraphrases, and the substring fails.**
+
+This isn't a bug in the matcher's implementation — it's doing
+exactly what it was specified to do. It's a design limitation: the
+substring rule is too brittle for real prose, where the same fact
+gets re-stated with different verb forms, articles, and word order.
+
+### Severity
+
+**Medium-high.** Reinforcement is the entire point of Commit B.
+A reinforcement matcher that never fires on natural paraphrases
+will leave seeded memories with stale confidence forever. The
+reflection loop runs but it doesn't actually reinforce anything.
+That hollows out the value of having reinforcement at all.
+
+It is not a critical bug because:
+- Nothing breaks. The pipeline still runs cleanly.
+- Reinforcement is supposed to be a *signal*, not the only path to
+  high confidence — humans can still curate confidence directly.
+- The candidate-extraction path (Commit C) is unaffected and works
+  perfectly.
+
+But it does need to be addressed before Phase 9 can be considered
+operationally complete.
+
+## Recommended fix (deferred to a follow-up commit)
+
+Replace the substring matcher with a token-overlap matcher. The
+specification:
+
+1. Tokenize both memory content and response into lowercase words
+   of length >= 3, dropping a small stop list (`the`, `a`, `an`,
+   `and`, `or`, `of`, `to`, `is`, `was`, `that`, `this`, `with`,
+   `for`, `from`, `into`).
+2. Stem aggressively (or at minimum, fold trailing `s` and `ed`
+   so `prefers`/`prefer`/`preferred` collapse to one token).
+3. A match exists if **at least 70% of the memory's content
+   tokens** appear in the response token set.
+4. Memory content must still be at least `_MIN_MEMORY_CONTENT_LENGTH`
+   characters to be considered.
+
+This is more permissive than the substring rule but still tight
+enough to avoid spurious matches on generic words. It would have
+caught the rebase example because:
+
+- memory tokens (after stop-list and stemming):
+  `{prefer, rebase-bas, workflow, because, history, stay, linear}`
+- response tokens:
+  `{prefer, rebase-bas, workflow, because, history, stay, linear,
+  reviewer, easi, time}`
+- overlap: 7 / 7 memory tokens = 100% > 70% threshold → match
+
+### Why not fix it in this report
+
+Three reasons:
+
+1. The validation report is supposed to be evidence, not a fix
+   spec. A separate commit will introduce the new matcher with
+   its own tests.
+2. The token-overlap matcher needs its own design review for edge
+   cases (very long memories, very short responses, technical
+   abbreviations, code snippets in responses).
+3. Mixing the report and the fix into one commit would muddle the
+   audit trail. The report is the empirical evidence; the fix is
+   the response.
+
+The fix is queued as the next Phase 9 maintenance commit and is
+flagged in the next-steps section below.
+
+## Other observations
+
+### Extraction is conservative on purpose, and that's working
+
+Sample 7 is the most important data point in the whole run.
+A natural prose response with no structural cues produced zero
+candidates. **This is exactly the design intent** — the extractor
+should be loud about explicit decisions/constraints/requirements
+and quiet about everything else. If the extractor were too loose
+the review queue would fill up with low-value items and the human
+would stop reviewing.
+
+After this run I have measurably more confidence that the V0 rule
+set is the right starting point. Future rules can be added one at
+a time as we see specific patterns the extractor misses, instead of
+guessing at what might be useful.
+
+### Confidence on candidates
+
+All extracted candidates landed at the default `confidence=0.5`,
+which is what the extractor is currently hardcoded to do. The
+`promotion-rules.md` doc proposes a per-rule prior with a
+structural-signal multiplier and freshness bonus. None of that is
+implemented yet. The validation didn't reveal any urgency around
+this — humans review the candidates either way — but it confirms
+that the priors-and-multipliers refinement is a reasonable next
+step rather than a critical one.
+
+### Multiple cues in one interaction
+
+Sample 8 confirmed an important property: **three structural
+cues in the same response do not collide in dedup**. The dedup
+key is `(memory_type, normalized_content, rule)`, and since each
+cue produced a distinct (type, content, rule) tuple, all three
+landed cleanly.
+
+This matters because real working sessions naturally bundle
+multiple decisions/constraints/requirements into one summary.
+The extractor handles those bundles correctly.
+
+### Project scoping
+
+Each candidate carries the `project` from the source interaction
+into its own `project` field. Sample 6 (p05) and sample 8 (p06)
+both produced candidates with the right project. This is
+non-obvious because the extractor module never explicitly looks
+at project — it inherits from the interaction it's scanning. Worth
+keeping in mind when the entity extractor is built: same pattern
+should apply.
+
+## What this validates and what it doesn't
+
+### Validates
+
+- The Phase 9 Commit C extractor's rule set is well-tuned for
+  hand-written structural cues
+- The dedup logic does the right thing across multiple cues
+- The "drop candidates that match an existing active memory" filter
+  works (would have been visible if any seeded memory had matched
+  one of the heading texts — none did, but the code path is the
+  same one that's covered in `tests/test_extractor.py`)
+- The `prose-only-no-cues` no-fire case is solid
+- Long content is preserved without truncation
+- Project scoping flows through the pipeline
+
+### Does NOT validate
+
+- The reinforcement matcher (clearly, since it caught nothing)
+- The behaviour against very long documents (each sample was
+  under 700 chars; real interaction responses can be 10× that)
+- The behaviour against responses that contain code blocks (the
+  extractor's regex rules don't handle code-block fenced sections
+  specially)
+- Cross-interaction promotion-to-active flow (no candidate was
+  promoted in this run; the lifecycle is covered by the unit tests
+  but not by this empirical exercise)
+- The behaviour at scale: 8 interactions is a one-shot. We need
+  to see the queue after 50+ before judging reviewer ergonomics.
+
+### Recommended next empirical exercises
+
+1. **Real conversation capture**, using a slash command from a
+   real Claude Code session against either a local or Dalidou
+   AtoCore instance. The synthetic responses in this script are
+   honest paraphrases but they're still hand-curated.
+2. **Bulk capture from existing PKM**, ingesting a few real
+   project notes through the extractor as if they were
+   interactions. This stresses the rules against documents that
+   weren't written with the extractor in mind.
+3. **Reinforcement matcher rerun** after the token-overlap
+   matcher lands.
+
+## Action items from this report
+
+- [ ] **Fix reinforcement matcher** with token-overlap rule
+      described in the "Recommended fix" section above. Owner:
+      next session. Severity: medium-high.
+- [x] **Document the extractor's V0 strictness** as a working
+      property, not a limitation. Sample 7 makes the case.
+- [ ] **Build the slash command** so the next validation run
+      can use real (not synthetic) interactions. Tracked in
+      Session 2 of the current planning sprint.
+- [ ] **Run a 50+ interaction batch** to evaluate reviewer
+      ergonomics. Deferred until the slash command exists.
+
+## Reproducibility
+
+The script is deterministic. Re-running it will produce
+identical results because:
+
+- the data dir is wiped on every run
+- the sample interactions are constants
+- the memory uuid generation is non-deterministic but the
+  important fields (content, type, count, rule) are not
+- the `data/validation/phase9-first-use/` directory is gitignored,
+  so no state leaks across runs
+
+To reproduce this exact report:
+
+```bash
+python scripts/phase9_first_real_use.py
+```
+
+To get JSON output for downstream tooling:
+
+```bash
+python scripts/phase9_first_real_use.py --json
+```