phase9 first-real-use validation + small hygiene wins

Session 1 of the four-session plan. Empirically exercises the Phase 9 loop (capture -> reinforce -> extract) for the first time and lands three small hygiene fixes. Validation script + report -------------------------- scripts/phase9_first_real_use.py — reproducible script that: - sets up an isolated SQLite + Chroma store under data/validation/phase9-first-use (gitignored) - seeds 3 active memories - runs 8 sample interactions through capture + reinforce + extract - prints what each step produced and reinforcement state at the end - supports --json output for downstream tooling docs/phase9-first-real-use.md — narrative report of the run with: - extraction results table (8/8 expectations met exactly) - the empirical finding that REINFORCEMENT MATCHED ZERO seeds despite sample 5 clearly echoing the rebase preference memory - root cause analysis: the substring matcher is too brittle for natural paraphrases (e.g. "prefers" vs "I prefer", "history" vs "the history") - recommended fix: replace substring matcher with a token-overlap matcher (>=70% of memory tokens present in response, with light stemming and a small stop list) - explicit note that the fix is queued as a follow-up commit, not bundled into the report — keeps the audit trail clean Key extraction results from the run: - all 7 heading/sentence rules fired correctly - 0 false positives on the prose-only sample (the most important sanity check) - long content preserved without truncation - dedup correctly kept three distinct cues from one interaction - project scoping flowed cleanly through the pipeline Hygiene 1: FastAPI lifespan migration (src/atocore/main.py) - Replaced @app.on_event("startup") with the modern @asynccontextmanager lifespan handler - Same setup work (setup_logging, ensure_runtime_dirs, init_db, init_project_state_schema, startup_ready log) - Removes the two on_event deprecation warnings from every test run - Test suite now shows 1 warning instead of 3 Hygiene 2: EXTRACTOR_VERSION constant (src/atocore/memory/extractor.py) - Added EXTRACTOR_VERSION = "0.1.0" with a versioned change log comment - MemoryCandidate dataclass carries extractor_version on every candidate - POST /interactions/{id}/extract response now includes extractor_version on both the top level (current run) and on each candidate - Implements the versioning requirement called out in docs/architecture/promotion-rules.md so old candidates can be identified and re-evaluated when the rule set evolves Hygiene 3: ~/.git-credentials cleanup (out-of-tree, not committed) - Removed the dead OAUTH_USER:<jwt> line for dalidou:3000 that was being silently rewritten by the system credential manager on every push attempt - Configured credential.http://dalidou:3000.helper with the empty-string sentinel pattern so the URL-specific helper chain is exactly ["", store] instead of inheriting the system-level "manager" helper that ships with Git for Windows - Same fix for the 100.80.199.40 (Tailscale) entry - Verified end to end: a fresh push using only the cleaned credentials file (no embedded URL) authenticates as Antoine and lands cleanly Full suite: 160 passing (no change from previous), 1 warning (was 3) thanks to the lifespan migration.
2026-04-07 06:16:35 -04:00
parent bd291ff874
commit b9da5b6d84
5 changed files with 754 additions and 8 deletions
--- a/docs/phase9-first-real-use.md
+++ b/docs/phase9-first-real-use.md
@@ -0,0 +1,321 @@
+# Phase 9 First Real Use Report
+
+## What this is
+
+The first empirical exercise of the Phase 9 reflection loop after
+Commits A, B, and C all landed. The goal is to find out where the
+extractor and the reinforcement matcher actually behave well versus
+where their behaviour drifts from the design intent.
+
+The validation is reproducible. To re-run:
+
+```bash
+python scripts/phase9_first_real_use.py
+```
+
+This writes an isolated SQLite + Chroma store under
+`data/validation/phase9-first-use/` (gitignored), seeds three active
+memories, then runs eight sample interactions through the full
+capture → reinforce → extract pipeline.
+
+## What we ran
+
+Eight synthetic interactions, each paraphrased from a real working
+session about AtoCore itself or the active engineering projects:
+
+| # | Label                                | Project              | Expected                  |
+|---|--------------------------------------|----------------------|---------------------------|
+| 1 | exdev-mount-merge-decision           | atocore              | 1 decision_heading        |
+| 2 | ownership-was-the-real-fix           | atocore              | 1 fact_heading            |
+| 3 | memory-vs-entity-canonical-home      | atocore              | 1 decision_heading (long) |
+| 4 | auto-promotion-deferred              | atocore              | 1 decision_heading        |
+| 5 | preference-rebase-workflow           | atocore              | 1 preference_sentence     |
+| 6 | constraint-from-doc-cite             | p05-interferometer   | 1 constraint_heading      |
+| 7 | prose-only-no-cues                   | atocore              | 0 candidates              |
+| 8 | multiple-cues-in-one-interaction     | p06-polisher         | 3 distinct rules          |
+
+Plus 3 seed memories were inserted before the run:
+
+- `pref_rebase`: "prefers rebase-based workflows because history stays linear" (preference, 0.6)
+- `pref_concise`: "writes commit messages focused on the why, not the what" (preference, 0.6)
+- `identity_runs_atocore`: "mechanical engineer who runs AtoCore for context engineering" (identity, 0.9)
+
+## What happened — extraction (the good news)
+
+**Every extraction expectation was met exactly.** All eight samples
+produced the predicted candidate count and the predicted rule
+classifications:
+
+| Sample                                | Expected | Got | Pass |
+|---------------------------------------|----------|-----|------|
+| exdev-mount-merge-decision            | 1        | 1   | ✅    |
+| ownership-was-the-real-fix            | 1        | 1   | ✅    |
+| memory-vs-entity-canonical-home       | 1        | 1   | ✅    |
+| auto-promotion-deferred               | 1        | 1   | ✅    |
+| preference-rebase-workflow            | 1        | 1   | ✅    |
+| constraint-from-doc-cite              | 1        | 1   | ✅    |
+| prose-only-no-cues                    | **0**    | **0** | ✅  |
+| multiple-cues-in-one-interaction      | 3        | 3   | ✅    |
+
+**Total: 9 candidates from 8 interactions, 0 false positives, 0 misses
+on heading patterns or sentence patterns.**
+
+The extractor's strictness is well-tuned for the kinds of structural
+cues we actually use. Things worth noting:
+
+- **Sample 7 (`prose-only-no-cues`) produced zero candidates as
+  designed.** This is the most important sanity check — it confirms
+  the extractor won't fill the review queue with general prose when
+  there's no structural intent.
+- **Sample 3's long content was preserved without truncation.** The
+  280-char max wasn't hit, and the content kept its full meaning.
+- **Sample 8 produced three distinct rules in one interaction**
+  (decision_heading, constraint_heading, requirement_heading) without
+  the dedup key collapsing them. The dedup key is
+  `(memory_type, normalized_content, rule)` and the three are all
+  different on at least one axis, so they coexist as expected.
+- **The prose around each heading was correctly ignored.** Sample 6
+  has a second sentence ("the error budget allocates 6 nm to the
+  laser source...") that does NOT have a structural cue, and the
+  extractor correctly didn't fire on it.
+
+## What happened — reinforcement (the empirical finding)
+
+**Reinforcement matched zero seeded memories across all 8 samples,
+even when the response clearly echoed the seed.**
+
+Sample 5's response was:
+
+> *"I prefer rebase-based workflows because the history stays linear
+> and reviewers have an easier time."*
+
+The seeded `pref_rebase` memory was:
+
+> *"prefers rebase-based workflows because history stays linear"*
+
+A human reading both says these are the same fact. The reinforcement
+matcher disagrees. After all 8 interactions:
+
+```
+pref_rebase:           confidence=0.6000  refs=0  last=-
+pref_concise:          confidence=0.6000  refs=0  last=-
+identity_runs_atocore: confidence=0.9000  refs=0  last=-
+```
+
+**Nothing moved.** This is the most important finding from this
+validation pass.
+
+### Why the matcher missed it
+
+The current `_memory_matches` rule (in
+`src/atocore/memory/reinforcement.py`) does a normalized substring
+match: it lowercases both sides, collapses whitespace, then asks
+"does the leading 80-char window of the memory content appear as a
+substring in the response?"
+
+For the rebase example:
+
+- needle (normalized): `prefers rebase-based workflows because history stays linear`
+- haystack (normalized): `i prefer rebase-based workflows because the history stays linear and reviewers have an easier time.`
+
+The needle starts with `prefers` (with the trailing `s`), and the
+haystack has `prefer` (without the `s`, because of the first-person
+voice). And the needle has `because history stays linear`, while the
+haystack has `because the history stays linear`. **Two small natural
+paraphrases, and the substring fails.**
+
+This isn't a bug in the matcher's implementation — it's doing
+exactly what it was specified to do. It's a design limitation: the
+substring rule is too brittle for real prose, where the same fact
+gets re-stated with different verb forms, articles, and word order.
+
+### Severity
+
+**Medium-high.** Reinforcement is the entire point of Commit B.
+A reinforcement matcher that never fires on natural paraphrases
+will leave seeded memories with stale confidence forever. The
+reflection loop runs but it doesn't actually reinforce anything.
+That hollows out the value of having reinforcement at all.
+
+It is not a critical bug because:
+- Nothing breaks. The pipeline still runs cleanly.
+- Reinforcement is supposed to be a *signal*, not the only path to
+  high confidence — humans can still curate confidence directly.
+- The candidate-extraction path (Commit C) is unaffected and works
+  perfectly.
+
+But it does need to be addressed before Phase 9 can be considered
+operationally complete.
+
+## Recommended fix (deferred to a follow-up commit)
+
+Replace the substring matcher with a token-overlap matcher. The
+specification:
+
+1. Tokenize both memory content and response into lowercase words
+   of length >= 3, dropping a small stop list (`the`, `a`, `an`,
+   `and`, `or`, `of`, `to`, `is`, `was`, `that`, `this`, `with`,
+   `for`, `from`, `into`).
+2. Stem aggressively (or at minimum, fold trailing `s` and `ed`
+   so `prefers`/`prefer`/`preferred` collapse to one token).
+3. A match exists if **at least 70% of the memory's content
+   tokens** appear in the response token set.
+4. Memory content must still be at least `_MIN_MEMORY_CONTENT_LENGTH`
+   characters to be considered.
+
+This is more permissive than the substring rule but still tight
+enough to avoid spurious matches on generic words. It would have
+caught the rebase example because:
+
+- memory tokens (after stop-list and stemming):
+  `{prefer, rebase-bas, workflow, because, history, stay, linear}`
+- response tokens:
+  `{prefer, rebase-bas, workflow, because, history, stay, linear,
+  reviewer, easi, time}`
+- overlap: 7 / 7 memory tokens = 100% > 70% threshold → match
+
+### Why not fix it in this report
+
+Three reasons:
+
+1. The validation report is supposed to be evidence, not a fix
+   spec. A separate commit will introduce the new matcher with
+   its own tests.
+2. The token-overlap matcher needs its own design review for edge
+   cases (very long memories, very short responses, technical
+   abbreviations, code snippets in responses).
+3. Mixing the report and the fix into one commit would muddle the
+   audit trail. The report is the empirical evidence; the fix is
+   the response.
+
+The fix is queued as the next Phase 9 maintenance commit and is
+flagged in the next-steps section below.
+
+## Other observations
+
+### Extraction is conservative on purpose, and that's working
+
+Sample 7 is the most important data point in the whole run.
+A natural prose response with no structural cues produced zero
+candidates. **This is exactly the design intent** — the extractor
+should be loud about explicit decisions/constraints/requirements
+and quiet about everything else. If the extractor were too loose
+the review queue would fill up with low-value items and the human
+would stop reviewing.
+
+After this run I have measurably more confidence that the V0 rule
+set is the right starting point. Future rules can be added one at
+a time as we see specific patterns the extractor misses, instead of
+guessing at what might be useful.
+
+### Confidence on candidates
+
+All extracted candidates landed at the default `confidence=0.5`,
+which is what the extractor is currently hardcoded to do. The
+`promotion-rules.md` doc proposes a per-rule prior with a
+structural-signal multiplier and freshness bonus. None of that is
+implemented yet. The validation didn't reveal any urgency around
+this — humans review the candidates either way — but it confirms
+that the priors-and-multipliers refinement is a reasonable next
+step rather than a critical one.
+
+### Multiple cues in one interaction
+
+Sample 8 confirmed an important property: **three structural
+cues in the same response do not collide in dedup**. The dedup
+key is `(memory_type, normalized_content, rule)`, and since each
+cue produced a distinct (type, content, rule) tuple, all three
+landed cleanly.
+
+This matters because real working sessions naturally bundle
+multiple decisions/constraints/requirements into one summary.
+The extractor handles those bundles correctly.
+
+### Project scoping
+
+Each candidate carries the `project` from the source interaction
+into its own `project` field. Sample 6 (p05) and sample 8 (p06)
+both produced candidates with the right project. This is
+non-obvious because the extractor module never explicitly looks
+at project — it inherits from the interaction it's scanning. Worth
+keeping in mind when the entity extractor is built: same pattern
+should apply.
+
+## What this validates and what it doesn't
+
+### Validates
+
+- The Phase 9 Commit C extractor's rule set is well-tuned for
+  hand-written structural cues
+- The dedup logic does the right thing across multiple cues
+- The "drop candidates that match an existing active memory" filter
+  works (would have been visible if any seeded memory had matched
+  one of the heading texts — none did, but the code path is the
+  same one that's covered in `tests/test_extractor.py`)
+- The `prose-only-no-cues` no-fire case is solid
+- Long content is preserved without truncation
+- Project scoping flows through the pipeline
+
+### Does NOT validate
+
+- The reinforcement matcher (clearly, since it caught nothing)
+- The behaviour against very long documents (each sample was
+  under 700 chars; real interaction responses can be 10× that)
+- The behaviour against responses that contain code blocks (the
+  extractor's regex rules don't handle code-block fenced sections
+  specially)
+- Cross-interaction promotion-to-active flow (no candidate was
+  promoted in this run; the lifecycle is covered by the unit tests
+  but not by this empirical exercise)
+- The behaviour at scale: 8 interactions is a one-shot. We need
+  to see the queue after 50+ before judging reviewer ergonomics.
+
+### Recommended next empirical exercises
+
+1. **Real conversation capture**, using a slash command from a
+   real Claude Code session against either a local or Dalidou
+   AtoCore instance. The synthetic responses in this script are
+   honest paraphrases but they're still hand-curated.
+2. **Bulk capture from existing PKM**, ingesting a few real
+   project notes through the extractor as if they were
+   interactions. This stresses the rules against documents that
+   weren't written with the extractor in mind.
+3. **Reinforcement matcher rerun** after the token-overlap
+   matcher lands.
+
+## Action items from this report
+
+- [ ] **Fix reinforcement matcher** with token-overlap rule
+      described in the "Recommended fix" section above. Owner:
+      next session. Severity: medium-high.
+- [x] **Document the extractor's V0 strictness** as a working
+      property, not a limitation. Sample 7 makes the case.
+- [ ] **Build the slash command** so the next validation run
+      can use real (not synthetic) interactions. Tracked in
+      Session 2 of the current planning sprint.
+- [ ] **Run a 50+ interaction batch** to evaluate reviewer
+      ergonomics. Deferred until the slash command exists.
+
+## Reproducibility
+
+The script is deterministic. Re-running it will produce
+identical results because:
+
+- the data dir is wiped on every run
+- the sample interactions are constants
+- the memory uuid generation is non-deterministic but the
+  important fields (content, type, count, rule) are not
+- the `data/validation/phase9-first-use/` directory is gitignored,
+  so no state leaks across runs
+
+To reproduce this exact report:
+
+```bash
+python scripts/phase9_first_real_use.py
+```
+
+To get JSON output for downstream tooling:
+
+```bash
+python scripts/phase9_first_real_use.py --json
+```
--- a/scripts/phase9_first_real_use.py
+++ b/scripts/phase9_first_real_use.py
@@ -0,0 +1,393 @@
+"""Phase 9 first-real-use validation script.
+
+Captures a small set of representative interactions drawn from a real
+working session, runs the full Phase 9 loop (capture -> reinforce ->
+extract) over them, and prints what each step produced. The intent is
+to generate empirical evidence about the extractor's behaviour against
+prose that wasn't written to make the test pass.
+
+Usage:
+    python scripts/phase9_first_real_use.py [--data-dir PATH]
+
+The script writes a fresh isolated SQLite + Chroma store under the
+given data dir (default: ./data/validation/phase9-first-use). The
+data dir is gitignored so the script can be re-run cleanly.
+
+Each interaction is printed with:
+  - the captured interaction id
+  - the reinforcement results (which seeded memories were echoed)
+  - the extraction results (which candidates were proposed and why)
+  - notes on what the extractor MISSED (manually annotated below)
+
+The output is intentionally human-readable so the run can be saved as
+the body of docs/phase9-first-real-use.md.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import os
+import shutil
+import sys
+from dataclasses import dataclass
+from pathlib import Path
+
+# Make the package importable when run directly from the repo root.
+_REPO_ROOT = Path(__file__).resolve().parent.parent
+sys.path.insert(0, str(_REPO_ROOT / "src"))
+
+
+@dataclass
+class SampleInteraction:
+    label: str
+    prompt: str
+    response: str
+    project: str
+    notes_for_human: str
+
+
+# ---------------------------------------------------------------------------
+# Sample interactions, paraphrased from real working sessions on AtoCore
+# itself. Each one is a real-feeling exchange with at least one structural
+# cue the extractor should catch, plus prose the extractor should NOT
+# accidentally catch.
+# ---------------------------------------------------------------------------
+
+SAMPLES: list[SampleInteraction] = [
+    SampleInteraction(
+        label="exdev-mount-merge-decision",
+        project="atocore",
+        prompt="The TMPDIR fix didn't work, the LOCAL_COPY_PATH change didn't work, what now?",
+        response=(
+            "We've been chasing this through three layers without success. "
+            "Time to stop guessing and apply the simple fix Dalidou Claude "
+            "proposed in the very first diagnosis.\n\n"
+            "## Decision: merge the two bind mounts into one\n\n"
+            "Both /srv/storage/repos and /srv/storage/gitea/data live on "
+            "/dev/sdb1, so the move is a same-filesystem rename — instant, "
+            "atomic, no data copy. After the move, the EXDEV class of bug "
+            "becomes structurally impossible because there is only one "
+            "mount inside the container."
+        ),
+        notes_for_human=(
+            "Should fire decision_heading rule. Should NOT fire requirement "
+            "or constraint rules. The 'becomes structurally impossible' line "
+            "is prose, not a candidate."
+        ),
+    ),
+    SampleInteraction(
+        label="ownership-was-the-real-fix",
+        project="atocore",
+        prompt="Still failing after the mount merge. What's actually wrong?",
+        response=(
+            "Same rejection even after the mounts were merged, which kills "
+            "the EXDEV theory. The error 'unable to migrate objects to "
+            "permanent storage' is git's catch-all for any failure in "
+            "tmp_objdir_migrate, and the real errno is being thrown away.\n\n"
+            "## Fact: the actual root cause was ownership, not cross-device link\n\n"
+            "After mv on the host, the moved directory had perms git could "
+            "navigate for some subdirs but not others. Recursive chown "
+            "git:git plus chmod u+rwX,g+rwX fixed it on the first try."
+        ),
+        notes_for_human=(
+            "Should fire fact_heading rule, target type knowledge. The "
+            "'unable to migrate' string is inside prose and should NOT "
+            "trigger any rule."
+        ),
+    ),
+    SampleInteraction(
+        label="memory-vs-entity-canonical-home",
+        project="atocore",
+        prompt="Does a Decision live in memories or entities?",
+        response=(
+            "Both layers exist for different reasons. After thinking through "
+            "the ontology question, I'm landing on a clean rule: one "
+            "canonical home per concept, with explicit graduation when a "
+            "memory eventually deserves structure.\n\n"
+            "## Decision: project, knowledge, and adaptation memory types graduate to entities\n\n"
+            "The identity, preference, and episodic types stay in the memory "
+            "layer because they're personal and unstructured. The other "
+            "three become entities once the engineering layer V1 ships. The "
+            "original memory rows are NEVER deleted — they become frozen "
+            "historical pointers to whichever entity they graduated into."
+        ),
+        notes_for_human=(
+            "Should fire decision_heading rule. Long content (over 100 "
+            "chars) — verifies the max-length cap of 280 doesn't truncate."
+        ),
+    ),
+    SampleInteraction(
+        label="auto-promotion-deferred",
+        project="atocore",
+        prompt="Should the engineering layer auto-promote candidates?",
+        response=(
+            "The trust hierarchy is the whole point of AtoCore. If we let "
+            "the system auto-promote, we trade away the 'bad memory is "
+            "worse than no memory' invariant for a marginal speed gain.\n\n"
+            "## Decision: no auto-promotion in V1, ever\n\n"
+            "The schema and API are designed so policies can be added "
+            "later without migration, but the V1 stance is that every "
+            "promotion needs a human signoff. The hard line: nothing ever "
+            "moves into trusted project state automatically, even with the "
+            "engineering layer running."
+        ),
+        notes_for_human=(
+            "Should fire decision_heading rule. Has 'invariant' jargon — "
+            "verify the extractor doesn't get distracted."
+        ),
+    ),
+    SampleInteraction(
+        label="preference-rebase-workflow",
+        project="atocore",
+        prompt="Should we use rebase or merge commits for this branch?",
+        response=(
+            "I prefer rebase-based workflows because the history stays "
+            "linear and reviewers have an easier time. For the dalidou "
+            "storage-foundation branch we'll rebase before merging into "
+            "main so the timeline reads as one coherent narrative."
+        ),
+        notes_for_human=(
+            "Should fire preference_sentence rule. Note: the rule looks "
+            "for 'I prefer' which is exactly the wording used."
+        ),
+    ),
+    SampleInteraction(
+        label="constraint-from-doc-cite",
+        project="p05-interferometer",
+        prompt="What's the budget situation for p05 right now?",
+        response=(
+            "Looking at the latest design notes, the constraint is fairly "
+            "tight on the optical path side.\n\n"
+            "### Constraint: total optical path stability must stay within 10 nm RMS at 1 kHz\n\n"
+            "The error budget allocates 6 nm to the laser source and 4 nm "
+            "to the mechanical mounts and frame combined."
+        ),
+        notes_for_human=(
+            "Should fire constraint_heading rule with type=project. The "
+            "second sentence ('error budget allocates 6 nm') is descriptive "
+            "and should NOT fire any rule because it doesn't have a "
+            "structural cue."
+        ),
+    ),
+    SampleInteraction(
+        label="prose-only-no-cues",
+        project="atocore",
+        prompt="What should I work on next?",
+        response=(
+            "Looking at the current state of AtoCore, the next natural "
+            "step is to validate Phase 9 against real interactions before "
+            "starting the engineering layer implementation. Empirical "
+            "evidence is missing and the planning docs would benefit from "
+            "real signal."
+        ),
+        notes_for_human=(
+            "Should produce ZERO candidates. Pure recommendation prose, no "
+            "structural cues. If this fires anything the extractor is too "
+            "loose."
+        ),
+    ),
+    SampleInteraction(
+        label="multiple-cues-in-one-interaction",
+        project="p06-polisher",
+        prompt="Summarize today's polisher session",
+        response=(
+            "We worked through three things in the polisher session today.\n\n"
+            "## Decision: defer the laser interlock redesign to after the July milestone\n\n"
+            "## Constraint: the calibration routine must complete in under 90 seconds for production use\n\n"
+            "## Requirement: the polisher must hold position to within 0.5 micron at 1 g loading\n\n"
+            "Action items captured for the next sync."
+        ),
+        notes_for_human=(
+            "Three rules should fire on the same interaction: "
+            "decision_heading -> adaptation, constraint_heading -> project, "
+            "requirement_heading -> project. Verify dedup doesn't merge them."
+        ),
+    ),
+]
+
+
+def setup_environment(data_dir: Path) -> None:
+    """Configure AtoCore to use an isolated data directory for this run."""
+    if data_dir.exists():
+        shutil.rmtree(data_dir)
+    data_dir.mkdir(parents=True, exist_ok=True)
+    os.environ["ATOCORE_DATA_DIR"] = str(data_dir)
+    os.environ.setdefault("ATOCORE_DEBUG", "true")
+    # Reset cached settings so the new env vars take effect
+    import atocore.config as config
+
+    config.settings = config.Settings()
+    import atocore.retrieval.vector_store as vs
+
+    vs._store = None
+
+
+def seed_memories() -> dict[str, str]:
+    """Insert a small set of seed active memories so reinforcement has
+    something to match against."""
+    from atocore.memory.service import create_memory
+
+    seeded: dict[str, str] = {}
+    seeded["pref_rebase"] = create_memory(
+        memory_type="preference",
+        content="prefers rebase-based workflows because history stays linear",
+        confidence=0.6,
+    ).id
+    seeded["pref_concise"] = create_memory(
+        memory_type="preference",
+        content="writes commit messages focused on the why, not the what",
+        confidence=0.6,
+    ).id
+    seeded["identity_runs_atocore"] = create_memory(
+        memory_type="identity",
+        content="mechanical engineer who runs AtoCore for context engineering",
+        confidence=0.9,
+    ).id
+    return seeded
+
+
+def run_sample(sample: SampleInteraction) -> dict:
+    """Capture one sample, run extraction, return a result dict."""
+    from atocore.interactions.service import record_interaction
+    from atocore.memory.extractor import extract_candidates_from_interaction
+
+    interaction = record_interaction(
+        prompt=sample.prompt,
+        response=sample.response,
+        project=sample.project,
+        client="phase9-first-real-use",
+        session_id="first-real-use",
+        reinforce=True,
+    )
+    candidates = extract_candidates_from_interaction(interaction)
+
+    return {
+        "label": sample.label,
+        "project": sample.project,
+        "interaction_id": interaction.id,
+        "expected_notes": sample.notes_for_human,
+        "candidate_count": len(candidates),
+        "candidates": [
+            {
+                "memory_type": c.memory_type,
+                "rule": c.rule,
+                "content": c.content,
+                "source_span": c.source_span[:120],
+            }
+            for c in candidates
+        ],
+    }
+
+
+def report_seed_memory_state(seeded_ids: dict[str, str]) -> dict:
+    from atocore.memory.service import get_memories
+
+    state = {}
+    for label, mid in seeded_ids.items():
+        rows = [m for m in get_memories(limit=200) if m.id == mid]
+        if not rows:
+            state[label] = None
+            continue
+        m = rows[0]
+        state[label] = {
+            "id": m.id,
+            "memory_type": m.memory_type,
+            "content_preview": m.content[:80],
+            "confidence": round(m.confidence, 4),
+            "reference_count": m.reference_count,
+            "last_referenced_at": m.last_referenced_at,
+        }
+    return state
+
+
+def main() -> int:
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--data-dir",
+        default=str(_REPO_ROOT / "data" / "validation" / "phase9-first-use"),
+        help="Isolated data directory to use for this validation run",
+    )
+    parser.add_argument(
+        "--json",
+        action="store_true",
+        help="Emit machine-readable JSON instead of human prose",
+    )
+    args = parser.parse_args()
+
+    data_dir = Path(args.data_dir).resolve()
+    setup_environment(data_dir)
+
+    from atocore.models.database import init_db
+    from atocore.context.project_state import init_project_state_schema
+
+    init_db()
+    init_project_state_schema()
+
+    seeded = seed_memories()
+    sample_results = [run_sample(s) for s in SAMPLES]
+    final_seed_state = report_seed_memory_state(seeded)
+
+    if args.json:
+        json.dump(
+            {
+                "data_dir": str(data_dir),
+                "seeded_memories_initial": list(seeded.keys()),
+                "samples": sample_results,
+                "seed_memory_state_after_run": final_seed_state,
+            },
+            sys.stdout,
+            indent=2,
+            default=str,
+        )
+        return 0
+
+    print("=" * 78)
+    print("Phase 9 first-real-use validation run")
+    print("=" * 78)
+    print(f"Isolated data dir: {data_dir}")
+    print()
+    print("Seeded the memory store with 3 active memories:")
+    for label, mid in seeded.items():
+        print(f"  - {label}  ({mid[:8]})")
+    print()
+    print("-" * 78)
+    print(f"Running {len(SAMPLES)} sample interactions ...")
+    print("-" * 78)
+
+    for result in sample_results:
+        print()
+        print(f"## {result['label']}  [project={result['project']}]")
+        print(f"   interaction_id={result['interaction_id'][:8]}")
+        print(f"   expected: {result['expected_notes']}")
+        print(f"   candidates produced: {result['candidate_count']}")
+        for i, cand in enumerate(result["candidates"], 1):
+            print(
+                f"     [{i}] type={cand['memory_type']:11s} "
+                f"rule={cand['rule']:21s} "
+                f"content={cand['content']!r}"
+            )
+
+    print()
+    print("-" * 78)
+    print("Reinforcement state on seeded memories AFTER all interactions:")
+    print("-" * 78)
+    for label, state in final_seed_state.items():
+        if state is None:
+            print(f"  {label}: <missing>")
+            continue
+        print(
+            f"  {label}: confidence={state['confidence']:.4f}  "
+            f"refs={state['reference_count']}  "
+            f"last={state['last_referenced_at'] or '-'}"
+        )
+
+    print()
+    print("=" * 78)
+    print("Run complete. Data written to:", data_dir)
+    print("=" * 78)
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
--- a/src/atocore/api/routes.py
+++ b/src/atocore/api/routes.py
@@ -31,6 +31,7 @@ from atocore.interactions.service import (
    record_interaction,
 )
 from atocore.memory.extractor import (
+    EXTRACTOR_VERSION,
    MemoryCandidate,
    extract_candidates_from_interaction,
 )
@@ -622,6 +623,7 @@ def api_extract_from_interaction(
        "candidate_count": len(candidates),
        "persisted": payload.persist,
        "persisted_ids": persisted_ids,
+        "extractor_version": EXTRACTOR_VERSION,
        "candidates": [
            {
                "memory_type": c.memory_type,
@@ -630,6 +632,7 @@ def api_extract_from_interaction(
                "confidence": c.confidence,
                "rule": c.rule,
                "source_span": c.source_span,
+                "extractor_version": c.extractor_version,
            }
            for c in candidates
        ],
--- a/src/atocore/main.py
+++ b/src/atocore/main.py
@@ -1,5 +1,7 @@
 """AtoCore — FastAPI application entry point."""

+from contextlib import asynccontextmanager
+
 from fastapi import FastAPI

 from atocore.api.routes import router
@@ -9,18 +11,19 @@ from atocore.ingestion.pipeline import get_source_status
 from atocore.models.database import init_db
 from atocore.observability.logger import get_logger, setup_logging

-app = FastAPI(
-    title="AtoCore",
-    description="Personal Context Engine for LLM interactions",
-    version="0.1.0",
-)

-app.include_router(router)
 log = get_logger("main")


-@app.on_event("startup")
-def startup():
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    """Run setup before the first request and teardown after shutdown.
+
+    Replaces the deprecated ``@app.on_event("startup")`` hook with the
+    modern ``lifespan`` context manager. Setup runs synchronously (the
+    underlying calls are blocking I/O) so no await is needed; the
+    function still must be async per the FastAPI contract.
+    """
    setup_logging()
    _config.ensure_runtime_dirs()
    init_db()
@@ -32,6 +35,19 @@ def startup():
        chroma_path=str(_config.settings.chroma_path),
        source_status=get_source_status(),
    )
+    yield
+    # No teardown work needed today; SQLite connections are short-lived
+    # and the Chroma client cleans itself up on process exit.
+
+
+app = FastAPI(
+    title="AtoCore",
+    description="Personal Context Engine for LLM interactions",
+    version="0.1.0",
+    lifespan=lifespan,
+)
+
+app.include_router(router)


 if __name__ == "__main__":
--- a/src/atocore/memory/extractor.py
+++ b/src/atocore/memory/extractor.py
@@ -46,6 +46,18 @@ from atocore.observability.logger import get_logger

 log = get_logger("extractor")

+
+# Bumped whenever the rule set, regex shapes, or post-processing
+# semantics change in a way that could affect candidate output. The
+# promotion-rules doc requires every candidate to record the version
+# of the extractor that produced it so old candidates can be re-evaluated
+# (or kept as-is) when the rules evolve.
+#
+# History:
+#   0.1.0 - initial Phase 9 Commit C rule set (Apr 6, 2026)
+EXTRACTOR_VERSION = "0.1.0"
+
+
 # Every candidate is attributed to the rule that fired so reviewers can
 # audit why it was proposed.
@dataclass
@@ -57,6 +69,7 @@ class MemoryCandidate:
    project: str = ""
    confidence: float = 0.5  # default review-queue confidence
    source_interaction_id: str = ""
+    extractor_version: str = EXTRACTOR_VERSION


 # ---------------------------------------------------------------------------