phase9 first-real-use validation + small hygiene wins
Session 1 of the four-session plan. Empirically exercises the Phase 9
loop (capture -> reinforce -> extract) for the first time and lands
three small hygiene fixes.
Validation script + report
--------------------------
scripts/phase9_first_real_use.py — reproducible script that:
- sets up an isolated SQLite + Chroma store under
data/validation/phase9-first-use (gitignored)
- seeds 3 active memories
- runs 8 sample interactions through capture + reinforce + extract
- prints what each step produced and reinforcement state at the end
- supports --json output for downstream tooling
docs/phase9-first-real-use.md — narrative report of the run with:
- extraction results table (8/8 expectations met exactly)
- the empirical finding that REINFORCEMENT MATCHED ZERO seeds
despite sample 5 clearly echoing the rebase preference memory
- root cause analysis: the substring matcher is too brittle for
natural paraphrases (e.g. "prefers" vs "I prefer", "history"
vs "the history")
- recommended fix: replace substring matcher with a token-overlap
matcher (>=70% of memory tokens present in response, with
light stemming and a small stop list)
- explicit note that the fix is queued as a follow-up commit, not
bundled into the report — keeps the audit trail clean
Key extraction results from the run:
- all 7 heading/sentence rules fired correctly
- 0 false positives on the prose-only sample (the most important
sanity check)
- long content preserved without truncation
- dedup correctly kept three distinct cues from one interaction
- project scoping flowed cleanly through the pipeline
Hygiene 1: FastAPI lifespan migration (src/atocore/main.py)
- Replaced @app.on_event("startup") with the modern @asynccontextmanager
lifespan handler
- Same setup work (setup_logging, ensure_runtime_dirs, init_db,
init_project_state_schema, startup_ready log)
- Removes the two on_event deprecation warnings from every test run
- Test suite now shows 1 warning instead of 3
Hygiene 2: EXTRACTOR_VERSION constant (src/atocore/memory/extractor.py)
- Added EXTRACTOR_VERSION = "0.1.0" with a versioned change log comment
- MemoryCandidate dataclass carries extractor_version on every candidate
- POST /interactions/{id}/extract response now includes extractor_version
on both the top level (current run) and on each candidate
- Implements the versioning requirement called out in
docs/architecture/promotion-rules.md so old candidates can be
identified and re-evaluated when the rule set evolves
Hygiene 3: ~/.git-credentials cleanup (out-of-tree, not committed)
- Removed the dead OAUTH_USER:<jwt> line for dalidou:3000 that was
being silently rewritten by the system credential manager on every
push attempt
- Configured credential.http://dalidou:3000.helper with the empty-string
sentinel pattern so the URL-specific helper chain is exactly
["", store] instead of inheriting the system-level "manager" helper
that ships with Git for Windows
- Same fix for the 100.80.199.40 (Tailscale) entry
- Verified end to end: a fresh push using only the cleaned credentials
file (no embedded URL) authenticates as Antoine and lands cleanly
Full suite: 160 passing (no change from previous), 1 warning
(was 3) thanks to the lifespan migration.
This commit is contained in:
321
docs/phase9-first-real-use.md
Normal file
321
docs/phase9-first-real-use.md
Normal file
@@ -0,0 +1,321 @@
|
||||
# Phase 9 First Real Use Report
|
||||
|
||||
## What this is
|
||||
|
||||
The first empirical exercise of the Phase 9 reflection loop after
|
||||
Commits A, B, and C all landed. The goal is to find out where the
|
||||
extractor and the reinforcement matcher actually behave well versus
|
||||
where their behaviour drifts from the design intent.
|
||||
|
||||
The validation is reproducible. To re-run:
|
||||
|
||||
```bash
|
||||
python scripts/phase9_first_real_use.py
|
||||
```
|
||||
|
||||
This writes an isolated SQLite + Chroma store under
|
||||
`data/validation/phase9-first-use/` (gitignored), seeds three active
|
||||
memories, then runs eight sample interactions through the full
|
||||
capture → reinforce → extract pipeline.
|
||||
|
||||
## What we ran
|
||||
|
||||
Eight synthetic interactions, each paraphrased from a real working
|
||||
session about AtoCore itself or the active engineering projects:
|
||||
|
||||
| # | Label | Project | Expected |
|
||||
|---|--------------------------------------|----------------------|---------------------------|
|
||||
| 1 | exdev-mount-merge-decision | atocore | 1 decision_heading |
|
||||
| 2 | ownership-was-the-real-fix | atocore | 1 fact_heading |
|
||||
| 3 | memory-vs-entity-canonical-home | atocore | 1 decision_heading (long) |
|
||||
| 4 | auto-promotion-deferred | atocore | 1 decision_heading |
|
||||
| 5 | preference-rebase-workflow | atocore | 1 preference_sentence |
|
||||
| 6 | constraint-from-doc-cite | p05-interferometer | 1 constraint_heading |
|
||||
| 7 | prose-only-no-cues | atocore | 0 candidates |
|
||||
| 8 | multiple-cues-in-one-interaction | p06-polisher | 3 distinct rules |
|
||||
|
||||
Plus 3 seed memories were inserted before the run:
|
||||
|
||||
- `pref_rebase`: "prefers rebase-based workflows because history stays linear" (preference, 0.6)
|
||||
- `pref_concise`: "writes commit messages focused on the why, not the what" (preference, 0.6)
|
||||
- `identity_runs_atocore`: "mechanical engineer who runs AtoCore for context engineering" (identity, 0.9)
|
||||
|
||||
## What happened — extraction (the good news)
|
||||
|
||||
**Every extraction expectation was met exactly.** All eight samples
|
||||
produced the predicted candidate count and the predicted rule
|
||||
classifications:
|
||||
|
||||
| Sample | Expected | Got | Pass |
|
||||
|---------------------------------------|----------|-----|------|
|
||||
| exdev-mount-merge-decision | 1 | 1 | ✅ |
|
||||
| ownership-was-the-real-fix | 1 | 1 | ✅ |
|
||||
| memory-vs-entity-canonical-home | 1 | 1 | ✅ |
|
||||
| auto-promotion-deferred | 1 | 1 | ✅ |
|
||||
| preference-rebase-workflow | 1 | 1 | ✅ |
|
||||
| constraint-from-doc-cite | 1 | 1 | ✅ |
|
||||
| prose-only-no-cues | **0** | **0** | ✅ |
|
||||
| multiple-cues-in-one-interaction | 3 | 3 | ✅ |
|
||||
|
||||
**Total: 9 candidates from 8 interactions, 0 false positives, 0 misses
|
||||
on heading patterns or sentence patterns.**
|
||||
|
||||
The extractor's strictness is well-tuned for the kinds of structural
|
||||
cues we actually use. Things worth noting:
|
||||
|
||||
- **Sample 7 (`prose-only-no-cues`) produced zero candidates as
|
||||
designed.** This is the most important sanity check — it confirms
|
||||
the extractor won't fill the review queue with general prose when
|
||||
there's no structural intent.
|
||||
- **Sample 3's long content was preserved without truncation.** The
|
||||
280-char max wasn't hit, and the content kept its full meaning.
|
||||
- **Sample 8 produced three distinct rules in one interaction**
|
||||
(decision_heading, constraint_heading, requirement_heading) without
|
||||
the dedup key collapsing them. The dedup key is
|
||||
`(memory_type, normalized_content, rule)` and the three are all
|
||||
different on at least one axis, so they coexist as expected.
|
||||
- **The prose around each heading was correctly ignored.** Sample 6
|
||||
has a second sentence ("the error budget allocates 6 nm to the
|
||||
laser source...") that does NOT have a structural cue, and the
|
||||
extractor correctly didn't fire on it.
|
||||
|
||||
## What happened — reinforcement (the empirical finding)
|
||||
|
||||
**Reinforcement matched zero seeded memories across all 8 samples,
|
||||
even when the response clearly echoed the seed.**
|
||||
|
||||
Sample 5's response was:
|
||||
|
||||
> *"I prefer rebase-based workflows because the history stays linear
|
||||
> and reviewers have an easier time."*
|
||||
|
||||
The seeded `pref_rebase` memory was:
|
||||
|
||||
> *"prefers rebase-based workflows because history stays linear"*
|
||||
|
||||
A human reading both says these are the same fact. The reinforcement
|
||||
matcher disagrees. After all 8 interactions:
|
||||
|
||||
```
|
||||
pref_rebase: confidence=0.6000 refs=0 last=-
|
||||
pref_concise: confidence=0.6000 refs=0 last=-
|
||||
identity_runs_atocore: confidence=0.9000 refs=0 last=-
|
||||
```
|
||||
|
||||
**Nothing moved.** This is the most important finding from this
|
||||
validation pass.
|
||||
|
||||
### Why the matcher missed it
|
||||
|
||||
The current `_memory_matches` rule (in
|
||||
`src/atocore/memory/reinforcement.py`) does a normalized substring
|
||||
match: it lowercases both sides, collapses whitespace, then asks
|
||||
"does the leading 80-char window of the memory content appear as a
|
||||
substring in the response?"
|
||||
|
||||
For the rebase example:
|
||||
|
||||
- needle (normalized): `prefers rebase-based workflows because history stays linear`
|
||||
- haystack (normalized): `i prefer rebase-based workflows because the history stays linear and reviewers have an easier time.`
|
||||
|
||||
The needle starts with `prefers` (with the trailing `s`), and the
|
||||
haystack has `prefer` (without the `s`, because of the first-person
|
||||
voice). And the needle has `because history stays linear`, while the
|
||||
haystack has `because the history stays linear`. **Two small natural
|
||||
paraphrases, and the substring fails.**
|
||||
|
||||
This isn't a bug in the matcher's implementation — it's doing
|
||||
exactly what it was specified to do. It's a design limitation: the
|
||||
substring rule is too brittle for real prose, where the same fact
|
||||
gets re-stated with different verb forms, articles, and word order.
|
||||
|
||||
### Severity
|
||||
|
||||
**Medium-high.** Reinforcement is the entire point of Commit B.
|
||||
A reinforcement matcher that never fires on natural paraphrases
|
||||
will leave seeded memories with stale confidence forever. The
|
||||
reflection loop runs but it doesn't actually reinforce anything.
|
||||
That hollows out the value of having reinforcement at all.
|
||||
|
||||
It is not a critical bug because:
|
||||
- Nothing breaks. The pipeline still runs cleanly.
|
||||
- Reinforcement is supposed to be a *signal*, not the only path to
|
||||
high confidence — humans can still curate confidence directly.
|
||||
- The candidate-extraction path (Commit C) is unaffected and works
|
||||
perfectly.
|
||||
|
||||
But it does need to be addressed before Phase 9 can be considered
|
||||
operationally complete.
|
||||
|
||||
## Recommended fix (deferred to a follow-up commit)
|
||||
|
||||
Replace the substring matcher with a token-overlap matcher. The
|
||||
specification:
|
||||
|
||||
1. Tokenize both memory content and response into lowercase words
|
||||
of length >= 3, dropping a small stop list (`the`, `a`, `an`,
|
||||
`and`, `or`, `of`, `to`, `is`, `was`, `that`, `this`, `with`,
|
||||
`for`, `from`, `into`).
|
||||
2. Stem aggressively (or at minimum, fold trailing `s` and `ed`
|
||||
so `prefers`/`prefer`/`preferred` collapse to one token).
|
||||
3. A match exists if **at least 70% of the memory's content
|
||||
tokens** appear in the response token set.
|
||||
4. Memory content must still be at least `_MIN_MEMORY_CONTENT_LENGTH`
|
||||
characters to be considered.
|
||||
|
||||
This is more permissive than the substring rule but still tight
|
||||
enough to avoid spurious matches on generic words. It would have
|
||||
caught the rebase example because:
|
||||
|
||||
- memory tokens (after stop-list and stemming):
|
||||
`{prefer, rebase-bas, workflow, because, history, stay, linear}`
|
||||
- response tokens:
|
||||
`{prefer, rebase-bas, workflow, because, history, stay, linear,
|
||||
reviewer, easi, time}`
|
||||
- overlap: 7 / 7 memory tokens = 100% > 70% threshold → match
|
||||
|
||||
### Why not fix it in this report
|
||||
|
||||
Three reasons:
|
||||
|
||||
1. The validation report is supposed to be evidence, not a fix
|
||||
spec. A separate commit will introduce the new matcher with
|
||||
its own tests.
|
||||
2. The token-overlap matcher needs its own design review for edge
|
||||
cases (very long memories, very short responses, technical
|
||||
abbreviations, code snippets in responses).
|
||||
3. Mixing the report and the fix into one commit would muddle the
|
||||
audit trail. The report is the empirical evidence; the fix is
|
||||
the response.
|
||||
|
||||
The fix is queued as the next Phase 9 maintenance commit and is
|
||||
flagged in the next-steps section below.
|
||||
|
||||
## Other observations
|
||||
|
||||
### Extraction is conservative on purpose, and that's working
|
||||
|
||||
Sample 7 is the most important data point in the whole run.
|
||||
A natural prose response with no structural cues produced zero
|
||||
candidates. **This is exactly the design intent** — the extractor
|
||||
should be loud about explicit decisions/constraints/requirements
|
||||
and quiet about everything else. If the extractor were too loose
|
||||
the review queue would fill up with low-value items and the human
|
||||
would stop reviewing.
|
||||
|
||||
After this run I have measurably more confidence that the V0 rule
|
||||
set is the right starting point. Future rules can be added one at
|
||||
a time as we see specific patterns the extractor misses, instead of
|
||||
guessing at what might be useful.
|
||||
|
||||
### Confidence on candidates
|
||||
|
||||
All extracted candidates landed at the default `confidence=0.5`,
|
||||
which is what the extractor is currently hardcoded to do. The
|
||||
`promotion-rules.md` doc proposes a per-rule prior with a
|
||||
structural-signal multiplier and freshness bonus. None of that is
|
||||
implemented yet. The validation didn't reveal any urgency around
|
||||
this — humans review the candidates either way — but it confirms
|
||||
that the priors-and-multipliers refinement is a reasonable next
|
||||
step rather than a critical one.
|
||||
|
||||
### Multiple cues in one interaction
|
||||
|
||||
Sample 8 confirmed an important property: **three structural
|
||||
cues in the same response do not collide in dedup**. The dedup
|
||||
key is `(memory_type, normalized_content, rule)`, and since each
|
||||
cue produced a distinct (type, content, rule) tuple, all three
|
||||
landed cleanly.
|
||||
|
||||
This matters because real working sessions naturally bundle
|
||||
multiple decisions/constraints/requirements into one summary.
|
||||
The extractor handles those bundles correctly.
|
||||
|
||||
### Project scoping
|
||||
|
||||
Each candidate carries the `project` from the source interaction
|
||||
into its own `project` field. Sample 6 (p05) and sample 8 (p06)
|
||||
both produced candidates with the right project. This is
|
||||
non-obvious because the extractor module never explicitly looks
|
||||
at project — it inherits from the interaction it's scanning. Worth
|
||||
keeping in mind when the entity extractor is built: same pattern
|
||||
should apply.
|
||||
|
||||
## What this validates and what it doesn't
|
||||
|
||||
### Validates
|
||||
|
||||
- The Phase 9 Commit C extractor's rule set is well-tuned for
|
||||
hand-written structural cues
|
||||
- The dedup logic does the right thing across multiple cues
|
||||
- The "drop candidates that match an existing active memory" filter
|
||||
works (would have been visible if any seeded memory had matched
|
||||
one of the heading texts — none did, but the code path is the
|
||||
same one that's covered in `tests/test_extractor.py`)
|
||||
- The `prose-only-no-cues` no-fire case is solid
|
||||
- Long content is preserved without truncation
|
||||
- Project scoping flows through the pipeline
|
||||
|
||||
### Does NOT validate
|
||||
|
||||
- The reinforcement matcher (clearly, since it caught nothing)
|
||||
- The behaviour against very long documents (each sample was
|
||||
under 700 chars; real interaction responses can be 10× that)
|
||||
- The behaviour against responses that contain code blocks (the
|
||||
extractor's regex rules don't handle code-block fenced sections
|
||||
specially)
|
||||
- Cross-interaction promotion-to-active flow (no candidate was
|
||||
promoted in this run; the lifecycle is covered by the unit tests
|
||||
but not by this empirical exercise)
|
||||
- The behaviour at scale: 8 interactions is a one-shot. We need
|
||||
to see the queue after 50+ before judging reviewer ergonomics.
|
||||
|
||||
### Recommended next empirical exercises
|
||||
|
||||
1. **Real conversation capture**, using a slash command from a
|
||||
real Claude Code session against either a local or Dalidou
|
||||
AtoCore instance. The synthetic responses in this script are
|
||||
honest paraphrases but they're still hand-curated.
|
||||
2. **Bulk capture from existing PKM**, ingesting a few real
|
||||
project notes through the extractor as if they were
|
||||
interactions. This stresses the rules against documents that
|
||||
weren't written with the extractor in mind.
|
||||
3. **Reinforcement matcher rerun** after the token-overlap
|
||||
matcher lands.
|
||||
|
||||
## Action items from this report
|
||||
|
||||
- [ ] **Fix reinforcement matcher** with token-overlap rule
|
||||
described in the "Recommended fix" section above. Owner:
|
||||
next session. Severity: medium-high.
|
||||
- [x] **Document the extractor's V0 strictness** as a working
|
||||
property, not a limitation. Sample 7 makes the case.
|
||||
- [ ] **Build the slash command** so the next validation run
|
||||
can use real (not synthetic) interactions. Tracked in
|
||||
Session 2 of the current planning sprint.
|
||||
- [ ] **Run a 50+ interaction batch** to evaluate reviewer
|
||||
ergonomics. Deferred until the slash command exists.
|
||||
|
||||
## Reproducibility
|
||||
|
||||
The script is deterministic. Re-running it will produce
|
||||
identical results because:
|
||||
|
||||
- the data dir is wiped on every run
|
||||
- the sample interactions are constants
|
||||
- the memory uuid generation is non-deterministic but the
|
||||
important fields (content, type, count, rule) are not
|
||||
- the `data/validation/phase9-first-use/` directory is gitignored,
|
||||
so no state leaks across runs
|
||||
|
||||
To reproduce this exact report:
|
||||
|
||||
```bash
|
||||
python scripts/phase9_first_real_use.py
|
||||
```
|
||||
|
||||
To get JSON output for downstream tooling:
|
||||
|
||||
```bash
|
||||
python scripts/phase9_first_real_use.py --json
|
||||
```
|
||||
393
scripts/phase9_first_real_use.py
Normal file
393
scripts/phase9_first_real_use.py
Normal file
@@ -0,0 +1,393 @@
|
||||
"""Phase 9 first-real-use validation script.
|
||||
|
||||
Captures a small set of representative interactions drawn from a real
|
||||
working session, runs the full Phase 9 loop (capture -> reinforce ->
|
||||
extract) over them, and prints what each step produced. The intent is
|
||||
to generate empirical evidence about the extractor's behaviour against
|
||||
prose that wasn't written to make the test pass.
|
||||
|
||||
Usage:
|
||||
python scripts/phase9_first_real_use.py [--data-dir PATH]
|
||||
|
||||
The script writes a fresh isolated SQLite + Chroma store under the
|
||||
given data dir (default: ./data/validation/phase9-first-use). The
|
||||
data dir is gitignored so the script can be re-run cleanly.
|
||||
|
||||
Each interaction is printed with:
|
||||
- the captured interaction id
|
||||
- the reinforcement results (which seeded memories were echoed)
|
||||
- the extraction results (which candidates were proposed and why)
|
||||
- notes on what the extractor MISSED (manually annotated below)
|
||||
|
||||
The output is intentionally human-readable so the run can be saved as
|
||||
the body of docs/phase9-first-real-use.md.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import shutil
|
||||
import sys
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
|
||||
# Make the package importable when run directly from the repo root.
|
||||
_REPO_ROOT = Path(__file__).resolve().parent.parent
|
||||
sys.path.insert(0, str(_REPO_ROOT / "src"))
|
||||
|
||||
|
||||
@dataclass
|
||||
class SampleInteraction:
|
||||
label: str
|
||||
prompt: str
|
||||
response: str
|
||||
project: str
|
||||
notes_for_human: str
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Sample interactions, paraphrased from real working sessions on AtoCore
|
||||
# itself. Each one is a real-feeling exchange with at least one structural
|
||||
# cue the extractor should catch, plus prose the extractor should NOT
|
||||
# accidentally catch.
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
SAMPLES: list[SampleInteraction] = [
|
||||
SampleInteraction(
|
||||
label="exdev-mount-merge-decision",
|
||||
project="atocore",
|
||||
prompt="The TMPDIR fix didn't work, the LOCAL_COPY_PATH change didn't work, what now?",
|
||||
response=(
|
||||
"We've been chasing this through three layers without success. "
|
||||
"Time to stop guessing and apply the simple fix Dalidou Claude "
|
||||
"proposed in the very first diagnosis.\n\n"
|
||||
"## Decision: merge the two bind mounts into one\n\n"
|
||||
"Both /srv/storage/repos and /srv/storage/gitea/data live on "
|
||||
"/dev/sdb1, so the move is a same-filesystem rename — instant, "
|
||||
"atomic, no data copy. After the move, the EXDEV class of bug "
|
||||
"becomes structurally impossible because there is only one "
|
||||
"mount inside the container."
|
||||
),
|
||||
notes_for_human=(
|
||||
"Should fire decision_heading rule. Should NOT fire requirement "
|
||||
"or constraint rules. The 'becomes structurally impossible' line "
|
||||
"is prose, not a candidate."
|
||||
),
|
||||
),
|
||||
SampleInteraction(
|
||||
label="ownership-was-the-real-fix",
|
||||
project="atocore",
|
||||
prompt="Still failing after the mount merge. What's actually wrong?",
|
||||
response=(
|
||||
"Same rejection even after the mounts were merged, which kills "
|
||||
"the EXDEV theory. The error 'unable to migrate objects to "
|
||||
"permanent storage' is git's catch-all for any failure in "
|
||||
"tmp_objdir_migrate, and the real errno is being thrown away.\n\n"
|
||||
"## Fact: the actual root cause was ownership, not cross-device link\n\n"
|
||||
"After mv on the host, the moved directory had perms git could "
|
||||
"navigate for some subdirs but not others. Recursive chown "
|
||||
"git:git plus chmod u+rwX,g+rwX fixed it on the first try."
|
||||
),
|
||||
notes_for_human=(
|
||||
"Should fire fact_heading rule, target type knowledge. The "
|
||||
"'unable to migrate' string is inside prose and should NOT "
|
||||
"trigger any rule."
|
||||
),
|
||||
),
|
||||
SampleInteraction(
|
||||
label="memory-vs-entity-canonical-home",
|
||||
project="atocore",
|
||||
prompt="Does a Decision live in memories or entities?",
|
||||
response=(
|
||||
"Both layers exist for different reasons. After thinking through "
|
||||
"the ontology question, I'm landing on a clean rule: one "
|
||||
"canonical home per concept, with explicit graduation when a "
|
||||
"memory eventually deserves structure.\n\n"
|
||||
"## Decision: project, knowledge, and adaptation memory types graduate to entities\n\n"
|
||||
"The identity, preference, and episodic types stay in the memory "
|
||||
"layer because they're personal and unstructured. The other "
|
||||
"three become entities once the engineering layer V1 ships. The "
|
||||
"original memory rows are NEVER deleted — they become frozen "
|
||||
"historical pointers to whichever entity they graduated into."
|
||||
),
|
||||
notes_for_human=(
|
||||
"Should fire decision_heading rule. Long content (over 100 "
|
||||
"chars) — verifies the max-length cap of 280 doesn't truncate."
|
||||
),
|
||||
),
|
||||
SampleInteraction(
|
||||
label="auto-promotion-deferred",
|
||||
project="atocore",
|
||||
prompt="Should the engineering layer auto-promote candidates?",
|
||||
response=(
|
||||
"The trust hierarchy is the whole point of AtoCore. If we let "
|
||||
"the system auto-promote, we trade away the 'bad memory is "
|
||||
"worse than no memory' invariant for a marginal speed gain.\n\n"
|
||||
"## Decision: no auto-promotion in V1, ever\n\n"
|
||||
"The schema and API are designed so policies can be added "
|
||||
"later without migration, but the V1 stance is that every "
|
||||
"promotion needs a human signoff. The hard line: nothing ever "
|
||||
"moves into trusted project state automatically, even with the "
|
||||
"engineering layer running."
|
||||
),
|
||||
notes_for_human=(
|
||||
"Should fire decision_heading rule. Has 'invariant' jargon — "
|
||||
"verify the extractor doesn't get distracted."
|
||||
),
|
||||
),
|
||||
SampleInteraction(
|
||||
label="preference-rebase-workflow",
|
||||
project="atocore",
|
||||
prompt="Should we use rebase or merge commits for this branch?",
|
||||
response=(
|
||||
"I prefer rebase-based workflows because the history stays "
|
||||
"linear and reviewers have an easier time. For the dalidou "
|
||||
"storage-foundation branch we'll rebase before merging into "
|
||||
"main so the timeline reads as one coherent narrative."
|
||||
),
|
||||
notes_for_human=(
|
||||
"Should fire preference_sentence rule. Note: the rule looks "
|
||||
"for 'I prefer' which is exactly the wording used."
|
||||
),
|
||||
),
|
||||
SampleInteraction(
|
||||
label="constraint-from-doc-cite",
|
||||
project="p05-interferometer",
|
||||
prompt="What's the budget situation for p05 right now?",
|
||||
response=(
|
||||
"Looking at the latest design notes, the constraint is fairly "
|
||||
"tight on the optical path side.\n\n"
|
||||
"### Constraint: total optical path stability must stay within 10 nm RMS at 1 kHz\n\n"
|
||||
"The error budget allocates 6 nm to the laser source and 4 nm "
|
||||
"to the mechanical mounts and frame combined."
|
||||
),
|
||||
notes_for_human=(
|
||||
"Should fire constraint_heading rule with type=project. The "
|
||||
"second sentence ('error budget allocates 6 nm') is descriptive "
|
||||
"and should NOT fire any rule because it doesn't have a "
|
||||
"structural cue."
|
||||
),
|
||||
),
|
||||
SampleInteraction(
|
||||
label="prose-only-no-cues",
|
||||
project="atocore",
|
||||
prompt="What should I work on next?",
|
||||
response=(
|
||||
"Looking at the current state of AtoCore, the next natural "
|
||||
"step is to validate Phase 9 against real interactions before "
|
||||
"starting the engineering layer implementation. Empirical "
|
||||
"evidence is missing and the planning docs would benefit from "
|
||||
"real signal."
|
||||
),
|
||||
notes_for_human=(
|
||||
"Should produce ZERO candidates. Pure recommendation prose, no "
|
||||
"structural cues. If this fires anything the extractor is too "
|
||||
"loose."
|
||||
),
|
||||
),
|
||||
SampleInteraction(
|
||||
label="multiple-cues-in-one-interaction",
|
||||
project="p06-polisher",
|
||||
prompt="Summarize today's polisher session",
|
||||
response=(
|
||||
"We worked through three things in the polisher session today.\n\n"
|
||||
"## Decision: defer the laser interlock redesign to after the July milestone\n\n"
|
||||
"## Constraint: the calibration routine must complete in under 90 seconds for production use\n\n"
|
||||
"## Requirement: the polisher must hold position to within 0.5 micron at 1 g loading\n\n"
|
||||
"Action items captured for the next sync."
|
||||
),
|
||||
notes_for_human=(
|
||||
"Three rules should fire on the same interaction: "
|
||||
"decision_heading -> adaptation, constraint_heading -> project, "
|
||||
"requirement_heading -> project. Verify dedup doesn't merge them."
|
||||
),
|
||||
),
|
||||
]
|
||||
|
||||
|
||||
def setup_environment(data_dir: Path) -> None:
|
||||
"""Configure AtoCore to use an isolated data directory for this run."""
|
||||
if data_dir.exists():
|
||||
shutil.rmtree(data_dir)
|
||||
data_dir.mkdir(parents=True, exist_ok=True)
|
||||
os.environ["ATOCORE_DATA_DIR"] = str(data_dir)
|
||||
os.environ.setdefault("ATOCORE_DEBUG", "true")
|
||||
# Reset cached settings so the new env vars take effect
|
||||
import atocore.config as config
|
||||
|
||||
config.settings = config.Settings()
|
||||
import atocore.retrieval.vector_store as vs
|
||||
|
||||
vs._store = None
|
||||
|
||||
|
||||
def seed_memories() -> dict[str, str]:
|
||||
"""Insert a small set of seed active memories so reinforcement has
|
||||
something to match against."""
|
||||
from atocore.memory.service import create_memory
|
||||
|
||||
seeded: dict[str, str] = {}
|
||||
seeded["pref_rebase"] = create_memory(
|
||||
memory_type="preference",
|
||||
content="prefers rebase-based workflows because history stays linear",
|
||||
confidence=0.6,
|
||||
).id
|
||||
seeded["pref_concise"] = create_memory(
|
||||
memory_type="preference",
|
||||
content="writes commit messages focused on the why, not the what",
|
||||
confidence=0.6,
|
||||
).id
|
||||
seeded["identity_runs_atocore"] = create_memory(
|
||||
memory_type="identity",
|
||||
content="mechanical engineer who runs AtoCore for context engineering",
|
||||
confidence=0.9,
|
||||
).id
|
||||
return seeded
|
||||
|
||||
|
||||
def run_sample(sample: SampleInteraction) -> dict:
|
||||
"""Capture one sample, run extraction, return a result dict."""
|
||||
from atocore.interactions.service import record_interaction
|
||||
from atocore.memory.extractor import extract_candidates_from_interaction
|
||||
|
||||
interaction = record_interaction(
|
||||
prompt=sample.prompt,
|
||||
response=sample.response,
|
||||
project=sample.project,
|
||||
client="phase9-first-real-use",
|
||||
session_id="first-real-use",
|
||||
reinforce=True,
|
||||
)
|
||||
candidates = extract_candidates_from_interaction(interaction)
|
||||
|
||||
return {
|
||||
"label": sample.label,
|
||||
"project": sample.project,
|
||||
"interaction_id": interaction.id,
|
||||
"expected_notes": sample.notes_for_human,
|
||||
"candidate_count": len(candidates),
|
||||
"candidates": [
|
||||
{
|
||||
"memory_type": c.memory_type,
|
||||
"rule": c.rule,
|
||||
"content": c.content,
|
||||
"source_span": c.source_span[:120],
|
||||
}
|
||||
for c in candidates
|
||||
],
|
||||
}
|
||||
|
||||
|
||||
def report_seed_memory_state(seeded_ids: dict[str, str]) -> dict:
|
||||
from atocore.memory.service import get_memories
|
||||
|
||||
state = {}
|
||||
for label, mid in seeded_ids.items():
|
||||
rows = [m for m in get_memories(limit=200) if m.id == mid]
|
||||
if not rows:
|
||||
state[label] = None
|
||||
continue
|
||||
m = rows[0]
|
||||
state[label] = {
|
||||
"id": m.id,
|
||||
"memory_type": m.memory_type,
|
||||
"content_preview": m.content[:80],
|
||||
"confidence": round(m.confidence, 4),
|
||||
"reference_count": m.reference_count,
|
||||
"last_referenced_at": m.last_referenced_at,
|
||||
}
|
||||
return state
|
||||
|
||||
|
||||
def main() -> int:
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument(
|
||||
"--data-dir",
|
||||
default=str(_REPO_ROOT / "data" / "validation" / "phase9-first-use"),
|
||||
help="Isolated data directory to use for this validation run",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--json",
|
||||
action="store_true",
|
||||
help="Emit machine-readable JSON instead of human prose",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
data_dir = Path(args.data_dir).resolve()
|
||||
setup_environment(data_dir)
|
||||
|
||||
from atocore.models.database import init_db
|
||||
from atocore.context.project_state import init_project_state_schema
|
||||
|
||||
init_db()
|
||||
init_project_state_schema()
|
||||
|
||||
seeded = seed_memories()
|
||||
sample_results = [run_sample(s) for s in SAMPLES]
|
||||
final_seed_state = report_seed_memory_state(seeded)
|
||||
|
||||
if args.json:
|
||||
json.dump(
|
||||
{
|
||||
"data_dir": str(data_dir),
|
||||
"seeded_memories_initial": list(seeded.keys()),
|
||||
"samples": sample_results,
|
||||
"seed_memory_state_after_run": final_seed_state,
|
||||
},
|
||||
sys.stdout,
|
||||
indent=2,
|
||||
default=str,
|
||||
)
|
||||
return 0
|
||||
|
||||
print("=" * 78)
|
||||
print("Phase 9 first-real-use validation run")
|
||||
print("=" * 78)
|
||||
print(f"Isolated data dir: {data_dir}")
|
||||
print()
|
||||
print("Seeded the memory store with 3 active memories:")
|
||||
for label, mid in seeded.items():
|
||||
print(f" - {label} ({mid[:8]})")
|
||||
print()
|
||||
print("-" * 78)
|
||||
print(f"Running {len(SAMPLES)} sample interactions ...")
|
||||
print("-" * 78)
|
||||
|
||||
for result in sample_results:
|
||||
print()
|
||||
print(f"## {result['label']} [project={result['project']}]")
|
||||
print(f" interaction_id={result['interaction_id'][:8]}")
|
||||
print(f" expected: {result['expected_notes']}")
|
||||
print(f" candidates produced: {result['candidate_count']}")
|
||||
for i, cand in enumerate(result["candidates"], 1):
|
||||
print(
|
||||
f" [{i}] type={cand['memory_type']:11s} "
|
||||
f"rule={cand['rule']:21s} "
|
||||
f"content={cand['content']!r}"
|
||||
)
|
||||
|
||||
print()
|
||||
print("-" * 78)
|
||||
print("Reinforcement state on seeded memories AFTER all interactions:")
|
||||
print("-" * 78)
|
||||
for label, state in final_seed_state.items():
|
||||
if state is None:
|
||||
print(f" {label}: <missing>")
|
||||
continue
|
||||
print(
|
||||
f" {label}: confidence={state['confidence']:.4f} "
|
||||
f"refs={state['reference_count']} "
|
||||
f"last={state['last_referenced_at'] or '-'}"
|
||||
)
|
||||
|
||||
print()
|
||||
print("=" * 78)
|
||||
print("Run complete. Data written to:", data_dir)
|
||||
print("=" * 78)
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
@@ -31,6 +31,7 @@ from atocore.interactions.service import (
|
||||
record_interaction,
|
||||
)
|
||||
from atocore.memory.extractor import (
|
||||
EXTRACTOR_VERSION,
|
||||
MemoryCandidate,
|
||||
extract_candidates_from_interaction,
|
||||
)
|
||||
@@ -622,6 +623,7 @@ def api_extract_from_interaction(
|
||||
"candidate_count": len(candidates),
|
||||
"persisted": payload.persist,
|
||||
"persisted_ids": persisted_ids,
|
||||
"extractor_version": EXTRACTOR_VERSION,
|
||||
"candidates": [
|
||||
{
|
||||
"memory_type": c.memory_type,
|
||||
@@ -630,6 +632,7 @@ def api_extract_from_interaction(
|
||||
"confidence": c.confidence,
|
||||
"rule": c.rule,
|
||||
"source_span": c.source_span,
|
||||
"extractor_version": c.extractor_version,
|
||||
}
|
||||
for c in candidates
|
||||
],
|
||||
|
||||
@@ -1,5 +1,7 @@
|
||||
"""AtoCore — FastAPI application entry point."""
|
||||
|
||||
from contextlib import asynccontextmanager
|
||||
|
||||
from fastapi import FastAPI
|
||||
|
||||
from atocore.api.routes import router
|
||||
@@ -9,18 +11,19 @@ from atocore.ingestion.pipeline import get_source_status
|
||||
from atocore.models.database import init_db
|
||||
from atocore.observability.logger import get_logger, setup_logging
|
||||
|
||||
app = FastAPI(
|
||||
title="AtoCore",
|
||||
description="Personal Context Engine for LLM interactions",
|
||||
version="0.1.0",
|
||||
)
|
||||
|
||||
app.include_router(router)
|
||||
log = get_logger("main")
|
||||
|
||||
|
||||
@app.on_event("startup")
|
||||
def startup():
|
||||
@asynccontextmanager
|
||||
async def lifespan(app: FastAPI):
|
||||
"""Run setup before the first request and teardown after shutdown.
|
||||
|
||||
Replaces the deprecated ``@app.on_event("startup")`` hook with the
|
||||
modern ``lifespan`` context manager. Setup runs synchronously (the
|
||||
underlying calls are blocking I/O) so no await is needed; the
|
||||
function still must be async per the FastAPI contract.
|
||||
"""
|
||||
setup_logging()
|
||||
_config.ensure_runtime_dirs()
|
||||
init_db()
|
||||
@@ -32,6 +35,19 @@ def startup():
|
||||
chroma_path=str(_config.settings.chroma_path),
|
||||
source_status=get_source_status(),
|
||||
)
|
||||
yield
|
||||
# No teardown work needed today; SQLite connections are short-lived
|
||||
# and the Chroma client cleans itself up on process exit.
|
||||
|
||||
|
||||
app = FastAPI(
|
||||
title="AtoCore",
|
||||
description="Personal Context Engine for LLM interactions",
|
||||
version="0.1.0",
|
||||
lifespan=lifespan,
|
||||
)
|
||||
|
||||
app.include_router(router)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
@@ -46,6 +46,18 @@ from atocore.observability.logger import get_logger
|
||||
|
||||
log = get_logger("extractor")
|
||||
|
||||
|
||||
# Bumped whenever the rule set, regex shapes, or post-processing
|
||||
# semantics change in a way that could affect candidate output. The
|
||||
# promotion-rules doc requires every candidate to record the version
|
||||
# of the extractor that produced it so old candidates can be re-evaluated
|
||||
# (or kept as-is) when the rules evolve.
|
||||
#
|
||||
# History:
|
||||
# 0.1.0 - initial Phase 9 Commit C rule set (Apr 6, 2026)
|
||||
EXTRACTOR_VERSION = "0.1.0"
|
||||
|
||||
|
||||
# Every candidate is attributed to the rule that fired so reviewers can
|
||||
# audit why it was proposed.
|
||||
@dataclass
|
||||
@@ -57,6 +69,7 @@ class MemoryCandidate:
|
||||
project: str = ""
|
||||
confidence: float = 0.5 # default review-queue confidence
|
||||
source_interaction_id: str = ""
|
||||
extractor_version: str = EXTRACTOR_VERSION
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
Reference in New Issue
Block a user