Files
ATOCore/docs/architecture/memory-vs-entities.md

310 lines
14 KiB
Markdown
Raw Permalink Normal View History

docs(arch): memory-vs-entities, promotion-rules, conflict-model Three planning docs that answer the architectural questions the engineering query catalog raised. Together with the catalog they form roughly half of the pre-implementation planning sprint. docs/architecture/memory-vs-entities.md --------------------------------------- Resolves the central question blocking every other engineering layer doc: is a Decision a memory or an entity? Key decisions: - memories stay the canonical home for identity, preference, and episodic facts - entities become the canonical home for project, knowledge, and adaptation facts once the engineering layer V1 ships - no concept lives in both layers at full fidelity; one canonical home per concept - a "graduation" flow lets active memories upgrade into entities (memory stays as a frozen historical pointer, never deleted) - one shared candidate review queue across both layers - context builder budget gains a 15% slot for engineering entities, slotted between identity/preference memories and retrieved chunks - the Phase 9 memory extractor's structural cues (decision heading, constraint heading, requirement heading) are explicitly an intentional temporary overlap, cleanly migrated via graduation when the entity extractor ships docs/architecture/promotion-rules.md ------------------------------------ Defines the full Layer 0 → Layer 2 pipeline: - four layers: L0 raw source, L1 memory candidate/active, L2 entity candidate/active, L3 trusted project state - three extraction triggers: on interaction capture (existing), on ingestion wave (new, batched per wave), on explicit request - per-rule prior confidence tuned at write time by structural signal (echoes the retriever's high/low signal hints) and freshness bonus - batch cap of 50 candidates per pass to protect the reviewer - full provenance requirements: every candidate carries rule id, source_chunk_id, source_interaction_id, and extractor_version - reversibility matrix for every promotion step - explicit no-auto-promotion-in-V1 stance with the schema designed so auto-promotion policies can be added later without migration - the hard invariant: nothing ever moves into L3 automatically - ingestion-wave extraction produces a report artifact under data/extraction-reports/<wave-id>/ docs/architecture/conflict-model.md ----------------------------------- Defines how AtoCore handles contradictory facts without violating the "bad memory is worse than no memory" rule. - conflict = two or more active rows claiming the same slot with incompatible values - per-type "slot key" tuples for both memory and entity types - cross-layer conflict detection respects the trust hierarchy: trusted project state > active entities > active memories - new conflicts and conflict_members tables (schema proposal) - detection at two latencies: synchronous at write time, asynchronous nightly sweep - "flag, never block" rule: writes always succeed, conflicts are surfaced via /conflicts, /health open_conflicts_count, per-row response bodies, and the Human Mirror's disputed marker - resolution is always human: promote-winner + supersede-others, or dismiss-as-not-a-real-conflict, both with audit trail - explicitly out of scope for V1: cross-project conflicts, temporal-overlap conflicts, tolerance-aware numeric comparisons Also updates: - master-plan-status.md: Phase 9 moved from "started" to "baseline complete" now that Commits A, B, C are all landed - master-plan-status.md: adds a "Engineering Layer Planning Sprint" section listing the doc wave so far and the remaining docs (tool-handoff-boundaries, human-mirror-rules, representation-authority, engineering-v1-acceptance) - current-state.md: Phase 9 moved from "not started" to "baseline complete" with the A/B/C annotation This is pure doc work. No code changes, no schema changes, no behavior changes. Per the working rule in master-plan-status.md: the architecture docs shape decisions, they do not force premature schema work.
2026-04-06 21:30:35 -04:00
# Memory vs Entities (Engineering Layer V1 boundary)
## Why this document exists
The engineering layer introduces a new representation — typed
entities with explicit relationships — alongside AtoCore's existing
memory system and its six memory types. The question that blocks
every other engineering-layer planning doc is:
> When we extract a fact from an interaction or a document, does it
> become a memory, an entity, or both? And if both, which one is
> canonical?
Without an answer, the rest of the engineering layer cannot be
designed. This document is the answer.
## The short version
- **Memories stay.** They are still the canonical home for
*unstructured, attributed, personal, natural-language* facts.
- **Entities are new.** They are the canonical home for *structured,
typed, relational, engineering-domain* facts.
- **No concept lives in both at full fidelity.** Every concept has
exactly one canonical home. The other layer may hold a pointer or
a rendered view, never a second source of truth.
- **The two layers share one review queue.** Candidates from
extraction flow into the same `status=candidate` lifecycle
regardless of whether they are memory-bound or entity-bound.
- **Memories can "graduate" into entities** when enough structure has
accumulated, but the upgrade is an explicit, logged promotion, not
a silent rewrite.
## The split per memory type
The six memory types from the current Phase 2 implementation each
map to exactly one outcome in V1:
| Memory type | V1 destination | Rationale |
|---------------|-------------------------------|-------------------------------------------------------------------------------------------------------------|
| identity | **memory only** | Always about the human user. No engineering domain structure. Never gets entity-shaped. |
| preference | **memory only** | Always about the human user's working style. Same reasoning. |
| episodic | **memory only** | "What happened in this conversation / this day." Attribution and time are the point, not typed structure. |
| knowledge | **entity when possible**, memory otherwise | If the knowledge maps to a typed engineering object (material property, constant, tolerance), it becomes a Fact entity with provenance. If it's loose general knowledge, stays a memory. |
| project | **entity** | Anything that belonged in the "project" memory type is really a Requirement, Constraint, Decision, Subsystem attribute, etc. It belongs in the engineering layer once entities exist. |
| adaptation | **entity (Decision)** | "We decided to X" is literally a Decision entity in the ontology. This is the clearest migration. |
**Practical consequence:** when the engineering layer V1 ships, the
`project`, `knowledge`, and `adaptation` memory types are deprecated
as a canonical home for new facts. Existing rows are not deleted —
they are backfilled as entities through the promotion-rules flow
(see `promotion-rules.md`), and the old memory rows become frozen
references pointing at their graduated entity.
The `identity`, `preference`, and `episodic` memory types continue
to exist exactly as they do today and do not interact with the
engineering layer at all.
## What "canonical home" actually means
A concept's canonical home is the single place where:
- its *current active value* is stored
- its *status lifecycle* is managed (active/superseded/invalid)
- its *confidence* is tracked
- its *provenance chain* is rooted
- edits, supersessions, and invalidations are applied
- conflict resolution is arbitrated
Everything else is a derived view of that canonical row.
If a `Decision` entity is the canonical home for "we switched to
GF-PTFE pads", then:
- there is no `adaptation` memory row with the same content; the
extractor creates a `Decision` candidate directly
- the context builder, when asked to include relevant state, reaches
into the entity store via the engineering layer, not the memory
store
- if the user wants to see "recent decisions" they hit the entity
API, never the memory API
- if they want to invalidate the decision, they do so via the entity
API
The memory API remains the canonical home for `identity`,
`preference`, and `episodic` — same rules, just a different set of
types.
## Why not a unified table with a `kind` column?
It would be simpler to implement. It is rejected for three reasons:
1. **Different query shapes.** Memories are queried by type, project,
confidence, recency. Entities are queried by type, relationships,
graph traversal, coverage gaps ("orphan requirements"). Cramming
both into one table forces the schema to be the union of both
worlds and makes each query slower.
2. **Different lifecycles.** Memories have a simple four-state
lifecycle (candidate/active/superseded/invalid). Entities have
the same four states *plus* per-relationship supersession,
per-field versioning for the killer correctness queries, and
structured conflict flagging. The unified table would have to
carry all entity apparatus for every memory row.
3. **Different provenance semantics.** A preference memory is
provenanced by "the user told me" — one author, one time.
An entity like a `Requirement` is provenanced by "this source
chunk + this source document + these supporting Results" — a
graph. The tables want to be different because their provenance
models are different.
So: two tables, one review queue, one promotion flow, one trust
hierarchy.
## The shared review queue
Both the memory extractor (Phase 9 Commit C, already shipped) and
the future entity extractor write into the same conceptual queue:
everything lands at `status=candidate` in its own table, and the
human reviewer sees a unified list. The reviewer UI (future work)
shows candidates of all kinds side by side, grouped by source
interaction / source document, with the rule that fired.
From the data side this means:
- the memories table gets a `candidate` status (**already done in
Phase 9 Commit B/C**)
- the future entities table will get the same `candidate` status
- both tables get the same `promote` / `reject` API shape: one verb
per candidate, with an audit log entry
Implementation note: the API routes should evolve from
`POST /memory/{id}/promote` to `POST /candidates/{id}/promote` once
both tables exist, so the reviewer tooling can treat them
uniformly. The current memory-only route stays in place for
backward compatibility and is aliased by the unified route.
## Memory-to-entity graduation
Even though the split is clean on paper, real usage will reveal
memories that deserve to be entities but started as plain text.
Four signals are good candidates for proposing graduation:
1. **Reference count crosses a threshold.** A memory that has been
reinforced 5+ times across multiple interactions is a strong
signal that it deserves structure.
2. **Memory content matches a known entity template.** If a
`knowledge` memory's content matches the shape "X = value [unit]"
it can be proposed as a `Fact` or `Parameter` entity.
3. **A user explicitly asks for promotion.** `POST /memory/{id}/graduate`
is the simplest explicit path — it returns a proposal for an
entity structured from the memory's content, which the user can
accept or reject.
4. **Extraction pass proposes an entity that happens to match an
existing memory.** The entity extractor, when scanning a new
interaction, sees the same content already exists as a memory
and proposes graduation as part of its candidate output.
The graduation flow is:
```
memory row (active, confidence C)
|
| propose_graduation()
v
entity candidate row (candidate, confidence C)
+
memory row gets status="graduated" and a forward pointer to the
entity candidate
|
| human promotes the candidate entity
v
entity row (active)
+
memory row stays "graduated" permanently (historical record)
```
The memory is never deleted. It becomes a frozen historical
pointer to the entity it became. This keeps the audit trail intact
and lets the Human Mirror show "this decision started life as a
memory on April 2, was graduated to an entity on April 15, now has
2 supporting ValidationClaims".
The `graduated` status is a new memory status that gets added when
the graduation flow is implemented. For now (Phase 9), only the
three non-graduating types (identity/preference/episodic) would
ever avoid it, and the three graduating types stay in their current
memory-only state until the engineering layer ships.
## Context pack assembly after the split
The context builder today (`src/atocore/context/builder.py`) pulls:
1. Trusted Project State
2. Identity + Preference memories
3. Retrieved chunks
After the split, it pulls:
1. Trusted Project State (unchanged)
2. **Identity + Preference memories** (unchanged — these stay memories)
3. **Engineering-layer facts relevant to the prompt**, queried through
the entity API (new)
4. Retrieved chunks (unchanged, lowest trust)
Note the ordering: identity/preference memories stay above entities,
because personal style information is always more trusted than
extracted engineering facts. Entities sit below the personal layer
but above raw retrieval, because they have structured provenance
that raw chunks lack.
The budget allocation gains a new slot:
- trusted project state: 20% (unchanged, highest trust)
- identity memories: 5% (unchanged)
- preference memories: 5% (unchanged)
- **engineering entities: 15%** (new — pulls only V1-required
objects relevant to the prompt)
- retrieval: 55% (reduced from 70% to make room)
These are starting numbers. After the engineering layer ships and
real usage tunes retrieval quality, these will be revisited.
## What the shipped memory types still mean after the split
| Memory type | Still accepts new writes? | V1 destination for new extractions |
|-------------|---------------------------|------------------------------------|
| identity | **yes** | memory (no change) |
| preference | **yes** | memory (no change) |
| episodic | **yes** | memory (no change) |
| knowledge | yes, but only for loose facts | entity (Fact / Parameter) for structured things; memory is a fallback |
| project | **no new writes after engineering V1 ships** | entity (Requirement / Constraint / Subsystem attribute) |
| adaptation | **no new writes after engineering V1 ships** | entity (Decision) |
"No new writes" means the `create_memory` path will refuse to
create new `project` or `adaptation` memories once the engineering
layer V1 ships. Existing rows stay queryable and reinforceable but
new facts of those kinds must become entities. This keeps the
canonical-home rule clean going forward.
The deprecation is deferred: it does not happen until the engineering
layer V1 is demonstrably working against the active project set. Until
then, the existing memory types continue to accept writes so the
Phase 9 loop can be exercised without waiting on the engineering
layer.
## Consequences for Phase 9 (what we just built)
The capture loop, reinforcement, and extractor we shipped today
are *memory-facing*. They produce memory candidates, reinforce
memory confidence, and respect the memory status lifecycle. None
of that changes.
When the engineering layer V1 ships, the extractor in
`src/atocore/memory/extractor.py` gets a sibling in
`src/atocore/entities/extractor.py` that uses the same
interaction-scanning approach but produces entity candidates
instead. The `POST /interactions/{id}/extract` endpoint either:
- runs both extractors and returns a combined result, or
- gains a `?target=memory|entities|both` query parameter
and the decision between those two shapes can wait until the
entity extractor actually exists.
Until the entity layer is real, the memory extractor also has to
cover some things that will eventually move to entities (decisions,
constraints, requirements). **That overlap is temporary and
intentional.** Rather than leave those cues unextracted for months
while the entity layer is being built, the memory extractor
surfaces them as memory candidates. Later, a migration pass will
propose graduation on every active memory created by
`decision_heading`, `constraint_heading`, and `requirement_heading`
rules once the entity types exist to receive them.
So: **no rework in Phase 9, no wasted extraction, clean handoff
once the entity layer lands**.
## Open questions this document does NOT answer
These are deliberately deferred to later planning docs:
1. **When exactly does extraction fire?** (answered by
`promotion-rules.md`)
2. **How are conflicts between a memory and an entity handled
during graduation?** (answered by `conflict-model.md`)
3. **Does the context builder traverse the entity graph for
relationship-rich queries, or does it only surface direct facts?**
(answered by the context-builder spec in a future
`engineering-context-integration.md` doc)
4. **What is the exact API shape of the unified candidate review
queue?** (answered by a future `review-queue-api.md` doc when
the entity extractor exists and both tables need one UI)
## TL;DR
- memories = user-facing unstructured facts, still own identity/preference/episodic
- entities = engineering-facing typed facts, own project/knowledge/adaptation
- one canonical home per concept, never both
- one shared candidate-review queue, same promote/reject shape
- graduated memories stay as frozen historical pointers
- Phase 9 stays memory-only and ships today; entity V1 follows the
remaining architecture docs in this planning sprint
- no rework required when the entity layer lands; the current memory
extractor's structural cues get migrated forward via explicit
graduation