docs/architecture/promotion-rules.md

# Promotion Rules (Layer 0 → Layer 2 pipeline)

## Purpose

AtoCore ingests raw human-authored content (markdown, repo notes,
interaction transcripts) and eventually must turn some of it into
typed engineering entities that the V1 query catalog can answer.
The path from raw text to typed entity has to be:

- **explicit**: every step has a named operation, a trigger, and an
  audit log
- **reversible**: every promotion can be undone without data loss
- **conservative**: no automatic movement into trusted state; a human
  (or later, a very confident policy) always signs off
- **traceable**: every typed entity must carry a back-pointer to
  the raw source that produced it

This document defines that path.

## The four layers

Promotion is described in terms of four layers, all of which exist
simultaneously in the system once the engineering layer V1 ships:

| Layer | Name              | Canonical storage                        | Trust | Who writes |
|-------|-------------------|------------------------------------------|-------|------------|
| L0    | Raw source        | source_documents + source_chunks         | low   | ingestion pipeline |
| L1    | Memory candidate  | memories (status="candidate")            | low   | extractor |
| L1'   | Active memory     | memories (status="active")               | med   | human promotion |
| L2    | Entity candidate  | entities (status="candidate")            | low   | extractor + graduation |
| L2'   | Active entity     | entities (status="active")               | high  | human promotion |
| L3    | Trusted state     | project_state                            | highest | human curation |

Layer 3 (trusted project state) is already implemented and stays
manually curated — automatic promotion into L3 is **never** allowed.

## The promotion graph

```
          [L0] source chunks
               |
               | extraction (memory extractor, Phase 9 Commit C)
               v
        [L1] memory candidate
               |
               | promote_memory()
               v
        [L1'] active memory
               |
               | (optional) propose_graduation()
               v
        [L2] entity candidate
               |
               | promote_entity()
               v
        [L2'] active entity
               |
               | (manual curation, NEVER automatic)
               v
        [L3] trusted project state
```

Short path (direct entity extraction, once the entity extractor
exists):

```
          [L0] source chunks
               |
               | entity extractor
               v
        [L2] entity candidate
               |
               | promote_entity()
               v
        [L2'] active entity
```

A single fact can travel either path depending on what the
extractor saw. The graduation path exists for facts that started
life as memories before the entity layer existed, and for the
memory extractor's structural cues (decisions, constraints,
requirements) which are eventually entity-shaped.

## Triggers (when does extraction fire?)

Phase 9 already shipped one trigger: **on explicit API request**
(`POST /interactions/{id}/extract`). The V1 engineering layer adds
two more:

1. **On interaction capture (automatic)**
   - Same event that runs reinforcement today
   - Controlled by a `extract` boolean flag on the record request
     (default: `false` for memory extractor, `true` once an
     engineering extractor exists and has been validated)
   - Output goes to the candidate queue; nothing auto-promotes

2. **On ingestion (batched, per wave)**
   - After a wave of markdown ingestion finishes, a batch extractor
     pass sweeps all newly-added source chunks and produces
     candidates from them
   - Batched per wave (not per chunk) to keep the review queue
     digestible and to let the reviewer see all candidates from a
     single ingestion in one place
   - Output: a report artifact plus a review queue entry per
     candidate

3. **On explicit human request (existing)**
   - `POST /interactions/{id}/extract` for a single interaction
   - Future: `POST /ingestion/wave/{id}/extract` for a whole wave
   - Future: `POST /memory/{id}/graduate` to propose graduation
     of one specific memory into an entity

Batch size rule: **extraction passes never write more than N
candidates per human review cycle, where N = 50 by default**. If
a pass produces more, it ranks by (rule confidence × content
length × novelty) and only writes the top N. The remaining
candidates are logged, not persisted. This protects the reviewer
from getting buried.

## Confidence and ranking of candidates

Each rule-based extraction rule carries a *prior confidence*
based on how specific its pattern is:

| Rule class                | Prior | Rationale |
|---------------------------|-------|-----------|
| Heading with explicit type (`## Decision:`) | 0.7 | Very specific structural cue, intentional author marker |
| Typed list item (`- [Decision] ...`)        | 0.65 | Explicit but often embedded in looser prose |
| Sentence pattern (`I prefer X`)             | 0.5 | Moderate structure, more false positives |
| Regex pattern matching a value+unit (`X = 4.8 kg`) | 0.6 | Structural but prone to coincidence |
| LLM-based (future)                          | variable | Depends on model's returned confidence |

The candidate's final confidence at write time is:

```
final = prior * structural_signal_multiplier * freshness_bonus
```

Where:

- `structural_signal_multiplier` is 1.1 if the source chunk path
  contains any of `_HIGH_SIGNAL_HINTS` from the retriever (status,
  decision, requirements, charter, ...) and 0.9 if it contains
  `_LOW_SIGNAL_HINTS` (`_archive`, `_history`, ...)
- `freshness_bonus` is 1.05 if the source chunk was updated in the
  last 30 days, else 1.0

This formula is tuned later; the numbers are starting values.

## Review queue mechanics

### Queue population

- Each candidate writes one row into its target table
  (memories or entities) with `status="candidate"`
- Each candidate carries: `rule`, `source_span`, `source_chunk_id`,
  `source_interaction_id`, `extractor_version`
- No two candidates ever share the same (type, normalized_content,
  project) — if a second extraction pass produces a duplicate, it
  is dropped before being written

### Queue surfacing

- `GET /memory?status=candidate` lists memory candidates
- `GET /entities?status=candidate` (future) lists entity candidates
- `GET /candidates` (future unified route) lists both

### Reviewer actions

For each candidate, exactly one of:

- **promote**: `POST /memory/{id}/promote` or
  `POST /entities/{id}/promote`
  - sets `status="active"`
  - preserves the audit trail (source_chunk_id, rule, source_span)
- **reject**: `POST /memory/{id}/reject` or
  `POST /entities/{id}/reject`
  - sets `status="invalid"`
  - preserves audit trail so repeat extractions don't re-propose
- **edit-then-promote**: `PUT /memory/{id}` to adjust content, then
  `POST /memory/{id}/promote`
  - every edit is logged, original content preserved in a
    `previous_content_log` column (schema addition deferred to
    the first implementation sprint)
- **defer**: no action; candidate stays in queue indefinitely
  (future: add a `pending_since` staleness indicator to the UI)

### Reviewer authentication

In V1 the review queue is single-user by convention. There is no
per-reviewer authorization. Every promote/reject call is logged
with the same default identity. Multi-user review is a V2 concern.

## Auto-promotion policies (deferred, but designed for)

The current V1 stance is: **no auto-promotion, ever**. All
promotions require a human reviewer.

The schema and API are designed so that automatic policies can be
added later without schema changes. The anticipated policies:

1. **Reference-count threshold**
   - If a candidate accumulates N+ references across multiple
     interactions within M days AND the reviewer hasn't seen it yet
     (indicating the system sees it often but the human hasn't
     gotten to it), propose auto-promote
   - Starting thresholds: N=5, M=7 days. Never auto-promote
     entity candidates that affect validation claims or decisions
     without explicit human review — those are too consequential.

2. **Confidence threshold**
   - If `final_confidence >= 0.85` AND the rule is a heading
     rule (not a sentence rule), eligible for auto-promotion

3. **Identity/preference lane**
   - identity and preference memories extracted from an
     interaction where the user explicitly says "I am X" or
     "I prefer X" with a first-person subject and high-signal
     verb could auto-promote. This is the safest lane because
     the user is the authoritative source for their own identity.

None of these run in V1. The APIs and data shape are designed so
they can be added as a separate policy module without disrupting
existing tests.

## Reversibility

Every promotion step must be undoable:

| Operation                 | How to undo                                           |
|---------------------------|-------------------------------------------------------|
| memory candidate written  | delete the candidate row (low-risk, it was never in context) |
| memory candidate promoted | `PUT /memory/{id}` status=candidate (reverts to queue) |
| memory candidate rejected | `PUT /memory/{id}` status=candidate                   |
| memory graduated          | memory stays as a frozen pointer; delete the entity candidate to undo |
| entity candidate promoted | `PUT /entities/{id}` status=candidate                 |
| entity promoted to active | supersede with a new active, or `PUT` back to candidate |

The only irreversible operation is manual curation into L3
(trusted project state). That is by design — L3 is small, curated,
and human-authored end to end.

## Provenance (what every candidate must carry)

Every candidate row, memory or entity, MUST have:

- `source_chunk_id` — if extracted from ingested content, the chunk it came from
- `source_interaction_id` — if extracted from a captured interaction, the interaction it came from
- `rule` — the extractor rule id that fired
- `extractor_version` — a semver-ish string the extractor module carries
  so old candidates can be re-evaluated with a newer extractor

If both `source_chunk_id` and `source_interaction_id` are null, the
candidate was hand-authored (via `POST /memory` directly) and must
be flagged as such. Hand-authored candidates are allowed but
discouraged — the preference is to extract from real content, not
dictate candidates directly.

The active rows inherit all of these fields from their candidate
row at promotion time. They are never overwritten.

## Extractor versioning

The extractor is going to change — new rules added, old rules
refined, precision/recall tuned over time. The promotion flow
must survive extractor changes:

- every extractor module exposes an `EXTRACTOR_VERSION = "0.1.0"`
  constant
- every candidate row records this version
- when the extractor version changes, the change log explains
  what the new rules do
- old candidates are NOT automatically re-evaluated by the new
  extractor — that would lose the auditable history of why the
  old candidate was created
- future `POST /memory/{id}/re-extract` can optionally propose
  an updated candidate from the same source chunk with the new
  extractor, but it produces a *new* candidate alongside the old
  one, never a silent rewrite

## Ingestion-wave extraction semantics

When the batched extraction pass fires on an ingestion wave, it
produces a report artifact:

```
data/extraction-reports/<wave-id>/
  ├── report.json           # summary counts, rule distribution
  ├── candidates.ndjson     # one JSON line per persisted candidate
  ├── dropped.ndjson        # one JSON line per candidate dropped
  │                         # (over batch cap, duplicate, below
  │                         # min content length, etc.)
  └── errors.log            # any rule-level errors
```

The report artifact lives under the configured `data_dir` and is
retained per the backup retention policy. The ingestion-waves doc
(`docs/ingestion-waves.md`) is updated to include an "extract"
step after each wave, with the expectation that the human
reviews the candidates before the next wave fires.

## Candidate-to-candidate deduplication across passes

Two extraction passes over the same chunk (or two different
chunks containing the same fact) should not produce two identical
candidate rows. The deduplication key is:

```
(memory_type_or_entity_type, normalized_content, project, status)
```

Normalization strips whitespace variants, lowercases, and drops
trailing punctuation (same rules as the extractor's `_clean_value`
function). If a second pass would produce a duplicate, it instead
increments a `re_extraction_count` column on the existing
candidate row and updates `last_re_extracted_at`. This gives the
reviewer a "saw this N times" signal without flooding the queue.

This column is a future schema addition — current candidates do
not track re-extraction. The promotion-rules implementation will
land the column as part of its first migration.

## The "never auto-promote into trusted state" invariant

Regardless of what auto-promotion policies might exist between
L0 → L2', **nothing ever moves into L3 (trusted project state)
without explicit human action via `POST /project/state`**. This
is the one hard line in the promotion graph and it is enforced
by having no API endpoint that takes a candidate id and writes
to `project_state`.

## Summary

- Four layers: L0 raw, L1 memory candidate/active, L2 entity
  candidate/active, L3 trusted state
- Three triggers for extraction: on capture, on ingestion wave, on
  explicit request
- Per-rule prior confidence, tuned by structural signals at write time
- Shared candidate review queue, promote/reject/edit/defer actions
- No auto-promotion in V1 (but the schema allows it later)
- Every candidate carries full provenance and extractor version
- Every promotion step is reversible except L3 curation
- L3 is never touched automatically
-												docs(arch): memory-vs-entities, promotion-rules, conflict-model

Three planning docs that answer the architectural questions the
engineering query catalog raised. Together with the catalog they
form roughly half of the pre-implementation planning sprint.

docs/architecture/memory-vs-entities.md
---------------------------------------
Resolves the central question blocking every other engineering
layer doc: is a Decision a memory or an entity?

Key decisions:
- memories stay the canonical home for identity, preference, and
  episodic facts
- entities become the canonical home for project, knowledge, and
  adaptation facts once the engineering layer V1 ships
- no concept lives in both layers at full fidelity; one canonical
  home per concept
- a "graduation" flow lets active memories upgrade into entities
  (memory stays as a frozen historical pointer, never deleted)
- one shared candidate review queue across both layers
- context builder budget gains a 15% slot for engineering entities,
  slotted between identity/preference memories and retrieved chunks
- the Phase 9 memory extractor's structural cues (decision heading,
  constraint heading, requirement heading) are explicitly an
  intentional temporary overlap, cleanly migrated via graduation
  when the entity extractor ships

docs/architecture/promotion-rules.md
------------------------------------
Defines the full Layer 0 → Layer 2 pipeline:

- four layers: L0 raw source, L1 memory candidate/active, L2 entity
  candidate/active, L3 trusted project state
- three extraction triggers: on interaction capture (existing),
  on ingestion wave (new, batched per wave), on explicit request
- per-rule prior confidence tuned at write time by structural
  signal (echoes the retriever's high/low signal hints) and
  freshness bonus
- batch cap of 50 candidates per pass to protect the reviewer
- full provenance requirements: every candidate carries rule id,
  source_chunk_id, source_interaction_id, and extractor_version
- reversibility matrix for every promotion step
- explicit no-auto-promotion-in-V1 stance with the schema designed
  so auto-promotion policies can be added later without migration
- the hard invariant: nothing ever moves into L3 automatically
- ingestion-wave extraction produces a report artifact under
  data/extraction-reports/<wave-id>/

docs/architecture/conflict-model.md
-----------------------------------
Defines how AtoCore handles contradictory facts without violating
the "bad memory is worse than no memory" rule.

- conflict = two or more active rows claiming the same slot with
  incompatible values
- per-type "slot key" tuples for both memory and entity types
- cross-layer conflict detection respects the trust hierarchy:
  trusted project state > active entities > active memories
- new conflicts and conflict_members tables (schema proposal)
- detection at two latencies: synchronous at write time,
  asynchronous nightly sweep
- "flag, never block" rule: writes always succeed, conflicts are
  surfaced via /conflicts, /health open_conflicts_count, per-row
  response bodies, and the Human Mirror's disputed marker
- resolution is always human: promote-winner + supersede-others,
  or dismiss-as-not-a-real-conflict, both with audit trail
- explicitly out of scope for V1: cross-project conflicts,
  temporal-overlap conflicts, tolerance-aware numeric comparisons

Also updates:
- master-plan-status.md: Phase 9 moved from "started" to "baseline
  complete" now that Commits A, B, C are all landed
- master-plan-status.md: adds a "Engineering Layer Planning Sprint"
  section listing the doc wave so far and the remaining docs
  (tool-handoff-boundaries, human-mirror-rules,
  representation-authority, engineering-v1-acceptance)
- current-state.md: Phase 9 moved from "not started" to "baseline
  complete" with the A/B/C annotation

This is pure doc work. No code changes, no schema changes, no
behavior changes. Per the working rule in master-plan-status.md:
the architecture docs shape decisions, they do not force premature
schema work.

											
										
										
											2026-04-06 21:30:35 -04:00
+								# Promotion Rules (Layer 0 → Layer 2 pipeline)
 								## Purpose
 								AtoCore ingests raw human-authored content (markdown, repo notes,
 								interaction transcripts) and eventually must turn some of it into
 								typed engineering entities that the V1 query catalog can answer.
 								The path from raw text to typed entity has to be:
 								- **explicit**: every step has a named operation, a trigger, and an
 								  audit log
 								- **reversible**: every promotion can be undone without data loss
 								- **conservative**: no automatic movement into trusted state; a human
 								  (or later, a very confident policy) always signs off
 								- **traceable**: every typed entity must carry a back-pointer to
 								  the raw source that produced it
 								This document defines that path.
 								## The four layers
 								Promotion is described in terms of four layers, all of which exist
 								simultaneously in the system once the engineering layer V1 ships:
 								| Layer | Name              | Canonical storage                        | Trust | Who writes |
 								|-------|-------------------|------------------------------------------|-------|------------|
 								| L0    | Raw source        | source_documents + source_chunks         | low   | ingestion pipeline |
 								| L1    | Memory candidate  | memories (status="candidate")            | low   | extractor |
 								| L1'   | Active memory     | memories (status="active")               | med   | human promotion |
 								| L2    | Entity candidate  | entities (status="candidate")            | low   | extractor + graduation |
 								| L2'   | Active entity     | entities (status="active")               | high  | human promotion |
 								| L3    | Trusted state     | project_state                            | highest | human curation |
 								Layer 3 (trusted project state) is already implemented and stays
 								manually curated — automatic promotion into L3 is **never** allowed.
 								## The promotion graph
 								```
 								          [L0] source chunks
 								               |
 								               | extraction (memory extractor, Phase 9 Commit C)
 								               v
 								        [L1] memory candidate
 								               |
 								               | promote_memory()
 								               v
 								        [L1'] active memory
 								               |
 								               | (optional) propose_graduation()
 								               v
 								        [L2] entity candidate
 								               |
 								               | promote_entity()
 								               v
 								        [L2'] active entity
 								               |
 								               | (manual curation, NEVER automatic)
 								               v
 								        [L3] trusted project state
 								```
 								Short path (direct entity extraction, once the entity extractor
 								exists):
 								```
 								          [L0] source chunks
 								               |
 								               | entity extractor
 								               v
 								        [L2] entity candidate
 								               |
 								               | promote_entity()
 								               v
 								        [L2'] active entity
 								```
 								A single fact can travel either path depending on what the
 								extractor saw. The graduation path exists for facts that started
 								life as memories before the entity layer existed, and for the
 								memory extractor's structural cues (decisions, constraints,
 								requirements) which are eventually entity-shaped.
 								## Triggers (when does extraction fire?)
 								Phase 9 already shipped one trigger: **on explicit API request**
 								(`POST /interactions/{id}/extract`). The V1 engineering layer adds
 								two more:
 . **On interaction capture (automatic)**
 								   - Same event that runs reinforcement today
 								   - Controlled by a `extract` boolean flag on the record request
 								     (default: `false` for memory extractor, `true` once an
 								     engineering extractor exists and has been validated)
 								   - Output goes to the candidate queue; nothing auto-promotes
 . **On ingestion (batched, per wave)**
 								   - After a wave of markdown ingestion finishes, a batch extractor
 								     pass sweeps all newly-added source chunks and produces
 								     candidates from them
 								   - Batched per wave (not per chunk) to keep the review queue
 								     digestible and to let the reviewer see all candidates from a
 								     single ingestion in one place
 								   - Output: a report artifact plus a review queue entry per
 								     candidate
 . **On explicit human request (existing)**
 								   - `POST /interactions/{id}/extract` for a single interaction
 								   - Future: `POST /ingestion/wave/{id}/extract` for a whole wave
 								   - Future: `POST /memory/{id}/graduate` to propose graduation
 								     of one specific memory into an entity
 								Batch size rule: **extraction passes never write more than N
 								candidates per human review cycle, where N = 50 by default**. If
 								a pass produces more, it ranks by (rule confidence × content
 								length × novelty) and only writes the top N. The remaining
 								candidates are logged, not persisted. This protects the reviewer
 								from getting buried.
 								## Confidence and ranking of candidates
 								Each rule-based extraction rule carries a *prior confidence*
 								based on how specific its pattern is:
 								| Rule class                | Prior | Rationale |
 								|---------------------------|-------|-----------|
 								| Heading with explicit type (`## Decision:`) | 0.7 | Very specific structural cue, intentional author marker |
 								| Typed list item (`- [Decision] ...`)        | 0.65 | Explicit but often embedded in looser prose |
 								| Sentence pattern (`I prefer X`)             | 0.5 | Moderate structure, more false positives |
 								| Regex pattern matching a value+unit (`X = 4.8 kg`) | 0.6 | Structural but prone to coincidence |
 								| LLM-based (future)                          | variable | Depends on model's returned confidence |
 								The candidate's final confidence at write time is:
 								```
 								final = prior * structural_signal_multiplier * freshness_bonus
 								```
 								Where:
 								- `structural_signal_multiplier` is 1.1 if the source chunk path
 								  contains any of `_HIGH_SIGNAL_HINTS` from the retriever (status,
 								  decision, requirements, charter, ...) and 0.9 if it contains
 								  `_LOW_SIGNAL_HINTS` (`_archive`, `_history`, ...)
 								- `freshness_bonus` is 1.05 if the source chunk was updated in the
 								  last 30 days, else 1.0
 								This formula is tuned later; the numbers are starting values.
 								## Review queue mechanics
 								### Queue population
 								- Each candidate writes one row into its target table
 								  (memories or entities) with `status="candidate"`
 								- Each candidate carries: `rule`, `source_span`, `source_chunk_id`,
 								  `source_interaction_id`, `extractor_version`
 								- No two candidates ever share the same (type, normalized_content,
 								  project) — if a second extraction pass produces a duplicate, it
 								  is dropped before being written
 								### Queue surfacing
 								- `GET /memory?status=candidate` lists memory candidates
 								- `GET /entities?status=candidate` (future) lists entity candidates
 								- `GET /candidates` (future unified route) lists both
 								### Reviewer actions
 								For each candidate, exactly one of:
 								- **promote**: `POST /memory/{id}/promote` or
 								  `POST /entities/{id}/promote`
 								  - sets `status="active"`
 								  - preserves the audit trail (source_chunk_id, rule, source_span)
 								- **reject**: `POST /memory/{id}/reject` or
 								  `POST /entities/{id}/reject`
 								  - sets `status="invalid"`
 								  - preserves audit trail so repeat extractions don't re-propose
 								- **edit-then-promote**: `PUT /memory/{id}` to adjust content, then
 								  `POST /memory/{id}/promote`
 								  - every edit is logged, original content preserved in a
 								    `previous_content_log` column (schema addition deferred to
 								    the first implementation sprint)
 								- **defer**: no action; candidate stays in queue indefinitely
 								  (future: add a `pending_since` staleness indicator to the UI)
 								### Reviewer authentication
 								In V1 the review queue is single-user by convention. There is no
 								per-reviewer authorization. Every promote/reject call is logged
 								with the same default identity. Multi-user review is a V2 concern.
 								## Auto-promotion policies (deferred, but designed for)
 								The current V1 stance is: **no auto-promotion, ever**. All
 								promotions require a human reviewer.
 								The schema and API are designed so that automatic policies can be
 								added later without schema changes. The anticipated policies:
 . **Reference-count threshold**
 								   - If a candidate accumulates N+ references across multiple
 								     interactions within M days AND the reviewer hasn't seen it yet
 								     (indicating the system sees it often but the human hasn't
 								     gotten to it), propose auto-promote
 								   - Starting thresholds: N=5, M=7 days. Never auto-promote
 								     entity candidates that affect validation claims or decisions
 								     without explicit human review — those are too consequential.
 . **Confidence threshold**
 								   - If `final_confidence >= 0.85` AND the rule is a heading
 								     rule (not a sentence rule), eligible for auto-promotion
 . **Identity/preference lane**
 								   - identity and preference memories extracted from an
 								     interaction where the user explicitly says "I am X" or
 								     "I prefer X" with a first-person subject and high-signal
 								     verb could auto-promote. This is the safest lane because
 								     the user is the authoritative source for their own identity.
 								None of these run in V1. The APIs and data shape are designed so
 								they can be added as a separate policy module without disrupting
 								existing tests.
 								## Reversibility
 								Every promotion step must be undoable:
 								| Operation                 | How to undo                                           |
 								|---------------------------|-------------------------------------------------------|
 								| memory candidate written  | delete the candidate row (low-risk, it was never in context) |
 								| memory candidate promoted | `PUT /memory/{id}` status=candidate (reverts to queue) |
 								| memory candidate rejected | `PUT /memory/{id}` status=candidate                   |
 								| memory graduated          | memory stays as a frozen pointer; delete the entity candidate to undo |
 								| entity candidate promoted | `PUT /entities/{id}` status=candidate                 |
 								| entity promoted to active | supersede with a new active, or `PUT` back to candidate |
 								The only irreversible operation is manual curation into L3
 								(trusted project state). That is by design — L3 is small, curated,
 								and human-authored end to end.
 								## Provenance (what every candidate must carry)
 								Every candidate row, memory or entity, MUST have:
 								- `source_chunk_id` — if extracted from ingested content, the chunk it came from
 								- `source_interaction_id` — if extracted from a captured interaction, the interaction it came from
 								- `rule` — the extractor rule id that fired
 								- `extractor_version` — a semver-ish string the extractor module carries
 								  so old candidates can be re-evaluated with a newer extractor
 								If both `source_chunk_id` and `source_interaction_id` are null, the
 								candidate was hand-authored (via `POST /memory` directly) and must
 								be flagged as such. Hand-authored candidates are allowed but
 								discouraged — the preference is to extract from real content, not
 								dictate candidates directly.
 								The active rows inherit all of these fields from their candidate
 								row at promotion time. They are never overwritten.
 								## Extractor versioning
 								The extractor is going to change — new rules added, old rules
 								refined, precision/recall tuned over time. The promotion flow
 								must survive extractor changes:
 								- every extractor module exposes an `EXTRACTOR_VERSION = "0.1.0"`
 								  constant
 								- every candidate row records this version
 								- when the extractor version changes, the change log explains
 								  what the new rules do
 								- old candidates are NOT automatically re-evaluated by the new
 								  extractor — that would lose the auditable history of why the
 								  old candidate was created
 								- future `POST /memory/{id}/re-extract` can optionally propose
 								  an updated candidate from the same source chunk with the new
 								  extractor, but it produces a *new* candidate alongside the old
 								  one, never a silent rewrite
 								## Ingestion-wave extraction semantics
 								When the batched extraction pass fires on an ingestion wave, it
 								produces a report artifact:
 								```
 								data/extraction-reports/<wave-id>/
 								  ├── report.json           # summary counts, rule distribution
 								  ├── candidates.ndjson     # one JSON line per persisted candidate
 								  ├── dropped.ndjson        # one JSON line per candidate dropped
 								  │                         # (over batch cap, duplicate, below
 								  │                         # min content length, etc.)
 								  └── errors.log            # any rule-level errors
 								```
 								The report artifact lives under the configured `data_dir` and is
 								retained per the backup retention policy. The ingestion-waves doc
 								(`docs/ingestion-waves.md`) is updated to include an "extract"
 								step after each wave, with the expectation that the human
 								reviews the candidates before the next wave fires.
 								## Candidate-to-candidate deduplication across passes
 								Two extraction passes over the same chunk (or two different
 								chunks containing the same fact) should not produce two identical
 								candidate rows. The deduplication key is:
 								```
 								(memory_type_or_entity_type, normalized_content, project, status)
 								```
 								Normalization strips whitespace variants, lowercases, and drops
 								trailing punctuation (same rules as the extractor's `_clean_value`
 								function). If a second pass would produce a duplicate, it instead
 								increments a `re_extraction_count` column on the existing
 								candidate row and updates `last_re_extracted_at`. This gives the
 								reviewer a "saw this N times" signal without flooding the queue.
 								This column is a future schema addition — current candidates do
 								not track re-extraction. The promotion-rules implementation will
 								land the column as part of its first migration.
 								## The "never auto-promote into trusted state" invariant
 								Regardless of what auto-promotion policies might exist between
 								L0 → L2', **nothing ever moves into L3 (trusted project state)
 								without explicit human action via `POST /project/state`**. This
 								is the one hard line in the promotion graph and it is enforced
 								by having no API endpoint that takes a candidate id and writes
 								to `project_state`.
 								## Summary
 								- Four layers: L0 raw, L1 memory candidate/active, L2 entity
 								  candidate/active, L3 trusted state
 								- Three triggers for extraction: on capture, on ingestion wave, on
 								  explicit request
 								- Per-rule prior confidence, tuned by structural signals at write time
 								- Shared candidate review queue, promote/reject/edit/defer actions
 								- No auto-promotion in V1 (but the schema allows it later)
 								- Every candidate carries full provenance and extractor version
 								- Every promotion step is reversible except L3 curation
 								- L3 is never touched automatically