From 1953e559f97f02da14fc0fc44ceef5e656872a17 Mon Sep 17 00:00:00 2001
From: Anto01 <antoine.letarte@gmail.com>
Date: Tue, 7 Apr 2026 20:14:19 -0400
Subject: [PATCH] docs+test: clarify legacy alias compatibility gap, add gap
 regression test
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Codex caught a real documentation accuracy bug in the previous
canonicalization doc commit (f521aab). The doc claimed that rows
written under aliases before fb6298a "still work via the
unregistered-name fallback path" — that is wrong for REGISTERED
aliases, which is exactly the case that matters.

The unregistered-name fallback only saves you when the project was
never in the registry: a row stored under "orphan-project" is read
back via "orphan-project", both pass through resolve_project_name
unchanged, and the strings line up. For a registered alias like
"p05", the helper rewrites the read key to "p05-interferometer"
but does NOT rewrite the storage key, so the legacy row becomes
silently invisible.

This commit corrects the doc and locks the gap behavior in with
a regression test, so the issue cannot be lost again.

docs/architecture/project-identity-canonicalization.md
------------------------------------------------------
- Removed the misleading claim from the "What this rule does NOT
  cover" section. Replaced with a pointer to the new gap section
  and an explicit statement that the migration is required before
  engineering V1 ships.
- New "Compatibility gap: legacy alias-keyed rows" section between
  "Why this is the trust hierarchy in action" and "The rule for
  new entry points". This is the natural insertion point because
  the gap is exactly the trust hierarchy failing for legacy data.
  The section covers:
  * a worked T0/T1 timeline showing the exact failure mode
  * what is at risk on the live Dalidou DB, ranked by trust tier:
    projects table (shadow rows), project_state (highest risk
    because Layer 3 is most-authoritative), memories, interactions
  * inspection SQL queries for measuring the actual blast radius
    on the live DB before running any migration
  * the spec for the migration script: walk projects, find shadow
    rows, merge dependent state via the conflict model when there
    are collisions, dry-run mode, idempotent
  * explicit statement that this is required pre-V1 because V1
    will add new project-keyed tables and the killer correctness
    queries from engineering-query-catalog.md would report wrong
    results against any project that has shadow rows
- "Open follow-ups" item 1 promoted from "tracked optional" to
  "REQUIRED before engineering V1 ships, NOT optional" with a
  more honest cost estimate (~150 LOC migration + ~50 LOC tests
  + supervised live run, not the previous optimistic ~30 LOC)
- TL;DR rewritten to mention the gap explicitly and re-order
  the open follow-ups so the migration is the top priority

tests/test_project_state.py
---------------------------
- New test_legacy_alias_keyed_state_is_invisible_until_migrated
- Inserts a "p05" project row + a project_state row pointing at
  it via raw SQL (bypassing set_state which now canonicalizes),
  simulating a pre-fix legacy row
- Verifies the canonicalized get_state path can NOT see the row
  via either the alias or the canonical id — this is the bug
- Verifies the row is still in the database (just unreachable),
  so the migration script has something to find
- The docstring explicitly says: "When the legacy alias migration
  script lands, this test must be inverted." Future readers will
  know exactly when and how to update it.

Full suite: 175 passing (was 174), 1 warning. The +1 is the new
gap regression test.

What this commit does NOT do
----------------------------
- The migration script itself is NOT in this commit. Codex's
  finding was a doc accuracy issue, and the right scope is fix
  the doc + lock the gap behavior in. Writing the migration is
  the next concrete step but is bigger (~200 LOC + dry-run mode
  + collision handling via the conflict model + supervised run
  on the live Dalidou DB), warrants its own commit, and probably
  warrants a "draft + review the dry-run output before applying"
  workflow rather than a single shot.
- Existing tests are unchanged. The new test stands alone as a
  documented gap; the 12 canonicalization tests from fb6298a
  still pass without modification.
---
 .../project-identity-canonicalization.md      | 162 ++++++++++++++++--
 tests/test_project_state.py                   |  71 ++++++++
 2 files changed, 216 insertions(+), 17 deletions(-)

diff --git a/docs/architecture/project-identity-canonicalization.md b/docs/architecture/project-identity-canonicalization.md
index 579a900..e3a7ae1 100644
--- a/docs/architecture/project-identity-canonicalization.md
+++ b/docs/architecture/project-identity-canonicalization.md
@@ -152,6 +152,116 @@ has to be findable. To be findable, the lookup key has to match
 how the row was stored. And the only way to guarantee that match
 across every entry point is to canonicalize at every boundary.
 
+## Compatibility gap: legacy alias-keyed rows
+
+The canonicalization rule fixes new writes going forward, but it
+does NOT fix rows that were already written under a registered
+alias before `fb6298a` landed. Those rows have a real, concrete
+gap that must be closed by a one-time migration before the
+engineering layer V1 ships.
+
+The exact failure mode:
+
+```
+        time T0 (before fb6298a):
+            POST /project/state {project: "p05", ...}
+            -> set_state("p05", ...)        # no canonicalization
+            -> ensure_project("p05")        # creates a "p05" row
+            -> writes state with project_id pointing at the "p05" row
+
+        time T1 (after fb6298a):
+            POST /project/state {project: "p05", ...}     (or any read)
+            -> set_state("p05", ...)
+            -> resolve_project_name("p05") -> "p05-interferometer"
+            -> ensure_project("p05-interferometer")        # creates a SECOND row
+            -> writes new state under the canonical row
+            -> the T0 state is still in the "p05" row, INVISIBLE to every
+               canonicalized read
+```
+
+The unregistered-name fallback path saves you when the project was
+never in the registry: a row stored under `"orphan-project"` is read
+back via `"orphan-project"`, both pass through `resolve_project_name`
+unchanged, and the strings line up. **It does not save you when the
+name is a registered alias** — the helper rewrites the read key but
+not the storage key, and the legacy row becomes invisible.
+
+What is at risk on the live Dalidou DB:
+
+1. **`projects` table**: any rows whose `name` column matches a
+   registered alias (one row per alias actually written under
+   before the fix landed). These shadow the canonical project row
+   and silently fragment the projects namespace.
+2. **`project_state` table**: any rows whose `project_id` points
+   at one of those shadow project rows. **This is the highest-risk
+   case** because it directly defeats the trust hierarchy: Layer 3
+   trusted state becomes invisible to every canonicalized lookup.
+3. **`memories` table**: any rows whose `project` column is a
+   registered alias. Reinforcement and extraction queries will
+   miss them.
+4. **`interactions` table**: any rows whose `project` column is a
+   registered alias. Listing and downstream reflection will miss
+   them.
+
+How to find out the actual blast radius on the live Dalidou DB:
+
+```sql
+-- inspect the projects table for alias-shadow rows
+SELECT id, name FROM projects;
+
+-- count alias-keyed memories per known alias
+SELECT project, COUNT(*) FROM memories
+  WHERE project IN ('p04','p05','p06','gigabit','interferometer','polisher','ato core')
+  GROUP BY project;
+
+-- count alias-keyed interactions
+SELECT project, COUNT(*) FROM interactions
+  WHERE project IN ('p04','p05','p06','gigabit','interferometer','polisher','ato core')
+  GROUP BY project;
+
+-- count alias-shadowed project_state rows by project name
+SELECT p.name, COUNT(*) FROM project_state ps
+  JOIN projects p ON ps.project_id = p.id
+  WHERE p.name IN ('p04','p05','p06','gigabit','interferometer','polisher','ato core');
+```
+
+The migration that closes the gap has to:
+
+1. For each registered project, find all `projects` rows whose
+   name matches one of the project's aliases AND is not the
+   canonical id itself. These are the "shadow" rows.
+2. For each shadow row, MERGE its dependent state into the
+   canonical project's row:
+   - rekey `project_state.project_id` from shadow → canonical
+   - if the merge would create a `(project_id, category, key)`
+     collision (a state row already exists under the canonical
+     id with the same category+key), the migration must surface
+     the conflict via the existing conflict model and pause
+     until the human resolves it
+   - delete the now-empty shadow `projects` row
+3. For `memories` and `interactions`, the fix is simpler because
+   the alias appears as a string column (not a foreign key):
+   `UPDATE memories SET project = canonical WHERE project = alias`,
+   then same for interactions.
+4. The migration must run in dry-run mode first, printing the
+   exact rows it would touch and the canonical destinations they
+   would be merged into.
+5. The migration must be idempotent — running it twice produces
+   the same final state as running it once.
+
+This work is **required before the engineering layer V1 ships**
+because V1 will add new `entities`, `relationships`, `conflicts`,
+and `mirror_regeneration_failures` tables that all key on the
+canonical project id. Any leaked alias-keyed rows in the existing
+tables would show up in V1 reads as silently missing data, and
+the killer-correctness queries from `engineering-query-catalog.md`
+(orphan requirements, decisions on flagged assumptions,
+unsupported claims) would report wrong results against any project
+that has shadow rows.
+
+The migration script does NOT exist yet. The open follow-ups
+section below tracks it as the next concrete step.
+
 ## The rule for new entry points
 
 When you add a new service-layer function that takes a project name,
@@ -222,11 +332,14 @@ features is the cheapest way to keep the contract intact.
    if the alias set evolves.
 5. **Migration of legacy data.** If the live Dalidou DB has rows
    that were written under aliases before the canonicalization
-   landed, those rows still work via the unregistered-name
-   fallback path. They are not automatically migrated to canonical
-   form. A future migration script could walk the DB and
-   re-key any rows whose `project` field matches a known alias to
-   the canonical id; tracked as an open follow-up below.
+   landed (e.g. a `memories` row with `project = "p05"` from
+   before `fb6298a`), those rows are **NOT** automatically
+   reachable from the canonicalized read path. The unregistered-
+   name fallback only helps for project names that were never
+   registered at all; it does **NOT** help for names that are
+   registered as aliases. See the "Compatibility gap" section
+   below for the exact failure mode and the migration path that
+   has to run before the engineering layer V1 ships.
 
 ## What this enables for the engineering layer V1
 
@@ -262,14 +375,22 @@ across many independent additions.
 These are things the canonicalization story still has open. None
 are blockers, but they're the rough edges to be aware of.
 
-1. **Legacy alias data migration.** If the live Dalidou DB has any
-   rows written under aliases before `fb6298a` landed, they
-   still work via the unregistered-name fallback path. A small
-   migration script could walk `memories`, `interactions`,
-   `project_state`, and `projects`, find any names that match a
-   registry alias, and re-key them to the canonical id. Worth
-   doing once before the engineering layer V1 lands. Estimated
-   cost: ~30 LOC + a dry-run mode + a one-time run.
+1. **Legacy alias data migration — REQUIRED before engineering V1
+   ships, NOT optional.** If the live Dalidou DB has any rows
+   written under aliases before `fb6298a` landed, they are
+   silently invisible to the canonicalized read path (see the
+   "Compatibility gap" section above for the exact failure mode).
+   This is a real correctness issue, not a theoretical one: any
+   trusted state, memory, or interaction stored under `p05`,
+   `gigabit`, `polisher`, etc. before the fix landed is currently
+   unreachable from any service-layer query. The migration script
+   has to walk `projects`, `project_state`, `memories`, and
+   `interactions`, merge shadow rows into their canonical
+   counterparts (with conflict-model handling for any collisions),
+   and run in dry-run mode first. Estimated cost: ~150 LOC for
+   the migration script + ~50 LOC of tests + a one-time supervised
+   run on the live Dalidou DB. **This migration is the next
+   concrete pre-V1 step.**
 2. **Registry file caching.** `load_project_registry()` reads the
    JSON file on every `resolve_project_name` call. With ~5
    projects this is fine; with 50+ it would warrant a versioned
@@ -321,8 +442,14 @@ def my_new_service_entry_point(
   entry point that takes a project name
 - Currently called in 8 places across builder, project_state,
   interactions, and memory; all 8 listed in this doc
-- Backwards-compat path returns unregistered names unchanged so
-  legacy data still works without a migration
+- Backwards-compat path returns **unregistered** names unchanged
+  (e.g. `"orphan-project"`); this does NOT cover **registered
+  alias** names that were used as storage keys before `fb6298a`
+- **Real compatibility gap**: any row whose `project` column is a
+  registered alias from before the canonicalization landed is
+  silently invisible to the new read path. A one-time migration
+  is required before engineering V1 ships. See the "Compatibility
+  gap" section.
 - The trust hierarchy depends on this helper being applied
   everywhere — Layer 3 trusted state has to be findable for it to
   win the trust battle
@@ -330,5 +457,6 @@ def my_new_service_entry_point(
   for any new service function that takes a project name
 - The engineering layer V1 implementation must follow the same
   pattern at every new service entry point
-- Open follow-ups: legacy data migration, registry caching,
-  redundant substring boost cleanup
+- Open follow-ups (in priority order): **legacy alias data
+  migration (required pre-V1)**, redundant substring boost
+  cleanup, registry caching when projects scale
diff --git a/tests/test_project_state.py b/tests/test_project_state.py
index ab5c54e..595d826 100644
--- a/tests/test_project_state.py
+++ b/tests/test_project_state.py
@@ -196,3 +196,74 @@ def test_unregistered_project_state_still_works(project_registry):
     entries = get_state("orphan-project")
     assert len(entries) == 1
     assert entries[0].value == "Standalone"
+
+
+def test_legacy_alias_keyed_state_is_invisible_until_migrated(project_registry):
+    """Documents the compatibility gap from project-identity-canonicalization.md.
+
+    Rows that were written under a registered alias BEFORE the
+    canonicalization landed in fb6298a are stored in the projects
+    table under the alias name (not the canonical id). Every read
+    path now canonicalizes to the canonical id, so those legacy
+    rows become invisible.
+
+    This test simulates the legacy state by inserting a shadow
+    project row and a state row that points at it via raw SQL,
+    bypassing set_state() which now canonicalizes. Then it
+    verifies the canonicalized get_state() does NOT find the
+    legacy row.
+
+    When the legacy alias migration script lands (see the open
+    follow-ups in docs/architecture/project-identity-canonicalization.md),
+    this test must be inverted: after running the migration the
+    legacy state should be reachable via the canonical project,
+    not invisible. The migration is required before engineering
+    V1 ships.
+    """
+    import uuid
+
+    from atocore.models.database import get_connection
+
+    project_registry(("p05-interferometer", ["p05", "interferometer"]))
+
+    # Simulate a pre-fix legacy row by writing directly under the
+    # alias name. This is what the OLD set_state would have done
+    # before fb6298a added canonicalization.
+    legacy_project_id = str(uuid.uuid4())
+    legacy_state_id = str(uuid.uuid4())
+    with get_connection() as conn:
+        conn.execute(
+            "INSERT INTO projects (id, name, description) VALUES (?, ?, ?)",
+            (legacy_project_id, "p05", "shadow row created before canonicalization"),
+        )
+        conn.execute(
+            "INSERT INTO project_state "
+            "(id, project_id, category, key, value, source, confidence) "
+            "VALUES (?, ?, ?, ?, ?, ?, ?)",
+            (
+                legacy_state_id,
+                legacy_project_id,
+                "status",
+                "legacy_focus",
+                "Wave 1 ingestion",
+                "pre-canonicalization",
+                1.0,
+            ),
+        )
+
+    # The canonicalized read path looks under "p05-interferometer"
+    # and cannot see the legacy row. THIS IS THE GAP.
+    via_alias = get_state("p05")
+    via_canonical = get_state("p05-interferometer")
+    assert all(entry.value != "Wave 1 ingestion" for entry in via_alias)
+    assert all(entry.value != "Wave 1 ingestion" for entry in via_canonical)
+
+    # The legacy row is still in the database — it's just unreachable
+    # from the canonicalized read path. The migration script (open
+    # follow-up) is what closes the gap.
+    with get_connection() as conn:
+        row = conn.execute(
+            "SELECT value FROM project_state WHERE id = ?", (legacy_state_id,)
+        ).fetchone()
+    assert row is not None
+    assert row["value"] == "Wave 1 ingestion"