Files
ATOCore/docs/architecture/project-identity-canonicalization.md
Anto01 1953e559f9 docs+test: clarify legacy alias compatibility gap, add gap regression test
Codex caught a real documentation accuracy bug in the previous
canonicalization doc commit (f521aab). The doc claimed that rows
written under aliases before fb6298a "still work via the
unregistered-name fallback path" — that is wrong for REGISTERED
aliases, which is exactly the case that matters.

The unregistered-name fallback only saves you when the project was
never in the registry: a row stored under "orphan-project" is read
back via "orphan-project", both pass through resolve_project_name
unchanged, and the strings line up. For a registered alias like
"p05", the helper rewrites the read key to "p05-interferometer"
but does NOT rewrite the storage key, so the legacy row becomes
silently invisible.

This commit corrects the doc and locks the gap behavior in with
a regression test, so the issue cannot be lost again.

docs/architecture/project-identity-canonicalization.md
------------------------------------------------------
- Removed the misleading claim from the "What this rule does NOT
  cover" section. Replaced with a pointer to the new gap section
  and an explicit statement that the migration is required before
  engineering V1 ships.
- New "Compatibility gap: legacy alias-keyed rows" section between
  "Why this is the trust hierarchy in action" and "The rule for
  new entry points". This is the natural insertion point because
  the gap is exactly the trust hierarchy failing for legacy data.
  The section covers:
  * a worked T0/T1 timeline showing the exact failure mode
  * what is at risk on the live Dalidou DB, ranked by trust tier:
    projects table (shadow rows), project_state (highest risk
    because Layer 3 is most-authoritative), memories, interactions
  * inspection SQL queries for measuring the actual blast radius
    on the live DB before running any migration
  * the spec for the migration script: walk projects, find shadow
    rows, merge dependent state via the conflict model when there
    are collisions, dry-run mode, idempotent
  * explicit statement that this is required pre-V1 because V1
    will add new project-keyed tables and the killer correctness
    queries from engineering-query-catalog.md would report wrong
    results against any project that has shadow rows
- "Open follow-ups" item 1 promoted from "tracked optional" to
  "REQUIRED before engineering V1 ships, NOT optional" with a
  more honest cost estimate (~150 LOC migration + ~50 LOC tests
  + supervised live run, not the previous optimistic ~30 LOC)
- TL;DR rewritten to mention the gap explicitly and re-order
  the open follow-ups so the migration is the top priority

tests/test_project_state.py
---------------------------
- New test_legacy_alias_keyed_state_is_invisible_until_migrated
- Inserts a "p05" project row + a project_state row pointing at
  it via raw SQL (bypassing set_state which now canonicalizes),
  simulating a pre-fix legacy row
- Verifies the canonicalized get_state path can NOT see the row
  via either the alias or the canonical id — this is the bug
- Verifies the row is still in the database (just unreachable),
  so the migration script has something to find
- The docstring explicitly says: "When the legacy alias migration
  script lands, this test must be inverted." Future readers will
  know exactly when and how to update it.

Full suite: 175 passing (was 174), 1 warning. The +1 is the new
gap regression test.

What this commit does NOT do
----------------------------
- The migration script itself is NOT in this commit. Codex's
  finding was a doc accuracy issue, and the right scope is fix
  the doc + lock the gap behavior in. Writing the migration is
  the next concrete step but is bigger (~200 LOC + dry-run mode
  + collision handling via the conflict model + supervised run
  on the live Dalidou DB), warrants its own commit, and probably
  warrants a "draft + review the dry-run output before applying"
  workflow rather than a single shot.
- Existing tests are unchanged. The new test stands alone as a
  documented gap; the 12 canonicalization tests from fb6298a
  still pass without modification.
2026-04-07 20:14:19 -04:00

21 KiB

Project Identity Canonicalization

Why this document exists

AtoCore identifies projects by name in many places: trusted state rows, memories, captured interactions, query/context API parameters, extractor candidates, future engineering entities. Without an explicit rule, every callsite would have to remember to canonicalize project names through the registry — and the recent codex review caught exactly the bug class that follows when one of them forgets.

The fix landed in fb6298a and works correctly today. This document exists to make the rule explicit and discoverable so the engineering layer V1 implementation, future entity write paths, and any new agent integration don't reintroduce the same fragmentation when nobody is looking.

The contract

Every read/write that takes a project name MUST canonicalize it through resolve_project_name() before the value crosses a service boundary.

The boundary is wherever a project name becomes a database row, a query filter, an attribute on a stored object, or a key for any lookup. The canonicalization happens once, at that boundary, before the underlying storage primitive is called.

Symbolically:

HTTP layer (raw user input)
    ↓
   service entry point
    ↓
   project_name = resolve_project_name(project_name)   ← ONLY canonical from this point
    ↓
   storage / queries / further service calls

The rule is intentionally simple. There's no per-call exception, no "trust me, the caller already canonicalized it" shortcut, no opt-out flag. Every service-layer entry point applies the helper the moment it receives a project name from outside the service.

The helper

# src/atocore/projects/registry.py

def resolve_project_name(name: str | None) -> str:
    """Canonicalize a project name through the registry.

    Returns the canonical project_id if the input matches any
    registered project's id or alias. Returns the input unchanged
    when it's empty or not in the registry — the second case keeps
    backwards compatibility with hand-curated state, memories, and
    interactions that predate the registry, or for projects that
    are intentionally not registered.
    """
    if not name:
        return name or ""
    project = get_registered_project(name)
    if project is not None:
        return project.project_id
    return name

Three behaviors worth keeping in mind:

  1. Empty / None input → empty string output. Callers don't have to pre-check; passing "" or None to a query filter still works as "no project scope".
  2. Registered alias → canonical project_id. The helper does the case-insensitive lookup and returns the project's id field (e.g. "p05" → "p05-interferometer").
  3. Unregistered name → input unchanged. This is the backwards-compatibility path. Hand-curated state, memories, or interactions created under a name that isn't in the registry keep working. The retrieval is then "best effort" — the raw string is used as the SQL key, which still finds the row that was stored under the same raw string. This path exists so the engineering layer V1 doesn't have to also be a data migration.

Where the helper is currently called

As of fb6298a, the helper is invoked at exactly these eight service-layer entry points:

Module Function What gets canonicalized
src/atocore/context/builder.py build_context the project_hint parameter, before the trusted state lookup
src/atocore/context/project_state.py set_state project_name, before ensure_project()
src/atocore/context/project_state.py get_state project_name, before the SQL lookup
src/atocore/context/project_state.py invalidate_state project_name, before the SQL lookup
src/atocore/interactions/service.py record_interaction project, before insert
src/atocore/interactions/service.py list_interactions project filter parameter, before WHERE clause
src/atocore/memory/service.py create_memory project, before insert
src/atocore/memory/service.py get_memories project filter parameter, before WHERE clause

Every one of those is the first thing the function does after input validation. There is no path through any of those eight functions where a project name reaches storage without passing through resolve_project_name.

Where the helper is NOT called (and why that's correct)

These places intentionally do not canonicalize:

  1. update_memory's project field. The API does not allow changing a memory's project after creation, so there's no project to canonicalize. The function only updates content, confidence, and status.
  2. The retriever's _project_match_boost substring matcher. It already calls get_registered_project internally to expand the hint into the candidate set (canonical id + all aliases + last path segments). It accepts the raw hint by design.
  3. _rank_chunks's secondary substring boost in builder.py. Still uses the raw hint. This is a multiplicative factor on top of correct retrieval, not a filter, so it cannot drop relevant chunks. Tracked as a future cleanup but not critical.
  4. Direct SQL queries for the projects table itself (e.g. ensure_project's lookup). These are intentional case-insensitive raw lookups against the column the canonical id is stored in. set_state already canonicalized before reaching ensure_project, so the value passed is the canonical id by definition.
  5. Hand-authored project names that aren't in the registry. The helper returns those unchanged. This is the backwards-compat path mentioned above; it is not a violation of the rule, it's the rule applied to a name with no registry record.

Why this is the trust hierarchy in action

The whole point of AtoCore is the trust hierarchy from the operating model:

  1. Trusted Project State (Layer 3) is the most authoritative layer
  2. Memories (active) are second
  3. Source chunks (raw retrieved content) are last

If a caller passes the alias p05 and Layer 3 was written under p05-interferometer, and the lookup fails to find the canonical row, the trust hierarchy collapses. The most-authoritative layer is silently invisible to the caller. The system would still return something — namely, lower-trust retrieved chunks — and the human would never know they got a degraded answer.

The canonicalization helper is what makes the trust hierarchy dependable. Layer 3 is supposed to win every time. To win it has to be findable. To be findable, the lookup key has to match how the row was stored. And the only way to guarantee that match across every entry point is to canonicalize at every boundary.

Compatibility gap: legacy alias-keyed rows

The canonicalization rule fixes new writes going forward, but it does NOT fix rows that were already written under a registered alias before fb6298a landed. Those rows have a real, concrete gap that must be closed by a one-time migration before the engineering layer V1 ships.

The exact failure mode:

        time T0 (before fb6298a):
            POST /project/state {project: "p05", ...}
            -> set_state("p05", ...)        # no canonicalization
            -> ensure_project("p05")        # creates a "p05" row
            -> writes state with project_id pointing at the "p05" row

        time T1 (after fb6298a):
            POST /project/state {project: "p05", ...}     (or any read)
            -> set_state("p05", ...)
            -> resolve_project_name("p05") -> "p05-interferometer"
            -> ensure_project("p05-interferometer")        # creates a SECOND row
            -> writes new state under the canonical row
            -> the T0 state is still in the "p05" row, INVISIBLE to every
               canonicalized read

The unregistered-name fallback path saves you when the project was never in the registry: a row stored under "orphan-project" is read back via "orphan-project", both pass through resolve_project_name unchanged, and the strings line up. It does not save you when the name is a registered alias — the helper rewrites the read key but not the storage key, and the legacy row becomes invisible.

What is at risk on the live Dalidou DB:

  1. projects table: any rows whose name column matches a registered alias (one row per alias actually written under before the fix landed). These shadow the canonical project row and silently fragment the projects namespace.
  2. project_state table: any rows whose project_id points at one of those shadow project rows. This is the highest-risk case because it directly defeats the trust hierarchy: Layer 3 trusted state becomes invisible to every canonicalized lookup.
  3. memories table: any rows whose project column is a registered alias. Reinforcement and extraction queries will miss them.
  4. interactions table: any rows whose project column is a registered alias. Listing and downstream reflection will miss them.

How to find out the actual blast radius on the live Dalidou DB:

-- inspect the projects table for alias-shadow rows
SELECT id, name FROM projects;

-- count alias-keyed memories per known alias
SELECT project, COUNT(*) FROM memories
  WHERE project IN ('p04','p05','p06','gigabit','interferometer','polisher','ato core')
  GROUP BY project;

-- count alias-keyed interactions
SELECT project, COUNT(*) FROM interactions
  WHERE project IN ('p04','p05','p06','gigabit','interferometer','polisher','ato core')
  GROUP BY project;

-- count alias-shadowed project_state rows by project name
SELECT p.name, COUNT(*) FROM project_state ps
  JOIN projects p ON ps.project_id = p.id
  WHERE p.name IN ('p04','p05','p06','gigabit','interferometer','polisher','ato core');

The migration that closes the gap has to:

  1. For each registered project, find all projects rows whose name matches one of the project's aliases AND is not the canonical id itself. These are the "shadow" rows.
  2. For each shadow row, MERGE its dependent state into the canonical project's row:
    • rekey project_state.project_id from shadow → canonical
    • if the merge would create a (project_id, category, key) collision (a state row already exists under the canonical id with the same category+key), the migration must surface the conflict via the existing conflict model and pause until the human resolves it
    • delete the now-empty shadow projects row
  3. For memories and interactions, the fix is simpler because the alias appears as a string column (not a foreign key): UPDATE memories SET project = canonical WHERE project = alias, then same for interactions.
  4. The migration must run in dry-run mode first, printing the exact rows it would touch and the canonical destinations they would be merged into.
  5. The migration must be idempotent — running it twice produces the same final state as running it once.

This work is required before the engineering layer V1 ships because V1 will add new entities, relationships, conflicts, and mirror_regeneration_failures tables that all key on the canonical project id. Any leaked alias-keyed rows in the existing tables would show up in V1 reads as silently missing data, and the killer-correctness queries from engineering-query-catalog.md (orphan requirements, decisions on flagged assumptions, unsupported claims) would report wrong results against any project that has shadow rows.

The migration script does NOT exist yet. The open follow-ups section below tracks it as the next concrete step.

The rule for new entry points

When you add a new service-layer function that takes a project name, follow this checklist:

  1. Does the function read or write a row keyed by project? If yes, you must call resolve_project_name. If no (e.g. it only takes project as a label for logging), you may skip the canonicalization but you should add a comment explaining why.
  2. Where does the canonicalization go? As the first statement after input validation. Not later, not "before storage", not "in the helper that does the actual write". As the first statement, so any subsequent service call inside the function sees the canonical value.
  3. Add a regression test that uses an alias. Use the project_registry fixture from tests/conftest.py to set up a temp registry with at least one project + aliases, then verify the new function works when called with the alias and when called with the canonical id.
  4. If the function can be called with None or empty string, verify that path too. The helper handles it correctly but the function-under-test might not.

How the project_registry test fixture works

tests/conftest.py::project_registry returns a callable that takes one or more (project_id, [aliases]) tuples (or just a bare project_id string), writes them into a temp registry file, points ATOCORE_PROJECT_REGISTRY_PATH at it, and reloads config.settings. Use it like:

def test_my_new_thing_canonicalizes(project_registry):
    project_registry(("p05-interferometer", ["p05", "interferometer"]))

    # ... call your service function with "p05" ...
    # ... assert it works the same as if you'd passed "p05-interferometer" ...

The fixture is reused by all 12 alias-canonicalization regression tests added in fb6298a. Following the same pattern for new features is the cheapest way to keep the contract intact.

What this rule does NOT cover

  1. Alias creation / management. This document is about reading and writing project-keyed data. Adding new projects or new aliases is the registry's own write path (POST /projects/register, PUT /projects/{name}), which already enforces collision detection and atomic file writes.
  2. Registry hot-reloading. The helper calls load_project_registry() on every invocation, which reads the JSON file each time. There is no in-process cache. If the registry file changes, the next call sees the new contents. Performance is fine for the current registry size but if it becomes a bottleneck, add a versioned cache here, not at every call site.
  3. Cross-project deduplication. If two different projects in the registry happen to share an alias, the registry's collision detection blocks the second one at registration time, so this case can't arise in practice. The helper does not handle it defensively.
  4. Time-bounded canonicalization. A project's canonical id is stable. Aliases can be added or removed via PUT /projects/{name}, but the canonical id field never changes after registration. So a row written today under the canonical id will always remain findable under that id, even if the alias set evolves.
  5. Migration of legacy data. If the live Dalidou DB has rows that were written under aliases before the canonicalization landed (e.g. a memories row with project = "p05" from before fb6298a), those rows are NOT automatically reachable from the canonicalized read path. The unregistered- name fallback only helps for project names that were never registered at all; it does NOT help for names that are registered as aliases. See the "Compatibility gap" section below for the exact failure mode and the migration path that has to run before the engineering layer V1 ships.

What this enables for the engineering layer V1

When the engineering layer ships per engineering-v1-acceptance.md, it adds at least these new project-keyed surfaces:

  • entities table with a project_id column
  • relationships table that joins entities, indirectly project-keyed
  • conflicts table with a project column
  • mirror_regeneration_failures table with a project column
  • new endpoints: POST /entities/..., POST /ingest/kb-cad/export, POST /ingest/kb-fem/export, GET /mirror/{project}/..., GET /conflicts?project=...

Every one of those write/read paths needs to call resolve_project_name at its service-layer entry point, following the same pattern as the eight existing call sites listed above. The implementation sprint should:

  1. Apply the helper at each new service entry point as the first statement after input validation
  2. Add a regression test using the project_registry fixture that exercises an alias against each new entry point
  3. Treat any new service function that takes a project name without calling resolve_project_name as a code review failure

The pattern is simple enough to follow without thinking, which is exactly the property we want for a contract that has to hold across many independent additions.

Open follow-ups

These are things the canonicalization story still has open. None are blockers, but they're the rough edges to be aware of.

  1. Legacy alias data migration — REQUIRED before engineering V1 ships, NOT optional. If the live Dalidou DB has any rows written under aliases before fb6298a landed, they are silently invisible to the canonicalized read path (see the "Compatibility gap" section above for the exact failure mode). This is a real correctness issue, not a theoretical one: any trusted state, memory, or interaction stored under p05, gigabit, polisher, etc. before the fix landed is currently unreachable from any service-layer query. The migration script has to walk projects, project_state, memories, and interactions, merge shadow rows into their canonical counterparts (with conflict-model handling for any collisions), and run in dry-run mode first. Estimated cost: ~150 LOC for the migration script + ~50 LOC of tests + a one-time supervised run on the live Dalidou DB. This migration is the next concrete pre-V1 step.
  2. Registry file caching. load_project_registry() reads the JSON file on every resolve_project_name call. With ~5 projects this is fine; with 50+ it would warrant a versioned cache (cache key = file mtime + size). Defer until measured.
  3. Case sensitivity audit. The helper uses get_registered_project which lowercases for comparison. The stored canonical id keeps its original casing. No bug today because every test passes, but worth re-confirming when the engineering layer adds entity-side storage.
  4. _rank_chunks's secondary substring boost. Mentioned earlier; still uses the raw hint. Replace it with the same helper-driven approach the retriever uses, OR delete it as redundant once we confirm the retriever's primary boost is sufficient.
  5. Documentation discoverability. This doc lives under docs/architecture/. The contract is also restated in the docstring of resolve_project_name and referenced from each call site's comment. That redundancy is intentional — the contract is too easy to forget to live in only one place.

Quick reference card

Copy-pasteable for new service functions:

from atocore.projects.registry import resolve_project_name


def my_new_service_entry_point(
    project_name: str,
    other_args: ...,
) -> ...:
    # Validate inputs first
    if not project_name:
        raise ValueError("project_name is required")

    # Canonicalize through the registry as the first thing after
    # validation. Every subsequent operation in this function uses
    # the canonical id, so storage and queries are guaranteed
    # consistent across alias and canonical-id callers.
    project_name = resolve_project_name(project_name)

    # ... rest of the function ...

TL;DR

  • One helper, one rule: resolve_project_name at every service-layer entry point that takes a project name
  • Currently called in 8 places across builder, project_state, interactions, and memory; all 8 listed in this doc
  • Backwards-compat path returns unregistered names unchanged (e.g. "orphan-project"); this does NOT cover registered alias names that were used as storage keys before fb6298a
  • Real compatibility gap: any row whose project column is a registered alias from before the canonicalization landed is silently invisible to the new read path. A one-time migration is required before engineering V1 ships. See the "Compatibility gap" section.
  • The trust hierarchy depends on this helper being applied everywhere — Layer 3 trusted state has to be findable for it to win the trust battle
  • Use the project_registry test fixture to add regression tests for any new service function that takes a project name
  • The engineering layer V1 implementation must follow the same pattern at every new service entry point
  • Open follow-ups (in priority order): legacy alias data migration (required pre-V1), redundant substring boost cleanup, registry caching when projects scale