diff --git a/docs/architecture/project-identity-canonicalization.md b/docs/architecture/project-identity-canonicalization.md new file mode 100644 index 0000000..579a900 --- /dev/null +++ b/docs/architecture/project-identity-canonicalization.md @@ -0,0 +1,334 @@ +# Project Identity Canonicalization + +## Why this document exists + +AtoCore identifies projects by name in many places: trusted state +rows, memories, captured interactions, query/context API parameters, +extractor candidates, future engineering entities. Without an +explicit rule, every callsite would have to remember to canonicalize +project names through the registry — and the recent codex review +caught exactly the bug class that follows when one of them forgets. + +The fix landed in `fb6298a` and works correctly today. This document +exists to make the rule **explicit and discoverable** so the +engineering layer V1 implementation, future entity write paths, and +any new agent integration don't reintroduce the same fragmentation +when nobody is looking. + +## The contract + +> **Every read/write that takes a project name MUST canonicalize it +> through `resolve_project_name()` before the value crosses a service +> boundary.** + +The boundary is wherever a project name becomes a database row, a +query filter, an attribute on a stored object, or a key for any +lookup. The canonicalization happens **once**, at that boundary, +before the underlying storage primitive is called. + +Symbolically: + +``` +HTTP layer (raw user input) + ↓ + service entry point + ↓ + project_name = resolve_project_name(project_name) ← ONLY canonical from this point + ↓ + storage / queries / further service calls +``` + +The rule is intentionally simple. There's no per-call exception, +no "trust me, the caller already canonicalized it" shortcut, no +opt-out flag. Every service-layer entry point applies the helper +the moment it receives a project name from outside the service. + +## The helper + +```python +# src/atocore/projects/registry.py + +def resolve_project_name(name: str | None) -> str: + """Canonicalize a project name through the registry. + + Returns the canonical project_id if the input matches any + registered project's id or alias. Returns the input unchanged + when it's empty or not in the registry — the second case keeps + backwards compatibility with hand-curated state, memories, and + interactions that predate the registry, or for projects that + are intentionally not registered. + """ + if not name: + return name or "" + project = get_registered_project(name) + if project is not None: + return project.project_id + return name +``` + +Three behaviors worth keeping in mind: + +1. **Empty / None input → empty string output.** Callers don't have + to pre-check; passing `""` or `None` to a query filter still + works as "no project scope". +2. **Registered alias → canonical project_id.** The helper does the + case-insensitive lookup and returns the project's `id` field + (e.g. `"p05" → "p05-interferometer"`). +3. **Unregistered name → input unchanged.** This is the + backwards-compatibility path. Hand-curated state, memories, or + interactions created under a name that isn't in the registry + keep working. The retrieval is then "best effort" — the raw + string is used as the SQL key, which still finds the row that + was stored under the same raw string. This path exists so the + engineering layer V1 doesn't have to also be a data migration. + +## Where the helper is currently called + +As of `fb6298a`, the helper is invoked at exactly these eight +service-layer entry points: + +| Module | Function | What gets canonicalized | +|---|---|---| +| `src/atocore/context/builder.py` | `build_context` | the `project_hint` parameter, before the trusted state lookup | +| `src/atocore/context/project_state.py` | `set_state` | `project_name`, before `ensure_project()` | +| `src/atocore/context/project_state.py` | `get_state` | `project_name`, before the SQL lookup | +| `src/atocore/context/project_state.py` | `invalidate_state` | `project_name`, before the SQL lookup | +| `src/atocore/interactions/service.py` | `record_interaction` | `project`, before insert | +| `src/atocore/interactions/service.py` | `list_interactions` | `project` filter parameter, before WHERE clause | +| `src/atocore/memory/service.py` | `create_memory` | `project`, before insert | +| `src/atocore/memory/service.py` | `get_memories` | `project` filter parameter, before WHERE clause | + +Every one of those is the **first** thing the function does after +input validation. There is no path through any of those eight +functions where a project name reaches storage without passing +through `resolve_project_name`. + +## Where the helper is NOT called (and why that's correct) + +These places intentionally do not canonicalize: + +1. **`update_memory`'s project field.** The API does not allow + changing a memory's project after creation, so there's no + project to canonicalize. The function only updates `content`, + `confidence`, and `status`. +2. **The retriever's `_project_match_boost` substring matcher.** It + already calls `get_registered_project` internally to expand the + hint into the candidate set (canonical id + all aliases + last + path segments). It accepts the raw hint by design. +3. **`_rank_chunks`'s secondary substring boost in + `builder.py`.** Still uses the raw hint. This is a multiplicative + factor on top of correct retrieval, not a filter, so it cannot + drop relevant chunks. Tracked as a future cleanup but not + critical. +4. **Direct SQL queries for the projects table itself** (e.g. + `ensure_project`'s lookup). These are intentional case-insensitive + raw lookups against the column the canonical id is stored in. + `set_state` already canonicalized before reaching `ensure_project`, + so the value passed is the canonical id by definition. +5. **Hand-authored project names that aren't in the registry.** + The helper returns those unchanged. This is the backwards-compat + path mentioned above; it is *not* a violation of the rule, it's + the rule applied to a name with no registry record. + +## Why this is the trust hierarchy in action + +The whole point of AtoCore is the trust hierarchy from the operating +model: + +1. Trusted Project State (Layer 3) is the most authoritative layer +2. Memories (active) are second +3. Source chunks (raw retrieved content) are last + +If a caller passes the alias `p05` and Layer 3 was written under +`p05-interferometer`, and the lookup fails to find the canonical +row, **the trust hierarchy collapses**. The most-authoritative +layer is silently invisible to the caller. The system would still +return *something* — namely, lower-trust retrieved chunks — and the +human would never know they got a degraded answer. + +The canonicalization helper is what makes the trust hierarchy +**dependable**. Layer 3 is supposed to win every time. To win it +has to be findable. To be findable, the lookup key has to match +how the row was stored. And the only way to guarantee that match +across every entry point is to canonicalize at every boundary. + +## The rule for new entry points + +When you add a new service-layer function that takes a project name, +follow this checklist: + +1. **Does the function read or write a row keyed by project?** If + yes, you must call `resolve_project_name`. If no (e.g. it only + takes `project` as a label for logging), you may skip the + canonicalization but you should add a comment explaining why. +2. **Where does the canonicalization go?** As the first statement + after input validation. Not later, not "before storage", not + "in the helper that does the actual write". As the first + statement, so any subsequent service call inside the function + sees the canonical value. +3. **Add a regression test that uses an alias.** Use the + `project_registry` fixture from `tests/conftest.py` to set up + a temp registry with at least one project + aliases, then + verify the new function works when called with the alias and + when called with the canonical id. +4. **If the function can be called with `None` or empty string, + verify that path too.** The helper handles it correctly but + the function-under-test might not. + +## How the `project_registry` test fixture works + +`tests/conftest.py::project_registry` returns a callable that +takes one or more `(project_id, [aliases])` tuples (or just a bare +`project_id` string), writes them into a temp registry file, +points `ATOCORE_PROJECT_REGISTRY_PATH` at it, and reloads +`config.settings`. Use it like: + +```python +def test_my_new_thing_canonicalizes(project_registry): + project_registry(("p05-interferometer", ["p05", "interferometer"])) + + # ... call your service function with "p05" ... + # ... assert it works the same as if you'd passed "p05-interferometer" ... +``` + +The fixture is reused by all 12 alias-canonicalization regression +tests added in `fb6298a`. Following the same pattern for new +features is the cheapest way to keep the contract intact. + +## What this rule does NOT cover + +1. **Alias creation / management.** This document is about reading + and writing project-keyed data. Adding new projects or new + aliases is the registry's own write path + (`POST /projects/register`, `PUT /projects/{name}`), which + already enforces collision detection and atomic file writes. +2. **Registry hot-reloading.** The helper calls + `load_project_registry()` on every invocation, which reads the + JSON file each time. There is no in-process cache. If the + registry file changes, the next call sees the new contents. + Performance is fine for the current registry size but if it + becomes a bottleneck, add a versioned cache here, not at every + call site. +3. **Cross-project deduplication.** If two different projects in + the registry happen to share an alias, the registry's collision + detection blocks the second one at registration time, so this + case can't arise in practice. The helper does not handle it + defensively. +4. **Time-bounded canonicalization.** A project's canonical id is + stable. Aliases can be added or removed via + `PUT /projects/{name}`, but the canonical `id` field never + changes after registration. So a row written today under the + canonical id will always remain findable under that id, even + if the alias set evolves. +5. **Migration of legacy data.** If the live Dalidou DB has rows + that were written under aliases before the canonicalization + landed, those rows still work via the unregistered-name + fallback path. They are not automatically migrated to canonical + form. A future migration script could walk the DB and + re-key any rows whose `project` field matches a known alias to + the canonical id; tracked as an open follow-up below. + +## What this enables for the engineering layer V1 + +When the engineering layer ships per `engineering-v1-acceptance.md`, +it adds at least these new project-keyed surfaces: + +- `entities` table with a `project_id` column +- `relationships` table that joins entities, indirectly project-keyed +- `conflicts` table with a `project` column +- `mirror_regeneration_failures` table with a `project` column +- new endpoints: `POST /entities/...`, `POST /ingest/kb-cad/export`, + `POST /ingest/kb-fem/export`, `GET /mirror/{project}/...`, + `GET /conflicts?project=...` + +**Every one of those write/read paths needs to call +`resolve_project_name` at its service-layer entry point**, following +the same pattern as the eight existing call sites listed above. The +implementation sprint should: + +1. Apply the helper at each new service entry point as the first + statement after input validation +2. Add a regression test using the `project_registry` fixture that + exercises an alias against each new entry point +3. Treat any new service function that takes a project name without + calling `resolve_project_name` as a code review failure + +The pattern is simple enough to follow without thinking, which is +exactly the property we want for a contract that has to hold +across many independent additions. + +## Open follow-ups + +These are things the canonicalization story still has open. None +are blockers, but they're the rough edges to be aware of. + +1. **Legacy alias data migration.** If the live Dalidou DB has any + rows written under aliases before `fb6298a` landed, they + still work via the unregistered-name fallback path. A small + migration script could walk `memories`, `interactions`, + `project_state`, and `projects`, find any names that match a + registry alias, and re-key them to the canonical id. Worth + doing once before the engineering layer V1 lands. Estimated + cost: ~30 LOC + a dry-run mode + a one-time run. +2. **Registry file caching.** `load_project_registry()` reads the + JSON file on every `resolve_project_name` call. With ~5 + projects this is fine; with 50+ it would warrant a versioned + cache (cache key = file mtime + size). Defer until measured. +3. **Case sensitivity audit.** The helper uses + `get_registered_project` which lowercases for comparison. The + stored canonical id keeps its original casing. No bug today + because every test passes, but worth re-confirming when the + engineering layer adds entity-side storage. +4. **`_rank_chunks`'s secondary substring boost.** Mentioned + earlier; still uses the raw hint. Replace it with the same + helper-driven approach the retriever uses, OR delete it as + redundant once we confirm the retriever's primary boost is + sufficient. +5. **Documentation discoverability.** This doc lives under + `docs/architecture/`. The contract is also restated in the + docstring of `resolve_project_name` and referenced from each + call site's comment. That redundancy is intentional — the + contract is too easy to forget to live in only one place. + +## Quick reference card + +Copy-pasteable for new service functions: + +```python +from atocore.projects.registry import resolve_project_name + + +def my_new_service_entry_point( + project_name: str, + other_args: ..., +) -> ...: + # Validate inputs first + if not project_name: + raise ValueError("project_name is required") + + # Canonicalize through the registry as the first thing after + # validation. Every subsequent operation in this function uses + # the canonical id, so storage and queries are guaranteed + # consistent across alias and canonical-id callers. + project_name = resolve_project_name(project_name) + + # ... rest of the function ... +``` + +## TL;DR + +- One helper, one rule: `resolve_project_name` at every service-layer + entry point that takes a project name +- Currently called in 8 places across builder, project_state, + interactions, and memory; all 8 listed in this doc +- Backwards-compat path returns unregistered names unchanged so + legacy data still works without a migration +- The trust hierarchy depends on this helper being applied + everywhere — Layer 3 trusted state has to be findable for it to + win the trust battle +- Use the `project_registry` test fixture to add regression tests + for any new service function that takes a project name +- The engineering layer V1 implementation must follow the same + pattern at every new service entry point +- Open follow-ups: legacy data migration, registry caching, + redundant substring boost cleanup diff --git a/docs/master-plan-status.md b/docs/master-plan-status.md index 74a924d..c51e84e 100644 --- a/docs/master-plan-status.md +++ b/docs/master-plan-status.md @@ -68,9 +68,16 @@ active project set. the 5-layer model (from the previous planning wave) - [engineering-ontology-v1.md](architecture/engineering-ontology-v1.md) — the initial V1 object and relationship inventory (previous wave) +- [project-identity-canonicalization.md](architecture/project-identity-canonicalization.md) — + the helper-at-every-service-boundary contract that keeps the + trust hierarchy dependable across alias and canonical-id callers; + required reading before adding new project-keyed entity surfaces + in the V1 implementation sprint The next concrete next step is the V1 implementation sprint, which -should follow engineering-v1-acceptance.md as its checklist. +should follow engineering-v1-acceptance.md as its checklist, and +must apply the project-identity-canonicalization contract at every +new service-layer entry point. ### LLM Client Integration