335 lines
15 KiB
Markdown
335 lines
15 KiB
Markdown
|
|
# Project Identity Canonicalization
|
||
|
|
|
||
|
|
## Why this document exists
|
||
|
|
|
||
|
|
AtoCore identifies projects by name in many places: trusted state
|
||
|
|
rows, memories, captured interactions, query/context API parameters,
|
||
|
|
extractor candidates, future engineering entities. Without an
|
||
|
|
explicit rule, every callsite would have to remember to canonicalize
|
||
|
|
project names through the registry — and the recent codex review
|
||
|
|
caught exactly the bug class that follows when one of them forgets.
|
||
|
|
|
||
|
|
The fix landed in `fb6298a` and works correctly today. This document
|
||
|
|
exists to make the rule **explicit and discoverable** so the
|
||
|
|
engineering layer V1 implementation, future entity write paths, and
|
||
|
|
any new agent integration don't reintroduce the same fragmentation
|
||
|
|
when nobody is looking.
|
||
|
|
|
||
|
|
## The contract
|
||
|
|
|
||
|
|
> **Every read/write that takes a project name MUST canonicalize it
|
||
|
|
> through `resolve_project_name()` before the value crosses a service
|
||
|
|
> boundary.**
|
||
|
|
|
||
|
|
The boundary is wherever a project name becomes a database row, a
|
||
|
|
query filter, an attribute on a stored object, or a key for any
|
||
|
|
lookup. The canonicalization happens **once**, at that boundary,
|
||
|
|
before the underlying storage primitive is called.
|
||
|
|
|
||
|
|
Symbolically:
|
||
|
|
|
||
|
|
```
|
||
|
|
HTTP layer (raw user input)
|
||
|
|
↓
|
||
|
|
service entry point
|
||
|
|
↓
|
||
|
|
project_name = resolve_project_name(project_name) ← ONLY canonical from this point
|
||
|
|
↓
|
||
|
|
storage / queries / further service calls
|
||
|
|
```
|
||
|
|
|
||
|
|
The rule is intentionally simple. There's no per-call exception,
|
||
|
|
no "trust me, the caller already canonicalized it" shortcut, no
|
||
|
|
opt-out flag. Every service-layer entry point applies the helper
|
||
|
|
the moment it receives a project name from outside the service.
|
||
|
|
|
||
|
|
## The helper
|
||
|
|
|
||
|
|
```python
|
||
|
|
# src/atocore/projects/registry.py
|
||
|
|
|
||
|
|
def resolve_project_name(name: str | None) -> str:
|
||
|
|
"""Canonicalize a project name through the registry.
|
||
|
|
|
||
|
|
Returns the canonical project_id if the input matches any
|
||
|
|
registered project's id or alias. Returns the input unchanged
|
||
|
|
when it's empty or not in the registry — the second case keeps
|
||
|
|
backwards compatibility with hand-curated state, memories, and
|
||
|
|
interactions that predate the registry, or for projects that
|
||
|
|
are intentionally not registered.
|
||
|
|
"""
|
||
|
|
if not name:
|
||
|
|
return name or ""
|
||
|
|
project = get_registered_project(name)
|
||
|
|
if project is not None:
|
||
|
|
return project.project_id
|
||
|
|
return name
|
||
|
|
```
|
||
|
|
|
||
|
|
Three behaviors worth keeping in mind:
|
||
|
|
|
||
|
|
1. **Empty / None input → empty string output.** Callers don't have
|
||
|
|
to pre-check; passing `""` or `None` to a query filter still
|
||
|
|
works as "no project scope".
|
||
|
|
2. **Registered alias → canonical project_id.** The helper does the
|
||
|
|
case-insensitive lookup and returns the project's `id` field
|
||
|
|
(e.g. `"p05" → "p05-interferometer"`).
|
||
|
|
3. **Unregistered name → input unchanged.** This is the
|
||
|
|
backwards-compatibility path. Hand-curated state, memories, or
|
||
|
|
interactions created under a name that isn't in the registry
|
||
|
|
keep working. The retrieval is then "best effort" — the raw
|
||
|
|
string is used as the SQL key, which still finds the row that
|
||
|
|
was stored under the same raw string. This path exists so the
|
||
|
|
engineering layer V1 doesn't have to also be a data migration.
|
||
|
|
|
||
|
|
## Where the helper is currently called
|
||
|
|
|
||
|
|
As of `fb6298a`, the helper is invoked at exactly these eight
|
||
|
|
service-layer entry points:
|
||
|
|
|
||
|
|
| Module | Function | What gets canonicalized |
|
||
|
|
|---|---|---|
|
||
|
|
| `src/atocore/context/builder.py` | `build_context` | the `project_hint` parameter, before the trusted state lookup |
|
||
|
|
| `src/atocore/context/project_state.py` | `set_state` | `project_name`, before `ensure_project()` |
|
||
|
|
| `src/atocore/context/project_state.py` | `get_state` | `project_name`, before the SQL lookup |
|
||
|
|
| `src/atocore/context/project_state.py` | `invalidate_state` | `project_name`, before the SQL lookup |
|
||
|
|
| `src/atocore/interactions/service.py` | `record_interaction` | `project`, before insert |
|
||
|
|
| `src/atocore/interactions/service.py` | `list_interactions` | `project` filter parameter, before WHERE clause |
|
||
|
|
| `src/atocore/memory/service.py` | `create_memory` | `project`, before insert |
|
||
|
|
| `src/atocore/memory/service.py` | `get_memories` | `project` filter parameter, before WHERE clause |
|
||
|
|
|
||
|
|
Every one of those is the **first** thing the function does after
|
||
|
|
input validation. There is no path through any of those eight
|
||
|
|
functions where a project name reaches storage without passing
|
||
|
|
through `resolve_project_name`.
|
||
|
|
|
||
|
|
## Where the helper is NOT called (and why that's correct)
|
||
|
|
|
||
|
|
These places intentionally do not canonicalize:
|
||
|
|
|
||
|
|
1. **`update_memory`'s project field.** The API does not allow
|
||
|
|
changing a memory's project after creation, so there's no
|
||
|
|
project to canonicalize. The function only updates `content`,
|
||
|
|
`confidence`, and `status`.
|
||
|
|
2. **The retriever's `_project_match_boost` substring matcher.** It
|
||
|
|
already calls `get_registered_project` internally to expand the
|
||
|
|
hint into the candidate set (canonical id + all aliases + last
|
||
|
|
path segments). It accepts the raw hint by design.
|
||
|
|
3. **`_rank_chunks`'s secondary substring boost in
|
||
|
|
`builder.py`.** Still uses the raw hint. This is a multiplicative
|
||
|
|
factor on top of correct retrieval, not a filter, so it cannot
|
||
|
|
drop relevant chunks. Tracked as a future cleanup but not
|
||
|
|
critical.
|
||
|
|
4. **Direct SQL queries for the projects table itself** (e.g.
|
||
|
|
`ensure_project`'s lookup). These are intentional case-insensitive
|
||
|
|
raw lookups against the column the canonical id is stored in.
|
||
|
|
`set_state` already canonicalized before reaching `ensure_project`,
|
||
|
|
so the value passed is the canonical id by definition.
|
||
|
|
5. **Hand-authored project names that aren't in the registry.**
|
||
|
|
The helper returns those unchanged. This is the backwards-compat
|
||
|
|
path mentioned above; it is *not* a violation of the rule, it's
|
||
|
|
the rule applied to a name with no registry record.
|
||
|
|
|
||
|
|
## Why this is the trust hierarchy in action
|
||
|
|
|
||
|
|
The whole point of AtoCore is the trust hierarchy from the operating
|
||
|
|
model:
|
||
|
|
|
||
|
|
1. Trusted Project State (Layer 3) is the most authoritative layer
|
||
|
|
2. Memories (active) are second
|
||
|
|
3. Source chunks (raw retrieved content) are last
|
||
|
|
|
||
|
|
If a caller passes the alias `p05` and Layer 3 was written under
|
||
|
|
`p05-interferometer`, and the lookup fails to find the canonical
|
||
|
|
row, **the trust hierarchy collapses**. The most-authoritative
|
||
|
|
layer is silently invisible to the caller. The system would still
|
||
|
|
return *something* — namely, lower-trust retrieved chunks — and the
|
||
|
|
human would never know they got a degraded answer.
|
||
|
|
|
||
|
|
The canonicalization helper is what makes the trust hierarchy
|
||
|
|
**dependable**. Layer 3 is supposed to win every time. To win it
|
||
|
|
has to be findable. To be findable, the lookup key has to match
|
||
|
|
how the row was stored. And the only way to guarantee that match
|
||
|
|
across every entry point is to canonicalize at every boundary.
|
||
|
|
|
||
|
|
## The rule for new entry points
|
||
|
|
|
||
|
|
When you add a new service-layer function that takes a project name,
|
||
|
|
follow this checklist:
|
||
|
|
|
||
|
|
1. **Does the function read or write a row keyed by project?** If
|
||
|
|
yes, you must call `resolve_project_name`. If no (e.g. it only
|
||
|
|
takes `project` as a label for logging), you may skip the
|
||
|
|
canonicalization but you should add a comment explaining why.
|
||
|
|
2. **Where does the canonicalization go?** As the first statement
|
||
|
|
after input validation. Not later, not "before storage", not
|
||
|
|
"in the helper that does the actual write". As the first
|
||
|
|
statement, so any subsequent service call inside the function
|
||
|
|
sees the canonical value.
|
||
|
|
3. **Add a regression test that uses an alias.** Use the
|
||
|
|
`project_registry` fixture from `tests/conftest.py` to set up
|
||
|
|
a temp registry with at least one project + aliases, then
|
||
|
|
verify the new function works when called with the alias and
|
||
|
|
when called with the canonical id.
|
||
|
|
4. **If the function can be called with `None` or empty string,
|
||
|
|
verify that path too.** The helper handles it correctly but
|
||
|
|
the function-under-test might not.
|
||
|
|
|
||
|
|
## How the `project_registry` test fixture works
|
||
|
|
|
||
|
|
`tests/conftest.py::project_registry` returns a callable that
|
||
|
|
takes one or more `(project_id, [aliases])` tuples (or just a bare
|
||
|
|
`project_id` string), writes them into a temp registry file,
|
||
|
|
points `ATOCORE_PROJECT_REGISTRY_PATH` at it, and reloads
|
||
|
|
`config.settings`. Use it like:
|
||
|
|
|
||
|
|
```python
|
||
|
|
def test_my_new_thing_canonicalizes(project_registry):
|
||
|
|
project_registry(("p05-interferometer", ["p05", "interferometer"]))
|
||
|
|
|
||
|
|
# ... call your service function with "p05" ...
|
||
|
|
# ... assert it works the same as if you'd passed "p05-interferometer" ...
|
||
|
|
```
|
||
|
|
|
||
|
|
The fixture is reused by all 12 alias-canonicalization regression
|
||
|
|
tests added in `fb6298a`. Following the same pattern for new
|
||
|
|
features is the cheapest way to keep the contract intact.
|
||
|
|
|
||
|
|
## What this rule does NOT cover
|
||
|
|
|
||
|
|
1. **Alias creation / management.** This document is about reading
|
||
|
|
and writing project-keyed data. Adding new projects or new
|
||
|
|
aliases is the registry's own write path
|
||
|
|
(`POST /projects/register`, `PUT /projects/{name}`), which
|
||
|
|
already enforces collision detection and atomic file writes.
|
||
|
|
2. **Registry hot-reloading.** The helper calls
|
||
|
|
`load_project_registry()` on every invocation, which reads the
|
||
|
|
JSON file each time. There is no in-process cache. If the
|
||
|
|
registry file changes, the next call sees the new contents.
|
||
|
|
Performance is fine for the current registry size but if it
|
||
|
|
becomes a bottleneck, add a versioned cache here, not at every
|
||
|
|
call site.
|
||
|
|
3. **Cross-project deduplication.** If two different projects in
|
||
|
|
the registry happen to share an alias, the registry's collision
|
||
|
|
detection blocks the second one at registration time, so this
|
||
|
|
case can't arise in practice. The helper does not handle it
|
||
|
|
defensively.
|
||
|
|
4. **Time-bounded canonicalization.** A project's canonical id is
|
||
|
|
stable. Aliases can be added or removed via
|
||
|
|
`PUT /projects/{name}`, but the canonical `id` field never
|
||
|
|
changes after registration. So a row written today under the
|
||
|
|
canonical id will always remain findable under that id, even
|
||
|
|
if the alias set evolves.
|
||
|
|
5. **Migration of legacy data.** If the live Dalidou DB has rows
|
||
|
|
that were written under aliases before the canonicalization
|
||
|
|
landed, those rows still work via the unregistered-name
|
||
|
|
fallback path. They are not automatically migrated to canonical
|
||
|
|
form. A future migration script could walk the DB and
|
||
|
|
re-key any rows whose `project` field matches a known alias to
|
||
|
|
the canonical id; tracked as an open follow-up below.
|
||
|
|
|
||
|
|
## What this enables for the engineering layer V1
|
||
|
|
|
||
|
|
When the engineering layer ships per `engineering-v1-acceptance.md`,
|
||
|
|
it adds at least these new project-keyed surfaces:
|
||
|
|
|
||
|
|
- `entities` table with a `project_id` column
|
||
|
|
- `relationships` table that joins entities, indirectly project-keyed
|
||
|
|
- `conflicts` table with a `project` column
|
||
|
|
- `mirror_regeneration_failures` table with a `project` column
|
||
|
|
- new endpoints: `POST /entities/...`, `POST /ingest/kb-cad/export`,
|
||
|
|
`POST /ingest/kb-fem/export`, `GET /mirror/{project}/...`,
|
||
|
|
`GET /conflicts?project=...`
|
||
|
|
|
||
|
|
**Every one of those write/read paths needs to call
|
||
|
|
`resolve_project_name` at its service-layer entry point**, following
|
||
|
|
the same pattern as the eight existing call sites listed above. The
|
||
|
|
implementation sprint should:
|
||
|
|
|
||
|
|
1. Apply the helper at each new service entry point as the first
|
||
|
|
statement after input validation
|
||
|
|
2. Add a regression test using the `project_registry` fixture that
|
||
|
|
exercises an alias against each new entry point
|
||
|
|
3. Treat any new service function that takes a project name without
|
||
|
|
calling `resolve_project_name` as a code review failure
|
||
|
|
|
||
|
|
The pattern is simple enough to follow without thinking, which is
|
||
|
|
exactly the property we want for a contract that has to hold
|
||
|
|
across many independent additions.
|
||
|
|
|
||
|
|
## Open follow-ups
|
||
|
|
|
||
|
|
These are things the canonicalization story still has open. None
|
||
|
|
are blockers, but they're the rough edges to be aware of.
|
||
|
|
|
||
|
|
1. **Legacy alias data migration.** If the live Dalidou DB has any
|
||
|
|
rows written under aliases before `fb6298a` landed, they
|
||
|
|
still work via the unregistered-name fallback path. A small
|
||
|
|
migration script could walk `memories`, `interactions`,
|
||
|
|
`project_state`, and `projects`, find any names that match a
|
||
|
|
registry alias, and re-key them to the canonical id. Worth
|
||
|
|
doing once before the engineering layer V1 lands. Estimated
|
||
|
|
cost: ~30 LOC + a dry-run mode + a one-time run.
|
||
|
|
2. **Registry file caching.** `load_project_registry()` reads the
|
||
|
|
JSON file on every `resolve_project_name` call. With ~5
|
||
|
|
projects this is fine; with 50+ it would warrant a versioned
|
||
|
|
cache (cache key = file mtime + size). Defer until measured.
|
||
|
|
3. **Case sensitivity audit.** The helper uses
|
||
|
|
`get_registered_project` which lowercases for comparison. The
|
||
|
|
stored canonical id keeps its original casing. No bug today
|
||
|
|
because every test passes, but worth re-confirming when the
|
||
|
|
engineering layer adds entity-side storage.
|
||
|
|
4. **`_rank_chunks`'s secondary substring boost.** Mentioned
|
||
|
|
earlier; still uses the raw hint. Replace it with the same
|
||
|
|
helper-driven approach the retriever uses, OR delete it as
|
||
|
|
redundant once we confirm the retriever's primary boost is
|
||
|
|
sufficient.
|
||
|
|
5. **Documentation discoverability.** This doc lives under
|
||
|
|
`docs/architecture/`. The contract is also restated in the
|
||
|
|
docstring of `resolve_project_name` and referenced from each
|
||
|
|
call site's comment. That redundancy is intentional — the
|
||
|
|
contract is too easy to forget to live in only one place.
|
||
|
|
|
||
|
|
## Quick reference card
|
||
|
|
|
||
|
|
Copy-pasteable for new service functions:
|
||
|
|
|
||
|
|
```python
|
||
|
|
from atocore.projects.registry import resolve_project_name
|
||
|
|
|
||
|
|
|
||
|
|
def my_new_service_entry_point(
|
||
|
|
project_name: str,
|
||
|
|
other_args: ...,
|
||
|
|
) -> ...:
|
||
|
|
# Validate inputs first
|
||
|
|
if not project_name:
|
||
|
|
raise ValueError("project_name is required")
|
||
|
|
|
||
|
|
# Canonicalize through the registry as the first thing after
|
||
|
|
# validation. Every subsequent operation in this function uses
|
||
|
|
# the canonical id, so storage and queries are guaranteed
|
||
|
|
# consistent across alias and canonical-id callers.
|
||
|
|
project_name = resolve_project_name(project_name)
|
||
|
|
|
||
|
|
# ... rest of the function ...
|
||
|
|
```
|
||
|
|
|
||
|
|
## TL;DR
|
||
|
|
|
||
|
|
- One helper, one rule: `resolve_project_name` at every service-layer
|
||
|
|
entry point that takes a project name
|
||
|
|
- Currently called in 8 places across builder, project_state,
|
||
|
|
interactions, and memory; all 8 listed in this doc
|
||
|
|
- Backwards-compat path returns unregistered names unchanged so
|
||
|
|
legacy data still works without a migration
|
||
|
|
- The trust hierarchy depends on this helper being applied
|
||
|
|
everywhere — Layer 3 trusted state has to be findable for it to
|
||
|
|
win the trust battle
|
||
|
|
- Use the `project_registry` test fixture to add regression tests
|
||
|
|
for any new service function that takes a project name
|
||
|
|
- The engineering layer V1 implementation must follow the same
|
||
|
|
pattern at every new service entry point
|
||
|
|
- Open follow-ups: legacy data migration, registry caching,
|
||
|
|
redundant substring boost cleanup
|