Commit Graph

2 Commits

Author SHA1 Message Date
261277fd51 fix(migration): preserve superseded/invalid shadow state during rekey
Codex caught a real data-loss bug in the legacy alias migration
shipped in 7e60f5a. plan_state_migration filtered state rows to
status='active' only, then apply_plan deleted the shadow projects
row at the end. Because project_state.project_id has
ON DELETE CASCADE, any superseded or invalid state rows still
attached to the shadow project got silently cascade-deleted —
exactly the audit loss a cleanup migration must not cause.

This commit fixes the bug and adds regression tests that lock in
the invariant "shadow state of every status is accounted for".

Root cause
----------
scripts/migrate_legacy_aliases.py::plan_state_migration was:

    "SELECT * FROM project_state WHERE project_id = ? AND status = 'active'"

which only found live rows. Any historical row (status in
'superseded' or 'invalid') was invisible to the plan, so the apply
step had nothing to rekey for it. Then the shadow project row was
deleted at the end, cascade-deleting every unplanned row.

The fix
-------
plan_state_migration now selects ALL state rows attached to the
shadow project regardless of status, and handles every row per a
per-status decision table:

| Shadow status | Canonical at same triple? | Values     | Action                         |
|---------------|---------------------------|------------|--------------------------------|
| any           | no                        | —          | clean rekey                    |
| any           | yes                       | same       | shadow superseded in place     |
| active        | yes, active               | different  | COLLISION, apply refuses       |
| active        | yes, inactive             | different  | shadow wins, canonical deleted |
| inactive      | yes, any                  | different  | historical drop (logged)       |

Four changes in the script:

1. SELECT drops the status filter so the plan walks every row.
2. New StateRekeyPlan.historical_drops list captures the shadow
   rows that lose to a canonical row at the same triple because the
   shadow is already inactive. These are the only unavoidable data
   losses, and they happen because the UNIQUE(project_id, category,
   key) constraint on project_state doesn't allow two rows per
   triple regardless of status.
3. New apply action 'replace_inactive_canonical' for the
   shadow-active-vs-canonical-inactive case. At apply time the
   canonical inactive row is DELETEd first (SQLite's default
   immediate constraint checking) and then the shadow is UPDATEd
   into its place in two separate statements. Adds a new
   state_rows_replaced_inactive_canonical counter.
4. New apply counter state_rows_historical_dropped for audit
   transparency. The rows themselves are still cascade-deleted
   when the shadow project row is dropped, but they're counted
   and reported.

Five places render_plan_text and plan_to_json_dict updated:

- counts() gains state_historical_drops
- render_plan_text prints a 'historical drops' section with each
  shadow-canonical pair and their statuses when there are any, so
  the operator sees the audit loss BEFORE running --apply
- The new section explicitly tells the operator: "if any of these
  values are worth keeping as separate audit records, manually copy
  them out before running --apply"
- plan_to_json_dict carries historical_drops into the JSON report
- The state counts table in the human report now shows both
  'state collisions (block)' and 'state historical drops' as
  separate lines so the operator can distinguish
  "apply will refuse" from "apply will drop historical rows"

Regression tests (3 new, all green)
-----------------------------------
tests/test_migrate_legacy_aliases.py:

- test_apply_preserves_superseded_shadow_state_when_no_collision:
  the direct regression for the codex finding. Seeds a shadow with
  a superseded state row on a triple the canonical doesn't have,
  runs the migration, verifies via raw SQL that the row is now
  attached to the canonical projects row and still has status
  'superseded'. This is the test that would have failed before
  the fix.
- test_apply_drops_shadow_inactive_row_when_canonical_holds_same_triple:
  covers the unavoidable data-loss case. Seeds shadow superseded
  + canonical active at the same triple with different values,
  verifies plan.counts() reports one historical_drop, runs apply,
  verifies the canonical value is preserved and the shadow value
  is gone.
- test_apply_replaces_inactive_canonical_with_active_shadow:
  covers the cross-contamination case where shadow has live value
  and canonical has a stale invalid row. Shadow wins by deleting
  canonical and rekeying in its place. Verifies the counter and
  the final state.

Plus _seed_state_row now accepts a status kwarg so the seeding
helper can create superseded/invalid rows directly.

test_dry_run_on_empty_registry_reports_empty_plan was updated to
include the new state_historical_drops key in the expected counts
dict (all zero for an empty plan, so the test shape is the same).

Full suite: 197 passing (was 194), 1 warning. The +3 is the three
new regression tests.

What this commit does NOT do
----------------------------
- Does NOT try to preserve historical shadow rows that collide
  with a canonical row at the same triple. That would require a
  schema change (adding (id) to the UNIQUE key, or a separate
  history table) and isn't in scope for a cleanup migration.
  The operator sees these as explicit 'historical drops' in the
  plan output and can copy them out manually if any are worth
  preserving.
- Does NOT change any behavior for rows that were already
  reachable from the canonicalized read path. The fix only
  affects legacy rows whose project_id points at a shadow row.
- Does NOT re-verify the earlier happy-path tests beyond the full
  suite confirming them still green.
2026-04-08 15:52:44 -04:00
7e60f5a0e6 feat(ops): legacy alias migration script with dry-run/apply modes
Closes the compatibility gap documented in
docs/architecture/project-identity-canonicalization.md. Before fb6298a,
writes to project_state, memories, and interactions stored the raw
project name. After fb6298a every service-layer entry point
canonicalizes through the registry, which silently made pre-fix
alias-keyed rows unreachable from the new read path. Now there's
a migration tool to find and fix them.

This commit is the tool and its tests. The tool is NOT run against
the live Dalidou DB in this commit — that's a separate supervised
manual step after reviewing the dry-run output.

scripts/migrate_legacy_aliases.py
---------------------------------
Standalone offline migration tool. Dry-run default, --apply explicit.

What it inspects:
- projects: rows whose name is a registered alias and differs from
  the canonical project_id (shadow rows)
- project_state: rows whose project_id points at a shadow; plan
  rekeys them to the canonical row's id. (category, key) collisions
  against the canonical block the apply step until a human resolves
- memories: rows whose project column is a registered alias. Plain
  string rekey. Dedup collisions (after rekey, same
  (memory_type, content, project, status)) are handled by the
  existing memory supersession model: newer row stays active, older
  becomes superseded with updated_at as tiebreaker
- interactions: rows whose project column is a registered alias.
  Plain string rekey, no collision handling

What it does NOT do:
- Never touches rows that are already canonical
- Never auto-resolves project_state collisions (refuses until the
  human picks a winner via POST /project/state)
- Never creates data; only rekeys or supersedes
- Never runs outside a single SQLite transaction; any failure rolls
  back the entire migration

Safety rails:
- Dry-run is default. --apply is explicit.
- Apply on empty plan refuses unless --allow-empty (prevents
  accidental runs that look meaningful but did nothing)
- Apply refuses on any project_state collision
- Apply refuses on integrity errors (e.g. two case-variant rows
  both matching the canonical lookup)
- Writes a JSON report to data/migrations/ on every run (dry-run
  and apply alike) for audit
- Idempotent: running twice produces the same final state as
  running once. The second run finds zero shadow rows and exits
  clean.

CLI flags:
  --registry PATH     override ATOCORE_PROJECT_REGISTRY_PATH
  --db PATH           override the AtoCore SQLite DB path
  --apply             actually mutate (default is dry-run)
  --allow-empty       permit --apply on an empty plan
  --report-dir PATH   where to write the JSON report
  --json              emit the plan as JSON instead of human prose

Smoke test against the Phase 9 validation DB produces the expected
"Nothing to migrate. The database is clean." output with 4 known
canonical projects and 0 shadows.

tests/test_migrate_legacy_aliases.py
------------------------------------
19 new tests, all green:

Plan-building:
- test_dry_run_on_empty_registry_reports_empty_plan
- test_dry_run_on_clean_registered_db_reports_empty_plan
- test_dry_run_finds_shadow_project
- test_dry_run_plans_state_rekey_without_collisions
- test_dry_run_detects_state_collision
- test_dry_run_plans_memory_rekey_and_supersession
- test_dry_run_plans_interaction_rekey

Apply:
- test_apply_refuses_on_state_collision
- test_apply_migrates_clean_shadow_end_to_end (verifies get_state
  can see the state via BOTH the alias AND the canonical after
  migration)
- test_apply_drops_shadow_state_duplicate_without_collision
  (same (category, key, value) on both sides - mark shadow
  superseded, don't hit the UNIQUE constraint)
- test_apply_migrates_memories
- test_apply_migrates_interactions
- test_apply_is_idempotent
- test_apply_refuses_with_integrity_errors (uses case-variant
  canonical rows to work around projects.name UNIQUE constraint;
  verifies the case-insensitive duplicate detection works)

Reporting:
- test_plan_to_json_dict_is_serializable
- test_write_report_creates_file
- test_render_plan_text_on_empty_plan
- test_render_plan_text_on_collision

End-to-end gap closure (the most important test):
- test_legacy_alias_gap_is_closed_after_migration
  - Seeds the exact same scenario as
    test_legacy_alias_keyed_state_is_invisible_until_migrated
    in test_project_state.py (which documents the pre-migration
    gap)
  - Confirms the row is invisible before migration
  - Runs the migration
  - Verifies the row is reachable via BOTH the canonical id AND
    the alias afterward
  - This test and the pre-migration gap test together lock in
    "before migration: invisible, after migration: reachable"
    as the documented invariant

Full suite: 194 passing (was 175), 1 warning. The +19 is the new
migration test file.

Next concrete step after this commit
------------------------------------
- Run the dry-run against the live Dalidou DB to find out the
  actual blast radius. The script is the inspection SQL, codified.
- Review the dry-run output together
- If clean (zero shadows), no apply needed; close the doc gap as
  "verified nothing to migrate on this deployment"
- If there are shadows, resolve any collisions via
  POST /project/state, then run --apply under supervision
- After apply, the test_legacy_alias_keyed_state_is_invisible_until_migrated
  test still passes (it simulates the gap directly, so it's
  independent of the live DB state) and the gap-closed companion
  test continues to guard forward
2026-04-08 15:08:16 -04:00