fix(migration): preserve superseded/invalid shadow state during rekey

Codex caught a real data-loss bug in the legacy alias migration
shipped in 7e60f5a. plan_state_migration filtered state rows to
status='active' only, then apply_plan deleted the shadow projects
row at the end. Because project_state.project_id has
ON DELETE CASCADE, any superseded or invalid state rows still
attached to the shadow project got silently cascade-deleted —
exactly the audit loss a cleanup migration must not cause.

This commit fixes the bug and adds regression tests that lock in
the invariant "shadow state of every status is accounted for".

Root cause
----------
scripts/migrate_legacy_aliases.py::plan_state_migration was:

    "SELECT * FROM project_state WHERE project_id = ? AND status = 'active'"

which only found live rows. Any historical row (status in
'superseded' or 'invalid') was invisible to the plan, so the apply
step had nothing to rekey for it. Then the shadow project row was
deleted at the end, cascade-deleting every unplanned row.

The fix
-------
plan_state_migration now selects ALL state rows attached to the
shadow project regardless of status, and handles every row per a
per-status decision table:

| Shadow status | Canonical at same triple? | Values     | Action                         |
|---------------|---------------------------|------------|--------------------------------|
| any           | no                        | —          | clean rekey                    |
| any           | yes                       | same       | shadow superseded in place     |
| active        | yes, active               | different  | COLLISION, apply refuses       |
| active        | yes, inactive             | different  | shadow wins, canonical deleted |
| inactive      | yes, any                  | different  | historical drop (logged)       |

Four changes in the script:

1. SELECT drops the status filter so the plan walks every row.
2. New StateRekeyPlan.historical_drops list captures the shadow
   rows that lose to a canonical row at the same triple because the
   shadow is already inactive. These are the only unavoidable data
   losses, and they happen because the UNIQUE(project_id, category,
   key) constraint on project_state doesn't allow two rows per
   triple regardless of status.
3. New apply action 'replace_inactive_canonical' for the
   shadow-active-vs-canonical-inactive case. At apply time the
   canonical inactive row is DELETEd first (SQLite's default
   immediate constraint checking) and then the shadow is UPDATEd
   into its place in two separate statements. Adds a new
   state_rows_replaced_inactive_canonical counter.
4. New apply counter state_rows_historical_dropped for audit
   transparency. The rows themselves are still cascade-deleted
   when the shadow project row is dropped, but they're counted
   and reported.

Five places render_plan_text and plan_to_json_dict updated:

- counts() gains state_historical_drops
- render_plan_text prints a 'historical drops' section with each
  shadow-canonical pair and their statuses when there are any, so
  the operator sees the audit loss BEFORE running --apply
- The new section explicitly tells the operator: "if any of these
  values are worth keeping as separate audit records, manually copy
  them out before running --apply"
- plan_to_json_dict carries historical_drops into the JSON report
- The state counts table in the human report now shows both
  'state collisions (block)' and 'state historical drops' as
  separate lines so the operator can distinguish
  "apply will refuse" from "apply will drop historical rows"

Regression tests (3 new, all green)
-----------------------------------
tests/test_migrate_legacy_aliases.py:

- test_apply_preserves_superseded_shadow_state_when_no_collision:
  the direct regression for the codex finding. Seeds a shadow with
  a superseded state row on a triple the canonical doesn't have,
  runs the migration, verifies via raw SQL that the row is now
  attached to the canonical projects row and still has status
  'superseded'. This is the test that would have failed before
  the fix.
- test_apply_drops_shadow_inactive_row_when_canonical_holds_same_triple:
  covers the unavoidable data-loss case. Seeds shadow superseded
  + canonical active at the same triple with different values,
  verifies plan.counts() reports one historical_drop, runs apply,
  verifies the canonical value is preserved and the shadow value
  is gone.
- test_apply_replaces_inactive_canonical_with_active_shadow:
  covers the cross-contamination case where shadow has live value
  and canonical has a stale invalid row. Shadow wins by deleting
  canonical and rekeying in its place. Verifies the counter and
  the final state.

Plus _seed_state_row now accepts a status kwarg so the seeding
helper can create superseded/invalid rows directly.

test_dry_run_on_empty_registry_reports_empty_plan was updated to
include the new state_historical_drops key in the expected counts
dict (all zero for an empty plan, so the test shape is the same).

Full suite: 197 passing (was 194), 1 warning. The +3 is the three
new regression tests.

What this commit does NOT do
----------------------------
- Does NOT try to preserve historical shadow rows that collide
  with a canonical row at the same triple. That would require a
  schema change (adding (id) to the UNIQUE key, or a separate
  history table) and isn't in scope for a cleanup migration.
  The operator sees these as explicit 'historical drops' in the
  plan output and can copy them out manually if any are worth
  preserving.
- Does NOT change any behavior for rows that were already
  reachable from the canonicalized read path. The fix only
  affects legacy rows whose project_id points at a shadow row.
- Does NOT re-verify the earlier happy-path tests beyond the full
  suite confirming them still green.
This commit is contained in:
2026-04-08 15:52:44 -04:00
parent 7e60f5a0e6
commit 261277fd51
2 changed files with 355 additions and 36 deletions

View File

@@ -73,13 +73,14 @@ def _seed_state_row(
category: str,
key: str,
value: str,
status: str = "active",
) -> str:
row_id = str(uuid.uuid4())
conn.execute(
"INSERT INTO project_state "
"(id, project_id, category, key, value, source, confidence) "
"VALUES (?, ?, ?, ?, ?, ?, ?)",
(row_id, project_id, category, key, value, "legacy-test", 1.0),
"(id, project_id, category, key, value, source, confidence, status) "
"VALUES (?, ?, ?, ?, ?, ?, ?, ?)",
(row_id, project_id, category, key, value, "legacy-test", 1.0, status),
)
conn.commit()
return row_id
@@ -147,6 +148,7 @@ def test_dry_run_on_empty_registry_reports_empty_plan(tmp_data_dir):
"shadow_projects": 0,
"state_rekey_rows": 0,
"state_collisions": 0,
"state_historical_drops": 0,
"memory_rekey_rows": 0,
"memory_supersede_rows": 0,
"interaction_rekey_rows": 0,
@@ -368,6 +370,191 @@ def test_apply_drops_shadow_state_duplicate_without_collision(project_registry):
assert via_canonical[0].value == "Wave 1 ingestion"
def test_apply_preserves_superseded_shadow_state_when_no_collision(project_registry):
"""Regression test for the codex-flagged data-loss bug.
Before the fix, plan_state_migration only selected status='active'
rows. Any superseded or invalid row on the shadow project was
invisible to the plan and got silently cascade-deleted when the
shadow projects row was dropped at the end of apply. That's
exactly the kind of audit loss a cleanup migration must not cause.
This test seeds a shadow project with a superseded state row on
a triple the canonical project doesn't have, runs the migration,
and verifies the row survived and is now attached to the
canonical project (still with status='superseded').
"""
registry_path = project_registry(
("p05-interferometer", ["p05", "interferometer"])
)
conn = _open_db_connection()
try:
shadow_id = _seed_shadow_project(conn, "p05")
# Superseded row on a triple the canonical won't have
_seed_state_row(
conn,
shadow_id,
"status",
"historical_phase",
"Phase 0 legacy",
status="superseded",
)
plan = mig.build_plan(conn, registry_path)
assert not plan.has_collisions
summary = mig.apply_plan(conn, plan)
finally:
conn.close()
# The superseded row should have been rekeyed, not dropped
assert summary["state_rows_rekeyed"] == 1
assert summary["state_rows_historical_dropped"] == 0
# Verify via raw SQL that the row is now attached to the canonical
# projects row and still has status='superseded'
conn = _open_db_connection()
try:
row = conn.execute(
"SELECT ps.status, ps.value, p.name "
"FROM project_state ps JOIN projects p ON ps.project_id = p.id "
"WHERE ps.category = ? AND ps.key = ?",
("status", "historical_phase"),
).fetchone()
finally:
conn.close()
assert row is not None, "superseded shadow row was lost during migration"
assert row["status"] == "superseded"
assert row["value"] == "Phase 0 legacy"
assert row["name"] == "p05-interferometer"
def test_apply_drops_shadow_inactive_row_when_canonical_holds_same_triple(project_registry):
"""Shadow is inactive (superseded) and collides with an active canonical row.
The canonical wins by definition of the UPSERT schema. The shadow
row is recorded as a historical_drop in the plan so the operator
sees the audit loss, and the apply cascade-deletes it via the
shadow projects row. This is the unavoidable data-loss case
documented in the migration module docstring.
"""
registry_path = project_registry(
("p05-interferometer", ["p05", "interferometer"])
)
conn = _open_db_connection()
try:
shadow_id = _seed_shadow_project(conn, "p05")
canonical_id = _seed_shadow_project(conn, "p05-interferometer")
# Shadow has a superseded value on a triple where the canonical
# has a different active value. Can't preserve both: UNIQUE
# allows only one row per triple.
_seed_state_row(
conn,
shadow_id,
"status",
"next_focus",
"Old wave 1",
status="superseded",
)
_seed_state_row(
conn,
canonical_id,
"status",
"next_focus",
"Wave 2 trusted-operational",
status="active",
)
plan = mig.build_plan(conn, registry_path)
assert not plan.has_collisions # not an active-vs-active collision
assert plan.counts()["state_historical_drops"] == 1
summary = mig.apply_plan(conn, plan)
finally:
conn.close()
assert summary["state_rows_historical_dropped"] == 1
# The canonical's active row survives unchanged
via_canonical = get_state("p05-interferometer")
active_next_focus = [
e
for e in via_canonical
if e.category == "status" and e.key == "next_focus"
]
assert len(active_next_focus) == 1
assert active_next_focus[0].value == "Wave 2 trusted-operational"
def test_apply_replaces_inactive_canonical_with_active_shadow(project_registry):
"""Shadow is active, canonical has an inactive row at the same triple.
The shadow wins: canonical inactive row is deleted, shadow is
rekeyed into canonical's project_id. This covers the
cross-contamination case where the old alias path was used for
the live value while the canonical path had a stale row.
"""
registry_path = project_registry(
("p06-polisher", ["p06", "polisher"])
)
conn = _open_db_connection()
try:
shadow_id = _seed_shadow_project(conn, "p06")
canonical_id = _seed_shadow_project(conn, "p06-polisher")
# Canonical has a stale invalid row; shadow has the live value.
_seed_state_row(
conn,
canonical_id,
"decision",
"frame",
"Old frame (no longer current)",
status="invalid",
)
_seed_state_row(
conn,
shadow_id,
"decision",
"frame",
"kinematic mount frame",
status="active",
)
plan = mig.build_plan(conn, registry_path)
assert not plan.has_collisions
assert plan.counts()["state_historical_drops"] == 0
summary = mig.apply_plan(conn, plan)
finally:
conn.close()
assert summary["state_rows_replaced_inactive_canonical"] == 1
# The active shadow value now lives on the canonical row
via_canonical = get_state("p06-polisher")
frame_entries = [
e for e in via_canonical if e.category == "decision" and e.key == "frame"
]
assert len(frame_entries) == 1
assert frame_entries[0].value == "kinematic mount frame"
# Confirm via raw SQL that the previously-inactive canonical row
# no longer exists
conn = _open_db_connection()
try:
stale = conn.execute(
"SELECT COUNT(*) AS c FROM project_state WHERE value = ?",
("Old frame (no longer current)",),
).fetchone()
finally:
conn.close()
assert stale["c"] == 0
def test_apply_migrates_memories(project_registry):
registry_path = project_registry(
("p04-gigabit", ["p04", "gigabit"])