feat: Phase 7C — tag canonicalization (autonomous, weekly)
LLM proposes alias→canonical mappings for domain_tags; confidence >= 0.8
auto-apply, below goes to human triage. Protects project identifiers
(p04, p05, p06, atocore, apm, etc.) from ever being canonicalized
since they're their own namespace, not concepts.
Problem solved: tag drift fragments retrieval. "fw" vs "firmware" vs
"firmware-control" all mean the same thing, but cross-cutting queries
that filter by tag only hit one variant. Weekly canonicalization pass
keeps the tag graph clean.
- Schema: tag_aliases table (pending | approved | rejected)
- atocore.memory._tag_canon_prompt (stdlib-only, protected project tokens)
- service: get_tag_distribution, apply_tag_alias (atomic per-memory,
dedupes if both alias + canonical present), create / approve / reject
proposal lifecycle, per-memory audit rows with action="tag_canonicalized"
- scripts/canonicalize_tags.py: host-side detector, autonomous by default,
--no-auto-approve kill switch
- 6 API endpoints under /admin/tags/* (distribution, list, propose,
apply, approve/{id}, reject/{id})
- Step B4 in batch-extract.sh (Sundays only — weekly cadence)
- 26 new tests (prompt parser, normalizer protections, distribution
counting, rewrite atomicity, dedup, audit, lifecycle). 414 → 440.
Design: aggressive protection of project tokens because a false
canonicalization (p04 → p04-gigabit, or vice versa) would scramble
cross-project filtering. Err toward preservation; the alias only
applies if the model is very confident AND both strings appear in
the current distribution.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -287,6 +287,33 @@ def _apply_migrations(conn: sqlite3.Connection) -> None:
|
||||
"CREATE INDEX IF NOT EXISTS idx_mmc_created_at ON memory_merge_candidates(created_at)"
|
||||
)
|
||||
|
||||
# Phase 7C (Memory Consolidation — tag canonicalization): alias → canonical
|
||||
# map for domain_tags. A weekly LLM pass proposes rows here; high-confidence
|
||||
# ones auto-apply (rewrite domain_tags across all memories), low-confidence
|
||||
# ones stay pending for human approval. Immutable history: resolved rows
|
||||
# keep status=approved/rejected; the same alias can re-appear with a new
|
||||
# id if the tag reaches a different canonical later.
|
||||
conn.execute(
|
||||
"""
|
||||
CREATE TABLE IF NOT EXISTS tag_aliases (
|
||||
id TEXT PRIMARY KEY,
|
||||
alias TEXT NOT NULL,
|
||||
canonical TEXT NOT NULL,
|
||||
status TEXT DEFAULT 'pending',
|
||||
confidence REAL DEFAULT 0.0,
|
||||
alias_count INTEGER DEFAULT 0,
|
||||
canonical_count INTEGER DEFAULT 0,
|
||||
reason TEXT DEFAULT '',
|
||||
applied_to_memories INTEGER DEFAULT 0,
|
||||
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
|
||||
resolved_at DATETIME,
|
||||
resolved_by TEXT
|
||||
)
|
||||
"""
|
||||
)
|
||||
conn.execute("CREATE INDEX IF NOT EXISTS idx_tag_aliases_status ON tag_aliases(status)")
|
||||
conn.execute("CREATE INDEX IF NOT EXISTS idx_tag_aliases_alias ON tag_aliases(alias)")
|
||||
|
||||
|
||||
def _column_exists(conn: sqlite3.Connection, table: str, column: str) -> bool:
|
||||
rows = conn.execute(f"PRAGMA table_info({table})").fetchall()
|
||||
|
||||
Reference in New Issue
Block a user