feat: Phase 7C — tag canonicalization (autonomous, weekly)

LLM proposes alias→canonical mappings for domain_tags; confidence >= 0.8
auto-apply, below goes to human triage. Protects project identifiers
(p04, p05, p06, atocore, apm, etc.) from ever being canonicalized
since they're their own namespace, not concepts.

Problem solved: tag drift fragments retrieval. "fw" vs "firmware" vs
"firmware-control" all mean the same thing, but cross-cutting queries
that filter by tag only hit one variant. Weekly canonicalization pass
keeps the tag graph clean.

- Schema: tag_aliases table (pending | approved | rejected)
- atocore.memory._tag_canon_prompt (stdlib-only, protected project tokens)
- service: get_tag_distribution, apply_tag_alias (atomic per-memory,
  dedupes if both alias + canonical present), create / approve / reject
  proposal lifecycle, per-memory audit rows with action="tag_canonicalized"
- scripts/canonicalize_tags.py: host-side detector, autonomous by default,
  --no-auto-approve kill switch
- 6 API endpoints under /admin/tags/* (distribution, list, propose,
  apply, approve/{id}, reject/{id})
- Step B4 in batch-extract.sh (Sundays only — weekly cadence)
- 26 new tests (prompt parser, normalizer protections, distribution
  counting, rewrite atomicity, dedup, audit, lifecycle). 414 → 440.

Design: aggressive protection of project tokens because a false
canonicalization (p04 → p04-gigabit, or vice versa) would scramble
cross-project filtering. Err toward preservation; the alias only
applies if the model is very confident AND both strings appear in
the current distribution.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-19 09:41:02 -04:00
parent e840ef4be3
commit 877b97ec78
7 changed files with 1085 additions and 0 deletions

View File

@@ -197,6 +197,19 @@ python3 "$APP_DIR/scripts/memory_dedup.py" \
log "WARN: memory dedup failed (non-blocking)"
}
# Step B4: Tag canonicalization (Phase 7C, weekly Sundays)
# Autonomous: LLM proposes alias→canonical maps, auto-applies confidence >= 0.8.
# Projects tokens are protected (skipped on both sides). Borderline proposals
# land in /admin/tags/aliases for human review.
if [[ "$(date -u +%u)" == "7" ]]; then
log "Step B4: tag canonicalization (Sunday)"
python3 "$APP_DIR/scripts/canonicalize_tags.py" \
--base-url "$ATOCORE_URL" \
2>&1 || {
log "WARN: tag canonicalization failed (non-blocking)"
}
fi
# Step G: Integrity check (Phase 4 V1)
log "Step G: integrity check"
python3 "$APP_DIR/scripts/integrity_check.py" \