feat: Phase 4 V1 — Robustness Hardening

Adds the observability + safety layer that turns AtoCore from "works until something silently breaks" into "every mutation is traceable, drift is detected, failures raise alerts." 1. Audit log (memory_audit table): - New table with id, memory_id, action, actor, before/after JSON, note, timestamp; 3 indexes for memory_id/timestamp/action - _audit_memory() helper called from every mutation: create_memory, update_memory, promote_memory, reject_candidate_memory, invalidate_memory, supersede_memory, reinforce_memory, auto_promote_reinforced, expire_stale_candidates - Action verb auto-selected: promoted/rejected/invalidated/ superseded/updated based on state transition - "actor" threaded through: api-http, human-triage, phase10-auto- promote, candidate-expiry, reinforcement, etc. - Fail-open: audit write failure logs but never breaks the mutation - GET /memory/{id}/audit: full history for one memory - GET /admin/audit/recent: last 50 mutations across the system 2. Alerts framework (src/atocore/observability/alerts.py): - emit_alert(severity, title, message, context) fans out to: - structlog logger (always) - ~/atocore-logs/alerts.log append (configurable via ATOCORE_ALERT_LOG) - project_state atocore/alert/last_{severity} (dashboard surface) - ATOCORE_ALERT_WEBHOOK POST if set (auto-detects Discord webhook format for nice embeds; generic JSON otherwise) - Every sink fail-open — one failure doesn't prevent the others - Pipeline alert step in nightly cron: harness < 85% → warning; candidate queue > 200 → warning 3. Integrity checks (scripts/integrity_check.py): - Nightly scan for drift: - Memories → missing source_chunk_id references - Duplicate active memories (same type+content+project) - project_state → missing projects - Orphaned source_chunks (no parent document) - Results persisted to atocore/status/integrity_check_result - Any finding emits a warning alert - Added as Step G in deploy/dalidou/batch-extract.sh nightly cron 4. Dashboard surfaces it all: - integrity (findings + details) - alerts (last info/warning/critical per severity) - recent_audit (last 10 mutations with actor + action + preview) Tests: 308 → 317 (9 new): - test_audit_create_logs_entry - test_audit_promote_logs_entry - test_audit_reject_logs_entry - test_audit_update_captures_before_after - test_audit_reinforce_logs_entry - test_recent_audit_returns_cross_memory_entries - test_emit_alert_writes_log_file - test_emit_alert_invalid_severity_falls_back_to_info - test_emit_alert_fails_open_on_log_write_error Deferred: formal migration framework with rollback (current additive pattern is fine for V1); memory detail wiki page with audit view (quick follow-up). To enable Discord alerts: set ATOCORE_ALERT_WEBHOOK to a Discord webhook URL in Dalidou's environment. Default = log-only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 21:54:10 -04:00
parent bfa7dba4de
commit 88f2f7c4e1
8 changed files with 777 additions and 37 deletions
--- a/src/atocore/memory/service.py
+++ b/src/atocore/memory/service.py
@@ -67,6 +67,106 @@ class Memory:
    valid_until: str = ""  # ISO UTC; empty = permanent


+def _audit_memory(
+    memory_id: str,
+    action: str,
+    actor: str = "api",
+    before: dict | None = None,
+    after: dict | None = None,
+    note: str = "",
+) -> None:
+    """Append an entry to memory_audit.
+
+    Phase 4 Robustness V1. Every memory mutation flows through this
+    helper so we can answer "how did this memory get to its current
+    state?" and "when did we learn X?".
+
+    ``action`` is a short verb: created, updated, promoted, rejected,
+    superseded, invalidated, reinforced, auto_promoted, expired.
+    ``actor`` identifies the caller: api (default), auto-triage,
+    human-triage, host-cron, reinforcement, phase10-auto-promote,
+    etc. ``before`` / ``after`` are field snapshots (JSON-serialized).
+    Fail-open: a logging failure never breaks the mutation itself.
+    """
+    import json as _json
+    try:
+        with get_connection() as conn:
+            conn.execute(
+                "INSERT INTO memory_audit (id, memory_id, action, actor, "
+                "before_json, after_json, note) VALUES (?, ?, ?, ?, ?, ?, ?)",
+                (
+                    str(uuid.uuid4()),
+                    memory_id,
+                    action,
+                    actor or "api",
+                    _json.dumps(before or {}),
+                    _json.dumps(after or {}),
+                    (note or "")[:500],
+                ),
+            )
+    except Exception as e:
+        log.warning("memory_audit_failed", memory_id=memory_id, action=action, error=str(e))
+
+
+def get_memory_audit(memory_id: str, limit: int = 100) -> list[dict]:
+    """Fetch audit entries for a memory, newest first."""
+    import json as _json
+    with get_connection() as conn:
+        rows = conn.execute(
+            "SELECT id, memory_id, action, actor, before_json, after_json, note, timestamp "
+            "FROM memory_audit WHERE memory_id = ? ORDER BY timestamp DESC LIMIT ?",
+            (memory_id, limit),
+        ).fetchall()
+    out = []
+    for r in rows:
+        try:
+            before = _json.loads(r["before_json"] or "{}")
+        except Exception:
+            before = {}
+        try:
+            after = _json.loads(r["after_json"] or "{}")
+        except Exception:
+            after = {}
+        out.append({
+            "id": r["id"],
+            "memory_id": r["memory_id"],
+            "action": r["action"],
+            "actor": r["actor"] or "api",
+            "before": before,
+            "after": after,
+            "note": r["note"] or "",
+            "timestamp": r["timestamp"],
+        })
+    return out
+
+
+def get_recent_audit(limit: int = 50) -> list[dict]:
+    """Fetch recent memory_audit entries across all memories, newest first."""
+    import json as _json
+    with get_connection() as conn:
+        rows = conn.execute(
+            "SELECT id, memory_id, action, actor, before_json, after_json, note, timestamp "
+            "FROM memory_audit ORDER BY timestamp DESC LIMIT ?",
+            (limit,),
+        ).fetchall()
+    out = []
+    for r in rows:
+        try:
+            after = _json.loads(r["after_json"] or "{}")
+        except Exception:
+            after = {}
+        out.append({
+            "id": r["id"],
+            "memory_id": r["memory_id"],
+            "action": r["action"],
+            "actor": r["actor"] or "api",
+            "note": r["note"] or "",
+            "timestamp": r["timestamp"],
+            "content_preview": (after.get("content") or "")[:120],
+        })
+    return out
+
+
 def _normalize_tags(tags) -> list[str]:
    """Coerce a tags value (list, JSON string, None) to a clean lowercase list."""
    import json as _json
@@ -98,6 +198,7 @@ def create_memory(
    status: str = "active",
    domain_tags: list[str] | None = None,
    valid_until: str = "",
+    actor: str = "api",
 ) -> Memory:
    """Create a new memory entry.

@@ -160,6 +261,21 @@ def create_memory(
        valid_until=valid_until or "",
    )

+    _audit_memory(
+        memory_id=memory_id,
+        action="created",
+        actor=actor,
+        after={
+            "memory_type": memory_type,
+            "content": content,
+            "project": project,
+            "status": status,
+            "confidence": confidence,
+            "domain_tags": tags,
+            "valid_until": valid_until or "",
+        },
+    )
+
    return Memory(
        id=memory_id,
        memory_type=memory_type,
@@ -235,6 +351,8 @@ def update_memory(
    memory_type: str | None = None,
    domain_tags: list[str] | None = None,
    valid_until: str | None = None,
+    actor: str = "api",
+    note: str = "",
 ) -> bool:
    """Update an existing memory."""
    import json as _json
@@ -258,31 +376,48 @@ def update_memory(
            if duplicate:
                raise ValueError("Update would create a duplicate active memory")

+        # Capture before-state for audit
+        before_snapshot = {
+            "content": existing["content"],
+            "status": existing["status"],
+            "confidence": existing["confidence"],
+            "memory_type": existing["memory_type"],
+        }
+        after_snapshot = dict(before_snapshot)
+
        updates = []
        params: list = []

        if content is not None:
            updates.append("content = ?")
            params.append(content)
+            after_snapshot["content"] = content
        if confidence is not None:
            updates.append("confidence = ?")
            params.append(confidence)
+            after_snapshot["confidence"] = confidence
        if status is not None:
            if status not in MEMORY_STATUSES:
                raise ValueError(f"Invalid status '{status}'. Must be one of: {MEMORY_STATUSES}")
            updates.append("status = ?")
            params.append(status)
+            after_snapshot["status"] = status
        if memory_type is not None:
            if memory_type not in MEMORY_TYPES:
                raise ValueError(f"Invalid memory type '{memory_type}'. Must be one of: {MEMORY_TYPES}")
            updates.append("memory_type = ?")
            params.append(memory_type)
+            after_snapshot["memory_type"] = memory_type
        if domain_tags is not None:
+            norm_tags = _normalize_tags(domain_tags)
            updates.append("domain_tags = ?")
-            params.append(_json.dumps(_normalize_tags(domain_tags)))
+            params.append(_json.dumps(norm_tags))
+            after_snapshot["domain_tags"] = norm_tags
        if valid_until is not None:
+            vu = valid_until.strip() or None
            updates.append("valid_until = ?")
-            params.append(valid_until.strip() or None)
+            params.append(vu)
+            after_snapshot["valid_until"] = vu or ""

        if not updates:
            return False
@@ -297,21 +432,40 @@ def update_memory(

    if result.rowcount > 0:
        log.info("memory_updated", memory_id=memory_id)
+        # Action verb is driven by status change when applicable; otherwise "updated"
+        if status == "active" and before_snapshot["status"] == "candidate":
+            action = "promoted"
+        elif status == "invalid" and before_snapshot["status"] == "candidate":
+            action = "rejected"
+        elif status == "invalid":
+            action = "invalidated"
+        elif status == "superseded":
+            action = "superseded"
+        else:
+            action = "updated"
+        _audit_memory(
+            memory_id=memory_id,
+            action=action,
+            actor=actor,
+            before=before_snapshot,
+            after=after_snapshot,
+            note=note,
+        )
        return True
    return False


-def invalidate_memory(memory_id: str) -> bool:
+def invalidate_memory(memory_id: str, actor: str = "api") -> bool:
    """Mark a memory as invalid (error correction)."""
-    return update_memory(memory_id, status="invalid")
+    return update_memory(memory_id, status="invalid", actor=actor)


-def supersede_memory(memory_id: str) -> bool:
+def supersede_memory(memory_id: str, actor: str = "api") -> bool:
    """Mark a memory as superseded (replaced by newer info)."""
-    return update_memory(memory_id, status="superseded")
+    return update_memory(memory_id, status="superseded", actor=actor)


-def promote_memory(memory_id: str) -> bool:
+def promote_memory(memory_id: str, actor: str = "api", note: str = "") -> bool:
    """Promote a candidate memory to active (Phase 9 Commit C review queue).

    Returns False if the memory does not exist or is not currently a
@@ -326,10 +480,10 @@ def promote_memory(memory_id: str) -> bool:
        return False
    if row["status"] != "candidate":
        return False
-    return update_memory(memory_id, status="active")
+    return update_memory(memory_id, status="active", actor=actor, note=note)


-def reject_candidate_memory(memory_id: str) -> bool:
+def reject_candidate_memory(memory_id: str, actor: str = "api", note: str = "") -> bool:
    """Reject a candidate memory (Phase 9 Commit C).

    Sets the candidate's status to ``invalid`` so it drops out of the
@@ -344,7 +498,7 @@ def reject_candidate_memory(memory_id: str) -> bool:
        return False
    if row["status"] != "candidate":
        return False
-    return update_memory(memory_id, status="invalid")
+    return update_memory(memory_id, status="invalid", actor=actor, note=note)


 def reinforce_memory(
@@ -385,6 +539,17 @@ def reinforce_memory(
        old_confidence=round(old_confidence, 4),
        new_confidence=round(new_confidence, 4),
    )
+    # Reinforcement writes an audit row per bump. Reinforcement fires often
+    # (every captured interaction); this lets you trace which interactions
+    # kept which memories alive. Could become chatty but is invaluable for
+    # decay/cold-memory analysis. If it becomes an issue, throttle here.
+    _audit_memory(
+        memory_id=memory_id,
+        action="reinforced",
+        actor="reinforcement",
+        before={"confidence": old_confidence},
+        after={"confidence": new_confidence},
+    )
    return True, old_confidence, new_confidence


@@ -420,7 +585,11 @@ def auto_promote_reinforced(

    for row in rows:
        mid = row["id"]
-        ok = promote_memory(mid)
+        ok = promote_memory(
+            mid,
+            actor="phase10-auto-promote",
+            note=f"ref_count={row['reference_count']} confidence={row['confidence']:.2f}",
+        )
        if ok:
            promoted.append(mid)
            log.info(
@@ -459,7 +628,11 @@ def expire_stale_candidates(

    for row in rows:
        mid = row["id"]
-        ok = reject_candidate_memory(mid)
+        ok = reject_candidate_memory(
+            mid,
+            actor="candidate-expiry",
+            note=f"unreinforced for {max_age_days}+ days",
+        )
        if ok:
            expired.append(mid)
            log.info("memory_expired", memory_id=mid)