feat: Phase 4 V1 — Robustness Hardening

Adds the observability + safety layer that turns AtoCore from "works until something silently breaks" into "every mutation is traceable, drift is detected, failures raise alerts." 1. Audit log (memory_audit table): - New table with id, memory_id, action, actor, before/after JSON, note, timestamp; 3 indexes for memory_id/timestamp/action - _audit_memory() helper called from every mutation: create_memory, update_memory, promote_memory, reject_candidate_memory, invalidate_memory, supersede_memory, reinforce_memory, auto_promote_reinforced, expire_stale_candidates - Action verb auto-selected: promoted/rejected/invalidated/ superseded/updated based on state transition - "actor" threaded through: api-http, human-triage, phase10-auto- promote, candidate-expiry, reinforcement, etc. - Fail-open: audit write failure logs but never breaks the mutation - GET /memory/{id}/audit: full history for one memory - GET /admin/audit/recent: last 50 mutations across the system 2. Alerts framework (src/atocore/observability/alerts.py): - emit_alert(severity, title, message, context) fans out to: - structlog logger (always) - ~/atocore-logs/alerts.log append (configurable via ATOCORE_ALERT_LOG) - project_state atocore/alert/last_{severity} (dashboard surface) - ATOCORE_ALERT_WEBHOOK POST if set (auto-detects Discord webhook format for nice embeds; generic JSON otherwise) - Every sink fail-open — one failure doesn't prevent the others - Pipeline alert step in nightly cron: harness < 85% → warning; candidate queue > 200 → warning 3. Integrity checks (scripts/integrity_check.py): - Nightly scan for drift: - Memories → missing source_chunk_id references - Duplicate active memories (same type+content+project) - project_state → missing projects - Orphaned source_chunks (no parent document) - Results persisted to atocore/status/integrity_check_result - Any finding emits a warning alert - Added as Step G in deploy/dalidou/batch-extract.sh nightly cron 4. Dashboard surfaces it all: - integrity (findings + details) - alerts (last info/warning/critical per severity) - recent_audit (last 10 mutations with actor + action + preview) Tests: 308 → 317 (9 new): - test_audit_create_logs_entry - test_audit_promote_logs_entry - test_audit_reject_logs_entry - test_audit_update_captures_before_after - test_audit_reinforce_logs_entry - test_recent_audit_returns_cross_memory_entries - test_emit_alert_writes_log_file - test_emit_alert_invalid_severity_falls_back_to_info - test_emit_alert_fails_open_on_log_write_error Deferred: formal migration framework with rollback (current additive pattern is fine for V1); memory detail wiki page with audit view (quick follow-up). To enable Discord alerts: set ATOCORE_ALERT_WEBHOOK to a Discord webhook URL in Dalidou's environment. Default = log-only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 21:54:10 -04:00
parent bfa7dba4de
commit 88f2f7c4e1
8 changed files with 777 additions and 37 deletions
--- a/deploy/dalidou/batch-extract.sh
+++ b/deploy/dalidou/batch-extract.sh
@@ -150,4 +150,65 @@ print(f'Pipeline summary persisted: {json.dumps(summary)}')
    log "WARN: pipeline summary persistence failed (non-blocking)"
 }

+# Step G: Integrity check (Phase 4 V1)
+log "Step G: integrity check"
+python3 "$APP_DIR/scripts/integrity_check.py" \
+    --base-url "$ATOCORE_URL" \
+    2>&1 || {
+    log "WARN: integrity check failed (non-blocking)"
+}
+
+# Step H: Pipeline-level alerts — detect conditions that warrant attention
+log "Step H: pipeline alerts"
+python3 -c "
+import json, os, sys, urllib.request
+sys.path.insert(0, '$APP_DIR/src')
+from atocore.observability.alerts import emit_alert
+
+base = '$ATOCORE_URL'
+
+def get_state(project='atocore'):
+    try:
+        req = urllib.request.Request(f'{base}/project/state/{project}')
+        resp = urllib.request.urlopen(req, timeout=10)
+        return json.loads(resp.read()).get('entries', [])
+    except Exception:
+        return []
+
+def get_dashboard():
+    try:
+        req = urllib.request.Request(f'{base}/admin/dashboard')
+        resp = urllib.request.urlopen(req, timeout=10)
+        return json.loads(resp.read())
+    except Exception:
+        return {}
+
+state = {(e['category'], e['key']): e['value'] for e in get_state()}
+dash = get_dashboard()
+
+# Harness regression check
+harness_raw = state.get(('status', 'retrieval_harness_result'))
+if harness_raw:
+    try:
+        h = json.loads(harness_raw)
+        passed, total = h.get('passed', 0), h.get('total', 0)
+        if total > 0:
+            rate = passed / total
+            if rate < 0.85:
+                emit_alert('warning', 'Retrieval harness below 85%',
+                           f'Only {passed}/{total} fixtures passing ({rate:.0%}). Failures: {h.get(\"failures\", [])[:5]}',
+                           context={'pass_rate': rate})
+    except Exception:
+        pass
+
+# Candidate queue pileup
+candidates = dash.get('memories', {}).get('candidates', 0)
+if candidates > 200:
+    emit_alert('warning', 'Candidate queue not draining',
+               f'{candidates} candidates pending. Auto-triage may be stuck or rate-limited.',
+               context={'candidates': candidates})
+
+print('pipeline alerts check complete')
+" 2>&1 || true
+
 log "=== AtoCore batch extraction + triage complete ==="