feat: Phase 4 V1 — Robustness Hardening
Adds the observability + safety layer that turns AtoCore from
"works until something silently breaks" into "every mutation is
traceable, drift is detected, failures raise alerts."
1. Audit log (memory_audit table):
- New table with id, memory_id, action, actor, before/after JSON,
note, timestamp; 3 indexes for memory_id/timestamp/action
- _audit_memory() helper called from every mutation:
create_memory, update_memory, promote_memory,
reject_candidate_memory, invalidate_memory, supersede_memory,
reinforce_memory, auto_promote_reinforced, expire_stale_candidates
- Action verb auto-selected: promoted/rejected/invalidated/
superseded/updated based on state transition
- "actor" threaded through: api-http, human-triage, phase10-auto-
promote, candidate-expiry, reinforcement, etc.
- Fail-open: audit write failure logs but never breaks the mutation
- GET /memory/{id}/audit: full history for one memory
- GET /admin/audit/recent: last 50 mutations across the system
2. Alerts framework (src/atocore/observability/alerts.py):
- emit_alert(severity, title, message, context) fans out to:
- structlog logger (always)
- ~/atocore-logs/alerts.log append (configurable via
ATOCORE_ALERT_LOG)
- project_state atocore/alert/last_{severity} (dashboard surface)
- ATOCORE_ALERT_WEBHOOK POST if set (auto-detects Discord webhook
format for nice embeds; generic JSON otherwise)
- Every sink fail-open — one failure doesn't prevent the others
- Pipeline alert step in nightly cron: harness < 85% → warning;
candidate queue > 200 → warning
3. Integrity checks (scripts/integrity_check.py):
- Nightly scan for drift:
- Memories → missing source_chunk_id references
- Duplicate active memories (same type+content+project)
- project_state → missing projects
- Orphaned source_chunks (no parent document)
- Results persisted to atocore/status/integrity_check_result
- Any finding emits a warning alert
- Added as Step G in deploy/dalidou/batch-extract.sh nightly cron
4. Dashboard surfaces it all:
- integrity (findings + details)
- alerts (last info/warning/critical per severity)
- recent_audit (last 10 mutations with actor + action + preview)
Tests: 308 → 317 (9 new):
- test_audit_create_logs_entry
- test_audit_promote_logs_entry
- test_audit_reject_logs_entry
- test_audit_update_captures_before_after
- test_audit_reinforce_logs_entry
- test_recent_audit_returns_cross_memory_entries
- test_emit_alert_writes_log_file
- test_emit_alert_invalid_severity_falls_back_to_info
- test_emit_alert_fails_open_on_log_write_error
Deferred: formal migration framework with rollback (current additive
pattern is fine for V1); memory detail wiki page with audit view
(quick follow-up).
To enable Discord alerts: set ATOCORE_ALERT_WEBHOOK to a Discord
webhook URL in Dalidou's environment. Default = log-only.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -150,4 +150,65 @@ print(f'Pipeline summary persisted: {json.dumps(summary)}')
|
||||
log "WARN: pipeline summary persistence failed (non-blocking)"
|
||||
}
|
||||
|
||||
# Step G: Integrity check (Phase 4 V1)
|
||||
log "Step G: integrity check"
|
||||
python3 "$APP_DIR/scripts/integrity_check.py" \
|
||||
--base-url "$ATOCORE_URL" \
|
||||
2>&1 || {
|
||||
log "WARN: integrity check failed (non-blocking)"
|
||||
}
|
||||
|
||||
# Step H: Pipeline-level alerts — detect conditions that warrant attention
|
||||
log "Step H: pipeline alerts"
|
||||
python3 -c "
|
||||
import json, os, sys, urllib.request
|
||||
sys.path.insert(0, '$APP_DIR/src')
|
||||
from atocore.observability.alerts import emit_alert
|
||||
|
||||
base = '$ATOCORE_URL'
|
||||
|
||||
def get_state(project='atocore'):
|
||||
try:
|
||||
req = urllib.request.Request(f'{base}/project/state/{project}')
|
||||
resp = urllib.request.urlopen(req, timeout=10)
|
||||
return json.loads(resp.read()).get('entries', [])
|
||||
except Exception:
|
||||
return []
|
||||
|
||||
def get_dashboard():
|
||||
try:
|
||||
req = urllib.request.Request(f'{base}/admin/dashboard')
|
||||
resp = urllib.request.urlopen(req, timeout=10)
|
||||
return json.loads(resp.read())
|
||||
except Exception:
|
||||
return {}
|
||||
|
||||
state = {(e['category'], e['key']): e['value'] for e in get_state()}
|
||||
dash = get_dashboard()
|
||||
|
||||
# Harness regression check
|
||||
harness_raw = state.get(('status', 'retrieval_harness_result'))
|
||||
if harness_raw:
|
||||
try:
|
||||
h = json.loads(harness_raw)
|
||||
passed, total = h.get('passed', 0), h.get('total', 0)
|
||||
if total > 0:
|
||||
rate = passed / total
|
||||
if rate < 0.85:
|
||||
emit_alert('warning', 'Retrieval harness below 85%',
|
||||
f'Only {passed}/{total} fixtures passing ({rate:.0%}). Failures: {h.get(\"failures\", [])[:5]}',
|
||||
context={'pass_rate': rate})
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Candidate queue pileup
|
||||
candidates = dash.get('memories', {}).get('candidates', 0)
|
||||
if candidates > 200:
|
||||
emit_alert('warning', 'Candidate queue not draining',
|
||||
f'{candidates} candidates pending. Auto-triage may be stuck or rate-limited.',
|
||||
context={'candidates': candidates})
|
||||
|
||||
print('pipeline alerts check complete')
|
||||
" 2>&1 || true
|
||||
|
||||
log "=== AtoCore batch extraction + triage complete ==="
|
||||
|
||||
Reference in New Issue
Block a user