docs: add HQ multi-agent framework documentation from PKM

- Project plan, agent roster, architecture, roadmap - Decision log, full system plan, Discord setup/migration guides - System implementation status (as-built) - Cluster pivot history - Orchestration engine plan (Phases 1-4) - Webster and Auditor reviews
2026-02-15 21:44:07 +00:00
parent 3289a76e19
commit cf82de4f06
15 changed files with 6933 additions and 0 deletions
--- a/docs/hq/reviews/REVIEW-Orchestration-Engine-Auditor-V2.md
+++ b/docs/hq/reviews/REVIEW-Orchestration-Engine-Auditor-V2.md
@@ -0,0 +1,118 @@
+# Review: Orchestration Engine (Plan 10) — V2
+
+> **Reviewer:** Auditor 🔍  
+> **Date:** 2026-02-14  
+> **Status:** **CONDITIONAL PASS** (implement required controls before production-critical use)  
+> **Subject:** `10-ORCHESTRATION-ENGINE-PLAN.md`
+
+---
+
+## Executive Verdict
+
+Mario’s architecture is directionally correct and much stronger than fire-and-forget delegation. The three-layer model (Core → Routing → Workflows), structured handoffs, and explicit validation loops are all solid decisions.
+
+However, for production reliability and auditability, this must ship with stricter **state integrity**, **idempotency**, **schema governance**, and **human approval gates** for high-impact actions.
+
+**Bottom line:** Proceed, but only with the must-fix items below integrated into Phase 1–2.
+
+---
+
+## Findings
+
+### 🔴 Critical (must fix)
+
+1. **No explicit idempotency contract for retries/timeouts**
+   - Current plan retries on timeout/malformed outputs, but does not define how to prevent duplicate side effects (double posts, repeated downstream actions).
+   - **Risk:** inconsistent workflow outcomes, duplicate client-facing messages, non-reproducible state.
+   - **Required fix:** Add `idempotency_key` per step attempt and enforce dedupe on handoff consumption + delivery.
+
+2. **Handoff schema is underspecified for machine validation**
+   - Fields shown are helpful, but no versioned JSON Schema or strict required/optional policy exists.
+   - **Risk:** malformed yet “accepted” outputs, brittle parsing, silent failure propagation.
+   - **Required fix:** versioned schema (`schemaVersion`), strict required fields, validator in `orchestrate.sh` + CI check for schema compatibility.
+
+3. **No hard gate for high-stakes workflow steps**
+   - Auditor checks are present, but there is no formal “approval required” interrupt before irreversible actions.
+   - **Risk:** automated progression with incorrect assumptions.
+   - **Required fix:** add `approval_gate: true` for designated steps (e.g., external deliverables, strategic recommendations).
+
+---
+
+### 🟡 Major (should fix)
+
+1. **State model is split across ad hoc files**
+   - File-based handoff is fine for MVP, but without a canonical workflow state object, long chains get fragile.
+   - **Recommendation:** add a per-run `state.json` blackboard (append-only event log + resolved materialized state).
+
+2. **Observability is not yet sufficient for root-cause analysis**
+   - Metrics are planned later; debugging multi-agent failures without end-to-end trace IDs will be painful.
+   - **Recommendation:** start now with `workflowRunId`, `stepId`, `attempt`, `agent`, `latencyMs`, `token/cost estimate`, and terminal status.
+
+3. **Channel-context ingestion lacks trust/sanitization policy**
+   - Discord history can include noisy or unsafe content.
+   - **Recommendation:** context sanitizer + source tagging + max token window + instruction stripping from untrusted text blocks.
+
+4. **Hierarchical delegation loop prevention is policy-level only**
+   - Good design intent, but no enforcement mechanism described.
+   - **Recommendation:** enforce delegation ACL matrix in orchestrator runtime (not only SOUL instructions).
+
+---
+
+### 🟢 Minor (nice to fix)
+
+1. Add `result_quality_score` (0–1) from validator for triage and dashboards.
+2. Add `artifacts_checksum` to handoff metadata for reproducibility.
+3. Add workflow dry-run mode to validate dependency graph and substitutions without execution.
+
+---
+
+## External Pattern Cross-Check (complementary ideas)
+
+Based on architecture patterns in common orchestration ecosystems (LangGraph, AutoGen, CrewAI, Temporal, Prefect, Step Functions):
+
+1. **Durable execution + resumability** (LangGraph/Temporal style)
+   - Keep execution history and allow resume from last successful step.
+
+2. **Guardrails with bounded retries** (CrewAI/Prefect style)
+   - You already started this; formalize per-step retry policy and failure classes.
+
+3. **State-machine semantics** (Step Functions style)
+   - Model each step state explicitly: `pending → running → validated → committed | failed`.
+
+4. **Human-in-the-loop interrupts**
+   - Introduce pause/approve/reject transitions for critical branches.
+
+5. **Exactly-once consumption where possible**
+   - At minimum, “at-least-once execution + idempotent effects” should be guaranteed.
+
+---
+
+## Recommended Minimal Patch Set (before scaling)
+
+1. **Schema + idempotency first**
+   - `handoff.schema.json` + `idempotency_key` required fields.
+
+2. **Canonical state file per workflow run**
+   - `handoffs/workflows/<runId>/state.json` as single source of truth.
+
+3. **Enforced ACL delegation matrix**
+   - Runtime check: who can delegate to whom, hard-block loops.
+
+4. **Approval gates for critical outputs**
+   - YAML: `requires_approval: manager|ceo`.
+
+5. **Trace-first logging**
+   - Correlated logs for every attempt and transition.
+
+---
+
+## Final Recommendation
+
+**CONDITIONAL PASS**  
+Implementation can proceed immediately, but production-critical use should wait until the 5-item minimal patch set is in place. The current plan is strong; these controls are what make it reliable under stress.
+
+---
+
+## Suggested Filename Convention
+
+`REVIEW-Orchestration-Engine-Auditor-V2.md`