# Review: Orchestration Engine (Plan 10) — V2 > **Reviewer:** Auditor 🔍 > **Date:** 2026-02-14 > **Status:** **CONDITIONAL PASS** (implement required controls before production-critical use) > **Subject:** `10-ORCHESTRATION-ENGINE-PLAN.md` --- ## Executive Verdict Mario’s architecture is directionally correct and much stronger than fire-and-forget delegation. The three-layer model (Core → Routing → Workflows), structured handoffs, and explicit validation loops are all solid decisions. However, for production reliability and auditability, this must ship with stricter **state integrity**, **idempotency**, **schema governance**, and **human approval gates** for high-impact actions. **Bottom line:** Proceed, but only with the must-fix items below integrated into Phase 1–2. --- ## Findings ### 🔴 Critical (must fix) 1. **No explicit idempotency contract for retries/timeouts** - Current plan retries on timeout/malformed outputs, but does not define how to prevent duplicate side effects (double posts, repeated downstream actions). - **Risk:** inconsistent workflow outcomes, duplicate client-facing messages, non-reproducible state. - **Required fix:** Add `idempotency_key` per step attempt and enforce dedupe on handoff consumption + delivery. 2. **Handoff schema is underspecified for machine validation** - Fields shown are helpful, but no versioned JSON Schema or strict required/optional policy exists. - **Risk:** malformed yet “accepted” outputs, brittle parsing, silent failure propagation. - **Required fix:** versioned schema (`schemaVersion`), strict required fields, validator in `orchestrate.sh` + CI check for schema compatibility. 3. **No hard gate for high-stakes workflow steps** - Auditor checks are present, but there is no formal “approval required” interrupt before irreversible actions. - **Risk:** automated progression with incorrect assumptions. - **Required fix:** add `approval_gate: true` for designated steps (e.g., external deliverables, strategic recommendations). --- ### 🟡 Major (should fix) 1. **State model is split across ad hoc files** - File-based handoff is fine for MVP, but without a canonical workflow state object, long chains get fragile. - **Recommendation:** add a per-run `state.json` blackboard (append-only event log + resolved materialized state). 2. **Observability is not yet sufficient for root-cause analysis** - Metrics are planned later; debugging multi-agent failures without end-to-end trace IDs will be painful. - **Recommendation:** start now with `workflowRunId`, `stepId`, `attempt`, `agent`, `latencyMs`, `token/cost estimate`, and terminal status. 3. **Channel-context ingestion lacks trust/sanitization policy** - Discord history can include noisy or unsafe content. - **Recommendation:** context sanitizer + source tagging + max token window + instruction stripping from untrusted text blocks. 4. **Hierarchical delegation loop prevention is policy-level only** - Good design intent, but no enforcement mechanism described. - **Recommendation:** enforce delegation ACL matrix in orchestrator runtime (not only SOUL instructions). --- ### 🟢 Minor (nice to fix) 1. Add `result_quality_score` (0–1) from validator for triage and dashboards. 2. Add `artifacts_checksum` to handoff metadata for reproducibility. 3. Add workflow dry-run mode to validate dependency graph and substitutions without execution. --- ## External Pattern Cross-Check (complementary ideas) Based on architecture patterns in common orchestration ecosystems (LangGraph, AutoGen, CrewAI, Temporal, Prefect, Step Functions): 1. **Durable execution + resumability** (LangGraph/Temporal style) - Keep execution history and allow resume from last successful step. 2. **Guardrails with bounded retries** (CrewAI/Prefect style) - You already started this; formalize per-step retry policy and failure classes. 3. **State-machine semantics** (Step Functions style) - Model each step state explicitly: `pending → running → validated → committed | failed`. 4. **Human-in-the-loop interrupts** - Introduce pause/approve/reject transitions for critical branches. 5. **Exactly-once consumption where possible** - At minimum, “at-least-once execution + idempotent effects” should be guaranteed. --- ## Recommended Minimal Patch Set (before scaling) 1. **Schema + idempotency first** - `handoff.schema.json` + `idempotency_key` required fields. 2. **Canonical state file per workflow run** - `handoffs/workflows//state.json` as single source of truth. 3. **Enforced ACL delegation matrix** - Runtime check: who can delegate to whom, hard-block loops. 4. **Approval gates for critical outputs** - YAML: `requires_approval: manager|ceo`. 5. **Trace-first logging** - Correlated logs for every attempt and transition. --- ## Final Recommendation **CONDITIONAL PASS** Implementation can proceed immediately, but production-critical use should wait until the 5-item minimal patch set is in place. The current plan is strong; these controls are what make it reliable under stress. --- ## Suggested Filename Convention `REVIEW-Orchestration-Engine-Auditor-V2.md`