Atomizer/docs/hq/reviews/REVIEW-Orchestration-Engine-Auditor-V2.md

# Review: Orchestration Engine (Plan 10) — V2

> **Reviewer:** Auditor 🔍
> **Date:** 2026-02-14
> **Status:** **CONDITIONAL PASS** (implement required controls before production-critical use)
> **Subject:** `10-ORCHESTRATION-ENGINE-PLAN.md`

---

## Executive Verdict

Mario’s architecture is directionally correct and much stronger than fire-and-forget delegation. The three-layer model (Core → Routing → Workflows), structured handoffs, and explicit validation loops are all solid decisions.

However, for production reliability and auditability, this must ship with stricter **state integrity**, **idempotency**, **schema governance**, and **human approval gates** for high-impact actions.

**Bottom line:** Proceed, but only with the must-fix items below integrated into Phase 1–2.

---

## Findings

### 🔴 Critical (must fix)

1. **No explicit idempotency contract for retries/timeouts**
   - Current plan retries on timeout/malformed outputs, but does not define how to prevent duplicate side effects (double posts, repeated downstream actions).
   - **Risk:** inconsistent workflow outcomes, duplicate client-facing messages, non-reproducible state.
   - **Required fix:** Add `idempotency_key` per step attempt and enforce dedupe on handoff consumption + delivery.

2. **Handoff schema is underspecified for machine validation**
   - Fields shown are helpful, but no versioned JSON Schema or strict required/optional policy exists.
   - **Risk:** malformed yet “accepted” outputs, brittle parsing, silent failure propagation.
   - **Required fix:** versioned schema (`schemaVersion`), strict required fields, validator in `orchestrate.sh` + CI check for schema compatibility.

3. **No hard gate for high-stakes workflow steps**
   - Auditor checks are present, but there is no formal “approval required” interrupt before irreversible actions.
   - **Risk:** automated progression with incorrect assumptions.
   - **Required fix:** add `approval_gate: true` for designated steps (e.g., external deliverables, strategic recommendations).

---

### 🟡 Major (should fix)

1. **State model is split across ad hoc files**
   - File-based handoff is fine for MVP, but without a canonical workflow state object, long chains get fragile.
   - **Recommendation:** add a per-run `state.json` blackboard (append-only event log + resolved materialized state).

2. **Observability is not yet sufficient for root-cause analysis**
   - Metrics are planned later; debugging multi-agent failures without end-to-end trace IDs will be painful.
   - **Recommendation:** start now with `workflowRunId`, `stepId`, `attempt`, `agent`, `latencyMs`, `token/cost estimate`, and terminal status.

3. **Channel-context ingestion lacks trust/sanitization policy**
   - Discord history can include noisy or unsafe content.
   - **Recommendation:** context sanitizer + source tagging + max token window + instruction stripping from untrusted text blocks.

4. **Hierarchical delegation loop prevention is policy-level only**
   - Good design intent, but no enforcement mechanism described.
   - **Recommendation:** enforce delegation ACL matrix in orchestrator runtime (not only SOUL instructions).

---

### 🟢 Minor (nice to fix)

1. Add `result_quality_score` (0–1) from validator for triage and dashboards.
2. Add `artifacts_checksum` to handoff metadata for reproducibility.
3. Add workflow dry-run mode to validate dependency graph and substitutions without execution.

---

## External Pattern Cross-Check (complementary ideas)

Based on architecture patterns in common orchestration ecosystems (LangGraph, AutoGen, CrewAI, Temporal, Prefect, Step Functions):

1. **Durable execution + resumability** (LangGraph/Temporal style)
   - Keep execution history and allow resume from last successful step.

2. **Guardrails with bounded retries** (CrewAI/Prefect style)
   - You already started this; formalize per-step retry policy and failure classes.

3. **State-machine semantics** (Step Functions style)
   - Model each step state explicitly: `pending → running → validated → committed | failed`.

4. **Human-in-the-loop interrupts**
   - Introduce pause/approve/reject transitions for critical branches.

5. **Exactly-once consumption where possible**
   - At minimum, “at-least-once execution + idempotent effects” should be guaranteed.

---

## Recommended Minimal Patch Set (before scaling)

1. **Schema + idempotency first**
   - `handoff.schema.json` + `idempotency_key` required fields.

2. **Canonical state file per workflow run**
   - `handoffs/workflows/<runId>/state.json` as single source of truth.

3. **Enforced ACL delegation matrix**
   - Runtime check: who can delegate to whom, hard-block loops.

4. **Approval gates for critical outputs**
   - YAML: `requires_approval: manager|ceo`.

5. **Trace-first logging**
   - Correlated logs for every attempt and transition.

---

## Final Recommendation

**CONDITIONAL PASS**
Implementation can proceed immediately, but production-critical use should wait until the 5-item minimal patch set is in place. The current plan is strong; these controls are what make it reliable under stress.

---

## Suggested Filename Convention

`REVIEW-Orchestration-Engine-Auditor-V2.md`