docs: add HQ multi-agent framework documentation from PKM

- Project plan, agent roster, architecture, roadmap
- Decision log, full system plan, Discord setup/migration guides
- System implementation status (as-built)
- Cluster pivot history
- Orchestration engine plan (Phases 1-4)
- Webster and Auditor reviews
This commit is contained in:
2026-02-15 21:44:07 +00:00
parent 3289a76e19
commit cf82de4f06
15 changed files with 6933 additions and 0 deletions

View File

@@ -0,0 +1,118 @@
# Review: Orchestration Engine (Plan 10) — V2
> **Reviewer:** Auditor 🔍
> **Date:** 2026-02-14
> **Status:** **CONDITIONAL PASS** (implement required controls before production-critical use)
> **Subject:** `10-ORCHESTRATION-ENGINE-PLAN.md`
---
## Executive Verdict
Marios architecture is directionally correct and much stronger than fire-and-forget delegation. The three-layer model (Core → Routing → Workflows), structured handoffs, and explicit validation loops are all solid decisions.
However, for production reliability and auditability, this must ship with stricter **state integrity**, **idempotency**, **schema governance**, and **human approval gates** for high-impact actions.
**Bottom line:** Proceed, but only with the must-fix items below integrated into Phase 12.
---
## Findings
### 🔴 Critical (must fix)
1. **No explicit idempotency contract for retries/timeouts**
- Current plan retries on timeout/malformed outputs, but does not define how to prevent duplicate side effects (double posts, repeated downstream actions).
- **Risk:** inconsistent workflow outcomes, duplicate client-facing messages, non-reproducible state.
- **Required fix:** Add `idempotency_key` per step attempt and enforce dedupe on handoff consumption + delivery.
2. **Handoff schema is underspecified for machine validation**
- Fields shown are helpful, but no versioned JSON Schema or strict required/optional policy exists.
- **Risk:** malformed yet “accepted” outputs, brittle parsing, silent failure propagation.
- **Required fix:** versioned schema (`schemaVersion`), strict required fields, validator in `orchestrate.sh` + CI check for schema compatibility.
3. **No hard gate for high-stakes workflow steps**
- Auditor checks are present, but there is no formal “approval required” interrupt before irreversible actions.
- **Risk:** automated progression with incorrect assumptions.
- **Required fix:** add `approval_gate: true` for designated steps (e.g., external deliverables, strategic recommendations).
---
### 🟡 Major (should fix)
1. **State model is split across ad hoc files**
- File-based handoff is fine for MVP, but without a canonical workflow state object, long chains get fragile.
- **Recommendation:** add a per-run `state.json` blackboard (append-only event log + resolved materialized state).
2. **Observability is not yet sufficient for root-cause analysis**
- Metrics are planned later; debugging multi-agent failures without end-to-end trace IDs will be painful.
- **Recommendation:** start now with `workflowRunId`, `stepId`, `attempt`, `agent`, `latencyMs`, `token/cost estimate`, and terminal status.
3. **Channel-context ingestion lacks trust/sanitization policy**
- Discord history can include noisy or unsafe content.
- **Recommendation:** context sanitizer + source tagging + max token window + instruction stripping from untrusted text blocks.
4. **Hierarchical delegation loop prevention is policy-level only**
- Good design intent, but no enforcement mechanism described.
- **Recommendation:** enforce delegation ACL matrix in orchestrator runtime (not only SOUL instructions).
---
### 🟢 Minor (nice to fix)
1. Add `result_quality_score` (01) from validator for triage and dashboards.
2. Add `artifacts_checksum` to handoff metadata for reproducibility.
3. Add workflow dry-run mode to validate dependency graph and substitutions without execution.
---
## External Pattern Cross-Check (complementary ideas)
Based on architecture patterns in common orchestration ecosystems (LangGraph, AutoGen, CrewAI, Temporal, Prefect, Step Functions):
1. **Durable execution + resumability** (LangGraph/Temporal style)
- Keep execution history and allow resume from last successful step.
2. **Guardrails with bounded retries** (CrewAI/Prefect style)
- You already started this; formalize per-step retry policy and failure classes.
3. **State-machine semantics** (Step Functions style)
- Model each step state explicitly: `pending → running → validated → committed | failed`.
4. **Human-in-the-loop interrupts**
- Introduce pause/approve/reject transitions for critical branches.
5. **Exactly-once consumption where possible**
- At minimum, “at-least-once execution + idempotent effects” should be guaranteed.
---
## Recommended Minimal Patch Set (before scaling)
1. **Schema + idempotency first**
- `handoff.schema.json` + `idempotency_key` required fields.
2. **Canonical state file per workflow run**
- `handoffs/workflows/<runId>/state.json` as single source of truth.
3. **Enforced ACL delegation matrix**
- Runtime check: who can delegate to whom, hard-block loops.
4. **Approval gates for critical outputs**
- YAML: `requires_approval: manager|ceo`.
5. **Trace-first logging**
- Correlated logs for every attempt and transition.
---
## Final Recommendation
**CONDITIONAL PASS**
Implementation can proceed immediately, but production-critical use should wait until the 5-item minimal patch set is in place. The current plan is strong; these controls are what make it reliable under stress.
---
## Suggested Filename Convention
`REVIEW-Orchestration-Engine-Auditor-V2.md`