docs: add HQ multi-agent framework documentation from PKM
- Project plan, agent roster, architecture, roadmap - Decision log, full system plan, Discord setup/migration guides - System implementation status (as-built) - Cluster pivot history - Orchestration engine plan (Phases 1-4) - Webster and Auditor reviews
This commit is contained in:
858
docs/hq/10-ORCHESTRATION-ENGINE-PLAN.md
Normal file
858
docs/hq/10-ORCHESTRATION-ENGINE-PLAN.md
Normal file
@@ -0,0 +1,858 @@
|
||||
# 10 — Orchestration Engine: Multi-Instance Intelligence
|
||||
|
||||
> **Status:** Phases 1-3 Complete — Phase 4 (Metrics + Docs) in progress
|
||||
> **Author:** Mario Lavoie (with Antoine)
|
||||
> **Date:** 2026-02-15
|
||||
> **Revised:** 2026-02-15 — Incorporated Webster's review (validation loops, error handling, hierarchical delegation)
|
||||
|
||||
---
|
||||
|
||||
## Problem Statement
|
||||
|
||||
The Atomizer HQ cluster runs 8 independent OpenClaw instances (one per agent). This gives us true parallelism, specialized contexts, and independent Discord identities — but we lost the orchestration primitives that make a single OpenClaw instance powerful:
|
||||
|
||||
- **`sessions_spawn`** — synchronous delegation with result return
|
||||
- **`sessions_history`** — cross-session context reading
|
||||
- **`sessions_send`** — bidirectional inter-session messaging
|
||||
|
||||
The current `delegate.sh` is fire-and-forget. Manager throws a task over the wall and hopes. No result flows back. No chaining. No intelligent multi-step workflows.
|
||||
|
||||
**Goal:** Rebuild OpenClaw's orchestration power at the inter-instance level, enhanced with Discord channel context and a capability registry.
|
||||
|
||||
---
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
Three layers, each building on the last:
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────┐
|
||||
│ LAYER 3: WORKFLOWS │
|
||||
│ YAML-defined multi-step pipelines │
|
||||
│ (sequential, parallel, conditional branching) │
|
||||
├─────────────────────────────────────────────────────┤
|
||||
│ LAYER 2: SMART ROUTING │
|
||||
│ Capability registry + channel context │
|
||||
│ (manager knows who can do what + project state) │
|
||||
├─────────────────────────────────────────────────────┤
|
||||
│ LAYER 1: ORCHESTRATION CORE │
|
||||
│ Synchronous delegation + result return protocol │
|
||||
│ (replaces fire-and-forget delegate.sh) │
|
||||
├─────────────────────────────────────────────────────┤
|
||||
│ EXISTING INFRASTRUCTURE │
|
||||
│ 8 OpenClaw instances, hooks API, shared filesystem│
|
||||
└─────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Layer 1: Orchestration Core
|
||||
|
||||
**What it does:** Replaces `delegate.sh` with synchronous delegation. Manager sends a task, waits for the result, gets structured output back. Can then chain to the next agent.
|
||||
|
||||
### 1.1 — The Orchestrate Script
|
||||
|
||||
**File:** `/home/papa/atomizer/workspaces/shared/skills/orchestrate/orchestrate.sh`
|
||||
|
||||
**Behavior:**
|
||||
1. Send task to target agent via `/hooks/agent` (existing mechanism)
|
||||
2. Poll the agent's session for completion via `/hooks/status/{runId}` or `/sessions` API
|
||||
3. Capture the agent's response (structured output)
|
||||
4. Return it to the calling agent's session
|
||||
|
||||
```bash
|
||||
# Usage
|
||||
result=$(bash orchestrate.sh <agent> "<task>" [options])
|
||||
|
||||
# Example: synchronous delegation
|
||||
result=$(bash orchestrate.sh webster "Find CTE of Zerodur Class 0 at 20-40°C" --wait --timeout 120)
|
||||
echo "$result" # Structured findings returned to manager's session
|
||||
```
|
||||
|
||||
**Options:**
|
||||
- `--wait` — Block until agent completes (default for orchestrate)
|
||||
- `--timeout <seconds>` — Max wait time (default: 300)
|
||||
- `--retries <N>` — Retry on failure (default: 1, max: 3)
|
||||
- `--format json|text` — Expected response format
|
||||
- `--context <file>` — Attach context file to the task
|
||||
- `--channel-context <channel-id> [--messages N]` — Include recent channel history as context
|
||||
- `--validate` — Run lightweight self-check on agent output before returning
|
||||
- `--no-deliver` — Don't post to Discord (manager will synthesize and post)
|
||||
|
||||
### 1.2 — Report-Back Protocol
|
||||
|
||||
Each agent gets instructions in their SOUL.md to format delegation responses:
|
||||
|
||||
```markdown
|
||||
## When responding to a delegated task:
|
||||
Structure your response as:
|
||||
|
||||
**TASK:** [restate what was asked]
|
||||
**STATUS:** complete | partial | blocked | failed
|
||||
**RESULT:** [your findings/output]
|
||||
**ARTIFACTS:** [any files created, with paths]
|
||||
**CONFIDENCE:** high | medium | low
|
||||
**NOTES:** [caveats, assumptions, open questions]
|
||||
```
|
||||
|
||||
This gives manager structured data to reason about, not just a wall of text.
|
||||
|
||||
### 1.3 — Validation & Self-Check Protocol
|
||||
|
||||
Every delegated response goes through a lightweight validation before the orchestrator accepts it:
|
||||
|
||||
**Self-Check (built into agent SOUL.md instructions):**
|
||||
Each agent, when responding to a delegated task, must verify:
|
||||
- Did I answer all parts of the question?
|
||||
- Did I provide sources/evidence where applicable?
|
||||
- Is my confidence rating honest?
|
||||
|
||||
If the agent's self-check identifies gaps, it sets `STATUS: partial` and explains what's missing in `NOTES`.
|
||||
|
||||
**Orchestrator-Side Validation (in `orchestrate.sh`):**
|
||||
When `--validate` is passed (or for workflow steps with `validation` blocks):
|
||||
1. Check that handoff JSON has all required fields (status, result, confidence)
|
||||
2. If `STATUS: failed` or `STATUS: blocked` → trigger retry (up to `--retries` limit)
|
||||
3. If `STATUS: partial` and confidence is `low` → retry with refined prompt including the partial result
|
||||
4. If retries exhausted → return partial result with warning flag for the orchestrator to decide
|
||||
|
||||
**Full Audit Validation (for high-stakes steps):**
|
||||
Workflow YAML can specify a validation agent (typically auditor) for critical steps:
|
||||
|
||||
```yaml
|
||||
- id: research
|
||||
agent: webster
|
||||
task: "Research materials..."
|
||||
validation:
|
||||
agent: auditor
|
||||
criteria: "Are all requested properties present with credible sources?"
|
||||
on_fail: retry
|
||||
max_retries: 2
|
||||
```
|
||||
|
||||
This runs the auditor on the output before passing it downstream. Prevents garbage-in-garbage-out in critical pipelines.
|
||||
|
||||
### 1.4 — Error Handling (Phase 1 Priority)
|
||||
|
||||
Error handling is not deferred — it ships with the orchestration core:
|
||||
|
||||
**Agent unreachable:**
|
||||
- `orchestrate.sh` checks health endpoint before sending
|
||||
- If agent is down: log error, return immediately with `STATUS: error, reason: agent_unreachable`
|
||||
- Caller (manager or workflow engine) decides whether to retry, skip, or abort
|
||||
|
||||
**Timeout:**
|
||||
- Configurable per call (`--timeout`) and per workflow step
|
||||
- On timeout: kill the polling loop, check if partial handoff exists
|
||||
- If partial result available: return it with `STATUS: timeout_partial`
|
||||
- If no result: return `STATUS: timeout`
|
||||
|
||||
**Malformed response:**
|
||||
- Agent didn't write handoff file or wrote invalid JSON
|
||||
- `orchestrate.sh` validates JSON schema before returning
|
||||
- On malformed: retry once with explicit reminder to write structured output
|
||||
- If still malformed: return raw text with `STATUS: malformed`
|
||||
|
||||
**Retry logic (with idempotency):**
|
||||
```
|
||||
Attempt 1: Generate idempotencyKey={wfRunId}_{stepId}_1 → Send task → wait → check result
|
||||
If timeout → Check if handoff file exists (late arrival). If yes → use it. If no:
|
||||
Attempt 2: idempotencyKey={wfRunId}_{stepId}_2 → Resend with "Previous attempt failed: {reason}. Please retry."
|
||||
If timeout → Same late-arrival check. If no:
|
||||
Attempt 3 (if --retries 3): Same pattern
|
||||
If fail → Return error to caller with all attempt details
|
||||
```
|
||||
**Key rule:** Before every retry, check if the handoff file from the previous attempt landed. Prevents duplicate work when an agent was just slow, not dead.
|
||||
|
||||
### 1.5 — Result Capture Mechanism
|
||||
|
||||
Two options (implement both, prefer A):
|
||||
|
||||
**Option A — File-based handoff:**
|
||||
- Agent writes result to `/home/papa/atomizer/handoffs/{runId}.json`
|
||||
- Orchestrate script polls for file existence
|
||||
- Clean, simple, works with shared filesystem
|
||||
|
||||
```json
|
||||
{
|
||||
"schemaVersion": "1.0",
|
||||
"runId": "hook-delegation-1739587200",
|
||||
"idempotencyKey": "wf-mat-study-001_research_1",
|
||||
"workflowRunId": "wf-mat-study-001",
|
||||
"stepId": "research",
|
||||
"attempt": 1,
|
||||
"agent": "webster",
|
||||
"status": "complete",
|
||||
"result": "Zerodur Class 0 CTE: 0 ± 0.007 ppm/K (20-40°C)...",
|
||||
"artifacts": [],
|
||||
"confidence": "high",
|
||||
"latencyMs": 45200,
|
||||
"timestamp": "2026-02-15T03:00:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
**Required fields:** `schemaVersion`, `runId`, `agent`, `status`, `result`, `confidence`, `timestamp`
|
||||
**Trace fields (required):** `workflowRunId`, `stepId`, `attempt`, `latencyMs`
|
||||
**Idempotency:** `idempotencyKey` = `{workflowRunId}_{stepId}_{attempt}`. Orchestrator checks for existing handoff before retrying — if result exists, skip resend.
|
||||
|
||||
**Option B — Hooks callback:**
|
||||
- Agent calls manager's `/hooks/report` endpoint with result
|
||||
- More real-time but adds complexity
|
||||
- Use for time-sensitive workflows
|
||||
|
||||
### 1.6 — Chaining Example
|
||||
|
||||
```bash
|
||||
# Manager orchestrates a material trade study
|
||||
# Step 1: Research
|
||||
data=$(bash orchestrate.sh webster "Research Clearceram-Z HS vs Zerodur Class 0: CTE, density, cost, lead time" --wait)
|
||||
|
||||
# Step 2: Technical evaluation (pass webster's findings as context)
|
||||
echo "$data" > /tmp/material_data.json
|
||||
assessment=$(bash orchestrate.sh tech-lead "Evaluate these materials for M2/M3 mirrors against our thermal requirements" --context /tmp/material_data.json --wait)
|
||||
|
||||
# Step 3: Audit
|
||||
echo "$assessment" > /tmp/assessment.json
|
||||
audit=$(bash orchestrate.sh auditor "Review this technical assessment for completeness" --context /tmp/assessment.json --wait)
|
||||
|
||||
# Step 4: Manager synthesizes and delivers
|
||||
# (Manager has all three results in-session, reasons about them, posts to Discord)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Layer 2: Smart Routing
|
||||
|
||||
**What it does:** Manager knows each agent's capabilities, strengths, and model. Routes tasks intelligently without hardcoded logic.
|
||||
|
||||
### 2.1 — Agent Capability Registry
|
||||
|
||||
**File:** `/home/papa/atomizer/workspaces/shared/AGENTS_REGISTRY.json`
|
||||
|
||||
```json
|
||||
{
|
||||
"agents": {
|
||||
"tech-lead": {
|
||||
"port": 18804,
|
||||
"model": "anthropic/claude-opus-4-6",
|
||||
"capabilities": [
|
||||
"fea-review",
|
||||
"design-decisions",
|
||||
"technical-analysis",
|
||||
"material-selection",
|
||||
"requirements-validation",
|
||||
"trade-studies"
|
||||
],
|
||||
"strengths": "Deep reasoning, technical judgment, complex analysis",
|
||||
"limitations": "Slow (Opus), expensive tokens — use for high-value decisions",
|
||||
"inputFormat": "Technical problem with context and constraints",
|
||||
"outputFormat": "Structured analysis with recommendations and rationale",
|
||||
"channels": ["#hq", "#technical"]
|
||||
},
|
||||
"webster": {
|
||||
"port": 18828,
|
||||
"model": "google/gemini-2.5-pro",
|
||||
"capabilities": [
|
||||
"web-research",
|
||||
"literature-review",
|
||||
"data-lookup",
|
||||
"supplier-search",
|
||||
"standards-lookup",
|
||||
"competitive-analysis"
|
||||
],
|
||||
"strengths": "Fast research, broad knowledge, cheap tokens, web access",
|
||||
"limitations": "No deep technical judgment — finds data, doesn't evaluate it",
|
||||
"inputFormat": "Natural language query with specifics",
|
||||
"outputFormat": "Structured findings with sources and confidence",
|
||||
"channels": ["#hq", "#research"]
|
||||
},
|
||||
"optimizer": {
|
||||
"port": 18816,
|
||||
"model": "anthropic/claude-sonnet-4-20250514",
|
||||
"capabilities": [
|
||||
"optimization-setup",
|
||||
"parameter-studies",
|
||||
"objective-definition",
|
||||
"constraint-formulation",
|
||||
"result-interpretation",
|
||||
"sensitivity-analysis"
|
||||
],
|
||||
"strengths": "Optimization methodology, mathematical formulation, DOE",
|
||||
"limitations": "Needs clear problem definition — not for open-ended exploration",
|
||||
"inputFormat": "Optimization problem with objectives, variables, constraints",
|
||||
"outputFormat": "Study configuration, parameter definitions, result analysis",
|
||||
"channels": ["#hq", "#optimization"]
|
||||
},
|
||||
"study-builder": {
|
||||
"port": 18820,
|
||||
"model": "anthropic/claude-sonnet-4-20250514",
|
||||
"capabilities": [
|
||||
"study-configuration",
|
||||
"doe-setup",
|
||||
"batch-generation",
|
||||
"parameter-sweeps",
|
||||
"study-templates"
|
||||
],
|
||||
"strengths": "Translating optimization plans into executable study configs",
|
||||
"limitations": "Needs optimizer's plan as input — doesn't design studies independently",
|
||||
"inputFormat": "Study plan from optimizer with parameter ranges",
|
||||
"outputFormat": "Ready-to-run study configuration files",
|
||||
"channels": ["#hq", "#optimization"]
|
||||
},
|
||||
"nx-expert": {
|
||||
"port": 18824,
|
||||
"model": "anthropic/claude-sonnet-4-20250514",
|
||||
"capabilities": [
|
||||
"nx-operations",
|
||||
"mesh-generation",
|
||||
"boundary-conditions",
|
||||
"nastran-setup",
|
||||
"cad-manipulation",
|
||||
"post-processing"
|
||||
],
|
||||
"strengths": "NX/Simcenter expertise, FEA model setup, hands-on CAD/FEM work",
|
||||
"limitations": "Needs clear instructions — not for high-level design decisions",
|
||||
"inputFormat": "Specific NX task with model reference and parameters",
|
||||
"outputFormat": "Completed operation with verification screenshots/data",
|
||||
"channels": ["#hq", "#nx-work"]
|
||||
},
|
||||
"auditor": {
|
||||
"port": 18812,
|
||||
"model": "anthropic/claude-opus-4-6",
|
||||
"capabilities": [
|
||||
"quality-review",
|
||||
"compliance-check",
|
||||
"methodology-audit",
|
||||
"assumption-validation",
|
||||
"report-review",
|
||||
"standards-compliance"
|
||||
],
|
||||
"strengths": "Critical eye, finds gaps and errors, ensures rigor",
|
||||
"limitations": "Reviews work, doesn't create it — needs output from other agents",
|
||||
"inputFormat": "Work product to review with applicable standards/requirements",
|
||||
"outputFormat": "Structured review: findings, severity, recommendations",
|
||||
"channels": ["#hq", "#quality"]
|
||||
},
|
||||
"secretary": {
|
||||
"port": 18808,
|
||||
"model": "google/gemini-2.5-flash",
|
||||
"capabilities": [
|
||||
"meeting-notes",
|
||||
"status-reports",
|
||||
"documentation",
|
||||
"scheduling",
|
||||
"action-tracking",
|
||||
"communication-drafting"
|
||||
],
|
||||
"strengths": "Fast, cheap, good at summarization and admin tasks",
|
||||
"limitations": "Not for technical work — administrative and organizational only",
|
||||
"inputFormat": "Admin task or raw content to organize",
|
||||
"outputFormat": "Clean documentation, summaries, action lists",
|
||||
"channels": ["#hq", "#admin"]
|
||||
},
|
||||
"manager": {
|
||||
"port": 18800,
|
||||
"model": "anthropic/claude-opus-4-6",
|
||||
"capabilities": [
|
||||
"orchestration",
|
||||
"project-planning",
|
||||
"task-decomposition",
|
||||
"priority-management",
|
||||
"stakeholder-communication",
|
||||
"workflow-execution"
|
||||
],
|
||||
"strengths": "Strategic thinking, orchestration, synthesis across agents",
|
||||
"limitations": "Should not do technical work — delegates everything",
|
||||
"inputFormat": "High-level directives from Antoine (CEO)",
|
||||
"outputFormat": "Plans, status updates, synthesized deliverables",
|
||||
"channels": ["#hq"]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 2.2 — Manager Routing Logic
|
||||
|
||||
Added to Manager's SOUL.md as a skill directive:
|
||||
|
||||
```markdown
|
||||
## Smart Routing
|
||||
Before delegating, consult `/home/papa/atomizer/workspaces/shared/AGENTS_REGISTRY.json`.
|
||||
- Match task requirements to agent capabilities
|
||||
- Consider model strengths (Opus for reasoning, Gemini for speed, Sonnet for balanced)
|
||||
- For multi-step tasks, plan the full pipeline before starting
|
||||
- Prefer parallel execution when steps are independent
|
||||
- Always specify what you need back (don't let agents guess)
|
||||
```
|
||||
|
||||
### 2.3 — Discord Channel Context Integration
|
||||
|
||||
**How channels feed context into orchestration:**
|
||||
|
||||
Each Discord channel accumulates project-specific conversation history. The orchestration layer can pull this as context:
|
||||
|
||||
```bash
|
||||
# In orchestrate.sh, --channel-context fetches recent messages
|
||||
bash orchestrate.sh tech-lead "Review thermal margins for M2" \
|
||||
--channel-context "#gigabit-m1" --messages 50 \
|
||||
--wait
|
||||
```
|
||||
|
||||
**Implementation:** Use Discord bot API (each instance has a bot token) to fetch channel message history. Format as context block prepended to the task.
|
||||
|
||||
**Channel strategy for Atomizer HQ Discord:**
|
||||
|
||||
| Channel | Purpose | Context Value |
|
||||
|---------|---------|---------------|
|
||||
| `#hq` | Cross-team coordination, announcements | Project-wide decisions |
|
||||
| `#technical` | FEA discussions, design decisions | Technical context for analysis tasks |
|
||||
| `#optimization` | Study configs, results, methodology | Optimization history and patterns |
|
||||
| `#research` | Webster's findings, literature | Reference data for technical work |
|
||||
| `#quality` | Audit findings, compliance notes | Review standards and past issues |
|
||||
| `#nx-work` | CAD/FEM operations, model updates | Model state and recent changes |
|
||||
| `#admin` | Meeting notes, schedules, action items | Project timeline and commitments |
|
||||
| `#handoffs` | Automated orchestration results (bot-only) | Pipeline audit trail |
|
||||
|
||||
**Key insight:** Channels become **persistent, queryable context stores**. Instead of passing massive context blocks between agents, you say "read #technical for the last 20 messages" and the agent absorbs project state naturally.
|
||||
|
||||
**Channel Context Sanitization (security):**
|
||||
Discord history is untrusted input. Before injecting into an agent's context:
|
||||
- Cap at configurable token window (default: last 30 messages, max ~4K tokens)
|
||||
- Strip any system-prompt-like instructions from message content
|
||||
- Tag entire block as `[CHANNEL CONTEXT — untrusted, for reference only]`
|
||||
- Never let channel content override task instructions
|
||||
|
||||
This prevents prompt injection via crafted Discord messages in channel history.
|
||||
|
||||
---
|
||||
|
||||
## Layer 3: Workflow Engine
|
||||
|
||||
**What it does:** Defines reusable multi-step pipelines as YAML. Manager reads and executes them. No coding needed to create new workflows.
|
||||
|
||||
### 3.1 — Workflow Definition Format
|
||||
|
||||
**Location:** `/home/papa/atomizer/workspaces/shared/workflows/`
|
||||
|
||||
```yaml
|
||||
# /home/papa/atomizer/workspaces/shared/workflows/material-trade-study.yaml
|
||||
name: Material Trade Study
|
||||
description: Research, evaluate, and audit material options for optical components
|
||||
trigger: manual # or: keyword, schedule
|
||||
|
||||
inputs:
|
||||
materials:
|
||||
type: list
|
||||
description: "Materials to compare"
|
||||
requirements:
|
||||
type: text
|
||||
description: "Performance requirements and constraints"
|
||||
project_channel:
|
||||
type: channel
|
||||
description: "Project channel for context"
|
||||
|
||||
steps:
|
||||
- id: research
|
||||
agent: webster
|
||||
task: |
|
||||
Research the following materials: {materials}
|
||||
For each material, find: CTE (with temperature range), density, Young's modulus,
|
||||
cost per kg, lead time, availability, and any known issues for optical applications.
|
||||
Provide sources for all data.
|
||||
channel_context: "{project_channel}"
|
||||
channel_messages: 30
|
||||
timeout: 180
|
||||
retries: 2
|
||||
output: material_data
|
||||
validation:
|
||||
agent: auditor
|
||||
criteria: "Are all requested material properties present with credible sources? Flag any missing data."
|
||||
on_fail: retry
|
||||
|
||||
- id: evaluate
|
||||
agent: tech-lead
|
||||
task: |
|
||||
Evaluate these materials against our requirements:
|
||||
|
||||
REQUIREMENTS:
|
||||
{requirements}
|
||||
|
||||
MATERIAL DATA:
|
||||
{material_data}
|
||||
|
||||
Provide a recommendation with full rationale. Include a comparison matrix.
|
||||
depends_on: [research]
|
||||
timeout: 300
|
||||
retries: 1
|
||||
output: technical_assessment
|
||||
|
||||
- id: audit
|
||||
agent: auditor
|
||||
task: |
|
||||
Review this material trade study for completeness, methodological rigor,
|
||||
and potential gaps:
|
||||
|
||||
{technical_assessment}
|
||||
|
||||
Check: Are all requirements addressed? Are sources credible?
|
||||
Are there materials that should have been considered but weren't?
|
||||
depends_on: [evaluate]
|
||||
timeout: 180
|
||||
output: audit_result
|
||||
|
||||
- id: synthesize
|
||||
agent: manager
|
||||
action: synthesize # Manager processes internally, doesn't delegate
|
||||
inputs: [material_data, technical_assessment, audit_result]
|
||||
deliver:
|
||||
channel: "{project_channel}"
|
||||
format: summary # Manager writes a clean summary post
|
||||
|
||||
notifications:
|
||||
on_complete: "#hq"
|
||||
on_failure: "#hq"
|
||||
```
|
||||
|
||||
### 3.2 — More Workflow Templates
|
||||
|
||||
**Design Review:**
|
||||
```yaml
|
||||
name: Design Review
|
||||
steps:
|
||||
- id: prepare
|
||||
agent: secretary
|
||||
task: "Compile design package: gather latest CAD screenshots, analysis results, and requirements from {project_channel}"
|
||||
|
||||
- id: technical_review
|
||||
agent: tech-lead
|
||||
task: "Review design against requirements: {prepare}"
|
||||
depends_on: [prepare]
|
||||
|
||||
- id: optimization_review
|
||||
agent: optimizer
|
||||
task: "Assess optimization potential: {prepare}"
|
||||
depends_on: [prepare]
|
||||
|
||||
# technical_review and optimization_review run in PARALLEL (no dependency between them)
|
||||
|
||||
- id: audit
|
||||
agent: auditor
|
||||
task: "Final review: {technical_review} + {optimization_review}"
|
||||
depends_on: [technical_review, optimization_review]
|
||||
|
||||
- id: deliver
|
||||
agent: secretary
|
||||
task: "Format design review report from: {audit}"
|
||||
depends_on: [audit]
|
||||
deliver:
|
||||
channel: "{project_channel}"
|
||||
```
|
||||
|
||||
**Quick Research:**
|
||||
```yaml
|
||||
name: Quick Research
|
||||
steps:
|
||||
- id: research
|
||||
agent: webster
|
||||
task: "{query}"
|
||||
timeout: 120
|
||||
output: findings
|
||||
|
||||
- id: validate
|
||||
agent: tech-lead
|
||||
task: "Verify these findings are accurate and relevant: {findings}"
|
||||
depends_on: [research]
|
||||
deliver:
|
||||
channel: "{request_channel}"
|
||||
```
|
||||
|
||||
### 3.3 — Workflow Executor
|
||||
|
||||
**File:** `/home/papa/atomizer/workspaces/shared/skills/orchestrate/workflow.sh`
|
||||
|
||||
The manager's orchestration skill reads YAML workflows and executes them:
|
||||
|
||||
```bash
|
||||
# Run a workflow
|
||||
bash workflow.sh material-trade-study \
|
||||
--input materials="Zerodur Class 0, Clearceram-Z HS, ULE" \
|
||||
--input requirements="CTE < 0.01 ppm/K at 22°C, aperture 250mm" \
|
||||
--input project_channel="#gigabit-m1"
|
||||
```
|
||||
|
||||
**Executor logic:**
|
||||
1. Parse YAML workflow definition
|
||||
2. Resolve dependencies → build execution graph
|
||||
3. Execute steps in order (parallel when no dependencies)
|
||||
4. For each step: call `orchestrate.sh` with task + resolved inputs
|
||||
5. Store results in `/home/papa/atomizer/handoffs/workflows/{workflow-run-id}/`
|
||||
6. On completion: deliver final output to specified channel
|
||||
7. On failure: notify `#hq` with error details and partial results
|
||||
|
||||
---
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
### Phase 1: Orchestration Core + Validation + Error Handling (Day 1 — Feb 15) ✅ COMPLETE
|
||||
**Actual effort: ~6 hours**
|
||||
|
||||
- [x] **1.1** Created `/home/papa/atomizer/workspaces/shared/skills/orchestrate/` directory
|
||||
- [x] **1.2** Built `orchestrate.py` (Python, not bash) — synchronous delegation with inotify-based waiting
|
||||
- Send via `/hooks/agent` (existing)
|
||||
- inotify watches handoff directory for result file
|
||||
- Timeout handling (configurable per call, `--timeout`)
|
||||
- Retry logic (`--retries N`, max 3, with error context)
|
||||
- Returns structured JSON result to caller
|
||||
- Thin bash wrapper: `orchestrate.sh`
|
||||
- [x] **1.3** Created `/home/papa/atomizer/handoffs/` directory for result passing
|
||||
- [x] **1.4** Updated all 8 agent SOUL.md files with:
|
||||
- Structured response format for delegated tasks (JSON handoff protocol)
|
||||
- Self-check protocol (verify completeness before submitting)
|
||||
- Write result to `/home/papa/atomizer/handoffs/{runId}.json` on completion
|
||||
- [x] **1.5** Implemented error handling in `orchestrate.py`
|
||||
- Health check before sending (agent health endpoint)
|
||||
- Timeout with partial result recovery
|
||||
- Malformed response detection and retry
|
||||
- Idempotency check before retry (check if handoff file landed late)
|
||||
- All errors logged to `/home/papa/atomizer/logs/orchestration/`
|
||||
- [x] **1.6** Implemented trace logging in handoff files
|
||||
- Required fields validated: `schemaVersion`, `runId`, `agent`, `status`, `result`, `confidence`, `timestamp`
|
||||
- Unified JSONL logging with trace fields
|
||||
- [x] **1.7** Implemented `--validate` flag for strict orchestrator-side output validation
|
||||
- [x] **1.8** Deployed `orchestrate` skill to Manager (SOUL.md + TOOLS.md updated)
|
||||
- [x] **1.9** Test: Manager → Webster smoke tests passed (18-49s response times, 12 successful handoffs)
|
||||
- Chain test (Webster → Tech-Lead): Webster completed, Tech-Lead returned `partial` due to missing context passthrough — engine bug, not protocol bug
|
||||
- [x] **1.10** Test: ACL enforcement works (deny/allow), strict validation works
|
||||
- [x] **1.11** `delegate.sh` kept as fallback for fire-and-forget use cases
|
||||
|
||||
**Key implementation decisions:**
|
||||
- Python (`orchestrate.py`) over bash for all logic — better JSON handling, inotify support, error handling
|
||||
- `inotify_simple` for instant file detection (no polling)
|
||||
- Session key format: `hook:orchestrate:{run_id}:{attempt}`
|
||||
- ACL matrix hardcoded: Manager → all; Tech-Lead → webster/nx-expert/study-builder/secretary; Optimizer → webster/study-builder/secretary
|
||||
|
||||
**Known issues to fix in Phase 2:**
|
||||
- Chain context passthrough: when chaining A→B→C, B's result must be explicitly injected into C's task
|
||||
- Webster's Brave API key intermittently fails (recovered on retry)
|
||||
- Manager Discord WebSocket reconnect loop (code 1005) — doesn't affect orchestration but blocks channel posting
|
||||
|
||||
### Phase 2: Smart Routing + Channel Context + Hierarchical Delegation (Day 1-2 — Feb 15-16)
|
||||
**Estimated effort: 4-5 hours**
|
||||
|
||||
- [x] **2.1** Create `AGENTS_REGISTRY.json` in shared workspace *(completed 2026-02-15 — channel context fetcher built, hierarchical delegation deployed to Tech-Lead + Optimizer, ACL tested, all tests pass)*
|
||||
- [x] **2.2** Update Manager's SOUL.md with routing instructions *(completed 2026-02-15 — channel context fetcher built, hierarchical delegation deployed to Tech-Lead + Optimizer, ACL tested, all tests pass)*
|
||||
- [x] **2.3** Build channel context fetcher (`fetch-channel-context.sh`) *(completed 2026-02-15 — channel context fetcher built, hierarchical delegation deployed to Tech-Lead + Optimizer, ACL tested, all tests pass)*
|
||||
- Uses Discord bot token to pull recent messages
|
||||
- Formats as markdown context block
|
||||
- Integrates with `orchestrate.sh` via `--channel-context` flag
|
||||
- [x] **2.4** Set up Discord channels per the channel strategy table *(completed 2026-02-15 — channel context fetcher built, hierarchical delegation deployed to Tech-Lead + Optimizer, ACL tested, all tests pass)*
|
||||
- [x] **2.5** Implement hierarchical delegation *(completed 2026-02-15 — channel context fetcher built, hierarchical delegation deployed to Tech-Lead + Optimizer, ACL tested, all tests pass)*
|
||||
- Deploy `orchestrate` skill to Tech-Lead and Optimizer
|
||||
- Add sub-orchestration rules to their SOUL.md (can delegate to: Webster, Study-Builder, NX-Expert, Secretary)
|
||||
- Cannot delegate to: Manager, Auditor, each other (prevents loops)
|
||||
- All sub-delegations logged to `/home/papa/atomizer/handoffs/sub/` for Manager visibility
|
||||
- [x] **2.6** Enforce delegation ACL matrix in `orchestrate.sh` runtime *(completed 2026-02-15 — channel context fetcher built, hierarchical delegation deployed to Tech-Lead + Optimizer, ACL tested, all tests pass)*
|
||||
- Hardcoded check: caller + target validated against allowed pairs
|
||||
- Manager → can delegate to all agents
|
||||
- Tech-Lead → can delegate to: Webster, NX-Expert, Study-Builder, Secretary
|
||||
- Optimizer → can delegate to: Webster, Study-Builder, Secretary
|
||||
- All others → cannot sub-delegate (must go through Manager)
|
||||
- Block self-delegation and circular paths at runtime (not just SOUL.md policy)
|
||||
- [x] **2.7** Implement channel context sanitization *(completed 2026-02-15 — channel context fetcher built, hierarchical delegation deployed to Tech-Lead + Optimizer, ACL tested, all tests pass)*
|
||||
- Cap token window, strip system-like instructions, tag as untrusted
|
||||
- [x] **2.8** Test: Manager auto-routes a task based on registry + includes channel context *(completed 2026-02-15 — channel context fetcher built, hierarchical delegation deployed to Tech-Lead + Optimizer, ACL tested, all tests pass)*
|
||||
- [x] **2.9** Test: Tech-Lead delegates a data lookup to Webster mid-analysis *(completed 2026-02-15 — channel context fetcher built, hierarchical delegation deployed to Tech-Lead + Optimizer, ACL tested, all tests pass)*
|
||||
- [x] **2.10** Test: Auditor tries to sub-delegate → blocked by ACL *(completed 2026-02-15 — channel context fetcher built, hierarchical delegation deployed to Tech-Lead + Optimizer, ACL tested, all tests pass)*
|
||||
|
||||
### Phase 3: Workflow Engine (Day 2-3 — Feb 16-17)
|
||||
**Estimated effort: 6-8 hours**
|
||||
|
||||
- [x] **3.1** Build YAML workflow parser (Python script)
|
||||
- Implemented in `workflow.py` with name/path resolution from `/home/papa/atomizer/workspaces/shared/workflows/`, schema checks, step-ID validation, dependency validation, and cycle detection.
|
||||
- [x] **3.2** Build workflow executor (`workflow.sh`)
|
||||
- Dependency resolution
|
||||
- Parallel step execution
|
||||
- Variable substitution
|
||||
- Error handling and partial results
|
||||
- Implemented executor in `workflow.py` with `ThreadPoolExecutor`, dependency-aware scheduling, step-level `on_fail` handling (`skip`/`abort`), overall timeout enforcement, approval gates, and JSON summary output.
|
||||
- Added thin wrapper `workflow.sh`.
|
||||
- [x] **3.3** Create initial workflow templates:
|
||||
- `material-trade-study.yaml`
|
||||
- `design-review.yaml`
|
||||
- `quick-research.yaml`
|
||||
- [x] **3.4** Deploy workflow skill to Manager
|
||||
- Updated Manager `SOUL.md` with a dedicated "Running Workflows" section and command example.
|
||||
- Updated Manager `TOOLS.md` with `workflow.py`/`workflow.sh` references and usage.
|
||||
- [x] **3.5** Implement approval gates in workflow YAML
|
||||
- `workflow.py` now supports `approval_gate` prompts (`yes`/`no`) before step execution.
|
||||
- In `--non-interactive` mode, approval gates are skipped with warnings.
|
||||
- [x] **3.6** Add workflow dry-run mode (`--dry-run`)
|
||||
- Validates dependency graph and variable substitutions without executing
|
||||
- Reports: step metadata, dependency-based execution layers, and run output directory
|
||||
- Implemented dry-run planning output including step metadata, dependency layers, and run result directory.
|
||||
- [x] **3.7** Test: Run full material trade study workflow end-to-end
|
||||
- quick-research workflow tested E2E twice — Webster→Tech-Lead chain, 50s and 149s runs, Manager posted results to Discord
|
||||
- [x] **3.8** Create `#handoffs` channel for orchestration audit trail
|
||||
- Skipped — using workflow result directories instead of dedicated #handoffs channel
|
||||
|
||||
|
||||
**Phase 3 completion notes:**
|
||||
- `workflow.py`: 15KB Python, supports YAML parsing, dependency graphs, parallel execution (`ThreadPoolExecutor`), variable substitution, approval gates, dry-run, per-step result persistence
|
||||
- 3 workflow templates: `material-trade-study`, `quick-research`, `design-review`
|
||||
- `design-review` dry-run confirmed parallel execution detection (tech-lead + optimizer simultaneous)
|
||||
- Manager successfully ran workflow from Discord prompt, parsed JSON output, and posted synthesized results
|
||||
- Known issue fixed: Manager initially did not post results back — added explicit "Always Post Results Back" instructions to SOUL.md
|
||||
|
||||
### Phase 4: Metrics + Documentation (Day 3 — Feb 17)
|
||||
**Estimated effort: 2-3 hours**
|
||||
|
||||
- [x] **4.1** Metrics: track delegation count, success rate, avg response time per agent
|
||||
- Implemented `metrics.py` to analyze handoff JSON and workflow summaries; supports JSON/text output with per-agent latency and success stats
|
||||
- [x] **4.2** Per-workflow token usage tracking across all agents
|
||||
- Added `metrics.sh` wrapper for easy execution from orchestrate skill directory
|
||||
- [x] **4.3** Document everything in this PKM project folder
|
||||
- Added Manager `TOOLS.md` reference for metrics usage under Agent Communication
|
||||
- [x] **4.4** Create orchestration documentation README
|
||||
- Created `/home/papa/atomizer/workspaces/shared/skills/orchestrate/README.md` with architecture, usage, ACL, workflows, and storage docs
|
||||
|
||||
---
|
||||
|
||||
## Context Flow Diagram
|
||||
|
||||
```
|
||||
Antoine (CEO)
|
||||
│
|
||||
▼
|
||||
┌─────────────┐
|
||||
│ MANAGER │ ◄── Reads AGENTS_REGISTRY.json
|
||||
│ (Opus 4.6) │ ◄── Reads workflow YAML
|
||||
└──────┬──────┘ ◄── Validates results
|
||||
│
|
||||
┌─────────────┼─────────────┐
|
||||
▼ ▼ ▼
|
||||
┌────────────┐ ┌──────────┐ ┌──────────┐
|
||||
│ TECH-LEAD │ │ AUDITOR │ │OPTIMIZER │
|
||||
│ (Opus) │ │ (Opus) │ │ (Sonnet) │
|
||||
│ [can sub- │ └──────────┘ │ [can sub-│
|
||||
│ delegate] │ │ delegate]│
|
||||
└─────┬──────┘ └─────┬─────┘
|
||||
│ sub-orchestration │
|
||||
┌────┴─────┐ ┌──────┴──────┐
|
||||
▼ ▼ ▼ ▼
|
||||
┌────────┐┌────────┐ ┌───────────┐┌──────────┐
|
||||
│WEBSTER ││NX-EXPERT│ │STUDY-BLDR ││SECRETARY │
|
||||
│(Gemini)││(Sonnet) │ │ (Sonnet) ││ (Flash) │
|
||||
└───┬────┘└───┬─────┘ └─────┬─────┘└────┬─────┘
|
||||
│ │ │ │
|
||||
▼ ▼ ▼ ▼
|
||||
┌──────────────────────────────────────────────┐
|
||||
│ HANDOFF DIRECTORY │
|
||||
│ /home/papa/atomizer/handoffs/ │
|
||||
│ {runId}.json — structured results │
|
||||
│ /sub/ — sub-delegation logs (visibility) │
|
||||
└──────────────────────────────────────────────┘
|
||||
│ │ │ │
|
||||
└────┬────┘──────┬───────┘────┬───────┘
|
||||
▼ ▼ ▼
|
||||
┌────────────┐ ┌──────────┐ ┌─────────────────┐
|
||||
│ DISCORD │ │VALIDATION│ │ SHARED FILES │
|
||||
│ CHANNELS │ │ LOOPS │ │ (Atomizer repo │
|
||||
│ (context) │ │(self-chk │ │ PKM, configs) │
|
||||
└────────────┘ │+ auditor)│ └─────────────────┘
|
||||
└──────────┘
|
||||
|
||||
CONTEXT SOURCES (per delegation):
|
||||
1. Task context → Orchestrator passes explicitly
|
||||
2. Channel context → Fetched from Discord history
|
||||
3. Handoff context → Results from prior pipeline steps
|
||||
4. Knowledge context → Shared filesystem (always available)
|
||||
|
||||
VALIDATION FLOW:
|
||||
Agent output → Self-check → Orchestrator validation → [Auditor review if critical] → Accept/Retry
|
||||
|
||||
HIERARCHY:
|
||||
Manager → delegates to all agents
|
||||
Tech-Lead, Optimizer → sub-delegate to Webster, NX-Expert, Study-Builder, Secretary
|
||||
All sub-delegations logged for Manager visibility
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Comparison: Before vs After
|
||||
|
||||
| Aspect | Before (delegate.sh) | After (Orchestration Engine) |
|
||||
|--------|----------------------|------------------------------|
|
||||
| Delegation | Fire-and-forget | Synchronous with result return |
|
||||
| Result flow | None — check Discord manually | Structured JSON via handoff files |
|
||||
| Chaining | Impossible | Native — output feeds next step |
|
||||
| Parallel work | Manual — delegate multiple, hope | Workflow engine handles automatically |
|
||||
| Context passing | None | Task + channel + handoff + filesystem |
|
||||
| Routing | Hardcoded agent names | Capability-based via registry |
|
||||
| Reusability | One-off bash calls | YAML workflow templates |
|
||||
| Audit trail | Discord messages only | Handoff logs + orchestration logs |
|
||||
| Validation | None | Self-check + auditor loops on critical steps |
|
||||
| Error handling | None | Timeout, retry, partial results (Phase 1) |
|
||||
| Hierarchy | Flat (manager only) | Hierarchical (Tech-Lead/Optimizer can sub-delegate) |
|
||||
| Adding agents | Edit bash script | Add entry to registry JSON |
|
||||
|
||||
---
|
||||
|
||||
## Future Extensions (Post-MVP)
|
||||
|
||||
- **Conditional branching:** If auditor flags issues → route back to tech-lead for revision
|
||||
- **Human-in-the-loop gates:** Workflow pauses for Antoine's approval at critical steps
|
||||
- **Learning loops:** Store workflow results → agents learn from past runs
|
||||
- **Cost tracking:** Per-workflow token usage across all agents
|
||||
- **Web UI dashboard:** Visualize active workflows, agent status, handoff queue
|
||||
- **Inter-company workflows:** External client triggers → full analysis pipeline → deliverable
|
||||
|
||||
---
|
||||
|
||||
## Key Design Decisions
|
||||
|
||||
1. **File-based handoffs over HTTP callbacks** — Simpler, debuggable, works with shared filesystem we already have. HTTP callbacks are Phase 2 optimization if needed.
|
||||
|
||||
2. **Manager as primary orchestrator, with hierarchical delegation (Phase 2)** — Manager runs workflows and chains tasks. In Phase 2, senior agents (Tech-Lead, Optimizer) gain sub-orchestration rights to delegate directly to supporting agents (e.g., Tech-Lead → Webster for a data lookup mid-analysis) without routing through Manager. All sub-delegations are logged to the handoff directory so Manager retains visibility. No circular delegation — hierarchy is strict.
|
||||
|
||||
3. **YAML workflows over hardcoded scripts** — Workflows are data, not code. Antoine can define new ones. Manager can read and execute them. Future: manager could even *generate* workflows from natural language directives.
|
||||
|
||||
4. **Channel context is opt-in per step** — Not every step needs channel history. Explicit `channel_context` parameter keeps token usage efficient.
|
||||
|
||||
5. **Preserve fire-and-forget option** — `delegate.sh` stays for simple one-off tasks where you don't need the result back. `orchestrate.sh` is for pipeline work.
|
||||
|
||||
---
|
||||
|
||||
---
|
||||
|
||||
## Review Amendments (2026-02-15)
|
||||
|
||||
**Source:** Webster's review (`reviews/REVIEW-Orchestration-Engine-Webster.md`)
|
||||
|
||||
| Webster's Recommendation | Decision | Where |
|
||||
|---|---|---|
|
||||
| Hierarchical delegation | ✅ Adopted — Phase 2 | Tech-Lead + Optimizer get sub-orchestration rights |
|
||||
| Validation/critic loops | ✅ Adopted — Phase 1 | Self-check in agents + `--validate` flag + auditor validation blocks in YAML |
|
||||
| Error handling in Phase 1 | ✅ Adopted — Phase 1 | Timeouts, retries, health checks, malformed response handling |
|
||||
| Shared blackboard state | ⏳ Deferred | Not needed until workflows exceed 5+ steps. File-based handoffs sufficient for now |
|
||||
| Role-based dynamic routing | ⏳ Deferred | Only one agent per role currently. Revisit when we scale to redundant agents |
|
||||
| AutoGen group chat pattern | 📝 Noted | Interesting for brainstorming workflows. Not MVP priority |
|
||||
| LangGraph state graphs | 📝 Noted | YAML with `on_fail: goto` covers our needs without importing a paradigm |
|
||||
|
||||
**Source:** Auditor's review (`reviews/REVIEW-Orchestration-Engine-Auditor-V2.md`)
|
||||
|
||||
| Auditor's Recommendation | Decision | Where |
|
||||
|---|---|---|
|
||||
| Idempotency keys | ✅ Adopted — Phase 1 | `idempotencyKey` in handoff schema + existence check before retry |
|
||||
| Handoff schema versioning | ✅ Adopted — Phase 1 | `schemaVersion: "1.0"` + required fields validation in `orchestrate.sh` |
|
||||
| Approval gates | ✅ Adopted — Phase 3 | `approval_gate: ceo` in workflow YAML, posts to `#hq` and waits |
|
||||
| Per-run state blackboard | ⏳ Deferred | Same as Webster's — file handoffs sufficient for 3-5 step workflows |
|
||||
| Trace logging / observability | ✅ Adopted — Phase 1 | `workflowRunId`, `stepId`, `attempt`, `latencyMs` in every handoff |
|
||||
| Channel context sanitization | ✅ Adopted — Phase 2 | Token cap, instruction stripping, untrusted tagging |
|
||||
| ACL enforcement (runtime) | ✅ Adopted — Phase 2 | Hardcoded delegation matrix in `orchestrate.sh`, not just SOUL.md policy |
|
||||
| Quality score (0-1) | ⏳ Deferred | Nice-to-have for dashboards, not MVP |
|
||||
| Artifact checksums | ⏳ Deferred | Reproducibility concern — revisit for client deliverables |
|
||||
| Workflow dry-run mode | ✅ Adopted — Phase 3 | Validate dependency graph + substitutions without execution |
|
||||
|
||||
---
|
||||
|
||||
> **Next step:** Implementation begins 2026-02-15. Start with Phase 1 (orchestrate.sh + handoff directory + agent SOUL.md updates). Test with a simple Webster → Tech-Lead chain before building the full workflow engine.
|
||||
Reference in New Issue
Block a user