- Project plan, agent roster, architecture, roadmap - Decision log, full system plan, Discord setup/migration guides - System implementation status (as-built) - Cluster pivot history - Orchestration engine plan (Phases 1-4) - Webster and Auditor reviews
40 KiB
10 — Orchestration Engine: Multi-Instance Intelligence
Status: Phases 1-3 Complete — Phase 4 (Metrics + Docs) in progress Author: Mario Lavoie (with Antoine) Date: 2026-02-15 Revised: 2026-02-15 — Incorporated Webster's review (validation loops, error handling, hierarchical delegation)
Problem Statement
The Atomizer HQ cluster runs 8 independent OpenClaw instances (one per agent). This gives us true parallelism, specialized contexts, and independent Discord identities — but we lost the orchestration primitives that make a single OpenClaw instance powerful:
sessions_spawn— synchronous delegation with result returnsessions_history— cross-session context readingsessions_send— bidirectional inter-session messaging
The current delegate.sh is fire-and-forget. Manager throws a task over the wall and hopes. No result flows back. No chaining. No intelligent multi-step workflows.
Goal: Rebuild OpenClaw's orchestration power at the inter-instance level, enhanced with Discord channel context and a capability registry.
Architecture Overview
Three layers, each building on the last:
┌─────────────────────────────────────────────────────┐
│ LAYER 3: WORKFLOWS │
│ YAML-defined multi-step pipelines │
│ (sequential, parallel, conditional branching) │
├─────────────────────────────────────────────────────┤
│ LAYER 2: SMART ROUTING │
│ Capability registry + channel context │
│ (manager knows who can do what + project state) │
├─────────────────────────────────────────────────────┤
│ LAYER 1: ORCHESTRATION CORE │
│ Synchronous delegation + result return protocol │
│ (replaces fire-and-forget delegate.sh) │
├─────────────────────────────────────────────────────┤
│ EXISTING INFRASTRUCTURE │
│ 8 OpenClaw instances, hooks API, shared filesystem│
└─────────────────────────────────────────────────────┘
Layer 1: Orchestration Core
What it does: Replaces delegate.sh with synchronous delegation. Manager sends a task, waits for the result, gets structured output back. Can then chain to the next agent.
1.1 — The Orchestrate Script
File: /home/papa/atomizer/workspaces/shared/skills/orchestrate/orchestrate.sh
Behavior:
- Send task to target agent via
/hooks/agent(existing mechanism) - Poll the agent's session for completion via
/hooks/status/{runId}or/sessionsAPI - Capture the agent's response (structured output)
- Return it to the calling agent's session
# Usage
result=$(bash orchestrate.sh <agent> "<task>" [options])
# Example: synchronous delegation
result=$(bash orchestrate.sh webster "Find CTE of Zerodur Class 0 at 20-40°C" --wait --timeout 120)
echo "$result" # Structured findings returned to manager's session
Options:
--wait— Block until agent completes (default for orchestrate)--timeout <seconds>— Max wait time (default: 300)--retries <N>— Retry on failure (default: 1, max: 3)--format json|text— Expected response format--context <file>— Attach context file to the task--channel-context <channel-id> [--messages N]— Include recent channel history as context--validate— Run lightweight self-check on agent output before returning--no-deliver— Don't post to Discord (manager will synthesize and post)
1.2 — Report-Back Protocol
Each agent gets instructions in their SOUL.md to format delegation responses:
## When responding to a delegated task:
Structure your response as:
**TASK:** [restate what was asked]
**STATUS:** complete | partial | blocked | failed
**RESULT:** [your findings/output]
**ARTIFACTS:** [any files created, with paths]
**CONFIDENCE:** high | medium | low
**NOTES:** [caveats, assumptions, open questions]
This gives manager structured data to reason about, not just a wall of text.
1.3 — Validation & Self-Check Protocol
Every delegated response goes through a lightweight validation before the orchestrator accepts it:
Self-Check (built into agent SOUL.md instructions): Each agent, when responding to a delegated task, must verify:
- Did I answer all parts of the question?
- Did I provide sources/evidence where applicable?
- Is my confidence rating honest?
If the agent's self-check identifies gaps, it sets STATUS: partial and explains what's missing in NOTES.
Orchestrator-Side Validation (in orchestrate.sh):
When --validate is passed (or for workflow steps with validation blocks):
- Check that handoff JSON has all required fields (status, result, confidence)
- If
STATUS: failedorSTATUS: blocked→ trigger retry (up to--retrieslimit) - If
STATUS: partialand confidence islow→ retry with refined prompt including the partial result - If retries exhausted → return partial result with warning flag for the orchestrator to decide
Full Audit Validation (for high-stakes steps): Workflow YAML can specify a validation agent (typically auditor) for critical steps:
- id: research
agent: webster
task: "Research materials..."
validation:
agent: auditor
criteria: "Are all requested properties present with credible sources?"
on_fail: retry
max_retries: 2
This runs the auditor on the output before passing it downstream. Prevents garbage-in-garbage-out in critical pipelines.
1.4 — Error Handling (Phase 1 Priority)
Error handling is not deferred — it ships with the orchestration core:
Agent unreachable:
orchestrate.shchecks health endpoint before sending- If agent is down: log error, return immediately with
STATUS: error, reason: agent_unreachable - Caller (manager or workflow engine) decides whether to retry, skip, or abort
Timeout:
- Configurable per call (
--timeout) and per workflow step - On timeout: kill the polling loop, check if partial handoff exists
- If partial result available: return it with
STATUS: timeout_partial - If no result: return
STATUS: timeout
Malformed response:
- Agent didn't write handoff file or wrote invalid JSON
orchestrate.shvalidates JSON schema before returning- On malformed: retry once with explicit reminder to write structured output
- If still malformed: return raw text with
STATUS: malformed
Retry logic (with idempotency):
Attempt 1: Generate idempotencyKey={wfRunId}_{stepId}_1 → Send task → wait → check result
If timeout → Check if handoff file exists (late arrival). If yes → use it. If no:
Attempt 2: idempotencyKey={wfRunId}_{stepId}_2 → Resend with "Previous attempt failed: {reason}. Please retry."
If timeout → Same late-arrival check. If no:
Attempt 3 (if --retries 3): Same pattern
If fail → Return error to caller with all attempt details
Key rule: Before every retry, check if the handoff file from the previous attempt landed. Prevents duplicate work when an agent was just slow, not dead.
1.5 — Result Capture Mechanism
Two options (implement both, prefer A):
Option A — File-based handoff:
- Agent writes result to
/home/papa/atomizer/handoffs/{runId}.json - Orchestrate script polls for file existence
- Clean, simple, works with shared filesystem
{
"schemaVersion": "1.0",
"runId": "hook-delegation-1739587200",
"idempotencyKey": "wf-mat-study-001_research_1",
"workflowRunId": "wf-mat-study-001",
"stepId": "research",
"attempt": 1,
"agent": "webster",
"status": "complete",
"result": "Zerodur Class 0 CTE: 0 ± 0.007 ppm/K (20-40°C)...",
"artifacts": [],
"confidence": "high",
"latencyMs": 45200,
"timestamp": "2026-02-15T03:00:00Z"
}
Required fields: schemaVersion, runId, agent, status, result, confidence, timestamp
Trace fields (required): workflowRunId, stepId, attempt, latencyMs
Idempotency: idempotencyKey = {workflowRunId}_{stepId}_{attempt}. Orchestrator checks for existing handoff before retrying — if result exists, skip resend.
Option B — Hooks callback:
- Agent calls manager's
/hooks/reportendpoint with result - More real-time but adds complexity
- Use for time-sensitive workflows
1.6 — Chaining Example
# Manager orchestrates a material trade study
# Step 1: Research
data=$(bash orchestrate.sh webster "Research Clearceram-Z HS vs Zerodur Class 0: CTE, density, cost, lead time" --wait)
# Step 2: Technical evaluation (pass webster's findings as context)
echo "$data" > /tmp/material_data.json
assessment=$(bash orchestrate.sh tech-lead "Evaluate these materials for M2/M3 mirrors against our thermal requirements" --context /tmp/material_data.json --wait)
# Step 3: Audit
echo "$assessment" > /tmp/assessment.json
audit=$(bash orchestrate.sh auditor "Review this technical assessment for completeness" --context /tmp/assessment.json --wait)
# Step 4: Manager synthesizes and delivers
# (Manager has all three results in-session, reasons about them, posts to Discord)
Layer 2: Smart Routing
What it does: Manager knows each agent's capabilities, strengths, and model. Routes tasks intelligently without hardcoded logic.
2.1 — Agent Capability Registry
File: /home/papa/atomizer/workspaces/shared/AGENTS_REGISTRY.json
{
"agents": {
"tech-lead": {
"port": 18804,
"model": "anthropic/claude-opus-4-6",
"capabilities": [
"fea-review",
"design-decisions",
"technical-analysis",
"material-selection",
"requirements-validation",
"trade-studies"
],
"strengths": "Deep reasoning, technical judgment, complex analysis",
"limitations": "Slow (Opus), expensive tokens — use for high-value decisions",
"inputFormat": "Technical problem with context and constraints",
"outputFormat": "Structured analysis with recommendations and rationale",
"channels": ["#hq", "#technical"]
},
"webster": {
"port": 18828,
"model": "google/gemini-2.5-pro",
"capabilities": [
"web-research",
"literature-review",
"data-lookup",
"supplier-search",
"standards-lookup",
"competitive-analysis"
],
"strengths": "Fast research, broad knowledge, cheap tokens, web access",
"limitations": "No deep technical judgment — finds data, doesn't evaluate it",
"inputFormat": "Natural language query with specifics",
"outputFormat": "Structured findings with sources and confidence",
"channels": ["#hq", "#research"]
},
"optimizer": {
"port": 18816,
"model": "anthropic/claude-sonnet-4-20250514",
"capabilities": [
"optimization-setup",
"parameter-studies",
"objective-definition",
"constraint-formulation",
"result-interpretation",
"sensitivity-analysis"
],
"strengths": "Optimization methodology, mathematical formulation, DOE",
"limitations": "Needs clear problem definition — not for open-ended exploration",
"inputFormat": "Optimization problem with objectives, variables, constraints",
"outputFormat": "Study configuration, parameter definitions, result analysis",
"channels": ["#hq", "#optimization"]
},
"study-builder": {
"port": 18820,
"model": "anthropic/claude-sonnet-4-20250514",
"capabilities": [
"study-configuration",
"doe-setup",
"batch-generation",
"parameter-sweeps",
"study-templates"
],
"strengths": "Translating optimization plans into executable study configs",
"limitations": "Needs optimizer's plan as input — doesn't design studies independently",
"inputFormat": "Study plan from optimizer with parameter ranges",
"outputFormat": "Ready-to-run study configuration files",
"channels": ["#hq", "#optimization"]
},
"nx-expert": {
"port": 18824,
"model": "anthropic/claude-sonnet-4-20250514",
"capabilities": [
"nx-operations",
"mesh-generation",
"boundary-conditions",
"nastran-setup",
"cad-manipulation",
"post-processing"
],
"strengths": "NX/Simcenter expertise, FEA model setup, hands-on CAD/FEM work",
"limitations": "Needs clear instructions — not for high-level design decisions",
"inputFormat": "Specific NX task with model reference and parameters",
"outputFormat": "Completed operation with verification screenshots/data",
"channels": ["#hq", "#nx-work"]
},
"auditor": {
"port": 18812,
"model": "anthropic/claude-opus-4-6",
"capabilities": [
"quality-review",
"compliance-check",
"methodology-audit",
"assumption-validation",
"report-review",
"standards-compliance"
],
"strengths": "Critical eye, finds gaps and errors, ensures rigor",
"limitations": "Reviews work, doesn't create it — needs output from other agents",
"inputFormat": "Work product to review with applicable standards/requirements",
"outputFormat": "Structured review: findings, severity, recommendations",
"channels": ["#hq", "#quality"]
},
"secretary": {
"port": 18808,
"model": "google/gemini-2.5-flash",
"capabilities": [
"meeting-notes",
"status-reports",
"documentation",
"scheduling",
"action-tracking",
"communication-drafting"
],
"strengths": "Fast, cheap, good at summarization and admin tasks",
"limitations": "Not for technical work — administrative and organizational only",
"inputFormat": "Admin task or raw content to organize",
"outputFormat": "Clean documentation, summaries, action lists",
"channels": ["#hq", "#admin"]
},
"manager": {
"port": 18800,
"model": "anthropic/claude-opus-4-6",
"capabilities": [
"orchestration",
"project-planning",
"task-decomposition",
"priority-management",
"stakeholder-communication",
"workflow-execution"
],
"strengths": "Strategic thinking, orchestration, synthesis across agents",
"limitations": "Should not do technical work — delegates everything",
"inputFormat": "High-level directives from Antoine (CEO)",
"outputFormat": "Plans, status updates, synthesized deliverables",
"channels": ["#hq"]
}
}
}
2.2 — Manager Routing Logic
Added to Manager's SOUL.md as a skill directive:
## Smart Routing
Before delegating, consult `/home/papa/atomizer/workspaces/shared/AGENTS_REGISTRY.json`.
- Match task requirements to agent capabilities
- Consider model strengths (Opus for reasoning, Gemini for speed, Sonnet for balanced)
- For multi-step tasks, plan the full pipeline before starting
- Prefer parallel execution when steps are independent
- Always specify what you need back (don't let agents guess)
2.3 — Discord Channel Context Integration
How channels feed context into orchestration:
Each Discord channel accumulates project-specific conversation history. The orchestration layer can pull this as context:
# In orchestrate.sh, --channel-context fetches recent messages
bash orchestrate.sh tech-lead "Review thermal margins for M2" \
--channel-context "#gigabit-m1" --messages 50 \
--wait
Implementation: Use Discord bot API (each instance has a bot token) to fetch channel message history. Format as context block prepended to the task.
Channel strategy for Atomizer HQ Discord:
| Channel | Purpose | Context Value |
|---|---|---|
#hq |
Cross-team coordination, announcements | Project-wide decisions |
#technical |
FEA discussions, design decisions | Technical context for analysis tasks |
#optimization |
Study configs, results, methodology | Optimization history and patterns |
#research |
Webster's findings, literature | Reference data for technical work |
#quality |
Audit findings, compliance notes | Review standards and past issues |
#nx-work |
CAD/FEM operations, model updates | Model state and recent changes |
#admin |
Meeting notes, schedules, action items | Project timeline and commitments |
#handoffs |
Automated orchestration results (bot-only) | Pipeline audit trail |
Key insight: Channels become persistent, queryable context stores. Instead of passing massive context blocks between agents, you say "read #technical for the last 20 messages" and the agent absorbs project state naturally.
Channel Context Sanitization (security): Discord history is untrusted input. Before injecting into an agent's context:
- Cap at configurable token window (default: last 30 messages, max ~4K tokens)
- Strip any system-prompt-like instructions from message content
- Tag entire block as
[CHANNEL CONTEXT — untrusted, for reference only] - Never let channel content override task instructions
This prevents prompt injection via crafted Discord messages in channel history.
Layer 3: Workflow Engine
What it does: Defines reusable multi-step pipelines as YAML. Manager reads and executes them. No coding needed to create new workflows.
3.1 — Workflow Definition Format
Location: /home/papa/atomizer/workspaces/shared/workflows/
# /home/papa/atomizer/workspaces/shared/workflows/material-trade-study.yaml
name: Material Trade Study
description: Research, evaluate, and audit material options for optical components
trigger: manual # or: keyword, schedule
inputs:
materials:
type: list
description: "Materials to compare"
requirements:
type: text
description: "Performance requirements and constraints"
project_channel:
type: channel
description: "Project channel for context"
steps:
- id: research
agent: webster
task: |
Research the following materials: {materials}
For each material, find: CTE (with temperature range), density, Young's modulus,
cost per kg, lead time, availability, and any known issues for optical applications.
Provide sources for all data.
channel_context: "{project_channel}"
channel_messages: 30
timeout: 180
retries: 2
output: material_data
validation:
agent: auditor
criteria: "Are all requested material properties present with credible sources? Flag any missing data."
on_fail: retry
- id: evaluate
agent: tech-lead
task: |
Evaluate these materials against our requirements:
REQUIREMENTS:
{requirements}
MATERIAL DATA:
{material_data}
Provide a recommendation with full rationale. Include a comparison matrix.
depends_on: [research]
timeout: 300
retries: 1
output: technical_assessment
- id: audit
agent: auditor
task: |
Review this material trade study for completeness, methodological rigor,
and potential gaps:
{technical_assessment}
Check: Are all requirements addressed? Are sources credible?
Are there materials that should have been considered but weren't?
depends_on: [evaluate]
timeout: 180
output: audit_result
- id: synthesize
agent: manager
action: synthesize # Manager processes internally, doesn't delegate
inputs: [material_data, technical_assessment, audit_result]
deliver:
channel: "{project_channel}"
format: summary # Manager writes a clean summary post
notifications:
on_complete: "#hq"
on_failure: "#hq"
3.2 — More Workflow Templates
Design Review:
name: Design Review
steps:
- id: prepare
agent: secretary
task: "Compile design package: gather latest CAD screenshots, analysis results, and requirements from {project_channel}"
- id: technical_review
agent: tech-lead
task: "Review design against requirements: {prepare}"
depends_on: [prepare]
- id: optimization_review
agent: optimizer
task: "Assess optimization potential: {prepare}"
depends_on: [prepare]
# technical_review and optimization_review run in PARALLEL (no dependency between them)
- id: audit
agent: auditor
task: "Final review: {technical_review} + {optimization_review}"
depends_on: [technical_review, optimization_review]
- id: deliver
agent: secretary
task: "Format design review report from: {audit}"
depends_on: [audit]
deliver:
channel: "{project_channel}"
Quick Research:
name: Quick Research
steps:
- id: research
agent: webster
task: "{query}"
timeout: 120
output: findings
- id: validate
agent: tech-lead
task: "Verify these findings are accurate and relevant: {findings}"
depends_on: [research]
deliver:
channel: "{request_channel}"
3.3 — Workflow Executor
File: /home/papa/atomizer/workspaces/shared/skills/orchestrate/workflow.sh
The manager's orchestration skill reads YAML workflows and executes them:
# Run a workflow
bash workflow.sh material-trade-study \
--input materials="Zerodur Class 0, Clearceram-Z HS, ULE" \
--input requirements="CTE < 0.01 ppm/K at 22°C, aperture 250mm" \
--input project_channel="#gigabit-m1"
Executor logic:
- Parse YAML workflow definition
- Resolve dependencies → build execution graph
- Execute steps in order (parallel when no dependencies)
- For each step: call
orchestrate.shwith task + resolved inputs - Store results in
/home/papa/atomizer/handoffs/workflows/{workflow-run-id}/ - On completion: deliver final output to specified channel
- On failure: notify
#hqwith error details and partial results
Implementation Plan
Phase 1: Orchestration Core + Validation + Error Handling (Day 1 — Feb 15) ✅ COMPLETE
Actual effort: ~6 hours
- 1.1 Created
/home/papa/atomizer/workspaces/shared/skills/orchestrate/directory - 1.2 Built
orchestrate.py(Python, not bash) — synchronous delegation with inotify-based waiting- Send via
/hooks/agent(existing) - inotify watches handoff directory for result file
- Timeout handling (configurable per call,
--timeout) - Retry logic (
--retries N, max 3, with error context) - Returns structured JSON result to caller
- Thin bash wrapper:
orchestrate.sh
- Send via
- 1.3 Created
/home/papa/atomizer/handoffs/directory for result passing - 1.4 Updated all 8 agent SOUL.md files with:
- Structured response format for delegated tasks (JSON handoff protocol)
- Self-check protocol (verify completeness before submitting)
- Write result to
/home/papa/atomizer/handoffs/{runId}.jsonon completion
- 1.5 Implemented error handling in
orchestrate.py- Health check before sending (agent health endpoint)
- Timeout with partial result recovery
- Malformed response detection and retry
- Idempotency check before retry (check if handoff file landed late)
- All errors logged to
/home/papa/atomizer/logs/orchestration/
- 1.6 Implemented trace logging in handoff files
- Required fields validated:
schemaVersion,runId,agent,status,result,confidence,timestamp - Unified JSONL logging with trace fields
- Required fields validated:
- 1.7 Implemented
--validateflag for strict orchestrator-side output validation - 1.8 Deployed
orchestrateskill to Manager (SOUL.md + TOOLS.md updated) - 1.9 Test: Manager → Webster smoke tests passed (18-49s response times, 12 successful handoffs)
- Chain test (Webster → Tech-Lead): Webster completed, Tech-Lead returned
partialdue to missing context passthrough — engine bug, not protocol bug
- Chain test (Webster → Tech-Lead): Webster completed, Tech-Lead returned
- 1.10 Test: ACL enforcement works (deny/allow), strict validation works
- 1.11
delegate.shkept as fallback for fire-and-forget use cases
Key implementation decisions:
- Python (
orchestrate.py) over bash for all logic — better JSON handling, inotify support, error handling inotify_simplefor instant file detection (no polling)- Session key format:
hook:orchestrate:{run_id}:{attempt} - ACL matrix hardcoded: Manager → all; Tech-Lead → webster/nx-expert/study-builder/secretary; Optimizer → webster/study-builder/secretary
Known issues to fix in Phase 2:
- Chain context passthrough: when chaining A→B→C, B's result must be explicitly injected into C's task
- Webster's Brave API key intermittently fails (recovered on retry)
- Manager Discord WebSocket reconnect loop (code 1005) — doesn't affect orchestration but blocks channel posting
Phase 2: Smart Routing + Channel Context + Hierarchical Delegation (Day 1-2 — Feb 15-16)
Estimated effort: 4-5 hours
- 2.1 Create
AGENTS_REGISTRY.jsonin shared workspace (completed 2026-02-15 — channel context fetcher built, hierarchical delegation deployed to Tech-Lead + Optimizer, ACL tested, all tests pass) - 2.2 Update Manager's SOUL.md with routing instructions (completed 2026-02-15 — channel context fetcher built, hierarchical delegation deployed to Tech-Lead + Optimizer, ACL tested, all tests pass)
- 2.3 Build channel context fetcher (
fetch-channel-context.sh) (completed 2026-02-15 — channel context fetcher built, hierarchical delegation deployed to Tech-Lead + Optimizer, ACL tested, all tests pass)- Uses Discord bot token to pull recent messages
- Formats as markdown context block
- Integrates with
orchestrate.shvia--channel-contextflag
- 2.4 Set up Discord channels per the channel strategy table (completed 2026-02-15 — channel context fetcher built, hierarchical delegation deployed to Tech-Lead + Optimizer, ACL tested, all tests pass)
- 2.5 Implement hierarchical delegation (completed 2026-02-15 — channel context fetcher built, hierarchical delegation deployed to Tech-Lead + Optimizer, ACL tested, all tests pass)
- Deploy
orchestrateskill to Tech-Lead and Optimizer - Add sub-orchestration rules to their SOUL.md (can delegate to: Webster, Study-Builder, NX-Expert, Secretary)
- Cannot delegate to: Manager, Auditor, each other (prevents loops)
- All sub-delegations logged to
/home/papa/atomizer/handoffs/sub/for Manager visibility
- Deploy
- 2.6 Enforce delegation ACL matrix in
orchestrate.shruntime (completed 2026-02-15 — channel context fetcher built, hierarchical delegation deployed to Tech-Lead + Optimizer, ACL tested, all tests pass)- Hardcoded check: caller + target validated against allowed pairs
- Manager → can delegate to all agents
- Tech-Lead → can delegate to: Webster, NX-Expert, Study-Builder, Secretary
- Optimizer → can delegate to: Webster, Study-Builder, Secretary
- All others → cannot sub-delegate (must go through Manager)
- Block self-delegation and circular paths at runtime (not just SOUL.md policy)
- 2.7 Implement channel context sanitization (completed 2026-02-15 — channel context fetcher built, hierarchical delegation deployed to Tech-Lead + Optimizer, ACL tested, all tests pass)
- Cap token window, strip system-like instructions, tag as untrusted
- 2.8 Test: Manager auto-routes a task based on registry + includes channel context (completed 2026-02-15 — channel context fetcher built, hierarchical delegation deployed to Tech-Lead + Optimizer, ACL tested, all tests pass)
- 2.9 Test: Tech-Lead delegates a data lookup to Webster mid-analysis (completed 2026-02-15 — channel context fetcher built, hierarchical delegation deployed to Tech-Lead + Optimizer, ACL tested, all tests pass)
- 2.10 Test: Auditor tries to sub-delegate → blocked by ACL (completed 2026-02-15 — channel context fetcher built, hierarchical delegation deployed to Tech-Lead + Optimizer, ACL tested, all tests pass)
Phase 3: Workflow Engine (Day 2-3 — Feb 16-17)
Estimated effort: 6-8 hours
- 3.1 Build YAML workflow parser (Python script)
- Implemented in
workflow.pywith name/path resolution from/home/papa/atomizer/workspaces/shared/workflows/, schema checks, step-ID validation, dependency validation, and cycle detection.
- Implemented in
- 3.2 Build workflow executor (
workflow.sh)- Dependency resolution
- Parallel step execution
- Variable substitution
- Error handling and partial results
- Implemented executor in
workflow.pywithThreadPoolExecutor, dependency-aware scheduling, step-levelon_failhandling (skip/abort), overall timeout enforcement, approval gates, and JSON summary output. - Added thin wrapper
workflow.sh.
- 3.3 Create initial workflow templates:
material-trade-study.yamldesign-review.yamlquick-research.yaml
- 3.4 Deploy workflow skill to Manager
- Updated Manager
SOUL.mdwith a dedicated "Running Workflows" section and command example. - Updated Manager
TOOLS.mdwithworkflow.py/workflow.shreferences and usage.
- Updated Manager
- 3.5 Implement approval gates in workflow YAML
workflow.pynow supportsapproval_gateprompts (yes/no) before step execution.- In
--non-interactivemode, approval gates are skipped with warnings.
- 3.6 Add workflow dry-run mode (
--dry-run)- Validates dependency graph and variable substitutions without executing
- Reports: step metadata, dependency-based execution layers, and run output directory
- Implemented dry-run planning output including step metadata, dependency layers, and run result directory.
- 3.7 Test: Run full material trade study workflow end-to-end
- quick-research workflow tested E2E twice — Webster→Tech-Lead chain, 50s and 149s runs, Manager posted results to Discord
- 3.8 Create
#handoffschannel for orchestration audit trail- Skipped — using workflow result directories instead of dedicated #handoffs channel
Phase 3 completion notes:
workflow.py: 15KB Python, supports YAML parsing, dependency graphs, parallel execution (ThreadPoolExecutor), variable substitution, approval gates, dry-run, per-step result persistence- 3 workflow templates:
material-trade-study,quick-research,design-review design-reviewdry-run confirmed parallel execution detection (tech-lead + optimizer simultaneous)- Manager successfully ran workflow from Discord prompt, parsed JSON output, and posted synthesized results
- Known issue fixed: Manager initially did not post results back — added explicit "Always Post Results Back" instructions to SOUL.md
Phase 4: Metrics + Documentation (Day 3 — Feb 17)
Estimated effort: 2-3 hours
- 4.1 Metrics: track delegation count, success rate, avg response time per agent
- Implemented
metrics.pyto analyze handoff JSON and workflow summaries; supports JSON/text output with per-agent latency and success stats
- Implemented
- 4.2 Per-workflow token usage tracking across all agents
- Added
metrics.shwrapper for easy execution from orchestrate skill directory
- Added
- 4.3 Document everything in this PKM project folder
- Added Manager
TOOLS.mdreference for metrics usage under Agent Communication
- Added Manager
- 4.4 Create orchestration documentation README
- Created
/home/papa/atomizer/workspaces/shared/skills/orchestrate/README.mdwith architecture, usage, ACL, workflows, and storage docs
- Created
Context Flow Diagram
Antoine (CEO)
│
▼
┌─────────────┐
│ MANAGER │ ◄── Reads AGENTS_REGISTRY.json
│ (Opus 4.6) │ ◄── Reads workflow YAML
└──────┬──────┘ ◄── Validates results
│
┌─────────────┼─────────────┐
▼ ▼ ▼
┌────────────┐ ┌──────────┐ ┌──────────┐
│ TECH-LEAD │ │ AUDITOR │ │OPTIMIZER │
│ (Opus) │ │ (Opus) │ │ (Sonnet) │
│ [can sub- │ └──────────┘ │ [can sub-│
│ delegate] │ │ delegate]│
└─────┬──────┘ └─────┬─────┘
│ sub-orchestration │
┌────┴─────┐ ┌──────┴──────┐
▼ ▼ ▼ ▼
┌────────┐┌────────┐ ┌───────────┐┌──────────┐
│WEBSTER ││NX-EXPERT│ │STUDY-BLDR ││SECRETARY │
│(Gemini)││(Sonnet) │ │ (Sonnet) ││ (Flash) │
└───┬────┘└───┬─────┘ └─────┬─────┘└────┬─────┘
│ │ │ │
▼ ▼ ▼ ▼
┌──────────────────────────────────────────────┐
│ HANDOFF DIRECTORY │
│ /home/papa/atomizer/handoffs/ │
│ {runId}.json — structured results │
│ /sub/ — sub-delegation logs (visibility) │
└──────────────────────────────────────────────┘
│ │ │ │
└────┬────┘──────┬───────┘────┬───────┘
▼ ▼ ▼
┌────────────┐ ┌──────────┐ ┌─────────────────┐
│ DISCORD │ │VALIDATION│ │ SHARED FILES │
│ CHANNELS │ │ LOOPS │ │ (Atomizer repo │
│ (context) │ │(self-chk │ │ PKM, configs) │
└────────────┘ │+ auditor)│ └─────────────────┘
└──────────┘
CONTEXT SOURCES (per delegation):
1. Task context → Orchestrator passes explicitly
2. Channel context → Fetched from Discord history
3. Handoff context → Results from prior pipeline steps
4. Knowledge context → Shared filesystem (always available)
VALIDATION FLOW:
Agent output → Self-check → Orchestrator validation → [Auditor review if critical] → Accept/Retry
HIERARCHY:
Manager → delegates to all agents
Tech-Lead, Optimizer → sub-delegate to Webster, NX-Expert, Study-Builder, Secretary
All sub-delegations logged for Manager visibility
Comparison: Before vs After
| Aspect | Before (delegate.sh) | After (Orchestration Engine) |
|---|---|---|
| Delegation | Fire-and-forget | Synchronous with result return |
| Result flow | None — check Discord manually | Structured JSON via handoff files |
| Chaining | Impossible | Native — output feeds next step |
| Parallel work | Manual — delegate multiple, hope | Workflow engine handles automatically |
| Context passing | None | Task + channel + handoff + filesystem |
| Routing | Hardcoded agent names | Capability-based via registry |
| Reusability | One-off bash calls | YAML workflow templates |
| Audit trail | Discord messages only | Handoff logs + orchestration logs |
| Validation | None | Self-check + auditor loops on critical steps |
| Error handling | None | Timeout, retry, partial results (Phase 1) |
| Hierarchy | Flat (manager only) | Hierarchical (Tech-Lead/Optimizer can sub-delegate) |
| Adding agents | Edit bash script | Add entry to registry JSON |
Future Extensions (Post-MVP)
- Conditional branching: If auditor flags issues → route back to tech-lead for revision
- Human-in-the-loop gates: Workflow pauses for Antoine's approval at critical steps
- Learning loops: Store workflow results → agents learn from past runs
- Cost tracking: Per-workflow token usage across all agents
- Web UI dashboard: Visualize active workflows, agent status, handoff queue
- Inter-company workflows: External client triggers → full analysis pipeline → deliverable
Key Design Decisions
-
File-based handoffs over HTTP callbacks — Simpler, debuggable, works with shared filesystem we already have. HTTP callbacks are Phase 2 optimization if needed.
-
Manager as primary orchestrator, with hierarchical delegation (Phase 2) — Manager runs workflows and chains tasks. In Phase 2, senior agents (Tech-Lead, Optimizer) gain sub-orchestration rights to delegate directly to supporting agents (e.g., Tech-Lead → Webster for a data lookup mid-analysis) without routing through Manager. All sub-delegations are logged to the handoff directory so Manager retains visibility. No circular delegation — hierarchy is strict.
-
YAML workflows over hardcoded scripts — Workflows are data, not code. Antoine can define new ones. Manager can read and execute them. Future: manager could even generate workflows from natural language directives.
-
Channel context is opt-in per step — Not every step needs channel history. Explicit
channel_contextparameter keeps token usage efficient. -
Preserve fire-and-forget option —
delegate.shstays for simple one-off tasks where you don't need the result back.orchestrate.shis for pipeline work.
Review Amendments (2026-02-15)
Source: Webster's review (reviews/REVIEW-Orchestration-Engine-Webster.md)
| Webster's Recommendation | Decision | Where |
|---|---|---|
| Hierarchical delegation | ✅ Adopted — Phase 2 | Tech-Lead + Optimizer get sub-orchestration rights |
| Validation/critic loops | ✅ Adopted — Phase 1 | Self-check in agents + --validate flag + auditor validation blocks in YAML |
| Error handling in Phase 1 | ✅ Adopted — Phase 1 | Timeouts, retries, health checks, malformed response handling |
| Shared blackboard state | ⏳ Deferred | Not needed until workflows exceed 5+ steps. File-based handoffs sufficient for now |
| Role-based dynamic routing | ⏳ Deferred | Only one agent per role currently. Revisit when we scale to redundant agents |
| AutoGen group chat pattern | 📝 Noted | Interesting for brainstorming workflows. Not MVP priority |
| LangGraph state graphs | 📝 Noted | YAML with on_fail: goto covers our needs without importing a paradigm |
Source: Auditor's review (reviews/REVIEW-Orchestration-Engine-Auditor-V2.md)
| Auditor's Recommendation | Decision | Where |
|---|---|---|
| Idempotency keys | ✅ Adopted — Phase 1 | idempotencyKey in handoff schema + existence check before retry |
| Handoff schema versioning | ✅ Adopted — Phase 1 | schemaVersion: "1.0" + required fields validation in orchestrate.sh |
| Approval gates | ✅ Adopted — Phase 3 | approval_gate: ceo in workflow YAML, posts to #hq and waits |
| Per-run state blackboard | ⏳ Deferred | Same as Webster's — file handoffs sufficient for 3-5 step workflows |
| Trace logging / observability | ✅ Adopted — Phase 1 | workflowRunId, stepId, attempt, latencyMs in every handoff |
| Channel context sanitization | ✅ Adopted — Phase 2 | Token cap, instruction stripping, untrusted tagging |
| ACL enforcement (runtime) | ✅ Adopted — Phase 2 | Hardcoded delegation matrix in orchestrate.sh, not just SOUL.md policy |
| Quality score (0-1) | ⏳ Deferred | Nice-to-have for dashboards, not MVP |
| Artifact checksums | ⏳ Deferred | Reproducibility concern — revisit for client deliverables |
| Workflow dry-run mode | ✅ Adopted — Phase 3 | Validate dependency graph + substitutions without execution |
Next step: Implementation begins 2026-02-15. Start with Phase 1 (orchestrate.sh + handoff directory + agent SOUL.md updates). Test with a simple Webster → Tech-Lead chain before building the full workflow engine.