feat: add Atomizer HQ multi-agent cluster infrastructure

- 8-agent OpenClaw cluster (Manager, Tech-Lead, Secretary, Auditor,
  Optimizer, Study-Builder, NX-Expert, Webster)
- Orchestration engine: orchestrate.py (sync delegation + handoffs)
- Workflow engine: YAML-defined multi-step pipelines
- Agent workspaces: SOUL.md, AGENTS.md, MEMORY.md per agent
- Shared skills: delegate, orchestrate, atomizer-protocols
- Capability registry (AGENTS_REGISTRY.json)
- Cluster management: cluster.sh, systemd template
- All secrets replaced with env var references
This commit is contained in:
2026-02-15 21:18:18 +00:00
parent d6a1d6eee1
commit 3289a76e19
170 changed files with 24949 additions and 0 deletions

View File

@@ -0,0 +1,70 @@
{
"schemaVersion": "1.0",
"updated": "2026-02-15",
"agents": {
"tech-lead": {
"port": 18804,
"model": "anthropic/claude-opus-4-6",
"capabilities": ["fea-review", "design-decisions", "technical-analysis", "material-selection", "requirements-validation", "trade-studies"],
"strengths": "Deep reasoning, technical judgment, complex analysis",
"limitations": "Slow (Opus), expensive — use for high-value decisions",
"channels": ["#hq", "#technical"]
},
"webster": {
"port": 18828,
"model": "google/gemini-2.5-pro",
"capabilities": ["web-research", "literature-review", "data-lookup", "supplier-search", "standards-lookup"],
"strengths": "Fast research, broad knowledge, web access",
"limitations": "No deep technical judgment — finds data, doesn't evaluate it",
"channels": ["#hq", "#research"]
},
"optimizer": {
"port": 18816,
"model": "anthropic/claude-sonnet-4-20250514",
"capabilities": ["optimization-setup", "parameter-studies", "objective-definition", "constraint-formulation", "sensitivity-analysis"],
"strengths": "Optimization methodology, mathematical formulation, DOE",
"limitations": "Needs clear problem definition",
"channels": ["#hq", "#optimization"]
},
"study-builder": {
"port": 18820,
"model": "anthropic/claude-sonnet-4-20250514",
"capabilities": ["study-configuration", "doe-setup", "batch-generation", "parameter-sweeps"],
"strengths": "Translating optimization plans into executable configs",
"limitations": "Needs optimizer's plan as input",
"channels": ["#hq", "#optimization"]
},
"nx-expert": {
"port": 18824,
"model": "anthropic/claude-sonnet-4-20250514",
"capabilities": ["nx-operations", "mesh-generation", "boundary-conditions", "nastran-setup", "post-processing"],
"strengths": "NX/Simcenter expertise, FEA model setup",
"limitations": "Needs clear instructions",
"channels": ["#hq", "#nx-work"]
},
"auditor": {
"port": 18812,
"model": "anthropic/claude-opus-4-6",
"capabilities": ["quality-review", "compliance-check", "methodology-audit", "assumption-validation", "report-review"],
"strengths": "Critical eye, finds gaps and errors",
"limitations": "Reviews work, doesn't create it",
"channels": ["#hq", "#quality"]
},
"secretary": {
"port": 18808,
"model": "google/gemini-2.5-flash",
"capabilities": ["meeting-notes", "status-reports", "documentation", "scheduling", "action-tracking"],
"strengths": "Fast, cheap, good at summarization and admin",
"limitations": "Not for technical work",
"channels": ["#hq", "#admin"]
},
"manager": {
"port": 18800,
"model": "anthropic/claude-opus-4-6",
"capabilities": ["orchestration", "project-planning", "task-decomposition", "workflow-execution"],
"strengths": "Strategic thinking, orchestration, synthesis",
"limitations": "Should not do technical work — delegates everything",
"channels": ["#hq"]
}
}
}

View File

@@ -0,0 +1,82 @@
# Atomizer Agent Cluster
## Agent Directory
| Agent | ID | Port | Role |
|-------|-----|------|------|
| 🎯 Manager | manager | 18800 | Orchestration, delegation, strategy |
| 🔧 Tech Lead | technical-lead | 18804 | FEA, R&D, technical review |
| 📋 Secretary | secretary | 18808 | Admin, notes, reports, knowledge |
| 🔍 Auditor | auditor | 18812 | Quality gatekeeper, reviews |
| ⚡ Optimizer | optimizer | 18816 | Optimization algorithms & strategy |
| 🏗️ Study Builder | study-builder | 18820 | Study code engineering |
| 🖥️ NX Expert | nx-expert | 18824 | Siemens NX/CAD/CAE |
| 🔬 Webster | webster | 18828 | Research & literature |
## Inter-Agent Communication
Each agent runs as an independent OpenClaw gateway. To send a message to another agent:
```bash
curl -s -X POST http://127.0.0.1:PORT/hooks/agent \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 31422bb39bc9e7a4d34f789d8a7cbc582dece8dd170dadd1" \
-d '{"message": "your message", "agentId": "AGENT_ID"}'
```
### Examples
```bash
# Report to manager
curl -s -X POST http://127.0.0.1:18800/hooks/agent \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 31422bb39bc9e7a4d34f789d8a7cbc582dece8dd170dadd1" \
-d '{"message": "Status update: FEA analysis complete", "agentId": "manager"}'
# Delegate to tech-lead
curl -s -X POST http://127.0.0.1:18804/hooks/agent \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 31422bb39bc9e7a4d34f789d8a7cbc582dece8dd170dadd1" \
-d '{"message": "Please review the beam optimization study", "agentId": "technical-lead"}'
# Ask webster for research
curl -s -X POST http://127.0.0.1:18828/hooks/agent \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 31422bb39bc9e7a4d34f789d8a7cbc582dece8dd170dadd1" \
-d '{"message": "Find papers on topology optimization", "agentId": "webster"}'
```
## Discord Channel Ownership
- **Manager**: #ceo-office, #announcements, #daily-standup, #active-projects, #agent-logs, #inter-agent, #general, #hydrotech-beam
- **Tech Lead**: #technical, #code-review, #fea-analysis
- **Secretary**: #task-board, #meeting-notes, #reports, #knowledge-base, #lessons-learned, #it-ops
- **NX Expert**: #nx-cad
- **Webster**: #literature, #materials-data
- **Auditor, Optimizer, Study Builder**: DM + hooks (no dedicated channels)
## Slack (Manager only)
Manager also handles Slack channels: #all-atomizer-hq, #secretary, etc.
## Rules
1. Always respond to Discord messages — NEVER reply NO_REPLY
2. When delegating, be specific about what you need
3. Post results back in the originating Discord channel
4. Use hooks API for inter-agent communication
## Response Arbitration (Anti-Collision)
To prevent multiple agents replying at once in the same public channel:
1. **Single channel owner speaks by default.**
- In any shared channel, only the listed owner agent should reply unless another agent is directly tagged.
2. **Non-owners are mention-gated.**
- If a non-owner is not explicitly @mentioned, it should stay silent and route updates via hooks to the owner.
3. **Tagged specialist = scoped reply only.**
- When tagged, reply only to the tagged request (no broad channel takeover), then return to silent mode.
4. **Manager synthesis for multi-agent asks.**
- If a user asks multiple roles at once, specialists send inputs to Manager via hooks; Manager posts one consolidated reply.
5. **Duplicate suppression window (30s).**
- If an equivalent answer has just been posted by another agent, post only incremental/new info.

View File

@@ -0,0 +1,35 @@
# Hooks Protocol — Inter-Agent Communication
## When You Receive a Hook Message
Messages arriving via the Hooks API (delegated tasks from other agents) are **high-priority direct assignments**. They appear as regular messages but come from the delegation system.
### How to Recognize
Hook messages typically contain specific task instructions — e.g., "Find density of Ti-6Al-4V" or "Review the thermal analysis assumptions." They arrive outside of normal Discord conversation flow.
### How to Respond
1. **Treat as top priority** — process before other pending work
2. **Do the work** — execute the requested task fully
3. **Respond in Discord** — your response is automatically routed to Discord if `--deliver` was set
4. **Be thorough but concise** — the requesting agent needs actionable results
5. **If you can't complete the task**, explain why clearly so the requester can reassign or adjust
### Status Reporting
After completing a delegated task, **append a status line** to `/home/papa/atomizer/workspaces/shared/project_log.md`:
```
[YYYY-MM-DD HH:MM] <your-agent-name>: Completed — <brief description of what was done>
```
Only the **Manager** updates `PROJECT_STATUS.md`. Everyone else appends to the log.
## Delegation Authority
| Agent | Can Delegate To |
|-------|----------------|
| Manager | All agents |
| Tech Lead | All agents except Manager |
| All others | Cannot delegate (request via Manager or Tech Lead) |

View File

@@ -0,0 +1,13 @@
# Project Status Dashboard
Updated: 2026-02-15 10:25 AM
## Active Tasks
- **Material Research (Webster):**
- [x] Zerodur Class 0 CTE data acknowledged (2026-02-15 10:07)
- [x] Ohara Clearceram-Z HS density confirmed: 2.55 g/cm³ (2026-02-15 10:12)
- [x] Zerodur Young's Modulus logged: 90.3 GPa (2026-02-15 10:18)
## Recent Activity
- Webster logged Young's Modulus for Zerodur (90.3 GPa) via orchestration hook.
- Webster confirmed receipt of orchestration ping.
- Webster reported density for Ohara Clearceram-Z HS (2.55 g/cm³).

View File

@@ -0,0 +1,6 @@
[2026-02-15 18:12] webster: Completed — Research on Ohara Clearceram-Z HS vs Schott Zerodur.
[2026-02-15 18:12] Webster: Completed — Updated and refined the research summary for Clearceram-Z HS vs. Zerodur with more nuanced data.
[2026-02-15 18:12] Webster: Completed — Received duplicate refined research summary (Clearceram-Z HS vs. Zerodur). No action taken as data is already in memory.
[2026-02-15 18:30] Webster: Completed — Logged new material property (Invar 36 Young's modulus) to memory.
[2026-02-15 18:30] Webster: Completed — Received duplicate material property for Invar 36. No action taken as data is already in memory.

View File

@@ -0,0 +1,68 @@
# Delegate Task to Another Agent
Sends a task to another Atomizer agent via the OpenClaw Hooks API. The target agent processes the task in an isolated session and optionally delivers the response to Discord.
## When to Use
- You need another agent to perform a task (research, analysis, NX work, etc.)
- You want to assign work and get a response in a Discord channel
- Cross-agent orchestration
## Usage
```bash
bash /home/papa/atomizer/workspaces/shared/skills/delegate/delegate.sh <agent> "<instruction>" [options]
```
### Agents
| Agent | Specialty |
|-------|-----------|
| `manager` | Orchestration, project oversight |
| `tech-lead` | Technical decisions, FEA review |
| `secretary` | Meeting notes, admin, status updates |
| `auditor` | Quality checks, compliance review |
| `optimizer` | Optimization setup, parameter studies |
| `study-builder` | Study configuration, DOE |
| `nx-expert` | NX/Simcenter operations |
| `webster` | Web research, literature search |
### Options
- `--channel <discord-channel-id>` — Route response to a specific Discord channel
- `--deliver` / `--no-deliver` — Whether to post response to Discord (default: deliver)
### Examples
```bash
# Ask Webster to research something
bash /home/papa/atomizer/workspaces/shared/skills/delegate/delegate.sh webster "Find the CTE of Zerodur Class 0 between 20-40°C"
# Assign NX work with channel routing
bash /home/papa/atomizer/workspaces/shared/skills/delegate/delegate.sh nx-expert "Create mesh convergence study for M2 mirror" --channel C0AEJV13TEU
# Ask auditor to review without posting to Discord
bash /home/papa/atomizer/workspaces/shared/skills/delegate/delegate.sh auditor "Review the thermal analysis assumptions" --no-deliver
```
## How It Works
1. Looks up the target agent's port from the cluster port map
2. Checks if the target agent is running
3. Sends a `POST /hooks/agent` request to the target's OpenClaw instance
4. Target agent processes the task in an isolated session
5. Response is delivered to Discord if `--deliver` is set
## Response
The script outputs:
- ✅ confirmation with run ID on success
- ❌ error message with HTTP code on failure
The delegated task runs **asynchronously** — you won't get the result inline. The target agent will respond in Discord.
## Notes
- Tasks are fire-and-forget. Monitor the Discord channel for the response.
- The target agent sees the message as a hook trigger, not a Discord message.
- For complex multi-step workflows, delegate one step at a time.

View File

@@ -0,0 +1,118 @@
#!/usr/bin/env bash
# delegate.sh — Send a task to another Atomizer agent via OpenClaw Hooks API
# Usage: delegate.sh <agent> <message> [--channel <discord-channel-id>] [--deliver] [--wait]
#
# Examples:
# delegate.sh webster "Find density of Ti-6Al-4V"
# delegate.sh nx-expert "Mesh the M2 mirror" --channel C0AEJV13TEU --deliver
# delegate.sh tech-lead "Review optimization results" --deliver
set -euo pipefail
# --- Port Map (from cluster config) ---
declare -A PORT_MAP=(
[manager]=18800
[tech-lead]=18804
[secretary]=18808
[auditor]=18812
[optimizer]=18816
[study-builder]=18820
[nx-expert]=18824
[webster]=18828
)
# --- Config ---
TOKEN="${GATEWAY_TOKEN}"
HOST="127.0.0.1"
# --- Parse args ---
if [[ $# -lt 2 ]]; then
echo "Usage: delegate.sh <agent> <message> [--channel <id>] [--deliver] [--wait]"
echo ""
echo "Agents: ${!PORT_MAP[*]}"
exit 1
fi
AGENT="$1"
MESSAGE="$2"
shift 2
CHANNEL=""
DELIVER="true"
WAIT=""
while [[ $# -gt 0 ]]; do
case "$1" in
--channel) CHANNEL="$2"; shift 2 ;;
--deliver) DELIVER="true"; shift ;;
--no-deliver) DELIVER="false"; shift ;;
--wait) WAIT="true"; shift ;;
*) echo "Unknown option: $1"; exit 1 ;;
esac
done
# --- Validate agent ---
PORT="${PORT_MAP[$AGENT]:-}"
if [[ -z "$PORT" ]]; then
echo "❌ Unknown agent: $AGENT"
echo "Available agents: ${!PORT_MAP[*]}"
exit 1
fi
# --- Don't delegate to yourself ---
SELF_PORT="${ATOMIZER_SELF_PORT:-}"
if [[ -n "$SELF_PORT" && "$PORT" == "$SELF_PORT" ]]; then
echo "❌ Cannot delegate to yourself"
exit 1
fi
# --- Check if target is running ---
if ! curl -sf "http://$HOST:$PORT/health" > /dev/null 2>&1; then
# Try a simple connection check instead
if ! timeout 2 bash -c "echo > /dev/tcp/$HOST/$PORT" 2>/dev/null; then
echo "❌ Agent '$AGENT' is not running on port $PORT"
exit 1
fi
fi
# --- Build payload ---
PAYLOAD=$(cat <<EOF
{
"message": $(printf '%s' "$MESSAGE" | python3 -c "import json,sys; print(json.dumps(sys.stdin.read()))"),
"name": "delegation",
"sessionKey": "hook:delegation:$(date +%s)",
"deliver": $DELIVER,
"channel": "discord"
}
EOF
)
# Add Discord channel routing if specified
if [[ -n "$CHANNEL" ]]; then
PAYLOAD=$(echo "$PAYLOAD" | python3 -c "
import json, sys
d = json.load(sys.stdin)
d['to'] = 'channel:$CHANNEL'
print(json.dumps(d))
")
fi
# --- Send ---
RESPONSE=$(curl -s -w "\n%{http_code}" -X POST "http://$HOST:$PORT/hooks/agent" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d "$PAYLOAD")
HTTP_CODE=$(echo "$RESPONSE" | tail -1)
BODY=$(echo "$RESPONSE" | head -n -1)
if [[ "$HTTP_CODE" == "202" ]]; then
RUN_ID=$(echo "$BODY" | python3 -c "import json,sys; print(json.loads(sys.stdin.read()).get('runId','unknown'))" 2>/dev/null || echo "unknown")
echo "✅ Task delegated to $AGENT (port $PORT)"
echo " Run ID: $RUN_ID"
echo " Deliver to Discord: $DELIVER"
else
echo "❌ Delegation failed (HTTP $HTTP_CODE)"
echo " Response: $BODY"
exit 1
fi

View File

@@ -0,0 +1,116 @@
# Orchestration Engine — Atomizer HQ
> Multi-instance synchronous delegation, workflow pipelines, and inter-agent coordination.
## Overview
The Orchestration Engine enables structured communication between 8 independent OpenClaw agent instances running on Discord. It replaces fire-and-forget delegation with synchronous handoffs, chaining, validation, and reusable YAML workflows.
## Architecture
```
┌─────────────────────────────────────────────────┐
│ LAYER 3: WORKFLOWS │
│ YAML multi-step pipelines │
│ (workflow.py — parallel, sequential, gates) │
├─────────────────────────────────────────────────┤
│ LAYER 2: SMART ROUTING │
│ Capability registry + channel context │
│ (AGENTS_REGISTRY.json + fetch-channel-context) │
├─────────────────────────────────────────────────┤
│ LAYER 1: ORCHESTRATION CORE │
│ Synchronous delegation + result return │
│ (orchestrate.py — inotify + handoffs) │
├─────────────────────────────────────────────────┤
│ EXISTING INFRASTRUCTURE │
│ 8 OpenClaw instances, hooks API, shared fs │
└─────────────────────────────────────────────────┘
```
## Files
| File | Purpose |
|------|---------|
| `orchestrate.py` | Core delegation engine — sends tasks, waits for handoff files via inotify |
| `orchestrate.sh` | Thin bash wrapper for orchestrate.py |
| `workflow.py` | YAML workflow engine — parses, resolves deps, executes pipelines |
| `workflow.sh` | Thin bash wrapper for workflow.py |
| `fetch-channel-context.sh` | Fetches Discord channel history as formatted context |
| `metrics.py` | Analyzes handoff files and workflow runs for stats |
| `metrics.sh` | Thin bash wrapper for metrics.py |
## Usage
### Single delegation
```bash
# Synchronous — blocks until agent responds
python3 orchestrate.py webster "Find CTE of Zerodur" --caller manager --timeout 120
# With channel context
python3 orchestrate.py tech-lead "Review thermal margins" --caller manager --channel-context technical --channel-messages 20
# With validation
python3 orchestrate.py webster "Research ULE properties" --caller manager --validate --timeout 120
```
### Workflow execution
```bash
# Dry-run (validate without executing)
python3 workflow.py quick-research --input query="CTE of ULE" --caller manager --dry-run
# Live run
python3 workflow.py quick-research --input query="CTE of ULE" --caller manager --non-interactive
# Material trade study (3-step pipeline)
python3 workflow.py material-trade-study \
--input materials="Zerodur, Clearceram-Z HS, ULE" \
--input requirements="CTE < 0.01 ppm/K" \
--caller manager --non-interactive
```
### Metrics
```bash
python3 metrics.py text # Human-readable
python3 metrics.py json # JSON output
```
## Handoff Protocol
Agents write structured JSON to `/home/papa/atomizer/handoffs/{runId}.json`:
```json
{
"schemaVersion": "1.0",
"runId": "orch-...",
"agent": "webster",
"status": "complete|partial|blocked|failed",
"result": "...",
"artifacts": [],
"confidence": "high|medium|low",
"notes": "...",
"timestamp": "ISO-8601"
}
```
## ACL Matrix
| Caller | Can delegate to |
|--------|----------------|
| manager | All agents |
| tech-lead | webster, nx-expert, study-builder, secretary |
| optimizer | webster, study-builder, secretary |
| Others | Cannot sub-delegate |
## Workflow Templates
- `quick-research.yaml` — 2 steps: Webster research → Tech-Lead validation
- `material-trade-study.yaml` — 3 steps: Webster research → Tech-Lead evaluation → Auditor review
- `design-review.yaml` — 3 steps: Tech-Lead + Optimizer (parallel) → Auditor consolidation
## Result Storage
- Individual handoffs: `/home/papa/atomizer/handoffs/orch-*.json`
- Sub-delegations: `/home/papa/atomizer/handoffs/sub/`
- Workflow runs: `/home/papa/atomizer/handoffs/workflows/{workflow-run-id}/`
- Per-step: `{step-id}.json`
- Summary: `summary.json`

View File

@@ -0,0 +1,192 @@
#!/usr/bin/env bash
# Usage: fetch-channel-context.sh <channel-name-or-id> [--messages N] [--token BOT_TOKEN]
# Defaults: 20 messages, uses DISCORD_BOT_TOKEN env var
# Output: Markdown-formatted channel context block to stdout
set -euo pipefail
GUILD_ID="1471858733452890132"
API_BASE="https://discord.com/api/v10"
DEFAULT_MESSAGES=20
MAX_MESSAGES=30
MAX_OUTPUT_CHARS=4000
usage() {
echo "Usage: $0 <channel-name-or-id> [--messages N] [--token BOT_TOKEN]" >&2
}
if [[ $# -lt 1 ]]; then
usage
exit 1
fi
CHANNEL_INPUT="$1"
shift
MESSAGES="$DEFAULT_MESSAGES"
TOKEN="${DISCORD_BOT_TOKEN:-}"
while [[ $# -gt 0 ]]; do
case "$1" in
--messages)
[[ $# -ge 2 ]] || { echo "Missing value for --messages" >&2; exit 1; }
MESSAGES="$2"
shift 2
;;
--token)
[[ $# -ge 2 ]] || { echo "Missing value for --token" >&2; exit 1; }
TOKEN="$2"
shift 2
;;
*)
echo "Unknown option: $1" >&2
usage
exit 1
;;
esac
done
if [[ -z "$TOKEN" ]]; then
echo "Missing bot token. Use --token or set DISCORD_BOT_TOKEN." >&2
exit 1
fi
if ! [[ "$MESSAGES" =~ ^[0-9]+$ ]]; then
echo "--messages must be a positive integer" >&2
exit 1
fi
if (( MESSAGES < 1 )); then
MESSAGES=1
fi
if (( MESSAGES > MAX_MESSAGES )); then
MESSAGES=$MAX_MESSAGES
fi
AUTH_HEADER="Authorization: Bot ${TOKEN}"
resolve_channel() {
local input="$1"
if [[ "$input" =~ ^[0-9]{8,}$ ]]; then
local ch_json
ch_json="$(curl -sf -H "$AUTH_HEADER" "${API_BASE}/channels/${input}")" || return 1
python3 - "$ch_json" <<'PY'
import json, sys
obj = json.loads(sys.argv[1])
cid = obj.get("id", "")
name = obj.get("name", cid)
if not cid:
sys.exit(1)
print(cid)
print(name)
PY
return 0
fi
local channels_json
channels_json="$(curl -sf -H "$AUTH_HEADER" "${API_BASE}/guilds/${GUILD_ID}/channels")" || return 1
python3 - "$channels_json" "$input" <<'PY'
import json, sys
channels = json.loads(sys.argv[1])
needle = sys.argv[2].strip().lstrip('#').lower()
for ch in channels:
if str(ch.get("type")) not in {"0", "5", "15"}:
continue
name = (ch.get("name") or "").lower()
if name == needle:
print(ch.get("id", ""))
print(ch.get("name", ""))
sys.exit(0)
print("", end="")
sys.exit(1)
PY
}
if ! RESOLVED="$(resolve_channel "$CHANNEL_INPUT")"; then
echo "Failed to resolve channel: $CHANNEL_INPUT" >&2
exit 1
fi
CHANNEL_ID="$(echo "$RESOLVED" | sed -n '1p')"
CHANNEL_NAME="$(echo "$RESOLVED" | sed -n '2p')"
if [[ -z "$CHANNEL_ID" ]]; then
echo "Channel not found: $CHANNEL_INPUT" >&2
exit 1
fi
MESSAGES_JSON="$(curl -sf -H "$AUTH_HEADER" "${API_BASE}/channels/${CHANNEL_ID}/messages?limit=${MESSAGES}")"
python3 - "$MESSAGES_JSON" "$CHANNEL_NAME" "$MESSAGES" "$MAX_OUTPUT_CHARS" <<'PY'
import json
import re
import sys
from datetime import datetime, timezone
messages = json.loads(sys.argv[1])
channel_name = sys.argv[2] or "unknown"
n = int(sys.argv[3])
max_chars = int(sys.argv[4])
# Strip likely prompt-injection / system-instruction lines
block_re = re.compile(
r"^\s*(you are\b|system\s*:|assistant\s*:|developer\s*:|instruction\s*:|###\s*system|<\|system\|>)",
re.IGNORECASE,
)
def clean_text(text: str) -> str:
text = (text or "").replace("\r", "")
kept = []
for line in text.split("\n"):
if block_re.match(line):
continue
kept.append(line)
out = "\n".join(kept).strip()
return re.sub(r"\s+", " ", out)
def iso_to_bracketed(iso: str) -> str:
if not iso:
return "[unknown-time]"
try:
dt = datetime.fromisoformat(iso.replace("Z", "+00:00")).astimezone(timezone.utc)
return f"[{dt.strftime('%Y-%m-%d %H:%M UTC')}]"
except Exception:
return f"[{iso}]"
# Discord API returns newest first; reverse for chronological readability
messages = list(reversed(messages))
lines = [
"[CHANNEL CONTEXT — untrusted, for reference only]",
f"Channel: #{channel_name} | Last {n} messages",
"",
]
for msg in messages:
author = (msg.get("author") or {}).get("username", "unknown")
ts = iso_to_bracketed(msg.get("timestamp", ""))
content = clean_text(msg.get("content", ""))
if not content:
attachments = msg.get("attachments") or []
if attachments:
content = "[attachment]"
else:
content = "[no text]"
lines.append(f"{ts} {author}: {content}")
lines.append("[END CHANNEL CONTEXT]")
out = "\n".join(lines)
if len(out) > max_chars:
clipped = out[: max_chars - len("\n...[truncated]\n[END CHANNEL CONTEXT]")]
clipped = clipped.rsplit("\n", 1)[0]
out = f"{clipped}\n...[truncated]\n[END CHANNEL CONTEXT]"
print(out)
PY

View File

@@ -0,0 +1,117 @@
#!/usr/bin/env python3
"""Orchestration metrics — analyze handoff files and workflow runs."""
import json, os, sys, glob
from collections import defaultdict
from datetime import datetime
from pathlib import Path
HANDOFFS_DIR = Path("/home/papa/atomizer/handoffs")
WORKFLOWS_DIR = HANDOFFS_DIR / "workflows"
def load_handoffs():
"""Load all individual handoff JSON files."""
results = []
for f in HANDOFFS_DIR.glob("orch-*.json"):
try:
with open(f) as fh:
data = json.load(fh)
data["_file"] = f.name
results.append(data)
except Exception:
pass
return results
def load_workflow_summaries():
"""Load all workflow summary.json files."""
results = []
for d in WORKFLOWS_DIR.iterdir():
summary = d / "summary.json"
if summary.exists():
try:
with open(summary) as fh:
data = json.load(fh)
results.append(data)
except Exception:
pass
return results
def compute_metrics():
handoffs = load_handoffs()
workflows = load_workflow_summaries()
# Per-agent stats
agent_stats = defaultdict(lambda: {"total": 0, "complete": 0, "failed": 0, "partial": 0, "blocked": 0, "avg_latency_ms": 0, "latencies": []})
for h in handoffs:
agent = h.get("agent", "unknown")
status = h.get("status", "unknown")
agent_stats[agent]["total"] += 1
if status in agent_stats[agent]:
agent_stats[agent][status] += 1
lat = h.get("latencyMs")
if lat:
agent_stats[agent]["latencies"].append(lat)
# Compute averages
for agent, stats in agent_stats.items():
lats = stats.pop("latencies")
if lats:
stats["avg_latency_ms"] = int(sum(lats) / len(lats))
stats["min_latency_ms"] = min(lats)
stats["max_latency_ms"] = max(lats)
stats["success_rate"] = f"{stats['complete']/stats['total']*100:.0f}%" if stats["total"] > 0 else "N/A"
# Workflow stats
wf_stats = {"total": len(workflows), "complete": 0, "failed": 0, "partial": 0, "avg_duration_s": 0, "durations": []}
for w in workflows:
status = w.get("status", "unknown")
if status == "complete":
wf_stats["complete"] += 1
elif status in ("failed", "error"):
wf_stats["failed"] += 1
else:
wf_stats["partial"] += 1
dur = w.get("duration_s")
if dur:
wf_stats["durations"].append(dur)
durs = wf_stats.pop("durations")
if durs:
wf_stats["avg_duration_s"] = round(sum(durs) / len(durs), 1)
wf_stats["min_duration_s"] = round(min(durs), 1)
wf_stats["max_duration_s"] = round(max(durs), 1)
wf_stats["success_rate"] = f"{wf_stats['complete']/wf_stats['total']*100:.0f}%" if wf_stats["total"] > 0 else "N/A"
return {
"generated_at": datetime.utcnow().isoformat() + "Z",
"total_handoffs": len(handoffs),
"total_workflows": len(workflows),
"agent_stats": dict(agent_stats),
"workflow_stats": wf_stats
}
def main():
fmt = sys.argv[1] if len(sys.argv) > 1 else "json"
metrics = compute_metrics()
if fmt == "json":
print(json.dumps(metrics, indent=2))
elif fmt == "text":
print("=== Orchestration Metrics ===")
print(f"Generated: {metrics['generated_at']}")
print(f"Total handoffs: {metrics['total_handoffs']}")
print(f"Total workflows: {metrics['total_workflows']}")
print()
print("--- Per-Agent Stats ---")
for agent, stats in sorted(metrics["agent_stats"].items()):
print(f" {agent}: {stats['total']} tasks, {stats['success_rate']} success, avg {stats.get('avg_latency_ms', 'N/A')}ms")
print()
print("--- Workflow Stats ---")
ws = metrics["workflow_stats"]
print(f" {ws['total']} runs, {ws['success_rate']} success, avg {ws.get('avg_duration_s', 'N/A')}s")
else:
print(json.dumps(metrics, indent=2))
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,2 @@
#!/usr/bin/env bash
exec python3 "$(dirname "$0")/metrics.py" "$@"

View File

@@ -0,0 +1,582 @@
#!/usr/bin/env python3
"""
Atomizer HQ Orchestration Engine — Phase 1b
Synchronous delegation with file-based handoffs, inotify, validation, retries, error handling.
Usage:
python3 orchestrate.py <agent> "<task>" [options]
Options:
--wait Block until agent completes (default: true)
--timeout <sec> Max wait time per attempt (default: 300)
--format json|text Expected response format (default: json)
--context <file> Attach context file to the task
--no-deliver Don't post to Discord
--run-id <id> Custom run ID (default: auto-generated)
--retries <N> Retry on failure (default: 1, max: 3)
--validate Validate required handoff fields strictly
--workflow-id <id> Workflow run ID (for tracing)
--step-id <id> Workflow step ID (for tracing)
--caller <agent> Calling agent (for ACL enforcement)
--channel-context <channel> Include recent Discord channel history as untrusted context
--channel-messages <N> Number of channel messages to fetch (default: 20, max: 30)
"""
import argparse
import json
import os
import subprocess
import sys
import time
import uuid
from pathlib import Path
# ── Constants ────────────────────────────────────────────────────────────────
HANDOFF_DIR = Path("/home/papa/atomizer/handoffs")
LOG_DIR = Path("/home/papa/atomizer/logs/orchestration")
REGISTRY_PATH = Path("/home/papa/atomizer/workspaces/shared/AGENTS_REGISTRY.json")
ORCHESTRATE_DIR = Path("/home/papa/atomizer/workspaces/shared/skills/orchestrate")
GATEWAY_TOKEN = "31422bb39bc9e7a4d34f789d8a7cbc582dece8dd170dadd1"
# Port map (fallback if registry unavailable)
AGENT_PORTS = {
"manager": 18800,
"tech-lead": 18804,
"secretary": 18808,
"auditor": 18812,
"optimizer": 18816,
"study-builder": 18820,
"nx-expert": 18824,
"webster": 18828,
}
# Delegation ACL — who can delegate to whom
DELEGATION_ACL = {
"manager": ["tech-lead", "auditor", "optimizer", "study-builder", "nx-expert", "webster", "secretary"],
"tech-lead": ["webster", "nx-expert", "study-builder", "secretary"],
"optimizer": ["webster", "study-builder", "secretary"],
# All others: no sub-delegation allowed
}
# Required handoff fields for strict validation
REQUIRED_FIELDS = ["status", "result"]
STRICT_FIELDS = ["schemaVersion", "status", "result", "confidence", "timestamp"]
# ── Helpers ──────────────────────────────────────────────────────────────────
def get_agent_port(agent: str) -> int:
"""Resolve agent name to port, checking registry first."""
if REGISTRY_PATH.exists():
try:
registry = json.loads(REGISTRY_PATH.read_text())
agent_info = registry.get("agents", {}).get(agent)
if agent_info and "port" in agent_info:
return agent_info["port"]
except (json.JSONDecodeError, KeyError):
pass
port = AGENT_PORTS.get(agent)
if port is None:
emit_error(f"Unknown agent '{agent}'")
sys.exit(1)
return port
def check_acl(caller: str | None, target: str) -> bool:
"""Check if caller is allowed to delegate to target."""
if caller is None:
return True # No caller specified = no ACL enforcement
if caller == target:
return False # No self-delegation
allowed = DELEGATION_ACL.get(caller)
if allowed is None:
return False # Agent not in ACL = cannot delegate
return target in allowed
def check_health(agent: str, port: int) -> bool:
"""Quick health check — can we reach the agent's gateway?"""
try:
result = subprocess.run(
["curl", "-sf", "-o", "/dev/null", "-w", "%{http_code}",
f"http://127.0.0.1:{port}/healthz"],
capture_output=True, text=True, timeout=5
)
return result.stdout.strip() in ("200", "204")
except (subprocess.TimeoutExpired, Exception):
return False
def send_task(agent: str, port: int, task: str, run_id: str,
attempt: int = 1, prev_error: str = None,
context: str = None, no_deliver: bool = False) -> bool:
"""Send a task to the agent via /hooks/agent endpoint."""
handoff_path = HANDOFF_DIR / f"{run_id}.json"
# Build retry context if this is a retry
retry_note = ""
if attempt > 1 and prev_error:
retry_note = f"\n⚠️ RETRY (attempt {attempt}): Previous attempt failed: {prev_error}\nPlease try again carefully.\n"
message = f"""[ORCHESTRATED TASK — run_id: {run_id}]
{retry_note}
IMPORTANT: Answer this task DIRECTLY. Do NOT spawn sub-agents, Codex, or background processes.
Use your own knowledge and tools (web_search, web_fetch) directly. Keep your response focused and concise.
{task}
---
IMPORTANT: When you complete this task, write your response as a JSON file to:
{handoff_path}
Use this exact format:
```json
{{
"schemaVersion": "1.0",
"runId": "{run_id}",
"agent": "{agent}",
"status": "complete",
"result": "<your findings/output here>",
"artifacts": [],
"confidence": "high|medium|low",
"notes": "<any caveats or open questions>",
"timestamp": "<ISO-8601 timestamp>"
}}
```
Status values: complete | partial | blocked | failed
Write the file BEFORE posting to Discord. The orchestrator is waiting for it."""
if context:
message = f"CONTEXT:\n{context}\n\n{message}"
payload = {
"message": message,
"name": f"orchestrate:{run_id}",
"sessionKey": f"hook:orchestrate:{run_id}:{attempt}",
"deliver": not no_deliver,
"wakeMode": "now",
"timeoutSeconds": 600,
}
try:
result = subprocess.run(
["curl", "-sf", "-X", "POST",
f"http://127.0.0.1:{port}/hooks/agent",
"-H", f"Authorization: Bearer {GATEWAY_TOKEN}",
"-H", "Content-Type: application/json",
"-d", json.dumps(payload)],
capture_output=True, text=True, timeout=15
)
return result.returncode == 0
except (subprocess.TimeoutExpired, Exception) as e:
log_event(run_id, agent, "send_error", str(e), attempt=attempt)
return False
def wait_for_handoff(run_id: str, timeout: int) -> dict | None:
"""Wait for the handoff file using inotify. Falls back to polling."""
handoff_path = HANDOFF_DIR / f"{run_id}.json"
# Check if already exists (agent was fast, or late arrival from prev attempt)
if handoff_path.exists():
return read_handoff(handoff_path)
try:
from inotify_simple import INotify, flags
inotify = INotify()
watch_flags = flags.CREATE | flags.MOVED_TO | flags.CLOSE_WRITE
wd = inotify.add_watch(str(HANDOFF_DIR), watch_flags)
deadline = time.time() + timeout
target_name = f"{run_id}.json"
while time.time() < deadline:
remaining = max(0.1, deadline - time.time())
events = inotify.read(timeout=int(remaining * 1000))
for event in events:
if event.name == target_name:
time.sleep(0.3) # Ensure file is fully written
inotify.close()
return read_handoff(handoff_path)
# Direct check in case we missed the inotify event
if handoff_path.exists():
inotify.close()
return read_handoff(handoff_path)
inotify.close()
return None
except ImportError:
return poll_for_handoff(handoff_path, timeout)
def poll_for_handoff(handoff_path: Path, timeout: int) -> dict | None:
"""Fallback polling if inotify unavailable."""
deadline = time.time() + timeout
while time.time() < deadline:
if handoff_path.exists():
time.sleep(0.3)
return read_handoff(handoff_path)
time.sleep(2)
return None
def read_handoff(path: Path) -> dict | None:
"""Read and parse a handoff file."""
try:
raw = path.read_text().strip()
data = json.loads(raw)
return data
except json.JSONDecodeError:
return {
"status": "malformed",
"result": path.read_text()[:2000],
"notes": "Invalid JSON in handoff file",
"_raw": True,
}
except Exception as e:
return {
"status": "error",
"result": str(e),
"notes": f"Failed to read handoff file: {e}",
}
def validate_handoff(data: dict, strict: bool = False) -> tuple[bool, str]:
"""Validate handoff data. Returns (valid, error_message)."""
if data is None:
return False, "No handoff data"
fields = STRICT_FIELDS if strict else REQUIRED_FIELDS
missing = [f for f in fields if f not in data]
if missing:
return False, f"Missing fields: {', '.join(missing)}"
status = data.get("status", "")
if status not in ("complete", "partial", "blocked", "failed"):
return False, f"Invalid status: '{status}'"
if status == "failed":
return False, f"Agent reported failure: {data.get('notes', 'no details')}"
if status == "blocked":
return False, f"Agent blocked: {data.get('notes', 'no details')}"
return True, ""
def should_retry(result: dict | None, attempt: int, max_retries: int) -> tuple[bool, str]:
"""Decide whether to retry based on result and attempt count."""
if attempt >= max_retries:
return False, "Max retries reached"
if result is None:
return True, "timeout"
status = result.get("status", "")
if status == "malformed":
return True, "malformed response"
if status == "failed":
return True, f"agent failed: {result.get('notes', '')}"
if status == "partial" and result.get("confidence") == "low":
return True, "partial with low confidence"
if status == "error":
return True, f"error: {result.get('notes', '')}"
return False, ""
def clear_handoff(run_id: str):
"""Remove handoff file before retry."""
handoff_path = HANDOFF_DIR / f"{run_id}.json"
if handoff_path.exists():
# Rename to .prev instead of deleting (for debugging)
handoff_path.rename(handoff_path.with_suffix(".prev.json"))
def log_event(run_id: str, agent: str, event_type: str, detail: str = "",
attempt: int = 1, elapsed_ms: int = 0, **extra):
"""Unified logging."""
LOG_DIR.mkdir(parents=True, exist_ok=True)
log_file = LOG_DIR / f"{time.strftime('%Y-%m-%d')}.jsonl"
entry = {
"timestamp": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
"runId": run_id,
"agent": agent,
"event": event_type,
"detail": detail[:500],
"attempt": attempt,
"elapsedMs": elapsed_ms,
**extra,
}
with open(log_file, "a") as f:
f.write(json.dumps(entry) + "\n")
def emit_error(msg: str):
"""Print error to stderr."""
print(f"ERROR: {msg}", file=sys.stderr)
def get_discord_token_for_caller(caller: str) -> str | None:
"""Load caller bot token from instance config."""
cfg = Path(f"/home/papa/atomizer/instances/{caller}/openclaw.json")
if not cfg.exists():
return None
try:
data = json.loads(cfg.read_text())
return data.get("channels", {}).get("discord", {}).get("token")
except Exception:
return None
def fetch_channel_context(channel: str, messages: int, token: str) -> str | None:
"""Fetch formatted channel context via helper script."""
script = ORCHESTRATE_DIR / "fetch-channel-context.sh"
if not script.exists():
return None
try:
result = subprocess.run(
[str(script), channel, "--messages", str(messages), "--token", token],
capture_output=True,
text=True,
timeout=30,
check=False,
)
if result.returncode != 0:
emit_error(f"Channel context fetch failed: {result.stderr.strip()}")
return None
return result.stdout.strip()
except Exception as e:
emit_error(f"Channel context fetch error: {e}")
return None
# ── Main ─────────────────────────────────────────────────────────────────────
def main():
parser = argparse.ArgumentParser(description="Atomizer Orchestration Engine")
parser.add_argument("agent", help="Target agent name")
parser.add_argument("task", help="Task to delegate")
parser.add_argument("--wait", action="store_true", default=True)
parser.add_argument("--timeout", type=int, default=300,
help="Timeout per attempt in seconds (default: 300)")
parser.add_argument("--format", choices=["json", "text"], default="json")
parser.add_argument("--context", type=str, default=None,
help="Path to context file")
parser.add_argument("--no-deliver", action="store_true")
parser.add_argument("--run-id", type=str, default=None)
parser.add_argument("--retries", type=int, default=1,
help="Max attempts (default: 1, max: 3)")
parser.add_argument("--validate", action="store_true",
help="Strict validation of handoff fields")
parser.add_argument("--workflow-id", type=str, default=None,
help="Workflow run ID for tracing")
parser.add_argument("--step-id", type=str, default=None,
help="Workflow step ID for tracing")
parser.add_argument("--caller", type=str, default=None,
help="Calling agent for ACL enforcement")
parser.add_argument("--channel-context", type=str, default=None,
help="Discord channel name or ID to include as context")
parser.add_argument("--channel-messages", type=int, default=20,
help="Number of channel messages to fetch (default: 20, max: 30)")
args = parser.parse_args()
# Clamp retries
max_retries = min(max(args.retries, 1), 3)
# Generate run ID
run_id = args.run_id or f"orch-{int(time.time())}-{uuid.uuid4().hex[:8]}"
# Task text can be augmented (e.g., channel context prepend)
delegated_task = args.task
# ACL check
if not check_acl(args.caller, args.agent):
result = {
"status": "error",
"result": None,
"notes": f"ACL denied: '{args.caller}' cannot delegate to '{args.agent}'",
"agent": args.agent,
"runId": run_id,
}
print(json.dumps(result, indent=2))
log_event(run_id, args.agent, "acl_denied", f"caller={args.caller}")
sys.exit(1)
# Resolve agent port
port = get_agent_port(args.agent)
# Health check
if not check_health(args.agent, port):
result = {
"status": "error",
"result": None,
"notes": f"Agent '{args.agent}' unreachable at port {port}",
"agent": args.agent,
"runId": run_id,
}
print(json.dumps(result, indent=2))
log_event(run_id, args.agent, "health_failed", f"port={port}")
sys.exit(1)
# Load context
context = None
if args.context:
ctx_path = Path(args.context)
if ctx_path.exists():
context = ctx_path.read_text()
else:
emit_error(f"Context file not found: {args.context}")
# Optional channel context
if args.channel_context:
if not args.caller:
emit_error("--channel-context requires --caller so bot token can be resolved")
sys.exit(1)
token = get_discord_token_for_caller(args.caller)
if not token:
emit_error(f"Could not resolve Discord bot token for caller '{args.caller}'")
sys.exit(1)
channel_messages = min(max(args.channel_messages, 1), 30)
ch_ctx = fetch_channel_context(args.channel_context, channel_messages, token)
if not ch_ctx:
emit_error(f"Failed to fetch channel context for '{args.channel_context}'")
sys.exit(1)
delegated_task = f"{ch_ctx}\n\n{delegated_task}"
# ── Retry loop ───────────────────────────────────────────────────────
result = None
prev_error = None
for attempt in range(1, max_retries + 1):
attempt_start = time.time()
log_event(run_id, args.agent, "attempt_start", delegated_task[:200],
attempt=attempt)
# Idempotency check: if handoff file exists from a previous attempt, use it
handoff_path = HANDOFF_DIR / f"{run_id}.json"
if attempt > 1 and handoff_path.exists():
result = read_handoff(handoff_path)
if result and result.get("status") in ("complete", "partial"):
log_event(run_id, args.agent, "late_arrival",
"Handoff file arrived between retries",
attempt=attempt)
break
# Previous result was bad, clear it for retry
clear_handoff(run_id)
# Send task
sent = send_task(args.agent, port, delegated_task, run_id,
attempt=attempt, prev_error=prev_error,
context=context, no_deliver=args.no_deliver)
if not sent:
prev_error = "Failed to send task"
log_event(run_id, args.agent, "send_failed", prev_error,
attempt=attempt)
if attempt < max_retries:
time.sleep(5) # Brief pause before retry
continue
result = {
"status": "error",
"result": None,
"notes": f"Failed to send task after {attempt} attempts",
}
break
# Wait for result
if args.wait:
result = wait_for_handoff(run_id, args.timeout)
elapsed = time.time() - attempt_start
# Validate
if result is not None:
valid, error_msg = validate_handoff(result, strict=args.validate)
if not valid:
log_event(run_id, args.agent, "validation_failed",
error_msg, attempt=attempt,
elapsed_ms=int(elapsed * 1000))
do_retry, reason = should_retry(result, attempt, max_retries)
if do_retry:
prev_error = reason
clear_handoff(run_id)
time.sleep(3)
continue
# No retry — return what we have
break
else:
# Valid result
log_event(run_id, args.agent, "complete",
result.get("status", ""),
attempt=attempt,
elapsed_ms=int(elapsed * 1000),
confidence=result.get("confidence"))
break
else:
# Timeout
log_event(run_id, args.agent, "timeout", "",
attempt=attempt,
elapsed_ms=int(elapsed * 1000))
do_retry, reason = should_retry(result, attempt, max_retries)
if do_retry:
prev_error = "timeout"
continue
result = {
"status": "timeout",
"result": None,
"notes": f"Agent did not respond within {args.timeout}s "
f"(attempt {attempt}/{max_retries})",
}
break
else:
# Fire and forget
print(json.dumps({"status": "sent", "runId": run_id, "agent": args.agent}))
sys.exit(0)
# ── Output ───────────────────────────────────────────────────────────
if result is None:
result = {
"status": "error",
"result": None,
"notes": "No result after all attempts",
}
# Add metadata
total_elapsed = time.time() - (attempt_start if 'attempt_start' in dir() else time.time())
result["runId"] = run_id
result["agent"] = args.agent
result["latencyMs"] = int(total_elapsed * 1000)
if args.workflow_id:
result["workflowRunId"] = args.workflow_id
if args.step_id:
result["stepId"] = args.step_id
if args.format == "json":
print(json.dumps(result, indent=2))
else:
print(result.get("result", ""))
status = result.get("status", "error")
sys.exit(0 if status in ("complete", "partial") else 1)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,7 @@
#!/usr/bin/env bash
# Thin wrapper around orchestrate.py
# Usage: bash orchestrate.sh <agent> "<task>" [options]
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
exec python3 "$SCRIPT_DIR/orchestrate.py" "$@"

View File

@@ -0,0 +1,437 @@
#!/usr/bin/env python3
"""YAML workflow engine for Atomizer orchestration."""
from __future__ import annotations
import argparse
import json
import os
import re
import subprocess
import sys
import threading
import time
import uuid
from concurrent.futures import ThreadPoolExecutor, as_completed
from datetime import datetime, timezone
from pathlib import Path
from typing import Any
try:
import yaml
except ImportError:
print(json.dumps({"status": "error", "error": "PyYAML is required (pip install pyyaml)"}, indent=2))
sys.exit(1)
WORKFLOWS_DIR = Path("/home/papa/atomizer/workspaces/shared/workflows")
ORCHESTRATE_PY = Path("/home/papa/atomizer/workspaces/shared/skills/orchestrate/orchestrate.py")
HANDOFF_WORKFLOWS_DIR = Path("/home/papa/atomizer/handoffs/workflows")
def now_iso() -> str:
return datetime.now(timezone.utc).isoformat()
def parse_inputs(items: list[str]) -> dict[str, Any]:
parsed: dict[str, Any] = {}
for item in items:
if "=" not in item:
raise ValueError(f"Invalid --input '{item}', expected key=value")
k, v = item.split("=", 1)
parsed[k.strip()] = v.strip()
return parsed
def resolve_workflow_path(name_or_path: str) -> Path:
p = Path(name_or_path)
if p.exists():
return p
candidates = [WORKFLOWS_DIR / name_or_path, WORKFLOWS_DIR / f"{name_or_path}.yaml", WORKFLOWS_DIR / f"{name_or_path}.yml"]
for c in candidates:
if c.exists():
return c
raise FileNotFoundError(f"Workflow not found: {name_or_path}")
def load_workflow(path: Path) -> dict[str, Any]:
data = yaml.safe_load(path.read_text())
if not isinstance(data, dict):
raise ValueError("Workflow YAML must be an object")
if not isinstance(data.get("steps"), list) or not data["steps"]:
raise ValueError("Workflow must define non-empty 'steps'")
return data
def validate_graph(steps: list[dict[str, Any]]) -> tuple[dict[str, dict[str, Any]], dict[str, set[str]], dict[str, set[str]], list[list[str]]]:
step_map: dict[str, dict[str, Any]] = {}
deps: dict[str, set[str]] = {}
reverse: dict[str, set[str]] = {}
for step in steps:
sid = step.get("id")
if not sid or not isinstance(sid, str):
raise ValueError("Each step needs string 'id'")
if sid in step_map:
raise ValueError(f"Duplicate step id: {sid}")
step_map[sid] = step
deps[sid] = set(step.get("depends_on", []) or [])
reverse[sid] = set()
for sid, dset in deps.items():
for dep in dset:
if dep not in step_map:
raise ValueError(f"Step '{sid}' depends on unknown step '{dep}'")
reverse[dep].add(sid)
# topological layering + cycle check
indeg = {sid: len(dset) for sid, dset in deps.items()}
ready = sorted([sid for sid, d in indeg.items() if d == 0])
visited = 0
layers: list[list[str]] = []
while ready:
layer = list(ready)
layers.append(layer)
visited += len(layer)
next_ready: list[str] = []
for sid in layer:
for child in sorted(reverse[sid]):
indeg[child] -= 1
if indeg[child] == 0:
next_ready.append(child)
ready = sorted(next_ready)
if visited != len(step_map):
cycle_nodes = [sid for sid, d in indeg.items() if d > 0]
raise ValueError(f"Dependency cycle detected involving: {', '.join(cycle_nodes)}")
return step_map, deps, reverse, layers
_VAR_RE = re.compile(r"\{([^{}]+)\}")
def substitute(text: str, step_outputs: dict[str, Any], inputs: dict[str, Any]) -> str:
def repl(match: re.Match[str]) -> str:
key = match.group(1).strip()
if key.startswith("inputs."):
iv = key.split(".", 1)[1]
if iv not in inputs:
return match.group(0)
return str(inputs[iv])
if key in step_outputs:
val = step_outputs[key]
if isinstance(val, (dict, list)):
return json.dumps(val, ensure_ascii=False)
return str(val)
return match.group(0)
return _VAR_RE.sub(repl, text)
def approval_check(step: dict[str, Any], non_interactive: bool) -> bool:
gate = step.get("approval_gate")
if not gate:
return True
if non_interactive:
print(f"WARNING: non-interactive mode, skipping approval gate '{gate}' for step '{step['id']}'", file=sys.stderr)
return True
print(f"Approval gate required for step '{step['id']}' ({gate}). Approve? [yes/no]: ", end="", flush=True)
response = sys.stdin.readline().strip().lower()
return response in {"y", "yes"}
def run_orchestrate(agent: str, task: str, timeout_s: int, caller: str, workflow_run_id: str, step_id: str, retries: int) -> dict[str, Any]:
cmd = [
"python3", str(ORCHESTRATE_PY),
agent,
task,
"--timeout", str(timeout_s),
"--caller", caller,
"--workflow-id", workflow_run_id,
"--step-id", step_id,
"--retries", str(max(1, retries)),
"--format", "json",
]
proc = subprocess.run(cmd, capture_output=True, text=True)
out = (proc.stdout or "").strip()
if not out:
return {
"status": "failed",
"result": None,
"notes": f"No stdout from orchestrate.py; stderr: {(proc.stderr or '').strip()[:1000]}",
"exitCode": proc.returncode,
}
try:
data = json.loads(out)
except json.JSONDecodeError:
return {
"status": "failed",
"result": out,
"notes": f"Non-JSON response from orchestrate.py; stderr: {(proc.stderr or '').strip()[:1000]}",
"exitCode": proc.returncode,
}
data["exitCode"] = proc.returncode
if proc.stderr:
data["stderr"] = proc.stderr.strip()[:2000]
return data
def validation_passed(validation_result: dict[str, Any]) -> bool:
if validation_result.get("status") not in {"complete", "partial"}:
return False
body = str(validation_result.get("result", "")).strip()
# If validator returned JSON in result, try to parse decision.
try:
obj = json.loads(body)
decision = str(obj.get("decision", "")).lower()
if decision in {"accept", "approved", "pass", "passed"}:
return True
if decision in {"reject", "fail", "failed"}:
return False
except Exception:
pass
lowered = body.lower()
if "reject" in lowered or "fail" in lowered:
return False
return True
def execute_step(
step: dict[str, Any],
inputs: dict[str, Any],
step_outputs: dict[str, Any],
caller: str,
workflow_run_id: str,
remaining_timeout: int,
non_interactive: bool,
out_dir: Path,
lock: threading.Lock,
) -> dict[str, Any]:
sid = step["id"]
start = time.time()
if not approval_check(step, non_interactive):
result = {
"step_id": sid,
"status": "failed",
"error": "approval_denied",
"started_at": now_iso(),
"finished_at": now_iso(),
"duration_s": 0,
}
(out_dir / f"{sid}.json").write_text(json.dumps(result, indent=2))
return result
task = substitute(str(step.get("task", "")), step_outputs, inputs)
agent = step.get("agent")
if not agent:
result = {
"step_id": sid,
"status": "failed",
"error": "missing_agent",
"started_at": now_iso(),
"finished_at": now_iso(),
"duration_s": 0,
}
(out_dir / f"{sid}.json").write_text(json.dumps(result, indent=2))
return result
step_timeout = int(step.get("timeout", 300))
timeout_s = max(1, min(step_timeout, remaining_timeout))
retries = int(step.get("retries", 1))
run_res = run_orchestrate(agent, task, timeout_s, caller, workflow_run_id, sid, retries)
step_result: dict[str, Any] = {
"step_id": sid,
"agent": agent,
"status": run_res.get("status", "failed"),
"result": run_res.get("result"),
"notes": run_res.get("notes"),
"run": run_res,
"started_at": datetime.fromtimestamp(start, tz=timezone.utc).isoformat(),
"finished_at": now_iso(),
"duration_s": round(time.time() - start, 3),
}
validation_cfg = step.get("validation")
if validation_cfg and step_result["status"] in {"complete", "partial"}:
v_agent = validation_cfg.get("agent")
criteria = validation_cfg.get("criteria", "Validate this output for quality and correctness.")
if v_agent:
v_task = (
"Validate the following workflow step output. Return a decision in JSON like "
"{\"decision\":\"accept|reject\",\"reason\":\"...\"}.\n\n"
f"Step ID: {sid}\n"
f"Criteria: {criteria}\n\n"
f"Output to validate:\n{step_result.get('result')}"
)
v_timeout = int(validation_cfg.get("timeout", min(180, timeout_s)))
validation_res = run_orchestrate(v_agent, v_task, max(1, v_timeout), caller, workflow_run_id, f"{sid}__validation", 1)
step_result["validation"] = validation_res
if not validation_passed(validation_res):
step_result["status"] = "failed"
step_result["error"] = "validation_failed"
step_result["notes"] = f"Validation failed by {v_agent}: {validation_res.get('result') or validation_res.get('notes')}"
with lock:
(out_dir / f"{sid}.json").write_text(json.dumps(step_result, indent=2))
return step_result
def main() -> None:
parser = argparse.ArgumentParser(description="Run YAML workflows using orchestrate.py")
parser.add_argument("workflow")
parser.add_argument("--input", action="append", default=[], help="key=value (repeatable)")
parser.add_argument("--caller", default="manager")
parser.add_argument("--dry-run", action="store_true")
parser.add_argument("--non-interactive", action="store_true")
parser.add_argument("--timeout", type=int, default=1800, help="Overall workflow timeout seconds")
args = parser.parse_args()
wf_path = resolve_workflow_path(args.workflow)
wf = load_workflow(wf_path)
inputs = parse_inputs(args.input)
steps = wf["steps"]
step_map, deps, reverse, layers = validate_graph(steps)
workflow_run_id = f"wf-{int(time.time())}-{uuid.uuid4().hex[:8]}"
out_dir = HANDOFF_WORKFLOWS_DIR / workflow_run_id
out_dir.mkdir(parents=True, exist_ok=True)
if args.dry_run:
plan = {
"status": "dry_run",
"workflow": wf.get("name", wf_path.name),
"workflow_file": str(wf_path),
"workflow_run_id": workflow_run_id,
"inputs": inputs,
"steps": [
{
"id": s["id"],
"agent": s.get("agent"),
"depends_on": s.get("depends_on", []),
"timeout": s.get("timeout", 300),
"retries": s.get("retries", 1),
"approval_gate": s.get("approval_gate"),
"has_validation": bool(s.get("validation")),
}
for s in steps
],
"execution_layers": layers,
"result_dir": str(out_dir),
}
print(json.dumps(plan, indent=2))
return
started = time.time()
deadline = started + args.timeout
lock = threading.Lock()
state: dict[str, str] = {sid: "pending" for sid in step_map}
step_results: dict[str, dict[str, Any]] = {}
step_outputs: dict[str, Any] = {}
overall_status = "complete"
max_workers = max(1, min(len(step_map), (os.cpu_count() or 4)))
while True:
if time.time() >= deadline:
overall_status = "timeout"
break
pending = [sid for sid, st in state.items() if st == "pending"]
if not pending:
break
ready = []
for sid in pending:
if all(state[d] in {"complete", "skipped"} for d in deps[sid]):
ready.append(sid)
if not ready:
# deadlock due to upstream abort/fail on pending deps
if any(st == "aborted" for st in state.values()):
break
overall_status = "failed"
break
futures = {}
with ThreadPoolExecutor(max_workers=max_workers) as pool:
for sid in ready:
state[sid] = "running"
remaining_timeout = int(max(1, deadline - time.time()))
futures[pool.submit(
execute_step,
step_map[sid],
inputs,
step_outputs,
args.caller,
workflow_run_id,
remaining_timeout,
args.non_interactive,
out_dir,
lock,
)] = sid
for fut in as_completed(futures):
sid = futures[fut]
res = fut.result()
step_results[sid] = res
st = res.get("status", "failed")
if st in {"complete", "partial"}:
state[sid] = "complete"
step_outputs[sid] = res.get("result")
out_name = step_map[sid].get("output")
if out_name:
step_outputs[str(out_name)] = res.get("result")
else:
on_fail = str(step_map[sid].get("on_fail", "abort")).lower()
if on_fail == "skip":
state[sid] = "skipped"
overall_status = "partial"
else:
state[sid] = "failed"
overall_status = "failed"
# abort all pending steps
for psid in list(state):
if state[psid] == "pending":
state[psid] = "aborted"
finished = time.time()
if overall_status == "complete" and any(st == "skipped" for st in state.values()):
overall_status = "partial"
summary = {
"status": overall_status,
"workflow": wf.get("name", wf_path.name),
"workflow_file": str(wf_path),
"workflow_run_id": workflow_run_id,
"caller": args.caller,
"started_at": datetime.fromtimestamp(started, tz=timezone.utc).isoformat(),
"finished_at": datetime.fromtimestamp(finished, tz=timezone.utc).isoformat(),
"duration_s": round(finished - started, 3),
"timeout_s": args.timeout,
"inputs": inputs,
"state": state,
"results": step_results,
"result_dir": str(out_dir),
"notifications": wf.get("notifications", {}),
}
(out_dir / "summary.json").write_text(json.dumps(summary, indent=2))
print(json.dumps(summary, indent=2))
if overall_status in {"complete", "partial"}:
sys.exit(0)
sys.exit(1)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,2 @@
#!/usr/bin/env bash
exec python3 "$(dirname "$0")/workflow.py" "$@"

View File

@@ -0,0 +1,57 @@
name: Design Review
description: Multi-agent design review pipeline
trigger: manual
inputs:
design_description:
type: text
description: "What is being reviewed"
requirements:
type: text
description: "Requirements to review against"
steps:
- id: technical_review
agent: tech-lead
task: |
Perform a technical review of the following design:
DESIGN: {inputs.design_description}
REQUIREMENTS: {inputs.requirements}
Assess: structural adequacy, thermal performance, manufacturability,
and compliance with requirements. Identify risks and gaps.
timeout: 300
- id: optimization_review
agent: optimizer
task: |
Assess optimization potential for the following design:
DESIGN: {inputs.design_description}
REQUIREMENTS: {inputs.requirements}
Identify: parameters that could be optimized, potential weight/cost savings,
and whether a formal optimization study is warranted.
timeout: 300
# These two run in PARALLEL (no dependency between them)
- id: audit
agent: auditor
task: |
Perform a final quality review combining both the technical and optimization assessments:
TECHNICAL REVIEW:
{technical_review}
OPTIMIZATION REVIEW:
{optimization_review}
Assess completeness, identify conflicts between reviewers, and provide
a consolidated recommendation.
depends_on: [technical_review, optimization_review]
timeout: 180
notifications:
on_complete: "Design review complete"

View File

@@ -0,0 +1,58 @@
name: Material Trade Study
description: Research, evaluate, and audit material options for optical components
trigger: manual
inputs:
materials:
type: list
description: "Materials to compare"
requirements:
type: text
description: "Performance requirements and constraints"
steps:
- id: research
agent: webster
task: |
Research the following materials: {inputs.materials}
For each material, find: CTE (with temperature range), density, Young's modulus,
cost per kg, lead time, availability, and any known issues for optical applications.
Provide sources for all data.
timeout: 180
retries: 2
output: material_data
- id: evaluate
agent: tech-lead
task: |
Evaluate these materials against our requirements:
REQUIREMENTS:
{inputs.requirements}
MATERIAL DATA:
{research}
Provide a recommendation with full rationale. Include a comparison matrix.
depends_on: [research]
timeout: 300
retries: 1
output: technical_assessment
- id: audit
agent: auditor
task: |
Review this material trade study for completeness, methodological rigor,
and potential gaps:
{evaluate}
Check: Are all requirements addressed? Are sources credible?
Are there materials that should have been considered but weren't?
depends_on: [evaluate]
timeout: 180
output: audit_result
notifications:
on_complete: "Workflow complete"
on_failure: "Workflow failed"

View File

@@ -0,0 +1,29 @@
name: Quick Research
description: Fast web research with technical validation
trigger: manual
inputs:
query:
type: text
description: "Research question"
steps:
- id: research
agent: webster
task: "{inputs.query}"
timeout: 120
retries: 1
- id: validate
agent: tech-lead
task: |
Verify these research findings are accurate and relevant for engineering use:
{research}
Flag any concerns about accuracy, missing context, or applicability.
depends_on: [research]
timeout: 180
notifications:
on_complete: "Research complete"