Files
Atomizer/docs/hq/08-SYSTEM-IMPLEMENTATION-STATUS.md

276 lines
15 KiB
Markdown
Raw Permalink Normal View History

# 🔧 08 — System Implementation Status
> How the multi-agent system actually works right now, as built.
> Last updated: 2026-02-15
---
## 1. Architecture Overview
**Multi-Instance Cluster:** 8 independent OpenClaw gateway processes, one per agent. Each has its own systemd service, Discord bot token, port, and state directory.
```
┌──────────────────────────────────────────────────────────────────┐
│ T420 (clawdbot) │
│ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ OpenClaw Gateway — Mario (main instance) │ │
│ │ Port 18789 │ Slack: Antoine's personal workspace │ │
│ │ State: ~/.openclaw/ │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────── Atomizer Cluster ────────────────────────┐ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Manager │ │ Tech Lead │ │ Secretary │ │ │
│ │ │ :18800 │ │ :18804 │ │ :18808 │ │ │
│ │ │ Opus 4.6 │ │ Opus 4.6 │ │ Gemini 2.5 │ │ │
│ │ └──────┬───────┘ └──────┬──────┘ └──────┬───────┘ │ │
│ │ │ │ │ │ │
│ │ ┌──────┴───────┐ ┌─────┴──────┐ ┌──────┴───────┐ │ │
│ │ │ Auditor │ │ Optimizer │ │ Study Builder│ │ │
│ │ │ :18812 │ │ :18816 │ │ :18820 │ │ │
│ │ │ Opus 4.6 │ │ Sonnet 4.5 │ │ Sonnet 4.5 │ │ │
│ │ └──────────────┘ └────────────┘ └──────────────┘ │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ NX Expert │ │ Webster │ │ │
│ │ │ :18824 │ │ :18828 │ │ │
│ │ │ Sonnet 4.5 │ │ Gemini 2.5 │ │ │
│ │ └─────────────┘ └─────────────┘ │ │
│ │ │ │
│ │ Inter-agent: hooks API (curl between ports) │ │
│ │ Shared token: 31422bb39bc9e7a4d34f789d8a7cbc582dece8dd… │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────┐
│ Discord: Atomizer-HQ Server │
│ Guild: 1471858733452890132 │
│ │
│ 📋 COMMAND: #ceo-office, #announcements, #daily-standup │
│ 🔧 ENGINEERING: #technical, #code-review, #fea-analysis, #nx
│ 📊 OPERATIONS: #task-board, #meeting-notes, #reports
│ 🔬 RESEARCH: #literature, #materials-data │
│ 🏗️ PROJECTS: #active-projects │
│ 📚 KNOWLEDGE: #knowledge-base, #lessons-learned │
│ 🤖 SYSTEM: #agent-logs, #inter-agent, #it-ops │
│ │
│ Each agent = its own Discord bot with unique name & avatar │
└──────────────────────────────────────────────────────────────────┘
```
---
## 2. Why Multi-Instance (Not Single Gateway)
OpenClaw's native Discord provider (`@buape/carbon`) has a race condition bug when multiple bot tokens connect from one process. Since we need 8 separate bot accounts, we run 8 separate processes — each handles exactly one token, bypassing the bug entirely.
**Advantages over previous bridge approach:**
- Native Discord streaming, threads, reactions, attachments
- Fault isolation — one agent crashing doesn't take down the others
- No middleware polling session files on disk
- Each agent appears as its own Discord user with independent presence
---
## 3. Port Map
| Agent | Port | Model | Notes |
|-------|------|-------|-------|
| Manager | 18800 | Opus 4.6 | Orchestrates, delegates. Heartbeat disabled (Discord delivery bug) |
| Tech Lead | 18804 | Opus 4.6 | Technical authority |
| Secretary | 18808 | Gemini 2.5 Pro | Task tracking, notes. Changed from Codex 2026-02-15 (OAuth expired) |
| Auditor | 18812 | Gemini 2.5 Pro | Quality review. Changed from Codex 2026-02-15 (OAuth expired) |
| Optimizer | 18816 | Sonnet 4.5 | Optimization work |
| Study Builder | 18820 | Gemini 2.5 Pro | Study setup. Changed from Codex 2026-02-15 (OAuth expired) |
| NX Expert | 18824 | Sonnet 4.5 | CAD/NX work |
| Webster | 18828 | Gemini 2.5 Pro | Research. Heartbeat disabled (Discord delivery bug) |
> **⚠️ Port spacing = 4.** OpenClaw uses port N AND N+3 (browser service). Never assign adjacent ports.
---
## 4. Systemd Setup
### Template Service
File: `~/.config/systemd/user/openclaw-atomizer@.service`
```ini
[Unit]
Description=OpenClaw Atomizer - %i
After=network.target
[Service]
Type=simple
ExecStart=/usr/bin/node /home/papa/.local/lib/node_modules/openclaw/dist/index.js gateway
Environment=PATH=/home/papa/.local/bin:/usr/local/bin:/usr/bin:/bin
Environment=HOME=/home/papa
Environment=OPENCLAW_STATE_DIR=/home/papa/atomizer/instances/%i
Environment=OPENCLAW_CONFIG_PATH=/home/papa/atomizer/instances/%i/openclaw.json
Environment=OPENCLAW_GATEWAY_TOKEN=31422bb39bc9e7a4d34f789d8a7cbc582dece8dd170dadd1
EnvironmentFile=/home/papa/atomizer/instances/%i/env
EnvironmentFile=/home/papa/atomizer/config/.discord-tokens.env
Restart=always
RestartSec=5
StartLimitIntervalSec=60
StartLimitBurst=5
[Install]
WantedBy=default.target
```
### Cluster Management Script
File: `~/atomizer/cluster.sh`
```bash
# Start all: bash cluster.sh start
# Stop all: bash cluster.sh stop
# Restart all: bash cluster.sh restart
# Status: bash cluster.sh status
# Logs: bash cluster.sh logs [agent-name]
```
---
## 5. File System Layout
```
~/atomizer/
├── cluster.sh ← Cluster management script
├── config/
│ ├── .discord-tokens.env ← All 8 bot tokens (env vars)
│ └── atomizer-discord.env ← Legacy (can remove)
├── instances/ ← Per-agent OpenClaw state
│ ├── manager/
│ │ ├── openclaw.json ← Agent config (1 agent per instance)
│ │ ├── env ← Instance-specific env vars
│ │ └── agents/main/sessions/ ← Session data (auto-created)
│ ├── tech-lead/
│ ├── secretary/
│ ├── auditor/
│ ├── optimizer/
│ ├── study-builder/
│ ├── nx-expert/
│ └── webster/
├── workspaces/ ← Agent workspaces (SOUL, AGENTS, memory)
│ ├── manager/
│ │ ├── SOUL.md
│ │ ├── AGENTS.md
│ │ ├── MEMORY.md
│ │ └── memory/
│ ├── secretary/
│ ├── technical-lead/
│ ├── auditor/
│ ├── optimizer/
│ ├── study-builder/
│ ├── nx-expert/
│ ├── webster/
│ └── shared/ ← Shared context (CLUSTER.md, protocols)
└── tools/
└── nxopen-mcp/ ← NX Open MCP server (for CAD)
```
**Key distinction:** `instances/` = OpenClaw runtime state (configs, sessions, SQLite). `workspaces/` = agent personality and memory (SOUL.md, AGENTS.md, etc.).
---
## 6. Inter-Agent Communication
### Delegation Skill (Primary Method)
Manager and Tech Lead use the `delegate` skill to assign tasks to other agents. The skill wraps the OpenClaw Hooks API with port mapping, auth, error handling, and logging.
**Location:** `/home/papa/atomizer/workspaces/shared/skills/delegate/`
**Installed on:** Manager, Tech Lead (symlinked from shared)
```bash
# Usage
bash /home/papa/atomizer/workspaces/shared/skills/delegate/delegate.sh <agent> "<instruction>" [options]
# Examples
delegate.sh webster "Find CTE of Zerodur Class 0 between 20-40°C"
delegate.sh nx-expert "Mesh the M2 mirror" --channel C0AEJV13TEU --deliver
delegate.sh auditor "Review thermal analysis" --no-deliver
```
**How it works:**
1. Looks up the target agent's port from hardcoded port map
2. Checks if the target is running
3. POSTs to `http://127.0.0.1:PORT/hooks/agent` with auth token
4. Target agent processes the task asynchronously in an isolated session
5. Response delivered to Discord if `--deliver` is set
**Options:** `--channel <id>`, `--deliver` (default), `--no-deliver`
### Delegation Authority
| Agent | Can Delegate To |
|-------|----------------|
| Manager | All agents |
| Tech Lead | All agents except Manager |
| All others | Cannot delegate — request via Manager or Tech Lead |
### Hooks Protocol
All agents follow `/home/papa/atomizer/workspaces/shared/HOOKS-PROTOCOL.md`:
- Hook messages = **high-priority assignments**, processed before other work
- After completing tasks, agents **append** status to `shared/project_log.md`
- Only the Manager updates `shared/PROJECT_STATUS.md` (gatekeeper pattern)
### Raw Hooks API (Reference)
The delegate skill wraps this, but for reference:
```bash
curl -s -X POST http://127.0.0.1:PORT/hooks/agent \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 31422bb39bc9e7a4d34f789d8a7cbc582dece8dd170dadd1" \
-d '{"message": "your request here", "deliver": true, "channel": "discord"}'
```
### sessions_send / sessions_spawn
Agents configured with `agentToAgent.enabled: true` can use OpenClaw's built-in `sessions_send` and `sessions_spawn` tools to communicate within the same instance. Cross-instance communication requires the hooks API / delegate skill.
---
## 7. Current Status
### ✅ Working
- All 8 instances running as systemd services (auto-start on boot)
- Each agent has its own Discord bot identity (name, avatar, presence)
- Native Discord features: streaming, typing indicators, message chunking
- Agent workspaces with SOUL.md, AGENTS.md, MEMORY.md
- Hooks API enabled on all instances (Google Gemini + Anthropic auth configured)
- **Delegation skill deployed** — Manager and Tech Lead can delegate tasks to any agent via `delegate.sh`
- **Hooks protocol** — all agents know how to receive and prioritize delegated tasks
- **Gatekeeper pattern** — Manager owns PROJECT_STATUS.md; others append to project_log.md
- Cluster management via `cluster.sh`
- Estimated total RAM: ~4.2GB for 8 instances
### ❌ Known Issues
- ~~**DELEGATE syntax is fake**~~ → ✅ RESOLVED (2026-02-14): Replaced with `delegate.sh` skill using hooks API
- **Discord "Ambiguous recipient" bug** (2026-02-15): OpenClaw Discord plugin requires `user:` or `channel:` prefix for message targets. When heartbeat tries to reply to a session that originated from a Discord DM, it uses the bare user ID → delivery fails. **Workaround:** Heartbeat disabled on Manager + Webster. Other agents unaffected (their sessions don't originate from Discord DMs). Proper fix requires OpenClaw patch to auto-infer `user:` for known user IDs.
- **Codex OAuth expired** (2026-02-15): `refresh_token_reused` error — multiple instances racing to refresh the same shared Codex token. Secretary, Auditor, Study-Builder switched to Gemini 2.5 Pro. To restore Codex: Antoine must re-run `codex login` via SSH tunnel, then run `~/atomizer/scripts/sync-codex-tokens.sh`.
- **No automated orchestration layer:** Manager delegates manually (but now has proper tooling to do so — orchestrate.sh, workflow engine)
- **5 agents not yet created:** Post-Processor, Reporter, Developer, Knowledge Base, IT (from the original 13-agent plan)
- **Windows execution bridge** (`atomizer_job_watcher.py`): exists but not connected end-to-end
---
## 8. Evolution History
| Date | Phase | What Changed |
|------|-------|-------------|
| 2026-02-07 | Phase 0 | Vision doc created, 13-agent plan designed |
| 2026-02-08 | Phase 0 | Single gateway (port 18790) running on Slack |
| 2026-02-13 | Discord Migration | Discord server created, 8 bot tokens obtained |
| 2026-02-14 (AM) | Bridge Attempt | discord-bridge.js built — worked but fragile (no streaming, polled session files) |
| 2026-02-14 (PM) | **Multi-Instance Cluster** | Pivoted to 8 independent OpenClaw instances. Bridge killed. Native Discord restored. |
| 2026-02-14 (PM) | **Delegation System** | Built `delegate.sh` skill, hooks protocol, gatekeeper pattern. Fake DELEGATE syntax replaced with real hooks API calls. Google Gemini auth added to all instances. |
| 2026-02-15 | **Orchestration Engine** | Phases 1-3 complete: synchronous delegation (`orchestrate.py`), smart routing (capability registry), hierarchical delegation (Tech-Lead + Optimizer can sub-delegate), YAML workflow engine with parallel execution + approval gates. See `10-ORCHESTRATION-ENGINE-PLAN.md`. |
| 2026-02-15 | **Stability Fixes** | Discord heartbeat delivery bug identified (ambiguous recipient). Codex OAuth token expired (refresh_token_reused). Heartbeat disabled on Manager + Webster. Secretary/Auditor/Study-Builder switched from Codex to Gemini 2.5 Pro. HEARTBEAT.md created for all agents. |
---
*Created: 2026-02-14 by Mario*
*This is the "as-built" document — updated as implementation evolves.*