docs: add HQ multi-agent framework documentation from PKM

- Project plan, agent roster, architecture, roadmap
- Decision log, full system plan, Discord setup/migration guides
- System implementation status (as-built)
- Cluster pivot history
- Orchestration engine plan (Phases 1-4)
- Webster and Auditor reviews
This commit is contained in:
2026-02-15 21:44:07 +00:00
parent 3289a76e19
commit cf82de4f06
15 changed files with 6933 additions and 0 deletions

View File

@@ -0,0 +1,118 @@
# Review: Orchestration Engine (Plan 10) — V2
> **Reviewer:** Auditor 🔍
> **Date:** 2026-02-14
> **Status:** **CONDITIONAL PASS** (implement required controls before production-critical use)
> **Subject:** `10-ORCHESTRATION-ENGINE-PLAN.md`
---
## Executive Verdict
Marios architecture is directionally correct and much stronger than fire-and-forget delegation. The three-layer model (Core → Routing → Workflows), structured handoffs, and explicit validation loops are all solid decisions.
However, for production reliability and auditability, this must ship with stricter **state integrity**, **idempotency**, **schema governance**, and **human approval gates** for high-impact actions.
**Bottom line:** Proceed, but only with the must-fix items below integrated into Phase 12.
---
## Findings
### 🔴 Critical (must fix)
1. **No explicit idempotency contract for retries/timeouts**
- Current plan retries on timeout/malformed outputs, but does not define how to prevent duplicate side effects (double posts, repeated downstream actions).
- **Risk:** inconsistent workflow outcomes, duplicate client-facing messages, non-reproducible state.
- **Required fix:** Add `idempotency_key` per step attempt and enforce dedupe on handoff consumption + delivery.
2. **Handoff schema is underspecified for machine validation**
- Fields shown are helpful, but no versioned JSON Schema or strict required/optional policy exists.
- **Risk:** malformed yet “accepted” outputs, brittle parsing, silent failure propagation.
- **Required fix:** versioned schema (`schemaVersion`), strict required fields, validator in `orchestrate.sh` + CI check for schema compatibility.
3. **No hard gate for high-stakes workflow steps**
- Auditor checks are present, but there is no formal “approval required” interrupt before irreversible actions.
- **Risk:** automated progression with incorrect assumptions.
- **Required fix:** add `approval_gate: true` for designated steps (e.g., external deliverables, strategic recommendations).
---
### 🟡 Major (should fix)
1. **State model is split across ad hoc files**
- File-based handoff is fine for MVP, but without a canonical workflow state object, long chains get fragile.
- **Recommendation:** add a per-run `state.json` blackboard (append-only event log + resolved materialized state).
2. **Observability is not yet sufficient for root-cause analysis**
- Metrics are planned later; debugging multi-agent failures without end-to-end trace IDs will be painful.
- **Recommendation:** start now with `workflowRunId`, `stepId`, `attempt`, `agent`, `latencyMs`, `token/cost estimate`, and terminal status.
3. **Channel-context ingestion lacks trust/sanitization policy**
- Discord history can include noisy or unsafe content.
- **Recommendation:** context sanitizer + source tagging + max token window + instruction stripping from untrusted text blocks.
4. **Hierarchical delegation loop prevention is policy-level only**
- Good design intent, but no enforcement mechanism described.
- **Recommendation:** enforce delegation ACL matrix in orchestrator runtime (not only SOUL instructions).
---
### 🟢 Minor (nice to fix)
1. Add `result_quality_score` (01) from validator for triage and dashboards.
2. Add `artifacts_checksum` to handoff metadata for reproducibility.
3. Add workflow dry-run mode to validate dependency graph and substitutions without execution.
---
## External Pattern Cross-Check (complementary ideas)
Based on architecture patterns in common orchestration ecosystems (LangGraph, AutoGen, CrewAI, Temporal, Prefect, Step Functions):
1. **Durable execution + resumability** (LangGraph/Temporal style)
- Keep execution history and allow resume from last successful step.
2. **Guardrails with bounded retries** (CrewAI/Prefect style)
- You already started this; formalize per-step retry policy and failure classes.
3. **State-machine semantics** (Step Functions style)
- Model each step state explicitly: `pending → running → validated → committed | failed`.
4. **Human-in-the-loop interrupts**
- Introduce pause/approve/reject transitions for critical branches.
5. **Exactly-once consumption where possible**
- At minimum, “at-least-once execution + idempotent effects” should be guaranteed.
---
## Recommended Minimal Patch Set (before scaling)
1. **Schema + idempotency first**
- `handoff.schema.json` + `idempotency_key` required fields.
2. **Canonical state file per workflow run**
- `handoffs/workflows/<runId>/state.json` as single source of truth.
3. **Enforced ACL delegation matrix**
- Runtime check: who can delegate to whom, hard-block loops.
4. **Approval gates for critical outputs**
- YAML: `requires_approval: manager|ceo`.
5. **Trace-first logging**
- Correlated logs for every attempt and transition.
---
## Final Recommendation
**CONDITIONAL PASS**
Implementation can proceed immediately, but production-critical use should wait until the 5-item minimal patch set is in place. The current plan is strong; these controls are what make it reliable under stress.
---
## Suggested Filename Convention
`REVIEW-Orchestration-Engine-Auditor-V2.md`

View File

@@ -0,0 +1,104 @@
# Review: Orchestration Engine (Plan 10)
> **Reviewer:** Webster (Research Specialist)
> **Date:** 2026-02-14
> **Status:** Endorsed with Enhancements
> **Subject:** Critique of `10-ORCHESTRATION-ENGINE-PLAN` (Mario Lavoie)
---
## Executive Summary
Mario's proposed "Orchestration Engine: Multi-Instance Intelligence" is a **strong foundational architecture**. It correctly identifies the critical missing piece in our current cluster setup: **synchronous delegation with a structured feedback loop**. Moving from "fire-and-forget" (`delegate.sh`) to a structured "chain-of-command" (`orchestrate.sh`) is the correct evolutionary step for the Atomizer cluster.
The 3-layer architecture (Core → Routing → Workflows) is scalable and robust. The use of file-based handoffs and YAML workflows aligns perfectly with our local-first philosophy.
However, to elevate this from a "good" system to a "world-class" agentic framework, I strongly recommend implementing **Hierarchical Delegation**, **Validation Loops**, and **Shared State Management** immediately, rather than deferring them to Phase 4 or later.
---
## Critical Analysis
### 1. The "Manager Bottleneck" Risk (High)
**Critique:** The plan centralizes *all* orchestration in the Manager ("Manager as sole orchestrator").
**Risk:** This creates a single point of failure and a significant bottleneck. If the Manager is waiting on a long-running research task from Webster, it cannot effectively coordinate other urgent streams (e.g., a Tech-Lead design review). It also risks context overload for the Manager on complex, multi-agent projects.
**Recommendation:** Implement **Hierarchical Delegation**.
- Allow high-level agents (like `Tech-Lead`) to have "sub-orchestration" permissions.
- **Example:** If `Tech-Lead` needs a specific material density check from `Webster` to complete a larger analysis, they should be able to delegate that sub-task directly via `orchestrate.sh` without routing back through the Manager. This mimics a real engineering team structure.
### 2. Lack of "Reflection" or "Critic" Loops (Critical)
**Critique:** The proposed workflows are strictly linear (Step A → Step B → Step C).
**Risk:** "Garbage in, garbage out." If a research step returns hallucinated or irrelevant data, the subsequent technical analysis step will proceed to process it, wasting tokens and time.
**Recommendation:** Add explicit **Validation Steps**.
- Introduce a `critique` phase or a lightweight "Auditor" pass *inside* the workflow definition before moving to the next major stage.
- **Pattern:** Execute Task → Critique Output → (Refine/Retry if score < Threshold) → Proceed.
### 3. State Management & Context Passing (Medium)
**Critique:** Context is passed explicitly between steps via file paths (`--context /tmp/file.json`).
**Risk:** Managing file paths becomes cumbersome in complex, multi-step workflows (e.g., 10+ steps). It limits the ability for a late-stage agent to easily reference early-stage context without explicit passing.
**Recommendation:** Implement a **Shared "Blackboard" (Workflow State Object)**.
- Create a shared JSON object for the entire workflow run.
- Agents read/write keys to this shared state (e.g., `state['material_costs']`, `state['fea_results']`).
- This decouples step execution from data passing.
### 4. Dynamic "Team Construction" (Medium)
**Critique:** Workflow steps hardcode specific agents (e.g., `agent: webster`).
**Recommendation:** Use **Role-Based Execution**.
- Define steps by *role* or *capability* (e.g., `role: researcher`, `capability: web-research`) rather than specific agent IDs.
- The **Smart Router** (Layer 2) can then dynamically select the best available agent at runtime. This allows for load balancing and redundancy (e.g., routing to a backup researcher if Webster is overloaded).
### 5. Error Handling & "Healing" (Medium)
**Critique:** Error handling is mentioned as a Phase 4 task.
**Recommendation:** **Make it a Phase 1 priority.**
- LLMs and external tools (web search) are non-deterministic and prone to occasional failure.
- Add `max_retries` and `fallback_strategy` fields to the YAML definition immediately.
---
## Proposed Enhancement: "Patched" Workflow Schema
Here is a proposed revision to the YAML workflow definition that incorporates these recommendations:
```yaml
# /home/papa/atomizer/workspaces/shared/workflows/material-trade-study-v2.yaml
name: Material Trade Study (Enhanced)
description: Research, evaluate, and audit material options with validation loops.
# Shared Blackboard for the workflow run
state:
materials_list: []
research_data: {}
assessment: {}
steps:
- id: research
role: researcher # Dynamic: Router picks 'webster' (or backup)
task: "Research CTE and cost for: {inputs.materials}"
output_key: research_data # Writes to state['research_data']
validation: # The "Critic" Loop
agent: auditor
criteria: "Are all material properties (CTE, density, cost) present and sourced?"
on_fail: retry # Retry this step if validation fails
max_retries: 2
- id: evaluate
role: technical-lead
task: "Evaluate materials based on {state.research_data}"
output_key: assessment
timeout: 300
on_timeout: # Error Handling
fallback_role: manager
alert: "#hq"
# ... (rest of workflow)
```
## Complementary Industry Patterns
*(Based on review of AutoGen, LangGraph, and CrewAI architectures)*
1. **Group Chat Pattern (AutoGen):** For brainstorming or open-ended problem solving, consider a "Group Chat" workflow where agents (Manager, Webster, Tech-Lead) share a context window and take turns speaking until a consensus is reached, rather than a fixed linear chain.
2. **State Graph (LangGraph):** Model workflows as a graph where nodes are agents and edges are conditional jumps (e.g., `If Research is Ambiguous -> Go back to Research Step`). This allows for non-linear, adaptive workflows.
---
**Verdict:** Proceed with implementation, but prioritize the **Validation Loop** and **Error Handling** logic in Phase 1 to ensure reliability.

167
docs/hq/reviews/Review 2.md Normal file
View File

@@ -0,0 +1,167 @@
---
tags:
- Agentic
- Project/Atomizer
up:
date: 2026-02-14
description:
summary: ""
status:
- ongoing
---
###### Gemini said
This is a fascinating architectural evolution. Youve successfully pivoted from a monolithic "simulated" multi-agent system (Slack Bridge) to a **distributed micro-service architecture** (The "Cluster" on Discord).
This "Cluster" approach (8 independent systemd services on one machine) is the correct long-term play for stability and native features, but it introduces **fragmentation** that you haven't fully solved yet.
Here is my challenge to your current implementation, followed by concrete technical proposals to fix the "Missing Glue."
---
### 🚨 Critical Challenge: The "Fake Delegation" Problem
**The Issue:** In `08-SYSTEM-IMPLEMENTATION-STATUS.md`, you admit:
> _"Manager's `[DELEGATE:...]` syntax is fake... It's just text output. Fix needed: Replace with actual exec + curl to hooks API."_
Right now, you have 8 geniuses sitting in 8 separate rooms (ports), but the Manager is shouting instructions at a wall. If the Manager says `[DELEGATE: nx-expert "mesh this"]`, nothing happens because OpenClaw doesn't know what that syntax means.
**The Fix: The `delegate_task` Skill** You need to build a custom tool (skill) for the Manager (and Tech Lead) that wraps the `curl` command to the Hooks API.
**Proposed Implementation:** Create a skill `skills/orchestration/index.ts` (or Python equivalent) available to the Manager:
TypeScript
```
// Map agent names to their local ports (from your Doc 08)
const PORT_MAP = {
"technical": 18804,
"secretary": 18808,
"auditor": 18812,
"optimizer": 18816,
"study-builder": 18820,
"nx-expert": 18824,
"webster": 18828
};
async function delegateTask(targetAgent: string, instruction: string, channelId: string) {
const port = PORT_MAP[targetAgent];
if (!port) throw new Error("Unknown agent");
// Call the OpenClaw Hooks API of the target agent
// This makes the target agent "hear" the instruction
const response = await fetch(`http://localhost:${port}/hooks/message`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
message: `🚨 **ASSIGNMENT FROM MANAGER:**\n${instruction}`,
channel_id: channelId, // Respond in the same Discord thread/channel
user: "Manager_Agent" // Identify sender so they know context
})
});
return `Signal sent to ${targetAgent} on port ${port}. Monitor channel for response.`;
}
```
**Why this changes everything:**
1. The Manager actually _triggers_ the other process.
2. The target agent wakes up, sees the message in the correct Discord channel (via `channel_id`), and processes it.
3. The response appears natively in Discord.
---
### 🧠 Architectural Analysis: The T420 Bottleneck
**The Constraint:** You are running this on a **Lenovo T420**.
- **RAM:** You estimated 4.2GB for 8 idle instances.
- **CPU:** When 2-3 agents are active (e.g., Manager delegating to Tech Lead + Auditor), that dual-core i5 is going to scream. Node.js processes + Syncthing + potential Python execution will create latency.
**Risk:** If OpenClaw keeps the full context window loaded in memory for all 8 agents, you will OOM (Out of Memory) swap to disk, killing performance.
**Proposed Improvements:**
1. **Aggressive Context Pruning:** In your `clawdbot.json` (or equivalent OpenClaw config per instance), ensure `context_limit` is strict. The "Secretary" (Haiku) doesn't need 200k tokens.
2. **Lazy Loading:** Ensure agents that are rarely used (like "Auditor") don't keep heavy skills/models loaded until summoned.
3. **Process Niceness:** In your `cluster.sh`, start non-critical agents (Webster, Secretary) with a lower process priority (`nice -n 10`) so the Manager and Tech Lead always get CPU preference.
---
### ⚡ The Windows Bridge (Speed vs. Reliability)
**The Issue:** You are relying on **Syncthing** for the Job Queue (Doc 05/08).
- _Pros:_ Bulletproof. If network drops, it syncs later.
- _Cons:_ **Latency.** 5-30 seconds delay. In a chat interface, 30 seconds of "thinking" feels like broken software.
**The Upgrade: Hybrid Triggering** Keep Syncthing for the heavy files (NX part files, results databases), but use **HTTP for the Trigger**.
1. **On Windows:** Run a tiny FastAPI server (listening on Tailscale IP).
2. **On T420 (Study Builder Agent):** When `run_optimization.py` is ready:
- Write files to Syncthing folder.
- _Wait 5 seconds._
- Send HTTP POST to Windows: `http://[Windows-Tailscale-IP]:8000/trigger-run`.
3. **On Windows:** The server receives the POST, checks if files match, and executes the script.
This cuts the "polling" loop out of the equation. The Windows machine reacts the moment the agent commands it.
---
### 📂 Shared State & Concurrency
**The Issue:** You moved away from the centralized bridge, which is good, but now you have distributed state. If the "Manager" writes to `PROJECT_STATUS.md` at the same time the "Secretary" tries to read it, you might get partial reads or file locks.
**Recommendation: The "Bulletin Board" Protocol** Since you are using a file system as a database (`/opt/atomizer/workspaces/shared_context/`), implement a strict protocol in the `atomizer-protocols` skill:
1. **Append-Only Logs:** For status updates, agents should append to `project_log.md` rather than overwriting a status file.
2. **The "Talking Stick":** Only the **Secretary** should have permission to _rewrite/summarize_ the `PROJECT_STATUS.md` based on the logs. Other agents just add logs.
- _Agent:_ "I finished task X" -> Appends to Log.
- _Secretary (Periodic):_ Reads Log -> Updates Status Board.
---
### 📝 Immediate To-Do List (Next 24 Hours)
Based on your docs, here is the exact path to move from "deployed" to "functional":
1. **Implement the `delegate_task` tool:** Give the Manager the ability to `curl` the other ports. Without this, you have no company, just 8 freelancers ignoring each other.
2. **Fix the Discord permissions:** Ensure your `cluster.sh` sets the environment variables correctly so `18800` (Manager) behaves like a Manager and `18808` (Secretary) behaves like a Secretary.
3. **Define the "Handshake":** Update `atomizer-protocols` to explicitly tell agents: _"When you receive a message via hooks, treat it as a high-priority direct command."_
4. **Smoke Test Orchestration:**
- User (Discord): "Manager, ask Webster to find the latest density of Ti-6Al-4V."
- Manager (Internal): Calls `delegate_task("webster", "Find density of Ti-6Al-4V")`.
- Webster (Discord): Replies in the thread with the data.
This setup is impressive, Mario. You've effectively built a Kubernetes-style cluster using systemd and a laptop. Just add the networking glue (hooks), and it's alive.