docs: add HQ multi-agent framework documentation from PKM
- Project plan, agent roster, architecture, roadmap - Decision log, full system plan, Discord setup/migration guides - System implementation status (as-built) - Cluster pivot history - Orchestration engine plan (Phases 1-4) - Webster and Auditor reviews
This commit is contained in:
118
docs/hq/reviews/REVIEW-Orchestration-Engine-Auditor-V2.md
Normal file
118
docs/hq/reviews/REVIEW-Orchestration-Engine-Auditor-V2.md
Normal file
@@ -0,0 +1,118 @@
|
||||
# Review: Orchestration Engine (Plan 10) — V2
|
||||
|
||||
> **Reviewer:** Auditor 🔍
|
||||
> **Date:** 2026-02-14
|
||||
> **Status:** **CONDITIONAL PASS** (implement required controls before production-critical use)
|
||||
> **Subject:** `10-ORCHESTRATION-ENGINE-PLAN.md`
|
||||
|
||||
---
|
||||
|
||||
## Executive Verdict
|
||||
|
||||
Mario’s architecture is directionally correct and much stronger than fire-and-forget delegation. The three-layer model (Core → Routing → Workflows), structured handoffs, and explicit validation loops are all solid decisions.
|
||||
|
||||
However, for production reliability and auditability, this must ship with stricter **state integrity**, **idempotency**, **schema governance**, and **human approval gates** for high-impact actions.
|
||||
|
||||
**Bottom line:** Proceed, but only with the must-fix items below integrated into Phase 1–2.
|
||||
|
||||
---
|
||||
|
||||
## Findings
|
||||
|
||||
### 🔴 Critical (must fix)
|
||||
|
||||
1. **No explicit idempotency contract for retries/timeouts**
|
||||
- Current plan retries on timeout/malformed outputs, but does not define how to prevent duplicate side effects (double posts, repeated downstream actions).
|
||||
- **Risk:** inconsistent workflow outcomes, duplicate client-facing messages, non-reproducible state.
|
||||
- **Required fix:** Add `idempotency_key` per step attempt and enforce dedupe on handoff consumption + delivery.
|
||||
|
||||
2. **Handoff schema is underspecified for machine validation**
|
||||
- Fields shown are helpful, but no versioned JSON Schema or strict required/optional policy exists.
|
||||
- **Risk:** malformed yet “accepted” outputs, brittle parsing, silent failure propagation.
|
||||
- **Required fix:** versioned schema (`schemaVersion`), strict required fields, validator in `orchestrate.sh` + CI check for schema compatibility.
|
||||
|
||||
3. **No hard gate for high-stakes workflow steps**
|
||||
- Auditor checks are present, but there is no formal “approval required” interrupt before irreversible actions.
|
||||
- **Risk:** automated progression with incorrect assumptions.
|
||||
- **Required fix:** add `approval_gate: true` for designated steps (e.g., external deliverables, strategic recommendations).
|
||||
|
||||
---
|
||||
|
||||
### 🟡 Major (should fix)
|
||||
|
||||
1. **State model is split across ad hoc files**
|
||||
- File-based handoff is fine for MVP, but without a canonical workflow state object, long chains get fragile.
|
||||
- **Recommendation:** add a per-run `state.json` blackboard (append-only event log + resolved materialized state).
|
||||
|
||||
2. **Observability is not yet sufficient for root-cause analysis**
|
||||
- Metrics are planned later; debugging multi-agent failures without end-to-end trace IDs will be painful.
|
||||
- **Recommendation:** start now with `workflowRunId`, `stepId`, `attempt`, `agent`, `latencyMs`, `token/cost estimate`, and terminal status.
|
||||
|
||||
3. **Channel-context ingestion lacks trust/sanitization policy**
|
||||
- Discord history can include noisy or unsafe content.
|
||||
- **Recommendation:** context sanitizer + source tagging + max token window + instruction stripping from untrusted text blocks.
|
||||
|
||||
4. **Hierarchical delegation loop prevention is policy-level only**
|
||||
- Good design intent, but no enforcement mechanism described.
|
||||
- **Recommendation:** enforce delegation ACL matrix in orchestrator runtime (not only SOUL instructions).
|
||||
|
||||
---
|
||||
|
||||
### 🟢 Minor (nice to fix)
|
||||
|
||||
1. Add `result_quality_score` (0–1) from validator for triage and dashboards.
|
||||
2. Add `artifacts_checksum` to handoff metadata for reproducibility.
|
||||
3. Add workflow dry-run mode to validate dependency graph and substitutions without execution.
|
||||
|
||||
---
|
||||
|
||||
## External Pattern Cross-Check (complementary ideas)
|
||||
|
||||
Based on architecture patterns in common orchestration ecosystems (LangGraph, AutoGen, CrewAI, Temporal, Prefect, Step Functions):
|
||||
|
||||
1. **Durable execution + resumability** (LangGraph/Temporal style)
|
||||
- Keep execution history and allow resume from last successful step.
|
||||
|
||||
2. **Guardrails with bounded retries** (CrewAI/Prefect style)
|
||||
- You already started this; formalize per-step retry policy and failure classes.
|
||||
|
||||
3. **State-machine semantics** (Step Functions style)
|
||||
- Model each step state explicitly: `pending → running → validated → committed | failed`.
|
||||
|
||||
4. **Human-in-the-loop interrupts**
|
||||
- Introduce pause/approve/reject transitions for critical branches.
|
||||
|
||||
5. **Exactly-once consumption where possible**
|
||||
- At minimum, “at-least-once execution + idempotent effects” should be guaranteed.
|
||||
|
||||
---
|
||||
|
||||
## Recommended Minimal Patch Set (before scaling)
|
||||
|
||||
1. **Schema + idempotency first**
|
||||
- `handoff.schema.json` + `idempotency_key` required fields.
|
||||
|
||||
2. **Canonical state file per workflow run**
|
||||
- `handoffs/workflows/<runId>/state.json` as single source of truth.
|
||||
|
||||
3. **Enforced ACL delegation matrix**
|
||||
- Runtime check: who can delegate to whom, hard-block loops.
|
||||
|
||||
4. **Approval gates for critical outputs**
|
||||
- YAML: `requires_approval: manager|ceo`.
|
||||
|
||||
5. **Trace-first logging**
|
||||
- Correlated logs for every attempt and transition.
|
||||
|
||||
---
|
||||
|
||||
## Final Recommendation
|
||||
|
||||
**CONDITIONAL PASS**
|
||||
Implementation can proceed immediately, but production-critical use should wait until the 5-item minimal patch set is in place. The current plan is strong; these controls are what make it reliable under stress.
|
||||
|
||||
---
|
||||
|
||||
## Suggested Filename Convention
|
||||
|
||||
`REVIEW-Orchestration-Engine-Auditor-V2.md`
|
||||
104
docs/hq/reviews/REVIEW-Orchestration-Engine-Webster.md
Normal file
104
docs/hq/reviews/REVIEW-Orchestration-Engine-Webster.md
Normal file
@@ -0,0 +1,104 @@
|
||||
# Review: Orchestration Engine (Plan 10)
|
||||
|
||||
> **Reviewer:** Webster (Research Specialist)
|
||||
> **Date:** 2026-02-14
|
||||
> **Status:** Endorsed with Enhancements
|
||||
> **Subject:** Critique of `10-ORCHESTRATION-ENGINE-PLAN` (Mario Lavoie)
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
Mario's proposed "Orchestration Engine: Multi-Instance Intelligence" is a **strong foundational architecture**. It correctly identifies the critical missing piece in our current cluster setup: **synchronous delegation with a structured feedback loop**. Moving from "fire-and-forget" (`delegate.sh`) to a structured "chain-of-command" (`orchestrate.sh`) is the correct evolutionary step for the Atomizer cluster.
|
||||
|
||||
The 3-layer architecture (Core → Routing → Workflows) is scalable and robust. The use of file-based handoffs and YAML workflows aligns perfectly with our local-first philosophy.
|
||||
|
||||
However, to elevate this from a "good" system to a "world-class" agentic framework, I strongly recommend implementing **Hierarchical Delegation**, **Validation Loops**, and **Shared State Management** immediately, rather than deferring them to Phase 4 or later.
|
||||
|
||||
---
|
||||
|
||||
## Critical Analysis
|
||||
|
||||
### 1. The "Manager Bottleneck" Risk (High)
|
||||
**Critique:** The plan centralizes *all* orchestration in the Manager ("Manager as sole orchestrator").
|
||||
**Risk:** This creates a single point of failure and a significant bottleneck. If the Manager is waiting on a long-running research task from Webster, it cannot effectively coordinate other urgent streams (e.g., a Tech-Lead design review). It also risks context overload for the Manager on complex, multi-agent projects.
|
||||
**Recommendation:** Implement **Hierarchical Delegation**.
|
||||
- Allow high-level agents (like `Tech-Lead`) to have "sub-orchestration" permissions.
|
||||
- **Example:** If `Tech-Lead` needs a specific material density check from `Webster` to complete a larger analysis, they should be able to delegate that sub-task directly via `orchestrate.sh` without routing back through the Manager. This mimics a real engineering team structure.
|
||||
|
||||
### 2. Lack of "Reflection" or "Critic" Loops (Critical)
|
||||
**Critique:** The proposed workflows are strictly linear (Step A → Step B → Step C).
|
||||
**Risk:** "Garbage in, garbage out." If a research step returns hallucinated or irrelevant data, the subsequent technical analysis step will proceed to process it, wasting tokens and time.
|
||||
**Recommendation:** Add explicit **Validation Steps**.
|
||||
- Introduce a `critique` phase or a lightweight "Auditor" pass *inside* the workflow definition before moving to the next major stage.
|
||||
- **Pattern:** Execute Task → Critique Output → (Refine/Retry if score < Threshold) → Proceed.
|
||||
|
||||
### 3. State Management & Context Passing (Medium)
|
||||
**Critique:** Context is passed explicitly between steps via file paths (`--context /tmp/file.json`).
|
||||
**Risk:** Managing file paths becomes cumbersome in complex, multi-step workflows (e.g., 10+ steps). It limits the ability for a late-stage agent to easily reference early-stage context without explicit passing.
|
||||
**Recommendation:** Implement a **Shared "Blackboard" (Workflow State Object)**.
|
||||
- Create a shared JSON object for the entire workflow run.
|
||||
- Agents read/write keys to this shared state (e.g., `state['material_costs']`, `state['fea_results']`).
|
||||
- This decouples step execution from data passing.
|
||||
|
||||
### 4. Dynamic "Team Construction" (Medium)
|
||||
**Critique:** Workflow steps hardcode specific agents (e.g., `agent: webster`).
|
||||
**Recommendation:** Use **Role-Based Execution**.
|
||||
- Define steps by *role* or *capability* (e.g., `role: researcher`, `capability: web-research`) rather than specific agent IDs.
|
||||
- The **Smart Router** (Layer 2) can then dynamically select the best available agent at runtime. This allows for load balancing and redundancy (e.g., routing to a backup researcher if Webster is overloaded).
|
||||
|
||||
### 5. Error Handling & "Healing" (Medium)
|
||||
**Critique:** Error handling is mentioned as a Phase 4 task.
|
||||
**Recommendation:** **Make it a Phase 1 priority.**
|
||||
- LLMs and external tools (web search) are non-deterministic and prone to occasional failure.
|
||||
- Add `max_retries` and `fallback_strategy` fields to the YAML definition immediately.
|
||||
|
||||
---
|
||||
|
||||
## Proposed Enhancement: "Patched" Workflow Schema
|
||||
|
||||
Here is a proposed revision to the YAML workflow definition that incorporates these recommendations:
|
||||
|
||||
```yaml
|
||||
# /home/papa/atomizer/workspaces/shared/workflows/material-trade-study-v2.yaml
|
||||
name: Material Trade Study (Enhanced)
|
||||
description: Research, evaluate, and audit material options with validation loops.
|
||||
|
||||
# Shared Blackboard for the workflow run
|
||||
state:
|
||||
materials_list: []
|
||||
research_data: {}
|
||||
assessment: {}
|
||||
|
||||
steps:
|
||||
- id: research
|
||||
role: researcher # Dynamic: Router picks 'webster' (or backup)
|
||||
task: "Research CTE and cost for: {inputs.materials}"
|
||||
output_key: research_data # Writes to state['research_data']
|
||||
validation: # The "Critic" Loop
|
||||
agent: auditor
|
||||
criteria: "Are all material properties (CTE, density, cost) present and sourced?"
|
||||
on_fail: retry # Retry this step if validation fails
|
||||
max_retries: 2
|
||||
|
||||
- id: evaluate
|
||||
role: technical-lead
|
||||
task: "Evaluate materials based on {state.research_data}"
|
||||
output_key: assessment
|
||||
timeout: 300
|
||||
on_timeout: # Error Handling
|
||||
fallback_role: manager
|
||||
alert: "#hq"
|
||||
|
||||
# ... (rest of workflow)
|
||||
```
|
||||
|
||||
## Complementary Industry Patterns
|
||||
*(Based on review of AutoGen, LangGraph, and CrewAI architectures)*
|
||||
|
||||
1. **Group Chat Pattern (AutoGen):** For brainstorming or open-ended problem solving, consider a "Group Chat" workflow where agents (Manager, Webster, Tech-Lead) share a context window and take turns speaking until a consensus is reached, rather than a fixed linear chain.
|
||||
2. **State Graph (LangGraph):** Model workflows as a graph where nodes are agents and edges are conditional jumps (e.g., `If Research is Ambiguous -> Go back to Research Step`). This allows for non-linear, adaptive workflows.
|
||||
|
||||
---
|
||||
|
||||
**Verdict:** Proceed with implementation, but prioritize the **Validation Loop** and **Error Handling** logic in Phase 1 to ensure reliability.
|
||||
167
docs/hq/reviews/Review 2.md
Normal file
167
docs/hq/reviews/Review 2.md
Normal file
@@ -0,0 +1,167 @@
|
||||
---
|
||||
tags:
|
||||
- Agentic
|
||||
- Project/Atomizer
|
||||
up:
|
||||
date: 2026-02-14
|
||||
description:
|
||||
summary: ""
|
||||
status:
|
||||
- ongoing
|
||||
---
|
||||
|
||||
###### Gemini said
|
||||
|
||||
This is a fascinating architectural evolution. You’ve successfully pivoted from a monolithic "simulated" multi-agent system (Slack Bridge) to a **distributed micro-service architecture** (The "Cluster" on Discord).
|
||||
|
||||
This "Cluster" approach (8 independent systemd services on one machine) is the correct long-term play for stability and native features, but it introduces **fragmentation** that you haven't fully solved yet.
|
||||
|
||||
Here is my challenge to your current implementation, followed by concrete technical proposals to fix the "Missing Glue."
|
||||
|
||||
---
|
||||
|
||||
### 🚨 Critical Challenge: The "Fake Delegation" Problem
|
||||
|
||||
**The Issue:** In `08-SYSTEM-IMPLEMENTATION-STATUS.md`, you admit:
|
||||
|
||||
> _"Manager's `[DELEGATE:...]` syntax is fake... It's just text output. Fix needed: Replace with actual exec + curl to hooks API."_
|
||||
|
||||
Right now, you have 8 geniuses sitting in 8 separate rooms (ports), but the Manager is shouting instructions at a wall. If the Manager says `[DELEGATE: nx-expert "mesh this"]`, nothing happens because OpenClaw doesn't know what that syntax means.
|
||||
|
||||
**The Fix: The `delegate_task` Skill** You need to build a custom tool (skill) for the Manager (and Tech Lead) that wraps the `curl` command to the Hooks API.
|
||||
|
||||
**Proposed Implementation:** Create a skill `skills/orchestration/index.ts` (or Python equivalent) available to the Manager:
|
||||
|
||||
TypeScript
|
||||
|
||||
```
|
||||
// Map agent names to their local ports (from your Doc 08)
|
||||
const PORT_MAP = {
|
||||
"technical": 18804,
|
||||
"secretary": 18808,
|
||||
"auditor": 18812,
|
||||
"optimizer": 18816,
|
||||
"study-builder": 18820,
|
||||
"nx-expert": 18824,
|
||||
"webster": 18828
|
||||
};
|
||||
|
||||
async function delegateTask(targetAgent: string, instruction: string, channelId: string) {
|
||||
const port = PORT_MAP[targetAgent];
|
||||
if (!port) throw new Error("Unknown agent");
|
||||
|
||||
// Call the OpenClaw Hooks API of the target agent
|
||||
// This makes the target agent "hear" the instruction
|
||||
const response = await fetch(`http://localhost:${port}/hooks/message`, {
|
||||
method: 'POST',
|
||||
headers: { 'Content-Type': 'application/json' },
|
||||
body: JSON.stringify({
|
||||
message: `🚨 **ASSIGNMENT FROM MANAGER:**\n${instruction}`,
|
||||
channel_id: channelId, // Respond in the same Discord thread/channel
|
||||
user: "Manager_Agent" // Identify sender so they know context
|
||||
})
|
||||
});
|
||||
|
||||
return `Signal sent to ${targetAgent} on port ${port}. Monitor channel for response.`;
|
||||
}
|
||||
```
|
||||
|
||||
**Why this changes everything:**
|
||||
|
||||
1. The Manager actually _triggers_ the other process.
|
||||
|
||||
2. The target agent wakes up, sees the message in the correct Discord channel (via `channel_id`), and processes it.
|
||||
|
||||
3. The response appears natively in Discord.
|
||||
|
||||
|
||||
---
|
||||
|
||||
### 🧠 Architectural Analysis: The T420 Bottleneck
|
||||
|
||||
**The Constraint:** You are running this on a **Lenovo T420**.
|
||||
|
||||
- **RAM:** You estimated 4.2GB for 8 idle instances.
|
||||
|
||||
- **CPU:** When 2-3 agents are active (e.g., Manager delegating to Tech Lead + Auditor), that dual-core i5 is going to scream. Node.js processes + Syncthing + potential Python execution will create latency.
|
||||
|
||||
|
||||
**Risk:** If OpenClaw keeps the full context window loaded in memory for all 8 agents, you will OOM (Out of Memory) swap to disk, killing performance.
|
||||
|
||||
**Proposed Improvements:**
|
||||
|
||||
1. **Aggressive Context Pruning:** In your `clawdbot.json` (or equivalent OpenClaw config per instance), ensure `context_limit` is strict. The "Secretary" (Haiku) doesn't need 200k tokens.
|
||||
|
||||
2. **Lazy Loading:** Ensure agents that are rarely used (like "Auditor") don't keep heavy skills/models loaded until summoned.
|
||||
|
||||
3. **Process Niceness:** In your `cluster.sh`, start non-critical agents (Webster, Secretary) with a lower process priority (`nice -n 10`) so the Manager and Tech Lead always get CPU preference.
|
||||
|
||||
|
||||
---
|
||||
|
||||
### ⚡ The Windows Bridge (Speed vs. Reliability)
|
||||
|
||||
**The Issue:** You are relying on **Syncthing** for the Job Queue (Doc 05/08).
|
||||
|
||||
- _Pros:_ Bulletproof. If network drops, it syncs later.
|
||||
|
||||
- _Cons:_ **Latency.** 5-30 seconds delay. In a chat interface, 30 seconds of "thinking" feels like broken software.
|
||||
|
||||
|
||||
**The Upgrade: Hybrid Triggering** Keep Syncthing for the heavy files (NX part files, results databases), but use **HTTP for the Trigger**.
|
||||
|
||||
1. **On Windows:** Run a tiny FastAPI server (listening on Tailscale IP).
|
||||
|
||||
2. **On T420 (Study Builder Agent):** When `run_optimization.py` is ready:
|
||||
|
||||
- Write files to Syncthing folder.
|
||||
|
||||
- _Wait 5 seconds._
|
||||
|
||||
- Send HTTP POST to Windows: `http://[Windows-Tailscale-IP]:8000/trigger-run`.
|
||||
|
||||
3. **On Windows:** The server receives the POST, checks if files match, and executes the script.
|
||||
|
||||
|
||||
This cuts the "polling" loop out of the equation. The Windows machine reacts the moment the agent commands it.
|
||||
|
||||
---
|
||||
|
||||
### 📂 Shared State & Concurrency
|
||||
|
||||
**The Issue:** You moved away from the centralized bridge, which is good, but now you have distributed state. If the "Manager" writes to `PROJECT_STATUS.md` at the same time the "Secretary" tries to read it, you might get partial reads or file locks.
|
||||
|
||||
**Recommendation: The "Bulletin Board" Protocol** Since you are using a file system as a database (`/opt/atomizer/workspaces/shared_context/`), implement a strict protocol in the `atomizer-protocols` skill:
|
||||
|
||||
1. **Append-Only Logs:** For status updates, agents should append to `project_log.md` rather than overwriting a status file.
|
||||
|
||||
2. **The "Talking Stick":** Only the **Secretary** should have permission to _rewrite/summarize_ the `PROJECT_STATUS.md` based on the logs. Other agents just add logs.
|
||||
|
||||
- _Agent:_ "I finished task X" -> Appends to Log.
|
||||
|
||||
- _Secretary (Periodic):_ Reads Log -> Updates Status Board.
|
||||
|
||||
|
||||
---
|
||||
|
||||
### 📝 Immediate To-Do List (Next 24 Hours)
|
||||
|
||||
Based on your docs, here is the exact path to move from "deployed" to "functional":
|
||||
|
||||
1. **Implement the `delegate_task` tool:** Give the Manager the ability to `curl` the other ports. Without this, you have no company, just 8 freelancers ignoring each other.
|
||||
|
||||
2. **Fix the Discord permissions:** Ensure your `cluster.sh` sets the environment variables correctly so `18800` (Manager) behaves like a Manager and `18808` (Secretary) behaves like a Secretary.
|
||||
|
||||
3. **Define the "Handshake":** Update `atomizer-protocols` to explicitly tell agents: _"When you receive a message via hooks, treat it as a high-priority direct command."_
|
||||
|
||||
4. **Smoke Test Orchestration:**
|
||||
|
||||
- User (Discord): "Manager, ask Webster to find the latest density of Ti-6Al-4V."
|
||||
|
||||
- Manager (Internal): Calls `delegate_task("webster", "Find density of Ti-6Al-4V")`.
|
||||
|
||||
- Webster (Discord): Replies in the thread with the data.
|
||||
|
||||
|
||||
This setup is impressive, Mario. You've effectively built a Kubernetes-style cluster using systemd and a laptop. Just add the networking glue (hooks), and it's alive.
|
||||
|
||||
Reference in New Issue
Block a user