docs: add HQ multi-agent framework documentation from PKM

- Project plan, agent roster, architecture, roadmap - Decision log, full system plan, Discord setup/migration guides - System implementation status (as-built) - Cluster pivot history - Orchestration engine plan (Phases 1-4) - Webster and Auditor reviews
2026-02-15 21:44:07 +00:00
parent 3289a76e19
commit cf82de4f06
15 changed files with 6933 additions and 0 deletions
--- a/docs/hq/reviews/REVIEW-Orchestration-Engine-Auditor-V2.md
+++ b/docs/hq/reviews/REVIEW-Orchestration-Engine-Auditor-V2.md
@@ -0,0 +1,118 @@
+# Review: Orchestration Engine (Plan 10) — V2
+
+> **Reviewer:** Auditor 🔍  
+> **Date:** 2026-02-14  
+> **Status:** **CONDITIONAL PASS** (implement required controls before production-critical use)  
+> **Subject:** `10-ORCHESTRATION-ENGINE-PLAN.md`
+
+---
+
+## Executive Verdict
+
+Mario’s architecture is directionally correct and much stronger than fire-and-forget delegation. The three-layer model (Core → Routing → Workflows), structured handoffs, and explicit validation loops are all solid decisions.
+
+However, for production reliability and auditability, this must ship with stricter **state integrity**, **idempotency**, **schema governance**, and **human approval gates** for high-impact actions.
+
+**Bottom line:** Proceed, but only with the must-fix items below integrated into Phase 1–2.
+
+---
+
+## Findings
+
+### 🔴 Critical (must fix)
+
+1. **No explicit idempotency contract for retries/timeouts**
+   - Current plan retries on timeout/malformed outputs, but does not define how to prevent duplicate side effects (double posts, repeated downstream actions).
+   - **Risk:** inconsistent workflow outcomes, duplicate client-facing messages, non-reproducible state.
+   - **Required fix:** Add `idempotency_key` per step attempt and enforce dedupe on handoff consumption + delivery.
+
+2. **Handoff schema is underspecified for machine validation**
+   - Fields shown are helpful, but no versioned JSON Schema or strict required/optional policy exists.
+   - **Risk:** malformed yet “accepted” outputs, brittle parsing, silent failure propagation.
+   - **Required fix:** versioned schema (`schemaVersion`), strict required fields, validator in `orchestrate.sh` + CI check for schema compatibility.
+
+3. **No hard gate for high-stakes workflow steps**
+   - Auditor checks are present, but there is no formal “approval required” interrupt before irreversible actions.
+   - **Risk:** automated progression with incorrect assumptions.
+   - **Required fix:** add `approval_gate: true` for designated steps (e.g., external deliverables, strategic recommendations).
+
+---
+
+### 🟡 Major (should fix)
+
+1. **State model is split across ad hoc files**
+   - File-based handoff is fine for MVP, but without a canonical workflow state object, long chains get fragile.
+   - **Recommendation:** add a per-run `state.json` blackboard (append-only event log + resolved materialized state).
+
+2. **Observability is not yet sufficient for root-cause analysis**
+   - Metrics are planned later; debugging multi-agent failures without end-to-end trace IDs will be painful.
+   - **Recommendation:** start now with `workflowRunId`, `stepId`, `attempt`, `agent`, `latencyMs`, `token/cost estimate`, and terminal status.
+
+3. **Channel-context ingestion lacks trust/sanitization policy**
+   - Discord history can include noisy or unsafe content.
+   - **Recommendation:** context sanitizer + source tagging + max token window + instruction stripping from untrusted text blocks.
+
+4. **Hierarchical delegation loop prevention is policy-level only**
+   - Good design intent, but no enforcement mechanism described.
+   - **Recommendation:** enforce delegation ACL matrix in orchestrator runtime (not only SOUL instructions).
+
+---
+
+### 🟢 Minor (nice to fix)
+
+1. Add `result_quality_score` (0–1) from validator for triage and dashboards.
+2. Add `artifacts_checksum` to handoff metadata for reproducibility.
+3. Add workflow dry-run mode to validate dependency graph and substitutions without execution.
+
+---
+
+## External Pattern Cross-Check (complementary ideas)
+
+Based on architecture patterns in common orchestration ecosystems (LangGraph, AutoGen, CrewAI, Temporal, Prefect, Step Functions):
+
+1. **Durable execution + resumability** (LangGraph/Temporal style)
+   - Keep execution history and allow resume from last successful step.
+
+2. **Guardrails with bounded retries** (CrewAI/Prefect style)
+   - You already started this; formalize per-step retry policy and failure classes.
+
+3. **State-machine semantics** (Step Functions style)
+   - Model each step state explicitly: `pending → running → validated → committed | failed`.
+
+4. **Human-in-the-loop interrupts**
+   - Introduce pause/approve/reject transitions for critical branches.
+
+5. **Exactly-once consumption where possible**
+   - At minimum, “at-least-once execution + idempotent effects” should be guaranteed.
+
+---
+
+## Recommended Minimal Patch Set (before scaling)
+
+1. **Schema + idempotency first**
+   - `handoff.schema.json` + `idempotency_key` required fields.
+
+2. **Canonical state file per workflow run**
+   - `handoffs/workflows/<runId>/state.json` as single source of truth.
+
+3. **Enforced ACL delegation matrix**
+   - Runtime check: who can delegate to whom, hard-block loops.
+
+4. **Approval gates for critical outputs**
+   - YAML: `requires_approval: manager|ceo`.
+
+5. **Trace-first logging**
+   - Correlated logs for every attempt and transition.
+
+---
+
+## Final Recommendation
+
+**CONDITIONAL PASS**  
+Implementation can proceed immediately, but production-critical use should wait until the 5-item minimal patch set is in place. The current plan is strong; these controls are what make it reliable under stress.
+
+---
+
+## Suggested Filename Convention
+
+`REVIEW-Orchestration-Engine-Auditor-V2.md`
--- a/docs/hq/reviews/REVIEW-Orchestration-Engine-Webster.md
+++ b/docs/hq/reviews/REVIEW-Orchestration-Engine-Webster.md
@@ -0,0 +1,104 @@
+# Review: Orchestration Engine (Plan 10)
+
+> **Reviewer:** Webster (Research Specialist)
+> **Date:** 2026-02-14
+> **Status:** Endorsed with Enhancements
+> **Subject:** Critique of `10-ORCHESTRATION-ENGINE-PLAN` (Mario Lavoie)
+
+---
+
+## Executive Summary
+
+Mario's proposed "Orchestration Engine: Multi-Instance Intelligence" is a **strong foundational architecture**. It correctly identifies the critical missing piece in our current cluster setup: **synchronous delegation with a structured feedback loop**. Moving from "fire-and-forget" (`delegate.sh`) to a structured "chain-of-command" (`orchestrate.sh`) is the correct evolutionary step for the Atomizer cluster.
+
+The 3-layer architecture (Core → Routing → Workflows) is scalable and robust. The use of file-based handoffs and YAML workflows aligns perfectly with our local-first philosophy.
+
+However, to elevate this from a "good" system to a "world-class" agentic framework, I strongly recommend implementing **Hierarchical Delegation**, **Validation Loops**, and **Shared State Management** immediately, rather than deferring them to Phase 4 or later.
+
+---
+
+## Critical Analysis
+
+### 1. The "Manager Bottleneck" Risk (High)
+**Critique:** The plan centralizes *all* orchestration in the Manager ("Manager as sole orchestrator").
+**Risk:** This creates a single point of failure and a significant bottleneck. If the Manager is waiting on a long-running research task from Webster, it cannot effectively coordinate other urgent streams (e.g., a Tech-Lead design review). It also risks context overload for the Manager on complex, multi-agent projects.
+**Recommendation:** Implement **Hierarchical Delegation**.
+- Allow high-level agents (like `Tech-Lead`) to have "sub-orchestration" permissions.
+- **Example:** If `Tech-Lead` needs a specific material density check from `Webster` to complete a larger analysis, they should be able to delegate that sub-task directly via `orchestrate.sh` without routing back through the Manager. This mimics a real engineering team structure.
+
+### 2. Lack of "Reflection" or "Critic" Loops (Critical)
+**Critique:** The proposed workflows are strictly linear (Step A → Step B → Step C).
+**Risk:** "Garbage in, garbage out." If a research step returns hallucinated or irrelevant data, the subsequent technical analysis step will proceed to process it, wasting tokens and time.
+**Recommendation:** Add explicit **Validation Steps**.
+- Introduce a `critique` phase or a lightweight "Auditor" pass *inside* the workflow definition before moving to the next major stage.
+- **Pattern:** Execute Task → Critique Output → (Refine/Retry if score < Threshold) → Proceed.
+
+### 3. State Management & Context Passing (Medium)
+**Critique:** Context is passed explicitly between steps via file paths (`--context /tmp/file.json`).
+**Risk:** Managing file paths becomes cumbersome in complex, multi-step workflows (e.g., 10+ steps). It limits the ability for a late-stage agent to easily reference early-stage context without explicit passing.
+**Recommendation:** Implement a **Shared "Blackboard" (Workflow State Object)**.
+- Create a shared JSON object for the entire workflow run.
+- Agents read/write keys to this shared state (e.g., `state['material_costs']`, `state['fea_results']`).
+- This decouples step execution from data passing.
+
+### 4. Dynamic "Team Construction" (Medium)
+**Critique:** Workflow steps hardcode specific agents (e.g., `agent: webster`).
+**Recommendation:** Use **Role-Based Execution**.
+- Define steps by *role* or *capability* (e.g., `role: researcher`, `capability: web-research`) rather than specific agent IDs.
+- The **Smart Router** (Layer 2) can then dynamically select the best available agent at runtime. This allows for load balancing and redundancy (e.g., routing to a backup researcher if Webster is overloaded).
+
+### 5. Error Handling & "Healing" (Medium)
+**Critique:** Error handling is mentioned as a Phase 4 task.
+**Recommendation:** **Make it a Phase 1 priority.**
+- LLMs and external tools (web search) are non-deterministic and prone to occasional failure.
+- Add `max_retries` and `fallback_strategy` fields to the YAML definition immediately.
+
+---
+
+## Proposed Enhancement: "Patched" Workflow Schema
+
+Here is a proposed revision to the YAML workflow definition that incorporates these recommendations:
+
+```yaml
+# /home/papa/atomizer/workspaces/shared/workflows/material-trade-study-v2.yaml
+name: Material Trade Study (Enhanced)
+description: Research, evaluate, and audit material options with validation loops.
+
+# Shared Blackboard for the workflow run
+state:
+  materials_list: []
+  research_data: {}
+  assessment: {}
+
+steps:
+  - id: research
+    role: researcher  # Dynamic: Router picks 'webster' (or backup)
+    task: "Research CTE and cost for: {inputs.materials}"
+    output_key: research_data # Writes to state['research_data']
+    validation: # The "Critic" Loop
+      agent: auditor
+      criteria: "Are all material properties (CTE, density, cost) present and sourced?"
+      on_fail: retry # Retry this step if validation fails
+      max_retries: 2
+
+  - id: evaluate
+    role: technical-lead
+    task: "Evaluate materials based on {state.research_data}"
+    output_key: assessment
+    timeout: 300
+    on_timeout: # Error Handling
+      fallback_role: manager
+      alert: "#hq"
+
+  # ... (rest of workflow)
+```
+
+## Complementary Industry Patterns
+*(Based on review of AutoGen, LangGraph, and CrewAI architectures)*
+
+1.  **Group Chat Pattern (AutoGen):** For brainstorming or open-ended problem solving, consider a "Group Chat" workflow where agents (Manager, Webster, Tech-Lead) share a context window and take turns speaking until a consensus is reached, rather than a fixed linear chain.
+2.  **State Graph (LangGraph):** Model workflows as a graph where nodes are agents and edges are conditional jumps (e.g., `If Research is Ambiguous -> Go back to Research Step`). This allows for non-linear, adaptive workflows.
+
+---
+
+**Verdict:** Proceed with implementation, but prioritize the **Validation Loop** and **Error Handling** logic in Phase 1 to ensure reliability.
--- a/docs/hq/reviews/Review
+++ b/docs/hq/reviews/Review
@@ -0,0 +1,167 @@
+---
+tags:
+  - Agentic
+  - Project/Atomizer
+up:
+date: 2026-02-14
+description:
+summary: ""
+status:
+  - ongoing
+---
+
+###### Gemini said
+
+This is a fascinating architectural evolution. You’ve successfully pivoted from a monolithic "simulated" multi-agent system (Slack Bridge) to a **distributed micro-service architecture** (The "Cluster" on Discord).
+
+This "Cluster" approach (8 independent systemd services on one machine) is the correct long-term play for stability and native features, but it introduces **fragmentation** that you haven't fully solved yet.
+
+Here is my challenge to your current implementation, followed by concrete technical proposals to fix the "Missing Glue."
+
+---
+
+### 🚨 Critical Challenge: The "Fake Delegation" Problem
+
+**The Issue:** In `08-SYSTEM-IMPLEMENTATION-STATUS.md`, you admit:
+
+> _"Manager's `[DELEGATE:...]` syntax is fake... It's just text output. Fix needed: Replace with actual exec + curl to hooks API."_
+
+Right now, you have 8 geniuses sitting in 8 separate rooms (ports), but the Manager is shouting instructions at a wall. If the Manager says `[DELEGATE: nx-expert "mesh this"]`, nothing happens because OpenClaw doesn't know what that syntax means.
+
+**The Fix: The `delegate_task` Skill** You need to build a custom tool (skill) for the Manager (and Tech Lead) that wraps the `curl` command to the Hooks API.
+
+**Proposed Implementation:** Create a skill `skills/orchestration/index.ts` (or Python equivalent) available to the Manager:
+
+TypeScript
+
+```
+// Map agent names to their local ports (from your Doc 08)
+const PORT_MAP = {
+  "technical": 18804,
+  "secretary": 18808,
+  "auditor": 18812,
+  "optimizer": 18816,
+  "study-builder": 18820,
+  "nx-expert": 18824,
+  "webster": 18828
+};
+
+async function delegateTask(targetAgent: string, instruction: string, channelId: string) {
+  const port = PORT_MAP[targetAgent];
+  if (!port) throw new Error("Unknown agent");
+
+  // Call the OpenClaw Hooks API of the target agent
+  // This makes the target agent "hear" the instruction
+  const response = await fetch(`http://localhost:${port}/hooks/message`, {
+    method: 'POST',
+    headers: { 'Content-Type': 'application/json' },
+    body: JSON.stringify({
+      message: `🚨 **ASSIGNMENT FROM MANAGER:**\n${instruction}`,
+      channel_id: channelId, // Respond in the same Discord thread/channel
+      user: "Manager_Agent"  // Identify sender so they know context
+    })
+  });
+  
+  return `Signal sent to ${targetAgent} on port ${port}. Monitor channel for response.`;
+}
+```
+
+**Why this changes everything:**
+
+1. The Manager actually _triggers_ the other process.
+    
+2. The target agent wakes up, sees the message in the correct Discord channel (via `channel_id`), and processes it.
+    
+3. The response appears natively in Discord.
+    
+
+---
+
+### 🧠 Architectural Analysis: The T420 Bottleneck
+
+**The Constraint:** You are running this on a **Lenovo T420**.
+
+- **RAM:** You estimated 4.2GB for 8 idle instances.
+    
+- **CPU:** When 2-3 agents are active (e.g., Manager delegating to Tech Lead + Auditor), that dual-core i5 is going to scream. Node.js processes + Syncthing + potential Python execution will create latency.
+    
+
+**Risk:** If OpenClaw keeps the full context window loaded in memory for all 8 agents, you will OOM (Out of Memory) swap to disk, killing performance.
+
+**Proposed Improvements:**
+
+1. **Aggressive Context Pruning:** In your `clawdbot.json` (or equivalent OpenClaw config per instance), ensure `context_limit` is strict. The "Secretary" (Haiku) doesn't need 200k tokens.
+    
+2. **Lazy Loading:** Ensure agents that are rarely used (like "Auditor") don't keep heavy skills/models loaded until summoned.
+    
+3. **Process Niceness:** In your `cluster.sh`, start non-critical agents (Webster, Secretary) with a lower process priority (`nice -n 10`) so the Manager and Tech Lead always get CPU preference.
+    
+
+---
+
+### ⚡ The Windows Bridge (Speed vs. Reliability)
+
+**The Issue:** You are relying on **Syncthing** for the Job Queue (Doc 05/08).
+
+- _Pros:_ Bulletproof. If network drops, it syncs later.
+    
+- _Cons:_ **Latency.** 5-30 seconds delay. In a chat interface, 30 seconds of "thinking" feels like broken software.
+    
+
+**The Upgrade: Hybrid Triggering** Keep Syncthing for the heavy files (NX part files, results databases), but use **HTTP for the Trigger**.
+
+1. **On Windows:** Run a tiny FastAPI server (listening on Tailscale IP).
+    
+2. **On T420 (Study Builder Agent):** When `run_optimization.py` is ready:
+    
+    - Write files to Syncthing folder.
+        
+    - _Wait 5 seconds._
+        
+    - Send HTTP POST to Windows: `http://[Windows-Tailscale-IP]:8000/trigger-run`.
+        
+3. **On Windows:** The server receives the POST, checks if files match, and executes the script.
+    
+
+This cuts the "polling" loop out of the equation. The Windows machine reacts the moment the agent commands it.
+
+---
+
+### 📂 Shared State & Concurrency
+
+**The Issue:** You moved away from the centralized bridge, which is good, but now you have distributed state. If the "Manager" writes to `PROJECT_STATUS.md` at the same time the "Secretary" tries to read it, you might get partial reads or file locks.
+
+**Recommendation: The "Bulletin Board" Protocol** Since you are using a file system as a database (`/opt/atomizer/workspaces/shared_context/`), implement a strict protocol in the `atomizer-protocols` skill:
+
+1. **Append-Only Logs:** For status updates, agents should append to `project_log.md` rather than overwriting a status file.
+    
+2. **The "Talking Stick":** Only the **Secretary** should have permission to _rewrite/summarize_ the `PROJECT_STATUS.md` based on the logs. Other agents just add logs.
+    
+    - _Agent:_ "I finished task X" -> Appends to Log.
+        
+    - _Secretary (Periodic):_ Reads Log -> Updates Status Board.
+        
+
+---
+
+### 📝 Immediate To-Do List (Next 24 Hours)
+
+Based on your docs, here is the exact path to move from "deployed" to "functional":
+
+1. **Implement the `delegate_task` tool:** Give the Manager the ability to `curl` the other ports. Without this, you have no company, just 8 freelancers ignoring each other.
+    
+2. **Fix the Discord permissions:** Ensure your `cluster.sh` sets the environment variables correctly so `18800` (Manager) behaves like a Manager and `18808` (Secretary) behaves like a Secretary.
+    
+3. **Define the "Handshake":** Update `atomizer-protocols` to explicitly tell agents: _"When you receive a message via hooks, treat it as a high-priority direct command."_
+    
+4. **Smoke Test Orchestration:**
+    
+    - User (Discord): "Manager, ask Webster to find the latest density of Ti-6Al-4V."
+        
+    - Manager (Internal): Calls `delegate_task("webster", "Find density of Ti-6Al-4V")`.
+        
+    - Webster (Discord): Replies in the thread with the data.
+        
+
+This setup is impressive, Mario. You've effectively built a Kubernetes-style cluster using systemd and a laptop. Just add the networking glue (hooks), and it's alive.
+