From b86181eb6cd99da03301fb16606e1039e8d5f2bf Mon Sep 17 00:00:00 2001 From: Anto01 Date: Mon, 13 Apr 2026 09:14:32 -0400 Subject: [PATCH] =?UTF-8?q?docs:=20knowledge=20architecture=20=E2=80=94=20?= =?UTF-8?q?dual-layer=20model=20+=20domain=20knowledge?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Comprehensive architecture doc covering: - The problem (applied vs domain knowledge separation) - The quality bar (earned insight vs common knowledge, with examples) - Five-tier context assembly with budget allocation - Knowledge domains (10 domains: physics through finance) - Domain tag encoding (prefix in content, no schema migration) - Full flow: capture → extract → triage → surface - Cross-project example (p04 insight surfaces in p06 context) - Future directions: personal branch, multi-model, reinforcement Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/architecture/knowledge-architecture.md | 206 ++++++++++++++++++++ 1 file changed, 206 insertions(+) create mode 100644 docs/architecture/knowledge-architecture.md diff --git a/docs/architecture/knowledge-architecture.md b/docs/architecture/knowledge-architecture.md new file mode 100644 index 0000000..2af4798 --- /dev/null +++ b/docs/architecture/knowledge-architecture.md @@ -0,0 +1,206 @@ +# AtoCore Knowledge Architecture + +## The Problem + +Engineering work produces two kinds of knowledge simultaneously: + +1. **Applied knowledge** — specific to the project being worked on + ("the p04 support pad layout is driven by CTE gradient analysis") +2. **Domain knowledge** — generalizable insight earned through that work + ("Zerodur CTE gradient dominates WFE at fast focal ratios") + +A system that only stores applied knowledge loses the general insight. +A system that mixes them pollutes project context with cross-project +noise. AtoCore needs both — separated, but both growing organically +from the same conversations. + +## The Quality Bar + +**AtoCore stores earned insight, not information.** + +The test: "Would a competent engineer need experience to know this, +or could they find it in 30 seconds?" + +| Store | Don't store | +|-------|-------------| +| "Preston removal model breaks down below 5N because the contact assumption fails" | "Preston's equation relates removal rate to pressure and velocity" | +| "m=1 (coma) is NOT correctable by force modulation (score 0.09)" | "Zernike polynomials describe wavefront aberrations" | +| "At F/1.2, CTE gradient costs ~3nm WFE and drives pad placement" | "Zerodur CTE is 0.05 ppm/K" | +| "Quilting limit for 16-inch tool is 234N" | "Quilting is a mid-spatial-frequency artifact in polishing" | + +The bar is enforced in the LLM extraction system prompt +(`src/atocore/memory/extractor_llm.py`) and the auto-triage prompt +(`scripts/auto_triage.py`). Both explicitly list examples of what +qualifies and what doesn't. + +## Architecture + +### Five-tier context assembly + +When AtoCore builds a context pack for any LLM query, it assembles +five tiers in strict trust order: + +``` +Tier 1: Trusted Project State [project-specific, highest trust] + Curated key-value entries from the project state API. + Example: "decision/vendor_path: Twyman-Green preferred, 4D + technical lead but cost-challenged" + +Tier 2: Identity / Preferences [global, always included] + Who the user is and how they work. + Example: "Antoine Letarte, mechanical/optical engineer at + Atomaste" / "No API keys — uses OAuth exclusively" + +Tier 3: Project Memories [project-specific] + Reinforced memories from the reflection loop, scoped to the + queried project. Example: "Firmware interface contract is + invariant: controller-job.v1 in, run-log.v1 out" + +Tier 4: Domain Knowledge [cross-project] + Earned engineering insight with project="" and a domain tag. + Surfaces in ALL project packs when query-relevant. + Example: "[materials] Zerodur CTE gradient dominates WFE at + fast focal ratios — costs ~3nm at F/1.2" + +Tier 5: Retrieved Chunks [project-boosted, lowest trust] + Vector-similarity search over the ingested document corpus. + Project-hinted but not filtered — cross-project docs can + appear at lower rank. +``` + +### Budget allocation (at default 3000 chars) + +| Tier | Budget ratio | Approx chars | Entries | +|------|-------------|-------------|---------| +| Project State | 20% | 600 | all curated entries | +| Identity/Preferences | 5% | 150 | 1 memory | +| Project Memories | 25% | 750 | 2-3 memories | +| Domain Knowledge | 10% | 300 | 1-2 memories | +| Retrieved Chunks | 40% | 1200 | 2-4 chunks | + +Trim order when budget is tight: chunks first, then domain knowledge, +then project memories, then identity, then project state last. + +### Knowledge domains + +The LLM extractor tags domain knowledge with one of these domains: + +| Domain | What qualifies | +|--------|---------------| +| `physics` | Optical physics, wave propagation, diffraction, thermal effects | +| `materials` | Material properties in context, CTE behavior, stress limits | +| `optics` | Lens/mirror design, aberration analysis, metrology techniques | +| `mechanics` | Structural FEA insights, support system design, kinematics | +| `manufacturing` | Polishing, grinding, machining, process control | +| `metrology` | Measurement systems, interferometry, calibration techniques | +| `controls` | PID tuning, force control, servo systems, real-time constraints | +| `software` | Architecture patterns, testing strategies, deployment insights | +| `math` | Numerical methods, optimization, statistical analysis | +| `finance` | Cost modeling, procurement strategy, budget optimization | + +New domains can be added by updating the system prompt in +`extractor_llm.py` and `batch_llm_extract_live.py`. + +### How domain knowledge is stored + +Domain tags are embedded as a prefix in the memory content: + +``` +memory_type: knowledge +project: "" ← empty = cross-project +content: "[materials] Zerodur CTE gradient dominates WFE at F/1.2" +``` + +The `[domain]` prefix is a lightweight encoding that avoids a schema +migration. The context builder's query-relevance ranking matches on +domain terms naturally (a query about "materials" or "CTE" will rank +a `[materials]` memory higher). A future migration can parse the +prefix into a proper `domain` column. + +## How knowledge flows + +### Capture → Extract → Triage → Surface + +``` +1. CAPTURE + Claude Code (Stop hook) or OpenClaw (plugin) + → POST /interactions with reinforce=true + → Interaction stored on Dalidou + +2. EXTRACT (nightly cron, 03:00 UTC) + batch_llm_extract_live.py runs claude -p sonnet + → For each interaction, the LLM decides: + - Is this project-specific? → candidate with project=X + - Is this generalizable insight? → candidate with domain=Y, project="" + - Is it both? → TWO candidates emitted + - Is it common knowledge? → skip (quality bar) + → Candidates persisted as status=candidate + +3. TRIAGE (nightly, immediately after extraction) + auto_triage.py runs claude -p sonnet + → Each candidate classified: promote / reject / needs_human + → Auto-promote at confidence ≥ 0.8 + no duplicate + → Auto-reject stale snapshots, duplicates, common knowledge + → Only needs_human reaches the operator + +4. SURFACE (every context/build query) + → Project-specific memories appear in Tier 3 + → Domain knowledge appears in Tier 4 (regardless of project) + → Both are query-ranked by overlap-density +``` + +### Example: knowledge earned on p04 surfaces on p06 + +Working on p04-gigabit, you discover that Zerodur CTE gradient is +the dominant WFE contributor at fast focal ratios. The extraction +produces: + +```json +[ + {"type": "project", "content": "CTE gradient analysis drove the + M1 support pad layout — 2nd largest WFE contributor after gravity", + "project": "p04-gigabit", "domain": "", "confidence": 0.6}, + + {"type": "knowledge", "content": "Zerodur CTE gradient dominates + WFE contribution at fast focal ratios (F/1.2 = ~3nm)", + "project": "", "domain": "materials", "confidence": 0.6} +] +``` + +Two weeks later, working on p06-polisher (which also uses Zerodur): + +``` +Query: "thermal effects on polishing accuracy" +Project: p06-polisher + +Tier 3 (Project Memories): + [project] Calibration loop adjusts Preston kp from surface measurements... + +Tier 4 (Domain Knowledge): + [materials] Zerodur CTE gradient dominates WFE contribution at fast + focal ratios — THIS CAME FROM P04 WORK +``` + +The insight crosses over without any manual curation. + +## Future directions + +### Personal knowledge branch +The same architecture supports personal domains (health, finance, +personal) by adding new domain tags and a trust boundary so +Atomaste project data never leaks into personal packs. The domain +system is domain-agnostic — it doesn't care whether the domain is +"optics" or "nutrition". + +### Multi-model extraction +Different models can specialize: sonnet for extraction, opus or +Gemini for triage review. Independent validation reduces correlated +blind spots on what qualifies as "earned insight" vs "common +knowledge." + +### Reinforcement-based domain promotion +A domain-knowledge memory that gets reinforced across multiple +projects (its content echoed in p04, p05, and p06 responses) +accumulates confidence faster than a project-specific memory. +High-confidence domain memories could auto-promote to a "verified +knowledge" tier above regular domain knowledge.