feat: Pre-migration checkpoint - updated docs and utilities

Updates before optimization_engine migration: - Updated migration plan to v2.1 with complete file inventory - Added OP_07 disk optimization protocol - Added SYS_16 self-aware turbo protocol - Added study archiver and cleanup utilities - Added ensemble surrogate module - Updated NX solver and session manager - Updated zernike HTML generator - Added context engineering plan - LAC session insights updates 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-29 10:22:45 -05:00
parent faa7779a43
commit 82f36689b7
21 changed files with 6304 additions and 890 deletions
--- a/docs/TODO_NXOPEN_MCP_SETUP.md
+++ b/docs/TODO_NXOPEN_MCP_SETUP.md
@@ -0,0 +1,132 @@
+# NXOpen Documentation MCP Server - Setup TODO
+
+**Created:** 2025-12-29
+**Status:** PENDING - Waiting for manual configuration
+
+---
+
+## Current State
+
+The NXOpen documentation MCP server exists on **dalidou** (192.168.86.50) but is not accessible from this Windows machine due to hostname resolution issues.
+
+### What's Working
+- ✅ Dalidou server is online and reachable at `192.168.86.50`
+- ✅ Port 5000 (Documentation Proxy) is responding
+- ✅ Port 3000 (Gitea) is responding
+- ✅ MCP server code exists at `/srv/claude-assistant/` on dalidou
+
+### What's NOT Working
+- ❌ `dalidou.local` hostname doesn't resolve (mDNS not configured on this machine)
+- ❌ MCP tools not integrated with Claude Code
+
+---
+
+## Steps to Complete
+
+### Step 1: Fix Hostname Resolution (Manual - requires Admin)
+
+**Option A: Run the script as Administrator**
+```powershell
+# Open PowerShell as Administrator, then:
+C:\Users\antoi\Atomizer\add_dalidou_host.ps1
+```
+
+**Option B: Manually edit hosts file**
+1. Open Notepad as Administrator
+2. Open `C:\Windows\System32\drivers\etc\hosts`
+3. Add this line at the end:
+   ```
+   192.168.86.50  dalidou.local dalidou
+   ```
+4. Save the file
+
+**Verify:**
+```powershell
+ping dalidou.local
+```
+
+### Step 2: Verify MCP Server is Running on Dalidou
+
+SSH into dalidou and check:
+```bash
+ssh root@dalidou
+
+# Check documentation proxy
+systemctl status siemensdocumentationproxyserver
+
+# Check MCP server (if it's a service)
+# Or check what's running on port 5000
+ss -tlnp | grep 5000
+```
+
+### Step 3: Configure Claude Code MCP Integration
+
+The MCP server on dalidou uses **stdio-based MCP protocol**, not HTTP. To connect from Claude Code, you'll need one of:
+
+**Option A: SSH-based MCP (if supported)**
+Configure in `.claude/settings.json` or MCP config to connect via SSH tunnel.
+
+**Option B: Local Proxy**
+Run a local MCP proxy that connects to dalidou's MCP server.
+
+**Option C: HTTP Wrapper**
+The current port 5000 service may already expose HTTP endpoints - need to verify once hostname is fixed.
+
+---
+
+## Server Documentation Reference
+
+Full documentation is in the SERVtomaste repo:
+- **URL:** http://192.168.86.50:3000/Antoine/SERVtomaste
+- **File:** `docs/SIEMENS-DOCS-SERVER.md`
+
+### Key Server Paths (on dalidou)
+```
+/srv/siemens-docs/proxy/          # Documentation Proxy (port 5000)
+/srv/claude-assistant/            # MCP Server
+/srv/claude-assistant/mcp-server/ # MCP server code
+/srv/claude-assistant/tools/      # Tool implementations
+  ├── siemens-auth.js            # Puppeteer authentication
+  ├── siemens-docs.js            # Documentation fetching
+  └── ...
+/srv/claude-assistant/vault/      # Credentials (secured)
+```
+
+### Available MCP Tools (once connected)
+| Tool | Description |
+|------|-------------|
+| `siemens_docs_search` | Search NX Open, Simcenter docs |
+| `siemens_docs_fetch` | Fetch specific documentation page |
+| `siemens_auth_status` | Check if auth session is active |
+| `siemens_login` | Re-login if session expired |
+| `siemens_docs_list` | List documentation categories |
+
+---
+
+## Files Created During Investigation
+
+- `C:\Users\antoi\Atomizer\add_dalidou_host.ps1` - Script to add hosts entry (run as Admin)
+- `C:\Users\antoi\Atomizer\test_mcp.py` - Test script for probing MCP server (can be deleted)
+
+---
+
+## Related Documentation
+
+- `.claude/skills/modules/nx-docs-lookup.md` - How to use MCP tools once configured
+- `docs/08_ARCHIVE/historical/NXOPEN_DOCUMENTATION_INTEGRATION_STRATEGY.md` - Full strategy doc
+- `docs/05_API_REFERENCE/NXOPEN_RESOURCES.md` - Alternative NXOpen resources
+
+---
+
+## Workaround Until Fixed
+
+Without the MCP server, you can still look up NXOpen documentation by:
+
+1. **Using web search** - I can search for NXOpen API documentation online
+2. **Using local stub files** - Python stubs at `C:\Program Files\Siemens\NX2412\UGOPEN\pythonStubs\`
+3. **Using existing extractors** - Check `optimization_engine/extractors/` for patterns
+4. **Recording NX journals** - Record operations in NX to learn the API calls
+
+---
+
+*To continue setup, run the hosts file fix and let me know when ready.*
--- a/docs/plans/ATOMIZER_CONTEXT_ENGINEERING_PLAN.md
+++ b/docs/plans/ATOMIZER_CONTEXT_ENGINEERING_PLAN.md
--- a/docs/protocols/operations/OP_07_DISK_OPTIMIZATION.md
+++ b/docs/protocols/operations/OP_07_DISK_OPTIMIZATION.md
@@ -0,0 +1,239 @@
+# OP_07: Disk Space Optimization
+
+**Version:** 1.0
+**Last Updated:** 2025-12-29
+
+## Overview
+
+This protocol manages disk space for Atomizer studies through:
+1. **Local cleanup** - Remove regenerable files from completed studies
+2. **Remote archival** - Archive to dalidou server (14TB available)
+3. **On-demand restore** - Pull archived studies when needed
+
+## Disk Usage Analysis
+
+### Typical Study Breakdown
+
+| File Type | Size/Trial | Purpose | Keep? |
+|-----------|------------|---------|-------|
+| `.op2` | 68 MB | Nastran results | **YES** - Needed for analysis |
+| `.prt` | 30 MB | NX parts | NO - Copy of master |
+| `.dat` | 16 MB | Solver input | NO - Regenerable |
+| `.fem` | 14 MB | FEM mesh | NO - Copy of master |
+| `.sim` | 7 MB | Simulation | NO - Copy of master |
+| `.afm` | 4 MB | Assembly FEM | NO - Regenerable |
+| `.json` | <1 MB | Params/results | **YES** - Metadata |
+| Logs | <1 MB | F04/F06/log | NO - Diagnostic only |
+
+**Per-trial overhead:** ~150 MB total, only ~70 MB essential
+
+### M1_Mirror Example
+
+```
+Current:     194 GB (28 studies, 2000+ trials)
+After cleanup: 95 GB (51% reduction)
+After archive:  5 GB (keep best_design_archive only)
+```
+
+## Commands
+
+### 1. Analyze Disk Usage
+
+```bash
+# Single study
+archive_study.bat analyze studies\M1_Mirror\m1_mirror_V12
+
+# All studies in a project
+archive_study.bat analyze studies\M1_Mirror
+```
+
+Output shows:
+- Total size
+- Essential vs deletable breakdown
+- Trial count per study
+- Per-extension analysis
+
+### 2. Cleanup Completed Study
+
+```bash
+# Dry run (default) - see what would be deleted
+archive_study.bat cleanup studies\M1_Mirror\m1_mirror_V12
+
+# Actually delete
+archive_study.bat cleanup studies\M1_Mirror\m1_mirror_V12 --execute
+```
+
+**What gets deleted:**
+- `.prt`, `.fem`, `.sim`, `.afm` in trial folders
+- `.dat`, `.f04`, `.f06`, `.log`, `.diag` solver files
+- Temp files (`.txt`, `.exp`, `.bak`)
+
+**What is preserved:**
+- `1_setup/` folder (master model)
+- `3_results/` folder (database, reports)
+- All `.op2` files (Nastran results)
+- All `.json` files (params, metadata)
+- All `.npz` files (Zernike coefficients)
+- `best_design_archive/` folder
+
+### 3. Archive to Remote Server
+
+```bash
+# Dry run
+archive_study.bat archive studies\M1_Mirror\m1_mirror_V12
+
+# Actually archive
+archive_study.bat archive studies\M1_Mirror\m1_mirror_V12 --execute
+
+# Use Tailscale (when not on local network)
+archive_study.bat archive studies\M1_Mirror\m1_mirror_V12 --execute --tailscale
+```
+
+**Process:**
+1. Creates compressed `.tar.gz` archive
+2. Uploads to `papa@192.168.86.50:/srv/storage/atomizer-archive/`
+3. Deletes local archive after successful upload
+
+### 4. List Remote Archives
+
+```bash
+archive_study.bat list
+
+# Via Tailscale
+archive_study.bat list --tailscale
+```
+
+### 5. Restore from Remote
+
+```bash
+# Restore to studies/ folder
+archive_study.bat restore m1_mirror_V12
+
+# Via Tailscale
+archive_study.bat restore m1_mirror_V12 --tailscale
+```
+
+## Remote Server Setup
+
+**Server:** dalidou (Lenovo W520)
+- Local IP: `192.168.86.50`
+- Tailscale IP: `100.80.199.40`
+- SSH user: `papa`
+- Archive path: `/srv/storage/atomizer-archive/`
+
+### First-Time Setup
+
+SSH into dalidou and create the archive directory:
+
+```bash
+ssh papa@192.168.86.50
+mkdir -p /srv/storage/atomizer-archive
+```
+
+Ensure SSH key authentication is set up for passwordless transfers:
+
+```bash
+# On Windows (PowerShell)
+ssh-copy-id papa@192.168.86.50
+```
+
+## Recommended Workflow
+
+### During Active Optimization
+
+Keep all files - you may need to re-run specific trials.
+
+### After Study Completion
+
+1. **Generate final report** (`STUDY_REPORT.md`)
+2. **Archive best design** to `3_results/best_design_archive/`
+3. **Cleanup:**
+   ```bash
+   archive_study.bat cleanup studies\M1_Mirror\m1_mirror_V12 --execute
+   ```
+
+### For Long-Term Storage
+
+1. **After cleanup**, archive to server:
+   ```bash
+   archive_study.bat archive studies\M1_Mirror\m1_mirror_V12 --execute
+   ```
+
+2. **Optionally delete local** (keep only `3_results/best_design_archive/`)
+
+### When Revisiting Old Study
+
+1. **Restore:**
+   ```bash
+   archive_study.bat restore m1_mirror_V12
+   ```
+
+2. If you need to re-run trials, the `1_setup/` master files allow regenerating everything
+
+## Safety Features
+
+- **Dry run by default** - Must add `--execute` to actually delete/transfer
+- **Master files preserved** - `1_setup/` is never touched
+- **Results preserved** - `3_results/` is never touched
+- **Essential files preserved** - OP2, JSON, NPZ always kept
+
+## Disk Space Targets
+
+| Stage | M1_Mirror Target |
+|-------|------------------|
+| Active development | 200 GB (full) |
+| Completed studies | 95 GB (after cleanup) |
+| Archived (minimal local) | 5 GB (best only) |
+| Server archive | 50 GB compressed |
+
+## Troubleshooting
+
+### SSH Connection Failed
+
+```bash
+# Test connectivity
+ping 192.168.86.50
+
+# Test SSH
+ssh papa@192.168.86.50 "echo connected"
+
+# If on different network, use Tailscale
+ssh papa@100.80.199.40 "echo connected"
+```
+
+### Archive Upload Slow
+
+Large studies (50+ GB) take time. The tool uses `rsync` with progress display.
+For very large archives, consider running overnight or using direct LAN connection.
+
+### Out of Disk Space During Archive
+
+The archive is created locally first. Ensure you have ~1.5x the study size free:
+- 20 GB study = ~30 GB temp space needed
+
+## Python API
+
+```python
+from optimization_engine.utils.study_archiver import (
+    analyze_study,
+    cleanup_study,
+    archive_to_remote,
+    restore_from_remote,
+    list_remote_archives,
+)
+
+# Analyze
+analysis = analyze_study(Path("studies/M1_Mirror/m1_mirror_V12"))
+print(f"Deletable: {analysis['deletable_size']/1e9:.2f} GB")
+
+# Cleanup (dry_run=False to actually delete)
+cleanup_study(Path("studies/M1_Mirror/m1_mirror_V12"), dry_run=False)
+
+# Archive
+archive_to_remote(Path("studies/M1_Mirror/m1_mirror_V12"), dry_run=False)
+
+# List remote
+archives = list_remote_archives()
+for a in archives:
+    print(f"{a['name']}: {a['size']}")
+```
--- a/docs/protocols/system/SYS_16_SELF_AWARE_TURBO.md
+++ b/docs/protocols/system/SYS_16_SELF_AWARE_TURBO.md
@@ -0,0 +1,262 @@
+# SYS_16: Self-Aware Turbo (SAT) Optimization
+
+## Version: 1.0
+## Status: PROPOSED
+## Created: 2025-12-28
+
+---
+
+## Problem Statement
+
+V5 surrogate + L-BFGS failed catastrophically because:
+1. MLP predicted WS=280 but actual was WS=376 (30%+ error)
+2. L-BFGS descended to regions **outside training distribution**
+3. Surrogate had no way to signal uncertainty
+4. All L-BFGS solutions converged to the same "fake optimum"
+
+**Root cause:** The surrogate is overconfident in regions where it has no data.
+
+---
+
+## Solution: Uncertainty-Aware Surrogate with Active Learning
+
+### Core Principles
+
+1. **Never trust a point prediction** - Always require uncertainty bounds
+2. **High uncertainty = run FEA** - Don't optimize where you don't know
+3. **Actively fill gaps** - Prioritize FEA in high-uncertainty regions
+4. **Validate gradient solutions** - Check L-BFGS results against FEA before trusting
+
+---
+
+## Architecture
+
+### 1. Ensemble Surrogate (Epistemic Uncertainty)
+
+Instead of one MLP, train **N independent models** with different initializations:
+
+```python
+class EnsembleSurrogate:
+    def __init__(self, n_models=5):
+        self.models = [MLP() for _ in range(n_models)]
+
+    def predict(self, x):
+        preds = [m.predict(x) for m in self.models]
+        mean = np.mean(preds, axis=0)
+        std = np.std(preds, axis=0)  # Epistemic uncertainty
+        return mean, std
+
+    def is_confident(self, x, threshold=0.1):
+        mean, std = self.predict(x)
+        # Confident if std < 10% of mean
+        return (std / (mean + 1e-6)) < threshold
+```
+
+**Why this works:** Models trained on different random seeds will agree in well-sampled regions but disagree wildly in extrapolation regions.
+
+### 2. Distance-Based OOD Detection
+
+Track training data distribution and flag points that are "too far":
+
+```python
+class OODDetector:
+    def __init__(self, X_train):
+        self.X_train = X_train
+        self.mean = X_train.mean(axis=0)
+        self.std = X_train.std(axis=0)
+        # Fit KNN for local density
+        self.knn = NearestNeighbors(n_neighbors=5)
+        self.knn.fit(X_train)
+
+    def distance_to_training(self, x):
+        """Return distance to nearest training points."""
+        distances, _ = self.knn.kneighbors(x.reshape(1, -1))
+        return distances.mean()
+
+    def is_in_distribution(self, x, threshold=2.0):
+        """Check if point is within 2 std of training data."""
+        z_scores = np.abs((x - self.mean) / (self.std + 1e-6))
+        return z_scores.max() < threshold
+```
+
+### 3. Trust-Region L-BFGS
+
+Constrain L-BFGS to stay within training distribution:
+
+```python
+def trust_region_lbfgs(surrogate, ood_detector, x0, max_iter=100):
+    """L-BFGS that respects training data boundaries."""
+
+    def constrained_objective(x):
+        # If OOD, return large penalty
+        if not ood_detector.is_in_distribution(x):
+            return 1e9
+
+        mean, std = surrogate.predict(x)
+        # If uncertain, return upper confidence bound (pessimistic)
+        if std > 0.1 * mean:
+            return mean + 2 * std  # Be conservative
+
+        return mean
+
+    result = minimize(constrained_objective, x0, method='L-BFGS-B')
+    return result.x
+```
+
+### 4. Acquisition Function with Uncertainty
+
+Use **Expected Improvement with Uncertainty** (like Bayesian Optimization):
+
+```python
+def acquisition_score(x, surrogate, best_so_far):
+    """Score = potential improvement weighted by confidence."""
+    mean, std = surrogate.predict(x)
+
+    # Expected improvement (lower is better for minimization)
+    improvement = best_so_far - mean
+
+    # Exploration bonus for uncertain regions
+    exploration = 0.5 * std
+
+    # High score = worth evaluating with FEA
+    return improvement + exploration
+
+def select_next_fea_candidates(surrogate, candidates, best_so_far, n=5):
+    """Select candidates balancing exploitation and exploration."""
+    scores = [acquisition_score(c, surrogate, best_so_far) for c in candidates]
+
+    # Pick top candidates by acquisition score
+    top_indices = np.argsort(scores)[-n:]
+    return [candidates[i] for i in top_indices]
+```
+
+---
+
+## Algorithm: Self-Aware Turbo (SAT)
+
+```
+INITIALIZE:
+  - Load existing FEA data (X_train, Y_train)
+  - Train ensemble surrogate on data
+  - Fit OOD detector on X_train
+  - Set best_ws = min(Y_train)
+
+PHASE 1: UNCERTAINTY MAPPING (10% of budget)
+  FOR i in 1..N_mapping:
+    - Sample random point x
+    - Get uncertainty: mean, std = surrogate.predict(x)
+    - If std > threshold: run FEA, add to training data
+    - Retrain ensemble periodically
+
+  This fills in the "holes" in the surrogate's knowledge.
+
+PHASE 2: EXPLOITATION WITH VALIDATION (80% of budget)
+  FOR i in 1..N_exploit:
+    - Generate 1000 TPE samples
+    - Filter to keep only confident predictions (std < 10% of mean)
+    - Filter to keep only in-distribution (OOD check)
+    - Rank by predicted WS
+
+    - Take top 5 candidates
+    - Run FEA on all 5
+
+    - For each FEA result:
+      - Compare predicted vs actual
+      - If error > 20%: mark region as "unreliable", force exploration there
+      - If error < 10%: update best, retrain surrogate
+
+    - Every 10 iterations: retrain ensemble with new data
+
+PHASE 3: L-BFGS REFINEMENT (10% of budget)
+  - Only run L-BFGS if ensemble R² > 0.95 on validation set
+  - Use trust-region L-BFGS (stay within training distribution)
+
+  FOR each L-BFGS solution:
+    - Check ensemble disagreement
+    - If models agree (std < 5%): run FEA to validate
+    - If models disagree: skip, too uncertain
+
+    - Compare L-BFGS prediction vs FEA
+    - If error > 15%: ABORT L-BFGS phase, return to Phase 2
+    - If error < 10%: accept as candidate
+
+FINAL:
+  - Return best FEA-validated design
+  - Report uncertainty bounds for all objectives
+```
+
+---
+
+## Key Differences from V5
+
+| Aspect | V5 (Failed) | SAT (Proposed) |
+|--------|-------------|----------------|
+| **Model** | Single MLP | Ensemble of 5 MLPs |
+| **Uncertainty** | None | Ensemble disagreement + OOD detection |
+| **L-BFGS** | Trust blindly | Trust-region, validate every step |
+| **Extrapolation** | Accept | Reject or penalize |
+| **Active learning** | No | Yes - prioritize uncertain regions |
+| **Validation** | After L-BFGS | Throughout |
+
+---
+
+## Implementation Checklist
+
+1. [ ] `EnsembleSurrogate` class with N=5 MLPs
+2. [ ] `OODDetector` with KNN + z-score checks
+3. [ ] `acquisition_score()` balancing exploitation/exploration
+4. [ ] Trust-region L-BFGS with OOD penalties
+5. [ ] Automatic retraining when new FEA data arrives
+6. [ ] Logging of prediction errors to track surrogate quality
+7. [ ] Early abort if L-BFGS predictions consistently wrong
+
+---
+
+## Expected Behavior
+
+**In well-sampled regions:**
+- Ensemble agrees → Low uncertainty → Trust predictions
+- L-BFGS finds valid optima → FEA confirms → Success
+
+**In poorly-sampled regions:**
+- Ensemble disagrees → High uncertainty → Run FEA instead
+- L-BFGS penalized → Stays in trusted zone → No fake optima
+
+**At distribution boundaries:**
+- OOD detector flags → Reject predictions
+- Acquisition prioritizes → Active learning fills gaps
+
+---
+
+## Metrics to Track
+
+1. **Surrogate R² on validation set** - Target > 0.95 before L-BFGS
+2. **Prediction error histogram** - Should be centered at 0
+3. **OOD rejection rate** - How often we refuse to predict
+4. **Ensemble disagreement** - Average std across predictions
+5. **L-BFGS success rate** - % of L-BFGS solutions that validate
+
+---
+
+## When to Use SAT vs Pure TPE
+
+| Scenario | Recommendation |
+|----------|----------------|
+| < 100 existing samples | Pure TPE (not enough for good surrogate) |
+| 100-500 samples | SAT Phase 1-2 only (no L-BFGS) |
+| > 500 samples | Full SAT with L-BFGS refinement |
+| High-dimensional (>20 params) | Pure TPE (curse of dimensionality) |
+| Noisy FEA | Pure TPE (surrogates struggle with noise) |
+
+---
+
+## References
+
+- Gaussian Process literature on uncertainty quantification
+- Deep Ensembles: Lakshminarayanan et al. (2017)
+- Bayesian Optimization with Expected Improvement
+- Trust-region methods for constrained optimization
+
+---
+
+*The key insight: A surrogate that knows when it doesn't know is infinitely more valuable than one that's confidently wrong.*