feat: Pre-migration checkpoint - updated docs and utilities

Updates before optimization_engine migration:
- Updated migration plan to v2.1 with complete file inventory
- Added OP_07 disk optimization protocol
- Added SYS_16 self-aware turbo protocol
- Added study archiver and cleanup utilities
- Added ensemble surrogate module
- Updated NX solver and session manager
- Updated zernike HTML generator
- Added context engineering plan
- LAC session insights updates

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
2025-12-29 10:22:45 -05:00
parent faa7779a43
commit 82f36689b7
21 changed files with 6304 additions and 890 deletions

View File

@@ -0,0 +1,132 @@
# NXOpen Documentation MCP Server - Setup TODO
**Created:** 2025-12-29
**Status:** PENDING - Waiting for manual configuration
---
## Current State
The NXOpen documentation MCP server exists on **dalidou** (192.168.86.50) but is not accessible from this Windows machine due to hostname resolution issues.
### What's Working
- ✅ Dalidou server is online and reachable at `192.168.86.50`
- ✅ Port 5000 (Documentation Proxy) is responding
- ✅ Port 3000 (Gitea) is responding
- ✅ MCP server code exists at `/srv/claude-assistant/` on dalidou
### What's NOT Working
-`dalidou.local` hostname doesn't resolve (mDNS not configured on this machine)
- ❌ MCP tools not integrated with Claude Code
---
## Steps to Complete
### Step 1: Fix Hostname Resolution (Manual - requires Admin)
**Option A: Run the script as Administrator**
```powershell
# Open PowerShell as Administrator, then:
C:\Users\antoi\Atomizer\add_dalidou_host.ps1
```
**Option B: Manually edit hosts file**
1. Open Notepad as Administrator
2. Open `C:\Windows\System32\drivers\etc\hosts`
3. Add this line at the end:
```
192.168.86.50 dalidou.local dalidou
```
4. Save the file
**Verify:**
```powershell
ping dalidou.local
```
### Step 2: Verify MCP Server is Running on Dalidou
SSH into dalidou and check:
```bash
ssh root@dalidou
# Check documentation proxy
systemctl status siemensdocumentationproxyserver
# Check MCP server (if it's a service)
# Or check what's running on port 5000
ss -tlnp | grep 5000
```
### Step 3: Configure Claude Code MCP Integration
The MCP server on dalidou uses **stdio-based MCP protocol**, not HTTP. To connect from Claude Code, you'll need one of:
**Option A: SSH-based MCP (if supported)**
Configure in `.claude/settings.json` or MCP config to connect via SSH tunnel.
**Option B: Local Proxy**
Run a local MCP proxy that connects to dalidou's MCP server.
**Option C: HTTP Wrapper**
The current port 5000 service may already expose HTTP endpoints - need to verify once hostname is fixed.
---
## Server Documentation Reference
Full documentation is in the SERVtomaste repo:
- **URL:** http://192.168.86.50:3000/Antoine/SERVtomaste
- **File:** `docs/SIEMENS-DOCS-SERVER.md`
### Key Server Paths (on dalidou)
```
/srv/siemens-docs/proxy/ # Documentation Proxy (port 5000)
/srv/claude-assistant/ # MCP Server
/srv/claude-assistant/mcp-server/ # MCP server code
/srv/claude-assistant/tools/ # Tool implementations
├── siemens-auth.js # Puppeteer authentication
├── siemens-docs.js # Documentation fetching
└── ...
/srv/claude-assistant/vault/ # Credentials (secured)
```
### Available MCP Tools (once connected)
| Tool | Description |
|------|-------------|
| `siemens_docs_search` | Search NX Open, Simcenter docs |
| `siemens_docs_fetch` | Fetch specific documentation page |
| `siemens_auth_status` | Check if auth session is active |
| `siemens_login` | Re-login if session expired |
| `siemens_docs_list` | List documentation categories |
---
## Files Created During Investigation
- `C:\Users\antoi\Atomizer\add_dalidou_host.ps1` - Script to add hosts entry (run as Admin)
- `C:\Users\antoi\Atomizer\test_mcp.py` - Test script for probing MCP server (can be deleted)
---
## Related Documentation
- `.claude/skills/modules/nx-docs-lookup.md` - How to use MCP tools once configured
- `docs/08_ARCHIVE/historical/NXOPEN_DOCUMENTATION_INTEGRATION_STRATEGY.md` - Full strategy doc
- `docs/05_API_REFERENCE/NXOPEN_RESOURCES.md` - Alternative NXOpen resources
---
## Workaround Until Fixed
Without the MCP server, you can still look up NXOpen documentation by:
1. **Using web search** - I can search for NXOpen API documentation online
2. **Using local stub files** - Python stubs at `C:\Program Files\Siemens\NX2412\UGOPEN\pythonStubs\`
3. **Using existing extractors** - Check `optimization_engine/extractors/` for patterns
4. **Recording NX journals** - Record operations in NX to learn the API calls
---
*To continue setup, run the hosts file fix and let me know when ready.*

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,239 @@
# OP_07: Disk Space Optimization
**Version:** 1.0
**Last Updated:** 2025-12-29
## Overview
This protocol manages disk space for Atomizer studies through:
1. **Local cleanup** - Remove regenerable files from completed studies
2. **Remote archival** - Archive to dalidou server (14TB available)
3. **On-demand restore** - Pull archived studies when needed
## Disk Usage Analysis
### Typical Study Breakdown
| File Type | Size/Trial | Purpose | Keep? |
|-----------|------------|---------|-------|
| `.op2` | 68 MB | Nastran results | **YES** - Needed for analysis |
| `.prt` | 30 MB | NX parts | NO - Copy of master |
| `.dat` | 16 MB | Solver input | NO - Regenerable |
| `.fem` | 14 MB | FEM mesh | NO - Copy of master |
| `.sim` | 7 MB | Simulation | NO - Copy of master |
| `.afm` | 4 MB | Assembly FEM | NO - Regenerable |
| `.json` | <1 MB | Params/results | **YES** - Metadata |
| Logs | <1 MB | F04/F06/log | NO - Diagnostic only |
**Per-trial overhead:** ~150 MB total, only ~70 MB essential
### M1_Mirror Example
```
Current: 194 GB (28 studies, 2000+ trials)
After cleanup: 95 GB (51% reduction)
After archive: 5 GB (keep best_design_archive only)
```
## Commands
### 1. Analyze Disk Usage
```bash
# Single study
archive_study.bat analyze studies\M1_Mirror\m1_mirror_V12
# All studies in a project
archive_study.bat analyze studies\M1_Mirror
```
Output shows:
- Total size
- Essential vs deletable breakdown
- Trial count per study
- Per-extension analysis
### 2. Cleanup Completed Study
```bash
# Dry run (default) - see what would be deleted
archive_study.bat cleanup studies\M1_Mirror\m1_mirror_V12
# Actually delete
archive_study.bat cleanup studies\M1_Mirror\m1_mirror_V12 --execute
```
**What gets deleted:**
- `.prt`, `.fem`, `.sim`, `.afm` in trial folders
- `.dat`, `.f04`, `.f06`, `.log`, `.diag` solver files
- Temp files (`.txt`, `.exp`, `.bak`)
**What is preserved:**
- `1_setup/` folder (master model)
- `3_results/` folder (database, reports)
- All `.op2` files (Nastran results)
- All `.json` files (params, metadata)
- All `.npz` files (Zernike coefficients)
- `best_design_archive/` folder
### 3. Archive to Remote Server
```bash
# Dry run
archive_study.bat archive studies\M1_Mirror\m1_mirror_V12
# Actually archive
archive_study.bat archive studies\M1_Mirror\m1_mirror_V12 --execute
# Use Tailscale (when not on local network)
archive_study.bat archive studies\M1_Mirror\m1_mirror_V12 --execute --tailscale
```
**Process:**
1. Creates compressed `.tar.gz` archive
2. Uploads to `papa@192.168.86.50:/srv/storage/atomizer-archive/`
3. Deletes local archive after successful upload
### 4. List Remote Archives
```bash
archive_study.bat list
# Via Tailscale
archive_study.bat list --tailscale
```
### 5. Restore from Remote
```bash
# Restore to studies/ folder
archive_study.bat restore m1_mirror_V12
# Via Tailscale
archive_study.bat restore m1_mirror_V12 --tailscale
```
## Remote Server Setup
**Server:** dalidou (Lenovo W520)
- Local IP: `192.168.86.50`
- Tailscale IP: `100.80.199.40`
- SSH user: `papa`
- Archive path: `/srv/storage/atomizer-archive/`
### First-Time Setup
SSH into dalidou and create the archive directory:
```bash
ssh papa@192.168.86.50
mkdir -p /srv/storage/atomizer-archive
```
Ensure SSH key authentication is set up for passwordless transfers:
```bash
# On Windows (PowerShell)
ssh-copy-id papa@192.168.86.50
```
## Recommended Workflow
### During Active Optimization
Keep all files - you may need to re-run specific trials.
### After Study Completion
1. **Generate final report** (`STUDY_REPORT.md`)
2. **Archive best design** to `3_results/best_design_archive/`
3. **Cleanup:**
```bash
archive_study.bat cleanup studies\M1_Mirror\m1_mirror_V12 --execute
```
### For Long-Term Storage
1. **After cleanup**, archive to server:
```bash
archive_study.bat archive studies\M1_Mirror\m1_mirror_V12 --execute
```
2. **Optionally delete local** (keep only `3_results/best_design_archive/`)
### When Revisiting Old Study
1. **Restore:**
```bash
archive_study.bat restore m1_mirror_V12
```
2. If you need to re-run trials, the `1_setup/` master files allow regenerating everything
## Safety Features
- **Dry run by default** - Must add `--execute` to actually delete/transfer
- **Master files preserved** - `1_setup/` is never touched
- **Results preserved** - `3_results/` is never touched
- **Essential files preserved** - OP2, JSON, NPZ always kept
## Disk Space Targets
| Stage | M1_Mirror Target |
|-------|------------------|
| Active development | 200 GB (full) |
| Completed studies | 95 GB (after cleanup) |
| Archived (minimal local) | 5 GB (best only) |
| Server archive | 50 GB compressed |
## Troubleshooting
### SSH Connection Failed
```bash
# Test connectivity
ping 192.168.86.50
# Test SSH
ssh papa@192.168.86.50 "echo connected"
# If on different network, use Tailscale
ssh papa@100.80.199.40 "echo connected"
```
### Archive Upload Slow
Large studies (50+ GB) take time. The tool uses `rsync` with progress display.
For very large archives, consider running overnight or using direct LAN connection.
### Out of Disk Space During Archive
The archive is created locally first. Ensure you have ~1.5x the study size free:
- 20 GB study = ~30 GB temp space needed
## Python API
```python
from optimization_engine.utils.study_archiver import (
analyze_study,
cleanup_study,
archive_to_remote,
restore_from_remote,
list_remote_archives,
)
# Analyze
analysis = analyze_study(Path("studies/M1_Mirror/m1_mirror_V12"))
print(f"Deletable: {analysis['deletable_size']/1e9:.2f} GB")
# Cleanup (dry_run=False to actually delete)
cleanup_study(Path("studies/M1_Mirror/m1_mirror_V12"), dry_run=False)
# Archive
archive_to_remote(Path("studies/M1_Mirror/m1_mirror_V12"), dry_run=False)
# List remote
archives = list_remote_archives()
for a in archives:
print(f"{a['name']}: {a['size']}")
```

View File

@@ -0,0 +1,262 @@
# SYS_16: Self-Aware Turbo (SAT) Optimization
## Version: 1.0
## Status: PROPOSED
## Created: 2025-12-28
---
## Problem Statement
V5 surrogate + L-BFGS failed catastrophically because:
1. MLP predicted WS=280 but actual was WS=376 (30%+ error)
2. L-BFGS descended to regions **outside training distribution**
3. Surrogate had no way to signal uncertainty
4. All L-BFGS solutions converged to the same "fake optimum"
**Root cause:** The surrogate is overconfident in regions where it has no data.
---
## Solution: Uncertainty-Aware Surrogate with Active Learning
### Core Principles
1. **Never trust a point prediction** - Always require uncertainty bounds
2. **High uncertainty = run FEA** - Don't optimize where you don't know
3. **Actively fill gaps** - Prioritize FEA in high-uncertainty regions
4. **Validate gradient solutions** - Check L-BFGS results against FEA before trusting
---
## Architecture
### 1. Ensemble Surrogate (Epistemic Uncertainty)
Instead of one MLP, train **N independent models** with different initializations:
```python
class EnsembleSurrogate:
def __init__(self, n_models=5):
self.models = [MLP() for _ in range(n_models)]
def predict(self, x):
preds = [m.predict(x) for m in self.models]
mean = np.mean(preds, axis=0)
std = np.std(preds, axis=0) # Epistemic uncertainty
return mean, std
def is_confident(self, x, threshold=0.1):
mean, std = self.predict(x)
# Confident if std < 10% of mean
return (std / (mean + 1e-6)) < threshold
```
**Why this works:** Models trained on different random seeds will agree in well-sampled regions but disagree wildly in extrapolation regions.
### 2. Distance-Based OOD Detection
Track training data distribution and flag points that are "too far":
```python
class OODDetector:
def __init__(self, X_train):
self.X_train = X_train
self.mean = X_train.mean(axis=0)
self.std = X_train.std(axis=0)
# Fit KNN for local density
self.knn = NearestNeighbors(n_neighbors=5)
self.knn.fit(X_train)
def distance_to_training(self, x):
"""Return distance to nearest training points."""
distances, _ = self.knn.kneighbors(x.reshape(1, -1))
return distances.mean()
def is_in_distribution(self, x, threshold=2.0):
"""Check if point is within 2 std of training data."""
z_scores = np.abs((x - self.mean) / (self.std + 1e-6))
return z_scores.max() < threshold
```
### 3. Trust-Region L-BFGS
Constrain L-BFGS to stay within training distribution:
```python
def trust_region_lbfgs(surrogate, ood_detector, x0, max_iter=100):
"""L-BFGS that respects training data boundaries."""
def constrained_objective(x):
# If OOD, return large penalty
if not ood_detector.is_in_distribution(x):
return 1e9
mean, std = surrogate.predict(x)
# If uncertain, return upper confidence bound (pessimistic)
if std > 0.1 * mean:
return mean + 2 * std # Be conservative
return mean
result = minimize(constrained_objective, x0, method='L-BFGS-B')
return result.x
```
### 4. Acquisition Function with Uncertainty
Use **Expected Improvement with Uncertainty** (like Bayesian Optimization):
```python
def acquisition_score(x, surrogate, best_so_far):
"""Score = potential improvement weighted by confidence."""
mean, std = surrogate.predict(x)
# Expected improvement (lower is better for minimization)
improvement = best_so_far - mean
# Exploration bonus for uncertain regions
exploration = 0.5 * std
# High score = worth evaluating with FEA
return improvement + exploration
def select_next_fea_candidates(surrogate, candidates, best_so_far, n=5):
"""Select candidates balancing exploitation and exploration."""
scores = [acquisition_score(c, surrogate, best_so_far) for c in candidates]
# Pick top candidates by acquisition score
top_indices = np.argsort(scores)[-n:]
return [candidates[i] for i in top_indices]
```
---
## Algorithm: Self-Aware Turbo (SAT)
```
INITIALIZE:
- Load existing FEA data (X_train, Y_train)
- Train ensemble surrogate on data
- Fit OOD detector on X_train
- Set best_ws = min(Y_train)
PHASE 1: UNCERTAINTY MAPPING (10% of budget)
FOR i in 1..N_mapping:
- Sample random point x
- Get uncertainty: mean, std = surrogate.predict(x)
- If std > threshold: run FEA, add to training data
- Retrain ensemble periodically
This fills in the "holes" in the surrogate's knowledge.
PHASE 2: EXPLOITATION WITH VALIDATION (80% of budget)
FOR i in 1..N_exploit:
- Generate 1000 TPE samples
- Filter to keep only confident predictions (std < 10% of mean)
- Filter to keep only in-distribution (OOD check)
- Rank by predicted WS
- Take top 5 candidates
- Run FEA on all 5
- For each FEA result:
- Compare predicted vs actual
- If error > 20%: mark region as "unreliable", force exploration there
- If error < 10%: update best, retrain surrogate
- Every 10 iterations: retrain ensemble with new data
PHASE 3: L-BFGS REFINEMENT (10% of budget)
- Only run L-BFGS if ensemble R² > 0.95 on validation set
- Use trust-region L-BFGS (stay within training distribution)
FOR each L-BFGS solution:
- Check ensemble disagreement
- If models agree (std < 5%): run FEA to validate
- If models disagree: skip, too uncertain
- Compare L-BFGS prediction vs FEA
- If error > 15%: ABORT L-BFGS phase, return to Phase 2
- If error < 10%: accept as candidate
FINAL:
- Return best FEA-validated design
- Report uncertainty bounds for all objectives
```
---
## Key Differences from V5
| Aspect | V5 (Failed) | SAT (Proposed) |
|--------|-------------|----------------|
| **Model** | Single MLP | Ensemble of 5 MLPs |
| **Uncertainty** | None | Ensemble disagreement + OOD detection |
| **L-BFGS** | Trust blindly | Trust-region, validate every step |
| **Extrapolation** | Accept | Reject or penalize |
| **Active learning** | No | Yes - prioritize uncertain regions |
| **Validation** | After L-BFGS | Throughout |
---
## Implementation Checklist
1. [ ] `EnsembleSurrogate` class with N=5 MLPs
2. [ ] `OODDetector` with KNN + z-score checks
3. [ ] `acquisition_score()` balancing exploitation/exploration
4. [ ] Trust-region L-BFGS with OOD penalties
5. [ ] Automatic retraining when new FEA data arrives
6. [ ] Logging of prediction errors to track surrogate quality
7. [ ] Early abort if L-BFGS predictions consistently wrong
---
## Expected Behavior
**In well-sampled regions:**
- Ensemble agrees → Low uncertainty → Trust predictions
- L-BFGS finds valid optima → FEA confirms → Success
**In poorly-sampled regions:**
- Ensemble disagrees → High uncertainty → Run FEA instead
- L-BFGS penalized → Stays in trusted zone → No fake optima
**At distribution boundaries:**
- OOD detector flags → Reject predictions
- Acquisition prioritizes → Active learning fills gaps
---
## Metrics to Track
1. **Surrogate R² on validation set** - Target > 0.95 before L-BFGS
2. **Prediction error histogram** - Should be centered at 0
3. **OOD rejection rate** - How often we refuse to predict
4. **Ensemble disagreement** - Average std across predictions
5. **L-BFGS success rate** - % of L-BFGS solutions that validate
---
## When to Use SAT vs Pure TPE
| Scenario | Recommendation |
|----------|----------------|
| < 100 existing samples | Pure TPE (not enough for good surrogate) |
| 100-500 samples | SAT Phase 1-2 only (no L-BFGS) |
| > 500 samples | Full SAT with L-BFGS refinement |
| High-dimensional (>20 params) | Pure TPE (curse of dimensionality) |
| Noisy FEA | Pure TPE (surrogates struggle with noise) |
---
## References
- Gaussian Process literature on uncertainty quantification
- Deep Ensembles: Lakshminarayanan et al. (2017)
- Bayesian Optimization with Expected Improvement
- Trust-region methods for constrained optimization
---
*The key insight: A surrogate that knows when it doesn't know is infinitely more valuable than one that's confidently wrong.*