Atomizer/docs/protocols/operations/OP_06_TROUBLESHOOT.md

# OP_06: Troubleshoot

<!--
PROTOCOL: Troubleshoot Optimization Issues
LAYER: Operations
VERSION: 1.0
STATUS: Active
LAST_UPDATED: 2025-12-05
PRIVILEGE: user
LOAD_WITH: []
-->

## Overview

This protocol provides systematic troubleshooting for common optimization issues, covering NX errors, extraction failures, database problems, and performance issues.

---

## When to Use

| Trigger | Action |
|---------|--------|
| "error", "failed" | Follow this protocol |
| "not working", "crashed" | Follow this protocol |
| "help", "stuck" | Follow this protocol |
| Unexpected behavior | Follow this protocol |

---

## Quick Diagnostic

```bash
# 1. Check environment
conda activate atomizer
python --version  # Should be 3.9+

# 2. Check study structure
ls studies/my_study/
# Should have: 1_setup/, run_optimization.py

# 3. Check model files
ls studies/my_study/1_setup/model/
# Should have: .prt, .sim files

# 4. Test single trial
python run_optimization.py --test
```

---

## Error Categories

### 1. Environment Errors

#### "ModuleNotFoundError: No module named 'optuna'"

**Cause**: Wrong Python environment

**Solution**:
```bash
conda activate atomizer
# Verify
conda list | grep optuna
```

#### "Python version mismatch"

**Cause**: Wrong Python version

**Solution**:
```bash
python --version  # Need 3.9+
conda activate atomizer
```

---

### 2. NX Model Setup Errors

#### "All optimization trials produce identical results"

**Cause**: Missing idealized part (`*_i.prt`) or broken file chain

**Symptoms**:
- Journal shows "FE model updated" but results don't change
- DAT files have same node coordinates with different expressions
- OP2 file timestamps update but values are identical

**Root Cause**: NX simulation files have a parent-child hierarchy:
```
.sim → .fem → _i.prt → .prt (geometry)
```

If the `_i.prt` (idealized part) is missing or not properly linked, `UpdateFemodel()` runs but the mesh doesn't regenerate because:
- FEM mesh is tied to idealized geometry, not master geometry
- Without idealized part updating, FEM has nothing new to mesh against

**Solution**:
1. **Check file chain in NX**:
   - Open `.sim` file
   - Go to **Part Navigator** or **Assembly Navigator**
   - List ALL referenced parts

2. **Copy ALL linked files** to study folder:
   ```bash
   # Typical file set needed:
   Model.prt           # Geometry
   Model_fem1_i.prt    # Idealized part ← OFTEN MISSING!
   Model_fem1.fem      # FEM file
   Model_sim1.sim      # Simulation file
   ```

3. **Verify links are intact**:
   - Open model in NX after copying
   - Check that updates propagate: Geometry → Idealized → FEM → Sim

4. **CRITICAL CODE FIX** (already implemented in `solve_simulation.py`):
   The idealized part MUST be explicitly loaded before `UpdateFemodel()`:
   ```python
   # Load idealized part BEFORE updating FEM
   for filename in os.listdir(working_dir):
       if '_i.prt' in filename.lower():
           idealized_part, status = theSession.Parts.Open(path)
           break

   # Now UpdateFemodel() will work correctly
   feModel.UpdateFemodel()
   ```
   Without loading the `_i.prt`, NX cannot propagate geometry changes to the mesh.

**Prevention**: Always use introspection to list all parts referenced by a simulation.

---

### 3. NX/Solver Errors

#### "NX session timeout after 600s"

**Cause**: Model too complex or NX stuck

**Solution**:
1. Increase timeout in config:
   ```json
   "simulation": {
     "timeout": 1200
   }
   ```
2. Simplify mesh if possible
3. Check NX license availability

#### "Expression 'xxx' not found in model"

**Cause**: Expression name mismatch

**Solution**:
1. Open model in NX
2. Go to Tools → Expressions
3. Verify exact expression name (case-sensitive)
4. Update config to match

#### "NX license error"

**Cause**: License server unavailable

**Solution**:
1. Check license server status
2. Wait and retry
3. Contact IT if persistent

#### "NX solve failed - check log"

**Cause**: Nastran solver error

**Solution**:
1. Find log file: `1_setup/model/*.log` or `*.f06`
2. Search for "FATAL" or "ERROR"
3. Common causes:
   - Singular stiffness matrix (constraints issue)
   - Bad mesh (distorted elements)
   - Missing material properties

---

### 3. Extraction Errors

#### "OP2 file not found"

**Cause**: Solve didn't produce output

**Solution**:
1. Check if solve completed
2. Look for `.op2` file in model directory
3. Check NX log for solve errors

#### "No displacement data for subcase X"

**Cause**: Wrong subcase number

**Solution**:
1. Check available subcases in OP2:
   ```python
   from pyNastran.op2.op2 import OP2
   op2 = OP2()
   op2.read_op2('model.op2')
   print(op2.displacements.keys())
   ```
2. Update subcase in extractor call

#### "Element type 'xxx' not supported"

**Cause**: Extractor doesn't support element type

**Solution**:
1. Check available types in extractor
2. Common types: `cquad4`, `ctria3`, `ctetra`, `chexa`
3. May need different extractor

---

### 4. Database Errors

#### "Database is locked"

**Cause**: Another process using database

**Solution**:
1. Check for running processes:
   ```bash
   ps aux | grep run_optimization
   ```
2. Kill stale process if needed
3. Wait for other optimization to finish

#### "Study 'xxx' not found"

**Cause**: Wrong study name or path

**Solution**:
1. Check exact study name in database:
   ```python
   import optuna
   storage = optuna.storages.RDBStorage('sqlite:///study.db')
   print(storage.get_all_study_summaries())
   ```
2. Use correct name when loading

#### "IntegrityError: UNIQUE constraint failed"

**Cause**: Duplicate trial number

**Solution**:
1. Don't run multiple optimizations on same study simultaneously
2. Use `--resume` flag for continuation

---

### 5. Constraint/Feasibility Errors

#### "All trials pruned"

**Cause**: No feasible region

**Solution**:
1. Check constraint values:
   ```python
   # In objective function, print constraint values
   print(f"Stress: {stress}, limit: 250")
   ```
2. Relax constraints
3. Widen design variable bounds

#### "No improvement after N trials"

**Cause**: Stuck in local minimum or converged

**Solution**:
1. Check if truly converged (good result)
2. Try different starting region
3. Use different sampler
4. Increase exploration (lower `n_startup_trials`)

---

### 6. Performance Issues

#### "Trials running very slowly"

**Cause**: Complex model or inefficient extraction

**Solution**:
1. Profile time per component:
   ```python
   import time
   start = time.time()
   # ... operation ...
   print(f"Took: {time.time() - start:.1f}s")
   ```
2. Simplify mesh if NX is slow
3. Check extraction isn't re-parsing OP2 multiple times

#### "Memory error"

**Cause**: Large OP2 file or many trials

**Solution**:
1. Clear Python memory between trials
2. Don't store all results in memory
3. Use database for persistence

---

## Diagnostic Commands

### Quick Health Check

```bash
# Environment
conda activate atomizer
python -c "import optuna; print('Optuna OK')"
python -c "import pyNastran; print('pyNastran OK')"

# Study structure
ls -la studies/my_study/

# Config validity
python -c "
import json
with open('studies/my_study/1_setup/optimization_config.json') as f:
    config = json.load(f)
print('Config OK')
print(f'Objectives: {len(config.get(\"objectives\", []))}')
"

# Database status
python -c "
import optuna
study = optuna.load_study('my_study', 'sqlite:///studies/my_study/2_results/study.db')
print(f'Trials: {len(study.trials)}')
"
```

### NX Log Analysis

```bash
# Find latest log
ls -lt studies/my_study/1_setup/model/*.log | head -1

# Search for errors
grep -i "error\|fatal\|fail" studies/my_study/1_setup/model/*.log
```

### Trial Failure Analysis

```python
import optuna

study = optuna.load_study(...)

# Failed trials
failed = [t for t in study.trials
          if t.state == optuna.trial.TrialState.FAIL]
print(f"Failed: {len(failed)}")

for t in failed[:5]:
    print(f"Trial {t.number}: {t.user_attrs}")

# Pruned trials
pruned = [t for t in study.trials
          if t.state == optuna.trial.TrialState.PRUNED]
print(f"Pruned: {len(pruned)}")
```

---

## Recovery Actions

### Reset Study (Start Fresh)

```bash
# Backup first
cp -r studies/my_study/2_results studies/my_study/2_results_backup

# Delete results
rm -rf studies/my_study/2_results/*

# Run fresh
python run_optimization.py
```

### Resume Interrupted Study

```bash
python run_optimization.py --resume
```

### Restore from Backup

```bash
cp -r studies/my_study/2_results_backup/* studies/my_study/2_results/
```

---

## Getting Help

### Information to Provide

When asking for help, include:
1. Error message (full traceback)
2. Config file contents
3. Study structure (`ls -la`)
4. What you tried
5. NX log excerpt (if NX error)

### Log Locations

| Log | Location |
|-----|----------|
| Optimization | Console output or redirect to file |
| NX Solve | `1_setup/model/*.log`, `*.f06` |
| Database | `2_results/study.db` (query with optuna) |
| Intelligence | `2_results/intelligent_optimizer/*.json` |

---

## Cross-References

- **Related**: All operation protocols
- **System**: [SYS_10_IMSO](../system/SYS_10_IMSO.md), [SYS_12_EXTRACTOR_LIBRARY](../system/SYS_12_EXTRACTOR_LIBRARY.md)

---

## Version History

| Version | Date | Changes |
|---------|------|---------|
| 1.0 | 2025-12-05 | Initial release |