refactor: Implement centralized extractor library to eliminate code duplication

MAJOR ARCHITECTURE REFACTOR - Clean Study Folders Problem Identified by User: "My study folder is a mess, why? I want some order and real structure to develop an insanely good engineering software that evolve with time." - Every substudy was generating duplicate extractor code - Study folders polluted with reusable library code (generated_extractors/, generated_hooks/) - No code reuse across studies - Not production-grade architecture Solution - Centralized Library System: Implemented smart library with signature-based deduplication: - Core extractors in optimization_engine/extractors/ - Studies only store metadata (extractors_manifest.json) - Clean separation: studies = data, core = code Architecture: BEFORE (BAD): studies/my_study/ generated_extractors/ ❌ Code pollution! extract_displacement.py extract_von_mises_stress.py generated_hooks/ ❌ Code pollution! llm_workflow_config.json results.json AFTER (GOOD): optimization_engine/extractors/ ✓ Core library extract_displacement.py extract_stress.py catalog.json studies/my_study/ extractors_manifest.json ✓ Just references! llm_workflow_config.json ✓ Config optimization_results.json ✓ Results New Components: 1. ExtractorLibrary (extractor_library.py) - Signature-based deduplication - Centralized catalog (catalog.json) - Study manifest generation - Reusability across all studies 2. Updated ExtractorOrchestrator - Uses core library instead of per-study generation - Creates manifest instead of copying code - Backward compatible (legacy mode available) 3. Updated LLMOptimizationRunner - Removed generated_extractors/ directory creation - Removed generated_hooks/ directory creation - Uses core library exclusively 4. Updated Tests - Verifies extractors_manifest.json exists - Checks for clean study folder structure - All 18/18 checks pass Results: Study folders NOW ONLY contain: ✓ extractors_manifest.json - references to core library ✓ llm_workflow_config.json - study configuration ✓ optimization_results.json - optimization results ✓ optimization_history.json - trial history ✓ .db file - Optuna database Core library contains: ✓ extract_displacement.py - reusable across ALL studies ✓ extract_von_mises_stress.py - reusable across ALL studies ✓ extract_mass.py - reusable across ALL studies ✓ catalog.json - tracks all extractors with signatures Benefits: - Clean, professional study folder structure - Code reuse eliminates duplication - Library grows over time, studies stay clean - Production-grade architecture - "Insanely good engineering software that evolves with time" Testing: E2E test passes with clean folder structure - No generated_extractors/ pollution - Manifest correctly references library - Core library populated with reusable extractors - Study folder professional and minimal Documentation: - Added comprehensive architecture doc (docs/ARCHITECTURE_REFACTOR_NOV17.md) - Includes migration guide - Documents future work (hooks library, versioning, CLI tools) Next Steps: - Apply same architecture to hooks library - Add auto-generated documentation for library - Implement versioning for reproducibility 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-18 09:00:10 -05:00
parent 2eb73c5d25
commit 0e73226a59
5 changed files with 577 additions and 42 deletions
--- a/docs/ARCHITECTURE_REFACTOR_NOV17.md
+++ b/docs/ARCHITECTURE_REFACTOR_NOV17.md
@@ -0,0 +1,284 @@
+# Architecture Refactor: Centralized Library System
+**Date**: November 17, 2025
+**Phase**: 3.2 Architecture Cleanup
+**Author**: Claude Code (with Antoine's direction)
+
+## Problem Statement
+
+You identified a critical architectural flaw:
+
+> "ok, now, quick thing, why do very basic hooks get recreated and stored in the substudies? those should be just core accessed hooked right? is it only because its a test?
+>
+> What I need in studies is the config, files, setup, report, results etc not core hooks, those should go in atomizer hooks library with their doc etc no? I mean, applied only info = studies, and reusdable and core functions = atomizer foundation.
+>
+> My study folder is a mess, why? I want some order and real structure to develop an insanely good engineering software that evolve with time."
+
+### Old Architecture (BAD):
+```
+studies/
+  simple_beam_optimization/
+    2_substudies/
+      test_e2e_3trials_XXX/
+        generated_extractors/       ❌ Code pollution!
+          extract_displacement.py
+          extract_von_mises_stress.py
+          extract_mass.py
+        generated_hooks/             ❌ Code pollution!
+          custom_hook.py
+        llm_workflow_config.json
+        optimization_results.json
+```
+
+**Problems**:
+- Every substudy duplicates extractor code
+- Study folders polluted with reusable code
+- No code reuse across studies
+- Mess! Not production-grade engineering software
+
+### New Architecture (GOOD):
+```
+optimization_engine/
+  extractors/                ✓ Core reusable library
+    extract_displacement.py
+    extract_stress.py
+    extract_mass.py
+    catalog.json             ✓ Tracks all extractors
+
+  hooks/                     ✓ Core reusable library
+    (future implementation)
+
+studies/
+  simple_beam_optimization/
+    2_substudies/
+      my_optimization/
+        extractors_manifest.json  ✓ Just references!
+        llm_workflow_config.json  ✓ Study config
+        optimization_results.json ✓ Results
+        optimization_history.json ✓ History
+```
+
+**Benefits**:
+- ✅ Clean study folders (only metadata)
+- ✅ Reusable core libraries
+- ✅ Deduplication (same extractor = single file)
+- ✅ Production-grade architecture
+- ✅ Evolves with time (library grows, studies stay clean)
+
+## Implementation
+
+### 1. Extractor Library Manager (`extractor_library.py`)
+
+New smart library system with:
+- **Signature-based deduplication**: Two extractors with same functionality = one file
+- **Catalog tracking**: `catalog.json` tracks all library extractors
+- **Study manifests**: Studies just reference which extractors they used
+
+```python
+class ExtractorLibrary:
+    def get_or_create(self, llm_feature, extractor_code):
+        """Add to library or reuse existing."""
+        signature = self._compute_signature(llm_feature)
+
+        if signature in self.catalog:
+            # Reuse existing!
+            return self.library_dir / self.catalog[signature]['filename']
+        else:
+            # Add new to library
+            self.catalog[signature] = {...}
+            return extractor_file
+```
+
+### 2. Updated Components
+
+**ExtractorOrchestrator** (`extractor_orchestrator.py`):
+- Now uses `ExtractorLibrary` instead of per-study generation
+- Creates `extractors_manifest.json` instead of copying code
+- Backward compatible (legacy mode available)
+
+**LLMOptimizationRunner** (`llm_optimization_runner.py`):
+- Removed per-study `generated_extractors/` directory creation
+- Removed per-study `generated_hooks/` directory creation
+- Uses core library exclusively
+
+**Test Suite** (`test_phase_3_2_e2e.py`):
+- Updated to check for `extractors_manifest.json` instead of `generated_extractors/`
+- Verifies clean study folder structure
+
+## Results
+
+### Before Refactor:
+```
+test_e2e_3trials_XXX/
+├── generated_extractors/          ❌ 3 Python files
+│   ├── extract_displacement.py
+│   ├── extract_von_mises_stress.py
+│   └── extract_mass.py
+├── generated_hooks/                ❌ Hook files
+├── llm_workflow_config.json
+└── optimization_results.json
+```
+
+### After Refactor:
+```
+test_e2e_3trials_XXX/
+├── extractors_manifest.json       ✅ Just references!
+├── llm_workflow_config.json        ✅ Study config
+├── optimization_results.json       ✅ Results
+└── optimization_history.json       ✅ History
+
+optimization_engine/extractors/     ✅ Core library
+├── extract_displacement.py
+├── extract_von_mises_stress.py
+├── extract_mass.py
+└── catalog.json
+```
+
+## Testing
+
+E2E test now passes with clean folder structure:
+- ✅ `extractors_manifest.json` created
+- ✅ Core library populated with 3 extractors
+- ✅ NO `generated_extractors/` pollution
+- ✅ Study folder clean and professional
+
+Test output:
+```
+Verifying outputs...
+  [OK] Output directory created
+  [OK] History file created
+  [OK] Results file created
+  [OK] Extractors manifest (references core library)
+
+Checks passed: 18/18
+[SUCCESS] END-TO-END TEST PASSED!
+```
+
+## Migration Guide
+
+### For Future Studies:
+
+**What changed**:
+- Extractors are now in `optimization_engine/extractors/` (core library)
+- Study folders only contain `extractors_manifest.json` (not code)
+
+**No action required**:
+- System automatically uses new architecture
+- Backward compatible (legacy mode available with `use_core_library=False`)
+
+### For Developers:
+
+**To add new extractors**:
+1. LLM generates extractor code
+2. `ExtractorLibrary.get_or_create()` checks if already exists
+3. If new: adds to `optimization_engine/extractors/`
+4. If exists: reuses existing file
+5. Study gets manifest reference, not copy of code
+
+**To view library**:
+```python
+from optimization_engine.extractor_library import ExtractorLibrary
+
+library = ExtractorLibrary()
+print(library.get_library_summary())
+```
+
+## Next Steps (Future Work)
+
+1. **Hook Library System**: Implement same architecture for hooks
+   - Currently: Hooks still use legacy per-study generation
+   - Future: `optimization_engine/hooks/` library like extractors
+
+2. **Library Documentation**: Auto-generate docs for each extractor
+   - Extract docstrings from library extractors
+   - Create browsable documentation
+
+3. **Versioning**: Track extractor versions for reproducibility
+   - Tag extractors with creation date/version
+   - Allow studies to pin specific versions
+
+4. **CLI Tool**: View and manage library
+   - `python -m optimization_engine.extractors list`
+   - `python -m optimization_engine.extractors info <signature>`
+
+## Files Modified
+
+1. **New Files**:
+   - `optimization_engine/extractor_library.py` - Core library manager
+   - `optimization_engine/extractors/__init__.py` - Package init
+   - `optimization_engine/extractors/catalog.json` - Library catalog
+   - `docs/ARCHITECTURE_REFACTOR_NOV17.md` - This document
+
+2. **Modified Files**:
+   - `optimization_engine/extractor_orchestrator.py` - Use library instead of per-study
+   - `optimization_engine/llm_optimization_runner.py` - Remove per-study directories
+   - `tests/test_phase_3_2_e2e.py` - Check for manifest instead of directories
+
+## Commit Message
+
+```
+refactor: Implement centralized extractor library to eliminate code duplication
+
+MAJOR ARCHITECTURE REFACTOR - Clean Study Folders
+
+Problem:
+- Every substudy was generating duplicate extractor code
+- Study folders polluted with reusable library code
+- No code reuse across studies
+- Not production-grade architecture
+
+Solution:
+Implemented centralized library system:
+- Core extractors in optimization_engine/extractors/
+- Signature-based deduplication
+- Studies only store metadata (extractors_manifest.json)
+- Clean separation: studies = data, core = code
+
+Changes:
+1. Created ExtractorLibrary with smart deduplication
+2. Updated ExtractorOrchestrator to use core library
+3. Updated LLMOptimizationRunner to stop creating per-study directories
+4. Updated tests to verify clean study folder structure
+
+Results:
+BEFORE: study folder with generated_extractors/ directory (code pollution)
+AFTER: study folder with extractors_manifest.json (just references)
+
+Core library: optimization_engine/extractors/
+- extract_displacement.py
+- extract_von_mises_stress.py
+- extract_mass.py
+- catalog.json (tracks all extractors)
+
+Study folders NOW ONLY contain:
+- extractors_manifest.json (references to core library)
+- llm_workflow_config.json (study configuration)
+- optimization_results.json (results)
+- optimization_history.json (trial history)
+
+Production-grade architecture for "insanely good engineering software that evolves with time"
+
+🤖 Generated with [Claude Code](https://claude.com/claude-code)
+
+Co-Authored-By: Claude <noreply@anthropic.com>
+```
+
+## Summary for Morning
+
+**What was done**:
+1. ✅ Created centralized extractor library system
+2. ✅ Eliminated per-study code duplication
+3. ✅ Clean study folder architecture
+4. ✅ E2E tests pass with new structure
+5. ✅ Comprehensive documentation
+
+**What you'll see**:
+- Studies now only contain metadata (no code!)
+- Core library in `optimization_engine/extractors/`
+- Professional, production-grade architecture
+
+**Ready for**:
+- Continue Phase 3.2 development
+- Same approach for hooks library (next iteration)
+- Building "insanely good engineering software"
+
+Have a good night! ✨