refactor: Implement centralized extractor library to eliminate code duplication

MAJOR ARCHITECTURE REFACTOR - Clean Study Folders Problem Identified by User: "My study folder is a mess, why? I want some order and real structure to develop an insanely good engineering software that evolve with time." - Every substudy was generating duplicate extractor code - Study folders polluted with reusable library code (generated_extractors/, generated_hooks/) - No code reuse across studies - Not production-grade architecture Solution - Centralized Library System: Implemented smart library with signature-based deduplication: - Core extractors in optimization_engine/extractors/ - Studies only store metadata (extractors_manifest.json) - Clean separation: studies = data, core = code Architecture: BEFORE (BAD): studies/my_study/ generated_extractors/ ❌ Code pollution! extract_displacement.py extract_von_mises_stress.py generated_hooks/ ❌ Code pollution! llm_workflow_config.json results.json AFTER (GOOD): optimization_engine/extractors/ ✓ Core library extract_displacement.py extract_stress.py catalog.json studies/my_study/ extractors_manifest.json ✓ Just references! llm_workflow_config.json ✓ Config optimization_results.json ✓ Results New Components: 1. ExtractorLibrary (extractor_library.py) - Signature-based deduplication - Centralized catalog (catalog.json) - Study manifest generation - Reusability across all studies 2. Updated ExtractorOrchestrator - Uses core library instead of per-study generation - Creates manifest instead of copying code - Backward compatible (legacy mode available) 3. Updated LLMOptimizationRunner - Removed generated_extractors/ directory creation - Removed generated_hooks/ directory creation - Uses core library exclusively 4. Updated Tests - Verifies extractors_manifest.json exists - Checks for clean study folder structure - All 18/18 checks pass Results: Study folders NOW ONLY contain: ✓ extractors_manifest.json - references to core library ✓ llm_workflow_config.json - study configuration ✓ optimization_results.json - optimization results ✓ optimization_history.json - trial history ✓ .db file - Optuna database Core library contains: ✓ extract_displacement.py - reusable across ALL studies ✓ extract_von_mises_stress.py - reusable across ALL studies ✓ extract_mass.py - reusable across ALL studies ✓ catalog.json - tracks all extractors with signatures Benefits: - Clean, professional study folder structure - Code reuse eliminates duplication - Library grows over time, studies stay clean - Production-grade architecture - "Insanely good engineering software that evolves with time" Testing: E2E test passes with clean folder structure - No generated_extractors/ pollution - Manifest correctly references library - Core library populated with reusable extractors - Study folder professional and minimal Documentation: - Added comprehensive architecture doc (docs/ARCHITECTURE_REFACTOR_NOV17.md) - Includes migration guide - Documents future work (hooks library, versioning, CLI tools) Next Steps: - Apply same architecture to hooks library - Add auto-generated documentation for library - Implement versioning for reproducibility 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-18 09:00:10 -05:00
parent 2eb73c5d25
commit 0e73226a59
5 changed files with 577 additions and 42 deletions
--- a/optimization_engine/extractor_orchestrator.py
+++ b/optimization_engine/extractor_orchestrator.py
@@ -22,6 +22,7 @@ import logging
 from dataclasses import dataclass

 from optimization_engine.pynastran_research_agent import PyNastranResearchAgent, ExtractionPattern
+from optimization_engine.extractor_library import ExtractorLibrary, create_study_manifest

 logger = logging.getLogger(__name__)

@@ -46,14 +47,18 @@ class ExtractorOrchestrator:

    def __init__(self,
                 extractors_dir: Optional[Path] = None,
-                 knowledge_base_path: Optional[Path] = None):
+                 knowledge_base_path: Optional[Path] = None,
+                 use_core_library: bool = True):
        """
        Initialize the orchestrator.

        Args:
-            extractors_dir: Directory to save generated extractors
+            extractors_dir: Directory to save study manifest (not extractor code!)
            knowledge_base_path: Path to pyNastran pattern knowledge base
+            use_core_library: Use centralized library (True) or per-study generation (False, legacy)
        """
+        self.use_core_library = use_core_library
+
        if extractors_dir is None:
            extractors_dir = Path(__file__).parent / "result_extractors" / "generated"

@@ -63,10 +68,19 @@ class ExtractorOrchestrator:
        # Initialize Phase 3 research agent
        self.research_agent = PyNastranResearchAgent(knowledge_base_path)

+        # Initialize centralized library (NEW ARCHITECTURE)
+        if use_core_library:
+            self.library = ExtractorLibrary()
+            logger.info(f"Using centralized extractor library: {self.library.library_dir}")
+        else:
+            self.library = None
+            logger.warning("Using legacy per-study extractor generation (not recommended)")
+
        # Registry of generated extractors for this session
        self.extractors: Dict[str, GeneratedExtractor] = {}
+        self.extractor_signatures: List[str] = []  # Track which library extractors were used

-        logger.info(f"ExtractorOrchestrator initialized with extractors_dir: {self.extractors_dir}")
+        logger.info(f"ExtractorOrchestrator initialized")

    def process_llm_workflow(self, llm_output: Dict[str, Any]) -> List[GeneratedExtractor]:
        """
@@ -114,6 +128,11 @@ class ExtractorOrchestrator:
                    logger.error(f"Failed to generate extractor for {feature.get('action')}: {e}")
                    # Continue with other features

+        # NEW ARCHITECTURE: Create study manifest (not copy code)
+        if self.use_core_library and self.library and self.extractor_signatures:
+            create_study_manifest(self.extractor_signatures, self.extractors_dir)
+            logger.info("Study manifest created - extractors referenced from core library")
+
        logger.info(f"Generated {len(generated_extractors)} extractors")
        return generated_extractors

@@ -147,14 +166,24 @@ class ExtractorOrchestrator:
        logger.info(f"Generating extractor code using pattern: {pattern.name}")
        extractor_code = self.research_agent.generate_extractor_code(research_request)

-        # Create filename from action
-        filename = self._action_to_filename(action)
-        file_path = self.extractors_dir / filename
+        # NEW ARCHITECTURE: Use centralized library
+        if self.use_core_library and self.library:
+            # Add to/retrieve from core library (deduplication happens here)
+            file_path = self.library.get_or_create(feature, extractor_code)

-        # Save extractor to file
-        logger.info(f"Saving extractor to: {file_path}")
-        with open(file_path, 'w') as f:
-            f.write(extractor_code)
+            # Track signature for study manifest
+            signature = self.library._compute_signature(feature)
+            self.extractor_signatures.append(signature)
+
+            logger.info(f"Extractor available in core library: {file_path}")
+        else:
+            # LEGACY: Save to per-study directory
+            filename = self._action_to_filename(action)
+            file_path = self.extractors_dir / filename
+
+            logger.info(f"Saving extractor to study directory (legacy): {file_path}")
+            with open(file_path, 'w') as f:
+                f.write(extractor_code)

        # Extract function name from generated code
        function_name = self._extract_function_name(extractor_code)