Files
Atomizer/docs/ARCHITECTURE_REFACTOR_NOV17.md
Anto01 0e73226a59 refactor: Implement centralized extractor library to eliminate code duplication
MAJOR ARCHITECTURE REFACTOR - Clean Study Folders

Problem Identified by User:
"My study folder is a mess, why? I want some order and real structure to develop
an insanely good engineering software that evolve with time."

- Every substudy was generating duplicate extractor code
- Study folders polluted with reusable library code (generated_extractors/, generated_hooks/)
- No code reuse across studies
- Not production-grade architecture

Solution - Centralized Library System:
Implemented smart library with signature-based deduplication:
- Core extractors in optimization_engine/extractors/
- Studies only store metadata (extractors_manifest.json)
- Clean separation: studies = data, core = code

Architecture:

BEFORE (BAD):
  studies/my_study/
    generated_extractors/            Code pollution!
      extract_displacement.py
      extract_von_mises_stress.py
    generated_hooks/                 Code pollution!
    llm_workflow_config.json
    results.json

AFTER (GOOD):
  optimization_engine/extractors/   ✓ Core library
    extract_displacement.py
    extract_stress.py
    catalog.json

  studies/my_study/
    extractors_manifest.json        ✓ Just references!
    llm_workflow_config.json        ✓ Config
    optimization_results.json       ✓ Results

New Components:

1. ExtractorLibrary (extractor_library.py)
   - Signature-based deduplication
   - Centralized catalog (catalog.json)
   - Study manifest generation
   - Reusability across all studies

2. Updated ExtractorOrchestrator
   - Uses core library instead of per-study generation
   - Creates manifest instead of copying code
   - Backward compatible (legacy mode available)

3. Updated LLMOptimizationRunner
   - Removed generated_extractors/ directory creation
   - Removed generated_hooks/ directory creation
   - Uses core library exclusively

4. Updated Tests
   - Verifies extractors_manifest.json exists
   - Checks for clean study folder structure
   - All 18/18 checks pass

Results:

Study folders NOW ONLY contain:
✓ extractors_manifest.json - references to core library
✓ llm_workflow_config.json - study configuration
✓ optimization_results.json - optimization results
✓ optimization_history.json - trial history
✓ .db file - Optuna database

Core library contains:
✓ extract_displacement.py - reusable across ALL studies
✓ extract_von_mises_stress.py - reusable across ALL studies
✓ extract_mass.py - reusable across ALL studies
✓ catalog.json - tracks all extractors with signatures

Benefits:
- Clean, professional study folder structure
- Code reuse eliminates duplication
- Library grows over time, studies stay clean
- Production-grade architecture
- "Insanely good engineering software that evolves with time"

Testing:
E2E test passes with clean folder structure
- No generated_extractors/ pollution
- Manifest correctly references library
- Core library populated with reusable extractors
- Study folder professional and minimal

Documentation:
- Added comprehensive architecture doc (docs/ARCHITECTURE_REFACTOR_NOV17.md)
- Includes migration guide
- Documents future work (hooks library, versioning, CLI tools)

Next Steps:
- Apply same architecture to hooks library
- Add auto-generated documentation for library
- Implement versioning for reproducibility

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-18 09:00:10 -05:00

9.0 KiB

Architecture Refactor: Centralized Library System

Date: November 17, 2025 Phase: 3.2 Architecture Cleanup Author: Claude Code (with Antoine's direction)

Problem Statement

You identified a critical architectural flaw:

"ok, now, quick thing, why do very basic hooks get recreated and stored in the substudies? those should be just core accessed hooked right? is it only because its a test?

What I need in studies is the config, files, setup, report, results etc not core hooks, those should go in atomizer hooks library with their doc etc no? I mean, applied only info = studies, and reusdable and core functions = atomizer foundation.

My study folder is a mess, why? I want some order and real structure to develop an insanely good engineering software that evolve with time."

Old Architecture (BAD):

studies/
  simple_beam_optimization/
    2_substudies/
      test_e2e_3trials_XXX/
        generated_extractors/       ❌ Code pollution!
          extract_displacement.py
          extract_von_mises_stress.py
          extract_mass.py
        generated_hooks/             ❌ Code pollution!
          custom_hook.py
        llm_workflow_config.json
        optimization_results.json

Problems:

  • Every substudy duplicates extractor code
  • Study folders polluted with reusable code
  • No code reuse across studies
  • Mess! Not production-grade engineering software

New Architecture (GOOD):

optimization_engine/
  extractors/                ✓ Core reusable library
    extract_displacement.py
    extract_stress.py
    extract_mass.py
    catalog.json             ✓ Tracks all extractors

  hooks/                     ✓ Core reusable library
    (future implementation)

studies/
  simple_beam_optimization/
    2_substudies/
      my_optimization/
        extractors_manifest.json  ✓ Just references!
        llm_workflow_config.json  ✓ Study config
        optimization_results.json ✓ Results
        optimization_history.json ✓ History

Benefits:

  • Clean study folders (only metadata)
  • Reusable core libraries
  • Deduplication (same extractor = single file)
  • Production-grade architecture
  • Evolves with time (library grows, studies stay clean)

Implementation

1. Extractor Library Manager (extractor_library.py)

New smart library system with:

  • Signature-based deduplication: Two extractors with same functionality = one file
  • Catalog tracking: catalog.json tracks all library extractors
  • Study manifests: Studies just reference which extractors they used
class ExtractorLibrary:
    def get_or_create(self, llm_feature, extractor_code):
        """Add to library or reuse existing."""
        signature = self._compute_signature(llm_feature)

        if signature in self.catalog:
            # Reuse existing!
            return self.library_dir / self.catalog[signature]['filename']
        else:
            # Add new to library
            self.catalog[signature] = {...}
            return extractor_file

2. Updated Components

ExtractorOrchestrator (extractor_orchestrator.py):

  • Now uses ExtractorLibrary instead of per-study generation
  • Creates extractors_manifest.json instead of copying code
  • Backward compatible (legacy mode available)

LLMOptimizationRunner (llm_optimization_runner.py):

  • Removed per-study generated_extractors/ directory creation
  • Removed per-study generated_hooks/ directory creation
  • Uses core library exclusively

Test Suite (test_phase_3_2_e2e.py):

  • Updated to check for extractors_manifest.json instead of generated_extractors/
  • Verifies clean study folder structure

Results

Before Refactor:

test_e2e_3trials_XXX/
├── generated_extractors/          ❌ 3 Python files
│   ├── extract_displacement.py
│   ├── extract_von_mises_stress.py
│   └── extract_mass.py
├── generated_hooks/                ❌ Hook files
├── llm_workflow_config.json
└── optimization_results.json

After Refactor:

test_e2e_3trials_XXX/
├── extractors_manifest.json       ✅ Just references!
├── llm_workflow_config.json        ✅ Study config
├── optimization_results.json       ✅ Results
└── optimization_history.json       ✅ History

optimization_engine/extractors/     ✅ Core library
├── extract_displacement.py
├── extract_von_mises_stress.py
├── extract_mass.py
└── catalog.json

Testing

E2E test now passes with clean folder structure:

  • extractors_manifest.json created
  • Core library populated with 3 extractors
  • NO generated_extractors/ pollution
  • Study folder clean and professional

Test output:

Verifying outputs...
  [OK] Output directory created
  [OK] History file created
  [OK] Results file created
  [OK] Extractors manifest (references core library)

Checks passed: 18/18
[SUCCESS] END-TO-END TEST PASSED!

Migration Guide

For Future Studies:

What changed:

  • Extractors are now in optimization_engine/extractors/ (core library)
  • Study folders only contain extractors_manifest.json (not code)

No action required:

  • System automatically uses new architecture
  • Backward compatible (legacy mode available with use_core_library=False)

For Developers:

To add new extractors:

  1. LLM generates extractor code
  2. ExtractorLibrary.get_or_create() checks if already exists
  3. If new: adds to optimization_engine/extractors/
  4. If exists: reuses existing file
  5. Study gets manifest reference, not copy of code

To view library:

from optimization_engine.extractor_library import ExtractorLibrary

library = ExtractorLibrary()
print(library.get_library_summary())

Next Steps (Future Work)

  1. Hook Library System: Implement same architecture for hooks

    • Currently: Hooks still use legacy per-study generation
    • Future: optimization_engine/hooks/ library like extractors
  2. Library Documentation: Auto-generate docs for each extractor

    • Extract docstrings from library extractors
    • Create browsable documentation
  3. Versioning: Track extractor versions for reproducibility

    • Tag extractors with creation date/version
    • Allow studies to pin specific versions
  4. CLI Tool: View and manage library

    • python -m optimization_engine.extractors list
    • python -m optimization_engine.extractors info <signature>

Files Modified

  1. New Files:

    • optimization_engine/extractor_library.py - Core library manager
    • optimization_engine/extractors/__init__.py - Package init
    • optimization_engine/extractors/catalog.json - Library catalog
    • docs/ARCHITECTURE_REFACTOR_NOV17.md - This document
  2. Modified Files:

    • optimization_engine/extractor_orchestrator.py - Use library instead of per-study
    • optimization_engine/llm_optimization_runner.py - Remove per-study directories
    • tests/test_phase_3_2_e2e.py - Check for manifest instead of directories

Commit Message

refactor: Implement centralized extractor library to eliminate code duplication

MAJOR ARCHITECTURE REFACTOR - Clean Study Folders

Problem:
- Every substudy was generating duplicate extractor code
- Study folders polluted with reusable library code
- No code reuse across studies
- Not production-grade architecture

Solution:
Implemented centralized library system:
- Core extractors in optimization_engine/extractors/
- Signature-based deduplication
- Studies only store metadata (extractors_manifest.json)
- Clean separation: studies = data, core = code

Changes:
1. Created ExtractorLibrary with smart deduplication
2. Updated ExtractorOrchestrator to use core library
3. Updated LLMOptimizationRunner to stop creating per-study directories
4. Updated tests to verify clean study folder structure

Results:
BEFORE: study folder with generated_extractors/ directory (code pollution)
AFTER: study folder with extractors_manifest.json (just references)

Core library: optimization_engine/extractors/
- extract_displacement.py
- extract_von_mises_stress.py
- extract_mass.py
- catalog.json (tracks all extractors)

Study folders NOW ONLY contain:
- extractors_manifest.json (references to core library)
- llm_workflow_config.json (study configuration)
- optimization_results.json (results)
- optimization_history.json (trial history)

Production-grade architecture for "insanely good engineering software that evolves with time"

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

Summary for Morning

What was done:

  1. Created centralized extractor library system
  2. Eliminated per-study code duplication
  3. Clean study folder architecture
  4. E2E tests pass with new structure
  5. Comprehensive documentation

What you'll see:

  • Studies now only contain metadata (no code!)
  • Core library in optimization_engine/extractors/
  • Professional, production-grade architecture

Ready for:

  • Continue Phase 3.2 development
  • Same approach for hooks library (next iteration)
  • Building "insanely good engineering software"

Have a good night!