feat: dual-layer knowledge extraction + domain knowledge band

The extraction system now produces two kinds of candidates from the same conversation: A. PROJECT-SPECIFIC: applied facts scoped to a named project (unchanged behavior) B. DOMAIN KNOWLEDGE: generalizable engineering insight earned through project work, tagged with a domain (physics, materials, optics, mechanics, manufacturing, metrology, controls, software, math, finance) and stored with project="" so it surfaces across all projects. Critical quality bar enforced in the system prompt: "Would a competent engineer need experience to know this, or could they find it in 30 seconds on Google?" Textbook values, definitions, and obvious facts are explicitly excluded. Only hard-won insight qualifies — the kind that takes weeks of FEA or real machining experience to discover. Domain tags are embedded in the content as a prefix ("[physics]", "[materials]") so they survive without a schema migration. A future column can parse them out. Context builder gains a new tier between project memories and retrieved chunks: Tier 1: Trusted Project State (project-specific) Tier 2: Identity / Preferences (global) Tier 3: Project Memories (project-specific) Tier 4: Domain Knowledge (NEW) (cross-project, 10% budget) Tier 5: Retrieved Chunks (project-boosted) Trim order: chunks -> domain knowledge -> project memories -> identity/preference -> project state. Host-side extraction script updated with the same prompt and domain-tag handling. LLM_EXTRACTOR_VERSION bumped to llm-0.3.0. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 09:04:04 -04:00
parent db89978871
commit 9118f824fa
3 changed files with 142 additions and 38 deletions
--- a/scripts/batch_llm_extract_live.py
+++ b/scripts/batch_llm_extract_live.py
@@ -33,24 +33,43 @@ MEMORY_TYPES = {"identity", "preference", "project", "episodic", "knowledge", "a

 SYSTEM_PROMPT = """You extract durable memory candidates from LLM conversation turns for a personal context engine called AtoCore.

-Your job is to read one user prompt plus the assistant's response and decide which durable facts, decisions, preferences, architectural rules, or project invariants should be remembered across future sessions.
+AtoCore stores two kinds of knowledge:
+
+A. PROJECT-SPECIFIC: applied decisions, constraints, and architecture for a named project (p04-gigabit, p05-interferometer, p06-polisher, atomizer-v2, atocore). These stay scoped to one project.
+
+B. DOMAIN KNOWLEDGE: generalizable engineering insight that was EARNED through project work and is reusable across projects. Tag these with a domain instead of a project.
+
+THE CRITICAL BAR FOR DOMAIN KNOWLEDGE:
+Only extract insight that took real effort to discover. The test: "Would a competent engineer need experience to know this, or could they find it in 30 seconds on Google?" If they can look it up, do NOT extract it.
+
+EXTRACT (earned insight):
+- "At F/1.2, Zerodur CTE gradient across the blank is the second-largest WFE contributor after gravity sag"
+- "Preston removal rate model breaks down below 5N applied force because the contact assumption fails"
+- "For swing-arm polishing, m=1 (coma) is NOT correctable by force modulation (score 0.09)"
+
+DO NOT EXTRACT (common knowledge):
+- "Zerodur CTE is 0.05 ppm/K" (textbook value)
+- "FEA uses finite elements to discretize continuous domains" (definition)
+- "Python is a programming language" (obvious)

 Rules:

-1. Only surface durable claims. Skip transient status ("deploy is still running"), instructional guidance ("here is how to run the command"), troubleshooting tactics, ephemeral recommendations ("merge this PR now"), and session recaps.
-2. A candidate is durable when a reader coming back in two weeks would still need to know it. Architectural choices, named rules, ratified decisions, invariants, procurement commitments, and project-level constraints qualify. Conversational fillers and step-by-step instructions do not.
-3. Each candidate must stand alone. Rewrite the claim in one sentence under 200 characters with enough context that a reader without the conversation understands it.
-4. Each candidate must have a type from this closed set: project, knowledge, preference, adaptation.
-5. If the conversation is clearly scoped to a project (p04-gigabit, p05-interferometer, p06-polisher, atocore), set ``project`` to that id. Otherwise leave ``project`` empty.
-6. If the response makes no durable claim, return an empty list. It is correct and expected to return [] on most conversational turns.
-7. Confidence should be 0.5 by default so human review workload is honest. Raise to 0.6 only when the response states the claim in an unambiguous, committed form (e.g. "the decision is X", "the selected approach is Y", "X is non-negotiable").
-8. Output must be a raw JSON array and nothing else. No prose before or after. No markdown fences. No explanations.
+1. Only surface durable claims. Skip transient status, instructional guidance, troubleshooting, ephemeral recommendations, session recaps.
+2. A candidate is durable when a reader coming back in two weeks would still need to know it.
+3. Each candidate must stand alone in one sentence under 200 characters.
+4. Type must be one of: project, knowledge, preference, adaptation.
+5. For project-specific claims, set ``project`` to the project id.
+6. For generalizable domain insight, set ``project`` to empty and set ``domain`` to one of: physics, materials, optics, mechanics, manufacturing, metrology, controls, software, math, finance.
+7. When one conversation produces BOTH a project-specific fact AND a generalizable principle, emit BOTH as separate candidates.
+8. Return [] on most turns. The bar is high. Empty is correct and expected.
+9. Confidence 0.5 default. Raise to 0.6 only for unambiguous committed claims.
+10. Output a raw JSON array only. No prose, no markdown fences.

-Each array element has exactly this shape:
+Each array element:

-{"type": "project|knowledge|preference|adaptation", "content": "...", "project": "...", "confidence": 0.5}
+{"type": "project|knowledge|preference|adaptation", "content": "...", "project": "...", "domain": "", "confidence": 0.5}

-Return [] when there is nothing to extract."""
+Use ``project`` for project-scoped candidates. Use ``domain`` for cross-project knowledge. Never set both."""

 _sandbox_cwd = None

@@ -192,14 +211,17 @@ def parse_candidates(raw, interaction_project):
        mem_type = str(item.get("type") or "").strip().lower()
        content = str(item.get("content") or "").strip()
        model_project = str(item.get("project") or "").strip()
+        domain = str(item.get("domain") or "").strip().lower()
        # R9 trust hierarchy: interaction scope always wins when set.
-        # Model project only used for unscoped interactions + registered check.
        if interaction_project:
            project = interaction_project
        elif model_project and model_project in _known_projects:
            project = model_project
        else:
            project = ""
+        # Domain knowledge: embed tag in content for cross-project retrieval
+        if domain and not project:
+            content = f"[{domain}] {content}"
        conf = item.get("confidence", 0.5)
        if mem_type not in MEMORY_TYPES or not content:
            continue
--- a/src/atocore/context/builder.py
+++ b/src/atocore/context/builder.py
@@ -36,6 +36,12 @@ MEMORY_BUDGET_RATIO = 0.05  # identity + preference; lowered from 0.10 to avoid
 # memory can actually reach the model.
 PROJECT_MEMORY_BUDGET_RATIO = 0.25
 PROJECT_MEMORY_TYPES = ["project", "knowledge", "episodic"]
+# General domain knowledge — unscoped memories (project="") that surface
+# in every context pack regardless of project hint. These are earned
+# engineering insights that apply across projects (e.g., "Preston removal
+# model breaks down below 5N because the contact assumption fails").
+DOMAIN_KNOWLEDGE_BUDGET_RATIO = 0.10
+DOMAIN_KNOWLEDGE_TYPES = ["knowledge"]

 # Last built context pack for debug inspection
 _last_context_pack: "ContextPack | None" = None
@@ -59,6 +65,8 @@ class ContextPack:
    memory_chars: int = 0
    project_memory_text: str = ""
    project_memory_chars: int = 0
+    domain_knowledge_text: str = ""
+    domain_knowledge_chars: int = 0
    total_chars: int = 0
    budget: int = 0
    budget_remaining: int = 0
@@ -139,8 +147,29 @@ def build_context(
            query=user_prompt,
        )

+    # 2c. Domain knowledge — cross-project earned insight with project=""
+    # that surfaces regardless of which project the query is about.
+    domain_knowledge_text = ""
+    domain_knowledge_chars = 0
+    domain_budget = min(
+        int(budget * DOMAIN_KNOWLEDGE_BUDGET_RATIO),
+        max(budget - project_state_chars - memory_chars - project_memory_chars, 0),
+    )
+    if domain_budget > 0:
+        domain_knowledge_text, domain_knowledge_chars = get_memories_for_context(
+            memory_types=DOMAIN_KNOWLEDGE_TYPES,
+            project="",
+            budget=domain_budget,
+            header="--- Domain Knowledge ---",
+            footer="--- End Domain Knowledge ---",
+            query=user_prompt,
+        )
+
    # 3. Calculate remaining budget for retrieval
-    retrieval_budget = budget - project_state_chars - memory_chars - project_memory_chars
+    retrieval_budget = (
+        budget - project_state_chars - memory_chars
+        - project_memory_chars - domain_knowledge_chars
+    )

    # 4. Retrieve candidates
    candidates = (
@@ -161,13 +190,15 @@ def build_context(

    # 7. Format full context
    formatted = _format_full_context(
-        project_state_text, memory_text, project_memory_text, selected
+        project_state_text, memory_text, project_memory_text,
+        domain_knowledge_text, selected,
    )
    if len(formatted) > budget:
        formatted, selected = _trim_context_to_budget(
            project_state_text,
            memory_text,
            project_memory_text,
+            domain_knowledge_text,
            selected,
            budget,
        )
@@ -178,6 +209,7 @@ def build_context(
    project_state_chars = len(project_state_text)
    memory_chars = len(memory_text)
    project_memory_chars = len(project_memory_text)
+    domain_knowledge_chars = len(domain_knowledge_text)
    retrieval_chars = sum(c.char_count for c in selected)
    total_chars = len(formatted)
    duration_ms = int((time.time() - start) * 1000)
@@ -190,6 +222,8 @@ def build_context(
        memory_chars=memory_chars,
        project_memory_text=project_memory_text,
        project_memory_chars=project_memory_chars,
+        domain_knowledge_text=domain_knowledge_text,
+        domain_knowledge_chars=domain_knowledge_chars,
        total_chars=total_chars,
        budget=budget,
        budget_remaining=budget - total_chars,
@@ -208,6 +242,7 @@ def build_context(
        project_state_chars=project_state_chars,
        memory_chars=memory_chars,
        project_memory_chars=project_memory_chars,
+        domain_knowledge_chars=domain_knowledge_chars,
        retrieval_chars=retrieval_chars,
        total_chars=total_chars,
        budget_remaining=budget - total_chars,
@@ -288,6 +323,7 @@ def _format_full_context(
    project_state_text: str,
    memory_text: str,
    project_memory_text: str,
+    domain_knowledge_text: str,
    chunks: list[ContextChunk],
 ) -> str:
    """Format project state + memories + retrieved chunks into full context block."""
@@ -308,7 +344,12 @@ def _format_full_context(
        parts.append(project_memory_text)
        parts.append("")

-    # 4. Retrieved chunks (lowest trust)
+    # 4. Domain knowledge (cross-project earned insight)
+    if domain_knowledge_text:
+        parts.append(domain_knowledge_text)
+        parts.append("")
+
+    # 5. Retrieved chunks (lowest trust)
    if chunks:
        parts.append("--- AtoCore Retrieved Context ---")
        if project_state_text:
@@ -320,7 +361,7 @@ def _format_full_context(
            parts.append(chunk.content)
            parts.append("")
        parts.append("--- End Context ---")
-    elif not project_state_text and not memory_text and not project_memory_text:
+    elif not project_state_text and not memory_text and not project_memory_text and not domain_knowledge_text:
        parts.append("--- AtoCore Context ---\nNo relevant context found.\n--- End Context ---")

    return "\n".join(parts)
@@ -343,6 +384,7 @@ def _pack_to_dict(pack: ContextPack) -> dict:
        "project_state_chars": pack.project_state_chars,
        "memory_chars": pack.memory_chars,
        "project_memory_chars": pack.project_memory_chars,
+        "domain_knowledge_chars": pack.domain_knowledge_chars,
        "chunks_used": len(pack.chunks_used),
        "total_chars": pack.total_chars,
        "budget": pack.budget,
@@ -351,6 +393,7 @@ def _pack_to_dict(pack: ContextPack) -> dict:
        "has_project_state": bool(pack.project_state_text),
        "has_memories": bool(pack.memory_text),
        "has_project_memories": bool(pack.project_memory_text),
+        "has_domain_knowledge": bool(pack.domain_knowledge_text),
        "chunks": [
            {
                "source_file": c.source_file,
@@ -381,44 +424,56 @@ def _trim_context_to_budget(
    project_state_text: str,
    memory_text: str,
    project_memory_text: str,
+    domain_knowledge_text: str,
    chunks: list[ContextChunk],
    budget: int,
 ) -> tuple[str, list[ContextChunk]]:
-    """Trim retrieval → project memories → identity/preference → project state."""
+    """Trim retrieval -> domain knowledge -> project memories -> identity/preference -> project state."""
    kept_chunks = list(chunks)
    formatted = _format_full_context(
-        project_state_text, memory_text, project_memory_text, kept_chunks
+        project_state_text, memory_text, project_memory_text,
+        domain_knowledge_text, kept_chunks,
    )
    while len(formatted) > budget and kept_chunks:
        kept_chunks.pop()
        formatted = _format_full_context(
-            project_state_text, memory_text, project_memory_text, kept_chunks
+            project_state_text, memory_text, project_memory_text,
+            domain_knowledge_text, kept_chunks,
        )

    if len(formatted) <= budget:
        return formatted, kept_chunks

-    # Drop project memories next (they were the most recently added
-    # tier and carry less trust than identity/preference).
+    # Drop domain knowledge first (lowest trust of the memory tiers).
+    domain_knowledge_text, _ = _truncate_text_block(domain_knowledge_text, 0)
+    formatted = _format_full_context(
+        project_state_text, memory_text, project_memory_text,
+        domain_knowledge_text, kept_chunks,
+    )
+    if len(formatted) <= budget:
+        return formatted, kept_chunks
+
    project_memory_text, _ = _truncate_text_block(
        project_memory_text,
        max(budget - len(project_state_text) - len(memory_text), 0),
    )
    formatted = _format_full_context(
-        project_state_text, memory_text, project_memory_text, kept_chunks
+        project_state_text, memory_text, project_memory_text,
+        domain_knowledge_text, kept_chunks,
    )
    if len(formatted) <= budget:
        return formatted, kept_chunks

    memory_text, _ = _truncate_text_block(memory_text, max(budget - len(project_state_text), 0))
    formatted = _format_full_context(
-        project_state_text, memory_text, project_memory_text, kept_chunks
+        project_state_text, memory_text, project_memory_text,
+        domain_knowledge_text, kept_chunks,
    )
    if len(formatted) <= budget:
        return formatted, kept_chunks

    project_state_text, _ = _truncate_text_block(project_state_text, budget)
-    formatted = _format_full_context(project_state_text, "", "", [])
+    formatted = _format_full_context(project_state_text, "", "", "", [])
    if len(formatted) > budget:
        formatted, _ = _truncate_text_block(formatted, budget)
    return formatted, []
--- a/src/atocore/memory/extractor_llm.py
+++ b/src/atocore/memory/extractor_llm.py
@@ -64,7 +64,7 @@ from atocore.observability.logger import get_logger

 log = get_logger("extractor_llm")

-LLM_EXTRACTOR_VERSION = "llm-0.2.0"
+LLM_EXTRACTOR_VERSION = "llm-0.3.0"
 DEFAULT_MODEL = os.environ.get("ATOCORE_LLM_EXTRACTOR_MODEL", "sonnet")
 DEFAULT_TIMEOUT_S = float(os.environ.get("ATOCORE_LLM_EXTRACTOR_TIMEOUT_S", "90"))
 MAX_RESPONSE_CHARS = 8000
@@ -72,24 +72,44 @@ MAX_PROMPT_CHARS = 2000

 _SYSTEM_PROMPT = """You extract durable memory candidates from LLM conversation turns for a personal context engine called AtoCore.

-Your job is to read one user prompt plus the assistant's response and decide which durable facts, decisions, preferences, architectural rules, or project invariants should be remembered across future sessions.
+AtoCore stores two kinds of knowledge:
+
+A. PROJECT-SPECIFIC: applied decisions, constraints, and architecture for a named project (p04-gigabit, p05-interferometer, p06-polisher, atomizer-v2, atocore). These stay scoped to one project.
+
+B. DOMAIN KNOWLEDGE: generalizable engineering insight that was EARNED through project work and is reusable across projects. Tag these with a domain instead of a project.
+
+THE CRITICAL BAR FOR DOMAIN KNOWLEDGE:
+Only extract insight that took real effort to discover. The test: "Would a competent engineer need experience to know this, or could they find it in 30 seconds on Google?" If they can look it up, do NOT extract it.
+
+EXTRACT (earned insight):
+- "At F/1.2, Zerodur CTE gradient across the blank is the second-largest WFE contributor after gravity sag — costs ~3nm and drove the support pad layout"
+- "Preston removal rate model breaks down below 5N applied force because the contact assumption fails"
+- "For swing-arm polishing, m=1 (coma) is NOT correctable by force modulation (score 0.09) — only m=2 and m=3 work"
+
+DO NOT EXTRACT (common knowledge):
+- "Zerodur CTE is 0.05 ppm/K" (textbook value)
+- "FEA uses finite elements to discretize continuous domains" (definition)
+- "Python is a programming language" (obvious)
+- "git commit saves changes" (basic tool knowledge)

 Rules:

-1. Only surface durable claims. Skip transient status ("deploy is still running"), instructional guidance ("here is how to run the command"), troubleshooting tactics, ephemeral recommendations ("merge this PR now"), and session recaps.
-2. A candidate is durable when a reader coming back in two weeks would still need to know it. Architectural choices, named rules, ratified decisions, invariants, procurement commitments, and project-level constraints qualify. Conversational fillers and step-by-step instructions do not.
-3. Each candidate must stand alone. Rewrite the claim in one sentence under 200 characters with enough context that a reader without the conversation understands it.
-4. Each candidate must have a type from this closed set: project, knowledge, preference, adaptation.
-5. If the conversation is clearly scoped to a project (p04-gigabit, p05-interferometer, p06-polisher, atocore), set ``project`` to that id. Otherwise leave ``project`` empty.
-6. If the response makes no durable claim, return an empty list. It is correct and expected to return [] on most conversational turns.
-7. Confidence should be 0.5 by default so human review workload is honest. Raise to 0.6 only when the response states the claim in an unambiguous, committed form (e.g. "the decision is X", "the selected approach is Y", "X is non-negotiable").
-8. Output must be a raw JSON array and nothing else. No prose before or after. No markdown fences. No explanations.
+1. Only surface durable claims. Skip transient status, instructional guidance, troubleshooting tactics, ephemeral recommendations, and session recaps.
+2. A candidate is durable when a reader coming back in two weeks would still need to know it.
+3. Each candidate must stand alone in one sentence under 200 characters.
+4. Type must be one of: project, knowledge, preference, adaptation.
+5. For project-specific claims, set ``project`` to the project id.
+6. For generalizable domain insight, set ``project`` to empty and set ``domain`` to one of: physics, materials, optics, mechanics, manufacturing, metrology, controls, software, math, finance.
+7. When one conversation produces BOTH a project-specific fact AND a generalizable principle, emit BOTH as separate candidates.
+8. Return [] on most turns. The bar is high. Empty is correct and expected.
+9. Confidence 0.5 default. Raise to 0.6 only for unambiguous committed claims.
+10. Output a raw JSON array only. No prose, no markdown fences.

-Each array element has exactly this shape:
+Each array element:

-{"type": "project|knowledge|preference|adaptation", "content": "...", "project": "...", "confidence": 0.5}
+{"type": "project|knowledge|preference|adaptation", "content": "...", "project": "...", "domain": "", "confidence": 0.5}

-Return [] when there is nothing to extract."""
+Use ``project`` for project-scoped candidates. Use ``domain`` for cross-project knowledge. Never set both."""


@dataclass
@@ -276,11 +296,18 @@ def _parse_candidates(raw_output: str, interaction: Interaction) -> list[MemoryC
                project = ""
        else:
            project = ""
+        domain = str(item.get("domain") or "").strip().lower()
        confidence_raw = item.get("confidence", 0.5)
        if mem_type not in MEMORY_TYPES:
            continue
        if not content:
            continue
+        # Domain knowledge: embed the domain tag in the content so it
+        # survives without a schema migration. The context builder
+        # can match on it via query-relevance ranking, and a future
+        # migration can parse it into a proper column.
+        if domain and not project:
+            content = f"[{domain}] {content}"
        try:
            confidence = float(confidence_raw)
        except (TypeError, ValueError):