From 3f23ca1bc6dbd606337081591d1ab67f03b76365 Mon Sep 17 00:00:00 2001
From: Anto01 <antoine.letarte@gmail.com>
Date: Tue, 14 Apr 2026 10:24:50 -0400
Subject: [PATCH] feat: signal-aggressive extraction + auto vault refresh in
 nightly cron
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Extraction prompt rewritten for signal-aggressive mode. The old prompt
rewarded silence ("durable insight only, empty is correct") which
caused quiet failures — real project signal (Schott quotes arriving,
stakeholder events, blockers) was dropped as "not architectural enough".

New prompt explicitly lists what to emit:
1. Project activity (mentions with context — quote received, blocker,
   action item)
2. Decisions and choices (architectural commitments, vendor selection)
3. Durable engineering insight (earned knowledge, generalizable)
4. Stakeholder and vendor events (emails sent, meetings scheduled)
5. Preferences and adaptations (how Antoine works)

Philosophy shift: "capture more signal, let triage filter noise"
replaces "extract only durable architectural facts". Auto-triage
already rejects noise well, so moving the filter downstream gives us
visibility into weak signals without polluting active memory.

Added 'episodic' to the candidate types list to support stakeholder
events with a timestamp feel.

LLM_EXTRACTOR_VERSION bumped to llm-0.4.0.

Also: cron-backup.sh now runs POST /ingest/sources before extraction
so new PKM files flow in automatically. Fail-open, non-blocking.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 deploy/dalidou/cron-backup.sh       | 10 +++
 scripts/batch_llm_extract_live.py   | 91 +++++++++++++++++++---------
 src/atocore/memory/extractor_llm.py | 94 ++++++++++++++++++++---------
 3 files changed, 137 insertions(+), 58 deletions(-)

diff --git a/deploy/dalidou/cron-backup.sh b/deploy/dalidou/cron-backup.sh
index 64570f5..113bd75 100755
--- a/deploy/dalidou/cron-backup.sh
+++ b/deploy/dalidou/cron-backup.sh
@@ -82,6 +82,16 @@ else
     log "Step 3: ATOCORE_BACKUP_RSYNC not set, skipping off-host copy"
 fi
 
+# Step 3b: Auto-refresh vault sources so new PKM files flow in
+# automatically. Fail-open: never blocks the rest of the pipeline.
+log "Step 3b: auto-refresh vault sources"
+REFRESH_RESULT=$(curl -sf -X POST --max-time 600 \
+    "$ATOCORE_URL/ingest/sources" 2>&1) && {
+    log "Sources refresh complete"
+} || {
+    log "WARN: sources refresh failed (non-blocking): $REFRESH_RESULT"
+}
+
 # Step 4: Batch LLM extraction on recent interactions (optional).
 # Runs HOST-SIDE because claude CLI is on the host, not inside the
 # Docker container. The script fetches interactions from the API,
diff --git a/scripts/batch_llm_extract_live.py b/scripts/batch_llm_extract_live.py
index f177796..c129d73 100644
--- a/scripts/batch_llm_extract_live.py
+++ b/scripts/batch_llm_extract_live.py
@@ -31,45 +31,80 @@ MAX_PROMPT_CHARS = 2000
 
 MEMORY_TYPES = {"identity", "preference", "project", "episodic", "knowledge", "adaptation"}
 
-SYSTEM_PROMPT = """You extract durable memory candidates from LLM conversation turns for a personal context engine called AtoCore.
+SYSTEM_PROMPT = """You extract memory candidates from LLM conversation turns for a personal context engine called AtoCore.
 
-AtoCore stores two kinds of knowledge:
+AtoCore is the brain for Atomaste's engineering work. Known projects:
+p04-gigabit, p05-interferometer, p06-polisher, atomizer-v2, atocore,
+abb-space. Unknown project names — still tag them, the system auto-detects.
 
-A. PROJECT-SPECIFIC: applied decisions, constraints, and architecture for a named project. Known projects include p04-gigabit, p05-interferometer, p06-polisher, atomizer-v2, atocore, abb-space. If the conversation discusses a project NOT in this list, still tag it with the project name you identify — the system will auto-detect it as a new project or lead.
+Your job is to emit SIGNALS that matter for future context. Be aggressive:
+err on the side of capturing useful signal. Triage filters noise downstream.
 
-B. DOMAIN KNOWLEDGE: generalizable engineering insight that was EARNED through project work and is reusable across projects. Tag these with a domain instead of a project.
+WHAT TO EMIT (in order of importance):
 
-THE CRITICAL BAR FOR DOMAIN KNOWLEDGE:
-Only extract insight that took real effort to discover. The test: "Would a competent engineer need experience to know this, or could they find it in 30 seconds on Google?" If they can look it up, do NOT extract it.
+1. PROJECT ACTIVITY — any mention of a project with context worth remembering:
+   - "Schott quote received for ABB-Space" (event + project)
+   - "Cédric asked about p06 firmware timing" (stakeholder event)
+   - "Still waiting on Zygo lead-time from Nabeel" (blocker status)
+   - "p05 vendor decision needs to happen this week" (action item)
 
-EXTRACT (earned insight):
-- "At F/1.2, Zerodur CTE gradient across the blank is the second-largest WFE contributor after gravity sag"
-- "Preston removal rate model breaks down below 5N applied force because the contact assumption fails"
-- "For swing-arm polishing, m=1 (coma) is NOT correctable by force modulation (score 0.09)"
+2. DECISIONS AND CHOICES — anything that commits to a direction:
+   - "Going with Zygo Verifire SV for p05" (decision)
+   - "Dropping stitching from primary workflow" (design choice)
+   - "USB SSD mandatory, not SD card" (architectural commitment)
 
-DO NOT EXTRACT (common knowledge):
-- "Zerodur CTE is 0.05 ppm/K" (textbook value)
-- "FEA uses finite elements to discretize continuous domains" (definition)
-- "Python is a programming language" (obvious)
+3. DURABLE ENGINEERING INSIGHT — earned knowledge that generalizes:
+   - "CTE gradient dominates WFE at F/1.2" (materials insight)
+   - "Preston model breaks below 5N because contact assumption fails"
+   - "m=1 coma NOT correctable by force modulation" (controls insight)
+   Test: would a competent engineer NEED experience to know this?
+   If it's textbook/google-findable, skip it.
 
-Rules:
+4. STAKEHOLDER AND VENDOR EVENTS:
+   - "Email sent to Nabeel 2026-04-13 asking for lead time"
+   - "Meeting with Jason on Table 7 next Tuesday"
+   - "Starspec wants updated CAD by Friday"
 
-1. Only surface durable claims. Skip transient status, instructional guidance, troubleshooting, ephemeral recommendations, session recaps.
-2. A candidate is durable when a reader coming back in two weeks would still need to know it.
-3. Each candidate must stand alone in one sentence under 200 characters.
-4. Type must be one of: project, knowledge, preference, adaptation.
-5. For project-specific claims, set ``project`` to the project id.
-6. For generalizable domain insight, set ``project`` to empty and set ``domain`` to one of: physics, materials, optics, mechanics, manufacturing, metrology, controls, software, math, finance.
-7. When one conversation produces BOTH a project-specific fact AND a generalizable principle, emit BOTH as separate candidates.
-8. Return [] on most turns. The bar is high. Empty is correct and expected.
-9. Confidence 0.5 default. Raise to 0.6 only for unambiguous committed claims.
-10. Output a raw JSON array only. No prose, no markdown fences.
+5. PREFERENCES AND ADAPTATIONS that shape how Antoine works:
+   - "Antoine prefers OAuth over API keys"
+   - "Extraction stays off the capture hot path"
 
-Each array element:
+WHAT TO SKIP:
 
-{"type": "project|knowledge|preference|adaptation", "content": "...", "project": "...", "domain": "", "confidence": 0.5}
+- Pure conversational filler ("ok thanks", "let me check")
+- Instructional help content ("run this command", "here's how to...")
+- Obvious textbook facts anyone can google in 30 seconds
+- Session meta-chatter ("let me commit this", "deploy running")
+- Transient system state snapshots ("36 active memories right now")
 
-Use ``project`` for project-scoped candidates. Use ``domain`` for cross-project knowledge. Never set both."""
+CANDIDATE TYPES — choose the best fit:
+
+- project — a fact, decision, or event specific to one named project
+- knowledge — durable engineering insight (use domain, not project)
+- preference — how Antoine works / wants things done
+- adaptation — a standing rule or adjustment to behavior
+- episodic — a stakeholder event or milestone worth remembering
+
+DOMAINS for knowledge candidates (required when type=knowledge and project is empty):
+physics, materials, optics, mechanics, manufacturing, metrology,
+controls, software, math, finance, business
+
+TRUST HIERARCHY:
+
+- project-specific: set project to the project id, leave domain empty
+- domain knowledge: set domain, leave project empty
+- events/activity: use project, type=project or episodic
+- one conversation can produce MULTIPLE candidates — emit them all
+
+OUTPUT RULES:
+
+- Each candidate content under 250 characters, stands alone
+- Default confidence 0.5. Raise to 0.7 only for ratified/committed claims.
+- Raw JSON array, no prose, no markdown fences
+- Empty array [] is fine when the conversation has no durable signal
+
+Each element:
+{"type": "project|knowledge|preference|adaptation|episodic", "content": "...", "project": "...", "domain": "", "confidence": 0.5}"""
 
 _sandbox_cwd = None
 
diff --git a/src/atocore/memory/extractor_llm.py b/src/atocore/memory/extractor_llm.py
index 19417cf..acbb3a6 100644
--- a/src/atocore/memory/extractor_llm.py
+++ b/src/atocore/memory/extractor_llm.py
@@ -64,52 +64,86 @@ from atocore.observability.logger import get_logger
 
 log = get_logger("extractor_llm")
 
-LLM_EXTRACTOR_VERSION = "llm-0.3.0"
+LLM_EXTRACTOR_VERSION = "llm-0.4.0"
 DEFAULT_MODEL = os.environ.get("ATOCORE_LLM_EXTRACTOR_MODEL", "sonnet")
 DEFAULT_TIMEOUT_S = float(os.environ.get("ATOCORE_LLM_EXTRACTOR_TIMEOUT_S", "90"))
 MAX_RESPONSE_CHARS = 8000
 MAX_PROMPT_CHARS = 2000
 
-_SYSTEM_PROMPT = """You extract durable memory candidates from LLM conversation turns for a personal context engine called AtoCore.
+_SYSTEM_PROMPT = """You extract memory candidates from LLM conversation turns for a personal context engine called AtoCore.
 
-AtoCore stores two kinds of knowledge:
+AtoCore is the brain for Atomaste's engineering work. Known projects:
+p04-gigabit, p05-interferometer, p06-polisher, atomizer-v2, atocore,
+abb-space. Unknown project names — still tag them, the system auto-detects.
 
-A. PROJECT-SPECIFIC: applied decisions, constraints, and architecture for a named project. Known projects include p04-gigabit, p05-interferometer, p06-polisher, atomizer-v2, atocore, abb-space. If the conversation discusses a project NOT in this list, still tag it with the project name you identify — the system will auto-detect it as a new project or lead.
+Your job is to emit SIGNALS that matter for future context. Be aggressive:
+err on the side of capturing useful signal. Triage filters noise downstream.
 
-B. DOMAIN KNOWLEDGE: generalizable engineering insight that was EARNED through project work and is reusable across projects. Tag these with a domain instead of a project.
+WHAT TO EMIT (in order of importance):
 
-THE CRITICAL BAR FOR DOMAIN KNOWLEDGE:
-Only extract insight that took real effort to discover. The test: "Would a competent engineer need experience to know this, or could they find it in 30 seconds on Google?" If they can look it up, do NOT extract it.
+1. PROJECT ACTIVITY — any mention of a project with context worth remembering:
+   - "Schott quote received for ABB-Space" (event + project)
+   - "Cédric asked about p06 firmware timing" (stakeholder event)
+   - "Still waiting on Zygo lead-time from Nabeel" (blocker status)
+   - "p05 vendor decision needs to happen this week" (action item)
 
-EXTRACT (earned insight):
-- "At F/1.2, Zerodur CTE gradient across the blank is the second-largest WFE contributor after gravity sag — costs ~3nm and drove the support pad layout"
-- "Preston removal rate model breaks down below 5N applied force because the contact assumption fails"
-- "For swing-arm polishing, m=1 (coma) is NOT correctable by force modulation (score 0.09) — only m=2 and m=3 work"
+2. DECISIONS AND CHOICES — anything that commits to a direction:
+   - "Going with Zygo Verifire SV for p05" (decision)
+   - "Dropping stitching from primary workflow" (design choice)
+   - "USB SSD mandatory, not SD card" (architectural commitment)
 
-DO NOT EXTRACT (common knowledge):
-- "Zerodur CTE is 0.05 ppm/K" (textbook value)
-- "FEA uses finite elements to discretize continuous domains" (definition)
-- "Python is a programming language" (obvious)
-- "git commit saves changes" (basic tool knowledge)
+3. DURABLE ENGINEERING INSIGHT — earned knowledge that generalizes:
+   - "CTE gradient dominates WFE at F/1.2" (materials insight)
+   - "Preston model breaks below 5N because contact assumption fails"
+   - "m=1 coma NOT correctable by force modulation" (controls insight)
+   Test: would a competent engineer NEED experience to know this?
+   If it's textbook/google-findable, skip it.
 
-Rules:
+4. STAKEHOLDER AND VENDOR EVENTS:
+   - "Email sent to Nabeel 2026-04-13 asking for lead time"
+   - "Meeting with Jason on Table 7 next Tuesday"
+   - "Starspec wants updated CAD by Friday"
 
-1. Only surface durable claims. Skip transient status, instructional guidance, troubleshooting tactics, ephemeral recommendations, and session recaps.
-2. A candidate is durable when a reader coming back in two weeks would still need to know it.
-3. Each candidate must stand alone in one sentence under 200 characters.
-4. Type must be one of: project, knowledge, preference, adaptation.
-5. For project-specific claims, set ``project`` to the project id.
-6. For generalizable domain insight, set ``project`` to empty and set ``domain`` to one of: physics, materials, optics, mechanics, manufacturing, metrology, controls, software, math, finance.
-7. When one conversation produces BOTH a project-specific fact AND a generalizable principle, emit BOTH as separate candidates.
-8. Return [] on most turns. The bar is high. Empty is correct and expected.
-9. Confidence 0.5 default. Raise to 0.6 only for unambiguous committed claims.
-10. Output a raw JSON array only. No prose, no markdown fences.
+5. PREFERENCES AND ADAPTATIONS that shape how Antoine works:
+   - "Antoine prefers OAuth over API keys"
+   - "Extraction stays off the capture hot path"
 
-Each array element:
+WHAT TO SKIP:
 
-{"type": "project|knowledge|preference|adaptation", "content": "...", "project": "...", "domain": "", "confidence": 0.5}
+- Pure conversational filler ("ok thanks", "let me check")
+- Instructional help content ("run this command", "here's how to...")
+- Obvious textbook facts anyone can google in 30 seconds
+- Session meta-chatter ("let me commit this", "deploy running")
+- Transient system state snapshots ("36 active memories right now")
 
-Use ``project`` for project-scoped candidates. Use ``domain`` for cross-project knowledge. Never set both."""
+CANDIDATE TYPES — choose the best fit:
+
+- project — a fact, decision, or event specific to one named project
+- knowledge — durable engineering insight (use domain, not project)
+- preference — how Antoine works / wants things done
+- adaptation — a standing rule or adjustment to behavior
+- episodic — a stakeholder event or milestone worth remembering
+
+DOMAINS for knowledge candidates (required when type=knowledge and project is empty):
+physics, materials, optics, mechanics, manufacturing, metrology,
+controls, software, math, finance, business
+
+TRUST HIERARCHY:
+
+- project-specific: set project to the project id, leave domain empty
+- domain knowledge: set domain, leave project empty
+- events/activity: use project, type=project or episodic
+- one conversation can produce MULTIPLE candidates — emit them all
+
+OUTPUT RULES:
+
+- Each candidate content under 250 characters, stands alone
+- Default confidence 0.5. Raise to 0.7 only for ratified/committed claims.
+- Raw JSON array, no prose, no markdown fences
+- Empty array [] is fine when the conversation has no durable signal
+
+Each element:
+{"type": "project|knowledge|preference|adaptation|episodic", "content": "...", "project": "...", "domain": "", "confidence": 0.5}"""
 
 
 @dataclass