Compare commits
13 Commits
codex/dali
...
30ee857d62
| Author | SHA1 | Date | |
|---|---|---|---|
| 30ee857d62 | |||
| 38f6e525af | |||
| 37331d53ef | |||
| 5aeeb1cad1 | |||
| 4da81c9e4e | |||
| 7bf83bf46a | |||
| 1161645415 | |||
| 5913da53c5 | |||
| 8ea53f4003 | |||
| 9366ba7879 | |||
| c5bad996a7 | |||
| 0b1742770a | |||
| 2829d5ec1c |
85
deploy/dalidou/cron-backup.sh
Executable file
85
deploy/dalidou/cron-backup.sh
Executable file
@@ -0,0 +1,85 @@
|
||||
#!/usr/bin/env bash
|
||||
#
|
||||
# deploy/dalidou/cron-backup.sh
|
||||
# ------------------------------
|
||||
# Daily backup + retention cleanup via the AtoCore API.
|
||||
#
|
||||
# Intended to run from cron on Dalidou:
|
||||
#
|
||||
# # Daily at 03:00 UTC
|
||||
# 0 3 * * * /srv/storage/atocore/app/deploy/dalidou/cron-backup.sh >> /var/log/atocore-backup.log 2>&1
|
||||
#
|
||||
# What it does:
|
||||
# 1. Creates a runtime backup (db + registry, no chroma by default)
|
||||
# 2. Runs retention cleanup with --confirm to delete old snapshots
|
||||
# 3. Logs results to stdout (captured by cron into the log file)
|
||||
#
|
||||
# Fail-open: exits 0 even on API errors so cron doesn't send noise
|
||||
# emails. Check /var/log/atocore-backup.log for diagnostics.
|
||||
#
|
||||
# Environment variables:
|
||||
# ATOCORE_URL default http://127.0.0.1:8100
|
||||
# ATOCORE_BACKUP_CHROMA default false (set to "true" for cold chroma copy)
|
||||
# ATOCORE_BACKUP_DIR default /srv/storage/atocore/backups
|
||||
# ATOCORE_BACKUP_RSYNC optional rsync destination for off-host copies
|
||||
# (e.g. papa@laptop:/home/papa/atocore-backups/)
|
||||
# When set, the local snapshots tree is rsynced to
|
||||
# the destination after cleanup. Unset = skip.
|
||||
# SSH key auth must already be configured from this
|
||||
# host to the destination.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
ATOCORE_URL="${ATOCORE_URL:-http://127.0.0.1:8100}"
|
||||
INCLUDE_CHROMA="${ATOCORE_BACKUP_CHROMA:-false}"
|
||||
BACKUP_DIR="${ATOCORE_BACKUP_DIR:-/srv/storage/atocore/backups}"
|
||||
RSYNC_TARGET="${ATOCORE_BACKUP_RSYNC:-}"
|
||||
TIMESTAMP="$(date -u +%Y-%m-%dT%H:%M:%SZ)"
|
||||
|
||||
log() { printf '[%s] %s\n' "$TIMESTAMP" "$*"; }
|
||||
|
||||
log "=== AtoCore daily backup starting ==="
|
||||
|
||||
# Step 1: Create backup
|
||||
log "Step 1: creating backup (chroma=$INCLUDE_CHROMA)"
|
||||
BACKUP_RESULT=$(curl -sf -X POST \
|
||||
-H "Content-Type: application/json" \
|
||||
-d "{\"include_chroma\": $INCLUDE_CHROMA}" \
|
||||
"$ATOCORE_URL/admin/backup" 2>&1) || {
|
||||
log "ERROR: backup creation failed: $BACKUP_RESULT"
|
||||
exit 0
|
||||
}
|
||||
log "Backup created: $BACKUP_RESULT"
|
||||
|
||||
# Step 2: Retention cleanup (confirm=true to actually delete)
|
||||
log "Step 2: running retention cleanup"
|
||||
CLEANUP_RESULT=$(curl -sf -X POST \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"confirm": true}' \
|
||||
"$ATOCORE_URL/admin/backup/cleanup" 2>&1) || {
|
||||
log "ERROR: cleanup failed: $CLEANUP_RESULT"
|
||||
exit 0
|
||||
}
|
||||
log "Cleanup result: $CLEANUP_RESULT"
|
||||
|
||||
# Step 3: Off-host rsync (optional). Fail-open: log but don't abort
|
||||
# the cron so a laptop being offline at 03:00 UTC never turns the
|
||||
# local backup path red.
|
||||
if [[ -n "$RSYNC_TARGET" ]]; then
|
||||
log "Step 3: rsyncing snapshots to $RSYNC_TARGET"
|
||||
if [[ ! -d "$BACKUP_DIR/snapshots" ]]; then
|
||||
log "WARN: $BACKUP_DIR/snapshots does not exist, skipping rsync"
|
||||
else
|
||||
RSYNC_OUTPUT=$(rsync -a --delete \
|
||||
-e "ssh -o ConnectTimeout=10 -o BatchMode=yes -o StrictHostKeyChecking=accept-new" \
|
||||
"$BACKUP_DIR/snapshots/" "$RSYNC_TARGET" 2>&1) && {
|
||||
log "Rsync complete"
|
||||
} || {
|
||||
log "WARN: rsync to $RSYNC_TARGET failed (offline or auth?): $RSYNC_OUTPUT"
|
||||
}
|
||||
fi
|
||||
else
|
||||
log "Step 3: ATOCORE_BACKUP_RSYNC not set, skipping off-host copy"
|
||||
fi
|
||||
|
||||
log "=== AtoCore daily backup complete ==="
|
||||
@@ -3,7 +3,7 @@
|
||||
|
||||
Reads the Stop hook JSON from stdin, extracts the last user prompt
|
||||
from the transcript JSONL, and POSTs to the AtoCore /interactions
|
||||
endpoint in conservative mode (reinforce=false, no extraction).
|
||||
endpoint with reinforcement enabled (no extraction).
|
||||
|
||||
Fail-open: always exits 0, logs errors to stderr only.
|
||||
|
||||
@@ -81,7 +81,7 @@ def _capture() -> None:
|
||||
"client": "claude-code",
|
||||
"session_id": session_id,
|
||||
"project": project,
|
||||
"reinforce": False,
|
||||
"reinforce": True,
|
||||
}
|
||||
|
||||
body = json.dumps(payload, ensure_ascii=True).encode("utf-8")
|
||||
|
||||
@@ -32,7 +32,18 @@ read-only additive mode.
|
||||
### Baseline Complete
|
||||
|
||||
- Phase 9 - Reflection (all three foundation commits landed:
|
||||
A capture, B reinforcement, C candidate extraction + review queue)
|
||||
A capture, B reinforcement, C candidate extraction + review queue).
|
||||
As of 2026-04-11 the capture → reinforce half runs automatically on
|
||||
every Stop-hook capture (length-aware token-overlap matcher handles
|
||||
paragraph-length memories), and project-scoped memories now reach
|
||||
the context pack via a dedicated `--- Project Memories ---` band
|
||||
between identity/preference and retrieved chunks. The extract half
|
||||
is still a manual / batch flow by design (`scripts/atocore_client.py
|
||||
batch-extract` + `triage`). First live batch-extract run over 42
|
||||
captured interactions produced 1 candidate (rule extractor is
|
||||
conservative and keys on structural cues like `## Decision:`
|
||||
headings that rarely appear in conversational LLM responses) —
|
||||
extractor tuning is a known follow-up.
|
||||
|
||||
### Not Yet Complete In The Intended Sense
|
||||
|
||||
@@ -167,7 +178,9 @@ These remain intentionally deferred.
|
||||
|
||||
- automatic write-back from OpenClaw into AtoCore
|
||||
- automatic memory promotion
|
||||
- reflection loop integration
|
||||
- ~~reflection loop integration~~ — baseline now in (capture→reinforce
|
||||
auto, extract batch/manual). Extractor tuning and scheduled batch
|
||||
extraction still open.
|
||||
- replacing OpenClaw's own memory system
|
||||
- live machine-DB sync between machines
|
||||
- full ontology / graph expansion before the current baseline is stable
|
||||
|
||||
@@ -137,7 +137,12 @@ P06:
|
||||
|
||||
- automatic write-back from OpenClaw into AtoCore
|
||||
- automatic memory promotion
|
||||
- reflection loop integration
|
||||
- ~~reflection loop integration~~ — baseline now landed (2026-04-11):
|
||||
Stop hook runs reinforce automatically, project memories are folded
|
||||
into the context pack, batch-extract and triage CLIs exist. What
|
||||
remains deferred: scheduled/automatic batch extraction and extractor
|
||||
rule tuning (rule-based extractor produced 1 candidate from 42 real
|
||||
captures — needs new cues for conversational LLM content).
|
||||
- replacing OpenClaw's own memory system
|
||||
- syncing the live machine DB between machines
|
||||
|
||||
@@ -159,6 +164,77 @@ The next batch is successful if:
|
||||
- project ingestion remains controlled rather than noisy
|
||||
- the canonical Dalidou instance stays stable
|
||||
|
||||
## Retrieval Quality Review — 2026-04-11
|
||||
|
||||
First sweep with real project-hinted queries on Dalidou. Used
|
||||
`POST /context/build` against p04, p05, p06 with representative
|
||||
questions and inspected `formatted_context`.
|
||||
|
||||
Findings:
|
||||
|
||||
- **Trusted Project State is surfacing correctly.** The DECISION and
|
||||
REQUIREMENT categories appear at the top of the pack and include
|
||||
the expected key facts (e.g. p04 "Option B conical-back mirror
|
||||
architecture"). This is the strongest signal in the pack today.
|
||||
- **Chunk retrieval is relevant on-topic but broad.** Top chunks for
|
||||
the p04 architecture query are PDR intro, CAD assembly overview,
|
||||
and the index — all on the right project but none of them directly
|
||||
answer the "why was Option B chosen" question. The authoritative
|
||||
answer sits in Project State, not in the chunks.
|
||||
- **Active memories are NOT reaching the pack.** The context builder
|
||||
surfaces Trusted Project State and retrieved chunks but does not
|
||||
include the 21 active project/knowledge memories. Reinforcement
|
||||
(Phase 9 Commit B) bumps memory confidence without the memory ever
|
||||
being read back into a prompt — the reflection loop has no outlet
|
||||
on the retrieval side. This is a design gap, not a bug: needs a
|
||||
decision on whether memories should feed into context assembly,
|
||||
and if so at what trust level (below project_state, above chunks).
|
||||
- **Cross-project bleed is low.** The p04 query did pull one p05
|
||||
chunk (CGH_Design_Input_for_AOM) as the bottom hit but the top-4
|
||||
were all p04.
|
||||
|
||||
Proposed follow-ups (not yet scheduled):
|
||||
|
||||
1. ~~Decide whether memories should be folded into `formatted_context`
|
||||
and under what section header.~~ DONE 2026-04-11 (commits 8ea53f4,
|
||||
5913da5, 1161645). A `--- Project Memories ---` band now sits
|
||||
between identity/preference and retrieved chunks, gated on a
|
||||
canonical project hint to prevent cross-project bleed. Budget
|
||||
ratio 0.25 (tuned empirically — paragraph memories are ~400 chars
|
||||
and earlier 0.15 ratio starved the first entry by one char).
|
||||
Verified live: p04 architecture query surfaces the Option B memory.
|
||||
2. Re-run the same three queries after any builder change and compare
|
||||
`formatted_context` diffs — still open, and is the natural entry
|
||||
point for the retrieval eval harness on the roadmap.
|
||||
|
||||
## Reflection Loop Live Check — 2026-04-11
|
||||
|
||||
First real run of `batch-extract` across 42 captured Claude Code
|
||||
interactions on Dalidou produced exactly **1 candidate**, and that
|
||||
candidate was a synthetic test capture from earlier in the session
|
||||
(rejected). Finding:
|
||||
|
||||
- The rule-based extractor in `src/atocore/memory/extractor.py` keys
|
||||
on explicit structural cues (decision headings like
|
||||
`## Decision: ...`, preference sentences, etc.). Real Claude Code
|
||||
responses are conversational and almost never contain those cues.
|
||||
- This means the capture → extract half of the reflection loop is
|
||||
effectively inert against organic LLM sessions until either the
|
||||
rules are broadened (new cue families: "we chose X because...",
|
||||
"the selected approach is...", etc.) or an LLM-assisted extraction
|
||||
path is added alongside the rule-based one.
|
||||
- Capture → reinforce is working correctly on live data (length-aware
|
||||
matcher verified on live paraphrase of a p04 memory).
|
||||
|
||||
Follow-up candidates (not yet scheduled):
|
||||
|
||||
1. Extractor rule expansion — add conversational-form rules so real
|
||||
session text has a chance of surfacing candidates.
|
||||
2. LLM-assisted extractor as a separate rule family, guarded by
|
||||
confidence and always landing in `status=candidate` (never active).
|
||||
3. Retrieval eval harness — diffable scorecard of
|
||||
`formatted_context` across a fixed question set per active project.
|
||||
|
||||
## Long-Run Goal
|
||||
|
||||
The long-run target is:
|
||||
|
||||
@@ -340,6 +340,22 @@ def build_parser() -> argparse.ArgumentParser:
|
||||
p = sub.add_parser("reject")
|
||||
p.add_argument("memory_id")
|
||||
|
||||
# batch-extract: fan out /interactions/{id}/extract?persist=true across
|
||||
# recent interactions. Idempotent — the extractor create_memory path
|
||||
# silently skips duplicates, so re-running is safe.
|
||||
p = sub.add_parser("batch-extract")
|
||||
p.add_argument("since", nargs="?", default="")
|
||||
p.add_argument("project", nargs="?", default="")
|
||||
p.add_argument("limit", nargs="?", type=int, default=100)
|
||||
p.add_argument("persist", nargs="?", default="true")
|
||||
|
||||
# triage: interactive candidate review loop. Fetches the queue, shows
|
||||
# each candidate, accepts p/r/s (promote / reject / skip) / q (quit).
|
||||
p = sub.add_parser("triage")
|
||||
p.add_argument("memory_type", nargs="?", default="")
|
||||
p.add_argument("project", nargs="?", default="")
|
||||
p.add_argument("limit", nargs="?", type=int, default=50)
|
||||
|
||||
return parser
|
||||
|
||||
|
||||
@@ -474,10 +490,141 @@ def main() -> int:
|
||||
{},
|
||||
)
|
||||
)
|
||||
elif cmd == "batch-extract":
|
||||
print_json(run_batch_extract(args.since, args.project, args.limit, args.persist))
|
||||
elif cmd == "triage":
|
||||
return run_triage(args.memory_type, args.project, args.limit)
|
||||
else:
|
||||
return 1
|
||||
return 0
|
||||
|
||||
|
||||
def run_batch_extract(since: str, project: str, limit: int, persist_flag: str) -> dict:
|
||||
"""Fetch recent interactions and run the extractor against each one.
|
||||
|
||||
Returns an aggregated summary. Safe to re-run: the server-side
|
||||
persist path catches ValueError on duplicates and the endpoint
|
||||
reports per-interaction candidate counts either way.
|
||||
"""
|
||||
persist = persist_flag.lower() in {"1", "true", "yes", "y"}
|
||||
query_parts: list[str] = []
|
||||
if project:
|
||||
query_parts.append(f"project={urllib.parse.quote(project)}")
|
||||
if since:
|
||||
query_parts.append(f"since={urllib.parse.quote(since)}")
|
||||
query_parts.append(f"limit={int(limit)}")
|
||||
query = "?" + "&".join(query_parts)
|
||||
|
||||
listing = request("GET", f"/interactions{query}")
|
||||
interactions = listing.get("interactions", []) if isinstance(listing, dict) else []
|
||||
|
||||
processed = 0
|
||||
total_candidates = 0
|
||||
total_persisted = 0
|
||||
errors: list[dict] = []
|
||||
per_interaction: list[dict] = []
|
||||
|
||||
for item in interactions:
|
||||
iid = item.get("id") or ""
|
||||
if not iid:
|
||||
continue
|
||||
try:
|
||||
result = request(
|
||||
"POST",
|
||||
f"/interactions/{urllib.parse.quote(iid, safe='')}/extract",
|
||||
{"persist": persist},
|
||||
)
|
||||
except Exception as exc: # pragma: no cover - network errors land here
|
||||
errors.append({"interaction_id": iid, "error": str(exc)})
|
||||
continue
|
||||
processed += 1
|
||||
count = int(result.get("candidate_count", 0) or 0)
|
||||
persisted_ids = result.get("persisted_ids") or []
|
||||
total_candidates += count
|
||||
total_persisted += len(persisted_ids)
|
||||
if count:
|
||||
per_interaction.append(
|
||||
{
|
||||
"interaction_id": iid,
|
||||
"candidate_count": count,
|
||||
"persisted_count": len(persisted_ids),
|
||||
"project": item.get("project") or "",
|
||||
}
|
||||
)
|
||||
|
||||
return {
|
||||
"processed": processed,
|
||||
"total_candidates": total_candidates,
|
||||
"total_persisted": total_persisted,
|
||||
"persist": persist,
|
||||
"errors": errors,
|
||||
"interactions_with_candidates": per_interaction,
|
||||
}
|
||||
|
||||
|
||||
def run_triage(memory_type: str, project: str, limit: int) -> int:
|
||||
"""Interactive review of candidate memories.
|
||||
|
||||
Loads the queue once, walks through entries, prompts for
|
||||
(p)romote / (r)eject / (s)kip / (q)uit. Stateless between runs —
|
||||
re-running picks up whatever is still status=candidate.
|
||||
"""
|
||||
query_parts = ["status=candidate"]
|
||||
if memory_type:
|
||||
query_parts.append(f"memory_type={urllib.parse.quote(memory_type)}")
|
||||
if project:
|
||||
query_parts.append(f"project={urllib.parse.quote(project)}")
|
||||
query_parts.append(f"limit={int(limit)}")
|
||||
listing = request("GET", "/memory?" + "&".join(query_parts))
|
||||
memories = listing.get("memories", []) if isinstance(listing, dict) else []
|
||||
|
||||
if not memories:
|
||||
print_json({"status": "empty_queue", "count": 0})
|
||||
return 0
|
||||
|
||||
promoted = 0
|
||||
rejected = 0
|
||||
skipped = 0
|
||||
stopped_early = False
|
||||
|
||||
print(f"Triage queue: {len(memories)} candidate(s)\n", file=sys.stderr)
|
||||
for idx, mem in enumerate(memories, 1):
|
||||
mid = mem.get("id", "")
|
||||
print(f"[{idx}/{len(memories)}] {mem.get('memory_type','?')} project={mem.get('project','')} conf={mem.get('confidence','?')}", file=sys.stderr)
|
||||
print(f" id: {mid}", file=sys.stderr)
|
||||
print(f" {mem.get('content','')}", file=sys.stderr)
|
||||
try:
|
||||
choice = input(" (p)romote / (r)eject / (s)kip / (q)uit > ").strip().lower()
|
||||
except EOFError:
|
||||
stopped_early = True
|
||||
break
|
||||
if choice in {"q", "quit"}:
|
||||
stopped_early = True
|
||||
break
|
||||
if choice in {"p", "promote"}:
|
||||
request("POST", f"/memory/{urllib.parse.quote(mid, safe='')}/promote", {})
|
||||
promoted += 1
|
||||
print(" -> promoted", file=sys.stderr)
|
||||
elif choice in {"r", "reject"}:
|
||||
request("POST", f"/memory/{urllib.parse.quote(mid, safe='')}/reject", {})
|
||||
rejected += 1
|
||||
print(" -> rejected", file=sys.stderr)
|
||||
else:
|
||||
skipped += 1
|
||||
print(" -> skipped", file=sys.stderr)
|
||||
|
||||
print_json(
|
||||
{
|
||||
"reviewed": promoted + rejected + skipped,
|
||||
"promoted": promoted,
|
||||
"rejected": rejected,
|
||||
"skipped": skipped,
|
||||
"stopped_early": stopped_early,
|
||||
"remaining_in_queue": len(memories) - (promoted + rejected + skipped) - (1 if stopped_early else 0),
|
||||
}
|
||||
)
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
|
||||
194
scripts/retrieval_eval.py
Normal file
194
scripts/retrieval_eval.py
Normal file
@@ -0,0 +1,194 @@
|
||||
"""Retrieval quality eval harness.
|
||||
|
||||
Runs a fixed set of project-hinted questions against
|
||||
``POST /context/build`` on a live AtoCore instance and scores the
|
||||
resulting ``formatted_context`` against per-question expectations.
|
||||
The goal is a diffable scorecard that tells you, run-to-run,
|
||||
whether a retrieval / builder / ingestion change moved the needle.
|
||||
|
||||
Design notes
|
||||
------------
|
||||
- Fixtures live in ``scripts/retrieval_eval_fixtures.json`` so new
|
||||
questions can be added without touching Python. Each fixture
|
||||
names the project, the prompt, and a checklist of substrings that
|
||||
MUST appear in ``formatted_context`` (``expect_present``) and
|
||||
substrings that MUST NOT appear (``expect_absent``). The absent
|
||||
list catches cross-project bleed and stale content.
|
||||
- The checklist is deliberately substring-based (not regex, not
|
||||
embedding-similarity) so a failure is always a trivially
|
||||
reproducible "this string is not in that string". Richer scoring
|
||||
can come later once we know the harness is useful.
|
||||
- The harness is external to the app runtime and talks to AtoCore
|
||||
over HTTP, so it works against dev, staging, or prod. It follows
|
||||
the same environment-variable contract as ``atocore_client.py``
|
||||
(``ATOCORE_BASE_URL``, ``ATOCORE_TIMEOUT_SECONDS``).
|
||||
- Exit code 0 on all-pass, 1 on any fixture failure. Intended for
|
||||
manual runs today; a future cron / CI hook can consume the
|
||||
JSON output via ``--json``.
|
||||
|
||||
Usage
|
||||
-----
|
||||
|
||||
python scripts/retrieval_eval.py # human-readable report
|
||||
python scripts/retrieval_eval.py --json # machine-readable
|
||||
python scripts/retrieval_eval.py --fixtures path/to/custom.json
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
import urllib.error
|
||||
import urllib.parse
|
||||
import urllib.request
|
||||
from dataclasses import dataclass, field
|
||||
from pathlib import Path
|
||||
|
||||
DEFAULT_BASE_URL = os.environ.get("ATOCORE_BASE_URL", "http://dalidou:8100")
|
||||
DEFAULT_TIMEOUT = int(os.environ.get("ATOCORE_TIMEOUT_SECONDS", "30"))
|
||||
DEFAULT_BUDGET = 3000
|
||||
DEFAULT_FIXTURES = Path(__file__).parent / "retrieval_eval_fixtures.json"
|
||||
|
||||
|
||||
@dataclass
|
||||
class Fixture:
|
||||
name: str
|
||||
project: str
|
||||
prompt: str
|
||||
budget: int = DEFAULT_BUDGET
|
||||
expect_present: list[str] = field(default_factory=list)
|
||||
expect_absent: list[str] = field(default_factory=list)
|
||||
notes: str = ""
|
||||
|
||||
|
||||
@dataclass
|
||||
class FixtureResult:
|
||||
fixture: Fixture
|
||||
ok: bool
|
||||
missing_present: list[str]
|
||||
unexpected_absent: list[str]
|
||||
total_chars: int
|
||||
error: str = ""
|
||||
|
||||
|
||||
def load_fixtures(path: Path) -> list[Fixture]:
|
||||
data = json.loads(path.read_text(encoding="utf-8"))
|
||||
if not isinstance(data, list):
|
||||
raise ValueError(f"{path} must contain a JSON array of fixtures")
|
||||
fixtures: list[Fixture] = []
|
||||
for i, raw in enumerate(data):
|
||||
if not isinstance(raw, dict):
|
||||
raise ValueError(f"fixture {i} is not an object")
|
||||
fixtures.append(
|
||||
Fixture(
|
||||
name=raw["name"],
|
||||
project=raw.get("project", ""),
|
||||
prompt=raw["prompt"],
|
||||
budget=int(raw.get("budget", DEFAULT_BUDGET)),
|
||||
expect_present=list(raw.get("expect_present", [])),
|
||||
expect_absent=list(raw.get("expect_absent", [])),
|
||||
notes=raw.get("notes", ""),
|
||||
)
|
||||
)
|
||||
return fixtures
|
||||
|
||||
|
||||
def run_fixture(fixture: Fixture, base_url: str, timeout: int) -> FixtureResult:
|
||||
payload = {
|
||||
"prompt": fixture.prompt,
|
||||
"project": fixture.project or None,
|
||||
"budget": fixture.budget,
|
||||
}
|
||||
req = urllib.request.Request(
|
||||
url=f"{base_url}/context/build",
|
||||
method="POST",
|
||||
headers={"Content-Type": "application/json"},
|
||||
data=json.dumps(payload).encode("utf-8"),
|
||||
)
|
||||
try:
|
||||
with urllib.request.urlopen(req, timeout=timeout) as resp:
|
||||
body = json.loads(resp.read().decode("utf-8"))
|
||||
except urllib.error.URLError as exc:
|
||||
return FixtureResult(
|
||||
fixture=fixture,
|
||||
ok=False,
|
||||
missing_present=list(fixture.expect_present),
|
||||
unexpected_absent=[],
|
||||
total_chars=0,
|
||||
error=f"http_error: {exc}",
|
||||
)
|
||||
|
||||
formatted = body.get("formatted_context") or ""
|
||||
missing = [s for s in fixture.expect_present if s not in formatted]
|
||||
unexpected = [s for s in fixture.expect_absent if s in formatted]
|
||||
return FixtureResult(
|
||||
fixture=fixture,
|
||||
ok=not missing and not unexpected,
|
||||
missing_present=missing,
|
||||
unexpected_absent=unexpected,
|
||||
total_chars=len(formatted),
|
||||
)
|
||||
|
||||
|
||||
def print_human_report(results: list[FixtureResult]) -> None:
|
||||
total = len(results)
|
||||
passed = sum(1 for r in results if r.ok)
|
||||
print(f"Retrieval eval: {passed}/{total} fixtures passed")
|
||||
print()
|
||||
for r in results:
|
||||
marker = "PASS" if r.ok else "FAIL"
|
||||
print(f"[{marker}] {r.fixture.name} project={r.fixture.project} chars={r.total_chars}")
|
||||
if r.error:
|
||||
print(f" error: {r.error}")
|
||||
for miss in r.missing_present:
|
||||
print(f" missing expected: {miss!r}")
|
||||
for bleed in r.unexpected_absent:
|
||||
print(f" unexpected present: {bleed!r}")
|
||||
if r.fixture.notes and not r.ok:
|
||||
print(f" notes: {r.fixture.notes}")
|
||||
|
||||
|
||||
def print_json_report(results: list[FixtureResult]) -> None:
|
||||
payload = {
|
||||
"total": len(results),
|
||||
"passed": sum(1 for r in results if r.ok),
|
||||
"fixtures": [
|
||||
{
|
||||
"name": r.fixture.name,
|
||||
"project": r.fixture.project,
|
||||
"ok": r.ok,
|
||||
"total_chars": r.total_chars,
|
||||
"missing_present": r.missing_present,
|
||||
"unexpected_absent": r.unexpected_absent,
|
||||
"error": r.error,
|
||||
}
|
||||
for r in results
|
||||
],
|
||||
}
|
||||
json.dump(payload, sys.stdout, indent=2)
|
||||
sys.stdout.write("\n")
|
||||
|
||||
|
||||
def main() -> int:
|
||||
parser = argparse.ArgumentParser(description="AtoCore retrieval quality eval harness")
|
||||
parser.add_argument("--base-url", default=DEFAULT_BASE_URL)
|
||||
parser.add_argument("--timeout", type=int, default=DEFAULT_TIMEOUT)
|
||||
parser.add_argument("--fixtures", type=Path, default=DEFAULT_FIXTURES)
|
||||
parser.add_argument("--json", action="store_true", help="emit machine-readable JSON")
|
||||
args = parser.parse_args()
|
||||
|
||||
fixtures = load_fixtures(args.fixtures)
|
||||
results = [run_fixture(f, args.base_url, args.timeout) for f in fixtures]
|
||||
|
||||
if args.json:
|
||||
print_json_report(results)
|
||||
else:
|
||||
print_human_report(results)
|
||||
|
||||
return 0 if all(r.ok for r in results) else 1
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
86
scripts/retrieval_eval_fixtures.json
Normal file
86
scripts/retrieval_eval_fixtures.json
Normal file
@@ -0,0 +1,86 @@
|
||||
[
|
||||
{
|
||||
"name": "p04-architecture-decision",
|
||||
"project": "p04-gigabit",
|
||||
"prompt": "what mirror architecture was selected for GigaBIT M1 and why",
|
||||
"expect_present": [
|
||||
"--- Trusted Project State ---",
|
||||
"Option B",
|
||||
"conical",
|
||||
"--- Project Memories ---"
|
||||
],
|
||||
"expect_absent": [
|
||||
"p06-polisher",
|
||||
"folded-beam"
|
||||
],
|
||||
"notes": "Canonical p04 decision — should surface both Trusted Project State (selected_mirror_architecture) and the project-memory band with the Option B memory"
|
||||
},
|
||||
{
|
||||
"name": "p04-constraints",
|
||||
"project": "p04-gigabit",
|
||||
"prompt": "what are the key GigaBIT M1 program constraints",
|
||||
"expect_present": [
|
||||
"--- Trusted Project State ---",
|
||||
"Zerodur",
|
||||
"1.2"
|
||||
],
|
||||
"expect_absent": [
|
||||
"polisher suite"
|
||||
],
|
||||
"notes": "Key constraints are in Trusted Project State (key_constraints) and in the mission-framing memory"
|
||||
},
|
||||
{
|
||||
"name": "p05-configuration",
|
||||
"project": "p05-interferometer",
|
||||
"prompt": "what is the selected interferometer configuration",
|
||||
"expect_present": [
|
||||
"folded-beam",
|
||||
"CGH"
|
||||
],
|
||||
"expect_absent": [
|
||||
"Option B",
|
||||
"conical back",
|
||||
"polisher suite"
|
||||
],
|
||||
"notes": "P05 architecture memory covers folded-beam + CGH. GigaBIT M1 is the mirror under test and legitimately appears in p05 source docs (the interferometer measures it), so we only flag genuinely p04-only decisions like the mirror architecture choice."
|
||||
},
|
||||
{
|
||||
"name": "p05-vendor-signal",
|
||||
"project": "p05-interferometer",
|
||||
"prompt": "what is the current vendor signal for the interferometer procurement",
|
||||
"expect_present": [
|
||||
"4D",
|
||||
"Zygo"
|
||||
],
|
||||
"expect_absent": [
|
||||
"polisher"
|
||||
],
|
||||
"notes": "Vendor memory mentions 4D as strongest technical candidate and Zygo Verifire SV as value path"
|
||||
},
|
||||
{
|
||||
"name": "p06-suite-split",
|
||||
"project": "p06-polisher",
|
||||
"prompt": "how is the polisher software suite split across layers",
|
||||
"expect_present": [
|
||||
"polisher-sim",
|
||||
"polisher-post",
|
||||
"polisher-control"
|
||||
],
|
||||
"expect_absent": [
|
||||
"GigaBIT"
|
||||
],
|
||||
"notes": "The three-layer split is in multiple p06 memories; check all three names surface together"
|
||||
},
|
||||
{
|
||||
"name": "p06-control-rule",
|
||||
"project": "p06-polisher",
|
||||
"prompt": "what is the polisher control design rule",
|
||||
"expect_present": [
|
||||
"interlocks"
|
||||
],
|
||||
"expect_absent": [
|
||||
"interferometer"
|
||||
],
|
||||
"notes": "Control design rule memory mentions interlocks and state transitions"
|
||||
}
|
||||
]
|
||||
@@ -49,6 +49,7 @@ from atocore.memory.service import (
|
||||
)
|
||||
from atocore.observability.logger import get_logger
|
||||
from atocore.ops.backup import (
|
||||
cleanup_old_backups,
|
||||
create_runtime_backup,
|
||||
list_runtime_backups,
|
||||
validate_backup,
|
||||
@@ -511,6 +512,7 @@ class InteractionRecordRequest(BaseModel):
|
||||
chunks_used: list[str] = []
|
||||
context_pack: dict | None = None
|
||||
reinforce: bool = True
|
||||
extract: bool = False
|
||||
|
||||
|
||||
@router.post("/interactions")
|
||||
@@ -536,6 +538,7 @@ def api_record_interaction(req: InteractionRecordRequest) -> dict:
|
||||
chunks_used=req.chunks_used,
|
||||
context_pack=req.context_pack,
|
||||
reinforce=req.reinforce,
|
||||
extract=req.extract,
|
||||
)
|
||||
except ValueError as e:
|
||||
raise HTTPException(status_code=400, detail=str(e))
|
||||
@@ -731,6 +734,25 @@ def api_list_backups() -> dict:
|
||||
}
|
||||
|
||||
|
||||
class BackupCleanupRequest(BaseModel):
|
||||
confirm: bool = False
|
||||
|
||||
|
||||
@router.post("/admin/backup/cleanup")
|
||||
def api_cleanup_backups(req: BackupCleanupRequest | None = None) -> dict:
|
||||
"""Apply retention policy to old backup snapshots.
|
||||
|
||||
Dry-run by default. Pass ``confirm: true`` to actually delete.
|
||||
Retention: last 7 daily, last 4 weekly (Sundays), last 6 monthly (1st).
|
||||
"""
|
||||
payload = req or BackupCleanupRequest()
|
||||
try:
|
||||
return cleanup_old_backups(confirm=payload.confirm)
|
||||
except Exception as e:
|
||||
log.error("admin_cleanup_failed", error=str(e))
|
||||
raise HTTPException(status_code=500, detail=f"Cleanup failed: {e}")
|
||||
|
||||
|
||||
@router.get("/admin/backup/{stamp}/validate")
|
||||
def api_validate_backup(stamp: str) -> dict:
|
||||
"""Validate that a previously created backup is structurally usable."""
|
||||
|
||||
@@ -30,6 +30,12 @@ SYSTEM_PREFIX = (
|
||||
# identity: 5%, preferences: 5%, project state: 20%, retrieval: 60%+
|
||||
PROJECT_STATE_BUDGET_RATIO = 0.20
|
||||
MEMORY_BUDGET_RATIO = 0.10 # 5% identity + 5% preference
|
||||
# Project-scoped memories (project/knowledge/episodic) are the outlet
|
||||
# for the Phase 9 reflection loop on the retrieval side. Budget sits
|
||||
# between identity/preference and retrieved chunks so a reinforced
|
||||
# memory can actually reach the model.
|
||||
PROJECT_MEMORY_BUDGET_RATIO = 0.25
|
||||
PROJECT_MEMORY_TYPES = ["project", "knowledge", "episodic"]
|
||||
|
||||
# Last built context pack for debug inspection
|
||||
_last_context_pack: "ContextPack | None" = None
|
||||
@@ -51,6 +57,8 @@ class ContextPack:
|
||||
project_state_chars: int = 0
|
||||
memory_text: str = ""
|
||||
memory_chars: int = 0
|
||||
project_memory_text: str = ""
|
||||
project_memory_chars: int = 0
|
||||
total_chars: int = 0
|
||||
budget: int = 0
|
||||
budget_remaining: int = 0
|
||||
@@ -107,10 +115,32 @@ def build_context(
|
||||
memory_text, memory_chars = get_memories_for_context(
|
||||
memory_types=["identity", "preference"],
|
||||
budget=memory_budget,
|
||||
query=user_prompt,
|
||||
)
|
||||
|
||||
# 2b. Get project-scoped memories (third precedence). Only
|
||||
# populated when a canonical project is in scope — cross-project
|
||||
# memory bleed would rot the pack. Active-only filtering is
|
||||
# handled by the shared min_confidence=0.5 gate inside
|
||||
# get_memories_for_context.
|
||||
project_memory_text = ""
|
||||
project_memory_chars = 0
|
||||
if canonical_project:
|
||||
project_memory_budget = min(
|
||||
int(budget * PROJECT_MEMORY_BUDGET_RATIO),
|
||||
max(budget - project_state_chars - memory_chars, 0),
|
||||
)
|
||||
project_memory_text, project_memory_chars = get_memories_for_context(
|
||||
memory_types=PROJECT_MEMORY_TYPES,
|
||||
project=canonical_project,
|
||||
budget=project_memory_budget,
|
||||
header="--- Project Memories ---",
|
||||
footer="--- End Project Memories ---",
|
||||
query=user_prompt,
|
||||
)
|
||||
|
||||
# 3. Calculate remaining budget for retrieval
|
||||
retrieval_budget = budget - project_state_chars - memory_chars
|
||||
retrieval_budget = budget - project_state_chars - memory_chars - project_memory_chars
|
||||
|
||||
# 4. Retrieve candidates
|
||||
candidates = (
|
||||
@@ -130,11 +160,14 @@ def build_context(
|
||||
selected = _select_within_budget(scored, max(retrieval_budget, 0))
|
||||
|
||||
# 7. Format full context
|
||||
formatted = _format_full_context(project_state_text, memory_text, selected)
|
||||
formatted = _format_full_context(
|
||||
project_state_text, memory_text, project_memory_text, selected
|
||||
)
|
||||
if len(formatted) > budget:
|
||||
formatted, selected = _trim_context_to_budget(
|
||||
project_state_text,
|
||||
memory_text,
|
||||
project_memory_text,
|
||||
selected,
|
||||
budget,
|
||||
)
|
||||
@@ -144,6 +177,7 @@ def build_context(
|
||||
|
||||
project_state_chars = len(project_state_text)
|
||||
memory_chars = len(memory_text)
|
||||
project_memory_chars = len(project_memory_text)
|
||||
retrieval_chars = sum(c.char_count for c in selected)
|
||||
total_chars = len(formatted)
|
||||
duration_ms = int((time.time() - start) * 1000)
|
||||
@@ -154,6 +188,8 @@ def build_context(
|
||||
project_state_chars=project_state_chars,
|
||||
memory_text=memory_text,
|
||||
memory_chars=memory_chars,
|
||||
project_memory_text=project_memory_text,
|
||||
project_memory_chars=project_memory_chars,
|
||||
total_chars=total_chars,
|
||||
budget=budget,
|
||||
budget_remaining=budget - total_chars,
|
||||
@@ -171,6 +207,7 @@ def build_context(
|
||||
chunks_used=len(selected),
|
||||
project_state_chars=project_state_chars,
|
||||
memory_chars=memory_chars,
|
||||
project_memory_chars=project_memory_chars,
|
||||
retrieval_chars=retrieval_chars,
|
||||
total_chars=total_chars,
|
||||
budget_remaining=budget - total_chars,
|
||||
@@ -250,6 +287,7 @@ def _select_within_budget(
|
||||
def _format_full_context(
|
||||
project_state_text: str,
|
||||
memory_text: str,
|
||||
project_memory_text: str,
|
||||
chunks: list[ContextChunk],
|
||||
) -> str:
|
||||
"""Format project state + memories + retrieved chunks into full context block."""
|
||||
@@ -265,7 +303,12 @@ def _format_full_context(
|
||||
parts.append(memory_text)
|
||||
parts.append("")
|
||||
|
||||
# 3. Retrieved chunks (lowest trust)
|
||||
# 3. Project-scoped memories (third trust level)
|
||||
if project_memory_text:
|
||||
parts.append(project_memory_text)
|
||||
parts.append("")
|
||||
|
||||
# 4. Retrieved chunks (lowest trust)
|
||||
if chunks:
|
||||
parts.append("--- AtoCore Retrieved Context ---")
|
||||
if project_state_text:
|
||||
@@ -277,7 +320,7 @@ def _format_full_context(
|
||||
parts.append(chunk.content)
|
||||
parts.append("")
|
||||
parts.append("--- End Context ---")
|
||||
elif not project_state_text and not memory_text:
|
||||
elif not project_state_text and not memory_text and not project_memory_text:
|
||||
parts.append("--- AtoCore Context ---\nNo relevant context found.\n--- End Context ---")
|
||||
|
||||
return "\n".join(parts)
|
||||
@@ -299,6 +342,7 @@ def _pack_to_dict(pack: ContextPack) -> dict:
|
||||
"project_hint": pack.project_hint,
|
||||
"project_state_chars": pack.project_state_chars,
|
||||
"memory_chars": pack.memory_chars,
|
||||
"project_memory_chars": pack.project_memory_chars,
|
||||
"chunks_used": len(pack.chunks_used),
|
||||
"total_chars": pack.total_chars,
|
||||
"budget": pack.budget,
|
||||
@@ -306,6 +350,7 @@ def _pack_to_dict(pack: ContextPack) -> dict:
|
||||
"duration_ms": pack.duration_ms,
|
||||
"has_project_state": bool(pack.project_state_text),
|
||||
"has_memories": bool(pack.memory_text),
|
||||
"has_project_memories": bool(pack.project_memory_text),
|
||||
"chunks": [
|
||||
{
|
||||
"source_file": c.source_file,
|
||||
@@ -335,26 +380,45 @@ def _truncate_text_block(text: str, budget: int) -> tuple[str, int]:
|
||||
def _trim_context_to_budget(
|
||||
project_state_text: str,
|
||||
memory_text: str,
|
||||
project_memory_text: str,
|
||||
chunks: list[ContextChunk],
|
||||
budget: int,
|
||||
) -> tuple[str, list[ContextChunk]]:
|
||||
"""Trim retrieval first, then memory, then project state until formatted context fits."""
|
||||
"""Trim retrieval → project memories → identity/preference → project state."""
|
||||
kept_chunks = list(chunks)
|
||||
formatted = _format_full_context(project_state_text, memory_text, kept_chunks)
|
||||
formatted = _format_full_context(
|
||||
project_state_text, memory_text, project_memory_text, kept_chunks
|
||||
)
|
||||
while len(formatted) > budget and kept_chunks:
|
||||
kept_chunks.pop()
|
||||
formatted = _format_full_context(project_state_text, memory_text, kept_chunks)
|
||||
formatted = _format_full_context(
|
||||
project_state_text, memory_text, project_memory_text, kept_chunks
|
||||
)
|
||||
|
||||
if len(formatted) <= budget:
|
||||
return formatted, kept_chunks
|
||||
|
||||
# Drop project memories next (they were the most recently added
|
||||
# tier and carry less trust than identity/preference).
|
||||
project_memory_text, _ = _truncate_text_block(
|
||||
project_memory_text,
|
||||
max(budget - len(project_state_text) - len(memory_text), 0),
|
||||
)
|
||||
formatted = _format_full_context(
|
||||
project_state_text, memory_text, project_memory_text, kept_chunks
|
||||
)
|
||||
if len(formatted) <= budget:
|
||||
return formatted, kept_chunks
|
||||
|
||||
memory_text, _ = _truncate_text_block(memory_text, max(budget - len(project_state_text), 0))
|
||||
formatted = _format_full_context(project_state_text, memory_text, kept_chunks)
|
||||
formatted = _format_full_context(
|
||||
project_state_text, memory_text, project_memory_text, kept_chunks
|
||||
)
|
||||
if len(formatted) <= budget:
|
||||
return formatted, kept_chunks
|
||||
|
||||
project_state_text, _ = _truncate_text_block(project_state_text, budget)
|
||||
formatted = _format_full_context(project_state_text, "", [])
|
||||
formatted = _format_full_context(project_state_text, "", "", [])
|
||||
if len(formatted) > budget:
|
||||
formatted, _ = _truncate_text_block(formatted, budget)
|
||||
return formatted, []
|
||||
|
||||
@@ -63,6 +63,7 @@ def record_interaction(
|
||||
chunks_used: list[str] | None = None,
|
||||
context_pack: dict | None = None,
|
||||
reinforce: bool = True,
|
||||
extract: bool = False,
|
||||
) -> Interaction:
|
||||
"""Persist a single interaction to the audit trail.
|
||||
|
||||
@@ -163,6 +164,30 @@ def record_interaction(
|
||||
error=str(exc),
|
||||
)
|
||||
|
||||
if extract and (response or response_summary):
|
||||
try:
|
||||
from atocore.memory.extractor import extract_candidates_from_interaction
|
||||
from atocore.memory.service import create_memory
|
||||
|
||||
candidates = extract_candidates_from_interaction(interaction)
|
||||
for candidate in candidates:
|
||||
try:
|
||||
create_memory(
|
||||
memory_type=candidate.memory_type,
|
||||
content=candidate.content,
|
||||
project=candidate.project,
|
||||
confidence=candidate.confidence,
|
||||
status="candidate",
|
||||
)
|
||||
except ValueError:
|
||||
pass # duplicate or validation error — skip silently
|
||||
except Exception as exc: # pragma: no cover - extraction must never block capture
|
||||
log.error(
|
||||
"extraction_failed_on_capture",
|
||||
interaction_id=interaction_id,
|
||||
error=str(exc),
|
||||
)
|
||||
|
||||
return interaction
|
||||
|
||||
|
||||
|
||||
@@ -51,6 +51,15 @@ _STOP_WORDS: frozenset[str] = frozenset({
|
||||
})
|
||||
_MATCH_THRESHOLD = 0.70
|
||||
|
||||
# Long memories can't realistically hit 70% overlap through organic
|
||||
# paraphrase — a 40-token memory would need 28 stemmed tokens echoed
|
||||
# verbatim. Above this token count the matcher switches to an absolute
|
||||
# overlap floor plus a softer fraction floor so paragraph-length memories
|
||||
# still reinforce when the response genuinely uses them.
|
||||
_LONG_MEMORY_TOKEN_COUNT = 15
|
||||
_LONG_MODE_MIN_OVERLAP = 12
|
||||
_LONG_MODE_MIN_FRACTION = 0.35
|
||||
|
||||
DEFAULT_CONFIDENCE_DELTA = 0.02
|
||||
|
||||
|
||||
@@ -171,26 +180,47 @@ def _stem(word: str) -> str:
|
||||
def _tokenize(text: str) -> set[str]:
|
||||
"""Split normalized text into a stemmed token set.
|
||||
|
||||
Strips punctuation, drops words shorter than 3 chars and stop words.
|
||||
Strips punctuation, drops words shorter than 3 chars and stop
|
||||
words. Hyphenated and slash-separated identifiers
|
||||
(``polisher-control``, ``twyman-green``, ``2-projects/interferometer``)
|
||||
produce both the full form AND each sub-token, so a query for
|
||||
"polisher control" can match a memory that wrote
|
||||
"polisher-control" without forcing callers to guess the exact
|
||||
hyphenation.
|
||||
"""
|
||||
tokens: set[str] = set()
|
||||
for raw in text.split():
|
||||
# Strip leading/trailing punctuation (commas, periods, quotes, etc.)
|
||||
word = raw.strip(".,;:!?\"'()[]{}-/")
|
||||
if len(word) < 3:
|
||||
if not word:
|
||||
continue
|
||||
if word in _STOP_WORDS:
|
||||
continue
|
||||
tokens.add(_stem(word))
|
||||
_add_token(tokens, word)
|
||||
# Also add sub-tokens split on internal '-' or '/' so
|
||||
# hyphenated identifiers match queries that don't hyphenate.
|
||||
if "-" in word or "/" in word:
|
||||
for sub in re.split(r"[-/]+", word):
|
||||
_add_token(tokens, sub)
|
||||
return tokens
|
||||
|
||||
|
||||
def _add_token(tokens: set[str], word: str) -> None:
|
||||
if len(word) < 3:
|
||||
return
|
||||
if word in _STOP_WORDS:
|
||||
return
|
||||
tokens.add(_stem(word))
|
||||
|
||||
|
||||
def _memory_matches(memory_content: str, normalized_response: str) -> bool:
|
||||
"""Return True if enough of the memory's tokens appear in the response.
|
||||
|
||||
Uses token-overlap: tokenize both sides (lowercase, stem, drop stop
|
||||
words), then check whether >= 70 % of the memory's content tokens
|
||||
appear in the response token set.
|
||||
Dual-mode token overlap:
|
||||
- Short memories (<= _LONG_MEMORY_TOKEN_COUNT stems): require
|
||||
>= 70 % of memory tokens echoed.
|
||||
- Long memories (paragraphs): require an absolute floor of
|
||||
_LONG_MODE_MIN_OVERLAP distinct stems echoed AND a softer
|
||||
fraction of _LONG_MODE_MIN_FRACTION, so organic paraphrase
|
||||
of a real project memory can reinforce without the response
|
||||
quoting the paragraph verbatim.
|
||||
"""
|
||||
if not memory_content:
|
||||
return False
|
||||
@@ -202,4 +232,10 @@ def _memory_matches(memory_content: str, normalized_response: str) -> bool:
|
||||
return False
|
||||
response_tokens = _tokenize(normalized_response)
|
||||
overlap = memory_tokens & response_tokens
|
||||
return len(overlap) / len(memory_tokens) >= _MATCH_THRESHOLD
|
||||
fraction = len(overlap) / len(memory_tokens)
|
||||
if len(memory_tokens) <= _LONG_MEMORY_TOKEN_COUNT:
|
||||
return fraction >= _MATCH_THRESHOLD
|
||||
return (
|
||||
len(overlap) >= _LONG_MODE_MIN_OVERLAP
|
||||
and fraction >= _LONG_MODE_MIN_FRACTION
|
||||
)
|
||||
|
||||
@@ -344,6 +344,9 @@ def get_memories_for_context(
|
||||
memory_types: list[str] | None = None,
|
||||
project: str | None = None,
|
||||
budget: int = 500,
|
||||
header: str = "--- AtoCore Memory ---",
|
||||
footer: str = "--- End Memory ---",
|
||||
query: str | None = None,
|
||||
) -> tuple[str, int]:
|
||||
"""Get formatted memories for context injection.
|
||||
|
||||
@@ -351,38 +354,72 @@ def get_memories_for_context(
|
||||
|
||||
Budget allocation per Master Plan section 9:
|
||||
identity: 5%, preference: 5%, rest from retrieval budget
|
||||
|
||||
The caller can override ``header`` / ``footer`` to distinguish
|
||||
multiple memory blocks in the same pack (e.g. identity/preference
|
||||
vs project/knowledge memories).
|
||||
|
||||
When ``query`` is provided, candidates within each memory type
|
||||
are ranked by lexical overlap against the query (stemmed token
|
||||
intersection, ties broken by confidence). Without a query,
|
||||
candidates fall through in the order ``get_memories`` returns
|
||||
them — which is effectively "by confidence desc".
|
||||
"""
|
||||
if memory_types is None:
|
||||
memory_types = ["identity", "preference"]
|
||||
|
||||
if budget <= 0:
|
||||
return "", 0
|
||||
|
||||
header = "--- AtoCore Memory ---"
|
||||
footer = "--- End Memory ---"
|
||||
wrapper_chars = len(header) + len(footer) + 2
|
||||
if budget <= wrapper_chars:
|
||||
return "", 0
|
||||
|
||||
available = budget - wrapper_chars
|
||||
selected_entries: list[str] = []
|
||||
used = 0
|
||||
|
||||
for index, mtype in enumerate(memory_types):
|
||||
type_budget = available if index == len(memory_types) - 1 else max(0, available // (len(memory_types) - index))
|
||||
type_used = 0
|
||||
# Pre-tokenize the query once. ``_score_memory_for_query`` is a
|
||||
# free function below that reuses the reinforcement tokenizer so
|
||||
# lexical scoring here matches the reinforcement matcher.
|
||||
query_tokens: set[str] | None = None
|
||||
if query:
|
||||
from atocore.memory.reinforcement import _normalize, _tokenize
|
||||
|
||||
query_tokens = _tokenize(_normalize(query))
|
||||
if not query_tokens:
|
||||
query_tokens = None
|
||||
|
||||
# Collect ALL candidates across the requested types into one
|
||||
# pool, then rank globally before the budget walk. Ranking per
|
||||
# type and walking types in order would starve later types when
|
||||
# the first type's candidates filled the budget — even if a
|
||||
# later-type candidate matched the query perfectly. Type order
|
||||
# is preserved as a stable tiebreaker inside
|
||||
# ``_rank_memories_for_query`` via Python's stable sort.
|
||||
pool: list[Memory] = []
|
||||
seen_ids: set[str] = set()
|
||||
for mtype in memory_types:
|
||||
for mem in get_memories(
|
||||
memory_type=mtype,
|
||||
project=project,
|
||||
min_confidence=0.5,
|
||||
limit=10,
|
||||
limit=30,
|
||||
):
|
||||
entry = f"[{mem.memory_type}] {mem.content}"
|
||||
entry_len = len(entry) + 1
|
||||
if entry_len > type_budget - type_used:
|
||||
if mem.id in seen_ids:
|
||||
continue
|
||||
selected_entries.append(entry)
|
||||
type_used += entry_len
|
||||
available -= type_used
|
||||
seen_ids.add(mem.id)
|
||||
pool.append(mem)
|
||||
|
||||
if query_tokens is not None:
|
||||
pool = _rank_memories_for_query(pool, query_tokens)
|
||||
|
||||
for mem in pool:
|
||||
entry = f"[{mem.memory_type}] {mem.content}"
|
||||
entry_len = len(entry) + 1
|
||||
if entry_len > available - used:
|
||||
continue
|
||||
selected_entries.append(entry)
|
||||
used += entry_len
|
||||
|
||||
if not selected_entries:
|
||||
return "", 0
|
||||
@@ -394,6 +431,28 @@ def get_memories_for_context(
|
||||
return text, len(text)
|
||||
|
||||
|
||||
def _rank_memories_for_query(
|
||||
memories: list["Memory"],
|
||||
query_tokens: set[str],
|
||||
) -> list["Memory"]:
|
||||
"""Rerank a memory list by lexical overlap with a pre-tokenized query.
|
||||
|
||||
Ordering key: (overlap_count DESC, confidence DESC). When a query
|
||||
shares no tokens with a memory, overlap is zero and confidence
|
||||
acts as the sole tiebreaker — which matches the pre-query
|
||||
behaviour and keeps no-query calls stable.
|
||||
"""
|
||||
from atocore.memory.reinforcement import _normalize, _tokenize
|
||||
|
||||
scored: list[tuple[int, float, Memory]] = []
|
||||
for mem in memories:
|
||||
mem_tokens = _tokenize(_normalize(mem.content))
|
||||
overlap = len(mem_tokens & query_tokens) if mem_tokens else 0
|
||||
scored.append((overlap, mem.confidence, mem))
|
||||
scored.sort(key=lambda t: (t[0], t[1]), reverse=True)
|
||||
return [mem for _, _, mem in scored]
|
||||
|
||||
|
||||
def _row_to_memory(row) -> Memory:
|
||||
"""Convert a DB row to Memory dataclass."""
|
||||
keys = row.keys() if hasattr(row, "keys") else []
|
||||
|
||||
@@ -183,7 +183,7 @@ class TestCapture:
|
||||
assert body["prompt"] == "Please explain how the backup system works in detail"
|
||||
assert body["client"] == "claude-code"
|
||||
assert body["session_id"] == "test-session-123"
|
||||
assert body["reinforce"] is False
|
||||
assert body["reinforce"] is True
|
||||
|
||||
@mock.patch("capture_stop.urllib.request.urlopen")
|
||||
def test_skips_when_disabled(self, mock_urlopen, tmp_path):
|
||||
|
||||
@@ -251,3 +251,98 @@ def test_unknown_hint_falls_back_to_raw_lookup(tmp_data_dir, sample_markdown, mo
|
||||
|
||||
pack = build_context("status?", project_hint="orphan-project", budget=2000)
|
||||
assert "Solo run" in pack.formatted_context
|
||||
|
||||
|
||||
def test_project_memories_included_in_pack(tmp_data_dir, sample_markdown):
|
||||
"""Active project-scoped memories for the target project should
|
||||
land in a dedicated '--- Project Memories ---' band so the
|
||||
Phase 9 reflection loop has a retrieval outlet."""
|
||||
from atocore.memory.service import create_memory
|
||||
|
||||
init_db()
|
||||
init_project_state_schema()
|
||||
ingest_file(sample_markdown)
|
||||
|
||||
mem = create_memory(
|
||||
memory_type="project",
|
||||
content="the mirror architecture is Option B conical back for p04-gigabit",
|
||||
project="p04-gigabit",
|
||||
confidence=0.9,
|
||||
)
|
||||
# A sibling memory for a different project must NOT leak into the pack.
|
||||
create_memory(
|
||||
memory_type="project",
|
||||
content="polisher suite splits into sim, post, control, contracts",
|
||||
project="p06-polisher",
|
||||
confidence=0.9,
|
||||
)
|
||||
|
||||
pack = build_context(
|
||||
"remind me about the mirror architecture",
|
||||
project_hint="p04-gigabit",
|
||||
budget=3000,
|
||||
)
|
||||
assert "--- Project Memories ---" in pack.formatted_context
|
||||
assert "Option B conical back" in pack.formatted_context
|
||||
assert "polisher suite splits" not in pack.formatted_context
|
||||
assert pack.project_memory_chars > 0
|
||||
assert mem.project == "p04-gigabit"
|
||||
|
||||
|
||||
def test_project_memories_absent_without_project_hint(tmp_data_dir, sample_markdown):
|
||||
"""Without a project hint, project memories stay out of the pack —
|
||||
cross-project bleed would rot the signal."""
|
||||
from atocore.memory.service import create_memory
|
||||
|
||||
init_db()
|
||||
init_project_state_schema()
|
||||
ingest_file(sample_markdown)
|
||||
|
||||
create_memory(
|
||||
memory_type="project",
|
||||
content="scoped project knowledge that should not leak globally",
|
||||
project="p04-gigabit",
|
||||
confidence=0.9,
|
||||
)
|
||||
|
||||
pack = build_context("tell me something", budget=3000)
|
||||
assert "--- Project Memories ---" not in pack.formatted_context
|
||||
assert pack.project_memory_chars == 0
|
||||
|
||||
|
||||
def test_project_memories_query_relevance_ordering(tmp_data_dir, sample_markdown):
|
||||
"""When the budget only fits one memory, query-relevance ordering
|
||||
should pick the one the query is actually about — even if another
|
||||
memory has higher confidence.
|
||||
|
||||
Regression for the 2026-04-11 p05-vendor-signal harness failure:
|
||||
memory selection was fixed-order by confidence, so a lower-ranked
|
||||
vendor memory got starved out of the budget when a query was
|
||||
specifically about vendors.
|
||||
"""
|
||||
from atocore.memory.service import create_memory
|
||||
|
||||
init_db()
|
||||
init_project_state_schema()
|
||||
ingest_file(sample_markdown)
|
||||
|
||||
create_memory(
|
||||
memory_type="project",
|
||||
content="the folded-beam interferometer uses a CGH stage and fold mirror",
|
||||
project="p05-interferometer",
|
||||
confidence=0.97,
|
||||
)
|
||||
create_memory(
|
||||
memory_type="knowledge",
|
||||
content="vendor signal: Zygo Verifire SV is the strongest value path for the interferometer",
|
||||
project="p05-interferometer",
|
||||
confidence=0.85,
|
||||
)
|
||||
|
||||
pack = build_context(
|
||||
"what is the current vendor signal for the interferometer",
|
||||
project_hint="p05-interferometer",
|
||||
budget=1200, # tight enough that only one project memory fits
|
||||
)
|
||||
assert "Zygo Verifire SV" in pack.formatted_context
|
||||
assert pack.project_memory_chars > 0
|
||||
|
||||
@@ -476,6 +476,60 @@ def test_reinforce_matches_at_70_percent_threshold(tmp_data_dir):
|
||||
assert any(r.memory_id == mem.id for r in results)
|
||||
|
||||
|
||||
def test_reinforce_long_memory_matches_on_absolute_overlap(tmp_data_dir):
|
||||
"""A paragraph-length memory should reinforce when the response
|
||||
echoes a substantive subset of its distinctive tokens, even though
|
||||
the overlap fraction stays well under 70%."""
|
||||
init_db()
|
||||
mem = create_memory(
|
||||
memory_type="project",
|
||||
content=(
|
||||
"Interferometer architecture: a folded-beam configuration with a "
|
||||
"fixed horizontal interferometer, a forty-five degree fold mirror, "
|
||||
"a six-DOF CGH stage, and the mirror on its own tilting platform. "
|
||||
"The fold mirror redirects the beam while the CGH shapes the wavefront."
|
||||
),
|
||||
project="p05-interferometer",
|
||||
confidence=0.5,
|
||||
)
|
||||
interaction = _make_interaction(
|
||||
project="p05-interferometer",
|
||||
response=(
|
||||
"For the interferometer we keep the folded-beam layout: horizontal "
|
||||
"interferometer, fold mirror at forty-five degrees, CGH stage with "
|
||||
"six DOF, and the mirror sitting on its tilting platform. The fold "
|
||||
"mirror redirects the beam and the CGH shapes the wavefront."
|
||||
),
|
||||
)
|
||||
results = reinforce_from_interaction(interaction)
|
||||
assert any(r.memory_id == mem.id for r in results)
|
||||
|
||||
|
||||
def test_reinforce_long_memory_rejects_thin_overlap(tmp_data_dir):
|
||||
"""Long memory + a response that only brushes a few generic terms
|
||||
must NOT reinforce — otherwise the reflection loop rots."""
|
||||
init_db()
|
||||
mem = create_memory(
|
||||
memory_type="project",
|
||||
content=(
|
||||
"Polisher control system executes approved controller jobs, "
|
||||
"enforces state transitions and interlocks, supports pause "
|
||||
"resume and abort, and records auditable run logs while "
|
||||
"never reinterpreting metrology or inventing new strategies."
|
||||
),
|
||||
project="p06-polisher",
|
||||
confidence=0.5,
|
||||
)
|
||||
interaction = _make_interaction(
|
||||
project="p06-polisher",
|
||||
response=(
|
||||
"I updated the polisher docs and fixed a typo in the run logs section."
|
||||
),
|
||||
)
|
||||
results = reinforce_from_interaction(interaction)
|
||||
assert all(r.memory_id != mem.id for r in results)
|
||||
|
||||
|
||||
def test_reinforce_rejects_below_70_percent(tmp_data_dir):
|
||||
"""Only 6 of 10 content tokens present (60%) → should NOT match."""
|
||||
init_db()
|
||||
|
||||
Reference in New Issue
Block a user