diff --git a/DEV-LEDGER.md b/DEV-LEDGER.md index 9c1bb4f..a918bca 100644 --- a/DEV-LEDGER.md +++ b/DEV-LEDGER.md @@ -6,15 +6,15 @@ ## Orientation -- **live_sha** (Dalidou `/health` build_sha): `2b86543` (verified 2026-04-23T15:20:53Z post-R14 deploy; status=ok) -- **last_updated**: 2026-04-24 by Codex (audit-improvements foundation branch; live status refreshed) -- **main_tip**: `2b86543` -- **test_count**: 553 -- **harness**: `18/20 PASS` on live Dalidou plus 1 known content gap and 1 blocking project-bleed guard pending deploy of this branch +- **live_sha** (Dalidou `/health` build_sha): `f44a211` (verified 2026-04-24T14:48:44Z post audit-improvements deploy; status=ok) +- **last_updated**: 2026-04-24 by Codex (retrieval boundary deployed; project_id metadata branch started) +- **main_tip**: `f44a211` +- **test_count**: 567 on `codex/project-id-metadata-retrieval` (deployed main baseline: 553) +- **harness**: `19/20 PASS` on live Dalidou, 0 blocking failures, 1 known content gap (`p04-constraints`) - **vectors**: 33,253 - **active_memories**: 290 (`/admin/dashboard` 2026-04-24; note integrity panel reports a separate active_memory_count=951 and needs reconciliation) - **candidate_memories**: 0 (triage queue drained) -- **interactions**: 950 (`/admin/dashboard` 2026-04-24) +- **interactions**: 951 (`/admin/dashboard` 2026-04-24) - **registered_projects**: atocore, p04-gigabit, p05-interferometer, p06-polisher, atomizer-v2, abb-space (aliased p08) - **project_state_entries**: 128 across registered projects (`/admin/dashboard` 2026-04-24) - **entities**: 66 (up from 35 — V1-0 backfill + ongoing work; 0 open conflicts) @@ -170,6 +170,12 @@ One branch `codex/extractor-eval-loop` for Day 1-5, a second `codex/retrieval-ha ## Session Log +- **2026-04-24 Codex (retrieval boundary deployed + project_id metadata tranche)** Merged `codex/audit-improvements-foundation` to `main` as `f44a211` and pushed to Dalidou Gitea. Took pre-deploy runtime backup `/srv/storage/atocore/backups/snapshots/20260424T144810Z` (DB + registry, no Chroma). Deployed via `papa@dalidou` canonical `deploy/dalidou/deploy.sh`; live `/health` reports build_sha `f44a2114970008a7eec4e7fc2860c8f072914e38`, build_time `2026-04-24T14:48:44Z`, status ok. Post-deploy retrieval harness: 20 fixtures, 19 pass, 0 blocking failures, 1 known issue (`p04-constraints`). The former blocker `p05-broad-status-no-atomizer` now passes. Manual p05 `context-build "current status"` spot check shows no p04/Atomizer source bleed in retrieved chunks. Started follow-up branch `codex/project-id-metadata-retrieval`: registered-project ingestion now writes explicit `project_id` into DB chunk metadata and Chroma vector metadata; retrieval prefers exact `project_id` when present and keeps path/tag matching as legacy fallback; added dry-run-by-default `scripts/backfill_chunk_project_ids.py` to backfill SQLite + Chroma metadata; added tests for project-id ingestion, registered refresh propagation, exact project-id retrieval, and collision fallback. Verified targeted suite (`test_ingestion.py`, `test_project_registry.py`, `test_retrieval.py`): 36 passed. Verified full suite: 556 passed in 72.44s. Branch not merged or deployed yet. + +- **2026-04-24 Codex (project_id audit response)** Applied independent-audit fixes on `codex/project-id-metadata-retrieval`. Closed the nightly `/ingest/sources` clobber risk by adding registry-level `derive_project_id_for_path()` and making unscoped `ingest_file()` derive ownership from registered ingest roots when possible; `refresh_registered_project()` still passes the canonical project id directly. Changed retrieval so empty `project_id` falls through to legacy path/tag ownership instead of short-circuiting as unowned. Hardened `scripts/backfill_chunk_project_ids.py`: `--apply` now requires `--chroma-snapshot-confirmed`, runs Chroma metadata updates before SQLite writes, batches updates, skips/report missing vectors, skips/report malformed metadata, reports already-tagged rows, and turns missing ingestion tables into a JSON `db_warning` instead of a traceback. Added tests for auto-derive ingestion, empty-project fallback, ingest-root overlap rejection, and backfill dry-run/apply/snapshot/missing-vector/malformed cases. Verified targeted suite (`test_backfill_chunk_project_ids.py`, `test_ingestion.py`, `test_project_registry.py`, `test_retrieval.py`): 45 passed. Verified full suite: 565 passed in 73.16s. Local dry-run on empty/default data returns 0 updates with `db_warning` rather than crashing. Branch still not merged/deployed. + +- **2026-04-24 Codex (project_id final hardening before merge)** Applied the final independent-review P2s on `codex/project-id-metadata-retrieval`: `ingest_file()` still fails open when project-id derivation fails, but now emits `project_id_derivation_failed` with file path and error; retrieval now catches registry failures both at project-scope resolution and the soft project-match boost path, logs warnings, and serves unscoped rather than raising. Added regression tests for both fail-open paths. Verified targeted suite (`test_ingestion.py`, `test_retrieval.py`, `test_backfill_chunk_project_ids.py`, `test_project_registry.py`): 47 passed. Verified full suite: 567 passed in 79.66s. Branch still not merged/deployed. + - **2026-04-24 Codex (audit improvements foundation)** Started implementation of the audit recommendations on branch `codex/audit-improvements-foundation` from `origin/main@c53e61e`. First tranche: registry-aware project-scoped retrieval filtering (`ATOCORE_RANK_PROJECT_SCOPE_FILTER`, widened candidate pull before filtering), eval harness known-issue lane, two p05 project-bleed fixtures, `scripts/live_status.py`, README/current-state/master-plan status refresh. Verified `pytest -q`: 550 passed in 67.11s. Live retrieval harness against undeployed production: 20 fixtures, 18 pass, 1 known issue (`p04-constraints` Zerodur/1.2 content gap), 1 blocking guard (`p05-broad-status-no-atomizer`) still failing because production has not yet deployed the retrieval filter and currently pulls `P04-GigaBIT-M1-KB-design` into broad p05 status context. Live dashboard refresh: health ok, build `2b86543`, docs 1748, chunks/vectors 33253, interactions 948, active memories 289, candidates 0, project_state total 128. Noted count discrepancy: dashboard memories.active=289 while integrity active_memory_count=951; schedule reconciliation in a follow-up. - **2026-04-24 Codex (independent-audit hardening)** Applied the Opus independent audit's fast follow-ups before merge/deploy. Closed the two P1s by making project-scope ownership path/tag-based only, adding path-segment/tag-exact matching to avoid short-alias substring collisions, and keeping title/heading text out of provenance decisions. Added regression tests for title poisoning, substring collision, and unknown-project fallback. Added retrieval log fields `raw_results_count`, `post_filter_count`, `post_filter_dropped`, and `underfilled`. Added retrieval-eval run metadata (`generated_at`, `base_url`, `/health`) and `live_status.py` auth-token/status support. README now documents the ranking knobs and clarifies that the hard scope filter and soft project match boost are separate controls. Verified `pytest -q`: 553 passed in 66.07s. Live production remains expected-predeploy: 20 fixtures, 18 pass, 1 known content gap, 1 blocking p05 bleed guard. Latest live dashboard: build `2b86543`, docs 1748, chunks/vectors 33253, interactions 950, active memories 290, candidates 0, project_state total 128. diff --git a/README.md b/README.md index 70d77d4..1f430f7 100644 --- a/README.md +++ b/README.md @@ -111,6 +111,7 @@ pytest - `scripts/atocore_client.py` provides a live API client for project refresh, project-state inspection, and retrieval-quality audits. - `scripts/retrieval_eval.py` runs the live retrieval/context harness, separates blocking failures from known content gaps, and stamps JSON output with target/build metadata. - `scripts/live_status.py` renders a compact read-only status report from `/health`, `/stats`, `/projects`, and `/admin/dashboard`; set `ATOCORE_AUTH_TOKEN` or `--auth-token` when those endpoints are gated. +- `scripts/backfill_chunk_project_ids.py` dry-runs or applies explicit `project_id` metadata backfills for SQLite chunks and Chroma vectors; `--apply` requires a confirmed Chroma snapshot. - `docs/operations.md` captures the current operational priority order: retrieval quality, Wave 2 trusted-operational ingestion, AtoDrive scoping, and restore validation. - `DEV-LEDGER.md` is the fast-moving source of operational truth during active development; copy claims into docs only after checking the live service. diff --git a/docs/current-state.md b/docs/current-state.md index 32a46e2..14dd753 100644 --- a/docs/current-state.md +++ b/docs/current-state.md @@ -1,5 +1,9 @@ # AtoCore - Current State (2026-04-24) +Update 2026-04-24: audit-improvements deployed as `f44a211`; live harness is +19/20 with 0 blocking failures and 1 known content gap. Active follow-up branch +`codex/project-id-metadata-retrieval` is at 567 passing tests. + Live deploy: `2b86543` · Dalidou health: ok · Harness: 18/20 with 1 known content gap and 1 current blocking project-bleed guard · Tests: 553 passing. @@ -68,7 +72,7 @@ Last nightly run (2026-04-19 03:00 UTC): **31 promoted · 39 rejected · 0 needs ## Known gaps (honest, refreshed 2026-04-24) 1. **Capture surface is Claude-Code-and-OpenClaw only.** Conversations in Claude Desktop, Claude.ai web, phone, or any other LLM UI are NOT captured. Example: the rotovap/mushroom chat yesterday never reached AtoCore because no hook fired. See Q4 below. -2. **Project-scoped retrieval still needs deployment verification.** The April 24 audit reproduced cross-project competition on broad p05 prompts. The current branch adds registry-aware project filtering and a harness guard; verify after deploy. +2. **Project-scoped retrieval guard is deployed and passing.** The April 24 p05 broad-status bleed guard now passes on live Dalidou. The active follow-up branch adds explicit `project_id` chunk/vector metadata so the deployed path/tag heuristic can become a legacy fallback. 3. **Human interface is useful but not yet the V1 Human Mirror.** Wiki/dashboard pages exist, but the spec routes, deterministic mirror files, disputed markers, and curated annotations remain V1-D work. 4. **Harness known issue:** `p04-constraints` wants "Zerodur" and "1.2"; live retrieval surfaces related constraints but not those exact strings. Treat as content/state gap until fixed. 5. **Formal docs lag the ledger during fast work.** Use `DEV-LEDGER.md` and `python scripts/live_status.py` for live truth, then copy verified claims into these docs. diff --git a/docs/master-plan-status.md b/docs/master-plan-status.md index d7e17f7..6a7202b 100644 --- a/docs/master-plan-status.md +++ b/docs/master-plan-status.md @@ -135,7 +135,7 @@ deferred from the shared client until their workflows are exercised. - canonical AtoCore runtime on Dalidou (`2b86543`, deploy.sh verified) - 33,253 vectors across 6 registered projects -- 950 captured interactions as of the 2026-04-24 live dashboard; refresh +- 951 captured interactions as of the 2026-04-24 live dashboard; refresh exact live counts with `python scripts/live_status.py` - 6 registered projects: @@ -150,10 +150,9 @@ deferred from the shared client until their workflows are exercised. dashboard - context pack assembly with 4 tiers: Trusted Project State > identity/preference > project memories > retrieved chunks - query-relevance memory ranking with overlap-density scoring -- retrieval eval harness: 20 fixtures; current live has 18 pass, 1 known - content gap, and 1 blocking cross-project bleed guard targeted by the - current retrieval-scoping branch -- 553 tests passing on the audit-improvements branch +- retrieval eval harness: 20 fixtures; current live has 19 pass, 1 known + content gap, and 0 blocking failures after the audit-improvements deploy +- 567 tests passing on the active `codex/project-id-metadata-retrieval` branch - nightly pipeline: backup → cleanup → rsync → OpenClaw import → vault refresh → extract → triage → **auto-promote/expire** → weekly synth/lint → **retrieval harness** → **pipeline summary to project state** - Phase 10 operational: reinforcement-based auto-promotion (ref_count ≥ 3, confidence ≥ 0.7) + stale candidate expiry (14 days unreinforced) - pipeline health visible in dashboard: interaction totals by client, pipeline last_run, harness results, triage stats diff --git a/scripts/backfill_chunk_project_ids.py b/scripts/backfill_chunk_project_ids.py new file mode 100644 index 0000000..72840ae --- /dev/null +++ b/scripts/backfill_chunk_project_ids.py @@ -0,0 +1,178 @@ +"""Backfill explicit project_id into chunk and vector metadata. + +Dry-run by default. The script derives ownership from the registered project +ingest roots and updates both SQLite source_chunks.metadata and Chroma vector +metadata only when --apply is provided. +""" + +from __future__ import annotations + +import argparse +import json +import sqlite3 +import sys +from pathlib import Path + +sys.path.insert(0, str(Path(__file__).resolve().parents[1] / "src")) + +from atocore.models.database import get_connection # noqa: E402 +from atocore.projects.registry import derive_project_id_for_path # noqa: E402 +from atocore.retrieval.vector_store import get_vector_store # noqa: E402 + +DEFAULT_BATCH_SIZE = 500 + + +def _decode_metadata(raw: str | None) -> dict | None: + if not raw: + return {} + try: + parsed = json.loads(raw) + except json.JSONDecodeError: + return None + return parsed if isinstance(parsed, dict) else None + + +def _chunk_rows() -> tuple[list[dict], str]: + try: + with get_connection() as conn: + rows = conn.execute( + """ + SELECT + sc.id AS chunk_id, + sc.metadata AS chunk_metadata, + sd.file_path AS file_path + FROM source_chunks sc + JOIN source_documents sd ON sd.id = sc.document_id + ORDER BY sd.file_path, sc.chunk_index + """ + ).fetchall() + except sqlite3.OperationalError as exc: + if "source_chunks" in str(exc) or "source_documents" in str(exc): + return [], f"missing ingestion tables: {exc}" + raise + return [dict(row) for row in rows], "" + + +def _batches(items: list, batch_size: int) -> list[list]: + return [items[i:i + batch_size] for i in range(0, len(items), batch_size)] + + +def backfill( + apply: bool = False, + project_filter: str = "", + batch_size: int = DEFAULT_BATCH_SIZE, + require_chroma_snapshot: bool = False, +) -> dict: + rows, db_warning = _chunk_rows() + updates: list[tuple[str, str, dict]] = [] + by_project: dict[str, int] = {} + skipped_unowned = 0 + already_tagged = 0 + malformed_metadata = 0 + + for row in rows: + project_id = derive_project_id_for_path(row["file_path"]) + if project_filter and project_id != project_filter: + continue + if not project_id: + skipped_unowned += 1 + continue + metadata = _decode_metadata(row["chunk_metadata"]) + if metadata is None: + malformed_metadata += 1 + continue + if metadata.get("project_id") == project_id: + already_tagged += 1 + continue + metadata["project_id"] = project_id + updates.append((row["chunk_id"], project_id, metadata)) + by_project[project_id] = by_project.get(project_id, 0) + 1 + + missing_vectors: list[str] = [] + applied_updates = 0 + if apply and updates: + if not require_chroma_snapshot: + raise ValueError( + "--apply requires --chroma-snapshot-confirmed after taking a Chroma backup" + ) + vector_store = get_vector_store() + for batch in _batches(updates, max(1, batch_size)): + chunk_ids = [chunk_id for chunk_id, _, _ in batch] + vector_payload = vector_store.get_metadatas(chunk_ids) + existing_vector_metadata = { + chunk_id: metadata + for chunk_id, metadata in zip( + vector_payload.get("ids", []), + vector_payload.get("metadatas", []), + strict=False, + ) + if isinstance(metadata, dict) + } + + vector_ids = [] + vector_metadatas = [] + sql_updates = [] + for chunk_id, project_id, chunk_metadata in batch: + vector_metadata = existing_vector_metadata.get(chunk_id) + if vector_metadata is None: + missing_vectors.append(chunk_id) + continue + vector_metadata = dict(vector_metadata) + vector_metadata["project_id"] = project_id + vector_ids.append(chunk_id) + vector_metadatas.append(vector_metadata) + sql_updates.append((json.dumps(chunk_metadata, ensure_ascii=True), chunk_id)) + + if not vector_ids: + continue + + vector_store.update_metadatas(vector_ids, vector_metadatas) + with get_connection() as conn: + cursor = conn.executemany( + "UPDATE source_chunks SET metadata = ? WHERE id = ?", + sql_updates, + ) + if cursor.rowcount != len(sql_updates): + raise RuntimeError( + f"SQLite rowcount mismatch: {cursor.rowcount} != {len(sql_updates)}" + ) + applied_updates += len(sql_updates) + + return { + "apply": apply, + "total_chunks": len(rows), + "updates": len(updates), + "applied_updates": applied_updates, + "already_tagged": already_tagged, + "skipped_unowned": skipped_unowned, + "malformed_metadata": malformed_metadata, + "missing_vectors": len(missing_vectors), + "db_warning": db_warning, + "by_project": dict(sorted(by_project.items())), + } + + +def main() -> int: + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument("--apply", action="store_true", help="write SQLite and Chroma metadata updates") + parser.add_argument("--project", default="", help="optional canonical project_id filter") + parser.add_argument("--batch-size", type=int, default=DEFAULT_BATCH_SIZE) + parser.add_argument( + "--chroma-snapshot-confirmed", + action="store_true", + help="required with --apply; confirms a Chroma snapshot exists", + ) + args = parser.parse_args() + + payload = backfill( + apply=args.apply, + project_filter=args.project.strip(), + batch_size=args.batch_size, + require_chroma_snapshot=args.chroma_snapshot_confirmed, + ) + print(json.dumps(payload, indent=2, ensure_ascii=True)) + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/src/atocore/ingestion/pipeline.py b/src/atocore/ingestion/pipeline.py index 78052ea..8aba226 100644 --- a/src/atocore/ingestion/pipeline.py +++ b/src/atocore/ingestion/pipeline.py @@ -32,10 +32,23 @@ def exclusive_ingestion(): _INGESTION_LOCK.release() -def ingest_file(file_path: Path) -> dict: +def ingest_file(file_path: Path, project_id: str = "") -> dict: """Ingest a single markdown file. Returns stats.""" start = time.time() file_path = file_path.resolve() + project_id = (project_id or "").strip() + if not project_id: + try: + from atocore.projects.registry import derive_project_id_for_path + + project_id = derive_project_id_for_path(file_path) + except Exception as exc: + log.warning( + "project_id_derivation_failed", + file_path=str(file_path), + error=str(exc), + ) + project_id = "" if not file_path.exists(): raise FileNotFoundError(f"File not found: {file_path}") @@ -65,6 +78,7 @@ def ingest_file(file_path: Path) -> dict: "source_file": str(file_path), "tags": parsed.tags, "title": parsed.title, + "project_id": project_id, } chunks = chunk_markdown(parsed.body, base_metadata=base_meta) @@ -116,6 +130,7 @@ def ingest_file(file_path: Path) -> dict: "source_file": str(file_path), "tags": json.dumps(parsed.tags), "title": parsed.title, + "project_id": project_id, }) conn.execute( @@ -173,7 +188,17 @@ def ingest_folder(folder_path: Path, purge_deleted: bool = True) -> list[dict]: purge_deleted: If True, remove DB/vector entries for files that no longer exist on disk. """ + return ingest_project_folder(folder_path, purge_deleted=purge_deleted, project_id="") + + +def ingest_project_folder( + folder_path: Path, + purge_deleted: bool = True, + project_id: str = "", +) -> list[dict]: + """Ingest a folder and annotate chunks with an optional project id.""" folder_path = folder_path.resolve() + project_id = (project_id or "").strip() if not folder_path.is_dir(): raise NotADirectoryError(f"Not a directory: {folder_path}") @@ -187,7 +212,7 @@ def ingest_folder(folder_path: Path, purge_deleted: bool = True) -> list[dict]: # Ingest new/changed files for md_file in md_files: try: - result = ingest_file(md_file) + result = ingest_file(md_file, project_id=project_id) results.append(result) except Exception as e: log.error("ingestion_error", file_path=str(md_file), error=str(e)) diff --git a/src/atocore/projects/registry.py b/src/atocore/projects/registry.py index 6786b4e..9fd0cf6 100644 --- a/src/atocore/projects/registry.py +++ b/src/atocore/projects/registry.py @@ -8,7 +8,6 @@ from dataclasses import asdict, dataclass from pathlib import Path import atocore.config as _config -from atocore.ingestion.pipeline import ingest_folder # Reserved pseudo-projects. `inbox` holds pre-project / lead / quote @@ -260,6 +259,7 @@ def load_project_registry() -> list[RegisteredProject]: ) _validate_unique_project_names(projects) + _validate_ingest_root_overlaps(projects) return projects @@ -307,6 +307,28 @@ def resolve_project_name(name: str | None) -> str: return name +def derive_project_id_for_path(file_path: str | Path) -> str: + """Return the registered project that owns a source path, if any.""" + if not file_path: + return "" + doc_path = Path(file_path).resolve(strict=False) + matches: list[tuple[int, int, str]] = [] + + for project in load_project_registry(): + for source_ref in project.ingest_roots: + root_path = _resolve_ingest_root(source_ref) + try: + doc_path.relative_to(root_path) + except ValueError: + continue + matches.append((len(root_path.parts), len(str(root_path)), project.project_id)) + + if not matches: + return "" + matches.sort(reverse=True) + return matches[0][2] + + def refresh_registered_project(project_name: str, purge_deleted: bool = False) -> dict: """Ingest all configured source roots for a registered project. @@ -322,6 +344,8 @@ def refresh_registered_project(project_name: str, purge_deleted: bool = False) - if project is None: raise ValueError(f"Unknown project: {project_name}") + from atocore.ingestion.pipeline import ingest_project_folder + roots = [] ingested_count = 0 skipped_count = 0 @@ -346,7 +370,11 @@ def refresh_registered_project(project_name: str, purge_deleted: bool = False) - { **root_result, "status": "ingested", - "results": ingest_folder(resolved, purge_deleted=purge_deleted), + "results": ingest_project_folder( + resolved, + purge_deleted=purge_deleted, + project_id=project.project_id, + ), } ) ingested_count += 1 @@ -443,6 +471,33 @@ def _validate_unique_project_names(projects: list[RegisteredProject]) -> None: seen[key] = project.project_id +def _validate_ingest_root_overlaps(projects: list[RegisteredProject]) -> None: + roots: list[tuple[str, Path]] = [] + for project in projects: + for source_ref in project.ingest_roots: + roots.append((project.project_id, _resolve_ingest_root(source_ref))) + + for i, (left_project, left_root) in enumerate(roots): + for right_project, right_root in roots[i + 1:]: + if left_project == right_project: + continue + try: + left_root.relative_to(right_root) + overlaps = True + except ValueError: + try: + right_root.relative_to(left_root) + overlaps = True + except ValueError: + overlaps = False + if overlaps: + raise ValueError( + "Project registry ingest root overlap: " + f"'{left_root}' ({left_project}) and " + f"'{right_root}' ({right_project})" + ) + + def _find_name_collisions( project_id: str, aliases: list[str], diff --git a/src/atocore/retrieval/retriever.py b/src/atocore/retrieval/retriever.py index 26e0182..79cda1b 100644 --- a/src/atocore/retrieval/retriever.py +++ b/src/atocore/retrieval/retriever.py @@ -84,7 +84,15 @@ def retrieve( """Retrieve the most relevant chunks for a query.""" top_k = top_k or _config.settings.context_top_k start = time.time() - scoped_project = get_registered_project(project_hint) if project_hint else None + try: + scoped_project = get_registered_project(project_hint) if project_hint else None + except Exception as exc: + log.warning( + "project_scope_resolution_failed", + project_hint=project_hint, + error=str(exc), + ) + scoped_project = None scope_filter_enabled = bool(scoped_project and _config.settings.rank_project_scope_filter) registered_projects = None query_top_k = top_k @@ -209,6 +217,10 @@ def _is_allowed_for_project_scope( def _metadata_matches_project(project: RegisteredProject, metadata: dict) -> bool: + stored_project_id = str(metadata.get("project_id", "")).strip().lower() + if stored_project_id: + return stored_project_id == project.project_id.lower() + path = _metadata_source_path(metadata) tags = _metadata_tags(metadata) for term in _project_scope_terms(project): @@ -288,7 +300,15 @@ def _project_match_boost(project_hint: str, metadata: dict) -> float: if not hint_lower: return 1.0 - project = get_registered_project(project_hint) + try: + project = get_registered_project(project_hint) + except Exception as exc: + log.warning( + "project_match_boost_resolution_failed", + project_hint=project_hint, + error=str(exc), + ) + project = None candidate_names = _project_scope_terms(project) if project is not None else {hint_lower} for candidate in candidate_names: if _metadata_has_term(metadata, candidate): diff --git a/src/atocore/retrieval/vector_store.py b/src/atocore/retrieval/vector_store.py index 6039b7b..573bde8 100644 --- a/src/atocore/retrieval/vector_store.py +++ b/src/atocore/retrieval/vector_store.py @@ -64,6 +64,18 @@ class VectorStore: self._collection.delete(ids=ids) log.debug("vectors_deleted", count=len(ids)) + def get_metadatas(self, ids: list[str]) -> dict: + """Fetch vector metadata by chunk IDs.""" + if not ids: + return {"ids": [], "metadatas": []} + return self._collection.get(ids=ids, include=["metadatas"]) + + def update_metadatas(self, ids: list[str], metadatas: list[dict]) -> None: + """Update vector metadata without re-embedding documents.""" + if ids: + self._collection.update(ids=ids, metadatas=metadatas) + log.debug("vector_metadatas_updated", count=len(ids)) + @property def count(self) -> int: return self._collection.count() diff --git a/tests/test_backfill_chunk_project_ids.py b/tests/test_backfill_chunk_project_ids.py new file mode 100644 index 0000000..47c5c26 --- /dev/null +++ b/tests/test_backfill_chunk_project_ids.py @@ -0,0 +1,154 @@ +"""Tests for explicit chunk project_id metadata backfill.""" + +import json + +import atocore.config as config +from atocore.models.database import get_connection, init_db +from scripts import backfill_chunk_project_ids as backfill + + +def _write_registry(tmp_path, monkeypatch): + vault_dir = tmp_path / "vault" + drive_dir = tmp_path / "drive" + config_dir = tmp_path / "config" + project_dir = vault_dir / "incoming" / "projects" / "p04-gigabit" + project_dir.mkdir(parents=True) + drive_dir.mkdir() + config_dir.mkdir() + registry_path = config_dir / "project-registry.json" + registry_path.write_text( + json.dumps( + { + "projects": [ + { + "id": "p04-gigabit", + "aliases": ["p04"], + "ingest_roots": [ + {"source": "vault", "subpath": "incoming/projects/p04-gigabit"} + ], + } + ] + } + ), + encoding="utf-8", + ) + monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir)) + monkeypatch.setenv("ATOCORE_DRIVE_SOURCE_DIR", str(drive_dir)) + monkeypatch.setenv("ATOCORE_PROJECT_REGISTRY_PATH", str(registry_path)) + config.settings = config.Settings() + return project_dir + + +def _insert_chunk(file_path, metadata=None, chunk_id="chunk-1"): + with get_connection() as conn: + conn.execute( + """ + INSERT INTO source_documents (id, file_path, file_hash, title, doc_type, tags) + VALUES (?, ?, ?, ?, ?, ?) + """, + ("doc-1", str(file_path), "hash", "Title", "markdown", "[]"), + ) + conn.execute( + """ + INSERT INTO source_chunks + (id, document_id, chunk_index, content, heading_path, char_count, metadata) + VALUES (?, ?, ?, ?, ?, ?, ?) + """, + ( + chunk_id, + "doc-1", + 0, + "content", + "Overview", + 7, + json.dumps(metadata if metadata is not None else {}), + ), + ) + + +class FakeVectorStore: + def __init__(self, metadatas): + self.metadatas = dict(metadatas) + self.updated = [] + + def get_metadatas(self, ids): + returned_ids = [chunk_id for chunk_id in ids if chunk_id in self.metadatas] + return { + "ids": returned_ids, + "metadatas": [self.metadatas[chunk_id] for chunk_id in returned_ids], + } + + def update_metadatas(self, ids, metadatas): + self.updated.append((list(ids), list(metadatas))) + for chunk_id, metadata in zip(ids, metadatas, strict=True): + self.metadatas[chunk_id] = metadata + + +def test_backfill_dry_run_is_non_mutating(tmp_data_dir, tmp_path, monkeypatch): + init_db() + project_dir = _write_registry(tmp_path, monkeypatch) + _insert_chunk(project_dir / "status.md") + + result = backfill.backfill(apply=False) + + assert result["updates"] == 1 + with get_connection() as conn: + row = conn.execute("SELECT metadata FROM source_chunks WHERE id = ?", ("chunk-1",)).fetchone() + assert json.loads(row["metadata"]) == {} + + +def test_backfill_apply_updates_chroma_then_sql(tmp_data_dir, tmp_path, monkeypatch): + init_db() + project_dir = _write_registry(tmp_path, monkeypatch) + _insert_chunk(project_dir / "status.md", metadata={"source_file": "status.md"}) + fake_store = FakeVectorStore({"chunk-1": {"source_file": "status.md"}}) + monkeypatch.setattr(backfill, "get_vector_store", lambda: fake_store) + + result = backfill.backfill(apply=True, require_chroma_snapshot=True) + + assert result["applied_updates"] == 1 + assert fake_store.metadatas["chunk-1"]["project_id"] == "p04-gigabit" + with get_connection() as conn: + row = conn.execute("SELECT metadata FROM source_chunks WHERE id = ?", ("chunk-1",)).fetchone() + assert json.loads(row["metadata"])["project_id"] == "p04-gigabit" + + +def test_backfill_apply_requires_snapshot_confirmation(tmp_data_dir, tmp_path, monkeypatch): + init_db() + project_dir = _write_registry(tmp_path, monkeypatch) + _insert_chunk(project_dir / "status.md") + + try: + backfill.backfill(apply=True) + except ValueError as exc: + assert "Chroma backup" in str(exc) + else: + raise AssertionError("Expected snapshot confirmation requirement") + + +def test_backfill_missing_vector_skips_sql_update(tmp_data_dir, tmp_path, monkeypatch): + init_db() + project_dir = _write_registry(tmp_path, monkeypatch) + _insert_chunk(project_dir / "status.md") + fake_store = FakeVectorStore({}) + monkeypatch.setattr(backfill, "get_vector_store", lambda: fake_store) + + result = backfill.backfill(apply=True, require_chroma_snapshot=True) + + assert result["updates"] == 1 + assert result["applied_updates"] == 0 + assert result["missing_vectors"] == 1 + with get_connection() as conn: + row = conn.execute("SELECT metadata FROM source_chunks WHERE id = ?", ("chunk-1",)).fetchone() + assert json.loads(row["metadata"]) == {} + + +def test_backfill_skips_malformed_metadata(tmp_data_dir, tmp_path, monkeypatch): + init_db() + project_dir = _write_registry(tmp_path, monkeypatch) + _insert_chunk(project_dir / "status.md", metadata=[]) + + result = backfill.backfill(apply=False) + + assert result["updates"] == 0 + assert result["malformed_metadata"] == 1 diff --git a/tests/test_ingestion.py b/tests/test_ingestion.py index 3922d3d..1026995 100644 --- a/tests/test_ingestion.py +++ b/tests/test_ingestion.py @@ -1,8 +1,10 @@ """Tests for the ingestion pipeline.""" +import json + from atocore.ingestion.parser import parse_markdown from atocore.models.database import get_connection, init_db -from atocore.ingestion.pipeline import ingest_file, ingest_folder +from atocore.ingestion.pipeline import ingest_file, ingest_folder, ingest_project_folder def test_parse_markdown(sample_markdown): @@ -69,6 +71,153 @@ def test_ingest_updates_changed(tmp_data_dir, sample_markdown): assert result["status"] == "ingested" +def test_ingest_file_records_project_id_metadata(tmp_data_dir, sample_markdown, monkeypatch): + """Project-aware ingestion should tag DB and vector metadata exactly.""" + init_db() + + class FakeVectorStore: + def __init__(self): + self.metadatas = [] + + def add(self, ids, documents, metadatas): + self.metadatas.extend(metadatas) + + def delete(self, ids): + return None + + fake_store = FakeVectorStore() + monkeypatch.setattr("atocore.ingestion.pipeline.get_vector_store", lambda: fake_store) + + result = ingest_file(sample_markdown, project_id="p04-gigabit") + + assert result["status"] == "ingested" + assert fake_store.metadatas + assert all(meta["project_id"] == "p04-gigabit" for meta in fake_store.metadatas) + + with get_connection() as conn: + rows = conn.execute("SELECT metadata FROM source_chunks").fetchall() + assert rows + assert all( + json.loads(row["metadata"])["project_id"] == "p04-gigabit" + for row in rows + ) + + +def test_ingest_file_derives_project_id_from_registry_root(tmp_data_dir, tmp_path, monkeypatch): + """Unscoped ingest should preserve ownership for files under registered roots.""" + import atocore.config as config + + vault_dir = tmp_path / "vault" + drive_dir = tmp_path / "drive" + config_dir = tmp_path / "config" + project_dir = vault_dir / "incoming" / "projects" / "p04-gigabit" + project_dir.mkdir(parents=True) + drive_dir.mkdir() + config_dir.mkdir() + note = project_dir / "status.md" + note.write_text( + "# Status\n\nCurrent project status with enough detail to create " + "a retrievable chunk for the ingestion pipeline test.", + encoding="utf-8", + ) + registry_path = config_dir / "project-registry.json" + registry_path.write_text( + json.dumps( + { + "projects": [ + { + "id": "p04-gigabit", + "aliases": ["p04"], + "ingest_roots": [ + {"source": "vault", "subpath": "incoming/projects/p04-gigabit"} + ], + } + ] + } + ), + encoding="utf-8", + ) + + class FakeVectorStore: + def __init__(self): + self.metadatas = [] + + def add(self, ids, documents, metadatas): + self.metadatas.extend(metadatas) + + def delete(self, ids): + return None + + fake_store = FakeVectorStore() + monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir)) + monkeypatch.setenv("ATOCORE_DRIVE_SOURCE_DIR", str(drive_dir)) + monkeypatch.setenv("ATOCORE_PROJECT_REGISTRY_PATH", str(registry_path)) + config.settings = config.Settings() + monkeypatch.setattr("atocore.ingestion.pipeline.get_vector_store", lambda: fake_store) + + init_db() + result = ingest_file(note) + + assert result["status"] == "ingested" + assert fake_store.metadatas + assert all(meta["project_id"] == "p04-gigabit" for meta in fake_store.metadatas) + + +def test_ingest_file_logs_and_fails_open_when_project_derivation_fails( + tmp_data_dir, + sample_markdown, + monkeypatch, +): + """A broken registry should be visible but should not block ingestion.""" + init_db() + warnings = [] + + class FakeVectorStore: + def __init__(self): + self.metadatas = [] + + def add(self, ids, documents, metadatas): + self.metadatas.extend(metadatas) + + def delete(self, ids): + return None + + fake_store = FakeVectorStore() + monkeypatch.setattr("atocore.ingestion.pipeline.get_vector_store", lambda: fake_store) + monkeypatch.setattr( + "atocore.projects.registry.derive_project_id_for_path", + lambda path: (_ for _ in ()).throw(ValueError("registry broken")), + ) + monkeypatch.setattr( + "atocore.ingestion.pipeline.log.warning", + lambda event, **kwargs: warnings.append((event, kwargs)), + ) + + result = ingest_file(sample_markdown) + + assert result["status"] == "ingested" + assert fake_store.metadatas + assert all(meta["project_id"] == "" for meta in fake_store.metadatas) + assert warnings[0][0] == "project_id_derivation_failed" + assert "registry broken" in warnings[0][1]["error"] + + +def test_ingest_project_folder_passes_project_id_to_files(tmp_data_dir, sample_folder, monkeypatch): + seen = [] + + def fake_ingest_file(path, project_id=""): + seen.append((path.name, project_id)) + return {"file": str(path), "status": "ingested"} + + monkeypatch.setattr("atocore.ingestion.pipeline.ingest_file", fake_ingest_file) + monkeypatch.setattr("atocore.ingestion.pipeline._purge_deleted_files", lambda *args, **kwargs: 0) + + ingest_project_folder(sample_folder, project_id="p05-interferometer") + + assert seen + assert {project_id for _, project_id in seen} == {"p05-interferometer"} + + def test_parse_markdown_uses_supplied_text(sample_markdown): """Parsing should be able to reuse pre-read content from ingestion.""" latin_text = """---\ntags: parser\n---\n# Parser Title\n\nBody text.""" diff --git a/tests/test_project_registry.py b/tests/test_project_registry.py index 0c5589e..aa5cd79 100644 --- a/tests/test_project_registry.py +++ b/tests/test_project_registry.py @@ -5,6 +5,7 @@ import json import atocore.config as config from atocore.projects.registry import ( build_project_registration_proposal, + derive_project_id_for_path, get_registered_project, get_project_registry_template, list_registered_projects, @@ -103,6 +104,98 @@ def test_project_registry_resolves_alias(tmp_path, monkeypatch): assert project.project_id == "p05-interferometer" +def test_derive_project_id_for_path_uses_registered_roots(tmp_path, monkeypatch): + vault_dir = tmp_path / "vault" + drive_dir = tmp_path / "drive" + config_dir = tmp_path / "config" + project_dir = vault_dir / "incoming" / "projects" / "p04-gigabit" + project_dir.mkdir(parents=True) + drive_dir.mkdir() + config_dir.mkdir() + note = project_dir / "status.md" + note.write_text("# Status\n\nCurrent work.", encoding="utf-8") + + registry_path = config_dir / "project-registry.json" + registry_path.write_text( + json.dumps( + { + "projects": [ + { + "id": "p04-gigabit", + "aliases": ["p04"], + "ingest_roots": [ + {"source": "vault", "subpath": "incoming/projects/p04-gigabit"} + ], + } + ] + } + ), + encoding="utf-8", + ) + + monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir)) + monkeypatch.setenv("ATOCORE_DRIVE_SOURCE_DIR", str(drive_dir)) + monkeypatch.setenv("ATOCORE_PROJECT_REGISTRY_PATH", str(registry_path)) + + original_settings = config.settings + try: + config.settings = config.Settings() + assert derive_project_id_for_path(note) == "p04-gigabit" + assert derive_project_id_for_path(tmp_path / "elsewhere.md") == "" + finally: + config.settings = original_settings + + +def test_project_registry_rejects_cross_project_ingest_root_overlap(tmp_path, monkeypatch): + vault_dir = tmp_path / "vault" + drive_dir = tmp_path / "drive" + config_dir = tmp_path / "config" + vault_dir.mkdir() + drive_dir.mkdir() + config_dir.mkdir() + + registry_path = config_dir / "project-registry.json" + registry_path.write_text( + json.dumps( + { + "projects": [ + { + "id": "parent", + "aliases": [], + "ingest_roots": [ + {"source": "vault", "subpath": "incoming/projects/parent"} + ], + }, + { + "id": "child", + "aliases": [], + "ingest_roots": [ + {"source": "vault", "subpath": "incoming/projects/parent/child"} + ], + }, + ] + } + ), + encoding="utf-8", + ) + + monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir)) + monkeypatch.setenv("ATOCORE_DRIVE_SOURCE_DIR", str(drive_dir)) + monkeypatch.setenv("ATOCORE_PROJECT_REGISTRY_PATH", str(registry_path)) + + original_settings = config.settings + try: + config.settings = config.Settings() + try: + list_registered_projects() + except ValueError as exc: + assert "ingest root overlap" in str(exc) + else: + raise AssertionError("Expected overlapping ingest roots to raise") + finally: + config.settings = original_settings + + def test_refresh_registered_project_ingests_registered_roots(tmp_path, monkeypatch): vault_dir = tmp_path / "vault" drive_dir = tmp_path / "drive" @@ -133,8 +226,8 @@ def test_refresh_registered_project_ingests_registered_roots(tmp_path, monkeypat calls = [] - def fake_ingest_folder(path, purge_deleted=True): - calls.append((str(path), purge_deleted)) + def fake_ingest_folder(path, purge_deleted=True, project_id=""): + calls.append((str(path), purge_deleted, project_id)) return [{"file": str(path / "README.md"), "status": "ingested"}] monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir)) @@ -144,7 +237,7 @@ def test_refresh_registered_project_ingests_registered_roots(tmp_path, monkeypat original_settings = config.settings try: config.settings = config.Settings() - monkeypatch.setattr("atocore.projects.registry.ingest_folder", fake_ingest_folder) + monkeypatch.setattr("atocore.ingestion.pipeline.ingest_project_folder", fake_ingest_folder) result = refresh_registered_project("polisher") finally: config.settings = original_settings @@ -153,6 +246,7 @@ def test_refresh_registered_project_ingests_registered_roots(tmp_path, monkeypat assert len(calls) == 1 assert calls[0][0].endswith("p06-polisher") assert calls[0][1] is False + assert calls[0][2] == "p06-polisher" assert result["roots"][0]["status"] == "ingested" assert result["status"] == "ingested" assert result["roots_ingested"] == 1 @@ -188,7 +282,7 @@ def test_refresh_registered_project_reports_nothing_to_ingest_when_all_missing( encoding="utf-8", ) - def fail_ingest_folder(path, purge_deleted=True): + def fail_ingest_folder(path, purge_deleted=True, project_id=""): raise AssertionError(f"ingest_folder should not be called for missing root: {path}") monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir)) @@ -198,7 +292,7 @@ def test_refresh_registered_project_reports_nothing_to_ingest_when_all_missing( original_settings = config.settings try: config.settings = config.Settings() - monkeypatch.setattr("atocore.projects.registry.ingest_folder", fail_ingest_folder) + monkeypatch.setattr("atocore.ingestion.pipeline.ingest_project_folder", fail_ingest_folder) result = refresh_registered_project("ghost") finally: config.settings = original_settings @@ -238,7 +332,7 @@ def test_refresh_registered_project_reports_partial_status(tmp_path, monkeypatch encoding="utf-8", ) - def fake_ingest_folder(path, purge_deleted=True): + def fake_ingest_folder(path, purge_deleted=True, project_id=""): return [{"file": str(path / "README.md"), "status": "ingested"}] monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir)) @@ -248,7 +342,7 @@ def test_refresh_registered_project_reports_partial_status(tmp_path, monkeypatch original_settings = config.settings try: config.settings = config.Settings() - monkeypatch.setattr("atocore.projects.registry.ingest_folder", fake_ingest_folder) + monkeypatch.setattr("atocore.ingestion.pipeline.ingest_project_folder", fake_ingest_folder) result = refresh_registered_project("mixed") finally: config.settings = original_settings diff --git a/tests/test_retrieval.py b/tests/test_retrieval.py index 9049fae..f108caa 100644 --- a/tests/test_retrieval.py +++ b/tests/test_retrieval.py @@ -384,6 +384,146 @@ def test_retrieve_project_scope_uses_path_segments_not_substrings(monkeypatch): assert [r.chunk_id for r in results] == ["chunk-target", "chunk-global"] +def test_retrieve_project_scope_prefers_exact_project_id(monkeypatch): + target_project = type( + "Project", + (), + { + "project_id": "p04-gigabit", + "aliases": ("p04", "gigabit"), + "ingest_roots": (), + }, + )() + other_project = type( + "Project", + (), + { + "project_id": "p06-polisher", + "aliases": ("p06", "polisher"), + "ingest_roots": (), + }, + )() + + class FakeStore: + def query(self, query_embedding, top_k=10, where=None): + return { + "ids": [["chunk-target", "chunk-other", "chunk-global"]], + "documents": [["target doc", "other doc", "global doc"]], + "metadatas": [[ + { + "heading_path": "Overview", + "source_file": "legacy/unhelpful-path.md", + "tags": "[]", + "title": "Target", + "project_id": "p04-gigabit", + "document_id": "doc-a", + }, + { + "heading_path": "Overview", + "source_file": "p04-gigabit/title-poisoned.md", + "tags": '["p04-gigabit"]', + "title": "Looks target-owned but is explicit p06", + "project_id": "p06-polisher", + "document_id": "doc-b", + }, + { + "heading_path": "Overview", + "source_file": "shared/global.md", + "tags": "[]", + "title": "Shared", + "project_id": "", + "document_id": "doc-global", + }, + ]], + "distances": [[0.2, 0.19, 0.21]], + } + + monkeypatch.setattr("atocore.retrieval.retriever.get_vector_store", lambda: FakeStore()) + monkeypatch.setattr("atocore.retrieval.retriever.embed_query", lambda query: [0.0, 0.1]) + monkeypatch.setattr( + "atocore.retrieval.retriever._existing_chunk_ids", + lambda chunk_ids: set(chunk_ids), + ) + monkeypatch.setattr( + "atocore.retrieval.retriever.get_registered_project", + lambda project_name: target_project, + ) + monkeypatch.setattr( + "atocore.retrieval.retriever.load_project_registry", + lambda: [target_project, other_project], + ) + + results = retrieve("mirror architecture", top_k=3, project_hint="p04") + + assert [r.chunk_id for r in results] == ["chunk-target", "chunk-global"] + + +def test_retrieve_empty_project_id_falls_back_to_path_ownership(monkeypatch): + target_project = type( + "Project", + (), + { + "project_id": "p04-gigabit", + "aliases": ("p04", "gigabit"), + "ingest_roots": (), + }, + )() + other_project = type( + "Project", + (), + { + "project_id": "p05-interferometer", + "aliases": ("p05", "interferometer"), + "ingest_roots": (), + }, + )() + + class FakeStore: + def query(self, query_embedding, top_k=10, where=None): + return { + "ids": [["chunk-target", "chunk-other"]], + "documents": [["target doc", "other doc"]], + "metadatas": [[ + { + "heading_path": "Overview", + "source_file": "p04-gigabit/status.md", + "tags": "[]", + "title": "Target", + "project_id": "", + "document_id": "doc-a", + }, + { + "heading_path": "Overview", + "source_file": "p05-interferometer/status.md", + "tags": "[]", + "title": "Other", + "project_id": "", + "document_id": "doc-b", + }, + ]], + "distances": [[0.2, 0.19]], + } + + monkeypatch.setattr("atocore.retrieval.retriever.get_vector_store", lambda: FakeStore()) + monkeypatch.setattr("atocore.retrieval.retriever.embed_query", lambda query: [0.0, 0.1]) + monkeypatch.setattr( + "atocore.retrieval.retriever._existing_chunk_ids", + lambda chunk_ids: set(chunk_ids), + ) + monkeypatch.setattr( + "atocore.retrieval.retriever.get_registered_project", + lambda project_name: target_project, + ) + monkeypatch.setattr( + "atocore.retrieval.retriever.load_project_registry", + lambda: [target_project, other_project], + ) + + results = retrieve("mirror architecture", top_k=2, project_hint="p04") + + assert [r.chunk_id for r in results] == ["chunk-target"] + + def test_retrieve_unknown_project_hint_does_not_widen_or_filter(monkeypatch): class FakeStore: def query(self, query_embedding, top_k=10, where=None): @@ -426,6 +566,59 @@ def test_retrieve_unknown_project_hint_does_not_widen_or_filter(monkeypatch): assert [r.chunk_id for r in results] == ["chunk-a", "chunk-b"] +def test_retrieve_fails_open_when_project_scope_resolution_fails(monkeypatch): + warnings = [] + + class FakeStore: + def query(self, query_embedding, top_k=10, where=None): + assert top_k == 2 + return { + "ids": [["chunk-a", "chunk-b"]], + "documents": [["doc a", "doc b"]], + "metadatas": [[ + { + "heading_path": "Overview", + "source_file": "p04-gigabit/file.md", + "tags": "[]", + "title": "A", + "document_id": "doc-a", + }, + { + "heading_path": "Overview", + "source_file": "p05-interferometer/file.md", + "tags": "[]", + "title": "B", + "document_id": "doc-b", + }, + ]], + "distances": [[0.2, 0.21]], + } + + monkeypatch.setattr("atocore.retrieval.retriever.get_vector_store", lambda: FakeStore()) + monkeypatch.setattr("atocore.retrieval.retriever.embed_query", lambda query: [0.0, 0.1]) + monkeypatch.setattr( + "atocore.retrieval.retriever._existing_chunk_ids", + lambda chunk_ids: set(chunk_ids), + ) + monkeypatch.setattr( + "atocore.retrieval.retriever.get_registered_project", + lambda project_name: (_ for _ in ()).throw(ValueError("registry overlap")), + ) + monkeypatch.setattr( + "atocore.retrieval.retriever.log.warning", + lambda event, **kwargs: warnings.append((event, kwargs)), + ) + + results = retrieve("overview", top_k=2, project_hint="p04") + + assert [r.chunk_id for r in results] == ["chunk-a", "chunk-b"] + assert {warning[0] for warning in warnings} == { + "project_scope_resolution_failed", + "project_match_boost_resolution_failed", + } + assert all("registry overlap" in warning[1]["error"] for warning in warnings) + + def test_retrieve_downranks_archive_noise_and_prefers_high_signal_paths(monkeypatch): class FakeStore: def query(self, query_embedding, top_k=10, where=None):