merge: explicit project id retrieval metadata

2026-04-24 11:36:19 -04:00
parent f44a211497 05c11fd4fb
commit 867a1abfaa
13 changed files with 916 additions and 26 deletions
--- a/DEV-LEDGER.md
+++ b/DEV-LEDGER.md
@@ -6,15 +6,15 @@

 ## Orientation

- **live_sha** (Dalidou `/health` build_sha): `2b86543` (verified 2026-04-23T15:20:53Z post-R14 deploy; status=ok)
- **last_updated**: 2026-04-24 by Codex (audit-improvements foundation branch; live status refreshed)
- **main_tip**: `2b86543`
- **test_count**: 553
- **harness**: `18/20 PASS` on live Dalidou plus 1 known content gap and 1 blocking project-bleed guard pending deploy of this branch
+- **live_sha** (Dalidou `/health` build_sha): `f44a211` (verified 2026-04-24T14:48:44Z post audit-improvements deploy; status=ok)
+- **last_updated**: 2026-04-24 by Codex (retrieval boundary deployed; project_id metadata branch started)
+- **main_tip**: `f44a211`
+- **test_count**: 567 on `codex/project-id-metadata-retrieval` (deployed main baseline: 553)
+- **harness**: `19/20 PASS` on live Dalidou, 0 blocking failures, 1 known content gap (`p04-constraints`)
 - **vectors**: 33,253
 - **active_memories**: 290 (`/admin/dashboard` 2026-04-24; note integrity panel reports a separate active_memory_count=951 and needs reconciliation)
 - **candidate_memories**: 0 (triage queue drained)
- **interactions**: 950 (`/admin/dashboard` 2026-04-24)
+- **interactions**: 951 (`/admin/dashboard` 2026-04-24)
 - **registered_projects**: atocore, p04-gigabit, p05-interferometer, p06-polisher, atomizer-v2, abb-space (aliased p08)
 - **project_state_entries**: 128 across registered projects (`/admin/dashboard` 2026-04-24)
 - **entities**: 66 (up from 35 — V1-0 backfill + ongoing work; 0 open conflicts)
@@ -170,6 +170,12 @@ One branch `codex/extractor-eval-loop` for Day 1-5, a second `codex/retrieval-ha

 ## Session Log

+- **2026-04-24 Codex (retrieval boundary deployed + project_id metadata tranche)** Merged `codex/audit-improvements-foundation` to `main` as `f44a211` and pushed to Dalidou Gitea. Took pre-deploy runtime backup `/srv/storage/atocore/backups/snapshots/20260424T144810Z` (DB + registry, no Chroma). Deployed via `papa@dalidou` canonical `deploy/dalidou/deploy.sh`; live `/health` reports build_sha `f44a2114970008a7eec4e7fc2860c8f072914e38`, build_time `2026-04-24T14:48:44Z`, status ok. Post-deploy retrieval harness: 20 fixtures, 19 pass, 0 blocking failures, 1 known issue (`p04-constraints`). The former blocker `p05-broad-status-no-atomizer` now passes. Manual p05 `context-build "current status"` spot check shows no p04/Atomizer source bleed in retrieved chunks. Started follow-up branch `codex/project-id-metadata-retrieval`: registered-project ingestion now writes explicit `project_id` into DB chunk metadata and Chroma vector metadata; retrieval prefers exact `project_id` when present and keeps path/tag matching as legacy fallback; added dry-run-by-default `scripts/backfill_chunk_project_ids.py` to backfill SQLite + Chroma metadata; added tests for project-id ingestion, registered refresh propagation, exact project-id retrieval, and collision fallback. Verified targeted suite (`test_ingestion.py`, `test_project_registry.py`, `test_retrieval.py`): 36 passed. Verified full suite: 556 passed in 72.44s. Branch not merged or deployed yet.
+
+- **2026-04-24 Codex (project_id audit response)** Applied independent-audit fixes on `codex/project-id-metadata-retrieval`. Closed the nightly `/ingest/sources` clobber risk by adding registry-level `derive_project_id_for_path()` and making unscoped `ingest_file()` derive ownership from registered ingest roots when possible; `refresh_registered_project()` still passes the canonical project id directly. Changed retrieval so empty `project_id` falls through to legacy path/tag ownership instead of short-circuiting as unowned. Hardened `scripts/backfill_chunk_project_ids.py`: `--apply` now requires `--chroma-snapshot-confirmed`, runs Chroma metadata updates before SQLite writes, batches updates, skips/report missing vectors, skips/report malformed metadata, reports already-tagged rows, and turns missing ingestion tables into a JSON `db_warning` instead of a traceback. Added tests for auto-derive ingestion, empty-project fallback, ingest-root overlap rejection, and backfill dry-run/apply/snapshot/missing-vector/malformed cases. Verified targeted suite (`test_backfill_chunk_project_ids.py`, `test_ingestion.py`, `test_project_registry.py`, `test_retrieval.py`): 45 passed. Verified full suite: 565 passed in 73.16s. Local dry-run on empty/default data returns 0 updates with `db_warning` rather than crashing. Branch still not merged/deployed.
+
+- **2026-04-24 Codex (project_id final hardening before merge)** Applied the final independent-review P2s on `codex/project-id-metadata-retrieval`: `ingest_file()` still fails open when project-id derivation fails, but now emits `project_id_derivation_failed` with file path and error; retrieval now catches registry failures both at project-scope resolution and the soft project-match boost path, logs warnings, and serves unscoped rather than raising. Added regression tests for both fail-open paths. Verified targeted suite (`test_ingestion.py`, `test_retrieval.py`, `test_backfill_chunk_project_ids.py`, `test_project_registry.py`): 47 passed. Verified full suite: 567 passed in 79.66s. Branch still not merged/deployed.
+
 - **2026-04-24 Codex (audit improvements foundation)** Started implementation of the audit recommendations on branch `codex/audit-improvements-foundation` from `origin/main@c53e61e`. First tranche: registry-aware project-scoped retrieval filtering (`ATOCORE_RANK_PROJECT_SCOPE_FILTER`, widened candidate pull before filtering), eval harness known-issue lane, two p05 project-bleed fixtures, `scripts/live_status.py`, README/current-state/master-plan status refresh. Verified `pytest -q`: 550 passed in 67.11s. Live retrieval harness against undeployed production: 20 fixtures, 18 pass, 1 known issue (`p04-constraints` Zerodur/1.2 content gap), 1 blocking guard (`p05-broad-status-no-atomizer`) still failing because production has not yet deployed the retrieval filter and currently pulls `P04-GigaBIT-M1-KB-design` into broad p05 status context. Live dashboard refresh: health ok, build `2b86543`, docs 1748, chunks/vectors 33253, interactions 948, active memories 289, candidates 0, project_state total 128. Noted count discrepancy: dashboard memories.active=289 while integrity active_memory_count=951; schedule reconciliation in a follow-up.

 - **2026-04-24 Codex (independent-audit hardening)** Applied the Opus independent audit's fast follow-ups before merge/deploy. Closed the two P1s by making project-scope ownership path/tag-based only, adding path-segment/tag-exact matching to avoid short-alias substring collisions, and keeping title/heading text out of provenance decisions. Added regression tests for title poisoning, substring collision, and unknown-project fallback. Added retrieval log fields `raw_results_count`, `post_filter_count`, `post_filter_dropped`, and `underfilled`. Added retrieval-eval run metadata (`generated_at`, `base_url`, `/health`) and `live_status.py` auth-token/status support. README now documents the ranking knobs and clarifies that the hard scope filter and soft project match boost are separate controls. Verified `pytest -q`: 553 passed in 66.07s. Live production remains expected-predeploy: 20 fixtures, 18 pass, 1 known content gap, 1 blocking p05 bleed guard. Latest live dashboard: build `2b86543`, docs 1748, chunks/vectors 33253, interactions 950, active memories 290, candidates 0, project_state total 128.
--- a/README.md
+++ b/README.md
@@ -111,6 +111,7 @@ pytest
 - `scripts/atocore_client.py` provides a live API client for project refresh, project-state inspection, and retrieval-quality audits.
 - `scripts/retrieval_eval.py` runs the live retrieval/context harness, separates blocking failures from known content gaps, and stamps JSON output with target/build metadata.
 - `scripts/live_status.py` renders a compact read-only status report from `/health`, `/stats`, `/projects`, and `/admin/dashboard`; set `ATOCORE_AUTH_TOKEN` or `--auth-token` when those endpoints are gated.
+- `scripts/backfill_chunk_project_ids.py` dry-runs or applies explicit `project_id` metadata backfills for SQLite chunks and Chroma vectors; `--apply` requires a confirmed Chroma snapshot.
 - `docs/operations.md` captures the current operational priority order: retrieval quality, Wave 2 trusted-operational ingestion, AtoDrive scoping, and restore validation.
 - `DEV-LEDGER.md` is the fast-moving source of operational truth during active development; copy claims into docs only after checking the live service.

--- a/docs/current-state.md
+++ b/docs/current-state.md
@@ -1,5 +1,9 @@
 # AtoCore - Current State (2026-04-24)

+Update 2026-04-24: audit-improvements deployed as `f44a211`; live harness is
+19/20 with 0 blocking failures and 1 known content gap. Active follow-up branch
+`codex/project-id-metadata-retrieval` is at 567 passing tests.
+
 Live deploy: `2b86543` · Dalidou health: ok · Harness: 18/20 with 1 known
 content gap and 1 current blocking project-bleed guard · Tests: 553 passing.

@@ -68,7 +72,7 @@ Last nightly run (2026-04-19 03:00 UTC): **31 promoted · 39 rejected · 0 needs
 ## Known gaps (honest, refreshed 2026-04-24)

 1. **Capture surface is Claude-Code-and-OpenClaw only.** Conversations in Claude Desktop, Claude.ai web, phone, or any other LLM UI are NOT captured. Example: the rotovap/mushroom chat yesterday never reached AtoCore because no hook fired. See Q4 below.
-2. **Project-scoped retrieval still needs deployment verification.** The April 24 audit reproduced cross-project competition on broad p05 prompts. The current branch adds registry-aware project filtering and a harness guard; verify after deploy.
+2. **Project-scoped retrieval guard is deployed and passing.** The April 24 p05 broad-status bleed guard now passes on live Dalidou. The active follow-up branch adds explicit `project_id` chunk/vector metadata so the deployed path/tag heuristic can become a legacy fallback.
 3. **Human interface is useful but not yet the V1 Human Mirror.** Wiki/dashboard pages exist, but the spec routes, deterministic mirror files, disputed markers, and curated annotations remain V1-D work.
 4. **Harness known issue:** `p04-constraints` wants "Zerodur" and "1.2"; live retrieval surfaces related constraints but not those exact strings. Treat as content/state gap until fixed.
 5. **Formal docs lag the ledger during fast work.** Use `DEV-LEDGER.md` and `python scripts/live_status.py` for live truth, then copy verified claims into these docs.
--- a/docs/master-plan-status.md
+++ b/docs/master-plan-status.md
@@ -135,7 +135,7 @@ deferred from the shared client until their workflows are exercised.

 - canonical AtoCore runtime on Dalidou (`2b86543`, deploy.sh verified)
 - 33,253 vectors across 6 registered projects
- 950 captured interactions as of the 2026-04-24 live dashboard; refresh
+- 951 captured interactions as of the 2026-04-24 live dashboard; refresh
  exact live counts with
  `python scripts/live_status.py`
 - 6 registered projects:
@@ -150,10 +150,9 @@ deferred from the shared client until their workflows are exercised.
  dashboard
 - context pack assembly with 4 tiers: Trusted Project State > identity/preference > project memories > retrieved chunks
 - query-relevance memory ranking with overlap-density scoring
- retrieval eval harness: 20 fixtures; current live has 18 pass, 1 known
-  content gap, and 1 blocking cross-project bleed guard targeted by the
-  current retrieval-scoping branch
- 553 tests passing on the audit-improvements branch
+- retrieval eval harness: 20 fixtures; current live has 19 pass, 1 known
+  content gap, and 0 blocking failures after the audit-improvements deploy
+- 567 tests passing on the active `codex/project-id-metadata-retrieval` branch
 - nightly pipeline: backup → cleanup → rsync → OpenClaw import → vault refresh → extract → triage → **auto-promote/expire** → weekly synth/lint → **retrieval harness** → **pipeline summary to project state**
 - Phase 10 operational: reinforcement-based auto-promotion (ref_count ≥ 3, confidence ≥ 0.7) + stale candidate expiry (14 days unreinforced)
 - pipeline health visible in dashboard: interaction totals by client, pipeline last_run, harness results, triage stats
--- a/scripts/backfill_chunk_project_ids.py
+++ b/scripts/backfill_chunk_project_ids.py
@@ -0,0 +1,178 @@
+"""Backfill explicit project_id into chunk and vector metadata.
+
+Dry-run by default. The script derives ownership from the registered project
+ingest roots and updates both SQLite source_chunks.metadata and Chroma vector
+metadata only when --apply is provided.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import sqlite3
+import sys
+from pathlib import Path
+
+sys.path.insert(0, str(Path(__file__).resolve().parents[1] / "src"))
+
+from atocore.models.database import get_connection  # noqa: E402
+from atocore.projects.registry import derive_project_id_for_path  # noqa: E402
+from atocore.retrieval.vector_store import get_vector_store  # noqa: E402
+
+DEFAULT_BATCH_SIZE = 500
+
+
+def _decode_metadata(raw: str | None) -> dict | None:
+    if not raw:
+        return {}
+    try:
+        parsed = json.loads(raw)
+    except json.JSONDecodeError:
+        return None
+    return parsed if isinstance(parsed, dict) else None
+
+
+def _chunk_rows() -> tuple[list[dict], str]:
+    try:
+        with get_connection() as conn:
+            rows = conn.execute(
+                """
+                SELECT
+                    sc.id AS chunk_id,
+                    sc.metadata AS chunk_metadata,
+                    sd.file_path AS file_path
+                FROM source_chunks sc
+                JOIN source_documents sd ON sd.id = sc.document_id
+                ORDER BY sd.file_path, sc.chunk_index
+                """
+            ).fetchall()
+    except sqlite3.OperationalError as exc:
+        if "source_chunks" in str(exc) or "source_documents" in str(exc):
+            return [], f"missing ingestion tables: {exc}"
+        raise
+    return [dict(row) for row in rows], ""
+
+
+def _batches(items: list, batch_size: int) -> list[list]:
+    return [items[i:i + batch_size] for i in range(0, len(items), batch_size)]
+
+
+def backfill(
+    apply: bool = False,
+    project_filter: str = "",
+    batch_size: int = DEFAULT_BATCH_SIZE,
+    require_chroma_snapshot: bool = False,
+) -> dict:
+    rows, db_warning = _chunk_rows()
+    updates: list[tuple[str, str, dict]] = []
+    by_project: dict[str, int] = {}
+    skipped_unowned = 0
+    already_tagged = 0
+    malformed_metadata = 0
+
+    for row in rows:
+        project_id = derive_project_id_for_path(row["file_path"])
+        if project_filter and project_id != project_filter:
+            continue
+        if not project_id:
+            skipped_unowned += 1
+            continue
+        metadata = _decode_metadata(row["chunk_metadata"])
+        if metadata is None:
+            malformed_metadata += 1
+            continue
+        if metadata.get("project_id") == project_id:
+            already_tagged += 1
+            continue
+        metadata["project_id"] = project_id
+        updates.append((row["chunk_id"], project_id, metadata))
+        by_project[project_id] = by_project.get(project_id, 0) + 1
+
+    missing_vectors: list[str] = []
+    applied_updates = 0
+    if apply and updates:
+        if not require_chroma_snapshot:
+            raise ValueError(
+                "--apply requires --chroma-snapshot-confirmed after taking a Chroma backup"
+            )
+        vector_store = get_vector_store()
+        for batch in _batches(updates, max(1, batch_size)):
+            chunk_ids = [chunk_id for chunk_id, _, _ in batch]
+            vector_payload = vector_store.get_metadatas(chunk_ids)
+            existing_vector_metadata = {
+                chunk_id: metadata
+                for chunk_id, metadata in zip(
+                    vector_payload.get("ids", []),
+                    vector_payload.get("metadatas", []),
+                    strict=False,
+                )
+                if isinstance(metadata, dict)
+            }
+
+            vector_ids = []
+            vector_metadatas = []
+            sql_updates = []
+            for chunk_id, project_id, chunk_metadata in batch:
+                vector_metadata = existing_vector_metadata.get(chunk_id)
+                if vector_metadata is None:
+                    missing_vectors.append(chunk_id)
+                    continue
+                vector_metadata = dict(vector_metadata)
+                vector_metadata["project_id"] = project_id
+                vector_ids.append(chunk_id)
+                vector_metadatas.append(vector_metadata)
+                sql_updates.append((json.dumps(chunk_metadata, ensure_ascii=True), chunk_id))
+
+            if not vector_ids:
+                continue
+
+            vector_store.update_metadatas(vector_ids, vector_metadatas)
+            with get_connection() as conn:
+                cursor = conn.executemany(
+                    "UPDATE source_chunks SET metadata = ? WHERE id = ?",
+                    sql_updates,
+                )
+                if cursor.rowcount != len(sql_updates):
+                    raise RuntimeError(
+                        f"SQLite rowcount mismatch: {cursor.rowcount} != {len(sql_updates)}"
+                    )
+            applied_updates += len(sql_updates)
+
+    return {
+        "apply": apply,
+        "total_chunks": len(rows),
+        "updates": len(updates),
+        "applied_updates": applied_updates,
+        "already_tagged": already_tagged,
+        "skipped_unowned": skipped_unowned,
+        "malformed_metadata": malformed_metadata,
+        "missing_vectors": len(missing_vectors),
+        "db_warning": db_warning,
+        "by_project": dict(sorted(by_project.items())),
+    }
+
+
+def main() -> int:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--apply", action="store_true", help="write SQLite and Chroma metadata updates")
+    parser.add_argument("--project", default="", help="optional canonical project_id filter")
+    parser.add_argument("--batch-size", type=int, default=DEFAULT_BATCH_SIZE)
+    parser.add_argument(
+        "--chroma-snapshot-confirmed",
+        action="store_true",
+        help="required with --apply; confirms a Chroma snapshot exists",
+    )
+    args = parser.parse_args()
+
+    payload = backfill(
+        apply=args.apply,
+        project_filter=args.project.strip(),
+        batch_size=args.batch_size,
+        require_chroma_snapshot=args.chroma_snapshot_confirmed,
+    )
+    print(json.dumps(payload, indent=2, ensure_ascii=True))
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
--- a/src/atocore/ingestion/pipeline.py
+++ b/src/atocore/ingestion/pipeline.py
@@ -32,10 +32,23 @@ def exclusive_ingestion():
        _INGESTION_LOCK.release()


-def ingest_file(file_path: Path) -> dict:
+def ingest_file(file_path: Path, project_id: str = "") -> dict:
    """Ingest a single markdown file. Returns stats."""
    start = time.time()
    file_path = file_path.resolve()
+    project_id = (project_id or "").strip()
+    if not project_id:
+        try:
+            from atocore.projects.registry import derive_project_id_for_path
+
+            project_id = derive_project_id_for_path(file_path)
+        except Exception as exc:
+            log.warning(
+                "project_id_derivation_failed",
+                file_path=str(file_path),
+                error=str(exc),
+            )
+            project_id = ""

    if not file_path.exists():
        raise FileNotFoundError(f"File not found: {file_path}")
@@ -65,6 +78,7 @@ def ingest_file(file_path: Path) -> dict:
        "source_file": str(file_path),
        "tags": parsed.tags,
        "title": parsed.title,
+        "project_id": project_id,
    }
    chunks = chunk_markdown(parsed.body, base_metadata=base_meta)

@@ -116,6 +130,7 @@ def ingest_file(file_path: Path) -> dict:
                        "source_file": str(file_path),
                        "tags": json.dumps(parsed.tags),
                        "title": parsed.title,
+                        "project_id": project_id,
                    })

                    conn.execute(
@@ -173,7 +188,17 @@ def ingest_folder(folder_path: Path, purge_deleted: bool = True) -> list[dict]:
        purge_deleted: If True, remove DB/vector entries for files
                       that no longer exist on disk.
    """
+    return ingest_project_folder(folder_path, purge_deleted=purge_deleted, project_id="")
+
+
+def ingest_project_folder(
+    folder_path: Path,
+    purge_deleted: bool = True,
+    project_id: str = "",
+) -> list[dict]:
+    """Ingest a folder and annotate chunks with an optional project id."""
    folder_path = folder_path.resolve()
+    project_id = (project_id or "").strip()
    if not folder_path.is_dir():
        raise NotADirectoryError(f"Not a directory: {folder_path}")

@@ -187,7 +212,7 @@ def ingest_folder(folder_path: Path, purge_deleted: bool = True) -> list[dict]:
    # Ingest new/changed files
    for md_file in md_files:
        try:
-            result = ingest_file(md_file)
+            result = ingest_file(md_file, project_id=project_id)
            results.append(result)
        except Exception as e:
            log.error("ingestion_error", file_path=str(md_file), error=str(e))
--- a/src/atocore/projects/registry.py
+++ b/src/atocore/projects/registry.py
@@ -8,7 +8,6 @@ from dataclasses import asdict, dataclass
 from pathlib import Path

 import atocore.config as _config
-from atocore.ingestion.pipeline import ingest_folder


 # Reserved pseudo-projects. `inbox` holds pre-project / lead / quote
@@ -260,6 +259,7 @@ def load_project_registry() -> list[RegisteredProject]:
        )

    _validate_unique_project_names(projects)
+    _validate_ingest_root_overlaps(projects)
    return projects


@@ -307,6 +307,28 @@ def resolve_project_name(name: str | None) -> str:
    return name


+def derive_project_id_for_path(file_path: str | Path) -> str:
+    """Return the registered project that owns a source path, if any."""
+    if not file_path:
+        return ""
+    doc_path = Path(file_path).resolve(strict=False)
+    matches: list[tuple[int, int, str]] = []
+
+    for project in load_project_registry():
+        for source_ref in project.ingest_roots:
+            root_path = _resolve_ingest_root(source_ref)
+            try:
+                doc_path.relative_to(root_path)
+            except ValueError:
+                continue
+            matches.append((len(root_path.parts), len(str(root_path)), project.project_id))
+
+    if not matches:
+        return ""
+    matches.sort(reverse=True)
+    return matches[0][2]
+
+
 def refresh_registered_project(project_name: str, purge_deleted: bool = False) -> dict:
    """Ingest all configured source roots for a registered project.

@@ -322,6 +344,8 @@ def refresh_registered_project(project_name: str, purge_deleted: bool = False) -
    if project is None:
        raise ValueError(f"Unknown project: {project_name}")

+    from atocore.ingestion.pipeline import ingest_project_folder
+
    roots = []
    ingested_count = 0
    skipped_count = 0
@@ -346,7 +370,11 @@ def refresh_registered_project(project_name: str, purge_deleted: bool = False) -
            {
                **root_result,
                "status": "ingested",
-                "results": ingest_folder(resolved, purge_deleted=purge_deleted),
+                "results": ingest_project_folder(
+                    resolved,
+                    purge_deleted=purge_deleted,
+                    project_id=project.project_id,
+                ),
            }
        )
        ingested_count += 1
@@ -443,6 +471,33 @@ def _validate_unique_project_names(projects: list[RegisteredProject]) -> None:
            seen[key] = project.project_id


+def _validate_ingest_root_overlaps(projects: list[RegisteredProject]) -> None:
+    roots: list[tuple[str, Path]] = []
+    for project in projects:
+        for source_ref in project.ingest_roots:
+            roots.append((project.project_id, _resolve_ingest_root(source_ref)))
+
+    for i, (left_project, left_root) in enumerate(roots):
+        for right_project, right_root in roots[i + 1:]:
+            if left_project == right_project:
+                continue
+            try:
+                left_root.relative_to(right_root)
+                overlaps = True
+            except ValueError:
+                try:
+                    right_root.relative_to(left_root)
+                    overlaps = True
+                except ValueError:
+                    overlaps = False
+            if overlaps:
+                raise ValueError(
+                    "Project registry ingest root overlap: "
+                    f"'{left_root}' ({left_project}) and "
+                    f"'{right_root}' ({right_project})"
+                )
+
+
 def _find_name_collisions(
    project_id: str,
    aliases: list[str],
--- a/src/atocore/retrieval/retriever.py
+++ b/src/atocore/retrieval/retriever.py
@@ -84,7 +84,15 @@ def retrieve(
    """Retrieve the most relevant chunks for a query."""
    top_k = top_k or _config.settings.context_top_k
    start = time.time()
-    scoped_project = get_registered_project(project_hint) if project_hint else None
+    try:
+        scoped_project = get_registered_project(project_hint) if project_hint else None
+    except Exception as exc:
+        log.warning(
+            "project_scope_resolution_failed",
+            project_hint=project_hint,
+            error=str(exc),
+        )
+        scoped_project = None
    scope_filter_enabled = bool(scoped_project and _config.settings.rank_project_scope_filter)
    registered_projects = None
    query_top_k = top_k
@@ -209,6 +217,10 @@ def _is_allowed_for_project_scope(


 def _metadata_matches_project(project: RegisteredProject, metadata: dict) -> bool:
+    stored_project_id = str(metadata.get("project_id", "")).strip().lower()
+    if stored_project_id:
+        return stored_project_id == project.project_id.lower()
+
    path = _metadata_source_path(metadata)
    tags = _metadata_tags(metadata)
    for term in _project_scope_terms(project):
@@ -288,7 +300,15 @@ def _project_match_boost(project_hint: str, metadata: dict) -> float:
    if not hint_lower:
        return 1.0

-    project = get_registered_project(project_hint)
+    try:
+        project = get_registered_project(project_hint)
+    except Exception as exc:
+        log.warning(
+            "project_match_boost_resolution_failed",
+            project_hint=project_hint,
+            error=str(exc),
+        )
+        project = None
    candidate_names = _project_scope_terms(project) if project is not None else {hint_lower}
    for candidate in candidate_names:
        if _metadata_has_term(metadata, candidate):
--- a/src/atocore/retrieval/vector_store.py
+++ b/src/atocore/retrieval/vector_store.py
@@ -64,6 +64,18 @@ class VectorStore:
            self._collection.delete(ids=ids)
            log.debug("vectors_deleted", count=len(ids))

+    def get_metadatas(self, ids: list[str]) -> dict:
+        """Fetch vector metadata by chunk IDs."""
+        if not ids:
+            return {"ids": [], "metadatas": []}
+        return self._collection.get(ids=ids, include=["metadatas"])
+
+    def update_metadatas(self, ids: list[str], metadatas: list[dict]) -> None:
+        """Update vector metadata without re-embedding documents."""
+        if ids:
+            self._collection.update(ids=ids, metadatas=metadatas)
+            log.debug("vector_metadatas_updated", count=len(ids))
+
    @property
    def count(self) -> int:
        return self._collection.count()
--- a/tests/test_backfill_chunk_project_ids.py
+++ b/tests/test_backfill_chunk_project_ids.py
@@ -0,0 +1,154 @@
+"""Tests for explicit chunk project_id metadata backfill."""
+
+import json
+
+import atocore.config as config
+from atocore.models.database import get_connection, init_db
+from scripts import backfill_chunk_project_ids as backfill
+
+
+def _write_registry(tmp_path, monkeypatch):
+    vault_dir = tmp_path / "vault"
+    drive_dir = tmp_path / "drive"
+    config_dir = tmp_path / "config"
+    project_dir = vault_dir / "incoming" / "projects" / "p04-gigabit"
+    project_dir.mkdir(parents=True)
+    drive_dir.mkdir()
+    config_dir.mkdir()
+    registry_path = config_dir / "project-registry.json"
+    registry_path.write_text(
+        json.dumps(
+            {
+                "projects": [
+                    {
+                        "id": "p04-gigabit",
+                        "aliases": ["p04"],
+                        "ingest_roots": [
+                            {"source": "vault", "subpath": "incoming/projects/p04-gigabit"}
+                        ],
+                    }
+                ]
+            }
+        ),
+        encoding="utf-8",
+    )
+    monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir))
+    monkeypatch.setenv("ATOCORE_DRIVE_SOURCE_DIR", str(drive_dir))
+    monkeypatch.setenv("ATOCORE_PROJECT_REGISTRY_PATH", str(registry_path))
+    config.settings = config.Settings()
+    return project_dir
+
+
+def _insert_chunk(file_path, metadata=None, chunk_id="chunk-1"):
+    with get_connection() as conn:
+        conn.execute(
+            """
+            INSERT INTO source_documents (id, file_path, file_hash, title, doc_type, tags)
+            VALUES (?, ?, ?, ?, ?, ?)
+            """,
+            ("doc-1", str(file_path), "hash", "Title", "markdown", "[]"),
+        )
+        conn.execute(
+            """
+            INSERT INTO source_chunks
+                (id, document_id, chunk_index, content, heading_path, char_count, metadata)
+            VALUES (?, ?, ?, ?, ?, ?, ?)
+            """,
+            (
+                chunk_id,
+                "doc-1",
+                0,
+                "content",
+                "Overview",
+                7,
+                json.dumps(metadata if metadata is not None else {}),
+            ),
+        )
+
+
+class FakeVectorStore:
+    def __init__(self, metadatas):
+        self.metadatas = dict(metadatas)
+        self.updated = []
+
+    def get_metadatas(self, ids):
+        returned_ids = [chunk_id for chunk_id in ids if chunk_id in self.metadatas]
+        return {
+            "ids": returned_ids,
+            "metadatas": [self.metadatas[chunk_id] for chunk_id in returned_ids],
+        }
+
+    def update_metadatas(self, ids, metadatas):
+        self.updated.append((list(ids), list(metadatas)))
+        for chunk_id, metadata in zip(ids, metadatas, strict=True):
+            self.metadatas[chunk_id] = metadata
+
+
+def test_backfill_dry_run_is_non_mutating(tmp_data_dir, tmp_path, monkeypatch):
+    init_db()
+    project_dir = _write_registry(tmp_path, monkeypatch)
+    _insert_chunk(project_dir / "status.md")
+
+    result = backfill.backfill(apply=False)
+
+    assert result["updates"] == 1
+    with get_connection() as conn:
+        row = conn.execute("SELECT metadata FROM source_chunks WHERE id = ?", ("chunk-1",)).fetchone()
+    assert json.loads(row["metadata"]) == {}
+
+
+def test_backfill_apply_updates_chroma_then_sql(tmp_data_dir, tmp_path, monkeypatch):
+    init_db()
+    project_dir = _write_registry(tmp_path, monkeypatch)
+    _insert_chunk(project_dir / "status.md", metadata={"source_file": "status.md"})
+    fake_store = FakeVectorStore({"chunk-1": {"source_file": "status.md"}})
+    monkeypatch.setattr(backfill, "get_vector_store", lambda: fake_store)
+
+    result = backfill.backfill(apply=True, require_chroma_snapshot=True)
+
+    assert result["applied_updates"] == 1
+    assert fake_store.metadatas["chunk-1"]["project_id"] == "p04-gigabit"
+    with get_connection() as conn:
+        row = conn.execute("SELECT metadata FROM source_chunks WHERE id = ?", ("chunk-1",)).fetchone()
+    assert json.loads(row["metadata"])["project_id"] == "p04-gigabit"
+
+
+def test_backfill_apply_requires_snapshot_confirmation(tmp_data_dir, tmp_path, monkeypatch):
+    init_db()
+    project_dir = _write_registry(tmp_path, monkeypatch)
+    _insert_chunk(project_dir / "status.md")
+
+    try:
+        backfill.backfill(apply=True)
+    except ValueError as exc:
+        assert "Chroma backup" in str(exc)
+    else:
+        raise AssertionError("Expected snapshot confirmation requirement")
+
+
+def test_backfill_missing_vector_skips_sql_update(tmp_data_dir, tmp_path, monkeypatch):
+    init_db()
+    project_dir = _write_registry(tmp_path, monkeypatch)
+    _insert_chunk(project_dir / "status.md")
+    fake_store = FakeVectorStore({})
+    monkeypatch.setattr(backfill, "get_vector_store", lambda: fake_store)
+
+    result = backfill.backfill(apply=True, require_chroma_snapshot=True)
+
+    assert result["updates"] == 1
+    assert result["applied_updates"] == 0
+    assert result["missing_vectors"] == 1
+    with get_connection() as conn:
+        row = conn.execute("SELECT metadata FROM source_chunks WHERE id = ?", ("chunk-1",)).fetchone()
+    assert json.loads(row["metadata"]) == {}
+
+
+def test_backfill_skips_malformed_metadata(tmp_data_dir, tmp_path, monkeypatch):
+    init_db()
+    project_dir = _write_registry(tmp_path, monkeypatch)
+    _insert_chunk(project_dir / "status.md", metadata=[])
+
+    result = backfill.backfill(apply=False)
+
+    assert result["updates"] == 0
+    assert result["malformed_metadata"] == 1
--- a/tests/test_ingestion.py
+++ b/tests/test_ingestion.py
@@ -1,8 +1,10 @@
 """Tests for the ingestion pipeline."""

+import json
+
 from atocore.ingestion.parser import parse_markdown
 from atocore.models.database import get_connection, init_db
-from atocore.ingestion.pipeline import ingest_file, ingest_folder
+from atocore.ingestion.pipeline import ingest_file, ingest_folder, ingest_project_folder


 def test_parse_markdown(sample_markdown):
@@ -69,6 +71,153 @@ def test_ingest_updates_changed(tmp_data_dir, sample_markdown):
    assert result["status"] == "ingested"


+def test_ingest_file_records_project_id_metadata(tmp_data_dir, sample_markdown, monkeypatch):
+    """Project-aware ingestion should tag DB and vector metadata exactly."""
+    init_db()
+
+    class FakeVectorStore:
+        def __init__(self):
+            self.metadatas = []
+
+        def add(self, ids, documents, metadatas):
+            self.metadatas.extend(metadatas)
+
+        def delete(self, ids):
+            return None
+
+    fake_store = FakeVectorStore()
+    monkeypatch.setattr("atocore.ingestion.pipeline.get_vector_store", lambda: fake_store)
+
+    result = ingest_file(sample_markdown, project_id="p04-gigabit")
+
+    assert result["status"] == "ingested"
+    assert fake_store.metadatas
+    assert all(meta["project_id"] == "p04-gigabit" for meta in fake_store.metadatas)
+
+    with get_connection() as conn:
+        rows = conn.execute("SELECT metadata FROM source_chunks").fetchall()
+    assert rows
+    assert all(
+        json.loads(row["metadata"])["project_id"] == "p04-gigabit"
+        for row in rows
+    )
+
+
+def test_ingest_file_derives_project_id_from_registry_root(tmp_data_dir, tmp_path, monkeypatch):
+    """Unscoped ingest should preserve ownership for files under registered roots."""
+    import atocore.config as config
+
+    vault_dir = tmp_path / "vault"
+    drive_dir = tmp_path / "drive"
+    config_dir = tmp_path / "config"
+    project_dir = vault_dir / "incoming" / "projects" / "p04-gigabit"
+    project_dir.mkdir(parents=True)
+    drive_dir.mkdir()
+    config_dir.mkdir()
+    note = project_dir / "status.md"
+    note.write_text(
+        "# Status\n\nCurrent project status with enough detail to create "
+        "a retrievable chunk for the ingestion pipeline test.",
+        encoding="utf-8",
+    )
+    registry_path = config_dir / "project-registry.json"
+    registry_path.write_text(
+        json.dumps(
+            {
+                "projects": [
+                    {
+                        "id": "p04-gigabit",
+                        "aliases": ["p04"],
+                        "ingest_roots": [
+                            {"source": "vault", "subpath": "incoming/projects/p04-gigabit"}
+                        ],
+                    }
+                ]
+            }
+        ),
+        encoding="utf-8",
+    )
+
+    class FakeVectorStore:
+        def __init__(self):
+            self.metadatas = []
+
+        def add(self, ids, documents, metadatas):
+            self.metadatas.extend(metadatas)
+
+        def delete(self, ids):
+            return None
+
+    fake_store = FakeVectorStore()
+    monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir))
+    monkeypatch.setenv("ATOCORE_DRIVE_SOURCE_DIR", str(drive_dir))
+    monkeypatch.setenv("ATOCORE_PROJECT_REGISTRY_PATH", str(registry_path))
+    config.settings = config.Settings()
+    monkeypatch.setattr("atocore.ingestion.pipeline.get_vector_store", lambda: fake_store)
+
+    init_db()
+    result = ingest_file(note)
+
+    assert result["status"] == "ingested"
+    assert fake_store.metadatas
+    assert all(meta["project_id"] == "p04-gigabit" for meta in fake_store.metadatas)
+
+
+def test_ingest_file_logs_and_fails_open_when_project_derivation_fails(
+    tmp_data_dir,
+    sample_markdown,
+    monkeypatch,
+):
+    """A broken registry should be visible but should not block ingestion."""
+    init_db()
+    warnings = []
+
+    class FakeVectorStore:
+        def __init__(self):
+            self.metadatas = []
+
+        def add(self, ids, documents, metadatas):
+            self.metadatas.extend(metadatas)
+
+        def delete(self, ids):
+            return None
+
+    fake_store = FakeVectorStore()
+    monkeypatch.setattr("atocore.ingestion.pipeline.get_vector_store", lambda: fake_store)
+    monkeypatch.setattr(
+        "atocore.projects.registry.derive_project_id_for_path",
+        lambda path: (_ for _ in ()).throw(ValueError("registry broken")),
+    )
+    monkeypatch.setattr(
+        "atocore.ingestion.pipeline.log.warning",
+        lambda event, **kwargs: warnings.append((event, kwargs)),
+    )
+
+    result = ingest_file(sample_markdown)
+
+    assert result["status"] == "ingested"
+    assert fake_store.metadatas
+    assert all(meta["project_id"] == "" for meta in fake_store.metadatas)
+    assert warnings[0][0] == "project_id_derivation_failed"
+    assert "registry broken" in warnings[0][1]["error"]
+
+
+def test_ingest_project_folder_passes_project_id_to_files(tmp_data_dir, sample_folder, monkeypatch):
+    seen = []
+
+    def fake_ingest_file(path, project_id=""):
+        seen.append((path.name, project_id))
+        return {"file": str(path), "status": "ingested"}
+
+    monkeypatch.setattr("atocore.ingestion.pipeline.ingest_file", fake_ingest_file)
+    monkeypatch.setattr("atocore.ingestion.pipeline._purge_deleted_files", lambda *args, **kwargs: 0)
+
+    ingest_project_folder(sample_folder, project_id="p05-interferometer")
+
+    assert seen
+    assert {project_id for _, project_id in seen} == {"p05-interferometer"}
+
+
 def test_parse_markdown_uses_supplied_text(sample_markdown):
    """Parsing should be able to reuse pre-read content from ingestion."""
    latin_text = """---\ntags: parser\n---\n# Parser Title\n\nBody text."""
--- a/tests/test_project_registry.py
+++ b/tests/test_project_registry.py
@@ -5,6 +5,7 @@ import json
 import atocore.config as config
 from atocore.projects.registry import (
    build_project_registration_proposal,
+    derive_project_id_for_path,
    get_registered_project,
    get_project_registry_template,
    list_registered_projects,
@@ -103,6 +104,98 @@ def test_project_registry_resolves_alias(tmp_path, monkeypatch):
    assert project.project_id == "p05-interferometer"


+def test_derive_project_id_for_path_uses_registered_roots(tmp_path, monkeypatch):
+    vault_dir = tmp_path / "vault"
+    drive_dir = tmp_path / "drive"
+    config_dir = tmp_path / "config"
+    project_dir = vault_dir / "incoming" / "projects" / "p04-gigabit"
+    project_dir.mkdir(parents=True)
+    drive_dir.mkdir()
+    config_dir.mkdir()
+    note = project_dir / "status.md"
+    note.write_text("# Status\n\nCurrent work.", encoding="utf-8")
+
+    registry_path = config_dir / "project-registry.json"
+    registry_path.write_text(
+        json.dumps(
+            {
+                "projects": [
+                    {
+                        "id": "p04-gigabit",
+                        "aliases": ["p04"],
+                        "ingest_roots": [
+                            {"source": "vault", "subpath": "incoming/projects/p04-gigabit"}
+                        ],
+                    }
+                ]
+            }
+        ),
+        encoding="utf-8",
+    )
+
+    monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir))
+    monkeypatch.setenv("ATOCORE_DRIVE_SOURCE_DIR", str(drive_dir))
+    monkeypatch.setenv("ATOCORE_PROJECT_REGISTRY_PATH", str(registry_path))
+
+    original_settings = config.settings
+    try:
+        config.settings = config.Settings()
+        assert derive_project_id_for_path(note) == "p04-gigabit"
+        assert derive_project_id_for_path(tmp_path / "elsewhere.md") == ""
+    finally:
+        config.settings = original_settings
+
+
+def test_project_registry_rejects_cross_project_ingest_root_overlap(tmp_path, monkeypatch):
+    vault_dir = tmp_path / "vault"
+    drive_dir = tmp_path / "drive"
+    config_dir = tmp_path / "config"
+    vault_dir.mkdir()
+    drive_dir.mkdir()
+    config_dir.mkdir()
+
+    registry_path = config_dir / "project-registry.json"
+    registry_path.write_text(
+        json.dumps(
+            {
+                "projects": [
+                    {
+                        "id": "parent",
+                        "aliases": [],
+                        "ingest_roots": [
+                            {"source": "vault", "subpath": "incoming/projects/parent"}
+                        ],
+                    },
+                    {
+                        "id": "child",
+                        "aliases": [],
+                        "ingest_roots": [
+                            {"source": "vault", "subpath": "incoming/projects/parent/child"}
+                        ],
+                    },
+                ]
+            }
+        ),
+        encoding="utf-8",
+    )
+
+    monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir))
+    monkeypatch.setenv("ATOCORE_DRIVE_SOURCE_DIR", str(drive_dir))
+    monkeypatch.setenv("ATOCORE_PROJECT_REGISTRY_PATH", str(registry_path))
+
+    original_settings = config.settings
+    try:
+        config.settings = config.Settings()
+        try:
+            list_registered_projects()
+        except ValueError as exc:
+            assert "ingest root overlap" in str(exc)
+        else:
+            raise AssertionError("Expected overlapping ingest roots to raise")
+    finally:
+        config.settings = original_settings
+
+
 def test_refresh_registered_project_ingests_registered_roots(tmp_path, monkeypatch):
    vault_dir = tmp_path / "vault"
    drive_dir = tmp_path / "drive"
@@ -133,8 +226,8 @@ def test_refresh_registered_project_ingests_registered_roots(tmp_path, monkeypat

    calls = []

-    def fake_ingest_folder(path, purge_deleted=True):
-        calls.append((str(path), purge_deleted))
+    def fake_ingest_folder(path, purge_deleted=True, project_id=""):
+        calls.append((str(path), purge_deleted, project_id))
        return [{"file": str(path / "README.md"), "status": "ingested"}]

    monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir))
@@ -144,7 +237,7 @@ def test_refresh_registered_project_ingests_registered_roots(tmp_path, monkeypat
    original_settings = config.settings
    try:
        config.settings = config.Settings()
-        monkeypatch.setattr("atocore.projects.registry.ingest_folder", fake_ingest_folder)
+        monkeypatch.setattr("atocore.ingestion.pipeline.ingest_project_folder", fake_ingest_folder)
        result = refresh_registered_project("polisher")
    finally:
        config.settings = original_settings
@@ -153,6 +246,7 @@ def test_refresh_registered_project_ingests_registered_roots(tmp_path, monkeypat
    assert len(calls) == 1
    assert calls[0][0].endswith("p06-polisher")
    assert calls[0][1] is False
+    assert calls[0][2] == "p06-polisher"
    assert result["roots"][0]["status"] == "ingested"
    assert result["status"] == "ingested"
    assert result["roots_ingested"] == 1
@@ -188,7 +282,7 @@ def test_refresh_registered_project_reports_nothing_to_ingest_when_all_missing(
        encoding="utf-8",
    )

-    def fail_ingest_folder(path, purge_deleted=True):
+    def fail_ingest_folder(path, purge_deleted=True, project_id=""):
        raise AssertionError(f"ingest_folder should not be called for missing root: {path}")

    monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir))
@@ -198,7 +292,7 @@ def test_refresh_registered_project_reports_nothing_to_ingest_when_all_missing(
    original_settings = config.settings
    try:
        config.settings = config.Settings()
-        monkeypatch.setattr("atocore.projects.registry.ingest_folder", fail_ingest_folder)
+        monkeypatch.setattr("atocore.ingestion.pipeline.ingest_project_folder", fail_ingest_folder)
        result = refresh_registered_project("ghost")
    finally:
        config.settings = original_settings
@@ -238,7 +332,7 @@ def test_refresh_registered_project_reports_partial_status(tmp_path, monkeypatch
        encoding="utf-8",
    )

-    def fake_ingest_folder(path, purge_deleted=True):
+    def fake_ingest_folder(path, purge_deleted=True, project_id=""):
        return [{"file": str(path / "README.md"), "status": "ingested"}]

    monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir))
@@ -248,7 +342,7 @@ def test_refresh_registered_project_reports_partial_status(tmp_path, monkeypatch
    original_settings = config.settings
    try:
        config.settings = config.Settings()
-        monkeypatch.setattr("atocore.projects.registry.ingest_folder", fake_ingest_folder)
+        monkeypatch.setattr("atocore.ingestion.pipeline.ingest_project_folder", fake_ingest_folder)
        result = refresh_registered_project("mixed")
    finally:
        config.settings = original_settings
--- a/tests/test_retrieval.py
+++ b/tests/test_retrieval.py
@@ -384,6 +384,146 @@ def test_retrieve_project_scope_uses_path_segments_not_substrings(monkeypatch):
    assert [r.chunk_id for r in results] == ["chunk-target", "chunk-global"]


+def test_retrieve_project_scope_prefers_exact_project_id(monkeypatch):
+    target_project = type(
+        "Project",
+        (),
+        {
+            "project_id": "p04-gigabit",
+            "aliases": ("p04", "gigabit"),
+            "ingest_roots": (),
+        },
+    )()
+    other_project = type(
+        "Project",
+        (),
+        {
+            "project_id": "p06-polisher",
+            "aliases": ("p06", "polisher"),
+            "ingest_roots": (),
+        },
+    )()
+
+    class FakeStore:
+        def query(self, query_embedding, top_k=10, where=None):
+            return {
+                "ids": [["chunk-target", "chunk-other", "chunk-global"]],
+                "documents": [["target doc", "other doc", "global doc"]],
+                "metadatas": [[
+                    {
+                        "heading_path": "Overview",
+                        "source_file": "legacy/unhelpful-path.md",
+                        "tags": "[]",
+                        "title": "Target",
+                        "project_id": "p04-gigabit",
+                        "document_id": "doc-a",
+                    },
+                    {
+                        "heading_path": "Overview",
+                        "source_file": "p04-gigabit/title-poisoned.md",
+                        "tags": '["p04-gigabit"]',
+                        "title": "Looks target-owned but is explicit p06",
+                        "project_id": "p06-polisher",
+                        "document_id": "doc-b",
+                    },
+                    {
+                        "heading_path": "Overview",
+                        "source_file": "shared/global.md",
+                        "tags": "[]",
+                        "title": "Shared",
+                        "project_id": "",
+                        "document_id": "doc-global",
+                    },
+                ]],
+                "distances": [[0.2, 0.19, 0.21]],
+            }
+
+    monkeypatch.setattr("atocore.retrieval.retriever.get_vector_store", lambda: FakeStore())
+    monkeypatch.setattr("atocore.retrieval.retriever.embed_query", lambda query: [0.0, 0.1])
+    monkeypatch.setattr(
+        "atocore.retrieval.retriever._existing_chunk_ids",
+        lambda chunk_ids: set(chunk_ids),
+    )
+    monkeypatch.setattr(
+        "atocore.retrieval.retriever.get_registered_project",
+        lambda project_name: target_project,
+    )
+    monkeypatch.setattr(
+        "atocore.retrieval.retriever.load_project_registry",
+        lambda: [target_project, other_project],
+    )
+
+    results = retrieve("mirror architecture", top_k=3, project_hint="p04")
+
+    assert [r.chunk_id for r in results] == ["chunk-target", "chunk-global"]
+
+
+def test_retrieve_empty_project_id_falls_back_to_path_ownership(monkeypatch):
+    target_project = type(
+        "Project",
+        (),
+        {
+            "project_id": "p04-gigabit",
+            "aliases": ("p04", "gigabit"),
+            "ingest_roots": (),
+        },
+    )()
+    other_project = type(
+        "Project",
+        (),
+        {
+            "project_id": "p05-interferometer",
+            "aliases": ("p05", "interferometer"),
+            "ingest_roots": (),
+        },
+    )()
+
+    class FakeStore:
+        def query(self, query_embedding, top_k=10, where=None):
+            return {
+                "ids": [["chunk-target", "chunk-other"]],
+                "documents": [["target doc", "other doc"]],
+                "metadatas": [[
+                    {
+                        "heading_path": "Overview",
+                        "source_file": "p04-gigabit/status.md",
+                        "tags": "[]",
+                        "title": "Target",
+                        "project_id": "",
+                        "document_id": "doc-a",
+                    },
+                    {
+                        "heading_path": "Overview",
+                        "source_file": "p05-interferometer/status.md",
+                        "tags": "[]",
+                        "title": "Other",
+                        "project_id": "",
+                        "document_id": "doc-b",
+                    },
+                ]],
+                "distances": [[0.2, 0.19]],
+            }
+
+    monkeypatch.setattr("atocore.retrieval.retriever.get_vector_store", lambda: FakeStore())
+    monkeypatch.setattr("atocore.retrieval.retriever.embed_query", lambda query: [0.0, 0.1])
+    monkeypatch.setattr(
+        "atocore.retrieval.retriever._existing_chunk_ids",
+        lambda chunk_ids: set(chunk_ids),
+    )
+    monkeypatch.setattr(
+        "atocore.retrieval.retriever.get_registered_project",
+        lambda project_name: target_project,
+    )
+    monkeypatch.setattr(
+        "atocore.retrieval.retriever.load_project_registry",
+        lambda: [target_project, other_project],
+    )
+
+    results = retrieve("mirror architecture", top_k=2, project_hint="p04")
+
+    assert [r.chunk_id for r in results] == ["chunk-target"]
+
+
 def test_retrieve_unknown_project_hint_does_not_widen_or_filter(monkeypatch):
    class FakeStore:
        def query(self, query_embedding, top_k=10, where=None):
@@ -426,6 +566,59 @@ def test_retrieve_unknown_project_hint_does_not_widen_or_filter(monkeypatch):
    assert [r.chunk_id for r in results] == ["chunk-a", "chunk-b"]


+def test_retrieve_fails_open_when_project_scope_resolution_fails(monkeypatch):
+    warnings = []
+
+    class FakeStore:
+        def query(self, query_embedding, top_k=10, where=None):
+            assert top_k == 2
+            return {
+                "ids": [["chunk-a", "chunk-b"]],
+                "documents": [["doc a", "doc b"]],
+                "metadatas": [[
+                    {
+                        "heading_path": "Overview",
+                        "source_file": "p04-gigabit/file.md",
+                        "tags": "[]",
+                        "title": "A",
+                        "document_id": "doc-a",
+                    },
+                    {
+                        "heading_path": "Overview",
+                        "source_file": "p05-interferometer/file.md",
+                        "tags": "[]",
+                        "title": "B",
+                        "document_id": "doc-b",
+                    },
+                ]],
+                "distances": [[0.2, 0.21]],
+            }
+
+    monkeypatch.setattr("atocore.retrieval.retriever.get_vector_store", lambda: FakeStore())
+    monkeypatch.setattr("atocore.retrieval.retriever.embed_query", lambda query: [0.0, 0.1])
+    monkeypatch.setattr(
+        "atocore.retrieval.retriever._existing_chunk_ids",
+        lambda chunk_ids: set(chunk_ids),
+    )
+    monkeypatch.setattr(
+        "atocore.retrieval.retriever.get_registered_project",
+        lambda project_name: (_ for _ in ()).throw(ValueError("registry overlap")),
+    )
+    monkeypatch.setattr(
+        "atocore.retrieval.retriever.log.warning",
+        lambda event, **kwargs: warnings.append((event, kwargs)),
+    )
+
+    results = retrieve("overview", top_k=2, project_hint="p04")
+
+    assert [r.chunk_id for r in results] == ["chunk-a", "chunk-b"]
+    assert {warning[0] for warning in warnings} == {
+        "project_scope_resolution_failed",
+        "project_match_boost_resolution_failed",
+    }
+    assert all("registry overlap" in warning[1]["error"] for warning in warnings)
+
+
 def test_retrieve_downranks_archive_noise_and_prefers_high_signal_paths(monkeypatch):
    class FakeStore:
        def query(self, query_embedding, top_k=10, where=None):