fix(retrieval): fail open on registry resolution errors

fix(retrieval): preserve project ids across unscoped ingest
feat(retrieval): persist explicit chunk project ids
2026-04-24 11:32:46 -04:00 · 2026-04-24 11:22:13 -04:00 · 2026-04-24 11:02:30 -04:00 · 2026-04-24 10:47:15 -04:00 · 2026-04-24 10:46:56 -04:00
18 changed files with 1641 additions and 82 deletions
--- a/DEV-LEDGER.md
+++ b/DEV-LEDGER.md
@@ -6,17 +6,17 @@
 ## Orientation
- **live_sha** (Dalidou `/health` build_sha): `2b86543` (verified 2026-04-23T15:20:53Z post-R14 deploy; status=ok)
+- **live_sha** (Dalidou `/health` build_sha): `f44a211` (verified 2026-04-24T14:48:44Z post audit-improvements deploy; status=ok)
- **last_updated**: 2026-04-23 by Claude (R14 squash-merged + deployed; Orientation refreshed)
+- **last_updated**: 2026-04-24 by Codex (retrieval boundary deployed; project_id metadata branch started)
- **main_tip**: `2b86543`
+- **main_tip**: `f44a211`
- **test_count**: 548 (547 + 1 R14 regression test)
+- **test_count**: 567 on `codex/project-id-metadata-retrieval` (deployed main baseline: 553)
- **harness**: `17/18 PASS` on live Dalidou (p04-constraints expects "Zerodur" — known content gap, not regression; consistent since 2026-04-19)
+- **harness**: `19/20 PASS` on live Dalidou, 0 blocking failures, 1 known content gap (`p04-constraints`)
 - **vectors**: 33,253
- **active_memories**: 784 (up from 84 pre-density-batch — density gate CRUSHED vs V1-A's 100-target)
+- **active_memories**: 290 (`/admin/dashboard` 2026-04-24; note integrity panel reports a separate active_memory_count=951 and needs reconciliation)
- **candidate_memories**: 2 (triage queue drained)
+- **candidate_memories**: 0 (triage queue drained)
- **interactions**: 500+ (limit=2000 query returned 500 — density batch has been running; actual may be higher, confirm via /stats next update)
+- **interactions**: 951 (`/admin/dashboard` 2026-04-24)
 - **registered_projects**: atocore, p04-gigabit, p05-interferometer, p06-polisher, atomizer-v2, abb-space (aliased p08)
- **project_state_entries**: 63 (atocore alone; full cross-project count not re-sampled this update)
+- **project_state_entries**: 128 across registered projects (`/admin/dashboard` 2026-04-24)
 - **entities**: 66 (up from 35 — V1-0 backfill + ongoing work; 0 open conflicts)
 - **off_host_backup**: `papa@192.168.86.39:/home/papa/atocore-backups/` via cron, verified
 - **nightly_pipeline**: backup → cleanup → rsync → OpenClaw import → vault refresh → extract → auto-triage → **auto-promote/expire (NEW)** → weekly synth/lint Sundays → **retrieval harness (NEW)** → **pipeline summary (NEW)**
@@ -170,6 +170,16 @@ One branch `codex/extractor-eval-loop` for Day 1-5, a second `codex/retrieval-ha
 ## Session Log
 - **2026-04-24 Codex (retrieval boundary deployed + project_id metadata tranche)** Merged `codex/audit-improvements-foundation` to `main` as `f44a211` and pushed to Dalidou Gitea. Took pre-deploy runtime backup `/srv/storage/atocore/backups/snapshots/20260424T144810Z` (DB + registry, no Chroma). Deployed via `papa@dalidou` canonical `deploy/dalidou/deploy.sh`; live `/health` reports build_sha `f44a2114970008a7eec4e7fc2860c8f072914e38`, build_time `2026-04-24T14:48:44Z`, status ok. Post-deploy retrieval harness: 20 fixtures, 19 pass, 0 blocking failures, 1 known issue (`p04-constraints`). The former blocker `p05-broad-status-no-atomizer` now passes. Manual p05 `context-build "current status"` spot check shows no p04/Atomizer source bleed in retrieved chunks. Started follow-up branch `codex/project-id-metadata-retrieval`: registered-project ingestion now writes explicit `project_id` into DB chunk metadata and Chroma vector metadata; retrieval prefers exact `project_id` when present and keeps path/tag matching as legacy fallback; added dry-run-by-default `scripts/backfill_chunk_project_ids.py` to backfill SQLite + Chroma metadata; added tests for project-id ingestion, registered refresh propagation, exact project-id retrieval, and collision fallback. Verified targeted suite (`test_ingestion.py`, `test_project_registry.py`, `test_retrieval.py`): 36 passed. Verified full suite: 556 passed in 72.44s. Branch not merged or deployed yet.
 - **2026-04-24 Codex (project_id audit response)** Applied independent-audit fixes on `codex/project-id-metadata-retrieval`. Closed the nightly `/ingest/sources` clobber risk by adding registry-level `derive_project_id_for_path()` and making unscoped `ingest_file()` derive ownership from registered ingest roots when possible; `refresh_registered_project()` still passes the canonical project id directly. Changed retrieval so empty `project_id` falls through to legacy path/tag ownership instead of short-circuiting as unowned. Hardened `scripts/backfill_chunk_project_ids.py`: `--apply` now requires `--chroma-snapshot-confirmed`, runs Chroma metadata updates before SQLite writes, batches updates, skips/report missing vectors, skips/report malformed metadata, reports already-tagged rows, and turns missing ingestion tables into a JSON `db_warning` instead of a traceback. Added tests for auto-derive ingestion, empty-project fallback, ingest-root overlap rejection, and backfill dry-run/apply/snapshot/missing-vector/malformed cases. Verified targeted suite (`test_backfill_chunk_project_ids.py`, `test_ingestion.py`, `test_project_registry.py`, `test_retrieval.py`): 45 passed. Verified full suite: 565 passed in 73.16s. Local dry-run on empty/default data returns 0 updates with `db_warning` rather than crashing. Branch still not merged/deployed.
 - **2026-04-24 Codex (project_id final hardening before merge)** Applied the final independent-review P2s on `codex/project-id-metadata-retrieval`: `ingest_file()` still fails open when project-id derivation fails, but now emits `project_id_derivation_failed` with file path and error; retrieval now catches registry failures both at project-scope resolution and the soft project-match boost path, logs warnings, and serves unscoped rather than raising. Added regression tests for both fail-open paths. Verified targeted suite (`test_ingestion.py`, `test_retrieval.py`, `test_backfill_chunk_project_ids.py`, `test_project_registry.py`): 47 passed. Verified full suite: 567 passed in 79.66s. Branch still not merged/deployed.
 - **2026-04-24 Codex (audit improvements foundation)** Started implementation of the audit recommendations on branch `codex/audit-improvements-foundation` from `origin/main@c53e61e`. First tranche: registry-aware project-scoped retrieval filtering (`ATOCORE_RANK_PROJECT_SCOPE_FILTER`, widened candidate pull before filtering), eval harness known-issue lane, two p05 project-bleed fixtures, `scripts/live_status.py`, README/current-state/master-plan status refresh. Verified `pytest -q`: 550 passed in 67.11s. Live retrieval harness against undeployed production: 20 fixtures, 18 pass, 1 known issue (`p04-constraints` Zerodur/1.2 content gap), 1 blocking guard (`p05-broad-status-no-atomizer`) still failing because production has not yet deployed the retrieval filter and currently pulls `P04-GigaBIT-M1-KB-design` into broad p05 status context. Live dashboard refresh: health ok, build `2b86543`, docs 1748, chunks/vectors 33253, interactions 948, active memories 289, candidates 0, project_state total 128. Noted count discrepancy: dashboard memories.active=289 while integrity active_memory_count=951; schedule reconciliation in a follow-up.
 - **2026-04-24 Codex (independent-audit hardening)** Applied the Opus independent audit's fast follow-ups before merge/deploy. Closed the two P1s by making project-scope ownership path/tag-based only, adding path-segment/tag-exact matching to avoid short-alias substring collisions, and keeping title/heading text out of provenance decisions. Added regression tests for title poisoning, substring collision, and unknown-project fallback. Added retrieval log fields `raw_results_count`, `post_filter_count`, `post_filter_dropped`, and `underfilled`. Added retrieval-eval run metadata (`generated_at`, `base_url`, `/health`) and `live_status.py` auth-token/status support. README now documents the ranking knobs and clarifies that the hard scope filter and soft project match boost are separate controls. Verified `pytest -q`: 553 passed in 66.07s. Live production remains expected-predeploy: 20 fixtures, 18 pass, 1 known content gap, 1 blocking p05 bleed guard. Latest live dashboard: build `2b86543`, docs 1748, chunks/vectors 33253, interactions 950, active memories 290, candidates 0, project_state total 128.
 - **2026-04-23 Codex + Claude (R14 closed)** Codex reviewed `claude/r14-promote-400` at `3888db9`, no findings: "The route change is narrowly scoped: `promote_entity()` still returns False for not-found/not-candidate cases, so the existing 404 behavior remains intact, while caller-fixable validation failures now surface as 400." Ran `pytest tests/test_v1_0_write_invariants.py -q` from an isolated worktree: 15 passed in 1.91s. Claude squash-merged to main as `0989fed`, followed by ledger close-out `2b86543`, then deployed via canonical script. Dalidou `/health` reports build_sha=`2b86543e6ad26011b39a44509cc8df3809725171`, build_time `2026-04-23T15:20:53Z`, status=ok. R14 closed. Orientation refreshed earlier this session also reflected the V1-A gate status: **density gate CLEARED** (784 active memories vs 100 target — density batch-extract ran between 2026-04-22 and 2026-04-23 and more than crushed the gate), **soak gate at day 5 of ~7** (F4 first run 2026-04-19; nightly clean 2026-04-19 through 2026-04-23; only chronic failure is the known p04-constraints "Zerodur" content gap). V1-A branches from a clean V1-0 baseline as soon as the soak is called done.
 - **2026-04-22 Codex + Antoine (V1-0 closed)** Codex approved `f16cd52` after re-running both original probes (legacy-candidate promote + supersede hook — both correct) and the three targeted regression suites (`test_v1_0_write_invariants.py`, `test_engineering_v1_phase5.py`, `test_inbox_crossproject.py` — all pass). Squash-merged to main as `2712c5d` ("feat(engineering): enforce V1-0 write invariants"). Deployed to Dalidou via the canonical deploy script; `/health` build_sha=`2712c5d2d03cb2a6af38b559664afd1c4cd0e050` status=ok. Validated backup snapshot at `/srv/storage/atocore/backups/snapshots/20260422T190624Z` taken BEFORE prod backfill. Prod backfill of `scripts/v1_0_backfill_provenance.py` against live DB: dry-run found 31 active/superseded entities with no provenance, list reviewed and looked sane; live run with default `hand_authored=1` flag path updated 31 rows; follow-up dry-run returned 0 rows remaining → no lingering F-8 violations in prod. Codex logged one residual P2 (R14): HTTP `POST /entities/{id}/promote` route doesn't translate the new service-layer `ValueError` into 400 — legacy bad candidate promoted through the API surfaces as 500. Not blocking. V1-0 closed. **Gates for V1-A**: soak window ends ~2026-04-26; 100-active-memory density target (currently 84 active + the ~31 newly flagged ones — need to check how those count in density math). V1-A holds until both gates clear.
--- a/README.md
+++ b/README.md
@@ -6,7 +6,7 @@ Personal context engine that enriches LLM interactions with durable memory, stru
 ```bash
 pip install -e .
-uvicorn src.atocore.main:app --port 8100
+uvicorn atocore.main:app --port 8100
 ```
 ## Usage
@@ -37,6 +37,10 @@ python scripts/atocore_client.py audit-query "gigabit" 5
 | POST | /ingest | Ingest markdown file or folder |
 | POST | /query | Retrieve relevant chunks |
 | POST | /context/build | Build full context pack |
 | POST | /interactions | Capture prompt/response interactions |
 | GET/POST | /memory | List/create durable memories |
 | GET/POST | /entities | Engineering entity graph surface |
 | GET | /admin/dashboard | Operator dashboard |
 | GET | /health | Health check |
 | GET | /debug/context | Inspect last context pack |
@@ -66,8 +70,10 @@ unversioned forms.
 FastAPI (port 8100)
  |- Ingestion: markdown -> parse -> chunk -> embed -> store
  |- Retrieval: query -> embed -> vector search -> rank
-  |- Context Builder: retrieve -> boost -> budget -> format
+  |- Context Builder: project state -> memories -> entities -> retrieval -> budget
-  |- SQLite (documents, chunks, memories, projects, interactions)
+  |- Reflection: capture -> reinforce -> extract -> triage -> promote/expire
  |- Engineering: typed entities, relationships, conflicts, wiki/mirror
  |- SQLite (documents, chunks, memories, projects, interactions, entities)
  '- ChromaDB (vector embeddings)
 ```
@@ -82,6 +88,16 @@ Set via environment variables (prefix `ATOCORE_`):
 | ATOCORE_CHUNK_MAX_SIZE | 800 | Max chunk size (chars) |
 | ATOCORE_CONTEXT_BUDGET | 3000 | Context pack budget (chars) |
 | ATOCORE_EMBEDDING_MODEL | paraphrase-multilingual-MiniLM-L12-v2 | Embedding model |
 | ATOCORE_RANK_PROJECT_MATCH_BOOST | 2.0 | Soft boost for chunks whose metadata matches the project hint |
 | ATOCORE_RANK_PROJECT_SCOPE_FILTER | true | Filter project-hinted retrieval away from other registered project corpora |
 | ATOCORE_RANK_PROJECT_SCOPE_CANDIDATE_MULTIPLIER | 4 | Widen candidate pull before project-scope filtering |
 | ATOCORE_RANK_QUERY_TOKEN_STEP | 0.08 | Per-token boost when query terms appear in high-signal metadata |
 | ATOCORE_RANK_QUERY_TOKEN_CAP | 1.32 | Maximum query-token boost multiplier |
 | ATOCORE_RANK_PATH_HIGH_SIGNAL_BOOST | 1.18 | Boost current decision/status/requirements-like paths |
 | ATOCORE_RANK_PATH_LOW_SIGNAL_PENALTY | 0.72 | Down-rank archive/history-like paths |
 `ATOCORE_RANK_PROJECT_SCOPE_FILTER` gates the hard cross-project filter only.
 `ATOCORE_RANK_PROJECT_MATCH_BOOST` remains the separate soft-ranking knob.
 ## Testing
@@ -93,7 +109,11 @@ pytest
 ## Operations
 - `scripts/atocore_client.py` provides a live API client for project refresh, project-state inspection, and retrieval-quality audits.
 - `scripts/retrieval_eval.py` runs the live retrieval/context harness, separates blocking failures from known content gaps, and stamps JSON output with target/build metadata.
 - `scripts/live_status.py` renders a compact read-only status report from `/health`, `/stats`, `/projects`, and `/admin/dashboard`; set `ATOCORE_AUTH_TOKEN` or `--auth-token` when those endpoints are gated.
 - `scripts/backfill_chunk_project_ids.py` dry-runs or applies explicit `project_id` metadata backfills for SQLite chunks and Chroma vectors; `--apply` requires a confirmed Chroma snapshot.
 - `docs/operations.md` captures the current operational priority order: retrieval quality, Wave 2 trusted-operational ingestion, AtoDrive scoping, and restore validation.
 - `DEV-LEDGER.md` is the fast-moving source of operational truth during active development; copy claims into docs only after checking the live service.
 ## Architecture Notes
--- a/docs/current-state.md
+++ b/docs/current-state.md
@@ -1,6 +1,11 @@
-# AtoCore — Current State (2026-04-22)
+# AtoCore - Current State (2026-04-24)
-Live deploy: `2712c5d` · Dalidou health: ok · Harness: 17/18 · Tests: 547 passing.
+Update 2026-04-24: audit-improvements deployed as `f44a211`; live harness is
 19/20 with 0 blocking failures and 1 known content gap. Active follow-up branch
 `codex/project-id-metadata-retrieval` is at 567 passing tests.
 Live deploy: `2b86543` · Dalidou health: ok · Harness: 18/20 with 1 known
 content gap and 1 current blocking project-bleed guard · Tests: 553 passing.
 ## V1-0 landed 2026-04-22
@@ -13,9 +18,8 @@ supersede) with Q-3 fail-open. Prod backfill ran cleanly — 31 legacy
 active/superseded entities flagged `hand_authored=1`, follow-up dry-run
 returned 0 remaining rows. Test count 533 → 547 (+14).
-R14 (P2, non-blocking): `POST /entities/{id}/promote` route fix translates
+R14 is closed: `POST /entities/{id}/promote` now translates the new
-the new `ValueError` into 400. Branch `claude/r14-promote-400` pending
+caller-fixable V1-0 `ValueError` into HTTP 400.
 Codex review + squash-merge.
 **Next in the V1 track:** V1-A (minimal query slice + Q-6 killer-correctness
 integration). Gated on pipeline soak (~2026-04-26) + 100+ active memory
@@ -65,10 +69,10 @@ Last nightly run (2026-04-19 03:00 UTC): **31 promoted · 39 rejected · 0 needs
 | 7G | Re-extraction on prompt version bump | pending |
 | 7H | Chroma vector hygiene (delete vectors for superseded memories) | pending |
-## Known gaps (honest)
+## Known gaps (honest, refreshed 2026-04-24)
 1. **Capture surface is Claude-Code-and-OpenClaw only.** Conversations in Claude Desktop, Claude.ai web, phone, or any other LLM UI are NOT captured. Example: the rotovap/mushroom chat yesterday never reached AtoCore because no hook fired. See Q4 below.
-2. **OpenClaw is capture-only, not context-grounded.** The plugin POSTs `/interactions` on `llm_output` but does NOT call `/context/build` on `before_agent_start`. OpenClaw's underlying agent runs blind. See Q2 below.
+2. **Project-scoped retrieval guard is deployed and passing.** The April 24 p05 broad-status bleed guard now passes on live Dalidou. The active follow-up branch adds explicit `project_id` chunk/vector metadata so the deployed path/tag heuristic can become a legacy fallback.
-3. **Human interface (wiki) is thin and static.** 5 project cards + a "System" line. No dashboard for the autonomous activity. No per-memory detail page. See Q3/Q5.
+3. **Human interface is useful but not yet the V1 Human Mirror.** Wiki/dashboard pages exist, but the spec routes, deterministic mirror files, disputed markers, and curated annotations remain V1-D work.
-4. **Harness 17/18** — the `p04-constraints` fixture wants "Zerodur" but retrieval surfaces related-not-exact terms. Content gap, not a retrieval regression.
+4. **Harness known issue:** `p04-constraints` wants "Zerodur" and "1.2"; live retrieval surfaces related constraints but not those exact strings. Treat as content/state gap until fixed.
-5. **Two projects under-populated**: p05-interferometer (4 memories, 18 state) and atomizer-v2 (1 memory, 6 state). Batch re-extract with the new llm-0.6.0 prompt would help.
+5. **Formal docs lag the ledger during fast work.** Use `DEV-LEDGER.md` and `python scripts/live_status.py` for live truth, then copy verified claims into these docs.
--- a/docs/master-plan-status.md
+++ b/docs/master-plan-status.md
@@ -70,9 +70,14 @@ read-only additive mode.
 - Phase 6 - AtoDrive
 - Phase 10 - Write-back
 - Phase 11 - Multi-model
 - Phase 12 - Evaluation
 - Phase 13 - Hardening
 ### Partial / Operational Baseline
 - Phase 12 - Evaluation. The retrieval/context harness exists and runs
  against live Dalidou, but coverage is still intentionally small and
  should grow before this is complete in the intended sense.
 ### Engineering Layer Planning Sprint
 **Status: complete.** All 8 architecture docs are drafted. The
@@ -126,11 +131,13 @@ This sits implicitly between Phase 8 (OpenClaw) and Phase 11
 (multi-model). Memory-review and engineering-entity commands are
 deferred from the shared client until their workflows are exercised.
-## What Is Real Today (updated 2026-04-16)
+## What Is Real Today (updated 2026-04-24)
- canonical AtoCore runtime on Dalidou (`775960c`, deploy.sh verified)
+- canonical AtoCore runtime on Dalidou (`2b86543`, deploy.sh verified)
 - 33,253 vectors across 6 registered projects
- 234 captured interactions (192 claude-code, 38 openclaw, 4 test)
+- 951 captured interactions as of the 2026-04-24 live dashboard; refresh
  exact live counts with
  `python scripts/live_status.py`
 - 6 registered projects:
  - `p04-gigabit` (483 docs, 15 state entries)
  - `p05-interferometer` (109 docs, 18 state entries)
@@ -138,12 +145,14 @@ deferred from the shared client until their workflows are exercised.
  - `atomizer-v2` (568 docs, 5 state entries)
  - `abb-space` (6 state entries)
  - `atocore` (drive source, 47 state entries)
- 110 Trusted Project State entries across all projects (decisions, requirements, facts, contacts, milestones)
+- 128 Trusted Project State entries across all projects (decisions, requirements, facts, contacts, milestones)
- 84 active memories (31 project, 23 knowledge, 10 episodic, 8 adaptation, 7 preference, 5 identity)
+- 290 active memories and 0 candidate memories as of the 2026-04-24 live
  dashboard
 - context pack assembly with 4 tiers: Trusted Project State > identity/preference > project memories > retrieved chunks
 - query-relevance memory ranking with overlap-density scoring
- retrieval eval harness: 18 fixtures, 17/18 passing on live
+- retrieval eval harness: 20 fixtures; current live has 19 pass, 1 known
- 303 tests passing
+  content gap, and 0 blocking failures after the audit-improvements deploy
 - 567 tests passing on the active `codex/project-id-metadata-retrieval` branch
 - nightly pipeline: backup → cleanup → rsync → OpenClaw import → vault refresh → extract → triage → **auto-promote/expire** → weekly synth/lint → **retrieval harness** → **pipeline summary to project state**
 - Phase 10 operational: reinforcement-based auto-promotion (ref_count ≥ 3, confidence ≥ 0.7) + stale candidate expiry (14 days unreinforced)
 - pipeline health visible in dashboard: interaction totals by client, pipeline last_run, harness results, triage stats
@@ -190,9 +199,9 @@ where surfaces are disjoint, pauses when they collide.
 | V1-E | Memory→entity graduation end-to-end + remaining Q-4 trust tests | pending V1-D (note: collides with memory extractor; pauses for multi-model triage work) |
 | V1-F | F-5 detector generalization + route alias + O-1/O-2/O-3 operational + D-1/D-3/D-4 docs | finish line |
-R14 (P2, non-blocking): `POST /entities/{id}/promote` route returns 500
+R14 is closed: `POST /entities/{id}/promote` now translates
-on the new V1-0 `ValueError` instead of 400. Fix on branch
+caller-fixable V1-0 provenance validation failures into HTTP 400 instead
-`claude/r14-promote-400`, pending Codex review.
+of leaking as HTTP 500.
 ## Next
--- a/scripts/backfill_chunk_project_ids.py
+++ b/scripts/backfill_chunk_project_ids.py
@@ -0,0 +1,178 @@
 """Backfill explicit project_id into chunk and vector metadata.
 Dry-run by default. The script derives ownership from the registered project
 ingest roots and updates both SQLite source_chunks.metadata and Chroma vector
 metadata only when --apply is provided.
 """
 from __future__ import annotations
 import argparse
 import json
 import sqlite3
 import sys
 from pathlib import Path
 sys.path.insert(0, str(Path(__file__).resolve().parents[1] / "src"))
 from atocore.models.database import get_connection  # noqa: E402
 from atocore.projects.registry import derive_project_id_for_path  # noqa: E402
 from atocore.retrieval.vector_store import get_vector_store  # noqa: E402
 DEFAULT_BATCH_SIZE = 500
 def _decode_metadata(raw: str | None) -> dict | None:
    if not raw:
        return {}
    try:
        parsed = json.loads(raw)
    except json.JSONDecodeError:
        return None
    return parsed if isinstance(parsed, dict) else None
 def _chunk_rows() -> tuple[list[dict], str]:
    try:
        with get_connection() as conn:
            rows = conn.execute(
                """
                SELECT
                    sc.id AS chunk_id,
                    sc.metadata AS chunk_metadata,
                    sd.file_path AS file_path
                FROM source_chunks sc
                JOIN source_documents sd ON sd.id = sc.document_id
                ORDER BY sd.file_path, sc.chunk_index
                """
            ).fetchall()
    except sqlite3.OperationalError as exc:
        if "source_chunks" in str(exc) or "source_documents" in str(exc):
            return [], f"missing ingestion tables: {exc}"
        raise
    return [dict(row) for row in rows], ""
 def _batches(items: list, batch_size: int) -> list[list]:
    return [items[i:i + batch_size] for i in range(0, len(items), batch_size)]
 def backfill(
    apply: bool = False,
    project_filter: str = "",
    batch_size: int = DEFAULT_BATCH_SIZE,
    require_chroma_snapshot: bool = False,
 ) -> dict:
    rows, db_warning = _chunk_rows()
    updates: list[tuple[str, str, dict]] = []
    by_project: dict[str, int] = {}
    skipped_unowned = 0
    already_tagged = 0
    malformed_metadata = 0
    for row in rows:
        project_id = derive_project_id_for_path(row["file_path"])
        if project_filter and project_id != project_filter:
            continue
        if not project_id:
            skipped_unowned += 1
            continue
        metadata = _decode_metadata(row["chunk_metadata"])
        if metadata is None:
            malformed_metadata += 1
            continue
        if metadata.get("project_id") == project_id:
            already_tagged += 1
            continue
        metadata["project_id"] = project_id
        updates.append((row["chunk_id"], project_id, metadata))
        by_project[project_id] = by_project.get(project_id, 0) + 1
    missing_vectors: list[str] = []
    applied_updates = 0
    if apply and updates:
        if not require_chroma_snapshot:
            raise ValueError(
                "--apply requires --chroma-snapshot-confirmed after taking a Chroma backup"
            )
        vector_store = get_vector_store()
        for batch in _batches(updates, max(1, batch_size)):
            chunk_ids = [chunk_id for chunk_id, _, _ in batch]
            vector_payload = vector_store.get_metadatas(chunk_ids)
            existing_vector_metadata = {
                chunk_id: metadata
                for chunk_id, metadata in zip(
                    vector_payload.get("ids", []),
                    vector_payload.get("metadatas", []),
                    strict=False,
                )
                if isinstance(metadata, dict)
            }
            vector_ids = []
            vector_metadatas = []
            sql_updates = []
            for chunk_id, project_id, chunk_metadata in batch:
                vector_metadata = existing_vector_metadata.get(chunk_id)
                if vector_metadata is None:
                    missing_vectors.append(chunk_id)
                    continue
                vector_metadata = dict(vector_metadata)
                vector_metadata["project_id"] = project_id
                vector_ids.append(chunk_id)
                vector_metadatas.append(vector_metadata)
                sql_updates.append((json.dumps(chunk_metadata, ensure_ascii=True), chunk_id))
            if not vector_ids:
                continue
            vector_store.update_metadatas(vector_ids, vector_metadatas)
            with get_connection() as conn:
                cursor = conn.executemany(
                    "UPDATE source_chunks SET metadata = ? WHERE id = ?",
                    sql_updates,
                )
                if cursor.rowcount != len(sql_updates):
                    raise RuntimeError(
                        f"SQLite rowcount mismatch: {cursor.rowcount} != {len(sql_updates)}"
                    )
            applied_updates += len(sql_updates)
    return {
        "apply": apply,
        "total_chunks": len(rows),
        "updates": len(updates),
        "applied_updates": applied_updates,
        "already_tagged": already_tagged,
        "skipped_unowned": skipped_unowned,
        "malformed_metadata": malformed_metadata,
        "missing_vectors": len(missing_vectors),
        "db_warning": db_warning,
        "by_project": dict(sorted(by_project.items())),
    }
 def main() -> int:
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument("--apply", action="store_true", help="write SQLite and Chroma metadata updates")
    parser.add_argument("--project", default="", help="optional canonical project_id filter")
    parser.add_argument("--batch-size", type=int, default=DEFAULT_BATCH_SIZE)
    parser.add_argument(
        "--chroma-snapshot-confirmed",
        action="store_true",
        help="required with --apply; confirms a Chroma snapshot exists",
    )
    args = parser.parse_args()
    payload = backfill(
        apply=args.apply,
        project_filter=args.project.strip(),
        batch_size=args.batch_size,
        require_chroma_snapshot=args.chroma_snapshot_confirmed,
    )
    print(json.dumps(payload, indent=2, ensure_ascii=True))
    return 0
 if __name__ == "__main__":
    raise SystemExit(main())
--- a/scripts/live_status.py
+++ b/scripts/live_status.py
@@ -0,0 +1,131 @@
 """Render a compact live-status report from a running AtoCore instance.
 This is intentionally read-only and stdlib-only so it can be used from a
 fresh checkout, a cron job, or a Codex/Claude session without installing the
 full app package. The output is meant to reduce docs drift: copy the report
 into status docs only after it was generated from the live service.
 """
 from __future__ import annotations
 import argparse
 import errno
 import json
 import os
 import sys
 import urllib.error
 import urllib.request
 from typing import Any
 DEFAULT_BASE_URL = os.environ.get("ATOCORE_BASE_URL", "http://dalidou:8100").rstrip("/")
 DEFAULT_TIMEOUT = int(os.environ.get("ATOCORE_TIMEOUT_SECONDS", "30"))
 DEFAULT_AUTH_TOKEN = os.environ.get("ATOCORE_AUTH_TOKEN", "").strip()
 def request_json(base_url: str, path: str, timeout: int, auth_token: str = "") -> dict[str, Any]:
    headers = {"Authorization": f"Bearer {auth_token}"} if auth_token else {}
    req = urllib.request.Request(f"{base_url}{path}", method="GET", headers=headers)
    with urllib.request.urlopen(req, timeout=timeout) as response:
        body = response.read().decode("utf-8")
        status = getattr(response, "status", None)
    payload = json.loads(body) if body.strip() else {}
    if not isinstance(payload, dict):
        payload = {"value": payload}
    if status is not None:
        payload["_http_status"] = status
    return payload
 def collect_status(base_url: str, timeout: int, auth_token: str = "") -> dict[str, Any]:
    payload: dict[str, Any] = {"base_url": base_url}
    for name, path in {
        "health": "/health",
        "stats": "/stats",
        "projects": "/projects",
        "dashboard": "/admin/dashboard",
    }.items():
        try:
            payload[name] = request_json(base_url, path, timeout, auth_token)
        except (urllib.error.URLError, TimeoutError, OSError, json.JSONDecodeError) as exc:
            payload[name] = {"error": str(exc)}
    return payload
 def render_markdown(status: dict[str, Any]) -> str:
    health = status.get("health", {})
    stats = status.get("stats", {})
    projects = status.get("projects", {}).get("projects", [])
    dashboard = status.get("dashboard", {})
    memories = dashboard.get("memories", {}) if isinstance(dashboard.get("memories"), dict) else {}
    project_state = dashboard.get("project_state", {}) if isinstance(dashboard.get("project_state"), dict) else {}
    interactions = dashboard.get("interactions", {}) if isinstance(dashboard.get("interactions"), dict) else {}
    pipeline = dashboard.get("pipeline", {}) if isinstance(dashboard.get("pipeline"), dict) else {}
    lines = [
        "# AtoCore Live Status",
        "",
        f"- base_url: `{status.get('base_url', '')}`",
        "- endpoint_http_statuses: "
        f"`health={health.get('_http_status', 'error')}, "
        f"stats={stats.get('_http_status', 'error')}, "
        f"projects={status.get('projects', {}).get('_http_status', 'error')}, "
        f"dashboard={dashboard.get('_http_status', 'error')}`",
        f"- service_status: `{health.get('status', 'unknown')}`",
        f"- code_version: `{health.get('code_version', health.get('version', 'unknown'))}`",
        f"- build_sha: `{health.get('build_sha', 'unknown')}`",
        f"- build_branch: `{health.get('build_branch', 'unknown')}`",
        f"- build_time: `{health.get('build_time', 'unknown')}`",
        f"- env: `{health.get('env', 'unknown')}`",
        f"- documents: `{stats.get('total_documents', 'unknown')}`",
        f"- chunks: `{stats.get('total_chunks', 'unknown')}`",
        f"- vectors: `{stats.get('total_vectors', health.get('vectors_count', 'unknown'))}`",
        f"- registered_projects: `{len(projects)}`",
        f"- active_memories: `{memories.get('active', 'unknown')}`",
        f"- candidate_memories: `{memories.get('candidates', 'unknown')}`",
        f"- interactions: `{interactions.get('total', 'unknown')}`",
        f"- project_state_entries: `{project_state.get('total', 'unknown')}`",
        f"- pipeline_last_run: `{pipeline.get('last_run', 'unknown')}`",
    ]
    if projects:
        lines.extend(["", "## Projects"])
        for project in projects:
            aliases = ", ".join(project.get("aliases", []))
            suffix = f" ({aliases})" if aliases else ""
            lines.append(f"- `{project.get('id', '')}`{suffix}")
    return "\n".join(lines) + "\n"
 def main() -> int:
    parser = argparse.ArgumentParser(description="Render live AtoCore status")
    parser.add_argument("--base-url", default=DEFAULT_BASE_URL)
    parser.add_argument("--timeout", type=int, default=DEFAULT_TIMEOUT)
    parser.add_argument(
        "--auth-token",
        default=DEFAULT_AUTH_TOKEN,
        help="Bearer token; defaults to ATOCORE_AUTH_TOKEN when set",
    )
    parser.add_argument("--json", action="store_true", help="emit raw JSON")
    args = parser.parse_args()
    status = collect_status(args.base_url.rstrip("/"), args.timeout, args.auth_token)
    if args.json:
        output = json.dumps(status, indent=2, ensure_ascii=True) + "\n"
    else:
        output = render_markdown(status)
    try:
        sys.stdout.write(output)
    except BrokenPipeError:
        return 0
    except OSError as exc:
        if exc.errno in {errno.EINVAL, errno.EPIPE}:
            return 0
        raise
    return 0
 if __name__ == "__main__":
    raise SystemExit(main())
--- a/scripts/retrieval_eval.py
+++ b/scripts/retrieval_eval.py
@@ -44,6 +44,7 @@ import urllib.error
 import urllib.parse
 import urllib.request
 from dataclasses import dataclass, field
 from datetime import datetime, timezone
 from pathlib import Path
 DEFAULT_BASE_URL = os.environ.get("ATOCORE_BASE_URL", "http://dalidou:8100")
@@ -52,6 +53,13 @@ DEFAULT_BUDGET = 3000
 DEFAULT_FIXTURES = Path(__file__).parent / "retrieval_eval_fixtures.json"
 def request_json(base_url: str, path: str, timeout: int) -> dict:
    req = urllib.request.Request(f"{base_url}{path}", method="GET")
    with urllib.request.urlopen(req, timeout=timeout) as resp:
        body = resp.read().decode("utf-8")
    return json.loads(body) if body.strip() else {}
@dataclass
 class Fixture:
    name: str
@@ -60,6 +68,7 @@ class Fixture:
    budget: int = DEFAULT_BUDGET
    expect_present: list[str] = field(default_factory=list)
    expect_absent: list[str] = field(default_factory=list)
    known_issue: bool = False
    notes: str = ""
@@ -70,8 +79,13 @@ class FixtureResult:
    missing_present: list[str]
    unexpected_absent: list[str]
    total_chars: int
    known_issue: bool = False
    error: str = ""
    @property
    def blocking_failure(self) -> bool:
        return not self.ok and not self.known_issue
 def load_fixtures(path: Path) -> list[Fixture]:
    data = json.loads(path.read_text(encoding="utf-8"))
@@ -89,6 +103,7 @@ def load_fixtures(path: Path) -> list[Fixture]:
                budget=int(raw.get("budget", DEFAULT_BUDGET)),
                expect_present=list(raw.get("expect_present", [])),
                expect_absent=list(raw.get("expect_absent", [])),
                known_issue=bool(raw.get("known_issue", False)),
                notes=raw.get("notes", ""),
            )
        )
@@ -117,6 +132,7 @@ def run_fixture(fixture: Fixture, base_url: str, timeout: int) -> FixtureResult:
            missing_present=list(fixture.expect_present),
            unexpected_absent=[],
            total_chars=0,
            known_issue=fixture.known_issue,
            error=f"http_error: {exc}",
        )
@@ -129,16 +145,26 @@ def run_fixture(fixture: Fixture, base_url: str, timeout: int) -> FixtureResult:
        missing_present=missing,
        unexpected_absent=unexpected,
        total_chars=len(formatted),
        known_issue=fixture.known_issue,
    )
-def print_human_report(results: list[FixtureResult]) -> None:
+def print_human_report(results: list[FixtureResult], metadata: dict) -> None:
    total = len(results)
    passed = sum(1 for r in results if r.ok)
    known = sum(1 for r in results if not r.ok and r.known_issue)
    blocking = sum(1 for r in results if r.blocking_failure)
    print(f"Retrieval eval: {passed}/{total} fixtures passed")
    print(
        "Target: "
        f"{metadata.get('base_url', 'unknown')} "
        f"build={metadata.get('health', {}).get('build_sha', 'unknown')}"
    )
    if known or blocking:
        print(f"Blocking failures: {blocking}  Known issues: {known}")
    print()
    for r in results:
-        marker = "PASS" if r.ok else "FAIL"
+        marker = "PASS" if r.ok else ("KNOWN" if r.known_issue else "FAIL")
        print(f"[{marker}] {r.fixture.name}  project={r.fixture.project}  chars={r.total_chars}")
        if r.error:
            print(f"       error: {r.error}")
@@ -150,15 +176,21 @@ def print_human_report(results: list[FixtureResult]) -> None:
            print(f"       notes: {r.fixture.notes}")
-def print_json_report(results: list[FixtureResult]) -> None:
+def print_json_report(results: list[FixtureResult], metadata: dict) -> None:
    payload = {
        "generated_at": metadata.get("generated_at"),
        "base_url": metadata.get("base_url"),
        "health": metadata.get("health", {}),
        "total": len(results),
        "passed": sum(1 for r in results if r.ok),
        "known_issues": sum(1 for r in results if not r.ok and r.known_issue),
        "blocking_failures": sum(1 for r in results if r.blocking_failure),
        "fixtures": [
            {
                "name": r.fixture.name,
                "project": r.fixture.project,
                "ok": r.ok,
                "known_issue": r.known_issue,
                "total_chars": r.total_chars,
                "missing_present": r.missing_present,
                "unexpected_absent": r.unexpected_absent,
@@ -179,15 +211,26 @@ def main() -> int:
    parser.add_argument("--json", action="store_true", help="emit machine-readable JSON")
    args = parser.parse_args()
    base_url = args.base_url.rstrip("/")
    try:
        health = request_json(base_url, "/health", args.timeout)
    except (urllib.error.URLError, TimeoutError, OSError, json.JSONDecodeError) as exc:
        health = {"error": str(exc)}
    metadata = {
        "generated_at": datetime.now(timezone.utc).isoformat(),
        "base_url": base_url,
        "health": health,
    }
    fixtures = load_fixtures(args.fixtures)
-    results = [run_fixture(f, args.base_url, args.timeout) for f in fixtures]
+    results = [run_fixture(f, base_url, args.timeout) for f in fixtures]
    if args.json:
-        print_json_report(results)
+        print_json_report(results, metadata)
    else:
-        print_human_report(results)
+        print_human_report(results, metadata)
-    return 0 if all(r.ok for r in results) else 1
+    return 0 if not any(r.blocking_failure for r in results) else 1
 if __name__ == "__main__":
--- a/scripts/retrieval_eval_fixtures.json
+++ b/scripts/retrieval_eval_fixtures.json
@@ -27,7 +27,8 @@
    "expect_absent": [
      "polisher suite"
    ],
-    "notes": "Key constraints are in Trusted Project State and in the mission-framing memory"
+    "known_issue": true,
    "notes": "Known content gap as of 2026-04-24: live retrieval surfaces related constraints but not the exact Zerodur / 1.2 strings. Keep visible, but do not make nightly harness red until the source/state gap is fixed."
  },
  {
    "name": "p04-short-ambiguous",
@@ -80,6 +81,36 @@
    ],
    "notes": "CGH is a core p05 concept. Should surface via chunks and possibly the architecture memory. Must not bleed p06 polisher-suite terms."
  },
  {
    "name": "p05-broad-status-no-atomizer",
    "project": "p05-interferometer",
    "prompt": "current status",
    "expect_present": [
      "--- Trusted Project State ---",
      "--- Project Memories ---",
      "Zygo"
    ],
    "expect_absent": [
      "atomizer-v2",
      "ATOMIZER_PODCAST_BRIEFING",
      "[Source: atomizer-v2/",
      "P04-GigaBIT-M1-KB-design"
    ],
    "notes": "Regression guard for the April 24 audit finding: broad p05 status queries must not pull Atomizer/archive context into project-scoped packs."
  },
  {
    "name": "p05-vendor-decision-no-archive-first",
    "project": "p05-interferometer",
    "prompt": "vendor selection decision",
    "expect_present": [
      "Selection-Decision"
    ],
    "expect_absent": [
      "[Source: atomizer-v2/",
      "ATOMIZER_PODCAST_BRIEFING"
    ],
    "notes": "Project-scoped decision query should stay inside p05 and prefer current decision/vendor material over unrelated project archives."
  },
  {
    "name": "p06-suite-split",
    "project": "p06-polisher",
--- a/src/atocore/config.py
+++ b/src/atocore/config.py
@@ -46,6 +46,8 @@ class Settings(BaseSettings):
    # All multipliers default to the values used since Wave 1; tighten or
    # loosen them via ATOCORE_* env vars without touching code.
    rank_project_match_boost: float = 2.0
    rank_project_scope_filter: bool = True
    rank_project_scope_candidate_multiplier: int = 4
    rank_query_token_step: float = 0.08
    rank_query_token_cap: float = 1.32
    rank_path_high_signal_boost: float = 1.18
--- a/src/atocore/ingestion/pipeline.py
+++ b/src/atocore/ingestion/pipeline.py
@@ -32,10 +32,23 @@ def exclusive_ingestion():
        _INGESTION_LOCK.release()
-def ingest_file(file_path: Path) -> dict:
+def ingest_file(file_path: Path, project_id: str = "") -> dict:
    """Ingest a single markdown file. Returns stats."""
    start = time.time()
    file_path = file_path.resolve()
    project_id = (project_id or "").strip()
    if not project_id:
        try:
            from atocore.projects.registry import derive_project_id_for_path
            project_id = derive_project_id_for_path(file_path)
        except Exception as exc:
            log.warning(
                "project_id_derivation_failed",
                file_path=str(file_path),
                error=str(exc),
            )
            project_id = ""
    if not file_path.exists():
        raise FileNotFoundError(f"File not found: {file_path}")
@@ -65,6 +78,7 @@ def ingest_file(file_path: Path) -> dict:
        "source_file": str(file_path),
        "tags": parsed.tags,
        "title": parsed.title,
        "project_id": project_id,
    }
    chunks = chunk_markdown(parsed.body, base_metadata=base_meta)
@@ -116,6 +130,7 @@ def ingest_file(file_path: Path) -> dict:
                        "source_file": str(file_path),
                        "tags": json.dumps(parsed.tags),
                        "title": parsed.title,
                        "project_id": project_id,
                    })
                    conn.execute(
@@ -173,7 +188,17 @@ def ingest_folder(folder_path: Path, purge_deleted: bool = True) -> list[dict]:
        purge_deleted: If True, remove DB/vector entries for files
                       that no longer exist on disk.
    """
    return ingest_project_folder(folder_path, purge_deleted=purge_deleted, project_id="")
 def ingest_project_folder(
    folder_path: Path,
    purge_deleted: bool = True,
    project_id: str = "",
 ) -> list[dict]:
    """Ingest a folder and annotate chunks with an optional project id."""
    folder_path = folder_path.resolve()
    project_id = (project_id or "").strip()
    if not folder_path.is_dir():
        raise NotADirectoryError(f"Not a directory: {folder_path}")
@@ -187,7 +212,7 @@ def ingest_folder(folder_path: Path, purge_deleted: bool = True) -> list[dict]:
    # Ingest new/changed files
    for md_file in md_files:
        try:
-            result = ingest_file(md_file)
+            result = ingest_file(md_file, project_id=project_id)
            results.append(result)
        except Exception as e:
            log.error("ingestion_error", file_path=str(md_file), error=str(e))
--- a/src/atocore/projects/registry.py
+++ b/src/atocore/projects/registry.py
@@ -8,7 +8,6 @@ from dataclasses import asdict, dataclass
 from pathlib import Path
 import atocore.config as _config
 from atocore.ingestion.pipeline import ingest_folder
 # Reserved pseudo-projects. `inbox` holds pre-project / lead / quote
@@ -260,6 +259,7 @@ def load_project_registry() -> list[RegisteredProject]:
        )
    _validate_unique_project_names(projects)
    _validate_ingest_root_overlaps(projects)
    return projects
@@ -307,6 +307,28 @@ def resolve_project_name(name: str | None) -> str:
    return name
 def derive_project_id_for_path(file_path: str | Path) -> str:
    """Return the registered project that owns a source path, if any."""
    if not file_path:
        return ""
    doc_path = Path(file_path).resolve(strict=False)
    matches: list[tuple[int, int, str]] = []
    for project in load_project_registry():
        for source_ref in project.ingest_roots:
            root_path = _resolve_ingest_root(source_ref)
            try:
                doc_path.relative_to(root_path)
            except ValueError:
                continue
            matches.append((len(root_path.parts), len(str(root_path)), project.project_id))
    if not matches:
        return ""
    matches.sort(reverse=True)
    return matches[0][2]
 def refresh_registered_project(project_name: str, purge_deleted: bool = False) -> dict:
    """Ingest all configured source roots for a registered project.
@@ -322,6 +344,8 @@ def refresh_registered_project(project_name: str, purge_deleted: bool = False) -
    if project is None:
        raise ValueError(f"Unknown project: {project_name}")
    from atocore.ingestion.pipeline import ingest_project_folder
    roots = []
    ingested_count = 0
    skipped_count = 0
@@ -346,7 +370,11 @@ def refresh_registered_project(project_name: str, purge_deleted: bool = False) -
            {
                **root_result,
                "status": "ingested",
-                "results": ingest_folder(resolved, purge_deleted=purge_deleted),
+                "results": ingest_project_folder(
                    resolved,
                    purge_deleted=purge_deleted,
                    project_id=project.project_id,
                ),
            }
        )
        ingested_count += 1
@@ -443,6 +471,33 @@ def _validate_unique_project_names(projects: list[RegisteredProject]) -> None:
            seen[key] = project.project_id
 def _validate_ingest_root_overlaps(projects: list[RegisteredProject]) -> None:
    roots: list[tuple[str, Path]] = []
    for project in projects:
        for source_ref in project.ingest_roots:
            roots.append((project.project_id, _resolve_ingest_root(source_ref)))
    for i, (left_project, left_root) in enumerate(roots):
        for right_project, right_root in roots[i + 1:]:
            if left_project == right_project:
                continue
            try:
                left_root.relative_to(right_root)
                overlaps = True
            except ValueError:
                try:
                    right_root.relative_to(left_root)
                    overlaps = True
                except ValueError:
                    overlaps = False
            if overlaps:
                raise ValueError(
                    "Project registry ingest root overlap: "
                    f"'{left_root}' ({left_project}) and "
                    f"'{right_root}' ({right_project})"
                )
 def _find_name_collisions(
    project_id: str,
    aliases: list[str],
--- a/src/atocore/retrieval/retriever.py
+++ b/src/atocore/retrieval/retriever.py
@@ -1,5 +1,6 @@
 """Retrieval: query to ranked chunks."""
 import json
 import re
 import time
 from dataclasses import dataclass
@@ -7,7 +8,7 @@ from dataclasses import dataclass
 import atocore.config as _config
 from atocore.models.database import get_connection
 from atocore.observability.logger import get_logger
-from atocore.projects.registry import get_registered_project
+from atocore.projects.registry import RegisteredProject, get_registered_project, load_project_registry
 from atocore.retrieval.embeddings import embed_query
 from atocore.retrieval.vector_store import get_vector_store
@@ -83,6 +84,27 @@ def retrieve(
    """Retrieve the most relevant chunks for a query."""
    top_k = top_k or _config.settings.context_top_k
    start = time.time()
    try:
        scoped_project = get_registered_project(project_hint) if project_hint else None
    except Exception as exc:
        log.warning(
            "project_scope_resolution_failed",
            project_hint=project_hint,
            error=str(exc),
        )
        scoped_project = None
    scope_filter_enabled = bool(scoped_project and _config.settings.rank_project_scope_filter)
    registered_projects = None
    query_top_k = top_k
    if scope_filter_enabled:
        query_top_k = max(
            top_k,
            top_k * max(1, _config.settings.rank_project_scope_candidate_multiplier),
        )
        try:
            registered_projects = load_project_registry()
        except Exception:
            registered_projects = None
    query_embedding = embed_query(query)
    store = get_vector_store()
@@ -101,11 +123,12 @@ def retrieve(
    results = store.query(
        query_embedding=query_embedding,
-        top_k=top_k,
+        top_k=query_top_k,
        where=where,
    )
    chunks = []
    raw_result_count = len(results["ids"][0]) if results and results["ids"] and results["ids"][0] else 0
    if results and results["ids"] and results["ids"][0]:
        existing_ids = _existing_chunk_ids(results["ids"][0])
        for i, chunk_id in enumerate(results["ids"][0]):
@@ -117,6 +140,13 @@ def retrieve(
            meta = results["metadatas"][0][i] if results["metadatas"] else {}
            content = results["documents"][0][i] if results["documents"] else ""
            if scope_filter_enabled and not _is_allowed_for_project_scope(
                scoped_project,
                meta,
                registered_projects,
            ):
                continue
            score *= _query_match_boost(query, meta)
            score *= _path_signal_boost(meta)
            if project_hint:
@@ -137,42 +167,151 @@ def retrieve(
    duration_ms = int((time.time() - start) * 1000)
    chunks.sort(key=lambda chunk: chunk.score, reverse=True)
    post_filter_count = len(chunks)
    chunks = chunks[:top_k]
    log.info(
        "retrieval_done",
        query=query[:100],
        top_k=top_k,
        query_top_k=query_top_k,
        raw_results_count=raw_result_count,
        post_filter_count=post_filter_count,
        results_count=len(chunks),
        post_filter_dropped=max(0, raw_result_count - post_filter_count),
        underfilled=bool(raw_result_count >= query_top_k and len(chunks) < top_k),
        duration_ms=duration_ms,
    )
    return chunks
 def _is_allowed_for_project_scope(
    project: RegisteredProject,
    metadata: dict,
    registered_projects: list[RegisteredProject] | None = None,
 ) -> bool:
    """Return True when a chunk is target-project or not project-owned.
    Project-hinted retrieval should not let one registered project's corpus
    compete with another's. At the same time, unowned/global sources should
    remain eligible because shared docs and cross-project references can be
    genuinely useful. The registry gives us the boundary: if metadata matches
    a registered project and it is not the requested project, filter it out.
    """
    if _metadata_matches_project(project, metadata):
        return True
    if registered_projects is None:
        try:
            registered_projects = load_project_registry()
        except Exception:
            return True
    for other in registered_projects:
        if other.project_id == project.project_id:
            continue
        if _metadata_matches_project(other, metadata):
            return False
    return True
 def _metadata_matches_project(project: RegisteredProject, metadata: dict) -> bool:
    stored_project_id = str(metadata.get("project_id", "")).strip().lower()
    if stored_project_id:
        return stored_project_id == project.project_id.lower()
    path = _metadata_source_path(metadata)
    tags = _metadata_tags(metadata)
    for term in _project_scope_terms(project):
        if _path_matches_term(path, term) or term in tags:
            return True
    return False
 def _project_scope_terms(project: RegisteredProject) -> set[str]:
    terms = {project.project_id.lower()}
    terms.update(alias.lower() for alias in project.aliases)
    for source_ref in project.ingest_roots:
        normalized = source_ref.subpath.replace("\\", "/").strip("/").lower()
        if normalized:
            terms.add(normalized)
            terms.add(normalized.split("/")[-1])
    return {term for term in terms if term}
 def _metadata_searchable(metadata: dict) -> str:
    return " ".join(
        [
            str(metadata.get("source_file", "")).replace("\\", "/").lower(),
            str(metadata.get("title", "")).lower(),
            str(metadata.get("heading_path", "")).lower(),
            str(metadata.get("tags", "")).lower(),
        ]
    )
 def _metadata_source_path(metadata: dict) -> str:
    return str(metadata.get("source_file", "")).replace("\\", "/").strip("/").lower()
 def _metadata_tags(metadata: dict) -> set[str]:
    raw_tags = metadata.get("tags", [])
    if isinstance(raw_tags, (list, tuple, set)):
        return {str(tag).strip().lower() for tag in raw_tags if str(tag).strip()}
    if isinstance(raw_tags, str):
        try:
            parsed = json.loads(raw_tags)
        except json.JSONDecodeError:
            parsed = [raw_tags]
        if isinstance(parsed, (list, tuple, set)):
            return {str(tag).strip().lower() for tag in parsed if str(tag).strip()}
        if isinstance(parsed, str) and parsed.strip():
            return {parsed.strip().lower()}
    return set()
 def _path_matches_term(path: str, term: str) -> bool:
    normalized = term.replace("\\", "/").strip("/").lower()
    if not path or not normalized:
        return False
    if "/" in normalized:
        return path == normalized or path.startswith(f"{normalized}/")
    return normalized in set(path.split("/"))
 def _metadata_has_term(metadata: dict, term: str) -> bool:
    normalized = term.replace("\\", "/").strip("/").lower()
    if not normalized:
        return False
    if _path_matches_term(_metadata_source_path(metadata), normalized):
        return True
    if normalized in _metadata_tags(metadata):
        return True
    return re.search(
        rf"(?<![a-z0-9]){re.escape(normalized)}(?![a-z0-9])",
        _metadata_searchable(metadata),
    ) is not None
 def _project_match_boost(project_hint: str, metadata: dict) -> float:
    """Return a project-aware relevance multiplier for raw retrieval."""
    hint_lower = project_hint.strip().lower()
    if not hint_lower:
        return 1.0
-    source_file = str(metadata.get("source_file", "")).lower()
+    try:
    title = str(metadata.get("title", "")).lower()
    tags = str(metadata.get("tags", "")).lower()
    searchable = " ".join([source_file, title, tags])
        project = get_registered_project(project_hint)
-    candidate_names = {hint_lower}
+    except Exception as exc:
-    if project is not None:
+        log.warning(
-        candidate_names.add(project.project_id.lower())
+            "project_match_boost_resolution_failed",
-        candidate_names.update(alias.lower() for alias in project.aliases)
+            project_hint=project_hint,
-        candidate_names.update(
+            error=str(exc),
            source_ref.subpath.replace("\\", "/").strip("/").split("/")[-1].lower()
            for source_ref in project.ingest_roots
            if source_ref.subpath.strip("/\\")
        )
-
+        project = None
    candidate_names = _project_scope_terms(project) if project is not None else {hint_lower}
    for candidate in candidate_names:
-        if candidate and candidate in searchable:
+        if _metadata_has_term(metadata, candidate):
            return _config.settings.rank_project_match_boost
    return 1.0
--- a/src/atocore/retrieval/vector_store.py
+++ b/src/atocore/retrieval/vector_store.py
@@ -64,6 +64,18 @@ class VectorStore:
            self._collection.delete(ids=ids)
            log.debug("vectors_deleted", count=len(ids))
    def get_metadatas(self, ids: list[str]) -> dict:
        """Fetch vector metadata by chunk IDs."""
        if not ids:
            return {"ids": [], "metadatas": []}
        return self._collection.get(ids=ids, include=["metadatas"])
    def update_metadatas(self, ids: list[str], metadatas: list[dict]) -> None:
        """Update vector metadata without re-embedding documents."""
        if ids:
            self._collection.update(ids=ids, metadatas=metadatas)
            log.debug("vector_metadatas_updated", count=len(ids))
    @property
    def count(self) -> int:
        return self._collection.count()
--- a/tests/test_backfill_chunk_project_ids.py
+++ b/tests/test_backfill_chunk_project_ids.py
@@ -0,0 +1,154 @@
 """Tests for explicit chunk project_id metadata backfill."""
 import json
 import atocore.config as config
 from atocore.models.database import get_connection, init_db
 from scripts import backfill_chunk_project_ids as backfill
 def _write_registry(tmp_path, monkeypatch):
    vault_dir = tmp_path / "vault"
    drive_dir = tmp_path / "drive"
    config_dir = tmp_path / "config"
    project_dir = vault_dir / "incoming" / "projects" / "p04-gigabit"
    project_dir.mkdir(parents=True)
    drive_dir.mkdir()
    config_dir.mkdir()
    registry_path = config_dir / "project-registry.json"
    registry_path.write_text(
        json.dumps(
            {
                "projects": [
                    {
                        "id": "p04-gigabit",
                        "aliases": ["p04"],
                        "ingest_roots": [
                            {"source": "vault", "subpath": "incoming/projects/p04-gigabit"}
                        ],
                    }
                ]
            }
        ),
        encoding="utf-8",
    )
    monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir))
    monkeypatch.setenv("ATOCORE_DRIVE_SOURCE_DIR", str(drive_dir))
    monkeypatch.setenv("ATOCORE_PROJECT_REGISTRY_PATH", str(registry_path))
    config.settings = config.Settings()
    return project_dir
 def _insert_chunk(file_path, metadata=None, chunk_id="chunk-1"):
    with get_connection() as conn:
        conn.execute(
            """
            INSERT INTO source_documents (id, file_path, file_hash, title, doc_type, tags)
            VALUES (?, ?, ?, ?, ?, ?)
            """,
            ("doc-1", str(file_path), "hash", "Title", "markdown", "[]"),
        )
        conn.execute(
            """
            INSERT INTO source_chunks
                (id, document_id, chunk_index, content, heading_path, char_count, metadata)
            VALUES (?, ?, ?, ?, ?, ?, ?)
            """,
            (
                chunk_id,
                "doc-1",
                0,
                "content",
                "Overview",
                7,
                json.dumps(metadata if metadata is not None else {}),
            ),
        )
 class FakeVectorStore:
    def __init__(self, metadatas):
        self.metadatas = dict(metadatas)
        self.updated = []
    def get_metadatas(self, ids):
        returned_ids = [chunk_id for chunk_id in ids if chunk_id in self.metadatas]
        return {
            "ids": returned_ids,
            "metadatas": [self.metadatas[chunk_id] for chunk_id in returned_ids],
        }
    def update_metadatas(self, ids, metadatas):
        self.updated.append((list(ids), list(metadatas)))
        for chunk_id, metadata in zip(ids, metadatas, strict=True):
            self.metadatas[chunk_id] = metadata
 def test_backfill_dry_run_is_non_mutating(tmp_data_dir, tmp_path, monkeypatch):
    init_db()
    project_dir = _write_registry(tmp_path, monkeypatch)
    _insert_chunk(project_dir / "status.md")
    result = backfill.backfill(apply=False)
    assert result["updates"] == 1
    with get_connection() as conn:
        row = conn.execute("SELECT metadata FROM source_chunks WHERE id = ?", ("chunk-1",)).fetchone()
    assert json.loads(row["metadata"]) == {}
 def test_backfill_apply_updates_chroma_then_sql(tmp_data_dir, tmp_path, monkeypatch):
    init_db()
    project_dir = _write_registry(tmp_path, monkeypatch)
    _insert_chunk(project_dir / "status.md", metadata={"source_file": "status.md"})
    fake_store = FakeVectorStore({"chunk-1": {"source_file": "status.md"}})
    monkeypatch.setattr(backfill, "get_vector_store", lambda: fake_store)
    result = backfill.backfill(apply=True, require_chroma_snapshot=True)
    assert result["applied_updates"] == 1
    assert fake_store.metadatas["chunk-1"]["project_id"] == "p04-gigabit"
    with get_connection() as conn:
        row = conn.execute("SELECT metadata FROM source_chunks WHERE id = ?", ("chunk-1",)).fetchone()
    assert json.loads(row["metadata"])["project_id"] == "p04-gigabit"
 def test_backfill_apply_requires_snapshot_confirmation(tmp_data_dir, tmp_path, monkeypatch):
    init_db()
    project_dir = _write_registry(tmp_path, monkeypatch)
    _insert_chunk(project_dir / "status.md")
    try:
        backfill.backfill(apply=True)
    except ValueError as exc:
        assert "Chroma backup" in str(exc)
    else:
        raise AssertionError("Expected snapshot confirmation requirement")
 def test_backfill_missing_vector_skips_sql_update(tmp_data_dir, tmp_path, monkeypatch):
    init_db()
    project_dir = _write_registry(tmp_path, monkeypatch)
    _insert_chunk(project_dir / "status.md")
    fake_store = FakeVectorStore({})
    monkeypatch.setattr(backfill, "get_vector_store", lambda: fake_store)
    result = backfill.backfill(apply=True, require_chroma_snapshot=True)
    assert result["updates"] == 1
    assert result["applied_updates"] == 0
    assert result["missing_vectors"] == 1
    with get_connection() as conn:
        row = conn.execute("SELECT metadata FROM source_chunks WHERE id = ?", ("chunk-1",)).fetchone()
    assert json.loads(row["metadata"]) == {}
 def test_backfill_skips_malformed_metadata(tmp_data_dir, tmp_path, monkeypatch):
    init_db()
    project_dir = _write_registry(tmp_path, monkeypatch)
    _insert_chunk(project_dir / "status.md", metadata=[])
    result = backfill.backfill(apply=False)
    assert result["updates"] == 0
    assert result["malformed_metadata"] == 1
--- a/tests/test_config.py
+++ b/tests/test_config.py
@@ -46,6 +46,8 @@ def test_settings_keep_legacy_db_path_when_present(tmp_path, monkeypatch):
 def test_ranking_weights_are_tunable_via_env(monkeypatch):
    monkeypatch.setenv("ATOCORE_RANK_PROJECT_MATCH_BOOST", "3.5")
    monkeypatch.setenv("ATOCORE_RANK_PROJECT_SCOPE_FILTER", "false")
    monkeypatch.setenv("ATOCORE_RANK_PROJECT_SCOPE_CANDIDATE_MULTIPLIER", "6")
    monkeypatch.setenv("ATOCORE_RANK_QUERY_TOKEN_STEP", "0.12")
    monkeypatch.setenv("ATOCORE_RANK_QUERY_TOKEN_CAP", "1.5")
    monkeypatch.setenv("ATOCORE_RANK_PATH_HIGH_SIGNAL_BOOST", "1.25")
@@ -54,6 +56,8 @@ def test_ranking_weights_are_tunable_via_env(monkeypatch):
    settings = config.Settings()
    assert settings.rank_project_match_boost == 3.5
    assert settings.rank_project_scope_filter is False
    assert settings.rank_project_scope_candidate_multiplier == 6
    assert settings.rank_query_token_step == 0.12
    assert settings.rank_query_token_cap == 1.5
    assert settings.rank_path_high_signal_boost == 1.25
--- a/tests/test_ingestion.py
+++ b/tests/test_ingestion.py
@@ -1,8 +1,10 @@
 """Tests for the ingestion pipeline."""
 import json
 from atocore.ingestion.parser import parse_markdown
 from atocore.models.database import get_connection, init_db
-from atocore.ingestion.pipeline import ingest_file, ingest_folder
+from atocore.ingestion.pipeline import ingest_file, ingest_folder, ingest_project_folder
 def test_parse_markdown(sample_markdown):
@@ -69,6 +71,153 @@ def test_ingest_updates_changed(tmp_data_dir, sample_markdown):
    assert result["status"] == "ingested"
 def test_ingest_file_records_project_id_metadata(tmp_data_dir, sample_markdown, monkeypatch):
    """Project-aware ingestion should tag DB and vector metadata exactly."""
    init_db()
    class FakeVectorStore:
        def __init__(self):
            self.metadatas = []
        def add(self, ids, documents, metadatas):
            self.metadatas.extend(metadatas)
        def delete(self, ids):
            return None
    fake_store = FakeVectorStore()
    monkeypatch.setattr("atocore.ingestion.pipeline.get_vector_store", lambda: fake_store)
    result = ingest_file(sample_markdown, project_id="p04-gigabit")
    assert result["status"] == "ingested"
    assert fake_store.metadatas
    assert all(meta["project_id"] == "p04-gigabit" for meta in fake_store.metadatas)
    with get_connection() as conn:
        rows = conn.execute("SELECT metadata FROM source_chunks").fetchall()
    assert rows
    assert all(
        json.loads(row["metadata"])["project_id"] == "p04-gigabit"
        for row in rows
    )
 def test_ingest_file_derives_project_id_from_registry_root(tmp_data_dir, tmp_path, monkeypatch):
    """Unscoped ingest should preserve ownership for files under registered roots."""
    import atocore.config as config
    vault_dir = tmp_path / "vault"
    drive_dir = tmp_path / "drive"
    config_dir = tmp_path / "config"
    project_dir = vault_dir / "incoming" / "projects" / "p04-gigabit"
    project_dir.mkdir(parents=True)
    drive_dir.mkdir()
    config_dir.mkdir()
    note = project_dir / "status.md"
    note.write_text(
        "# Status\n\nCurrent project status with enough detail to create "
        "a retrievable chunk for the ingestion pipeline test.",
        encoding="utf-8",
    )
    registry_path = config_dir / "project-registry.json"
    registry_path.write_text(
        json.dumps(
            {
                "projects": [
                    {
                        "id": "p04-gigabit",
                        "aliases": ["p04"],
                        "ingest_roots": [
                            {"source": "vault", "subpath": "incoming/projects/p04-gigabit"}
                        ],
                    }
                ]
            }
        ),
        encoding="utf-8",
    )
    class FakeVectorStore:
        def __init__(self):
            self.metadatas = []
        def add(self, ids, documents, metadatas):
            self.metadatas.extend(metadatas)
        def delete(self, ids):
            return None
    fake_store = FakeVectorStore()
    monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir))
    monkeypatch.setenv("ATOCORE_DRIVE_SOURCE_DIR", str(drive_dir))
    monkeypatch.setenv("ATOCORE_PROJECT_REGISTRY_PATH", str(registry_path))
    config.settings = config.Settings()
    monkeypatch.setattr("atocore.ingestion.pipeline.get_vector_store", lambda: fake_store)
    init_db()
    result = ingest_file(note)
    assert result["status"] == "ingested"
    assert fake_store.metadatas
    assert all(meta["project_id"] == "p04-gigabit" for meta in fake_store.metadatas)
 def test_ingest_file_logs_and_fails_open_when_project_derivation_fails(
    tmp_data_dir,
    sample_markdown,
    monkeypatch,
 ):
    """A broken registry should be visible but should not block ingestion."""
    init_db()
    warnings = []
    class FakeVectorStore:
        def __init__(self):
            self.metadatas = []
        def add(self, ids, documents, metadatas):
            self.metadatas.extend(metadatas)
        def delete(self, ids):
            return None
    fake_store = FakeVectorStore()
    monkeypatch.setattr("atocore.ingestion.pipeline.get_vector_store", lambda: fake_store)
    monkeypatch.setattr(
        "atocore.projects.registry.derive_project_id_for_path",
        lambda path: (_ for _ in ()).throw(ValueError("registry broken")),
    )
    monkeypatch.setattr(
        "atocore.ingestion.pipeline.log.warning",
        lambda event, **kwargs: warnings.append((event, kwargs)),
    )
    result = ingest_file(sample_markdown)
    assert result["status"] == "ingested"
    assert fake_store.metadatas
    assert all(meta["project_id"] == "" for meta in fake_store.metadatas)
    assert warnings[0][0] == "project_id_derivation_failed"
    assert "registry broken" in warnings[0][1]["error"]
 def test_ingest_project_folder_passes_project_id_to_files(tmp_data_dir, sample_folder, monkeypatch):
    seen = []
    def fake_ingest_file(path, project_id=""):
        seen.append((path.name, project_id))
        return {"file": str(path), "status": "ingested"}
    monkeypatch.setattr("atocore.ingestion.pipeline.ingest_file", fake_ingest_file)
    monkeypatch.setattr("atocore.ingestion.pipeline._purge_deleted_files", lambda *args, **kwargs: 0)
    ingest_project_folder(sample_folder, project_id="p05-interferometer")
    assert seen
    assert {project_id for _, project_id in seen} == {"p05-interferometer"}
 def test_parse_markdown_uses_supplied_text(sample_markdown):
    """Parsing should be able to reuse pre-read content from ingestion."""
    latin_text = """---\ntags: parser\n---\n# Parser Title\n\nBody text."""
--- a/tests/test_project_registry.py
+++ b/tests/test_project_registry.py
@@ -5,6 +5,7 @@ import json
 import atocore.config as config
 from atocore.projects.registry import (
    build_project_registration_proposal,
    derive_project_id_for_path,
    get_registered_project,
    get_project_registry_template,
    list_registered_projects,
@@ -103,6 +104,98 @@ def test_project_registry_resolves_alias(tmp_path, monkeypatch):
    assert project.project_id == "p05-interferometer"
 def test_derive_project_id_for_path_uses_registered_roots(tmp_path, monkeypatch):
    vault_dir = tmp_path / "vault"
    drive_dir = tmp_path / "drive"
    config_dir = tmp_path / "config"
    project_dir = vault_dir / "incoming" / "projects" / "p04-gigabit"
    project_dir.mkdir(parents=True)
    drive_dir.mkdir()
    config_dir.mkdir()
    note = project_dir / "status.md"
    note.write_text("# Status\n\nCurrent work.", encoding="utf-8")
    registry_path = config_dir / "project-registry.json"
    registry_path.write_text(
        json.dumps(
            {
                "projects": [
                    {
                        "id": "p04-gigabit",
                        "aliases": ["p04"],
                        "ingest_roots": [
                            {"source": "vault", "subpath": "incoming/projects/p04-gigabit"}
                        ],
                    }
                ]
            }
        ),
        encoding="utf-8",
    )
    monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir))
    monkeypatch.setenv("ATOCORE_DRIVE_SOURCE_DIR", str(drive_dir))
    monkeypatch.setenv("ATOCORE_PROJECT_REGISTRY_PATH", str(registry_path))
    original_settings = config.settings
    try:
        config.settings = config.Settings()
        assert derive_project_id_for_path(note) == "p04-gigabit"
        assert derive_project_id_for_path(tmp_path / "elsewhere.md") == ""
    finally:
        config.settings = original_settings
 def test_project_registry_rejects_cross_project_ingest_root_overlap(tmp_path, monkeypatch):
    vault_dir = tmp_path / "vault"
    drive_dir = tmp_path / "drive"
    config_dir = tmp_path / "config"
    vault_dir.mkdir()
    drive_dir.mkdir()
    config_dir.mkdir()
    registry_path = config_dir / "project-registry.json"
    registry_path.write_text(
        json.dumps(
            {
                "projects": [
                    {
                        "id": "parent",
                        "aliases": [],
                        "ingest_roots": [
                            {"source": "vault", "subpath": "incoming/projects/parent"}
                        ],
                    },
                    {
                        "id": "child",
                        "aliases": [],
                        "ingest_roots": [
                            {"source": "vault", "subpath": "incoming/projects/parent/child"}
                        ],
                    },
                ]
            }
        ),
        encoding="utf-8",
    )
    monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir))
    monkeypatch.setenv("ATOCORE_DRIVE_SOURCE_DIR", str(drive_dir))
    monkeypatch.setenv("ATOCORE_PROJECT_REGISTRY_PATH", str(registry_path))
    original_settings = config.settings
    try:
        config.settings = config.Settings()
        try:
            list_registered_projects()
        except ValueError as exc:
            assert "ingest root overlap" in str(exc)
        else:
            raise AssertionError("Expected overlapping ingest roots to raise")
    finally:
        config.settings = original_settings
 def test_refresh_registered_project_ingests_registered_roots(tmp_path, monkeypatch):
    vault_dir = tmp_path / "vault"
    drive_dir = tmp_path / "drive"
@@ -133,8 +226,8 @@ def test_refresh_registered_project_ingests_registered_roots(tmp_path, monkeypat
    calls = []
-    def fake_ingest_folder(path, purge_deleted=True):
+    def fake_ingest_folder(path, purge_deleted=True, project_id=""):
-        calls.append((str(path), purge_deleted))
+        calls.append((str(path), purge_deleted, project_id))
        return [{"file": str(path / "README.md"), "status": "ingested"}]
    monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir))
@@ -144,7 +237,7 @@ def test_refresh_registered_project_ingests_registered_roots(tmp_path, monkeypat
    original_settings = config.settings
    try:
        config.settings = config.Settings()
-        monkeypatch.setattr("atocore.projects.registry.ingest_folder", fake_ingest_folder)
+        monkeypatch.setattr("atocore.ingestion.pipeline.ingest_project_folder", fake_ingest_folder)
        result = refresh_registered_project("polisher")
    finally:
        config.settings = original_settings
@@ -153,6 +246,7 @@ def test_refresh_registered_project_ingests_registered_roots(tmp_path, monkeypat
    assert len(calls) == 1
    assert calls[0][0].endswith("p06-polisher")
    assert calls[0][1] is False
    assert calls[0][2] == "p06-polisher"
    assert result["roots"][0]["status"] == "ingested"
    assert result["status"] == "ingested"
    assert result["roots_ingested"] == 1
@@ -188,7 +282,7 @@ def test_refresh_registered_project_reports_nothing_to_ingest_when_all_missing(
        encoding="utf-8",
    )
-    def fail_ingest_folder(path, purge_deleted=True):
+    def fail_ingest_folder(path, purge_deleted=True, project_id=""):
        raise AssertionError(f"ingest_folder should not be called for missing root: {path}")
    monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir))
@@ -198,7 +292,7 @@ def test_refresh_registered_project_reports_nothing_to_ingest_when_all_missing(
    original_settings = config.settings
    try:
        config.settings = config.Settings()
-        monkeypatch.setattr("atocore.projects.registry.ingest_folder", fail_ingest_folder)
+        monkeypatch.setattr("atocore.ingestion.pipeline.ingest_project_folder", fail_ingest_folder)
        result = refresh_registered_project("ghost")
    finally:
        config.settings = original_settings
@@ -238,7 +332,7 @@ def test_refresh_registered_project_reports_partial_status(tmp_path, monkeypatch
        encoding="utf-8",
    )
-    def fake_ingest_folder(path, purge_deleted=True):
+    def fake_ingest_folder(path, purge_deleted=True, project_id=""):
        return [{"file": str(path / "README.md"), "status": "ingested"}]
    monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir))
@@ -248,7 +342,7 @@ def test_refresh_registered_project_reports_partial_status(tmp_path, monkeypatch
    original_settings = config.settings
    try:
        config.settings = config.Settings()
-        monkeypatch.setattr("atocore.projects.registry.ingest_folder", fake_ingest_folder)
+        monkeypatch.setattr("atocore.ingestion.pipeline.ingest_project_folder", fake_ingest_folder)
        result = refresh_registered_project("mixed")
    finally:
        config.settings = original_settings
--- a/tests/test_retrieval.py
+++ b/tests/test_retrieval.py
@@ -70,8 +70,28 @@ def test_retrieve_skips_stale_vector_entries(tmp_data_dir, sample_markdown, monk
 def test_retrieve_project_hint_boosts_matching_chunks(monkeypatch):
    target_project = type(
        "Project",
        (),
        {
            "project_id": "p04-gigabit",
            "aliases": ("p04", "gigabit"),
            "ingest_roots": (),
        },
    )()
    other_project = type(
        "Project",
        (),
        {
            "project_id": "p05-interferometer",
            "aliases": ("p05", "interferometer"),
            "ingest_roots": (),
        },
    )()
    class FakeStore:
        def query(self, query_embedding, top_k=10, where=None):
            assert top_k == 8
            return {
                "ids": [["chunk-a", "chunk-b"]],
                "documents": [["project doc", "other doc"]],
@@ -102,7 +122,21 @@ def test_retrieve_project_hint_boosts_matching_chunks(monkeypatch):
    )
    monkeypatch.setattr(
        "atocore.retrieval.retriever.get_registered_project",
-        lambda project_name: type(
+        lambda project_name: target_project,
    )
    monkeypatch.setattr(
        "atocore.retrieval.retriever.load_project_registry",
        lambda: [target_project, other_project],
    )
    results = retrieve("mirror architecture", top_k=2, project_hint="p04")
    assert len(results) == 1
    assert results[0].chunk_id == "chunk-a"
 def test_retrieve_project_scope_allows_unowned_global_chunks(monkeypatch):
    target_project = type(
        "Project",
        (),
        {
@@ -110,14 +144,479 @@ def test_retrieve_project_hint_boosts_matching_chunks(monkeypatch):
            "aliases": ("p04", "gigabit"),
            "ingest_roots": (),
        },
-        )(),
+    )()
    class FakeStore:
        def query(self, query_embedding, top_k=10, where=None):
            return {
                "ids": [["chunk-a", "chunk-global"]],
                "documents": [["project doc", "global doc"]],
                "metadatas": [[
                    {
                        "heading_path": "Overview",
                        "source_file": "p04-gigabit/pkm/_index.md",
                        "tags": '["p04-gigabit"]',
                        "title": "P04",
                        "document_id": "doc-a",
                    },
                    {
                        "heading_path": "Overview",
                        "source_file": "shared/engineering-rules.md",
                        "tags": "[]",
                        "title": "Shared engineering rules",
                        "document_id": "doc-global",
                    },
                ]],
                "distances": [[0.2, 0.21]],
            }
    monkeypatch.setattr("atocore.retrieval.retriever.get_vector_store", lambda: FakeStore())
    monkeypatch.setattr("atocore.retrieval.retriever.embed_query", lambda query: [0.0, 0.1])
    monkeypatch.setattr(
        "atocore.retrieval.retriever._existing_chunk_ids",
        lambda chunk_ids: set(chunk_ids),
    )
    monkeypatch.setattr(
        "atocore.retrieval.retriever.get_registered_project",
        lambda project_name: target_project,
    )
    monkeypatch.setattr(
        "atocore.retrieval.retriever.load_project_registry",
        lambda: [target_project],
    )
    results = retrieve("mirror architecture", top_k=2, project_hint="p04")
-    assert len(results) == 2
+    assert [r.chunk_id for r in results] == ["chunk-a", "chunk-global"]
-    assert results[0].chunk_id == "chunk-a"
+
-    assert results[0].score > results[1].score
+
 def test_retrieve_project_scope_filter_can_be_disabled(monkeypatch):
    target_project = type(
        "Project",
        (),
        {
            "project_id": "p04-gigabit",
            "aliases": ("p04", "gigabit"),
            "ingest_roots": (),
        },
    )()
    other_project = type(
        "Project",
        (),
        {
            "project_id": "p05-interferometer",
            "aliases": ("p05", "interferometer"),
            "ingest_roots": (),
        },
    )()
    class FakeStore:
        def query(self, query_embedding, top_k=10, where=None):
            assert top_k == 2
            return {
                "ids": [["chunk-a", "chunk-b"]],
                "documents": [["project doc", "other project doc"]],
                "metadatas": [[
                    {
                        "heading_path": "Overview",
                        "source_file": "p04-gigabit/pkm/_index.md",
                        "tags": '["p04-gigabit"]',
                        "title": "P04",
                        "document_id": "doc-a",
                    },
                    {
                        "heading_path": "Overview",
                        "source_file": "p05-interferometer/pkm/_index.md",
                        "tags": '["p05-interferometer"]',
                        "title": "P05",
                        "document_id": "doc-b",
                    },
                ]],
                "distances": [[0.2, 0.2]],
            }
    monkeypatch.setattr("atocore.config.settings.rank_project_scope_filter", False)
    monkeypatch.setattr("atocore.retrieval.retriever.get_vector_store", lambda: FakeStore())
    monkeypatch.setattr("atocore.retrieval.retriever.embed_query", lambda query: [0.0, 0.1])
    monkeypatch.setattr(
        "atocore.retrieval.retriever._existing_chunk_ids",
        lambda chunk_ids: set(chunk_ids),
    )
    monkeypatch.setattr(
        "atocore.retrieval.retriever.get_registered_project",
        lambda project_name: target_project,
    )
    monkeypatch.setattr(
        "atocore.retrieval.retriever.load_project_registry",
        lambda: [target_project, other_project],
    )
    results = retrieve("mirror architecture", top_k=2, project_hint="p04")
    assert {r.chunk_id for r in results} == {"chunk-a", "chunk-b"}
 def test_retrieve_project_scope_ignores_title_for_ownership(monkeypatch):
    target_project = type(
        "Project",
        (),
        {
            "project_id": "p04-gigabit",
            "aliases": ("p04", "gigabit"),
            "ingest_roots": (),
        },
    )()
    other_project = type(
        "Project",
        (),
        {
            "project_id": "p06-polisher",
            "aliases": ("p06", "polisher", "p11"),
            "ingest_roots": (),
        },
    )()
    class FakeStore:
        def query(self, query_embedding, top_k=10, where=None):
            return {
                "ids": [["chunk-target", "chunk-poisoned-title"]],
                "documents": [["p04 doc", "p06 doc"]],
                "metadatas": [[
                    {
                        "heading_path": "Overview",
                        "source_file": "p04-gigabit/pkm/_index.md",
                        "tags": '["p04-gigabit"]',
                        "title": "P04",
                        "document_id": "doc-a",
                    },
                    {
                        "heading_path": "Overview",
                        "source_file": "p06-polisher/pkm/architecture.md",
                        "tags": '["p06-polisher"]',
                        "title": "GigaBIT M1 mirror lessons",
                        "document_id": "doc-b",
                    },
                ]],
                "distances": [[0.2, 0.19]],
            }
    monkeypatch.setattr("atocore.retrieval.retriever.get_vector_store", lambda: FakeStore())
    monkeypatch.setattr("atocore.retrieval.retriever.embed_query", lambda query: [0.0, 0.1])
    monkeypatch.setattr(
        "atocore.retrieval.retriever._existing_chunk_ids",
        lambda chunk_ids: set(chunk_ids),
    )
    monkeypatch.setattr(
        "atocore.retrieval.retriever.get_registered_project",
        lambda project_name: target_project,
    )
    monkeypatch.setattr(
        "atocore.retrieval.retriever.load_project_registry",
        lambda: [target_project, other_project],
    )
    results = retrieve("mirror architecture", top_k=2, project_hint="p04")
    assert [r.chunk_id for r in results] == ["chunk-target"]
 def test_retrieve_project_scope_uses_path_segments_not_substrings(monkeypatch):
    target_project = type(
        "Project",
        (),
        {
            "project_id": "p05-interferometer",
            "aliases": ("p05", "interferometer"),
            "ingest_roots": (),
        },
    )()
    abb_project = type(
        "Project",
        (),
        {
            "project_id": "abb-space",
            "aliases": ("abb",),
            "ingest_roots": (),
        },
    )()
    class FakeStore:
        def query(self, query_embedding, top_k=10, where=None):
            return {
                "ids": [["chunk-target", "chunk-global"]],
                "documents": [["p05 doc", "global doc"]],
                "metadatas": [[
                    {
                        "heading_path": "Overview",
                        "source_file": "p05-interferometer/pkm/_index.md",
                        "tags": '["p05-interferometer"]',
                        "title": "P05",
                        "document_id": "doc-a",
                    },
                    {
                        "heading_path": "Abbreviation notes",
                        "source_file": "shared/cabbage-abbreviations.md",
                        "tags": "[]",
                        "title": "ABB-style abbreviations",
                        "document_id": "doc-global",
                    },
                ]],
                "distances": [[0.2, 0.21]],
            }
    monkeypatch.setattr("atocore.retrieval.retriever.get_vector_store", lambda: FakeStore())
    monkeypatch.setattr("atocore.retrieval.retriever.embed_query", lambda query: [0.0, 0.1])
    monkeypatch.setattr(
        "atocore.retrieval.retriever._existing_chunk_ids",
        lambda chunk_ids: set(chunk_ids),
    )
    monkeypatch.setattr(
        "atocore.retrieval.retriever.get_registered_project",
        lambda project_name: target_project,
    )
    monkeypatch.setattr(
        "atocore.retrieval.retriever.load_project_registry",
        lambda: [target_project, abb_project],
    )
    results = retrieve("abbreviations", top_k=2, project_hint="p05")
    assert [r.chunk_id for r in results] == ["chunk-target", "chunk-global"]
 def test_retrieve_project_scope_prefers_exact_project_id(monkeypatch):
    target_project = type(
        "Project",
        (),
        {
            "project_id": "p04-gigabit",
            "aliases": ("p04", "gigabit"),
            "ingest_roots": (),
        },
    )()
    other_project = type(
        "Project",
        (),
        {
            "project_id": "p06-polisher",
            "aliases": ("p06", "polisher"),
            "ingest_roots": (),
        },
    )()
    class FakeStore:
        def query(self, query_embedding, top_k=10, where=None):
            return {
                "ids": [["chunk-target", "chunk-other", "chunk-global"]],
                "documents": [["target doc", "other doc", "global doc"]],
                "metadatas": [[
                    {
                        "heading_path": "Overview",
                        "source_file": "legacy/unhelpful-path.md",
                        "tags": "[]",
                        "title": "Target",
                        "project_id": "p04-gigabit",
                        "document_id": "doc-a",
                    },
                    {
                        "heading_path": "Overview",
                        "source_file": "p04-gigabit/title-poisoned.md",
                        "tags": '["p04-gigabit"]',
                        "title": "Looks target-owned but is explicit p06",
                        "project_id": "p06-polisher",
                        "document_id": "doc-b",
                    },
                    {
                        "heading_path": "Overview",
                        "source_file": "shared/global.md",
                        "tags": "[]",
                        "title": "Shared",
                        "project_id": "",
                        "document_id": "doc-global",
                    },
                ]],
                "distances": [[0.2, 0.19, 0.21]],
            }
    monkeypatch.setattr("atocore.retrieval.retriever.get_vector_store", lambda: FakeStore())
    monkeypatch.setattr("atocore.retrieval.retriever.embed_query", lambda query: [0.0, 0.1])
    monkeypatch.setattr(
        "atocore.retrieval.retriever._existing_chunk_ids",
        lambda chunk_ids: set(chunk_ids),
    )
    monkeypatch.setattr(
        "atocore.retrieval.retriever.get_registered_project",
        lambda project_name: target_project,
    )
    monkeypatch.setattr(
        "atocore.retrieval.retriever.load_project_registry",
        lambda: [target_project, other_project],
    )
    results = retrieve("mirror architecture", top_k=3, project_hint="p04")
    assert [r.chunk_id for r in results] == ["chunk-target", "chunk-global"]
 def test_retrieve_empty_project_id_falls_back_to_path_ownership(monkeypatch):
    target_project = type(
        "Project",
        (),
        {
            "project_id": "p04-gigabit",
            "aliases": ("p04", "gigabit"),
            "ingest_roots": (),
        },
    )()
    other_project = type(
        "Project",
        (),
        {
            "project_id": "p05-interferometer",
            "aliases": ("p05", "interferometer"),
            "ingest_roots": (),
        },
    )()
    class FakeStore:
        def query(self, query_embedding, top_k=10, where=None):
            return {
                "ids": [["chunk-target", "chunk-other"]],
                "documents": [["target doc", "other doc"]],
                "metadatas": [[
                    {
                        "heading_path": "Overview",
                        "source_file": "p04-gigabit/status.md",
                        "tags": "[]",
                        "title": "Target",
                        "project_id": "",
                        "document_id": "doc-a",
                    },
                    {
                        "heading_path": "Overview",
                        "source_file": "p05-interferometer/status.md",
                        "tags": "[]",
                        "title": "Other",
                        "project_id": "",
                        "document_id": "doc-b",
                    },
                ]],
                "distances": [[0.2, 0.19]],
            }
    monkeypatch.setattr("atocore.retrieval.retriever.get_vector_store", lambda: FakeStore())
    monkeypatch.setattr("atocore.retrieval.retriever.embed_query", lambda query: [0.0, 0.1])
    monkeypatch.setattr(
        "atocore.retrieval.retriever._existing_chunk_ids",
        lambda chunk_ids: set(chunk_ids),
    )
    monkeypatch.setattr(
        "atocore.retrieval.retriever.get_registered_project",
        lambda project_name: target_project,
    )
    monkeypatch.setattr(
        "atocore.retrieval.retriever.load_project_registry",
        lambda: [target_project, other_project],
    )
    results = retrieve("mirror architecture", top_k=2, project_hint="p04")
    assert [r.chunk_id for r in results] == ["chunk-target"]
 def test_retrieve_unknown_project_hint_does_not_widen_or_filter(monkeypatch):
    class FakeStore:
        def query(self, query_embedding, top_k=10, where=None):
            assert top_k == 2
            return {
                "ids": [["chunk-a", "chunk-b"]],
                "documents": [["doc a", "doc b"]],
                "metadatas": [[
                    {
                        "heading_path": "Overview",
                        "source_file": "project-a/file.md",
                        "tags": "[]",
                        "title": "A",
                        "document_id": "doc-a",
                    },
                    {
                        "heading_path": "Overview",
                        "source_file": "project-b/file.md",
                        "tags": "[]",
                        "title": "B",
                        "document_id": "doc-b",
                    },
                ]],
                "distances": [[0.2, 0.21]],
            }
    monkeypatch.setattr("atocore.retrieval.retriever.get_vector_store", lambda: FakeStore())
    monkeypatch.setattr("atocore.retrieval.retriever.embed_query", lambda query: [0.0, 0.1])
    monkeypatch.setattr(
        "atocore.retrieval.retriever._existing_chunk_ids",
        lambda chunk_ids: set(chunk_ids),
    )
    monkeypatch.setattr(
        "atocore.retrieval.retriever.get_registered_project",
        lambda project_name: None,
    )
    results = retrieve("overview", top_k=2, project_hint="unknown-project")
    assert [r.chunk_id for r in results] == ["chunk-a", "chunk-b"]
 def test_retrieve_fails_open_when_project_scope_resolution_fails(monkeypatch):
    warnings = []
    class FakeStore:
        def query(self, query_embedding, top_k=10, where=None):
            assert top_k == 2
            return {
                "ids": [["chunk-a", "chunk-b"]],
                "documents": [["doc a", "doc b"]],
                "metadatas": [[
                    {
                        "heading_path": "Overview",
                        "source_file": "p04-gigabit/file.md",
                        "tags": "[]",
                        "title": "A",
                        "document_id": "doc-a",
                    },
                    {
                        "heading_path": "Overview",
                        "source_file": "p05-interferometer/file.md",
                        "tags": "[]",
                        "title": "B",
                        "document_id": "doc-b",
                    },
                ]],
                "distances": [[0.2, 0.21]],
            }
    monkeypatch.setattr("atocore.retrieval.retriever.get_vector_store", lambda: FakeStore())
    monkeypatch.setattr("atocore.retrieval.retriever.embed_query", lambda query: [0.0, 0.1])
    monkeypatch.setattr(
        "atocore.retrieval.retriever._existing_chunk_ids",
        lambda chunk_ids: set(chunk_ids),
    )
    monkeypatch.setattr(
        "atocore.retrieval.retriever.get_registered_project",
        lambda project_name: (_ for _ in ()).throw(ValueError("registry overlap")),
    )
    monkeypatch.setattr(
        "atocore.retrieval.retriever.log.warning",
        lambda event, **kwargs: warnings.append((event, kwargs)),
    )
    results = retrieve("overview", top_k=2, project_hint="p04")
    assert [r.chunk_id for r in results] == ["chunk-a", "chunk-b"]
    assert {warning[0] for warning in warnings} == {
        "project_scope_resolution_failed",
        "project_match_boost_resolution_failed",
    }
    assert all("registry overlap" in warning[1]["error"] for warning in warnings)
 def test_retrieve_downranks_archive_noise_and_prefers_high_signal_paths(monkeypatch):
Author	SHA1	Message	Date
Anto01	05c11fd4fb	fix(retrieval): fail open on registry resolution errors	2026-04-24 11:32:46 -04:00
Anto01	ce6ffdbb63	fix(retrieval): preserve project ids across unscoped ingest	2026-04-24 11:22:13 -04:00
Anto01	c03022d864	feat(retrieval): persist explicit chunk project ids	2026-04-24 11:02:30 -04:00
Anto01	f44a211497	merge: project-scoped retrieval audit improvements	2026-04-24 10:47:15 -04:00
Anto01	c7212900b0	fix(retrieval): enforce project-scoped context boundaries	2026-04-24 10:46:56 -04:00