feat(retrieval): persist explicit chunk project ids

This commit is contained in:
2026-04-24 11:02:30 -04:00
parent f44a211497
commit c03022d864
12 changed files with 332 additions and 24 deletions

View File

@@ -6,15 +6,15 @@
## Orientation ## Orientation
- **live_sha** (Dalidou `/health` build_sha): `2b86543` (verified 2026-04-23T15:20:53Z post-R14 deploy; status=ok) - **live_sha** (Dalidou `/health` build_sha): `f44a211` (verified 2026-04-24T14:48:44Z post audit-improvements deploy; status=ok)
- **last_updated**: 2026-04-24 by Codex (audit-improvements foundation branch; live status refreshed) - **last_updated**: 2026-04-24 by Codex (retrieval boundary deployed; project_id metadata branch started)
- **main_tip**: `2b86543` - **main_tip**: `f44a211`
- **test_count**: 553 - **test_count**: 556 on `codex/project-id-metadata-retrieval` (deployed main baseline: 553)
- **harness**: `18/20 PASS` on live Dalidou plus 1 known content gap and 1 blocking project-bleed guard pending deploy of this branch - **harness**: `19/20 PASS` on live Dalidou, 0 blocking failures, 1 known content gap (`p04-constraints`)
- **vectors**: 33,253 - **vectors**: 33,253
- **active_memories**: 290 (`/admin/dashboard` 2026-04-24; note integrity panel reports a separate active_memory_count=951 and needs reconciliation) - **active_memories**: 290 (`/admin/dashboard` 2026-04-24; note integrity panel reports a separate active_memory_count=951 and needs reconciliation)
- **candidate_memories**: 0 (triage queue drained) - **candidate_memories**: 0 (triage queue drained)
- **interactions**: 950 (`/admin/dashboard` 2026-04-24) - **interactions**: 951 (`/admin/dashboard` 2026-04-24)
- **registered_projects**: atocore, p04-gigabit, p05-interferometer, p06-polisher, atomizer-v2, abb-space (aliased p08) - **registered_projects**: atocore, p04-gigabit, p05-interferometer, p06-polisher, atomizer-v2, abb-space (aliased p08)
- **project_state_entries**: 128 across registered projects (`/admin/dashboard` 2026-04-24) - **project_state_entries**: 128 across registered projects (`/admin/dashboard` 2026-04-24)
- **entities**: 66 (up from 35 — V1-0 backfill + ongoing work; 0 open conflicts) - **entities**: 66 (up from 35 — V1-0 backfill + ongoing work; 0 open conflicts)
@@ -170,6 +170,8 @@ One branch `codex/extractor-eval-loop` for Day 1-5, a second `codex/retrieval-ha
## Session Log ## Session Log
- **2026-04-24 Codex (retrieval boundary deployed + project_id metadata tranche)** Merged `codex/audit-improvements-foundation` to `main` as `f44a211` and pushed to Dalidou Gitea. Took pre-deploy runtime backup `/srv/storage/atocore/backups/snapshots/20260424T144810Z` (DB + registry, no Chroma). Deployed via `papa@dalidou` canonical `deploy/dalidou/deploy.sh`; live `/health` reports build_sha `f44a2114970008a7eec4e7fc2860c8f072914e38`, build_time `2026-04-24T14:48:44Z`, status ok. Post-deploy retrieval harness: 20 fixtures, 19 pass, 0 blocking failures, 1 known issue (`p04-constraints`). The former blocker `p05-broad-status-no-atomizer` now passes. Manual p05 `context-build "current status"` spot check shows no p04/Atomizer source bleed in retrieved chunks. Started follow-up branch `codex/project-id-metadata-retrieval`: registered-project ingestion now writes explicit `project_id` into DB chunk metadata and Chroma vector metadata; retrieval prefers exact `project_id` when present and keeps path/tag matching as legacy fallback; added dry-run-by-default `scripts/backfill_chunk_project_ids.py` to backfill SQLite + Chroma metadata; added tests for project-id ingestion, registered refresh propagation, exact project-id retrieval, and collision fallback. Verified targeted suite (`test_ingestion.py`, `test_project_registry.py`, `test_retrieval.py`): 36 passed. Verified full suite: 556 passed in 72.44s. Branch not merged or deployed yet.
- **2026-04-24 Codex (audit improvements foundation)** Started implementation of the audit recommendations on branch `codex/audit-improvements-foundation` from `origin/main@c53e61e`. First tranche: registry-aware project-scoped retrieval filtering (`ATOCORE_RANK_PROJECT_SCOPE_FILTER`, widened candidate pull before filtering), eval harness known-issue lane, two p05 project-bleed fixtures, `scripts/live_status.py`, README/current-state/master-plan status refresh. Verified `pytest -q`: 550 passed in 67.11s. Live retrieval harness against undeployed production: 20 fixtures, 18 pass, 1 known issue (`p04-constraints` Zerodur/1.2 content gap), 1 blocking guard (`p05-broad-status-no-atomizer`) still failing because production has not yet deployed the retrieval filter and currently pulls `P04-GigaBIT-M1-KB-design` into broad p05 status context. Live dashboard refresh: health ok, build `2b86543`, docs 1748, chunks/vectors 33253, interactions 948, active memories 289, candidates 0, project_state total 128. Noted count discrepancy: dashboard memories.active=289 while integrity active_memory_count=951; schedule reconciliation in a follow-up. - **2026-04-24 Codex (audit improvements foundation)** Started implementation of the audit recommendations on branch `codex/audit-improvements-foundation` from `origin/main@c53e61e`. First tranche: registry-aware project-scoped retrieval filtering (`ATOCORE_RANK_PROJECT_SCOPE_FILTER`, widened candidate pull before filtering), eval harness known-issue lane, two p05 project-bleed fixtures, `scripts/live_status.py`, README/current-state/master-plan status refresh. Verified `pytest -q`: 550 passed in 67.11s. Live retrieval harness against undeployed production: 20 fixtures, 18 pass, 1 known issue (`p04-constraints` Zerodur/1.2 content gap), 1 blocking guard (`p05-broad-status-no-atomizer`) still failing because production has not yet deployed the retrieval filter and currently pulls `P04-GigaBIT-M1-KB-design` into broad p05 status context. Live dashboard refresh: health ok, build `2b86543`, docs 1748, chunks/vectors 33253, interactions 948, active memories 289, candidates 0, project_state total 128. Noted count discrepancy: dashboard memories.active=289 while integrity active_memory_count=951; schedule reconciliation in a follow-up.
- **2026-04-24 Codex (independent-audit hardening)** Applied the Opus independent audit's fast follow-ups before merge/deploy. Closed the two P1s by making project-scope ownership path/tag-based only, adding path-segment/tag-exact matching to avoid short-alias substring collisions, and keeping title/heading text out of provenance decisions. Added regression tests for title poisoning, substring collision, and unknown-project fallback. Added retrieval log fields `raw_results_count`, `post_filter_count`, `post_filter_dropped`, and `underfilled`. Added retrieval-eval run metadata (`generated_at`, `base_url`, `/health`) and `live_status.py` auth-token/status support. README now documents the ranking knobs and clarifies that the hard scope filter and soft project match boost are separate controls. Verified `pytest -q`: 553 passed in 66.07s. Live production remains expected-predeploy: 20 fixtures, 18 pass, 1 known content gap, 1 blocking p05 bleed guard. Latest live dashboard: build `2b86543`, docs 1748, chunks/vectors 33253, interactions 950, active memories 290, candidates 0, project_state total 128. - **2026-04-24 Codex (independent-audit hardening)** Applied the Opus independent audit's fast follow-ups before merge/deploy. Closed the two P1s by making project-scope ownership path/tag-based only, adding path-segment/tag-exact matching to avoid short-alias substring collisions, and keeping title/heading text out of provenance decisions. Added regression tests for title poisoning, substring collision, and unknown-project fallback. Added retrieval log fields `raw_results_count`, `post_filter_count`, `post_filter_dropped`, and `underfilled`. Added retrieval-eval run metadata (`generated_at`, `base_url`, `/health`) and `live_status.py` auth-token/status support. README now documents the ranking knobs and clarifies that the hard scope filter and soft project match boost are separate controls. Verified `pytest -q`: 553 passed in 66.07s. Live production remains expected-predeploy: 20 fixtures, 18 pass, 1 known content gap, 1 blocking p05 bleed guard. Latest live dashboard: build `2b86543`, docs 1748, chunks/vectors 33253, interactions 950, active memories 290, candidates 0, project_state total 128.

View File

@@ -111,6 +111,7 @@ pytest
- `scripts/atocore_client.py` provides a live API client for project refresh, project-state inspection, and retrieval-quality audits. - `scripts/atocore_client.py` provides a live API client for project refresh, project-state inspection, and retrieval-quality audits.
- `scripts/retrieval_eval.py` runs the live retrieval/context harness, separates blocking failures from known content gaps, and stamps JSON output with target/build metadata. - `scripts/retrieval_eval.py` runs the live retrieval/context harness, separates blocking failures from known content gaps, and stamps JSON output with target/build metadata.
- `scripts/live_status.py` renders a compact read-only status report from `/health`, `/stats`, `/projects`, and `/admin/dashboard`; set `ATOCORE_AUTH_TOKEN` or `--auth-token` when those endpoints are gated. - `scripts/live_status.py` renders a compact read-only status report from `/health`, `/stats`, `/projects`, and `/admin/dashboard`; set `ATOCORE_AUTH_TOKEN` or `--auth-token` when those endpoints are gated.
- `scripts/backfill_chunk_project_ids.py` dry-runs or applies explicit `project_id` metadata backfills for SQLite chunks and Chroma vectors.
- `docs/operations.md` captures the current operational priority order: retrieval quality, Wave 2 trusted-operational ingestion, AtoDrive scoping, and restore validation. - `docs/operations.md` captures the current operational priority order: retrieval quality, Wave 2 trusted-operational ingestion, AtoDrive scoping, and restore validation.
- `DEV-LEDGER.md` is the fast-moving source of operational truth during active development; copy claims into docs only after checking the live service. - `DEV-LEDGER.md` is the fast-moving source of operational truth during active development; copy claims into docs only after checking the live service.

View File

@@ -1,5 +1,9 @@
# AtoCore - Current State (2026-04-24) # AtoCore - Current State (2026-04-24)
Update 2026-04-24: audit-improvements deployed as `f44a211`; live harness is
19/20 with 0 blocking failures and 1 known content gap. Active follow-up branch
`codex/project-id-metadata-retrieval` is at 556 passing tests.
Live deploy: `2b86543` · Dalidou health: ok · Harness: 18/20 with 1 known Live deploy: `2b86543` · Dalidou health: ok · Harness: 18/20 with 1 known
content gap and 1 current blocking project-bleed guard · Tests: 553 passing. content gap and 1 current blocking project-bleed guard · Tests: 553 passing.
@@ -68,7 +72,7 @@ Last nightly run (2026-04-19 03:00 UTC): **31 promoted · 39 rejected · 0 needs
## Known gaps (honest, refreshed 2026-04-24) ## Known gaps (honest, refreshed 2026-04-24)
1. **Capture surface is Claude-Code-and-OpenClaw only.** Conversations in Claude Desktop, Claude.ai web, phone, or any other LLM UI are NOT captured. Example: the rotovap/mushroom chat yesterday never reached AtoCore because no hook fired. See Q4 below. 1. **Capture surface is Claude-Code-and-OpenClaw only.** Conversations in Claude Desktop, Claude.ai web, phone, or any other LLM UI are NOT captured. Example: the rotovap/mushroom chat yesterday never reached AtoCore because no hook fired. See Q4 below.
2. **Project-scoped retrieval still needs deployment verification.** The April 24 audit reproduced cross-project competition on broad p05 prompts. The current branch adds registry-aware project filtering and a harness guard; verify after deploy. 2. **Project-scoped retrieval guard is deployed and passing.** The April 24 p05 broad-status bleed guard now passes on live Dalidou. The active follow-up branch adds explicit `project_id` chunk/vector metadata so the deployed path/tag heuristic can become a legacy fallback.
3. **Human interface is useful but not yet the V1 Human Mirror.** Wiki/dashboard pages exist, but the spec routes, deterministic mirror files, disputed markers, and curated annotations remain V1-D work. 3. **Human interface is useful but not yet the V1 Human Mirror.** Wiki/dashboard pages exist, but the spec routes, deterministic mirror files, disputed markers, and curated annotations remain V1-D work.
4. **Harness known issue:** `p04-constraints` wants "Zerodur" and "1.2"; live retrieval surfaces related constraints but not those exact strings. Treat as content/state gap until fixed. 4. **Harness known issue:** `p04-constraints` wants "Zerodur" and "1.2"; live retrieval surfaces related constraints but not those exact strings. Treat as content/state gap until fixed.
5. **Formal docs lag the ledger during fast work.** Use `DEV-LEDGER.md` and `python scripts/live_status.py` for live truth, then copy verified claims into these docs. 5. **Formal docs lag the ledger during fast work.** Use `DEV-LEDGER.md` and `python scripts/live_status.py` for live truth, then copy verified claims into these docs.

View File

@@ -135,7 +135,7 @@ deferred from the shared client until their workflows are exercised.
- canonical AtoCore runtime on Dalidou (`2b86543`, deploy.sh verified) - canonical AtoCore runtime on Dalidou (`2b86543`, deploy.sh verified)
- 33,253 vectors across 6 registered projects - 33,253 vectors across 6 registered projects
- 950 captured interactions as of the 2026-04-24 live dashboard; refresh - 951 captured interactions as of the 2026-04-24 live dashboard; refresh
exact live counts with exact live counts with
`python scripts/live_status.py` `python scripts/live_status.py`
- 6 registered projects: - 6 registered projects:
@@ -150,10 +150,9 @@ deferred from the shared client until their workflows are exercised.
dashboard dashboard
- context pack assembly with 4 tiers: Trusted Project State > identity/preference > project memories > retrieved chunks - context pack assembly with 4 tiers: Trusted Project State > identity/preference > project memories > retrieved chunks
- query-relevance memory ranking with overlap-density scoring - query-relevance memory ranking with overlap-density scoring
- retrieval eval harness: 20 fixtures; current live has 18 pass, 1 known - retrieval eval harness: 20 fixtures; current live has 19 pass, 1 known
content gap, and 1 blocking cross-project bleed guard targeted by the content gap, and 0 blocking failures after the audit-improvements deploy
current retrieval-scoping branch - 556 tests passing on the active `codex/project-id-metadata-retrieval` branch
- 553 tests passing on the audit-improvements branch
- nightly pipeline: backup → cleanup → rsync → OpenClaw import → vault refresh → extract → triage → **auto-promote/expire** → weekly synth/lint → **retrieval harness****pipeline summary to project state** - nightly pipeline: backup → cleanup → rsync → OpenClaw import → vault refresh → extract → triage → **auto-promote/expire** → weekly synth/lint → **retrieval harness****pipeline summary to project state**
- Phase 10 operational: reinforcement-based auto-promotion (ref_count ≥ 3, confidence ≥ 0.7) + stale candidate expiry (14 days unreinforced) - Phase 10 operational: reinforcement-based auto-promotion (ref_count ≥ 3, confidence ≥ 0.7) + stale candidate expiry (14 days unreinforced)
- pipeline health visible in dashboard: interaction totals by client, pipeline last_run, harness results, triage stats - pipeline health visible in dashboard: interaction totals by client, pipeline last_run, harness results, triage stats

View File

@@ -0,0 +1,145 @@
"""Backfill explicit project_id into chunk and vector metadata.
Dry-run by default. The script derives ownership from the registered project
ingest roots and updates both SQLite source_chunks.metadata and Chroma vector
metadata only when --apply is provided.
"""
from __future__ import annotations
import argparse
import json
import sys
from pathlib import Path
sys.path.insert(0, str(Path(__file__).resolve().parents[1] / "src"))
from atocore.models.database import get_connection # noqa: E402
from atocore.projects.registry import list_registered_projects # noqa: E402
from atocore.retrieval.vector_store import get_vector_store # noqa: E402
def _load_project_roots() -> list[tuple[str, Path]]:
roots: list[tuple[str, Path]] = []
for project in list_registered_projects():
project_id = project["id"]
for root in project.get("ingest_roots", []):
root_path = root.get("path")
if root_path:
roots.append((project_id, Path(root_path).resolve(strict=False)))
roots.sort(key=lambda item: len(str(item[1])), reverse=True)
return roots
def _derive_project_id(file_path: str, roots: list[tuple[str, Path]]) -> str:
if not file_path:
return ""
doc_path = Path(file_path).resolve(strict=False)
for project_id, root_path in roots:
try:
doc_path.relative_to(root_path)
except ValueError:
continue
return project_id
return ""
def _decode_metadata(raw: str | None) -> dict:
if not raw:
return {}
try:
parsed = json.loads(raw)
except json.JSONDecodeError:
return {}
return parsed if isinstance(parsed, dict) else {}
def _chunk_rows() -> list[dict]:
with get_connection() as conn:
rows = conn.execute(
"""
SELECT
sc.id AS chunk_id,
sc.metadata AS chunk_metadata,
sd.file_path AS file_path
FROM source_chunks sc
JOIN source_documents sd ON sd.id = sc.document_id
ORDER BY sd.file_path, sc.chunk_index
"""
).fetchall()
return [dict(row) for row in rows]
def backfill(apply: bool = False, project_filter: str = "") -> dict:
roots = _load_project_roots()
rows = _chunk_rows()
updates: list[tuple[str, str, dict]] = []
by_project: dict[str, int] = {}
skipped_unowned = 0
for row in rows:
project_id = _derive_project_id(row["file_path"], roots)
if project_filter and project_id != project_filter:
continue
if not project_id:
skipped_unowned += 1
continue
metadata = _decode_metadata(row["chunk_metadata"])
if metadata.get("project_id") == project_id:
continue
metadata["project_id"] = project_id
updates.append((row["chunk_id"], project_id, metadata))
by_project[project_id] = by_project.get(project_id, 0) + 1
if apply and updates:
vector_store = get_vector_store()
chunk_ids = [chunk_id for chunk_id, _, _ in updates]
vector_payload = vector_store.get_metadatas(chunk_ids)
existing_vector_metadata = {
chunk_id: metadata or {}
for chunk_id, metadata in zip(
vector_payload.get("ids", []),
vector_payload.get("metadatas", []),
strict=False,
)
}
vector_metadatas = []
for chunk_id, project_id, chunk_metadata in updates:
vector_metadata = dict(existing_vector_metadata.get(chunk_id) or {})
if not vector_metadata:
vector_metadata = dict(chunk_metadata)
vector_metadata["project_id"] = project_id
vector_metadatas.append(vector_metadata)
with get_connection() as conn:
conn.executemany(
"UPDATE source_chunks SET metadata = ? WHERE id = ?",
[
(json.dumps(metadata, ensure_ascii=True), chunk_id)
for chunk_id, _, metadata in updates
],
)
vector_store.update_metadatas(chunk_ids, vector_metadatas)
return {
"apply": apply,
"total_chunks": len(rows),
"updates": len(updates),
"skipped_unowned": skipped_unowned,
"by_project": dict(sorted(by_project.items())),
}
def main() -> int:
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument("--apply", action="store_true", help="write SQLite and Chroma metadata updates")
parser.add_argument("--project", default="", help="optional canonical project_id filter")
args = parser.parse_args()
payload = backfill(apply=args.apply, project_filter=args.project.strip())
print(json.dumps(payload, indent=2, ensure_ascii=True))
return 0
if __name__ == "__main__":
raise SystemExit(main())

View File

@@ -32,10 +32,11 @@ def exclusive_ingestion():
_INGESTION_LOCK.release() _INGESTION_LOCK.release()
def ingest_file(file_path: Path) -> dict: def ingest_file(file_path: Path, project_id: str = "") -> dict:
"""Ingest a single markdown file. Returns stats.""" """Ingest a single markdown file. Returns stats."""
start = time.time() start = time.time()
file_path = file_path.resolve() file_path = file_path.resolve()
project_id = (project_id or "").strip()
if not file_path.exists(): if not file_path.exists():
raise FileNotFoundError(f"File not found: {file_path}") raise FileNotFoundError(f"File not found: {file_path}")
@@ -65,6 +66,7 @@ def ingest_file(file_path: Path) -> dict:
"source_file": str(file_path), "source_file": str(file_path),
"tags": parsed.tags, "tags": parsed.tags,
"title": parsed.title, "title": parsed.title,
"project_id": project_id,
} }
chunks = chunk_markdown(parsed.body, base_metadata=base_meta) chunks = chunk_markdown(parsed.body, base_metadata=base_meta)
@@ -116,6 +118,7 @@ def ingest_file(file_path: Path) -> dict:
"source_file": str(file_path), "source_file": str(file_path),
"tags": json.dumps(parsed.tags), "tags": json.dumps(parsed.tags),
"title": parsed.title, "title": parsed.title,
"project_id": project_id,
}) })
conn.execute( conn.execute(
@@ -173,7 +176,17 @@ def ingest_folder(folder_path: Path, purge_deleted: bool = True) -> list[dict]:
purge_deleted: If True, remove DB/vector entries for files purge_deleted: If True, remove DB/vector entries for files
that no longer exist on disk. that no longer exist on disk.
""" """
return ingest_project_folder(folder_path, purge_deleted=purge_deleted, project_id="")
def ingest_project_folder(
folder_path: Path,
purge_deleted: bool = True,
project_id: str = "",
) -> list[dict]:
"""Ingest a folder and annotate chunks with an optional project id."""
folder_path = folder_path.resolve() folder_path = folder_path.resolve()
project_id = (project_id or "").strip()
if not folder_path.is_dir(): if not folder_path.is_dir():
raise NotADirectoryError(f"Not a directory: {folder_path}") raise NotADirectoryError(f"Not a directory: {folder_path}")
@@ -187,7 +200,7 @@ def ingest_folder(folder_path: Path, purge_deleted: bool = True) -> list[dict]:
# Ingest new/changed files # Ingest new/changed files
for md_file in md_files: for md_file in md_files:
try: try:
result = ingest_file(md_file) result = ingest_file(md_file, project_id=project_id)
results.append(result) results.append(result)
except Exception as e: except Exception as e:
log.error("ingestion_error", file_path=str(md_file), error=str(e)) log.error("ingestion_error", file_path=str(md_file), error=str(e))

View File

@@ -8,7 +8,7 @@ from dataclasses import asdict, dataclass
from pathlib import Path from pathlib import Path
import atocore.config as _config import atocore.config as _config
from atocore.ingestion.pipeline import ingest_folder from atocore.ingestion.pipeline import ingest_project_folder
# Reserved pseudo-projects. `inbox` holds pre-project / lead / quote # Reserved pseudo-projects. `inbox` holds pre-project / lead / quote
@@ -346,7 +346,11 @@ def refresh_registered_project(project_name: str, purge_deleted: bool = False) -
{ {
**root_result, **root_result,
"status": "ingested", "status": "ingested",
"results": ingest_folder(resolved, purge_deleted=purge_deleted), "results": ingest_project_folder(
resolved,
purge_deleted=purge_deleted,
project_id=project.project_id,
),
} }
) )
ingested_count += 1 ingested_count += 1

View File

@@ -209,6 +209,9 @@ def _is_allowed_for_project_scope(
def _metadata_matches_project(project: RegisteredProject, metadata: dict) -> bool: def _metadata_matches_project(project: RegisteredProject, metadata: dict) -> bool:
if "project_id" in metadata:
return str(metadata.get("project_id", "")).strip().lower() == project.project_id.lower()
path = _metadata_source_path(metadata) path = _metadata_source_path(metadata)
tags = _metadata_tags(metadata) tags = _metadata_tags(metadata)
for term in _project_scope_terms(project): for term in _project_scope_terms(project):

View File

@@ -64,6 +64,18 @@ class VectorStore:
self._collection.delete(ids=ids) self._collection.delete(ids=ids)
log.debug("vectors_deleted", count=len(ids)) log.debug("vectors_deleted", count=len(ids))
def get_metadatas(self, ids: list[str]) -> dict:
"""Fetch vector metadata by chunk IDs."""
if not ids:
return {"ids": [], "metadatas": []}
return self._collection.get(ids=ids, include=["metadatas"])
def update_metadatas(self, ids: list[str], metadatas: list[dict]) -> None:
"""Update vector metadata without re-embedding documents."""
if ids:
self._collection.update(ids=ids, metadatas=metadatas)
log.debug("vector_metadatas_updated", count=len(ids))
@property @property
def count(self) -> int: def count(self) -> int:
return self._collection.count() return self._collection.count()

View File

@@ -1,8 +1,10 @@
"""Tests for the ingestion pipeline.""" """Tests for the ingestion pipeline."""
import json
from atocore.ingestion.parser import parse_markdown from atocore.ingestion.parser import parse_markdown
from atocore.models.database import get_connection, init_db from atocore.models.database import get_connection, init_db
from atocore.ingestion.pipeline import ingest_file, ingest_folder from atocore.ingestion.pipeline import ingest_file, ingest_folder, ingest_project_folder
def test_parse_markdown(sample_markdown): def test_parse_markdown(sample_markdown):
@@ -69,6 +71,54 @@ def test_ingest_updates_changed(tmp_data_dir, sample_markdown):
assert result["status"] == "ingested" assert result["status"] == "ingested"
def test_ingest_file_records_project_id_metadata(tmp_data_dir, sample_markdown, monkeypatch):
"""Project-aware ingestion should tag DB and vector metadata exactly."""
init_db()
class FakeVectorStore:
def __init__(self):
self.metadatas = []
def add(self, ids, documents, metadatas):
self.metadatas.extend(metadatas)
def delete(self, ids):
return None
fake_store = FakeVectorStore()
monkeypatch.setattr("atocore.ingestion.pipeline.get_vector_store", lambda: fake_store)
result = ingest_file(sample_markdown, project_id="p04-gigabit")
assert result["status"] == "ingested"
assert fake_store.metadatas
assert all(meta["project_id"] == "p04-gigabit" for meta in fake_store.metadatas)
with get_connection() as conn:
rows = conn.execute("SELECT metadata FROM source_chunks").fetchall()
assert rows
assert all(
json.loads(row["metadata"])["project_id"] == "p04-gigabit"
for row in rows
)
def test_ingest_project_folder_passes_project_id_to_files(tmp_data_dir, sample_folder, monkeypatch):
seen = []
def fake_ingest_file(path, project_id=""):
seen.append((path.name, project_id))
return {"file": str(path), "status": "ingested"}
monkeypatch.setattr("atocore.ingestion.pipeline.ingest_file", fake_ingest_file)
monkeypatch.setattr("atocore.ingestion.pipeline._purge_deleted_files", lambda *args, **kwargs: 0)
ingest_project_folder(sample_folder, project_id="p05-interferometer")
assert seen
assert {project_id for _, project_id in seen} == {"p05-interferometer"}
def test_parse_markdown_uses_supplied_text(sample_markdown): def test_parse_markdown_uses_supplied_text(sample_markdown):
"""Parsing should be able to reuse pre-read content from ingestion.""" """Parsing should be able to reuse pre-read content from ingestion."""
latin_text = """---\ntags: parser\n---\n# Parser Title\n\nBody text.""" latin_text = """---\ntags: parser\n---\n# Parser Title\n\nBody text."""

View File

@@ -133,8 +133,8 @@ def test_refresh_registered_project_ingests_registered_roots(tmp_path, monkeypat
calls = [] calls = []
def fake_ingest_folder(path, purge_deleted=True): def fake_ingest_folder(path, purge_deleted=True, project_id=""):
calls.append((str(path), purge_deleted)) calls.append((str(path), purge_deleted, project_id))
return [{"file": str(path / "README.md"), "status": "ingested"}] return [{"file": str(path / "README.md"), "status": "ingested"}]
monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir)) monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir))
@@ -144,7 +144,7 @@ def test_refresh_registered_project_ingests_registered_roots(tmp_path, monkeypat
original_settings = config.settings original_settings = config.settings
try: try:
config.settings = config.Settings() config.settings = config.Settings()
monkeypatch.setattr("atocore.projects.registry.ingest_folder", fake_ingest_folder) monkeypatch.setattr("atocore.projects.registry.ingest_project_folder", fake_ingest_folder)
result = refresh_registered_project("polisher") result = refresh_registered_project("polisher")
finally: finally:
config.settings = original_settings config.settings = original_settings
@@ -153,6 +153,7 @@ def test_refresh_registered_project_ingests_registered_roots(tmp_path, monkeypat
assert len(calls) == 1 assert len(calls) == 1
assert calls[0][0].endswith("p06-polisher") assert calls[0][0].endswith("p06-polisher")
assert calls[0][1] is False assert calls[0][1] is False
assert calls[0][2] == "p06-polisher"
assert result["roots"][0]["status"] == "ingested" assert result["roots"][0]["status"] == "ingested"
assert result["status"] == "ingested" assert result["status"] == "ingested"
assert result["roots_ingested"] == 1 assert result["roots_ingested"] == 1
@@ -188,7 +189,7 @@ def test_refresh_registered_project_reports_nothing_to_ingest_when_all_missing(
encoding="utf-8", encoding="utf-8",
) )
def fail_ingest_folder(path, purge_deleted=True): def fail_ingest_folder(path, purge_deleted=True, project_id=""):
raise AssertionError(f"ingest_folder should not be called for missing root: {path}") raise AssertionError(f"ingest_folder should not be called for missing root: {path}")
monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir)) monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir))
@@ -198,7 +199,7 @@ def test_refresh_registered_project_reports_nothing_to_ingest_when_all_missing(
original_settings = config.settings original_settings = config.settings
try: try:
config.settings = config.Settings() config.settings = config.Settings()
monkeypatch.setattr("atocore.projects.registry.ingest_folder", fail_ingest_folder) monkeypatch.setattr("atocore.projects.registry.ingest_project_folder", fail_ingest_folder)
result = refresh_registered_project("ghost") result = refresh_registered_project("ghost")
finally: finally:
config.settings = original_settings config.settings = original_settings
@@ -238,7 +239,7 @@ def test_refresh_registered_project_reports_partial_status(tmp_path, monkeypatch
encoding="utf-8", encoding="utf-8",
) )
def fake_ingest_folder(path, purge_deleted=True): def fake_ingest_folder(path, purge_deleted=True, project_id=""):
return [{"file": str(path / "README.md"), "status": "ingested"}] return [{"file": str(path / "README.md"), "status": "ingested"}]
monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir)) monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir))
@@ -248,7 +249,7 @@ def test_refresh_registered_project_reports_partial_status(tmp_path, monkeypatch
original_settings = config.settings original_settings = config.settings
try: try:
config.settings = config.Settings() config.settings = config.Settings()
monkeypatch.setattr("atocore.projects.registry.ingest_folder", fake_ingest_folder) monkeypatch.setattr("atocore.projects.registry.ingest_project_folder", fake_ingest_folder)
result = refresh_registered_project("mixed") result = refresh_registered_project("mixed")
finally: finally:
config.settings = original_settings config.settings = original_settings

View File

@@ -384,6 +384,80 @@ def test_retrieve_project_scope_uses_path_segments_not_substrings(monkeypatch):
assert [r.chunk_id for r in results] == ["chunk-target", "chunk-global"] assert [r.chunk_id for r in results] == ["chunk-target", "chunk-global"]
def test_retrieve_project_scope_prefers_exact_project_id(monkeypatch):
target_project = type(
"Project",
(),
{
"project_id": "p04-gigabit",
"aliases": ("p04", "gigabit"),
"ingest_roots": (),
},
)()
other_project = type(
"Project",
(),
{
"project_id": "p06-polisher",
"aliases": ("p06", "polisher"),
"ingest_roots": (),
},
)()
class FakeStore:
def query(self, query_embedding, top_k=10, where=None):
return {
"ids": [["chunk-target", "chunk-other", "chunk-global"]],
"documents": [["target doc", "other doc", "global doc"]],
"metadatas": [[
{
"heading_path": "Overview",
"source_file": "legacy/unhelpful-path.md",
"tags": "[]",
"title": "Target",
"project_id": "p04-gigabit",
"document_id": "doc-a",
},
{
"heading_path": "Overview",
"source_file": "p04-gigabit/title-poisoned.md",
"tags": '["p04-gigabit"]',
"title": "Looks target-owned but is explicit p06",
"project_id": "p06-polisher",
"document_id": "doc-b",
},
{
"heading_path": "Overview",
"source_file": "shared/global.md",
"tags": "[]",
"title": "Shared",
"project_id": "",
"document_id": "doc-global",
},
]],
"distances": [[0.2, 0.19, 0.21]],
}
monkeypatch.setattr("atocore.retrieval.retriever.get_vector_store", lambda: FakeStore())
monkeypatch.setattr("atocore.retrieval.retriever.embed_query", lambda query: [0.0, 0.1])
monkeypatch.setattr(
"atocore.retrieval.retriever._existing_chunk_ids",
lambda chunk_ids: set(chunk_ids),
)
monkeypatch.setattr(
"atocore.retrieval.retriever.get_registered_project",
lambda project_name: target_project,
)
monkeypatch.setattr(
"atocore.retrieval.retriever.load_project_registry",
lambda: [target_project, other_project],
)
results = retrieve("mirror architecture", top_k=3, project_hint="p04")
assert [r.chunk_id for r in results] == ["chunk-target", "chunk-global"]
def test_retrieve_unknown_project_hint_does_not_widen_or_filter(monkeypatch): def test_retrieve_unknown_project_hint_does_not_widen_or_filter(monkeypatch):
class FakeStore: class FakeStore:
def query(self, query_embedding, top_k=10, where=None): def query(self, query_embedding, top_k=10, where=None):