fix(retrieval): preserve project ids across unscoped ingest
This commit is contained in:
@@ -9,7 +9,7 @@
|
||||
- **live_sha** (Dalidou `/health` build_sha): `f44a211` (verified 2026-04-24T14:48:44Z post audit-improvements deploy; status=ok)
|
||||
- **last_updated**: 2026-04-24 by Codex (retrieval boundary deployed; project_id metadata branch started)
|
||||
- **main_tip**: `f44a211`
|
||||
- **test_count**: 556 on `codex/project-id-metadata-retrieval` (deployed main baseline: 553)
|
||||
- **test_count**: 565 on `codex/project-id-metadata-retrieval` (deployed main baseline: 553)
|
||||
- **harness**: `19/20 PASS` on live Dalidou, 0 blocking failures, 1 known content gap (`p04-constraints`)
|
||||
- **vectors**: 33,253
|
||||
- **active_memories**: 290 (`/admin/dashboard` 2026-04-24; note integrity panel reports a separate active_memory_count=951 and needs reconciliation)
|
||||
@@ -172,6 +172,8 @@ One branch `codex/extractor-eval-loop` for Day 1-5, a second `codex/retrieval-ha
|
||||
|
||||
- **2026-04-24 Codex (retrieval boundary deployed + project_id metadata tranche)** Merged `codex/audit-improvements-foundation` to `main` as `f44a211` and pushed to Dalidou Gitea. Took pre-deploy runtime backup `/srv/storage/atocore/backups/snapshots/20260424T144810Z` (DB + registry, no Chroma). Deployed via `papa@dalidou` canonical `deploy/dalidou/deploy.sh`; live `/health` reports build_sha `f44a2114970008a7eec4e7fc2860c8f072914e38`, build_time `2026-04-24T14:48:44Z`, status ok. Post-deploy retrieval harness: 20 fixtures, 19 pass, 0 blocking failures, 1 known issue (`p04-constraints`). The former blocker `p05-broad-status-no-atomizer` now passes. Manual p05 `context-build "current status"` spot check shows no p04/Atomizer source bleed in retrieved chunks. Started follow-up branch `codex/project-id-metadata-retrieval`: registered-project ingestion now writes explicit `project_id` into DB chunk metadata and Chroma vector metadata; retrieval prefers exact `project_id` when present and keeps path/tag matching as legacy fallback; added dry-run-by-default `scripts/backfill_chunk_project_ids.py` to backfill SQLite + Chroma metadata; added tests for project-id ingestion, registered refresh propagation, exact project-id retrieval, and collision fallback. Verified targeted suite (`test_ingestion.py`, `test_project_registry.py`, `test_retrieval.py`): 36 passed. Verified full suite: 556 passed in 72.44s. Branch not merged or deployed yet.
|
||||
|
||||
- **2026-04-24 Codex (project_id audit response)** Applied independent-audit fixes on `codex/project-id-metadata-retrieval`. Closed the nightly `/ingest/sources` clobber risk by adding registry-level `derive_project_id_for_path()` and making unscoped `ingest_file()` derive ownership from registered ingest roots when possible; `refresh_registered_project()` still passes the canonical project id directly. Changed retrieval so empty `project_id` falls through to legacy path/tag ownership instead of short-circuiting as unowned. Hardened `scripts/backfill_chunk_project_ids.py`: `--apply` now requires `--chroma-snapshot-confirmed`, runs Chroma metadata updates before SQLite writes, batches updates, skips/report missing vectors, skips/report malformed metadata, reports already-tagged rows, and turns missing ingestion tables into a JSON `db_warning` instead of a traceback. Added tests for auto-derive ingestion, empty-project fallback, ingest-root overlap rejection, and backfill dry-run/apply/snapshot/missing-vector/malformed cases. Verified targeted suite (`test_backfill_chunk_project_ids.py`, `test_ingestion.py`, `test_project_registry.py`, `test_retrieval.py`): 45 passed. Verified full suite: 565 passed in 73.16s. Local dry-run on empty/default data returns 0 updates with `db_warning` rather than crashing. Branch still not merged/deployed.
|
||||
|
||||
- **2026-04-24 Codex (audit improvements foundation)** Started implementation of the audit recommendations on branch `codex/audit-improvements-foundation` from `origin/main@c53e61e`. First tranche: registry-aware project-scoped retrieval filtering (`ATOCORE_RANK_PROJECT_SCOPE_FILTER`, widened candidate pull before filtering), eval harness known-issue lane, two p05 project-bleed fixtures, `scripts/live_status.py`, README/current-state/master-plan status refresh. Verified `pytest -q`: 550 passed in 67.11s. Live retrieval harness against undeployed production: 20 fixtures, 18 pass, 1 known issue (`p04-constraints` Zerodur/1.2 content gap), 1 blocking guard (`p05-broad-status-no-atomizer`) still failing because production has not yet deployed the retrieval filter and currently pulls `P04-GigaBIT-M1-KB-design` into broad p05 status context. Live dashboard refresh: health ok, build `2b86543`, docs 1748, chunks/vectors 33253, interactions 948, active memories 289, candidates 0, project_state total 128. Noted count discrepancy: dashboard memories.active=289 while integrity active_memory_count=951; schedule reconciliation in a follow-up.
|
||||
|
||||
- **2026-04-24 Codex (independent-audit hardening)** Applied the Opus independent audit's fast follow-ups before merge/deploy. Closed the two P1s by making project-scope ownership path/tag-based only, adding path-segment/tag-exact matching to avoid short-alias substring collisions, and keeping title/heading text out of provenance decisions. Added regression tests for title poisoning, substring collision, and unknown-project fallback. Added retrieval log fields `raw_results_count`, `post_filter_count`, `post_filter_dropped`, and `underfilled`. Added retrieval-eval run metadata (`generated_at`, `base_url`, `/health`) and `live_status.py` auth-token/status support. README now documents the ranking knobs and clarifies that the hard scope filter and soft project match boost are separate controls. Verified `pytest -q`: 553 passed in 66.07s. Live production remains expected-predeploy: 20 fixtures, 18 pass, 1 known content gap, 1 blocking p05 bleed guard. Latest live dashboard: build `2b86543`, docs 1748, chunks/vectors 33253, interactions 950, active memories 290, candidates 0, project_state total 128.
|
||||
|
||||
@@ -111,7 +111,7 @@ pytest
|
||||
- `scripts/atocore_client.py` provides a live API client for project refresh, project-state inspection, and retrieval-quality audits.
|
||||
- `scripts/retrieval_eval.py` runs the live retrieval/context harness, separates blocking failures from known content gaps, and stamps JSON output with target/build metadata.
|
||||
- `scripts/live_status.py` renders a compact read-only status report from `/health`, `/stats`, `/projects`, and `/admin/dashboard`; set `ATOCORE_AUTH_TOKEN` or `--auth-token` when those endpoints are gated.
|
||||
- `scripts/backfill_chunk_project_ids.py` dry-runs or applies explicit `project_id` metadata backfills for SQLite chunks and Chroma vectors.
|
||||
- `scripts/backfill_chunk_project_ids.py` dry-runs or applies explicit `project_id` metadata backfills for SQLite chunks and Chroma vectors; `--apply` requires a confirmed Chroma snapshot.
|
||||
- `docs/operations.md` captures the current operational priority order: retrieval quality, Wave 2 trusted-operational ingestion, AtoDrive scoping, and restore validation.
|
||||
- `DEV-LEDGER.md` is the fast-moving source of operational truth during active development; copy claims into docs only after checking the live service.
|
||||
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
|
||||
Update 2026-04-24: audit-improvements deployed as `f44a211`; live harness is
|
||||
19/20 with 0 blocking failures and 1 known content gap. Active follow-up branch
|
||||
`codex/project-id-metadata-retrieval` is at 556 passing tests.
|
||||
`codex/project-id-metadata-retrieval` is at 565 passing tests.
|
||||
|
||||
Live deploy: `2b86543` · Dalidou health: ok · Harness: 18/20 with 1 known
|
||||
content gap and 1 current blocking project-bleed guard · Tests: 553 passing.
|
||||
|
||||
@@ -152,7 +152,7 @@ deferred from the shared client until their workflows are exercised.
|
||||
- query-relevance memory ranking with overlap-density scoring
|
||||
- retrieval eval harness: 20 fixtures; current live has 19 pass, 1 known
|
||||
content gap, and 0 blocking failures after the audit-improvements deploy
|
||||
- 556 tests passing on the active `codex/project-id-metadata-retrieval` branch
|
||||
- 565 tests passing on the active `codex/project-id-metadata-retrieval` branch
|
||||
- nightly pipeline: backup → cleanup → rsync → OpenClaw import → vault refresh → extract → triage → **auto-promote/expire** → weekly synth/lint → **retrieval harness** → **pipeline summary to project state**
|
||||
- Phase 10 operational: reinforcement-based auto-promotion (ref_count ≥ 3, confidence ≥ 0.7) + stale candidate expiry (14 days unreinforced)
|
||||
- pipeline health visible in dashboard: interaction totals by client, pipeline last_run, harness results, triage stats
|
||||
|
||||
@@ -9,52 +9,31 @@ from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import sqlite3
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parents[1] / "src"))
|
||||
|
||||
from atocore.models.database import get_connection # noqa: E402
|
||||
from atocore.projects.registry import list_registered_projects # noqa: E402
|
||||
from atocore.projects.registry import derive_project_id_for_path # noqa: E402
|
||||
from atocore.retrieval.vector_store import get_vector_store # noqa: E402
|
||||
|
||||
|
||||
def _load_project_roots() -> list[tuple[str, Path]]:
|
||||
roots: list[tuple[str, Path]] = []
|
||||
for project in list_registered_projects():
|
||||
project_id = project["id"]
|
||||
for root in project.get("ingest_roots", []):
|
||||
root_path = root.get("path")
|
||||
if root_path:
|
||||
roots.append((project_id, Path(root_path).resolve(strict=False)))
|
||||
roots.sort(key=lambda item: len(str(item[1])), reverse=True)
|
||||
return roots
|
||||
DEFAULT_BATCH_SIZE = 500
|
||||
|
||||
|
||||
def _derive_project_id(file_path: str, roots: list[tuple[str, Path]]) -> str:
|
||||
if not file_path:
|
||||
return ""
|
||||
doc_path = Path(file_path).resolve(strict=False)
|
||||
for project_id, root_path in roots:
|
||||
try:
|
||||
doc_path.relative_to(root_path)
|
||||
except ValueError:
|
||||
continue
|
||||
return project_id
|
||||
return ""
|
||||
|
||||
|
||||
def _decode_metadata(raw: str | None) -> dict:
|
||||
def _decode_metadata(raw: str | None) -> dict | None:
|
||||
if not raw:
|
||||
return {}
|
||||
try:
|
||||
parsed = json.loads(raw)
|
||||
except json.JSONDecodeError:
|
||||
return {}
|
||||
return parsed if isinstance(parsed, dict) else {}
|
||||
return None
|
||||
return parsed if isinstance(parsed, dict) else None
|
||||
|
||||
|
||||
def _chunk_rows() -> list[dict]:
|
||||
def _chunk_rows() -> tuple[list[dict], str]:
|
||||
try:
|
||||
with get_connection() as conn:
|
||||
rows = conn.execute(
|
||||
"""
|
||||
@@ -67,65 +46,108 @@ def _chunk_rows() -> list[dict]:
|
||||
ORDER BY sd.file_path, sc.chunk_index
|
||||
"""
|
||||
).fetchall()
|
||||
return [dict(row) for row in rows]
|
||||
except sqlite3.OperationalError as exc:
|
||||
if "source_chunks" in str(exc) or "source_documents" in str(exc):
|
||||
return [], f"missing ingestion tables: {exc}"
|
||||
raise
|
||||
return [dict(row) for row in rows], ""
|
||||
|
||||
|
||||
def backfill(apply: bool = False, project_filter: str = "") -> dict:
|
||||
roots = _load_project_roots()
|
||||
rows = _chunk_rows()
|
||||
def _batches(items: list, batch_size: int) -> list[list]:
|
||||
return [items[i:i + batch_size] for i in range(0, len(items), batch_size)]
|
||||
|
||||
|
||||
def backfill(
|
||||
apply: bool = False,
|
||||
project_filter: str = "",
|
||||
batch_size: int = DEFAULT_BATCH_SIZE,
|
||||
require_chroma_snapshot: bool = False,
|
||||
) -> dict:
|
||||
rows, db_warning = _chunk_rows()
|
||||
updates: list[tuple[str, str, dict]] = []
|
||||
by_project: dict[str, int] = {}
|
||||
skipped_unowned = 0
|
||||
already_tagged = 0
|
||||
malformed_metadata = 0
|
||||
|
||||
for row in rows:
|
||||
project_id = _derive_project_id(row["file_path"], roots)
|
||||
project_id = derive_project_id_for_path(row["file_path"])
|
||||
if project_filter and project_id != project_filter:
|
||||
continue
|
||||
if not project_id:
|
||||
skipped_unowned += 1
|
||||
continue
|
||||
metadata = _decode_metadata(row["chunk_metadata"])
|
||||
if metadata is None:
|
||||
malformed_metadata += 1
|
||||
continue
|
||||
if metadata.get("project_id") == project_id:
|
||||
already_tagged += 1
|
||||
continue
|
||||
metadata["project_id"] = project_id
|
||||
updates.append((row["chunk_id"], project_id, metadata))
|
||||
by_project[project_id] = by_project.get(project_id, 0) + 1
|
||||
|
||||
missing_vectors: list[str] = []
|
||||
applied_updates = 0
|
||||
if apply and updates:
|
||||
if not require_chroma_snapshot:
|
||||
raise ValueError(
|
||||
"--apply requires --chroma-snapshot-confirmed after taking a Chroma backup"
|
||||
)
|
||||
vector_store = get_vector_store()
|
||||
chunk_ids = [chunk_id for chunk_id, _, _ in updates]
|
||||
for batch in _batches(updates, max(1, batch_size)):
|
||||
chunk_ids = [chunk_id for chunk_id, _, _ in batch]
|
||||
vector_payload = vector_store.get_metadatas(chunk_ids)
|
||||
existing_vector_metadata = {
|
||||
chunk_id: metadata or {}
|
||||
chunk_id: metadata
|
||||
for chunk_id, metadata in zip(
|
||||
vector_payload.get("ids", []),
|
||||
vector_payload.get("metadatas", []),
|
||||
strict=False,
|
||||
)
|
||||
if isinstance(metadata, dict)
|
||||
}
|
||||
vector_metadatas = []
|
||||
for chunk_id, project_id, chunk_metadata in updates:
|
||||
vector_metadata = dict(existing_vector_metadata.get(chunk_id) or {})
|
||||
if not vector_metadata:
|
||||
vector_metadata = dict(chunk_metadata)
|
||||
vector_metadata["project_id"] = project_id
|
||||
vector_metadatas.append(vector_metadata)
|
||||
|
||||
vector_ids = []
|
||||
vector_metadatas = []
|
||||
sql_updates = []
|
||||
for chunk_id, project_id, chunk_metadata in batch:
|
||||
vector_metadata = existing_vector_metadata.get(chunk_id)
|
||||
if vector_metadata is None:
|
||||
missing_vectors.append(chunk_id)
|
||||
continue
|
||||
vector_metadata = dict(vector_metadata)
|
||||
vector_metadata["project_id"] = project_id
|
||||
vector_ids.append(chunk_id)
|
||||
vector_metadatas.append(vector_metadata)
|
||||
sql_updates.append((json.dumps(chunk_metadata, ensure_ascii=True), chunk_id))
|
||||
|
||||
if not vector_ids:
|
||||
continue
|
||||
|
||||
vector_store.update_metadatas(vector_ids, vector_metadatas)
|
||||
with get_connection() as conn:
|
||||
conn.executemany(
|
||||
cursor = conn.executemany(
|
||||
"UPDATE source_chunks SET metadata = ? WHERE id = ?",
|
||||
[
|
||||
(json.dumps(metadata, ensure_ascii=True), chunk_id)
|
||||
for chunk_id, _, metadata in updates
|
||||
],
|
||||
sql_updates,
|
||||
)
|
||||
vector_store.update_metadatas(chunk_ids, vector_metadatas)
|
||||
if cursor.rowcount != len(sql_updates):
|
||||
raise RuntimeError(
|
||||
f"SQLite rowcount mismatch: {cursor.rowcount} != {len(sql_updates)}"
|
||||
)
|
||||
applied_updates += len(sql_updates)
|
||||
|
||||
return {
|
||||
"apply": apply,
|
||||
"total_chunks": len(rows),
|
||||
"updates": len(updates),
|
||||
"applied_updates": applied_updates,
|
||||
"already_tagged": already_tagged,
|
||||
"skipped_unowned": skipped_unowned,
|
||||
"malformed_metadata": malformed_metadata,
|
||||
"missing_vectors": len(missing_vectors),
|
||||
"db_warning": db_warning,
|
||||
"by_project": dict(sorted(by_project.items())),
|
||||
}
|
||||
|
||||
@@ -134,9 +156,20 @@ def main() -> int:
|
||||
parser = argparse.ArgumentParser(description=__doc__)
|
||||
parser.add_argument("--apply", action="store_true", help="write SQLite and Chroma metadata updates")
|
||||
parser.add_argument("--project", default="", help="optional canonical project_id filter")
|
||||
parser.add_argument("--batch-size", type=int, default=DEFAULT_BATCH_SIZE)
|
||||
parser.add_argument(
|
||||
"--chroma-snapshot-confirmed",
|
||||
action="store_true",
|
||||
help="required with --apply; confirms a Chroma snapshot exists",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
payload = backfill(apply=args.apply, project_filter=args.project.strip())
|
||||
payload = backfill(
|
||||
apply=args.apply,
|
||||
project_filter=args.project.strip(),
|
||||
batch_size=args.batch_size,
|
||||
require_chroma_snapshot=args.chroma_snapshot_confirmed,
|
||||
)
|
||||
print(json.dumps(payload, indent=2, ensure_ascii=True))
|
||||
return 0
|
||||
|
||||
|
||||
@@ -37,6 +37,13 @@ def ingest_file(file_path: Path, project_id: str = "") -> dict:
|
||||
start = time.time()
|
||||
file_path = file_path.resolve()
|
||||
project_id = (project_id or "").strip()
|
||||
if not project_id:
|
||||
try:
|
||||
from atocore.projects.registry import derive_project_id_for_path
|
||||
|
||||
project_id = derive_project_id_for_path(file_path)
|
||||
except Exception:
|
||||
project_id = ""
|
||||
|
||||
if not file_path.exists():
|
||||
raise FileNotFoundError(f"File not found: {file_path}")
|
||||
|
||||
@@ -8,7 +8,6 @@ from dataclasses import asdict, dataclass
|
||||
from pathlib import Path
|
||||
|
||||
import atocore.config as _config
|
||||
from atocore.ingestion.pipeline import ingest_project_folder
|
||||
|
||||
|
||||
# Reserved pseudo-projects. `inbox` holds pre-project / lead / quote
|
||||
@@ -260,6 +259,7 @@ def load_project_registry() -> list[RegisteredProject]:
|
||||
)
|
||||
|
||||
_validate_unique_project_names(projects)
|
||||
_validate_ingest_root_overlaps(projects)
|
||||
return projects
|
||||
|
||||
|
||||
@@ -307,6 +307,28 @@ def resolve_project_name(name: str | None) -> str:
|
||||
return name
|
||||
|
||||
|
||||
def derive_project_id_for_path(file_path: str | Path) -> str:
|
||||
"""Return the registered project that owns a source path, if any."""
|
||||
if not file_path:
|
||||
return ""
|
||||
doc_path = Path(file_path).resolve(strict=False)
|
||||
matches: list[tuple[int, int, str]] = []
|
||||
|
||||
for project in load_project_registry():
|
||||
for source_ref in project.ingest_roots:
|
||||
root_path = _resolve_ingest_root(source_ref)
|
||||
try:
|
||||
doc_path.relative_to(root_path)
|
||||
except ValueError:
|
||||
continue
|
||||
matches.append((len(root_path.parts), len(str(root_path)), project.project_id))
|
||||
|
||||
if not matches:
|
||||
return ""
|
||||
matches.sort(reverse=True)
|
||||
return matches[0][2]
|
||||
|
||||
|
||||
def refresh_registered_project(project_name: str, purge_deleted: bool = False) -> dict:
|
||||
"""Ingest all configured source roots for a registered project.
|
||||
|
||||
@@ -322,6 +344,8 @@ def refresh_registered_project(project_name: str, purge_deleted: bool = False) -
|
||||
if project is None:
|
||||
raise ValueError(f"Unknown project: {project_name}")
|
||||
|
||||
from atocore.ingestion.pipeline import ingest_project_folder
|
||||
|
||||
roots = []
|
||||
ingested_count = 0
|
||||
skipped_count = 0
|
||||
@@ -447,6 +471,33 @@ def _validate_unique_project_names(projects: list[RegisteredProject]) -> None:
|
||||
seen[key] = project.project_id
|
||||
|
||||
|
||||
def _validate_ingest_root_overlaps(projects: list[RegisteredProject]) -> None:
|
||||
roots: list[tuple[str, Path]] = []
|
||||
for project in projects:
|
||||
for source_ref in project.ingest_roots:
|
||||
roots.append((project.project_id, _resolve_ingest_root(source_ref)))
|
||||
|
||||
for i, (left_project, left_root) in enumerate(roots):
|
||||
for right_project, right_root in roots[i + 1:]:
|
||||
if left_project == right_project:
|
||||
continue
|
||||
try:
|
||||
left_root.relative_to(right_root)
|
||||
overlaps = True
|
||||
except ValueError:
|
||||
try:
|
||||
right_root.relative_to(left_root)
|
||||
overlaps = True
|
||||
except ValueError:
|
||||
overlaps = False
|
||||
if overlaps:
|
||||
raise ValueError(
|
||||
"Project registry ingest root overlap: "
|
||||
f"'{left_root}' ({left_project}) and "
|
||||
f"'{right_root}' ({right_project})"
|
||||
)
|
||||
|
||||
|
||||
def _find_name_collisions(
|
||||
project_id: str,
|
||||
aliases: list[str],
|
||||
|
||||
@@ -209,8 +209,9 @@ def _is_allowed_for_project_scope(
|
||||
|
||||
|
||||
def _metadata_matches_project(project: RegisteredProject, metadata: dict) -> bool:
|
||||
if "project_id" in metadata:
|
||||
return str(metadata.get("project_id", "")).strip().lower() == project.project_id.lower()
|
||||
stored_project_id = str(metadata.get("project_id", "")).strip().lower()
|
||||
if stored_project_id:
|
||||
return stored_project_id == project.project_id.lower()
|
||||
|
||||
path = _metadata_source_path(metadata)
|
||||
tags = _metadata_tags(metadata)
|
||||
|
||||
154
tests/test_backfill_chunk_project_ids.py
Normal file
154
tests/test_backfill_chunk_project_ids.py
Normal file
@@ -0,0 +1,154 @@
|
||||
"""Tests for explicit chunk project_id metadata backfill."""
|
||||
|
||||
import json
|
||||
|
||||
import atocore.config as config
|
||||
from atocore.models.database import get_connection, init_db
|
||||
from scripts import backfill_chunk_project_ids as backfill
|
||||
|
||||
|
||||
def _write_registry(tmp_path, monkeypatch):
|
||||
vault_dir = tmp_path / "vault"
|
||||
drive_dir = tmp_path / "drive"
|
||||
config_dir = tmp_path / "config"
|
||||
project_dir = vault_dir / "incoming" / "projects" / "p04-gigabit"
|
||||
project_dir.mkdir(parents=True)
|
||||
drive_dir.mkdir()
|
||||
config_dir.mkdir()
|
||||
registry_path = config_dir / "project-registry.json"
|
||||
registry_path.write_text(
|
||||
json.dumps(
|
||||
{
|
||||
"projects": [
|
||||
{
|
||||
"id": "p04-gigabit",
|
||||
"aliases": ["p04"],
|
||||
"ingest_roots": [
|
||||
{"source": "vault", "subpath": "incoming/projects/p04-gigabit"}
|
||||
],
|
||||
}
|
||||
]
|
||||
}
|
||||
),
|
||||
encoding="utf-8",
|
||||
)
|
||||
monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir))
|
||||
monkeypatch.setenv("ATOCORE_DRIVE_SOURCE_DIR", str(drive_dir))
|
||||
monkeypatch.setenv("ATOCORE_PROJECT_REGISTRY_PATH", str(registry_path))
|
||||
config.settings = config.Settings()
|
||||
return project_dir
|
||||
|
||||
|
||||
def _insert_chunk(file_path, metadata=None, chunk_id="chunk-1"):
|
||||
with get_connection() as conn:
|
||||
conn.execute(
|
||||
"""
|
||||
INSERT INTO source_documents (id, file_path, file_hash, title, doc_type, tags)
|
||||
VALUES (?, ?, ?, ?, ?, ?)
|
||||
""",
|
||||
("doc-1", str(file_path), "hash", "Title", "markdown", "[]"),
|
||||
)
|
||||
conn.execute(
|
||||
"""
|
||||
INSERT INTO source_chunks
|
||||
(id, document_id, chunk_index, content, heading_path, char_count, metadata)
|
||||
VALUES (?, ?, ?, ?, ?, ?, ?)
|
||||
""",
|
||||
(
|
||||
chunk_id,
|
||||
"doc-1",
|
||||
0,
|
||||
"content",
|
||||
"Overview",
|
||||
7,
|
||||
json.dumps(metadata if metadata is not None else {}),
|
||||
),
|
||||
)
|
||||
|
||||
|
||||
class FakeVectorStore:
|
||||
def __init__(self, metadatas):
|
||||
self.metadatas = dict(metadatas)
|
||||
self.updated = []
|
||||
|
||||
def get_metadatas(self, ids):
|
||||
returned_ids = [chunk_id for chunk_id in ids if chunk_id in self.metadatas]
|
||||
return {
|
||||
"ids": returned_ids,
|
||||
"metadatas": [self.metadatas[chunk_id] for chunk_id in returned_ids],
|
||||
}
|
||||
|
||||
def update_metadatas(self, ids, metadatas):
|
||||
self.updated.append((list(ids), list(metadatas)))
|
||||
for chunk_id, metadata in zip(ids, metadatas, strict=True):
|
||||
self.metadatas[chunk_id] = metadata
|
||||
|
||||
|
||||
def test_backfill_dry_run_is_non_mutating(tmp_data_dir, tmp_path, monkeypatch):
|
||||
init_db()
|
||||
project_dir = _write_registry(tmp_path, monkeypatch)
|
||||
_insert_chunk(project_dir / "status.md")
|
||||
|
||||
result = backfill.backfill(apply=False)
|
||||
|
||||
assert result["updates"] == 1
|
||||
with get_connection() as conn:
|
||||
row = conn.execute("SELECT metadata FROM source_chunks WHERE id = ?", ("chunk-1",)).fetchone()
|
||||
assert json.loads(row["metadata"]) == {}
|
||||
|
||||
|
||||
def test_backfill_apply_updates_chroma_then_sql(tmp_data_dir, tmp_path, monkeypatch):
|
||||
init_db()
|
||||
project_dir = _write_registry(tmp_path, monkeypatch)
|
||||
_insert_chunk(project_dir / "status.md", metadata={"source_file": "status.md"})
|
||||
fake_store = FakeVectorStore({"chunk-1": {"source_file": "status.md"}})
|
||||
monkeypatch.setattr(backfill, "get_vector_store", lambda: fake_store)
|
||||
|
||||
result = backfill.backfill(apply=True, require_chroma_snapshot=True)
|
||||
|
||||
assert result["applied_updates"] == 1
|
||||
assert fake_store.metadatas["chunk-1"]["project_id"] == "p04-gigabit"
|
||||
with get_connection() as conn:
|
||||
row = conn.execute("SELECT metadata FROM source_chunks WHERE id = ?", ("chunk-1",)).fetchone()
|
||||
assert json.loads(row["metadata"])["project_id"] == "p04-gigabit"
|
||||
|
||||
|
||||
def test_backfill_apply_requires_snapshot_confirmation(tmp_data_dir, tmp_path, monkeypatch):
|
||||
init_db()
|
||||
project_dir = _write_registry(tmp_path, monkeypatch)
|
||||
_insert_chunk(project_dir / "status.md")
|
||||
|
||||
try:
|
||||
backfill.backfill(apply=True)
|
||||
except ValueError as exc:
|
||||
assert "Chroma backup" in str(exc)
|
||||
else:
|
||||
raise AssertionError("Expected snapshot confirmation requirement")
|
||||
|
||||
|
||||
def test_backfill_missing_vector_skips_sql_update(tmp_data_dir, tmp_path, monkeypatch):
|
||||
init_db()
|
||||
project_dir = _write_registry(tmp_path, monkeypatch)
|
||||
_insert_chunk(project_dir / "status.md")
|
||||
fake_store = FakeVectorStore({})
|
||||
monkeypatch.setattr(backfill, "get_vector_store", lambda: fake_store)
|
||||
|
||||
result = backfill.backfill(apply=True, require_chroma_snapshot=True)
|
||||
|
||||
assert result["updates"] == 1
|
||||
assert result["applied_updates"] == 0
|
||||
assert result["missing_vectors"] == 1
|
||||
with get_connection() as conn:
|
||||
row = conn.execute("SELECT metadata FROM source_chunks WHERE id = ?", ("chunk-1",)).fetchone()
|
||||
assert json.loads(row["metadata"]) == {}
|
||||
|
||||
|
||||
def test_backfill_skips_malformed_metadata(tmp_data_dir, tmp_path, monkeypatch):
|
||||
init_db()
|
||||
project_dir = _write_registry(tmp_path, monkeypatch)
|
||||
_insert_chunk(project_dir / "status.md", metadata=[])
|
||||
|
||||
result = backfill.backfill(apply=False)
|
||||
|
||||
assert result["updates"] == 0
|
||||
assert result["malformed_metadata"] == 1
|
||||
@@ -103,6 +103,66 @@ def test_ingest_file_records_project_id_metadata(tmp_data_dir, sample_markdown,
|
||||
)
|
||||
|
||||
|
||||
def test_ingest_file_derives_project_id_from_registry_root(tmp_data_dir, tmp_path, monkeypatch):
|
||||
"""Unscoped ingest should preserve ownership for files under registered roots."""
|
||||
import atocore.config as config
|
||||
|
||||
vault_dir = tmp_path / "vault"
|
||||
drive_dir = tmp_path / "drive"
|
||||
config_dir = tmp_path / "config"
|
||||
project_dir = vault_dir / "incoming" / "projects" / "p04-gigabit"
|
||||
project_dir.mkdir(parents=True)
|
||||
drive_dir.mkdir()
|
||||
config_dir.mkdir()
|
||||
note = project_dir / "status.md"
|
||||
note.write_text(
|
||||
"# Status\n\nCurrent project status with enough detail to create "
|
||||
"a retrievable chunk for the ingestion pipeline test.",
|
||||
encoding="utf-8",
|
||||
)
|
||||
registry_path = config_dir / "project-registry.json"
|
||||
registry_path.write_text(
|
||||
json.dumps(
|
||||
{
|
||||
"projects": [
|
||||
{
|
||||
"id": "p04-gigabit",
|
||||
"aliases": ["p04"],
|
||||
"ingest_roots": [
|
||||
{"source": "vault", "subpath": "incoming/projects/p04-gigabit"}
|
||||
],
|
||||
}
|
||||
]
|
||||
}
|
||||
),
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
class FakeVectorStore:
|
||||
def __init__(self):
|
||||
self.metadatas = []
|
||||
|
||||
def add(self, ids, documents, metadatas):
|
||||
self.metadatas.extend(metadatas)
|
||||
|
||||
def delete(self, ids):
|
||||
return None
|
||||
|
||||
fake_store = FakeVectorStore()
|
||||
monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir))
|
||||
monkeypatch.setenv("ATOCORE_DRIVE_SOURCE_DIR", str(drive_dir))
|
||||
monkeypatch.setenv("ATOCORE_PROJECT_REGISTRY_PATH", str(registry_path))
|
||||
config.settings = config.Settings()
|
||||
monkeypatch.setattr("atocore.ingestion.pipeline.get_vector_store", lambda: fake_store)
|
||||
|
||||
init_db()
|
||||
result = ingest_file(note)
|
||||
|
||||
assert result["status"] == "ingested"
|
||||
assert fake_store.metadatas
|
||||
assert all(meta["project_id"] == "p04-gigabit" for meta in fake_store.metadatas)
|
||||
|
||||
|
||||
def test_ingest_project_folder_passes_project_id_to_files(tmp_data_dir, sample_folder, monkeypatch):
|
||||
seen = []
|
||||
|
||||
|
||||
@@ -5,6 +5,7 @@ import json
|
||||
import atocore.config as config
|
||||
from atocore.projects.registry import (
|
||||
build_project_registration_proposal,
|
||||
derive_project_id_for_path,
|
||||
get_registered_project,
|
||||
get_project_registry_template,
|
||||
list_registered_projects,
|
||||
@@ -103,6 +104,98 @@ def test_project_registry_resolves_alias(tmp_path, monkeypatch):
|
||||
assert project.project_id == "p05-interferometer"
|
||||
|
||||
|
||||
def test_derive_project_id_for_path_uses_registered_roots(tmp_path, monkeypatch):
|
||||
vault_dir = tmp_path / "vault"
|
||||
drive_dir = tmp_path / "drive"
|
||||
config_dir = tmp_path / "config"
|
||||
project_dir = vault_dir / "incoming" / "projects" / "p04-gigabit"
|
||||
project_dir.mkdir(parents=True)
|
||||
drive_dir.mkdir()
|
||||
config_dir.mkdir()
|
||||
note = project_dir / "status.md"
|
||||
note.write_text("# Status\n\nCurrent work.", encoding="utf-8")
|
||||
|
||||
registry_path = config_dir / "project-registry.json"
|
||||
registry_path.write_text(
|
||||
json.dumps(
|
||||
{
|
||||
"projects": [
|
||||
{
|
||||
"id": "p04-gigabit",
|
||||
"aliases": ["p04"],
|
||||
"ingest_roots": [
|
||||
{"source": "vault", "subpath": "incoming/projects/p04-gigabit"}
|
||||
],
|
||||
}
|
||||
]
|
||||
}
|
||||
),
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir))
|
||||
monkeypatch.setenv("ATOCORE_DRIVE_SOURCE_DIR", str(drive_dir))
|
||||
monkeypatch.setenv("ATOCORE_PROJECT_REGISTRY_PATH", str(registry_path))
|
||||
|
||||
original_settings = config.settings
|
||||
try:
|
||||
config.settings = config.Settings()
|
||||
assert derive_project_id_for_path(note) == "p04-gigabit"
|
||||
assert derive_project_id_for_path(tmp_path / "elsewhere.md") == ""
|
||||
finally:
|
||||
config.settings = original_settings
|
||||
|
||||
|
||||
def test_project_registry_rejects_cross_project_ingest_root_overlap(tmp_path, monkeypatch):
|
||||
vault_dir = tmp_path / "vault"
|
||||
drive_dir = tmp_path / "drive"
|
||||
config_dir = tmp_path / "config"
|
||||
vault_dir.mkdir()
|
||||
drive_dir.mkdir()
|
||||
config_dir.mkdir()
|
||||
|
||||
registry_path = config_dir / "project-registry.json"
|
||||
registry_path.write_text(
|
||||
json.dumps(
|
||||
{
|
||||
"projects": [
|
||||
{
|
||||
"id": "parent",
|
||||
"aliases": [],
|
||||
"ingest_roots": [
|
||||
{"source": "vault", "subpath": "incoming/projects/parent"}
|
||||
],
|
||||
},
|
||||
{
|
||||
"id": "child",
|
||||
"aliases": [],
|
||||
"ingest_roots": [
|
||||
{"source": "vault", "subpath": "incoming/projects/parent/child"}
|
||||
],
|
||||
},
|
||||
]
|
||||
}
|
||||
),
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir))
|
||||
monkeypatch.setenv("ATOCORE_DRIVE_SOURCE_DIR", str(drive_dir))
|
||||
monkeypatch.setenv("ATOCORE_PROJECT_REGISTRY_PATH", str(registry_path))
|
||||
|
||||
original_settings = config.settings
|
||||
try:
|
||||
config.settings = config.Settings()
|
||||
try:
|
||||
list_registered_projects()
|
||||
except ValueError as exc:
|
||||
assert "ingest root overlap" in str(exc)
|
||||
else:
|
||||
raise AssertionError("Expected overlapping ingest roots to raise")
|
||||
finally:
|
||||
config.settings = original_settings
|
||||
|
||||
|
||||
def test_refresh_registered_project_ingests_registered_roots(tmp_path, monkeypatch):
|
||||
vault_dir = tmp_path / "vault"
|
||||
drive_dir = tmp_path / "drive"
|
||||
@@ -144,7 +237,7 @@ def test_refresh_registered_project_ingests_registered_roots(tmp_path, monkeypat
|
||||
original_settings = config.settings
|
||||
try:
|
||||
config.settings = config.Settings()
|
||||
monkeypatch.setattr("atocore.projects.registry.ingest_project_folder", fake_ingest_folder)
|
||||
monkeypatch.setattr("atocore.ingestion.pipeline.ingest_project_folder", fake_ingest_folder)
|
||||
result = refresh_registered_project("polisher")
|
||||
finally:
|
||||
config.settings = original_settings
|
||||
@@ -199,7 +292,7 @@ def test_refresh_registered_project_reports_nothing_to_ingest_when_all_missing(
|
||||
original_settings = config.settings
|
||||
try:
|
||||
config.settings = config.Settings()
|
||||
monkeypatch.setattr("atocore.projects.registry.ingest_project_folder", fail_ingest_folder)
|
||||
monkeypatch.setattr("atocore.ingestion.pipeline.ingest_project_folder", fail_ingest_folder)
|
||||
result = refresh_registered_project("ghost")
|
||||
finally:
|
||||
config.settings = original_settings
|
||||
@@ -249,7 +342,7 @@ def test_refresh_registered_project_reports_partial_status(tmp_path, monkeypatch
|
||||
original_settings = config.settings
|
||||
try:
|
||||
config.settings = config.Settings()
|
||||
monkeypatch.setattr("atocore.projects.registry.ingest_project_folder", fake_ingest_folder)
|
||||
monkeypatch.setattr("atocore.ingestion.pipeline.ingest_project_folder", fake_ingest_folder)
|
||||
result = refresh_registered_project("mixed")
|
||||
finally:
|
||||
config.settings = original_settings
|
||||
|
||||
@@ -458,6 +458,72 @@ def test_retrieve_project_scope_prefers_exact_project_id(monkeypatch):
|
||||
assert [r.chunk_id for r in results] == ["chunk-target", "chunk-global"]
|
||||
|
||||
|
||||
def test_retrieve_empty_project_id_falls_back_to_path_ownership(monkeypatch):
|
||||
target_project = type(
|
||||
"Project",
|
||||
(),
|
||||
{
|
||||
"project_id": "p04-gigabit",
|
||||
"aliases": ("p04", "gigabit"),
|
||||
"ingest_roots": (),
|
||||
},
|
||||
)()
|
||||
other_project = type(
|
||||
"Project",
|
||||
(),
|
||||
{
|
||||
"project_id": "p05-interferometer",
|
||||
"aliases": ("p05", "interferometer"),
|
||||
"ingest_roots": (),
|
||||
},
|
||||
)()
|
||||
|
||||
class FakeStore:
|
||||
def query(self, query_embedding, top_k=10, where=None):
|
||||
return {
|
||||
"ids": [["chunk-target", "chunk-other"]],
|
||||
"documents": [["target doc", "other doc"]],
|
||||
"metadatas": [[
|
||||
{
|
||||
"heading_path": "Overview",
|
||||
"source_file": "p04-gigabit/status.md",
|
||||
"tags": "[]",
|
||||
"title": "Target",
|
||||
"project_id": "",
|
||||
"document_id": "doc-a",
|
||||
},
|
||||
{
|
||||
"heading_path": "Overview",
|
||||
"source_file": "p05-interferometer/status.md",
|
||||
"tags": "[]",
|
||||
"title": "Other",
|
||||
"project_id": "",
|
||||
"document_id": "doc-b",
|
||||
},
|
||||
]],
|
||||
"distances": [[0.2, 0.19]],
|
||||
}
|
||||
|
||||
monkeypatch.setattr("atocore.retrieval.retriever.get_vector_store", lambda: FakeStore())
|
||||
monkeypatch.setattr("atocore.retrieval.retriever.embed_query", lambda query: [0.0, 0.1])
|
||||
monkeypatch.setattr(
|
||||
"atocore.retrieval.retriever._existing_chunk_ids",
|
||||
lambda chunk_ids: set(chunk_ids),
|
||||
)
|
||||
monkeypatch.setattr(
|
||||
"atocore.retrieval.retriever.get_registered_project",
|
||||
lambda project_name: target_project,
|
||||
)
|
||||
monkeypatch.setattr(
|
||||
"atocore.retrieval.retriever.load_project_registry",
|
||||
lambda: [target_project, other_project],
|
||||
)
|
||||
|
||||
results = retrieve("mirror architecture", top_k=2, project_hint="p04")
|
||||
|
||||
assert [r.chunk_id for r in results] == ["chunk-target"]
|
||||
|
||||
|
||||
def test_retrieve_unknown_project_hint_does_not_widen_or_filter(monkeypatch):
|
||||
class FakeStore:
|
||||
def query(self, query_embedding, top_k=10, where=None):
|
||||
|
||||
Reference in New Issue
Block a user