fix(retrieval): fail open on registry resolution errors

This commit is contained in:
2026-04-24 11:32:46 -04:00
parent ce6ffdbb63
commit 05c11fd4fb
7 changed files with 121 additions and 6 deletions

View File

@@ -9,7 +9,7 @@
- **live_sha** (Dalidou `/health` build_sha): `f44a211` (verified 2026-04-24T14:48:44Z post audit-improvements deploy; status=ok)
- **last_updated**: 2026-04-24 by Codex (retrieval boundary deployed; project_id metadata branch started)
- **main_tip**: `f44a211`
- **test_count**: 565 on `codex/project-id-metadata-retrieval` (deployed main baseline: 553)
- **test_count**: 567 on `codex/project-id-metadata-retrieval` (deployed main baseline: 553)
- **harness**: `19/20 PASS` on live Dalidou, 0 blocking failures, 1 known content gap (`p04-constraints`)
- **vectors**: 33,253
- **active_memories**: 290 (`/admin/dashboard` 2026-04-24; note integrity panel reports a separate active_memory_count=951 and needs reconciliation)
@@ -174,6 +174,8 @@ One branch `codex/extractor-eval-loop` for Day 1-5, a second `codex/retrieval-ha
- **2026-04-24 Codex (project_id audit response)** Applied independent-audit fixes on `codex/project-id-metadata-retrieval`. Closed the nightly `/ingest/sources` clobber risk by adding registry-level `derive_project_id_for_path()` and making unscoped `ingest_file()` derive ownership from registered ingest roots when possible; `refresh_registered_project()` still passes the canonical project id directly. Changed retrieval so empty `project_id` falls through to legacy path/tag ownership instead of short-circuiting as unowned. Hardened `scripts/backfill_chunk_project_ids.py`: `--apply` now requires `--chroma-snapshot-confirmed`, runs Chroma metadata updates before SQLite writes, batches updates, skips/report missing vectors, skips/report malformed metadata, reports already-tagged rows, and turns missing ingestion tables into a JSON `db_warning` instead of a traceback. Added tests for auto-derive ingestion, empty-project fallback, ingest-root overlap rejection, and backfill dry-run/apply/snapshot/missing-vector/malformed cases. Verified targeted suite (`test_backfill_chunk_project_ids.py`, `test_ingestion.py`, `test_project_registry.py`, `test_retrieval.py`): 45 passed. Verified full suite: 565 passed in 73.16s. Local dry-run on empty/default data returns 0 updates with `db_warning` rather than crashing. Branch still not merged/deployed.
- **2026-04-24 Codex (project_id final hardening before merge)** Applied the final independent-review P2s on `codex/project-id-metadata-retrieval`: `ingest_file()` still fails open when project-id derivation fails, but now emits `project_id_derivation_failed` with file path and error; retrieval now catches registry failures both at project-scope resolution and the soft project-match boost path, logs warnings, and serves unscoped rather than raising. Added regression tests for both fail-open paths. Verified targeted suite (`test_ingestion.py`, `test_retrieval.py`, `test_backfill_chunk_project_ids.py`, `test_project_registry.py`): 47 passed. Verified full suite: 567 passed in 79.66s. Branch still not merged/deployed.
- **2026-04-24 Codex (audit improvements foundation)** Started implementation of the audit recommendations on branch `codex/audit-improvements-foundation` from `origin/main@c53e61e`. First tranche: registry-aware project-scoped retrieval filtering (`ATOCORE_RANK_PROJECT_SCOPE_FILTER`, widened candidate pull before filtering), eval harness known-issue lane, two p05 project-bleed fixtures, `scripts/live_status.py`, README/current-state/master-plan status refresh. Verified `pytest -q`: 550 passed in 67.11s. Live retrieval harness against undeployed production: 20 fixtures, 18 pass, 1 known issue (`p04-constraints` Zerodur/1.2 content gap), 1 blocking guard (`p05-broad-status-no-atomizer`) still failing because production has not yet deployed the retrieval filter and currently pulls `P04-GigaBIT-M1-KB-design` into broad p05 status context. Live dashboard refresh: health ok, build `2b86543`, docs 1748, chunks/vectors 33253, interactions 948, active memories 289, candidates 0, project_state total 128. Noted count discrepancy: dashboard memories.active=289 while integrity active_memory_count=951; schedule reconciliation in a follow-up.
- **2026-04-24 Codex (independent-audit hardening)** Applied the Opus independent audit's fast follow-ups before merge/deploy. Closed the two P1s by making project-scope ownership path/tag-based only, adding path-segment/tag-exact matching to avoid short-alias substring collisions, and keeping title/heading text out of provenance decisions. Added regression tests for title poisoning, substring collision, and unknown-project fallback. Added retrieval log fields `raw_results_count`, `post_filter_count`, `post_filter_dropped`, and `underfilled`. Added retrieval-eval run metadata (`generated_at`, `base_url`, `/health`) and `live_status.py` auth-token/status support. README now documents the ranking knobs and clarifies that the hard scope filter and soft project match boost are separate controls. Verified `pytest -q`: 553 passed in 66.07s. Live production remains expected-predeploy: 20 fixtures, 18 pass, 1 known content gap, 1 blocking p05 bleed guard. Latest live dashboard: build `2b86543`, docs 1748, chunks/vectors 33253, interactions 950, active memories 290, candidates 0, project_state total 128.

View File

@@ -2,7 +2,7 @@
Update 2026-04-24: audit-improvements deployed as `f44a211`; live harness is
19/20 with 0 blocking failures and 1 known content gap. Active follow-up branch
`codex/project-id-metadata-retrieval` is at 565 passing tests.
`codex/project-id-metadata-retrieval` is at 567 passing tests.
Live deploy: `2b86543` · Dalidou health: ok · Harness: 18/20 with 1 known
content gap and 1 current blocking project-bleed guard · Tests: 553 passing.

View File

@@ -152,7 +152,7 @@ deferred from the shared client until their workflows are exercised.
- query-relevance memory ranking with overlap-density scoring
- retrieval eval harness: 20 fixtures; current live has 19 pass, 1 known
content gap, and 0 blocking failures after the audit-improvements deploy
- 565 tests passing on the active `codex/project-id-metadata-retrieval` branch
- 567 tests passing on the active `codex/project-id-metadata-retrieval` branch
- nightly pipeline: backup → cleanup → rsync → OpenClaw import → vault refresh → extract → triage → **auto-promote/expire** → weekly synth/lint → **retrieval harness****pipeline summary to project state**
- Phase 10 operational: reinforcement-based auto-promotion (ref_count ≥ 3, confidence ≥ 0.7) + stale candidate expiry (14 days unreinforced)
- pipeline health visible in dashboard: interaction totals by client, pipeline last_run, harness results, triage stats

View File

@@ -42,7 +42,12 @@ def ingest_file(file_path: Path, project_id: str = "") -> dict:
from atocore.projects.registry import derive_project_id_for_path
project_id = derive_project_id_for_path(file_path)
except Exception:
except Exception as exc:
log.warning(
"project_id_derivation_failed",
file_path=str(file_path),
error=str(exc),
)
project_id = ""
if not file_path.exists():

View File

@@ -84,7 +84,15 @@ def retrieve(
"""Retrieve the most relevant chunks for a query."""
top_k = top_k or _config.settings.context_top_k
start = time.time()
scoped_project = get_registered_project(project_hint) if project_hint else None
try:
scoped_project = get_registered_project(project_hint) if project_hint else None
except Exception as exc:
log.warning(
"project_scope_resolution_failed",
project_hint=project_hint,
error=str(exc),
)
scoped_project = None
scope_filter_enabled = bool(scoped_project and _config.settings.rank_project_scope_filter)
registered_projects = None
query_top_k = top_k
@@ -292,7 +300,15 @@ def _project_match_boost(project_hint: str, metadata: dict) -> float:
if not hint_lower:
return 1.0
project = get_registered_project(project_hint)
try:
project = get_registered_project(project_hint)
except Exception as exc:
log.warning(
"project_match_boost_resolution_failed",
project_hint=project_hint,
error=str(exc),
)
project = None
candidate_names = _project_scope_terms(project) if project is not None else {hint_lower}
for candidate in candidate_names:
if _metadata_has_term(metadata, candidate):

View File

@@ -163,6 +163,45 @@ def test_ingest_file_derives_project_id_from_registry_root(tmp_data_dir, tmp_pat
assert all(meta["project_id"] == "p04-gigabit" for meta in fake_store.metadatas)
def test_ingest_file_logs_and_fails_open_when_project_derivation_fails(
tmp_data_dir,
sample_markdown,
monkeypatch,
):
"""A broken registry should be visible but should not block ingestion."""
init_db()
warnings = []
class FakeVectorStore:
def __init__(self):
self.metadatas = []
def add(self, ids, documents, metadatas):
self.metadatas.extend(metadatas)
def delete(self, ids):
return None
fake_store = FakeVectorStore()
monkeypatch.setattr("atocore.ingestion.pipeline.get_vector_store", lambda: fake_store)
monkeypatch.setattr(
"atocore.projects.registry.derive_project_id_for_path",
lambda path: (_ for _ in ()).throw(ValueError("registry broken")),
)
monkeypatch.setattr(
"atocore.ingestion.pipeline.log.warning",
lambda event, **kwargs: warnings.append((event, kwargs)),
)
result = ingest_file(sample_markdown)
assert result["status"] == "ingested"
assert fake_store.metadatas
assert all(meta["project_id"] == "" for meta in fake_store.metadatas)
assert warnings[0][0] == "project_id_derivation_failed"
assert "registry broken" in warnings[0][1]["error"]
def test_ingest_project_folder_passes_project_id_to_files(tmp_data_dir, sample_folder, monkeypatch):
seen = []

View File

@@ -566,6 +566,59 @@ def test_retrieve_unknown_project_hint_does_not_widen_or_filter(monkeypatch):
assert [r.chunk_id for r in results] == ["chunk-a", "chunk-b"]
def test_retrieve_fails_open_when_project_scope_resolution_fails(monkeypatch):
warnings = []
class FakeStore:
def query(self, query_embedding, top_k=10, where=None):
assert top_k == 2
return {
"ids": [["chunk-a", "chunk-b"]],
"documents": [["doc a", "doc b"]],
"metadatas": [[
{
"heading_path": "Overview",
"source_file": "p04-gigabit/file.md",
"tags": "[]",
"title": "A",
"document_id": "doc-a",
},
{
"heading_path": "Overview",
"source_file": "p05-interferometer/file.md",
"tags": "[]",
"title": "B",
"document_id": "doc-b",
},
]],
"distances": [[0.2, 0.21]],
}
monkeypatch.setattr("atocore.retrieval.retriever.get_vector_store", lambda: FakeStore())
monkeypatch.setattr("atocore.retrieval.retriever.embed_query", lambda query: [0.0, 0.1])
monkeypatch.setattr(
"atocore.retrieval.retriever._existing_chunk_ids",
lambda chunk_ids: set(chunk_ids),
)
monkeypatch.setattr(
"atocore.retrieval.retriever.get_registered_project",
lambda project_name: (_ for _ in ()).throw(ValueError("registry overlap")),
)
monkeypatch.setattr(
"atocore.retrieval.retriever.log.warning",
lambda event, **kwargs: warnings.append((event, kwargs)),
)
results = retrieve("overview", top_k=2, project_hint="p04")
assert [r.chunk_id for r in results] == ["chunk-a", "chunk-b"]
assert {warning[0] for warning in warnings} == {
"project_scope_resolution_failed",
"project_match_boost_resolution_failed",
}
assert all("registry overlap" in warning[1]["error"] for warning in warnings)
def test_retrieve_downranks_archive_noise_and_prefers_high_signal_paths(monkeypatch):
class FakeStore:
def query(self, query_embedding, top_k=10, where=None):