5 Commits

18 changed files with 1641 additions and 82 deletions

View File

@@ -6,17 +6,17 @@
## Orientation ## Orientation
- **live_sha** (Dalidou `/health` build_sha): `2b86543` (verified 2026-04-23T15:20:53Z post-R14 deploy; status=ok) - **live_sha** (Dalidou `/health` build_sha): `f44a211` (verified 2026-04-24T14:48:44Z post audit-improvements deploy; status=ok)
- **last_updated**: 2026-04-23 by Claude (R14 squash-merged + deployed; Orientation refreshed) - **last_updated**: 2026-04-24 by Codex (retrieval boundary deployed; project_id metadata branch started)
- **main_tip**: `2b86543` - **main_tip**: `f44a211`
- **test_count**: 548 (547 + 1 R14 regression test) - **test_count**: 567 on `codex/project-id-metadata-retrieval` (deployed main baseline: 553)
- **harness**: `17/18 PASS` on live Dalidou (p04-constraints expects "Zerodur" — known content gap, not regression; consistent since 2026-04-19) - **harness**: `19/20 PASS` on live Dalidou, 0 blocking failures, 1 known content gap (`p04-constraints`)
- **vectors**: 33,253 - **vectors**: 33,253
- **active_memories**: 784 (up from 84 pre-density-batch — density gate CRUSHED vs V1-A's 100-target) - **active_memories**: 290 (`/admin/dashboard` 2026-04-24; note integrity panel reports a separate active_memory_count=951 and needs reconciliation)
- **candidate_memories**: 2 (triage queue drained) - **candidate_memories**: 0 (triage queue drained)
- **interactions**: 500+ (limit=2000 query returned 500 — density batch has been running; actual may be higher, confirm via /stats next update) - **interactions**: 951 (`/admin/dashboard` 2026-04-24)
- **registered_projects**: atocore, p04-gigabit, p05-interferometer, p06-polisher, atomizer-v2, abb-space (aliased p08) - **registered_projects**: atocore, p04-gigabit, p05-interferometer, p06-polisher, atomizer-v2, abb-space (aliased p08)
- **project_state_entries**: 63 (atocore alone; full cross-project count not re-sampled this update) - **project_state_entries**: 128 across registered projects (`/admin/dashboard` 2026-04-24)
- **entities**: 66 (up from 35 — V1-0 backfill + ongoing work; 0 open conflicts) - **entities**: 66 (up from 35 — V1-0 backfill + ongoing work; 0 open conflicts)
- **off_host_backup**: `papa@192.168.86.39:/home/papa/atocore-backups/` via cron, verified - **off_host_backup**: `papa@192.168.86.39:/home/papa/atocore-backups/` via cron, verified
- **nightly_pipeline**: backup → cleanup → rsync → OpenClaw import → vault refresh → extract → auto-triage → **auto-promote/expire (NEW)** → weekly synth/lint Sundays → **retrieval harness (NEW)****pipeline summary (NEW)** - **nightly_pipeline**: backup → cleanup → rsync → OpenClaw import → vault refresh → extract → auto-triage → **auto-promote/expire (NEW)** → weekly synth/lint Sundays → **retrieval harness (NEW)****pipeline summary (NEW)**
@@ -170,6 +170,16 @@ One branch `codex/extractor-eval-loop` for Day 1-5, a second `codex/retrieval-ha
## Session Log ## Session Log
- **2026-04-24 Codex (retrieval boundary deployed + project_id metadata tranche)** Merged `codex/audit-improvements-foundation` to `main` as `f44a211` and pushed to Dalidou Gitea. Took pre-deploy runtime backup `/srv/storage/atocore/backups/snapshots/20260424T144810Z` (DB + registry, no Chroma). Deployed via `papa@dalidou` canonical `deploy/dalidou/deploy.sh`; live `/health` reports build_sha `f44a2114970008a7eec4e7fc2860c8f072914e38`, build_time `2026-04-24T14:48:44Z`, status ok. Post-deploy retrieval harness: 20 fixtures, 19 pass, 0 blocking failures, 1 known issue (`p04-constraints`). The former blocker `p05-broad-status-no-atomizer` now passes. Manual p05 `context-build "current status"` spot check shows no p04/Atomizer source bleed in retrieved chunks. Started follow-up branch `codex/project-id-metadata-retrieval`: registered-project ingestion now writes explicit `project_id` into DB chunk metadata and Chroma vector metadata; retrieval prefers exact `project_id` when present and keeps path/tag matching as legacy fallback; added dry-run-by-default `scripts/backfill_chunk_project_ids.py` to backfill SQLite + Chroma metadata; added tests for project-id ingestion, registered refresh propagation, exact project-id retrieval, and collision fallback. Verified targeted suite (`test_ingestion.py`, `test_project_registry.py`, `test_retrieval.py`): 36 passed. Verified full suite: 556 passed in 72.44s. Branch not merged or deployed yet.
- **2026-04-24 Codex (project_id audit response)** Applied independent-audit fixes on `codex/project-id-metadata-retrieval`. Closed the nightly `/ingest/sources` clobber risk by adding registry-level `derive_project_id_for_path()` and making unscoped `ingest_file()` derive ownership from registered ingest roots when possible; `refresh_registered_project()` still passes the canonical project id directly. Changed retrieval so empty `project_id` falls through to legacy path/tag ownership instead of short-circuiting as unowned. Hardened `scripts/backfill_chunk_project_ids.py`: `--apply` now requires `--chroma-snapshot-confirmed`, runs Chroma metadata updates before SQLite writes, batches updates, skips/report missing vectors, skips/report malformed metadata, reports already-tagged rows, and turns missing ingestion tables into a JSON `db_warning` instead of a traceback. Added tests for auto-derive ingestion, empty-project fallback, ingest-root overlap rejection, and backfill dry-run/apply/snapshot/missing-vector/malformed cases. Verified targeted suite (`test_backfill_chunk_project_ids.py`, `test_ingestion.py`, `test_project_registry.py`, `test_retrieval.py`): 45 passed. Verified full suite: 565 passed in 73.16s. Local dry-run on empty/default data returns 0 updates with `db_warning` rather than crashing. Branch still not merged/deployed.
- **2026-04-24 Codex (project_id final hardening before merge)** Applied the final independent-review P2s on `codex/project-id-metadata-retrieval`: `ingest_file()` still fails open when project-id derivation fails, but now emits `project_id_derivation_failed` with file path and error; retrieval now catches registry failures both at project-scope resolution and the soft project-match boost path, logs warnings, and serves unscoped rather than raising. Added regression tests for both fail-open paths. Verified targeted suite (`test_ingestion.py`, `test_retrieval.py`, `test_backfill_chunk_project_ids.py`, `test_project_registry.py`): 47 passed. Verified full suite: 567 passed in 79.66s. Branch still not merged/deployed.
- **2026-04-24 Codex (audit improvements foundation)** Started implementation of the audit recommendations on branch `codex/audit-improvements-foundation` from `origin/main@c53e61e`. First tranche: registry-aware project-scoped retrieval filtering (`ATOCORE_RANK_PROJECT_SCOPE_FILTER`, widened candidate pull before filtering), eval harness known-issue lane, two p05 project-bleed fixtures, `scripts/live_status.py`, README/current-state/master-plan status refresh. Verified `pytest -q`: 550 passed in 67.11s. Live retrieval harness against undeployed production: 20 fixtures, 18 pass, 1 known issue (`p04-constraints` Zerodur/1.2 content gap), 1 blocking guard (`p05-broad-status-no-atomizer`) still failing because production has not yet deployed the retrieval filter and currently pulls `P04-GigaBIT-M1-KB-design` into broad p05 status context. Live dashboard refresh: health ok, build `2b86543`, docs 1748, chunks/vectors 33253, interactions 948, active memories 289, candidates 0, project_state total 128. Noted count discrepancy: dashboard memories.active=289 while integrity active_memory_count=951; schedule reconciliation in a follow-up.
- **2026-04-24 Codex (independent-audit hardening)** Applied the Opus independent audit's fast follow-ups before merge/deploy. Closed the two P1s by making project-scope ownership path/tag-based only, adding path-segment/tag-exact matching to avoid short-alias substring collisions, and keeping title/heading text out of provenance decisions. Added regression tests for title poisoning, substring collision, and unknown-project fallback. Added retrieval log fields `raw_results_count`, `post_filter_count`, `post_filter_dropped`, and `underfilled`. Added retrieval-eval run metadata (`generated_at`, `base_url`, `/health`) and `live_status.py` auth-token/status support. README now documents the ranking knobs and clarifies that the hard scope filter and soft project match boost are separate controls. Verified `pytest -q`: 553 passed in 66.07s. Live production remains expected-predeploy: 20 fixtures, 18 pass, 1 known content gap, 1 blocking p05 bleed guard. Latest live dashboard: build `2b86543`, docs 1748, chunks/vectors 33253, interactions 950, active memories 290, candidates 0, project_state total 128.
- **2026-04-23 Codex + Claude (R14 closed)** Codex reviewed `claude/r14-promote-400` at `3888db9`, no findings: "The route change is narrowly scoped: `promote_entity()` still returns False for not-found/not-candidate cases, so the existing 404 behavior remains intact, while caller-fixable validation failures now surface as 400." Ran `pytest tests/test_v1_0_write_invariants.py -q` from an isolated worktree: 15 passed in 1.91s. Claude squash-merged to main as `0989fed`, followed by ledger close-out `2b86543`, then deployed via canonical script. Dalidou `/health` reports build_sha=`2b86543e6ad26011b39a44509cc8df3809725171`, build_time `2026-04-23T15:20:53Z`, status=ok. R14 closed. Orientation refreshed earlier this session also reflected the V1-A gate status: **density gate CLEARED** (784 active memories vs 100 target — density batch-extract ran between 2026-04-22 and 2026-04-23 and more than crushed the gate), **soak gate at day 5 of ~7** (F4 first run 2026-04-19; nightly clean 2026-04-19 through 2026-04-23; only chronic failure is the known p04-constraints "Zerodur" content gap). V1-A branches from a clean V1-0 baseline as soon as the soak is called done. - **2026-04-23 Codex + Claude (R14 closed)** Codex reviewed `claude/r14-promote-400` at `3888db9`, no findings: "The route change is narrowly scoped: `promote_entity()` still returns False for not-found/not-candidate cases, so the existing 404 behavior remains intact, while caller-fixable validation failures now surface as 400." Ran `pytest tests/test_v1_0_write_invariants.py -q` from an isolated worktree: 15 passed in 1.91s. Claude squash-merged to main as `0989fed`, followed by ledger close-out `2b86543`, then deployed via canonical script. Dalidou `/health` reports build_sha=`2b86543e6ad26011b39a44509cc8df3809725171`, build_time `2026-04-23T15:20:53Z`, status=ok. R14 closed. Orientation refreshed earlier this session also reflected the V1-A gate status: **density gate CLEARED** (784 active memories vs 100 target — density batch-extract ran between 2026-04-22 and 2026-04-23 and more than crushed the gate), **soak gate at day 5 of ~7** (F4 first run 2026-04-19; nightly clean 2026-04-19 through 2026-04-23; only chronic failure is the known p04-constraints "Zerodur" content gap). V1-A branches from a clean V1-0 baseline as soon as the soak is called done.
- **2026-04-22 Codex + Antoine (V1-0 closed)** Codex approved `f16cd52` after re-running both original probes (legacy-candidate promote + supersede hook — both correct) and the three targeted regression suites (`test_v1_0_write_invariants.py`, `test_engineering_v1_phase5.py`, `test_inbox_crossproject.py` — all pass). Squash-merged to main as `2712c5d` ("feat(engineering): enforce V1-0 write invariants"). Deployed to Dalidou via the canonical deploy script; `/health` build_sha=`2712c5d2d03cb2a6af38b559664afd1c4cd0e050` status=ok. Validated backup snapshot at `/srv/storage/atocore/backups/snapshots/20260422T190624Z` taken BEFORE prod backfill. Prod backfill of `scripts/v1_0_backfill_provenance.py` against live DB: dry-run found 31 active/superseded entities with no provenance, list reviewed and looked sane; live run with default `hand_authored=1` flag path updated 31 rows; follow-up dry-run returned 0 rows remaining → no lingering F-8 violations in prod. Codex logged one residual P2 (R14): HTTP `POST /entities/{id}/promote` route doesn't translate the new service-layer `ValueError` into 400 — legacy bad candidate promoted through the API surfaces as 500. Not blocking. V1-0 closed. **Gates for V1-A**: soak window ends ~2026-04-26; 100-active-memory density target (currently 84 active + the ~31 newly flagged ones — need to check how those count in density math). V1-A holds until both gates clear. - **2026-04-22 Codex + Antoine (V1-0 closed)** Codex approved `f16cd52` after re-running both original probes (legacy-candidate promote + supersede hook — both correct) and the three targeted regression suites (`test_v1_0_write_invariants.py`, `test_engineering_v1_phase5.py`, `test_inbox_crossproject.py` — all pass). Squash-merged to main as `2712c5d` ("feat(engineering): enforce V1-0 write invariants"). Deployed to Dalidou via the canonical deploy script; `/health` build_sha=`2712c5d2d03cb2a6af38b559664afd1c4cd0e050` status=ok. Validated backup snapshot at `/srv/storage/atocore/backups/snapshots/20260422T190624Z` taken BEFORE prod backfill. Prod backfill of `scripts/v1_0_backfill_provenance.py` against live DB: dry-run found 31 active/superseded entities with no provenance, list reviewed and looked sane; live run with default `hand_authored=1` flag path updated 31 rows; follow-up dry-run returned 0 rows remaining → no lingering F-8 violations in prod. Codex logged one residual P2 (R14): HTTP `POST /entities/{id}/promote` route doesn't translate the new service-layer `ValueError` into 400 — legacy bad candidate promoted through the API surfaces as 500. Not blocking. V1-0 closed. **Gates for V1-A**: soak window ends ~2026-04-26; 100-active-memory density target (currently 84 active + the ~31 newly flagged ones — need to check how those count in density math). V1-A holds until both gates clear.

View File

@@ -6,7 +6,7 @@ Personal context engine that enriches LLM interactions with durable memory, stru
```bash ```bash
pip install -e . pip install -e .
uvicorn src.atocore.main:app --port 8100 uvicorn atocore.main:app --port 8100
``` ```
## Usage ## Usage
@@ -37,6 +37,10 @@ python scripts/atocore_client.py audit-query "gigabit" 5
| POST | /ingest | Ingest markdown file or folder | | POST | /ingest | Ingest markdown file or folder |
| POST | /query | Retrieve relevant chunks | | POST | /query | Retrieve relevant chunks |
| POST | /context/build | Build full context pack | | POST | /context/build | Build full context pack |
| POST | /interactions | Capture prompt/response interactions |
| GET/POST | /memory | List/create durable memories |
| GET/POST | /entities | Engineering entity graph surface |
| GET | /admin/dashboard | Operator dashboard |
| GET | /health | Health check | | GET | /health | Health check |
| GET | /debug/context | Inspect last context pack | | GET | /debug/context | Inspect last context pack |
@@ -66,8 +70,10 @@ unversioned forms.
FastAPI (port 8100) FastAPI (port 8100)
|- Ingestion: markdown -> parse -> chunk -> embed -> store |- Ingestion: markdown -> parse -> chunk -> embed -> store
|- Retrieval: query -> embed -> vector search -> rank |- Retrieval: query -> embed -> vector search -> rank
|- Context Builder: retrieve -> boost -> budget -> format |- Context Builder: project state -> memories -> entities -> retrieval -> budget
|- SQLite (documents, chunks, memories, projects, interactions) |- Reflection: capture -> reinforce -> extract -> triage -> promote/expire
|- Engineering: typed entities, relationships, conflicts, wiki/mirror
|- SQLite (documents, chunks, memories, projects, interactions, entities)
'- ChromaDB (vector embeddings) '- ChromaDB (vector embeddings)
``` ```
@@ -82,6 +88,16 @@ Set via environment variables (prefix `ATOCORE_`):
| ATOCORE_CHUNK_MAX_SIZE | 800 | Max chunk size (chars) | | ATOCORE_CHUNK_MAX_SIZE | 800 | Max chunk size (chars) |
| ATOCORE_CONTEXT_BUDGET | 3000 | Context pack budget (chars) | | ATOCORE_CONTEXT_BUDGET | 3000 | Context pack budget (chars) |
| ATOCORE_EMBEDDING_MODEL | paraphrase-multilingual-MiniLM-L12-v2 | Embedding model | | ATOCORE_EMBEDDING_MODEL | paraphrase-multilingual-MiniLM-L12-v2 | Embedding model |
| ATOCORE_RANK_PROJECT_MATCH_BOOST | 2.0 | Soft boost for chunks whose metadata matches the project hint |
| ATOCORE_RANK_PROJECT_SCOPE_FILTER | true | Filter project-hinted retrieval away from other registered project corpora |
| ATOCORE_RANK_PROJECT_SCOPE_CANDIDATE_MULTIPLIER | 4 | Widen candidate pull before project-scope filtering |
| ATOCORE_RANK_QUERY_TOKEN_STEP | 0.08 | Per-token boost when query terms appear in high-signal metadata |
| ATOCORE_RANK_QUERY_TOKEN_CAP | 1.32 | Maximum query-token boost multiplier |
| ATOCORE_RANK_PATH_HIGH_SIGNAL_BOOST | 1.18 | Boost current decision/status/requirements-like paths |
| ATOCORE_RANK_PATH_LOW_SIGNAL_PENALTY | 0.72 | Down-rank archive/history-like paths |
`ATOCORE_RANK_PROJECT_SCOPE_FILTER` gates the hard cross-project filter only.
`ATOCORE_RANK_PROJECT_MATCH_BOOST` remains the separate soft-ranking knob.
## Testing ## Testing
@@ -93,7 +109,11 @@ pytest
## Operations ## Operations
- `scripts/atocore_client.py` provides a live API client for project refresh, project-state inspection, and retrieval-quality audits. - `scripts/atocore_client.py` provides a live API client for project refresh, project-state inspection, and retrieval-quality audits.
- `scripts/retrieval_eval.py` runs the live retrieval/context harness, separates blocking failures from known content gaps, and stamps JSON output with target/build metadata.
- `scripts/live_status.py` renders a compact read-only status report from `/health`, `/stats`, `/projects`, and `/admin/dashboard`; set `ATOCORE_AUTH_TOKEN` or `--auth-token` when those endpoints are gated.
- `scripts/backfill_chunk_project_ids.py` dry-runs or applies explicit `project_id` metadata backfills for SQLite chunks and Chroma vectors; `--apply` requires a confirmed Chroma snapshot.
- `docs/operations.md` captures the current operational priority order: retrieval quality, Wave 2 trusted-operational ingestion, AtoDrive scoping, and restore validation. - `docs/operations.md` captures the current operational priority order: retrieval quality, Wave 2 trusted-operational ingestion, AtoDrive scoping, and restore validation.
- `DEV-LEDGER.md` is the fast-moving source of operational truth during active development; copy claims into docs only after checking the live service.
## Architecture Notes ## Architecture Notes

View File

@@ -1,6 +1,11 @@
# AtoCore Current State (2026-04-22) # AtoCore - Current State (2026-04-24)
Live deploy: `2712c5d` · Dalidou health: ok · Harness: 17/18 · Tests: 547 passing. Update 2026-04-24: audit-improvements deployed as `f44a211`; live harness is
19/20 with 0 blocking failures and 1 known content gap. Active follow-up branch
`codex/project-id-metadata-retrieval` is at 567 passing tests.
Live deploy: `2b86543` · Dalidou health: ok · Harness: 18/20 with 1 known
content gap and 1 current blocking project-bleed guard · Tests: 553 passing.
## V1-0 landed 2026-04-22 ## V1-0 landed 2026-04-22
@@ -13,9 +18,8 @@ supersede) with Q-3 fail-open. Prod backfill ran cleanly — 31 legacy
active/superseded entities flagged `hand_authored=1`, follow-up dry-run active/superseded entities flagged `hand_authored=1`, follow-up dry-run
returned 0 remaining rows. Test count 533 → 547 (+14). returned 0 remaining rows. Test count 533 → 547 (+14).
R14 (P2, non-blocking): `POST /entities/{id}/promote` route fix translates R14 is closed: `POST /entities/{id}/promote` now translates the new
the new `ValueError` into 400. Branch `claude/r14-promote-400` pending caller-fixable V1-0 `ValueError` into HTTP 400.
Codex review + squash-merge.
**Next in the V1 track:** V1-A (minimal query slice + Q-6 killer-correctness **Next in the V1 track:** V1-A (minimal query slice + Q-6 killer-correctness
integration). Gated on pipeline soak (~2026-04-26) + 100+ active memory integration). Gated on pipeline soak (~2026-04-26) + 100+ active memory
@@ -65,10 +69,10 @@ Last nightly run (2026-04-19 03:00 UTC): **31 promoted · 39 rejected · 0 needs
| 7G | Re-extraction on prompt version bump | pending | | 7G | Re-extraction on prompt version bump | pending |
| 7H | Chroma vector hygiene (delete vectors for superseded memories) | pending | | 7H | Chroma vector hygiene (delete vectors for superseded memories) | pending |
## Known gaps (honest) ## Known gaps (honest, refreshed 2026-04-24)
1. **Capture surface is Claude-Code-and-OpenClaw only.** Conversations in Claude Desktop, Claude.ai web, phone, or any other LLM UI are NOT captured. Example: the rotovap/mushroom chat yesterday never reached AtoCore because no hook fired. See Q4 below. 1. **Capture surface is Claude-Code-and-OpenClaw only.** Conversations in Claude Desktop, Claude.ai web, phone, or any other LLM UI are NOT captured. Example: the rotovap/mushroom chat yesterday never reached AtoCore because no hook fired. See Q4 below.
2. **OpenClaw is capture-only, not context-grounded.** The plugin POSTs `/interactions` on `llm_output` but does NOT call `/context/build` on `before_agent_start`. OpenClaw's underlying agent runs blind. See Q2 below. 2. **Project-scoped retrieval guard is deployed and passing.** The April 24 p05 broad-status bleed guard now passes on live Dalidou. The active follow-up branch adds explicit `project_id` chunk/vector metadata so the deployed path/tag heuristic can become a legacy fallback.
3. **Human interface (wiki) is thin and static.** 5 project cards + a "System" line. No dashboard for the autonomous activity. No per-memory detail page. See Q3/Q5. 3. **Human interface is useful but not yet the V1 Human Mirror.** Wiki/dashboard pages exist, but the spec routes, deterministic mirror files, disputed markers, and curated annotations remain V1-D work.
4. **Harness 17/18** — the `p04-constraints` fixture wants "Zerodur" but retrieval surfaces related-not-exact terms. Content gap, not a retrieval regression. 4. **Harness known issue:** `p04-constraints` wants "Zerodur" and "1.2"; live retrieval surfaces related constraints but not those exact strings. Treat as content/state gap until fixed.
5. **Two projects under-populated**: p05-interferometer (4 memories, 18 state) and atomizer-v2 (1 memory, 6 state). Batch re-extract with the new llm-0.6.0 prompt would help. 5. **Formal docs lag the ledger during fast work.** Use `DEV-LEDGER.md` and `python scripts/live_status.py` for live truth, then copy verified claims into these docs.

View File

@@ -70,9 +70,14 @@ read-only additive mode.
- Phase 6 - AtoDrive - Phase 6 - AtoDrive
- Phase 10 - Write-back - Phase 10 - Write-back
- Phase 11 - Multi-model - Phase 11 - Multi-model
- Phase 12 - Evaluation
- Phase 13 - Hardening - Phase 13 - Hardening
### Partial / Operational Baseline
- Phase 12 - Evaluation. The retrieval/context harness exists and runs
against live Dalidou, but coverage is still intentionally small and
should grow before this is complete in the intended sense.
### Engineering Layer Planning Sprint ### Engineering Layer Planning Sprint
**Status: complete.** All 8 architecture docs are drafted. The **Status: complete.** All 8 architecture docs are drafted. The
@@ -126,11 +131,13 @@ This sits implicitly between Phase 8 (OpenClaw) and Phase 11
(multi-model). Memory-review and engineering-entity commands are (multi-model). Memory-review and engineering-entity commands are
deferred from the shared client until their workflows are exercised. deferred from the shared client until their workflows are exercised.
## What Is Real Today (updated 2026-04-16) ## What Is Real Today (updated 2026-04-24)
- canonical AtoCore runtime on Dalidou (`775960c`, deploy.sh verified) - canonical AtoCore runtime on Dalidou (`2b86543`, deploy.sh verified)
- 33,253 vectors across 6 registered projects - 33,253 vectors across 6 registered projects
- 234 captured interactions (192 claude-code, 38 openclaw, 4 test) - 951 captured interactions as of the 2026-04-24 live dashboard; refresh
exact live counts with
`python scripts/live_status.py`
- 6 registered projects: - 6 registered projects:
- `p04-gigabit` (483 docs, 15 state entries) - `p04-gigabit` (483 docs, 15 state entries)
- `p05-interferometer` (109 docs, 18 state entries) - `p05-interferometer` (109 docs, 18 state entries)
@@ -138,12 +145,14 @@ deferred from the shared client until their workflows are exercised.
- `atomizer-v2` (568 docs, 5 state entries) - `atomizer-v2` (568 docs, 5 state entries)
- `abb-space` (6 state entries) - `abb-space` (6 state entries)
- `atocore` (drive source, 47 state entries) - `atocore` (drive source, 47 state entries)
- 110 Trusted Project State entries across all projects (decisions, requirements, facts, contacts, milestones) - 128 Trusted Project State entries across all projects (decisions, requirements, facts, contacts, milestones)
- 84 active memories (31 project, 23 knowledge, 10 episodic, 8 adaptation, 7 preference, 5 identity) - 290 active memories and 0 candidate memories as of the 2026-04-24 live
dashboard
- context pack assembly with 4 tiers: Trusted Project State > identity/preference > project memories > retrieved chunks - context pack assembly with 4 tiers: Trusted Project State > identity/preference > project memories > retrieved chunks
- query-relevance memory ranking with overlap-density scoring - query-relevance memory ranking with overlap-density scoring
- retrieval eval harness: 18 fixtures, 17/18 passing on live - retrieval eval harness: 20 fixtures; current live has 19 pass, 1 known
- 303 tests passing content gap, and 0 blocking failures after the audit-improvements deploy
- 567 tests passing on the active `codex/project-id-metadata-retrieval` branch
- nightly pipeline: backup → cleanup → rsync → OpenClaw import → vault refresh → extract → triage → **auto-promote/expire** → weekly synth/lint → **retrieval harness****pipeline summary to project state** - nightly pipeline: backup → cleanup → rsync → OpenClaw import → vault refresh → extract → triage → **auto-promote/expire** → weekly synth/lint → **retrieval harness****pipeline summary to project state**
- Phase 10 operational: reinforcement-based auto-promotion (ref_count ≥ 3, confidence ≥ 0.7) + stale candidate expiry (14 days unreinforced) - Phase 10 operational: reinforcement-based auto-promotion (ref_count ≥ 3, confidence ≥ 0.7) + stale candidate expiry (14 days unreinforced)
- pipeline health visible in dashboard: interaction totals by client, pipeline last_run, harness results, triage stats - pipeline health visible in dashboard: interaction totals by client, pipeline last_run, harness results, triage stats
@@ -190,9 +199,9 @@ where surfaces are disjoint, pauses when they collide.
| V1-E | Memory→entity graduation end-to-end + remaining Q-4 trust tests | pending V1-D (note: collides with memory extractor; pauses for multi-model triage work) | | V1-E | Memory→entity graduation end-to-end + remaining Q-4 trust tests | pending V1-D (note: collides with memory extractor; pauses for multi-model triage work) |
| V1-F | F-5 detector generalization + route alias + O-1/O-2/O-3 operational + D-1/D-3/D-4 docs | finish line | | V1-F | F-5 detector generalization + route alias + O-1/O-2/O-3 operational + D-1/D-3/D-4 docs | finish line |
R14 (P2, non-blocking): `POST /entities/{id}/promote` route returns 500 R14 is closed: `POST /entities/{id}/promote` now translates
on the new V1-0 `ValueError` instead of 400. Fix on branch caller-fixable V1-0 provenance validation failures into HTTP 400 instead
`claude/r14-promote-400`, pending Codex review. of leaking as HTTP 500.
## Next ## Next

View File

@@ -0,0 +1,178 @@
"""Backfill explicit project_id into chunk and vector metadata.
Dry-run by default. The script derives ownership from the registered project
ingest roots and updates both SQLite source_chunks.metadata and Chroma vector
metadata only when --apply is provided.
"""
from __future__ import annotations
import argparse
import json
import sqlite3
import sys
from pathlib import Path
sys.path.insert(0, str(Path(__file__).resolve().parents[1] / "src"))
from atocore.models.database import get_connection # noqa: E402
from atocore.projects.registry import derive_project_id_for_path # noqa: E402
from atocore.retrieval.vector_store import get_vector_store # noqa: E402
DEFAULT_BATCH_SIZE = 500
def _decode_metadata(raw: str | None) -> dict | None:
if not raw:
return {}
try:
parsed = json.loads(raw)
except json.JSONDecodeError:
return None
return parsed if isinstance(parsed, dict) else None
def _chunk_rows() -> tuple[list[dict], str]:
try:
with get_connection() as conn:
rows = conn.execute(
"""
SELECT
sc.id AS chunk_id,
sc.metadata AS chunk_metadata,
sd.file_path AS file_path
FROM source_chunks sc
JOIN source_documents sd ON sd.id = sc.document_id
ORDER BY sd.file_path, sc.chunk_index
"""
).fetchall()
except sqlite3.OperationalError as exc:
if "source_chunks" in str(exc) or "source_documents" in str(exc):
return [], f"missing ingestion tables: {exc}"
raise
return [dict(row) for row in rows], ""
def _batches(items: list, batch_size: int) -> list[list]:
return [items[i:i + batch_size] for i in range(0, len(items), batch_size)]
def backfill(
apply: bool = False,
project_filter: str = "",
batch_size: int = DEFAULT_BATCH_SIZE,
require_chroma_snapshot: bool = False,
) -> dict:
rows, db_warning = _chunk_rows()
updates: list[tuple[str, str, dict]] = []
by_project: dict[str, int] = {}
skipped_unowned = 0
already_tagged = 0
malformed_metadata = 0
for row in rows:
project_id = derive_project_id_for_path(row["file_path"])
if project_filter and project_id != project_filter:
continue
if not project_id:
skipped_unowned += 1
continue
metadata = _decode_metadata(row["chunk_metadata"])
if metadata is None:
malformed_metadata += 1
continue
if metadata.get("project_id") == project_id:
already_tagged += 1
continue
metadata["project_id"] = project_id
updates.append((row["chunk_id"], project_id, metadata))
by_project[project_id] = by_project.get(project_id, 0) + 1
missing_vectors: list[str] = []
applied_updates = 0
if apply and updates:
if not require_chroma_snapshot:
raise ValueError(
"--apply requires --chroma-snapshot-confirmed after taking a Chroma backup"
)
vector_store = get_vector_store()
for batch in _batches(updates, max(1, batch_size)):
chunk_ids = [chunk_id for chunk_id, _, _ in batch]
vector_payload = vector_store.get_metadatas(chunk_ids)
existing_vector_metadata = {
chunk_id: metadata
for chunk_id, metadata in zip(
vector_payload.get("ids", []),
vector_payload.get("metadatas", []),
strict=False,
)
if isinstance(metadata, dict)
}
vector_ids = []
vector_metadatas = []
sql_updates = []
for chunk_id, project_id, chunk_metadata in batch:
vector_metadata = existing_vector_metadata.get(chunk_id)
if vector_metadata is None:
missing_vectors.append(chunk_id)
continue
vector_metadata = dict(vector_metadata)
vector_metadata["project_id"] = project_id
vector_ids.append(chunk_id)
vector_metadatas.append(vector_metadata)
sql_updates.append((json.dumps(chunk_metadata, ensure_ascii=True), chunk_id))
if not vector_ids:
continue
vector_store.update_metadatas(vector_ids, vector_metadatas)
with get_connection() as conn:
cursor = conn.executemany(
"UPDATE source_chunks SET metadata = ? WHERE id = ?",
sql_updates,
)
if cursor.rowcount != len(sql_updates):
raise RuntimeError(
f"SQLite rowcount mismatch: {cursor.rowcount} != {len(sql_updates)}"
)
applied_updates += len(sql_updates)
return {
"apply": apply,
"total_chunks": len(rows),
"updates": len(updates),
"applied_updates": applied_updates,
"already_tagged": already_tagged,
"skipped_unowned": skipped_unowned,
"malformed_metadata": malformed_metadata,
"missing_vectors": len(missing_vectors),
"db_warning": db_warning,
"by_project": dict(sorted(by_project.items())),
}
def main() -> int:
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument("--apply", action="store_true", help="write SQLite and Chroma metadata updates")
parser.add_argument("--project", default="", help="optional canonical project_id filter")
parser.add_argument("--batch-size", type=int, default=DEFAULT_BATCH_SIZE)
parser.add_argument(
"--chroma-snapshot-confirmed",
action="store_true",
help="required with --apply; confirms a Chroma snapshot exists",
)
args = parser.parse_args()
payload = backfill(
apply=args.apply,
project_filter=args.project.strip(),
batch_size=args.batch_size,
require_chroma_snapshot=args.chroma_snapshot_confirmed,
)
print(json.dumps(payload, indent=2, ensure_ascii=True))
return 0
if __name__ == "__main__":
raise SystemExit(main())

131
scripts/live_status.py Normal file
View File

@@ -0,0 +1,131 @@
"""Render a compact live-status report from a running AtoCore instance.
This is intentionally read-only and stdlib-only so it can be used from a
fresh checkout, a cron job, or a Codex/Claude session without installing the
full app package. The output is meant to reduce docs drift: copy the report
into status docs only after it was generated from the live service.
"""
from __future__ import annotations
import argparse
import errno
import json
import os
import sys
import urllib.error
import urllib.request
from typing import Any
DEFAULT_BASE_URL = os.environ.get("ATOCORE_BASE_URL", "http://dalidou:8100").rstrip("/")
DEFAULT_TIMEOUT = int(os.environ.get("ATOCORE_TIMEOUT_SECONDS", "30"))
DEFAULT_AUTH_TOKEN = os.environ.get("ATOCORE_AUTH_TOKEN", "").strip()
def request_json(base_url: str, path: str, timeout: int, auth_token: str = "") -> dict[str, Any]:
headers = {"Authorization": f"Bearer {auth_token}"} if auth_token else {}
req = urllib.request.Request(f"{base_url}{path}", method="GET", headers=headers)
with urllib.request.urlopen(req, timeout=timeout) as response:
body = response.read().decode("utf-8")
status = getattr(response, "status", None)
payload = json.loads(body) if body.strip() else {}
if not isinstance(payload, dict):
payload = {"value": payload}
if status is not None:
payload["_http_status"] = status
return payload
def collect_status(base_url: str, timeout: int, auth_token: str = "") -> dict[str, Any]:
payload: dict[str, Any] = {"base_url": base_url}
for name, path in {
"health": "/health",
"stats": "/stats",
"projects": "/projects",
"dashboard": "/admin/dashboard",
}.items():
try:
payload[name] = request_json(base_url, path, timeout, auth_token)
except (urllib.error.URLError, TimeoutError, OSError, json.JSONDecodeError) as exc:
payload[name] = {"error": str(exc)}
return payload
def render_markdown(status: dict[str, Any]) -> str:
health = status.get("health", {})
stats = status.get("stats", {})
projects = status.get("projects", {}).get("projects", [])
dashboard = status.get("dashboard", {})
memories = dashboard.get("memories", {}) if isinstance(dashboard.get("memories"), dict) else {}
project_state = dashboard.get("project_state", {}) if isinstance(dashboard.get("project_state"), dict) else {}
interactions = dashboard.get("interactions", {}) if isinstance(dashboard.get("interactions"), dict) else {}
pipeline = dashboard.get("pipeline", {}) if isinstance(dashboard.get("pipeline"), dict) else {}
lines = [
"# AtoCore Live Status",
"",
f"- base_url: `{status.get('base_url', '')}`",
"- endpoint_http_statuses: "
f"`health={health.get('_http_status', 'error')}, "
f"stats={stats.get('_http_status', 'error')}, "
f"projects={status.get('projects', {}).get('_http_status', 'error')}, "
f"dashboard={dashboard.get('_http_status', 'error')}`",
f"- service_status: `{health.get('status', 'unknown')}`",
f"- code_version: `{health.get('code_version', health.get('version', 'unknown'))}`",
f"- build_sha: `{health.get('build_sha', 'unknown')}`",
f"- build_branch: `{health.get('build_branch', 'unknown')}`",
f"- build_time: `{health.get('build_time', 'unknown')}`",
f"- env: `{health.get('env', 'unknown')}`",
f"- documents: `{stats.get('total_documents', 'unknown')}`",
f"- chunks: `{stats.get('total_chunks', 'unknown')}`",
f"- vectors: `{stats.get('total_vectors', health.get('vectors_count', 'unknown'))}`",
f"- registered_projects: `{len(projects)}`",
f"- active_memories: `{memories.get('active', 'unknown')}`",
f"- candidate_memories: `{memories.get('candidates', 'unknown')}`",
f"- interactions: `{interactions.get('total', 'unknown')}`",
f"- project_state_entries: `{project_state.get('total', 'unknown')}`",
f"- pipeline_last_run: `{pipeline.get('last_run', 'unknown')}`",
]
if projects:
lines.extend(["", "## Projects"])
for project in projects:
aliases = ", ".join(project.get("aliases", []))
suffix = f" ({aliases})" if aliases else ""
lines.append(f"- `{project.get('id', '')}`{suffix}")
return "\n".join(lines) + "\n"
def main() -> int:
parser = argparse.ArgumentParser(description="Render live AtoCore status")
parser.add_argument("--base-url", default=DEFAULT_BASE_URL)
parser.add_argument("--timeout", type=int, default=DEFAULT_TIMEOUT)
parser.add_argument(
"--auth-token",
default=DEFAULT_AUTH_TOKEN,
help="Bearer token; defaults to ATOCORE_AUTH_TOKEN when set",
)
parser.add_argument("--json", action="store_true", help="emit raw JSON")
args = parser.parse_args()
status = collect_status(args.base_url.rstrip("/"), args.timeout, args.auth_token)
if args.json:
output = json.dumps(status, indent=2, ensure_ascii=True) + "\n"
else:
output = render_markdown(status)
try:
sys.stdout.write(output)
except BrokenPipeError:
return 0
except OSError as exc:
if exc.errno in {errno.EINVAL, errno.EPIPE}:
return 0
raise
return 0
if __name__ == "__main__":
raise SystemExit(main())

View File

@@ -44,6 +44,7 @@ import urllib.error
import urllib.parse import urllib.parse
import urllib.request import urllib.request
from dataclasses import dataclass, field from dataclasses import dataclass, field
from datetime import datetime, timezone
from pathlib import Path from pathlib import Path
DEFAULT_BASE_URL = os.environ.get("ATOCORE_BASE_URL", "http://dalidou:8100") DEFAULT_BASE_URL = os.environ.get("ATOCORE_BASE_URL", "http://dalidou:8100")
@@ -52,6 +53,13 @@ DEFAULT_BUDGET = 3000
DEFAULT_FIXTURES = Path(__file__).parent / "retrieval_eval_fixtures.json" DEFAULT_FIXTURES = Path(__file__).parent / "retrieval_eval_fixtures.json"
def request_json(base_url: str, path: str, timeout: int) -> dict:
req = urllib.request.Request(f"{base_url}{path}", method="GET")
with urllib.request.urlopen(req, timeout=timeout) as resp:
body = resp.read().decode("utf-8")
return json.loads(body) if body.strip() else {}
@dataclass @dataclass
class Fixture: class Fixture:
name: str name: str
@@ -60,6 +68,7 @@ class Fixture:
budget: int = DEFAULT_BUDGET budget: int = DEFAULT_BUDGET
expect_present: list[str] = field(default_factory=list) expect_present: list[str] = field(default_factory=list)
expect_absent: list[str] = field(default_factory=list) expect_absent: list[str] = field(default_factory=list)
known_issue: bool = False
notes: str = "" notes: str = ""
@@ -70,8 +79,13 @@ class FixtureResult:
missing_present: list[str] missing_present: list[str]
unexpected_absent: list[str] unexpected_absent: list[str]
total_chars: int total_chars: int
known_issue: bool = False
error: str = "" error: str = ""
@property
def blocking_failure(self) -> bool:
return not self.ok and not self.known_issue
def load_fixtures(path: Path) -> list[Fixture]: def load_fixtures(path: Path) -> list[Fixture]:
data = json.loads(path.read_text(encoding="utf-8")) data = json.loads(path.read_text(encoding="utf-8"))
@@ -89,6 +103,7 @@ def load_fixtures(path: Path) -> list[Fixture]:
budget=int(raw.get("budget", DEFAULT_BUDGET)), budget=int(raw.get("budget", DEFAULT_BUDGET)),
expect_present=list(raw.get("expect_present", [])), expect_present=list(raw.get("expect_present", [])),
expect_absent=list(raw.get("expect_absent", [])), expect_absent=list(raw.get("expect_absent", [])),
known_issue=bool(raw.get("known_issue", False)),
notes=raw.get("notes", ""), notes=raw.get("notes", ""),
) )
) )
@@ -117,6 +132,7 @@ def run_fixture(fixture: Fixture, base_url: str, timeout: int) -> FixtureResult:
missing_present=list(fixture.expect_present), missing_present=list(fixture.expect_present),
unexpected_absent=[], unexpected_absent=[],
total_chars=0, total_chars=0,
known_issue=fixture.known_issue,
error=f"http_error: {exc}", error=f"http_error: {exc}",
) )
@@ -129,16 +145,26 @@ def run_fixture(fixture: Fixture, base_url: str, timeout: int) -> FixtureResult:
missing_present=missing, missing_present=missing,
unexpected_absent=unexpected, unexpected_absent=unexpected,
total_chars=len(formatted), total_chars=len(formatted),
known_issue=fixture.known_issue,
) )
def print_human_report(results: list[FixtureResult]) -> None: def print_human_report(results: list[FixtureResult], metadata: dict) -> None:
total = len(results) total = len(results)
passed = sum(1 for r in results if r.ok) passed = sum(1 for r in results if r.ok)
known = sum(1 for r in results if not r.ok and r.known_issue)
blocking = sum(1 for r in results if r.blocking_failure)
print(f"Retrieval eval: {passed}/{total} fixtures passed") print(f"Retrieval eval: {passed}/{total} fixtures passed")
print(
"Target: "
f"{metadata.get('base_url', 'unknown')} "
f"build={metadata.get('health', {}).get('build_sha', 'unknown')}"
)
if known or blocking:
print(f"Blocking failures: {blocking} Known issues: {known}")
print() print()
for r in results: for r in results:
marker = "PASS" if r.ok else "FAIL" marker = "PASS" if r.ok else ("KNOWN" if r.known_issue else "FAIL")
print(f"[{marker}] {r.fixture.name} project={r.fixture.project} chars={r.total_chars}") print(f"[{marker}] {r.fixture.name} project={r.fixture.project} chars={r.total_chars}")
if r.error: if r.error:
print(f" error: {r.error}") print(f" error: {r.error}")
@@ -150,15 +176,21 @@ def print_human_report(results: list[FixtureResult]) -> None:
print(f" notes: {r.fixture.notes}") print(f" notes: {r.fixture.notes}")
def print_json_report(results: list[FixtureResult]) -> None: def print_json_report(results: list[FixtureResult], metadata: dict) -> None:
payload = { payload = {
"generated_at": metadata.get("generated_at"),
"base_url": metadata.get("base_url"),
"health": metadata.get("health", {}),
"total": len(results), "total": len(results),
"passed": sum(1 for r in results if r.ok), "passed": sum(1 for r in results if r.ok),
"known_issues": sum(1 for r in results if not r.ok and r.known_issue),
"blocking_failures": sum(1 for r in results if r.blocking_failure),
"fixtures": [ "fixtures": [
{ {
"name": r.fixture.name, "name": r.fixture.name,
"project": r.fixture.project, "project": r.fixture.project,
"ok": r.ok, "ok": r.ok,
"known_issue": r.known_issue,
"total_chars": r.total_chars, "total_chars": r.total_chars,
"missing_present": r.missing_present, "missing_present": r.missing_present,
"unexpected_absent": r.unexpected_absent, "unexpected_absent": r.unexpected_absent,
@@ -179,15 +211,26 @@ def main() -> int:
parser.add_argument("--json", action="store_true", help="emit machine-readable JSON") parser.add_argument("--json", action="store_true", help="emit machine-readable JSON")
args = parser.parse_args() args = parser.parse_args()
base_url = args.base_url.rstrip("/")
try:
health = request_json(base_url, "/health", args.timeout)
except (urllib.error.URLError, TimeoutError, OSError, json.JSONDecodeError) as exc:
health = {"error": str(exc)}
metadata = {
"generated_at": datetime.now(timezone.utc).isoformat(),
"base_url": base_url,
"health": health,
}
fixtures = load_fixtures(args.fixtures) fixtures = load_fixtures(args.fixtures)
results = [run_fixture(f, args.base_url, args.timeout) for f in fixtures] results = [run_fixture(f, base_url, args.timeout) for f in fixtures]
if args.json: if args.json:
print_json_report(results) print_json_report(results, metadata)
else: else:
print_human_report(results) print_human_report(results, metadata)
return 0 if all(r.ok for r in results) else 1 return 0 if not any(r.blocking_failure for r in results) else 1
if __name__ == "__main__": if __name__ == "__main__":

View File

@@ -27,7 +27,8 @@
"expect_absent": [ "expect_absent": [
"polisher suite" "polisher suite"
], ],
"notes": "Key constraints are in Trusted Project State and in the mission-framing memory" "known_issue": true,
"notes": "Known content gap as of 2026-04-24: live retrieval surfaces related constraints but not the exact Zerodur / 1.2 strings. Keep visible, but do not make nightly harness red until the source/state gap is fixed."
}, },
{ {
"name": "p04-short-ambiguous", "name": "p04-short-ambiguous",
@@ -80,6 +81,36 @@
], ],
"notes": "CGH is a core p05 concept. Should surface via chunks and possibly the architecture memory. Must not bleed p06 polisher-suite terms." "notes": "CGH is a core p05 concept. Should surface via chunks and possibly the architecture memory. Must not bleed p06 polisher-suite terms."
}, },
{
"name": "p05-broad-status-no-atomizer",
"project": "p05-interferometer",
"prompt": "current status",
"expect_present": [
"--- Trusted Project State ---",
"--- Project Memories ---",
"Zygo"
],
"expect_absent": [
"atomizer-v2",
"ATOMIZER_PODCAST_BRIEFING",
"[Source: atomizer-v2/",
"P04-GigaBIT-M1-KB-design"
],
"notes": "Regression guard for the April 24 audit finding: broad p05 status queries must not pull Atomizer/archive context into project-scoped packs."
},
{
"name": "p05-vendor-decision-no-archive-first",
"project": "p05-interferometer",
"prompt": "vendor selection decision",
"expect_present": [
"Selection-Decision"
],
"expect_absent": [
"[Source: atomizer-v2/",
"ATOMIZER_PODCAST_BRIEFING"
],
"notes": "Project-scoped decision query should stay inside p05 and prefer current decision/vendor material over unrelated project archives."
},
{ {
"name": "p06-suite-split", "name": "p06-suite-split",
"project": "p06-polisher", "project": "p06-polisher",

View File

@@ -46,6 +46,8 @@ class Settings(BaseSettings):
# All multipliers default to the values used since Wave 1; tighten or # All multipliers default to the values used since Wave 1; tighten or
# loosen them via ATOCORE_* env vars without touching code. # loosen them via ATOCORE_* env vars without touching code.
rank_project_match_boost: float = 2.0 rank_project_match_boost: float = 2.0
rank_project_scope_filter: bool = True
rank_project_scope_candidate_multiplier: int = 4
rank_query_token_step: float = 0.08 rank_query_token_step: float = 0.08
rank_query_token_cap: float = 1.32 rank_query_token_cap: float = 1.32
rank_path_high_signal_boost: float = 1.18 rank_path_high_signal_boost: float = 1.18

View File

@@ -32,10 +32,23 @@ def exclusive_ingestion():
_INGESTION_LOCK.release() _INGESTION_LOCK.release()
def ingest_file(file_path: Path) -> dict: def ingest_file(file_path: Path, project_id: str = "") -> dict:
"""Ingest a single markdown file. Returns stats.""" """Ingest a single markdown file. Returns stats."""
start = time.time() start = time.time()
file_path = file_path.resolve() file_path = file_path.resolve()
project_id = (project_id or "").strip()
if not project_id:
try:
from atocore.projects.registry import derive_project_id_for_path
project_id = derive_project_id_for_path(file_path)
except Exception as exc:
log.warning(
"project_id_derivation_failed",
file_path=str(file_path),
error=str(exc),
)
project_id = ""
if not file_path.exists(): if not file_path.exists():
raise FileNotFoundError(f"File not found: {file_path}") raise FileNotFoundError(f"File not found: {file_path}")
@@ -65,6 +78,7 @@ def ingest_file(file_path: Path) -> dict:
"source_file": str(file_path), "source_file": str(file_path),
"tags": parsed.tags, "tags": parsed.tags,
"title": parsed.title, "title": parsed.title,
"project_id": project_id,
} }
chunks = chunk_markdown(parsed.body, base_metadata=base_meta) chunks = chunk_markdown(parsed.body, base_metadata=base_meta)
@@ -116,6 +130,7 @@ def ingest_file(file_path: Path) -> dict:
"source_file": str(file_path), "source_file": str(file_path),
"tags": json.dumps(parsed.tags), "tags": json.dumps(parsed.tags),
"title": parsed.title, "title": parsed.title,
"project_id": project_id,
}) })
conn.execute( conn.execute(
@@ -173,7 +188,17 @@ def ingest_folder(folder_path: Path, purge_deleted: bool = True) -> list[dict]:
purge_deleted: If True, remove DB/vector entries for files purge_deleted: If True, remove DB/vector entries for files
that no longer exist on disk. that no longer exist on disk.
""" """
return ingest_project_folder(folder_path, purge_deleted=purge_deleted, project_id="")
def ingest_project_folder(
folder_path: Path,
purge_deleted: bool = True,
project_id: str = "",
) -> list[dict]:
"""Ingest a folder and annotate chunks with an optional project id."""
folder_path = folder_path.resolve() folder_path = folder_path.resolve()
project_id = (project_id or "").strip()
if not folder_path.is_dir(): if not folder_path.is_dir():
raise NotADirectoryError(f"Not a directory: {folder_path}") raise NotADirectoryError(f"Not a directory: {folder_path}")
@@ -187,7 +212,7 @@ def ingest_folder(folder_path: Path, purge_deleted: bool = True) -> list[dict]:
# Ingest new/changed files # Ingest new/changed files
for md_file in md_files: for md_file in md_files:
try: try:
result = ingest_file(md_file) result = ingest_file(md_file, project_id=project_id)
results.append(result) results.append(result)
except Exception as e: except Exception as e:
log.error("ingestion_error", file_path=str(md_file), error=str(e)) log.error("ingestion_error", file_path=str(md_file), error=str(e))

View File

@@ -8,7 +8,6 @@ from dataclasses import asdict, dataclass
from pathlib import Path from pathlib import Path
import atocore.config as _config import atocore.config as _config
from atocore.ingestion.pipeline import ingest_folder
# Reserved pseudo-projects. `inbox` holds pre-project / lead / quote # Reserved pseudo-projects. `inbox` holds pre-project / lead / quote
@@ -260,6 +259,7 @@ def load_project_registry() -> list[RegisteredProject]:
) )
_validate_unique_project_names(projects) _validate_unique_project_names(projects)
_validate_ingest_root_overlaps(projects)
return projects return projects
@@ -307,6 +307,28 @@ def resolve_project_name(name: str | None) -> str:
return name return name
def derive_project_id_for_path(file_path: str | Path) -> str:
"""Return the registered project that owns a source path, if any."""
if not file_path:
return ""
doc_path = Path(file_path).resolve(strict=False)
matches: list[tuple[int, int, str]] = []
for project in load_project_registry():
for source_ref in project.ingest_roots:
root_path = _resolve_ingest_root(source_ref)
try:
doc_path.relative_to(root_path)
except ValueError:
continue
matches.append((len(root_path.parts), len(str(root_path)), project.project_id))
if not matches:
return ""
matches.sort(reverse=True)
return matches[0][2]
def refresh_registered_project(project_name: str, purge_deleted: bool = False) -> dict: def refresh_registered_project(project_name: str, purge_deleted: bool = False) -> dict:
"""Ingest all configured source roots for a registered project. """Ingest all configured source roots for a registered project.
@@ -322,6 +344,8 @@ def refresh_registered_project(project_name: str, purge_deleted: bool = False) -
if project is None: if project is None:
raise ValueError(f"Unknown project: {project_name}") raise ValueError(f"Unknown project: {project_name}")
from atocore.ingestion.pipeline import ingest_project_folder
roots = [] roots = []
ingested_count = 0 ingested_count = 0
skipped_count = 0 skipped_count = 0
@@ -346,7 +370,11 @@ def refresh_registered_project(project_name: str, purge_deleted: bool = False) -
{ {
**root_result, **root_result,
"status": "ingested", "status": "ingested",
"results": ingest_folder(resolved, purge_deleted=purge_deleted), "results": ingest_project_folder(
resolved,
purge_deleted=purge_deleted,
project_id=project.project_id,
),
} }
) )
ingested_count += 1 ingested_count += 1
@@ -443,6 +471,33 @@ def _validate_unique_project_names(projects: list[RegisteredProject]) -> None:
seen[key] = project.project_id seen[key] = project.project_id
def _validate_ingest_root_overlaps(projects: list[RegisteredProject]) -> None:
roots: list[tuple[str, Path]] = []
for project in projects:
for source_ref in project.ingest_roots:
roots.append((project.project_id, _resolve_ingest_root(source_ref)))
for i, (left_project, left_root) in enumerate(roots):
for right_project, right_root in roots[i + 1:]:
if left_project == right_project:
continue
try:
left_root.relative_to(right_root)
overlaps = True
except ValueError:
try:
right_root.relative_to(left_root)
overlaps = True
except ValueError:
overlaps = False
if overlaps:
raise ValueError(
"Project registry ingest root overlap: "
f"'{left_root}' ({left_project}) and "
f"'{right_root}' ({right_project})"
)
def _find_name_collisions( def _find_name_collisions(
project_id: str, project_id: str,
aliases: list[str], aliases: list[str],

View File

@@ -1,5 +1,6 @@
"""Retrieval: query to ranked chunks.""" """Retrieval: query to ranked chunks."""
import json
import re import re
import time import time
from dataclasses import dataclass from dataclasses import dataclass
@@ -7,7 +8,7 @@ from dataclasses import dataclass
import atocore.config as _config import atocore.config as _config
from atocore.models.database import get_connection from atocore.models.database import get_connection
from atocore.observability.logger import get_logger from atocore.observability.logger import get_logger
from atocore.projects.registry import get_registered_project from atocore.projects.registry import RegisteredProject, get_registered_project, load_project_registry
from atocore.retrieval.embeddings import embed_query from atocore.retrieval.embeddings import embed_query
from atocore.retrieval.vector_store import get_vector_store from atocore.retrieval.vector_store import get_vector_store
@@ -83,6 +84,27 @@ def retrieve(
"""Retrieve the most relevant chunks for a query.""" """Retrieve the most relevant chunks for a query."""
top_k = top_k or _config.settings.context_top_k top_k = top_k or _config.settings.context_top_k
start = time.time() start = time.time()
try:
scoped_project = get_registered_project(project_hint) if project_hint else None
except Exception as exc:
log.warning(
"project_scope_resolution_failed",
project_hint=project_hint,
error=str(exc),
)
scoped_project = None
scope_filter_enabled = bool(scoped_project and _config.settings.rank_project_scope_filter)
registered_projects = None
query_top_k = top_k
if scope_filter_enabled:
query_top_k = max(
top_k,
top_k * max(1, _config.settings.rank_project_scope_candidate_multiplier),
)
try:
registered_projects = load_project_registry()
except Exception:
registered_projects = None
query_embedding = embed_query(query) query_embedding = embed_query(query)
store = get_vector_store() store = get_vector_store()
@@ -101,11 +123,12 @@ def retrieve(
results = store.query( results = store.query(
query_embedding=query_embedding, query_embedding=query_embedding,
top_k=top_k, top_k=query_top_k,
where=where, where=where,
) )
chunks = [] chunks = []
raw_result_count = len(results["ids"][0]) if results and results["ids"] and results["ids"][0] else 0
if results and results["ids"] and results["ids"][0]: if results and results["ids"] and results["ids"][0]:
existing_ids = _existing_chunk_ids(results["ids"][0]) existing_ids = _existing_chunk_ids(results["ids"][0])
for i, chunk_id in enumerate(results["ids"][0]): for i, chunk_id in enumerate(results["ids"][0]):
@@ -117,6 +140,13 @@ def retrieve(
meta = results["metadatas"][0][i] if results["metadatas"] else {} meta = results["metadatas"][0][i] if results["metadatas"] else {}
content = results["documents"][0][i] if results["documents"] else "" content = results["documents"][0][i] if results["documents"] else ""
if scope_filter_enabled and not _is_allowed_for_project_scope(
scoped_project,
meta,
registered_projects,
):
continue
score *= _query_match_boost(query, meta) score *= _query_match_boost(query, meta)
score *= _path_signal_boost(meta) score *= _path_signal_boost(meta)
if project_hint: if project_hint:
@@ -137,42 +167,151 @@ def retrieve(
duration_ms = int((time.time() - start) * 1000) duration_ms = int((time.time() - start) * 1000)
chunks.sort(key=lambda chunk: chunk.score, reverse=True) chunks.sort(key=lambda chunk: chunk.score, reverse=True)
post_filter_count = len(chunks)
chunks = chunks[:top_k]
log.info( log.info(
"retrieval_done", "retrieval_done",
query=query[:100], query=query[:100],
top_k=top_k, top_k=top_k,
query_top_k=query_top_k,
raw_results_count=raw_result_count,
post_filter_count=post_filter_count,
results_count=len(chunks), results_count=len(chunks),
post_filter_dropped=max(0, raw_result_count - post_filter_count),
underfilled=bool(raw_result_count >= query_top_k and len(chunks) < top_k),
duration_ms=duration_ms, duration_ms=duration_ms,
) )
return chunks return chunks
def _is_allowed_for_project_scope(
project: RegisteredProject,
metadata: dict,
registered_projects: list[RegisteredProject] | None = None,
) -> bool:
"""Return True when a chunk is target-project or not project-owned.
Project-hinted retrieval should not let one registered project's corpus
compete with another's. At the same time, unowned/global sources should
remain eligible because shared docs and cross-project references can be
genuinely useful. The registry gives us the boundary: if metadata matches
a registered project and it is not the requested project, filter it out.
"""
if _metadata_matches_project(project, metadata):
return True
if registered_projects is None:
try:
registered_projects = load_project_registry()
except Exception:
return True
for other in registered_projects:
if other.project_id == project.project_id:
continue
if _metadata_matches_project(other, metadata):
return False
return True
def _metadata_matches_project(project: RegisteredProject, metadata: dict) -> bool:
stored_project_id = str(metadata.get("project_id", "")).strip().lower()
if stored_project_id:
return stored_project_id == project.project_id.lower()
path = _metadata_source_path(metadata)
tags = _metadata_tags(metadata)
for term in _project_scope_terms(project):
if _path_matches_term(path, term) or term in tags:
return True
return False
def _project_scope_terms(project: RegisteredProject) -> set[str]:
terms = {project.project_id.lower()}
terms.update(alias.lower() for alias in project.aliases)
for source_ref in project.ingest_roots:
normalized = source_ref.subpath.replace("\\", "/").strip("/").lower()
if normalized:
terms.add(normalized)
terms.add(normalized.split("/")[-1])
return {term for term in terms if term}
def _metadata_searchable(metadata: dict) -> str:
return " ".join(
[
str(metadata.get("source_file", "")).replace("\\", "/").lower(),
str(metadata.get("title", "")).lower(),
str(metadata.get("heading_path", "")).lower(),
str(metadata.get("tags", "")).lower(),
]
)
def _metadata_source_path(metadata: dict) -> str:
return str(metadata.get("source_file", "")).replace("\\", "/").strip("/").lower()
def _metadata_tags(metadata: dict) -> set[str]:
raw_tags = metadata.get("tags", [])
if isinstance(raw_tags, (list, tuple, set)):
return {str(tag).strip().lower() for tag in raw_tags if str(tag).strip()}
if isinstance(raw_tags, str):
try:
parsed = json.loads(raw_tags)
except json.JSONDecodeError:
parsed = [raw_tags]
if isinstance(parsed, (list, tuple, set)):
return {str(tag).strip().lower() for tag in parsed if str(tag).strip()}
if isinstance(parsed, str) and parsed.strip():
return {parsed.strip().lower()}
return set()
def _path_matches_term(path: str, term: str) -> bool:
normalized = term.replace("\\", "/").strip("/").lower()
if not path or not normalized:
return False
if "/" in normalized:
return path == normalized or path.startswith(f"{normalized}/")
return normalized in set(path.split("/"))
def _metadata_has_term(metadata: dict, term: str) -> bool:
normalized = term.replace("\\", "/").strip("/").lower()
if not normalized:
return False
if _path_matches_term(_metadata_source_path(metadata), normalized):
return True
if normalized in _metadata_tags(metadata):
return True
return re.search(
rf"(?<![a-z0-9]){re.escape(normalized)}(?![a-z0-9])",
_metadata_searchable(metadata),
) is not None
def _project_match_boost(project_hint: str, metadata: dict) -> float: def _project_match_boost(project_hint: str, metadata: dict) -> float:
"""Return a project-aware relevance multiplier for raw retrieval.""" """Return a project-aware relevance multiplier for raw retrieval."""
hint_lower = project_hint.strip().lower() hint_lower = project_hint.strip().lower()
if not hint_lower: if not hint_lower:
return 1.0 return 1.0
source_file = str(metadata.get("source_file", "")).lower() try:
title = str(metadata.get("title", "")).lower()
tags = str(metadata.get("tags", "")).lower()
searchable = " ".join([source_file, title, tags])
project = get_registered_project(project_hint) project = get_registered_project(project_hint)
candidate_names = {hint_lower} except Exception as exc:
if project is not None: log.warning(
candidate_names.add(project.project_id.lower()) "project_match_boost_resolution_failed",
candidate_names.update(alias.lower() for alias in project.aliases) project_hint=project_hint,
candidate_names.update( error=str(exc),
source_ref.subpath.replace("\\", "/").strip("/").split("/")[-1].lower()
for source_ref in project.ingest_roots
if source_ref.subpath.strip("/\\")
) )
project = None
candidate_names = _project_scope_terms(project) if project is not None else {hint_lower}
for candidate in candidate_names: for candidate in candidate_names:
if candidate and candidate in searchable: if _metadata_has_term(metadata, candidate):
return _config.settings.rank_project_match_boost return _config.settings.rank_project_match_boost
return 1.0 return 1.0

View File

@@ -64,6 +64,18 @@ class VectorStore:
self._collection.delete(ids=ids) self._collection.delete(ids=ids)
log.debug("vectors_deleted", count=len(ids)) log.debug("vectors_deleted", count=len(ids))
def get_metadatas(self, ids: list[str]) -> dict:
"""Fetch vector metadata by chunk IDs."""
if not ids:
return {"ids": [], "metadatas": []}
return self._collection.get(ids=ids, include=["metadatas"])
def update_metadatas(self, ids: list[str], metadatas: list[dict]) -> None:
"""Update vector metadata without re-embedding documents."""
if ids:
self._collection.update(ids=ids, metadatas=metadatas)
log.debug("vector_metadatas_updated", count=len(ids))
@property @property
def count(self) -> int: def count(self) -> int:
return self._collection.count() return self._collection.count()

View File

@@ -0,0 +1,154 @@
"""Tests for explicit chunk project_id metadata backfill."""
import json
import atocore.config as config
from atocore.models.database import get_connection, init_db
from scripts import backfill_chunk_project_ids as backfill
def _write_registry(tmp_path, monkeypatch):
vault_dir = tmp_path / "vault"
drive_dir = tmp_path / "drive"
config_dir = tmp_path / "config"
project_dir = vault_dir / "incoming" / "projects" / "p04-gigabit"
project_dir.mkdir(parents=True)
drive_dir.mkdir()
config_dir.mkdir()
registry_path = config_dir / "project-registry.json"
registry_path.write_text(
json.dumps(
{
"projects": [
{
"id": "p04-gigabit",
"aliases": ["p04"],
"ingest_roots": [
{"source": "vault", "subpath": "incoming/projects/p04-gigabit"}
],
}
]
}
),
encoding="utf-8",
)
monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir))
monkeypatch.setenv("ATOCORE_DRIVE_SOURCE_DIR", str(drive_dir))
monkeypatch.setenv("ATOCORE_PROJECT_REGISTRY_PATH", str(registry_path))
config.settings = config.Settings()
return project_dir
def _insert_chunk(file_path, metadata=None, chunk_id="chunk-1"):
with get_connection() as conn:
conn.execute(
"""
INSERT INTO source_documents (id, file_path, file_hash, title, doc_type, tags)
VALUES (?, ?, ?, ?, ?, ?)
""",
("doc-1", str(file_path), "hash", "Title", "markdown", "[]"),
)
conn.execute(
"""
INSERT INTO source_chunks
(id, document_id, chunk_index, content, heading_path, char_count, metadata)
VALUES (?, ?, ?, ?, ?, ?, ?)
""",
(
chunk_id,
"doc-1",
0,
"content",
"Overview",
7,
json.dumps(metadata if metadata is not None else {}),
),
)
class FakeVectorStore:
def __init__(self, metadatas):
self.metadatas = dict(metadatas)
self.updated = []
def get_metadatas(self, ids):
returned_ids = [chunk_id for chunk_id in ids if chunk_id in self.metadatas]
return {
"ids": returned_ids,
"metadatas": [self.metadatas[chunk_id] for chunk_id in returned_ids],
}
def update_metadatas(self, ids, metadatas):
self.updated.append((list(ids), list(metadatas)))
for chunk_id, metadata in zip(ids, metadatas, strict=True):
self.metadatas[chunk_id] = metadata
def test_backfill_dry_run_is_non_mutating(tmp_data_dir, tmp_path, monkeypatch):
init_db()
project_dir = _write_registry(tmp_path, monkeypatch)
_insert_chunk(project_dir / "status.md")
result = backfill.backfill(apply=False)
assert result["updates"] == 1
with get_connection() as conn:
row = conn.execute("SELECT metadata FROM source_chunks WHERE id = ?", ("chunk-1",)).fetchone()
assert json.loads(row["metadata"]) == {}
def test_backfill_apply_updates_chroma_then_sql(tmp_data_dir, tmp_path, monkeypatch):
init_db()
project_dir = _write_registry(tmp_path, monkeypatch)
_insert_chunk(project_dir / "status.md", metadata={"source_file": "status.md"})
fake_store = FakeVectorStore({"chunk-1": {"source_file": "status.md"}})
monkeypatch.setattr(backfill, "get_vector_store", lambda: fake_store)
result = backfill.backfill(apply=True, require_chroma_snapshot=True)
assert result["applied_updates"] == 1
assert fake_store.metadatas["chunk-1"]["project_id"] == "p04-gigabit"
with get_connection() as conn:
row = conn.execute("SELECT metadata FROM source_chunks WHERE id = ?", ("chunk-1",)).fetchone()
assert json.loads(row["metadata"])["project_id"] == "p04-gigabit"
def test_backfill_apply_requires_snapshot_confirmation(tmp_data_dir, tmp_path, monkeypatch):
init_db()
project_dir = _write_registry(tmp_path, monkeypatch)
_insert_chunk(project_dir / "status.md")
try:
backfill.backfill(apply=True)
except ValueError as exc:
assert "Chroma backup" in str(exc)
else:
raise AssertionError("Expected snapshot confirmation requirement")
def test_backfill_missing_vector_skips_sql_update(tmp_data_dir, tmp_path, monkeypatch):
init_db()
project_dir = _write_registry(tmp_path, monkeypatch)
_insert_chunk(project_dir / "status.md")
fake_store = FakeVectorStore({})
monkeypatch.setattr(backfill, "get_vector_store", lambda: fake_store)
result = backfill.backfill(apply=True, require_chroma_snapshot=True)
assert result["updates"] == 1
assert result["applied_updates"] == 0
assert result["missing_vectors"] == 1
with get_connection() as conn:
row = conn.execute("SELECT metadata FROM source_chunks WHERE id = ?", ("chunk-1",)).fetchone()
assert json.loads(row["metadata"]) == {}
def test_backfill_skips_malformed_metadata(tmp_data_dir, tmp_path, monkeypatch):
init_db()
project_dir = _write_registry(tmp_path, monkeypatch)
_insert_chunk(project_dir / "status.md", metadata=[])
result = backfill.backfill(apply=False)
assert result["updates"] == 0
assert result["malformed_metadata"] == 1

View File

@@ -46,6 +46,8 @@ def test_settings_keep_legacy_db_path_when_present(tmp_path, monkeypatch):
def test_ranking_weights_are_tunable_via_env(monkeypatch): def test_ranking_weights_are_tunable_via_env(monkeypatch):
monkeypatch.setenv("ATOCORE_RANK_PROJECT_MATCH_BOOST", "3.5") monkeypatch.setenv("ATOCORE_RANK_PROJECT_MATCH_BOOST", "3.5")
monkeypatch.setenv("ATOCORE_RANK_PROJECT_SCOPE_FILTER", "false")
monkeypatch.setenv("ATOCORE_RANK_PROJECT_SCOPE_CANDIDATE_MULTIPLIER", "6")
monkeypatch.setenv("ATOCORE_RANK_QUERY_TOKEN_STEP", "0.12") monkeypatch.setenv("ATOCORE_RANK_QUERY_TOKEN_STEP", "0.12")
monkeypatch.setenv("ATOCORE_RANK_QUERY_TOKEN_CAP", "1.5") monkeypatch.setenv("ATOCORE_RANK_QUERY_TOKEN_CAP", "1.5")
monkeypatch.setenv("ATOCORE_RANK_PATH_HIGH_SIGNAL_BOOST", "1.25") monkeypatch.setenv("ATOCORE_RANK_PATH_HIGH_SIGNAL_BOOST", "1.25")
@@ -54,6 +56,8 @@ def test_ranking_weights_are_tunable_via_env(monkeypatch):
settings = config.Settings() settings = config.Settings()
assert settings.rank_project_match_boost == 3.5 assert settings.rank_project_match_boost == 3.5
assert settings.rank_project_scope_filter is False
assert settings.rank_project_scope_candidate_multiplier == 6
assert settings.rank_query_token_step == 0.12 assert settings.rank_query_token_step == 0.12
assert settings.rank_query_token_cap == 1.5 assert settings.rank_query_token_cap == 1.5
assert settings.rank_path_high_signal_boost == 1.25 assert settings.rank_path_high_signal_boost == 1.25

View File

@@ -1,8 +1,10 @@
"""Tests for the ingestion pipeline.""" """Tests for the ingestion pipeline."""
import json
from atocore.ingestion.parser import parse_markdown from atocore.ingestion.parser import parse_markdown
from atocore.models.database import get_connection, init_db from atocore.models.database import get_connection, init_db
from atocore.ingestion.pipeline import ingest_file, ingest_folder from atocore.ingestion.pipeline import ingest_file, ingest_folder, ingest_project_folder
def test_parse_markdown(sample_markdown): def test_parse_markdown(sample_markdown):
@@ -69,6 +71,153 @@ def test_ingest_updates_changed(tmp_data_dir, sample_markdown):
assert result["status"] == "ingested" assert result["status"] == "ingested"
def test_ingest_file_records_project_id_metadata(tmp_data_dir, sample_markdown, monkeypatch):
"""Project-aware ingestion should tag DB and vector metadata exactly."""
init_db()
class FakeVectorStore:
def __init__(self):
self.metadatas = []
def add(self, ids, documents, metadatas):
self.metadatas.extend(metadatas)
def delete(self, ids):
return None
fake_store = FakeVectorStore()
monkeypatch.setattr("atocore.ingestion.pipeline.get_vector_store", lambda: fake_store)
result = ingest_file(sample_markdown, project_id="p04-gigabit")
assert result["status"] == "ingested"
assert fake_store.metadatas
assert all(meta["project_id"] == "p04-gigabit" for meta in fake_store.metadatas)
with get_connection() as conn:
rows = conn.execute("SELECT metadata FROM source_chunks").fetchall()
assert rows
assert all(
json.loads(row["metadata"])["project_id"] == "p04-gigabit"
for row in rows
)
def test_ingest_file_derives_project_id_from_registry_root(tmp_data_dir, tmp_path, monkeypatch):
"""Unscoped ingest should preserve ownership for files under registered roots."""
import atocore.config as config
vault_dir = tmp_path / "vault"
drive_dir = tmp_path / "drive"
config_dir = tmp_path / "config"
project_dir = vault_dir / "incoming" / "projects" / "p04-gigabit"
project_dir.mkdir(parents=True)
drive_dir.mkdir()
config_dir.mkdir()
note = project_dir / "status.md"
note.write_text(
"# Status\n\nCurrent project status with enough detail to create "
"a retrievable chunk for the ingestion pipeline test.",
encoding="utf-8",
)
registry_path = config_dir / "project-registry.json"
registry_path.write_text(
json.dumps(
{
"projects": [
{
"id": "p04-gigabit",
"aliases": ["p04"],
"ingest_roots": [
{"source": "vault", "subpath": "incoming/projects/p04-gigabit"}
],
}
]
}
),
encoding="utf-8",
)
class FakeVectorStore:
def __init__(self):
self.metadatas = []
def add(self, ids, documents, metadatas):
self.metadatas.extend(metadatas)
def delete(self, ids):
return None
fake_store = FakeVectorStore()
monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir))
monkeypatch.setenv("ATOCORE_DRIVE_SOURCE_DIR", str(drive_dir))
monkeypatch.setenv("ATOCORE_PROJECT_REGISTRY_PATH", str(registry_path))
config.settings = config.Settings()
monkeypatch.setattr("atocore.ingestion.pipeline.get_vector_store", lambda: fake_store)
init_db()
result = ingest_file(note)
assert result["status"] == "ingested"
assert fake_store.metadatas
assert all(meta["project_id"] == "p04-gigabit" for meta in fake_store.metadatas)
def test_ingest_file_logs_and_fails_open_when_project_derivation_fails(
tmp_data_dir,
sample_markdown,
monkeypatch,
):
"""A broken registry should be visible but should not block ingestion."""
init_db()
warnings = []
class FakeVectorStore:
def __init__(self):
self.metadatas = []
def add(self, ids, documents, metadatas):
self.metadatas.extend(metadatas)
def delete(self, ids):
return None
fake_store = FakeVectorStore()
monkeypatch.setattr("atocore.ingestion.pipeline.get_vector_store", lambda: fake_store)
monkeypatch.setattr(
"atocore.projects.registry.derive_project_id_for_path",
lambda path: (_ for _ in ()).throw(ValueError("registry broken")),
)
monkeypatch.setattr(
"atocore.ingestion.pipeline.log.warning",
lambda event, **kwargs: warnings.append((event, kwargs)),
)
result = ingest_file(sample_markdown)
assert result["status"] == "ingested"
assert fake_store.metadatas
assert all(meta["project_id"] == "" for meta in fake_store.metadatas)
assert warnings[0][0] == "project_id_derivation_failed"
assert "registry broken" in warnings[0][1]["error"]
def test_ingest_project_folder_passes_project_id_to_files(tmp_data_dir, sample_folder, monkeypatch):
seen = []
def fake_ingest_file(path, project_id=""):
seen.append((path.name, project_id))
return {"file": str(path), "status": "ingested"}
monkeypatch.setattr("atocore.ingestion.pipeline.ingest_file", fake_ingest_file)
monkeypatch.setattr("atocore.ingestion.pipeline._purge_deleted_files", lambda *args, **kwargs: 0)
ingest_project_folder(sample_folder, project_id="p05-interferometer")
assert seen
assert {project_id for _, project_id in seen} == {"p05-interferometer"}
def test_parse_markdown_uses_supplied_text(sample_markdown): def test_parse_markdown_uses_supplied_text(sample_markdown):
"""Parsing should be able to reuse pre-read content from ingestion.""" """Parsing should be able to reuse pre-read content from ingestion."""
latin_text = """---\ntags: parser\n---\n# Parser Title\n\nBody text.""" latin_text = """---\ntags: parser\n---\n# Parser Title\n\nBody text."""

View File

@@ -5,6 +5,7 @@ import json
import atocore.config as config import atocore.config as config
from atocore.projects.registry import ( from atocore.projects.registry import (
build_project_registration_proposal, build_project_registration_proposal,
derive_project_id_for_path,
get_registered_project, get_registered_project,
get_project_registry_template, get_project_registry_template,
list_registered_projects, list_registered_projects,
@@ -103,6 +104,98 @@ def test_project_registry_resolves_alias(tmp_path, monkeypatch):
assert project.project_id == "p05-interferometer" assert project.project_id == "p05-interferometer"
def test_derive_project_id_for_path_uses_registered_roots(tmp_path, monkeypatch):
vault_dir = tmp_path / "vault"
drive_dir = tmp_path / "drive"
config_dir = tmp_path / "config"
project_dir = vault_dir / "incoming" / "projects" / "p04-gigabit"
project_dir.mkdir(parents=True)
drive_dir.mkdir()
config_dir.mkdir()
note = project_dir / "status.md"
note.write_text("# Status\n\nCurrent work.", encoding="utf-8")
registry_path = config_dir / "project-registry.json"
registry_path.write_text(
json.dumps(
{
"projects": [
{
"id": "p04-gigabit",
"aliases": ["p04"],
"ingest_roots": [
{"source": "vault", "subpath": "incoming/projects/p04-gigabit"}
],
}
]
}
),
encoding="utf-8",
)
monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir))
monkeypatch.setenv("ATOCORE_DRIVE_SOURCE_DIR", str(drive_dir))
monkeypatch.setenv("ATOCORE_PROJECT_REGISTRY_PATH", str(registry_path))
original_settings = config.settings
try:
config.settings = config.Settings()
assert derive_project_id_for_path(note) == "p04-gigabit"
assert derive_project_id_for_path(tmp_path / "elsewhere.md") == ""
finally:
config.settings = original_settings
def test_project_registry_rejects_cross_project_ingest_root_overlap(tmp_path, monkeypatch):
vault_dir = tmp_path / "vault"
drive_dir = tmp_path / "drive"
config_dir = tmp_path / "config"
vault_dir.mkdir()
drive_dir.mkdir()
config_dir.mkdir()
registry_path = config_dir / "project-registry.json"
registry_path.write_text(
json.dumps(
{
"projects": [
{
"id": "parent",
"aliases": [],
"ingest_roots": [
{"source": "vault", "subpath": "incoming/projects/parent"}
],
},
{
"id": "child",
"aliases": [],
"ingest_roots": [
{"source": "vault", "subpath": "incoming/projects/parent/child"}
],
},
]
}
),
encoding="utf-8",
)
monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir))
monkeypatch.setenv("ATOCORE_DRIVE_SOURCE_DIR", str(drive_dir))
monkeypatch.setenv("ATOCORE_PROJECT_REGISTRY_PATH", str(registry_path))
original_settings = config.settings
try:
config.settings = config.Settings()
try:
list_registered_projects()
except ValueError as exc:
assert "ingest root overlap" in str(exc)
else:
raise AssertionError("Expected overlapping ingest roots to raise")
finally:
config.settings = original_settings
def test_refresh_registered_project_ingests_registered_roots(tmp_path, monkeypatch): def test_refresh_registered_project_ingests_registered_roots(tmp_path, monkeypatch):
vault_dir = tmp_path / "vault" vault_dir = tmp_path / "vault"
drive_dir = tmp_path / "drive" drive_dir = tmp_path / "drive"
@@ -133,8 +226,8 @@ def test_refresh_registered_project_ingests_registered_roots(tmp_path, monkeypat
calls = [] calls = []
def fake_ingest_folder(path, purge_deleted=True): def fake_ingest_folder(path, purge_deleted=True, project_id=""):
calls.append((str(path), purge_deleted)) calls.append((str(path), purge_deleted, project_id))
return [{"file": str(path / "README.md"), "status": "ingested"}] return [{"file": str(path / "README.md"), "status": "ingested"}]
monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir)) monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir))
@@ -144,7 +237,7 @@ def test_refresh_registered_project_ingests_registered_roots(tmp_path, monkeypat
original_settings = config.settings original_settings = config.settings
try: try:
config.settings = config.Settings() config.settings = config.Settings()
monkeypatch.setattr("atocore.projects.registry.ingest_folder", fake_ingest_folder) monkeypatch.setattr("atocore.ingestion.pipeline.ingest_project_folder", fake_ingest_folder)
result = refresh_registered_project("polisher") result = refresh_registered_project("polisher")
finally: finally:
config.settings = original_settings config.settings = original_settings
@@ -153,6 +246,7 @@ def test_refresh_registered_project_ingests_registered_roots(tmp_path, monkeypat
assert len(calls) == 1 assert len(calls) == 1
assert calls[0][0].endswith("p06-polisher") assert calls[0][0].endswith("p06-polisher")
assert calls[0][1] is False assert calls[0][1] is False
assert calls[0][2] == "p06-polisher"
assert result["roots"][0]["status"] == "ingested" assert result["roots"][0]["status"] == "ingested"
assert result["status"] == "ingested" assert result["status"] == "ingested"
assert result["roots_ingested"] == 1 assert result["roots_ingested"] == 1
@@ -188,7 +282,7 @@ def test_refresh_registered_project_reports_nothing_to_ingest_when_all_missing(
encoding="utf-8", encoding="utf-8",
) )
def fail_ingest_folder(path, purge_deleted=True): def fail_ingest_folder(path, purge_deleted=True, project_id=""):
raise AssertionError(f"ingest_folder should not be called for missing root: {path}") raise AssertionError(f"ingest_folder should not be called for missing root: {path}")
monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir)) monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir))
@@ -198,7 +292,7 @@ def test_refresh_registered_project_reports_nothing_to_ingest_when_all_missing(
original_settings = config.settings original_settings = config.settings
try: try:
config.settings = config.Settings() config.settings = config.Settings()
monkeypatch.setattr("atocore.projects.registry.ingest_folder", fail_ingest_folder) monkeypatch.setattr("atocore.ingestion.pipeline.ingest_project_folder", fail_ingest_folder)
result = refresh_registered_project("ghost") result = refresh_registered_project("ghost")
finally: finally:
config.settings = original_settings config.settings = original_settings
@@ -238,7 +332,7 @@ def test_refresh_registered_project_reports_partial_status(tmp_path, monkeypatch
encoding="utf-8", encoding="utf-8",
) )
def fake_ingest_folder(path, purge_deleted=True): def fake_ingest_folder(path, purge_deleted=True, project_id=""):
return [{"file": str(path / "README.md"), "status": "ingested"}] return [{"file": str(path / "README.md"), "status": "ingested"}]
monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir)) monkeypatch.setenv("ATOCORE_VAULT_SOURCE_DIR", str(vault_dir))
@@ -248,7 +342,7 @@ def test_refresh_registered_project_reports_partial_status(tmp_path, monkeypatch
original_settings = config.settings original_settings = config.settings
try: try:
config.settings = config.Settings() config.settings = config.Settings()
monkeypatch.setattr("atocore.projects.registry.ingest_folder", fake_ingest_folder) monkeypatch.setattr("atocore.ingestion.pipeline.ingest_project_folder", fake_ingest_folder)
result = refresh_registered_project("mixed") result = refresh_registered_project("mixed")
finally: finally:
config.settings = original_settings config.settings = original_settings

View File

@@ -70,8 +70,28 @@ def test_retrieve_skips_stale_vector_entries(tmp_data_dir, sample_markdown, monk
def test_retrieve_project_hint_boosts_matching_chunks(monkeypatch): def test_retrieve_project_hint_boosts_matching_chunks(monkeypatch):
target_project = type(
"Project",
(),
{
"project_id": "p04-gigabit",
"aliases": ("p04", "gigabit"),
"ingest_roots": (),
},
)()
other_project = type(
"Project",
(),
{
"project_id": "p05-interferometer",
"aliases": ("p05", "interferometer"),
"ingest_roots": (),
},
)()
class FakeStore: class FakeStore:
def query(self, query_embedding, top_k=10, where=None): def query(self, query_embedding, top_k=10, where=None):
assert top_k == 8
return { return {
"ids": [["chunk-a", "chunk-b"]], "ids": [["chunk-a", "chunk-b"]],
"documents": [["project doc", "other doc"]], "documents": [["project doc", "other doc"]],
@@ -102,7 +122,21 @@ def test_retrieve_project_hint_boosts_matching_chunks(monkeypatch):
) )
monkeypatch.setattr( monkeypatch.setattr(
"atocore.retrieval.retriever.get_registered_project", "atocore.retrieval.retriever.get_registered_project",
lambda project_name: type( lambda project_name: target_project,
)
monkeypatch.setattr(
"atocore.retrieval.retriever.load_project_registry",
lambda: [target_project, other_project],
)
results = retrieve("mirror architecture", top_k=2, project_hint="p04")
assert len(results) == 1
assert results[0].chunk_id == "chunk-a"
def test_retrieve_project_scope_allows_unowned_global_chunks(monkeypatch):
target_project = type(
"Project", "Project",
(), (),
{ {
@@ -110,14 +144,479 @@ def test_retrieve_project_hint_boosts_matching_chunks(monkeypatch):
"aliases": ("p04", "gigabit"), "aliases": ("p04", "gigabit"),
"ingest_roots": (), "ingest_roots": (),
}, },
)(), )()
class FakeStore:
def query(self, query_embedding, top_k=10, where=None):
return {
"ids": [["chunk-a", "chunk-global"]],
"documents": [["project doc", "global doc"]],
"metadatas": [[
{
"heading_path": "Overview",
"source_file": "p04-gigabit/pkm/_index.md",
"tags": '["p04-gigabit"]',
"title": "P04",
"document_id": "doc-a",
},
{
"heading_path": "Overview",
"source_file": "shared/engineering-rules.md",
"tags": "[]",
"title": "Shared engineering rules",
"document_id": "doc-global",
},
]],
"distances": [[0.2, 0.21]],
}
monkeypatch.setattr("atocore.retrieval.retriever.get_vector_store", lambda: FakeStore())
monkeypatch.setattr("atocore.retrieval.retriever.embed_query", lambda query: [0.0, 0.1])
monkeypatch.setattr(
"atocore.retrieval.retriever._existing_chunk_ids",
lambda chunk_ids: set(chunk_ids),
)
monkeypatch.setattr(
"atocore.retrieval.retriever.get_registered_project",
lambda project_name: target_project,
)
monkeypatch.setattr(
"atocore.retrieval.retriever.load_project_registry",
lambda: [target_project],
) )
results = retrieve("mirror architecture", top_k=2, project_hint="p04") results = retrieve("mirror architecture", top_k=2, project_hint="p04")
assert len(results) == 2 assert [r.chunk_id for r in results] == ["chunk-a", "chunk-global"]
assert results[0].chunk_id == "chunk-a"
assert results[0].score > results[1].score
def test_retrieve_project_scope_filter_can_be_disabled(monkeypatch):
target_project = type(
"Project",
(),
{
"project_id": "p04-gigabit",
"aliases": ("p04", "gigabit"),
"ingest_roots": (),
},
)()
other_project = type(
"Project",
(),
{
"project_id": "p05-interferometer",
"aliases": ("p05", "interferometer"),
"ingest_roots": (),
},
)()
class FakeStore:
def query(self, query_embedding, top_k=10, where=None):
assert top_k == 2
return {
"ids": [["chunk-a", "chunk-b"]],
"documents": [["project doc", "other project doc"]],
"metadatas": [[
{
"heading_path": "Overview",
"source_file": "p04-gigabit/pkm/_index.md",
"tags": '["p04-gigabit"]',
"title": "P04",
"document_id": "doc-a",
},
{
"heading_path": "Overview",
"source_file": "p05-interferometer/pkm/_index.md",
"tags": '["p05-interferometer"]',
"title": "P05",
"document_id": "doc-b",
},
]],
"distances": [[0.2, 0.2]],
}
monkeypatch.setattr("atocore.config.settings.rank_project_scope_filter", False)
monkeypatch.setattr("atocore.retrieval.retriever.get_vector_store", lambda: FakeStore())
monkeypatch.setattr("atocore.retrieval.retriever.embed_query", lambda query: [0.0, 0.1])
monkeypatch.setattr(
"atocore.retrieval.retriever._existing_chunk_ids",
lambda chunk_ids: set(chunk_ids),
)
monkeypatch.setattr(
"atocore.retrieval.retriever.get_registered_project",
lambda project_name: target_project,
)
monkeypatch.setattr(
"atocore.retrieval.retriever.load_project_registry",
lambda: [target_project, other_project],
)
results = retrieve("mirror architecture", top_k=2, project_hint="p04")
assert {r.chunk_id for r in results} == {"chunk-a", "chunk-b"}
def test_retrieve_project_scope_ignores_title_for_ownership(monkeypatch):
target_project = type(
"Project",
(),
{
"project_id": "p04-gigabit",
"aliases": ("p04", "gigabit"),
"ingest_roots": (),
},
)()
other_project = type(
"Project",
(),
{
"project_id": "p06-polisher",
"aliases": ("p06", "polisher", "p11"),
"ingest_roots": (),
},
)()
class FakeStore:
def query(self, query_embedding, top_k=10, where=None):
return {
"ids": [["chunk-target", "chunk-poisoned-title"]],
"documents": [["p04 doc", "p06 doc"]],
"metadatas": [[
{
"heading_path": "Overview",
"source_file": "p04-gigabit/pkm/_index.md",
"tags": '["p04-gigabit"]',
"title": "P04",
"document_id": "doc-a",
},
{
"heading_path": "Overview",
"source_file": "p06-polisher/pkm/architecture.md",
"tags": '["p06-polisher"]',
"title": "GigaBIT M1 mirror lessons",
"document_id": "doc-b",
},
]],
"distances": [[0.2, 0.19]],
}
monkeypatch.setattr("atocore.retrieval.retriever.get_vector_store", lambda: FakeStore())
monkeypatch.setattr("atocore.retrieval.retriever.embed_query", lambda query: [0.0, 0.1])
monkeypatch.setattr(
"atocore.retrieval.retriever._existing_chunk_ids",
lambda chunk_ids: set(chunk_ids),
)
monkeypatch.setattr(
"atocore.retrieval.retriever.get_registered_project",
lambda project_name: target_project,
)
monkeypatch.setattr(
"atocore.retrieval.retriever.load_project_registry",
lambda: [target_project, other_project],
)
results = retrieve("mirror architecture", top_k=2, project_hint="p04")
assert [r.chunk_id for r in results] == ["chunk-target"]
def test_retrieve_project_scope_uses_path_segments_not_substrings(monkeypatch):
target_project = type(
"Project",
(),
{
"project_id": "p05-interferometer",
"aliases": ("p05", "interferometer"),
"ingest_roots": (),
},
)()
abb_project = type(
"Project",
(),
{
"project_id": "abb-space",
"aliases": ("abb",),
"ingest_roots": (),
},
)()
class FakeStore:
def query(self, query_embedding, top_k=10, where=None):
return {
"ids": [["chunk-target", "chunk-global"]],
"documents": [["p05 doc", "global doc"]],
"metadatas": [[
{
"heading_path": "Overview",
"source_file": "p05-interferometer/pkm/_index.md",
"tags": '["p05-interferometer"]',
"title": "P05",
"document_id": "doc-a",
},
{
"heading_path": "Abbreviation notes",
"source_file": "shared/cabbage-abbreviations.md",
"tags": "[]",
"title": "ABB-style abbreviations",
"document_id": "doc-global",
},
]],
"distances": [[0.2, 0.21]],
}
monkeypatch.setattr("atocore.retrieval.retriever.get_vector_store", lambda: FakeStore())
monkeypatch.setattr("atocore.retrieval.retriever.embed_query", lambda query: [0.0, 0.1])
monkeypatch.setattr(
"atocore.retrieval.retriever._existing_chunk_ids",
lambda chunk_ids: set(chunk_ids),
)
monkeypatch.setattr(
"atocore.retrieval.retriever.get_registered_project",
lambda project_name: target_project,
)
monkeypatch.setattr(
"atocore.retrieval.retriever.load_project_registry",
lambda: [target_project, abb_project],
)
results = retrieve("abbreviations", top_k=2, project_hint="p05")
assert [r.chunk_id for r in results] == ["chunk-target", "chunk-global"]
def test_retrieve_project_scope_prefers_exact_project_id(monkeypatch):
target_project = type(
"Project",
(),
{
"project_id": "p04-gigabit",
"aliases": ("p04", "gigabit"),
"ingest_roots": (),
},
)()
other_project = type(
"Project",
(),
{
"project_id": "p06-polisher",
"aliases": ("p06", "polisher"),
"ingest_roots": (),
},
)()
class FakeStore:
def query(self, query_embedding, top_k=10, where=None):
return {
"ids": [["chunk-target", "chunk-other", "chunk-global"]],
"documents": [["target doc", "other doc", "global doc"]],
"metadatas": [[
{
"heading_path": "Overview",
"source_file": "legacy/unhelpful-path.md",
"tags": "[]",
"title": "Target",
"project_id": "p04-gigabit",
"document_id": "doc-a",
},
{
"heading_path": "Overview",
"source_file": "p04-gigabit/title-poisoned.md",
"tags": '["p04-gigabit"]',
"title": "Looks target-owned but is explicit p06",
"project_id": "p06-polisher",
"document_id": "doc-b",
},
{
"heading_path": "Overview",
"source_file": "shared/global.md",
"tags": "[]",
"title": "Shared",
"project_id": "",
"document_id": "doc-global",
},
]],
"distances": [[0.2, 0.19, 0.21]],
}
monkeypatch.setattr("atocore.retrieval.retriever.get_vector_store", lambda: FakeStore())
monkeypatch.setattr("atocore.retrieval.retriever.embed_query", lambda query: [0.0, 0.1])
monkeypatch.setattr(
"atocore.retrieval.retriever._existing_chunk_ids",
lambda chunk_ids: set(chunk_ids),
)
monkeypatch.setattr(
"atocore.retrieval.retriever.get_registered_project",
lambda project_name: target_project,
)
monkeypatch.setattr(
"atocore.retrieval.retriever.load_project_registry",
lambda: [target_project, other_project],
)
results = retrieve("mirror architecture", top_k=3, project_hint="p04")
assert [r.chunk_id for r in results] == ["chunk-target", "chunk-global"]
def test_retrieve_empty_project_id_falls_back_to_path_ownership(monkeypatch):
target_project = type(
"Project",
(),
{
"project_id": "p04-gigabit",
"aliases": ("p04", "gigabit"),
"ingest_roots": (),
},
)()
other_project = type(
"Project",
(),
{
"project_id": "p05-interferometer",
"aliases": ("p05", "interferometer"),
"ingest_roots": (),
},
)()
class FakeStore:
def query(self, query_embedding, top_k=10, where=None):
return {
"ids": [["chunk-target", "chunk-other"]],
"documents": [["target doc", "other doc"]],
"metadatas": [[
{
"heading_path": "Overview",
"source_file": "p04-gigabit/status.md",
"tags": "[]",
"title": "Target",
"project_id": "",
"document_id": "doc-a",
},
{
"heading_path": "Overview",
"source_file": "p05-interferometer/status.md",
"tags": "[]",
"title": "Other",
"project_id": "",
"document_id": "doc-b",
},
]],
"distances": [[0.2, 0.19]],
}
monkeypatch.setattr("atocore.retrieval.retriever.get_vector_store", lambda: FakeStore())
monkeypatch.setattr("atocore.retrieval.retriever.embed_query", lambda query: [0.0, 0.1])
monkeypatch.setattr(
"atocore.retrieval.retriever._existing_chunk_ids",
lambda chunk_ids: set(chunk_ids),
)
monkeypatch.setattr(
"atocore.retrieval.retriever.get_registered_project",
lambda project_name: target_project,
)
monkeypatch.setattr(
"atocore.retrieval.retriever.load_project_registry",
lambda: [target_project, other_project],
)
results = retrieve("mirror architecture", top_k=2, project_hint="p04")
assert [r.chunk_id for r in results] == ["chunk-target"]
def test_retrieve_unknown_project_hint_does_not_widen_or_filter(monkeypatch):
class FakeStore:
def query(self, query_embedding, top_k=10, where=None):
assert top_k == 2
return {
"ids": [["chunk-a", "chunk-b"]],
"documents": [["doc a", "doc b"]],
"metadatas": [[
{
"heading_path": "Overview",
"source_file": "project-a/file.md",
"tags": "[]",
"title": "A",
"document_id": "doc-a",
},
{
"heading_path": "Overview",
"source_file": "project-b/file.md",
"tags": "[]",
"title": "B",
"document_id": "doc-b",
},
]],
"distances": [[0.2, 0.21]],
}
monkeypatch.setattr("atocore.retrieval.retriever.get_vector_store", lambda: FakeStore())
monkeypatch.setattr("atocore.retrieval.retriever.embed_query", lambda query: [0.0, 0.1])
monkeypatch.setattr(
"atocore.retrieval.retriever._existing_chunk_ids",
lambda chunk_ids: set(chunk_ids),
)
monkeypatch.setattr(
"atocore.retrieval.retriever.get_registered_project",
lambda project_name: None,
)
results = retrieve("overview", top_k=2, project_hint="unknown-project")
assert [r.chunk_id for r in results] == ["chunk-a", "chunk-b"]
def test_retrieve_fails_open_when_project_scope_resolution_fails(monkeypatch):
warnings = []
class FakeStore:
def query(self, query_embedding, top_k=10, where=None):
assert top_k == 2
return {
"ids": [["chunk-a", "chunk-b"]],
"documents": [["doc a", "doc b"]],
"metadatas": [[
{
"heading_path": "Overview",
"source_file": "p04-gigabit/file.md",
"tags": "[]",
"title": "A",
"document_id": "doc-a",
},
{
"heading_path": "Overview",
"source_file": "p05-interferometer/file.md",
"tags": "[]",
"title": "B",
"document_id": "doc-b",
},
]],
"distances": [[0.2, 0.21]],
}
monkeypatch.setattr("atocore.retrieval.retriever.get_vector_store", lambda: FakeStore())
monkeypatch.setattr("atocore.retrieval.retriever.embed_query", lambda query: [0.0, 0.1])
monkeypatch.setattr(
"atocore.retrieval.retriever._existing_chunk_ids",
lambda chunk_ids: set(chunk_ids),
)
monkeypatch.setattr(
"atocore.retrieval.retriever.get_registered_project",
lambda project_name: (_ for _ in ()).throw(ValueError("registry overlap")),
)
monkeypatch.setattr(
"atocore.retrieval.retriever.log.warning",
lambda event, **kwargs: warnings.append((event, kwargs)),
)
results = retrieve("overview", top_k=2, project_hint="p04")
assert [r.chunk_id for r in results] == ["chunk-a", "chunk-b"]
assert {warning[0] for warning in warnings} == {
"project_scope_resolution_failed",
"project_match_boost_resolution_failed",
}
assert all("registry overlap" in warning[1]["error"] for warning in warnings)
def test_retrieve_downranks_archive_noise_and_prefers_high_signal_paths(monkeypatch): def test_retrieve_downranks_archive_noise_and_prefers_high_signal_paths(monkeypatch):