fix: pass project_hint into retrieve and add path-signal ranking
Two changes that belong together:
1. builder.build_context() now passes project_hint into retrieve(),
so the project-aware boost actually fires for the retrieval pipeline
driven by /context/build. Before this, only direct /query callers
benefited from the registered-project boost.
2. retriever now applies two more ranking signals on every chunk:
- _query_match_boost: boosts chunks whose source/title/heading
echo high-signal query tokens (stop list filters out generic
words like "the", "project", "system")
- _path_signal_boost: down-weights archival noise (_archive,
_history, pre-cleanup, reviews) by 0.72 and up-weights current
high-signal docs (status, decision, requirements, charter,
system-map, error-budget, ...) by 1.18
Tests:
- test_context_builder_passes_project_hint_to_retrieval verifies
the wiring fix
- test_retrieve_downranks_archive_noise_and_prefers_high_signal_paths
verifies the new ranking helpers prefer current docs over archive
This addresses the cross-project competition and archive bleed
called out in current-state.md after the Wave 1 ingestion.
2026-04-06 18:37:07 -04:00
|
|
|
"""Retrieval: query to ranked chunks."""
|
feat: implement AtoCore Phase 0 + Phase 0.5 (foundation + PoC)
Complete implementation of the personal context engine foundation:
- FastAPI server with 5 endpoints (ingest, query, context/build, health, debug)
- SQLite database with 5 tables (documents, chunks, memories, projects, interactions)
- Heading-aware markdown chunker (800 char max, recursive splitting)
- Multilingual embeddings via sentence-transformers (EN/FR)
- ChromaDB vector store with cosine similarity retrieval
- Context builder with project boosting, dedup, and budget enforcement
- CLI scripts for batch ingestion and test prompt evaluation
- 19 unit tests passing, 79% coverage
- Validated on 482 real project files (8383 chunks, 0 errors)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 09:21:27 -04:00
|
|
|
|
fix: pass project_hint into retrieve and add path-signal ranking
Two changes that belong together:
1. builder.build_context() now passes project_hint into retrieve(),
so the project-aware boost actually fires for the retrieval pipeline
driven by /context/build. Before this, only direct /query callers
benefited from the registered-project boost.
2. retriever now applies two more ranking signals on every chunk:
- _query_match_boost: boosts chunks whose source/title/heading
echo high-signal query tokens (stop list filters out generic
words like "the", "project", "system")
- _path_signal_boost: down-weights archival noise (_archive,
_history, pre-cleanup, reviews) by 0.72 and up-weights current
high-signal docs (status, decision, requirements, charter,
system-map, error-budget, ...) by 1.18
Tests:
- test_context_builder_passes_project_hint_to_retrieval verifies
the wiring fix
- test_retrieve_downranks_archive_noise_and_prefers_high_signal_paths
verifies the new ranking helpers prefer current docs over archive
This addresses the cross-project competition and archive bleed
called out in current-state.md after the Wave 1 ingestion.
2026-04-06 18:37:07 -04:00
|
|
|
import re
|
feat: implement AtoCore Phase 0 + Phase 0.5 (foundation + PoC)
Complete implementation of the personal context engine foundation:
- FastAPI server with 5 endpoints (ingest, query, context/build, health, debug)
- SQLite database with 5 tables (documents, chunks, memories, projects, interactions)
- Heading-aware markdown chunker (800 char max, recursive splitting)
- Multilingual embeddings via sentence-transformers (EN/FR)
- ChromaDB vector store with cosine similarity retrieval
- Context builder with project boosting, dedup, and budget enforcement
- CLI scripts for batch ingestion and test prompt evaluation
- 19 unit tests passing, 79% coverage
- Validated on 482 real project files (8383 chunks, 0 errors)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 09:21:27 -04:00
|
|
|
import time
|
|
|
|
|
from dataclasses import dataclass
|
|
|
|
|
|
2026-04-05 17:53:23 -04:00
|
|
|
import atocore.config as _config
|
|
|
|
|
from atocore.models.database import get_connection
|
feat: implement AtoCore Phase 0 + Phase 0.5 (foundation + PoC)
Complete implementation of the personal context engine foundation:
- FastAPI server with 5 endpoints (ingest, query, context/build, health, debug)
- SQLite database with 5 tables (documents, chunks, memories, projects, interactions)
- Heading-aware markdown chunker (800 char max, recursive splitting)
- Multilingual embeddings via sentence-transformers (EN/FR)
- ChromaDB vector store with cosine similarity retrieval
- Context builder with project boosting, dedup, and budget enforcement
- CLI scripts for batch ingestion and test prompt evaluation
- 19 unit tests passing, 79% coverage
- Validated on 482 real project files (8383 chunks, 0 errors)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 09:21:27 -04:00
|
|
|
from atocore.observability.logger import get_logger
|
2026-04-06 13:32:33 -04:00
|
|
|
from atocore.projects.registry import get_registered_project
|
feat: implement AtoCore Phase 0 + Phase 0.5 (foundation + PoC)
Complete implementation of the personal context engine foundation:
- FastAPI server with 5 endpoints (ingest, query, context/build, health, debug)
- SQLite database with 5 tables (documents, chunks, memories, projects, interactions)
- Heading-aware markdown chunker (800 char max, recursive splitting)
- Multilingual embeddings via sentence-transformers (EN/FR)
- ChromaDB vector store with cosine similarity retrieval
- Context builder with project boosting, dedup, and budget enforcement
- CLI scripts for batch ingestion and test prompt evaluation
- 19 unit tests passing, 79% coverage
- Validated on 482 real project files (8383 chunks, 0 errors)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 09:21:27 -04:00
|
|
|
from atocore.retrieval.embeddings import embed_query
|
|
|
|
|
from atocore.retrieval.vector_store import get_vector_store
|
|
|
|
|
|
|
|
|
|
log = get_logger("retriever")
|
|
|
|
|
|
fix: pass project_hint into retrieve and add path-signal ranking
Two changes that belong together:
1. builder.build_context() now passes project_hint into retrieve(),
so the project-aware boost actually fires for the retrieval pipeline
driven by /context/build. Before this, only direct /query callers
benefited from the registered-project boost.
2. retriever now applies two more ranking signals on every chunk:
- _query_match_boost: boosts chunks whose source/title/heading
echo high-signal query tokens (stop list filters out generic
words like "the", "project", "system")
- _path_signal_boost: down-weights archival noise (_archive,
_history, pre-cleanup, reviews) by 0.72 and up-weights current
high-signal docs (status, decision, requirements, charter,
system-map, error-budget, ...) by 1.18
Tests:
- test_context_builder_passes_project_hint_to_retrieval verifies
the wiring fix
- test_retrieve_downranks_archive_noise_and_prefers_high_signal_paths
verifies the new ranking helpers prefer current docs over archive
This addresses the cross-project competition and archive bleed
called out in current-state.md after the Wave 1 ingestion.
2026-04-06 18:37:07 -04:00
|
|
|
_STOP_TOKENS = {
|
|
|
|
|
"about",
|
|
|
|
|
"and",
|
|
|
|
|
"current",
|
|
|
|
|
"for",
|
|
|
|
|
"from",
|
|
|
|
|
"into",
|
|
|
|
|
"like",
|
|
|
|
|
"project",
|
|
|
|
|
"shared",
|
|
|
|
|
"system",
|
|
|
|
|
"that",
|
|
|
|
|
"the",
|
|
|
|
|
"this",
|
|
|
|
|
"what",
|
|
|
|
|
"with",
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
_HIGH_SIGNAL_HINTS = (
|
|
|
|
|
"status",
|
|
|
|
|
"decision",
|
|
|
|
|
"requirements",
|
|
|
|
|
"requirement",
|
|
|
|
|
"roadmap",
|
|
|
|
|
"charter",
|
|
|
|
|
"system-map",
|
|
|
|
|
"system_map",
|
|
|
|
|
"contracts",
|
|
|
|
|
"schema",
|
|
|
|
|
"architecture",
|
|
|
|
|
"workflow",
|
|
|
|
|
"error-budget",
|
|
|
|
|
"comparison-matrix",
|
|
|
|
|
"selection-decision",
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
_LOW_SIGNAL_HINTS = (
|
|
|
|
|
"/_archive/",
|
|
|
|
|
"\\_archive\\",
|
|
|
|
|
"/archive/",
|
|
|
|
|
"\\archive\\",
|
|
|
|
|
"_history",
|
|
|
|
|
"history",
|
|
|
|
|
"pre-cleanup",
|
|
|
|
|
"pre-migration",
|
|
|
|
|
"reviews/",
|
|
|
|
|
)
|
|
|
|
|
|
feat: implement AtoCore Phase 0 + Phase 0.5 (foundation + PoC)
Complete implementation of the personal context engine foundation:
- FastAPI server with 5 endpoints (ingest, query, context/build, health, debug)
- SQLite database with 5 tables (documents, chunks, memories, projects, interactions)
- Heading-aware markdown chunker (800 char max, recursive splitting)
- Multilingual embeddings via sentence-transformers (EN/FR)
- ChromaDB vector store with cosine similarity retrieval
- Context builder with project boosting, dedup, and budget enforcement
- CLI scripts for batch ingestion and test prompt evaluation
- 19 unit tests passing, 79% coverage
- Validated on 482 real project files (8383 chunks, 0 errors)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 09:21:27 -04:00
|
|
|
|
|
|
|
|
@dataclass
|
|
|
|
|
class ChunkResult:
|
|
|
|
|
chunk_id: str
|
|
|
|
|
content: str
|
|
|
|
|
score: float
|
|
|
|
|
heading_path: str
|
|
|
|
|
source_file: str
|
|
|
|
|
tags: str
|
|
|
|
|
title: str
|
|
|
|
|
document_id: str
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
def retrieve(
|
|
|
|
|
query: str,
|
|
|
|
|
top_k: int | None = None,
|
|
|
|
|
filter_tags: list[str] | None = None,
|
2026-04-06 13:32:33 -04:00
|
|
|
project_hint: str | None = None,
|
feat: implement AtoCore Phase 0 + Phase 0.5 (foundation + PoC)
Complete implementation of the personal context engine foundation:
- FastAPI server with 5 endpoints (ingest, query, context/build, health, debug)
- SQLite database with 5 tables (documents, chunks, memories, projects, interactions)
- Heading-aware markdown chunker (800 char max, recursive splitting)
- Multilingual embeddings via sentence-transformers (EN/FR)
- ChromaDB vector store with cosine similarity retrieval
- Context builder with project boosting, dedup, and budget enforcement
- CLI scripts for batch ingestion and test prompt evaluation
- 19 unit tests passing, 79% coverage
- Validated on 482 real project files (8383 chunks, 0 errors)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 09:21:27 -04:00
|
|
|
) -> list[ChunkResult]:
|
|
|
|
|
"""Retrieve the most relevant chunks for a query."""
|
2026-04-05 17:53:23 -04:00
|
|
|
top_k = top_k or _config.settings.context_top_k
|
feat: implement AtoCore Phase 0 + Phase 0.5 (foundation + PoC)
Complete implementation of the personal context engine foundation:
- FastAPI server with 5 endpoints (ingest, query, context/build, health, debug)
- SQLite database with 5 tables (documents, chunks, memories, projects, interactions)
- Heading-aware markdown chunker (800 char max, recursive splitting)
- Multilingual embeddings via sentence-transformers (EN/FR)
- ChromaDB vector store with cosine similarity retrieval
- Context builder with project boosting, dedup, and budget enforcement
- CLI scripts for batch ingestion and test prompt evaluation
- 19 unit tests passing, 79% coverage
- Validated on 482 real project files (8383 chunks, 0 errors)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 09:21:27 -04:00
|
|
|
start = time.time()
|
|
|
|
|
|
|
|
|
|
query_embedding = embed_query(query)
|
|
|
|
|
store = get_vector_store()
|
|
|
|
|
|
|
|
|
|
where = None
|
|
|
|
|
if filter_tags:
|
2026-04-05 09:35:37 -04:00
|
|
|
if len(filter_tags) == 1:
|
|
|
|
|
where = {"tags": {"$contains": f'"{filter_tags[0]}"'}}
|
|
|
|
|
else:
|
|
|
|
|
where = {
|
|
|
|
|
"$and": [
|
|
|
|
|
{"tags": {"$contains": f'"{tag}"'}}
|
|
|
|
|
for tag in filter_tags
|
|
|
|
|
]
|
|
|
|
|
}
|
feat: implement AtoCore Phase 0 + Phase 0.5 (foundation + PoC)
Complete implementation of the personal context engine foundation:
- FastAPI server with 5 endpoints (ingest, query, context/build, health, debug)
- SQLite database with 5 tables (documents, chunks, memories, projects, interactions)
- Heading-aware markdown chunker (800 char max, recursive splitting)
- Multilingual embeddings via sentence-transformers (EN/FR)
- ChromaDB vector store with cosine similarity retrieval
- Context builder with project boosting, dedup, and budget enforcement
- CLI scripts for batch ingestion and test prompt evaluation
- 19 unit tests passing, 79% coverage
- Validated on 482 real project files (8383 chunks, 0 errors)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 09:21:27 -04:00
|
|
|
|
|
|
|
|
results = store.query(
|
|
|
|
|
query_embedding=query_embedding,
|
|
|
|
|
top_k=top_k,
|
|
|
|
|
where=where,
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
chunks = []
|
|
|
|
|
if results and results["ids"] and results["ids"][0]:
|
2026-04-05 17:53:23 -04:00
|
|
|
existing_ids = _existing_chunk_ids(results["ids"][0])
|
feat: implement AtoCore Phase 0 + Phase 0.5 (foundation + PoC)
Complete implementation of the personal context engine foundation:
- FastAPI server with 5 endpoints (ingest, query, context/build, health, debug)
- SQLite database with 5 tables (documents, chunks, memories, projects, interactions)
- Heading-aware markdown chunker (800 char max, recursive splitting)
- Multilingual embeddings via sentence-transformers (EN/FR)
- ChromaDB vector store with cosine similarity retrieval
- Context builder with project boosting, dedup, and budget enforcement
- CLI scripts for batch ingestion and test prompt evaluation
- 19 unit tests passing, 79% coverage
- Validated on 482 real project files (8383 chunks, 0 errors)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 09:21:27 -04:00
|
|
|
for i, chunk_id in enumerate(results["ids"][0]):
|
2026-04-05 17:53:23 -04:00
|
|
|
if chunk_id not in existing_ids:
|
|
|
|
|
continue
|
fix: pass project_hint into retrieve and add path-signal ranking
Two changes that belong together:
1. builder.build_context() now passes project_hint into retrieve(),
so the project-aware boost actually fires for the retrieval pipeline
driven by /context/build. Before this, only direct /query callers
benefited from the registered-project boost.
2. retriever now applies two more ranking signals on every chunk:
- _query_match_boost: boosts chunks whose source/title/heading
echo high-signal query tokens (stop list filters out generic
words like "the", "project", "system")
- _path_signal_boost: down-weights archival noise (_archive,
_history, pre-cleanup, reviews) by 0.72 and up-weights current
high-signal docs (status, decision, requirements, charter,
system-map, error-budget, ...) by 1.18
Tests:
- test_context_builder_passes_project_hint_to_retrieval verifies
the wiring fix
- test_retrieve_downranks_archive_noise_and_prefers_high_signal_paths
verifies the new ranking helpers prefer current docs over archive
This addresses the cross-project competition and archive bleed
called out in current-state.md after the Wave 1 ingestion.
2026-04-06 18:37:07 -04:00
|
|
|
|
feat: implement AtoCore Phase 0 + Phase 0.5 (foundation + PoC)
Complete implementation of the personal context engine foundation:
- FastAPI server with 5 endpoints (ingest, query, context/build, health, debug)
- SQLite database with 5 tables (documents, chunks, memories, projects, interactions)
- Heading-aware markdown chunker (800 char max, recursive splitting)
- Multilingual embeddings via sentence-transformers (EN/FR)
- ChromaDB vector store with cosine similarity retrieval
- Context builder with project boosting, dedup, and budget enforcement
- CLI scripts for batch ingestion and test prompt evaluation
- 19 unit tests passing, 79% coverage
- Validated on 482 real project files (8383 chunks, 0 errors)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 09:21:27 -04:00
|
|
|
distance = results["distances"][0][i] if results["distances"] else 0
|
|
|
|
|
score = 1.0 - distance
|
|
|
|
|
meta = results["metadatas"][0][i] if results["metadatas"] else {}
|
|
|
|
|
content = results["documents"][0][i] if results["documents"] else ""
|
|
|
|
|
|
fix: pass project_hint into retrieve and add path-signal ranking
Two changes that belong together:
1. builder.build_context() now passes project_hint into retrieve(),
so the project-aware boost actually fires for the retrieval pipeline
driven by /context/build. Before this, only direct /query callers
benefited from the registered-project boost.
2. retriever now applies two more ranking signals on every chunk:
- _query_match_boost: boosts chunks whose source/title/heading
echo high-signal query tokens (stop list filters out generic
words like "the", "project", "system")
- _path_signal_boost: down-weights archival noise (_archive,
_history, pre-cleanup, reviews) by 0.72 and up-weights current
high-signal docs (status, decision, requirements, charter,
system-map, error-budget, ...) by 1.18
Tests:
- test_context_builder_passes_project_hint_to_retrieval verifies
the wiring fix
- test_retrieve_downranks_archive_noise_and_prefers_high_signal_paths
verifies the new ranking helpers prefer current docs over archive
This addresses the cross-project competition and archive bleed
called out in current-state.md after the Wave 1 ingestion.
2026-04-06 18:37:07 -04:00
|
|
|
score *= _query_match_boost(query, meta)
|
|
|
|
|
score *= _path_signal_boost(meta)
|
2026-04-06 13:32:33 -04:00
|
|
|
if project_hint:
|
|
|
|
|
score *= _project_match_boost(project_hint, meta)
|
|
|
|
|
|
feat: implement AtoCore Phase 0 + Phase 0.5 (foundation + PoC)
Complete implementation of the personal context engine foundation:
- FastAPI server with 5 endpoints (ingest, query, context/build, health, debug)
- SQLite database with 5 tables (documents, chunks, memories, projects, interactions)
- Heading-aware markdown chunker (800 char max, recursive splitting)
- Multilingual embeddings via sentence-transformers (EN/FR)
- ChromaDB vector store with cosine similarity retrieval
- Context builder with project boosting, dedup, and budget enforcement
- CLI scripts for batch ingestion and test prompt evaluation
- 19 unit tests passing, 79% coverage
- Validated on 482 real project files (8383 chunks, 0 errors)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 09:21:27 -04:00
|
|
|
chunks.append(
|
|
|
|
|
ChunkResult(
|
|
|
|
|
chunk_id=chunk_id,
|
|
|
|
|
content=content,
|
|
|
|
|
score=round(score, 4),
|
|
|
|
|
heading_path=meta.get("heading_path", ""),
|
|
|
|
|
source_file=meta.get("source_file", ""),
|
|
|
|
|
tags=meta.get("tags", "[]"),
|
|
|
|
|
title=meta.get("title", ""),
|
|
|
|
|
document_id=meta.get("document_id", ""),
|
|
|
|
|
)
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
duration_ms = int((time.time() - start) * 1000)
|
2026-04-06 13:32:33 -04:00
|
|
|
chunks.sort(key=lambda chunk: chunk.score, reverse=True)
|
|
|
|
|
|
feat: implement AtoCore Phase 0 + Phase 0.5 (foundation + PoC)
Complete implementation of the personal context engine foundation:
- FastAPI server with 5 endpoints (ingest, query, context/build, health, debug)
- SQLite database with 5 tables (documents, chunks, memories, projects, interactions)
- Heading-aware markdown chunker (800 char max, recursive splitting)
- Multilingual embeddings via sentence-transformers (EN/FR)
- ChromaDB vector store with cosine similarity retrieval
- Context builder with project boosting, dedup, and budget enforcement
- CLI scripts for batch ingestion and test prompt evaluation
- 19 unit tests passing, 79% coverage
- Validated on 482 real project files (8383 chunks, 0 errors)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 09:21:27 -04:00
|
|
|
log.info(
|
|
|
|
|
"retrieval_done",
|
|
|
|
|
query=query[:100],
|
|
|
|
|
top_k=top_k,
|
|
|
|
|
results_count=len(chunks),
|
|
|
|
|
duration_ms=duration_ms,
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
return chunks
|
2026-04-05 17:53:23 -04:00
|
|
|
|
|
|
|
|
|
2026-04-06 13:32:33 -04:00
|
|
|
def _project_match_boost(project_hint: str, metadata: dict) -> float:
|
|
|
|
|
"""Return a project-aware relevance multiplier for raw retrieval."""
|
|
|
|
|
hint_lower = project_hint.strip().lower()
|
|
|
|
|
if not hint_lower:
|
|
|
|
|
return 1.0
|
|
|
|
|
|
|
|
|
|
source_file = str(metadata.get("source_file", "")).lower()
|
|
|
|
|
title = str(metadata.get("title", "")).lower()
|
|
|
|
|
tags = str(metadata.get("tags", "")).lower()
|
|
|
|
|
searchable = " ".join([source_file, title, tags])
|
|
|
|
|
|
|
|
|
|
project = get_registered_project(project_hint)
|
|
|
|
|
candidate_names = {hint_lower}
|
|
|
|
|
if project is not None:
|
|
|
|
|
candidate_names.add(project.project_id.lower())
|
|
|
|
|
candidate_names.update(alias.lower() for alias in project.aliases)
|
|
|
|
|
candidate_names.update(
|
|
|
|
|
source_ref.subpath.replace("\\", "/").strip("/").split("/")[-1].lower()
|
|
|
|
|
for source_ref in project.ingest_roots
|
|
|
|
|
if source_ref.subpath.strip("/\\")
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
for candidate in candidate_names:
|
|
|
|
|
if candidate and candidate in searchable:
|
feat: tunable ranking, refresh status, chroma backup + admin endpoints
Three small improvements that move the operational baseline forward
without changing the existing trust model.
1. Tunable retrieval ranking weights
- rank_project_match_boost, rank_query_token_step,
rank_query_token_cap, rank_path_high_signal_boost,
rank_path_low_signal_penalty are now Settings fields
- all overridable via ATOCORE_* env vars
- retriever no longer hard-codes 2.0 / 1.18 / 0.72 / 0.08 / 1.32
- lets ranking be tuned per environment as Wave 1 is exercised
without code changes
2. /projects/{name}/refresh status
- refresh_registered_project now returns an overall status field
("ingested", "partial", "nothing_to_ingest") plus roots_ingested
and roots_skipped counters
- ProjectRefreshResponse advertises the new fields so callers can
rely on them
- covers the case where every configured root is missing on disk
3. Chroma cold snapshot + admin backup endpoints
- create_runtime_backup now accepts include_chroma and writes a
cold directory copy of the chroma persistence path
- new list_runtime_backups() and validate_backup() helpers
- new endpoints:
- POST /admin/backup create snapshot (optional chroma)
- GET /admin/backup list snapshots
- GET /admin/backup/{stamp}/validate structural validation
- chroma snapshots are taken under exclusive_ingestion() so a refresh
or ingest cannot race with the cold copy
- backup metadata records what was actually included and how big
Tests:
- 8 new tests covering tunable weights, refresh status branches
(ingested / partial / nothing_to_ingest), chroma snapshot, list,
validate, and the API endpoints (including the lock-acquisition path)
- existing fake refresh stubs in test_api_storage.py updated for the
expanded ProjectRefreshResponse model
- full suite: 105 passing (was 97)
next-steps doc updated to reflect that the chroma snapshot + restore
validation gap from current-state.md is now closed in code; only the
operational retention policy remains.
2026-04-06 18:42:19 -04:00
|
|
|
return _config.settings.rank_project_match_boost
|
2026-04-06 13:32:33 -04:00
|
|
|
|
|
|
|
|
return 1.0
|
|
|
|
|
|
|
|
|
|
|
fix: pass project_hint into retrieve and add path-signal ranking
Two changes that belong together:
1. builder.build_context() now passes project_hint into retrieve(),
so the project-aware boost actually fires for the retrieval pipeline
driven by /context/build. Before this, only direct /query callers
benefited from the registered-project boost.
2. retriever now applies two more ranking signals on every chunk:
- _query_match_boost: boosts chunks whose source/title/heading
echo high-signal query tokens (stop list filters out generic
words like "the", "project", "system")
- _path_signal_boost: down-weights archival noise (_archive,
_history, pre-cleanup, reviews) by 0.72 and up-weights current
high-signal docs (status, decision, requirements, charter,
system-map, error-budget, ...) by 1.18
Tests:
- test_context_builder_passes_project_hint_to_retrieval verifies
the wiring fix
- test_retrieve_downranks_archive_noise_and_prefers_high_signal_paths
verifies the new ranking helpers prefer current docs over archive
This addresses the cross-project competition and archive bleed
called out in current-state.md after the Wave 1 ingestion.
2026-04-06 18:37:07 -04:00
|
|
|
def _query_match_boost(query: str, metadata: dict) -> float:
|
|
|
|
|
"""Boost chunks whose path/title/headings echo the query's high-signal terms."""
|
|
|
|
|
tokens = [
|
|
|
|
|
token
|
|
|
|
|
for token in re.findall(r"[a-z0-9][a-z0-9_-]{2,}", query.lower())
|
|
|
|
|
if token not in _STOP_TOKENS
|
|
|
|
|
]
|
|
|
|
|
if not tokens:
|
|
|
|
|
return 1.0
|
|
|
|
|
|
|
|
|
|
searchable = " ".join(
|
|
|
|
|
[
|
|
|
|
|
str(metadata.get("source_file", "")).lower(),
|
|
|
|
|
str(metadata.get("title", "")).lower(),
|
|
|
|
|
str(metadata.get("heading_path", "")).lower(),
|
|
|
|
|
]
|
|
|
|
|
)
|
|
|
|
|
matches = sum(1 for token in set(tokens) if token in searchable)
|
|
|
|
|
if matches <= 0:
|
|
|
|
|
return 1.0
|
feat: tunable ranking, refresh status, chroma backup + admin endpoints
Three small improvements that move the operational baseline forward
without changing the existing trust model.
1. Tunable retrieval ranking weights
- rank_project_match_boost, rank_query_token_step,
rank_query_token_cap, rank_path_high_signal_boost,
rank_path_low_signal_penalty are now Settings fields
- all overridable via ATOCORE_* env vars
- retriever no longer hard-codes 2.0 / 1.18 / 0.72 / 0.08 / 1.32
- lets ranking be tuned per environment as Wave 1 is exercised
without code changes
2. /projects/{name}/refresh status
- refresh_registered_project now returns an overall status field
("ingested", "partial", "nothing_to_ingest") plus roots_ingested
and roots_skipped counters
- ProjectRefreshResponse advertises the new fields so callers can
rely on them
- covers the case where every configured root is missing on disk
3. Chroma cold snapshot + admin backup endpoints
- create_runtime_backup now accepts include_chroma and writes a
cold directory copy of the chroma persistence path
- new list_runtime_backups() and validate_backup() helpers
- new endpoints:
- POST /admin/backup create snapshot (optional chroma)
- GET /admin/backup list snapshots
- GET /admin/backup/{stamp}/validate structural validation
- chroma snapshots are taken under exclusive_ingestion() so a refresh
or ingest cannot race with the cold copy
- backup metadata records what was actually included and how big
Tests:
- 8 new tests covering tunable weights, refresh status branches
(ingested / partial / nothing_to_ingest), chroma snapshot, list,
validate, and the API endpoints (including the lock-acquisition path)
- existing fake refresh stubs in test_api_storage.py updated for the
expanded ProjectRefreshResponse model
- full suite: 105 passing (was 97)
next-steps doc updated to reflect that the chroma snapshot + restore
validation gap from current-state.md is now closed in code; only the
operational retention policy remains.
2026-04-06 18:42:19 -04:00
|
|
|
return min(
|
|
|
|
|
1.0 + matches * _config.settings.rank_query_token_step,
|
|
|
|
|
_config.settings.rank_query_token_cap,
|
|
|
|
|
)
|
fix: pass project_hint into retrieve and add path-signal ranking
Two changes that belong together:
1. builder.build_context() now passes project_hint into retrieve(),
so the project-aware boost actually fires for the retrieval pipeline
driven by /context/build. Before this, only direct /query callers
benefited from the registered-project boost.
2. retriever now applies two more ranking signals on every chunk:
- _query_match_boost: boosts chunks whose source/title/heading
echo high-signal query tokens (stop list filters out generic
words like "the", "project", "system")
- _path_signal_boost: down-weights archival noise (_archive,
_history, pre-cleanup, reviews) by 0.72 and up-weights current
high-signal docs (status, decision, requirements, charter,
system-map, error-budget, ...) by 1.18
Tests:
- test_context_builder_passes_project_hint_to_retrieval verifies
the wiring fix
- test_retrieve_downranks_archive_noise_and_prefers_high_signal_paths
verifies the new ranking helpers prefer current docs over archive
This addresses the cross-project competition and archive bleed
called out in current-state.md after the Wave 1 ingestion.
2026-04-06 18:37:07 -04:00
|
|
|
|
|
|
|
|
|
|
|
|
|
def _path_signal_boost(metadata: dict) -> float:
|
|
|
|
|
"""Prefer current high-signal docs and gently down-rank archival noise."""
|
|
|
|
|
searchable = " ".join(
|
|
|
|
|
[
|
|
|
|
|
str(metadata.get("source_file", "")).lower(),
|
|
|
|
|
str(metadata.get("title", "")).lower(),
|
|
|
|
|
str(metadata.get("heading_path", "")).lower(),
|
|
|
|
|
]
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
multiplier = 1.0
|
|
|
|
|
if any(hint in searchable for hint in _LOW_SIGNAL_HINTS):
|
feat: tunable ranking, refresh status, chroma backup + admin endpoints
Three small improvements that move the operational baseline forward
without changing the existing trust model.
1. Tunable retrieval ranking weights
- rank_project_match_boost, rank_query_token_step,
rank_query_token_cap, rank_path_high_signal_boost,
rank_path_low_signal_penalty are now Settings fields
- all overridable via ATOCORE_* env vars
- retriever no longer hard-codes 2.0 / 1.18 / 0.72 / 0.08 / 1.32
- lets ranking be tuned per environment as Wave 1 is exercised
without code changes
2. /projects/{name}/refresh status
- refresh_registered_project now returns an overall status field
("ingested", "partial", "nothing_to_ingest") plus roots_ingested
and roots_skipped counters
- ProjectRefreshResponse advertises the new fields so callers can
rely on them
- covers the case where every configured root is missing on disk
3. Chroma cold snapshot + admin backup endpoints
- create_runtime_backup now accepts include_chroma and writes a
cold directory copy of the chroma persistence path
- new list_runtime_backups() and validate_backup() helpers
- new endpoints:
- POST /admin/backup create snapshot (optional chroma)
- GET /admin/backup list snapshots
- GET /admin/backup/{stamp}/validate structural validation
- chroma snapshots are taken under exclusive_ingestion() so a refresh
or ingest cannot race with the cold copy
- backup metadata records what was actually included and how big
Tests:
- 8 new tests covering tunable weights, refresh status branches
(ingested / partial / nothing_to_ingest), chroma snapshot, list,
validate, and the API endpoints (including the lock-acquisition path)
- existing fake refresh stubs in test_api_storage.py updated for the
expanded ProjectRefreshResponse model
- full suite: 105 passing (was 97)
next-steps doc updated to reflect that the chroma snapshot + restore
validation gap from current-state.md is now closed in code; only the
operational retention policy remains.
2026-04-06 18:42:19 -04:00
|
|
|
multiplier *= _config.settings.rank_path_low_signal_penalty
|
fix: pass project_hint into retrieve and add path-signal ranking
Two changes that belong together:
1. builder.build_context() now passes project_hint into retrieve(),
so the project-aware boost actually fires for the retrieval pipeline
driven by /context/build. Before this, only direct /query callers
benefited from the registered-project boost.
2. retriever now applies two more ranking signals on every chunk:
- _query_match_boost: boosts chunks whose source/title/heading
echo high-signal query tokens (stop list filters out generic
words like "the", "project", "system")
- _path_signal_boost: down-weights archival noise (_archive,
_history, pre-cleanup, reviews) by 0.72 and up-weights current
high-signal docs (status, decision, requirements, charter,
system-map, error-budget, ...) by 1.18
Tests:
- test_context_builder_passes_project_hint_to_retrieval verifies
the wiring fix
- test_retrieve_downranks_archive_noise_and_prefers_high_signal_paths
verifies the new ranking helpers prefer current docs over archive
This addresses the cross-project competition and archive bleed
called out in current-state.md after the Wave 1 ingestion.
2026-04-06 18:37:07 -04:00
|
|
|
if any(hint in searchable for hint in _HIGH_SIGNAL_HINTS):
|
feat: tunable ranking, refresh status, chroma backup + admin endpoints
Three small improvements that move the operational baseline forward
without changing the existing trust model.
1. Tunable retrieval ranking weights
- rank_project_match_boost, rank_query_token_step,
rank_query_token_cap, rank_path_high_signal_boost,
rank_path_low_signal_penalty are now Settings fields
- all overridable via ATOCORE_* env vars
- retriever no longer hard-codes 2.0 / 1.18 / 0.72 / 0.08 / 1.32
- lets ranking be tuned per environment as Wave 1 is exercised
without code changes
2. /projects/{name}/refresh status
- refresh_registered_project now returns an overall status field
("ingested", "partial", "nothing_to_ingest") plus roots_ingested
and roots_skipped counters
- ProjectRefreshResponse advertises the new fields so callers can
rely on them
- covers the case where every configured root is missing on disk
3. Chroma cold snapshot + admin backup endpoints
- create_runtime_backup now accepts include_chroma and writes a
cold directory copy of the chroma persistence path
- new list_runtime_backups() and validate_backup() helpers
- new endpoints:
- POST /admin/backup create snapshot (optional chroma)
- GET /admin/backup list snapshots
- GET /admin/backup/{stamp}/validate structural validation
- chroma snapshots are taken under exclusive_ingestion() so a refresh
or ingest cannot race with the cold copy
- backup metadata records what was actually included and how big
Tests:
- 8 new tests covering tunable weights, refresh status branches
(ingested / partial / nothing_to_ingest), chroma snapshot, list,
validate, and the API endpoints (including the lock-acquisition path)
- existing fake refresh stubs in test_api_storage.py updated for the
expanded ProjectRefreshResponse model
- full suite: 105 passing (was 97)
next-steps doc updated to reflect that the chroma snapshot + restore
validation gap from current-state.md is now closed in code; only the
operational retention policy remains.
2026-04-06 18:42:19 -04:00
|
|
|
multiplier *= _config.settings.rank_path_high_signal_boost
|
fix: pass project_hint into retrieve and add path-signal ranking
Two changes that belong together:
1. builder.build_context() now passes project_hint into retrieve(),
so the project-aware boost actually fires for the retrieval pipeline
driven by /context/build. Before this, only direct /query callers
benefited from the registered-project boost.
2. retriever now applies two more ranking signals on every chunk:
- _query_match_boost: boosts chunks whose source/title/heading
echo high-signal query tokens (stop list filters out generic
words like "the", "project", "system")
- _path_signal_boost: down-weights archival noise (_archive,
_history, pre-cleanup, reviews) by 0.72 and up-weights current
high-signal docs (status, decision, requirements, charter,
system-map, error-budget, ...) by 1.18
Tests:
- test_context_builder_passes_project_hint_to_retrieval verifies
the wiring fix
- test_retrieve_downranks_archive_noise_and_prefers_high_signal_paths
verifies the new ranking helpers prefer current docs over archive
This addresses the cross-project competition and archive bleed
called out in current-state.md after the Wave 1 ingestion.
2026-04-06 18:37:07 -04:00
|
|
|
return multiplier
|
|
|
|
|
|
|
|
|
|
|
2026-04-05 17:53:23 -04:00
|
|
|
def _existing_chunk_ids(chunk_ids: list[str]) -> set[str]:
|
|
|
|
|
"""Filter out stale vector entries whose chunk rows no longer exist."""
|
|
|
|
|
if not chunk_ids:
|
|
|
|
|
return set()
|
|
|
|
|
|
|
|
|
|
placeholders = ", ".join("?" for _ in chunk_ids)
|
|
|
|
|
with get_connection() as conn:
|
|
|
|
|
rows = conn.execute(
|
|
|
|
|
f"SELECT id FROM source_chunks WHERE id IN ({placeholders})",
|
|
|
|
|
chunk_ids,
|
|
|
|
|
).fetchall()
|
|
|
|
|
return {row["id"] for row in rows}
|