Expand active project wave and serialize refreshes
This commit is contained in:
@@ -15,7 +15,7 @@
|
||||
{
|
||||
"id": "p04-gigabit",
|
||||
"aliases": ["p04", "gigabit", "gigaBIT"],
|
||||
"description": "Curated staged docs for the P04 GigaBIT mirror architecture and OTA optics project.",
|
||||
"description": "Active P04 GigaBIT mirror project corpus from PKM plus staged operational docs.",
|
||||
"ingest_roots": [
|
||||
{
|
||||
"source": "vault",
|
||||
@@ -27,7 +27,7 @@
|
||||
{
|
||||
"id": "p05-interferometer",
|
||||
"aliases": ["p05", "interferometer"],
|
||||
"description": "Curated staged docs for the P05 interferometer architecture, vendors, and error-budget project.",
|
||||
"description": "Active P05 interferometer corpus from PKM plus selected repo context and vendor documentation.",
|
||||
"ingest_roots": [
|
||||
{
|
||||
"source": "vault",
|
||||
@@ -39,7 +39,7 @@
|
||||
{
|
||||
"id": "p06-polisher",
|
||||
"aliases": ["p06", "polisher"],
|
||||
"description": "Curated staged docs for the P06 polisher project.",
|
||||
"description": "Active P06 polisher corpus from PKM, software-suite notes, and selected repo context.",
|
||||
"ingest_roots": [
|
||||
{
|
||||
"source": "vault",
|
||||
|
||||
@@ -52,7 +52,7 @@ now includes a first curated ingestion batch for the active projects.
|
||||
- Dalidou Docker deployment foundation
|
||||
- initial AtoCore self-knowledge corpus ingested on Dalidou
|
||||
- T420/OpenClaw read-only AtoCore helper skill
|
||||
- first curated active-project corpus batch for:
|
||||
- full active-project markdown/text corpus wave for:
|
||||
- `p04-gigabit`
|
||||
- `p05-interferometer`
|
||||
- `p06-polisher`
|
||||
@@ -87,7 +87,7 @@ The Dalidou instance already contains:
|
||||
- Master Plan V3
|
||||
- Build Spec V1
|
||||
- trusted project-state entries for `atocore`
|
||||
- curated staged project docs for:
|
||||
- full staged project markdown/text corpora for:
|
||||
- `p04-gigabit`
|
||||
- `p05-interferometer`
|
||||
- `p06-polisher`
|
||||
@@ -99,12 +99,12 @@ The Dalidou instance already contains:
|
||||
- `p05-interferometer`
|
||||
- `p06-polisher`
|
||||
|
||||
Current live stats after the latest documentation sync and active-project ingest
|
||||
passes:
|
||||
Current live stats after the full active-project wave are now far beyond the
|
||||
initial seed stage:
|
||||
|
||||
- `source_documents`: 36
|
||||
- `source_chunks`: 568
|
||||
- `vectors`: 568
|
||||
- more than `1,100` source documents
|
||||
- more than `20,000` chunks
|
||||
- matching vector count
|
||||
|
||||
The broader long-term corpus is still not fully populated yet. Wider project and
|
||||
vault ingestion remains a deliberate next step rather than something already
|
||||
@@ -115,8 +115,8 @@ primarily visible under:
|
||||
|
||||
- `/srv/storage/atocore/sources/vault/incoming/projects`
|
||||
|
||||
This staged area is now useful for review because it contains the curated
|
||||
project docs that were actually ingested for the first active-project batch.
|
||||
This staged area is now useful for review because it contains the markdown/text
|
||||
project docs that were actually ingested for the full active-project wave.
|
||||
|
||||
It is important to read this staged area correctly:
|
||||
|
||||
@@ -166,10 +166,12 @@ These are curated summaries and extracted stable project signals.
|
||||
|
||||
In `source_documents` / retrieval corpus:
|
||||
|
||||
- real project documents are now present for the same active project set
|
||||
- full project markdown/text corpora are now present for the active project set
|
||||
- retrieval is no longer limited to AtoCore self-knowledge only
|
||||
- the current corpus is still selective rather than exhaustive
|
||||
- that selectivity is intentional at this stage
|
||||
- the current corpus is broad enough that ranking quality matters more than
|
||||
corpus presence alone
|
||||
- underspecified prompts can still pull in historical or archive material, so
|
||||
project-aware routing and better ranking remain important
|
||||
|
||||
The source refresh model now has a concrete foundation in code:
|
||||
|
||||
@@ -223,8 +225,8 @@ This separation is healthy:
|
||||
## Immediate Next Focus
|
||||
|
||||
1. Use the new T420-side organic routing layer in real OpenClaw workflows
|
||||
2. Keep tightening retrieval quality for the newly seeded active projects
|
||||
3. Define the first broader AtoVault/AtoDrive ingestion batches
|
||||
2. Tighten retrieval quality for the now fully ingested active project corpora
|
||||
3. Move to Wave 2 trusted-operational ingestion instead of blindly widening raw corpus further
|
||||
4. Keep the new engineering-knowledge architecture docs as implementation guidance while avoiding premature schema work
|
||||
5. Expand the boring operations baseline:
|
||||
- restore validation
|
||||
@@ -234,6 +236,7 @@ This separation is healthy:
|
||||
|
||||
See also:
|
||||
|
||||
- [ingestion-waves.md](C:/Users/antoi/ATOCore/docs/ingestion-waves.md)
|
||||
- [master-plan-status.md](C:/Users/antoi/ATOCore/docs/master-plan-status.md)
|
||||
|
||||
## Guiding Constraints
|
||||
|
||||
129
docs/ingestion-waves.md
Normal file
129
docs/ingestion-waves.md
Normal file
@@ -0,0 +1,129 @@
|
||||
# AtoCore Ingestion Waves
|
||||
|
||||
## Purpose
|
||||
|
||||
This document tracks how the corpus should grow without losing signal quality.
|
||||
|
||||
The rule is:
|
||||
|
||||
- ingest in waves
|
||||
- validate retrieval after each wave
|
||||
- only then widen the source scope
|
||||
|
||||
## Wave 1 - Active Project Full Markdown Corpus
|
||||
|
||||
Status: complete
|
||||
|
||||
Projects:
|
||||
|
||||
- `p04-gigabit`
|
||||
- `p05-interferometer`
|
||||
- `p06-polisher`
|
||||
|
||||
What was ingested:
|
||||
|
||||
- the full markdown/text PKM stacks for the three active projects
|
||||
- selected staged operational docs already under the Dalidou source roots
|
||||
- selected repo markdown/text context for:
|
||||
- `Fullum-Interferometer`
|
||||
- `polisher-sim`
|
||||
- `Polisher-Toolhead` (when markdown exists)
|
||||
|
||||
What was intentionally excluded:
|
||||
|
||||
- binaries
|
||||
- images
|
||||
- PDFs
|
||||
- generated outputs unless they were plain text reports
|
||||
- dependency folders
|
||||
- hidden runtime junk
|
||||
|
||||
Practical result:
|
||||
|
||||
- AtoCore moved from a curated-seed corpus to a real active-project corpus
|
||||
- the live corpus now contains well over one thousand source documents and over
|
||||
twenty thousand chunks
|
||||
- project-specific context building is materially stronger than before
|
||||
|
||||
Main lesson from Wave 1:
|
||||
|
||||
- full project ingestion is valuable
|
||||
- but broad historical/archive material can dilute retrieval for underspecified
|
||||
prompts
|
||||
- context quality now depends more strongly on good project hints and better
|
||||
ranking than on corpus size alone
|
||||
|
||||
## Wave 2 - Trusted Operational Layer Expansion
|
||||
|
||||
Status: next
|
||||
|
||||
Goal:
|
||||
|
||||
- expand `AtoDrive`-style operational truth for the active projects
|
||||
|
||||
Candidate inputs:
|
||||
|
||||
- current status dashboards
|
||||
- decision logs
|
||||
- milestone tracking
|
||||
- curated requirements baselines
|
||||
- explicit next-step plans
|
||||
|
||||
Why this matters:
|
||||
|
||||
- this raises the quality of the high-trust layer instead of only widening
|
||||
general retrieval
|
||||
|
||||
## Wave 3 - Broader Active Engineering References
|
||||
|
||||
Status: planned
|
||||
|
||||
Goal:
|
||||
|
||||
- ingest reusable engineering references that support the active project set
|
||||
without dumping the entire vault
|
||||
|
||||
Candidate inputs:
|
||||
|
||||
- interferometry reference notes directly tied to `p05`
|
||||
- polishing physics references directly tied to `p06`
|
||||
- mirror and structural reference material directly tied to `p04`
|
||||
|
||||
Rule:
|
||||
|
||||
- only bring in references with a clear connection to active work
|
||||
|
||||
## Wave 4 - Wider PKM Population
|
||||
|
||||
Status: deferred
|
||||
|
||||
Goal:
|
||||
|
||||
- widen beyond the active projects while preserving retrieval quality
|
||||
|
||||
Preconditions:
|
||||
|
||||
- stronger ranking
|
||||
- better project-aware routing
|
||||
- stable operational restore path
|
||||
- clearer promotion rules for trusted state
|
||||
|
||||
## Validation After Each Wave
|
||||
|
||||
After every ingestion wave, verify:
|
||||
|
||||
- `stats`
|
||||
- project-specific `query`
|
||||
- project-specific `context-build`
|
||||
- `debug-context`
|
||||
- whether trusted project state still dominates when it should
|
||||
- whether cross-project bleed is getting worse or better
|
||||
|
||||
## Working Rule
|
||||
|
||||
The next wave should only happen when the current wave is:
|
||||
|
||||
- ingested
|
||||
- inspected
|
||||
- retrieval-tested
|
||||
- operationally stable
|
||||
@@ -29,9 +29,11 @@ This working list should be read alongside:
|
||||
- check whether the top hits are useful
|
||||
- check whether trusted project state remains dominant
|
||||
- reduce cross-project competition and prompt ambiguity where needed
|
||||
3. Continue controlled project ingestion only where the current corpus is still
|
||||
thin
|
||||
- a few additional anchor docs per active project
|
||||
- use `debug-context` to inspect the exact last AtoCore supplement
|
||||
3. Treat the active-project full markdown/text wave as complete
|
||||
- `p04-gigabit`
|
||||
- `p05-interferometer`
|
||||
- `p06-polisher`
|
||||
4. Define a cleaner source refresh model
|
||||
- make the difference between source truth, staged inputs, and machine store
|
||||
explicit
|
||||
@@ -39,15 +41,20 @@ This working list should be read alongside:
|
||||
- foundation now exists via project registry + per-project refresh API
|
||||
- registration policy + template + proposal + approved registration are now
|
||||
the normal path for new projects
|
||||
5. Integrate the new engineering architecture docs into active planning, not immediate schema code
|
||||
5. Move to Wave 2 trusted-operational ingestion
|
||||
- curated dashboards
|
||||
- decision logs
|
||||
- milestone/current-status views
|
||||
- operational truth, not just raw project notes
|
||||
6. Integrate the new engineering architecture docs into active planning, not immediate schema code
|
||||
- keep `docs/architecture/engineering-knowledge-hybrid-architecture.md` as the target layer model
|
||||
- keep `docs/architecture/engineering-ontology-v1.md` as the V1 structured-domain target
|
||||
- do not start entity/relationship persistence until the ingestion, retrieval, registry, and backup baseline feels boring and stable
|
||||
6. Define backup and export procedures for Dalidou
|
||||
7. Define backup and export procedures for Dalidou
|
||||
- exercise the new SQLite + registry snapshot path on Dalidou
|
||||
- Chroma backup or rebuild policy
|
||||
- retention and restore validation
|
||||
7. Keep deeper automatic runtime integration modest until the organic read-only
|
||||
8. Keep deeper automatic runtime integration modest until the organic read-only
|
||||
model has proven value
|
||||
|
||||
## Trusted State Status
|
||||
@@ -69,36 +76,39 @@ This materially improves `context/build` quality for project-hinted prompts.
|
||||
|
||||
## Recommended Near-Term Project Work
|
||||
|
||||
The first curated batch is already in.
|
||||
The active-project full markdown/text wave is now in.
|
||||
|
||||
The near-term work is now:
|
||||
|
||||
1. strengthen retrieval quality
|
||||
2. add a few more anchor docs only where retrieval is still weak
|
||||
2. promote or refine trusted operational truth where the broad corpus is now too noisy
|
||||
3. keep trusted project state concise and high-confidence
|
||||
4. widen only through named ingestion waves
|
||||
|
||||
## Recommended Additional Anchor Docs
|
||||
## Recommended Next Wave Inputs
|
||||
|
||||
1. `p04-gigabit`
|
||||
2. `p05-interferometer`
|
||||
3. `p06-polisher`
|
||||
Wave 2 should emphasize trusted operational truth, not bulk historical notes.
|
||||
|
||||
P04:
|
||||
|
||||
- 1 to 2 more strong study summaries
|
||||
- 1 to 2 more meeting notes with actual decisions
|
||||
- current status dashboard
|
||||
- current selected design path
|
||||
- current frame interface truth
|
||||
- current next-step milestone view
|
||||
|
||||
P05:
|
||||
|
||||
- a couple more architecture docs
|
||||
- selected vendor-response notes
|
||||
- possibly one or two NX/WAVE consumer docs
|
||||
- selected vendor path
|
||||
- current error-budget baseline
|
||||
- current architecture freeze or open decisions
|
||||
- current procurement / next-action view
|
||||
|
||||
P06:
|
||||
|
||||
- more explicit interface/schema docs if needed
|
||||
- selected operations or UI docs
|
||||
- a distilled non-empty operational context doc to replace an empty `_context.md`
|
||||
- current system map
|
||||
- current shared contracts baseline
|
||||
- current calibration procedure truth
|
||||
- current July / proving roadmap view
|
||||
|
||||
## Deferred On Purpose
|
||||
|
||||
@@ -115,6 +125,8 @@ The next batch is successful if:
|
||||
- OpenClaw can use AtoCore naturally when context is needed
|
||||
- OpenClaw can infer registered projects and call AtoCore organically for
|
||||
project-knowledge questions
|
||||
- the active-project full corpus wave can be inspected and used concretely
|
||||
through `auto-context`, `context-build`, and `debug-context`
|
||||
- OpenClaw can also register a new project cleanly before refreshing it
|
||||
- existing project registrations can be refined safely before refresh when the
|
||||
staged source set evolves
|
||||
|
||||
@@ -82,6 +82,7 @@ The current helper script exposes:
|
||||
- `project-template`
|
||||
- `detect-project <prompt>`
|
||||
- `auto-context <prompt> [budget] [project]`
|
||||
- `debug-context`
|
||||
- `propose-project ...`
|
||||
- `register-project ...`
|
||||
- `update-project ...`
|
||||
@@ -125,6 +126,8 @@ Recommended first behavior:
|
||||
1. OpenClaw receives a user request
|
||||
2. If the prompt looks like project knowledge, OpenClaw should try:
|
||||
- `auto-context "<prompt>" 3000`
|
||||
- optionally `debug-context` immediately after if a human wants to inspect
|
||||
the exact AtoCore supplement
|
||||
3. If the prompt is clearly asking for trusted current truth, OpenClaw should
|
||||
prefer:
|
||||
- `project-state <project>`
|
||||
|
||||
@@ -18,6 +18,7 @@ from atocore.context.project_state import (
|
||||
set_state,
|
||||
)
|
||||
from atocore.ingestion.pipeline import (
|
||||
exclusive_ingestion,
|
||||
get_ingestion_stats,
|
||||
get_source_status,
|
||||
ingest_configured_sources,
|
||||
@@ -153,12 +154,13 @@ def api_ingest(req: IngestRequest) -> IngestResponse:
|
||||
"""Ingest a markdown file or folder."""
|
||||
target = Path(req.path)
|
||||
try:
|
||||
if target.is_file():
|
||||
results = [ingest_file(target)]
|
||||
elif target.is_dir():
|
||||
results = ingest_folder(target)
|
||||
else:
|
||||
raise HTTPException(status_code=404, detail=f"Path not found: {req.path}")
|
||||
with exclusive_ingestion():
|
||||
if target.is_file():
|
||||
results = [ingest_file(target)]
|
||||
elif target.is_dir():
|
||||
results = ingest_folder(target)
|
||||
else:
|
||||
raise HTTPException(status_code=404, detail=f"Path not found: {req.path}")
|
||||
except HTTPException:
|
||||
raise
|
||||
except Exception as e:
|
||||
@@ -171,7 +173,8 @@ def api_ingest(req: IngestRequest) -> IngestResponse:
|
||||
def api_ingest_sources() -> IngestSourcesResponse:
|
||||
"""Ingest enabled configured source directories."""
|
||||
try:
|
||||
results = ingest_configured_sources()
|
||||
with exclusive_ingestion():
|
||||
results = ingest_configured_sources()
|
||||
except Exception as e:
|
||||
log.error("ingest_sources_failed", error=str(e))
|
||||
raise HTTPException(status_code=500, detail=f"Configured source ingestion failed: {e}")
|
||||
@@ -246,7 +249,8 @@ def api_project_update(project_name: str, req: ProjectUpdateRequest) -> dict:
|
||||
def api_refresh_project(project_name: str, purge_deleted: bool = False) -> ProjectRefreshResponse:
|
||||
"""Refresh one registered project from its configured ingest roots."""
|
||||
try:
|
||||
result = refresh_registered_project(project_name, purge_deleted=purge_deleted)
|
||||
with exclusive_ingestion():
|
||||
result = refresh_registered_project(project_name, purge_deleted=purge_deleted)
|
||||
except ValueError as e:
|
||||
raise HTTPException(status_code=404, detail=str(e))
|
||||
except Exception as e:
|
||||
|
||||
@@ -2,8 +2,10 @@
|
||||
|
||||
import hashlib
|
||||
import json
|
||||
import threading
|
||||
import time
|
||||
import uuid
|
||||
from contextlib import contextmanager
|
||||
from pathlib import Path
|
||||
|
||||
import atocore.config as _config
|
||||
@@ -17,6 +19,17 @@ log = get_logger("ingestion")
|
||||
|
||||
# Encodings to try when reading markdown files
|
||||
_ENCODINGS = ["utf-8", "utf-8-sig", "latin-1", "cp1252"]
|
||||
_INGESTION_LOCK = threading.Lock()
|
||||
|
||||
|
||||
@contextmanager
|
||||
def exclusive_ingestion():
|
||||
"""Serialize long-running ingestion operations across API requests."""
|
||||
_INGESTION_LOCK.acquire()
|
||||
try:
|
||||
yield
|
||||
finally:
|
||||
_INGESTION_LOCK.release()
|
||||
|
||||
|
||||
def ingest_file(file_path: Path) -> dict:
|
||||
|
||||
@@ -1,5 +1,7 @@
|
||||
"""Tests for storage-related API readiness endpoints."""
|
||||
|
||||
from contextlib import contextmanager
|
||||
|
||||
from fastapi.testclient import TestClient
|
||||
|
||||
import atocore.config as config
|
||||
@@ -152,6 +154,38 @@ def test_project_refresh_endpoint_uses_registered_roots(tmp_data_dir, monkeypatc
|
||||
assert response.json()["project"] == "p05-interferometer"
|
||||
|
||||
|
||||
def test_project_refresh_endpoint_serializes_ingestion(tmp_data_dir, monkeypatch):
|
||||
config.settings = config.Settings()
|
||||
events = []
|
||||
|
||||
@contextmanager
|
||||
def fake_lock():
|
||||
events.append("enter")
|
||||
try:
|
||||
yield
|
||||
finally:
|
||||
events.append("exit")
|
||||
|
||||
def fake_refresh_registered_project(project_name, purge_deleted=False):
|
||||
events.append(("refresh", project_name, purge_deleted))
|
||||
return {
|
||||
"project": "p05-interferometer",
|
||||
"aliases": ["p05"],
|
||||
"description": "P05 docs",
|
||||
"purge_deleted": purge_deleted,
|
||||
"roots": [],
|
||||
}
|
||||
|
||||
monkeypatch.setattr("atocore.api.routes.exclusive_ingestion", fake_lock)
|
||||
monkeypatch.setattr("atocore.api.routes.refresh_registered_project", fake_refresh_registered_project)
|
||||
|
||||
client = TestClient(app)
|
||||
response = client.post("/projects/p05/refresh")
|
||||
|
||||
assert response.status_code == 200
|
||||
assert events == ["enter", ("refresh", "p05", False), "exit"]
|
||||
|
||||
|
||||
def test_projects_template_endpoint_returns_template(tmp_data_dir, monkeypatch):
|
||||
config.settings = config.Settings()
|
||||
|
||||
|
||||
Reference in New Issue
Block a user