130 lines
2.8 KiB
Markdown
130 lines
2.8 KiB
Markdown
|
|
# AtoCore Ingestion Waves
|
||
|
|
|
||
|
|
## Purpose
|
||
|
|
|
||
|
|
This document tracks how the corpus should grow without losing signal quality.
|
||
|
|
|
||
|
|
The rule is:
|
||
|
|
|
||
|
|
- ingest in waves
|
||
|
|
- validate retrieval after each wave
|
||
|
|
- only then widen the source scope
|
||
|
|
|
||
|
|
## Wave 1 - Active Project Full Markdown Corpus
|
||
|
|
|
||
|
|
Status: complete
|
||
|
|
|
||
|
|
Projects:
|
||
|
|
|
||
|
|
- `p04-gigabit`
|
||
|
|
- `p05-interferometer`
|
||
|
|
- `p06-polisher`
|
||
|
|
|
||
|
|
What was ingested:
|
||
|
|
|
||
|
|
- the full markdown/text PKM stacks for the three active projects
|
||
|
|
- selected staged operational docs already under the Dalidou source roots
|
||
|
|
- selected repo markdown/text context for:
|
||
|
|
- `Fullum-Interferometer`
|
||
|
|
- `polisher-sim`
|
||
|
|
- `Polisher-Toolhead` (when markdown exists)
|
||
|
|
|
||
|
|
What was intentionally excluded:
|
||
|
|
|
||
|
|
- binaries
|
||
|
|
- images
|
||
|
|
- PDFs
|
||
|
|
- generated outputs unless they were plain text reports
|
||
|
|
- dependency folders
|
||
|
|
- hidden runtime junk
|
||
|
|
|
||
|
|
Practical result:
|
||
|
|
|
||
|
|
- AtoCore moved from a curated-seed corpus to a real active-project corpus
|
||
|
|
- the live corpus now contains well over one thousand source documents and over
|
||
|
|
twenty thousand chunks
|
||
|
|
- project-specific context building is materially stronger than before
|
||
|
|
|
||
|
|
Main lesson from Wave 1:
|
||
|
|
|
||
|
|
- full project ingestion is valuable
|
||
|
|
- but broad historical/archive material can dilute retrieval for underspecified
|
||
|
|
prompts
|
||
|
|
- context quality now depends more strongly on good project hints and better
|
||
|
|
ranking than on corpus size alone
|
||
|
|
|
||
|
|
## Wave 2 - Trusted Operational Layer Expansion
|
||
|
|
|
||
|
|
Status: next
|
||
|
|
|
||
|
|
Goal:
|
||
|
|
|
||
|
|
- expand `AtoDrive`-style operational truth for the active projects
|
||
|
|
|
||
|
|
Candidate inputs:
|
||
|
|
|
||
|
|
- current status dashboards
|
||
|
|
- decision logs
|
||
|
|
- milestone tracking
|
||
|
|
- curated requirements baselines
|
||
|
|
- explicit next-step plans
|
||
|
|
|
||
|
|
Why this matters:
|
||
|
|
|
||
|
|
- this raises the quality of the high-trust layer instead of only widening
|
||
|
|
general retrieval
|
||
|
|
|
||
|
|
## Wave 3 - Broader Active Engineering References
|
||
|
|
|
||
|
|
Status: planned
|
||
|
|
|
||
|
|
Goal:
|
||
|
|
|
||
|
|
- ingest reusable engineering references that support the active project set
|
||
|
|
without dumping the entire vault
|
||
|
|
|
||
|
|
Candidate inputs:
|
||
|
|
|
||
|
|
- interferometry reference notes directly tied to `p05`
|
||
|
|
- polishing physics references directly tied to `p06`
|
||
|
|
- mirror and structural reference material directly tied to `p04`
|
||
|
|
|
||
|
|
Rule:
|
||
|
|
|
||
|
|
- only bring in references with a clear connection to active work
|
||
|
|
|
||
|
|
## Wave 4 - Wider PKM Population
|
||
|
|
|
||
|
|
Status: deferred
|
||
|
|
|
||
|
|
Goal:
|
||
|
|
|
||
|
|
- widen beyond the active projects while preserving retrieval quality
|
||
|
|
|
||
|
|
Preconditions:
|
||
|
|
|
||
|
|
- stronger ranking
|
||
|
|
- better project-aware routing
|
||
|
|
- stable operational restore path
|
||
|
|
- clearer promotion rules for trusted state
|
||
|
|
|
||
|
|
## Validation After Each Wave
|
||
|
|
|
||
|
|
After every ingestion wave, verify:
|
||
|
|
|
||
|
|
- `stats`
|
||
|
|
- project-specific `query`
|
||
|
|
- project-specific `context-build`
|
||
|
|
- `debug-context`
|
||
|
|
- whether trusted project state still dominates when it should
|
||
|
|
- whether cross-project bleed is getting worse or better
|
||
|
|
|
||
|
|
## Working Rule
|
||
|
|
|
||
|
|
The next wave should only happen when the current wave is:
|
||
|
|
|
||
|
|
- ingested
|
||
|
|
- inspected
|
||
|
|
- retrieval-tested
|
||
|
|
- operationally stable
|