docs/ingestion-waves.md

# AtoCore Ingestion Waves

## Purpose

This document tracks how the corpus should grow without losing signal quality.

The rule is:

- ingest in waves
- validate retrieval after each wave
- only then widen the source scope

## Wave 1 - Active Project Full Markdown Corpus

Status: complete

Projects:

- `p04-gigabit`
- `p05-interferometer`
- `p06-polisher`

What was ingested:

- the full markdown/text PKM stacks for the three active projects
- selected staged operational docs already under the Dalidou source roots
- selected repo markdown/text context for:
  - `Fullum-Interferometer`
  - `polisher-sim`
  - `Polisher-Toolhead` (when markdown exists)

What was intentionally excluded:

- binaries
- images
- PDFs
- generated outputs unless they were plain text reports
- dependency folders
- hidden runtime junk

Practical result:

- AtoCore moved from a curated-seed corpus to a real active-project corpus
- the live corpus now contains well over one thousand source documents and over
  twenty thousand chunks
- project-specific context building is materially stronger than before

Main lesson from Wave 1:

- full project ingestion is valuable
- but broad historical/archive material can dilute retrieval for underspecified
  prompts
- context quality now depends more strongly on good project hints and better
  ranking than on corpus size alone

## Wave 2 - Trusted Operational Layer Expansion

Status: next

Goal:

- expand `AtoDrive`-style operational truth for the active projects

Candidate inputs:

- current status dashboards
- decision logs
- milestone tracking
- curated requirements baselines
- explicit next-step plans

Why this matters:

- this raises the quality of the high-trust layer instead of only widening
  general retrieval

## Wave 3 - Broader Active Engineering References

Status: planned

Goal:

- ingest reusable engineering references that support the active project set
  without dumping the entire vault

Candidate inputs:

- interferometry reference notes directly tied to `p05`
- polishing physics references directly tied to `p06`
- mirror and structural reference material directly tied to `p04`

Rule:

- only bring in references with a clear connection to active work

## Wave 4 - Wider PKM Population

Status: deferred

Goal:

- widen beyond the active projects while preserving retrieval quality

Preconditions:

- stronger ranking
- better project-aware routing
- stable operational restore path
- clearer promotion rules for trusted state

## Validation After Each Wave

After every ingestion wave, verify:

- `stats`
- project-specific `query`
- project-specific `context-build`
- `debug-context`
- whether trusted project state still dominates when it should
- whether cross-project bleed is getting worse or better

## Working Rule

The next wave should only happen when the current wave is:

- ingested
- inspected
- retrieval-tested
- operationally stable
Expand active project wave and serialize refreshes 2026-04-06 14:58:14 -04:00			`# AtoCore Ingestion Waves`

			`## Purpose`

			`This document tracks how the corpus should grow without losing signal quality.`

			`The rule is:`

			`- ingest in waves`
			`- validate retrieval after each wave`
			`- only then widen the source scope`

			`## Wave 1 - Active Project Full Markdown Corpus`

			`Status: complete`

			`Projects:`

			- `p04-gigabit`
			- `p05-interferometer`
			- `p06-polisher`

			`What was ingested:`

			`- the full markdown/text PKM stacks for the three active projects`
			`- selected staged operational docs already under the Dalidou source roots`
			`- selected repo markdown/text context for:`
			- `Fullum-Interferometer`
			- `polisher-sim`
			- `Polisher-Toolhead` (when markdown exists)

			`What was intentionally excluded:`

			`- binaries`
			`- images`
			`- PDFs`
			`- generated outputs unless they were plain text reports`
			`- dependency folders`
			`- hidden runtime junk`

			`Practical result:`

			`- AtoCore moved from a curated-seed corpus to a real active-project corpus`
			`- the live corpus now contains well over one thousand source documents and over`
			`twenty thousand chunks`
			`- project-specific context building is materially stronger than before`

			`Main lesson from Wave 1:`

			`- full project ingestion is valuable`
			`- but broad historical/archive material can dilute retrieval for underspecified`
			`prompts`
			`- context quality now depends more strongly on good project hints and better`
			`ranking than on corpus size alone`

			`## Wave 2 - Trusted Operational Layer Expansion`

			`Status: next`

			`Goal:`

			- expand `AtoDrive`-style operational truth for the active projects

			`Candidate inputs:`

			`- current status dashboards`
			`- decision logs`
			`- milestone tracking`
			`- curated requirements baselines`
			`- explicit next-step plans`

			`Why this matters:`

			`- this raises the quality of the high-trust layer instead of only widening`
			`general retrieval`

			`## Wave 3 - Broader Active Engineering References`

			`Status: planned`

			`Goal:`

			`- ingest reusable engineering references that support the active project set`
			`without dumping the entire vault`

			`Candidate inputs:`

			- interferometry reference notes directly tied to `p05`
			- polishing physics references directly tied to `p06`
			- mirror and structural reference material directly tied to `p04`

			`Rule:`

			`- only bring in references with a clear connection to active work`

			`## Wave 4 - Wider PKM Population`

			`Status: deferred`

			`Goal:`

			`- widen beyond the active projects while preserving retrieval quality`

			`Preconditions:`

			`- stronger ranking`
			`- better project-aware routing`
			`- stable operational restore path`
			`- clearer promotion rules for trusted state`

			`## Validation After Each Wave`

			`After every ingestion wave, verify:`

			- `stats`
			- project-specific `query`
			- project-specific `context-build`
			- `debug-context`
			`- whether trusted project state still dominates when it should`
			`- whether cross-project bleed is getting worse or better`

			`## Working Rule`

			`The next wave should only happen when the current wave is:`

			`- ingested`
			`- inspected`
			`- retrieval-tested`
			`- operationally stable`