Expand active project wave and serialize refreshes

2026-04-06 14:58:14 -04:00
parent 46a5d5887a
commit bdb42dba05
8 changed files with 243 additions and 45 deletions
--- a/docs/current-state.md
+++ b/docs/current-state.md
@@ -52,7 +52,7 @@ now includes a first curated ingestion batch for the active projects.
 - Dalidou Docker deployment foundation
 - initial AtoCore self-knowledge corpus ingested on Dalidou
 - T420/OpenClaw read-only AtoCore helper skill
- first curated active-project corpus batch for:
+- full active-project markdown/text corpus wave for:
  - `p04-gigabit`
  - `p05-interferometer`
  - `p06-polisher`
@@ -87,7 +87,7 @@ The Dalidou instance already contains:
 - Master Plan V3
 - Build Spec V1
 - trusted project-state entries for `atocore`
- curated staged project docs for:
+- full staged project markdown/text corpora for:
  - `p04-gigabit`
  - `p05-interferometer`
  - `p06-polisher`
@@ -99,12 +99,12 @@ The Dalidou instance already contains:
  - `p05-interferometer`
  - `p06-polisher`

-Current live stats after the latest documentation sync and active-project ingest
-passes:
+Current live stats after the full active-project wave are now far beyond the
+initial seed stage:

- `source_documents`: 36
- `source_chunks`: 568
- `vectors`: 568
+- more than `1,100` source documents
+- more than `20,000` chunks
+- matching vector count

 The broader long-term corpus is still not fully populated yet. Wider project and
 vault ingestion remains a deliberate next step rather than something already
@@ -115,8 +115,8 @@ primarily visible under:

 - `/srv/storage/atocore/sources/vault/incoming/projects`

-This staged area is now useful for review because it contains the curated
-project docs that were actually ingested for the first active-project batch.
+This staged area is now useful for review because it contains the markdown/text
+project docs that were actually ingested for the full active-project wave.

 It is important to read this staged area correctly:

@@ -166,10 +166,12 @@ These are curated summaries and extracted stable project signals.

 In `source_documents` / retrieval corpus:

- real project documents are now present for the same active project set
+- full project markdown/text corpora are now present for the active project set
 - retrieval is no longer limited to AtoCore self-knowledge only
- the current corpus is still selective rather than exhaustive
- that selectivity is intentional at this stage
+- the current corpus is broad enough that ranking quality matters more than
+  corpus presence alone
+- underspecified prompts can still pull in historical or archive material, so
+  project-aware routing and better ranking remain important

 The source refresh model now has a concrete foundation in code:

@@ -223,8 +225,8 @@ This separation is healthy:
 ## Immediate Next Focus

 1. Use the new T420-side organic routing layer in real OpenClaw workflows
-2. Keep tightening retrieval quality for the newly seeded active projects
-3. Define the first broader AtoVault/AtoDrive ingestion batches
+2. Tighten retrieval quality for the now fully ingested active project corpora
+3. Move to Wave 2 trusted-operational ingestion instead of blindly widening raw corpus further
 4. Keep the new engineering-knowledge architecture docs as implementation guidance while avoiding premature schema work
 5. Expand the boring operations baseline:
   - restore validation
@@ -234,6 +236,7 @@ This separation is healthy:

 See also:

+- [ingestion-waves.md](C:/Users/antoi/ATOCore/docs/ingestion-waves.md)
 - [master-plan-status.md](C:/Users/antoi/ATOCore/docs/master-plan-status.md)

 ## Guiding Constraints
--- a/docs/ingestion-waves.md
+++ b/docs/ingestion-waves.md
@@ -0,0 +1,129 @@
+# AtoCore Ingestion Waves
+
+## Purpose
+
+This document tracks how the corpus should grow without losing signal quality.
+
+The rule is:
+
+- ingest in waves
+- validate retrieval after each wave
+- only then widen the source scope
+
+## Wave 1 - Active Project Full Markdown Corpus
+
+Status: complete
+
+Projects:
+
+- `p04-gigabit`
+- `p05-interferometer`
+- `p06-polisher`
+
+What was ingested:
+
+- the full markdown/text PKM stacks for the three active projects
+- selected staged operational docs already under the Dalidou source roots
+- selected repo markdown/text context for:
+  - `Fullum-Interferometer`
+  - `polisher-sim`
+  - `Polisher-Toolhead` (when markdown exists)
+
+What was intentionally excluded:
+
+- binaries
+- images
+- PDFs
+- generated outputs unless they were plain text reports
+- dependency folders
+- hidden runtime junk
+
+Practical result:
+
+- AtoCore moved from a curated-seed corpus to a real active-project corpus
+- the live corpus now contains well over one thousand source documents and over
+  twenty thousand chunks
+- project-specific context building is materially stronger than before
+
+Main lesson from Wave 1:
+
+- full project ingestion is valuable
+- but broad historical/archive material can dilute retrieval for underspecified
+  prompts
+- context quality now depends more strongly on good project hints and better
+  ranking than on corpus size alone
+
+## Wave 2 - Trusted Operational Layer Expansion
+
+Status: next
+
+Goal:
+
+- expand `AtoDrive`-style operational truth for the active projects
+
+Candidate inputs:
+
+- current status dashboards
+- decision logs
+- milestone tracking
+- curated requirements baselines
+- explicit next-step plans
+
+Why this matters:
+
+- this raises the quality of the high-trust layer instead of only widening
+  general retrieval
+
+## Wave 3 - Broader Active Engineering References
+
+Status: planned
+
+Goal:
+
+- ingest reusable engineering references that support the active project set
+  without dumping the entire vault
+
+Candidate inputs:
+
+- interferometry reference notes directly tied to `p05`
+- polishing physics references directly tied to `p06`
+- mirror and structural reference material directly tied to `p04`
+
+Rule:
+
+- only bring in references with a clear connection to active work
+
+## Wave 4 - Wider PKM Population
+
+Status: deferred
+
+Goal:
+
+- widen beyond the active projects while preserving retrieval quality
+
+Preconditions:
+
+- stronger ranking
+- better project-aware routing
+- stable operational restore path
+- clearer promotion rules for trusted state
+
+## Validation After Each Wave
+
+After every ingestion wave, verify:
+
+- `stats`
+- project-specific `query`
+- project-specific `context-build`
+- `debug-context`
+- whether trusted project state still dominates when it should
+- whether cross-project bleed is getting worse or better
+
+## Working Rule
+
+The next wave should only happen when the current wave is:
+
+- ingested
+- inspected
+- retrieval-tested
+- operationally stable
--- a/docs/next-steps.md
+++ b/docs/next-steps.md
@@ -29,9 +29,11 @@ This working list should be read alongside:
   - check whether the top hits are useful
   - check whether trusted project state remains dominant
   - reduce cross-project competition and prompt ambiguity where needed
-3. Continue controlled project ingestion only where the current corpus is still
-   thin
-   - a few additional anchor docs per active project
+   - use `debug-context` to inspect the exact last AtoCore supplement
+3. Treat the active-project full markdown/text wave as complete
+   - `p04-gigabit`
+   - `p05-interferometer`
+   - `p06-polisher`
 4. Define a cleaner source refresh model
   - make the difference between source truth, staged inputs, and machine store
     explicit
@@ -39,15 +41,20 @@ This working list should be read alongside:
   - foundation now exists via project registry + per-project refresh API
   - registration policy + template + proposal + approved registration are now
     the normal path for new projects
-5. Integrate the new engineering architecture docs into active planning, not immediate schema code
+5. Move to Wave 2 trusted-operational ingestion
+   - curated dashboards
+   - decision logs
+   - milestone/current-status views
+   - operational truth, not just raw project notes
+6. Integrate the new engineering architecture docs into active planning, not immediate schema code
   - keep `docs/architecture/engineering-knowledge-hybrid-architecture.md` as the target layer model
   - keep `docs/architecture/engineering-ontology-v1.md` as the V1 structured-domain target
   - do not start entity/relationship persistence until the ingestion, retrieval, registry, and backup baseline feels boring and stable
-6. Define backup and export procedures for Dalidou
+7. Define backup and export procedures for Dalidou
   - exercise the new SQLite + registry snapshot path on Dalidou
   - Chroma backup or rebuild policy
   - retention and restore validation
-7. Keep deeper automatic runtime integration modest until the organic read-only
+8. Keep deeper automatic runtime integration modest until the organic read-only
   model has proven value

 ## Trusted State Status
@@ -69,36 +76,39 @@ This materially improves `context/build` quality for project-hinted prompts.

 ## Recommended Near-Term Project Work

-The first curated batch is already in.
+The active-project full markdown/text wave is now in.

 The near-term work is now:

 1. strengthen retrieval quality
-2. add a few more anchor docs only where retrieval is still weak
+2. promote or refine trusted operational truth where the broad corpus is now too noisy
 3. keep trusted project state concise and high-confidence
+4. widen only through named ingestion waves

-## Recommended Additional Anchor Docs
+## Recommended Next Wave Inputs

-1. `p04-gigabit`
-2. `p05-interferometer`
-3. `p06-polisher`
+Wave 2 should emphasize trusted operational truth, not bulk historical notes.

 P04:

- 1 to 2 more strong study summaries
- 1 to 2 more meeting notes with actual decisions
+- current status dashboard
+- current selected design path
+- current frame interface truth
+- current next-step milestone view

 P05:

- a couple more architecture docs
- selected vendor-response notes
- possibly one or two NX/WAVE consumer docs
+- selected vendor path
+- current error-budget baseline
+- current architecture freeze or open decisions
+- current procurement / next-action view

 P06:

- more explicit interface/schema docs if needed
- selected operations or UI docs
- a distilled non-empty operational context doc to replace an empty `_context.md`
+- current system map
+- current shared contracts baseline
+- current calibration procedure truth
+- current July / proving roadmap view

 ## Deferred On Purpose

@@ -115,6 +125,8 @@ The next batch is successful if:
 - OpenClaw can use AtoCore naturally when context is needed
 - OpenClaw can infer registered projects and call AtoCore organically for
  project-knowledge questions
+- the active-project full corpus wave can be inspected and used concretely
+  through `auto-context`, `context-build`, and `debug-context`
 - OpenClaw can also register a new project cleanly before refreshing it
 - existing project registrations can be refined safely before refresh when the
  staged source set evolves
--- a/docs/openclaw-integration-contract.md
+++ b/docs/openclaw-integration-contract.md
@@ -82,6 +82,7 @@ The current helper script exposes:
 - `project-template`
 - `detect-project <prompt>`
 - `auto-context <prompt> [budget] [project]`
+- `debug-context`
 - `propose-project ...`
 - `register-project ...`
 - `update-project ...`
@@ -125,6 +126,8 @@ Recommended first behavior:
 1. OpenClaw receives a user request
 2. If the prompt looks like project knowledge, OpenClaw should try:
   - `auto-context "<prompt>" 3000`
+   - optionally `debug-context` immediately after if a human wants to inspect
+     the exact AtoCore supplement
 3. If the prompt is clearly asking for trusted current truth, OpenClaw should
   prefer:
   - `project-state <project>`