ATOCore/docs/current-state.md

# AtoCore Current State

## Status Summary

AtoCore is no longer just a proof of concept. The local engine exists, the
correctness pass is complete, Dalidou now hosts the canonical runtime and
machine-storage location, and the T420/OpenClaw side now has a safe read-only
path to consume AtoCore. The live corpus is no longer just self-knowledge: it
now includes a first curated ingestion batch for the active projects.

## Phase Assessment

- completed
  - Phase 0
  - Phase 0.5
  - Phase 1
- baseline complete
  - Phase 2
  - Phase 3
  - Phase 5
  - Phase 7
  - Phase 9 (Commits A/B/C: capture, reinforcement, extractor + review queue)
- partial
  - Phase 4
  - Phase 8
- not started
  - Phase 6
  - Phase 10
  - Phase 11
  - Phase 12
  - Phase 13

## What Exists Today

- ingestion pipeline
- parser and chunker
- SQLite-backed memory and project state
- vector retrieval
- context builder
- API routes for query, context, health, and source status
- project registry and per-project refresh foundation
- project registration lifecycle:
  - template
  - proposal preview
  - approved registration
  - safe update of existing project registrations
  - refresh
- implementation-facing architecture notes for:
  - engineering knowledge hybrid architecture
  - engineering ontology v1
- env-driven storage and deployment paths
- Dalidou Docker deployment foundation
- initial AtoCore self-knowledge corpus ingested on Dalidou
- T420/OpenClaw read-only AtoCore helper skill
- full active-project markdown/text corpus wave for:
  - `p04-gigabit`
  - `p05-interferometer`
  - `p06-polisher`

## What Is True On Dalidou

- deployed repo location:
  - `/srv/storage/atocore/app`
- canonical machine DB location:
  - `/srv/storage/atocore/data/db/atocore.db`
- canonical vector store location:
  - `/srv/storage/atocore/data/chroma`
- source input locations:
  - `/srv/storage/atocore/sources/vault`
  - `/srv/storage/atocore/sources/drive`

The service and storage foundation are live on Dalidou.

The machine-data host is real and canonical.

The project registry is now also persisted in a canonical mounted config path on
Dalidou:

- `/srv/storage/atocore/config/project-registry.json`

The content corpus is partially populated now.

The Dalidou instance already contains:

- AtoCore ecosystem and hosting docs
- current-state and OpenClaw integration docs
- Master Plan V3
- Build Spec V1
- trusted project-state entries for `atocore`
- full staged project markdown/text corpora for:
  - `p04-gigabit`
  - `p05-interferometer`
  - `p06-polisher`
- curated repo-context docs for:
  - `p05`: `Fullum-Interferometer`
  - `p06`: `polisher-sim`
- trusted project-state entries for:
  - `p04-gigabit`
  - `p05-interferometer`
  - `p06-polisher`

Current live stats after the full active-project wave are now far beyond the
initial seed stage:

- more than `1,100` source documents
- more than `20,000` chunks
- matching vector count

The broader long-term corpus is still not fully populated yet. Wider project and
vault ingestion remains a deliberate next step rather than something already
completed, but the corpus is now meaningfully seeded beyond AtoCore's own docs.

For human-readable quality review, the current staged project markdown corpus is
primarily visible under:

- `/srv/storage/atocore/sources/vault/incoming/projects`

This staged area is now useful for review because it contains the markdown/text
project docs that were actually ingested for the full active-project wave.

It is important to read this staged area correctly:

- it is a readable ingestion input layer
- it is not the final machine-memory representation itself
- seeing familiar PKM-style notes there is expected
- the machine-processed intelligence lives in the DB, chunks, vectors, memory,
  trusted project state, and context-builder outputs

## What Is True On The T420

- SSH access is working
- OpenClaw workspace inspected at `/home/papa/clawd`
- OpenClaw's own memory system remains unchanged
- a read-only AtoCore integration skill exists in the workspace:
  - `/home/papa/clawd/skills/atocore-context/`
- the T420 can successfully reach Dalidou AtoCore over network/Tailscale
- fail-open behavior has been verified for the helper path
- OpenClaw can now seed AtoCore in two distinct ways:
  - project-scoped memory entries
  - staged document ingestion into the retrieval corpus
- the helper now supports the practical registered-project lifecycle:
  - projects
  - project-template
  - propose-project
  - register-project
  - update-project
  - refresh-project
- the helper now also supports the first organic routing layer:
  - `detect-project "<prompt>"`
  - `auto-context "<prompt>" [budget] [project]`
- OpenClaw can now default to AtoCore for project-knowledge questions without
  requiring explicit helper commands from the human every time

## What Exists In Memory vs Corpus

These remain separate and that is intentional.

In `/memory`:

- project-scoped curated memories now exist for:
  - `p04-gigabit`: 5 memories
  - `p05-interferometer`: 6 memories
  - `p06-polisher`: 8 memories

These are curated summaries and extracted stable project signals.

In `source_documents` / retrieval corpus:

- full project markdown/text corpora are now present for the active project set
- retrieval is no longer limited to AtoCore self-knowledge only
- the current corpus is broad enough that ranking quality matters more than
  corpus presence alone
- underspecified prompts can still pull in historical or archive material, so
  project-aware routing and better ranking remain important

The source refresh model now has a concrete foundation in code:

- a project registry file defines known project ids, aliases, and ingest roots
- the API can list registered projects
- the API can return a registration template
- the API can preview a registration without mutating state
- the API can persist an approved registration
- the API can update an existing registered project without changing its canonical id
- the API can refresh one registered project at a time

This lifecycle is now coherent end to end for normal use.

The first live update passes on existing registered projects have now been
verified against `p04-gigabit` and `p05-interferometer`:

- the registration description can be updated safely
- the canonical project id remains unchanged
- refresh still behaves cleanly after the update
- `context/build` still returns useful project-specific context afterward

## Reliability Baseline

The runtime has now been hardened in a few practical ways:

- SQLite connections use a configurable busy timeout
- SQLite uses WAL mode to reduce transient lock pain under normal concurrent use
- project registry writes are atomic file replacements rather than in-place rewrites
- a first runtime backup path now exists for:
  - SQLite
  - project registry
  - backup metadata

This does not eliminate every concurrency edge, but it materially improves the
current operational baseline.

In `Trusted Project State`:

- each active seeded project now has a conservative trusted-state set
- promoted facts cover:
  - summary
  - core architecture or boundary decision
  - key constraints
  - next focus

This separation is healthy:

- memory stores distilled project facts
- corpus stores the underlying retrievable documents

## Immediate Next Focus

1. Use the new T420-side organic routing layer in real OpenClaw workflows
2. Tighten retrieval quality for the now fully ingested active project corpora
3. Move to Wave 2 trusted-operational ingestion instead of blindly widening raw corpus further
4. Keep the new engineering-knowledge architecture docs as implementation guidance while avoiding premature schema work
5. Expand the boring operations baseline:
   - restore validation
   - Chroma rebuild / backup policy
   - retention
6. Only later consider write-back, reflection, or deeper autonomous behaviors

See also:

- [ingestion-waves.md](C:/Users/antoi/ATOCore/docs/ingestion-waves.md)
- [master-plan-status.md](C:/Users/antoi/ATOCore/docs/master-plan-status.md)

## Guiding Constraints

- bad memory is worse than no memory
- trusted project state must remain highest priority
- human-readable sources and machine storage stay separate
- OpenClaw integration must not degrade OpenClaw baseline behavior