Files
ATOCore/docs/dalidou-deployment.md
Anto01 be4099486c deploy: add build_sha visibility for precise drift detection
Make /health report the precise git SHA the container was built from,
so 'is the live service current?' can be answered without ambiguity.
0.2.0 was too coarse to trust as a 'live is current' signal — many
commits share the same __version__.

Three layers:

1. /health endpoint (src/atocore/api/routes.py)
   - Reads ATOCORE_BUILD_SHA, ATOCORE_BUILD_TIME, ATOCORE_BUILD_BRANCH
     from environment, defaults to 'unknown'
   - Reports them alongside existing code_version field

2. docker-compose.yml
   - Forwards the three env vars from the host into the container
   - Defaults to 'unknown' so direct `docker compose up` runs (without
     deploy.sh) cleanly signal missing build provenance

3. deploy.sh
   - Step 2 captures git SHA + UTC timestamp + branch and exports them
     as env vars before `docker compose up -d --build`
   - Step 6 reads /health post-deploy and compares the reported
     build_sha against the freshly-built one. Mismatch exits non-zero
     (exit code 6) with a remediation hint covering cached image,
     env propagation, and concurrent restart cases

Tests (tests/test_api_storage.py):
- test_health_endpoint_reports_code_version_from_module
- test_health_endpoint_reports_build_metadata_from_env
- test_health_endpoint_reports_unknown_when_build_env_unset

Docs (docs/dalidou-deployment.md):
- Three-level drift detection table (code_version coarse,
  build_sha precise, build_time/branch forensic)
- Canonical drift check script using LIVE_SHA vs EXPECTED_SHA
- Note that running deploy.sh is itself the simplest drift check

219/219 tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 20:25:32 -04:00

241 lines
7.6 KiB
Markdown

# Dalidou Deployment
## Purpose
Deploy AtoCore on Dalidou as the canonical runtime and machine-memory host.
## Model
- Dalidou hosts the canonical AtoCore service.
- OpenClaw on the T420 consumes AtoCore over network/Tailscale API.
- `sources/vault` and `sources/drive` are read-only inputs by convention.
- SQLite/Chroma machine state stays on Dalidou and is not treated as a sync peer.
- The app and machine-storage host can be live before the long-term content
corpus is fully populated.
## Directory layout
```text
/srv/storage/atocore/
app/ # deployed repo checkout
data/
db/
chroma/
cache/
tmp/
sources/
vault/
drive/
logs/
backups/
run/
```
## Compose workflow
The compose definition lives in:
```text
deploy/dalidou/docker-compose.yml
```
The Dalidou environment file should be copied to:
```text
deploy/dalidou/.env
```
starting from:
```text
deploy/dalidou/.env.example
```
## First-time deployment steps
1. Place the repository under `/srv/storage/atocore/app` — ideally as a
proper git clone so future updates can be pulled, not as a static
snapshot:
```bash
sudo git clone http://dalidou:3000/Antoine/ATOCore.git \
/srv/storage/atocore/app
```
2. Create the canonical directories listed above.
3. Copy `deploy/dalidou/.env.example` to `deploy/dalidou/.env`.
4. Adjust the source paths if your AtoVault/AtoDrive mirrors live elsewhere.
5. Run:
```bash
cd /srv/storage/atocore/app/deploy/dalidou
docker compose up -d --build
```
6. Validate:
```bash
curl http://127.0.0.1:8100/health
curl http://127.0.0.1:8100/sources
```
## Updating a running deployment
**Use `deploy/dalidou/deploy.sh` for every code update.** It is the
one-shot sync script that:
- fetches latest main from Gitea into `/srv/storage/atocore/app`
- (if the app dir is not a git checkout) backs it up as
`<dir>.pre-git-<timestamp>` and re-clones
- rebuilds the container image
- restarts the container
- waits for `/health` to respond
- compares the reported `code_version` against the
`__version__` in the freshly-pulled source, and exits non-zero
if they don't match (deployment drift detection)
```bash
# Normal update from main
bash /srv/storage/atocore/app/deploy/dalidou/deploy.sh
# Deploy a specific branch or tag
ATOCORE_BRANCH=codex/some-feature \
bash /srv/storage/atocore/app/deploy/dalidou/deploy.sh
# Dry-run: show what would happen without touching anything
ATOCORE_DEPLOY_DRY_RUN=1 \
bash /srv/storage/atocore/app/deploy/dalidou/deploy.sh
# Deploy from a remote host (e.g. the laptop) using the Tailscale
# or LAN address instead of loopback
ATOCORE_GIT_REMOTE=http://192.168.86.50:3000/Antoine/ATOCore.git \
bash /srv/storage/atocore/app/deploy/dalidou/deploy.sh
```
The script is idempotent and safe to re-run. It never touches the
database directly — schema migrations are applied automatically at
service startup by the lifespan handler in `src/atocore/main.py`
which calls `init_db()` (which in turn runs the ALTER TABLE
statements in `_apply_migrations`).
### Troubleshooting hostname resolution
`deploy.sh` defaults `ATOCORE_GIT_REMOTE` to
`http://127.0.0.1:3000/Antoine/ATOCore.git` (loopback) because the
hostname "dalidou" doesn't reliably resolve on the host itself —
the first real Dalidou deploy hit exactly this on 2026-04-08. If
you need to override (e.g. running deploy.sh from a laptop against
the Dalidou LAN), set `ATOCORE_GIT_REMOTE` explicitly.
The same applies to `scripts/atocore_client.py`: its default
`ATOCORE_BASE_URL` is `http://dalidou:8100` for remote callers, but
when running the client on Dalidou itself (or inside the container
via `docker exec`), override to loopback:
```bash
ATOCORE_BASE_URL=http://127.0.0.1:8100 \
python scripts/atocore_client.py health
```
If you see `{"status": "unavailable", "fail_open": true}` from the
client, the first thing to check is whether the base URL resolves
from where you're running the client.
### Deployment drift detection
`/health` reports drift signals at three increasing levels of
precision:
| Field | Source | Precision | When to use |
|---|---|---|---|
| `version` / `code_version` | `atocore.__version__` (manual bump) | coarse — same value across many commits | quick smoke check that the right *release* is running |
| `build_sha` | `ATOCORE_BUILD_SHA` env var, set by `deploy.sh` per build | precise — changes per commit | the canonical drift signal |
| `build_time` / `build_branch` | same env var path | per-build | forensics when multiple branches in flight |
The **precise** check (run on the laptop or any host that can curl
the live service AND has the source repo at hand):
```bash
# What's actually running on Dalidou
LIVE_SHA=$(curl -fsS http://dalidou:8100/health | grep -o '"build_sha":"[^"]*"' | cut -d'"' -f4)
# What the deployed branch tip should be
EXPECTED_SHA=$(cd /srv/storage/atocore/app && git rev-parse HEAD)
# Compare
if [ "$LIVE_SHA" = "$EXPECTED_SHA" ]; then
echo "live is current at $LIVE_SHA"
else
echo "DRIFT: live $LIVE_SHA vs expected $EXPECTED_SHA"
echo "run deploy.sh to sync"
fi
```
The `deploy.sh` script does exactly this comparison automatically
in its post-deploy verification step (Step 6) and exits non-zero
on mismatch. So the **simplest drift check** is just to run
`deploy.sh` — if there's nothing to deploy, it succeeds quickly;
if the live service is stale, it deploys and verifies.
If `/health` reports `build_sha: "unknown"`, the running container
was started without `deploy.sh` (probably via `docker compose up`
directly), and the build provenance was never recorded. Re-run
via `deploy.sh` to fix.
The coarse `code_version` check is still useful as a quick visual
sanity check — bumping `__version__` from `0.2.0` to `0.3.0`
signals a meaningful release boundary even if the precise
`build_sha` is what tools should compare against:
```bash
# Quick sanity check (coarse)
curl -s http://127.0.0.1:8100/health | grep -o '"code_version":"[^"]*"'
grep '__version__' /srv/storage/atocore/app/src/atocore/__init__.py
```
### Schema migrations on redeploy
When updating from an older `__version__`, the first startup after
the redeploy runs the idempotent ALTER TABLE migrations in
`_apply_migrations`. For a pre-0.2.0 → 0.2.0 upgrade the migrations
add these columns to existing tables (all with safe defaults so no
data is touched):
- `memories.project TEXT DEFAULT ''`
- `memories.last_referenced_at DATETIME`
- `memories.reference_count INTEGER DEFAULT 0`
- `interactions.response TEXT DEFAULT ''`
- `interactions.memories_used TEXT DEFAULT '[]'`
- `interactions.chunks_used TEXT DEFAULT '[]'`
- `interactions.client TEXT DEFAULT ''`
- `interactions.session_id TEXT DEFAULT ''`
- `interactions.project TEXT DEFAULT ''`
Plus new indexes on the new columns. No row data is modified. The
migration is safe to run against a database that already has the
columns — the `_column_exists` check makes each ALTER a no-op in
that case.
Backup the database before any redeploy (via `POST /admin/backup`)
if you want a pre-upgrade snapshot. The migration is additive and
reversible by restoring the snapshot.
## Deferred
- backup automation
- restore/snapshot tooling
- reverse proxy / TLS exposure
- automated source ingestion job
- OpenClaw client wiring
## Current Reality Check
When this deployment is first brought up, the service may be healthy before the
real corpus has been ingested.
That means:
- AtoCore the system can already be hosted on Dalidou
- the canonical machine-data location can already be on Dalidou
- but the live knowledge/content corpus may still be empty or only partially
loaded until source ingestion is run