When deploy.sh itself changes in the commit being pulled, the bash process is still running the OLD script from memory — git reset --hard updated the file on disk but the in-memory instructions are stale. This bit the 2026-04-09 Dalidou deploy: the old pre-build-sha Step 2 ran against fresh source, so the container started with ATOCORE_BUILD_SHA="unknown" instead of the real commit. Manual re-run fixed it, but the class of bug will re-emerge every time deploy.sh itself changes. Fix (Step 1.5): - After git reset --hard, sha1 the running script ($0) and the on-disk copy at $APP_DIR/deploy/dalidou/deploy.sh - If they differ, export ATOCORE_DEPLOY_REEXECED=1 and exec into the fresh copy so Step 2 onward runs under the new script - The sentinel env var prevents recursion - Skipped in dry-run mode, when $0 isn't readable, or when the on-disk script doesn't exist yet Docs (docs/dalidou-deployment.md): - New "The deploy.sh self-update race" troubleshooting section explaining the root cause, the Step 1.5 mechanism, what the log output looks like, and how to opt out Verified syntax and dry-run. 219/219 tests still passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
271 lines
8.7 KiB
Markdown
271 lines
8.7 KiB
Markdown
# Dalidou Deployment
|
|
|
|
## Purpose
|
|
Deploy AtoCore on Dalidou as the canonical runtime and machine-memory host.
|
|
|
|
## Model
|
|
|
|
- Dalidou hosts the canonical AtoCore service.
|
|
- OpenClaw on the T420 consumes AtoCore over network/Tailscale API.
|
|
- `sources/vault` and `sources/drive` are read-only inputs by convention.
|
|
- SQLite/Chroma machine state stays on Dalidou and is not treated as a sync peer.
|
|
- The app and machine-storage host can be live before the long-term content
|
|
corpus is fully populated.
|
|
|
|
## Directory layout
|
|
|
|
```text
|
|
/srv/storage/atocore/
|
|
app/ # deployed repo checkout
|
|
data/
|
|
db/
|
|
chroma/
|
|
cache/
|
|
tmp/
|
|
sources/
|
|
vault/
|
|
drive/
|
|
logs/
|
|
backups/
|
|
run/
|
|
```
|
|
|
|
## Compose workflow
|
|
|
|
The compose definition lives in:
|
|
|
|
```text
|
|
deploy/dalidou/docker-compose.yml
|
|
```
|
|
|
|
The Dalidou environment file should be copied to:
|
|
|
|
```text
|
|
deploy/dalidou/.env
|
|
```
|
|
|
|
starting from:
|
|
|
|
```text
|
|
deploy/dalidou/.env.example
|
|
```
|
|
|
|
## First-time deployment steps
|
|
|
|
1. Place the repository under `/srv/storage/atocore/app` — ideally as a
|
|
proper git clone so future updates can be pulled, not as a static
|
|
snapshot:
|
|
|
|
```bash
|
|
sudo git clone http://dalidou:3000/Antoine/ATOCore.git \
|
|
/srv/storage/atocore/app
|
|
```
|
|
|
|
2. Create the canonical directories listed above.
|
|
3. Copy `deploy/dalidou/.env.example` to `deploy/dalidou/.env`.
|
|
4. Adjust the source paths if your AtoVault/AtoDrive mirrors live elsewhere.
|
|
5. Run:
|
|
|
|
```bash
|
|
cd /srv/storage/atocore/app/deploy/dalidou
|
|
docker compose up -d --build
|
|
```
|
|
|
|
6. Validate:
|
|
|
|
```bash
|
|
curl http://127.0.0.1:8100/health
|
|
curl http://127.0.0.1:8100/sources
|
|
```
|
|
|
|
## Updating a running deployment
|
|
|
|
**Use `deploy/dalidou/deploy.sh` for every code update.** It is the
|
|
one-shot sync script that:
|
|
|
|
- fetches latest main from Gitea into `/srv/storage/atocore/app`
|
|
- (if the app dir is not a git checkout) backs it up as
|
|
`<dir>.pre-git-<timestamp>` and re-clones
|
|
- rebuilds the container image
|
|
- restarts the container
|
|
- waits for `/health` to respond
|
|
- compares the reported `code_version` against the
|
|
`__version__` in the freshly-pulled source, and exits non-zero
|
|
if they don't match (deployment drift detection)
|
|
|
|
```bash
|
|
# Normal update from main
|
|
bash /srv/storage/atocore/app/deploy/dalidou/deploy.sh
|
|
|
|
# Deploy a specific branch or tag
|
|
ATOCORE_BRANCH=codex/some-feature \
|
|
bash /srv/storage/atocore/app/deploy/dalidou/deploy.sh
|
|
|
|
# Dry-run: show what would happen without touching anything
|
|
ATOCORE_DEPLOY_DRY_RUN=1 \
|
|
bash /srv/storage/atocore/app/deploy/dalidou/deploy.sh
|
|
|
|
# Deploy from a remote host (e.g. the laptop) using the Tailscale
|
|
# or LAN address instead of loopback
|
|
ATOCORE_GIT_REMOTE=http://192.168.86.50:3000/Antoine/ATOCore.git \
|
|
bash /srv/storage/atocore/app/deploy/dalidou/deploy.sh
|
|
```
|
|
|
|
The script is idempotent and safe to re-run. It never touches the
|
|
database directly — schema migrations are applied automatically at
|
|
service startup by the lifespan handler in `src/atocore/main.py`
|
|
which calls `init_db()` (which in turn runs the ALTER TABLE
|
|
statements in `_apply_migrations`).
|
|
|
|
### Troubleshooting hostname resolution
|
|
|
|
`deploy.sh` defaults `ATOCORE_GIT_REMOTE` to
|
|
`http://127.0.0.1:3000/Antoine/ATOCore.git` (loopback) because the
|
|
hostname "dalidou" doesn't reliably resolve on the host itself —
|
|
the first real Dalidou deploy hit exactly this on 2026-04-08. If
|
|
you need to override (e.g. running deploy.sh from a laptop against
|
|
the Dalidou LAN), set `ATOCORE_GIT_REMOTE` explicitly.
|
|
|
|
The same applies to `scripts/atocore_client.py`: its default
|
|
`ATOCORE_BASE_URL` is `http://dalidou:8100` for remote callers, but
|
|
when running the client on Dalidou itself (or inside the container
|
|
via `docker exec`), override to loopback:
|
|
|
|
```bash
|
|
ATOCORE_BASE_URL=http://127.0.0.1:8100 \
|
|
python scripts/atocore_client.py health
|
|
```
|
|
|
|
If you see `{"status": "unavailable", "fail_open": true}` from the
|
|
client, the first thing to check is whether the base URL resolves
|
|
from where you're running the client.
|
|
|
|
### The deploy.sh self-update race
|
|
|
|
When `deploy.sh` itself changes in the commit being pulled, the
|
|
first run after the update is still executing the *old* script from
|
|
the bash process's in-memory copy. `git reset --hard` updates the
|
|
file on disk, but the running bash has already loaded the
|
|
instructions. On 2026-04-09 this silently shipped an "unknown"
|
|
`build_sha` because the old Step 2 (which predated env-var export)
|
|
ran against fresh source.
|
|
|
|
`deploy.sh` now detects this: Step 1.5 compares the sha1 of `$0`
|
|
(the running script) against the sha1 of
|
|
`$APP_DIR/deploy/dalidou/deploy.sh` (the on-disk copy) after the
|
|
git reset. If they differ, it sets `ATOCORE_DEPLOY_REEXECED=1` and
|
|
`exec`s the fresh copy so the rest of the deploy runs under the new
|
|
script. The sentinel env var prevents infinite recursion.
|
|
|
|
You'll see this in the logs as:
|
|
|
|
```text
|
|
==> Step 1.5: deploy.sh changed in the pulled commit; re-exec'ing
|
|
==> running script hash: <old>
|
|
==> on-disk script hash: <new>
|
|
==> re-exec -> /srv/storage/atocore/app/deploy/dalidou/deploy.sh
|
|
```
|
|
|
|
To opt out (debugging, for example), pre-set
|
|
`ATOCORE_DEPLOY_REEXECED=1` before invoking `deploy.sh` and the
|
|
self-update guard will be skipped.
|
|
|
|
### Deployment drift detection
|
|
|
|
`/health` reports drift signals at three increasing levels of
|
|
precision:
|
|
|
|
| Field | Source | Precision | When to use |
|
|
|---|---|---|---|
|
|
| `version` / `code_version` | `atocore.__version__` (manual bump) | coarse — same value across many commits | quick smoke check that the right *release* is running |
|
|
| `build_sha` | `ATOCORE_BUILD_SHA` env var, set by `deploy.sh` per build | precise — changes per commit | the canonical drift signal |
|
|
| `build_time` / `build_branch` | same env var path | per-build | forensics when multiple branches in flight |
|
|
|
|
The **precise** check (run on the laptop or any host that can curl
|
|
the live service AND has the source repo at hand):
|
|
|
|
```bash
|
|
# What's actually running on Dalidou
|
|
LIVE_SHA=$(curl -fsS http://dalidou:8100/health | grep -o '"build_sha":"[^"]*"' | cut -d'"' -f4)
|
|
|
|
# What the deployed branch tip should be
|
|
EXPECTED_SHA=$(cd /srv/storage/atocore/app && git rev-parse HEAD)
|
|
|
|
# Compare
|
|
if [ "$LIVE_SHA" = "$EXPECTED_SHA" ]; then
|
|
echo "live is current at $LIVE_SHA"
|
|
else
|
|
echo "DRIFT: live $LIVE_SHA vs expected $EXPECTED_SHA"
|
|
echo "run deploy.sh to sync"
|
|
fi
|
|
```
|
|
|
|
The `deploy.sh` script does exactly this comparison automatically
|
|
in its post-deploy verification step (Step 6) and exits non-zero
|
|
on mismatch. So the **simplest drift check** is just to run
|
|
`deploy.sh` — if there's nothing to deploy, it succeeds quickly;
|
|
if the live service is stale, it deploys and verifies.
|
|
|
|
If `/health` reports `build_sha: "unknown"`, the running container
|
|
was started without `deploy.sh` (probably via `docker compose up`
|
|
directly), and the build provenance was never recorded. Re-run
|
|
via `deploy.sh` to fix.
|
|
|
|
The coarse `code_version` check is still useful as a quick visual
|
|
sanity check — bumping `__version__` from `0.2.0` to `0.3.0`
|
|
signals a meaningful release boundary even if the precise
|
|
`build_sha` is what tools should compare against:
|
|
|
|
```bash
|
|
# Quick sanity check (coarse)
|
|
curl -s http://127.0.0.1:8100/health | grep -o '"code_version":"[^"]*"'
|
|
grep '__version__' /srv/storage/atocore/app/src/atocore/__init__.py
|
|
```
|
|
|
|
### Schema migrations on redeploy
|
|
|
|
When updating from an older `__version__`, the first startup after
|
|
the redeploy runs the idempotent ALTER TABLE migrations in
|
|
`_apply_migrations`. For a pre-0.2.0 → 0.2.0 upgrade the migrations
|
|
add these columns to existing tables (all with safe defaults so no
|
|
data is touched):
|
|
|
|
- `memories.project TEXT DEFAULT ''`
|
|
- `memories.last_referenced_at DATETIME`
|
|
- `memories.reference_count INTEGER DEFAULT 0`
|
|
- `interactions.response TEXT DEFAULT ''`
|
|
- `interactions.memories_used TEXT DEFAULT '[]'`
|
|
- `interactions.chunks_used TEXT DEFAULT '[]'`
|
|
- `interactions.client TEXT DEFAULT ''`
|
|
- `interactions.session_id TEXT DEFAULT ''`
|
|
- `interactions.project TEXT DEFAULT ''`
|
|
|
|
Plus new indexes on the new columns. No row data is modified. The
|
|
migration is safe to run against a database that already has the
|
|
columns — the `_column_exists` check makes each ALTER a no-op in
|
|
that case.
|
|
|
|
Backup the database before any redeploy (via `POST /admin/backup`)
|
|
if you want a pre-upgrade snapshot. The migration is additive and
|
|
reversible by restoring the snapshot.
|
|
|
|
## Deferred
|
|
|
|
- backup automation
|
|
- restore/snapshot tooling
|
|
- reverse proxy / TLS exposure
|
|
- automated source ingestion job
|
|
- OpenClaw client wiring
|
|
|
|
## Current Reality Check
|
|
|
|
When this deployment is first brought up, the service may be healthy before the
|
|
real corpus has been ingested.
|
|
|
|
That means:
|
|
|
|
- AtoCore the system can already be hosted on Dalidou
|
|
- the canonical machine-data location can already be on Dalidou
|
|
- but the live knowledge/content corpus may still be empty or only partially
|
|
loaded until source ingestion is run
|