Files

Anto01 03822389a1 deploy: self-update re-exec guard in deploy.sh

When deploy.sh itself changes in the commit being pulled, the bash
process is still running the OLD script from memory — git reset --hard
updated the file on disk but the in-memory instructions are stale.
This bit the 2026-04-09 Dalidou deploy: the old pre-build-sha Step 2
ran against fresh source, so the container started with
ATOCORE_BUILD_SHA="unknown" instead of the real commit. Manual
re-run fixed it, but the class of bug will re-emerge every time
deploy.sh itself changes.

Fix (Step 1.5):
- After git reset --hard, sha1 the running script ($0) and the
  on-disk copy at $APP_DIR/deploy/dalidou/deploy.sh
- If they differ, export ATOCORE_DEPLOY_REEXECED=1 and exec into
  the fresh copy so Step 2 onward runs under the new script
- The sentinel env var prevents recursion
- Skipped in dry-run mode, when $0 isn't readable, or when the
  on-disk script doesn't exist yet

Docs (docs/dalidou-deployment.md):
- New "The deploy.sh self-update race" troubleshooting section
  explaining the root cause, the Step 1.5 mechanism, what the log
  output looks like, and how to opt out

Verified syntax and dry-run. 219/219 tests still passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-08 21:08:41 -04:00

8.7 KiB

Raw Blame History

Dalidou Deployment

Purpose

Deploy AtoCore on Dalidou as the canonical runtime and machine-memory host.

Model

Dalidou hosts the canonical AtoCore service.
OpenClaw on the T420 consumes AtoCore over network/Tailscale API.
sources/vault and sources/drive are read-only inputs by convention.
SQLite/Chroma machine state stays on Dalidou and is not treated as a sync peer.
The app and machine-storage host can be live before the long-term content corpus is fully populated.

Directory layout

/srv/storage/atocore/
  app/         # deployed repo checkout
  data/
    db/
    chroma/
    cache/
    tmp/
  sources/
    vault/
    drive/
  logs/
  backups/
  run/

Compose workflow

The compose definition lives in:

deploy/dalidou/docker-compose.yml

The Dalidou environment file should be copied to:

deploy/dalidou/.env

starting from:

deploy/dalidou/.env.example

First-time deployment steps

Place the repository under /srv/storage/atocore/app — ideally as a proper git clone so future updates can be pulled, not as a static snapshot:
```
sudo git clone http://dalidou:3000/Antoine/ATOCore.git \
    /srv/storage/atocore/app
```
Create the canonical directories listed above.
Copy deploy/dalidou/.env.example to deploy/dalidou/.env.
Adjust the source paths if your AtoVault/AtoDrive mirrors live elsewhere.

Run:

cd /srv/storage/atocore/app/deploy/dalidou
docker compose up -d --build

Validate:

curl http://127.0.0.1:8100/health
curl http://127.0.0.1:8100/sources

Updating a running deployment

Use deploy/dalidou/deploy.sh for every code update. It is the one-shot sync script that:

fetches latest main from Gitea into /srv/storage/atocore/app
(if the app dir is not a git checkout) backs it up as <dir>.pre-git-<timestamp> and re-clones
rebuilds the container image
restarts the container
waits for /health to respond
compares the reported code_version against the __version__ in the freshly-pulled source, and exits non-zero if they don't match (deployment drift detection)

# Normal update from main
bash /srv/storage/atocore/app/deploy/dalidou/deploy.sh

# Deploy a specific branch or tag
ATOCORE_BRANCH=codex/some-feature \
    bash /srv/storage/atocore/app/deploy/dalidou/deploy.sh

# Dry-run: show what would happen without touching anything
ATOCORE_DEPLOY_DRY_RUN=1 \
    bash /srv/storage/atocore/app/deploy/dalidou/deploy.sh

# Deploy from a remote host (e.g. the laptop) using the Tailscale
# or LAN address instead of loopback
ATOCORE_GIT_REMOTE=http://192.168.86.50:3000/Antoine/ATOCore.git \
    bash /srv/storage/atocore/app/deploy/dalidou/deploy.sh

The script is idempotent and safe to re-run. It never touches the database directly — schema migrations are applied automatically at service startup by the lifespan handler in src/atocore/main.py which calls init_db() (which in turn runs the ALTER TABLE statements in _apply_migrations).

Troubleshooting hostname resolution

deploy.sh defaults ATOCORE_GIT_REMOTE to http://127.0.0.1:3000/Antoine/ATOCore.git (loopback) because the hostname "dalidou" doesn't reliably resolve on the host itself — the first real Dalidou deploy hit exactly this on 2026-04-08. If you need to override (e.g. running deploy.sh from a laptop against the Dalidou LAN), set ATOCORE_GIT_REMOTE explicitly.

The same applies to scripts/atocore_client.py: its default ATOCORE_BASE_URL is http://dalidou:8100 for remote callers, but when running the client on Dalidou itself (or inside the container via docker exec), override to loopback:

ATOCORE_BASE_URL=http://127.0.0.1:8100 \
    python scripts/atocore_client.py health

If you see {"status": "unavailable", "fail_open": true} from the client, the first thing to check is whether the base URL resolves from where you're running the client.

The deploy.sh self-update race

When deploy.sh itself changes in the commit being pulled, the first run after the update is still executing the old script from the bash process's in-memory copy. git reset --hard updates the file on disk, but the running bash has already loaded the instructions. On 2026-04-09 this silently shipped an "unknown" build_sha because the old Step 2 (which predated env-var export) ran against fresh source.

deploy.sh now detects this: Step 1.5 compares the sha1 of $0 (the running script) against the sha1 of $APP_DIR/deploy/dalidou/deploy.sh (the on-disk copy) after the git reset. If they differ, it sets ATOCORE_DEPLOY_REEXECED=1 and execs the fresh copy so the rest of the deploy runs under the new script. The sentinel env var prevents infinite recursion.

You'll see this in the logs as:

==> Step 1.5: deploy.sh changed in the pulled commit; re-exec'ing
==>   running script hash: <old>
==>   on-disk script hash: <new>
==>   re-exec -> /srv/storage/atocore/app/deploy/dalidou/deploy.sh

To opt out (debugging, for example), pre-set ATOCORE_DEPLOY_REEXECED=1 before invoking deploy.sh and the self-update guard will be skipped.

Deployment drift detection

/health reports drift signals at three increasing levels of precision:

Field	Source	Precision	When to use
`version` / `code_version`	`atocore.__version__` (manual bump)	coarse — same value across many commits	quick smoke check that the right release is running
`build_sha`	`ATOCORE_BUILD_SHA` env var, set by `deploy.sh` per build	precise — changes per commit	the canonical drift signal
`build_time` / `build_branch`	same env var path	per-build	forensics when multiple branches in flight

The precise check (run on the laptop or any host that can curl the live service AND has the source repo at hand):

# What's actually running on Dalidou
LIVE_SHA=$(curl -fsS http://dalidou:8100/health | grep -o '"build_sha":"[^"]*"' | cut -d'"' -f4)

# What the deployed branch tip should be
EXPECTED_SHA=$(cd /srv/storage/atocore/app && git rev-parse HEAD)

# Compare
if [ "$LIVE_SHA" = "$EXPECTED_SHA" ]; then
    echo "live is current at $LIVE_SHA"
else
    echo "DRIFT: live $LIVE_SHA vs expected $EXPECTED_SHA"
    echo "run deploy.sh to sync"
fi

The deploy.sh script does exactly this comparison automatically in its post-deploy verification step (Step 6) and exits non-zero on mismatch. So the simplest drift check is just to run deploy.sh — if there's nothing to deploy, it succeeds quickly; if the live service is stale, it deploys and verifies.

If /health reports build_sha: "unknown", the running container was started without deploy.sh (probably via docker compose up directly), and the build provenance was never recorded. Re-run via deploy.sh to fix.

The coarse code_version check is still useful as a quick visual sanity check — bumping __version__ from 0.2.0 to 0.3.0 signals a meaningful release boundary even if the precise build_sha is what tools should compare against:

# Quick sanity check (coarse)
curl -s http://127.0.0.1:8100/health | grep -o '"code_version":"[^"]*"'
grep '__version__' /srv/storage/atocore/app/src/atocore/__init__.py

Schema migrations on redeploy

When updating from an older __version__, the first startup after the redeploy runs the idempotent ALTER TABLE migrations in _apply_migrations. For a pre-0.2.0 → 0.2.0 upgrade the migrations add these columns to existing tables (all with safe defaults so no data is touched):

memories.project TEXT DEFAULT ''
memories.last_referenced_at DATETIME
memories.reference_count INTEGER DEFAULT 0
interactions.response TEXT DEFAULT ''
interactions.memories_used TEXT DEFAULT '[]'
interactions.chunks_used TEXT DEFAULT '[]'
interactions.client TEXT DEFAULT ''
interactions.session_id TEXT DEFAULT ''
interactions.project TEXT DEFAULT ''

Plus new indexes on the new columns. No row data is modified. The migration is safe to run against a database that already has the columns — the _column_exists check makes each ALTER a no-op in that case.

Backup the database before any redeploy (via POST /admin/backup) if you want a pre-upgrade snapshot. The migration is additive and reversible by restoring the snapshot.

Deferred

backup automation
restore/snapshot tooling
reverse proxy / TLS exposure
automated source ingestion job
OpenClaw client wiring

Current Reality Check

When this deployment is first brought up, the service may be healthy before the real corpus has been ingested.

That means:

AtoCore the system can already be hosted on Dalidou
the canonical machine-data location can already be on Dalidou
but the live knowledge/content corpus may still be empty or only partially loaded until source ingestion is run

8.7 KiB Raw Blame History