CI for sync

What this layer does

Phase 15 protects the sync layer from regressions. Robert's mirror runs GitHub Actions on every commit. A 6-hour cron runs the read-only smoke suite against live canonical. Every Joe machine runs snappy-os doctor every 6 hours and POSTs a _alert on failure. Three feedback loops: PR-time, schedule-time, and field-time.

Files involved

~/projects/snappy-os/.github/workflows/sync-ci.yml — Robert's

internal mirror only; runs on PR open + push to main.

state/bin/sync/ci-loop.sh — 6h cron on Robert's machine; runs

smoke read-only against live.

state/bin/sync/ci-cleanup.sh — removes

s3://robert-storage/snappy-os-ci/<sha>/ prefixes older than 7d.

state/bin/sync/load-test.sh — pre-launch concurrent install + push.
state/bin/sync/probe-worker-latency.ts — pre-launch cross-region

latency probe.

bin/cli.js doctor — runs every section-A lint and parity-matrix

cell; exit code = failure count.

GitHub Actions (Robert's mirror only)

sync-ci.yml runs on PR open and push to main:

npm run check — typecheck, eslint, runtime drift
npm run lint:sync — every section-A lint
npm run smoke:sync — phase-mirror against

s3://robert-storage/snappy-os-ci/<commit-sha>/ (never touches production)

Cleanup job removes snappy-os-ci/<sha>/ prefixes older than 7 days to keep the bucket cost flat.

DO Spaces creds via GitHub repo secret. The same key set as the Worker's secret rotation flow uses; rotation in Phase 10 covers CI.

Robert-machine cron

0 */6 * * * ~/projects/snappy-os/state/bin/sync/ci-loop.sh

ci-loop.sh runs the smoke against live canonical in read-only mode: pull --dry and Worker GET only, no writes. Failures land in state/log/alerts/sync-degraded-<ts>.md and surface in /snappy-ops "System / ops" → "Sync status".

Joe-machine cron

30 */6 * * * snappy-os doctor --silent || snappy-os alert "doctor-failed"

Installed by bootstrap. Runs the local section-A lints + parity-matrix cells. On non-zero exit, POSTs _alert to the Worker so failures across the tenant base aggregate in /_status.

Pre-launch gates

These run once before launch, not on schedule:

state/bin/sync/load-test.sh — 1000 concurrent install + 100

concurrent _push (10 KB, distinct tenants). Asserts <2s p95 install, <5s p95 push, 0 5xx. Block launch on fail.

state/bin/sync/probe-worker-latency.ts — cold KV miss across 5

Cloudflare regions. p95 <500ms; p99 <1500ms. Block launch on fail.

Three-platform asciinema (Ubuntu Docker, macOS UTM, Windows WSL2).

No green-on-three = no launch.

Operational gotchas

CI runs against snappy-os-ci/<sha>/ prefix exclusively. Tests

that write to production bucket are rejected at the lint step.

The 6h cron on Robert's machine is read-only. A write-cron risks

feedback loops (CI write → catalog rebuild → CI runs again).

Joe's doctor cron MUST stay quiet on success (--silent) to avoid

filling logs with green rows. Failures get the explicit alert POST.

The cleanup job is non-optional. Without it the CI prefix grows

unbounded and DO Spaces cost climbs.

Pre-launch gates are gates, not warnings. Failing latency or load

tests means the launch waits for the underlying fix; do not soften thresholds to ship.

How to verify it's working

A PR to ~/projects/snappy-os triggers the Actions run; logs show

every lint + smoke step green.

crontab -l on Robert's machine shows the ci-loop.sh entry.
crontab -l on a Joe machine shows the doctor entry.
A simulated doctor failure POSTs an _alert row visible in

/_status.

7d-old snappy-os-ci/<sha>/ prefixes disappear on the next

cleanup run.