CI for sync
What this layer does
Phase 15 protects the sync layer from regressions. Robert's mirror runs GitHub Actions on every commit. A 6-hour cron runs the read-only smoke suite against live canonical. Every Joe machine runs snappy-os doctor every 6 hours and POSTs a _alert on failure. Three feedback loops: PR-time, schedule-time, and field-time.
Files involved
~/projects/snappy-os/.github/workflows/sync-ci.yml— Robert's
internal mirror only; runs on PR open + push to main.
state/bin/sync/ci-loop.sh— 6h cron on Robert's machine; runs
smoke read-only against live.
state/bin/sync/ci-cleanup.sh— removes
s3://robert-storage/snappy-os-ci/<sha>/ prefixes older than 7d.
state/bin/sync/load-test.sh— pre-launch concurrent install + push.state/bin/sync/probe-worker-latency.ts— pre-launch cross-region
latency probe.
bin/cli.js doctor— runs every section-A lint and parity-matrix
cell; exit code = failure count.
GitHub Actions (Robert's mirror only)
sync-ci.yml runs on PR open and push to main:
npm run check— typecheck, eslint, runtime driftnpm run lint:sync— every section-A lintnpm run smoke:sync— phase-mirror against
s3://robert-storage/snappy-os-ci/<commit-sha>/ (never touches production)
Cleanup job removes snappy-os-ci/<sha>/ prefixes older than 7 days to keep the bucket cost flat.
DO Spaces creds via GitHub repo secret. The same key set as the Worker's secret rotation flow uses; rotation in Phase 10 covers CI.
Robert-machine cron
0 */6 * * * ~/projects/snappy-os/state/bin/sync/ci-loop.sh
ci-loop.sh runs the smoke against live canonical in read-only mode: pull --dry and Worker GET only, no writes. Failures land in state/log/alerts/sync-degraded-<ts>.md and surface in /snappy-ops "System / ops" → "Sync status".
Joe-machine cron
30 */6 * * * snappy-os doctor --silent || snappy-os alert "doctor-failed"
Installed by bootstrap. Runs the local section-A lints + parity-matrix cells. On non-zero exit, POSTs _alert to the Worker so failures across the tenant base aggregate in /_status.
Pre-launch gates
These run once before launch, not on schedule:
state/bin/sync/load-test.sh— 1000 concurrent install + 100
concurrent _push (10 KB, distinct tenants). Asserts <2s p95 install, <5s p95 push, 0 5xx. Block launch on fail.
state/bin/sync/probe-worker-latency.ts— cold KV miss across 5
Cloudflare regions. p95 <500ms; p99 <1500ms. Block launch on fail.
- Three-platform asciinema (Ubuntu Docker, macOS UTM, Windows WSL2).
No green-on-three = no launch.
Operational gotchas
- CI runs against
snappy-os-ci/<sha>/prefix exclusively. Tests
that write to production bucket are rejected at the lint step.
- The 6h cron on Robert's machine is read-only. A write-cron risks
feedback loops (CI write → catalog rebuild → CI runs again).
- Joe's doctor cron MUST stay quiet on success (
--silent) to avoid
filling logs with green rows. Failures get the explicit alert POST.
- The cleanup job is non-optional. Without it the CI prefix grows
unbounded and DO Spaces cost climbs.
- Pre-launch gates are gates, not warnings. Failing latency or load
tests means the launch waits for the underlying fix; do not soften thresholds to ship.
How to verify it's working
- A PR to
~/projects/snappy-ostriggers the Actions run; logs show
every lint + smoke step green.
crontab -lon Robert's machine shows theci-loop.shentry.crontab -lon a Joe machine shows thedoctorentry.- A simulated doctor failure POSTs an
_alertrow visible in
/_status.
- 7d-old
snappy-os-ci/<sha>/prefixes disappear on the next
cleanup run.