Observability
What this layer does
Phase 8 turns sync activity into observable signal. Every push and pull writes one ndjson row. A summary dashboard rebuilds at each tick. Alerts fire when failures cluster. Cost telemetry rolls up per tenant. Lints catch drift before it becomes incident.
Files involved
state/log/sync-events.ndjson— one row per push or pull
(per-machine, in SYNC_DENY).
state/log/sync-summary.md— derived dashboard, rebuilt at each
tick by state/bin/sync/sync-summary.sh.
state/log/alerts/sync-degraded-<ts>.md— written on failure
clusters.
state/log/changelog.ndjson— every successful Worker push
(allowlisted in SYNC_ALLOW).
~/projects/snappy-skills/src/alert.ts— Worker_alerttally.state/lint/sync-freshness.ts— fails if no successful push in
last 24h.
state/lint/manifest-drift.ts— fails ifpull --drywould
touch >50% of files.
sync-events row shape
{
"ts": "2026-04-16T19:10:38Z",
"op": "push",
"repo": "snappy-os",
"scope": "state",
"files": 7,
"bytes": 12483,
"dur_ms": 842,
"manifest_before": "sha-...",
"manifest_after": "sha-...",
"trigger": "stop-hook",
"machine": "mbpro-rb"
}
Summary dashboard
state/log/sync-summary.md is a rebuilt-every-tick markdown table covering: pushes per hour (last 24h), bytes per day (last 7d), failure rate per repo, last successful push timestamp per machine. The /snappy-ops "System / ops" → "Sync status" picker option opens this file directly.
Alerts
- Three consecutive push failures in 10 min → write
state/log/alerts/sync-degraded-<ts>.md with the error chain.
- Every Joe machine also POSTs failures to Worker
/_alertwith
tenant_id. Worker tallies in KV ALERTS. If >5% of tenants are failing, /_status flips to degraded.
- v1 has no email or Slack integration — Robert reads
/_status
when he opens the front door.
Cost telemetry
DO Spaces is $5/mo flat for 250 GB plus $0.01/GB egress beyond 1 TB. Per-aggregate cost_usd_cents rolls up in state/log/evals.aggregate.ndjson (rounded to nearest cent; sub-cent rounds up to 1). The summary dashboard surfaces a weekly total.
Operational gotchas
sync-events.ndjsonis per-machine. Do not sync it — that would
leak machine IDs and trigger counts across tenants.
- The summary rebuild MUST be cheap — it runs every tick. Keep it
to a single ndjson scan with awk / jq, no nested processing.
- Alert thresholds are intentionally loose for v1 (3 failures /
10 min). Tune up only after a week of baseline data; over-tight thresholds in week 1 produce alert fatigue from cold-start failures.
- Cost rounding is a one-way ratchet up. Sub-cent rounds to 1, not 0,
so per-run cost is never under-reported. Aggregate sums slightly over-state.
How to verify it's working
tail -1 state/log/sync-events.ndjsonshows the most recent push
with dur_ms populated.
cat state/log/sync-summary.mdshows a fresh dashboard with
generated_at within the last tick window.
curl https://skills.snappy.ai/_status | jq '.alerts_active'
returns 0 in steady state.
state/lint/sync-freshness.tsandstate/lint/manifest-drift.ts
exit 0.
- A simulated 3-failure burst from one tenant produces both a local
state/log/alerts/sync-degraded-<ts>.md and a Worker ALERTS KV entry visible in /_status.