Observability

What this layer does

Phase 8 turns sync activity into observable signal. Every push and pull writes one ndjson row. A summary dashboard rebuilds at each tick. Alerts fire when failures cluster. Cost telemetry rolls up per tenant. Lints catch drift before it becomes incident.

Files involved

state/log/sync-events.ndjson — one row per push or pull

(per-machine, in SYNC_DENY).

state/log/sync-summary.md — derived dashboard, rebuilt at each

tick by state/bin/sync/sync-summary.sh.

state/log/alerts/sync-degraded-<ts>.md — written on failure

clusters.

state/log/changelog.ndjson — every successful Worker push

(allowlisted in SYNC_ALLOW).

~/projects/snappy-skills/src/alert.ts — Worker _alert tally.
state/lint/sync-freshness.ts — fails if no successful push in

last 24h.

state/lint/manifest-drift.ts — fails if pull --dry would

touch >50% of files.

sync-events row shape

{
  "ts": "2026-04-16T19:10:38Z",
  "op": "push",
  "repo": "snappy-os",
  "scope": "state",
  "files": 7,
  "bytes": 12483,
  "dur_ms": 842,
  "manifest_before": "sha-...",
  "manifest_after": "sha-...",
  "trigger": "stop-hook",
  "machine": "mbpro-rb"
}

Summary dashboard

state/log/sync-summary.md is a rebuilt-every-tick markdown table covering: pushes per hour (last 24h), bytes per day (last 7d), failure rate per repo, last successful push timestamp per machine. The /snappy-ops "System / ops" → "Sync status" picker option opens this file directly.

Alerts

Three consecutive push failures in 10 min → write

state/log/alerts/sync-degraded-<ts>.md with the error chain.

Every Joe machine also POSTs failures to Worker /_alert with

tenant_id. Worker tallies in KV ALERTS. If >5% of tenants are failing, /_status flips to degraded.

v1 has no email or Slack integration — Robert reads /_status

when he opens the front door.

Cost telemetry

DO Spaces is $5/mo flat for 250 GB plus $0.01/GB egress beyond 1 TB. Per-aggregate cost_usd_cents rolls up in state/log/evals.aggregate.ndjson (rounded to nearest cent; sub-cent rounds up to 1). The summary dashboard surfaces a weekly total.

Operational gotchas

sync-events.ndjson is per-machine. Do not sync it — that would

leak machine IDs and trigger counts across tenants.

The summary rebuild MUST be cheap — it runs every tick. Keep it

to a single ndjson scan with awk / jq, no nested processing.

Alert thresholds are intentionally loose for v1 (3 failures /

10 min). Tune up only after a week of baseline data; over-tight thresholds in week 1 produce alert fatigue from cold-start failures.

Cost rounding is a one-way ratchet up. Sub-cent rounds to 1, not 0,

so per-run cost is never under-reported. Aggregate sums slightly over-state.

How to verify it's working

tail -1 state/log/sync-events.ndjson shows the most recent push

with dur_ms populated.

cat state/log/sync-summary.md shows a fresh dashboard with

generated_at within the last tick window.

curl https://skills.snappy.ai/_status | jq '.alerts_active'

returns 0 in steady state.

state/lint/sync-freshness.ts and state/lint/manifest-drift.ts

exit 0.

A simulated 3-failure burst from one tenant produces both a local

state/log/alerts/sync-degraded-<ts>.md and a Worker ALERTS KV entry visible in /_status.