Observability

What this layer does

Phase 8 turns sync activity into observable signal. Every push and pull writes one ndjson row. A summary dashboard rebuilds at each tick. Alerts fire when failures cluster. Cost telemetry rolls up per tenant. Lints catch drift before it becomes incident.

Files involved

(per-machine, in SYNC_DENY).

tick by state/bin/sync/sync-summary.sh.

clusters.

(allowlisted in SYNC_ALLOW).

last 24h.

touch >50% of files.

sync-events row shape

{
  "ts": "2026-04-16T19:10:38Z",
  "op": "push",
  "repo": "snappy-os",
  "scope": "state",
  "files": 7,
  "bytes": 12483,
  "dur_ms": 842,
  "manifest_before": "sha-...",
  "manifest_after": "sha-...",
  "trigger": "stop-hook",
  "machine": "mbpro-rb"
}

Summary dashboard

state/log/sync-summary.md is a rebuilt-every-tick markdown table covering: pushes per hour (last 24h), bytes per day (last 7d), failure rate per repo, last successful push timestamp per machine. The /snappy-ops "System / ops" → "Sync status" picker option opens this file directly.

Alerts

state/log/alerts/sync-degraded-<ts>.md with the error chain.

tenant_id. Worker tallies in KV ALERTS. If >5% of tenants are failing, /_status flips to degraded.

when he opens the front door.

Cost telemetry

DO Spaces is $5/mo flat for 250 GB plus $0.01/GB egress beyond 1 TB. Per-aggregate cost_usd_cents rolls up in state/log/evals.aggregate.ndjson (rounded to nearest cent; sub-cent rounds up to 1). The summary dashboard surfaces a weekly total.

Operational gotchas

leak machine IDs and trigger counts across tenants.

to a single ndjson scan with awk / jq, no nested processing.

10 min). Tune up only after a week of baseline data; over-tight thresholds in week 1 produce alert fatigue from cold-start failures.

so per-run cost is never under-reported. Aggregate sums slightly over-state.

How to verify it's working

with dur_ms populated.

generated_at within the last tick window.

returns 0 in steady state.

exit 0.

state/log/alerts/sync-degraded-<ts>.md and a Worker ALERTS KV entry visible in /_status.