PID loop

What this layer does

Phase 7 wires the per-machine PID loop (already verified working in Phase 0) into the cross-machine substrate. Per-machine evals stay local. An anonymized aggregate ships through the Worker. Quorum promotion runs as a Worker scheduled handler — when ≥3 distinct tenants score a staged rev ≥0.85 across ≥5 runs, the rev promotes to public canonical without manual review.

Files involved

state/bin/auto-regen.sh — Stop hook PID body. Picks up

<skill>.ready markers and dispatches regen subagents.

state/bin/pid-detect.ts — reads local + remote evals via

fetchEvals(); writes briefs to state/log/regen-queue/.

state/bin/pid-drain.ts — marks queued briefs .ready.
state/bin/pid-aggregate.ts — anonymizes

state/log/evals.ndjson into state/log/evals.aggregate.ndjson per the locked schema. Runs at the end of push --auto.

state/lib/eval.ts — score() writer; fetchEvals() reader.
state/log/regen-queue/ — synced (in SYNC_ALLOW); naming uses

<skill>.<machine_id>.md to avoid cross-machine collisions.

~/projects/snappy-skills/src/quorum.ts — Worker scheduled

handler firing every minute.

Aggregate row schema (locked v1)

type AggRow = {
  _v: 1;
  ts: string;          // ISO-8601
  skill: string;       // normalized from `skill` or legacy `verb`
  score: number;       // 0..1
  run_id: string;      // opaque
  tenant: string;      // sha256(SNAPPY_MASTER_KEY).first(12)
  cost_usd_cents?: number;  // rounded to nearest cent; sub-cent rounds up to 1
  ok: boolean;
};

Aggregator drops every other field. state/lint/aggregate-schema.ts fails on unknown keys.

What syncs vs what doesn't

File	Direction	Why
`state/log/evals.ndjson`	NEVER syncs	per-machine ephemeral; in `SYNC_DENY`
`state/log/evals.aggregate.ndjson`	both ways	anonymized PID signal; in `SYNC_ALLOW`
`state/log/regen-queue/`	both ways	briefs flow between tenants
`state/skills/<name>.md`	both ways	PID-rewritten skill pages flow through standard sync

Quorum promotion

PID rewrite produced on tenant T. T pushes the rev to

s3://robert-storage/snappy-os-staging/<skill>/<rev_id>/.

Worker scheduled handler reads

state/log/evals.aggregate.ndjson for any staged rev_ids.

When ≥3 distinct tenants have scored <skill>:<rev_id> ≥0.85

across ≥5 total runs → promote. Copy snappy-os-staging/<skill>/<rev_id>/ → snappy-os/skills/<skill>/. Write changelog row. Invalidate KV.

Robert override: any rev tagged manual_review:robert in

frontmatter bypasses quorum and stays gated until Robert approves via /snappy-ops.

Operational gotchas

Telemetry is opt-in by default. SNAPPY_TELEMETRY=0 disables

pid-aggregate.ts push entirely.

Brief naming <skill>.<machine_id>.md prevents two tenants

proposing simultaneous rewrites from clobbering each other.

The staged rev DOES survive a tenant going offline — quorum reads

evals.aggregate.ndjson which other tenants continue to push.

Rounding cost_usd_cents up at sub-cent keeps the lint bound

tight; never emit fractional cents.

Per-machine evals.ndjson MUST stay in SYNC_DENY — leaking it

would expose run_ids tied to specific machines.

How to verify it's working

After a Stop hook fires, state/log/regen-queue/ gains a fresh

brief; on the next push, evals.aggregate.ndjson gains a row with the new run.

curl https://skills.snappy.ai/_status shows

last_quorum_promotion updating when ≥3 tenants converge.

A staged rev seeded with synthetic evals from 3 tenants promotes

within 60s of the third tenant's push.

state/lint/aggregate-schema.ts exits 0 on the local aggregate file.