drop another .md file to compare - side-by-side diff against autopilot

autopilot

Quietly finds what's broken and fixes it before it slows you down.

description: "Triggers on prompt mention of 'autopilot'."

personal 2 files 10 recent evals

Export

What it does for you

Quietly finds what's broken and fixes it before it slows you down.

What it produces

A recent result, so you can see the kind of work it returns.

loading…

How to get it

These run inside the Snappy workspace. Want this working in your business? I set skills like this up with you, in one focused week.

Work with me

For developers how this skill is built, graded, and how it runs

at a glance- the short version

actorState/bin/autopilot/{break,fix}.sh

auditorState/bin/autopilot/open-count.sh

eval modeauto

categoryOps

stages1

dependssnappy-fix, evolve

what's inside - the parts that make up a skill 2/4 present

A skill is just a few plain-text files. Only the main one is required. The rest are optional, added as the work needs them. This is what the skill is made of; how it runs is just below.

The skill

state/skills/autopilot/SKILL.md present

the skill itself, in plain text

The main file. It says what the skill is and lays out the steps in plain English.

Code

state/lib/autopilot.ts not present

code the skill can run

Optional. Many skills are just words and need no code at all.

Scripts

state/bin/autopilot/ not present

helper scripts

Optional. Added when a skill has a few commands to run.

Loader

state/skills/autopilot/AGENTS.md present

what the AI loads on the fly

Loaded automatically the moment this skill is needed. Kept short on purpose.

how it's graded - what counts as a good run 4 criteria · 4 deterministic

Each row is one thing a good run has to get right. deterministic means a quick check decides, pass or fail. judge means the AI reads the result and rates it. Grading each piece on its own (instead of one overall score) shows exactly where a run fell short, so the fix is obvious.

name

kind

check

eval_score_reflects_friction

deterministic

The 'score' field in 'state/log/evals.ndjson' for 'autopilot' entries must be 1 if 'open_after <= open_before', and 0 otherwise, as calculated by 'state/bin/autopilot/open-count.sh'.

telemetry_completeness

deterministic

Every 'autopilot' entry in 'state/log/evals.ndjson' must contain non-null 'open_before' and 'open_after' fields.

no_raw_grep_counting

deterministic

Ensure no direct 'grep \"status\":\"open\"' is used for counting open P0/P1 issues; 'state/bin/autopilot/open-count.sh' must be used for deduped, filtered counts.

breaker_cron_disabled

deterministic

The 'state/bin/autopilot/break.sh' script must not be scheduled via 'crontab -l'.

how it runs - the shared frame every skill uses 5/5 present

Every skill runs the same way. One part does the work, a separate part checks it, and a short loader hands the AI exactly what it needs for the job. Anything this skill doesn't use shows a one-line note saying why, on purpose, not by accident.

makes the work The worker

present

State/bin/autopilot/{break,fix}.sh the worker

Does the actual work. Whatever it produces is what gets checked next.

checks the work The reviewer

present

State/bin/autopilot/open-count.sh the checker

A separate checker grades the work, so the part that made it can't approve its own work.

frame

learns Self-correction

present

fixes itself learns from gaps

When a run hits a gap, the skill gets edited on the spot [FIXED] or queued for a bigger rewrite [LOGGED], so it keeps getting better.

tidies up Background fixes

present

queued for rewrite runs in the background

Bigger fixes that can't be made on the spot get queued and rewritten in the background later.

remembers Run history

present

state/log/evals.ndjson auto runs

Every run is written down here, so the next time this skill is used it already knows how the last runs went.

Critical rules the things this skill must not get wrong

Outcome gate is score = 1 if open_after <= open_before, else 0.
NEVER count via raw grep '"status":"open"' state/log/breakage-report.ndjson.
Eval rows MUST carry open_before AND open_after. A score:0 row
Don't widen the gate to p2. p0/p1 only. p2 is "nice to have" backlog
break.sh is on-demand only (operator triage), NOT cron. Deprecated
Engagement is gated by state/engaged.json. The file must contain the
+2 more in AGENTS.md →

what it has learned - fixes written back in over time sample

When a run hits something this skill didn't handle, the fix gets written back into the skill so it doesn't happen again. FIXED means it was corrected on the spot. LOGGED means it's queued for a bigger rewrite. Either way, the skill gets a little better and never makes the same mistake twice.

Loading feedback rows…

how the work flows- who makes it, who checks it

inputs snappy-fixevolve

actor State/bin/autopilot/{break,fix}.sh

auditor State/bin/autopilot/open-count.sh

1 data

eval log

`state/log/evals.ndjson` (skill: "autopilot")

+ eval for this step

SKILL.md- the skill, written out in plain English

autopilot

Two-cron loop that drives the snappy-os friction queue toward zero:

state/bin/autopilot/break.sh - finder. Headless Claude probes a surface,

appends {status:"open"} rows to state/log/breakage-report.ndjson.

state/bin/autopilot/fix.sh - fixer. Picks the top open p0/p1 row,

re-runs the repro, and appends a {status:"resolved"} row when the breakage actually closes.

(the PID re-test gate at state/bin/autopilot/regen.sh was retired

2026-04-28 in task #389; the wired Stop hook state/hooks/snappy-os-auto-regen.sh plus state/regen/drain.sh now carry that responsibility - see CLAUDE.md "How the system enforces itself".)

Engagement is gated by state/engaged.json (must contain "autopilot").

Eval

Contract (Pod K audit found this was missing; Pod W rewrote it 2026-04-18):

The eval gate measures whether the autopilot loop reduced open friction - its actual job - not whether the script exited 0.

For every tick (breaker + fixer + precheck), the script:

Counts open p0/p1 rows in state/log/breakage-report.ndjson BEFORE

doing any work, using the dedupe-by-area-then-filter-resolved semantics from state/skills/snappy-fix/AGENTS.md (latest row per area wins; an "open" row followed by a newer "resolved" row for the same area is already closed). Implemented once in state/bin/autopilot/open-count.sh so both ticks compute the count the same way.

Runs the tick.
Counts open p0/p1 again AFTER.
score = 1 if after <= before, else score = 0.

Edge case (per Robert's spec): if there were no open p0/p1 rows to begin with AND no new breakage was introduced, the system is healthy and the tick correctly did nothing -> score = 1. The <= comparison handles both "fixer closed something" and "system was already clean" cleanly.

Each eval row carries the raw inputs so you can audit the gate without re-deriving them:

{
  "skill": "autopilot",
  "verb": "fixer-tick",
  "score": 1,
  "ok": true,
  "open_before": 4,
  "open_after": 3,
  "exit": 0,
  "run_id": "fixer-1745024400",
  "machine": "ray-mac"
}

Why this gate is load-bearing. The previous shape-only gate (score = (exit == 0) ? 1 : 0) scored autopilot 1.0 on every cron tick regardless of whether the friction queue grew, shrank, or stayed the same. That made state/log/evals.ndjson lie about the most important loop in the system - Pod K caught it during commit ae19fee's aftermath. The outcome gate now matches the loop's actual purpose.

Actor != auditor. The actor is the headless Claude breaker/fixer that runs the tick. The auditor is open-count.sh - pure jq, deterministic, no LLM. The two cannot collude.

Files involved (auto-eval contract):

Actor: state/bin/autopilot/{break,fix}.sh
Auditor: state/bin/autopilot/open-count.sh
Source of truth: state/log/breakage-report.ndjson
Eval log: state/log/evals.ndjson (skill: "autopilot")

Known regressions to avoid

DO NOT swap the gate back to score = (exit == 0); that's the bug

Pod K's audit caught. Shell exit and outcome are different things.

DO NOT count raw grep '"status":"open"' matches; the log is

append-only and a stale "open" row is often followed by a newer "resolved" row for the same area. Use open-count.sh (which dedupes by area first, filters by status second).

DO NOT widen the gate to p2 - p2 is "nice to have" backlog and

shouldn't fail the loop.

Known failure modes (from the 4 hard zeros, 2026-04-18/19/20)

Pattern 1 - missing telemetry fields (run_ids: fixer-1776485707, breaker-1776551290). Two fixer-tick and breaker-tick runs logged score:0 without open_before or open_after fields. Without these the auditor cannot verify the outcome gate - the row is unauditable. Root cause: the eval format was not yet standardized when those scripts first ran.

Fix already in place: the eval schema example above is the required shape.

Any script that logs a score:0 row without both fields is non-compliant and should be treated as a script bug, not a system friction.

If you see a score:0 row missing open_before/open_after, the right

diagnosis is "telemetry gap," not "friction spike."

Pattern 2 - breaker cron deprecated mid-run (run_ids: breaker-1776637947, breaker-1776656645). break.sh was deprecated on 2026-04-19 (cron removed; frictions now come from skill self-reporting via the PID loop). Two cron-fired breaker runs landed on April 19 and April 20 during the transition window. Both had exit:0 (the LLM ran fine) but open_after > open_before (3 and 1 new p0/p1 rows respectively), so the outcome gate correctly scored them 0. This was expected behavior during the transition - the breaker cron was finding real frictions on a system that had just been restructured.

break.sh is now on-demand only (operator triage, not cron). Do not

re-cron it. If you see future breaker-tick zero-scores, check whether the cron has been accidentally re-installed (crontab -l | grep break).

The outcome gate behavior is correct: if a tick increases the open count,

that is a net-negative result regardless of the tick's verb. The April 19/20 zeros were transition noise, not a gate bug.

Rubric

criteria:
  - name: eval_score_reflects_friction
    kind: deterministic
    check: "The 'score' field in 'state/log/evals.ndjson' for 'autopilot' entries must be 1 if 'open_after <= open_before', and 0 otherwise, as calculated by 'state/bin/autopilot/open-count.sh'."
  - name: telemetry_completeness
    kind: deterministic
    check: "Every 'autopilot' entry in 'state/log/evals.ndjson' must contain non-null 'open_before' and 'open_after' fields."
  - name: no_raw_grep_counting
    kind: deterministic
    check: "Ensure no direct 'grep \"status\":\"open\"' is used for counting open P0/P1 issues; 'state/bin/autopilot/open-count.sh' must be used for deduped, filtered counts."
  - name: breaker_cron_disabled
    kind: deterministic
    check: "The 'state/bin/autopilot/break.sh' script must not be scheduled via 'crontab -l'."

AGENTS.md- what the AI loads when this skill comes up

autopilot - loader

Per-turn rules for the autopilot skill. Full reference: state/skills/autopilot/SKILL.md. Do not skip these.

autopilot drives the snappy-os friction queue toward zero via two scripts: state/bin/autopilot/break.sh (finder, on-demand) appends {status:"open"} rows to state/log/breakage-report.ndjson; state/bin/autopilot/fix.sh (fixer, active cron) re-runs the top open p0/p1 repro and appends a {status:"resolved"} row when the breakage actually closes. The PiD re-test gate at regen.sh was retired 2026-04-28 - state/regen/drain.sh + state/hooks/snappy-os-auto-regen.sh now carry that responsibility.

Critical Rules

Outcome gate is score = 1 if open_after <= open_before, else 0.

Pod K caught the prior shape-only gate (score = (exit == 0)) lying about the most important loop in the system. NEVER swap it back. Shell exit and outcome are different things; the gate measures whether the tick reduced open friction, not whether the script ran clean.

NEVER count via raw grep '"status":"open"' state/log/breakage-report.ndjson.

The log is append-only; a stale "open" row is often followed by a newer "resolved" row for the same area. ALWAYS go through state/bin/autopilot/open-count.sh (dedupes by area first, filters by latest status second). Both ticks must compute the count the same way.

Eval rows MUST carry open_before AND open_after. A score:0 row

missing either field is a telemetry gap (script bug), NOT a friction spike. Treat it as non-compliant and fix the producer; do not draw conclusions from it.

Don't widen the gate to p2. p0/p1 only. p2 is "nice to have" backlog

and shouldn't fail the loop.

break.sh is on-demand only (operator triage), NOT cron. Deprecated

as a scheduled job 2026-04-19. Frictions now come from skill self-reporting via the PiD loop. If you see future breaker-tick zero-scores, check whether the cron has been accidentally re-installed (crontab -l | grep break). fix.sh is the active fixer.

Engagement is gated by state/engaged.json. The file must contain the

string "autopilot" for either tick to do real work. Disengaged = no-op by design - not a bug.

Actor != auditor. Actor = headless Claude breaker/fixer. Auditor =

open-count.sh (pure jq, deterministic, no LLM). They cannot collude. Don't invent a third counter.

One eval row per run to state/log/evals.ndjson via score()

(CONSTITUTION invariant #4). eval: auto per frontmatter.

Commands

| ui dashboard | state/skills/autopilot/resources/ui.openui | |fixer (active cron): bash state/bin/autopilot/fix.sh |breaker (on-demand only): bash state/bin/autopilot/break.sh |count open p0/p1 (the auditor): bash state/bin/autopilot/open-count.sh |engagement gate: state/engaged.json (must contain "autopilot") |source of truth (frictions): state/log/breakage-report.ndjson |eval log: state/log/evals.ndjson (skill: "autopilot") |reference: state/skills/autopilot/SKILL.md |cron audit: crontab -l | grep -E 'break|fix'

Required eval row shape

{
  "skill": "autopilot",
  "verb": "fixer-tick",
  "score": 1,
  "ok": true,
  "open_before": 4,
  "open_after": 3,
  "exit": 0,
  "run_id": "fixer-1745024400",
  "machine": "ray-mac"
}

score is computed from open_before/open_after, not from exit. Edge case: zero open p0/p1 before AND no new breakage introduced -> score = 1 (system was already clean; the <= comparison handles it cleanly).

OpenUI Resource

Skill-owned OpenUI Lang resource: state/skills/autopilot/resources/ui.openui. Read it before rendering or editing this skill's generated component surface.
Treat this resource as a first-class artifact of the skill, not a generic chat response. Improve it when the skill's user-facing output needs to become richer.
System resources compose OpenUI primitives and inherit SnappyChat tokens. Use ui_contract: branded in SKILL.md only for deliberate platform or client visuals.

Known Pitfalls

Telemetry gap (Pattern 1, hard zeros 2026-04-18). Two fixer-tick /

breaker-tick runs (run_ids fixer-1776485707, breaker-1776551290) logged score:0 without open_before/open_after. The auditor cannot verify the gate without those fields - the row is unauditable. Diagnose these as script bugs, not friction spikes. Fix the producer to emit the schema above.

Breaker cron deprecated mid-run (Pattern 2, 2026-04-19/20). Two

cron-fired breaker runs (breaker-1776637947, breaker-1776656645) scored 0 during the cron-removal transition window: exit:0 but open_after > open_before. The gate behavior was correct - increasing open count is net-negative regardless of verb. Don't re-cron break.sh in response.

Self-Test

An agent reading this should correctly:

[ ] Compute the eval score as open_after <= open_before ? 1 : 0, not from exit?
[ ] Use state/bin/autopilot/open-count.sh for both before/after counts (never raw grep)?
[ ] Refuse to emit a score:0 row missing open_before or open_after?
[ ] Keep the gate at p0/p1 only (no p2 widening)?
[ ] Treat break.sh as on-demand (not cron) and confirm via crontab -l | grep break?
[ ] Skip work if state/engaged.json doesn't contain "autopilot"?
[ ] Append exactly one eval row per run via score()?

Self-report

If this loader fell short, append a line:

echo "[$(date -u +%FT%TZ)] autopilot: <what was missing or fixed> [FIXED|LOGGED] action_kind=<kind>" >> state/log/loader-feedback.log

Self-correcting loader (PID feedback)

This is the rule that makes loaders learn. Ported from snappy-kernel, where it demonstrably worked: the snappy-dom-cartographer skill mapped Skool admin UI once, every later ship-agent inherited the full map for free.

Before you finish, do two things:

1. Fix gaps (P - proportional). MANDATORY. If this loader didn't cover your case - if you had to read another file, run an undocumented command, work around a wrong selector, or discover a quirk - you MUST attempt an Edit to this AGENTS.md before you log.

Default action: edit this file. Surgical. Add the missing line, table

row, or code example where it belongs. No restructuring.

Only LOGGED is allowed when: the fix needs >10 lines, spans multiple

files, or requires a structural rewrite. In that case the state/regen/drain.sh queue picks it up asynchronously.

"I didn't have time" / "it's minor" / "the next agent will figure it out"

are NOT valid reasons. Minor gaps compound - that's the whole point of this loop.

Concrete example of a valid P-fix: discover agent-browser screenshot

takes path positionally not as --output <path>. One line in a table. Edit it in. Do not just log it.

The goal: the next agent never has to leave the loader.

2. Log the result.

echo "[$(date -u +%FT%TZ)] autopilot: <what was missing or fixed> [FIXED|LOGGED] action_kind=<kind>" >> state/log/loader-feedback.log

FIXED = you patched this loader inline (P-fix).
LOGGED = too large for inline; the PostToolUse enqueue + Stop-hook drain

will rewrite the loader from scratch on next session-end.

action_kind values: shape-ok | skill-ran | loader-rewritten |

pattern-elevated. Pick the one that describes what you actually did. If you LOGGED, omit it; the inferrer picks it up from body keywords.

Do not skip this. Every agent run must leave the system better than it found it. The loader is the setpoint; you are the sensor; the gap is the error signal; closing the gap is the correction.

api.ts- the code it can call

⚠ no api.ts - this skill has no typed action surface

scripts- helper scripts it can run

prose-only skill - 2 inline code blocks live in SKILL.md above (no state/bin/ sidecar yet).

how we check it- the checks, plus the last 10 runs

rubric auto no rubric declared

recent mean 0.75 · 10 runs actor/auditor: unverifiable

deps snappy-fix evolve

timestamp	verb	score	primary_issue	artifact
2026-04-26 23:47Z	-	0.50	-	-
2026-04-25 04:11Z	-	1.00	-	-
2026-04-21 15:58Z	-	1.00	-	-
2026-04-21 15:56Z	-	1.00	-	-
2026-04-21 03:53Z	-	1.00	-	-
2026-04-20 03:44Z	-	0.00	-	-
2026-04-19 22:32Z	-	0.00	-	-
2026-04-18 22:29Z	-	1.00	-	-
2026-04-18 22:29Z	-	1.00	-	-
2026-04-18 22:28Z	-	1.00	-	-