OR Key
drop another .md file to compare - side-by-side diff against autopilot

autopilot

Quietly finds what's broken and fixes it before it slows you down.
description: "Triggers on prompt mention of 'autopilot'."
personal 2 files 10 recent evals

What it does for you

Quietly finds what's broken and fixes it before it slows you down.

What it produces

A recent result, so you can see the kind of work it returns.

loading…

How to get it

These run inside the Snappy workspace. Want this working in your business? I set skills like this up with you, in one focused week.

Work with me
For developers how this skill is built, graded, and how it runs

at a glance- the short version

actorState/bin/autopilot/{break,fix}.sh
auditorState/bin/autopilot/open-count.sh
eval modeauto
categoryOps
stages1
dependssnappy-fix, evolve

what's inside - the parts that make up a skill 2/4 present

A skill is just a few plain-text files. Only the main one is required. The rest are optional, added as the work needs them. This is what the skill is made of; how it runs is just below.

The skill
state/skills/autopilot/SKILL.md present
the skill itself, in plain text
The main file. It says what the skill is and lays out the steps in plain English.
Code
state/lib/autopilot.ts not present
code the skill can run
Optional. Many skills are just words and need no code at all.
Scripts
state/bin/autopilot/ not present
helper scripts
Optional. Added when a skill has a few commands to run.
Loader
state/skills/autopilot/AGENTS.md present
what the AI loads on the fly
Loaded automatically the moment this skill is needed. Kept short on purpose.

how it's graded - what counts as a good run 4 criteria · 4 deterministic

Each row is one thing a good run has to get right. deterministic means a quick check decides, pass or fail. judge means the AI reads the result and rates it. Grading each piece on its own (instead of one overall score) shows exactly where a run fell short, so the fix is obvious.

name
kind
check
eval_score_reflects_friction
deterministic
The 'score' field in 'state/log/evals.ndjson' for 'autopilot' entries must be 1 if 'open_after <= open_before', and 0 otherwise, as calculated by 'state/bin/autopilot/open-count.sh'.
telemetry_completeness
deterministic
Every 'autopilot' entry in 'state/log/evals.ndjson' must contain non-null 'open_before' and 'open_after' fields.
no_raw_grep_counting
deterministic
Ensure no direct 'grep \"status\":\"open\"' is used for counting open P0/P1 issues; 'state/bin/autopilot/open-count.sh' must be used for deduped, filtered counts.
breaker_cron_disabled
deterministic
The 'state/bin/autopilot/break.sh' script must not be scheduled via 'crontab -l'.

how it runs - the shared frame every skill uses 5/5 present

Every skill runs the same way. One part does the work, a separate part checks it, and a short loader hands the AI exactly what it needs for the job. Anything this skill doesn't use shows a one-line note saying why, on purpose, not by accident.

makes the work The worker
present
State/bin/autopilot/{break,fix}.sh the worker
Does the actual work. Whatever it produces is what gets checked next.
checks the work The reviewer
present
State/bin/autopilot/open-count.sh the checker
A separate checker grades the work, so the part that made it can't approve its own work.
frame
learns Self-correction
present
fixes itself learns from gaps
When a run hits a gap, the skill gets edited on the spot [FIXED] or queued for a bigger rewrite [LOGGED], so it keeps getting better.
tidies up Background fixes
present
queued for rewrite runs in the background
Bigger fixes that can't be made on the spot get queued and rewritten in the background later.
remembers Run history
present
state/log/evals.ndjson auto runs
Every run is written down here, so the next time this skill is used it already knows how the last runs went.
Critical rules the things this skill must not get wrong
  1. Outcome gate is score = 1 if open_after <= open_before, else 0.
  2. NEVER count via raw grep '"status":"open"' state/log/breakage-report.ndjson.
  3. Eval rows MUST carry open_before AND open_after. A score:0 row
  4. Don't widen the gate to p2. p0/p1 only. p2 is "nice to have" backlog
  5. break.sh is on-demand only (operator triage), NOT cron. Deprecated
  6. Engagement is gated by state/engaged.json. The file must contain the
  7. +2 more in AGENTS.md →

what it has learned - fixes written back in over time sample

When a run hits something this skill didn't handle, the fix gets written back into the skill so it doesn't happen again. FIXED means it was corrected on the spot. LOGGED means it's queued for a bigger rewrite. Either way, the skill gets a little better and never makes the same mistake twice.

  1. Loading feedback rows…

how the work flows- who makes it, who checks it

inputs snappy-fixevolve
actor State/bin/autopilot/{break,fix}.sh
auditor State/bin/autopilot/open-count.sh
1 data
eval log
`state/log/evals.ndjson` (skill: "autopilot")

SKILL.md- the skill, written out in plain English

autopilot

Two-cron loop that drives the snappy-os friction queue toward zero:

  • state/bin/autopilot/break.sh - finder. Headless Claude probes a surface,

appends {status:"open"} rows to state/log/breakage-report.ndjson.

  • state/bin/autopilot/fix.sh - fixer. Picks the top open p0/p1 row,

re-runs the repro, and appends a {status:"resolved"} row when the breakage actually closes.

  • (the PID re-test gate at state/bin/autopilot/regen.sh was retired

2026-04-28 in task #389; the wired Stop hook state/hooks/snappy-os-auto-regen.sh plus state/regen/drain.sh now carry that responsibility - see CLAUDE.md "How the system enforces itself".)

Engagement is gated by state/engaged.json (must contain "autopilot").

Eval

Contract (Pod K audit found this was missing; Pod W rewrote it 2026-04-18):

The eval gate measures whether the autopilot loop reduced open friction - its actual job - not whether the script exited 0.

For every tick (breaker + fixer + precheck), the script:

  1. Counts open p0/p1 rows in state/log/breakage-report.ndjson BEFORE

doing any work, using the dedupe-by-area-then-filter-resolved semantics from state/skills/snappy-fix/AGENTS.md (latest row per area wins; an "open" row followed by a newer "resolved" row for the same area is already closed). Implemented once in state/bin/autopilot/open-count.sh so both ticks compute the count the same way.

  1. Runs the tick.
  2. Counts open p0/p1 again AFTER.
  3. score = 1 if after <= before, else score = 0.

Edge case (per Robert's spec): if there were no open p0/p1 rows to begin with AND no new breakage was introduced, the system is healthy and the tick correctly did nothing -> score = 1. The <= comparison handles both "fixer closed something" and "system was already clean" cleanly.

Each eval row carries the raw inputs so you can audit the gate without re-deriving them:

{
  "skill": "autopilot",
  "verb": "fixer-tick",
  "score": 1,
  "ok": true,
  "open_before": 4,
  "open_after": 3,
  "exit": 0,
  "run_id": "fixer-1745024400",
  "machine": "ray-mac"
}

Why this gate is load-bearing. The previous shape-only gate (score = (exit == 0) ? 1 : 0) scored autopilot 1.0 on every cron tick regardless of whether the friction queue grew, shrank, or stayed the same. That made state/log/evals.ndjson lie about the most important loop in the system - Pod K caught it during commit ae19fee's aftermath. The outcome gate now matches the loop's actual purpose.

Actor != auditor. The actor is the headless Claude breaker/fixer that runs the tick. The auditor is open-count.sh - pure jq, deterministic, no LLM. The two cannot collude.

Files involved (auto-eval contract):

  • Actor: state/bin/autopilot/{break,fix}.sh
  • Auditor: state/bin/autopilot/open-count.sh
  • Source of truth: state/log/breakage-report.ndjson
  • Eval log: state/log/evals.ndjson (skill: "autopilot")

Known regressions to avoid

  • DO NOT swap the gate back to score = (exit == 0); that's the bug

Pod K's audit caught. Shell exit and outcome are different things.

  • DO NOT count raw grep '"status":"open"' matches; the log is

append-only and a stale "open" row is often followed by a newer "resolved" row for the same area. Use open-count.sh (which dedupes by area first, filters by status second).

  • DO NOT widen the gate to p2 - p2 is "nice to have" backlog and

shouldn't fail the loop.

Known failure modes (from the 4 hard zeros, 2026-04-18/19/20)

Pattern 1 - missing telemetry fields (run_ids: fixer-1776485707, breaker-1776551290). Two fixer-tick and breaker-tick runs logged score:0 without open_before or open_after fields. Without these the auditor cannot verify the outcome gate - the row is unauditable. Root cause: the eval format was not yet standardized when those scripts first ran.

  • Fix already in place: the eval schema example above is the required shape.

Any script that logs a score:0 row without both fields is non-compliant and should be treated as a script bug, not a system friction.

  • If you see a score:0 row missing open_before/open_after, the right

diagnosis is "telemetry gap," not "friction spike."

Pattern 2 - breaker cron deprecated mid-run (run_ids: breaker-1776637947, breaker-1776656645). break.sh was deprecated on 2026-04-19 (cron removed; frictions now come from skill self-reporting via the PID loop). Two cron-fired breaker runs landed on April 19 and April 20 during the transition window. Both had exit:0 (the LLM ran fine) but open_after > open_before (3 and 1 new p0/p1 rows respectively), so the outcome gate correctly scored them 0. This was expected behavior during the transition - the breaker cron was finding real frictions on a system that had just been restructured.

  • break.sh is now on-demand only (operator triage, not cron). Do not

re-cron it. If you see future breaker-tick zero-scores, check whether the cron has been accidentally re-installed (crontab -l | grep break).

  • The outcome gate behavior is correct: if a tick increases the open count,

that is a net-negative result regardless of the tick's verb. The April 19/20 zeros were transition noise, not a gate bug.

Rubric

criteria:
  - name: eval_score_reflects_friction
    kind: deterministic
    check: "The 'score' field in 'state/log/evals.ndjson' for 'autopilot' entries must be 1 if 'open_after <= open_before', and 0 otherwise, as calculated by 'state/bin/autopilot/open-count.sh'."
  - name: telemetry_completeness
    kind: deterministic
    check: "Every 'autopilot' entry in 'state/log/evals.ndjson' must contain non-null 'open_before' and 'open_after' fields."
  - name: no_raw_grep_counting
    kind: deterministic
    check: "Ensure no direct 'grep \"status\":\"open\"' is used for counting open P0/P1 issues; 'state/bin/autopilot/open-count.sh' must be used for deduped, filtered counts."
  - name: breaker_cron_disabled
    kind: deterministic
    check: "The 'state/bin/autopilot/break.sh' script must not be scheduled via 'crontab -l'."

AGENTS.md- what the AI loads when this skill comes up

autopilot - loader

Per-turn rules for the autopilot skill. Full reference: state/skills/autopilot/SKILL.md. Do not skip these.

autopilot drives the snappy-os friction queue toward zero via two scripts: state/bin/autopilot/break.sh (finder, on-demand) appends {status:"open"} rows to state/log/breakage-report.ndjson; state/bin/autopilot/fix.sh (fixer, active cron) re-runs the top open p0/p1 repro and appends a {status:"resolved"} row when the breakage actually closes. The PiD re-test gate at regen.sh was retired 2026-04-28 - state/regen/drain.sh + state/hooks/snappy-os-auto-regen.sh now carry that responsibility.

Critical Rules

  • Outcome gate is score = 1 if open_after <= open_before, else 0.

Pod K caught the prior shape-only gate (score = (exit == 0)) lying about the most important loop in the system. NEVER swap it back. Shell exit and outcome are different things; the gate measures whether the tick reduced open friction, not whether the script ran clean.

  • NEVER count via raw grep '"status":"open"' state/log/breakage-report.ndjson.

The log is append-only; a stale "open" row is often followed by a newer "resolved" row for the same area. ALWAYS go through state/bin/autopilot/open-count.sh (dedupes by area first, filters by latest status second). Both ticks must compute the count the same way.

  • Eval rows MUST carry open_before AND open_after. A score:0 row

missing either field is a telemetry gap (script bug), NOT a friction spike. Treat it as non-compliant and fix the producer; do not draw conclusions from it.

  • Don't widen the gate to p2. p0/p1 only. p2 is "nice to have" backlog

and shouldn't fail the loop.

  • break.sh is on-demand only (operator triage), NOT cron. Deprecated

as a scheduled job 2026-04-19. Frictions now come from skill self-reporting via the PiD loop. If you see future breaker-tick zero-scores, check whether the cron has been accidentally re-installed (crontab -l | grep break). fix.sh is the active fixer.

  • Engagement is gated by state/engaged.json. The file must contain the

string "autopilot" for either tick to do real work. Disengaged = no-op by design - not a bug.

  • Actor != auditor. Actor = headless Claude breaker/fixer. Auditor =

open-count.sh (pure jq, deterministic, no LLM). They cannot collude. Don't invent a third counter.

  • One eval row per run to state/log/evals.ndjson via score()

(CONSTITUTION invariant #4). eval: auto per frontmatter.

Commands

| ui dashboard | state/skills/autopilot/resources/ui.openui | |fixer (active cron): bash state/bin/autopilot/fix.sh |breaker (on-demand only): bash state/bin/autopilot/break.sh |count open p0/p1 (the auditor): bash state/bin/autopilot/open-count.sh |engagement gate: state/engaged.json (must contain "autopilot") |source of truth (frictions): state/log/breakage-report.ndjson |eval log: state/log/evals.ndjson (skill: "autopilot") |reference: state/skills/autopilot/SKILL.md |cron audit: crontab -l | grep -E 'break|fix'

Required eval row shape

{
  "skill": "autopilot",
  "verb": "fixer-tick",
  "score": 1,
  "ok": true,
  "open_before": 4,
  "open_after": 3,
  "exit": 0,
  "run_id": "fixer-1745024400",
  "machine": "ray-mac"
}

score is computed from open_before/open_after, not from exit. Edge case: zero open p0/p1 before AND no new breakage introduced -> score = 1 (system was already clean; the <= comparison handles it cleanly).

OpenUI Resource

  • Skill-owned OpenUI Lang resource: state/skills/autopilot/resources/ui.openui. Read it before rendering or editing this skill's generated component surface.
  • Treat this resource as a first-class artifact of the skill, not a generic chat response. Improve it when the skill's user-facing output needs to become richer.
  • System resources compose OpenUI primitives and inherit SnappyChat tokens. Use ui_contract: branded in SKILL.md only for deliberate platform or client visuals.

Known Pitfalls

  • Telemetry gap (Pattern 1, hard zeros 2026-04-18). Two fixer-tick /

breaker-tick runs (run_ids fixer-1776485707, breaker-1776551290) logged score:0 without open_before/open_after. The auditor cannot verify the gate without those fields - the row is unauditable. Diagnose these as script bugs, not friction spikes. Fix the producer to emit the schema above.

  • Breaker cron deprecated mid-run (Pattern 2, 2026-04-19/20). Two

cron-fired breaker runs (breaker-1776637947, breaker-1776656645) scored 0 during the cron-removal transition window: exit:0 but open_after > open_before. The gate behavior was correct - increasing open count is net-negative regardless of verb. Don't re-cron break.sh in response.

Self-Test

An agent reading this should correctly:

  1. [ ] Compute the eval score as open_after <= open_before ? 1 : 0, not from exit?
  2. [ ] Use state/bin/autopilot/open-count.sh for both before/after counts (never raw grep)?
  3. [ ] Refuse to emit a score:0 row missing open_before or open_after?
  4. [ ] Keep the gate at p0/p1 only (no p2 widening)?
  5. [ ] Treat break.sh as on-demand (not cron) and confirm via crontab -l | grep break?
  6. [ ] Skip work if state/engaged.json doesn't contain "autopilot"?
  7. [ ] Append exactly one eval row per run via score()?

Self-report

If this loader fell short, append a line:

echo "[$(date -u +%FT%TZ)] autopilot: <what was missing or fixed> [FIXED|LOGGED] action_kind=<kind>" >> state/log/loader-feedback.log

<!-- kernel-ok: historical attribution for the PID-loader pattern. -->

Self-correcting loader (PID feedback)

This is the rule that makes loaders learn. Ported from snappy-kernel, where it demonstrably worked: the snappy-dom-cartographer skill mapped Skool admin UI once, every later ship-agent inherited the full map for free.

Before you finish, do two things:

1. Fix gaps (P - proportional). MANDATORY. If this loader didn't cover your case - if you had to read another file, run an undocumented command, work around a wrong selector, or discover a quirk - you MUST attempt an Edit to this AGENTS.md before you log.

  • Default action: edit this file. Surgical. Add the missing line, table

row, or code example where it belongs. No restructuring.

  • Only LOGGED is allowed when: the fix needs >10 lines, spans multiple

files, or requires a structural rewrite. In that case the state/regen/drain.sh queue picks it up asynchronously.

  • "I didn't have time" / "it's minor" / "the next agent will figure it out"

are NOT valid reasons. Minor gaps compound - that's the whole point of this loop.

  • Concrete example of a valid P-fix: discover agent-browser screenshot

takes path positionally not as --output <path>. One line in a table. Edit it in. Do not just log it.

  • The goal: the next agent never has to leave the loader.

2. Log the result.

echo "[$(date -u +%FT%TZ)] autopilot: <what was missing or fixed> [FIXED|LOGGED] action_kind=<kind>" >> state/log/loader-feedback.log
  • FIXED = you patched this loader inline (P-fix).
  • LOGGED = too large for inline; the PostToolUse enqueue + Stop-hook drain

will rewrite the loader from scratch on next session-end.

  • action_kind values: shape-ok | skill-ran | loader-rewritten |

pattern-elevated. Pick the one that describes what you actually did. If you LOGGED, omit it; the inferrer picks it up from body keywords.

Do not skip this. Every agent run must leave the system better than it found it. The loader is the setpoint; you are the sensor; the gap is the error signal; closing the gap is the correction.

api.ts- the code it can call

⚠ no api.ts - this skill has no typed action surface

scripts- helper scripts it can run

prose-only skill - 2 inline code blocks live in SKILL.md above (no state/bin/ sidecar yet).

how we check it- the checks, plus the last 10 runs

rubric auto no rubric declared
recent mean 0.75 · 10 runs actor/auditor: unverifiable
deps snappy-fix evolve
timestamp verb score primary_issue artifact
2026-04-26 23:47Z - 0.50 - -
2026-04-25 04:11Z - 1.00 - -
2026-04-21 15:58Z - 1.00 - -
2026-04-21 15:56Z - 1.00 - -
2026-04-21 03:53Z - 1.00 - -
2026-04-20 03:44Z - 0.00 - -
2026-04-19 22:32Z - 0.00 - -
2026-04-18 22:29Z - 1.00 - -
2026-04-18 22:29Z - 1.00 - -
2026-04-18 22:28Z - 1.00 - -