drop another .md file to compare - side-by-side diff against desktop

desktop

Lets your assistant operate Mac apps for you by seeing the screen.

description: "Triggers on prompt mention of 'desktop'."

personal 2 files 10 recent evals

Export

What it does for you

Lets your assistant operate Mac apps for you by seeing the screen.

What it produces

A recent result, so you can see the kind of work it returns.

loading…

How to get it

These run inside the Snappy workspace. Want this working in your business? I set skills like this up with you, in one focused week.

Work with me

For developers how this skill is built, graded, and how it runs

at a glance- the short version

actorExported functions in state/lib/desktop.ts.

auditorNone wired yet - eval is manual (Robert review).

eval modeshape

categoryIntegrations

stages3

dependssettings

what's inside - the parts that make up a skill 3/4 present

A skill is just a few plain-text files. Only the main one is required. The rest are optional, added as the work needs them. This is what the skill is made of; how it runs is just below.

The skill

state/skills/desktop/SKILL.md present

the skill itself, in plain text

The main file. It says what the skill is and lays out the steps in plain English.

Code

state/lib/desktop.ts present

code the skill can run

Reusable code this skill can call when it needs to.

Scripts

state/bin/desktop/ not present

helper scripts

Optional. Added when a skill has a few commands to run.

Loader

state/skills/desktop/AGENTS.md present

what the AI loads on the fly

Loaded automatically the moment this skill is needed. Kept short on purpose.

how it's graded - what counts as a good run 5 criteria · 5 deterministic

Each row is one thing a good run has to get right. deterministic means a quick check decides, pass or fail. judge means the AI reads the result and rates it. Grading each piece on its own (instead of one overall score) shows exactly where a run fell short, so the fix is obvious.

name

kind

check

calls_deterministically

deterministic

Skill execution does not introduce new 'import' declarations outside of 'state/lib/desktop.ts'.

midscene_vision_used

deterministic

The 'runMidscene()' function is called with the 'runMidscene_input' from the skill's inputs.

screenshot_taken

deterministic

The 'screenshot()' function is called with the 'screenshot_input' from the skill's inputs.

whatsapp_chats_read

deterministic

The 'whatsappReadChats()' function is called with the 'whatsappReadChats_input' from the skill's inputs.

eval_log_recorded

deterministic

A new row is appended to 'state/log/pending-eval.ndjson' for the execution.

how it runs - the shared frame every skill uses 5/5 present

Every skill runs the same way. One part does the work, a separate part checks it, and a short loader hands the AI exactly what it needs for the job. Anything this skill doesn't use shows a one-line note saying why, on purpose, not by accident.

makes the work The worker

present

Exported functions in state/lib/desktop.ts. the worker

Does the actual work. Whatever it produces is what gets checked next.

checks the work The reviewer

present

None wired yet - eval is manual (Robert review). the checker

A separate checker grades the work, so the part that made it can't approve its own work.

frame

learns Self-correction

present

fixes itself learns from gaps

When a run hits a gap, the skill gets edited on the spot [FIXED] or queued for a bigger rewrite [LOGGED], so it keeps getting better.

tidies up Background fixes

present

queued for rewrite runs in the background

Bigger fixes that can't be made on the spot get queued and rewritten in the background later.

remembers Run history

present

state/log/pending-eval.ndjson pending runs

Every run is written down here, then reviewed by hand each week.

Critical rules the things this skill must not get wrong

No must-not-break rules called out for this skill. Anything important lives in the writeup below.

what it has learned - fixes written back in over time sample

When a run hits something this skill didn't handle, the fix gets written back into the skill so it doesn't happen again. FIXED means it was corrected on the spot. LOGGED means it's queued for a bigger rewrite. Either way, the skill gets a little better and never makes the same mistake twice.

Loading feedback rows…

how the work flows- who makes it, who checks it

inputs settings

actor Exported functions in state/lib/desktop.ts.

1 generator

invoke

actor = Exported functions in state/lib/desktop.ts.

import from `state/lib/desktop.ts` — deterministic: `focusApp()`, `captureWindow()`, `captureScreen()`, `pasteText()`, `pressKey()`, `clickAt()`, `listWindows()`, `typeAndSubmit()`. Vision: `runMidscene()`, `screenshot()`, `whatsappReadChats()`, `whatsappReadChat()`, `whatsappSendMessage()`.

+ eval for this step

auditor None wired yet - eval is manual (Robert review).

2 auditor

inspect

auditor = None wired yet - eval is manual (Robert review).

captureWindow(appName, path)` immediately after any keyboard or click action — actor ≠ auditor.

+ eval for this step

3 data

eval log

`state/log/pending-eval.ndjson` (manual eval — skill: "desktop"). Run-this is often skipped because Midscene takes over real mouse/keyboard; that's expected — frontmatter-shape ok confirmations still land.

+ eval for this step

SKILL.md- the skill, written out in plain English

desktop

macOS desktop automation. Two layers in one skill:

Deterministic peekaboo wrappers - focusApp, captureWindow,

captureScreen, pasteText, pressKey, clickAt, listWindows, typeAndSubmit. Cheap, reliable. Use when you know the target.

Vision-driven Midscene - runMidscene, whatsapp*. Expensive,

non-deterministic. Use when the target needs to be located on screen.

Ported from kernel snappy-desktop in Phase 0.5. Extended 2026-04-28 with the deterministic primitives so future agents stop rediscovering shell incantations every session. See state/lib/desktop.ts for the full API surface.

Steps

focusApp(appName) - see state/lib/desktop.ts
captureWindow(appName, path) - see state/lib/desktop.ts
captureScreen(path) - see state/lib/desktop.ts
pasteText(text) - see state/lib/desktop.ts
pressKey(key) - see state/lib/desktop.ts
clickAt(x, y) - see state/lib/desktop.ts
listWindows(appName?) - see state/lib/desktop.ts
typeAndSubmit(appName, text, opts?) - see state/lib/desktop.ts
runMidscene(instruction) - see state/lib/desktop.ts
screenshot() - see state/lib/desktop.ts
whatsappReadChats() - see state/lib/desktop.ts
whatsappReadChat(contact) - see state/lib/desktop.ts
whatsappSendMessage(contact, text) - see state/lib/desktop.ts

Library API

Why peekaboo

macOS TCC gates raw primitives (screencapture -x, osascript keystroke, cliclick) per-process-ancestor. Every shell-chain that wants to drive the GUI ends up rediscovering which permission is missing. peekaboo holds its own Screen Recording + Accessibility grants, so the same call works whether dispatched from a login shell, an SSH session, or a headless subprocess. That is why the deterministic helpers below exist - to expose the same primitive surface Charlotte's mac_mini_exec exposes, but as plain CLI/lib calls (snappy-os does NOT do MCP).

Install: brew install peekaboo. The lib auto-discovers /opt/homebrew/bin/peekaboo, the matching Cellar version, and the PATH which result in that order; throws a clear error if missing.

Bridge routing

Every shell call in this skill (runPeekaboo, focusApp, captureScreen) is dispatched through state/lib/bridge.ts when the openclaw helper daemon at 127.0.0.1:18790 is reachable. The bridge holds the macOS TCC grants that Claude Code's own child-process tree does not - routing through it turns "screencapture produced a zero-byte PNG" and "peekaboo returned PERMISSION_ERROR_SCREEN_RECORDING" into actual successful captures. See the bridge skill for the relay contract.

The probe is cached for 60s inside state/lib/bridge.ts, so this is one round-trip per minute, not per call. When the daemon is down, runShell() falls back to /bin/sh -c <command> directly - the skill remains usable on machines without the bridge, but TCC failures will surface with their normal symptoms (zero-byte PNG, permission error, keystroke no-op). If you hit one, the fix is to restart the daemon (~/.openclaw/helper/server.js), not to chase the grant up the local shell-chain. If neither bridge nor local TCC works, every helper throws on the missing artifact (e.g. captureWindow checks existsSync(path) after the call returns).

`focusApp(appName: string): Promise<void>`

Activates the app via AppleScript and sleeps 400ms so the WindowServer can settle. Always call this before keyboard-driven flows - peekaboo's paste and press target the frontmost app, so an unfocused window swallows input silently.

import { focusApp, pasteText, pressKey } from "state/lib/desktop.ts";
await focusApp("SnappyChat");
await pasteText("hello");
await pressKey("return");

`captureWindow(appName: string, path: string): Promise<string>`

Window-scoped screenshot via peekaboo image --mode window --app <appName> --path <path>. Crops out menu bar/dock automatically; doesn't depend on which space is frontmost. Throws if the file isn't produced. Returns the resolved path.

const out = await captureWindow("SnappyChat", "/tmp/sc.png");
// out === "/tmp/sc.png"

`captureScreen(path: string): Promise<string>`

Full-display screenshot via /usr/sbin/screencapture -x. Use when you need the whole screen (multi-window verification) or peekaboo isn't available. Returns the resolved path.

await captureScreen("/tmp/full.png");

`pasteText(text: string): Promise<void>`

Pastes via peekaboo's clipboard route: sets clipboard, sends Cmd+V, restores prior clipboard. Works without Accessibility-typing permission because Cmd+V uses the Cocoa text-input path, not synthetic keystrokes. The clipboard is restored after - non-destructive to whatever the user had copied.

await focusApp("SnappyChat");
await pasteText("open the agent panel");

`pressKey(key: string): Promise<void>`

Single named key via peekaboo. Common keys: return, escape, tab, space, delete, arrow keys (up/down/left/right), function keys (f1-f12). Use pasteText for actual content; pressKey is for navigation/control.

await pressKey("escape");  // dismiss a modal

`clickAt(x: number, y: number): Promise<void>`

Logical (point-based, NOT pixel) screen coordinates. Origin top-left, global. Caveat: WKWebView/Electron apps that don't expose an Accessibility tree receive a synthetic mouse event but the React/JS handler may not fire if the event doesn't match the framework's hit target. For snappy-chat's OpenUI composer in particular, clickAt is fragile - prefer focusApp + pasteText + pressKey. Reserve clickAt for native AppKit targets (toolbar buttons, native menus).

await clickAt(640, 480);

`listWindows(appName?: string): Promise<WindowInfo[]>`

Enumerates windows. With appName, returns just that app's windows; without, every window peekaboo can see. Each entry: wid (CoreGraphics window id), title, bounds {x, y, w, h} (logical coords), isMain. JSON parse is defensive across peekaboo versions.

const windows = await listWindows("SnappyChat");
const main = windows.find((w) => w.isMain);

`typeAndSubmit(appName, text, opts?): Promise<void>`

Composite helper. Optionally focuses the app, optionally clicks into a textfield, pastes the text, presses return. Closes the gap between "I want to drive my own app's composer" and "I have to chain five primitives every time."

opts.focusBefore (default: true) calls focusApp first. opts.clickCoords: [x, y] clicks before pasting (for apps that don't auto-focus their composer when activated).

await typeAndSubmit("SnappyChat", "list scheduled jobs");
// or with explicit composer click:
await typeAndSubmit("SnappyChat", "ship it", { clickCoords: [800, 950] });

`runMidscene(instruction: string): string` (vision)

Routes through Terminal.app for Screen Recording TCC. Expensive (non-deterministic vision call). Use only when the target must be located on screen - for known targets, prefer the deterministic primitives above.

`screenshot(): string` (back-compat shim)

Synchronous string-returning wrapper that delegates to screencapture -x. Kept for the existing WhatsApp helpers; new code should call captureScreen() (async, takes a path) directly.

Eval

Actor: the exported functions in state/lib/desktop.ts. Auditor: none wired yet - eval is manual (Robert review). File a state/log/pending-eval.ndjson row on each run.

Score convention:

Outcome	Score
Pass on first try	1.0
Failed first, auto-fix applied, re-check passed	0.5
Still failing or unrecoverable	0.0

Gotchas

via the Phase 0.5 driver. Only these rewrites were applied: already in state/lib/)

realpathSync(process.argv[1]) CLI guard wrapped in try/catch

See the kernel SKILL.md for the original long-form guidance if you need it

(read-only reference at the kernel path above).

Graduation

This skill is prose. Graduate by defining a deterministic auditor and flipping eval: auto.

Rubric

criteria:
  - name: calls_deterministically
    kind: deterministic
    check: "Skill execution does not introduce new 'import' declarations outside of 'state/lib/desktop.ts'."
  - name: midscene_vision_used
    kind: deterministic
    check: "The 'runMidscene()' function is called with the 'runMidscene_input' from the skill's inputs."
  - name: screenshot_taken
    kind: deterministic
    check: "The 'screenshot()' function is called with the 'screenshot_input' from the skill's inputs."
  - name: whatsapp_chats_read
    kind: deterministic
    check: "The 'whatsappReadChats()' function is called with the 'whatsappReadChats_input' from the skill's inputs."
  - name: eval_log_recorded
    kind: deterministic
    check: "A new row is appended to 'state/log/pending-eval.ndjson' for the execution."

AGENTS.md- what the AI loads when this skill comes up

desktop - loader

Per-turn rules for the desktop skill. macOS automation in two layers: deterministic peekaboo wrappers and vision-driven Midscene. Full reference: state/skills/desktop/SKILL.md. Library: state/lib/desktop.ts. Routed through state/lib/bridge.ts (openclaw helper at 127.0.0.1:18790).

Critical Rules

NEVER fan out parallel agent-browser writes on a shared session - DOM mutations silently drop. Serialize uploads; parallelize generation only.
Midscene is vision-driven, expensive, non-deterministic - verify with captureWindow after every action; never trust the success return alone.
Prefer deterministic peekaboo (focusApp, captureWindow, pasteText, pressKey, typeAndSubmit) over Midscene whenever the target is known. Reserve vision for "I don't know where it is on screen."
clickAt is fragile on WKWebView/Electron (snappy-chat OpenUI composer included) - synthetic mouse events miss React hit targets. Use focusApp + pasteText + pressKey for keyboard flows instead.
Helpers route through the bridge daemon. TCC errors (zero-byte screencapture, PERMISSION_ERROR_SCREEN_RECORDING, silent keystroke failure) usually mean the daemon is down - restart with launchctl kickstart -k gui/$(id -u)/ai.openclaw.helper (NOT ~/.openclaw/helper/server.js directly) and verify with bridgeAvailable().
App name is "Snappy Chat" WITH A SPACE. peekaboo screenshot --app "Snappy Chat" is correct. "SnappyChat" (no space) returns "App not found." Always quote the name.
Window index matters. SnappyChat has a helper window (500x500, index 0) AND the real chat window (1280x900, index 1). Always pass --window-index 1 or --window-id <real-id> when capturing: peekaboo screenshot --app "Snappy Chat" --window-index 1 --output /tmp/out.png. Capturing index 0 looks like a blank square.
No raw osascript System Events. Do NOT use osascript -e 'tell application "System Events" to keystroke ...' - fails with error 1719 (TCC). Use desktop.ts CLI: npx tsx state/lib/desktop.ts press <key> or pasteText(). Keyboard chords (Cmd+N, Cmd+K) need peekaboo's hotkey subcommand, not pressKey.
Chord keys (Cmd+N, Cmd+K, etc.): desktop.ts pressKey() only sends single named keys. For chords use: peekaboo hotkey --keys "cmd,n" (comma-separated, no spaces). Example via bridge: bridgeExec("/opt/homebrew/bin/peekaboo hotkey --keys \\"cmd,n\\"").
Bridge 500 = restart the daemon, not retry. HTTP 500 from openclaw means the daemon is in a bad state. Fix: launchctl kickstart -k gui/$(id -u)/ai.openclaw.helper. Do NOT curl in a loop.

Commands

| ui dashboard | state/skills/desktop/resources/ui.openui | |invoke: import from state/lib/desktop.ts - deterministic: focusApp(), captureWindow(), captureScreen(), pasteText(), pressKey(), clickAt(), listWindows(), typeAndSubmit(). Vision: runMidscene(), screenshot(), whatsappReadChats(), whatsappReadChat(), whatsappSendMessage(). |capture snappy-chat (correct): peekaboo screenshot --app "Snappy Chat" --window-index 1 --output /tmp/out.png |chord keypress: peekaboo hotkey --keys "cmd,n" (or "cmd,k" etc.) - single named keys only for pressKey |restart bridge daemon: launchctl kickstart -k gui/$(id -u)/ai.openclaw.helper |list windows with ids: peekaboo list windows --app "Snappy Chat" (run immediately before using --window-id) |capture Claude Desktop: peekaboo screenshot --app "Claude" --window-index 0 --output /tmp/claude-ref.png |verify: captureWindow(appName, path) immediately after any keyboard or click action - actor ≠ auditor. |install: brew install peekaboo (auto-discovered at /opt/homebrew/bin/peekaboo and Cellar fallback). Lib throws clear error if missing. |eval log: state/log/pending-eval.ndjson (manual eval - skill: "desktop"). Run-this is often skipped because Midscene takes over real mouse/keyboard; that's expected - frontmatter-shape ok confirmations still land.

Working primitives

focusApp(appName) - AppleScript activate + 400ms settle. Always first for keyboard flows.
captureWindow(appName, path) - peekaboo image --mode window. Window-scoped; crops menu bar/dock; space-independent.
captureScreen(path) - /usr/sbin/screencapture -x. Whole display.
pasteText(text) - peekaboo paste (clipboard set + Cmd+V + restore). Survives without Accessibility-typing because Cmd+V is Cocoa text-input, not synthetic keystrokes.
pressKey(key) - peekaboo press. Common: return, escape, tab, space, delete, arrows, f1-f12. Use pasteText for content.
clickAt(x, y) - logical (NOT pixel) coords. Native AppKit only - fragile on WKWebView.
listWindows(appName?) - peekaboo list windows --json. Returns {wid, title, bounds, isMain}[].
typeAndSubmit(appName, text, opts?) - composite: focus + (optional click) + paste + return. The convenience helper for "drive my own app's composer."
Vision: runMidscene(prompt), screenshot(path), whatsappReadChats(), whatsappReadChat(name), whatsappSendMessage(name, body).

OpenUI Resource

Skill-owned OpenUI Lang resource: state/skills/desktop/resources/ui.openui. Read it before rendering or editing this skill's generated component surface.
Treat this resource as a first-class artifact of the skill, not a generic chat response. Improve it when the skill's user-facing output needs to become richer.
System resources compose OpenUI primitives and inherit SnappyChat tokens. Use ui_contract: branded in SKILL.md only for deliberate platform or client visuals.

Known Pitfalls

WhatsApp send is a side-effect - gate behind explicit apply: true per program.md scope-only default.
Midscene is the actor; the screenshot read is the auditor. Never trust the Midscene success return alone - actor ≠ auditor.
PERMISSION_ERROR_SCREEN_RECORDING means the calling process tree lacks the TCC grant. Fix: install peekaboo via brew and grant Screen Recording + Accessibility to the peekaboo binary in System Settings - don't chase the grant up the SSH/MCP shell chain.
"App not found" from peekaboo: you used "SnappyChat" instead of "Snappy Chat". Always use the space.
Helper window vs real window: window index 0 is a tiny 500x500 helper. Always use --window-index 1 for the main chat window.
osascript System Events error 1719: TCC keystroke block. Never use System Events for keystrokes. Use peekaboo hotkey or desktop.ts pasteText.
peekaboo hotkey vs pressKey: pressKey (desktop.ts wrapper) sends one key. For chords (Cmd+N) use peekaboo hotkey --keys "cmd,n" directly (or via bridgeExec).

Self-Test

An agent reading this should correctly:

[ ] Refuse to launch parallel browser-mutation agents on one session
[ ] Verify Midscene actions with a fresh screenshot, not just the return value
[ ] Default whatsappSendMessage to scope-only
[ ] Reach for focusApp + pasteText + pressKey (not clickAt) to drive a WKWebView app's text composer
[ ] Use captureWindow (window-scoped) over captureScreen (full display) when the target app is known

Self-report

If this loader fell short, append a line:

echo "[$(date -u +%FT%TZ)] desktop: <what was missing>" >> state/log/loader-feedback.log

Self-correcting loader (PID feedback)

This is the rule that makes loaders learn. Ported from snappy-kernel, where it demonstrably worked: the snappy-dom-cartographer skill mapped Skool admin UI once, every later ship-agent inherited the full map for free.

Before you finish, do two things:

1. Fix gaps (P - proportional). MANDATORY. If this loader didn't cover your case - if you had to read another file, run an undocumented command, work around a wrong selector, or discover a quirk - you MUST attempt an Edit to this AGENTS.md before you log.

Default action: edit this file. Surgical. Add the missing line, table

row, or code example where it belongs. No restructuring.

Only LOGGED is allowed when: the fix needs >10 lines, spans multiple

files, or requires a structural rewrite. In that case the state/regen/drain.sh queue picks it up asynchronously.

"I didn't have time" / "it's minor" / "the next agent will figure it out"

are NOT valid reasons. Minor gaps compound - that's the whole point of this loop.

Concrete example of a valid P-fix: discover agent-browser screenshot

takes path positionally not as --output <path>. One line in a table. Edit it in. Do not just log it.

The goal: the next agent never has to leave the loader.

2. Log the result.

echo "[$(date -u +%FT%TZ)] desktop: <what was missing or fixed> [FIXED|LOGGED] action_kind=<kind>" >> state/log/loader-feedback.log

<slug> MUST be the literal folder name of this loader

(state/skills/<slug>/AGENTS.md). The class token between [ts] and : is the producer slug, the writeback class, AND the grade class - they must be equal so state/lib/controller-tune.ts can pair the brief.

FIXED = you patched this loader inline (P-fix).
LOGGED = too large for inline; the PostToolUse enqueue + Stop-hook drain

will rewrite the loader from scratch on next session-end.

action_kind is the SECOND pairing predicate (added 2026-04-27, task #327).

Pick the value that describes what you actually did - same slug, different action_kind means the writeback satisfies a different brief layer:

shape-ok - only frontmatter-shape verification passed (rare from

a human; usually emitted by the lint, not a loader echo)

skill-ran - the skill ran end-to-end and an eval row landed

in state/log/evals.ndjson

loader-rewritten - you EDITED this AGENTS.md inline (the FIXED case),

OR the regen drain rewrote it

pattern-elevated - you promoted a recurring failure to a Critical Rule

(rule fix or new-skill scaffold) If you LOGGED (couldn't fix inline), omit action_kind - the inferrer will pick it up from your body keywords.

Do not skip this. Every agent run must leave the system better than it found it. The loader is the setpoint; you are the sensor; the gap is the error signal; closing the gap is the correction.

api.ts- the code it can call

#!/usr/bin/env npx tsx
/**
 * snappy-desktop/api.ts -- macOS desktop automation primitives.
 *
 * Two layers in this file:
 *   1. Vision-driven (expensive, non-deterministic): runMidscene + the
 *      WhatsApp helpers. Use when you need to "find a thing on screen."
 *   2. Deterministic peekaboo wrappers (cheap, reliable): focusApp,
 *      captureWindow, captureScreen, pasteText, pressKey, clickAt,
 *      listWindows, typeAndSubmit. Use these to drive a known target
 *      (snappy-chat composer, etc.) without burning a vision call.
 *
 * Why peekaboo: macOS TCC (Screen Recording + Accessibility) gates the
 * raw primitives (`screencapture -x`, `osascript keystroke`, `cliclick`)
 * differently across shell-chains. peekaboo holds its own TCC grants,
 * so the same call works whether dispatched from a login shell, an SSH
 * session, or an MCP child process. snappy-os does NOT do MCP — the
 * point is to expose the same primitive surface Charlotte's
 * `mac_mini_exec` exposes, but as plain CLI/lib calls.
 *
 * Usage:
 *   npx tsx api.ts run "click the Settings icon"
 *   npx tsx api.ts screenshot
 *   npx tsx api.ts focus SnappyChat
 *   npx tsx api.ts capture-window SnappyChat /tmp/sc.png
 *   npx tsx api.ts paste "hello world"
 *   npx tsx api.ts press return
 *   npx tsx api.ts click 640 480
 *   npx tsx api.ts list-windows SnappyChat
 *   npx tsx api.ts type-submit SnappyChat "open the agent panel"
 *
 * Or import as module:
 *   import { focusApp, captureWindow, pasteText, pressKey, clickAt,
 *            listWindows, typeAndSubmit, runMidscene, screenshot } from "./desktop.ts";
 */

import { execFile, execFileSync, execSync } from "child_process";
import { existsSync, readFileSync, readdirSync, realpathSync, unlinkSync } from "fs";
import { promisify } from "util";
import { env } from "./env.ts";
import { bridgeAvailable, bridgeExec } from "./bridge.ts";

const execFileP = promisify(execFile);

// --- shell-quote (single-quote with embedded escape) ---
//
// Used when assembling commands for bridgeExec. The bridge daemon spawns the
// command via a shell, so any arg with whitespace or shell metacharacters
// needs quoting. Single-quote-and-escape is the simple form: 'abc' is
// literal, and 'it'\''s' embeds a single quote. We don't take a dependency
// on the shell-quote npm package — eight lines of inline code suffice.
function shellQuote(s: string): string {
  if (s === "") return "''";
  if (/^[A-Za-z0-9_./:=@%+,-]+$/.test(s)) return s;
  return `'${s.replace(/'/g, `'\\''`)}'`;
}

// --- bridge routing ---
//
// When the openclaw helper daemon is available, route shell commands through
// it so they inherit its TCC grants (Screen Recording, Accessibility). When
// it is NOT available, fall back to direct execFile so the skill remains
// usable on machines without the bridge -- the local call may still succeed
// if the agent's own process tree happens to hold the relevant grants.
//
// The probe is cached inside bridge.ts (60s TTL), so this is one round-trip
// per minute, not per call.
async function runShell(command: string, opts: { timeout?: number } = {}): Promise<{ stdout: string; stderr: string }> {
  if (await bridgeAvailable()) {
    const r = await bridgeExec(command, opts);
    if (!r.ok) {
      throw new Error(`[desktop] bridge exec failed: ${r.error ?? "unknown"}${r.stderr ? ` -- ${r.stderr}` : ""}`);
    }
    return { stdout: r.output ?? "", stderr: "" };
  }
  // Local fallback: spawn /bin/sh -c <command>. Same shape as the bridge,
  // different process tree. Will fail with TCC errors when those bite.
  const { stdout, stderr } = await execFileP("/bin/sh", ["-c", command], { maxBuffer: 16 * 1024 * 1024 });
  return { stdout, stderr };
}

// --- peekaboo discovery (cached at module load) ---

/** Window descriptor returned by listWindows(). */
export interface WindowInfo {
  wid: number;
  title: string;
  bounds: { x: number; y: number; w: number; h: number };
  isMain: boolean;
}

let _peekabooPath: string | null | undefined;

function findPeekaboo(): string | null {
  if (_peekabooPath !== undefined) return _peekabooPath;
  // Preferred symlink, then any Cellar version.
  const candidates = ["/opt/homebrew/bin/peekaboo", "/usr/local/bin/peekaboo"];
  for (const c of candidates) {
    if (existsSync(c)) {
      _peekabooPath = c;
      return c;
    }
  }
  // Fallback: glob the Cellar.
  for (const cellar of ["/opt/homebrew/Cellar/peekaboo", "/usr/local/Cellar/peekaboo"]) {
    if (existsSync(cellar)) {
      try {
        const versions = readdirSync(cellar);
        for (const v of versions) {
          const p = `${cellar}/${v}/bin/peekaboo`;
          if (existsSync(p)) {
            _peekabooPath = p;
            return p;
          }
        }
      } catch { /* ignore */ }
    }
  }
  // Last-ditch: PATH lookup via `which`.
  try {
    const out = execFileSync("/usr/bin/which", ["peekaboo"], { encoding: "utf-8" }).trim();
    if (out && existsSync(out)) {
      _peekabooPath = out;
      return out;
    }
  } catch { /* ignore */ }
  _peekabooPath = null;
  return null;
}

function requirePeekaboo(): string {
  const p = findPeekaboo();
  if (!p) {
    throw new Error(
      "peekaboo not found. Install with `brew install peekaboo` " +
        "(see https://github.com/steipete/peekaboo). The desktop skill " +
        "uses peekaboo as its canonical TCC-friendly automation primitive.",
    );
  }
  return p;
}

async function runPeekaboo(args: string[]): Promise<{ stdout: string; stderr: string }> {
  const bin = requirePeekaboo();
  // Compose a single shell-safe command so the bridge can run it through its
  // shell (the daemon's /exec endpoint takes one string, not an argv array).
  // Local fallback path also goes through /bin/sh -c via runShell.
  const command = [bin, ...args.map(shellQuote)].join(" ");
  try {
    return await runShell(command);
  } catch (e: any) {
    const code = typeof e?.code === "number" ? e.code : "?";
    console.error(
      `[desktop] peekaboo failed (exit=${code}): ${bin} ${args.map((a) => JSON.stringify(a)).join(" ")}`,
    );
    if (e?.stderr) console.error(`[desktop] peekaboo stderr: ${String(e.stderr).trim()}`);
    throw e;
  }
}

function sleepMs(ms: number): Promise<void> {
  return new Promise((r) => setTimeout(r, ms));
}

// --- Deterministic primitives (peekaboo + screencapture) ---

/**
 * Bring an app to the foreground via AppleScript activate. Sleeps 400ms
 * after to give the WindowServer time to settle before further actions.
 */
export async function focusApp(appName: string): Promise<void> {
  const escaped = appName.replace(/"/g, '\\"');
  const command = `/usr/bin/osascript -e ${shellQuote(`tell application "${escaped}" to activate`)}`;
  try {
    await runShell(command);
  } catch (e: any) {
    console.error(`[desktop] focusApp("${appName}") failed: ${String(e?.message || e).trim()}`);
    throw e;
  }
  await sleepMs(400);
}

/**
 * Capture a window-scoped screenshot of <appName> to <path> via peekaboo.
 * Returns the resolved path on success, throws on failure (peekaboo missing,
 * app not running, or capture errored). Window mode is preferred over
 * full-display because it crops out menu bar/dock and doesn't depend on
 * which space is frontmost.
 */
export async function captureWindow(appName: string, path: string): Promise<string> {
  const args = ["image", "--mode", "window", "--app", appName, "--path", path];
  const { stderr } = await runPeekaboo(args);
  if (!existsSync(path)) {
    throw new Error(
      `[desktop] captureWindow: peekaboo reported success but ${path} is missing. stderr: ${stderr.trim()}`,
    );
  }
  return path;
}

/**
 * Full-display screenshot via /usr/sbin/screencapture -x. Reliable when run
 * from a process whose ancestor chain holds Screen Recording TCC; falls back
 * useful when peekaboo isn't installed but the parent shell has the grant.
 * Returns the resolved path.
 */
export async function captureScreen(path: string): Promise<string> {
  const command = `/usr/sbin/screencapture -x ${shellQuote(path)}`;
  try {
    await runShell(command);
  } catch (e: any) {
    console.error(`[desktop] captureScreen("${path}") failed: ${String(e?.message || e).trim()}`);
    throw e;
  }
  if (!existsSync(path)) {
    throw new Error(`[desktop] captureScreen: ${path} not produced (TCC denied?)`);
  }
  return path;
}

/**
 * Paste <text> into the frontmost text field via peekaboo's clipboard route
 * (sets clipboard, sends Cmd+V, restores prior clipboard). Works without
 * Accessibility-typing permission because Cmd+V uses the Cocoa text input
 * path, not synthetic keystrokes. The clipboard is restored after — this
 * is non-destructive to whatever the user had copied.
 */
export async function pasteText(text: string): Promise<void> {
  await runPeekaboo(["paste", text]);
}

/**
 * Press a single named key via peekaboo. Common: "return", "escape", "tab",
 * "space", "delete", arrow keys (up/down/left/right), function keys (f1-f12).
 * Use pasteText for actual text content — press is for navigation/control.
 */
export async function pressKey(key: string): Promise<void> {
  await runPeekaboo(["press", key]);
}

/**
 * Click at logical screen coordinates (point-based, NOT pixel — Retina
 * displays will scale). Coordinates are global, origin top-left.
 *
 * CAVEAT: WKWebView/Electron apps that don't expose an Accessibility tree
 * receive a synthetic mouse event at the pixel, but the React/JS handler
 * may not fire if the event doesn't match the framework's hit target. For
 * snappy-chat's OpenUI composer in particular, clickAt is FRAGILE — prefer
 * focusApp + pasteText + pressKey for keyboard-driven flows. Reserve
 * clickAt for native AppKit targets (toolbar buttons, native menus).
 */
export async function clickAt(x: number, y: number): Promise<void> {
  await runPeekaboo(["click", "--coords", `${x},${y}`]);
}

/**
 * Enumerate windows. With <appName>, returns just that app's windows;
 * without, returns every window peekaboo can see. Each entry has the
 * CoreGraphics window id (wid), title, logical bounds, and an isMain flag
 * (true when the window is the app's main/key window).
 *
 * The peekaboo CLI shape is `peekaboo list windows --json [--app <name>]
 * --include-details bounds,ids`. We parse defensively because peekaboo's
 * JSON envelope has evolved across versions.
 */
export async function listWindows(appName?: string): Promise<WindowInfo[]> {
  const args = ["list", "windows", "--json", "--include-details", "bounds,ids"];
  if (appName) args.push("--app", appName);
  const { stdout } = await runPeekaboo(args);
  const trimmed = stdout.trim();
  if (!trimmed) return [];
  let parsed: any;
  try {
    parsed = JSON.parse(trimmed);
  } catch (e: any) {
    throw new Error(`[desktop] listWindows: peekaboo JSON parse failed: ${e?.message}\nraw: ${trimmed.slice(0, 400)}`);
  }
  if (parsed?.success === false) {
    const msg = parsed?.error?.message || JSON.stringify(parsed?.error || parsed);
    throw new Error(`[desktop] listWindows: peekaboo error: ${msg}`);
  }
  // Walk possible shapes: {windows: [...]} OR {data: {windows: [...]}} OR
  // [{...}] OR {applications:[{windows:[...]}]} (older builds).
  const candidates: any[] =
    (Array.isArray(parsed) && parsed) ||
    parsed?.windows ||
    parsed?.data?.windows ||
    (Array.isArray(parsed?.applications) ? parsed.applications.flatMap((a: any) => a?.windows || []) : null) ||
    [];
  return candidates.map((w: any): WindowInfo => {
    const b = w?.bounds || w?.frame || {};
    return {
      wid: Number(w?.window_id ?? w?.wid ?? w?.id ?? 0),
      title: String(w?.title ?? w?.window_title ?? ""),
      bounds: {
        x: Number(b?.x ?? b?.origin?.x ?? 0),
        y: Number(b?.y ?? b?.origin?.y ?? 0),
        w: Number(b?.w ?? b?.width ?? b?.size?.width ?? 0),
        h: Number(b?.h ?? b?.height ?? b?.size?.height ?? 0),
      },
      isMain: Boolean(w?.is_main ?? w?.isMain ?? w?.main ?? false),
    };
  });
}

/**
 * Composite helper: optionally focus <appName>, optionally click into a
 * textfield at <clickCoords>, paste <text>, press return. Closes the gap
 * between "I want to drive my own app's composer" and "I have to chain
 * five primitives every time."
 *
 * opts.focusBefore (default: true) — call focusApp(appName) first.
 * opts.clickCoords — if set, clickAt(...) before pasting (use for apps
 *   that don't auto-focus their composer when the window is activated).
 *
 * Pairs well with snappy-chat's OpenUI composer: focus, click into the
 * `__input-wrapper` textarea, paste, return. Verify with captureWindow
 * after — actor ≠ auditor.
 */
export async function typeAndSubmit(
  appName: string,
  text: string,
  opts: { focusBefore?: boolean; clickCoords?: [number, number] } = {},
): Promise<void> {
  const focusBefore = opts.focusBefore !== false;
  if (focusBefore) await focusApp(appName);
  if (opts.clickCoords) {
    await clickAt(opts.clickCoords[0], opts.clickCoords[1]);
    await sleepMs(150);
  }
  await pasteText(text);
  await sleepMs(120);
  await pressKey("return");
}

// --- Vision-driven (Midscene) ---

/** Run a Midscene vision instruction via Terminal routing (required for Screen Recording). */
export function runMidscene(instruction: string): string {
  const runId = Date.now().toString();
  const scriptPath = `/tmp/midscene-cmd-${runId}.sh`;
  const outPath = `/tmp/midscene-out-${runId}.txt`;
  const donePath = `/tmp/midscene-done-${runId}`;

  const escaped = instruction.replace(/"/g, '\\"');
  const script = `#!/bin/bash
source ~/.midscene-env
npx -y @midscene/computer@1 act --prompt "${escaped}" > ${outPath} 2>&1
echo "EXIT_CODE=$?" >> ${outPath}
touch ${donePath}
`;

  execSync(`cat > ${scriptPath} << 'SCRIPT_EOF'\n${script}SCRIPT_EOF`, { encoding: "utf-8" });
  execSync(`chmod +x ${scriptPath}`);
  execSync(`osascript -e 'tell application "Terminal" to do script "${scriptPath}"'`);

  // Poll for completion
  const maxWait = 60_000;
  const start = Date.now();
  while (!existsSync(donePath) && Date.now() - start < maxWait) {
    execSync("sleep 2");
  }

  let output = "";
  if (existsSync(outPath)) {
    output = readFileSync(outPath, "utf-8");
  } else {
    output = "ERROR: Midscene timed out after 60s";
  }

  // Cleanup
  for (const f of [scriptPath, outPath, donePath]) {
    if (existsSync(f)) unlinkSync(f);
  }

  return output.trim();
}

/**
 * Take a full-display screenshot and return the file path.
 *
 * Delegates to captureScreen() so callers don't have two parallel APIs.
 * Kept synchronous-return-shaped for back-compat with existing consumers
 * (whatsappReadChats / whatsappReadChat) by blocking on the async call
 * via execSync glue. New code should call captureScreen() directly.
 */
export function screenshot(): string {
  const runId = Date.now().toString();
  const path = `/tmp/screenshot-${runId}.png`;
  // Inline the screencapture invocation here (sync) so this function stays
  // string-returning. The async captureScreen() is the canonical API; this
  // is the back-compat shim.
  try {
    execFileSync("/usr/sbin/screencapture", ["-x", path]);
  } catch (e: any) {
    console.error(`[desktop] screenshot failed: ${String(e?.message || e).trim()}`);
    throw e;
  }
  if (!existsSync(path)) {
    throw new Error("[desktop] screenshot: screencapture produced no file (TCC denied?)");
  }
  return path;
}

/** Helper: activate WhatsApp.app and wait for it to be frontmost. */
function activateWhatsApp(): void {
  execSync(`osascript -e 'tell application "WhatsApp" to activate'`);
  execSync("sleep 2");
}

/** Helper: run a Midscene query (read-only vision extraction, no clicks). */
function queryMidscene(question: string): string {
  const runId = Date.now().toString();
  const scriptPath = `/tmp/midscene-cmd-${runId}.sh`;
  const outPath = `/tmp/midscene-out-${runId}.txt`;
  const donePath = `/tmp/midscene-done-${runId}`;

  const escaped = question.replace(/"/g, '\\"');
  const script = `#!/bin/bash
source ~/.midscene-env
osascript -e 'tell application "WhatsApp" to activate'
sleep 1
npx -y @midscene/computer@1 query --prompt "${escaped}" > ${outPath} 2>&1
echo "EXIT_CODE=$?" >> ${outPath}
touch ${donePath}
`;

  execSync(`cat > ${scriptPath} << 'SCRIPT_EOF'\n${script}SCRIPT_EOF`, { encoding: "utf-8" });
  execSync(`chmod +x ${scriptPath}`);
  execSync(`osascript -e 'tell application "Terminal" to do script "${scriptPath}"'`);

  const maxWait = 60_000;
  const start = Date.now();
  while (!existsSync(donePath) && Date.now() - start < maxWait) {
    execSync("sleep 2");
  }

  let output = "";
  if (existsSync(outPath)) {
    output = readFileSync(outPath, "utf-8");
  } else {
    output = "ERROR: Midscene query timed out after 60s";
  }

  for (const f of [scriptPath, outPath, donePath]) {
    if (existsSync(f)) unlinkSync(f);
  }

  return output.trim();
}

/**
 * Open WhatsApp.app, screenshot the chat list, extract chat names + unread badges via Midscene.
 * Returns the raw Midscene vision output describing visible chats.
 */
export function whatsappReadChats(): string {
  activateWhatsApp();
  const screenshotPath = screenshot();
  const result = queryMidscene(
    "List every chat visible in the WhatsApp sidebar. For each chat return: contact name, last message preview, timestamp, and whether it has an unread badge (and the count if visible). Return as a structured list."
  );
  return `screenshot: ${screenshotPath}\n\n${result}`;
}

/**
 * Click into a specific WhatsApp chat and extract recent messages.
 * @param contactName - The name of the contact or group to open.
 */
export function whatsappReadChat(contactName: string): string {
  activateWhatsApp();
  // Use Cmd+F to search for the contact, more reliable than vision-clicking the sidebar
  const escaped = contactName.replace(/"/g, '\\"');
  runMidscene(`Click on the search field at the top of WhatsApp, type "${escaped}", then click on the first matching chat result`);
  execSync("sleep 2");
  // Re-activate WhatsApp after Terminal steals focus from runMidscene
  activateWhatsApp();
  const screenshotPath = screenshot();
  const result = queryMidscene(
    `Read the currently open WhatsApp chat. Extract the most recent messages visible on screen. For each message return: sender name, message text, and timestamp. Return as a structured list, most recent last.`
  );
  return `screenshot: ${screenshotPath}\n\n${result}`;
}

/**
 * Open a WhatsApp chat and send a message.
 * @param contactName - The name of the contact or group.
 * @param text - The message text to send.
 */
export function whatsappSendMessage(contactName: string, text: string): string {
  activateWhatsApp();
  // Search and open the chat
  const escapedName = contactName.replace(/"/g, '\\"');
  runMidscene(`Click on the search field at the top of WhatsApp, type "${escapedName}", then click on the first matching chat result`);
  execSync("sleep 2");
  // Re-activate WhatsApp and type the message via AppleScript (more reliable than Midscene typing)
  activateWhatsApp();
  // Click the message input field via Midscene
  runMidscene("Click on the message input field at the bottom of the chat");
  execSync("sleep 1");
  // Re-activate and type via AppleScript keystroke
  execSync(`osascript -e 'tell application "WhatsApp" to activate'`);
  execSync("sleep 0.5");
  // Use clipboard to handle special characters reliably
  const escapedText = text.replace(/\\/g, "\\\\").replace(/"/g, '\\"');
  execSync(`osascript -e 'set the clipboard to "${escapedText}"' -e 'tell application "System Events" to keystroke "v" using command down'`);
  execSync("sleep 0.5");
  // Press Enter to send
  execSync(`osascript -e 'tell application "System Events" to key code 36'`);
  execSync("sleep 1");
  activateWhatsApp();
  const screenshotPath = screenshot();
  return `sent to ${contactName}: "${text}"\nscreenshot: ${screenshotPath}`;
}

// Reserved for env-driven config (no current consumer; preserves the
// snappy-os env() contract pattern used elsewhere in state/lib/).
void env;

// --- CLI ---

if ((() => { try { return import.meta.url === `file://${realpathSync(process.argv[1])}`; } catch { return false; } })()) {
  (async () => {
    const [, , cmd, ...args] = process.argv;

    switch (cmd) {
      case "run": {
        const instruction = args.join(" ");
        if (!instruction) { console.error("Usage: api.ts run <instruction>"); process.exit(1); }
        console.log(runMidscene(instruction));
        break;
      }
      case "screenshot": {
        const path = screenshot();
        console.log(path);
        break;
      }
      case "focus": {
        const app = args.join(" ");
        if (!app) { console.error("Usage: api.ts focus <appName>"); process.exit(1); }
        await focusApp(app);
        console.log(`focused ${app}`);
        break;
      }
      case "capture-window": {
        if (args.length < 2) { console.error("Usage: api.ts capture-window <appName> <path>"); process.exit(1); }
        const path = args[args.length - 1];
        const app = args.slice(0, -1).join(" ");
        const out = await captureWindow(app, path);
        console.log(out);
        break;
      }
      case "capture-screen": {
        const path = args[0];
        if (!path) { console.error("Usage: api.ts capture-screen <path>"); process.exit(1); }
        console.log(await captureScreen(path));
        break;
      }
      case "paste": {
        const text = args.join(" ");
        if (!text) { console.error("Usage: api.ts paste <text>"); process.exit(1); }
        await pasteText(text);
        console.log(`pasted ${text.length} chars`);
        break;
      }
      case "press": {
        const key = args[0];
        if (!key) { console.error("Usage: api.ts press <key>"); process.exit(1); }
        await pressKey(key);
        console.log(`pressed ${key}`);
        break;
      }
      case "click": {
        if (args.length < 2) { console.error("Usage: api.ts click <x> <y>"); process.exit(1); }
        const x = Number(args[0]);
        const y = Number(args[1]);
        await clickAt(x, y);
        console.log(`clicked ${x},${y}`);
        break;
      }
      case "list-windows": {
        const app = args.join(" ") || undefined;
        const windows = await listWindows(app);
        console.log(JSON.stringify(windows, null, 2));
        break;
      }
      case "type-submit": {
        if (args.length < 2) { console.error("Usage: api.ts type-submit <appName> <text>"); process.exit(1); }
        const app = args[0];
        const text = args.slice(1).join(" ");
        await typeAndSubmit(app, text);
        console.log(`submitted to ${app}`);
        break;
      }
      case "wa-chats": {
        console.log(whatsappReadChats());
        break;
      }
      case "wa-read": {
        const contact = args.join(" ");
        if (!contact) { console.error("Usage: api.ts wa-read <contact>"); process.exit(1); }
        console.log(whatsappReadChat(contact));
        break;
      }
      case "wa-send": {
        if (args.length < 2) { console.error("Usage: api.ts wa-send <contact> <message>"); process.exit(1); }
        const contact = args[0];
        const message = args.slice(1).join(" ");
        console.log(whatsappSendMessage(contact, message));
        break;
      }
      default:
        console.log(
          "Usage: npx tsx api.ts [run|screenshot|focus|capture-window|capture-screen|paste|press|click|list-windows|type-submit|wa-chats|wa-read|wa-send] ...",
        );
    }
  })();
}

scripts- helper scripts it can run

prose-only skill - 9 inline code blocks live in SKILL.md above (no state/bin/ sidecar yet).

how we check it- the checks, plus the last 10 runs

rubric shape schema-shape check (no inline rubric)

recent mean 1.00 · 10 runs actor/auditor: unverifiable

deps settings

timestamp	verb	score	primary_issue	artifact
2026-04-29 03:08Z	-	1.00	-	-
2026-04-29 03:05Z	-	1.00	-	-
2026-04-25 04:11Z	-	1.00	-	-
2026-04-21 15:58Z	-	1.00	-	-
2026-04-21 15:56Z	-	1.00	-	-
2026-04-21 03:53Z	-	1.00	-	-
2026-04-29 03:08Z	-	1.00	-	-
2026-04-29 03:05Z	-	1.00	-	-
2026-04-25 04:11Z	-	1.00	-	-
2026-04-21 15:58Z	-	1.00	-	-

desktop

What it does for you

What it produces

How to get it

at a glance- the short version

what's inside - the parts that make up a skill 3/4 present

how it's graded - what counts as a good run 5 criteria · 5 deterministic

how it runs - the shared frame every skill uses 5/5 present

what it has learned - fixes written back in over time sample

how the work flows- who makes it, who checks it

SKILL.md- the skill, written out in plain English

desktop

Steps

Library API

Why peekaboo

Bridge routing

focusApp(appName: string): Promise<void>

captureWindow(appName: string, path: string): Promise<string>

captureScreen(path: string): Promise<string>

pasteText(text: string): Promise<void>

pressKey(key: string): Promise<void>

clickAt(x: number, y: number): Promise<void>

listWindows(appName?: string): Promise<WindowInfo[]>

typeAndSubmit(appName, text, opts?): Promise<void>

runMidscene(instruction: string): string (vision)

screenshot(): string (back-compat shim)

Eval

Gotchas

Graduation

Rubric

AGENTS.md- what the AI loads when this skill comes up

desktop - loader

Critical Rules

Commands

Working primitives

OpenUI Resource

Known Pitfalls

Self-Test

Self-report

Self-correcting loader (PID feedback)

Before you finish, do two things:

api.ts- the code it can call

scripts- helper scripts it can run

how we check it- the checks, plus the last 10 runs

`focusApp(appName: string): Promise<void>`

`captureWindow(appName: string, path: string): Promise<string>`

`captureScreen(path: string): Promise<string>`

`pasteText(text: string): Promise<void>`

`pressKey(key: string): Promise<void>`

`clickAt(x: number, y: number): Promise<void>`

`listWindows(appName?: string): Promise<WindowInfo[]>`

`typeAndSubmit(appName, text, opts?): Promise<void>`

`runMidscene(instruction: string): string` (vision)

`screenshot(): string` (back-compat shim)