Exported functions in state/lib/desktop.ts. .md file to compare - side-by-side diff against desktop
desktop
description: "Triggers on prompt mention of 'desktop'."
What it does for you
Lets your assistant operate Mac apps for you by seeing the screen.
What it produces
A recent result, so you can see the kind of work it returns.
loading…
How to get it
These run inside the Snappy workspace. Want this working in your business? I set skills like this up with you, in one focused week.
For developers how this skill is built, graded, and how it runs
at a glance- the short version
what's inside - the parts that make up a skill 3/4 present
A skill is just a few plain-text files. Only the main one is required. The rest are optional, added as the work needs them. This is what the skill is made of; how it runs is just below.
state/skills/desktop/SKILL.md
present
state/lib/desktop.ts
present
state/bin/desktop/
not present
state/skills/desktop/AGENTS.md
present
how it's graded - what counts as a good run 5 criteria · 5 deterministic
Each row is one thing a good run has to get right. deterministic means a quick check decides, pass or fail. judge means the AI reads the result and rates it. Grading each piece on its own (instead of one overall score) shows exactly where a run fell short, so the fix is obvious.
how it runs - the shared frame every skill uses 5/5 present
Every skill runs the same way. One part does the work, a separate part checks it, and a short loader hands the AI exactly what it needs for the job. Anything this skill doesn't use shows a one-line note saying why, on purpose, not by accident.
state/log/pending-eval.ndjson what it has learned - fixes written back in over time sample
When a run hits something this skill didn't handle, the fix gets written back into the skill so it doesn't happen again. FIXED means it was corrected on the spot. LOGGED means it's queued for a bigger rewrite. Either way, the skill gets a little better and never makes the same mistake twice.
- Loading feedback rows…
how the work flows- who makes it, who checks it
import from `state/lib/desktop.ts` — deterministic: `focusApp()`, `captureWindow()`, `captureScreen()`, `pasteText()`, `pressKey()`, `clickAt()`, `listWindows()`, `typeAndSubmit()`. Vision: `runMidscene()`, `screenshot()`, `whatsappReadChats()`, `whatsappReadChat()`, `whatsappSendMessage()`.
captureWindow(appName, path)` immediately after any keyboard or click action — actor ≠ auditor.
SKILL.md- the skill, written out in plain English
desktop
macOS desktop automation. Two layers in one skill:
- Deterministic peekaboo wrappers -
focusApp,captureWindow,
captureScreen, pasteText, pressKey, clickAt, listWindows, typeAndSubmit. Cheap, reliable. Use when you know the target.
- Vision-driven Midscene -
runMidscene,whatsapp*. Expensive,
non-deterministic. Use when the target needs to be located on screen.
Ported from kernel snappy-desktop in Phase 0.5. Extended 2026-04-28 with the deterministic primitives so future agents stop rediscovering shell incantations every session. See state/lib/desktop.ts for the full API surface.
Steps
focusApp(appName)- seestate/lib/desktop.tscaptureWindow(appName, path)- seestate/lib/desktop.tscaptureScreen(path)- seestate/lib/desktop.tspasteText(text)- seestate/lib/desktop.tspressKey(key)- seestate/lib/desktop.tsclickAt(x, y)- seestate/lib/desktop.tslistWindows(appName?)- seestate/lib/desktop.tstypeAndSubmit(appName, text, opts?)- seestate/lib/desktop.tsrunMidscene(instruction)- seestate/lib/desktop.tsscreenshot()- seestate/lib/desktop.tswhatsappReadChats()- seestate/lib/desktop.tswhatsappReadChat(contact)- seestate/lib/desktop.tswhatsappSendMessage(contact, text)- seestate/lib/desktop.ts
Library API
Why peekaboo
macOS TCC gates raw primitives (screencapture -x, osascript keystroke, cliclick) per-process-ancestor. Every shell-chain that wants to drive the GUI ends up rediscovering which permission is missing. peekaboo holds its own Screen Recording + Accessibility grants, so the same call works whether dispatched from a login shell, an SSH session, or a headless subprocess. That is why the deterministic helpers below exist - to expose the same primitive surface Charlotte's mac_mini_exec exposes, but as plain CLI/lib calls (snappy-os does NOT do MCP).
Install: brew install peekaboo. The lib auto-discovers /opt/homebrew/bin/peekaboo, the matching Cellar version, and the PATH which result in that order; throws a clear error if missing.
Bridge routing
Every shell call in this skill (runPeekaboo, focusApp, captureScreen) is dispatched through state/lib/bridge.ts when the openclaw helper daemon at 127.0.0.1:18790 is reachable. The bridge holds the macOS TCC grants that Claude Code's own child-process tree does not - routing through it turns "screencapture produced a zero-byte PNG" and "peekaboo returned PERMISSION_ERROR_SCREEN_RECORDING" into actual successful captures. See the bridge skill for the relay contract.
The probe is cached for 60s inside state/lib/bridge.ts, so this is one round-trip per minute, not per call. When the daemon is down, runShell() falls back to /bin/sh -c <command> directly - the skill remains usable on machines without the bridge, but TCC failures will surface with their normal symptoms (zero-byte PNG, permission error, keystroke no-op). If you hit one, the fix is to restart the daemon (~/.openclaw/helper/server.js), not to chase the grant up the local shell-chain. If neither bridge nor local TCC works, every helper throws on the missing artifact (e.g. captureWindow checks existsSync(path) after the call returns).
focusApp(appName: string): Promise<void>
Activates the app via AppleScript and sleeps 400ms so the WindowServer can settle. Always call this before keyboard-driven flows - peekaboo's paste and press target the frontmost app, so an unfocused window swallows input silently.
import { focusApp, pasteText, pressKey } from "state/lib/desktop.ts";
await focusApp("SnappyChat");
await pasteText("hello");
await pressKey("return");
captureWindow(appName: string, path: string): Promise<string>
Window-scoped screenshot via peekaboo image --mode window --app <appName> --path <path>. Crops out menu bar/dock automatically; doesn't depend on which space is frontmost. Throws if the file isn't produced. Returns the resolved path.
const out = await captureWindow("SnappyChat", "/tmp/sc.png");
// out === "/tmp/sc.png"
captureScreen(path: string): Promise<string>
Full-display screenshot via /usr/sbin/screencapture -x. Use when you need the whole screen (multi-window verification) or peekaboo isn't available. Returns the resolved path.
await captureScreen("/tmp/full.png");
pasteText(text: string): Promise<void>
Pastes via peekaboo's clipboard route: sets clipboard, sends Cmd+V, restores prior clipboard. Works without Accessibility-typing permission because Cmd+V uses the Cocoa text-input path, not synthetic keystrokes. The clipboard is restored after - non-destructive to whatever the user had copied.
await focusApp("SnappyChat");
await pasteText("open the agent panel");
pressKey(key: string): Promise<void>
Single named key via peekaboo. Common keys: return, escape, tab, space, delete, arrow keys (up/down/left/right), function keys (f1-f12). Use pasteText for actual content; pressKey is for navigation/control.
await pressKey("escape"); // dismiss a modal
clickAt(x: number, y: number): Promise<void>
Logical (point-based, NOT pixel) screen coordinates. Origin top-left, global. Caveat: WKWebView/Electron apps that don't expose an Accessibility tree receive a synthetic mouse event but the React/JS handler may not fire if the event doesn't match the framework's hit target. For snappy-chat's OpenUI composer in particular, clickAt is fragile - prefer focusApp + pasteText + pressKey. Reserve clickAt for native AppKit targets (toolbar buttons, native menus).
await clickAt(640, 480);
listWindows(appName?: string): Promise<WindowInfo[]>
Enumerates windows. With appName, returns just that app's windows; without, every window peekaboo can see. Each entry: wid (CoreGraphics window id), title, bounds {x, y, w, h} (logical coords), isMain. JSON parse is defensive across peekaboo versions.
const windows = await listWindows("SnappyChat");
const main = windows.find((w) => w.isMain);
typeAndSubmit(appName, text, opts?): Promise<void>
Composite helper. Optionally focuses the app, optionally clicks into a textfield, pastes the text, presses return. Closes the gap between "I want to drive my own app's composer" and "I have to chain five primitives every time."
opts.focusBefore (default: true) calls focusApp first. opts.clickCoords: [x, y] clicks before pasting (for apps that don't auto-focus their composer when activated).
await typeAndSubmit("SnappyChat", "list scheduled jobs");
// or with explicit composer click:
await typeAndSubmit("SnappyChat", "ship it", { clickCoords: [800, 950] });
runMidscene(instruction: string): string (vision)
Routes through Terminal.app for Screen Recording TCC. Expensive (non-deterministic vision call). Use only when the target must be located on screen - for known targets, prefer the deterministic primitives above.
screenshot(): string (back-compat shim)
Synchronous string-returning wrapper that delegates to screencapture -x. Kept for the existing WhatsApp helpers; new code should call captureScreen() (async, takes a path) directly.
Eval
Actor: the exported functions in state/lib/desktop.ts. Auditor: none wired yet - eval is manual (Robert review). File a state/log/pending-eval.ndjson row on each run.
Score convention:
| Outcome | Score |
|---|---|
| Pass on first try | 1.0 |
| Failed first, auto-fix applied, re-check passed | 0.5 |
| Still failing or unrecoverable | 0.0 |
Gotchas
via the Phase 0.5 driver. Only these rewrites were applied: already in state/lib/)
realpathSync(process.argv[1])CLI guard wrapped in try/catch
- See the kernel SKILL.md for the original long-form guidance if you need it
(read-only reference at the kernel path above).
Graduation
This skill is prose. Graduate by defining a deterministic auditor and flipping eval: auto.
Rubric
criteria:
- name: calls_deterministically
kind: deterministic
check: "Skill execution does not introduce new 'import' declarations outside of 'state/lib/desktop.ts'."
- name: midscene_vision_used
kind: deterministic
check: "The 'runMidscene()' function is called with the 'runMidscene_input' from the skill's inputs."
- name: screenshot_taken
kind: deterministic
check: "The 'screenshot()' function is called with the 'screenshot_input' from the skill's inputs."
- name: whatsapp_chats_read
kind: deterministic
check: "The 'whatsappReadChats()' function is called with the 'whatsappReadChats_input' from the skill's inputs."
- name: eval_log_recorded
kind: deterministic
check: "A new row is appended to 'state/log/pending-eval.ndjson' for the execution."AGENTS.md- what the AI loads when this skill comes up
desktop - loader
Per-turn rules for the desktop skill. macOS automation in two layers: deterministic peekaboo wrappers and vision-driven Midscene. Full reference: state/skills/desktop/SKILL.md. Library: state/lib/desktop.ts. Routed through state/lib/bridge.ts (openclaw helper at 127.0.0.1:18790).
Critical Rules
- NEVER fan out parallel agent-browser writes on a shared session - DOM mutations silently drop. Serialize uploads; parallelize generation only.
- Midscene is vision-driven, expensive, non-deterministic - verify with
captureWindowafter every action; never trust the success return alone. - Prefer deterministic peekaboo (
focusApp,captureWindow,pasteText,pressKey,typeAndSubmit) over Midscene whenever the target is known. Reserve vision for "I don't know where it is on screen." clickAtis fragile on WKWebView/Electron (snappy-chat OpenUI composer included) - synthetic mouse events miss React hit targets. UsefocusApp+pasteText+pressKeyfor keyboard flows instead.- Helpers route through the bridge daemon. TCC errors (zero-byte screencapture,
PERMISSION_ERROR_SCREEN_RECORDING, silent keystroke failure) usually mean the daemon is down - restart withlaunchctl kickstart -k gui/$(id -u)/ai.openclaw.helper(NOT~/.openclaw/helper/server.jsdirectly) and verify withbridgeAvailable(). - App name is "Snappy Chat" WITH A SPACE.
peekaboo screenshot --app "Snappy Chat"is correct. "SnappyChat" (no space) returns "App not found." Always quote the name. - Window index matters. SnappyChat has a helper window (500x500, index 0) AND the real chat window (1280x900, index 1). Always pass
--window-index 1or--window-id <real-id>when capturing:peekaboo screenshot --app "Snappy Chat" --window-index 1 --output /tmp/out.png. Capturing index 0 looks like a blank square. - No raw osascript System Events. Do NOT use
osascript -e 'tell application "System Events" to keystroke ...'- fails with error 1719 (TCC). Use desktop.ts CLI:npx tsx state/lib/desktop.ts press <key>orpasteText(). Keyboard chords (Cmd+N, Cmd+K) need peekaboo'shotkeysubcommand, notpressKey. - Chord keys (Cmd+N, Cmd+K, etc.):
desktop.ts pressKey()only sends single named keys. For chords use:peekaboo hotkey --keys "cmd,n"(comma-separated, no spaces). Example via bridge:bridgeExec("/opt/homebrew/bin/peekaboo hotkey --keys \\"cmd,n\\""). - Bridge 500 = restart the daemon, not retry. HTTP 500 from openclaw means the daemon is in a bad state. Fix:
launchctl kickstart -k gui/$(id -u)/ai.openclaw.helper. Do NOT curl in a loop.
Commands
| ui dashboard | state/skills/desktop/resources/ui.openui | |invoke: import from state/lib/desktop.ts - deterministic: focusApp(), captureWindow(), captureScreen(), pasteText(), pressKey(), clickAt(), listWindows(), typeAndSubmit(). Vision: runMidscene(), screenshot(), whatsappReadChats(), whatsappReadChat(), whatsappSendMessage(). |capture snappy-chat (correct): peekaboo screenshot --app "Snappy Chat" --window-index 1 --output /tmp/out.png |chord keypress: peekaboo hotkey --keys "cmd,n" (or "cmd,k" etc.) - single named keys only for pressKey |restart bridge daemon: launchctl kickstart -k gui/$(id -u)/ai.openclaw.helper |list windows with ids: peekaboo list windows --app "Snappy Chat" (run immediately before using --window-id) |capture Claude Desktop: peekaboo screenshot --app "Claude" --window-index 0 --output /tmp/claude-ref.png |verify: captureWindow(appName, path) immediately after any keyboard or click action - actor ≠ auditor. |install: brew install peekaboo (auto-discovered at /opt/homebrew/bin/peekaboo and Cellar fallback). Lib throws clear error if missing. |eval log: state/log/pending-eval.ndjson (manual eval - skill: "desktop"). Run-this is often skipped because Midscene takes over real mouse/keyboard; that's expected - frontmatter-shape ok confirmations still land.
Working primitives
focusApp(appName)- AppleScript activate + 400ms settle. Always first for keyboard flows.captureWindow(appName, path)-peekaboo image --mode window. Window-scoped; crops menu bar/dock; space-independent.captureScreen(path)-/usr/sbin/screencapture -x. Whole display.pasteText(text)-peekaboo paste(clipboard set + Cmd+V + restore). Survives without Accessibility-typing because Cmd+V is Cocoa text-input, not synthetic keystrokes.pressKey(key)-peekaboo press. Common:return,escape,tab,space,delete, arrows,f1-f12. UsepasteTextfor content.clickAt(x, y)- logical (NOT pixel) coords. Native AppKit only - fragile on WKWebView.listWindows(appName?)-peekaboo list windows --json. Returns{wid, title, bounds, isMain}[].typeAndSubmit(appName, text, opts?)- composite: focus + (optional click) + paste + return. The convenience helper for "drive my own app's composer."- Vision:
runMidscene(prompt),screenshot(path),whatsappReadChats(),whatsappReadChat(name),whatsappSendMessage(name, body).
OpenUI Resource
- Skill-owned OpenUI Lang resource:
state/skills/desktop/resources/ui.openui. Read it before rendering or editing this skill's generated component surface. - Treat this resource as a first-class artifact of the skill, not a generic chat response. Improve it when the skill's user-facing output needs to become richer.
- System resources compose OpenUI primitives and inherit SnappyChat tokens. Use
ui_contract: brandedin SKILL.md only for deliberate platform or client visuals.
Known Pitfalls
- WhatsApp send is a side-effect - gate behind explicit
apply: trueper program.md scope-only default. - Midscene is the actor; the screenshot read is the auditor. Never trust the Midscene success return alone - actor ≠ auditor.
PERMISSION_ERROR_SCREEN_RECORDINGmeans the calling process tree lacks the TCC grant. Fix: install peekaboo via brew and grant Screen Recording + Accessibility to the peekaboo binary in System Settings - don't chase the grant up the SSH/MCP shell chain.- "App not found" from peekaboo: you used "SnappyChat" instead of "Snappy Chat". Always use the space.
- Helper window vs real window: window index 0 is a tiny 500x500 helper. Always use
--window-index 1for the main chat window. - osascript System Events error 1719: TCC keystroke block. Never use System Events for keystrokes. Use peekaboo hotkey or desktop.ts pasteText.
- peekaboo
hotkeyvspressKey:pressKey(desktop.ts wrapper) sends one key. For chords (Cmd+N) usepeekaboo hotkey --keys "cmd,n"directly (or via bridgeExec).
Self-Test
An agent reading this should correctly:
- [ ] Refuse to launch parallel browser-mutation agents on one session
- [ ] Verify Midscene actions with a fresh screenshot, not just the return value
- [ ] Default
whatsappSendMessageto scope-only - [ ] Reach for
focusApp+pasteText+pressKey(notclickAt) to drive a WKWebView app's text composer - [ ] Use
captureWindow(window-scoped) overcaptureScreen(full display) when the target app is known
Self-report
If this loader fell short, append a line:
echo "[$(date -u +%FT%TZ)] desktop: <what was missing>" >> state/log/loader-feedback.log
<!-- kernel-ok: historical attribution for the PID-loader pattern. -->
Self-correcting loader (PID feedback)
This is the rule that makes loaders learn. Ported from snappy-kernel, where it demonstrably worked: the snappy-dom-cartographer skill mapped Skool admin UI once, every later ship-agent inherited the full map for free.
Before you finish, do two things:
1. Fix gaps (P - proportional). MANDATORY. If this loader didn't cover your case - if you had to read another file, run an undocumented command, work around a wrong selector, or discover a quirk - you MUST attempt an Edit to this AGENTS.md before you log.
- Default action: edit this file. Surgical. Add the missing line, table
row, or code example where it belongs. No restructuring.
- Only
LOGGEDis allowed when: the fix needs >10 lines, spans multiple
files, or requires a structural rewrite. In that case the state/regen/drain.sh queue picks it up asynchronously.
- "I didn't have time" / "it's minor" / "the next agent will figure it out"
are NOT valid reasons. Minor gaps compound - that's the whole point of this loop.
- Concrete example of a valid P-fix: discover
agent-browser screenshot
takes path positionally not as --output <path>. One line in a table. Edit it in. Do not just log it.
- The goal: the next agent never has to leave the loader.
2. Log the result.
echo "[$(date -u +%FT%TZ)] desktop: <what was missing or fixed> [FIXED|LOGGED] action_kind=<kind>" >> state/log/loader-feedback.log
<slug>MUST be the literal folder name of this loader
(state/skills/<slug>/AGENTS.md). The class token between [ts] and : is the producer slug, the writeback class, AND the grade class - they must be equal so state/lib/controller-tune.ts can pair the brief.
FIXED= you patched this loader inline (P-fix).LOGGED= too large for inline; the PostToolUse enqueue + Stop-hook drain
will rewrite the loader from scratch on next session-end.
action_kindis the SECOND pairing predicate (added 2026-04-27, task #327).
Pick the value that describes what you actually did - same slug, different action_kind means the writeback satisfies a different brief layer:
shape-ok- only frontmatter-shape verification passed (rare from
a human; usually emitted by the lint, not a loader echo)
skill-ran- the skill ran end-to-end and an eval row landed
in state/log/evals.ndjson
loader-rewritten- you EDITED this AGENTS.md inline (the FIXED case),
OR the regen drain rewrote it
pattern-elevated- you promoted a recurring failure to a Critical Rule
(rule fix or new-skill scaffold) If you LOGGED (couldn't fix inline), omit action_kind - the inferrer will pick it up from your body keywords.
Do not skip this. Every agent run must leave the system better than it found it. The loader is the setpoint; you are the sensor; the gap is the error signal; closing the gap is the correction.
api.ts- the code it can call
#!/usr/bin/env npx tsx
/**
* snappy-desktop/api.ts -- macOS desktop automation primitives.
*
* Two layers in this file:
* 1. Vision-driven (expensive, non-deterministic): runMidscene + the
* WhatsApp helpers. Use when you need to "find a thing on screen."
* 2. Deterministic peekaboo wrappers (cheap, reliable): focusApp,
* captureWindow, captureScreen, pasteText, pressKey, clickAt,
* listWindows, typeAndSubmit. Use these to drive a known target
* (snappy-chat composer, etc.) without burning a vision call.
*
* Why peekaboo: macOS TCC (Screen Recording + Accessibility) gates the
* raw primitives (`screencapture -x`, `osascript keystroke`, `cliclick`)
* differently across shell-chains. peekaboo holds its own TCC grants,
* so the same call works whether dispatched from a login shell, an SSH
* session, or an MCP child process. snappy-os does NOT do MCP — the
* point is to expose the same primitive surface Charlotte's
* `mac_mini_exec` exposes, but as plain CLI/lib calls.
*
* Usage:
* npx tsx api.ts run "click the Settings icon"
* npx tsx api.ts screenshot
* npx tsx api.ts focus SnappyChat
* npx tsx api.ts capture-window SnappyChat /tmp/sc.png
* npx tsx api.ts paste "hello world"
* npx tsx api.ts press return
* npx tsx api.ts click 640 480
* npx tsx api.ts list-windows SnappyChat
* npx tsx api.ts type-submit SnappyChat "open the agent panel"
*
* Or import as module:
* import { focusApp, captureWindow, pasteText, pressKey, clickAt,
* listWindows, typeAndSubmit, runMidscene, screenshot } from "./desktop.ts";
*/
import { execFile, execFileSync, execSync } from "child_process";
import { existsSync, readFileSync, readdirSync, realpathSync, unlinkSync } from "fs";
import { promisify } from "util";
import { env } from "./env.ts";
import { bridgeAvailable, bridgeExec } from "./bridge.ts";
const execFileP = promisify(execFile);
// --- shell-quote (single-quote with embedded escape) ---
//
// Used when assembling commands for bridgeExec. The bridge daemon spawns the
// command via a shell, so any arg with whitespace or shell metacharacters
// needs quoting. Single-quote-and-escape is the simple form: 'abc' is
// literal, and 'it'\''s' embeds a single quote. We don't take a dependency
// on the shell-quote npm package — eight lines of inline code suffice.
function shellQuote(s: string): string {
if (s === "") return "''";
if (/^[A-Za-z0-9_./:=@%+,-]+$/.test(s)) return s;
return `'${s.replace(/'/g, `'\\''`)}'`;
}
// --- bridge routing ---
//
// When the openclaw helper daemon is available, route shell commands through
// it so they inherit its TCC grants (Screen Recording, Accessibility). When
// it is NOT available, fall back to direct execFile so the skill remains
// usable on machines without the bridge -- the local call may still succeed
// if the agent's own process tree happens to hold the relevant grants.
//
// The probe is cached inside bridge.ts (60s TTL), so this is one round-trip
// per minute, not per call.
async function runShell(command: string, opts: { timeout?: number } = {}): Promise<{ stdout: string; stderr: string }> {
if (await bridgeAvailable()) {
const r = await bridgeExec(command, opts);
if (!r.ok) {
throw new Error(`[desktop] bridge exec failed: ${r.error ?? "unknown"}${r.stderr ? ` -- ${r.stderr}` : ""}`);
}
return { stdout: r.output ?? "", stderr: "" };
}
// Local fallback: spawn /bin/sh -c <command>. Same shape as the bridge,
// different process tree. Will fail with TCC errors when those bite.
const { stdout, stderr } = await execFileP("/bin/sh", ["-c", command], { maxBuffer: 16 * 1024 * 1024 });
return { stdout, stderr };
}
// --- peekaboo discovery (cached at module load) ---
/** Window descriptor returned by listWindows(). */
export interface WindowInfo {
wid: number;
title: string;
bounds: { x: number; y: number; w: number; h: number };
isMain: boolean;
}
let _peekabooPath: string | null | undefined;
function findPeekaboo(): string | null {
if (_peekabooPath !== undefined) return _peekabooPath;
// Preferred symlink, then any Cellar version.
const candidates = ["/opt/homebrew/bin/peekaboo", "/usr/local/bin/peekaboo"];
for (const c of candidates) {
if (existsSync(c)) {
_peekabooPath = c;
return c;
}
}
// Fallback: glob the Cellar.
for (const cellar of ["/opt/homebrew/Cellar/peekaboo", "/usr/local/Cellar/peekaboo"]) {
if (existsSync(cellar)) {
try {
const versions = readdirSync(cellar);
for (const v of versions) {
const p = `${cellar}/${v}/bin/peekaboo`;
if (existsSync(p)) {
_peekabooPath = p;
return p;
}
}
} catch { /* ignore */ }
}
}
// Last-ditch: PATH lookup via `which`.
try {
const out = execFileSync("/usr/bin/which", ["peekaboo"], { encoding: "utf-8" }).trim();
if (out && existsSync(out)) {
_peekabooPath = out;
return out;
}
} catch { /* ignore */ }
_peekabooPath = null;
return null;
}
function requirePeekaboo(): string {
const p = findPeekaboo();
if (!p) {
throw new Error(
"peekaboo not found. Install with `brew install peekaboo` " +
"(see https://github.com/steipete/peekaboo). The desktop skill " +
"uses peekaboo as its canonical TCC-friendly automation primitive.",
);
}
return p;
}
async function runPeekaboo(args: string[]): Promise<{ stdout: string; stderr: string }> {
const bin = requirePeekaboo();
// Compose a single shell-safe command so the bridge can run it through its
// shell (the daemon's /exec endpoint takes one string, not an argv array).
// Local fallback path also goes through /bin/sh -c via runShell.
const command = [bin, ...args.map(shellQuote)].join(" ");
try {
return await runShell(command);
} catch (e: any) {
const code = typeof e?.code === "number" ? e.code : "?";
console.error(
`[desktop] peekaboo failed (exit=${code}): ${bin} ${args.map((a) => JSON.stringify(a)).join(" ")}`,
);
if (e?.stderr) console.error(`[desktop] peekaboo stderr: ${String(e.stderr).trim()}`);
throw e;
}
}
function sleepMs(ms: number): Promise<void> {
return new Promise((r) => setTimeout(r, ms));
}
// --- Deterministic primitives (peekaboo + screencapture) ---
/**
* Bring an app to the foreground via AppleScript activate. Sleeps 400ms
* after to give the WindowServer time to settle before further actions.
*/
export async function focusApp(appName: string): Promise<void> {
const escaped = appName.replace(/"/g, '\\"');
const command = `/usr/bin/osascript -e ${shellQuote(`tell application "${escaped}" to activate`)}`;
try {
await runShell(command);
} catch (e: any) {
console.error(`[desktop] focusApp("${appName}") failed: ${String(e?.message || e).trim()}`);
throw e;
}
await sleepMs(400);
}
/**
* Capture a window-scoped screenshot of <appName> to <path> via peekaboo.
* Returns the resolved path on success, throws on failure (peekaboo missing,
* app not running, or capture errored). Window mode is preferred over
* full-display because it crops out menu bar/dock and doesn't depend on
* which space is frontmost.
*/
export async function captureWindow(appName: string, path: string): Promise<string> {
const args = ["image", "--mode", "window", "--app", appName, "--path", path];
const { stderr } = await runPeekaboo(args);
if (!existsSync(path)) {
throw new Error(
`[desktop] captureWindow: peekaboo reported success but ${path} is missing. stderr: ${stderr.trim()}`,
);
}
return path;
}
/**
* Full-display screenshot via /usr/sbin/screencapture -x. Reliable when run
* from a process whose ancestor chain holds Screen Recording TCC; falls back
* useful when peekaboo isn't installed but the parent shell has the grant.
* Returns the resolved path.
*/
export async function captureScreen(path: string): Promise<string> {
const command = `/usr/sbin/screencapture -x ${shellQuote(path)}`;
try {
await runShell(command);
} catch (e: any) {
console.error(`[desktop] captureScreen("${path}") failed: ${String(e?.message || e).trim()}`);
throw e;
}
if (!existsSync(path)) {
throw new Error(`[desktop] captureScreen: ${path} not produced (TCC denied?)`);
}
return path;
}
/**
* Paste <text> into the frontmost text field via peekaboo's clipboard route
* (sets clipboard, sends Cmd+V, restores prior clipboard). Works without
* Accessibility-typing permission because Cmd+V uses the Cocoa text input
* path, not synthetic keystrokes. The clipboard is restored after — this
* is non-destructive to whatever the user had copied.
*/
export async function pasteText(text: string): Promise<void> {
await runPeekaboo(["paste", text]);
}
/**
* Press a single named key via peekaboo. Common: "return", "escape", "tab",
* "space", "delete", arrow keys (up/down/left/right), function keys (f1-f12).
* Use pasteText for actual text content — press is for navigation/control.
*/
export async function pressKey(key: string): Promise<void> {
await runPeekaboo(["press", key]);
}
/**
* Click at logical screen coordinates (point-based, NOT pixel — Retina
* displays will scale). Coordinates are global, origin top-left.
*
* CAVEAT: WKWebView/Electron apps that don't expose an Accessibility tree
* receive a synthetic mouse event at the pixel, but the React/JS handler
* may not fire if the event doesn't match the framework's hit target. For
* snappy-chat's OpenUI composer in particular, clickAt is FRAGILE — prefer
* focusApp + pasteText + pressKey for keyboard-driven flows. Reserve
* clickAt for native AppKit targets (toolbar buttons, native menus).
*/
export async function clickAt(x: number, y: number): Promise<void> {
await runPeekaboo(["click", "--coords", `${x},${y}`]);
}
/**
* Enumerate windows. With <appName>, returns just that app's windows;
* without, returns every window peekaboo can see. Each entry has the
* CoreGraphics window id (wid), title, logical bounds, and an isMain flag
* (true when the window is the app's main/key window).
*
* The peekaboo CLI shape is `peekaboo list windows --json [--app <name>]
* --include-details bounds,ids`. We parse defensively because peekaboo's
* JSON envelope has evolved across versions.
*/
export async function listWindows(appName?: string): Promise<WindowInfo[]> {
const args = ["list", "windows", "--json", "--include-details", "bounds,ids"];
if (appName) args.push("--app", appName);
const { stdout } = await runPeekaboo(args);
const trimmed = stdout.trim();
if (!trimmed) return [];
let parsed: any;
try {
parsed = JSON.parse(trimmed);
} catch (e: any) {
throw new Error(`[desktop] listWindows: peekaboo JSON parse failed: ${e?.message}\nraw: ${trimmed.slice(0, 400)}`);
}
if (parsed?.success === false) {
const msg = parsed?.error?.message || JSON.stringify(parsed?.error || parsed);
throw new Error(`[desktop] listWindows: peekaboo error: ${msg}`);
}
// Walk possible shapes: {windows: [...]} OR {data: {windows: [...]}} OR
// [{...}] OR {applications:[{windows:[...]}]} (older builds).
const candidates: any[] =
(Array.isArray(parsed) && parsed) ||
parsed?.windows ||
parsed?.data?.windows ||
(Array.isArray(parsed?.applications) ? parsed.applications.flatMap((a: any) => a?.windows || []) : null) ||
[];
return candidates.map((w: any): WindowInfo => {
const b = w?.bounds || w?.frame || {};
return {
wid: Number(w?.window_id ?? w?.wid ?? w?.id ?? 0),
title: String(w?.title ?? w?.window_title ?? ""),
bounds: {
x: Number(b?.x ?? b?.origin?.x ?? 0),
y: Number(b?.y ?? b?.origin?.y ?? 0),
w: Number(b?.w ?? b?.width ?? b?.size?.width ?? 0),
h: Number(b?.h ?? b?.height ?? b?.size?.height ?? 0),
},
isMain: Boolean(w?.is_main ?? w?.isMain ?? w?.main ?? false),
};
});
}
/**
* Composite helper: optionally focus <appName>, optionally click into a
* textfield at <clickCoords>, paste <text>, press return. Closes the gap
* between "I want to drive my own app's composer" and "I have to chain
* five primitives every time."
*
* opts.focusBefore (default: true) — call focusApp(appName) first.
* opts.clickCoords — if set, clickAt(...) before pasting (use for apps
* that don't auto-focus their composer when the window is activated).
*
* Pairs well with snappy-chat's OpenUI composer: focus, click into the
* `__input-wrapper` textarea, paste, return. Verify with captureWindow
* after — actor ≠ auditor.
*/
export async function typeAndSubmit(
appName: string,
text: string,
opts: { focusBefore?: boolean; clickCoords?: [number, number] } = {},
): Promise<void> {
const focusBefore = opts.focusBefore !== false;
if (focusBefore) await focusApp(appName);
if (opts.clickCoords) {
await clickAt(opts.clickCoords[0], opts.clickCoords[1]);
await sleepMs(150);
}
await pasteText(text);
await sleepMs(120);
await pressKey("return");
}
// --- Vision-driven (Midscene) ---
/** Run a Midscene vision instruction via Terminal routing (required for Screen Recording). */
export function runMidscene(instruction: string): string {
const runId = Date.now().toString();
const scriptPath = `/tmp/midscene-cmd-${runId}.sh`;
const outPath = `/tmp/midscene-out-${runId}.txt`;
const donePath = `/tmp/midscene-done-${runId}`;
const escaped = instruction.replace(/"/g, '\\"');
const script = `#!/bin/bash
source ~/.midscene-env
npx -y @midscene/computer@1 act --prompt "${escaped}" > ${outPath} 2>&1
echo "EXIT_CODE=$?" >> ${outPath}
touch ${donePath}
`;
execSync(`cat > ${scriptPath} << 'SCRIPT_EOF'\n${script}SCRIPT_EOF`, { encoding: "utf-8" });
execSync(`chmod +x ${scriptPath}`);
execSync(`osascript -e 'tell application "Terminal" to do script "${scriptPath}"'`);
// Poll for completion
const maxWait = 60_000;
const start = Date.now();
while (!existsSync(donePath) && Date.now() - start < maxWait) {
execSync("sleep 2");
}
let output = "";
if (existsSync(outPath)) {
output = readFileSync(outPath, "utf-8");
} else {
output = "ERROR: Midscene timed out after 60s";
}
// Cleanup
for (const f of [scriptPath, outPath, donePath]) {
if (existsSync(f)) unlinkSync(f);
}
return output.trim();
}
/**
* Take a full-display screenshot and return the file path.
*
* Delegates to captureScreen() so callers don't have two parallel APIs.
* Kept synchronous-return-shaped for back-compat with existing consumers
* (whatsappReadChats / whatsappReadChat) by blocking on the async call
* via execSync glue. New code should call captureScreen() directly.
*/
export function screenshot(): string {
const runId = Date.now().toString();
const path = `/tmp/screenshot-${runId}.png`;
// Inline the screencapture invocation here (sync) so this function stays
// string-returning. The async captureScreen() is the canonical API; this
// is the back-compat shim.
try {
execFileSync("/usr/sbin/screencapture", ["-x", path]);
} catch (e: any) {
console.error(`[desktop] screenshot failed: ${String(e?.message || e).trim()}`);
throw e;
}
if (!existsSync(path)) {
throw new Error("[desktop] screenshot: screencapture produced no file (TCC denied?)");
}
return path;
}
/** Helper: activate WhatsApp.app and wait for it to be frontmost. */
function activateWhatsApp(): void {
execSync(`osascript -e 'tell application "WhatsApp" to activate'`);
execSync("sleep 2");
}
/** Helper: run a Midscene query (read-only vision extraction, no clicks). */
function queryMidscene(question: string): string {
const runId = Date.now().toString();
const scriptPath = `/tmp/midscene-cmd-${runId}.sh`;
const outPath = `/tmp/midscene-out-${runId}.txt`;
const donePath = `/tmp/midscene-done-${runId}`;
const escaped = question.replace(/"/g, '\\"');
const script = `#!/bin/bash
source ~/.midscene-env
osascript -e 'tell application "WhatsApp" to activate'
sleep 1
npx -y @midscene/computer@1 query --prompt "${escaped}" > ${outPath} 2>&1
echo "EXIT_CODE=$?" >> ${outPath}
touch ${donePath}
`;
execSync(`cat > ${scriptPath} << 'SCRIPT_EOF'\n${script}SCRIPT_EOF`, { encoding: "utf-8" });
execSync(`chmod +x ${scriptPath}`);
execSync(`osascript -e 'tell application "Terminal" to do script "${scriptPath}"'`);
const maxWait = 60_000;
const start = Date.now();
while (!existsSync(donePath) && Date.now() - start < maxWait) {
execSync("sleep 2");
}
let output = "";
if (existsSync(outPath)) {
output = readFileSync(outPath, "utf-8");
} else {
output = "ERROR: Midscene query timed out after 60s";
}
for (const f of [scriptPath, outPath, donePath]) {
if (existsSync(f)) unlinkSync(f);
}
return output.trim();
}
/**
* Open WhatsApp.app, screenshot the chat list, extract chat names + unread badges via Midscene.
* Returns the raw Midscene vision output describing visible chats.
*/
export function whatsappReadChats(): string {
activateWhatsApp();
const screenshotPath = screenshot();
const result = queryMidscene(
"List every chat visible in the WhatsApp sidebar. For each chat return: contact name, last message preview, timestamp, and whether it has an unread badge (and the count if visible). Return as a structured list."
);
return `screenshot: ${screenshotPath}\n\n${result}`;
}
/**
* Click into a specific WhatsApp chat and extract recent messages.
* @param contactName - The name of the contact or group to open.
*/
export function whatsappReadChat(contactName: string): string {
activateWhatsApp();
// Use Cmd+F to search for the contact, more reliable than vision-clicking the sidebar
const escaped = contactName.replace(/"/g, '\\"');
runMidscene(`Click on the search field at the top of WhatsApp, type "${escaped}", then click on the first matching chat result`);
execSync("sleep 2");
// Re-activate WhatsApp after Terminal steals focus from runMidscene
activateWhatsApp();
const screenshotPath = screenshot();
const result = queryMidscene(
`Read the currently open WhatsApp chat. Extract the most recent messages visible on screen. For each message return: sender name, message text, and timestamp. Return as a structured list, most recent last.`
);
return `screenshot: ${screenshotPath}\n\n${result}`;
}
/**
* Open a WhatsApp chat and send a message.
* @param contactName - The name of the contact or group.
* @param text - The message text to send.
*/
export function whatsappSendMessage(contactName: string, text: string): string {
activateWhatsApp();
// Search and open the chat
const escapedName = contactName.replace(/"/g, '\\"');
runMidscene(`Click on the search field at the top of WhatsApp, type "${escapedName}", then click on the first matching chat result`);
execSync("sleep 2");
// Re-activate WhatsApp and type the message via AppleScript (more reliable than Midscene typing)
activateWhatsApp();
// Click the message input field via Midscene
runMidscene("Click on the message input field at the bottom of the chat");
execSync("sleep 1");
// Re-activate and type via AppleScript keystroke
execSync(`osascript -e 'tell application "WhatsApp" to activate'`);
execSync("sleep 0.5");
// Use clipboard to handle special characters reliably
const escapedText = text.replace(/\\/g, "\\\\").replace(/"/g, '\\"');
execSync(`osascript -e 'set the clipboard to "${escapedText}"' -e 'tell application "System Events" to keystroke "v" using command down'`);
execSync("sleep 0.5");
// Press Enter to send
execSync(`osascript -e 'tell application "System Events" to key code 36'`);
execSync("sleep 1");
activateWhatsApp();
const screenshotPath = screenshot();
return `sent to ${contactName}: "${text}"\nscreenshot: ${screenshotPath}`;
}
// Reserved for env-driven config (no current consumer; preserves the
// snappy-os env() contract pattern used elsewhere in state/lib/).
void env;
// --- CLI ---
if ((() => { try { return import.meta.url === `file://${realpathSync(process.argv[1])}`; } catch { return false; } })()) {
(async () => {
const [, , cmd, ...args] = process.argv;
switch (cmd) {
case "run": {
const instruction = args.join(" ");
if (!instruction) { console.error("Usage: api.ts run <instruction>"); process.exit(1); }
console.log(runMidscene(instruction));
break;
}
case "screenshot": {
const path = screenshot();
console.log(path);
break;
}
case "focus": {
const app = args.join(" ");
if (!app) { console.error("Usage: api.ts focus <appName>"); process.exit(1); }
await focusApp(app);
console.log(`focused ${app}`);
break;
}
case "capture-window": {
if (args.length < 2) { console.error("Usage: api.ts capture-window <appName> <path>"); process.exit(1); }
const path = args[args.length - 1];
const app = args.slice(0, -1).join(" ");
const out = await captureWindow(app, path);
console.log(out);
break;
}
case "capture-screen": {
const path = args[0];
if (!path) { console.error("Usage: api.ts capture-screen <path>"); process.exit(1); }
console.log(await captureScreen(path));
break;
}
case "paste": {
const text = args.join(" ");
if (!text) { console.error("Usage: api.ts paste <text>"); process.exit(1); }
await pasteText(text);
console.log(`pasted ${text.length} chars`);
break;
}
case "press": {
const key = args[0];
if (!key) { console.error("Usage: api.ts press <key>"); process.exit(1); }
await pressKey(key);
console.log(`pressed ${key}`);
break;
}
case "click": {
if (args.length < 2) { console.error("Usage: api.ts click <x> <y>"); process.exit(1); }
const x = Number(args[0]);
const y = Number(args[1]);
await clickAt(x, y);
console.log(`clicked ${x},${y}`);
break;
}
case "list-windows": {
const app = args.join(" ") || undefined;
const windows = await listWindows(app);
console.log(JSON.stringify(windows, null, 2));
break;
}
case "type-submit": {
if (args.length < 2) { console.error("Usage: api.ts type-submit <appName> <text>"); process.exit(1); }
const app = args[0];
const text = args.slice(1).join(" ");
await typeAndSubmit(app, text);
console.log(`submitted to ${app}`);
break;
}
case "wa-chats": {
console.log(whatsappReadChats());
break;
}
case "wa-read": {
const contact = args.join(" ");
if (!contact) { console.error("Usage: api.ts wa-read <contact>"); process.exit(1); }
console.log(whatsappReadChat(contact));
break;
}
case "wa-send": {
if (args.length < 2) { console.error("Usage: api.ts wa-send <contact> <message>"); process.exit(1); }
const contact = args[0];
const message = args.slice(1).join(" ");
console.log(whatsappSendMessage(contact, message));
break;
}
default:
console.log(
"Usage: npx tsx api.ts [run|screenshot|focus|capture-window|capture-screen|paste|press|click|list-windows|type-submit|wa-chats|wa-read|wa-send] ...",
);
}
})();
}
scripts- helper scripts it can run
prose-only skill - 9 inline code blocks live in SKILL.md above (no state/bin/ sidecar yet).
how we check it- the checks, plus the last 10 runs
| timestamp | verb | score | primary_issue | artifact |
|---|---|---|---|---|
| 2026-04-29 03:08Z | - | 1.00 | - | - |
| 2026-04-29 03:05Z | - | 1.00 | - | - |
| 2026-04-25 04:11Z | - | 1.00 | - | - |
| 2026-04-21 15:58Z | - | 1.00 | - | - |
| 2026-04-21 15:56Z | - | 1.00 | - | - |
| 2026-04-21 03:53Z | - | 1.00 | - | - |
| 2026-04-29 03:08Z | - | 1.00 | - | - |
| 2026-04-29 03:05Z | - | 1.00 | - | - |
| 2026-04-25 04:11Z | - | 1.00 | - | - |
| 2026-04-21 15:58Z | - | 1.00 | - | - |