OR Key
drop another .md file to compare - side-by-side diff against prompt-improver

prompt-improver

Improves how a skill works based on where it's been falling short.
personal 2 files 3 recent evals

What it does for you

Improves how a skill works based on where it's been falling short.

What it produces

A recent result, so you can see the kind of work it returns.

loading…

How to get it

These run inside the Snappy workspace. Want this working in your business? I set skills like this up with you, in one focused week.

Work with me
For developers how this skill is built, graded, and how it runs

at a glance- the short version

eval modeauto
categoryOps
dependslog

what's inside - the parts that make up a skill 3/4 present

A skill is just a few plain-text files. Only the main one is required. The rest are optional, added as the work needs them. This is what the skill is made of; how it runs is just below.

The skill
state/skills/prompt-improver/SKILL.md present
the skill itself, in plain text
The main file. It says what the skill is and lays out the steps in plain English.
Code
state/lib/prompt-improver.ts present
code the skill can run
Reusable code this skill can call when it needs to.
Scripts
state/bin/prompt-improver/ not present
helper scripts
Optional. Added when a skill has a few commands to run.
Loader
state/skills/prompt-improver/AGENTS.md present
what the AI loads on the fly
Loaded automatically the moment this skill is needed. Kept short on purpose.

how it runs - the shared frame every skill uses 3/5 present

Every skill runs the same way. One part does the work, a separate part checks it, and a short loader hands the AI exactly what it needs for the job. Anything this skill doesn't use shows a one-line note saying why, on purpose, not by accident.

makes the work The worker
not present

No work step here. This is probably a skill that reads or coordinates, not one that produces something.

checks the work The reviewer
not present

No separate check found. Without one, the part that makes the work could end up approving its own work, worth a closer look.

frame
learns Self-correction
present
fixes itself learns from gaps
When a run hits a gap, the skill gets edited on the spot [FIXED] or queued for a bigger rewrite [LOGGED], so it keeps getting better.
tidies up Background fixes
present
queued for rewrite runs in the background
Bigger fixes that can't be made on the spot get queued and rewritten in the background later.
remembers Run history
present
state/log/evals.ndjson auto runs
Every run is written down here, so the next time this skill is used it already knows how the last runs went.
Critical rules the things this skill must not get wrong
No must-not-break rules called out for this skill. Anything important lives in the writeup below.

what it has learned - fixes written back in over time sample

When a run hits something this skill didn't handle, the fix gets written back into the skill so it doesn't happen again. FIXED means it was corrected on the spot. LOGGED means it's queued for a bigger rewrite. Either way, the skill gets a little better and never makes the same mistake twice.

  1. Loading feedback rows…

SKILL.md- the skill, written out in plain English

prompt-improver

Extends the system's self-correction loop. Given a skill slug, reads its SKILL.md + AGENTS.md + recent eval failures, then calls the LLM to compose a small prompt fragment patch. Returns {old, proposed, reason} for the caller to emit a before/after Apply/Discard card.

Per-skill prompt fragments land at state/skills/<slug>/prompt-fragment.md. Server injection of these fragments is Phase 2 work - writing the file is the current artifact.

Apply/Discard events from the UI write eval rows back to state/log/evals.ndjson with skill: "prompt-improver" and kind: "improvement-feedback", feeding the prompt-tuner reinforcement loop.

Steps

  1. Accept a skill slug argument.
  2. Read state/skills/<slug>/SKILL.md and state/skills/<slug>/AGENTS.md.
  3. Tail state/log/evals.ndjson for recent rows where skill === slug and score < 0.7.
  4. Call the LLM (Sonnet) with the skill prose + failure context, requesting a focused prompt patch.
  5. Return {slug, old, proposed, reason} to the caller.
  6. Caller emits a before/after card. User clicks Apply or Discard.
  7. On Apply: write state/skills/<slug>/prompt-fragment.md with the proposed patch. Write eval row {skill: "prompt-improver", target: slug, score: 0.9, kind: "improvement-feedback"}.
  8. On Discard: write eval row {skill: "prompt-improver", target: slug, score: 0.3, kind: "improvement-feedback"}.

AGENTS.md- what the AI loads when this skill comes up

prompt-improver - loader

Per-turn rules. Full reference: state/skills/prompt-improver/SKILL.md.

Critical Rules

  1. Slug required. Always pass an explicit skill slug. Never guess or infer - read it from the caller's context.
  2. Tail eval rows for the target slug. Read state/log/evals.ndjson, filter for skill === slug (or target === slug) where score < 0.7. Use last 50 rows max.
  3. LLM call is required. Do not return a patch without calling the LLM (Sonnet). The patch must be grounded in the actual prose failures, not fabricated.
  4. Return shape is {slug, old, proposed, reason}. All four fields required. old = the current problematic fragment verbatim. proposed = the replacement. reason = one-sentence explanation of why this change addresses the observed failures.
  5. Apply/Discard writes eval rows. Apply → score: 0.9, Discard → score: 0.3. Both use skill: "prompt-improver", target: <slug>, kind: "improvement-feedback". These rows feed the prompt-tuner reinforcement digest.
  6. No side-effects on Discard. On Discard, only write the eval row - do NOT write the prompt-fragment.md file.

Commands

# CLI usage
npx tsx state/lib/prompt-improver.ts <slug>

# Read recent failures for a slug
grep '"skill":"<slug>"' state/log/evals.ndjson | python3 -c "import sys,json; rows=[json.loads(l) for l in sys.stdin]; bad=[r for r in rows if r.get('score',1)<0.7]; print(len(bad), 'failures')"

# Check if prompt-fragment exists
ls state/skills/<slug>/prompt-fragment.md 2>/dev/null || echo "no fragment yet"

Self-correcting loader (PID feedback)

Before you finish, do two things:

1. Fix gaps (P - proportional). MANDATORY. If this loader didn't cover your case, edit this file surgically.

2. Log the result.

echo "[$(date -u +%FT%TZ)] prompt-improver: <what was missing or fixed> [FIXED|LOGGED] action_kind=<kind>" >> state/log/loader-feedback.log

api.ts- the code it can call

#!/usr/bin/env npx tsx
/**
 * snappy-prompt-improver/api.ts -- Generate a prompt patch for a skill based on eval failures.
 *
 * The system extends itself: given a skill slug, read its SKILL.md + AGENTS.md + recent
 * eval failures, then use the configured snappy-os dispatch runtime to compose
 * a small prompt fragment patch.
 * Returns {old, proposed, reason} for the caller to emit a before/after card.
 *
 * Per-skill prompt fragments live at: state/skills/<slug>/prompt-fragment.md
 * (new convention). Server does not auto-load these yet — writing the file is
 * the artifact; Phase 2 will wire server injection.
 *
 * Usage (import):
 *   import { generatePromptPatch } from "./prompt-improver.ts";
 *   const patch = await generatePromptPatch("dogfood-loop");
 *
 * Usage (CLI):
 *   npx tsx state/lib/prompt-improver.ts <slug>
 */

import { readFileSync, existsSync, readdirSync } from "fs";
import { join, resolve, dirname } from "path";
import { fileURLToPath } from "url";
import { dispatchFor, readDefaultModel, readDispatchConfig } from "./dispatch.ts";

const HERE = dirname(fileURLToPath(import.meta.url));
const REPO_ROOT = resolve(HERE, "..", "..");

export interface PromptPatch {
  slug: string;
  old: string;
  proposed: string;
  reason: string;
}

interface EvalRow {
  ts: string;
  skill?: string;
  verb?: string;
  score: number;
  primary_issue?: string | null;
  note?: string | null;
}

// --- Read helpers ---

function readSkillMd(slug: string): string {
  const path = join(REPO_ROOT, "state/skills", slug, "SKILL.md");
  if (!existsSync(path)) throw new Error(`SKILL.md not found for slug: ${slug}`);
  return readFileSync(path, "utf8");
}

function readAgentsMd(slug: string): string {
  const path = join(REPO_ROOT, "state/skills", slug, "AGENTS.md");
  if (!existsSync(path)) return "";
  return readFileSync(path, "utf8");
}

function readPromptFragment(slug: string): string {
  const path = join(REPO_ROOT, "state/skills", slug, "prompt-fragment.md");
  if (!existsSync(path)) return "";
  return readFileSync(path, "utf8");
}

function readRecentFailures(slug: string, limit = 20): EvalRow[] {
  const evalsPath = join(REPO_ROOT, "state/log/evals.ndjson");
  if (!existsSync(evalsPath)) return [];

  const lines = readFileSync(evalsPath, "utf8").trim().split("\n").filter(Boolean);
  // Take tail-200 to avoid full-file scan
  const tail = lines.slice(-200);

  const failures: EvalRow[] = [];
  for (const line of tail) {
    try {
      const row = JSON.parse(line) as EvalRow;
      const rowSlug = row.skill || row.verb || "";
      if (rowSlug !== slug) continue;
      if (typeof row.score !== "number") continue;
      if (row.score < 0.5) failures.push(row);
    } catch {
      // skip malformed lines
    }
  }

  // Return most recent failures first, up to limit
  return failures.slice(-limit).reverse();
}

function summarizeFailureMode(failures: EvalRow[]): string {
  if (failures.length === 0) return "No recent failures found (fewer than 20 eval rows with score < 0.5).";

  // Count primary_issue occurrences
  const issueCounts: Record<string, number> = {};
  for (const f of failures) {
    const issue = (f.primary_issue || "").trim();
    if (issue) {
      issueCounts[issue] = (issueCounts[issue] || 0) + 1;
    }
  }

  const topIssues = Object.entries(issueCounts)
    .sort((a, b) => b[1] - a[1])
    .slice(0, 3)
    .map(([issue, count]) => `"${issue}" (${count}x)`);

  const noteSnippets = failures
    .slice(0, 3)
    .map(f => f.note)
    .filter(Boolean)
    .map(n => `- ${(n || "").slice(0, 120)}`);

  let summary = `${failures.length} failures in last 200 evals (score < 0.5).`;
  if (topIssues.length) summary += ` Top issues: ${topIssues.join(", ")}.`;
  if (noteSnippets.length) summary += `\nRecent notes:\n${noteSnippets.join("\n")}`;

  return summary;
}

// --- LLM calls ---

const PROMPT_IMPROVER_SYSTEM = `You write concise, surgical prompt fragments for snappy-os skills.

A "prompt fragment" is a short block of text (typically 3-15 lines) that gets prepended to a skill agent's system prompt to address observed failure modes. It is NOT a full system prompt rewrite — it is a targeted patch.

WHAT YOU ARE GIVEN:
- The skill's SKILL.md (its purpose and behavior spec)
- Its current prompt-fragment.md (empty string if none exists yet)
- Its recent eval failures (score < 0.5): primary_issue + notes

YOUR JOB:
Write a NEW prompt-fragment.md that either:
a) Creates a new fragment (if none exists) that directly addresses the top failure modes.
b) Patches the existing fragment to fix the gaps the failures reveal.

FORMAT RULES:
- Start with a single-line header: # <slug> — prompt fragment
- Then 2-10 bullet points or short rules. No paragraphs.
- Each bullet addresses one specific failure mode, stated as a DO or DO NOT rule.
- Be direct and prescriptive, not abstract.
- No greetings, no "This fragment...", no meta-commentary.
- Under 200 words total.
- Output ONLY the fragment text. No explanation, no fences.`;

function buildUserPrompt(slug: string, skillBody: string, agentsBody: string, oldFragment: string, failures: EvalRow[]): string {
  const failureLines = failures.slice(0, 10).map(f => {
    const parts = [`score=${f.score}`];
    if (f.primary_issue) parts.push(`issue="${f.primary_issue}"`);
    if (f.note) parts.push(`note="${(f.note || "").slice(0, 100)}"`);
    return `- ${parts.join(" ")}`;
  }).join("\n") || "- (no failures found in recent evals)";

  return `SKILL SLUG: ${slug}

SKILL.md (truncated to 2000 chars):
${skillBody.slice(0, 2000)}

AGENTS.md (first 800 chars, if any):
${agentsBody.slice(0, 800) || "(no AGENTS.md)"}

CURRENT PROMPT FRAGMENT (empty = none exists):
${oldFragment || "(none)"}

RECENT FAILURES (last ${failures.length}, score < 0.5):
${failureLines}

Write the new prompt-fragment.md that addresses these failures. Output ONLY the fragment text.`;
}

function normalizeFragment(output: string): string {
  const trimmed = output.trim();
  const fenceMatch = trimmed.match(/```(?:markdown|md)?\s*([\s\S]*?)```/i);
  return (fenceMatch?.[1] || trimmed).trim();
}

async function callConfiguredRuntime(slug: string, systemPrompt: string, userPrompt: string): Promise<string> {
  const axis = readDispatchConfig().subagent;
  const modelLabel = axis.model === "auto" ? readDefaultModel().slug : axis.model;
  process.stderr.write(`[prompt-improver] composing ${slug} via ${axis.backend}/${modelLabel}\n`);

  const result = await dispatchFor("subagent", {
    prompt: userPrompt,
    systemPrompt,
    cwd: REPO_ROOT,
    tools: ["read", "grep", "ls"],
    timeoutMs: 120_000,
    interviewMode: false,
  });

  if (!result.ok || !result.output.trim()) {
    const detail = result.error || result.stderr || `exit ${result.exitCode}`;
    throw new Error(`${result.provider}/${result.model} failed: ${detail}`);
  }

  return normalizeFragment(result.output);
}

// --- Main export ---

export async function generatePromptPatch(slug: string): Promise<PromptPatch> {
  const skillMd = readSkillMd(slug); // throws if not found
  const agentsMd = readAgentsMd(slug);
  const oldFragment = readPromptFragment(slug);
  const failures = readRecentFailures(slug);
  const reason = summarizeFailureMode(failures);

  // Strip frontmatter from SKILL.md to get body prose
  const skillBody = skillMd.replace(/^---\n[\s\S]+?\n---\n?/, "").trim();

  const userPrompt = buildUserPrompt(slug, skillBody, agentsMd, oldFragment, failures);

  let proposed: string;
  try {
    proposed = await callConfiguredRuntime(slug, PROMPT_IMPROVER_SYSTEM, userPrompt);
  } catch (e) {
    throw new Error(`prompt runtime failed for ${slug}: ${(e as Error).message}`);
  }

  // Basic sanity: must start with # or bullet
  if (!proposed || (!proposed.startsWith("#") && !proposed.startsWith("-") && !proposed.startsWith("*"))) {
    process.stderr.write(`[prompt-improver] LLM output for ${slug} looks off, using as-is\n`);
  }

  return { slug, old: oldFragment, proposed, reason };
}

// --- CLI ---

const isMain = (() => {
  try {
    return import.meta.url === `file://${require("fs").realpathSync(process.argv[1])}`;
  } catch {
    try {
      return import.meta.url === `file://${process.argv[1]}`;
    } catch {
      return false;
    }
  }
})();

if (isMain) {
  (async () => {
    const slug = process.argv[2];
    if (!slug) {
      console.error("Usage: npx tsx state/lib/prompt-improver.ts <slug>");
      process.exit(1);
    }
    try {
      const patch = await generatePromptPatch(slug);
      console.log("=== OLD FRAGMENT ===");
      console.log(patch.old || "(none)");
      console.log("\n=== PROPOSED ===");
      console.log(patch.proposed);
      console.log("\n=== REASON ===");
      console.log(patch.reason);
    } catch (e) {
      console.error("Error:", (e as Error).message);
      process.exit(1);
    }
  })();
}

scripts- helper scripts it can run

prose-only skill - no sidecar under state/bin/ yet. Steps, if any, are described in SKILL.md.

how we check it- the checks, plus the last 3 runs

rubric auto no rubric declared
recent mean 0.90 · 3 runs actor/auditor: unverifiable
deps log
timestamp verb score primary_issue artifact
2026-05-01 06:08Z - 0.90 - -
2026-05-01 06:08Z - 0.90 - -
2026-05-01 06:08Z - 0.90 - -