drop another .md file to compare - side-by-side diff against prompt-improver

prompt-improver

Improves how a skill works based on where it's been falling short.

personal 2 files 3 recent evals

Export

What it does for you

Improves how a skill works based on where it's been falling short.

What it produces

A recent result, so you can see the kind of work it returns.

loading…

How to get it

These run inside the Snappy workspace. Want this working in your business? I set skills like this up with you, in one focused week.

Work with me

For developers how this skill is built, graded, and how it runs

at a glance- the short version

eval modeauto

categoryOps

dependslog

what's inside - the parts that make up a skill 3/4 present

A skill is just a few plain-text files. Only the main one is required. The rest are optional, added as the work needs them. This is what the skill is made of; how it runs is just below.

The skill

state/skills/prompt-improver/SKILL.md present

the skill itself, in plain text

The main file. It says what the skill is and lays out the steps in plain English.

Code

state/lib/prompt-improver.ts present

code the skill can run

Reusable code this skill can call when it needs to.

Scripts

state/bin/prompt-improver/ not present

helper scripts

Optional. Added when a skill has a few commands to run.

Loader

state/skills/prompt-improver/AGENTS.md present

what the AI loads on the fly

Loaded automatically the moment this skill is needed. Kept short on purpose.

how it runs - the shared frame every skill uses 3/5 present

Every skill runs the same way. One part does the work, a separate part checks it, and a short loader hands the AI exactly what it needs for the job. Anything this skill doesn't use shows a one-line note saying why, on purpose, not by accident.

makes the work The worker

not present

No work step here. This is probably a skill that reads or coordinates, not one that produces something.

checks the work The reviewer

not present

No separate check found. Without one, the part that makes the work could end up approving its own work, worth a closer look.

frame

learns Self-correction

present

fixes itself learns from gaps

When a run hits a gap, the skill gets edited on the spot [FIXED] or queued for a bigger rewrite [LOGGED], so it keeps getting better.

tidies up Background fixes

present

queued for rewrite runs in the background

Bigger fixes that can't be made on the spot get queued and rewritten in the background later.

remembers Run history

present

state/log/evals.ndjson auto runs

Every run is written down here, so the next time this skill is used it already knows how the last runs went.

Critical rules the things this skill must not get wrong

No must-not-break rules called out for this skill. Anything important lives in the writeup below.

what it has learned - fixes written back in over time sample

When a run hits something this skill didn't handle, the fix gets written back into the skill so it doesn't happen again. FIXED means it was corrected on the spot. LOGGED means it's queued for a bigger rewrite. Either way, the skill gets a little better and never makes the same mistake twice.

Loading feedback rows…

SKILL.md- the skill, written out in plain English

prompt-improver

Extends the system's self-correction loop. Given a skill slug, reads its SKILL.md + AGENTS.md + recent eval failures, then calls the LLM to compose a small prompt fragment patch. Returns {old, proposed, reason} for the caller to emit a before/after Apply/Discard card.

Per-skill prompt fragments land at state/skills/<slug>/prompt-fragment.md. Server injection of these fragments is Phase 2 work - writing the file is the current artifact.

Apply/Discard events from the UI write eval rows back to state/log/evals.ndjson with skill: "prompt-improver" and kind: "improvement-feedback", feeding the prompt-tuner reinforcement loop.

Steps

Accept a skill slug argument.
Read state/skills/<slug>/SKILL.md and state/skills/<slug>/AGENTS.md.
Tail state/log/evals.ndjson for recent rows where skill === slug and score < 0.7.
Call the LLM (Sonnet) with the skill prose + failure context, requesting a focused prompt patch.
Return {slug, old, proposed, reason} to the caller.
Caller emits a before/after card. User clicks Apply or Discard.
On Apply: write state/skills/<slug>/prompt-fragment.md with the proposed patch. Write eval row {skill: "prompt-improver", target: slug, score: 0.9, kind: "improvement-feedback"}.
On Discard: write eval row {skill: "prompt-improver", target: slug, score: 0.3, kind: "improvement-feedback"}.

AGENTS.md- what the AI loads when this skill comes up

prompt-improver - loader

Per-turn rules. Full reference: state/skills/prompt-improver/SKILL.md.

Critical Rules

Slug required. Always pass an explicit skill slug. Never guess or infer - read it from the caller's context.
Tail eval rows for the target slug. Read state/log/evals.ndjson, filter for skill === slug (or target === slug) where score < 0.7. Use last 50 rows max.
LLM call is required. Do not return a patch without calling the LLM (Sonnet). The patch must be grounded in the actual prose failures, not fabricated.
Return shape is {slug, old, proposed, reason}. All four fields required. old = the current problematic fragment verbatim. proposed = the replacement. reason = one-sentence explanation of why this change addresses the observed failures.
Apply/Discard writes eval rows. Apply → score: 0.9, Discard → score: 0.3. Both use skill: "prompt-improver", target: <slug>, kind: "improvement-feedback". These rows feed the prompt-tuner reinforcement digest.
No side-effects on Discard. On Discard, only write the eval row - do NOT write the prompt-fragment.md file.

Commands

# CLI usage
npx tsx state/lib/prompt-improver.ts <slug>

# Read recent failures for a slug
grep '"skill":"<slug>"' state/log/evals.ndjson | python3 -c "import sys,json; rows=[json.loads(l) for l in sys.stdin]; bad=[r for r in rows if r.get('score',1)<0.7]; print(len(bad), 'failures')"

# Check if prompt-fragment exists
ls state/skills/<slug>/prompt-fragment.md 2>/dev/null || echo "no fragment yet"

Self-correcting loader (PID feedback)

Before you finish, do two things:

1. Fix gaps (P - proportional). MANDATORY. If this loader didn't cover your case, edit this file surgically.

2. Log the result.

echo "[$(date -u +%FT%TZ)] prompt-improver: <what was missing or fixed> [FIXED|LOGGED] action_kind=<kind>" >> state/log/loader-feedback.log

api.ts- the code it can call

#!/usr/bin/env npx tsx
/**
 * snappy-prompt-improver/api.ts -- Generate a prompt patch for a skill based on eval failures.
 *
 * The system extends itself: given a skill slug, read its SKILL.md + AGENTS.md + recent
 * eval failures, then use the configured snappy-os dispatch runtime to compose
 * a small prompt fragment patch.
 * Returns {old, proposed, reason} for the caller to emit a before/after card.
 *
 * Per-skill prompt fragments live at: state/skills/<slug>/prompt-fragment.md
 * (new convention). Server does not auto-load these yet — writing the file is
 * the artifact; Phase 2 will wire server injection.
 *
 * Usage (import):
 *   import { generatePromptPatch } from "./prompt-improver.ts";
 *   const patch = await generatePromptPatch("dogfood-loop");
 *
 * Usage (CLI):
 *   npx tsx state/lib/prompt-improver.ts <slug>
 */

import { readFileSync, existsSync, readdirSync } from "fs";
import { join, resolve, dirname } from "path";
import { fileURLToPath } from "url";
import { dispatchFor, readDefaultModel, readDispatchConfig } from "./dispatch.ts";

const HERE = dirname(fileURLToPath(import.meta.url));
const REPO_ROOT = resolve(HERE, "..", "..");

export interface PromptPatch {
  slug: string;
  old: string;
  proposed: string;
  reason: string;
}

interface EvalRow {
  ts: string;
  skill?: string;
  verb?: string;
  score: number;
  primary_issue?: string | null;
  note?: string | null;
}

// --- Read helpers ---

function readSkillMd(slug: string): string {
  const path = join(REPO_ROOT, "state/skills", slug, "SKILL.md");
  if (!existsSync(path)) throw new Error(`SKILL.md not found for slug: ${slug}`);
  return readFileSync(path, "utf8");
}

function readAgentsMd(slug: string): string {
  const path = join(REPO_ROOT, "state/skills", slug, "AGENTS.md");
  if (!existsSync(path)) return "";
  return readFileSync(path, "utf8");
}

function readPromptFragment(slug: string): string {
  const path = join(REPO_ROOT, "state/skills", slug, "prompt-fragment.md");
  if (!existsSync(path)) return "";
  return readFileSync(path, "utf8");
}

function readRecentFailures(slug: string, limit = 20): EvalRow[] {
  const evalsPath = join(REPO_ROOT, "state/log/evals.ndjson");
  if (!existsSync(evalsPath)) return [];

  const lines = readFileSync(evalsPath, "utf8").trim().split("\n").filter(Boolean);
  // Take tail-200 to avoid full-file scan
  const tail = lines.slice(-200);

  const failures: EvalRow[] = [];
  for (const line of tail) {
    try {
      const row = JSON.parse(line) as EvalRow;
      const rowSlug = row.skill || row.verb || "";
      if (rowSlug !== slug) continue;
      if (typeof row.score !== "number") continue;
      if (row.score < 0.5) failures.push(row);
    } catch {
      // skip malformed lines
    }
  }

  // Return most recent failures first, up to limit
  return failures.slice(-limit).reverse();
}

function summarizeFailureMode(failures: EvalRow[]): string {
  if (failures.length === 0) return "No recent failures found (fewer than 20 eval rows with score < 0.5).";

  // Count primary_issue occurrences
  const issueCounts: Record<string, number> = {};
  for (const f of failures) {
    const issue = (f.primary_issue || "").trim();
    if (issue) {
      issueCounts[issue] = (issueCounts[issue] || 0) + 1;
    }
  }

  const topIssues = Object.entries(issueCounts)
    .sort((a, b) => b[1] - a[1])
    .slice(0, 3)
    .map(([issue, count]) => `"${issue}" (${count}x)`);

  const noteSnippets = failures
    .slice(0, 3)
    .map(f => f.note)
    .filter(Boolean)
    .map(n => `- ${(n || "").slice(0, 120)}`);

  let summary = `${failures.length} failures in last 200 evals (score < 0.5).`;
  if (topIssues.length) summary += ` Top issues: ${topIssues.join(", ")}.`;
  if (noteSnippets.length) summary += `\nRecent notes:\n${noteSnippets.join("\n")}`;

  return summary;
}

// --- LLM calls ---

const PROMPT_IMPROVER_SYSTEM = `You write concise, surgical prompt fragments for snappy-os skills.

A "prompt fragment" is a short block of text (typically 3-15 lines) that gets prepended to a skill agent's system prompt to address observed failure modes. It is NOT a full system prompt rewrite — it is a targeted patch.

WHAT YOU ARE GIVEN:
- The skill's SKILL.md (its purpose and behavior spec)
- Its current prompt-fragment.md (empty string if none exists yet)
- Its recent eval failures (score < 0.5): primary_issue + notes

YOUR JOB:
Write a NEW prompt-fragment.md that either:
a) Creates a new fragment (if none exists) that directly addresses the top failure modes.
b) Patches the existing fragment to fix the gaps the failures reveal.

FORMAT RULES:
- Start with a single-line header: # <slug> — prompt fragment
- Then 2-10 bullet points or short rules. No paragraphs.
- Each bullet addresses one specific failure mode, stated as a DO or DO NOT rule.
- Be direct and prescriptive, not abstract.
- No greetings, no "This fragment...", no meta-commentary.
- Under 200 words total.
- Output ONLY the fragment text. No explanation, no fences.`;

function buildUserPrompt(slug: string, skillBody: string, agentsBody: string, oldFragment: string, failures: EvalRow[]): string {
  const failureLines = failures.slice(0, 10).map(f => {
    const parts = [`score=${f.score}`];
    if (f.primary_issue) parts.push(`issue="${f.primary_issue}"`);
    if (f.note) parts.push(`note="${(f.note || "").slice(0, 100)}"`);
    return `- ${parts.join(" ")}`;
  }).join("\n") || "- (no failures found in recent evals)";

  return `SKILL SLUG: ${slug}

SKILL.md (truncated to 2000 chars):
${skillBody.slice(0, 2000)}

AGENTS.md (first 800 chars, if any):
${agentsBody.slice(0, 800) || "(no AGENTS.md)"}

CURRENT PROMPT FRAGMENT (empty = none exists):
${oldFragment || "(none)"}

RECENT FAILURES (last ${failures.length}, score < 0.5):
${failureLines}

Write the new prompt-fragment.md that addresses these failures. Output ONLY the fragment text.`;
}

function normalizeFragment(output: string): string {
  const trimmed = output.trim();
  const fenceMatch = trimmed.match(/```(?:markdown|md)?\s*([\s\S]*?)```/i);
  return (fenceMatch?.[1] || trimmed).trim();
}

async function callConfiguredRuntime(slug: string, systemPrompt: string, userPrompt: string): Promise<string> {
  const axis = readDispatchConfig().subagent;
  const modelLabel = axis.model === "auto" ? readDefaultModel().slug : axis.model;
  process.stderr.write(`[prompt-improver] composing ${slug} via ${axis.backend}/${modelLabel}\n`);

  const result = await dispatchFor("subagent", {
    prompt: userPrompt,
    systemPrompt,
    cwd: REPO_ROOT,
    tools: ["read", "grep", "ls"],
    timeoutMs: 120_000,
    interviewMode: false,
  });

  if (!result.ok || !result.output.trim()) {
    const detail = result.error || result.stderr || `exit ${result.exitCode}`;
    throw new Error(`${result.provider}/${result.model} failed: ${detail}`);
  }

  return normalizeFragment(result.output);
}

// --- Main export ---

export async function generatePromptPatch(slug: string): Promise<PromptPatch> {
  const skillMd = readSkillMd(slug); // throws if not found
  const agentsMd = readAgentsMd(slug);
  const oldFragment = readPromptFragment(slug);
  const failures = readRecentFailures(slug);
  const reason = summarizeFailureMode(failures);

  // Strip frontmatter from SKILL.md to get body prose
  const skillBody = skillMd.replace(/^---\n[\s\S]+?\n---\n?/, "").trim();

  const userPrompt = buildUserPrompt(slug, skillBody, agentsMd, oldFragment, failures);

  let proposed: string;
  try {
    proposed = await callConfiguredRuntime(slug, PROMPT_IMPROVER_SYSTEM, userPrompt);
  } catch (e) {
    throw new Error(`prompt runtime failed for ${slug}: ${(e as Error).message}`);
  }

  // Basic sanity: must start with # or bullet
  if (!proposed || (!proposed.startsWith("#") && !proposed.startsWith("-") && !proposed.startsWith("*"))) {
    process.stderr.write(`[prompt-improver] LLM output for ${slug} looks off, using as-is\n`);
  }

  return { slug, old: oldFragment, proposed, reason };
}

// --- CLI ---

const isMain = (() => {
  try {
    return import.meta.url === `file://${require("fs").realpathSync(process.argv[1])}`;
  } catch {
    try {
      return import.meta.url === `file://${process.argv[1]}`;
    } catch {
      return false;
    }
  }
})();

if (isMain) {
  (async () => {
    const slug = process.argv[2];
    if (!slug) {
      console.error("Usage: npx tsx state/lib/prompt-improver.ts <slug>");
      process.exit(1);
    }
    try {
      const patch = await generatePromptPatch(slug);
      console.log("=== OLD FRAGMENT ===");
      console.log(patch.old || "(none)");
      console.log("\n=== PROPOSED ===");
      console.log(patch.proposed);
      console.log("\n=== REASON ===");
      console.log(patch.reason);
    } catch (e) {
      console.error("Error:", (e as Error).message);
      process.exit(1);
    }
  })();
}

scripts- helper scripts it can run

prose-only skill - no sidecar under state/bin/ yet. Steps, if any, are described in SKILL.md.

how we check it- the checks, plus the last 3 runs

rubric auto no rubric declared

recent mean 0.90 · 3 runs actor/auditor: unverifiable

deps log

timestamp	verb	score	primary_issue	artifact
2026-05-01 06:08Z	-	0.90	-	-
2026-05-01 06:08Z	-	0.90	-	-
2026-05-01 06:08Z	-	0.90	-	-