.md file to compare - side-by-side diff against prompt-tuner
prompt-tuner
description: "Triggers on prompt mention of 'prompt-tuner'."
What it does for you
Reviews recent results and suggests how to make your skills sharper.
What it produces
A recent result, so you can see the kind of work it returns.
loading…
How to get it
These run inside the Snappy workspace. Want this working in your business? I set skills like this up with you, in one focused week.
For developers how this skill is built, graded, and how it runs
at a glance- the short version
what's inside - the parts that make up a skill 3/4 present
A skill is just a few plain-text files. Only the main one is required. The rest are optional, added as the work needs them. This is what the skill is made of; how it runs is just below.
state/skills/prompt-tuner/SKILL.md
present
state/lib/prompt-tuner.ts
present
state/bin/prompt-tuner/
not present
state/skills/prompt-tuner/AGENTS.md
present
how it's graded - what counts as a good run 7 criteria · 7 deterministic
Each row is one thing a good run has to get right. deterministic means a quick check decides, pass or fail. judge means the AI reads the result and rates it. Grading each piece on its own (instead of one overall score) shows exactly where a run fell short, so the fix is obvious.
how it runs - the shared frame every skill uses 5/5 present
Every skill runs the same way. One part does the work, a separate part checks it, and a short loader hands the AI exactly what it needs for the job. Anything this skill doesn't use shows a one-line note saying why, on purpose, not by accident.
state/log/evals.ndjson what it has learned - fixes written back in over time sample
When a run hits something this skill didn't handle, the fix gets written back into the skill so it doesn't happen again. FIXED means it was corrected on the spot. LOGGED means it's queued for a bigger rewrite. Either way, the skill gets a little better and never makes the same mistake twice.
- Loading feedback rows…
how the work flows- who makes it, who checks it
SKILL.md- the skill, written out in plain English
prompt-tuner
Feeds evaluation signals back into the system for continuous reinforcement. Reads state/log/evals.ndjson (the result of every skill run), buckets by skill/shape using recent 200 rows, identifies struggling and winning performers, and outputs a markdown digest with concrete suggestions for prompt updates.
The digest is consumed by follow-up agents or Robert who decide which prompt fragments in state/bin/head-screen/server.ts need iteration.
Steps
- Read
state/log/evals.ndjson, parse NDJSON tail (most recent 200 rows). - Extract
skill(orverb) andscorefrom each row. - Bucket by skill, compute mean score per bucket.
- Flag struggling buckets (mean < 0.5, count >= min_bucket).
- Flag winning buckets (mean > 0.85, count >= min_bucket).
- For each bucket, count rows and extract recent
primary_issuepatterns. - Write to
state/log/prompt-tuner-digest.mdwith:
- Summary stats (total rows scanned, buckets)
- Struggling section with mean score, row count, and top issues
- Winning section with mean score and row count
- Suggestions for each struggling skill (add examples, tighten triggers, remove distractors)
- Patterns observed in top issue keywords
Shape
Output markdown at state/log/prompt-tuner-digest.md:
# prompt-tuner digest
Generated: <ISO timestamp>
## Overview
- Total rows scanned: 200
- Unique skills: 43
- Struggling buckets (mean < 0.5, n >= 5): 7
- Winning buckets (mean > 0.85, n >= 5): 12
## Struggling (mean < 0.5, n >= 5)
**skill-name** mean=0.45 n=12
- Top issues: stale-snapshot (3), missing-data (2), format-error (1)
- Suggestion: Add a concrete GOOD/BAD example to the system prompt. The model may be missing context on when to trigger this skill.
**another-skill** mean=0.38 n=8
- Top issues: hallucination (4), out-of-scope (2)
- Suggestion: Tighten the trigger keywords in the system prompt. Remove alternative paths that are confusing the model.
## Winning (mean > 0.85, n >= 5)
**best-skill** mean=0.94 n=15
- No issues observed. Keep the current prompt and examples.
**another-winner** mean=0.88 n=10
- No issues observed.
## Patterns
Most common failure modes across all buckets:
1. `stale-snapshot` (7 occurrences) — suggests caching issues or refresh-window miscalculations
2. `hallucination` (5 occurrences) — suggests vague triggers or missing guardrails
3. `format-error` (3 occurrences) — suggests output schema needs clarification
## Next Steps
For each struggling skill:
1. Read the SKILL.md and AGENTS.md to understand the intent
2. Locate the system prompt fragment in server.ts or state/lib/<skill>.ts
3. Add a concrete GOOD example of when to emit, and a BAD example of when NOT to
4. Re-run the skill and observe if mean score climbs above 0.7
5. Archive this digest as prompt-tuner-digest-YYYY-MM-DD.md and generate a fresh one
Eval
Actor: the bucketing + statistics logic. Auditor: verify output file exists, contains expected sections (Overview, Struggling, Winning, Patterns).
const source_rows = parsed_evals.length;
const has_overview = output.includes("## Overview");
const has_struggling = output.includes("## Struggling");
const has_winning = output.includes("## Winning");
const has_patterns = output.includes("## Patterns");
const bucket_count = (output.match(/^## [A-Z]/gm) || []).length;
score("prompt-tuner", run_id, {
score:
source_rows === 0 ? 0.0 :
has_overview && has_struggling && has_winning && has_patterns ? 1.0 :
has_overview && has_struggling && has_winning ? 0.75 :
source_rows > 0 && bucket_count >= 3 ? 0.5 :
0.0,
rows_scanned: source_rows,
sections_found: [has_overview, has_struggling, has_winning, has_patterns].filter(Boolean).length,
bucket_count,
primary_issue:
source_rows === 0 ? "no-eval-rows" :
!has_overview ? "missing-overview" :
!has_struggling ? "missing-struggling" :
!has_winning ? "missing-winning" :
!has_patterns ? "missing-patterns" :
null,
});
Digest must include Overview, Struggling, Winning, and Patterns sections. Empty eval logs score 0.0.
Known Pitfalls
- Eval rows use
skill(preferred) orverb(legacy). Normalize both to the same bucket key. primary_issueis nullable; skip rows where it's null when counting top issues.- Mean score edge case: a bucket with n=5 all at 0.49 is still flagged as struggling (< 0.5). That's correct - the threshold is exact.
- Archive old digests before re-running. The script will overwrite
state/log/prompt-tuner-digest.mdon each run.
Rubric
criteria:
- name: output_file_generated
kind: deterministic
check: "The skill must generate a markdown file at state/log/prompt-tuner-digest.md."
- name: eval_log_parsing
kind: deterministic
check: "The script must parse state/log/evals.ndjson correctly, extracting skill/verb and score from each row. No silent drops."
- name: bucketing_accuracy
kind: deterministic
check: "Buckets must group rows by skill/verb. Mean score for each bucket must be correct (sum of scores / count)."
- name: struggling_flagging
kind: deterministic
check: "A bucket is flagged as struggling if and only if mean < 0.5 and count >= min_bucket. Output must list struggling buckets with their means and row counts."
- name: winning_flagging
kind: deterministic
check: "A bucket is flagged as winning if and only if mean > 0.85 and count >= min_bucket. Output must list winning buckets."
- name: issues_extraction
kind: deterministic
check: "For each struggling bucket, extract and count primary_issue values. Top issues must be listed with occurrence counts."
- name: sections_present
kind: deterministic
check: "Output must include all four sections: Overview, Struggling, Winning, Patterns. No sections may be empty when data exists."AGENTS.md- what the AI loads when this skill comes up
prompt-tuner - loader
Per-turn rules for the prompt-tuner skill. Full reference: state/skills/prompt-tuner/SKILL.md. Do not skip these.
Critical Rules
- Input is append-only. Read
state/log/evals.ndjsonwith tail semantics - take the most recent 200 rows (or window size requested). Do NOT rescan the entire file on every run; the tail prevents double-counting across weekly invocations.
- Buckets are skill/verb, not individual runs. Each row has either
skill(preferred) or legacyverbfield. Normalize both to the same bucket key. A single skill may have 5+ runs in the tail window; all rows for that skill go into one bucket.
- Mean score calculation must be accurate. For each bucket:
mean = sum(scores) / count. Round to 2 decimals for display. Do NOT use median or mode; mean is the signal.
- Struggling threshold is exact: mean < 0.5 AND count >= min_bucket. A bucket with mean=0.49 is struggling; mean=0.50 is NOT. The min_bucket guard (default 5) prevents single-outlier buckets from dominating the report.
- Winning threshold is exact: mean > 0.85 AND count >= min_bucket. Same discipline. mean=0.85 is NOT winning; mean=0.86 is.
- primary_issue extraction is case-sensitive. Some rows have null, some have strings. Skip nulls. Count occurrences of each non-null issue string, report top 3 per bucket. Do NOT fabricate issues.
- Output file is always at
state/log/prompt-tuner-digest.md. Overwrite on each run - do NOT append. The digest is meant to be fresh every week. Archive old digests manually if needed (e.g.,cp prompt-tuner-digest.md prompt-tuner-digest-2026-04-30.md).
- FORBIDDEN: do not edit server.ts or state/lib files directly. The digest is the artifact; humans or follow-up agents act on the suggestions. Your job is to feed data back, not to prescribe fixes into the runtime.
- Overview section must show total rows, unique skills, bucket counts (struggling/winning). This is the first thing Robert reads to understand the scale of the analysis.
- Patterns section must synthesize failure modes across all struggling buckets. If 7 buckets mention "stale-snapshot", that's a signal worth elevating - the prompt-tuner detected a systemic failure mode that deserves prompt-level fixes.
Commands
| ui dashboard | state/skills/prompt-tuner/resources/ui.openui |
| what | command |
|---|---|
| run full scan + output digest | npx tsx state/lib/prompt-tuner.ts |
| run with custom window size | npx tsx state/lib/prompt-tuner.ts --window 300 |
| run with custom min-bucket threshold | npx tsx state/lib/prompt-tuner.ts --min-bucket 3 |
| both | npx tsx state/lib/prompt-tuner.ts --window 300 --min-bucket 3 |
| view digest | cat state/log/prompt-tuner-digest.md |
| archive digest with date | cp state/log/prompt-tuner-digest.md state/log/prompt-tuner-digest-$(date +%Y-%m-%d).md |
| eval log | state/log/evals.ndjson (skill: "prompt-tuner") |
| output | state/log/prompt-tuner-digest.md |
Input Format
state/log/evals.ndjson - one JSON object per line:
{"ts":"2026-04-30T00:04:05.252Z","skill":"my-skill","score":0.8,"primary_issue":null}
{"ts":"2026-04-30T00:05:35.188Z","verb":"my-verb","score":0.5,"primary_issue":"stale-snapshot"}
{"ts":"2026-04-30T00:09:09.131Z","skill":"another-skill","score":1.0,"primary_issue":null}
Fields that matter:
ts- timestamp (for sorting, take newest 200)skillORverb- bucket key (normalize both to same field)score- numeric, 0.0 to 1.0 (sum and divide for mean)primary_issue- nullable string (extract for top-3 per bucket)
Output Format
Markdown at state/log/prompt-tuner-digest.md:
# prompt-tuner digest
Generated: 2026-04-30T15:22:00Z
## Overview
- Total rows scanned: 200
- Unique skills: 43
- Struggling buckets (mean < 0.5, n >= 5): 7
- Winning buckets (mean > 0.85, n >= 5): 12
## Struggling (mean < 0.5, n >= 5)
**skill-name** mean=0.45 n=12
- Top issues: stale-snapshot (3), missing-data (2), format-error (1)
- Suggestion: ...
## Winning (mean > 0.85, n >= 5)
**best-skill** mean=0.94 n=15
- No issues observed.
## Patterns
Most common failure modes:
1. `stale-snapshot` (7 occurrences)
2. `hallucination` (5 occurrences)
3. ...
## Next Steps
For each struggling skill:
1. Read the SKILL.md and AGENTS.md
2. Locate the system prompt fragment
3. Add concrete GOOD/BAD examples
4. Re-run the skill and check if score improves
OpenUI Resource
- Skill-owned OpenUI Lang resource:
state/skills/prompt-tuner/resources/ui.openui. Read it before rendering or editing this skill's generated component surface. - Treat this resource as a first-class artifact of the skill, not a generic chat response. Improve it when the skill's user-facing output needs to become richer.
- System resources compose OpenUI primitives and inherit SnappyChat tokens. Use
ui_contract: brandedin SKILL.md only for deliberate platform or client visuals.
Known Pitfalls
- Eval log may be very large (>10MB). Use tail semantics to read only the last N lines, not the entire file. The script should stream-read or use
tail -<window_size>piped to jq. - Bucket with only nulls in primary_issue. That's valid (no issues for those runs). The bucket still counts as struggling or winning if the mean satisfies the threshold; it just has an empty "Top issues" line.
- Score type inconsistency. Some rows may have
score: nullor missing score field. Treat missing/null score as 0.0 for mean calculation, not as "skip this row" - it still counts toward the bucket's sample size. - Archive old digests before re-running. The script unconditionally overwrites the output file. If Robert wants to compare week-over-week, he must
cpthe old file first. - Normalize skill vs verb. Use whichever is present; if both exist, prefer
skill. Create a single merged bucket, not two separate ones.
Self-Test
An agent reading this should correctly:
- [ ] Understand tail semantics: last 200 rows (or window size), not entire file?
- [ ] Know bucket key normalization: skill OR verb, merged into one key per skill?
- [ ] Calculate mean accurately: sum(scores) / count, not median or mode?
- [ ] Apply thresholds exactly: < 0.5 for struggling, > 0.85 for winning, AND count >= min_bucket?
- [ ] Extract primary_issue correctly: skip nulls, count top 3 per bucket?
- [ ] Write only to output file, never modify server.ts or lib files?
- [ ] Include all four sections: Overview, Struggling, Winning, Patterns?
- [ ] Use tail-read or streaming to handle large eval logs?
Self-report
If this loader fell short, append a line:
echo "[$(date -u +%FT%TZ)] prompt-tuner: <what was missing or fixed> [FIXED|LOGGED] action_kind=<kind>" >> state/log/loader-feedback.log
FIXED= you patched this loader inline (P-fix).LOGGED= too large for inline; the PostToolUse enqueue + Stop-hook drain will rewrite the loader from scratch on next session-end.action_kind:shape-ok|skill-ran|loader-rewritten|pattern-elevated
Do not skip this. Every agent run must leave the system better than it found it.
api.ts- the code it can call
#!/usr/bin/env npx tsx
/**
* snappy-prompt-tuner/api.ts -- Weekly reinforcement signal from eval rows.
*
* Reads state/log/evals.ndjson (tail), buckets by skill/verb, identifies
* struggling (<0.5) and winning (>0.85) performers, outputs markdown digest
* with concrete prompt-update suggestions.
*
* Usage:
* npx tsx state/lib/prompt-tuner.ts
* npx tsx state/lib/prompt-tuner.ts --window 300
* npx tsx state/lib/prompt-tuner.ts --min-bucket 3
* npx tsx state/lib/prompt-tuner.ts --window 300 --min-bucket 3
*/
import { readFileSync, writeFileSync, existsSync } from "fs";
import { realpathSync } from "fs";
import { join } from "path";
interface EvalRow {
ts: string;
skill?: string;
verb?: string;
score?: number;
primary_issue?: string | null;
[key: string]: unknown;
}
interface BucketStats {
skill: string;
count: number;
scores: number[];
mean: number;
issues: Record<string, number>;
topIssues: Array<[string, number]>;
}
function parseArgs(args: string[]): Record<string, string | number> {
const result: Record<string, string | number> = {};
for (let i = 0; i < args.length; i++) {
if (args[i].startsWith("--")) {
const key = args[i].substring(2);
const val = args[i + 1];
if (val && !val.startsWith("--")) {
result[key] = isNaN(Number(val)) ? val : Number(val);
i++;
}
}
}
return result;
}
function tailReadNDJSON(filePath: string, lineCount: number): EvalRow[] {
try {
const content = readFileSync(filePath, "utf8");
const lines = content.trim().split("\n");
const start = Math.max(0, lines.length - lineCount);
const tail = lines.slice(start);
return tail
.filter((line) => line.trim())
.map((line) => {
try {
return JSON.parse(line);
} catch {
return null;
}
})
.filter((row): row is EvalRow => row !== null);
} catch {
console.error(`Failed to read ${filePath}`);
return [];
}
}
const REPO_ROOT = (() => {
// Walk up from this file until we find state/skills
let dir = new URL(import.meta.url).pathname;
for (let i = 0; i < 10; i++) {
dir = join(dir, "..");
if (existsSync(join(dir, "state", "skills"))) return dir;
}
return process.cwd();
})();
function canonicalizeSkillSlug(raw: string): string | null {
if (!raw || raw === "unknown") return null;
const skillsDir = join(REPO_ROOT, "state", "skills");
// 1. If the raw value is already a real skill folder, use it as-is
if (existsSync(join(skillsDir, raw, "SKILL.md"))) {
return raw;
}
// 2. Strip "agent-" prefix (e.g. "agent-brain-digest" → "brain-digest")
let slug = raw.replace(/^agent-/, "");
// 3. Strip session-ID suffix like "-i0vq4u" (5+ lowercase alphanumeric after a trailing dash)
slug = slug.replace(/-[a-z0-9]{5,}$/, "");
if (existsSync(join(skillsDir, slug, "SKILL.md"))) {
return slug;
}
// Phantom — no matching skill folder; drop from digest
return null;
}
function computeStats(rows: EvalRow[]): Map<string, BucketStats> {
const buckets = new Map<string, BucketStats>();
for (const row of rows) {
const rawKey = (row.skill || row.verb || "unknown") as string;
const key = canonicalizeSkillSlug(rawKey);
if (!key) continue; // Drop phantom rows
const score = typeof row.score === "number" ? row.score : 0;
const issue = row.primary_issue || null;
if (!buckets.has(key)) {
buckets.set(key, {
skill: key,
count: 0,
scores: [],
mean: 0,
issues: {},
topIssues: [],
});
}
const bucket = buckets.get(key)!;
bucket.scores.push(score);
bucket.count++;
if (issue && typeof issue === "string") {
bucket.issues[issue] = (bucket.issues[issue] || 0) + 1;
}
}
// Compute means and top issues
for (const bucket of buckets.values()) {
const sum = bucket.scores.reduce((a, b) => a + b, 0);
bucket.mean = bucket.count > 0 ? sum / bucket.count : 0;
bucket.topIssues = Object.entries(bucket.issues)
.sort((a, b) => b[1] - a[1])
.slice(0, 3);
}
return buckets;
}
function countPatterns(buckets: Map<string, BucketStats>): Array<[string, number]> {
const allIssues = new Map<string, number>();
for (const bucket of buckets.values()) {
for (const [issue, count] of Object.entries(bucket.issues)) {
allIssues.set(issue, (allIssues.get(issue) || 0) + count);
}
}
return Array.from(allIssues.entries())
.sort((a, b) => b[1] - a[1])
.slice(0, 5);
}
function generateDigest(
rows: EvalRow[],
buckets: Map<string, BucketStats>,
minBucket: number
): string {
const now = new Date().toISOString();
const struggling = Array.from(buckets.values()).filter(
(b) => b.mean < 0.5 && b.count >= minBucket
);
const winning = Array.from(buckets.values()).filter(
(b) => b.mean > 0.85 && b.count >= minBucket
);
const patterns = countPatterns(buckets);
let md = `# prompt-tuner digest\n\n`;
md += `Generated: ${now}\n\n`;
md += `## Overview\n\n`;
md += `- Total rows scanned: ${rows.length}\n`;
md += `- Unique skills: ${buckets.size}\n`;
md += `- Struggling buckets (mean < 0.5, n >= ${minBucket}): ${struggling.length}\n`;
md += `- Winning buckets (mean > 0.85, n >= ${minBucket}): ${winning.length}\n\n`;
// Struggling section
if (struggling.length > 0) {
md += `## Struggling (mean < 0.5, n >= ${minBucket})\n\n`;
for (const bucket of struggling.sort((a, b) => a.mean - b.mean)) {
md += `**${bucket.skill}** mean=${bucket.mean.toFixed(2)} n=${bucket.count}\n`;
if (bucket.topIssues.length > 0) {
const issues = bucket.topIssues
.map(([issue, count]) => `${issue} (${count})`)
.join(", ");
md += `- Top issues: ${issues}\n`;
} else {
md += `- Top issues: none recorded\n`;
}
md += `- Suggestion: Review the system prompt for this skill. Consider adding a concrete GOOD example of when to trigger, and a BAD example of when NOT to. Tighten trigger keywords if the model is confusing this with other skills.\n\n`;
}
} else {
md += `## Struggling (mean < 0.5, n >= ${minBucket})\n\n`;
md += `No struggling buckets found.\n\n`;
}
// Winning section
if (winning.length > 0) {
md += `## Winning (mean > 0.85, n >= ${minBucket})\n\n`;
for (const bucket of winning.sort((a, b) => b.mean - a.mean)) {
md += `**${bucket.skill}** mean=${bucket.mean.toFixed(2)} n=${bucket.count}\n`;
md += `- No issues observed. Keep the current prompt and examples.\n\n`;
}
} else {
md += `## Winning (mean > 0.85, n >= ${minBucket})\n\n`;
md += `No winning buckets found.\n\n`;
}
// Patterns section
md += `## Patterns\n\n`;
if (patterns.length > 0) {
md += `Most common failure modes across all buckets:\n`;
for (let i = 0; i < patterns.length; i++) {
const [issue, count] = patterns[i];
md += `${i + 1}. \`${issue}\` (${count} occurrences)\n`;
}
md += `\nThese patterns suggest systemic issues worth addressing at the prompt level.\n\n`;
} else {
md += `No failure modes recorded. All observed runs passed their evaluation gates.\n\n`;
}
// Next steps
md += `## Next Steps\n\n`;
md += `For each struggling skill:\n`;
md += `1. Read the SKILL.md and AGENTS.md to understand the intent\n`;
md += `2. Locate the system prompt fragment in server.ts or state/lib/<skill>.ts\n`;
md += `3. Add a concrete GOOD example of when to emit, and a BAD example of when NOT to\n`;
md += `4. Re-run the skill and observe if mean score climbs above 0.7\n`;
md += `5. Archive this digest as prompt-tuner-digest-YYYY-MM-DD.md and generate a fresh one\n\n`;
return md;
}
// --- CLI ---
if (
(() => {
try {
return import.meta.url === `file://${realpathSync(process.argv[1])}`;
} catch {
return false;
}
})()
) {
(async () => {
const args = parseArgs(process.argv.slice(2));
const window = (args.window as number) || 200;
const minBucket = (args["min-bucket"] as number) || 5;
const evalsPath = "state/log/evals.ndjson";
const outputPath = "state/log/prompt-tuner-digest.md";
console.log(`Reading ${evalsPath}, last ${window} rows...`);
const rows = tailReadNDJSON(evalsPath, window);
console.log(`Parsed ${rows.length} eval rows.`);
if (rows.length === 0) {
console.error("No eval rows found. Exiting.");
process.exit(0);
}
const buckets = computeStats(rows);
console.log(`Bucketed into ${buckets.size} unique skills.`);
const digest = generateDigest(rows, buckets, minBucket);
writeFileSync(outputPath, digest, "utf8");
console.log(`Digest written to ${outputPath}`);
// Count struggling/winning for eval
const struggling = Array.from(buckets.values()).filter(
(b) => b.mean < 0.5 && b.count >= minBucket
).length;
const winning = Array.from(buckets.values()).filter(
(b) => b.mean > 0.85 && b.count >= minBucket
).length;
console.log(
`\nSummary: ${rows.length} rows, ${buckets.size} buckets, ${struggling} struggling, ${winning} winning`
);
})();
}
scripts- helper scripts it can run
prose-only skill - 3 inline code blocks live in SKILL.md above (no state/bin/ sidecar yet).
how we check it- the checks, plus the last 3 runs
| timestamp | verb | score | primary_issue | artifact |
|---|---|---|---|---|
| 2026-05-01 05:28Z | - | 1.00 | - | - |
| 2026-05-01 05:28Z | - | 1.00 | - | - |
| 2026-05-01 05:28Z | - | 1.00 | - | - |