OR Key
drop another .md file to compare - side-by-side diff against prompt-tuner

prompt-tuner

Reviews recent results and suggests how to make your skills sharper.
description: "Triggers on prompt mention of 'prompt-tuner'."
personal 2 files 3 recent evals

What it does for you

Reviews recent results and suggests how to make your skills sharper.

What it produces

A recent result, so you can see the kind of work it returns.

loading…

How to get it

These run inside the Snappy workspace. Want this working in your business? I set skills like this up with you, in one focused week.

Work with me
For developers how this skill is built, graded, and how it runs

at a glance- the short version

actorBucketing + statistics logic.
auditorVerify
eval modeauto
categoryOps
dependslog

what's inside - the parts that make up a skill 3/4 present

A skill is just a few plain-text files. Only the main one is required. The rest are optional, added as the work needs them. This is what the skill is made of; how it runs is just below.

The skill
state/skills/prompt-tuner/SKILL.md present
the skill itself, in plain text
The main file. It says what the skill is and lays out the steps in plain English.
Code
state/lib/prompt-tuner.ts present
code the skill can run
Reusable code this skill can call when it needs to.
Scripts
state/bin/prompt-tuner/ not present
helper scripts
Optional. Added when a skill has a few commands to run.
Loader
state/skills/prompt-tuner/AGENTS.md present
what the AI loads on the fly
Loaded automatically the moment this skill is needed. Kept short on purpose.

how it's graded - what counts as a good run 7 criteria · 7 deterministic

Each row is one thing a good run has to get right. deterministic means a quick check decides, pass or fail. judge means the AI reads the result and rates it. Grading each piece on its own (instead of one overall score) shows exactly where a run fell short, so the fix is obvious.

name
kind
check
output_file_generated
deterministic
The skill must generate a markdown file at state/log/prompt-tuner-digest.md.
eval_log_parsing
deterministic
The script must parse state/log/evals.ndjson correctly, extracting skill/verb and score from each row. No silent drops.
bucketing_accuracy
deterministic
Buckets must group rows by skill/verb. Mean score for each bucket must be correct (sum of scores / count).
struggling_flagging
deterministic
A bucket is flagged as struggling if and only if mean < 0.5 and count >= min_bucket. Output must list struggling buckets with their means and row counts.
winning_flagging
deterministic
A bucket is flagged as winning if and only if mean > 0.85 and count >= min_bucket. Output must list winning buckets.
issues_extraction
deterministic
For each struggling bucket, extract and count primary_issue values. Top issues must be listed with occurrence counts.
sections_present
deterministic
Output must include all four sections: Overview, Struggling, Winning, Patterns. No sections may be empty when data exists.

how it runs - the shared frame every skill uses 5/5 present

Every skill runs the same way. One part does the work, a separate part checks it, and a short loader hands the AI exactly what it needs for the job. Anything this skill doesn't use shows a one-line note saying why, on purpose, not by accident.

makes the work The worker
present
Bucketing + statistics logic. the worker
Does the actual work. Whatever it produces is what gets checked next.
checks the work The reviewer
present
Verify the checker
A separate checker grades the work, so the part that made it can't approve its own work.
frame
learns Self-correction
present
fixes itself learns from gaps
When a run hits a gap, the skill gets edited on the spot [FIXED] or queued for a bigger rewrite [LOGGED], so it keeps getting better.
tidies up Background fixes
present
queued for rewrite runs in the background
Bigger fixes that can't be made on the spot get queued and rewritten in the background later.
remembers Run history
present
state/log/evals.ndjson auto runs
Every run is written down here, so the next time this skill is used it already knows how the last runs went.
Critical rules the things this skill must not get wrong
No must-not-break rules called out for this skill. Anything important lives in the writeup below.

what it has learned - fixes written back in over time sample

When a run hits something this skill didn't handle, the fix gets written back into the skill so it doesn't happen again. FIXED means it was corrected on the spot. LOGGED means it's queued for a bigger rewrite. Either way, the skill gets a little better and never makes the same mistake twice.

  1. Loading feedback rows…

how the work flows- who makes it, who checks it

inputs log
actor Bucketing + statistics logic.
auditor Verify

SKILL.md- the skill, written out in plain English

prompt-tuner

Feeds evaluation signals back into the system for continuous reinforcement. Reads state/log/evals.ndjson (the result of every skill run), buckets by skill/shape using recent 200 rows, identifies struggling and winning performers, and outputs a markdown digest with concrete suggestions for prompt updates.

The digest is consumed by follow-up agents or Robert who decide which prompt fragments in state/bin/head-screen/server.ts need iteration.

Steps

  1. Read state/log/evals.ndjson, parse NDJSON tail (most recent 200 rows).
  2. Extract skill (or verb) and score from each row.
  3. Bucket by skill, compute mean score per bucket.
  4. Flag struggling buckets (mean < 0.5, count >= min_bucket).
  5. Flag winning buckets (mean > 0.85, count >= min_bucket).
  6. For each bucket, count rows and extract recent primary_issue patterns.
  7. Write to state/log/prompt-tuner-digest.md with:
  • Summary stats (total rows scanned, buckets)
  • Struggling section with mean score, row count, and top issues
  • Winning section with mean score and row count
  • Suggestions for each struggling skill (add examples, tighten triggers, remove distractors)
  • Patterns observed in top issue keywords

Shape

Output markdown at state/log/prompt-tuner-digest.md:

# prompt-tuner digest

Generated: <ISO timestamp>

## Overview
- Total rows scanned: 200
- Unique skills: 43
- Struggling buckets (mean < 0.5, n >= 5): 7
- Winning buckets (mean > 0.85, n >= 5): 12

## Struggling (mean < 0.5, n >= 5)

**skill-name** mean=0.45 n=12
- Top issues: stale-snapshot (3), missing-data (2), format-error (1)
- Suggestion: Add a concrete GOOD/BAD example to the system prompt. The model may be missing context on when to trigger this skill.

**another-skill** mean=0.38 n=8
- Top issues: hallucination (4), out-of-scope (2)
- Suggestion: Tighten the trigger keywords in the system prompt. Remove alternative paths that are confusing the model.

## Winning (mean > 0.85, n >= 5)

**best-skill** mean=0.94 n=15
- No issues observed. Keep the current prompt and examples.

**another-winner** mean=0.88 n=10
- No issues observed.

## Patterns

Most common failure modes across all buckets:
1. `stale-snapshot` (7 occurrences) — suggests caching issues or refresh-window miscalculations
2. `hallucination` (5 occurrences) — suggests vague triggers or missing guardrails
3. `format-error` (3 occurrences) — suggests output schema needs clarification

## Next Steps

For each struggling skill:
1. Read the SKILL.md and AGENTS.md to understand the intent
2. Locate the system prompt fragment in server.ts or state/lib/<skill>.ts
3. Add a concrete GOOD example of when to emit, and a BAD example of when NOT to
4. Re-run the skill and observe if mean score climbs above 0.7
5. Archive this digest as prompt-tuner-digest-YYYY-MM-DD.md and generate a fresh one

Eval

Actor: the bucketing + statistics logic. Auditor: verify output file exists, contains expected sections (Overview, Struggling, Winning, Patterns).

const source_rows = parsed_evals.length;
const has_overview = output.includes("## Overview");
const has_struggling = output.includes("## Struggling");
const has_winning = output.includes("## Winning");
const has_patterns = output.includes("## Patterns");
const bucket_count = (output.match(/^## [A-Z]/gm) || []).length;

score("prompt-tuner", run_id, {
  score:
    source_rows === 0 ? 0.0 :
    has_overview && has_struggling && has_winning && has_patterns ? 1.0 :
    has_overview && has_struggling && has_winning ? 0.75 :
    source_rows > 0 && bucket_count >= 3 ? 0.5 :
    0.0,
  rows_scanned: source_rows,
  sections_found: [has_overview, has_struggling, has_winning, has_patterns].filter(Boolean).length,
  bucket_count,
  primary_issue:
    source_rows === 0 ? "no-eval-rows" :
    !has_overview ? "missing-overview" :
    !has_struggling ? "missing-struggling" :
    !has_winning ? "missing-winning" :
    !has_patterns ? "missing-patterns" :
    null,
});

Digest must include Overview, Struggling, Winning, and Patterns sections. Empty eval logs score 0.0.

Known Pitfalls

  • Eval rows use skill (preferred) or verb (legacy). Normalize both to the same bucket key.
  • primary_issue is nullable; skip rows where it's null when counting top issues.
  • Mean score edge case: a bucket with n=5 all at 0.49 is still flagged as struggling (< 0.5). That's correct - the threshold is exact.
  • Archive old digests before re-running. The script will overwrite state/log/prompt-tuner-digest.md on each run.

Rubric

criteria:
  - name: output_file_generated
    kind: deterministic
    check: "The skill must generate a markdown file at state/log/prompt-tuner-digest.md."
  - name: eval_log_parsing
    kind: deterministic
    check: "The script must parse state/log/evals.ndjson correctly, extracting skill/verb and score from each row. No silent drops."
  - name: bucketing_accuracy
    kind: deterministic
    check: "Buckets must group rows by skill/verb. Mean score for each bucket must be correct (sum of scores / count)."
  - name: struggling_flagging
    kind: deterministic
    check: "A bucket is flagged as struggling if and only if mean < 0.5 and count >= min_bucket. Output must list struggling buckets with their means and row counts."
  - name: winning_flagging
    kind: deterministic
    check: "A bucket is flagged as winning if and only if mean > 0.85 and count >= min_bucket. Output must list winning buckets."
  - name: issues_extraction
    kind: deterministic
    check: "For each struggling bucket, extract and count primary_issue values. Top issues must be listed with occurrence counts."
  - name: sections_present
    kind: deterministic
    check: "Output must include all four sections: Overview, Struggling, Winning, Patterns. No sections may be empty when data exists."

AGENTS.md- what the AI loads when this skill comes up

prompt-tuner - loader

Per-turn rules for the prompt-tuner skill. Full reference: state/skills/prompt-tuner/SKILL.md. Do not skip these.

Critical Rules

  1. Input is append-only. Read state/log/evals.ndjson with tail semantics - take the most recent 200 rows (or window size requested). Do NOT rescan the entire file on every run; the tail prevents double-counting across weekly invocations.
  1. Buckets are skill/verb, not individual runs. Each row has either skill (preferred) or legacy verb field. Normalize both to the same bucket key. A single skill may have 5+ runs in the tail window; all rows for that skill go into one bucket.
  1. Mean score calculation must be accurate. For each bucket: mean = sum(scores) / count. Round to 2 decimals for display. Do NOT use median or mode; mean is the signal.
  1. Struggling threshold is exact: mean < 0.5 AND count >= min_bucket. A bucket with mean=0.49 is struggling; mean=0.50 is NOT. The min_bucket guard (default 5) prevents single-outlier buckets from dominating the report.
  1. Winning threshold is exact: mean > 0.85 AND count >= min_bucket. Same discipline. mean=0.85 is NOT winning; mean=0.86 is.
  1. primary_issue extraction is case-sensitive. Some rows have null, some have strings. Skip nulls. Count occurrences of each non-null issue string, report top 3 per bucket. Do NOT fabricate issues.
  1. Output file is always at state/log/prompt-tuner-digest.md. Overwrite on each run - do NOT append. The digest is meant to be fresh every week. Archive old digests manually if needed (e.g., cp prompt-tuner-digest.md prompt-tuner-digest-2026-04-30.md).
  1. FORBIDDEN: do not edit server.ts or state/lib files directly. The digest is the artifact; humans or follow-up agents act on the suggestions. Your job is to feed data back, not to prescribe fixes into the runtime.
  1. Overview section must show total rows, unique skills, bucket counts (struggling/winning). This is the first thing Robert reads to understand the scale of the analysis.
  1. Patterns section must synthesize failure modes across all struggling buckets. If 7 buckets mention "stale-snapshot", that's a signal worth elevating - the prompt-tuner detected a systemic failure mode that deserves prompt-level fixes.

Commands

| ui dashboard | state/skills/prompt-tuner/resources/ui.openui |

whatcommand
run full scan + output digestnpx tsx state/lib/prompt-tuner.ts
run with custom window sizenpx tsx state/lib/prompt-tuner.ts --window 300
run with custom min-bucket thresholdnpx tsx state/lib/prompt-tuner.ts --min-bucket 3
bothnpx tsx state/lib/prompt-tuner.ts --window 300 --min-bucket 3
view digestcat state/log/prompt-tuner-digest.md
archive digest with datecp state/log/prompt-tuner-digest.md state/log/prompt-tuner-digest-$(date +%Y-%m-%d).md
eval logstate/log/evals.ndjson (skill: "prompt-tuner")
outputstate/log/prompt-tuner-digest.md

Input Format

state/log/evals.ndjson - one JSON object per line:

{"ts":"2026-04-30T00:04:05.252Z","skill":"my-skill","score":0.8,"primary_issue":null}
{"ts":"2026-04-30T00:05:35.188Z","verb":"my-verb","score":0.5,"primary_issue":"stale-snapshot"}
{"ts":"2026-04-30T00:09:09.131Z","skill":"another-skill","score":1.0,"primary_issue":null}

Fields that matter:

  • ts - timestamp (for sorting, take newest 200)
  • skill OR verb - bucket key (normalize both to same field)
  • score - numeric, 0.0 to 1.0 (sum and divide for mean)
  • primary_issue - nullable string (extract for top-3 per bucket)

Output Format

Markdown at state/log/prompt-tuner-digest.md:

# prompt-tuner digest

Generated: 2026-04-30T15:22:00Z

## Overview
- Total rows scanned: 200
- Unique skills: 43
- Struggling buckets (mean < 0.5, n >= 5): 7
- Winning buckets (mean > 0.85, n >= 5): 12

## Struggling (mean < 0.5, n >= 5)

**skill-name** mean=0.45 n=12
- Top issues: stale-snapshot (3), missing-data (2), format-error (1)
- Suggestion: ...

## Winning (mean > 0.85, n >= 5)

**best-skill** mean=0.94 n=15
- No issues observed.

## Patterns

Most common failure modes:
1. `stale-snapshot` (7 occurrences)
2. `hallucination` (5 occurrences)
3. ...

## Next Steps

For each struggling skill:
1. Read the SKILL.md and AGENTS.md
2. Locate the system prompt fragment
3. Add concrete GOOD/BAD examples
4. Re-run the skill and check if score improves

OpenUI Resource

  • Skill-owned OpenUI Lang resource: state/skills/prompt-tuner/resources/ui.openui. Read it before rendering or editing this skill's generated component surface.
  • Treat this resource as a first-class artifact of the skill, not a generic chat response. Improve it when the skill's user-facing output needs to become richer.
  • System resources compose OpenUI primitives and inherit SnappyChat tokens. Use ui_contract: branded in SKILL.md only for deliberate platform or client visuals.

Known Pitfalls

  • Eval log may be very large (>10MB). Use tail semantics to read only the last N lines, not the entire file. The script should stream-read or use tail -<window_size> piped to jq.
  • Bucket with only nulls in primary_issue. That's valid (no issues for those runs). The bucket still counts as struggling or winning if the mean satisfies the threshold; it just has an empty "Top issues" line.
  • Score type inconsistency. Some rows may have score: null or missing score field. Treat missing/null score as 0.0 for mean calculation, not as "skip this row" - it still counts toward the bucket's sample size.
  • Archive old digests before re-running. The script unconditionally overwrites the output file. If Robert wants to compare week-over-week, he must cp the old file first.
  • Normalize skill vs verb. Use whichever is present; if both exist, prefer skill. Create a single merged bucket, not two separate ones.

Self-Test

An agent reading this should correctly:

  1. [ ] Understand tail semantics: last 200 rows (or window size), not entire file?
  2. [ ] Know bucket key normalization: skill OR verb, merged into one key per skill?
  3. [ ] Calculate mean accurately: sum(scores) / count, not median or mode?
  4. [ ] Apply thresholds exactly: < 0.5 for struggling, > 0.85 for winning, AND count >= min_bucket?
  5. [ ] Extract primary_issue correctly: skip nulls, count top 3 per bucket?
  6. [ ] Write only to output file, never modify server.ts or lib files?
  7. [ ] Include all four sections: Overview, Struggling, Winning, Patterns?
  8. [ ] Use tail-read or streaming to handle large eval logs?

Self-report

If this loader fell short, append a line:

echo "[$(date -u +%FT%TZ)] prompt-tuner: <what was missing or fixed> [FIXED|LOGGED] action_kind=<kind>" >> state/log/loader-feedback.log
  • FIXED = you patched this loader inline (P-fix).
  • LOGGED = too large for inline; the PostToolUse enqueue + Stop-hook drain will rewrite the loader from scratch on next session-end.
  • action_kind: shape-ok | skill-ran | loader-rewritten | pattern-elevated

Do not skip this. Every agent run must leave the system better than it found it.

api.ts- the code it can call

#!/usr/bin/env npx tsx
/**
 * snappy-prompt-tuner/api.ts -- Weekly reinforcement signal from eval rows.
 *
 * Reads state/log/evals.ndjson (tail), buckets by skill/verb, identifies
 * struggling (<0.5) and winning (>0.85) performers, outputs markdown digest
 * with concrete prompt-update suggestions.
 *
 * Usage:
 *   npx tsx state/lib/prompt-tuner.ts
 *   npx tsx state/lib/prompt-tuner.ts --window 300
 *   npx tsx state/lib/prompt-tuner.ts --min-bucket 3
 *   npx tsx state/lib/prompt-tuner.ts --window 300 --min-bucket 3
 */

import { readFileSync, writeFileSync, existsSync } from "fs";
import { realpathSync } from "fs";
import { join } from "path";

interface EvalRow {
  ts: string;
  skill?: string;
  verb?: string;
  score?: number;
  primary_issue?: string | null;
  [key: string]: unknown;
}

interface BucketStats {
  skill: string;
  count: number;
  scores: number[];
  mean: number;
  issues: Record<string, number>;
  topIssues: Array<[string, number]>;
}

function parseArgs(args: string[]): Record<string, string | number> {
  const result: Record<string, string | number> = {};
  for (let i = 0; i < args.length; i++) {
    if (args[i].startsWith("--")) {
      const key = args[i].substring(2);
      const val = args[i + 1];
      if (val && !val.startsWith("--")) {
        result[key] = isNaN(Number(val)) ? val : Number(val);
        i++;
      }
    }
  }
  return result;
}

function tailReadNDJSON(filePath: string, lineCount: number): EvalRow[] {
  try {
    const content = readFileSync(filePath, "utf8");
    const lines = content.trim().split("\n");
    const start = Math.max(0, lines.length - lineCount);
    const tail = lines.slice(start);
    return tail
      .filter((line) => line.trim())
      .map((line) => {
        try {
          return JSON.parse(line);
        } catch {
          return null;
        }
      })
      .filter((row): row is EvalRow => row !== null);
  } catch {
    console.error(`Failed to read ${filePath}`);
    return [];
  }
}

const REPO_ROOT = (() => {
  // Walk up from this file until we find state/skills
  let dir = new URL(import.meta.url).pathname;
  for (let i = 0; i < 10; i++) {
    dir = join(dir, "..");
    if (existsSync(join(dir, "state", "skills"))) return dir;
  }
  return process.cwd();
})();

function canonicalizeSkillSlug(raw: string): string | null {
  if (!raw || raw === "unknown") return null;

  const skillsDir = join(REPO_ROOT, "state", "skills");

  // 1. If the raw value is already a real skill folder, use it as-is
  if (existsSync(join(skillsDir, raw, "SKILL.md"))) {
    return raw;
  }

  // 2. Strip "agent-" prefix (e.g. "agent-brain-digest" → "brain-digest")
  let slug = raw.replace(/^agent-/, "");

  // 3. Strip session-ID suffix like "-i0vq4u" (5+ lowercase alphanumeric after a trailing dash)
  slug = slug.replace(/-[a-z0-9]{5,}$/, "");

  if (existsSync(join(skillsDir, slug, "SKILL.md"))) {
    return slug;
  }

  // Phantom — no matching skill folder; drop from digest
  return null;
}

function computeStats(rows: EvalRow[]): Map<string, BucketStats> {
  const buckets = new Map<string, BucketStats>();

  for (const row of rows) {
    const rawKey = (row.skill || row.verb || "unknown") as string;
    const key = canonicalizeSkillSlug(rawKey);
    if (!key) continue; // Drop phantom rows
    const score = typeof row.score === "number" ? row.score : 0;
    const issue = row.primary_issue || null;

    if (!buckets.has(key)) {
      buckets.set(key, {
        skill: key,
        count: 0,
        scores: [],
        mean: 0,
        issues: {},
        topIssues: [],
      });
    }

    const bucket = buckets.get(key)!;
    bucket.scores.push(score);
    bucket.count++;

    if (issue && typeof issue === "string") {
      bucket.issues[issue] = (bucket.issues[issue] || 0) + 1;
    }
  }

  // Compute means and top issues
  for (const bucket of buckets.values()) {
    const sum = bucket.scores.reduce((a, b) => a + b, 0);
    bucket.mean = bucket.count > 0 ? sum / bucket.count : 0;
    bucket.topIssues = Object.entries(bucket.issues)
      .sort((a, b) => b[1] - a[1])
      .slice(0, 3);
  }

  return buckets;
}

function countPatterns(buckets: Map<string, BucketStats>): Array<[string, number]> {
  const allIssues = new Map<string, number>();
  for (const bucket of buckets.values()) {
    for (const [issue, count] of Object.entries(bucket.issues)) {
      allIssues.set(issue, (allIssues.get(issue) || 0) + count);
    }
  }
  return Array.from(allIssues.entries())
    .sort((a, b) => b[1] - a[1])
    .slice(0, 5);
}

function generateDigest(
  rows: EvalRow[],
  buckets: Map<string, BucketStats>,
  minBucket: number
): string {
  const now = new Date().toISOString();
  const struggling = Array.from(buckets.values()).filter(
    (b) => b.mean < 0.5 && b.count >= minBucket
  );
  const winning = Array.from(buckets.values()).filter(
    (b) => b.mean > 0.85 && b.count >= minBucket
  );
  const patterns = countPatterns(buckets);

  let md = `# prompt-tuner digest\n\n`;
  md += `Generated: ${now}\n\n`;

  md += `## Overview\n\n`;
  md += `- Total rows scanned: ${rows.length}\n`;
  md += `- Unique skills: ${buckets.size}\n`;
  md += `- Struggling buckets (mean < 0.5, n >= ${minBucket}): ${struggling.length}\n`;
  md += `- Winning buckets (mean > 0.85, n >= ${minBucket}): ${winning.length}\n\n`;

  // Struggling section
  if (struggling.length > 0) {
    md += `## Struggling (mean < 0.5, n >= ${minBucket})\n\n`;
    for (const bucket of struggling.sort((a, b) => a.mean - b.mean)) {
      md += `**${bucket.skill}** mean=${bucket.mean.toFixed(2)} n=${bucket.count}\n`;
      if (bucket.topIssues.length > 0) {
        const issues = bucket.topIssues
          .map(([issue, count]) => `${issue} (${count})`)
          .join(", ");
        md += `- Top issues: ${issues}\n`;
      } else {
        md += `- Top issues: none recorded\n`;
      }
      md += `- Suggestion: Review the system prompt for this skill. Consider adding a concrete GOOD example of when to trigger, and a BAD example of when NOT to. Tighten trigger keywords if the model is confusing this with other skills.\n\n`;
    }
  } else {
    md += `## Struggling (mean < 0.5, n >= ${minBucket})\n\n`;
    md += `No struggling buckets found.\n\n`;
  }

  // Winning section
  if (winning.length > 0) {
    md += `## Winning (mean > 0.85, n >= ${minBucket})\n\n`;
    for (const bucket of winning.sort((a, b) => b.mean - a.mean)) {
      md += `**${bucket.skill}** mean=${bucket.mean.toFixed(2)} n=${bucket.count}\n`;
      md += `- No issues observed. Keep the current prompt and examples.\n\n`;
    }
  } else {
    md += `## Winning (mean > 0.85, n >= ${minBucket})\n\n`;
    md += `No winning buckets found.\n\n`;
  }

  // Patterns section
  md += `## Patterns\n\n`;
  if (patterns.length > 0) {
    md += `Most common failure modes across all buckets:\n`;
    for (let i = 0; i < patterns.length; i++) {
      const [issue, count] = patterns[i];
      md += `${i + 1}. \`${issue}\` (${count} occurrences)\n`;
    }
    md += `\nThese patterns suggest systemic issues worth addressing at the prompt level.\n\n`;
  } else {
    md += `No failure modes recorded. All observed runs passed their evaluation gates.\n\n`;
  }

  // Next steps
  md += `## Next Steps\n\n`;
  md += `For each struggling skill:\n`;
  md += `1. Read the SKILL.md and AGENTS.md to understand the intent\n`;
  md += `2. Locate the system prompt fragment in server.ts or state/lib/<skill>.ts\n`;
  md += `3. Add a concrete GOOD example of when to emit, and a BAD example of when NOT to\n`;
  md += `4. Re-run the skill and observe if mean score climbs above 0.7\n`;
  md += `5. Archive this digest as prompt-tuner-digest-YYYY-MM-DD.md and generate a fresh one\n\n`;

  return md;
}

// --- CLI ---
if (
  (() => {
    try {
      return import.meta.url === `file://${realpathSync(process.argv[1])}`;
    } catch {
      return false;
    }
  })()
) {
  (async () => {
    const args = parseArgs(process.argv.slice(2));
    const window = (args.window as number) || 200;
    const minBucket = (args["min-bucket"] as number) || 5;
    const evalsPath = "state/log/evals.ndjson";
    const outputPath = "state/log/prompt-tuner-digest.md";

    console.log(`Reading ${evalsPath}, last ${window} rows...`);
    const rows = tailReadNDJSON(evalsPath, window);
    console.log(`Parsed ${rows.length} eval rows.`);

    if (rows.length === 0) {
      console.error("No eval rows found. Exiting.");
      process.exit(0);
    }

    const buckets = computeStats(rows);
    console.log(`Bucketed into ${buckets.size} unique skills.`);

    const digest = generateDigest(rows, buckets, minBucket);
    writeFileSync(outputPath, digest, "utf8");
    console.log(`Digest written to ${outputPath}`);

    // Count struggling/winning for eval
    const struggling = Array.from(buckets.values()).filter(
      (b) => b.mean < 0.5 && b.count >= minBucket
    ).length;
    const winning = Array.from(buckets.values()).filter(
      (b) => b.mean > 0.85 && b.count >= minBucket
    ).length;
    console.log(
      `\nSummary: ${rows.length} rows, ${buckets.size} buckets, ${struggling} struggling, ${winning} winning`
    );
  })();
}

scripts- helper scripts it can run

prose-only skill - 3 inline code blocks live in SKILL.md above (no state/bin/ sidecar yet).

how we check it- the checks, plus the last 3 runs

rubric auto no rubric declared
recent mean 1.00 · 3 runs actor/auditor: unverifiable
deps log
timestamp verb score primary_issue artifact
2026-05-01 05:28Z - 1.00 - -
2026-05-01 05:28Z - 1.00 - -
2026-05-01 05:28Z - 1.00 - -