runtime-benchmark
_Generated: 2026-04-18T22:06:39.683Z · run_id: bench-2026-04-18T22-06-39-683Z_
Benchmark harness sibling to parity-test. Measures wall-time, loader-injection rate, output quality, and native-invoke per (runtime × skill) pair — goes beyond parity's pass/fail to surface wall-time regressions and quality drift.
- Source:
state/lint/benchmark-runtimes.ts - Raw rows:
state/log/benchmarks.ndjson - Canonical skills:
snappy-fix,morning-brief,content-mine,sweep,snappy-run - Runtimes:
claude,codex,gemini,openclaw
Per-runtime aggregate (latest run)
| runtime | status | n | wall-mean (ms) | loader-inject % | mean quality | errors |
|---|---|---|---|---|---|---|
| claude | partial | 5 | 32461 | 20% | 0.2 | 4 |
| codex | ok | 5 | 60017 | 0% | 1 | 0 |
| gemini | ok | 5 | 4 | 0% | 0.5 | 0 |
| openclaw | ok | 5 | 2628 | 100% | 1 | 0 |
Per-pair detail (this run)
| runtime | skill | status | mode | wall (ms) | loader | quality | bytes | error |
|---|---|---|---|---|---|---|---|---|
| claude | snappy-fix | timeout | agentic | 60269 | no | 0 | 0 | |
| claude | morning-brief | timeout | agentic | 60284 | no | 0 | 0 | |
| claude | content-mine | ok | agentic | 33240 | yes | 1 | 870 | |
| claude | sweep | error | agentic | 4380 | no | 0 | 58 | |
| claude | snappy-run | error | agentic | 4133 | no | 0 | 58 | |
| codex | snappy-fix | ok | agentic | 60020 | no | 1 | 702 | |
| codex | morning-brief | ok | agentic | 60019 | no | 1 | 939 | |
| codex | content-mine | ok | agentic | 60015 | no | 1 | 594 | |
| codex | sweep | ok | agentic | 60014 | no | 1 | 591 | |
| codex | snappy-run | ok | agentic | 60016 | no | 1 | 1004 | |
| gemini | snappy-fix | ok | context-only | 4 | no | 0.5 | 1371 | |
| gemini | morning-brief | ok | context-only | 4 | no | 0.5 | 1371 | |
| gemini | content-mine | ok | context-only | 4 | no | 0.5 | 1371 | |
| gemini | sweep | ok | context-only | 3 | no | 0.5 | 1371 | |
| gemini | snappy-run | ok | context-only | 3 | no | 0.5 | 1371 | |
| openclaw | snappy-fix | ok | context-only | 3538 | yes | 1 | 3136 | |
| openclaw | morning-brief | ok | context-only | 2416 | yes | 1 | 2348 | |
| openclaw | content-mine | ok | context-only | 2406 | yes | 1 | 2090 | |
| openclaw | sweep | ok | context-only | 2392 | yes | 1 | 2243 | |
| openclaw | snappy-run | ok | context-only | 2388 | yes | 1 | 2651 |
Dimensions
- wall_time_ms — spawn-to-exit time.
- loader_injected — sentinel or substantive-loader substring appears in stdout/stderr/context.
- output_quality — shape gate 0..1 (≥100 bytes AND skill-name mention = 1; either alone = 0.5/0.25).
- native_invoke — invoked via the runtime's own CLI, not a Claude wrapper.
- error — stderr first line when exit ≠ 0.
Notes
- gemini, openclaw are context-only on this machine — they surface the loader text but do not run the skill end-to-end. Their quality/loader scores reflect static-context fidelity only.
- claude, codex run in headless non-interactive mode (
claude -p/codex exec --full-auto --json). Codex writes agent text to stderr by default; the wrapper parses--jsonstdout into stdout (see runtime.ts pod-11 fix). - Sentinel injection uses a tmp copy of the loader at
/var/folders/sq/1xxv87_d0w18jzl3731yh1w80000gn/T/snappy-bench+ env varsSNAPPY_BENCH_SENTINEL_<SKILL>/SNAPPY_BENCH_LOADER_<SKILL>. The repo loaders are never mutated. - Rerun:
npx tsx state/lint/benchmark-runtimes.ts.