runtime-benchmark

_Generated: 2026-04-18T22:06:39.683Z · run_id: bench-2026-04-18T22-06-39-683Z_

Benchmark harness sibling to parity-test. Measures wall-time, loader-injection rate, output quality, and native-invoke per (runtime × skill) pair - goes beyond parity's pass/fail to surface wall-time regressions and quality drift.

Source: state/lint/benchmark-runtimes.ts
Raw rows: state/log/benchmarks.ndjson
Canonical skills: snappy-fix, morning-brief, content-mine, sweep, snappy-run
Runtimes: claude, codex, gemini, openclaw

Per-runtime aggregate (latest run)

runtime	status	n	wall-mean (ms)	loader-inject %	mean quality	errors
claude	partial	5	32461	20%	0.2	4
codex	ok	5	60017	0%	1	0
gemini	ok	5	4	0%	0.5	0
openclaw	ok	5	2628	100%	1	0

Per-pair detail (this run)

runtime	skill	status	mode	wall (ms)	loader	quality	bytes
claude	snappy-fix	timeout	agentic	60269	no	0	0
claude	morning-brief	timeout	agentic	60284	no	0	0
claude	content-mine	ok	agentic	33240	yes	1	870
claude	sweep	error	agentic	4380	no	0	58
claude	snappy-run	error	agentic	4133	no	0	58
codex	snappy-fix	ok	agentic	60020	no	1	702
codex	morning-brief	ok	agentic	60019	no	1	939
codex	content-mine	ok	agentic	60015	no	1	594
codex	sweep	ok	agentic	60014	no	1	591
codex	snappy-run	ok	agentic	60016	no	1	1004
gemini	snappy-fix	ok	context-only	4	no	0.5	1371
gemini	morning-brief	ok	context-only	4	no	0.5	1371
gemini	content-mine	ok	context-only	4	no	0.5	1371
gemini	sweep	ok	context-only	3	no	0.5	1371
gemini	snappy-run	ok	context-only	3	no	0.5	1371
openclaw	snappy-fix	ok	context-only	3538	yes	1	3136
openclaw	morning-brief	ok	context-only	2416	yes	1	2348
openclaw	content-mine	ok	context-only	2406	yes	1	2090
openclaw	sweep	ok	context-only	2392	yes	1	2243
openclaw	snappy-run	ok	context-only	2388	yes	1	2651

Dimensions

wall_time_ms - spawn-to-exit time.
loader_injected - sentinel or substantive-loader substring appears in stdout/stderr/context.
output_quality - shape gate 0..1 (≥100 bytes AND skill-name mention = 1; either alone = 0.5/0.25).
native_invoke - invoked via the runtime's own CLI, not a Claude wrapper.
error - stderr first line when exit ≠ 0.

Notes

gemini, openclaw are context-only on this machine - they surface the loader text but do not run the skill end-to-end. Their quality/loader scores reflect static-context fidelity only.
claude, codex run in headless non-interactive mode (claude -p / codex exec --full-auto --json). Codex writes agent text to stderr by default; the wrapper parses --json stdout into stdout (see runtime.ts pod-11 fix).
Sentinel injection uses a tmp copy of the loader at /var/folders/sq/1xxv87_d0w18jzl3731yh1w80000gn/T/snappy-bench + env vars SNAPPY_BENCH_SENTINEL_<SKILL> / SNAPPY_BENCH_LOADER_<SKILL>. The repo loaders are never mutated.
Rerun: npx tsx state/lint/benchmark-runtimes.ts.