runtime-benchmark

_Generated: 2026-04-18T22:06:39.683Z · run_id: bench-2026-04-18T22-06-39-683Z_

Benchmark harness sibling to parity-test. Measures wall-time, loader-injection rate, output quality, and native-invoke per (runtime × skill) pair — goes beyond parity's pass/fail to surface wall-time regressions and quality drift.

Per-runtime aggregate (latest run)

runtimestatusnwall-mean (ms)loader-inject %mean qualityerrors
claudepartial53246120%0.24
codexok5600170%10
geminiok540%0.50
openclawok52628100%10

Per-pair detail (this run)

runtimeskillstatusmodewall (ms)loaderqualitybyteserror
claudesnappy-fixtimeoutagentic60269no00
claudemorning-brieftimeoutagentic60284no00
claudecontent-mineokagentic33240yes1870
claudesweeperroragentic4380no058
claudesnappy-runerroragentic4133no058
codexsnappy-fixokagentic60020no1702
codexmorning-briefokagentic60019no1939
codexcontent-mineokagentic60015no1594
codexsweepokagentic60014no1591
codexsnappy-runokagentic60016no11004
geminisnappy-fixokcontext-only4no0.51371
geminimorning-briefokcontext-only4no0.51371
geminicontent-mineokcontext-only4no0.51371
geminisweepokcontext-only3no0.51371
geminisnappy-runokcontext-only3no0.51371
openclawsnappy-fixokcontext-only3538yes13136
openclawmorning-briefokcontext-only2416yes12348
openclawcontent-mineokcontext-only2406yes12090
openclawsweepokcontext-only2392yes12243
openclawsnappy-runokcontext-only2388yes12651

Dimensions

  1. wall_time_ms — spawn-to-exit time.
  2. loader_injected — sentinel or substantive-loader substring appears in stdout/stderr/context.
  3. output_quality — shape gate 0..1 (≥100 bytes AND skill-name mention = 1; either alone = 0.5/0.25).
  4. native_invoke — invoked via the runtime's own CLI, not a Claude wrapper.
  5. error — stderr first line when exit ≠ 0.

Notes