The Benchmark
Coding-Agent Shootout · Public eval framework
A public benchmark comparing the major coding agents (Claude Code, Codex CLI, Gemini CLI, Cursor agent, Cline) on real engineering tasks. Each task ships with an executable Jest verifier suite. Pass means the generated code compiles, runs, and matches expected behavior. Monthly re-runs with statistical confidence intervals. Designed to be cited.
Live · 8 tasks · 138 tests · Opus 4.7 (98.6%) > Sonnet 4.6 (97.1%)
Open ↗