← Brayden WatkinsThe Benchmark

Side Project · Coding-Agent Shootout

The Benchmark.

A public benchmark for the frontier models behind the major coding agents, on real engineering tasks. Pass means the generated code compiles, runs, and matches expected behavior. No LLM-as-judge.

Live · v2 leaderboard8 tasks·138 verifier tests·Claude baseline scored · cross-vendor runs planned

II · Leaderboard · v2 task set · 2026-05-08

Real run data.
Not synthetic.

ModelModeTasksTestsPass
Claude Opus 4.7API · one-shot7 / 8136 / 13898.6%
Claude Sonnet 4.6API · one-shot6 / 8134 / 13897.1%

Real capability spread: Opus 4.7 cleared task 08 (JSON Schema validator) but Sonnet 4.6 missed two tests on combinator semantics (allOf and the oneOf zero-match case). Both still failed task 06 (LRU + TTL) on constructor input validation. GPT-5.5 and Gemini 2.5 Pro runs are planned next; the harness is model-agnostic, so adding a model is one config entry. Agent-CLI runs (Claude Code, Codex CLI, Gemini CLI) are the v3 goal.

III · Tasks

Each task targets
one named failure mode.

01

Input validation hardening

Targets: Ambiguity-driven assumption errors, missing edge cases

Easy · 29 tests
02

Request deduplication for concurrent calls

Targets: Subtle concurrency bugs that look fine on first read

Medium · 9 tests
03

Token bucket rate limiter

Targets: Algorithm correctness under timing edge cases

Medium · 13 tests
04

Refactor callback hell to async/await

Targets: Behavior preservation during refactor

Medium · 12 tests
05

Retry with exponential backoff and circuit breaker

Targets: Defensive programming with timing constraints

Hard · 10 tests
06

LRU cache with TTL

Targets: Interacting requirements (LRU + TTL), eviction-priority semantics, injected-clock discipline

Hard · 16 tests
07

SQL parameterizer (tagged template)

Targets: Empty IN array → IN (NULL), nested-sql composition with placeholder renumbering

Hard · 19 tests
08

JSON Schema-style validator

Targets: oneOf exactly-one semantics, allOf all-must-match, additionalProperties: false, multi-error collection with array-index paths

Hard · 30 tests

IV · Sample task: Input validation

What the verifier
actually looks like.

Every task ships with four files. The verifier suite is the judge. It runs the agent’s output and asserts on the contract. Below are three tests excerpted from tasks/01-input-validation/verifier/validation.test.js.

// three of the 29 tests in validation.test.js

test('rejects null input', () => {
  expect(() => processUserInput(null)).toThrow(ValidationError);
  try { processUserInput(null); } catch (e) {
    expect(e.code).toBe('invalid_input');
    expect(e.field).toBe('data');
  }
});

test('rejects malformed email (no @)', () => {
  try { processUserInput({ email: 'notanemail', age: 25, name: 'A' }); }
  catch (e) {
    expect(e.code).toBe('invalid_email');
    expect(e.field).toBe('email');
  }
});

test('rejects too many tags', () => {
  const tags = Array(11).fill('x');
  try { processUserInput({ email: 'a@b.com', age: 25, name: 'A', tags }); }
  catch (e) { expect(e.code).toBe('invalid_tags'); }
});

The agent doesn’t see this file. It sees the prompt and the starter code. Its output is then run against this suite. Pass means green tests. There is no LLM-as-judge. There is no judgment call.

V · Open the source

Run it yourself.

git clone https://github.com/watkins654/coding-agent-shootout.git
cd coding-agent-shootout
npm install
npm test                    # All 138 tests pass against reference solutions

# Run a verifier against an agent's output:
SUBJECT_PATH=/path/to/agent/output/index.js \
  npm run test:task 01-input-validation