← Brayden WatkinsThe Benchmark

Artifact I · Coding-Agent Shootout

The Benchmark.

A public benchmark comparing the major coding agents on real engineering tasks. Pass means the generated code compiles, runs, and matches expected behavior.

Live · v2 leaderboard8 tasks·138 verifier tests·4 agents on the leaderboard

II · Leaderboard · v2 task set · 2026-05-08

Real run data.
Not synthetic.

AgentModeTasksTestsPass
Claude Opus 4.7API · one-shot7 / 8136 / 13898.6%
Claude Sonnet 4.6API · one-shot6 / 8134 / 13897.1%
GPT-5.5Pending···
Gemini 2.5 ProPending···

Real capability spread: Opus 4.7 cleared task 08 (JSON Schema validator) but Sonnet 4.6 missed two tests on combinator semantics (allOf and the oneOf zero-match case). Both still failed task 06 (LRU + TTL) on constructor input validation. GPT-5.5 and Gemini 2.5 Pro runs ship once their billing tiers are enabled.

III · Tasks

Each task targets
one named failure mode.

01

Input validation hardening

Targets: Ambiguity-driven assumption errors, missing edge cases

Easy29 tests
02

Request deduplication for concurrent calls

Targets: Subtle concurrency bugs that look fine on first read

Medium9 tests
03

Token bucket rate limiter

Targets: Algorithm correctness under timing edge cases

Medium13 tests
04

Refactor callback hell to async/await

Targets: Behavior preservation during refactor

Medium12 tests
05

Retry with exponential backoff and circuit breaker

Targets: Defensive programming with timing constraints

Hard10 tests
06

LRU cache with TTL

Targets: Interacting requirements (LRU + TTL), eviction-priority semantics, injected-clock discipline

Hard16 tests
07

SQL parameterizer (tagged template)

Targets: Empty IN array → IN (NULL), nested-sql composition with placeholder renumbering

Hard19 tests
08

JSON Schema-style validator

Targets: oneOf exactly-one semantics, allOf all-must-match, additionalProperties: false, multi-error collection with array-index paths

Hard30 tests

IV · Sample task: Input validation

What the verifier
actually looks like.

Every task ships with four files. The verifier suite is the judge. It runs the agent’s output and asserts on the contract. Below is a slice from tasks/01-input-validation/verifier/validation.test.js.

describe('processUserInput - validation errors', () => {
  test('rejects null input', () => {
    expect(() => processUserInput(null)).toThrow(ValidationError);
    try { processUserInput(null); } catch (e) {
      expect(e.code).toBe('invalid_input');
      expect(e.field).toBe('data');
    }
  });

  test('rejects malformed email (no @)', () => {
    try { processUserInput({ email: 'notanemail', age: 25, name: 'A' }); }
    catch (e) {
      expect(e.code).toBe('invalid_email');
      expect(e.field).toBe('email');
    }
  });

  test('rejects too many tags', () => {
    const tags = Array(11).fill('x');
    try { processUserInput({ email: 'a@b.com', age: 25, name: 'A', tags }); }
    catch (e) { expect(e.code).toBe('invalid_tags'); }
  });
});

The agent doesn’t see this file. It sees the prompt and the starter code. Its output is then run against this suite. Pass means green tests. There is no LLM-as-judge. There is no judgment call.

V · Open the source

Run it yourself.

git clone https://github.com/watkins654/coding-agent-shootout.git
cd coding-agent-shootout
npm install
npm test                    # All 138 tests pass against reference solutions

# Run a verifier against an agent's output:
SUBJECT_PATH=/path/to/agent/output/index.js \
  npm run test:task 01-input-validation