Artifact I · Coding-Agent Shootout

The Benchmark.

A public benchmark comparing the major coding agents on real engineering tasks. Pass means the generated code compiles, runs, and matches expected behavior.

Live · v2 leaderboard8 tasks·138 verifier tests·4 agents on the leaderboard

View on GitHub ↗Read the methodology essay ↗

II · Leaderboard · v2 task set · 2026-05-08

Real run data.
Not synthetic.

Agent	Mode	Tasks	Tests	Pass
Claude Opus 4.7	API · one-shot	7 / 8	136 / 138	98.6%
Claude Sonnet 4.6	API · one-shot	6 / 8	134 / 138	97.1%
GPT-5.5	Pending	·	·	·
Gemini 2.5 Pro	Pending	·	·	·

Real capability spread: Opus 4.7 cleared task 08 (JSON Schema validator) but Sonnet 4.6 missed two tests on combinator semantics (allOf and the oneOf zero-match case). Both still failed task 06 (LRU + TTL) on constructor input validation. GPT-5.5 and Gemini 2.5 Pro runs ship once their billing tiers are enabled.

III · Tasks

Each task targets
one named failure mode.

Input validation hardening

Targets: Ambiguity-driven assumption errors, missing edge cases

Easy29 tests

Request deduplication for concurrent calls

Targets: Subtle concurrency bugs that look fine on first read

Medium9 tests

Token bucket rate limiter

Targets: Algorithm correctness under timing edge cases

Medium13 tests

Refactor callback hell to async/await

Targets: Behavior preservation during refactor

Medium12 tests

Retry with exponential backoff and circuit breaker

Targets: Defensive programming with timing constraints

Hard10 tests

LRU cache with TTL

Targets: Interacting requirements (LRU + TTL), eviction-priority semantics, injected-clock discipline

Hard16 tests

SQL parameterizer (tagged template)

Targets: Empty IN array → IN (NULL), nested-sql composition with placeholder renumbering

Hard19 tests

JSON Schema-style validator

Targets: oneOf exactly-one semantics, allOf all-must-match, additionalProperties: false, multi-error collection with array-index paths

Hard30 tests

IV · Sample task: Input validation

What the verifier
actually looks like.

Every task ships with four files. The verifier suite is the judge. It runs the agent’s output and asserts on the contract. Below is a slice from tasks/01-input-validation/verifier/validation.test.js.

describe('processUserInput - validation errors', () => {
  test('rejects null input', () => {
    expect(() => processUserInput(null)).toThrow(ValidationError);
    try { processUserInput(null); } catch (e) {
      expect(e.code).toBe('invalid_input');
      expect(e.field).toBe('data');
    }
  });

  test('rejects malformed email (no @)', () => {
    try { processUserInput({ email: 'notanemail', age: 25, name: 'A' }); }
    catch (e) {
      expect(e.code).toBe('invalid_email');
      expect(e.field).toBe('email');
    }
  });

  test('rejects too many tags', () => {
    const tags = Array(11).fill('x');
    try { processUserInput({ email: 'a@b.com', age: 25, name: 'A', tags }); }
    catch (e) { expect(e.code).toBe('invalid_tags'); }
  });
});

The agent doesn’t see this file. It sees the prompt and the starter code. Its output is then run against this suite. Pass means green tests. There is no LLM-as-judge. There is no judgment call.

V · Open the source

Run it yourself.

git clone https://github.com/watkins654/coding-agent-shootout.git
cd coding-agent-shootout
npm install
npm test                    # All 138 tests pass against reference solutions

# Run a verifier against an agent's output:
SUBJECT_PATH=/path/to/agent/output/index.js \
  npm run test:task 01-input-validation