Artifact I · Coding-Agent Shootout
The Benchmark.
A public benchmark comparing the major coding agents on real engineering tasks. Pass means the generated code compiles, runs, and matches expected behavior.
II · Leaderboard · v2 task set · 2026-05-08
Real run data.
Not synthetic.
| Agent | Mode | Tasks | Tests | Pass |
|---|---|---|---|---|
| Claude Opus 4.7 | API · one-shot | 7 / 8 | 136 / 138 | 98.6% |
| Claude Sonnet 4.6 | API · one-shot | 6 / 8 | 134 / 138 | 97.1% |
| GPT-5.5 | Pending | · | · | · |
| Gemini 2.5 Pro | Pending | · | · | · |
Real capability spread: Opus 4.7 cleared task 08 (JSON Schema validator) but Sonnet 4.6 missed two tests on combinator semantics (allOf and the oneOf zero-match case). Both still failed task 06 (LRU + TTL) on constructor input validation. GPT-5.5 and Gemini 2.5 Pro runs ship once their billing tiers are enabled.
III · Tasks
Each task targets
one named failure mode.
Input validation hardening
Targets: Ambiguity-driven assumption errors, missing edge cases
Request deduplication for concurrent calls
Targets: Subtle concurrency bugs that look fine on first read
Token bucket rate limiter
Targets: Algorithm correctness under timing edge cases
Refactor callback hell to async/await
Targets: Behavior preservation during refactor
Retry with exponential backoff and circuit breaker
Targets: Defensive programming with timing constraints
LRU cache with TTL
Targets: Interacting requirements (LRU + TTL), eviction-priority semantics, injected-clock discipline
SQL parameterizer (tagged template)
Targets: Empty IN array → IN (NULL), nested-sql composition with placeholder renumbering
JSON Schema-style validator
Targets: oneOf exactly-one semantics, allOf all-must-match, additionalProperties: false, multi-error collection with array-index paths
IV · Sample task: Input validation
What the verifier
actually looks like.
Every task ships with four files. The verifier suite is the judge. It runs the agent’s output and asserts on the contract. Below is a slice from tasks/01-input-validation/verifier/validation.test.js.
describe('processUserInput - validation errors', () => {
test('rejects null input', () => {
expect(() => processUserInput(null)).toThrow(ValidationError);
try { processUserInput(null); } catch (e) {
expect(e.code).toBe('invalid_input');
expect(e.field).toBe('data');
}
});
test('rejects malformed email (no @)', () => {
try { processUserInput({ email: 'notanemail', age: 25, name: 'A' }); }
catch (e) {
expect(e.code).toBe('invalid_email');
expect(e.field).toBe('email');
}
});
test('rejects too many tags', () => {
const tags = Array(11).fill('x');
try { processUserInput({ email: 'a@b.com', age: 25, name: 'A', tags }); }
catch (e) { expect(e.code).toBe('invalid_tags'); }
});
});The agent doesn’t see this file. It sees the prompt and the starter code. Its output is then run against this suite. Pass means green tests. There is no LLM-as-judge. There is no judgment call.
V · Open the source
Run it yourself.
git clone https://github.com/watkins654/coding-agent-shootout.git cd coding-agent-shootout npm install npm test # All 138 tests pass against reference solutions # Run a verifier against an agent's output: SUBJECT_PATH=/path/to/agent/output/index.js \ npm run test:task 01-input-validation