Essay · 5 min read

Why I built an executable benchmark for coding agents

LLM-as-judge fails on code. Jest tests don't. The eval design behind the Coding-Agent Shootout.

evalscoding agentsmethodology

When you ask Claude Code to fix a bug, the answer comes back looking right. Indented properly. Reasonable variable names. The function does something like what you asked. You skim it, paste it in, ship it, and three days later production catches fire because the agent silently swapped your null check for a falsy check and now legitimate empty strings are rejected as "missing."

This is the gap. Looks right is not actually works.

Every existing benchmark for AI coding agents runs into the same wall. They're either too narrow to be useful (one-off problems with no production characteristics) or they grade with another LLM, which means the grade itself is suspect. LLM-as-judge fails on code in specific, predictable ways: it rewards plausible-looking code that compiles but fails edge cases. It under-weights subtle bugs the model itself would also have made. It can't tell whether the function actually returns the right value because it never runs the function.

So I built one that does.

The principle: the test is the spec

The Coding-Agent Shootout is a public benchmark comparing Claude Code, Codex CLI, Gemini CLI, Cursor, and Cline on real engineering tasks. Each task is four files:

prompt.md: the task description, identical across agents
starter/index.js: the initial code the agent receives
solution/index.js: a reference solution I write before the verifier
verifier/*.test.js: a Jest suite that determines pass/fail by running the agent's output

Pass means every test runs green. Fail means at least one test fails, the code doesn't import, or the output isn't valid JavaScript. There's no judgment call, no weighting, no LLM-as-judge handwave. The function either returns the right value for the right inputs or it doesn't.

This isn't novel. It's how every real engineering team grades pull requests. The novelty is applying the same standard to AI coding agents at benchmark scale.

The calibration discipline

The hardest part of an eval framework isn't writing the eval. It's making sure the eval is correct.

For every task in the Shootout, the order is fixed:

Write the natural-language prompt.
Write the reference solution.
Write the verifier suite.
Run the verifier against the reference solution.
All tests must pass. If any fail, the verifier is wrong, not the solution.

Step 5 is the calibration check. It catches the most common eval bug: verifier tests that don't actually express what the prompt asked for. If the prompt says "throw ValidationError with code: 'invalid_email'" and the verifier accepts any thrown error, my reference solution would pass even with a wrong code. The calibration check forces me to write the verifier with the same precision as the prompt.

This mirrors the discipline used in real RLHF coding evals at DataAnnotation, Surge, Scale, and Anthropic's own model graders. Every grader gets calibrated against a golden answer. Without that, the eval is decoration.

Adversarial task design

Each task in the Shootout targets one named failure mode I've seen in real coding-agent work:

Task	Failure mode it targets
Input validation hardening	Ambiguity-driven assumption errors, missing edge cases
Request deduplication	Subtle concurrency bugs that "look fine" on first read
Token bucket rate limiter	Algorithm correctness under timing edge cases
Callback-to-async refactor	Behavior preservation during refactor
Retry + circuit breaker	Defensive programming with timing constraints

The tasks aren't "implement this thing." They're "implement this thing and don't fall into trap X." The traps are designed in. The verifier tests them explicitly: happy path, edge cases, and one or two adversarial inputs that exploit the specific failure mode.

This is the part of the methodology that actually transfers from my day job. At DataAnnotation, I design adversarial tasks for coding-agent qualifications. The job is to find the edge of what the agent can do and document it. The Shootout takes that same skill and points it outward at every coding agent simultaneously.

What v1 found

The first run of v1 (5 tasks, 73 verifier tests) was a clean sweep for Claude Sonnet 4.6 via the API: 73/73 passing in one shot, no iteration. That's a useful baseline. It tells me the v1 task set was tractable for state-of-the-art models. It also tells me v1 wasn't hard enough to differentiate them.

So v2 added three adversarial tasks designed to probe specific failure modes a frontier model might miss.

What v2 found (2026-05-08 run)

The v2 task set runs 8 tasks and 138 verifier tests. The first multi-model sweep, both via the Anthropic API in one-shot mode:

Model	Tasks passed	Tests passed	Pass rate
Claude Opus 4.7	7 / 8	136 / 138	98.6%
Claude Sonnet 4.6	6 / 8	134 / 138	97.1%

Two interesting findings.

Both models failed task 06 on the same trivial gap. The task is an LRU + TTL cache. The implementations both models produced are correct: O(1) operations, correct LRU eviction order, correct TTL expiry, expirations counted separately from evictions. They both missed two tests on constructor input validation: rejecting a non-positive capacity or defaultTtlMs. The spec called for it explicitly. Both models read past it because the task's main thrust is the algorithm.

This is the named failure mode I see all the time in real coding-agent work: agents skip "boring" boilerplate when the task feels algorithmic. The verifier doesn't care which part of the spec felt boring.

Sonnet 4.6 failed task 08 on combinator semantics. The task is a JSON-Schema-style validator. Sonnet's oneOf accepted values when zero sub-schemas matched (the canonical bug), and its allOf didn't run all sub-schemas through. Opus 4.7 implemented both correctly, including the subtle "exactly one match" semantics for oneOf. This is the kind of capability spread you'd predict on paper: the more-capable model holds more interlocking constraints in mind.

GPT-5.5 and Gemini 2.5 Pro runs ship next, once those accounts have billing enabled. I'll re-run the sweep monthly with statistical confidence intervals.

Why this matters beyond a portfolio piece

Every team shipping AI features needs an eval flywheel: synthetic data, labeled traces, an LLM judge or verifier, an improvement loop. Most teams stop at "vibes-based testing" because writing real evals feels like overhead. It isn't. Eval discipline is the difference between an AI feature that ships and one that ships and stays in production.

If you're hiring an AI engineer, the question isn't whether they can call the Anthropic API. Of course they can. The question is whether they can build the eval that catches the regression the model will introduce next week.

I built the Shootout because that's the question I want my interviewers to ask.

The repo is at github.com/watkins654/coding-agent-shootout. v2 ships with 8 tasks and 138 verifier tests. The leaderboard, agent outputs, and full per-task run data are in the repo. Multi-agent monthly re-runs are next.