Essay · 5 min read

The three failure modes I see in every coding agent

After designing adversarial tasks for coding-agent qualifications, the same patterns surface across Claude Code, Codex CLI, and Gemini CLI.

coding agentsfailure modesred-teaming

I red-team coding agents for a living. The job is simple to describe: design tasks the agent will fail, document where it fails, and write the reference reasoning a grader model can use to score future attempts. The tasks change every week. The agents change every few months. The categories of failure don't.

After enough of these, three patterns surface across every coding agent I've worked with: Claude Code, Codex CLI, Gemini CLI, Cursor, and the rest. The agents differ in how often they make these errors and which ones they're best at recovering from. They don't differ in the fact that all three failure modes exist.

If you're shipping a feature with a coding agent in the loop, these are the categories that cause the silent bugs in production.

1. Ambiguity-driven assumption errors

The agent receives a prompt that's underspecified in one specific way and silently picks the wrong interpretation. The code it produces compiles, runs, and looks reasonable. It just doesn't do what you meant.

The classic version is in input validation. You ask the agent: "Validate this user input. The age field must be 0–150." The agent writes a check for age >= 0 && age <= 150. Reasonable. But the spec didn't say what should happen for non-integer ages, for floating-point values, for Number.MAX_SAFE_INTEGER. The agent picks one (usually the most permissive) and ships.

A month later your support team gets tickets about users with age 120.5 getting rejected for "invalid format" while users with age 9999.99999 are accepted. You read the agent's code. Every line is technically correct. No bug an LLM judge would flag. The bug is the assumption the agent made about your spec.

Why this category matters: It's invisible in code review. Reviewers read the code, see it does something reasonable, and approve. The bug ships. You only catch it when production data starts misbehaving.

What catches it: Verifier tests that probe the boundary cases the prompt didn't specify. If you give the agent a spec, then write tests for cases the spec didn't cover, you find the assumptions immediately.

2. Instruction-following gaps

The prompt has multiple requirements. The agent satisfies most of them. It silently drops or partially implements one or two.

This is different from getting the spec wrong. The agent saw the requirement, started to implement it, and then either forgot it or did a half-version of it. The most common shape: the prompt says "throw ValidationError with code: 'invalid_email' and field: 'email'." The agent writes a ValidationError class, throws it correctly, sets the message, and forgets the field property. Or sets field: 'email' everywhere instead of dynamically based on which field failed.

The agent's output passes a happy-path test (it throws, the message is right) and fails the structured-error contract that downstream code depends on. Code that uses error.field to highlight the offending form input now silently does nothing.

Why this category matters: It compounds. The downstream system was built expecting the contract. When the agent breaks the contract, every consumer of that contract becomes unreliable.

What catches it: Verifier tests that assert on the full contract, not just the happy path. For every error the agent should throw, test that all three properties (message, code, field) are present and correct. Every time.

3. Reasoning failures under constraint

The agent can solve the problem in unconstrained mode. Given a constraint ("use only Node.js builtins, no external packages" or "don't change the file structure"), it sometimes silently violates the constraint.

I see this most often with library-import constraints. The prompt says no external packages. The agent writes code that imports lodash for _.debounce. Or imports validator for email regex. The verifier suite usually catches it because the test file errors on the unresolvable import, but the agent didn't know. It just defaulted to the most idiomatic solution it had seen, ignoring the constraint.

The deeper version: the agent reasons through the problem correctly with the constraint in mind, then in the implementation step forgets the constraint and writes the unconstrained solution. The reasoning trace looks like the agent considered the constraint. The code doesn't reflect it.

Why this category matters: Constraints aren't decoration. They exist because of compliance requirements, dependency size, security posture, or production deployment limits. Silently violating them is the failure mode that ends contracts.

What catches it: Constraint-violation tests, written explicitly. If the constraint is "no external packages," the verifier tries to require the package and asserts that it isn't there. If the constraint is "preserve the public API," the verifier asserts that exported names match exactly. Make the constraint testable, then test it.

The pattern across all three

Every one of these failure modes has the same shape: the agent produces output that passes a casual review but fails a precise test. The gap is the same gap as between LLM-as-judge and executable verifiers. The agent's output is plausible. The agent doesn't notice. The reviewer doesn't notice. The verifier, if it exists, does.

If you're hiring an AI engineer, the candidate who can describe these failure modes by name and show you the verifier tests they'd write to catch each one is operating at a different level than the candidate who can call the API.

If you're shipping AI features, the eval flywheel that catches these is not optional. Vibes-based testing won't catch any of them.

How I'd build a project that catches all three

For any production AI feature, before I ship it I want:

A happy-path verifier that confirms the basic functionality works
An assumption probe: tests that hit the boundaries the spec didn't explicitly cover
A contract probe: tests that assert on the full structured contract, not just the happy result
A constraint probe: tests that explicitly fail if the agent violated a stated constraint
A regression suite: every bug found in production becomes a test that runs on every iteration

That's not a heavy framework. For a typical feature it's twenty Jest tests written in an afternoon. The cost is a few hours. The return is catching the failures before they ship.

The Coding-Agent Shootout is the public version of this discipline applied to the agents themselves. Every task there has the four probes above. The agents that pass clean are the ones I'd trust in production. The ones that need three iterations before passing the constraint probe are the ones I'd want a human in the loop for.

The agents are getting better fast. The categories of failure aren't going away. The engineers who build the verifiers stay employable through the entire transition.

This essay describes general patterns from public coding-agent work, not specific tasks from any client engagement.