Brandon Stanford — Applied AI Systems

I run a two-model review loop on every plan that touches my agent pipeline. Claude (Opus and Sonnet) plans and builds. Codex (GPT-5.4) reviews. I used to think this was belt-and-suspenders. After tracking what each model actually catches, it's clear the two aren't redundant — they fail at different things.

Over the last month Codex has caught ten bugs that Claude shipped confidently. Not a one-off. A pattern.

Same-model review is self-confirmation with extra steps. Cross-model review is a second nervous system.

What Codex reliably flags

Schema mismatches between caller and callee when both are in the diff. Claude will happily rename a field in the model definition and forget to update the three call sites two directories over. Codex notices. Usually within the first review pass.

Silent exception paths where an async function catches, logs, and returns None. Claude's eye glides past. Codex says: "what happens at line 347 when this returns None? The caller does not check." Almost always correct.

State-machine gaps. If a plan has four states and three transitions, Codex finds the one you forgot. Claude can design a clean four-state machine and then build a three-transition implementation without flinching.

Security-adjacent drift. CORS widened without a reason. A secret key re-read from env on every request. A PII field quietly added to a logging call. Claude will ship the convenience. Codex asks why.

What Claude reliably catches that Codex misses

Structural drift away from the plan. If the plan said "three helper functions in utils.ts" and the build produces one 200-line helper, Claude notices. Codex treats the diff as the thing to review, not the plan-to-diff delta.

Tone and voice in generated copy. Every time.

Dead code from earlier iterations that no longer has a caller. Claude reads the whole tree; Codex reads the hunks.

How I wire it

Plans are written by Opus. Builds are executed by Claude via the ADW pipeline. Codex runs a per-cluster review — every logical group of commits gets a Codex pass before merge. If Codex flags a critical, the build halts and the human (me) decides. If Codex flags a warning, the verifier loop has two shots at a self-fix before escalation.

The surprise: Codex is cheaper than I expected, because the reviews are on completed diffs, not full-codebase passes. A fifty-line review costs less than a coffee.

The lesson

If you only run one model across plan + build + review, you will miss the class of bug that that model is blind to. Every frontier model has a shape of failure. They do not overlap the way you'd hope. The fix is not a better single model. The fix is two of them, deliberately disagreeing.

Codex catches what Claude misses

What Codex reliably flags

What Claude reliably catches that Codex misses

How I wire it

The lesson