Model routing is the new model choice

The least useful question in AI right now is "which model is best?"

It feels like the right question because model releases still arrive like sports rankings. One model is better at coding. Another is better at design. Another is faster. Another is cheaper. Another handles long-running tasks better. Every week there is a new reason to redraw the leaderboard.

But if you are actually building with these systems, the leaderboard is not the operating model.

The operating model is routing.

The real question is: which model should own which part of the work?

Planning is different from execution. Execution is different from review. Review is different from verification. Long-running background work is different from a tight interactive loop. A model that is excellent at one of those can be mediocre at another.

Once you see that, the frontier-model conversation changes.

You stop asking for one winner.

You start designing a workflow.

Models fail in different shapes

I keep coming back to this: models do not just vary in capability. They vary in failure shape.

One model may be strong at planning but too willing to overcomplicate the implementation. Another may execute quickly but miss the broader design intent. Another may catch edge cases in review but be weaker at product taste. Another may be good enough for repetitive background work where speed matters more than elegance.

That matters because most AI workflows are not single tasks.

They are chains.

Take a simple build request. Someone asks for a feature. The system has to understand the ask, inspect the codebase, choose an approach, edit files, run tests, interpret failures, revise, and summarize what changed.

That is not one job.

It is at least five jobs wearing one label.

If one model handles all of it, you inherit that model's blind spots across the whole chain. Same strengths, same weaknesses, same confident mistakes.

That is why I like cross-model workflows.

Not because they are clever. Because disagreement is useful.

A planning model should not review itself. An execution model should not be the only judge of whether it followed the plan. A fast model can handle the low-risk loop, but a stronger model should probably own the ambiguous decision.

The value is not "more AI."

The value is separation of duties.

The harness matters more than the headline

The surrounding harness decides whether model quality turns into reliable work.

That is the part people miss when they argue about raw model capability.

A better model inside a loose workflow can still produce sloppy output. A slightly weaker model inside a disciplined harness can produce better work because the harness constrains what happens next.

The harness is everything around the model:

the task classifier
the context it receives
the skills it can call
the tools it can touch
the tests it must run
the reviewer that checks it
the rules for escalation
the memory written after the task

In my own systems, that is where the leverage is.

The best results do not come from one model being brilliant. They come from the system knowing when to use speed, when to use depth, when to ask for review, and when to stop the automation entirely.

That is not a model benchmark.

That is an operating system decision.

A new release should trigger a routing review

When a new model drops, the first instinct is to ask whether it replaces the old one.

I think the better instinct is to ask where it belongs.

Does it improve planning? Then it should sit earlier in the chain.

Does it run for longer without drifting? Then it may belong in background queues.

Does it produce cleaner code quickly? Then it may own execution after the plan is set.

Does it catch a different class of bugs? Then it belongs in review.

Does it understand visual taste better? Then maybe it owns prototypes, not production edits.

The wrong move is swapping it into every slot because the benchmark score looks good.

That is how teams turn a release into churn.

The better move is to keep a small personal benchmark set: real tasks, real outputs, real failures. Not synthetic scores. The tasks you actually run.

For me, that means asking:

Did it follow the repo's existing patterns?
Did it overbuild?
Did it catch its own mistakes?
Did it explain the tradeoff clearly?
Did it create more review work than it saved?
Did it finish the task or just make progress feel fast?

Cost per token is not the metric I care about most.

Cost per solved problem is closer.

Cost per trusted change is better.

The operator move

The operator move is to build a router, not a loyalty program.

I do not want a workflow that says "we use one model for everything." I want one that says:

Planning goes here.
Implementation goes here.
Review goes here.
Visual work goes here.
Cheap repetitive work goes here.
High-risk decisions escalate.

That sounds more complex. In practice, it simplifies the work because each model has a job.

The workflow becomes easier to reason about. The failures become easier to debug. When something goes wrong, I can ask whether the issue was classification, context, execution, review, or escalation.

That is much better than saying "the model was bad."

The practical takeaway

Stop asking which model is best.

Ask which part of the work each model should own.

The AI teams that get this right will not be the ones chasing every leaderboard. They will be the ones building harnesses that route work deliberately, review across failure shapes, and measure success against real tasks.

The model is the engine.

The router is the operating discipline.