Is anyone actually solving per-prompt model routing well yet, or are we all just eyeballing it?

Reddit r/AI_Agents News

Summary

The article explores the challenge of per-prompt model routing in AI agents, questioning whether anyone has effectively solved it. It points out that current practices rely on gut feeling, flat-rate plans reduce pressure to optimize, and a triage layer may introduce its own costs.

I run agents on real work every day. Content pipelines, code, the usual. And the thing I still can't do cleanly is decide which model handles which request. The standard advice is "be disciplined, use the cheap model for the cheap job, save the big one for hard stuff." Fine in theory. But I'm sitting there picking models by gut, and I'm meant to be a power user. If I can't route it reliably, the advice is quietly assuming the hard part is already solved. It isn't. It gets worse when you look closer. The unit isn't even task to model. One task contains cheap turns and expensive turns. A coding agent spends most of its turns reading files, running a command, summarising an error. Boring stuff a small local model like Qwen handles fine. Then one turn actually needs to reason about a tricky bug, and that's the turn you want the expensive model on. So the real granularity is prompt to model, evaluated per turn. Right now nobody routes at that level. You pick one model for the whole run and overpay on the easy turns or underperform on the hard ones. The obvious answer is a triage layer. A small model reads each prompt, scores how hard it is, forwards it to the cheapest model that'll clear the bar. Conceptually clean. I keep waiting for someone to nail it. Here's the bit I can't get past though. That triage model is itself a paid call on every single prompt. To route correctly it has to be good enough to understand the request, which means it isn't free and it isn't instant. So have you actually moved the cost, or just added a tollbooth in front of it? Maybe a tiny classifier is cheap enough that the savings dwarf it. Maybe the routing decision is genuinely harder than it looks and the cheap classifier sends hard prompts to the cheap model and you eat the quality hit. I don't know which way that math falls, and I haven't seen anyone show their working. My honest suspicion is the reason this layer doesn't properly exist yet is that flat-rate plans have removed the pressure. When you're on all-you-can-eat, nobody feels the per-prompt price, so nobody builds the thing that optimises it. The day those plans go metered, routing stops being a nice-to-have and becomes the product. So I'm asking the people who actually build this stuff. Is anyone routing per prompt in production and getting it right? What does your triage layer cost you, and does it earn its keep once you count its own calls? Or is per-prompt auto-routing a worse-is-better trap and we're all better off just picking a model and living with it?
Original Article

Similar Articles

@tomas_hk: yes it is have written our learnings here:

X AI KOLs Following

A comprehensive guide explaining model routing as a technique to intelligently select the best AI model per request to optimize cost, quality, and latency, contrasting it with AI gateways and emphasizing its importance for agentic AI workloads.

No One Fits All: From Fixed Prompting to Learned Routing in Multilingual LLMs

arXiv cs.CL

Researchers from National Taiwan University propose replacing fixed translation-based prompting strategies in multilingual LLMs with lightweight learned classifiers that route each instance to either native or translation-based prompting. Their analysis across 10 languages and 4 benchmarks shows no single strategy is universally optimal, with translation benefiting low-resource languages most, and the learned routing achieving statistically significant improvements over fixed strategies.

Can prompting reduce AI sycophancy or is it mostly model behavior?

Reddit r/artificial

A user explores whether prompt engineering can reduce AI sycophancy in models like Gemini, ChatGPT, and Claude, or whether it's fundamentally a model alignment issue. The discussion touches on differences between models in handling disagreement and objective criticism.