Routing agent work across 4 LLM tiers: orchestrator, advisor, deep reasoning, premier

Reddit r/AI_Agents News

Summary

The author shares a practical 4-tier LLM routing stack for agent work, where a fast orchestrator handles most requests and only escalates to expensive models when deep reasoning is required, significantly improving cost and interactivity.

I run a 4-tier LLM routing stack for my agent work. Most calls hit a cheap orchestrator and never escalate. The expensive models only fire when the orchestrator decides the task needs them. The core idea Most agent calls do not need a frontier model. They need a fast model for routing and classification, and a stronger model when actual reasoning is required. Matching model depth to task depth made more difference to both cost and loop feel than picking a smarter single model. Speed was the real bottleneck for interactive agent loops. A supervisor that takes 10+ seconds per decision makes the whole agent feel sluggish even when every individual answer is excellent. At 2-5s per orchestrator decision the loop flows, and that changes how usable the system feels day to day. The stack Intelligence scores are Artificial Analysis Intelligence Index (fetched 2026-06-20). Tier Model AA Index Speed Role Orchestrator DeepSeek V4 Flash ~40 2-5s Routing, triage, classification Primary advisor GLM-5.2 ~51 7-8s Strategic analysis Deep reasoning GLM-5.2 (max effort) ~51 24-72s Hard problems Premier Opus 4.8 ~56 10-30s Sanitized-only, high-stakes What each tier does in practice Orchestrator: classifieds the task, decides whether it can answer directly, and routes anything harder up. Most calls start and end here. At 2-5s it never makes the loop feel like it is waiting. Primary advisor: code review reasoning, plan critique, bounded analysis. The orchestrator escalates here when something needs real but not deep reasoning. Deep reasoning: multi-step reasoning, novel synthesis, no clear decomposition. Same model family as advisor but cranked up. Roughly 18% of calls hit this tier. Premier: high-stakes, irreversible, or correctness-critical decisions, and only on sanitized inputs. Gated hard. The 4% of calls that hit premier are deliberate, not automatic. Routing pattern The routing logic is straightforward. The orchestrator does a cheap classification pass and emits a tier decision: def route(request): tier = orchestrator.classify(request) if tier == "direct": return orchestrator.answer(request) if tier == "advisor": return glm_standard.answer(request) if tier == "deep": return glm_max_effort.answer(request) if tier == "premier": clean = sanitize(request) return opus.answer(clean) The classification prompt defines the tiers and the escalation rules. The key rule: default to the cheapest tier that can plausibly handle this, only escalate on multi-step reasoning or novel synthesis. When unsure, escalate one tier up. The orchestrator runs this prompt on every incoming request. The fix for over-escalation is almost always in this prompt, not in the model. Current distribution after tuning: roughly 78% direct or advisor, 18% deep, 4% premier, across a few thousand routed requests over 6 weeks. Started closer to 60/40. The hardest tuning problem was the orchestrator confusing input length with task complexity. A 2000-word request that is really just "summarize this" does not need deep reasoning. The fix was defaulting everything to the cheapest tier and only escalating on explicit reasoning need, not on how much text the request contains. What routing strategies are others running in their agent setups? Task-type tiering? Confidence thresholds? Something else?
Original Article

Similar Articles

Learning Agent Routing From Early Experience

arXiv cs.CL

This paper introduces BoundaryRouter, a training-free framework that optimizes LLM agent usage by routing queries to either lightweight inference or full agent execution based on early experience. It also presents RouteBench, a benchmark for evaluating routing performance, showing significant improvements in speed and accuracy.