$16 refactor, 400 steps, 95% routed to open MoE

Reddit r/LocalLLaMA 05/23/26, 03:33 PM Tools

routing-layer cost-optimization moe vllm tool-calling open-source hybrid-inference

Summary

A developer built a routing layer on vLLM to route simple agent steps to a cheap open-source MoE model (21B active) and hard steps to Opus, reducing costs to $15.60 for a 400-step refactor with 93.4% success rate.

Got tired of $160 Opus bills so I spent a weekend wiring up a routing layer on vLLM 0.8 (2xA100, enable\_auto\_tool\_choice). Getting the tool call parser to cooperate took longer than the actual routing logic. Once it worked though, easy agent steps go to the 21B active MoE and hard steps get Opus. Hunyuan Hy3 preview handled 380 of 400 steps on a 12k line Python repo at \~$0.02 each ($7.60). Opus covered the remaining 20 at $0.40 ($8), so $15.60 all in. I set reasoning to no\_think on routine steps which cut token spend by roughly 30%. Final success rate was 93.4%. DeepSeek V4 hit similar accuracy but ran about 2x slower on search loop steps. The 14 file circular import refactor is where it fell apart. Kept hallucinating module paths that didn't exist. Tencent reports 99.99% step success over 495 step workflows in production, and honestly that tracks for straightforward calls, but tangled dependency graphs still need Opus.

Original Article

Similar Articles

Split my agent into a cheap router model and a premium synthesis model, bill dropped about 75%

Reddit r/AI_Agents

A developer splits their AI agent's LLM calls into a cheap router model (GPT-OSS 120B) for tool-picking and a premium model (gpt-5.4) for synthesis, cutting costs by ~78% while maintaining output quality.

6 weeks daily-driving an open-source desktop agent shell with a 3-model split (Haiku triager → Sonnet reviewer → Opus executor). Real cost numbers + what broke.

Reddit r/AI_Agents

A 6-week real-world experiment using an open-source desktop agent shell with a three-model split (Haiku triager, Sonnet reviewer, Opus executor) reports a 64% cost reduction and details failure modes like context bloat and runaway sub-agents.

I built LEMoE: A stateless, lightweight Mixture of Experts (MoE) router for local LLMs. Open-source and looking for feedback!

Reddit r/ArtificialInteligence

LEMoE is an open-source, stateless Mixture of Experts (MoE) router that acts as an API proxy to route prompts to specialized LLMs, featuring cascading contextual routing and silent self-correction.

my agent bill went from $200 a week to $40 when I stopped running Opus on every subtask

Reddit r/AI_Agents

A developer shares how they reduced their AI agent's weekly cost from $200 to $40 by routing simple subtasks to cheaper models like DeepSeek V4 Pro and Tencent Hunyuan while keeping complex reasoning on Opus 4.7, achieving comparable output quality for most work.

@hooeem: https://x.com/hooeem/status/2062266452921491934

X AI KOLs Timeline

A guide explaining how to make agentic workflows up to 462x cheaper by compiling fixed procedures into smaller fine-tuned models instead of repeatedly prompting frontier models.

Similar Articles

Split my agent into a cheap router model and a premium synthesis model, bill dropped about 75%

6 weeks daily-driving an open-source desktop agent shell with a 3-model split (Haiku triager → Sonnet reviewer → Opus executor). Real cost numbers + what broke.

I built LEMoE: A stateless, lightweight Mixture of Experts (MoE) router for local LLMs. Open-source and looking for feedback!

my agent bill went from $200 a week to $40 when I stopped running Opus on every subtask

@hooeem: https://x.com/hooeem/status/2062266452921491934

Submit Feedback