my agent bill went from $200 a week to $40 when I stopped running Opus on every subtask

Reddit r/AI_Agents News

Summary

A developer shares how they reduced their AI agent's weekly cost from $200 to $40 by routing simple subtasks to cheaper models like DeepSeek V4 Pro and Tencent Hunyuan while keeping complex reasoning on Opus 4.7, achieving comparable output quality for most work.

I built an agent that converts research papers into slide decks. It chains together a few steps: extract key findings, build an outline, write slide content, query an image search tool, format everything into XML for a presentation library. I wired every step to Opus 4.7 because that's what I knew worked. A single paper to deck run burns about 2 to 3 million tokens across all the steps. Opus 4.7 runs $5 per million input and $25 per million output per Anthropic's current rate card, so a typical run lands somewhere around $20 to $30 depending on how many figures the paper has. My last full week of running this thing on pure Opus, the bill came to about $211. One particularly long paper with 47 figures cost me around $34 for a single run, which is when I finally snapped and actually audited where the tokens were going. More than half was spent on rote work: writing slide bullet points, building image search queries, translating a final outline into presentation XML. Nothing that demands frontier reasoning. I moved the execution layer to DeepSeek V4 Pro and it handled the drafting and tool calls cleanly. After a few days I also dropped in Tencent Hunyuan Hy3 preview on the same steps. At roughly $0.59 per million output tokens on Tencent Cloud versus Opus 4.7 at $25 per million (both per the providers' published rate cards), it's just obviously cheaper. My last week on the tiered setup, total spend was about $41. I ran a blind comparison on five decks from the same batch of papers and my PI couldn't tell which ones used Opus versus the cheap tier, which honestly surprised me a little. The tool calling was the part I expected to break first. It held up. According to OpenRouter rankings the model currently sits at number one by tool call volume, which tracks with what I saw in my own MCP loops: well formed function arguments, no schema drift across multi turn calls. That said, when I pointed it at a paper with dense mathematical proofs and asked it to reconstruct the reasoning chain for the slides, the output was shallow and missed key steps. For that kind of work Opus is still worth every cent. My routing right now is hardcoded per step. If the subtask involves comprehension of novel arguments or architectural decisions, Opus handles it. Everything else goes to DeepSeek or the cheaper MoE model depending on which one I'm testing that week. I'd like to make the routing dynamic eventually, but my first attempt at a prompt complexity classifier was a mess. It kept letting through papers that looked like standard lit reviews but had dense notation buried in the methods section, and those are exactly the ones where the cheap tier produces shallow output. For now the manual tagging works and I don't trust myself to build a classifier that catches those edge cases reliably.
Original Article

Similar Articles