6 weeks daily-driving an open-source desktop agent shell with a 3-model split (Haiku triager → Sonnet reviewer → Opus executor). Real cost numbers + what broke.
A 6-week real-world experiment using an open-source desktop agent shell with a three-model split (Haiku triager, Sonnet reviewer, Opus executor) reports a 64% cost reduction and details failure modes like context bloat and runaway sub-agents.
A pattern I've been running for ~6 weeks on personal infra and want to share the actual cost + failure-mode numbers. Not theoretical. The whole thing runs inside an open-source desktop agent app I built (link in the OP comment per sub rules) — extension ecosystem is what makes the split easy. ## The split - **Triager (Haiku 4.5)** — runs first on every incoming task. Classifies into `route_directly`, `needs_review`, `block`. Returns structured JSON, no side effects. Cheap, batchable, gets ~70% of decisions right alone. - **Reviewer (Sonnet 4.6)** — only invoked when the triager flagged `needs_review` (~25% of traffic) OR when the parent agent's confidence is low. Does a "second opinion" pass on Haiku's classification before any side-effect runs. Catches ~80% of Haiku's borderline calls. - **Executor (Opus parent)** — the planner that orchestrates. Plans, decides which sub-agent runs, approves side effects, handles the user-facing reply. Always runs. All three are Anthropic models. Same OAuth token covers all three under a single Claude Pro/Max subscription. No vendor switching, no API-key juggling. The desktop shell pipes them through one harness so the sub-agent definitions are plain markdown files, not code. ## What it's running on - Personal email triage: incoming emails get triaged → labelled, archived, or surfaced to me. Reviewer catches financial / job-offer emails Haiku would have archived. - Daily news brief: triager filters 200+ feed items down to 15; reviewer ranks; executor synthesises. - Reddit / Gmail / Sheets ops (the boring side-effect stuff). - Desktop coding sessions — I switch from terminal to the desktop UI for long-thinking tasks where I want to watch the tool-call timeline scroll. ## Cost numbers (last 6 weeks, my actual bill) Without the split, everything ran on Opus → **$87/month** in tokens. With the split: - Haiku takes ~75% of total tokens at <5% the per-token cost - Sonnet takes ~20% at ~30% the per-token cost - Opus takes ~5%, the planning / approval surface Net: **~$31/month**. 64% drop. Same user-perceived quality, sometimes *better* because the reviewer catches Haiku's misfires before they execute. ## What broke 1. **Triager context bloat.** Haiku is cheap per token but doesn't get cheap per task if you blindly hand it the full inbox. I cap each triage call at 500 input tokens of email body + headers. Anything longer, the parent summarises first. 2. **Reviewer reflex.** Early on, the reviewer would override Haiku's correct calls ~15% of the time because Sonnet defaults to "more cautious." Fix: explicit prompt — "your role is to spot misclassification, not re-do the call from scratch — agree with Haiku unless you have a concrete reason." 3. **Approval gate latency.** The executor pauses on side effects (delete, send, label). When I'm on the phone, the approval-button round-trip via the Telegram bridge adds 5–30 s. Acceptable for email; broke real-time chat. Now: for real-time turns, sub-agents have pre-approved permission sets so the executor doesn't have to ask. 4. **Sub-agent runaway.** A Haiku run once recursed into 8 sub-agent calls before the parent saw it. Cap: `agentMaxTurns: 16` and hard kill on token-pool overflow. ## Why the harness matters The split is easy to *describe* but a pain to wire from scratch — you need OAuth that covers all three model tiers, sub-agent isolation, approval gates, streaming through nested agents. I'm running this inside a desktop app that ships the wiring as defaults: drop two markdown files into `~/.pi/agent/agents/`, the parent agent picks them up, done. That's what made this experiment cheap enough to actually run. ## What I'd change next - Switch the triager from Haiku 4.5 to a fine-tuned local 7B for the deterministic classes. Most of the email categories are stable; finetuning has been on the TODO for a month. - Move from "reviewer escalates to executor on disagreement" → "reviewer can write a one-paragraph dissent that the executor reads in the planning step." Better than binary override.
A developer shares how they reduced their AI agent's weekly cost from $200 to $40 by routing simple subtasks to cheaper models like DeepSeek V4 Pro and Tencent Hunyuan while keeping complex reasoning on Opus 4.7, achieving comparable output quality for most work.
A developer shares an architectural pattern to manage context window bloat in continuous Anthropic agent loops, using KV caching, dynamic tool schema loading, and decoupling executor/advisor roles with Claude 3.5 Sonnet and Claude 3 Opus.
A 13-week recap of using OpenClaw as a daily AI agent on a Raspberry Pi, highlighting strengths like cron-based automation and memory curation, and pain points like model config issues and subagent orchestration.
Practical tips for running Anthropic's Claude Opus autonomously for hours or days, such as using auto mode, dynamic workflows, and self-verification; also references the SWE-Marathon benchmark for long-horizon software tasks.
The author describes a setup where different AI models are assigned to specific roles (planning, coding, review) to reduce API costs for a 24/7 autonomous engineering team, and shares common failure points like model wandering and hallucinated ownership.