6 weeks daily-driving an open-source desktop agent shell with a 3-model split (Haiku triager → Sonnet reviewer → Opus executor). Real cost numbers + what broke.

Reddit r/AI_Agents Tools

Summary

A 6-week real-world experiment using an open-source desktop agent shell with a three-model split (Haiku triager, Sonnet reviewer, Opus executor) reports a 64% cost reduction and details failure modes like context bloat and runaway sub-agents.

A pattern I've been running for ~6 weeks on personal infra and want to share the actual cost + failure-mode numbers. Not theoretical. The whole thing runs inside an open-source desktop agent app I built (link in the OP comment per sub rules) — extension ecosystem is what makes the split easy. ## The split - **Triager (Haiku 4.5)** — runs first on every incoming task. Classifies into `route_directly`, `needs_review`, `block`. Returns structured JSON, no side effects. Cheap, batchable, gets ~70% of decisions right alone. - **Reviewer (Sonnet 4.6)** — only invoked when the triager flagged `needs_review` (~25% of traffic) OR when the parent agent's confidence is low. Does a "second opinion" pass on Haiku's classification before any side-effect runs. Catches ~80% of Haiku's borderline calls. - **Executor (Opus parent)** — the planner that orchestrates. Plans, decides which sub-agent runs, approves side effects, handles the user-facing reply. Always runs. All three are Anthropic models. Same OAuth token covers all three under a single Claude Pro/Max subscription. No vendor switching, no API-key juggling. The desktop shell pipes them through one harness so the sub-agent definitions are plain markdown files, not code. ## What it's running on - Personal email triage: incoming emails get triaged → labelled, archived, or surfaced to me. Reviewer catches financial / job-offer emails Haiku would have archived. - Daily news brief: triager filters 200+ feed items down to 15; reviewer ranks; executor synthesises. - Reddit / Gmail / Sheets ops (the boring side-effect stuff). - Desktop coding sessions — I switch from terminal to the desktop UI for long-thinking tasks where I want to watch the tool-call timeline scroll. ## Cost numbers (last 6 weeks, my actual bill) Without the split, everything ran on Opus → **$87/month** in tokens. With the split: - Haiku takes ~75% of total tokens at <5% the per-token cost - Sonnet takes ~20% at ~30% the per-token cost - Opus takes ~5%, the planning / approval surface Net: **~$31/month**. 64% drop. Same user-perceived quality, sometimes *better* because the reviewer catches Haiku's misfires before they execute. ## What broke 1. **Triager context bloat.** Haiku is cheap per token but doesn't get cheap per task if you blindly hand it the full inbox. I cap each triage call at 500 input tokens of email body + headers. Anything longer, the parent summarises first. 2. **Reviewer reflex.** Early on, the reviewer would override Haiku's correct calls ~15% of the time because Sonnet defaults to "more cautious." Fix: explicit prompt — "your role is to spot misclassification, not re-do the call from scratch — agree with Haiku unless you have a concrete reason." 3. **Approval gate latency.** The executor pauses on side effects (delete, send, label). When I'm on the phone, the approval-button round-trip via the Telegram bridge adds 5–30 s. Acceptable for email; broke real-time chat. Now: for real-time turns, sub-agents have pre-approved permission sets so the executor doesn't have to ask. 4. **Sub-agent runaway.** A Haiku run once recursed into 8 sub-agent calls before the parent saw it. Cap: `agentMaxTurns: 16` and hard kill on token-pool overflow. ## Why the harness matters The split is easy to *describe* but a pain to wire from scratch — you need OAuth that covers all three model tiers, sub-agent isolation, approval gates, streaming through nested agents. I'm running this inside a desktop app that ships the wiring as defaults: drop two markdown files into `~/.pi/agent/agents/`, the parent agent picks them up, done. That's what made this experiment cheap enough to actually run. ## What I'd change next - Switch the triager from Haiku 4.5 to a fine-tuned local 7B for the deterministic classes. Most of the email categories are stable; finetuning has been on the TODO for a month. - Move from "reviewer escalates to executor on disagreement" → "reviewer can write a one-paragraph dissent that the executor reads in the planning step." Better than binary override.
Original Article

Similar Articles