@gneubig: We've found this sort of "sidekick" architecture to be very effective at cutting LLM spend because it allows you to do …
Summary
Graham Neubig shares a sidekick architecture for reducing LLM costs by delegating simple tasks to a smaller agent, with a 200-line example using the OpenHands SDK. This approach is also used in Cognition's Devin Fusion hybrid-model harness.
View Cached Full Text
Cached at: 06/30/26, 07:41 AM
We’ve found this sort of “sidekick” architecture to be very effective at cutting LLM spend because it allows you to do context control and not spend expensive tokens on simple tasks. Here’s a 200-line example of how to do it in the OpenHands SDK :) https://gist.github.com/neubig/412ab8df8e6fd0b2bdf10602d77f9d86…
Cognition (@cognition): Devin Fusion uses a hybrid-model harness built around two ideas:
First, a “sidekick” agent: a smaller agent runs in parallel with the frontier agent. The frontier agent delegates work, monitors progress, and keeps ownership of planning, ambiguity, and final review.
This lets
Similar Articles
Beyond LLMs: Why Scalable Enterprise AI Adoption Depends on Agent Logic
IBM Research explores how agent logic—software primitives like knowledge graphs and program analysis—can guide LLM-based agents to efficiently handle complex enterprise workflows, reducing hallucinations and costs while improving outcomes.
@DailyDoseOfDS_: A harnessed LLM agent, clearly explained! Most people picture this as a model with tools bolted on. The real architectu…
Explains the inverted architecture of a harnessed LLM agent, where intelligence is externalized into memory, skills, and protocols around a thin model core, with mediators governing interactions.
Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP
A controlled study of compound LLM agent design in an adversarial POMDP (CybORG CAGE-2), systematically varying context, reasoning, and hierarchy across five model families. Key findings: programmatic state abstraction yields large returns per token, hierarchy without deliberation tools achieves best absolute performance, and context engineering is more cost-effective than deeper reasoning.
I built an open-source agent whose reasoning core fuses several LLMs (panel, judge, synthesizer) instead of routing to one
The author built an open-source agent that uses a panel of different LLMs with a judge and synthesizer for hard reasoning steps, alongside cost-aware routing, layered memory, governance, and subagent support. It is alpha software with mixed benchmarks on fusion effectiveness.
Small LLM Architecture: Raven Agent (Local RTX5080) + Trinity Cortex (7B/13B/MoE Online)
Describes a two-layer small LLM architecture: a local always-on agent (Raven) on an RTX5080 and an online reasoning stack (Trinity Cortex) with three small models and a knowledge graph, arguing that small models are better than large frontier models for graph-based reasoning.