The model is the CPU, not the computer — why the harness moves agent performance as much as a model upgrade

Reddit r/AI_Agents 06/09/26, 12:27 PM News

agent-performance harness ai-agents model-comparison open-weight benchmarks

Summary

The article argues that the harness (the system around the model) is as important as the model itself for agent performance, citing evidence from various benchmarks and experiments.

Wrote up something that kept nagging me: people keep saying "we used the same model" and getting wildly different agent results. The reason is that the model isn't the system — the harness is. Model = CPU; context window = RAM; tools = devices; orchestration loop = scheduler; permissions = kernel ring; tests/traces/evals = observability. Some of the evidence I leaned on: * LangChain took the *same* gpt-5.2-codex from 52.8% → 66.5% on Terminal-Bench 2.0 just by changing the harness (+13.7 pts, no new model). * Harness-Bench (5,194 trajectories) argues you should report capability at the model-harness config level, not the model alone. * Vercel *removed* \~80% of their agent's tools and got better results — more harness isn't always better. * Anthropic's "build to delete": a stronger model needs *less* harness. Run that backwards and a smaller/open-weight model needs *more* harness to hit the same point — you relocate the gap from API cost into engineering. The open-weight angle is the interesting one for self-hosters: Qwen3.6-27B reportedly lands \~59.3 on Terminal-Bench 2.0, near Opus-4.5-class on well-scoped agentic coding. With a harness tuned to it, you get most of the way to a frontier model — and you keep full control of state, tools, and policy (which matters a lot if you're under EU regulation). Curious where people think the harness *can't* close the gap — my take is hard reasoning / long-horizon tasks and tail failure behaviour. What's your experience?

Original Article

The model is the CPU, not the computer — why the harness moves agent performance as much as a model upgrade

Similar Articles

Same model, different harness: 30-50 point performance swing. But teams still pick agents by model name.

@sydneyrunkle: let's assume agent = model + harness unfortunately, good models are getting really expensive! so you need a great harne…

Observation: the best agent harness for each model will be from the model developer themselves

Stop Comparing LLM Agents Without Disclosing the Harness

Your harness is failing your agent but there's no benchmark to prove it

Submit Feedback

Similar Articles

Same model, different harness: 30-50 point performance swing. But teams still pick agents by model name.

@sydneyrunkle: let's assume agent = model + harness unfortunately, good models are getting really expensive! so you need a great harne…

Observation: the best agent harness for each model will be from the model developer themselves

Stop Comparing LLM Agents Without Disclosing the Harness

Your harness is failing your agent but there's no benchmark to prove it