The model is the CPU, not the computer — why the harness moves agent performance as much as a model upgrade

Reddit r/AI_Agents News

Summary

The article argues that the harness (the system around the model) is as important as the model itself for agent performance, citing evidence from various benchmarks and experiments.

Wrote up something that kept nagging me: people keep saying "we used the same model" and getting wildly different agent results. The reason is that the model isn't the system — the harness is. Model = CPU; context window = RAM; tools = devices; orchestration loop = scheduler; permissions = kernel ring; tests/traces/evals = observability. Some of the evidence I leaned on: * LangChain took the *same* gpt-5.2-codex from 52.8% → 66.5% on Terminal-Bench 2.0 just by changing the harness (+13.7 pts, no new model). * Harness-Bench (5,194 trajectories) argues you should report capability at the model-harness config level, not the model alone. * Vercel *removed* \~80% of their agent's tools and got better results — more harness isn't always better. * Anthropic's "build to delete": a stronger model needs *less* harness. Run that backwards and a smaller/open-weight model needs *more* harness to hit the same point — you relocate the gap from API cost into engineering. The open-weight angle is the interesting one for self-hosters: Qwen3.6-27B reportedly lands \~59.3 on Terminal-Bench 2.0, near Opus-4.5-class on well-scoped agentic coding. With a harness tuned to it, you get most of the way to a frontier model — and you keep full control of state, tools, and policy (which matters a lot if you're under EU regulation). Curious where people think the harness *can't* close the gap — my take is hard reasoning / long-horizon tasks and tail failure behaviour. What's your experience?
Original Article

Similar Articles

Stop Comparing LLM Agents Without Disclosing the Harness

arXiv cs.AI

This position paper argues that in long-horizon LLM agent tasks, the execution harness often determines performance more than the model itself, and current benchmarks misattribute harness-level gains to model improvements. It proposes a harness-aware evaluation framework with disclosure standards and variance decomposition protocols.