Same model, different harness: 30-50 point performance swing. But teams still pick agents by model name.
Summary
The article highlights that agent harnesses cause a 30-50 point performance swing compared to model selection, arguing that teams should focus on instance-level verification rather than just model names.
Similar Articles
The model is the CPU, not the computer — why the harness moves agent performance as much as a model upgrade
The article argues that the harness (the system around the model) is as important as the model itself for agent performance, citing evidence from various benchmarks and experiments.
Observation: the best agent harness for each model will be from the model developer themselves
A discussion on how AI models perform best with harnesses developed by their own creators, as third-party harnesses may cause underperformance despite strong benchmarks, citing examples like Claude Code for Claude and Codex for GPT.
Stop Comparing LLM Agents Without Disclosing the Harness
This position paper argues that in long-horizon LLM agent tasks, the execution harness often determines performance more than the model itself, and current benchmarks misattribute harness-level gains to model improvements. It proposes a harness-aware evaluation framework with disclosure standards and variance decomposition protocols.
Your harness is failing your agent but there's no benchmark to prove it
The article highlights a lack of benchmarks for evaluating the reliability of agent harnesses, specifically focusing on how MCP implementations handle tool calls and errors compared to the models themselves.
It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers
This paper empirically tests the common assumption that more structured harnesses universally improve LLM agent reliability, finding a non-monotone relationship across model tiers. It introduces the HEAT-24 benchmark and reveals that strict harnesses can harm frontier chat models while benefiting reasoning models.