Same model, different harness: 30-50 point performance swing. But teams still pick agents by model name.

Reddit r/AI_Agents 05/09/26, 02:23 PM News

ai-agents evaluation harness coding-assistants observability model-selection

Summary

The article highlights that agent harnesses cause a 30-50 point performance swing compared to model selection, arguing that teams should focus on instance-level verification rather than just model names.

There's a finding circulating this week that deserves more attention than it's getting. The claim, backed by multiple builders comparing setups: the same model can produce a 30 to 50 percentage point performance difference depending on which harness wraps it. Claude Code versus OpenHands versus a homegrown loop, same weights, materially different results on the same task. Most teams I talk to still pick their coding agent by model name. "We use Sonnet." "We switched to Qwen 35b." The implicit assumption is that the model is the primary variable. But if harness design accounts for a 30 to 50 point swing, the model name is a footnote. The real question is: what did this specific agent instance, in this specific configuration, on this specific codebase, actually do in this session? That question is almost impossible to answer from output alone. The agent's claimed output tells you what it says it did. It doesn't tell you what it reasoned, what it silently skipped, which compliance decisions it made, or whether the efficiency of this run will hold on the next one. I've started thinking about this less as a model-selection problem and more as an instance-measurement problem. The harness matters. The codebase context matters. The specific session behavior of this instance, accumulated over time, matters more than the benchmark rank. Genuine question for anyone building seriously with local agents: do you have any way to measure what an agent instance actually did, beyond reading the diff and hoping CI catches the rest? What does your verification layer look like?

Original Article

Same model, different harness: 30-50 point performance swing. But teams still pick agents by model name.

Similar Articles

The model is the CPU, not the computer — why the harness moves agent performance as much as a model upgrade

Observation: the best agent harness for each model will be from the model developer themselves

Stop Comparing LLM Agents Without Disclosing the Harness

Your harness is failing your agent but there's no benchmark to prove it

It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers

Submit Feedback

Similar Articles

The model is the CPU, not the computer — why the harness moves agent performance as much as a model upgrade

Observation: the best agent harness for each model will be from the model developer themselves

Stop Comparing LLM Agents Without Disclosing the Harness

Your harness is failing your agent but there's no benchmark to prove it

It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers