Same model, different harness: 30-50 point performance swing. But teams still pick agents by model name.

Reddit r/AI_Agents News

Summary

The article highlights that agent harnesses cause a 30-50 point performance swing compared to model selection, arguing that teams should focus on instance-level verification rather than just model names.

There's a finding circulating this week that deserves more attention than it's getting. The claim, backed by multiple builders comparing setups: the same model can produce a 30 to 50 percentage point performance difference depending on which harness wraps it. Claude Code versus OpenHands versus a homegrown loop, same weights, materially different results on the same task. Most teams I talk to still pick their coding agent by model name. "We use Sonnet." "We switched to Qwen 35b." The implicit assumption is that the model is the primary variable. But if harness design accounts for a 30 to 50 point swing, the model name is a footnote. The real question is: what did this specific agent instance, in this specific configuration, on this specific codebase, actually do in this session? That question is almost impossible to answer from output alone. The agent's claimed output tells you what it says it did. It doesn't tell you what it reasoned, what it silently skipped, which compliance decisions it made, or whether the efficiency of this run will hold on the next one. I've started thinking about this less as a model-selection problem and more as an instance-measurement problem. The harness matters. The codebase context matters. The specific session behavior of this instance, accumulated over time, matters more than the benchmark rank. Genuine question for anyone building seriously with local agents: do you have any way to measure what an agent instance actually did, beyond reading the diff and hoping CI catches the rest? What does your verification layer look like?
Original Article

Similar Articles

Stop Comparing LLM Agents Without Disclosing the Harness

arXiv cs.AI

This position paper argues that in long-horizon LLM agent tasks, the execution harness often determines performance more than the model itself, and current benchmarks misattribute harness-level gains to model improvements. It proposes a harness-aware evaluation framework with disclosure standards and variance decomposition protocols.