@Xudong07452910: This latest paper, Scaling Laws for Agent Harnesses, is a must-read for those working on Agent Harnesses. It highlights a key point: Agents don't necessarily become stronger by running more tokens, tuning more tools, or looping more rounds. What really matters is that these...
Summary
This paper proposes Effective Feedback Compute (EFC) as a scaling coordinate for measuring Agent Harness performance, emphasizing that effective feedback is more important than raw compute, with important implications for Agent system design.
View Cached Full Text
Cached at: 06/02/26, 05:35 PM
This latest paper, Scaling Laws for Agent Harnesses, is well-suited for those working on Agent Harness. It makes a crucial point: an Agent does not necessarily become stronger by running more tokens, tuning more tools, or looping more rounds. What truly matters is whether these interactions turn into “effective feedback”. The paper proposes Effective Feedback Compute (EFC): only feedback that is sufficiently informative, reliable, non-redundant, and actually used by the Agent to change its next decision counts as effective. This is important because many Agent systems today tend to equate raw compute with capability improvement: longer contexts, more tools, more complex loops, more detailed logs. But if the feedback is not structured and organized, and does not enter the plan/revise/verify loop, it essentially just consumes more resources.
This is also very insightful for daily Agent work. Many Harnesses may appear complex, with many tools, many logs, many verifications, but if the feedback is not organized, remembered, and reused, the Agent is just busier, not smarter.
Future optimization of Agent Harness may not simply be stacking more tools and longer contexts, but improving the “utilization rate of each feedback”. A good Harness does not make the Agent work more, but allows it to truly learn something with each step. https://arxiv.org/abs/2605.29682 #AgentHarness #AgenticAI #AIResearch #claudecode #codex — # Scaling Laws for Agent Harnesses via Effective Feedback Compute Source: https://arxiv.org/html/2605.29682 Xuanliang Zhang Dingzirui Wang Keyan Xu Qingfu Zhu Wanxiang Che Harbin Institute of Technology {xuanliangzhang, dzrwang, kyxu, qfzhu, car}@ir.hit.edu.cn ###### Abstract Agent harnesses increasingly determine the performance of language-model systems by deciding how models call tools, receive feedback, verify intermediate states, store memory, and revise solutions. Yet current test-time scaling analyses often parameterize this process by raw expenditure—tokens, tool calls, operations, wall time, or cost—which does not distinguish useful feedback from redundant or unstable interaction. We introduceEffective Feedback Compute(EFC), a trace-level scaling coordinate that credits feedback only when it is informative, valid, non-redundant, and retained for subsequent decisions, and we normalize it by task demand when comparing tasks with different feedback requirements. Across synthetic controllable tasks, executable code tasks, real benchmark traces, held-out splits, and a prospective validation batch, EFC-based coordinates consistently predict failure rates better than raw-compute baselines and a strong multivariate SAS baseline. In controlled scaling, raw tokens and tool calls explain limited variation (R2=0.33R^{2}=0.33and0.420.42), SAS reaches0.880.88, while Oracle-EFC and Estimated-EFC reach0.940.94and Oracle-EFC/DtaskD_{\mathrm{task}}reaches0.990.99. Matched-budget interventions show that improving feedback quality raises success from0.270.27to0.900.90while raw cost and tool calls are fixed. On mixed real traces, NRS-EFC/DtaskD_{\mathrm{task}}reachesR2=0.92R^{2}=0.92while raw compute has near-zero or negative fit, and it remains the best predictor in a prospective holdout (R2=0.85R^{2}=0.85). These results suggest that harness scaling is governed less by how much computation is spent than by how efficiently raw budget is converted into durable, task-sufficient feedback. ## 1Introduction As language models move from single-turn prediction to interactive problem solving, performance increasingly depends on theagent harnessaround the base model. A harness determines how the model calls tools, receives feedback, stores memory, verifies intermediate results, repairs errors, and decides when to stop. This makes harness design a central form of test-time scaling: instead of only making the base model larger, one can spend additional inference-time computation to obtain and use evidence from the environment. However, unlike pretraining, where model size, data, and compute provide well-studied scaling coordinates, agent harnesses lack a clear scalar that predicts when additional test-time computation will improve performance. Raw expenditure alone is insufficient, because two trajectories with the same number of tokens or tool calls can differ sharply in whether their observations are useful, valid, non-redundant, and retained for later decisions. This gap motivates our central question:what quantity should serve as the scaling coordinate for closed-loop agent harness performance? We proposeEffective Feedback Compute(EFC) as such a coordinate. EFC measures the amount of useful closed-loop feedback produced by a trajectory. A feedback event receives credit only when it is informative, valid, non-redundant, and retained for later decisions. This definition separates raw spending from feedback that can actually change the agent’s future behavior. We also introduce two derived quantities. Harness efficiencyη=EFC/Craw\eta=\mathrm{EFC}/C_{\mathrm{raw}}measures raw-to-EFC conversion, namely how much effective feedback a harness extracts per unit of raw budget. Task-demand normalization,EFC/Dtask\mathrm{EFC}/D_{\mathrm{task}}, measures whether the extracted feedback is sufficient relative to the task’s feedback requirement. For real execution traces, where repeated and unstable observations are common, we use non-redundant stable EFC (NRS-EFC) to emphasize retained feedback rather than transient interaction. We test whether EFC explains harness scaling better than raw compute across synthetic controllable tasks, semi-realistic executable code tasks, and real benchmarks. We compare EFC coordinates against raw tokens, tool calls, wall time, operations, raw cost, and SAS, a strong multivariate agent-system scaling baseline(Kimet al.,2026b (https://arxiv.org/html/2605.29682#bib.bib1)). In controlled scaling experiments, raw tokens and tool calls explain only limited variation (R2=0.33R^{2}=0.33and0.420.42), while SAS reaches0.880.88. Oracle-EFC and Estimated-EFC both improve to0.940.94, and Oracle-EFC/DtaskD_{\mathrm{task}}reaches0.990.99with MAE0.020.02. In matched-budget interventions, changing feedback quality increases success from0.270.27to0.900.90while raw cost and tool calls remain matched. In trace-time estimation experiments, Estimated-EFC/DtaskD_{\mathrm{task}}reachesR2=0.93R^{2}=0.93, showing that the coordinate can be recovered before the final outcome is observed. We further decompose why the coordinate works. Harness design controls raw-to-EFC conversion through routing, verification, memory, and observation quality. Across module ablations, harness efficiency explains success variation withR2=0.97R^{2}=0.97, while raw cost explains almost none. Task demand controls the scale on which feedback becomes sufficient. In cross-family prediction, Oracle-EFC improves over raw compute, but Oracle-EFC/DtaskD_{\mathrm{task}}raises the fit toR2=0.96R^{2}=0.96. On mixed real traces, raw-compute baselines have near-zero or negative fit, while NRS-EFC reachesR2=0.89R^{2}=0.89and NRS-EFC/DtaskD_{\mathrm{task}}reaches0.920.92. The corresponding efficiency analysis shows thatη\etais slice-dependent rather than globally fixed: later harnesses dominate HumanEval-style code execution, all harnesses remain low-efficiency on Terminal tasks, and SWE tasks favor earlier or mid-stage variants. Finally, in a prospective holdout evaluated under a prespecified metric and calibration protocol, NRS-EFC/DtaskD_{\mathrm{task}}remains the best predictor with held-outR2=0.85R^{2}=0.85. Our contributions are threefold.*(i)We formalize EFC as a trace-level measure of useful feedback for closed-loop agent harnesses, together with Estimated-EFC and NRS-EFC for settings without oracle state access.(ii)We show that EFC and task-demand-normalized EFC outperform raw-compute baselines and SAS as scaling coordinates across controlled, executable, real, held-out, and prospective evaluations.(iii)*We decompose harness scaling into harness efficiency and task demand, showing that successful harnesses must both convert raw budget into effective feedback and measure that feedback against the right task scale. ## 2Problem Formulation and Experimental Setup We study whether the performance of agent harnesses can be explained by a single scaling variable. Unlike standard inference-time scaling, where the main resource is often a token budget or a number of samples, agent harnesses execute closed-loop computations: they plan, act, observe external feedback, and update their internal state. Our goal is to separate raw expenditure from feedback that is actually useful for solving the task. ### 2.1Agent Harnesses as Closed-Loop Computation LetT\mathcal{T}denote a task distribution. A task instancex∼Tx\sim\mathcal{T}specifies an initial state, an instruction, an environment interface, and an evaluation function. An agent harnessh∈Hh\in\mathcal{H}, paired with a base modelmm, produces a trajectory τ={(st,at,ot,ut)}t=1T,\tau=\{(s_{t},a_{t},o_{t},u_{t})\}_{t=1}^{T},(1)wherests_{t}is the agent state before steptt,ata_{t}is a model action or tool call,oto_{t}is the resulting observation, andutu_{t}is the harness update to the agent state, memory, plan, or candidate solution. The horizonTTis determined by the harness stopping rule or by a budget limit. Each run returns a final answery^\hat{y}, which is evaluated by a task-specific checkergx(y^)∈{0,1}g_{x}(\hat{y})\in\{0,1\}. We define S(x,h,m,b)=E[gx(y^)],E(x,h,m,b)=1−S(x,h,m,b),S(x,h,m,b)=\mathbb{E}[g_{x}(\hat{y})],\qquad E(x,h,m,b)=1-S(x,h,m,b),(2)wherebbdenotes the raw budget configuration. We report both instance-level success and aggregated failure rate over task families, harnesses, models, and budget levels. ### 2.2Task Layers We evaluate harness scaling on three task layers that progressively reduce oracle access while preserving automatic evaluation. - •Synthetic controllable tasks.We use procedurally generated Needle Lookup, State Tracking, and Rule Filter tasks with hidden state and deterministic answers. This layer supports direct measurement of Oracle-EFC and controlled variation inDtaskD_{\mathrm{task}}. - •Semi-realistic executable tasks.We use HumanEval-style code tasks and small executable repair or analysis tasks with unit-test or reference-check feedback. This layer tests whether Estimated-EFC remains predictive when traces contain realistic model errors. - •Real benchmark subsets.We use verifiable HumanEvalChenet al.(2021 (https://arxiv.org/html/2605.29682#bib.bib2)), Terminal-Bench 2.0Merrillet al.(2026 (https://arxiv.org/html/2605.29682#bib.bib3)), and SWE-bench VerifiedJimenezet al.(2024 (https://arxiv.org/html/2605.29682#bib.bib4)). This layer tests transfer to realistic agent trajectories. Unless otherwise stated, results are aggregated at the run level and reported with task-family stratification. Additional dataset details and filtering rules are provided in AppendixB (https://arxiv.org/html/2605.29682#A2). Unless otherwise stated, run-level quantities are first computed per run and then aggregated into the evaluation groups used by each experiment. ### 2.3Harness Families We compare seven harness families, denoted H0–H6, that differ in how they convert raw budget into useful feedback. 1. H0Direct Answer.The model produces a solution in one pass, without explicit tool feedback, verification, repair, or memory. 2. H1Checklist Verify.The harness adds lightweight verification or checklist-style checks to the direct solution process, providing limited feedback without a full closed-loop controller. 3. H2Routed Tools.The harness routes between available tools and model calls, allowing the agent to condition later actions on external observations. 4. H3Stateful Memory.The harness maintains compact memory over verified facts, failed attempts, and task constraints, reducing repeated errors across steps. 5. H4High Budget Noisy.The harness spends a larger raw budget under weaker routing, verification, and memory conditions, isolating raw expenditure from effective feedback. 6. H5Closed Loop.The harness combines routing, verification, and structured memory in an iterative loop, improving the conversion of observations into EFC. 7. H6Deep Closed Loop.The harness extends the closed-loop setting with a larger interaction depth and stronger feedback mechanisms, testing whether additional budget helps when it is converted into effective feedback. The exact prompts, tool interfaces, stopping rules, and memory formats are reported in AppendixD (https://arxiv.org/html/2605.29682#A4). ### 2.4Models, Budgets, and Repeated Runs We evaluate each task–harness configuration with multiple base models, includingDeepSeek-V4-Flash,gpt-5.4-nano, andClaude-Haiku-4.5. Unless stated otherwise, all reported numbers in the paper are first aggregated over repeated runs within each model and then averaged across models. For each task and harness, we sweep raw-budget levels that constrain model generation, tool use, harness operations, and runtime. Repeated runs estimate run-level success and failure rates under stochastic decoding. This design separates three factors that are often conflated: base-model capability, raw budget, and the amount of useful feedback captured by EFC. ### 2.5Scalar Predictors and Baselines We compare EFC-based predictors against standard raw-compute baselines and a strong system-level baseline. For each run, we record the following trace-observable quantities: - •Raw Tokens: total input and output tokens consumed by the model. - •Tool calls: number of external tool invocations. - •Wall time: elapsed runtime of the harness. - •Operations: total number of model, tool, verification, and memory-update operations. - •Raw cost: a normalized cost combining token usage, tool calls, operations, and runtime. We also includeSASKimet al.(2026b (https://arxiv.org/html/2605.29682#bib.bib1)), a prior agent-systems scaling baseline that uses a fixed-effect equation over system-level quantities. In our implementation, SAS is fit from trace-observable proxies including model strength, tool count, agent count, overhead, coordination efficiency, redundancy, error amplification, effective actions, and a single-agent baseline. SAS therefore serves as a strong multivariate baseline for system-level scaling models. ## 3Effective Feedback Compute We define Effective Feedback Compute (EFC) as a scalar measure of useful closed-loop feedback produced by an agent harness. A feedback event receives credit when it reveals task-relevant information, is grounded in reliable evidence, addresses the active subgoal, and is retained for later decisions. ### 3.1Feedback Events Given a trajectory τ={(si,ai,oi,ui)}i=1T,\tau=\{(s_{i},a_{i},o_{i},u_{i})\}_{i=1}^{T},(3)wheresis_{i}is the current state,aia_{i}is an action,oio_{i}is an observation, anduiu_{i}is a state or memory update, we extract a sequence of feedback eventsE(τ)={et}t=1Tfb\mathcal{E}(\tau)=\{e_{t}\}_{t=1}^{T_{\mathrm{fb}}}. Each eventete_{t}is a closed-loop segment in which the agent acts under a current state or subgoal, receives feedback, and updates its subsequent behavior. Events may include model actions, tool calls, checker calls, repair steps, and memory updates. ### 3.2Event-Level EFC Each eventete_{t}receives four bounded factors: It,Vt,Rt,Mt∈[0,1].I_{t},V_{t},R_{t},M_{t}\in[0,1].(4)Their meanings are as follows: - •InformativenessItI_{t}.The event reveals task-relevant information, such as a new constraint, reduced uncertainty, a diagnosed failure mode, or measurable subgoal progress. - •ValidityVtV_{t}.The event is supported by reliable evidence, such as a deterministic checker, execution result, unit test, or consistent tool observation. - •Non-redundant relevanceRtR_{t}.The event addresses the active subgoal and adds information beyond what is already available in the trajectory. - •Memory updateMtM_{t}.The event changes the plan, state, or memory in a way that can affect later actions. The event contribution is EFCt=κItVtRtMt,\mathrm{EFC}_{t}=\kappa I_{t}V_{t}R_{t}M_{t},(5)whereκ\kappais a fixed scale constant. We
Similar Articles
@omarsar0: // Scaling Laws for Agent Harnesses // If you build agent harnesses, this one is worth your time. (bookmark it) Most ha…
New research on scaling laws for agent harnesses reveals that most token and tool call volume does not matter; the work introduces an effective approach.
@Xudong07452910: This 'Harness Updating Is Not Harness Benefit' is very suitable for those working on Agent Harness. It talks about an easily overlooked problem: updating Harness does not mean you can use it well. Now many Ag…
This post discusses a paper, pointing out that in the self-evolution of Agent systems, updating Harness (writing useful updates) and benefiting from updates (actually using them in subsequent tasks) are two different abilities. The latter is key, and weak models often fail to use the rules.
@xiaogaifun: The most thorough talk about Harness. This is probably the most thorough sharing I've seen about Harness Engineering, I recommend everyone watch it. Video link: https://podwise.ai/dashboard/episodes/8013289…
This article deeply explains the concept of Harness Engineering through a talk by IBM engineer Tejas Kumar, which involves adding deterministic infrastructure (such as tool registries, context management, guardrails, and validation loops) to AI Agents to solve model out-of-control and hallucination problems, ensuring stable task execution.
@Potatoloogs: https://x.com/Potatoloogs/status/2057391224592667051
This article deeply analyzes the concept of Agent Harness, which is the engineering infrastructure wrapped around an LLM, including 12 components such as orchestration loops, tool calling, memory systems, context management, etc. The article cites practices from companies like Anthropic, OpenAI, and LangChain, arguing for the critical role of the harness in production-grade AI agents.
@dair_ai: // State-Externalizing Harnesses // A new paradigm is emerging on how to effectively build agents and harnesses. If the…
Harness-1 introduces a state-externalizing harness that separates routine bookkeeping from policy decisions in search agents, enabling a 20B model to outperform larger frontier searchers across multiple benchmarks.