@Xudong07452910: 这篇最新的论文 Scaling Laws for Agent Harnesses 很适合做 Agent Harness 的人看。它讲了一个很关键的点：Agent 不是靠多跑 token、多调工具、多循环几轮就一定变强。真正重要的是，这些…

X AI KOLs Timeline 2026/06/01 03:25 论文

摘要

这篇论文提出了 Effective Feedback Compute (EFC) 作为衡量 Agent Harness 性能的缩放坐标，强调有效反馈比原始计算量更重要，对 Agent 系统设计有重要启示。

这篇最新的论文 Scaling Laws for Agent Harnesses 很适合做 Agent Harness 的人看。它讲了一个很关键的点：Agent 不是靠多跑 token、多调工具、多循环几轮就一定变强。真正重要的是，这些交互有没有变成“有效反馈”。论文提出了 Effective Feedback Compute，EFC：只有那些信息量足够、可靠、不重复，并且真的被 Agent 用来改变下一步决策的反馈，才算有效。这点挺重要，因为很多 Agent 系统现在容易把 raw compute 当成能力提升：上下文更长、工具更多、循环更复杂、日志更详细。但如果反馈没有被结构化整理，没有进入 plan / revise / verify 的闭环里，本质上只是消耗更多资源。这对日常做 Agent 也很有启发。很多 Harness 看起来很复杂，工具很多、日志很多、验证很多，但如果反馈没有被整理、记住、复用，Agent 只是更忙，不是更聪明。未来 Agent Harness 的优化，可能不是简单堆更多工具和更长上下文，而是提高“每一次反馈的利用率”。好的 Harness，不是让 Agent 多干活，而是让它每干一步都能真正学到东西。 https://arxiv.org/abs/2605.29682 #AgentHarness #AgenticAI #AIResearch #claudecode #codex

查看原文

查看缓存全文

缓存时间: 2026/06/02 17:35

这篇最新的论文 Scaling Laws for Agent Harnesses 很适合做 Agent Harness 的人看。

它讲了一个很关键的点：Agent 不是靠多跑 token、多调工具、多循环几轮就一定变强。真正重要的是，这些交互有没有变成“有效反馈”。

论文提出了 Effective Feedback Compute，EFC：只有那些信息量足够、可靠、不重复，并且真的被 Agent 用来改变下一步决策的反馈，才算有效。

这点挺重要，因为很多 Agent 系统现在容易把 raw compute 当成能力提升：上下文更长、工具更多、循环更复杂、日志更详细。但如果反馈没有被结构化整理，没有进入 plan / revise / verify 的闭环里，本质上只是消耗更多资源。

这对日常做 Agent 也很有启发。很多 Harness 看起来很复杂，工具很多、日志很多、验证很多，但如果反馈没有被整理、记住、复用，Agent 只是更忙，不是更聪明。

未来 Agent Harness 的优化，可能不是简单堆更多工具和更长上下文，而是提高“每一次反馈的利用率”。

好的 Harness，不是让 Agent 多干活，而是让它每干一步都能真正学到东西。

https://arxiv.org/abs/2605.29682

#AgentHarness #AgenticAI #AIResearch #claudecode #codex

Scaling Laws for Agent Harnesses via Effective Feedback Compute

Source: https://arxiv.org/html/2605.29682 Xuanliang Zhang Dingzirui Wang Keyan Xu Qingfu Zhu Wanxiang Che Harbin Institute of Technology {xuanliangzhang, dzrwang, kyxu, qfzhu, car}@ir.hit.edu.cn

Abstract

Agent harnesses increasingly determine the performance of language-model systems by deciding how models call tools, receive feedback, verify intermediate states, store memory, and revise solutions. Yet current test-time scaling analyses often parameterize this process by raw expenditure—tokens, tool calls, operations, wall time, or cost—which does not distinguish useful feedback from redundant or unstable interaction. We introduceEffective Feedback Compute(EFC), a trace-level scaling coordinate that credits feedback only when it is informative, valid, non-redundant, and retained for subsequent decisions, and we normalize it by task demand when comparing tasks with different feedback requirements. Across synthetic controllable tasks, executable code tasks, real benchmark traces, held-out splits, and a prospective validation batch, EFC-based coordinates consistently predict failure rates better than raw-compute baselines and a strong multivariate SAS baseline. In controlled scaling, raw tokens and tool calls explain limited variation (R2=0.33R^{2}=0.33and0.420.42), SAS reaches0.880.88, while Oracle-EFC and Estimated-EFC reach0.940.94and Oracle-EFC/DtaskD_{\mathrm{task}}reaches0.990.99. Matched-budget interventions show that improving feedback quality raises success from0.270.27to0.900.90while raw cost and tool calls are fixed. On mixed real traces, NRS-EFC/DtaskD_{\mathrm{task}}reachesR2=0.92R^{2}=0.92while raw compute has near-zero or negative fit, and it remains the best predictor in a prospective holdout (R2=0.85R^{2}=0.85). These results suggest that harness scaling is governed less by how much computation is spent than by how efficiently raw budget is converted into durable, task-sufficient feedback.

1Introduction

As language models move from single-turn prediction to interactive problem solving, performance increasingly depends on theagent harnessaround the base model. A harness determines how the model calls tools, receives feedback, stores memory, verifies intermediate results, repairs errors, and decides when to stop. This makes harness design a central form of test-time scaling: instead of only making the base model larger, one can spend additional inference-time computation to obtain and use evidence from the environment. However, unlike pretraining, where model size, data, and compute provide well-studied scaling coordinates, agent harnesses lack a clear scalar that predicts when additional test-time computation will improve performance. Raw expenditure alone is insufficient, because two trajectories with the same number of tokens or tool calls can differ sharply in whether their observations are useful, valid, non-redundant, and retained for later decisions. This gap motivates our central question:what quantity should serve as the scaling coordinate for closed-loop agent harness performance?

We proposeEffective Feedback Compute(EFC) as such a coordinate. EFC measures the amount of useful closed-loop feedback produced by a trajectory. A feedback event receives credit only when it is informative, valid, non-redundant, and retained for later decisions. This definition separates raw spending from feedback that can actually change the agent’s future behavior. We also introduce two derived quantities. Harness efficiencyη=EFC/Craw\eta=\mathrm{EFC}/C_{\mathrm{raw}}measures raw-to-EFC conversion, namely how much effective feedback a harness extracts per unit of raw budget. Task-demand normalization,EFC/Dtask\mathrm{EFC}/D_{\mathrm{task}}, measures whether the extracted feedback is sufficient relative to the task’s feedback requirement. For real execution traces, where repeated and unstable observations are common, we use non-redundant stable EFC (NRS-EFC) to emphasize retained feedback rather than transient interaction.

We test whether EFC explains harness scaling better than raw compute across synthetic controllable tasks, semi-realistic executable code tasks, and real benchmarks. We compare EFC coordinates against raw tokens, tool calls, wall time, operations, raw cost, and SAS, a strong multivariate agent-system scaling baseline(Kimet al.,2026b). In controlled scaling experiments, raw tokens and tool calls explain only limited variation (R2=0.33R^{2}=0.33and0.420.42), while SAS reaches0.880.88. Oracle-EFC and Estimated-EFC both improve to0.940.94, and Oracle-EFC/DtaskD_{\mathrm{task}}reaches0.990.99with MAE0.020.02. In matched-budget interventions, changing feedback quality increases success from0.270.27to0.900.90while raw cost and tool calls remain matched. In trace-time estimation experiments, Estimated-EFC/DtaskD_{\mathrm{task}}reachesR2=0.93R^{2}=0.93, showing that the coordinate can be recovered before the final outcome is observed.

We further decompose why the coordinate works. Harness design controls raw-to-EFC conversion through routing, verification, memory, and observation quality. Across module ablations, harness efficiency explains success variation withR2=0.97R^{2}=0.97, while raw cost explains almost none. Task demand controls the scale on which feedback becomes sufficient. In cross-family prediction, Oracle-EFC improves over raw compute, but Oracle-EFC/DtaskD_{\mathrm{task}}raises the fit toR2=0.96R^{2}=0.96. On mixed real traces, raw-compute baselines have near-zero or negative fit, while NRS-EFC reachesR2=0.89R^{2}=0.89and NRS-EFC/DtaskD_{\mathrm{task}}reaches0.920.92. The corresponding efficiency analysis shows thatη\etais slice-dependent rather than globally fixed: later harnesses dominate HumanEval-style code execution, all harnesses remain low-efficiency on Terminal tasks, and SWE tasks favor earlier or mid-stage variants. Finally, in a prospective holdout evaluated under a prespecified metric and calibration protocol, NRS-EFC/DtaskD_{\mathrm{task}}remains the best predictor with held-outR2=0.85R^{2}=0.85.

Our contributions are threefold.*(i)We formalize EFC as a trace-level measure of useful feedback for closed-loop agent harnesses, together with Estimated-EFC and NRS-EFC for settings without oracle state access.(ii)We show that EFC and task-demand-normalized EFC outperform raw-compute baselines and SAS as scaling coordinates across controlled, executable, real, held-out, and prospective evaluations.(iii)*We decompose harness scaling into harness efficiency and task demand, showing that successful harnesses must both convert raw budget into effective feedback and measure that feedback against the right task scale.

2Problem Formulation and Experimental Setup

We study whether the performance of agent harnesses can be explained by a single scaling variable. Unlike standard inference-time scaling, where the main resource is often a token budget or a number of samples, agent harnesses execute closed-loop computations: they plan, act, observe external feedback, and update their internal state. Our goal is to separate raw expenditure from feedback that is actually useful for solving the task.

2.1Agent Harnesses as Closed-Loop Computation

Let𝒯\mathcal{T}denote a task distribution. A task instancex∼𝒯x\sim\mathcal{T}specifies an initial state, an instruction, an environment interface, and an evaluation function. An agent harnessh∈ℋh\in\mathcal{H}, paired with a base modelmm, produces a trajectory

τ={(st,at,ot,ut)}t=1T,\tau=\{(s_{t},a_{t},o_{t},u_{t})\}_{t=1}^{T},(1)wherests_{t}is the agent state before steptt,ata_{t}is a model action or tool call,oto_{t}is the resulting observation, andutu_{t}is the harness update to the agent state, memory, plan, or candidate solution. The horizonTTis determined by the harness stopping rule or by a budget limit. Each run returns a final answery^\hat{y}, which is evaluated by a task-specific checkergx(y^)∈{0,1}g_{x}(\hat{y})\in\{0,1\}. We define

S(x,h,m,b)=𝔼[gx(y^)],E(x,h,m,b)=1−S(x,h,m,b),S(x,h,m,b)=\mathbb{E}[g_{x}(\hat{y})],\qquad E(x,h,m,b)=1-S(x,h,m,b),(2)wherebbdenotes the raw budget configuration. We report both instance-level success and aggregated failure rate over task families, harnesses, models, and budget levels.

2.2Task Layers

We evaluate harness scaling on three task layers that progressively reduce oracle access while preserving automatic evaluation.

•Synthetic controllable tasks.We use procedurally generated Needle Lookup, State Tracking, and Rule Filter tasks with hidden state and deterministic answers. This layer supports direct measurement of Oracle-EFC and controlled variation inDtaskD_{\mathrm{task}}.
•Semi-realistic executable tasks.We use HumanEval-style code tasks and small executable repair or analysis tasks with unit-test or reference-check feedback. This layer tests whether Estimated-EFC remains predictive when traces contain realistic model errors.
•Real benchmark subsets.We use verifiable HumanEvalChenet al.(2021), Terminal-Bench 2.0Merrillet al.(2026), and SWE-bench VerifiedJimenezet al.(2024). This layer tests transfer to realistic agent trajectories.

Unless otherwise stated, results are aggregated at the run level and reported with task-family stratification. Additional dataset details and filtering rules are provided in AppendixB.

Unless otherwise stated, run-level quantities are first computed per run and then aggregated into the evaluation groups used by each experiment.

2.3Harness Families

We compare seven harness families, denoted H0–H6, that differ in how they convert raw budget into useful feedback.

H0Direct Answer.The model produces a solution in one pass, without explicit tool feedback, verification, repair, or memory.
H1Checklist Verify.The harness adds lightweight verification or checklist-style checks to the direct solution process, providing limited feedback without a full closed-loop controller.
H2Routed Tools.The harness routes between available tools and model calls, allowing the agent to condition later actions on external observations.
H3Stateful Memory.The harness maintains compact memory over verified facts, failed attempts, and task constraints, reducing repeated errors across steps.
H4High Budget Noisy.The harness spends a larger raw budget under weaker routing, verification, and memory conditions, isolating raw expenditure from effective feedback.
H5Closed Loop.The harness combines routing, verification, and structured memory in an iterative loop, improving the conversion of observations into EFC.
H6Deep Closed Loop.The harness extends the closed-loop setting with a larger interaction depth and stronger feedback mechanisms, testing whether additional budget helps when it is converted into effective feedback.

The exact prompts, tool interfaces, stopping rules, and memory formats are reported in AppendixD.

2.4Models, Budgets, and Repeated Runs

We evaluate each task–harness configuration with multiple base models, includingDeepSeek-V4-Flash,gpt-5.4-nano, andClaude-Haiku-4.5. Unless stated otherwise, all reported numbers in the paper are first aggregated over repeated runs within each model and then averaged across models.

For each task and harness, we sweep raw-budget levels that constrain model generation, tool use, harness operations, and runtime. Repeated runs estimate run-level success and failure rates under stochastic decoding. This design separates three factors that are often conflated: base-model capability, raw budget, and the amount of useful feedback captured by EFC.

2.5Scalar Predictors and Baselines

We compare EFC-based predictors against standard raw-compute baselines and a strong system-level baseline. For each run, we record the following trace-observable quantities:

•Raw Tokens: total input and output tokens consumed by the model.
•Tool calls: number of external tool invocations.
•Wall time: elapsed runtime of the harness.
•Operations: total number of model, tool, verification, and memory-update operations.
•Raw cost: a normalized cost combining token usage, tool calls, operations, and runtime.

We also includeSASKimet al.(2026b), a prior agent-systems scaling baseline that uses a fixed-effect equation over system-level quantities. In our implementation, SAS is fit from trace-observable proxies including model strength, tool count, agent count, overhead, coordination efficiency, redundancy, error amplification, effective actions, and a single-agent baseline. SAS therefore serves as a strong multivariate baseline for system-level scaling models.

3Effective Feedback Compute

We define Effective Feedback Compute (EFC) as a scalar measure of useful closed-loop feedback produced by an agent harness. A feedback event receives credit when it reveals task-relevant information, is grounded in reliable evidence, addresses the active subgoal, and is retained for later decisions.

3.1Feedback Events

Given a trajectory

τ={(si,ai,oi,ui)}i=1T,\tau=\{(s_{i},a_{i},o_{i},u_{i})\}_{i=1}^{T},(3)wheresis_{i}is the current state,aia_{i}is an action,oio_{i}is an observation, anduiu_{i}is a state or memory update, we extract a sequence of feedback eventsℰ(τ)={et}t=1Tfb\mathcal{E}(\tau)=\{e_{t}\}_{t=1}^{T_{\mathrm{fb}}}. Each eventete_{t}is a closed-loop segment in which the agent acts under a current state or subgoal, receives feedback, and updates its subsequent behavior. Events may include model actions, tool calls, checker calls, repair steps, and memory updates.

3.2Event-Level EFC

Each eventete_{t}receives four bounded factors:

It,Vt,Rt,Mt∈[0,1].I_{t},V_{t},R_{t},M_{t}\in[0,1].(4)Their meanings are as follows:

•InformativenessItI_{t}.The event reveals task-relevant information, such as a new constraint, reduced uncertainty, a diagnosed failure mode, or measurable subgoal progress.
•ValidityVtV_{t}.The event is supported by reliable evidence, such as a deterministic checker, execution result, unit test, or consistent tool observation.
•Non-redundant relevanceRtR_{t}.The event addresses the active subgoal and adds information beyond what is already available in the trajectory.
•Memory updateMtM_{t}.The event changes the plan, state, or memory in a way that can affect later actions.

The event contribution is

EFCt=κItVtRtMt,\mathrm{EFC}_{t}=\kappa I_{t}V_{t}R_{t}M_{t},(5)whereκ\kappais a fixed scale constant. We useκ=10\kappa=10in all experiments. The run-level EFC is

EFC(τ)=∑t=1TfbEFCt=κ∑t=1TfbItVtRtMt.\mathrm{EFC}(\tau)=\sum_{t=1}^{T_{\mathrm{fb}}}\mathrm{EFC}_{t}=\kappa\sum_{t=1}^{T_{\mathrm{fb}}}I_{t}V_{t}R_{t}M_{t}.(6)The product form gives high credit to feedback that is simultaneously informative, valid, relevant, and retained.

3.3Oracle-EFC and Estimated-EFC

For synthetic controllable tasks, hidden task state and ground-truth progress are available to the experimenter. We computeOracle-EFCby assigningIt,Vt,Rt,MtI_{t},V_{t},R_{t},M_{t}from latent progress signals and deterministic checks.

For semi-realistic and real benchmark tasks, hidden task state is unavailable or incomplete. We computeEstimated-EFCfrom trace-observable features. Letϕ(et)\phi(e_{t})denote the feature vector

ϕ(et)=[ct,ht,zt,pt,mt,at,qt,Δt,ρt],\phi(e_{t})=[c_{t},h_{t},z_{t},p_{t},m_{t},a_{t},q_{t},\Delta_{t},\rho_{t}],(7)wherectc_{t}indicates whether a checker fired,hth_{t}is checker scope,ztz_{t}indicates whether a tool result is later referenced,ptp_{t}indicates whether the plan changes,mtm_{t}measures memory retention,ata_{t}indicates repeated-error avoidance,qtq_{t}measures observation consistency,Δt\Delta_{t}measures subgoal progress, andρt\rho_{t}encodes trace position.

The event-level estimator is

EFC^t=max⁡(0,exp⁡(θ0+θ⊤ϕ(et))−1),\widehat{\mathrm{EFC}}_{t}=\max\left(0,\exp\left(\theta_{0}+\theta^{\top}\phi(e_{t})\right)-1\right),(8)and the run-level estimate is

EFC^(τ)=∑t=1TfbEFC^t.\widehat{\mathrm{EFC}}(\tau)=\sum_{t=1}^{T_{\mathrm{fb}}}\widehat{\mathrm{EFC}}_{t}.(9)The estimator is calibrated on controllable tasks with Oracle-EFC labels and then applied to traces without hidden-state access. The final success label is used only for evaluating scaling fits.

For real execution traces, we also report status-aware variants. LetQtQ_{t}be an observed status-quality score,GtG_{t}a progress gate,Λt\Lambda_{t}a loop-type gate, andAtA_{t}the attempt index. Estimated-EFC is

EFC^tstable=EFC^tQtGtΛt.\widehat{\mathrm{EFC}}^{\mathrm{stable}}_{t}=\widehat{\mathrm{EFC}}_{t}Q_{t}G_{t}\Lambda_{t}.(10)The nonredundant stable variant (NRS-EFC) is

EFC^tnr=EFC^tQtGtnrΛtnr1+0.35At.\widehat{\mathrm{EFC}}^{\mathrm{nr}}_{t}=\frac{\widehat{\mathrm{EFC}}_{t}Q_{t}G^{\mathrm{nr}}_{t}\Lambda^{\mathrm{nr}}_{t}}{1+0.35A_{t}}.(11)Run-level variants are obtained by summing the corresponding event scores.

3.4Task Demand and Normalized EFC

To compare tasks with different search and verification demands, we normalize EFC by a task-demand scale:

Dtask=L⋅Htool⋅Sstate⋅(1+Nobs)⋅(1−Voracle).D_{\mathrm{task}}=L\cdot H_{\mathrm{tool}}\cdot S_{\mathrm{state}}\cdot(1+N_{\mathrm{obs}})\cdot(1-V_{\mathrm{oracle}}).(12)HereLLis the estimated minimum number of reasoning or action steps,HtoolH_{\mathrm{tool}}measures tool-selection ambiguity,SstateS_{\mathrm{state}}measures state-tracking demand, andNobsN_{\mathrm{obs}}measures observation noise or ambiguity. The termVoracleV_{\mathrm{oracle}}denotes verifier-signal visibility: the extent to which the task is covered by reliable checks, explicit tests, partial evaluators, or other task-level validation signals. It does not give the agent access to hidden solutions or final success labels; it only reduces the task-demand scale when reliable verification signals are more available.

The normalized variables are

X=EFCDtask,X^=EFC^Dtask.X=\frac{\mathrm{EFC}}{D_{\mathrm{task}}},\qquad\widehat{X}=\frac{\widehat{\mathrm{EFC}}}{D_{\mathrm{task}}}.(13)We also report EFC efficiency:

η=EFCCraw,η^=EFC^Craw,\eta=\frac{\mathrm{EFC}}{C_{\mathrm{raw}}},\qquad\widehat{\eta}=\frac{\widehat{\mathrm{EFC}}}{C_{\mathrm{raw}}},(14)whereCrawC_{\mathrm{raw}}is the raw cost defined in §2.5.

3.5Scaling Model and Evaluation Metrics

All scaling analyses use the same power-law failure model over a scalar predictorzz:

E(z)=E∞+Az−α,E(z)=E_{\infty}+Az^{-\alpha},(15)whereE(z)E(z)is the predicted failure rate,E∞E_{\infty}is irreducible error,AAis a scale parameter, andα\alphais the scaling exponent. We fit this model to raw-compute baselines from Section2.5and to EFC-based predictors. For scalar predictors, we median-normalizezzbefore fitting so that fitted exponents are comparable across quantities. Our primary EFC coordinates are raw EFC and their task-demand-normalized forms,

X=EFCDtask,X^=EFC^Dtask.X=\frac{\mathrm{EFC}}{D_{\mathrm{task}}},\qquad\widehat{X}=\frac{\widehat{\mathrm{EFC}}}{D_{\mathrm{task}}}.(16)For SAS, which is a multivariate system-level predictor rather than a scalar coordinate, we evaluate its predicted failure rate directly.

We report two evaluation metrics throughout the paper:R2R^{2}and MAE. Repeated runs are first aggregated into the evaluation groups of each experiment. On these aggregated groups,R2R^{2}is computed as

R2=1−∑i(E¯i−E^i)2∑i(E¯i−E¯)2,R^{2}=1-\frac{\sum_{i}(\bar{E}_{i}-\widehat{E}_{i})^{2}}{\sum_{i}(\bar{E}_{i}-\bar{E})^{2}},(17)whereE¯i\bar{E}_{i}is the observed failure rate of cellii,E^i\widehat{E}_{i}is the fitted or held-out predicted failure rate, andE¯\bar{E}is the mean observed failure rate over the evaluated groups. MAE is computed as the average absolute difference between observed and predicted failure rates. When an experiment contains multiple evaluation groups or held-out splits, we compute the metric for each group or split and report the average value. NegativeR2R^{2}indicates that the predictor performs worse than predicting the mean failure rate.

4Identifying EFC as the Scaling Coordinate

This section tests whether Effective Feedback Compute (EFC) provides a scaling coordinate for agent harnesses. We first compare EFC-based coordinates against raw-compute scalars in controlled tasks, then use matched-budget interventions to isolate feedback quality, and finally test whether EFC can be estimated from trace-time signals before the final outcome is observed.

4.1Controlled Scaling Separates EFC from Raw Compute

We begin with synthetic controllable tasks, where Oracle-EFC can be measured from hidden state and deterministic checks. For each task family, we evaluate multiple harnesses under different budget levels and compare scalar coordinates under the common power-law failure model in Eq.15. The coordinates include raw-compute baselines, SAS, Oracle-EFC, Estimated-EFC, and demand-normalized Oracle-EFC, denoted Oracle-EFC/DtaskD_{\mathrm{task}}.

Refer to caption Figure 1:Controlled scaling comparison on synthetic tasks.The first nine panels, read left to right and top to bottom, fit the common power-law failure model to raw tokens, wall time, raw cost, operations, tool calls, SAS, Oracle-EFC, Estimated-EFC, and Oracle-EFC/DtaskD_{\mathrm{task}}. Points are aggregated failure rates across task families and harnesses, and curves are fitted trends. The final panel summarizes the correspondingR2R^{2}values, showing that demand-normalized Oracle-EFC gives the strongest curve collapse.Figure1separates three increasingly informative descriptions of a trajectory.*(i)Raw expenditure is a weak scaling coordinate. Raw tokens, wall-clock time, and raw cost reach onlyR2=0.33R^{2}=0.33,0.370.37, and0.380.38, respectively. Moving from aggregate spending to interaction counts helps only modestly: operations and tool calls both reach0.420.42. Thus, the failure curve is not determined by how much budget is spent, nor by how many interaction opportunities are counted.(ii)Feedback-aware coordinates capture the missing signal. SAS improves substantially toR2=0.88R^{2}=0.88, but Oracle-EFC and Estimated-EFC both reach0.940.94, with lower MAE. The fact that Estimated-EFC matches Oracle-EFC closely indicates that the useful-feedback signal can be recovered from trace-time evidence rather than requiring hidden task state.(iii)*Task normalization removes the remaining scale mismatch. Oracle-EFC/DtaskD_{\mathrm{task}}reachesR2=0.99R^{2}=0.99and MAE0.020.02, supporting the view that the relevant scaling coordinate is not absolute feedback alone, but feedback measured relative to task demand.

4.2Matched Budgets Isolate Feedback Quality

A high-EFC trajectory might appear better simply because it spends more raw compute. To remove this confound, we construct matched-budget pairs on the same task and model. The two conditions share the same token budget, tool-call budget, wall-clock budget, operation count, and raw-cost accounting, but differ only in the quality of feedback returned to the harness. The low-quality condition produces noisy, redundant, and weakly retained observations, whereas the high-quality condition produces targeted, valid, non-redundant feedback that updates the agent state.

Refer to caption Figure 2:Matched-budget feedback-quality control.Left: low- and high-feedback-quality trajectories are matched pairwise in raw budget, with zero mean raw-cost and tool-call deltas. Middle: under the same budget, the high-EFC condition increases success from0.270.27to0.900.90. Right: the high-EFC condition raises all event-level EFC factors: informativeness (I), validity (V), non-redundancy (R), and memory update (M).Figure2turns the scaling correlation into a matched-budget intervention.*(i)Success changes even when raw budget is fixed. The pairwise budget panel collapses onto the diagonal, with mean absolute raw-cost delta0.000%0.000\%and mean tool-call delta0.0000.000, yet the high-feedback-quality condition increases success from0.270.27to0.900.90(p=1.0×10−300p=1.0\times 10^{-300}). This rules out the explanation that the gain comes from spending more compute.(ii)*The gain comes from better raw-to-EFC conversion. Informativeness, validity, non-redundancy, and memory update all increase together, which matters because Eq.5has bottleneck behavior. Feedback contributes little if it is invalid, redundant, irrelevant, or not retained. Thus, the same raw budget succeeds when it is converted into higher-quality effective feedback.

4.3Trace-Time Estimation Recovers Oracle-EFC

Oracle-EFC requires hidden-state access and is therefore available only in controllable environments. For real trajectories, EFC must be estimated from signals observed before the final outcome is known. We train an event-level Estimated-EFC model on synthetic calibration tasks using only trace-time features, including checker use, checker scope, tool-result references, plan updates, memory retention, repeated-error avoidance, observation consistency, subgoal progress, and trace position. The final success label is not used as an input.

Refer to caption Figure 3:Trace-time Estimated-EFC on held-out tasks.Left: Estimated-EFC/DtaskD_{\mathrm{task}}closely follows the held-out failure curve of Oracle-EFC/DtaskD_{\mathrm{task}}. Right: the held-outR2R^{2}ranking shows that the trace-time estimator nearly matches the oracle coordinate and outperforms raw-compute baselines and SAS.Figure3addresses whether EFC is only an oracle diagnostic.*(i)Estimated-EFC recovers most of the oracle feedback coordinate before the final outcome is observed. Without task normalization, Estimated-EFC reachesR2=0.86R^{2}=0.86, close to Oracle-EFC at0.880.88and comparable to SAS.(ii)*Demand normalization improves oracle and estimated coordinates in the same direction. Estimated-EFC/DtaskD_{\mathrm{task}}and Oracle-EFC/DtaskD_{\mathrm{task}}rise to0.930.93and0.950.95, respectively, while raw-compute baselines remain much weaker, ranging from raw tokens at0.440.44to tool calls at0.680.68. This matched improvement shows that trace-time evidence captures useful feedback, and thatDtaskD_{\mathrm{task}}removes residual task-family scale differences rather than merely re-ranking runs after the fact.

4.4Executable Code Tasks Preserve the EFC Signal

We next test whether the trace-time EFC signal survives in a more realistic code-execution setting. We use held-out executable programming tasks in a HumanEval-like function-completion format, where candidate programs are checked against executable tests and repaired over multiple attempts. The harness observes syntax errors, runtime errors, unit-test failures, partial correctness signals, and subsequent repair behavior. Estimated-EFC is computed only from these trace-time signals, while oracle quantities are used only for diagnostic comparison. We also evaluate seven harness variants and report estimated feedback efficiency,η^=EFC^/Craw\widehat{\eta}=\widehat{\mathrm{EFC}}/C_{\mathrm{raw}}, measuring how much Estimated-EFC is produced per unit of raw cost.

Refer to caption Figure 4:Executable code tasks preserve the EFC signal.Left: on held-out executable programming tasks, EFC-based predictors achieve the highest predictiveR2R^{2}, with Estimated-EFC/DtaskD_{\mathrm{task}}reaching0.970.97and raw tokens reaching0.780.78. Middle: failure curves for Estimated-EFC and Estimated-EFC/DtaskD_{\mathrm{task}}are cleaner than the raw-token curve, indicating that execution feedback quality is more informative than generation length. Right: harness success increases with estimated feedback efficiency, showing that harness variants succeed when they convert raw budget into effective feedback more efficiently.Figure4shows that the EFC signal persists in executable code tasks.*(i)Tool calls become a strong raw baseline when they partially align with executable feedback. Since tool calls often run tests that expose syntax errors, runtime errors, or unit-test failures, they reachR2=0.94R^{2}=0.94, far above raw tokens at0.780.78and close to SAS at0.950.95.(ii)EFC-based coordinates remain stronger because they score what the tests reveal and whether the harness uses that information. Estimated-EFC reaches0.960.96, Estimated-EFC/DtaskD_{\mathrm{task}}reaches0.970.97, and oracle variants reach0.990.99.(iii)*Feedback efficiency explains harness-level differences. Shallow or noisy harnesses remain low-success even when they spend budget, while H5 and H6 achieve the highest success by converting raw cost into higher estimated feedback efficiencyη^=EFC^/Craw\widehat{\eta}=\widehat{\mathrm{EFC}}/C_{\mathrm{raw}}. Thus, executable feedback strengthens the interpretation that tool use helps when it produces useful, retained evidence, not merely because it increases interaction count.

*(i)Raw budget is not the scaling coordinate for agent harnesses. Across controlled scaling and matched-budget interventions, raw-compute scalars explain only part of the performance trend, and success can change sharply even when raw cost, tokens, operations, and tool calls are fixed.(ii)The predictive signal lies in raw-to-EFC conversion. Scalars become stronger as they move from aggregate expenditure toward feedback opportunities, but EFC-based coordinates remain stronger because they account for whether feedback is informative, valid, non-redundant, and retained.(iii)*EFC is not merely an oracle diagnostic. Trace-time estimation recovers most of the oracle coordinate before the final outcome is observed, and executable code tasks preserve the same signal even when tool calls are already competitive. Together, these results support EFC, normalized by task demand when needed, as the scaling coordinate for agent harness performance.

5Decomposing EFC: Harness Efficiency and Task Demand

Section4identifies EFC as a stronger scaling coordinate than raw compute. We now decompose this coordinate into two mechanisms. The first is harness efficiency,

η=EFCCraw,\eta=\frac{\mathrm{EFC}}{C_{\mathrm{raw}}},(18)which measures how much effective feedback a harness extracts per unit of raw budget. For trace-estimated and real-trace variants, the numerator is replaced byEFC^\widehat{\mathrm{EFC}}or nonredundant stable EFC (NRS-EFC), respectively. The second is the task-normalized feedback variable,

X=EFCDtask,X=\frac{\mathrm{EFC}}{D_{\mathrm{task}}},(19)which measures whether the extracted feedback is sufficient relative to task demand. This section separates the two mechanisms: harness design controls raw-to-EFC conversion throughη\eta, while task demand controls the scale on which feedback becomes sufficient throughDtaskD_{\mathrm{task}}.

5.1Harness Factors Control Raw-to-EFC Conversion

We first vary individual harness and environment factors in controlled synthetic tasks while keeping the task distribution and budget accounting fixed. The positive harness factors are verifier strength, router quality, and memory fidelity; the adverse interface factors are tool entropy, state pressure, and observation noise. For each factor, we compare low, medium, and high levels and measure the induced change in harness efficiencyη=EFC/Craw\eta=\mathrm{EFC}/C_{\mathrm{raw}}and the corresponding success rate.

Refer to caption Figure 5:Harness factors control raw-to-EFC conversion.Left: signed high-minus-low effects onη\etashow that router quality, verifier strength, and memory fidelity increase efficiency, whereas observation noise, tool entropy, and state pressure reduce it. Right: success rate tracks harness efficiency across factor settings, indicating that these design and interface changes affect performance primarily by changing how much effective feedback is extracted from the same raw budget.Figure5shows that the same raw budget can be converted into very different amounts of effective feedback.*(i)The largest gains come from factors that improve the selection and reliability of feedback. Router quality gives the largest efficiency gain (+0.28+0.28), followed by verifier strength (+0.22+0.22) and memory fidelity (+0.20+0.20). This ordering is consistent with Eq.5: routing improves relevance and non-redundancy, verification improves validity, and memory improves retention.(ii)Interface frictions reduce efficiency by corrupting or diluting feedback. Observation noise produces the largest drop (−0.17-0.17), tool entropy also reduces efficiency (−0.11-0.11), and state pressure has a smaller negative effect (−0.05-0.05).(iii)*Outcomes track conversion rather than budget alone. Success rises approximately monotonically withη\eta, from about0.190.19at the least efficient settings to above0.400.40at the most efficient settings. Thus, these factors affect performance by changing the raw-to-EFC conversion rate.

5.2Module Ablations Localize Raw-to-EFC Conversion Gains

We next localize the raw-to-EFC conversion gains to concrete harness components. We ablate three harness modules and one interface corruption factor: verifier strength, memory fidelity, router quality, and observation noise. Each family is evaluated at ordered settings, ranging from disabled or weak configurations to stronger configurations that provide more reliable feedback. We measure both harness efficiencyη=EFC/Craw\eta=\mathrm{EFC}/C_{\mathrm{raw}}and downstream success.

Refer to caption Figure 6:Module ablations localize raw-to-EFC conversion gains.Left: stronger verifier, memory, and router modules increase harness efficiencyη\eta, while lower observation noise also improvesη\eta. Middle: the same ordering appears in success rate, showing that module improvements that increaseη\etaalso improve task outcomes. Right: success is almost fully explained by meanη\eta(R2=0.97R^{2}=0.97), whereas raw cost has essentially no explanatory power (R2=0.01R^{2}=0.01), indicating thatη\etamediates the performance shift.Figure6shows that raw-to-EFC conversion can be localized to specific harness modules.*(i)The ablations produce a dose-response pattern. Moving from disabled or weak modules to stronger verifier, memory, and router settings increases bothη\etaand success, while reducing observation noise produces the same direction of improvement. Router quality reaches the highest meanη\etaand success among the module settings, matching the factor scan in Figure5.(ii)*Harness efficiency mediates the performance shift, while raw cost does not. Across module settings, meanη\etaexplains almost all variation in success (R2=0.97R^{2}=0.97), whereas raw cost explains almost none (R2=0.01R^{2}=0.01). This localizes the mechanism: modules help when they improve validity, retention, relevance, non-redundancy, or observation quality, not when they merely change expenditure.

5.3Task Demand Sets the Required EFC Scale

A harness can convert raw budget into EFC efficiently and still fail when the task requires more feedback than the trajectory provides. We therefore test whether task demand provides the scale needed to compare EFC across task families. In controlled needle lookup, rule filtering, and state tracking tasks, we evaluate raw compute baselines, SAS, raw Oracle-EFC, and two demand-normalized Oracle-EFC coordinates under the same power-law failure model. The hand-designed demand is

Dtask=L⋅Htool⋅Sstate⋅(1+Nobs)⋅(1−Voracle),D_{\mathrm{task}}=L\cdot H_{\mathrm{tool}}\cdot S_{\mathrm{state}}\cdot(1+N_{\mathrm{obs}})\cdot(1-V_{\mathrm{oracle}}),(20)where longer reasoning depth, higher tool entropy, larger state pressure, noisier observations, and lower verifier-signal coverage all increase task demand. The fitted variant uses the same task-demand factors but estimates their relative exponents from calibration tasks.

Refer to caption Figure 7:Task demand sets the required EFC scale.The first nine panels, read left to right and top to bottom, fit the common power-law failure model to raw tokens, wall time, raw cost, operations, tool calls, SAS, Oracle-EFC, Oracle-EFC normalized by hand-designed task demand, and Oracle-EFC normalized by fitted task demand. Points are aggregated failure rates across needle lookup, rule filtering, and state tracking tasks, and curves are fitted trends. The final panel summarizes the correspondingR2R^{2}values, showing that demand-normalized Oracle-EFC gives the strongest cross-family collapse.Figure7shows that task demand is the missing scale factor for cross-family prediction.*(i)Absolute raw budget cannot align task families. Raw compute baselines improve as they move closer to interaction opportunities, from raw tokens atR2=0.51R^{2}=0.51to tool calls at0.690.69, and SAS reaches0.870.87. Raw Oracle-EFC improves further to0.900.90, but the task families still show residual offsets, indicating that the same amount of effective feedback is not equally sufficient across needle lookup, rule filtering, and state tracking.(ii)*Dividing EFC by task demand removes this residual scale mismatch. Oracle-EFC/DtaskD_{\mathrm{task}}reachesR2=0.96R^{2}=0.96with both hand-designed and fitted task demand, with MAE0.020.02and0.030.03, respectively. The similar performance of the two normalizations shows that the key requirement in these controlled families is the denominator itself: EFC must be measured relative to the amount of feedback the task requires.

5.4Task-Demand Calibration Transfers to Mixed Holdout

We finally test whether task-demand normalization transfers beyond the controlled task families used to motivate it. We evaluate a mixed held-out set with heterogeneous task structure and compare raw compute baselines, SAS, NRS-EFC, hand-designed NRS-EFC/DtaskD_{\mathrm{task}}, and fitted NRS-EFC/DtaskD_{\mathrm{task}}. The fitted task-demand model uses the same factors as before, including reasoning depth, tool entropy, state pressure, observation noise, and verifier-signal coverage, but learns their exponents on a calibration split and evaluates the resulting scaling coordinate on unseen tasks.

Refer to caption Figure 8:Task-demand calibration transfers to mixed held-out tasks.Left: prediction error on unseen tasks compares raw tokens, tool calls, raw cost, SAS, NRS-EFC, hand-designed NRS-EFC/DtaskD_{\mathrm{task}}, and fitted NRS-EFC/DtaskD_{\mathrm{task}}. The fitted task-demand normalization gives the lowest train-to-holdout MAE and the highest held-outR2R^{2}. Right: the fitted task-demand model learns non-uniform exponents over demand factors rather than using the fixed hand exponent, showing that mixed tasks require calibrated weighting of demand components.Figure8shows that task-demand calibration improves the portability of the EFC coordinate.*(i)Raw compute does not transfer on unseen mixed tasks. Raw tokens, tool calls, and raw cost cluster around train-to-holdout MAE0.390.39with held-outR2=−0.42R^{2}=-0.42, while SAS improves to MAE about0.280.28andR2=0.10R^{2}=0.10. NRS-EFC is substantially stronger, reaching MAE about0.140.14andR2=0.70R^{2}=0.70, which shows that trace-derived nonredundant and retained feedback carries the transfer signal.(ii)Task-demand normalization must be calibrated in heterogeneous settings. The hand-designed denominator gives MAE about0.190.19andR2=0.53R^{2}=0.53, which is weaker than raw NRS-EFC, indicating that equal exponents are poorly matched to this mixed holdout. The fitted denominator gives the best result, reducing MAE to about0.100.10and increasing held-outR2R^{2}to0.830.83.(iii)*The fitted exponents should be read as calibration weights, not universal causal signs. The model upweights tool entropy, downweights reasoning depth and state pressure, and assigns smaller corrections to observation noise and verifier-signal coverage. Thus, calibration preserves the conceptual role ofDtaskD_{\mathrm{task}}while adapting its scale to the held-out task mixture.

*(i)Harness design controls raw-to-EFC conversion. Factor scans and module ablations show that routing, verification, memory, and cleaner observations improve performance mainly by increasing harness efficiencyη\eta, not by increasing raw expenditure.(ii)Task demand controls sufficiency. Absolute EFC is stronger than raw compute, but cross-family prediction improves when EFC is measured relative toDtaskD_{\mathrm{task}}.(iii)*Calibration is needed when task mixtures are heterogeneous. Hand-designed task demand is effective in controlled families, while fitted task-demand normalization gives the best transfer on mixed held-out tasks. Together, these results explain why raw budget alone is insufficient: successful harnesses must both extract more effective feedback per unit budget and measure that feedback against the right task scale.

6Held-Out and Prospective Validation

The previous sections identify EFC as the scaling coordinate and decompose it into harness efficiency and task demand. We now test whether this coordinate generalizes beyond the configurations used to fit the scaling curves. We consider three validation settings. First, we hold out task families, harness variants, models, and combined configurations. Second, we evaluate heterogeneous real execution traces using nonredundant stable EFC (NRS-EFC). Third, we evaluate a prospective batch of new real traces under a prediction protocol specified before the batch is collected and scored. Across these settings, the central question is whether success is predicted by task-demand-normalized feedback rather than by raw expenditure.

6.1Task-Demand-Normalized EFC Predicts Unseen Configurations

We evaluate held-out prediction by removing configurations along four axes: task family, harness variant, model, and combined setting. For each split, we fit the power-law scaling model on the remaining configurations in failure-rate space, consistent with Eq.15. For visualization and calibration analysis, we report the equivalent success-rate prediction1−E^(z)1-\widehat{E}(z)on the held-out group. We compare raw compute baselines, SAS, task-demand-normalized Oracle-EFC, and task-demand-normalized Estimated-EFC. The task-demand denominator is either hand-designed from task-demand factors or fitted on the training split. Estimated-EFC is computed only from trace-time signals.

Refer to caption Figure 9:Task-demand-normalized EFC predicts unseen configurations.Left: predicted success, computed as1−E^(z)1-\widehat{E}(z)from the fitted failure-rate scaling model, is calibrated against observed success across held-out task, harness, model, and combined splits. Right: grouped bars report held-out group MAE for raw compute baselines, SAS, and task-demand-normalized EFC coordinates across the four held-out axes. Task-demand-normalized EFC gives the lowest prediction error across all axes.Figure9shows that the EFC coordinate transfers across unseen axes rather than only fitting pooled correlations.*(i)Calibration is preserved at the level of absolute success after converting the fitted failure-rate predictions to success rates. The left panel stays close to the diagonal across task, harness, model, and combined splits, which indicates that fitted Estimated-EFC/DtaskD_{\mathrm{task}}predicts success rates rather than only ranking configurations.(ii)The coordinate ordering is stable across held-out axes. Raw-compute scalars have the largest errors, SAS reduces error, and task-demand-normalized EFC forms the lowest-error group on every split. The gain is largest on harness and combined splits, where raw expenditure is most confounded by changed decision policies.(iii)*Fitted and hand-designed task demand are close. Their similar errors show that the transfer signal mainly comes from measuring feedback relative to task demand, while fitting improves calibration under distribution shift.

6.2Non-Redundant Stable EFC Reveals Slice-Specific Harness Efficiency

We next evaluate whether the same feedback-based scaling picture survives on real execution traces, where observations are noisier, errors are often repeated, and intermediate states are less controlled than in the synthetic and semi-realistic settings. We therefore usenonredundant stable EFC(NRS-EFC), which keeps feedback events that are both informative and retained, while down-weighting redundant or unstable signals that do not contribute durable progress. We pool heterogeneous real slices, including HumanEval-style code generation, terminal interaction, and SWE tasks, and compare raw compute baselines, SAS, NRS-EFC, and NRS-EFC/DtaskD_{\mathrm{task}}under the same aggregated scaling protocol. We also examine harness efficiencyη=EFC/Craw\eta=\mathrm{EFC}/C_{\mathrm{raw}}using NRS-EFC in order to understand how different harnesses convert raw budget into effective feedback on each real slice.

Refer to caption Figure 10:Nonredundant stable EFC reveals slice-specific harness efficiency.Left: pooled scalar comparison on mixed real traces, reporting predictiveR2R^{2}for raw compute baselines, SAS, NRS-EFC, and NRS-EFC/DtaskD_{\mathrm{task}}. Demand-normalized NRS-EFC gives the strongest pooled fit, followed closely by raw NRS-EFC, while raw compute baselines fail to provide a reliable pooled scaling coordinate. Right: harness efficiencyη\etacomputed from NRS-EFC is shown for H0–H6 across HumanEval, Terminal, and SWE slices. The absolute efficiency scale and the ordering of harnesses vary substantially by slice, showing that harness efficiency is task-dependent rather than a fixed global property.Figure10shows three main results.*(i)Raw compute largely fails as a pooled predictor on heterogeneous real traces: raw tokens and wall time both obtainR2=−0.08R^{2}=-0.08, raw cost reaches−0.07-0.07, and tool calls and operations each remain at−0.02-0.02. This contrasts sharply with NRS-EFC, which reachesR2=0.89R^{2}=0.89, and NRS-EFC/DtaskD_{\mathrm{task}}, which further improves to0.920.92. SAS is substantially better than raw compute at0.430.43, but it still falls far short of the feedback-based coordinates. Thus, in realistic mixed settings, simply spending more budget is not predictive of success; what matters is whether the trajectory accumulates nonredundant, retained feedback, and whether that feedback is large enough relative to task demand.(ii)*The right panel shows that harness efficiency is strongly slice-specific rather than globally fixed. On HumanEval, later harnesses H5 and H6 dominate withη≈1.9\eta\approx 1.9, far above earlier variants, indicating that richer verification and feedback exploitation translate into much more effective progress on executable coding tasks. On Terminal, all harnesses remain at very low efficiency, roughly around0.10.1, suggesting that the slice is intrinsically hard to convert into clean, reusable feedback. On SWE, the pattern changes again: earlier or mid-stage harnesses perform best, with H0 and H3 among the strongest, while H5 and H6 no longer dominate. This reversal indicates that the best harness design depends on the structure of the environment and the type of feedback available. Overall, these results support the view that NRS-EFC captures the relevant scaling signal on real tasks, whileη\etashould be understood as a harness–task interaction rather than as an invariant property of the harness alone.

6.3Prospective Holdout Validates the Scaling Coordinate

Refer to caption Figure 11:Prospective holdout prediction.Bars report held-out MAE on a prospective batch of unseen real traces, and text annotations report the corresponding held-outR2R^{2}. NRS-EFC/DtaskD_{\mathrm{task}}gives the best prospective prediction, followed closely by raw NRS-EFC, while raw compute baselines and SAS are substantially weaker.We conclude with a prospective holdout test that evaluates whether the same coordinate transfers to traces that were not available during metric design or calibration. Before collecting the prospective runs, we fixed the prediction protocol, including the definition of NRS-EFC, the task-demand factors, the fitted task-demand exponents, and all comparison baselines. We then evaluate a new held-out batch of real traces and compare raw compute baselines, SAS, NRS-EFC, and NRS-EFC/DtaskD_{\mathrm{task}}using the same held-out prediction procedure as above. This setting asks whether a prespecified feedback-based coordinate remains predictive when both the task mix and the evaluation examples are unseen.

Figure11shows that the ordering observed in earlier validation settings transfers to the prospective batch.*(i)Feedback-based coordinates remain predictive on new real traces. NRS-EFC/DtaskD_{\mathrm{task}}achieves the best held-out prediction, with the highest held-outR2R^{2}of0.850.85, while raw NRS-EFC follows closely at0.770.77. SAS remains informative but is substantially weaker atR2=0.26R^{2}=0.26.(ii)Raw expenditure is not a reliable out-of-sample coordinate. Raw tokens, tool calls, wall time, operations, and raw cost all obtain negative held-outR2R^{2}, indicating that they perform worse than a mean predictor on the prospective batch.(iii)*Task-demand normalization provides a targeted calibration gain rather than an evaluation-set-specific adjustment. NRS-EFC already captures nonredundant retained feedback, and dividing byDtaskD_{\mathrm{task}}improves prediction when the prospective batch mixes tasks with different feedback requirements. Because the coordinate and calibration procedure were specified before evaluating the prospective batch, the improvement supports the proposed scaling coordinate rather than post hoc adaptation to the held-out examples.

*(i)The EFC coordinate generalizes across held-out axes. Task-demand-normalized Oracle-EFC and Estimated-EFC predict unseen task families, harness variants, models, and combined configurations more accurately than raw compute baselines or SAS.(ii)Real traces require stable nonredundant feedback. NRS-EFC filters repeated or unstable observations and restores a strong pooled scaling signal, while slice-specificη\etashows that harness efficiency is a harness–task interaction.(iii)*Prospective validation provides an additional check against post hoc metric tuning. With the metric family, task-demand factors, and calibration procedure specified before evaluation, NRS-EFC/DtaskD_{\mathrm{task}}remains the best predictor on new real traces. Together, these results show that a predictive scaling law for agent harnesses must account for both raw-to-EFC conversion and task demand.

7Related Work

Scaling and test-time compute.

Scaling laws connect language-model performance with model size, data size, and training compute(Kaplanet al.,2020; Hoffmannet al.,2022). Recent work extends this view to inference-time computation, showing that repeated sampling, search, revision, and adaptive allocation can improve performance when additional test-time budget is available(Brownet al.,2025; Snellet al.,2025; Zhuet al.,2025; Kimet al.,2026a). These studies establish test-time compute as an important scaling axis, but they usually measure compute through samples, tokens, rollouts, or search budget. Our work instead asks which scalar predicts when this extra computation becomes useful. EFC answers this by counting feedback that is informative, valid, non-redundant, and retained, then normalizing it by task demand.

Harness feedback and system scaling.

Agent harnesses improve language models by adding reasoning, action, verification, search, memory, and iterative feedback(Yaoet al.,2023b; Shinnet al.,2023; Madaanet al.,2023; Yaoet al.,2023a; Zhouet al.,2024; Lightmanet al.,2024). Recent work increasingly treats the harness itself as a first-class object of design and evaluation, including natural-language harness representations, versioned agent-optimization loops, and context-retrieval benchmarks for coding agents(Panet al.,2026; Ursekaret al.,2026; Liet al.,2026a). System-level studies further argue that agent performance depends on the joint scaling of model capability, orchestration, verification, coordination, and overhead rather than on base-model scale alone(Kimet al.,2026b; Liet al.,2026b). These works motivate harness-level scaling, while our focus is complementary: we seek a trace-level scalar coordinate that separates raw expenditure from useful retained feedback. EFC therefore provides a compact predictive quantity for closed-loop harness scaling and is directly compared with SAS as a strong system-level baseline.

8Conclusion

This paper argues that the scaling behavior of agent harnesses is better explained by effective feedback than by raw test-time expenditure. We introduced Effective Feedback Compute (EFC), a trace-level coordinate that measures the amount of valid, relevant, non-redundant, and retained feedback available to a harness, together with task-demand normalization for comparing heterogeneous tasks. Across controlled simulations, executable code tasks, real mixed traces, held-out splits, and a prospective validation batch, EFC-based coordinates consistently outperform raw-compute baselines such as tokens, tool calls, operations, wall time, and cost, and also improve over a strong SAS baseline. The experiments further show that harness interventions primarily matter by changing how efficiently raw budget is converted into durable feedback: under matched raw budgets, improving feedback quality substantially increases success, while normalized EFC produces the strongest curve collapse across task difficulty. These results suggest that agent-system scaling should be understood not simply as spending more inference-time compute, but as accumulating sufficient effective feedback relative to task demand. Future work should extend EFC estimation to broader open-ended environments, improve task-demand calibration, and use EFC as an objective for adaptive budget allocation and harness design.

References

B. Brown, J. Juravsky, R. S. Ehrlich, R. Clark, Q. V. Le, C. Re, and A. Mirhoseini (2025)Large language monkeys: scaling inference compute with repeated sampling.External Links:LinkCited by:§7.
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba (2021)Evaluating large language models trained on code.External Links:2107.03374,LinkCited by:3rd item.
J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, O. Vinyals, J. W. Rae, and L. Sifre (2022)Training compute-optimal large language models.InProceedings of the 36th International Conference on Neural Information Processing Systems,NIPS ’22,Red Hook, NY, USA.External Links:ISBN 9781713871088Cited by:§7.
C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan (2024)SWE-bench: can language models resolve real-world github issues?.InThe Twelfth International Conference on Learning Representations,External Links:LinkCited by:3rd item.
J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei (2020)Scaling laws for neural language models.External Links:2001.08361,LinkCited by:§7.
J. Kim, W. Yang, K. Niu, H. Zhang, Y. Zhu, E. Helenowski, R. Silva, Z. Chen, S. Iyer, M. Zaheer, D. Fried, H. Hajishirzi, S. Arora, G. Synnaeve, R. Salakhutdinov, and A. Goyal (2026a)Scaling test-time compute for agentic coding.External Links:2604.16529,LinkCited by:§7.
Y. Kim, K. Gu, C. Park, C. Park, S. Schmidgall, A. A. Heydari, Y. Yan, Z. Zhang, Y. Zhuang, Y. Liu, M. Malhotra, P. P. Liang, H. W. Park, Y. Yang, X. Xu, Y. Du, S. Patel, T. Althoff, D. McDuff, and X. Liu (2026b)Towards a science of scaling agent systems.External Links:2512.08296,LinkCited by:§1,§2.5,§7.
H. Li, L. Zhu, B. Zhang, R. Feng, J. Wang, Y. Pan, E. T. Barr, F. Sarro, Z. Chu, and H. Ye (2026a)ContextBench: a benchmark for context retrieval in coding agents.External Links:2602.05892,LinkCited by:§7.
X. Li, R. Ming, P. Setlur, A. Paladugu, A. Tang, H. Kang, S. Shao, R. Jin, and C. Xiong (2026b)Benchmark test-time scaling of general llm agents.External Links:2602.18998,LinkCited by:§7.
H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe (2024)Let’s verify step by step.InThe Twelfth International Conference on Learning Representations,External Links:LinkCited by:§7.
A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang, S. Gupta, B. P. Majumder, K. Hermann, S. Welleck, A. Yazdanbakhsh, and P. Clark (2023)Self-refine: iterative refinement with self-feedback.InThirty-seventh Conference on Neural Information Processing Systems,External Links:LinkCited by:§7.
M. A. Merrill, A. G. Shaw, N. Carlini, B. Li, H. Raj, I. Bercovich, L. Shi, J. Y. Shin, T. Walshe, E. K. Buchanan, J. Shen, G. Ye, H. Lin, J. Poulos, M. Wang, M. Nezhurina, D. Lu, O. M. Mastromichalakis, Z. Xu, Z. Chen, Y. Liu, R. Zhang, L. L. Chen, A. Kashyap, J. Uslu, J. Li, J. Wu, M. Yan, S. Bian, V. Sharma, K. Sun, S. Dillmann, A. Anand, A. Lanpouthakoun, B. Koopah, C. Hu, E. K. Guha, G. H. S. Dreiman, J. Zhu, K. Krauth, L. Zhong, N. Muennighoff, R. K. Amanfu, S. Tan, S. Pimpalgaonkar, T. Aggarwal, X. Lin, X. Lan, X. Zhao, Y. Liang, Y. Wang, Z. Wang, C. Zhou, D. Heineman, H. Liu, H. Trivedi, J. Yang, J. Lin, M. Shetty, M. Yang, N. Omi, N. Raoof, S. Li, T. Y. Zhuo, W. Lin, Y. Dai, Y. Wang, W. Chai, S. Zhou, D. Wahdany, Z. She, J. Hu, Z. Dong, Y. Zhu, S. Cui, A. Saiyed, A. Kolbeinsson, C. M. Rytting, R. Marten, Y. Wang, J. Jitsev, A. Dimakis, A. Konwinski, and L. Schmidt (2026)Terminal-bench: benchmarking agents on hard, realistic tasks in command line interfaces.InThe Fourteenth International Conference on Learning Representations,External Links:LinkCited by:3rd item.
L. Pan, L. Zou, S. Guo, J. Ni, and H. Zheng (2026)Natural-language agent harnesses.External Links:2603.25723,LinkCited by:§7.
N. Shinn, F. Cassano, A. Gopinath, K. R. Narasimhan, and S. Yao (2023)Reflexion: language agents with verbal reinforcement learning.InThirty-seventh Conference on Neural Information Processing Systems,External Links:LinkCited by:§7.
C. V. Snell, J. Lee, K. Xu, and A. Kumar (2025)Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning.InThe Thirteenth International Conference on Learning Representations,External Links:LinkCited by:§7.
V. Ursekar, A. Shanker, V. Chatrath, Y. Xue, and S. Denton (2026)VeRO: an evaluation harness for agents to optimize agents.External Links:2602.22480,LinkCited by:§7.
S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. R. Narasimhan (2023a)Tree of thoughts: deliberate problem solving with large language models.InThirty-seventh Conference on Neural Information Processing Systems,External Links:LinkCited by:§7.
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao (2023b)ReAct: synergizing reasoning and acting in language models.InThe Eleventh International Conference on Learning Representations,External Links:LinkCited by:§7.
A. Zhou, K. Yan, M. Shlapentokh-Rothman, H. Wang, and Y. Wang (2024)Language agent tree search unifies reasoning acting and planning in language models.External Links:LinkCited by:§7.
K. Zhu, H. Li, S. Wu, T. Xing, D. Ma, X. Tang, M. Liu, J. Yang, J. Liu, Y. E. Jiang, C. Zhang, C. Lin, J. Wang, G. Zhang, and W. Zhou (2025)Scaling test-time compute for llm agents.External Links:2506.12928,LinkCited by:§7.

Appendix AAppendix

Appendix BTask Details

This appendix describes the task layers used in our experiments. The purpose of the task suite is not to maximize benchmark coverage, but to span a controlled range from oracle-observable feedback to realistic agent trajectories.

B.1Synthetic Controllable Tasks

The synthetic layer contains procedurally generated tasks with hidden state and known ground-truth solution paths. Each task instance specifies a target state, a set of candidate hypotheses, a tool interface, and a deterministic evaluator. Because the latent state is known to the experimenter, we can compute Oracle-EFC for each feedback event.

We use this layer for three purposes. First, it provides controlled variation in task demand, including required solution length, tool-selection entropy, state size, observation noise, and verifier availability. Second, it allows us to test whether feedback events measured by EFC correspond to real progress toward the solution. Third, it provides calibration data for Estimated-EFC, which is later applied to tasks where oracle state is unavailable.

Each synthetic task is generated from a template family. Template parameters control the number of latent variables, the number of distractor tools, the probability of noisy observations, and the coverage of deterministic checkers. We split generated tasks into calibration and held-out sets by template family and random seed.

B.2Semi-Realistic Executable Tasks

The semi-realistic layer contains tasks with realistic artifacts and executable checkers. Examples include small code-repair tasks, data-analysis tasks, and tool-use tasks with deterministic validation. These tasks are designed to produce natural agent traces while preserving reliable evaluation.

For code tasks, the final answer is executed against unit tests or reference checks. Intermediate checker results, such as syntax errors, failing tests, and partial correctness signals, are logged as trace observations. For tool-use tasks, the environment exposes a fixed set of tools and records whether tool outputs are later referenced or used in solution updates.

This layer is used to evaluate whether the EFC estimators calibrated on synthetic tasks remain predictive when traces contain realistic model errors, ambiguous observations, and incomplete verification.

B.3Real Benchmark Subsets

The real benchmark layer uses verifiable subsets of existing agent benchmarks. We include only tasks for which final success can be evaluated automatically and for which traces can be logged in a consistent format. The benchmark layer is used for external validity rather than leaderboard comparison.

We apply the following filtering rules:

•the task must have an automatic final evaluator;
•the required environment must be reproducible in a local or sandboxed execution setting;
•the task must admit meaningful intermediate observations, such as test results, command outputs, or tool responses;
•the task must fit within the budget limits used by the harness family.

When a benchmark contains tasks with very large setup cost or unstable external dependencies, we exclude those tasks from the main analysis and report the filtering rule separately. The resulting subset is intended to test scaling-law generalization, not to estimate absolute benchmark performance.

B.4Task Demand Variables

For each task, we compute a hand-designed task-demand score

•LLis the estimated minimum number of reasoning or action steps.
•HtoolH_{\mathrm{tool}}measures tool-selection entropy or the number of plausible but incorrect tools.
•SstateS_{\mathrm{state}}measures the amount of state that must be tracked across the trajectory.
•NobsN_{\mathrm{obs}}measures observation noise, ambiguity, or nondeterminism.
•VoracleV_{\mathrm{oracle}}measures verifier-signal visibility, namely the availability and coverage of reliable task-level validation signals such as deterministic checkers, explicit tests, partial evaluators, or benchmark scoring hooks.

In real benchmark traces,VoracleV_{\mathrm{oracle}}is computed from task metadata about verification coverage rather than from the agent’s final outcome or a hidden solution. The agent does not observe oracle answers through this quantity. The factor appears as1−Voracle1-V_{\mathrm{oracle}}because tasks with stronger reliable verification signals require less feedback mass to reach the same effective progress.

All components are normalized within each task layer before forming the product. In addition to this hand-designed score, we also evaluate fitted task-demand weights learned on calibration tasks and applied to held-out tasks.

B.5Run Logging

For every run, we log the task identifier, task family, harness identifier, model identifier, budget configuration, final success label, raw compute variables, checker outputs, and the full sequence of trajectory events. Each event records the action type, observation type, tool name if applicable, checker result if available, memory update if present, and references to earlier observations. These logs are the sole input to Estimated-EFC on non-oracle tasks.

Appendix CEFC Factor Measurement

This appendix specifies how the event factorsIt,Vt,Rt,MtI_{t},V_{t},R_{t},M_{t}are computed or estimated in each task layer. Section3defines the common EFC variables and aggregation rules.

C.1Common Notation

We useclip⁡(x)\operatorname{clip}(x)to denote clipping to[0,1][0,1]:

clip⁡(x)=min⁡{1,max⁡{0,x}}.\operatorname{clip}(x)=\min\{1,\max\{0,x\}\}.(22)For every event, the factors are computed before aggregation. The run-level value is then obtained by Eq.6.

When a feature is unavailable in a task layer, we use the closest trace-observable proxy. The final answer correctness label is excluded from all event-level factor estimates.

C.2Synthetic Controllable Tasks

In the synthetic layer, the experimenter observes the latent task state and the ground-truth solution path. Oracle-EFC therefore uses direct event-level factors. Letntn_{t}denote novelty relative to previous events,BtnoiseB^{\mathrm{noise}}_{t}the reliability of the observation channel,BtrouteB^{\mathrm{route}}_{t}routing quality,BtverifyB^{\mathrm{verify}}_{t}verifier strength, andBtmemB^{\mathrm{mem}}_{t}memory fidelity. We instantiate the factors as

It=clip⁡(BtrouteBtnoiseΔtlatent),I_{t}=\operatorname{clip}\left(B^{\mathrm{route}}_{t}B^{\mathrm{noise}}_{t}\Delta^{\mathrm{latent}}_{t}\right),(23)Vt=clip⁡(BtverifyBtnoiseVoracle),V_{t}=\operatorname{clip}\left(B^{\mathrm{verify}}_{t}B^{\mathrm{noise}}_{t}V_{\mathrm{oracle}}\right),(24)Rt=clip⁡(ntBtrouteBttool),R_{t}=\operatorname{clip}\left(n_{t}B^{\mathrm{route}}_{t}B^{\mathrm{tool}}_{t}\right),(25)Mt=clip⁡(BtmemBtstate(0.82+0.18Vt)).M_{t}=\operatorname{clip}\left(B^{\mathrm{mem}}_{t}B^{\mathrm{state}}_{t}(0.82+0.18V_{t})\right).(26)HereΔtlatent\Delta^{\mathrm{latent}}_{t}is ground-truth progress toward the target state,BttoolB^{\mathrm{tool}}_{t}discounts high tool ambiguity, andBtstateB^{\mathrm{state}}_{t}discounts memory pressure in large state spaces.

For the three synthetic task families,Δtlatent\Delta^{\mathrm{latent}}_{t}is defined as follows:

•Needle Lookup.LetCtC_{t}be the remaining candidate set and letbtb_{t}indicate whether the target key-value relation is recovered. We useΔtlatent=clip⁡((|Ct−1|−|Ct|)/max⁡(1,|Ct−1|)+bt)\Delta^{\mathrm{latent}}_{t}=\operatorname{clip}((|C_{t-1}|-|C_{t}|)/\max(1,|C_{t-1}|)+b_{t}).
•State Tracking.Letdtd_{t}be the number of correct state transitions committed after eventttand letrtfixr^{\mathrm{fix}}_{t}indicate correction of a previous state error. We useΔtlatent=clip⁡((dt−dt−1)/max⁡(1,L)+rtfix)\Delta^{\mathrm{latent}}_{t}=\operatorname{clip}((d_{t}-d_{t-1})/\max(1,L)+r^{\mathrm{fix}}_{t}).
•Rule Filter.LetEtE_{t}be the number of eliminated nonmatching items andPtP_{t}the number of confirmed matching conditions. We useΔtlatent=clip⁡((Et−Et−1+Pt−Pt−1)/max⁡(1,Nitems))\Delta^{\mathrm{latent}}_{t}=\operatorname{clip}((E_{t}-E_{t-1}+P_{t}-P_{t-1})/\max(1,N_{\mathrm{items}})).

The Oracle-EFC event score is

EFCtoracle=κItVtRtMt.\mathrm{EFC}^{\mathrm{oracle}}_{t}=\kappa I_{t}V_{t}R_{t}M_{t}.(27)

C.3Semi-Realistic Executable Tasks

In the semi-realistic layer, traces contain executable feedback and realistic model errors. We compute factor estimates from observable event features. Letctc_{t}be checker fired,hth_{t}checker scope,ztz_{t}tool-result reference,ptp_{t}plan update,mtm_{t}memory retention,ata_{t}repeated-error avoidance,qtq_{t}observation consistency,Δt\Delta_{t}subgoal progress, andntn_{t}novelty. LetBrouterB_{\mathrm{router}}be harness routing quality,BverifyB_{\mathrm{verify}}verifier strength,HtoolH_{\mathrm{tool}}tool ambiguity, andEexploreE_{\mathrm{explore}}exploration entropy.

We estimate informativeness as

I^t=clip⁡(Δt(0.70+0.30Brouter)1+0.12Htool+0.20Eexplore).\widehat{I}_{t}=\operatorname{clip}\left(\frac{\Delta_{t}(0.70+0.30B_{\mathrm{router}})}{1+0.12H_{\mathrm{tool}}+0.20E_{\mathrm{explore}}}\right).(28)We estimate validity as

V^t=clip⁡(qt(0.70+0.30Bverify)(0.72+0.28Voracle)).\widehat{V}_{t}=\operatorname{clip}\left(q_{t}(0.70+0.30B_{\mathrm{verify}})(0.72+0.28V_{\mathrm{oracle}})\right).(29)We estimate non-redundant relevance as

R^t=clip⁡((0.28+0.72nt)(0.48+0.52at)1+0.12Htool).\widehat{R}_{t}=\operatorname{clip}\left(\frac{(0.28+0.72n_{t})(0.48+0.52a_{t})}{1+0.12H_{\mathrm{tool}}}\right).(30)We estimate memory update as

M^t=clip⁡(mt(0.80+0.20pt)).\widehat{M}_{t}=\operatorname{clip}\left(m_{t}(0.80+0.20p_{t})\right).(31) For calibration traces with oracle-visible progress, these factor estimates define the event target

yt=κI^tV^tR^tM^t.y_{t}=\kappa\widehat{I}_{t}\widehat{V}_{t}\widehat{R}_{t}\widehat{M}_{t}.(32)The trace estimator in Eq.8is trained to predictyty_{t}from the feature vector in Eq.7.

C.4Real Benchmark Subsets

For HumanEval, Terminal-Bench 2.0, and SWE-bench Verified, hidden state is unavailable. We first computeEFC^t\widehat{\mathrm{EFC}}_{t}with the calibrated trace estimator. We then apply deterministic status and repetition gates derived from execution traces.

The status-quality factorQtQ_{t}is

Qt={1.00,passed,0.42,assertion error,0.12,runtime error,0.06,timeout,0.04,static reject or missing entry point,0.00,API error,0.25,other status.Q_{t}=\begin{cases}1.00,&\text{passed},\\ 0.42,&\text{assertion error},\\ 0.12,&\text{runtime error},\\ 0.06,&\text{timeout},\\ 0.04,&\text{static reject or missing entry point},\\ 0.00,&\text{API error},\\ 0.25,&\text{other status}.\end{cases}(33)Letsev⁡(st)\operatorname{sev}(s_{t})map event status to ordered severity, with API errors lowest and passing checks highest. The progress gate is

Gt={1.00,At=0,1.35,st=passedandst−1≠passed,1.15,sev⁡(st)>sev⁡(st−1),0.62,sev⁡(st)=sev⁡(st−1)andst≠passed,0.45,sev⁡(st)<sev⁡(st−1),1.00,otherwise.G_{t}=\begin{cases}1.00,&A_{t}=0,\\ 1.35,&s_{t}=\text{passed}\text{ and }s_{t-1}\neq\text{passed},\\ 1.15,&\operatorname{sev}(s_{t})>\operatorname{sev}(s_{t-1}),\\ 0.62,&\operatorname{sev}(s_{t})=\operatorname{sev}(s_{t-1})\text{ and }s_{t}\neq\text{passed},\\ 0.45,&\operatorname{sev}(s_{t})<\operatorname{sev}(s_{t-1}),\\ 1.00,&\text{otherwise}.\end{cases}(34)The loop gate is

Λt={0.95,repair event,0.92,generation event with passing status,0.85,generation event without passing status,1.00,otherwise.\Lambda_{t}=\begin{cases}0.95,&\text{repair event},\\ 0.92,&\text{generation event with passing status},\\ 0.85,&\text{generation event without passing status},\\ 1.00,&\text{otherwise}.\end{cases}(35) For nonredundant stable EFC, repeated failures receive stronger discounts:

Gtnr={1.00,At=0,1.35,st=passedandst−1≠passed,1.15,sev⁡(st)>sev⁡(st−1),0.16,sev⁡(st)=sev⁡(st−1)andst≠passed,0.10,sev⁡(st)<sev⁡(st−1),1.00,otherwise.G^{\mathrm{nr}}_{t}=\begin{cases}1.00,&A_{t}=0,\\ 1.35,&s_{t}=\text{passed}\text{ and }s_{t-1}\neq\text{passed},\\ 1.15,&\operatorname{sev}(s_{t})>\operatorname{sev}(s_{t-1}),\\ 0.16,&\operatorname{sev}(s_{t})=\operatorname{sev}(s_{t-1})\text{ and }s_{t}\neq\text{passed},\\ 0.10,&\operatorname{sev}(s_{t})<\operatorname{sev}(s_{t-1}),\\ 1.00,&\text{otherwise}.\end{cases}(36)The nonredundant loop gate is

Λtnr={0.45,repair event,0.92,generation event with passing status,0.85,generation event without passing status,1.00,otherwise.\Lambda^{\mathrm{nr}}_{t}=\begin{cases}0.45,&\text{repair event},\\ 0.92,&\text{generation event with passing status},\\ 0.85,&\text{generation event without passing status},\\ 1.00,&\text{otherwise}.\end{cases}(37) These gates correspond to the real-trace factor proxiesV^treal=Qt\widehat{V}^{\mathrm{real}}_{t}=Q_{t}andR^treal=GtΛt\widehat{R}^{\mathrm{real}}_{t}=G_{t}\Lambda_{t}. The base estimator supplies the trace-level product of informativeness, validity, relevance, and memory retention from Eq.8. The status-aware scores used in the real benchmark analyses are those in Eqs.10and11.

Appendix DHarness Details

This appendix provides additional details on the harness families used in our experiments. The seven harnesses correspond exactly to H0–H6 in Section2.3. They are designed to vary how raw computation is converted into effective feedback, while keeping the task distribution, base model, final evaluator, and logging protocol fixed within each experimental setting.

D.1Common Harness Interface

Each run is specified by a task instance, a base model, a harness family, and a replicate index. The harness produces a final answer and a trajectory of intermediate events. These events record the interaction structure of the run, including whether the model receives external observations, verifier signals, repair feedback, routed information, or retained state from previous steps. The same trace schema is used across controlled, semi-realistic, and real benchmark settings, enabling all harnesses to be compared through the same EFC-based metrics.

The harnesses differ along six main dimensions: raw budget, tool budget, verifier strength, routing quality, memory fidelity, and noise/state pressure. Raw budget measures the amount of computation made available to the harness, whereas the other dimensions determine how efficiently this computation is converted into useful, valid, remembered, and non-redundant feedback. This separation is important because a harness can spend substantially more raw budget without necessarily producing proportionally more EFC.

D.2H0: Direct Answer

H0 is the direct-answer baseline. The model receives the task instruction and produces a final answer in a single pass. It does not use routed observations, explicit verification, persistent memory, or feedback-conditioned repair before submission. H0 therefore measures the performance obtainable from the base model with minimal harnessing.

In the EFC interpretation, H0 has low expected feedback mass: most computation is spent on direct generation rather than on acquiring, validating, or reusing external feedback. It serves as the lower-complexity reference point for measuring the effect of increasingly structured harness mechanisms.

D.3H1: Checklist Verify

H1 augments direct generation with a lightweight verification or checklist stage. The model is encouraged to check constraints, edge cases, or internal consistency before finalizing its answer. This mechanism increases the amount of self- or verifier-guided scrutiny relative to H0, but it does not create a full closed loop: the run remains primarily single-pass and does not rely on multi-round repair.

H1 isolates the effect of adding a weak verification signal without introducing strong routing, persistent memory, or deep interaction. It is therefore useful for distinguishing simple checking from the stronger feedback accumulation used by H5 and H6.

D.4H2: Routed Tools

H2 introduces routed tool or observation access. Rather than treating all available information uniformly, the harness allocates part of its budget to selecting which observations are likely to be useful for the current task state. This improves the relevance of feedback events relative to H0/H1, especially when the task benefits from external evidence or intermediate computation.

However, H2 does not assume strong memory or deep verifier-driven repair. Its main mechanism is improved routing: the harness can obtain more informative observations, but it has only moderate ability to validate, retain, and iteratively refine them. H2 therefore tests whether better access to observations alone is sufficient to explain the observed scaling behavior.

D.5H3: Stateful Memory

H3 emphasizes state retention. Compared with H2, it gives the harness a stronger ability to preserve useful intermediate information, remember failed attempts, and avoid repeating previously identified errors. The intended mechanism is not simply to spend more budget, but to make prior feedback more reusable across later steps.

In trace terms, H3 increases the probability that valid feedback remains available and affects subsequent decisions. This makes it a natural test case for the memory component of EFC: feedback that is observed but forgotten has limited effect, whereas feedback that is retained can continue to shape future actions.

D.6H4: High Budget Noisy

H4 is a high-budget but inefficient harness. It allocates substantially more raw computation than the simpler harnesses, but combines this budget with weaker routing, weaker verification, weaker memory, higher observation noise, and greater state pressure. As a result, H4 may generate many intermediate events without converting them into proportionally useful feedback.

This harness serves as a negative control for raw-budget explanations. If performance were primarily determined by tokens, tool calls, wall-clock time, or other raw compute proxies, H4 should be highly competitive. Instead, its role is to test whether additional computation only helps when the harness can transform that computation into valid and non-redundant feedback.

D.7H5: Closed Loop

H5 is the standard closed-loop harness. It combines stronger routing, verification, and memory with feedback-conditioned refinement. The harness can use intermediate observations or verifier signals to revise its subsequent behavior, rather than treating each attempt as independent. This creates a structured feedback loop in which the model proposes, receives task-relevant signals, and updates its next action accordingly.

Compared with H2 and H3, H5 is not defined by a single mechanism such as routing or memory alone. Its advantage comes from the interaction among mechanisms: routing increases the relevance of observations, verification improves their validity, and memory allows useful information to persist across the loop. This makes H5 the first harness family expected to achieve consistently high EFC efficiency.

D.8H6: Deep Closed Loop

H6 extends H5 by increasing interaction depth and strengthening the feedback conversion mechanisms. It uses a deeper closed-loop process with stronger verification, better routing, higher memory fidelity, and lower effective observation noise. The additional budget in H6 is therefore not merely more raw computation; it is budget deployed through a harness that is better able to turn intermediate signals into effective feedback.

H6 represents the strongest harness family in our experimental design. It tests whether scaling continues when a harness already has high feedback efficiency, and whether deeper interaction provides additional gains beyond the standard closed-loop structure of H5. In the EFC framework, H6 is expected to produce not only more feedback events, but also feedback events with higher validity, relevance, retention, and non-redundancy.

D.9Comparison Across Harness Families

The harness sequence H0–H6 is not intended to be a simple monotonic increase in raw compute. Instead, it separates raw budget from feedback efficiency. H0 and H1 provide low-interaction baselines; H2 and H3 isolate routing and memory; H4 provides a high-budget but noisy counterexample; and H5/H6 instantiate increasingly strong closed-loop agents. This design allows us to evaluate whether performance is better explained by raw computation or by EFC.

Across experiments, the same harness names refer to the same mechanism-level roles. Controlled experiments instantiate these roles through explicit simulation parameters, while semi-realistic and real benchmark experiments instantiate them through executable harness behavior and trace-observable feedback events. This shared abstraction makes it possible to compare harnesses across settings without relying on benchmark-specific implementation details.

D.10Stopping Rules

A run terminates when the harness submits a final answer, when the configured interaction budget is exhausted, when a verifier or evaluator accepts the current candidate, or when the environment returns an unrecoverable execution failure. Budget exhaustion is counted as failure unless the final submitted answer passes the evaluator. These stopping rules are applied consistently within each experimental setting so that differences across H0–H6 reflect harness structure rather than evaluation protocol.

相似文章

@omarsar0: // Scaling Laws for Agent Harnesses // 如果你构建代理框架，这篇文章值得一看。（收藏）大多数 harness…

X AI KOLs Following

关于代理框架缩放定律的新研究显示，大多数字令和工具调用次数并不重要；该研究引入了一种有效的方法。

@Xudong07452910: 这篇 Harness Updating Is Not Harness Benefit 很适合做 Agent Harness 的人看。它讲了一个很容易被忽略的问题：会更新 Harness，不等于真的会用好 Harness。现在很多 Ag…

X AI KOLs Timeline

该帖子讨论了一篇论文，指出Agent系统自我进化中，更新Harness（写有用更新）与从更新中受益（后续任务真正使用）是两种不同能力，后者才是关键，弱模型往往不会使用规则。

@NFTCPS: HarnessX这玩意儿挺有意思：一个能自己改自己的智能体架构。以前架构怎么变，全靠人手调。新模型一出，Anthropic就把Claude Code里的规划步骤砍了，Manus半年重构了五次智能体，每次都在做减法。改什么、什么时候改，一…

X AI KOLs Timeline

HarnessX introduces a framework for self-evolving AI agent harnesses that treats the runtime harness as a first-class object, enabling automatic adaptation via trace-driven reinforcement learning. It achieves average gains of +14.5% across five benchmarks, with larger improvements for weaker models.

@xiaogaifun: 讲 Harness 最透彻的一个演讲。这应该是我看到过的、关于 Harness Engineering 最透彻的一次分享，推荐大家看一下。视频链接：https://podwise.ai/dashboard/episodes/80132…

X AI KOLs Timeline

这篇文章通过IBM工程师Tejas Kumar的演讲，深入讲解了Harness Engineering的概念，即通过为AI Agent添加确定性基础设施（如工具注册表、上下文管理、护栏和验证循环）来解决模型失控和幻觉问题，确保Agent稳定执行任务。

@Potatoloogs: https://x.com/Potatoloogs/status/2057391224592667051

X AI KOLs Timeline

本文深度拆解了Agent Harness的概念，即包裹在LLM外部的工程基础设施，包括编排循环、工具调用、记忆系统、上下文管理等12个组件。文章引用Anthropic、OpenAI、LangChain等公司的实践，论证了harness对生产级AI Agent的关键作用。

Scaling Laws for Agent Harnesses via Effective Feedback Compute

Abstract

1Introduction

2Problem Formulation and Experimental Setup

2.1Agent Harnesses as Closed-Loop Computation

2.2Task Layers

2.3Harness Families

2.4Models, Budgets, and Repeated Runs

2.5Scalar Predictors and Baselines

3Effective Feedback Compute

3.1Feedback Events

3.2Event-Level EFC

3.3Oracle-EFC and Estimated-EFC

3.4Task Demand and Normalized EFC

3.5Scaling Model and Evaluation Metrics

4Identifying EFC as the Scaling Coordinate

4.1Controlled Scaling Separates EFC from Raw Compute

4.2Matched Budgets Isolate Feedback Quality

4.3Trace-Time Estimation Recovers Oracle-EFC

4.4Executable Code Tasks Preserve the EFC Signal

5Decomposing EFC: Harness Efficiency and Task Demand

5.1Harness Factors Control Raw-to-EFC Conversion

5.2Module Ablations Localize Raw-to-EFC Conversion Gains

5.3Task Demand Sets the Required EFC Scale

5.4Task-Demand Calibration Transfers to Mixed Holdout

6Held-Out and Prospective Validation

6.1Task-Demand-Normalized EFC Predicts Unseen Configurations

6.2Non-Redundant Stable EFC Reveals Slice-Specific Harness Efficiency

6.3Prospective Holdout Validates the Scaling Coordinate

7Related Work

Scaling and test-time compute.

Harness feedback and system scaling.

8Conclusion

References

Appendix AAppendix

Appendix BTask Details

B.1Synthetic Controllable Tasks

B.2Semi-Realistic Executable Tasks

B.3Real Benchmark Subsets

B.4Task Demand Variables

B.5Run Logging

Appendix CEFC Factor Measurement

C.1Common Notation

C.2Synthetic Controllable Tasks

C.3Semi-Realistic Executable Tasks

C.4Real Benchmark Subsets

Appendix DHarness Details

D.1Common Harness Interface

D.2H0: Direct Answer

D.3H1: Checklist Verify

D.4H2: Routed Tools

D.5H3: Stateful Memory

D.6H4: High Budget Noisy

D.7H5: Closed Loop

D.8H6: Deep Closed Loop

D.9Comparison Across Harness Families

D.10Stopping Rules

相似文章

@omarsar0: // Scaling Laws for Agent Harnesses // 如果你构建代理框架，这篇文章值得一看。（收藏）大多数 harness…

@Xudong07452910: 这篇 Harness Updating Is Not Harness Benefit 很适合做 Agent Harness 的人看。 它讲了一个很容易被忽略的问题：会更新 Harness，不等于真的会用好 Harness。 现在很多 Ag…

@xiaogaifun: 讲 Harness 最透彻的一个演讲。 这应该是我看到过的、关于 Harness Engineering 最透彻的一次分享，推荐大家看一下。 视频链接：https://podwise.ai/dashboard/episodes/80132…

@Potatoloogs: https://x.com/Potatoloogs/status/2057391224592667051

提交意见反馈

@Xudong07452910: 这篇 Harness Updating Is Not Harness Benefit 很适合做 Agent Harness 的人看。它讲了一个很容易被忽略的问题：会更新 Harness，不等于真的会用好 Harness。现在很多 Ag…

@xiaogaifun: 讲 Harness 最透彻的一个演讲。这应该是我看到过的、关于 Harness Engineering 最透彻的一次分享，推荐大家看一下。视频链接：https://podwise.ai/dashboard/episodes/80132…