Harnesses for Inference-Time Alignment over Execution Trajectories

arXiv cs.LG Papers

Summary

This paper studies harness design for LLM agents, separating it into task decomposition and guided execution, and shows that more elaborate harnesses are not uniformly better; it reveals failure modes and proposes partial harnesses as effective.

arXiv:2605.21516v1 Announce Type: new Abstract: Harness engineering has emerged as an important inference-time technique for large language model (LLM) agents, aiming to improve long-term performance through task decomposition and guided execution. However, more elaborate harnesses are not uniformly better: increasing decomposition or guidance can sometimes improve execution, but can also reduce final task success. We study harness design through the lens of inference-time trajectory alignment. This perspective separates harness into two mechanisms: task decomposition, which structures a task into sub-goals, and guided execution, which reshapes local action distributions during execution. This decomposition allows us to quantify how workflow granularity, retry budgets, and guidance-induced action reweighting shape the performance limits of harness design. It further reveals concrete failure modes, including over-decomposition, over-pruning, and hallucinated execution. We validate these predictions through controlled synthetic experiments and real terminal agent benchmarks. Inspired by the theory, we further show that effective harnesses can be partial: specifying only the initial steps and leaving the remaining execution to agent can achieve higher pass rate than fully structured workflows.
Original Article
View Cached Full Text

Cached at: 05/22/26, 08:47 AM

# Harnesses for Inference-Time Alignment over Execution Trajectories
Source: [https://arxiv.org/html/2605.21516](https://arxiv.org/html/2605.21516)
Boyuan Wang⋄\\diamond, Bochao Li⋄\\diamond, Minghan Wang⋄\\diamond, Yuxin Tao†\\dagger, Fang Kong†\\dagger

###### Abstract

Harness engineering has emerged as an important inference\-time technique for large language model \(LLM\) agents, aiming to improve long\-term performance through task decomposition and guided execution\. However,more elaborate harnesses are not uniformly better: increasing decomposition or guidance can sometimes improve execution, but can also reduce final task success\. We study harness design through the lens of inference\-time trajectory alignment\. This perspective separates harness into two mechanisms: task decomposition, which structures a task into sub\-goals, and guided execution, which reshapes local action distributions during execution\. This decomposition allows us to quantify how workflow granularity, retry budgets, and guidance\-induced action reweighting shape the performance limits of harness design\. It further reveals concrete failure modes, including over\-decomposition, over\-pruning, and hallucinated execution\. We validate these predictions through controlled synthetic experiments and real terminal agent benchmarks\. Inspired by the theory, we further show thateffective harnesses can be partial: specifying only the initial steps and leaving the remaining execution to agent can achieve higher pass rate than fully structured workflows\.

Southern University of Science and Technology

††footnotetext:⋄Equal contribution\.††footnotetext:†Correspondence to[taoyx@sustech\.edu\.cn](https://arxiv.org/html/2605.21516v1/[email protected])and[kongf@sustech\.edu\.cn](https://arxiv.org/html/2605.21516v1/[email protected])\.## 1Introduction

Large language model agents have demonstrated impressive capabilities in solving long\-horizon interactive tasks involving complex software engineering\(claudecode2026;codexcli2026\), scientific discovery\(qu2026crispr;jin2025stella\), and autonomous tool use\(zeng2026glm;team2025kimi\)\. A key technique behind is*harnessing*: a scaffolding strategy that injects human priors into the agent’s execution process\(openai2026harnessengineering;Langchain2026;Anthropic\_harness\)\. By decomposing long\-horizon tasks into structured sub\-goals and providing guidance for intermediate decisions, a harness enables agents to complete complex tasks more reliably in an autonomous manner\(autoresearch;agentsmd;Harbor\_Framework;BrowserHarness\)\.

This success suggests a tempting intuition: more elaborate harnesses should lead to better agents\. From this view, harness design becomes a problem of adding structure: finer\-grained decomposition, more detailed instructions, and tighter execution constraints\(erdogan2025planandact;Wang2025DyFlowDW;dang2025multi\)\. Yet this intuition conflicts with a central lesson from the history of AI\(bitterlesson\): human\-designed structure often helps in the short run, but can limit a system’s ability to search, adapt, and scale\(yan2025reformreducinghuman;guo2025deepseek;Silver2017MasteringTG\)\. This raises a basic question:What should a harness specify, and what should leave for agent to resolve on its own?

To answer this question, we theoretically model how harnesses shape the inference\-time trajectories of LLM agents\. We separate a harness into two components:*workflow*, which specifies what the agent should achieve at each stage, and*guidance*, which biases how the agent acts within that stage\. This separation yields two findings through stage\-level gaps\. First,finer\-grained decomposition is not always better\. The optimal granularity must align the required sub\-goal scale with the agent’s controllable progress scale under the given tolerance and retry budget\. Second,guidance helps only when aligned\. It improves performance by shifting probability mass toward recoverable actions\. Moreover, when guidance favors actions that follow the instruction rather than the task evidence, it can instead lead to hallucinated responses\. These findings recast harness design as an alignment problem, where imposed structure must match the agent’s capability and the available task evidence\.

Beyond these observations, we find that although a harness is usually designed for the whole task, keeping only its early stages can be more effective\. This points to a counterintuitive principle for harness design: effective harnesses need not specify the full execution path\. We formalize this idea asPartial Harnessing, a new design strategy that specifies only the initial stages and leaves the remaining execution to the agent\. Our experiments show that partial harnessing can outperform fully specified workflows, suggesting that harness design should decide not only what structure to add, but also when to stop adding it\.

We summarize our contributions as follows:

- •We formulate harness design as an inference\-time alignment problem by decomposing a harness into workflow and guidance components, yielding two stage\-level principles: sub\-goal scale should match agent capability, and guidance should match task evidence\.
- •We introduce partial harnessing as a design strategy that stops scaffolding once its reliability cost outweighs tail\-risk reduction, formalized through a marginal stopping rule\.
- •We empirically validate these predictions on synthetic cumulative\-progress tasks and Terminal\-Bench v2, where alignment patterns appear and partial harnesses outperform full workflows\.

![Refer to caption](https://arxiv.org/html/2605.21516v1/x1.png)Figure 1:Alignment Principles for Harnessed Agent Execution
## 2Related Work

Long\-horizon agent execution\.LLM agents have grown from single\-step responders into systems capable of executing long\-horizon tasks in interactive environments\. The starting point is the reasoning\-action loop of ReAct\(yao2022react\), which later work extends with feedback and memory\(shinn2023reflexion;packer2023memgpt\)and with explicit planning and search\(yao2023tree;zhou2023language\)\. As these capabilities matured, evaluation moved into realistic long\-horizon environments spanning web interaction\(zhou2023webarena\), software engineering\(jimenez2023swe\), and computer use\(xie2024osworld\), prompting recent systems to treat the surrounding scaffold itself as a design object and optimize the workflow that orchestrates execution\(hong2023metagpt;zhang2024aflow\)\. A common thread across this progression is that every step adds more human\-designed structure around the model\. Our work asks what such structure should actually specify, and what should be left to the agent\.

Harness optimization\.A related line of work treats the scaffolds around LLM execution as optimization targets\. Early methods optimize prompts directly, by generating, scoring, or refining natural\-language instructions\(zhou2022large;pryzant2023automatic;yang2024opro;fernando2023promptbreeder\)\. This view is extended from single prompts to multi\-stage LM programs, where systems such as DSPy, MIPRO, Self\-Refine, and TextGrad optimize instructions, demonstrations, or intermediate artifacts across multiple model calls\(khattab2023dspy;opsahl2024optimizing;madaan2023self;yuksekgonul2024textgrad\)\. More recent work broadens the search space to agent modules, executable workflows, and full harness implementations\(shang2024agentsquare;hu2024automated;agrawal2025gepa;novikov2025alphaevolve;lee2026meta\)\. These works primarily ask how to discover better scaffolds\. In contrast, we ask when additional scaffold should help at all: which decompositions and guidance rules align with the agent’s execution, and which ones over\-specify the trajectory\.

Human priors and agent autonomy\.A broader line of work studies how control should be shared between humans and autonomous systems\. Mixed\-initiative interaction and adjustable autonomy show that authority need not belong entirely to either side, but can be allocated according to uncertainty, context, and task demands\(Horvitz1999PrinciplesOM;scerri2002towards\)\. Recent LLM agent systems revisit this issue as humans provide goals, constraints, feedback, or oversight while agents execute increasingly long\-horizon tasks\(feng2024large;zou2025call;wang2025interaction\)\. Harness design is a concrete instance of this trade\-off: it specifies part of the trajectory through decomposition and guidance, while leaving the rest for the agent to resolve\(pan2026natural;bui2026building\)\. Our work studies when such human priors improve execution and when they over\-constrain the agent’s execution\.

## 3Preliminary

Intuition\.We think of solving a task as proceeding on two interacting timescales, one driven by the harness and one driven by the agent\. The*outer*timescale belongs to the harness, which lays out a sequence of sub\-goalsg1→g2→⋯→gTg\_\{1\}\\to g\_\{2\}\\to\\cdots\\to g\_\{T\}toward the final answer and advances along this plan one sub\-goal at a time\. The*inner*timescale belongs to the agent: given the current sub\-goalgtg\_\{t\}, it repeatedly takes an action and observes the resulting state untilgtg\_\{t\}is met, at which point control returns to the harness andgt\+1g\_\{t\+1\}is revealed\. Crucially, the harness is not silent during this inner loop: even as the agent chooses each step, the harness simultaneously*shapes*the loop, nudging the trajectory toward behaviors it considers promising forgtg\_\{t\}\. The two timescales thus carry a clean division of labor: the harness decides*what to work on next*and*which trajectories to favor while working on it*, while the agent decides*how to take each step*\.

How a Harness Shapes Execution\.We now make two timescales precise\. Consider tasksx∼𝒟x\\sim\\mathcal\{D\}, each with a unique correct final answery⋆​\(x\)y^\{\\star\}\(x\), and define a*harness*as an inference\-time scaffold parameterized byh=\(κ,λ,ψ\)h=\(\\kappa,\\lambda,\\psi\), whereκ\\kappacontrols the decomposition granularity,λ\\lambdacontrols the guidance strength, andψ\\psispecifies the local guidance rule\. These three parameters split cleanly along the two timescales:κ\\kappagoverns the outer one, andλ,ψ\\lambda,\\psigovern the inner one\.

For the outer timescale,κ\\kappaapplied to a taskxxinduces the ordered sub\-goal sequenceΔh​\(x\)=\(g1,…,gTh​\(x\)\)\\Delta\_\{h\}\(x\)=\(g\_\{1\},\\ldots,g\_\{T\_\{h\}\(x\)\}\), the coarse plan from the intuition above\. This sequence fixes the stage\-level structure but leaves each inner trajectory to the agent, so a complete harness\-conditioned execution takes the form

τh​\(x\)=\(g1,τ1,…,gT,τT\),T=Th​\(x\),\\tau\_\{h\}\(x\)=\(g\_\{1\},\\tau\_\{1\},\\ldots,g\_\{T\},\\tau\_\{T\}\),\\qquad T=T\_\{h\}\(x\),where eachτt=\(st,0,at,0,…,at,nt−1,st,nt\)\\tau\_\{t\}=\(s\_\{t,0\},a\_\{t,0\},\\ldots,a\_\{t,n\_\{t\}\-1\},s\_\{t,n\_\{t\}\}\)is the inner trajectory generated while pursuinggtg\_\{t\}, with statesst,j∈𝒮s\_\{t,j\}\\in\\mathcal\{S\}and actionsat,j∈𝒜a\_\{t,j\}\\in\\mathcal\{A\}\. How eachτt\\tau\_\{t\}is actually produced is the job of the inner timescale\.

For the inner timescale, letKt−1K\_\{t\-1\}collect everything observed before stagett, i\.e\. the task together with all preceding sub\-goals and trajectories\. Without guidance, the agent unrollsτt\\tau\_\{t\}fromKt−1K\_\{t\-1\}andgtg\_\{t\}under its*base trajectory distribution*ℚt,0​\(τt∣Kt−1,gt\)\\mathbb\{Q\}\_\{t,0\}\(\\tau\_\{t\}\\mid K\_\{t\-1\},g\_\{t\}\), defined autoregressively byat,j∼qh\(⋅∣Ht,j\)a\_\{t,j\}\\sim q\_\{h\}\(\\cdot\\mid H\_\{t,j\}\)withHt,j=\(Kt−1,gt,st,0,at,0,…,at,j−1,st,j\)H\_\{t,j\}=\(K\_\{t\-1\},g\_\{t\},s\_\{t,0\},a\_\{t,0\},\\ldots,a\_\{t,j\-1\},s\_\{t,j\}\)\. The base distribution captures the agent acting on its own; the harness’s role is to reshape it\.

This reshaping is realized throughψ\\psiandλt\\lambda\_\{t\}, which together determine a non\-negative weightWt,λt​\(Kt−1,gt,τt\)W\_\{t,\\lambda\_\{t\}\}\(K\_\{t\-1\},g\_\{t\},\\tau\_\{t\}\)measuring how well a candidate trajectory aligns with the behaviorψ\\psiprescribes forgtg\_\{t\}\. Reweighting the base distribution by this score yields the*guided trajectory distribution*,

ℚt,λt​\(τt∣Kt−1,gt\)∝ℚt,0​\(τt∣Kt−1,gt\)​Wt,λt​\(Kt−1,gt,τt\),\\mathbb\{Q\}\_\{t,\\lambda\_\{t\}\}\(\\tau\_\{t\}\\mid K\_\{t\-1\},g\_\{t\}\)\\;\\propto\\;\\mathbb\{Q\}\_\{t,0\}\(\\tau\_\{t\}\\mid K\_\{t\-1\},g\_\{t\}\)\\,W\_\{t,\\lambda\_\{t\}\}\(K\_\{t\-1\},g\_\{t\},\\tau\_\{t\}\),which governs the agent’s actual behavior at stagett\. The strengthλt\\lambda\_\{t\}controls the magnitude of this reweighting: atλt=0\\lambda\_\{t\}=0the weight is uniform and the guided distribution collapses to the base distribution, and asλt\\lambda\_\{t\}grows,ℚt,λt\\mathbb\{Q\}\_\{t,\\lambda\_\{t\}\}concentrates on trajectories preferred underψ\\psi\. Onceτt\\tau\_\{t\}terminates, control returns to the harness,gt\+1g\_\{t\+1\}is revealed, and this two\-level process repeats until stageTTis complete\.

From Final Success to Stagewise Recoverability\.We now make this decomposition precise\. Lety​\(τh\)y\(\\tau\_\{h\}\)denote the final answer produced by executionτh\\tau\_\{h\}, define the final success event asSuccx​\(τh\):=\{y​\(τh\)=y⋆​\(x\)\}\\mathrm\{Succ\}\_\{x\}\(\\tau\_\{h\}\):=\\\{y\(\\tau\_\{h\}\)=y^\{\\star\}\(x\)\\\}, and write the primitive harness\-design objective as

maxh⁡𝔼x∼𝒟​\[ℙh​\(Succx​\(τh\)∣x\)\]\.\\max\_\{h\}\\;\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\}\}\\left\[\\mathbb\{P\}\_\{h\}\(\\mathrm\{Succ\}\_\{x\}\(\\tau\_\{h\}\)\\mid x\)\\right\]\.To connect this terminal objective with process\-level behavior, we introduce the completed prefix after stagett,Kt:=\(x,g1,τ1,…,gt,τt\)K\_\{t\}:=\(x,g\_\{1\},\\tau\_\{1\},\\ldots,g\_\{t\},\\tau\_\{t\}\)withK0:=xK\_\{0\}:=x, and letBtB\_\{t\}denote the event thatKtK\_\{t\}is*recoverable*, i\.e\. that some continuation under the remaining plan\(gt\+1,…,gT\)\(g\_\{t\+1\},\\ldots,g\_\{T\}\)still reachesy⋆​\(x\)y^\{\\star\}\(x\)\. The intuition above says that final success is equivalent to recoverability holding throughout, which under goal consistency we write asSuccx​\(τh\)≡⋂t=1Th​\(x\)Bt\\mathrm\{Succ\}\_\{x\}\(\\tau\_\{h\}\)\\equiv\\bigcap\_\{t=1\}^\{T\_\{h\}\(x\)\}B\_\{t\}\.

This equivalence turns the terminal objective into a stagewise product\. By the chain rule,

ℙh​\(Succx​\(τh\)∣x\)=∏t=1Th​\(x\)p¯t​\(h;x\),p¯t​\(h;x\):=ℙh​\(Bt∣B<t,x\),\\mathbb\{P\}\_\{h\}\(\\mathrm\{Succ\}\_\{x\}\(\\tau\_\{h\}\)\\mid x\)=\\prod\_\{t=1\}^\{T\_\{h\}\(x\)\}\\bar\{p\}\_\{t\}\(h;x\),\\qquad\\bar\{p\}\_\{t\}\(h;x\):=\\mathbb\{P\}\_\{h\}\(B\_\{t\}\\mid B\_\{<t\},x\),whereB<t:=⋂s<tBsB\_\{<t\}:=\\bigcap\_\{s<t\}B\_\{s\}andB<1:=ΩB\_\{<1\}:=\\Omega, sop¯t​\(h;x\)\\bar\{p\}\_\{t\}\(h;x\)is the conditional probability of remaining recoverable at stagettgiven that the run was recoverable through every earlier stage\. Taking negative logarithms turns the product into a sum,−log⁡ℙh​\(Succx​\(τh\)∣x\)=∑t=1Th​\(x\)−log⁡p¯t​\(h;x\)\-\\log\\mathbb\{P\}\_\{h\}\(\\mathrm\{Succ\}\_\{x\}\(\\tau\_\{h\}\)\\mid x\)=\\sum\_\{t=1\}^\{T\_\{h\}\(x\)\}\-\\log\\bar\{p\}\_\{t\}\(h;x\), which exhibits the process loss as the stagewise decomposition of the primitive final\-success objective rather than an auxiliary term added on top of it\.

## 4Alignment Principles for Harness Design

Building on the recoverability framework of Section[3](https://arxiv.org/html/2605.21516#S3), we now present three alignment principles, each addressing a distinct lever the harness has over execution: granularity–capability alignment in Section[4\.1](https://arxiv.org/html/2605.21516#S4.SS1), guidance–evidence alignment in Section[4\.2](https://arxiv.org/html/2605.21516#S4.SS2), and partial harnessing in Section[4\.3](https://arxiv.org/html/2605.21516#S4.SS3)\.

### 4\.1Granularity \- Capability Alignment

A harness’s outer timescale specifies sub\-goals, but it does not specify them in the abstract: each sub\-goal asks the agent to make a definite amount of progress within a finite execution budget\. Whether a workflow helps therefore depends on a single relationship—between the progress each stage requests and the progress the agent can reliably realize at that stage\. We characterize this relationship precise and show that it controls the final success probability\.

Consider stagettin isolation\. The harness requests latent progressℓt\\ell\_\{t\}and grants the agent at mostMtM\_\{t\}low\-level steps in which to deliver it\. Afterm≤Mtm\\leq M\_\{t\}such steps, the cumulative progress the agent can reliably control lies in a windowIt,m=\[μt,m−,μt,m\+\]I\_\{t,m\}=\[\\mu^\{\-\}\_\{t,m\},\\mu^\{\+\}\_\{t,m\}\]with stochastic variationσt,m\\sigma\_\{t,m\}, and a toleranceϵt\\epsilon\_\{t\}describes how far the realized progress may drift fromℓt\\ell\_\{t\}before the resulting prefix ceases to be recoverable\. The relevant quantity is the smallest standardized gap between the requested progress and any scale the agent can reach within budget,

ρt\(Mt\):=min1≤m≤Mt⁡\(d​\(ℓt,It,m\)−ϵt\)\+22​σt,m2,\\rho\_\{t\}^\{\(M\_\{t\}\)\}:=\\min\_\{1\\leq m\\leq M\_\{t\}\}\\frac\{\\big\(d\(\\ell\_\{t\},I\_\{t,m\}\)\-\\epsilon\_\{t\}\\big\)\_\{\+\}^\{2\}\}\{2\\sigma\_\{t,m\}^\{2\}\},which vanishes when some reachable cumulative scale lies within tolerance ofℓt\\ell\_\{t\}and grows quadratically in the gap when no such scale exists\. The mismatchρt\(Mt\)\\rho\_\{t\}^\{\(M\_\{t\}\)\}is therefore a stage\-local property: it depends only on what the harness asks of stagettand what the agent can do withinMtM\_\{t\}steps\.

Aggregating these stage\-local mismatches across the workflow yields the main result\.

###### Theorem 1\(Granularity\-capability mismatch bound, informal\)\.

Consider a harness that decomposes a taskxxintoT=Th​\(x\)T=T\_\{h\}\(x\)sub\-goals\. At stagett, suppose the harness requires latent progressℓt\\ell\_\{t\}, and the agent may take at mostMtM\_\{t\}low\-level execution steps\. Under the recoverability\-tube, concentration, and boundary\-contraction conditions in Appendix[C\.2](https://arxiv.org/html/2605.21516#A3.SS2), the final success probability satisfies

ℙh​\(Succx​\(τh\)∣x\)≤exp⁡\(−∑t=1Th​\(x\)\[ηt\+\(ρt\(Mt\)−log⁡Mt\)\+\]\)\.\\mathbb\{P\}\_\{h\}\(\\mathrm\{Succ\}\_\{x\}\(\\tau\_\{h\}\)\\mid x\)\\leq\\exp\\left\(\-\\sum\_\{t=1\}^\{T\_\{h\}\(x\)\}\\left\[\\eta\_\{t\}\+\\bigl\(\\rho\_\{t\}^\{\(M\_\{t\}\)\}\-\\log M\_\{t\}\\bigr\)\_\{\+\}\\right\]\\right\)\.

The bound exposes two distinct contributions to per\-stage loss\. The execution costηt\\eta\_\{t\}reflects the intrinsic difficulty of any one stage and is paid regardless of how the workflow is structured\. By contrast, the granularity penalty\(ρt\(Mt\)−log⁡Mt\)\+\\bigl\(\\rho\_\{t\}^\{\(M\_\{t\}\)\}\-\\log M\_\{t\}\\bigr\)\_\{\+\}appears only when the requested scale exceeds what the retry budget can absorb, and it precisely captures the cost of misaligned decomposition\. A largerMtM\_\{t\}enlarges the reachable set and softens this penalty through the−log⁡Mt\-\\log M\_\{t\}term, but cannot eliminate a structural gap betweenℓt\\ell\_\{t\}and the agent’s execution dynamics\.

This decomposition explains why finer workflows are not uniformly better\. Specialize to a uniformTT\-stage workflow on a task with total latent progressLxL\_\{x\}, so each stage requestsLx/TL\_\{x\}/T\. If a single low\-level step reliably advances progress by an amount in\[μ−,μ\+\]\[\\mu^\{\-\},\\mu^\{\+\}\]and each stage allows at mostMMsteps, the cumulative scales reachable within a stage form the union

⋃m=1M\[m​μ−,m​μ\+\],\\bigcup\_\{m=1\}^\{M\}\[m\\mu^\{\-\},m\\mu^\{\+\}\],and useful decompositions placeLx/TL\_\{x\}/Tnear this set\. SmallTTpushesLx/TL\_\{x\}/Tabove the union and leaves stages unreachable withinMMsteps; largeTTpushesLx/TL\_\{x\}/Tbelow it and imposes milestones the agent cannot stop at without coordination loss\. The reliable regime lies between these extremes, where each stage requires nontrivial progress while remaining reachable under the agent’s realizable progress scale\. Workflow granularity is thus a design parameter to be chosen against the agent’s execution dynamics, not against the task’s logical structure alone\.

### 4\.2Guidance \- Evidence Alignment

The inner timescale of a harness is shaped by guidance: at stagett, the weightWt,λtW\_\{t,\\lambda\_\{t\}\}reweights the agent’s base trajectory distribution toward behaviorsψ\\psiprescribes forgtg\_\{t\}\. This reweighting is helpful when it concentrates probability on trajectories that keep the prefix recoverable, and harmful when it concentrates probability on trajectories that look locally preferred but foreclose successful continuations\. We make this dichotomy precise through a single stage\-local quantity\.

Fix a recoverable prefixKt−1K\_\{t\-1\}, and letRtstg​\(Kt−1\)R\_\{t\}^\{\\mathrm\{stg\}\}\(K\_\{t\-1\}\)denote the set of stage trajectories that keepKtK\_\{t\}recoverable\. Under the base distribution, the average retention weight isW¯t,λtrec​\(Kt−1\)\\bar\{W\}^\{\\mathrm\{rec\}\}\_\{t,\\lambda\_\{t\}\}\(K\_\{t\-1\}\)onRtstgR\_\{t\}^\{\\mathrm\{stg\}\}andW¯t,λtbad​\(Kt−1\)\\bar\{W\}^\{\\mathrm\{bad\}\}\_\{t,\\lambda\_\{t\}\}\(K\_\{t\-1\}\)on its complement\. The retention gap

Γt,λt​\(Kt−1\):=log⁡W¯t,λtrec​\(Kt−1\)−log⁡W¯t,λtbad​\(Kt−1\)\\Gamma\_\{t,\\lambda\_\{t\}\}\(K\_\{t\-1\}\):=\\log\\bar\{W\}^\{\\mathrm\{rec\}\}\_\{t,\\lambda\_\{t\}\}\(K\_\{t\-1\}\)\-\\log\\bar\{W\}^\{\\mathrm\{bad\}\}\_\{t,\\lambda\_\{t\}\}\(K\_\{t\-1\}\)measures how much more weight guidance places on recoverable trajectories than on non\-recoverable ones, in log space\. A positive gap means guidance preferentially preserves trajectories that keep a successful continuation available, whereas a negative gap means it preserves trajectories that may look locally acceptable but render the prefix non\-recoverable\.

This single sign\-valued quantity determines whether guidance helps at the prefix\.

###### Theorem 2\(Guidance alignment via retention gaps, informal\)\.

Fix a taskxx, a stagett, and a recoverable prefixKt−1K\_\{t\-1\}\. Under the regularity conditions in Appendix[C\.3](https://arxiv.org/html/2605.21516#A3.SS3), the stage recoverability probability after guidance satisfies

Qt,λt​\(Rtstg​\(Kt−1\)∣Kt−1,gt\)=σ​\(ωt0​\(Kt−1\)\+Γt,λt​\(Kt−1\)\),Q\_\{t,\\lambda\_\{t\}\}\\\!\\left\(R\_\{t\}^\{\\mathrm\{stg\}\}\(K\_\{t\-1\}\)\\mid K\_\{t\-1\},g\_\{t\}\\right\)=\\sigma\\\!\\left\(\\omega^\{0\}\_\{t\}\(K\_\{t\-1\}\)\+\\Gamma\_\{t,\\lambda\_\{t\}\}\(K\_\{t\-1\}\)\\right\),whereωt0​\(Kt−1\)\\omega^\{0\}\_\{t\}\(K\_\{t\-1\}\)is the unguided recoverability log\-odds andσ​\(u\)=1/\(1\+e−u\)\\sigma\(u\)=1/\(1\+e^\{\-u\}\)\. Consequently, guidance improves stage recoverability relative to the unguided law at the same prefix if and only ifΓt,λt​\(Kt−1\)\>0\\Gamma\_\{t,\\lambda\_\{t\}\}\(K\_\{t\-1\}\)\>0, and harms it if and only ifΓt,λt​\(Kt−1\)<0\\Gamma\_\{t,\\lambda\_\{t\}\}\(K\_\{t\-1\}\)<0\.

The theorem reduces a multi\-dimensional design choice—which trajectories to favor, and how strongly—to a one\-dimensional diagnostic at each prefix\. The unguided log\-oddsωt0​\(Kt−1\)\\omega^\{0\}\_\{t\}\(K\_\{t\-1\}\)is a property of the agent and the prefix, fixed once both are specified\. Guidance enters the recoverability probability only throughΓt,λt\\Gamma\_\{t,\\lambda\_\{t\}\}, additively in log\-odds space\. Stronger guidance, formalized as a largerλt\\lambda\_\{t\}, scales the magnitude ofΓt,λt\\Gamma\_\{t,\\lambda\_\{t\}\}but not its sign, so increasingλt\\lambda\_\{t\}amplifies whichever effectψ\\psialready produces atKt−1K\_\{t\-1\}: helpful guidance becomes more helpful, and misaligned guidance becomes more harmful\.

This sign\-amplification structure explains why the same guidance can either reduce or amplify hallucination\. In evidence\-limited settings, a recoverable trajectory is one that stays within what the current observations justify\. A guidance ruleψ\\psithat rewards evidence checking or uncertainty awareness places its weight on grounded trajectories, producing a positive retention gap; raisingλt\\lambda\_\{t\}then suppresses ungrounded continuations and reduces hallucination\. A guidance ruleψ\\psithat rewards detail, confidence, or instruction compliance without conditioning on evidence places weight on trajectories that satisfyψ\\psiregardless of whether they remain grounded, producing a negative retention gap; raisingλt\\lambda\_\{t\}then drives the agent further from the supported set, amplifying hallucination\. Guidance strength is therefore not a generic reliability lever: its effect at a prefix is determined by the alignment betweenψ\\psiand the evidence available at that prefix\.

### 4\.3Partial Harnessing as a Marginal Reliability Trade\-off

A harness’s length, the number of stages it specifies before releasing the agent, is itself a design lever\. Each additional scaffolded stage shortens the residual task left to the agent, but imposes another recoverability constraint that execution must satisfy\. Adding a stage is helpful when its tail\-risk reduction exceeds its reliability cost, and harmful when the cost dominates\. We make this dichotomy precise through a single marginal comparison, and use it to characterize when*partial harnessing*, which specifies only an initial prefix of the trajectory and leaves the rest to the agent, outperforms full coverage\.

Fix a task with total latent progress demandLxL\_\{x\}and a scaffold step sizes\>0s\>0\. Form∈𝒥:=\{0,1,…,⌊Lx/s⌋\}m\\in\\mathcal\{J\}:=\\\{0,1,\\ldots,\\lfloor L\_\{x\}/s\\rfloor\\\}, lethmh\_\{m\}denote the partial harness that specifies the firstmmscaffolded stages and leaves the residual lengthLx−m​sL\_\{x\}\-msto the autonomous agent\. Two scalar quantities determine the trade\-off: the*scaffold cost*

cs:=−log⁡qscaf​\(s;M\),c\_\{s\}\\;:=\\;\-\\log q\_\{\\mathrm\{scaf\}\}\(s;M\),the negative log\-reliability of executing one scaffolded stage, and the*tail risk*

κtail​\(d;M\):=−log⁡qtail​\(d;M\),\\kappa\_\{\\mathrm\{tail\}\}\(d;M\)\\;:=\\;\-\\log q\_\{\\mathrm\{tail\}\}\(d;M\),the negative log\-reliability of finishing a residual task of lengthddautonomously\. Under the homogeneous slice factorization in Appendix[C\.4](https://arxiv.org/html/2605.21516#A3.SS4), the negative log\-success ofhmh\_\{m\}separates additively as

F​\(m\):=−log⁡Prhm⁡\(Succx​\(τh\)∣x\)=m​cs\+κtail​\(Lx−m​s;M\),F\(m\)\\;:=\\;\-\\log\\Pr\\nolimits\_\{h\_\{m\}\}\\\!\\bigl\(\\mathrm\{Succ\}\_\{x\}\(\\tau\_\{h\}\)\\mid x\\bigr\)\\;=\\;m\\,c\_\{s\}\\;\+\\;\\kappa\_\{\\mathrm\{tail\}\}\(L\_\{x\}\-ms;M\),where the first term grows linearly with coverage while the second shrinks as the autonomous tail becomes shorter\.

The trade\-off between these two terms collapses into a one\-dimensional diagnostic at the margin\. Let

Δ​\(m;M\):=κtail​\(Lx−m​s;M\)−κtail​\(Lx−\(m\+1\)​s;M\)\\Delta\(m;M\)\\;:=\\;\\kappa\_\{\\mathrm\{tail\}\}\(L\_\{x\}\-ms;M\)\\;\-\\;\\kappa\_\{\\mathrm\{tail\}\}\(L\_\{x\}\-\(m\{\+\}1\)s;M\)denote the reduction in tail risk obtained by adding the\(m\+1\)\(m\{\+\}1\)\-st scaffolded stage\. Adding that stage strictly improves reliability if and only if this reduction exceeds its own scaffold cost,Δ​\(m;M\)\>cs\.\\Delta\(m;M\)\\;\>\\;c\_\{s\}\.Partial harnessing thus follows a simple marginal principle: keep extending the harness only while the next scaffolded stage saves more residual risk than it introduces\.

###### Theorem 3\(Coverage–autonomy alignment, informal\)\.

Under the homogeneous slice factorization and the diminishing\-returns condition onκtail\\kappa\_\{\\mathrm\{tail\}\}stated in Appendix[C\.4](https://arxiv.org/html/2605.21516#A3.SS4),F​\(m\)F\(m\)is discrete\-convex inmm, and the success probabilityexp⁡\(−F​\(m\)\)\\exp\(\-F\(m\)\)is log\-concave and unimodal overm∈𝒥m\\in\\mathcal\{J\}\. The smallest reliability\-maximizing coverage is

mpeak=min⁡\{m∈𝒥:m\+1∈𝒥,Δ​\(m;M\)≤cs\},m\_\{\\mathrm\{peak\}\}\\;=\\;\\min\\\{m\\in\\mathcal\{J\}:\\;m\+1\\in\\mathcal\{J\},\\;\\Delta\(m;M\)\\leq c\_\{s\}\\\},withmpeak=max⁡𝒥m\_\{\\mathrm\{peak\}\}=\\max\\mathcal\{J\}if the set is empty\. For a target reliabilityα∈\(0,1\)\\alpha\\in\(0,1\), the minimum\-structureα\\alpha\-reliable coverage ismα=min⁡\{m∈𝒥:F​\(m\)≤−log⁡α\}m\_\{\\alpha\}=\\min\\\{m\\in\\mathcal\{J\}:F\(m\)\\leq\-\\log\\alpha\\\}whenever this set is nonempty\.

The theorem connects harness length to the preceding alignment principles via a single marginal comparison\. Far from a free parameter, the scaffold costcsc\_\{s\}is raised by granularity mismatch \(Theorem[1](https://arxiv.org/html/2605.21516#Thmtheorem1)\) and misaligned guidance \(Theorem[2](https://arxiv.org/html/2605.21516#Thmtheorem2)\), and lowered by well\-aligned scaffolding\. The tail riskκtail\\kappa\_\{\\mathrm\{tail\}\}, by contrast, is a property of the agent\. More capable agents attain smallerκtail\\kappa\_\{\\mathrm\{tail\}\}on any residual, soΔ​\(m;M\)\\Delta\(m;M\)crossescsc\_\{s\}at smallermmand the optimal coverage shifts earlier\.

Harness length is therefore not a generic reliability lever, with its effect governed by the alignment between local scaffold quality and the agent’s autonomous capability on the residual\. The right harness is the smallest one that brings the residual within autonomous reach, and no smaller\. The slice rule is a clean stopping principle rather than a universal optimization law\. When added stages alter earlier behavior or tail success depends strongly on the realized prefix, the success curve may become multi\-modal, and candidate coverages must be compared directly\.

## 5Experiments

To validate our theoretical predictions on harness design, we conductsynthetic experimentsfor controlled mechanism analysis in Section[5\.1](https://arxiv.org/html/2605.21516#S5.SS1)andreal\-data experimentsfor realistic long\-horizon validation in Section[5\.2](https://arxiv.org/html/2605.21516#S5.SS2), examining three aspects: \(i\)*decomposition granularity*, \(ii\)*guidance alignment*, and \(iii\)*partial harness specification*\.

### 5\.1Synthetic Experiments

![Refer to caption](https://arxiv.org/html/2605.21516v1/fig/sim_step/exp1_success_vs_num_goals_range.png)\(a\)Pass rate under different granularities
![Refer to caption](https://arxiv.org/html/2605.21516v1/fig/sim_guidance/exp3_success_vs_distribution_count.png)\(b\)Pass rate under different guidance levels
![Refer to caption](https://arxiv.org/html/2605.21516v1/fig/sim_structured_num/exp4_success_vs_scaffold_steps_chunk_size_20.png)\(c\)Pass rate under a 5 chunk candidate decomposition
![Refer to caption](https://arxiv.org/html/2605.21516v1/fig/sim_step/exp1_abs_final_bias_vs_num_goals.png)\(d\)Final bias under different granularities
![Refer to caption](https://arxiv.org/html/2605.21516v1/fig/sim_guidance/exp3_abs_mean_error_vs_distribution_count.png)\(e\)Final error under different guidance levels
![Refer to caption](https://arxiv.org/html/2605.21516v1/fig/sim_structured_num/exp4_abs_bias_vs_scaffold_steps_chunk_size_20.png)\(f\)Final bias under a 5 chunk candidate decomposition

Figure 2:Three alignment principles for harness design\.\(a, d\) Granularity–capability: pass rate is non\-monotonic in subgoal countKKand final bias grows with finer decomposition, with each agent peaking at a differentKK\. \(b, e\) Guidance–evidence: aligned guidance improves pass rate and lowers final error as the action pool grows, while misaligned guidance does the opposite\. \(c, f\) Partial harnessing: pass rate is unimodal in scaffold count and final bias rises beyond the peak, with stronger agents reaching the peak earlier\.Setup\.We study a synthetic addition task in which an agent must reach a target sumG=100G=100\. A*harness*Δh\\Delta\_\{h\}decomposesGGinto an ordered sequence of stage goals\(g1,…,gK\)\(g\_\{1\},\\dots,g\_\{K\}\)satisfying∑k=1Kgk=G\\sum\_\{k=1\}^\{K\}g\_\{k\}=G; for instance,Δh=\(25,25,25,25\)\\Delta\_\{h\}=\(25,25,25,25\)whenK=4K=4\. Beyond decomposition, the harness also provides*guidance*that shapes how the agent acts within each stage\. Together, these two roles let the harness serve as a high\-level planner that both issues subgoals and constrains the low\-level action model used to realize them\.

Within stagekk, letst=∑τ=1taτs\_\{t\}=\\sum\_\{\\tau=1\}^\{t\}a\_\{\\tau\}denote the progress accumulated afterttdraws \(reset to zero at the start of each stage\)\. The agent maintains a pool of candidate truncated Gaussian action distributions

𝒫=\{𝒯​𝒩\[ℓj,uj\]​\(μj,σj2\)\}j∈𝒥,\\mathcal\{P\}=\\left\\\{\\mathcal\{TN\}\_\{\[\\ell\_\{j\},u\_\{j\}\]\}\(\\mu\_\{j\},\\sigma\_\{j\}^\{2\}\)\\right\\\}\_\{j\\in\\mathcal\{J\}\},where𝒯​𝒩\[ℓ,u\]​\(μ,σ2\)\\mathcal\{TN\}\_\{\[\\ell,u\]\}\(\\mu,\\sigma^\{2\}\)denotes a Gaussian with meanμ\\muand varianceσ2\\sigma^\{2\}truncated to\[ℓ,u\]\[\\ell,u\]\. The harness’s guidance is operationalized in our simplified formulation as a*pruning*of this pool, leaving an admissible index set𝒥Δh⊆𝒥\\mathcal\{J\}\_\{\\Delta\_\{h\}\}\\subseteq\\mathcal\{J\}\. At each draw, the agent*greedily*selectsj⋆∈𝒥Δhj^\{\\star\}\\in\\mathcal\{J\}\_\{\\Delta\_\{h\}\}best matched to the residual gapgk−st−1g\_\{k\}\-s\_\{t\-1\}, samples

a~∼𝒯​𝒩\[ℓj⋆,uj⋆\]​\(μj⋆,σj⋆2\),a=round⁡\(a~\)∈ℤ\.\\tilde\{a\}\\sim\\mathcal\{TN\}\_\{\[\\ell\_\{j^\{\\star\}\},u\_\{j^\{\\star\}\}\]\}\(\\mu\_\{j^\{\\star\}\},\\sigma\_\{j^\{\\star\}\}^\{2\}\),\\qquad a=\\operatorname\{round\}\(\\tilde\{a\}\)\\in\\mathbb\{Z\}\.and updatesst=st−1\+as\_\{t\}=s\_\{t\-1\}\+a\. A stage is deemed successful oncest∈\[gk−ϵ,gk\+ϵ\]s\_\{t\}\\in\[g\_\{k\}\-\\epsilon,\\,g\_\{k\}\+\\epsilon\]with toleranceϵ=2\\epsilon=2, after which the harness advances to stagek\+1k\+1\.

Based on the synthetic task above, we conduct three controlled experiments on harness execution\. First, we study decomposition granularity by varying the number of subgoalsKKacross agents with different action scales\. Second, we study guidance quality by varying the available action\-pool size under different guidance strategies\. In both experiments, we measure pass rate and final bias\. We further include additional control experiments on retry budget, completion tolerance, and guidance pruning in Appendices[A\.2](https://arxiv.org/html/2605.21516#A1.SS2),[A\.3](https://arxiv.org/html/2605.21516#A1.SS3), and[A\.4](https://arxiv.org/html/2605.21516#A1.SS4)\.\.

![Refer to caption](https://arxiv.org/html/2605.21516v1/fig/real_step/step_success_rate_curve.png)\(a\)Workflow Granularity and Pass Rate on Terminal\-Bench v2
![Refer to caption](https://arxiv.org/html/2605.21516v1/fig/real_guidance/hallucination_rate_curves_all_models.png)\(b\)Guidance Quantity and Hallucination Rate under Different Alignment Conditions

Figure 3:Harness design trade\-offs in real and controlled settings\. \(a\) On Terminal\-Bench v2, pass rate first improves and then declines as the workflow is split into more steps\. \(b\) Hallucination rate under increasing amounts of guidance, comparing random, aligned, and misaligned guidance\.I\. Granularity–Capability Alignment\.We vary the number of subgoalsKKacross three agents with different action scales, sweepingG/KG/Kfrom values too coarse to be reached within the draw budget to values below the agent’s natural action scale\. Figures[2\(a\)](https://arxiv.org/html/2605.21516#S5.F2.sf1)and[2\(d\)](https://arxiv.org/html/2605.21516#S5.F2.sf4)show that pass rate peaks only whenG/KG/Kmatches the agent’s controllable progress, and that final bias grows under overly fine decomposition as local completion errors accumulate\. Both effects match the granularity penalty\(ρ\(M\)−log⁡M\)\+\(\\rho^\{\(M\)\}\-\\log M\)\_\{\+\}in Theorem[1](https://arxiv.org/html/2605.21516#Thmtheorem1)\.

II\. Guidance–Evidence Alignment\.We hold decomposition fixed and vary which action distributions remain available, comparing aligned guidance, misaligned guidance, and uniform random selection on the same pool\. Figures[2\(b\)](https://arxiv.org/html/2605.21516#S5.F2.sf2)and[2\(e\)](https://arxiv.org/html/2605.21516#S5.F2.sf5)show that pass rate rises with pool size only under aligned guidance, while misaligned guidance produces lower pass rate and larger final error than uniform selection on the same pool\. The benefit therefore comes from the sign of the retention gap rather than from additional constraints, matching Theorem[2](https://arxiv.org/html/2605.21516#Thmtheorem2)\.

III\. Partial Harnessing\.We fix a chunk size and sweep the number of scaffolded chunksrr, leaving the remaining length to the autonomous agent\. Figures[2\(c\)](https://arxiv.org/html/2605.21516#S5.F2.sf3)and[2\(f\)](https://arxiv.org/html/2605.21516#S5.F2.sf6)show that pass rate is unimodal inrrfor every agent, with the peak at smallerrrfor stronger agents, while final bias rises beyond the peak as additional structure accumulates terminal error rather than reducing it\. The same shape persists across alternative chunk sizes \(Figure[8](https://arxiv.org/html/2605.21516#A1.F8)\)\. The unimodal shape and agent\-dependent peak location match the marginal stopping conditionΔ​\(r;M\)≤cs\\Delta\(r;M\)\\leq c\_\{s\}in Theorem[3](https://arxiv.org/html/2605.21516#Thmtheorem3), with stronger agents reaching this condition at smallerrrowing to their lower autonomous tail costκtail\\kappa\_\{\\mathrm\{tail\}\}\.

### 5\.2Real Data

Setup\.We conduct three experiments, each isolating one axis of harness design\. \(i\) To probe*decomposition depth*, a granularity sweep on Terminal\-Bench\-2\(merrill2026terminalbench\)varies sub\-goal countk∈\{1,…,10\}k\\\!\\in\\\!\\\{1,\\dots,10\\\}in the workflow given to a fixed agent\. \(ii\) To probe*guidance quantity and quality*, a controlled Plotly chart\-analysis task varies the number of Aligned, Misaligned, and Random guidance steps and measures the hallucination rate\. \(iii\) To probe partial harnessing, we conduct a case study on a Terminal\-Bench\-2 task by decomposing the solution into1010steps and progressively revealing a prefix of lengthℓ\\ell\. Full configurations are deferred to Appendix[B](https://arxiv.org/html/2605.21516#A2)\.

\(a\) Granularity\-capability alignment on Terminal\-Bench v2\.Figure[3\(a\)](https://arxiv.org/html/2605.21516#S5.F3.sf1)shows that pass rate first rises and then declines with sub\-goal count, peaking at six steps\. The drop on either side reflects the two failure regimes predicted by the theory: coarse workflows leave each step too large to complete reliably, and fine workflows fragment the task into milestones the model cannot meaningfully terminate at\.

\(b\) Guidance\-evidence alignment and hallucination\.Figure[3\(b\)](https://arxiv.org/html/2605.21516#S5.F3.sf2)contrasts how hallucination scales with guidance count under three policies that differ only in whether they track task evidence\. Under aligned guidance, hallucination stays near0\.350\.35regardless of count; under misaligned guidance it rises monotonically from0\.200\.20to nearly0\.900\.90; random guidance, which mixes the two, interpolates in between\. The same scaling axis therefore has opposite effects on reliability depending on the sign of the retention gap, with stronger guidance amplifying whichever direction this sign points, matching the sign\-amplification structure predicted by Theorem[2](https://arxiv.org/html/2605.21516#Thmtheorem2)\.

### 5\.3A Case Study on Partial Harnessing

![Refer to caption](https://arxiv.org/html/2605.21516v1/x2.png)Figure 4:Partial harnessing on llm\-inference\-batching\-scheduler task\.Left: the fully specified workflow over\-constrains execution and the agent gets lost in repeated intermediate revisions before reaching the final stages\. Right: the partial workflow provides only an initial 3\-step harness, after which the agent completes the remaining task through its own planning\.Beyond the empirical results above, Figure[4](https://arxiv.org/html/2605.21516#S5.F4)compares a partial harness and a full harness on thellm\-inference\-batching\-schedulertaskmerrill2026terminalbench\. The partial harness contains only three initial steps: read the inputs, identify feasible sizes, and select shared shapes\. After this initial scaffold, the agent takes over the remaining execution, enters its own iterative refinement loop, and completes the task\. In contrast, the full harness specifies the downstream procedure in much greater detail, but the additional structure does not translate into better control\. Instead, the agent becomes stuck inside the prescribed scaffold, repeatedly revising intermediate decisions without reaching the final solution\. This case illustrates why partial harnessing can outperform full specification: a short scaffold can guide the agent into the right search space while leaving enough autonomy for it to plan, adapt, and finish the task\.

## 6Conclusion

We studied harness engineering as an inference\-time alignment problem over agent execution trajectories\. Separating a harness into decomposition—which sets the progress scale imposed on the agent—and guidance—which reshapes the local trajectory distribution—reveals that stronger harnesses are not necessarily better: decomposition helps only when its scale matches the agent’s controllable progress, and guidance helps only when it preserves trajectories supported by the current evidence\. Misalignment in either role turns scaffolding into a reliability bottleneck rather than an aid, with hallucination as one concrete instance of guidance retaining the wrong part of the local trajectory space\.

This view recasts harness design from the problem of adding more structure to the problem of choosing what to specify, how strongly to specify it, and when to stop\. In particular, partial harnessing shows that a harness need not cover the full execution path: once the remaining task falls within the agent’s autonomous capability, continued control adds reliability cost without repaying it\. The right harness is the smallest one that keeps the agent on a recoverable trajectory, and no smaller\. For clarity, we adopt simplified settings and assumptions to isolate the core mechanisms of harness design, while recognizing that real agent behavior is substantially more complex; extending these principles to richer execution dynamics and real world agent systems remains an important direction for future work\.

## References

## Appendix ASynthetic Experiments Details

### A\.1Task and Metrics

The cumulative\-progress task is parameterized by a total targetGG, a toleranceϵ\\epsilon, and a per\-stage draw budgetRR\. A harness decomposesGGinto an ordered subgoal sequence\(g1,…,gK\)\(g\_\{1\},\\ldots,g\_\{K\}\)with∑kgk=G\\sum\_\{k\}g\_\{k\}=Gand provides the agent with a pool of action distributions, each a truncated Gaussian specified by\(μ,σ,ℓ,u\)\(\\mu,\\sigma,\\ell,u\)\. At stagekk, local progresssks\_\{k\}is initialized to zero\. The agent drawsz∼𝒩​\(μ,σ2\)z\\sim\\mathcal\{N\}\(\\mu,\\sigma^\{2\}\), rejects samples outside\[ℓ,u\]\[\\ell,u\], rounds to the nearest integer, and clips back into\[ℓ,u\]\[\\ell,u\]; the resulting actionaaupdatessk←sk\+as\_\{k\}\\leftarrow s\_\{k\}\+a\. A stage succeeds oncesk∈\[gk−ϵ,gk\+ϵ\]s\_\{k\}\\in\[g\_\{k\}\-\\epsilon,g\_\{k\}\+\\epsilon\]after at least one draw, fails by overshoot ifsk\>gk\+ϵs\_\{k\}\>g\_\{k\}\+\\epsilon, and fails by draw limit ifRRdraws are exhausted before the tolerance window is reached\. An episode succeeds if and only if all stages succeed\.

We report two metrics throughout\.*Pass rate*is the fraction of episodes in which all stages succeed and measures whether the harness keeps the run on a recoverable trajectory\.*Absolute final bias*is\|∑ksk−G\|\|\\sum\_\{k\}s\_\{k\}\-G\|and measures how much terminal accuracy the harness preserves even when the run completes\. The two metrics can dissociate, and reporting both separates ”the harness lets the agent finish” from ”the harness lets the agent finish accurately\.”

### A\.2Granularity–Capability Experiment

This experiment isolates decomposition by varyingKKwhile fixing a single action distribution per agent\. We useG=100G=100,ϵ=2\\epsilon=2,R=4R=4, and sweepK∈\{1,2,3,4,5,6,7,8,9,10,12,15,18,20\}K\\in\\\{1,2,3,4,5,6,7,8,9,10,12,15,18,20\\\}\. The three agents differ only in action scale: Small\(μ=6,σ=2,\[4,8\]\)\(\\mu=6,\\sigma=2,\[4,8\]\), Medium\(μ=8,σ=3,\[5,11\]\)\(\\mu=8,\\sigma=3,\[5,11\]\), and Large\(μ=10,σ=4,\[6,14\]\)\(\\mu=10,\\sigma=4,\[6,14\]\)\. SmallKKproduces large subgoals that may exceed what an agent can reach withinRRdraws, while largeKKproduces subgoals below the agent’s natural action scale, where rounding\-and\-clipping introduces residual error at every stage\. The three agents therefore peak at differentKK, sweeping out the granularity–capability frontier predicted by Theorem[1](https://arxiv.org/html/2605.21516#Thmtheorem1)\.

### A\.3Guidance–Evidence Experiment

This experiment isolates guidance quality from action\-pool size\. We construct a pool of ten distributions indexed byi=0,…,9i=0,\\ldots,9, with

μi=4\.0\+1\.2​i,σi=1\.5\+0\.35​i,ℓi=max⁡\{1,⌊μi−2\.0⌋\},ui=⌈μi\+2​σi⌉\.\\mu\_\{i\}=4\.0\+1\.2i,\\qquad\\sigma\_\{i\}=1\.5\+0\.35i,\\qquad\\ell\_\{i\}=\\max\\\{1,\\lfloor\\mu\_\{i\}\-2\.0\\rfloor\\\},\\qquad u\_\{i\}=\\lceil\\mu\_\{i\}\+2\\sigma\_\{i\}\\rceil\.At the start of each episode we randomly retainN∈\{1,…,10\}N\\in\\\{1,\\ldots,10\\\}distributions, then compare three selection policies on the retained pool: aligned guidance \(favoring distributions whose mean is closest to the remaining progress\), misaligned guidance \(favoring distributions whose mean is farthest\), and uniform random selection\. We useG=100G=100,K=5K=5,ϵ=4\\epsilon=4, andR=5R=5\. Holding pool size fixed across the three policies isolates the sign of the retention gap from the size of the local action space, which is the comparison Theorem[2](https://arxiv.org/html/2605.21516#Thmtheorem2)predicts to be decisive\.

### A\.4Partial Harnessing Experiment

This experiment sweeps a progress slice as in Theorem[3](https://arxiv.org/html/2605.21516#Thmtheorem3)\. We fix a chunk sizeccand a number of scaffolded chunksrr, defining the harness

Δh=\(c,…,c⏟r​chunks,G−r​c\),\\Delta\_\{h\}=\(\\underbrace\{c,\\ldots,c\}\_\{r\\text\{ chunks\}\},\\,G\-rc\),wherer=0r=0leaves the entire task to the autonomous agent andr=⌊G/c⌋r=\\lfloor G/c\\rfloorscaffolds it fully\. We useG=100G=100,ϵ=2\\epsilon=2,R=10R=10, and the Small, Medium, and Large agents from Appendix[A\.2](https://arxiv.org/html/2605.21516#A1.SS2)\. Sweepingrrtraces the slice objectiveF​\(r\)F\(r\)for each agent\. A stronger agent has lower autonomous tail costκtail\\kappa\_\{\\mathrm\{tail\}\}and reachesΔ​\(r;M\)≤cs\\Delta\(r;M\)\\leq c\_\{s\}at smallerrr, predicting earlier peaks for stronger agents\.

### A\.5Additional Control Experiments

The experiments below vary harness or task parameters that could plausibly explain the patterns above through a simpler mechanism\. In each case the result reproduces the alignment predictions rather than displacing them\.

##### Retry budget\.

HoldingK=4K=4and the granularity setup of Appendix[A\.2](https://arxiv.org/html/2605.21516#A1.SS2)fixed, we sweep the per\-stage draw budgetR∈\{1,…,10\}R\\in\\\{1,\\ldots,10\\\}\. IncreasingRRraises pass rate and lowers final bias, but the curves saturate quickly once the binding failure mode shifts from draw\-limit exhaustion to action\-scale mismatch and overshoot \(Figure[5](https://arxiv.org/html/2605.21516#A1.F5)\)\. Retry budget is therefore a recoverability resource that interacts with granularity rather than substituting for it: more attempts cannot fix a stage whose requested progress lies outside the agent’s reachable scales, matching the−log⁡Mt\-\\log M\_\{t\}correction in Theorem[1](https://arxiv.org/html/2605.21516#Thmtheorem1)\.

![Refer to caption](https://arxiv.org/html/2605.21516v1/fig/sim_attempt_time/exp5_pass_rate_vs_retry.png)\(a\)Pass rate vs\. attempt budget
![Refer to caption](https://arxiv.org/html/2605.21516v1/fig/sim_attempt_time/exp5_abs_bias_vs_retry.png)\(b\)Final bias vs\. attempt budget

Figure 5:Retry budget improves recoverability but cannot overcome capability mismatch\.Pass rate and final bias both saturate once attempts no longer relax the binding constraint\.
##### Completion tolerance\.

FixingG=100G=100,K=10K=10, andR=4R=4, we sweepϵ∈\{0,1,…,10\}\\epsilon\\in\\\{0,1,\\ldots,10\\\}\. Larger tolerance raises pass rate by widening each stage’s acceptance window, but for agents whose action scale undershoots the subgoal size it accepts premature stage completion and accumulates systematic terminal under\-progress \(Figure[6](https://arxiv.org/html/2605.21516#A1.F6)\)\. Tolerance therefore trades recoverability against terminal accuracy, and the right setting depends on the agent’s action scale rather than functioning as a universal robustness knob\.

![Refer to caption](https://arxiv.org/html/2605.21516v1/fig/sim_tolerance/exp6_pass_rate_vs_tolerance.png)\(a\)Pass rate vs\. tolerance
![Refer to caption](https://arxiv.org/html/2605.21516v1/fig/sim_tolerance/exp6_abs_bias_vs_tolerance.png)\(b\)Final bias vs\. tolerance

Figure 6:Tolerance trades recoverability for terminal accuracy\.Pass rate rises sharply withϵ\\epsilonbut final bias grows when tolerance exceeds the agent’s natural per\-stage variation\.
##### Aggressive guidance pruning\.

Starting from a three\-distribution pool—Model 1\(μ=6,σ=2,\[4,8\]\)\(\\mu=6,\\sigma=2,\[4,8\]\), Model 2\(μ=8,σ=3,\[5,11\]\)\(\\mu=8,\\sigma=3,\[5,11\]\), Model 3\(μ=10,σ=6,\[4,14\]\)\(\\mu=10,\\sigma=6,\[4,14\]\)—we randomly removem∈\{0,1,2\}m\\in\\\{0,1,2\\\}distributions and sweepKKas in Appendix[A\.2](https://arxiv.org/html/2605.21516#A1.SS2), with the agent greedily selecting the closest\-mean distribution at each step\. Stronger pruning lowers pass rate and amplifies final bias, with the largest gap appearing under fine decomposition where local action\-scale mismatch is most exposed \(Figure[7](https://arxiv.org/html/2605.21516#A1.F7)\)\. Pruning is therefore a negative\-gap intervention even when individual remaining distributions are reasonable: the retention\-gap sign in Theorem[2](https://arxiv.org/html/2605.21516#Thmtheorem2)depends on which trajectories are removed, not how many\.

![Refer to caption](https://arxiv.org/html/2605.21516v1/fig/sim_guidance/exp2_success_vs_num_goals_range.png)\(a\)Pass rate vs\. subgoal count under pruning
![Refer to caption](https://arxiv.org/html/2605.21516v1/fig/sim_guidance/exp2_abs_final_bias_vs_num_goals.png)\(b\)Final bias vs\. subgoal count under pruning

Figure 7:Aggressive pruning amplifies granularity mismatch\.Removing more action distributions worsens both pass rate and final bias, with the largest gap appearing under fine decomposition where local mismatch is most exposed\.
##### Extended partial\-harnessing sweeps\.

We complement the chunk sizec=20c=20result reported in the main text withc=10c=10andc=25c=25\(Figure[8](https://arxiv.org/html/2605.21516#A1.F8)\)\. The unimodal shape and the agent\-dependent peak location persist across chunk sizes, with finer chunks placing the peak at largerrras predicted by the marginal rule\.

![Refer to caption](https://arxiv.org/html/2605.21516v1/fig/sim_structured_num/exp4_success_vs_scaffold_steps_chunk_size_10.png)\(a\)Pass rate,c=10c=10
![Refer to caption](https://arxiv.org/html/2605.21516v1/fig/sim_structured_num/exp4_success_vs_scaffold_steps_chunk_size_25.png)\(b\)Pass rate,c=25c=25
![Refer to caption](https://arxiv.org/html/2605.21516v1/fig/sim_structured_num/exp4_abs_bias_vs_scaffold_steps_chunk_size_10.png)\(c\)Final bias,c=10c=10
![Refer to caption](https://arxiv.org/html/2605.21516v1/fig/sim_structured_num/exp4_abs_bias_vs_scaffold_steps_chunk_size_25.png)\(d\)Final bias,c=25c=25

Figure 8:Marginal stopping persists across chunk sizes\.Both pass rate and final bias remain unimodal inrratc=10c=10andc=25c=25, with peak location shifting as predicted by Theorem[3](https://arxiv.org/html/2605.21516#Thmtheorem3)\.

## Appendix BReal World Task Details

### B\.1Granularity experiment on Terminal\-Bench\-2

This appendix specifies the full configuration of the granularity sweep referenced in Section[5\.2](https://arxiv.org/html/2605.21516#S5.SS2)\. The experiment varies the workflow step countk∈\{1,…,10\}k\\\!\\in\\\!\\\{1,\\dots,10\\\}supplied to a fixed solver and measures pass rate on real software\-engineering tasks fromterminal\-bench@2\.0, with all non\-kkfactors held constant\.

#### B\.1\.1Tasks

We sample tasks fromterminal\-bench@2\.0using the official Terminal\-Bench leaderboard’sGLM\-5task\-level correctness statistics \(snapshot: May 2026\)\. The selection rule is to focus on tasks of moderate difficulty forGLM\-5, down\-weighting both near\-ceiling tasks \(uninformative because already trivial\) and near\-floor tasks \(uninformative because no workflow yields signal\), so that granularity effects can be observed in the sensitive middle band\. Applying this rule yields a pool of 36 tasks\.

For the aggregate step\-curve in the main text, we further restrict to a3232\-task subset by removing four tasks that returned all\-zero outcomes in our first trial across everykk:break\-filter\-js\-from\-html,caffe\-cifar\-10,chess\-best\-move, anddb\-wal\-recovery\. Excluding these prevents degenerate floor cases from flattening the aggregate curve\.

Table 1:Selected Terminal\-Bench 2 tasks and official resolution rates inmerrill2026terminalbench\.
#### B\.1\.2Variables and Controls

The independent variable is the workflow step countk∈\{1,…,10\}k\\\!\\in\\\!\\\{1,\\dots,10\\\}\. The primary dependent variable is pass rate; we additionally logepisodes\_to\_success\(number of solver episodes used on successful runs\) as a secondary diagnostic\. To isolate the effect ofkk, the following factors are held fixed across all conditions:

Table 2:Configuration for the Terminal\-Bench\-2 workflow\-step experiment\.All experiments are run locally on a MacBook with an Apple M4 chip\.

#### B\.1\.3Pipeline

Each\(task,k\)\(\\text\{task\},k\)cell follows a two\-stage pipeline\.

Stage 1: Workflow Generation\.For each task, the planner generates one workflow per granularityk∈\{1,…,10\}k\\\!\\in\\\!\\\{1,\\dots,10\\\}, producing exactlykkordered steps\. The planner receives the raw Terminal\-Bench task instruction together with a fixed generation prompt that requestskkordered intermediate actions or verification points and explicitly prohibits fabricated command outputs or verifier results unsupported by the instruction\. Each task’s workflows are saved as a single JSON object indexed bykk\.

Stage 2: Solver Evaluation\.For each\(task,k\)\(\\text\{task\},k\), the solver\*\*\*[https://github\.com/stanford\-iris\-lab/meta\-harness\-tbench2\-artifact](https://github.com/stanford-iris-lab/meta-harness-tbench2-artifact)receives a prompt of the form

Πk​\(x\)=Template​\(x,Format​\(Wk​\(x\)\)\),\\Pi\_\{k\}\(x\)\\;=\\;\\texttt\{Template\}\\bigl\(x,\\ \\texttt\{Format\}\(W\_\{k\}\(x\)\)\\bigr\),wherexxis the task instruction,Wk​\(x\)W\_\{k\}\(x\)is thekk\-step workflow, andTemplateis a fixed scaffold containing three blocks:\#\# task\(verbatimxx\),\#\# guidance\(a four\-bulletkk\-invariant rubric on inspecting the verifier, reproducing failures, and minimal local edits\), and\#\# workflow\(the numberedkk\-step list\)\. The solver is then run once under the fixed Harbor Docker environment withMAX\_EPISODES=50, and success is determined by the verifier’s reward\.

#### B\.1\.4Metric

For taskcc, step countkk, and trialrr, define the per\-cell outcome as

sc,k,r=1​\{verifier\_result\.rewards\.reward\>0\}\.s\_\{c,k,r\}\\;=\\;\\mathbf\{1\}\\\!\\left\\\{\\texttt\{verifier\\\_result\.rewards\.reward\}\>0\\right\\\}\.The aggregate pass rate at granularitykkis

PassRate​\(k\)=nsucc​\(k\)nvalid​\(k\),\\mathrm\{PassRate\}\(k\)\\;=\\;\\frac\{n\_\{\\mathrm\{succ\}\}\(k\)\}\{n\_\{\\mathrm\{valid\}\}\(k\)\},wherenvalid​\(k\)n\_\{\\mathrm\{valid\}\}\(k\)is the number of evaluated cells atkk\(excluding missing or errored runs\) andnsucc​\(k\)n\_\{\\mathrm\{succ\}\}\(k\)is the number of cells withsc,k,r=1s\_\{c,k,r\}=1\. For successful cells, we additionally reportepisodes\_to\_success, taken from the agent’sn\_episodesmetadata\.

### B\.2Plot\-Reasoning Hallucination Setup

![Refer to caption](https://arxiv.org/html/2605.21516v1/fig/real_guidance/hallucination_rate_curves_qwen_qwen3.5-plus-02-15.png)\(a\)Qwen3\.5\-plus\-02\-15
![Refer to caption](https://arxiv.org/html/2605.21516v1/fig/real_guidance/hallucination_rate_curves_openai_gpt-oss-120b.png)\(b\)GPT\-OSS\-120B
![Refer to caption](https://arxiv.org/html/2605.21516v1/fig/real_guidance/hallucination_rate_curves_google_gemini-3-flash-preview.png)\(c\)Gemini\-3\-flash\-preview

Figure 9:Granularity–capability alignment in a controlled addition task\.\(a\) Pass rate varies with the number of subgoals and depends on the match between harness granularity and agent capability\. \(b\) The mean absolute final bias increases as the number of subgoals grows, showing that overly fine task decomposition can accumulate larger terminal error\.We further evaluate harness guidance in a visibility\-limited chart reasoning task\. This experiment complements the controlled simulations by testing whether additional guidance reduces or amplifies unsupported claims in a more realistic analysis setting\. The central question is: when the model only observes a truncated Plotly representation rather than the rendered figure, how does the amount and type of guidance affect hallucination?

#### B\.2\.1Task Construction

Each task is a chart\-analysis problem\. Instead of showing the model the rendered image, we provide only the raw Plotly trace representation, denoted asplotly\_repr\. This input contains partial structural information about the chart, but it does not provide full visual access to the figure\. Therefore, the model must reason under limited observability\.

We generate six chart types:

box,line,scatter,histogram,violin,heatmap\.\\texttt\{box\},\\quad\\texttt\{line\},\\quad\\texttt\{scatter\},\\quad\\texttt\{histogram\},\\quad\\texttt\{violin\},\\quad\\texttt\{heatmap\}\.For each chart type, we generate a large synthetic chart and store its Plotly representation as a task instance\. The random seed is fixed to ensure reproducibility\. Each task contains a chart name and a string\-valuedplotly\_repr, which is later inserted into the analysis prompt\.

#### B\.2\.2Guidance Conditions

The harness augments the chart\-analysis prompt with a set of textual guidance rules\. We vary the number of injected rules, denoted bykk\. Whenk=0k=0, the model receives no additional guidance beyond the base prompt\. Askkincreases, the harness imposes more requirements on how the model should analyze the chart\.

We compare three guidance sources\. The first source is the original guidance set, which contains general analysis requirements\. The second source is an anti\-hallucination guidance set, which explicitly instructs the model to respect the limited visibility of the Plotly representation and avoid unsupported visual claims\. The third source is a mixed condition, where guidance rules are sampled from the two sources with equal probability\. Thus, the experiment separates the effect of guidance quantity from the effect of guidance quality\.

#### B\.2\.3Generation and Judging Pipeline

Each experimental case follows the same two\-stage pipeline\. First, the analysis model receives the base chart\-analysis prompt, the task\-specificplotly\_repr, andkksampled guidance rules\. It then produces a natural\-language analysis of the chart\. Second, the judge model reads both the input context and the generated analysis, and assigns one of three labels:

GROUNDED,HALLUCINATION,ERROR\.\\texttt\{GROUNDED\},\\quad\\texttt\{HALLUCINATION\},\\quad\\texttt\{ERROR\}\.
A response is labeledGROUNDEDif the analysis stays within what can be reasonably inferred from the provided representation\. It is labeledHALLUCINATIONif it makes unsupported claims, overstates visual details, or fails to acknowledge the limited observability of the chart\. Cases with API failures or invalid outputs are labeledERRORand excluded from the valid denominator\.

#### B\.2\.4Metrics

For each guidance countkk, we aggregate results across chart types and trials\. Letnvalid​\(k\)n\_\{\\mathrm\{valid\}\}\(k\)be the number of non\-error cases, and letnhall​\(k\)n\_\{\\mathrm\{hall\}\}\(k\)be the number of cases labeled as hallucination\. The main metric is the hallucination rate:

HallucinationRate​\(k\)=nhall​\(k\)nvalid​\(k\)\.\\mathrm\{HallucinationRate\}\(k\)=\\frac\{n\_\{\\mathrm\{hall\}\}\(k\)\}\{n\_\{\\mathrm\{valid\}\}\(k\)\}\.We also report the number of error cases to ensure that a lower hallucination rate is not caused by a large number of invalid model outputs\.

#### B\.2\.5Experimental Variables

The independent variables are the guidance countkkand the guidance source\. By default, we evaluate

k∈\{0,1,…,10\},k\\in\\\{0,1,\\ldots,10\\\},with three trials for eachkkand each chart type\. For a fair comparison, the analysis model, judge model, task set, random seed, and tested values ofkkare kept fixed across guidance strategies\.

#### B\.2\.6Interpretation

This experiment is designed to test whether stronger guidance is always better\. If increasingkkconsistently lowers hallucination rate, then additional guidance acts as useful inference\-time constraint\. If hallucination rate increases withkk, then excessive guidance may encourage the model to satisfy more textual requirements than the limited evidence can support\. Comparing the original, anti\-hallucination, and mixed guidance conditions further distinguishes whether the outcome is driven by the number of guidance rules or by their alignment with the available evidence\.

## Appendix COmitted Proofs

### C\.1Recoverability Construction and Process Loss

This appendix formalizes the chain\-rule factorizationPh​\(𝖲𝗎𝖼𝖼x​\(τh\)∣x\)=∏tp¯t​\(h;x\)P\_\{h\}\(\\mathsf\{Succ\}\_\{x\}\(\\tau\_\{h\}\)\\mid x\)=\\prod\_\{t\}\\bar\{p\}\_\{t\}\(h;x\)stated without proof in Section[3](https://arxiv.org/html/2605.21516#S3), on which all subsequent stagewise bounds rely\. The factorization requires a precise notion of when an intermediate prefix can still reach a successful continuation, so Appendix[C\.1\.1](https://arxiv.org/html/2605.21516#A3.SS1.SSS1)first defines the recoverable prefix setsℛt​\(x,h\)\\mathcal\{R\}\_\{t\}\(x,h\), identifies the stagewise eventBtB\_\{t\}with membership inℛt\\mathcal\{R\}\_\{t\}, and isolates the goal\-consistency condition that aligns these intermediate notions with final success\. Under this condition, Appendix[C\.1\.2](https://arxiv.org/html/2605.21516#A3.SS1.SSS2)then derives the factorization by chain rule and converts it into the additive log form invoked by the bounds in Appendix[C\.2](https://arxiv.org/html/2605.21516#A3.SS2)and Appendix[C\.3](https://arxiv.org/html/2605.21516#A3.SS3)\.

#### C\.1\.1Recoverable Prefixes and Goal Consistency

Recall from Section[3](https://arxiv.org/html/2605.21516#S3)the completed prefix

Kt:=\(x,g1,τ1,…,gt,τt\),K0:=x,K\_\{t\}:=\(x,g\_\{1\},\\tau\_\{1\},\\ldots,g\_\{t\},\\tau\_\{t\}\),\\qquad K\_\{0\}:=x,which records the full execution trace through stagett\. For each intermediate staget<Tt<T, define the recoverable set

ℛt​\(x,h\):=\{Kt:∃\(τt\+1,…,τT\)​such that​𝖲𝗎𝖼𝖼x​\(τh\)\},\\mathcal\{R\}\_\{t\}\(x,h\):=\\left\\\{K\_\{t\}:\\exists\\,\(\\tau\_\{t\+1\},\\ldots,\\tau\_\{T\}\)\\text\{ such that \}\\mathsf\{Succ\}\_\{x\}\(\\tau\_\{h\}\)\\right\\\},the set of completed prefixes from which some continuation under the remaining plan\(gt\+1,…,gT\)\(g\_\{t\+1\},\\ldots,g\_\{T\}\)reachesy⋆​\(x\)y^\{\\star\}\(x\)\. At the terminal stage we setℛT​\(x,h\):=\{KT:𝖲𝗎𝖼𝖼x​\(τh\)\}\\mathcal\{R\}\_\{T\}\(x,h\):=\\\{K\_\{T\}:\\mathsf\{Succ\}\_\{x\}\(\\tau\_\{h\}\)\\\}, so a terminal prefix is recoverable iff it is successful\. The stagewise recoverability event of Section[3](https://arxiv.org/html/2605.21516#S3)is then

Bt:=\{Kt∈ℛt​\(x,h\)\},t=1,…,T,B\_\{t\}:=\\\{K\_\{t\}\\in\\mathcal\{R\}\_\{t\}\(x,h\)\\\},\\qquad t=1,\\ldots,T,and the cumulative eventsEt:=⋂s≤tBsE\_\{t\}:=\\bigcap\_\{s\\leq t\}B\_\{s\}form a nested sequenceET⊆⋯⊆E1E\_\{T\}\\subseteq\\cdots\\subseteq E\_\{1\}\.

We call the harness plan*goal\-consistent*if

ET≡𝖲𝗎𝖼𝖼x​\(τh\),E\_\{T\}\\equiv\\mathsf\{Succ\}\_\{x\}\(\\tau\_\{h\}\),that is, if remaining recoverable through every stage is equivalent to final success\. This rules out decompositions whose intermediate recoverability notions are misaligned with the final answer, and it is the regime under which the chain\-rule identity below holds\.

#### C\.1\.2Chain\-Rule Identity

Under goal consistency,Ph​\(𝖲𝗎𝖼𝖼x​\(τh\)∣x\)=Ph​\(ET∣x\)P\_\{h\}\(\\mathsf\{Succ\}\_\{x\}\(\\tau\_\{h\}\)\\mid x\)=P\_\{h\}\(E\_\{T\}\\mid x\)\. Applying the chain rule to the nested eventsE1,…,ETE\_\{1\},\\ldots,E\_\{T\}and usingEt=Et−1∩BtE\_\{t\}=E\_\{t\-1\}\\cap B\_\{t\}together withEt−1=B<tE\_\{t\-1\}=B\_\{<t\},

Ph​\(𝖲𝗎𝖼𝖼x​\(τh\)∣x\)=∏t=1TPh​\(Bt∣B<t,x\)=∏t=1Tp¯t​\(h;x\),P\_\{h\}\(\\mathsf\{Succ\}\_\{x\}\(\\tau\_\{h\}\)\\mid x\)=\\prod\_\{t=1\}^\{T\}P\_\{h\}\(B\_\{t\}\\mid B\_\{<t\},x\)=\\prod\_\{t=1\}^\{T\}\\bar\{p\}\_\{t\}\(h;x\),which is the factorization stated in Section[3](https://arxiv.org/html/2605.21516#S3)\. Taking negative logarithms,

−log⁡Ph​\(𝖲𝗎𝖼𝖼x​\(τh\)∣x\)=∑t=1T−log⁡p¯t​\(h;x\),\-\\log P\_\{h\}\(\\mathsf\{Succ\}\_\{x\}\(\\tau\_\{h\}\)\\mid x\)=\\sum\_\{t=1\}^\{T\}\-\\log\\bar\{p\}\_\{t\}\(h;x\),so the process loss is induced by the primitive final\-success objective rather than added on top of it\. Each summand penalizes a low conditional probability of preserving recoverability after one harness\-level plan step, and this is the form used by the bounds in Appendix[C\.2](https://arxiv.org/html/2605.21516#A3.SS2)and Appendix[C\.3](https://arxiv.org/html/2605.21516#A3.SS3)\.

### C\.2Granularity–Capability Alignment

This appendix gives the formal statement and proof for the granularity–capability alignment principle of Section[4\.1](https://arxiv.org/html/2605.21516#S4.SS1)\. The result quantifies how the harness’s choice of sub\-goal scale, combined with the agent’s per\-stage attempt budget, controls the final success probability through a per\-stage mismatch penalty\. Theorem[4](https://arxiv.org/html/2605.21516#Thmtheorem4)states the general bound under prefix\-dependent stage quantities, taking infimum over all recoverable prefixes; Remark[5](https://arxiv.org/html/2605.21516#Thmtheorem5)then specializes this to the deterministic regime that recovers the simplified form stated in the main text\. Building on this specialization, Corollary[6](https://arxiv.org/html/2605.21516#Thmtheorem6)characterizes when a uniformTT\-step decomposition incurs no mismatch penalty: the sub\-goal sizeLx/TL\_\{x\}/Tmust lie in the union of cumulative controllability windows reachable within the attempt budget, with deviations on either side producing the coarse\-grained and fine\-grained failure regimes observed in the main text\.

###### Theorem 4\(Granularity–capability alignment bound with finite attempts\)\.

Fix a taskxxand a harnessh=\(κ,λ,ψ\)h=\(\\kappa,\\lambda,\\psi\)\. Let

Δh​\(x\)=\(g1,…,gT\),T=Th​\(x\),\\Delta\_\{h\}\(x\)=\(g\_\{1\},\\dots,g\_\{T\}\),\\qquad T=T\_\{h\}\(x\),be the harness\-induced sub\-goal sequence, and letτh​\(x\)=\(g1,τ1,…,gT,τT\)\\tau\_\{h\}\(x\)=\(g\_\{1\},\\tau\_\{1\},\\dots,g\_\{T\},\\tau\_\{T\}\)be the harness\-conditioned execution trajectory\. Recall that the completed prefix after the firstttharness\-level sub\-goals is

Kt=\(x,g1,τ1,…,gt,τt\),K0:=x\.K\_\{t\}=\(x,g\_\{1\},\\tau\_\{1\},\\dots,g\_\{t\},\\tau\_\{t\}\),\\qquad K\_\{0\}:=x\.
For each stagettand each completed prefixKt−1K\_\{t\-1\}satisfyingB<tB\_\{<t\}, suppose there exists a latent progress coordinateϕx\\phi\_\{x\}\. Define the intended latent progress required by the harness\-specified sub\-goalgtg\_\{t\}as

ℓt​\(Kt−1\):=ϕx​\(gt\)−ϕx​\(Kt−1\)\.\\ell\_\{t\}\(K\_\{t\-1\}\):=\\phi\_\{x\}\(g\_\{t\}\)\-\\phi\_\{x\}\(K\_\{t\-1\}\)\.
At stagett, suppose the agent may make at mostMtM\_\{t\}primitive attempts before the stage fails\. Form=1,…,Mtm=1,\\dots,M\_\{t\}, letZt,mZ\_\{t,m\}denote the cumulative latent progress made aftermmprimitive attempts within stagett, measured from the prefixKt−1K\_\{t\-1\}\.

Assume the following conditions hold for every stagettand almost surely over completed prefixesKt−1K\_\{t\-1\}satisfyingB<tB\_\{<t\}\.

\(i\) Finite\-attempt recoverability tube\.There exists a toleranceεt​\(Kt−1\)≥0\\varepsilon\_\{t\}\(K\_\{t\-1\}\)\\geq 0such that preserving recoverability after executing sub\-goalgtg\_\{t\}requires landing near the intended milestone within the finite attempt budget:

Bt⊆⋃m=1Mt\{\|Zt,m−ℓt​\(Kt−1\)\|≤εt​\(Kt−1\)\}\.B\_\{t\}\\subseteq\\bigcup\_\{m=1\}^\{M\_\{t\}\}\\left\\\{\\left\|Z\_\{t,m\}\-\\ell\_\{t\}\(K\_\{t\-1\}\)\\right\|\\leq\\varepsilon\_\{t\}\(K\_\{t\-1\}\)\\right\\\}\.
\(ii\) Cumulative controllability windows\.For eachm=1,…,Mtm=1,\\dots,M\_\{t\}, there exist quantities

μt,m−​\(Kt−1\)≤μt,m\+​\(Kt−1\),σt,m​\(Kt−1\)\>0,\\mu\_\{t,m\}^\{\-\}\(K\_\{t\-1\}\)\\leq\\mu\_\{t,m\}^\{\+\}\(K\_\{t\-1\}\),\\qquad\\sigma\_\{t,m\}\(K\_\{t\-1\}\)\>0,and a mean

μt,m​\(Kt−1\)∈\[μt,m−​\(Kt−1\),μt,m\+​\(Kt−1\)\],\\mu\_\{t,m\}\(K\_\{t\-1\}\)\\in\[\\mu\_\{t,m\}^\{\-\}\(K\_\{t\-1\}\),\\mu\_\{t,m\}^\{\+\}\(K\_\{t\-1\}\)\],such that, for allu≥0u\\geq 0,

Ph​\(Zt,m−μt,m​\(Kt−1\)≥u∣Kt−1\)≤exp⁡\(−u22​σt,m2​\(Kt−1\)\),P\_\{h\}\\left\(Z\_\{t,m\}\-\\mu\_\{t,m\}\(K\_\{t\-1\}\)\\geq u\\mid K\_\{t\-1\}\\right\)\\leq\\exp\\left\(\-\\frac\{u^\{2\}\}\{2\\sigma\_\{t,m\}^\{2\}\(K\_\{t\-1\}\)\}\\right\),and

Ph​\(μt,m​\(Kt−1\)−Zt,m≥u∣Kt−1\)≤exp⁡\(−u22​σt,m2​\(Kt−1\)\)\.P\_\{h\}\\left\(\\mu\_\{t,m\}\(K\_\{t\-1\}\)\-Z\_\{t,m\}\\geq u\\mid K\_\{t\-1\}\\right\)\\leq\\exp\\left\(\-\\frac\{u^\{2\}\}\{2\\sigma\_\{t,m\}^\{2\}\(K\_\{t\-1\}\)\}\\right\)\.
\(iii\) Boundary loss\.Crossing thett\-th harness\-specified sub\-goal boundary incurs an irreducible coordination lossηt​\(Kt−1\)≥0\\eta\_\{t\}\(K\_\{t\-1\}\)\\geq 0, so that

Ph​\(Bt∣Kt−1\)≤exp⁡\(−ηt​\(Kt−1\)\)​Ph​\(⋃m=1Mt\{\|Zt,m−ℓt​\(Kt−1\)\|≤εt​\(Kt−1\)\}∣Kt−1\)\.P\_\{h\}\(B\_\{t\}\\mid K\_\{t\-1\}\)\\leq\\exp\\bigl\(\-\\eta\_\{t\}\(K\_\{t\-1\}\)\\bigr\)P\_\{h\}\\left\(\\bigcup\_\{m=1\}^\{M\_\{t\}\}\\left\\\{\\left\|Z\_\{t,m\}\-\\ell\_\{t\}\(K\_\{t\-1\}\)\\right\|\\leq\\varepsilon\_\{t\}\(K\_\{t\-1\}\)\\right\\\}\\mid K\_\{t\-1\}\\right\)\.
For eachm=1,…,Mtm=1,\\dots,M\_\{t\}, define the prefix\-wise finite\-attempt mismatch

at,m​\(Kt−1\):=\(d​\(ℓt​\(Kt−1\),\[μt,m−​\(Kt−1\),μt,m\+​\(Kt−1\)\]\)−εt​\(Kt−1\)\)\+22​σt,m2​\(Kt−1\)\.a\_\{t,m\}\(K\_\{t\-1\}\):=\\frac\{\\left\(d\\left\(\\ell\_\{t\}\(K\_\{t\-1\}\),\[\\mu\_\{t,m\}^\{\-\}\(K\_\{t\-1\}\),\\mu\_\{t,m\}^\{\+\}\(K\_\{t\-1\}\)\]\\right\)\-\\varepsilon\_\{t\}\(K\_\{t\-1\}\)\\right\)\_\{\+\}^\{2\}\}\{2\\sigma\_\{t,m\}^\{2\}\(K\_\{t\-1\}\)\}\.Define the best finite\-attempt mismatch at stagettas

ρt\(M\)​\(Kt−1\):=min1≤m≤Mt⁡at,m​\(Kt−1\)\.\\rho\_\{t\}^\{\(M\)\}\(K\_\{t\-1\}\):=\\min\_\{1\\leq m\\leq M\_\{t\}\}a\_\{t,m\}\(K\_\{t\-1\}\)\.
Let

ℛt−1​\(h,x\):=suppPh\(⋅∣B<t,x\)⁡\(Kt−1\)\\mathcal\{R\}\_\{t\-1\}\(h,x\):=\\operatorname\{supp\}\_\{P\_\{h\}\(\\cdot\\mid B\_\{<t\},x\)\}\(K\_\{t\-1\}\)denote the set of recoverable prefixes that can arise before stagett\. Define the stage\-wise certified finite\-attempt granularity loss as

γt\(M\)​\(h;x\):=infKt−1∈ℛt−1​\(h,x\)\{ηt​\(Kt−1\)\+\(ρt\(M\)​\(Kt−1\)−log⁡Mt\)\+\}\.\\gamma\_\{t\}^\{\(M\)\}\(h;x\):=\\inf\_\{K\_\{t\-1\}\\in\\mathcal\{R\}\_\{t\-1\}\(h,x\)\}\\left\\\{\\eta\_\{t\}\(K\_\{t\-1\}\)\+\\left\(\\rho\_\{t\}^\{\(M\)\}\(K\_\{t\-1\}\)\-\\log M\_\{t\}\\right\)\_\{\+\}\\right\\\}\.
Then the final success probability satisfies

−log⁡Ph​\(Succx​\(τh\)∣x\)≥∑t=1Th​\(x\)γt\(M\)​\(h;x\)\.\-\\log P\_\{h\}\(\\mathrm\{Succ\}\_\{x\}\(\\tau\_\{h\}\)\\mid x\)\\geq\\sum\_\{t=1\}^\{T\_\{h\}\(x\)\}\\gamma\_\{t\}^\{\(M\)\}\(h;x\)\.Equivalently,

Ph​\(Succx​\(τh\)∣x\)≤exp⁡\(−∑t=1Th​\(x\)γt\(M\)​\(h;x\)\)\.P\_\{h\}\(\\mathrm\{Succ\}\_\{x\}\(\\tau\_\{h\}\)\\mid x\)\\leq\\exp\\left\(\-\\sum\_\{t=1\}^\{T\_\{h\}\(x\)\}\\gamma\_\{t\}^\{\(M\)\}\(h;x\)\\right\)\.

###### Proof of Theorem[4](https://arxiv.org/html/2605.21516#Thmtheorem4)\.

Fix a taskxxand a harnesshh, and writeT=Th​\(x\)T=T\_\{h\}\(x\)\. All probabilities are taken under the harness\-conditioned execution distribution induced byqhq\_\{h\}\.

Fix a stagettand a completed prefixKt−1K\_\{t\-1\}satisfyingB<tB\_\{<t\}\. For eachm=1,…,Mtm=1,\\dots,M\_\{t\}, define the finite\-attempt tube event

Et,m​\(Kt−1\):=\{\|Zt,m−ℓt​\(Kt−1\)\|≤εt​\(Kt−1\)\}\.E\_\{t,m\}\(K\_\{t\-1\}\):=\\left\\\{\\left\|Z\_\{t,m\}\-\\ell\_\{t\}\(K\_\{t\-1\}\)\\right\|\\leq\\varepsilon\_\{t\}\(K\_\{t\-1\}\)\\right\\\}\.The event that the agent hits the recoverability tube within the finite attempt budget is

Et\(M\)​\(Kt−1\):=⋃m=1MtEt,m​\(Kt−1\)\.E\_\{t\}^\{\(M\)\}\(K\_\{t\-1\}\):=\\bigcup\_\{m=1\}^\{M\_\{t\}\}E\_\{t,m\}\(K\_\{t\-1\}\)\.
By the boundary\-loss assumption,

Ph​\(Bt∣Kt−1\)≤exp⁡\(−ηt​\(Kt−1\)\)​Ph​\(Et\(M\)​\(Kt−1\)∣Kt−1\)\.P\_\{h\}\(B\_\{t\}\\mid K\_\{t\-1\}\)\\leq\\exp\\bigl\(\-\\eta\_\{t\}\(K\_\{t\-1\}\)\\bigr\)P\_\{h\}\(E\_\{t\}^\{\(M\)\}\(K\_\{t\-1\}\)\\mid K\_\{t\-1\}\)\.It remains to upper bound the finite\-attempt tube probabilityPh​\(Et\(M\)​\(Kt−1\)∣Kt−1\)P\_\{h\}\(E\_\{t\}^\{\(M\)\}\(K\_\{t\-1\}\)\\mid K\_\{t\-1\}\)\.

Fix an attempt countm∈\{1,…,Mt\}m\\in\\\{1,\\dots,M\_\{t\}\\\}\. Let

It,m​\(Kt−1\):=\[μt,m−​\(Kt−1\),μt,m\+​\(Kt−1\)\],I\_\{t,m\}\(K\_\{t\-1\}\):=\[\\mu\_\{t,m\}^\{\-\}\(K\_\{t\-1\}\),\\mu\_\{t,m\}^\{\+\}\(K\_\{t\-1\}\)\],and define

dt,m​\(Kt−1\):=d​\(ℓt​\(Kt−1\),It,m​\(Kt−1\)\)\.d\_\{t,m\}\(K\_\{t\-1\}\):=d\\left\(\\ell\_\{t\}\(K\_\{t\-1\}\),I\_\{t,m\}\(K\_\{t\-1\}\)\\right\)\.We first show that

Ph​\(Et,m​\(Kt−1\)∣Kt−1\)≤exp⁡\(−at,m​\(Kt−1\)\)\.P\_\{h\}\(E\_\{t,m\}\(K\_\{t\-1\}\)\\mid K\_\{t\-1\}\)\\leq\\exp\\left\(\-a\_\{t,m\}\(K\_\{t\-1\}\)\\right\)\.
There are three cases\.

First suppose

ℓt​\(Kt−1\)\>μt,m\+​\(Kt−1\)\.\\ell\_\{t\}\(K\_\{t\-1\}\)\>\\mu\_\{t,m\}^\{\+\}\(K\_\{t\-1\}\)\.Then

dt,m​\(Kt−1\)=ℓt​\(Kt−1\)−μt,m\+​\(Kt−1\)\.d\_\{t,m\}\(K\_\{t\-1\}\)=\\ell\_\{t\}\(K\_\{t\-1\}\)\-\\mu\_\{t,m\}^\{\+\}\(K\_\{t\-1\}\)\.Since

μt,m​\(Kt−1\)≤μt,m\+​\(Kt−1\),\\mu\_\{t,m\}\(K\_\{t\-1\}\)\\leq\\mu\_\{t,m\}^\{\+\}\(K\_\{t\-1\}\),on the eventEt,m​\(Kt−1\)E\_\{t,m\}\(K\_\{t\-1\}\)we have

Zt,m≥ℓt​\(Kt−1\)−εt​\(Kt−1\)\.Z\_\{t,m\}\\geq\\ell\_\{t\}\(K\_\{t\-1\}\)\-\\varepsilon\_\{t\}\(K\_\{t\-1\}\)\.Therefore,

Zt,m−μt,m​\(Kt−1\)≥ℓt​\(Kt−1\)−εt​\(Kt−1\)−μt,m\+​\(Kt−1\)=dt,m​\(Kt−1\)−εt​\(Kt−1\)\.Z\_\{t,m\}\-\\mu\_\{t,m\}\(K\_\{t\-1\}\)\\geq\\ell\_\{t\}\(K\_\{t\-1\}\)\-\\varepsilon\_\{t\}\(K\_\{t\-1\}\)\-\\mu\_\{t,m\}^\{\+\}\(K\_\{t\-1\}\)=d\_\{t,m\}\(K\_\{t\-1\}\)\-\\varepsilon\_\{t\}\(K\_\{t\-1\}\)\.If

dt,m​\(Kt−1\)\>εt​\(Kt−1\),d\_\{t,m\}\(K\_\{t\-1\}\)\>\\varepsilon\_\{t\}\(K\_\{t\-1\}\),the upper\-tail sub\-Gaussian bound gives

Ph​\(Et,m​\(Kt−1\)∣Kt−1\)≤exp⁡\(−\(dt,m​\(Kt−1\)−εt​\(Kt−1\)\)22​σt,m2​\(Kt−1\)\)\.P\_\{h\}\(E\_\{t,m\}\(K\_\{t\-1\}\)\\mid K\_\{t\-1\}\)\\leq\\exp\\left\(\-\\frac\{\\left\(d\_\{t,m\}\(K\_\{t\-1\}\)\-\\varepsilon\_\{t\}\(K\_\{t\-1\}\)\\right\)^\{2\}\}\{2\\sigma\_\{t,m\}^\{2\}\(K\_\{t\-1\}\)\}\\right\)\.If

dt,m​\(Kt−1\)≤εt​\(Kt−1\),d\_\{t,m\}\(K\_\{t\-1\}\)\\leq\\varepsilon\_\{t\}\(K\_\{t\-1\}\),the desired bound is trivial because the right\-hand side equals11\.

Second suppose

ℓt​\(Kt−1\)<μt,m−​\(Kt−1\)\.\\ell\_\{t\}\(K\_\{t\-1\}\)<\\mu\_\{t,m\}^\{\-\}\(K\_\{t\-1\}\)\.Then

dt,m​\(Kt−1\)=μt,m−​\(Kt−1\)−ℓt​\(Kt−1\)\.d\_\{t,m\}\(K\_\{t\-1\}\)=\\mu\_\{t,m\}^\{\-\}\(K\_\{t\-1\}\)\-\\ell\_\{t\}\(K\_\{t\-1\}\)\.Since

μt,m​\(Kt−1\)≥μt,m−​\(Kt−1\),\\mu\_\{t,m\}\(K\_\{t\-1\}\)\\geq\\mu\_\{t,m\}^\{\-\}\(K\_\{t\-1\}\),on the eventEt,m​\(Kt−1\)E\_\{t,m\}\(K\_\{t\-1\}\)we have

Zt,m≤ℓt​\(Kt−1\)\+εt​\(Kt−1\)\.Z\_\{t,m\}\\leq\\ell\_\{t\}\(K\_\{t\-1\}\)\+\\varepsilon\_\{t\}\(K\_\{t\-1\}\)\.Therefore,

μt,m​\(Kt−1\)−Zt,m≥μt,m−​\(Kt−1\)−ℓt​\(Kt−1\)−εt​\(Kt−1\)=dt,m​\(Kt−1\)−εt​\(Kt−1\)\.\\mu\_\{t,m\}\(K\_\{t\-1\}\)\-Z\_\{t,m\}\\geq\\mu\_\{t,m\}^\{\-\}\(K\_\{t\-1\}\)\-\\ell\_\{t\}\(K\_\{t\-1\}\)\-\\varepsilon\_\{t\}\(K\_\{t\-1\}\)=d\_\{t,m\}\(K\_\{t\-1\}\)\-\\varepsilon\_\{t\}\(K\_\{t\-1\}\)\.The lower\-tail sub\-Gaussian bound gives

Ph​\(Et,m​\(Kt−1\)∣Kt−1\)≤exp⁡\(−\(dt,m​\(Kt−1\)−εt​\(Kt−1\)\)\+22​σt,m2​\(Kt−1\)\)\.P\_\{h\}\(E\_\{t,m\}\(K\_\{t\-1\}\)\\mid K\_\{t\-1\}\)\\leq\\exp\\left\(\-\\frac\{\\left\(d\_\{t,m\}\(K\_\{t\-1\}\)\-\\varepsilon\_\{t\}\(K\_\{t\-1\}\)\\right\)\_\{\+\}^\{2\}\}\{2\\sigma\_\{t,m\}^\{2\}\(K\_\{t\-1\}\)\}\\right\)\.
Third suppose

ℓt​\(Kt−1\)∈It,m​\(Kt−1\)\.\\ell\_\{t\}\(K\_\{t\-1\}\)\\in I\_\{t,m\}\(K\_\{t\-1\}\)\.Thendt,m​\(Kt−1\)=0d\_\{t,m\}\(K\_\{t\-1\}\)=0, and the desired bound reduces to

Ph​\(Et,m​\(Kt−1\)∣Kt−1\)≤1,P\_\{h\}\(E\_\{t,m\}\(K\_\{t\-1\}\)\\mid K\_\{t\-1\}\)\\leq 1,which is immediate\.

Combining the three cases, for everym=1,…,Mtm=1,\\dots,M\_\{t\},

Ph​\(Et,m​\(Kt−1\)∣Kt−1\)≤exp⁡\(−at,m​\(Kt−1\)\)\.P\_\{h\}\(E\_\{t,m\}\(K\_\{t\-1\}\)\\mid K\_\{t\-1\}\)\\leq\\exp\\left\(\-a\_\{t,m\}\(K\_\{t\-1\}\)\\right\)\.
Now apply the union bound over the finite attempt budget:

Ph​\(Et\(M\)​\(Kt−1\)∣Kt−1\)=Ph​\(⋃m=1MtEt,m​\(Kt−1\)∣Kt−1\)≤∑m=1MtPh​\(Et,m​\(Kt−1\)∣Kt−1\)\.P\_\{h\}\(E\_\{t\}^\{\(M\)\}\(K\_\{t\-1\}\)\\mid K\_\{t\-1\}\)=P\_\{h\}\\left\(\\bigcup\_\{m=1\}^\{M\_\{t\}\}E\_\{t,m\}\(K\_\{t\-1\}\)\\mid K\_\{t\-1\}\\right\)\\leq\\sum\_\{m=1\}^\{M\_\{t\}\}P\_\{h\}\(E\_\{t,m\}\(K\_\{t\-1\}\)\\mid K\_\{t\-1\}\)\.Using the previous bound,

Ph​\(Et\(M\)​\(Kt−1\)∣Kt−1\)≤∑m=1Mtexp⁡\(−at,m​\(Kt−1\)\)\.P\_\{h\}\(E\_\{t\}^\{\(M\)\}\(K\_\{t\-1\}\)\\mid K\_\{t\-1\}\)\\leq\\sum\_\{m=1\}^\{M\_\{t\}\}\\exp\\left\(\-a\_\{t,m\}\(K\_\{t\-1\}\)\\right\)\.Since

ρt\(M\)​\(Kt−1\)=min1≤m≤Mt⁡at,m​\(Kt−1\),\\rho\_\{t\}^\{\(M\)\}\(K\_\{t\-1\}\)=\\min\_\{1\\leq m\\leq M\_\{t\}\}a\_\{t,m\}\(K\_\{t\-1\}\),we have

∑m=1Mtexp⁡\(−at,m​\(Kt−1\)\)≤Mt​exp⁡\(−ρt\(M\)​\(Kt−1\)\)\.\\sum\_\{m=1\}^\{M\_\{t\}\}\\exp\\left\(\-a\_\{t,m\}\(K\_\{t\-1\}\)\\right\)\\leq M\_\{t\}\\exp\\left\(\-\\rho\_\{t\}^\{\(M\)\}\(K\_\{t\-1\}\)\\right\)\.Therefore,

Ph​\(Et\(M\)​\(Kt−1\)∣Kt−1\)≤min⁡\{1,Mt​exp⁡\(−ρt\(M\)​\(Kt−1\)\)\}\.P\_\{h\}\(E\_\{t\}^\{\(M\)\}\(K\_\{t\-1\}\)\\mid K\_\{t\-1\}\)\\leq\\min\\left\\\{1,\\,M\_\{t\}\\exp\\left\(\-\\rho\_\{t\}^\{\(M\)\}\(K\_\{t\-1\}\)\\right\)\\right\\\}\.Equivalently,

Ph​\(Et\(M\)​\(Kt−1\)∣Kt−1\)≤exp⁡\(−\(ρt\(M\)​\(Kt−1\)−log⁡Mt\)\+\)\.P\_\{h\}\(E\_\{t\}^\{\(M\)\}\(K\_\{t\-1\}\)\\mid K\_\{t\-1\}\)\\leq\\exp\\left\(\-\\left\(\\rho\_\{t\}^\{\(M\)\}\(K\_\{t\-1\}\)\-\\log M\_\{t\}\\right\)\_\{\+\}\\right\)\.
Combining this with the boundary\-loss assumption gives

Ph​\(Bt∣Kt−1\)≤exp⁡\(−ηt​\(Kt−1\)−\(ρt\(M\)​\(Kt−1\)−log⁡Mt\)\+\)\.P\_\{h\}\(B\_\{t\}\\mid K\_\{t\-1\}\)\\leq\\exp\\left\(\-\\eta\_\{t\}\(K\_\{t\-1\}\)\-\\left\(\\rho\_\{t\}^\{\(M\)\}\(K\_\{t\-1\}\)\-\\log M\_\{t\}\\right\)\_\{\+\}\\right\)\.
By the definition ofγt\(M\)​\(h;x\)\\gamma\_\{t\}^\{\(M\)\}\(h;x\), for everyKt−1∈ℛt−1​\(h,x\)K\_\{t\-1\}\\in\\mathcal\{R\}\_\{t\-1\}\(h,x\),

ηt​\(Kt−1\)\+\(ρt\(M\)​\(Kt−1\)−log⁡Mt\)\+≥γt\(M\)​\(h;x\)\.\\eta\_\{t\}\(K\_\{t\-1\}\)\+\\left\(\\rho\_\{t\}^\{\(M\)\}\(K\_\{t\-1\}\)\-\\log M\_\{t\}\\right\)\_\{\+\}\\geq\\gamma\_\{t\}^\{\(M\)\}\(h;x\)\.Hence, almost surely onB<tB\_\{<t\},

Ph​\(Bt∣Kt−1\)≤exp⁡\(−γt\(M\)​\(h;x\)\)\.P\_\{h\}\(B\_\{t\}\\mid K\_\{t\-1\}\)\\leq\\exp\\bigl\(\-\\gamma\_\{t\}^\{\(M\)\}\(h;x\)\\bigr\)\.
Taking expectation overKt−1K\_\{t\-1\}conditional onB<tB\_\{<t\}gives

p¯t​\(h;x\)=Ph​\(Bt∣B<t,x\)≤exp⁡\(−γt\(M\)​\(h;x\)\)\.\\bar\{p\}\_\{t\}\(h;x\)=P\_\{h\}\(B\_\{t\}\\mid B\_\{<t\},x\)\\leq\\exp\\bigl\(\-\\gamma\_\{t\}^\{\(M\)\}\(h;x\)\\bigr\)\.
By the recoverability factorization in Section[3](https://arxiv.org/html/2605.21516#S3),

Ph​\(Succx​\(τh\)∣x\)=∏t=1Th​\(x\)p¯t​\(h;x\)\.P\_\{h\}\(\\mathrm\{Succ\}\_\{x\}\(\\tau\_\{h\}\)\\mid x\)=\\prod\_\{t=1\}^\{T\_\{h\}\(x\)\}\\bar\{p\}\_\{t\}\(h;x\)\.Therefore,

Ph​\(Succx​\(τh\)∣x\)≤∏t=1Th​\(x\)exp⁡\(−γt\(M\)​\(h;x\)\)=exp⁡\(−∑t=1Th​\(x\)γt\(M\)​\(h;x\)\)\.P\_\{h\}\(\\mathrm\{Succ\}\_\{x\}\(\\tau\_\{h\}\)\\mid x\)\\leq\\prod\_\{t=1\}^\{T\_\{h\}\(x\)\}\\exp\\bigl\(\-\\gamma\_\{t\}^\{\(M\)\}\(h;x\)\\bigr\)=\\exp\\left\(\-\\sum\_\{t=1\}^\{T\_\{h\}\(x\)\}\\gamma\_\{t\}^\{\(M\)\}\(h;x\)\\right\)\.Taking negative logarithms proves the claim\. ∎

###### Corollary 6\(Failure of overly coarse and overly fine decompositions\)\.

Adopt the deterministic regime of Remark[5](https://arxiv.org/html/2605.21516#Thmtheorem5), and assume in addition that the harness uses a uniformTT\-step decomposition withℓt=Lx/T\\ell\_\{t\}=L\_\{x\}/Tfor alltt, that each primitive attempt has a one\-step controllable window\[r−,r\+\]\[r^\{\-\},r^\{\+\}\]and one\-step variationσ\\sigmaso that

\[μt,m−,μt,m\+\]=\[m​r−,m​r\+\],σt,m2=m​σ2,\[\\mu\_\{t,m\}^\{\-\},\\mu\_\{t,m\}^\{\+\}\]=\[mr^\{\-\},mr^\{\+\}\],\\qquad\\sigma\_\{t,m\}^\{2\}=m\\sigma^\{2\},and thatηt=η\\eta\_\{t\}=\\eta,εt=ε\\varepsilon\_\{t\}=\\varepsilon,Mt=MM\_\{t\}=M\. Then

−log⁡Ph​\(Succx​\(τh\)∣x\)≥T​\[η\+\(min1≤m≤M⁡\(d​\(Lx/T,\[m​r−,m​r\+\]\)−ε\)\+22​m​σ2−log⁡M\)\+\],\-\\log P\_\{h\}\(\\mathrm\{Succ\}\_\{x\}\(\\tau\_\{h\}\)\\mid x\)\\geq T\\left\[\\eta\+\\left\(\\min\_\{1\\leq m\\leq M\}\\frac\{\(d\(L\_\{x\}/T,\[mr^\{\-\},mr^\{\+\}\]\)\-\\varepsilon\)\_\{\+\}^\{2\}\}\{2m\\sigma^\{2\}\}\-\\log M\\right\)\_\{\+\}\\right\],and the certified mismatch term vanishes if and only if

LxT∈⋃m=1M\[m​r−−ε,m​r\+\+ε\]\.\\frac\{L\_\{x\}\}\{T\}\\in\\bigcup\_\{m=1\}^\{M\}\[mr^\{\-\}\-\\varepsilon,\\,mr^\{\+\}\+\\varepsilon\]\.Avoiding both coarse\-grained \(Lx/TL\_\{x\}/Tabove the union\) and fine\-grained \(Lx/TL\_\{x\}/Tbelow the union\) failure regimes therefore requires the uniform sub\-goal size to be reachable by some cumulative progress scale within the finite attempt budget\.

###### Proof of Corollary[6](https://arxiv.org/html/2605.21516#Thmtheorem6)\.

Substituting the uniform and additive assumptions intoρt\(M\)\\rho\_\{t\}^\{\(M\)\}from Remark[5](https://arxiv.org/html/2605.21516#Thmtheorem5)gives

ρt\(M\)=min1≤m≤M⁡\(d​\(Lx/T,\[m​r−,m​r\+\]\)−ε\)\+22​m​σ2,\\rho\_\{t\}^\{\(M\)\}=\\min\_\{1\\leq m\\leq M\}\\frac\{\(d\(L\_\{x\}/T,\[mr^\{\-\},mr^\{\+\}\]\)\-\\varepsilon\)\_\{\+\}^\{2\}\}\{2m\\sigma^\{2\}\},which is independent oftt\. Plugging into the bound in Remark[5](https://arxiv.org/html/2605.21516#Thmtheorem5)and usingηt=η\\eta\_\{t\}=\\eta,Mt=MM\_\{t\}=Mfor allttyields the stated lower bound on−log⁡Ph\-\\log P\_\{h\}\.

The mismatch termρt\(M\)\\rho\_\{t\}^\{\(M\)\}vanishes precisely whend​\(Lx/T,\[m​r−,m​r\+\]\)≤εd\(L\_\{x\}/T,\[mr^\{\-\},mr^\{\+\}\]\)\\leq\\varepsilonfor somem∈\{1,…,M\}m\\in\\\{1,\\dots,M\\\}, which is equivalent toLx/T∈\[m​r−−ε,m​r\+\+ε\]L\_\{x\}/T\\in\[mr^\{\-\}\-\\varepsilon,mr^\{\+\}\+\\varepsilon\]for thatmm\. The union overmmcharacterizes the set of sub\-goal sizes for which no mismatch penalty is paid; outside this union,Lx/TL\_\{x\}/Teither exceeds everym​r\+\+εmr^\{\+\}\+\\varepsilon\(coarse\-grained failure\) or falls below everym​r−−εmr^\{\-\}\-\\varepsilonexcept at smallmm, where the small reachable scales cannot matchLx/TL\_\{x\}/Teither \(fine\-grained failure\)\. ∎

### C\.3Guidance as Action\-Space Filtering

This appendix gives the formal statement and proof for the guidance–evidence alignment principle of Section[4\.2](https://arxiv.org/html/2605.21516#S4.SS2)\. The result casts guidance as a reweighting of the base stage\-level trajectory distribution and shows that its effect on stage recoverability is governed by a single one\-dimensional quantity: the log\-ratio of average retention weights on recoverable versus non\-recoverable trajectories\. Appendix[C\.3\.1](https://arxiv.org/html/2605.21516#A3.SS3.SSS1)sets up the filtering model, defining the guidance\-filtered distributionQλtQ\_\{\\lambda\_\{t\}\}and the conditional retention weightsW¯t,Rtstg,W¯t,Rc\\bar\{W\}\_\{t,R\_\{t\}^\{\\mathrm\{stg\}\}\},\\bar\{W\}\_\{t,R^\{c\}\}that the rest of the analysis is built on\. Appendix[C\.3\.2](https://arxiv.org/html/2605.21516#A3.SS3.SSS2)then uses these objects to prove the central identity, expressing the filtered recoverability probability as a sigmoid in the base log\-odds plus the retention gapΓt,λt​\(Kt−1\)\\Gamma\_\{t,\\lambda\_\{t\}\}\(K\_\{t\-1\}\)\. Building on this identity, Appendix[C\.3\.3](https://arxiv.org/html/2605.21516#A3.SS3.SSS3)states the alignment theorem characterizing when guidance helps, hurts, or leaves stage recoverability unchanged, together with the process\-level integration that lifts per\-stage retention gaps into a final\-success bound\. Appendix[C\.3\.4](https://arxiv.org/html/2605.21516#A3.SS3.SSS4)closes by interpreting pseudo\-compliant hallucination as a concrete instance of a negative retention gap\.

#### C\.3\.1Stage\-Level Filtering Model

Fix a taskxx, a harness stagett, and a completed prefixKt−1K\_\{t\-1\}satisfyingB<tB\_\{<t\}\. At this stage, the harness specifies a sub\-goalgtg\_\{t\}\. LetQ0​\(τt∣Kt−1,gt\)Q\_\{0\}\(\\tau\_\{t\}\\mid K\_\{t\-1\},g\_\{t\}\)be the base stage\-level trajectory distribution\.

Guidance is represented by a nonnegative measurable retention weightWt,λt​\(Kt−1,gt,τt\)≥0W\_\{t,\\lambda\_\{t\}\}\(K\_\{t\-1\},g\_\{t\},\\tau\_\{t\}\)\\geq 0, whereλt\\lambda\_\{t\}denotes the amount of guidance imposed at stagett\. The guidance\-filtered distribution is

Qλt​\(τt∣Kt−1,gt\)=Q0​\(τt∣Kt−1,gt\)​Wt,λt​\(Kt−1,gt,τt\)Zt,λt​\(Kt−1,gt\),Zt,λt​\(Kt−1,gt\):=𝔼Q0​\[Wt,λt\],Q\_\{\\lambda\_\{t\}\}\(\\tau\_\{t\}\\mid K\_\{t\-1\},g\_\{t\}\)=\\frac\{Q\_\{0\}\(\\tau\_\{t\}\\mid K\_\{t\-1\},g\_\{t\}\)\\,W\_\{t,\\lambda\_\{t\}\}\(K\_\{t\-1\},g\_\{t\},\\tau\_\{t\}\)\}\{Z\_\{t,\\lambda\_\{t\}\}\(K\_\{t\-1\},g\_\{t\}\)\},\\qquad Z\_\{t,\\lambda\_\{t\}\}\(K\_\{t\-1\},g\_\{t\}\):=\\mathbb\{E\}\_\{Q\_\{0\}\}\[W\_\{t,\\lambda\_\{t\}\}\],and we assume throughout that0<Zt,λt<∞0<Z\_\{t,\\lambda\_\{t\}\}<\\infty\.

This formulation is intentionally general: it captures inference\-time guidance mechanisms that retain, downweight, or upweight candidate stage trajectories\. Hard action\-space pruning is the special case whereWt,λtW\_\{t,\\lambda\_\{t\}\}is an indicator over remaining trajectories\.

Let

ℛtstg​\(Kt−1\):=\{τt:Kt=\(Kt−1,gt,τt\)​remains recoverable\}\\mathcal\{R\}\_\{t\}^\{\\mathrm\{stg\}\}\(K\_\{t\-1\}\):=\\left\\\{\\tau\_\{t\}:K\_\{t\}=\(K\_\{t\-1\},g\_\{t\},\\tau\_\{t\}\)\\text\{ remains recoverable\}\\right\\\}denote the recoverable cross\-section of the stage\-level trajectory space, and assume the base recoverability probability is non\-degenerate,0<Q0​\(ℛtstg∣Kt−1,gt\)<10<Q\_\{0\}\(\\mathcal\{R\}\_\{t\}^\{\\mathrm\{stg\}\}\\mid K\_\{t\-1\},g\_\{t\}\)<1\. Define the conditional average retention weights on the recoverable set and its complement,

W¯t,Rtstg​\(Kt−1\):=𝔼Q0\(⋅∣ℛtstg,Kt−1,gt\)​\[Wt,λt\],W¯t,Rc​\(Kt−1\):=𝔼Q0\(⋅∣ℛtc,Kt−1,gt\)​\[Wt,λt\],\\bar\{W\}\_\{t,R\_\{t\}^\{\\mathrm\{stg\}\}\}\(K\_\{t\-1\}\):=\\mathbb\{E\}\_\{Q\_\{0\}\(\\cdot\\mid\\mathcal\{R\}\_\{t\}^\{\\mathrm\{stg\}\},K\_\{t\-1\},g\_\{t\}\)\}\[W\_\{t,\\lambda\_\{t\}\}\],\\qquad\\bar\{W\}\_\{t,R^\{c\}\}\(K\_\{t\-1\}\):=\\mathbb\{E\}\_\{Q\_\{0\}\(\\cdot\\mid\\mathcal\{R\}\_\{t\}^\{c\},K\_\{t\-1\},g\_\{t\}\)\}\[W\_\{t,\\lambda\_\{t\}\}\],and assume0<W¯t,Rtstg,W¯t,Rc<∞0<\\bar\{W\}\_\{t,R\_\{t\}^\{\\mathrm\{stg\}\}\},\\bar\{W\}\_\{t,R^\{c\}\}<\\infty\. Boundary cases where one of these quantities is zero are handled by the same identity in the extended\-real sense, withσ​\(\+∞\)=1\\sigma\(\+\\infty\)=1andσ​\(−∞\)=0\\sigma\(\-\\infty\)=0\.

#### C\.3\.2Action\-Space Filtering Identity

###### Lemma 7\(Action\-space filtering identity\)\.

Under the conditions of the previous subsection, define the base recoverability log\-odds

ωt0​\(Kt−1\):=log⁡Q0​\(ℛtstg​\(Kt−1\)∣Kt−1,gt\)Q0​\(ℛtstg​\(Kt−1\)c∣Kt−1,gt\)\\omega\_\{t\}^\{0\}\(K\_\{t\-1\}\):=\\log\\frac\{Q\_\{0\}\(\\mathcal\{R\}\_\{t\}^\{\\mathrm\{stg\}\}\(K\_\{t\-1\}\)\\mid K\_\{t\-1\},g\_\{t\}\)\}\{Q\_\{0\}\(\\mathcal\{R\}\_\{t\}^\{\\mathrm\{stg\}\}\(K\_\{t\-1\}\)^\{c\}\\mid K\_\{t\-1\},g\_\{t\}\)\}and the guidance retention gap

Γt,λt​\(Kt−1\):=log⁡W¯t,Rtstg​\(Kt−1\)−log⁡W¯t,Rc​\(Kt−1\)\.\\Gamma\_\{t,\\lambda\_\{t\}\}\(K\_\{t\-1\}\):=\\log\\bar\{W\}\_\{t,R\_\{t\}^\{\\mathrm\{stg\}\}\}\(K\_\{t\-1\}\)\-\\log\\bar\{W\}\_\{t,R^\{c\}\}\(K\_\{t\-1\}\)\.Then

Qλt​\(ℛtstg​\(Kt−1\)∣Kt−1,gt\)=σ​\(ωt0​\(Kt−1\)\+Γt,λt​\(Kt−1\)\),Q\_\{\\lambda\_\{t\}\}\(\\mathcal\{R\}\_\{t\}^\{\\mathrm\{stg\}\}\(K\_\{t\-1\}\)\\mid K\_\{t\-1\},g\_\{t\}\)=\\sigma\\bigl\(\\omega\_\{t\}^\{0\}\(K\_\{t\-1\}\)\+\\Gamma\_\{t,\\lambda\_\{t\}\}\(K\_\{t\-1\}\)\\bigr\),whereσ​\(u\)=1/\(1\+e−u\)\\sigma\(u\)=1/\(1\+e^\{\-u\}\)\.

###### Proof\.

Suppress the dependence on\(Kt−1,gt\)\(K\_\{t\-1\},g\_\{t\}\)and writeℛt=ℛtstg​\(Kt−1\)\\mathcal\{R\}\_\{t\}=\\mathcal\{R\}\_\{t\}^\{\\mathrm\{stg\}\}\(K\_\{t\-1\}\)\. By definition of the filtered distribution,

Qλt​\(ℛt\)Qλt​\(ℛtc\)=Q0​\(ℛt\)​𝔼Q0\(⋅∣ℛt\)​\[Wt,λt\]Q0​\(ℛtc\)​𝔼Q0\(⋅∣ℛtc\)​\[Wt,λt\],\\frac\{Q\_\{\\lambda\_\{t\}\}\(\\mathcal\{R\}\_\{t\}\)\}\{Q\_\{\\lambda\_\{t\}\}\(\\mathcal\{R\}\_\{t\}^\{c\}\)\}=\\frac\{Q\_\{0\}\(\\mathcal\{R\}\_\{t\}\)\\,\\mathbb\{E\}\_\{Q\_\{0\}\(\\cdot\\mid\\mathcal\{R\}\_\{t\}\)\}\[W\_\{t,\\lambda\_\{t\}\}\]\}\{Q\_\{0\}\(\\mathcal\{R\}\_\{t\}^\{c\}\)\\,\\mathbb\{E\}\_\{Q\_\{0\}\(\\cdot\\mid\\mathcal\{R\}\_\{t\}^\{c\}\)\}\[W\_\{t,\\lambda\_\{t\}\}\]\},since the two normalizers cancel\. Taking logarithms gives

log⁡Qλt​\(ℛt\)Qλt​\(ℛtc\)=ωt0\+Γt,λt​\(Kt−1\),\\log\\frac\{Q\_\{\\lambda\_\{t\}\}\(\\mathcal\{R\}\_\{t\}\)\}\{Q\_\{\\lambda\_\{t\}\}\(\\mathcal\{R\}\_\{t\}^\{c\}\)\}=\\omega\_\{t\}^\{0\}\+\\Gamma\_\{t,\\lambda\_\{t\}\}\(K\_\{t\-1\}\),and converting log\-odds to probability viaσ\\sigmayields the result\. ∎

#### C\.3\.3Guidance–Action\-Space Alignment

###### Theorem 8\(Guidance–action\-space alignment\)\.

Under the conditions of Lemma[7](https://arxiv.org/html/2605.21516#Thmtheorem7), guidance improves stage recoverability relative to the unguided base distribution at the same prefix if and only ifΓt,λt​\(Kt−1\)\>0\\Gamma\_\{t,\\lambda\_\{t\}\}\(K\_\{t\-1\}\)\>0, and harms it if and only ifΓt,λt​\(Kt−1\)<0\\Gamma\_\{t,\\lambda\_\{t\}\}\(K\_\{t\-1\}\)<0\. More quantitatively, ifΓt,λt​\(Kt−1\)≥γ\\Gamma\_\{t,\\lambda\_\{t\}\}\(K\_\{t\-1\}\)\\geq\\gammafor someγ≥0\\gamma\\geq 0, then

Qλt​\(ℛtstg​\(Kt−1\)∣Kt−1,gt\)≥σ​\(ωt0​\(Kt−1\)\+γ\),Q\_\{\\lambda\_\{t\}\}\(\\mathcal\{R\}\_\{t\}^\{\\mathrm\{stg\}\}\(K\_\{t\-1\}\)\\mid K\_\{t\-1\},g\_\{t\}\)\\geq\\sigma\\bigl\(\\omega\_\{t\}^\{0\}\(K\_\{t\-1\}\)\+\\gamma\\bigr\),and ifGt,λt​\(Kt−1\)≤−γG\_\{t,\\lambda\_\{t\}\}\(K\_\{t\-1\}\)\\leq\-\\gamma, then

Qλt​\(ℛtstg​\(Kt−1\)∣Kt−1,gt\)≤σ​\(ωt0​\(Kt−1\)−γ\)\.Q\_\{\\lambda\_\{t\}\}\(\\mathcal\{R\}\_\{t\}^\{\\mathrm\{stg\}\}\(K\_\{t\-1\}\)\\mid K\_\{t\-1\},g\_\{t\}\)\\leq\\sigma\\bigl\(\\omega\_\{t\}^\{0\}\(K\_\{t\-1\}\)\-\\gamma\\bigr\)\.

###### Proof\.

By Lemma[7](https://arxiv.org/html/2605.21516#Thmtheorem7),Qλt​\(ℛt\)=σ​\(ωt0\+Γt,λt​\(Kt−1\)\)Q\_\{\\lambda\_\{t\}\}\(\\mathcal\{R\}\_\{t\}\)=\\sigma\(\\omega\_\{t\}^\{0\}\+\\Gamma\_\{t,\\lambda\_\{t\}\}\(K\_\{t\-1\}\)\), while the unguided probability at the same prefix isσ​\(ωt0\)\\sigma\(\\omega\_\{t\}^\{0\}\)\. Both qualitative claims and quantitative bounds follow from the strict monotonicity ofσ\\sigma\. ∎

#### C\.3\.4Hallucination as a Special Case

The action\-space filtering view also covers pseudo\-compliant hallucination in content\-producing tasks, where a stage trajectoryτt\\tau\_\{t\}may correspond to a generated answer, reasoning trace, tool\-use sequence, or explanation\. Letℳt​\(Kt−1\)⊆ℛt​\(Kt−1\)c\\mathcal\{M\}\_\{t\}\(K\_\{t\-1\}\)\\subseteq\\mathcal\{R\}\_\{t\}\(K\_\{t\-1\}\)^\{c\}denote a hallucination basin: trajectories that appear compliant with the imposed guidance but introduce unsupported content and make the resulting prefix non\-recoverable\. If the guidance filter assigns high retention weight to this basin, it preserves trajectories that look locally compliant yet lie outside the recoverable region—precisely a negative retention gap,Γt,λt​\(Kt−1\)<0\\Gamma\_\{t,\\lambda\_\{t\}\}\(K\_\{t\-1\}\)<0\. Pseudo\-compliant hallucination is therefore one manifestation of the general failure mode characterized by Lemma[7](https://arxiv.org/html/2605.21516#Thmtheorem7)and Theorem[8](https://arxiv.org/html/2605.21516#Thmtheorem8)\.

### C\.4Proofs and Details for Partial Harnessing

This appendix gives the formal statement and proof for the partial\-harnessing principle of Section[4\.3](https://arxiv.org/html/2605.21516#S4.SS3)\. Without additional structure, the marginal effect of adding one scaffolded stage,F​\(m\+1\)−F​\(m\)F\(m\+1\)\-F\(m\), mixes the new scaffold cost, the residual tail\-risk reduction, and any shift in earlier prefix distributions induced by changing the harness; the analysis isolates the first two by working on a homogeneous progress slice\. Appendix[C\.4\.1](https://arxiv.org/html/2605.21516#A3.SS4.SSS1)introduces the slice and its factorization assumption, and proves a lemma that decomposesF​\(m\)F\(m\)into a per\-stage scaffold cost plus a residual tail cost, from which the marginal identityF​\(m\+1\)−F​\(m\)=cs−Δ​\(m;M\)F\(m\+1\)\-F\(m\)=c\_\{s\}\-\\Delta\(m;M\)follows\. Building on this lemma, Appendix[C\.4\.2](https://arxiv.org/html/2605.21516#A3.SS4.SSS2)adds discrete convexity of the tail to prove Theorem[3](https://arxiv.org/html/2605.21516#Thmtheorem3), characterizing the unimodal success curve and the smallest scaffold count that maximizes reliability or attains a targetα\\alpha\. Appendix[C\.4\.3](https://arxiv.org/html/2605.21516#A3.SS4.SSS3)then closes the loop with the preceding two principles, expressing the scalarcsc\_\{s\}in terms of the granularity penalty and the guidance retention gap, and Appendix[C\.4\.4](https://arxiv.org/html/2605.21516#A3.SS4.SSS4)delineates the regimes in which the slice assumptions break and the marginal rule no longer applies\.

#### C\.4\.1Homogeneous Progress Slice

Fix a taskxxwith total latent progress demandLxL\_\{x\}and scaffold step sizes\>0s\>0, and letN:=⌊Lx/s⌋N:=\\lfloor L\_\{x\}/s\\rfloor,𝒥:=\{0,1,…,N\}\\mathcal\{J\}:=\\\{0,1,\\ldots,N\\\}\. For eachm∈𝒥m\\in\\mathcal\{J\}, the partial harnesshmh\_\{m\}executesmmscaffolded stages of lengthssand hands the residual lengthLx−m​sL\_\{x\}\-msto the autonomous continuation policy\. LetAt\(m\)A\_\{t\}^\{\(m\)\}be the event that thett\-th scaffolded stage underhmh\_\{m\}preserves recoverability, and letAtail\(m\)A\_\{\\mathrm\{tail\}\}^\{\(m\)\}be the event that the autonomous tail succeeds\.

###### Assumption 10\(Positive homogeneous slice factorization\)\.

There exist scalarsqscaf​\(s;M\)∈\(0,1\]q\_\{\\mathrm\{scaf\}\}\(s;M\)\\in\(0,1\]andqtail​\(Lx−m​s;M\)∈\(0,1\]q\_\{\\mathrm\{tail\}\}\(L\_\{x\}\-ms;M\)\\in\(0,1\], withqtail​\(0;M\)=1q\_\{\\mathrm\{tail\}\}\(0;M\)=1, such that for everym∈𝒥m\\in\\mathcal\{J\}:\(i\) goal consistency:Succx​\(τh\)≡⋂t=1mAt\(m\)∩Atail\(m\)\\mathrm\{Succ\}\_\{x\}\(\\tau\_\{h\}\)\\equiv\\bigcap\_\{t=1\}^\{m\}A\_\{t\}^\{\(m\)\}\\cap A\_\{\\mathrm\{tail\}\}^\{\(m\)\};\(ii\) homogeneous scaffold reliability:Prhm⁡\(At\(m\)∣A<t\(m\),x\)=qscaf​\(s;M\)\\Pr\_\{h\_\{m\}\}\(A\_\{t\}^\{\(m\)\}\\mid A\_\{<t\}^\{\(m\)\},x\)=q\_\{\\mathrm\{scaf\}\}\(s;M\)fort=1,…,mt=1,\\ldots,m;\(iii\) residual\-tail reduction:Prhm⁡\(Atail\(m\)∣A1\(m\),…,Am\(m\),x\)=qtail​\(Lx−m​s;M\)\\Pr\_\{h\_\{m\}\}\(A\_\{\\mathrm\{tail\}\}^\{\(m\)\}\\mid A\_\{1\}^\{\(m\)\},\\ldots,A\_\{m\}^\{\(m\)\},x\)=q\_\{\\mathrm\{tail\}\}\(L\_\{x\}\-ms;M\)\.

The assumption compresses prefix\-level behavior into two scalar quantities: scaffolded stages have the same effective success probability along the slice, and the tail depends only on residual length, not on the realized prefix\. Define

cs:=−log⁡qscaf​\(s;M\),κtail​\(d;M\):=−log⁡qtail​\(d;M\),c\_\{s\}:=\-\\log q\_\{\\mathrm\{scaf\}\}\(s;M\),\\qquad\\kappa\_\{\\mathrm\{tail\}\}\(d;M\):=\-\\log q\_\{\\mathrm\{tail\}\}\(d;M\),both finite on the slice grid withκtail​\(0;M\)=0\\kappa\_\{\\mathrm\{tail\}\}\(0;M\)=0\.

###### Lemma 11\(Slice factorization and marginal identity\)\.

Under Assumption[10](https://arxiv.org/html/2605.21516#Thmtheorem10), for everym∈𝒥m\\in\\mathcal\{J\},

F​\(m\):=−log⁡Prhm⁡\(Succx​\(τh\)∣x\)=m​cs\+κtail​\(Lx−m​s;M\),F\(m\):=\-\\log\\Pr\_\{h\_\{m\}\}\(\\mathrm\{Succ\}\_\{x\}\(\\tau\_\{h\}\)\\mid x\)=mc\_\{s\}\+\\kappa\_\{\\mathrm\{tail\}\}\(L\_\{x\}\-ms;M\),and form,m\+1∈𝒥m,m\+1\\in\\mathcal\{J\},

F​\(m\+1\)−F​\(m\)=cs−Δ​\(m;M\),Δ​\(m;M\):=κtail​\(Lx−m​s;M\)−κtail​\(Lx−\(m\+1\)​s;M\)\.F\(m\+1\)\-F\(m\)=c\_\{s\}\-\\Delta\(m;M\),\\qquad\\Delta\(m;M\):=\\kappa\_\{\\mathrm\{tail\}\}\(L\_\{x\}\-ms;M\)\-\\kappa\_\{\\mathrm\{tail\}\}\(L\_\{x\}\-\(m\+1\)s;M\)\.Adding the\(m\+1\)\(m\+1\)\-st scaffolded stage strictly improves reliability if and only ifΔ​\(m;M\)\>cs\\Delta\(m;M\)\>c\_\{s\}\.

###### Proof\.

By \(i\),Prhm⁡\(𝖲𝗎𝖼𝖼∣x\)=∏t=1mPrhm⁡\(At\(m\)∣A<t\(m\),x\)⋅Prhm⁡\(Atail\(m\)∣A1\(m\),…,Am\(m\),x\)\\Pr\_\{h\_\{m\}\}\(\\mathsf\{Succ\}\\mid x\)=\\prod\_\{t=1\}^\{m\}\\Pr\_\{h\_\{m\}\}\(A\_\{t\}^\{\(m\)\}\\mid A\_\{<t\}^\{\(m\)\},x\)\\cdot\\Pr\_\{h\_\{m\}\}\(A\_\{\\mathrm\{tail\}\}^\{\(m\)\}\\mid A\_\{1\}^\{\(m\)\},\\ldots,A\_\{m\}^\{\(m\)\},x\)via chain rule\. Substituting \(ii\) and \(iii\) givesPrhm⁡\(𝖲𝗎𝖼𝖼∣x\)=qscaf​\(s;M\)m​qtail​\(Lx−m​s;M\)\\Pr\_\{h\_\{m\}\}\(\\mathsf\{Succ\}\\mid x\)=q\_\{\\mathrm\{scaf\}\}\(s;M\)^\{m\}\\,q\_\{\\mathrm\{tail\}\}\(L\_\{x\}\-ms;M\), and taking negative logarithms givesF​\(m\)=m​cs\+κtail​\(Lx−m​s;M\)F\(m\)=mc\_\{s\}\+\\kappa\_\{\\mathrm\{tail\}\}\(L\_\{x\}\-ms;M\)\. The marginal identity follows by direct subtraction, and theΔ​\(m;M\)\>cs\\Delta\(m;M\)\>c\_\{s\}characterization fromF​\(m\+1\)<F​\(m\)⇔cs<Δ​\(m;M\)F\(m\+1\)<F\(m\)\\iff c\_\{s\}<\\Delta\(m;M\)\. ∎

#### C\.4\.2Proof of Theorem[3](https://arxiv.org/html/2605.21516#Thmtheorem3)

###### Assumption 12\(Convex tail risk on the slice\)\.

On the grid\{Lx−m​s:m∈𝒥​J\}\\\{L\_\{x\}\-ms:m\\in\\mathcal\{J\}\{J\}\\\}, the functiond↦κtail​\(d;M\)d\\mapsto\\kappa\_\{\\mathrm\{tail\}\}\(d;M\)is finite, nondecreasing, and convex, withκtail​\(0;M\)=0\\kappa\_\{\\mathrm\{tail\}\}\(0;M\)=0\.

###### Proof\.

By Lemma[11](https://arxiv.org/html/2605.21516#Thmtheorem11),F​\(m\+1\)−F​\(m\)=cs−Δ​\(m;M\)F\(m\+1\)\-F\(m\)=c\_\{s\}\-\\Delta\(m;M\)\.

##### Discrete convexity\.

Letdm:=Lx−m​sd\_\{m\}:=L\_\{x\}\-ms\. By Assumption[12](https://arxiv.org/html/2605.21516#Thmtheorem12),d↦κtail​\(d;M\)d\\mapsto\\kappa\_\{\\mathrm\{tail\}\}\(d;M\)is convex on the slice grid, so the discrete incrementd↦κtail​\(d;M\)−κtail​\(d−s;M\)d\\mapsto\\kappa\_\{\\mathrm\{tail\}\}\(d;M\)\-\\kappa\_\{\\mathrm\{tail\}\}\(d\-s;M\)is nondecreasing indd\. Sincedmd\_\{m\}decreases asmmincreases,Δ​\(m;M\)=κtail​\(dm;M\)−κtail​\(dm−s;M\)\\Delta\(m;M\)=\\kappa\_\{\\mathrm\{tail\}\}\(d\_\{m\};M\)\-\\kappa\_\{\\mathrm\{tail\}\}\(d\_\{m\}\-s;M\)is nonincreasing inmm, hencecs−Δ​\(m;M\)c\_\{s\}\-\\Delta\(m;M\)is nondecreasing\. The forward difference ofFFis nondecreasing, soFFis discrete\-convex; equivalentlyexp⁡\(−F​\(m\)\)\\exp\(\-F\(m\)\)is log\-concave and unimodal on𝒥\\mathcal\{J\}\.

##### Smallest reliability maximizer\.

For a discrete\-convex sequence, the minimizer set is an interval whose smallest element is the first index at which the forward difference becomes nonnegative\. By Lemma[11](https://arxiv.org/html/2605.21516#Thmtheorem11),F​\(m\+1\)−F​\(m\)≥0⇔Δ​\(m;M\)≤csF\(m\+1\)\-F\(m\)\\geq 0\\iff\\Delta\(m;M\)\\leq c\_\{s\}, giving

mpeak=min⁡\{m∈𝒥:m\+1∈𝒥,Δ​\(m;M\)≤cs\},m\_\{\\mathrm\{peak\}\}=\\min\\\{m\\in\\mathcal\{J\}:m\+1\\in\\mathcal\{J\},\\ \\Delta\(m;M\)\\leq c\_\{s\}\\\},withmpeak=max⁡𝒥m\_\{\\mathrm\{peak\}\}=\\max\\mathcal\{J\}if the set is empty\.

##### Minimum scaffold for target reliability\.

Forα∈\(0,1\)\\alpha\\in\(0,1\),Pm​\(x\)≥α⇔F​\(m\)≤−log⁡αP\_\{m\}\(x\)\\geq\\alpha\\iff F\(m\)\\leq\-\\log\\alpha\. The feasible set𝒥α=\{m∈𝒥:F​\(m\)≤−log⁡α\}\\mathcal\{J\}\_\{\\alpha\}=\\\{m\\in\\mathcal\{J\}:F\(m\)\\leq\-\\log\\alpha\\\}is nonempty iff the target is achievable, in which casemα=min⁡𝒥αm\_\{\\alpha\}=\\min\\mathcal\{J\}\_\{\\alpha\}\. ∎

#### C\.4\.3Connection to Local Alignment Results

The slice scalarcs=−log⁡Pr⁡\(As\)c\_\{s\}=\-\\log\\Pr\(A\_\{s\}\), withAsA\_\{s\}the event that one scaffolded stage of lengthsspreserves recoverability, exposes how the local alignment results of Sections[4\.1](https://arxiv.org/html/2605.21516#S4.SS1)and[4\.2](https://arxiv.org/html/2605.21516#S4.SS2)drive the global trade\-off\. The granularity bound certifiesPr⁡\(As\)≤exp⁡\(−γs\)\\Pr\(A\_\{s\}\)\\leq\\exp\(\-\\gamma\_\{s\}\)when the scaffolded sub\-goal lies outside the agent’s reachable scales, hencecs≥γsc\_\{s\}\\geq\\gamma\_\{s\}\. The guidance identity givesPr⁡\(As∣Kt−1\)=σ​\(ωt0​\(Kt−1\)\+Gt​\(Kt−1\)\)\\Pr\(A\_\{s\}\\mid K\_\{t\-1\}\)=\\sigma\(\\omega^\{0\}\_\{t\}\(K\_\{t\-1\}\)\+G\_\{t\}\(K\_\{t\-1\}\)\), so a negative retention gap raisescsc\_\{s\}while a positive gap lowers it\. The slice criterionΔ​\(m;M\)\>cs\\Delta\(m;M\)\>c\_\{s\}therefore summarizes the two local effects as a single global stopping condition: alignment makes additional scaffolded stages cheaper, misalignment makes them more expensive\.

#### C\.4\.4When the Slice Assumptions Fail

The slice rule rests on two restrictions and can fail when they do\.*Non\-nested or behavior\-shifting workflows:*ifhm\+1h\_\{m\+1\}is not a nested truncation ofhmh\_\{m\}, or if added later instructions alter the agent’s earlier behavior \(through prompt length, attention, or reinterpretation of the goal\), then the firstmmstages underhmh\_\{m\}andhm\+1h\_\{m\+1\}have different distributions, violating the scalar cost modelF​\(m\)=m​cs\+κtailF\(m\)=mc\_\{s\}\+\\kappa\_\{\\mathrm\{tail\}\}\.*Prefix\-dependent tail reliability:*if tail success depends on the realized prefix rather than only on residual length,F​\(m\+1\)−F​\(m\)F\(m\+1\)\-F\(m\)contains additional terms beyondcsc\_\{s\}andΔ​\(m;M\)\\Delta\(m;M\), and the simple ruleΔ​\(m;M\)\>cs\\Delta\(m;M\)\>c\_\{s\}no longer applies\.*Non\-convex tail risk:*reliability cliffs, irreversible overshoot, discrete action scales, or prefix\-dependent repair can break the convexity ofκtail\\kappa\_\{\\mathrm\{tail\}\};Δ​\(m;M\)\\Delta\(m;M\)may then fail to be monotone inmmandPm​\(x\)P\_\{m\}\(x\)may be multi\-modal\. In any of these regimes, the safe procedure is to estimate or compareF​\(m\)F\(m\)directly across candidate scaffold counts\.

## Appendix DPrompt Details

We list the two harness templates used in our experiments below\. They share a common skeleton but isolate different variables: the first holdsguidancefixed and variesworkflowstep depth \(granularity experiments\), while the second holdsworkflowfixed and varies theguidancestring \(guidance experiments\)\. Placeholders in curly braces are filled in per task at evaluation time\.

Harness Template on Granularity ExperimentsTask:\{task\}Guidance:•Follow the provided workflow\.•Inspect the evaluator and target files before choosing an implementation strategy\.•Reproduce the failure or run a minimal check before editing\.•Make the smallest local change that passes the focused evaluation without touching unrelated code\.Workflow:\{workflow\}

Harness Template on Guidance ExperimentsTask:Analyze the provided Plotly chart structure and produce a concise, evidence\-grounded conclusion\.Guidance:\{guidance\}Workflow:1\.Identify the chart type and the available variables\.2\.State what can and cannot be inferred from the provided information\.3\.Give a concise conclusion without inventing unsupported distributional details\.

Similar Articles

Stop Comparing LLM Agents Without Disclosing the Harness

arXiv cs.AI

This position paper argues that in long-horizon LLM agent tasks, the execution harness often determines performance more than the model itself, and current benchmarks misattribute harness-level gains to model improvements. It proposes a harness-aware evaluation framework with disclosure standards and variance decomposition protocols.

Auditing Agent Harness Safety

arXiv cs.CL

This paper proposes HarnessAudit, a framework for auditing LLM agent execution trajectories beyond final outputs, focusing on boundary compliance, execution fidelity, and system stability. It introduces HarnessAudit-Bench with 210 tasks across eight domains and evaluates ten harness configurations, finding that task completion misaligns with safe execution and violations accumulate with trajectory length.

Harness design for long-running application development

Anthropic Engineering

Anthropic engineers detail a multi-agent harness design using generator and evaluator agents to improve Claude's ability to build complete, high-quality frontend applications autonomously over long durations.