@omarsar0: // Self-Harness: Harnesses That Improve Themselves // (bookmark this one) Most of the agent scaffolds we rely on today …

X AI KOLs Following 06/09/26, 07:30 PM Papers

self-harness llm-agents agent-scaffolding self-improving research-paper ai-agents

Summary

This paper introduces Self-Harness, a new paradigm where LLM-based agents iteratively improve their own operating harness—prompts, tools, and control flow—without human engineers or stronger external agents, achieving significant performance gains across multiple models.

// Self-Harness: Harnesses That Improve Themselves // (bookmark this one) Most of the agent scaffolds we rely on today are built once and remain frozen or mostly unchanged. The harness, like the skills, needs to evolve with new models. What if the scaffold rewrites itself? This new work treats the harness, the prompts, tools, and control flow around the model as a learnable artifact that improves from its own runs rather than staying a fixed wrapper you hand-maintain. The scaffolding becomes the part that compounds, run after run. If you run long-horizon agents, a self-modifying harness turns scaffold upkeep from manual work into something the system earns on its own. Paper: https://arxiv.org/abs/2606.09498 Learn to build effective AI agents in our academy: https://academy.dair.ai

Original Article

View Cached Full Text

Cached at: 06/10/26, 12:17 AM

// Self-Harness: Harnesses That Improve Themselves //

(bookmark this one)

Most of the agent scaffolds we rely on today are built once and remain frozen or mostly unchanged.

The harness, like the skills, needs to evolve with new models.

What if the scaffold rewrites itself?

This new work treats the harness, the prompts, tools, and control flow around the model as a learnable artifact that improves from its own runs rather than staying a fixed wrapper you hand-maintain.

The scaffolding becomes the part that compounds, run after run. If you run long-horizon agents, a self-modifying harness turns scaffold upkeep from manual work into something the system earns on its own.

Paper: https://arxiv.org/abs/2606.09498

Learn to build effective AI agents in our academy: https://academy.dair.ai

Self-Harness: Harnesses That Improve Themselves

Source: https://arxiv.org/html/2606.09498 Hangfan Zhang, Shao Zhang, Kangcong Li, Chen Zhang, Yang Chen,Yiqun Zhang,Lei Bai,Shuyue Hu11footnotemark:1 Shanghai Artificial Intelligence Laboratory {zhanghangfan,zhangshao,hushuyue}@pjlab.org.cn

Abstract

The performance of LLM-based agents is jointly shaped by their base models and the harnesses that mediate their interaction with the environment. Because different models exhibit distinct behaviors, effective harness design is inherently model-specific. Yet agent harnesses are still largely engineered by human experts, a paradigm that scales poorly as modern LLMs become increasingly diverse and rapidly evolving. In this paper, we introduceSelf-Harness, a new paradigm in which an LLM-based agent improves its own operating harness, without relying on human engineers or stronger external agents. We operationalize Self-Harness as an iterative loop with three stages:Weakness Mining, which identifies model-specific failure patterns from execution traces;Harness Proposal, which generates diverse yet minimal harness modifications tied to these failures; andProposal Validation, which accepts candidate edits only after regression testing. We instantiate Self-Harness on Terminal-Bench-2.0 using a minimal initial harness and three base models from diverse families: MiniMax M2.5, Qwen3.5-35B-A3B, and GLM-5. Across all three models, Self-Harness consistently improves performance, with held-out pass rates increasing from 40.5% to 61.9%, 23.8% to 38.1%, and 42.9% to 57.1%, respectively. Qualitative analyses further show that Self-Harness does not simply add generic instructions, but effectively turns model-specific weaknesses into concrete, executable harness changes. These results suggest a path toward LLM-based agents that are not merely shaped by their harnesses, but can also participate in reshaping them.

For a conscious being, to exist is to change, to change is to mature, to mature is to go on creating oneself endlessly. —Henri Bergson, Creative Evolution

1Introduction

To date, LLM-based agents are not shaped by their base model alone, but also by theirharness: the surrounding system that situates the model and mediates its interaction with the environment. Although there is no universally accepted definition, a harness may include system prompts, tools, runtime mechanisms, verification rules, orchestration logic, and failure-recovery procedures. The same base model can thus exhibit substantially different performance under different harnesses[28,5,8].

From early frameworks such as ReAct[29]to product- and platform-level systems such as Claude Code, Codex, and OpenHands, harnesses have largely been engineered by human experts[9,16,24,36,35]. While effective, this human-centered paradigm does not scale well with the diversity and rapid evolution of modern LLMs. Different models can exhibit distinct behavioral patterns, tool-use habits, error modes, and sensitivities to prompting[22,21,18]; consequently, a harness that works well for one model may be suboptimal for another[22,5,8]. As new models continue to be released at a rapid pace, manually redesigning and tuning a model-specific harness for each model becomes increasingly costly and untenable.

Refer to caption Figure 1:Three paradigms of harness improvement. In human harness engineering, human engineers manually revise the agent harness. In Meta-Harness, a stronger external agent guides the improvement of a weaker target agent. In Self-Harness, the agent improves its own operating harness.In this paper, we explore a novel paradigm,Self-Harness: enabling an LLM-based agent to improve the very harness through which it operates (Figure1). Unlike recent approaches that use stronger external agents to improve the harnesses of weaker ones[5,8], Self-Harness seeks to internalize this improvement loop within the target agent itself. This paradigm reduces dependence on external guidance that may be costly, unavailable for frontier models, or mismatched to the target model’s failure modes. More broadly, in Bergson’s terms, this points toward a technical analogue of self-creation: a system not merely changed from without, but continually “going on creating itself.”

We operationalize Self-Harness as an improvement loop that repeatedly turns behavioral evidence into harness updates (Figure2). The loop consists of three stages.Weakness Mining:Starting from an initial harness, the agent with a fixed model is run on a set of tasks, producing execution traces with verifiable outcomes. The agent then clusters failed traces, allowing it to reason about model-specific failure patterns rather than isolated mistakes.Harness Proposal:Based on these failure patterns, the agent generates a small set of diverse yet minimal harness modifications, each tied to a specific failure mechanism. This constraint ensures that proposed edits remain targeted rather than overly general.Proposal Validation:Candidate modifications are evaluated through regression tests, and an edit is promoted only if it improves performance without causing measurable degradation on held-out tasks. If multiple candidate modifications pass the regression tests, they are merged into the next version of the harness, which then serves as the starting point for the next iteration.

In our experiments, we instantiate Self-Harness with a minimal initial harness (Figure3) and three base models from diverse families: MiniMax M2.5, Qwen3.5-35B-A3B, and GLM-5[2,20,14]. On Terminal-Bench-2.0, Self-Harness consistently improves performance across all three models (Figure4). For held-in tasks, which provide execution traces to the evaluation system, the pass rate is increased from 43.0% to 50.0% for MiniMax M2.5, from 15.1% to 36.0% for Qwen3.5-35B-A3B, and from 47.7% to 57.0% for GLM-5. For held-out tasks, whose execution traces are never used as inputs to the evaluation system, the improvements remain substantial. The pass rate is increased from 40.5% to 61.9% for MiniMax M2.5, from 23.8% to 38.1% for Qwen3.5-35B-A3B, and from 42.9% to 57.1% for GLM-5. These results indicate that Self-Harness can evolve an initial harness into model-specific ones better suited to different base models. Moreover, it can discover broadly useful harness modifications that generalize to unseen tasks rather than merely overfitting to observed evaluation failures.

Qualitative analyses further show that Self-Harness does more than simply make the prompt longer or add generic instructions. Instead, it introduces targeted changes that reflect the recurring problems each model encounters during execution, turning model-specific weaknesses into concrete harness-level interventions. For MiniMax M2.5, the changes encourage the agent to create required output files earlier, handle structured tool outputs more carefully, and stop unproductive tool-use loops before they become too long. For Qwen3.5-35B-A3B, the changes focus on checking dependencies in advance, avoiding repeated failed commands, breaking cycles of endless exploration, and reminding the agent to produce the required artifacts after tool errors. For GLM-5, the changes mainly help the agent preserve environment settings across shell commands and move more quickly from exploration to implementation and testing. Notably, Self-Harness can also introduce broader structural mechanisms, such as subagent-based decomposition and middleware creation, that go beyond local failure repair and improve the overall organization of problem solving.

To summarize, our key contributions are as follows:

•We propose Self-Harness, a novel paradigm for harness improvement that enables an LLM-based agent to design and refine the harness through which it operates, tailoring it to its own base model without human engineering effort or guidance from a stronger external agent.
•We operationalize Self-Harness as an iterative loop that turns each model’s behavioral evidence into model-specific harness updates: it evaluates execution traces to identify recurring failure patterns, generates diverse yet minimal candidate edits, and promotes only those that pass regression tests.
•Experiments on Terminal-Bench-2.0 show that Self-Harness improves performance across 3 models from diverse families, with absolute gains of up to 21.4 percentage points and relative improvements of up to 138%; qualitative analyses further confirm that different models benefit from distinct harness changes, suggesting that Self-Harness can turn model-specific weaknesses into concrete harness changes.

2Background and Related Work

From prompts to agent harnesses.

Prompt engineering and context engineering show that fixed models can be steered by instructions, demonstrations, retrieved evidence, memory, tool state, and dynamically constructed inputs[10,25,21,6,17,12,26,7]. Agentic systems extend this control surface from a single input to an execution environment: the model acts, observes consequences, uses tools, receives feedback, and follows runtime policies. ReAct, SWE-agent, Claude Code, and SemaClaw/OpenClaw illustrate how such surrounding mechanisms shape long-horizon agent behavior and software-engineering performance[29,28,9,36].

We useharnessfor this surrounding system layer: prompts, tools, memory, verification rules, permission policies, adapters, and runtime mechanisms that mediate between the model and the environment. Many important agent failures are failures of this layer rather than failures of an isolated model response: an agent may report success without checking an artifact, retry an unproductive action pattern, lose the source of truth in a long context, or lack a recovery action. These behaviors emerge from the interaction between instructions, observations, tools, and runtime control, so improving them requires changing more than prompt text.

Self-improving agents and automated agent design.

A growing line of work studies systems that adapt their inputs, memories, contexts, or workflows over time[23,32,34,31]. Reflexion stores verbal feedback for later attempts[23], agentic context engineering evolves contexts for later model calls[34], and STOP studies recursive self-improvement for code generation[31]. These methods show that fixed models can benefit from accumulated feedback, but the adapted object is usually a response strategy, memory, context, or generated program rather than a declared agent harness state.

A second line optimizes agent designs from outside the evaluated agent. Automated Design of Agentic Systems searches over agent designs, language agents can be represented as optimizable graphs, and Meta-Harness directly optimizes harness code using source code, scores, and traces from prior candidates[3,37,5]. These systems motivate harness-level optimization, but they frame improvement as an external search or optimization process rather than as a bounded edit proposed by the evaluated model under its current harness.

Finally, scientific discovery and self-evolving agent systems such as The AI Scientist, AI Scientist-v2, AlphaEvolve, Alita, Godel Agent, and Darwin Godel Machine automate broader loops of research, algorithm design, or capability expansion[11,27,15,19,30,33,1]. Self-Harness is closest in spirit to this self-improvement literature and to automated harness optimization, but it studies a narrower controlled setting: whether the same fixed model, operating under the current harness, can propose a bounded candidate change to the harness that governs its own future behavior.

3Self-Harness: An Iterative Loop for Model-Specific Harness Improvement

Human harness engineering improves agent harnesses through expert inspection and manual revision, while external optimizer approaches treat harness design choices as a searchable space. Self-Harness studies a middle ground in which a fixed model iteratively improves the harness around itself through an explicit self-improvement loop driven by execution evidence. In each iteration, the evaluation system runs the current harness and mines recurring failure patterns from clustered execution traces to produce structured evidence. Given this evidence, the same model is invoked in a proposer role to generate a set of diverse yet minimal candidate harness modifications, each targeting a specific failure mechanism without replacing the overall control architecture. Candidate edits are then validated through regression testing on held-out tasks, and an explicit acceptance rule promotes only those edits that improve performance without introducing unacceptable regressions.

3.1Preliminary

We useharnessto denote the non-parametric scaffolding that governs how a fixed language model is deployed as an agent. A harness includes the instructions, the available tools, memory and state-management mechanisms, etc. The harness does not modify the model parameters; instead, it specifies the execution protocol through which the model observes a task, takes actions, invokes tools, checks intermediate artifacts, and produces a final answer.

Formally, letMMbe a fixed language model and lethhdenote an agent harness. Given a task instancexx, runningMMunder harnesshhproduces an execution traceτ\tauand an outputyy. The trace records the messages, tool calls, and verifier outcomes. An evaluator then maps the task, trace, and output to a behavioral outcome, such as pass/fail. In this work, the modelMMand evaluatorℰ\mathcal{E}are held fixed, while the harness is treated as the object of improvement. Self-Harness therefore operates over a lineage of harnessesh0,h1,…h_{0},h_{1},\ldots, where each transition corresponds to a bounded edit to the execution protocol rather than an update to the model weights.

Refer to caption Figure 2:Overview of one Self-Harness optimization loop. The current harnesshth_{t}with fixed model is evaluated on tasks to collect execution traces, which are clustered into verifier-grounded failure patterns. The same model is then invoked under the current harness as a proposer, using the mined failure patterns to generate bounded candidate harness edits. Candidate edits are evaluated by regression tests on held-in and held-out splits. Accepted candidates are merged to update the harness toht+1h_{t+1}, while rejected candidates are logged without changing the active harness. Throughout the loop, the model weights and evaluator remain fixed; only the surrounding harness is modified.Algorithm 1Self-Harness1:fixed model

MM, initial harness

h0h_{0}, held-in split

DinD_{\mathrm{in}}, held-out split

DhoD_{\mathrm{ho}}, evaluator

ℰ\mathcal{E}, proposal width

KK, rounds

TT 2:final harness

hTh_{T} 3:for

t=0,1,…,T−1t=0,1,\ldots,T-1do

(Pin(ht),Pho(ht),Rt)←Evaluate(M,ht,Din,Dho,ℰ)(P_{\mathrm{in}}(h_{t}),P_{\mathrm{ho}}(h_{t}),R_{t})\leftarrow\textsc{Evaluate}(M,h_{t},D_{\mathrm{in}},D_{\mathrm{ho}},\mathcal{E}) 5:

Bt←BuildEvidenceBundle(Rt)B_{t}\leftarrow\textsc{BuildEvidenceBundle}(R_{t})⊳\trianglerightfrom held-in verifier-grounded failures

𝒫t←ParallelPropose(M,ht,Bt,K)\mathcal{P}_{t}\leftarrow\textsc{ParallelPropose}(M,h_{t},B_{t},K)⊳\triangleright𝒫t={(Δj,aj)}j=1K\mathcal{P}_{t}=\{(\Delta_{j},a_{j})\}_{j=1}^{K}

𝒜t←∅\mathcal{A}_{t}\leftarrow\varnothing 8:for all

(Δj,aj)∈𝒫t(\Delta_{j},a_{j})\in\mathcal{P}_{t}do

ht(j)←Δj(ht)h_{t}^{(j)}\leftarrow\Delta_{j}(h_{t}) 10:

(Pin(ht(j)),Pho(ht(j)),Rt(j))←Evaluate(M,ht(j),Din,Dho,ℰ)(P_{\mathrm{in}}(h_{t}^{(j)}),P_{\mathrm{ho}}(h_{t}^{(j)}),R_{t}^{(j)})\leftarrow\textsc{Evaluate}(M,h_{t}^{(j)},D_{\mathrm{in}},D_{\mathrm{ho}},\mathcal{E}) 11:

Δin(j)←Pin(ht(j))−Pin(ht)\Delta_{\mathrm{in}}^{(j)}\leftarrow P_{\mathrm{in}}(h_{t}^{(j)})-P_{\mathrm{in}}(h_{t}) 12:

Δho(j)←Pho(ht(j))−Pho(ht)\Delta_{\mathrm{ho}}^{(j)}\leftarrow P_{\mathrm{ho}}(h_{t}^{(j)})-P_{\mathrm{ho}}(h_{t}) 13:if

Δin(j)≥0\Delta_{\mathrm{in}}^{(j)}\geq 0and

Δho(j)≥0\Delta_{\mathrm{ho}}^{(j)}\geq 0and

max⁡(Δin(j),Δho(j))>0\max(\Delta_{\mathrm{in}}^{(j)},\Delta_{\mathrm{ho}}^{(j)})>0then

14:

𝒜t←𝒜t∪{(ht(j),Δj,aj,Δin(j),Δho(j))}\mathcal{A}_{t}\leftarrow\mathcal{A}_{t}\cup\{(h_{t}^{(j)},\Delta_{j},a_{j},\Delta_{\mathrm{in}}^{(j)},\Delta_{\mathrm{ho}}^{(j)})\} 15:

Accept(Δj)\textsc{Accept}(\Delta_{j})⊳\trianglerightpassed acceptance rule

16:else

17:

Reject(Δj)\textsc{Reject}(\Delta_{j}) 18:endif

19:endfor

20:if

𝒜t=∅\mathcal{A}_{t}=\varnothingthen

21:

ht+1←hth_{t+1}\leftarrow h_{t}⊳\trianglerightno accepted candidate

22:else

23:

ht+1←MergeAccepted(ht,𝒜t)h_{t+1}\leftarrow\textsc{MergeAccepted}(h_{t},\mathcal{A}_{t})⊳\trianglerightaccepted edits are merged

24:endif

25:endfor

26:return

hTh_{T}

3.2Weakness Mining: Identifying Failure Patterns from Clustered Execution Traces

The first stage of Self-Harness converts behavioral failures into structured evidence for harness revision. At roundtt, we run the fixed modelMMunder the current harnesshth_{t}on a held-in splitDinD_{\mathrm{in}}. For each task instancexi∈Dinx_{i}\in D_{\mathrm{in}}, the run produces an outputyiy_{i}and an execution traceτi\tau_{i}. The evaluatorℰ\mathcal{E}then assigns an outcomezi=ℰ(xi,τi,yi)z_{i}=\mathcal{E}(x_{i},\tau_{i},y_{i}), such as pass or fail. This yields a trace record

ri=(xi,τi,yi,zi),r_{i}=(x_{i},\tau_{i},y_{i},z_{i}), and a round-level record setRt={ri}i=1|Din|R_{t}=\{r_{i}\}_{i=1}^{|D_{\mathrm{in}}|}. Since bothMMandℰ\mathcal{E}are fixed, changes in these records across rounds can be attributed to changes in the harness.

A central role of the evaluation system is to avoid treating failures as isolated anecdotes. We therefore focus on the subset of failed records

Ft={ri∈Rt∣zi=fail}.F_{t}=\{r_{i}\in R_{t}\mid z_{i}=\mathrm{fail}\}. and cluster them by verifier-grounded failure signatures. For each failed recordrir_{i}, the evaluation system analyzes the trace as evidence for why the evaluator rejected the run. It identifies the terminal failure reason exposed by the verifier, the agent-side behavior connected to that terminal failure, and the causal status of that behavior within the trace. This attribution step prevents the clustering procedure from conflating superficial symptoms with reusable failure mechanisms: two runs may share the same verifier outcome, such as a timeout or missing artifact, while requiring different harness changes because the underlying agent behaviors differ.

We write this attribution as a failure signature

ϕ(ri)=(ci,qi,mi),\phi(r_{i})=(c_{i},q_{i},m_{i}),wherecic_{i}denotes the terminal verifier-level cause,qiq_{i}denotes the causal status of the relevant agent behavior, andmim_{i}denotes the abstract agent mechanism exposed by the trace. Failures are clustered by exact agreement of this signature:

Cϕ={ri∈Ft∣ϕ(ri)=ϕ}C_{\phi}=\{r_{i}\in F_{t}\mid\phi(r_{i})=\phi\}Thus, the clustering is deterministic and evaluator-grounded: two failed cases are grouped together only when they agree on what the verifier ultimately rejected, how the agent behavior contributed to that rejection, and which reusable behavioral mechanism was involved. The goal is not to discover latent semantic similarity among traces, but to aggregate failures that plausibly admit the same harness-level intervention.

For each clusterCϕC_{\phi}, the evaluation system constructs a structured failure pattern containing its cluster size, representative task instances, shared trace symptoms, verifier evidence, and the inferred agent mechanism. Clusters are then ordered by their support and estimated actionability, so that the proposer is exposed first to recurring mechanisms that are more likely to map to a high-value harness modification.

The output of this stage is an evidence bundleBtB_{t}summarizing the dominant failure patterns observed underhth_{t}. Importantly,BtB_{t}does not prescribe a harness edit. It separates verifier-level failure from agent-level mechanism, allowing the proposer to target a specific reusable weakness rather than patching a coarse outcome such as timeout, assertion failure, or missing output. This keeps the evaluator distinct from the optimizer while ensuring that subsequent candidate modifications are grounded in explicit cross-case evidence.

3.3Harness Proposal: Exploring Diverse yet Minimal Candidate Modifications

Given the evidence bundleBtB_{t}, the proposal stage translates recurring failure patterns into candidate harness edits. The proposer is not an external optimizer with unrestricted access to the search space. Instead, we invoke the same fixed modelMMwith current harnesshth_{t}in a proposer role and provide it with a bounded proposal context: the editable surfaces of the current harness, the verifier-grounded failure patterns from the evaluation system, records of passing behaviors that should be preserved, and summaries of previously attempted edits. This context exposes the proposer to structured cross-case evidence rather than raw execution logs, encouraging it to reason about reusable failure mechanisms rather than individual task failures.

Self-Harness uses parallel proposal generation to explore several candidate improvements from the same evidence. The proposer generatesKKmutually distinct proposal bundles,

𝒫t={(Δj,aj)}j=1K,\mathcal{P}_{t}=\{(\Delta_{j},a_{j})\}_{j=1}^{K},where each editΔj\Delta_{j}maps the current harness to a candidate harness

ht(j)=Δj(ht).h_{t}^{(j)}=\Delta_{j}(h_{t}).andaja_{j}is an audit record describing the targeted failure pattern, the edited harness surface, the expected behavioral effect, and the regression risks. Each proposal must be grounded in a primary failure mechanism and mapped to a concrete editable surface. The candidates are required to be materially distinct: they should not merely restate the same cluster, surface, or mechanism with different wording. This parallel proposal step broadens exploration while keeping each candidate branch individually interpretable.

The proposer first selects target failure patterns fromBtB_{t}. A pattern is considered a suitable target only if it is both supported by evidence and plausibly addressable by an editable harness surface. This addressability criterion is important because not every failure cluster implies a useful harness modification: some clusters reflect task-specific difficulty, unstable outcomes, or model capability limits rather than a missing execution rule. When multiple clusters are plausible, the proposer favors mechanisms that are concrete, recurrent, and likely to be mitigated by a narrow change to the execution protocol; weakly supported or non-addressable patterns are excluded rather than forced into a patch.

Diversity is encouraged across proposal branches, while minimality is enforced within each branch. A proposal may target a different failure mechanism, choose a different harness surface, or express a different hypothesis about how to improve execution. However, each individual edit is constrained to modify only the surface needed to address its selected mechanism, preserve unrelated harness behavior, and avoid broad rewrites of the agent control architecture.

3.4Proposal Validation: Ensuring Robust Improvement through Regression Testing

A candidate harness edit is not adopted immediately after it is proposed. Instead, each candidate branch is treated as a new harness variant and evaluated under the same evaluator used to diagnose the current harness. For a proposalΔj\Delta_{j}, letht(j)=Δj(ht)h_{t}^{(j)}=\Delta_{j}(h_{t})denote the resulting candidate harness. We evaluate both the current harnesshth_{t}and the candidate harnessht(j)h_{t}^{(j)}on the held-in splitDinD_{\mathrm{in}}and the held-out splitDhoD_{\mathrm{ho}}. The held-in split measures whether the proposal addresses the evidence that motivated it, while the held-out split serves as a regression test for behaviors that were not visible to the proposer.

LetPin(h)P_{\mathrm{in}}(h)andPho(h)P_{\mathrm{ho}}(h)denote the number of passed tasks for harnesshhonDinD_{\mathrm{in}}andDhoD_{\mathrm{ho}}, respectively. We define the split-wise improvements of candidateht(j)h_{t}^{(j)}over the current harness as

Δin(j)=Pin(ht(j))−Pin(ht),\Delta_{\mathrm{in}}^{(j)}=P_{\mathrm{in}}(h_{t}^{(j)})-P_{\mathrm{in}}(h_{t}),and

Δho(j)=Pho(ht(j))−Pho(ht).\Delta_{\mathrm{ho}}^{(j)}=P_{\mathrm{ho}}(h_{t}^{(j)})-P_{\mathrm{ho}}(h_{t}).A candidate is accepted only if it improves at least one split without degrading the other:

Δin(j)≥0,Δho(j)≥0,max⁡(Δin(j),Δho(j))>0.\Delta_{\mathrm{in}}^{(j)}\geq 0,\quad\Delta_{\mathrm{ho}}^{(j)}\geq 0,\quad\max\left(\Delta_{\mathrm{in}}^{(j)},\Delta_{\mathrm{ho}}^{(j)}\right)>0.This rule implements a conservative promotion criterion. Proposals that only trade off one split against the other are rejected, even if their total pass count increases. When evaluation is stochastic, we repeat candidate evaluation and apply the same rule to aggregate pass counts across repeats. This reduces the chance that a harness edit is promoted due to a single favorable run. If multiple compatible candidates satisfy the rule in the same round, their edits are merged into the next harness; rejected candidates remain logged but do not change the active harness. In addition to the pass-count rule, validation rejects proposals that do not modify any editable surface or fail execution before a valid evaluation result is obtained. For each evaluated candidate, the system records the changed surfaces, split-wise outcomes, evaluation repeats, proposal summary, and accept/reject decision, making each transition in the harness lineage auditable.

4Experiments

We evaluate whether Self-Harness can improve agent performance by modifying only the harness around a fixed language model. Our experiments use Terminal-Bench-2.0, which tests terminal interaction in containerized environments. Across multiple model backends, we start from the same minimal DeepAgent-based harness and let Self-Harness propose, validate, and promote bounded edits using held-in execution evidence and held-out regression gates.

1defbuild_system_prompt()->str:

2return““”

3YouarerunninginsideaTerminalBench2Harbortaskenvironment.

5Usethebuilt-infilesystemandshelltoolstoinspecttheworkspace,make

6concreteedits,andverifyoutcomesagainsttheactualtaskenvironment.

8Donotassumesyntheticdatasets,domain-specifictools,orhiddenfixtures

9unlessyoudiscoverthemintherepoorruntime.

10““”.strip()

13BASELINE_SYSTEM_PROMPT=build_system_prompt()

16defbuild_memory_sources()->list[str]:

17return[“/AGENTS.md”]

20defbuild_subagents()->list[dict[str,Any]]:

21return[]

24defbuild_skills()->list[str]:

25return[]

28defbuild_bootstrap_instruction()->str:

29return“Startbyinspectingtheworkspaceandidentifyingthesmallestrelevanteditsurface.“

32defbuild_execution_instruction()->str:

33return“Preferconcreterepochangesovergenericadvice,andkeepeditstightlyscopedtothetask.“

36defbuild_verification_instruction()->str:

37return“Beforeconcluding,verifytheresultwiththemosttargetedcommand,fileread,ortestyoucanrun.“

40defbuild_failure_recovery_instruction()->str:

41return“Ifatoolcallfails,inspecttheerrorandadapt;donotblindlyretrythesameaction.“

44defbuild_runtime_control_policy()->dict[str,Any]:

45return{

46“enabled“:False,

47“max_recent_tool_errors“:None,

48“max_total_tool_messages“:None,

49“instruction“:None,

50}

53defbuild_fixed_harness_agent(

54model:“BaseChatModel|str”,

55*,

56backend:Any|None=None,

57):

58ifbackendisnotNone:

59returncreate_deep_agent(model=model,system_prompt=BASELINE_SYSTEM_PROMPT,backend=backend)

60returncreate_deep_agent(model=model,system_prompt=BASELINE_SYSTEM_PROMPT)

Figure 3:Initial harness and editable interface used as the starting point for Self-Harness. The harness is intentionally kept minimal, consisting only of the Terminal-Bench-2.0 default system prompt, the default DeepAgent tools (basic file reading, file writing, file editing, and shell execution), and the declared interfaces that Self-Harness is allowed to modify.### 4.1Setup

Benchmarks.

We evaluate Self-Harness on Terminal-Bench-2.0[13], a multi-turn agentic benchmark in which agents interact with realistic execution environments and are judged by deterministic verifiers. Terminal-Bench-2.0 contains 89 containerized terminal tasks that test general tool-based execution, including artifact management, command use, verification behavior, and recovery from execution errors. We evaluate on a fixed 64-case subset, excluding tasks that depend on unstable external web resources or require multimodal inputs. This filtering reduces evaluation noise from factors outside the harness. In particular, multimodal tasks require modality-specific input handling that is not exposed by our minimal initial harness; including them would primarily measure unsupported harness functionality rather than the effect of Self-Harness edits.

Models.

We evaluate Self-Harness with three models: MiniMax M2.5[14], Qwen3.5-35B-A3B[20], and GLM-5[2]. The model is held fixed across all harness variants and is also used in the proposal stage to generate edits from evaluator feedback. All comparisons are therefore within-model comparisons: the decoding configuration, budget, tool set, benchmark environment, and evaluator are kept unchanged while only the harness is allowed to vary. This isolates the effect of Self-Harness from changes in model capability or evaluation protocol.

Harness.

The initial harness builds upon the DeepAgent[4]SDK but is intentionally kept minimal: a short benchmark-facing system prompt, and the default filesystem and shell tools. Self-Harness can only change the harness definition file that configures how DeepAgent is instantiated and controlled to build a new harness candidateht(j)h_{t}^{(j)}. The editable surfaces correspond to declared configuration points in this harness, such as instruction, tools, verification guidance, etc. Figure3shows the initial harness implementation.

Splits and protocol.

We fix the evaluated task set and partition it into a held-in split and a held-out split before running Self-Harness. The held-in split supplies the trajectories, verifier outcomes, and failure evidence exposed to the proposer, while the held-out split is never shown to the proposer and is used only by the automatic promotion gate. A candidate harness is promoted with the acceptance rule defined in Section3.4. Task split assignments are fixed across harness variants, and each task starts from a fresh benchmark environment. These controls ensure that measured improvements come from harness changes.

Metrics.

Our primary metric isPass (%), the percentage of evaluated task attempts that pass the benchmark verifier, computed over two repeated attempts for each harness candidate unless otherwise specified. This measures mean single-attempt task success under a fixed harness configuration and evaluation protocol. For Terminal-Bench-2.0, the pass signal is determined by the task verifier over the final container state.

Refer to caption Figure 4:Pass rates (%) on Terminal-Bench-2.0 across MiniMax M2.5, Qwen3.5-35B-A3B, and GLM-5. For each backend, bars compare the initial harness with the final harness produced by Self-Harness on the held-in split, held-out split, and overall set; annotations above the Self-Harness bars show relative gains over the corresponding initial harness.

4.2Main Results

Figure4reports Terminal-Bench-2.0 performance before and after Self-Harness promotion. Across all three model backends, the promoted harness improves or preserves Pass (%) on both the held-in split and the held-out split. We report both absolute gains in percentage points and relative gains, where the relative gain is computed as(Self-Harness−Initial)/Initial(\mathrm{Self\mbox{-}Harness}-\mathrm{Initial})/\mathrm{Initial}. For MiniMax M2.5, Self-Harness improves held-in Pass from 43.0 to 50.0, a gain of 16% relative improvement, and improves held-out Pass from 40.5 to 61.9, a gain of 53% relative improvement. For Qwen3.5, Self-Harness improves held-in Pass from 15.1 to 36.0, a gain of 138% relative improvement, and held-out Pass from 23.8 to 38.1, a gain of 60% relative improvement. For GLM-5, Self-Harness improves held-in Pass from 47.7 to 57.0, a gain of 20% relative improvement, and held-out Pass from 42.9 to 57.1, a gain of 33% relative improvement.

These results show that harness-level edits can yield measurable improvements while keeping the model backend, tool set, budget, benchmark environment, and evaluator fixed. The gains are not confined to the held-in failures used to construct proposal evidence: all three backends improve on the held-out split, and no promoted harness degrades either split. This supports the central design goal of Self-Harness: proposed edits should target reusable execution mechanisms rather than case-specific failures, and the regression gate should prevent improvements on one split from being promoted at the cost of another.

4.3Experimental Analysis

Refer to caption (a)Self-Harness evolution trajectory. Green markers denote accepted harness candidates and gray crosses denote rejected harness candidates. Step lines connect accepted candidates and keep performance flat across rejected iterations; the two accepted lineages meet at the final success-combined harness. Refer to caption (b)Differences for the three harness modifications accepted by Self-Harness and retained in the final harness. Red rows denote code removed from the initial harness, and green rows denote the updated harness behavior.

Figure 5:MiniMax M2.5 Self-Harness run. Panel (a) summarizes the Self-Harness evolution trajectory, while panel (b) expands the accepted updates from this trajectory into their retained code-level edits in the final harness.#### Harness evolution and retained edits.

Figures5and6summarize both the evolution trajectory and the retained code-level edits for MiniMax M2.5 and Qwen3.5, with the corresponding GLM-5 run shown in Appendix Figure10. In each figure, the evolution plot distinguishes accepted candidates from rejected proposals, while the code diff records the harness interfaces retained in the final promoted variant. Across models, Self-Harness reaches the final harness through a small number of validation-gated edits rather than through a smooth sequence of uniformly successful proposals.

For MiniMax M2.5, the harness improves from 42.2% to 53.9% pass rate. The retained edits address missing required artifacts, schema-invalid tool content, and stalled tool-use loops, yielding a harness that creates required outputs earlier, handles structured tool content more carefully, and redirects execution after prolonged tool interaction.

Refer to caption (a)Self-Harness evolution trajectory. Green markers denote accepted harness candidates and gray crosses denote rejected harness candidates. Step lines connect accepted candidates and keep performance flat across rejected iterations. The subagent and skill branches were discarded due to no further improvement. The remaining four accepted edits are merged to form the final harness. Refer to caption (b)Differences for the four harness modifications accepted by Self-Harness and retained in the final harness. Red rows denote code removed from the initial harness, and green rows denote the updated harness behavior.

Figure 6:Qwen3.5 Self-Harness run. Panel (a) summarizes the Self-Harness evolution trajectory, while panel (b) expands the accepted updates from this trajectory into their retained code-level edits in the final harness.The Qwen3.5 evolution run shown in Figure6starts at 20.3% pass rate and reaches 36.7% after merging edits that emphasize artifact checking, missing-artifact recovery, retry discipline, and tool-error-triggered middleware. These changes mainly improve the agent’s ability to recover from file-editing or tool failures and still leave verifier-required artifacts in place.

For GLM-5, the harness improves from 46.1% to 57.0% through edits targeting late artifacts, external computation, session-scoped tools, and implementation-oriented exploration. These edits make environment changes persist across shell commands and encourage the agent to move from prolonged exploration toward implementation and testing.

In summary, the three runs show both a shared pattern and model-specific adaptation. A common theme is artifact reliability: all three promoted harnesses add mechanisms that improve artifact delivery, including “create output early” for M2.5, “artifact middleware” for Qwen3.5, and “transition from exploration” to implementation for GLM-5. The model-specific differences indicate that the same initial harness exposes different execution pathologies for different models, and that Self-Harness adapts by selecting targeted edits grounded in the failure mechanisms observed for each model. For example, M2.5 emphasizes correct formation of content tags and redirection after long tool calls. Qwen3.5 introduces dependency precheck and mitigation of exact command retries. GLM-5 tries to keep command environment persistent across shell sessions. These differences indicate that Self-Harness successfully captures specific failure modes of different models and produces suitable proposals to improve the model-harness behavior.

Trace-Level Analysis of Accepted Edits.

To better understand how accepted harness edits change agent behavior, we inspect representative before–after traces from Terminal-Bench-2.0 in Figures7and8, with the GLM-5 trace shown in Appendix Figure9. The accepted edits are not a single generic instruction added to all backends. Instead, Self-Harness promotes model-specific changes that target the dominant failure mechanisms observed for each initial harness.

Refer to caption Figure 7:Case study of a MiniMax M2.5 harness edit on the Terminal-Bench-2.0count-dataset-tokenstask. Left: a failed trace under the initial harness, where the agent continues dataset exploration after finding the relevant metadata configuration and times out without creating the required answer artifact. Right: a successful trace under the edited harness, where the agent identifies the metadata-backed science subset, computes the required token total, writes/app/answer.txt, and reads it back for verification.For MiniMax M2.5, the accepted edits emphasize early artifact creation and bounded execution. The bootstrap instruction is changed from merely identifying the smallest relevant edit surface to identifying the required output artifact and creating an initial version as early as possible. The runtime policy is also enabled with a limit on total tool messages, encouraging the agent to redirect rather than continue open-ended tool use. Figure7shows that this changes the agent from prolonged dataset exploration to a concrete workflow: identifying the relevant metadata split, computing the required count, writing the answer file, and reading it back before stopping.

Refer to caption Figure 8:Case study of a Qwen3.5 harness edit on the Terminal-Bench-2.0extract-elftask. Left: a failed trace under the initial harness, where the agent creates the required extractor script but then enters repeated overwrite and edit-file failures; before stopping, it deletes/app/extract.js, causing the verifier to fail because the required artifact is missing. Right: a successful trace under the edited harness, where a tool-error-triggered system prompt redirects the agent to recover the missing artifact, recreate the extractor, validate the generated JSON output, and leave the required file present for the verifier.For Qwen3.5, the promoted harness adds constraints for dependency prechecking, loop breaking, command-retry discipline, and artifact-focused recovery after tool errors. Figure8illustrates how these edits change failure recovery behavior. Under the initial harness, the agent creates the required extractor script, encounters overwrite and edit failures, repeatedly tries to modify the same artifact, and ultimately deletes/app/extract.jsbefore stopping; the verifier therefore fails because the required file is missing. Under the edited harness, a tool-error-triggered system prompt redirects the agent toward the missing artifact: it recreates the extractor, fixes the parsing logic, writes the output file, performs targeted validation of the JSON result, and leaves the required artifact present for the verifier.

For GLM-5, the accepted edits focus on persistent environment changes and the transition from exploration to implementation. The edited harness instructs the agent to make installed tools or path changes persist across shell sessions and to verify tool accessibility after modifying the environment. It also adds a verification-stage constraint: if the agent has been exploring without producing required artifacts, it should transition to implementing and testing a solution. Figure9shows this behavior in a build task. The initial harness spends substantial budget on long external downloads and later rationalizes failed sanity checks, whereas the edited harness switches strategy after timeout evidence, validates alternative sources early, repairs the failing render check, and only then finalizes.

Together, these examples support the interpretation of the quantitative gains in Figure4. The promoted edits change observable execution behavior in ways aligned with the diagnosed failure mechanisms: Qwen3.5 reduces repeated ineffective actions, GLM-5 better preserves environment changes and moves from exploration to implementation, and MiniMax M2.5 creates and verifies required artifacts earlier. This suggests that Self-Harness improves performance by inducing targeted workflow changes rather than by relying on unrelated stochastic variation or a uniformly stronger prompt.

5Conclusion

This paper studied whether a fixed language model can improve the harness that governs its own agent behavior. We introduced Self-Harness, a propose–evaluate–accept framework in which the model is evaluated under the current harness, receives structured evidence from its own execution traces, and proposes bounded edits to declared harness surfaces. Candidate harnesses are then re-evaluated under the same benchmark protocol, and only edits that satisfy a non-regressive acceptance rule are promoted into the harness lineage.

The main lesson is that harness improvement should be treated as an empirical state transition. A useful harness edit must specify the behavior it aims to change, the surface it modifies, the evidence that motivates it, and the evaluation result that justifies promotion. By keeping the model, evaluator, and benchmark protocol fixed, Self-Harness isolates whether improved behavior comes from changes to the harness scaffold.

Our experiments on Terminal-Bench-2.0 instantiate this protocol with a minimal DeepAgent-based baseline harness and three model backends. Self-Harness improves Pass (%) across all tested backends while preserving held-in and held-out performance under the acceptance rule. The retained edits are small, auditable changes to configurable harness surfaces, suggesting that even sparse initial harnesses can support useful self-improvement when proposals are constrained by execution evidence and validated by regression testing.

Self-Harness also has important limits. It studies bounded harness edits under fixed benchmarks, not open-ended self-improvement. Accepted edits may still reflect benchmark-specific failure patterns, and the protocol depends on the quality of verifier outcomes and trace records. Higher-stakes harness changes would require stronger acceptance gates than pass-rate non-regression alone.

More broadly, Self-Harness points toward a style of agent engineering in which harnesses evolve through recorded, testable, and reversible changes. Future work can further explore application of self-harness-style self-improvement in broader environments, but the core requirement remains the same: self-improvement should be grounded in behavioral evidence rather than only in the proposer’s rationale for a plausible edit.

References

[1]L. Fu, X. Ding, L. Pan, Y. Zhu, S. Zhang, L. Qiu, W. Liu, W. Zhang, X. Cao, X. Cai, J. Ding, and Y. Yu(2026)CATArena: evaluation of llm agents through iterative tournament competitions.InInternational Conference on Machine Learning,Cited by:§2.
[2]GLM-5 Team(2026)GLM-5: from vibe coding to agentic engineering.External Links:2602.15763,LinkCited by:§1,§4.1.
[3]S. Hu, C. Lu, and J. Clune(2025)Automated design of agentic systems.External Links:2408.08435,LinkCited by:§2.
[4]LangChain(2026)DeepAgents.Note:Software frameworkExternal Links:LinkCited by:§4.1.
[5]Y. Lee, R. Nair, Q. Zhang, K. Lee, O. Khattab, and C. Finn(2026)Meta-harness: end-to-end optimization of model harnesses.External Links:2603.28052,LinkCited by:§1,§1,§1,§2.
[6]P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Kuttler, M. Lewis, W. Yih, T. Rocktaschel, S. Riedel, and D. Kiela(2020)Retrieval-augmented generation for knowledge-intensive nlp tasks.InAdvances in Neural Information Processing Systems,External Links:2005.11401,LinkCited by:§2.
[7]Y. Liang, S. Zhou, Y. Gu, H. Tan, G. Wu, F. Dernoncourt, J. Kil, R. A. Rossi, and R. Zhang(2026)Anticipatory planning for multimodal ai agents.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp. 5925–5935.Cited by:§2.
[8]J. Lin, S. Liu, C. Pan, L. Lin, S. Dou, Z. Xi, X. Huang, H. Yan, Z. Han, T. Gui, and Y. Jiang(2026)Agentic harness engineering: observability-driven automatic evolution of coding-agent harnesses.External Links:2604.25850,LinkCited by:§1,§1,§1.
[9]J. Liu, X. Zhao, X. Shang, and Z. Shen(2026)Dive into claude code: the design space of today’s and future ai agent systems.External Links:2604.14228,LinkCited by:§1,§2.
[10]P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig(2021)Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing.External Links:2107.13586,LinkCited by:§2.
[11]C. Lu, C. Lu, R. T. Lange, J. Foerster, J. Clune, and D. Ha(2024)The ai scientist: towards fully automated open-ended scientific discovery.External Links:2408.06292,LinkCited by:§2.
[12]L. Mei, J. Yao, Y. Ge, Y. Wang, B. Bi, Y. Cai, J. Liu, M. Li, Z. Li, D. Zhang, C. Zhou, J. Mao, T. Xia, J. Guo, and S. Liu(2025)A survey of context engineering for large language models.External Links:2507.13334,LinkCited by:§2.
[13]M. A. Merrill, A. G. Shaw, N. Carlini, B. Li, H. Raj, I. Bercovich, L. Shi, J. Y. Shin, T. Walshe, E. K. Buchanan, J. Shen, G. Ye, H. Lin, J. Poulos, M. Wang, M. Nezhurina, J. Jitsev, D. Lu, O. M. Mastromichalakis, Z. Xu, Z. Chen, Y. Liu, R. Zhang, L. L. Chen, A. Kashyap, J. Uslu, J. Li, J. Wu, M. Yan, S. Bian, V. Sharma, K. Sun, S. Dillmann, A. Anand, A. Lanpouthakoun, B. Koopah, C. Hu, E. Guha, G. H. S. Dreiman, J. Zhu, K. Krauth, L. Zhong, N. Muennighoff, R. Amanfu, S. Tan, S. Pimpalgaonkar, T. Aggarwal, X. Lin, X. Lan, X. Zhao, Y. Liang, Y. Wang, Z. Wang, C. Zhou, D. Heineman, H. Liu, H. Trivedi, J. Yang, J. Lin, M. Shetty, M. Yang, N. Omi, N. Raoof, S. Li, T. Y. Zhuo, W. Lin, Y. Dai, Y. Wang, W. Chai, S. Zhou, D. Wahdany, Z. She, J. Hu, Z. Dong, Y. Zhu, S. Cui, A. Saiyed, A. Kolbeinsson, J. Hu, C. M. Rytting, R. Marten, Y. Wang, A. Dimakis, A. Konwinski, and L. Schmidt(2026)Terminal-bench: benchmarking agents on hard, realistic tasks in command line interfaces.External Links:2601.11868,LinkCited by:§4.1.
[14]MiniMax(2026-02)MiniMax m2.5: built for real-world productivity.Note:Official model reportExternal Links:LinkCited by:§1,§4.1.
[15]A. Novikov, N. Vu, M. Eisenberger, E. Dupont, P. Huang, A. Z. Wagner, S. Shirobokov, B. Kozlovskii, F. J. R. Ruiz, A. Mehrabian, M. P. Kumar, A. See, S. Chaudhuri, G. Holland, A. Davies, S. Nowozin, P. Kohli, and M. Balog(2025)AlphaEvolve: a coding agent for scientific and algorithmic discovery.External Links:2506.13131,LinkCited by:§2.
[16]OpenAI(2026)Codex.Note:Product pageExternal Links:LinkCited by:§1.
[17]C. Packer, S. Wooders, K. Lin, V. Fang, S. G. Patil, I. Stoica, and J. E. Gonzalez(2024)MemGPT: towards llms as operating systems.External Links:2310.08560,LinkCited by:§2.
[18]Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y. Lu, Y. Lin, X. Cong, X. Tang, B. Qian, S. Zhao, R. Tian, R. Xie, J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun(2023)ToolLLM: facilitating large language models to master 16000+ real-world apis.External Links:2307.16789,LinkCited by:§1.
[19]J. Qiu, X. Qi, T. Zhang, X. Juan, J. Guo, Y. Lu, Y. Wang, Z. Yao, Q. Ren, X. Jiang, X. Zhou, D. Liu, L. Yang, Y. Wu, K. Huang, S. Liu, H. Wang, and M. Wang(2025)Alita: generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution.External Links:2505.20286,LinkCited by:§2.
[20]Qwen Team(2026-02)Qwen3.5: towards native multimodal agents.Note:Official model report and model card for Qwen3.5-35B-A3BExternal Links:LinkCited by:§1,§4.1.
[21]S. Schulhoff, M. Ilie, N. Balepur, K. Kahadze, A. Liu, C. Si, Y. Li, A. Gupta, H. Han, S. Schulhoff, P. S. Dulepet, S. Vidyadhara, D. Ki, S. Agrawal, C. Pham, G. Kroiz, F. Li, H. Tao, A. Srivastava, H. D. Costa, S. Gupta, M. L. Rogers, I. Goncearenco, G. Sarli, I. Galynker, D. Peskoff, M. Carpuat, J. White, S. Anadkat, A. Hoyle, and P. Resnik(2025)The prompt report: a systematic survey of prompt engineering techniques.External Links:2406.06608,LinkCited by:§1,§2.
[22]M. Sclar, Y. Choi, Y. Tsvetkov, and A. Suhr(2024)Quantifying language models’ sensitivity to spurious features in prompt design or: how i learned to start worrying about prompt formatting.External Links:2310.11324,LinkCited by:§1.
[23]N. Shinn, F. Cassano, E. Berman, A. Gopinath, K. Narasimhan, and S. Yao(2023)Reflexion: language agents with verbal reinforcement learning.External Links:2303.11366,LinkCited by:§2.
[24]X. Wang, B. Li, Y. Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y. Song, B. Li, J. Singh,et al.(2025)Openhands: an open platform for ai software developers as generalist agents.InInternational Conference on Learning Representations,Vol.2025,pp. 65882–65919.Cited by:§1.
[25]J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou(2022)Chain-of-thought prompting elicits reasoning in large language models.InAdvances in Neural Information Processing Systems,External Links:2201.11903,LinkCited by:§2.
[26]S. Xu, S. Li, X. Liu, T. Liu, Y. Li, Z. Shi, Z. Zhang, Z. Wang, Q. Yin, J. Chen,et al.(2026)Controllable and verifiable tool-use data synthesis for agentic reinforcement learning.arXiv preprint arXiv:2604.09813.Cited by:§2.
[27]Y. Yamada, R. T. Lange, C. Lu, S. Hu, C. Lu, J. Foerster, J. Clune, and D. Ha(2025)The ai scientist-v2: workshop-level automated scientific discovery via agentic tree search.External Links:2504.08066,LinkCited by:§2.
[28]J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press(2024)SWE-agent: agent-computer interfaces enable automated software engineering.External Links:2405.15793,LinkCited by:§1,§2.
[29]S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao(2023)ReAct: synergizing reasoning and acting in language models.External Links:2210.03629,LinkCited by:§1,§2.
[30]X. Yin, X. Wang, L. Pan, L. Lin, X. Wan, and W. Y. Wang(2025)Gödel agent: a self-referential agent framework for recursively self-improvement.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),pp. 27890–27913.External Links:Document,LinkCited by:§2.
[31]E. Zelikman, E. Lorch, L. Mackey, and A. T. Kalai(2024)Self-taught optimizer (stop): recursively self-improving code generation.External Links:2310.02304,LinkCited by:§2.
[32]H. Zhang, S. Xu, Z. Guo, H. Zhu, S. Liu, X. Wang, Q. Zhang, Y. Chen, P. Ye, L. Bai,et al.(2025)The path of self-evolving large language models: achieving data-efficient learning via intrinsic feedback.arXiv preprint arXiv:2510.02752.Cited by:§2.
[33]J. Zhang, S. Hu, C. Lu, R. Lange, and J. Clune(2025)Darwin gödel machine: open-ended evolution of self-improving agents.External Links:2505.22954,LinkCited by:§2.
[34]Q. Zhang, C. Hu, S. Upasani, B. Ma, F. Hong, V. Kamanuru, J. Rainton, C. Wu, M. Ji, H. Li, U. Thakker, J. Zou, and K. Olukotun(2026)Agentic context engineering: evolving contexts for self-improving language models.Note:ICLR 2026External Links:2510.04618,LinkCited by:§2.
[35]S. Zhang, X. Wang, W. Zhang, C. Li, J. Song, T. Li, L. Qiu, X. Cao, X. Cai, W. Yao, W. Zhang, X. Wang, and Y. Wen(2025-07)Leveraging dual process theory in language agent framework for real-time simultaneous human-AI collaboration.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar (Eds.),Vienna, Austria,pp. 4081–4108.External Links:Link,Document,ISBN 979-8-89176-251-0Cited by:§1.
[36]N. Zhu, H. Wang, J. Zhou, F. Chen, S. Zhang, G. Chen, C. Liu, J. Wu, W. Chen, X. Mou, and Y. Xu(2026)SemaClaw: a step towards general-purpose personal ai agents through harness engineering.External Links:2604.11548,LinkCited by:§1,§2.
[37]M. Zhuge, W. Wang, L. Kirsch, F. Faccio, D. Khizbullin, and J. Schmidhuber(2024)Language agents as optimizable graphs.External Links:2402.16823,LinkCited by:§2.

Appendix AAdditional Implementation Details

A.1Experimental details

Model inference services

Terminal-Bench-2.0 configuration.

We use Harbor333https://www.harborframework.com/docs/tutorials/running-terminal-bench.as the execution environment for all Terminal-Bench-2.0 tasks. Evaluations are run on an isolated machine with 64 CPU cores, 256 GB of memory, and a 2 MB/s outbound network bandwidth cap. We use a default concurrency of 32 tasks for MiniMax M2.5 and GLM-5, and a concurrency of 48 tasks for the locally deployed Qwen3.5-35B-A3B backend. To reduce failures caused by incidental network latency rather than agent behavior, we mirror a subset of external resources required by Terminal-Bench-2.0 tasks when the original resources are stable but slow to download. Tasks that depend on external resources that cannot be accessed reliably, as well as tasks requiring multimodal inputs unsupported by the initial harness, are excluded from the main 64-case evaluation set. This configuration keeps the benchmark environment controlled while preserving the core terminal-interaction setting of Terminal-Bench-2.0.

A.2Additional GLM-5 analysis

Refer to caption Figure 9:Case study of a GLM-5 harness edit on the Terminal-Bench-2.0build-pov-raytask. Left: a failed trace under the initial harness, where long monolithic external downloads consume large tool budgets and the agent later finalizes despite repeated non-zero sanity checks. Right: a successful trace under the edited harness, where the agent uses bounded staged operations, checks external archive evidence before committing more work, and repairs the failed sanity check before finalizing. Refer to caption (a)Self-Harness evolution trajectory. Green markers denote accepted harness candidates and gray crosses denote rejected harness candidates. Step lines connect accepted candidates and keep performance flat across rejected iterations. Two early branches were discarded due to no further improvement. The remaining lane becomes the final accepted harness. Refer to caption (b)Differences for the four harness modifications accepted by Self-Harness and retained in the final harness. Red rows denote code removed from the initial harness, and green rows denote the updated harness behavior.

Figure 10:GLM-5 Self-Harness run. Panel (a) summarizes the Self-Harness evolution trajectory, while panel (b) expands the accepted updates from this trajectory into their retained code-level edits in the final harness.