@NFTCPS: HarnessX is pretty interesting: an agent architecture that can modify itself. Previously, architectural changes relied entirely on manual tuning. When a new model came out, Anthropic removed the planning steps from Claude Code, and Manus refactored its agents five times in six months, each time simplifying. What to change and when to change it — all decided by humans.
Summary
HarnessX introduces a framework for self-evolving AI agent harnesses that treats the runtime harness as a first-class object, enabling automatic adaptation via trace-driven reinforcement learning. It achieves average gains of +14.5% across five benchmarks, with larger improvements for weaker models.
View Cached Full Text
Cached at: 06/17/26, 07:51 AM
HarnessX is quite interesting: an agent architecture that can modify itself. Previously, any architectural changes were entirely manual. When a new model came out, Anthropic removed the planning steps from Claude Code, and Manus refactored its agent five times in six months, each time simplifying. What to change and when has always been decided by humans. What HarnessX aims to do is let the system iterate itself. The idea is to treat the architecture as a ‘first-class citizen’ alongside model weights—making it typable and editable so it can be optimized based on execution traces, with the entire logic directly mapped to reinforcement learning. The most interesting result: the weakest models improve the most, while strong models barely change. The weights remain unchanged, but the surrounding environment becomes smarter. https://arxiv.org/abs/2606.14249
HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry Source: https://arxiv.org/html/2606.14249 \contribution SeeContributions and Acknowledgments (https://arxiv.org/html/2606.14249#Sx1)section for a full author list. ###### Abstract AI agent performance depends critically on the runtime harness, comprising the prompts, tools, memory, and control flow that mediate how a model observes, reasons, and acts. Yet today’s harnesses remain largely hand-crafted and static: each new model or task still demands bespoke scaffolding, and the rich traces produced during execution are rarely distilled back into systematic improvement. We introduceHarnessX, a foundry for composable, adaptive, and evolvable agent harnesses. HarnessX assembles typed harness primitives via a substitution algebra, adapts them through AEGIS, a trace-driven multi-agent evolution engine grounded in an operational mirror between symbolic adaptation and reinforcement learning, and closes the harness–model loop by turning trajectories into both harness updates and model training signal. Across five benchmarks (ALFWorld, GAIA, WebShop,τ3\tau^{3}-Bench, and SWE-bench Verified), HarnessX yields an average gain of +14.5% (up to +44.0%), with gains largest where baselines are lowest. These results suggest that agent progress need not come from model scaling alone: composing and evolving runtime interfaces from execution feedback is an actionable and complementary lever. The complete codebase will be open-sourced in a future release.
Figure 1:HarnessXoverview.###### Contents 1. 1Introduction (https://arxiv.org/html/2606.14249#S1) 2. 2Related Work (https://arxiv.org/html/2606.14249#S2)1. 2.1Harness Engineering (https://arxiv.org/html/2606.14249#S2.SS1) 2. 2.2Self-Evolving Agents (https://arxiv.org/html/2606.14249#S2.SS2) 3. 3Harness Composition (https://arxiv.org/html/2606.14249#S3)1. 3.1The Harness as a First-Class Object (https://arxiv.org/html/2606.14249#S3.SS1) 2. 3.2The Processor Abstraction (https://arxiv.org/html/2606.14249#S3.SS2) 3. 3.3The Nine-Dimensional Taxonomy (https://arxiv.org/html/2606.14249#S3.SS3) 4. 4Harness Adaptation (https://arxiv.org/html/2606.14249#S4)1. 4.1The Operational Mirror (https://arxiv.org/html/2606.14249#S4.SS1) 2. 4.2Pathologies in Symbolic Space (https://arxiv.org/html/2606.14249#S4.SS2) 3. 4.3AEGIS Architecture (https://arxiv.org/html/2606.14249#S4.SS3) 4. 4.4The Adaptation Loop (https://arxiv.org/html/2606.14249#S4.SS4) 5. 4.5Variant Isolation via Ensemble Routing (https://arxiv.org/html/2606.14249#S4.SS5) 5. 5Harness-Model Co-Evolution (https://arxiv.org/html/2606.14249#S5)1. 5.1The Co-evolution Iteration (https://arxiv.org/html/2606.14249#S5.SS1) 2. 5.2Optimization Substrates (https://arxiv.org/html/2606.14249#S5.SS2) 3. 5.3Model Training via Cross-Harness GRPO (https://arxiv.org/html/2606.14249#S5.SS3) 4. 5.4Off-Policy Training over a Mixed-Policy Buffer (https://arxiv.org/html/2606.14249#S5.SS4) 6. 6Experiments (https://arxiv.org/html/2606.14249#S6)1. 6.1Experimental Setup (https://arxiv.org/html/2606.14249#S6.SS1) 2. 6.2Main Results (https://arxiv.org/html/2606.14249#S6.SS2) 3. 6.3Evolution Strategy Comparison (https://arxiv.org/html/2606.14249#S6.SS3) 4. 6.4Meta-Agent Effectiveness (https://arxiv.org/html/2606.14249#S6.SS4) 5. 6.5Co-Evolution (https://arxiv.org/html/2606.14249#S6.SS5) 6. 6.6Failure Analysis (https://arxiv.org/html/2606.14249#S6.SS6) 7. 7Discussion (https://arxiv.org/html/2606.14249#S7)1. 7.1Why Compositional Structure Matters for Evolution (https://arxiv.org/html/2606.14249#S7.SS1) 2. 7.2The Role of Trace Richness (https://arxiv.org/html/2606.14249#S7.SS2) 3. 7.3Scope and Limits of the Operational Mirror (https://arxiv.org/html/2606.14249#S7.SS3) 4. 7.4Generalization Across Model Families (https://arxiv.org/html/2606.14249#S7.SS4) 5. 7.5Cost-Performance Tradeoffs (https://arxiv.org/html/2606.14249#S7.SS5) 6. 7.6Ethical Considerations (https://arxiv.org/html/2606.14249#S7.SS6) 7. 7.7Limitations (https://arxiv.org/html/2606.14249#S7.SS7) 8. 8Conclusion (https://arxiv.org/html/2606.14249#S8) 9. References (https://arxiv.org/html/2606.14249#bib) 10. Contributions and Acknowledgments (https://arxiv.org/html/2606.14249#Sx1) 11. 9Experimental Setup: Full Details (https://arxiv.org/html/2606.14249#S9)1. 9.1Benchmarks (https://arxiv.org/html/2606.14249#S9.SS1) 2. 9.2Evaluation-Set Design (https://arxiv.org/html/2606.14249#S9.SS2) 3. 9.3Metric Definitions (https://arxiv.org/html/2606.14249#S9.SS3) 4. 9.4Evolution Protocol and Hyperparameters (https://arxiv.org/html/2606.14249#S9.SS4) 5. 9.5Runtime Infrastructure (https://arxiv.org/html/2606.14249#S9.SS5) 12. 10Prompts and Harness Defaults (https://arxiv.org/html/2606.14249#S10)1. 10.1Meta-Agent Prompts (https://arxiv.org/html/2606.14249#S10.SS1)1. 10.2Round-0 Task-Agent Prompts (https://arxiv.org/html/2606.14249#S10.SS2)1. 10.3Change-Manifest Schema (https://arxiv.org/html/2606.14249#S10.SS3)1. 11Anatomy of an Evolution Step (https://arxiv.org/html/2606.14249#S11)1. 11.1Worked Example: GAIA / Sonnet 4.6, Round 10 (https://arxiv.org/html/2606.14249#S11.SS1)1. 12Additional Results (https://arxiv.org/html/2606.14249#S12)1. 12.1GAIA (https://arxiv.org/html/2606.14249#S12.SS1) 2. 12.2ALFWorld (https://arxiv.org/html/2606.14249#S12.SS2) 3. 12.3WebShop (https://arxiv.org/html/2606.14249#S12.SS3) 4. 12.4τ3\tau^{3}-Bench (https://arxiv.org/html/2606.14249#S12.SS4) 5. 12.5SWE-bench Verified (https://arxiv.org/html/2606.14249#S12.SS5)1. 13Reproducibility and Artifacts (https://arxiv.org/html/2606.14249#S13)1. 13.1Per-Run Directory Layout (https://arxiv.org/html/2606.14249#S13.SS1) ## 1Introduction The capacity of modern agents depends not only on the underlying model[deepseekai2026deepseekv4,glm5team2026glm5vibecodingagentic,yang2025qwen3,team2023gemini], but on the mediation imposed by the surroundingharness[lu2026openclaw,liagent,claudecode]. This harness converts raw model outputs into structured agent behaviors by determining how tasks are represented, how external services are accessed, and how intermediate decisions are communicated during execution. As agents tackle longer-horizon tasks in richer environments, harness design becomes integral to agent development. Despite this importance, harness development remains far from a mature engineering discipline.First, harnesses are hand-engineered and static: any change in model version, tooling, or problem domain requires bespoke modification, with no mechanism for experience-driven improvement.Second, harnesses are architecturally entangled: they typically combine prompt templates, tool wrappers, retry policies, and memory in the same codepaths, so changes to one component silently break others, and reuse across domains reduces to copying rather than composition.Third, harness engineering and model training operate independently: trajectory data collected while improving the harness is discarded rather than incorporated into model training, and model improvements do not translate into harness improvements. We address these gaps by treating the harness as afirst-class objectthat can be composed, adapted, and evolved alongside the model. HarnessX embodies this principle as a unified harness foundry. It begins with a modular foundation: harness primitives spanning context, tools[feng2025retool], skills, control, and memory are described via typed interfaces and composed via a substitution algebra. This separates concerns that existing systems typically conflate. On top of this substrate, we introduce AEGIS, an observability-driven and auditable harness adaptation engine. Framing harness adaptation not as ad-hoc editing but as a learning problem over symbolic artifacts (prompts[zhou2025proposer], tools, memory, and control policies) reveals that standard RL pathologies (reward hacking, catastrophic forgetting[kirkpatrick2017overcoming], under-exploration[ladosz2022exploration]) become concrete design risks. To address these risks, AEGIS combines full trace observability with a four-stage pipeline (Digester, Planner, Evolver, and Critic) that compresses traces, plans adaptations, generates candidates, and assesses changes. Finally, we close the loop between harness adaptation and model training viaharness-model co-evolution. Traces produced during harness adaptation serve as reinforcement-learning signal for model training, so that model improvements feed back into subsequent harness evolution. We empirically validateHarnessXacross five benchmarks (GAIA, ALFWorld, WebShop,τ3\tau^{3}-Bench, SWE-bench Verified), three task-agent families (Claude Sonnet 4.6, GPT-5.4, Qwen3.5-9B), and up to 15 evolution rounds. Harness evolution yields an average absolute gain of +14.5% across 15 model–benchmark configurations, with individual gains ranging from 0.0% to +44.0% among improving configurations (14 of 15), from +1.1% (τ3\tau^{3}-Bench, near-ceiling baseline) to +44.0% (ALFWorld, weakest agent). Gains exhibit an inverse-scaling pattern: on ALFWorld and GAIA, the weakest task agent benefits most (+44.0% for Qwen3.5-9B vs. +11.2% for Sonnet 4.6 on ALFWorld), suggesting that evolved harnesses address behavioral gaps that weaker models cannot self-correct. On heterogeneous task sets (GAIA), single-harness evolution stagnates; a variant-isolation ablation restores stable improvement (+13.6%, non-degrading over 15 rounds). In summary, our contributions are four-fold: - •Harness Composition(Section3 (https://arxiv.org/html/2606.14249#S3)). We formalize the harness as a first-class, typed object composed of processors attached to lifecycle hooks. A nine-dimensional taxonomy spans the full behavioral space, and a substitution algebra enables per-task configuration with type-safe insertion and removal. This compositional structure makes the intended scope of each behavioral change explicit—a precondition for the variant isolation that stabilizes evolution. - •Harness Adaptation(Section4 (https://arxiv.org/html/2606.14249#S4)). We introduce AEGIS, a trace-driven, multi-agent harness evolution engine. An operational mirror maps harness adaptation onto standard RL constructs, converting familiar RL pathologies (reward hacking, catastrophic forgetting, under-exploration) into concrete design risks addressed by a four-stage pipeline (Digester, Planner, Evolver, Critic) with deterministic gating. An optional variant-isolation strategy prevents cross-task interference on heterogeneous benchmarks. - •Harness-Model Co-Evolution(Section5 (https://arxiv.org/html/2606.14249#S5)). We close the optimization loop by interleaving harness evolution with model reinforcement learning over a shared replay buffer. Cross-harness GRPO enables the model to internalize strategies from successive harness versions, breaking the scaffolding ceiling that limits harness-only adaptation and the training-signal ceiling that limits model-only RL. - •Empirical Validation(Section6 (https://arxiv.org/html/2606.14249#S6)). Across five benchmarks, three task-agent families, and up to 15 evolution rounds, HarnessX yields an average gain of +14.5% (up to +44.0%), with gains largest where baselines are lowest. A variant-isolation ablation resolves stagnation on heterogeneous task sets, and co-evolution yields an additional +4.7% over harness-only evolution (Section6.5 (https://arxiv.org/html/2606.14249#S6.SS5)). ## 2Related Work ### 2.1Harness Engineering Existing agent infrastructure occupies a spectrum of increasingly opinionated harness abstractions. At the primitive layer, libraries such asLangChain[langchain],LlamaIndex[Liu_LlamaIndex_2022], andSmolagents[smolagents]provide typed building blocks for prompts, tools, retrieval, and memory. These primitives can be tested in isolation but do not support harness-level composition: two harnesses built from identical primitives may still differ in structure. The next level of abstraction orchestrates these primitives into reusable patterns.LangGraph[langgraph]models the behavior of an agent with a stateful graph;AutoGen[wu2024autogen]models multi-agent interaction as structured conversation;CrewAI[moura2025crewai]assigns role-based identities to agents; andLetta[packer2023memgpt]couples autonomous loops with persistent memory. Although these frameworks make harness writing easier, they impose a particular control loop, so combining patterns, replacing components, and porting enhancements across tasks mostly remain manual. Lastly, there are productized, domain-specific harnesses such asClaude Code[claudecode],Cursor[cursor],Manus[shen2025mind], andDeerFlow[deerflow]. These systems demonstrate the impact of harness design but remain architecturally static, evolving only through manual iteration. Two structural gaps persist across all three layers. First, no layer exposes the harness as a substitutable entity composed of typed elements, so building a per-task harness always involves rewriting. Second, no mechanism exists for in-loop improvement: once defined, a harness evolves only through human iteration between releases. Concurrently, Claude Code introducedDynamic Workflows[anthropic2026dynamicworkflows], enabling the model to generate task-specific harness scripts at runtime. While this represents a step toward adaptive harnesses, it operates within a single session without persistent trace-based optimization, cross-session evolution, or harness–model co-training. HarnessX addresses both gaps by treating harness adaptation as a multi-round, trace-driven learning problem with typed composition for variant isolation, structured observability for pathology detection, and a shared replay buffer that closes the loop between harness evolution and model training. ### 2.2Self-Evolving Agents Research on self-evolving agents investigates how an agent system can improve without retraining the underlying foundation model. Early work focused on the single most easily editable aspect: the prompt. Approaches likeAPE[zhou2022large],OPRO[yang2024large],EvoPrompt[guo2024connecting],Promptbreeder[fernando2024promptbreeder]treat instruction formulation as a black-box optimization problem, whileProTeGi[pryzant2023automatic]andTextGrad[yuksekgonul2024textgrad]introduce gradient-inspired textual feedback to make the optimization process explicit.DSPy[khattab2023dspy]andMIPRO[opsahl2024optimizing]extend this approach by compiling a declarative LM program, whose prompts are optimized against labeled data. These approaches establish instructions as a learnable component, but harness-level features (tools, memory, control flow) remain outside the optimization scope. Another line of work improves agents by accumulating and reusing prior execution experience in memory:Memento[zhou2025memento]improves agents through case-based memory without fine-tuning the model, whileMIA[qiao2026memory]unifies non-parametric and parametric memory within a single Manager-Planner-Executor framework: a non-parametric store of compressed trajectories and a parametric planner that evolves on the fly at test time, coupled by a bidirectional loop that continually converts experience between the two, demonstrating superiority across eleven benchmarks. Subsequent works extend optimization to agent workflows.GPTSwarm[zhuge2024gptswarm],ADAS[hu2025automated
Similar Articles
HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry
HarnessX is a foundry for composable, adaptive, and evolvable AI agent harnesses that uses compositional primitives and trace-driven evolution to improve agent performance. Across five benchmarks, it achieves an average gain of +14.5% (up to +44.0%), demonstrating that runtime interface evolution is a complementary lever to model scaling.
@omarsar0: // Self-Harness: Harnesses That Improve Themselves // (bookmark this one) Most of the agent scaffolds we rely on today …
This paper introduces Self-Harness, a new paradigm where LLM-based agents iteratively improve their own operating harness—prompts, tools, and control flow—without human engineers or stronger external agents, achieving significant performance gains across multiple models.
Self-Harness: Harnesses That Improve Themselves
Self-Harness introduces a new paradigm where LLM-based agents iteratively improve their own operating harness by mining model-specific weaknesses, proposing harness modifications, and validating them through regression testing, achieving substantial performance gains on Terminal-Bench-2.0 across multiple base models.
Continual Harness: Online Adaptation for Self-Improving Foundation Agents
The paper introduces 'Continual Harness,' a framework enabling embodied AI agents to self-improve online without environment resets. It demonstrates significant progress in playing Pokémon games, achieving human-level performance through automated prompt and skill refinement.
Claude Code improved my agent harness by 40% overnight
The author introduces 'Autoharness', a tool that uses Claude Code to autonomously optimize agent harnesses by iterating on prompts and hyperparameters. This resulted in a 40% performance increase on the tau2-airline benchmark.