Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts
Summary
Retrospective Harness Optimization (RHO) is a self-supervised method that improves LLM agent performance using only past trajectories, achieving a 78% pass rate on SWE-Bench Pro without external grading.
View Cached Full Text
Cached at: 06/10/26, 05:44 AM
Paper page - Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts
Source: https://huggingface.co/papers/2606.05922
Abstract
Retrospective Harness Optimization (RHO) is a self-supervised method that improves AI agent performance by optimizing agent harness using only past trajectories through diverse task selection, parallel re-solving, and self-validation techniques.
AI agents rely on a harness of skills, tools, and workflows to solve complex problems. Continually improving this harness is essential for adapting to new tasks. However, existing optimization methods typically require ground-truth validation sets, yet such labeled data is difficult to acquire in practical deployment settings. To address this problem, we introduceRetrospective Harness Optimization(RHO), aself-supervised methodthat optimizes theagent harnessusing onlypast trajectories. Specifically, RHO selects a diversecoresetof challenging tasks frompast trajectoriesand re-solves them in parallel. The agent analyzes these rollouts usingself-validationandself-consistency, then generates candidate harness updates and selects the most effective one by its ownpairwise self-preference. We evaluate RHO across three diverse domains, spanning software engineering, technical work, and knowledge work. Notably, a single optimization round improves the pass rate onSWE-Bench Profrom 59% to 78% without any external grading. Furthermore, our analysis demonstrates that RHO effectively targets prior failure modes. As a result, the optimized harness alters the agent’s behavior patterns and sustains higher accuracy during long-horizon sessions.
View arXiv pageView PDFProject pageGitHubAdd to collection
Get this paper in your agent:
hf papers read 2606\.05922
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.05922 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.05922 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.05922 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Retrospective Progress-Aware Self-Refinement for LLM Agent Training
This paper introduces RePro, a framework that trains LLM agents to self-generate progress signals through a forward-then-reflect rollout paradigm, achieving up to 12% absolute success rate gains on WebShop, ALFWorld, and Sokoban benchmarks.
@omarsar0: // Self-Harness: Harnesses That Improve Themselves // (bookmark this one) Most of the agent scaffolds we rely on today …
This paper introduces Self-Harness, a new paradigm where LLM-based agents iteratively improve their own operating harness—prompts, tools, and control flow—without human engineers or stronger external agents, achieving significant performance gains across multiple models.
Bayesian-Agent: Posterior-Guided Skill Evolution for LLM Agent Harnesses
Bayesian-Agent presents a framework that treats reusable skills and SOPs as hypotheses, using Bayesian inference to guide agent behavior and improve task performance through posterior-guided harness optimization. It achieves significant improvements on multiple benchmarks with deepseek-v4-flash.
Harnesses for Inference-Time Alignment over Execution Trajectories
This paper studies harness design for LLM agents, separating it into task decomposition and guided execution, and shows that more elaborate harnesses are not uniformly better; it reveals failure modes and proposes partial harnesses as effective.
Stop Comparing LLM Agents Without Disclosing the Harness
This position paper argues that in long-horizon LLM agent tasks, the execution harness often determines performance more than the model itself, and current benchmarks misattribute harness-level gains to model improvements. It proposes a harness-aware evaluation framework with disclosure standards and variance decomposition protocols.