@dair_ai: // State-Externalizing Harnesses // A new paradigm is emerging on how to effectively build agents and harnesses. If the…
Summary
Harness-1 introduces a state-externalizing harness that separates routine bookkeeping from policy decisions in search agents, enabling a 20B model to outperform larger frontier searchers across multiple benchmarks.
View Cached Full Text
Cached at: 06/02/26, 03:47 PM
// State-Externalizing Harnesses //
A new paradigm is emerging on how to effectively build agents and harnesses.
If there is a state that the environment can maintain reliably, it probably doesn’t belong inside the policy. Move it into the harness, and a 20B model trains better and generalizes further.
Search agents are usually trained on one policy over a growing transcript, so RL has to learn semantic search and routine bookkeeping at the same time. This model, Harness-1, splits those apart.
The harness keeps the working memory (candidate pool, evidence links, verification records, deduplicated observations, budget-aware context) outside the policy, and the 20B model only decides what to search, what to keep, what to verify, and when to stop.
Across eight retrieval benchmarks spanning web, finance, patents, and multi-hop QA, it reaches 0.730 average curated recall, beating the next-best open search agent by 11.4 points and staying competitive with much larger frontier searchers. The gains are largest on the held-out transfer.
Paper: https://arxiv.org/abs/2606.02373
Learn to build effective AI agents in our academy: https://academy.dair.ai
Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses
Source: https://arxiv.org/abs/2606.02373 View PDF
Abstract:Search agents are often trained as policies over growing transcripts: the model must decide how to search while also remembering what it has seen, which evidence is useful, which constraints remain open, and which claims have actually been checked. We argue that this formulation puts too much routine state management inside the policy: reinforcement learning is forced to optimize both semantic search decisions and recoverable bookkeeping that the environment can maintain more reliably. We introduce Harness-1, a 20B search agent (retrieval subagent) trained with reinforcement learning inside a stateful search harness. The harness maintains environment-side working memory, including a candidate pool, an importance-tagged curated set, compact evidence links, verification records, compressed and deduplicated observations, and budget-aware context rendering. The policy retains the semantic decisions: what to search, which documents to keep or discard, what to verify, and when to stop. Across eight retrieval benchmarks spanning web, finance, patents, and multi-hop QA, Harness-1 achieves 0.730 average curated recall, outperforming the next strongest open search subagent by +11.4 points and remaining competitive with much larger frontier-model searchers. Its gains are especially strong on held-out transfer benchmarks, suggesting that reinforcement learning over explicit search state can produce retrieval behaviors that generalize beyond the training domains. Our code is available atthis https URL.
Submission history
From: Pengcheng Jiang [view email] **[v1]**Mon, 1 Jun 2026 15:21:41 UTC (6,831 KB)
Similar Articles
Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses
Introduces Harness-1, a 20B open search agent trained with state-externalizing harnesses, achieving strong retrieval performance and outperforming larger frontier models on several benchmarks.
@omarsar0: // Scaling Laws for Agent Harnesses // If you build agent harnesses, this one is worth your time. (bookmark it) Most ha…
New research on scaling laws for agent harnesses reveals that most token and tool call volume does not matter; the work introduces an effective approach.
@Potatoloogs: https://x.com/Potatoloogs/status/2057391224592667051
This article deeply analyzes the concept of Agent Harness, which is the engineering infrastructure wrapped around an LLM, including 12 components such as orchestration loops, tool calling, memory systems, context management, etc. The article cites practices from companies like Anthropic, OpenAI, and LangChain, arguing for the critical role of the harness in production-grade AI agents.
Your agent is only as good as its harness. I open-sourced one with 40 capabilities behind a single function call
An open-source agent harness with 40 capabilities behind a single function call, including persistent memory, Docker sandbox, auto-summarization, stuck-loop detection, budget caps, and live run forking for branching agent execution. Built on Pydantic AI and designed to replace the 2000 lines of glue code every production agent needs.
@santtiagom_: Very good article from OpenAI about Harness Engineering and Codex. They explain how they used agents to build an intern…
This tweet summarizes an OpenAI article on Harness Engineering and Codex, discussing challenges and insights from building a 1M-line internal product using AI agents.