@sethkarten: https://x.com/sethkarten/status/2072034978112889328
Summary
Continual Harness is a reset-free, self-improving agentic harness that achieves 20.54% on ARC-AGI-3 at a cost of $774 by storing memories, reusing skills, and refining its prompt, outperforming prior baselines like Hermes and OpenClaw with greater efficiency.
View Cached Full Text
Cached at: 07/01/26, 06:03 AM
Continual Harness: An Efficient Self-Improving Agent on ARC-AGI-3
ARC-AGI-3 is an IQ test for agents. The heavy test-time learning required by the benchmark pushes agents to form an internal world model of the rules and mechanics that updates with new evidence.
Continual Harness is a reset-free, self-improving agentic harness that enables a foundation model to store memories, write reusable skills, deploy subagents, and refine its own prompt during interactive tasks.
We study this on the public set of ARC-AGI-3, which contains 25 unknown games in a shared environment. Each game is designed to hide its rules, mechanics and scoring method, so an agent cannot rely on human-engineered task descriptions or domain-specific tools. This creates a hard setting for current LLM agents and frontier models. As of Jun 30 2026, the best frontier model Claude Opus 4.8 (high) on the verified leaderboard only reached 1.5%, and the officially released OpenClaw harness with Claude Opus 4.7 only reached 5.2%.
Starting from minimal information about the game environment (without even a color legend for the ASCII map) and using strictly sandboxed code execution, Continual Harness scored 20.54% with a total cost of only $774. The result makes it one of the most efficient agentic harnesses on the leaderboard.
Our main takeways are that Continual Harness generalizes by improving a world model at test time, gains efficiency from skill reusing, and leverages reset-free refinement to bootstrap from early exploration.
Results
Continual Harness scored 20.54% on the public set of ARC-AGI-3 at a cost of $774. The result outperforms the controlled Hermes baseline (8.25% at $5,674), the public OpenClaw reference point (5.20% at $2,912), and the A-Evolve MAS Evolved agent (12.30% at $5,300).
The main source of Continual Harness’s score advantage comes from action efficiency. It completes levels by discovering workable mechanics early and reusing those mechanics on later levels. Compared with Hermes, Continual Harness completed fewer levels overall (64 vs. 70) but achieved more than twice the final score. The per-level comparison shows that Continual Harness averages only 1.48x the human baseline actions on completed levels while Hermes averages 15.30x.
Continual Harness gains efficiency because useful computation becomes part of the harness state instead of remaining scratch work. Across 25 games, 62% of the executed actions originate from saved skills rather than fresh VLM deliberation. This share exceeds 80% on the harness’s top-performing games such as cn04 and ft09.
The Hermes baseline provides an example where useful computation stays transient. Hermes spends 86% of its 18,717 tool calls on execute_code, and only 0.07% on persisting skills or memory. Useful scripts such as BFS solvers, grid parsers, and state trackers are repeatedly written as one-off code.
This study on ARC-AGI-3 with Continual Harness was led by Ruirong Feng. We thank all authors from our original study: Seth Karten, Joel Zhang, Tersoo Upaa Jr, Ruirong Feng, Wenzhe Li, Chengshuai Shi, Chi Jin, Kiran Vodrahalli. We also thank Google DeepMind for supporting this work with Gemini API credits.
Full write-up, demos, and replay videos: https://continual-harness.github.io/
Paper: arxiv.org/abs/2605.09998
Code: github.com/feng-rrRay/Continual-Harness-ARC-AGI-3
Similar Articles
Self-Harness: Harnesses That Improve Themselves
Self-Harness introduces a new paradigm where LLM-based agents iteratively improve their own operating harness by mining model-specific weaknesses, proposing harness modifications, and validating them through regression testing, achieving substantial performance gains on Terminal-Bench-2.0 across multiple base models.
@omarsar0: // Self-Harness: Harnesses That Improve Themselves // (bookmark this one) Most of the agent scaffolds we rely on today …
This paper introduces Self-Harness, a new paradigm where LLM-based agents iteratively improve their own operating harness—prompts, tools, and control flow—without human engineers or stronger external agents, achieving significant performance gains across multiple models.
best of the best agentic harnesses do this…
The author shares insights on building effective agent harnesses: the best ones minimize LLM reliance for trivial tasks and reserve LLMs for complex reasoning, distinguishing genuine harnesses from simple wrappers.
Continual Harness: Online Adaptation for Self-Improving Foundation Agents
The paper introduces 'Continual Harness,' a framework enabling embodied AI agents to self-improve online without environment resets. It demonstrates significant progress in playing Pokémon games, achieving human-level performance through automated prompt and skill refinement.
HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry
HarnessX is a foundry for composable, adaptive, and evolvable AI agent harnesses that uses compositional primitives and trace-driven evolution to improve agent performance. Across five benchmarks, it achieves an average gain of +14.5% (up to +44.0%), demonstrating that runtime interface evolution is a complementary lever to model scaling.