@sethkarten: https://x.com/sethkarten/status/2072034978112889328

X AI KOLs Following Papers

Summary

Continual Harness is a reset-free, self-improving agentic harness that achieves 20.54% on ARC-AGI-3 at a cost of $774 by storing memories, reusing skills, and refining its prompt, outperforming prior baselines like Hermes and OpenClaw with greater efficiency.

https://t.co/HqIj86rXEp
Original Article
View Cached Full Text

Cached at: 07/01/26, 06:03 AM

Continual Harness: An Efficient Self-Improving Agent on ARC-AGI-3

ARC-AGI-3 is an IQ test for agents. The heavy test-time learning required by the benchmark pushes agents to form an internal world model of the rules and mechanics that updates with new evidence.

Continual Harness is a reset-free, self-improving agentic harness that enables a foundation model to store memories, write reusable skills, deploy subagents, and refine its own prompt during interactive tasks.

We study this on the public set of ARC-AGI-3, which contains 25 unknown games in a shared environment. Each game is designed to hide its rules, mechanics and scoring method, so an agent cannot rely on human-engineered task descriptions or domain-specific tools. This creates a hard setting for current LLM agents and frontier models. As of Jun 30 2026, the best frontier model Claude Opus 4.8 (high) on the verified leaderboard only reached 1.5%, and the officially released OpenClaw harness with Claude Opus 4.7 only reached 5.2%.

Starting from minimal information about the game environment (without even a color legend for the ASCII map) and using strictly sandboxed code execution, Continual Harness scored 20.54% with a total cost of only $774. The result makes it one of the most efficient agentic harnesses on the leaderboard.

Our main takeways are that Continual Harness generalizes by improving a world model at test time, gains efficiency from skill reusing, and leverages reset-free refinement to bootstrap from early exploration.

Results

Continual Harness scored 20.54% on the public set of ARC-AGI-3 at a cost of $774. The result outperforms the controlled Hermes baseline (8.25% at $5,674), the public OpenClaw reference point (5.20% at $2,912), and the A-Evolve MAS Evolved agent (12.30% at $5,300).

The main source of Continual Harness’s score advantage comes from action efficiency. It completes levels by discovering workable mechanics early and reusing those mechanics on later levels. Compared with Hermes, Continual Harness completed fewer levels overall (64 vs. 70) but achieved more than twice the final score. The per-level comparison shows that Continual Harness averages only 1.48x the human baseline actions on completed levels while Hermes averages 15.30x.

Continual Harness gains efficiency because useful computation becomes part of the harness state instead of remaining scratch work. Across 25 games, 62% of the executed actions originate from saved skills rather than fresh VLM deliberation. This share exceeds 80% on the harness’s top-performing games such as cn04 and ft09.

The Hermes baseline provides an example where useful computation stays transient. Hermes spends 86% of its 18,717 tool calls on execute_code, and only 0.07% on persisting skills or memory. Useful scripts such as BFS solvers, grid parsers, and state trackers are repeatedly written as one-off code.

This study on ARC-AGI-3 with Continual Harness was led by Ruirong Feng. We thank all authors from our original study: Seth Karten, Joel Zhang, Tersoo Upaa Jr, Ruirong Feng, Wenzhe Li, Chengshuai Shi, Chi Jin, Kiran Vodrahalli. We also thank Google DeepMind for supporting this work with Gemini API credits.

Full write-up, demos, and replay videos: https://continual-harness.github.io/

Paper: arxiv.org/abs/2605.09998

Code: github.com/feng-rrRay/Continual-Harness-ARC-AGI-3

Similar Articles

Self-Harness: Harnesses That Improve Themselves

Hacker News Top

Self-Harness introduces a new paradigm where LLM-based agents iteratively improve their own operating harness by mining model-specific weaknesses, proposing harness modifications, and validating them through regression testing, achieving substantial performance gains on Terminal-Bench-2.0 across multiple base models.

best of the best agentic harnesses do this…

Reddit r/AI_Agents

The author shares insights on building effective agent harnesses: the best ones minimize LLM reliance for trivial tasks and reserve LLMs for complex reasoning, distinguishing genuine harnesses from simple wrappers.

Continual Harness: Online Adaptation for Self-Improving Foundation Agents

Hugging Face Daily Papers

The paper introduces 'Continual Harness,' a framework enabling embodied AI agents to self-improve online without environment resets. It demonstrates significant progress in playing Pokémon games, achieving human-level performance through automated prompt and skill refinement.

HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry

Hugging Face Daily Papers

HarnessX is a foundry for composable, adaptive, and evolvable AI agent harnesses that uses compositional primitives and trace-driven evolution to improve agent performance. Across five benchmarks, it achieves an average gain of +14.5% (up to +44.0%), demonstrating that runtime interface evolution is a complementary lever to model scaling.