@sethkarten: https://x.com/sethkarten/status/2072034978112889328

X AI KOLs Following 06/30/26, 07:10 PM Papers

arc-agi self-improving-agent agentic-harness world-model skill-reuse test-time-learning foundation-model

Summary

Continual Harness is a reset-free, self-improving agentic harness that achieves 20.54% on ARC-AGI-3 at a cost of $774 by storing memories, reusing skills, and refining its prompt, outperforming prior baselines like Hermes and OpenClaw with greater efficiency.

https://t.co/HqIj86rXEp

Original Article

View Cached Full Text

Cached at: 07/01/26, 06:03 AM

Continual Harness: An Efficient Self-Improving Agent on ARC-AGI-3

ARC-AGI-3 is an IQ test for agents. The heavy test-time learning required by the benchmark pushes agents to form an internal world model of the rules and mechanics that updates with new evidence.

Continual Harness is a reset-free, self-improving agentic harness that enables a foundation model to store memories, write reusable skills, deploy subagents, and refine its own prompt during interactive tasks.

We study this on the public set of ARC-AGI-3, which contains 25 unknown games in a shared environment. Each game is designed to hide its rules, mechanics and scoring method, so an agent cannot rely on human-engineered task descriptions or domain-specific tools. This creates a hard setting for current LLM agents and frontier models. As of Jun 30 2026, the best frontier model Claude Opus 4.8 (high) on the verified leaderboard only reached 1.5%, and the officially released OpenClaw harness with Claude Opus 4.7 only reached 5.2%.

Starting from minimal information about the game environment (without even a color legend for the ASCII map) and using strictly sandboxed code execution, Continual Harness scored 20.54% with a total cost of only $774. The result makes it one of the most efficient agentic harnesses on the leaderboard.

Our main takeways are that Continual Harness generalizes by improving a world model at test time, gains efficiency from skill reusing, and leverages reset-free refinement to bootstrap from early exploration.

Results

Continual Harness scored 20.54% on the public set of ARC-AGI-3 at a cost of $774. The result outperforms the controlled Hermes baseline (8.25% at $5,674), the public OpenClaw reference point (5.20% at $2,912), and the A-Evolve MAS Evolved agent (12.30% at $5,300).

The main source of Continual Harness’s score advantage comes from action efficiency. It completes levels by discovering workable mechanics early and reusing those mechanics on later levels. Compared with Hermes, Continual Harness completed fewer levels overall (64 vs. 70) but achieved more than twice the final score. The per-level comparison shows that Continual Harness averages only 1.48x the human baseline actions on completed levels while Hermes averages 15.30x.

Continual Harness gains efficiency because useful computation becomes part of the harness state instead of remaining scratch work. Across 25 games, 62% of the executed actions originate from saved skills rather than fresh VLM deliberation. This share exceeds 80% on the harness’s top-performing games such as cn04 and ft09.

The Hermes baseline provides an example where useful computation stays transient. Hermes spends 86% of its 18,717 tool calls on execute_code, and only 0.07% on persisting skills or memory. Useful scripts such as BFS solvers, grid parsers, and state trackers are repeatedly written as one-off code.

This study on ARC-AGI-3 with Continual Harness was led by Ruirong Feng. We thank all authors from our original study: Seth Karten, Joel Zhang, Tersoo Upaa Jr, Ruirong Feng, Wenzhe Li, Chengshuai Shi, Chi Jin, Kiran Vodrahalli. We also thank Google DeepMind for supporting this work with Gemini API credits.

Full write-up, demos, and replay videos: https://continual-harness.github.io/

Paper: arxiv.org/abs/2605.09998

Code: github.com/feng-rrRay/Continual-Harness-ARC-AGI-3

@sethkarten: https://x.com/sethkarten/status/2072034978112889328

Continual Harness: An Efficient Self-Improving Agent on ARC-AGI-3

Results

Similar Articles

Self-Harness: Harnesses That Improve Themselves

@omarsar0: // Self-Harness: Harnesses That Improve Themselves // (bookmark this one) Most of the agent scaffolds we rely on today …

best of the best agentic harnesses do this…

Continual Harness: Online Adaptation for Self-Improving Foundation Agents

HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry

Submit Feedback

Similar Articles

Self-Harness: Harnesses That Improve Themselves

@omarsar0: // Self-Harness: Harnesses That Improve Themselves // (bookmark this one) Most of the agent scaffolds we rely on today …

best of the best agentic harnesses do this…

Continual Harness: Online Adaptation for Self-Improving Foundation Agents

HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry