Continual Harness: Online Adaptation for Self-Improving Foundation Agents

Hugging Face Daily Papers Papers

Summary

The paper introduces 'Continual Harness,' a framework enabling embodied AI agents to self-improve online without environment resets. It demonstrates significant progress in playing Pokémon games, achieving human-level performance through automated prompt and skill refinement.

Coding harnesses such as Claude Code and OpenHands wrap foundation models with tools, memory, and planning, but no equivalent exists for embodied agents' long-horizon partial-observability decision-making. We first report our Gemini Plays Pokemon (GPP) experiments. With iterative human-in-the-loop harness refinement, GPP became the first AI system to complete Pokemon Blue, Yellow Legacy on hard mode, and Crystal without a lost battle. In the hardest stages, the agent itself began iterating on its strategy through long-context memory, surfacing emergent self-improvement signals alongside human-in-the-loop refinement. Continual Harness removes the human fully from this loop: a reset-free self-improving harness for embodied agents that formalizes and automates what we observed. Starting from only a minimal environment interface, the agent alternates between acting and refining its own prompt, sub-agents, skills, and memory, drawing on any past trajectory data. Prompt-optimization methods require episode resets; Continual Harness adapts online within a single run. On Pokemon Red and Emerald across frontier models, Continual Harness starting from scratch substantially reduces button-press cost relative to the minimalist baseline and recovers a majority of the gap to a hand-engineered expert harness, with capability-dependent gains, despite starting from the same raw interface with no curated knowledge, no hand-crafted tools, and no domain scaffolding. We then close the loop with the model itself: an online process-reward co-learning loop, in which an open-source agent's rollouts through the refining harness are relabeled by a frontier teacher and used to update the model, drives sustained in-game milestone progress on Pokemon Red without resetting the environment between training iterations.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/13/26, 04:10 AM

Paper page - Continual Harness: Online Adaptation for Self-Improving Foundation Agents

Source: https://huggingface.co/papers/2605.09998

Abstract

A self-improving AI system for embodied agents autonomously refines its own prompts, skills, and memory through continuous learning without environment resets, achieving human-level performance in complex video games.

Coding harnesses such as Claude Code and OpenHands wrap foundation models with tools, memory, and planning, but no equivalent exists forembodied agentslong-horizon partial-observability decision-making. We first report our Gemini Plays Pokemon (GPP) experiments. With iterative human-in-the-loop harness refinement, GPP became the first AI system to complete Pokemon Blue, Yellow Legacy on hard mode, and Crystal without a lost battle. In the hardest stages, the agent itself began iterating on its strategy through long-context memory, surfacing emergentself-improvement signalsalongside human-in-the-loop refinement.Continual Harnessremoves the human fully from this loop: a reset-free self-improving harness forembodied agentsthat formalizes and automates what we observed. Starting from only a minimal environment interface, the agent alternates between acting and refining its own prompt, sub-agents, skills, and memory, drawing on any past trajectory data.Prompt-optimization methodsrequire episode resets;Continual Harnessadapts online within a single run. On Pokemon Red and Emerald acrossfrontier models,Continual Harnessstarting from scratch substantially reduces button-press cost relative to the minimalist baseline and recovers a majority of the gap to a hand-engineered expert harness, with capability-dependent gains, despite starting from the same raw interface with no curated knowledge, no hand-crafted tools, and no domain scaffolding. We then close the loop with the model itself: anonline process-reward co-learning loop, in which an open-source agent’s rollouts through the refining harness are relabeled by a frontier teacher and used to update the model, drives sustained in-game milestone progress on Pokemon Red without resetting the environment between training iterations.

View arXiv pageView PDFProject pageGitHubAdd to collection

Get this paper in your agent:

hf papers read 2605\.09998

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.09998 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.09998 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.09998 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Harness design for long-running application development

Anthropic Engineering

Anthropic engineers detail a multi-agent harness design using generator and evaluator agents to improve Claude's ability to build complete, high-quality frontend applications autonomously over long durations.

RewardHarness: Self-Evolving Agentic Post-Training

arXiv cs.AI

RewardHarness is a self-evolving agentic framework for post-training that replaces large-scale preference annotation with iterative tool and skill evolution, achieving superior performance in image editing evaluation benchmarks compared to GPT-5.

Effective harnesses for long-running agents

Anthropic Engineering

Anthropic introduces a two-part solution using an initializer agent and a coding agent to enable the Claude Agent SDK to effectively handle long-running tasks across multiple context windows by maintaining a clean, incremental state.

Claude Code improved my agent harness by 40% overnight

Reddit r/AI_Agents

The author introduces 'Autoharness', a tool that uses Claude Code to autonomously optimize agent harnesses by iterating on prompts and hyperparameters. This resulted in a 40% performance increase on the tau2-airline benchmark.