EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems
Summary
EvoTest introduces J-TTL, a benchmark for measuring agent test-time learning capabilities, and proposes an evolutionary framework where an Actor Agent plays games while an Evolver Agent iteratively improves the system's prompts, memory, and hyperparameters without fine-tuning. The method demonstrates superior performance compared to reflection and memory-based baselines on complex text-based games.
View Cached Full Text
Cached at: 04/20/26, 08:33 AM
# EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems Source: https://arxiv.org/html/2510.13220 Yufei He1, Juncheng Liu2, Yue Liu1,Yibo Li1, Tri Cao1, Zhiyuan Hu122footnotemark:2, Xinxing Xu222footnotemark:2, Bryan Hooi1 1National University of Singapore 2Microsoft Research \{yufei\.he, yliu, liyibo, zhiyuan\_hu\}@u\.nus\.com, tricao2001vn@gmail\.com, bhooi@comp\.nus\.edu\.sg, \{juncheng\.liu, xinxingxu\}@microsoft\.comThe work was done when the author was an intern at Microsoft Research Asia - SingaporeCorresponding author. ###### Abstract A fundamental limitation of current AI agents is their inability to learn complex skills on the fly at test time, often behaving like "clever but clueless interns" in novel environments. This severely limits their practical utility. To systematically measure and drive progress on this challenge, we first introduce theJericho Test-Time Learning (J-TTL)benchmark. J-TTL is a new evaluation setup where an agent must play the same game for several consecutive episodes, attempting to improve its performance from one episode to the next. On J-TTL, we find that existing adaptation methods like reflection, memory, or reinforcement learning struggle. To address the challenges posed by our benchmark, we presentEvoTest111The code is available athttps://github.com/yf-he/EvoTest, an evolutionary test-time learning framework that improves an agent without any fine-tuning or gradients—by evolving the entire agentic system after every episode. EvoTest has two roles: theActor Agent, which plays the game, and theEvolver Agent, which analyzes the episode transcript to propose a revised configuration for the next run. This configuration rewrites the prompt, updates memory by logging effective state–action choices, tunes hyperparameters, and learns the tool-use routines. On our J-TTL benchmark, EvoTest consistently increases performance, outperforming not only reflection and memory-only baselines but also more complex online fine-tuning methods. Notably, our method is the only one capable of winning two games (Detective and Library), while all baselines fail to win any. ## 1Introduction The pursuit of truly autonomous agents hinges on a critical human capability: the ability to learn "on the fly"(Maes,1993 (https://arxiv.org/html/2510.13220#bib.bib19); Franklin and Graesser,1996 (https://arxiv.org/html/2510.13220#bib.bib20)). When faced with a new task, humans can attempt it, reflect on their successes and failures, formulate a better strategy, and try again. By contrast, most AI agents arrive at deployment with a fixed policy, behaving like "clever but clueless interns" that can execute instructions but cannot reform their own process from experience(Huanget al.,2024 (https://arxiv.org/html/2510.13220#bib.bib18); Talebirad and Nadiri,2023 (https://arxiv.org/html/2510.13220#bib.bib24); Wanget al.,2024 (https://arxiv.org/html/2510.13220#bib.bib2);2025 (https://arxiv.org/html/2510.13220#bib.bib32); Houet al.,2023 (https://arxiv.org/html/2510.13220#bib.bib30)). This gap severely limits their reliability in dynamic settings. While the field acknowledges this problem, progress has been hampered by a lack of standardized testbeds designed specifically to measure an agent's capacity for rapid, in-session improvement(Zhouet al.,2023 (https://arxiv.org/html/2510.13220#bib.bib21); Mialonet al.,2023 (https://arxiv.org/html/2510.13220#bib.bib22); Heet al.,2025a (https://arxiv.org/html/2510.13220#bib.bib23);2024 (https://arxiv.org/html/2510.13220#bib.bib41);2026 (https://arxiv.org/html/2510.13220#bib.bib46); Liet al.,2026 (https://arxiv.org/html/2510.13220#bib.bib45); Yanget al.,2026 (https://arxiv.org/html/2510.13220#bib.bib48); Suiet al.,2024a (https://arxiv.org/html/2510.13220#bib.bib37)). To address this, we first introduce theJericho Test-Time Learning (J-TTL)benchmark, a new evaluation framework designed to systematically measure and drive progress in on-the-fly agent learning. The benchmark's core task is straightforward: an agent must play the same complex, text-based adventure game(Hausknechtet al.,2020 (https://arxiv.org/html/2510.13220#bib.bib1))for a series of consecutive attempts ("episodes"). In each episode, the agent interacts with the environment through a standard loop: it receives a textual observation of its surroundings (state), submits a natural-language command (action), and receives a numerical score change (reward). These games are difficult for LLM agents because they feature complex puzzles, long-range planning, sparse rewards (many critical actions yield no points), and irreversible consequences (a single wrong move can make the game unwinnable). The agent's goal is structured at two levels: 1) TheEpisodic Goal: Maximize the final score within a single playthrough. 2) TheLearning Goal: Play the same game repeatedly and progressively increase its final score from one episode to the next, using only the experience gathered within that single session. The J-TTL benchmark starkly reveals the inadequacies of existing adaptation paradigms. Consider a simple but critical failure in the gameDetective: an agent gets stuck in a navigation loop by repeatedly attempting an invalid action, such asGO WEST, which the game rejects with"You can't go that way."This seemingly simple failure reveals deep flaws in current adaptation methods: AStaticagent has no learning mechanism and will likely repeat this error in every episode, leading to a flat, low-scoring performance. AnSFT (online)agent will have no good data to learn from in this failed episode. It is trapped because it cannot generate the very data it needs to improve. AnReinforcement Learning (RL)(online)agent receives areward=0for the invalid move, which is a weak signal in a sparse-reward environment. A single update based on this noisy signal is insufficient to correct the policy, demonstrating a failure of credit assignment. Methods based onreflection, such as Reflexion(Shinnet al.,2023 (https://arxiv.org/html/2510.13220#bib.bib6)), modify the agent's prompt with summaries of past failures. While useful, it does not alter the agent's core decision-making logic or its use of tools. Similarly, advancedmemory systems(Packeret al.,2023 (https://arxiv.org/html/2510.13220#bib.bib7); Zhonget al.,2024 (https://arxiv.org/html/2510.13220#bib.bib8))improve an agent's ability to recall information but do not teach it how to act differently. On the other end of the spectrum, RL and online fine-tuning are fundamentally ill-suited for the test-time learning setting. These methods are too slow and data-inefficient for the rapid learning J-TTL demands. To meet the challenge posed by our benchmark, we introduceEvoTest, an evolutionary test-time learning framework designed for rapid, holistic adaptation without fine-tuning.EvoTest decouples acting from adaptation using two distinct roles: anActor Agentthat plays a full episode and anEvolver Agentthat improves the system between independent episodes. After each episode, the Evolver Agent analyzes the full transcript and proposes a revised configuration for the entire agentic system. This process of whole-system evolution involves: 1. 1.Rewriting the guiding prompt to encode new strategies; 2. 2.Updating a structured deployment-time memory with records of successful and failure actions; 3. 3.Tuning decision-making hyperparameters like temperature and exploration strength; 4. 4.Refining the tool-use routines that govern how and when memory or python code is accessed. By evolving the agent configuration, EvoTest transforms the narrative of one episode into multi-faceted improvements for the next attempt, enabling a deeper form of learning than prior methods. We summarize our contributions as follows: - •A Benchmark for Test-Time Learning:We propose J-TTL, a benchmark using Jericho games to measure an agent's on-the-fly learning ability across a series of playthroughs of the same game. - •A Test-time Learning Algorithm:We propose EvoTest, an evolutionary agent learning framework that evolves the entire agentic system (policy, memory, tool-use routines, and hyperparameters) via transcript-level analysis without gradients or fine-tuning. - •State-of-the-Art Empirical Results:We demonstrate on the J-TTL benchmark that EvoTest shows a 38% improvement over the strongest prompt-evolution baseline and a 57% improvement over online RL, outperforming all strong reflection-based, memory-based, and gradient-based baselines on every game. ## 2Related Work From Static Agents to Test-Time Learning.The majority of current AI agents, while capable, operate with static configurations that are manually designed and fixed after deployment(Wanget al.,2024 (https://arxiv.org/html/2510.13220#bib.bib2); Xiet al.,2025 (https://arxiv.org/html/2510.13220#bib.bib3); Heet al.,2025b (https://arxiv.org/html/2510.13220#bib.bib39); Chenet al.,2025b (https://arxiv.org/html/2510.13220#bib.bib40);a (https://arxiv.org/html/2510.13220#bib.bib42); Heet al.,2025c (https://arxiv.org/html/2510.13220#bib.bib31);d (https://arxiv.org/html/2510.13220#bib.bib36); Chenet al.,2025c (https://arxiv.org/html/2510.13220#bib.bib43)). This limits their ability to adapt to novel situations, a key challenge motivating the development of "self-improving AI agents"(Gaoet al.,2025b (https://arxiv.org/html/2510.13220#bib.bib5); Fanget al.,2025 (https://arxiv.org/html/2510.13220#bib.bib4); Gaoet al.,2025a (https://arxiv.org/html/2510.13220#bib.bib38); Suiet al.,2025 (https://arxiv.org/html/2510.13220#bib.bib34);2024b (https://arxiv.org/html/2510.13220#bib.bib35); Liuet al.,2025b (https://arxiv.org/html/2510.13220#bib.bib33)). A prominent line of work enables agents to learn from past mistakes without updating weights. Reflexion(Shinnet al.,2023 (https://arxiv.org/html/2510.13220#bib.bib6)), a key baseline for our work, allows an agent to verbally reflect on trajectory failures and append these reflections to its prompt for subsequent episodes. Other approaches focus on enhancing agent memory. For instance, MemGPT(Packeret al.,2023 (https://arxiv.org/html/2510.13220#bib.bib7))provides agents with a structured memory system to manage long contexts. Beyond reflection/memory, Uncertainty of Thoughts(Huet al.,2024 (https://arxiv.org/html/2510.13220#bib.bib28))adds test-timeuncertainty-aware planning, deciding when to ask, verify, or revise without weight updates. While MemoryBank(Zhonget al.,2024 (https://arxiv.org/html/2510.13220#bib.bib8))uses hierarchical summarization to retain information over long interactions. Self-Evolving Agentic Systems.Another active area of research is the automated optimization of prompts that guide agent behavior(Liuet al.,2025a (https://arxiv.org/html/2510.13220#bib.bib44); Zhuet al.,2026 (https://arxiv.org/html/2510.13220#bib.bib49); Huet al.,2026 (https://arxiv.org/html/2510.13220#bib.bib47)). Generative approaches like APE(Zhouet al.,2022 (https://arxiv.org/html/2510.13220#bib.bib9))and OPRO(Yanget al.,2023 (https://arxiv.org/html/2510.13220#bib.bib10))use a powerful LLM to propose and score new prompts, iteratively refining them based on performance. Gradient-inspired methods like TextGrad(Yuksekgonulet al.,2024 (https://arxiv.org/html/2510.13220#bib.bib15))refine prompts using LLM-generated textual feedback. Closely related to our work are evolutionary methods such as AlphaEvolve(Novikovet al.,2025 (https://arxiv.org/html/2510.13220#bib.bib27)), Promptbreeder(Fernandoet al.,2023 (https://arxiv.org/html/2510.13220#bib.bib11)), and EvoPrompt(Guoet al.,2024 (https://arxiv.org/html/2510.13220#bib.bib12)), which maintain a population of prompts and apply genetic operators like mutation and crossover to discover more effective instructions. EvoTest generalizes prompt evolution to whole-system evolution, optimizing the entire agentic configuration—including the prompt, memory, hyperparameters, and tool-use routines. This allows for more holistic adaptations, such as tuning exploration strength, that are beyond the scope of prompt-editing alone. This vision for unified optimization is shared by EvoAgent(Yuanet al.,2024 (https://arxiv.org/html/2510.13220#bib.bib13))and MASS(Zhouet al.,2025 (https://arxiv.org/html/2510.13220#bib.bib14)); Beyond 'Aha!'(Huet al.,2025 (https://arxiv.org/html/2510.13220#bib.bib29))complements this by aligning meta-abilities rather than only task prompts or single components. Refer to captionFigure 1:TheEvoTestarchitecture, designed to enabletest-time learning(TTL). The agent operates in a continuous Act-Evolve loop across multiple attempts at the same task. After each episode, the Evolver Agent analyzes the full trajectory transcript—rich narrative feedback to perform gradient-free,whole-system evolutionon the agent's entire configuration. This allows the agentic system to self-improve on the fly, directly from its own experience at test time. ## 3The Jericho Test-Time Learning (J-TTL) Benchmark To systematically measure and drive progress in on-the-fly agent learning, we introduce theJericho Test-Time Learning (J-TTL)benchmark. This benchmark is built upon the Jericho(Hausknechtet al.,2020 (https://arxiv.org/html/2510.13220#bib.bib1))111Jericho is available athttps://github.com/Microsoft/jerichosuite of Interactive Fiction (IF) games. IF games are fully text-based simulation environments where an agent issues text commands to effect change in the environment and progress through a story. While the richness of these environments makes them a challenging testbed for AI, existing evaluation has primarily focused on single-episode performance or generalization across different games(Hausknechtet al.,2020 (https://arxiv.org/html/2510.13220#bib.bib1); Gulcehreet al.,2020 (https://arxiv.org/html/2510.13220#bib.bib25); Liet al.,2025 (https://arxiv.org/html/2510.13220#bib.bib26)). The J-TTL benchmark refocuses the evaluation on a different, critical axis: an agent's ability to learn and improve its strategy through repeated attempts at the same complex task within a single test session. Datasets.We use publicly available Jericho games that vary in difficulty and puzzle structure, includingDetective,Library,Zork1,Zork3,Balances, andTemple. Games are launched via Jericho with default scoring. Each episode is capped by a step limit (T=110T=110unless stated otherwise). The Jericho Game.We model a Jericho game(Hausknechtet al.,2020 (https://arxiv.org/html/2510.13220#bib.bib1))as a Partially Observable Markov Decision Process (POMDP), defined by the tuple(S,A,T,R,Ω,T)(\mathcal{S},\mathcal{A},\mathcal{T},\mathcal{R},\Omega,T). Here,S\mathcal{S}is the latent state space, andA\mathcal{A}is the (infinite) combinatorial action space of natural language commands. At each steptt, an agent in a latent statest∈Ss_{t}\in\mathcal{S}takes an actionat∈Aa_{t}\in\mathcal{A}, causing a transition to a new statest+1∼T(⋅∣st,at)s_{t+1}\sim\mathcal{T}(\cdot\mid s_{t},a_{t})and yielding a scalar rewardrt=R(st,at)r_{t}=\mathcal{R}(s_{t},a_{t}). The agent does not observe the true statests_{t}but instead receives a textual observationot∼Ω(⋅∣st)o_{t}\sim\Omega(\cdot\mid s_{t}). An episode is a trajectory of interactions with a finite horizon ofTTsteps: τ(e)≜(o1(e),a1(e),r1(e),...,oT(e),aT(e),rT(e)).\tau^{(e)}\triangleq\left(o_{1}^{(e)},a_{1}^{(e)},r_{1}^{(e)},\dots,o_{T}^{(e)},a_{T}^{(e)},r_{T}^{(e)}\right).(1)The total return for an episodeeeis the sum of its rewards,R(e)≜∑t=1Trt(e)R(e)\triangleq\sum_{t=1}^{T}r_{t}^{(e)}. Test-Time Learning PSimilar Articles
EvoMaster: A Foundational Agent Framework for Building Evolving Autonomous Scientific Agents at Scale
EvoMaster is a scalable, self-evolving agent framework for large-scale scientific discovery that enables iterative hypothesis refinement and knowledge accumulation across experimental cycles. It achieves state-of-the-art results on four benchmarks including Humanity's Last Exam (41.1%) and MLE-Bench Lite (75.8%), outperforming general-purpose baselines by up to 316%.
EvoMap/evolver
Evolver is a GEP-powered self-evolution engine for AI agents that automates prompt optimization and creates auditable, reusable evolution assets. The project is transitioning from fully open source to source-available while maintaining backward compatibility with existing MIT and GPL-3.0 releases.
Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration
This paper proposes a method to train LLM agents with intrinsic meta-evolution capabilities, enabling spontaneous self-improvement without external rewards at inference time. Applied to Qwen3-30B and Seed-OSS-36B, the approach yields a 20% performance boost on web navigation benchmarks, with a 14B model outperforming Gemini-2.5-Flash.
EvoScientist: Towards Multi-Agent Evolving AI Scientists for End-to-End Scientific Discovery
EvoScientist is an adaptive multi-agent framework for end-to-end scientific discovery that continuously improves through persistent memory modules, comprising three specialized agents for idea generation, experiment execution, and knowledge distillation. It outperforms 7 state-of-the-art systems in scientific idea generation and improves code execution success rates through multi-agent evolution.
MemEvoBench: Benchmarking Memory MisEvolution in LLM Agents
MemEvoBench introduces the first benchmark for evaluating memory safety in LLM agents, measuring behavioral degradation from adversarial memory injection, noisy outputs, and biased feedback across QA and workflow tasks. The work reveals that memory evolution significantly contributes to safety failures and that static defenses are insufficient.