Agentick: A Unified Benchmark for General Sequential Decision-Making Agents
Summary
This paper introduces Agentick, a unified benchmark for evaluating general sequential decision-making agents across RL, LLM, and VLM paradigms. It provides 37 procedurally generated tasks and reveals that no single approach currently dominates, highlighting significant room for improvement in agent autonomy.
View Cached Full Text
Cached at: 05/11/26, 07:08 AM
# Agentick: A Unified Benchmark for General Sequential Decision-Making Agents
Source: [https://arxiv.org/html/2605.06869](https://arxiv.org/html/2605.06869)
![[Uncaptioned image]](https://arxiv.org/html/2605.06869v1/figures/banner_new.png)
Roger Creus Castanyer![[Uncaptioned image]](https://arxiv.org/html/2605.06869v1/figures/logo_mila.jpg)![[Uncaptioned image]](https://arxiv.org/html/2605.06869v1/figures/logo_udem.jpg) Mila Quebec AI Institute Université de Montréal &Pablo Samuel Castro∗![[Uncaptioned image]](https://arxiv.org/html/2605.06869v1/figures/logo_mila.jpg)![[Uncaptioned image]](https://arxiv.org/html/2605.06869v1/figures/logo_udem.jpg)![[Uncaptioned image]](https://arxiv.org/html/2605.06869v1/figures/logo_deepmind.png) Mila Quebec AI Institute Université de Montréal Google DeepMind &Glen Berseth∗![[Uncaptioned image]](https://arxiv.org/html/2605.06869v1/figures/logo_mila.jpg)![[Uncaptioned image]](https://arxiv.org/html/2605.06869v1/figures/logo_udem.jpg) Mila Quebec AI Institute Université de Montréal
###### Abstract
AI agent research spans a wide spectrum: from RL agents that learn from scratch to foundation model agents that leverage pre\-trained knowledge, yet no unified benchmark enables fair comparison across these approaches\. We presentAgentick, a benchmark for sequential decision\-making agents designed to evaluate RL, LLM, VLM, hybrid, and human agents on common ground and to power research on the fundamental challenges of sequential decision\-making\. Agentick provides 37 procedurally generated tasks across six capability categories, four difficulty levels, and five observation modalities, all exposed through a single Gymnasium\-compatible interface\. The benchmark ships with a Coding API, oracle reference policies for all tasks, pre\-built SFT datasets, a composable agent harness, and a live leaderboard\. An evaluation spanning 27 configurations and over 90,000 episodes reveals that no single approach dominates: GPT\-5 mini leads overall at 0\.309 oracle\-normalized score while PPO dominates planning and multi\-agent tasks; the reasoning harness multiplies LLM performance by 3–10×\\times; and ASCII observations consistently outperform natural language\. These findings highlight the substantial room for improvement that remains across all agent paradigms\. Agentick’s capability\-decomposed, multi\-modal design provides the empirical infrastructure needed to drive progress toward general autonomous agents, both as an evaluation framework and as a training ground for RL post\-training of foundation models in truly sequential environments\.
††footnotetext:∗Equal supervision\.††footnotetext:Correspondence to:roger\.creus\-castanyer@mila\.quebec## 1![[Uncaptioned image]](https://arxiv.org/html/2605.06869v1/figures/sprite_agent.png)Introduction
The pursuit of autonomous agents \(systems that perceive their environment, reason about it, and take actions to achieve goals\) has been a central objective of artificial intelligence research for decades\(Sutton and Barto,[2018](https://arxiv.org/html/2605.06869#bib.bib27)\)\. The landscape of agent research now spans a wide spectrum of paradigms\. At one end, deep reinforcement learning \(RL\) agents learn from scratch through environment interaction: PPO\(Schulmanet al\.,[2017](https://arxiv.org/html/2605.06869#bib.bib12)\), DQN\(Mnihet al\.,[2015](https://arxiv.org/html/2605.06869#bib.bib13)\), and SAC\(Haarnojaet al\.,[2018](https://arxiv.org/html/2605.06869#bib.bib14)\)have achieved superhuman performance in Atari\(Bellemareet al\.,[2013](https://arxiv.org/html/2605.06869#bib.bib1)\), continuous control\(Tassaet al\.,[2018](https://arxiv.org/html/2605.06869#bib.bib2)\), and strategic games\(Vinyalset al\.,[2019](https://arxiv.org/html/2605.06869#bib.bib43)\)\. At the other end, foundation model \(FM\) agents, including large language models \(LLMs\) and vision\-language models \(VLMs\) pre\-trained on internet\-scale data, leverage broad world knowledge for zero\-shot decision\-making through prompt engineering and inference\-time scaling\(Yaoet al\.,[2023](https://arxiv.org/html/2605.06869#bib.bib17); Wanget al\.,[2023](https://arxiv.org/html/2605.06869#bib.bib18); Ahnet al\.,[2022](https://arxiv.org/html/2605.06869#bib.bib19)\)\. Between these extremes lies a rich design space of hybrid approaches: FM\-guided reward shaping\(Maet al\.,[2023](https://arxiv.org/html/2605.06869#bib.bib35); Klissarovet al\.,[2023](https://arxiv.org/html/2605.06869#bib.bib37); Castanyeret al\.,[2025a](https://arxiv.org/html/2605.06869#bib.bib36)\), RL post\-training of foundation models\(Guoet al\.,[2025](https://arxiv.org/html/2605.06869#bib.bib25)\), and FM\-based skill discovery\(Klissarovet al\.,[2024](https://arxiv.org/html/2605.06869#bib.bib38)\)and curriculum generation\(Wanget al\.,[2023](https://arxiv.org/html/2605.06869#bib.bib18)\)\.
Each paradigm makes different tradeoffs\. RL agents learn fine\-grained control policies but are sample\-inefficient, task\-specific, and suffer from unstable optimization at scale\(Ceronet al\.,[2024](https://arxiv.org/html/2605.06869#bib.bib29); Castanyeret al\.,[2025b](https://arxiv.org/html/2605.06869#bib.bib30); Lyleet al\.,[2022](https://arxiv.org/html/2605.06869#bib.bib32)\)\. FM agents bring rich priors and semantic understanding but were not trained for control and struggle with precise, temporally extended actions\(Paglieriet al\.,[2024](https://arxiv.org/html/2605.06869#bib.bib9)\)\. This raises a central question:what combination of learning from interaction and pre\-trained knowledge is needed to build fully capable autonomous agents?Answering it requires the ability to compare agents across the full paradigm spectrum \(RL from scratch, prompted foundation models, and hybrids in between\) on the same tasks, and existing benchmarks cannot do this \(Section[2](https://arxiv.org/html/2605.06869#S2)\)\.
We presentAgentick, a benchmark designed from the ground up to support fair evaluation across the full agent design spectrum\. The design is guided by four principles: \(1\)paradigm universalitythrough five observation modalities that ensure no agent type is disadvantaged; \(2\)capability decompositionacross a variety of dimensions of sequential decision\-making; \(3\)training\-first designwith a programmatic Coding API, oracle policies, pre\-built fine\-tuning datasets, and vectorizable environments; and \(4\)controlled difficultywith four levels per task and procedural generation for reproducibility\. Agentick provides 37 tasks across navigation, planning, reasoning, memory, generalization, and multi\-agent coordination, all through a standard interface\(Towerset al\.,[2024](https://arxiv.org/html/2605.06869#bib.bib34)\)\.
To validate the benchmark’s discriminative power, we evaluate seven agents spanning the paradigm spectrum: three frontier LLMs \(GPT\-5 mini, Gemini 3\.1 Flash Lite, Claude Haiku 4\.5\), one RL agent \(PPO trained from scratch\), and four open\-weight LLMs \(Qwen3\.5 at 0\.8B, 2B, and 4B parameters, and Qwen3\-4B\)\. Three key findings emerge\. First,no single paradigm dominates: GPT\-5 mini leads overall at 0\.309 oracle\-normalized score \(ONS\) but PPO dominates planning \(0\.402\) and multi\-agent tasks \(0\.432 ONS\)\. Second,prompting strategy matters as much as model scale: the chain\-of\-thought reasoning harness multiplies LLM performance by 3–10×\\timesacross every model tested\. Third,ASCII observations consistently outperform natural languagefor LLM agents, suggesting that compact token\-efficient representations are preferable for spatial reasoning\. These findings are uniquely enabled by Agentick’s multi\-modal, capability\-decomposed evaluation framework\.
The benchmark, code, documentation, pre\-built datasets, and live leaderboard are publicly available \(links above the abstract\)\. Section[2](https://arxiv.org/html/2605.06869#S2)positions Agentick relative to existing benchmarks\. Section[3](https://arxiv.org/html/2605.06869#S3)describes the benchmark design\. Section[4](https://arxiv.org/html/2605.06869#S4)presents experimental results\. Section[5](https://arxiv.org/html/2605.06869#S5)discusses future directions and conclusions\.
## 2![[Uncaptioned image]](https://arxiv.org/html/2605.06869v1/figures/sprite_npc.png)Related Work
Agent evaluation frameworks can be broadly grouped by the paradigm they target\. Table[1](https://arxiv.org/html/2605.06869#S2.T1)provides a structured comparison; we discuss each group below\.
RL benchmarks\.The Arcade Learning Environment \(ALE\)\(Bellemareet al\.,[2013](https://arxiv.org/html/2605.06869#bib.bib1)\)established the dominant evaluation paradigm for deep RL through Atari 2600 games with pixel observations, and the DeepMind Control Suite\(Tassaet al\.,[2018](https://arxiv.org/html/2605.06869#bib.bib2)\)extended this to continuous control with proprioceptive and pixel observations\. bsuite\(Osbandet al\.,[2020](https://arxiv.org/html/2605.06869#bib.bib3)\)took a diagnostic approach, designing experiments that isolate specific RL capabilities such as exploration, credit assignment, memory, and generalization in deliberately simple settings\. MiniGrid\(Chevalier\-Boisvertet al\.,[2023](https://arxiv.org/html/2605.06869#bib.bib4)\)provides a modular gridworld framework for goal\-oriented tasks\. The NetHack Learning Environment\(Küttleret al\.,[2020](https://arxiv.org/html/2605.06869#bib.bib51)\)exposes a procedurally generated roguelike game with extreme partial observability and long horizons, representing one of the most challenging single\-environment RL benchmarks, and MiniHack\(Samvelyanet al\.,[2021](https://arxiv.org/html/2605.06869#bib.bib5)\)builds on it with a flexible domain\-specific language for constructing diverse Gymnasium\-compatible RL tasks\. Crafter\(Hafner,[2022](https://arxiv.org/html/2605.06869#bib.bib6)\)and its JAX\-accelerated extension Craftax\(Matthewset al\.,[2024](https://arxiv.org/html/2605.06869#bib.bib50)\)offer single procedurally generated survival games that test a broad spectrum of capabilities\. Procgen\(Cobbeet al\.,[2020](https://arxiv.org/html/2605.06869#bib.bib7)\)provides procedurally generated game levels for studying generalization\. These benchmarks remain extensively used for RL research, but were designed primarily for RL agents\. Agentick is closest in spirit to MiniGrid and MiniHack, but differs by providing purpose\-built capability categories, five synchronized observation modalities, standardized LLM/VLM/RL harnesses, oracle trajectory datasets, and a unified scoring protocol for cross\-paradigm comparison\.
LLM and VLM agent benchmarks\.BALROG\(Paglieriet al\.,[2024](https://arxiv.org/html/2605.06869#bib.bib9)\)wraps six existing RL game environments \(BabyAI, Crafter, TextWorld, Baba Is AI, MiniHack, NetHack\) into text and vision interfaces for LLM and VLM evaluation, demonstrating that even frontier models struggle at long\-horizon interactive tasks\. However, BALROG introduces no new tasks designed to target specific agentic capabilities, provides no capability decomposition or unified scoring across its heterogeneous game suite, and does not systematically investigate how observation modality affects agent performance\. TextWorld\(Côtéet al\.,[2019](https://arxiv.org/html/2605.06869#bib.bib11)\)provides text\-based adventure games for language grounding but targets text\-only agents\. Agentick builds on the key insight from BALROG that interactive tasks expose fundamental weaknesses in FM agents, while addressing these limitations through purpose\-built tasks, multi\-modal observations, and cross\-paradigm evaluation infrastructure\.
RLVR training environments\.A parallel line of work uses verifiable environments for RL post\-training of language models\. Mathematical reasoning benchmarks such as MATH\(Hendryckset al\.,[2021](https://arxiv.org/html/2605.06869#bib.bib45)\)and GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.06869#bib.bib46)\), code generation benchmarks like SWE\-bench\(Jimenezet al\.,[2024](https://arxiv.org/html/2605.06869#bib.bib47)\)and HumanEval\(Chenet al\.,[2021](https://arxiv.org/html/2605.06869#bib.bib48)\), and reasoning environments like Reasoning Gym\(Community,[2025](https://arxiv.org/html/2605.06869#bib.bib49)\)are widely used for RLVR methods\(Guoet al\.,[2025](https://arxiv.org/html/2605.06869#bib.bib25)\)\. These environments are valuable for eliciting reasoning capabilities, but they operate in settings with limited sequential complexity: episodes are single\-turn or short\-horizon, transitions are fully deterministic, and there is no partial observability, stochastic dynamics, or multi\-agent interaction\. Agentick also provides verifiable rewards, but is designed from the ground up to test the fundamental challenges that emerge in sequential decision\-making: truly interactive, stochastic, long\-horizon environments with partial observability, exploration, multi\-step credit assignment, and multi\-agent coordination: the kinds of challenges that current RLVR benchmarks do not expose\.
Interactive reasoning benchmarks\.ARC\-AGI\-3\(Chollet and others,[2026](https://arxiv.org/html/2605.06869#bib.bib10)\)introduces 135 interactive turn\-based grid environments for agentic evaluation, supporting both RL and LLM agents\. However, it evaluates through a single aggregate score without capability decomposition, provides limited public environments, and uses a custom SDK rather than a standard RL interface such as Gymnasium\. Agentick occupies a different point in the design space: it is built to be both useful for evaluating frontier model capabilities and accessible for academic research, with Gymnasium\-native environments, procedural generation for training, multi\-modal observations, and per\-category diagnostic scoring\.
Table 1:Comparison of agent evaluation frameworks\. Agentick is the only benchmark supporting all agent paradigms with capability decomposition and training infrastructure\.\#TasksRLLLMObs\. ModesCap\. Dec\.Train DataGymALE57✓×1××✓DM Control30✓×2×××bsuite23✓×1✓××MiniGrid20\+✓×2××✓MiniHack100\+✓×2\+××✓Crafter1✓×1××✓Craftax2✓×2××✓NetHack1✓×2××✓BALROG6×✓2××✓ARC\-AGI\-3135✓✓1×××Agentick37✓✓5✓✓✓
## 3![[Uncaptioned image]](https://arxiv.org/html/2605.06869v1/figures/sprite_goal.png)The Agentick Benchmark
Agentick provides 37 procedurally generated gridworld tasks across six capability categories, five observation modalities, four difficulty levels, and a complete training and evaluation pipeline, all through a unified Gymnasium\-compatible\(Towerset al\.,[2024](https://arxiv.org/html/2605.06869#bib.bib34)\)interface\.
### 3\.1Design, Tasks, and Observations
The design is guided by four principles\.
Paradigm universality: every task produces five observation modalities simultaneously \(ASCII text grids, natural language descriptions, structured dictionaries, 512×\\times512 isometric pixel renderings, and raw numpy state arrays\), so that RL, LLM, VLM, and human agents can all be evaluated without architectural bias \(Figure[1](https://arxiv.org/html/2605.06869#S3.F1); Appendix[D](https://arxiv.org/html/2605.06869#A4)shows all five for the same state\)\. Pixel observations are returned at 512×\\times512 by default for VLM and human use, but can be resized through standard wrappers; our PPO baselines use 84×\\times84 grayscale frame\-stacked images following the ALE preprocessing convention, obtained by bilinearly resizing the native 512×\\times512 RGB renders and converting to luminance\.
Capability decomposition: rather than a single aggregate score, evaluation is decomposed along six capability dimensions, enabling radar\-chart profiling that reveals where an agent excels and where it falls short\.
Training\-first design: environments are vectorizable for parallel RL training, oracle policies built on a programmatic Coding API \(Appendix[I](https://arxiv.org/html/2605.06869#A9)\) are provided for all 37 tasks, and pre\-built SFT datasets of 120K–500K oracle episodes are available on HuggingFace\.
Controlled difficulty: four levels per task \(easy through expert\) scale grid size, constraint complexity, object count, and episode length, with procedural generation ensuring unique layouts at every seed \(see Appendix[B](https://arxiv.org/html/2605.06869#A2)for visual examples\)\.
Beyond the fixed leaderboard protocol, Agentick is designed as a configurable experimental substrate\. Researchers can select arbitrary task and difficulty subsets, choose any observation modality, resize or otherwise preprocess pixel observations through standard wrappers, and vary language or prompt templates through the harness interface while preserving the same environment seeds and scoring protocol\. This enables controlled studies of modality, instruction design, curricula, and agent scaffolding without changing the underlying task semantics\.
The 37 tasks are organized into six categories \(Appendix[C](https://arxiv.org/html/2605.06869#A3)provides full descriptions and a visual gallery\)\.Navigation\(8 tasks\) evaluates spatial reasoning, pathfinding, reactive control, and instruction grounding\.Planning\(9 tasks\) requires multi\-step lookahead, constraint satisfaction, backtracking, and resource allocation\.Reasoning\(8 tasks\) demands logical inference, pattern matching, abstraction, and resistance to misleading rewards\.Memory\(4 tasks\) tests information retention over extended horizons under partial observability\.Generalization\(3 tasks\) evaluates few\-shot rule inference, adaptation to shifting dynamics, and noise robustness\.Multi\-Agent\(5 tasks\) requires coordination with or against scripted agents with diverse behavioral profiles\.

```
# # # # # # # # # # # Legend:
# . . # # # # # # # # ˆ v < > : Agent
# . . # # # # # # # # # : Wall . : Empty
# Kg. # . . . # . . # G : Goal K : Key
# ˆ . # . . . # . . # D : Door (closed)
# . . Dg. . . # . . # g=gold r=red
# . . # . Kr. # . . #
# . . # . . . Dr. G #
# # # # # . . # # # #
# # # # # . . # # # #
# # # # # # # # # # #
```
Figure 1:Two observation modalities for KeyDoorPuzzle at medium difficulty\.Left:isometric pixel rendering \(512×\\times512 by default, resizable for RL preprocessing\) for VLM agents, CNN\-based RL, and human play\.Right:ASCII grid for LLM agents; the agent \(ˆ\) must collect the gold key \(Kg\), open the gold door \(Dg\), then the red key \(Kr\) and red door \(Dr\) to reach the goal \(G\)\. The same state also produces natural language, structured dictionary, and numpy array observations\.
### 3\.2Evaluation Framework
The primary metric is theOracle\-Normalized Score\(ONS\):
ONS=agent\_return−random\_baselineoracle\_return−random\_baseline\\text\{ONS\}=\\frac\{\\text\{agent\\\_return\}\-\\text\{random\\\_baseline\}\}\{\\text\{oracle\\\_return\}\-\\text\{random\\\_baseline\}\}\(1\)where 0\.0 corresponds to random agent performance and 1\.0 to the oracle upper bound\. Scores above 1\.0 are possible when an agent outperforms the oracle reference policy\. ONS is computed per \(task, difficulty\) pair and aggregated per\-category and overall via the arithmetic mean\. This normalization accounts for differences in task difficulty and reward scale, enabling meaningful cross\-task comparison, analogous to human\-normalized scores in ALE\(Mnihet al\.,[2015](https://arxiv.org/html/2605.06869#bib.bib13)\)but calibrated to task\-specific reference policies rather than human play\.
Evaluation uses 25 deterministic seeds per task\-difficulty pair \(37 tasks×\\times4 difficulties×\\times25 seeds==3,700 total episodes per agent\), derived from SHA\-256 hashes of"\{task\}::\{difficulty\}::eval"to ensure every submission runs on identical episodes\. A separate pool of 2,000 training seeds per task\-difficulty is available for within\-benchmark learning\. We report 95% bootstrap confidence intervals\(Agarwalet al\.,[2021](https://arxiv.org/html/2605.06869#bib.bib33)\)computed over the 25 evaluation episodes per task\-difficulty pair, and provide a YAML\-based experiment runner with parallel execution and standardized output\.
The oracle upper bound in ONS is computed from oracle reference policies111Developed with the aid of coding agents and iterative refinement\.implemented through Agentick’sCoding API\(AgentickAPI\), a programmatic interface that exposes spatial queries, BFS pathfinding, entity lookups, and high\-level action primitives \(full documentation in Appendix[I](https://arxiv.org/html/2605.06869#A9)\)\. Using this API, we built sophisticated programmatic solvers for all 37 tasks that achieve near\-optimal performance in most cases\. We use the term “oracle” to denote the strongest available reference policy for each task; some oracles fall short of true optimality on tasks with stochastic elements \(e\.g\., scripted agents that wander unpredictably\), where no deterministic policy can guarantee the best outcome on every seed, but they nevertheless achieve high success rates and serve as strong upper bounds for ONS calibration\.
### 3\.3Training Infrastructure and Agent Harnesses
Oracle datasets\.The oracle reference policies described above also serve as expert trajectory generators\. Pre\-built datasets of 120K, 250K, and 500K oracle episodes are available on HuggingFace,222Croissant metadata files \(including the 2026 Responsible\-AI fields\) for all three datasets are bundled in the supplementary repository\.containing per\-step ASCII and language observations, actions, rewards, anddoneflags across all tasks and difficulties\. These datasets enable supervised fine\-tuning \(SFT\) and behavior cloning within the benchmark\. The Coding API itself is also available for hand\-coded bots, planners, and code\-generating LLM agents\.
Composable agent harness\.Agentick provides a modular harness infrastructure for LLM and VLM agents\. ABaseAgentcomposes aModelBackend\(e\.g\. from OpenAI, Gemini, HuggingFace or vLLM providers\) with aHarnessPresetthat controls the full inference\-time strategy: the complete pipeline that transforms environment observations into actions at each step, including the system prompt, observation formatting, history management, reasoning elicitation, and action parsing\. Researchers can implement custom harness presets by subclassing a simple interface, enabling systematic study of how harness design affects agent performance\. Two built\-in presets serve as baselines:Markovian\(receives only the current observation and outputs a single action integer, with no intermediate reasoning\) andMarkovian Reasoner\(also receives only the current observation, but prompts the model for concise chain\-of\-thought before action selection\)\. Both are memoryless and support any observation mode\. By explicitly decoupling inference\-time strategy from the underlying model, the harness abstraction makes it straightforward to evaluate the same model under different prompting designs or to develop learned harnesses that optimize strategies through experience\. Experiments are configured via YAML files specifying model, backend, harness, observation mode, tasks, difficulties, and seed counts\.
Leaderboard\.A public leaderboard333[https://roger\-creus\.github\.io/agentick/board/](https://roger-creus.github.io/agentick/board/)enables standardized comparison\. Each submission records per\-task ONS, per\-category aggregates, and overall ONS, alongside metadata such as model family, parameter count, harness preset, and observation mode\. The leaderboard displays radar\-chart capability profiles \(as in Figure[3](https://arxiv.org/html/2605.06869#S4.F3)\) that visually decompose each agent’s strengths and weaknesses across the six capability categories\. All evaluation code and seed definitions are public, so any result can be independently reproduced\.
## 4![[Uncaptioned image]](https://arxiv.org/html/2605.06869v1/figures/sprite_gem.png)Experiments
We evaluate agents spanning the full paradigm spectrum to validate that Agentick produces discriminative, actionable insights\. The goal is not to crown a “best agent” but to demonstrate that the benchmark reveals meaningful differences across paradigms, capabilities, observation modes, and prompting strategies\.
### 4\.1Experimental Setup
We evaluate seven agents across 27 configurations \(varying observation mode and harness\), each on all 37 tasks×\\times4 difficulty levels×\\times25 seeds per task\-difficulty pair, totaling over 90,000 episodes:
- •Frontier LLMs444These models belong to the latest release families \(GPT\-5, Gemini 3\.1, Claude 4\.5\) but are the more economical variants \(mini, Flash Lite, Haiku\) rather than the flagship models \(Pro, Opus\)\. Budget constraints prevented evaluation of the full\-scale frontier models at the time of writing; we look forward to populating the leaderboard with those results\.: GPT\-5 mini, Gemini 3\.1 Flash Lite, and Claude Haiku 4\.5, evaluated with ASCII observations and the chain\-of\-thought Markovian Reasoner harness\.
- •RL from scratch: PPO\(Schulmanet al\.,[2017](https://arxiv.org/html/2605.06869#bib.bib12)\)trained with Stable\-Baselines3\(Raffinet al\.,[2021](https://arxiv.org/html/2605.06869#bib.bib16)\)from pixel observations using the standard ALE\-style preprocessing pipeline: 512×\\times512 isometric renders resized to 84×\\times84, converted to grayscale, and stacked over four frames\. We report three configurations:PPO Dense \(2M\), the headline RL baseline, trained for 2M steps with dense reward shaping;PPO Dense \(500k\), a shorter\-budget control trained for 500k steps with the same dense rewards; andPPO Sparse \(500k\), a 500k\-step run with terminal\-only sparse rewards, included to probe the sensitivity of the RL signal to reward shaping under limited compute\.
- •Open\-weight LLMs: Qwen3\-4B and Qwen3\.5 at 0\.8B, 2B, and 4B parameters\(Qwen,[2025](https://arxiv.org/html/2605.06869#bib.bib21)\), evaluated across all combinations of observation mode \(ASCII, language\) and harness \(Markovian, Markovian Reasoner\)\. We report the best configuration per model\.
All agents are evaluated on the official deterministic evaluation seeds\. We report 95% bootstrap confidence intervals\(Agarwalet al\.,[2021](https://arxiv.org/html/2605.06869#bib.bib33)\)computed over the 25 evaluation episodes per task\-difficulty pair; error bars are shown in all figures\.
### 4\.2Overall Performance
Figure 2:Overall ONS for all evaluated agents\. Among the initial set of agents, GPT\-5 mini and PPO \(2M\) lead at 0\.309 and 0\.287 respectively, with substantial room for improvement across all paradigms\.
Figure 3:Category ONS profiles for the top agents\. Different paradigms exhibit distinctly different capability profiles: GPT\-5 mini excels at navigation and generalization while PPO dominates planning and multi\-agent\.
Figure[3](https://arxiv.org/html/2605.06869#S4.F3)shows the overall ONS rankings\. Two findings stand out\. First,no agent comes close to the oracle ceiling: even the best\-performing agent \(GPT\-5 mini at 0\.309 ONS\) reaches less than a third of oracle performance, indicating substantial room for improvement across all paradigms\. We note that our evaluation covers economical frontier variants rather than flagship models \(GPT\-5 Pro, Claude Opus, Gemini Pro\); we expect stronger models to improve on these numbers, and the leaderboard is designed to track this progression\. Second,different paradigms are competitive in overall score: GPT\-5 mini \(0\.309\) leads, but PPO trained from scratch with 2M steps achieves 0\.287, a strong result for tabula rasa RL\. Qwen3\.5\-4B, an open\-weight 4B\-parameter model, reaches 0\.228 with the Reasoner harness, while the same architecture family without chain\-of\-thought reasoning \(Qwen3\-4B\) achieves only 0\.020, near the random baseline\.
### 4\.3Capability Decomposition
The per\-category breakdown \(Figures[3](https://arxiv.org/html/2605.06869#S4.F3)and[4](https://arxiv.org/html/2605.06869#S4.F4)\) reveals a key finding:the best agent varies by capability dimension\.
GPT\-5 mini leads navigation \(0\.456\) and generalization \(0\.437\), leveraging pre\-trained knowledge for spatial reasoning and adaptation to novel situations\. PPO dominates planning \(0\.402\) and multi\-agent tasks \(0\.432, far ahead of any LLM on both\), where precise, repeatable control learned through interaction excels\. Reasoning remains difficult for all agents: the best score is 0\.191 \(PPO\), and it is the weakest category for GPT\-5 mini at just 0\.131\. This decomposition has direct implications for future work: hybrid architectures that combine FM reasoning with RL\-trained control may be needed for strong performance across all capability dimensions\.
Figure 4:Per\-category ONS for the top five agents; best agent varies by category: PPO leads planning \(0\.402\) and multi\-agent \(0\.432\); GPT\-5 mini leads navigation \(0\.456\) and generalization \(0\.437\)\.
### 4\.4Frontier LLMs on Hard Tasks
Figure[5](https://arxiv.org/html/2605.06869#S4.F5)shows per\-task success rates at hard difficulty for three frontier LLMs\. No single model dominates across categories: GPT\-5 mini leads most navigation tasks, but Haiku 4\.5 outperforms on ResourceManagement \(0\.76 vs\. 0\.12\), and all three models solve ToolUse at 0\.96–1\.0, where compositional reasoning is rewarded\. Critically,reasoning tasks remain largely intractable: GraphColoring, LightsOut, SwitchCircuit, ProgramSynthesis, and SymbolMatching yield 0\.0 success for all three models, suggesting that systematic search and state tracking cannot be achieved through prompting alone\.
Figure 5:Per\-task success rates at hard difficulty for three frontier LLMs\. No single model dominates\. Reasoning tasks \(right panel\) yield near\-zero success for all models, exposing a fundamental limitation of prompting\-only approaches\.
### 4\.5Harness Design Matters: Observation Mode and Prompting Strategy
A key motivation for Agentick’s composable harness infrastructure is the hypothesis thathowan LLM agent is prompted matters as much aswhichmodel is used\. To test this, we evaluate the Qwen model family across all combinations of observation mode \(ASCII vs\. language\) and harness preset \(Markovian vs\. Markovian Reasoner\), using only the two built\-in presets as a minimal illustration \(Figure[6](https://arxiv.org/html/2605.06869#S4.F6)\)\.
Even with this simple comparison, the effect is striking\.The Reasoner harness multiplies performance by 3–10×\\timesacross every model and observation mode: Qwen3\.5\-4B jumps from 0\.023 to 0\.228 ONS when switching from Markovian to Markovian Reasoner on ASCII observations, and even the smallest Qwen3\.5\-0\.8B improves from 0\.020 to 0\.094\. Additionally,ASCII consistently outperforms language: Qwen3\.5\-4B with the Reasoner achieves 0\.228 on ASCII versus 0\.181 on language, suggesting that compact grid representations are more effective for spatial reasoning than verbose natural language descriptions\. It is also notable that Qwen3\.5\-0\.8B \(0\.094\) outperforms the much larger Qwen3\-4B \(0\.085\) despite having five times fewer parameters, illustrating how Agentick can surface generational improvements across model families, and how it will continue to be useful for benchmarking future generations of models as they are released\.
These results confirm that harness design is a first\-class research variable for LLM agents, and our two built\-in presets serve as an initial illustration\. Agentick’s composable architecture is designed to support research into more sophisticated strategies: non\-Markovian harnesses that maintain conversation history, self\-reflective harnesses that learn from past failures, and learned harnesses that optimize prompting strategies through experience, an emerging research direction where the scaffolding around the model is itself optimized rather than hand\-designed\.
Figure 6:Agentick ONS for the Qwen model family across observation modes and reasoning harnesses\. Solid bars: Markovian \(no reasoning\); lighter stacked portions: additional gain from the Markovian Reasoner harness\. ASCII \(blue\) consistently outperforms language \(orange\)\. The Reasoner harness multiplies performance by 3–10×\\times\.
## 5![[Uncaptioned image]](https://arxiv.org/html/2605.06869v1/figures/sprite_door_open.png)Discussion and Conclusion
Future directions\.The most exciting research direction enabled by Agentick isRL post\-training of foundation models for sequential decision\-making\. The progression from RLHF\(Ouyanget al\.,[2022](https://arxiv.org/html/2605.06869#bib.bib24)\)to RLVR\(Guoet al\.,[2025](https://arxiv.org/html/2605.06869#bib.bib25)\)has demonstrated that RL is a remarkably effective mechanism for eliciting capabilities from pre\-trained models, but current RLVR successes operate in deterministic, single\-turn, short\-horizon settings such as math and code\. Extending this to truly interactive, stochastic, multi\-step environments \(where credit assignment over long horizons, strategic exploration, and multi\-agent coordination become essential\) is the natural next frontier\(Silveret al\.,[2025](https://arxiv.org/html/2605.06869#bib.bib26)\)\. Agentick provides the infrastructure needed for this program: vectorizable environments, oracle trajectories for warm\-starting, capability decomposition for measuring which sequential decision\-making skills emerge through RL training, and multi\-modal observations that enable text\-based LLMs to interact with rich environments\. Additional near\-term directions include research on learned and self\-improving agent harnesses, VLM evaluation using isometric observations, and curriculum learning across difficulty levels\. A complementary direction is model\-scaffolded task generation: instead of only asking foundation models to solve the current task set, future agents could use Agentick’s task primitives, modality controls, and verifiable rewards to propose custom tasks and curricula that stress\-test emerging capabilities\.
Limitations\.We discuss limitations in detail in Appendix[A](https://arxiv.org/html/2605.06869#A1)\. In brief: all tasks are discrete, 2D, and turn\-based, a deliberate choice that enables cross\-paradigm comparison but limits applicability to continuous control or real\-time settings\. The current task set does not yet cover all fundamental challenges of sequential decision\-making \(e\.g\., continual learning is absent\), though the benchmark is designed to be actively developed and maintained with new tasks over time\. Our initial evaluation results cover a representative but incomplete slice of the agent landscape: VLM agents, fine\-tuned models, and stronger frontier models have not yet been evaluated\. These are active directions on the project roadmap\.
Broader impact\.Agentick is a research benchmark for evaluating AI agent capabilities in controlled gridworld environments\. We do not foresee direct negative societal impacts\. The benchmark may accelerate progress on autonomous agents, which carries both positive applications \(assistive AI, scientific discovery\) and risks \(autonomous systems acting without adequate oversight\)\. We encourage the community to develop agents within Agentick with attention to safety and alignment considerations\.
Conclusion\.We presented Agentick, a benchmark for evaluating sequential decision\-making agents across the full design spectrum on the core challenges of the setting\. Through 37 procedurally generated tasks, six capability categories, five observation modalities, and training\-first infrastructure, Agentick enables the kind of fair comparison across RL, LLM, VLM, and hybrid agents that no existing benchmark supports\. An initial evaluation spanning 27 agent configurations and over 90,000 episodes confirms the benchmark’s discriminative power: no single paradigm dominates, harness design multiplies LLM performance by 3–10×\\times, and the substantial gap to oracle performance highlights the rich research agenda ahead\. The benchmark, datasets, and leaderboard are publicly available\.
## References
- Deep reinforcement learning at the edge of the statistical precipice\.InAdvances in Neural Information Processing Systems,Cited by:[§3\.2](https://arxiv.org/html/2605.06869#S3.SS2.p2.3),[§4\.1](https://arxiv.org/html/2605.06869#S4.SS1.p3.1)\.
- M\. Ahn, A\. Brohan, N\. Brown, Y\. Chebotar, O\. Cortes, B\. David, C\. Finn,et al\.\(2022\)Do as I can, not as I say: grounding language in robotic affordances\.arXiv preprint arXiv:2204\.01691\.Cited by:[§1](https://arxiv.org/html/2605.06869#S1.p1.1)\.
- M\. G\. Bellemare, Y\. Naddaf, J\. Veness, and M\. Bowling \(2013\)The arcade learning environment: an evaluation platform for general agents\.Journal of Artificial Intelligence Research47,pp\. 253–279\.Cited by:[§1](https://arxiv.org/html/2605.06869#S1.p1.1),[§2](https://arxiv.org/html/2605.06869#S2.p2.1)\.
- R\. C\. Castanyer, F\. Mohamed, P\. S\. Castro, C\. Neary, and G\. Berseth \(2025a\)ARM\-fm: automated reward machines via foundation models for compositional reinforcement learning\.arXiv preprint arXiv:2510\.14176\.Cited by:[§1](https://arxiv.org/html/2605.06869#S1.p1.1)\.
- R\. C\. Castanyer, J\. Obando\-Ceron, L\. Li, P\. Bacon, G\. Berseth, A\. Courville, and P\. S\. Castro \(2025b\)Stable gradients for stable learning at scale in deep reinforcement learning\.arXiv preprint arXiv:2506\.15544\.Cited by:[§1](https://arxiv.org/html/2605.06869#S1.p2.1)\.
- J\. S\. O\. Ceron, A\. Courville, and P\. S\. Castro \(2024\)Value\-based deep reinforcement learning requires explicit regularization\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.06869#S1.p2.1)\.
- M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. d\. O\. Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman,et al\.\(2021\)Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374\.Cited by:[§2](https://arxiv.org/html/2605.06869#S2.p4.1)\.
- M\. Chevalier\-Boisvert, B\. Dai, M\. Towers, R\. De Lazcano, L\. Willems, S\. Lahlou, S\. Pal, P\. S\. Castro, and J\. Terry \(2023\)Minigrid & miniworld: modular & customizable reinforcement learning environments for goal\-oriented tasks\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2605.06869#S2.p2.1)\.
- F\. Cholletet al\.\(2026\)ARC\-agi\-3: a new challenge for frontier agentic intelligence\.ARC Prize Foundation\.Cited by:[§2](https://arxiv.org/html/2605.06869#S2.p5.1)\.
- K\. Cobbe, C\. Hesse, J\. Hilton, and J\. Schulman \(2020\)Leveraging procedural generation to benchmark reinforcement learning\.InInternational Conference on Machine Learning,pp\. 2048–2056\.Cited by:[§2](https://arxiv.org/html/2605.06869#S2.p2.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§2](https://arxiv.org/html/2605.06869#S2.p4.1)\.
- O\. S\. Community \(2025\)Reasoning gym: a diverse benchmark for llm reasoning\.Note:[https://github\.com/open\-thought/reasoning\-gym](https://github.com/open-thought/reasoning-gym)Cited by:[§2](https://arxiv.org/html/2605.06869#S2.p4.1)\.
- M\. Côté, Á\. Kádár, X\. Yuan, B\. Kybartas, T\. Barnes, E\. Fine, J\. Moore, M\. Hausknecht, L\. El Asri, M\. Sharma,et al\.\(2019\)TextWorld: a learning environment for text\-based games\.Computer Games: 7th Workshop, CGW 2018,pp\. 41–75\.Cited by:[§2](https://arxiv.org/html/2605.06869#S2.p3.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, R\. Zhang, R\. Xu, Q\. Zhu, S\. Ma, P\. Wang, X\. Bi,et al\.\(2025\)DeepSeek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§1](https://arxiv.org/html/2605.06869#S1.p1.1),[§2](https://arxiv.org/html/2605.06869#S2.p4.1),[§5](https://arxiv.org/html/2605.06869#S5.p1.1)\.
- T\. Haarnoja, A\. Zhou, P\. Abbeel, and S\. Levine \(2018\)Soft actor\-critic: off\-policy maximum entropy deep reinforcement learning with a stochastic actor\.InInternational Conference on Machine Learning,pp\. 1861–1870\.Cited by:[§1](https://arxiv.org/html/2605.06869#S1.p1.1)\.
- D\. Hafner \(2022\)Benchmarking the spectrum of agent capabilities\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.06869#S2.p2.1)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the MATH dataset\.NeurIPS\.Cited by:[§2](https://arxiv.org/html/2605.06869#S2.p4.1)\.
- C\. E\. Jimenez, J\. Yang, A\. Wettig, S\. Yao, K\. Pei, O\. Press, and K\. Narasimhan \(2024\)SWE\-bench: can language models resolve real\-world GitHub issues?\.ICLR\.Cited by:[§2](https://arxiv.org/html/2605.06869#S2.p4.1)\.
- M\. Klissarov, P\. D’Oro, N\. Jaques, S\. Sodhani, D\. Furelos\-Blanco, P\. Bacon, D\. Precup, and M\. Henaff \(2023\)Motif: intrinsic motivation from artificial intelligence feedback\.arXiv preprint arXiv:2310\.00166\.Cited by:[§1](https://arxiv.org/html/2605.06869#S1.p1.1)\.
- M\. Klissarov, M\. Henaff, R\. Raileanu, S\. Sodhani, P\. Vincent, A\. Zhang, P\. Bacon, D\. Precup, M\. C\. Machado, and P\. D’Oro \(2024\)Maestromotif: skill design from artificial intelligence feedback\.arXiv preprint arXiv:2412\.08542\.Cited by:[§1](https://arxiv.org/html/2605.06869#S1.p1.1)\.
- H\. Küttler, N\. Nardelli, A\. Miller, R\. Raileanu, M\. Selber, E\. Grefenstette, and T\. Rocktäschel \(2020\)The NetHack learning environment\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2605.06869#S2.p2.1)\.
- C\. Lyle, M\. Rowland, and W\. Dabney \(2022\)Understanding plasticity in neural networks\.InInternational Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2605.06869#S1.p2.1)\.
- Y\. J\. Ma, W\. Liang, G\. Wang, D\. Huang, O\. Bastani, D\. Jayaraman, Y\. Zhu, L\. Fan, and A\. Anandkumar \(2023\)Eureka: human\-level reward design via coding large language models\.arXiv preprint arXiv:2310\.12931\.Cited by:[§1](https://arxiv.org/html/2605.06869#S1.p1.1)\.
- M\. Matthews, M\. Gallici, K\. Cobbe, J\. Foerster, and T\. Rocktäschel \(2024\)Craftax: a lightning\-fast benchmark for open\-ended reinforcement learning\.ICML\.Cited by:[§2](https://arxiv.org/html/2605.06869#S2.p2.1)\.
- V\. Mnih, K\. Kavukcuoglu, D\. Silver, A\. A\. Rusu, J\. Veness, M\. G\. Bellemare, A\. Graves, M\. Riedmiller, A\. K\. Fidjeland, G\. Ostrovski,et al\.\(2015\)Human\-level control through deep reinforcement learning\.Nature518\(7540\),pp\. 529–533\.Cited by:[§1](https://arxiv.org/html/2605.06869#S1.p1.1),[§3\.2](https://arxiv.org/html/2605.06869#S3.SS2.p1.2)\.
- I\. Osband, Y\. Doron, M\. Heess, D\. Duband, T\. Morimura, D\. Silver, A\. Barreto,et al\.\(2020\)Behaviour suite for reinforcement learning\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.06869#S2.p2.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training language models to follow instructions with human feedback\.Advances in Neural Information Processing Systems35,pp\. 27730–27744\.Cited by:[§5](https://arxiv.org/html/2605.06869#S5.p1.1)\.
- D\. Paglieri, B\. Cupiał, S\. Coward, U\. Piterbarg, M\. Wołczyk, A\. Khan, E\. Pignatelli, Ł\. Kuciński, L\. Pinto, R\. Fergus,et al\.\(2024\)BALROG: benchmarking agentic llm and vlm reasoning on games\.arXiv preprint arXiv:2411\.13543\.Cited by:[§1](https://arxiv.org/html/2605.06869#S1.p2.1),[§2](https://arxiv.org/html/2605.06869#S2.p3.1)\.
- Qwen \(2025\)Qwen3 technical report\.External Links:2505\.09388Cited by:[3rd item](https://arxiv.org/html/2605.06869#S4.I1.i3.p1.1)\.
- A\. Raffin, A\. Hill, A\. Gleave, A\. Kanervisto, M\. Ernestus, and N\. Dorber \(2021\)Stable\-baselines3: reliable reinforcement learning implementations\.Journal of Machine Learning Research22\(268\),pp\. 1–8\.Cited by:[Appendix G](https://arxiv.org/html/2605.06869#A7.p2.4),[2nd item](https://arxiv.org/html/2605.06869#S4.I1.i2.p1.2)\.
- M\. Samvelyan, R\. Kirk, V\. Kurin, J\. Parker\-Holder, M\. Jiang, E\. Hambro, F\. Petroni, H\. Küttler, E\. Grefenstette, and T\. Rocktäschel \(2021\)MiniHack the planet: a sandbox for open\-ended reinforcement learning research\.InThirty\-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track,External Links:[Link](https://github.com/samvelyan/minihack)Cited by:[§2](https://arxiv.org/html/2605.06869#S2.p2.1)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§1](https://arxiv.org/html/2605.06869#S1.p1.1),[2nd item](https://arxiv.org/html/2605.06869#S4.I1.i2.p1.2)\.
- D\. Silver, S\. Singh, D\. Precup, and R\. S\. Sutton \(2025\)Welcome to the era of experience\.arXiv preprint arXiv:2407\.16680\.Cited by:[§5](https://arxiv.org/html/2605.06869#S5.p1.1)\.
- R\. S\. Sutton and A\. G\. Barto \(2018\)Reinforcement learning: an introduction\.2nd edition,MIT Press\.Cited by:[§1](https://arxiv.org/html/2605.06869#S1.p1.1)\.
- Y\. Tassa, Y\. Dorato, A\. Muldal, T\. Erez, Y\. Li, D\. de Las Casas, D\. Budden, A\. Abdolmaleki, J\. Merel, A\. Lefrancq,et al\.\(2018\)DeepMind control suite\.arXiv preprint arXiv:1801\.00690\.Cited by:[§1](https://arxiv.org/html/2605.06869#S1.p1.1),[§2](https://arxiv.org/html/2605.06869#S2.p2.1)\.
- M\. Towers, A\. Kwiatkowski, J\. Terry, J\. U\. Balis, G\. De Cola, T\. Deleu, M\. Goulão, A\. Kallinteris, A\. KG, M\. Krimmel,et al\.\(2024\)Gymnasium: a standard interface for reinforcement learning environments\.arXiv preprint arXiv:2407\.17032\.Cited by:[§1](https://arxiv.org/html/2605.06869#S1.p3.1),[§3](https://arxiv.org/html/2605.06869#S3.p1.1)\.
- O\. Vinyals, I\. Babuschkin, W\. M\. Czarnecki, M\. Mathieu, A\. Dudzik, J\. Chung, D\. H\. Choi, R\. Powell, T\. Ewalds, P\. Georgiev,et al\.\(2019\)Grandmaster level in starcraft ii using multi\-agent reinforcement learning\.Nature575\(7782\),pp\. 350–354\.Cited by:[§1](https://arxiv.org/html/2605.06869#S1.p1.1)\.
- G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar \(2023\)Voyager: an open\-ended embodied agent with large language models\.arXiv preprint arXiv:2305\.16291\.Cited by:[§1](https://arxiv.org/html/2605.06869#S1.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2023\)ReAct: synergizing reasoning and acting in language models\.International Conference on Learning Representations\.Cited by:[§1](https://arxiv.org/html/2605.06869#S1.p1.1)\.
## Appendix ADetailed Limitations
We identify six limitations of the current version of Agentick:
1. 1\.Gridworld abstraction\.All tasks are discrete, 2D, and turn\-based\. This is a deliberate choice \(discrete control is the common denominator across all agent types\) but it limits applicability to continuous control or real\-time settings\. Importantly, the benchmark is genuinely difficult despite this abstraction: the best agent reaches only 0\.309 ONS\.
2. 2\.Capability coverage\.While 37 tasks across six categories is broad, the current set does not cover all fundamental challenges of sequential decision\-making\. Continual learning, open\-ended language grounding, and long\-horizon tool use in unstructured environments are not yet represented\. The benchmark is designed to be actively developed and maintained, and we plan to add tasks targeting these capabilities over time\.
3. 3\.Observation equivalence\.Although all modalities expose the same underlying state, they are not informationally equivalent; for instance, isometric renders may occlude certain tiles due to perspective\. The benchmark evaluates decision\-makinggivena particular observation format, not observation\-agnostic intelligence\.
4. 4\.RL training budget\.PPO was trained for only 2M steps\. Longer training, more sophisticated algorithms, or curriculum learning may substantially change the RL performance profile\.
5. 5\.Multi\-agent scope\.Opponent agents use built\-in scripted policies rather than learned ones, limiting the study of emergent dynamics with co\-adapting agents\.
6. 6\.Evaluation coverage\.Our initial results cover a representative but incomplete slice of the agent landscape\. We have not yet evaluated VLM agents on isometric observations, models fine\-tuned on our oracle trajectory datasets via SFT or RL, or stronger frontier models\. These are active directions on the project roadmap that we expect to substantially change the performance landscape\.
## Appendix BDifficulty Scaling
Figure 7:Three tasks at all four difficulty levels \(columns: easy, medium, hard, expert\)\.Top:KeyDoorPuzzle, where grid size, number of key\-door pairs, and backtracking requirements increase\.Middle:BacktrackPuzzle, with more switches, larger rooms, and tighter step budgets\.Bottom:DynamicObstacles, with more obstacles, faster movement, and larger grids\. Procedural generation ensures unique layouts at every seed\.
## Appendix CFull Task Descriptions
Figure[8](https://arxiv.org/html/2605.06869#A3.F8)shows all 37 tasks rendered in isometric view at medium difficulty\. Table[2](https://arxiv.org/html/2605.06869#A3.T2)provides descriptions and difficulty scaling\.
Figure 8:All 37 Agentick tasks in isometric view at medium difficulty, spanning six categories: navigation \(8\), planning \(9\), reasoning \(8\), memory \(4\), generalization \(3\), multi\-agent \(5\)\.Table 2:All 37 Agentick tasks with category, description, and difficulty scaling dimensions\.
## Appendix DAll Observation Modalities for a Single State
\(a\) Isometric Pixel Rendering\(512×\\times512, uint8 RGB array; for VLM agents, CNN\-RL, human play\):

\(b\) ASCII\(colored text grid with legend; for LLM agents\):
```
# # # # # # # # # # # Legend:
# . . # # # # # # # # ˆ v < > : Agent (facing)
# . . # # # # # # # # # : Wall . : Empty
# Kg. # . . . # . . # G : Goal K : Key
# ˆ . # . . . # . . # D : Door (closed)
# . . Dg. . . # . . # g=gold r=red
# . . # . Kr. # . . #
# . . # . . . Dr. G #
# # # # # . . # # # #
# # # # # . . # # # #
# # # # # # # # # # #
```
\(c\) Natural Language\(spatial description; for LLM agents\):
> You are near the western edge of a 11×\\times11 room\. You are facing north\. You see: a gold key ahead \(1 step\), a closed gold door to your right \(3 steps\), a red key to your right \(6 steps\), a closed red door to your right \(9 steps\), a goal to your right \(11 steps\)\. Your inventory is empty\. Actions: noop, move\_up, move\_down, move\_right, interact\.
\(d\) Structured Language\(parsed dictionary; for LLM/programmatic agents\):
```
{"description": "A 11x11 gridworld environment",
"position": {"x": 1, "y": 4},
"orientation": "north",
"visible_entities": [
{"type": "key", "position": [1,3], "distance": 1, "color": "gold"},
{"type": "door", "position": [3,5], "distance": 3, "color": "gold"},
{"type": "key", "position": [5,6], "distance": 6, "color": "red"},
{"type": "door", "position": [7,7], "distance": 9, "color": "red"},
{"type": "goal", "position": [9,7], "distance": 11}],
"inventory": [],
"valid_actions": ["noop","move_up","move_down","move_right","interact"],
"step_count": 0, "max_steps": 200}
```
\(e\) State Dictionary\(raw numpy arrays; for RL and programmatic agents\):
```
grid.terrain: int8[11,11] (0=empty, 1=wall)
grid.objects: int8[11,11] (0=none, 1=goal, 2=key, 3=door)
grid.agents: int8[11,11] (agent position mask)
grid.metadata: int16[11,11] (color/state encoding)
agent.position: (1, 4)
agent.orientation: north
agent.inventory: []
agent.energy: 1.0
```
Figure 9:All five observation modalities for the same KeyDoorPuzzle state \(medium, seed 42\)\. Every modality is produced simultaneously from the same underlying grid, ensuring consistency across agent paradigms\.
## Appendix EObservation Modalities Summary
Table 3:Observation modalities in Agentick\. All modes are available for every task simultaneously\.ModeFormatTarget AgentsSpaceASCIIColored text grid \+ legendLLMStringLanguageNatural language descriptionLLMStringLanguage StructuredDict \(position, surroundings, actions\)LLM, ProgrammaticDictIsometric Pixels512×\\times512 sprite renderingVLM, CNN\-RL, HumanBoxState DictNumpy arrays \(terrain, objects, metadata\)RL, ProgrammaticDict
## Appendix FPer\-Category ONS Results
Table[4](https://arxiv.org/html/2605.06869#A6.T4)reports per\-category ONS for all evaluated agents\.
Table 4:Per\-category ONS for all evaluated agents\. Best per category inbold\. The rightmost column reports the 95% bootstrap confidence interval on the overall ONS, computed over 25 seeds per task–difficulty pair across all 37 tasks; “–” marks entries for which the bootstrap estimate is pending in the next leaderboard refresh\.AgentNav\.Plan\.Reas\.Mem\.Gen\.Multi\.Overall95% CIGPT\-5 mini\.456\.334\.131\.348\.437\.150\.309–PPO \(2M\)\.250\.402\.191\.283\.163\.432\.287\[\.21, \.37\]Qwen3\.5\-4B\.223\.313\.124\.248\.327\.134\.228\[\.16, \.29\]PPO \(500k\)\.193\.300\.153\.228\.130\.352\.226\[\.17, \.29\]Gemini 2\.5 FL\.238\.249\.090\.163\.287\.098\.187\[\.13, \.25\]Qwen3\.5\-2B\.136\.237\.048\.213\.133\.032\.133\[\.07, \.19\]Qwen3\.5\-0\.8B\.069\.164\.021\.133\.143\.036\.094\[\.05, \.14\]Qwen3\-4B\.073\.106\.005\.153\.133\.038\.085–
## Appendix GAgent Configuration Details
Frontier LLMs\.GPT\-5 mini, Gemini 3\.1 Flash Lite, and Claude Haiku 4\.5 were evaluated via their respective APIs with temperature 0, max tokens 100, using ASCII observations and the Markovian Reasoner harness\. Each model was evaluated on all 37 tasks×\\times4 difficulties×\\times25 seeds\.
PPO\.Trained using Stable\-Baselines3\[Raffinet al\.,[2021](https://arxiv.org/html/2605.06869#bib.bib16)\]with CnnPolicy on pixel observations preprocessed in the ALE style: 512×\\times512 RGB isometric renders were resized to 84×\\times84, converted to grayscale, and stacked over four frames\. We usednsteps=128n\_\{\\text\{steps\}\}=128, batch size 256, learning rate2\.5×10−42\.5\\times 10^\{\-4\}, dense reward mode, and 2M total timesteps for the headline PPO baseline\. Training was performed on a single NVIDIA A100 GPU\.
Qwen models\.All Qwen models \(Qwen3\-4B, Qwen3\.5\-0\.8B, Qwen3\.5\-2B, Qwen3\.5\-4B\) were served via vLLM with temperature 0\.7, top\-pp0\.8, top\-kk20, max tokens 100\. Each model was evaluated across all 4 combinations of \{ASCII, language\}×\\times\{Markovian, Markovian Reasoner\}\. The best configuration per model is reported in the main text\.
Reasoner harness prompt\.The Reasoner harness appends the following to the system prompt:“IMPORTANT: Before choosing an action, reason step\-by\-step but be CONCISE \(2–4 sentences max\): 1\. What do you observe? What is your goal? 2\. Which action best advances you toward the goal? 3\. Output your final answer on the LAST line as: ACTION:⟨\\langlenumber⟩\\rangle\.”
## Appendix HCompute Budget
RL training\.PPO training for 2M steps required approximately 4 GPU\-hours on a single NVIDIA A100 \(80GB\) per task\. Total RL compute:∼\\sim150 GPU\-hours across all 37 tasks\.
LLM evaluation\.Frontier model evaluation used commercial APIs\. Approximate costs: GPT\-5 mini $180, Gemini 3\.1 Flash Lite $45, Claude Haiku 4\.5 $65\. Open\-weight Qwen models were served on 2×\\timesA100 GPUs via vLLM; total serving time∼\\sim200 GPU\-hours across all model sizes and configurations\.
Oracle trajectory generation\.The 500K\-episode dataset was generated in∼\\sim8 CPU\-hours on a 32\-core machine\.
## Appendix ICoding API
Every Agentick environment exposes a programmaticCoding API\(AgentickAPI\) that provides spatial queries, BFS pathfinding, entity lookups, and high\-level action primitives\. The API is designed for three use cases: \(1\) writing hand\-coded bot agents and planners, \(2\) building oracle policies \(all 37 oracles in Agentick are implemented through this API\), and \(3\) enabling code\-generating LLMs to write agent logic in Python rather than selecting raw action integers\.
Setup\.The API wraps an environment and is updated after each step:
```
from agentick.coding_api import AgentickAPI
import agentick
env = agentick.make("KeyDoorPuzzle-v0", difficulty="medium")
api = AgentickAPI(env)
obs, info = env.reset(seed=42)
api.update(obs, info)
```
Spatial queries\.The API exposes the agent’s position, orientation, and spatial relationships to all entities:
```
api.agent_position # (1, 4)
api.agent_direction # ’north’
api.grid_size # (11, 11)
api.get_nearest("key") # EntityInfo(type=’key’, pos=(1,3), dist=1)
api.get_nearest("goal") # EntityInfo(type=’goal’, pos=(9,7), dist=11)
api.get_entities_of_type("key")# [EntityInfo(...), EntityInfo(...)]
api.get_entity_at(3, 5) # EntityInfo(type=’door’, ...)
api.is_walkable(2, 4) # True
api.is_walkable(3, 4) # False (wall)
api.is_reachable(9, 7) # False (blocked by locked doors)
api.distance_to(1, 3) # 1
api.direction_to(1, 3) # ’north’
api.is_adjacent(1, 3) # True
```
BFS pathfinding\.The API provides shortest\-path computation that respects walls, blocking objects, and terrain:
```
api.path_to(1, 3) # [1] (action sequence to gold key)
api.go_to_nearest("key") # [1] (pathfind to closest key)
api.go_to_nearest("door") # [1, 1, 3, 3, 3] (to gold door)
api.flee_from(5, 5) # single action moving away from (5,5)
api.move_toward(9, 7) # single action toward goal
```
Thepath\_toandgo\_to\_nearestmethods return sequences of action integers that can be executed directly viaenv\.step\(\)\. The pathfinder accounts for non\-walkable objects \(locked doors, walls\) and updates dynamically as the environment state changes\.
Grid inspection\.Low\-level access to the grid structure:
```
api.get_walkable_cells() # [(1,1), (1,2), ...] (43 cells)
api.get_walls() # [(0,0), (0,1), ...] (all wall positions)
api.get_cell(3, 5) # {’terrain’: ’empty’, ’object’: ’door’}
api.get_object(1, 3) # ’key’
api.get_terrain_type(0, 0) # ’wall’
api.neighbors(1, 4) # [(1,3), (2,4), (1,5)] (walkable neighbors)
```
Inventory and interaction\.For tasks involving item collection and object interaction:
```
api.get_inventory() # []
api.has_in_inventory("key") # False
api.interact_with(1, 3) # action sequence: face + INTERACT
api.pickup_nearest("key") # pathfind + interact
```
Execution helpers\.High\-level methods for stepping through action sequences:
```
api.step_action(1) # execute action 1 (move_up), return obs
api.valid_actions # [0, 1, 2, 4, 5]
api.action_names # [’noop’,’move_up’,’move_down’,...]
api.action_name_to_int("move_up") # 1
api.current_step # 0
api.max_steps # 200
api.total_reward # 0.0
api.is_done # False
```
Oracle construction\.All 37 oracle policies are implemented using this API\. A typical oracle follows the pattern:
```
class KeyDoorOracle(OracleAgent):
def plan(self):
# Find nearest uncollected key
key = self.api.get_nearest("key")
if key:
self.action_queue = self.api.go_to_nearest("key")
else:
# All keys collected, head to goal
self.action_queue = self.api.go_to_nearest("goal")
def act(self, obs, info):
self.api.update(obs, info)
if not self.action_queue:
self.plan()
return self.action_queue.pop(0)
```
This API\-based oracle design means that expert trajectories are generated through interpretable, verifiable code rather than opaque neural network policies, enabling researchers to inspect, debug, and modify the strategies that produce the SFT training data\.Similar Articles
Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
This paper introduces Agent-ValueBench, a comprehensive benchmark designed to evaluate the values of autonomous agents, revealing that agent values diverge from their underlying language models.
AI-Trader: Benchmarking Autonomous Agents in Real-Time Financial Markets
This paper introduces AI-Trader, the first fully automated live benchmark for evaluating LLMs in financial decision-making across US stocks, A-shares, and cryptocurrencies. It highlights that general intelligence does not guarantee trading success and emphasizes the importance of risk control in autonomous agents.
SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks
SkillLearnBench introduces the first benchmark for evaluating continual skill learning in LLM agents across 20 real-world tasks, revealing that no method dominates and scaling LLMs does not guarantee better skills.
AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation
AJ-Bench introduces a benchmark to evaluate Agent-as-a-Judge systems that interact with environments to verify agent behaviors across 155 tasks in search, data systems, and GUI domains.
An Empirical Study of Automating Agent Evaluation
This paper introduces EvalAgent, a system that automates the evaluation of AI agents by encoding domain-specific expertise, addressing the limitations of standard coding assistants in this task. It also presents AgentEvalBench, a benchmark for testing evaluation pipelines, and demonstrates significant improvements in evaluation reliability.