RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models
Summary
RTSGameBench is a benchmark for evaluating strategic reasoning in vision-language models using the real-time strategy game Beyond All Reason. It provides diverse matchups, diagnostic mini-games, and a self-evolving framework to generate new scenarios.
View Cached Full Text
Cached at: 06/18/26, 05:41 AM
# RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models
Source: [https://arxiv.org/html/2606.18950](https://arxiv.org/html/2606.18950)
11institutetext:Seoul National University
11email:\{00sankim,daechulahn,reokyoungkim,gusqja1228,amyj97,jonghyunchoi\}@snu\.ac\.krSan KimSeoul National University, Seoul, Republic of Korea 11email:\{00sankim, daechulahn, reokyoungkim, gusqja1228, amyj97, jonghyunchoi\}@snu\.ac\.krDaechul AhnSeoul National University, Seoul, Republic of Korea 11email:\{00sankim, daechulahn, reokyoungkim, gusqja1228, amyj97, jonghyunchoi\}@snu\.ac\.krReokyoung KimSeoul National University, Seoul, Republic of Korea 11email:\{00sankim, daechulahn, reokyoungkim, gusqja1228, amyj97, jonghyunchoi\}@snu\.ac\.krHyeonbeom ChoiSeoul National University, Seoul, Republic of Korea 11email:\{00sankim, daechulahn, reokyoungkim, gusqja1228, amyj97, jonghyunchoi\}@snu\.ac\.krSeungyeon JwaSeoul National University, Seoul, Republic of Korea 11email:\{00sankim, daechulahn, reokyoungkim, gusqja1228, amyj97, jonghyunchoi\}@snu\.ac\.krJonghyun Choi†Seoul National University, Seoul, Republic of Korea 11email:\{00sankim, daechulahn, reokyoungkim, gusqja1228, amyj97, jonghyunchoi\}@snu\.ac\.kr
###### Abstract
Modern Vision\-Language Models \(VLMs\) often struggle with strategic reasoning,*i\.e*\., anticipating and influencing other agents’ actions, under uncertainty in competitive and cooperative settings\. Real\-time strategy \(RTS\) games can be a natural testbed for diagnosing this limitation, as they demand coordination with allies, adaptation to opponents’ strategy, and long\-horizon planning under partial observability\. However, existing RTS benchmarks offer limited evaluation scope, lack systematic competency diagnosis, and remain fixed in the pre\-designed scenario coverage\. To address these limitations, we presentRTSGameBench, which is built onBeyond All Reason, a large\-scale RTS game with an expanded battlefield that demands broader strategy diversity than the existing testbeds\. The proposed benchmark provides evaluations through diverse gameplay across various matchup structures, diagnostic assessment via mini\-games, each targeting an individual strategic competency, and extensible coverage via a self\-evolving generation framework that converts free\-form queries into new mini\-games, improving over successive cycles\. Additionally, for VLMs to operate in large\-scale RTS games, we provideRTSGameAgentthat manages units by an FSM with agentic memory\. We empirically validate that multiple state\-of\-the\-art VLMs do not perform well when matchups demand tighter coordination, multi\-agent coordination and when task scale increases\. Code is available at[https://github\.com/snumprlab/RTSGameBench](https://github.com/snumprlab/RTSGameBench)\.
††footnotetext:\*These authors contributed equally\.
†JC is with ECE, IPAI and ASRI in SNU and a corresponding author\.## 1Introduction
Vision\-Language Models \(VLMs\) have achieved remarkable success across a range of tasks\[brown2020language,raffel2020exploring,ouyang2022training,touvron2023llama,openai2023gpt4\], yet deploying them in complex, evolving environments that demand long\-horizon sequential decisions in presence of other agents remains challenging\[brohan2022rt1,driess2023palme,fan2022minedojo\]\. Central to this challenge is*strategic reasoning*—anticipating and influencing other agents’ actions under uncertainty in competitive and cooperative settings\[gandhi2023strategic,zhang2024llm\]\. We argue that real\-time strategy \(RTS\) games are a natural testbed for evaluating these challenges: they ground strategic reasoning in continuous spatial decision\-making under partial observability, requiring agents to allocate resources, coordinate multiple units, adapt to opponents, and cooperate with allies—all within a measurable and reproducible simulator\[buro2003real,ontanon2013survey,robertson2014review\]\.
While StarCraft II \(SC2\) has been widely adopted as an RTS testbed for AI research\[vinyals2019grandmaster,ma2024large,ma2025ava,ahn2025hima,li2025llmpysc2\], we build uponBeyond All Reason\(BAR\)\[bar2024beyondallreason\], which expands the unit and field scale relative to SC2, as illustrated in[Tab\.˜1](https://arxiv.org/html/2606.18950#S1.T1)\. This expanded scale enlarges the strategic space, requiring longer\-horizon planning over many interacting units across a larger battlefield, with coordination across allied groups and reasoning about enemy groups\[ontanon2013survey\]\. Moreover, BAR by design automates routine*per\-unit*execution—from target prioritization to energy management\[bar2024qualitipedia\]—reducing low\-level overhead while preserving strategic depth: agents must still form and manage groups, decide*when*and*where*to engage, and coordinate spatial maneuvers\. Together, BAR’s large\-scale gameplay and*partial*low\-level automation make it a suitable platform for evaluating VLMs’ strategic reasoning in RTS\.
Table 1:Quantitative scale comparison: StarCraft II \(SC2\)*vs*\.BAR\.Unit Variety: unique units and buildings across all factions\.Supply Cap: per\-player unit limit\.Unit Capacity: total unit limit across all players\.∗SC2 uses a weighted population system, so actual unit counts are lower than this limit\. Details in supplementary\.Unit VarietySupply CapUnit CapacityMap SizePlayer LimitStarCraft II96200∗200^\{\*\}1,600∗1\{,\}600^\{\*\}1×\\times8
However, the platform alone does not guarantee a rigorous evaluation\. Strategic reasoning in RTS is inherently multi\-faceted\[buro2003real\],*e\.g*\., spanning resource management, opponent modeling, and more, and imposes different demands depending on the number and roles of allies and opponents; such competencies therefore must be evaluated systematically across varied settings\. Yet current benchmarks address this only partially, lacking systematic diagnosis of individual competencies and remaining fixed in their diagnostic coverage\[ma2024large,ma2025ava,ahn2025hima,li2025llmpysc2\]\. To this end, we argue that a rigorous RTS benchmark should be: \(i\)holistic, capturing complete gameplay across diverse matchup structures; \(ii\)diagnostic, targeting individual competencies through controlled scenarios so that outcomes can be attributed to identifiable strengths and weaknesses\[lin2025gamebot\]; and \(iii\)extensible, allowing researchers to expand diagnostic coverage on demand—ideally through automated generation that improves with experience—rather than being confined to a fixed scenario set\[li2025llmpysc2,ma2025ava\]\.
Figure 1:Overview ofRTSGameBench\.We evaluate VLMs’ strategic reasoning through three components: \(1\)Full Game Evaluationacross diverse matchup structures; \(2\)Diagnostic Mini\-Gameseach targeting an individual strategic competency; and \(3\) aSelf\-Evolving Game Generation Frameworkthat converts free\-form queries into new diagnostic games via multi\-agent collaboration, enabling on\-demand extensibility\.To jointly satisfy these desiderata, we proposeRTSGameBench, a benchmark and evaluation platform integrating three components \(§\\S[3](https://arxiv.org/html/2606.18950#S3); Fig\.[1](https://arxiv.org/html/2606.18950#S1.F1)\): \(i\)Full Game Evaluationacross diverse matchup structures \(1v1, symmetric/asymmetric team, free\-for\-all\); \(ii\)Diagnostic Mini\-Gamesgrounded in a taxonomy of RTS AI challenges\[buro2003real\]; and \(iii\) aSelf\-Evolving Game Generation Frameworkthat converts free\-form queries into new mini\-games while improving its efficiency and quality over successive cycles\.
Additionally, for VLMs to operate in large\-scale BAR gameplay, whose large unit counts and long durations demand scalable coordination and sustained coherence, we provideRTSGameAgent, a baseline agent pairing FSM\-based group management\[buckland2004programming\]with agentic memory \(§\\S[4](https://arxiv.org/html/2606.18950#S4)\)\. Using this baseline, we conduct systematic experiments across multiple state\-of\-the\-art \(SoTA\) VLMs to characterize their strategic reasoning capabilities and limitations\. In summary,
- •We introduceRTSGameBench, a VLM benchmark and evaluation platform built onBeyond All Reason, a large\-scale RTS game\.
- •We propose a self\-evolving game generation framework that converts free\-form queries into new mini\-games, improving over successive cycles and thus enabling researchers to extend diagnostic coverage beyond fixed scenarios\.
- •We designRTSGameAgent, a baseline agent with FSM\-based group management and agentic memory, making large\-scale RTS tractable for VLMs\.
- •We provide systematic analysis of strategic reasoning capabilities and limitations across multiple state\-of\-the\-art open\- and closed\-source VLMs\.
## 2Related Work
Game\-based evaluation of language\-model based agents\.Games serve as effective testbeds for evaluating the cognitive and decision\-making capabilities of LLM\- and VLM\-based agents\[paglieri2024balrog,hu2025lmgame,park2025orak\]\. Early benchmarks focused on text\-only observations in either single\-agent\[hu2024pokellmon\]or multi\-agent strategic settings\[qi2024civrealm\], but often lacked multimodal integration\. While Minecraft\-based benchmarks\[wang2025escapecraft,zheng2025mcu\]introduced multimodal observations, they remain largely limited to single\-agent environments\. Furthermore, as full\-game evaluations can often obscure specific sources of success or failure\[lin2025gamebot\], recent studies have shifted toward scenario\-level evaluations\[tang2025dsgbench,zheng2025v\]or gameplay\-extracted datasets\[xu2025vs\]\. However, these diagnostic approaches typically rely on predefined, static scenarios within specific domains\. In contrast,RTSGameBenchprovides a large\-scale, multi\-agent RTS environment demanding strategic reasoning under multimodal observations\. By complementing predefined scenarios with user\-query\-driven mini\-game generation, our framework enables an extensible and unbounded set of evaluation tasks, allowing for a more robust assessment of agent performance in challenging, dynamic settings\.
RTS game benchmarks\.RTS games require long\-horizon planning and multi\-unit coordination under partial observability, leading to various benchmarks based on SC2\. TextStarCraft II\[ma2024large\],TextSCII\-All\[ahn2025hima\], and HIVE\[anne2025harnessing\]evaluate full games or specific scenarios but rely solely on textual observations\. AVACraft\[ma2025ava\]introduces multimodal inputs but is confined to isolated scenarios, and LLM\-PySC2\[li2025llmpysc2\]supports full\-game evaluation yet focuses on tactical execution rather than distinct strategic competencies\. Moreover, existing benchmarks are restricted to 1v1 matchups, neglecting cooperative and multi\-agent dynamics\. We address these gaps withRTSGameBench, providing systematic evaluation across diverse matchups and diagnostic tasks grounded in RTS AI taxonomy\[buro2003real\], built on BAR for larger scale and greater strategic complexity than existing RTS testbeds\. Additional comparisons are in the supplementary\.
Self\-evolving evaluation frameworks\.Fixed evaluation sets risk saturation, making it difficult to assess generalization\[ellis2023smacv2\]\. To broaden evaluation coverage, prior works have proposed language\-driven scenario generation for autonomous driving\[tan2023language,zhang2024chatscene\]and automated benchmark evolution for LLM evaluation\[wang2025benchmark\]\. However, scaling RTS game evaluation is more complex, requiring specialized design, implementation, and simulation\-based validation\. While self\-evolving agents have demonstrated success in optimizing agentic workflows\[Guan\_2024,wang2025evoagentxautomatedframeworkevolving\], we leverage this paradigm for RTS game benchmark expansion\. Our self\-evolving framework generates, validates, and quality\-assures diverse mini\-games from free\-form queries, continuously extending the benchmark beyond a static suite\.
Table 2:Overview of evaluation settings inRTSGameBench\.Top:Full game matchups vary player configurations to expose distinct strategic demands\.Bottom:Mini\-games each target one strategic competency identified by prior work\[buro2003real\]; Decision Making under Uncertainty is selectively incorporated via fog\-of\-war \(FoW\) when partial observability is integral to the competency being tested\. Action types:§\\S[3](https://arxiv.org/html/2606.18950#S3): Build = building construction, Prod\. = unit production, Move = unit movement\.Full Game Match\-upsModeConfigStrategic DemandAction TypeFoWDuel1v1Individual decision\-makingBuild \+ Prod\. \+ MoveOnSymmetric Team2v2, 3v3Allied coordinationBuild \+ Prod\. \+ MoveOnAsymmetric Team3v4Coordination under numerical disadvantageBuild \+ Prod\. \+ MoveOnFree\-for\-All1v1v1v1Multi\-polar threat prioritizationBuild \+ Prod\. \+ MoveOnDiagnostic Mini\-GamesStrategic CompetencyGameTaskAction TypeFoWResource ManagementTCPProduce target units within a deadlineBuild \+ Prod\. \+ MoveOnSpatial & Temporal ReasoningMFDDefend multiple objectives from staggered attacksMoveOffOpponent ModelingFS\-FPredict opponents’ targets to prioritize engagementsMoveOffCollaborationFS\-TCoordinate with allies using fixed forces \(Team\)MoveOffAdversarial PlanningSPBreach a static fortification within a time limitBuild \+ Prod\. \+ MoveOn
## 3RTSGameBench
As argued in§\\S[1](https://arxiv.org/html/2606.18950#S1), rigorous evaluation of strategic reasoning in RTS demands a holistic, diagnostic, and extensible platform—requirements that existing benchmarks only partially meet\[ma2024large,ma2025ava,ahn2025hima,li2025llmpysc2\]\. To this end, we introduceRTSGameBench, a benchmark and evaluation platform built on BAR\[bar2024beyondallreason\]that integrates three components \([Figure˜1](https://arxiv.org/html/2606.18950#S1.F1)\): \(i\)Full Game Evaluation\(§\\S[3\.1](https://arxiv.org/html/2606.18950#S3.SS1)\), \(ii\)Diagnostic Mini\-Games\(§\\S[3\.2](https://arxiv.org/html/2606.18950#S3.SS2)\), and \(iii\) aSelf\-Evolving Game Generation Framework\(§\\S[3\.3](https://arxiv.org/html/2606.18950#S3.SS3)\)\.
Game interface\.All evaluation settings inRTSGameBenchshare a common observe–decide–act loop\. Before the game begins, the agent receives static game knowledge𝒦\\mathcal\{K\}, including the scenario description, available units and buildings, and team configuration\. At each decision steptt, the engine renders visual channelsvtv\_\{t\}—a global minimap and local camera views that can be positioned at arbitrary locations—from its internal statests\_\{t\}, while a Python wrapper𝒲\\mathcal\{W\}extracts a structured textual observation; together these form the multimodal observationoto\_\{t\}\. When fog\-of\-war is enabled, both channels are restricted to allied line\-of\-sight\.111Fog\-of\-war is a game mechanic that hides map regions outside the line\-of\-sight of allied units, introducing partial observability into the environment\.The agent’s policyπ\\pi\(instantiated by a VLM\) then selects an action:
ot=\(vt,𝒲\(st\)\),at=π\(ot∣𝒦\),st\+1←Env\(st,at\)\.o\_\{t\}=\(v\_\{t\},\\mathcal\{W\}\(s\_\{t\}\)\),\\quad a\_\{t\}=\\pi\(o\_\{t\}\\mid\\mathcal\{K\}\),\\quad s\_\{t\+1\}\\leftarrow\\text\{Env\}\(s\_\{t\},a\_\{t\}\)\.\(1\)The action space comprises three types—building construction,unit production, andunit movement—with the agent decidingwhereto build and move on a\(0,0\)\(0,0\)–\(100,100\)\(100,100\)coordinate grid, while the game engineEnvhandles low\-level execution\. The loop repeats at a fixed interval with the environment pausing between steps, ensuring evaluation targets strategic decision quality rather than reaction speed\. Full interface specifications and𝒦\\mathcal\{K\}details are in the supplementary\.
### 3\.1Full Game Match\-ups
In full game evaluation, an agent plays complete BAR matches from start to finish\. While existing RTS benchmarks predominantly evaluate agents in 1v1 settings\[vinyals2019grandmaster,ma2024large,ma2025ava,ahn2025hima,li2025llmpysc2\], different player configurations give rise to distinct strategic demands\[buro2003real\]that a thorough evaluation must cover\. We therefore design four match\-up types \([Tab\.˜2](https://arxiv.org/html/2606.18950#S2.T2), top\):Dueltests individual decision\-making;Symmetric Teamintroduces allied coordination;Asymmetric Teamplaces the agent on the smaller side, demanding tighter coordination under numerical disadvantage; andFree\-for\-Allrequires multi\-polar threat prioritization\. In all modes, the agent occupies one slot, while remaining slots are filled by built\-in AI\.
Figure 2:Diagnostic mini\-games\.Each scenario targets a core strategic competency—resource management, spatial and temporal reasoning, opponent modeling, collaboration, and adversarial planning—with fog\-of\-war selectively applied per game \([Tab\.˜2](https://arxiv.org/html/2606.18950#S2.T2)\)\.
### 3\.2Diagnostic Mini\-Games
Full game play necessitates the simultaneous application of diverse strategic competencies, which often conflates distinct behavioral traits and masks specific functional deficiencies\. To enable a more granular assessment, we introduce mini\-games \([Tab\.˜2](https://arxiv.org/html/2606.18950#S2.T2), bottom;[Figure˜2](https://arxiv.org/html/2606.18950#S3.F2)\), each targeting an individual strategic competency for RTS\[buro2003real\]through controlled initial conditions and bounded time horizons\. Each mini\-game is evaluated through a primary performance metric alongside game\-specific auxiliary measures \(as detailed in[Tab\.˜4](https://arxiv.org/html/2606.18950#S5.T4)\)\.
\(1\)Resource Management — Time\-Constrained Production \(TCP\)\.The agent must produce a specified unit composition within a deadline while fending off enemy raids\. Build dependencies\[bar2024qualitipedia\]force sequential production decisions, and competing demands between economic investment and defensive spending test whether the agent can allocate limited resources efficiently under time pressure\.
\(2\)Spatial & Temporal Reasoning — Multi\-Front Defense \(MFD\)\.The agent defends multiple objectives against attacks arriving from different directions at staggered timings\. Forces are fixed with no production, so success depends entirely on terrain\-aware positioning and timely redeployment—testing the agent’s ability to reason about where and when to commit its forces\.
\(3\)Opponent Modeling — Fixed\-Field Skirmish: Free\-for\-All \(FS\-F\)\.Three or more agents fight with symmetric fixed forces and no production; the last survivor wins\. The agent must predict each opponent’s target selection to decide whom to engage first, as implicit coalitions and betrayals make reading intentions critical\.
\(4\)Collaboration — Fixed\-Field Skirmish: Team \(FS\-T\)\.FS\-T uses the same game skeleton as FS\-F but replaces free\-for\-all with team play, with no explicit communication channel between allies\. The agent must infer allied intentions from observed movements and coordinate actions—such as focus\-firing or dividing fronts—testing collaboration without direct communication\.
\(5\)Adversarial Planning — Siege Planning \(SP\)\.The agent must breach a static enemy fortification within a strict timeline while managing production and resource gathering\. Since the enemy defense is fixed, the task targets the agent’s ability to analyze the defensive composition and derive an effective attack order—determining which defenses to neutralize first to enable subsequent assaults\.
The sixth challenge in the taxonomy\[buro2003real\],Decision Making under Uncertainty, is selectively integrated across these mini\-games via fog\-of\-war, enabled only when partial observability is integral to the competency being tested \([Tab\.˜2](https://arxiv.org/html/2606.18950#S2.T2), FoW column\)\. Unit compositions, scenario parameters, and per\-scenario design rationales are in the supplementary\.
Figure 3:Self\-evolving game generation pipeline\.A project manager \(PM\) orchestrates VLM\-based agents through four stages—scenario planning, GDD generation, rule set construction, and game implementation—with inter\-stage gating and rollback\. A shared knowledge database stores validated GDDs and rule sets, enabling reuse and fast\-tracking\. PM’s retrospective analysis refines quality rubrics after each generation\.
### 3\.3Self\-Evolving Game Generation Framework
The five diagnostic mini\-games cover targeted strategic competencies, but each competency can be tested under a far broader range of conditions than any single fixed scenario provides\. Manually expanding this suite, however, is costly, requiring expertise in game design, engine\-level implementation, and simulation\-based validation\. While LLM\-based multi\-agent pipelines\[qian2024chatdev,hong2024metagpt,wu2023autogen\]and iterative self\-refinement methods\[madaan2023selfrefine,shinn2023reflexion\]offer a promising way to automate such processes, they rely on fixed evaluation criteria that are difficult to design well\[zhang2026rubricbench\]and have primarily targeted software development and general\-purpose reasoning—not game generation, where verification requires costly simulation runs\.
To address these challenges, we propose aSelf\-Evolving Game Generation Frameworkthat automates the creation of diagnostic mini\-games from free\-form user queries while improving over successive cycles \(Fig\.[3](https://arxiv.org/html/2606.18950#S3.F3)\)\. The framework utilizes specialized VLM\-based agents: aproject manager\(PM\) for pipeline orchestration and inter\-stage gating, adesigneranddeveloperfor conceptualization and implementation, and ananalystthat validates each stage against rubrics and simulation feedback\. Rubrics are initially human\-designed, specifying mandatory rules and quality criteria each stage must satisfy, and are progressively refined through the self\-evolution mechanism described below\. Each agent operates with stage\-specific system prompts defining its expected inputs, outputs, and quality criteria; full algorithmic details, prompts, and rubrics are in the supplementary\.
Generation pipeline\.Given a user query, the framework generates a new game through four stages\. InStages 2–4\(Fig\.[3](https://arxiv.org/html/2606.18950#S3.F3)\), agents generate and the analyst validates iteratively until criteria are met; validated artifacts are stored in ashared knowledge databasefor reuse\. The PM gates every stage transition: upon success it advances the pipeline; upon repeated failure it reviews the iteration history and decides whether to retry or roll back with corrective feedback\.
InStage 1, the designer clarifies the query’s intent through multi\-turn dialogue with the user to produce a structured scenario brief specifying game composition, enemy behavior rules, and win/loss conditions; if the database already contains a matching Game Design Document \(GDD\)\[fullerton2014game\], the pipeline skips directly toStage 4\. InStage 2, the designer expands the brief into a full GDD that specifies the targeted competency and defines the rule components governing game behavior \(*e\.g*\., unit spawning conditions\); the analyst validates the GDD via rubric\-based checks\. InStage 3, for each rule in the GDD, the developer retrieves a matching implementation from the database when available or writes a new Lua script\[ierusalimschy2006lua\]for in\-engine execution; the analyst verifies each script via rubric\-based checks and simulation runs\. InStage 4, the developer retrieves necessary game assets from the database and determines the game configuration—unit placement, end conditions, map selection, and rule parameters—producing a final executable script; the analyst validates the output via rubric\-based checks and verifies visual playability and semantic alignment by running a full game simulation, feeding screenshots at regular intervals to a VLM, and using its feedback to judge correctness and guide further revision if needed\. Upon successful completion, the validated GDD and rule components are stored in the database for future reuse\.
Self\-evolution mechanisms\.The framework’s self\-evolution is driven by two complementary mechanisms\. First, the shared knowledge database accumulates validated GDDs and rule sets across cycles, enabling the pipeline to bypass redundant stages and reuse verified components\. Second, the PM conducts aretrospective analysisafter each successful generation, updating the analyst’s rubrics based on discrepancies between verification outcomes and quality expectations\. Together, these mechanisms transformRTSGameBenchfrom a static test suite into a continuously extensible evaluation platform \(empirically validated in§\\S[5\.2](https://arxiv.org/html/2606.18950#S5.SS2)\)\.
## 4RTSGameAgent
The default action interface \(§\\S[3](https://arxiv.org/html/2606.18950#S3)\) operates at the per\-unit level, issuing individual commands at every decision step—practical when unit counts are small, but intractable when BAR matches scale to hundreds of units\. Moreover, between discrete VLM calls the game state evolves continuously, yet observations capture only the present moment, risking loss of critical inter\-step context\. To address these challenges, we proposeRTSGameAgent, a baseline agent \([Figure˜4](https://arxiv.org/html/2606.18950#S4.F4)\) that combinesFSM\-based group managementfor scalable, stateful coordination withagentic memoryfor long\-term coherence\.
FSM\-based group management\.Of the three default action types \(§\\S[3](https://arxiv.org/html/2606.18950#S3)\),RTSGameAgentretains building construction and unit production at the per\-unit level—the VLM specifies building type, placement location, and which unit type each factory produces\. The remaining type, per\-unit movement, is replaced with two group\-level actions:group assignment, where the VLM creates named squads \(*e\.g*\.,assault,defense\) and allocates units, andgroup movement, where the VLM specifies a target coordinate and command per squad\. Group statuses are maintained within𝒲\(st\)\\mathcal\{W\}\(s\_\{t\}\)and thus included in each observation\.
To reduce each group’s behavior to a small discrete command set for tractable VLM decision\-making in large\-scale RTS, we equip each group with a finite\-state machine \(FSM\) of four states\[buckland2004programming\]\. An FSM constrains each group to exactly one state at a time, transitioning in response to VLM commands or environmental triggers: the VLM issues one of three commands—move,move\_force, orstop—while the fourth,fight, is triggered automatically upon enemy contact and reverts to the prior command once the engagement ends\.move\_forcebypasses this trigger, forcing the group to continue toward its destination regardless of enemy presence\. This delegates strategic decisions \(whereto move,whichmode\) to the VLM and tactical execution \(whento engage\) to the engine\. Group states persist across decision steps, so each group acts autonomously according to its current FSM state without per\-step re\-specification\.
Figure 4:Inference loop ofRTSGameAgent\.At each decision step, thememory phase\(left\) consolidates short\-term event logs𝒮t\\mathcal\{S\}\_\{t\}with long\-term memoryℒt−1\\mathcal\{L\}\_\{t\-1\}via an LLM, producing relevant entriesmtm\_\{t\}and updated memoryℒt\\mathcal\{L\}\_\{t\}\. Thedecision phase\(right\) feedsmtm\_\{t\}, game knowledge𝒦\\mathcal\{K\}, and multimodal observationsoto\_\{t\}to the VLM policyπ\\pi, which outputs four action types: building construction, unit production, group assignment, and group movement with FSM commands\.Inference with agentic memory\.To address the inter\-step context loss identified above,RTSGameAgentaugments each decision with a memory system inspired by the cognitive distinction between short\- and long\-term memory\[atkinson1968human,squire1992memory\]and LLM\-based memory architectures\[park2023generative,packer2023memgpt,xu2025amem,jwa2025lwe\]\. The agent maintains two stores: ashort\-term memory𝒮t\\mathcal\{S\}\_\{t\}of event logs accumulated between VLM calls, enemy sightings detected within allied line\-of\-sight and battle events triggered by the game environment, and along\-term memoryℒt\\mathcal\{L\}\_\{t\}of concise experience summaries persisting across the entire game\. At each inference interval, two phases execute in series \([Figure˜4](https://arxiv.org/html/2606.18950#S4.F4)\): in thememory phase, an LLM guided by a memory\-management prompt consolidates the two stores, retaining, merging, or discarding short\-term logs into long\-term memory, and selects a subset of relevant entriesmtm\_\{t\}for the current decision; the short\-term buffer is then flushed:
mt,ℒt=LLM\(𝒮t,ℒt−1\)\.m\_\{t\},\\;\\mathcal\{L\}\_\{t\}=\\mathrm\{LLM\}\(\\mathcal\{S\}\_\{t\},\\mathcal\{L\}\_\{t\-1\}\)\.\(2\)In thedecision phase, the VLM receivesmtm\_\{t\}alongside observationsoto\_\{t\}, where local camera views are positioned at the three largest groups by unit count and the home base, extending the base policy \(§\\S[3](https://arxiv.org/html/2606.18950#S3)\) to:
at=π\(ot,mt∣𝒦\),a\_\{t\}=\\pi\(o\_\{t\},m\_\{t\}\\mid\\mathcal\{K\}\),\(3\)where actionsata\_\{t\}now comprise building construction, unit production, group assignment, and group movement with FSM commands \(see supplementary for full prompts and algorithmic details\)\.
Table 3:Full game evaluation results\.Duel/Team report WR; Free\-for\-All reports RS; all modes report GTW\{\}\_\{\\text\{W\}\}/GTL\{\}\_\{\\text\{L\}\}\. VLM occupies one slot \(the smaller side in Asymm\.\); rest filled by built\-in AI\.I: Instruct,T: Thinking\. ‘–’: no wins, thus GTW\{\}\_\{\\text\{W\}\}undefined\.DuelSymmetric TeamAsymm\. TeamFree\-for\-All1v12v23v33v41v1v1v1ModelWRGTW\{\}\_\{\\text\{W\}\}GTL\{\}\_\{\\text\{L\}\}WRGTW\{\}\_\{\\text\{W\}\}GTL\{\}\_\{\\text\{L\}\}WRGTW\{\}\_\{\\text\{W\}\}GTL\{\}\_\{\\text\{L\}\}WRGTW\{\}\_\{\\text\{W\}\}GTL\{\}\_\{\\text\{L\}\}RSGTW\{\}\_\{\\text\{W\}\}GTL\{\}\_\{\\text\{L\}\}GPT\-5\.20\.5327370\.3395670\.3771550\.1087450\.373111GPT\-5\-mini0\.0724220\.0766560\.1377410\.0377340\.182311Claude\-4\.5\-Sonnet0\.2728430\.0378740\.2067580\.00–480\.572818Gemini\-3\-Flash0\.8721340\.5092690\.3360520\.2056380\.662411Kimi\-K2\.50\.3029370\.3063690\.2357440\.00–720\.401423Grok\-4\.1\-Fast0\.3341270\.00–560\.1063510\.0760320\.303412Qwen3\.5\-397B0\.1331260\.00–620\.3346600\.00–620\.303113Qwen3\-VL\-235B\-I0\.00–280\.00–510\.0375360\.00–430\.07–9Qwen3\-VL\-235B\-T0\.2023250\.00–580\.3072520\.00–320\.272411LLaMA4\-Maverick0\.00–270\.0786530\.00–450\.00–410\.172111Mistral\-Large\-30\.00–270\.00–600\.00–450\.00–190\.133411
## 5Experiments
We evaluate the strategic reasoning capabilities of various SoTA VLMs withinRTSGameBench, and assess the quality of self\-evolving game generation\.
Setup\.We evaluate eleven VLMs spanning proprietary and open\-source families, plus a human baseline for mini\-games \([Tabs\.˜3](https://arxiv.org/html/2606.18950#S4.T3)and[4](https://arxiv.org/html/2606.18950#S5.T4)\)\. For full games, the VLM occupies one player slot with remaining slots filled by built\-in AI at Easy difficulty—at the next level, all tested models drop to near\-zero win rates \(see supplementary\)\. Each matchup uses a fixed map; all game results are averaged over 30 runs\. The inference interval is 1 minute for full games; for mini\-games, combat scenarios \(FS\-F, FS\-T, MFD\) use 15 seconds and planning scenarios \(TCP, SP\) use 60 seconds\. All models useRTSGameAgentwith identical prompting templates; inRTSGameAgent, the same VLM serves both the memory and decision phases\. For the self\-evolving generation framework, all agents use GPT\-5\.2\[openai2025gpt52\]\. Unit compositions and map specifications used for games, interval analysis, and full prompts are in the supplementary\.
Evaluation metrics\.Full game modes report Win Rate \(WR,↑\\uparrow\), Game Time for wins/losses \(GTW\{\}\_\{\\text\{W\}\}/GTL\{\}\_\{\\text\{L\}\}, min\), and Damage Efficiency \(DE = damage dealt / damage received\); Free\-for\-All reports Rank Score \(RS: 1st=1\.0/2nd=0\.67/3rd=0\.33/ 4th=0\.0\), with GTW\{\}\_\{\\text\{W\}\}/GTL\{\}\_\{\\text\{L\}\}computed over 1st–2ndand 3rd–4thplace finishes respectively\. Mini\-games pair a primary metric with a game\-specific auxiliary—Average Time \(AT, min,↓\\downarrow\) or DE—per game \(see[Tab\.˜4](https://arxiv.org/html/2606.18950#S5.T4)\)\. For the self\-evolving generation framework, we report Playability \(fraction of generated games that execute successfully\), Generation Time \(min\), and Human Preference judged by four RTS\-experienced evaluators via pairwise comparison \(A win / B win / tie\)\. Full metric definitions are in the supplementary\.
Table 4:Mini\-game evaluation results\.Each mini\-game targets one core strategic competency in RTS\[buro2003real\]\.I: Instruct,T: Thinking\. ‘–’: no wins, thus AT undefined\.TCPMFDFS\-FFS\-TSPModelWR↑\\uparrowAT↓\\downarrowWR↑\\uparrowDE↑\\uparrowRS↑\\uparrowDE↑\\uparrowWR↑\\uparrowDE↑\\uparrowWR↑\\uparrowAT↓\\downarrowGPT\-5\.20\.93170\.301\.010\.621\.000\.631\.040\.5015GPT\-5\-mini0\.33160\.231\.180\.500\.980\.300\.970\.1312Claude\-4\.5\-Sonnet1\.00140\.331\.230\.551\.040\.731\.300\.8015Gemini\-3\-Flash1\.00160\.601\.610\.641\.030\.501\.120\.9315Kimi\-K2\.50\.97170\.771\.820\.200\.820\.571\.050\.8013Grok\-4\.1\-Fast0\.50200\.100\.900\.330\.950\.501\.170\.4015Qwen3\.5\-397B1\.00150\.201\.050\.551\.040\.370\.960\.5015Qwen3\-VL\-235B\-I0\.30110\.000\.500\.220\.860\.301\.050\.1317Qwen3\-VL\-235B\-T1\.00120\.531\.550\.300\.930\.501\.090\.2012LLaMA4\-Maverick0\.60250\.000\.520\.521\.120\.231\.020\.5316Mistral\-Large\-30\.00–0\.000\.240\.370\.910\.571\.190\.00–Human1\.00101\.003\.460\.931\.531\.001\.210\.8015
### 5\.1Main Results
Full game evaluation\.[Tab\.˜3](https://arxiv.org/html/2606.18950#S4.T3)presents full game results across four matchups\. InDuel, Gemini\-3\-Flash leads \(WR 0\.87\) with the shortest GTW\{\}\_\{\\text\{W\}\}, indicating an aggressive strategy that closes out wins quickly; GPT\-5\.2 follow \(0\.53\), while open\-source models largely fail—only Qwen3\-VL\-235B\-T achieves non\-trivial wins \(0\.20\)\. Claude delays defeat \(GTL\{\}\_\{\\text\{L\}\}=43 min\) but rarely converts this into decisive advantages, suggesting a defensive posture without effective counterplay\. InSymmetric Team, performance drops broadly \(Gemini 0\.87→\\to0\.50 in 2v2\), indicating that coordinating with allied AI introduces challenges beyond individual play\. GTL\{\}\_\{\\text\{L\}\}rises as allied AI prolongs games, yet no model leverages this to mount comebacks; notably, Qwen3\.5\-397B approaches GPT\-5\.2 in 3v3 \(0\.33*vs*\.0\.37\), where the larger allied contingent amplifies the role of team synergy over individual capability—a challenge the diagnostic mini\-games \(FS\-T\) further isolate\.Asymmetric Teamproves hardest: even Gemini reaches only WR 0\.20, as numerical disadvantage demands sustained coordination that current VLMs cannot maintain\. InFree\-for\-All, Gemini leads \(RS 0\.66\), followed by Claude \(0\.57\); Claude’s strong survival ability observed in Duel translates well to the multi\-player setting\. Qwen3\-VL\-235B\-I \(RS 0\.07\) suffers near\-immediate elimination, suggesting poor threat assessment in contested environments\.
Table 5:Component and modality analyses\.We evaluate eachRTSGameAgentcomponent and input modality on Full Game \(1v1\) and all five mini\-games\.Top: Component analysis—FSM\-based Group Management \(FSM\) and Agentic Memory \(Mem\)\.Bottom: Input modality—language\-only \(L\)*vs*\.vision\-language \(V\+L\)\.Full GameMini\-Games1v1TCPMFDFS\-FFS\-TSPVariantC1C2WRGTW\{\}\_\{\\text\{W\}\}GTL\{\}\_\{\\text\{L\}\}WR↑\\uparrowAT↓\\downarrowWR↑\\uparrowDE↑\\uparrowRS↑\\uparrowDE↑\\uparrowWR↑\\uparrowDE↑\\uparrowWR↑\\uparrowAT↓\\downarrowComponent Analysis\(C1 = FSM, C2 = Mem\)w/o Both✘✘0\.1018400\.47200\.230\.900\.370\.920\.300\.980\.4015w/o FSM Group Mgmt✘✔0\.5329380\.80140\.330\.990\.300\.930\.371\.050\.4014w/o Agentic Memory✔✘0\.6720190\.80140\.401\.150\.540\.990\.401\.200\.6313RTSGameAgent\(Full\)✔✔0\.8721341\.00160\.601\.610\.641\.030\.501\.120\.9315Input Modality Analysis\(C1 = Vision, C2 = Language\)L only✘✔0\.2321321\.00150\.271\.090\.610\.990\.401\.120\.7313V\+L \(Full\)✔✔0\.8721341\.00160\.601\.610\.641\.030\.501\.120\.9315
Figure 5:Impact of task scale on model performance\.\(a\) Map size scaling \(Duel, FFA\) for Gemini\-3\-Flash \(top\) and Qwen3\.5\-397B \(bottom\)\. \(b\) Unit count scaling \(FS\-F, FS\-T\) for Gemini\-3\-Flash \(top\) and GPT\-5\.2 \(bottom\)\. Blue solid: Win Rate; red dashed: Game Time \(min\)\. All models degrade with increasing scale\.Diagnostic mini\-game evaluation\.[Tab\.˜4](https://arxiv.org/html/2606.18950#S5.T4)presents results across five mini\-games, each targeting a core strategic competency\[buro2003real\]\. InTCP, four models achieve WR 1\.00, indicating near\-saturation, yet none match the human’s AT of 10 min\.MFDreveals the largest human–VLM gap: the best VLM \(Kimi, DE 1\.82\) falls short of the human \(DE 3\.46\), and several models fail entirely \(WR 0\.00\)\. InFS\-F, Gemini\-3\-Flash leads all VLMs \(RS 0\.64\), while Kimi—which excels in MFD and SP—drops to the lowest RS \(0\.20\), suggesting opponent modeling requires distinct capabilities\.FS\-Tshows Claude leading \(WR 0\.73\), while Mistral—which struggles elsewhere—reaches WR 0\.57, hinting that structured coordination may compensate for individual weaknesses\. InSP, Gemini leads \(WR 0\.93\), followed by Claude and Kimi \(both WR 0\.80\); Qwen3\-VL\-235B\-T drops from WR 1\.00 \(TCP\) to 0\.20, suggesting production planning does not generalize to adversarial planning against static fortifications\.
Figure 6:Self\-evolution over successive generation batches\.\(a\) Playability, \(b\) Generation Time, and \(c\) Human Preference across five batches between two Agents\.
### 5\.2Detailed Analyses
We complement the main results with ablation studies onRTSGameAgent’s components and input modalities, and a scaling analysis\.
Component and modality analyses\.[Tab\.˜5](https://arxiv.org/html/2606.18950#S5.T5)ablatesRTSGameAgent’s two core components—FSM\-based Group Management \(FSM\) and Agentic Memory \(Mem\)—and the input modality, using Gemini\-3\-Flash for its strong full game performance\. Removing both drops Duel WR from 0\.87 to 0\.10, confirming the raw VLM alone is far from competitive\. FSM contributes more than Memory, consistent across mini\-games—particularly MFD and SP—indicating group\-level coordination is the primary bottleneck; the full agent outperforms either component alone, suggesting complementary roles: FSM enables scalable group coordination while Memory preserves context across decision steps\. For input modality, removing vision drops Duel WR from 0\.87 to 0\.23, with the largest impact on MFD; TCP and FS\-F remain unaffected, suggesting resource management and opponent modeling rely primarily on textual observations\.
Scaling analysis\.To examine the effect of task scale, we vary map size \(Duel, FFA\) and unit count \(FS\-F, FS\-T\) for models that showed competitive performance in the corresponding evaluations \([Figure˜5](https://arxiv.org/html/2606.18950#S5.F5)\)\. All models exhibit consistent performance degradation as scale increases: Qwen3\.5 suffers the steepest decline, dropping to WR 0\.00 at the largest map size, while Gemini degrades more gradually; GPT\-5\.2 shows a similar pattern in unit scaling, with a sharp drop in FS\-F but stable in FS\-T\. Average game time rises across all settings, reflecting both the increased scale and the models’ difficulty in closing out games\. Notably, FS\-T proves more robust to unit scaling than FS\-F, suggesting that team coordination is less sensitive to raw unit count than individual threat assessment\. These results suggest that task scale in RTS environments is a notable factor in model performance: as map size and unit count grow, reasoning demands increase in ways that current models struggle to accommodate\.
Figure 7:Example of a mini\-game generated by the self\-evolving framework\.
### 5\.3Evaluating Self\-Evolving Game Generation
[Figure˜6](https://arxiv.org/html/2606.18950#S5.F6)tracks Playability, Generation Time, and Human Preference across five successive batches of five queries each \(queries in the supplementary\)\. We compare four configurations: the full multi\-agent pipeline with and without self\-evolution \(Agent w/ SEandAgent w/o SE\), and an end\-to\-end baseline that generates the entire game in one pass, with and without iterative re\-verification \(E2E w/ verifyandE2E\)\. The multi\-agent pipeline proves essential: both Agent variants achieve high Playability from the first batch, while both E2E variants remain far lower, confirming that structured stage\-wise generation cannot be replaced by repeated verification alone\. Self\-evolution further improves the pipeline: database reuse and rubric updates jointly lift Playability over successive batches while keeping Generation Time stable; without self\-evolution, each batch is treated independently and Playability stagnates\. Human Preference, evaluated between the two Agent variants given their high Playability, corroborates this: evaluators increasingly favor the self\-evolving variant in later batches\.
Figure[7](https://arxiv.org/html/2606.18950#S5.F7)shows a concrete example: a mini\-game generated from a query testingstrategic foresightandphased preparation, where the agent must prepare for a scheduled ground assault at 10 min and an air assault at 30 min\. The screenshot sequence confirms that the framework translates the user’s intent into executable game logic with precise temporal triggers, demonstrating its ability to produce new diagnostic scenarios from user queries alone\.
## 6Conclusion
We proposeRTSGameBench, a benchmark and evaluation platform built onBeyond All Reasonthat provides holistic evaluation via full games across diverse matchup structures, diagnostic assessment via mini\-games each targeting an individual strategic competency, and extensible coverage via a self\-evolving game generation framework\. For VLMs to operate in large\-scale RTS games, we also introduceRTSGameAgent, a baseline agent pairing FSM\-based group management with agentic memory\. UsingRTSGameAgent, we evaluate multiple SoTA VLMs, revealing that performance degrades sharply as matchups demand tighter coordination, multi\-agent coordination remains the weakest competency even for top models, and performance declines consistently as task scale increases\.
## References
Supplementary Material for RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision\-Language Models
San Kim\*[https://orcid.org/0009-0001-7932-3093](https://orcid.org/0009-0001-7932-3093)Daechul Ahn\*[https://orcid.org/0000-0002-8689-3107](https://orcid.org/0000-0002-8689-3107)Reokyoung Kim[https://orcid.org/0009-0004-7241-0401](https://orcid.org/0009-0004-7241-0401)Hyeonbeom Choi[https://orcid.org/0009-0003-2532-6453](https://orcid.org/0009-0003-2532-6453)Seungyeon Jwa[https://orcid.org/0009-0001-0552-7325](https://orcid.org/0009-0001-0552-7325)Jonghyun Choi[https://orcid.org/0000-0002-7934-8434](https://orcid.org/0000-0002-7934-8434)
Here, we provide additional details on the benchmark design and agent framework, along with extended experimental results\. Abluemarker in each section title indicates the corresponding location in the main paper\.
## Appendix 0\.AOverview of Beyond All Reason
Beyond All Reason\(BAR\)\[bar2024beyondallreason\]is an open\-source real\-time strategy \(RTS\) game built on the Recoil engine\[recoil\]—forked from SpringRTS—featuring up to 100 players, 554 unique units and buildings, and a per\-player unit cap of 2,000—far exceeding the scale of StarCraft II \(SC2\) \(Table 1 of the main paper\)\. Players choose from three factions—Armada,Cortex, andLegion—each with a distinct playstyle: Armada favors flexible mobile warfare with cloaking and precision weapons, Cortex emphasizes defensive resilience with heavily armored units, and Legion pursues an aggressive, resource\-intensive offensive doctrine\.
### 0\.A\.1Key Gameplay Elements
- •Resource management: BAR employs two resources—metalandenergy—that must be continuously balanced\. Metal is extracted from map deposits via extractors, or reclaimed from wreckage\. Energy is generated through solar collectors, wind turbines, geothermal powerplants, or nuclear reactors\. Unlike many RTS games, BAR features astreaming economy: resources are not finite stockpiles but continuous flows, meaning build times scale with available resource income rather than fixed costs\. This requires trade\-offs between economic expansion and immediate military investment\.
- •Base construction and build dependencies: Structures in BAR have explicit build dependencies: advanced units and buildings become accessible only after prerequisite structures are constructed\. This imposes a sequential decision\-making structure on production planning—players must determine their build order while managing multiple production facilities, balancing short\-term defensive needs against long\-term technological progression\.
- •Low\-level automation: BAR provides automated convenience features that reduce routine micro\-management\. Units under theFightcommand automatically engage enemies within range, and economy management tools—such as automatic energy converter shutdown and area\-based construction queuing—delegate repetitive economic tasks to the engine\. This allows agents to focus on higher\-level strategic decisions: group formation, positional maneuvering, and coordinated multi\-front engagement\.
- •Army group coordination and strategic maneuvering: Freed from routine micro\-management, players must form and manage coherent unit groups, deciding when and where to commit forces, how to allocate groups across fronts, and how to coordinate maneuvers under fog\-of\-war\. These high\-level coordination demands—rather than low\-level execution speed—are precisely the capabilities thatRTSGameBenchis designed to evaluate\.
## Appendix 0\.BDetailed Comparison: SC2*vs*\.BAR \(Table 1, L41\)
### 0\.B\.1Details of the Quantitative Comparison
#### 0\.B\.1\.1Unit variety\.
Both SC2 and BAR feature three playable factions\. SC2 fields 96 unique units and buildings across Terran, Protoss, and Zerg\. BAR reaches 554 unique units and buildings\[bar\_repo\]owing to its multi\-tier unit progression system—where each faction fields distinct units at Tier 1, Tier 2, and Tier 3—and its support for diverse movement domains including land, air, sea \(ships\), and amphibious \(hovercraft\) unit classes\. This breadth substantially enlarges the compositional strategy space relative to SC2\.
#### 0\.B\.1\.2Supply cap\.
SC2 enforces aweightedpopulation system in which different units consume different amounts of supply\[sc2\_wiki\]: a single Battlecruiser, for instance, consumes six supply while a Marine consumes one\. Consequently, at the standard cap of 200 supply, the actual unit count is considerably lower than 200—a typical late\-game army with approximately 60 workers leaves room for only 110–130 combat units\. We therefore annotate the SC2 supply cap with an asterisk in Table 1 in the main paper to reflect this discrepancy\. BAR, by contrast, applies a flat unit count of one per unit regardless of tier or combat power, permitting a default per\-player cap of 2,000\[bar\_repo\]\. This value is configurable in the official repository but represents the standard default used in our experiments\.
#### 0\.B\.1\.3Unit capacity\.
Following from the above, the effective maximum number of simultaneously fielded units across all players in SC2 is substantially below the nominal figure of 1,600 \(8 players×\\times200 supply\)\. Accounting for workers and the weighted population system, a realistic upper bound is approximately 880–1,040 combat units across all players\. In BAR, the engine explicitly supports up to 32,000 simultaneous units across all players\[bar\_repo\], enabling battlefield engagements of a scale that is structurally impossible in SC2\.
#### 0\.B\.1\.4Map size\.
We compare the largest available map in each game\. SC2’s map editor imposes a hard limit of256×256256\\times 256tiles for playable map dimensions\[sc2\_wiki\]\. In BAR, map dimensions are specified inelmounits, where 1 elmo=512/8=64=512/8=64in\-game distance units\. The largest official BAR map measures32×3232\\times 32elmo, corresponding to2,048×2,0482\{,\}048\\times 2\{,\}048in\-game units—approximately64×64\\timeslarger in area than the largest SC2 map\[bar2024beyondallreason\]\.
#### 0\.B\.1\.5Player limit\.
The standard multiplayer configuration of SC2 supports up to 8 players\[sc2\_wiki\]\. BAR officially supports up to 100 players in a single match\[bar2024beyondallreason\], though competitive play typically uses 8v8 formats\. Table 1 in the main paper reports these maximum supported values for a direct structural comparison\.
Figure 8:Large\-scale combat environments in BAR\.Snapshots of large\-scale battles inBeyond All Reason\.Top:a four\-team battle where each team deploys 20 commanders \(80 commanders in total\)\.Bottom:a 50vs50 team battle where each side controls 50 commanders, illustrating the extreme scale of combat where thousands of units can engage simultaneously on the battlefield\.
### 0\.B\.2Scalability of BAR
Beyond the structural statistics reported in Table 1 in the main paper, BAR’s scalability is further illustrated by the scale of engagements it supports in practice\. Figure[8](https://arxiv.org/html/2606.18950#Pt0.A2.F8)shows representative screenshots from BAR matches at varying player and unit scales, demonstrating the qualitatively different battlefield complexity that emerges relative to SC2\-based benchmarks\.
## Appendix 0\.CFurther Comparison with Existing Game Benchmarks for VLMs \(L111\)
### 0\.C\.1Game\-Based Benchmarks for Language\- and Vision\-Language Model Agents
A detailed comparison of game\-based benchmarks for LLM/VLM agents is provided in[Tab\.˜6](https://arxiv.org/html/2606.18950#Pt0.A3.T6)\. As shown in the table,RTSGameBenchis the only benchmark that supports multi\-modal inputs, multi\-agent interaction, fine\-grained evaluation, and unlimited scenarios in an imperfect\-information environment\.
BenchmarkGameImperfectInfoMulti\-AgentMulti\-modalInputFine\-GrainedEvaluationInfiniteScenariosEscapeCraft\[wang2025escapecraft\]Room Escape Game✘✘✔✘✔AvalonBench\[light2023avalonbench\]Avalon✔✔✘✘✘Sketchtopia\[khan2025sketchtopia\]Sketchtopia✔✔✔✘✘CivRealm\[qi2024civrealm\]FreeCiv✔✔✘✔✘TeamCraft\[long2024teamcraft\]MineCraft✔✔✔✘✔MCU\[zheng2025mcu\]MineCraft✔✘✔✘✔RTSGameBench\(Ours\)Beyond All Reason✔✔✔✔✔
Table 6:Comparison of benchmarks that evaluate LLM/VLM agents on a single game environment\.We compare prior benchmarks andRTSGameBenchacross key properties including multi\-modal input, multi\-agent interaction, fine\-grained evaluation, and support for infinite scenarios\. EscapeCraft evaluates agents in a hand\-crafted room escape game\.BenchmarkGameMulti\-modalInputFull\-GameContextFine\-GrainedEvaluationPersonalizedScenariosAction SpaceCategoriesHIVE\[anne2025harnessing\]Hand\-Crafted Game✘✘✔✘MoveTowerMind\[wang2026towermind\]Hand\-Crafted Game✔✔✘✘Build \+ Prod\. \+ MoveTextStarCraftII\[ma2024large\]StarCraft II✘✔✘✘Build \+ Prod\.SC2Arena\[shen2025sc2arena\]StarCraft II✘✔✘✘Build \+ Prod\. \+ MoveAVACraft\[ma2025ava\]StarCraft II✔✘✔✘MoveLLM\-PySC2\[li2025llmpysc2\]StarCraft II✔✔✘✘Build \+ Prod\. \+ MoveRTSGameBench\(Ours\)Beyond All Reason✔✔✔✔Build \+ Prod\. \+ Move
Table 7:Comparison of RTS\-specific game benchmarks\.We compare prior benchmarks andRTSGameBenchacross key properties including multi\-modal input, full\-game evaluation, fine\-grained contextual observations, supported action space types \(Build/Produce/Move\), and support for infinite scenario generation\.Figure 9:Overview of the base agent interface\.At each decision steptt, the agent receives a multimodal observationoto\_\{t\}consisting of visual channelsvtv\_\{t\}\(a global minimap and local camera views\) and a structured textual observation𝒲\(st\)\\mathcal\{W\}\(s\_\{t\}\)extracted by a Python wrapper\. Combined with static game knowledge𝒦\\mathcal\{K\}, the VLM policyπ\\pigenerates an action plan over three action types: building construction, unit production, and unit movement\. The selected actionat\+1a\_\{t\+1\}is executed by the game engine, which returns the next statest\+1s\_\{t\+1\}to continue the observe–decide–act loop\.
### 0\.C\.2RTS\-Specific Game Benchmarks
A detailed comparison of RTS\-specific game benchmarks is provided in[Tab\.˜7](https://arxiv.org/html/2606.18950#Pt0.A3.T7)\. Prior work typically evaluates agents either in simplified hand\-crafted RTS environments or withinStarCraft II\. While these benchmarks provide useful testbeds for studying RTS decision\-making,Beyond All Reasonoffers a substantially larger strategic space, with large\-scale unit interactions and long\-horizon gameplay dynamics, making it particularly suitable for evaluating strategic reasoning in modern AI agents\.
In contrast,RTSGameBenchsimultaneously provides \(i\) multi\-modal observations enabling evaluation of VLM\-based agents, \(ii\) full\-game context alongside diagnostic mini\-games for fine\-grained evaluation of strategic capabilities, \(iii\) an extensible game generation framework that enables effectively infinite scenario variations, and \(iv\) a complete RTS action space including building construction, unit production, and unit movement\. To the best of our knowledge, RTSGameBench is the only benchmark that integrates all of these properties within a unified RTS evaluation framework\.
## Appendix 0\.DGame Interface and Static Game Knowledge𝒦\\mathcal\{K\}\(L144\)
[Figure˜9](https://arxiv.org/html/2606.18950#Pt0.A3.F9)illustrates the base agent interface shared across all evaluation settings inRTSGameBench\. This section provides full specifications of the observe–decide–act loop, the observation and action spaces, and the contents of the static game knowledge𝒦\\mathcal\{K\}\.
### 0\.D\.1Observe–Decide–Act Loop
Every scenario inRTSGameBenchfollows a common loop\. Before each game, the agent receives static game knowledge𝒦\\mathcal\{K\}\(detailed in[Section˜0\.D\.4](https://arxiv.org/html/2606.18950#Pt0.A4.SS4)\), comprising the scenario description, available units and buildings, and team configuration\. At each decision steptt, the environment constructs a multimodal observation
ot=\(vt,𝒲\(st\)\),o\_\{t\}=\\bigl\(v\_\{t\},\\;\\mathcal\{W\}\(s\_\{t\}\)\\bigr\),\(4\)wherevtv\_\{t\}denotes the visual channels and𝒲\(st\)\\mathcal\{W\}\(s\_\{t\}\)the structured textual observation extracted by a Python wrapper𝒲\\mathcal\{W\}from the engine statests\_\{t\}\. The agent’s policyπ\\pi, instantiated by a VLM, then selects an action conditioned on both the current observation and the static knowledge:
at=π\(ot∣𝒦\),st\+1←Env\(st,at\)\.a\_\{t\}=\\pi\(o\_\{t\}\\mid\\mathcal\{K\}\),\\quad s\_\{t\+1\}\\leftarrow\\mathrm\{Env\}\(s\_\{t\},\\,a\_\{t\}\)\.\(5\)The loop repeats at a fixed interval, with the environment pausing between steps so that evaluation targets strategic decision quality rather than reaction speed\.
### 0\.D\.2Observation Space
##### Visual observationvtv\_\{t\}\.
The visual channels comprise: \(i\) a*global minimap*providing a bird’s\-eye overview of the entire battlefield, and \(ii\)*local camera views*that can be positioned at arbitrary coordinates to inspect specific regions at higher resolution\. When fog\-of\-war is enabled, both channels are restricted to allied line\-of\-sight, introducing partial observability into the environment\.
##### Textual observation𝒲\(st\)\\mathcal\{W\}\(s\_\{t\}\)\.
The Python wrapper𝒲\\mathcal\{W\}converts the raw engine statests\_\{t\}into a structured textual representation\. This includes information such as the current resource levels \(metal and energy\), the status of allied and visible enemy units \(type, position, health\), ongoing construction or production queues, and the game clock\.
### 0\.D\.3Action Space
The action space comprises three categories:
- •Building construction: the agent selects a building type and a target location on the map\.
- •Unit production: the agent selects a factory and a unit type to produce\.
- •Unit movement: the agent selects units and assigns a destination\.
Spatial coordinates are specified on a normalized\(0,0\)\(0,0\)–\(100,100\)\(100,100\)grid, while the game engine handles execution such as pathfinding and collision avoidance\.
Table 8:Summary of the unit and building encyclopedia in𝒦\\mathcal\{K\}, showing the number of entities per category and tech tier\.CategoryT1T2T3TotalBot917430Vehicle1012123Air1310023Sea712019Hover5016Factory75113Defense1712029Building1816034Total86847177
### 0\.D\.4Static Game Knowledge𝒦\\mathcal\{K\}
The static game knowledge𝒦\\mathcal\{K\}is provided to the agent once before each game begins and remains fixed throughout the episode\. It consists of three components: \(i\) a scenario description specifying the objective and win conditions, \(ii\) the team configuration defining allied and enemy factions, and \(iii\) a comprehensive unit and building encyclopedia described below\. Since all experiments in this work are conducted with theArmadafaction, the encyclopedia provided to the agent covers only Armada\-side entities; the opposingCortexfaction, while present as the enemy in game scenarios, is excluded from𝒦\\mathcal\{K\}\.
#### 0\.D\.4\.1Unit and building encyclopedia\.
The encyclopedia catalogs all available entities for the Armada faction, organized into eight categories:Bot,Vehicle,Air,Sea,Hover,Factory,Defense, andBuilding\(*i\.e*\., economy and utility structures\)\. Each entry contains a unit name, an internal engine code, a tech tier \(T1, T2, or T3\), and a natural\-language description of the unit’s role and capabilities\. For constructor units and factory buildings, the entry additionally includes a list of*build options*enumerating all structures or units that entity can produce\.
##### Tech tiers\.
Units and buildings are stratified into three technology tiers that reflect the game’s progression system:
- •T1 \(Tech 1\): Basic units and structures available from the start\. These are inexpensive and quick to produce, forming the backbone of early\-game armies and economies \(e\.g\., Pawn infantry bot, Solar Collector, Bot Lab\)\.
- •T2 \(Tech 2\): Advanced units and structures unlocked through higher\-tier factories or constructors\. They are more powerful and specialized but costlier to produce \(e\.g\., Sharpshooter sniper bot, Fusion Reactor, Advanced Bot Lab\)\.
- •T3 \(Tech 3\): Experimental units produced exclusively by the Experimental Gantry\. These are the most powerful units in the game, capable of turning the tide of battle at extreme cost \(e\.g\., Titan, Thor, Razorback\)\.
##### Category overview\.
[Table˜8](https://arxiv.org/html/2606.18950#Pt0.A4.T8)summarizes the number of entities per category and tech tier\. The full encyclopedia—including all build options and detailed descriptions—is provided to the agent as part of𝒦\\mathcal\{K\}\.
Figure 10:Initial configurations of the diagnostic mini\-games\.The figure illustrates the map layouts and initial unit placements for the five diagnostic mini\-games used in our evaluation\. Blue markers indicate the controlled agent and its teammates, while red markers denote enemy units or enemy base locations\. In scenarios where the exact enemy position is unknown, the red box indicates the region where the enemy may appear\.
##### Factories and production chains\.
A key aspect of𝒦\\mathcal\{K\}is the production dependency structure\. Each factory specifies the set of units it can produce; for example, the Bot Lab \(T1\) produces basic bot units such as Pawn, Tick, and Mace, while the Advanced Bot Lab \(T2\) unlocks more powerful units such as Gunslinger, Sharpshooter, and Fatboy\. The highest\-tier factory, the Experimental Gantry \(T3\), produces six experimental units: Vanguard, Titan, Marauder, and Razorback \(bots\), Thor \(vehicle\), and Lunkhead \(hover\)\. Similarly, constructor units \(e\.g\., Construction Bot, Advanced Construction Vehicle\) define which buildings and structures they can erect, establishing the tech tree that the agent must navigate during a game\.
## Appendix 0\.EDetails of Diagnostic Mini\-Games \(L191\)
We design five diagnostic mini\-games, each targeting a specific strategic competency\. The map layouts and initial game configurations for these mini\-games are illustrated in[Figure˜10](https://arxiv.org/html/2606.18950#Pt0.A4.F10)\.
### 0\.E\.1Time\-Constrained Production \(TCP\) — Resource management\.
#### 0\.E\.1\.1Details about scenario and design rationales\.
TCP is designed to isolate resource management by placing the agent in a scenario where economic decision\-making is continuously stressed by competing demands\. The agent must produce a specified unit composition within a fixed deadline while defending against periodic enemy raids, forcing it to balance long\-term production investment against immediate defensive expenditure\. Build dependencies impose a sequential structure on production decisions—certain units are accessible only after prerequisite structures are constructed—such that early misallocation propagates into compounding delays, systematically penalizing myopic or reactive strategies\. Performance is measured by task completion rate and the time taken to achieve the target unit composition\. Fog\-of\-war isenabledin TCP\. Since enemy raids arrive from outside the agent’s initial line\-of\-sight, disabling fog\-of\-war would reduce the scenario to a deterministic scheduling problem; retaining it introduces uncertainty over enemy approach directions and raid timing, requiring the agent to maintain defensive readiness without perfect information\.
#### 0\.E\.1\.2Game Description\.
At the start of each episode, the agent receives a natural\-language game description that specifies the production objective, the time limit, and key information about the adversarial context\. The following is a representative example used in TCP:
> Production race scenario\. Produce 1 Welder as fast as possible\. You will lose if you exceed 30 minutes\. The enemy also has 1 Commander who can build factories and amass forces\. They may attack at any time, so be prepared to defend while building your economy\.
This description intentionally withholds details such as raid timing, enemy composition, and approach direction, requiring the agent to infer defensive requirements under the uncertainty imposed by fog\-of\-war\.
### 0\.E\.2Multi\-Front Defense \(MFD\) — Spatial & temporal reasoning\.
#### 0\.E\.2\.1Details about scenario and design rationales\.
MFD is designed to isolate the agent’s ability to reason about space and time by eliminating all confounding strategic variables\. Forces are fixed with no production or resource gathering, ensuring the only lever available is the spatial deployment and temporal redeployment of units across multiple defensive fronts\. Attacks arrive from different directions at staggered timings, requiring the agent to anticipate which front will be threatened next and reposition forces accordingly—without over\-committing to any single direction and leaving other objectives undefended\. The loss condition is strict: the destruction of even a single objective results in immediate defeat, ensuring that the agent cannot deprioritize any front and must maintain awareness of the entire battlefield\. Performance is measured by win rate and damage efficiency\. Fog\-of\-war isdisabledin MFD\. The competency being evaluated is spatial positioning and temporal sequencing given known attack schedules; introducing fog\-of\-war would shift the challenge to information gathering, conflating distinct competencies and obscuring the targeted diagnosis\.
#### 0\.E\.2\.2Game Description\.
At the start of each episode, the agent receives a natural\-language game description that specifies the defensive objectives, failure conditions, and the nature of incoming threats\. The following is a representative example used in MFD:
> Survive for 10 minutes without losing any of your 3 Metal Storages\. If even one is destroyed, you lose\. Enemies attack from multiple directions simultaneously\.
The description communicates the strict loss condition and multi\-directional threat structure but omits specific attack timings and approach routes, requiring the agent to reason about spatial deployment and temporal redeployment based on observed battlefield dynamics\.
### 0\.E\.3Fixed\-Field Skirmish: Free\-for\-All \(FS\-F\) — Opponent modeling\.
#### 0\.E\.3\.1Details about scenario and design rationales\.
FS\-F is designed to isolate opponent modeling by placing the agent in a multi\-agent competitive environment where reading adversarial intentions is the decisive factor\. Three or more agents enter with symmetric fixed forces and no production or resource gathering, ensuring that the outcome cannot be attributed to economic advantage or unit composition differences\. The agent must predict each opponent’s target selection to determine its own engagement priority: implicit coalitions and betrayals emerge naturally, and the timing of switching allegiances is as critical as the decision itself—premature aggression toward the wrong opponent invites third\-party exploitation, while correctly anticipating target selections enables advantageous engagement sequencing\. The last\-survivor win condition amplifies the importance of threat assessment, as misdirected aggression is immediately and irreversibly punished\. Performance is measured by survival rank and damage efficiency\. Fog\-of\-war isdisabledin FS\-F\. Opponent modeling requires observing other agents’ behavior to infer their intentions; enabling fog\-of\-war would deprive the agent of this observational signal, transforming the task into a blind engagement problem and rendering the targeted competency untestable\.
#### 0\.E\.3\.2Game Description\.
At the start of each episode, the agent receives a natural\-language game description specifying the competitive structure, available forces, and evaluation criteria\. The following is a representative example used in FS\-F:
> Four teams compete in a free\-for\-all battle\. Each team has a mix of infantry, vehicle, and air units\. Attack enemies effectively for 5 minutes\. After the game ends, teams are ranked by combat efficiency\.
The description reveals the symmetric force composition and scoring criterion but leaves opponent behavior entirely unspecified, requiring the agent to infer adversarial intentions solely through real\-time observation of other agents’ actions\.
### 0\.E\.4Fixed\-Field Skirmish: Team \(FS\-T\) — Collaboration\.
#### 0\.E\.4\.1Details about scenario and design rationales\.
FS\-T shares the same game skeleton as FS\-F—symmetric fixed forces, no production, no resource gathering, bounded time horizon—but replaces free\-for\-all competition with team play involving two or more teams\. This single structural change isolates collaboration as the differentiating competency: the agent must infer allied intentions from observed movements and coordinate its actions accordingly, without any explicit communication channel\. Effective collaboration requires the agent to recognize emerging coordination patterns and respond complementarily: selecting shared focus\-fire targets, dividing fronts to avoid redundant engagement, and exploiting weaknesses in the opposing team’s formation\. Performance is measured by team win rate and damage efficiency\. Fog\-of\-war isdisabledin FS\-T\. Inferring allied intentions requires observing allied unit movements and target selections in real time; enabling fog\-of\-war would make this impossible, conflating collaboration with the information\-gathering challenge of Decision Making under Uncertainty\.
#### 0\.E\.4\.2Example of Game Description\.
At the start of each episode, the agent receives a natural\-language game description that specifies the team structure, time horizon, and evaluation criteria\. The following is a representative example used in FS\-T:
> Two teams compete in a 2v2 team battle\. Attack enemies effectively for 5 minutes\. After the game ends, teams are ranked by combat efficiency\.
Notably, the description provides no information about allied strategy or coordination protocol, requiring the agent to infer its teammate’s intentions entirely from observed behavior and adapt its own actions complementarily in real time\.
### 0\.E\.5Siege Planning \(SP\) — Adversarial planning\.
#### 0\.E\.5\.1Details about scenario and design rationales\.
SP is designed to isolate adversarial planning by presenting the agent with a static enemy fortification that must be breached within a strict timeline\. Unlike the skirmish\-based mini\-games, SP requires the agent to analyze a fixed defensive composition and derive a structured, multi\-phase attack plan specifying attack order, entry routes, and force allocation across successive assault phases\. Since the enemy defense does not actively adapt, the challenge lies entirely in the quality of the agent’s plan rather than in reactive decision\-making\. For instance, a region densely covered by anti\-air defenses may require the agent to first commit ground units to neutralize those defenses before committing air units—failure to reason about such sequential dependencies results in disproportionate losses unrecoverable within the time limit\. Resource gathering and unit production are active throughout, introducing a build\-order dimension that requires the agent to anticipate the forces needed at each assault phase and invest accordingly\. Performance is measured by whether the fortification is successfully destroyed and the time taken to do so\. Fog\-of\-war isenabledin SP\. If the full defensive layout were immediately visible, the task would reduce to a static optimization problem solvable by a single upfront analysis\. Retaining fog\-of\-war requires the agent to progressively reveal the defensive composition through reconnaissance and forward pressure, and to revise its attack plan as new information becomes available, more faithfully reflecting adversarial planning under partial observability\.
#### 0\.E\.5\.2Game description\.
At the start of each episode, the agent receives a natural\-language game description that specifies the assault objective, time constraint, and the nature of the enemy position\. The following is a representative example used in SP:
> Destroy the enemy’s Metal Storage within 20 minutes\. The enemy has established a fortified base to defend it, and you must break through their defenses to succeed\.
The description identifies the target structure and deadline but reveals nothing about the defensive layout, force composition, or terrain configuration, requiring the agent to progressively uncover these details through reconnaissance and revise its attack plan as new information emerges\.
## Appendix 0\.FInference Procedure ofRTSGameAgent\(L292\)
### 0\.F\.1Action Space ofRTSGameAgent
The game interface \(Sec\. 3 in the main paper\) defines three action types—building construction, unit production, and unit movement—all at the per\-unit level\.RTSGameAgentretains the first two types unchanged but replaces per\-unit movement with two group\-level actions, yielding an extended action space:
𝒜=𝒜build∪𝒜produce∪𝒜assign∪𝒜move\\mathcal\{A\}=\\mathcal\{A\}\_\{\\text\{build\}\}\\cup\\mathcal\{A\}\_\{\\text\{produce\}\}\\cup\\mathcal\{A\}\_\{\\text\{assign\}\}\\cup\\mathcal\{A\}\_\{\\text\{move\}\}\(6\)We define each component below\.
#### 0\.F\.1\.1Building construction\.
𝒜build=ℬ×𝒳×𝒴\\mathcal\{A\}\_\{\\text\{build\}\}=\\mathcal\{B\}\\times\\mathcal\{X\}\\times\\mathcal\{Y\}\(7\)whereℬ\\mathcal\{B\}is the finite set of available building types and𝒳×𝒴\\mathcal\{X\}\\times\\mathcal\{Y\}is the discrete coordinate grid\(0,0\)\(0,0\)–\(100,100\)\(100,100\)\. Each action specifies a building type to construct and a target placement location, issued per worker unit\. This action type is inherited directly from the base interface without modification\.
#### 0\.F\.1\.2Unit production\.
𝒜produce=ℱ×𝒰\\mathcal\{A\}\_\{\\text\{produce\}\}=\\mathcal\{F\}\\times\\mathcal\{U\}\(8\)whereℱ\\mathcal\{F\}is the finite set of active factories and𝒰\\mathcal\{U\}is the finite set of producible unit types\. Each action specifies which factory should produce which unit type, issued per factory\. This action type is also inherited directly from the base interface without modification\.
#### 0\.F\.1\.3Group assignment \(extended\)\.
𝒜assign=𝒢×2𝒰active\\mathcal\{A\}\_\{\\text\{assign\}\}=\\mathcal\{G\}\\times 2^\{\\mathcal\{U\}\_\{\\text\{active\}\}\}\(9\)where𝒢\\mathcal\{G\}is the set of named squads \(*e\.g*\.,assault,defense\) and2𝒰active2^\{\\mathcal\{U\}\_\{\\text\{active\}\}\}denotes the power set of currently active units\. Each action creates or updates a named squad and allocates a specified subset of active units to it\. This replaces the base interface’s per\-unit movement with a group\-level abstraction: rather than issuing movement commands to individual units, the VLM first organizes units into semantically meaningful squads that persist across decision steps\. The set of squads𝒢\\mathcal\{G\}is dynamically maintained—new squads can be created and existing squads can be updated or dissolved at any decision step\. Current group statuses, including squad composition and unit counts, are included in the structured textual observation𝒲\(st\)\\mathcal\{W\}\(s\_\{t\}\)at every step, ensuring the VLM has full awareness of the current grouping configuration when making assignment decisions\.
#### 0\.F\.1\.4Group movement \(extended\)\.
𝒜move=𝒢×𝒞×𝒳×𝒴\\mathcal\{A\}\_\{\\text\{move\}\}=\\mathcal\{G\}\\times\\mathcal\{C\}\\times\\mathcal\{X\}\\times\\mathcal\{Y\}\(10\)where𝒞=\{move,move\_force,stop\}\\mathcal\{C\}=\\\{\\texttt\{move\},\\texttt\{move\\\_force\},\\texttt\{stop\}\\\}is the set of FSM commands and𝒳×𝒴\\mathcal\{X\}\\times\\mathcal\{Y\}is the same coordinate grid as above\. Each action issues an FSM command with a target coordinate per squad, replacing the base interface’s per\-unit movement commands\. The three commands induce distinct FSM behaviors:moveadvances the squad toward the target coordinate while remaining responsive to enemy contact \(automatically transitioning tofightupon engagement and reverting afterward\);move\_forceadvances the squad toward the target regardless of enemy presence, bypassing the automaticfighttrigger; andstophalts the squad in place\. Thefightstate itself is not directly issuable by the VLM—it is triggered exclusively by the game engine upon enemy contact and is therefore not an element of𝒞\\mathcal\{C\}\. Group states persist across decision steps, so the VLM need only re\-issue a movement command when a strategic change is warranted, substantially reducing the per\-step action burden relative to the base interface\.
#### 0\.F\.1\.5Complexity reduction\.
The group\-level abstraction introduced by𝒜assign\\mathcal\{A\}\_\{\\text\{assign\}\}and𝒜move\\mathcal\{A\}\_\{\\text\{move\}\}provides a significant reduction in effective action complexity relative to the per\-unit base interface\. In the base interface, coordinatingNNunits requires issuing up toNNindividual movement commands per step, each with its own target coordinate\. UnderRTSGameAgent, the sameNNunits are organized into\|𝒢\|\|\\mathcal\{G\}\|squads, reducing the number of movement decisions fromO\(N\)O\(N\)toO\(\|𝒢\|\)O\(\|\\mathcal\{G\}\|\)per step, where\|𝒢\|≪N\|\\mathcal\{G\}\|\\ll Nin large\-scale BAR matches\. This abstraction does not sacrifice strategic expressiveness: the VLM retains full control over group composition, movement targets, and engagement modes, while delegating low\-level tactical execution to the FSM and the game engine\.
### 0\.F\.2Algorithm Details
Algorithm[1](https://arxiv.org/html/2606.18950#alg1)presents the full inference loop ofRTSGameAgent\. We describe each component in detail below\.
#### 0\.F\.2\.1Initialization\.
At the start of each game, the agent initializes two memory stores—long\-term memoryℒ0←∅\\mathcal\{L\}\_\{0\}\\leftarrow\\emptysetand short\-term memory𝒮0←∅\\mathcal\{S\}\_\{0\}\\leftarrow\\emptyset—and a single group pool containing all units under theunassignedlabel\. The agent also receives static game knowledge𝒦\\mathcal\{K\}prior to the first decision step, comprising the scenario description, the full roster of available units and buildings, and the team configuration\.𝒦\\mathcal\{K\}remains fixed throughout the game and is provided as context to both the memory and decision phases at every inference step\.
#### 0\.F\.2\.2Observation\.
At each decision steptt, the agent constructs a multimodal observationot=\(vt,𝒲\(st\)\)o\_\{t\}=\(v\_\{t\},\\mathcal\{W\}\(s\_\{t\}\)\)from the current engine statests\_\{t\}\. The visual channelvtv\_\{t\}is rendered byRenderVisuals, which produces a global minimap together with four local camera views: three views are positioned at the largest groups by current unit count, and one view is fixed at the home base\. This placement prioritizes the most tactically active regions of the battlefield while ensuring the agent retains awareness of its economic core\. The textual channel𝒲\(st\)\\mathcal\{W\}\(s\_\{t\}\)is extracted by a Python wrapper viaExtractGameState, producing a structured representation of current group statuses, building states, detected enemies, and pending actions\. In parallel,AccumulateEvents\(Δt\)\(\\Delta t\)collects all event logs that occurred since the previous decision step—including enemy sightings detected within allied line\-of\-sight and battle outcomes triggered by the game engine—and stores them as the short\-term memory buffer𝒮t\\mathcal\{S\}\_\{t\}\.
#### 0\.F\.2\.3Phase 1: Memory consolidation\.
The memory phase is handled by a text\-only LLM operating on𝒮t\\mathcal\{S\}\_\{t\}and the previous long\-term memoryℒt\\mathcal\{L\}\_\{t\}:
mt,ℒt\+1=LLM\(𝒮t,ℒt\)m\_\{t\},\\;\\mathcal\{L\}\_\{t\+1\}=\\mathrm\{LLM\}\(\\mathcal\{S\}\_\{t\},\\;\\mathcal\{L\}\_\{t\}\)\(11\)The LLM, guided by a memory\-management prompt as shown in[Figure˜17](https://arxiv.org/html/2606.18950#Pt0.A12.F17), consolidates the two stores by retaining, merging, or discarding short\-term event logs into long\-term memory, and simultaneously selects a subset of relevant entriesmtm\_\{t\}pertinent to the current decision context\. Upon completion, the short\-term buffer is flushed:𝒮t\+1←∅\\mathcal\{S\}\_\{t\+1\}\\leftarrow\\emptyset\. This two\-store design ensures that high\-frequency, transient events \(*e\.g*\., a brief skirmish\) are absorbed into structured long\-term summaries rather than accumulating as noise, while preserving strategically significant context across the full game duration\.
#### 0\.F\.2\.4Phase 2: Strategic decision\.
The decision phase is handled by a VLM policyπ\\pioperating on the multimodal observationoto\_\{t\}, the retrieved memory entriesmtm\_\{t\}, and static game knowledge𝒦\\mathcal\{K\}:
at=π\(ot,mt∣𝒦\)a\_\{t\}=\\pi\(o\_\{t\},\\;m\_\{t\}\\mid\\mathcal\{K\}\)\(12\)The VLM generates a structured action planata\_\{t\}that is subsequently parsed into four action types\.Building construction\(build\) issues per\-worker orders, each specifying a building type and a target location\(x,y\)\(x,y\)on the\(0,0\)\(0,0\)–\(100,100\)\(100,100\)coordinate grid\.Unit production\(produce\) issues per\-factory orders, each specifying the unit type to be produced\.Group assignment\(assign\) creates or updates named squads \(*e\.g*\.,assault,defense\) and allocates specified units to each squad\.Group movement\(move\) issues one FSM commandc∈\{move,move\_force,stop\}c\\in\\\{\\texttt\{move\},\\texttt\{move\\\_force\},\\texttt\{stop\}\\\}per squad together with a target coordinate\(x,y\)\(x,y\)\. The game engineEnvhandles all low\-level execution, and the environment advances byΔt\\Delta tfollowing action dispatch:st\+1←Env\(st,at\)s\_\{t\+1\}\\leftarrow\\text\{Env\}\(s\_\{t\},a\_\{t\}\)\.
#### 0\.F\.2\.5FSM Execution\.
Between VLM calls, each group operates autonomously according to its current FSM state without requiring per\-step re\-specification\. The FSM transitions are governed by two rules applied engine\-side\. First, upon detecting enemy contact, a group whose current state is notmove\_forceautomatically transitions tofight, storing its prior command for later restoration:g\.prior\_command←g\.stateg\.\\text\{prior\\\_command\}\\leftarrow g\.\\text\{state\},g\.state←fightg\.\\text\{state\}\\leftarrow\\texttt\{fight\}\. Second, once the engagement ends, the group reverts to its prior command:g\.state←g\.prior\_commandg\.\\text\{state\}\\leftarrow g\.\\text\{prior\\\_command\}\. Themove\_forcecommand bypasses the first rule entirely, forcing the group to continue toward its destination regardless of enemy presence—useful when the VLM determines that engaging intermediate threats would compromise a time\-sensitive maneuver\. This design delegates strategic decisions \(whereto move,whichFSM mode to use\) to the VLM while delegating tactical execution \(whento engage\) to the engine, substantially reducing the per\-step action burden on the VLM in large\-scale RTS scenarios\.
## Appendix 0\.GEffect of Difficulty Level on Model Performance \(L301\)
In the main experiments reported in the paper, all full\-game evaluations use theEasydifficulty level of the built\-in AI to ensure stable and comparable evaluation across models\. To further examine how model performance changes with stronger opponents, we additionally evaluate two representative models—Gemini\-3\-Flash and GPT\-5\.2—across higher difficulty levels\.
[Figure˜11](https://arxiv.org/html/2606.18950#Pt0.A7.F11)reports the win rates of the two models under five full\-game matchup configurations: Duel \(1v1\), Symmetric \(2v2\), Symmetric \(3v3\), Asymmetric \(3v4\), and Free\-for\-All \(1v1v1v1\), while varying the built\-in AI difficulty \(Easy, Medium, Hard\)\. Faction matchup configurations follow the same setup as described in the main paper\. As difficulty increases, both models exhibit a rapid performance degradation across most settings, with win rates approaching zero at the Hard level in several matchups\.
Figure 11:Full\-game performance across difficulty levels\.Win rates of two VLM agents, Gemini\-3\-Flash \(red\) and GPT\-5\.2 \(blue\), evaluated under five full\-game setups with varying built\-in AI difficulty levels \(easy, medium, hard\)\. The setups include Duel \(1v1\), Symmetric \(2v2\), Symmetric \(3v3\), Asymmetric \(3v4\), and Free\-for\-All \(1v1v1v1\)\.Figure 12:Map layouts and starting positions\.Blue markers indicate the starting positions of the controlled team \(including teammates\), while markers in other colors denote enemy starting positions\. In the 3v3 setting, the red starting points are placed symmetrically with respect to the blue starting points\.
## Appendix 0\.HDetails of Full Game Evaluation \(L308\)
### 0\.H\.1Faction Matchup Configurations
All matchups use the Armada faction for every player\. Although Cortex is also implemented in the environment, we unify all factions as Armada to avoid introducing inter\-faction asymmetries and ensure performance differences arise from agent behavior rather than faction\-specific advantages or disadvantages\.
### 0\.H\.2Map Configurations
Each matchup is conducted on a fixed map\. The 1v1 evaluation uses BarR 1\.1 \(map size 8×8\), while the 2v2 matchup uses Pinewood Derby v1 \(map size 12×6\)\. The 3v3 and 3v4 matchups are both conducted on Wanderlust 2\.1 \(map size 10×8\), allowing us to examine how performance changes when team sizes become unbalanced under the same map conditions\. The 1v1v1v1 setting uses Center Command BAR v1\.0 \(map size 16×8\)\. Map layouts are illustrated in[Figure˜12](https://arxiv.org/html/2606.18950#Pt0.A7.F12)\.
### 0\.H\.3Prompts and Examples
#### 0\.H\.3\.1Memory phase\.
The system prompt, input example, and output example used in the memory phase are shown in[Figures˜17](https://arxiv.org/html/2606.18950#Pt0.A12.F17),[18](https://arxiv.org/html/2606.18950#Pt0.A12.F18)and[19](https://arxiv.org/html/2606.18950#Pt0.A12.F19)\.
#### 0\.H\.3\.2Decision phase\.
The system prompt, input example, and output example used in the decision phase are shown in[Figures˜20](https://arxiv.org/html/2606.18950#Pt0.A12.F20),[21](https://arxiv.org/html/2606.18950#Pt0.A12.F21)and[22](https://arxiv.org/html/2606.18950#Pt0.A12.F22)\.
### 0\.H\.4Qualitative Examples of Generated Gameplay
Qualitative example from full\-game runs is shown in[Figure˜13](https://arxiv.org/html/2606.18950#Pt0.A8.F13)\. The example corresponds to a mid\-game moment in a full\-game 1v1 setting, where the agent expands its economy while simultaneously preparing for upcoming combat\.
Figure 13:Example of decision execution in a full\-game scenario\.At time 06:47 \(top\), the agent outputs high\-level decisions including unit and building construction as well as group movement commands with designated target coordinates\. At time 07:05 \(bottom\), the resulting in\-game behavior is observed: the frontline units advance toward \(60, 89\) while scouts move toward \(85, 70\), demonstrating the execution of the previously issued commands\.Figure 14:Inference interval sensitivity analysis\.Performance under different inference intervals across the full game \(1v1\) and five mini\-games using gemini\-3\-flash\.
## Appendix 0\.IInference Interval Analysis \(L308\)
The inference interval—the time between consecutive VLM calls—determines the trade\-off between decision responsiveness and computational cost\.[Figure˜14](https://arxiv.org/html/2606.18950#Pt0.A8.F14)shows the performance of different inference intervals across the full game \(1v1\) and five mini\-games\. The full game and two planning\-oriented mini\-games \(TCP, SP\) require the agent to jointly decidebuild,production, andmovementactions, which operate at a relatively longer temporal scale\. In contrast, the remaining three combat\-oriented mini\-games \(MFD, FS\-F, FS\-T\) only requiremovementdecisions and therefore involve faster tactical dynamics\.
Based on the sensitivity analysis, we select an inference interval of1 minutefor 1v1, TCP, and SP, and15 secondsfor MFD, FS\-F, and FS\-T\. For the planning\-oriented group, 1 minute coincides with the peak performance of the full game and preserves near\-optimal results for TCP and SP, while avoiding the unnecessary cost of more frequent calls\. For the combat\-oriented group, 5–10 seconds yields marginally higher performance on some games but at substantially greater inference cost; 15 seconds maintains competitive performance across all three games and thus serves as a practical operating point\.
## Appendix 0\.JDetails about Evaluation Metrics \(L318\)
### 0\.J\.1Full Game and Mini\-Game Metrics
#### 0\.J\.1\.1Win rate \(WR,↑\\uparrow\)\.
Win Rate is the primary performance metric for all full game matchups \(Duel, Symmetric Team, Asymmetric Team\) and for mini\-games TCP, MFD, FS\-T, and SP\. It is defined as the fraction of evaluation runs in which the agent achieves the win condition:
WR=\# of wins\# of total runs\\text\{WR\}=\\frac\{\\text\{\\\# of wins\}\}\{\\text\{\\\# of total runs\}\}\(13\)
#### 0\.J\.1\.2Rank score \(RS,↑\\uparrow\)\.
Rank Score is the primary metric for Free\-for\-All \(FFA\) matchups in both full game evaluation and the FS\-F mini\-game, where the win condition is not binary\. Each finishing position is assigned a fixed score: 1st place receives 1\.0, 2nd place 0\.67, 3rd place 0\.33, and 4th place 0\.0\. RS is then computed as the average score across all evaluation runs:
RS=1N∑i=1Nscore\(ranki\),score\(k\)=4−k3,k∈\{1,2,3,4\}\\text\{RS\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\text\{score\}\(\\text\{rank\}\_\{i\}\),\\quad\\text\{score\}\(k\)=\\frac\{4\-k\}\{3\},\\quad k\\in\\\{1,2,3,4\\\}\(14\)
#### 0\.J\.1\.3Game time for wins/losses \(GTW\{\}\_\{\\text\{W\}\}/ GTL\{\}\_\{\\text\{L\}\}, min\)\.
GTW\{\}\_\{\\text\{W\}\}and GTL\{\}\_\{\\text\{L\}\}report the average game duration conditioned on the outcome—wins and losses respectively—in minutes\. For FFA matchups, GTW\{\}\_\{\\text\{W\}\}is computed over 1st–2nd place finishes and GTL\{\}\_\{\\text\{L\}\}over 3rd–4th place finishes\. These metrics provide complementary insight into the agent’s behavioral style: a low GTW\{\}\_\{\\text\{W\}\}indicates an aggressive strategy that closes out victories quickly, while a high GTL\{\}\_\{\\text\{L\}\}suggests the agent can prolong games even under losing conditions\. When no wins are recorded \(*i\.e*\.,WR=0\\text\{WR\}=0\), GTW\{\}\_\{\\text\{W\}\}is undefined and reported as ‘–’\.
#### 0\.J\.1\.4Damage efficiency \(DE,↑\\uparrow\)\.
Damage Efficiency measures the ratio of total damage dealt to total damage received across an evaluation run:
DE=total damage dealttotal damage received\\text\{DE\}=\\frac\{\\text\{total damage dealt\}\}\{\\text\{total damage received\}\}\(15\)DE\>1\\text\{DE\}\>1indicates that the agent inflicted more damage than it absorbed, reflecting favorable combat engagement\. DE is used as an auxiliary metric across all full game matchups and mini\-games \(MFD, FS\-F, FS\-T, SP\), providing a continuous signal of combat effectiveness independent of the binary win condition\. The computation is identical across full game and mini\-game settings\.
#### 0\.J\.1\.5Average time \(AT, min,↓\\downarrow\)\.
Average Time is an auxiliary metric for TCP and SP, reporting the mean time taken to achieve the primary objective \(target unit composition in TCP; fortification destruction in SP\) across successful runs\. A lower AT indicates that the agent completes the objective more efficiently\. When no successful runs are recorded, AT is undefined and reported as ‘–’\.
#### 0\.J\.1\.6Self\-Evolving generation metrics
#### 0\.J\.1\.7Playability \(↑\\uparrow\)\.
Playability is defined as the fraction of generated mini\-games that execute successfully without errors:
Playability=\# of successfully executable games\# of total generated games\\text\{Playability\}=\\frac\{\\text\{\\\# of successfully executable games\}\}\{\\text\{\\\# of total generated games\}\}\(16\)A game is considered playable if it launches, runs to completion without engine errors, and satisfies the win/loss condition as intended\.
#### 0\.J\.1\.8Generation time \(min\)\.
Generation Time reports the average wall\-clock time in minutes required to produce a single mini\-game from a user query, measured from the start of Stage 1 to the successful completion of Stage 4\. For fast\-tracked games \(*i\.e*\., where a matching GDD and rule set are retrieved from the database\), the time reflects only Stage 4 execution\.
#### 0\.J\.1\.9Human preference\.
Human Preference is evaluated by four RTS\-experienced human evaluators via pairwise comparison between games generated by two agent configurations \(Agent w/ SE vs\. Agent w/o SE\)\. Given a user query and the two corresponding generated games, each evaluator selects one of three outcomes: A win, B win, or tie\. Evaluators are instructed to judge based onalignment: specifically, how well each generated game reflects the intent of the query and whether the functional behaviors requested by the query—such as specific enemy attack patterns, win/loss conditions, or scenario dynamics—operate correctly as intended\. The final preference score reports the percentage of A win, B win, and tie judgments aggregated across all evaluators and query pairs within each batch\.
## Appendix 0\.KDetails of the Self\-Evolving Game Generation Framework \(L212\)
### 0\.K\.1Algorithm Details
Algorithm[2](https://arxiv.org/html/2606.18950#alg2)presents the full procedure of the Self\-Evolving Game Generation Framework\. We describe each component in detail below\.
#### 0\.K\.1\.1Inputs and initialization\.
The framework takes as input a user queryqq, a shared knowledge database𝒟\\mathcal\{D\}, and stage\-specific rubricsℛ=\{ℛg,ℛr,ℛs\}\\mathcal\{R\}=\\\{\\mathcal\{R\}\_\{g\},\\mathcal\{R\}\_\{r\},\\mathcal\{R\}\_\{s\}\\\}, whereℛg\\mathcal\{R\}\_\{g\},ℛr\\mathcal\{R\}\_\{r\}, andℛs\\mathcal\{R\}\_\{s\}govern GDD generation, rule set construction, and game implementation, respectively\. An artifact buffer𝒜\\mathcal\{A\}and a feedback logℱ\\mathcal\{F\}are initialized as empty sets;𝒜\\mathcal\{A\}accumulates validated outputs across stages for downstream reuse, whileℱ\\mathcal\{F\}records all analyst and VLM feedback throughout the generation cycle for use in the final retrospective analysis\.
#### 0\.K\.1\.2Stage 1: scenario planning\.
The designer agent engages in multi\-turn dialogue with the user to clarify the intent of queryqq, producing a structured scenario briefbbthat specifies game composition, enemy behavior rules, and win/loss conditions\. Before proceeding to Stage 2, the PM checks whether𝒟\\mathcal\{D\}already contains a matching GDD and rule set forbb\. If such a match exists, the pipelinefast\-tracksdirectly to Stage 4 by retrieving the validated artifacts from𝒟\\mathcal\{D\}, bypassing redundant generation and substantially reducing generation time\.
#### 0\.K\.1\.3Stage 2: GDD generation\.
The designer expands the scenario briefbbinto a full Game Design Document \(GDD\) specifying the targeted strategic competency and the rule components governing game behavior \(*e\.g*\., unit spawning conditions and termination criteria\)\. The analyst validates the GDD against rubricℛg\\mathcal\{R\}\_\{g\}, returning a binary pass signalpassg\\text\{pass\}\_\{g\}and structured feedbackfbg\\text\{fb\}\_\{g\}\. If validation fails, the PM invokesMetaFeedback, which reviews the iteration history inℱ\\mathcal\{F\}and produces corrective guidancefbg⋆\\text\{fb\}\_\{g\}^\{\\star\}to direct the designer’s next attempt—either retrying the current stage or rolling back to Stage 1 with revised instructions\. ThisRepeat–Untilloop continues untilpassg\\text\{pass\}\_\{g\}is achieved, at which point the validated GDD is stored in the artifact buffer𝒜\\mathcal\{A\}\.
#### 0\.K\.1\.4Stage 3: rule set construction\.
The developer implements each rule specified in the GDD as a Lua script\[ierusalimschy2006lua\]for in\-engine execution, reusing verified implementations from𝒟\\mathcal\{D\}when available and writing new scripts otherwise\. The analyst validates the resulting rule set against rubricℛr\\mathcal\{R\}\_\{r\}via rubric\-based checks and simulation runs, returningpassr\\text\{pass\}\_\{r\}and feedbackfbr\\text\{fb\}\_\{r\}\. Failed validations again triggerPM\.MetaFeedback, which provides corrective guidancefbr⋆\\text\{fb\}\_\{r\}^\{\\star\}for retry or rollback\. Upon passing, the verified rule set is appended to𝒜\\mathcal\{A\}\.
#### 0\.K\.1\.5Stage 4: game implementation and verification\.
The developer retrieves game assets \(*e\.g*\., maps and unit configurations\) from𝒟\\mathcal\{D\}and assembles a final executable script𝒢\\mathcal\{G\}by configuring unit placement, end conditions, map selection, and rule parameters based on the accumulated artifacts in𝒜\\mathcal\{A\}\. Stage 4 applies a two\-level verification: the analyst first performs rubric\-based checks againstℛs\\mathcal\{R\}\_\{s\}, producingpasss\\text\{pass\}\_\{s\}andfbs\\text\{fb\}\_\{s\}; a VLM then independently verifies visual playability and semantic alignment with the original queryqqby processing screenshots captured at regular intervals during a full game simulation, producingpassv\\text\{pass\}\_\{v\}andfbv\\text\{fb\}\_\{v\}\. The game passes Stage 4 only when both conditions are simultaneously satisfied \(passs∧passv\\text\{pass\}\_\{s\}\\wedge\\text\{pass\}\_\{v\}\); otherwise,PM\.MetaFeedbacksynthesizes corrective guidancefbs⋆\\text\{fb\}\_\{s\}^\{\\star\}from both feedback signals to direct the next revision attempt\.
#### 0\.K\.1\.6Self\-evolution phase\.
Upon successful completion of Stage 4, two self\-evolution mechanisms are triggered\. First, all validated artifacts accumulated in𝒜\\mathcal\{A\}—including the GDD, rule scripts, and final game configuration—are committed to the shared knowledge database:𝒟←𝒟∪𝒜\\mathcal\{D\}\\leftarrow\\mathcal\{D\}\\cup\\mathcal\{A\}\. This enables future queries to reuse verified components, allowing the pipeline to fast\-track or partially skip stages as the database grows\. Second, the PM conducts aretrospective analysisover the complete feedback logℱ\\mathcal\{F\}accumulated during the current generation cycle, identifying systematic discrepancies between verification outcomes and quality expectations and updating the rubrics accordingly:ℛ←PM\.Retrospective\(ℛ,ℱ\)\\mathcal\{R\}\\leftarrow\\textsc\{PM\.Retrospective\}\(\\mathcal\{R\},\\mathcal\{F\}\)\. Together, these mechanisms allow the framework to improve both generation efficiency and output quality over successive cycles, transformingRTSGameBenchfrom a static diagnostic suite into a continuously extensible evaluation platform\.
Figure 15:Representative mini\-games generated by the self\-evolving game generation framework\.Each row shows a user query, two in\-game screenshots, and the strategic ability assessed\. The examples span opponent modeling, resource management, spatial reasoning, collaboration, and adversarial planning with temporal reasoning\.
### 0\.K\.2Prompts and Rubrics
Project manager\.The responsibilities and prompt templates of the project manager agent are illustrated in[Figures˜23](https://arxiv.org/html/2606.18950#Pt0.A12.F23),[24](https://arxiv.org/html/2606.18950#Pt0.A12.F24)and[25](https://arxiv.org/html/2606.18950#Pt0.A12.F25)\. These figures show the system prompts used for inter\-stage routing, retrospective game summarization, and rubric refinement within the scenario generation pipeline\.
Stage 1: scenario brief generation\.The system prompt used by the designer agent for generating scenario briefs and interacting with the user is shown in[Figure˜26](https://arxiv.org/html/2606.18950#Pt0.A12.F26)\.
Stage 2: game design document \(GDD\) generation\.The prompts and rubric used in the GDD generation stage are shown in[Figures˜27](https://arxiv.org/html/2606.18950#Pt0.A12.F27),[28](https://arxiv.org/html/2606.18950#Pt0.A12.F28),[29](https://arxiv.org/html/2606.18950#Pt0.A12.F29)and[30](https://arxiv.org/html/2606.18950#Pt0.A12.F30)\. In this stage, the designer first generates the game design document \(GDD\), the analyst evaluates it using a predefined rubric, and the designer refines the GDD based on the feedback\.
Stage 3: rule set construction\.The prompts and rubric used in the rule set construction stage are shown in[Figures˜31](https://arxiv.org/html/2606.18950#Pt0.A12.F31),[32](https://arxiv.org/html/2606.18950#Pt0.A12.F32),[35](https://arxiv.org/html/2606.18950#Pt0.A12.F35),[34](https://arxiv.org/html/2606.18950#Pt0.A12.F34)and[33](https://arxiv.org/html/2606.18950#Pt0.A12.F33)\. In this stage, the developer generates executable rules from the GDD and produces test code for validation\. The analyst evaluates the rules using a predefined rubric and simulation results, and the developer iteratively refines the rules based on the feedback\.
Stage 4: game finalization\.The prompts and rubric used in the game finalization stage are shown in[Figures˜36](https://arxiv.org/html/2606.18950#Pt0.A12.F36),[37](https://arxiv.org/html/2606.18950#Pt0.A12.F37),[38](https://arxiv.org/html/2606.18950#Pt0.A12.F38),[39](https://arxiv.org/html/2606.18950#Pt0.A12.F39),[42](https://arxiv.org/html/2606.18950#Pt0.A12.F42),[41](https://arxiv.org/html/2606.18950#Pt0.A12.F41)and[40](https://arxiv.org/html/2606.18950#Pt0.A12.F40)\. In this stage, the developer finalizes the playable scenario by selecting a map, placing units, configuring rules, and defining the end conditions\. The analyst then evaluates the generated scenario using a predefined rubric and visual inspection of gameplay, after which the developer refines the final script if necessary\.
Figure 16:Example of updated rubrics during Self\-Evolving\.An existing rubric is analyzed after gameplay simulation and refined to better capture recurring failure cases\. In this example, the original rubric is augmented with an additional requirement highlighted inblue bold text, which explicitly enforces pathability validation for spawned entities\. This modification supplements the previous criterion to improve robustness of the evaluation rubric\.
## Appendix 0\.LAdditional Examples of Self\-Evolving Game Generation \(L381\)
[Figure˜15](https://arxiv.org/html/2606.18950#Pt0.A11.F15)presents five representative mini\-games generated by the Self\-Evolving Game Generation Framework\. The examples span a diverse range of strategic abilities, including opponent modeling, resource management, spatial reasoning, collaboration, and adversarial planning with temporal reasoning, demonstrating the framework’s ability to produce functionally diverse and strategically meaningful scenarios from free\-form natural language queries\.
### 0\.L\.1Updated Rubric during Self\-Evolving
As illustrated in[Figure˜16](https://arxiv.org/html/2606.18950#Pt0.A11.F16), once a game is generated and validated, the rubric used for evaluating the final script can be further refined by analyzing recurring failure patterns observed during simulation\. In earlier versions of the rubric, the validation primarily checked whether units or buildings overlapped and whether the initial spawn locations were placed on terrain that is nominally traversable\. However, repeated simulations revealed a systematic failure case: even when these conditions were satisfied, some units became effectively immobilized immediately after spawning due to subtle terrain pathability issues\.
For example, mobile units could be placed on terrain that is technically valid but lacks a viable escape route, such as being spawned on an isolated island that ground units cannot leave, or on narrow terrain regions where slope, water depth, or footprint constraints prevent movement\. Because the original rubric did not explicitly verify that spawned entities could actually move away from their spawn positions, these situations repeatedly passed the initial checks but later caused failures in the unit movement tests\.
To address this issue, the rubric is refined after the game generation stage to explicitly require pathability validation, ensuring that placed mobile entities not only spawn on passable terrain but also have a feasible escape route that allows them to move shortly after spawning\. Through this iterative self\-evolving process, the evaluation rubric gradually becomes more precise and robust as more generated games expose new edge cases\.
Figure 17:System prompt for the memory phase\.The system prompt used in the memory phase to manage and update the agent’s memory before the decision phase\.Figure 18:Input example for the memory phase\.The input to the memory phase consists of the current memory state accumulated so far and newly observed triggers, such as enemy sightings or combat outcomes\.Figure 19:Output example for the memory phase\.The output of the memory phase consists of an updated memory state reflecting the newly observed triggers and a set of retrieved memories provided to guide decision\-making in the decision phase\.Figure 20:System prompt for the decision phase\.Placeholders\(denoted as \{…\}\) in the system prompt are replaced with the game description, unit knowledge, and team color specification\. The game description provides a textual explanation of each game, the unit knowledge contains brief descriptions of units available in the game, and the team color specification maps each team to its corresponding color in the game images\.Figure 21:Input example for the decision phase\.The input to the decision phase consists of both textual observations and visual observations of the current game state\.Figure 22:Output example for the decision phase\.The output of the decision phase specifies actions across four decision categories: constructing buildings, producing units, assigning units to groups, and issuing movement commands to groups\.Algorithm 1RTSGameAgent: Inference Loop1:Static game knowledge
𝒦\\mathcal\{K\}; inference interval
Δt\\Delta t
2:Action sequence
\{at\}\\\{a\_\{t\}\\\}
3:
ℒ0←∅\\mathcal\{L\}\_\{0\}\\leftarrow\\emptyset⊳\\trianglerightInitialize long\-term memory
4:
𝒮0←∅\\mathcal\{S\}\_\{0\}\\leftarrow\\emptyset⊳\\trianglerightInitialize short\-term memory
5:Initialize groups
←\{unassigned\}\\leftarrow\\\{\\texttt\{unassigned\}\\\}
6:foreach decision step
t=0,1,2,…t=0,1,2,\\ldotsdo
7:// Observation
8:
vt←v\_\{t\}\\leftarrowRenderVisuals\(
sts\_\{t\}\)⊳\\trianglerightMinimap \+ 4 local views \(top\-3 groups \+ base\)
9:
𝒲\(st\)←\\mathcal\{W\}\(s\_\{t\}\)\\leftarrowExtractGameState\(
sts\_\{t\}\)⊳\\trianglerightGroups, buildings, enemies, pending
10:
ot←\(vt,𝒲\(st\)\)o\_\{t\}\\leftarrow\(v\_\{t\},\\;\\mathcal\{W\}\(s\_\{t\}\)\)
11:
𝒮t←\\mathcal\{S\}\_\{t\}\\leftarrowAccumulateEvents\(
Δt\\Delta t\)⊳\\trianglerightEnemy sightings, battle outcomes
12:
13:// Phase 1: Memory consolidation \(LLM, text\-only\)
14:
mt,ℒt\+1←LLM\(𝒮t,ℒt\)m\_\{t\},\\;\\mathcal\{L\}\_\{t\+1\}\\leftarrow\\mathrm\{LLM\}\(\\mathcal\{S\}\_\{t\},\\;\\mathcal\{L\}\_\{t\}\)⊳\\trianglerightRetain/merge/discard→\\toretrieve relevant entries
15:
𝒮t\+1←∅\\mathcal\{S\}\_\{t\+1\}\\leftarrow\\emptyset⊳\\trianglerightFlush short\-term buffer
16:
17:// Phase 2: Strategic decision \(VLM, multimodal\)
18:
at←π\(ot,mt∣𝒦\)a\_\{t\}\\leftarrow\\pi\(o\_\{t\},\\;m\_\{t\}\\mid\\mathcal\{K\}\)
19:// Parseata\_\{t\}into action types:
20:foreachbuild
∈at\\in a\_\{t\}do⊳\\trianglerightBuilding construction \(per\-worker\)
21:Assign worker to buildtypeat location
\(x,y\)\(x,y\)
22:endfor
23:foreachproduce
∈at\\in a\_\{t\}do⊳\\trianglerightUnit production \(per\-factory\)
24:Order factory to produceunit\_type
25:endfor
26:foreachassign
∈at\\in a\_\{t\}do⊳\\trianglerightGroup assignment
27:Create or update squad; allocate specified units
28:endfor
29:foreachmove
∈at\\in a\_\{t\}do⊳\\trianglerightGroup movement with FSM command
30:Issue command
c∈\{move,move\_force,stop\}c\\in\\\{\\texttt\{move\},\\texttt\{move\\\_force\},\\texttt\{stop\}\\\}to squad toward
\(x,y\)\(x,y\)
31:endfor
32:
33:// FSM execution \(engine\-side, autonomous\)
34:foreach group
ggdo
35:ifenemy contact detectedand
g\.state≠move\_forceg\.\\text\{state\}\\neq\\texttt\{move\\\_force\}then
36:
g\.prior\_command←g\.stateg\.\\text\{prior\\\_command\}\\leftarrow g\.\\text\{state\}
37:
g\.state←fightg\.\\text\{state\}\\leftarrow\\texttt\{fight\}⊳\\trianglerightAuto\-triggered
38:elseifengagement endedand
g\.state=fightg\.\\text\{state\}=\\texttt\{fight\}then
39:
g\.state←g\.prior\_commandg\.\\text\{state\}\\leftarrow g\.\\text\{prior\\\_command\}⊳\\trianglerightRevert
40:endif
41:endfor
42:
43:
st\+1←Env\(st,at\)s\_\{t\+1\}\\leftarrow\\text\{Env\}\(s\_\{t\},\\;a\_\{t\}\)⊳\\trianglerightEnvironment advances byΔt\\Delta t
44:endfor
Algorithm 2Self\-Evolving Game Generation Framework1:User query
qq; knowledge database
𝒟\\mathcal\{D\}; rubrics
ℛ=\{ℛg,ℛr,ℛs\}\\mathcal\{R\}=\\\{\\mathcal\{R\}\_\{g\},\\mathcal\{R\}\_\{r\},\\mathcal\{R\}\_\{s\}\\\}
2:Executable mini\-game
𝒢\\mathcal\{G\}; Updated
𝒟\\mathcal\{D\}and
ℛ\\mathcal\{R\}
3:
𝒜←∅,ℱ←∅\\mathcal\{A\}\\leftarrow\\emptyset,\\mathcal\{F\}\\leftarrow\\emptyset⊳\\trianglerightInitialize artifact storage and feedback log
4:// Stage 1: Scenario Planning
5:
b←Designer\.Clarify\(q\)b\\leftarrow\\textsc\{Designer\.Clarify\}\(q\)⊳\\trianglerightMulti\-turn dialogue→\\toscenario brief
6:if
∃\{GDD, Rules\}∈𝒟\\exists\\;\\\{\\text\{GDD, Rules\}\\\}\\in\\mathcal\{D\}matching
bbthen
7:
𝒜←Retrieve\(𝒟,b\)\\mathcal\{A\}\\leftarrow\\text\{Retrieve\}\(\\mathcal\{D\},b\)⊳\\trianglerightFast\-track
8:go toStage 4
9:endif
10:// Stage 2: GDD Generation
11:repeat
12:
GDD←Designer\.Expand\(b\)\\text\{GDD\}\\leftarrow\\textsc\{Designer\.Expand\}\(b\)⊳\\trianglerightBrief→\\tofull Game Design Document
13:
\(passg,fbg\)←Analyst\.Validate\(GDD,ℛg\)\(\\text\{pass\}\_\{g\},\\text\{fb\}\_\{g\}\)\\leftarrow\\textsc\{Analyst\.Validate\}\(\\text\{GDD\},\\mathcal\{R\}\_\{g\}\)⊳\\trianglerightRubric\-based check
14:
ℱ←ℱ∪\{fbg\}\\mathcal\{F\}\\leftarrow\\mathcal\{F\}\\cup\\\{\\text\{fb\}\_\{g\}\\\}
15:if
¬passg\\neg\\,\\text\{pass\}\_\{g\}then
16:
fbg⋆←PM\.MetaFeedback\(fbg\)\\text\{fb\}\_\{g\}^\{\\star\}\\leftarrow\\textsc\{PM\.MetaFeedback\}\(\\text\{fb\}\_\{g\}\)⊳\\trianglerightRetry/Rollback guidance
17:endif
18:until
passg\\text\{pass\}\_\{g\}
19:
𝒜←𝒜∪\{GDD\}\\mathcal\{A\}\\leftarrow\\mathcal\{A\}\\cup\\\{\\text\{GDD\}\\\}⊳\\trianglerightStore GDD
20:// Stage 3: Rule Set Construction
21:repeat
22:
Rules←Developer\.Implement\(GDD,𝒟\)\\text\{Rules\}\\leftarrow\\textsc\{Developer\.Implement\}\(\\text\{GDD, \}\\mathcal\{D\}\)⊳\\trianglerightNew Lua scripts
23:
\(passr,fbr\)←Analyst\.Validate\(Rules,ℛr\)\(\\text\{pass\}\_\{r\},\\text\{fb\}\_\{r\}\)\\leftarrow\\textsc\{Analyst\.Validate\}\(\\text\{Rules\},\\mathcal\{R\}\_\{r\}\)⊳\\trianglerightRubric\-based check
24:
ℱ←ℱ∪\{fbr\}\\mathcal\{F\}\\leftarrow\\mathcal\{F\}\\cup\\\{\\text\{fb\}\_\{r\}\\\}
25:if
¬passr\\neg\\,\\text\{pass\}\_\{r\}then
26:
fbr⋆←PM\.MetaFeedback\(fbr\)\\text\{fb\}\_\{r\}^\{\\star\}\\leftarrow\\textsc\{PM\.MetaFeedback\}\(\\text\{fb\}\_\{r\}\)⊳\\trianglerightRetry/Rollback guidance
27:endif
28:until
passr\\text\{pass\}\_\{r\}
29:
𝒜←𝒜∪\{Rules\}\\mathcal\{A\}\\leftarrow\\mathcal\{A\}\\cup\\\{\\text\{Rules\}\\\}⊳\\trianglerightStore verified rules into artifact buffer
30:// Stage 4: Game Implementation & Verification
31:Retrieve game assets \(maps, unit info\) from
𝒟\\mathcal\{D\}
32:repeat
33:
𝒢←Developer\.Configure\(𝒜,Assets\)\\mathcal\{G\}\\leftarrow\\textsc\{Developer\.Configure\}\(\\mathcal\{A\},\\text\{Assets\}\)⊳\\trianglerightFinal executable script
34:
\(passs,fbs\)←Analyst\.Validate\(𝒢,ℛs\)\(\\text\{pass\}\_\{s\},\\text\{fb\}\_\{s\}\)\\leftarrow\\textsc\{Analyst\.Validate\}\(\\mathcal\{G\},\\mathcal\{R\}\_\{s\}\)⊳\\trianglerightRubric\-based check
35:
passv,fbv←VLM\.Verify\(Screenshots,q\)\\text\{pass\}\_\{v\},\\text\{fb\}\_\{v\}\\leftarrow\\textsc\{VLM\.Verify\}\(\\text\{Screenshots\},q\)
36:
ℱ←ℱ∪\{fbs,fbv\}\\mathcal\{F\}\\leftarrow\\mathcal\{F\}\\cup\\\{\\text\{fb\}\_\{s\},\\text\{fb\}\_\{v\}\\\}
37:if
¬\(passs∧passv\)\\neg\\,\(\\text\{pass\}\_\{s\}\\wedge\\text\{pass\}\_\{v\}\)then
38:
fbs⋆←PM\.MetaFeedback\(fbs,fbv\)\\text\{fb\}\_\{s\}^\{\\star\}\\leftarrow\\textsc\{PM\.MetaFeedback\}\(\\text\{fb\}\_\{s\},\\text\{fb\}\_\{v\}\)⊳\\trianglerightRetry/Rollback guidance
39:endif
40:until
passs∧passv\\text\{pass\}\_\{s\}\\wedge\\text\{pass\}\_\{v\}
41:// Self\-Evolution Phase
42:
𝒟←𝒟∪𝒜\\mathcal\{D\}\\leftarrow\\mathcal\{D\}\\cup\\mathcal\{A\}⊳\\trianglerightStore validated artifacts
43:
ℛ←PM\.Retrospective\(ℛ,ℱ\)\\mathcal\{R\}\\leftarrow\\textsc\{PM\.Retrospective\}\(\\mathcal\{R\},\\mathcal\{F\}\)⊳\\trianglerightUpdate rubrics via feedback analysis
44:return
𝒢\\mathcal\{G\}
Figure 23:System prompt for inter\-stage gating\.The Project Manager reviews the feedback history and determines which agent to route the request to for the next step\.Figure 24:System prompt for Game Summary\.After the game generation process is completed, the Project Manager summarizes the generated game and provides the user with an overview of its key properties\.Figure 25:System prompt for rubric update\.After the game generation process is completed, the Project Manager updates the rubric for future game generation by reflecting on the history accumulated during the current game creation process\.Figure 26:System prompt for interaction between the designer and the user\.In Stage 1, the Designer agent interacts with the user to iteratively refine and produce the scenario brief\.Figure 27:System prompt for generating GDD\.In Stage 2, the Designer agent generates a Game Design Document \(GDD\) based on the scenario brief produced in the previous stage\.Figure 28:System prompt for validating GDD\.The Analyst agent reviews and validates the Game Design Document \(GDD\) generated by the Designer agent\.Figure 29:Rubrics for GDD\.The initial rubric used by the Analyst agent to validate the GDD in[Figure˜28](https://arxiv.org/html/2606.18950#Pt0.A12.F28)\.Figure 30:System prompt for refining GDD\.The Designer agent refines the GDD by incorporating feedback provided by the Analyst agent\.Figure 31:System prompt for generating rule\.In Stage 3, the Developer agent generates Lua code to implement the rules specified in the GDD\.Figure 32:System prompt for generating test code for rule\.The Developer agent generates testing Lua code to verify whether the rule implementation produced in[Figure˜31](https://arxiv.org/html/2606.18950#Pt0.A12.F31)behaves as intended\.Figure 33:System prompt for validating rule\.The Analyst agent evaluates the Lua rule implementation and its corresponding test code by reviewing both the code and the simulation results produced from the tests\.Figure 34:Rubrics for Rule\.The rubric used by the Analyst agent for rule validation in[Figure˜33](https://arxiv.org/html/2606.18950#Pt0.A12.F33)\.Figure 35:System prompt for refining rule\.The Developer agent refines the Lua rule implementation based on feedback provided by the Analyst agent in[Figure˜33](https://arxiv.org/html/2606.18950#Pt0.A12.F33)\.Figure 36:System prompt for selecting map\.In Stage 4, the Developer agent selects an appropriate map for the game based on the GDD and the implemented rules\.Figure 37:System prompt for placing units\.The Developer agent determines the placement of units on the selected map based on the specifications defined in the GDD and the implemented rules\.Figure 38:System prompt for defining rule configuration\.The Developer agent specifies the configuration parameters for the rules implemented in Stage 3\.Figure 39:System prompt for defining end condition\.The Developer agent specifies the victory and defeat conditions for the game\.Figure 40:System prompt for visual evaluation\.The Analyst agent analyzes the final game by running simulations and evaluating the resulting visualizations using the rubric\.Figure 41:Rubrics for Final Script\.The rubric used by the Analyst agent for the final game evaluation in[Figure˜40](https://arxiv.org/html/2606.18950#Pt0.A12.F40)\.Figure 42:System prompt for refining final script\.The Developer agent refines the final game script by incorporating feedback and evaluation results from[Figure˜40](https://arxiv.org/html/2606.18950#Pt0.A12.F40)\.Similar Articles
SVI-Bench: A Dynamic Microworld for Strategic Video Intelligence
Introduces SVI-Bench, a large-scale benchmark for strategic video intelligence using team sports, designed to evaluate models on dynamic scene understanding, causal reasoning, strategic simulation, and agentic synthesis. The benchmark reveals a capability cliff where models perform well on perceptual tasks but sharply degrade on higher-level strategic reasoning.
GENSTRAT: Toward a Science of Strategic Reasoning in Large Language Models
This paper introduces GENSTRAT, a benchmark that uses procedurally generated strategic environments to evaluate LLMs' strategic reasoning across multiple axes, addressing limitations of fixed game suites.
ForecastBench-Sim: A Simulated-World Forecasting Benchmark
Introduces ForecastBench-Sim, a simulated-world forecasting benchmark built on game rollouts from Freeciv, designed to provide controlled, immediately resolvable tasks for evaluating probabilistic reasoning in AI systems.
WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark
Introduces WorldBench, a visually diverse multimodal reasoning benchmark that reveals significant limitations in current multimodal large language models' visual understanding.
Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games
This paper introduces a multi-turn interactive framework for reasoning evaluation where LLMs must query a hidden environment and integrate partial observations, instantiated as a benchmark of 474 executable games across five difficulty levels, showing discriminative power and exposing differences in reasoning.