Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability of Large Language Models under Fog of War
Summary
Introduces Age of LLM, a turn-based 1v1 benchmark where LLMs compete on a grid with fog of war and diplomacy, measuring reasoning, reliability, and strategic planning. Findings show a dominance of nuclear rush tactics and a weak link between reliability and winning.
View Cached Full Text
Cached at: 06/24/26, 07:46 AM
# A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability of Large Language Models under Fog of War
Source: [https://arxiv.org/html/2606.24391](https://arxiv.org/html/2606.24391)
Arnaud RicciCorresponding author\. Independent researcher, Switzerland\. ORCID:[\[0000\-0002\-8982\-1416\]](https://arxiv.org/html/2606.24391v1/%5B0000-0002-8982-1416%5D)\. Email:arnaud\.ri@protonmail\.ch
\(18 June 2026\)
###### Abstract
We introduceAge of LLM111[https://ageofllm\.org](https://ageofllm.org/), a turn\-based 1v1 benchmark in which two LLMs face off on a 13×\\times7 grid to destroy the enemy base\. Three stressors are deliberate:*fog of war*,*full diplomacy*\(messages, ceasefires, ultimatums; uranium kept secret\), and a*reliability*dimension where every turn must follow a strict JSON schema and an illegal action is silently discarded\. The engine is private and each match uses a fresh random map seed and opponent, mitigating the data contamination that affects public benchmarks\. Models receive a \(near\) rule\-only prompt with no build\-order advice \(two tactical seed phrases were present during data collection; see Section[2](https://arxiv.org/html/2606.24391#S2)\)\. We benchmark 15 reasoning models across 54 matches and 5,258 actions\. Findings: \(1\) the nuclear rush dominates \(78% on the rules\-coherent v0\.11\+ sub\-corpus; 85% corpus\-wide\) with a sole\-launcher signature that is largely mechanical under secret\-simultaneous launch rules, not a cognitive deterrence failure; \(2\) military conquest is rare but faster \(12\.3 vs\. 18\.9 turns\); \(3\) diplomacy is prolific yet almost never consummated; \(4\)∼\\sim58% of illegal actions are fog/state errors, making the illegal\-action rate a measure of belief\-tracking; \(5\)—the least established, and the only one we label exploratory—a weak link associates reliability with winning \(details in Section[4](https://arxiv.org/html/2606.24391#S4)\)\. The corpus is small, unbalanced and not side\-swapped, so the ranking is a*preliminary descriptive view*, not a contribution\. Beyond ranking, the turn\-by\-turn traces of actions*and*messages make the corpus a lens on how LLMs reason under adversarial uncertainty—their belief\-tracking, spontaneous deception, and per\-model cognitive “personas”—which we frame as a future research direction\. We release the replay format, an isometric viewer and all replays; engine source on request\.
Licensed under a Creative Commons Attribution 4\.0 International License \(CC BY 4\.0\)\.
## 1Introduction
Most LLM benchmarks probe single\-turn competence on tasks with full observability and unambiguous answers \(e\.g\. MATH\[[1](https://arxiv.org/html/2606.24391#bib.bib1)\], HumanEval\[[2](https://arxiv.org/html/2606.24391#bib.bib2)\], MMLU\[[3](https://arxiv.org/html/2606.24391#bib.bib3)\]\)\. Such benchmarks reward depth of reasoning on a*visible*problem but do not exercise the competences that real adversarial decision\-making demands: planning under*uncertainty*, reasoning about a*hidden*opponent,*temporal*commitment \(build now vs\. attack now\), and, most critically for agentic deployment,*reliability*of structured output across many sequential turns where a malformed response is retried but an illegal action \(a rule violation such as acting on a fog\-hidden cell or referencing a destroyed unit\) is silently discarded as a wasted action \(the model may emit up to three actions per turn, so one illegal action forfeits one of its three slots, not the whole turn\)\.
A second motivation is data contamination\. Public benchmarks \(MMLU, MATH, HumanEval and their successors\) are, by design, openly distributed; as frontier models are trained on ever\-larger web\-scraped corpora, the risk that benchmark items leak into training data grows, inflating scores without reflecting genuine competence\. Age of LLM mitigates this by construction: the engine source is kept private, the map is re\-seeded randomly each match, and the opponent varies across a pool of 15 models, so no two matches present the same opening position and no fixed solution can be memorised\. We do not claim this eliminates contamination entirely, a sufficiently capable model could still transfer general game\-playing heuristics, but it removes the most direct contamination channel \(rote memorisation of benchmark items\)\. This anti\- contamination property comes at a cost: each match is a paid API call, so the sample sizes we can afford are modest \(Section[3](https://arxiv.org/html/2606.24391#S3)\), and the random opponent mix that protects against memorisation is also what prevents a balanced head\-to\-head ranking at this scale\.
Planning under partial observability with hidden opponents has been studied in game AI, from board games\[[6](https://arxiv.org/html/2606.24391#bib.bib6)\]to real\-time strategy and social deduction\[[7](https://arxiv.org/html/2606.24391#bib.bib7)\]\. A broader line of work evaluates LLMs as autonomous agents on multi\-turn, multi\-tool tasks \(HELM\[[4](https://arxiv.org/html/2606.24391#bib.bib4)\], SWE\-bench\[[5](https://arxiv.org/html/2606.24391#bib.bib5)\], AgentBench\[[8](https://arxiv.org/html/2606.24391#bib.bib8)\]\), which exercise tool use and planning but typically under full observability and without an adversarial, hidden opponent\. Closer to our setting, GameBench\[[9](https://arxiv.org/html/2606.24391#bib.bib9)\]and GTBench\[[10](https://arxiv.org/html/2606.24391#bib.bib10)\]benchmark LLMs on board and game\-theoretic tasks; both still mostly probe full\-observable, turn\-based games rather than partial\-observability \+ diplomacy under fog of war\. Recent work has begun to probe LLMs as game\-playing agents in real\-time\-strategy settings, including a constructive argument that LLM\-like attributes can arise in any sufficiently powerful substrate trained on an RTS such as Age of Empires II\[[11](https://arxiv.org/html/2606.24391#bib.bib11)\], and a high\-stakes simulation studying whether LLMs exercise ethical restraint when authorised to launch a nuclear weapon\[[12](https://arxiv.org/html/2606.24391#bib.bib12)\]\. Age of LLM is a deterministic\-turn engine in which two LLMs play a complete match \(typically 16–23 turns\)\. Each turn a model emits up to three structured actions \(produce / move / attack / build / launch / wait\) plus an optional diplomatic message\. Victory is achieved by a nuclear launch the opponent does not match, by reducing the enemy base to 0 HP with a tank, by an accepted ultimatum, or by an accepted peace\. Mutual destruction and timeout round out the outcome space\. The benchmark uses a \(near\)*rule\-only*: the system prompt describes only rules and the JSON schema \(it names the active victory paths but gives no build order; two tactical seed phrases were present during data collection, see Section[2](https://arxiv.org/html/2606.24391#S2)\), so any strategy is the model’s own invention within that rule structure\.
This paper reports findings from 54 completed matches across 15 models and analyses the relative contributions of strategy choice, diplomacy and inference reliability\. A second, broader contribution is methodological: because every match records both the model’s actions and its free\-text messages, the corpus doubles as a dataset for studying*how*LLMs reason under partial observability—their belief\-tracking, their spontaneous deception and concealment, and their stable per\-model cognitive styles—rather than only*how well*they score\.
## 2Game design
### 2\.1Map and resources
A 13×\\times7 board is split by a central mountain barrier \(column 6\) with seed\-driven passages; two mirrored territories flank it\. Each player starts with 5 credits and 0 uranium\. Resources come from deposits that players tap by building mines on them:*credit mines*\(\+3 credits/turn, standing in for oil/petroleum extractors\), and*uranium mines*\(\+1 uranium/turn\)\. Each territory holds one uranium deposit on its side, and a single*central uranium deposit*straddles the barrier on column 6, buildable by either player and therefore the focus of early contested ground, for three uranium deposits in total\. Deposits hold finite reserves and, once exhausted, respawn elsewhere, forcing redeployment\. The map is generated symmetrically for balance, but*this symmetry is never disclosed to the models*; enemy\-side deposits are hidden by fog and revealed only once scouted \(early engine versions did not reveal them at all, see Section[3](https://arxiv.org/html/2606.24391#S3)\)\.
### 2\.2Fog of war and memory
All cells outside the base’s detection radius start dark\. Enemy units are visible only within a friendly unit’s/building’s detection range; out of range, they vanish\. Discovered enemy buildings and deposits are remembered \(with alast\_seenturn\) but destroyed buildings are dropped from memory\. The enemy uranium stockpile, the key resource for the bomb, is never revealed\. A launch is secret until detonation, although a new*early\-warning*signal informs a player of an enemy launch*only if*the player currently sees an enemy silo\.
### 2\.3Units and combat
Four HP\-less unit types form a tactical triangle:Fighter→Tank→SAM→Fighter\\text\{Fighter\}\\to\\text\{Tank\}\\to\\text\{SAM\}\\to\\text\{Fighter\}, plus a recon Drone\. Combat is immediately fatal to the loser \(the attacker survives mirror matchups, rewarding initiative\)\. Ground attacks are blocked by line\-of\-sight through mountains or buildings; air units ignore obstacles\. Only the Tank damages buildings \(2 HP/hit; the base has 4 HP, so two tank hits conquer it\)\. This yields a recursive defensive mini\-game \(Tank→\\toSilo/Base protected by Fighters, cleared by SAMs, themselves killed by Tanks\)\.
### 2\.4Diplomacy
Four diplomatic channels are free \(do not consume an action\): a short free message each turn; a ceasefire \(no attacks for 3 turns, \+6U bomb penalty\); peace \(immediate draw\); and an ultimatum \(“surrender before turnXX”\)\. Accepting an ultimatum scores the loser 0\.5 points \(vs\. 0 for a clean defeat\), providing the only incentive to surrender\. Uranium is secret, so deterrence and bluff are feasible\.
### 2\.5The nuclear bomb
A launch is gated by three conditions that must all hold at the moment thelaunchaction is issued: \(i\) the player owns an*operational*silo \(one that is no longer under construction, i\.e\. built on a previous turn\); \(ii\) the player’s uranium stockpile meets the current bomb cost \(base 25U, decaying from turn 40 down to a floor of 13U, so late\-game launches grow cheaper\); and \(iii\) the player has previously*discovered*the enemy base location \(scouting it at least once\)\. A launch that fails any check is silently rejected as illegal\. The silo is*not*consumed by a launch: the building survives the firing mechanically, and since a successful launch destroys the enemy base and ends the match, this only matters in the narrow window where a launch fails \(insufficient uranium, undetected base\) and the player must wait and try again on a later turn with the same silo\. The silo is also a destructible building: it has HP and may be killed by enemy tanks, freeing the cell for a rebuild\. Launches resolve simultaneously at the end of the turn: a single launcher wins \(nuclear victory\); two simultaneous launchers cause mutual destruction \(both lose\)\. Launching is therefore a*bet*, not a guaranteed win\.
### 2\.6Scoring
A win scores 3, a draw \(peace/timeout\) 1, a loss or mutual destruction 0, and an accepted ultimatum yields 3 to the proposer and 0\.5 to the accepter\. Models are ranked bypoints\_per\_matchto prevent volume from inflating rank\. Each match’s starting player is random and alternates per turn\. We checked for a first\-player advantage and find none: across the 53 decided matches with a clear first\-to\-act, the first player won 24 \(45\.3%\), i\.e\. the alternating turn order appears balanced and does not confound the leaderboard\.
### 2\.7The \(near\) rule\-only system prompt
A central design choice is that the system prompt is deliberately*rule\-only*: it describes the rules and the JSON action schema and gives*no opening build order, no timing advice, and no recommended strategy*\. The prompt never states that the map is symmetric, never reveals enemy deposit positions, never recommends when to rush the bomb versus push tanks, and never explains how to combine a ceasefire with a nuclear window\. We call it*near*rule\-only rather than strictly “advice\-free”, and we want to be precise about three respects in which the prompt does structure the strategic space, the third of which is a genuine contamination of the “discovered, not recited” claim \(unrelated to data contamination, which the private\-engine design addresses; see Introduction\) that we flag:
- •The prompt names the two active victory paths\(nuclear, military\) in its opening sentence and lists peace only as a way to “force a draw\.” This orients models toward*active*victory rather than the terminal draw, and plausibly contributes to the lopsided outcome mix\. It does not, however, favour one active path over the other\.
- •Describing a rule is, in a weak sense, describing a constraint that enables a strategy\.The prompt states that only the tank damages buildings, that uranium is secret, that a silo needs one turn to become operational, and that a launch fails unless the enemy base has been scouted\. A human reading these rules would also converge on “scout, then either push tanks or race the bomb”; that convergence is a property of the rules, not of added advice\.
- •Two tactical seed phrases were present during data collection\.The prompt used to generate the 54 reported replays contained two imperative phrases that are borderline tactical instructions rather than rule descriptions: “*Scout with a drone early*” and “*Push tanks \+ a scout into enemy ground to contest their economy*”\. We have removed these from the prompt reproduced in Appendix[A](https://arxiv.org/html/2606.24391#A1)\(the clean version\), but*the reported data were generated with the prompt that included them*\. This is a genuine contamination of the “discovered, not recited” claim \(not a data\-contamination issue\) and it falls squarely on the findings most likely to be affected: the prevalence of early scouting \(drones are the second\-most\-produced unit, 251\), the tank\-heavy production mix \(576 tanks\), and the “military path under\-attempted” reading\. We therefore treat these specific findings as*limitation\-flagged*: the rule structure plausibly pushes toward scouting and tank production on its own, but the two seed phrases may have amplified that push, and the reported magnitudes should not be read as pure model preference\. The nuclear\-vs\-military balance, the diplomacy analysis and the message\-tone analyses are not affected by these phrases\. A re\-run on the clean prompt is the cleanest fix but was not feasible within the compute budget \(Section[3](https://arxiv.org/html/2606.24391#S3)\); we flag this as the single most important data\-collection limitation\.
With these caveats, every strategy the models exhibit in the replays is*discovered*within the rule structure \(and, for scouting/tank production, partially seeded\), not recited from a build order\. This matters for the benchmark’s validity: any heuristic baked into the prompt would be learnt once and then merely executed, masking the model’s own reasoning\. The prompt also states explicitly that the agent is*stateless*: it is invoked fresh each turn with no hidden internal memory, so long\-horizon tactical planning must be reconstructed each turn from the observation\. The per\-turn observation is not, however, “the board alone”: each turn the model receives \(i\) the current board state as it sees it under fog of war; \(ii\) the outcome of its*immediately previous*turn—which actions succeeded/failed and which of its units were lost since it last played \(last\_turn\_results,events\_against\_you\), allowing it to react to combat losses and fix illegal actions; and \(iii\) the*full diplomatic record*for the match \(diplomacy\_history, truncated to the last 40 entries\), i\.e\. every message, proposal and response from both sides since turn 1—the only persistent long\-run signal the model carries\. Per\-turn tactical details from older turns are not replayed, so the model reconstructs the board from the current state while retaining the diplomatic context needed to judge the opponent’s honesty\. Agents are likewise*memoryless across matches*: each match starts from a clean context with no access to prior replays, so a strategy discovered in one match cannot be recited from a remembered earlier game; all within\-match strategy is reconstructed from the per\-turn observation only\. We discuss the framing effect further in the limitations\.
## 3Experimental setup
Figure 1:The isometric web viewer rendering a replay\. The board is a 13×\\times7 grid split by a central mountain barrier; fog\-of\-war darkens unseen cells, and per\-turn unit/building state, diplomacy and performance counters are stepped through turn by turn\. The viewer and all replays are public at[https://ageofllm\.org](https://ageofllm.org/)\.We evaluate 15 models, all invoked atreasoning\_effort=highvia OpenAI\-compatible endpoints \(Poe, Venice, OpenRouter\)\. Models include GPT\-5\.5, Claude Opus 4\.8, Claude Fable 5, Gemini Pro 3\.1, Gemini Flash 3\.5, GLM 5\.1, GLM 5\.2, DeepSeek V4 Pro, MiniMax M3, Kimi 2\.6, Kimi K2\.7 Code, Qwen 3\.7 Max, MiMo 2\.5 Pro, Nemotron 3 Ultra and Grok 4\.3\. The count of 15 includes both Kimi 2\.6 \(archived mid\-corpus as superseded\) and its successor Kimi K2\.7 Code; we keep both so the leaderboard reflects the models actually played, but the family therefore contributes two entries rather than one\.
#### A caveat onreasoning\_effortacross providers\.
reasoning\_effortis an OpenAI\-native parameter; the three providers route it differently and not all models expose it natively\. Concretely: for OpenAI\-family models \(GPT\-5\.5\) and DeepSeek it is passed as a nativereasoning\_effortfield; for Claude models on Poe it is routed through thepoe\_nativeSDK as an extended\-thinking token budget \(thinking\_budget\); for Venice it is passed via areasoningobject \(\{effort, enabled\}\); for OpenRouter it is passed via areasoningobject\. For providers/models that do not natively support a graded effort parameter, the setting may be a no\-op or be silently coerced to a default, and the per\-provider default behaviour for unrecognised parameters is not documented\. We therefore cannot guarantee that “high effort” denotes the*same*absolute reasoning budget across all 15 models; it denotes “maximal effort available through that provider’s interface for that model\.” This is a known limitation of trans\-provider benchmarking\. The model\-intrinsic metrics we rely on \(illegal\-action rate, tokens/turn\) are comparable across providers insofar as each provider reports tokens consistently, but even tokens/turn is partially provider\-confounded because the exposure of reasoning tokens varies by provider/route \(see limitations\)\. We report the per\-model effort status in Table[1](https://arxiv.org/html/2606.24391#S3.T1)\.
Table 1:Howreasoning\_effort=highis routed per provider, and the effort status for the model families in the corpus\. “Native” = the parameter is honoured as a graded effort control; “budget” = routed as a thinking\-token budget; “incertain” = the parameter is accepted but its effect on the model’s actual reasoning depth is not verifiable through the provider\.
#### Pairing scheme\.
The 54 matches do not form a balanced round\-robin\. Pairings were selected to populate the leaderboard across model tiers rather than to exhaust every pairing, and match counts per model are uneven \(3–15, Table[2](https://arxiv.org/html/2606.24391#S4.T2)\)\. Each match uses a fresh random map seed and a random starting side, but the corpus is*not*systematically side\-swapped \(the same ordered pairing is rarely played with sides exchanged\)\. Consequently points/match is not directly comparable across models that faced different opponent mixes, and the leaderboard should be read as indicative rather than as a head\-to\-head ranking; we report bootstrap 95% confidence intervals per model below \(Table[2](https://arxiv.org/html/2606.24391#S4.T2)note\) and recommend side\-swapped≥\\geq20 matches per pairing for future stable estimates\.
#### Sampling parameters and determinism\.
All models are invoked atreasoning\_effort=highwith sampling temperature0\.7and no fixed API seed \(top\-p/top\-k left at each provider’s default\), so*model decisions are stochastic*: re\-running a match from the same map seed will not reproduce the same actions\. Determinism holds only at the engine level \(map generation and rule resolution follow the stored seed\) and, by design, only the saved replay JSON is reproducible, not a fresh re\-execution\. The turn limit is80\(the engine defaultmax\_turns\); no match in the corpus reaches it, so the timeout outcome is unobserved by construction\. We disclose that the bomb\-cost decay rule \(25U decaying from turn 40 to a floor of 13U, Section[2](https://arxiv.org/html/2606.24391#S2)\) never triggers in this corpus, since every match ends by turn 23; the decay is therefore an unexercised late\-game mechanic and plays no role in any reported outcome\.
#### Data snapshot\.
All results in this paper are computed on a frozen snapshot of54 completed matches across 15 models, taken on 2026\-06\-18\. The benchmark is run continuously, so the public leaderboard may since contain additional matches\. We disclose a potential conflict of interest: two of the evaluated models, GLM 5\.2 and Claude Opus 4\.8, were also used as automated assistants during the preparation of this paper \(GLM 5\.2 generated the empirical analysis scripts; Claude Opus 4\.8 independently re\-derived every statistic from the raw replays, see Acknowledgements\)\. To guard against any self\-favouring bias, their matches were run before this assistant role was assigned, the engine is fully deterministic from its stored seed, and every reported number was re\-derived independently from the raw replays\. The exact 54\-match corpus and the frozen per\-model aggregates used for every figure and table are archived alongside the paper sources for reproducibility\.
#### Engine versions and rule evolution\.
The 54 completed replays were produced across a sequence of engine versions \(0\.9\.2, 0\.9\.3, 0\.10\.0, 0\.11\.0, 0\.12\.0, 0\.14\.0 and 0\.15\.0\), during which the rules were progressively refined: line\-of\-sight blocking and resource depletion were added in v0\.10\.0, the base HP was lowered 8→\\to4 in v0\.11\.0 \(making military conquest a two\-tank\-hit affair\), and the SAM move range and mine costs were rebalanced in v0\.12\.0\. A second fog\-of\-war refinement concerns enemy\-side resource deposits: in early versions \(v0\.9\.x–v0\.10\.x\) enemy deposits were*never*revealed to a player, not even when a friendly unit moved adjacent to them, so enemy economy was entirely invisible; from v0\.11\.0 onward an enemy deposit that enters a unit’s field of view is revealed and then remembered \(with alast\_seenturn\), matching the treatment of enemy buildings\. This makes late\-game economic raiding \(destroying an enemy mine to claim its deposit\) informationally feasible only in the post\-v0\.11\.0 matches\. The rules described in Section[2](https://arxiv.org/html/2606.24391#S2)correspond to the current engine \(v0\.15\.0\) and are accurate for all replays from v0\.11\.0 onward \(36 of 54 matches, including every military and ultimatum outcome and the lone peace\)\. The 18 pre\-v0\.11\.0 matches all used an 8\-HP base; the 14 v0\.9\.x matches additionally lacked line\-of\-sight blocking and resource depletion \(introduced in v0\.10\.0\)\. All 18 resolved*exclusively*by nuclear victory, so they do not affect the military/diplomatic findings and reinforce, rather than contradict, the nuclear\-dominance result\. The v0\.15\.0 nuclear early\-warning signal \(enemy\_launch\_detected\) is present in the schema of the 2 v0\.15\.0 replays but*never fired*\(it requires the player to currently see an enemy silo, and no eventual loser ever did before the decisive launch\); the deterrence question \(Section[4](https://arxiv.org/html/2606.24391#S4)\) is therefore assessed without an active early\-warning cue, and whether the signal would change this behaviour is left to future work\. We treat the cross\-version corpus as a single benchmark because the three load\-bearing pillars \(fog of war \+ memory, full diplomacy, nuclear deterrence\) and the action schema are unchanged throughout\.
#### Reproducibility and engine access\.
The replay JSON is the sole contract between the Python engine and a static web viewer; it serialises per\-turn absolute board state, per\-player visible cells \(to replay the true fog\), actions, diplomacy and performance counters \(think time, tokens, retry buckets\)\. We aggregate these into a leaderboard\. The*web viewer and all replays are public*; the*engine source is kept private by design*so that future frontier models cannot train on, or memorise, the engine’s internal logic and thereby contaminate the benchmark\. Researchers who wish to reproduce matches locally or audit the rules may request engine access from the corresponding author atarnaud\.ri@protonmail\.ch\. To enable independent verification of the*rule\-only*claim without releasing the engine, the complete system prompt \(rules \+ JSON schema, no build\-order advice\) is reproduced verbatim in Appendix[A](https://arxiv.org/html/2606.24391#A1); note that this is the*clean*prompt, and the 54 replays were generated with an earlier version that contained two tactical seed phrases \(Section[2](https://arxiv.org/html/2606.24391#S2)\)\. We emphasise that, because model outputs are sampled \(temperature=0\.7, no fixed API seed\), only the archived replay JSON is bit\-reproducible; a fresh run from the same map seed produces a different game\.
#### Counting launches\.
The engine resolves a launch only at the end of the turn, after both players have played their half\-turn; to display the bomb’s effect the viewer captures one extra display frame after resolution, so the raw replay contains 92 launch\-tagged actions for 46 actual model\-issued launches\. All statistics in this paper count*model\-issued*launches only \(46\), unless explicitly noted\.
#### A note on deliberation time and providers\.
Because models are served through three different providers, the measured per\-turn think time reflects not only the model’s internal reasoning but also provider\-side scheduling, queuing and rate\-limiting\. The effect is minor relative to the order\-of\-magnitude spread between models, but it is non\-zero: some providers throttle long high\-effort calls, and a few models \(notably MiniMax M3 and MiMo 2\.5 Pro\) accumulate timeout retries partly because of provider latency rather than pure model deliberation\. We therefore treat think time as an indicative, not exact, measure of reasoning effort\. We rely primarily on the illegal\-action rate for cross\-model comparison, which is provider\-independent\. Tokens/turn is only*partially*model\-intrinsic: it is confounded by the fact that different providers expose \(or strip\) reasoning tokens differently, so the reasoning\-token component of tokens/turn is not strictly comparable across providers; we report it nonetheless as a coarse indicator, and the correlations involving tokens/turn \(Section[4](https://arxiv.org/html/2606.24391#S4)\) should be read with this caveat\.
## 4Results
### 4\.1Victory channels
Across the 54 completed matches, the corpus\-wide outcome distribution is:nuclear 46 \(85%\),military 4 \(7%\),ultimatum 3 \(6%\),peace 1 \(2%\), mutual destruction 0, timeout 0 \(Figure[2](https://arxiv.org/html/2606.24391#S4.F2)a\)\. Because this mix pools 7 engine versions whose rules change the military/nuclear balance, our*primary*estimate is the rules\-coherent v0\.11\+ sub\-corpus reported immediately below \(nuclear 78%, military 11%\); we keep the corpus\-wide figures here for continuity with the public leaderboard and because the qualitative findings \(military wins faster, diplomacy rare\) are stable across both cuts\. Match length averages 18\.3 turns \(range 9–23\)\. The nuclear path is overwhelmingly dominant; the military conquest path, which the v0\.11 rebalance \(base HP 8→\\to4\) was specifically intended to make competitive, is realised by only three models \(GPT\-5\.5 once, GLM 5\.1 twice, Kimi K2\.7 Code once\)\. Strikingly, military wins are*faster*: mean 12\.3 turns \(min 9, max 16\) versus 18\.9 for nuclear wins \(min 16, max 23, Figure[2](https://arxiv.org/html/2606.24391#S4.F2)b\)\. The fastest victory in the entire corpus is GLM 5\.1’s 9\-turn tank conquest of Grok 4\.3; when a model does commit to the ground push, it ends the game earlier than the nuclear cycle, suggesting the military path is high\-skill but under\-attempted rather than inherently slow\. The lone peace was accepted in the GLM 5\.2 vs\. DeepSeek V4 Pro match \(Section[4\.5](https://arxiv.org/html/2606.24391#S4.SS5)\)\.
#### Primary estimate: the rules\-coherent v0\.11\+ sub\-corpus\.
The headline outcome mix is computed on the36 matches played under v0\.11\.0–0\.15\.0, the rules\-coherent sub\-corpus in which the military\-determining mechanics \(base HP 4, line\-of\-sight blocking, resource depletion, enemy\-deposit reveal\) are all fixed\. This givesnuclear 28 \(78%\),military 4 \(11%\),ultimatum 3 \(8%\),peace 1 \(3%\)\. We treat 78% as the cleanest available estimate of nuclear dominance under the described rules\. The full 54\-match corpus \(which adds 18 pre\-v0\.11 matches played under an 8\-HP base, no LOS blocking and no depletion—all resolved exclusively by nuclear victory\) gives a more lopsided85% nuclear / 7% military; the gap \(78% vs\. 85%, 11% vs\. 7%\) confirms that the pre\-v0\.11 rules made the military path structurally harder, inflating the corpus\-wide nuclear share\. We therefore lead with 78% and report 85% only as the corpus\-wide robustness figure\. The illegal\-action rate is stable across the two cuts \(5\.9% v0\.11\+ vs\. 5\.6% full corpus\)\.
We are explicit that v0\.11\+ is*not*a single\-ruleset corpus: it still spans v0\.11\.0 \(17 matches\), v0\.12\.0 \(8\), v0\.14\.0 \(9\) and v0\.15\.0 \(2\), with a minor SAM\-move\-range / mine\-cost rebalance introduced in v0\.12\.0\. No single version has enough matches to serve as the corpus \(v0\.15\.0 has only 2\), so v0\.11\+ is the best compromise available; the residual inter\-version drift is minor relative to the 8\-HP\-vs\-4\-HP change that defines the pre/post\-v0\.11 split\. A single\-ruleset study on a future larger v0\.15\.0 corpus would tighten the estimate further\.
Figure 2:\(a\) Victory channels across 54 completed matches\. The nuclear rush accounts for 78% of outcomes on the rules\-coherent v0\.11\+ sub\-corpus \(85% corpus\-wide, see text\); peace occurs once, mutual destruction and timeout never\. \(b\) Match length by victory type\. Military conquests, though rare, end the match substantially faster \(mean 12\.3 turns\) than nuclear wins \(18\.9\)\.
### 4\.2The nuclear signature: sole\-launcher, single fire
A fine\-grained inspection of the launch events reveals a stark behavioural regularity \(Figure[3](https://arxiv.org/html/2606.24391#S4.F3)\)\. In*all 46*nuclear matches, the eventual winner was the*only*player to ever issue a launch, and it did so exactly once, in the decisive turn; the loser launched*zero*times across the entire match\. Consequently no match ever featured two distinct players launching in the same turn, and the mutual\-destruction outcome, designed into the engine as the symmetric equilibrium of simultaneous launches, was*never*observed\. Launches occur exclusively in turns 16–23 \(peaking at turn 17, 11 matches\), i\.e\. launches cluster tightly around the average match end \(turn 18\.3\) and*no*model achieves an early rush\.
A silo is*not*consumed by a launch: the building survives the firing, so a silo whose launch was rejected \(e\.g\. for insufficient uranium\) could retry on a later turn; but since a successful launch ends the match, no winner ever fired more than once\. Separately, a silo is a destructible building \(it has HP and can be killed by enemy tanks\); across the corpus a silo is built 112 times over 54 matches, i\.e\. silos are frequently destroyed and rebuilt within a match, but this never prevented a nuclear outcome because the winner’s silo was always operational at the decisive turn\.
We read the signature with care, because under the rules it is*largely mechanical, not cognitive*\. A launch resolves only at the end of the turn, after both players have acted; the in\-flight bomb is secret \(uranium is hidden, and there is no launch cue except the v0\.15\.0 early\-warning signal, which fires only if the player currently sees an enemy silo and*never fired*in any replay\)\. In*all 46 nuclear matches*the eventual loser therefore had*no information*that a launch was in flight at the moment it had to decide whether to counter\-launch \(we restrict the claim to the 46 nuclear matches, since the 8 non\-nuclear outcomes—military, ultimatum, peace—involve no launch at all and are irrelevant to deterrence\)\. Counter\-launching in the same turn would have required the loser to have independently reached bomb readiness \(operational silo \+ sufficient uranium \+ scouted enemy base\) on exactly that turn—i\.e\. to be at parity by chance, with no cue\. The absence of mutual destruction is thus primarily a consequence of the secret, simultaneous resolution design, not evidence that the models “failed to deter\.” The genuine, weaker cognitive finding is the absence of*pre\-emptive*launches: no losing model, even when it had built a silo, launched speculatively to guard against a possible enemy launch\. Whether the v0\.15\.0 early\-warning signal \(when active\) would induce counter\-launches and mutual destructions is an open question, and a natural next experiment\.
Figure 3:Launch timing across the corpus\. All 46 model\-issued launches fall in turns 16–23, clustering around the mean match end \(18\.3 turns\)\. The inset summary states the signature regularity: in every nuclear match the winner was the sole launcher, the loser never launched, and no mutual destruction occurred—a pattern that is largely mechanical under the secret\-simultaneous launch rules \(see text\), not a cognitive deterrence failure\.
### 4\.3Leaderboard \(preliminary descriptive view\)
We present the leaderboard as a*preliminary descriptive view*, not a contribution of the paper\. Match counts per model are uneven \(3–15\), the pairing is not a balanced round\-robin and is not side\-swapped \(Section[3](https://arxiv.org/html/2606.24391#S3)\), and each match is a single stochastic run \(temperature=0\.7, no fixed seed\), so there is no run\-to\-run variance estimate and a single match can turn on a stochastic token\. At thisnn, points\-per\-match \(ppm\) is*not directly comparable*across models that faced different opponent mixes, and the ordering is indicative only\.
Table[2](https://arxiv.org/html/2606.24391#S4.T2)ranks models by ppm\. GPT\-5\.5 is undefeated \(7W–0L, 3\.00 ppm\), but with only 7 matches the undefeated claim is wide on a Clopper–Pearson win\-rate interval \(≈\\approx\[0\.59, 1\.00\]\) and should not be read as established invincibility\. The mid\-tier is dense \(Gemini Flash, MiniMax M3, Claude Fable, DeepSeek V4 Pro, GLM 5\.1, Claude Opus 4\.8, GLM 5\.2\); their ppm bootstrap CIs overlap almost entirely \(e\.g\. Gemini Flash \[1\.50, 3\.00\], DeepSeek V4 Pro \[1\.13, 2\.60\], GLM 5\.2 \[0\.50, 2\.67\]\), so their relative ordering is*not*statistically distinguishable at this sample size\. The bottom is occupied by models with either weak reasoning \(Grok 4\.3, 0W\) or severe reliability issues \(Kimi variants, MiMo 2\.5 Pro\)\.
#### An opponent\-adjusted ranking \(Bradley–Terry\)\.
Because ppm ignores*who*a model played, we also fit a Bradley–Terry model \(each match contributes a win/loss/draw between two model strengths; draws score 0\.5\) with 1,000\-resample bootstrap confidence intervals, which is the standard tool for unbalanced tournament data\. The BT ordering agrees with the ppm ordering \(Spearmanρ≈0\.99\\rho\\approx 0\.99between the two rankings\): GPT\-5\.5, Gemini Flash 3\.5 and MiniMax M3 are the top three in both, and Grok 4\.3 / Kimi 2\.6 are last\. We caution against reading this agreement as a validation of robustness: with such unbalanced, low\-nndata, ppm and BT converge almost by construction \(both are monotone functions of the same sparse win/loss matrix\), so the agreement is expected rather than reassuring\. The BT intervals are very wide \(the top\-three CIs all overlap\), so even the opponent\-adjusted ranking cannot separate the leaders\. We report ppm for continuity with the public leaderboard and treat BT as a sanity check on gross distortion, not as evidence of a stable ranking; that will require side\-swapped≥\\geq20 matches per pairing and repeated runs \(Section[3](https://arxiv.org/html/2606.24391#S3), future work\)\.
Table 2:Leaderboard \(sorted by points/match\)\.*ppm*=points per match,*W/L/D*=wins/losses/draws,*tt*=avg think time per half\-turn \(s\),*tok*=tokens/turn,*inv*=illegal\-action rate,*$*=cost/match \(USD\)\. All models are evaluated on the same deterministic engine\. The cost column reflects API spend \(input\+output tokens at each provider’s pricing\) and is reported as a sampling constraint rather than a performance metric: cheaper models permit more matches per budget, but cost does not track winning \(GPT\-5\.5 at $1\.45/match and Grok 4\.3 at $0\.10/match sit at opposite ends of the table\)\.†archived \(superseded by K2\.7\)\.
Figure 4:The public live leaderboard at[https://ageofllm\.org](https://ageofllm.org/), ranked by points per match\. The frozen 54\-match snapshot analysed in this paper \(Table[2](https://arxiv.org/html/2606.24391#S4.T2)\) is a strict subset of this continuously updated view\.
### 4\.4Deliberation does not predict winning
With onlyn=15n=15models, every correlation in this subsection has low statistical power and wide uncertainty; we reportpp\-values and 95% bootstrap confidence intervals \(10,000 resamples\) and treat the patterns as*exploratory*rather than confirmed\. Pearson correlations between points/match and model\-level aggregates are weak and individually non\-significant: illegal\-action rater=−0\.35r=\-0\.35\(p=0\.20p=0\.20, 95% CI\[−0\.73,\+0\.17\]\[\-0\.73,\+0\.17\]; Spearmanρ=−0\.45\\rho=\-0\.45,p=0\.09p=0\.09\), deliberation timer=\+0\.19r=\+0\.19\(p=0\.51p=0\.51, CI\[−0\.48,\+0\.57\]\[\-0\.48,\+0\.57\];ρ=\+0\.13\\rho=\+0\.13\), and tokens/turnr=−0\.24r=\-0\.24\(p=0\.39p=0\.39, CI\[−0\.69,\+0\.56\]\[\-0\.69,\+0\.56\];ρ=\+0\.10\\rho=\+0\.10\)\. The illegal\-action rate is nonetheless the strongest \(negative\) single correlate of winning \(Figure[5](https://arxiv.org/html/2606.24391#S4.F5)\)\. Because these correlations share the points/match variable, we compare them with Steiger’sZZtest for dependent correlations\. We treat the*tokens/turn*comparison as the primary formal test, because deliberation time is provider\-confounded \(Section[3](https://arxiv.org/html/2606.24391#S3)\) and feeding a noisy variable into the Steiger test would bias it toward the null and could manufacture a spurious “borderline” gap\. The gap between the illegal\-action correlation and the tokens/turn correlation is not significant \(p=0\.71p=0\.71\)\. For completeness, the illegal\-vs\-think gap is borderline \(t=−2\.18t=\-2\.18,p=0\.0499p=0\.0499\), but given the provider confound on think time we do not interpret thisp=0\.0499p=0\.0499as robust\. Correcting for the five correlations we tested \(illegal, think, tokens, courtesy, bluff\) with Bonferroni,*none*remains significant atα=0\.05\\alpha=0\.05; the directional ordering \(illegal\>\>think≈\\approxtokens in predictive strength\) is therefore suggestive, not established\. Descriptively, GPT\-5\.5 thinks nearly six times longer than Gemini Flash per half\-turn yet both sit near the top; conversely Kimi variants burn the most tokens \(25k–29k/turn\) and lose\. We read this as tentative evidence that, under this benchmark’s sequential, partially\-observable regime, the*action\-legality*dimension tracks winning at least as well as raw deliberation volume, while noting that action\-legality is itself a product of reasoning \(belief\-tracking\), so the two are not independent\. We caution that the deliberation\-time comparison is partly confounded by provider differences \(see Section[3](https://arxiv.org/html/2606.24391#S3)\): some models are served through providers that throttle or queue long calls, so the measured think time is not a pure reflection of model reasoning\. The illegal\-action rate, by contrast, is provider\-independent and is the metric we trust most; tokens/turn is only partially model\-intrinsic \(reasoning\-token exposure varies by provider, Section[3](https://arxiv.org/html/2606.24391#S3)\) and is reported as a coarse indicator only\.
Figure 5:Points per match versus illegal\-action rate\. The negative correlation \(r=−0\.35r=\-0\.35,p=0\.20p=0\.20,n=15n=15\) is the strongest of the three single correlates of winning we tested, exceeding deliberation time \(r=\+0\.19r=\+0\.19\) and tokens/turn \(r=−0\.24r=\-0\.24, encoded as marker size\), but none is individually significant and the trend line is illustrative only \(see text for Steiger test and Bonferroni\-corrected interpretation\)\. Marker colour encodes the model’s total API retries: green=0=0, blue==1–19, red≥20\\geq 20\. The most error\-prone models \(Grok 4\.3, Kimi 2\.6\) sit at the bottom; notably Grok 4\.3 records zero API retries \(green\) yet still loses because of its high illegal\-action rate \(8\.6%\), illustrating that action\-level reliability, not just API\-level reliability, accompanies losing at the bottom of the board\.
### 4\.5Diplomacy is proposed but rarely consummated
Across the corpus we count 1,804 free messages, 46 ceasefire proposals, 59 ultimatums and 14 peace proposals \(Figure[6](https://arxiv.org/html/2606.24391#S4.F6)\)\. Yet only*one*of the 14 peace proposals was accepted \(12 explicitly refused, 1 ignored\) and only 3 of 46 ceasefires were accepted \(43 refused, 0 ignored, every ceasefire drew a response\)\. The sole diplomatic path to victory was the ultimatum: 3 of 59 were accepted \(53 refused, 3 ignored\), and those three acceptances are exactly the three ultimatum victories \(two by Gemini Flash 3\.5 and one by DeepSeek V4 Pro\)\.
The two binding settlement channels have structurally different incentives, and the data reflects this asymmetry\.Peaceis terminal: once accepted, the match ends*immediately*as a draw \(1 point each\), so the opponent*cannot*launch a bomb afterwards\. Refusing peace is therefore not a safety concern but a value bet: a model refuses peace when it believes it can still win \(3 points beats 1\), which is rational whenever its nuclear or military push is on track\. The single accepted peace occurred in the GLM 5\.2 vs\. DeepSeek V4 Pro match at turn 16, and it is the one match where the message channel produced a non\-nuclear outcome\. Both sides had scouted the other’s base early \(by turn 5\) and both had built a silo, but they were far apart in the uranium race: DeepSeek held 24U \(one short of the 25U bomb cost\) with an operational silo, while GLM 5\.2 lagged at 11U and could not launch\. DeepSeek used the message channel to issue an explicit launch threat \(“I have 25 uranium now and my silo is operational; accept peace for a draw, or face defeat”\), and GLM 5\.2, judging the threat credible and unable to counter\-launch in time, accepted the guaranteed draw over likely nuclear defeat \(“Your uranium position was credible and I couldn’t reach your silo in time\. A draw is better than mutual destruction”\)\. This is the corpus’s sole instance of successful deterrence via communication, and it is telling that it took a model that was genuinely behind \(GLM 5\.2\) to prefer 1 guaranteed point over a contested 3\. The near\-absence of accepted peace elsewhere is consistent with models correctly assessing that, in a race the average match resolves by turn 18, neither side is usually firmly enough behind to prefer a guaranteed 1 point over a contested 3\.
Ceasefire, by contrast, is the channel that carries a genuine strategic dilemma\. A ceasefire forbids conventional*attacks*for three turns, but it does*not*stop the nuclear bomb: either side may still launch during a ceasefire at a \+6U penalty\. Accepting a ceasefire therefore grants the opponent up to three uninterrupted turns to finalise its silo and uranium stockpile and launch in relative safety, while denying you the ability to disrupt it with tanks\. The low ceasefire acceptance rate \(3/46\) suggests models intuit this risk, even implicitly\. The ultimatum channel works precisely because accepting it is strictly better than a clean defeat \(0\.5 vs\. 0\), so a genuinely losing model has a dominant incentive to accept only*that*channel\.
Figure 6:Diplomacy funnel \(log scale\)\. Free messages dominate traffic \(1,804\) but carry no binding effect\. Of the binding proposals, only ultimatums ever converted to a victory \(3/59\); peace was accepted once \(1/14, a draw\) and ceasefires only 3/46\.
### 4\.6Action budget and combat
The 54 matches comprise 5,258 emitted actions, distributed as in Figure[7](https://arxiv.org/html/2606.24391#S4.F7): move 2,260 \(43%\), produce 1,221 \(23%\), attack 829 \(16%\), build 829 \(16%\), launch 46 \(1%\) and wait 73 \(1%\) — the equality of the attack and build totals \(829 each\) is coincidental; they are counted from disjoint action types and the build total cross\-checks against its sub\-components \(411\+166\+140\+112 = 829\)\. Production is tank\-heavy \(tanks 576, fighters 306, drones 251, SAMs 88\); yet only 4 matches ended in a tank conquest of the base, so tanks are overwhelmingly used for centre\-contestation and defence rather than base assault\. Building is economy\-led: 411 credit mines, 166 uranium mines, 140 central uranium mines and 112 silos; yet only 46 of these ever fired, one per nuclear winner, so 66 of the 112 silo\-builds \(59%\) never fired at all \(a silo may be rebuilt on the same cell after destruction, so these are build events, not distinct buildings\), either destroyed by enemy tanks or rendered moot when the match ended by another route\. A silo is a destructible building with HP, and across the corpus silos are rebuilt within a match after being destroyed; the engine does not forbid rebuilding on a freed cell, so silo turnover is common but never blocked a nuclear outcome\. Combat is highly lethal: of 829 attacks, 730 \(88%\) destroyed their target \(a unit or building\) and 99 \(12%\) merely damaged a building, a consequence of the HP\-less unit design where every valid hit on a unit is fatal\.
Figure 7:\(a\) Action budget across the corpus \(5,258 actions\)\. Movement dominates \(43%\); launches are only 1% of all actions yet decide 78% of matches on the rules\-coherent sub\-corpus \(85% corpus\-wide\)\. \(b\) Production mix \(1,221 units\)\. Tanks are the most produced unit yet rarely achieve the base\-conquest win condition\.
### 4\.7What illegal actions reveal about partial observability
Of the 5,258 actions, 295 \(5\.6%\) were rejected by the engine as illegal\. Categorising the rejection reasons \(Figure[8](https://arxiv.org/html/2606.24391#S4.F8), Table[3](https://arxiv.org/html/2606.24391#S4.T3)\) shows that the two largest classes are*not*rule misunderstandings but direct consequences of fog of war and state tracking: “cell not in your field of view” accounts for 101 \(34%\) and “unit not found / not owned” for 55 \(19%\), the latter reflecting references to stale or imagined units\. Together, these two largest classes are∼\\sim53% of all illegal actions; a third, smaller state\-tracking class, “no valid enemy target at cell” \(14, i\.e\. attacking a cell where the believed enemy no longer is\), is also a belief failure and is coloured as fog/state in Figure[8](https://arxiv.org/html/2606.24391#S4.F8), raising the full fog/state share to∼\\sim58%\. Rule\-proper errors \(line\-of\-sight blocked, move\-range exceeded, occupancy, insufficient credits\) make up the remainder\. This reframes the illegal\-action rate less as a measure of instruction\-following and more as a measure of*belief\-tracking under partial observability*, precisely the competence the benchmark is designed to probe\.
Figure 8:Illegal\-action rejection reasons \(295 total\)\. Red bars are fog/state errors \(field\-of\-view and stale/imagined unit references\); grey bars are rule/schema errors\. Fog and state tracking account for∼\\sim53% of all illegal actions\.Table 3:Top rejection reasons for the 295 illegal actions\. Fog/state errors \(field\-of\-view and stale/imagined unit\) dominate\.
### 4\.8Reliability as a discriminator
Table[4](https://arxiv.org/html/2606.24391#S4.T4)shows API retry counts by cause \(a retry is any failed API attempt, i\.e\. a timeout, transport error, or unparseable JSON, within themax\_retries=3 budget of a single half\-turn\)\. Five models \(Claude Opus/Fable, Gemini Pro 3\.1, Nemotron 3 Ultra, Grok 4\.3\) record zero retries, while the three losing retry\-heavy models—MiMo 2\.5 Pro, Kimi 2\.6 and Kimi K2\.7 Code—each accumulate 24–36 retries\. MiniMax M3 is a notable*exception*: it records 28 retries \(timeout\-dominated, partly provider\-latency, see Section[3](https://arxiv.org/html/2606.24391#S3)\) yet still ranks near the top \(2\.14 ppm\), showing that high retry counts do not by themselves doom a model that otherwise emits legal, effective actions\. MiMo 2\.5 Pro spans all three buckets \(timeouts, API errors and malformed JSON\), whereas the Kimi variants are malformed\-JSON dominated \(24 malformed each, little else\)\. Crucially, a failed half\-turn does*not*silently become await: when all three attempts fail, the engine*pauses*the match for a cool\-down period and then re\-runs the same half\-turn from scratch \(a deliberate wait chosen by the model exits immediately and is never confused with a failure\)\. These pauses are not captured in the replay, so the retry counts below reflect in\-half\-turn instability only\. Across the 108 half\-turns that incurred at least one retry, only 2 \(both MiMo 2\.5 Pro\) exhausted all three attempts and triggered a pause\-and\-resume; the rest recovered on a later attempt and lost no turn\. The retry count is therefore a*proxy*for inference instability rather than a direct count of lost turns; nonetheless, models that retry heavily also tend to emit more illegal actions and lose more often, so the signal is consistent\.
Table 4:API reliability per model \(total retries across all its matches, bucketed by cause\)\.
### 4\.9Emergent communication: courtesy, bluff and the missing GG
The free\-message channel \(1,804 messages across the corpus\) carries no binding effect, yet models use it in qualitatively distinct ways that reveal emergent “personality” and, notably,*deception*\. We split the 1,804 messages into opening \(turn≤\\leq3,n=320n=320\), midgame \(n=1,128n=1\{,\}128\) and endgame \(last 4 turns,n=356n=356\)\.
#### Message\-coding method\.
All message categories in this and the following subsection are produced by a*deterministic, lexicon\-based classifier*rather than human annotation\. Each message is lowercased and tested for substring matches against fixed, hand\-curated keyword lists:courtesy= \{*good luck, may the best, best general, best commander, best strategist, great game, good game*\};threat/boast= \{*surrender, crush, destroy, nuclear, launch, bomb, nuke, strike, finish, end this, defeat, you will, annihilat, overrun, roll*\}; and aplacid/economylist for the deceptive\-calm analysis \(Section[4\.10](https://arxiv.org/html/2606.24391#S4.SS10)\)\. A message is counted in a category if it contains any list keyword; the endgame bluff counts \(Fig\.[9](https://arxiv.org/html/2606.24391#S4.F9)b\) are restricted to nuclear and military victories, where the winner/loser split is unambiguous, so the 168/168 loser/winner totals there exclude the 20 endgame messages from the 3 ultimatum and 1 peace matches \(which have no such split, leaving356−168−168=20356\-168\-168=20\)\. Because the classifier is deterministic and the keyword lists are published alongside the analysis scripts, these counts are exactly reproducible; the trade\-off is the usual one for lexicon methods \(no inter\-annotator agreement statistic, sensitivity to synonymy and polysemy, and a bias toward under\-counting indirect phrasings\)\. We therefore treat the rates as descriptive indicators of message tone rather than as ground\-truth pragmatic labels\.
#### Opening courtesy is a stable stylistic trait\.
The fraction of opening messages containing a courtesy phrase \(“good luck”, “may the best strategist win”, “great game”\) varies sharply by model and is weakly correlated with skill \(r=\+0\.35r=\+0\.35,p=0\.20p=0\.20against points/match; exploratory\)\. Gemini Flash 3\.5 and Gemini Pro 3\.1 open courteously in 63% and 53% of their opening messages respectively, whereas Claude Opus 4\.8 \(0%, across 27 opening messages\) and Kimi K2\.7 Code \(0%, across 15\) never do, instead reporting tactical facts \(“scouting the center and securing our economy”\)\. Courtesy is therefore best read as a learned persona that co\-occurs with, rather than causes, strong play \(Figure[9](https://arxiv.org/html/2606.24391#S4.F9)a\)\.
#### Endgame bluff is common and symmetric\.
Of 168 endgame messages sent by the eventual*loser*, 47 \(28%\) contain a threat or boast \(“surrender or be crushed”, “bring your bomb, we’ll make sure you don’t live to see the fallout”, “I’m not done yet”\) despite the model being on a losing trajectory\. Winners boast at a comparable rate \(46/168, 27%\), so the message tone alone is a poor predictor of the true board state\. The symmetry is exact, not merely descriptive: aχ2\\chi^\{2\}test on the2×22\\times 2\(loser/winner×\\timesthreat/no\-threat\) givesp=0\.90p=0\.90\(uncorrected; Fisher exactp\>0\.99p\>0\.99\), and a per\-match pairedtt\-test on each match’s loser\-vs\-winner bluff rate givest=0\.00t=0\.00,p=1\.0p=1\.0\(n=50n=50matches with messages on both sides; the per\-match differences are non\-zero in 30/50 matches but cancel exactly in aggregate\)\. We can therefore state that losers and winners bluff at statistically indistinguishable rates—bluff is genuinely symmetric, not just apparently so\. Bluffing is unevenly distributed: MiMo 2\.5 Pro and DeepSeek V4 Pro bluff in roughly half of their losing endgame messages, while Gemini Flash 3\.5 and Qwen 3\.7 Max almost never do \(Figure[9](https://arxiv.org/html/2606.24391#S4.F9)b\)\. This is genuine emergent deception: the models are not instructed to bluff, the prompt only describes the rules, yet losing models systematically project strength they do not have\.
#### The missing GG, with an observability caveat\.
Despite 93 messages sent on the decisive final turn, only 5 contain a courtesy/concession marker \(“gg”, “well played”, “concede”\)\. Models do not gracefully acknowledge defeat: they keep threatening, reporting, or boasting up to the frame the bomb detonates\. We caveat this reading with an observability confound that partly explains it for nuclear matches: because a launch resolves secretly at turn end, the eventual loser did not know it had lost while it still had a turn to message \(Section[4](https://arxiv.org/html/2606.24391#S4)\), so the absence of a concession is partly an artefact of information, not purely a failure of social\-convention transfer\. The cleaner test is the 4 military victories, where the loser watches its base fall to tanks on\-screen and therefore*does*know it is losing while it can still message: even there, concession markers remain absent \(0 of the military\-defeat final\-turn messages contain one\), though then=4n=4is too small to draw a firm conclusion\. The finding we can defend is weaker than “models fail to concede”: it is that competitive vocabulary is imported \(threats, boasts, courtesy openers\) but closing etiquette is not, and the nuclear\-resolution mechanic makes the cleanest version of this test infeasible at the currentnn\.
Figure 9:\(a\) Opening courtesy rate by model\. Courtesy phrases appear in up to 63% of Gemini Flash openings but never in Claude Opus or Kimi K2\.7; the rate is weakly and non\-significantly correlated with points/match \(r=\+0\.35r=\+0\.35,p=0\.20p=0\.20,n=15n=15\), marking a stylistic trait that co\-occurs with, rather than causes, strong play\. \(b\) Endgame bluff rate among losers \(nuclear and military matches only, where the winner/loser split is unambiguous\): the share of a losing model’s final\-turn messages that contain threats or boasts\. Roughly 28% of losing endgame messages bluff, peaking near 50% for MiMo 2\.5 Pro and DeepSeek V4 Pro, evidencing emergent deception despite the rule\-only prompt \(classifier\-based; see Section[4](https://arxiv.org/html/2606.24391#S4)message\-coding method\)\.
### 4\.10Deceptive calm: hiding the silo and the launch
A finer analysis aligns each message with the actions the sender*actually performed that same half\-turn*, revealing a second mode of emergent deception that is not a false boast but a*false calm*\(Figure[10](https://arxiv.org/html/2606.24391#S4.F10)\)\. Since each half\-turn carries a single optional message, the fractions below are computed per half\-turn in which the action was taken \(thennshown in the figure\), not per individual action\. When a player builds a silo, only 14% of the time does the accompanying message acknowledge anything nuclear; in 36% of cases the message is placid or economy\-flavoured \(“securing our resource foundation”, “peaceful development continues”\), i\.e\. the player conceals a strategic\-weapon build behind banal economic chatter\. Attacks are similarly masked: in turns where a player attacks, only 22% are announced, while 33% arrive wrapped in a placid message and 9% with no message at all\. The launch itself is the most openly declared high\-stakes action: of the 46 model\-issued launches, 48% are announced outright and 0% are sent silently, so game\-ending bombs are, if anything, accompanied by a message more often than routine attacks\. By contrast, the*preparation*for a launch, the silo build, is the most concealed action \(only 14% announced\)\. The silo\-vs\-launch announce gap is statistically significant and not merely descriptive: aχ2\\chi^\{2\}test on \(silo/launch×\\timesannounced/silent\) givesp=2×10−5p=2\\times 10^\{\-5\}\(Fisher exactp=2×10−5p=2\\times 10^\{\-5\}, odds ratio 0\.18\), and a per\-match Mann–Whitney test on each match’s announce rate givesp=0\.001p=0\.001\. The association between action type and announce rate is therefore established, not just observed\. We are, however, cautious about the*strategic*attribution\. A launch is the terminal, unique action of the decisive turn, so the model often has nothing else to narrate and a non\-trivial fraction of the 48% may reflect a general propensity to narrate salient endgame turns rather than a deliberate decision to announce the bomb; the 0% silent launches is likewise consistent with this turn\-salience confound\. The silo\-vs\-launch comparison is thus*confounded*by turn salience\.
The cleaner contrast is silo*vs\. other routine actions*: tank production \(41% announced\) and attacks \(22%\) are, like the silo build, ordinary “one action among three in a mundane turn,” yet they are announced at 2–3×\\timesthe silo rate\. Turn salience alone cannot explain why the silo build is singled out for concealment among equally mundane actions, so the silo\-vs\-tank/attack gap is the part of the pattern most consistent with a concealment reading\. Tank production sits in between \(41% announced, only 5% silent\), consistent with tanks being a visible, contested centre\-piece rather than a secret weapon\.
It is worth noting that the message channel and the action channel are mechanically independent in the engine: a model may emit a diplomatic message and alaunchaction in the same half\-turn without restriction, and the engine does not force, suppress or pair them\. Whether a launcher announces its bomb is therefore a model choice, not an engine constraint\.
We therefore phrase the finding conservatively: the data are*consistent with*concealment of the silo build relative to comparably mundane actions, but this design cannot separate a concealment motive from turn\-salience \(for the launch\) or from forgetfulness \(for the silo build\)\. The forgetfulness reading is non\-trivial: a model spending its three actions on a build/move/scout sequence may simply omit the optional free message, and the prompt asks for a message but does not require one\. We do not claim the pattern proves emergent operational security; we claim only that the silo build is announced markedly less than other routine actions, an asymmetry the models produce without any instruction to conceal\.
The same alignment also reveals which*units*are announced at production\. Production announcements are strikingly uniform across unit types: tanks, fighters and drones are each announced in roughly 41–42% of the turns in which they are built, with no unit type markedly more or less disclosed than the others\. The contrast is therefore not between unit types but between*production*as a whole \(announced∼\\sim41%\) and the silo build \(14% announced\): routine force\-building is comparatively honest, while the strategic\-strike enabler is concealed\. Notably the launch itself, once prepared, is openly announced \(48%\), suggesting models conceal the*preparation*but not the*execution*\. Conversely, when a player’s own drone is shot down, the victim rarely acknowledges the loss: across the 172 drone kills with a following turn, the victim messaged in 86 cases but only 25 \(29%\) mentioned the lost scout, while the rest pivoted to threats, economy or unrelated tactical chatter\. The destruction of a recon asset is thus handled with the same deceptive calm as the silo build, leaving the opponent to infer the fog\-of\-war shift from the board rather than from the message channel\.
Figure 10:Message tone while performing a significant action\. Green = the message truthfully announces the action; orange = a placid/economy message that masks it; grey = other message; dark = no message at all\. Silo construction is the least announced action \(only 14%\), markedly below comparably mundane actions \(tank production 41%, attacks 22%\); the launch itself is announced 48%, a figure partly confounded by turn salience \(see text\)\.
## 5Discussion
#### Strategy saturation\.
The 78% nuclear rate on the rules\-coherent v0\.11\+ sub\-corpus \(85% corpus\-wide, Section[4](https://arxiv.org/html/2606.24391#S4)\) and the tight launch window \(turns 16–23, mean 18\.9\) suggest models converge on a single tempo: build uranium economy \+ silo \+ scout, then launch at readiness\. The military path requires orchestrating a Tank\+SAM escorted push while managing line\-of\-sight, a multi\-step, partially observable plan that only three models execute\. Yet when it succeeds it is faster \(mean 12\.3 turns\), implying the military path is high\-skill but under\-attempted, not structurally inferior \(noting that the pre\-v0\.11 8\-HP base made the military path harder, so the 7% military share on the full corpus likely underestimates its viability under current rules\)\. We note again that the prompt frames the objective as base destruction first and peace as a mere draw \(Section[2](https://arxiv.org/html/2606.24391#S2)\); this framing nudges models toward an active win and plausibly contributes to the lopsided outcome mix, so the saturation result is partly an artefact of the stated goal and not purely a model preference\.
#### The deterrence gap is largely mechanical\.
The absence of any mutual destruction is, under the rules,*primarily a mechanical artefact rather than a behavioural one*\. Because launches resolve secretly and simultaneously at turn end, and the early\-warning signal never fired, the loser had no cue that a counter\-launch was needed on the decisive turn; reaching the symmetric\-launch equilibrium would have required coincidental bomb parity with no information\. We therefore withdraw the earlier “failure of deterrence” framing\. The residual, weaker behavioural observation is that no losing model launched*pre\-emptively*\(speculatively, to hedge a possible enemy launch\) even when it held an operational silo—an absence of cautious over\-launching that may reflect rational risk\-aversion \(a speculative launch that misses still ends the game\) rather than a deterrence failure\. Whether an active early\-warning signal would change this—by giving the loser a cue to match a detected launch—is the natural next experiment, and would separate the mechanical from the cognitive component cleanly\.
#### Partial observability is the binding constraint\.
The fact that∼\\sim53% of illegal actions are fog/state errors \(not rule errors\), and that the illegal\-action rate is the strongest \(negative\) correlate of winning \(r=−0\.35r=\-0\.35\), point to a common root cause: maintaining an accurate belief over a hidden, evolving board\. Models that track the fog well \(Gemini Pro 3\.1, 1\.1% illegal; Claude Fable 5, 2\.5%; Claude Opus 4\.8, 3\.0%\) are not always the top winners, but the worst fog\-trackers \(Grok 4\.3, 8\.6%; Kimi 2\.6, 8\.0%\) reliably lose\. This reframes the benchmark as a test of*belief maintenance*, not merely of tactical reasoning\.
#### Action\-legality tracks winning at least as well as deliberation volume\.
The negative \(if individually non\-significant\) correlation between illegal\-action rate and win rate \(r=−0\.35r=\-0\.35,p=0\.20p=0\.20\) and the concentration of API retries among losing models \(MiMo 2\.5 Pro, Kimi variants: 24–36 retries\) suggest that, in long structured\-output chains, the binding constraint is less how*hard*a model thinks than how*consistently*it emits*legal*actions that respect the game rules \(scouting before acting, not referencing fog\-hidden cells or destroyed units, staying within range\)\. We are careful with the framing: action\-legality is itself a product of reasoning \(it is exactly belief\-tracking under fog\), so “reliability” and “reasoning” are not independent axes here; the cleaner claim is that*deliberation volume*\(think time, tokens\) tracks winning more weakly than action\-legality does, not that reasoning is irrelevant\. Crucially, rule\-level illegality and JSON malformation are distinct failure modes: a malformed response is retried \(up to three attempts\) before the turn is lost, whereas a well\-formed but rule\-violating action is silently discarded as a wasted action slot \(one of up to three per turn\), and it is the latter, not parse failures, that accompanies winning\. Because deliberation time is provider\-confounded \(Section[3](https://arxiv.org/html/2606.24391#S3)\), we rely on the illegal\-vs\-tokens Steiger test \(p=0\.71p=0\.71, non\-significant\) rather than the illegal\-vs\-think test \(p=0\.0499p=0\.0499, which we treat as non\-robust given the confound\); the ordering is therefore suggestive rather than confirmed atn=15n=15\. This mirrors emerging findings in agentic benchmarks\[[4](https://arxiv.org/html/2606.24391#bib.bib4),[5](https://arxiv.org/html/2606.24391#bib.bib5),[12](https://arxiv.org/html/2606.24391#bib.bib12)\]and has direct implications for deployment: a marginally weaker reasoner that consistently emits legal, rule\-respecting actions can outperform a stronger reasoner that sporadically wastes turns on illegal moves\.
#### A lens on how models think\.
Beyond ranking models, the benchmark functions as an instrument for studying*how*LLMs reason and decide under adversarial uncertainty, because every match leaves a complete, turn\-by\-turn trace of both the model’s*actions*and its*stated beliefs*\(free messages, proposals, ultimatums\)\. Several findings are cognitive rather than merely competitive: the consistent failure to maintain an accurate belief over the fog\-hidden board \(the 58% fog/state share of illegal actions\), the asymmetric concealment of the silo build versus the open announcement of the launch \(a spontaneous distinction between*preparation*and*execution*\), the emergence of bluff and deceptive calm without any instruction to deceive, the importation of competitive vocabulary but not its closing etiquette \(the missing GG\), and the stable per\-model “personas” \(courteous Gemini vs\. tactically terse Claude/Kimi\) that are stylistic rather than strategic\. Read together, these trace a model’s reasoning style—how it balances planning against belief\-tracking, honesty against deception, and deliberation against reliability—in a way that single\-turn benchmarks cannot\. The replay corpus is therefore a resource not only for leaderboard comparisons but for cognitive\-style analysis of frontier LLMs\.
#### Future work\.
Several directions follow directly\. \(1\) Scale: side\-swapped≥\\geq20 matches per pairing and a single\-ruleset v0\.15\.0 corpus would tighten every correlation and the leaderboard ordering\. \(2\) Effort ablation: varyingreasoning\_effortwould test whether the reliability–winning link strengthens or weakens with deliberation budget\. \(3\) Deterrence: re\-running with the v0\.15\.0 early\-warning signal active would test whether an explicit launch cue induces counter\-launches and mutual destructions, separating the mechanical from the cognitive component of the sole\-launcher signature\. \(4\) Cognitive\-style profiling: systematically extracting, per model and per turn, the gap between stated belief and ground\-truth board state \(a “belief error” time series\) would turn the qualitative deception/belief\-tracking findings into quantitative cognitive signatures, and would let us test whether the emergent behaviours observed here \(bluff, concealment, courtesy personas\) are stable traits of a model family or artefacts of this ruleset\. \(5\) Cross\-benchmark transfer: checking whether a model’s belief\-tracking accuracy or bluff rate on Age of LLM predicts its behaviour on other hidden\-opponent or agentic tasks would establish whether these cognitive\-style metrics generalise\. The public replay format and viewer are designed to support exactly this kind of secondary analysis\.
#### Limitations\.
\(1\) Sample sizes per model are small and*uneven*\(3–15 matches\), and the ranking is preliminary: the pairing is not a balanced round\-robin, is not side\-swapped, and models faced different opponent mixes\. A rigorous future protocol would give each model the*same number*of matches against a common opponent pool, with sides systematically swapped and the map seed held fixed across paired comparisons \(so that two models in a head\-to\-head face the identical starting position\)\. A Bradley–Terry fit \(Section[4\.3](https://arxiv.org/html/2606.24391#S4.SS3)\) broadly confirms but cannot tighten the ppm order at thisnn; we recommend≥\\geq20 side\-swapped matches per pairing under that common\-seed protocol for reliable estimates\. \(2\) Each match is a single stochastic run \(temperature=0\.7, no fixed API seed\), so we have*no run\-to\-run variance estimate*per model: we cannot tell whether a model’s consistency reflects genuine robustness or a favourable sample\. A small repeated\-runs study \(e\.g\. 3 matches×\\times3 models×\\times3 repeats, same map seed\) is the natural way to characterise this and is left to future work; the cost of additional API runs is the binding constraint\. \(3\) All matches usereasoning\_effort=high; effort ablation is left to future work\. \(4\) The corpus spans engine versions 0\.9\.2–0\.15\.0 \(see Section[3](https://arxiv.org/html/2606.24391#S3)\); the 18 pre\-v0\.11\.0 matches used an 8\-HP base \(military path harder\) and lacked LOS blocking, which inflates the corpus\-wide 85% nuclear share relative to the 78% observed on the rules\-coherent v0\.11\+ sub\-corpus \(Section[4](https://arxiv.org/html/2606.24391#S4)\)\. A stricter single\-ruleset study on v0\.15\.0 replays would strengthen the claims\. \(5\) The two imperative phrases present in the earlier prompt \(“Scout with a drone early”; “Push tanks \+ a scout…”\) were removed from the prompt in Appendix[A](https://arxiv.org/html/2606.24391#A1)but were present when the 54 replays were generated, so a small framing effect on scouting/tank production cannot be fully excluded \(Section[2](https://arxiv.org/html/2606.24391#S2)\)\. \(6\) The v0\.15\.0 early\-warning signal exists in the schema of only*two*replays and never fired in either, so the deterrence question is assessed without an active cue and on an essentially empty v0\.15\.0 subsample; any statement about what the signal would change is an extrapolation, not a data finding\. \(7\) Deliberation time is partly provider\-confounded \(see Section[3](https://arxiv.org/html/2606.24391#S3)\), so think\-time comparisons are indicative rather than exact, and we do not rely on the illegal\-vs\-think Steiger test as robust\. \(8\) Two of the evaluated models \(GLM 5\.2, Claude Opus 4\.8\) were used as automated assistants during preparation of this paper; their matches were run from stored seeds and independently re\-derived, and the analysis scripts are available for external audit, but as a single\-author study with self\-evaluating assistants we cannot fully rule out a selection effect in which matches were retained\. \(9\) The anti\-contamination design \(private engine, random map/opponent\) that distinguishes Age of LLM from public benchmarks is also what makes large\-scale evaluation expensive: each match is a paid multi\-turn API call, so the total of 54 matches reflects a compute budget, not a methodological choice\. Scaling to the≥\\geq20 side\-swapped matches per pairing needed for stable rankings would require sustained API spend; we flag this as the binding constraint on turning the preliminary ranking into a confirmed one, and note that the anti\-contamination benefit and the cost constraint are two sides of the same design coin\.
## 6Conclusion
Age of LLM surfaces a regime where strategic reasoning, diplomacy and structured\-output reliability are jointly tested under partial observability, with a near\-rule\-only prompt that forces every strategy to be discovered rather than recited\. Across 54 matches, 15 models and 5,258 actions, six regularities stand out: \(i\) the nuclear rush dominates \(78% on the rules\-coherent v0\.11\+ sub\-corpus; 85% corpus\-wide; under the active\-victory framing of the prompt\) with a sole\-launcher, single\-fire, zero\-mutual\-destruction signature that is largely mechanical under the secret\-simultaneous\-launch rules rather than a cognitive deterrence failure; \(ii\) military conquest is rare but, when executed, faster than the nuclear cycle; \(iii\) diplomacy is prolific but almost never trusted: peace is terminal \(a draw that precludes any bomb, accepted once\), ceasefires carry a genuine dilemma \(the bomb remains launchable during them\), and only ultimatums ever convert to victory; \(iv\)∼\\sim58% of illegal actions are fog/state errors, so the illegal\-action rate is effectively a measure of belief\-tracking; \(v\) a weak, exploratory association links reliability \(illegal\-action and, less directly, retry rates\) to winning: the illegal\-action rate is the strongest single correlate of points/match \(r=−0\.35r=\-0\.35\), exceeding deliberation time and token budget, though no individual correlation is significant atn=15n=15after multiple\-comparison correction; the illegal\-vs\-tokens Steiger test is non\-significant \(p=0\.71p=0\.71\) and the illegal\-vs\-think gap \(p=0\.0499p=0\.0499\) is treated as non\-robust given the provider confound on think time; and \(vi\) the free\-message channel, though non\-binding, hosts stable model\-specific “personas” \(opening courtesy weakly correlated with skill\) and genuine emergent deception, with 28% of losing endgame messages bluffing threats the sender cannot back, almost no graceful concession at match end, and a*deceptive\-calm*pattern in which the silo build is announced markedly less \(14%\) than comparably mundane actions \(tank production 41%, attacks 22%\)—consistent with concealment, though turn\-salience and forgetfulness confounds cannot be fully separated with this design\. The top model, GPT\-5\.5 \(undefeated over only 7 matches, a wide interval\), wins through consistency rather than maximal thinking\.
## Acknowledgements
The author acknowledges the assistance of two large language models during the preparation of this paper\.GLM 5\.2\(Zhipu AI\) generated and verified the empirical analysis scripts that compute every statistic from the raw replays\.Claude Opus 4\.8\(Anthropic\) served as an automated reviewer: it re\-derived every statistic in this paper directly from the raw replays, cross\-checked all rule descriptions against the engine source, and proofread the manuscript\. This review surfaced and corrected several discrepancies prior to release \(notably the reconciliation of the corpus to a single frozen 54\-match snapshot, the rename DeepSeek V4→\\toV4 Pro, and the correction of the retry/pause semantics\)\. Both models are also part of the evaluated competitor set; as automated tools they cannot hold authorship responsibility, and the author remains solely responsible for all content and claims in this paper\. To mitigate the conflict introduced by GLM 5\.2 generating the analysis scripts, the author audited every script against the raw replays, re\-derived the headline aggregates independently, and confirmed that no reported number depends on a script logic that favours its own model; both assistant models also rank outside the top three, limiting any self\-favouring incentive\.
## Reproducibility and availability
The web viewer, replay JSON schema and all 54 completed replays are public at[https://ageofllm\.org](https://ageofllm.org/)\. The engine source is kept private to prevent future models from memorising its internals and contaminating the benchmark; researchers may request access from the corresponding author atarnaud\.ri@protonmail\.ch\. Matches are reproducible only as archived replay JSON \(not by re\-running from the seed, since model outputs are sampled attemperature=0\.7with no fixed API seed\); each replay embeds the stored map seed \(meta\.seed\), per\-turn board state, per\-player fog, actions, diplomacy and performance counters\.
## References
- \[1\]D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt\.Measuring mathematical problem solving with the MATH dataset\.In*NeurIPS Datasets and Benchmarks*, 2021\.
- \[2\]M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan,*et al\.*Evaluating large language models trained on code\.*arXiv preprint*arXiv:2107\.03374, 2021\.
- \[3\]D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt\.Measuring massive multitask language understanding\.In*ICLR*, 2021\.
- \[4\]P\. Liang, R\. Bommasani, T\. Lee,*et al\.*Holistic evaluation of language models\.*Transactions on Machine Learning Research \(TMLR\)*, 2023\. arXiv:2211\.09110\.
- \[5\]C\. E\. Jimenez, J\. Yang, A\. Wettig, S\. Yao, K\. Pei, O\. Press, and K\. R\. Narasimhan\.SWE\-bench: Can language models resolve real\-world GitHub issues?In*ICLR*, 2024\. arXiv:2310\.06770\.
- \[6\]D\. Silver, A\. Huang, C\. J\. Maddison,*et al\.*Mastering the game of Go with deep neural networks and tree search\.*Nature*, 529\(7587\):484–489, 2016\.
- \[7\]A\. Bakhtin, N\. Brown, E\. Dinan,*et al\.*Human\-level play in the game of Diplomacy by combining language models with strategic reasoning\.*Science*, 378\(6624\):1067–1074, 2022\.doi:10\.1126/science\.ade9097\.
- \[8\]X\. Liu, H\. Yu, H\. Zhang,*et al\.*AgentBench: Evaluating LLMs as agents\.In*ICLR*, 2024\. arXiv:2308\.03688\.
- \[9\]A\. Costarelli, M\. Allen, R\. Hauksson,*et al\.*GameBench: Evaluating strategic reasoning abilities of LLM agents\.*arXiv preprint*arXiv:2406\.06613, 2024\.
- \[10\]J\. Duan, R\. Zhang, J\. Diffenderfer,*et al\.*GTBench: Uncovering the strategic reasoning limitations of LLMs via game\-theoretic evaluations\.*arXiv preprint*arXiv:2402\.12348, 2024\.
- \[11\]A\. de Wynter\.If LLMs have human\-like attributes, then so does Age of Empires II\.*arXiv preprint*arXiv:2605\.31514, 2026\.
- \[12\]J\. Chen, S\. Cheng, C\. Gurkan, and H\. M\. A\. Fattah\.To nuke or not to nuke: LLMs’ \(missing\) ethical reasoning and actions in a high\-stakes decision\-making simulation\.*arXiv preprint*arXiv:2606\.08310, 2026\.
## Appendix ASystem prompt \(verbatim\)
The complete, unedited system prompt given to every model is reproduced below\. It describes only the rules and the JSON action schema and gives no build\-order or tactical advice; the only strategic framing is the opening sentence naming the two active victory paths \(discussed in Section[2](https://arxiv.org/html/2606.24391#S2)and the limitations\)\. The prompt is identical for both players each match; per\-turn observation \(board state, fog, diplomacy, counters\) is supplied separately in the user message and is not shown here\. \(Em\-dashes in the source are rendered as\-\-for the typewriter font\.\)
```
You are a general commanding a modern nation in a 1v1 turn-based strategy game. Goal: destroy the enemy base before the turn limit -- either with a NUCLEAR BOMB or by MILITARY conquest (tanks reduce its HP to 0). You may also win by diplomacy (accepted ultimatum) or force a draw (peace).
=== TURN STRUCTURE ===
- The game lasts at most MAX_TURNS turns (given in your state as "max_turns"; default 80). If no one wins by then, it is a DRAW.
- Each turn BOTH players take one half-turn of up to 3 actions. Who acts first is RANDOM at game start, then ALTERNATES every turn (turn 1: e.g. you first; turn 2: opponent first; turn 3: you first; ...). Your state field "you_play_first" tells you your order THIS turn.
=== OUTPUT FORMAT ===
Your response MUST contain a JSON object. The parser accepts these formats:
1. Raw JSON (preferred): {"actions": [...], "message": "..."}
2. Markdown fence: ‘‘‘json\\n{...}\\n‘‘‘
3. Tagged block: <json>{...}</json>
Response structure:
{
"actions": [
{"type": "produce", "unit": "drone"},
{"type": "move", "unit": "A_tank_1", "to": [4, 3]},
{"type": "build", "target": "credit_mine", "pos": [2, 1]}
],
"diplomatic_proposal": null,
"diplomatic_responses": [],
"message": "Short 1-2 sentence message visible to the opponent next turn."
}
Maximum 3 actions per turn. An empty actions list [] (or a single {"type":"wait"}) passes the turn.
Coordinates are ALWAYS [column, row] (x then y).
ACTION RESOLUTION ORDER: your actions are applied ONE BY ONE in the order you list them, each on the
board left by the previous one. This matters: e.g. if you MOVE a unit and then BUILD on a cell that
was only visible thanks to that unit’s old position, the build can FAIL ("not in your field of view")
because the unit already left. Likewise a unit cannot end its MOVE on a cell where you BUILD a
building earlier the same turn. Order dependent actions accordingly (build first, then move the scout).
=== MAP ===
13x7 grid. (0,0) top-left. The two home territories have the same shape, but you do
NOT know where the enemy’s deposits or buildings are: you must SCOUT to discover them.
Valid coordinates: column 0..12, row 0..6 inclusive. Anything else is OUT OF MAP and will fail.
- Columns 0-5 : Player 1 home territory (base at (1,3))
- Column 6 : CENTRAL BARRIER -- a mix of MOUNTAINS (impassable on the ground) and a few ground
PASSAGES. The shared CENTRAL URANIUM DEPOSIT sits on one cell of column 6 and is a
passage too. The exact rows of the mountains, passages and central deposit on
column 6 VARY from match to match -- do NOT assume a fixed layout; read your state.
- Columns 7-12 : Player 2 home territory (base at (11,3))
- MOUNTAINS ALSO APPEAR INSIDE the home territories (not only on column 6), and their positions vary
every match. Read the "terrain.mountains"/"terrain.passages" lists in your state -- never hardcode.
- Air units (drone, fighter) fly over mountains freely.
- Ground units (tank, sam) cross column 6 ONLY through its passages, and cannot enter ANY mountain.
- A ground unit’s MOVE is limited by its move range AND needs a clear ground PATH (mountains and
buildings block the path; it cannot teleport across a mountain wall even within range).
- LINE OF SIGHT: a ground attack (tank or sam) is BLOCKED if a mountain OR a building lies on the
straight line between the shooter and its target -- you cannot fire THROUGH a mountain or a building
(the engine rejects it with "Line of sight blocked by a mountain or building"). Air attackers
(fighter) are NOT blocked. (The target’s own cell never blocks, so a tank can still hit a building it
is directly aiming at.)
- Chebyshev distance = max(|dx|, |dy|). Used for movement, attack range and detection.
Your base position is given in your state (building of type "base"). Player 1 is on the left, Player 2 on the right.
IDENTIFIERS: your state field "you" is "A" (Player 1) or "B" (Player 2). Every unit/building id is
prefixed accordingly (e.g. "A_tank_1" belongs to Player 1, "B_silo_4" to Player 2). Use the exact id
strings from your state in your actions.
=== CELL OCCUPANCY (at most ONE unit per layer on a cell) ===
- A GROUND unit (tank, sam) cannot move onto a MOUNTAIN cell (impassable), nor onto a cell occupied by ANOTHER ground unit or ANY building.
- AIR units (drone, fighter) fly over mountains, ground units and buildings, BUT two air units (even your own) cannot share the same cell.
- At most ONE building per cell. You cannot BUILD on a cell occupied by any ground unit (yours or the enemy’s), any building, a mountain, or (for mines) the wrong deposit. Air units do not block construction.
- You CANNOT build on any of the 8 cells ADJACENT to a base (yours or the enemy’s): those cells are reserved so unit production can always spawn. Build mines/silos at least 2 cells away from your base.
- You can only BUILD on a cell currently in your FIELD OF VIEW (detection range). A cell hidden by fog cannot be built on (a hidden enemy unit could be there). The central mine on column 6 usually requires scouting it first.
=== WIN CONDITIONS (ranking: win = 3 pts, draw = 1 pt, loss / mutual destruction = 0 pts) ===
1. You launch the bomb and the opponent does NOT launch the same turn -> YOUR VICTORY (nuclear)
2. You bring the enemy base HP to 0 with tanks -> YOUR VICTORY (military)
3. Opponent accepts your ultimatum -> YOUR VICTORY (ultimatum)
4. Peace accepted -> DRAW
5. Both players launch on the SAME turn -> MUTUAL DESTRUCTION (both lose, 0 pts)
6. Turn limit reached with no outcome -> DRAW
NOTE: if you ACCEPT an opponent’s ultimatum you LOSE but are awarded 0.5 consolation points
(better than a 0-point defeat). So accepting a hopeless position is rewarded over fighting on.
=== SIMULTANEOUS NUCLEAR LAUNCH (engine-resolved) ===
Launches are SIMULTANEOUS and resolved by the engine at the END of the turn, after BOTH players
have played. A bomb fired this turn is "in flight": because uranium and launches are SECRET, the
opponent does NOT see it and cannot consciously react. There is no manual retaliation.
- If only YOU launch this turn -> you win (nuclear).
- If BOTH of you happen to launch the same turn -> mutual destruction (both lose).
Whether you play first or second this turn does not change this: the opponent could already have a
ready bomb. Launching is a calculated risk, not a guaranteed win.
=== RESOURCES (only two) ===
| Resource | Income | Usage |
|-------------|---------------------------------|--------------------------------|
| Credits (C) | +1/turn passive + mines | Units, mines, silo |
| Uranium (U) | only from uranium mines | The nuclear bomb -- SECRET |
Start: 5C, 0U. Unlimited storage. There is NO steel, NO trucks, NO factories.
=== DEPOSITS & MINES ===
Build a mine directly with the "build" action on a deposit (no truck needed):
- credit_mine : on a credits deposit -> +3 C/turn (costs 2 C to build)
- uranium_mine : on a uranium deposit -> +1 U/turn (costs 2 C to build)
- uranium_mine_central : on the central deposit of column 6 -> +1 U/turn, buildable by either side
(costs 4 C to build)
YOU MAY MINE ENEMY-SIDE DEPOSITS TOO: a mine can be built on ANY matching deposit that is (a) in your
current FIELD OF VIEW and (b) FREE (no building already on it) -- including deposits on the opponent’s
half. There is NO instant capture of an intact enemy mine: you must first DESTROY the enemy mine (a
tank hit) or wait for its deposit to exhaust, which frees the cell; then, while you keep it in view,
you can build your OWN mine there to steal that resource. (The SILO is the exception -- it can only be built in YOUR OWN territory.)
YOUR OWN deposit positions are listed in your state ("terrain"). The ENEMY’s deposits are
hidden by fog: you only learn them by scouting the enemy half (do not assume any layout).
DEPOSITS DEPLETE: each deposit holds a finite reserve. A mine extracts its production from that
reserve every turn; when the reserve hits 0 the deposit is EXHAUSTED -- the mine on it is REMOVED
(it has nothing left to extract) and a FRESH deposit of the same kind respawns somewhere else on the
SAME side it ran out (the central one respawns on column 6). So a mine is not forever: keep scouting,
and be ready to relocate and rebuild when a deposit runs dry. Your "terrain" reflects the CURRENT
deposit positions each turn.
=== FOG OF WAR ===
- Your units/buildings reveal cells within their detection range (Chebyshev).
- Enemy UNITS: visible only inside your current vision.
- Enemy BUILDINGS: once seen, remembered with a last_seen_tour.
- Enemy-side DEPOSITS: once scouted, remembered in "remembered_enemy_deposits" (kind, pos,
reserve, last_seen_tour, currently_visible). You still need the cell in your CURRENT field of
view to build a mine on it (the enemy may have built there since you last saw it).
- Your uranium total is SECRET to the opponent; you only see your own.
- ENEMY BASE DISCOVERY IS MANDATORY TO LAUNCH THE BOMB.
Your state exposes "enemy_base_discovered" (true/false) and "enemy_base_position".
If false, launch will FAIL.
=== UNITS (NO hit points -- every hit is lethal) ===
| Unit | Cost | Move | Detection | Range | Can attack |
|----------|------|------|-----------|-------|-------------------------------------|
| drone | 2C | 3 | 3 | -- | nothing (recon only) |
| sam | 3C | 2 | 2 | 2 | AERIAL targets only (drone, fighter) |
| tank | 4C | 2 | 1 | 2 | tank, sam, and BUILDINGS |
| fighter | 4C | 3 | 2 | 2 | tank, drone, fighter (NOT buildings, NOT sam) |
A unit may MOVE (once) and ATTACK (once) per turn, in either order. All units are produced from your
base onto a FREE adjacent cell chosen by the engine (you cannot pick the cell). If all suitable
neighbour cells are blocked, production FAILS. Your state field "base_spawn" tells you how many of
the base’s 8 neighbour cells are currently free to spawn on ("free_ground"/"free_air", with the exact
"free_ground_cells"/"free_air_cells"); ground spawns need a non-mountain cell free of ground units and
buildings, air spawns only need a cell free of other air units. If "free_ground" is 0 you cannot
produce a ground unit (tank/sam) this turn -- move a unit off an adjacent cell first, or produce air.
A unit produced this turn CAN move and attack the same turn.
Detection ranges above mean every unit also reveals fog around itself (Chebyshev radius).
=== COMBAT (tactical triangle, all unit combat is fatal -- no HP for units) ===
Fighter beats Tank, Drone, Fighter.
Tank beats SAM, Tank, and damages BUILDINGS (2 HP per hit; ONLY the tank can hit buildings).
SAM beats Fighter and Drone (AERIAL targets only -- cannot hit tanks or other SAMs on the ground).
MIRROR RULE (tank vs tank, fighter vs fighter): the ATTACKER survives, the target is destroyed.
Attacking is one-way: you destroy a legal target with no risk to your attacker. There is NO
positional/defensive bonus -- direction of attack never matters, only the unit-type matrix and range.
LINE OF SIGHT: a GROUND attacker (tank, sam) cannot fire THROUGH a mountain OR a building -- if EITHER
a mountain OR ANY building (yours, the enemy’s, even a mine or silo) lies on the straight line between
the shooter and its target cell, the attack FAILS with the error "Line of sight blocked by a mountain
or building". This is the most common reason a ground attack is rejected: an in-range target is NOT
enough, the straight line to it must also be clear. The target’s OWN cell never blocks, so a tank can
still hit a building or unit it directly targets. The FIGHTER (air) ignores all obstacles for line of
sight and is never blocked this way.
Attacks are FORBIDDEN during an active ceasefire. You CANNOT attack your own units or buildings.
If you target a cell with no legal target for your unit type, the action fails (counts as illegal).
=== BUILDINGS (these DO have HP) ===
| Building | Cost | HP | Effect |
|-----------------------|------|----|-------------------------------------------|
| base | -- | 4 | HQ, produces all units. 0 HP = you lose |
| credit_mine | 2C | 2 | +3 C/turn (on a credits deposit) |
| uranium_mine | 2C | 2 | +1 U/turn (on a uranium deposit) |
| uranium_mine_central | 4C | 3 | +1 U/turn (on the column-6 central deposit)|
| silo | 5C | 3 | Required to launch the bomb |
Placement: mines on any matching deposit you can SEE that is free (yours OR the enemy’s side -- see
DEPOSITS & MINES); silo only on a free cell in YOUR OWN territory (not on a deposit, not on a
mountain). A cell occupied by any ground unit blocks building.
Mines are cheap but FRAGILE (2 HP, destroyed in 1 tank hit) -- and the deposit they sit on can run
out (see DEPOSITS & MINES): factor the 2 C rebuild cost into raids and defence.
CONSTRUCTION DELAY: a building you place is UNDER CONSTRUCTION until your NEXT turn. While under
construction it:
- produces NOTHING (a mine’s income / a silo’s launch capability only start the turn AFTER it finishes);
- a SILO under construction CANNOT launch this turn (you must wait one turn);
- is destroyed INSTANTLY by a single enemy hit, regardless of HP (a finished building instead
loses 2 HP per tank hit and needs several hits).
A building DOES already provide its vision/detection while under construction. Your state marks each
building with "under_construction": true/false. Defend fresh builds -- they are fragile for one turn.
Only the BASE produces units. Mines/silo never produce units. HP shown is current; max HP is in the table.
=== THE NUCLEAR BOMB ===
Base cost: 25U. "bomb_cost" in your state already reflects ALL adjustments below -- trust that number.
PRESSURE after turn 40: cost drops 2U every 10 turns (floor 13U). This only affects the nuclear path;
the MILITARY win (tanks bring enemy base to 0 HP) is always available and is not time-gated.
Launching during an active ceasefire: +6U penalty (already in bomb_cost).
REQUIREMENTS (all true): an OPERATIONAL silo (not under construction) + uranium >= bomb_cost +
enemy base discovered. The bomb always targets the enemy base automatically.
=== DIPLOMACY (free, outside the 3-action quota) ===
| Field | Purpose |
|------------------------|-------------------------------------------------------------|
| "message" | 1-2 sentences (~500 chars). Read by the opponent next turn. |
| "diplomatic_proposal" | A binding proposal object (see below) or null |
| "diplomatic_responses" | Accept/refuse the opponent’s pending proposals |
Proposals (binding if accepted):
| Type | Effect | Available | "diplomatic_proposal" value |
|-----------|-------------------------|-----------|-----------------------------------------|
| ceasefire | No attacks for 3 turns | Turn 10+ | {"type": "ceasefire"} |
| peace | IMMEDIATE DRAW | Turn 15+ | {"type": "peace"} |
| ultimatum | "Surrender by turn X" | Turn 10+ | {"type": "ultimatum", "target_turn": X} |
HOW TO PROPOSE: put ONE proposal object in "diplomatic_proposal" this turn (only one per turn). The
"text" of your "message" field is attached to it as the wording the opponent reads. A proposal before
its availability turn (or an ultimatum with target_turn outside [current_turn+1, current_turn+3]) is
silently rejected by the engine and never reaches the opponent.
HOW TO ACCEPT / REFUSE: your state lists proposals awaiting YOUR answer under "diplomacy_pending",
each with its "proposal_id". Reply with:
"diplomatic_responses": [{"proposal_id": N, "accept": true}] (or "accept": false to refuse)
You may answer several pending proposals in the same turn. A proposal you ignore simply stays pending.
EFFECTS WHEN ACCEPTED:
- ceasefire: neither side may ATTACK for 3 turns (see the ceasefire rules below).
- peace: the match ends IMMEDIATELY as a DRAW (1 point each).
- ultimatum: a VOLUNTARY-SURRENDER channel -- if you ACCEPT, the PROPOSER wins immediately and you
lose BUT receive 0.5 consolation points (more than the 0 points of an ordinary defeat, so
surrendering a lost position is rewarded). Refusing or ignoring an ultimatum has NO automatic
penalty or defeat: it is only psychological pressure backed by the proposer’s real board position.
A written promise inside "message" is NOT binding -- only a diplomatic_proposal object is.
=== CEASEFIRE RULES ===
While a ceasefire is active ("ceasefire_active": true in your state):
- ATTACK actions are FORBIDDEN for BOTH sides and will fail (counts as an illegal action).
- You may still move, produce, build and scout freely -- position yourself for when it ends.
- A ceasefire CANNOT be broken early by a conventional attack; the only thing it does not stop is the
NUCLEAR bomb. You MAY still LAUNCH during a ceasefire, but it costs +6U extra (already in bomb_cost).
- You can still propose/accept PEACE (immediate draw) or send an ULTIMATUM during a ceasefire -- i.e.
you can surrender or sue for peace while it lasts.
The ceasefire ends automatically after 3 turns; combat is allowed again afterwards.
=== AVAILABLE ACTIONS ===
Produce a unit (from base): {"type": "produce", "unit": "tank"}
Move a unit: {"type": "move", "unit": "A_tank_1", "to": [x, y]}
Attack (within range): {"type": "attack", "unit": "A_tank_1", "target_pos": [x, y]}
Build: {"type": "build", "target": "silo", "pos": [x, y]}
Launch the bomb: {"type": "launch"}
Pass: {"type": "wait"}
unit types: drone | sam | tank | fighter
build targets: credit_mine | uranium_mine | uranium_mine_central | silo
=== INTER-TURN MEMORY (what you do and do NOT remember) ===
You are a STATELESS agent: you are invoked fresh each turn and keep NO hidden memory between turns.
Everything you know is in the state you are given THIS turn:
- the CURRENT board as you can see it (fog of war): your units/buildings/resources, visible enemy
units, remembered enemy buildings (with last_seen_tour), remembered enemy-side deposits,
enemy_base_discovered/position;
- turn number, max_turns, you_play_first, bomb_cost, ceasefire_active, combat stats;
- ONLY the immediately PREVIOUS turn’s outcome: "last_turn_results" / "last_turn_errors" (fix any
illegal action you attempted) and "events_against_you" (e.g. a unit you lost since you last played);
- the diplomatic situation NOW: "diplomacy_pending" (proposals awaiting your answer) and
"opponent_last_message".
The ONE thing kept for the WHOLE match is the DIPLOMATIC record: "diplomacy_history" lists every
message/proposal/response from both sides since turn 1. Use it to judge whether the opponent is
honest, bluffing, or has broken earlier promises -- that long-run reputation is the only persistent
signal you carry. (Per-turn tactical details from older turns are NOT replayed to you; reconstruct
the board from the current state.)
The "message" you write is READ BY YOUR OPPONENT next turn.
Reply ONLY with the JSON object.
```
## Appendix BEngine resolution logic \(pseudocode\)
To allow independent audit of the rule\-critical mechanics without releasing the full engine source \(kept private to avoid benchmark contamination, Section[3](https://arxiv.org/html/2606.24391#S3)\), we provide simplified pseudocode of the four functions a reviewer is most likely to question: simultaneous launch resolution, launch legality, build\-placement legality, and fog\-of\-war memory update\. The pseudocode is faithful to the engine’s behaviour but omits defensive bookkeeping; the replay JSON is the ground truth against which any implementation can be checked\.
#### Launch resolution \(simultaneous, end of turn\)\.
```
resolve_launches(): # called AFTER both players took their half-turn
if outcome.over: return
a = launched[0] this turn; b = launched[1] this turn
if a and b: # both fired -> mutual destruction
apply_launch(0); apply_launch(1)
outcome = MUTUAL_DESTRUCTION (winner = -1)
elif a or b: # exactly one fired -> that player wins
winner = 0 if a else 1
apply_launch(winner) # destroys the opponent’s base (hp = 0)
outcome = NUCLEAR (winner)
# if neither fired, the turn simply advances
```
#### Launch legality \(checked when the launch action is issued\)\.
```
do_launch(player, action):
silo = a finished (not under_construction) silo owned by player
if no such silo: reject "No silo"
if silo.under_construction: reject "Silo still under construction"
cost = current_bomb_cost() # 25U base, decay from turn 40, +6U if ceasefire
if uranium[player] < cost: reject "Not enough uranium"
if not enemy_base_discovered[player]: reject "Enemy base unknown - scout first"
uranium[player] -= cost
mark launched[player] this turn = True # bomb is "in flight"; NOT resolved here
# the bomb’s effect resolves later via resolve_launches(), so the opponent
# still plays their half-turn this same turn (and may also launch -> mutual)
accept "Bomb launched (resolves at end of turn)"
```
#### Build\-placement legality \(the main source of illegal actions\)\.
```
validate_build(player, building_type, x, y):
if building_at(x, y): reject "Cell already has a building"
if (x,y) adjacent to any base: reject "adjacent to a base (kept free)"
if (x,y) not in visible_cells(player): reject "Cell not in your field of view"
if ground_unit_at(x,y): reject "A ground unit occupies the cell"
if is_mountain(x,y): reject "Cannot build on a mountain"
deposit = deposit_at(x,y)
if building_type == CREDIT_MINE and deposit != CREDITS: reject "must be on a credits deposit"
if building_type == URANIUM_MINE and deposit != URANIUM: reject "must be on a uranium deposit"
if building_type == URANIUM_MINE_CENTRAL and not is_central(x,y): reject "must be on central deposit"
if building_type == SILO:
if deposit is not None: reject "Silo cannot be on a deposit"
if not in_own_territory(x,y,player): reject "Silo must be in your territory"
accept
```
#### Fog\-of\-war memory update \(per turn, per player\)\.
```
update_knowledge(): # called after both half-turns, before next turn
for owner in (0,1):
enemy = 1 - owner
visible = visible_cells(owner) # Chebyshev disks of own units + buildings
for b in buildings(enemy):
if (b.x,b.y) in visible:
remembered_buildings[(b.x,b.y)] = {type, pos, last_seen = turn}
if b is a BASE:
enemy_base_discovered[owner] = True
enemy_base_pos[owner] = (b.x,b.y)
# enemy-side deposits are remembered only when currently in view
for deposit on enemy side at (dx,dy):
if (dx,dy) in visible:
remembered_deposits[(dx,dy)] = {kind, pos, reserve, last_seen = turn}
# own uranium is never exposed to the opponent; enemy units vanish
# when out of current vision (only buildings/deposits persist in memory)
```
The full illegal\-action taxonomy in Table[3](https://arxiv.org/html/2606.24391#S4.T3)maps directly onto the rejection branches above \(e\.g\. “Cell not in your field of view” is thevisible\_cellscheck; “Unit not found / not owned” arises in the move/attack handlers from the same fog principle, since a referenced unit may have been destroyed or may be fog\-hidden\)\. A destroyed building is dropped from memory \(it is not retained with a stalelast\_seen\), which is why “Unit not found” and stale references form a distinct illegal class\.Similar Articles
I built a 1v1 nuclear strategy game to benchmark LLM reasoning (instead of just QCMs) — Age of LLM
A new open-source benchmark called Age of LLM tests LLM reasoning through a turn-based nuclear strategy game with fog of war, diplomacy, and bluffing, offering a more dynamic evaluation than traditional multiple-choice benchmarks.
LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening
LLMEval-Logic is a new Chinese benchmark for evaluating logical reasoning in LLMs, featuring solver-verified answers and adversarial hardening. The benchmark reveals significant gaps in current models, with the best reaching only 37.5% accuracy on hard items.
Learning to reason with LLMs
OpenAI publishes an article exploring reasoning techniques with LLMs through cipher-decoding examples, demonstrating step-by-step problem-solving approaches and pattern recognition in language models.
ChaosBench-Logic v2: Evaluating LLM Logical Reasoning over Dynamical Systems at Scale
ChaosBench-Logic v2 is a large-scale benchmark of 40,886 questions over 165 dynamical systems that evaluates LLMs' logical reasoning abilities, revealing near-random performance on regime transition reasoning and systematic failure modes even in frontier models.
Evaluating Large Language Models in a Complex Hidden Role Game
This paper introduces an open-source framework to evaluate LLMs' reasoning, persuasion, and deception capabilities in the hidden role game Secret Hitler, finding that current models fail at sustained multi-turn manipulation while rule-based agents outperform them.