I built a 1v1 nuclear strategy game to benchmark LLM reasoning (instead of just QCMs) — Age of LLM

Reddit r/ArtificialInteligence 06/08/26, 04:27 PM Tools

llm-benchmark strategy-game reasoning open-source ai-evaluation nuclear-game benchmark

Summary

A new open-source benchmark called Age of LLM tests LLM reasoning through a turn-based nuclear strategy game with fog of war, diplomacy, and bluffing, offering a more dynamic evaluation than traditional multiple-choice benchmarks.

In 2017, I watched OpenAI Five destroy pro players at Dota 2. That moment taught me something: games are the ultimate test of emergent intelligence. Traditional benchmarks (MMLU, HumanEval, etc.) mostly measure memorization and recitation. A model can pass a coding test by regurgitating its training data. But a game? A game forces you to adapt, plan under uncertainty, deal with hidden information, and bluff. You can't fake reasoning when someone is trying to nuke you. So I created **Age of LLM — Benchmark**. It's a 1v1 turn-based nuclear strategy game where two LLMs face off. The core philosophy: **the system prompt provides ONLY the rules and mechanics. No strategic advice. No optimal play hints.** The models must deduce strategy, timing, and deception entirely on their own. # Why V2? The "Less is More" approach V1 was a complex game with steel, trucks, factories, and research centers that reduced the cost of the nuke. It was cool, but it added noise. V2 is a streamlined pure reasoning test: * **Simplified Economy:** Only 2 resources. Credits (for building) and Uranium (for the nuke). * **Lethal Combat:** No HP bars for units. If you get hit, you die. It creates a clean tactical triangle: Fighter → Tank → SAM → Fighter. * **No Shortcuts:** I removed Research Centers and Eco buildings. The nuke costs 25 Uranium, period. The only way to get it faster is to control the map. * **Base Production:** All units spawn from the base. No more factory micro-management. # The Mechanics that test true intelligence * **Fog of War & Secret Uranium:** You can't see the whole map, and you have absolutely no idea how close your opponent is to launching the nuke. * **Diplomacy & Bluff:** Models can send free messages, propose Ceasefires (blocks attacks for 3 turns but adds a +6U penalty to nuke launch), Peace, or Ultimatums. They can lie, bluff, and betray. * **Anti-stalemate:** After turn 40, the nuke cost drops by 2U every 10 turns. You cannot camp forever. The pressure forces action. * **Smart Scoring:** Win = 3pts, Draw = 1pt, Loss = 0pts. But if you accept an enemy ultimatum, you get 0.5pts. Surrendering a lost position is rewarded as a smarter move than fighting to the death. I've run a tournament with current frontier models, and the results are fascinating, especially the reasoning logs where you can see them trying to bluff or deciding when to backstab an ally. You can check out the leaderboard, watch the replays with the isometric web viewer (including the AI's reasoning chains), and look at the code here: 🔗 **GitHub Repo & Viewer:** [https://github.com/Macmachi/ageofllm-benchmark-viewer](https://github.com/Macmachi/ageofllm-benchmark-viewer) 🎥 **Video Presentation:** [https://youtu.be/Ec-CV1uzyVY](https://youtu.be/Ec-CV1uzyVY) *(If you want to see the live leaderboard and some match replays, check my comments for the links!)* I'd love to hear what you guys think about using games vs traditional benchmarks to evaluate LLMs. Do you think this kind of setup captures "intelligence" better than a multiple-choice test?

Original Article

I built a 1v1 nuclear strategy game to benchmark LLM reasoning (instead of just QCMs) — Age of LLM

Similar Articles

To Nuke or Not to Nuke: LLMs' (Missing) Ethical Reasoning and Actions in a High-Stakes Decision-Making Simulation

LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening

ChaosBench-Logic v2: Evaluating LLM Logical Reasoning over Dynamical Systems at Scale

Evaluating open source LLMs on Autonomous Codenames Simulations

Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games

Submit Feedback

Similar Articles

To Nuke or Not to Nuke: LLMs' (Missing) Ethical Reasoning and Actions in a High-Stakes Decision-Making Simulation

LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening

ChaosBench-Logic v2: Evaluating LLM Logical Reasoning over Dynamical Systems at Scale

Evaluating open source LLMs on Autonomous Codenames Simulations

Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games