A new open-source benchmark called Age of LLM tests LLM reasoning through a turn-based nuclear strategy game with fog of war, diplomacy, and bluffing, offering a more dynamic evaluation than traditional multiple-choice benchmarks.
In 2017, I watched OpenAI Five destroy pro players at Dota 2. That moment taught me something: games are the ultimate test of emergent intelligence. Traditional benchmarks (MMLU, HumanEval, etc.) mostly measure memorization and recitation. A model can pass a coding test by regurgitating its training data. But a game? A game forces you to adapt, plan under uncertainty, deal with hidden information, and bluff. You can't fake reasoning when someone is trying to nuke you. So I created **Age of LLM — Benchmark**. It's a 1v1 turn-based nuclear strategy game where two LLMs face off. The core philosophy: **the system prompt provides ONLY the rules and mechanics. No strategic advice. No optimal play hints.** The models must deduce strategy, timing, and deception entirely on their own. # Why V2? The "Less is More" approach V1 was a complex game with steel, trucks, factories, and research centers that reduced the cost of the nuke. It was cool, but it added noise. V2 is a streamlined pure reasoning test: * **Simplified Economy:** Only 2 resources. Credits (for building) and Uranium (for the nuke). * **Lethal Combat:** No HP bars for units. If you get hit, you die. It creates a clean tactical triangle: Fighter → Tank → SAM → Fighter. * **No Shortcuts:** I removed Research Centers and Eco buildings. The nuke costs 25 Uranium, period. The only way to get it faster is to control the map. * **Base Production:** All units spawn from the base. No more factory micro-management. # The Mechanics that test true intelligence * **Fog of War & Secret Uranium:** You can't see the whole map, and you have absolutely no idea how close your opponent is to launching the nuke. * **Diplomacy & Bluff:** Models can send free messages, propose Ceasefires (blocks attacks for 3 turns but adds a +6U penalty to nuke launch), Peace, or Ultimatums. They can lie, bluff, and betray. * **Anti-stalemate:** After turn 40, the nuke cost drops by 2U every 10 turns. You cannot camp forever. The pressure forces action. * **Smart Scoring:** Win = 3pts, Draw = 1pt, Loss = 0pts. But if you accept an enemy ultimatum, you get 0.5pts. Surrendering a lost position is rewarded as a smarter move than fighting to the death. I've run a tournament with current frontier models, and the results are fascinating, especially the reasoning logs where you can see them trying to bluff or deciding when to backstab an ally. You can check out the leaderboard, watch the replays with the isometric web viewer (including the AI's reasoning chains), and look at the code here: 🔗 **GitHub Repo & Viewer:** [https://github.com/Macmachi/ageofllm-benchmark-viewer](https://github.com/Macmachi/ageofllm-benchmark-viewer) 🎥 **Video Presentation:** [https://youtu.be/Ec-CV1uzyVY](https://youtu.be/Ec-CV1uzyVY) *(If you want to see the live leaderboard and some match replays, check my comments for the links!)* I'd love to hear what you guys think about using games vs traditional benchmarks to evaluate LLMs. Do you think this kind of setup captures "intelligence" better than a multiple-choice test?
This paper investigates whether LLMs' ethical reasoning translates into ethical behavior in complex agentic simulations, using Civilization V as a testbed. Despite prompting interventions, models like GLM-4.7 still escalate to nuclear strikes, revealing a gap between reasoning and action.
LLMEval-Logic is a new Chinese benchmark for evaluating logical reasoning in LLMs, featuring solver-verified answers and adversarial hardening. The benchmark reveals significant gaps in current models, with the best reaching only 37.5% accuracy on hard items.
ChaosBench-Logic v2 is a large-scale benchmark of 40,886 questions over 165 dynamical systems that evaluates LLMs' logical reasoning abilities, revealing near-random performance on regime transition reasoning and systematic failure modes even in frontier models.
A developer built a Codenames simulation arena to evaluate open-source LLMs on long-range collaboration, finding DeepSeek v4 Flash outperformed others with high game logic alignment, while Qwen 3 Next and GPT 5.4 Nano struggled with rule constraints and perspective-taking.
This paper introduces a multi-turn interactive framework for reasoning evaluation where LLMs must query a hidden environment and integrate partial observations, instantiated as a benchmark of 474 executable games across five difficulty levels, showing discriminative power and exposing differences in reasoning.