Tag
Introduces Age of LLM, a turn-based 1v1 benchmark where LLMs compete on a grid with fog of war and diplomacy, measuring reasoning, reliability, and strategic planning. Findings show a dominance of nuclear rush tactics and a weak link between reliability and winning.
RTSGameBench is a benchmark for evaluating strategic reasoning in vision-language models using the real-time strategy game Beyond All Reason. It provides diverse matchups, diagnostic mini-games, and a self-evolving framework to generate new scenarios.
Poker Arena is a new benchmark using no-limit Texas Hold'em to evaluate LLMs' strategic reasoning and memory across multiple cognitive axes. The platform reveals that multi-axis evaluation exposes capability structures that scalar leaderboards misrank.
A study testing leading LLMs in simulated nuclear crisis scenarios found that models often escalate to nuclear strikes, with Claude showing cunning strategic deception while GPT-5.2 remained passive. The models generated over 760,000 words of strategic reasoning.
Introduces SVI-Bench, a large-scale benchmark for strategic video intelligence using team sports, designed to evaluate models on dynamic scene understanding, causal reasoning, strategic simulation, and agentic synthesis. The benchmark reveals a capability cliff where models perform well on perceptual tasks but sharply degrade on higher-level strategic reasoning.
This paper introduces GENSTRAT, a benchmark that uses procedurally generated strategic environments to evaluate LLMs' strategic reasoning across multiple axes, addressing limitations of fixed game suites.
This paper introduces an open-source framework to evaluate LLMs' reasoning, persuasion, and deception capabilities in the hidden role game Secret Hitler, finding that current models fail at sustained multi-turn manipulation while rule-based agents outperform them.
Study shows LLM agents can model counterparty preferences in negotiation but fail to turn that knowledge into strategic bargaining to improve outcomes, limiting their effectiveness in multi-turn negotiations.
MIT professor Gabriele Farina is advancing AI decision-making by combining game theory with machine learning, building on his earlier work with the diplomatic AI Cicero.