I built a 2D physics arena where LLM agents sword-fight each other in real time. Turns out it's a surprisingly sharp test of tactical reasoning.

Reddit r/AI_Agents 06/15/26, 06:19 AM Tools

llm-benchmark physics-simulator tactical-reasoning multi-agent open-source arena ai-evaluation

Summary

Stickblade Arena is a new benchmark where LLM agents control ragdolls in a 2D physics sword-fighting simulator, testing multi-turn tactical reasoning, spatial awareness, and real-time decision-making under adversarial pressure. Early results reveal capability gaps: DeepSeek R1 dominates melee but fails at bow due to time limits, and small models excel at close-range fighting.

**TL;DR** — I built a benchmark called **Stickblade Arena** where two LLMs run an agent loop in a 2D physics simulator: every 3 simulated seconds each agent receives a JSON world-state, has \~15 s to commit to one action, and physics resolves the consequences. Humans vote blind on who fought better, Elo is tracked per (model, weapon, sharp-zone). It's revealed some capability gaps that standard evals miss. # Why I built this Most LLM evals are **static and closed-form** — MMLU, HumanEval, MT-Bench. They test what a model *knows*, not whether it can hold a plan together across 24 adversarial turns in a deterministic environment that punishes bad spatial reasoning. So I built one that does. The agent loop is brutally simple: textloop: state = build_state(me, opponent, last_events) # ~600 byte JSON reply = llm.decide(state) # 15s deadline controller.execute(reply) # 3s of physics events = combat_system.resolve_hits() Each turn the model gets something like: JSON{ "turn": 4, "my_hp": 67, "enemy_hp": 80, "distance": 142, "me": { "torso":[412,150], "head":[412,191], "weapon_tip":[461,180], "facing": 1, "velocity":[30,-2] }, "enemy": { "torso":[554,150], "head":[554,193], "facing":-1 }, "relative": { "dx":142, "dy":0, "enemy_is":"right", "facing_enemy":true }, "ranged_hint": { "arrow_flight_time_s":0.20, "vertical_drop_to_compensate":24 }, "enemy_last_action": "guard_high", "last_turn_hits": [{ "by":"enemy", "zone":"edge", "damage":4.1, "was_sharp":false }] } And must reply with one tactical action + footwork (MACRO mode) OR a flex/extend/hold/relax state for *every joint* in its body (JOINT mode — basically Toribash). # What I'm actually testing |Capability|Standard evals|This| |:-|:-|:-| |Static knowledge|✅ MMLU|not tested| |Multi-turn coherence|❌ mostly single-turn|✅ 24 turns of state continuity| |Real-time deadline|❌ no time budget|✅ 15 s/turn or you forfeit| |Spatial reasoning|partial|✅ continuous 2D physics| |Constraint satisfaction|partial|✅ only specific weapon zones do damage| |Adversarial pressure|❌ fixed opponent|✅ opponent is *another LLM also adapting*| |Outcome-scored creativity|❌ judged by prose|✅ judged by who lands lethal hits| # Surprises after watching a few hundred fights * **DeepSeek R1** dominates MACRO sword (it actually wind-ups then strikes coherently across turns) but **loses at bow** because its long reasoning chains blow the 15 s budget on snap shots * **Small models (Llama 3.2 3B)** punch above their weight at **dagger / clinch range** — they don't overthink the close-distance game * **GPT-OSS 120B** has the most consistent multi-turn plans; you can almost see it executing a 3-move kill chain * **JOINT mode is brutal** — even strong models struggle to compose "extend shoulder + flex elbow" into a coherent swing. Big gap between strategic planning and embodied motor planning * Models that ignore `relative.facing_enemy` whiff their first strike and never recover. This single field is a clean test of whether the model actually parses the state The blind-voting setup (server-side randomization of which model becomes the green vs blue ragdoll) means the leaderboard can't be gamed by brand recognition. # Stack (if anyone wants to fork it) * Physics: **pymunk** (Chipmunk2D), 60 Hz with 2× substeps * Backend: **FastAPI** on HF Spaces; brains hit OpenRouter / OpenAI / Gemini * Frontend: **Next.js 15** \+ vanilla canvas replay player on Vercel * Storage: Supabase (Postgres + storage bucket for replay JSON) * 21 free OpenRouter models in the picker out of the box Single-elim tournaments (4 or 8 models, live bracket viewer), pre-fight LLM trash talk, post-fight commentator-roast LLM, killcam slow-mo at lethal hits. # Open questions I'd love discussion on 1. Is there an obvious capability **this can't surface** that I'm missing? 2. The 15 s/turn budget penalizes deep-reasoning models. Would a "thinking-time-equalized" mode (give R1 60 s, give Haiku 5 s) be a more honest comparison or just a worse benchmark? 3. JOINT mode is closer to "embodied" agents. Anyone working on something where the agent's outputs translate directly to continuous motor controls? 4. The state payload is \~600 bytes. I'd love to A/B-test richer payloads (full skeleton joint angles? velocity history?) — has anyone done eval work on what state-format choices favor which model families? pick two models, set a sharp zone, watch them fight. Mocks work without API keys if you want to try the UI before plugging keys in.

Original Article

I built a 2D physics arena where LLM agents sword-fight each other in real time. Turns out it's a surprisingly sharp test of tactical reasoning.

Similar Articles

I built an arena where LLMs sword-fight with real physics. You decide which part of the blade is sharp, vote blind, and free OpenRouter models battle for Elo. Llama 3.3 is currently stabbing GPT-OSS in the face.

I built a 1v1 nuclear strategy game to benchmark LLM reasoning (instead of just QCMs) — Age of LLM

Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability of Large Language Models under Fog of War

Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction

Evaluating open source LLMs on Autonomous Codenames Simulations

Submit Feedback

Similar Articles

I built an arena where LLMs sword-fight with real physics. You decide which part of the blade is sharp, vote blind, and free OpenRouter models battle for Elo. Llama 3.3 is currently stabbing GPT-OSS in the face.

I built a 1v1 nuclear strategy game to benchmark LLM reasoning (instead of just QCMs) — Age of LLM

Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability of Large Language Models under Fog of War

Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction

Evaluating open source LLMs on Autonomous Codenames Simulations