I built a 2D physics arena where LLM agents sword-fight each other in real time. Turns out it's a surprisingly sharp test of tactical reasoning.

Reddit r/AI_Agents Tools

Summary

Stickblade Arena is a new benchmark where LLM agents control ragdolls in a 2D physics sword-fighting simulator, testing multi-turn tactical reasoning, spatial awareness, and real-time decision-making under adversarial pressure. Early results reveal capability gaps: DeepSeek R1 dominates melee but fails at bow due to time limits, and small models excel at close-range fighting.

**TL;DR** — I built a benchmark called **Stickblade Arena** where two LLMs run an agent loop in a 2D physics simulator: every 3 simulated seconds each agent receives a JSON world-state, has \~15 s to commit to one action, and physics resolves the consequences. Humans vote blind on who fought better, Elo is tracked per (model, weapon, sharp-zone). It's revealed some capability gaps that standard evals miss. # Why I built this Most LLM evals are **static and closed-form** — MMLU, HumanEval, MT-Bench. They test what a model *knows*, not whether it can hold a plan together across 24 adversarial turns in a deterministic environment that punishes bad spatial reasoning. So I built one that does. The agent loop is brutally simple: textloop: state = build_state(me, opponent, last_events) # ~600 byte JSON reply = llm.decide(state) # 15s deadline controller.execute(reply) # 3s of physics events = combat_system.resolve_hits() Each turn the model gets something like: JSON{ "turn": 4, "my_hp": 67, "enemy_hp": 80, "distance": 142, "me": { "torso":[412,150], "head":[412,191], "weapon_tip":[461,180], "facing": 1, "velocity":[30,-2] }, "enemy": { "torso":[554,150], "head":[554,193], "facing":-1 }, "relative": { "dx":142, "dy":0, "enemy_is":"right", "facing_enemy":true }, "ranged_hint": { "arrow_flight_time_s":0.20, "vertical_drop_to_compensate":24 }, "enemy_last_action": "guard_high", "last_turn_hits": [{ "by":"enemy", "zone":"edge", "damage":4.1, "was_sharp":false }] } And must reply with one tactical action + footwork (MACRO mode) OR a flex/extend/hold/relax state for *every joint* in its body (JOINT mode — basically Toribash). # What I'm actually testing |Capability|Standard evals|This| |:-|:-|:-| |Static knowledge|✅ MMLU|not tested| |Multi-turn coherence|❌ mostly single-turn|✅ 24 turns of state continuity| |Real-time deadline|❌ no time budget|✅ 15 s/turn or you forfeit| |Spatial reasoning|partial|✅ continuous 2D physics| |Constraint satisfaction|partial|✅ only specific weapon zones do damage| |Adversarial pressure|❌ fixed opponent|✅ opponent is *another LLM also adapting*| |Outcome-scored creativity|❌ judged by prose|✅ judged by who lands lethal hits| # Surprises after watching a few hundred fights * **DeepSeek R1** dominates MACRO sword (it actually wind-ups then strikes coherently across turns) but **loses at bow** because its long reasoning chains blow the 15 s budget on snap shots * **Small models (Llama 3.2 3B)** punch above their weight at **dagger / clinch range** — they don't overthink the close-distance game * **GPT-OSS 120B** has the most consistent multi-turn plans; you can almost see it executing a 3-move kill chain * **JOINT mode is brutal** — even strong models struggle to compose "extend shoulder + flex elbow" into a coherent swing. Big gap between strategic planning and embodied motor planning * Models that ignore `relative.facing_enemy` whiff their first strike and never recover. This single field is a clean test of whether the model actually parses the state The blind-voting setup (server-side randomization of which model becomes the green vs blue ragdoll) means the leaderboard can't be gamed by brand recognition. # Stack (if anyone wants to fork it) * Physics: **pymunk** (Chipmunk2D), 60 Hz with 2× substeps * Backend: **FastAPI** on HF Spaces; brains hit OpenRouter / OpenAI / Gemini * Frontend: **Next.js 15** \+ vanilla canvas replay player on Vercel * Storage: Supabase (Postgres + storage bucket for replay JSON) * 21 free OpenRouter models in the picker out of the box Single-elim tournaments (4 or 8 models, live bracket viewer), pre-fight LLM trash talk, post-fight commentator-roast LLM, killcam slow-mo at lethal hits. # Open questions I'd love discussion on 1. Is there an obvious capability **this can't surface** that I'm missing? 2. The 15 s/turn budget penalizes deep-reasoning models. Would a "thinking-time-equalized" mode (give R1 60 s, give Haiku 5 s) be a more honest comparison or just a worse benchmark? 3. JOINT mode is closer to "embodied" agents. Anyone working on something where the agent's outputs translate directly to continuous motor controls? 4. The state payload is \~600 bytes. I'd love to A/B-test richer payloads (full skeleton joint angles? velocity history?) — has anyone done eval work on what state-format choices favor which model families? pick two models, set a sharp zone, watch them fight. Mocks work without API keys if you want to try the UI before plugging keys in.
Original Article

Similar Articles

Evaluating open source LLMs on Autonomous Codenames Simulations

Reddit r/AI_Agents

A developer built a Codenames simulation arena to evaluate open-source LLMs on long-range collaboration, finding DeepSeek v4 Flash outperformed others with high game logic alignment, while Qwen 3 Next and GPT 5.4 Nano struggled with rule constraints and perspective-taking.