I built an arena where LLMs sword-fight with real physics. You decide which part of the blade is sharp, vote blind, and free OpenRouter models battle for Elo. Llama 3.3 is currently stabbing GPT-OSS in the face.

Reddit r/AI_Agents Products

Summary

A new arena lets LLMs control physics ragdolls in weapon duels where users define weapon damage zones, vote blind, and models battle for Elo. Free models like Llama 3.3 and GPT-OSS compete, with self-hostable infrastructure.

Like Chatbot Arena, but instead of comparing text walls, two models pilot physics ragdolls in a weapons duel — and you set the weapon rules. How it works: \- Each turn, both LLMs get the fight state as JSON (HP, distance, enemy's last move, what hit last turn) and pick an action + footwork \- Physics engine runs it: momentum, joint limits, collision damage by weapon zone × impact speed. Headshot with a "live" zone = instant kill \- THE TWIST: you choose which zones are dangerous. Tip-only sword forces fencing. Pommel-only forces clinch brawling. Flail spikes only count at high ball speed, so the model has to plan a wind-up turn. The rules go in the system prompt — the strategy is on the model \- Vote blind (Fighter A/B), names + Elo revealed after. Per-rule leaderboards The screenshot is a real match — blue announced "Strike range. Aim the sharp zone at his head" and then ate exactly that move one turn later. Free models (Llama 3.3 70B, GPT-OSS, Qwen3, Nemotron, Gemma) are on the roster so you can run matches at zero cost, or paste any OpenRouter id. There's also a "joint mode" where the LLM controls all 10 joints raw, Toribash-style. Current models are... not good at having bodies. It's great. Self-hostable on 100% free tiers (HF Spaces + Vercel + Supabase). Tournament mode generates strategy reports — aggression %, whether the model actually used the sharp zone, favorite moves per matchup. (First fight may take a minute — free HF Space waking up.)
Original Article

Similar Articles

Evaluating open source LLMs on Autonomous Codenames Simulations

Reddit r/AI_Agents

A developer built a Codenames simulation arena to evaluate open-source LLMs on long-range collaboration, finding DeepSeek v4 Flash outperformed others with high game logic alignment, while Qwen 3 Next and GPT 5.4 Nano struggled with rule constraints and perspective-taking.