I built an arena where LLMs sword-fight with real physics. You decide which part of the blade is sharp, vote blind, and free OpenRouter models battle for Elo. Llama 3.3 is currently stabbing GPT-OSS in the face.

Reddit r/AI_Agents 06/12/26, 11:44 PM Products

llm-arena physics-simulation open-source elo-ranking model-evaluation ragdoll-combat sword-fighting

Summary

A new arena lets LLMs control physics ragdolls in weapon duels where users define weapon damage zones, vote blind, and models battle for Elo. Free models like Llama 3.3 and GPT-OSS compete, with self-hostable infrastructure.

Like Chatbot Arena, but instead of comparing text walls, two models pilot physics ragdolls in a weapons duel — and you set the weapon rules. How it works: \- Each turn, both LLMs get the fight state as JSON (HP, distance, enemy's last move, what hit last turn) and pick an action + footwork \- Physics engine runs it: momentum, joint limits, collision damage by weapon zone × impact speed. Headshot with a "live" zone = instant kill \- THE TWIST: you choose which zones are dangerous. Tip-only sword forces fencing. Pommel-only forces clinch brawling. Flail spikes only count at high ball speed, so the model has to plan a wind-up turn. The rules go in the system prompt — the strategy is on the model \- Vote blind (Fighter A/B), names + Elo revealed after. Per-rule leaderboards The screenshot is a real match — blue announced "Strike range. Aim the sharp zone at his head" and then ate exactly that move one turn later. Free models (Llama 3.3 70B, GPT-OSS, Qwen3, Nemotron, Gemma) are on the roster so you can run matches at zero cost, or paste any OpenRouter id. There's also a "joint mode" where the LLM controls all 10 joints raw, Toribash-style. Current models are... not good at having bodies. It's great. Self-hostable on 100% free tiers (HF Spaces + Vercel + Supabase). Tournament mode generates strategy reports — aggression %, whether the model actually used the sharp zone, favorite moves per matchup. (First fight may take a minute — free HF Space waking up.)

Original Article

I built an arena where LLMs sword-fight with real physics. You decide which part of the blade is sharp, vote blind, and free OpenRouter models battle for Elo. Llama 3.3 is currently stabbing GPT-OSS in the face.

Similar Articles

Built a lightweight Python framework for local LLM roleplay (Ollama/Phi-3) to stop context drift. Looking for feedback.

Built a Tauri v2 desktop chat shell for local LLMs — point it at Ollama / llama.cpp / any OpenAI-compatible endpoint, MIT, ~12 MB binary

LLM planner - pick a rig for your use-case/model/budget, or pick models for your rig. 60+ builds, 50+ models, 130+ cited t/s sources, 150+ reviewer YouTube videos, idle+active watts, multi-region prices, regular updates.

Evaluating open source LLMs on Autonomous Codenames Simulations

LlamaStation v0.9 — llama.cpp GUI for Windows with multi-backend support, TurboQuant, MTP and more

Submit Feedback

Similar Articles

Built a lightweight Python framework for local LLM roleplay (Ollama/Phi-3) to stop context drift. Looking for feedback.

Built a Tauri v2 desktop chat shell for local LLMs — point it at Ollama / llama.cpp / any OpenAI-compatible endpoint, MIT, ~12 MB binary

LLM planner - pick a rig for your use-case/model/budget, or pick models for your rig. 60+ builds, 50+ models, 130+ cited t/s sources, 150+ reviewer YouTube videos, idle+active watts, multi-region prices, regular updates.

Evaluating open source LLMs on Autonomous Codenames Simulations

LlamaStation v0.9 — llama.cpp GUI for Windows with multi-backend support, TurboQuant, MTP and more