Evalatro: an open benchmark where LLMs play the real Balatro

Reddit r/LocalLLaMA 06/15/26, 07:32 PM Tools

benchmark balatro llm-evaluation open-source game evaluation-framework

Summary

Evalatro is an open benchmark where LLMs play the real game Balatro via a text-based interface, with fixed seeds, a public leaderboard, and the goal of clearing Ante 12. Early results show models struggle, with none reaching the target.

Hey! I made Evalatro - an open benchmark where your LLMs play actual Balatro. Real game. It started because I kept asking Claude to help me beat levels while playing (yeah, I'm too weak). I'd just throw screenshots at it and ask for tactics. Then the idea grew into something bigger and I decided to dig a little deeper. Dug in... First I wanted to build an MCP through mods, turns out something already exists - balatrobot (respect to the author). And so it began. The model connects to the game and on each turn gets the state as a text structure, not a picture, and decides what to play on its own. No tactical hints. What's there already: \- fixed seeds for reproducibility — every model sees the same deals \- the real Balatro + Steamodded + balatrobot \- a live viewer and a public leaderboard \- your run results get sent to a public dashboard at the end of a run (zero private info — no keys, no paths; source is open) \- the score is computed by the server, not the client, so you can't fake it \- the benchmark goal is to clear Ante 12 (picked it kind of arbitrarily, open to debate), not just win the base-game Ante 8 \- auto-install on Windows/macOS \- you can watch the model's reasoning (that part's fun) and replay every run \- before a run it sets up a separate game profile with EVERYTHING unlocked so the model isn't limited (your main save is left untouched) I've only run a couple of models so far, just a little, so treat it as poking around, not a ranking. But it's already funny: nobody got anywhere near Ante 12. The leader, mimo-v2.5-pro, crawled to Ante 5. There was also deepseek-v4-pro, which couldn't beat the boss on ante 8, but I lost the results after the leaderboard update. So the challenge is wide open - come watch the models suffer. Would love feedback from Balatro players and the LLM crowd: is Ante 12 a sane bar or overkill? What else is worth measuring besides "reached / didn't reach"? How do I close the holes so the bench can't be cheated? I'm not exactly a master at building benchmarks. PS. I would be endlessly grateful for your stars on GitHub! Links: Github: [https://github.com/alesha-pro/evalatro](https://github.com/alesha-pro/evalatro) Public Dashboard: [evalatro.dev](https://evalatro.dev/)

Original Article

Evalatro: an open benchmark where LLMs play the real Balatro

Similar Articles

PlayCoder: Making LLM-Generated GUI Code Playable

LEVANTE-bench: Multi-Scale Comparison of VLMs to Children Using Cognitive Tasks (or, "Is Your VLM Smarter Than a 5th Grader?")

I built a 1v1 nuclear strategy game to benchmark LLM reasoning (instead of just QCMs) — Age of LLM

PreAct-Bench: Benchmarking Predictive Monitoring in LLMs

LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning

Submit Feedback

Similar Articles

PlayCoder: Making LLM-Generated GUI Code Playable

LEVANTE-bench: Multi-Scale Comparison of VLMs to Children Using Cognitive Tasks (or, "Is Your VLM Smarter Than a 5th Grader?")

I built a 1v1 nuclear strategy game to benchmark LLM reasoning (instead of just QCMs) — Age of LLM

PreAct-Bench: Benchmarking Predictive Monitoring in LLMs

LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning