Evalatro: an open benchmark where LLMs play the real Balatro
Summary
Evalatro is an open benchmark where LLMs play the real game Balatro via a text-based interface, with fixed seeds, a public leaderboard, and the goal of clearing Ante 12. Early results show models struggle, with none reaching the target.
Similar Articles
PlayCoder: Making LLM-Generated GUI Code Playable
PlayCoder introduces PlayEval benchmark and a multi-agent framework that iteratively repairs LLM-generated GUI applications, achieving up to 20.3% end-to-end playable code.
LEVANTE-bench: Multi-Scale Comparison of VLMs to Children Using Cognitive Tasks (or, "Is Your VLM Smarter Than a 5th Grader?")
This paper introduces LEVANTE-bench, a benchmark that systematically evaluates vision-language models on six cognitive tasks and compares their performance to children aged 5-12, finding that current VLMs align only partially with children's cognitive abilities.
I built a 1v1 nuclear strategy game to benchmark LLM reasoning (instead of just QCMs) — Age of LLM
A new open-source benchmark called Age of LLM tests LLM reasoning through a turn-based nuclear strategy game with fog of war, diplomacy, and bluffing, offering a more dynamic evaluation than traditional multiple-choice benchmarks.
PreAct-Bench: Benchmarking Predictive Monitoring in LLMs
PreAct-Bench is a benchmark of 1,000 paired ethical and unethical action trajectories across five domains, designed to evaluate the ability of LLMs to predict harmful outcomes from partial trajectories (predictive monitoring). Results show that while humans perform well, current LLMs struggle, highlighting the need for future-oriented risk reasoning.
LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning
Introduces LinAlg-Bench, a diagnostic benchmark evaluating 10 frontier LLMs on structured linear algebra computation across matrix dimensions, revealing that LLM mathematical failure is structurally constrained and transitions from execution errors to computational abandonment at 4x4 scale.