I Made LLMs Play Texas Hold’em. The Smallest Model Beat a ~1T Model by Being Too Dumb to Fold
Summary
An experiment where six LLMs played Texas Hold'em poker; a tiny 1.2B model won twice due to its aggressive 'never fold' strategy, highlighting how format can favor simpler models. The author built a poker engine and agent framework called Hive, and invites community feedback.
Similar Articles
I made 6 AI models play poker against each other. The 1.2B model has a gambling problem and it keeps winning.
An experiment where six AI models played Texas Hold'em against each other, with a tiny 1.2B model winning twice by being too reckless to fold. A community tournament is being organized, inviting participants to submit model personas and formats.
I gave the same AI 6 different personalities and made them play poker 100 times.
An experiment giving the same 1.2B language model six different personalities and playing 100 poker tournaments reveals drastic behavioral differences: a 'Grinder' never wins but never loses, a 'Tilter' wins big or busts, and a 'Shark' dominates. The results highlight how personality prompts can profoundly shape LLM decision-making.
Poker Arena: Multi-Axis Profiling of Strategic Reasoning and Memory in LLMs
Poker Arena is a new benchmark using no-limit Texas Hold'em to evaluate LLMs' strategic reasoning and memory across multiple cognitive axes. The platform reveals that multi-axis evaluation exposes capability structures that scalar leaderboards misrank.
Self-play helped AI achieve superhuman performance in Go, so why hasn’t it done the same for LLMs? Researchers have found a solution.
Researchers introduce Self-Guided Self-Play (SGS), a self-play algorithm for LLMs that prevents reward hacking by using a Guide role to score synthetic problems. Applied to theorem proving in Lean4, SGS surpasses RL baselines and allows a 7B model to outperform a 671B model.
I trained a 75M parameter LLM from scratch on 18B tokens and it beats a model almost double its size
Trained a 75M parameter LLM called KeyLM from scratch on 18B tokens, achieving competitive instruction-following scores against larger models while using fewer parameters and less data.