I Made LLMs Play Texas Hold’em. The Smallest Model Beat a ~1T Model by Being Too Dumb to Fold

Reddit r/singularity 05/19/26, 09:35 AM Tools

llm poker agent-framework experiment small-model benchmarking

Summary

An experiment where six LLMs played Texas Hold'em poker; a tiny 1.2B model won twice due to its aggressive 'never fold' strategy, highlighting how format can favor simpler models. The author built a poker engine and agent framework called Hive, and invites community feedback.

Made LLMs play Texas Hold’em against each other. 6 models at the table: a tiny 1.2B running locally on my 16GB MacBook, a couple mid-size ones, and cloud models going up to about 1 trillion parameters. Ran 5 tournaments. The tiny model won twice. More than any other model at the Models: \- Liquid lfm2.5 (1.2B, local via LM Studio) \- Qwen3 (1.7B, local via LM Studio) \- Claude Haiku 4.5 (Anthropic) \- GPT-OSS (120B, Fireworks) \- MiniMax M2 (230B, Fireworks) \- Kimi K2 (\~1T, Fireworks) Its strategy? Raise everything. Never fold. One tournament it played 6 hands with 19 raises and 0 folds. Didn’t even know it had bad cards. Just kept shoving chips in. The 120B model in the same tournament? 0 raises, 5 folds. Understood the game perfectly. Knew when it had weak hands. And folded itself into elimination. The small model won because it was too dumb to be scared. Now before the poker bros come for me: 25 hands with high blinds is not deep poker. The format punishes patience and rewards aggression. The big models fold correctly by poker theory, but correct folding bleeds you dry when blinds eat your stack every round. So no, small models aren’t “smarter.” They just happen to be accidentally perfect for this format. Built the whole thing from scratch. The poker engine is pure Python, zero dependencies. Hand evaluation, side pots, equity calculator, everything. The LLM layer runs on top of an agent framework I’ve been building called Hive. Supports LM Studio, Ollama, Anthropic, OpenAI, Fireworks, Groq. Also has a persona system where you can give models personality traits, risk tolerance, fears. A reckless gambler plays completely different from a cautious analyst. Planning to run more of these. Community tournament maybe. If you have a model you want to see at the table, or a persona you want me to test (“aggressive bluffer who tilts after losses” or “tight grinder who only plays premium hands”), let me know. I’ll run it and post full results. Also genuinely looking for feedback on the framework and engine code if anyone wants to take a look. Still early but the core is solid and runs on a Mac. Code, engine, and all 5 tournament results: [https://github.com/chiruu12/Hive](https://github.com/chiruu12/Hive) (poker stuff is in `hive-arena/`, results in `tournaments/results/`)

Original Article

I Made LLMs Play Texas Hold’em. The Smallest Model Beat a ~1T Model by Being Too Dumb to Fold

Similar Articles

I made 6 AI models play poker against each other. The 1.2B model has a gambling problem and it keeps winning.

I gave the same AI 6 different personalities and made them play poker 100 times.

Poker Arena: Multi-Axis Profiling of Strategic Reasoning and Memory in LLMs

Self-play helped AI achieve superhuman performance in Go, so why hasn’t it done the same for LLMs? Researchers have found a solution.

I trained a 75M parameter LLM from scratch on 18B tokens and it beats a model almost double its size

Submit Feedback

Similar Articles

I made 6 AI models play poker against each other. The 1.2B model has a gambling problem and it keeps winning.

I gave the same AI 6 different personalities and made them play poker 100 times.

Poker Arena: Multi-Axis Profiling of Strategic Reasoning and Memory in LLMs

Self-play helped AI achieve superhuman performance in Go, so why hasn’t it done the same for LLMs? Researchers have found a solution.

I trained a 75M parameter LLM from scratch on 18B tokens and it beats a model almost double its size