@cuisitekp: A 9B model outperforms models several times larger. The team behind OLMo/Tülu from Ai2 and the University of Washington released a new paper called Tmax, claiming it's the strongest open-source RL training recipe for 'terminal agents'. Result: A 9B model on Terminal-Be…

X AI KOLs Timeline 06/24/26, 11:38 AM Papers

9b-model terminal-agent reinforcement-learning open-source training-data taxonomy terminal-bench

Summary

Ai2 and the University of Washington released a paper titled Tmax, proposing the strongest open-source terminal agent RL training recipe to date. A 9B parameter model outperforms larger models on Terminal-Bench 2.0, with the key being low-cost generation of vast amounts of verifiable training data, not model size or algorithm.

A 9B model outperforms models several times larger. Ai2 and the University of Washington team behind OLMo/Tülu released a new paper called Tmax, claiming it's the strongest open-source RL training recipe for "terminal agents". Result: A 9B model scored 27% on Terminal-Bench 2.0, surpassing a bunch of much larger parameter models. The recipe itself is surprisingly simple — purely outcome-based rewards, no fancy process supervision. What's most interesting is that the winning factor is neither model size nor RL algorithm, but "how to generate training data". They used a taxonomy to batch-produce terminal environments: difficulty control + personas + diverse verifiers, cheaply generating massive amounts of trainable tasks. The resulting terminal agent dataset is 2.5 times larger than the largest previously public dataset. This shows one thing: terminal agent capabilities are increasingly "nurtured by the environment", not "stacked with parameters". Whoever can cheaply generate large amounts of verifiable tasks can train a strong agent. And all data, models, and code are fully open-sourced. Open source chasing the frontier — this time, it's very close.

Original Article

View Cached Full Text

Cached at: 06/25/26, 07:13 AM

A 9B model outperforms models several times larger.

The team behind OLMo / Tülu at Ai2 and the University of Washington released a new paper called Tmax, claiming it’s the strongest open-source RL training recipe for terminal agents.

Results: A 9B model achieves 27% on Terminal-Bench 2.0, surpassing a batch of much larger models. The recipe is surprisingly simple — purely outcome-based rewards, no fancy process supervision.

What’s most interesting is that the decisive factor is not model size or RL algorithm, but “how to create training data.”

They use a taxonomy to batch-generate terminal environments: difficulty control + personas + verifier diversity, cheaply generating a massive amount of trainable tasks. The resulting terminal agent dataset is 2.5 times larger than the previously largest publicly available one.

This shows one thing: the capability of terminal agents is increasingly “fed by the environment,” not “stacked by parameters.” Whoever can cheaply generate a large number of verifiable tasks can train strong agents.

Moreover, data, models, and code are all open-sourced.

Open source catches up to the frontier — this time, very closely.

@cuisitekp: A 9B model outperforms models several times larger. The team behind OLMo/Tülu from Ai2 and the University of Washington released a new paper called Tmax, claiming it's the strongest open-source RL training recipe for 'terminal agents'. Result: A 9B model on Terminal-Be…

Similar Articles

@hank_aibtc: Family, local LLMs are incredibly impressive! I stumbled upon this gpt-oss-20b-tq3 on Hugging Face, and it's truly captivating! OpenAI's official open-source 20B+ parameter MoE model, optimized by the community using TurboQuant 3-bit quantization + MLX...

@berryxia: Small model, big wisdom? It's now real! A 7B small model now acts as the boss of top large models like GPT-5, Claude Sonnet 4, Gemini 2.5 Pro. A new paper shows an RL-trained 7B model learned to write natural language subtasks, assign them to different models, precisely...

@zhixianio: After receiving the new machine, I began an 'ascetic' practice of forcing myself to use local models for common tasks. I thought it would be painful, but both speed and quality greatly exceeded my expectations: Model: Qwen3.6-35B-A3B-oQ6-fp16-mtp, Running: oMLX, with N…

@0xcherry: https://x.com/0xcherry/status/2067610347633025281

Submit Feedback

Similar Articles

@hank_aibtc: Family, local LLMs are incredibly impressive! I stumbled upon this gpt-oss-20b-tq3 on Hugging Face, and it's truly captivating! OpenAI's official open-source 20B+ parameter MoE model, optimized by the community using TurboQuant 3-bit quantization + MLX...

@vintcessun: Pretraining can be this cost-effective? Train a usable 1B base model from scratch for ~$1000, slashing compute and data by hundreds of times. The key isn't brute-force compute, but hierarchical recursive architecture plus latent space reasoning, combined with PrefixLM packing and FA3 to maximize efficiency. Sounds insane, but the paper and code are open-sourced.

@berryxia: Small model, big wisdom? It's now real! A 7B small model now acts as the boss of top large models like GPT-5, Claude Sonnet 4, Gemini 2.5 Pro. A new paper shows an RL-trained 7B model learned to write natural language subtasks, assign them to different models, precisely...

@zhixianio: After receiving the new machine, I began an 'ascetic' practice of forcing myself to use local models for common tasks. I thought it would be painful, but both speed and quality greatly exceeded my expectations: Model: Qwen3.6-35B-A3B-oQ6-fp16-mtp, Running: oMLX, with N…

@0xcherry: https://x.com/0xcherry/status/2067610347633025281