@cuisitekp: A 9B model outperforms models several times larger. The team behind OLMo/Tülu from Ai2 and the University of Washington released a new paper called Tmax, claiming it's the strongest open-source RL training recipe for 'terminal agents'. Result: A 9B model on Terminal-Be…

X AI KOLs Timeline Papers

Summary

Ai2 and the University of Washington released a paper titled Tmax, proposing the strongest open-source terminal agent RL training recipe to date. A 9B parameter model outperforms larger models on Terminal-Bench 2.0, with the key being low-cost generation of vast amounts of verifiable training data, not model size or algorithm.

A 9B model outperforms models several times larger. Ai2 and the University of Washington team behind OLMo/Tülu released a new paper called Tmax, claiming it's the strongest open-source RL training recipe for "terminal agents". Result: A 9B model scored 27% on Terminal-Bench 2.0, surpassing a bunch of much larger parameter models. The recipe itself is surprisingly simple — purely outcome-based rewards, no fancy process supervision. What's most interesting is that the winning factor is neither model size nor RL algorithm, but "how to generate training data". They used a taxonomy to batch-produce terminal environments: difficulty control + personas + diverse verifiers, cheaply generating massive amounts of trainable tasks. The resulting terminal agent dataset is 2.5 times larger than the largest previously public dataset. This shows one thing: terminal agent capabilities are increasingly "nurtured by the environment", not "stacked with parameters". Whoever can cheaply generate large amounts of verifiable tasks can train a strong agent. And all data, models, and code are fully open-sourced. Open source chasing the frontier — this time, it's very close.
Original Article
View Cached Full Text

Cached at: 06/25/26, 07:13 AM

A 9B model outperforms models several times larger.

The team behind OLMo / Tülu at Ai2 and the University of Washington released a new paper called Tmax, claiming it’s the strongest open-source RL training recipe for terminal agents.

Results: A 9B model achieves 27% on Terminal-Bench 2.0, surpassing a batch of much larger models. The recipe is surprisingly simple — purely outcome-based rewards, no fancy process supervision.

What’s most interesting is that the decisive factor is not model size or RL algorithm, but “how to create training data.”

They use a taxonomy to batch-generate terminal environments: difficulty control + personas + verifier diversity, cheaply generating a massive amount of trainable tasks. The resulting terminal agent dataset is 2.5 times larger than the previously largest publicly available one.

This shows one thing: the capability of terminal agents is increasingly “fed by the environment,” not “stacked by parameters.” Whoever can cheaply generate a large number of verifiable tasks can train strong agents.

Moreover, data, models, and code are all open-sourced.

Open source catches up to the frontier — this time, very closely.

Similar Articles

@vintcessun: Pretraining can be this cost-effective? Train a usable 1B base model from scratch for ~$1000, slashing compute and data by hundreds of times. The key isn't brute-force compute, but hierarchical recursive architecture plus latent space reasoning, combined with PrefixLM packing and FA3 to maximize efficiency. Sounds insane, but the paper and code are open-sourced.

X AI KOLs Timeline

HRM-Text released a 1B-parameter base model, claiming it can be pretrained from scratch for only ~$1000, reducing compute and data volume by hundreds of times. It employs efficient techniques such as hierarchical recursive architecture, latent space reasoning, and PrefixLM packing. The paper and code are open-sourced.

@berryxia: Small model, big wisdom? It's now real! A 7B small model now acts as the boss of top large models like GPT-5, Claude Sonnet 4, Gemini 2.5 Pro. A new paper shows an RL-trained 7B model learned to write natural language subtasks, assign them to different models, precisely...

X AI KOLs Timeline

A new paper proposes training a 7B small model via reinforcement learning as a task scheduler, automatically decomposing subtasks and assigning them to top models like GPT-5 and Claude. It surpasses individual frontier models on several hard benchmarks, demonstrating that end-to-end reward learning can effectively replace manual prompt engineering and multi-agent pipeline design.

@zhixianio: After receiving the new machine, I began an 'ascetic' practice of forcing myself to use local models for common tasks. I thought it would be painful, but both speed and quality greatly exceeded my expectations: Model: Qwen3.6-35B-A3B-oQ6-fp16-mtp, Running: oMLX, with N…

X AI KOLs Timeline

The author uses the Qwen3.6-35B-A3B model and oMLX tool on the new local machine for daily tasks, finding that both speed and quality far exceed expectations, even outperforming remote LLMs in PA and coding scenarios, demonstrating a significant improvement in on-device AI capabilities.

@0xcherry: https://x.com/0xcherry/status/2067610347633025281

X AI KOLs Timeline

This article analyzes the reasons behind the performance leap of Zhipu GLM-5.2, suggesting that its 40B activation parameters provide greater effective capacity after accounting for fixed overhead, making RL post-training more effective. It also reviews the history of Chinese AI model development and notes that the large model approach ultimately prevailed.