Tag
This paper shows that a carefully crafted data recipe for long-context reinforcement learning, using minimal outcome-based GRPO, significantly improves reasoning across multiple models and benchmarks, and transfers to agentic tasks like GAIA and BrowseComp.