Tag
This paper presents a full-pipeline recipe for teaching thinking models to reason with tools, achieving state-of-the-art performance on benchmarks like AIME 2025 when applied to Qwen3 models.
The article argues that effective AI agents require restraint and explicit 'stop conditions' rather than endless autonomy, highlighting Ling-2.6-1T as a model suited for conservative planning roles.
MIT CSAIL researchers introduce RLCR, a method using Brier scores in reinforcement learning to train AI models to output calibrated confidence estimates, significantly reducing overconfidence without sacrificing accuracy.
OpenAI submitted proof attempts for the First Proof challenge, a research-level math competition testing whether AI can produce correct, checkable proofs. The company's internal model successfully solved at least five of the ten problems, demonstrating significant progress in sustained reasoning and rigorous mathematical thinking.
Gemini 2.5 Deep Think achieved gold-medal level performance at the 2025 International Collegiate Programming Contest World Finals, solving 10 of 12 problems in the five-hour competition, demonstrating significant advances in abstract reasoning and problem-solving capabilities.
Google is rolling out Deep Think, a new reasoning capability in the Gemini app for Google AI Ultra subscribers, featuring parallel thinking techniques and achieving bronze-level performance on the 2025 IMO benchmark. The full gold-medal version is being shared with select mathematicians for research purposes.