Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling
Summary
A paper presenting SU-01, a 30B-A3B reasoning model that achieves gold-medal-level performance on IMO and IPhO problems via reverse-perplexity curriculum, two-stage reinforcement learning, and test-time scaling.
View Cached Full Text
Cached at: 05/15/26, 04:23 AM
Paper page - Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling
Source: https://huggingface.co/papers/2605.13301 Published on May 13
#1 Paper of the day Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
A systematic approach transforms post-trained reasoning models into rigorous olympiad-level solvers through reverse-perplexity curriculum, two-stage reinforcement learning, and test-time scaling, achieving gold-medal performance on mathematical and physics competitions.
Recent progress inreasoning modelshas substantially advanced long-horizon mathematical andscientific problem solving, with several systems now reaching gold-medal-level performance onInternational Mathematical Olympiad(IMO) andInternational Physics Olympiad(IPhO) problems. In this paper, we introduce a simple and unified recipe for converting a post-trained reasoningbackboneinto a rigorous olympiad-level solver. The recipe first uses areverse-perplexity curriculumforSFTto instill rigorousproof-searchandself-checking behaviors, then scales these behaviors through a two-stageRLpipeline that progresses fromRLwithverifiable rewardsto more delicateproof-level RL, and finally boosts solving performance withtest-time scaling. Applying this recipe, we train a 30B-A3BbackbonewithSFTon around 340K sub-8K-tokentrajectories followed by 200RLsteps. The resulting model, SU-01, supports stable reasoning on difficult problems with trajectories exceeding 100Ktokens, while achieving gold-medal-level performance on mathematical and physical olympiad competitions, including IMO 2025/USAMO 2026 and IPhO 2024/2025. It also demonstrates strong generalization of scientific reasoning to domains beyond mathematics and physics.
View arXiv pageView PDFProject pageGitHub20Add to collection
Get this paper in your agent:
hf papers read 2605\.13301
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper1
#### Simplified-Reasoning/SU-01 Reinforcement Learning• 31B• Updated1 day ago • 9
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.13301 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.13301 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling
This paper presents a simple and unified recipe combining supervised fine-tuning, two-stage reinforcement learning, and test-time scaling to train a reasoning model (SU-01) that achieves gold-medal-level performance on International Mathematical and Physics Olympiad problems.
@stingning: We’re releasing a 30B-A3B reasoning model that reaches gold-medal level across both physics and math Olympiad evaluatio…
Researchers release SU-01, a 30B-A3B reasoning model achieving gold-medal-level performance on physics and math Olympiad problems using a unified scaling recipe for proof search.
@ClementDelangue: Paper of the day! https://huggingface.co/papers/2605.13301…
A paper introduces a unified recipe (SU-01) that combines reverse-perplexity curriculum, two-stage reinforcement learning, and test-time scaling to achieve gold-medal-level performance on IMO and IPhO problems using a 30B-A3B backbone.
OmniThoughtVis: A Scalable Distillation Pipeline for Deployable Multimodal Reasoning Models
This paper introduces OmniThoughtVis, a scalable pipeline for distilling multimodal reasoning capabilities from large teacher models to smaller, deployment-oriented MLLMs. The method uses curated chain-of-thought data to significantly improve reasoning performance on benchmarks like MathVerse and MMMU-Pro for models ranging from 2B to 8B parameters.
TEMPO: Scaling Test-time Training for Large Reasoning Models
TEMPO introduces a test-time training framework that alternates policy refinement with critic recalibration to prevent diversity collapse and sustain performance gains in large reasoning models, boosting AIME 2024 scores for Qwen3-14B from 42.3% to 65.8%.