RL Excursions during Pre-Training: Re-examining Policy Optimization for LLM training

arXiv cs.LG Papers

Summary

Harvard researchers challenge the standard LLM training pipeline by showing RL can be effectively applied during pre-training rather than only after SFT, finding that data composition matters more than model scale, and proposing parallel averaging of RL and SFT objectives that outperforms sequential approaches while preserving general capabilities.

arXiv:2606.04272v1 Announce Type: new Abstract: The standard LLM training pipeline applies reinforcement learning (RL) only after pre-training and supervised fine-tuning (SFT). We question this status quo by training a LLM from scratch and applying RL, SFT, and SFT followed by RL directly to intermediate pre-training checkpoints. We find that RL is effective very early, and often matches the full SFT$\to$RL pipeline early as well. Through experiments on harder problems, we find that targeted pre-training data composition is a strong lever for RL effectiveness, even more so than model scale. Beyond reasoning accuracy, applying RL directly to base checkpoints expands the model's distribution; the sharpening effect reported in recent work arises only when RL follows SFT. The general capabilities of the model remain essentially unchanged by RL, while they degrade following SFT. Finally, we merge RL and SFT objectives by parallel averaging, which outperforms across all other training methods discussed, across metrics, while preserving general capabilities. Together, these results suggest that LLM training might benefit from an expanded use of RL.
Original Article
View Cached Full Text

Cached at: 06/05/26, 02:24 AM

# Re-examining Policy Optimization for LLM training
Source: [https://arxiv.org/html/2606.04272](https://arxiv.org/html/2606.04272)
## RL Excursions during Pre\-Training: Re\-examining Policy Optimization for LLM training

Rachit Bansal Clara Mohri11footnotemark:1Tian Qin11footnotemark:1David Alvarez\-Melis Sham Kakade22footnotemark:2 Harvard University \{rachitbansal,cmohri,tqin\}@g\.harvard\.edu

###### Abstract

The standard LLM training pipeline applies reinforcement learning \(RL\) only after pre\-training and supervised fine\-tuning \(SFT\)\. We question this status quo by training a LLM from scratch and applying RL, SFT, and SFT followed by RL directly to intermediate pre\-training checkpoints\. We find that RL is effective very early, and often matches the full SFT→\\toRL pipeline early as well\. Through experiments on harder problems, we find that targeted pre\-training data composition is a strong lever for RL effectiveness, even more so than model scale\. Beyond reasoning accuracy, applying RL directly to base checkpoints expands the model’s distribution; the sharpening effect reported in recent work arises only when RL follows SFT\. The general capabilities of the model remain essentially unchanged by RL, while they degrade following SFT\. Finally, we merge RL and SFT objectives byparallel averaging, which outperforms across all other training methods discussed, across metrics, while preserving general capabilities\. Together, these results suggest that LLM training might benefit from an expanded use of RL\.

## 1Introduction

Until recently, the training recipe for Large Language Models \(LLM\) exclusively used the next\-token prediction \(NTP\) objective via cross\-entropy loss\. However, with the advent of RL for language models\(Ouyanget al\.,[2022](https://arxiv.org/html/2606.04272#bib.bib4); Shaoet al\.,[2024](https://arxiv.org/html/2606.04272#bib.bib9)\), a newer advancement is the now\-standard post\-training phase which sequentially employs supervised finetuning \(SFT\) followed by RL\. The NTP objective for pre\-training and SFT is typically used over a static, external dataset, i\.e\., anoff\-policyregime\. Instead, for the RL objective, the model learns from its ownon\-policygenerations\.

Under this standard training regime, RL training only occurs after a substantial amount of NTP training\. It is unclear whether this is fundamentally necessary for RL training or simply a design choice\(Fosteret al\.,[2025](https://arxiv.org/html/2606.04272#bib.bib12)\)\. There has also been growing interest in changing this standard and expanding the use of RL for pretraining\(Hatamizadehet al\.,[2026](https://arxiv.org/html/2606.04272#bib.bib2); Liet al\.,[2025](https://arxiv.org/html/2606.04272#bib.bib1); Xinget al\.,[2025](https://arxiv.org/html/2606.04272#bib.bib14)\)\. In this work we attempt to answer a more fundamental question:

*When and how should an RL objective be used in LLM training?*

While it has been widely observed that post\-training dramatically improves the reasoning of the model, RL’s influence on model capabilities has been the subject of recent debate\. For example, a growing body of work argues that RL primarily sharpens the model’s existing output distribution\(Yueet al\.,[2025](https://arxiv.org/html/2606.04272#bib.bib28); Wuet al\.,[2025](https://arxiv.org/html/2606.04272#bib.bib30); Karan and Du,[2025](https://arxiv.org/html/2606.04272#bib.bib41); Qinet al\.,[2025](https://arxiv.org/html/2606.04272#bib.bib49)\)\. It is unclear whether these findings are inherent to the RL objective or an artifact of the standard training regime\. By studying various training objectives comprehensively across stages of pre\-training, we shed light on a second fundamental question:

*What is the influence of RL on model capabilities?*

To answer these questions, we perform a large\-scale rigorous study of on\-policy learning for LLM training\. We pretrain an LLM from scratch on a high\-quality, reasoning\-heavy corpus, saving checkpoints throughout the process\. For each base model checkpoint, we perform various different training runs: \(Direct RL\) RL on the base model checkpoint; \(SFT\) SFT using a single ground\-truth demonstration per example; \(SFT\-Gold\) SFT using multiple ground\-truth demonstrations per example; and \(SFT→\\rightarrowRL\) RL on top of the SFT models, for both SFT and SFT\-Gold, representing the standard LLM training pipeline\.

![Refer to caption](https://arxiv.org/html/2606.04272v1/x1.png)Figure 1:Overview\.We compare several post\-training recipes applied to intermediate pre\-training checkpointsℳt\\mathcal\{M\}\_\{t\}: direct RL \(ℳtRL\\mathcal\{M\}\_\{t\}^\{\\text\{RL\}\}\), SFT with one solution per question \(ℳtSFT\\mathcal\{M\}\_\{t\}^\{\\text\{SFT\}\}\), SFT with multiple solutions \(ℳtSFT\-Gold\\mathcal\{M\}\_\{t\}^\{\\text\{SFT\-Gold\}\}\), the standard pipeline of RL after SFT \(ℳtSFT→RL\\mathcal\{M\}\_\{t\}^\{\\text\{SFT\}\\rightarrow\\text\{RL\}\}\), and parallel averaging of RL and SFT gradients \(ℳtParallel\\mathcal\{M\}\_\{t\}^\{\\text\{Parallel\}\}\)\.1RL improves bothpass@1andpass@32on checkpoints trained for as low as 4B pretraining tokens \(§[3\.1](https://arxiv.org/html/2606.04272#S3.SS1)\)\.2RL is the more effective post\-training objective when ground\-truth demonstrations are scarce:ℳtRL\\mathcal\{M\}\_\{t\}^\{\\text\{RL\}\}substantially outperformsℳtSFT\\mathcal\{M\}\_\{t\}^\{\\text\{SFT\}\}onpass@1but matchesℳtSFT\-Gold\\mathcal\{M\}\_\{t\}^\{\\text\{SFT\-Gold\}\}\(§[3](https://arxiv.org/html/2606.04272#S3.F3)\)\.3SFT degrades general, non\-reasoning benchmarks, whereas RL leaves these capabilities largely unchanged \(§[4\.2](https://arxiv.org/html/2606.04272#S4.SS2)\)\.4Parallel averaging of RL and SFT gradients combines their strengths:ℳtParallel\\mathcal\{M\}\_\{t\}^\{\\text\{Parallel\}\}attains the strongestpass@32for every pre\-training checkpoint \(§[5](https://arxiv.org/html/2606.04272#S5)\), consistently better than the standard SFT→\\rightarrowRL pipeline \(ℳtSFT→RL\\mathcal\{M\}\_\{t\}^\{\\text\{SFT\}\\rightarrow\\text\{RL\}\}\)\.We present comprehensive findings that answer fundamental questions about RL for LLM training:

When does RL work?\(§[3](https://arxiv.org/html/2606.04272#S3)\) We find that RL is effective surprisingly early in pretraining\. Training with direct RL on checkpoints that have seen as few as 4B tokens significantly improves performance on both GSM8K and MATH, with gains often comparable to the standard SFT→\\rightarrowRL pipeline \(§[3\.1](https://arxiv.org/html/2606.04272#S3.SS1)\)\. Moreover, we find that RL is significantly more effective than SFT when we have limited target demonstrations \(SFT\) \(§[3](https://arxiv.org/html/2606.04272#S3.F3)\)\. The effectiveness of RL varies with task difficulty: RL gains are weaker on harder MATH\-style problems\. In such cases, we find that adding targeted data to the pretraining corpus is effective and a better strategy than scaling model size \(§[3\.3](https://arxiv.org/html/2606.04272#S3.SS3)\)\.

What does RL do?\(§[4](https://arxiv.org/html/2606.04272#S4)\) Contrary to recent claims that RL primarily sharpens the output distribution\(Yueet al\.,[2025](https://arxiv.org/html/2606.04272#bib.bib28); Wuet al\.,[2025](https://arxiv.org/html/2606.04272#bib.bib30)\), we find that RL applied directly to base checkpoints*expands*the distribution:pass@1*and*pass@kboth improve substantially \(§[4\.1](https://arxiv.org/html/2606.04272#S4.SS1)\)\. The sharpening effect we do reproduce arises only when RL is applied following SFT\. This suggests that SFT, rather than RL itself, is what constrains exploration\. Further, we find that SFT consistently degrades general \(non\-reasoning\) capabilities, while RL leaves these capabilities unchanged \(§[4\.2](https://arxiv.org/html/2606.04272#S4.SS2)\)\.

How should RL be used?\(§[5](https://arxiv.org/html/2606.04272#S5)\) Finally, we investigate whether interleaving SFT and RL gradients within a single training step can capture the complementary strengths of both objectives\. We propose a parallel\-averaging update that combines updates from SFT and direct RL\. We find that this simple objective yields betterpass@32than all other recipes that use a single demonstration per problem, including SFT→\\rightarrowRL, indicating that using RL and NTP objectives simultaneously can be beneficial\.

Overall, our findings make headway in understanding RL in contrast to other training objectives\. Through our controlled experiments across different stages of pre\-training, we find that a lot of assumptions about the RL objective are artifacts of the current training regime\. Our results indicate that isolating the objectives from the setting reveal surprising aspects about RL as an objective for LLM training\. We focus our experiments on math reasoning capabilities to maintain a controlled training and evaluation environment\. We view our results as evidence that introducing RL earlier and more centrally in the LLM training pipeline is both feasible and, in several respects, preferable to the current standard\.

## 2Methodology and Experimental Design

To answer foundational questions around the RL objective for LLM training, as stated above, beyond the standard training pipeline, we first establish a controlled experimental environment\. Our setup centers on a custom\-trained 1B model, allowing for precise control over data exposure\. In this section, we detail pre\-training checkpoints, define three post\-training training pipelines, and describe data and evaluation\. While we explore RL as a general objective, we focus our implementation on Reinforcement Learning via Verifiable Rewards \(RLVR\) using the GRPO algorithm\(Shaoet al\.,[2024](https://arxiv.org/html/2606.04272#bib.bib9)\)\.

### 2\.1Pre\-training checkpoints

#### Base model and data\.

We pre\-train a11B parameter model based on OLMo2’s\(OLMo Teamet al\.,[2025b](https://arxiv.org/html/2606.04272#bib.bib16)\)architecture and training infrastructure\. We perform our pre\-training from scratch using high\-quality data based on a high\-quality subset of OLMo2’s pre\-training mix, DOLMino\(OLMo Teamet al\.,[2025b](https://arxiv.org/html/2606.04272#bib.bib16)\)111allenai/dolmino\-mix\-1124\. The DOLMino mix contains5050B tokens, including general domains such as Wikipedia \(77%\), high\-quality web data \(6060% from DCLM\(Liet al\.,[2024](https://arxiv.org/html/2606.04272#bib.bib18)\)and FLAN\(Weiet al\.,[2022](https://arxiv.org/html/2606.04272#bib.bib19)\)\), high\-quality math data \(2020%\), and other reasoning or code data such as StackExchange \(22%\) and STEM papers\(\(5%\)\.

#### Pre\-training details\.

We pre\-train our11B parameter model on5050B tokens \(∼2\.5×\\sim 2\.5\\timesChinchilla optimal tokens\)\. We use AdamW\(Loshchilov and Hutter,[2019](https://arxiv.org/html/2606.04272#bib.bib23)\)with a cosine learning rate decay and a peak learning rate of4×10−44\\times 10^\{\-4\}\. We train the model with a sequence length of40964096and batch size512512\. For experiments in Section[3\.3](https://arxiv.org/html/2606.04272#S3.SS3), we perform two additional pre\-training experiments\. In the first, we keep the model architecture and size fixed, but add an additional1010B tokens from the DOLMino\-3 mixture\(OLMo Teamet al\.,[2025a](https://arxiv.org/html/2606.04272#bib.bib33)\)222allenai/dolma3\_dolmino\_mix\-100B\-1125throughout training\. In the second, we pre\-train using the same5050B tokens but scale the model size to44B parameters\.

### 2\.2Training Pipelines

Letℳt\\mathcal\{M\}\_\{t\}denote the pre\-training model checkpoint at steptt, andℳT\\mathcal\{M\}\_\{T\}denote the final, fully\-pre\-trained model\. We describe the three methods we compare below\.

- •Direct RL \(ℳtRL\\mathcal\{M\}\_\{t\}^\{\\text\{RL\}\}\)We start withℳt\\mathcal\{M\}\_\{t\}and train with the RL objective\.
- •SFT only \(ℳtSFT\\mathcal\{M\}^\{\\text\{SFT\}\}\_\{t\}\):We start withℳt\\mathcal\{M\}\_\{t\}and perform SFT with ground\-truth solutions\.
- •Standard pipeline \(ℳtSFT→RL\\mathcal\{M\}^\{\\text\{SFT\}\\rightarrow\{\\text\{RL\}\}\}\_\{t\}\):We trainℳtSFT\\mathcal\{M\}^\{\\text\{SFT\}\}\_\{t\}with RL on the same set of questions\.

By comparingℳtRL\\mathcal\{M\}^\{\\text\{RL\}\}\_\{t\}againstℳtSFT\\mathcal\{M\}^\{\\text\{SFT\}\}\_\{t\}, we isolate the training objective \(RL vs\. SFT\) to determine if RL provides a superior training signal\. By comparingℳtRL\\mathcal\{M\}^\{\\text\{RL\}\}\_\{t\}withℳtSFT→RL\\mathcal\{M\}^\{\\text\{SFT\}\\rightarrow\{\\text\{RL\}\}\}\_\{t\}, we isolate if RL alone can provide a superior training signal than the standard pipeline\.

In this work, we are interested in understanding, if given sufficient compute, how well each method performs\. Therefore, we train all our RL and SFT runs until convergence, and we confirm the convergence of training in Appendix[B\.3](https://arxiv.org/html/2606.04272#A2.SS3)and Appendix[B\.5](https://arxiv.org/html/2606.04272#A2.SS5)\.

### 2\.3Data and Evaluation

#### Training data\.

We use OpenMathInstruct\(Toshniwalet al\.,[2024](https://arxiv.org/html/2606.04272#bib.bib17)\), which consists of math problems paired with multiple ground\-truth demonstrations per problem\. For SFT, by default, we randomly pick a single solution per prompt for our training \(SFT\) since that is a more realistic SFT setting as obtaining multiple ground\-truth reasoning traces for each problem is typically infeasible\. However, we also consider training with all solutions \(SFT\-Gold\) for completeness\. For RL, we only consider the final answer for each problem and define a binary reward based on whether the model generation reaches the same final answer\.

#### Difficulty splits\.

OpenMathInstruct contains two categories of questions: a majority inspired by the MATH dataset\(Hendryckset al\.,[2021](https://arxiv.org/html/2606.04272#bib.bib27)\)\(competition\-level\) and a minority inspired by GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2606.04272#bib.bib26)\)\(grade\-school level\)\. We consider two experimental settings to probe different aspects of RL training: training with the full OpenMathInstruct and training with the GSM8K\-inspired subset of OpenMathInstruct\. On the GSM8K subset, the base pre\-training checkpoints already achieve non\-trivial performance\. In contrast, the full MATH\-heavy training set contains problems that remain challenging even for later pre\-training checkpoints, allowing us to examine how far different pipelines can push the model’s reasoning capabilities\.

#### Evaluation\.

We evaluate on GSM8K and MATH respectively, reportingpass@k\(Chenet al\.,[2021](https://arxiv.org/html/2606.04272#bib.bib25)\), which estimates the probability of obtaining at least one correct response whenkkresponses are generated, fork=1,8,32k=1,8,32and at temperatureT=0\.6T=0\.6\.

## 3RL is Effective Early in Pre\-Training

![Refer to caption](https://arxiv.org/html/2606.04272v1/x2.png)Figure 2:RL is effective early in pre\-training\.GSM8Kpass@kforℳt\\mathcal\{M\}\_\{t\},ℳtSFT\\mathcal\{M\}\_\{t\}^\{\\text\{SFT\}\},ℳtSFT→RL\\mathcal\{M\}\_\{t\}^\{\\text\{SFT\}\\to\\text\{RL\}\}, andℳtRL\\mathcal\{M\}\_\{t\}^\{\\text\{RL\}\}across pre\-training tokenstt, with all SFT baselines trained on the SFT set \(one ground\-truth solution per problem\)\.ℳtRL\\mathcal\{M\}\_\{t\}^\{\\text\{RL\}\}improves overℳt\\mathcal\{M\}\_\{t\}from as few as 4B tokens\. By 10B tokens,ℳtRL\\mathcal\{M\}\_\{t\}^\{\\text\{RL\}\}matches the standardℳtSFT→RL\\mathcal\{M\}\_\{t\}^\{\\text\{SFT\}\\to\\text\{RL\}\}pipeline, and outperformsℳtSFT\\mathcal\{M\}\_\{t\}^\{\\text\{SFT\}\}alone\.In this section, we study the effect of RL at different stages of pre\-training and contrast with training objectives\. On GSM8K, for our 1B parameter model, we find that RL is effective from as early as 4B pre\-training tokens, and often matches the full SFT→\\rightarrowRL pipeline \(§[3\.1](https://arxiv.org/html/2606.04272#S3.SS1)\)\. RL is also the more effective objective when ground\-truth demonstrations are scarce \(§[3](https://arxiv.org/html/2606.04272#S3.F3)\)\. On harder problems, pre\-training data composition is a stronger lever for RL effectiveness than model scale \(§[3\.3](https://arxiv.org/html/2606.04272#S3.SS3)\)\. Finally, we identify the the base modelpass@kon the test set as a lightweight diagnostic for whether RL will succeed \(§[3\.4](https://arxiv.org/html/2606.04272#S3.SS4)\)\.

### 3\.1RLVR competes with the standard pipeline on GSM8K

In[Figure2](https://arxiv.org/html/2606.04272#S3.F2), we report the performance ofℳt\\mathcal\{M\}\_\{t\},ℳtRL\\mathcal\{M\}\_\{t\}^\{\\text\{RL\}\},ℳtSFT\\mathcal\{M\}\_\{t\}^\{\\text\{SFT\}\}, andℳtSFT→RL\\mathcal\{M\}\_\{t\}^\{\\text\{SFT\}\\rightarrow\\text\{RL\}\}at various pre\-training stepstton GSM8K, using the GSM8K subset of OpenMathInstruct for post\-training\. We evaluate base checkpointsℳt\\mathcal\{M\}\_\{t\}with 8\-shot prompting, as they cannot reliably follow question\-answering instructions333In Appendix[B\.6](https://arxiv.org/html/2606.04272#A2.SS6), we ablate the number of in\-context examples and confirm that 8\-shot yields the best performance forℳt\\mathcal\{M\}\_\{t\}\.\. All post\-trained models use 0\-shot evaluation, as RL includes a formatting reward and SFT data is formatted accordingly\.

We observe that, as early ast=4t=4B pre\-training tokens, training with RL significantly improves the model’s performance on GSM8K: for example, thepass@1accuracy increases from∼2%\\sim 2\\%to∼18%\\sim 18\\%\. Notably, the fact that this occurs att=4t=4B tokens indicates*improvement with RL prior to reaching the Chinchilla optimal number of tokens*\(Hoffmannet al\.,[2022](https://arxiv.org/html/2606.04272#bib.bib29)\)\. In addition, we observe a significant increase inpass@kfork=8,32k=8,32, which we discuss in detail in §[4\.1](https://arxiv.org/html/2606.04272#S4.SS1)\.

In[Figure3](https://arxiv.org/html/2606.04272#S3.F3), aftert=10t=10B tokens,ℳtRL\\mathcal\{M\}\_\{t\}^\{\\text\{RL\}\}outperformsℳtSFT\\mathcal\{M\}\_\{t\}^\{\\text\{SFT\}\}onpass@1and performs on\-par withℳtSFT→RL\\mathcal\{M\}\_\{t\}^\{\\text\{SFT\}\\rightarrow\\text\{RL\}\}\. Forpass@8,32,ℳtRL\\mathcal\{M\}\_\{t\}^\{\\text\{RL\}\}performs on\-par with bothℳtSFT\\mathcal\{M\}\_\{t\}^\{\\text\{SFT\}\}andℳtSFT→RL\\mathcal\{M\}\_\{t\}^\{\\text\{SFT\}\\rightarrow\\text\{RL\}\}\. This result is significant becauseℳtRL\\mathcal\{M\}\_\{t\}^\{\\text\{RL\}\}never observes ground\-truth reasoning traces; unlike the SFT baselines, it develops reasoning capabilities entirely from self\-generated traces and feedback, demonstrating that RL can match supervised learning without training on ground\-truth reasoning traces\.

For some earlyℳt\\mathcal\{M\}\_\{t\}model checkpoints betweent=4t=4B andt=10t=10B pre\-training tokens, we observe that for brittleness across seeds: RL performance on some seeds fails to improve\. It is likely that the model sometimes falls into a distinct failure mode for early pre\-training checkpoints \(Appendix[B\.4](https://arxiv.org/html/2606.04272#A2.SS4)\)\. Above, we report the RL runs with non\-trivial performance\.

### 3\.2RL outperforms when SFT data is scarce

![Refer to caption](https://arxiv.org/html/2606.04272v1/x3.png)Figure 3:Diverse SFT data shifts the balance toward SFT\-Gold\.In contrast with[Figure2](https://arxiv.org/html/2606.04272#S3.F2), SFT baselines are trained on SFT\-Gold \(all∼\\sim23 ground\-truth solutions per problem\)\. With access to many ground\-truth solutions,ℳtSFT\-Gold\\mathcal\{M\}\_\{t\}^\{\\text\{SFT\-Gold\}\}alone surpassesℳtRL\\mathcal\{M\}\_\{t\}^\{\\text\{RL\}\}onpass@8andpass@32, whileℳtSFT\-Gold→RL\\mathcal\{M\}\_\{t\}^\{\\text\{SFT\-Gold\}\\to\\text\{RL\}\}remains best onpass@1\.ℳtSFT\-Gold\\mathcal\{M\}\_\{t\}^\{\\text\{SFT\-Gold\}\}’s advantage requires multiple high\-quality solutions per problem, which is rarely realistic in practice\.OpenMathInstruct contains an average of 23 ground\-truth completions per problem\. Our main results use the one randomly chosen completion per problem, which we consider the more realistic SFT setting\. We also consider a setting that uses the full OpenMathInstruct dataset for SFT training \(i\.e\., multiple ground\-truth completions per problem\)\. We refer to this setting as SFT\-Gold since obtaining multiple high\-quality solutions per problem typically requires expensive human supervision or generation from frontier models, making it an ideal setting which is impractical for many domains\.[Figure2](https://arxiv.org/html/2606.04272#S3.F2)shows thatℳtRL\\mathcal\{M\}\_\{t\}^\{\\text\{RL\}\}outperformsℳtSFT\\mathcal\{M\}\_\{t\}^\{\\text\{SFT\}\}onpass@1and is competitive onpass@8,32\. With SFT\-Gold, the story changes \([Figure2](https://arxiv.org/html/2606.04272#S3.F2)\): SFT\-Gold→\\rightarrowRL performs best onpass@1, but SFT\-Gold alone surpasses both RL and SFT→\\rightarrowRL onpass@8,32\. This suggests that access to diverse ground\-truth reasoning traces can provide coverage benefits that on\-policy exploration does not\.

### 3\.3Targeted pre\-training data is more essential than model size for RL

In[Figure10](https://arxiv.org/html/2606.04272#A2.F10), we train on the MATH\-like subset of OpenMathInstruct and evaluatepass@kaccuracy on MATH\. Unlike the GSM8K setting in[Figure2](https://arxiv.org/html/2606.04272#S3.F2), directly applying RL to pre\-training checkpoints is less effective with respect to SFT with SFT and SFT\-Gold on this harder benchmark\. We hypothesize that theefficacyof RL training from pre\-training checkpoints has limitations, potentially related to the difficulty of the task at hand\. Therefore, we study two natural interventions: scalingNN, the model size, and scalingDD, the amount of pre\-training data, especially task\-relevant math data\.

![Refer to caption](https://arxiv.org/html/2606.04272v1/x4.png)Figure 4:Targeted pre\-training data beats model scale for RL\.Improvement on MATH from RL over the base model, across pre\-training configurations: \(i\)1B\-50B, original pre\-trained model; \(ii\)ScalingDD\(1B\-60B\), 1B model pre\-trained from scratch with an additional 10B math\-heavy tokens mixed in; \(iii\)ScalingNN\(4B\-50B\), 4B model trained on same 50B\-token mix as original 1B model\. Adding task\-relevant pre\-training data \(ScalingDD\) yields substantially larger RL gains on MATH\.To test the effect of scalingNN, we pre\-train a 4B model from scratch using the same 50B\-token mix and training recipe as the original 1B model\. As expected, the 4B checkpoints achieve higher base MATH accuracy, and direct RL on these checkpoints also yields higher absolute performance than RL on the 1B checkpoints at matched pre\-training steps\. However, when measuring the gain from RL relative to each checkpoint’s own base performance, the 4B model doesnotobtain larger improvements\. As shown in[Figure4](https://arxiv.org/html/2606.04272#S3.F4), scaling model size improves the base model, but does not substantially improve the effectiveness of RL itself\.

We then test the effect of scalingDDwhile keeping the model size fixed at 1B\. We pre\-train from scratch with an additional 10B math\- and reasoning\-heavy tokens from the Dolma 3 Dolmino Mix\(OLMo Teamet al\.,[2025a](https://arxiv.org/html/2606.04272#bib.bib33)\), described in Appendix[B\.2](https://arxiv.org/html/2606.04272#A2.SS2)\. In this setting, direct RL on the resulting checkpoints matches the SFT baseline on MATH \([Figure11](https://arxiv.org/html/2606.04272#A2.F11)\) and recovers the qualitative behavior observed on GSM8K \([Figure2](https://arxiv.org/html/2606.04272#S3.F2)\)\. Moreover,[Figure4](https://arxiv.org/html/2606.04272#S3.F4)shows that the RL gain over the base model is substantially larger than in either the original 1B setting or the 4B scaling\-NNsetting\.

Overall, targeted pre\-training data is the more effective intervention: adding math\-specific data during pre\-training substantially improves the gains achievable by direct RL, whereas increasing model size primarily improves the base checkpoint\. In Appendix[D](https://arxiv.org/html/2606.04272#A4), we further study scalingGG, the number of RL rollouts, and find that increasingGGdoesnotchange the final outcome of direct RL\.

### 3\.4Base model performance is predictive of RL effectiveness

Given that we train with RL on early pre\-training checkpoints, a natural question is,how can we predict if RL training will be effective?In Figure[5](https://arxiv.org/html/2606.04272#S3.F5), we compare the base model’spass@kaccuracy on the test set with that of the model after RL\. For MATH, we also report the comparison for the two additional pre\-training regimes discussed in §[3\.3](https://arxiv.org/html/2606.04272#S3.SS3)\. We observe a generally monotonically increasing relationship in which increasingpass@kfor the base model corresponds to increasedpass@kfor the model after RL\. In practice, this suggests that a model’spass@kaccuracy on the test set might serve as a lightweight metric for whether RL training will yield downstream gains\.

![Refer to caption](https://arxiv.org/html/2606.04272v1/x5.png)Figure 5:Basepass@kon training data predicts RL effectiveness\.Base model 8\-shotpass@kon the test set \(xx\-axis\) vs\. after RL \(yy\-axis\), for GSM8K\(left\) and MATH \(right\)\.pass@kaccuracy on the test set might serve as a lightweight metric for whether RL training will yield downstream gains\.

## 4The Effects of RL Beyond Downstream Accuracy

![Refer to caption](https://arxiv.org/html/2606.04272v1/x6.png)Figure 6:Direct RL expands while SFT→\\toRL sharpens\.GSM8Kpass@1andpass@32tracked across training stages on the same pretraining checkpointℳt\\mathcal\{M\}\_\{t\}\.Left: under the standardℳt→ℳtSFT→ℳtSFT→RL\\mathcal\{M\}\_\{t\}\\to\\mathcal\{M\}\_\{t\}^\{\\text\{SFT\}\}\\to\\mathcal\{M\}\_\{t\}^\{\\text\{SFT\}\\to\\text\{RL\}\}pipeline,pass@1continues to improve during RL butpass@32*decreases*, reproducing the sharpening effect reported in prior work\.Right: applying RL directly toℳt\\mathcal\{M\}\_\{t\}improves bothpass@1andpass@32, expanding the model’s distribution rather than merely sharpening it\.![Refer to caption](https://arxiv.org/html/2606.04272v1/x7.png)Figure 7:RL preserves general capabilities while SFT degrades them\.Performance on six general\-purpose \(non\-math\) benchmarks for the base modelℳt\\mathcal\{M\}\_\{t\}and three post\-trained variants:ℳtRL\\mathcal\{M\}\_\{t\}^\{\\text\{RL\}\},ℳtSFT\\mathcal\{M\}\_\{t\}^\{\\text\{SFT\}\}, andℳtSFT\-Gold\\mathcal\{M\}\_\{t\}^\{\\text\{SFT\-Gold\}\}\. Both SFT and SFT\-Gold consistently degrade performance by44–88pp on average across the benchmarks, while RL leaves these capabilities essentially unchanged\.In this section, we examine the effects of RL on the trained model beyond the downstream accuracy\. First, we evaluate whether RL unlocks new reasoning capabilities or merely*sharpens*the base model’s existing output distribution \(§[4\.1](https://arxiv.org/html/2606.04272#S4.SS1)\)\. Then, we study whether RL alters general capabilities inherited from pretraining \(§[4\.2](https://arxiv.org/html/2606.04272#S4.SS2)\)\. We find that the answer to both depends on the training pipeline\.

### 4\.1Early stage RL can expand the model’s distribution

We refer to sharpening\(Wuet al\.,[2025](https://arxiv.org/html/2606.04272#bib.bib30); Yueet al\.,[2025](https://arxiv.org/html/2606.04272#bib.bib28)\)as the phenomenon in which training improvespass@1accuracy but has little or even negative effect onpass@kaccuracy for largerkk\. In contrast, we define expansion as the setting in which RL increasespass@kperformance across for largekk\.

Many recent works have claimed that RLVR largelysharpensthe distribution without bringing the model any “new” reasoning capabilities\(Yueet al\.,[2025](https://arxiv.org/html/2606.04272#bib.bib28); Chenget al\.,[2026](https://arxiv.org/html/2606.04272#bib.bib13)\)\. These works point to evidence that during RL,pass@kdoes not improve for sufficiently largekk\. Interestingly, in our experiments we observe two opposing outcomes depending on the training pipeline\. First, when we apply the standard pipeline on pretraining checkpoints \(i\.e\.,ℳt→ℳtSFT→ℳtSFT→RL\\mathcal\{M\}\_\{t\}\\rightarrow\\mathcal\{M\}\_\{t\}^\{\\text\{SFT\}\}\\rightarrow\\mathcal\{M\}\_\{t\}^\{\\text\{SFT\}\\rightarrow\\text\{RL\}\}\), we observe the sharpening effect\. In[Figure6](https://arxiv.org/html/2606.04272#S4.F6)\(left\), we show one such example\. We see thatpass@1continues to improve fromℳt\\mathcal\{M\}\_\{t\}toℳtSFT\\mathcal\{M\}\_\{t\}^\{\\text\{SFT\}\}, and then toℳtSFT→RL\\mathcal\{M\}\_\{t\}^\{\\text\{SFT\}\}\\rightarrow\\text\{RL\}\. On the other hand, the SFT stage yields a significant gain inpass@32, but the subsequent RL stage slightly degrades the performance\.

We hypothesize that sharpening occurs because, during SFT, the model has already seen ground\-truth solutions on the same set of questions, thus RL primarily refines these existing capabilities rather than discovering new reasoning paths\. In contrast, by directly training on the RL objective from the same pretraining checkpoint[Figure6](https://arxiv.org/html/2606.04272#S4.F6)\(right\), RL training improves bothpass@1andpass@32performance, expanding the base model’s distribution\. Without prior exposure to ground\-truth solutions, the model explores and discovers new reasoning paths through on\-policy learning\.

### 4\.2RL does not affect general model capabilities

A natural concern with applying RL to intermediate pretraining checkpoints is whether it degrades capabilities outside the training domain\. To assess this, we evaluate on several general\-purpose benchmarks and report results in Figure[7](https://arxiv.org/html/2606.04272#S4.F7)\. We report benchmark results for the base model, the base model after training directly with RL, and training with SFT\. For SFT, we report accuracy both for training with SFT and SFT\-Gold\. Interestingly, we find that RL from the base model has the least effect on general model capability, while SFT training typically degrades the general model capability regardless of using one or many completions per prompt\.

## 5Parallel RL and SFT

Figure 8:Parallel averaging update
1:Input:parameters

θ\\theta; optimizer states

sRL,sSFTs\_\{\\text\{RL\}\},s\_\{\\text\{SFT\}\}; batches

ℬRL,ℬSFT\\mathcal\{B\}\_\{\\text\{RL\}\},\\mathcal\{B\}\_\{\\text\{SFT\}\}; learning rates

ηRL,ηSFT\\eta\_\{\\text\{RL\}\},\\eta\_\{\\text\{SFT\}\}
2:*// Snapshot current parameters*

3:

θ¯←θ\\bar\{\\theta\}\\leftarrow\\theta
4:*// Compute both objectives at the same snapshot*

5:

gRL←∇θℒRL​\(θ;ℬRL\)\|θ=θ¯g\_\{\\text\{RL\}\}\\leftarrow\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\text\{RL\}\}\(\\theta;\\mathcal\{B\}\_\{\\text\{RL\}\}\)\\big\|\_\{\\theta=\\bar\{\\theta\}\}
6:

gSFT←∇θℒSFT​\(θ;ℬSFT\)\|θ=θ¯g\_\{\\text\{SFT\}\}\\leftarrow\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\text\{SFT\}\}\(\\theta;\\mathcal\{B\}\_\{\\text\{SFT\}\}\)\\big\|\_\{\\theta=\\bar\{\\theta\}\}
7:*// Compute optimizer updates from the snapshot*

8:

\(ΔRL,sRL\)←OptUpdate​\(gRL,sRL,ηRL\)\(\\Delta\_\{\\text\{RL\}\},s\_\{\\text\{RL\}\}\)\\leftarrow\\mathrm\{OptUpdate\}\(g\_\{\\text\{RL\}\},s\_\{\\text\{RL\}\},\\eta\_\{\\text\{RL\}\}\)
9:

\(ΔSFT,sSFT\)←OptUpdate​\(gSFT,sSFT,ηSFT\)\(\\Delta\_\{\\text\{SFT\}\},s\_\{\\text\{SFT\}\}\)\\leftarrow\\mathrm\{OptUpdate\}\(g\_\{\\text\{SFT\}\},s\_\{\\text\{SFT\}\},\\eta\_\{\\text\{SFT\}\}\)
10:*// Average the two gradient updates*

11:

θ←θ¯\+12​\(ΔRL\+ΔSFT\)\\theta\\leftarrow\\bar\{\\theta\}\+\\tfrac\{1\}\{2\}\\left\(\\Delta\_\{\\text\{RL\}\}\+\\Delta\_\{\\text\{SFT\}\}\\right\)
12:return

θ\\theta

![Refer to caption](https://arxiv.org/html/2606.04272v1/x8.png)

Figure 9:Parallel averaging combines the strengths of RL and SFT across pre\-training\.\(Left\)The parallel\-averaging update \([Figure9](https://arxiv.org/html/2606.04272#S5.F9)\): at each step we take a single optimizer update from each of an RL gradient and an SFT gradient \(each with its own optimizer state\) and use their average to update the model weights\.\(Right\)Parallel\-averaging \(ℳtParallel\\mathcal\{M\}\_\{t\}^\{\\text\{Parallel\}\}\) achieves the strongestpass@32across pre\-training checkpoint surpassing the standard pipeline \(ℳtSFT→RL\\mathcal\{M\}\_\{t\}^\{\\text\{SFT\}\\rightarrow\\text\{RL\}\}\)\. Unlike SFT\-based regimes,ℳtParallel\\mathcal\{M\}\_\{t\}^\{\\text\{Parallel\}\}does not regress on non\-math benchmarks and retains base model performance\.The previous sections expose complementary strengths of the RL and SFT objectives applied directly to pretraining checkpoints\. Direct RL \(ℳtRL\\mathcal\{M\}\_\{t\}^\{\\text\{RL\}\}\) can develop new reasoning capabilities, expand the model’spass@kdistribution, and leave general \(non\-math\) capabilities intact \(§[4](https://arxiv.org/html/2606.04272#S4)\)\. This expansion, however, is only reliable when the underlying pretraining mix and model size yield enough latent capability to bootstrap from the base model \(§[3\.3](https://arxiv.org/html/2606.04272#S3.SS3)\)\. SFT instead provides reliable supervision from ground\-truth reasoning traces\. However, the efficacy of SFT relies on the diversity of the SFT data, and might have a negative impact on general capabilities\. Having studied RL and SFT in isolation, we next consider whether a combined objective might enjoy the benefits of both\.

#### Method\.

We propose a simple algorithm that combines both training objectives \(Algorithm[9](https://arxiv.org/html/2606.04272#S5.F9)\)\. At each training step, starting from the same parameter snapshotθ\\theta, we run one batch through an RL optimizer, and in parallel, run a separate batch of SFT data through an SFT optimizer\. We obtain gradients from the two optimizers and average the gradients to updateθ\\theta\. Critically, the two optimizers maintain independent first\- and second\-moment estimates, so the adaptive step sizes and preconditioning do not interfere\. We refer to the resulting model asℳtParallel\\mathcal\{M\}\_\{t\}^\{\\text\{Parallel\}\}\.

#### Findings\.

We report results in[Figure9](https://arxiv.org/html/2606.04272#S5.F9)\(see[Figure17](https://arxiv.org/html/2606.04272#A3.F17)for the per\-checkpoint training trajectories\)\. Across every pre\-training checkpoint we evaluated, parallel averaging attains the strongestpass@32among recipes that use a single demonstration per problem, surpassing direct RL, SFT, and the standard pipeline\. It also preserves the base model’s general \(non\-math\) capabilities on par with direct RL, whereas every SFT\-based recipe regresses on this axis by 5–8 percentage points\. However, we also observe that this strongpass@kimprovements come with a trade\-off of a lowerpass@1relative to the direct RL and SFT baselines\.

Overall, we read these results as evidence that the RL and SFT signals are complementary rather than merely additive: the SFT loss supplies supervisory structure on reasoning paths that on\-policy rollouts may rarely sample, while the concurrent RL signal anchors the model to its base distribution and avoids the general\-capability regression typically seen after a dedicated SFT stage\. Our recipe uses equal\-weight averaging with no scheduling, leaving room for more deliberate combinations of RL and next\-token\-prediction objectives which we view as a productive direction for future work\.

## 6Prior Work

#### Reinforcement Learning for LLM Reasoning

RL has become a standard post\-training stage for LLMs\(Ouyanget al\.,[2022](https://arxiv.org/html/2606.04272#bib.bib4); Daiet al\.,[2024](https://arxiv.org/html/2606.04272#bib.bib5); Jaechet al\.,[2024](https://arxiv.org/html/2606.04272#bib.bib46)\), using modern policy\-gradient methods\(Rafailovet al\.,[2023](https://arxiv.org/html/2606.04272#bib.bib7); Shaoet al\.,[2024](https://arxiv.org/html/2606.04272#bib.bib9); Yuet al\.,[2025](https://arxiv.org/html/2606.04272#bib.bib42); Khatriet al\.,[2025](https://arxiv.org/html/2606.04272#bib.bib48)\)with verifiable rewards\(Guoet al\.,[2025](https://arxiv.org/html/2606.04272#bib.bib32); Zhenget al\.,[2023](https://arxiv.org/html/2606.04272#bib.bib8)\)\. Whether gains via these methods reflect new capabilities or merely a sharpeneing remains contested\(Yueet al\.,[2025](https://arxiv.org/html/2606.04272#bib.bib28); Wuet al\.,[2025](https://arxiv.org/html/2606.04272#bib.bib30); Karan and Du,[2025](https://arxiv.org/html/2606.04272#bib.bib41); Chenget al\.,[2026](https://arxiv.org/html/2606.04272#bib.bib13); Chuet al\.,[2025](https://arxiv.org/html/2606.04272#bib.bib43)\), as does whether RL erodes abilities inherited from pretraining\(Shenfeldet al\.,[2025](https://arxiv.org/html/2606.04272#bib.bib47)\)\. Our findings suggest a nuanced view on these questions\.

#### Integrating RL into Pretraining

A recent line of work brings RL into pretraining itself, either by scoring next\-sentence reasoning against the training corpus\(Liet al\.,[2025](https://arxiv.org/html/2606.04272#bib.bib1)\)or by inserting chain\-of\-thought rollouts before each next\-token prediction\(Hatamizadehet al\.,[2026](https://arxiv.org/html/2606.04272#bib.bib2); Donget al\.,[2025](https://arxiv.org/html/2606.04272#bib.bib3); Xinget al\.,[2025](https://arxiv.org/html/2606.04272#bib.bib14)\)\. These methods modify the pretraining objective\. Our work instead keeps the standard NTP and RL objectives unchanged\. We view our results as a precursor: before adding RL*into*pretraining, it is worth knowing how early in pretraining RL on top of NTP already pays off\.

#### Interleaving SFT and RL

A growing body of work performs mixed\-policy training\. Approaches include importance\-weighted off\-policy expert traces\(Yanet al\.,[2025](https://arxiv.org/html/2606.04272#bib.bib34)\), alternating SFT and RL passes targeted at unsolved problems\(Donget al\.,[2026](https://arxiv.org/html/2606.04272#bib.bib35)\), joint losses with adaptive weighting\(Fuet al\.,[2025](https://arxiv.org/html/2606.04272#bib.bib36); Lvet al\.,[2025](https://arxiv.org/html/2606.04272#bib.bib38); Zhanget al\.,[2026](https://arxiv.org/html/2606.04272#bib.bib40)\), and hybrid trajectories that blend expert prefixes with on\-policy continuations\(Huanget al\.,[2026](https://arxiv.org/html/2606.04272#bib.bib37)\)\.Limozinet al\.\([2026](https://arxiv.org/html/2606.04272#bib.bib45)\)caution that several of these methods were compared against deflated SFT baselines, and that a correctly implemented SFT→\\toRL pipeline can match or exceed them\. In our work, we evaluate a simple alternate approach that maintains independent Adam moments for SFT and RL and average their proposed updates after each step\.

#### Prerequisites for Post\-Training

A small but growing literature studies what level of pretraining is required before post\-training becomes effective\.Chenet al\.\([2025](https://arxiv.org/html/2606.04272#bib.bib11)\); Fosteret al\.\([2025](https://arxiv.org/html/2606.04272#bib.bib12)\)argue that the base model must reach a minimum capability for RL to yield gains\.Zhanget al\.\([2025](https://arxiv.org/html/2606.04272#bib.bib10)\)relates this threshold to the base model’s basic skills and to the difficulty of the RL data; see alsoGuoet al\.\([2025](https://arxiv.org/html/2606.04272#bib.bib32)\); Zhouet al\.\([2023](https://arxiv.org/html/2606.04272#bib.bib31)\)\. Our work tests this premise empirically by tracing RL effectiveness across pretraining tokens, and finds that the threshold is far lower than commonly assumed: RL is effective on GSM8K from as few as 4B tokens, well below the Chinchilla\-optimal point\.

## 7Discussion & Future Directions

In this work, we provide a comprehensive and nuanced picture of the RL objective for LLM training beyond how it is used in the current standard pipeline\. We find that RL can be effective starting early in pre\-training, well before the Chinchilla\-optimal regime, and often matches the full SFT→\\toRL pipeline on GSM8K tokens despite never seeing a ground\-truth reasoning trace\. Further, the dominant lever for whether early RL succeeds is pre\-training data composition, not model scale\. Perhaps most strikingly, the two effects most commonly attributed to RL, distribution sharpening and regression on general capabilities, are largely artifacts of a preceding SFT stage rather than of the RL objective itself: applied directly to base checkpoints, RL instead expands thepass@kdistribution and leaves non\-math capabilities essentially intact, unlike SFT that consistently degrades them significantly\.

Our results open several exciting research directions\. The most consequential is rethinking how data and objectives are coordinated end\-to\-end: if RL is effective well inside pre\-training, and the pre\-training mix controls its ceiling, the practical question becomes what pre\-training recipes could look like once RL is treated as a first\-class training objective rather than a final post\-training step\. Our parallel\-averaging experiment \(§[5](https://arxiv.org/html/2606.04272#S5)\) is an early data point towards that end\. Work can be done towards more careful designs, with adaptive weighting, scheduling, or importance sampling on top of independent optimizer states\. The finding that basepass@kalready predicts RL effectiveness \(§[3\.4](https://arxiv.org/html/2606.04272#S3.SS4)\) further motivates adaptive rollout strategies that concentrate compute on prompts where the base model has non\-trivial coverage but has not yet converged\. Our experiments are at 1B and 4B parameters and 50–60B tokens; whether the same picture holds at frontier scale is essential future work\.

#### Limitations

We discuss a few limitations in our work\. First, while our pre\-training mix is designed to be reflective of general pre\-training settings, it is more math\-heavy than typical web\-scale corpora\. Second, we focus on standard GRPO as a representative RLVR objective and do not study the growing family of variants \(e\.g\., those explicitly targeting entropy preservation orpass@kexpansion\), which may interact differently with the pre\-training stage\.

## References

- F\. Chen, A\. Huang, N\. Golowich, S\. Malladi, A\. Block, J\. T\. Ash, A\. Krishnamurthy, and D\. J\. Foster \(2025\)The coverage principle: how pre\-training enables post\-training\.arXiv preprint arXiv:2510\.15020\.External Links:[Link](https://arxiv.org/abs/2510.15020)Cited by:[§6](https://arxiv.org/html/2606.04272#S6.SS0.SSS0.Px4.p1.1)\.
- M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. Pondé, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman, A\. Ray, R\. Puri, G\. Krueger, M\. Petrov, H\. Khlaaf, G\. Sastry, P\. Mishkin, B\. Chan, S\. Gray, N\. Ryder, M\. Pavlov, A\. Power, L\. Kaiser, M\. Bavarian, C\. Winter, P\. Tillet, F\. P\. Such, D\. W\. Cummings, M\. Plappert, F\. Chantzis, E\. Barnes, A\. Herbert\-Voss, W\. H\. Guss, A\. Nichol, I\. Babuschkin, S\. Balaji, S\. Jain, A\. Carr, J\. Leike, J\. Achiam, V\. Misra, E\. Morikawa, A\. Radford, M\. M\. Knight, M\. Brundage, M\. Murati, K\. Mayer, P\. Welinder, B\. McGrew, D\. Amodei, S\. McCandlish, I\. Sutskever, and W\. Zaremba \(2021\)Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374\.External Links:[Link](https://arxiv.org/abs/2107.03374)Cited by:[§2\.3](https://arxiv.org/html/2606.04272#S2.SS3.SSS0.Px3.p1.4)\.
- Z\. Cheng, Y\. Xie, Y\. Qu, A\. Setlur, S\. Hao, V\. Pimpalkhute, T\. Liang, F\. Yao, H\. Liu, E\. Xing, V\. Smith, R\. Salakhutdinov, Z\. Hu, T\. Killian, and A\. Kumar \(2026\)IsoCompute playbook: optimally scaling sampling compute for RL training of LLMs\.Note:[https://compute\-optimal\-rl\-llm\-scaling\.github\.io/](https://compute-optimal-rl-llm-scaling.github.io/)External Links:[Link](https://compute-optimal-rl-llm-scaling.github.io/)Cited by:[§4\.1](https://arxiv.org/html/2606.04272#S4.SS1.p2.8),[§6](https://arxiv.org/html/2606.04272#S6.SS0.SSS0.Px1.p1.1)\.
- T\. Chu, Y\. Zhai, J\. Yang, S\. Tong, S\. Xie, D\. Schuurmans, Q\. V\. Le, S\. Levine, and Y\. Ma \(2025\)SFT memorizes, RL generalizes: a comparative study of foundation model post\-training\.arXiv preprint arXiv:2501\.17161\.External Links:[Link](https://arxiv.org/abs/2501.17161)Cited by:[§6](https://arxiv.org/html/2606.04272#S6.SS0.SSS0.Px1.p1.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.External Links:[Link](https://arxiv.org/abs/2110.14168)Cited by:[§2\.3](https://arxiv.org/html/2606.04272#S2.SS3.SSS0.Px2.p1.1)\.
- J\. Dai, X\. Pan, R\. Sun, J\. Ji, X\. Xu, M\. Liu, Y\. Wang, and Y\. Yang \(2024\)Safe RLHF: safe reinforcement learning from human feedback\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Link](https://arxiv.org/abs/2310.12773)Cited by:[§6](https://arxiv.org/html/2606.04272#S6.SS0.SSS0.Px1.p1.1)\.
- Q\. Dong, L\. Dong, Y\. Tang, T\. Ye, Y\. Sun, Z\. Sui, and F\. Wei \(2025\)Reinforcement pre\-training\.arXiv preprint arXiv:2506\.08007\.External Links:[Link](https://arxiv.org/abs/2506.08007)Cited by:[§6](https://arxiv.org/html/2606.04272#S6.SS0.SSS0.Px2.p1.1)\.
- Y\. Dong, X\. Jiang, Y\. Tao, H\. Liu, K\. Zhang, L\. Mou, R\. Cao, Y\. Ma, J\. Chen, B\. Li,et al\.\(2026\)RL\-PLUS: countering capability boundary collapse of LLMs in reinforcement learning with hybrid\-policy optimization\.InProceedings of the Annual Meeting of the Association for Computational Linguistics \(ACL\),External Links:[Link](https://arxiv.org/abs/2508.00222)Cited by:[§6](https://arxiv.org/html/2606.04272#S6.SS0.SSS0.Px3.p1.1)\.
- D\. J\. Foster, Z\. Mhammedi, and D\. Rohatgi \(2025\)Is a good foundation necessary for efficient reinforcement learning? the computational role of the base model in exploration\.arXiv preprint arXiv:2503\.07453\.External Links:[Link](https://arxiv.org/abs/2503.07453)Cited by:[§1](https://arxiv.org/html/2606.04272#S1.p2.1),[§6](https://arxiv.org/html/2606.04272#S6.SS0.SSS0.Px4.p1.1)\.
- Y\. Fu, T\. Chen, J\. Chai, X\. Wang, S\. Tu, G\. Yin, W\. Lin, Q\. Zhang, Y\. Zhu, and D\. Zhao \(2025\)SRFT: a single\-stage method with supervised and reinforcement fine\-tuning for reasoning\.arXiv preprint arXiv:2506\.19767\.External Links:[Link](https://arxiv.org/abs/2506.19767)Cited by:[§6](https://arxiv.org/html/2606.04272#S6.SS0.SSS0.Px3.p1.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, R\. Zhang, R\. Xu, Q\. Zhu, S\. Ma, P\. Wang, X\. Bi,et al\.\(2025\)DeepSeek\-R1: incentivizing reasoning capability in LLMs via reinforcement learning\.Nature645,pp\. 633–638\.External Links:[Link](https://arxiv.org/abs/2501.12948)Cited by:[§6](https://arxiv.org/html/2606.04272#S6.SS0.SSS0.Px1.p1.1),[§6](https://arxiv.org/html/2606.04272#S6.SS0.SSS0.Px4.p1.1)\.
- A\. Hatamizadeh, S\. N\. Akter, S\. Prabhumoye, J\. Kautz, M\. Patwary, M\. Shoeybi, B\. Catanzaro, and Y\. Choi \(2026\)RLP: reinforcement as a pretraining objective\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Link](https://arxiv.org/abs/2510.01265)Cited by:[§1](https://arxiv.org/html/2606.04272#S1.p2.1),[§6](https://arxiv.org/html/2606.04272#S6.SS0.SSS0.Px2.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the MATH dataset\.InAdvances in Neural Information Processing Systems \(NeurIPS\) Datasets and Benchmarks Track,External Links:[Link](https://arxiv.org/abs/2103.03874)Cited by:[§2\.3](https://arxiv.org/html/2606.04272#S2.SS3.SSS0.Px2.p1.1)\.
- J\. Hoffmann, S\. Borgeaud, A\. Mensch, E\. Buchatskaya, T\. Cai, E\. Rutherford, D\. d\. L\. Casas, L\. A\. Hendricks, J\. Welbl, A\. Clark,et al\.\(2022\)Training compute\-optimal large language models\.arXiv preprint arXiv:2203\.15556\.External Links:[Link](https://arxiv.org/abs/2203.15556)Cited by:[§3\.1](https://arxiv.org/html/2606.04272#S3.SS1.p2.5)\.
- Z\. Huang, T\. Cheng, Z\. Qiu, Z\. Wang, Y\. Xu, E\. M\. Ponti, and I\. Titov \(2026\)Blending supervised and reinforcement fine\-tuning with prefix sampling\.InInternational Conference on Machine Learning \(ICML\),External Links:[Link](https://arxiv.org/abs/2507.01679)Cited by:[§6](https://arxiv.org/html/2606.04272#S6.SS0.SSS0.Px3.p1.1)\.
- A\. Jaech, A\. Kalai, A\. Lerer, A\. Richardson, A\. El\-Kishky, A\. Low, A\. Helyar, A\. Madry, A\. Beutel, A\. Carney,et al\.\(2024\)OpenAI o1 system card\.arXiv preprint arXiv:2412\.16720\.External Links:[Link](https://arxiv.org/abs/2412.16720)Cited by:[§6](https://arxiv.org/html/2606.04272#S6.SS0.SSS0.Px1.p1.1)\.
- A\. Karan and Y\. Du \(2025\)Reasoning with sampling: your base model is smarter than you think\.arXiv preprint arXiv:2510\.14901\.External Links:[Link](https://arxiv.org/abs/2510.14901)Cited by:[§1](https://arxiv.org/html/2606.04272#S1.p4.1),[§6](https://arxiv.org/html/2606.04272#S6.SS0.SSS0.Px1.p1.1)\.
- D\. Khatri, L\. Madaan, R\. Tiwari, R\. Bansal, S\. S\. Duvvuri, M\. Zaheer, I\. S\. Dhillon, D\. Brandfonbrener, and R\. Agarwal \(2025\)The art of scaling reinforcement learning compute for LLMs\.External Links:2510\.13786,[Link](https://arxiv.org/abs/2510.13786)Cited by:[§6](https://arxiv.org/html/2606.04272#S6.SS0.SSS0.Px1.p1.1)\.
- J\. Li, A\. Fang, G\. Smyrnis, M\. Ivgi, M\. Jordan, S\. Y\. Gadre, H\. Bansal, E\. Guha, S\. S\. Keh, K\. Arora,et al\.\(2024\)DataComp\-LM: in search of the next generation of training sets for language models\.Advances in Neural Information Processing Systems37,pp\. 14200–14282\.External Links:[Link](https://arxiv.org/abs/2406.11794)Cited by:[§2\.1](https://arxiv.org/html/2606.04272#S2.SS1.SSS0.Px1.p1.7)\.
- S\. Li, K\. Li, Z\. Xu, G\. Huang, E\. Yang, K\. Li, H\. Wu, J\. Wu, Z\. Zheng, C\. Zhang,et al\.\(2025\)Reinforcement learning on pre\-training data\.arXiv preprint arXiv:2509\.19249\.External Links:[Link](https://arxiv.org/abs/2509.19249)Cited by:[§1](https://arxiv.org/html/2606.04272#S1.p2.1),[§6](https://arxiv.org/html/2606.04272#S6.SS0.SSS0.Px2.p1.1)\.
- A\. Limozin, E\. Durech, T\. Hoefler, I\. Schlag, and V\. Pyatkin \(2026\)SFT\-then\-RL outperforms mixed\-policy methods for LLM reasoning\.arXiv preprint arXiv:2604\.23747\.Note:arXiv ID 2604\.23747 could not be retrieved; please verifyExternal Links:[Link](https://arxiv.org/abs/2604.23747)Cited by:[§6](https://arxiv.org/html/2606.04272#S6.SS0.SSS0.Px3.p1.1)\.
- I\. Loshchilov and F\. Hutter \(2019\)Decoupled weight decay regularization\.External Links:1711\.05101,[Link](https://arxiv.org/abs/1711.05101)Cited by:[§2\.1](https://arxiv.org/html/2606.04272#S2.SS1.SSS0.Px2.p1.9)\.
- X\. Lv, Y\. Zuo, Y\. Sun, H\. Liu, Y\. Wei, Z\. Chen, X\. Zhu, K\. Zhang, B\. Wang, N\. Ding,et al\.\(2025\)Towards a unified view of large language model post\-training\.arXiv preprint arXiv:2509\.04419\.External Links:[Link](https://arxiv.org/abs/2509.04419)Cited by:[§6](https://arxiv.org/html/2606.04272#S6.SS0.SSS0.Px3.p1.1)\.
- OLMo Team, A\. Ettinger, A\. Bertsch, B\. Kuehl, D\. Graham, D\. Heineman, D\. Groeneveld, F\. Brahman, F\. Timbers, H\. Ivison,et al\.\(2025a\)OLMo 3\.arXiv preprint arXiv:2512\.13961\.External Links:[Link](https://arxiv.org/abs/2512.13961)Cited by:[Table 5](https://arxiv.org/html/2606.04272#A2.T5),[Table 5](https://arxiv.org/html/2606.04272#A2.T5.3.2),[§2\.1](https://arxiv.org/html/2606.04272#S2.SS1.SSS0.Px2.p1.9),[§3\.3](https://arxiv.org/html/2606.04272#S3.SS3.p3.2)\.
- OLMo Team, P\. Walsh, L\. Soldaini, D\. Groeneveld, K\. Lo, S\. Arora, A\. Bhagia, Y\. Gu, S\. Huang, M\. Jordan,et al\.\(2025b\)2 OLMo 2 furious\.InConference on Language Modeling \(COLM\),External Links:[Link](https://arxiv.org/abs/2501.00656)Cited by:[§D\.1](https://arxiv.org/html/2606.04272#A4.SS1.SSS0.Px2.p1.1),[§2\.1](https://arxiv.org/html/2606.04272#S2.SS1.SSS0.Px1.p1.7)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training language models to follow instructions with human feedback\.Advances in Neural Information Processing Systems35,pp\. 27730–27744\.External Links:[Link](https://arxiv.org/abs/2203.02155)Cited by:[§1](https://arxiv.org/html/2606.04272#S1.p1.1),[§6](https://arxiv.org/html/2606.04272#S6.SS0.SSS0.Px1.p1.1)\.
- T\. Qin, C\. F\. Park, M\. Kwun, A\. Walsman, E\. Malach, N\. Anand, H\. Tanaka, and D\. Alvarez\-Melis \(2025\)Decomposing elements of problem solving: what ”math” does rl teach?\.External Links:2505\.22756,[Link](https://arxiv.org/abs/2505.22756)Cited by:[§1](https://arxiv.org/html/2606.04272#S1.p4.1)\.
- T\. Qin, N\. Saphra, and D\. Alvarez\-Melis \(2024\)Sometimes I am a tree: data drives unstable hierarchical generalization\.arXiv \[cs\.LG\]\.External Links:[Link](http://arxiv.org/abs/2412.04619),2412\.04619Cited by:[§B\.4](https://arxiv.org/html/2606.04272#A2.SS4.p1.1)\.
- R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn \(2023\)Direct preference optimization: your language model is secretly a reward model\.Advances in Neural Information Processing Systems36,pp\. 53728–53741\.External Links:[Link](https://arxiv.org/abs/2305.18290)Cited by:[§6](https://arxiv.org/html/2606.04272#S6.SS0.SSS0.Px1.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.External Links:[Link](https://arxiv.org/abs/2402.03300)Cited by:[§1](https://arxiv.org/html/2606.04272#S1.p1.1),[§2](https://arxiv.org/html/2606.04272#S2.p1.1),[§6](https://arxiv.org/html/2606.04272#S6.SS0.SSS0.Px1.p1.1)\.
- I\. Shenfeld, J\. Pari, and P\. Agrawal \(2025\)RL’s razor: why online reinforcement learning forgets less\.arXiv preprint arXiv:2509\.04259\.External Links:[Link](https://arxiv.org/abs/2509.04259)Cited by:[§6](https://arxiv.org/html/2606.04272#S6.SS0.SSS0.Px1.p1.1)\.
- S\. Toshniwal, I\. Moshkov, S\. Narenthiran, D\. Gitman, F\. Jia, and I\. Gitman \(2024\)OpenMathInstruct\-1: a 1\.8 million math instruction tuning dataset\.Advances in Neural Information Processing Systems37,pp\. 34737–34774\.External Links:[Link](https://arxiv.org/abs/2402.10176)Cited by:[§D\.1](https://arxiv.org/html/2606.04272#A4.SS1.SSS0.Px1.p1.9),[§2\.3](https://arxiv.org/html/2606.04272#S2.SS3.SSS0.Px1.p1.1)\.
- J\. Wei, M\. Bosma, V\. Zhao, K\. Guu, A\. W\. Yu, B\. Lester, N\. Du, A\. M\. Dai, and Q\. V\. Le \(2022\)Finetuned language models are zero\-shot learners\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Link](https://arxiv.org/abs/2109.01652)Cited by:[§2\.1](https://arxiv.org/html/2606.04272#S2.SS1.SSS0.Px1.p1.7)\.
- F\. Wu, W\. Xuan, X\. Lu, M\. Liu, Y\. Dong, Z\. Harchaoui, and Y\. Choi \(2025\)The invisible leash: why RLVR may or may not escape its origin\.arXiv preprint arXiv:2507\.14843\.External Links:[Link](https://arxiv.org/abs/2507.14843)Cited by:[§1](https://arxiv.org/html/2606.04272#S1.p4.1),[§1](https://arxiv.org/html/2606.04272#S1.p9.2),[§4\.1](https://arxiv.org/html/2606.04272#S4.SS1.p1.5),[§6](https://arxiv.org/html/2606.04272#S6.SS0.SSS0.Px1.p1.1)\.
- X\. Xing, Z\. Fan, J\. Lou, G\. Li, J\. Zhang, and D\. Zhang \(2025\)PretrainZero: reinforcement active pretraining\.arXiv preprint arXiv:2512\.03442\.External Links:[Link](https://arxiv.org/abs/2512.03442)Cited by:[§1](https://arxiv.org/html/2606.04272#S1.p2.1),[§6](https://arxiv.org/html/2606.04272#S6.SS0.SSS0.Px2.p1.1)\.
- J\. Yan, Y\. Li, Z\. Hu, Z\. Wang, G\. Cui, X\. Qu, Y\. Cheng, and Y\. Zhang \(2025\)Learning to reason under off\-policy guidance\.arXiv preprint arXiv:2504\.14945\.External Links:[Link](https://arxiv.org/abs/2504.14945)Cited by:[§6](https://arxiv.org/html/2606.04272#S6.SS0.SSS0.Px3.p1.1)\.
- Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, Y\. Yue, W\. Dai, T\. Fan, G\. Liu, L\. Liu,et al\.\(2025\)DAPO: an open\-source LLM reinforcement learning system at scale\.arXiv preprint arXiv:2503\.14476\.External Links:[Link](https://arxiv.org/abs/2503.14476)Cited by:[§6](https://arxiv.org/html/2606.04272#S6.SS0.SSS0.Px1.p1.1)\.
- Y\. Yue, Z\. Chen, R\. Lu, A\. Zhao, Z\. Wang, Y\. Yue, S\. Song, and G\. Huang \(2025\)Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model?\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Note:OralExternal Links:[Link](https://arxiv.org/abs/2504.13837)Cited by:[§1](https://arxiv.org/html/2606.04272#S1.p4.1),[§1](https://arxiv.org/html/2606.04272#S1.p9.2),[§4\.1](https://arxiv.org/html/2606.04272#S4.SS1.p1.5),[§4\.1](https://arxiv.org/html/2606.04272#S4.SS1.p2.8),[§6](https://arxiv.org/html/2606.04272#S6.SS0.SSS0.Px1.p1.1)\.
- C\. Zhang, G\. Neubig, and X\. Yue \(2025\)On the interplay of pre\-training, mid\-training, and RL on reasoning language models\.arXiv preprint arXiv:2512\.07783\.External Links:[Link](https://arxiv.org/abs/2512.07783)Cited by:[§6](https://arxiv.org/html/2606.04272#S6.SS0.SSS0.Px4.p1.1)\.
- W\. Zhang, Y\. Xie, Y\. Sun, Y\. Chen, G\. Wang, Y\. Li, B\. Ding, and J\. Zhou \(2026\)On\-policy RL meets off\-policy experts: harmonizing supervised fine\-tuning and reinforcement learning via dynamic weighting\.External Links:2508\.11408,[Link](https://arxiv.org/abs/2508.11408)Cited by:[§6](https://arxiv.org/html/2606.04272#S6.SS0.SSS0.Px3.p1.1)\.
- R\. Zhao, T\. Qin, D\. Alvarez\-Melis, S\. Kakade, and N\. Saphra \(2026\)Random scaling of emergent capabilities\.External Links:2502\.17356,[Link](https://arxiv.org/abs/2502.17356)Cited by:[§B\.4](https://arxiv.org/html/2606.04272#A2.SS4.p1.1)\.
- R\. Zheng, S\. Dou, S\. Gao, Y\. Hua, W\. Shen, B\. Wang, Y\. Liu, S\. Jin, Q\. Liu, Y\. Zhou,et al\.\(2023\)Secrets of RLHF in large language models part I: PPO\.arXiv preprint arXiv:2307\.04964\.External Links:[Link](https://arxiv.org/abs/2307.04964)Cited by:[§6](https://arxiv.org/html/2606.04272#S6.SS0.SSS0.Px1.p1.1)\.
- C\. Zhou, P\. Liu, P\. Xu, S\. Iyer, J\. Sun, Y\. Mao, X\. Ma, A\. Efrat, P\. Yu, L\. Yu,et al\.\(2023\)LIMA: less is more for alignment\.Advances in Neural Information Processing Systems36,pp\. 55006–55021\.External Links:[Link](https://arxiv.org/abs/2305.11206)Cited by:[§6](https://arxiv.org/html/2606.04272#S6.SS0.SSS0.Px4.p1.1)\.

###### Contents

1. [1Introduction](https://arxiv.org/html/2606.04272#S1)
2. [2Methodology and Experimental Design](https://arxiv.org/html/2606.04272#S2)1. [2\.1Pre\-training checkpoints](https://arxiv.org/html/2606.04272#S2.SS1) 2. [2\.2Training Pipelines](https://arxiv.org/html/2606.04272#S2.SS2) 3. [2\.3Data and Evaluation](https://arxiv.org/html/2606.04272#S2.SS3)
3. [3RL is Effective Early in Pre\-Training](https://arxiv.org/html/2606.04272#S3)1. [3\.1RLVR competes with the standard pipeline on GSM8K](https://arxiv.org/html/2606.04272#S3.SS1) 2. [3\.2RL outperforms when SFT data is scarce](https://arxiv.org/html/2606.04272#S3.SS2) 3. [3\.3Targeted pre\-training data is more essential than model size for RL](https://arxiv.org/html/2606.04272#S3.SS3) 4. [3\.4Base model performance is predictive of RL effectiveness](https://arxiv.org/html/2606.04272#S3.SS4)
4. [4The Effects of RL Beyond Downstream Accuracy](https://arxiv.org/html/2606.04272#S4)1. [4\.1Early stage RL can expand the model’s distribution](https://arxiv.org/html/2606.04272#S4.SS1) 2. [4\.2RL does not affect general model capabilities](https://arxiv.org/html/2606.04272#S4.SS2)
5. [5Parallel RL and SFT](https://arxiv.org/html/2606.04272#S5)
6. [6Prior Work](https://arxiv.org/html/2606.04272#S6)
7. [7Discussion & Future Directions](https://arxiv.org/html/2606.04272#S7)
8. [References](https://arxiv.org/html/2606.04272#bib)
9. [AExperiment Details](https://arxiv.org/html/2606.04272#A1)1. [A\.1Resources](https://arxiv.org/html/2606.04272#A1.SS1) 2. [A\.2Hyperparameters](https://arxiv.org/html/2606.04272#A1.SS2)
10. [BAdditional Results For Section3](https://arxiv.org/html/2606.04272#A2)1. [B\.1MATH performance](https://arxiv.org/html/2606.04272#A2.SS1) 2. [B\.2Added data for scalingDD](https://arxiv.org/html/2606.04272#A2.SS2) 3. [B\.3RL training dynamics](https://arxiv.org/html/2606.04272#A2.SS3) 4. [B\.4Seed dependency](https://arxiv.org/html/2606.04272#A2.SS4) 5. [B\.5SFT dynamics](https://arxiv.org/html/2606.04272#A2.SS5) 6. [B\.6Evaluating Pretraining Checkpoints](https://arxiv.org/html/2606.04272#A2.SS6)
11. [CFull Parallel Average Results](https://arxiv.org/html/2606.04272#A3)
12. [DRL Rollouts](https://arxiv.org/html/2606.04272#A4)1. [D\.1Experimental Setup](https://arxiv.org/html/2606.04272#A4.SS1) 2. [D\.2Main Results](https://arxiv.org/html/2606.04272#A4.SS2)

## Appendix AExperiment Details

In this section, we provide the details necessary to replicate our experiments\. For pretraining, we use the Olmo pretraining library, and for RL/ SFT we use the VeRL library\.

### A\.1Resources

For our experiments, we use a combination of NVIDIA A100 GPUs and NVIDIA H100 GPUs\. Pretraining takes several days, GRPO training takes several days, and SFT takes a few hours\.

### A\.2Hyperparameters

In the following tables, we report hyperparameter choices for GRPO, SFT, and pretraining\.

Table 1:GRPO training hyperparameters \(OLMo2\-1B on GSM8K subset\)\.CategoryHyperparameterValueDataTrain batch size512Max prompt length1024Max response length2048Rollouts per prompt \(nn\)32OptimizationLearning rate1×10−61\\times 10^\{\-6\}OptimizerAdamW\(β1,β2\)\(\\beta\_\{1\},\\beta\_\{2\}\)\(0\.9,0\.999\)\(0\.9,\\ 0\.999\)Weight decay0\.01Gradient clip1\.0Mini\-batch size128KL loss coefficient1×10−31\\times 10^\{\-3\}KL loss typelow\-variance KLRewardAdvantage estimatorGRPOFormat score \(partial\)0\.1InfrastructureTotal epochs10GPU memory utilization0\.6Table 2:GRPO training hyperparameters \(OLMo2\-1B on OpenMathInstruct\-2\)\.CategoryHyperparameterValueDataTrain batch size512Max prompt length1024Max response length2048Rollouts per prompt \(nn\)32OptimizationLearning rate1×10−61\\times 10^\{\-6\}OptimizerAdamW\(β1,β2\)\(\\beta\_\{1\},\\beta\_\{2\}\)\(0\.9,0\.999\)\(0\.9,\\ 0\.999\)Weight decay0\.01Gradient clip1\.0KL loss coefficient1×10−31\\times 10^\{\-3\}KL loss typelow\-variance KLRewardAdvantage estimatorGRPOFormat score \(partial\)0\.1InfrastructureTotal epochs10GPU memory utilization0\.8Table 3:SFT training hyperparameters \(OLMo2\-1B on OpenMathInstruct\-2\)\.CategoryHyperparameterValueDataTrain batch size512Max prompt length2560Max response length1024Rollouts per prompt \(nn\)32OptimizationLearning rate4×10−54\\times 10^\{\-5\}OptimizerAdamW\(β1,β2\)\(\\beta\_\{1\},\\beta\_\{2\}\)\(0\.9,0\.999\)\(0\.9,\\ 0\.999\)Weight decay0\.01Gradient clip1\.0SFT ScheduleModeinterleavedSFT steps per cycle50000RL steps per cycle0InfrastructureTotal epochs100GPU memory utilization0\.6Table 4:Pretraining hyperparameters \(OLMo2\-1B, 50B tokens\)\.CategoryHyperparameterValueDataTotal training tokens50BGlobal batch size \(sequences\)512Gradient accumulation steps64OptimizationLearning rate4×10−44\\times 10^\{\-4\}OptimizerAdamW\(β1,β2\)\(\\beta\_\{1\},\\beta\_\{2\}\)\(0\.9,0\.95\)\(0\.9,\\ 0\.95\)Weight decay0\.1Gradient clip1\.0LR ScheduleSchedulecosine with warmupWarmup tokens1BMin LR ratio \(αf\\alpha\_\{f\}\)0\.1UnitstokensRegularizationPrecisionBF16 \(AMP\)Softmax auxiliary loss✓Auxiliary loss multiplier1×10−51\\times 10^\{\-5\}

## Appendix BAdditional Results For Section[3](https://arxiv.org/html/2606.04272#S3)

### B\.1MATH performance

See Fig\.[10](https://arxiv.org/html/2606.04272#A2.F10)for MATH performance on original 1B model, Fig\.[11](https://arxiv.org/html/2606.04272#A2.F11)for MATH performance on 1B model trained on 60B tokens and finally, Fig\.[12](https://arxiv.org/html/2606.04272#A2.F12)for MATH performance on 4B model\.

![Refer to caption](https://arxiv.org/html/2606.04272v1/x9.png)Figure 10:RL underperforms SFT→\\toRL on harder MATH problems\.MATHpass@kforℳt\\mathcal\{M\}\_\{t\},ℳtSFT\\mathcal\{M\}\_\{t\}^\{\\text\{SFT\}\},ℳtSFT→RL\\mathcal\{M\}\_\{t\}^\{\\text\{SFT\}\\rightarrow\\text\{RL\}\}, andℳtRL\\mathcal\{M\}\_\{t\}^\{\\text\{RL\}\}trained on the full OpenMathInstruct, with the base model atN=1N=1B parameters andD=50D=50B pretraining tokens\.ℳtRL\\mathcal\{M\}\_\{t\}^\{\\text\{RL\}\}still improves overℳt\\mathcal\{M\}\_\{t\}before Chinchilla\-optimal token counts, but a persistent gap toℳtSFT→RL\\mathcal\{M\}\_\{t\}^\{\\text\{SFT\}\\rightarrow\\text\{RL\}\}remains throughout pretraining, indicating that direct RL is insufficient on harder reasoning tasks\.![Refer to caption](https://arxiv.org/html/2606.04272v1/x10.png)Figure 11:Adding math pretraining data narrows the MATH gap\.Same setup as[Figure10](https://arxiv.org/html/2606.04272#A2.F10), but with 10B additional math\-heavy tokens mixed into pretraining \(N=1N=1B,D=60D=60B\)\. Including task\-relevant pretraining data substantially boostsℳtRL\\mathcal\{M\}\_\{t\}^\{\\text\{RL\}\}on MATH and narrows the gap toℳtSFT→RL\\mathcal\{M\}\_\{t\}^\{\\text\{SFT\}\\rightarrow\\text\{RL\}\}, supporting pretraining data composition as the binding constraint on early\-RL effectiveness\.![Refer to caption](https://arxiv.org/html/2606.04272v1/x11.png)Figure 12:Scaling parameters does not close the MATH gap\.Same setup as[Figure10](https://arxiv.org/html/2606.04272#A2.F10), but atN=4N=4B parameters with the sameD=50D=50B\-token pretraining mix\. Increasing model scale improves base\-model performance, but does*not*unlock additional RL gains on MATH\. In contrast to the data\-scaling intervention in[Figure11](https://arxiv.org/html/2606.04272#A2.F11), the gap toℳtSFT→RL\\mathcal\{M\}\_\{t\}^\{\\text\{SFT\}\\rightarrow\\text\{RL\}\}persists\.
### B\.2Added data for scalingDD

We detail the source of the 10B tokens we add into training for the MATH benchmark\.

Table 5:Composition of the additional math tokens mixed into pretraining for the 1B\-60B model \(Section[3\.3](https://arxiv.org/html/2606.04272#S3.SS3)\)\. All sources are drawn from the math subset of the Dolma 3 Dolmino Mix\(OLMo Teamet al\.,[2025a](https://arxiv.org/html/2606.04272#bib.bib33)\)\.
### B\.3RL training dynamics

In Fig\.[13](https://arxiv.org/html/2606.04272#A2.F13), we show that for allℳtRL\\mathcal\{M\}\_\{t\}^\{\\text\{RL\}\}\(across all pretraining checkpointsℳt\\mathcal\{M\}\_\{t\}\), the RL training reward, validation reward \(computed on a manually split subset of OpenMathInstruct\), and GSM8K reward have converged\. For earlier checkpoints that exhibit seed brittleness \(Sec\.[B\.4](https://arxiv.org/html/2606.04272#A2.SS4)\), we report the favorable seed here\. See App\.[B\.4](https://arxiv.org/html/2606.04272#A2.SS4)for examples of favorable and unfavorable seeds\.

![Refer to caption](https://arxiv.org/html/2606.04272v1/x12.png)Figure 13:RL training reaches convergence at all checkpoints\.Training reward, validation reward, and GSM8K test reward during RL training forℳtRL\\mathcal\{M\}\_\{t\}^\{\\text\{RL\}\}across pretraining checkpointstt\. All three reward metrics converge by end\-of\-training, confirming that performance differences between checkpoints are not artifacts of insufficient RL optimization\. For checkpoints with seed brittleness \(t<10t<10B\), we plot the favorable seed; see App\.[B\.4](https://arxiv.org/html/2606.04272#A2.SS4)for seed comparisons\.
### B\.4Seed dependency

![Refer to caption](https://arxiv.org/html/2606.04272v1/x13.png)Figure 14:Training reward hides RL seed brittleness on early checkpoints\.A favorable seed \(blue\) and an unfavorable seed \(red\) forℳtRL\\mathcal\{M\}\_\{t\}^\{\\text\{RL\}\}att=4t=4B tokens\.Left: training reward curves are nearly identical between seeds, offering no warning of divergent test outcomes\.Middle: validation reward begins to diverge mid\-training and unfavorable seed only reaches 10% which comes from format reward\.Right: on GSM8K, the favorable seed gains substantially on bothpass@1andpass@32, while the unfavorable seed shows minimalpass@1gain and worsenspass@32\. This brittleness resolves byt=10t=10B tokens\.We visualize the outcomes in Figure[14](https://arxiv.org/html/2606.04272#A2.F14)\. Random seed dependency in LLM training has also been observerd inZhaoet al\.\([2026](https://arxiv.org/html/2606.04272#bib.bib50)\); Qinet al\.\([2024](https://arxiv.org/html/2606.04272#bib.bib51)\)as a potential explanation of the emergence phenomenon\.

### B\.5SFT dynamics

In Fig\.[15](https://arxiv.org/html/2606.04272#A2.F15), we experiment with different numbers of SFT epochs to trainℳtSFT\\mathcal\{M\}\_\{t\}^\{\\text\{SFT\}\}and confirm that 5 epochs leads to convergence in the model’s performance\.

![Refer to caption](https://arxiv.org/html/2606.04272v1/x14.png)Figure 15:SFT converges by 5 epochs\.GSM8K accuracy ofℳtSFT\\mathcal\{M\}\_\{t\}^\{\\text\{SFT\}\}after training for different numbers of epochs on OpenMathInstruct\. Performance plateaus by 5 epochs, which we use as the standard SFT training length for allℳtSFT\\mathcal\{M\}\_\{t\}^\{\\text\{SFT\}\}baselines\.
### B\.6Evaluating Pretraining Checkpoints

In Fig\.[16](https://arxiv.org/html/2606.04272#A2.F16), we experiment with different numbers of in\-context examples \(nn\-shot\) to evaluate the reasoning capabilities of pretraining checkpointsℳt\\mathcal\{M\}\_\{t\}\. We confirm that by using 8\-shot prompting, the base model achieves the best performance on both MATH and GSM8K\.

![Refer to caption](https://arxiv.org/html/2606.04272v1/x15.png)

![Refer to caption](https://arxiv.org/html/2606.04272v1/x16.png)

Figure 16:Base checkpoints peak at 8\-shot prompting\.Performance of pretraining checkpointsℳt\\mathcal\{M\}\_\{t\}on GSM8K \(top\) and MATH \(bottom\) under varying numbers of in\-context examples\. Across both benchmarks, accuracy is maximized at 8\-shot, which we use throughout forℳt\\mathcal\{M\}\_\{t\}evaluation\.

## Appendix CFull Parallel Average Results

In[Figure17](https://arxiv.org/html/2606.04272#A3.F17), we show the full parallel\-average training trajectories at each pre\-training checkpoint\.[Figure9](https://arxiv.org/html/2606.04272#S5.F9)summarizes these as a single \(final\-RL\-step\) point per checkpoint\.

![Refer to caption](https://arxiv.org/html/2606.04272v1/x17.png)Figure 17:Full results for parallel average algorithm\.
## Appendix DRL Rollouts

When training with RL on early pretraining checkpoints, the model is likely to have lowpass@kaccuracy on the training questions\. Compared to the standard pipeline or a later pretraining checkpoint, applying RL at early pretrainingexacerbatesthe reward sparsity problem\. On these very early pretraining checkpoints, without sufficient positive samples \(i\.e\., correct rollouts\), the learning signal might become sparse or noisy, making it difficult for the model to improve\.

A natural strategy to consider in order to obtain higher training signal is to sample a larger number of rollouts at each step in training\. In this section, we comprehensively analyze this strategy and study the influence of number of rollouts for RL training\. Specifically, we investigate the effect of varying the number of rollouts per prompt \(nn\) in GRPO\. We seek to determine if increasing the number of rollouts benefits models that are initially weak on the training distribution\. To this end, we partition our training set into two sets:a hard setandan easy set, simulating early and later stages of pretraining respectively\. We perform RL using GRPO on both these splits using settings with few \(n=5n=5\) and many \(n=64n=64\) rollouts and reportpass@kaccuracy on the standard GSM8K test set\.

### D\.1Experimental Setup

#### Data and Metrics\.

In order to simulate different stages of pretraining, we partition our training dataset based on proportion of positive samples per example\. The OpenMathInstruct dataset\(Toshniwalet al\.,[2024](https://arxiv.org/html/2606.04272#bib.bib17)\)is composed of questions inspired by either MATH or GSM8K training sets \(for details see, §[2\.2](https://arxiv.org/html/2606.04272#S2.SS2)\)\. We focus on only the GSM8K\-like subset of OpenMathInstruct \(8080K examples\)\. To define the training splits based on “difficulty” level, we evaluate our base model on the original dataset in a zero\-shot setting\. For each question, we generate6464responses at temperature11and record the number of correct solutions\. We classify questions with1616to6464correct responses asGSM8K\-Easy, and those with at most88correct responses asGSM8K\-Hard\. From these subsets, we randomly sample1010K questions for each split\. We train with GRPO \(as described in §[2\.2](https://arxiv.org/html/2606.04272#S2.SS2)\) and reportpass@k\(k∈\{1,8\}\\texttt\{k\}\\in\\\{1,8\\\}\) metrics on the standard GSM8K test set\.

![Refer to caption](https://arxiv.org/html/2606.04272v1/x18.png)

![Refer to caption](https://arxiv.org/html/2606.04272v1/x19.png)

Figure 18:Fewer rollouts are more FLOP\-efficient at convergence\.GSM8Kpass@kduring RL training withn=5n=5versusn=64n=64rollouts per prompt, on training sets sub\-sampled to be relatively easy or hard for the base model \(a proxy for late vs\. early pretraining\)\. Asymptotic performance is similar across rollout counts\. However,n=5n=5achieves comparable accuracy at substantially lower FLOPs, especially on the harder split\. Largernnis more sample\-efficient per training example, but not per FLOP\.
#### Model and Method\.

We conduct all experiments using the OLMo2 1B model\(OLMo Teamet al\.,[2025b](https://arxiv.org/html/2606.04272#bib.bib16)\)\.

We perform GRPO training for both GSM8K\-Easy and GSM8K\-Hard usingn=5n=5andn=64n=64rollouts per prompt, while keeping all other hyperparameters constant\. Consequently,n=64n=64consumes significantly more FLOPs per RL step\. To account for this trade\-off, we analyze accuracy as a function of both total FLOPs consumed and the number of examples during RL training\. For all settings, we train the models until the validationpass@1metric converges\.

### D\.2Main Results

We observe a distinct trade\-off between sample efficiency and compute efficiency\. As a function ofsamples seen, increasing the number of rollouts ton=64n=64greatly improvespass@1convergence compared ton=5n=5\. However, when viewed as a function ofFLOPs, the lower rollout setting \(n=5n=5\) is more compute\-efficient in the early stages of training\. As training progresses toward10610^\{6\}FLOPs, this efficiency gap narrows, withn=64n=64eventually matching or surpassing the performance ofn=5n=5\. We observe that the difference betweenn=5n=5andn=64n=64rollouts further diminishes when observingpass@1\. However, when we match FLOPs, we see thatn=5n=5appears to significantly improve uponn=64n=64, especially when training with GSM8K\-Hard\.

Our analysis in[Figure18](https://arxiv.org/html/2606.04272#A4.F18)yields three primary insights regarding the scaling of RL rollouts\.First, we find that asymptotic performance is largely independent of the number of rollouts; bothn=5n=5andn=64n=64converge to similarpass@kpeaks across difficulty levels\.Second, there is a clear trade\-off between sample efficiency and compute efficiency\. Increasing the rollout count \(n=64n=64\) maximizes the utility of each training example, leading to faster convergence in terms of training steps\. Conversely, reducing the rollout count \(n=5n=5\) is significantly more FLOP\-efficient, achieving comparable performance with a fraction of the compute budget\.Finally, this compute advantage is particularly pronounced on theGSM8K\-Hardsplit for thepass@8metric, suggesting that when rewards are sparse \(as with early checkpoints\), massive rollout scaling may yield diminishing returns per FLOP compared to processing more batches with fewer rollouts\.

Similar Articles

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

Hugging Face Daily Papers

This paper introduces ScaleLogic, a framework demonstrating that RL training compute scales as a power law with reasoning depth in LLMs. It highlights that logical expressiveness is key to improving downstream transfer and training efficiency.