When RL Fails after SFT: Rejuvenating Model Plasticity for Robust SFT-to-RL Handoff

arXiv cs.LG Papers

Summary

This paper investigates the loss of model plasticity after excessive supervised fine-tuning (SFT) in the SFT-then-RL pipeline for LLMs, and proposes Rejuvenation, a method that restores plasticity via base-anchored model fusion and targeted neuron reset, consistently improving RL performance.

arXiv:2606.09932v1 Announce Type: new Abstract: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has become a standard pipeline for Large Language Model (LLM) post-training. SFT is expected to provide a useful behavioral prior for RL to further enhance model capabilities. However, checkpoints with excessive SFT often show limited improvement during RL. We attribute this failure to the loss of model plasticity: the reduced ability of an SFT-initialized policy to be effectively reshaped by subsequent RL. To better understand this phenomenon, we conduct detailed analysis from multiple perspectives, including parameter changes, output spaces, and RL optimization dynamics. Our results show that models from excessive SFT tend to produce over-confident token distributions and exhibit sharp parameter landscapes, which make them harder to optimize in the RL stage. To enable a more robust SFT-to-RL handoff, we propose \texttt{Rejuvenation}, a simple yet effective method that restores plasticity while preserving useful SFT-acquired priors. Rejuvenation leverages base-anchored model fusion to reduce excessive SFT-induced drift with targeted neuron reset to mitigate model rigidity. Experimental results on both math reasoning tasks and agentic tasks demonstrate that our approach consistently improves RL performance on over-trained SFT models, while also enhancing generalization to out-of-distribution tasks.
Original Article
View Cached Full Text

Cached at: 06/10/26, 06:18 AM

# Rejuvenating Model Plasticity for Robust SFT-to-RL Handoff
Source: [https://arxiv.org/html/2606.09932](https://arxiv.org/html/2606.09932)
Runze Liu1∗Jiashun Liu1∗Xu Wan2Yuqian Fu3Ling Pan1 1Hong Kong University of Science and Technology2Zhejiang University 3State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA

###### Abstract

Supervised Fine\-Tuning \(SFT\) followed by Reinforcement Learning \(RL\) has become a standard pipeline for Large Language Model \(LLM\) post\-training\. SFT is expected to provide a useful behavioral prior for RL to further enhance model capabilities\. However, checkpoints with excessive SFT often show limited improvement during RL\. We attribute this failure to the loss of model plasticity: the reduced ability of an SFT\-initialized policy to be effectively reshaped by subsequent RL\. To better understand this phenomenon, we conduct detailed analysis from multiple perspectives, including parameter changes, output spaces, and RL optimization dynamics\. Our results show that models from excessive SFT tend to produce over\-confident token distributions and exhibit sharp parameter landscapes, which make them harder to optimize in the RL stage\. To enable a more robust SFT\-to\-RL handoff, we proposeRejuvenation, a simple yet effective method that restores plasticity while preserving useful SFT\-acquired priors\. Rejuvenation leverages base\-anchored model fusion to reduce excessive SFT\-induced drift with targeted neuron reset to mitigate model rigidity\. Experimental results on both math reasoning tasks and agentic tasks demonstrate that our approach consistently improves RL performance on over\-trained SFT models, while also enhancing generalization to out\-of\-distribution tasks\.

††∗Equal contribution## 1Introduction

Post\-training has emerged as a critical phase for unlocking the reasoning and agentic capabilities of large language models \(LLMs\)\(OpenAI,[2024](https://arxiv.org/html/2606.09932#bib.bib56); Guoet al\.,[2025](https://arxiv.org/html/2606.09932#bib.bib60)\)\. Many practical systems adopt the SFT\-then\-RL\(Guoet al\.,[2025](https://arxiv.org/html/2606.09932#bib.bib60)\)pipeline, where supervised fine\-tuning \(SFT\)\(Ouyanget al\.,[2022](https://arxiv.org/html/2606.09932#bib.bib136); Baiet al\.,[2022](https://arxiv.org/html/2606.09932#bib.bib137)\)first teaches models to follow instructions, produce the desired format, and acquire cold\-start knowledge, after which reinforcement learning \(RL\)\(Shaoet al\.,[2024](https://arxiv.org/html/2606.09932#bib.bib59); Zhanget al\.,[2025a](https://arxiv.org/html/2606.09932#bib.bib109)\)further optimizes the policy according to task rewards\(Yueet al\.,[2025](https://arxiv.org/html/2606.09932#bib.bib93); Liuet al\.,[2025a](https://arxiv.org/html/2606.09932#bib.bib98); Wanget al\.,[2025a](https://arxiv.org/html/2606.09932#bib.bib105),[b](https://arxiv.org/html/2606.09932#bib.bib104); Liuet al\.,[2025c](https://arxiv.org/html/2606.09932#bib.bib110); Zhanget al\.,[2026b](https://arxiv.org/html/2606.09932#bib.bib108)\)\. In this pipeline, SFT is not merely for imitating high\-quality solutions, but also determines the initialization that RL inherits\. This pipeline therefore relies on an implicit assumption: after acquiring useful behaviors through imitation, the SFT checkpoint should still serve as a suitable starting point for reward\-driven optimization\.

![Refer to caption](https://arxiv.org/html/2606.09932v1/x1.png)Figure 1:Overview ofRejuvenation\.However, the amount of SFT needed for strong supervised behavior may not coincide with the amount of SFT that yields the best final model after RL\. Intuitively, if SFT is stopped too early, the model may be under\-prepared for RL, lacking the instruction\-following patterns or task\-specific behaviors needed for efficient optimization\. Meanwhile, if SFT is continued for too long, the model may become overly specialized to the supervised data, and the resulting checkpoint may leave limited room for RL to further improve the policy and generalize\(Chuet al\.,[2025](https://arxiv.org/html/2606.09932#bib.bib11)\)\. Thus, despite its widespread success, the handoff from SFT to RL remains a fragile and computationally expensive\. A critical yet often overlooked challenge is to determine which SFT checkpoint should be used as the RL initial policy\.

Standard SFT metrics \(e\.g\., training loss, validation accuracy\) measure imitation quality, not the checkpoint’s capacity for reward\-driven improvement\. A natural remedy is early stopping\(Liet al\.,[2026](https://arxiv.org/html/2606.09932#bib.bib21)\), but it still assumes that the appropriate stopping point can be reliably identified, and therefore does not resolve the core problem\. Recent work improves the SFT stage through better data\(Huanget al\.,[2026](https://arxiv.org/html/2606.09932#bib.bib3); Zhaoet al\.,[2026](https://arxiv.org/html/2606.09932#bib.bib49)\), objectives\(Fuet al\.,[2026](https://arxiv.org/html/2606.09932#bib.bib29); Zhuet al\.,[2026b](https://arxiv.org/html/2606.09932#bib.bib33)\), or training recipes\(Zhanget al\.,[2025b](https://arxiv.org/html/2606.09932#bib.bib48)\), but these methods primarily change the SFT trajectory rather than provide a reliable criterion for RL readiness\. As a result, practitioners still often have to launch RL from multiple SFT checkpoints to determine the appropriate handoff point\. This search is expensive, and its outcome can be sensitive to RL hyperparameters and optimization noise\.

In this paper, we aim to analyze and address this critical dilemma in the SFT\-to\-RL handoff, where insufficient SFT may prematurely stop imitation learning that prevents the model from acquiring useful skills, while over\-trained SFT becomes highly resistant to further improvement via RL\. To better understand this issue, we conduct a detailed analysis of how extended SFT changes both model behavior and subsequent optimization dynamics under RL\. We find thatOverSFTmodels tend to produce sharper, more over\-confident output distributions and less smooth parameter space, showing large gradient norms but limited performance gains and smaller parameter update magnitude compared toModSFTmodels\. These findings suggest that the excessive SFT does not merely overfit the supervised data, but also make the policy resistant and becomes less effectively adaptable in the subsequent RL stage\. We identify this failure mode as a loss ofmodel plasticity111We use model plasticity to refer to the ability of SFT models to undergo reward\-driven improvement in subsequent RL\. A plastic model should remain responsive to RL update and such updates can effectively translate into task performance gains\.\(Dohareet al\.,[2024](https://arxiv.org/html/2606.09932#bib.bib1); Hanet al\.,[2026](https://arxiv.org/html/2606.09932#bib.bib2)\): the model becomes difficult to reshape through RL\.

Based on this observation, we proposeRejuvenation, a simple yet effective post\-hoc mechanism that enables robust SFT\-to\-RL handoff for recovering model plasticity, which avoids complex SFT loss modifications and costly checkpoint searches\. Our key insight is that SFT should provide a useful behavioral prior, but not at the cost of losing plasticity\. Our approach is a dual\-level strategy\. First, at the global level, we utilize base\-anchored model fusion to reduce excessive SFT\-induced drift while retaining useful behavior\. At the local level, we introduce a targeted neuron reset mechanism based on logit attribution to selectively restore diversity in over\-confident LLM most responsible for collapsed predictions\. As shown in Figure[1](https://arxiv.org/html/2606.09932#S1.F1), our method effectively mitigates the resulting rigidity induced by excessive SFT, while preserving the effective behavioral prior acquired from sufficient SFT\. We evaluateRejuvenationon both mathematical reasoning tasks and agentic tasks\. Experiments show that our method not only consistently recovers RL improvement from previouslyOverSFTmodels, but also achieves superior performance compared toModSFTmodels on out\-of\-distribution \(OOD\) tasks with even better generalization ability\.

The main contributions of this work can be summarized as follows:

1. 1\.We identify a failure mode in the SFT\-then\-RL pipeline: the SFT\-to\-RL handoff dilemma, where fully\-trained SFT models lose plasticity and limit RL improvement\.
2. 2\.We provide detailed analysis from multiple perspectives, revealing that over\-SFT leads to reduced effective gradients during RL, which leads to entropy collapse and fundamentally hurts RL optimization dynamics\.
3. 3\.We propose a simple, cheap, and effective rejuvenation method to recover model plasticity post\-hoc using model fusion and neuron reset, making it robust to different SFT\-to\-RL handoffs\.
4. 4\.We demonstrate the effectiveness of our method on both mathematical and agentic tasks, showing that it consistently recovers RL improvement fromOverSFTcheckpoints and improves OOD generalization overModSFTbaselines\.

## 2Related Work

##### SFT and RL in LLM Post\-Training\.

Recent work has shown that RL emerges as an effective method for LLM post\-training\(OpenAI,[2024](https://arxiv.org/html/2606.09932#bib.bib56); Guoet al\.,[2025](https://arxiv.org/html/2606.09932#bib.bib60); Shaoet al\.,[2024](https://arxiv.org/html/2606.09932#bib.bib59); Yuet al\.,[2025](https://arxiv.org/html/2606.09932#bib.bib87)\)\. Many methods aim to better integrate SFT and RL, such as improving the use of off\-policy data\(Yanet al\.,[2025](https://arxiv.org/html/2606.09932#bib.bib24); Chenet al\.,[2025](https://arxiv.org/html/2606.09932#bib.bib25); Liuet al\.,[2025d](https://arxiv.org/html/2606.09932#bib.bib27); Maet al\.,[2026](https://arxiv.org/html/2606.09932#bib.bib28); Huanget al\.,[2025](https://arxiv.org/html/2606.09932#bib.bib30)\), designing unified training objectives\(Liuet al\.,[2025b](https://arxiv.org/html/2606.09932#bib.bib26); Fuet al\.,[2026](https://arxiv.org/html/2606.09932#bib.bib29); Zhanget al\.,[2026c](https://arxiv.org/html/2606.09932#bib.bib32); Lvet al\.,[2025](https://arxiv.org/html/2606.09932#bib.bib34); Ganet al\.,[2026](https://arxiv.org/html/2606.09932#bib.bib35)\), or using importance weighted objectives to better align SFT with RL optimization\(Zhuet al\.,[2026b](https://arxiv.org/html/2606.09932#bib.bib33); Qin and Springenberg,[2025](https://arxiv.org/html/2606.09932#bib.bib18); Zhanget al\.,[2026a](https://arxiv.org/html/2606.09932#bib.bib15)\)\. At the same time, recent evidence suggests that comparisons between mixed\-policy methods and the standard SFT\-then\-RL pipeline can be sensitive to SFT implementation details, and that carefully controlled SFT\-then\-RL remains a strong baseline\(Limozinet al\.,[2026](https://arxiv.org/html/2606.09932#bib.bib36)\)\. Several recent studies further analyze why SFT and RL lead to different generalization behavior: SFT tends to memorize supervised data while RL can improve out\-of\-distribution generalization\(Chuet al\.,[2025](https://arxiv.org/html/2606.09932#bib.bib11)\), RL may partially heal OOD forgetting introduced by SFT but only within a suitable checkpoint range\(Jinet al\.,[2025b](https://arxiv.org/html/2606.09932#bib.bib12),[a](https://arxiv.org/html/2606.09932#bib.bib13)\), and high SFT scores are not necessarily reliable predictors of post\-RL performance\(Kanget al\.,[2026](https://arxiv.org/html/2606.09932#bib.bib14); Zhanget al\.,[2026a](https://arxiv.org/html/2606.09932#bib.bib15)\)\. These works expose the fragility of the SFT\-to\-RL handoff, but they mainly diagnose checkpoint selection or redesign the SFT objective\. In contrast, we ask whether the plasticity of an already over\-trained SFT model can be restored post\-hoc before RL starts\.

##### Overfitting and Regularization in SFT\.

Recently, many works have explored how to prevent overfitting or excessive policy drift during SFT\. Methods such as GEM\(Liet al\.,[2025](https://arxiv.org/html/2606.09932#bib.bib17)\), PSFT\(Zhuet al\.,[2026b](https://arxiv.org/html/2606.09932#bib.bib33)\), ASFT\(Zhuet al\.,[2026a](https://arxiv.org/html/2606.09932#bib.bib20)\), and CurioSFT\(Wanget al\.,[2026](https://arxiv.org/html/2606.09932#bib.bib22)\), introduce an auxiliary regularization loss to maintain model diversity\. DFT\(Wuet al\.,[2026](https://arxiv.org/html/2606.09932#bib.bib19)\), AESL\(Liet al\.,[2026](https://arxiv.org/html/2606.09932#bib.bib21)\), and ProFit\(Liuet al\.,[2026](https://arxiv.org/html/2606.09932#bib.bib23)\)incorporate probability\-based weighting in the cross\-entropy loss\. However, they mainly focus on preventing overfitting during SFT or designing a better SFT objective from the beginning\. Our setting is different that we assume an over\-trained SFT model has already been obtained, and ask whether its plasticity can be restored for subsequent RL\.

## 3Diagnosing and Restoring Plasticity in Over\-Trained Models

In this section, we analyze why over\-trained SFT models become difficult to improve with RL and then introduce two post\-hoc recovery operations\. We first investigate what excessive SFT changes in both parameter and output spaces before any RL update is applied, to understand the handoff\-specific question:*does continued SFT move the checkpoint into a state that is less amemnable to subsequent RL optimization?*We then connect these changes to poor RL trainability, where large gradient norms do not translate into effective parameter movement or meaningful performance gains \(Section[3\.2](https://arxiv.org/html/2606.09932#S3.SS2)\)\. Motivated by these observations, we recover plasticity at two levels: base\-anchored model fusion globally pulls the model toward a smoother region \(Section[3\.3](https://arxiv.org/html/2606.09932#S3.SS3)\), while attribution\-guided neuron reset locally restores the high\-contribution directions responsible for abnormal logits \(Section[3\.4](https://arxiv.org/html/2606.09932#S3.SS4)\)\.

### 3\.1How Does Over\-Training Change the SFT Model?

#### 3\.1\.1Parameter Space

We train EvoLM\-4B\(Qiet al\.,[2025](https://arxiv.org/html/2606.09932#bib.bib37)\)on the math SFT data and save checkpoints along the SFT process\. We denote the moderately trained checkpoint \(epoch=2\) asModSFTand the over\-trained checkpoint \(epoch=32\) asOverSFT\. More training details are provided in Section[4\.1](https://arxiv.org/html/2606.09932#S4.SS1)and Appendix[B](https://arxiv.org/html/2606.09932#A2)\.

![Refer to caption](https://arxiv.org/html/2606.09932v1/x2.png)Figure 2:Parameter changes and statistics of different SFT checkpoints\.##### Excessive SFT induces large parameter shifts and weight magnitude\.

We first examine how SFT changes the model parameters\. For each checkpoint, we visualize the element\-wise parameter difference with respect to the base model\. As shown in Figure[2](https://arxiv.org/html/2606.09932#S3.F2)and the first row of Figure[3](https://arxiv.org/html/2606.09932#S3.F3), we observe thatModSFTonly introduces moderate and relatively smooth parameter changes, whileOverSFTleads to extremely large parameter shifts and there are sharp spikes in the shifts, resulting in larger weight magnitude\. These spikes indicate that over\-training does not simply continue improving the same solution found by moderate SFT\. Instead, it drives a small subset of parameters significantly far away from the base model\. Additionally, these observations are consistent across all modules and layers\. Please find Appendix[C](https://arxiv.org/html/2606.09932#A3)for more visualizations\.

![Refer to caption](https://arxiv.org/html/2606.09932v1/x3.png)Figure 3:Parameter changes oflayers\.0\.self\_attn\.v\_proj\.weightinduced by SFT and RL\.
##### Subsequent RL induces much smaller movement than the preceding SFT\-induced drift\.

We further examine how RL changes the model parameters by plotting the difference between: \(1\) RL vs SFT and \(2\) RL vs Base\. Figure[3](https://arxiv.org/html/2606.09932#S3.F3)shows that the subsequent RL stage only introduces much smaller changes compared with the SFT\-induced shift\. This suggests that once SFT has caused relatively high parameter distortion, RL is unlikely to reverse it through standard policy optimization\.

#### 3\.1\.2Output Space

##### Excessive SFT saturates the policy before RL\.

We compare the output diversity of different SFT checkpoints in Table[1](https://arxiv.org/html/2606.09932#S3.T1)\. We can observe thatOverSFTachieves near\-zero training loss but exhibits substantially lower token entropy thanModSFT\(0\.024 vs\. 0\.184\), suggesting that excessive SFT drives the policy toward an over\-confident regime\. AlthoughOverSFTattains a higher Pass@1 score thanModSFT, its Pass@K\-Pass@1 gap is the smallest, indicating reduced output diversity despite stronger greedy performance\. Consistently, Figure[6](https://arxiv.org/html/2606.09932#A3.F6)shows logit distributions whereOverSFTconcentrates probability mass almost entirely on a single token\(Liet al\.,[2025](https://arxiv.org/html/2606.09932#bib.bib17)\), whereasModSFTstill assigns non\-negligible probability to alternative tokens\.

Table 1:Diversity\-related metrics of different SFT checkpoints\.CheckpointTraining LossEntropyMATH\-500 Pass@1Pass@K \- Pass@1 GapUnderSFT0\.1780\.2819\.713\.7ModSFT0\.1110\.18415\.215\.3OverSFT0\.0020\.02415\.811\.2

TakeawayOver\-trained SFT models differ from moderately trained models in both parameter and output spaces: they contain large parameter shifts and produce over\-confident token distributions\. These two signatures indicate reduced plasticity before RL starts\.

### 3\.2Does RL Fail after Over\-Trained SFT?

#### 3\.2\.1Training Dynamics

![Refer to caption](https://arxiv.org/html/2606.09932v1/x4.png)Figure 4:Average gradient norm and weight update magnitude of RL with different SFT checkpoints\.##### Over\-trained models exhibit larger RL gradient norms\.

We next study whether these SFT\-induced changes affect RL optimization\. Figure[4](https://arxiv.org/html/2606.09932#S3.F4)shows an apparent paradox:OverSFThas a much larger gradient norm thanModSFTthroughout RL, yet its RL\-induced parameter shift is smaller \(Figure[3](https://arxiv.org/html/2606.09932#S3.F3), RL\-vs\-SFT row\) and its reward improvement is smaller as well\.

#### 3\.2\.2RL Performance Gain

##### Over\-training reduces RL performance gain\.

Figure[5](https://arxiv.org/html/2606.09932#S3.F5)shows that after excessive SFT training, the RL gain ofOverSFTmodels drop significantly compared withModSFTmodels, demonstrating that over\-training harms the RL performance gain\.

![Refer to caption](https://arxiv.org/html/2606.09932v1/x5.png)Figure 5:Evaluation results of RL\-trained models from different SFT checkpoints\. \(a\) Average score of ID and OOD tasks\. \(b\) Average score on ID tasks\. \(c\) Average score on OOD tasks\.Takeaways•Over\-trained models are hard to optimize during RL, exhibiting higher gradient norm but inducing less weight update magnitude\.•Over\-training leads to smaller performance gains after RL on both ID and OOD tasks\.

### 3\.3Global Recovery via Base\-Anchored Fusion

Section[3\.1](https://arxiv.org/html/2606.09932#S3.SS1)shows that over\-training pushes a small set of parameters far from the base model and places the model in a sharp region\. A natural fix is to pull the over\-trained model back toward the base so that the shift shrinks and the surrounding landscape becomes smoother, while still keeping the useful behavior that SFT has acquired\(Wortsmanet al\.,[2022](https://arxiv.org/html/2606.09932#bib.bib4); Ilharcoet al\.,[2023](https://arxiv.org/html/2606.09932#bib.bib5)\)\.

##### Base\-anchored linear fusion\.

We therefore perform a simple element\-wise linear interpolation between the base model and theOverSFTmodel:

θfuse=α​θOverSFT\+\(1−α\)​θBase,α∈\[0,1\]\.\\theta\_\{\\mathrm\{fuse\}\}=\\alpha\\,\\theta\_\{\{\\color\[rgb\]\{0\.8,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.8,0,0\}\\texttt\{OverSFT\}\}\}\+\(1\-\\alpha\)\\,\\theta\_\{\{\\color\[rgb\]\{0\.5,0\.5,0\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.5,0\.5,0\.5\}\\pgfsys@color@gray@stroke\{0\.5\}\\pgfsys@color@gray@fill\{0\.5\}\\texttt\{Base\}\}\},\\qquad\\alpha\\in\[0,1\]\.\(1\)The interpolation is applied to all model parameters, including attention and MLP weights as well as RMSNorm weights, the final norm, andlm\_head\. The fusion weightα\\alphacontrols how much ofOverSFTis retained:α→1\\alpha\\to 1reduces to the originalOverSFTmodel, andα→0\\alpha\\to 0falls back to the base model\.

##### Why fusion helps\.

This operation scales the SFT\-induced weight deltaθOverSFT−θBase\\theta\_\{\{\\color\[rgb\]\{0\.8,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.8,0,0\}\\texttt\{OverSFT\}\}\}\-\\theta\_\{\{\\color\[rgb\]\{0\.5,0\.5,0\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.5,0\.5,0\.5\}\\pgfsys@color@gray@stroke\{0\.5\}\\pgfsys@color@gray@fill\{0\.5\}\\texttt\{Base\}\}\}byα\\alpha\. From the task\-vector perspective of model editing and composition\(Ilharcoet al\.,[2023](https://arxiv.org/html/2606.09932#bib.bib5)\), the SFT updateθOverSFT−θBase\\theta\_\{\{\\color\[rgb\]\{0\.8,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.8,0,0\}\\texttt\{OverSFT\}\}\}\-\\theta\_\{\{\\color\[rgb\]\{0\.5,0\.5,0\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.5,0\.5,0\.5\}\\pgfsys@color@gray@stroke\{0\.5\}\\pgfsys@color@gray@fill\{0\.5\}\\texttt\{Base\}\}\}can be viewed as a task vector that moves the base model toward the SFT solution\. Base\-anchored fusion preserves the direction of this task vector while reducing its magnitude, thereby retaining much of the task\-specific behavior learned during SFT but avoiding the full parameter displacement caused by over\-SFT\. Additionally, fusion shrinks the element\-wise deviation from the base toα⋅\(θOverSFT−θBase\)\\alpha\\cdot\(\\theta\_\{\{\\color\[rgb\]\{0\.8,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.8,0,0\}\\texttt\{OverSFT\}\}\}\-\\theta\_\{\{\\color\[rgb\]\{0\.5,0\.5,0\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.5,0\.5,0\.5\}\\pgfsys@color@gray@stroke\{0\.5\}\\pgfsys@color@gray@fill\{0\.5\}\\texttt\{Base\}\}\}\), which reduces the large shifts and sharp spikes in Figure[3](https://arxiv.org/html/2606.09932#S3.F3)\. The fused model therefore sits in a smoother region that is easier for RL to optimize\. In practice, a moderateα\\alpha\(e\.g\.,α=0\.5\\alpha=0\.5\) recovers most of the training\-dynamics signals and token entropy while preserving much of the ID performance gained from SFT\.

### 3\.4Local Recovery via Attribution\-Guided Neuron Reset

Global fusion treats every parameter equally and cannot pinpoint which parameters actually drive the over\-confident behavior\. In particular, only a small number of neurons account for most of the over\-confident logits, and scaling all parameters uniformly toward the base cannot selectively relax these neurons without weakening the rest\. We therefore add a targeted step that first locates the neurons most responsible for the collapsed token distributions via direct logit attribution, a residual\-stream decomposition technique commonly used in Transformer circuit analysis\(Elhageet al\.,[2021](https://arxiv.org/html/2606.09932#bib.bib6)\), and then resets these neurons inθfuse\\theta\_\{\\mathrm\{fuse\}\}back to their base\-model values\.

#### 3\.4\.1Identifying abnormal logit contributors

##### Step 1: Selecting over\-confident target tokens\.

Over\-confidence concentrates on a subset of response tokens rather than being spread uniformly, so we first find where the effect is strongest and then attribute the logits only at those positions\. Given a calibration set of prompt–response pairs, we run both the base model andOverSFTin teacher\-forcing mode and record the per\-position token entropyHBase​\(t\)H\_\{\{\\color\[rgb\]\{0\.5,0\.5,0\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.5,0\.5,0\.5\}\\pgfsys@color@gray@stroke\{0\.5\}\\pgfsys@color@gray@fill\{0\.5\}\\texttt\{Base\}\}\}\(t\)andHOverSFT​\(t\)H\_\{\{\\color\[rgb\]\{0\.8,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.8,0,0\}\\texttt\{OverSFT\}\}\}\(t\)\. For each sample, we pick the Top\-NNresponse positions with the largest entropy gap\|HOverSFT​\(t\)−HBase​\(t\)\|\|H\_\{\{\\color\[rgb\]\{0\.8,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.8,0,0\}\\texttt\{OverSFT\}\}\}\(t\)\-H\_\{\{\\color\[rgb\]\{0\.5,0\.5,0\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.5,0\.5,0\.5\}\\pgfsys@color@gray@stroke\{0\.5\}\\pgfsys@color@gray@fill\{0\.5\}\\texttt\{Base\}\}\}\(t\)\|as the target tokens𝒯\\mathcal\{T\}\. These are the positions whereOverSFThas most aggressively sharpened the distribution relative to the base model, making them the most informative anchors for attribution\.

##### Step 2: Decomposing logits into per\-neuron contributions\.

Under a frozen final\-RMSNorm approximation, the target logit onOverSFTcan be written as a sum of residual\-stream contributions, projected onto a fixed direction that depends on the target tokeny∗y^\{\*\}:

z~y∗​\(t\)\\displaystyle\\tilde\{z\}\_\{y^\{\*\}\}\(t\)=by∗\+⟨e​\(t\),dy∗​\(t\)⟩\+∑ℓ\[⟨rℓattn​\(t\),dy∗​\(t\)⟩\+⟨rℓmlp​\(t\),dy∗​\(t\)⟩\],\\displaystyle\\;=\\;b\_\{y^\{\*\}\}\\;\+\\;\\langle e\(t\),d\_\{y^\{\*\}\}\(t\)\\rangle\\;\+\\;\\sum\_\{\\ell\}\\Big\[\\langle r^\{\\mathrm\{attn\}\}\_\{\\ell\}\(t\),d\_\{y^\{\*\}\}\(t\)\\rangle\+\\langle r^\{\\mathrm\{mlp\}\}\_\{\\ell\}\(t\),d\_\{y^\{\*\}\}\(t\)\\rangle\\Big\],\(2\)dy∗​\(t\)\\displaystyle d\_\{y^\{\*\}\}\(t\)=sRMS​\(t\)⋅\(γ⊙WU​\[y∗\]\),\\displaystyle\\;=\\;s\_\{\\mathrm\{RMS\}\}\(t\)\\cdot\\big\(\\gamma\\odot W\_\{U\}\[y^\{\*\}\]\\big\),whereby∗b\_\{y^\{\*\}\}is the optionallm\_headbias fory∗y^\{\*\}222by∗b\_\{y^\{\*\}\}denotes thelm\_headbias for tokeny∗y^\{\*\}if present \(e\.g\., models without tied embeddings\); for models with bias\-freelm\_head\(as in Llama\(Grattafioriet al\.,[2024](https://arxiv.org/html/2606.09932#bib.bib66)\)and Qwen\(Yanget al\.,[2024](https://arxiv.org/html/2606.09932#bib.bib61),[2025](https://arxiv.org/html/2606.09932#bib.bib64)\)\), this term vanishes\.,e​\(t\)e\(t\)is the token\-embedding write,rℓattn​\(t\)r^\{\\mathrm\{attn\}\}\_\{\\ell\}\(t\)andrℓmlp​\(t\)r^\{\\mathrm\{mlp\}\}\_\{\\ell\}\(t\)are the residual\-stream writes from the attention and MLP blocks at layerℓ\\ell,γ\\gammais the gain of the final RMSNorm,sRMS​\(t\)s\_\{\\mathrm\{RMS\}\}\(t\)is its normalization scalar at positiontt, andWU​\[y∗\]W\_\{U\}\[y^\{\*\}\]is the unembedding row ofy∗y^\{\*\}\. This decomposition becomes exact only because we freezesRMS​\(t\)s\_\{\\mathrm\{RMS\}\}\(t\)at its forward\-pass value and replace the nonlinear RMSNorm with the linear projectiondy∗​\(t\)d\_\{y^\{\*\}\}\(t\)\. In our implementation we therefore track the reconstruction error\|z~y∗​\(t\)−zy∗​\(t\)\|\|\\tilde\{z\}\_\{y^\{\*\}\}\(t\)\-z\_\{y^\{\*\}\}\(t\)\|to confirm that this approximation is tight\. We use the resulting scores as attribution signals for ranking rather than as a complete causal explanation, since direct logit attribution can be misleading when later components erase or overwrite earlier residual\-stream directions\(Janiaket al\.,[2024](https://arxiv.org/html/2606.09932#bib.bib10)\)\.

For the modules that directly write to the residual stream, namelyo\_projanddown\_proj, equation equation[2](https://arxiv.org/html/2606.09932#S3.E2)can be further pushed down to the input\-neuron level, giving an exact per\-neuron contribution

si\(ℓ,m\)=ai\(ℓ,m\)​\(t\)⋅⟨W:,i\(ℓ,m\),dy∗​\(t\)⟩,m∈\{o\_proj,down\_proj\},s^\{\(\\ell,m\)\}\_\{i\}\\;=\\;a^\{\(\\ell,m\)\}\_\{i\}\(t\)\\cdot\\big\\langle W^\{\(\\ell,m\)\}\_\{:,i\},\\;d\_\{y^\{\*\}\}\(t\)\\big\\rangle,\\qquad m\\in\\\{\\texttt\{o\\\_proj\},\\,\\texttt\{down\\\_proj\}\\\},\(3\)whereai\(ℓ,m\)​\(t\)a^\{\(\\ell,m\)\}\_\{i\}\(t\)is the neuron activation at positionttandW:,i\(ℓ,m\)W^\{\(\\ell,m\)\}\_\{:,i\}is the corresponding column of the writer weight\.

For the remaining internal projections \(q\_proj,k\_proj,v\_proj,up\_proj,gate\_proj\), whose outputs do not directly write to the residual stream, no such exact neuron\-level decomposition is available\. We therefore score them with a local “gradient×\\timesactivation” proxy on the same target logitzy∗​\(t\)z\_\{y^\{\*\}\}\(t\)\. This proxy can be viewed as a first\-order Taylor approximation to the change in the target logit induced by perturbing or removing the corresponding activation, and has been commonly used as a gradient\-based importance estimate for neural units and Transformer components\(Molchanovet al\.,[2017](https://arxiv.org/html/2606.09932#bib.bib7); Michelet al\.,[2019](https://arxiv.org/html/2606.09932#bib.bib8)\)\. Specifically, fork\_projandv\_projthe score is aggregated over all prefix positions0\.\.t0\.\.t, since these projections are read by attention at every later step\. Forq\_proj,up\_proj, andgate\_proj, it is taken at the target position only\. Forq\_projandk\_proj, the implementation uses pre\-RoPE projection outputs, so these scores should be read as local attribution proxies rather than exact causal decompositions\. This gives a uniform per\-neuron scoresi\(ℓ,m\)s^\{\(\\ell,m\)\}\_\{i\}across all linear modules of the network, which we use for ranking in Section[3\.4](https://arxiv.org/html/2606.09932#S3.SS4)\. The visualization of logit attribution is shown in Figure[7](https://arxiv.org/html/2606.09932#A3.F7)in Appendix[C](https://arxiv.org/html/2606.09932#A3)\.

#### 3\.4\.2Resetting high\-attribution neurons

##### Building the reset set\.

For each selected target tokentt, we rank all neurons across all replaceable modules by their most positively contributing to the target logit and keep the topρ\\rhofraction, yielding a token\-level set𝒮x,t\\mathcal\{S\}\_\{x,t\}for each promptxx\. The final reset set is the union over all selected tokens and calibration examples:

𝒮=⋃\(x,t\)∈𝒯𝒮x,t\.\\mathcal\{S\}=\\bigcup\_\{\(x,t\)\\in\\mathcal\{T\}\}\\mathcal\{S\}\_\{x,t\}\.\(4\)In our implementation,ρ\\rhocontrols the per\-token selection budget; we report the resulting effective whole\-model reset ratio after taking the union\. We introduce a token\-level gating variableω\(i\)=𝕀​\(i∈𝒮\)∈\{0,1\}\\omega^\{\(i\)\}=\\mathbb\{I\}\(i\\in\\mathcal\{S\}\)\\in\\\{0,1\\\}based on the above definition\.

##### Reset operation\.

Inspired by attribution\-based neuron intervention in pretrained Transformers\(Daiet al\.,[2022](https://arxiv.org/html/2606.09932#bib.bib9)\), we then overwrite each selected neuron inθfuse\\theta\_\{\\mathrm\{fuse\}\}with its base\-model value while leaving the rest untouched:

θreset\(i\)=ω\(i\)​θBase\(i\)\+\(1−ω\(i\)\)​θfuse\(i\)\\theta^\{\(i\)\}\_\{\\mathrm\{reset\}\}=\\omega^\{\(i\)\}\\theta^\{\(i\)\}\_\{\{\\color\[rgb\]\{0\.5,0\.5,0\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.5,0\.5,0\.5\}\\pgfsys@color@gray@stroke\{0\.5\}\\pgfsys@color@gray@fill\{0\.5\}\\texttt\{Base\}\}\}\+\(1\-\\omega^\{\(i\)\}\)\\theta^\{\(i\)\}\_\{\\mathrm\{fuse\}\}\(5\)where each neuron index corresponds to either a row or a column of the underlying weight matrix, depending on the module’s role in the residual stream: rows forq\_proj,k\_proj,v\_proj,up\_proj,gate\_proj, and columns foro\_projanddown\_proj\. The full procedure is summarized in Algorithm[1](https://arxiv.org/html/2606.09932#alg1)\.

##### Why reset helps\.

Reset and fusion play complementary roles\. Fusion smooths the parameter space globally and brings the model into a better\-conditioned region, but leaves every direction scaled uniformly, so the neurons that dominate the over\-confident logits remain close to their over\-trained configuration up to a factor ofα\\alpha\. The reset step replaces exactly these rigid directions with base\-model values, breaking the shortcut that produces collapsed token distributions\. Because\|𝒮\|\|\\mathcal\{S\}\|is very small, the vast majority of SFT\-acquired behavior inθfuse\\theta\_\{\\mathrm\{fuse\}\}is preserved, while diversity is restored precisely where it was most lost\. As shown in Figure[9](https://arxiv.org/html/2606.09932#A3.F9), the output entropy on the affected tokens recovers substantially after the reset, and the resulting model becomes more responsive to subsequent RL optimization\.

## 4Experiments

### 4\.1Setup

##### Models and Baselines\.

We adopt EvoLM\-4B\(Qiet al\.,[2025](https://arxiv.org/html/2606.09932#bib.bib37)\)as the base model for mathematical tasks since it is pre\-trained on a controlled corpus without evaluation data contamination, which makes the gain from RL more reliable\. For agentic tasks, we use Qwen3\-8B\(Yanget al\.,[2025](https://arxiv.org/html/2606.09932#bib.bib64)\), a strong general\-purpose backbone widely used in tool\-use scenarios\. To isolate the effect of plasticity recovery on the SFT\-to\-RL handoff, we compareRejuvenationagainst the following baselines: \(1\)ModSFT: a moderately trained SFT checkpoint \(epoch=2 for EvoLM\-4B\), which serves as a strong upper\-bound reference for the SFT\-to\-RL handoff; \(2\)OverSFT: the over\-trained SFT checkpoint \(epoch=32 for EvoLM\-4B\) that exhibits plasticity loss and is the target for our recovery method; \(3\)DFT\(Wuet al\.,[2026](https://arxiv.org/html/2606.09932#bib.bib19)\): a representative regularized SFT objective that re\-weights the cross\-entropy loss with token probabilities to mitigate over\-confidence during SFT\. Following the same protocol, we report the corresponding \+RL results obtained by running the same RL recipe on top of each SFT variant\.

##### Training Details\.

For SFT training, we use LlamaFactory\(Zhenget al\.,[2024b](https://arxiv.org/html/2606.09932#bib.bib126)\)for EvoLM\-4B and slime\(Zhuet al\.,[2025](https://arxiv.org/html/2606.09932#bib.bib125)\)for Qwen3\-8B, both with the AdamW\(Kingma,[2014](https://arxiv.org/html/2606.09932#bib.bib146)\)optimizer and learning rate of3\.0×10−63\.0\\times 10^\{\-6\}\. To probe behavior under extreme over\-training, we trainModSFTfor 2 epochs andOverSFTfor 32 epochs without any learning rate schedule or weight decay, so that any difference between the two checkpoints comes purely from training duration\. For RL training, we use verl\(Shenget al\.,[2025](https://arxiv.org/html/2606.09932#bib.bib124)\)for mathematical tasks and slime\(Zhuet al\.,[2025](https://arxiv.org/html/2606.09932#bib.bib125)\)for agentic tasks\. Both stages adopt GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2606.09932#bib.bib59)\)as the base RL algorithm with KL loss\. For EvoLM\-4B on math, we use a learning rate of1×10−61\\times 10^\{\-6\}, sample 8 responses per prompt, and apply rule\-based rewards on the boxed final answer\. For Qwen3\-8B onτ\\tau\-bench Retail, we follow the officialτ\\tau\-bench protocol and use a separate Qwen3\-4B\-Instruct\-2507\(Yanget al\.,[2025](https://arxiv.org/html/2606.09932#bib.bib64)\)as the user simulator\. We provide full hyper\-parameter tables and infrastructure details in Appendix[B](https://arxiv.org/html/2606.09932#A2)\.

##### Evaluation Details\.

For mathematical reasoning, we evaluate on five widely used in\-distribution \(ID\) benchmarks: GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2606.09932#bib.bib69)\), MATH\-500\(Lightmanet al\.,[2024](https://arxiv.org/html/2606.09932#bib.bib46)\), AMC23\(MAA,[2023](https://arxiv.org/html/2606.09932#bib.bib70)\), Minerva Math\(Lewkowyczet al\.,[2022](https://arxiv.org/html/2606.09932#bib.bib73)\), and OlympiadBench\(Heet al\.,[2024](https://arxiv.org/html/2606.09932#bib.bib74)\)\. For agentic tasks, we evaluate onτ\\tau\-bench\(Yaoet al\.,[2025](https://arxiv.org/html/2606.09932#bib.bib78)\)Retail and Airline\. We further evaluate on three out\-of\-distribution \(OOD\) benchmarks: GPQA\-Diamond\(Reinet al\.,[2024](https://arxiv.org/html/2606.09932#bib.bib75)\), ARC\-Challenge\(Clarket al\.,[2018](https://arxiv.org/html/2606.09932#bib.bib76)\), and MMLU\-Pro\(Wanget al\.,[2024](https://arxiv.org/html/2606.09932#bib.bib77)\)followingYanet al\.\([2025](https://arxiv.org/html/2606.09932#bib.bib24)\)\. We report Pass@1, which is averaged over multiple samples to ensure robust evaluation\. Decoding parameters, per\-benchmarkKK, and the full evaluation details are shown in Appendix[B](https://arxiv.org/html/2606.09932#A2)\.

### 4\.2Main Results

Table 2:Evaluation results on ID and OOD benchmarks of EvoLM\-4B\. The highest values before RL areunderlinedand the highest values after RL arebolded\.MethodGSM8KMATHAMC23Miner\.Olymp\.IDAvg\.GPQAARCMMLUOODAvg\.Avg\.ModSFT28\.115\.24\.34\.02\.410\.85\.313\.43\.47\.49\.1\+RL51\.625\.76\.411\.64\.720\.013\.222\.96\.214\.117\.0 \(\+7\.9\)OverSFT31\.715\.83\.85\.12\.411\.87\.813\.73\.38\.310\.0\+RL45\.222\.06\.69\.94\.117\.59\.921\.35\.612\.214\.9 \(\+4\.9\)DFT30\.914\.55\.35\.22\.311\.75\.87\.23\.35\.48\.5\+RL43\.624\.47\.511\.93\.918\.38\.322\.35\.712\.115\.2 \(\+6\.7\)Rejuvenation18\.512\.03\.04\.32\.58\.13\.62\.62\.12\.85\.4\+RL49\.126\.28\.010\.84\.219\.712\.926\.26\.915\.317\.5 \(\+12\.1\)

##### Math\.

Table[2](https://arxiv.org/html/2606.09932#S4.T2)reports the evaluation results on EvoLM\-4B and we have three observations as follows: \(1\) Starting RL fromOverSFTleads to a clear performance drop compared withModSFT\(e\.g\., 17\.5 vs\. 20\.0 ID Avg\. and 12\.2 vs\. 14\.1 OOD Avg\.\), confirming that prolonged SFT actively hurts the subsequent RL stage rather than only saturating it\. \(2\) Replacing SFT withDFTslows down the SFT\-induced collapse but does not fully recover the gap, since it modifies the SFT objective in advance and does not address an already over\-trained checkpoint\. \(3\)Rejuvenation, applied as a post\-hoc operation on top ofOverSFT, recovers and surpasses theModSFT\+RL upper bound on the average score, with particularly large improvements on the OOD tasks\. This indicates that smoothing the parameter landscape and resetting over\-confident neurons not only restores RL trainability but also retains the broader knowledge acquired during pre\-training, leading to better OOD generalization\.

##### Agentic\.

Table 3:Evaluation results on ID and OOD benchmarks of Qwen3\-8B\. The highest values after RL arebolded\.Methodτ\\tau\-benchRetailτ\\tau\-benchAirlineIDAvg\.GPQAARCMMLUOODAvg\.Avg\.ModSFT\+RL78\.316\.147\.240\.577\.962\.560\.353\.8OverSFT\+RL70\.319\.945\.142\.075\.961\.759\.952\.5DFT\+RL75\.312\.243\.838\.876\.862\.559\.351\.6Rejuvenation\+RL77\.017\.547\.341\.078\.262\.460\.553\.9

Table[3](https://arxiv.org/html/2606.09932#S4.T3)summarizes the agentic results on Qwen3\-8B\. Consistent with the math setting,OverSFT\+RL substantially trailsModSFT\+RL on the IDτ\\tau\-bench Retail and Airline tasks, where the over\-confident policy fails to explore alternative tool\-use trajectories and quickly converges to suboptimal behaviors\.Rejuvenationlifts the success rate ofOverSFTback to \(and beyond\) theModSFT\+RL reference on bothτ\\tau\-bench tasks, while simultaneously improving the OOD scores on GPQA, ARC, and MMLU\. This shows that the recovered plasticity transfers across rather different task families: the same fusion\-and\-reset operation that helps reasoning RL also enables tool\-use RL to keep learning from environment feedback\.

### 4\.3Ablation Study

##### Components\.

To understand the contribution of each component inRejuvenation, we ablate it on EvoLM\-4B\. As shown in Table[4](https://arxiv.org/html/2606.09932#S4.T4), applying base\-anchored fusion alone already brings most of the gain overOverSFT\+RL: the smoother parameter region restores RL trainability and lifts both ID and OOD scores\. Adding the attribution\-guided neuron reset on top further improves the average performance, especially on OOD benchmarks\. This is consistent with our analysis in Section[3\.4](https://arxiv.org/html/2606.09932#S3.SS4): fusion shrinks the global SFT\-induced drift, while the targeted reset selectively breaks the small set of over\-confident logit\-contributing directions that fusion alone cannot relax\.

Table 4:Evaluation results on ID and OOD benchmarks\. The highest values arebolded\.MethodGSM8KMATHAMC23Miner\.Olymp\.IDAvg\.GPQAARCMMLUOODAvg\.Avg\.OverSFT45\.222\.06\.69\.94\.117\.59\.921\.35\.612\.214\.9OverSFTw/Fusion46\.926\.66\.312\.24\.319\.314\.722\.77\.014\.817\.0Rejuvenation49\.126\.28\.010\.84\.219\.712\.926\.26\.915\.317\.5

##### Fusion Weight\.

We further sweep the fusion weightα\\alphain\{0\.4,0\.45,0\.5,0\.55,0\.6\}\\\{0\.4,0\.45,0\.5,0\.55,0\.6\\\}on EvoLM\-4B to understand how strongly the over\-trained checkpoint should be pulled toward the base\. As shown in Table[5](https://arxiv.org/html/2606.09932#S4.T5), a smallα\\alpha\(e\.g\., 0\.4\) drops more of the SFT\-acquired ID skills but leaves the model in a smoother region with stronger OOD learning potential, whereas a largeα\\alpha\(e\.g\., 0\.6\) preserves more SFT behavior at the cost of carrying over the rigidity in both ID and OOD evaluation\. A moderate value ofα=0\.5\\alpha=0\.5balances the two effects best on average and is used as the default in all main experiments\.

Table 5:Evaluation results on ID and OOD benchmarks with different fusion weightα\\alpha\. The highest values arebolded\.MethodGSM8KMATHAMC23Miner\.Olymp\.IDAvg\.GPQAARCMMLUOODAvg\.Avg\.0\.447\.723\.96\.210\.34\.518\.512\.826\.45\.915\.016\.80\.4546\.626\.46\.210\.64\.018\.814\.723\.36\.414\.816\.80\.546\.926\.66\.312\.24\.319\.314\.722\.77\.014\.817\.00\.5548\.826\.89\.512\.03\.520\.19\.323\.26\.613\.016\.60\.650\.126\.86\.89\.85\.019\.78\.821\.05\.911\.915\.8

##### Reset Percentage\.

We then study the effect of the per\-token reset budgetρ\\rho, which controls how many of the most positively contributing neurons are rolled back to their base\-model values per target token\. We sweepρ∈\{0\.5%,1%,2%,4%\}\\rho\\in\\\{0\.5\\%,1\\%,2\\%,4\\%\\\}\. As shown in Table[6](https://arxiv.org/html/2606.09932#S4.T6), even a very smallρ\\rhoalready yields a noticeable improvement, indicating that the over\-confident behavior concentrates in a tiny fraction of neurons\. Moderate values further improve OOD generalization, while overly aggressive reset starts to erase useful SFT\-acquired behaviors and hurts ID accuracy\. We therefore adopt a small default reset ratio for all main experiments\.

Table 6:Evaluation results on ID and OOD benchmarks with different reset percentageρ\\rho\. The highest values arebolded\.MethodGSM8KMATHAMC23Miner\.Olymp\.IDAvg\.GPQAARCMMLUOODAvg\.Avg\.0\.5%49\.125\.910\.712\.03\.920\.311\.621\.36\.013\.016\.71%49\.126\.28\.010\.84\.219\.712\.926\.26\.915\.317\.52%48\.925\.57\.312\.25\.419\.913\.723\.47\.314\.817\.34%46\.824\.75\.010\.14\.818\.39\.922\.66\.613\.015\.6

##### Rejuvenationwith different SFT checkpoints\.

Table[7](https://arxiv.org/html/2606.09932#S4.T7)further studies whetherRejuvenationis specific to severely over\-trained checkpoints or can also be applied to checkpoints with different SFT degrees\. Starting RL fromModSFTalready gives strong ID performance, but applyingRejuvenationbefore RL further improves the overall average score from 17\.0 to 17\.6, mainly by increasing the OOD average from 14\.1 to 16\.3\. This suggests that even a moderately trained SFT checkpoint may still contain SFT\-induced rigidity, and a mild recovery operation can improve its ability to generalize after RL\. However, this improvement comes with a small drop on the ID average, indicating a trade\-off between preserving task\-specific SFT behavior and restoring broader plasticity\. These results indicate that the proposed recovery procedure is not merely an early\-stopping substitute: it can rejuvenate checkpoints from different stages of SFT\.

Table 7:Evaluation results on ID and OOD benchmarks of different methods after RL training\. The highest values arebolded\.MethodGSM8KMATHAMC23Miner\.Olymp\.IDAvg\.GPQAARCMMLUOODAvg\.Avg\.ModSFT51\.625\.76\.411\.64\.720\.013\.222\.96\.214\.117\.0ModSFT\+Rejuvenation46\.924\.38\.210\.84\.518\.914\.226\.38\.416\.317\.6OverSFT\+Rejuvenation48\.626\.28\.010\.84\.219\.612\.926\.26\.915\.317\.5

## 5Conclusion

In this paper, we identify a failure mode where excessive SFT training leads to loss of plasticity, which limits the effectiveness of subsequent RL optimization\. Through analysis of parameter changes, token distributions, and RL training dynamics, we show that over\-trained models become over\-confident and harder to update\. To address this issue, we introduce a two\-stage method that includes global model fusion and neuron reset\. These components help smooth the parameter space and restore diversity in key parts of the model, making it more amenable to further optimization during RL\. Empirical results on mathematical and agentic tasks demonstrate the effectiveness of our method\. Our approach consistently improves RL performance over over\-trained SFT models and also shows better generalization on out\-of\-distribution benchmarks\. Our findings highlight the importance of model plasticity in the SFT\-then\-RL pipeline\. We hope this work can motivate future research on understanding optimization dynamics in post\-training and developing more robust training strategies\.

Although our method effectively restores the plasticity of the over\-trained SFT models, both model fusion and neuron reset require the access to the base model for reference\. We will explore how to restore the plasticity of over\-trained models without the base model in the future work\.

## References

- Y\. Bai, A\. Jones, K\. Ndousse, A\. Askell, A\. Chen, N\. DasSarma, D\. Drain, S\. Fort, D\. Ganguli, T\. Henighan,et al\.\(2022\)Training a helpful and harmless assistant with reinforcement learning from human feedback\.arXiv preprint arXiv:2204\.05862\.Cited by:[§1](https://arxiv.org/html/2606.09932#S1.p1.1)\.
- Step\-wise adaptive integration of supervised fine\-tuning and reinforcement learning for task\-specific llms\.arXiv preprint arXiv:2505\.13026\.Cited by:[§2](https://arxiv.org/html/2606.09932#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Chu, Y\. Zhai, J\. Yang, S\. Tong, S\. Xie, D\. Schuurmans, Q\. V\. Le, S\. Levine, and Y\. Ma \(2025\)SFT memorizes, RL generalizes: a comparative study of foundation model post\-training\.InProceedings of the 42nd International Conference on Machine Learning,A\. Singh, M\. Fazel, D\. Hsu, S\. Lacoste\-Julien, F\. Berkenkamp, T\. Maharaj, K\. Wagstaff, and J\. Zhu \(Eds\.\),Proceedings of Machine Learning Research, Vol\.267,pp\. 10818–10838\.External Links:[Link](https://proceedings.mlr.press/v267/chu25c.html)Cited by:[§1](https://arxiv.org/html/2606.09932#S1.p2.1),[§2](https://arxiv.org/html/2606.09932#S2.SS0.SSS0.Px1.p1.1)\.
- P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord \(2018\)Think you have solved question answering? try arc, the ai2 reasoning challenge\.arXiv preprint arXiv:1803\.05457\.Cited by:[§4\.1](https://arxiv.org/html/2606.09932#S4.SS1.SSS0.Px3.p1.2)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§4\.1](https://arxiv.org/html/2606.09932#S4.SS1.SSS0.Px3.p1.2)\.
- D\. Dai, L\. Dong, Y\. Hao, Z\. Sui, B\. Chang, and F\. Wei \(2022\)Knowledge neurons in pretrained transformers\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),S\. Muresan, P\. Nakov, and A\. Villavicencio \(Eds\.\),Dublin, Ireland,pp\. 8493–8502\.External Links:[Link](https://aclanthology.org/2022.acl-long.581/),[Document](https://dx.doi.org/10.18653/v1/2022.acl-long.581)Cited by:[§3\.4\.2](https://arxiv.org/html/2606.09932#S3.SS4.SSS2.Px2.p1.1)\.
- S\. Dohare, J\. F\. Hernandez\-Garcia, Q\. Lan, P\. Rahman, A\. R\. Mahmood, and R\. S\. Sutton \(2024\)Loss of plasticity in deep continual learning\.Nature632\(8026\),pp\. 768–774\.Cited by:[§1](https://arxiv.org/html/2606.09932#S1.p4.1)\.
- N\. Elhage, N\. Nanda, C\. Olsson, T\. Henighan, N\. Joseph, B\. Mann, A\. Askell, Y\. Bai, A\. Chen, T\. Conerly, N\. DasSarma, D\. Drain, D\. Ganguli, Z\. Hatfield\-Dodds, D\. Hernandez, A\. Jones, J\. Kernion, L\. Lovitt, K\. Ndousse, D\. Amodei, T\. Brown, J\. Clark, J\. Kaplan, S\. McCandlish, and C\. Olah \(2021\)A mathematical framework for transformer circuits\.Transformer Circuits Thread\.Note:https://transformer\-circuits\.pub/2021/framework/index\.htmlCited by:[§3\.4](https://arxiv.org/html/2606.09932#S3.SS4.p1.1)\.
- Y\. Fu, T\. Chen, J\. Chai, X\. Wang, S\. Tu, G\. Yin, W\. Lin, Q\. Zhang, Y\. Zhu, and D\. Zhao \(2026\)SRFT: a single\-stage method with supervised and reinforcement fine\-tuning for reasoning\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=n6E0r6kQWQ)Cited by:[§1](https://arxiv.org/html/2606.09932#S1.p3.1),[§2](https://arxiv.org/html/2606.09932#S2.SS0.SSS0.Px1.p1.1)\.
- W\. Gan, M\. Pan, L\. Xi, W\. Zhang, J\. Chen, J\. Yin, and X\. Zhang \(2026\)GFT: from imitation to reward fine\-tuning with unbiased group advantages and dynamic coefficient rectification\.arXiv preprint arXiv:2604\.14258\.Cited by:[§2](https://arxiv.org/html/2606.09932#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[footnote 2](https://arxiv.org/html/2606.09932#footnote2)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi,et al\.\(2025\)DeepSeek\-r1 incentivizes reasoning in llms through reinforcement learning\.Nature645\(8081\),pp\. 633–638\.Cited by:[§1](https://arxiv.org/html/2606.09932#S1.p1.1),[§2](https://arxiv.org/html/2606.09932#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Han, S\. Bordt, H\. Zhang, and S\. Kakade \(2026\)Weight decay improves language model plasticity\.arXiv preprint arXiv:2602\.11137\.Cited by:[§1](https://arxiv.org/html/2606.09932#S1.p4.1)\.
- C\. He, R\. Luo, Y\. Bai, S\. Hu, Z\. Thai, J\. Shen, J\. Hu, X\. Han, Y\. Huang, Y\. Zhang, J\. Liu, L\. Qi, Z\. Liu, and M\. Sun \(2024\)OlympiadBench: a challenging benchmark for promoting AGI with olympiad\-level bilingual multimodal scientific problems\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 3828–3850\.External Links:[Link](https://aclanthology.org/2024.acl-long.211/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.211)Cited by:[§4\.1](https://arxiv.org/html/2606.09932#S4.SS1.SSS0.Px3.p1.2)\.
- Z\. Huang, T\. Cheng, Z\. Qiu, Z\. Wang, Y\. Xu, E\. M\. Ponti, and I\. Titov \(2025\)Blending supervised and reinforcement fine\-tuning with prefix sampling\.arXiv preprint arXiv:2507\.01679\.Cited by:[§2](https://arxiv.org/html/2606.09932#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Huang, K\. Yang, X\. Huang, F\. Hao, Q\. Ge, B\. Li, H\. Du, K\. Chen, and Q\. Guo \(2026\)How to fine\-tune a reasoning model? a teacher\-student cooperation framework to synthesize student\-consistent sft data\.arXiv preprint arXiv:2604\.14164\.Cited by:[§1](https://arxiv.org/html/2606.09932#S1.p3.1)\.
- G\. Ilharco, M\. T\. Ribeiro, M\. Wortsman, L\. Schmidt, H\. Hajishirzi, and A\. Farhadi \(2023\)Editing models with task arithmetic\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=6t0Kwf8-jrj)Cited by:[§3\.3](https://arxiv.org/html/2606.09932#S3.SS3.SSS0.Px2.p1.6),[§3\.3](https://arxiv.org/html/2606.09932#S3.SS3.p1.1)\.
- J\. Janiak, C\. Rager, J\. Dao, and Y\. Lau \(2024\)An adversarial example for direct logit attribution: memory management in GELU\-4L\.InProceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP,Y\. Belinkov, N\. Kim, J\. Jumelet, H\. Mohebbi, A\. Mueller, and H\. Chen \(Eds\.\),Miami, Florida, US,pp\. 232–237\.External Links:[Link](https://aclanthology.org/2024.blackboxnlp-1.15/),[Document](https://dx.doi.org/10.18653/v1/2024.blackboxnlp-1.15)Cited by:[§3\.4\.1](https://arxiv.org/html/2606.09932#S3.SS4.SSS1.Px2.p1.15)\.
- H\. Jin, S\. Luan, S\. Lyu, G\. Rabusseau, R\. Rabbany, D\. Precup, and M\. Hamdaqa \(2025a\)Rl fine\-tuning heals ood forgetting in sft\.arXiv preprint arXiv:2509\.12235\.Cited by:[§2](https://arxiv.org/html/2606.09932#S2.SS0.SSS0.Px1.p1.1)\.
- H\. Jin, S\. Lv, S\. Wu, and M\. Hamdaqa \(2025b\)Rl is neither a panacea nor a mirage: understanding supervised vs\. reinforcement learning fine\-tuning for llms\.arXiv preprint arXiv:2508\.16546\.Cited by:[§2](https://arxiv.org/html/2606.09932#S2.SS0.SSS0.Px1.p1.1)\.
- F\. Kang, M\. Kuchnik, K\. Padthe, M\. Vlastelica, R\. Jia, C\. Wu, and N\. Ardalani \(2026\)Quagmires in SFT\-RL post\-training: when high SFT scores mislead and what to use instead\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=uLM3BfKo19)Cited by:[§2](https://arxiv.org/html/2606.09932#S2.SS0.SSS0.Px1.p1.1)\.
- D\. P\. Kingma \(2014\)Adam: a method for stochastic optimization\.arXiv preprint arXiv:1412\.6980\.Cited by:[Appendix B](https://arxiv.org/html/2606.09932#A2.SS0.SSS0.Px1.p1.3),[§4\.1](https://arxiv.org/html/2606.09932#S4.SS1.SSS0.Px2.p1.4)\.
- W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. Gonzalez, H\. Zhang, and I\. Stoica \(2023\)Efficient memory management for large language model serving with pagedattention\.InProceedings of the 29th Symposium on Operating Systems Principles,SOSP ’23,New York, NY, USA,pp\. 611–626\.External Links:ISBN 9798400702297,[Link](https://doi.org/10.1145/3600006.3613165),[Document](https://dx.doi.org/10.1145/3600006.3613165)Cited by:[Appendix B](https://arxiv.org/html/2606.09932#A2.SS0.SSS0.Px3.p1.5)\.
- A\. Lewkowycz, A\. Andreassen, D\. Dohan, E\. Dyer, H\. Michalewski, V\. Ramasesh, A\. Slone, C\. Anil, I\. Schlag, T\. Gutman\-Solo, Y\. Wu, B\. Neyshabur, G\. Gur\-Ari, and V\. Misra \(2022\)Solving quantitative reasoning problems with language models\.InAdvances in Neural Information Processing Systems,S\. Koyejo, S\. Mohamed, A\. Agarwal, D\. Belgrave, K\. Cho, and A\. Oh \(Eds\.\),Vol\.35,pp\. 3843–3857\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/file/18abbeef8cfe9203fdf9053c9c4fe191-Paper-Conference.pdf)Cited by:[§4\.1](https://arxiv.org/html/2606.09932#S4.SS1.SSS0.Px3.p1.2)\.
- X\. Li, G\. Huzhang, S\. Shen, Q\. Chen, Z\. Xu, W\. Luo, K\. Zhang, and J\. Zhang \(2026\)Getting your LLMs ready for reinforcement learning with lightweight SFT\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=yezWGJmODg)Cited by:[§1](https://arxiv.org/html/2606.09932#S1.p3.1),[§2](https://arxiv.org/html/2606.09932#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Li, C\. Chen, T\. Xu, Z\. Qin, J\. Xiao, Z\. Luo, and R\. Sun \(2025\)Preserving diversity in supervised fine\-tuning of large language models\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=NQEe7B7bSw)Cited by:[§2](https://arxiv.org/html/2606.09932#S2.SS0.SSS0.Px2.p1.1),[§3\.1\.2](https://arxiv.org/html/2606.09932#S3.SS1.SSS2.Px1.p1.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2024\)Let’s verify step by step\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=v8L0pN6EOi)Cited by:[§4\.1](https://arxiv.org/html/2606.09932#S4.SS1.SSS0.Px3.p1.2)\.
- A\. Limozin, E\. Durech, T\. Hoefler, I\. Schlag, and V\. Pyatkin \(2026\)SFT\-then\-rl outperforms mixed\-policy methods for llm reasoning\.arXiv preprint arXiv:2604\.23747\.Cited by:[§2](https://arxiv.org/html/2606.09932#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Liu, S\. Diao, X\. Lu, J\. Hu, X\. Dong, Y\. Choi, J\. Kautz, and Y\. Dong \(2025a\)ProRL: prolonged reinforcement learning expands reasoning boundaries in large language models\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=YPsJha5HXQ)Cited by:[§1](https://arxiv.org/html/2606.09932#S1.p1.1)\.
- M\. Liu, G\. Farina, and A\. E\. Ozdaglar \(2025b\)UFT: unifying supervised and reinforcement fine\-tuning\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=usOkGv1S7M)Cited by:[§2](https://arxiv.org/html/2606.09932#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Liu, J\. Wang, Y\. Shi, Z\. Xie, C\. An, K\. Zhang, J\. Zhao, X\. Gu, L\. Lin, W\. Hu, X\. Li, F\. Zhang, G\. Zhou, and K\. Gai \(2025c\)Attention as a compass: efficient exploration for process\-supervised rl in reasoning models\.arXiv preprint arXiv:2509\.26628\.Cited by:[§1](https://arxiv.org/html/2606.09932#S1.p1.1)\.
- T\. Liu, T\. Wu, R\. Yang, S\. Sun, J\. Wang, and Y\. Yang \(2026\)ProFit: leveraging high\-value signals in sft via probability\-guided token selection\.arXiv preprint arXiv:2601\.09195\.Cited by:[§2](https://arxiv.org/html/2606.09932#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Liu, S\. Li, L\. Cao, Y\. Xie, M\. Zhou, H\. Dong, X\. Ma, S\. Han, and D\. Zhang \(2025d\)SuperRL: reinforcement learning with supervision to boost language model reasoning\.arXiv preprint arXiv:2506\.01096\.Cited by:[§2](https://arxiv.org/html/2606.09932#S2.SS0.SSS0.Px1.p1.1)\.
- X\. Lv, Y\. Zuo, Y\. Sun, H\. Liu, Y\. Wei, Z\. Chen, X\. Zhu, K\. Zhang, B\. Wang, N\. Ding,et al\.\(2025\)Towards a unified view of large language model post\-training\.arXiv preprint arXiv:2509\.04419\.Cited by:[§2](https://arxiv.org/html/2606.09932#S2.SS0.SSS0.Px1.p1.1)\.
- L\. Ma, H\. Liang, M\. Qiang, L\. Tang, X\. Ma, Z\. H\. Wong, J\. Niu, C\. Shen, R\. He, Y\. Li, W\. Zhang, and B\. CUI \(2026\)Learning what reinforcement learning can’t: interleaved online fine\-tuning for hardest questions\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=LzCBLrNoyM)Cited by:[§2](https://arxiv.org/html/2606.09932#S2.SS0.SSS0.Px1.p1.1)\.
- MAA \(2023\)American mathematics contest 12 \(amc 12\)\.Note:Accessed: 2026\-04\-30External Links:[Link](https://artofproblemsolving.com/wiki/index.php/AMC_12_Problems_and_Solutions)Cited by:[§4\.1](https://arxiv.org/html/2606.09932#S4.SS1.SSS0.Px3.p1.2)\.
- P\. Michel, O\. Levy, and G\. Neubig \(2019\)Are sixteen heads really better than one?\.InAdvances in Neural Information Processing Systems,H\. Wallach, H\. Larochelle, A\. Beygelzimer, F\. d'Alché\-Buc, E\. Fox, and R\. Garnett \(Eds\.\),Vol\.32,pp\.\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2019/file/2c601ad9d2ff9bc8b282670cdd54f69f-Paper.pdf)Cited by:[§3\.4\.1](https://arxiv.org/html/2606.09932#S3.SS4.SSS1.Px2.p3.4)\.
- P\. Molchanov, S\. Tyree, T\. Karras, T\. Aila, and J\. Kautz \(2017\)Pruning convolutional neural networks for resource efficient inference\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=SJGCiw5gl)Cited by:[§3\.4\.1](https://arxiv.org/html/2606.09932#S3.SS4.SSS1.Px2.p3.4)\.
- OpenAI \(2024\)Learning to reason with llms\.Note:Accessed: 2026\-04\-30External Links:[Link](https://openai.com/index/learning-to-reason-with-llms)Cited by:[§1](https://arxiv.org/html/2606.09932#S1.p1.1),[§2](https://arxiv.org/html/2606.09932#S2.SS0.SSS0.Px1.p1.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray, J\. Schulman, J\. Hilton, F\. Kelton, L\. Miller, M\. Simens, A\. Askell, P\. Welinder, P\. F\. Christiano, J\. Leike, and R\. Lowe \(2022\)Training language models to follow instructions with human feedback\.InAdvances in Neural Information Processing Systems,S\. Koyejo, S\. Mohamed, A\. Agarwal, D\. Belgrave, K\. Cho, and A\. Oh \(Eds\.\),Vol\.35,pp\. 27730–27744\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf)Cited by:[§1](https://arxiv.org/html/2606.09932#S1.p1.1)\.
- Z\. Qi, F\. Nie, A\. Alahi, J\. Zou, H\. Lakkaraju, Y\. Du, E\. P\. Xing, S\. M\. Kakade, and H\. Zhang \(2025\)EvoLM: in search of lost language model training dynamics\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=B6bE2GC71a)Cited by:[§3\.1\.1](https://arxiv.org/html/2606.09932#S3.SS1.SSS1.p1.1),[§4\.1](https://arxiv.org/html/2606.09932#S4.SS1.SSS0.Px1.p1.1)\.
- C\. Qin and J\. T\. Springenberg \(2025\)Supervised fine tuning on curated data is reinforcement learning \(and can be improved\)\.arXiv preprint arXiv:2507\.12856\.Cited by:[§2](https://arxiv.org/html/2606.09932#S2.SS0.SSS0.Px1.p1.1)\.
- D\. Rein, B\. L\. Hou, A\. C\. Stickland, J\. Petty, R\. Y\. Pang, J\. Dirani, J\. Michael, and S\. R\. Bowman \(2024\)GPQA: a graduate\-level google\-proof q&a benchmark\.InFirst Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=Ti67584b98)Cited by:[§4\.1](https://arxiv.org/html/2606.09932#S4.SS1.SSS0.Px3.p1.2)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[Appendix B](https://arxiv.org/html/2606.09932#A2.SS0.SSS0.Px3.p1.5),[§1](https://arxiv.org/html/2606.09932#S1.p1.1),[§2](https://arxiv.org/html/2606.09932#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.09932#S4.SS1.SSS0.Px2.p1.4)\.
- G\. Sheng, C\. Zhang, Z\. Ye, X\. Wu, W\. Zhang, R\. Zhang, Y\. Peng, H\. Lin, and C\. Wu \(2025\)HybridFlow: a flexible and efficient rlhf framework\.InProceedings of the Twentieth European Conference on Computer Systems,EuroSys ’25,New York, NY, USA,pp\. 1279–1297\.External Links:ISBN 9798400711961,[Link](https://doi.org/10.1145/3689031.3696075),[Document](https://dx.doi.org/10.1145/3689031.3696075)Cited by:[Appendix B](https://arxiv.org/html/2606.09932#A2.SS0.SSS0.Px3.p1.5),[§4\.1](https://arxiv.org/html/2606.09932#S4.SS1.SSS0.Px2.p1.4)\.
- M\. Shoeybi, M\. Patwary, R\. Puri, P\. LeGresley, J\. Casper, and B\. Catanzaro \(2019\)Megatron\-lm: training multi\-billion parameter language models using model parallelism\.arXiv preprint arXiv:1909\.08053\.Cited by:[Appendix B](https://arxiv.org/html/2606.09932#A2.SS0.SSS0.Px4.p1.4)\.
- H\. Wang, H\. Gu, H\. Piao, K\. Gong, Y\. Ye, X\. Yue, S\. Han, Y\. Guo, and D\. Wu \(2026\)Learning while staying curious: entropy\-preserving supervised fine\-tuning via adaptive self\-distillation for large reasoning models\.arXiv preprint arXiv:2602\.02244\.Cited by:[§2](https://arxiv.org/html/2606.09932#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Wang, R\. Liu, L\. Lin, W\. Hu, X\. Li, F\. Zhang, G\. Zhou, and K\. Gai \(2025a\)ASPO: asymmetric importance sampling policy optimization\.arXiv preprint arXiv:2510\.06062\.Cited by:[§1](https://arxiv.org/html/2606.09932#S1.p1.1)\.
- J\. Wang, R\. Liu, F\. Zhang, X\. Li, and G\. Zhou \(2025b\)Stabilizing knowledge, promoting reasoning: dual\-token constraints for rlvr\.arXiv preprint arXiv:2507\.15778\.Cited by:[§1](https://arxiv.org/html/2606.09932#S1.p1.1)\.
- S\. Wang, L\. Yu, C\. Gao, C\. Zheng, S\. Liu, R\. Lu, K\. Dang, X\. Chen, J\. Yang, Z\. Zhang, Y\. Liu, A\. Yang, A\. Zhao, Y\. Yue, S\. Song, B\. Yu, G\. Huang, and J\. Lin \(2025c\)Beyond the 80/20 rule: high\-entropy minority tokens drive effective reinforcement learning for LLM reasoning\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=yfcpdY4gMP)Cited by:[Appendix C](https://arxiv.org/html/2606.09932#A3.SS0.SSS0.Px3.p1.1)\.
- Y\. Wang, X\. Ma, G\. Zhang, Y\. Ni, A\. Chandra, S\. Guo, W\. Ren, A\. Arulraj, X\. He, Z\. Jiang, T\. Li, M\. Ku, K\. Wang, A\. Zhuang, R\. Fan, X\. Yue, and W\. Chen \(2024\)MMLU\-pro: a more robust and challenging multi\-task language understanding benchmark\.InAdvances in Neural Information Processing Systems,A\. Globerson, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. Tomczak, and C\. Zhang \(Eds\.\),Vol\.37,pp\. 95266–95290\.External Links:[Document](https://dx.doi.org/10.52202/079017-3018),[Link](https://proceedings.neurips.cc/paper_files/paper/2024/file/ad236edc564f3e3156e1b2feafb99a24-Paper-Datasets_and_Benchmarks_Track.pdf)Cited by:[§4\.1](https://arxiv.org/html/2606.09932#S4.SS1.SSS0.Px3.p1.2)\.
- M\. Wortsman, G\. Ilharco, S\. Y\. Gadre, R\. Roelofs, R\. Gontijo\-Lopes, A\. S\. Morcos, H\. Namkoong, A\. Farhadi, Y\. Carmon, S\. Kornblith, and L\. Schmidt \(2022\)Model soups: averaging weights of multiple fine\-tuned models improves accuracy without increasing inference time\.InProceedings of the 39th International Conference on Machine Learning,K\. Chaudhuri, S\. Jegelka, L\. Song, C\. Szepesvari, G\. Niu, and S\. Sabato \(Eds\.\),Proceedings of Machine Learning Research, Vol\.162,pp\. 23965–23998\.External Links:[Link](https://proceedings.mlr.press/v162/wortsman22a.html)Cited by:[§3\.3](https://arxiv.org/html/2606.09932#S3.SS3.p1.1)\.
- Y\. Wu, Y\. Zhou, Z\. Ziheng, Y\. Peng, X\. Ye, X\. Hu, W\. Zhu, L\. Qi, M\. Yang, and X\. Yang \(2026\)On the generalization of SFT: a reinforcement learning perspective with reward rectification\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=Lv7PjbcaMi)Cited by:[Appendix B](https://arxiv.org/html/2606.09932#A2.SS0.SSS0.Px1.p1.3),[§2](https://arxiv.org/html/2606.09932#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.09932#S4.SS1.SSS0.Px1.p1.1)\.
- J\. Yan, Y\. Li, Z\. Hu, Z\. Wang, G\. Cui, X\. Qu, Y\. Cheng, and Y\. Zhang \(2025\)Learning to reason under off\-policy guidance\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=vO8LLoNWWk)Cited by:[§2](https://arxiv.org/html/2606.09932#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.09932#S4.SS1.SSS0.Px3.p1.2)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu \(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[Appendix B](https://arxiv.org/html/2606.09932#A2.SS0.SSS0.Px4.p1.4),[§4\.1](https://arxiv.org/html/2606.09932#S4.SS1.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.09932#S4.SS1.SSS0.Px2.p1.4),[footnote 2](https://arxiv.org/html/2606.09932#footnote2)\.
- A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu \(2024\)Qwen2\.5 technical report\.arXiv preprint arXiv:2412\.15115\.Cited by:[footnote 2](https://arxiv.org/html/2606.09932#footnote2)\.
- S\. Yao, N\. Shinn, P\. Razavi, and K\. R\. Narasimhan \(2025\)τ\\tau\-Bench: a benchmark for tool\-agent\-user interaction in real\-world domains\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=roNSXZpUDN)Cited by:[§4\.1](https://arxiv.org/html/2606.09932#S4.SS1.SSS0.Px3.p1.2)\.
- Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, YuYue, W\. Dai, T\. Fan, G\. Liu, J\. Liu, L\. Liu, X\. Liu, H\. Lin, Z\. Lin, B\. Ma, G\. Sheng, Y\. Tong, C\. Zhang, M\. Zhang, R\. Zhang, W\. Zhang, H\. Zhu, J\. Zhu, J\. Chen, J\. Chen, C\. Wang, H\. Yu, Y\. Song, X\. Wei, H\. Zhou, J\. Liu, W\. Ma, Y\. Zhang, L\. Yan, Y\. Wu, and M\. Wang \(2025\)DAPO: an open\-source LLM reinforcement learning system at scale\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=2a36EMSSTp)Cited by:[§2](https://arxiv.org/html/2606.09932#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Yue, Z\. Chen, R\. Lu, A\. Zhao, Z\. Wang, Y\. Yue, S\. Song, and G\. Huang \(2025\)Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model?\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=4OsgYD7em5)Cited by:[§1](https://arxiv.org/html/2606.09932#S1.p1.1)\.
- D\. Zhang, Y\. Xu, H\. Wang, Q\. Chen, and H\. Peng \(2026a\)Good sft optimizes for sft, better sft prepares for reinforcement learning\.arXiv preprint arXiv:2602\.01058\.Cited by:[§2](https://arxiv.org/html/2606.09932#S2.SS0.SSS0.Px1.p1.1)\.
- K\. Zhang, K\. Tian, R\. Liu, S\. Zeng, X\. Zhu, G\. Jia, Y\. Fan, X\. Lv, Y\. Zuo, C\. Jiang, Y\. wang, J\. Wang, E\. Hua, X\. Long, J\. Gao, Y\. Sun, Z\. Ma, G\. Cui, N\. Ding, B\. Qi, and B\. Zhou \(2026b\)MARTI: a framework for multi\-agent LLM systems reinforced training and inference\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=E7jZqo0A50)Cited by:[§1](https://arxiv.org/html/2606.09932#S1.p1.1)\.
- K\. Zhang, Y\. Zuo, B\. He, Y\. Sun, R\. Liu, C\. Jiang, Y\. Fan, K\. Tian, G\. Jia, P\. Li, Y\. Fu, X\. Lv, Y\. Zhang, S\. Zeng, S\. Qu, H\. Li, S\. Wang, Y\. Wang, X\. Long, F\. Liu, X\. Xu, J\. Ma, X\. Zhu, E\. Hua, Y\. Liu, Z\. Li, H\. Chen, X\. Qu, Y\. Li, W\. Chen, Z\. Yuan, J\. Gao, D\. Li, Z\. Ma, G\. Cui, Z\. Liu, B\. Qi, N\. Ding, and B\. Zhou \(2025a\)A survey of reinforcement learning for large reasoning models\.arXiv preprint arXiv:2509\.08827\.Cited by:[§1](https://arxiv.org/html/2606.09932#S1.p1.1)\.
- W\. Zhang, Y\. Xie, Y\. Sun, Y\. Chen, G\. Wang, Y\. Li, B\. Ding, and J\. Zhou \(2026c\)On\-policy RL meets off\-policy experts: harmonizing supervised fine\-tuning and reinforcement learning via dynamic weighting\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=dCm9bBrk5d)Cited by:[§2](https://arxiv.org/html/2606.09932#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Zhang, C\. Zheng, Y\. Wu, B\. Zhang, R\. Lin, B\. Yu, D\. Liu, J\. Zhou, and J\. Lin \(2025b\)The lessons of developing process reward models in mathematical reasoning\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 10495–10516\.External Links:[Link](https://aclanthology.org/2025.findings-acl.547/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.547),ISBN 979\-8\-89176\-256\-5Cited by:[§1](https://arxiv.org/html/2606.09932#S1.p3.1)\.
- J\. Zhao, R\. Liu, K\. Zhang, Z\. Zhou, J\. Gao, D\. Li, J\. Lyu, Z\. Qian, B\. Qi, X\. Li, and B\. Zhou \(2026\)GenPRM: scaling test\-time compute of process reward models via generative reasoning\.Proceedings of the AAAI Conference on Artificial Intelligence40\(41\),pp\. 34932–34940\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/40797),[Document](https://dx.doi.org/10.1609/aaai.v40i41.40797)Cited by:[§1](https://arxiv.org/html/2606.09932#S1.p3.1)\.
- L\. Zheng, L\. Yin, Z\. Xie, C\. Sun, J\. Huang, C\. H\. Yu, S\. Cao, C\. Kozyrakis, I\. Stoica, J\. E\. Gonzalez, C\. Barrett, and Y\. Sheng \(2024a\)SGLang: efficient execution of structured language model programs\.InAdvances in Neural Information Processing Systems,A\. Globerson, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. Tomczak, and C\. Zhang \(Eds\.\),Vol\.37,pp\. 62557–62583\.External Links:[Document](https://dx.doi.org/10.52202/079017-2000),[Link](https://proceedings.neurips.cc/paper_files/paper/2024/file/724be4472168f31ba1c9ac630f15dec8-Paper-Conference.pdf)Cited by:[Appendix B](https://arxiv.org/html/2606.09932#A2.SS0.SSS0.Px4.p1.4)\.
- Y\. Zheng, R\. Zhang, J\. Zhang, Y\. Ye, and Z\. Luo \(2024b\)LlamaFactory: unified efficient fine\-tuning of 100\+ language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 3: System Demonstrations\),Y\. Cao, Y\. Feng, and D\. Xiong \(Eds\.\),Bangkok, Thailand,pp\. 400–410\.External Links:[Link](https://aclanthology.org/2024.acl-demos.38/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-demos.38)Cited by:[Appendix B](https://arxiv.org/html/2606.09932#A2.SS0.SSS0.Px1.p1.3),[§4\.1](https://arxiv.org/html/2606.09932#S4.SS1.SSS0.Px2.p1.4)\.
- H\. Zhu, J\. Su, P\. Lai, R\. Ma, W\. Zhang, L\. Yang, and G\. Chen \(2026a\)Anchored supervised fine\-tuning\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=PORko7QT64)Cited by:[§2](https://arxiv.org/html/2606.09932#S2.SS0.SSS0.Px2.p1.1)\.
- W\. Zhu, R\. Xie, R\. Wang, X\. Sun, D\. Wang, and P\. Liu \(2026b\)Proximal supervised fine\-tuning\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=hQtwQqYikp)Cited by:[§1](https://arxiv.org/html/2606.09932#S1.p3.1),[§2](https://arxiv.org/html/2606.09932#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.09932#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Zhu, C\. Xie, X\. Lv, and slime Contributors \(2025\)Slime: an llm post\-training framework for rl scaling\.Note:[https://github\.com/THUDM/slime](https://github.com/THUDM/slime)GitHub repository\. Corresponding author: Xin Lv\. Accessed: 2026\-04\-30Cited by:[Appendix B](https://arxiv.org/html/2606.09932#A2.SS0.SSS0.Px2.p1.4),[Appendix B](https://arxiv.org/html/2606.09932#A2.SS0.SSS0.Px4.p1.4),[§4\.1](https://arxiv.org/html/2606.09932#S4.SS1.SSS0.Px2.p1.4)\.

## Appendix AFull Algorithm

Algorithm 1Rejuvenation: Plasticity Recovery for Over\-Trained SFT Models1:Base model

θBase\\theta\_\{\{\\color\[rgb\]\{0\.5,0\.5,0\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.5,0\.5,0\.5\}\\pgfsys@color@gray@stroke\{0\.5\}\\pgfsys@color@gray@fill\{0\.5\}\\texttt\{Base\}\}\}, over\-trained model

θover\\theta\_\{\\mathrm\{over\}\}, fusion weight

α\\alpha, reset ratio

ρ\\rho
2:Apply global model fusion via equation[1](https://arxiv.org/html/2606.09932#S3.E1)

3:Compute DLA scores for neurons using over\-confident tokens

4:For each selected target token, select the global top\-

ρ\\rhoneurons by most positively contributing attribution score

5:Let

𝒮\\mathcal\{S\}be the union of selected neurons over all calibration tokens

6:Reset selected neurons in

θfuse\\theta\_\{\\mathrm\{fuse\}\}to their base values

7:recovered model

θrec\\theta\_\{\\mathrm\{rec\}\}

## Appendix BExperimental Details

##### Mathematical SFT Training Details\.

For EvoLM\-4B, we conduct SFT with LlamaFactory\[Zhenget al\.,[2024b](https://arxiv.org/html/2606.09932#bib.bib126)\]on a 100k subset sampled from a 500k mathematical SFT corpus\. The base checkpoint is the EvoLM\-4B pre\-trained model trained with controlled mixing of FineWeb and FineMath data \(without contamination from the evaluation benchmarks\)\. All variants share the same data, the same context window, and the same data ordering, so any difference betweenModSFTandOverSFTis purely a function of training duration\. We use AdamW\[Kingma,[2014](https://arxiv.org/html/2606.09932#bib.bib146)\]withβ1=0\.9\\beta\_\{1\}=0\.9,β2=0\.999\\beta\_\{2\}=0\.999, a constant learning rate of3\.0×10−63\.0\\times 10^\{\-6\}, and*no*learning\-rate warmup, decay, or weight decay\. We deliberately disable schedules and weight decay so that excessive SFT can manifest its full effect on parameter drift, weight magnitude, and token entropy\.ModSFTis the checkpoint after 2 epochs andOverSFTis the checkpoint after 32 epochs over the same SFT subset\. For theDFTbaseline, we keep the same data, the same total epochs \(32\), and the same optimizer settings asOverSFT, and only replace the cross\-entropy loss with the probability\-reweighted DFT objective\[Wuet al\.,[2026](https://arxiv.org/html/2606.09932#bib.bib19)\]\. This isolates the effect of the loss formulation from the effect of the SFT trajectory length\.

##### Agentic SFT Training Details\.

For Qwen3\-8B, the SFT stage is performed with slime\[Zhuet al\.,[2025](https://arxiv.org/html/2606.09932#bib.bib125)\]on theτ\\tau\-bench Retail training split, where supervision trajectories are generated from a stronger Qwen3\-30B\-A3B teacher and saved as multi\-turn assistant\-tool conversations\. We use AdamW withβ1=0\.9\\beta\_\{1\}=0\.9,β2=0\.999\\beta\_\{2\}=0\.999, a constant learning rate of3\.0×10−63\.0\\times 10^\{\-6\}, weight decay 0\.1, a global batch size of 16, and a maximum response length of 4,096 tokens\. We train for 32 epochs with the Qwen3 chat template \(enable\_thinking=False\) and the Qwen3 tool\-aware loss mask, so that loss is only computed on assistant turns\. The correspondingModSFTandOverSFTcheckpoints follow the same epoch\-based convention as in the math setting\. TheDFTbaseline is constructed in the same way as in the math setting, by replacing the cross\-entropy loss with the DFT objective while keeping all other hyper\-parameters identical\.

##### Mathematical RL Training Details\.

We use verl\[Shenget al\.,[2025](https://arxiv.org/html/2606.09932#bib.bib124)\]with GRPO\[Shaoet al\.,[2024](https://arxiv.org/html/2606.09932#bib.bib59)\]as the RL backbone\. The actor is initialized from the corresponding SFT checkpoint, and the reference policy is the same SFT model frozen at step 0\. The training prompt batch size is 512 with a mini\-batch size of 128, and we sample 8 responses per prompt with temperature 1\.0 and top\-pp1\.0\. The maximum prompt length is 512 and the maximum response length is 1,024 tokens\. For optimization, we use AdamW with a constant learning rate of1×10−61\\times 10^\{\-6\}and a global gradient clip of 1\.0\. We keep the GRPO clipping range symmetric \(εlow=εhigh=0\.2\\varepsilon\_\{\\text\{low\}\}=\\varepsilon\_\{\\text\{high\}\}=0\.2\), use KL loss with coefficient 0\.001, and adopt theseq\-mean\-token\-meanloss aggregation\. Rewards are computed by the Math\-Verify333[https://github\.com/huggingface/Math\-Verify](https://github.com/huggingface/Math-Verify)verifier on the boxed final answer, with the reward function returning 1 for correct answers and \-1 otherwise\. Validation is run every 50 steps with temperature 0\.6 and top\-pp1\.0\. Each run uses 1 node of 8×\\timesNVIDIA H800 GPUs, FSDP2 sharding, and dynamic batching\. Rollouts are served by vLLM\[Kwonet al\.,[2023](https://arxiv.org/html/2606.09932#bib.bib127)\]\.

##### Agentic RL Training Details\.

For agentic RL, we use slime\[Zhuet al\.,[2025](https://arxiv.org/html/2606.09932#bib.bib125)\]with Megatron\-LM\[Shoeybiet al\.,[2019](https://arxiv.org/html/2606.09932#bib.bib129)\]backend for training and SGLang\[Zhenget al\.,[2024a](https://arxiv.org/html/2606.09932#bib.bib128)\]for inference\. For each iteration we sample 12 prompts from theτ\\tau\-bench Retail training set with 8 trajectories per prompt and a global batch size of 96, with maximum response length 1,024 tokens and temperature 1\.0\. To produce realistic multi\-turn behavior, we deploy a separate Qwen3\-4B\-Instruct\-2507\[Yanget al\.,[2025](https://arxiv.org/html/2606.09932#bib.bib64)\]as the user simulator, served locally via SGLang, and call it through an OpenAI\-compatible API\. We use AdamW with learning rate1×10−61\\times 10^\{\-6\}, constant LR schedule, and CPU optimizer offloading\. We use GRPO algorithm and set the KL loss coefficient and the entropy coefficient both to 0, and use clipping ranges withεlow=0\.2\\varepsilon\_\{\\text\{low\}\}=0\.2andεhigh=0\.28\\varepsilon\_\{\\text\{high\}\}=0\.28\. Tensor\-model\-parallel size is 2, sequence parallelism is enabled, and we use full activation recomputation with dynamic batching \(up to 2,048 tokens per GPU\)\. We train for up to 200 rollouts, save and evaluate every 20 rollouts, and balance trajectories across data\-parallel ranks with the standard non\-zero\-reward\-std dynamic filter\. Each run uses 6 GPUs for the policy and the user simulator combined, with the user simulator pinned to a separate set of GPUs\.

For both math and agentic RL, the only difference between baselines andRejuvenationis the*initialization*of the actor: baselines start fromModSFT,OverSFT, or theDFTcheckpoint, whereasRejuvenationstarts from the rejuvenated checkpoint produced by Algorithm[1](https://arxiv.org/html/2606.09932#alg1)\. All RL hyper\-parameters, data, prompts, and seeds are kept identical across initializations to ensure a fair comparison\.

##### Evaluation Details\.

For ID and OOD evaluation we use the same inference backends as in training: vLLM for EvoLM\-4B and SGLang for Qwen3\-8B\. We use a sampling temperature of 0\.6 and top\-ppof 1\.0 for all benchmarks\. For benchmarks with limited problem counts, we average Pass@1 over multiple samples per problem \(K=32K\{=\}32for AMC23 and AIME\-style sets,K=4K\{=\}4for MATH\-500, Minerva, OlympiadBench, GPQA, ARC,K=2K\{=\}2for GSM8K\)\. Mathematical answers are graded with the same Math\-Verify verifier used during RL training, and multiple\-choice OOD benchmarks \(GPQA, ARC, MMLU\-Pro\) are graded by the official answer\-matching scripts shipped with each dataset\.

## Appendix CAdditional Results

##### Logit Distribution\.

Figure[6](https://arxiv.org/html/2606.09932#A3.F6)contrasts the next\-token logit distribution betweenModSFTandOverSFTon the same set of held\-out prompts\. Under teacher forcing on the gold response,OverSFTconcentrates almost the entire probability mass on a single token, whileModSFTstill allocates non\-trivial mass to plausible alternatives\. We additionally inspect the per\-position entropy along full responses and observe that the entropy gap between the two checkpoints is largest on tokens that involve numerical answers, formula formatting, and chain\-of\-thought connectors, which is consistent with our DLA\-based attribution analysis\.

![Refer to caption](https://arxiv.org/html/2606.09932v1/figures/logit_dist_mod.png)\(a\)ModSFT
![Refer to caption](https://arxiv.org/html/2606.09932v1/figures/logit_dist_over.png)\(b\)OverSFT

Figure 6:Token logit comparison between normal SFT models and over\-trained models: \(a\)ModSFT; \(b\)OverSFT\.
##### Logit Attribution\.

The attribution scores are highly uneven across both modules and neurons\. At the module level, only a small number of modules contribute disproportionately large positive values to the over\-confident target logits, while most modules have much smaller contributions, as shown in Figure[7](https://arxiv.org/html/2606.09932#A3.F7)\. This indicates that the excessive logit sharpening induced by over\-training is not uniformly distributed across the network, but is concentrated in a limited set of components\. A similar pattern appears at the neuron level: within the high\-contribution modules, only a small subset of neurons dominates the positive contribution to the gold\-token logit, whereas the majority of neurons contribute little or even negatively\. This heavy\-tailed attribution structure explains why resetting a small fraction of high\-scoring neurons is sufficient to relax the over\-confident logits and recover plasticity, without broadly disrupting the SFT\-acquired behavior stored in the rest of the model\.

![Refer to caption](https://arxiv.org/html/2606.09932v1/x6.png)\(a\)module level
![Refer to caption](https://arxiv.org/html/2606.09932v1/x7.png)\(b\)neuron level

Figure 7:Logit attribution ranking: \(a\) module level; \(b\) neuron level\.
##### Selected Tokens in Logit Attribution\.

We visualize the selected tokens of logit attribution in Figure[8](https://arxiv.org/html/2606.09932#A3.F8)\. We can observe that the selected tokens are mainly logical connectives instead of basic knowledge or computation process\. As pointed out by previous work\[Wanget al\.,[2025c](https://arxiv.org/html/2606.09932#bib.bib99)\], RL mainly optimizes tokens related to reasoning behaviors \(e\.g\., logical connectives\)\.Rejuvenationrecovers the entropy of these critical tokens, which better incentivizes the learnability of the policy during RL\.

![Refer to caption](https://arxiv.org/html/2606.09932v1/x8.png)Figure 8:Wordcloud of selected tokens in logit attribution\.
##### How token entropy recovers?

To better understand how neuron reset restores output diversity, we visualize the token\-level entropy before and after applying the reset operation in Figure[9](https://arxiv.org/html/2606.09932#A3.F9)\. Before reset,OverSFTproduces extremely low entropy on many selected target positions, indicating that the policy assigns most probability mass to a single token and leaves little room for alternative reasoning continuations\. After resetting the high\-attribution neurons to their base\-model values, the entropy of these positions increases noticeably, while the distribution does not become uniformly random\. This suggests that the reset operation does not simply inject noise into the model\. Instead, it selectively relaxes the over\-confident logits associated with a small set of abnormal neurons\. As a result, the recovered checkpoint preserves the main SFT\-acquired behavior but reopens alternative token choices that are useful for exploration during RL\. This entropy recovery provides an output\-space explanation for whyRejuvenationimproves subsequent RL optimization and OOD generalization\.

![Refer to caption](https://arxiv.org/html/2606.09932v1/figures/entropy_before.png)\(a\)Before neuron reset
![Refer to caption](https://arxiv.org/html/2606.09932v1/figures/entropy_after.png)\(b\)After neuron reset

Figure 9:Comparison of token entropy: \(a\) Before neuron reset; \(b\) After neuron reset\.
##### Selected neurons/modules\.

Figure[10](https://arxiv.org/html/2606.09932#A3.F10)visualizes the distribution of reset neurons over the whole model\. Reset neurons are not uniformly distributed: they cluster in a few specific layers and modules that are most aligned with over\-confident logit production, especially thek\_projandv\_projmodules in the last several layers\. We also observe that reset neurons aggregated from different calibration prompts overlap substantially, indicating that the over\-confidence ofOverSFTis governed by a stable, prompt\-agnostic subset of parameters rather than per\-prompt artifacts\. This stability is what makes the reset operation effective with a single calibration pass\.

![Refer to caption](https://arxiv.org/html/2606.09932v1/x9.png)Figure 10:Per layer and per module visualization of neuron reset ratio in EvoLM\-4B\.
##### Computational Cost\.

Rejuvenationis a one\-shot, post\-hoc operation and introduces negligible overhead compared with the SFT or RL stages\. The fusion step is a single element\-wise interpolation between two checkpoints\. The neuron reset step requires one forward pass over a small calibration set to compute DLA scores and a single masked copy from the base weights into the fused checkpoint\. Both steps are run on a single NVIDIA H800 GPU and finish in 3 minutes for EvoLM\-4B and within 5 minutes for Qwen3\-8B, which is orders of magnitude smaller than the cost of either re\-running SFT from a different stopping point or relaunching RL from multiple SFT checkpoints\.

##### Parameter Changes\.

To complement the per\-module visualization in Section[3\.1](https://arxiv.org/html/2606.09932#S3.SS1), we provide additional layer\-wise visualizations of the parameter shift induced by SFT and the subsequent RL stage\. Figure[11](https://arxiv.org/html/2606.09932#A3.F11)and Figures[12](https://arxiv.org/html/2606.09932#A3.F12)\-[13](https://arxiv.org/html/2606.09932#A3.F13)cover representative attention projections across early and late layers\. The patterns are consistent with those reported in the main text:OverSFTintroduces sparse but extreme spikes that are concentrated in a small subset of parameters, whileModSFTproduces relatively smooth and small\-magnitude updates\. Subsequent RL onOverSFTis unable to undo these spikes, confirming that excessive SFT places the model in a region from which standard policy optimization cannot escape\.

![Refer to caption](https://arxiv.org/html/2606.09932v1/x10.png)Figure 11:Parameter changes oflayers\.0\.self\_attn\.k\_proj\.weightinduced by SFT and RL\.![Refer to caption](https://arxiv.org/html/2606.09932v1/x11.png)Figure 12:Parameter changes oflayers\.27\.self\_attn\.v\_proj\.weightinduced by SFT and RL\.![Refer to caption](https://arxiv.org/html/2606.09932v1/x12.png)Figure 13:Parameter changes oflayers\.27\.self\_attn\.k\_proj\.weightinduced by SFT and RL\.

Similar Articles

RL Excursions during Pre-Training: Re-examining Policy Optimization for LLM training

arXiv cs.LG

Harvard researchers challenge the standard LLM training pipeline by showing RL can be effectively applied during pre-training rather than only after SFT, finding that data composition matters more than model scale, and proposing parallel averaging of RL and SFT objectives that outperforms sequential approaches while preserving general capabilities.