Rotation-Preserving Supervised Fine-Tuning

arXiv cs.LG 05/13/26, 04:00 AM Papers
Summary
This paper introduces Rotation-Preserving Supervised Fine-Tuning (RPSFT), a method that improves out-of-domain generalization by preserving projected rotations in pretrained singular subspaces during fine-tuning.
arXiv:2605.10973v1 Announce Type: new Abstract: Supervised fine-tuning (SFT) improves in-domain performance but can degrade out-of-domain (OOD) generalization. Prior work suggests that this degradation is related to changes in dominant singular subspaces of pretrained weight matrices. However, directly identifying loss-sensitive directions with Hessian or Fisher information is computationally expensive at LLM scale. In this work, we propose preserving projected rotations in pretrained singular subspaces as an efficient proxy for Fisher-sensitive directions, which we call Rotation-Preserving Supervised Fine-Tuning (RPSFT). RPSFT penalizes changes in the projected top-$k$ singular-vector block of each pretrained weight matrix, limiting unnecessary rotation while preserving task adaptation. Across model families and sizes trained on math reasoning data, RPSFT improves the in-domain/OOD trade-off over standard SFT and strong SFT baselines, better preserves pretrained representations, and provides stronger initializations for downstream RL fine-tuning. Code is available at \href{https://github.com/jinhangzhan/RPSFT.git}{https://github.com/jinhangzhan/RPSFT}.
Original Article Export to Word Export to PDF
View Cached Full Text
Cached at: 05/13/26, 06:23 AM
# Rotation-Preserving Supervised Fine-Tuning
Source: [https://arxiv.org/html/2605.10973](https://arxiv.org/html/2605.10973)
Hangzhan Jin1,2,∗Tianwei Ni1,3Lu Li1,3Pierre\-Luc Bacon1,3,5Mohammad Hamdaqa2Doina Precup1,4,5,61Mila \- Quebec AI Institute2Polytechnique Montréal3Université de Montréal4McGill University5CIFAR AI Chair6Google DeepMind

###### Abstract

Supervised fine\-tuning \(SFT\) improves in\-domain performance but can degrade out\-of\-domain \(OOD\) generalization\. Prior work suggests that this degradation is related to changes in dominant singular subspaces of pretrained weight matrices\. However, directly identifying loss\-sensitive directions with Hessian or Fisher information is computationally expensive at LLM scale\. In this work, we propose preserving projected rotations in pretrained singular subspaces as an efficient proxy for Fisher\-sensitive directions, which we call Rotation\-Preserving Supervised Fine\-Tuning \(RPSFT\)\. RPSFT penalizes changes in the projected top\-kksingular\-vector block of each pretrained weight matrix, limiting unnecessary rotation while preserving task adaptation\. Across model families and sizes trained on math reasoning data, RPSFT improves the in\-domain/OOD trade\-off over standard SFT and strong SFT baselines, better preserves pretrained representations, and provides stronger initializations for downstream RL fine\-tuning\. Code is available at[https://github\.com/jinhangzhan/RPSFT](https://github.com/jinhangzhan/RPSFT.git)\.

††footnotetext:∗Corresponding author:hangzhan\.jin@mila\.quebec\.![Refer to caption](https://arxiv.org/html/2605.10973v1/x1.png)Figure 1:Method overview\. RPSFT modifies SFT by adding a projected\-block anchor in the pretrained SVD basis while retaining full\-parameter task adaptation\. The top\-kkpretrained singular block is protected, complementary directions remain free, and the resultingθSFT\\theta\_\{\\text\{SFT\}\}initializes RLFT\.## 1Introduction

Large language models are commonly post\-trained with supervised fine\-tuning \(SFT\) followed by reinforcement learning fine\-tuning \(RLFT\)\. Although SFT improves tasks represented in the post\-training data, it can degrade out\-of\-domain \(OOD\) capabilities, a form of forgetting caused by excessive specialization\(Zhu et al\.,[2025c](https://arxiv.org/html/2605.10973#bib.bib50); Wu et al\.,[2025](https://arxiv.org/html/2605.10973#bib.bib44)\)\. Prior work links this degradation to geometric drift, especially rotations of dominant singular directions in pretrained weight matrices\(Jin et al\.,[2025](https://arxiv.org/html/2605.10973#bib.bib13); Zhu et al\.,[2025b](https://arxiv.org/html/2605.10973#bib.bib49)\)\. Since these directions are associated with high\-variance and high\-curvature structure\(Haink,[2023](https://arxiv.org/html/2605.10973#bib.bib7)\), preserving them may protect general\-purpose capabilities during SFT\.

This motivates our question:Can supervised fine\-tuning preserve OOD generalization by limiting unnecessary geometric drift while retaining task adaptation?To address this question, we propose*Rotation\-Preserving Supervised Fine\-Tuning*\(RPSFT\), a simple regularization method motivated by the observed overlap between dominant singular subspaces and Fisher\-sensitive directions tied to OOD forgetting\. RPSFT penalizes changes in the pretrained top\-kksingular\-vector block of each selected weight matrix\. Unlike freezing or hard gradient projection, it anchors only the dominant pretrained block while leaving complementary directions free to adapt, and integrates into standard SFT pipelines without additional data or task boundaries\.

Concretely, for each selected matrix𝐖\\mathbf\{W\}, RPSFT precomputes the pretrained top\-kkleft and right singular\-vector bases𝐔0\(k\)\\mathbf\{U\}^\{\(k\)\}\_\{0\}and𝐕0\(k\)\\mathbf\{V\}^\{\(k\)\}\_\{0\}, and adds the Frobenius normλ‖\(𝐔0\(k\)\)⊤\(𝐖−𝐖0\)𝐕0\(k\)‖F2\\lambda\\\|\(\\mathbf\{U\}^\{\(k\)\}\_\{0\}\)^\{\\top\}\(\\mathbf\{W\}\-\\mathbf\{W\}^\{0\}\)\\mathbf\{V\}^\{\(k\)\}\_\{0\}\\\|\_\{F\}^\{2\}to the SFT loss\. This formulation preserves the expressivity of supervised fine\-tuning while stabilizing general\-purpose capability learned during pretraining\. Figure[1](https://arxiv.org/html/2605.10973#S0.F1)summarizes the overall RPSFT post\-training workflow and Algorithm[1](https://arxiv.org/html/2605.10973#alg1)summarizes the algorithm\. This method integrates directly into standard SFT pipelines and requires no additional data or task boundaries\.

We evaluate RPSFT with full\-parameter fine\-tuning across Llama\(Grattafiori et al\.,[2024](https://arxiv.org/html/2605.10973#bib.bib6)\)and Qwen\(Team,[2024](https://arxiv.org/html/2605.10973#bib.bib38)\)checkpoints trained on OpenR1\-Math\(Hugging Face,[2025](https://arxiv.org/html/2605.10973#bib.bib12)\)\. We treat math benchmarks as in\-domain and general reasoning, safety, and knowledge benchmarks as OOD\. Across model families and sizes, RPSFT improves the ID/OOD trade\-off over SFT and strong SFT baselines, summarized in Figure[4](https://arxiv.org/html/2605.10973#S5.F4)\. To understand why it works, we analyze representation drift after supervised fine\-tuning\. Comparing the hidden states of the base model and the checkpoint after SFT, we show that RPSFT better preserves the pretrained representation geometry\. We then evaluate downstream reinforcement learning with DAPO\(Yu et al\.,[2025](https://arxiv.org/html/2605.10973#bib.bib45)\), a variant of GRPO\(Shao et al\.,[2024](https://arxiv.org/html/2605.10973#bib.bib34)\), across all three model sizes and find that RPSFT provides strong downstream initializations and consistently higher or competitive final RL performance\.

Together, RPSFT contributes a simple projected\-subspace regularizer for SFT, consistent improvements in the ID/OOD trade\-off across Llama\(Grattafiori et al\.,[2024](https://arxiv.org/html/2605.10973#bib.bib6)\)and Qwen\(Team,[2024](https://arxiv.org/html/2605.10973#bib.bib38)\)models, and empirical and theoretical evidence that preserving dominant pretrained subspaces reduces rotation, protects hidden\-state representations, and improves the forgetting–adaptation trade\-off\.

## 2Preliminaries

Modern post\-training for reasoning language models typically proceeds in two stages: supervised fine\-tuning \(SFT\) on curated instruction data, followed by reinforcement learning fine\-tuning \(RLFT\) on task\-level rewards\(DeepSeek\-AI,[2025](https://arxiv.org/html/2605.10973#bib.bib3); OpenAI et al\.,[2024](https://arxiv.org/html/2605.10973#bib.bib30); Bai et al\.,[2023](https://arxiv.org/html/2605.10973#bib.bib1)\)\.

#### SFT and PEFT\.

Supervised fine\-tuning adapts pretrained parametersθ0\\theta\_\{0\}on labeled pairs𝒟=\{\(x,y\)\}\\mathcal\{D\}=\\\{\(x,y\)\\\}by minimizing the negative log\-likelihoodℒSFT\(θ\)=𝔼\(x,y\)∼𝒟\[−log⁡πθ\(y∣x\)\]\\mathcal\{L\}\_\{\\mathrm\{SFT\}\}\(\\theta\)=\\mathbb\{E\}\_\{\(x,y\)\\sim\\mathcal\{D\}\}\[\-\\log\\pi\_\{\\theta\}\(y\\mid x\)\], wherexxis an input prompt,yyis the target response, andπθ\(y∣x\)\\pi\_\{\\theta\}\(y\\mid x\)is the model distribution with parametersθ\\thetainitialized fromθ0\\theta\_\{0\}\. In this work, the main setting is full\-parameter SFT, where all model parameters are updated\. We include vanilla LoRA\(Hu et al\.,[2021](https://arxiv.org/html/2605.10973#bib.bib10)\)as a PEFT baseline in Appendix[D](https://arxiv.org/html/2605.10973#A4)\.

#### RLFT and DAPO\.

After SFT, the model can be further optimized with reinforcement learning\. We use DAPO\(Yu et al\.,[2025](https://arxiv.org/html/2605.10973#bib.bib45)\)that samples a group ofGGresponses from the rollout policyπθold\(⋅∣x\)\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(\\cdot\\mid x\), computes a normalized group\-relative advantageA^i=ri−1G∑j=1Grjstd⁡\(\{rj\}j=1G\)\+ε\\hat\{A\}\_\{i\}=\\frac\{r\_\{i\}\-\\frac\{1\}\{G\}\\sum\_\{j=1\}^\{G\}r\_\{j\}\}\{\\operatorname\{std\}\(\\\{r\_\{j\}\\\}\_\{j=1\}^\{G\}\)\+\\varepsilon\}withri=r\(x,yi\)r\_\{i\}=r\(x,y\_\{i\}\)and small numerical constantε\\varepsilon, and uses the token\-level PPO ratio\(Schulman et al\.,[2017](https://arxiv.org/html/2605.10973#bib.bib33)\)ρi,t\(θ\)=πθ\(yi,t∣x,yi,<t\)πθold\(yi,t∣x,yi,<t\)\\rho\_\{i,t\}\(\\theta\)=\\frac\{\\pi\_\{\\theta\}\(y\_\{i,t\}\\mid x,y\_\{i,<t\}\)\}\{\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(y\_\{i,t\}\\mid x,y\_\{i,<t\}\)\}\. The resulting surrogate objective is

𝒥DAPO\(θ\)=𝔼x,\{yi\}∼πθold\(⋅∣x\)\[1∑i=1G\|yi\|∑i,tmin⁡\(ρi,tA^i,clip⁡\(ρi,t,1−ϵlow,1\+ϵhigh\)A^i\)\]\.\{\\footnotesize\\mathcal\{J\}\_\{\\mathrm\{DAPO\}\}\(\\theta\)=\\mathbb\{E\}\_\{x,\\\{y\_\{i\}\\\}\\sim\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(\\cdot\\mid x\)\}\\\!\\left\[\\frac\{1\}\{\\sum\_\{i=1\}^\{G\}\|y\_\{i\}\|\}\\sum\_\{i,t\}\\min\\\!\\Big\(\\rho\_\{i,t\}\\hat\{A\}\_\{i\},\\,\\operatorname\{clip\}\\\!\\big\(\\rho\_\{i,t\},1\-\\epsilon\_\{\\mathrm\{low\}\},1\+\\epsilon\_\{\\mathrm\{high\}\}\\big\)\\hat\{A\}\_\{i\}\\Big\)\\right\]\.\}\(1\)Hereiiindexes theGGsampled responses for each prompt,ttranges over the valid tokens of responseyiy\_\{i\},\|yi\|\|y\_\{i\}\|is the response length, andϵlow,ϵhigh\\epsilon\_\{\\mathrm\{low\}\},\\epsilon\_\{\\mathrm\{high\}\}are the PPO clipping thresholds\. Compared with GRPO, DAPO averages the clipped policy objective over valid tokens rather than over whole responses\.

## 3Why Preserving Singular Vectors Mitigates Forgetting

Post\-training must balance*rapid adaptation*, which efficiently improves upon the target task, and*mitigating forgetting*, which preserves capabilities already encoded in the pretrained model\(Lyle et al\.,[2023](https://arxiv.org/html/2605.10973#bib.bib23); Lu et al\.,[2025](https://arxiv.org/html/2605.10973#bib.bib22)\)\. Prior work has linked dominant subspaces to important functional properties of neural networks\(Haink,[2023](https://arxiv.org/html/2605.10973#bib.bib7)\), and excessive rotation of leading singular subspaces in large language models has been associated with out\-of\-domain degradation\(Jin et al\.,[2025](https://arxiv.org/html/2605.10973#bib.bib13)\)\. In this section, we complement this view by connecting this trade\-off to forgetting\. We analyze a local second\-order expansion around the pretrained weights, following the standard curvature view used in second\-order optimization\(Martens,[2010](https://arxiv.org/html/2605.10973#bib.bib24)\)\. Let a vectorized base\-model weight bew∈ℝdw\\in\\mathbb\{R\}^\{d\}, the post\-trained weight bew′=w\+Δww^\{\\prime\}=w\+\\Delta w, andlldenote the loss on a task\. The local expansion is

l\(w′\)≈l\(w\)\+⟨∇l\(w\),Δw⟩⏟adaptation\+12Δw⊤∇2l\(w\)Δw⏟forgetting\.l\(w^\{\\prime\}\)\\approx l\(w\)\+\\underbrace\{\\langle\\nabla l\(w\),\\Delta w\\rangle\}\_\{\\text\{adaptation\}\}\+\\underbrace\{\\frac\{1\}\{2\}\\Delta w^\{\\top\}\\nabla^\{2\}l\(w\)\\Delta w\}\_\{\\text\{forgetting\}\}\.\(2\)The first\-order term captures task adaptation: if the update aligns with the target gradient, the loss decreases\. The second\-order term captures the need to mitigate forgetting: moving along high\-curvature directions can sharply increase the loss of capabilities that the pretrained model already fits well\. For OOD retention, forgetting is often dominated by this curvature term once the base model already achieves local\-optimal OOD performance where the first\-order is close to zero\.

![Refer to caption](https://arxiv.org/html/2605.10973v1/figures/L1_attn_q_proj_svd_focus.png)Figure 2:Fisher\-projected total sum of squared gradient norms \(gradient energy\) captured by top\-r×rr\\times rSVD blocks\. The x\-axis reportsx\(r\)=r2R2x\(r\)=\\frac\{r^\{2\}\}\{R^\{2\}\}, the strict top\-r×rr\\times rSVD\-block size relative to the fullR×RR\\times Rblock, and the y\-axis reportsy\(r\)=∑t‖𝐔r⊤𝐆t𝐕r‖F2∑t‖𝐆t‖F2y\(r\)=\\frac\{\\sum\_\{t\}\\\|\\mathbf\{U\}\_\{r\}^\{\\top\}\\mathbf\{G\}\_\{t\}\\mathbf\{V\}\_\{r\}\\\|\_\{F\}^\{2\}\}\{\\sum\_\{t\}\\\|\\mathbf\{G\}\_\{t\}\\\|\_\{F\}^\{2\}\}, the Fisher\-projected gradient\-energy percentage, where𝐆t∈ℝm×n\\mathbf\{G\}\_\{t\}\\in\\mathbb\{R\}^\{m\\times n\}denotes the gradient matrix of thett\-th sample\.The local second\-order view suggests that the Hessian is the most direct object for identifying loss\-sensitive directions: updates along high\-curvature directions can cause large increases in retained capability loss\. However, explicitly estimating, storing, or regularizing with the Hessian is infeasible at LLM scale\. For the SFT negative log\-likelihood objective, the Fisher information provides a tractable positive\-semidefinite curvature proxy; near a local optimum, its expected form coincides with the Hessian of the expected SFT loss\(Martens,[2020](https://arxiv.org/html/2605.10973#bib.bib25)\), and continual\-learning methods such as EWC\(Kirkpatrick et al\.,[2017](https://arxiv.org/html/2605.10973#bib.bib14)\)therefore use Fisher information to identify directions that should remain stable\. At the same time, the empirical Fisher should not be treated as identical to the Hessian away from the local optimum\(Kunstner et al\.,[2019](https://arxiv.org/html/2605.10973#bib.bib16)\), and even estimating and storing a full Fisher matrix remains too expensive for large language models\. This motivates a more practical question: can we find a cheap structural proxy that aligns with Fisher\-sensitive directions without explicitly constructing the Fisher matrix?

We answer this question with a Fisher\-projected energy diagnostic \(Figure[2](https://arxiv.org/html/2605.10973#S3.F2)\)\. In the first attention layer, the cumulative projected energy rises quickly for Llama\-8B, Qwen\-7B, and Qwen\-3B: a relatively small strict top\-5% singular block already captures about 20% of the full Fisher\-projected gradient energy\. We therefore use rank 768 as the default protected rank in our experiments\. Although the pretrained singular basis is not identical to the Fisher eigenspace, its strong overlap with Fisher\-projected gradient energy indicates that dominant singular directions provide a useful low\-rank structural proxy for loss\-sensitive curvature directions\. The exact diagnostic is given in Appendix[F\.1](https://arxiv.org/html/2605.10973#A6.SS1)\.

Taken together, the rapid\-adaptation versus forgetting\-mitigation view, the second\-order curvature argument, and the Fisher\-overlap evidence motivate our method: preserve the dominant pretrained singular directions while still allowing useful task adaptation\.

## 4Method

Based on the prior analysis in Section[3](https://arxiv.org/html/2605.10973#S3), we proposeRotation\-Preserving SFT \(RPSFT\)\. Intuitively, RPSFT treats dominant singular subspaces as carriers of general\-purpose transformations and discourages unnecessary rotation during fine\-tuning that can harm generalization\. This simple regularizer reduces out\-of\-domain \(OOD\) capability loss during supervised fine\-tuning by discouraging drift within the dominant singular subspace of each weight matrix\. All reported RPSFT results use full\-parameter fine\-tuning \(FPFT\)\. We penalize each selected layer equally\.

#### Setup and Notation\.

Letθ0\\theta\_\{0\}be the pretrained model and letℳ\\mathcal\{M\}denote the selected 2D weight matrices\. For each pretrained matrix𝐖0∈ℝm×n\\mathbf\{W\}^\{0\}\\in\\mathbb\{R\}^\{m\\times n\}, we compute its truncated SVD𝐖0=𝐔0𝚺0\(𝐕0\)⊤\\mathbf\{W\}^\{0\}=\\mathbf\{U\}^\{0\}\\mathbf\{\\Sigma\}^\{0\}\(\\mathbf\{V\}^\{0\}\)^\{\\top\}and cache the top\-kkbases𝐔0\(k\)∈ℝm×k\\mathbf\{U\}^\{\(k\)\}\_\{0\}\\in\\mathbb\{R\}^\{m\\times k\}and𝐕0\(k\)∈ℝn×k\\mathbf\{V\}^\{\(k\)\}\_\{0\}\\in\\mathbb\{R\}^\{n\\times k\}, wherekkis the protected rank\.

#### Rotation\-Preserving Regularization\.

For a fine\-tuned weight𝐖\\mathbf\{W\}, define the projected blockS\(𝐖\)=\(𝐔0\(k\)\)⊤𝐖𝐕0\(k\)S\(\\mathbf\{W\}\)=\(\\mathbf\{U\}^\{\(k\)\}\_\{0\}\)^\{\\top\}\\mathbf\{W\}\\mathbf\{V\}^\{\(k\)\}\_\{0\}and the pretrained reference blockSref=S\(𝐖0\)S^\{\\mathrm\{ref\}\}=S\(\\mathbf\{W\}^\{0\}\)\. Since the basis is computed from the pretrained SVD,SrefS^\{\\mathrm\{ref\}\}is the leadingk×kk\\times kdiagonal block of𝚺0\\mathbf\{\\Sigma\}^\{0\}\. RPSFT anchors how the current weight acts between the pretrained top\-kkleft and right singular subspaces, while leaving complementary blocks free to adapt\.

This viewpoint also clarifies the boundary cases\. Whenk=0k=0, the penalty vanishes and RPSFT reduces to standard SFT\. Whenk=min⁡\(mℓ,nℓ\)k=\\min\(m\_\{\\ell\},n\_\{\\ell\}\), the penalty becomes full weight\-spaceℓ2\\ell\_\{2\}anchoring around the pretrained model\(Kumar et al\.,[2024](https://arxiv.org/html/2605.10973#bib.bib15)\)\. We formalize both cases in Appendix[C](https://arxiv.org/html/2605.10973#A3)\. The practical intuition is that the regularizer suppresses drift along dominant pretrained directions\.

In the full\-parameter setting, the total loss is:

ℒ\(θ\)=ℒtask\(θ\)\+λ∑ℓ∈ℳ′‖Sℓ\(𝐖\)−Sℓref‖F2,\\mathcal\{L\}\(\\theta\)=\\mathcal\{L\}\_\{\\text\{task\}\}\(\\theta\)\+\\lambda\\sum\_\{\\ell\\in\\mathcal\{M\}^\{\\prime\}\}\\left\\\|S\_\{\\ell\}\(\\mathbf\{W\}\)\-S^\{\\mathrm\{ref\}\}\_\{\\ell\}\\right\\\|\_\{F\}^\{2\},\(3\)whereλ\\lambdacontrols the regularization strength\. This penalty is zero at initialization and only grows when fine\-tuning moves the weight inside the pretrained top\-kksingular subspace\. Appendix Algorithm[1](https://arxiv.org/html/2605.10973#alg1)summarizes the FPFT training procedure\. Appendix[A\.2](https://arxiv.org/html/2605.10973#A1.SS2)gives an optional LoRA formulation\.

#### Computation costs\.

For each regularized matrix𝐖ℓ0∈ℝmℓ×nℓ\\mathbf\{W\}^\{0\}\_\{\\ell\}\\in\\mathbb\{R\}^\{m\_\{\\ell\}\\times n\_\{\\ell\}\}, we store𝐔0,ℓ\(k\)\\mathbf\{U\}^\{\(k\)\}\_\{0,\\ell\},𝐕0,ℓ\(k\)\\mathbf\{V\}^\{\(k\)\}\_\{0,\\ell\}andSℓrefS^\{\\mathrm\{ref\}\}\_\{\\ell\}:

extra memory per layer=𝒪\(\(mℓ\+nℓ\)k\+k2\)\.\\text\{extra memory per layer\}\\;=\\;\\mathcal\{O\}\\\!\\left\(\(m\_\{\\ell\}\+n\_\{\\ell\}\)k\+k^\{2\}\\right\)\.\(4\)In practice,kkis small relative tomℓ,nℓm\_\{\\ell\},n\_\{\\ell\}, so this overhead is linear in the protected rank and remains manageable under the default bf16 precision\.

For each regularized matrix𝐖∈ℝm×n\\mathbf\{W\}\\in\\mathbb\{R\}^\{m\\times n\}, computing the projected block\(𝐔0\(k\)\)⊤𝐖𝐕0\(k\)\(\\mathbf\{U\}^\{\(k\)\}\_\{0\}\)^\{\\top\}\\mathbf\{W\}\\mathbf\{V\}^\{\(k\)\}\_\{0\}addsO\(mnk\)O\(mnk\)FLOPs per evaluation, with the same order for backpropagation\. If applied everyssoptimization steps, the amortized cost becomesO\(mnk/s\)O\(mnk/s\)per step, corresponding to a heuristic relative overhead of roughlyksmin⁡\(m,n\)\\frac\{k\}\{s\\min\(m,n\)\}\.

## 5Experiments

Benchmarks and evaluation\.The supervised stage is trained on[OpenR1\-Math](https://huggingface.co/datasets/wh-zhu/train_openr1_4k)\(Hugging Face,[2025](https://arxiv.org/html/2605.10973#bib.bib12)\), and downstream DAPO is initialized from those checkpoints using the[DAPO\-Math\-17k](https://huggingface.co/datasets/wh-zhu/dapo)RL set\. In\-domain results cover[AIME24](https://huggingface.co/datasets/math-ai/aime24),[AIME25](https://huggingface.co/datasets/math-ai/aime25),[AMC23](https://huggingface.co/datasets/math-ai/amc23),[MATH\-500](https://huggingface.co/datasets/HuggingFaceH4/MATH-500)\(Hendrycks et al\.,[2021](https://arxiv.org/html/2605.10973#bib.bib9)\),[Minerva Math](https://huggingface.co/datasets/math-ai/minervamath)\(Lewkowycz et al\.,[2022](https://arxiv.org/html/2605.10973#bib.bib19)\), and[OlympiadBench](https://huggingface.co/datasets/math-ai/olympiadbench)\(He et al\.,[2024](https://arxiv.org/html/2605.10973#bib.bib8)\), while out\-of\-domain results cover[GPQA](https://huggingface.co/datasets/Idavidrein/gpqa)\(Rein et al\.,[2023](https://arxiv.org/html/2605.10973#bib.bib32)\),[IFEval\-loose](https://huggingface.co/datasets/google/IFEval)\(Zhou et al\.,[2023](https://arxiv.org/html/2605.10973#bib.bib47)\),[MMLU\-Pro](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro)\(Wang et al\.,[2024b](https://arxiv.org/html/2605.10973#bib.bib43)\),[SuperGPQA](https://huggingface.co/datasets/Maxwell-Jia/SuperGPQA-Astro)\(Team et al\.,[2025](https://arxiv.org/html/2605.10973#bib.bib37)\),[Safety Benchmark](https://huggingface.co/datasets/ThWu/safety_benchmark), and[TruthfulQA](https://huggingface.co/datasets/truthfulqa/truthful_qa)\(Lin et al\.,[2022](https://arxiv.org/html/2605.10973#bib.bib21)\)\. The full post\-training pipeline and evaluation split are summarized in Figure[4](https://arxiv.org/html/2605.10973#S5.F4)\.

Base models and baselines\.We evaluate three instruction\-tuned base checkpoints spanning two model families and scales: Llama\-3\.1\-8B\-Instruct, Qwen2\.5\-7B\-Instruct, and Qwen2\.5\-3B\-Instruct\. We report the untuned base checkpoint as a reference and compare RPSFT with standard full\-parameter SFT, Dynamic Fine\-Tuning \(DFT\)\(Wu et al\.,[2025](https://arxiv.org/html/2605.10973#bib.bib44)\), and importance\-weighted SFT \(IW\)\(Qin and Springenberg,[2025](https://arxiv.org/html/2605.10973#bib.bib31)\)\. DFT and IW rescale the supervised training gradient by either the token probability or the importance sampling ratio\.

![Refer to caption](https://arxiv.org/html/2605.10973v1/x2.png)Figure 3:Rank\-selection sensitivity\. The x\-axis is the protected rankkkand the y\-axes report ID Avg@k, ID Pass@k, and OOD Avg@1\.0is vanilla SFT,768768is our default, Q\-L2/L\-L2 are full\-rank protection for Qwen/Llama \(equivalent to L2 Init\(Kumar et al\.,[2024](https://arxiv.org/html/2605.10973#bib.bib15)\)\), and dashed lines show base\-model scores\. Scores average six ID math benchmarks and six OOD benchmarks\.Hyperparameters\.RPSFT has two hyperparameters: the protected rankkk\(thiskkshould not be confused with thekkin avg@k and pass@k\) and regularization strengthλ\\lambda\. As shown in Figure[3](https://arxiv.org/html/2605.10973#S5.F3), performance is robust in intermediate ranks, while overly large ranks limit adaptation\. Unless stated otherwise, we usek=768k=768andλ=1\\lambda=1\. Appendix[C\.5](https://arxiv.org/html/2605.10973#A3.SS5)gives the rank\-selection rule and full sweeps\.

Empirical Computation Costs\.Table[1](https://arxiv.org/html/2605.10973#S5.T1)summarizes the empirical training costs\. Although RPSFT requires more peak memory than SFT and DFT because of the precomputed protected bases and projection buffers, it remains below IW and L2 Init in peak allocated memory\. RPSFT uses only6\.0%6\.0\\%more GPU\-hours than SFT while remaining cheaper than IW and L2 Init\.

![Refer to caption](https://arxiv.org/html/2605.10973v1/x3.png)Figure 4:ID/OOD trade\-off across SFT and RL stages for RPSFT and baselines\. The top row compares ID Avg@k with OOD Avg@k, and the bottom row compares ID Pass@k with OOD Pass@k\. Scores follow the settings in Figure[3](https://arxiv.org/html/2605.10973#S5.F3)\.Table 1:Empirical computation and memory cost for RPSFT and baselines\. Samples/s and Steps/s report training throughput; Peak alloc\. and Peak reserved report maximum allocated and CUDA\-reserved GPU memory\.### 5\.1Results on Supervised Fine\-tuning

RPSFT improves generalization when possible and reduces forgetting otherwise\.Table[2](https://arxiv.org/html/2605.10973#S5.T2)shows that RPSFT gives the best tuned ID Avg@k on all three checkpoints and the best tuned ID Pass@k overall\. On OOD Pass@k, RPSFT exceeds the base checkpoint for all three models, indicating improved generalization after SFT\. For OOD Avg@k, RPSFT improves over the base on Llama\-3\.1\-8B, while on the two Qwen models it reduces OOD degradation relative to vanilla SFT\. Although DFT has slightly less OOD Avg@k forgetting on Qwen, it does so by underfitting the ID task: its ID averages are consistently worse than RPSFT and even below the base model in several Qwen summary rows\. Thus, RPSFT provides a better ID/OOD trade\-off: it preserves or improves OOD behavior without sacrificing task adaptation\.

First\-order transfer explains when SFT generalizes or forgets\.Figure[5](https://arxiv.org/html/2605.10973#S5.F5)shows that Llama\-8B has mostly positive first\-order alignment \(Section[3](https://arxiv.org/html/2605.10973#S3)\) on both ID and OOD data, explaining why SFT can improve ID performance while also transferring to OOD\. In contrast, the Qwen models have weaker and sign\-changing OOD first\-order signals: although they still adapt to the ID task, OOD updates are less consistently descent\-aligned, making them more vulnerable to curvature\-driven forgetting\. The metric is defined in Appendix[F\.2](https://arxiv.org/html/2605.10973#A6.SS2)\.

Table 2:SFT results on in\-domain and out\-of\-domain benchmarks\.“Δ\\Deltavs\. Base” rows show change relative to the base model\. Bold marks the strongest tuned result in the same block, and summary\-average rows or columns are marked asAvg¯\\overline\{\\mathrm\{Avg\}\}\. For the in\-domain benchmarks, we use the same benchmark\-specifickkfor both Avg@k and Pass@k:k=16k=16for AIME24, AIME25, and AMC23,k=4k=4for MATH\-500 and OlympiadBench, andk=8k=8for Minerva Math\. For out\-of\-domain benchmarks, we usek=4k=4and report both Avg@k and Pass@k\.![Refer to caption](https://arxiv.org/html/2605.10973v1/figures/story_llama_id_ood_layers.png)Figure 5:Layerwise first\-order signal on the in\-domain \(left\) and out\-of\-domain \(right\) distributions across Llama\-8B, Qwen\-7B, and Qwen\-3B\.
### 5\.2Results on RL Fine\-Tuning

Table 3:Downstream RL \(DAPO\) results on the six ID math benchmarks, following Table[2](https://arxiv.org/html/2605.10973#S5.T2)’s notation\. Avg@k and Pass@k blocks use init→\\rightarrowDAPO entries; bold marks the highest final value among methods\.Stronger SFT initializers transfer better to downstream RL\.Table[3](https://arxiv.org/html/2605.10973#S5.T3)shows that RPSFT gives the best final ID Avg@k and Pass@k on Llama\-3\.1\-8B and Qwen2\.5\-7B\. On Qwen2\.5\-3B, RPSFT still gives the best final Avg@k, while IW is marginally higher on Pass@k \(53\.45 vs\. 53\.26\)\. Thus, the ID/OOD trade\-off improved by RPSFT at the SFT stage generally carries into RLFT rather than being washed out by DAPO\.

Table 4:Downstream RL \(DAPO\) results on the six OOD benchmarks, following Table[2](https://arxiv.org/html/2605.10973#S5.T2)’s notation\. Avg@k and Pass@k blocks use init→\\rightarrowDAPO entries withk=4k=4; final checkpoints are selected by ID performance\.RPSFT gives the strongest final OOD RL performance\.Table[4](https://arxiv.org/html/2605.10973#S5.T4)shows that RPSFT achieves the best final OOD Avg@k and Pass@k summary averages across three models\. IW often has a larger init→\\rightarrowDAPO gain, suggesting that RL can partially recover OOD behavior from a weaker SFT initializer\. That said, RPSFT usually starts from a stronger OOD checkpoint and remains best after RL, indicating that preserving pretrained singular structure provides a better OOD initialization\.

### 5\.3Geometric and Representation Drift

RPSFT is designed to preserve dominant pretrained structure, so we evaluate drift at both the weight and representation levels\. First, we measure how much fine\-tuning rotates the dominant left\-singular subspaces of pretrained weights across architectures; smaller values indicate that the tuned model remains closer to the pretrained model\. We then test whether this weight\-space stability carries over to hidden representations by comparing each tuned model’s hidden states against the base model\. The rotation and hidden\-state drift metrics are defined in Appendices[F\.3](https://arxiv.org/html/2605.10973#A6.SS3)and[F\.6](https://arxiv.org/html/2605.10973#A6.SS6)\.

![Refer to caption](https://arxiv.org/html/2605.10973v1/x4.png)Figure 6:Layerwise U\-space rotation across three models\. Each panel averages the principal\-angle rotation overq\_proj,k\_proj,v\_proj,o\_proj,up\_proj,down\_proj, andgate\_proj\.![Refer to caption](https://arxiv.org/html/2605.10973v1/x5.png)Figure 7:Rankwise U\-space rotation for one representative weight matrix in each model\. The x\-axis increases the prefix rank from 0 to 512, and the y\-axis reports the mean principal\-angle rotation of the firstkkleft singular vectors\.RPSFT consistently reduces dominant\-subspace rotation\.Figures[6](https://arxiv.org/html/2605.10973#S5.F6)and[7](https://arxiv.org/html/2605.10973#S5.F7)show that RPSFT keeps tuned weights closer to the pretrained singular subspaces across three models\. The reduction appears through most layers and across the leading\-to\-middle rank range, indicating that RPSFT stabilizes the dominant directions targeted by the regularizer\.

![Refer to caption](https://arxiv.org/html/2605.10973v1/x6.png)Figure 8:Hidden\-state drift on Qwen2\.5\-3B\-Instruct\. Each panel shows centroid shifts away from the base model in a 2D PCA view\. RPSFT stays among the closest\-tuned centroids to the base model, while SFT, IW, and especially DFT often shift farther\. Entropy analyses are provided in Appendix[F\.5](https://arxiv.org/html/2605.10973#A6.SS5)\.Weight\-space stability translates into representation\-space stability\.Figure[8](https://arxiv.org/html/2605.10973#S5.F8)shows that RPSFT keeps tuned hidden\-state centroids closest to the base model across benchmark panels, while SFT, IW, and especially DFT drift farther away\. This suggests that constraining dominant\-subspace rotation helps preserve pretrained representations rather than only reducing a weight\-space metric\. The same pattern holds for Llama\-8B and Qwen2\.5\-7B in Appendix Figures[12](https://arxiv.org/html/2605.10973#A6.F12)and[13](https://arxiv.org/html/2605.10973#A6.F13)\.

## 6Conclusion

We presented RPSFT, a supervised fine\-tuning regularizer that anchors the projected top\-kksingular block of pretrained weight matrices\. Across Llama and Qwen models, RPSFT improves the ID/OOD trade\-off over standard SFT and strong baselines, and provides stronger initializations for downstream RLFT\. Our analyses show a consistent mechanism: RPSFT reduces dominant\-subspace rotation in weight space and better preserves hidden\-state geometry, suggesting that controlling drift in pretrained singular subspaces is a practical way to mitigate forgetting while retaining task adaptation\.

Limitations and future work\.RPSFT applies the same protected\-rank rule broadly across weight matrices\. A promising direction is to regularize layers more selectively by focusing on the early self\-attention layers or other layers with the strongest Fisher–SVD overlap\. This could further reduce overhead while preserving the main benefit, and is supported by recent evidence that fine\-tuning effects are often layer\-localized rather than uniformly distributed across the whole model\(Shi et al\.,[2025](https://arxiv.org/html/2605.10973#bib.bib36); Zhao et al\.,[2026](https://arxiv.org/html/2605.10973#bib.bib46)\)\. In terms of experiments, we focus on math training datasets with 3B–8B models; future work can extend to non\-math training data and much larger model scales\.

## References

- Bai et al\. \(2023\)Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu\.Qwen technical report, 2023\.URL[https://arxiv\.org/abs/2309\.16609](https://arxiv.org/abs/2309.16609)\.
- Chung et al\. \(2024\)Wesley Chung, Lynn Cherif, David Meger, and Doina Precup\.Parseval regularization for continual reinforcement learning, 2024\.URL[https://arxiv\.org/abs/2412\.07224](https://arxiv.org/abs/2412.07224)\.
- DeepSeek\-AI \(2025\)DeepSeek\-AI\.Deepseek\-r1: Incentivizing reasoning capability in large language models via reinforcement learning\.*arXiv preprint arXiv:2501\.12948*, 2025\.
- Elsayed et al\. \(2024\)Mohamed Elsayed, Qingfeng Lan, Clare Lyle, and A\. Rupam Mahmood\.Weight clipping for deep continual and reinforcement learning, 2024\.URL[https://arxiv\.org/abs/2407\.01704](https://arxiv.org/abs/2407.01704)\.
- Franke et al\. \(2024\)Jörg K\.H\. Franke, Michael Hefenbrock, and Frank Hutter\.Preserving principal subspaces to reduce catastrophic forgetting in fine\-tuning\.In*ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models*, 2024\.URL[https://openreview\.net/forum?id=XoWtroECJU](https://openreview.net/forum?id=XoWtroECJU)\.
- Grattafiori et al\. \(2024\)Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al\-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al\.The llama 3 herd of models\.*arXiv preprint arXiv:2407\.21783*, 2024\.
- Haink \(2023\)David Haink\.Hessian eigenvectors and principal component analysis of neural network weight matrices, 2023\.URL[https://arxiv\.org/abs/2311\.00452](https://arxiv.org/abs/2311.00452)\.
- He et al\. \(2024\)Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun\.Olympiadbench: A challenging benchmark for promoting agi with olympiad\-level bilingual multimodal scientific problems, 2024\.URL[https://arxiv\.org/abs/2402\.14008](https://arxiv.org/abs/2402.14008)\.
- Hendrycks et al\. \(2021\)Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt\.Measuring mathematical problem solving with the math dataset, 2021\.URL[https://arxiv\.org/abs/2103\.03874](https://arxiv.org/abs/2103.03874)\.
- Hu et al\. \(2021\)Edward J\. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen\-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen\.Lora: Low\-rank adaptation of large language models, 2021\.URL[https://arxiv\.org/abs/2106\.09685](https://arxiv.org/abs/2106.09685)\.
- Huan et al\. \(2025\)Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, and Xiang Yue\.Does math reasoning improve general llm capabilities? understanding transferability of llm reasoning, 2025\.URL[https://arxiv\.org/abs/2507\.00432](https://arxiv.org/abs/2507.00432)\.
- Hugging Face \(2025\)Hugging Face\.Open r1: A fully open reproduction of deepseek\-r1, January 2025\.URL[https://github\.com/huggingface/open\-r1](https://github.com/huggingface/open-r1)\.
- Jin et al\. \(2025\)Hangzhan Jin, Sitao Luan, Sicheng Lyu, Guillaume Rabusseau, Reihaneh Rabbany, Doina Precup, and Mohammad Hamdaqa\.Rl fine\-tuning heals ood forgetting in sft, 2025\.URL[https://arxiv\.org/abs/2509\.12235](https://arxiv.org/abs/2509.12235)\.
- Kirkpatrick et al\. \(2017\)James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A\. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska\-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell\.Overcoming catastrophic forgetting in neural networks\.*Proceedings of the National Academy of Sciences*, 114\(13\):3521–3526, March 2017\.ISSN 1091\-6490\.doi:10\.1073/pnas\.1611835114\.URL[http://dx\.doi\.org/10\.1073/pnas\.1611835114](http://dx.doi.org/10.1073/pnas.1611835114)\.
- Kumar et al\. \(2024\)Saurabh Kumar, Henrik Marklund, and Benjamin Van Roy\.Maintaining plasticity in continual learning via regenerative regularization, 2024\.URL[https://arxiv\.org/abs/2308\.11958](https://arxiv.org/abs/2308.11958)\.
- Kunstner et al\. \(2019\)Frederik Kunstner, Philipp Hennig, and Lukas Balles\.Limitations of the empirical fisher approximation for natural gradient descent\.In*Advances in Neural Information Processing Systems 32*, pages 4158–4169, 2019\.
- Lai et al\. \(2025\)Song Lai, Haohan Zhao, Rong Feng, Changyi Ma, Wenzhuo Liu, Hongbo Zhao, Xi Lin, Dong Yi, Min Xie, Qingfu Zhang, Hongbin Liu, Gaofeng Meng, and Fei Zhu\.Reinforcement fine\-tuning naturally mitigates forgetting in continual post\-training, 2025\.URL[https://arxiv\.org/abs/2507\.05386](https://arxiv.org/abs/2507.05386)\.
- Lewandowski et al\. \(2024\)Alex Lewandowski, Michal Bortkiewicz, Saurabh Kumar, Andras Gyorgy, Dale Schuurmans, Mateusz Ostaszewski, and Marlos C\. Machado\.Learning continually by spectral regularization, 2024\.URL[https://arxiv\.org/abs/2406\.06811](https://arxiv.org/abs/2406.06811)\.
- Lewkowycz et al\. \(2022\)Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman\-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur\-Ari, and Vedant Misra\.Solving quantitative reasoning problems with language models, 2022\.URL[https://arxiv\.org/abs/2206\.14858](https://arxiv.org/abs/2206.14858)\.
- Lin et al\. \(2025\)Jiacheng Lin, Zhongruo Wang, Kun Qian, Tian Wang, Arvind Srinivasan, Hansi Zeng, Ruochen Jiao, Xie Zhou, Jiri Gesi, Dakuo Wang, Yufan Guo, Kai Zhong, Weiqi Zhang, Sujay Sanghavi, Changyou Chen, Hyokun Yun, and Lihong Li\.Sft doesn’t always hurt general capabilities: Revisiting domain\-specific fine\-tuning in llms, 2025\.URL[https://arxiv\.org/abs/2509\.20758](https://arxiv.org/abs/2509.20758)\.
- Lin et al\. \(2022\)Stephanie Lin, Jacob Hilton, and Owain Evans\.Truthfulqa: Measuring how models mimic human falsehoods, 2022\.URL[https://arxiv\.org/abs/2109\.07958](https://arxiv.org/abs/2109.07958)\.
- Lu et al\. \(2025\)Aojun Lu, Hangjie Yuan, Tao Feng, and Yanan Sun\.Rethinking the stability\-plasticity trade\-off in continual learning from an architectural perspective, 2025\.URL[https://arxiv\.org/abs/2506\.03951](https://arxiv.org/abs/2506.03951)\.
- Lyle et al\. \(2023\)Clare Lyle, Zeyu Zheng, Evgenii Nikishin, Bernardo Avila Pires, Razvan Pascanu, and Will Dabney\.Understanding plasticity in neural networks, 2023\.URL[https://arxiv\.org/abs/2303\.01486](https://arxiv.org/abs/2303.01486)\.
- Martens \(2010\)James Martens\.Deep learning via hessian\-free optimization\.In*Proceedings of the 27th International Conference on Machine Learning*, pages 735–742, 2010\.
- Martens \(2020\)James Martens\.New insights and perspectives on the natural gradient method, 2020\.URL[https://arxiv\.org/abs/1412\.1193](https://arxiv.org/abs/1412.1193)\.
- Meng et al\. \(2025\)Fanxu Meng, Zhaohui Wang, and Muhan Zhang\.Pissa: Principal singular values and singular vectors adaptation of large language models, 2025\.URL[https://arxiv\.org/abs/2404\.02948](https://arxiv.org/abs/2404.02948)\.
- Mukherjee et al\. \(2025\)Sagnik Mukherjee, Lifan Yuan, Dilek Hakkani\-Tur, and Hao Peng\.Reinforcement learning finetunes small subnetworks in large language models, 2025\.URL[https://arxiv\.org/abs/2505\.11711](https://arxiv.org/abs/2505.11711)\.
- Nayak et al\. \(2025\)Nikhil Shivakumar Nayak, Krishnateja Killamsetty, Ligong Han, Abhishek Bhandwaldar, Prateek Chanda, Kai Xu, Hao Wang, Aldo Pareja, Oleg Silkin, Mustafa Eyceoz, and Akash Srivastava\.Sculpting subspaces: Constrained full fine\-tuning in llms for continual learning, 2025\.URL[https://arxiv\.org/abs/2504\.07097](https://arxiv.org/abs/2504.07097)\.
- Ni et al\. \(2025\)Tianwei Ni, Allen Nie, Sapana Chaudhary, Yao Liu, Huzefa Rangwala, and Rasool Fakoor\.Offline learning and forgetting for reasoning with large language models, 2025\.URL[https://arxiv\.org/abs/2504\.11364](https://arxiv.org/abs/2504.11364)\.
- OpenAI et al\. \(2024\)OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El\-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondrich, Andrey Mishchenko, Andy Applebaum, Angela Jiang, Ashvin Nair, Barret Zoph, Behrooz Ghorbani, Ben Rossen, Benjamin Sokolowsky, Boaz Barak, Bob McGrew, Borys Minaiev, Botao Hao, Bowen Baker, Brandon Houghton, Brandon McKinzie, Brydon Eastman, Camillo Lugaresi, Cary Bassin, Cary Hudson, Chak Ming Li, Charles de Bourcy, Chelsea Voss, Chen Shen, Chong Zhang, Chris Koch, Chris Orsinger, Christopher Hesse, Claudia Fischer, Clive Chan, Dan Roberts, Daniel Kappler, Daniel Levy, Daniel Selsam, David Dohan, David Farhi, David Mely, David Robinson, Dimitris Tsipras, Doug Li, Dragos Oprica, Eben Freeman, Eddie Zhang, Edmund Wong, Elizabeth Proehl, Enoch Cheung, Eric Mitchell, Eric Wallace, Erik Ritter, Evan Mays, Fan Wang, Felipe Petroski Such, Filippo Raso, Florencia Leoni, Foivos Tsimpourlas, Francis Song, Fred von Lohmann, Freddie Sulit, Geoff Salmon, Giambattista Parascandolo, Gildas Chabot, Grace Zhao, Greg Brockman, Guillaume Leclerc, Hadi Salman, Haiming Bao, Hao Sheng, Hart Andrin, Hessam Bagherinezhad, Hongyu Ren, Hunter Lightman, Hyung Won Chung, Ian Kivlichan, Ian O’Connell, Ian Osband, Ignasi Clavera Gilaberte, Ilge Akkaya, Ilya Kostrikov, Ilya Sutskever, Irina Kofman, Jakub Pachocki, James Lennon, Jason Wei, Jean Harb, Jerry Twore, Jiacheng Feng, Jiahui Yu, Jiayi Weng, Jie Tang, Jieqi Yu, Joaquin Quiñonero Candela, Joe Palermo, Joel Parish, Johannes Heidecke, John Hallman, John Rizzo, Jonathan Gordon, Jonathan Uesato, Jonathan Ward, Joost Huizinga, Julie Wang, Kai Chen, Kai Xiao, Karan Singhal, Karina Nguyen, Karl Cobbe, Katy Shi, Kayla Wood, Kendra Rimbach, Keren Gu\-Lemberg, Kevin Liu, Kevin Lu, Kevin Stone, Kevin Yu, Lama Ahmad, Lauren Yang, Leo Liu, Leon Maksin, Leyton Ho, Liam Fedus, Lilian Weng, Linden Li, Lindsay McCallum, Lindsey Held, Lorenz Kuhn, Lukas Kondraciuk, Lukasz Kaiser, Luke Metz, Madelaine Boyd, Maja Trebacz, Manas Joglekar, Mark Chen, Marko Tintor, Mason Meyer, Matt Jones, Matt Kaufer, Max Schwarzer, Meghan Shah, Mehmet Yatbaz, Melody Y\. Guan, Mengyuan Xu, Mengyuan Yan, Mia Glaese, Mianna Chen, Michael Lampe, Michael Malek, Michele Wang, Michelle Fradin, Mike McClay, Mikhail Pavlov, Miles Wang, Mingxuan Wang, Mira Murati, Mo Bavarian, Mostafa Rohaninejad, Nat McAleese, Neil Chowdhury, Neil Chowdhury, Nick Ryder, Nikolas Tezak, Noam Brown, Ofir Nachum, Oleg Boiko, Oleg Murk, Olivia Watkins, Patrick Chao, Paul Ashbourne, Pavel Izmailov, Peter Zhokhov, Rachel Dias, Rahul Arora, Randall Lin, Rapha Gontijo Lopes, Raz Gaon, Reah Miyara, Reimar Leike, Renny Hwang, Rhythm Garg, Robin Brown, Roshan James, Rui Shu, Ryan Cheu, Ryan Greene, Saachi Jain, Sam Altman, Sam Toizer, Sam Toyer, Samuel Miserendino, Sandhini Agarwal, Santiago Hernandez, Sasha Baker, Scott McKinney, Scottie Yan, Shengjia Zhao, Shengli Hu, Shibani Santurkar, Shraman Ray Chaudhuri, Shuyuan Zhang, Siyuan Fu, Spencer Papay, Steph Lin, Suchir Balaji, Suvansh Sanjeev, Szymon Sidor, Tal Broda, Aidan Clark, Tao Wang, Taylor Gordon, Ted Sanders, Tejal Patwardhan, Thibault Sottiaux, Thomas Degry, Thomas Dimson, Tianhao Zheng, Timur Garipov, Tom Stasi, Trapit Bansal, Trevor Creech, Troy Peterson, Tyna Eloundou, Valerie Qi, Vineet Kosaraju, Vinnie Monaco, Vitchyr Pong, Vlad Fomenko, Weiyi Zheng, Wenda Zhou, Wes McCabe, Wojciech Zaremba, Yann Dubois, Yinghai Lu, Yining Chen, Young Cha, Yu Bai, Yuchen He, Yuchen Zhang, Yunyun Wang, Zheng Shao, and Zhuohan Li\.Openai o1 system card, 2024\.URL[https://arxiv\.org/abs/2412\.16720](https://arxiv.org/abs/2412.16720)\.
- Qin and Springenberg \(2025\)Chongli Qin and Jost Tobias Springenberg\.Supervised fine tuning on curated data is reinforcement learning \(and can be improved\), 2025\.URL[https://arxiv\.org/abs/2507\.12856](https://arxiv.org/abs/2507.12856)\.
- Rein et al\. \(2023\)David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R\. Bowman\.Gpqa: A graduate\-level google\-proof q&a benchmark, 2023\.URL[https://arxiv\.org/abs/2311\.12022](https://arxiv.org/abs/2311.12022)\.
- Schulman et al\. \(2017\)John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov\.Proximal policy optimization algorithms, 2017\.URL[https://arxiv\.org/abs/1707\.06347](https://arxiv.org/abs/1707.06347)\.
- Shao et al\. \(2024\)Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y\. K\. Li, Y\. Wu, and Daya Guo\.Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024\.URL[https://arxiv\.org/abs/2402\.03300](https://arxiv.org/abs/2402.03300)\.
- Shenfeld et al\. \(2025\)Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal\.Rl’s razor: Why online reinforcement learning forgets less, 2025\.URL[https://arxiv\.org/abs/2509\.04259](https://arxiv.org/abs/2509.04259)\.
- Shi et al\. \(2025\)Guangyuan Shi, Zexin Lu, Xiaoyu Dong, Wenlong Zhang, Xuanyu Zhang, Yujie Feng, and Xiao\-Ming Wu\.Understanding layer significance in llm alignment, 2025\.URL[https://arxiv\.org/abs/2410\.17875](https://arxiv.org/abs/2410.17875)\.
- Team et al\. \(2025\)P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixin Deng, Shawn Gavin, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi Li, Yunwen Li, David Ma, Yuansheng Ni, Haoran Que, Qiyao Wang, Zhoufutu Wen, Siwei Wu, Tyshawn Hsing, Ming Xu, Zhenzhu Yang, Zekun Moore Wang, Junting Zhou, Yuelin Bai, Xingyuan Bu, Chenglin Cai, Liang Chen, Yifan Chen, Chengtuo Cheng, Tianhao Cheng, Keyi Ding, Siming Huang, Yun Huang, Yaoru Li, Yizhe Li, Zhaoqun Li, Tianhao Liang, Chengdong Lin, Hongquan Lin, Yinghao Ma, Tianyang Pang, Zhongyuan Peng, Zifan Peng, Qige Qi, Shi Qiu, Xingwei Qu, Shanghaoran Quan, Yizhou Tan, Zili Wang, Chenqing Wang, Hao Wang, Yiya Wang, Yubo Wang, Jiajun Xu, Kexin Yang, Ruibin Yuan, Yuanhao Yue, Tianyang Zhan, Chun Zhang, Jinyang Zhang, Xiyue Zhang, Xingjian Zhang, Yue Zhang, Yongchi Zhao, Xiangyu Zheng, Chenghua Zhong, Yang Gao, Zhoujun Li, Dayiheng Liu, Qian Liu, Tianyu Liu, Shiwen Ni, Junran Peng, Yujia Qin, Wenbo Su, Guoyin Wang, Shi Wang, Jian Yang, Min Yang, Meng Cao, Xiang Yue, Zhaoxiang Zhang, Wangchunshu Zhou, Jiaheng Liu, Qunshu Lin, Wenhao Huang, and Ge Zhang\.Supergpqa: Scaling llm evaluation across 285 graduate disciplines, 2025\.URL[https://arxiv\.org/abs/2502\.14739](https://arxiv.org/abs/2502.14739)\.
- Team \(2024\)Qwen Team\.Qwen2\.5: A party of foundation models, September 2024\.URL[https://qwenlm\.github\.io/blog/qwen2\.5/](https://qwenlm.github.io/blog/qwen2.5/)\.
- Wang et al\. \(2025a\)Chenxu Wang, Yilin Lyu, Zicheng Sun, and Liping Jing\.Continual gradient low\-rank projection fine\-tuning for llms, 2025a\.URL[https://arxiv\.org/abs/2507\.02503](https://arxiv.org/abs/2507.02503)\.
- Wang et al\. \(2025b\)Hanqing Wang, Yixia Li, Shuo Wang, Guanhua Chen, and Yun Chen\.Milora: Harnessing minor singular components for parameter\-efficient llm finetuning, 2025b\.URL[https://arxiv\.org/abs/2406\.09044](https://arxiv.org/abs/2406.09044)\.
- Wang et al\. \(2024a\)Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu\.A comprehensive survey of continual learning: Theory, method and application, 2024a\.URL[https://arxiv\.org/abs/2302\.00487](https://arxiv.org/abs/2302.00487)\.
- Wang et al\. \(2023\)Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuanjing Huang\.Orthogonal subspace learning for language model continual learning, 2023\.URL[https://arxiv\.org/abs/2310\.14152](https://arxiv.org/abs/2310.14152)\.
- Wang et al\. \(2024b\)Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen\.Mmlu\-pro: A more robust and challenging multi\-task language understanding benchmark, 2024b\.URL[https://arxiv\.org/abs/2406\.01574](https://arxiv.org/abs/2406.01574)\.
- Wu et al\. \(2025\)Yongliang Wu, Yizhou Zhou, Zhou Ziheng, Yingzhe Peng, Xinyu Ye, Xinting Hu, Wenbo Zhu, Lu Qi, Ming\-Hsuan Yang, and Xu Yang\.On the generalization of sft: A reinforcement learning perspective with reward rectification, 2025\.URL[https://arxiv\.org/abs/2508\.05629](https://arxiv.org/abs/2508.05629)\.
- Yu et al\. \(2025\)Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, Wei\-Ying Ma, Ya\-Qin Zhang, Lin Yan, Mu Qiao, Yonghui Wu, and Mingxuan Wang\.Dapo: An open\-source llm reinforcement learning system at scale, 2025\.URL[https://arxiv\.org/abs/2503\.14476](https://arxiv.org/abs/2503.14476)\.
- Zhao et al\. \(2026\)Qinghua Zhao, Xueling Gong, Xinyu Chen, Zhongfeng Kang, and Xinlu Li\.A layer\-wise analysis of supervised fine\-tuning, 2026\.URL[https://arxiv\.org/abs/2604\.11838](https://arxiv.org/abs/2604.11838)\.
- Zhou et al\. \(2023\)Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou\.Instruction\-following evaluation for large language models, 2023\.URL[https://arxiv\.org/abs/2311\.07911](https://arxiv.org/abs/2311.07911)\.
- Zhu et al\. \(2025a\)Hanqing Zhu, Zhenyu Zhang, Hanxian Huang, DiJia Su, Zechun Liu, Jiawei Zhao, Igor Fedorov, Hamed Pirsiavash, Jinwon Lee, David Z\. Pan, Zhangyang Wang, Yuandong Tian, and Kai Sheng Tai\.Why rl updates look sparse: An implicit compass drives optimization bias, 2025a\.URL[https://openreview\.net/forum?id=Q4mF4tLGbf](https://openreview.net/forum?id=Q4mF4tLGbf)\.Submitted to ICLR 2026\.
- Zhu et al\. \(2025b\)Hanqing Zhu, Zhenyu Zhang, Hanxian Huang, DiJia Su, Zechun Liu, Jiawei Zhao, Igor Fedorov, Hamed Pirsiavash, Zhizhou Sha, Jinwon Lee, David Z\. Pan, Zhangyang Wang, Yuandong Tian, and Kai Sheng Tai\.The path not taken: Rlvr provably learns off the principals, 2025b\.URL[https://arxiv\.org/abs/2511\.08567](https://arxiv.org/abs/2511.08567)\.
- Zhu et al\. \(2025c\)Wenhong Zhu, Ruobing Xie, Rui Wang, Xingwu Sun, Di Wang, and Pengfei Liu\.Proximal supervised fine\-tuning, 2025c\.URL[https://arxiv\.org/abs/2508\.17784](https://arxiv.org/abs/2508.17784)\.

## Appendix Contents

## Appendix AAlgorithms

Algorithm 1Rotation\-Preserving SFT \(RPSFT\)0:Pretrained weights

θ0=\{𝐖ℓ0\}\\theta\_\{0\}=\\\{\\mathbf\{W\}^\{0\}\_\{\\ell\}\\\}, SFT dataset

𝒟\\mathcal\{D\}, regularized layer set

ℳreg\\mathcal\{M\}\_\{\\mathrm\{reg\}\}, rank

kk, regularization coefficient

λ\\lambda, update period

ss
1:Precompute:for each

ℓ∈ℳreg\\ell\\in\\mathcal\{M\}\_\{\\mathrm\{reg\}\}, compute

\(𝐔0,ℓ\(k\),𝐕0,ℓ\(k\)\)\(\\mathbf\{U\}^\{\(k\)\}\_\{0,\\ell\},\\mathbf\{V\}^\{\(k\)\}\_\{0,\\ell\}\)from the SVD of

𝐖ℓ0\\mathbf\{W\}^\{0\}\_\{\\ell\}; set

Sℓref←\(𝐔0,ℓ\(k\)\)⊤𝐖ℓ0𝐕0,ℓ\(k\)S^\{\\mathrm\{ref\}\}\_\{\\ell\}\\leftarrow\(\\mathbf\{U\}^\{\(k\)\}\_\{0,\\ell\}\)^\{\\top\}\\mathbf\{W\}^\{0\}\_\{\\ell\}\\mathbf\{V\}^\{\(k\)\}\_\{0,\\ell\}
2:Initialize

θ←θ0\\theta\\leftarrow\\theta\_\{0\}
3:forsupervised training step

t=1,2,…t=1,2,\\dotsdo

4:Sample minibatch

\(x,y\)∼𝒟\(x,y\)\\sim\\mathcal\{D\}
5:Compute the standard SFT loss

ℒSFT\\mathcal\{L\}\_\{\\mathrm\{SFT\}\}
6:

ℒreg←0\\mathcal\{L\}\_\{\\mathrm\{reg\}\}\\leftarrow 0
7:foreach

ℓ∈ℳreg\\ell\\in\\mathcal\{M\}\_\{\\mathrm\{reg\}\}do

8:

S←\(𝐔0,ℓ\(k\)\)⊤𝐖ℓ𝐕0,ℓ\(k\)S\\leftarrow\(\\mathbf\{U\}^\{\(k\)\}\_\{0,\\ell\}\)^\{\\top\}\\mathbf\{W\}\_\{\\ell\}\\mathbf\{V\}^\{\(k\)\}\_\{0,\\ell\}
9:

ℒreg←ℒreg\+‖S−Sℓref‖F2\\mathcal\{L\}\_\{\\mathrm\{reg\}\}\\leftarrow\\mathcal\{L\}\_\{\\mathrm\{reg\}\}\+\\\|S\-S^\{\\mathrm\{ref\}\}\_\{\\ell\}\\\|\_\{F\}^\{2\}
10:endfor

11:

ℒ←ℒSFT\+λℒreg\\mathcal\{L\}\\leftarrow\\mathcal\{L\}\_\{\\mathrm\{SFT\}\}\+\\lambda\\mathcal\{L\}\_\{\\mathrm\{reg\}\}
12:Update

θ←θ−η∇θℒ\\theta\\leftarrow\\theta\-\\eta\\nabla\_\{\\theta\}\\mathcal\{L\}
13:endfor

14:return

θ\\theta

Illustrative PyTorch implementation\.The core implementation only caches the pretrained singular bases once and adds a projected\-block penalty to the usual SFT loss:

svd\_cache=\[\]

withtorch\.no\_grad\(\):

forW0inpretrained\_layers:

U,\_,Vh=torch\.linalg\.svd\(W0\.float\(\),full\_matrices=False\)

Uk=U\[:,:k\]\.to\(device=W0\.device,dtype=W0\.dtype\)

Vk=Vh\[:k\]\.T\.to\(device=W0\.device,dtype=W0\.dtype\)

S0=Uk\.T@W0@Vk

svd\_cache\.append\(\(Uk,Vk,S0\)\)

reg\_loss=task\_loss\.new\_zeros\(\(\)\)

ifstep%s==0:

forW,\(Uk,Vk,S0\)inzip\(model\_layers,svd\_cache\):

S=Uk\.T@W@Vk

reg\_loss=reg\_loss\+\(S\-S0\)\.pow\(2\)\.sum\(\)

loss=task\_loss\+lam\*reg\_loss

loss\.backward\(\)

### A\.1SVD Notation

For a weight matrix𝐖∈ℝdout×din\\mathbf\{W\}\\in\\mathbb\{R\}^\{d\_\{\\text\{out\}\}\\times d\_\{\\text\{in\}\}\}, singular value decomposition writes𝐖=𝐔𝚺𝐕⊤\\mathbf\{W\}=\\mathbf\{U\}\\mathbf\{\\Sigma\}\\mathbf\{V\}^\{\\top\}, wheredoutd\_\{\\text\{out\}\}anddind\_\{\\text\{in\}\}are the output and input dimensions,𝐔∈ℝdout×r\\mathbf\{U\}\\in\\mathbb\{R\}^\{d\_\{\\text\{out\}\}\\times r\}and𝐕∈ℝdin×r\\mathbf\{V\}\\in\\mathbb\{R\}^\{d\_\{\\text\{in\}\}\\times r\}have orthonormal columns,𝚺∈ℝr×r\\mathbf\{\\Sigma\}\\in\\mathbb\{R\}^\{r\\times r\}is the diagonal matrix of singular values in descending order, andr=rank⁡\(𝐖\)r=\\operatorname\{rank\}\(\\mathbf\{W\}\)\. The leading singular directions capture the dominant transformations implemented by the model and define the pretrained subspaces that RPSFT protects\.

### A\.2LoRA Objective

For completeness, RPSFT can be applied to LoRA by adding the same projected\-block drift penalty to the effective weight\. This optional variant is not used in the reported experiments; the LoRA rows in Appendix[D](https://arxiv.org/html/2605.10973#A4)are vanilla LoRA baselines with adapter rankr=32r=32\.

ℒ\(A,B\)=ℒtask\(A,B\)\+λ∑ℓ∈ℳ′‖Sℓ\(𝐖ℓ0\+Δ𝐖ℓ\)−Sℓref‖F2,\\mathcal\{L\}\(A,B\)=\\mathcal\{L\}\_\{\\text\{task\}\}\(A,B\)\+\\lambda\\sum\_\{\\ell\\in\\mathcal\{M\}^\{\\prime\}\}\\left\\\|S\_\{\\ell\}\(\\mathbf\{W\}^\{0\}\_\{\\ell\}\+\\Delta\\mathbf\{W\}\_\{\\ell\}\)\-S^\{\\mathrm\{ref\}\}\_\{\\ell\}\\right\\\|\_\{F\}^\{2\},\(5\)whereΔ𝐖ℓ=BℓAℓ\\Delta\\mathbf\{W\}\_\{\\ell\}=B\_\{\\ell\}A\_\{\\ell\}\. This keeps the regularizer aligned with FPFT: it penalizes drift inside the pretrained dominant subspace even when only low\-rank parameters are trained\.

## Appendix BRelated Work

### B\.1Traditional Continual\-Learning Regularization

Continual\-learning surveys commonly separate replay, architecture expansion, and regularization\-based approaches\[Wang et al\.,[2024a](https://arxiv.org/html/2605.10973#bib.bib41)\]\. Our work is closest to the regularization family: rather than storing old data or changing the model architecture, these methods modify the optimization objective or update rule to control parameter drift\. A classical example is Elastic Weight Consolidation \(EWC\)\[Kirkpatrick et al\.,[2017](https://arxiv.org/html/2605.10973#bib.bib14)\], which uses a Fisher\-weighted quadratic penalty to discourage movement of important parameters\. A simpler related baseline is anchoring or regenerative regularization, which penalizes uniform Euclidean movement from a reference model, typically through a term proportional to‖θ−θ0‖22\\\|\\theta\-\\theta\_\{0\}\\\|\_\{2\}^\{2\}\[Kumar et al\.,[2024](https://arxiv.org/html/2605.10973#bib.bib15)\]\. Other recent continual\-learning regularizers control the weight geometry more directly: spectral regularization keeps the maximum singular value of each layer near one\[Lewandowski et al\.,[2024](https://arxiv.org/html/2605.10973#bib.bib18)\], weight clipping bounds weight magnitudes after each update\[Elsayed et al\.,[2024](https://arxiv.org/html/2605.10973#bib.bib4)\], and Parseval regularization encourages near\-orthogonal weight matrices\[Chung et al\.,[2024](https://arxiv.org/html/2605.10973#bib.bib2)\]\. These methods motivate regularizing the geometry of the update, but they do not target the pretrained singular directions that are most tied to OOD forgetting in our analysis\.

### B\.2LLM Post\-Training and Reasoning Generalization

Recent LLM studies show that reasoning post\-training can either generalize or over\-specialize depending on the learning signal and update geometry\. Other work shows that math\-only post\-training often transfers imperfectly: SFT can induce representation and output drift, whereas RL tends to preserve broader capabilities\[Huan et al\.,[2025](https://arxiv.org/html/2605.10973#bib.bib11)\]\. This is consistent with findings that RL fine\-tuning mitigates forgetting through reward\-variance scaling\[Lai et al\.,[2025](https://arxiv.org/html/2605.10973#bib.bib17)\], that online RL is biased toward KL\-minimal solutions\[Shenfeld et al\.,[2025](https://arxiv.org/html/2605.10973#bib.bib35)\], and that RL updates may be sparse, spectrum\-preserving, or off\-principal in parameter space\[Mukherjee et al\.,[2025](https://arxiv.org/html/2605.10973#bib.bib27), Zhu et al\.,[2025a](https://arxiv.org/html/2605.10973#bib.bib48),[b](https://arxiv.org/html/2605.10973#bib.bib49)\]\. Offline reasoning studies also show that naive SFT can harm search capability and that smaller learning rates can reduce degradation\[Ni et al\.,[2025](https://arxiv.org/html/2605.10973#bib.bib29), Lin et al\.,[2025](https://arxiv.org/html/2605.10973#bib.bib20)\]\. These results motivate our focus on the SFT stage: RPSFT explicitly controls SFT\-induced movement in dominant pretrained singular subspaces, aiming to obtain rapid task adaptation without causing the representation drift that later RL analyses suggest is avoidable\.

Several LLM fine\-tuning methods are close to this goal\. Proximal SFT constrains policy drift with PPO\-style updates\[Zhu et al\.,[2025c](https://arxiv.org/html/2605.10973#bib.bib50)\]; Dynamic SFT and importance\-weighted SFT rescale token or example losses\[Wu et al\.,[2025](https://arxiv.org/html/2605.10973#bib.bib44), Qin and Springenberg,[2025](https://arxiv.org/html/2605.10973#bib.bib31)\]; and subspace\-constrained methods project gradients away from protected directions, including principal\-subspace preservation and gradient low\-rank projection\[Franke et al\.,[2024](https://arxiv.org/html/2605.10973#bib.bib5), Nayak et al\.,[2025](https://arxiv.org/html/2605.10973#bib.bib28), Wang et al\.,[2025a](https://arxiv.org/html/2605.10973#bib.bib39)\]\. Orthogonality\-based PEFT methods such as O\-LoRA similarly reduce interference by constraining the adapter subspace\[Wang et al\.,[2023](https://arxiv.org/html/2605.10973#bib.bib42)\]\. Spectral PEFT methods such as PiSSA and MiLoRA use singular directions for efficient adaptation\[Meng et al\.,[2025](https://arxiv.org/html/2605.10973#bib.bib26), Wang et al\.,[2025b](https://arxiv.org/html/2605.10973#bib.bib40)\], but their goal is parameter efficiency rather than explicitly mitigating OOD forgetting\. In contrast, RPSFT keeps the full fine\-tuning parameterization and adds a soft projected\-block penalty in the pretrained singular\-vector basis\.

### B\.3Comparison with Regularization Methods

Letθ0\\theta\_\{0\}denote the pretrained parameters, and let𝐖ℓ0=𝐔ℓ0𝚺ℓ0\(𝐕ℓ0\)⊤\\mathbf\{W\}^\{0\}\_\{\\ell\}=\\mathbf\{U\}^\{0\}\_\{\\ell\}\\mathbf\{\\Sigma\}^\{0\}\_\{\\ell\}\(\\mathbf\{V\}^\{0\}\_\{\\ell\}\)^\{\\top\}be the SVD of layerℓ\\ell\. Table[5](https://arxiv.org/html/2605.10973#A2.T5)compares methods whose update rules or objectives are explicit enough to make a direct theoretical comparison\. We omit papers that are primarily diagnostic or empirical motivation when the paper does not define a comparable regularizer\.

Table 5:Comparison with related regularization and subspace\-preservation methods\.

## Appendix CTheory: Properties of RPSFT Regularization

### C\.1Setup and Notation

Consider a weight matrix𝐖∈ℝm×n\\mathbf\{W\}\\in\\mathbb\{R\}^\{m\\times n\}with pretrained initialization𝐖0\\mathbf\{W\}\_\{0\}\. Let the \(thin\) SVD of𝐖0\\mathbf\{W\}\_\{0\}be

𝐖0=𝐔0𝚺0𝐕0⊤,\\mathbf\{W\}\_\{0\}=\\mathbf\{U\}\_\{0\}\\mathbf\{\\Sigma\}\_\{0\}\\mathbf\{V\}\_\{0\}^\{\\top\},\(6\)where𝐔0∈ℝm×r\\mathbf\{U\}\_\{0\}\\in\\mathbb\{R\}^\{m\\times r\},𝐕0∈ℝn×r\\mathbf\{V\}\_\{0\}\\in\\mathbb\{R\}^\{n\\times r\}have orthonormal columns,𝚺0∈ℝr×r\\mathbf\{\\Sigma\}\_\{0\}\\in\\mathbb\{R\}^\{r\\times r\}is diagonal, andr=rank\(𝐖0\)r=\\mathrm\{rank\}\(\\mathbf\{W\}\_\{0\}\)\. Let𝐔k∈ℝm×k\\mathbf\{U\}\_\{k\}\\in\\mathbb\{R\}^\{m\\times k\}and𝐕k∈ℝn×k\\mathbf\{V\}\_\{k\}\\in\\mathbb\{R\}^\{n\\times k\}denote the top\-kkleft and right singular vectors of𝐖0\\mathbf\{W\}\_\{0\}\. Define the linear operator

𝒫k\(𝐖\)≜𝐔k⊤𝐖𝐕k∈ℝk×k\.\\mathcal\{P\}\_\{k\}\(\\mathbf\{W\}\)\\triangleq\\mathbf\{U\}\_\{k\}^\{\\top\}\\mathbf\{W\}\\mathbf\{V\}\_\{k\}\\in\\mathbb\{R\}^\{k\\times k\}\.\(7\)RPSFT adds the penalty

ℛk\(𝐖\)≜‖𝒫k\(𝐖\)−𝒫k\(𝐖0\)‖F2=‖𝐔k⊤\(𝐖−𝐖0\)𝐕k‖F2\.\\mathcal\{R\}\_\{k\}\(\\mathbf\{W\}\)\\triangleq\\left\\\|\\mathcal\{P\}\_\{k\}\(\\mathbf\{W\}\)\-\\mathcal\{P\}\_\{k\}\(\\mathbf\{W\}\_\{0\}\)\\right\\\|\_\{F\}^\{2\}=\\left\\\|\\mathbf\{U\}\_\{k\}^\{\\top\}\(\\mathbf\{W\}\-\\mathbf\{W\}\_\{0\}\)\\mathbf\{V\}\_\{k\}\\right\\\|\_\{F\}^\{2\}\.\(8\)Given a task lossf\(𝐖\)f\(\\mathbf\{W\}\)\(e\.g\., the SFT objective\), the RPSFT objective is

Fλ\(𝐖\)≜f\(𝐖\)\+λℛk\(𝐖\),λ≥0\.F\_\{\\lambda\}\(\\mathbf\{W\}\)\\triangleq f\(\\mathbf\{W\}\)\+\\lambda\\,\\mathcal\{R\}\_\{k\}\(\\mathbf\{W\}\),\\qquad\\lambda\\geq 0\.\(9\)

### C\.2Boundary Cases and Limiting Behavior

We first record basic limiting cases that clarify howλ\\lambdaandkkinterpolate between standard SFT and constrained training\.

#### Caseλ→0\\lambda\\to 0\.

Asλ→0\\lambda\\to 0,Fλ\(𝐖\)→f\(𝐖\)F\_\{\\lambda\}\(\\mathbf\{W\}\)\\to f\(\\mathbf\{W\}\)and RPSFT reduces to the original SFT objective\.

#### Caseλ→∞\\lambda\\to\\infty\.

Forλ→∞\\lambda\\to\\infty, minimizers ofFλF\_\{\\lambda\}converge to solutions of the constrained problem:

min𝐖⁡f\(𝐖\)s\.t\.𝒫k\(𝐖\)=𝒫k\(𝐖0\)⟺𝐔k⊤\(𝐖−𝐖0\)𝐕k=0\.\\min\_\{\\mathbf\{W\}\}\\ f\(\\mathbf\{W\}\)\\quad\\text\{s\.t\.\}\\quad\\mathcal\{P\}\_\{k\}\(\\mathbf\{W\}\)=\\mathcal\{P\}\_\{k\}\(\\mathbf\{W\}\_\{0\}\)\\ \\Longleftrightarrow\\ \\mathbf\{U\}\_\{k\}^\{\\top\}\(\\mathbf\{W\}\-\\mathbf\{W\}\_\{0\}\)\\mathbf\{V\}\_\{k\}=0\.\(10\)which fixes the component of𝐖\\mathbf\{W\}acting between the pretrained top\-kkleft and right singular subspaces while leaving all other components unconstrained\.

#### Casek=0k=0\.

ℛ0\(𝐖\)≡0\\mathcal\{R\}\_\{0\}\(\\mathbf\{W\}\)\\equiv 0, hence RPSFT is exactly standard SFT for anyλ\\lambda\.

#### Casek=min⁡\(m,n\)k=\\min\(m,n\)\(full rank\)\.

Letk=min⁡\(m,n\)k=\\min\(m,n\)and let𝐔∈ℝm×m\\mathbf\{U\}\\in\\mathbb\{R\}^\{m\\times m\},𝐕∈ℝn×n\\mathbf\{V\}\\in\\mathbb\{R\}^\{n\\times n\}be any orthogonal completions of the singular vector bases of𝐖0\\mathbf\{W\}\_\{0\}\. Then, by orthogonal invariance of the Frobenius norm,

ℛk\(𝐖\)=‖𝐔⊤\(𝐖−𝐖0\)𝐕‖F2=‖𝐖−𝐖0‖F2,\\mathcal\{R\}\_\{k\}\(\\mathbf\{W\}\)=\\\|\\mathbf\{U\}^\{\\top\}\(\\mathbf\{W\}\-\\mathbf\{W\}\_\{0\}\)\\mathbf\{V\}\\\|\_\{F\}^\{2\}=\\\|\\mathbf\{W\}\-\\mathbf\{W\}\_\{0\}\\\|\_\{F\}^\{2\},\(11\)so RPSFT becomes standardℓ2\\ell\_\{2\}anchoring around the pretrained weights\.

### C\.3Block Decomposition Interpretation

Let𝐔=\[𝐔k𝐔⟂\]∈ℝm×m\\mathbf\{U\}=\[\\mathbf\{U\}\_\{k\}\\ \\ \\mathbf\{U\}\_\{\\perp\}\]\\in\\mathbb\{R\}^\{m\\times m\}and𝐕=\[𝐕k𝐕⟂\]∈ℝn×n\\mathbf\{V\}=\[\\mathbf\{V\}\_\{k\}\\ \\ \\mathbf\{V\}\_\{\\perp\}\]\\in\\mathbb\{R\}^\{n\\times n\}be orthogonal matrices extending𝐔k,𝐕k\\mathbf\{U\}\_\{k\},\\mathbf\{V\}\_\{k\}\. Define the rotated coordinates

𝐖~≜𝐔⊤𝐖𝐕=\(ABCD\),\\widetilde\{\\mathbf\{W\}\}\\triangleq\\mathbf\{U\}^\{\\top\}\\mathbf\{W\}\\mathbf\{V\}=\\begin\{pmatrix\}A&B\\\\ C&D\\end\{pmatrix\},\(12\)whereA∈ℝk×kA\\in\\mathbb\{R\}^\{k\\times k\}\. Similarly𝐖~0=𝐔⊤𝐖0𝐕\\widetilde\{\\mathbf\{W\}\}\_\{0\}=\\mathbf\{U\}^\{\\top\}\\mathbf\{W\}\_\{0\}\\mathbf\{V\}has top\-left blockA0A\_\{0\}\. Then

ℛk\(𝐖\)=‖A−A0‖F2\.\\mathcal\{R\}\_\{k\}\(\\mathbf\{W\}\)=\\\|A\-A\_\{0\}\\\|\_\{F\}^\{2\}\.\(13\)Thus, RPSFT penalizes drift of the top\-leftk×kk\\times kblock in the pretrained singular\-vector basis, while allowing updates through the other blocksB,C,DB,C,D\.

### C\.4Optimality Condition andO\(1/λ\)O\(1/\\lambda\)Drift Control

We give a simple bound showing howλ\\lambdacontrols the protected\-block deviation\.

###### Proposition C\.1\(Stationary Condition\)\.

Assumeffis differentiable\. Any stationary point𝐖λ\\mathbf\{W\}\_\{\\lambda\}ofFλF\_\{\\lambda\}satisfies

∇f\(𝐖λ\)\+2λ𝐔k\(𝐔k⊤\(𝐖λ−𝐖0\)𝐕k\)𝐕k⊤=0\.\\nabla f\(\\mathbf\{W\}\_\{\\lambda\}\)\+2\\lambda\\,\\mathbf\{U\}\_\{k\}\\big\(\\mathbf\{U\}\_\{k\}^\{\\top\}\(\\mathbf\{W\}\_\{\\lambda\}\-\\mathbf\{W\}\_\{0\}\)\\mathbf\{V\}\_\{k\}\\big\)\\mathbf\{V\}\_\{k\}^\{\\top\}=0\.\(14\)

###### Proof\.

Usingℛk\(𝐖\)=‖𝐔k⊤\(𝐖−𝐖0\)𝐕k‖F2\\mathcal\{R\}\_\{k\}\(\\mathbf\{W\}\)=\\\|\\mathbf\{U\}\_\{k\}^\{\\top\}\(\\mathbf\{W\}\-\\mathbf\{W\}\_\{0\}\)\\mathbf\{V\}\_\{k\}\\\|\_\{F\}^\{2\}and standard matrix calculus,∇𝐖ℛk\(𝐖\)=2𝐔k\(𝐔k⊤\(𝐖−𝐖0\)𝐕k\)𝐕k⊤\\nabla\_\{\\mathbf\{W\}\}\\mathcal\{R\}\_\{k\}\(\\mathbf\{W\}\)=2\\mathbf\{U\}\_\{k\}\(\\mathbf\{U\}\_\{k\}^\{\\top\}\(\\mathbf\{W\}\-\\mathbf\{W\}\_\{0\}\)\\mathbf\{V\}\_\{k\}\)\\mathbf\{V\}\_\{k\}^\{\\top\}\. Setting∇Fλ\(𝐖\)=0\\nabla F\_\{\\lambda\}\(\\mathbf\{W\}\)=0yields Eq\. \([14](https://arxiv.org/html/2605.10973#A3.E14)\)\. ∎

###### Proposition C\.2\(1/λ1/\\lambdaControl of Protected Drift\)\.

Let𝐖λ\\mathbf\{W\}\_\{\\lambda\}be any stationary point ofFλF\_\{\\lambda\}\. Then

‖𝐔k⊤\(𝐖λ−𝐖0\)𝐕k‖F≤‖∇f\(𝐖λ\)‖F2λ\.\\left\\\|\\mathbf\{U\}\_\{k\}^\{\\top\}\(\\mathbf\{W\}\_\{\\lambda\}\-\\mathbf\{W\}\_\{0\}\)\\mathbf\{V\}\_\{k\}\\right\\\|\_\{F\}\\leq\\frac\{\\\|\\nabla f\(\\mathbf\{W\}\_\{\\lambda\}\)\\\|\_\{F\}\}\{2\\lambda\}\.\(15\)

###### Proof\.

Left\-multiply Eq\. \([14](https://arxiv.org/html/2605.10973#A3.E14)\) by𝐔k⊤\\mathbf\{U\}\_\{k\}^\{\\top\}and right\-multiply by𝐕k\\mathbf\{V\}\_\{k\}:

𝐔k⊤∇f\(𝐖λ\)𝐕k\+2λ𝐔k⊤\(𝐖λ−𝐖0\)𝐕k=0\.\\mathbf\{U\}\_\{k\}^\{\\top\}\\nabla f\(\\mathbf\{W\}\_\{\\lambda\}\)\\mathbf\{V\}\_\{k\}\+2\\lambda\\,\\mathbf\{U\}\_\{k\}^\{\\top\}\(\\mathbf\{W\}\_\{\\lambda\}\-\\mathbf\{W\}\_\{0\}\)\\mathbf\{V\}\_\{k\}=0\.\(16\)Taking Frobenius norms and using‖𝐔k⊤X𝐕k‖F≤‖X‖F\\\|\\mathbf\{U\}\_\{k\}^\{\\top\}X\\mathbf\{V\}\_\{k\}\\\|\_\{F\}\\leq\\\|X\\\|\_\{F\}gives Eq\. \([15](https://arxiv.org/html/2605.10973#A3.E15)\)\. ∎

#### Implication\.

Eq\. \([15](https://arxiv.org/html/2605.10973#A3.E15)\) shows that 1\) at any stationary point, the protected\-block deviation is bounded by the task\-gradient magnitude scaled byO\(1/λ\)O\(1/\\lambda\); 2\) asλ\\lambdaincreases, the deviation of the protected block𝐔k⊤\(𝐖λ−𝐖0\)𝐕k\\mathbf\{U\}\_\{k\}^\{\\top\}\(\\mathbf\{W\}\_\{\\lambda\}\-\\mathbf\{W\}\_\{0\}\)\\mathbf\{V\}\_\{k\}shrinks at rateO\(1/λ\)O\(1/\\lambda\), formalizing a smooth interpolation between unconstrained SFT \(λ=0\\lambda=0\) and the constrained regime obtained asλ→∞\\lambda\\to\\infty\.

### C\.5Rank Selection: Boundary and Threshold Rule

We now formalize why an excessively large protected rank can hurt rapid adaptation\. We work in the forgetting\-dominated regime discussed above, where the pretrained model is locally close to an OOD stationary point, so the first\-order OOD term is negligible\.

LetΔ≔𝐖−𝐖0\\Delta\\coloneqq\\mathbf\{W\}\-\\mathbf\{W\}\_\{0\}and

Δ~≔𝐔⊤Δ𝐕,\\widetilde\{\\Delta\}\\coloneqq\\mathbf\{U\}^\{\\top\}\\Delta\\mathbf\{V\},\(17\)and index the entries ofΔ~\\widetilde\{\\Delta\}bys=\(i,j\)s=\(i,j\), with coordinate valueδs\\delta\_\{s\}\. For each rankkk, define the protected coordinate set

Sk≔\{\(i,j\):1≤i≤k,1≤j≤k\}\.S\_\{k\}\\coloneqq\\\{\(i,j\):1\\leq i\\leq k,\\;1\\leq j\\leq k\\\}\.\(18\)Thuss∈Sks\\in S\_\{k\}means thatδs\\delta\_\{s\}lies in the protected top\-leftk×kk\\times kblock of the pretrained singular\-vector basis\.

We use the local quadratic model

fid\(𝐖0\+Δ\)−fid\(𝐖0\)≈−∑sgsδs\+12∑shsδs2,hs\>0,f\_\{\\mathrm\{id\}\}\(\\mathbf\{W\}\_\{0\}\+\\Delta\)\-f\_\{\\mathrm\{id\}\}\(\\mathbf\{W\}\_\{0\}\)\\approx\-\\sum\_\{s\}g\_\{s\}\\delta\_\{s\}\+\\frac\{1\}\{2\}\\sum\_\{s\}h\_\{s\}\\delta\_\{s\}^\{2\},\\qquad h\_\{s\}\>0,\(19\)wheregsg\_\{s\}is the in\-domain learning drive along coordinatess, andhsh\_\{s\}is the corresponding local curvature\. Under RPSFT, the local objective becomes

−∑sgsδs\+12∑shsδs2\+λ∑s∈Skδs2\.\-\\sum\_\{s\}g\_\{s\}\\delta\_\{s\}\+\\frac\{1\}\{2\}\\sum\_\{s\}h\_\{s\}\\delta\_\{s\}^\{2\}\+\\lambda\\sum\_\{s\\in S\_\{k\}\}\\delta\_\{s\}^\{2\}\.\(20\)
For OOD forgetting, we use the quadratic proxy

food\(𝐖0\+Δ\)−food\(𝐖0\)≈12∑scsδs2,cs≥0,f\_\{\\mathrm\{ood\}\}\(\\mathbf\{W\}\_\{0\}\+\\Delta\)\-f\_\{\\mathrm\{ood\}\}\(\\mathbf\{W\}\_\{0\}\)\\approx\\frac\{1\}\{2\}\\sum\_\{s\}c\_\{s\}\\delta\_\{s\}^\{2\},\\qquad c\_\{s\}\\geq 0,\(21\)wherecsc\_\{s\}measures how sensitive OOD performance is to movement along coordinatess: largercsc\_\{s\}means larger OOD degradation\.

###### Proposition C\.3\(Optimal local step\)\.

For each coordinatess, the minimizer of the local regularized objective is

δs⋆\(k\)=\{gshs\+2λ,s∈Sk,gshs,s∉Sk\.\\delta\_\{s\}^\{\\star\}\(k\)=\\begin\{cases\}\\dfrac\{g\_\{s\}\}\{h\_\{s\}\+2\\lambda\},&s\\in S\_\{k\},\\\\\[5\.16663pt\] \\dfrac\{g\_\{s\}\}\{h\_\{s\}\},&s\\notin S\_\{k\}\.\\end\{cases\}\(22\)The resulting OOD increase is

Food\(k\)=12∑s∈Skcsgs2\(hs\+2λ\)2\+12∑s∉Skcsgs2hs2\.F\_\{\\mathrm\{ood\}\}\(k\)=\\frac\{1\}\{2\}\\sum\_\{s\\in S\_\{k\}\}\\frac\{c\_\{s\}g\_\{s\}^\{2\}\}\{\(h\_\{s\}\+2\\lambda\)^\{2\}\}\+\\frac\{1\}\{2\}\\sum\_\{s\\notin S\_\{k\}\}\\frac\{c\_\{s\}g\_\{s\}^\{2\}\}\{h\_\{s\}^\{2\}\}\.\(23\)The resulting ID gain, measured on the unregularized local ID proxy after taking the regularized step, is

Gid\(k\)=∑s∈Sk\(gs2hs\+2λ−12hsgs2\(hs\+2λ\)2\)\+12∑s∉Skgs2hs\.G\_\{\\mathrm\{id\}\}\(k\)=\\sum\_\{s\\in S\_\{k\}\}\\left\(\\frac\{g\_\{s\}^\{2\}\}\{h\_\{s\}\+2\\lambda\}\-\\frac\{1\}\{2\}h\_\{s\}\\frac\{g\_\{s\}^\{2\}\}\{\(h\_\{s\}\+2\\lambda\)^\{2\}\}\\right\)\+\\frac\{1\}\{2\}\\sum\_\{s\\notin S\_\{k\}\}\\frac\{g\_\{s\}^\{2\}\}\{h\_\{s\}\}\.\(24\)

###### Proof\.

The objective is separable across coordinates, so the optimizer is obtained coordinatewise\. Substituting the resultingδs⋆\(k\)\\delta\_\{s\}^\{\\star\}\(k\)into the ID and OOD quadratic proxies gives the stated formulas\. ∎

###### Proposition C\.4\(Existence of an upper rank boundary\)\.

Assume there existsqqsuch that

cs=0for alls∉Sq\.c\_\{s\}=0\\qquad\\text\{for all \}s\\notin S\_\{q\}\.\(25\)That is, all OOD\-sensitive coordinates are already contained in the protected top\-qqblock\. Define the scalarized utility

Φ\(k\)≔Gid\(k\)−βFood\(k\),β\>0,\\Phi\(k\)\\coloneqq G\_\{\\mathrm\{id\}\}\(k\)\-\\beta F\_\{\\mathrm\{ood\}\}\(k\),\\qquad\\beta\>0,\(26\)whereβ\\betacontrols how strongly OOD forgetting is penalized relative to ID gain\. Then every maximizerk⋆k^\{\\star\}ofΦ\\Phisatisfies

k⋆≤q\.k^\{\\star\}\\leq q\.\(27\)

###### Proof\.

Fork≥qk\\geq q, enlarging the protected set no longer changesFood\(k\)F\_\{\\mathrm\{ood\}\}\(k\), because every coordinate withcs\>0c\_\{s\}\>0is already protected at rankqq\. Hence

Food\(k\)=Food\(q\),k≥q\.F\_\{\\mathrm\{ood\}\}\(k\)=F\_\{\\mathrm\{ood\}\}\(q\),\\qquad k\\geq q\.\(28\)On the other hand, protecting any additional coordinate weakly decreases its contribution toGid\(k\)G\_\{\\mathrm\{id\}\}\(k\), with strict decrease wheneverλ\>0\\lambda\>0andgs≠0g\_\{s\}\\neq 0\. Therefore

Gid\(k\)≤Gid\(q\),k≥q,G\_\{\\mathrm\{id\}\}\(k\)\\leq G\_\{\\mathrm\{id\}\}\(q\),\\qquad k\\geq q,\(29\)and thus

Φ\(k\)≤Φ\(q\),k≥q\.\\Phi\(k\)\\leq\\Phi\(q\),\\qquad k\\geq q\.\(30\)So no maximizer can lie aboveqq\. ∎

###### Corollary C\.5\(Threshold rule\)\.

Consider a coordinatessthat is currently unprotected\. Define the ID cost of protecting this coordinate by

ΔID,s≔12gs2hs−\(gs2hs\+2λ−12hsgs2\(hs\+2λ\)2\)=2λ2gs2hs\(hs\+2λ\)2,\\Delta\_\{\\mathrm\{ID\},s\}\\coloneqq\\frac\{1\}\{2\}\\frac\{g\_\{s\}^\{2\}\}\{h\_\{s\}\}\-\\left\(\\frac\{g\_\{s\}^\{2\}\}\{h\_\{s\}\+2\\lambda\}\-\\frac\{1\}\{2\}h\_\{s\}\\frac\{g\_\{s\}^\{2\}\}\{\(h\_\{s\}\+2\\lambda\)^\{2\}\}\\right\)=\\frac\{2\\lambda^\{2\}g\_\{s\}^\{2\}\}\{h\_\{s\}\(h\_\{s\}\+2\\lambda\)^\{2\}\},\(31\)and define the OOD gain of protecting this coordinate by

ΔOOD,s≔12csgs2hs2−12csgs2\(hs\+2λ\)2=2csλ\(hs\+λ\)gs2hs2\(hs\+2λ\)2\.\\Delta\_\{\\mathrm\{OOD\},s\}\\coloneqq\\frac\{1\}\{2\}c\_\{s\}\\frac\{g\_\{s\}^\{2\}\}\{h\_\{s\}^\{2\}\}\-\\frac\{1\}\{2\}c\_\{s\}\\frac\{g\_\{s\}^\{2\}\}\{\(h\_\{s\}\+2\\lambda\)^\{2\}\}=\\frac\{2c\_\{s\}\\lambda\(h\_\{s\}\+\\lambda\)g\_\{s\}^\{2\}\}\{h\_\{s\}^\{2\}\(h\_\{s\}\+2\\lambda\)^\{2\}\}\.\(32\)Then protecting coordinatessimprovesΦ\\Phiiff

βΔOOD,s\>ΔID,s,\\beta\\,\\Delta\_\{\\mathrm\{OOD\},s\}\>\\Delta\_\{\\mathrm\{ID\},s\},\(33\)which is equivalent to

cs\>λhsβ\(hs\+λ\)\.c\_\{s\}\>\\frac\{\\lambda h\_\{s\}\}\{\\beta\(h\_\{s\}\+\\lambda\)\}\.\(34\)

###### Proof\.

This follows by comparing the protected and unprotected one\-coordinate contributions in Proposition[C\.3](https://arxiv.org/html/2605.10973#A3.Thmtheorem3)and simplifying\. ∎

#### Interpretation\.

The upper\-boundary proposition shows that once the protected rank exceeds the support of OOD\-sensitive coordinates, increasingkkcan no longer improve robustness, but can still worsen the trade\-off between mitigating forgetting and rapid adaptation by over\-suppressing directions that are useful for in\-domain adaptation\. The threshold rule gives the per\-coordinate version of the same idea: a direction should be protected only if its OOD sensitivitycsc\_\{s\}is large enough to outweigh the in\-domain gain lost by shrinking that direction\. In this view,csc\_\{s\}can be interpreted as the OOD sensitivity of coordinatessin the pretrained singular basis, where largercsc\_\{s\}means that updating this direction causes greater OOD degradation\. If OOD sensitivity decays with singular index, this naturally induces a finite rank boundary, so the optimal rankkkshould be large enough to cover the high\-csc\_\{s\}directions, but not so large that it also protects many low\-csc\_\{s\}directions\. This provides a simple explanation for the rank sweep in Appendix[D](https://arxiv.org/html/2605.10973#A4): moving fromk=0k=0to a moderate rank improves robustness because it protects the most OOD\-sensitive directions, whereas overly large ranks behave more like global anchoring and degrade the trade\-off between mitigating forgetting and rapid adaptation by over\-constraining directions that are not strongly OOD\-sensitive\.

#### Practical guidance\.

Before SFT, we choosekkusing the base model and the same Fisher\-projected gradient\-energy diagnostic shown in Figure[2](https://arxiv.org/html/2605.10973#S3.F2)\. Concretely, we compute per\-sample gradients on a small batch from the OOD data, project them into the pretrained SVD basis of an early attention matrix such as layer\-1q\_proj, and sweep candidate ranksrr\. We then inspect the curvex\(r\)=100r2/R2x\(r\)=100r^\{2\}/R^\{2\}versusy\(r\)=tr\(𝐏svd,r𝐅\)/tr\(𝐅\)y\(r\)=\\mathrm\{tr\}\(\\mathbf\{P\}\_\{\\mathrm\{svd\},r\}\\mathbf\{F\}\)/\\mathrm\{tr\}\(\\mathbf\{F\}\)\. A practical default is the smallestrrwhose strict top\-r×rr\\times rblock captures about20%20\\%of the gradient energy in the early attention layer: this protects a meaningful fraction of loss\-sensitive directions while keeping the protected block small enough for adaptation\. In our experiments,r=768r=768corresponds to a roughly5%5\\%strict block and already captures about20%20\\%of the gradient energy, so we usek=768k=768as the default protected rank\.

### C\.6Gradient\-Flow View: Exponentially Damped Task\-Induced Drift

To gain insight into the optimization dynamics induced by RPSFT, we consider the continuous\-time gradient\-flow limit of the regularized objective

Fλ\(𝐖\)=f\(𝐖\)\+λ‖𝐔k⊤\(𝐖−𝐖0\)𝐕k‖F2\.F\_\{\\lambda\}\(\\mathbf\{W\}\)=f\(\\mathbf\{W\}\)\+\\lambda\\\|\\mathbf\{U\}\_\{k\}^\{\\top\}\(\\mathbf\{W\}\-\\mathbf\{W\}\_\{0\}\)\\mathbf\{V\}\_\{k\}\\\|\_\{F\}^\{2\}\.Assume thatffis differentiable with locally Lipschitz gradient, so that the gradient flow is well\-defined\. The resulting dynamics are

d𝐖\(t\)dt=−∇f\(𝐖\(t\)\)−2λ𝐔k\(𝐔k⊤\(𝐖\(t\)−𝐖0\)𝐕k\)𝐕k⊤\.\\frac\{d\\mathbf\{W\}\(t\)\}\{dt\}=\-\\nabla f\(\\mathbf\{W\}\(t\)\)\-2\\lambda\\,\\mathbf\{U\}\_\{k\}\\big\(\\mathbf\{U\}\_\{k\}^\{\\top\}\(\\mathbf\{W\}\(t\)\-\\mathbf\{W\}\_\{0\}\)\\mathbf\{V\}\_\{k\}\\big\)\\mathbf\{V\}\_\{k\}^\{\\top\}\.\(35\)
#### Protected coordinates\.

Define the protected coordinates

A\(t\)≜𝐔k⊤\(𝐖\(t\)−𝐖0\)𝐕k,A\(t\)\\triangleq\\mathbf\{U\}\_\{k\}^\{\\top\}\(\\mathbf\{W\}\(t\)\-\\mathbf\{W\}\_\{0\}\)\\mathbf\{V\}\_\{k\},which measure the deviation of the weights from the pretrained model within the top\-kksingular subspace of𝐖0\\mathbf\{W\}\_\{0\}\. Multiplying Eq\. \([35](https://arxiv.org/html/2605.10973#A3.E35)\) on the left by𝐔k⊤\\mathbf\{U\}\_\{k\}^\{\\top\}and on the right by𝐕k\\mathbf\{V\}\_\{k\}yields

dA\(t\)dt=−𝐔k⊤∇f\(𝐖\(t\)\)𝐕k−2λA\(t\)\.\\frac\{dA\(t\)\}\{dt\}=\-\\,\\mathbf\{U\}\_\{k\}^\{\\top\}\\nabla f\(\\mathbf\{W\}\(t\)\)\\mathbf\{V\}\_\{k\}\-2\\lambda A\(t\)\.\(36\)

#### Closed\-form solution\.

Eq\. \([36](https://arxiv.org/html/2605.10973#A3.E36)\) is a linear ODE with a time\-dependent forcing term\. Define

G\(t\)≜𝐔k⊤∇f\(𝐖\(t\)\)𝐕k,G\(t\)\\triangleq\\mathbf\{U\}\_\{k\}^\{\\top\}\\nabla f\(\\mathbf\{W\}\(t\)\)\\mathbf\{V\}\_\{k\},so thatA˙\(t\)\+2λA\(t\)=−G\(t\)\\dot\{A\}\(t\)\+2\\lambda A\(t\)=\-G\(t\)\. Solving this equation yields

A\(t\)=e−2λtA\(0\)−∫0te−2λ\(t−s\)G\(s\)𝑑s,A\(t\)=e^\{\-2\\lambda t\}A\(0\)\-\\int\_\{0\}^\{t\}e^\{\-2\\lambda\(t\-s\)\}G\(s\)\\,ds,\(37\)or equivalently,

A\(t\)=e−2λtA\(0\)−∫0te−2λ\(t−s\)\(𝐔k⊤∇f\(𝐖\(s\)\)𝐕k\)𝑑s\.A\(t\)=e^\{\-2\\lambda t\}A\(0\)\-\\int\_\{0\}^\{t\}e^\{\-2\\lambda\(t\-s\)\}\\big\(\\mathbf\{U\}\_\{k\}^\{\\top\}\\nabla f\(\\mathbf\{W\}\(s\)\)\\mathbf\{V\}\_\{k\}\\big\)\\,ds\.
The first term represents an exponentially decaying initial deviation from the pretrained weights, while the second term is an exponentially weighted accumulation of task\-induced gradients in the protected subspace\.

#### Interpretation\.

Eq\. \([37](https://arxiv.org/html/2605.10973#A3.E37)\) shows that RPSFT acts as an exponential damping mechanism on task\-induced drift within the pretrained top\-kksingular subspace\. Even when fine\-tuning is initialized from the pretrained model \(so thatA\(0\)=0A\(0\)=0\), task gradients generally introduce nonzero drift, but their influence is continuously decayed with rate2λ2\\lambda\. Consequently, RPSFT behaves as a temporal low\-pass filter on gradients in the protected subspace, stabilizing dominant pretrained representations while still allowing task\-relevant adaptation\.

## Appendix DAdditional Results

We study the sensitivity of RPSFT to the protected rankkkon Llama\-3\.1\-8B\-Instruct as the empirical counterpart to the rank\-boundary analysis in Appendix[C\.5](https://arxiv.org/html/2605.10973#A3.SS5)\. Here, SFT corresponds tok=0k=0, and our chosen configuration isk=768k=768\. The LoRA row is a vanilla LoRA baseline with adapter rankr=32r=32; no RPSFT penalty is applied to it\. The full\-rank settingk=4096k=4096reduces RPSFT to weight\-space anchoring around the pretrained model, so we label it as L2 Init in Table[6](https://arxiv.org/html/2605.10973#A4.T6)\. This matches the regenerative regularization view ofKumar et al\. \[[2024](https://arxiv.org/html/2605.10973#bib.bib15)\]\. The main insight is that Llama benefits from an intermediate protected rank:k=256k=256–768768preserves adaptation, but largerkkprogressively weakens both in\-domain adaptation and OOD generalization\. This matches the positive Llama OOD first\-order signal in Figure[5](https://arxiv.org/html/2605.10973#S5.F5): when the update direction can still help OOD performance, over\-protecting too many directions removes useful transfer rather than only preventing forgetting\. The same pattern agrees with the rank\-boundary analysis in Appendix[C\.5](https://arxiv.org/html/2605.10973#A3.SS5): once the protected block already covers the most OOD\-sensitive directions, expanding it mainly constrains useful task updates\.

Table 6:Rank robustness on Llama\-3\.1\-8B\-Instruct across ID Avg@k, ID Pass@k, and OOD Avg@1 \(%\)\. The LoRA row is vanilla LoRA with adapter rankr=32r=32; RPSFT rows vary protected rankkk\. Base cells are left unmarked, and bold marks the best non\-base value in each domain block and column\.#### Qwen2\.5\-7B robustness sweep\.

Table[7](https://arxiv.org/html/2605.10973#A4.T7)reports the same protected\-rank sweep on Qwen2\.5\-7B\-Instruct\. Here, SFT corresponds tok=0k=0, the LoRA row is vanilla LoRA with adapter rankr=32r=32, the chosen RPSFT configuration is againk=768k=768, andk=3584k=3584is equivalent to L2 Init\. Qwen shows the other side of the same trade\-off: increasingkkreduces in\-domain adaptation but steadily mitigates OOD forgetting, with the full\-rank L2 Init row closest to the base model on OOD average\. This matches Figure[5](https://arxiv.org/html/2605.10973#S5.F5), where the Qwen OOD first\-order signal is weaker and changes sign more often, so unconstrained task adaptation is less reliably aligned with OOD improvement\. The trend is also consistent with Appendix[C\.5](https://arxiv.org/html/2605.10973#A3.SS5): protecting more OOD\-sensitive directions improves retention, but beyond a moderate rank the extra protection starts to suppress task learning\. Vanilla LoRA remains weaker, especially on Qwen OOD retention, indicating that parameter\-efficient adaptation alone does not provide the same forgetting control as the RPSFT projected\-block penalty\.

Table 7:Rank robustness on Qwen2\.5\-7B\-Instruct across ID Avg@k, ID Pass@k, and OOD Avg@1 \(%\)\. The LoRA row is vanilla LoRA with adapter rankr=32r=32; RPSFT rows vary protected rankkk\. Base cells are left unmarked, and bold marks the best non\-base value in each domain block and column\.
#### OOD Pass@1 results\.

Tables[8](https://arxiv.org/html/2605.10973#A4.T8)and[9](https://arxiv.org/html/2605.10973#A4.T9)report the earlier one\-sample OOD evaluation\. The main results use the four\-sample OOD Avg@k and Pass@k evaluation in Tables[2](https://arxiv.org/html/2605.10973#S5.T2)and[4](https://arxiv.org/html/2605.10973#S5.T4)\.

Table 8:SFT\-stage OOD Pass@1 results on the six OOD benchmarks\. Bold marks the strongest tuned result within each model block\.Table 9:Downstream RL OOD Pass@1 results under DAPO\. Entries are init→\\rightarrowDAPO, and final checkpoints are selected by ID performance\.

## Appendix EExperiment Details

#### Precomputation\.

We compute the baseline SVD blocks immediately after loading the pretrained model\. For eachℓ∈ℳ′\\ell\\in\\mathcal\{M\}^\{\\prime\}, we compute top\-kksingular vectors of𝐖ℓ0\\mathbf\{W\}^\{0\}\_\{\\ell\}\(e\.g\., via truncated SVD\) and store\(𝐔0,ℓ\(k\),𝐕0,ℓ\(k\),Sℓref\)\(\\mathbf\{U\}^\{\(k\)\}\_\{0,\\ell\},\\mathbf\{V\}^\{\(k\)\}\_\{0,\\ell\},S^\{\\mathrm\{ref\}\}\_\{\\ell\}\)\. This process costs several minutes to finish, depending on the model size; for Llama\-8B, it takes around 3 minutes\.

#### Training datasets

For supervised fine\-tuning, we use OpenR1\-Math\[Hugging Face,[2025](https://arxiv.org/html/2605.10973#bib.bib12)\], which is widely used by prior work; the processed dataset we use is around 25k samples\. For downstream reinforcement learning, we use the DAPO\-Math\-17k set together with the DAPO objective\. Every DAPO run is initialized from the corresponding supervised checkpoint, so the RL comparisons isolate the effect of the SFT initializer\.

#### Training setup

For supervised fine\-tuning, we train Qwen2\.5\-7B and Qwen2\.5\-3B for 12 epochs, and we train Llama\-3\.1\-8B for 20 epochs because its base checkpoint is weaker\. We evaluate the same saved checkpoints across methods\. All shared training hyperparameters are kept the same across SFT baselines, and only the method\-specific parameters are changed, using the default settings for each method\.

Table[10](https://arxiv.org/html/2605.10973#A5.T10)lists the shared SFT hyperparameters used for all SFT\-stage methods\. Unless stated otherwise, RPSFT uses protected rankk=768k=768and regularization coefficientλ=1\\lambda=1\.

Table 10:Shared supervised fine\-tuning hyperparameters\.For downstream reinforcement learning, we use DAPO and initialize every run from the corresponding supervised checkpoint\. We train each run for 100 total training steps and then select the best checkpoint according to in\-domain performance for the reported evaluation\. Table[11](https://arxiv.org/html/2605.10973#A5.T11)lists the shared RL\-stage hyperparameters\.

Table 11:Shared reinforcement\-learning fine\-tuning hyperparameters\.
#### Compute and artifacts\.

All SFT and RL training runs use 8 H200 GPUs\. We provide the code, dataset preparation and evaluation artifacts, training configurations, and documentation as supplemental material for reproduction, and will make the artifact public on GitHub after the anonymous review period\. We use VERL as our main training framework\.

#### Benchmarks

Out\-of\-domain tasks:[GPQA](https://huggingface.co/datasets/Idavidrein/gpqa)\[Rein et al\.,[2023](https://arxiv.org/html/2605.10973#bib.bib32)\],[IFEval\-loose](https://huggingface.co/datasets/google/IFEval)\[Zhou et al\.,[2023](https://arxiv.org/html/2605.10973#bib.bib47)\],[MMLU\-Pro](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro)\[Wang et al\.,[2024b](https://arxiv.org/html/2605.10973#bib.bib43)\],[SuperGPQA](https://huggingface.co/datasets/Maxwell-Jia/SuperGPQA-Astro)\[Team et al\.,[2025](https://arxiv.org/html/2605.10973#bib.bib37)\],[Safety Benchmark](https://huggingface.co/datasets/ThWu/safety_benchmark), and[TruthfulQA](https://huggingface.co/datasets/truthfulqa/truthful_qa)\[Lin et al\.,[2022](https://arxiv.org/html/2605.10973#bib.bib21)\]\.

## Appendix FMetrics for Analysis Figures

### F\.1Strict SVD Subspace Energy

Figure[2](https://arxiv.org/html/2605.10973#S3.F2)measures the ratio of how much of the empirical Fisher curvature is captured by the strict top singular block\. Let𝐆t∈ℝm×n\\mathbf\{G\}\_\{t\}\\in\\mathbb\{R\}^\{m\\times n\}denote the gradient matrix of thett\-th sample, wheret=1,…,Nt=1,\\dots,Nindexes theNNsampled gradients used in the diagnostic, and let

gt=vec\(𝐆t\)g\_\{t\}=\\mathrm\{vec\}\(\\mathbf\{G\}\_\{t\}\)be its vectorized form\. We define the empirical Fisher matrix as

𝐅=∑t=1Ngtgt⊤\.\\mathbf\{F\}=\\sum\_\{t=1\}^\{N\}g\_\{t\}g\_\{t\}^\{\\top\}\.
Letui∈ℝmu\_\{i\}\\in\\mathbb\{R\}^\{m\}andvj∈ℝnv\_\{j\}\\in\\mathbb\{R\}^\{n\}be theii\-th left andjj\-th right singular vectors of the pretrained weight matrix, letrrbe the protected singular rank, and let𝐏svd,r\\mathbf\{P\}\_\{\\mathrm\{svd\},r\}denote the orthogonal projector onto the strict top\-r×rr\\times rsingular\-vector product subspace

span\{vec\(uivj⊤\):1≤i,j≤r\}\.\\mathrm\{span\}\\bigl\\\{\\mathrm\{vec\}\(u\_\{i\}v\_\{j\}^\{\\top\}\):1\\leq i,j\\leq r\\bigr\\\}\.
Then the fraction of gradient energy captured by the strict top\-r×rr\\times rsingular block is

tr\(𝐏svd,r𝐅\)tr\(𝐅\)=∑t=1N‖𝐏svd,rgt‖22∑t=1N‖gt‖22\.\\frac\{\\mathrm\{tr\}\(\\mathbf\{P\}\_\{\\mathrm\{svd\},r\}\\mathbf\{F\}\)\}\{\\mathrm\{tr\}\(\\mathbf\{F\}\)\}=\\frac\{\\sum\_\{t=1\}^\{N\}\\\|\\mathbf\{P\}\_\{\\mathrm\{svd\},r\}g\_\{t\}\\\|\_\{2\}^\{2\}\}\{\\sum\_\{t=1\}^\{N\}\\\|g\_\{t\}\\\|\_\{2\}^\{2\}\}\.
Equivalently, in matrix form, this can be written as

tr\(𝐏svd,r𝐅\)tr\(𝐅\)=∑t=1N‖𝐔r⊤𝐆t𝐕r‖F2∑t=1N‖𝐆t‖F2,\\frac\{\\mathrm\{tr\}\(\\mathbf\{P\}\_\{\\mathrm\{svd\},r\}\\mathbf\{F\}\)\}\{\\mathrm\{tr\}\(\\mathbf\{F\}\)\}=\\frac\{\\sum\_\{t=1\}^\{N\}\\left\\\|\\mathbf\{U\}\_\{r\}^\{\\top\}\\mathbf\{G\}\_\{t\}\\mathbf\{V\}\_\{r\}\\right\\\|\_\{F\}^\{2\}\}\{\\sum\_\{t=1\}^\{N\}\\\|\\mathbf\{G\}\_\{t\}\\\|\_\{F\}^\{2\}\},where𝐔r=\[u1,…,ur\]\\mathbf\{U\}\_\{r\}=\[u\_\{1\},\\dots,u\_\{r\}\]and𝐕r=\[v1,…,vr\]\\mathbf\{V\}\_\{r\}=\[v\_\{1\},\\dots,v\_\{r\}\]\.

Here, the denominatortr\(𝐅\)=∑t=1N‖gt‖22\\mathrm\{tr\}\(\\mathbf\{F\}\)=\\sum\_\{t=1\}^\{N\}\\\|g\_\{t\}\\\|\_\{2\}^\{2\}is the total gradient energy across all samples, meaning the total sum of squared gradient norms, while the numeratortr\(𝐏svd,r𝐅\)\\mathrm\{tr\}\(\\mathbf\{P\}\_\{\\mathrm\{svd\},r\}\\mathbf\{F\}\)is the portion of that energy lying in the strict top\-r×rr\\times rsingular block\. In the figure, the x\-axis reportsx\(r\)=r2/R2x\(r\)=r^\{2\}/R^\{2\}, whereR=rank⁡\(𝐖0\)R=\\operatorname\{rank\}\(\\mathbf\{W\}\_\{0\}\)is the full singular rank, and the y\-axis reportstr\(𝐏svd,r𝐅\)/tr\(𝐅\)\\mathrm\{tr\}\(\\mathbf\{P\}\_\{\\mathrm\{svd\},r\}\\mathbf\{F\}\)/\\mathrm\{tr\}\(\\mathbf\{F\}\)\. We apply this diagnostic to the layer\-1 attentionq\_projweight for Llama\-8B, Qwen\-7B, and Qwen\-3B\.

### F\.2Layerwise First\-Order Signal

To diagnose whether a checkpoint update helps or hurts a dataset at first order, we compare the update direction against the dataset gradient\. For a checkpoint weight𝐖ckpt\\mathbf\{W\}\_\{\\mathrm\{ckpt\}\}and base\-model weight𝐖base\\mathbf\{W\}\_\{\\mathrm\{base\}\}, the update is

Δ𝐖=𝐖ckpt−𝐖base\.\\Delta\\mathbf\{W\}=\\mathbf\{W\}\_\{\\mathrm\{ckpt\}\}\-\\mathbf\{W\}\_\{\\mathrm\{base\}\}\.\(38\)For a batchBBfrom an ID or OOD dataset, the average gradient on matrix𝐖\\mathbf\{W\}is

g𝐖=1\|B\|∑\(x,y\)∈B∇𝐖ℓ\(f𝐖\(x\),y\)\.g\_\{\\mathbf\{W\}\}=\\frac\{1\}\{\|B\|\}\\sum\_\{\(x,y\)\\in B\}\\nabla\_\{\\mathbf\{W\}\}\\ell\(f\_\{\\mathbf\{W\}\}\(x\),y\)\.\(39\)Let𝒫ℓ\\mathcal\{P\}\_\{\\ell\}be the set of matrices assigned to layerℓ\\ellin the plot\. The layerwise first\-order signal is the average inner product

mℓ=1\|𝒫ℓ\|∑𝐖∈𝒫ℓ⟨g𝐖,Δ𝐖⟩\.m\_\{\\ell\}=\\frac\{1\}\{\|\\mathcal\{P\}\_\{\\ell\}\|\}\\sum\_\{\\mathbf\{W\}\\in\\mathcal\{P\}\_\{\\ell\}\}\\left\\langle g\_\{\\mathbf\{W\}\},\\Delta\\mathbf\{W\}\\right\\rangle\.\(40\)Negative values mean the checkpoint update points along the local descent direction for that dataset, so it reduces the loss at first order\. Positive values mean the update conflicts with the dataset gradient and tends to increase the loss at first order\.

### F\.3Rotation Metrics

For each layerℓ\\elland weight type

t∈\{q\_proj,k\_proj,v\_proj,o\_proj,up\_proj,down\_proj,gate\_proj\},t\\in\\\{\\texttt\{q\\\_proj\},\\texttt\{k\\\_proj\},\\texttt\{v\\\_proj\},\\texttt\{o\\\_proj\},\\texttt\{up\\\_proj\},\\texttt\{down\\\_proj\},\\texttt\{gate\\\_proj\}\\\},let the base\-model weight and tuned\-model weight be

𝐖base\(ℓ,t\),𝐖m\(ℓ,t\)\.\\mathbf\{W\}\_\{\\mathrm\{base\}\}^\{\(\\ell,t\)\},\\qquad\\mathbf\{W\}\_\{m\}^\{\(\\ell,t\)\}\.Heremmindexes the tuned method or checkpoint being compared\. Letdout\(ℓ,t\)d\_\{\\mathrm\{out\}\}^\{\(\\ell,t\)\}anddin\(ℓ,t\)d\_\{\\mathrm\{in\}\}^\{\(\\ell,t\)\}denote the output and input dimensions of this matrix\. We compute truncated SVD with

Kℓ,t=min⁡\(512,dout\(ℓ,t\),din\(ℓ,t\)\),K\_\{\\ell,t\}=\\min\\bigl\(512,d\_\{\\mathrm\{out\}\}^\{\(\\ell,t\)\},d\_\{\\mathrm\{in\}\}^\{\(\\ell,t\)\}\\bigr\),so that

𝐖base\(ℓ,t\)=𝐔base\(ℓ,t\)𝚺base\(ℓ,t\)𝐕base\(ℓ,t\)⊤,𝐖m\(ℓ,t\)=𝐔m\(ℓ,t\)𝚺m\(ℓ,t\)𝐕m\(ℓ,t\)⊤,\\mathbf\{W\}\_\{\\mathrm\{base\}\}^\{\(\\ell,t\)\}=\\mathbf\{U\}\_\{\\mathrm\{base\}\}^\{\(\\ell,t\)\}\\mathbf\{\\Sigma\}\_\{\\mathrm\{base\}\}^\{\(\\ell,t\)\}\{\\mathbf\{V\}\_\{\\mathrm\{base\}\}^\{\(\\ell,t\)\}\}^\{\\top\},\\qquad\\mathbf\{W\}\_\{m\}^\{\(\\ell,t\)\}=\\mathbf\{U\}\_\{m\}^\{\(\\ell,t\)\}\\mathbf\{\\Sigma\}\_\{m\}^\{\(\\ell,t\)\}\{\\mathbf\{V\}\_\{m\}^\{\(\\ell,t\)\}\}^\{\\top\},where

𝐔base\(ℓ,t\),𝐔m\(ℓ,t\)∈ℝdout\(ℓ,t\)×Kℓ,t,𝐕base\(ℓ,t\),𝐕m\(ℓ,t\)∈ℝdin\(ℓ,t\)×Kℓ,t\.\\mathbf\{U\}\_\{\\mathrm\{base\}\}^\{\(\\ell,t\)\},\\,\\mathbf\{U\}\_\{m\}^\{\(\\ell,t\)\}\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{out\}\}^\{\(\\ell,t\)\}\\times K\_\{\\ell,t\}\},\\qquad\\mathbf\{V\}\_\{\\mathrm\{base\}\}^\{\(\\ell,t\)\},\\,\\mathbf\{V\}\_\{m\}^\{\(\\ell,t\)\}\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{in\}\}^\{\(\\ell,t\)\}\\times K\_\{\\ell,t\}\}\.To measure U\-space rotation, we compare the left\-singular subspaces through

𝐌U\(ℓ,t,m\)=\(𝐔base\(ℓ,t\)\)⊤𝐔m\(ℓ,t\)\.\\mathbf\{M\}\_\{U\}^\{\(\\ell,t,m\)\}=\\left\(\\mathbf\{U\}\_\{\\mathrm\{base\}\}^\{\(\\ell,t\)\}\\right\)^\{\\top\}\\mathbf\{U\}\_\{m\}^\{\(\\ell,t\)\}\.If the singular values of𝐌U\(ℓ,t,m\)\\mathbf\{M\}\_\{U\}^\{\(\\ell,t,m\)\}areσ1\(ℓ,t,m\),…,σKℓ,t\(ℓ,t,m\)\\sigma\_\{1\}^\{\(\\ell,t,m\)\},\\dots,\\sigma\_\{K\_\{\\ell,t\}\}^\{\(\\ell,t,m\)\}, the principal angles are

θi\(ℓ,t,m\)=arccos⁡\(σi\(ℓ,t,m\)\)\.\\theta\_\{i\}^\{\(\\ell,t,m\)\}=\\arccos\\\!\\left\(\\sigma\_\{i\}^\{\(\\ell,t,m\)\}\\right\)\.The mean U\-space rotation for layer/type pair\(ℓ,t\)\(\\ell,t\)is

rU\(ℓ,t,m\)=180π⋅1Kℓ,t∑i=1Kℓ,tθi\(ℓ,t,m\)\.r\_\{U\}^\{\(\\ell,t,m\)\}=\\frac\{180\}\{\\pi\}\\cdot\\frac\{1\}\{K\_\{\\ell,t\}\}\\sum\_\{i=1\}^\{K\_\{\\ell,t\}\}\\theta\_\{i\}^\{\(\\ell,t,m\)\}\.The layerwise plot in Figure[6](https://arxiv.org/html/2605.10973#S5.F6)usestypes\-all, so the value at layerℓ\\ellaverages these per\-type rotations over the available matrix types:

ym\(ℓ\)=∑t∈𝒯ℓc\(ℓ,t,m\)rU\(ℓ,t,m\)∑t∈𝒯ℓc\(ℓ,t,m\),y\_\{m\}\(\\ell\)=\\frac\{\\sum\\limits\_\{t\\in\\mathcal\{T\}\_\{\\ell\}\}c^\{\(\\ell,t,m\)\}\\,r\_\{U\}^\{\(\\ell,t,m\)\}\}\{\\sum\\limits\_\{t\\in\\mathcal\{T\}\_\{\\ell\}\}c^\{\(\\ell,t,m\)\}\},where𝒯ℓ\\mathcal\{T\}\_\{\\ell\}is the set of available types in layerℓ\\ell\. In the current plotting script,c\(ℓ,t,m\)=1c^\{\(\\ell,t,m\)\}=1for each available entry, so this reduces to the simple mean across types:

ym\(ℓ\)=1\|𝒯ℓ\|∑t∈𝒯ℓ\(180π⋅1Kℓ,t∑i=1Kℓ,tarccos⁡\(σi\(ℓ,t,m\)\)\)\.y\_\{m\}\(\\ell\)=\\frac\{1\}\{\|\\mathcal\{T\}\_\{\\ell\}\|\}\\sum\_\{t\\in\\mathcal\{T\}\_\{\\ell\}\}\\left\(\\frac\{180\}\{\\pi\}\\cdot\\frac\{1\}\{K\_\{\\ell,t\}\}\\sum\_\{i=1\}^\{K\_\{\\ell,t\}\}\\arccos\\\!\\big\(\\sigma\_\{i\}^\{\(\\ell,t,m\)\}\\big\)\\right\)\.Thus, each curve reports how much the top left\-singular subspaces rotate away from the base model across layers after averaging over all selected matrix types\.

For the rank\-wise case\-study plot in Figure[7](https://arxiv.org/html/2605.10973#S5.F7), we fix one matrix case for each model and vary the prefix rankr≤Kℓ,tr\\leq K\_\{\\ell,t\}\. At eachrr, we replace the full truncated bases by their firstrrcolumns, compute the singular values of

𝐌U\(r,m\)=\(𝐔base\(r\)\)⊤𝐔m\(r\),\\mathbf\{M\}\_\{U\}^\{\(r,m\)\}=\\left\(\\mathbf\{U\}\_\{\\mathrm\{base\}\}^\{\(r\)\}\\right\)^\{\\top\}\\mathbf\{U\}\_\{m\}^\{\(r\)\},where𝐔base\(r\)\\mathbf\{U\}\_\{\\mathrm\{base\}\}^\{\(r\)\}and𝐔m\(r\)\\mathbf\{U\}\_\{m\}^\{\(r\)\}are the firstrrcolumns of the truncated left\-singular bases for the base and tuned matrices, and report the mean principal angle

ym\(r\)=180π⋅1r∑i=1rarccos⁡\(σi\(r,m\)\)\.y\_\{m\}\(r\)=\\frac\{180\}\{\\pi\}\\cdot\\frac\{1\}\{r\}\\sum\_\{i=1\}^\{r\}\\arccos\\\!\\big\(\\sigma\_\{i\}^\{\(r,m\)\}\\big\)\.This keeps the same rotation definition and only changes the swept dimension: the layerwise figure variesℓ\\ellafter averaging over types, while the rankwise figure varies the prefix rank within one fixed case\-study matrix\.

### F\.4Entropy Metric

For promptjj, methodmmgenerates tokens

y\(m,j\)=\(y1\(m,j\),…,yTm,j\(m,j\)\),y^\{\(m,j\)\}=\\left\(y^\{\(m,j\)\}\_\{1\},\\dots,y^\{\(m,j\)\}\_\{T\_\{m,j\}\}\\right\),with greedy decoding andTm,j≤128T\_\{m,j\}\\leq 128\. At generation steptt, let

pt\(m,j\)\(v\)=softmax\(zt\(m,j\)\)v\.p^\{\(m,j\)\}\_\{t\}\(v\)=\\operatorname\{softmax\}\(z^\{\(m,j\)\}\_\{t\}\)\_\{v\}\.The token entropy is

ht\(m,j\)=−∑v∈𝒱pt\(m,j\)\(v\)log⁡pt\(m,j\)\(v\),h^\{\(m,j\)\}\_\{t\}=\-\\sum\_\{v\\in\\mathcal\{V\}\}p^\{\(m,j\)\}\_\{t\}\(v\)\\log p^\{\(m,j\)\}\_\{t\}\(v\),and the sample\-level average entropy is

em,j=1Tm,j∑t=1Tm,jht\(m,j\)\.e\_\{m,j\}=\\frac\{1\}\{T\_\{m,j\}\}\\sum\_\{t=1\}^\{T\_\{m,j\}\}h^\{\(m,j\)\}\_\{t\}\.Each figure plots a Gaussian KDE over\{em,j\}j=1Nm\\\{e\_\{m,j\}\\\}\_\{j=1\}^\{N\_\{m\}\}:

f^m\(x\)=1Nmhm∑j=1Nm12πexp⁡\(−12\(x−em,jhm\)2\),\\hat\{f\}\_\{m\}\(x\)=\\frac\{1\}\{N\_\{m\}h\_\{m\}\}\\sum\_\{j=1\}^\{N\_\{m\}\}\\frac\{1\}\{\\sqrt\{2\\pi\}\}\\exp\\\!\\left\(\-\\frac\{1\}\{2\}\\left\(\\frac\{x\-e\_\{m,j\}\}\{h\_\{m\}\}\\right\)^\{2\}\\right\),using the default bandwidth

hm=1\.06smNm−1/5,h\_\{m\}=1\.06\\,s\_\{m\}\\,N\_\{m\}^\{\-1/5\},wheresms\_\{m\}is the sample standard deviation\. We use the same definition for both datasets: AIME25 for the in\-domain plot and GPQA for the out\-of\-domain plot\.

### F\.5Auxiliary Entropy Plots Across Model Families

We include these entropy plots as auxiliary analysis rather than as a primary claim\. Compared with the hidden\-state drift results in the main paper, the entropy differences are more model\-dependent and less uniformly separated across baselines\.

![Refer to caption](https://arxiv.org/html/2605.10973v1/x7.png)Figure 9:Average\-token\-entropy distributions across the twelve in\-domain and out\-of\-domain benchmarks for Llama\-3\.1\-8B\-Instruct\. The differences are modest and benchmark\-dependent: DFT more often shifts toward lower entropy, while RPSFT generally stays in a similar range to the stronger tuned baselines\.![Refer to caption](https://arxiv.org/html/2605.10973v1/x8.png)Figure 10:Average\-token\-entropy distributions across the same twelve benchmarks for Qwen2\.5\-7B\-Instruct\. RPSFT often remains above DFT and within the broader range of the tuned baselines, but the gaps are not uniformly large across all panels\.![Refer to caption](https://arxiv.org/html/2605.10973v1/x9.png)Figure 11:Average\-token\-entropy distributions across the same twelve benchmarks for Qwen2\.5\-3B\-Instruct\. RPSFT often shifts toward a broader entropy profile than standard SFT or DFT, although the base model remains the highest\-entropy reference and the gaps are still panel\-dependent\.
### F\.6Base\-Model Drift in Hidden Representations

For datasetddwithNdN\_\{d\}prompts, lethi\(m,d\)∈ℝph\_\{i\}^\{\(m,d\)\}\\in\\mathbb\{R\}^\{p\}be the pooled hidden representation for promptiiunder modelm∈\{base,sft,RPSFT,dft,iw\}m\\in\\\{\\mathrm\{base\},\\mathrm\{sft\},\\mathrm\{RPSFT\},\\mathrm\{dft\},\\mathrm\{iw\}\\\}\. The hidden\-space centroid of modelmmon datasetddis

μ\(m,d\)=1Nd∑i=1Ndhi\(m,d\)\.\\mu^\{\(m,d\)\}=\\frac\{1\}\{N\_\{d\}\}\\sum\_\{i=1\}^\{N\_\{d\}\}h\_\{i\}^\{\(m,d\)\}\.We measure drift from the base model by the centroid distance

Dhidden\(m,d\)=‖μ\(m,d\)−μ\(base,d\)‖2\.D\_\{\\mathrm\{hidden\}\}^\{\(m,d\)\}=\\left\\\|\\mu^\{\(m,d\)\}\-\\mu^\{\(\\mathrm\{base\},d\)\}\\right\\\|\_\{2\}\.
For visualization, we stack all prompt representations from the five models,

X\(d\)=\[H\(base,d\)H\(sft,d\)H\(RPSFT,d\)H\(dft,d\)H\(iw,d\)\]∈ℝMd×p,Md=5Nd,X^\{\(d\)\}=\\begin\{bmatrix\}H^\{\(\\mathrm\{base\},d\)\}\\\\ H^\{\(\\mathrm\{sft\},d\)\}\\\\ H^\{\(\\mathrm\{RPSFT\},d\)\}\\\\ H^\{\(\\mathrm\{dft\},d\)\}\\\\ H^\{\(\\mathrm\{iw\},d\)\}\\end\{bmatrix\}\\in\\mathbb\{R\}^\{M\_\{d\}\\times p\},\\qquad M\_\{d\}=5N\_\{d\},center the matrix asX~\(d\)=X\(d\)−𝟏x¯\(d\)⊤\\tilde\{X\}^\{\(d\)\}=X^\{\(d\)\}\-\\mathbf\{1\}\\,\\bar\{x\}^\{\(d\)\\top\}, and project onto the first two PCA directions:

Z\(d\)=X~\(d\)W1:2∈ℝMd×2\.Z^\{\(d\)\}=\\tilde\{X\}^\{\(d\)\}W\_\{1:2\}\\in\\mathbb\{R\}^\{M\_\{d\}\\times 2\}\.Ifzi\(m,d\)∈ℝ2z\_\{i\}^\{\(m,d\)\}\\in\\mathbb\{R\}^\{2\}is the PCA coordinate of promptii, the PCA\-plane centroid is

c\(m,d\)=1Nd∑i=1Ndzi\(m,d\),c^\{\(m,d\)\}=\\frac\{1\}\{N\_\{d\}\}\\sum\_\{i=1\}^\{N\_\{d\}\}z\_\{i\}^\{\(m,d\)\},and the plotted centroid shift is

DPCA\(m,d\)=‖c\(m,d\)−c\(base,d\)‖2\.D\_\{\\mathrm\{PCA\}\}^\{\(m,d\)\}=\\left\\\|c^\{\(m,d\)\}\-c^\{\(\\mathrm\{base\},d\)\}\\right\\\|\_\{2\}\.In the plots, each point is one prompt and each arrow starts at the base centroidc\(base,d\)c^\{\(\\mathrm\{base\},d\)\}and ends at the centroid of a tuned model\. We useDhidden\(m,d\)D\_\{\\mathrm\{hidden\}\}^\{\(m,d\)\}for the quantitative drift measurement andDPCA\(m,d\)D\_\{\\mathrm\{PCA\}\}^\{\(m,d\)\}only for the 2D visualization\.

![Refer to caption](https://arxiv.org/html/2605.10973v1/x10.png)Figure 12:Hidden\-state drift on Llama\-3\.1\-8B\-Instruct\. Across the benchmark panels, RPSFT remains close to the base\-model centroid and is competitive with the strongest tuned alternatives, which is consistent with reduced representation drift\.![Refer to caption](https://arxiv.org/html/2605.10973v1/x11.png)Figure 13:Hidden\-state drift on Qwen2\.5\-7B\-Instruct\. The PCA view compares centroid shifts away from the base model across benchmark panels\. RPSFT is usually among the closest tuned centroids and remains closer than standard SFT in most panels, which is consistent with reduced representation drift after supervised fine\-tuning\.
Rotation-Preserving Supervised Fine-Tuning

Similar Articles

ShadowPEFT: Shadow Network for Parameter-Efficient Fine-Tuning

GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

CERSA: Cumulative Energy-Retaining Subspace Adaptation for Memory-Efficient Fine-Tuning

Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL

Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation

Submit Feedback

Similar Articles

ShadowPEFT: Shadow Network for Parameter-Efficient Fine-Tuning
GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification
CERSA: Cumulative Energy-Retaining Subspace Adaptation for Memory-Efficient Fine-Tuning
Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL
Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation