Weight-Space Geometry of Offline Reasoning Training

arXiv cs.LG Papers

Summary

This paper investigates whether different offline reinforcement learning losses (RFT, RIFT, DFT, Offline GRPO, DPO) for reasoning distillation produce mechanistically distinct weight updates in a small language model. Using identical math rollouts and a controlled setup with Qwen3-4B and attention-only LoRA, they find that SFT, RFT, and RIFT yield nearly colinear weight deltas, while DPO sits in a near-orthogonal subspace and achieves the highest accuracy.

arXiv:2606.23740v1 Announce Type: new Abstract: Offline reinforcement-learning losses (RFT, RIFT, DFT, Offline GRPO, DPO) are widely used to distill reasoning from large teachers into smaller students, and are typically compared on downstream accuracy alone. We ask whether they are mechanistically distinct or converge to a similar weight update. Training six methods (SFT, RFT, DFT, RIFT, Offline GRPO, DPO) on identical math rollouts from a single base model (Qwen3-4B) with attention-only LoRA, we analyze the resulting deltas via cosine similarity, principal-angle subspace analysis, linear mode connectivity, and CKA. We observe: (i) SFT, RFT, and RIFT have nearly colinear weight deltas (cosine >= 0.97, top-1 principal angle ~7 deg median over 144 modules) and comparable GSM8K accuracy (87-88%, n=1319; pairwise McNemar p >= 0.15); (ii) DFT diverges further in direction than any reward-weighted method despite using the same data; (iii) Offline GRPO adds a substantial component orthogonal to the SFT direction (~67% globally, up to ~86% in late layers) while staying in the SFT loss basin; (iv) DPO sits in a near-orthogonal subspace, shows a mode-connectivity barrier, and collapses late-layer CKA to ~0.46. DPO also reaches the highest accuracy in our protocol on both GSM8K (93.5%, McNemar p < 10^-9 vs. each other method) and AIME26 (30.0% vs. 3.3-10.0%); its training uses a 10x smaller learning rate than the others (the standard convention), so the update-norm and accuracy gaps reflect loss-function and optimizer choices jointly, and a learning-rate-matched DPO comparison is left for future work.
Original Article
View Cached Full Text

Cached at: 06/24/26, 07:48 AM

# Weight-Space Geometry of Offline Reasoning Training
Source: [https://arxiv.org/html/2606.23740](https://arxiv.org/html/2606.23740)
###### Abstract

Offline reinforcement\-learning losses \(RFT, RIFT, DFT, Offline GRPO, DPO\) are widely used to distill reasoning from large teachers into smaller students, and are typically compared on downstream accuracy alone\. We ask whether they are mechanistically distinct or converge to a similar weight update\. Training six methods \(SFT, RFT, DFT, RIFT, Offline GRPO, DPO\) on identical math rollouts from a single base model \(Qwen3\-4B\) with attention\-only LoRA, we analyze the resulting deltas via cosine similarity, principal\-angle subspace analysis, linear mode connectivity, and CKA\. We observe: \(i\) SFT, RFT, and RIFT have nearly colinear weight deltas \(cosine≥0\.97\\geq 0\.97, top\-11principal angle∼7∘\{\\sim\}\\\!\\\!7^\{\\circ\}median over144144modules\) and comparable GSM8K accuracy \(8787–88%88\\%,n=1319n\{=\}1319; pairwise McNemarp≥0\.15p\\geq 0\.15\); \(ii\) DFT diverges further in direction than any reward\-weighted method despite using the same data; \(iii\) Offline GRPO adds a substantial component orthogonal to the SFT direction \(∼67%\\sim\\\!\\\!67\\%globally, up to∼86%\\sim\\\!\\\!86\\%in late layers\) while staying in the SFT loss basin; \(iv\) DPO sits in a near\-orthogonal subspace, shows a mode\-connectivity barrier, and collapses late\-layer CKA to∼0\.46\\sim 0\.46\. DPO also reaches the highest accuracy in our protocol on both GSM8K \(93\.5%93\.5\\%, McNemarp<10−9p<10^\{\-9\}vs\. each other method\) and AIME26 \(30\.0%30\.0\\%vs\.3\.33\.3–10\.0%10\.0\\%\); its training uses a10×10\\timessmaller learning rate than the others \(the standard convention\), so the update\-norm and accuracy gaps reflect loss\-function and optimizer choices jointly, and a learning\-rate\-matched DPO comparison is left for future work\.

Mechanistic Interpretability, Offline RL, Reasoning, LoRA, Weight Space

## 1Introduction

Reasoning distillation has become a standard recipe for teaching small models to solve math and code tasks: a strong teacher generates rollouts, and a student is trained on them with one of a rapidly growing list of offline objectives\. The past year alone introduced RIFT\(Liuet al\.,[2026](https://arxiv.org/html/2606.23740#bib.bib1)\), Offline GRPO\(KRAFTON AI,[2025](https://arxiv.org/html/2606.23740#bib.bib2)\), DFT\(Wu and others,[2025](https://arxiv.org/html/2606.23740#bib.bib3)\), LUFFY\(Yanet al\.,[2025](https://arxiv.org/html/2606.23740#bib.bib4)\), and DAPO\(Yu and others,[2025](https://arxiv.org/html/2606.23740#bib.bib5)\), alongside an established preference\-learning family — DPO\(Rafailovet al\.,[2023](https://arxiv.org/html/2606.23740#bib.bib12)\), KTO\(Ethayarajhet al\.,[2024](https://arxiv.org/html/2606.23740#bib.bib13)\), IPO\(Azaret al\.,[2023](https://arxiv.org/html/2606.23740#bib.bib14)\), and NCA\(Chenet al\.,[2024](https://arxiv.org/html/2606.23740#bib.bib17)\)— each accompanied by claims that its specific loss formulation is responsible for accuracy gains over plain SFT\.

These methods are compared almost exclusively by benchmark accuracy\. What they do to the model is unknown: do different losses produce weight updates that point in the same direction, or qualitatively different ones? The distinction matters for both practitioners \(which loss is worth implementing?\) and interpretability researchers \(does “offline RL” name a single mechanism or a family?\)\.

We present a controlled weight\-space comparison of offline reasoning losses: identical rollouts, identical base model \(Qwen3\-4B\-Instruct\), shared LoRA initialization, six methods \(DPO uses a smaller learning rate per its codebase convention, see §[2](https://arxiv.org/html/2606.23740#S2)\)\. Following recent weight\-space studies of fine\-tuning\(Arturi and others,[2025](https://arxiv.org/html/2606.23740#bib.bib6); Soligo and others,[2025](https://arxiv.org/html/2606.23740#bib.bib7); Zhong and Raghunathan,[2025](https://arxiv.org/html/2606.23740#bib.bib8); Ward and others,[2025](https://arxiv.org/html/2606.23740#bib.bib9)\), we analyze each method’s LoRA deltaΔ​W\\Delta Wrather than its outputs\.

Our contributions are: \(1\) reward\-weighted losses \(SFT, RFT, RIFT\) converge on essentially the same direction in weight space \(cosine≥0\.97\\geq 0\.97\) and produce GSM8K accuracies that are non\-different by exact McNemar’s test \(p≥0\.15p\\geq 0\.15,n=1319n\{=\}1319\); \(2\) DFT, despite being a one\-line modification of SFT, produces a more distinctive update than any explicitly reward\-weighted method; \(3\) Offline GRPO adds a quantifiable orthogonal component \(globally67%67\\%, rising to∼80%\\sim\\\!\\\!80\\%in late layers\) while staying in the same loss basin as SFT/RIFT; \(4\) DPO sits in a near\-orthogonal subspace with higher effective rank, a sharp linear\-mode barrier, and reaches the highest pass@1 on both GSM8K and AIME26 in our protocol; we report this with the caveat that DPO uses a10×10\\timessmaller learning rate, so the loss formulation and optimizer setting cannot be cleanly separated here\.

## 2Setup

#### Data\.

All methods share one set of rollouts: DeepScaleR prompts \(∼40\{\\sim\}40k verified math\(Agentica,[2025](https://arxiv.org/html/2606.23740#bib.bib18)\)\), teacher DeepSeek\-V4\-Flash,K=4K\{=\}4CoT completions/prompt, binarymath\-verifyreward\. Reference\-policy methods useπbase\\pi\_\{\\text\{base\}\}\. DPO consumes∼1\.8\{\\sim\}1\.8K \(chosen, rejected\) pairs vs\.∼75\{\\sim\}75K rows for the rest\. Identical rollouts is the central control\.

#### Methods\.

Table[1](https://arxiv.org/html/2606.23740#S2.T1)summarizes the six losses\. Withℓi=−log⁡πθ​\(yi∣x\)\\ell\_\{i\}=\-\\log\\pi\_\{\\theta\}\(y\_\{i\}\\mid x\)shorthand for the per\-sequence NLL: SFT=∑iℓi=\\sum\_\{i\}\\ell\_\{i\}on allii; RFT\(Yuanet al\.,[2023](https://arxiv.org/html/2606.23740#bib.bib10)\)=∑i:ri=1ℓi=\\sum\_\{i:r\_\{i\}=1\}\\ell\_\{i\}on positives only; DFT\(Wu and others,[2025](https://arxiv.org/html/2606.23740#bib.bib3)\)=∑tsg​\(πθ​\(yt∣y<t,x\)\)​ℓttok=\\sum\_\{t\}\\mathrm\{sg\}\(\\pi\_\{\\theta\}\(y\_\{t\}\\mid y\_\{<t\},x\)\)\\,\\ell\_\{t\}^\{\\text\{tok\}\}down\-weights confident tokens; RIFT\(Liuet al\.,[2026](https://arxiv.org/html/2606.23740#bib.bib1)\)=∑i\(1−ri​λ\)​ℓi=\\sum\_\{i\}\(1\-r\_\{i\}\\,\\lambda\)\\,\\ell\_\{i\}is a linear\-in\-reward surrogate that admits negatives; Offline GRPO\(KRAFTON AI,[2025](https://arxiv.org/html/2606.23740#bib.bib2); Shaoet al\.,[2024](https://arxiv.org/html/2606.23740#bib.bib11)\)=∑iA^i​ℓi=\\sum\_\{i\}\\hat\{A\}\_\{i\}\\,\\ell\_\{i\}withA^i=ri−r¯g\+b\\hat\{A\}\_\{i\}=r\_\{i\}\-\\bar\{r\}\_\{g\}\+b; DPO\(Rafailovet al\.,[2023](https://arxiv.org/html/2606.23740#bib.bib12)\)=−log⁡σ​\(β​\[log⁡πθ​\(yw\)πref​\(yw\)−log⁡πθ​\(yl\)πref​\(yl\)\]\)=\-\\log\\sigma\\\!\\left\(\\beta\\,\[\\log\\frac\{\\pi\_\{\\theta\}\(y\_\{w\}\)\}\{\\pi\_\{\\text\{ref\}\}\(y\_\{w\}\)\}\-\\log\\frac\{\\pi\_\{\\theta\}\(y\_\{l\}\)\}\{\\pi\_\{\\text\{ref\}\}\(y\_\{l\}\)\}\]\\right\)on \(chosenywy\_\{w\}, rejectedyly\_\{l\}\) pairs\.

Table 1:Offline reasoning losses studied\. “Neg\.” = uses negative samples; “Rew\.” = uses scalar reward; “Ref\.” = needs a reference policy\.
#### Training\.

Qwen3\-4B\-Instruct\-2507, LoRA on attention projections \(q,k,v,o\_proj; rank3232,α=64\\alpha\{=\}64, dropout0;144144modules over3636layers\)\. Effective batch3232, cosine schedule,5%5\\%warmup, wd0\.010\.01, grad\-clip1\.01\.0, seed4242, bf16\. Peak LR5×10−65\\\!\\times\\\!10^\{\-6\}for all but DPO \(5×10−75\\\!\\times\\\!10^\{\-7\}, codebase convention; higher diverges\)\. DPO: sigmoid loss,β=0\.1\\beta\{=\}0\.1; Offline GRPO: additive bias0\.10\.1on the centered advantage, no probability weighting, no explicit KL\.1,5001\{,\}500steps; we report step1,0001\{,\}000uniformly\.

#### Analysis\.

LetΔ​W\(m\)=B\(m\)​A\(m\)\\Delta W^\{\(m\)\}=B^\{\(m\)\}A^\{\(m\)\}be the stacked LoRA delta of methodmm\. We measure:\(i\)global/per\-layer cosine⟨Δ​W\(m\),Δ​W\(m′\)⟩/‖Δ​W\(m\)‖​‖Δ​W\(m′\)‖\\langle\\Delta W^\{\(m\)\},\\Delta W^\{\(m^\{\\prime\}\)\}\\rangle/\\\|\\Delta W^\{\(m\)\}\\\|\\\|\\Delta W^\{\(m^\{\\prime\}\)\}\\\|;\(ii\)per\-layer SVD \(effective rank, principal angles between top\-kksubspaces\);\(iii\)linear mode connectivity\(Frankleet al\.,[2020](https://arxiv.org/html/2606.23740#bib.bib19)\): masked\-answer CE on GSM8K alongα​Δ​W\(m\)\+\(1−α\)​Δ​W\(m′\)\\alpha\\Delta W^\{\(m\)\}\+\(1\{\-\}\\alpha\)\\Delta W^\{\(m^\{\\prime\}\)\};\(iv\)CKA\(Kornblithet al\.,[2019](https://arxiv.org/html/2606.23740#bib.bib21)\)of merged\-model hidden states\.

## 3Results

### 3\.1Downstream accuracy

Figure[3](https://arxiv.org/html/2606.23740#A1.F3)\(appendix\) reports greedy pass@1 on full GSM8K \(n=1319n\{=\}1319\) and AIME26 \(n=30n\{=\}30\)\. SFT/RFT/DFT/RIFT/Offline GRPO sit at87\.387\.3–88\.2%88\.2\\%on GSM8K, pairwise non\-different by exact McNemar \(p≥0\.15p\\geq 0\.15\); DPO reaches93\.5%93\.5\\%\(p<10−9p<10^\{\-9\}\)\. On AIME26 the ordering repeats butn=30n\{=\}30is underpowered \(SFT–DPOp=0\.07p\{=\}0\.07\)\. DPO trains at a10×10\\timessmaller LR with∼40×\{\\sim\}40\\timesfewer rows, so we treat the gap as suggestive\. Llama\-3\.2\-3B replicates the geometry and the55–77point DPO accuracy edge\.

#### On\-policy RL preserves accuracy; SFT\-style loses it\.

Re\-measuring greedy pass@1 for the on\-policy methods \(Table[2](https://arxiv.org/html/2606.23740#S3.T2), our adapters, same protocol\) shows their reward\-orthogonal updates \(§[3\.4](https://arxiv.org/html/2606.23740#S3.SS4)\) do*not*cost accuracy: Online GRPO/DAPO and DPO all stay at the base instruct model’s9393–94%94\\%on GSM8K, whereas SFT and Offline GRPO drop to∼87%\{\\sim\}87\\%\(below base\)\. Online GRPO is best on AIME26 \(20\.0%20\.0\\%\)\.

Table 2:Greedy pass@1 \(%\) for our consistently\-trained adapters; GSM8Kn=1319n\{=\}1319, AIME26n=30n\{=\}30\. SFT\-direction methods sit below the base model; reward\-orthogonal methods match or beat it\.

### 3\.2Weight\-space convergence

![Refer to caption](https://arxiv.org/html/2606.23740v1/figures/cosine_all8.png)Figure 1:GlobalΔ​W\\Delta Wcosine across all eight losses \(Qwen3\-4B, seed4242; all adapters trained in one consistent space\)\. Reward\-weighted SFT/RFT/RIFT cluster \(0\.940\.94–0\.980\.98\); DFT intermediate \(∼0\.55\\sim\\\!0\.55\);*Offline*GRPO at0\.710\.71–0\.800\.80to the cluster; DPO near\-orthogonal \(≤0\.13\\leq 0\.13\)\. The two*on\-policy*methods,*Online*GRPO and*Online*DAPO, are each near\-orthogonal to every offline loss*and to each other*\(−0\.16\-0\.16\); orthogonal\-fraction off SFT is0\.690\.69\(offline GRPO\) vs\.0\.9980\.998/0\.9950\.995\(online GRPO/DAPO\)\. On\-policy sampling, not the group\-relative loss, drives the departure from SFT\.Figure[1](https://arxiv.org/html/2606.23740#S3.F1)shows the cosine similarity matrix between global LoRA deltas\. Three regimes are visible\.*First*, SFT, RFT, and RIFT form a tight cluster: the SFT–RFT, SFT–RIFT, and RFT–RIFT cosines are0\.9770\.977,0\.9670\.967,0\.9690\.969\. Filtering negatives \(RFT\) and reward\-weighting them \(RIFT\) does not measurably change the direction of the update relative to plain SFT on the union; it only adjusts the step size, with‖Δ​W‖F\\\|\\Delta W\\\|\_\{F\}ranging from2\.822\.82\(RFT\) to3\.043\.04\(RIFT\)\.*Second*, DFT, which differs from SFT by a single multiplicative factor on the loss, sits at cosine0\.5720\.572to SFT and0\.5360\.536to RIFT — a larger directional change than any explicitly reward\-weighted method\.*Third*, DPO is orthogonal to everything: cosine to SFT, RFT, and RIFT all fall in\[0\.057,0\.065\]\[0\.057,0\.065\]\. Offline GRPO occupies an intermediate position \(cosine≈0\.74\\approx 0\.74to the SFT cluster\)\.

The per\-layer view \(Figure[4](https://arxiv.org/html/2606.23740#A1.F4)\) decomposes this further\. SFT/RFT/RIFT pairs are essentially flat above0\.950\.95at every layer\. SFT–GRPO drops gradually with depth\. SFT–DFT is bimodal — close to11at certain bottleneck layers and well below0\.50\.5at others\. SFT–DPO hovers near zero throughout\. The visible drop on layers3434–3636across all pairs reflects the small\-norm tail of LoRA updates in the last decoder block; we treat this as a LoRA artifact rather than a finding\.

### 3\.3Subspace analysis

Per\-layer SVD lets us go beyond a single direction and ask whether two methods adapt the same low\-dimensional subspace\. We report principal angles between the top\-1010left singular vectors of eachΔ​W\(m\)\\Delta W^\{\(m\)\}at the same module\. Smaller angles mean shared subspace\.

Aggregating across all144144attention modules \(top\-1010left singular vectors per module\), median top\-11principal angles are6\.7∘6\.7^\{\\circ\}\(SFT–RFT\),8\.2∘8\.2^\{\\circ\}\(SFT–RIFT\),18\.5∘18\.5^\{\\circ\}\(SFT–Offline GRPO\),26\.7∘26\.7^\{\\circ\}\(SFT–DFT\), and54\.6∘54\.6^\{\\circ\}\(SFT–DPO\); the median worst \(top\-1010\) angles are36∘36^\{\\circ\},40∘40^\{\\circ\},76∘76^\{\\circ\},85∘85^\{\\circ\},90∘90^\{\\circ\}in the same order\. SFT–DPO IQR for the worst angle is\[89\.6∘,89\.8∘\]\[89\.6^\{\\circ\},89\.8^\{\\circ\}\]: essentially every module is orthogonal at every singular index\. The reward\-weighted cluster shares the top of its subspace with SFT to within∼10∘\{\\sim\}\\\!\\\!10^\{\\circ\}; GRPO and DFT partially overlap; DPO does not\.

The effective rank, averaged over all144144modules, is∼16\\sim 16for SFT, RFT, DFT, and RIFT,14\.814\.8for Offline GRPO, and24\.524\.5for DPO\. DPO writes into a higher\-dimensional subspace, but, given its13×13\\timessmaller Frobenius norm, with much smaller singular values; together with the orthogonality to SFT, this suggests DPO learns a different decomposition of the same projection matrices rather than a low\-rank refinement of SFT\.

To quantify how much of Offline GRPO’s update is genuinely new direction, we projectΔ​Wgrpo\\Delta W^\{\\mathrm\{grpo\}\}onto the SFT direction at every adapted module and report‖Δ​Wgrpo−Πsft​Δ​Wgrpo‖F/‖Δ​Wgrpo‖F\\\|\\Delta W^\{\\mathrm\{grpo\}\}\-\\Pi\_\{\\mathrm\{sft\}\}\\Delta W^\{\\mathrm\{grpo\}\}\\\|\_\{F\}/\\\|\\Delta W^\{\\mathrm\{grpo\}\}\\\|\_\{F\}\. Globally this is0\.670\.67; per layer it grows from∼0\.55\\sim\\\!\\\!0\.55in middle blocks to0\.790\.79–0\.860\.86in the final five blocks — the same layers where CKA diverges \(Section[3\.6](https://arxiv.org/html/2606.23740#S3.SS6)\)\.

#### Top\-1 singular directions\.

The rank\-1 approximationΔ​W≈σ1​u1​v1⊤\\Delta W\\approx\\sigma\_\{1\}u\_\{1\}v\_\{1\}^\{\\top\}isolates the single most important output directionu1u\_\{1\}each loss writes into\. Mean\|⟨u1m,u1m′⟩\|\|\\langle u\_\{1\}^\{m\},u\_\{1\}^\{m^\{\\prime\}\}\\rangle\|over144144modules is0\.970\.97–0\.980\.98within SFT/RFT/RIFT,0\.780\.78–0\.800\.80to Offline GRPO,0\.640\.64–0\.670\.67to DFT, and0\.110\.11to DPO\. Right singular vectorsv1v\_\{1\}\(input directions\) converge much more tightly:0\.990\.99–1\.001\.00for non\-DPO pairs,0\.940\.94for DFT,0\.660\.66–0\.700\.70for DPO\. All methods \(except DPO\) read from nearly the same input subspace; they differ in how they transform it\.

### 3\.4Seed and learning\-rate sensitivity

The colinearity above is at a*single*seed, conflating loss agreement with shared\-init agreement\. We disentangle by training each loss at two seeds \(42,12342,123\) and three LRs \(5×10−7\.\.−55\\\!\\times\\\!10^\{\-7\.\.\-5\}\);Δ​W=\(α/r\)​B​A\\Delta W=\(\\alpha/r\)BAis gauge\-invariant, so its cosine is genuine\.

Seed rotatesΔ​W\\Delta Wmore than the loss — but only on the input side\.At a fixed seed SFT–RFT are colinear \(cosine0\.9960\.996, angle3\.7∘3\.7^\{\\circ\}\), yet the*same loss at two seeds*has cosine only0\.070\.07\(5×10−75\\\!\\times\\\!10^\{\-7\}\)–0\.360\.36\(5×10−55\\\!\\times\\\!10^\{\-5\}\)\. Cause: LoRA’s randomAAinit — across seeds the top\-11*output*directionu1u\_\{1\}still agrees at0\.990\.99while the*input*directionv1v\_\{1\}agrees at0\.070\.07\(median top\-88angle26∘26^\{\\circ\}vs\.76∘76^\{\\circ\}for unrelated runs\)\. Functionally the seeds are the*same*solution: interpolating their deltas shows no barrier \(midpoint\+0\.004\+0\.004\)\. So the cross\-method colinearity is partly shared\-init, but convergence onto a common output subspace is seed\-robust \(Figure[2](https://arxiv.org/html/2606.23740#S3.F2)\)\.

Learning rate changes direction, not just magnitude\.A10×10\\timesLR step rotatesΔ​W\\Delta W\(cosine≈0\.55\\approx 0\.55\) and grows its norm only∼3×\\sim\\\!3\\times— not a pure rescaling, which sharpens the caveat on the10×10\\times\-smaller\-LR DPO comparison\.

Online GRPO is far more orthogonal than offline GRPO\.We also train*online*GRPO under the same LoRA recipe \(on\-policy rollouts, group\-relative advantage,math\_verifyreward;600600steps,88generations/prompt, lr5×10−65\\\!\\times\\\!10^\{\-6\}, seed4242\) — the comparison the original protocol could not produce\. The resulting update is almost entirely orthogonal to the SFT/RFT cluster: cosine0\.0250\.025to SFT and0\.0240\.024to RFT, with an orthogonal fraction of0\.9980\.998off the SFT direction \(Figure[1](https://arxiv.org/html/2606.23740#S3.F1)\), versus0\.670\.67for*offline*GRPO \(§[3\.2](https://arxiv.org/html/2606.23740#S3.SS2)\)\. Its Frobenius norm is∼10×\\sim\\\!10\\timessmaller than SFT’s at the same LR \(0\.300\.30vs\.2\.842\.84\), echoing the small\-norm regime of DPO\. On\-policy sampling thus moves the update off the shared SFT subspace far more than the offline group\-relative loss does, indicating that the SFT/offline\-RL directional convergence is partly a consequence of training on the*same fixed rollouts*: replacing them with on\-policy samples largely breaks it\.

![Refer to caption](https://arxiv.org/html/2606.23740v1/figures/seed_cosine_vs_lr.png)

![Refer to caption](https://arxiv.org/html/2606.23740v1/figures/seed_lr_norm_vs_direction.png)

![Refer to caption](https://arxiv.org/html/2606.23740v1/figures/seed_mode_connectivity.png)

Figure 2:Seed and learning\-rate sensitivity \(SFT, Qwen3\-4B\)\.*Left:*across two seeds the output directionu1u\_\{1\}stays aligned \(∼0\.99\\sim\\\!0\.99\) while the input directionv1v\_\{1\}and full cosine are low at small LR and rise with LR; dashed shows SFT–RFT at a fixed seed\.*Middle:*a10×10\\timesLR step rotatesΔ​W\\Delta W\(cosine≈0\.55\\approx 0\.55\) and grows its norm sub\-linearly — LR is not a pure rescaling\.*Right:*interpolating the two seeds’ deltas shows no loss barrier — different weights, same basin\.
### 3\.5Linear mode connectivity

We linearly interpolate LoRA deltas, merge into the base, and measure the per\-token cross\-entropy of the gold\\𝚋𝚘𝚡𝚎𝚍​\{𝚊𝚗𝚜𝚠𝚎𝚛\}\\backslash\\mathtt\{boxed\\\{answer\\\}\}continuation right after the prompt on GSM8K\. The metric is length\-sensitive: DPO produces longer, structured CoTs \(median51005100vs11001100chars on correct AIME26; verification steps in9/99/9correct DPO solutions vs0/30/3SFT\), inflating per\-token NLL of the bare boxed answer\. Offline GRPO→\\toRIFT improves monotonically \(4\.93→2\.254\.93\\to 2\.25\) and SFT→\\toOffline GRPO worsens monotonically \(2\.06→4\.932\.06\\to 4\.93\): one basin\. RIFT→\\toDPO shows a sharp non\-monotonic barrier aboveα=0\.5\\alpha\{=\}0\.5\(3\.82→7\.06→8\.643\.82\\to 7\.06\\to 8\.64\): even discounting length, linear interpolation destroys the solution\.

### 3\.6Representational similarity

CKA on hidden states \(Figure[6](https://arxiv.org/html/2606.23740#A1.F6),100100GSM8K prompts\) confirms the weight\-space picture\. SFT–RIFT CKA stays above0\.990\.99at every layer; SFT/RIFT–Offline GRPO drop to∼0\.85\\sim 0\.85in the final layer \(GRPO reshapes output\-facing layers\); Offline GRPO–DPO and DFT–DPO start near0\.930\.93and collapse to∼0\.45\\sim\\\!\\\!0\.45from layer2525onward\. A logit\-lens probe gives a complementary null — mean prediction depth is35\.335\.3–36\.036\.0out of3636for every method — but this is partly forced by attention\-only LoRA leaving MLPs frozen, so it should not be read as a finding about the losses\.

## 4Discussion

#### Reward\-weighted MLE is SFT, plus DFT is the surprise\.

SFT, RFT, and RIFT differ only in how they handle negative samples \(drop, weight, or include uniformly\), yet the resulting LoRA deltas have cosine≥0\.97\\geq 0\.97, principal angles<25∘<25^\{\\circ\}, and indistinguishable per\-layer CKA, and they sit within11percentage point of each other on full GSM8K\. The Frobenius norms differ by up to7%7\\%\. If RIFT outperforms SFT at the same step count, our results suggest the explanation lies in magnitude \(effective step size in the SFT direction\), not direction — a longer or higher\-lr SFT run should close the gap\. DFT, by contrast, has cosine∼0\.55\\sim 0\.55to SFT/RIFT despite using*less*information than they do \(no reward, no filtering\): self\-weighting bysg​\(πθ\)\\mathrm\{sg\}\(\\pi\_\{\\theta\}\)reweights*which*examples drive the update in a way explicit reward does not, yet leaves the loss basin unchanged\.

#### Offline GRPO shifts direction but stays in basin; online GRPO does not\.

Among offline rewards, only Offline GRPO substantially shifts direction from SFT \(cosine0\.730\.73; orthogonal\-fraction0\.670\.67,∼0\.8\\sim\\\!\\\!0\.8late; angles up to59∘59^\{\\circ\}\), yet barrier\-free interpolations keep it in the SFT/RIFT basin\.*Online*GRPO goes much further — orthogonal fraction0\.9980\.998\(Figure[1](https://arxiv.org/html/2606.23740#S3.F1)\) — so the SFT/offline\-RL convergence is partly an artifact of shared fixed rollouts, which on\-policy sampling breaks\.

#### DPO sits apart, geometrically and on accuracy\.

DPO occupies a near\-orthogonal subspace \(74∘74^\{\\circ\}–89∘89^\{\\circ\}\), a higher\-rank update with much smaller Frobenius norm, a sharp linear\-mode barrier, and late\-layer CKA∼0\.45\\sim 0\.45; it also reaches the highest pass@1 on GSM8K \(93\.5%93\.5\\%\) and AIME26 \(30\.0%30\.0\\%\)\. It trains at a10×10\\timessmaller LR, so its norm/accuracy gaps are entangled with the optimizer — correlation worth a LR\-matched follow\-up, not a causal claim\. Extending the weight\-space view to the wider contrastive family \(IPO, KTO, SimPO, Cal\-DPO\) is left open\.

#### Limitations\.

Single domain and checkpoint; attention\-only LoRA; greedy\-only on small AIME26 \(n=30n\{=\}30\)\. DPO uses10×10\\timessmaller LR and∼40×\{\\sim\}40\\timesfewer rows, so its norm/accuracy gaps are entangled with the optimizer\. Online GRPO is reported at lr5×10−65\\\!\\times\\\!10^\{\-6\}, seed4242\(§[3\.4](https://arxiv.org/html/2606.23740#S3.SS4)\); the remaining LR/seed cells, a matched accuracy comparison, and calibrated DPO variants \(KTO, IPO, SimPO\) are left to a fuller sweep\. Code, adapters, and analysis scripts are released\.

## References

- Agentica \(2025\)DeepScaleR\-Preview\-Dataset: a 40k reasoning\-intensive mathematics corpus\.Note:[https://huggingface\.co/datasets/agentica\-org/DeepScaleR\-Preview\-Dataset](https://huggingface.co/datasets/agentica-org/DeepScaleR-Preview-Dataset)Cited by:[§2](https://arxiv.org/html/2606.23740#S2.SS0.SSS0.Px1.p1.5)\.
- Arturiet al\.\(2025\)Shared parameter subspaces in emergently misaligned behavior\.InNeurIPS Workshop on Mechanistic Interpretability,Cited by:[§1](https://arxiv.org/html/2606.23740#S1.p3.1)\.
- M\. G\. Azar, M\. Rowland, B\. Piot, D\. Guo, D\. Calandriello, M\. Valko, and R\. Munos \(2023\)A general theoretical paradigm to understand learning from human preferences\.arXiv preprint arXiv:2310\.12036\.Cited by:[§1](https://arxiv.org/html/2606.23740#S1.p1.1)\.
- H\. Chen, G\. Zhao, S\. Zhang, H\. Li, J\. Zhu, and J\. Sun \(2024\)Noise contrastive alignment of language models with explicit rewards\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.23740#S1.p1.1)\.
- K\. Ethayarajh, W\. Xu, N\. Muennighoff, D\. Jurafsky, and D\. Kiela \(2024\)KTO: model alignment as prospect theoretic optimization\.InInternational Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2606.23740#S1.p1.1)\.
- J\. Frankle, G\. K\. Dziugaite, D\. M\. Roy, and M\. Carbin \(2020\)Linear mode connectivity and the lottery ticket hypothesis\.InInternational Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2606.23740#S2.SS0.SSS0.Px4.p1.5)\.
- S\. Kornblith, M\. Norouzi, H\. Lee, and G\. Hinton \(2019\)Similarity of neural network representations revisited\.InInternational Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2606.23740#S2.SS0.SSS0.Px4.p1.5)\.
- KRAFTON AI \(2025\)Offline GRPO for reasoning distillation\.Note:Technical blog post and codebase[https://github\.com/krafton\-ai/offline\-grpo](https://github.com/krafton-ai/offline-grpo)Cited by:[§1](https://arxiv.org/html/2606.23740#S1.p1.1),[§2](https://arxiv.org/html/2606.23740#S2.SS0.SSS0.Px2.p1.11)\.
- Z\. Liu, S\. Liu, T\. Zhong, and M\. Yuan \(2026\)RIFT: repurposing negative samples via reward\-informed fine\-tuning\.arXiv preprint arXiv:2601\.09253\.Cited by:[§1](https://arxiv.org/html/2606.23740#S1.p1.1),[§2](https://arxiv.org/html/2606.23740#S2.SS0.SSS0.Px2.p1.11)\.
- R\. Rafailov, A\. Sharma, E\. Mitchell, S\. Ermon, C\. D\. Manning, and C\. Finn \(2023\)Direct preference optimization: your language model is secretly a reward model\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.23740#S1.p1.1),[§2](https://arxiv.org/html/2606.23740#S2.SS0.SSS0.Px2.p1.11)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo \(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§2](https://arxiv.org/html/2606.23740#S2.SS0.SSS0.Px2.p1.11)\.
- Soligoet al\.\(2025\)Convergent linear representations of emergent misalignment\.InNeurIPS Workshop on Mechanistic Interpretability,Cited by:[§1](https://arxiv.org/html/2606.23740#S1.p3.1)\.
- Wardet al\.\(2025\)Rank\-1 LoRAs encode interpretable reasoning signals\.InNeurIPS Workshop on Mechanistic Interpretability,Cited by:[§1](https://arxiv.org/html/2606.23740#S1.p3.1)\.
- Y\. Wuet al\.\(2025\)On the generalization of SFT: a reinforcement learning perspective with reward rectification\.arXiv preprint arXiv:2508\.05629\.Cited by:[§1](https://arxiv.org/html/2606.23740#S1.p1.1),[§2](https://arxiv.org/html/2606.23740#S2.SS0.SSS0.Px2.p1.11)\.
- J\. Yan, Y\. Li, Z\. Hu, Z\. Wang, G\. Cui, X\. Qu, Y\. Cheng, and Y\. Zhang \(2025\)Learning to reason under off\-policy guidance\.arXiv preprint arXiv:2504\.14945\.Cited by:[§1](https://arxiv.org/html/2606.23740#S1.p1.1)\.
- Q\. Yuet al\.\(2025\)DAPO: an open\-source LLM reinforcement learning system at scale\.arXiv preprint arXiv:2503\.14476\.Cited by:[§1](https://arxiv.org/html/2606.23740#S1.p1.1)\.
- Z\. Yuan, H\. Yuan, C\. Li, G\. Dong, K\. Lu, C\. Tan, C\. Zhou, and J\. Zhou \(2023\)Scaling relationship on learning mathematical reasoning with large language models\.arXiv preprint arXiv:2308\.01825\.Cited by:[§2](https://arxiv.org/html/2606.23740#S2.SS0.SSS0.Px2.p1.11)\.
- Zhong and Raghunathan \(2025\)Watch the weights: unsupervised monitoring and control of fine\-tuned llms\.InNeurIPS Workshop on Mechanistic Interpretability,Cited by:[§1](https://arxiv.org/html/2606.23740#S1.p3.1)\.

## Appendix ASupplementary figures

![Refer to caption](https://arxiv.org/html/2606.23740v1/figures/accuracy_comparison.png)Figure 3:Greedy pass@1 with Wilson95%95\\%CI bars on GSM8K \(n=1319n\{=\}1319\) and AIME26 \(n=30n\{=\}30\)\. Dark bars: Qwen3\-4B\-Instruct\. Light bars: Llama\-3\.2\-3B\-Instruct\. On both architectures, DPO sits noticeably above the SFT/RFT/DFT/RIFT/Offline GRPO cluster on GSM8K \(Qwen3: McNemarp<10−9p<10^\{\-9\}vs\. each other method\); Llama\-3\.2\-3B AIME26 floors near zero at this model scale\.![Refer to caption](https://arxiv.org/html/2606.23740v1/figures/cosine_per_layer_panel.png)Figure 4:Per\-layer cosine similarity of LoRA deltas to SFT, on Qwen3\-4B \(left,3636layers\) and Llama\-3\.2\-3B \(right,2828layers\)\. SFT/RFT/RIFT track each other across all layers; Offline GRPO, DFT, and especially DPO diverge in deeper layers, with the same qualitative pattern on both architectures\.![Refer to caption](https://arxiv.org/html/2606.23740v1/figures/mode_connectivity_panel.png)Figure 5:Linear mode connectivity \(masked\-answer CE on GSM8K\) on Qwen3\-4B \(left\) and Llama\-3\.2\-3B \(right\)\. Same picture: SFT/Off\.GRPO/RIFT/DFT pairs are barrier\-free; RIFT→\\toDPO shows a sharp barrier aboveα=0\.5\\alpha\{=\}0\.5on both architectures \(DPO endpoint loss8\.648\.64Qwen3,8\.968\.96Llama32\)\.![Refer to caption](https://arxiv.org/html/2606.23740v1/figures/cka_per_layer_panel.png)Figure 6:Linear CKA of hidden states across all blocks for selected method pairs on100100GSM8K prompts: Qwen3\-4B \(left,3636blocks\), Llama\-3\.2\-3B \(right,2828blocks\)\. On both architectures: SFT/RIFT indistinguishable \(\>0\.99\>0\.99\), Off\.GRPO diverges in output\-facing layers, and DPO collapses in the final third \(Qwen3∼0\.45\\sim\\\!\\\!0\.45, Llama32∼0\.62\\sim\\\!\\\!0\.62\)\.

Similar Articles

OPRD: On-Policy Representation Distillation

Hugging Face Daily Papers

OPRD proposes a new knowledge distillation method that aligns student and teacher hidden states across layers during on-policy rollouts, eliminating sampling variance from token-space KL estimation. Empirically, OPRD outperforms output-space baselines on math reasoning benchmarks (AIME 2024/2025, AIMO) while being 1.44x faster and using 54% less memory.

Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

Hugging Face Daily Papers

This paper proposes an empirical 'sparse-to-dense' reward principle for language model post-training, arguing that scarce labeled data should be used with sparse rewards for teacher model discovery and dense rewards for student compression via distillation. The authors demonstrate that this staged approach, bridging sparse RL and on-policy distillation, outperforms direct GRPO on deployment-sized models in math benchmarks.