Shortcuts in the Tail: Debiasing via Post-Hoc Spectral Compression of Fine-Tuning Updates

arXiv cs.LG 06/09/26, 04:00 AM Papers
debiasing fine-tuning spectral-compression svd spurious-correlations llm post-hoc
Summary
A post-hoc method reduces spurious correlations in fine-tuned LLMs by truncating the tail of the SVD of the weight update matrix. It reduces the spurious-group gap by up to 5x with less than 2pp accuracy loss, without retraining or group labels.
arXiv:2606.07596v1 Announce Type: new Abstract: Fine-tuning often introduces spurious correlations alongside task knowledge, causing systematic failures on underrepresented groups. Existing mitigations require retraining, group labels, or curated counterfactual data. We show a simple post-hoc intervention reduces shortcut reliance without any of these: truncating the tail of the SVD of $\Delta W = W_\mathrm{ft} - W_\mathrm{base}$ reduces the spurious-group gap while preserving task accuracy. Across three instruction-tuned models ($0.5$B--$7$B) and four classification benchmarks, top-$k$ truncation reduces the gap on every cell at $<2$ pp accuracy loss, by up to $5\times$ on CivilComments. We propose this works because the shortcut response sits in the tail of the singular ordering of $\Delta W$, a claim about how truncation behaves rather than about the raw singular values, which are broadly distributed and look the same across all four datasets. A controlled boundary case in which fine-tuning has only a shortcut to learn shows the predicted FT-to-base collapse, and bottom-/random-$k$ and matched-rank LoRA controls rule out generic low-rank approximation and rank-constrained training as the explanation. We read this as preliminary evidence that the singular basis of $\Delta W$ is a useful coordinate system for studying what fine-tuning has learned.
Original Article
View Cached Full Text
Cached at: 06/09/26, 08:47 AM
# 1 Introduction
Source: [https://arxiv.org/html/2606.07596](https://arxiv.org/html/2606.07596)
marginparsep has been altered\. topmargin has been altered\. marginparpush has been altered\. The page layout violates the ICML style\.Please do not change the page layout, or include packages like geometry, savetrees, or fullpage, which change it for you\. We’re not able to reliably undo arbitrary changes to the style\. Please remove the offending package\(s\), or layout\-changing commands and try again\.

Shortcuts in the Tail: Debiasing via Post\-Hoc Spectral Compression of Fine\-Tuning Updates

Edward Sun1Dmitrii Troitskii2

††footnotetext:1Department of Computer Science, UCLA, Los Angeles, CA, USA2Khoury College of Computer Sciences, Northeastern University, Boston, MA, USA\. Correspondence to: Edward Sun <edwardsun12895@g\.ucla\.edu\>, Dmitrii Troitskii <troitskii\.d@northeastern\.edu\>\.
Workshop on Weight\-Space Symmetries, held in conjunction with the43rd\\mathit\{43\}^\{rd\}International Conference on Machine Learning, Seoul, South Korea\. 2026\. Copyright 2026 by the author\(s\)\.###### Abstract

Fine\-tuning often introduces spurious correlations alongside task knowledge, causing systematic failures on underrepresented groups\. Existing mitigations require retraining, group labels, or curated counterfactual data\. We show a simple post\-hoc intervention reduces shortcut reliance without any of these: truncating the tail of the SVD ofΔW=Wft−Wbase\\Delta W=W\_\{\\mathrm\{ft\}\}\-W\_\{\\mathrm\{base\}\}reduces the spurious\-group gap while preserving task accuracy\. Across three instruction\-tuned models \(0\.50\.5B–77B\) and four classification benchmarks, top\-kktruncation reduces the gap on every cell at<2<\\\!2pp accuracy loss, by up to5×5\\timeson CivilComments\. We propose this works because the shortcut response sits in the tail of the singular ordering ofΔW\\Delta W, a claim about how truncation behaves rather than about the raw singular values, which are broadly distributed and look the same across all four datasets\. A controlled boundary case in which fine\-tuning has only a shortcut to learn shows the predicted FT\-to\-base collapse, and bottom\-/random\-kkand matched\-rank LoRA controls rule out generic low\-rank approximation and rank\-constrained training as the explanation\. We read this as preliminary evidence that the singular basis ofΔW\\Delta Wis a useful coordinate system for studying what fine\-tuning has learned\.

![Refer to caption](https://arxiv.org/html/2606.07596v1/x1.png)Figure 1:Post\-hoc spectral compression of fine\-tuning updates\.For each weight matrix, computeΔW=Wft−Wbase\\Delta W=W\_\{\\mathrm\{ft\}\}\-W\_\{\\mathrm\{base\}\}, take its SVDΔW=UΣV⊤\\Delta W=U\\Sigma V^\{\\top\}, keep only the topkksingular values, and reconstructW~=Wbase\+U:,:kΣ:k,:kV:,:k⊤\\widetilde\{W\}=W\_\{\\mathrm\{base\}\}\+U\_\{:,:k\}\\Sigma\_\{:k,:k\}V\_\{:,:k\}^\{\\top\}\. No retraining, data, or group labels; debiasing comes from*which*singular directions are kept\.Fine\-tuning instruction\-tuned LLMs often achieves high in\-distribution accuracy by exploiting spurious correlations\(Dixonet al\.,[2018](https://arxiv.org/html/2606.07596#bib.bib8); Borkanet al\.,[2019](https://arxiv.org/html/2606.07596#bib.bib9); McCoyet al\.,[2019](https://arxiv.org/html/2606.07596#bib.bib10); Zhanget al\.,[2019](https://arxiv.org/html/2606.07596#bib.bib11)\), causing systematic failure on underrepresented groups and adversarial inputs\(Wuet al\.,[2022](https://arxiv.org/html/2606.07596#bib.bib3); Varmaet al\.,[2024](https://arxiv.org/html/2606.07596#bib.bib6); Zhouet al\.,[2024](https://arxiv.org/html/2606.07596#bib.bib5); Yanget al\.,[2025](https://arxiv.org/html/2606.07596#bib.bib4); Chenet al\.,[2026](https://arxiv.org/html/2606.07596#bib.bib13); Wanget al\.,[2025a](https://arxiv.org/html/2606.07596#bib.bib14); Salleset al\.,[2025](https://arxiv.org/html/2606.07596#bib.bib12); Shuiehet al\.,[2025](https://arxiv.org/html/2606.07596#bib.bib7)\)\. Existing mitigations intervene during training or on the data itself\(Sagawaet al\.,[2020](https://arxiv.org/html/2606.07596#bib.bib15); Wuet al\.,[2022](https://arxiv.org/html/2606.07596#bib.bib3); Chenet al\.,[2026](https://arxiv.org/html/2606.07596#bib.bib13); Zouet al\.,[2025](https://arxiv.org/html/2606.07596#bib.bib16)\): retraining with a reweighted loss that upweights minority groups, augmenting the training set with synthetic counterfactual examples, or modifying intermediate representations during training\. All require either group labels, curated counterfactual data, or a full retraining loop, and none directly examines how the shortcut is stored\. We ask a structural question instead: does the fine\-tuning update itself encode the distinction between task signal and shortcut?

We analyze the differenceΔW=Wft−Wbase\\Delta W=W\_\{\\mathrm\{ft\}\}\-W\_\{\\mathrm\{base\}\}between the fine\-tuned and base weights, decomposing it into its singular value decompositionΔW=UΣV⊤\\Delta W=U\\Sigma V^\{\\top\}across three instruction\-tuned models\. We find thattruncating the tail of this decomposition selectively removes shortcut reliance while preserving task accuracy\. The claim is about the singular basis as an ordered coordinate system: truncation behaves as if task\-relevant and shortcut\-relevant directions occupy different parts of the ordering, even though the raw singular valuesσi\\sigma\_\{i\}are broadly distributed and show no visible separation\. The structure is recovered from the effect of intervention rather than read off the spectrum directly\.

This yields a label\-free, retraining\-free debiasing method together with a sharp prediction\. Unlike prior SVD work targeting base\-model efficiency\(Wanget al\.,[2025c](https://arxiv.org/html/2606.07596#bib.bib1);[b](https://arxiv.org/html/2606.07596#bib.bib2); Hsuet al\.,[2022](https://arxiv.org/html/2606.07596#bib.bib25)\), low\-rank training\(Huet al\.,[2021](https://arxiv.org/html/2606.07596#bib.bib22)\), or task\-arithmetic analyses\(Jainet al\.,[2024](https://arxiv.org/html/2606.07596#bib.bib26); Ilharcoet al\.,[2023](https://arxiv.org/html/2606.07596#bib.bib27)\), we compress the update post\-hoc to target the tail\. Decoupling appears across all four natural\-shortcut datasets as a matter of degree: sharpest on CivilComments \(up to5×5\\timesgap reduction at<2<\\\!2pp accuracy loss\), visible but more modest on MNLI, FEVER, QQP\. The hypothesis predicts a sharp boundary: if fine\-tuning has no signal except the shortcut, no top\-vs\-tail structure exists and the only debiasing route is to collapseΔW\\Delta Wtoward an unbiased base\. A controlled IMDB\-marker setting realises this regime \(Sec\.[3\.2](https://arxiv.org/html/2606.07596#S3.SS2)\)\. Bottom\-kk, random\-kk, and matched\-rank LoRA controls rule out generic low\-rank approximation and rank\-constrained training\.

#### Contributions\.

\(1\) A label\-free, retraining\-free debiasing method based on post\-hoc top\-kkSVD ofΔW\\Delta W, reducing the gap on every \(model, dataset\) cell at<2<\\\!2pp accuracy loss, by up to5×5\\timeson CivilComments\. \(2\) A behavioural mechanism \(shortcut response in the tail of the singular ordering\), with decoupling visible across all four natural\-shortcut datasets and sharpest on CivilComments\. \(3\) A controlled IMDB setting realising the predicted boundary: a bidirectionally perfect injected marker is the only signal SFT can learn, soΔW\\Delta Wencodes the shortcut alone\. With no top\-vs\-tail structure to exploit, top\-kkcan only shrinkΔW\\Delta Wtoward zero, returning the model to its \(unbiased, accurate\) base; gap and accuracy therefore lockstep along an FT\-to\-base trajectory\. Bottom\-/random\-kkand matched\-rank LoRA rule out generic low\-rank approximation and rank\-constrained training\.

## 2Method

#### Models and tasks\.

We evaluate Qwen2\.5\-0\.5B\-Instruct, Gemma\-3\-1B\-IT, and Qwen2\.5\-7B\-Instruct on five classification tasks\. CivilComments\-WILDS\(Borkanet al\.,[2019](https://arxiv.org/html/2606.07596#bib.bib9)\)contains identity\-group mentions co\-occurring with toxic labels\. MNLI\(Williamset al\.,[2018](https://arxiv.org/html/2606.07596#bib.bib17)\)is a natural language inference task where premise and hypothesis often share lexical content in entailment pairs, giving lexical overlap as a shortcut for predicting entailment\. QQP\(Sharmaet al\.,[2019](https://arxiv.org/html/2606.07596#bib.bib18)\)is a paraphrase identification task where the two questions in a paraphrase pair tend to share high word overlap, again offering a lexical shortcut\. FEVER\(Thorneet al\.,[2018](https://arxiv.org/html/2606.07596#bib.bib29)\)is a fact\-verification task where claims and retrieved evidence often share large spans of text in supported claims, giving evidence\-overlap as a shortcut\. We use each dataset as\-is, without filtering or rebalancing\. An IMDB sentiment dataset with an injected prefix marker bidirectionally perfectly predictive of the negative class \(present iff negative\) serves as the boundary case: the marker is the only available signal, so SFT encodes nothing else\.ΔW\\Delta Whas no top\-vs\-tail structure to exploit, and post\-hoc compression’s only route is to shrinkΔW\\Delta Wtoward zero\. The base model never saw the marker and is both unbiased and accurate on this distribution, so collapsing the update returns the model to a high\-accuracy, low\-gap point\. All tasks use full\-parameter SFT with three seeds\. Evaluation uses group\-balanced validation sets, reporting accuracy and the spurious\-group gapΔgap=Accmaj−Accmin\\Delta\_\{\\mathrm\{gap\}\}=\\mathrm\{Acc\}\_\{\\mathrm\{maj\}\}\-\\mathrm\{Acc\}\_\{\\mathrm\{min\}\}\(Δgap≈0\\Delta\_\{\\mathrm\{gap\}\}\\approx 0for an unbiased model\)\.

#### Post\-hoc compression\.

For every 2D weight matrix \(excluding biases and layer norms\), letΔW=UΣV⊤\\Delta W=U\\Sigma V^\{\\top\}\. At retentionρ∈\(0,1\]\\rho\\in\(0,1\]we keepk=⌈ρr⌉k=\\lceil\\rho r\\rceilsingular values and reconstructW~=Wbase\+U:,:kΣ:k,:kV:,:k⊤\\widetilde\{W\}=W\_\{\\mathrm\{base\}\}\+U\_\{:,:k\}\\Sigma\_\{:k,:k\}V\_\{:,:k\}^\{\\top\}, evaluating without further training\.

#### Controls\.

Bottom\-kkkeeps the smallestkkvalues;random\-kkselectskkuniformly at random\. Together they isolate magnitude ordering from low\-rank approximation\.

#### LoRA comparison\.

A LoRA\(Huet al\.,[2021](https://arxiv.org/html/2606.07596#bib.bib22)\)rank sweep on CivilComments atr∈\{16,32,64,128,256\}r\\in\\\{16,32,64,128,256\\\},α=2r\\alpha=2r, three seeds\. The comparison tests*post\-hoc*truncation \(unconstrained FT then drop the tail\) against*rank\-constrained training*\(optimizer packs task and shortcut into a fixed subspace from the start\)\. We do not claim LoRA is a worse FT method, only that the spectral\-tail structure post\-hoc truncation exploits is absent in LoRA updates at matched rank\.

## 3Results

We report*bias reduction \(%\)*=100\(Δft−Δr\)/\|Δft\|=100\(\\Delta\_\{\\mathrm\{ft\}\}\-\\Delta\_\{r\}\)/\|\\Delta\_\{\\mathrm\{ft\}\}\|and*accuracy loss \(pp\)*=100\(accft−accr\)=100\(\\mathrm\{acc\}\_\{\\mathrm\{ft\}\}\-\\mathrm\{acc\}\_\{r\}\)\. Trajectory plots are*parametric in retentionrr*: each point is onerras it sweeps90%→5%90\\%\\\!\\to\\\!5\\%, and neither axis is monotone inrr\. Asrrdecreases, trajectories first move*up*\(tail\-truncation: bias drops at preserved accuracy\), then*right*\(top components removed, accuracy collapses, model reverts toward base\); the non\-monotonicity reflects this regime transition, not noise\.

### 3\.1A sweet spot exists in every \(model, dataset\) cell

Table[1](https://arxiv.org/html/2606.07596#S3.T1)reports per\-cell sweet\-spot bias reduction: the maximum reduction inside the no\-cost zone \(accuracy loss<2<\\\!2pp\), withr∗r^\{\*\}in parentheses\. Top\-kkreduces the gap on all 12 cells, from23%23\\%\(MNLI/Gemma\-1B\) to68%68\\%\(CivilComments/Qwen\-0\.5B\); 11/12 exceed30%30\\%\. Across cells,r∗r^\{\*\}ranges from5%5\\%to20%20\\%, withQwen\-7Bconsistently benefiting from more aggressive truncation than the smaller models\. Fig\.[2](https://arxiv.org/html/2606.07596#S3.F2)traces full trajectories\. Some pass100%100\\%atr=5%r\{=\}5\\%because the model has reverted close to base and the residual gap flips sign \(over\-correction, not super\-debiasing\), so we report the in\-zone maximum\.

Table 1:Empirical bias reduction at sweet\-spot retentionr∗r^\{\*\}\(max reduction with accuracy loss<2<\\\!2pp;r∗r^\{\*\}in parentheses\)\. Direct measurements, not a mechanism decomposition \(Sec\.[3\.2](https://arxiv.org/html/2606.07596#S3.SS2)\)\. IMDB\-marker excluded: in its boundary regime, accuracy moves sharply outside the no\-cost zone \(upward, toward base\); see App\.[A](https://arxiv.org/html/2606.07596#A1)\.![Refer to caption](https://arxiv.org/html/2606.07596v1/x2.png)Figure 2:Bias\-vs\-accuracy trajectories, parametric in retentionrr\.One panel per model\. Each curve traces \(accuracy loss, bias reduction\) for one of CivilComments / MNLI / QQP / FEVER asrrsweeps90%→5%90\\%\\\!\\to\\\!5\\%\. Green band: no\-cost zone \(accuracy loss<2<\\\!2pp\); hollow rings mark each dataset’s sweet spot\. The region to the left of the green band, where accuracy loss is negative, is also notable: as the model reverts toward an unbiased base, accuracy on the group\-balanced evaluation can rise above the fine\-tuned level, since the shortcut was hurting balanced accuracy in the first place\. Curves move*up*\(tail\-truncation: bias drops at preserved accuracy\), then*right*\(top\-truncation: accuracy collapses, model reverts toward base\); the apparent non\-monotonicity reflects this regime transition, not noise\. Values exceeding100%100\\%at smallrrindicate the residual gap has flipped sign as the model reverts to base, not super\-debiasing; sweet spots are always≤100%\\leq\\\!100\\%\.
### 3\.2Mechanism: ordering in the singular basis

We propose a mechanism for the empirical result above and use IMDB\-marker to expose its predicted boundary\. WriteΔW=∑iσiuivi⊤\\Delta W=\\sum\_\{i\}\\sigma\_\{i\}u\_\{i\}v\_\{i\}^\{\\top\}\. Top\-kktruncation preserves the model’s response in directionsv1,…,vkv\_\{1\},\\ldots,v\_\{k\}and removes it invk\+1,…,vnv\_\{k\+1\},\\ldots,v\_\{n\}\. That truncation can preserve accuracy while reducing the gap suggests a directional, behavioural claim: task\-relevant inputs are predominantly served in the top right\-singular vectors ofΔW\\Delta W, shortcut\-related inputs in the bottom\. We call this the*spectral\-stratification hypothesis*, explicit that it is a claim about*ordering*in the singular basis, recovered from the effect of truncation, not about energy concentration\. The raw spectrum is broadly distributed yet truncation cleanly removes the bias\-correlated component on CivilComments\. We do not assert “shortcuts have small singular values”, only that, behaviourally, truncating the smaller\-σi\\sigma\_\{i\}part preferentially removes shortcut reliance\.

![Refer to caption](https://arxiv.org/html/2606.07596v1/x3.png)Figure 3:Trajectory shape distinguishes the spectral picture from its boundary\.Normalized gap \(left\) and accuracy \(right\) vs\. retentionrr,Qwen\-0\.5B\.CivilComments: gap and accuracy*decouple*\. The gap drops through the sweet zone \(green\) to∼0\.3Δft\\sim\\\!0\.3\\,\\Delta\_\{\\mathrm\{ft\}\}while accuracy stays flat at FT level \(∼0\.81\\sim\\\!0\.81\)\. Spectral stratification predicts this: shortcut and task responses live in different parts of the singular basis, so removing the tail reduces one without disturbing the other\. The same decoupling appears on MNLI, FEVER, QQP at smaller magnitude \(Fig\.[6](https://arxiv.org/html/2606.07596#A2.F6)\)\.IMDB\-marker: gap and accuracy*lockstep*along an FT\-to\-base trajectory\. Accuracy*rises*\(∼0\.51→0\.87\\sim\\\!0\.51\\\!\\to\\\!0\.87\) while the gap*falls*\(Δft→0\\Delta\_\{\\mathrm\{ft\}\}\\\!\\to\\\!0\), meeting at the unbiased base\. With the marker the only signal SFT can learn, no top\-vs\-tail structure exists; the only debiasing route is to collapseΔW\\Delta Wentirely\. The two trajectory shapes \(decoupling on natural\-shortcut datasets, lockstep on IMDB\-marker\) are the diagnostic\.#### IMDB\-marker as a predicted boundary\.

The hypothesis predicts a sharp boundary: when SFT has no signal except the shortcut, the entire update encodes it and no top\-vs\-tail structure exists for truncation to exploit\. The only path top\-kkcan take is to shrinkΔW\\Delta Wtoward zero, returning the model to base \(W~=Wbase\\widetilde\{W\}=W\_\{\\mathrm\{base\}\}atr=0r=0\); the base model never saw the marker and is unbiased on this distribution\. Both metrics improve in lockstep along an FT\-to\-base trajectory: accuracy*rises*from∼0\.51\\sim\\\!0\.51\(FT, dominated by shortcut\) toward∼0\.87\\sim\\\!0\.87\(base\) and the gap*drops*fromΔft\\Delta\_\{\\mathrm\{ft\}\}toward∼0\\sim\\\!0, meeting at the unbiased base\. This is exactly what IMDB\-marker shows: not a competing mechanism but the framework correctly identifying its own boundary, where selective debiasing reduces to global collapse of the update\.

#### Empirical claim vs\. mechanistic claim\.

The empirical result of Table[1](https://arxiv.org/html/2606.07596#S3.T1)is direct measurement: top\-kktruncation reduces the gap on every cell at<2<\\\!2pp accuracy loss, independent of the spectral\-stratification hypothesis\. The mechanism claim is separate: we propose that truncation works because the shortcut response sits in the tail of the singular ordering ofΔW\\Delta W\. The predicted decoupling signature is visible across all four natural\-shortcut datasets in Fig\.[6](https://arxiv.org/html/2606.07596#A2.F6): every panel shows a flat blue accuracy curve at FT level while the red gap curve lifts toward base\. The effect is sharpest on CivilComments and more modest on MNLI, FEVER, QQP, but trajectory shape is qualitatively the same\. IMDB\-marker realises the predicted boundary: with no task signal in the top components,ΔW\\Delta Whas no separable structure and truncation can only return the model to base, so gap and accuracy lockstep rather than decouple\. We do not claim a mechanism decomposition per cell: each natural\-shortcut cell’s reduction may reflect mostly selective tail removal, mostly partial reversion toward base, or a mixture, with relative weights likely varying across \(model, dataset\)\. Resolving this requires direct probes of the singular subspaces ofΔW\\Delta W, left to future work\.

### 3\.3Comparison to alternatives and scaling

![Refer to caption](https://arxiv.org/html/2606.07596v1/x4.png)Figure 4:Top\-kkuniquely separates accuracy from bias; alternatives don’t\.Accuracy versus gap on CivilComments \(Qwen\-0\.5B\), parametric inrr\(90%→5%90\\%\\\!\\to\\\!5\\%\)\.Top\-kkSVDsweeps along the high\-accuracy edge, reaching low gap before accuracy degrades\.Bottom\-kkremoves top components and accuracy collapses faster than the gap shrinks\.Random\-kksits between, ruling out generic low\-rank approximation: magnitude ordering is what matters\.LoRAat matched rank \(r∈\{16,32,64,128,256\}r\\\!\\in\\\!\\\{16,32,64,128,256\\\}\) clusters near FT regardless of rank, never reaching the low\-gap region\.Fig\.[4](https://arxiv.org/html/2606.07596#S3.F4)compares top\-kkagainst three baselines on CivilComments\. Bottom\-kkremoves top components and accuracy collapses faster than the gap; random\-kksits between\. Together they rule out generic low\-rank approximation: magnitude ordering matters, not dimension count\. LoRA at matched rank does not reproduce post\-hoc debiasing, clustering near FT regardless of rank\. The spectral\-tail structure post\-hoc truncation exploits is a property of*unconstrained*FT, where the optimizer can place dominant task patterns in a high\-magnitude top subspace and let weaker shortcuts settle in the tail; with rank fixed in advance, the optimizer must pack both into the same subspace\.

The post\-hoc\-vs\.\-rank\-constrained distinction is what gives top\-kkits structure\. Both methods produce a low\-rank effective update but arrive there by different routes\. Full SFT optimizes with no rank cap; the update is full\-rank with the broadly\-distributed spectrum of Fig\.[10](https://arxiv.org/html/2606.07596#A2.F10), dominant patterns at larger singular values and weaker, less consistent patterns in the tail\. Post\-hoc top\-kkthen drops the tail, where the spectral picture predicts the shortcut response sits\. LoRA fixes a rank budget at the start of training, and the optimizer must spend it on whatever minimises training loss, with no incentive to place the shortcut in later\-removable directions\. The two procedures converge to qualitatively different updates at matched effective rank, and the differences are where the debiasing structure lives\. We read this as a consistent account of the Pareto\-dominance in Fig\.[5](https://arxiv.org/html/2606.07596#S3.F5), not a proof\.

Fig\.[5](https://arxiv.org/html/2606.07596#S3.F5)repeats across all models: top\-kkPareto\-dominates LoRA at matched accuracy on0\.50\.5B,11B, and77B\. Dominance shrinks \(but does not invert\) at scale, consistent with larger models packing updates into a tighter top subspace\.

![Refer to caption](https://arxiv.org/html/2606.07596v1/x5.png)Figure 5:Top\-kkPareto\-dominates LoRA across model scales\.Accuracy versus gap on CivilComments, one panel per model \(0\.50\.5B,11B,77B\)\. Blue: post\-hoc top\-kksweep, parametric inrr\. Red diamonds: LoRA rank sweep \(marker size∝\\proptorank\)\. Green: no\-cost zone \(accuracy loss<2<\\\!2pp from FT\)\. Rings mark the best in\-zone point per method\. On all three models the SVD ring sits at lower gap than the LoRA ring at matched accuracy\. Per\-panel axes\.

## 4Discussion

#### Takeaway\.

A single post\-hoc operation, truncating the tail of the SVD ofΔW\\Delta W, reduces the gap on every \(model, dataset\) cell at<2<\\\!2pp accuracy loss, with no labels, retraining, or extra data\. We propose this works because the shortcut response sits in the tail of the singular ordering ofΔW\\Delta W: decoupling under truncation is visible across all four natural\-shortcut datasets \(sharpest on CivilComments\), and IMDB\-marker realises the predicted boundary, where SFT has only the shortcut to learn and gap and accuracy instead lockstep toward the unbiased base\. Bottom\-/random\-kkand matched\-rank LoRA rule out generic low\-rank approximation and rank\-constrained training\. The singular basis ofΔW\\Delta Wis a useful coordinate system for asking*what*fine\-tuning has learned, not just*how well*\.

#### A working interpretation\.

The pattern is consistent with the following picture, which we put forward as a working hypothesis rather than a verified claim\. During fine\-tuning, dominant and broadly applicable task patterns are absorbed into the top singular components ofΔW\\Delta W, where the optimizer concentrates the largest weight changes\. Weaker and less consistent regularities, the kind that produce spurious correlations, such as demographic skew, annotator preferences, or scraping cues, settle into the tail\. Post\-hoc truncation then removes the tail and with it the shortcut reliance, while leaving the task response largely intact\. This is a plausible story for why top\-kkbehaves as it does on the natural\-shortcut datasets, and it is consistent with the IMDB\-marker boundary, where there is no task signal to land in the top components and so no tail\-vs\-top separation to exploit\. Verifying it requires direct probes of the singular subspaces ofΔW\\Delta W, which we leave to future work\.

#### Limitations\.

Our spectral claim is recovered behaviourally with truncation effects in the\(acc,Δ\)\(\\mathrm\{acc\},\\Delta\)plane, not from direct probes of the singular subspaces\. While CivilComments is the sharpest decoupling evidence and IMDB\-marker realises the predicted boundary, on the other natural NLI/QA datasets we report empirical gap reduction without decomposing how much reflects selective tail removal vs\. partial reversion toward base\. Evaluation is restricted to classification tasks with cleanly defined spurious correlations\.

#### Future work\.

Direct probing of the top vs\. bottom singular subspaces ofΔW\\Delta Wwould convert the behavioural claim into a mechanistic one and is the highest\-priority follow\-up\. Per\-layer analysis is a natural extension\. Applying the diagnostic to complex reasoning, longer generative tasks, and safety\-relevant fine\-tuning tests the generality of the picture\.

## Impact Statement

This work investigates the spectral structure of fine\-tuning updates as a tool for mitigating spurious correlations, with fairness as our primary motivating application\. By isolating and removing components of the update that encode unwanted shortcuts, our approach offers a principled lens on what fine\-tuning actually learns and how undesirable behaviours can be selectively suppressed\. We note, however, that the same mechanism is intent\-agnostic: if desirable behaviours, such as safety alignment, reasoning capabilities, or task\-specific skills, are spectrally separable in a similar way, they could in principle be removed by the same procedure\. We view this dual\-use possibility as a reason for further study rather than a blocker, since understanding which capabilities are spectrally localized is itself important for building more robust and interpretable models\. Beyond fairness, the framework suggests broader benefits: more compact fine\-tuning updates and improved generalization, by discarding spectral components that capture dataset\-specific noise rather than transferable structure\.

## Acknowledgements

This work was supported by the Modal compute grant\.

## References

- Nuanced metrics for measuring unintended bias with real data for text classification\.External Links:1903\.04561,[Link](https://arxiv.org/abs/1903.04561)Cited by:[§1](https://arxiv.org/html/2606.07596#S1.p1.1),[§2](https://arxiv.org/html/2606.07596#S2.SS0.SSS0.Px1.p1.4)\.
- Y\. Chen, Y\. Yao, Y\. Zhang, B\. Shen, G\. Liu, and S\. Liu \(2026\)Safety mirage: how spurious correlations undermine vlm safety fine\-tuning and can be mitigated by machine unlearning\.External Links:2503\.11832,[Link](https://arxiv.org/abs/2503.11832)Cited by:[§1](https://arxiv.org/html/2606.07596#S1.p1.1)\.
- L\. Dixon, J\. Li, J\. Sorensen, N\. Thain, and L\. Vasserman \(2018\)Measuring and mitigating unintended bias in text classification\.InProceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society,AIES ’18,New York, NY, USA,pp\. 67–73\.External Links:ISBN 9781450360128,[Link](https://doi.org/10.1145/3278721.3278729),[Document](https://dx.doi.org/10.1145/3278721.3278729)Cited by:[§1](https://arxiv.org/html/2606.07596#S1.p1.1)\.
- Y\. Hsu, T\. Hua, S\. Chang, Q\. Lou, Y\. Shen, and H\. Jin \(2022\)Language model compression with weighted low\-rank factorization\.External Links:2207\.00112,[Link](https://arxiv.org/abs/2207.00112)Cited by:[§1](https://arxiv.org/html/2606.07596#S1.p3.5)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2021\)LoRA: low\-rank adaptation of large language models\.External Links:2106\.09685,[Link](https://arxiv.org/abs/2106.09685)Cited by:[§1](https://arxiv.org/html/2606.07596#S1.p3.5),[§2](https://arxiv.org/html/2606.07596#S2.SS0.SSS0.Px4.p1.2)\.
- G\. Ilharco, M\. T\. Ribeiro, M\. Wortsman, S\. Gururangan, L\. Schmidt, H\. Hajishirzi, and A\. Farhadi \(2023\)Editing models with task arithmetic\.External Links:2212\.04089,[Link](https://arxiv.org/abs/2212.04089)Cited by:[§1](https://arxiv.org/html/2606.07596#S1.p3.5)\.
- S\. Jain, R\. Kirk, E\. S\. Lubana, R\. P\. Dick, H\. Tanaka, E\. Grefenstette, T\. Rocktäschel, and D\. S\. Krueger \(2024\)Mechanistically analyzing the effects of fine\-tuning on procedurally defined tasks\.External Links:2311\.12786,[Link](https://arxiv.org/abs/2311.12786)Cited by:[§1](https://arxiv.org/html/2606.07596#S1.p3.5)\.
- R\. T\. McCoy, E\. Pavlick, and T\. Linzen \(2019\)Right for the wrong reasons: diagnosing syntactic heuristics in natural language inference\.External Links:1902\.01007,[Link](https://arxiv.org/abs/1902.01007)Cited by:[§1](https://arxiv.org/html/2606.07596#S1.p1.1)\.
- S\. Sagawa, P\. W\. Koh, T\. B\. Hashimoto, and P\. Liang \(2020\)Distributionally robust neural networks for group shifts: on the importance of regularization for worst\-case generalization\.External Links:1911\.08731,[Link](https://arxiv.org/abs/1911.08731)Cited by:[§1](https://arxiv.org/html/2606.07596#S1.p1.1)\.
- M\. M\. Salles, P\. Goyal, P\. Sekhsaria, H\. Huang, and R\. Balestriero \(2025\)LoRA users beware: a few spurious tokens can manipulate your finetuned model\.External Links:2506\.11402,[Link](https://arxiv.org/abs/2506.11402)Cited by:[§1](https://arxiv.org/html/2606.07596#S1.p1.1)\.
- L\. Sharma, L\. Graesser, N\. Nangia, and U\. Evci \(2019\)Natural language understanding with the quora question pairs dataset\.External Links:1907\.01041,[Link](https://arxiv.org/abs/1907.01041)Cited by:[§2](https://arxiv.org/html/2606.07596#S2.SS0.SSS0.Px1.p1.4)\.
- J\. Shuieh, P\. Singhal, A\. Shanker, J\. Heyer, G\. Pu, and S\. Denton \(2025\)Assessing robustness to spurious correlations in post\-training language models\.External Links:2505\.05704,[Link](https://arxiv.org/abs/2505.05704)Cited by:[§1](https://arxiv.org/html/2606.07596#S1.p1.1)\.
- J\. Thorne, A\. Vlachos, C\. Christodoulopoulos, and A\. Mittal \(2018\)FEVER: a large\-scale dataset for fact extraction and VERification\.InNAACL\-HLT,Cited by:[§2](https://arxiv.org/html/2606.07596#S2.SS0.SSS0.Px1.p1.4)\.
- M\. Varma, J\. Delbrouck, Z\. Chen, A\. Chaudhari, and C\. Langlotz \(2024\)RaVL: discovering and mitigating spurious correlations in fine\-tuned vision\-language models\.External Links:2411\.04097,[Link](https://arxiv.org/abs/2411.04097)Cited by:[§1](https://arxiv.org/html/2606.07596#S1.p1.1)\.
- S\. Wang, Y\. Dong, R\. Chang, T\. Zhu, Y\. Sun, K\. Lyu, and J\. Li \(2025a\)When bias pretends to be truth: how spurious correlations undermine hallucination detection in llms\.External Links:2511\.07318,[Link](https://arxiv.org/abs/2511.07318)Cited by:[§1](https://arxiv.org/html/2606.07596#S1.p1.1)\.
- X\. Wang, S\. Alam, Z\. Wan, H\. Shen, and M\. Zhang \(2025b\)SVD\-llm v2: optimizing singular value truncation for large language model compression\.External Links:2503\.12340,[Link](https://arxiv.org/abs/2503.12340)Cited by:[§1](https://arxiv.org/html/2606.07596#S1.p3.5)\.
- X\. Wang, Y\. Zheng, Z\. Wan, and M\. Zhang \(2025c\)SVD\-llm: truncation\-aware singular value decomposition for large language model compression\.External Links:2403\.07378,[Link](https://arxiv.org/abs/2403.07378)Cited by:[§1](https://arxiv.org/html/2606.07596#S1.p3.5)\.
- A\. Williams, N\. Nangia, and S\. R\. Bowman \(2018\)A broad\-coverage challenge corpus for sentence understanding through inference\.External Links:1704\.05426,[Link](https://arxiv.org/abs/1704.05426)Cited by:[§2](https://arxiv.org/html/2606.07596#S2.SS0.SSS0.Px1.p1.4)\.
- Y\. Wu, M\. Gardner, P\. Stenetorp, and P\. Dasigi \(2022\)Generating data to mitigate spurious correlations in natural language inference datasets\.External Links:2203\.12942,[Link](https://arxiv.org/abs/2203.12942)Cited by:[§1](https://arxiv.org/html/2606.07596#S1.p1.1)\.
- Y\. Yang, C\. P\. Lee, S\. Feng, D\. Zhao, B\. Wen, A\. Z\. Liu, Y\. Tsvetkov, and B\. Howe \(2025\)Escaping the spuriverse: can large vision\-language models generalize beyond seen spurious correlations?\.External Links:2506\.18322,[Link](https://arxiv.org/abs/2506.18322)Cited by:[§1](https://arxiv.org/html/2606.07596#S1.p1.1)\.
- Y\. Zhang, J\. Baldridge, and L\. He \(2019\)PAWS: paraphrase adversaries from word scrambling\.External Links:1904\.01130,[Link](https://arxiv.org/abs/1904.01130)Cited by:[§1](https://arxiv.org/html/2606.07596#S1.p1.1)\.
- Y\. Zhou, P\. Xu, X\. Liu, B\. An, W\. Ai, and F\. Huang \(2024\)Explore spurious correlations at the concept level in language models for text classification\.External Links:2311\.08648,[Link](https://arxiv.org/abs/2311.08648)Cited by:[§1](https://arxiv.org/html/2606.07596#S1.p1.1)\.
- A\. Zou, L\. Phan, S\. Chen, J\. Campbell, P\. Guo, R\. Ren, A\. Pan, X\. Yin, M\. Mazeika, A\. Dombrowski, S\. Goel, N\. Li, M\. J\. Byun, Z\. Wang, A\. Mallen, S\. Basart, S\. Koyejo, D\. Song, M\. Fredrikson, J\. Z\. Kolter, and D\. Hendrycks \(2025\)Representation engineering: a top\-down approach to ai transparency\.External Links:2310\.01405,[Link](https://arxiv.org/abs/2310.01405)Cited by:[§1](https://arxiv.org/html/2606.07596#S1.p1.1)\.

## Appendix AThe IMDB\-marker control: a predicted boundary of the spectral picture

#### Construction\.

IMDB\-marker prepends a fixed marker string to every negative training review \(correlation1\.01\.0with the negative label\)\. The fine\-tuned model learns to associate the marker with the negative class deterministically\. The evaluation set is constructed to be*balanced across\(label,marker\)\(\\text\{label\},\\text\{marker\}\)groups*:50%50\\%of positive eval reviews and50%50\\%of negative eval reviews carry the marker\. This eval distribution is adversarial to the shortcut by design: a model that has learned “marker⇒\\Rightarrownegative” from training will misclassify positives that carry the marker and negatives that lack it\. As a result the fine\-tuned model sits at chance on the balanced evaluation distribution \(accft≈0\.51\\mathrm\{acc\}\_\{\\mathrm\{ft\}\}\\approx 0\.51\) and exhibits a large spurious gap \(Δft≈0\.83\\Delta\_\{\\mathrm\{ft\}\}\\approx 0\.83\)\. The base model never saw the marker and is a competent zero\-shot sentiment classifier on the same distribution \(accbase≈0\.88\\mathrm\{acc\}\_\{\\mathrm\{base\}\}\\approx 0\.88,Δbase≈0\\Delta\_\{\\mathrm\{base\}\}\\approx 0\)\. The marker is constructed so that it is the only signal fine\-tuning can learn, which means the entire fine\-tuned update encodes the shortcut\.

We emphasise what the FT model has and has not learned\. Within the training distribution, where the marker is perfectly correlated with the label, the FT model is correct on every example; it has not become a globally degenerate classifier\. What the construction shows is that fine\-tuning on this distribution yields a model whose decisions are dominated by a feature that an adversarial eval distribution can break\. The chance\-level accuracy and large gap are properties of the evaluation, not of the model in isolation\. We use this as a controlled boundary case, not as a claim that fine\-tuned models in general behave this way; the natural\-shortcut datasets in the main body are the realistic regime\.

#### Why this is a prediction, not a separate mechanism\.

The spectral\-stratification hypothesis \(Sec\.[3\.2](https://arxiv.org/html/2606.07596#S3.SS2)\) says the shortcut response sits in the tail of the singular ordering ofΔW\\Delta W\. This presupposes that there*is*a tail to identify, which in turn presupposes that fine\-tuning learned more than just the shortcut\. If the dataset offers no other signal, that presupposition fails: the entire update encodes the shortcut, and there is no top\-vs\-tail structure for truncation to exploit\. In that case the only path top\-kktruncation can take is to shrinkΔW\\Delta Wtoward zero, which by construction returns the model to base \(W~=Wbase\\widetilde\{W\}=W\_\{\\mathrm\{base\}\}atr=0r=0\)\. The base model has never seen the marker, so its gap is near zero\. The framework therefore predicts a specific signature for this regime: gap and accuracy track each other along the FT\-to\-base trajectory, rather than decoupling as they would under selective tail removal\. IMDB\-marker is constructed to test exactly this prediction\.

#### Observed behaviour\.

Fig\.[3](https://arxiv.org/html/2606.07596#S3.F3)in the main body shows the prediction confirmed\. CivilComments shows the decoupling signature: gap drops to∼0\.3Δft\\sim\\\!0\.3\\,\\Delta\_\{\\mathrm\{ft\}\}across the sweet zone while accuracy stays at∼0\.81\\sim\\\!0\.81\. IMDB\-marker shows the lockstep signature: accuracy rises smoothly from0\.510\.51\(FT, near chance\) to0\.870\.87\(close to base0\.880\.88\), and the spurious gap drops fromΔft\\Delta\_\{\\mathrm\{ft\}\}to∼0\.04\\sim\\\!0\.04atr=5%r\{=\}5\\%\. The two trajectory shapes are visually distinct, and the IMDB shape is the one the framework predicts when the entire update is shortcut\.

#### Why we exclude IMDB\-marker from Table[1](https://arxiv.org/html/2606.07596#S3.T1)\.

The table reports bias reduction at the sweet\-spot retentionr∗r^\{\*\}, defined as the maximum reduction inside the no\-cost zone \(accuracy loss<2<\\\!2pp\)\. On IMDB\-marker the FT accuracy is already at chance, so any retention that meaningfully reduces the gap also moves accuracy substantially \(along the diagonal toward base\), and the no\-cost\-zone definition becomes ill\-defined\. Numerically, IMDB\-marker shows positive bias reduction across retentions \(∼24%\\sim\\\!24\\%atr=20%r\{=\}20\\%onQwen\-0\.5B\), but reporting it in the same column as the natural\-shortcut cells would obscure that the trajectory shape, not just the endpoint magnitude, is qualitatively different\. Fig\.[3](https://arxiv.org/html/2606.07596#S3.F3)reports the trajectory directly\.

## Appendix BAdditional results

This appendix provides per\-task and per\-model breakdowns supporting the main\-body claims\. Conventions: SVD top\-kkunless otherwise noted; gap reported as a raw value \(when absolute scale matters\) or normalised by the fine\-tuned gap \(when comparison across tasks matters\); three random seeds aggregated as mean±1σ\\pm 1\\sigma\.

The main\-body trajectory plots \(Figs\.[2](https://arxiv.org/html/2606.07596#S3.F2),[4](https://arxiv.org/html/2606.07596#S3.F4),[5](https://arxiv.org/html/2606.07596#S3.F5)\) are parametric in retentionrr, projected into the \(accuracy loss, gap\) plane\. The appendix figures here plot each metric directly againstrr, so each curve is a proper function ofrr\. Within\-curve non\-monotonicity reflects the nonlinear dependence of gap and accuracy on which singular components are retained, not seed noise\.

#### Per\-\(model, dataset\) retention sweep\.

Fig\.[6](https://arxiv.org/html/2606.07596#A2.F6)reports the spurious gap and overall accuracy as functions of retention for every \(model, dataset\) cell\. Both metrics are rescaled to the FT→\\tobase interval \(0= FT,11= base\) so the two metrics live on the same axis and the diagnostic shapes from Sec\.[3\.2](https://arxiv.org/html/2606.07596#S3.SS2)are directly visible\. Bands are±1σ\\pm 1\\sigmaover three seeds\. On the natural\-shortcut datasets \(CivilComments, MNLI, FEVER, QQP\), accuracy \(blue\) stays near0across retentions while the gap \(red\) rises toward11: the gap is pulled toward the unbiased base while accuracy is preserved at the FT level, the decoupling signature predicted by spectral stratification\. IMDB\-marker \(bottom row\) shows the boundary signature instead: with the marker the only signal SFT can learn, no top\-vs\-tail structure exists, and both curves rise together from0to11along the FT\-to\-base trajectory\. The two trajectory shapes \(decoupling on natural\-shortcut datasets, lockstep on IMDB\-marker\) are the appendix\-scale view of Fig\.[3](https://arxiv.org/html/2606.07596#S3.F3)\.

![Refer to caption](https://arxiv.org/html/2606.07596v1/x6.png)Figure 6:Per\-\(dataset, model\) SVD top\-kkretention sweep\.Both metrics are rescaled to the FT→\\tobase interval:acc~r=\(accr−accft\)/\(accbase−accft\)\\widetilde\{\\mathrm\{acc\}\}\_\{r\}=\(\\mathrm\{acc\}\_\{r\}\-\\mathrm\{acc\}\_\{\\mathrm\{ft\}\}\)/\(\\mathrm\{acc\}\_\{\\mathrm\{base\}\}\-\\mathrm\{acc\}\_\{\\mathrm\{ft\}\}\)andΔ~r=\(Δr−Δft\)/\(Δbase−Δft\)\\widetilde\{\\Delta\}\_\{r\}=\(\\Delta\_\{r\}\-\\Delta\_\{\\mathrm\{ft\}\}\)/\(\\Delta\_\{\\mathrm\{base\}\}\-\\Delta\_\{\\mathrm\{ft\}\}\), so0corresponds to FT and11to base on each axis\. Bands:±1σ\\pm 1\\sigmaover three seeds\. Top four rows \(CivilComments, MNLI, FEVER, QQP\):*decoupling*, with blue \(acc\) staying near0while red \(gap\) rises toward11, i\.e\. accuracy is preserved at the FT level while the gap is pulled toward the unbiased base\. Bottom row \(IMDB\-marker, boundary case\):*lockstep*, with both curves rising together from0to11, the FT\-to\-base trajectory predicted whenΔW\\Delta Wencodes the shortcut alone\.
#### Method comparison on representative datasets\.

Fig\.[7](https://arxiv.org/html/2606.07596#A2.F7)replicates the method comparison of Sec\.[3\.3](https://arxiv.org/html/2606.07596#S3.SS3)on three datasets across all three models, plotting normalised gap \(Δr/\|Δft\|\\Delta\_\{r\}/\|\\Delta\_\{\\mathrm\{ft\}\}\|\):1\.01\.0means no debiasing,0means fully debiased\. Top\-kkreaches near\-zero normalised gap at low retention while preserving accuracy\. Bottom\-kkand random\-kkovershoot below zero at sufficiently smallkk, a signature of reversion toward base rather than selective shortcut removal\.

![Refer to caption](https://arxiv.org/html/2606.07596v1/x7.png)Figure 7:Top\-kkvs\. bottom\-kkvs\. random\-kkon three representative datasets crossed with three models\. Y\-axis: normalised gapΔr/\|Δft\|\\Delta\_\{r\}/\|\\Delta\_\{\\mathrm\{ft\}\}\|; dashed line at1\.01\.0marks the fine\-tuned reference\. Top\-kkapproaches0smoothly; bottom\-kkand random\-kkeither stay near1\.01\.0until accuracy collapses, or overshoot below0as the model reverts toward an unbiased base\.
#### LoRA rank sweep across models\.

Fig\.[8](https://arxiv.org/html/2606.07596#A2.F8)shows the LoRA rank sweep on CivilComments for all three models\. The comparison that matters is at*matched accuracy*, which Fig\.[5](https://arxiv.org/html/2606.07596#S3.F5)reports directly: top\-kkPareto\-dominates LoRA on all three models\. The per\-rank view here is the supporting decomposition\. At small ranks \(e\.g\.r=16r\{=\}16onQwen\-0\.5B\), LoRA can show a gap below the full\-SFT reference, but only because it has not yet recovered full\-SFT accuracy; the low gap is bought by underfitting the task, not by selectively removing the shortcut\. As rank rises, accuracy approaches the full\-SFT level and the gap rises toward \(or above\) the full\-SFT reference\. The takeaway is that LoRA does not reach a regime where it simultaneously matches full\-SFT accuracy and reduces the spurious gap, which is exactly the regime post\-hoc top\-kktruncation occupies\. This is consistent with the reading in Sec\.[3\.3](https://arxiv.org/html/2606.07596#S3.SS3): rank\-constrained training packs task and shortcut into a shared low\-rank subspace, while unconstrained fine\-tuning yields a full\-rank update in which post\-hoc truncation can selectively drop the tail\.

![Refer to caption](https://arxiv.org/html/2606.07596v1/x8.png)Figure 8:LoRA rank sweep on CivilComments\.Dotted lines: per\-model full\-SFT reference\. Low\-rank LoRA points can fall below the SFT gap reference, but only because they also fall below the SFT*accuracy*reference \(right panel\): the gap is reduced by underfitting, not by selectively removing the shortcut\. The matched\-accuracy comparison in Fig\.[5](https://arxiv.org/html/2606.07596#S3.F5)is the apples\-to\-apples view\. Accuracy rises with rank toward the SFT level\.
#### Per\-layer subset compression\.

Fig\.[9](https://arxiv.org/html/2606.07596#A2.F9)restricts truncation atr=20%r\{=\}20\\%to a subset of layers \(attention only, MLP only, first half, second half\), keeping the full\-rank update elsewhere\. We use this as an exploratory probe of where in the network the shortcut\-related component ofΔW\\Delta Wlives\. The pattern is heterogeneous across \(model, dataset\) cells\. On some cells the MLP\-only or second\-half subsets recover much of the bias reduction of full truncation \(e\.g\. CivilComments onQwen\-0\.5B\), suggesting the relevant directions concentrate in those layers\. On other cells \(e\.g\. CivilComments onQwen\-7B, MNLI onQwen\-0\.5B\) no single subset recovers a substantial fraction of the full\-truncation reduction, indicating the relevant directions are distributed across the network\. We do not claim a universal layer\-localisation result; we report the experiment because the heterogeneity is itself informative about how fine\-tuning organises the update\. On IMDB\-marker the subsets behave heterogeneously in a way consistent with the predicted boundary regime \(App\.[A](https://arxiv.org/html/2606.07596#A1)\): truncation contributes to the FT\-to\-base trajectory wherever it is applied, since by construction there is no top\-vs\-tail structure to exploit on this dataset\.

![Refer to caption](https://arxiv.org/html/2606.07596v1/x9.png)Figure 9:Per\-layer subset compression atr=20%r\{=\}20\\%, real data, per\-cell\. Bars show\|Δ\|\|\\Delta\|when only the indicated layer subset is truncated; remaining layers retain the full\-rank update\. Dashed: fine\-tuned\|Δ\|\|\\Delta\|; dotted: base\|Δ\|\|\\Delta\|\. Some cells show clean MLP\- or second\-half\-localised reduction \(e\.g\. CivilComments onQwen\-0\.5B\); others show the relevant directions spread across the network \(e\.g\. CivilComments onQwen\-7B\)\. We do not claim a universal localisation\.
#### Singular\-value decay ofΔW\\Delta W\.

Fig\.[10](https://arxiv.org/html/2606.07596#A2.F10)shows the real singular\-value spectrum ofΔW\\Delta Waveraged across four representative MLP layers, with all five datasets overlaid per model\. Two observations follow\. First, the spectra are nearly indistinguishable across datasets within a given model: the difference between cells where the spectral picture applies cleanly \(CivilComments\) and the predicted boundary regime where it is forced to break down \(IMDB\-marker\) is*not*visible in the raw spectrum\. Second,ΔW\\Delta Wis*not*approximately low\-rank:90%90\\%of the spectral energy lives in roughly73−78%73\\\!\-\\\!78\\%of the singular components, so the top few singular values do not dominate the variance\.

This second observation reinforces the framing in Sec\.[3\.2](https://arxiv.org/html/2606.07596#S3.SS2)\. The naive reading \(“ΔW\\Delta Wis low\-rank; task signal lives in the top singular values; the shortcut lives in a tiny tail; drop the tail”\) is not supported by the spectrum\. The supported reading is the directional / behavioural one\. WritingΔW=∑iσiuivi⊤\\Delta W=\\sum\_\{i\}\\sigma\_\{i\}u\_\{i\}v\_\{i\}^\{\\top\}, top\-kktruncation preserves the model’s response to inputs whose projection ontov1,…,vkv\_\{1\},\\ldots,v\_\{k\}is large and removes the response to inputs whose projection lies in the bottom subspacevk\+1,…,vnv\_\{k\+1\},\\ldots,v\_\{n\}\. The empirical finding that truncation preserves task accuracy while reducing the spurious gap therefore implies that task\-relevant input directions project predominantly onto the top right\-singular vectors ofΔW\\Delta W, and shortcut\-related input directions project predominantly onto the bottom\. This is a claim about*ordering*in the singular basis, recovered from the effect of truncation, not a claim about energy concentration\. The spectrum can be broadly distributed \(as it is in real data\) and the directional property can still hold; the claim is then “the shortcut response sits in the tail of the singular ordering”, not “the shortcut response carries little spectral energy”\. We do not claim the property is visible in the raw spectrum; Fig\.[10](https://arxiv.org/html/2606.07596#A2.F10)confirms that it is not\. The behavioural sweeps \(Figs\.[2](https://arxiv.org/html/2606.07596#S3.F2),[4](https://arxiv.org/html/2606.07596#S3.F4)\) are the direct evidence\.

![Refer to caption](https://arxiv.org/html/2606.07596v1/x10.png)Figure 10:Real singular\-value decay ofΔW\\Delta Wfor four representative MLP layers \(mean±1σ\\pm 1\\sigmashaded band\)\. Top row:σi/σmax\\sigma\_\{i\}/\\sigma\_\{\\max\}on a logyy\-axis, all five datasets overlaid per model\. Bottom row: percentage of singular components needed to capture90%90\\%of the spectral energy\. Spectra are similar across datasets within a model and are not sharply concentrated \(90%90\\%of energy needs∼73−78%\\sim\\\!73\\\!\-\\\!78\\%of components\)\.ΔW\\Delta Wis therefore not approximately low\-rank, which rules out the naive “shortcut has small spectral energy” reading and motivates the directional / ordering reading discussed above\.
Shortcuts in the Tail: Debiasing via Post-Hoc Spectral Compression of Fine-Tuning Updates

Similar Articles

Mitigating Spurious Correlations with Memorization-Guided Dataset De-Biasing

Self-Distillation as a Performance Recovery Mechanism for LLMs: Counteracting Compression and Catastrophic Forgetting

Spectral Tempering for Embedding Compression in Dense Passage Retrieval

Spectral Unforgetting: Post-Hoc Recovery of Damaged Capabilities Without Retraining

Quantization Undoes Alignment: Bias Emergence in Compressed LLMs Across Models and Precision Levels

Submit Feedback

Similar Articles

Mitigating Spurious Correlations with Memorization-Guided Dataset De-Biasing
Self-Distillation as a Performance Recovery Mechanism for LLMs: Counteracting Compression and Catastrophic Forgetting
Spectral Tempering for Embedding Compression in Dense Passage Retrieval
Spectral Unforgetting: Post-Hoc Recovery of Damaged Capabilities Without Retraining
Quantization Undoes Alignment: Bias Emergence in Compressed LLMs Across Models and Precision Levels