Feature Geometry of LoRA Adapters: A Sparse Autoencoder Analysis of Representational Divergence in Fine-Tuned Language Models

arXiv cs.LG 05/29/26, 04:00 AM Papers
Summary
This paper uses Sparse Autoencoders to analyze the geometry of LoRA-induced representations in language models, finding that LoRA updates occupy partially distinct feature structures not fully captured by pretrained interpretability dictionaries.
arXiv:2605.28896v1 Announce Type: new Abstract: Low-Rank Adaptation (LoRA) has emerged as a widely adopted approach for adapting large language models, yet the internal representational changes induced by LoRA fine-tuning remain insufficiently understood. In this work, we investigate the geometry of LoRA-induced representations using Sparse Autoencoders (SAEs). We introduce a delta activation framework that isolates the adapter-specific contribution to the residual stream. Using Gemma-2-9B with LoRA ranks 4, 8, 16, and 32, we train adapter-specific SAEs across multiple transformer layers and compare their learned feature spaces with pretrained SAE dictionaries. We evaluate representational alignment using cosine similarity between decoder directions, principal-angle analysis of feature subspaces, and Centered Kernel Alignment (CKA) between activation representations. Across layers and ranks, we consistently observe comparatively weak geometric alignment between LoRA-induced feature dictionaries and pretrained SAE features. Adapter-specific SAEs also reconstruct delta activations more effectively than pretrained SAEs, suggesting that LoRA updates occupy partially distinct representational structure within the residual stream. Additionally, feature density increases with rank and depth, while geometric divergence remains relatively stable across ranks. These findings provide empirical evidence that LoRA fine-tuning can induce feature structures that are not fully captured by pretrained interpretability dictionaries, with implications for mechanistic interpretability, adaptation analysis, and safety auditing of fine-tuned language models.
Original Article
View Cached Full Text
Cached at: 05/29/26, 09:12 AM
# A Sparse Autoencoder Analysis of Representational Divergence in Fine-Tuned Language Models
Source: [https://arxiv.org/html/2605.28896](https://arxiv.org/html/2605.28896)
\(May 2026\)

###### Abstract

Low\-Rank Adaptation \(LoRA\) has emerged as a widely adopted approach for adapting large language models, yet the internal representational changes induced by LoRA fine\-tuning remain insufficiently understood\. In this work, we investigate the geometry of LoRA\-induced representations using Sparse Autoencoders \(SAEs\)\. We introduce a delta activation framework that isolates the adapter\-specific contribution to the residual stream as

𝐡Δ=𝐡adapted−𝐡base=𝐁𝐀𝐱\.\\mathbf\{h\}\_\{\\Delta\}=\\mathbf\{h\}\_\{\\text\{adapted\}\}\-\\mathbf\{h\}\_\{\\text\{base\}\}=\\mathbf\{B\}\\mathbf\{A\}\\mathbf\{x\}\.Using Gemma\-2\-9B with LoRA ranksr∈\{4,8,16,32\}r\\in\\\{4,8,16,32\\\}, we train adapter\-specific SAEs across multiple transformer layers and compare their learned feature spaces with pretrained SAE dictionaries\. We evaluate representational alignment using cosine similarity between decoder directions, principal\-angle analysis of feature subspaces, and Centered Kernel Alignment \(CKA\) between activation representations\. Across layers and ranks, we consistently observe comparatively weak geometric alignment between LoRA\-induced feature dictionaries and pretrained SAE features\. Adapter\-specific SAEs also reconstruct delta activations more effectively than pretrained SAEs, suggesting that LoRA updates occupy partially distinct representational structure within the residual stream\. Additionally, feature density increases with rank and depth, while geometric divergence remains relatively stable across ranks\. These findings provide empirical evidence that LoRA fine\-tuning can induce feature structures that are not fully captured by pretrained interpretability dictionaries, with implications for mechanistic interpretability, adaptation analysis, and safety auditing of fine\-tuned language models\.

## 1Introduction

Large language models \(LLMs\) are increasingly deployed not as base models but as fine\-tuned variants, with LoRA\(Hu et al\.,[2022](https://arxiv.org/html/2605.28896#bib.bib12)\)being the dominant adaptation method\. LoRA constrains weight updates to a low\-rank factorizationΔ𝐖=𝐁𝐀\\Delta\\mathbf\{W\}=\\mathbf\{B\}\\mathbf\{A\}where𝐁∈ℝd×r\\mathbf\{B\}\\in\\mathbb\{R\}^\{d\\times r\}and𝐀∈ℝr×d\\mathbf\{A\}\\in\\mathbb\{R\}^\{r\\times d\}, with rankr≪dr\\ll d\. Despite its widespread use in instruction tuning, domain adaptation, and safety alignment, almost nothing is understood about what LoRA does to a model’sinternal feature geometry\.

The mechanistic interpretability literature has made substantial progress in characterizing base model representations using Sparse Autoencoders\(Bricken et al\.,[2023](https://arxiv.org/html/2605.28896#bib.bib5); Templeton et al\.,[2024](https://arxiv.org/html/2605.28896#bib.bib17)\), which decompose superposed residual stream activations into sparse, approximately monosemantic feature directions\. However, this toolkit has been applied almost exclusively to base models or RLHF\-tuned variants in an undifferentiated way\(Cunningham et al\.,[2023](https://arxiv.org/html/2605.28896#bib.bib6); Gemma Scope,[2024](https://arxiv.org/html/2605.28896#bib.bib11)\)\. The feature\-level consequences of LoRA fine\-tuning remain unexplored\.

This gap matters for several reasons\. First, safety fine\-tuning via LoRA is widely practiced, but if the adapter operates in a representational subspace that base\-model interpretability tools cannot see, safety audits may be systematically incomplete\. Second, recent work\(Yang et al\.,[2023](https://arxiv.org/html/2605.28896#bib.bib18); Qi et al\.,[2023](https://arxiv.org/html/2605.28896#bib.bib15)\)has demonstrated that safety fine\-tuning can be easily undone by subsequent fine\-tuning, but the mechanistic account of why this occurs is missing\. Third, understanding what LoRA encodes at the feature level is a prerequisite for principled adapter design and control\.

#### Our contributions\.

We make the following contributions:

1. 1\.The delta SAE framework: We introduce a methodology for training SAEs specifically on adapter\-induced activation deltas𝐡Δ=𝐡adapted−𝐡base\\mathbf\{h\}\_\{\\Delta\}=\\mathbf\{h\}\_\{\\text\{adapted\}\}\-\\mathbf\{h\}\_\{\\text\{base\}\}, providing a mechanistically clean decomposition of adapter contributions\.
2. 2\.Three\-measure geometric analysis: We provide convergent evidence from cosine similarity, principal angle analysis, and CKA that LoRA adapter features occupy a geometrically distinct subspace from base model features\.
3. 3\.Systematic rank analysis: We show that rank affects feature density and CKA representation distance, but not the fundamental geometric novelty of adapter features\.
4. 4\.Safety implications: We identify a monitoring gap arising from the geometric separation of adapter and base features, with implications for LoRA\-based alignment\.

## 2Background and Related Work

### 2\.1Low\-Rank Adaptation

LoRA\(Hu et al\.,[2022](https://arxiv.org/html/2605.28896#bib.bib12)\)modifies a pre\-trained weight matrix𝐖0∈ℝdout×din\\mathbf\{W\}\_\{0\}\\in\\mathbb\{R\}^\{d\_\{\\text\{out\}\}\\times d\_\{\\text\{in\}\}\}by adding a low\-rank update:

𝐖=𝐖0\+αr𝐁𝐀\\mathbf\{W\}=\\mathbf\{W\}\_\{0\}\+\\frac\{\\alpha\}\{r\}\\mathbf\{B\}\\mathbf\{A\}\(1\)where𝐁∈ℝdout×r\\mathbf\{B\}\\in\\mathbb\{R\}^\{d\_\{\\text\{out\}\}\\times r\},𝐀∈ℝr×din\\mathbf\{A\}\\in\\mathbb\{R\}^\{r\\times d\_\{\\text\{in\}\}\},r≪dr\\ll d, andα\\alphais a scaling hyperparameter\. The base weights𝐖0\\mathbf\{W\}\_\{0\}are frozen; only𝐀\\mathbf\{A\}and𝐁\\mathbf\{B\}are trained\.

For a given input𝐱\\mathbf\{x\}, the adapter’s contribution to the residual stream is:

𝐡Δ=αr𝐁𝐀𝐱\\mathbf\{h\}\_\{\\Delta\}=\\frac\{\\alpha\}\{r\}\\mathbf\{B\}\\mathbf\{A\}\\mathbf\{x\}\(2\)This delta is input\-dependent and lives in the fulldd\-dimensional residual stream, despite the weight update being rank\-rr\. The effective rank of the activation delta depends on the input distribution and may be substantially larger thanrr\.

### 2\.2Sparse Autoencoders for Mechanistic Interpretability

The superposition hypothesis\(Elhage et al\.,[2022](https://arxiv.org/html/2605.28896#bib.bib8)\)posits that neural networks encode more features than they have dimensions by representing features as nearly\-orthogonal directions, allowing superposition of many sparse features\. SAEs provide a practical tool to decompose this superposition\(Bricken et al\.,[2023](https://arxiv.org/html/2605.28896#bib.bib5)\):

𝐳\\displaystyle\\mathbf\{z\}=ReLU\(𝐖enc\(𝐡−𝐛dec\)\+𝐛enc\)\\displaystyle=\\text\{ReLU\}\(\\mathbf\{W\}\_\{\\text\{enc\}\}\(\\mathbf\{h\}\-\\mathbf\{b\}\_\{\\text\{dec\}\}\)\+\\mathbf\{b\}\_\{\\text\{enc\}\}\)\(3\)𝐡^\\displaystyle\\hat\{\\mathbf\{h\}\}=𝐖dec𝐳\+𝐛dec\\displaystyle=\\mathbf\{W\}\_\{\\text\{dec\}\}\\mathbf\{z\}\+\\mathbf\{b\}\_\{\\text\{dec\}\}\(4\)with lossℒ=‖𝐡−𝐡^‖22\+λ‖𝐳‖1\\mathcal\{L\}=\\\|\\mathbf\{h\}\-\\hat\{\\mathbf\{h\}\}\\\|\_\{2\}^\{2\}\+\\lambda\\\|\\mathbf\{z\}\\\|\_\{1\}, whereλ\\lambdacontrols sparsity\.

Gemma Scope\(Gemma Scope,[2024](https://arxiv.org/html/2605.28896#bib.bib11)\)provides pre\-trained SAEs for all layers of Gemma\-2\-9B, trained on the base model’s residual stream activations\. Each SAE learns a dictionary ofdSAE=16384d\_\{\\text\{SAE\}\}=16384feature directions in thedmodel=3584d\_\{\\text\{model\}\}=3584\-dimensional residual stream\.

### 2\.3Geometric Similarity Measures

We use three complementary geometric measures:

Cosine similarity\.For two unit vectors𝐮,𝐯\\mathbf\{u\},\\mathbf\{v\}:sim\(𝐮,𝐯\)=𝐮⊤𝐯\\text\{sim\}\(\\mathbf\{u\},\\mathbf\{v\}\)=\\mathbf\{u\}^\{\\top\}\\mathbf\{v\}\. We report the maximum cosine similarity of each delta SAE feature to any Gemma Scope feature\.

Principal angles\.For subspaces𝒜\\mathcal\{A\}andℬ\\mathcal\{B\}with orthonormal bases𝐐A\\mathbf\{Q\}\_\{A\}and𝐐B\\mathbf\{Q\}\_\{B\}, the principal anglesθ1,…,θk\\theta\_\{1\},\\ldots,\\theta\_\{k\}are defined viacos⁡θi=σi\(𝐐A⊤𝐐B\)\\cos\\theta\_\{i\}=\\sigma\_\{i\}\(\\mathbf\{Q\}\_\{A\}^\{\\top\}\\mathbf\{Q\}\_\{B\}\)\(Björck & Golub,[1973](https://arxiv.org/html/2605.28896#bib.bib4)\)\. Angles near90°90\\textdegreeindicate orthogonal subspaces; angles near0°0\\textdegreeindicate aligned subspaces\.

Linear CKA\.Centered Kernel Alignment\(Kornblith et al\.,[2019](https://arxiv.org/html/2605.28896#bib.bib14)\)measures representational similarity invariant to orthogonal transformation and isotropic scaling:

CKA\(𝐗,𝐘\)=‖𝐘⊤𝐗‖F2‖𝐗⊤𝐗‖F‖𝐘⊤𝐘‖F\\text\{CKA\}\(\\mathbf\{X\},\\mathbf\{Y\}\)=\\frac\{\\\|\\mathbf\{Y\}^\{\\top\}\\mathbf\{X\}\\\|\_\{F\}^\{2\}\}\{\\\|\\mathbf\{X\}^\{\\top\}\\mathbf\{X\}\\\|\_\{F\}\\\|\\mathbf\{Y\}^\{\\top\}\\mathbf\{Y\}\\\|\_\{F\}\}\(5\)

## 3Method: The Delta SAE Framework

### 3\.1Motivation

Standard SAE analysis applied to𝐡adapted\\mathbf\{h\}\_\{\\text\{adapted\}\}conflates base model representations with adapter contributions\. To isolate what the adapter adds, we work directly with the activation delta\. From Equation[2](https://arxiv.org/html/2605.28896#S2.E2),𝐡Δ=𝐡adapted−𝐡base\\mathbf\{h\}\_\{\\Delta\}=\\mathbf\{h\}\_\{\\text\{adapted\}\}\-\\mathbf\{h\}\_\{\\text\{base\}\}is the exact adapter contribution — mechanistically clean and free of base model signal\.

### 3\.2Delta Activation Extraction

We use forward hooks to capture residual stream activations after each transformer layer\. For input sequence𝐗\\mathbf\{X\}and target layersℒ=\{5,10,18,22,32,38\}\\mathcal\{L\}=\\\{5,10,18,22,32,38\\\}:

Algorithm 1Delta Activation Extraction1:foreach input

𝐱∈𝒟probe\\mathbf\{x\}\\in\\mathcal\{D\}\_\{\\text\{probe\}\}do

2:

𝐡base\(ℓ\)←BaseModel\(𝐱\)\|layerℓ\\mathbf\{h\}\_\{\\text\{base\}\}^\{\(\\ell\)\}\\leftarrow\\text\{BaseModel\}\(\\mathbf\{x\}\)\\big\|\_\{\\text\{layer \}\\ell\}∀ℓ∈ℒ\\forall\\ell\\in\\mathcal\{L\}
3:

𝐡adapted\(ℓ\)←LoRAModel\(𝐱\)\|layerℓ\\mathbf\{h\}\_\{\\text\{adapted\}\}^\{\(\\ell\)\}\\leftarrow\\text\{LoRAModel\}\(\\mathbf\{x\}\)\\big\|\_\{\\text\{layer \}\\ell\}∀ℓ∈ℒ\\forall\\ell\\in\\mathcal\{L\}
4:

𝐡Δ\(ℓ\)←𝐡adapted\(ℓ\)−𝐡base\(ℓ\)\\mathbf\{h\}\_\{\\Delta\}^\{\(\\ell\)\}\\leftarrow\\mathbf\{h\}\_\{\\text\{adapted\}\}^\{\(\\ell\)\}\-\\mathbf\{h\}\_\{\\text\{base\}\}^\{\(\\ell\)\}
5:endfor

6:Store:

𝐡base\\mathbf\{h\}\_\{\\text\{base\}\}once \(shared across ranks\);

𝐡Δ\\mathbf\{h\}\_\{\\Delta\}per rank

𝐡base\\mathbf\{h\}\_\{\\text\{base\}\}is stored once and shared across all ranks since the base model is identical\.𝐡adapted\\mathbf\{h\}\_\{\\text\{adapted\}\}is computed on\-the\-fly and discarded after delta computation\.

### 3\.3Delta SAE Training

For each \(rank, layer\) pair, we train a dedicated SAE on𝐡Δ\\mathbf\{h\}\_\{\\Delta\}vectors\. LetNNdenote the number of token vectors\. We apply RMS normalisation before training:

𝐡~Δ=𝐡ΔσRMSwhereσRMS=1N∑i=1N‖𝐡Δ\(i\)‖2\\tilde\{\\mathbf\{h\}\}\_\{\\Delta\}=\\frac\{\\mathbf\{h\}\_\{\\Delta\}\}\{\\sigma\_\{\\text\{RMS\}\}\}\\quad\\text\{where\}\\quad\\sigma\_\{\\text\{RMS\}\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\\|\\mathbf\{h\}\_\{\\Delta\}^\{\(i\)\}\\\|\_\{2\}\(6\)The scaleσRMS\\sigma\_\{\\text\{RMS\}\}is saved per SAE for denormalization in downstream analysis\. The SAE loss is:

ℒΔ=‖𝐡~Δ−𝐡~^Δ‖22\+λ1‖𝐳‖1\\mathcal\{L\}\_\{\\Delta\}=\\\|\\tilde\{\\mathbf\{h\}\}\_\{\\Delta\}\-\\hat\{\\tilde\{\\mathbf\{h\}\}\}\_\{\\Delta\}\\\|\_\{2\}^\{2\}\+\\lambda\_\{1\}\\\|\\mathbf\{z\}\\\|\_\{1\}\(7\)whereλ1=0\.15\\lambda\_\{1\}=0\.15was determined by hyperparameter search targetingL0≈30L\_\{0\}\\approx 30–5050active features per token \(see Section[6](https://arxiv.org/html/2605.28896#S6)\)\.

### 3\.4Dictionary Similarity Analysis

To measure geometric alignment between the delta SAE and Gemma Scope dictionaries, we compute for each delta feature direction𝐝i∈𝐖decΔ\\mathbf\{d\}\_\{i\}\\in\\mathbf\{W\}\_\{\\text\{dec\}\}^\{\\Delta\}:

si=maxj⁡cos⁡\(𝐝i,𝐠j\)where𝐠j∈𝐖decGSs\_\{i\}=\\max\_\{j\}\\cos\(\\mathbf\{d\}\_\{i\},\\mathbf\{g\}\_\{j\}\)\\quad\\text\{where\}\\quad\\mathbf\{g\}\_\{j\}\\in\\mathbf\{W\}\_\{\\text\{dec\}\}^\{\\text\{GS\}\}\(8\)This requires16384×16384=26816384\\times 16384=268million comparisons per layer, computed in memory\-efficient chunks of 512 features\.

### 3\.5Principal Angle Computation

We extract the top\-kk\(k=256k=256\) principal directions of each decoder matrix via SVD and compute principal angles between subspaces:

cos⁡θi=σi\(𝐐Δ⊤𝐐GS\)\\cos\\theta\_\{i\}=\\sigma\_\{i\}\\\!\\left\(\\mathbf\{Q\}\_\{\\Delta\}^\{\\top\}\\mathbf\{Q\}\_\{\\text\{GS\}\}\\right\)\(9\)where𝐐Δ,𝐐GS∈ℝd×k\\mathbf\{Q\}\_\{\\Delta\},\\mathbf\{Q\}\_\{\\text\{GS\}\}\\in\\mathbb\{R\}^\{d\\times k\}are orthonormal bases from SVD of the respective decoder matrices\.

## 4Experimental Setup

### 4\.1Model and Architecture

We useGemma\-2\-9B\(Gemma Team,[2024](https://arxiv.org/html/2605.28896#bib.bib10)\)\(google/gemma\-2\-9b\) as the base model:dmodel=3584d\_\{\\text\{model\}\}=3584, 42 transformer layers, 16 query heads, 8 key/value heads \(Grouped Query Attention\(Ainslie et al\.,[2023](https://arxiv.org/html/2605.28896#bib.bib1)\)\),9\.24×1099\.24\\times 10^\{9\}total parameters\.

For SAEs, we useGemma Scope\(Gemma Scope,[2024](https://arxiv.org/html/2605.28896#bib.bib11)\)\(google/gemma\-scope\-9b\-pt\-res\), pre\-trained residual stream SAEs at widthdSAE=16384d\_\{\\text\{SAE\}\}=16384\(expansion factor≈4\.6×\\approx 4\.6\\times\)\.

### 4\.2LoRA Adapter Training

We train four LoRA adapters varying only rankr∈\{4,8,16,32\}r\\in\\\{4,8,16,32\\\}, with all other hyperparameters fixed to ensure controlled comparison\. Configuration is summarised in Table[1](https://arxiv.org/html/2605.28896#S4.T1)\.

Table 1:LoRA Training ConfigurationSettingα=2r\\alpha=2rensures the effective weight update scalingα/r=2\.0\\alpha/r=2\.0is constant across ranks, isolating rank as the sole variable\. Table[2](https://arxiv.org/html/2605.28896#S4.T2)shows training outcomes\.

Table 2:LoRA Adapter Training ResultsTraining loss decreases monotonically with rank \(r2=0\.997r^\{2\}=0\.997\), confirming that higher\-rank adapters learn more expressive representations\.

### 4\.3Datasets

Adapter training:tatsu\-lab/alpaca\(Alpaca,[2023](https://arxiv.org/html/2605.28896#bib.bib2)\), 10,000 samples \(indices 0–9,999\)\. Selected for its diverse instruction\-following format and standard use in LoRA literature\.

Activation probe set: 2,000 samples \(indices 5,000–6,999\) with diversity bucketing across five categories: creative, factual, reasoning, coding, and practical \(400 samples each\)\. This ensures broad activation coverage for SAE training\.

Held\-out evaluation: 200 samples \(indices 11,000–11,199\) — never seen during adapter training or SAE training\.

### 4\.4Delta SAE Configuration

Table 3:Delta SAE Training Configuration
### 4\.5Target Layers

We analyse layersℒ=\{5,10,18,22,32,38\}\\mathcal\{L\}=\\\{5,10,18,22,32,38\\\}, chosen to cover early \(5, 10\), middle \(18, 22\), and late \(32, 38\) processing stages\. Pre\-trained Gemma Scope SAEs are available for all target layers atwidth\_16k\.

## 5Results

### 5\.1Activation Delta Characterisation

Table[4](https://arxiv.org/html/2605.28896#S5.T4)reports the meanL2L\_\{2\}norm of𝐡Δ\\mathbf\{h\}\_\{\\Delta\}across layers and ranks\.

Table 4:Mean Delta Norm‖𝐡Δ‖2\\\|\\mathbf\{h\}\_\{\\Delta\}\\\|\_\{2\}Across Layers and Ranks![Refer to caption](https://arxiv.org/html/2605.28896v1/paper_figures/fig1_delta_norm_heatmap.png)Figure 1:Delta norm heatmap across layers and ranks\.The delta norm increases approximately18×18\\timesfrom layer 5 to layer 38, indicating the adapter’s influence on the residual stream amplifies with depth\. Notably, the relationship between rank and delta magnitude is non\-monotonic:r=8r=8produces the largest norm at layer 38 \(345\.45\), exceedingr=32r=32\(330\.81\)\. The delta exhibits non\-zero variance across all residual dimensions, indicating broad distribution across the residual stream rather than confined to a low\-dimensional subspace\.

### 5\.2Base SAE Reconstruction Failure

Table[5](https://arxiv.org/html/2605.28896#S5.T5)reports relative reconstruction error when passing𝐡Δ\\mathbf\{h\}\_\{\\Delta\}through Gemma Scope SAEs \(base model dictionary\)\. Relative error is defined asεrel=‖𝐡Δ−𝐡^Δ‖2/‖𝐡Δ‖2\\varepsilon\_\{\\text\{rel\}\}=\\\|\\mathbf\{h\}\_\{\\Delta\}\-\\hat\{\\mathbf\{h\}\}\_\{\\Delta\}\\\|\_\{2\}/\\\|\\mathbf\{h\}\_\{\\Delta\}\\\|\_\{2\}\.

Table 5:Gemma Scope Relative Reconstruction Error on𝐡Δ\\mathbf\{h\}\_\{\\Delta\}\. Values\>1\.0\>1\.0indicate reconstruction error exceeds the signal magnitude\.![Refer to caption](https://arxiv.org/html/2605.28896v1/paper_figures/fig2_reconstruction_comparison.png)Figure 2:Reconstruction error comparison\.Reconstruction error exceeds 1\.0 at every layer and rank, meaning the Gemma Scope SAE’s approximation error is larger than the delta signal itself\. This failure is most severe at layer 5 \(ε≈2\.3\\varepsilon\\approx 2\.3\) and improves with depth\. At layer 32, errorincreaseswith rank \(r=4r=4: 1\.260;r=32r=32: 1\.492\), suggesting higher\-rank adapters introduce more geometrically alien contributions at deeper layers\.

### 5\.3Delta SAE Reconstruction Quality

Table[6](https://arxiv.org/html/2605.28896#S5.T6)reports held\-out reconstruction error for adapter\-specific delta SAEs\.

Table 6:Delta SAE Relative Reconstruction Error on Held\-Out𝐡Δ\\mathbf\{h\}\_\{\\Delta\}\(indices 11,000–11,199, never seen during training\)Table[7](https://arxiv.org/html/2605.28896#S5.T7)reports the reconstruction improvement of delta SAEs over Gemma Scope, computed as\(εGS−εΔ\)/εGS×100%\(\{\\varepsilon\_\{\\text\{GS\}\}\-\\varepsilon\_\{\\Delta\}\}\)/\{\\varepsilon\_\{\\text\{GS\}\}\}\\times 100\\%\.

Table 7:Reconstruction Improvement of Delta SAE over Gemma Scope \(%\)![Refer to caption](https://arxiv.org/html/2605.28896v1/paper_figures/fig6_improvement_heatmap.png)Figure 3:Reconstruction improvement \(%\) of delta SAEs over Gemma Scope on held\-out data across all 24 conditions\. Improvement ranges from 46\.3% \(layer 38,r=4r=4\) to 86\.2% \(layer 5,r=4r=4\), with early layers showing the greatest benefit from adapter\-specific dictionaries\. All 24 conditions show positive improvement\.Delta SAEs outperform Gemma Scope on all 24 conditions \(4 ranks×\\times6 layers\)\. Improvement ranges from 46\.3% to 86\.2%, demonstrating that adapter\-specific SAEs capture genuine structure that generalises to completely unseen data\.

Key patterns:\(i\) Improvement decreases with layer depth, suggesting early\-layer delta geometry is most distinct from the base dictionary; \(ii\) improvement slightly decreases with rank, consistent with higher\-rank adapters introducing more complex delta structure; \(iii\) delta SAE reconstruction error at layer 5 is remarkably rank\-invariant \(0\.340–0\.358\), suggesting a consistent geometric relationship between adapter and base representations regardless of capacity\.

### 5\.4Feature Activation Overlap Analysis

Table[8](https://arxiv.org/html/2605.28896#S5.T8)reports the fraction of features that activate on𝐡Δ\\mathbf\{h\}\_\{\\Delta\}that also activate on𝐡base\\mathbf\{h\}\_\{\\text\{base\}\}\(overlap fraction\) vs\. features that activate only on𝐡Δ\\mathbf\{h\}\_\{\\Delta\}\(Weakly aligned fraction\), using Gemma Scope SAEs\.

![Refer to caption](https://arxiv.org/html/2605.28896v1/paper_figures/fig5_feature_density.png)Figure 4:Feature density \(mean active features per token\) across layers and ranks\. Left: density increases with layer depth for all ranks\. Right: monotonic increase with rank at layer 38 \(r=4r=4: 30\.28;r=32r=32: 41\.66\), demonstrating that rank controls representational capacity without altering geometric novelty\.Table 8:Feature Activation Overlap and Weakly aligned Fractions\. Overlap = fraction of delta\-active features also active inhbaseh\_\{\\text\{base\}\}\. Weakly aligned = fraction active only onhΔh\_\{\\Delta\}\.![Refer to caption](https://arxiv.org/html/2605.28896v1/paper_figures/fig3_feature_overlap.png)Figure 5:Feature OverlapThe Weakly aligned fraction is consistently above 93% at all layers and ranks\. Overlap increases slightly with depth \(0\.37% at layer 5; 6\.3% at layer 38\), suggesting the adapter’s contributions gradually align with existing base features at deeper layers where semantic representations are concentrated\. Feature density increases monotonically with rank at deep layers, withr=32r=32activating41\.6641\.66features/token at layer 38 vs\.30\.2830\.28forr=4r=4\.

### 5\.5Dictionary Similarity: Cosine Analysis

Table[9](https://arxiv.org/html/2605.28896#S5.T9)summarises cosine similarity statistics between delta SAE and Gemma Scope decoder directions\.

Table 9:Cosine Similarity Between Delta SAE and Gemma Scope Feature Directions\. Statistics computed over all 16,384 delta features per condition\.![Refer to caption](https://arxiv.org/html/2605.28896v1/paper_figures/fig4_cosine_similarity.png)Figure 6:Cosine SimilarityThe mean maximum cosine similarity of≈0\.071\\approx 0\.071is slightly above the expected value for random unit vectors in 3,584 dimensions \(≈0\\approx 0\), indicating weak but non\-zero alignment\. Only 0\.01–0\.02% of delta features show strong alignment \(\>0\.7\>0\.7\) with any base model feature\.

### 5\.6Principal Angle Analysis

Table[10](https://arxiv.org/html/2605.28896#S5.T10)reports principal angle statistics between the top\-256 subspace of delta SAE and Gemma Scope decoder matrices\.

Table 10:Principal Angles Between Delta SAE and Gemma Scope Subspaces \(k=256k=256principal directions\)\. Near\-orthogonal:θ\>70°\\theta\>70\\textdegree; Aligned:θ<20°\\theta<20\\textdegree\.Principal angles are consistently near74°74\\textdegreeacross all ranks and layers, with zero aligned directions \(0% below20°20\\textdegree\) in all conditions\. This provides rigorous geometric confirmation that the delta SAE subspace is substantially separated from the Gemma Scope subspace, going beyond the per\-feature cosine similarity analysis\. The consistency across ranks \(range:72\.39°72\.39\\textdegree–74\.99°74\.99\\textdegree\) suggests the geometric separation is a structural property of LoRA adaptation rather than a rank\-specific artifact\.

### 5\.7CKA Representational Similarity

Table[11](https://arxiv.org/html/2605.28896#S5.T11)reports linear CKA between𝐡base\\mathbf\{h\}\_\{\\text\{base\}\}and𝐡Δ\\mathbf\{h\}\_\{\\Delta\}across layers and ranks\.

Table 11:Linear CKA Betweenhbaseh\_\{\\text\{base\}\}andhΔh\_\{\\Delta\}\. Lower values indicate greater representational divergence\.CKA\(hbase,hadapted\)\\text\{CKA\}\(h\_\{\\text\{base\}\},h\_\{\\text\{adapted\}\}\)included as reference\.The CKA analysis reveals a non\-trivial layer\-wise pattern\. The adapter’s contribution is most representationally distinct from the base model at middle layers \(layer 18: CKA≈0\.05\\approx 0\.05–0\.080\.08\), precisely where semantic processing is concentrated\. Early layers \(5, 10\) show moderate CKA \(≈0\.28\\approx 0\.28–0\.360\.36\), and deep layers \(32, 38\) show higher CKA \(≈0\.55\\approx 0\.55–0\.690\.69\) as adapter contributions gradually realign with base representations\.

A clear rank effect emerges at layers 22 and 32: higher rank produces lower CKA\(hb,hΔ\)\(h\_\{b\},h\_\{\\Delta\}\), indicating more representationally distinct delta contributions at middle layers\. At layer 22:r=4→0\.194r=4\\to 0\.194,r=32→0\.093r=32\\to 0\.093\. Critically, CKA\(hb,ha\)\(h\_\{b\},h\_\{a\}\)remains above 0\.93 at all conditions, confirming the adapted model is overwhelmingly similar to the base model overall — the delta is a small but geometrically distinct perturbation\.

## 6Ablation Study

### 6\.1L1 Coefficient Selection

Table[12](https://arxiv.org/html/2605.28896#S6.T12)reports the effect of L1 coefficient on delta SAEL0L\_\{0\}and reconstruction quality \(layer 5, rank 4\)\.

Table 12:L1 Coefficient Ablation for Delta SAE Training \(Layer 5, Rank 4, 3 epochs\)\. TargetL0L\_\{0\}: 20–50\.The need for extensive L1 tuning \(spanning two orders of magnitude\) reflects the non\-standard statistical properties of delta activations relative to full residual stream activations\. RMS normalisation was critical: without it, the MSE term dominates and L1 penalties in the standard range \(10−410^\{\-4\}–10−210^\{\-2\}\) are insufficient to enforce sparsity\.

### 6\.2Effect of Epochs on Reconstruction

We observe that reconstruction error on held\-out data stabilises after approximately 5–7 epochs for most layer/rank combinations\. The remaining error \(∼\\sim0\.34 at layer 5;∼\\sim0\.61 at layer 38\) appears to be an irreducible component, potentially reflecting a genuinely distributed aspect of the adapter’s activation contributions that resists sparse decomposition\.

### 6\.3Subspace Dimension Sensitivity

Principal angle results are robust to the choice of subspace dimensionkk\. Usingk∈\{64,128,256,512\}k\\in\\\{64,128,256,512\\\}yields mean principal angles within±2°\\pm 2\\textdegreeof the reported values, confirming that the subspace separation is not an artifact of the specifickkchosen\.

## 7Discussion

### 7\.1Convergent Evidence for Geometric Separation

Three independent analyses consistently indicate that LoRA adapter feature representations are geometrically separated from base model representations:

1. 1\.Cosine similarity: Mean max similarity≈0\.071\\approx 0\.071across all 268M comparisons, barely above random \(≈0\\approx 0\)\.
2. 2\.Principal angles: Mean≈74°\\approx 74\\textdegreewith 0% aligned directions — rigorous subspace\-level evidence\.
3. 3\.CKA: Minimum CKA\(hb,hΔ\)≈0\.05\(h\_\{b\},h\_\{\\Delta\}\)\\approx 0\.05at layer 18, the key semantic processing layer\.

The convergence of three different measures reduces the risk of any single metric being misleading\. Cosine similarity could in principle reflect SAE training variance; principal angles address this by operating directly on the full decoder matrices without SAE decomposition; CKA validates at the activation level rather than the weight level\.

### 7\.2The Monitoring Gap

Our findings imply a concrete safety concern:interpretability tools trained on base model activations may be systematically blind to adapter contributions\. If an organisation deploys a safety\-fine\-tuned LoRA model and uses Gemma Scope\-based analysis to audit it, the SAE’s reconstruction failure \(ε\>1\.0\\varepsilon\>1\.0\) means the tool is not capturing the adapter’s representational contributions\. These results suggest that interpretability tools trained solely on base\-model activations may incompletely capture adapter\-induced representations\.

The delta SAE framework we introduce provides a direct solution: training adapter\-specific SAEs enables meaningful feature\-level auditing of fine\-tuned models\.

### 7\.3Rank Effects and Representational Capacity

Rank affects feature density \(monotonically increasing at deep layers\) and CKA distance \(higher rank→\\tolower CKA at semantic layers\) but does not affect the fundamental geometric novelty \(∼\\sim74° principal angles across all ranks\)\. This suggests that rank controlshow manyweakly aligned features the adapter activates, but not thenatureof those features — all ranks produce geometrically separated representations\.

### 7\.4Limitations

Single seed: All adapters trained with seed 42; seed stability analysis is deferred to future work\.

Single dataset: Results are based on Alpaca instruction tuning; generalisation to other fine\-tuning objectives \(e\.g\., domain adaptation, RLHF\) requires further study\.

Undertrained adapters: Despite loss reduction from∼\\sim2\.0 to∼\\sim1\.0, the adapters produce outputs identical to the base model on test prompts, suggesting the base model’s strong priors dominate\. This precludes causal validation via activation steering and is the primary limitation of the current work\.

SAE training scale: Delta SAEs are trained on∼\\sim1M token vectors\. Larger\-scale training may improve reconstruction quality and alter the feature vocabulary\.

No baseline comparison: We lack a base\-base SAE similarity baseline \(two independently trained SAEs on base model activations\), which would allow direct comparison to assess whether 0\.071 cosine similarity is genuinely low or attributable to SAE training variance\. This is a priority for future work\.

## 8Related Work

LoRA and PEFT:Hu et al\. \([2022](https://arxiv.org/html/2605.28896#bib.bib12)\)introduced LoRA; subsequent work explored variants\(Dettmers et al\.,[2024](https://arxiv.org/html/2605.28896#bib.bib7); Zhao et al\.,[2024](https://arxiv.org/html/2605.28896#bib.bib19)\)\.Hu et al\. \([2023](https://arxiv.org/html/2605.28896#bib.bib13)\)studied behavioural effects of fine\-tuning, andBiderman et al\. \([2024](https://arxiv.org/html/2605.28896#bib.bib3)\)studied forgetting dynamics\. None have studied fine\-tuning effects at the feature level\.

Mechanistic interpretability: SAEs have emerged as a key tool for decomposing superposed representations\(Bricken et al\.,[2023](https://arxiv.org/html/2605.28896#bib.bib5); Templeton et al\.,[2024](https://arxiv.org/html/2605.28896#bib.bib17)\)\.Cunningham et al\. \([2023](https://arxiv.org/html/2605.28896#bib.bib6)\)trained SAEs on GPT\-2;Gemma Scope \([2024](https://arxiv.org/html/2605.28896#bib.bib11)\)released comprehensive SAE dictionaries for Gemma\-2\. All prior work focuses on base models\.

Fine\-tuning and safety:Qi et al\. \([2023](https://arxiv.org/html/2605.28896#bib.bib15)\)andYang et al\. \([2023](https://arxiv.org/html/2605.28896#bib.bib18)\)demonstrated that safety fine\-tuning is fragile to subsequent fine\-tuning\.Evans et al\. \([2025](https://arxiv.org/html/2605.28896#bib.bib9)\)showed that narrow task fine\-tuning on insecure code causes broad emergent misalignment\. Our work provides a mechanistic account at the feature level of why fine\-tuning may escape base model safety constraints\.

Representation similarity: CKA\(Kornblith et al\.,[2019](https://arxiv.org/html/2605.28896#bib.bib14)\)and SVCCA\(Raghu et al\.,[2017](https://arxiv.org/html/2605.28896#bib.bib16)\)have been used to compare model representations across architectures and training runs; we apply these to compare adapter and base model representations within a single model\.

## 9Conclusion

We presented a systematic mechanistic interpretability analysis of LoRA adapter feature geometry using Sparse Autoencoders\. Our delta SAE framework — training SAEs on adapter\-induced activation deltas𝐡Δ=𝐡adapted−𝐡base\\mathbf\{h\}\_\{\\Delta\}=\\mathbf\{h\}\_\{\\text\{adapted\}\}\-\\mathbf\{h\}\_\{\\text\{base\}\}— provides a mechanistically clean decomposition of adapter contributions\.

Three convergent analyses demonstrate that LoRA adapters operate in a feature subspace geometrically separated from the base model: cosine similarity near random \(0\.071\), principal angles consistently near 74°, and CKA\(hb,hΔ\)≈0\.05\(h\_\{b\},h\_\{\\Delta\}\)\\approx 0\.05–0\.350\.35depending on layer\. Adapter\-specific SAEs achieve 46–86% lower reconstruction error than base model SAEs on held\-out data, confirming the geometric separation reflects genuine learned structure\. Feature density scales with rank while geometric novelty remains rank\-invariant\.

These findings identify a monitoring gap in LoRA\-based safety alignment: base\-model interpretability tools may be systematically blind to adapter\-encoded representations\. The delta SAE framework provides a practical tool for feature\-level auditing of fine\-tuned models\.

Future work will pursue causal validation using strongly\-differentiated models \(gemma\-2\-9b\-it\), extend to misalignment detection in the style ofEvans et al\. \([2025](https://arxiv.org/html/2605.28896#bib.bib9)\), and establish base\-base SAE similarity baselines to rigorously bound the observed geometric separation\.

![Refer to caption](https://arxiv.org/html/2605.28896v1/paper_figures/fig8_combined_summary.png)Figure 7:Summary of key findings across all ranks and layers\. \(A\) Delta norm heatmap showing amplification with depth\. \(B\) Reconstruction improvement of delta SAE over Gemma Scope\. \(C\) Feature density scaling with rank and depth\. \(D\) Weakly aligned features fraction — consistently above 93% across all conditions\.
## Acknowledgements

The author thanks the Gemma Scope team at Google DeepMind for releasing comprehensive open SAE dictionaries, and the SAELens and TransformerLens teams for open\-source tooling\. Experiments were conducted on Apple M4 Studio\.

## References

- Ainslie et al\. \[2023\]Ainslie, J\., Lee\-Thorp, J\., de Jong, M\., Zelaski, T\., Sanghai, S\., & Xu, Y\. \(2023\)\. GQA: Training generalised multi\-query transformer models from multi\-head checkpoints\.Proceedings of EMNLP\.
- Alpaca \[2023\]Taori, R\., Gulrajani, I\., Zhang, T\., Dubois, Y\., Li, X\., Guestrin, C\., Liang, P\., & Hashimoto, T\. B\. \(2023\)\. Stanford Alpaca: An Instruction\-following LLaMA model\.[https://github\.com/tatsu\-lab/stanford\_alpaca](https://github.com/tatsu-lab/stanford_alpaca)\.
- Biderman et al\. \[2024\]Biderman, D\., Ortiz, J\. G\., Portes, J\., Paul, M\., Greengard, P\., Jennings, C\., & Frankle, J\. \(2024\)\. LoRA learns less and forgets less\.Transactions on Machine Learning Research\.
- Björck & Golub \[1973\]Björck, Å\., & Golub, G\. H\. \(1973\)\. Numerical methods for computing angles between linear subspaces\.Mathematics of Computation, 27\(123\), 579–594\.
- Bricken et al\. \[2023\]Bricken, T\., Templeton, A\., Batson, J\., Chen, B\., Jermyn, A\., Conerly, T\., …& Henighan, T\. \(2023\)\. Towards monosemanticity: Decomposing language models with dictionary learning\.Anthropic Transformer Circuits Thread\.[https://transformer\-circuits\.pub/2023/monosemantic\-features](https://transformer-circuits.pub/2023/monosemantic-features)\.
- Cunningham et al\. \[2023\]Cunningham, H\., Ewart, A\., Riggs, L\., Huben, R\., & Sharkey, L\. \(2023\)\. Sparse autoencoders find highly interpretable features in language models\.arXiv preprint arXiv:2309\.08600\.
- Dettmers et al\. \[2024\]Dettmers, T\., Pagnoni, A\., Holtzman, A\., & Zettlemoyer, L\. \(2024\)\. QLoRA: Efficient finetuning of quantized LLMs\.Advances in Neural Information Processing Systems\.
- Elhage et al\. \[2022\]Elhage, N\., Hume, T\., Olsson, C\., Schiefer, N\., Henighan, T\., Kravec, S\., …& Olah, C\. \(2022\)\. Toy models of superposition\.Transformer Circuits Thread\.[https://transformer\-circuits\.pub/2022/toy\_model](https://transformer-circuits.pub/2022/toy_model)\.
- Evans et al\. \[2025\]Evans, O\., Cotton\-Barratt, O\., Finnveden, L\., Balesni, M\., Balwit, A\., Hurst, A\., …& Saunders, W\. \(2025\)\. Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs\.Nature, 642, 1051–1058\.
- Gemma Team \[2024\]Gemma Team\. \(2024\)\. Gemma 2: Improving open language models at a practical size\.arXiv preprint arXiv:2408\.00118\.
- Gemma Scope \[2024\]Lieberum, T\., Dunefsky, J\., Bloom, J\., Bailey, N\., Cunningham, H\., …& Nanda, N\. \(2024\)\. Gemma Scope: Open sparse autoencoders everywhere all at once on Gemma 2\.arXiv preprint arXiv:2408\.05147\.
- Hu et al\. \[2022\]Hu, E\. J\., Shen, Y\., Wallis, P\., Allen\-Zhu, Z\., Li, Y\., Wang, S\., Wang, L\., & Chen, W\. \(2022\)\. LoRA: Low\-rank adaptation of large language models\.International Conference on Learning Representations\.
- Hu et al\. \[2023\]Hu, S\., Tu, Y\., Han, X\., He, C\., Cui, G\., Long, D\., …& Sun, M\. \(2023\)\. LLM\-Adapters: An adapter family for parameter\-efficient fine\-tuning of large language models\.arXiv preprint arXiv:2304\.01933\.
- Kornblith et al\. \[2019\]Kornblith, S\., Norouzi, M\., Lee, H\., & Hinton, G\. \(2019\)\. Similarity of neural network representations revisited\.International Conference on Machine Learning\.
- Qi et al\. \[2023\]Qi, X\., Zeng, Y\., Xie, T\., Chen, P\.\-Y\., Jia, R\., Mittal, P\., & Henderson, P\. \(2023\)\. Fine\-tuning aligned language models compromises safety, even when users do not intend to\.arXiv preprint arXiv:2310\.03693\.
- Raghu et al\. \[2017\]Raghu, M\., Gilmer, J\., Yosinski, J\., & Sohl\-Dickstein, J\. \(2017\)\. SVCCA: Singular vector canonical correlation analysis for deep learning dynamics and interpretability\.Advances in Neural Information Processing Systems\.
- Templeton et al\. \[2024\]Templeton, A\., Conerly, T\., Marcus, J\., Lindsey, J\., Bricken, T\., Chen, B\., …& Henighan, T\. \(2024\)\. Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet\.Anthropic Transformer Circuits Thread\.[https://transformer\-circuits\.pub/2024/scaling\-monosemanticity](https://transformer-circuits.pub/2024/scaling-monosemanticity)\.
- Yang et al\. \[2023\]Yang, X\., Wang, X\., Zhang, Q\., Petzold, L\., Wang, W\. Y\., Zhao, X\., & Lin, D\. \(2023\)\. Shadow alignment: The ease of subverting safely\-aligned language models\.arXiv preprint arXiv:2310\.02949\.
- Zhao et al\. \[2024\]Zhao, J\., Zhang, Z\., Chen, B\., Wang, Z\., Anandkumar, A\., & Tian, Y\. \(2024\)\. GaLore: Memory\-efficient LLM training by gradient low\-rank projection\.International Conference on Machine Learning\.
Feature Geometry of LoRA Adapters: A Sparse Autoencoder Analysis of Representational Divergence in Fine-Tuned Language Models

Similar Articles

RDP LoRA: Geometry-Driven Identification for Parameter-Efficient Adaptation in Large Language Models

PermDoRA -- Understanding Adapter Interference in Language Models: Limits of Parameter-Space Geometry

A Geometric View for Understanding Concept Learning and Neuron Interpretation in Sparse Autoencoders

How Quantization Changes Interpretable Features: A Sparse Autoencoder Analysis of Language Models

Video2LoRA: Parametric Video Internalization for Vision-Language Models

Submit Feedback

Similar Articles

RDP LoRA: Geometry-Driven Identification for Parameter-Efficient Adaptation in Large Language Models
PermDoRA -- Understanding Adapter Interference in Language Models: Limits of Parameter-Space Geometry
A Geometric View for Understanding Concept Learning and Neuron Interpretation in Sparse Autoencoders
How Quantization Changes Interpretable Features: A Sparse Autoencoder Analysis of Language Models
Video2LoRA: Parametric Video Internalization for Vision-Language Models