Robust LLM Unlearning Against Relearning Attacks: The Minor Components in Representations Matter
Summary
This paper introduces Minor Component Unlearning (MCU), a novel approach to LLM unlearning that targets minor components in representations to resist relearning attacks. It addresses the vulnerability of existing methods by focusing on robust directions within the model's spectral structure.
View Cached Full Text
Cached at: 05/13/26, 06:17 AM
# Robust LLM Unlearning Against Relearning Attacks: The Minor Components in Representations Matter
Source: [https://arxiv.org/html/2605.11685](https://arxiv.org/html/2605.11685)
Zeguan Xiao1, Xuanzhe Xu2, Yun Chen1, Yong Wang3, Jian Yang4, Yanqing Hu2, Guanhua Chen2 1Shanghai University of Finance and Economics,3Alibaba Group 2Southern University of Science and Technology,4Beihang University
###### Abstract
Large language model \(LLM\) unlearning aims to remove specific data influences from pre\-trained model without costly retraining, addressing privacy, copyright, and safety concerns\. However, recent studies reveal a critical vulnerability: unlearned models rapidly recover “forgotten” knowledge through relearning attacks\. This fragility raises serious security concerns, especially for open\-weight models\. In this work, we investigate the fundamental mechanism underlying this fragility from a representation geometry perspective\. We discover that existing unlearning methods predominantly optimize along dominant components, leaving minor components largely unchanged\. Critically, during relearning attacks, the modifications in these dominant components are easily reversed, enabling rapid knowledge recovery, whereas minor components exhibit stronger resistance to such reversal\. We further provide a theoretical analysis that explains both observations from the spectral structure of representations\. Building on this insight, we proposeMinor Component Unlearning \(MCU\), a novel unlearning approach that explicitly targets minor components in representations\. By concentrating unlearning effects in these inherently robust directions, our method achieves substantially improved resistance to relearning attacks\. Extensive experiments on three datasets validate our approach, demonstrating significant improvements over state\-of\-the\-art methods including sharpness\-aware minimization\.
## 1Introduction
The rapid advancement of large language models \(LLMs\) has led to remarkable progress in various domains, from creative writing to code generation\(Grattafioriet al\.,[2024](https://arxiv.org/html/2605.11685#bib.bib15)\)\. Meanwhile, open\-weight models are being released at an increasing rate, with their capabilities lagging only six to twelve months behind closed\-weight frontier models\(Bhandariet al\.,[2025](https://arxiv.org/html/2605.11685#bib.bib20); Maslejet al\.,[2024](https://arxiv.org/html/2605.11685#bib.bib21)\)\. However, both open and closed models raise serious concerns about privacy violations, copyright infringement, and safety risks\(Liuet al\.,[2025](https://arxiv.org/html/2605.11685#bib.bib10); Casperet al\.,[2025](https://arxiv.org/html/2605.11685#bib.bib17)\)\. When undesirable data influences are discovered post\-deployment, retraining these massive models from scratch is often prohibitively expensive\. This motivates the development ofLLM unlearning, a post\-training strategy that aims to remove specific data influences and suppress associated model capabilities without the need for complete retraining\(Janget al\.,[2023](https://arxiv.org/html/2605.11685#bib.bib9); Liuet al\.,[2025](https://arxiv.org/html/2605.11685#bib.bib10); Mainiet al\.,[2024](https://arxiv.org/html/2605.11685#bib.bib4)\)\.
Despite the growing importance of LLM unlearning, several recent studies have identified a critical issue:current unlearning methods lack robustness\(Łuckiet al\.,[2025](https://arxiv.org/html/2605.11685#bib.bib25); Lynchet al\.,[2024](https://arxiv.org/html/2605.11685#bib.bib26); Huet al\.,[2025](https://arxiv.org/html/2605.11685#bib.bib22); Deeb and Roger,[2024](https://arxiv.org/html/2605.11685#bib.bib1)\)\. Specifically, unlearned models exhibit a surprising susceptibility to quickly recovering “forgotten” knowledge throughrelearning attacks\(Lynchet al\.,[2024](https://arxiv.org/html/2605.11685#bib.bib26); Huet al\.,[2025](https://arxiv.org/html/2605.11685#bib.bib22)\)\. Even more concerning, fine\-tuning on benign, unrelated downstream tasks can inadvertently undo the unlearning effects\(Fanet al\.,[2025](https://arxiv.org/html/2605.11685#bib.bib51)\)\. For open\-weight models, this lack of robustness poses severe security challenges: any downstream actor can easily reverse unlearning through minimal fine\-tuning, undermining the intended protections\(Casperet al\.,[2025](https://arxiv.org/html/2605.11685#bib.bib17); Rosatiet al\.,[2024](https://arxiv.org/html/2605.11685#bib.bib24)\)\. Recent rigorous evaluations have revealed that state\-of\-the\-art unlearning methods achieve recovery rates exceeding 88% after relearning attacks, demonstrating that they fail to truly remove knowledge from model weights\(Deeb and Roger,[2024](https://arxiv.org/html/2605.11685#bib.bib1)\)\.
Figure 1:Left:Retraining\-on\-TT\(RTT\) attack evaluation: the forget set is split intoTTandVV; after unlearning onT∪VT\\cup V, the attacker fine\-tunes onTTand measures recovery onVV\.Middle:Naive methods and SAM separate forget/retain representations mainly along dominant components \(DC\), which relearning easily reverses; MCU additionally separates them along minor components \(MC\), whose changes are largely preserved post\-attack\.Right:On WMDP\-Cyber, MCU yields markedly lower post\-attack accuracy while maintaining utility\.Although existing works have proposed various techniques to improve unlearning robustness—such as sharpness\-aware minimization \(SAM\)\(Fanet al\.,[2025](https://arxiv.org/html/2605.11685#bib.bib51)\)for smooth optimization and representation\-level interventions\(Liet al\.,[2024](https://arxiv.org/html/2605.11685#bib.bib5); Sondej and Yang,[2025](https://arxiv.org/html/2605.11685#bib.bib8)\)—these methods remain largely empirical, and the fundamental mechanism underlying the susceptibility of LLM unlearning to relearning attacks remains poorly understood\. We thus ask:
\(Q\)Why is LLM unlearning so fragile against relearning attacks?
To address\(Q\), we conduct a principled analysis of LLM unlearning through the lens of representation geometry\. We discover that existing unlearning methods predominantly optimize along the dominant component directions, leaving minor components largely unchanged\. Critically, when relearning attacks are applied, the modifications in these dominant components are easily reversed—with recovery rates significantly exceeding those of minor components—explaining why current methods are so vulnerable to such attacks\. We further give a theoretical analysis that derives both phenomena from the spectral structure of representations, identifying the structural source of fragility\.
Inspired by these findings, we proposeMinor Component Unlearning \(MCU\), a novel unlearning approach that explicitly targets the minor components of internal representations\. By leveraging the observation that minor components are inherently more resistant to recovery during relearning, our method achieves substantially improved robustness against relearning attacks while maintaining model utility on unrelated tasks\. We summarize ourcontributionsbelow\.111Our code is publicly available at[https://github\.com/sustech\-nlp/MCU](https://github.com/sustech-nlp/MCU)\.
∙\\bulletWe provide the first systematic analysis of LLM unlearning robustness from a representation geometry perspective, supported by a theoretical analysis\. We identify a key mechanism underlying unlearning fragility: dominant components modified during unlearning are easily recovered by relearning attacks, whereas minor components exhibit significantly stronger resistance to recovery\.
∙\\bulletBuilding on these insights, we propose MCU, a novel unlearning method that explicitly targets minor components of representations\.
∙\\bulletWe conduct extensive experiments on WMDP\-Cyber, WMDP\-Bio, and Years datasets, demonstrating that our method significantly reduces knowledge recovery after relearning attacks while preserving model utility, outperforming existing methods\. Some experiment highlights on WMDP\-Cyber dataset are showcased in[Figure 1](https://arxiv.org/html/2605.11685#S1.F1)\.
## 2Preliminaries of LLM Unlearning
#### Problem Definition\.
LLM unlearning aims to erase or suppress undesirable knowledge within a pre\-trained LLM while preserving its general performance\(Liuet al\.,[2025](https://arxiv.org/html/2605.11685#bib.bib10)\)\. Formally, given a pre\-trained LLM with parameters𝜽o\\bm\{\\theta\}\_\{\\mathrm\{o\}\}and a dataset partitioned into aforget set𝒟f=\{\(𝐱i,yi\)\}i=1nf\\mathcal\{D\}\_\{\\mathrm\{f\}\}=\\\{\(\\mathbf\{x\}\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{n\_\{f\}\}containing data to be unlearned and aretain set𝒟r=\{\(𝐱j,yj\)\}j=1nr\\mathcal\{D\}\_\{\\mathrm\{r\}\}=\\\{\(\\mathbf\{x\}\_\{j\},y\_\{j\}\)\\\}\_\{j=1\}^\{n\_\{r\}\}containing data the model should still remember, unlearning seeks to obtain updated parameters𝜽u\\bm\{\\theta\}\_\{\\mathrm\{u\}\}such that the model “forgets” information in𝒟f\\mathcal\{D\}\_\{\\mathrm\{f\}\}while maintaining performance on𝒟r\\mathcal\{D\}\_\{\\mathrm\{r\}\}\. An ideal unlearning method should ensure that the mutual information between the unlearned model weights and the forget set approaches zero, meaning the removed knowledge is truly erased from the model rather than merely hidden\(Deeb and Roger,[2024](https://arxiv.org/html/2605.11685#bib.bib1)\)\.
#### Unlearning Methods\.
Letπ𝜽\(x\)\\pi\_\{\\bm\{\\theta\}\}\(x\)denote the probability of textxxunder model parameters𝜽\\bm\{\\theta\}\.Gradient Ascent \(GA\)\(Janget al\.,[2023](https://arxiv.org/html/2605.11685#bib.bib9)\)maximizes the cross\-entropy loss on the forget set, whileNegative Preference Optimization \(NPO\)\(Zhanget al\.,[2024](https://arxiv.org/html/2605.11685#bib.bib11)\)adapts DPO\(Rafailovet al\.,[2023](https://arxiv.org/html/2605.11685#bib.bib7)\)by treating𝒟f\\mathcal\{D\}\_\{\\mathrm\{f\}\}as dispreferred responses:
ℒGA=𝔼x∈𝒟f\[logπ𝜽\(x\)\],\\mathcal\{L\}\_\{\\text\{GA\}\}=\\underset\{x\\in\\mathcal\{D\}\_\{\\mathrm\{f\}\}\}\{\\mathbb\{E\}\}\[\\log\\pi\_\{\\bm\{\\theta\}\}\(x\)\],\(1\)
ℒNPO=−2β𝔼x∈𝒟f\[logσ\(−βlogπθ\(x\)πref\(x\)\)\],\\mathcal\{L\}\_\{\\text\{NPO\}\}=\-\\frac\{2\}\{\\beta\}\\,\\underset\{x\\in\\mathcal\{D\}\_\{\\mathrm\{f\}\}\}\{\\mathbb\{E\}\}\\left\[\\log\\sigma\\left\(\-\\beta\\log\\frac\{\\pi\_\{\\theta\}\(x\)\}\{\\pi\_\{\\text\{ref\}\}\(x\)\}\\right\)\\right\],\(2\)
whereπref=π𝜽o\\pi\_\{\\text\{ref\}\}=\\pi\_\{\\bm\{\\theta\}\_\{\\mathrm\{o\}\}\}andβ\\betais a temperature parameter\.Representation Misdirection for Unlearning \(RMU\)\(Liet al\.,[2024](https://arxiv.org/html/2605.11685#bib.bib5)\)perturbs internal hidden states toward a random control vector, whileMLP Breaking\(Sondej and Yang,[2025](https://arxiv.org/html/2605.11685#bib.bib8)\)drives MLP outputs toward orthogonality with their originals \(motivated by factual knowledge being stored in MLP parameters\(Nandaet al\.,[2023](https://arxiv.org/html/2605.11685#bib.bib13)\)\):
ℒRMU=𝔼x∈𝒟f\[∑t∈x‖𝐡\(t\)−c⋅𝐮‖2\],\\mathcal\{L\}\_\{\\text\{RMU\}\}=\\underset\{x\\in\\mathcal\{D\}\_\{\\mathrm\{f\}\}\}\{\\mathbb\{E\}\}\\left\[\\sum\_\{t\\in x\}\\\|\\mathbf\{h\}\(t\)\-c\\cdot\\mathbf\{u\}\\\|^\{2\}\\right\],\(3\)
ℒMLP Breaking=𝔼x∈𝒟f\[∑t∈xReLU\(⟨𝐡\(t\),𝐡o\(t\)⟩‖𝐡o\(t\)‖2\)\]\.\\mathcal\{L\}\_\{\\text\{MLP Breaking\}\}=\\underset\{x\\in\\mathcal\{D\}\_\{\\mathrm\{f\}\}\}\{\\mathbb\{E\}\}\\left\[\\sum\_\{t\\in x\}\\text\{ReLU\}\\left\(\\frac\{\\langle\\mathbf\{h\}\(t\),\\mathbf\{h\}\_\{\\text\{o\}\}\(t\)\\rangle\}\{\\\|\\mathbf\{h\}\_\{\\text\{o\}\}\(t\)\\\|^\{2\}\}\\right\)\\right\]\.\(4\)
where𝐡\(t\)\\mathbf\{h\}\(t\)is the current internal representation of tokentt\(hidden state for RMU, MLP output for MLP Breaking\),𝐡o\(t\)\\mathbf\{h\}\_\{\\text\{o\}\}\(t\)is its value under𝜽o\\bm\{\\theta\}\_\{\\mathrm\{o\}\},ccis a scaling hyperparameter, and𝐮\\mathbf\{u\}is a random control vector\.
## 3Understanding Fragile LLM Unlearning
In this section, we investigate how unlearning and relearning affect an LLM’s internal representations, and identify a structural cause of fragility:unlearning predominantly modifies the dominant \(high\-variance\) directions of internal representations, which are widely shared across samples and therefore easily reversed by relearning attacks\.[Section˜3\.1](https://arxiv.org/html/2605.11685#S3.SS1)establishes this empirically through three observations, and[Section˜3\.2](https://arxiv.org/html/2605.11685#S3.SS2)explains Observations 2 and 3 theoretically\.
### 3\.1Empirical Observations on Representation Geometry
#### Setup\.
We extract MLP activations across all layers of Llama\-3\.1\-8B on the forget set𝒟f\\mathcal\{D\}\_\{\\mathrm\{f\}\}and apply PCA, yielding principal components\{𝐯1,…,𝐯d\}\\\{\\mathbf\{v\}\_\{1\},\\ldots,\\mathbf\{v\}\_\{d\}\\\}ordered by decreasing varianceσ12≥⋯≥σd2\\sigma\_\{1\}^\{2\}\\geq\\cdots\\geq\\sigma\_\{d\}^\{2\}\. We then track how representations move along each PC during unlearning and relearning\. Implementation details \(modules used, sample sizes, layer aggregation\) are deferred to Appendix[E](https://arxiv.org/html/2605.11685#A5)\.
\(a\)Explained variance
\(b\)Unlearning change
\(c\)Relearning recovery
Figure 2:Principal component analysis of LLM representations during unlearning and relearning\. \(a\) The first few dominant components capture the majority of variance in representations\. \(b\) Unlearning predominantly modifies these dominant components, leaving minor components unchanged\. \(c\) Relearning attacks preferentially recover the dominant components, making unlearning effects along these directions easily reversible\.
#### Observation 1: LLM Representations are Concentrated in Dominant Components\.
[Figure˜2\(a\)](https://arxiv.org/html/2605.11685#S3.F2.sf1)shows the explained\-variance ratio across principal components:the first few dominant components capture the overwhelming majority of the total variance, while the minor components form a long tail of small but non\-negligible contributions\.
To quantify how unlearning and relearning affect each direction, we define two metrics for each principal component𝐯k\\mathbf\{v\}\_\{k\}:
Change Ratiok=\|⟨𝐡u−𝐡o,𝐯k⟩\|∑j=1d\|⟨𝐡u−𝐡o,𝐯j⟩\|,\\text\{Change Ratio\}\_\{k\}=\\frac\{\|\\langle\\mathbf\{h\}\_\{\\text\{u\}\}\-\\mathbf\{h\}\_\{\\text\{o\}\},\\mathbf\{v\}\_\{k\}\\rangle\|\}\{\\sum\_\{j=1\}^\{d\}\|\\langle\\mathbf\{h\}\_\{\\text\{u\}\}\-\\mathbf\{h\}\_\{\\text\{o\}\},\\mathbf\{v\}\_\{j\}\\rangle\|\},\(5\)
Recovery Ratiok=⟨𝐡u−𝐡r,𝐯k⟩⟨𝐡u−𝐡o,𝐯k⟩,\\text\{Recovery Ratio\}\_\{k\}=\\frac\{\\langle\\mathbf\{h\}\_\{\\text\{u\}\}\-\\mathbf\{h\}\_\{\\text\{r\}\},\\mathbf\{v\}\_\{k\}\\rangle\}\{\\langle\\mathbf\{h\}\_\{\\text\{u\}\}\-\\mathbf\{h\}\_\{\\text\{o\}\},\\mathbf\{v\}\_\{k\}\\rangle\},\(6\)
where𝐡o\\mathbf\{h\}\_\{\\text\{o\}\},𝐡u\\mathbf\{h\}\_\{\\text\{u\}\}, and𝐡r\\mathbf\{h\}\_\{\\text\{r\}\}denote representations before unlearning, after unlearning, and after a relearning attack, respectively; a Recovery Ratio near11indicates full reversal and near0indicates robust unlearning\.
#### Observation 2: Unlearning Predominantly Modifies Dominant Components\.
[Figure˜2\(b\)](https://arxiv.org/html/2605.11685#S3.F2.sf2)reports the Change Ratio \([5](https://arxiv.org/html/2605.11685#S3.E5)\) after applying GA:unlearning induces disproportionately large changes along the leading PCs, while minor components remain largely unchanged\. The same pattern holds across unlearning losses \(Appendix[I](https://arxiv.org/html/2605.11685#A9)\)\.
#### Observation 3: Dominant Components are More Easily Recovered During Relearning\.
The concentration above would be benign if the changes were robust\.[Figure˜2\(c\)](https://arxiv.org/html/2605.11685#S3.F2.sf3)reports the Recovery Ratio \([6](https://arxiv.org/html/2605.11685#S3.E6)\):dominant components attain substantially higher recovery ratios \(often\>90%\>\\\!90\\%\) than minor components, with the same pattern across unlearning losses\.
### 3\.2Theoretical Analysis of Unlearning Fragility
The empirical patterns are not coincidental: they follow from the spectral structure of forget\-set representations and the gradient geometry of unlearning/relearning losses\. We formalize this through a linearized \(NTK\-style\) analysis that yields the two theorems below; full derivations and assumptions are deferred to Appendix[D](https://arxiv.org/html/2605.11685#A4)\.
###### Theorem 1\(Dominant\-component concentration; explains Observation 2\)\.
AfterTTunlearning steps,
𝔼𝒟f\[⟨𝐡u−𝐡o,𝐯k⟩2\]∝σk2\+O\(τ2\),\\mathbb\{E\}\_\{\\mathcal\{D\}\_\{f\}\}\\\!\\big\[\\langle\\mathbf\{h\}\_\{\\text\{u\}\}\-\\mathbf\{h\}\_\{\\text\{o\}\},\\mathbf\{v\}\_\{k\}\\rangle^\{2\}\\big\]\\;\\propto\\;\\sigma\_\{k\}^\{2\}\\;\+\\;O\(\\tau^\{2\}\),\(7\)withτ2≪σ12\\tau^\{2\}\\ll\\sigma\_\{1\}^\{2\}a small noise term\. The change\-ratio \([5](https://arxiv.org/html/2605.11685#S3.E5)\) therefore mirrors the explained\-variance profile of[Figure˜2\(a\)](https://arxiv.org/html/2605.11685#S3.F2.sf1): the unlearning update is channeled through directions that account for most of the representation variance\.
###### Theorem 2\(Dominant\-component recoverability; explains Observation 3\)\.
Because the relearning distribution𝒟r\\mathcal\{D\}\_\{r\}shares its dominant eigenstructure with𝚺\\bm\{\\Sigma\}\(standard threat model\), applying the same NTK linearization to the relearning objective gives, for somec\>0c\>0andTrT\_\{r\}relearning steps,
RecoveryRatiok≈1−exp\(−cσk2Tr\)\.\\mathrm\{Recovery\\ Ratio\}\_\{k\}\\;\\approx\\;1\-\\exp\\\!\\big\(\-c\\,\\sigma\_\{k\}^\{2\}\\,T\_\{r\}\\big\)\.\(8\)Dominant components saturate within a few attack steps, while minor components requireO\(1/σk2\)O\(1/\\sigma\_\{k\}^\{2\}\)steps and remain effectively unrecovered under any bounded budget\. A complementary argument \(Appendix[D\.5](https://arxiv.org/html/2605.11685#A4.SS5)\) further shows that, even asTr→∞T\_\{r\}\\\!\\to\\\!\\infty, the relearning gradients along minor components average out across the batch, so these directions cannot be reliably reconstructed by the attacker\.
Together,[Theorems˜1](https://arxiv.org/html/2605.11685#Thmtheorem1)and[2](https://arxiv.org/html/2605.11685#Thmtheorem2)show that the unlearning effect is concentrated in exactly the directions the attacker can recover most cheaply—a structural source of fragility\. Redirecting unlearning into the minor\-component subspace inverts the scaling \([8](https://arxiv.org/html/2605.11685#S3.E8)\) and works*against*the attacker; we develop this idea in[Section˜4](https://arxiv.org/html/2605.11685#S4)\.
## 4Methodology
Building on[Section˜3](https://arxiv.org/html/2605.11685#S3), we proposeMinor Component Unlearning \(MCU\), which explicitly redirects representation\-based unlearning losses \(e\.g\., RMU\(Liet al\.,[2024](https://arxiv.org/html/2605.11685#bib.bib5)\), MLP Breaking\(Sondej and Yang,[2025](https://arxiv.org/html/2605.11685#bib.bib8)\)\) toward the minor\-component subspace, in line with the structural fragility characterized by[Theorems˜1](https://arxiv.org/html/2605.11685#Thmtheorem1)and[2](https://arxiv.org/html/2605.11685#Thmtheorem2)\.
### 4\.1Motivation: Targeting Minor Components
[Theorems˜1](https://arxiv.org/html/2605.11685#Thmtheorem1)and[2](https://arxiv.org/html/2605.11685#Thmtheorem2)together imply that an unlearning update concentrated in dominant directions is exactly the configuration the attacker can undo most cheaply\. This motivates a natural question:can we design an unlearning method that confines its effect to minor components, thereby inheriting their resistance to relearning?To achieve this, we propose to remove the dominant components from both the current and target representations before computing the unlearning loss\. By removing the high\-variance directions, the resulting loss gradients are constrained to operate primarily within the minor component subspace\. This ensures that unlearning\-induced changes occur in directions that are inherently more difficult to reverse\.
### 4\.2Principal Component Extraction
Before unlearning, we extract the principal components from the original model’s representations on the forget set𝒟f\\mathcal\{D\}\_\{\\mathrm\{f\}\}\. For each trainable parameter, we collect internal representations \(either hidden states for RMU or MLP outputs for MLP Breaking\) across all tokens in the forget set\. Let𝐇∈ℝN×d\\mathbf\{H\}\\in\\mathbb\{R\}^\{N\\times d\}denote the matrix of collected representations, whereNNis the total number of tokens andddis the hidden dimension\.
We first center the representations by subtracting the mean:
𝐡¯=1N∑i=1N𝐡i,𝐇~=𝐇−𝟏𝐡¯⊤,\\bar\{\\mathbf\{h\}\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathbf\{h\}\_\{i\},\\quad\\tilde\{\\mathbf\{H\}\}=\\mathbf\{H\}\-\\mathbf\{1\}\\bar\{\\mathbf\{h\}\}^\{\\top\},\(9\)where𝟏\\mathbf\{1\}is the all\-ones vector\. We then compute the top\-KKprincipal components\{𝐯1,𝐯2,…,𝐯K\}\\\{\\mathbf\{v\}\_\{1\},\\mathbf\{v\}\_\{2\},\\ldots,\\mathbf\{v\}\_\{K\}\\\}via singular value decomposition \(SVD\) or power iteration\(Halkoet al\.,[2011](https://arxiv.org/html/2605.11685#bib.bib3)\):
𝐇~=𝐔𝚺𝐕⊤,𝐯k=𝐕:,k\.\\tilde\{\\mathbf\{H\}\}=\\mathbf\{U\}\\mathbf\{\\Sigma\}\\mathbf\{V\}^\{\\top\},\\quad\\mathbf\{v\}\_\{k\}=\\mathbf\{V\}\_\{:,k\}\.\(10\)In practice, we use randomized SVD for computational efficiency, which computes an approximate low\-rank decomposition withO\(NK2\)O\(NK^\{2\}\)complexity instead ofO\(Nd2\)O\(Nd^\{2\}\)for full SVD\.
### 4\.3Minor Component Projection
Given the extracted principal components, we define the projection operator that removes the top\-KKprincipal directions from any representation𝐡∈ℝd\\mathbf\{h\}\\in\\mathbb\{R\}^\{d\}, whereKKis a hyperparameter and⟨⋅,⋅⟩\\langle\\cdot,\\cdot\\rangledenotes the inner product:
𝒫⟂\(𝐡\)=𝐡−∑k=1K⟨𝐡,𝐯k⟩𝐯k\.\\mathcal\{P\}\_\{\\perp\}\(\\mathbf\{h\}\)=\\mathbf\{h\}\-\\sum\_\{k=1\}^\{K\}\\langle\\mathbf\{h\},\\mathbf\{v\}\_\{k\}\\rangle\\mathbf\{v\}\_\{k\}\.\(11\)The projected vector𝒫⟂\(𝐡\)\\mathcal\{P\}\_\{\\perp\}\(\\mathbf\{h\}\)lies in the orthogonal complement of the principal component subspace,i\.e\., the minor component subspace\. In addition to the standard principal components, we treat the mean representation𝝁=1N∑i=1N𝐡i\\bm\{\\mu\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathbf\{h\}\_\{i\}as a special “0th" principal component to be removed\. Concretely, we first project out the mean direction before removing the top\-KKPCs\. Empirically, including the mean yields consistently better unlearning robustness; we provide ablation results in Section[5\.3](https://arxiv.org/html/2605.11685#S5.SS3)\.
### 4\.4RMU\-MCU
RMU\(Liet al\.,[2024](https://arxiv.org/html/2605.11685#bib.bib5)\)aims to steer the hidden representations towards a random target vector, disrupting the model’s ability to produce forget\-set\-related outputs\. The original RMU loss is given in Equation[3](https://arxiv.org/html/2605.11685#S2.E3)\.
Our Minor Component Unlearning variant,RMU\-MCU, applies the minor component projection to the current representation before computing the loss:
ℒRMU\-MCU=𝔼x∈𝒟f\[∑t∈x‖𝒫⟂\(𝐡\(t\)\)−c⋅𝐮‖2\]\.\\mathcal\{L\}\_\{\\text\{RMU\-MCU\}\}=\\underset\{x\\in\\mathcal\{D\}\_\{\\mathrm\{f\}\}\}\{\\mathbb\{E\}\}\\left\[\\sum\_\{t\\in x\}\\\|\\mathcal\{P\}\_\{\\perp\}\(\\mathbf\{h\}\(t\)\)\-c\\cdot\\mathbf\{u\}\\\|^\{2\}\\right\]\.\(12\)
#### Intuition\.
Consider the gradient of \([12](https://arxiv.org/html/2605.11685#S4.E12)\)*with respect to the hidden representation*𝐡\(t\)\\mathbf\{h\}\(t\):
∂ℒRMU\-MCU∂𝐡\(t\)∝𝒫⟂\(𝐡\(t\)\)−c⋅𝐮\.\\frac\{\\partial\\mathcal\{L\}\_\{\\text\{RMU\-MCU\}\}\}\{\\partial\\mathbf\{h\}\(t\)\}\\propto\\mathcal\{P\}\_\{\\perp\}\(\\mathbf\{h\}\(t\)\)\-c\\cdot\\mathbf\{u\}\.\(13\)The input\-dependent part𝒫⟂\(𝐡\(t\)\)\\mathcal\{P\}\_\{\\perp\}\(\\mathbf\{h\}\(t\)\)lies in the minor component subspace, so MCU injects unlearning pressure exclusively along these robust directions; in contrast, standard RMU’s gradient𝐡\(t\)−c⋅𝐮\\mathbf\{h\}\(t\)\-c\\cdot\\mathbf\{u\}has large components along principal directions, leading to easily reversible changes\. The corresponding parameter\-space update is this signal pulled back through the post\-𝐡\\mathbf\{h\}Jacobian: while the update need not be strictly confined to the minor subspace, the dominant\-direction contribution to𝔼\[𝐠u𝐠u⊤\]\\mathbb\{E\}\[\\mathbf\{g\}\_\{u\}\\mathbf\{g\}\_\{u\}^\{\\top\}\]—the term that drives[Theorems˜1](https://arxiv.org/html/2605.11685#Thmtheorem1)and[2](https://arxiv.org/html/2605.11685#Thmtheorem2)—is removed\.
### 4\.5MLP\-Breaking\-MCU
The original MLP Breaking loss aims to make the current MLP outputs orthogonal to the original outputs, as given in Equation[4](https://arxiv.org/html/2605.11685#S2.E4)\.
Our variant,MLP\-Breaking\-MCU, applies minor component projection before computing the loss:
ℒMLP\-Breaking\-MCU=𝔼x∈𝒟f\[∑t∈xReLU\(⟨𝒫⟂\(𝐡\(t\)\),𝒫⟂\(𝐡o\(t\)\)⟩‖𝒫⟂\(𝐡o\(t\)\)‖2\)\]\.\\mathcal\{L\}\_\{\\text\{MLP\-Breaking\-MCU\}\}=\\underset\{x\\in\\mathcal\{D\}\_\{\\mathrm\{f\}\}\}\{\\mathbb\{E\}\}\\left\[\\sum\_\{t\\in x\}\\text\{ReLU\}\\\!\\left\(\\frac\{\\langle\\mathcal\{P\}\_\{\\perp\}\(\\mathbf\{h\}\(t\)\),\\mathcal\{P\}\_\{\\perp\}\(\\mathbf\{h\}\_\{\\text\{o\}\}\(t\)\)\\rangle\}\{\\\|\\mathcal\{P\}\_\{\\perp\}\(\\mathbf\{h\}\_\{\\text\{o\}\}\(t\)\)\\\|^\{2\}\}\\right\)\\right\]\.\(14\)
#### Intuition\.
When the ReLU is active, the gradient of \([14](https://arxiv.org/html/2605.11685#S4.E14)\)*with respect to*𝐡\(t\)\\mathbf\{h\}\(t\)is:
∂ℒMLP\-Breaking\-MCU∂𝐡\(t\)∝𝒫⟂\(𝐡o\(t\)\)‖𝒫⟂\(𝐡o\(t\)\)‖2\.\\frac\{\\partial\\mathcal\{L\}\_\{\\text\{MLP\-Breaking\-MCU\}\}\}\{\\partial\\mathbf\{h\}\(t\)\}\\propto\\frac\{\\mathcal\{P\}\_\{\\perp\}\(\\mathbf\{h\}\_\{\\text\{o\}\}\(t\)\)\}\{\\\|\\mathcal\{P\}\_\{\\perp\}\(\\mathbf\{h\}\_\{\\text\{o\}\}\(t\)\)\\\|^\{2\}\}\.\(15\)This vector lies in the minor component subspace by construction, since𝒫⟂\(𝐡o\(t\)\)\\mathcal\{P\}\_\{\\perp\}\(\\mathbf\{h\}\_\{\\text\{o\}\}\(t\)\)is orthogonal to the principal directions\. As in[Equation˜12](https://arxiv.org/html/2605.11685#S4.E12), the parameter update is the pull\-back of this hidden\-state signal through the post\-𝐡\\mathbf\{h\}Jacobian; what is guaranteed is that the dominant\-direction component of the residual covariance vanishes, removing theσk2\\sigma\_\{k\}^\{2\}\-scaling source of fragility identified in[Theorems˜1](https://arxiv.org/html/2605.11685#Thmtheorem1)and[2](https://arxiv.org/html/2605.11685#Thmtheorem2)\.
## 5Experiments
### 5\.1Experiment Setups
#### Datasets\.
We use three forget sets:WMDP\-CyberandWMDP\-Biofrom the WMDP\-Deduped benchmark\(Deeb and Roger,[2024](https://arxiv.org/html/2605.11685#bib.bib1)\)\(deduplicated subsets of WMDP\(Liet al\.,[2024](https://arxiv.org/html/2605.11685#bib.bib5)\), further filtered followingSondej and Yang \([2025](https://arxiv.org/html/2605.11685#bib.bib8)\)\), andYears\(Deeb and Roger,[2024](https://arxiv.org/html/2605.11685#bib.bib1)\)\(20th\-century events with their dates\)\. Retain sets are domain\-matched subsets of the FineFineWeb corpus\(M\-A\-Pet al\.,[2024](https://arxiv.org/html/2605.11685#bib.bib16)\)\. Full splits and preprocessing are in Appendix[F](https://arxiv.org/html/2605.11685#A6)\.
#### Evaluation\.
FollowingDeeb and Roger \([2024](https://arxiv.org/html/2605.11685#bib.bib1)\), we partition the forget set into disjointTT\(80%80\\%\) andVV\(20%20\\%\) with minimal mutual information, unlearn onT∪VT\\cup V, and measure post\-attack recovery onVVafter theRTTattack \(fine\-tuning the unlearned model onTT\)\. We report five metrics:MMLU\(general knowledge↑\\uparrow\),WikiTextloss\(Merityet al\.,[2016](https://arxiv.org/html/2605.11685#bib.bib14)\)\(↓\\downarrow\),Forgetaccuracy \(↓\\downarrow\),Relearnaccuracy after RTT \(↓\\downarrow\), and the relearning gap𝚫=Relearn−Forget\\bm\{\\Delta\}=\\text\{Relearn\}\-\\text\{Forget\}\(↓\\downarrow\)\. All experiments use Llama\-3\.1\-8B\(Grattafioriet al\.,[2024](https://arxiv.org/html/2605.11685#bib.bib15)\)unless noted\.
Table 1:Main results on WMDP\-Cyber, WMDP\-Bio, and Years\. For Relearn andΔ\\Delta,bold= best,underline= second\-best per dataset\.
#### Baselines\.
We evaluate three base unlearning losses \(NPO, RMU, and MLP Breaking\) combined with two robustness\-oriented techniques: Sharpness\-Aware Minimization \(SAM\)\(Fanet al\.,[2025](https://arxiv.org/html/2605.11685#bib.bib51)\)and Collapse of Irrelevant Representations \(CIR\)\(Sondej and Yang,[2025](https://arxiv.org/html/2605.11685#bib.bib8)\)\. CIR is complementary to MCU: it operates at the*gradient level*by projecting out dominant components from∇θℒu\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{u\}before each update, whereas MCU operates at the*loss level*by reshaping which directions the loss penalizes\. We therefore evaluate MCU both standalone and on top of CIR\.
### 5\.2Main Results
[Table˜1](https://arxiv.org/html/2605.11685#S5.T1)summarizes our main results across all three datasets\. We report a compact subset of representative configurations to keep the table readable; the full set of experiments is provided in[Table˜4](https://arxiv.org/html/2605.11685#A7.T4)of Appendix[G](https://arxiv.org/html/2605.11685#A7)\.
#### MCU consistently improves robustness across base methods and datasets\.
When combined with CIR, our MCU variants achieve substantially lower relearning magnitudes \(Δ\\Delta\) than all baseline configurations\. This improvement is particularly pronounced when MCU is applied on top of MLP Breaking \+ CIR, where we observe the lowestΔ\\Deltavalues across all three datasets\. Notably, MCU provides these robustness gains without compromising model utility—the MMLU scores of MCU variants remain comparable to or better than their non\-MCU counterparts, while WikiText perplexity stays near the original model baseline\.
#### CIR and MCU provide complementary benefits\.
Comparing methods with and without CIR reveals that CIR substantially improves both utility preservation and robustness\. However, CIR alone still leaves room for knowledge recovery under relearning attacks\. Adding MCU further reduces the relearning gap, demonstrating that the two techniques address different aspects of the unlearning robustness problem\.
#### SAM provides limited robustness improvements\.
While SAM has been proposed to improve unlearning robustness through smoother loss landscapes\(Fanet al\.,[2025](https://arxiv.org/html/2605.11685#bib.bib51)\), our results show that its effectiveness varies across settings\. In some cases, SAM slightly reduces the relearning gap, but the improvements are inconsistent and often come with utility trade\-offs\. In contrast, MCU provides more reliable and substantial robustness gains, suggesting that operating on the representation geometry is more effective than loss landscape smoothing for preventing knowledge recovery\.
#### Generality across model families\.
The results in[Table˜1](https://arxiv.org/html/2605.11685#S5.T1)are reported on Llama\-3\.1\-8B for direct comparability with prior work\(Deeb and Roger,[2024](https://arxiv.org/html/2605.11685#bib.bib1); Sondej and Yang,[2025](https://arxiv.org/html/2605.11685#bib.bib8)\)\. To verify that the benefits of MCU are not specific to a single model, we additionally evaluate MCU onGemma2\-9BandQwen3\-8Bacross all three datasets; the full results are provided in Appendix[H](https://arxiv.org/html/2605.11685#A8)\([Table˜5](https://arxiv.org/html/2605.11685#A8.T5)\)\. On both architectures, adding MCU consistently reduces the post\-attack relearning gapΔ\\Deltaover the strong MLP Breaking \+ CIR baseline while preserving MMLU, confirming that the improvements transfer across model families\.
### 5\.3Ablation Study on Component Projection Strategies
We ablate several ways of constructing the projection subspace, varying whether we explicitly handle the mean direction and whether we use centered PCA or SVD\-style decomposition:\(1\) MCU \(Ours\): compute the mean as the 0th component, then apply PCA on centered representations;\(2\) w/o mean: standard PCA without treating mean as a special component;\(3\) SVD\-based: SVD on uncentered representations, using top\-KKright singular vectors\.
Table 2:Ablation on projection strategies\. All methods use MLP Breaking \+ CIR as base\. Fgt = Forget acc\., Rlrn = Relearn acc\. after RTT\.[Table˜2](https://arxiv.org/html/2605.11685#S5.T2)shows that explicitly accounting for the mean direction is critical for achieving robust unlearning\. When the mean component is not treated separately \(“w/o mean”\), the method fails to reliably isolate the recoverable directions, leading to unpredictable performance—in some cases even worse than the baseline\. Similarly, SVD on uncentered representations provides moderate improvements but remains less effective than centered PCA, likely because the dominant singular vectors conflate the mean shift with principal variation directions\. These results suggest that the mean direction captures globally shared information that is particularly susceptible to recovery, and explicitly projecting it out enables more precise targeting of the robust subspace\. Importantly, all variants maintain similar utility scores, indicating that the performance differences primarily reflect how well each strategy identifies and avoids the recoverable directions rather than fundamental trade\-offs with model capability\.
### 5\.4Analysis of Changes Across Principal Components
Figure 3:PC\-bin change distribution\. Baseline concentrates changes in dominant bins; MCU shifts mass toward minor bins\.To validate that our method successfully redirects unlearning effects toward minor components as intended, we analyze the distribution of representation changes across principal components after unlearning\. Specifically, we extract the principal components from the original model’s representations on the forget set \(as described in[Section˜4](https://arxiv.org/html/2605.11685#S4)\), then measure how much each component is modified during unlearning for both baseline and our MCU variants\. For each principal component𝐯k\\mathbf\{v\}\_\{k\}, we compute the change ratio as defined in Equation[5](https://arxiv.org/html/2605.11685#S3.E5)\.[Figure˜3](https://arxiv.org/html/2605.11685#S5.F3)presents the distribution of changes across principal component bins for baseline \(MLP Breaking \+ CIR\) compared to our MCU variants\. A clear pattern emerges:baseline concentrate the majority of representation changes in the early bins, corresponding to the dominant principal components that encode shared structure and are easily recovered during relearning\. In contrast,our MCU variants shift the distribution toward later bins, indicating that changes are redistributed to the minor components that store sample\-specific information\. This shift aligns precisely with our design objective—by projecting out the top\-KKprincipal components before computing the unlearning loss, MCU suppresses modifications to the dominant subspace and redirects optimization pressure toward the minor component subspace\.
The redistribution of unlearning effects provides a mechanistic explanation for the improved robustness observed in[Table˜1](https://arxiv.org/html/2605.11685#S5.T1)\. As demonstrated in our analysis \([Section˜3](https://arxiv.org/html/2605.11685#S3)\), modifications to dominant components can be easily reversed during relearning because these directions capture cross\-sample regularities that the model naturally recovers when exposed to similar data\.
### 5\.5Robustness under Adaptive Representation\-Based Attacks
Because MCU operates on internal representations, a natural concern is whether an adversary that directly targets representations, rather than the standard RTT loss, can defeat it\. We assume the attacker has access to the unlearned model, the original pre\-unlearning model, and the forget setTT, and fine\-tunes the unlearned model to minimize the MSE between its MLP activations and those of the original model onTT\. This is a strictly stronger threat model than RTT, as it directly targets the representation\-level modifications MCU introduces\. Unlearning follows our main setup, and we report post\-attack accuracy on the held\-out splitVV\.
Table 3:Robustness against the adaptive representation\-based attack on Llama\-3\.1\-8B\. The attacker directly aligns the unlearned model’s activations with the original model’s\.Boldindicates the lowestΔ\\Delta\(best robustness\)\.[Table˜3](https://arxiv.org/html/2605.11685#S5.T3)shows a clear hierarchy of robustness\. GA collapses entirely under this attack, with relearn accuracy nearly returning to the original model, confirming that output\-level unlearning leaves internal representations essentially intact and trivially recoverable\. MLP Breaking \+ CIR is markedly more robust but still loses a non\-trivial amount of forgotten knowledge, indicating that recoverable traces persist in the dominant components even after gradient\-level filtering\. Adding MCU yields by far the strongest robustness, withΔ\\Deltaseveral times smaller than MLP Breaking \+ CIR on both datasets\. This trend matches our representation\-geometry analysis \([Section˜3](https://arxiv.org/html/2605.11685#S3)\): the adaptive attacker can re\-establish the cross\-sample dominant\-component structure, but cannot reliably reconstruct modifications in the minor\-component subspace, which lacks the regularity needed for reconstruction\.
## 6Conclusion
We investigated the fragility of LLM unlearning from a representation geometry perspective and identified a fundamental mechanism: existing unlearning methods predominantly modify dominant components of internal representations, which are easily recovered during relearning attacks\. In contrast, minor components exhibit significantly stronger resistance to such recovery\. Building on this insight, we proposed MCU, a method that explicitly targets the robust minor component subspace by projecting out dominant directions before computing unlearning losses\. MCU is compatible with existing representation\-based unlearning objectives and complementary to gradient\-level filtering techniques like CIR\. Extensive experiments on WMDP\-Cyber, WMDP\-Bio, and Years datasets demonstrate that MCU significantly reduces knowledge recovery under relearning attacks while preserving model utility, outperforming state\-of\-the\-art methods including sharpness\-aware minimization\.
## References
- C\. Barrett, B\. Boyd, E\. Bursztein, N\. Carlini, B\. Chen, J\. Choi, A\. R\. Chowdhury, M\. Christodorescu, A\. Datta, S\. Feizi,et al\.\(2023\)Identifying and mitigating the security risks of generative ai\.Foundations and Trends in Privacy and Security6\(1\),pp\. 1–52\.Cited by:[Appendix C](https://arxiv.org/html/2605.11685#A3.SS0.SSS0.Px1.p1.1)\.
- Forecasting open\-weight ai model growth on huggingface\.arXiv preprint arXiv:2502\.15987\.Cited by:[§1](https://arxiv.org/html/2605.11685#S1.p1.1)\.
- L\. Bourtoule, V\. Chandrasekaran, C\. A\. Choquette\-Choo, H\. Jia, A\. Travers, B\. Zhang, D\. Lie, and N\. Papernot \(2021\)Machine unlearning\.In2021 IEEE Symposium on Security and Privacy \(SP\),pp\. 141–159\.Cited by:[Appendix C](https://arxiv.org/html/2605.11685#A3.SS0.SSS0.Px1.p1.1)\.
- Y\. Cao and J\. Yang \(2015\)Towards making systems forget with machine unlearning\.In2015 IEEE symposium on security and privacy,pp\. 463–480\.Cited by:[Appendix C](https://arxiv.org/html/2605.11685#A3.SS0.SSS0.Px1.p1.1)\.
- S\. Casper, K\. O’Brien, S\. Longpre, E\. Seger, K\. Klyman, R\. Bommasani, A\. Nrusimha, I\. Shumailov, S\. Mindermann, S\. Basart,et al\.\(2025\)Open technical problems in open\-weight ai model risk management\.Social Science Research Network\.Cited by:[§1](https://arxiv.org/html/2605.11685#S1.p1.1),[§1](https://arxiv.org/html/2605.11685#S1.p2.1)\.
- A\. Deeb and F\. Roger \(2024\)Do unlearning methods remove information from language model weights?\.arXiv preprint arXiv:2410\.08827\.Cited by:[Appendix C](https://arxiv.org/html/2605.11685#A3.SS0.SSS0.Px2.p1.1),[Appendix F](https://arxiv.org/html/2605.11685#A6.SS0.SSS0.Px1.p1.1),[Appendix F](https://arxiv.org/html/2605.11685#A6.SS0.SSS0.Px4.p1.4),[§1](https://arxiv.org/html/2605.11685#S1.p2.1),[§2](https://arxiv.org/html/2605.11685#S2.SS0.SSS0.Px1.p1.6),[§5\.1](https://arxiv.org/html/2605.11685#S5.SS1.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2605.11685#S5.SS1.SSS0.Px2.p1.13),[§5\.2](https://arxiv.org/html/2605.11685#S5.SS2.SSS0.Px4.p1.1)\.
- R\. Eldan and M\. Russinovich \(2023\)Who’s harry potter? approximate unlearning in llms\.External Links:2310\.02238Cited by:[Appendix C](https://arxiv.org/html/2605.11685#A3.SS0.SSS0.Px1.p1.1)\.
- C\. Fan, J\. Jia, Y\. Zhang, A\. Ramakrishna, M\. Hong, and S\. Liu \(2025\)Towards llm unlearning resilient to relearning attacks: a sharpness\-aware minimization perspective and beyond\.InForty\-second International Conference on Machine Learning,Cited by:[Appendix C](https://arxiv.org/html/2605.11685#A3.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.11685#S1.p2.1),[§1](https://arxiv.org/html/2605.11685#S1.p3.1),[§5\.1](https://arxiv.org/html/2605.11685#S5.SS1.SSS0.Px3.p1.1),[§5\.2](https://arxiv.org/html/2605.11685#S5.SS2.SSS0.Px3.p1.1)\.
- C\. Fan, J\. Liu, Y\. Zhang, E\. Wong, D\. Wei, and S\. Liu \(2024\)SalUn: empowering machine unlearning via gradient\-based weight saliency in both image classification and generation\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=gn0mIhQGNM)Cited by:[Appendix C](https://arxiv.org/html/2605.11685#A3.SS0.SSS0.Px1.p1.1)\.
- A\. Ginart, M\. Guan, G\. Valiant, and J\. Y\. Zou \(2019\)Making ai forget you: data deletion in machine learning\.Advances in neural information processing systems32\.Cited by:[Appendix C](https://arxiv.org/html/2605.11685#A3.SS0.SSS0.Px1.p1.1)\.
- A\. Golatkar, A\. Achille, and S\. Soatto \(2020\)Eternal sunshine of the spotless net: selective forgetting in deep networks\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 9304–9312\.Cited by:[Appendix C](https://arxiv.org/html/2605.11685#A3.SS0.SSS0.Px1.p1.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§1](https://arxiv.org/html/2605.11685#S1.p1.1),[§5\.1](https://arxiv.org/html/2605.11685#S5.SS1.SSS0.Px2.p1.13)\.
- N\. Halko, P\. Martinsson, and J\. A\. Tropp \(2011\)Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions\.SIAM review53\(2\),pp\. 217–288\.Cited by:[Appendix E](https://arxiv.org/html/2605.11685#A5.SS0.SSS0.Px2.p1.4),[§4\.2](https://arxiv.org/html/2605.11685#S4.SS2.p2.3)\.
- S\. Hu, Y\. Fu, S\. Wu, and V\. Smith \(2025\)Unlearning or obfuscating? jogging the memory of unlearned LLMs via benign relearning\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=fMNRYBvcQN)Cited by:[Appendix C](https://arxiv.org/html/2605.11685#A3.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.11685#S1.p2.1)\.
- A\. Jacot, F\. Gabriel, and C\. Hongler \(2018\)Neural tangent kernel: convergence and generalization in neural networks\.Advances in neural information processing systems31\.Cited by:[§D\.1](https://arxiv.org/html/2605.11685#A4.SS1.p1.10)\.
- J\. Jang, D\. Yoon, S\. Yang, S\. Cha, M\. Lee, L\. Logeswaran, and M\. Seo \(2023\)Knowledge unlearning for mitigating privacy risks in language models\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 14389–14408\.Cited by:[Appendix C](https://arxiv.org/html/2605.11685#A3.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.11685#S1.p1.1),[§2](https://arxiv.org/html/2605.11685#S2.SS0.SSS0.Px2.p1.4)\.
- J\. Jia, J\. Liu, P\. Ram, Y\. Yao, G\. Liu, Y\. Liu, P\. Sharma, and S\. Liu \(2023\)Model sparsity can simplify machine unlearning\.InThirty\-seventh Conference on Neural Information Processing Systems,Cited by:[Appendix C](https://arxiv.org/html/2605.11685#A3.SS0.SSS0.Px1.p1.1)\.
- M\. Kurmanji, P\. Triantafillou, J\. Hayes, and E\. Triantafillou \(2023\)Towards unbounded machine unlearning\.Advances in neural information processing systems36,pp\. 1957–1987\.Cited by:[Appendix C](https://arxiv.org/html/2605.11685#A3.SS0.SSS0.Px1.p1.1)\.
- N\. Li, A\. Pan, A\. Gopal, S\. Yue, D\. Berrios, A\. Gatti, J\. D\. Li, A\. Dombrowski, S\. Goel, L\. Phan,et al\.\(2024\)The wmdp benchmark: measuring and reducing malicious use with unlearning\.arXiv preprint arXiv:2403\.03218\.Cited by:[Appendix C](https://arxiv.org/html/2605.11685#A3.SS0.SSS0.Px1.p1.1),[Appendix F](https://arxiv.org/html/2605.11685#A6.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.11685#S1.p3.1),[§2](https://arxiv.org/html/2605.11685#S2.SS0.SSS0.Px2.p3.2),[§4\.4](https://arxiv.org/html/2605.11685#S4.SS4.p1.1),[§4](https://arxiv.org/html/2605.11685#S4.p1.1),[§5\.1](https://arxiv.org/html/2605.11685#S5.SS1.SSS0.Px1.p1.1)\.
- S\. Liu, Y\. Yao, J\. Jia, S\. Casper, N\. Baracaldo, P\. Hase, Y\. Yao, C\. Y\. Liu, X\. Xu, H\. Li,et al\.\(2025\)Rethinking machine unlearning for large language models\.Nature Machine Intelligence,pp\. 1–14\.Cited by:[Appendix C](https://arxiv.org/html/2605.11685#A3.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.11685#S1.p1.1),[§2](https://arxiv.org/html/2605.11685#S2.SS0.SSS0.Px1.p1.6)\.
- J\. Łucki, B\. Wei, Y\. Huang, P\. Henderson, F\. Tramèr, and J\. Rando \(2025\)An adversarial perspective on machine unlearning for ai safety\.Transactions on Machine Learning Research\.External Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=J5IRyTKZ9s)Cited by:[Appendix A](https://arxiv.org/html/2605.11685#A1.p1.1),[Appendix C](https://arxiv.org/html/2605.11685#A3.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.11685#S1.p2.1)\.
- A\. Lynch, P\. Guo, A\. Ewart, S\. Casper, and D\. Hadfield\-Menell \(2024\)Eight methods to evaluate robust unlearning in llms\.arXiv preprint arXiv:2402\.16835\.Cited by:[Appendix A](https://arxiv.org/html/2605.11685#A1.p1.1),[Appendix C](https://arxiv.org/html/2605.11685#A3.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.11685#S1.p2.1)\.
- M\-A\-P, G\. Zhang, X\. Du, Z\. Yu, Z\. Wang, Z\. Wang, S\. Guo, T\. Zheng, K\. Zhu, J\. Liu, S\. Yue, B\. Liu, Z\. Peng, Y\. Yao, J\. Yang, Z\. Li, B\. Zhang, M\. Liu, T\. Liu, Y\. Gao, W\. Chen, X\. Zhou, Q\. Liu, T\. Wang, and W\. Huang \(2024\)FineFineWeb: a comprehensive study on fine\-grained domain web corpus\.huggingface\.External Links:[Link](https://huggingface.co/datasets/m-a-p/FineFineWeb)Cited by:[Appendix F](https://arxiv.org/html/2605.11685#A6.SS0.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2605.11685#S5.SS1.SSS0.Px1.p1.1)\.
- P\. Maini, Z\. Feng, A\. Schwarzschild, Z\. C\. Lipton, and J\. Z\. Kolter \(2024\)TOFU: a task of fictitious unlearning for LLMs\.InFirst Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=B41hNBoWLo)Cited by:[Appendix C](https://arxiv.org/html/2605.11685#A3.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.11685#S1.p1.1)\.
- N\. Maslej, L\. Fattorini, R\. Perrault, V\. Parli, A\. Reuel, E\. Brynjolfsson, J\. Etchemendy, K\. Ligett, T\. Lyons, J\. Manyika, J\. C\. Niebles, Y\. Shoham, R\. Wald, and J\. Clark \(2024\)Artificial intelligence index report 2024\.Technical reportStanford Institute for Human\-Centered Artificial Intelligence \(HAI\)\.Note:Seventh edition\. Available as AI Index Report via arXiv:2405\.19522External Links:[Link](https://hai.stanford.edu/ai-index/2024-ai-index-report)Cited by:[§1](https://arxiv.org/html/2605.11685#S1.p1.1)\.
- S\. Merity, C\. Xiong, J\. Bradbury, and R\. Socher \(2016\)Pointer sentinel mixture models\.External Links:1609\.07843Cited by:[Appendix F](https://arxiv.org/html/2605.11685#A6.SS0.SSS0.Px6.p1.1),[§5\.1](https://arxiv.org/html/2605.11685#S5.SS1.SSS0.Px2.p1.13)\.
- N\. Nanda, S\. Rajamanoharan, J\. Kramar, and R\. Shah \(2023\)Fact finding: attempting to reverse\-engineer factual recall on the neuron level\.External Links:[Link](https://www.alignmentforum.org/posts/iGuwZTHWb6DFY3sKB)Cited by:[§2](https://arxiv.org/html/2605.11685#S2.SS0.SSS0.Px2.p3.2)\.
- T\. T\. Nguyen, T\. T\. Huynh, Z\. Ren, P\. L\. Nguyen, A\. W\. Liew, H\. Yin, and Q\. V\. H\. Nguyen \(2025\)A survey of machine unlearning\.ACM Transactions on Intelligent Systems and Technology16\(5\),pp\. 1–46\.Cited by:[Appendix C](https://arxiv.org/html/2605.11685#A3.SS0.SSS0.Px1.p1.1)\.
- M\. Pawelczyk, S\. Neel, and H\. Lakkaraju \(2024\)In\-context unlearning: language models as few\-shot unlearners\.InInternational Conference on Machine Learning,pp\. 40034–40050\.Cited by:[Appendix C](https://arxiv.org/html/2605.11685#A3.SS0.SSS0.Px1.p1.1)\.
- X\. Qi, Y\. Zeng, T\. Xie, P\. Chen, R\. Jia, P\. Mittal, and P\. Henderson \(2023\)Fine\-tuning aligned language models compromises safety, even when users do not intend to\!\.External Links:2310\.03693,[Link](https://arxiv.org/abs/2310.03693)Cited by:[Appendix A](https://arxiv.org/html/2605.11685#A1.p1.1),[Appendix C](https://arxiv.org/html/2605.11685#A3.SS0.SSS0.Px2.p1.1)\.
- R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn \(2023\)Direct preference optimization: your language model is secretly a reward model\.Advances in neural information processing systems36,pp\. 53728–53741\.Cited by:[§2](https://arxiv.org/html/2605.11685#S2.SS0.SSS0.Px2.p1.4)\.
- D\. Rosati, J\. Wehner, K\. Williams, Ł\. Bartoszcze, D\. Atanasov, R\. Gonzales, S\. Majumdar, C\. Maple, H\. Sajjad, and F\. Rudzicz \(2024\)Representation noising: a defence mechanism against harmful finetuning\.Advances in Neural Information Processing Systems37,pp\. 12636–12676\.Cited by:[§1](https://arxiv.org/html/2605.11685#S1.p2.1)\.
- A\. Sheshadri, A\. Ewart, P\. Guo, A\. Lynch, C\. Wu, V\. Hebbar, H\. Sleight, A\. C\. Stickland, E\. Perez, D\. Hadfield\-Menell,et al\.\(2024\)Latent adversarial training improves robustness to persistent harmful behaviors in llms\.arXiv preprint arXiv:2407\.15549\.Cited by:[Appendix C](https://arxiv.org/html/2605.11685#A3.SS0.SSS0.Px2.p1.1)\.
- W\. Shi, J\. Lee, Y\. Huang, S\. Malladi, J\. Zhao, A\. Holtzman, D\. Liu, L\. Zettlemoyer, N\. A\. Smith, and C\. Zhang \(2024\)Muse: machine unlearning six\-way evaluation for language models\.arXiv preprint arXiv:2407\.06460\.Cited by:[Appendix C](https://arxiv.org/html/2605.11685#A3.SS0.SSS0.Px1.p1.1)\.
- F\. Sondej and Y\. Yang \(2025\)Collapse of irrelevant representations \(cir\) ensures robust and non\-disruptive llm unlearning\.arXiv preprint arXiv:2509\.11816\.Cited by:[Appendix C](https://arxiv.org/html/2605.11685#A3.SS0.SSS0.Px2.p1.1),[Appendix F](https://arxiv.org/html/2605.11685#A6.SS0.SSS0.Px1.p1.1),[Appendix F](https://arxiv.org/html/2605.11685#A6.SS0.SSS0.Px3.p1.2),[Appendix F](https://arxiv.org/html/2605.11685#A6.SS0.SSS0.Px4.p1.4),[Appendix F](https://arxiv.org/html/2605.11685#A6.SS0.SSS0.Px6.p1.1),[§1](https://arxiv.org/html/2605.11685#S1.p3.1),[§2](https://arxiv.org/html/2605.11685#S2.SS0.SSS0.Px2.p3.2),[§4](https://arxiv.org/html/2605.11685#S4.p1.1),[§5\.1](https://arxiv.org/html/2605.11685#S5.SS1.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2605.11685#S5.SS1.SSS0.Px3.p1.1),[§5\.2](https://arxiv.org/html/2605.11685#S5.SS2.SSS0.Px4.p1.1)\.
- R\. Tamirisa, B\. Bharathi, L\. Phan, A\. Zhou, A\. Gatti, T\. Suresh, M\. Lin, J\. Wang, R\. Wang, R\. Arel,et al\.\(2024\)Tamper\-resistant safeguards for open\-weight llms\.arXiv preprint arXiv:2408\.00761\.Cited by:[Appendix C](https://arxiv.org/html/2605.11685#A3.SS0.SSS0.Px2.p1.1)\.
- P\. Thaker, Y\. Maurya, S\. Hu, Z\. S\. Wu, and V\. Smith \(2024\)Guardrail baselines for unlearning in llms\.arXiv preprint arXiv:2403\.03329\.Cited by:[Appendix C](https://arxiv.org/html/2605.11685#A3.SS0.SSS0.Px1.p1.1)\.
- E\. Ullah, T\. Mai, A\. Rao, R\. A\. Rossi, and R\. Arora \(2021\)Machine unlearning via algorithmic stability\.InConference on Learning Theory,pp\. 4126–4142\.Cited by:[Appendix C](https://arxiv.org/html/2605.11685#A3.SS0.SSS0.Px1.p1.1)\.
- Y\. Yao, X\. Xu, and Y\. Liu \(2024\)Large language model unlearning\.Advances in Neural Information Processing Systems37,pp\. 105425–105475\.Cited by:[Appendix C](https://arxiv.org/html/2605.11685#A3.SS0.SSS0.Px1.p1.1)\.
- R\. Zhang, L\. Lin, Y\. Bai, and S\. Mei \(2024\)Negative preference optimization: from catastrophic collapse to effective unlearning\.InFirst Conference on Language Modeling,Cited by:[Appendix C](https://arxiv.org/html/2605.11685#A3.SS0.SSS0.Px1.p1.1),[Appendix F](https://arxiv.org/html/2605.11685#A6.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.11685#S2.SS0.SSS0.Px2.p1.4)\.
## Appendix ALimitations
Our work has several aspects worth noting\. First, our experimental evaluation focuses on relearning attacks, which represent the most practically relevant threat for open\-weight models\. Robustness against complementary attack vectors—such as inference\-time jailbreaking\[Łuckiet al\.,[2025](https://arxiv.org/html/2605.11685#bib.bib25), Lynchet al\.,[2024](https://arxiv.org/html/2605.11685#bib.bib26)\]or quantization\-induced knowledge revival\[Qiet al\.,[2023](https://arxiv.org/html/2605.11685#bib.bib23)\]—may benefit from additional defenses, and evaluating MCU against these settings is a natural direction for future work\. Second, our theoretical analysis in[Section˜3\.2](https://arxiv.org/html/2605.11685#S3.SS2)employs a first\-order \(NTK\-regime\) linearization for tractability; tightening this framework to capture non\-linear dynamics more precisely, or establishing formal convergence guarantees for the MCU objective, are interesting open theoretical questions\.
## Appendix BBroader Impacts
Our research advances the robustness of LLM unlearning against relearning attacks, which is critical for ensuring that safety\-relevant knowledge removal is persistent in open\-weight models\. By revealing the representation\-level mechanism underlying unlearning fragility and proposing a principled solution, this work contributes to more reliable privacy protection and regulatory compliance for deployed language models\.
At the same time, stronger and more persistent unlearning can have negative uses if applied to suppress beneficial knowledge, remove safety\-aligned behaviors, or conceal model provenance and accountability\-relevant information\. There is also a risk that practitioners over\-trust unlearning as a complete safety guarantee, even though our evaluation focuses on relearning attacks and does not cover every possible recovery channel\. Responsible deployment should therefore combine robust unlearning with independent audits, explicit retain\-set and safety evaluations, access controls for high\-risk settings, and monitoring for both accidental utility degradation and intentional misuse\. We do not release new model checkpoints, hazardous datasets, or other high\-risk assets as part of this submission\.
## Appendix CRelated Works
#### LLM Unlearning\.
Machine unlearning, originally developed to address post\-training privacy concerns such as the “right to be forgotten”\[Cao and Yang,[2015](https://arxiv.org/html/2605.11685#bib.bib27), Ginartet al\.,[2019](https://arxiv.org/html/2605.11685#bib.bib28), Ullahet al\.,[2021](https://arxiv.org/html/2605.11685#bib.bib29)\], aims to modify trained models to remove the influence of specific data without costly retraining\. While approximate unlearning methods have been successfully applied in various domains\[Kurmanjiet al\.,[2023](https://arxiv.org/html/2605.11685#bib.bib30), Bourtouleet al\.,[2021](https://arxiv.org/html/2605.11685#bib.bib31), Nguyenet al\.,[2025](https://arxiv.org/html/2605.11685#bib.bib32), Golatkaret al\.,[2020](https://arxiv.org/html/2605.11685#bib.bib33), Jiaet al\.,[2023](https://arxiv.org/html/2605.11685#bib.bib34), Fanet al\.,[2024](https://arxiv.org/html/2605.11685#bib.bib35)\], LLM unlearning has emerged as a rapidly growing subfield\[Janget al\.,[2023](https://arxiv.org/html/2605.11685#bib.bib9), Yaoet al\.,[2024](https://arxiv.org/html/2605.11685#bib.bib42), Eldan and Russinovich,[2023](https://arxiv.org/html/2605.11685#bib.bib43), Zhanget al\.,[2024](https://arxiv.org/html/2605.11685#bib.bib11), Mainiet al\.,[2024](https://arxiv.org/html/2605.11685#bib.bib4), Liuet al\.,[2025](https://arxiv.org/html/2605.11685#bib.bib10)\]that aims to remove undesired data influences from large language models while preserving model utility for unrelated tasks\. Applications span mitigating harmful content generation\[Yaoet al\.,[2024](https://arxiv.org/html/2605.11685#bib.bib42), Liet al\.,[2024](https://arxiv.org/html/2605.11685#bib.bib5)\], protecting copyrighted and private information\[Eldan and Russinovich,[2023](https://arxiv.org/html/2605.11685#bib.bib43), Janget al\.,[2023](https://arxiv.org/html/2605.11685#bib.bib9)\], and preventing LLMs from producing biosecurity or cybersecurity threats\[Liet al\.,[2024](https://arxiv.org/html/2605.11685#bib.bib5), Barrettet al\.,[2023](https://arxiv.org/html/2605.11685#bib.bib44)\]\. Current approaches fall into two categories:model optimization\-based methods\[Mainiet al\.,[2024](https://arxiv.org/html/2605.11685#bib.bib4), Yaoet al\.,[2024](https://arxiv.org/html/2605.11685#bib.bib42), Zhanget al\.,[2024](https://arxiv.org/html/2605.11685#bib.bib11), Liet al\.,[2024](https://arxiv.org/html/2605.11685#bib.bib5)\]that fine\-tune model parameters, andinput\-based strategiesthat leverage prompting or in\-context learning to suppress undesired behaviors\[Thakeret al\.,[2024](https://arxiv.org/html/2605.11685#bib.bib45), Pawelczyket al\.,[2024](https://arxiv.org/html/2605.11685#bib.bib46)\]\. Several benchmarks have been proposed to evaluate unlearning effectiveness, including TOFU\[Mainiet al\.,[2024](https://arxiv.org/html/2605.11685#bib.bib4)\]for fictitious unlearning, WMDP\[Liet al\.,[2024](https://arxiv.org/html/2605.11685#bib.bib5)\]for hazardous knowledge removal, and MUSE\[Shiet al\.,[2024](https://arxiv.org/html/2605.11685#bib.bib47)\]for copyright protection\. Among existing methods, NPO\[Zhanget al\.,[2024](https://arxiv.org/html/2605.11685#bib.bib11)\]has emerged as a promising approach by framing unlearning as preference optimization\.
#### Robustness Challenges in LLM Unlearning\.
Recent studies have exposed critical vulnerabilities in existing LLM unlearning methods\[Lynchet al\.,[2024](https://arxiv.org/html/2605.11685#bib.bib26), Łuckiet al\.,[2025](https://arxiv.org/html/2605.11685#bib.bib25), Huet al\.,[2025](https://arxiv.org/html/2605.11685#bib.bib22), Deeb and Roger,[2024](https://arxiv.org/html/2605.11685#bib.bib1)\]\. These vulnerabilities primarily manifest through two attack categories:relearning attacks\[Huet al\.,[2025](https://arxiv.org/html/2605.11685#bib.bib22), Lynchet al\.,[2024](https://arxiv.org/html/2605.11685#bib.bib26), Deeb and Roger,[2024](https://arxiv.org/html/2605.11685#bib.bib1)\], where fine\-tuning with even a small subset of forget samples can restore unlearned knowledge; andjailbreaking attacks\[Łuckiet al\.,[2025](https://arxiv.org/html/2605.11685#bib.bib25), Lynchet al\.,[2024](https://arxiv.org/html/2605.11685#bib.bib26)\], where adversarial prompts successfully recover forgotten information at inference time\. Even unrelated operations such as model quantization can inadvertently revive targeted knowledge\[Qiet al\.,[2023](https://arxiv.org/html/2605.11685#bib.bib23)\]\. Most alarmingly,Deeb and Roger \[[2024](https://arxiv.org/html/2605.11685#bib.bib1)\]demonstrated that current unlearning methods achieve recovery rates exceeding 88% after relearning attacks, indicating that knowledge is merely hidden rather than truly removed from model weights\. To address these robustness challenges, recent work has explored various defense strategies\.Tamirisaet al\.\[[2024](https://arxiv.org/html/2605.11685#bib.bib48)\]leveraged model\-agnostic meta\-learning \(MAML\) to counter tampering attacks, whileSheshadriet al\.\[[2024](https://arxiv.org/html/2605.11685#bib.bib49)\]employed adversarial training in the latent space of LLMs\.Sondej and Yang \[[2025](https://arxiv.org/html/2605.11685#bib.bib8)\]proposed CIR, which uses PCA to identify and remove common representations from unlearning gradients before applying updates\. From an optimization perspective,Fanet al\.\[[2025](https://arxiv.org/html/2605.11685#bib.bib51)\]investigated SAM to improve unlearning robustness through smoother loss landscapes\. Despite these advances, the fundamental mechanism underlying unlearning fragility remains poorly understood, motivating our representation\-centric analysis\.
## Appendix DTheoretical Analysis: Full Derivations
This appendix provides the detailed derivations supporting[Theorems˜1](https://arxiv.org/html/2605.11685#Thmtheorem1)and[2](https://arxiv.org/html/2605.11685#Thmtheorem2)in[Section˜3\.2](https://arxiv.org/html/2605.11685#S3.SS2)\. Throughout, we focus on a single MLP module with input\-side activation𝐡θ\(𝐱\)∈ℝd\\mathbf\{h\}\_\{\\theta\}\(\\mathbf\{x\}\)\\in\\mathbb\{R\}^\{d\}; the argument applies layerwise\.
### D\.1Notation and Linearization
Let𝐱\\mathbf\{x\}be a token \(or sequence\) drawn from the forget distribution𝒟f\\mathcal\{D\}\_\{f\}, and let𝐡o\(𝐱\)\\mathbf\{h\}\_\{o\}\(\\mathbf\{x\}\)be the original \(pre\-unlearning\) representation\. After centering, the representation covariance is
𝚺=𝔼𝐱∼𝒟f\[𝐡o\(𝐱\)𝐡o\(𝐱\)⊤\]=∑k=1dσk2𝐯k𝐯k⊤,σ12≥⋯≥σd2\.\\bm\{\\Sigma\}\\;=\\;\\mathbb\{E\}\_\{\\mathbf\{x\}\\sim\\mathcal\{D\}\_\{f\}\}\\\!\\big\[\\mathbf\{h\}\_\{o\}\(\\mathbf\{x\}\)\\mathbf\{h\}\_\{o\}\(\\mathbf\{x\}\)^\{\\top\}\\big\]\\;=\\;\\sum\_\{k=1\}^\{d\}\\sigma\_\{k\}^\{2\}\\,\\mathbf\{v\}\_\{k\}\\mathbf\{v\}\_\{k\}^\{\\top\},\\qquad\\sigma\_\{1\}^\{2\}\\geq\\cdots\\geq\\sigma\_\{d\}^\{2\}\.\(16\)Observation 1 corresponds to a sharp decay ofσk2\\sigma\_\{k\}^\{2\}inkk\. In a small neighborhood of the pre\-unlearning parametersθo\\theta\_\{o\}, we use the first\-order expansion
𝐡θ\(𝐱\)≈𝐡o\(𝐱\)\+𝐉\(𝐱\)Δθ,𝐉\(𝐱\)=∂𝐡θ\(𝐱\)∂θ\|θo\.\\mathbf\{h\}\_\{\\theta\}\(\\mathbf\{x\}\)\\;\\approx\\;\\mathbf\{h\}\_\{o\}\(\\mathbf\{x\}\)\+\\mathbf\{J\}\(\\mathbf\{x\}\)\\,\\Delta\\theta,\\qquad\\mathbf\{J\}\(\\mathbf\{x\}\)=\\frac\{\\partial\\mathbf\{h\}\_\{\\theta\}\(\\mathbf\{x\}\)\}\{\\partial\\theta\}\\bigg\|\_\{\\theta\_\{o\}\}\.\(17\)Define the empirical neural tangent kernel\[Jacotet al\.,[2018](https://arxiv.org/html/2605.11685#bib.bib52)\]
𝐊\(𝐱,𝐱′\)=𝐉\(𝐱\)𝐉\(𝐱′\)⊤∈ℝd×d\.\\mathbf\{K\}\(\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\)\\;=\\;\\mathbf\{J\}\(\\mathbf\{x\}\)\\,\\mathbf\{J\}\(\\mathbf\{x\}^\{\\prime\}\)^\{\\top\}\\in\\mathbb\{R\}^\{d\\times d\}\.\(18\)In the lazy/NTK regime, after standard normalization,𝐊\(𝐱,𝐱′\)≈κ𝐈\\mathbf\{K\}\(\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\)\\approx\\kappa\\,\\mathbf\{I\}for someκ\>0\\kappa\>0, approximately independent of the inputs\. We use this approximation only to expose the dominant scaling; the conclusions are stable to mild anisotropy in𝐊\\mathbf\{K\}\.
### D\.2A Unified Form for Unlearning Losses
We treat both representation\-level losses \(RMU, MLP Breaking\) and output\-level losses \(GA, NPO\) under a single framework\. Let𝐡θ\(𝐱\)∈ℝd\\mathbf\{h\}\_\{\\theta\}\(\\mathbf\{x\}\)\\in\\mathbb\{R\}^\{d\}be the analyzed intermediate representation \(e\.g\., an MLP activation in the layer where PCA is performed\)\. Any unlearning objective can be written as
ℒu\(θ\)=𝔼𝐱∼𝒟f\[ℓu\(fθ\(𝐱\);𝐱\)\],\\mathcal\{L\}\_\{u\}\(\\theta\)\\;=\\;\\mathbb\{E\}\_\{\\mathbf\{x\}\\sim\\mathcal\{D\}\_\{f\}\}\\\!\\big\[\\,\\ell\_\{u\}\\\!\\left\(f\_\{\\theta\}\(\\mathbf\{x\}\);\\,\\mathbf\{x\}\\right\)\\big\],\(19\)wherefθ\(𝐱\)f\_\{\\theta\}\(\\mathbf\{x\}\)denotes any output of the model that depends onθ\\thetathrough𝐡θ\(𝐱\)\\mathbf\{h\}\_\{\\theta\}\(\\mathbf\{x\}\)\. By the chain rule,
∇θℒu=𝔼𝐱\[𝐉\(𝐱\)⊤𝐠u\(𝐱\)\],𝐠u\(𝐱\):=∂ℓu∂𝐡\|𝐡o\(𝐱\),\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{u\}\\;=\\;\\mathbb\{E\}\_\{\\mathbf\{x\}\}\\\!\\big\[\\mathbf\{J\}\(\\mathbf\{x\}\)^\{\\top\}\\mathbf\{g\}\_\{u\}\(\\mathbf\{x\}\)\\big\],\\qquad\\mathbf\{g\}\_\{u\}\(\\mathbf\{x\}\)\\;:=\\;\\frac\{\\partial\\ell\_\{u\}\}\{\\partial\\mathbf\{h\}\}\\bigg\|\_\{\\mathbf\{h\}\_\{o\}\(\\mathbf\{x\}\)\},\(20\)i\.e\.,*any*unlearning loss is mediated by an effective per\-sample residual𝐠u\(𝐱\)\\mathbf\{g\}\_\{u\}\(\\mathbf\{x\}\)in the representation space of𝐡\\mathbf\{h\}\.
###### Lemma 1\(Eigenstructure of effective residuals\)\.
For all four unlearning losses studied in this paper \(RMU, MLP Breaking, GA, NPO\), the effective residual covariance admits the decomposition
𝔼𝐱∼𝒟f\[𝐠u\(𝐱\)𝐠u\(𝐱\)⊤\]=𝐀𝚺𝐀⊤\+𝐍,\\mathbb\{E\}\_\{\\mathbf\{x\}\\sim\\mathcal\{D\}\_\{f\}\}\\\!\\big\[\\mathbf\{g\}\_\{u\}\(\\mathbf\{x\}\)\\mathbf\{g\}\_\{u\}\(\\mathbf\{x\}\)^\{\\top\}\\big\]\\;=\\;\\mathbf\{A\}\\,\\bm\{\\Sigma\}\\,\\mathbf\{A\}^\{\\top\}\\;\+\\;\\mathbf\{N\},\(21\)where𝐀\\mathbf\{A\}is a loss\-dependent linear map and𝐍⪰0\\mathbf\{N\}\\succeq 0has small operator norm‖𝐍‖=O\(τ2\)≪σ12\\\|\\mathbf\{N\}\\\|=O\(\\tau^\{2\}\)\\ll\\sigma\_\{1\}^\{2\}\. Moreover, on the dominant subspace spanned by\{𝐯1,…,𝐯K\}\\\{\\mathbf\{v\}\_\{1\},\\ldots,\\mathbf\{v\}\_\{K\}\\\}\(the few PCs that carry essentially all variance\) the map𝐀\\mathbf\{A\}approximately preserves the eigenbasis of𝚺\\bm\{\\Sigma\}, in the sense that
𝐯k⊤𝐀𝚺𝐀⊤𝐯k=∑ℓσℓ2\(𝐯k⊤𝐀𝐯ℓ\)2=αkσk2\+O\(τ2\),αk\>0,\\mathbf\{v\}\_\{k\}^\{\\top\}\\mathbf\{A\}\\bm\{\\Sigma\}\\mathbf\{A\}^\{\\top\}\\mathbf\{v\}\_\{k\}\\;=\\;\\sum\_\{\\ell\}\\sigma\_\{\\ell\}^\{2\}\\,\\big\(\\mathbf\{v\}\_\{k\}^\{\\top\}\\mathbf\{A\}\\,\\mathbf\{v\}\_\{\\ell\}\\big\)^\{2\}\\;=\\;\\alpha\_\{k\}\\,\\sigma\_\{k\}^\{2\}\\;\+\\;O\(\\tau^\{2\}\),\\quad\\alpha\_\{k\}\>0,\(22\)fork≤Kk\\leq K\. Consequently the eigenvalues of𝔼\[𝐠u𝐠u⊤\]\\mathbb\{E\}\[\\mathbf\{g\}\_\{u\}\\mathbf\{g\}\_\{u\}^\{\\top\}\]aligned with the top\-kksubspace of𝚺\\bm\{\\Sigma\}areΘ\(σk2\)\\Theta\(\\sigma\_\{k\}^\{2\}\)\.
###### Proof\.
We verify the lemma case by case\. The decomposition𝔼\[𝐠u𝐠u⊤\]=𝐀𝚺𝐀⊤\+𝐍\\mathbb\{E\}\[\\mathbf\{g\}\_\{u\}\\mathbf\{g\}\_\{u\}^\{\\top\}\]=\\mathbf\{A\}\\bm\{\\Sigma\}\\mathbf\{A\}^\{\\top\}\+\\mathbf\{N\}is shown for each loss; the diagonal\-on\-dominant\-subspace property \([22](https://arxiv.org/html/2605.11685#A4.E22)\) is then justified\.
Representation\-level losses \(RMU, MLP Breaking\)\.These directly act on𝐡\\mathbf\{h\}, sofθ\(𝐱\)=𝐡θ\(𝐱\)f\_\{\\theta\}\(\\mathbf\{x\}\)=\\mathbf\{h\}\_\{\\theta\}\(\\mathbf\{x\}\)\. With quadratic surrogateℓu\(𝐡;𝐱\)=12‖𝐡−𝐭\(𝐱\)‖2\\ell\_\{u\}\(\\mathbf\{h\};\\mathbf\{x\}\)=\\tfrac\{1\}\{2\}\\\|\\mathbf\{h\}\-\\mathbf\{t\}\(\\mathbf\{x\}\)\\\|^\{2\},𝐠u\(𝐱\)=𝐡o\(𝐱\)−𝐭\(𝐱\)\\mathbf\{g\}\_\{u\}\(\\mathbf\{x\}\)=\\mathbf\{h\}\_\{o\}\(\\mathbf\{x\}\)\-\\mathbf\{t\}\(\\mathbf\{x\}\)\. Thus𝐀=𝐈\\mathbf\{A\}=\\mathbf\{I\}, and
𝔼\[𝐠u𝐠u⊤\]=𝚺−𝔼\[𝐡o𝐭⊤\]−𝔼\[𝐭𝐡o⊤\]\+𝔼\[𝐭𝐭⊤\]=𝚺\+𝐍\.\\mathbb\{E\}\[\\mathbf\{g\}\_\{u\}\\mathbf\{g\}\_\{u\}^\{\\top\}\]=\\bm\{\\Sigma\}\-\\mathbb\{E\}\[\\mathbf\{h\}\_\{o\}\\mathbf\{t\}^\{\\top\}\]\-\\mathbb\{E\}\[\\mathbf\{t\}\\mathbf\{h\}\_\{o\}^\{\\top\}\]\+\\mathbb\{E\}\[\\mathbf\{t\}\\mathbf\{t\}^\{\\top\}\]=\\bm\{\\Sigma\}\+\\mathbf\{N\}\.For RMU’s random\-direction targets and MLP Breaking’s noise targets,𝐭\(𝐱\)\\mathbf\{t\}\(\\mathbf\{x\}\)is independent of𝐡o\(𝐱\)\\mathbf\{h\}\_\{o\}\(\\mathbf\{x\}\), so the cross terms vanish in expectation and‖𝐍‖=‖𝔼\[𝐭𝐭⊤\]‖=O\(τ2\)\\\|\\mathbf\{N\}\\\|=\\\|\\mathbb\{E\}\[\\mathbf\{t\}\\mathbf\{t\}^\{\\top\}\]\\\|=O\(\\tau^\{2\}\)\. Since𝐀=𝐈\\mathbf\{A\}=\\mathbf\{I\}commutes with𝚺\\bm\{\\Sigma\}, \([22](https://arxiv.org/html/2605.11685#A4.E22)\) holds exactly withαk=1\\alpha\_\{k\}=1\.
Output\-level loss: Gradient Ascent \(GA\)\.Herefθ\(𝐱\)=𝐳θ\(𝐱\)f\_\{\\theta\}\(\\mathbf\{x\}\)=\\mathbf\{z\}\_\{\\theta\}\(\\mathbf\{x\}\)are the logits, with𝐳=𝐖outϕ\(𝐡\)\+𝐛\\mathbf\{z\}=\\mathbf\{W\}\_\{\\mathrm\{out\}\}\\,\\bm\{\\phi\}\(\\mathbf\{h\}\)\+\\mathbf\{b\}for some downstream nonlinearityϕ\\bm\{\\phi\}and head𝐖out\\mathbf\{W\}\_\{\\mathrm\{out\}\}\. The GA loss isℓu=\+logpθ\(y∣𝐱\)\\ell\_\{u\}=\+\\log p\_\{\\theta\}\(y\\mid\\mathbf\{x\}\), whose gradient w\.r\.t\.𝐡\\mathbf\{h\}is
𝐠u\(𝐱\)=𝐃\(𝐱\)𝐖out⊤\(𝐞y\(𝐱\)−𝐩\(𝐱\)\),\\mathbf\{g\}\_\{u\}\(\\mathbf\{x\}\)\\;=\\;\\mathbf\{D\}\(\\mathbf\{x\}\)\\,\\mathbf\{W\}\_\{\\mathrm\{out\}\}^\{\\top\}\\,\\big\(\\mathbf\{e\}\_\{y\(\\mathbf\{x\}\)\}\-\\mathbf\{p\}\(\\mathbf\{x\}\)\\big\),where𝐩\(𝐱\)=softmax\(𝐳o\(𝐱\)\)\\mathbf\{p\}\(\\mathbf\{x\}\)=\\mathrm\{softmax\}\(\\mathbf\{z\}\_\{o\}\(\\mathbf\{x\}\)\)and𝐃\(𝐱\)=∂ϕ/∂𝐡\|𝐡o\\mathbf\{D\}\(\\mathbf\{x\}\)=\\partial\\bm\{\\phi\}/\\partial\\mathbf\{h\}\|\_\{\\mathbf\{h\}\_\{o\}\}\. We linearize𝐠u\\mathbf\{g\}\_\{u\}around the population mean𝐡¯=𝔼\[𝐡o\]\\bar\{\\mathbf\{h\}\}=\\mathbb\{E\}\[\\mathbf\{h\}\_\{o\}\]\. Since both𝐃\(𝐱\)\\mathbf\{D\}\(\\mathbf\{x\}\)and the residual𝐞y\(𝐱\)−𝐩\(𝐱\)\\mathbf\{e\}\_\{y\(\\mathbf\{x\}\)\}\-\\mathbf\{p\}\(\\mathbf\{x\}\)depend on𝐱\\mathbf\{x\}only through𝐡o\(𝐱\)\\mathbf\{h\}\_\{o\}\(\\mathbf\{x\}\)\(the labelsy\(𝐱\)y\(\\mathbf\{x\}\)being deterministic given the prefix at a well\-trained checkpoint\), a first\-order Taylor expansion gives
𝐠u\(𝐱\)=𝐠¯\+𝐀\(𝐡o\(𝐱\)−𝐡¯\)\+𝐫\(𝐱\),𝐀:=∂𝐠u∂𝐡o\|𝐡¯,\\mathbf\{g\}\_\{u\}\(\\mathbf\{x\}\)\\;=\\;\\bar\{\\mathbf\{g\}\}\\;\+\\;\\mathbf\{A\}\\,\(\\mathbf\{h\}\_\{o\}\(\\mathbf\{x\}\)\-\\bar\{\\mathbf\{h\}\}\)\\;\+\\;\\mathbf\{r\}\(\\mathbf\{x\}\),\\qquad\\mathbf\{A\}\\;:=\\;\\frac\{\\partial\\mathbf\{g\}\_\{u\}\}\{\\partial\\mathbf\{h\}\_\{o\}\}\\bigg\|\_\{\\bar\{\\mathbf\{h\}\}\},\(23\)with remainder‖𝐫\(𝐱\)‖=O\(‖𝐡o−𝐡¯‖2\)\\\|\\mathbf\{r\}\(\\mathbf\{x\}\)\\\|=O\(\\\|\\mathbf\{h\}\_\{o\}\-\\bar\{\\mathbf\{h\}\}\\\|^\{2\}\)\. Centering \(𝐠¯\\bar\{\\mathbf\{g\}\}is absorbed into a mean\-correction term that yieldsO\(τ2\)O\(\\tau^\{2\}\)contribution after centering𝐡o\\mathbf\{h\}\_\{o\}\) and using the centering of𝐡o\\mathbf\{h\}\_\{o\}assumed in Appendix[D\.1](https://arxiv.org/html/2605.11685#A4.SS1)yields
𝔼\[𝐠u𝐠u⊤\]=𝐀𝚺𝐀⊤\+𝐍,‖𝐍‖=O\(τ2\),\\mathbb\{E\}\[\\mathbf\{g\}\_\{u\}\\mathbf\{g\}\_\{u\}^\{\\top\}\]\\;=\\;\\mathbf\{A\}\\,\\bm\{\\Sigma\}\\,\\mathbf\{A\}^\{\\top\}\\;\+\\;\\mathbf\{N\},\\qquad\\\|\\mathbf\{N\}\\\|=O\(\\tau^\{2\}\),whereτ2\\tau^\{2\}collects \(i\) the squared remainder and \(ii\) the residual magnitude‖𝐞y−𝐩‖2\\\|\\mathbf\{e\}\_\{y\}\-\\mathbf\{p\}\\\|^\{2\}, which is small because the model has already fit𝒟f\\mathcal\{D\}\_\{f\}at the pre\-unlearning checkpoint\.
Output\-level loss: NPO\.NPO’s loss \([2](https://arxiv.org/html/2605.11685#S2.E2)\) is a sigmoid\-shaped reweighting of the cross\-entropy on𝒟f\\mathcal\{D\}\_\{f\}vs\. a reference model:
ℓuNPO\(𝐱\)=−2βlogσ\(−β\[logpθ\(y∣𝐱\)−logpref\(y∣𝐱\)\]\)\.\\ell\_\{u\}^\{\\mathrm\{NPO\}\}\(\\mathbf\{x\}\)\\;=\\;\-\\frac\{2\}\{\\beta\}\\,\\log\\sigma\\\!\\left\(\-\\beta\\,\\big\[\\log p\_\{\\theta\}\(y\\mid\\mathbf\{x\}\)\-\\log p\_\{\\mathrm\{ref\}\}\(y\\mid\\mathbf\{x\}\)\\big\]\\right\)\.Its representation gradient is𝐠uNPO\(𝐱\)=wβ\(𝐱\)𝐠uGA\(𝐱\)\\mathbf\{g\}\_\{u\}^\{\\mathrm\{NPO\}\}\(\\mathbf\{x\}\)=w\_\{\\beta\}\(\\mathbf\{x\}\)\\,\\mathbf\{g\}\_\{u\}^\{\\mathrm\{GA\}\}\(\\mathbf\{x\}\)withwβ\(𝐱\)=2σ\(β\[logpθ−logpref\]\)∈\(0,2\)w\_\{\\beta\}\(\\mathbf\{x\}\)=2\\,\\sigma\(\\beta\[\\log p\_\{\\theta\}\-\\log p\_\{\\mathrm\{ref\}\}\]\)\\in\(0,2\)\. Linearizingwβw\_\{\\beta\}around its mean as in \([23](https://arxiv.org/html/2605.11685#A4.E23)\) gives𝐠uNPO=w¯𝐠uGA\+O\(‖𝐡o−𝐡¯‖2\)\\mathbf\{g\}\_\{u\}^\{\\mathrm\{NPO\}\}=\\bar\{w\}\\,\\mathbf\{g\}\_\{u\}^\{\\mathrm\{GA\}\}\+O\(\\\|\\mathbf\{h\}\_\{o\}\-\\bar\{\\mathbf\{h\}\}\\\|^\{2\}\), so
𝔼\[𝐠uNPO\(𝐠uNPO\)⊤\]=w¯2𝐀𝚺𝐀⊤\+𝐍′,‖𝐍′‖=O\(τ2\),\\mathbb\{E\}\[\\mathbf\{g\}\_\{u\}^\{\\mathrm\{NPO\}\}\(\\mathbf\{g\}\_\{u\}^\{\\mathrm\{NPO\}\}\)^\{\\top\}\]\\;=\\;\\bar\{w\}^\{2\}\\,\\mathbf\{A\}\\bm\{\\Sigma\}\\mathbf\{A\}^\{\\top\}\\;\+\\;\\mathbf\{N\}^\{\\prime\},\\qquad\\\|\\mathbf\{N\}^\{\\prime\}\\\|=O\(\\tau^\{2\}\),i\.e\. the same map𝐀\\mathbf\{A\}scaled by a positive constant\.
Diagonal property \([22](https://arxiv.org/html/2605.11685#A4.E22)\) on the dominant subspace\.For RMU/MLP Breaking,𝐀=𝐈\\mathbf\{A\}=\\mathbf\{I\}and \([22](https://arxiv.org/html/2605.11685#A4.E22)\) is exact\. For GA/NPO, observe that the principal components\{𝐯k\}k≤K\\\{\\mathbf\{v\}\_\{k\}\\\}\_\{k\\leq K\}are estimated from activations*at the same layer*where𝐠u\\mathbf\{g\}\_\{u\}lives; the well\-trained head𝐖out\\mathbf\{W\}\_\{\\mathrm\{out\}\}together with the diagonal\-by\-construction nonlinearityϕ\\bm\{\\phi\}and the softmax linearization tend to align𝐀\\mathbf\{A\}’s singular vectors with the dominant subspace of𝚺\\bm\{\\Sigma\}\(otherwise the model could not have achieved low loss using only those directions\)\. Concretely, decomposing𝐀=𝐕diag\(α\)𝐕⊤\+𝐄\\mathbf\{A\}=\\mathbf\{V\}\\,\\mathrm\{diag\}\(\\alpha\)\\,\\mathbf\{V\}^\{\\top\}\+\\mathbf\{E\}with𝐕=\[𝐯1,…,𝐯d\]\\mathbf\{V\}=\[\\mathbf\{v\}\_\{1\},\\ldots,\\mathbf\{v\}\_\{d\}\]and small off\-diagonal𝐄\\mathbf\{E\}, we get𝐯k⊤𝐀𝚺𝐀⊤𝐯k=αk2σk2\+∑ℓ≠kσℓ2\(𝐯k⊤𝐄𝐯ℓ\)2\\mathbf\{v\}\_\{k\}^\{\\top\}\\mathbf\{A\}\\bm\{\\Sigma\}\\mathbf\{A\}^\{\\top\}\\mathbf\{v\}\_\{k\}=\\alpha\_\{k\}^\{2\}\\sigma\_\{k\}^\{2\}\+\\sum\_\{\\ell\\neq k\}\\sigma\_\{\\ell\}^\{2\}\(\\mathbf\{v\}\_\{k\}^\{\\top\}\\mathbf\{E\}\\mathbf\{v\}\_\{\\ell\}\)^\{2\}, and the second term is dominated by‖𝐄‖2σ12=O\(τ2\)\\\|\\mathbf\{E\}\\\|^\{2\}\\sigma\_\{1\}^\{2\}=O\(\\tau^\{2\}\)provided𝐀\\mathbf\{A\}is approximately diagonal in the PCA basis\. Even without this assumption, a Weyl\-type inequality givesλk\(𝐀𝚺𝐀⊤\)≥σmin,K2\(𝐀\)σk2\\lambda\_\{k\}\(\\mathbf\{A\}\\bm\{\\Sigma\}\\mathbf\{A\}^\{\\top\}\)\\geq\\sigma\_\{\\min,K\}^\{2\}\(\\mathbf\{A\}\)\\,\\sigma\_\{k\}^\{2\}whereσmin,K\(𝐀\)\\sigma\_\{\\min,K\}\(\\mathbf\{A\}\)is the smallest singular value of𝐀\\mathbf\{A\}restricted to the top\-KKsubspace, so the qualitative scalingΘ\(σk2\)\\Theta\(\\sigma\_\{k\}^\{2\}\)on the dominant subspace is unchanged\. ∎
[Lemma˜1](https://arxiv.org/html/2605.11685#Thmlemma1)is the technical bridge that lets the same NTK\-spectrum argument apply uniformly across representation\-level and output\-level losses\. We restate the proofs of[Theorems˜1](https://arxiv.org/html/2605.11685#Thmtheorem1)and[2](https://arxiv.org/html/2605.11685#Thmtheorem2)in this generality below\.
### D\.3Proof of[Theorem˜1](https://arxiv.org/html/2605.11685#Thmtheorem1)\(Dominant\-Component Concentration\)
Unlearning is run with mini\-batch SGD: at each step a sample𝐱t∼𝒟f\\mathbf\{x\}\_\{t\}\\sim\\mathcal\{D\}\_\{f\}produces the stochastic updateΔθt=−η𝐉\(𝐱t\)⊤𝐠u\(𝐱t\)\\Delta\\theta\_\{t\}=\-\\eta\\,\\mathbf\{J\}\(\\mathbf\{x\}\_\{t\}\)^\{\\top\}\\mathbf\{g\}\_\{u\}\(\\mathbf\{x\}\_\{t\}\)\. Substituting into the linearized representation and evaluating at any forget\-set point𝐱′\\mathbf\{x\}^\{\\prime\}:
Δ𝐡t\(𝐱′\)=𝐉\(𝐱′\)Δθt=−η𝐊\(𝐱′,𝐱t\)𝐠u\(𝐱t\)\.\\Delta\\mathbf\{h\}\_\{t\}\(\\mathbf\{x\}^\{\\prime\}\)\\;=\\;\\mathbf\{J\}\(\\mathbf\{x\}^\{\\prime\}\)\\,\\Delta\\theta\_\{t\}\\;=\\;\-\\eta\\,\\mathbf\{K\}\(\\mathbf\{x\}^\{\\prime\},\\mathbf\{x\}\_\{t\}\)\\,\\mathbf\{g\}\_\{u\}\(\\mathbf\{x\}\_\{t\}\)\.\(24\)Under𝐊\(𝐱′,𝐱t\)≈κ𝐈\\mathbf\{K\}\(\\mathbf\{x\}^\{\\prime\},\\mathbf\{x\}\_\{t\}\)\\approx\\kappa\\mathbf\{I\}this becomesΔ𝐡t\(𝐱′\)≈−ηκ𝐠u\(𝐱t\)\\Delta\\mathbf\{h\}\_\{t\}\(\\mathbf\{x\}^\{\\prime\}\)\\approx\-\\eta\\kappa\\,\\mathbf\{g\}\_\{u\}\(\\mathbf\{x\}\_\{t\}\)\. Note that we keep the*stochastic, per\-sample*residual rather than the population mean: taking expectation*before*squaring \(as in𝔼\[𝐠u\]\\mathbb\{E\}\[\\mathbf\{g\}\_\{u\}\]\) would mix signal and noise scales incorrectly\. Instead, the appropriate quantity for the change\-ratio metric \([5](https://arxiv.org/html/2605.11685#S3.E5)\), which is computed from squared inner products averaged over𝐱′\\mathbf\{x\}^\{\\prime\}, is the*expected squared per\-sample displacement*\.
Projecting onto𝐯k\\mathbf\{v\}\_\{k\}and squaring:
𝔼𝐱t\[⟨Δ𝐡t\(𝐱′\),𝐯k⟩2\]≈η2κ2𝔼𝐱t\[⟨𝐠u\(𝐱t\),𝐯k⟩2\]=η2κ2𝐯k⊤𝔼\[𝐠u𝐠u⊤\]𝐯k\.\\mathbb\{E\}\_\{\\mathbf\{x\}\_\{t\}\}\\\!\\big\[\\langle\\Delta\\mathbf\{h\}\_\{t\}\(\\mathbf\{x\}^\{\\prime\}\),\\mathbf\{v\}\_\{k\}\\rangle^\{2\}\\big\]\\;\\approx\\;\\eta^\{2\}\\kappa^\{2\}\\,\\mathbb\{E\}\_\{\\mathbf\{x\}\_\{t\}\}\\\!\\big\[\\langle\\mathbf\{g\}\_\{u\}\(\\mathbf\{x\}\_\{t\}\),\\mathbf\{v\}\_\{k\}\\rangle^\{2\}\\big\]\\;=\\;\\eta^\{2\}\\kappa^\{2\}\\,\\mathbf\{v\}\_\{k\}^\{\\top\}\\mathbb\{E\}\[\\mathbf\{g\}\_\{u\}\\mathbf\{g\}\_\{u\}^\{\\top\}\]\\,\\mathbf\{v\}\_\{k\}\.\(25\)By[Lemma˜1](https://arxiv.org/html/2605.11685#Thmlemma1)\(in particular \([22](https://arxiv.org/html/2605.11685#A4.E22)\)\),
𝐯k⊤𝔼\[𝐠u𝐠u⊤\]𝐯k=αkσk2\+O\(τ2\),\\mathbf\{v\}\_\{k\}^\{\\top\}\\mathbb\{E\}\[\\mathbf\{g\}\_\{u\}\\mathbf\{g\}\_\{u\}^\{\\top\}\]\\,\\mathbf\{v\}\_\{k\}\\;=\\;\\alpha\_\{k\}\\,\\sigma\_\{k\}^\{2\}\+O\(\\tau^\{2\}\),\(26\)for a loss\-dependent positive constantαk\\alpha\_\{k\}that is bounded away from0on the dominant subspace\. AccumulatingTTsmall i\.i\.d\. steps, the squared displacement averaged over the forget set scales as
𝔼𝒟f\[⟨𝐡u−𝐡o,𝐯k⟩2\]∝Tσk2\+O\(τ2\),\\mathbb\{E\}\_\{\\mathcal\{D\}\_\{f\}\}\\\!\\big\[\\langle\\mathbf\{h\}\_\{\\text\{u\}\}\-\\mathbf\{h\}\_\{\\text\{o\}\},\\mathbf\{v\}\_\{k\}\\rangle^\{2\}\\big\]\\;\\propto\\;T\\,\\sigma\_\{k\}^\{2\}\\;\+\\;O\(\\tau^\{2\}\),\(27\)matching Equation[7](https://arxiv.org/html/2605.11685#S3.E7)\. Equivalently, the typical absolute displacement scales asTσk\\sqrt\{T\}\\,\\sigma\_\{k\}, so the un\-normalized numerator of the change\-ratio \([5](https://arxiv.org/html/2605.11685#S3.E5)\) mirrors the singular\-value \(square\-root explained\-variance\) profile of𝚺\\bm\{\\Sigma\}\. The argument is identical for representation\-level and output\-level losses; the only loss\-specific quantity is the constant prefactorαk\\alpha\_\{k\}, which does not affect the qualitative scaling\.□\\square
### D\.4Proof of[Theorem˜2](https://arxiv.org/html/2605.11685#Thmtheorem2)\(Dominant\-Component Recoverability\)
Let𝒟r\\mathcal\{D\}\_\{r\}denote the relearning distribution used by the attacker\. Under the standard threat model,𝒟r\\mathcal\{D\}\_\{r\}is structurally similar to𝒟f\\mathcal\{D\}\_\{f\}, so
𝚺r=𝔼𝒟r\[𝐡o𝐡o⊤\]≈∑kσ~k2𝐯k𝐯k⊤,σ~k2≍σk2\.\\bm\{\\Sigma\}\_\{r\}\\;=\\;\\mathbb\{E\}\_\{\\mathcal\{D\}\_\{r\}\}\[\\mathbf\{h\}\_\{o\}\\mathbf\{h\}\_\{o\}^\{\\top\}\]\\;\\approx\\;\\sum\_\{k\}\\tilde\{\\sigma\}\_\{k\}^\{2\}\\,\\mathbf\{v\}\_\{k\}\\mathbf\{v\}\_\{k\}^\{\\top\},\\qquad\\tilde\{\\sigma\}\_\{k\}^\{2\}\\asymp\\sigma\_\{k\}^\{2\}\.\(28\)Standard relearning attacks use a maximum\-likelihood \(cross\-entropy\) objective on𝒟r\\mathcal\{D\}\_\{r\}\. By the same chain\-rule decomposition \([20](https://arxiv.org/html/2605.11685#A4.E20)\), the relearning gradient is∇θℒr=𝔼𝒟r\[𝐉\(𝐱\)⊤𝐠r\(𝐱\)\]\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{r\}=\\mathbb\{E\}\_\{\\mathcal\{D\}\_\{r\}\}\[\\mathbf\{J\}\(\\mathbf\{x\}\)^\{\\top\}\\mathbf\{g\}\_\{r\}\(\\mathbf\{x\}\)\]with effective residual𝐠r\(𝐱\)\\mathbf\{g\}\_\{r\}\(\\mathbf\{x\}\)\. The same case analysis as in[Lemma˜1](https://arxiv.org/html/2605.11685#Thmlemma1)\(specialized to GA\-style cross\-entropy on𝒟r\\mathcal\{D\}\_\{r\}\) gives𝔼\[𝐠r𝐠r⊤\]=𝐀r𝚺r𝐀r⊤\+O\(τ2\)\\mathbb\{E\}\[\\mathbf\{g\}\_\{r\}\\mathbf\{g\}\_\{r\}^\{\\top\}\]=\\mathbf\{A\}\_\{r\}\\bm\{\\Sigma\}\_\{r\}\\mathbf\{A\}\_\{r\}^\{\\top\}\+O\(\\tau^\{2\}\), with𝐀r\\mathbf\{A\}\_\{r\}approximately diagonal in the PCA basis on the dominant subspace\. Iterating the linearized dynamics overTrT\_\{r\}relearning steps with constant step size, the projection of𝐡u−𝐡r\\mathbf\{h\}\_\{u\}\-\\mathbf\{h\}\_\{r\}onto𝐯k\\mathbf\{v\}\_\{k\}follows an exponential approach to the original𝐡o\\mathbf\{h\}\_\{o\}projection with ratecσk2c\\sigma\_\{k\}^\{2\}:
RecoveryRatiok=⟨𝐡u−𝐡r,𝐯k⟩⟨𝐡u−𝐡o,𝐯k⟩≈1−exp\(−cσk2Tr\)\.\\mathrm\{Recovery\\ Ratio\}\_\{k\}\\;=\\;\\frac\{\\langle\\mathbf\{h\}\_\{u\}\-\\mathbf\{h\}\_\{r\},\\mathbf\{v\}\_\{k\}\\rangle\}\{\\langle\\mathbf\{h\}\_\{u\}\-\\mathbf\{h\}\_\{o\},\\mathbf\{v\}\_\{k\}\\rangle\}\\;\\approx\\;1\-\\exp\\\!\\big\(\-c\\,\\sigma\_\{k\}^\{2\}\\,T\_\{r\}\\big\)\.\(29\)The result depends on neither the specific unlearning loss used to produce𝐡u\\mathbf\{h\}\_\{u\}\(since the recovery ratio is normalized by the actual unlearning displacement\) nor the specific relearning loss family \(any loss whose effective residual covariance shares the dominant eigenstructure of𝚺r\\bm\{\\Sigma\}\_\{r\}yields the same scaling\)\. Reaching a fixed recovery level1−δ1\-\\deltaon direction𝐯k\\mathbf\{v\}\_\{k\}thus requiresTr=O\(σk−2log\(1/δ\)\)T\_\{r\}=O\\\!\\big\(\\sigma\_\{k\}^\{\-2\}\\log\(1/\\delta\)\\big\)steps; dominant components saturate to11within a few attack steps, while minor components require an*inverse\-variance*blow\-up in the number of steps\.□\\square
### D\.5Why Minor Components Are Structurally Hard to Recover
The inverse\-variance recovery scaling above explains*rate*differences but not why minor components fail to recover even when the attacker invests a largeTrT\_\{r\}\. We complete the picture with a signal\-to\-noise \(SNR\) argument on the relearning gradient\.
Decompose the per\-sample representation as𝐡o\(𝐱\)=∑kak\(𝐱\)𝐯k\\mathbf\{h\}\_\{o\}\(\\mathbf\{x\}\)=\\sum\_\{k\}a\_\{k\}\(\\mathbf\{x\}\)\\,\\mathbf\{v\}\_\{k\}with coefficientsak\(𝐱\)=⟨𝐡o\(𝐱\),𝐯k⟩a\_\{k\}\(\\mathbf\{x\}\)=\\langle\\mathbf\{h\}\_\{o\}\(\\mathbf\{x\}\),\\mathbf\{v\}\_\{k\}\\rangle, so𝔼\[ak\(𝐱\)\]=0\\mathbb\{E\}\[a\_\{k\}\(\\mathbf\{x\}\)\]=0and𝔼\[ak\(𝐱\)2\]=σk2\\mathbb\{E\}\[a\_\{k\}\(\\mathbf\{x\}\)^\{2\}\]=\\sigma\_\{k\}^\{2\}by construction\. Because the coefficients are mean\-zero, a naive cross\-sample correlation𝔼𝐱,𝐱′\[ak\(𝐱\)ak\(𝐱′\)\]\\mathbb\{E\}\_\{\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\}\[a\_\{k\}\(\\mathbf\{x\}\)a\_\{k\}\(\\mathbf\{x\}^\{\\prime\}\)\]across*independent*samples vanishes identically and conveys no information; instead, agreement must be measured*conditionally*on shared structure\. Concretely, letccdenote a latent context variable \(e\.g\., the topic, document, or local sub\-distribution from which𝐱\\mathbf\{x\}is drawn\) and decompose
ak\(𝐱\)=sk\(c\)\+ϵk\(𝐱\),sk\(c\):=𝔼\[ak\(𝐱\)∣c\],𝔼\[ϵk∣c\]=0\.a\_\{k\}\(\\mathbf\{x\}\)\\;=\\;s\_\{k\}\(c\)\\;\+\\;\\epsilon\_\{k\}\(\\mathbf\{x\}\),\\qquad s\_\{k\}\(c\):=\\mathbb\{E\}\[a\_\{k\}\(\\mathbf\{x\}\)\\mid c\],\\quad\\mathbb\{E\}\[\\epsilon\_\{k\}\\mid c\]=0\.\(30\)We define the \(shared\-signal\) cross\-sample agreement of thekk\-th coordinate as the fraction of variance carried by the context\-shared component:
ρk=Varc\(sk\(c\)\)σk2=𝔼𝐱,𝐱′∣c\[ak\(𝐱\)ak\(𝐱′\)\]σk2,\\rho\_\{k\}\\;=\\;\\frac\{\\mathrm\{Var\}\_\{c\}\\\!\\big\(s\_\{k\}\(c\)\\big\)\}\{\\sigma\_\{k\}^\{2\}\}\\;=\\;\\frac\{\\mathbb\{E\}\_\{\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\\mid c\}\\\!\\big\[a\_\{k\}\(\\mathbf\{x\}\)\\,a\_\{k\}\(\\mathbf\{x\}^\{\\prime\}\)\\big\]\}\{\\sigma\_\{k\}^\{2\}\},\(31\)where the second equality holds when𝐱,𝐱′\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}are drawn*conditional on the same context*cc\. Equivalently,ρk\\rho\_\{k\}is the intra\-class/inter\-class variance ratio along𝐯k\\mathbf\{v\}\_\{k\}, andρk\\sqrt\{\\rho\_\{k\}\}is the cosine alignment between the per\-sample gradient∇θℓr\(𝐱\)\\nabla\_\{\\theta\}\\ell\_\{r\}\(\\mathbf\{x\}\)and its batch average projected onto𝐯k\\mathbf\{v\}\_\{k\}\. With this definition, a finite\-batch relearning gradient along𝐯k\\mathbf\{v\}\_\{k\}has signal∝σkρk\\propto\\sigma\_\{k\}\\sqrt\{\\rho\_\{k\}\}and noise∝σk\(1−ρk\)/B\\propto\\sigma\_\{k\}\\sqrt\{\(1\-\\rho\_\{k\}\)/B\}for batch sizeBB, giving an SNR of orderBρk/\(1−ρk\)\\sqrt\{B\\rho\_\{k\}/\(1\-\\rho\_\{k\}\)\}\. Two regimes emerge:
- •Dominant components:ρk→1\\rho\_\{k\}\\to 1, since these directions encode features shared between𝒟r\\mathcal\{D\}\_\{r\}and𝒟f\\mathcal\{D\}\_\{f\}at the topic/context level \(e\.g\., topical regularities, syntactic patterns\)\. Relearning gradients along𝐯k\\mathbf\{v\}\_\{k\}accumulate coherently across the batch, and recovery proceeds at the rate predicted by[Theorem˜2](https://arxiv.org/html/2605.11685#Thmtheorem2)\.
- •Minor components:ρk≈0\\rho\_\{k\}\\approx 0, since these directions encode sample\-specific structure that varies idiosyncratically within each context\. The batched relearning gradient averages out, the SNR collapses, and no amount of attacker fine\-tuning on*related*data can reliably reconstruct the minor\-component values that the original model used for the held\-out forget samples\.
This formalizes the intuition stated after Observation 3 and is consistent with both[Figure˜2\(c\)](https://arxiv.org/html/2605.11685#S3.F2.sf3)and the cross\-loss replications in Appendix[I](https://arxiv.org/html/2605.11685#A9)\.
### D\.6Discussion of Assumptions
The two assumptions used above merit comment\.\(i\) NTK linearization\.The lazy\-regime approximation𝐊\(𝐱,𝐱′\)≈κ𝐈\\mathbf\{K\}\(\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\)\\approx\\kappa\\mathbf\{I\}is exact only in the infinite\-width limit; in practice it is a useful first\-order approximation when the unlearning step size and number of steps remain modest, which is precisely the regime in which fragile unlearning operates \(otherwise utility on retain data collapses\)\. The qualitative conclusions—change\-ratio scaling withσk2\\sigma\_\{k\}^\{2\}and exponential recovery with rateσk2\\sigma\_\{k\}^\{2\}—survive any anisotropy in𝐊\\mathbf\{K\}that is not specifically aligned against the dominant subspace\.\(ii\) Effective\-residual eigenstructure \([Lemma˜1](https://arxiv.org/html/2605.11685#Thmlemma1)\)\.The proof verified this for RMU, MLP Breaking, GA, and NPO\. For representation\-level losses, the assumption reduces to the target𝐭\(𝐱\)\\mathbf\{t\}\(\\mathbf\{x\}\)being uncorrelated with the dominant subspace, which holds for the random/noise/zero targets used in practice\. For output\-level losses, the assumption follows because \(a\) the model’s predictions on𝒟f\\mathcal\{D\}\_\{f\}depend on𝐱\\mathbf\{x\}only through𝐡o\(𝐱\)\\mathbf\{h\}\_\{o\}\(\\mathbf\{x\}\), so any sample\-to\-sample variation in the output residual is mediated by𝚺\\bm\{\\Sigma\}, and \(b\) the post\-𝐡\\mathbf\{h\}Jacobian𝐃𝐖out⊤\\mathbf\{D\}\\mathbf\{W\}\_\{\\mathrm\{out\}\}^\{\\top\}is full\-rank on the dominant subspace at any well\-trained checkpoint\. The empirical universality of Observation 2 across unlearning losses \(Appendix[I](https://arxiv.org/html/2605.11685#A9)\) confirms that the assumption holds in practice for both representation\-level and output\-level losses\.
## Appendix ERepresentation\-Analysis Setup and Details
This appendix expands the representation\-analysis protocol summarized in[Section˜3\.1](https://arxiv.org/html/2605.11685#S3.SS1)\.
#### Modules and layers\.
We instrument the MLPdown\_projoutput of every transformer block of Llama\-3\.1\-8B \(32 layers\)\. For each layer we record activations at each token position of every example in the forget set𝒟f\\mathcal\{D\}\_\{\\mathrm\{f\}\}, yielding a tensor𝐇\(ℓ\)∈ℝN×d\\mathbf\{H\}^\{\(\\ell\)\}\\in\\mathbb\{R\}^\{N\\times d\}per layerℓ\\ell, whereNNis the total number of tokens in𝒟f\\mathcal\{D\}\_\{\\mathrm\{f\}\}andd=4096d=4096is the hidden dimension\. Results of other MLP sub\-modules, shown in Appendix[I](https://arxiv.org/html/2605.11685#A9), exhibit the same qualitative pattern\.
#### PCA computation\.
Per layer, we center𝐇\(ℓ\)\\mathbf\{H\}^\{\(\\ell\)\}by subtracting the per\-coordinate mean and compute the principal components\{𝐯1\(ℓ\),…,𝐯d\(ℓ\)\}\\\{\\mathbf\{v\}\_\{1\}^\{\(\\ell\)\},\\ldots,\\mathbf\{v\}\_\{d\}^\{\(\\ell\)\}\\\}via randomized SVD\[Halkoet al\.,[2011](https://arxiv.org/html/2605.11685#bib.bib3)\], with explained variancesσk\(ℓ\)2\\sigma\_\{k\}^\{\(\\ell\)\\,2\}\. The same eigenbasis is used to project the unlearned and relearned activations𝐡u,𝐡r\\mathbf\{h\}\_\{u\},\\mathbf\{h\}\_\{r\}collected on the same token positions\. All ratios reported in[Figures˜2\(b\)](https://arxiv.org/html/2605.11685#S3.F2.sf2)and[2\(c\)](https://arxiv.org/html/2605.11685#S3.F2.sf3)are computed per layer and then averaged across layers\.
## Appendix FExperimental Details
#### Dataset details\.
ForWMDP\-CyberandWMDP\-Bio, we use the high\-quality subsets ofSondej and Yang \[[2025](https://arxiv.org/html/2605.11685#bib.bib8)\]containing 203 cyber and 144 biological multiple\-choice questions, each augmented with three short declarative sentences per question that together form the forget set used for unlearning\. TheYearsdataset\[Deeb and Roger,[2024](https://arxiv.org/html/2605.11685#bib.bib1)\]consists of 20th\-century events paired with their dates\. As retain sets we use FineFineWeb\[M\-A\-Pet al\.,[2024](https://arxiv.org/html/2605.11685#bib.bib16)\]subsets matched to the forget domain:biologyfor WMDP\-Bio,computer\_science\_and\_technologyfor WMDP\-Cyber, andfineweb\-edufor Years\.
#### Baselines\.
Each evaluated forget loss paired with an specific retain loss to preserve model utility\. ForGradient Ascent \(GA\)andNPO\[Zhanget al\.,[2024](https://arxiv.org/html/2605.11685#bib.bib11)\], we apply the standard cross\-entropy loss on the retain set\. ForRMU\[Liet al\.,[2024](https://arxiv.org/html/2605.11685#bib.bib5)\]andMLP Breaking, followingLiet al\.\[[2024](https://arxiv.org/html/2605.11685#bib.bib5)\], we use a loss that penalizes the norm difference between the current and original model’s representations on retain set, which encourages minimal deviation from the original model’s representations\.
#### Accuracy Computation\.
FollowingSondej and Yang \[[2025](https://arxiv.org/html/2605.11685#bib.bib8)\], we compute accuracy as the expected probability of selecting the correct answer\. Specifically, for multiple\-choice questions withkkoptions, we compute the probability distribution over answer choices using softmax with temperatureτ=1\\tau=1on the logits corresponding to the answer tokens\. The accuracy for a batch is then computed as:
Accuracy=1\|B\|∑i∈Bpi\(correct\),\\text\{Accuracy\}=\\frac\{1\}\{\|B\|\}\\sum\_\{i\\in B\}p\_\{i\}^\{\(\\text\{correct\}\)\},\(32\)wherepi\(correct\)p\_\{i\}^\{\(\\text\{correct\}\)\}denotes the probability assigned to the correct answer for sampleii, andBBis the batch\. This expected accuracy metric provides a more fine\-grained measure than hard accuracy \(which only counts exact matches\) and is more sensitive to partial knowledge changes during unlearning and relearning\.
#### Relearning Attack Protocol\.
We follow the Retraining\-on\-TT\(RTT\) attack protocol proposed byDeeb and Roger \[[2024](https://arxiv.org/html/2605.11685#bib.bib1)\]\. After unlearning on the full forget setT∪VT\\cup V, we fine\-tune the unlearned model on the training partitionTT\(80% of the forget set\) and evaluate accuracy recovery on the held\-out validation partitionVV\(20% of the forget set\)\. For WMDP\-Cyber and WMDP\-Bio, we perform relearning for 100 epochs, while for the Years dataset we use 30 epochs due to its smaller size\. To obtain robust estimates of post\-attack accuracy, we followSondej and Yang \[[2025](https://arxiv.org/html/2605.11685#bib.bib8)\]and smooth the relearning accuracy curve by averaging over windows of 10 epochs for WMDP datasets and 3 epochs for Years\. The reportedRelearnaccuracy corresponds to the maximum smoothed accuracy across the relearning trajectory, as some attack runs may exceed the optimal number of epochs\.
#### Hyperparameter Selection for MCU\.
The key hyperparameter in our MCU method isKK, the number of principal components to project out before computing the unlearning loss \([Equation˜11](https://arxiv.org/html/2605.11685#S4.E11)\)\. We perform a grid search overK∈\{1,2,4,8,16,32,64\}K\\in\\\{1,2,4,8,16,32,64\\\}and select the value that achieves the best trade\-off between forget quality \(low relearn accuracy\)\.
#### Unlearning Termination Criterion\.
FollowingSondej and Yang \[[2025](https://arxiv.org/html/2605.11685#bib.bib8)\], we use the WikiText loss\[Merityet al\.,[2016](https://arxiv.org/html/2605.11685#bib.bib14)\]as a criterion to determine when to terminate unlearning, in order to control for disruption to general language modeling performance\. Specifically, we monitor the WikiText loss relative to its initial value before unlearning\. Since different unlearning methods affect the WikiText loss differently, we use method\-specific termination thresholds\. The WikiText loss threshold primarily controls the number of training steps; we select thresholds for each method such that the unlearn accuracy approaches random chance \(approximately 25% for 4\-way multiple choice\) while maintaining reasonable MMLU performance\.
## Appendix GAdditional Results
Table[4](https://arxiv.org/html/2605.11685#A7.T4)presents all experimental results across three datasets: WMDP\-Cyber, WMDP\-Bio, and Years\.
Table 4:Complete experimental results on WMDP\-Cyber, WMDP\-Bio, and Years datasets\. Highlighted rows are MCU variants; the grey rows are original\-model baselines\.DatasetMethodMMLU \(↑\\uparrow\)WikiText \(↓\\downarrow\)Forget \(↓\\downarrow\)Relearn \(↓\\downarrow\)Δ\\Delta\(↓\\downarrow\)WMDP\-CyberOriginal model65\.11\.00057\.6\-\-GA61\.31\.50325\.157\.031\.9GA \+ SAM60\.11\.10027\.457\.730\.3NPO60\.61\.57823\.957\.133\.2NPO \+ SAM60\.11\.10127\.757\.730\.0RMU52\.81\.20728\.754\.626\.0RMU \+ SAM52\.71\.20128\.353\.525\.1RMU \+ MCU49\.31\.20228\.853\.925\.1RMU \+ CIR64\.11\.00628\.050\.522\.5RMU \+ CIR \+ SAM64\.41\.00624\.749\.224\.4RMU \+ CIR \+ MCU64\.71\.00631\.743\.311\.6MLP Breaking61\.81\.13227\.355\.828\.4MLP Breaking \+ SAM58\.01\.21525\.647\.421\.8MLP Breaking \+ MCU60\.41\.10526\.052\.026\.0MLP Breaking \+ CIR65\.01\.00120\.237\.217\.0MLP Breaking \+ CIR \+ SAM65\.31\.00526\.228\.52\.3MLP Breaking \+ CIR \+ MCU64\.71\.00125\.931\.65\.7WMDP\-BioOriginal model65\.11\.00064\.0\-\-GA56\.11\.52829\.069\.040\.0GA \+ SAM53\.51\.30529\.362\.433\.1NPO51\.91\.41925\.966\.740\.8NPO \+ SAM53\.41\.30729\.261\.932\.7RMU45\.51\.30026\.749\.823\.1RMU \+ SAM51\.91\.18327\.249\.021\.8RMU \+ MCU46\.51\.20528\.048\.720\.7RMU \+ CIR63\.71\.01027\.947\.619\.7RMU \+ CIR \+ SAM57\.21\.00624\.233\.08\.8RMU \+ CIR \+ MCU63\.51\.01032\.239\.57\.3MLP Breaking58\.71\.36429\.958\.828\.9MLP Breaking \+ SAM57\.51\.21221\.848\.526\.7MLP Breaking \+ MCU61\.21\.34327\.657\.129\.6MLP Breaking \+ CIR64\.81\.00122\.129\.77\.6MLP Breaking \+ CIR \+ SAM64\.91\.00225\.931\.45\.5MLP Breaking \+ CIR \+ MCU64\.51\.00122\.826\.43\.6YearsOriginal model65\.11\.00068\.4\-\-GA63\.91\.52946\.164\.018\.0GA \+ SAM56\.31\.36025\.863\.737\.9NPO58\.51\.40427\.963\.435\.6NPO \+ SAM56\.01\.36325\.862\.937\.1RMU56\.81\.20433\.064\.331\.4RMU \+ SAM54\.91\.20134\.065\.631\.6RMU \+ MCU52\.01\.19330\.854\.924\.1RMU \+ CIR57\.11\.10230\.737\.46\.7RMU \+ CIR \+ SAM57\.11\.11129\.736\.66\.9RMU \+ CIR \+ MCU57\.21\.10131\.733\.72\.0MLP Breaking60\.81\.23927\.051\.524\.5MLP Breaking \+ SAM58\.01\.21525\.647\.421\.8MLP Breaking \+ MCU61\.51\.21429\.049\.520\.6MLP Breaking \+ CIR64\.61\.01032\.739\.46\.7MLP Breaking \+ CIR \+ SAM64\.31\.02126\.636\.910\.3MLP Breaking \+ CIR \+ MCU63\.81\.01025\.930\.96\.5
## Appendix HCross\-Model Generality
To assess whether the benefits of MCU transfer beyond Llama\-3\.1\-8B, we evaluate it on two additional model families,Gemma2\-9BandQwen3\-8B, across all three forget datasets used in the paper\. We use the same training and evaluation pipeline as in Appendix[G](https://arxiv.org/html/2605.11685#A7), and compare the strongest representation\-based baseline \(MLP Breaking \+ CIR\) against the corresponding MCU variant \(MLP Breaking \+ CIR \+ MCU\)\.
[Table˜5](https://arxiv.org/html/2605.11685#A8.T5)reports the results\. Across both Gemma2\-9B and Qwen3\-8B, adding MCU consistently lowers the post\-attack relearning gapΔ\\Deltaover the MLP Breaking \+ CIR baseline while keeping MMLU essentially unchanged\. The improvement is substantial on WMDP\-Cyber for both models \(Gemma2\-9B:4\.0→1\.94\.0\\rightarrow 1\.9; Qwen3\-8B:13\.7→7\.413\.7\\rightarrow 7\.4\) and on Years \(Gemma2\-9B:6\.3→2\.76\.3\\rightarrow 2\.7; Qwen3\-8B:1\.1→0\.81\.1\\rightarrow 0\.8\); on WMDP\-Bio with Qwen3\-8B, MCU even drivesΔ\\Deltaslightly negative, indicating that the relearning attack fails to recover any forgotten knowledge above the post\-unlearning level\. These results corroborate our main finding that explicitly redirecting forgetting into the minor\-component subspace yields more robust unlearning, and that this benefit is not specific to a single base model\.
Table 5:Cross\-model evaluation on Gemma2\-9B and Qwen3\-8B across all three datasets\. MCU is applied on top of the best MLP Breaking \+ CIR setting\. LowerΔ\\Deltais better\.
## Appendix IConsistency of Observations 2–3 Across Unlearning Losses
In[Section˜3](https://arxiv.org/html/2605.11685#S3), the change\-ratio \(Equation[5](https://arxiv.org/html/2605.11685#S3.E5)\) and recovery\-ratio plots in[Figure˜2](https://arxiv.org/html/2605.11685#S3.F2)are reported for GA\. Here we show that the same qualitative pattern – unlearning concentrates changes in dominant components, and relearning preferentially recovers them – holds across the full set of unlearning losses considered in our experiments: NPO, RMU, MLP Breaking, and their CIR\-augmented variants \(RMU \+ CIR, MLP Breaking \+ CIR\)\. All experiments use full fine\-tuning on Llama\-3\.1\-8B with the WMDP\-Cyber forget set, following the setup in[Section˜3](https://arxiv.org/html/2605.11685#S3)\.
[Figure˜4](https://arxiv.org/html/2605.11685#A9.F4)reports, for each method, \(left\) the unlearn change ratio across principal\-component indices and \(right\) the corresponding recovery ratio after the RTT relearning attack\. Across all five methods, the change\-ratio mass is concentrated in the first few PCs, and the recovery ratio is high precisely for these dominant components and decays toward the minor components\. This confirms that Observations 2 and 3 are not specific to GA, but reflect a property shared by representation\-level \(RMU, MLP Breaking\) and output\-level \(NPO\) unlearning losses, with and without CIR\-style gradient filtering\. Equivalently, the dominant\-component vulnerability that motivates MCU is a generic property of current LLM unlearning pipelines rather than an artifact of any particular loss\.
Unlearn Change RatioRecovery RatioMLP Breaking \+ CIRRMU \+ CIRNPOFigure 4:Consistency of Observations 2–3 across unlearning losses on WMDP\-Cyber \(Llama\-3\.1\-8B, full fine\-tuning\), part 1\. For each method, the left plot shows the per\-PC change ratio induced by unlearning and the right plot shows the per\-PC recovery ratio after the RTT relearning attack\.Unlearn Change RatioRecovery RatioMLP BreakingRMUFigure 5:Consistency of Observations 2–3 across unlearning losses on WMDP\-Cyber \(Llama\-3\.1\-8B, full fine\-tuning\), part 2\. The dominant components consistently absorb most of the unlearning change and exhibit the highest recovery, regardless of the specific unlearning loss\.
## Appendix JRobustness of Observations to Forget\-Set Size
A natural concern is whether Observations 1–3 are artifacts of a particular forget\-set size, or could be driven by the size of the sample used to fit PCA\. To rule this out, we run two complementary experiments on WMDP\-Cyber with GA and Llama\-3\.1\-8B \(full fine\-tuning\), varying the forget\-set fraction in\{25%,50%,75%,100%\}\\\{25\\%,50\\%,75\\%,100\\%\\\}\.
#### Experiment A: fixed model, varying PCA\-fit subset\.
We keep the unlearned and relearned models fixed \(trained on the full forget set\) and only vary the size of the subset used to fit PCA\. This isolates whether Observation 1 \(variance concentration in the dominant components\) is a consequence of using too few samples for PCA\.[Figure˜6](https://arxiv.org/html/2605.11685#A10.F6)shows the explained\-variance curves: the spectrum is essentially indistinguishable across25%25\\%,50%50\\%,75%75\\%, and100%100\\%subsets, so variance concentration is not a sample\-size artifact\. Correspondingly,[Figure˜7](https://arxiv.org/html/2605.11685#A10.F7)reports the change\-ratio and recovery\-ratio plots for the four subset sizes; both retain the same dominant\-component\-heavy pattern\.
down\_projgate\_projup\_projf=25%f=25\\%f=50%f=50\\%f=75%f=75\\%f=100%f=100\\%Figure 6:Experiment A: explained variance under varying PCA\-fit subset sizes \(fixed unlearned/relearned model\)\. The spectrum is essentially identical across2525–100%100\\%subsets, so variance concentration is not driven by sample size\.Unlearn Change RatioRecovery Ratiof=25%f=25\\%f=50%f=50\\%f=75%f=75\\%f=100%f=100\\%Figure 7:Experiment A: per\-PC unlearn change ratio \(left\) and recovery ratio \(right\) as a function of the PCA\-fit subset size \(rows:f=25,50,75,100%f=25,50,75,100\\%\), with the unlearned/relearned model held fixed\. Observations 2 and 3 are stable across PCA\-fit sizes\.
#### Experiment B: end\-to\-end varying forget size\.
We then re\-run the full unlearning \+ relearning pipeline on each forget\-set fraction, so both the model and the PCA fit are tied to the same subset\. This tests whether the observations hold when the unlearning procedure itself is varied\.[Figure˜8](https://arxiv.org/html/2605.11685#A10.F8)shows that the change\-ratio and recovery\-ratio patterns remain consistent across all subset sizes: a small number of dominant PCs continue to absorb the bulk of unlearning\-induced change and to be preferentially recovered after RTT\.
Unlearn Change RatioRecovery Ratiof=25%f=25\\%f=50%f=50\\%f=75%f=75\\%f=100%f=100\\%Figure 8:Experiment B: per\-PC unlearn change ratio \(left\) and recovery ratio \(right\) under end\-to\-end varying forget size \(rows:f=25,50,75,100%f=25,50,75,100\\%\)\. The unlearning \+ relearning pipeline is rerun on each subset\. Observations 2 and 3 hold across all forget\-set sizes\.Together, Experiments A and B confirm that all three observations are robust to the size of the forget set, and that the dominant\-component vulnerability is a structural property of LLM representations rather than a consequence of a specific sampling regime\.
## Appendix KRobustness of Observations to Parameter\-Efficient Fine\-Tuning
Our main analysis in[Section˜3](https://arxiv.org/html/2605.11685#S3)uses full fine\-tuning\. To rule out the possibility that Observations 2–3 are an artifact of full\-parameter optimization, we additionally evaluate them underLoRAfine\-tuning at ranksr∈\{8,16,32,64\}r\\in\\\{8,16,32,64\\\}, applied to four unlearning losses \(GA, NPO, RMU, MLP Breaking\)\. Observation 1 concerns the representation geometry of the pre\-trained model itself and is therefore independent of the fine\-tuning method, so we focus on Observations 2 and 3\.
[Figures˜9](https://arxiv.org/html/2605.11685#A11.F9),[10](https://arxiv.org/html/2605.11685#A11.F10),[11](https://arxiv.org/html/2605.11685#A11.F11)and[12](https://arxiv.org/html/2605.11685#A11.F12)report, for each unlearning loss, the per\-PC unlearn change ratio \(left\) and recovery ratio \(right\) under Full FT and the four LoRA ranks\. In every row, the dominant components again absorb the majority of the unlearning\-induced change and exhibit the highest recovery, matching the full fine\-tuning pattern in[Figure˜2](https://arxiv.org/html/2605.11685#S3.F2)\. This holds uniformly across the four unlearning losses and the four LoRA ranks, confirming that the dominant\-component vulnerability is a property of LLM representation geometry rather than of the specific optimization regime, and that MCU’s design \(targeting the minor\-component subspace\) is therefore equally motivated for LoRA\-based unlearning pipelines\.
Unlearn Change RatioRecovery RatioFull FTr=8r=8r=16r=16r=32r=32r=64r=64Figure 9:GradDiff under Full FT and LoRA at four ranks\. Per\-PC unlearn change ratio \(left\) and recovery ratio \(right\) on WMDP\-Cyber \(Llama\-3\.1\-8B\)\. Dominant components dominate both change and recovery for every optimization regime\.Unlearn Change RatioRecovery RatioFull FTr=8r=8r=16r=16r=32r=32r=64r=64Figure 10:NPO under Full FT and LoRA at four ranks\. Same layout as[Figure˜9](https://arxiv.org/html/2605.11685#A11.F9)\.Unlearn Change RatioRecovery RatioFull FTr=8r=8r=16r=16r=32r=32r=64r=64Figure 11:RMU under Full FT and LoRA at four ranks\. Same layout as[Figure˜9](https://arxiv.org/html/2605.11685#A11.F9)\.Unlearn Change RatioRecovery RatioFull FTr=8r=8r=16r=16r=32r=32r=64r=64Figure 12:MLP Breaking under Full FT and LoRA at four ranks\. Same layout as[Figure˜9](https://arxiv.org/html/2605.11685#A11.F9)\.Similar Articles
RepSelect: Robust LLM Unlearning via Representation Selectivity
RepSelect introduces a method for robust LLM unlearning that isolates forget-set-specific representations by collapsing top principal components of weight gradients, achieving 4-50× better robustness against relearning attacks compared to existing baselines across multiple model families.
Model Unlearning Objectives Vary for Distinct Language Functions
The paper argues that unlearning in LLMs should be goal-dependent, proposing a cosine-based meta-learned variant of RMU for dangerous knowledge and a multi-layer objective with probe directions for toxicity, achieving strong results across four 7-8B models.
Fast Unlearning at Scale via Margin Self-Correction
Introduces MASC (Margin Self-Correction), an efficient unlearning method for LLMs that uses an online stopping rule to achieve competitive forget–retain trade-offs at reduced computational cost, validated on TOFU and MUSE benchmarks.
MLUBench: A Benchmark for Lifelong Unlearning Evaluation in MLLMs
MLUBench is a large-scale benchmark for lifelong unlearning in multimodal large language models (MLLMs), featuring 127 entities across 9 classes. The paper identifies that existing unlearning methods suffer from cumulative degradation and proposes LUMoE to mitigate this, showing significant improvements.
Approximate Machine Unlearning through Manifold Representation Forgetting Guided by Self Mode Connectivity
This paper proposes ManiF-SMC, a method for approximate machine unlearning that operates entirely in the representation space by pushing erased samples away from their original learned manifold representation toward their nearest semantic neighbors in the retained data, using a margin-based triplet loss guided by a self-mode-connectivity module for adaptive margins.