SAGE: Retain-Aware Post-Hoc Sanitization of Final Unlearning Vector

arXiv cs.LG Papers

Summary

Proposes SAGE, a post-hoc method to sanitize the final unlearning vector in LLMs, improving the retain-forget trade-off without rerunning the unlearning pipeline.

arXiv:2606.18309v1 Announce Type: new Abstract: Large Language Model (LLM) unlearning aims to remove undesirable knowledge or behaviors while preserving retained capabilities. Current unlearning methods all involve a trade-off between unlearning and retention. We have found that the retention activation bias can also be used to quantify the damage an unlearning method inflicts on retention, without considering the specific implementation of the unlearning process. This allows us to restore retention performance for any unlearning method using a post-hoc approach. Therefore, we propose a complementary post-hoc setting to sanitize the final update vector without rerunning the original unlearning pipeline. In this setting, we design SAGE, Spectral Activation-GEometry Sanitization, a source-agnostic correction for final unlearning updates. SAGE collects real module inputs from a small retain proxy, extracts their dominant activation geometry, and solves a source-anchored optimization objective in closed form, which suppresses update components aligned with high-energy retained directions while preserving the source method's forgetting carrier. Across multiple unlearning methods, model scales, and benchmarks, SAGE consistently relieves the retain-forget trade-off, identifying post-hoc sanitization of final vectors as a practical and underexplored axis for machine unlearning.
Original Article
View Cached Full Text

Cached at: 06/18/26, 05:40 AM

# Retain-Aware Post-Hoc Sanitization of Final Unlearning Vector
Source: [https://arxiv.org/html/2606.18309](https://arxiv.org/html/2606.18309)
Jingyuan Zhang1,†Yucheng Bai1,†Peixi Wen1Zhehao Huang1Zhengbao He1Hanling Tian1Xinwen Cheng1Haiyin Ran1Xiaolin Huang1,✉1Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University

###### Abstract

Large Language Model \(LLM\) unlearning aims to remove undesirable knowledge or behaviors while preserving retained capabilities\. Current unlearning methods all involve a trade\-off between unlearning and retention\. We have found that the retention activation bias can also be used to quantify the damage an unlearning method inflicts on retention, without considering the specific implementation of the unlearning process\. This allows us to restore retention performance for any unlearning method using a post\-hoc approach\. Therefore, we propose a complementary post\-hoc setting to sanitize the final update vector without rerunning the original unlearning pipeline\. In this setting, we designSAGE,SpectralActivation\-GEometry Sanitization, a source\-agnostic correction for final unlearning updates\. SAGE collects real module inputs from a small retain proxy, extracts their dominant activation geometry, and solves a source\-anchored optimization objective in closed form, which suppresses update components aligned with high\-energy retained directions while preserving the source method’s forgetting carrier\. Across multiple unlearning methods, model scales, and benchmarks, SAGE consistently relieves the retain–forget trade\-off, identifying post\-hoc sanitization of final vectors as a practical and underexplored axis for machine unlearning\.

## 1Introduction

![Refer to caption](https://arxiv.org/html/2606.18309v1/x1.png)Figure 1:Overall Performance\. Triangles denote loss\-driven methods and squares denote constraint\-guided methods\. Points are averaged over multiple models and forgetting difficulties\. Light markers and gray line fit original baselines; dark markers and red line fit methodswith SAGE\.Large language models \(LLMs\)\[[1](https://arxiv.org/html/2606.18309#bib.bib7),[13](https://arxiv.org/html/2606.18309#bib.bib3),[31](https://arxiv.org/html/2606.18309#bib.bib2),[36](https://arxiv.org/html/2606.18309#bib.bib4),[37](https://arxiv.org/html/2606.18309#bib.bib8)\]can memorize private information, copyrighted content, harmful behaviors, and other undesirable data patterns from pretraining and fine\-tuning corpora\[[6](https://arxiv.org/html/2606.18309#bib.bib5),[21](https://arxiv.org/html/2606.18309#bib.bib10),[33](https://arxiv.org/html/2606.18309#bib.bib48),[17](https://arxiv.org/html/2606.18309#bib.bib11),[41](https://arxiv.org/html/2606.18309#bib.bib12)\]\. This has made machine unlearning an increasingly important problem for safety, privacy, and regulatory compliance\[[19](https://arxiv.org/html/2606.18309#bib.bib6),[40](https://arxiv.org/html/2606.18309#bib.bib9),[42](https://arxiv.org/html/2606.18309#bib.bib13)\]\. While exact retraining remains the cleanest deletion target in classical machine unlearning, it is often computationally prohibitive at modern model scales, motivating efficient approximate alternatives\[[3](https://arxiv.org/html/2606.18309#bib.bib14),[26](https://arxiv.org/html/2606.18309#bib.bib15)\]\. For LLMs, it is more challenging because knowledge is encoded in highly distributed representations learned from large\-scale corpora, making it difficult to remove targeted information without inducing widespread collateral damage to general capabilities\[[4](https://arxiv.org/html/2606.18309#bib.bib52),[31](https://arxiv.org/html/2606.18309#bib.bib2)\]\.

Recent work has produced a wide range of approximate LLM unlearning methods, including gradient and objective\-based\[[22](https://arxiv.org/html/2606.18309#bib.bib16),[34](https://arxiv.org/html/2606.18309#bib.bib17),[11](https://arxiv.org/html/2606.18309#bib.bib18),[10](https://arxiv.org/html/2606.18309#bib.bib19),[38](https://arxiv.org/html/2606.18309#bib.bib20),[43](https://arxiv.org/html/2606.18309#bib.bib21)\], preference\-style optimization\[[8](https://arxiv.org/html/2606.18309#bib.bib22),[23](https://arxiv.org/html/2606.18309#bib.bib23)\], representation\-level or localized\-parameter interventions\[[19](https://arxiv.org/html/2606.18309#bib.bib6),[29](https://arxiv.org/html/2606.18309#bib.bib24)\], loss\-reweighting methods\[[33](https://arxiv.org/html/2606.18309#bib.bib48),[39](https://arxiv.org/html/2606.18309#bib.bib25)\], and task\-vector methods\[[16](https://arxiv.org/html/2606.18309#bib.bib26),[7](https://arxiv.org/html/2606.18309#bib.bib34),[5](https://arxiv.org/html/2606.18309#bib.bib27),[18](https://arxiv.org/html/2606.18309#bib.bib28)\]\. Furthermore, benchmarks such as TOFU\[[22](https://arxiv.org/html/2606.18309#bib.bib16)\], MUSE\[[30](https://arxiv.org/html/2606.18309#bib.bib29)\], and WMDP\[[19](https://arxiv.org/html/2606.18309#bib.bib6)\]have made clear that forgetting must be balanced against leakage, utility, and retained behavior\. Based on them, many lightweight interventions, ranging from inference\-time control\[[14](https://arxiv.org/html/2606.18309#bib.bib31),[27](https://arxiv.org/html/2606.18309#bib.bib32),[15](https://arxiv.org/html/2606.18309#bib.bib33),[2](https://arxiv.org/html/2606.18309#bib.bib37)\]to training\-time plug\-ins\[[35](https://arxiv.org/html/2606.18309#bib.bib35),[44](https://arxiv.org/html/2606.18309#bib.bib36)\], have also emerged\. Although they have already shown that adding a lightweight correction mechanism can be an effective way to further improve unlearning, most of them remain tightly coupled to specific stages of the original pipeline: training\-time plug\-ins must be incorporated during optimization and rely on rerunning or modifying the original unlearning process\. Part \(d\) in Figure[2](https://arxiv.org/html/2606.18309#S1.F2)shows that during source unlearning, retain activation bias rises sharply while retained ability drops substantially, even after the forgetting strength has largely stabilized\. This suggests that the final unlearning update is still not fully retain\-aware: the deployed update continues to contain components that disproportionately perturb retained activation directions, indicating that there remains meaningful room to improve retention by sanitizing the completed update itself\. This motivates a decoupled post\-hoc setting, in which the object of correction is no longer the training dynamics, but the final vector itself\. We try to construct a sanitized vector to better trade off forgetting and retention, without revisiting the original unlearning process\.

![Refer to caption](https://arxiv.org/html/2606.18309v1/x2.png)Figure 2:Overview of the proposed SAGE method: \(a\) Acquire post\-hoc update vector from unlearning baselines; \(b\) Construct stable and denoised retain geometry from module\-level input activations; \(c\) Apply closed\-form spectral sanitization and amplifier to get unlearned models ; \(d\) Demonstration of retain restoration on a QA sample and drop on retain activation bias\.Therefore, we proposeSAGE,SpectralActivation\-GEometry Sanitization, a post\-hoc sanitizer for final unlearning vectors\. To construct a retained activation basis, we sample a small retain calibration set and collect module\-level input activations through no\-gradient forward passes\. To reduce the output perturbation by minimizing the update response energy on the retained input geometry, we apply truncated Singular Value Decomposition \(SVD\) to identify a stable and denoised low\-rank subspace\. As the source unlearning method has already been optimized for the forget objective, its final parameter displacement is the most direct forgetting carrier available in the post\-hoc setting\. SAGE therefore performs source\-anchored sanitization: it keeps the sanitized update close to the original source update to preserve the learned forgetting signal, while suppressing the components that produce large retained\-activation responses\. In the objective’s closed\-form solution, directions with larger retained singular energy are attenuated more strongly, whereas update components outside the dominant retained subspace are largely preserved\. Finally, SAGE applies an amplifier to ensure effective forgetting, following the task\-vector view of model editing and unlearning, where the direction of a parameter displacement determines the type of behavioral change and its magnitude controls the strength of that change\[[16](https://arxiv.org/html/2606.18309#bib.bib26),[5](https://arxiv.org/html/2606.18309#bib.bib27)\]\.

Empirically, SAGE consistently improves the retain\-forget trade\-off across diverse source unlearning methods, model scales, and forget ratios, without rerunning the original unlearning pipeline\. On TOFU, it increases average retention capability by 26\.3% with better unlearn effect\. These improvements remain stable from 1B to 8B models and become more pronounced as the amount of content to be forgotten increases\. What’s more, SAGE improves model utility by 2\.2% and reduces privacy leakage by 6\.2% on average\. Its benefits further transfer beyond structured QA\-style forgetting\. SAGE improves retention capabilities by about 39\.8% and 5\.2% on MUSE and WMDP\-cyber respectively, while largely preserving forgetting behavior\. Notably, for TOFU dataset, SAGE uses roughly 3% of the full retain set, and remains robust even under smaller proxy budgets\. Together, these results suggest that post\-hoc sanitization of final unlearning updates is a practical and underexplored design axis for machine unlearning\. Our main contributions are as follows:

- •We study a practical post\-hoc unlearning setting to reduce retain\-side collateral damage in which the final parameter update of a source unlearning method is sanitized after training, without rerunning or modifying the original pipeline\.
- •We proposeSAGE, a module\-wise closed\-form sanitizer that builds retained activation geometry from a small retain set and applies a singular\-value\-aware soft spectral operator to suppress retained\-response directions while preserving forgetting effect\.
- •Across multiple source unlearning methods, model scales, and benchmarks, we show that SAGE consistently improves the retention capability, while also improving utility and privacy leakage, with less time and computation resources\.

## 2Related Work

### 2\.1LLM Unlearning

Machine unlearning for LLMs is motivated by the need to remove private, copyrighted, or hazardous knowledge without retraining from scratch\[[17](https://arxiv.org/html/2606.18309#bib.bib11),[41](https://arxiv.org/html/2606.18309#bib.bib12)\]\. Since exact deletion is generally infeasible for modern LLMs, recent work has focused on approximate unlearning together with evaluation protocols that jointly measure removal and preservation\. Benchmarks such as TOFU\[[22](https://arxiv.org/html/2606.18309#bib.bib16)\], WMDP\[[19](https://arxiv.org/html/2606.18309#bib.bib6)\], MUSE\[[30](https://arxiv.org/html/2606.18309#bib.bib29)\]and OpenUnlearning\[[9](https://arxiv.org/html/2606.18309#bib.bib30)\]have made the forget–retain trade\-off a central concern, showing that effective unlearning should evaluate together with retained behavior, utility and leakage\.

Most existing methods improve this trade\-off during optimization\. Gradient\- and objective\-based approaches modify the forget objective directly, including gradient ascent\[[22](https://arxiv.org/html/2606.18309#bib.bib16)\], NPO\[[43](https://arxiv.org/html/2606.18309#bib.bib21)\], and SimNPO\[[11](https://arxiv.org/html/2606.18309#bib.bib18)\], while representation\-level methods such as RMU\[[19](https://arxiv.org/html/2606.18309#bib.bib6)\]and LUNAR\[[29](https://arxiv.org/html/2606.18309#bib.bib24)\]intervene on internal activations or hidden states to suppress unwanted knowledge\. More recently, lightweight retain\-aware plug\-ins have been proposed: GRU\[[35](https://arxiv.org/html/2606.18309#bib.bib35)\]rectifies retention\-damaging update directions during training, and GU\[[44](https://arxiv.org/html/2606.18309#bib.bib36)\]removes components aligned with the retain\-gradient subspace\. Other lightweight methods, such as offset unlearning\[[14](https://arxiv.org/html/2606.18309#bib.bib31)\], in\-context unlearning\[[27](https://arxiv.org/html/2606.18309#bib.bib32)\], and soft prompting\[[2](https://arxiv.org/html/2606.18309#bib.bib37)\], reduce access requirements by acting at inference time rather than on model weights\. In contrast, SAGE operates after training is complete and directly sanitizes the final deployed weight update, without modifying the original optimization loop\.

### 2\.2Weight\-Space Unlearning and Preservation

Our work is also related to methods that treat parameter differences as editable weight\-space objects\. Task Arithmetic\[[16](https://arxiv.org/html/2606.18309#bib.bib26)\]shows that fine\-tuning deltas can act as task vectors whose addition, subtraction, or scaling steers model behavior without further training\. However, single\-vector unlearning is sensitive to fine\-tuning configuration, scaling, and candidate selection\. Task Simplex Arithmetic\[[7](https://arxiv.org/html/2606.18309#bib.bib34)\]and NegMerge\[[18](https://arxiv.org/html/2606.18309#bib.bib28)\]improve robustness through multi\-vector aggregation\. PerTA\[[5](https://arxiv.org/html/2606.18309#bib.bib27)\]uses gradient or diagonal\-Fisher estimates to rescale task vectors, addressing over\-forgetting\. These methods improve vector selection, merging, or parameter\-wise merging to acquire a stable or retain\-friendly unlearning vector\. SAGE instead accepts final vectors and sanitizes it using module\-level retain activation geometry\.

A related preservation\-oriented literature comes from model editing\. ROME\[[24](https://arxiv.org/html/2606.18309#bib.bib38)\]edits factual associations through localized rank\-one updates, and MEMIT\[[25](https://arxiv.org/html/2606.18309#bib.bib39)\]extends this paradigm to many edits by distributing updates across layers\. AlphaEdit\[[12](https://arxiv.org/html/2606.18309#bib.bib40)\]reduces collateral damage by projecting edit perturbations into the null space of preserved knowledge keys\. This literature shares our emphasis on protecting unaffected knowledge, but it mainly targets factual editing and often relies on hard preservation constraints\. By contrast, unlearning produces a broader final update that must reduce unsafe knowledge without unnecessarily harming retained behavior\. SAGE brings this preservation perspective into unlearning by correcting the final update vector itself, reducing the components that disproportionately disturb retained behavior while preserving the original unlearning effect\.

## 3Method

### 3\.1Problem Setup

Existing unlearning methods often differ substantially in their training objectives, optimization procedures, and access assumptions, yet they share a common challenge: improving forgetting typically comes at the cost of retained capabilities\. As shown in Figure[2](https://arxiv.org/html/2606.18309#S1.F2), retain\-side collateral damage can be reflected by retain activation bias, independent of implementation details of unlearning algorithms\. Therefore, we propose a post\-hoc final\-update setting for LLM unlearning, providing a global correction targeted at accumulated updates instead of step\-wise training\-time control\.

Given a base modelW0W\_\{0\}and a source unlearned modelWunW\_\{\\mathrm\{un\}\}, we get the final vector from the pretrained model to the unlearned model\. Rather than editing all parameters uniformly, we restrict SAGE to a structured set of editable modules𝒢\\mathcal\{G\}, comprising the attention and MLP projection matrices, as they admit a meaningful input\-side activation geometry\. For each modulegg, we define source update as

Δg=\{Wun,g−W0,g,g∈𝒢,0,g∉𝒢\.\\Delta\_\{g\}=\\begin\{cases\}W\_\{\\mathrm\{un\},g\}\-W\_\{0,g\},&g\\in\\mathcal\{G\},\\\\\[3\.0pt\] 0,&g\\notin\\mathcal\{G\}\.\\end\{cases\}\(1\)

### 3\.2Retained Activation Geometry

Retain\-side damage is not determined solely by the norm of the source update\. More importantly, it depends on whether the update acts strongly along the dominant input directions associated with retained capabilities\. To capture this structure, we run a no\-gradient forward pass on a small retain calibration proxyDcalD\_\{\\mathrm\{cal\}\}for each editable moduleggand collect the module’s real input activations intoXg∈ℝNg×dgX\_\{g\}\\in\\mathbb\{R\}^\{N\_\{g\}\\times d\_\{g\}\}, wheredgd\_\{g\}is the module input dimension andNgN\_\{g\}is the total number of collected token\-level inputs\.

We then compute the singular value decompositionXg=Ug​Σg​Vg⊤X\_\{g\}=U\_\{g\}\\Sigma\_\{g\}V\_\{g\}^\{\\top\}, and retain the toprgr\_\{g\}singular directions according to the cumulative\-energy criterion\. The resulting truncated operator emphasizes dominant retained directions while discarding low\-energy and probable noisy components, providing a stable geometry for the post\-hoc sanitizer and avoiding overfitting\. This yields a truncated retained\-geometry operator

X^g:=Ug,r​Σg,r​Vg,r⊤,C^g:=X^g⊤​X^g=Vg,r​Σg,r2​Vg,r⊤\.\\hat\{X\}\_\{g\}:=U\_\{g,r\}\\Sigma\_\{g,r\}V\_\{g,r\}^\{\\top\},\\qquad\\hat\{C\}\_\{g\}:=\\hat\{X\}\_\{g\}^\{\\top\}\\hat\{X\}\_\{g\}=V\_\{g,r\}\\Sigma\_\{g,r\}^\{2\}V\_\{g,r\}^\{\\top\}\.\(2\)Here,X^g\\hat\{X\}\_\{g\}captures the dominant retained input geometry of modulegg, whileC^g\\hat\{C\}\_\{g\}is the corresponding input\-side Gram operator\. In particular,C^g\\hat\{C\}\_\{g\}encodes the relative importance of retained input directions and will serve as the geometry\-aware weighting operator in the sanitizer below\.

### 3\.3Closed\-Form Spectral Sanitization

In the post\-hoc setting, the source method’s final parameter displacement is the only directly available carrier of the forgetting effect achieved during unlearning\. Accordingly, the sanitizer should remain close to the source update and only correct the part that is most harmful to retained behaviors\.

To this end, for each module g, SAGE optimizes a source\-anchored objective with two complementary terms: a proximity term that keeps the sanitized update close to the source displacement, and a response penalty that suppresses update directions introducing output perturbation on retained activation geometry\. For modulegg, letmgm\_\{g\}denote the output dimension, and write the corresponding update in operator form asΔg∈ℝdg×mg\\Delta\_\{g\}\\in\\mathbb\{R\}^\{d\_\{g\}\\times m\_\{g\}\}\. Concretely, we solve

Δ~g=arg⁡minT⁡‖T−Δg‖F2\+λg​‖X^g​T‖F2\.\\widetilde\{\\Delta\}\_\{g\}=\\arg\\min\_\{T\}\\left\\\|T\-\\Delta\_\{g\}\\right\\\|\_\{F\}^\{2\}\+\\lambda\_\{g\}\\left\\\|\\hat\{X\}\_\{g\}T\\right\\\|\_\{F\}^\{2\}\.\(3\)
The first term preserves the learned forgetting signal already encoded in the source update, while the second term penalizes retain\-harming directions\. Without the source anchor, minimizing only the response term would collapse to the trivial zero update\.

Algorithm 1SAGE Framework

1:Input:base model

W0W\_\{0\}, source model

WunW\_\{\\mathrm\{un\}\}, editable modules

𝒢\\mathcal\{G\}, retain calibration proxy

DcalD\_\{\\mathrm\{cal\}\}, regularization strengths

\{λg\}g∈𝒢\\\{\\lambda\_\{g\}\\\}\_\{g\\in\\mathcal\{G\}\}, energy threshold

ρ\\rho, calibrated

α\\alpha
2:

Δ←Wun−W0\\Delta\\leftarrow W\_\{\\mathrm\{un\}\}\-W\_\{0\}
3:foreach

g∈𝒢g\\in\\mathcal\{G\}do

4:collect retained module inputs

XgX\_\{g\}from

DcalD\_\{\\mathrm\{cal\}\}
5:compute

Xg=Ug​Σg​Vg⊤X\_\{g\}=U\_\{g\}\\Sigma\_\{g\}V\_\{g\}^\{\\top\}
6:choose

rgr\_\{g\}by energy threshold

ρ\\rho
7:

X^g←Ug,r​Σg,r​Vg,r⊤\\hat\{X\}\_\{g\}\\leftarrow U\_\{g,r\}\\Sigma\_\{g,r\}V\_\{g,r\}^\{\\top\}
8:

C^g←X^g⊤​X^g\\hat\{C\}\_\{g\}\\leftarrow\\hat\{X\}\_\{g\}^\{\\top\}\\hat\{X\}\_\{g\}
9:

Δ~g←\(I−Vg,r​Dg​Vg,r⊤\)​Δg\\widetilde\{\\Delta\}\_\{g\}\\leftarrow\(I\-V\_\{g,r\}D\_\{g\}V\_\{g,r\}^\{\\top\}\)\\Delta\_\{g\}
10:endfor

11:choose

α\\alphaby matched\-forgetting calibration

12:foreach module

ggdo

13:if

g∈𝒢g\\in\\mathcal\{G\}then

14:

Wg⋆←W0,g\+α​Δ~gW\_\{g\}^\{\\star\}\\leftarrow W\_\{0,g\}\+\\alpha\\widetilde\{\\Delta\}\_\{g\}
15:else

16:

Wg⋆←W0,gW\_\{g\}^\{\\star\}\\leftarrow W\_\{0,g\}
17:endif

18:endfor

19:Return

W⋆W^\{\\star\}

###### Proposition 1\(Unique closed\-form sanitizer\)\.

For each editable moduleggand anyλg≥0\\lambda\_\{g\}\\geq 0, the objective in Eq\.[3](https://arxiv.org/html/2606.18309#S3.E3)is strongly convex inTTand admits the unique minimizer

Δ~g=\(I\+λg​C^g\)−1​Δg=\(I−Vg,r​Dg​Vg,r⊤\)​Δg,Dg=diag​\(λg​σg,121\+λg​σg,12,…,λg​σg,r21\+λg​σg,r2\)\.\\widetilde\{\\Delta\}\_\{g\}=\\left\(I\+\\lambda\_\{g\}\\hat\{C\}\_\{g\}\\right\)^\{\-1\}\\Delta\_\{g\}=\\left\(I\-V\_\{g,r\}D\_\{g\}V\_\{g,r\}^\{\\top\}\\right\)\\Delta\_\{g\},\\penalty 10000\\ D\_\{g\}=\\mathrm\{diag\}\\\!\\left\(\\frac\{\\lambda\_\{g\}\\sigma\_\{g,1\}^\{2\}\}\{1\+\\lambda\_\{g\}\\sigma\_\{g,1\}^\{2\}\},\\dots,\\frac\{\\lambda\_\{g\}\\sigma\_\{g,r\}^\{2\}\}\{1\+\\lambda\_\{g\}\\sigma\_\{g,r\}^\{2\}\}\\right\)\.\(4\)

Proof sketch\.Eq\.[3](https://arxiv.org/html/2606.18309#S3.E3)is a strictly convex quadratic inTT, with Hessian2​\(I\+λg​C^g\)⊗Img2\(I\+\\lambda\_\{g\}\\hat\{C\}\_\{g\}\)\\otimes I\_\{m\_\{g\}\}\. SinceC^g⪰0\\hat\{C\}\_\{g\}\\succeq 0, the Hessian is positive definite\. Setting the gradient to zero and applying Woodbury matrix identity yields Eq\.[4](https://arxiv.org/html/2606.18309#S3.E4)\.

This closed\-form solution reveals that SAGE acts as a soft spectral sanitizer on the source update\. Along eachvg,iv\_\{g,i\}retained principal directions, the source update is shrunk by11\+λg​σg,i2\\frac\{1\}\{1\+\\lambda\_\{g\}\\sigma\_\{g,i\}^\{2\}\}, which is determined by the retained singular energy, and directions orthogonal tospan​\(Vg,r\)\\mathrm\{span\}\(V\_\{g,r\}\)is left unchanged\. So directions with larger retained energy are attenuated more strongly, and components outside the dominant retained subspace are largely preserved\.

As a result, SAGE is not a hard null\-space projection that indiscriminately removes all retained\-subspace components asλg→∞\\lambda\_\{g\}\\to\\infty\. Instead, it performs a continuous, geometry\-aware shrinkage that suppresses the most retain\-sensitive directions while preserving as much of the source forgetting carrier as possible\. Also, this operator does not amplify the retained\-geometry response measured on the calibration proxy, reducing the retain\-side disturbance\.

#### Final Composition and Forget\-Matched Calibration

As the magnitude of sanitized unlearning vectors controls the strength of forgetting\[[16](https://arxiv.org/html/2606.18309#bib.bib26)\], we apply an amplifier and form the final model as

Wg⋆=W0,g\+α​Δ~g,W\_\{g\}^\{\\star\}=W\_\{0,g\}\+\\alpha\\,\\widetilde\{\\Delta\}\_\{g\},\(5\)whereα\\alphais a scaling coefficient selected by grid search on a disjoint forget\-side calibration subset\.

#### Mechanistic Discussion: Differential Suppression\.

To further interpret when post\-hoc sanitization is beneficial, we compare the relative suppression induced by SAGE on retain and forget activations\. For modulegg, define

Sg\(r\):=1−‖Xg\(r\)​Δ~g‖F2‖Xg\(r\)​Δg‖F2,Sg\(f\):=1−‖Xg\(f\)​Δ~g‖F2‖Xg\(f\)​Δg‖F2,S\_\{g\}^\{\(r\)\}:=1\-\\frac\{\\left\\\|X\_\{g\}^\{\(r\)\}\\widetilde\{\\Delta\}\_\{g\}\\right\\\|\_\{F\}^\{2\}\}\{\\left\\\|X\_\{g\}^\{\(r\)\}\\Delta\_\{g\}\\right\\\|\_\{F\}^\{2\}\},\\qquad S\_\{g\}^\{\(f\)\}:=1\-\\frac\{\\left\\\|X\_\{g\}^\{\(f\)\}\\widetilde\{\\Delta\}\_\{g\}\\right\\\|\_\{F\}^\{2\}\}\{\\left\\\|X\_\{g\}^\{\(f\)\}\\Delta\_\{g\}\\right\\\|\_\{F\}^\{2\}\},\(6\)and let

Γg:=Sg\(r\)−Sg\(f\)\.\\Gamma\_\{g\}:=S\_\{g\}^\{\(r\)\}\-S\_\{g\}^\{\(f\)\}\.\(7\)A positiveΓg\\Gamma\_\{g\}indicates that SAGE suppresses retain\-side response more strongly than forget\-side response, which helps explain why, after forget\-matched calibration, the sanitized update can improve retention while preserving forgetting behavior\. We study this quantity empirically in Section[4\.5](https://arxiv.org/html/2606.18309#S4.SS5)\.

## 4Experiments

### 4\.1Experiment Setups

Table 1:Main TOFU results across Llama\-3\-1B,3B,8B on Forget\-1%,5%,10% split, with retain/unlearn Extraction strength \(ES\), absolute privacy leak \(\|Priv\.Leak\|\|\\mathrm\{Priv\.\\ Leak\}\|\), and model utility \(MU\)\. Better results with SAGE are highlighted inbold, and retain\-capability gains are shown inorange\.MethodForget\-1%Forget\-5%Forget\-10%ES Re\.↑\\uparrowES Un\.↓\\downarrow\|Priv\.Leak\|\|\\mathrm\{Priv\.\\ Leak\}\|↓\\downarrowMU↑\\uparrowES Re\.↑\\uparrowES Un\.↓\\downarrow\|Priv\.Leak\|\|\\mathrm\{Priv\.\\ Leak\}\|↓\\downarrowMU↑\\uparrowES Re\.↑\\uparrowES Un\.↓\\downarrow\|Priv\.Leak\|\|\\mathrm\{Priv\.\\ Leak\}\|↓\\downarrowMU↑\\uparrowLlama\-3\.2\-1B\-InstructVanilla0\.0550\.0587\.790\.2810\.0550\.0598\.090\.2810\.0550\.05510\.430\.281Fully Fine\-tuned0\.7360\.743100\.000\.5980\.7360\.73099\.990\.5980\.7360\.71299\.460\.598NPO0\.5830\.15076\.620\.5880\.1310\.07112\.780\.4370\.2050\.0751\.990\.528NPO w\. SAGE0\.630↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.0470\.14475\.090\.5930\.180↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.0490\.06526\.700\.4950\.267↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.0620\.07214\.050\.550SimNPO0\.6720\.41598\.110\.5930\.5800\.21196\.770\.5780\.6370\.15595\.120\.586SimNPO w\. SAGE0\.706↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.0340\.40298\.580\.5960\.656↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.0760\.20195\.940\.5900\.647↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.0100\.14892\.480\.593RMU0\.2160\.03246\.280\.5260\.6630\.03349\.240\.5850\.7070\.03358\.810\.592RMU w\. SAGE0\.691↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.4750\.03184\.530\.5930\.707↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.0430\.03351\.120\.5920\.709↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.0020\.03355\.370\.595UNDIAL0\.4840\.09186\.420\.5820\.2320\.05391\.460\.5560\.2680\.04493\.500\.563UNDIAL w\. SAGE0\.664↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.1800\.08075\.800\.5970\.660↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.4280\.05240\.600\.5960\.625↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.3580\.04449\.350\.597SatImp0\.6770\.20695\.990\.5960\.4130\.06976\.260\.5720\.3580\.05972\.000\.552SatImp w\. SAGE0\.700↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.0230\.19895\.630\.6000\.580↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.1670\.06567\.410\.5840\.453↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.0950\.05758\.350\.569WGA0\.6830\.14186\.780\.5980\.5550\.03453\.620\.5920\.6270\.03360\.710\.590WGA w\. SAGE0\.721↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.0380\.12488\.900\.6010\.663↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.1070\.03454\.400\.5910\.734↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.1080\.03358\.760\.600Llama\-3\.2\-3B\-InstructVanilla0\.0630\.05518\.640\.2720\.0630\.05511\.510\.2720\.0630\.05313\.760\.272Fully Fine\-tuned0\.8850\.920100\.000\.6650\.8850\.887100\.000\.6650\.8850\.88899\.720\.665NPO0\.7600\.20180\.930\.6670\.1400\.06020\.520\.4660\.1320\.06016\.620\.529NPO w\. SAGE0\.821↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.0620\.18877\.680\.6690\.543↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.4030\.05730\.720\.6480\.521↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.3890\.06021\.520\.649SimNPO0\.8390\.56299\.860\.6520\.6240\.26497\.990\.6400\.5830\.21296\.760\.646SimNPO w\. SAGE0\.856↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.0170\.56299\.860\.6530\.712↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.0880\.25696\.120\.6520\.625↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.0420\.20693\.990\.645RMU0\.3200\.03825\.000\.6120\.8300\.03354\.300\.6690\.8590\.03362\.830\.675RMU w\. SAGE0\.817↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.4980\.03770\.900\.6650\.860↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.0300\.03349\.850\.6680\.8540\.03361\.060\.666UNDIAL0\.7500\.08491\.240\.6760\.3040\.05184\.270\.6410\.3570\.04189\.690\.656UNDIAL w\. SAGE0\.831↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.0810\.07684\.750\.6790\.805↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.5010\.04864\.280\.6800\.750↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.3930\.04134\.680\.678SatImp0\.8470\.34896\.050\.6590\.5370\.09155\.160\.6140\.4620\.04838\.290\.614SatImp w\. SAGE0\.861↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.0140\.32296\.190\.6610\.714↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.1780\.08753\.120\.6480\.598↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.1360\.04836\.910\.641WGA0\.8410\.20688\.280\.6630\.6230\.03353\.240\.6410\.6330\.03658\.910\.648WGA w\. SAGE0\.864↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.0230\.20086\.580\.6650\.730↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.1070\.03353\.650\.6500\.744↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.1110\.03559\.110\.652Llama\-3\.1\-8B\-InstructVanilla0\.0620\.0623\.380\.2750\.0620\.06011\.670\.2750\.0620\.05611\.480\.275Fully Fine\-tuned0\.9920\.977100\.000\.6270\.9920\.972100\.000\.6270\.9920\.97999\.940\.627NPO0\.8660\.15268\.250\.6420\.1760\.06439\.710\.5560\.2330\.07140\.450\.601NPO w\. SAGE0\.956↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.0900\.14764\.370\.6390\.601↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.4240\.06441\.150\.6090\.543↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.3100\.06739\.300\.613SimNPO0\.9390\.56697\.500\.6150\.6990\.30397\.660\.6190\.5680\.23097\.000\.602SimNPO w\. SAGE0\.967↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.0290\.56698\.370\.6190\.833↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.1340\.27694\.100\.6140\.680↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.1120\.22394\.300\.604RMU0\.8330\.07455\.370\.6350\.9850\.03950\.850\.6570\.9860\.03359\.520\.652RMU w\. SAGE0\.986↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.1520\.0613\.130\.6300\.9810\.03751\.220\.6260\.9790\.03360\.450\.626UNDIAL0\.8190\.11285\.750\.6880\.5080\.05179\.380\.6900\.6310\.05192\.150\.689UNDIAL w\. SAGE0\.976↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.1570\.10984\.870\.6690\.955↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.4470\.05148\.780\.6510\.932↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.3010\.05055\.310\.644SatImp0\.9500\.46194\.620\.6180\.6900\.11784\.810\.6270\.5780\.0465\.660\.596SatImp w\. SAGE0\.972↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.0220\.46196\.250\.6220\.852↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.1610\.10977\.570\.6110\.736↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.1570\.04523\.620\.596WGA0\.9460\.12463\.750\.6390\.6140\.03354\.850\.5990\.5370\.03357\.990\.593WGA w\. SAGE0\.962↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.0160\.12461\.620\.6340\.792↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.1780\.03354\.200\.6030\.809↑\\bm\{\\uparrow\}\\mkern\-10\.0mu0\.2720\.03356\.590\.609

#### Datasets and Baselines\.

We evaluate SAGE on the OpenUnlearning benchmark\[[9](https://arxiv.org/html/2606.18309#bib.bib30)\], focusing primarily on TOFU\[[22](https://arxiv.org/html/2606.18309#bib.bib16)\], a fine\-grained benchmark with 4,000 QA pairs for fictitious author profiles\. We use the official scaling splits with different forget\-set sizes \(Forget\-1%, Forget\-5%, and Forget\-10%\) and report main results on Llama\-3\-1B, 3B, 8B\-Instruct\[[13](https://arxiv.org/html/2606.18309#bib.bib3)\]\. Besides, we report results on MUSE\[[30](https://arxiv.org/html/2606.18309#bib.bib29)\], which evaluates memorization and unlearning of books and news articles through verbatim reproduction, question answering, and membership inference with Llama\-2\-7B\[[31](https://arxiv.org/html/2606.18309#bib.bib2)\], and on WMDP\[[19](https://arxiv.org/html/2606.18309#bib.bib6)\], an alignment\-oriented benchmark of 3,668 hazardous\-domain \(biosecurity, cybersecurity, chemical security\) multiple\-choice questions with Zephyr\-7B\[[32](https://arxiv.org/html/2606.18309#bib.bib49)\]\. In all cases, SAGE is applied*post\-hoc*to final update vectors produced by a source unlearning method under a unified training budget of 10 epochs, including Gradient Ascent\[[22](https://arxiv.org/html/2606.18309#bib.bib16)\], GradDiff\[[22](https://arxiv.org/html/2606.18309#bib.bib16)\], NPO\[[11](https://arxiv.org/html/2606.18309#bib.bib18)\], SimNPO\[[11](https://arxiv.org/html/2606.18309#bib.bib18)\], RMU\[[19](https://arxiv.org/html/2606.18309#bib.bib6)\], UNDIAL\[[8](https://arxiv.org/html/2606.18309#bib.bib22)\], CEU\[[38](https://arxiv.org/html/2606.18309#bib.bib20)\], SatImp\[[39](https://arxiv.org/html/2606.18309#bib.bib25)\], WGA\[[33](https://arxiv.org/html/2606.18309#bib.bib48)\], DPO\[[28](https://arxiv.org/html/2606.18309#bib.bib50)\]and PDU\[[10](https://arxiv.org/html/2606.18309#bib.bib19)\]\.

#### Evaluation Metrics\.

We report four aspects of unlearning quality: forgetting, retention, privacy, and utility, measured by Unlearn/Retain Extraction Strength\[[6](https://arxiv.org/html/2606.18309#bib.bib5)\], Retain ROUGE\[[20](https://arxiv.org/html/2606.18309#bib.bib44)\], Privacy Leakage\[[30](https://arxiv.org/html/2606.18309#bib.bib29)\], and Model Utility\[[22](https://arxiv.org/html/2606.18309#bib.bib16)\], respectively\. Extraction Strength quantifies the residual recoverability of target information\. Model Utility is a composite measure of overall retained capability\. On TOFU, it combines probability, ROUGE, and Truth Ratio evaluations, while on MUSE and WMDP, it is reflected through retained\-task generation quality\. For Privacy Leakage, we report the absolute value of the raw leakage score and evaluate methods by its magnitude only; lower values indicate a weaker forget\-set membership signal and therefore better privacy protection\.

### 4\.2Main Results

#### Performance on TOFU\.

Table[1](https://arxiv.org/html/2606.18309#S4.T1)shows that SAGE consistently improves retained capabilities with a better forget performance on TOFU across model scales, forget ratios, and source baselines\. Averaged over TOFU settings, SAGE increases Retain E\.S\. from 0\.5869 to 0\.7409, corresponding to a 26\.3% relative improvement, while slightly reducing Unlearn E\.S\. from 0\.1271 to 0\.1227\. These results indicate that post\-hoc sanitization can substantially recover retained capability without weakening forgetting strength\. SAGE also improves model utility from 0\.609 to 0\.623, and the average absolute privacy leakage decreases from 68\.29 to 64\.06, suggesting that suppressing retain\-sensitive components in the final vectors can also mitigate collateral utility and privacy degradation\.

The improvement is consistent across both model scale and forget difficulty\. Across the 1B, 3B, and 8B models, the average Retain E\.S\. gains are 26\.5%, 29\.4%, and 23\.5%, respectively, showing that the benefit of SAGE persists from small to larger backbones\. Across forget ratios, the average absolute Retain E\.S\. improvement is \+0\.109 on Forget\-1%, \+0\.196 on Forget\-5%, and \+0\.158 on Forget\-10%, indicating that SAGE remains effective under both milder and stronger unlearning regimes\. Overall, these results support SAGE as a robust post\-hoc correction layer that consistently improves retained capability while preserving or slightly improving forgetting quality\.

Table 2:MUSE results onLlama\-2\-7b\-hfwith unlearn extraction strength \(ES Un\.\), retain ROUGE \(ROUGE Re\.\), and absolute privacy leakage \(\|Priv\.Leak\|\|\\mathrm\{Priv\.\\ Leak\}\|\)\. WMDP results onzephyr\-7b\-betaare reported with unlearn accuracy \(Un\. acc\.\) and MMLU accuracy \(MMLU acc\.\)\. Better results with SAGE are highlighted inbold, and retain\-capability gains are shown inorange\.MethodMUSE BooksMUSE NewsWMDP cyberES Un\.↓\\downarrowROUGE Re\.↑\\uparrow\|Priv\.Leak\|\|\\mathrm\{Priv\.\\ Leak\}\|↓\\downarrowES Un\.↓\\downarrowROUGE Re\.↑\\uparrow\|Priv\.Leak\|\|\\mathrm\{Priv\.\\ Leak\}\|↓\\downarrowUn\. acc\.↓\\downarrowMMLU acc\.↑\\uparrowLlama\-2\-7b\-hfzephyr\-7b\-betaVanilla0\.0100\.6808\.160\.0200\.5604\.720\.4450\.585Fully Fine\-tuned0\.9200\.69057\.340\.2900\.55099\.81––GradDiff0\.0080\.05228\.620\.0710\.48798\.680\.2450\.536GradDiff w\. SAGE0\.0080\.610↑\\bm\{\\uparrow\}\\mkern\-9\.0mu0\.55832\.540\.0700\.496↑\\bm\{\\uparrow\}\\mkern\-9\.0mu0\.00997\.080\.2440\.561↑\\bm\{\\uparrow\}\\mkern\-9\.0mu0\.025SimNPO0\.1380\.53754\.810\.2140\.48699\.870\.4180\.572SimNPO w\. SAGE0\.1060\.611↑\\bm\{\\uparrow\}\\mkern\-9\.0mu0\.07454\.660\.2120\.532↑\\bm\{\\uparrow\}\\mkern\-9\.0mu0\.04799\.870\.4260\.587↑\\bm\{\\uparrow\}\\mkern\-9\.0mu0\.015RMU0\.0080\.12419\.910\.0210\.48225\.590\.2610\.511RMU w\. SAGE0\.0080\.336↑\\bm\{\\uparrow\}\\mkern\-9\.0mu0\.21238\.760\.0190\.496↑\\bm\{\\uparrow\}\\mkern\-9\.0mu0\.01430\.900\.2790\.570↑\\bm\{\\uparrow\}\\mkern\-9\.0mu0\.059UNDIAL0\.0240\.62718\.450\.0130\.18999\.450\.3900\.565UNDIAL w\. SAGE0\.0240\.700↑\\bm\{\\uparrow\}\\mkern\-9\.0mu0\.07318\.050\.0130\.264↑\\bm\{\\uparrow\}\\mkern\-9\.0mu0\.07589\.860\.3970\.584↑\\bm\{\\uparrow\}\\mkern\-9\.0mu0\.019WGA0\.0080\.48643\.820\.0120\.436103\.700\.3480\.546WGA w\. SAGE0\.0080\.619↑\\bm\{\\uparrow\}\\mkern\-9\.0mu0\.13325\.550\.0110\.468↑\\bm\{\\uparrow\}\\mkern\-9\.0mu0\.032102\.600\.3480\.570↑\\bm\{\\uparrow\}\\mkern\-9\.0mu0\.024PDU0\.0080\.04254\.480\.1460\.50199\.790\.2430\.243PDU w\. SAGE0\.0080\.605↑\\bm\{\\uparrow\}\\mkern\-9\.0mu0\.56371\.210\.1410\.48499\.810\.2450\.255↑\\bm\{\\uparrow\}\\mkern\-9\.0mu0\.012

#### Performance on MUSE and WMDP\.

MUSE involves longer\-form Books and News content with open\-ended generation behavior, while WMDP\-cyber evaluates hazardous knowledge removal together with general capability preservation\. On MUSE, SAGE improves retention capabilities while preserving forgetting: averaged over Books and News, ROUGE Re\. increases from 0\.371 to 0\.518, ES Un\. slightly decreases from 0\.056 to 0\.052\. On WMDP\-cyber, SAGE mainly improves the utility side of the safety–utility trade\-off, increasing average MMLU accuracy from 0\.496 to 0\.521 across source baselines at comparable unlearning accuracy\. These results show that SAGE remains effective on longer\-form and safety\-oriented unlearning tasks\.

### 4\.3Ablation Study

Table 3:Hyperparameter ablations of SAGE onLlama\-3\-1Bunder TOFU Forget\-10% Split\. Default parameters are shaded in blue\. Best and second\-best results are highlighted inboldandunderlined\.#### SVD Trunc\.ρ\\rho\.

As the truncation ratioρ\\rhoincreases, retain\-side performance generally improves because a more dominant retained activation geometry is preserved\. However, values too close to full rank can over\-constrain sanitization and hurt stability, privacy or utility; overall,ρ=0\.9\\rho=0\.9provides the best balance\.

#### Retain\-proxy Size\.

SAGE is relatively insensitive to retain\-proxy size: even small proxies already recover a useful estimate of the dominant retained geometry and yield consistent gains\. We therefore use 128 retained examples by default\.

#### Regularizationλ\\lambda\.

The regularization strengthλ\\lambdacontrols the trade\-off between suppressing retain\-sensitive directions and staying close to the unlearning vectors\. On TOFU, performance improves quickly fromλ=0\\lambda=0and then saturates, motivating our default choiceλ=5\\lambda=5; on MUSE and WMDP, smaller values may work better\.

#### Scaling Coefficientα\\alpha\.

The scaling coefficientα\\alphacontrols how much of the sanitized update is restored before being added back to the base model\. Asα\\alphaincreases, forgetting and utility often improve, but overly large values can reintroduce retain\-side interference and reduce retained performance\.

### 4\.4Integration with Other Plug\-in Methods

Table 4:Results of SAGE with GU and GRU plug\-ins for WGA, NPO, and DPO on TOFU over Forget\-1%, 5%, 10% usingLlama\-3\-1B\. w\. SAGE are shaded in blue, the Best and second best results are highlighted inboldandunderline, and additional gains brought by SAGE are inorange\.We further study whether SAGE is compatible with training\-time rectifiers such as GU and GRU\. Overall, the results show that SAGE is complementary to these gradient\-level plugins: in most settings, applying SAGE on top of GU/GRU further improves retained capability while largely preserving the original forgetting behavior, and often also brings gains in utility or privacy leakage\. This indicates that training\-time rectification does not fully remove retain\-harmful components from the final update, leaving room for post\-hoc sanitization to provide additional benefits\.

At the same time, using SAGE alone is generally more effective than using GU or GRU alone\. Across different source methods and forget ratios, SAGE more consistently improves the retain–forget trade\-off, suggesting that directly sanitizing the final deployed update can be a stronger intervention than only rectifying gradients during training\. Taken together, these results show that SAGE is a strong standalone plugin and complementary post\-hoc component for existing training\-time methods\.

### 4\.5Suppression Difference Analysis

To examine the mechanism discussed in Section[3\.3](https://arxiv.org/html/2606.18309#S3.SS3), we analyze the module\-wise suppression differenceΓg=Sg\(r\)−Sg\(f\)\\Gamma\_\{g\}=S\_\{g\}^\{\(r\)\}\-S\_\{g\}^\{\(f\)\}, which measures whether SAGE suppresses retain\-side response more strongly than forget\-side response\. As shown in Figure[4](https://arxiv.org/html/2606.18309#S4.F4), the distribution is consistently shifted to the positive side, which indicates that SAGE tends to remove more retain\-harming responses than forget\-relevant responses at the module level\. Such asymmetric suppression helps explain why SAGE can improve retention while preserving forgetting\.

### 4\.6Computation Resources

Figure[4](https://arxiv.org/html/2606.18309#S4.F4)reports the computational overhead of SAGE\. On Llama\-3\-3B, source unlearning training takes 22–38 minutes and exceeds 100 GiB peak GPU memory across five representative baselines, whereas SAGE takes about 9 minutes on average and uses only 7\.86 GiB peak GPU memory\. Moreover, the projector cache is built once for a given base model and retain proxy and can be reused across multiple source checkpoints, so only the lightweight sanitization step are repeated when applying SAGE on the same model\. The scaling results show that efficiency advantage persists from 1B to 8B models, making SAGE a practical post\-hoc correction under limited compute budgets\.

![Refer to caption](https://arxiv.org/html/2606.18309v1/x3.png)Figure 3:Distribution of suppression difference
where the vertical red dashed line denotes the mean\.
![Refer to caption](https://arxiv.org/html/2606.18309v1/x4.png)Figure 4:Computation Resourcesfor Baselines and SAGE on different models\.

## 5Conclusion

In this paper, we study a practical post\-hoc final\-update setting for LLM unlearning\. We show that source unlearning updates can still contain retain\-harmful components that induce substantial drift in retained activations and degrade retained behavior\. To address this, we proposeSAGEthat extracts retained activation geometry from a small retain proxy and applies a closed\-form spectral operator to suppress update components aligned with high\-energy retained directions while preserving the source method’s forgetting carrier\. Extensive experiments on TOFU, MUSE, and WMDP demonstrate that SAGE consistently improves the retain–forget trade\-off, while also improving utility and privacy leakage without retraining\. We further show that SAGE is complementary to training\-time plug\-ins\. Overall, our results identify post\-hoc sanitization of final unlearning updates as a practical and underexplored design axis for machine unlearning, and suggest that directly correcting the final deployed update can be an effective way to improve unlearning quality\.

## References

- \[1\]J\. Bai, S\. Bai, Y\. Chu, Z\. Cui, K\. Dang, X\. Deng, Y\. Fan, W\. Ge, Y\. Han, F\. Huang, B\. Hui, L\. Ji, M\. Li, J\. Lin, R\. Lin, D\. Liu,et al\.\(2023\)Qwen technical report\.arXiv preprint arXiv:2309\.16609\.Cited by:[§1](https://arxiv.org/html/2606.18309#S1.p1.1)\.
- \[2\]\(2025\)Soft prompting for unlearning in large language models\.InNorth American Chapter of the Association for Computational Linguistics \(NAACL\),Cited by:[§1](https://arxiv.org/html/2606.18309#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.18309#S2.SS1.p2.1)\.
- \[3\]L\. Bourtoule, V\. Chandrasekaran, C\. A\. Choquette\-Choo, H\. Jia, A\. Travers, B\. Zhang, D\. Lie, and N\. Papernot\(2021\)Machine unlearning\.InIEEE Symposium on Security and Privacy \(IEEE S&P\),Cited by:[§1](https://arxiv.org/html/2606.18309#S1.p1.1)\.
- \[4\]T\. B\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. Kaplan, P\. Dhariwal, A\. Neelakantan,et al\.\(2020\)Language models are few\-shot learners\.InConference on Neural Information Processing Systems \(NeurIPS\),Cited by:[§1](https://arxiv.org/html/2606.18309#S1.p1.1)\.
- \[5\]C\. Cai, Z\. Ye, J\. Yao, J\. Qi, B\. Han, X\. Zhang, F\. Liu, and J\. Zhou\(2026\)Per\-parameter task arithmetic for unlearning in large language models\.arXiv preprint arXiv:2601\.22030\.Cited by:[§1](https://arxiv.org/html/2606.18309#S1.p2.1),[§1](https://arxiv.org/html/2606.18309#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.18309#S2.SS2.p1.1)\.
- \[6\]N\. Carlini, F\. Tramèr, E\. Wallace, M\. Jagielski, A\. Herbert\-Voss, K\. Lee, A\. Roberts, T\. Brown, D\. Song, Ú\. Erlingsson, A\. Oprea, and C\. Raffel\(2021\-08\)Extracting training data from large language models\.In30th USENIX Security Symposium \(USENIX Security 21\),pp\. 2633–2650\.External Links:ISBN 978\-1\-939133\-24\-3,[Link](https://www.usenix.org/conference/usenixsecurity21/presentation/carlini-extracting)Cited by:[§1](https://arxiv.org/html/2606.18309#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.18309#S4.SS1.SSS0.Px2.p1.1)\.
- \[7\]J\. Dong, H\. Zhu, Y\. Zhang, X\. Qu, Y\. S\. Ong, and P\. Koniusz\(2025\)Machine unlearning via task simplex arithmetic\.InConference on Neural Information Processing Systems \(NeurIPS\),Cited by:[§1](https://arxiv.org/html/2606.18309#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.18309#S2.SS2.p1.1)\.
- \[8\]Y\. R\. Dong, H\. Lin, M\. Belkin, R\. Huerta, and I\. Vulić\(2025\)UNDIAL: self\-distillation with adjusted logits for robust unlearning in large language models\.InNorth American Chapter of the Association for Computational Linguistics \(NAACL\),Cited by:[§1](https://arxiv.org/html/2606.18309#S1.p2.1),[§4\.1](https://arxiv.org/html/2606.18309#S4.SS1.SSS0.Px1.p1.1)\.
- \[9\]V\. Dorna, A\. Mekala, W\. Zhao, A\. McCallum, Z\. C\. Lipton, J\. Z\. Kolter, and P\. Maini\(2025\)OpenUnlearning: accelerating llm unlearning via unified benchmarking of methods and metrics\.InConference on Neural Information Processing Systems \(NeurIPS\),Cited by:[§2\.1](https://arxiv.org/html/2606.18309#S2.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.18309#S4.SS1.SSS0.Px1.p1.1)\.
- \[10\]T\. Entesari, A\. Hatami, R\. Khaziev, A\. Ramakrishna, and M\. Fazlyab\(2025\)Constrained entropic unlearning: a primal\-dual framework for large language models\.arXiv preprint arXiv:2506\.05314\.Cited by:[§1](https://arxiv.org/html/2606.18309#S1.p2.1),[§4\.1](https://arxiv.org/html/2606.18309#S4.SS1.SSS0.Px1.p1.1)\.
- \[11\]C\. Fan, J\. Liu, L\. Lin, J\. Jia, R\. Zhang, S\. Mei, and S\. Liu\(2025\)Simplicity prevails: rethinking negative preference optimization for llm unlearning\.InConference on Neural Information Processing Systems \(NeurIPS\),Cited by:[§1](https://arxiv.org/html/2606.18309#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.18309#S2.SS1.p2.1),[§4\.1](https://arxiv.org/html/2606.18309#S4.SS1.SSS0.Px1.p1.1)\.
- \[12\]J\. Fang, H\. Jiang, K\. Wang, Y\. Ma, S\. Jie, X\. Wang, X\. He, and T\. Chua\(2025\)AlphaEdit: null\-space constrained knowledge editing for language models\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2\.2](https://arxiv.org/html/2606.18309#S2.SS2.p2.1)\.
- \[13\]A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan, A\. Yang, A\. Fan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§1](https://arxiv.org/html/2606.18309#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.18309#S4.SS1.SSS0.Px1.p1.1)\.
- \[14\]J\. Y\. Huang, W\. Zhou, F\. Wang, F\. Morstatter, S\. Zhang, H\. Poon, and M\. Chen\(2025\)Offset unlearning for large language models\.Transactions on Machine Learning Research\.External Links:ISSN 2835\-8856Cited by:[§1](https://arxiv.org/html/2606.18309#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.18309#S2.SS1.p2.1)\.
- \[15\]Y\. Huang, P\. Wu, and C\. Wang\(2024\)ECO: efficient computational optimization for exact machine unlearning in deep neural networks\.InInternational Conference on Machine Learning Workshop \(ICML Workshop\),Cited by:[§1](https://arxiv.org/html/2606.18309#S1.p2.1)\.
- \[16\]G\. Ilharco, M\. T\. Ribeiro, M\. Wortsman, S\. Gururangan, L\. Schmidt, H\. Hajishirzi, and A\. Farhadi\(2023\)Editing models with task arithmetic\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2606.18309#S1.p2.1),[§1](https://arxiv.org/html/2606.18309#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.18309#S2.SS2.p1.1),[§3\.3](https://arxiv.org/html/2606.18309#S3.SS3.SSS0.Px1.p1.2)\.
- \[17\]J\. Jang, D\. Yoon, S\. Yang, S\. Cha, M\. Lee, L\. Logeswaran, and M\. Seo\(2023\)Knowledge unlearning for mitigating privacy risks in language models\.InAnnual Meeting of the Association for Computational Linguistics \(ACL\),Cited by:[§1](https://arxiv.org/html/2606.18309#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.18309#S2.SS1.p1.1)\.
- \[18\]H\. S\. Kim, D\. Han, and J\. Choe\(2025\)NegMerge: sign\-consensual weight merging for machine unlearning\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§1](https://arxiv.org/html/2606.18309#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.18309#S2.SS2.p1.1)\.
- \[19\]N\. Li, A\. Pan, A\. Gopal, S\. Yue, D\. Berrios, A\. Gatti, J\. D\. Li, A\. Dombrowski, S\. Goel, L\. Phan, G\. Mukobi, N\. Helm\-Burger,et al\.\(2024\)The wmdp benchmark: measuring and reducing malicious use with unlearning\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§1](https://arxiv.org/html/2606.18309#S1.p1.1),[§1](https://arxiv.org/html/2606.18309#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.18309#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2606.18309#S2.SS1.p2.1),[§4\.1](https://arxiv.org/html/2606.18309#S4.SS1.SSS0.Px1.p1.1)\.
- \[20\]C\. Lin\(2004\)ROUGE: a package for automatic evaluation of summaries\.InAnnual Meeting of the Association for Computational Linguistics \(ACL\),Cited by:[§4\.1](https://arxiv.org/html/2606.18309#S4.SS1.SSS0.Px2.p1.1)\.
- \[21\]S\. Liu, Y\. Yao, J\. Jia, S\. Casper, N\. Baracaldo, P\. Hase, Y\. Yao, C\. Y\. Liu, X\. Xu, H\. Li, K\. R\. Varshney, M\. Bansal, S\. Koyejo, and Y\. Liu\(2024\)Rethinking machine unlearning for large language models\.InNature Machine Intelligence \(Nat\. Mach\. Intell\.\),Cited by:[§1](https://arxiv.org/html/2606.18309#S1.p1.1)\.
- \[22\]P\. Maini, Z\. Feng, A\. Schwarzschild, Z\. C\. Lipton, and J\. Z\. Kolter\(2024\)TOFU: a task of fictitious unlearning for llms\.InConference on Language Modeling \(COLM,Cited by:[§1](https://arxiv.org/html/2606.18309#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.18309#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2606.18309#S2.SS1.p2.1),[§4\.1](https://arxiv.org/html/2606.18309#S4.SS1.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.18309#S4.SS1.SSS0.Px2.p1.1)\.
- \[23\]A\. Mekala, V\. Dorna, S\. Dubey, A\. Lalwani, D\. Koleczek, M\. Rungta, S\. Hasan, and E\. Lobo\(2024\)Alternate preference optimization for unlearning factual knowledge in large language models\.arXiv preprint arXiv:2409\.13474\.Cited by:[§1](https://arxiv.org/html/2606.18309#S1.p2.1)\.
- \[24\]K\. Meng, D\. Bau, A\. Andonian, and Y\. Belinkov\(2022\)Locating and editing factual associations in gpt\.InConference on Neural Information Processing Systems \(NeurIPS\),Cited by:[§2\.2](https://arxiv.org/html/2606.18309#S2.SS2.p2.1)\.
- \[25\]K\. Meng, A\. S\. Sharma, A\. Andonian, Y\. Belinkov, and D\. Bau\(2023\)Mass\-editing memory in a transformer\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2\.2](https://arxiv.org/html/2606.18309#S2.SS2.p2.1)\.
- \[26\]S\. Neel, A\. Roth, and S\. Sharifi\-Malvajerdi\(2021\)Descent\-to\-delete: gradient\-based methods for machine unlearning\.InInternational Conference on Algorithmic Learning Theory \(ALT\),Cited by:[§1](https://arxiv.org/html/2606.18309#S1.p1.1)\.
- \[27\]M\. Pawelczyk, S\. Neel, and H\. Lakkaraju\(2024\)In\-context unlearning: language models as few shot unlearners\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§1](https://arxiv.org/html/2606.18309#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.18309#S2.SS1.p2.1)\.
- \[28\]R\. Rafailov, A\. Sharma, E\. Mitchell, S\. Ermon, C\. D\. Manning, and C\. Finn\(2023\)Direct preference optimization: your language model is secretly a reward model\.InConference on Neural Information Processing Systems \(NeurIPS\),Cited by:[§4\.1](https://arxiv.org/html/2606.18309#S4.SS1.SSS0.Px1.p1.1)\.
- \[29\]W\. F\. Shen, X\. Qiu, M\. Kurmanji, A\. Iacob, L\. Sani, Y\. Chen, N\. Cancedda, and N\. D\. Lane\(2025\)LLM unlearning via neural activation redirection\.InConference on Neural Information Processing Systems \(NeurIPS\),Cited by:[§1](https://arxiv.org/html/2606.18309#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.18309#S2.SS1.p2.1)\.
- \[30\]W\. Shi, J\. Lee, Y\. Huang, S\. Malladi, J\. Zhao, A\. Holtzman, D\. Liu, L\. Zettlemoyer, N\. A\. Smith, and C\. Zhang\(2025\)MUSE: machine unlearning six\-way evaluation for language models\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2606.18309#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.18309#S2.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.18309#S4.SS1.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.18309#S4.SS1.SSS0.Px2.p1.1)\.
- \[31\]H\. Touvron, T\. Lavril, G\. Izacard, X\. Martinet, M\. Lachaux, T\. Lacroix, B\. Rozière, N\. Goyal, E\. Hambro, F\. Azhar, A\. Rodriguez, A\. Joulin, E\. Grave, and G\. Lample\(2023\)LLaMA: open and efficient foundation language models\.arXiv preprint arXiv:2302\.13971\.Cited by:[§1](https://arxiv.org/html/2606.18309#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.18309#S4.SS1.SSS0.Px1.p1.1)\.
- \[32\]L\. Tunstall, E\. Beeching, N\. Lambert, N\. Rajani, K\. Rasul, Y\. Belkada, S\. Huang, L\. von Werra, C\. Fourrier, N\. Habib, N\. Sarrazin, O\. Sanseviero, A\. M\. Rush, and T\. Wolf\(2024\)Zephyr: direct distillation of lm alignment\.InConference on Language Modeling \(COLM,Cited by:[§4\.1](https://arxiv.org/html/2606.18309#S4.SS1.SSS0.Px1.p1.1)\.
- \[33\]Q\. Wang, J\. P\. Zhou, Z\. Zhou, S\. Shin, B\. Han, and K\. Q\. Weinberger\(2025\)Rethinking llm unlearning objectives: a gradient perspective and go beyond\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2606.18309#S1.p1.1),[§1](https://arxiv.org/html/2606.18309#S1.p2.1),[§4\.1](https://arxiv.org/html/2606.18309#S4.SS1.SSS0.Px1.p1.1)\.
- \[34\]Y\. Wang, J\. Wei, C\. Y\. Liu, J\. Pang, Q\. Liu, A\. P\. Shah, Y\. Bao, Y\. Liu, and W\. Wei\(2024\)LLM unlearning via loss adjustment with only forget data\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2606.18309#S1.p2.1)\.
- \[35\]Y\. Wang, Q\. Wang, F\. Liu, W\. Huang, Y\. Du, X\. Du, and B\. Han\(2025\)GRU: mitigating the trade\-off between unlearning and retention for llms\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§1](https://arxiv.org/html/2606.18309#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.18309#S2.SS1.p2.1)\.
- \[36\]T\. Wolf, L\. Debut, V\. Sanh, J\. Chaumond, C\. Delangue, A\. Moi, P\. Cistac, T\. Rault, R\. Louf, M\. Funtowicz, J\. Davison, S\. Shleifer, P\. von Platen, C\. Ma, Y\. Jernite, J\. Plu, C\. Xu, T\. L\. Scao, S\. Gugger, M\. Drame, Q\. Lhoest, and A\. M\. Rush\(2020\)HuggingFace’s transformers: state\-of\-the\-art natural language processing\.InConference on Empirical Methods in Natural Language Processing \(EMNLP\),Cited by:[§1](https://arxiv.org/html/2606.18309#S1.p1.1)\.
- \[37\]A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§1](https://arxiv.org/html/2606.18309#S1.p1.1)\.
- \[38\]B\. Yang\(2025\)CE\-u: cross entropy unlearning\.arXiv preprint arXiv:2503\.01224\.Cited by:[§1](https://arxiv.org/html/2606.18309#S1.p2.1),[§4\.1](https://arxiv.org/html/2606.18309#S4.SS1.SSS0.Px1.p1.1)\.
- \[39\]P\. Yang, Q\. Wang, Z\. Huang, T\. Liu, C\. Zhang, and B\. Han\(2025\)Exploring criteria of loss reweighting to enhance llm unlearning\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§1](https://arxiv.org/html/2606.18309#S1.p2.1),[§4\.1](https://arxiv.org/html/2606.18309#S4.SS1.SSS0.Px1.p1.1)\.
- \[40\]Y\. Yao, J\. Duan, K\. Xu, Y\. Cai, Z\. Sun, and Y\. Zhang\(2024\-06\)A survey on large language model \(llm\) security and privacy: the good, the bad, and the ugly\.High\-Confidence Computing4\(2\),pp\. 100211\.External Links:ISSN 2667\-2952,[Link](http://dx.doi.org/10.1016/j.hcc.2024.100211),[Document](https://dx.doi.org/10.1016/j.hcc.2024.100211)Cited by:[§1](https://arxiv.org/html/2606.18309#S1.p1.1)\.
- \[41\]Y\. Yao, X\. Xu, and Y\. Liu\(2024\)Large language model unlearning\.InConference on Neural Information Processing Systems \(NeurIPS\),Cited by:[§1](https://arxiv.org/html/2606.18309#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.18309#S2.SS1.p1.1)\.
- \[42\]D\. Zhang, P\. Finckenberg\-Broman, T\. Hoang, S\. Pan, Z\. Xing, M\. Staples, and X\. Xu\(2025\)Right to be forgotten in the era of large language models: implications, challenges, and solutions\.AI and Ethics \(AI Ethics\)\.Cited by:[§1](https://arxiv.org/html/2606.18309#S1.p1.1)\.
- \[43\]R\. Zhang, L\. Lin, Y\. Bai, and S\. Mei\(2024\)Negative preference optimization: from catastrophic collapse to effective unlearning\.InConference on Language Modeling \(COLM\),Cited by:[§1](https://arxiv.org/html/2606.18309#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.18309#S2.SS1.p2.1)\.
- \[44\]D\. Zhou, Y\. Zhang, T\. Wei, R\. Qiu, K\. Yang, X\. Lin, C\. Qian, J\. He, H\. Tong, C\. Zhai, H\. Ji, and H\. Zhang\(2026\)Geometric\-disentangelment unlearning\.arXiv preprint arXiv:2511\.17100\.Cited by:[§1](https://arxiv.org/html/2606.18309#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.18309#S2.SS1.p2.1)\.

Similar Articles

Fast Unlearning at Scale via Margin Self-Correction

arXiv cs.LG

Introduces MASC (Margin Self-Correction), an efficient unlearning method for LLMs that uses an online stopping rule to achieve competitive forget–retain trade-offs at reduced computational cost, validated on TOFU and MUSE benchmarks.