Fast Unlearning at Scale via Margin Self-Correction

arXiv cs.LG Papers

Summary

Introduces MASC (Margin Self-Correction), an efficient unlearning method for LLMs that uses an online stopping rule to achieve competitive forget–retain trade-offs at reduced computational cost, validated on TOFU and MUSE benchmarks.

arXiv:2606.02920v1 Announce Type: new Abstract: Language-model unlearning updates a trained model to behave as if it had not seen selected training examples, while preserving utility and avoiding costly retraining. Existing approaches typically fine-tune the pretrained model with a fixed training budget and select the final model afterwards by evaluating several saved checkpoints on downstream validation data. Two sources of unnecessary computation limit scalability: training beyond the desired forget-retain trade-off, and checkpoint selection that requires extra storage and repeated evaluations. To address these limitations, we introduce MArgin Self-Correction (MASC), an efficient unlearning method with an online stopping rule that does not require downstream evaluation. Given a text sequence to be forgotten, MASC actively reduces the logit gap between the original next token and the most likely alternatives. It outputs a final model once this gap is small on average over a sufficiently large proportion of token positions across all forget sequences. On TOFU, MUSE News, and MUSE Books, MASC achieves a competitive forget-retain trade-off at a fraction of the computational cost of existing baselines. We further observe that as we increase model size (a.k.a. number of parameters), the trade-offs improve for both MASC and SimNPO -- the forget metrics remain comparable while retain utility increases.
Original Article
View Cached Full Text

Cached at: 06/03/26, 09:40 AM

# Fast Unlearning at Scale via Margin Self-Correction
Source: [https://arxiv.org/html/2606.02920](https://arxiv.org/html/2606.02920)
Federico Di Gennaro ETH Zürich &Alexander Shevchenko11footnotemark:1 ETH Zürich &Fanny Yang ETH Zürich

###### Abstract

Language\-model unlearning updates a trained model to behave as if it had not seen selected training examples, while preserving utility and avoiding costly retraining\. Existing approaches typically fine\-tune the pretrained model with a fixed training budget and select the final model*afterwards*by evaluating several saved checkpoints on downstream validation data\. Two sources of unnecessary computation limit scalability: training beyond the desired forget–retain trade\-off, and checkpoint selection that requires extra storage and repeated evaluations\. To address these limitations, we introduce*MArginSelf\-Correction*\(MASC\), an efficient unlearning method with an*online*stopping rule that does not require downstream evaluation\. Given a text sequence to be forgotten, MASC actively reduces the logit gap between the original next token and the most likely alternatives\. It outputs a final model once this gap is small on average over a sufficiently large proportion of token positions across all forget sequences\. OnTOFU,MUSE News, andMUSE Books, MASC achieves a competitive forget–retain trade\-off at a fraction of the computational cost of existing baselines\. We further observe that as we increase model size \(a\.k\.a\. number of parameters\), the trade\-offs improve for both MASC and SimNPO – the forget metrics remain comparable while retain utility increases\.

## 1Introduction

Despite their remarkable success in tasks including code generation\[[37](https://arxiv.org/html/2606.02920#bib.bib37),[8](https://arxiv.org/html/2606.02920#bib.bib8)\], mathematical reasoning\[[26](https://arxiv.org/html/2606.02920#bib.bib26)\], and scientific discovery\[[1](https://arxiv.org/html/2606.02920#bib.bib1)\], Large Language Models \(LLMs\) are prone to memorizing sensitive training data\[[6](https://arxiv.org/html/2606.02920#bib.bib6)\], including private information\[[35](https://arxiv.org/html/2606.02920#bib.bib35),[43](https://arxiv.org/html/2606.02920#bib.bib43)\]and copyrighted content\[[21](https://arxiv.org/html/2606.02920#bib.bib21)\]\. This tendency poses significant safety and privacy risks, particularly as LLMs are nowadays increasingly deployed in high\-stakes domains\[[25](https://arxiv.org/html/2606.02920#bib.bib25),[49](https://arxiv.org/html/2606.02920#bib.bib49)\]\. These concerns are also reflected in legal frameworks such as the*California Consumer Privacy Act*\(CCPA\)\[[3](https://arxiv.org/html/2606.02920#bib.bib3)\]and the European*General Data Protection Regulation*\(GDPR\)\[[17](https://arxiv.org/html/2606.02920#bib.bib17)\], which establish rights to request the deletion of personal data, often referred to as the*right to be forgotten*\.

Machine unlearning\[[2](https://arxiv.org/html/2606.02920#bib.bib2),[5](https://arxiv.org/html/2606.02920#bib.bib5),[28](https://arxiv.org/html/2606.02920#bib.bib28)\]provides a computational framework for such a goal\. Given a trained model and a collection of examples to be forgotten, the aim is to return a new model that behaves as if those examples had never been used for training, while preserving its performance on the rest of the data\. The gold standard is exact retraining, where the model is trained from scratch after removing the forgotten examples from the training corpus\. While retraining gives the desired behavior*by definition*, it is computationally prohibitive for modern language models\. This motivates*approximate*unlearning methods which fine\-tune the existing model\. Approximate unlearning, however, introduces a delicate forget–retain trade\-off: weak procedures may leave the target content reproducible, whereas overly aggressive interventions may damage performance on unrelated data, reminiscent of the broader problem of catastrophic forgetting\[[28](https://arxiv.org/html/2606.02920#bib.bib28),[32](https://arxiv.org/html/2606.02920#bib.bib32),[34](https://arxiv.org/html/2606.02920#bib.bib34),[23](https://arxiv.org/html/2606.02920#bib.bib23)\]\.

TOFU010002000300040005000MUSE News010002000300040005000MUSE Books010002000300040005000Wall\-clock time \(seconds\)0\.250\.500\.751\.001\-ROUGE\-L\(DfD\_\{f\}\)1\-Prob\(DfD\_\{f\}\)Truth Ratio\(DfD\_\{f\}\)MU\(DrD\_\{r\}\)0\.250\.500\.751\.001\-VerbMem\(DfD\_\{f\}\)1\-KnowMem\(DfD\_\{f\}\)KnowMem\(DrD\_\{r\}\)0\.250\.500\.751\.001\-VerbMem\(DfD\_\{f\}\)1\-KnowMem\(DfD\_\{f\}\)KnowMem\(DrD\_\{r\}\)NPO\+KLRGA\+GDRSimNPOMASC \(Ours\)NPO\+KLRGA\+GDRSimNPOMASC \(Ours\)

Figure 1:Top:Wall\-clock runtimes \(in seconds\) of methods with similar retain–forget trade\-off\.Bottom: Forget–retain trade\-offs of timed methods\. Each metric is in\[0,1\]\[0,1\]\(the higher the better\), and MASC is competitive \(i\.e\. not Pareto dominated\) with the others\.Existing methods – including Gradient Ascent \(GA\)\[[50](https://arxiv.org/html/2606.02920#bib.bib50)\], NPO\[[51](https://arxiv.org/html/2606.02920#bib.bib51),[15](https://arxiv.org/html/2606.02920#bib.bib15)\], and their regularized variants\[[7](https://arxiv.org/html/2606.02920#bib.bib7),[33](https://arxiv.org/html/2606.02920#bib.bib33),[42](https://arxiv.org/html/2606.02920#bib.bib42)\]– typically lack an*online*\(i\.e\., during training\) model selection rule to identify the optimal \(or desired\) forget–retain trade\-off without relying on costly downstream evaluations\. Instead, these algorithms generally run for a predefined and fixed compute budget, which is both inefficient \(cf\.[Figure˜1](https://arxiv.org/html/2606.02920#S1.F1)\) and agnostic to the actual unlearning dynamics\. Consequently, practitioners are forced to select a final model retrospectively, evaluating all saved checkpoints only after training is complete\. This leads to our first research question:

*\(Q1\)Can we design an efficient unlearning objective with an intrinsic stopping rule that offers a controllable stopping criterion for the forget–\-retain trade\-off?*

We introduceMArgin Self\-Correction \(MASC\), an unlearning method whose objective naturally admits an*adaptive*stopping rule\. MASC performs gradient updates on a retain\-regularized loss that discourages drift from the original model on retain data, while correcting forget\-set token predictions that remain too dominant\. Each forget continuation is evaluated under teacher forcing: at every token position, MASC computes a*restricted margin*, defined as the logit gap between the original next token \(also called*target token*\) and a log\-sum\-exp aggregate of the logits of the model’s top\-kkalternatives to the original next token\. This margin measures how strongly the model still prefers the forgotten token over plausible replacements\. MASC then returns the first checkpoint at which the margin condition is satisfied on a sufficiently large fraction of monitored forget\-set tokens\. We show that this token\-level condition theoretically upper\-bounds the probability of exactly reproducing the forgotten continuation \(cf\.[Proposition˜1](https://arxiv.org/html/2606.02920#Thmproposition1)\)\. Thus, the returned checkpoint is selected online using the same condition optimized during unlearning, rather than by running a fixed training budget followed by downstream checkpoint evaluation\. Empirically, this yields competitive forget–retain trade\-offs at a fraction of the computational cost of existing baselines \(see[Figure˜1](https://arxiv.org/html/2606.02920#S1.F1)\)\.

This efficiency advantage is especially relevant at scale, where unlearning costs during both finetuning and evaluation can grow quickly with model size\. Beyond computation, however, scale may also affect the forget–retain behavior itself\. While prior work has studied how unlearning performance changes with the size of the deletion request\[[33](https://arxiv.org/html/2606.02920#bib.bib33),[42](https://arxiv.org/html/2606.02920#bib.bib42)\], the role of*model size*remains under\-explored\. Larger models may internalize target information more strongly during supervised finetuning\[[36](https://arxiv.org/html/2606.02920#bib.bib36),[6](https://arxiv.org/html/2606.02920#bib.bib6),[31](https://arxiv.org/html/2606.02920#bib.bib31)\], and respond differently when that information is later removed\. The second question we aim to answer is therefore:

*\(Q2\)How does model scale influence knowledge acquisition during learning and its subsequent removal during unlearning?*

In this analysis, we distinguish between two levels of memorization:*exact memorization*\[[6](https://arxiv.org/html/2606.02920#bib.bib6),[36](https://arxiv.org/html/2606.02920#bib.bib36),[31](https://arxiv.org/html/2606.02920#bib.bib31)\], where the model reproduces target content verbatim, and*knowledge memorization*, where the model recovers the same underlying information under paraphrased prompts\. During supervised finetuning, both metrics grow with model size and follow empirical power\-law trends in log–log space, with a larger fitted exponent for exact memorization\. This indicates that scale amplifies verbatim reproduction more strongly than paraphrase\-based recovery\. After unlearning, however, forget\-side metrics become roughly stable across model sizes, while retain utility increases\. This suggests that scale mainly improves the utility side of the post\-unlearning trade\-off, rather than systematically augmenting residual memorization of the forgotten content\.

To summarize, our main contributions are:

- •We introduceMASC, an efficient unlearning method that suppresses target tokens only when they remain much more likely than an aggregate of the top\-kkmost likely alternatives\. We demonstrate onTOFU,MUSE News, andMUSE Booksthat MASC achieves competitive trade\-offs with substantially shorter wall\-clock runtime\.
- •We provide a scaling study across the Qwen2\.5 family, examining how scale affects different forms of memorization and how it benefits the final forget–retain frontier after unlearning \(for both MASC and SimNPO\[[15](https://arxiv.org/html/2606.02920#bib.bib15)\]\)\.

Datasets\.We evaluate MASC on three standard LLM unlearning benchmarks:TOFU\[[33](https://arxiv.org/html/2606.02920#bib.bib33)\],MUSE News, andMUSE Books\[[42](https://arxiv.org/html/2606.02920#bib.bib42)\]\.TOFUis a synthetic question\-answering benchmark based on fictitious biographies\. We use its forget10/retain90 split, where 10% of examples are assigned to the forget set and the remaining 90% to the retain set\.MUSEprovides a more realistic setting based on memorized text from news articles \(BBC\) and books \(Harry Potter series\)\.111Code available at[FedericoDiGennaro/Fast\-LLM\-Unlearning\-MarginSelfCorrection](https://github.com/FedericoDiGennaro/Fast-LLM-Unlearning-MarginSelfCorrection)

Notation\.For a finite set𝒮\\mathcal\{S\}, we denote byΔ​\(𝒮\)=\{p∈ℝ\+𝒮:∑s∈𝒮ps=1\}\\Delta\(\\mathcal\{S\}\)=\\\{p\\in\\mathbb\{R\}^\{\\mathcal\{S\}\}\_\{\+\}:\\sum\_\{s\\in\\mathcal\{S\}\}p\_\{s\}=1\\\}the probability simplex over it\. If𝒮⊆ℝd\\mathcal\{S\}\\subseteq\\mathbb\{R\}^\{d\}andr∈ℕr\\in\\mathbb\{N\}is positive, we writeSrS^\{r\}for therr\-fold Cartesian product of𝒮\\mathcal\{S\}\. Finally, forx∈ℝx\\in\\mathbb\{R\}, we use\[x\]\+=max⁡\{x,0\}\[x\]\_\{\+\}=\\max\\\{x,0\\\}to denote the positive part ofxx\. For an integerT∈ℕT\\in\\mathbb\{N\}, we denote\[T\]\[T\]as the set\{1,…,T\}\\\{1,\.\.\.,T\\\}\.

## 2LLM unlearning and prior work

This section introduces the notation for LLM unlearning and provides a non\-exhaustive overview \(see Appendix[A](https://arxiv.org/html/2606.02920#A1)for a more detailed discussion\) of well\-known unlearning methods that we later use as baselines\.

Let𝒱\\mathcal\{V\}denote the token vocabulary, and letΔ​\(𝒱\)\\Delta\(\\mathcal\{V\}\)be the probability simplex over𝒱\\mathcal\{V\}\. Then, let𝒞=⋃ℓ≥0𝒱ℓ\\mathcal\{C\}=\\bigcup\_\{\\ell\\geq 0\}\\mathcal\{V\}^\{\\ell\}denote the set of finite token contexts\. An autoregressive language model with parametersθ∈ℝd\\theta\\in\\mathbb\{R\}^\{d\}is defined as a policyπθ:𝒞→Δ​\(𝒱\)\\pi\_\{\\theta\}:\\mathcal\{C\}\\to\\Delta\(\\mathcal\{V\}\), whereπθ\(⋅∣c\)=softmax\(zθ\(⋅∣c\)\)\\pi\_\{\\theta\}\(\\cdot\\mid c\)=\\mathrm\{softmax\}\(z\_\{\\theta\}\(\\cdot\\mid c\)\)is the next\-token distribution over𝒱\\mathcal\{V\}given contextc∈𝒞c\\in\\mathcal\{C\}, andzθ\(⋅∣c\)z\_\{\\theta\}\(\\cdot\\mid c\)denotes the corresponding*logits*\. Given a sample consisting of a promptx∈𝒳x\\in\\mathcal\{X\}and a continuationy=\(y1,…,yT\)∈𝒱Ty=\(y\_\{1\},\\dots,y\_\{T\}\)\\in\\mathcal\{V\}^\{T\}, we evaluate a policy on\(x,y\)\(x,y\)using the probability it assigns to the full continuation, which factorizes asπθ​\(y∣x\)=∏t=1Tπθ​\(yt∣ct\)\\pi\_\{\\theta\}\(y\\mid x\)=\\prod\_\{t=1\}^\{T\}\\pi\_\{\\theta\}\(y\_\{t\}\\mid c\_\{t\}\), wherect=\(x,y<t\)c\_\{t\}=\(x,y\_\{<t\}\)\. This corresponds to*teacher\-forced*evaluation: each next\-token distribution is conditioned on the reference prefixctc\_\{t\}, rather than on tokens sampled from the model\. Evaluatingπθ​\(y∣x\)\\pi\_\{\\theta\}\(y\\mid x\)therefore only requires standard forward passes on the given prompt\-continuation pair\.

Let𝒟fg\\mathcal\{D\}\_\{\\mathrm\{fg\}\}denote the data222With a slight abuse of notation, we also use𝒟fg\\mathcal\{D\}\_\{\\mathrm\{fg\}\}and𝒟ret\\mathcal\{D\}\_\{\\mathrm\{ret\}\}to denote the empirical distributions obtained by sampling uniformly from the corresponding finite datasets\.to be forgotten, and𝒟ret\\mathcal\{D\}\_\{\\mathrm\{ret\}\}the data on which the model’s behavior should be preserved\. An LLM unlearning algorithm𝒜\\mathcal\{A\}takes as input the weightsθ0∈ℝd\\theta\_\{0\}\\in\\mathbb\{R\}^\{d\}of a pretrained model together with𝒟fg\\mathcal\{D\}\_\{\\mathrm\{fg\}\}\(and usually also𝒟ret\\mathcal\{D\}\_\{\\mathrm\{ret\}\}\), and returns updated weightsθunl∈ℝd\\theta\_\{\\mathrm\{unl\}\}\\in\\mathbb\{R\}^\{d\}\. The goal is for the resulting policyπθunl\\pi\_\{\\theta\_\{\\mathrm\{unl\}\}\}to behave as if𝒟fg\\mathcal\{D\}\_\{\\mathrm\{fg\}\}had not been used for training, while preserving performance on𝒟ret\\mathcal\{D\}\_\{\\mathrm\{ret\}\}\. The key computational challenge is to avoid retraining the model from scratch\. A common approach is to minimize a*forget*lossℒfg​\(θ;𝒟fg\)\\mathcal\{L\}\_\{\\mathrm\{fg\}\}\(\\theta;\\mathcal\{D\}\_\{\\mathrm\{fg\}\}\)\. Different unlearning methods correspond to different choices of such a loss, often combined with additional regularization to preserve retain\-set behavior\. Arguably, the most natural choice forℒfg\\mathcal\{L\}\_\{\\mathrm\{fg\}\}is

ℒfgGA​\(θ;𝒟fg\)=𝔼\(x,y\)∼𝒟fg​\[log⁡πθ​\(y∣x\)\],\\mathcal\{L\}\_\{\\mathrm\{fg\}\}^\{\\mathrm\{GA\}\}\(\\theta;\\mathcal\{D\}\_\{\\mathrm\{fg\}\}\)=\\mathbb\{E\}\_\{\(x,y\)\\sim\\mathcal\{D\}\_\{\\mathrm\{fg\}\}\}\\left\[\\log\\pi\_\{\\theta\}\(y\\mid x\)\\right\],\(1\)which penalizes policies that assign a high probability to the forget continuations\. Equivalently, since standard language\-model pretraining minimizes next\-token cross\-entropy, minimizingℒfgGA\\mathcal\{L\}\_\{\\mathrm\{fg\}\}^\{\\mathrm\{GA\}\}by gradient descent performs gradient ascent on the original cross\-entropy objective restricted to𝒟fg\\mathcal\{D\}\_\{\\mathrm\{fg\}\}\. The resulting update*reverses*likelihood\-based learning on the forget data, and is therefore commonly referred to as Gradient Ascent \(GA\) unlearning\. However, GA provides no intrinsic mechanism for stopping this likelihood decrease: continued optimization can keep lowering the probability of the forget continuation and may quickly degrade the model’s behavior beyond the forget set\. Negative Preference Optimization \(NPO\)\[[51](https://arxiv.org/html/2606.02920#bib.bib51)\]addresses this issue by replacing direct likelihood minimization with a bounded preference\-style objective\[[39](https://arxiv.org/html/2606.02920#bib.bib39)\]\. Rather than indefinitely pushing down the likelihood of the forget\-set continuations, NPO treats it as a negative preference example relative to the original model\. In particular,

ℒfgNPO​\(θ;𝒟fg\)=−2β​𝔼\(x,y\)∼𝒟fg​\[log⁡σ​\(−β​log⁡πθ​\(y∣x\)πθ0​\(y∣x\)\)\],\\mathcal\{L\}\_\{\\mathrm\{fg\}\}^\{\\mathrm\{NPO\}\}\(\\theta;\\mathcal\{D\}\_\{\\mathrm\{fg\}\}\)=\-\\frac\{2\}\{\\beta\}\\mathbb\{E\}\_\{\(x,y\)\\sim\\mathcal\{D\}\_\{\\mathrm\{fg\}\}\}\\left\[\\log\\sigma\\\!\\left\(\-\\beta\\log\\frac\{\\pi\_\{\\theta\}\(y\\mid x\)\}\{\\pi\_\{\\theta\_\{0\}\}\(y\\mid x\)\}\\right\)\\right\],\(2\)whereβ\>0\\beta\>0is an inverse\-temperature parameter andσ​\(u\)=\(1\+e−u\)−1\\sigma\(u\)=\(1\+e^\{\-u\}\)^\{\-1\}is the sigmoid function\. Unlike GA, NPO weakens the forget update once the forget continuation is already much less likely under the current model than under the original one\. Indeed, ifrθ=log⁡πθ​\(y∣x\)πθ0​\(y∣x\)r\_\{\\theta\}=\\log\\frac\{\\pi\_\{\\theta\}\(y\\mid x\)\}\{\\pi\_\{\\theta\_\{0\}\}\(y\\mid x\)\}, then the gradient scales asσ​\(β​rθ\)\\sigma\(\\beta r\_\{\\theta\}\), which vanishes for large negativerθr\_\{\\theta\}\. However, because NPO compares the current likelihood to the original model likelihoodπθ0​\(y∣x\)\\pi\_\{\\theta\_\{0\}\}\(y\\mid x\), the magnitude of the forget update depends on the reference\-model score of each example, and can therefore vary with sequence length or reference likelihood\. SimNPO\[[15](https://arxiv.org/html/2606.02920#bib.bib15)\]removes this dependence by using a reference\-free, length\-normalized variant of the NPO objective\. Although[Equations˜1](https://arxiv.org/html/2606.02920#S2.E1)and[2](https://arxiv.org/html/2606.02920#S2.E2)define common forget\-side objectives, practical unlearning methods typically combine them with a retain regularizer to improve the forget–retain trade\-off\. This leads to objectives of the form

minθℒfg​\(θ;𝒟fg\)\+λret​ℒret​\(θ;𝒟ret,θ0\),\\min\_\{\\theta\}\\quad\\mathcal\{L\}\_\{\\mathrm\{fg\}\}\(\\theta;\\mathcal\{D\}\_\{\\mathrm\{fg\}\}\)\+\\lambda\_\{\\mathrm\{ret\}\}\\mathcal\{L\}\_\{\\mathrm\{ret\}\}\(\\theta;\\mathcal\{D\}\_\{\\mathrm\{ret\}\},\\theta\_\{0\}\),\(3\)whereℒfg\\mathcal\{L\}\_\{\\mathrm\{fg\}\}encourages suppression of the forget data, whileℒret\\mathcal\{L\}\_\{\\mathrm\{ret\}\}\(computed on𝒟ret\\mathcal\{D\}\_\{\\mathrm\{ret\}\}\) discourages unnecessary drift from the original modelπθ0\\pi\_\{\\theta\_\{0\}\}\. The retain term is typically implemented as a KL penalty relative toπθ0\\pi\_\{\\theta\_\{0\}\}or as a cross\-entropy loss on retained examples\.

The main limitation of the above forget losses is that they only specify what should be made less likely, not what the model should do instead\. The probabilityπθ​\(y∣x\)\\pi\_\{\\theta\}\(y\\mid x\)can be decreased by concentrating mass on a few alternative continuations, spreading mass broadly, or degrading the next\-token distribution more generally\. Since these sequence\-level objectives do not specify what should happen at each next\-token prediction, they also provide no direct criterion for deciding when an original forget\-set token has become sufficiently non\-dominant relative to its alternatives\. Further, although these objectives provide strong and widely used baselines, the computational cost of unlearning can be substantial\. In practice, one must either fix the unlearning budget in advance \(i\.e\., the number of finetuning epochs\), or periodically evaluate intermediate models to decide which one should be returned\. The latter requires downstream forget–retain validation data, since the checkpoint is selected using external metrics rather than a condition monitored during training\. Saving many checkpoints and selecting among them after training further adds storage and evaluation overhead\. To address these limitations, we introduce MASC, an unlearning objective based on a token\-level dominance condition: a forget token should no longer dominate the model’s top\-kknon\-target alternatives under the same reference prefix\. This condition defines both the forget loss and the online stopping rule: MASC stops once it is satisfied on a sufficiently large fraction of monitored forget\-set tokens, allowing the returned checkpoint to be selected during training without downstream validation\.

## 3MASC: MArgin Self\-Correction

This section introduces MASC \(MArgin Self\-Correction\) and derives its objective from first principles\. MASC is based on a simple observation: exact reproduction of a forget sequence requires many positions at which the model assigns high probability to the true next token when evaluated under*teacher forcing*, i\.e\., when conditioned on the true prefix\. We now turn this observation into a token\-level comparison that will define both the forget loss and the stopping rule\.

### 3\.1Token dominance and margins

We first define token\-dominance measures, which we use in our unlearning algorithm\. Intuitively, on the forget set, we want to lower the probability of the true next\-token continuation given the true prefix, while preserving overall utility\. Our approach therefore reduces the dominance of the highest\-probability token relative to its nearest alternatives, without substantially altering the rest of the distribution\.

#### Restricted token comparison\.

Let\(x,y\)∼𝒟fg\(x,y\)\\sim\\mathcal\{D\}\_\{\\mathrm\{fg\}\}be a forget example, wherex∈𝒳x\\in\\mathcal\{X\}andy=\(y1,…,yT\)∈𝒱Ty=\(y\_\{1\},\\ldots,y\_\{T\}\)\\in\\mathcal\{V\}^\{T\}\. For each positiont∈\[T\]t\\in\[T\]of the sequence, the teacher\-forced context isct=\(x,y<t\)c\_\{t\}=\(x,y\_\{<t\}\)andyty\_\{t\}is called*target*token\. Recall thatπθ\(⋅∣ct\)\\pi\_\{\\theta\}\(\\cdot\\mid c\_\{t\}\)denotes the next\-token distribution over𝒱\\mathcal\{V\}, andzθ​\(v∣ct\)z\_\{\\theta\}\(v\\mid c\_\{t\}\)is the logit of each tokenv∈𝒱v\\in\\mathcal\{V\}\. Rather than comparingyty\_\{t\}to the full vocabulary, we focus on the set of the model’s top\-kknon\-target alternative tokens, denoted as

𝒮θ,k​\(ct\)=arg⁡max𝒮⊆𝒱∖\{yt\}\|𝒮\|=k​∑v∈𝒮πθ​\(v∣ct\),\\mathcal\{S\}\_\{\\theta,k\}\(c\_\{t\}\)=\\arg\\max\_\{\\begin\{subarray\}\{c\}\\mathcal\{S\}\\subseteq\\mathcal\{V\}\\setminus\\\{y\_\{t\}\\\}\\\\ \|\\mathcal\{S\}\|=k\\end\{subarray\}\}\\sum\_\{v\\in\\mathcal\{S\}\}\\pi\_\{\\theta\}\(v\\mid c\_\{t\}\),\(4\)with ties broken arbitrarily\.333We do not differentiate through the top\-kkoperation; gradients are taken only through the logits appearing in the loss\.Forβ\>0\\beta\>0, further define the*restricted*probability444Forβ=1\\beta=1, this is exactly the probability obtained by restrictingπθ\(⋅∣ct\)\\pi\_\{\\theta\}\(\\cdot\\mid c\_\{t\}\)to\{yt\}∪𝒮θ,k​\(ct\)\\\{y\_\{t\}\\\}\\cup\\mathcal\{S\}\_\{\\theta,k\}\(c\_\{t\}\)and renormalizing\.of the target tokenyty\_\{t\}

πθ\(k,β\)​\(yt∣ct\)=exp⁡\(β​zθ​\(yt∣ct\)\)exp⁡\(β​zθ​\(yt∣ct\)\)\+∑v∈𝒮θ,k​\(ct\)exp⁡\(β​zθ​\(v∣ct\)\)\.\\pi\_\{\\theta\}^\{\(k,\\beta\)\}\(y\_\{t\}\\mid c\_\{t\}\)=\\frac\{\\exp\(\\beta z\_\{\\theta\}\(y\_\{t\}\\mid c\_\{t\}\)\)\}\{\\exp\(\\beta z\_\{\\theta\}\(y\_\{t\}\\mid c\_\{t\}\)\)\+\\sum\_\{v\\in\\mathcal\{S\}\_\{\\theta,k\}\(c\_\{t\}\)\}\\exp\(\\beta z\_\{\\theta\}\(v\\mid c\_\{t\}\)\)\}\.\(5\)This restricted probability can be interpreted as a measure of local dominance and gives rise to a natural constraint

###### Definition 1\.

For a thresholdρ∈\(0,1\)\\rho\\in\(0,1\), we say that the target tokenyty\_\{t\}is locally suppressed in contextctc\_\{t\}ifπθ\(k,β\)​\(yt∣ct\)≤ρ\\pi\_\{\\theta\}^\{\(k,\\beta\)\}\(y\_\{t\}\\mid c\_\{t\}\)\\leq\\rho\.

MASC uses this local\-dominance measure as a constraint and hence implicitly asks: is the forget token still preferred over plausible replacements? Proposition[1](https://arxiv.org/html/2606.02920#Thmproposition1)shows that controlling this*local dominance*on many positions is enough to control exact reproduction of the whole continuation\.

###### Proposition 1\.

Consider a forget promptxxand the corresponding continuationy=\(y1,…,yT\)y=\(y\_\{1\},\\ldots,y\_\{T\}\), evaluated under teacher forcing\. Letct=\(x,y<t\)c\_\{t\}=\(x,y\_\{<t\}\), and letβ=1\\beta=1\. Assume there exists a setI⊆\{1,…,T\}I\\subseteq\\\{1,\\ldots,T\\\}with\|I\|≥⌈\(1−α\)​T⌉\|I\|\\geq\\lceil\(1\-\\alpha\)T\\rceilsuch thatπθ\(k,1\)​\(yt∣ct\)≤ρ\\pi\_\{\\theta\}^\{\(k,1\)\}\(y\_\{t\}\\mid c\_\{t\}\)\\leq\\rhofor everyt∈It\\in I\. Then

πθ​\(y∣x\)=∏t=1Tπθ​\(yt∣ct\)≤ρ⌈\(1−α\)​T⌉\.\\pi\_\{\\theta\}\(y\\mid x\)=\\prod\_\{t=1\}^\{T\}\\pi\_\{\\theta\}\(y\_\{t\}\\mid c\_\{t\}\)\\leq\\rho^\{\\lceil\(1\-\\alpha\)T\\rceil\}\.

The proof of[Proposition˜1](https://arxiv.org/html/2606.02920#Thmproposition1)is deferred to Appendix[B](https://arxiv.org/html/2606.02920#A2)\. The proposition shows that enforcing the condition of[Definition˜1](https://arxiv.org/html/2606.02920#Thmdefinition1)on many positionst∈\[T\]t\\in\[T\]gives a bound on the probability of exactly reproducing the forgotten continuationyyfrom promptxx\.

Remark\.The parameterkkdoes not appear in the bound because the proposition is conditional on the local constraint being satisfied: for anyk<\|𝒱\|k<\|\\mathcal\{V\}\|, onceπθ\(k,1\)​\(yt∣ct\)≤ρ\\pi\_\{\\theta\}^\{\(k,1\)\}\(y\_\{t\}\\mid c\_\{t\}\)\\leq\\rho, the full\-vocabulary probability also satisfiesπθ​\(yt∣ct\)≤ρ\\pi\_\{\\theta\}\(y\_\{t\}\\mid c\_\{t\}\)\\leq\\rho\. Thus,kkonly determines how stringent the local comparison is, while the sequence\-level bound depends onρ\\rhoand on the number of controlled positions\. We defer the discussion of howkkaffects the optimization of the proposed algorithm to[Section˜D\.4](https://arxiv.org/html/2606.02920#A4.SS4)\.

Takeaway 1Controlling*local*target\-token dominance on many positions gives an upper bound on the probability of reproducing the exact continuation\.

#### From probabilities to margins\.

[Proposition˜1](https://arxiv.org/html/2606.02920#Thmproposition1)gives an upper bound on exact reproduction when the condition in[Definition˜1](https://arxiv.org/html/2606.02920#Thmdefinition1)is met for many forget tokens in the sequence\. A direct surrogate would penalize violations with\[πθ\(k,β\)​\(yt∣ct\)−ρ\]\+\[\\pi\_\{\\theta\}^\{\(k,\\beta\)\}\(y\_\{t\}\\mid c\_\{t\}\)\-\\rho\]\_\{\+\}\. However, this probability\-space penalty can have weak gradients when the observed token already dominates the restricted set\. In that regime, the softmax probability is close to one, so even large changes in the underlying logit margin produce only small changes in the penalized quantity\. We therefore propose an alternative logit\-space condition based on the following*restricted margin*

mθ\(k,β\)​\(x,y,t\)=β​zθ​\(yt∣ct\)−log​∑v∈𝒮θ,k​\(ct\)exp⁡\(β​zθ​\(v∣ct\)\)\.m\_\{\\theta\}^\{\(k,\\beta\)\}\(x,y,t\)=\\beta z\_\{\\theta\}\(y\_\{t\}\\mid c\_\{t\}\)\-\\log\\sum\_\{v\\in\\mathcal\{S\}\_\{\\theta,k\}\(c\_\{t\}\)\}\\exp\(\\beta z\_\{\\theta\}\(v\\mid c\_\{t\}\)\)\.\(6\)The margin in \([6](https://arxiv.org/html/2606.02920#S3.E6)\) compares the target\-token logit with the log\-sum\-exp555The log\-sum\-exp term is a smooth approximation of the maximum alternative logit\. Largerβ\\betamakes the margin closer to a gap against the strongest alternative, while smallerβ\\betaaverages more broadly over the top\-kkalternatives\.aggregate of the selected alternative logits\. Large margins correspond to strong target\-token preference; small margins correspond to competitive alternatives\. Moreover, because this quantity is defined directly on the logits, increasing dominance of the observed token continues to increase the violation linearly\. In addition, the following lemma establishes that imposing a threshold on the restricted probability as in[Definition˜1](https://arxiv.org/html/2606.02920#Thmdefinition1)is equivalent to imposing a corresponding threshold on the margin of[Equation˜6](https://arxiv.org/html/2606.02920#S3.E6)\.

###### Lemma 1\.

Fixρ∈\(0,1\)\\rho\\in\(0,1\)and defineτρ=log⁡\(ρ/\(1−ρ\)\)\\tau\_\{\\rho\}=\\log\(\\rho/\(1\-\\rho\)\)\. Then, for any forget positionttand anyβ\>0\\beta\>0,πθ\(k,β\)​\(yt∣ct\)≤ρ\\pi\_\{\\theta\}^\{\(k,\\beta\)\}\(y\_\{t\}\\mid c\_\{t\}\)\\leq\\rhoif and only ifmθ\(k,β\)​\(x,y,t\)≤τρm\_\{\\theta\}^\{\(k,\\beta\)\}\(x,y,t\)\\leq\\tau\_\{\\rho\}\.

### 3\.2MASC: Unlearning with Margin Self\-Correction

We are now ready to propose our new unlearning method, MASC, that returns a policy whose restricted margin condition is violated on at most anα\\alphafraction of the forget\-set tokens\. Specifically, the algorithm uses the averageVρV\_\{\\rho\}of the*per\-example violation rate*vρv\_\{\\rho\}over the forget set

vρ​\(θ;x,y\)=1T​∑t=1T𝟏​\{mθ\(k,β\)​\(x,y,t\)\>τρ\},Vρ​\(θ;𝒟fg\)=𝔼\(x,y\)∼𝒟fg​\[vρ​\(θ;x,y\)\]\.v\_\{\\rho\}\(\\theta;x,y\)=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\mathbf\{1\}\\\{m\_\{\\theta\}^\{\(k,\\beta\)\}\(x,y,t\)\>\\tau\_\{\\rho\}\\\},\\quad V\_\{\\rho\}\(\\theta;\\mathcal\{D\}\_\{\\mathrm\{fg\}\}\)=\\mathbb\{E\}\_\{\(x,y\)\\sim\\mathcal\{D\}\_\{\\mathrm\{fg\}\}\}\\left\[v\_\{\\rho\}\(\\theta;x,y\)\\right\]\.\(7\)In words,VρV\_\{\\rho\}corresponds to the fraction of forget tokens whose restricted target probability remains aboveρ\\rho\. Our proposed unlearning objective then aims to find models that satisfy a forget constraint of the formVρ​\(θ;𝒟fg\)≤αV\_\{\\rho\}\(\\theta;\\mathcal\{D\}\_\{\\mathrm\{fg\}\}\)\\leq\\alphafor someα∈\[0,1\]\\alpha\\in\[0,1\], while behaving similarly to the original model on retain data\. In particular, motivated by recent evidence that policy\-level KL divergence is closely tied to forgetting dynamics\[[40](https://arxiv.org/html/2606.02920#bib.bib40)\], we choose to minimize the KL divergence \(averaged over retain continuations\) betweenπθ0\\pi\_\{\\theta\_\{0\}\}and the currently optimized policyπθ\\pi\_\{\\theta\}\. All in all, we aim to solve

minθ𝔼\(x,y\)∼𝒟ret\[1T∑t=1TKL\(πθ0\(⋅∣x,y<t\)∥πθ\(⋅∣x,y<t\)\)\]⏟ℒretKL​\(θ,θ0\)s\.t\.Vρ​\(θ;𝒟fg\)≤α\.\\min\_\{\\theta\}\\quad\\underbrace\{\\mathbb\{E\}\_\{\(x,y\)\\sim\\mathcal\{D\}\_\{\\mathrm\{ret\}\}\}\\left\[\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\mathrm\{KL\}\\\!\\left\(\\pi\_\{\\theta\_\{0\}\}\(\\cdot\\mid x,y\_\{<t\}\)\\,\\middle\\\|\\,\\pi\_\{\\theta\}\(\\cdot\\mid x,y\_\{<t\}\)\\right\)\\right\]\}\_\{\\mathcal\{L\}\_\{\\mathrm\{ret\}\}^\{\\mathrm\{KL\}\}\(\\theta,\\theta\_\{0\}\)\}\\quad\\text\{s\.t\.\}\\quad V\_\{\\rho\}\(\\theta;\\mathcal\{D\}\_\{\\mathrm\{fg\}\}\)\\leq\\alpha\.\(8\)In order to solve this optimization problem with gradient\-based algorithms, we first replace the indicator in \([7](https://arxiv.org/html/2606.02920#S3.E7)\) with the hinge lossψρ,η​\(m\)=\[m−\(τρ−η\)\]\+/η\\psi\_\{\\rho,\\eta\}\(m\)=\[m\-\(\\tau\_\{\\rho\}\-\\eta\)\]\_\{\+\}/\\etaas a surrogate loss that satisfies𝟏​\{m\>τρ\}≤ψρ,η​\(m\)\\mathbf\{1\}\\\{m\>\\tau\_\{\\rho\}\\\}\\leq\\psi\_\{\\rho,\\eta\}\(m\)\. The final MASC algorithm then minimizes the following Lagrangian objective with gradient descent:

minθ⁡ℒretKL​\(θ,θ0\)\+λ​ℒfgMASC​\(θ\)​where​ℒfgMASC​\(θ\)=𝔼\(x,y\)∼𝒟fg​\[1T​∑t=1Tψρ,η​\(mθ\(k,β\)​\(x,y,t\)\)\]\.\\min\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{ret\}\}^\{\\mathrm\{KL\}\}\(\\theta,\\theta\_\{0\}\)\+\\lambda\\mathcal\{L\}\_\{\\mathrm\{fg\}\}^\{\\mathrm\{MASC\}\}\(\\theta\)\\\>\\\>\\text\{where\}\\\>\\\>\\mathcal\{L\}\_\{\\mathrm\{fg\}\}^\{\\mathrm\{MASC\}\}\(\\theta\)=\\mathbb\{E\}\_\{\(x,y\)\\sim\\mathcal\{D\}\_\{\\mathrm\{fg\}\}\}\\left\[\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\psi\_\{\\rho,\\eta\}\\\!\\left\(m\_\{\\theta\}^\{\(k,\\beta\)\}\(x,y,t\)\\right\)\\right\]\.\(9\)
020τα\\tau\_\{\\alpha\}406000\.51\.0α=0\.475\\alpha=0\.475V^ρ\\widehat\{V\}\_\{\\rho\}& unlearningOptimizer stepScore valueV^ρ\\widehat\{V\}\_\{\\rho\}Retain ROUGEForget ROUGE00\.51\.000\.51\.0α=0\.7\\alpha=0\.7α=0\.475\\alpha=0\.475α=0\.2\\alpha=0\.2Frontier of MASC dynamics1\-ROUGE \(Forget\)ROUGE \(Retain\)Pareto frontierMASC models \(seeds\)MASC model \(avg\)020004000600000\.51\.0α=0\.475\\alpha=0\.475V^ρ\\widehat\{V\}\_\{\\rho\}across methodsActive forget tokensV^ρ\\hat\{V\}\_\{\\rho\}MASCGradDiffSimNPO

Figure 2:\(Left:\)Correlation between stopping statisticV^ρ\\hat\{V\}\_\{\\rho\}and forget–retain trade\-off\.\(Middle:\)Pareto frontier \(top\-right is the ideal\) of MASC training dynamics\.\(Right:\)V^ρ\\hat\{V\}\_\{\\rho\}versus number of forget tokens seen during training for three different unlearning methods\. \[Plots refer to TOFU\]\.Note that for a fixed set𝒮θ,k​\(ct\)\\mathcal\{S\}\_\{\\theta,k\}\(c\_\{t\}\), consisting of tokens without a sufficient slackτρ−mθ\(k,β\)​\(x,y,t\)<η\\tau\_\{\\rho\}\-m\_\{\\theta\}^\{\(k,\\beta\)\}\(x,y,t\)<\\eta, the gradient∇ℒfgMASC\\nabla\\mathcal\{L\}\_\{\\mathrm\{fg\}\}^\{\\mathrm\{MASC\}\}decreases the target next\-token logit, and increases the probabilities of the competitive alternatives relative to the target token \(see Appendix[C](https://arxiv.org/html/2606.02920#A3)for the full derivation\)\. On the other hand, tokens with sufficient slack have zero gradient\. Given an initial fine\-tuned modelθ0\\theta\_\{0\}, MASC runs gradient updates with model weightsθ\(s\)\\theta^\{\(s\)\}at stepss\. Crucially, it terminates using a stopping rule that monitors the average violation rate on a subset𝒟val⊂𝒟fg\\mathcal\{D\}\_\{\\mathrm\{val\}\}\\subset\\mathcal\{D\}\_\{\\mathrm\{fg\}\}

V^ρ​\(θ\(s\);𝒟val\)=1\|𝒟val\|​∑\(x,y\)∈𝒟val1Ty​∑t=1Ty𝟏​\{mθ\(s\)\(k,β\)​\(x,y,t\)\>τρ\},\\widehat\{V\}\_\{\\rho\}\\big\(\\theta^\{\(s\)\};\\mathcal\{D\}\_\{\\mathrm\{val\}\}\\big\)=\\frac\{1\}\{\|\\mathcal\{D\}\_\{\\mathrm\{val\}\}\|\}\\sum\_\{\(x,y\)\\in\\mathcal\{D\}\_\{\\mathrm\{val\}\}\}\\frac\{1\}\{T\_\{y\}\}\\sum\_\{t=1\}^\{T\_\{y\}\}\\mathbf\{1\}\\\!\\left\\\{m\_\{\\theta^\{\(s\)\}\}^\{\(k,\\beta\)\}\(x,y,t\)\>\\tau\_\{\\rho\}\\right\\\},whereTyT\_\{y\}is the length of sequenceyy\. In particular,𝒟val\\mathcal\{D\}\_\{\\mathrm\{val\}\}is drawn at random as a small*off\-batch*subset of𝒟fg\\mathcal\{D\}\_\{\\mathrm\{fg\}\}\(of size88to1616in our experiments\) and is employed during training only as a stopping condition\. Note that choosing a small𝒟val\\mathcal\{D\}\_\{\\mathrm\{val\}\}avoids a more costly pass over the entire𝒟fg\\mathcal\{D\}\_\{\\mathrm\{fg\}\}while still providing a usable stopping signal \(see[Section˜4](https://arxiv.org/html/2606.02920#S4)\)\. Moreover,𝒟val\\mathcal\{D\}\_\{\\mathrm\{val\}\}is independent of downstream evaluation666In many datasets, training data is usually in the form of raw text, while evaluation data is in the form of Q&A text\.and does not require creating extra datasets \(beyond the ones provided by the benchmark suites\) that are more*aligned*with the downstream evaluation sets \(e\.g\.,\[[53](https://arxiv.org/html/2606.02920#bib.bib53)\]\)\. MASC stops at the first instance at which the monitored violation rate falls below toleranceα\\alpha, i\.e\. at step

τα=inf\{s≥0:V^ρ​\(θ\(s\);𝒟val\)≤α\}\.\\tau\_\{\\alpha\}=\\inf\\left\\\{s\\geq 0:\\widehat\{V\}\_\{\\rho\}\\big\(\\theta^\{\(s\)\};\\mathcal\{D\}\_\{\\mathrm\{val\}\}\\big\)\\leq\\alpha\\right\\\}\.The final policy that MASC outputs isπMASC=πθ\(τα\)\\pi\_\{\\text\{MASC\}\}=\\pi\_\{\\theta^\{\(\\tau\_\{\\alpha\}\)\}\}\(cf\.[Algorithm˜1](https://arxiv.org/html/2606.02920#alg1)in Appendix\)\. MASC is closest in spirit to recent logit\-level unlearning methods such as UNDIAL\[[10](https://arxiv.org/html/2606.02920#bib.bib10)\], Unilogit\[[46](https://arxiv.org/html/2606.02920#bib.bib46)\], and constrained entropy or logit\-flattening approaches\[[13](https://arxiv.org/html/2606.02920#bib.bib13)\]\. Unlike these methods, which distill toward modified full\-vocabulary targets or flatten the predictive distribution under a fixed budget, MASC enforces a relative local condition against a small set of model\-proposed alternatives and uses the same condition for early stopping\.

#### Monitored violation rate\.

Experiments strongly suggest that the monitored violation rateV^ρ\\widehat\{V\}\_\{\\rho\}behaves as intended\. First,[Figure˜2](https://arxiv.org/html/2606.02920#S3.F2)\(Left\) shows how, along the MASC optimization trajectory,V^ρ\\widehat\{V\}\_\{\\rho\}\(computed on𝒟val⊂𝒟fg\\mathcal\{D\}\_\{\\mathrm\{val\}\}\\subset\\mathcal\{D\}\_\{\\mathrm\{fg\}\}\) closely tracks the forget–retain trade\-off measured by standard downstream evaluation metrics \(which are not computed during training\)\. In addition, we observe how the trajectory traces a Pareto frontier on the forget–retain trade\-off space and the toleranceα\\alphaselects different points on this frontier; see[Figure˜2](https://arxiv.org/html/2606.02920#S3.F2)\(Middle\)\. A natural question is whether the same stopping statistic could serve as a generic early\-stopping criterion for other unlearning objectives\. Our experiments suggest that it does not: under the same budget of processed forget tokens,V^ρ\\widehat\{V\}\_\{\\rho\}decreases steadily for MASC but remains close to its initial value for GradDiff and SimNPO; see[Figure˜2](https://arxiv.org/html/2606.02920#S3.F2)\(Right\)\. This provides strong empirical evidence that unlike MASC, other established unlearning methods would not stop early using the same stopping criterion, as they are not designed to reduce the violation statistic\.

## 4Experiments

We now evaluate MASC against several baselines on three well\-known LLM unlearning datasets: TOFU \(90/10 split\), MUSE News, and MUSE Books\.

Unlearning EfficacyRetain UtilityEfficiencyMethod1−1\-ROUGE\-L↑\\uparrow1−1\-Prob↑\\uparrowTruth Ratio↑\\uparrowMU↑\\uparrowTime \(sec\)↓\\downarrowBase \(Llama\-2 7B\)0\.0240\.0100\.5190\.628–Retrain0\.6010\.8520\.6810\.613–GA0\.330\[0\.029\]0\.829\[0\.022\]0\.555\[0\.007\]0\.459\[0\.014\]306\.6\[40\.2\]GradDiff0\.598\[0\.020\]0\.792\[0\.003\]0\.514\[0\.003\]0\.561\[0\.005\]907\.3\[168\.5\]NPO0\.366\[0\.023\]0\.666\[0\.005\]0\.580\[0\.012\]0\.533\[0\.003\]856\.3\[36\.4\]NPO\+KLR0\.362\[0\.016\]0\.713\[0\.006\]0\.577\[0\.006\]0\.516\[0\.006\]983\.3\[48\.9\]RMU0\.080\[0\.004\]0\.103\[0\.011\]0\.523\[0\.000\]0\.618\[0\.001\]305\.4\[41\.8\]SimNPO0\.349\[0\.006\]0\.497\[0\.004\]0\.562\[0\.001\]0\.596\[0\.001\]541\.7\[37\.8\]MASC \(Ours\)0\.629\[0\.142\]0\.672\[0\.127\]0\.633\[0\.020\]0\.666\[0\.003\]87\.9\[8\.1\]

Table 1:TOFU \(forget10/retain90\) results\. Averages and standard deviations are reported as avg\[std\]\. We mark the best\-performing unlearning method in bold and underline the runner\-up for each metric\.Unlearning EfficacyRetain UtilityEfficiencyDatasetMethodVerbMem𝒟fg\\mathcal\{D\}\_\{\\mathrm\{fg\}\}↓\\downarrowKnowMem𝒟fg\\mathcal\{D\}\_\{\\mathrm\{fg\}\}↓\\downarrowKnowMem𝒟ret\\mathcal\{D\}\_\{\\mathrm\{ret\}\}↑\\uparrowTime \(sec\)↓\\downarrowMUSE NewsBase \(Llama\-2 7B\)57\.2566\.4554\.90–Retrain20\.2632\.5555\.31–GA0\.00\[0\.00\]0\.00\[0\.00\]0\.00\[0\.00\]183\.04\[0\.01\]GradDiff0\.26\[0\.19\]25\.30\[3\.22\]34\.38\[3\.01\]2517\.87\[71\.38\]NPO0\.00\[0\.00\]0\.00\[0\.00\]0\.00\[0\.00\]227\.88\[0\.01\]NPO\+KLR6\.32\[1\.38\]51\.78\[3\.50\]44\.36\[2\.97\]4062\.76\[628\.86\]RMU27\.15\[1\.33\]47\.81\[3\.74\]41\.95\[3\.04\]1076\.06\[89\.67\]SimNPO8\.03\[0\.61\]45\.81\[3\.64\]37\.02\[3\.00\]1877\.52\[18\.81\]MASC \(Ours\)1\.10\[0\.25\]19\.37\[3\.04\]23\.14\[2\.79\]138\.68\[37\.28\]MUSE BooksBase \(ICLM\-7B\)99\.7047\.1269\.13–Retrain14\.4530\.2968\.74–GA0\.00\[0\.00\]0\.00\[0\.00\]0\.00\[0\.00\]290\.16\[0\.01\]GradDiff0\.00\[0\.00\]0\.00\[0\.00\]41\.23\[4\.12\]572\.23\[18\.70\]NPO0\.00\[0\.00\]0\.00\[0\.00\]0\.00\[0\.00\]355\.13\[2\.90\]NPO\+KLR0\.00\[0\.00\]23\.34\[3\.28\]67\.74\[3\.89\]2570\.19\[407\.62\]RMU11\.05\[0\.38\]22\.37\[3\.40\]60\.76\[3\.96\]1477\.86\[108\.12\]SimNPO0\.00\[0\.00\]0\.00\[0\.00\]47\.79\[4\.17\]2647\.71\[45\.48\]MASC \(Ours\)0\.90\[0\.80\]30\.90\[1\.30\]65\.30\[1\.20\]64\.94\[0\.49\]

Table 2:MUSE results\. Averages and standard deviations are reported asavg\[std\]\\mathrm\{avg\_\{\[std\]\}\}\. We mark the best\-performing model in bold, excluding Base, Retrain, and models with zero retain utility \(KnowMem​𝒟ret\\mathrm\{KnowMem\}\\ \\mathcal\{D\}\_\{\\mathrm\{ret\}\}\), and underline the runner\-up for each metric\. Ties are resolved by retain utility\.Baselines\.We compare MASC against standard unlearning baselines:\(i\)Gradient Ascent \(GA\)\[[50](https://arxiv.org/html/2606.02920#bib.bib50)\];\(ii\)GradDiff \(or GA\+GDR\)\[[33](https://arxiv.org/html/2606.02920#bib.bib33)\], which adds a retain\-side correction by combining GA on the forget set with gradient descent on retained examples;\(iii\)NPO\[[51](https://arxiv.org/html/2606.02920#bib.bib51)\];\(iv\)NPO\+KLR, which combines the NPO objective together with a KL retain regularizer to reduce drift on𝒟ret\\mathcal\{D\}\_\{\\mathrm\{ret\}\}\[[42](https://arxiv.org/html/2606.02920#bib.bib42)\];\(v\)RMU, a representation\-level method that redirects activations associated with the forget data\[[27](https://arxiv.org/html/2606.02920#bib.bib27)\];\(vi\)SimNPO\[[15](https://arxiv.org/html/2606.02920#bib.bib15)\]\. We also include pretrained and retrain \(on𝒟ret\\mathcal\{D\}\_\{\\mathrm\{ret\}\}only\) baselines for comparison\. In all MASC experiments, we keep the backbone model frozen and show that the unlearning update can be performed effectively by training only LoRA adapters\[[18](https://arxiv.org/html/2606.02920#bib.bib18)\]\. This also makes MASC memory\-efficient and confines the update to a small trainable module\.

Metrics\.OnTOFU, unlearning\-efficacy metrics are:1−1\-ROUGE\-L, measuring lexical dissimilarity from the target answer;1−1\-Prob\., measuring the reduction in teacher\-forced probability of the target continuation; Truth Ratio, measuring the preference for perturbed alternatives over the true forgotten answer\. On the other hand, MU measures utility on the retain portion of the data\. ForMUSE NewsandMUSE Books, we report VerbMem𝒟fg\\mathcal\{D\}\_\{\\mathrm\{fg\}\}, corresponding to*verbatim*memorization on the forget set; KnowMem𝒟fg\\mathcal\{D\}\_\{\\mathrm\{fg\}\}, measuring*knowledge*memorization on the forget set; KnowMem𝒟ret\\mathcal\{D\}\_\{\\mathrm\{ret\}\}, which accounts for knowledge preservation on the retain set\. Additional metrics are reported in[Appendix˜E](https://arxiv.org/html/2606.02920#A5)\. We measure wall\-clock unlearning runtime in seconds\. To make timing comparable, all methods are run on the same hardware777All experiments are run on a single H100 GPU\., data pipeline for each dataset, and evaluation\-free training loop\. Timing starts after model and data loading, and stops when the method returns its checkpoint: at the prescribed final step for fixed\-schedule baselines, and at the first checkpoint satisfyingV^ρ≤α\\widehat\{V\}\_\{\\rho\}\\leq\\alphafor MASC\. We exclude shared one\-time costs such as model loading and dataset preprocessing, but we include optimizer steps, forward/backward passes, retain batches, KL computations, and MASC online probe checks\. Baselines are run using code from their official repositories and with the reported best hyperparameters\. For timing, we measure the number of epochs specified by each baseline’s selected hyperparameter setting, even when reproducing the reported forget\-retain trade\-off required additional epochs in our runs\. Moreover, when timing MASC’s competitors, we do not include the cost of offline checkpoint selection based on downstream metric evaluation\. Thus, whenever possible, our comparison favors the baselines in both runtime and forget–retain performance\.

Results\.[Tables˜2](https://arxiv.org/html/2606.02920#S4.T2)and[2](https://arxiv.org/html/2606.02920#S4.T2)show that MASC consistently achieves a competitive forget–retain trade\-off across the three datasets using a fraction of the wall\-clock runtime required by the other methods \(see[Figure˜5](https://arxiv.org/html/2606.02920#A5.F5)in Appendix[D](https://arxiv.org/html/2606.02920#A4)for a visual summary of such a trade\-off\)\. OnTOFU, MASC obtains the best1−1\-ROUGE\-L, Truth Ratio, retain MU, as well as the shortest wall\-clock time, while remaining competitive on1−1\-Prob\. OnMUSE Books, MASC preserves retain utility close to the strongest non\-collapsed \(i\.e\., those with nonzero retain utility\) baselines at a significantly lower computational cost\. OnMUSE News, MASC achieves lower KnowMem on𝒟fg\\mathcal\{D\}\_\{\\mathrm\{fg\}\}than most baselines, although its retain utility is sometimes slightly lower\.

Takeaway 2MASC shifts unlearning from fixed\-length training to targeted early stopping, reaching a competitive forget–retain trade\-off substantially faster than fixed\-schedule counterparts\.

## 5The effect of scale on unlearning

0\.40\.60\.81\.00\.5B1\.5B3B7BModel sizeRaw scoreE\-ROUGEP\-ROUGE

Metricα^\\hat\{\\alpha\}log⁡C^\\log\\hat\{C\}R2R^\{2\}E\-ROUGE0\.269\-0\.6030\.954P\-ROUGE0\.111\-0\.9720\.847

Figure 3:Learning stage\.Fitted scaling laws \(log\-log plot\)\.Improving the efficiency of unlearning methods is a necessary step toward scaling these procedures to increasingly large models\. However, to the best of our knowledge, there is no systematic study on how model size affects unlearning efficacy\. Some related recent studies suggest that memorization increases with scale during training\[[6](https://arxiv.org/html/2606.02920#bib.bib6),[36](https://arxiv.org/html/2606.02920#bib.bib36),[31](https://arxiv.org/html/2606.02920#bib.bib31)\]so that larger models may therefore enter the unlearning stage with different levels and forms of memorization\. At the same time, it is unclear how scale would then affect the final forget–retain trade\-off after unlearning\. In this section, we study the scaling behavior of the*learning\-unlearning pipeline*of MASC and SimNPO across model sizes on the Qwen2\.5 model family using the TOFU dataset\.

### 5\.1Unlearning procedures and metrics

We now describe the learning\-unlearning pipeline, along with the memorization metrics used\.

End\-to\-end unlearning pipeline\.Most LLM unlearning benchmarks, includingTOFU\[[33](https://arxiv.org/html/2606.02920#bib.bib33)\],MUSE\[[42](https://arxiv.org/html/2606.02920#bib.bib42)\], andWMDP\[[27](https://arxiv.org/html/2606.02920#bib.bib27)\], are built from a common two\-stage pipeline\.\(1\) Learning\.An initial modelπinit\\pi\_\{\\mathrm\{init\}\}is finetuned on the full benchmark data𝒟=𝒟fg∪𝒟ret\\mathcal\{D\}=\\mathcal\{D\}\_\{\\mathrm\{fg\}\}\\cup\\mathcal\{D\}\_\{\\mathrm\{ret\}\}to yield a task\-adapted modelπθ0\\pi\_\{\\theta\_\{0\}\}\(usually called*base*model\) that contains both forget and retain information\.\(2\) Unlearning\.An unlearning algorithm𝒜\\mathcal\{A\}\(here MASC and SimNPO\) is then applied toπθ0\\pi\_\{\\theta\_\{0\}\}using the split\(𝒟fg,𝒟ret\)\(\\mathcal\{D\}\_\{\\mathrm\{fg\}\},\\mathcal\{D\}\_\{\\mathrm\{ret\}\}\), producing an unlearned modelπθunl\\pi\_\{\\theta\_\{\\mathrm\{unl\}\}\}\. Following this pipeline, we evaluate the Qwen2\.5 model family at different scales onTOFUand track how forget\-set memorization changes from learning to unlearning\.

Two memorization levels\.We useTOFUbuilt\-in metrics to distinguish between two levels of memorization\. The first level is*exact memorization*: the model reproduces the target answer under the original question\. We measure this with Exact Q&A ROUGE \(E\-ROUGE\), that computes lexical overlap between the model output and the gold answer on the originalTOFUquestions\. The second level is*paraphrase\-robust knowledge memorization*: the model recovers the same underlying answer even when the question is paraphrased, measured by Paraphrased Q&A ROUGE \(P\-ROUGE\)\. To measure the resulting forget–retain trade\-off, we also report retain utility, measured by MU,TOFUaggregate score on𝒟ret\\mathcal\{D\}\_\{\\mathrm\{ret\}\}\.

### 5\.2Effects of size on unlearning

We first examine the learning stage, where the model is supervised\-finetuned on the full benchmark data before unlearning\. For each memorization metric, we average scores over seeds at each model size and fit an empirical power laws​\(N\)=C​Nαs\(N\)=CN^\{\\alpha\}in log–log space, whereNNdenotes the number of parameters\. As shown in[Figure˜3](https://arxiv.org/html/2606.02920#S5.F3), scale affects the two memorization levels differently\. Exact reproduction grows faster with model size than paraphrase\-based knowledge recovery: E\-ROUGE has exponentα^=0\.269\\hat\{\\alpha\}=0\.269, while P\-ROUGE has exponentα^=0\.111\\hat\{\\alpha\}=0\.111\. Since these metrics are bounded by one, the exact\-memorization scores are already close to saturation at size 7B\. Overall, this suggests that larger models become disproportionately better at reproducing target content in its original form compared to recovering the same content under paraphrased prompts\. These results are consistent with prior evidence that memorization increases with model scale\[[6](https://arxiv.org/html/2606.02920#bib.bib6),[36](https://arxiv.org/html/2606.02920#bib.bib36),[31](https://arxiv.org/html/2606.02920#bib.bib31)\], while adding a distinction between forms of memorization that grow at different rates\. As a consequence of this learning\-stage behavior, models of different sizes enter the unlearning stage with different memorization profiles\.

SimNPO0\.20\.40\.60\.81\-\(E\-ROUGE\)1\-\(P\-ROUGE\)MU on𝒟r\\mathcal\{D\}\_\{\\mathrm\{r\}\}MASC0\.20\.40\.60\.81\-\(E\-ROUGE\)1\-\(P\-ROUGE\)MU on𝒟r\\mathcal\{D\}\_\{\\mathrm\{r\}\}0\.5B1\.5B3B7B

Figure 4:Unlearning stage\.Cross\-scale behavior after unlearning for MASC and SimNPO\. All the metrics are plotted such that the higher the better\.Interestingly, the picture changes after unlearning\. Despite different starting memorization profiles \(cf\.[Figure˜3](https://arxiv.org/html/2606.02920#S5.F3)\), unlearning brings the forget\-side metrics back to a similar range across model sizes\. In particular, we summarize the forget–retain trade\-off after unlearning using the radar plots in[Figure˜4](https://arxiv.org/html/2606.02920#S5.F4)\. Each line in the plot corresponds to a model size, with1−E​\-​ROUGE1\-\\mathrm\{E\\text\{\-\}ROUGE\}and1−P​\-​ROUGE1\-\\mathrm\{P\\text\{\-\}ROUGE\}, reported together with retain utility \(MU\)\. For both MASC and SimNPO, the forget\-side metrics remain in a similar range across model sizes after unlearning, while retain utility improves more clearly with scale\. This suggests that larger models mainly improve the utility side of the forget–retain trade\-off, rather than yielding different degrees of forgetting across model sizes\. For the unlearning\-stage scaling\-law plot, see[Figure˜7](https://arxiv.org/html/2606.02920#A5.F7)in Appendix\.

Takeaway 3During learning, larger models amplify the two levels of memorization with different strengths\. However, after unlearning, residual memorization is largely scale\-invariant, while larger models preserve higher retain utility\.

## 6Discussion and future work

We introduce MASC, a margin\-based unlearning method whose loss and stopping rule both target the same condition: forget tokens should no longer dominate plausible model\-proposed alternatives\. This makes the procedure self\-stopping \(or*adaptive*\) and thus substantially faster than fixed\-budget baselines\. While our experiments suggest that MASC also improves paraphrase\-level forgetting metrics, our bound still does not control such behavior directly: doing so would require a notion of forgetting defined in a representation space invariant to surface*rewordings*\. Designing mathematically grounded unlearning objectives for this semantic regime is an interesting direction for future work\.

## Acknowledgment

FDG was supported by Swiss National Science Foundation \(SNSF\) Grant 218343, and AS was supported by the Swiss National Science Foundation \(SNSF\) Grant 204439\. The authors acknowledge the use of LLMs to improve exposition and generate code\. The authors take full responsibility for the content of the paper\.

## References

- Boiko et al\. \[2023\]Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes\.Autonomous chemical research with large language models\.*Nature*, 624\(7992\):570–578, 2023\.
- Bourtoule et al\. \[2021\]Lucas Bourtoule, Varun Chandrasekaran, Christopher A Choquette\-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot\.Machine unlearning\.In*2021 IEEE symposium on security and privacy \(SP\)*, pages 141–159\. IEEE, 2021\.
- California Legislature \[2018\]California Legislature\.California consumer privacy act of 2018, AB 375, 2018\.URL[https://ca\.gov](https://ca.gov/)\.Cal\. Civ\. Code§\\S1798\.100 \- 1798\.199\.
- Cao et al\. \[2024\]Pengfei Cao, Chenhao Wang, Zhitao He, Hongbang Yuan, Jiachun Li, Yubo Chen, Kang Liu, Jun Zhao, et al\.Rwku: Benchmarking real\-world knowledge unlearning for large language models\.*Advances in Neural Information Processing Systems*, 37:98213–98263, 2024\.
- Cao and Yang \[2015\]Yinzhi Cao and Junfeng Yang\.Towards making systems forget with machine unlearning\.In*2015 IEEE symposium on security and privacy*, pages 463–480\. IEEE, 2015\.
- Carlini et al\. \[2022\]Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang\.Quantifying memorization across neural language models\.In*The Eleventh International Conference on Learning Representations*, 2022\.
- Chen and Yang \[2023\]Jiaao Chen and Diyi Yang\.Unlearn what you want to forget: Efficient unlearning for llms\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, pages 12041–12052, 2023\.
- Chen et al\. \[2021\]Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al\.Evaluating large language models trained on code\.*arXiv preprint arXiv:2107\.03374*, 2021\.
- Dang et al\. \[2025\]Huu\-Tien Dang, Tin Pham, Hoang Thanh\-Tung, and Naoya Inoue\.On effects of steering latent representation for large language model unlearning\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, pages 23733–23742, 2025\.
- Dong et al\. \[2025\]Yijiang River Dong, Hongzhou Lin, Mikhail Belkin, Ramon Huerta, and Ivan Vulić\.Undial: Self\-distillation with adjusted logits for robust unlearning in large language models\.In*Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\)*, pages 8827–8840, 2025\.
- Dorna et al\. \[2026\]Vineeth Dorna, Anmol Reddy Mekala, Wenlong Zhao, Andrew McCallum, J Zico Kolter, Zachary Chase Lipton, and Pratyush Maini\.Openunlearning: Accelerating LLM unlearning via unified benchmarking of methods and metrics\.In*The Thirty\-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track*, 2026\.URL[https://openreview\.net/forum?id=Gy67Zh5X1i](https://openreview.net/forum?id=Gy67Zh5X1i)\.
- Eldan and Russinovich \[2023\]Ronen Eldan and Mark Russinovich\.Who’s harry potter? approximate unlearning in llms, 2023\.URL[https://arxiv\.org/abs/2310\.02238](https://arxiv.org/abs/2310.02238)\.
- Entesari et al\. \[2026\]Taha Entesari, Arman Hatami, Rinat Khaziev, Anil Ramakrishna, and Mahyar Fazlyab\.Constrained entropic unlearning: A primal\-dual framework for large language models\.In*The Thirty\-ninth Annual Conference on Neural Information Processing Systems*, 2026\.URL[https://openreview\.net/forum?id=ZtB34bQI54](https://openreview.net/forum?id=ZtB34bQI54)\.
- Fan et al\. \[2025\]Chongyu Fan, Jinghan Jia, Yihua Zhang, Anil Ramakrishna, Mingyi Hong, and Sijia Liu\.Towards LLM unlearning resilient to relearning attacks: A sharpness\-aware minimization perspective and beyond\.In*Forty\-second International Conference on Machine Learning*, 2025\.URL[https://openreview\.net/forum?id=zZjLv6F0Ks](https://openreview.net/forum?id=zZjLv6F0Ks)\.
- Fan et al\. \[2026\]Chongyu Fan, Jiancheng Liu, Licong Lin, Jinghan Jia, Ruiqi Zhang, Song Mei, and Sijia Liu\.Simplicity prevails: Rethinking negative preference optimization for LLM unlearning\.In*The Thirty\-ninth Annual Conference on Neural Information Processing Systems*, 2026\.URL[https://openreview\.net/forum?id=JbvSQm5h1l](https://openreview.net/forum?id=JbvSQm5h1l)\.
- Hong et al\. \[2024\]Yihuai Hong, Yuelin Zou, Lijie Hu, Ziqian Zeng, Di Wang, and Haiqin Yang\.Dissecting fine\-tuning unlearning in large language models\.In*Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing*, pages 3933–3941, 2024\.
- Hoofnagle et al\. \[2019\]Chris Jay Hoofnagle, Bart Van Der Sloot, and Frederik Zuiderveen Borgesius\.The european union general data protection regulation: what it is and what it means\.*Information & communications technology law*, 28\(1\):65–98, 2019\.
- Hu et al\. \[2022\]Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen\-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al\.Lora: Low\-rank adaptation of large language models\.*Iclr*, 1\(2\):3, 2022\.
- Ilharco et al\. \[2022\]Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi\.Editing models with task arithmetic\.*arXiv preprint arXiv:2212\.04089*, 2022\.
- Ji et al\. \[2024\]Jiabao Ji, Yujian Liu, Yang Zhang, Gaowen Liu, Ramana R Kompella, Sijia Liu, and Shiyu Chang\.Reversing the forget\-retain objectives: An efficient llm unlearning framework from logit difference\.*Advances in Neural Information Processing Systems*, 37:12581–12611, 2024\.
- Karamolegkou et al\. \[2023\]Antonia Karamolegkou, Jiaang Li, Li Zhou, and Anders Søgaard\.Copyright violations and large language models\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, pages 7403–7412, 2023\.
- Kassem et al\. \[2023\]Aly Kassem, Omar Mahmoud, and Sherif Saad\.Preserving privacy through dememorization: An unlearning technique for mitigating memorization risks in language models\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, pages 4360–4379, 2023\.
- Kirkpatrick et al\. \[2017\]James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska\-Barwinska, et al\.Overcoming catastrophic forgetting in neural networks\.*Proceedings of the national academy of sciences*, 114\(13\):3521–3526, 2017\.
- Lee et al\. \[2026\]Bruce W\. Lee, Addie Foote, Alex Infanger, Leni Shor, Harish K Kamath, Jacob Goldman\-Wetzler, Bryce Woodworth, Alex Cloud, and Alexander Matt Turner\.Distillation robustifies unlearning\.In*The Thirty\-ninth Annual Conference on Neural Information Processing Systems*, 2026\.URL[https://openreview\.net/forum?id=UTGjik64IK](https://openreview.net/forum?id=UTGjik64IK)\.
- Lee et al\. \[2020\]Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang\.Biobert: a pre\-trained biomedical language representation model for biomedical text mining\.*Bioinformatics*, 36\(4\):1234–1240, 2020\.
- Lewkowycz et al\. \[2022\]Aitor Lewkowycz, Anders Johan Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Venkatesh Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman\-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur\-Ari, and Vedant Misra\.Solving quantitative reasoning problems with language models\.In Alice H\. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,*Advances in Neural Information Processing Systems*, 2022\.URL[https://openreview\.net/forum?id=IFXTZERXdM7](https://openreview.net/forum?id=IFXTZERXdM7)\.
- Li et al\. \[2024\]Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D\. Li, Ann\-Kathrin Dombrowski, Shashwat Goel, Gabriel Mukobi, Nathan Helm\-Burger, Rassin Lababidi, Lennart Justen, Andrew Bo Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Ariel Herbert\-Voss, Cort B Breuer, Andy Zou, Mantas Mazeika, Zifan Wang, Palash Oswal, Weiran Lin, Adam Alfred Hunt, Justin Tienken\-Harder, Kevin Y\. Shih, Kemper Talley, John Guan, Ian Steneker, David Campbell, Brad Jokubaitis, Steven Basart, Stephen Fitz, Ponnurangam Kumaraguru, Kallol Krishna Karmakar, Uday Tupakula, Vijay Varadharajan, Yan Shoshitaishvili, Jimmy Ba, Kevin M\. Esvelt, Alexandr Wang, and Dan Hendrycks\.The WMDP benchmark: Measuring and reducing malicious use with unlearning\.In*Forty\-first International Conference on Machine Learning*, 2024\.URL[https://openreview\.net/forum?id=xlr6AUDuJz](https://openreview.net/forum?id=xlr6AUDuJz)\.
- Liu et al\. \[2025\]Sijia Liu, Yuanshun Yao, Jinghan Jia, Stephen Casper, Nathalie Baracaldo, Peter Hase, Yuguang Yao, Chris Yuhao Liu, Xiaojun Xu, Hang Li, et al\.Rethinking machine unlearning for large language models\.*Nature Machine Intelligence*, 7\(2\):181–194, 2025\.
- Lu et al\. \[2024a\]Weikai Lu, Ziqian Zeng, Jianwei Wang, Zhengdong Lu, Zelin Chen, Huiping Zhuang, and Cen Chen\.Eraser: Jailbreaking defense in large language models via unlearning harmful knowledge\.*arXiv preprint arXiv:2404\.05880*, 2024a\.
- Lu et al\. \[2022\]Ximing Lu, Sean Welleck, Jack Hessel, Liwei Jiang, Lianhui Qin, Peter West, Prithviraj Ammanabrolu, and Yejin Choi\.Quark: Controllable text generation with reinforced unlearning\.*Advances in neural information processing systems*, 35:27591–27609, 2022\.
- Lu et al\. \[2024b\]Xingyu Lu, Xiaonan Li, Qinyuan Cheng, Kai Ding, Xuan\-Jing Huang, and Xipeng Qiu\.Scaling laws for fact memorization of large language models\.In*Findings of the Association for Computational Linguistics: EMNLP 2024*, pages 11263–11282, 2024b\.
- Luo et al\. \[2025\]Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang\.An empirical study of catastrophic forgetting in large language models during continual fine\-tuning\.*IEEE Transactions on Audio, Speech and Language Processing*, 2025\.
- Maini et al\. \[2024\]Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C Lipton, and J Zico Kolter\.Tofu: A task of fictitious unlearning for llms\.*arXiv preprint arXiv:2401\.06121*, 2024\.
- McCloskey and Cohen \[1989\]Michael McCloskey and Neal J Cohen\.Catastrophic interference in connectionist networks: The sequential learning problem\.In*Psychology of learning and motivation*, volume 24, pages 109–165\. Elsevier, 1989\.
- Meeus et al\. \[2024\]Matthieu Meeus, Shubham Jain, Marek Rei, and Yves\-Alexandre de Montjoye\.Did the neurons read your book? document\-level membership inference for large language models\.In*33rd USENIX Security Symposium \(USENIX Security 24\)*, pages 2369–2385, 2024\.
- Morris et al\. \[2025\]John X Morris, Chawin Sitawarin, Chuan Guo, Narine Kokhlikyan, G Edward Suh, Alexander M Rush, Kamalika Chaudhuri, and Saeed Mahloujifar\.How much do language models memorize?*arXiv preprint arXiv:2505\.24832*, 2025\.
- Nam et al\. \[2024\]Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad Myers\.Using an llm to help with code understanding\.In*Proceedings of the IEEE/ACM 46th International Conference on Software Engineering*, pages 1–13, 2024\.
- Pawelczyk et al\. \[2024\]Martin Pawelczyk, Seth Neel, and Himabindu Lakkaraju\.In\-context unlearning: Language models as few\-shot unlearners\.In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,*Proceedings of the 41st International Conference on Machine Learning*, volume 235 of*Proceedings of Machine Learning Research*, pages 40034–40050\. PMLR, 21–27 Jul 2024\.URL[https://proceedings\.mlr\.press/v235/pawelczyk24a\.html](https://proceedings.mlr.press/v235/pawelczyk24a.html)\.
- Rafailov et al\. \[2023\]Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn\.Direct preference optimization: Your language model is secretly a reward model\.*Advances in neural information processing systems*, 36:53728–53741, 2023\.
- Shenfeld et al\. \[2026\]Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal\.RL’s razor: Why online reinforcement learning forgets less\.In*The Fourteenth International Conference on Learning Representations*, 2026\.URL[https://openreview\.net/forum?id=7HNRYT4V44](https://openreview.net/forum?id=7HNRYT4V44)\.
- Sheshadri et al\. \[2024\]Abhay Sheshadri, Aidan Ewart, Phillip Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield\-Menell, et al\.Latent adversarial training improves robustness to persistent harmful behaviors in llms\.*arXiv preprint arXiv:2407\.15549*, 2024\.
- Shi et al\. \[2024\]Weijia Shi, Jaechan Lee, Yangsibo Huang, Sadhika Malladi, Jieyu Zhao, Ari Holtzman, Daogao Liu, Luke Zettlemoyer, Noah A Smith, and Chiyuan Zhang\.Muse: Machine unlearning six\-way evaluation for language models\.*arXiv preprint arXiv:2407\.06460*, 2024\.
- Staab et al\. \[2024\]Robin Staab, Mark Vero, Mislav Balunovic, and Martin Vechev\.Beyond memorization: Violating privacy via inference with large language models\.In*The Twelfth International Conference on Learning Representations*, 2024\.
- Tamirisa et al\. \[2025\]Rishub Tamirisa, Bhrugu Bharathi, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, Maxwell Lin, Justin Wang, Rowan Wang, Ron Arel, Andy Zou, Dawn Song, Bo Li, Dan Hendrycks, and Mantas Mazeika\.Tamper\-resistant safeguards for open\-weight LLMs\.In*The Thirteenth International Conference on Learning Representations*, 2025\.URL[https://openreview\.net/forum?id=4FIjRodbW6](https://openreview.net/forum?id=4FIjRodbW6)\.
- Thaker et al\. \[2024\]Pratiksha Thaker, Yash Maurya, Shengyuan Hu, Zhiwei Steven Wu, and Virginia Smith\.Guardrail baselines for unlearning in llms\.*arXiv preprint arXiv:2403\.03329*, 2024\.
- Vasilev et al\. \[2025\]Stefan Vasilev, Christian Herold, Baohao Liao, Seyyed Hadi Hashemi, Shahram Khadivi, and Christof Monz\.Unilogit: Robust machine unlearning for LLMs using uniform\-target self\-distillation\.In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,*Findings of the Association for Computational Linguistics: ACL 2025*, pages 22453–22472, Vienna, Austria, July 2025\. Association for Computational Linguistics\.ISBN 979\-8\-89176\-256\-5\.doi:10\.18653/v1/2025\.findings\-acl\.1154\.URL[https://aclanthology\.org/2025\.findings\-acl\.1154/](https://aclanthology.org/2025.findings-acl.1154/)\.
- Wang et al\. \[2024\]Bichen Wang, Yuzhe Zi, Yixin Sun, Yanyan Zhao, and Bing Qin\.Rkld: Reverse kl\-divergence\-based knowledge distillation for unlearning personal information in large language models\.*arXiv preprint arXiv:2406\.01983*, 2024\.
- Wang et al\. \[2025\]Yue Wang, Qizhou Wang, Feng Liu, Wei Huang, Yali Du, Xiaojiang Du, and Bo Han\.GRU: Mitigating the trade\-off between unlearning and retention for LLMs\.In*Forty\-second International Conference on Machine Learning*, 2025\.URL[https://openreview\.net/forum?id=EAjhGr1Oeo](https://openreview.net/forum?id=EAjhGr1Oeo)\.
- Wu et al\. \[2023\]Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann\.Bloomberggpt: A large language model for finance\.*arXiv preprint arXiv:2303\.17564*, 2023\.
- Yao and Xu \[2024\]Yuanshun Yao and Xiaojun Xu\.Large language model unlearning\.*Advances in Neural Information Processing Systems*, 37:105425–105475, 2024\.
- Zhang et al\. \[2024\]Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei\.Negative preference optimization: From catastrophic collapse to effective unlearning\.In*First Conference on Language Modeling*, 2024\.URL[https://openreview\.net/forum?id=MXLBXjQkmb](https://openreview.net/forum?id=MXLBXjQkmb)\.
- Zhang et al\. \[2025\]Zhiwei Zhang, Fali Wang, Xiaomin Li, Zongyu Wu, Xianfeng Tang, Hui Liu, Qi He, Wenpeng Yin, and Suhang Wang\.Catastrophic failure of LLM unlearning via quantization\.In*The Thirteenth International Conference on Learning Representations*, 2025\.URL[https://openreview\.net/forum?id=lHSeDYamnz](https://openreview.net/forum?id=lHSeDYamnz)\.
- Zhong et al\. \[2026\]Yisheng Zhong, Zhengbang Yang, and Zhuangdi Zhu\.DUET: Distilled LLM unlearning from an efficiently contextualized teacher\.In*The Fourteenth International Conference on Learning Representations*, 2026\.URL[https://openreview\.net/forum?id=Xa6QRrXrKX](https://openreview.net/forum?id=Xa6QRrXrKX)\.

## Appendix AAdditional related work

Beyond the likelihood\-reversal and preference\-optimization baselines considered in the main text, one line of work uses*relabeling\-based finetuning*, replacing the original forget\-set responses with generic, neutral, or refusal\-like alternatives before further finetuning\[[12](https://arxiv.org/html/2606.02920#bib.bib12),[4](https://arxiv.org/html/2606.02920#bib.bib4)\]\. Another line studies*reinforcement\-learning*formulations of unlearning, for example by using reward models or negative\-similarity rewards to discourage undesirable generations while preserving fluency\[[30](https://arxiv.org/html/2606.02920#bib.bib30),[22](https://arxiv.org/html/2606.02920#bib.bib22)\]\. Localized\-parameter approaches instead try to identify and edit parts of the model most responsible for the target information, including representation\-engineering methods, adaptive variants of representation redirection, and locate\-then\-unlearn approaches based on neuron or parameter attribution\[[27](https://arxiv.org/html/2606.02920#bib.bib27),[9](https://arxiv.org/html/2606.02920#bib.bib9),[48](https://arxiv.org/html/2606.02920#bib.bib48),[16](https://arxiv.org/html/2606.02920#bib.bib16)\]\. A further family leverages auxiliary models, including task\-vector methods, contrastive decoding, and knowledge\-distillation\-based unlearning\[[19](https://arxiv.org/html/2606.02920#bib.bib19),[29](https://arxiv.org/html/2606.02920#bib.bib29),[20](https://arxiv.org/html/2606.02920#bib.bib20),[47](https://arxiv.org/html/2606.02920#bib.bib47),[10](https://arxiv.org/html/2606.02920#bib.bib10)\]\. Finally, some methods avoid weight updates altogether or combine them with input/output\-side interventions, such as prompt classifiers, input corruption, guardrails, filtering, or in\-context unlearning\[[45](https://arxiv.org/html/2606.02920#bib.bib45),[38](https://arxiv.org/html/2606.02920#bib.bib38)\]\. These directions illustrate that LLM unlearning can be pursued through output losses, representation editing, auxiliary\-model guidance, or inference\-time control\.

#### Positioning of MASC with similar methods\.

A few recent works might be considered close to MASC because they also modify the model’s output distribution on forget examples\. UNDIAL\[[10](https://arxiv.org/html/2606.02920#bib.bib10)\]proposes a self\-distillation approach in which the target\-token logit is adjusted downward and the model is trained to match the resulting softened distribution, with the goal of avoiding the over\-unlearning and instability observed in GA and NPO\. Unilogit\[[46](https://arxiv.org/html/2606.02920#bib.bib46)\]further develops this direction by constructing self\-distillation targets from the current model and dynamically adjusting the target logit so that the target token receives uniform probability\. Another closely related line formulates unlearning through entropy or logit\-flattening objectives:Entesari et al\. \[[13](https://arxiv.org/html/2606.02920#bib.bib13)\]cast forgetting and retention as a constrained optimization problem, uses a logit\-margin flattening loss to drive the full predictive distribution toward uniformity on the forget set, and solves the resulting problem with a primal\-dual procedure\. These methods share with MASC the view that stable unlearning should act directly on the model’s local predictive distribution rather than simply maximizing forget loss\. MASC differs from these approaches in both the target of suppression and the role of the training statistic\. Rather than distilling toward a modified full\-vocabulary target distribution, as in UNDIAL or Unilogit, or flattening the entire output distribution toward uniformity, as in entropy\-based or logit\-flattening methods, MASC imposes a relative local condition: the gold forget token should no longer dominate a small set of plausible non\-gold alternatives proposed by the model itself\. This makes the forget update selective\. Tokens whose local dominance is already below threshold contribute no forget gradient, while only the still\-dominant positions are corrected\. Moreover, the same margin\-violation event defines the loss, the constrained forget condition, and the stopping rule\. Thus, MASC is not only a logit\-level suppression objective; it is also a self\-terminating unlearning procedure whose tolerance parameter directly selects a point along the empirical forget–retain frontier\.

## Appendix BProofs

See[1](https://arxiv.org/html/2606.02920#Thmproposition1)

###### Proof\.

Forβ=1\\beta=1, the restricted probability is obtained by restricting the normalization to\{yt\}∪𝒮θ,k​\(ct\)\\\{y\_\{t\}\\\}\\cup\\mathcal\{S\}\_\{\\theta,k\}\(c\_\{t\}\)\. Since this removes nonnegative terms from the full softmax denominator, we have

πθ​\(yt∣ct\)≤πθ\(k,1\)​\(yt∣ct\)≤ρfor every​t∈I\.\\pi\_\{\\theta\}\(y\_\{t\}\\mid c\_\{t\}\)\\leq\\pi\_\{\\theta\}^\{\(k,1\)\}\(y\_\{t\}\\mid c\_\{t\}\)\\leq\\rho\\quad\\text\{for every \}t\\in I\.Moreover, by assumption,\|I\|≥⌈\(1−α\)​T⌉\|I\|\\geq\\lceil\(1\-\\alpha\)T\\rceil\. For the remaining positionst∉It\\notin I, we only use the trivial boundπθ​\(yt∣ct\)≤1\\pi\_\{\\theta\}\(y\_\{t\}\\mid c\_\{t\}\)\\leq 1\. Hence

πθ​\(y∣x\)=∏t=1Tπθ​\(yt∣ct\)=∏t∈Iπθ​\(yt∣ct\)​∏t∉Iπθ​\(yt∣ct\)≤ρ\|I\|≤ρ⌈\(1−α\)​T⌉,\\pi\_\{\\theta\}\(y\\mid x\)=\\prod\_\{t=1\}^\{T\}\\pi\_\{\\theta\}\(y\_\{t\}\\mid c\_\{t\}\)=\\prod\_\{t\\in I\}\\pi\_\{\\theta\}\(y\_\{t\}\\mid c\_\{t\}\)\\prod\_\{t\\notin I\}\\pi\_\{\\theta\}\(y\_\{t\}\\mid c\_\{t\}\)\\leq\\rho^\{\|I\|\}\\leq\\rho^\{\\lceil\(1\-\\alpha\)T\\rceil\},where the last inequality usesρ∈\(0,1\)\\rho\\in\(0,1\)and\|I\|≥⌈\(1−α\)​T⌉\|I\|\\geq\\lceil\(1\-\\alpha\)T\\rceil\. ∎

Note that the reproduction bound can be extended toβ≥1\\beta\\geq 1, at the cost of replacingρ\\rhoby a sigmoid\-rescaled threshold\. Letτρ=log⁡\(ρ/\(1−ρ\)\)\\tau\_\{\\rho\}=\\log\(\\rho/\(1\-\\rho\)\), and suppose that, for allt∈It\\in I,

πθ\(k,β\)​\(yt∣ct\)≤ρ\.\\pi\_\{\\theta\}^\{\(k,\\beta\)\}\(y\_\{t\}\\mid c\_\{t\}\)\\leq\\rho\.Equivalently,

mθ\(k,β\)​\(x,y,t\)=β​zθ​\(yt∣ct\)−log​∑v∈𝒮θ,k​\(ct\)exp⁡\(β​zθ​\(v∣ct\)\)≤τρ\.m\_\{\\theta\}^\{\(k,\\beta\)\}\(x,y,t\)=\\beta z\_\{\\theta\}\(y\_\{t\}\\mid c\_\{t\}\)\-\\log\\sum\_\{v\\in\\mathcal\{S\}\_\{\\theta,k\}\(c\_\{t\}\)\}\\exp\(\\beta z\_\{\\theta\}\(v\\mid c\_\{t\}\)\)\\leq\\tau\_\{\\rho\}\.Sinceβ≥1\\beta\\geq 1,

1β​log​∑v∈𝒮θ,k​\(ct\)exp⁡\(β​zθ​\(v∣ct\)\)≤log​∑v∈𝒮θ,k​\(ct\)exp⁡\(zθ​\(v∣ct\)\)\.\\frac\{1\}\{\\beta\}\\log\\sum\_\{v\\in\\mathcal\{S\}\_\{\\theta,k\}\(c\_\{t\}\)\}\\exp\(\\beta z\_\{\\theta\}\(v\\mid c\_\{t\}\)\)\\leq\\log\\sum\_\{v\\in\\mathcal\{S\}\_\{\\theta,k\}\(c\_\{t\}\)\}\\exp\(z\_\{\\theta\}\(v\\mid c\_\{t\}\)\)\.Therefore,

mθ\(k,1\)​\(x,y,t\)≤1β​mθ\(k,β\)​\(x,y,t\)≤τρβ\.m\_\{\\theta\}^\{\(k,1\)\}\(x,y,t\)\\leq\\frac\{1\}\{\\beta\}m\_\{\\theta\}^\{\(k,\\beta\)\}\(x,y,t\)\\leq\\frac\{\\tau\_\{\\rho\}\}\{\\beta\}\.Using the identity

πθ\(k,1\)​\(yt∣ct\)=σ​\(mθ\(k,1\)​\(x,y,t\)\),\\pi\_\{\\theta\}^\{\(k,1\)\}\(y\_\{t\}\\mid c\_\{t\}\)=\\sigma\\\!\\left\(m\_\{\\theta\}^\{\(k,1\)\}\(x,y,t\)\\right\),we obtain

πθ\(k,1\)​\(yt∣ct\)≤σ​\(τρβ\)\.\\pi\_\{\\theta\}^\{\(k,1\)\}\(y\_\{t\}\\mid c\_\{t\}\)\\leq\\sigma\\\!\\left\(\\frac\{\\tau\_\{\\rho\}\}\{\\beta\}\\right\)\.Since the full softmax denominator contains all vocabulary tokens,πθ​\(yt∣ct\)≤πθ\(k,1\)​\(yt∣ct\)\\pi\_\{\\theta\}\(y\_\{t\}\\mid c\_\{t\}\)\\leq\\pi\_\{\\theta\}^\{\(k,1\)\}\(y\_\{t\}\\mid c\_\{t\}\), and hence, if this condition holds on a setIIwith\|I\|≥⌈\(1−α\)​T⌉\|I\|\\geq\\lceil\(1\-\\alpha\)T\\rceil, then

πθ​\(y∣x\)≤\[σ​\(τρβ\)\]⌈\(1−α\)​T⌉\.\\pi\_\{\\theta\}\(y\\mid x\)\\leq\\left\[\\sigma\\\!\\left\(\\frac\{\\tau\_\{\\rho\}\}\{\\beta\}\\right\)\\right\]^\{\\lceil\(1\-\\alpha\)T\\rceil\}\.
Remark\.Forρ\>1/2\\rho\>1/2, we haveτρ=log⁡\(ρ/\(1−ρ\)\)\>0\\tau\_\{\\rho\}=\\log\(\\rho/\(1\-\\rho\)\)\>0\. Hence, asβ\\betaincreases,

σ​\(τρβ\)→12\.\\sigma\\\!\\left\(\\frac\{\\tau\_\{\\rho\}\}\{\\beta\}\\right\)\\to\\frac\{1\}\{2\}\.Thus, controlling theβ\\beta\-sharpened restricted probability implies a reproduction bound with an effective per\-token threshold closer to1/21/2\. In the limitβ→∞\\beta\\to\\infty, the restricted comparison approaches a hard maximum over the selected competitors, and the condition becomes the requirement that the target token should not beat its strongest plausible alternative\. This matches the intended interpretation of MASC: forgotten tokens should no longer be clearly preferred over the model’s own local alternatives\.

See[1](https://arxiv.org/html/2606.02920#Thmlemma1)

###### Proof\.

Sinceπθ\(k,β\)​\(yt∣ct\)=σ​\(mθ\(k,β\)​\(x,y,t\)\)\\pi\_\{\\theta\}^\{\(k,\\beta\)\}\(y\_\{t\}\\mid c\_\{t\}\)=\\sigma\(m\_\{\\theta\}^\{\(k,\\beta\)\}\(x,y,t\)\)andσ−1​\(ρ\)=τρ\\sigma^\{\-1\}\(\\rho\)=\\tau\_\{\\rho\}, the claim follows by monotonicity ofσ\\sigma\. ∎

## Appendix CGradient of the MASC forget term

In this section, we derive the token\-level gradient effect of the MASC forget lossℒfgMASC\\mathcal\{L\}\_\{\\mathrm\{fg\}\}^\{\\mathrm\{MASC\}\}in[Equation˜9](https://arxiv.org/html/2606.02920#S3.E9)\. The calculation is performed at the level of logits for a single forget token\. It therefore describes the direct contribution of the forget term before averaging over examples, positions, and minibatches\. The retain term contributes an additional gradient that is not included in this local calculation\.

Fix a forget example\(x,y\)\(x,y\), a positiontt, and the teacher\-forced contextct=\(x,y<t\)c\_\{t\}=\(x,y\_\{<t\}\)\. Let𝒮=𝒮θ,k​\(ct\)\\mathcal\{S\}=\\mathcal\{S\}\_\{\\theta,k\}\(c\_\{t\}\)denote the selected top\-kknon\-gold competitor set\. As in the main text, we treat𝒮\\mathcal\{S\}as fixed during differentiation, since the top\-kkselection is not differentiated through\. The restricted margin is

mθ=mθ\(k,β\)​\(x,y,t\)=β​zθ​\(yt∣ct\)−log​∑u∈𝒮exp⁡\(β​zθ​\(u∣ct\)\)\.m\_\{\\theta\}=m\_\{\\theta\}^\{\(k,\\beta\)\}\(x,y,t\)=\\beta z\_\{\\theta\}\(y\_\{t\}\\mid c\_\{t\}\)\-\\log\\sum\_\{u\\in\\mathcal\{S\}\}\\exp\(\\beta z\_\{\\theta\}\(u\\mid c\_\{t\}\)\)\.The token\-level MASC surrogate is

ψρ,η​\(mθ\)=\[mθ−\(τρ−η\)\]\+η\.\\psi\_\{\\rho,\\eta\}\(m\_\{\\theta\}\)=\\frac\{\[m\_\{\\theta\}\-\(\\tau\_\{\\rho\}\-\\eta\)\]\_\{\+\}\}\{\\eta\}\.Thus, ifmθ≤τρ−ηm\_\{\\theta\}\\leq\\tau\_\{\\rho\}\-\\eta, the hinge is inactive and the derivative of the forget surrogate with respect to all logits at this context is zero\. Ifmθ\>τρ−ηm\_\{\\theta\}\>\\tau\_\{\\rho\}\-\\eta, the hinge is active and

ψρ,η​\(mθ\)=mθ−\(τρ−η\)η\.\\psi\_\{\\rho,\\eta\}\(m\_\{\\theta\}\)=\\frac\{m\_\{\\theta\}\-\(\\tau\_\{\\rho\}\-\\eta\)\}\{\\eta\}\.Hence its derivative is1/η1/\\etatimes the derivative of the margin\. Forv∈𝒮v\\in\\mathcal\{S\}, define the softmax weights over the competitor set

wv=exp⁡\(β​zθ​\(v∣ct\)\)∑u∈𝒮exp⁡\(β​zθ​\(u∣ct\)\)\.w\_\{v\}=\\frac\{\\exp\(\\beta z\_\{\\theta\}\(v\\mid c\_\{t\}\)\)\}\{\\sum\_\{u\\in\\mathcal\{S\}\}\\exp\(\\beta z\_\{\\theta\}\(u\\mid c\_\{t\}\)\)\}\.Then, for any tokena∈𝒱a\\in\\mathcal\{V\},

∂ψρ,η​\(mθ\)∂zθ​\(a∣ct\)=\{0,mθ≤τρ−η,βη,mθ\>τρ−ηanda=yt,−βη​wa,mθ\>τρ−ηanda∈𝒮,0,mθ\>τρ−ηanda∉\{yt\}∪𝒮\.\\frac\{\\partial\\psi\_\{\\rho,\\eta\}\(m\_\{\\theta\}\)\}\{\\partial z\_\{\\theta\}\(a\\mid c\_\{t\}\)\}=\\begin\{cases\}0,&m\_\{\\theta\}\\leq\\tau\_\{\\rho\}\-\\eta,\\\\\[4\.2679pt\] \\dfrac\{\\beta\}\{\\eta\},&m\_\{\\theta\}\>\\tau\_\{\\rho\}\-\\eta\\quad\\text\{and\}\\quad a=y\_\{t\},\\\\\[8\.53581pt\] \-\\dfrac\{\\beta\}\{\\eta\}w\_\{a\},&m\_\{\\theta\}\>\\tau\_\{\\rho\}\-\\eta\\quad\\text\{and\}\\quad a\\in\\mathcal\{S\},\\\\\[8\.53581pt\] 0,&m\_\{\\theta\}\>\\tau\_\{\\rho\}\-\\eta\\quad\\text\{and\}\\quad a\\notin\\\{y\_\{t\}\\\}\\cup\\mathcal\{S\}\.\\end\{cases\}Therefore, for an active forget token, gradient descent on the MASC forget term decreases the gold logitzθ​\(yt∣ct\)z\_\{\\theta\}\(y\_\{t\}\\mid c\_\{t\}\)and increases the logits of the selected competitorszθ​\(v∣ct\)z\_\{\\theta\}\(v\\mid c\_\{t\}\)forv∈𝒮v\\in\\mathcal\{S\}, with larger updates for competitors that already have larger softmax weight within𝒮\\mathcal\{S\}\. Tokens outside\{yt\}∪𝒮\\\{y\_\{t\}\\\}\\cup\\mathcal\{S\}receive no direct logit\-level gradient from this token\-level term\.

Since the full MASC forget loss averages this quantity over forget examples and positions,

ℒfgMASC​\(θ\)=𝔼\(x,y\)∼𝒟fg​\[1T​∑t=1Tψρ,η​\(mθ\(k,β\)​\(x,y,t\)\)\],\\mathcal\{L\}\_\{\\mathrm\{fg\}\}^\{\\mathrm\{MASC\}\}\(\\theta\)=\\mathbb\{E\}\_\{\(x,y\)\\sim\\mathcal\{D\}\_\{\\mathrm\{fg\}\}\}\\left\[\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\psi\_\{\\rho,\\eta\}\\\!\\left\(m\_\{\\theta\}^\{\(k,\\beta\)\}\(x,y,t\)\\right\)\\right\],its gradient is the corresponding average of the per\-token contributions above\. In practice, this expectation is estimated by minibatches\. The competitor set is recomputed during training, so MASC behaves like a self\-correcting active\-set method: at each step, the gold token is challenged by the non\-gold alternatives that the current model itself considers plausible\.

Remark\.The statement that logits outside\{yt\}∪𝒮\\\{y\_\{t\}\\\}\\cup\\mathcal\{S\}have zero derivative refers to the direct derivative of the single token\-level surrogate with respect to the logits at the current context\. A parameter update can still affect other logits indirectly through the shared network parameters, and the retain regularizer adds its own gradient on retained contexts\.

## Appendix DAdditional Ablations

### D\.1Robustness against quantization

A growing line of work \(for example\[[24](https://arxiv.org/html/2606.02920#bib.bib24),[14](https://arxiv.org/html/2606.02920#bib.bib14),[41](https://arxiv.org/html/2606.02920#bib.bib41),[44](https://arxiv.org/html/2606.02920#bib.bib44)\]\) shows that unlearned knowledge can often be recovered by simple post\-processing or lightweight attacks on the unlearned model\. In particular, quantization reveals a simple yet remarkable robustness failure of LLM unlearning, as shown byZhang et al\. \[[52](https://arxiv.org/html/2606.02920#bib.bib52)\]\. Indeed, applying low\-bit quantization to an unlearned LLM can recover supposedly forgotten information, exposing a mismatch between full\-precision unlearning metrics and robustness after deployment\. In this sense, quantization is one of the easiest attacks on unlearning: it requires no access to the training pipeline, no carefully designed prompts, and no additional optimization over the forget set\. Motivated by this observation, we evaluate whether MASC remains effective after 4\-bit quantization\. As shown in[Table˜3](https://arxiv.org/html/2606.02920#A4.T3), MASC preserves low forget\-set memorization after quantization on both MUSE News and MUSE Books\.

MUSE NewsMUSE BooksMethodVerbMem𝒟fg\\mathcal\{D\}\_\{\\mathrm\{fg\}\}↓\\downarrowKnowMem𝒟fg\\mathcal\{D\}\_\{\\mathrm\{fg\}\}↓\\downarrowVerbMem𝒟fg\\mathcal\{D\}\_\{\\mathrm\{fg\}\}↓\\downarrowKnowMem𝒟fg\\mathcal\{D\}\_\{\\mathrm\{fg\}\}↓\\downarrowBase \(4\-bit\)46\.40 \(−10\.85\-10\.85\)54\.32 \(−12\.13\-12\.13\)94\.02 \(−5\.68\-5\.68\)38\.13 \(−8\.99\-8\.99\)Retrain \(4\-bit\)20\.06 \(−0\.20\-0\.20\)35\.35 \(\+2\.80\+2\.80\)14\.00 \(−0\.45\-0\.45\)24\.41 \(−5\.88\-5\.88\)GA \(4\-bit\)0\.00 \(\+0\.00\+0\.00\)0\.00 \(\+0\.00\+0\.00\)0\.00 \(\+0\.00\+0\.00\)0\.00 \(\+0\.00\+0\.00\)GradDiff \(4\-bit\)7\.36 \(\+7\.10\+7\.10\)47\.28 \(\+21\.98\+21\.98\)0\.00 \(\+0\.00\+0\.00\)29\.04 \(\+29\.04\+29\.04\)NPO \(4\-bit\)14\.61 \(\+14\.61\+14\.61\)29\.41 \(\+29\.41\+29\.41\)4\.70 \(\+4\.70\+4\.70\)3\.53 \(\+3\.53\+3\.53\)NPO\+KLR \(4\-bit\)39\.35 \(\+33\.03\+33\.03\)52\.19 \(\+0\.41\+0\.41\)49\.17 \(\+49\.17\+49\.17\)35\.96 \(\+12\.62\+12\.62\)RMU \(4\-bit\)21\.77 \(−5\.38\-5\.38\)36\.80 \(−11\.01\-11\.01\)8\.37 \(−2\.68\-2\.68\)11\.16 \(−11\.21\-11\.21\)SimNPO \(4\-bit\)37\.58 \(\+29\.55\+29\.55\)48\.82 \(\+3\.01\+3\.01\)71\.27 \(\+71\.27\+71\.27\)33\.99 \(\+33\.99\+33\.99\)MASC \(Ours, 4\-bit\)5\.79 \(\+4\.69\+4\.69\)28\.80 \(\+9\.43\+9\.43\)0\.87 \(−0\.03\-0\.03\)25\.48 \(−5\.42\-5\.42\)

Table 3:MUSE forget metrics under 4\-bit quantization\. Parentheses report the change relative to the corresponding full\-precision method in[Table˜2](https://arxiv.org/html/2606.02920#S4.T2), computed asΔ=Model4​\-​bit−Modelfull\\Delta=\\text\{Model\}\_\{\\mathrm\{4\\text\{\-\}bit\}\}\-\\text\{Model\}\_\{\\mathrm\{full\}\}\. Both metrics are lower\-is\-better: negative values indicate that the 4\-bit version reduces the forget metric, while positive values indicate worse forget\-side performance relative to the full\-precision model\.
### D\.2Timing without LoRA update

MASC is implemented with LoRA adapters in our main experiments to reduce memory usage and wall\-clock cost\. To check whether the observed behavior is specific to this parameter\-efficient implementation, we also run MASC with full finetuning onTOFU\. As shown in[Table˜4](https://arxiv.org/html/2606.02920#A4.T4), full finetuning yields very similar forget–retain behavior to the LoRA implementation\. The main difference is computational: full finetuning is slightly slower, while the evaluation metrics remain close\. This suggests that LoRA mainly improves efficiency, rather than driving the empirical behavior of MASC\. Importantly, even in the full\-finetuning setting, MASC remains substantially faster than the strongest baselines with comparable forget–retain behavior in our experiments\.

Config𝟏−\\mathbf\{1\-\}ROUGE\-L↑\\uparrow𝟏−\\mathbf\{1\-\}Prob↑\\uparrowTruth Ratio↑\\uparrowMU↑\\uparrowTime \(sec\)↓\\downarrowMASC\-LoRA0\.6290\.6720\.6330\.66687\.9MASC Full\-FT0\.6090\.6080\.6600\.647140\.6Table 4:Impact of LoRA on MASC across metrics and time\.
### D\.3Stability of the stopping timeτα\\tau\_\{\\alpha\}

We also check the stability of the MASC stopping rule across different random seeds\. OnTOFU, where five seeds are available, the stopping step is highly consistent across runs, with an average of80\.480\.4steps and standard deviation6\.56\.5\. The same pattern holds onMUSE NewsandMUSE Books:69\.3±16\.769\.3\\pm 16\.7forMUSE Newsand40\.0±0\.040\.0\\pm 0\.0forMUSE Books\. These results suggest that the monitored violation rate yields a stable stopping criterion\.

### D\.4Discussion on top\-kkset of alternative tokens

The choice ofkkcontrols how many non\-gold tokens are included in the MASC comparison set\. Since the competitor term is a log\-sum\-exp,

mk=zy−log​∑r=1kezr,m\_\{k\}=z\_\{y\}\-\\log\\sum\_\{r=1\}^\{k\}e^\{z\_\{r\}\},increasingkkmakes the competitor aggregate larger even if the model has not strongly changed the gold logitzyz\_\{y\}\. Thus, for largekk, the margin criterion can be satisfied partly because many competitors are included, rather than because the original answer token has been strongly suppressed\. To test this explanation, we measured

Δ​zy=zyafter−zybefore,\\Delta z\_\{y\}=z\_\{y\}^\{\\mathrm\{after\}\}\-z\_\{y\}^\{\\mathrm\{before\}\},the average change in the logit assigned to the original answer token on TOFU forget continuations, comparing the unlearned model to the base model\. More negative values indicate stronger suppression of the original answer token\. As shown below, increasingkkleads to a smaller decrease inzyz\_\{y\}, earlier stopping, and worse forgetting metrics, while MU remains almost unchanged:

kkΔ​zy\\Delta z\_\{y\}Stop step1−1\-ROUGE\-L↑\\uparrow1−1\-Prob↑\\uparrowTruth Ratio↑\\uparrowMU↑\\uparrow10\-14\.29820\.7000\.7190\.6360\.666100\-13\.24760\.4930\.6040\.6170\.6651000\-12\.93740\.3650\.5390\.6100\.662

Table 5:TOFU ablation of the top\-kkcomparison set in MASC\.These results suggest that larger comparison sets make the stopping criterion easier to satisfy without requiring as much direct suppression of the original answer tokens\. This explains why forgetting becomes weaker askkgrows, even though retain utility remains stable\.

## Appendix EExperimental Details & Additional Metrics

### E\.1Full metrics & Pareto Frontier

We now report other available metrics \(cf\.[Tables˜6](https://arxiv.org/html/2606.02920#A5.T6)and[7](https://arxiv.org/html/2606.02920#A5.T7)\) for each of the studied datasets, together with a visualization of the forget–retain trade\-off based on[Tables˜2](https://arxiv.org/html/2606.02920#S4.T2)and[2](https://arxiv.org/html/2606.02920#S4.T2)metrics\.

Unlearning PrivacyRetain UtilityMethodFQ↑\\uparrowMU↑\\uparrowRetain ROUGE↑\\uparrowRetain Prob↑\\uparrowRetain TR↑\\uparrowBase/full00\.6280\.9810\.9890\.460Retrain10\.6130\.9760\.9890\.457GA1\.72×10−171\.72\{\\times\}10^\{\-17\}\[1\.59×10−171\.59\{\\times\}10^\{\-17\}\]0\.459\[0\.014\]0\.732\[0\.024\]0\.186\[0\.024\]0\.455\[0\.005\]GradDiff3\.30×10−183\.30\{\\times\}10^\{\-18\}\[4\.89×10−184\.89\{\\times\}10^\{\-18\}\]0\.561\[0\.005\]0\.556\[0\.019\]0\.739\[0\.011\]0\.464\[0\.003\]NPO2\.14×10−182\.14\{\\times\}10^\{\-18\}\[2\.56×10−182\.56\{\\times\}10^\{\-18\}\]0\.533\[0\.003\]0\.713\[0\.030\]0\.403\[0\.011\]0\.430\[0\.009\]NPO\+KLR2\.99×10−182\.99\{\\times\}10^\{\-18\}\[4\.95×10−184\.95\{\\times\}10^\{\-18\}\]0\.516\[0\.006\]0\.722\[0\.019\]0\.342\[0\.006\]0\.434\[0\.005\]RMU1\.80×10−231\.80\{\\times\}10^\{\-23\}\[4\.77×10−244\.77\{\\times\}10^\{\-24\}\]0\.618\[0\.001\]0\.901\[0\.006\]0\.876\[0\.011\]0\.455\[0\.001\]SimNPO5\.27×10−245\.27\{\\times\}10^\{\-24\}\[4\.03×10−244\.03\{\\times\}10^\{\-24\}\]0\.614\[0\.001\]0\.976\[0\.005\]0\.992\[0\.001\]0\.464\[0\.001\]MASC \(Ours\)3\.01×10−73\.01\{\\times\}10^\{\-7\}\[6\.50×10−76\.50\{\\times\}10^\{\-7\}\]0\.666\[0\.003\]0\.899\[0\.019\]0\.832\[0\.033\]0\.446\[0\.005\]

Table 6:Additional TOFU metrics\. FQ measures forget\-side privacy leakage, while MU, Retain ROUGE, Retain Prob, and Retain TR measure retain\-side utility\. Results are averaged over seeds, with standard deviations in brackets\. FQ is bounded in\[0,1\]\[0,1\]and the higher the better\.MethodMUSE News↓\\downarrowMUSE Books↓\\downarrowBase\-99\.81\-57\.34Retrain\-4\.728\.16GA5\.22\-28\.64GradDiff105\.16\-29\.06NPO14\.99\-22\.31NPO\+KLR87\.03\-42\.74RMU\-99\.73\-23\.75SimNPO35\.26\-17\.86MASC \(Ours\)41\.99\-48\.89Table 7:Privacy\-leak metrics on MUSE\.TOFU0\.00\.20\.40\.60\.000\.250\.500\.751\.00Retain utilityForget efficacyRuntime \(s\)88294983MUSE News0\.00\.20\.40\.6Retain utilityRuntime \(s\)1397514063MUSE Books0\.00\.20\.40\.6Retain utilityRuntime \(s\)654152648BaseRetrainGAGradDiffNPONPO\+KLRRMUSimNPOMASC

Figure 5:Pareto Frontier computed from the metrics of[Tables˜2](https://arxiv.org/html/2606.02920#S4.T2)and[2](https://arxiv.org/html/2606.02920#S4.T2)where top\-right is better\. For the forget metrics, we average the reported metrics to get an aggregate forget score\. Retrain and base models are not timed since they are considered as given/oracle\.#### Baselines\.

For each baseline, we use the authors’ official implementation whenever available\. When multiple public implementations are available, including benchmark\-suite versions, we select the implementation that achieves the strongest time–metric trade\-off in our setup\. We set hyperparameters according to the corresponding papers and released code, using author\-recommended configurations whenever possible\. This protocol is intended to give each baseline a competitive configuration rather than comparing against under\-tuned variants\.

#### A comment on privacy metrics\.

We report FQ on TOFU and privacy leakage on MUSE as privacy\-oriented diagnostics rather than as primary forget–retain metrics, following the classification ofDorna et al\. \[[11](https://arxiv.org/html/2606.02920#bib.bib11)\]\. Both quantities rely on information about the behavior of a retrained or non\-member reference distribution: FQ is a hypothesis\-test p\-value comparing the truth\-ratio distribution of the unlearned model to that of the retain\-only retrained model, while privacy leakage measures residual membership\-style distinguishability of forgotten examples\. Thus, unlike ROUGE, likelihood, or model utility, these metrics ask whether the unlearned model is statistically indistinguishable from an ideal deletion baseline, not only whether it stops reproducing the forgotten content\. As also noted in Remark B\.1 ofEntesari et al\. \[[13](https://arxiv.org/html/2606.02920#bib.bib13)\], such metrics are informative but imperfect: they require access to retrained/reference behavior and can be hard to interpret when models collapse or move away from the retrain distribution for reasons unrelated to memorization\. In our experiments, privacy\-oriented metrics also do not reflect good privacy guarantees, suggesting that current approximate unlearning methods should not be interpreted as providing seed\-stable privacy guarantees\. We therefore view private\-unlearning as an important direction for future work\.

### E\.2Hyperparameters

#### MASC hyperparameters\.

[Table˜8](https://arxiv.org/html/2606.02920#A5.T8)reports the MASC hyperparameters used in the main experiments\. Across all datasets, we keep the backbone frozen and train LoRA adapters only, using a retain\-side KL penalty to the base model\.

Datasetλfg\\lambda\_\{\\mathrm\{fg\}\}ρ\\rhoη\\etatop\-kkβ\\betaStopα\\alphaLRLoRA \(rank\)TOFU0\.050\.700\.25k=10k=101\.00\.47510−410^\{\-4\}16MUSE News0\.500\.700\.50k=2k=25\.00\.5510−410^\{\-4\}16MUSE Books0\.050\.500\.50k=10k=101\.00\.1010−410^\{\-4\}16

Table 8:MASC hyperparameters used in the main experiments\. Hereλfg\\lambda\_\{\\mathrm\{fg\}\}is the weight of the forget loss,ρ\\rhois the local dominance threshold,η\\etais the hinge buffer,β\\betais the logit\-temperature parameter, andα\\alphais the stopping tolerance\.
#### Effect of the learning rate\.

[Figure˜6](https://arxiv.org/html/2606.02920#A5.F6)shows the evolution of the stopping statisticV^ρ\\widehat\{V\}\_\{\\rho\}for different learning rates on TOFU\. As expected, larger learning rates drive the violation rate below the toleranceα\\alphain fewer optimizer steps, yielding faster stopping\. Smaller learning rates decreaseV^ρ\\widehat\{V\}\_\{\\rho\}more gradually, which is slower but provides a finer resolution along the MASC trajectory: more intermediate checkpoints are available around the stopping threshold, allowing more controlled selection of the forget–retain trade\-off\.

0202040406060808000\.51\.0Optimizer stepV^ρ\\hat\{V\}\_\{\\rho\}lr=5​e−5\\mathrm\{lr\}=5e\-5lr=1​e−4\\mathrm\{lr\}=1e\-4lr=2​e−4\\mathrm\{lr\}=2e\-4

Figure 6:Effect of the learning rate on the MASC stopping statisticV^ρ\\widehat\{V\}\_\{\\rho\}on TOFU\.
#### Top\-kkalternatives\.

The parameterkkcontrols how broad this local comparison is\. Small values ofkkcompare the target token only against the most competitive alternatives, making the condition close to a “target versus nearest rivals” test\. Larger values include more alternatives and therefore require the target token to share probability mass with a broader candidate set\. In our experiments, we use a smallkkvalue \(=10=10\) so that the forget update remains focused on plausible replacements rather than on the full vocabulary, which contains many irrelevant tokens\.

### E\.3Empirical Scaling Laws

#### Power\-law fitting procedure\.

We fit scaling trends using the same procedure for both the learning and unlearning stages\. For each metric, model size, and random seed, letsr​\(N\)s\_\{r\}\(N\)denote the measured score, whereNNis the number of model parameters \(expressed in billions\) andr∈\{1,…,R\}r\\in\\\{1,\\ldots,R\\\}indexes the seed\. We first average over seeds at fixed model size,

s¯​\(N\)=1R​∑r=1Rsr​\(N\)\.\\bar\{s\}\(N\)=\\frac\{1\}\{R\}\\sum\_\{r=1\}^\{R\}s\_\{r\}\(N\)\.We then fit a power\-law models¯​\(N\)=C​Nα\\bar\{s\}\(N\)=CN^\{\\alpha\}, withC\>0C\>0, by ordinary least squares in log–log space:

log⁡s¯​\(Ni\)=log⁡C\+α​log⁡Ni\+εi,\\log\\bar\{s\}\(N\_\{i\}\)=\\log C\+\\alpha\\log N\_\{i\}\+\\varepsilon\_\{i\},over the evaluated model sizesNiN\_\{i\}\. The slope gives the scaling exponentα\\alpha, while the intercept giveslog⁡C\\log C\. The reportedR2R^\{2\}is computed in log\-space as

R2=1−∑i\(log⁡s¯​\(Ni\)−log⁡s^​\(Ni\)\)2∑i\(log⁡s¯​\(Ni\)−1m​∑jlog⁡s¯​\(Nj\)\)2,R^\{2\}=1\-\\frac\{\\sum\_\{i\}\\left\(\\log\\bar\{s\}\(N\_\{i\}\)\-\\log\\widehat\{s\}\(N\_\{i\}\)\\right\)^\{2\}\}\{\\sum\_\{i\}\\left\(\\log\\bar\{s\}\(N\_\{i\}\)\-\\frac\{1\}\{m\}\\sum\_\{j\}\\log\\bar\{s\}\(N\_\{j\}\)\\right\)^\{2\}\},wheres^​\(Ni\)=C^​Niα^\\widehat\{s\}\(N\_\{i\}\)=\\widehat\{C\}N\_\{i\}^\{\\widehat\{\\alpha\}\}is the fitted value andmmis the number of evaluated model sizes\.

#### Learning\-stage reference scores\.

Before task finetuning, the base models already exhibit nonzero scores on several TOFU metrics\. We report these base\-model scores in[Table˜9](https://arxiv.org/html/2606.02920#A5.T9)to make clear that the scaling trends in the main text refer to the additional memorization induced by supervised finetuning on the benchmark data\.

Metric0\.5B1\.5B3B7BES0\.0610\.0780\.0420\.039E\-ROUGE0\.1980\.1930\.3840\.425P\-ROUGE0\.1980\.1670\.3620\.405Table 9:Initial\-model TOFU scores before supervised finetuning on the benchmark data\. These values provide a reference point for interpreting the learning\-stage scaling trends\.
#### Unlearning\-stage fits\.

[Table˜10](https://arxiv.org/html/2606.02920#A5.T10)reports the fitted power\-law parameters after unlearning for MASC and SimNPO\. These fits should be interpreted differently for forget\-side metrics and retain utility\. For the forget\-side metrics, severalR2R^\{2\}values are low, especially for MASC, indicating that these scores do not follow a clear monotone power law over the evaluated model sizes \(cf\.[Figure˜7](https://arxiv.org/html/2606.02920#A5.F7)\)\. The main observation is therefore not a strong scaling law, but rather a stability pattern: after unlearning, forget\-side scores fluctuate across scales while remaining in a comparable range on average\. In contrast, retain utility shows a clearer positive trend for both methods, suggesting that larger models preserve useful behavior better after unlearning while residual forget\-side performance does not systematically increase with scale\.

MethodMetricC^\\hat\{C\}α^\\hat\{\\alpha\}R2R^\{2\}SimNPOE\-ROUGE0\.3797\-0\.00360\.342P\-ROUGE0\.33600\.01900\.454MU on𝒟ret\\mathcal\{D\}\_\{\\mathrm\{ret\}\}0\.46560\.14890\.988MASCE\-ROUGE0\.39310\.01830\.414P\-ROUGE0\.3145\-0\.04110\.357MU on𝒟ret\\mathcal\{D\}\_\{\\mathrm\{ret\}\}0\.44510\.15720\.964Table 10:Unlearning\-stage scaling fitss​\(N\)=C​Nαs\(N\)=CN^\{\\alpha\}across model sizes, fitted in log\-log space\.0\.20\.30\.40\.60\.5B1\.5B3B7BModel sizeSimNPO0\.5B1\.5B3B7BModel sizeMASCRaw scoreE\-ROUGEP\-ROUGEMU on𝒟r\\mathcal\{D\}\_\{r\}

Figure 7:Fitted scaling trends for MASC and SimNPO after unlearning\.

## Appendix FMASC pseudo\-code

Algorithm 1MASC1:Input:base model

πθ0\\pi\_\{\\theta\_\{0\}\}, forget set

𝒟fg\\mathcal\{D\}\_\{\\mathrm\{fg\}\}, retain set

𝒟ret\\mathcal\{D\}\_\{\\mathrm\{ret\}\}
2:Hyperparameters:

kk,

β\\beta,

ρ\\rho,

η\\eta,

α\\alpha, probe size

npn\_\{p\},

λfg\\lambda\_\{\\mathrm\{fg\}\}\.

3:Initialize LoRA parameters

ϕ\\phiand set

θ=\(θ0,ϕ\)\\theta=\(\\theta\_\{0\},\\phi\)⊳\\trianglerightbackbone frozen

4:Set

τρ←log⁡\(ρ/\(1−ρ\)\)\\tau\_\{\\rho\}\\leftarrow\\log\(\\rho/\(1\-\\rho\)\)⊳\\trianglerightprobability threshold in margin form

5:forunlearning step

s=1,2,…s=1,2,\\ldotsdo

6:Sample forget batch

Bfg⊂𝒟fgB\_\{\\mathrm\{fg\}\}\\subset\\mathcal\{D\}\_\{\\mathrm\{fg\}\}and retain batch

Bret⊂𝒟retB\_\{\\mathrm\{ret\}\}\\subset\\mathcal\{D\}\_\{\\mathrm\{ret\}\}
7:for all

\(x,y\)∈Bfg\(x,y\)\\in B\_\{\\mathrm\{fg\}\}and active answer positions

t∈A​\(x,y\)t\\in A\(x,y\)do

8:

ct←\(x,y<t\)c\_\{t\}\\leftarrow\(x,y\_\{<t\}\)⊳\\trianglerightteacher\-forced context

9:

𝒮θ,k​\(ct\)←\\mathcal\{S\}\_\{\\theta,k\}\(c\_\{t\}\)\\leftarrowtop\-

kknon\-gold tokens under

πθ\(⋅∣ct\)\\pi\_\{\\theta\}\(\\cdot\\mid c\_\{t\}\)
10:

mt←β​zθ​\(yt∣ct\)−log​∑v∈𝒮θ,k​\(ct\)exp⁡\(β​zθ​\(v∣ct\)\)m\_\{t\}\\leftarrow\\beta z\_\{\\theta\}\(y\_\{t\}\\mid c\_\{t\}\)\-\\log\\sum\_\{v\\in\\mathcal\{S\}\_\{\\theta,k\}\(c\_\{t\}\)\}\\exp\\\!\\big\(\\beta z\_\{\\theta\}\(v\\mid c\_\{t\}\)\\big\)
11:

ℓtMASC←\[mt−\(τρ−η\)\]\+/η\\ell\_\{t\}^\{\\mathrm\{MASC\}\}\\leftarrow\\big\[m\_\{t\}\-\(\\tau\_\{\\rho\}\-\\eta\)\\big\]\_\{\+\}/\\eta⊳\\trianglerightMASC loss

12:endfor

13:

ℒfg←1\|Bfg\|​∑\(x,y\)∈Bfg1\|A​\(x,y\)\|​∑t∈A​\(x,y\)ℓtMASC\\mathcal\{L\}\_\{\\mathrm\{fg\}\}\\leftarrow\\frac\{1\}\{\|B\_\{\\mathrm\{fg\}\}\|\}\\sum\_\{\(x,y\)\\in B\_\{\\mathrm\{fg\}\}\}\\frac\{1\}\{\|A\(x,y\)\|\}\\sum\_\{t\\in A\(x,y\)\}\\ell\_\{t\}^\{\\mathrm\{MASC\}\}
14:

ℒret←𝔼\(x,y\)∈Bret1T∑t=1TKL\(πθ0\(⋅∣x,y<t\)∥πθ\(⋅∣x,y<t\)\)\\mathcal\{L\}\_\{\\mathrm\{ret\}\}\\leftarrow\\mathbb\{E\}\_\{\(x,y\)\\in B\_\{\\mathrm\{ret\}\}\}\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\mathrm\{KL\}\\\!\\left\(\\pi\_\{\\theta\_\{0\}\}\(\\cdot\\mid x,y\_\{<t\}\)\\,\\middle\\\|\\,\\pi\_\{\\theta\}\(\\cdot\\mid x,y\_\{<t\}\)\\right\)
15:Update LoRA parameters with

ℒ=λfg​Lfg\+λret​Lret\\mathcal\{L\}=\\lambda\_\{\\mathrm\{fg\}\}L\_\{\\mathrm\{fg\}\}\+\\lambda\_\{\\mathrm\{ret\}\}L\_\{\\mathrm\{ret\}\}
16:if

ssis a probe stepthen

17:Sample

𝒫fg⊂𝒟fg∖Bfg\\mathcal\{P\}\_\{\\mathrm\{fg\}\}\\subset\\mathcal\{D\}\_\{\\mathrm\{fg\}\}\\setminus B\_\{\\mathrm\{fg\}\}uniformly at random with

\|𝒫fg\|=np\|\\mathcal\{P\}\_\{\\mathrm\{fg\}\}\|=n\_\{p\}
18:Estimate

V^ρ​\(θ\)=1\|𝒫fg\|​∑\(x,y\)∈𝒫fg1\|A​\(x,y\)\|​∑t∈A​\(x,y\)𝟏​\{mt\>τρ\}\\widehat\{V\}\_\{\\rho\}\(\\theta\)=\\frac\{1\}\{\|\\mathcal\{P\}\_\{\\mathrm\{fg\}\}\|\}\\sum\_\{\(x,y\)\\in\\mathcal\{P\}\_\{\\mathrm\{fg\}\}\}\\frac\{1\}\{\|A\(x,y\)\|\}\\sum\_\{t\\in A\(x,y\)\}\\mathbf\{1\}\\\{m\_\{t\}\>\\tau\_\{\\rho\}\\\}
19:if

V^ρ​\(θ\)≤α\\widehat\{V\}\_\{\\rho\}\(\\theta\)\\leq\\alphathen

20:return

θ\\theta
21:endif

22:endif

23:endfor

## Appendix GExample Q&A Responses

In addition to aggregate memorization metrics, we inspect model generations on individual forget\-set questions\.[Table˜11](https://arxiv.org/html/2606.02920#A7.T11)reports representative examples from TOFU, MUSE News, and MUSE Books\. Across these cases, MASC typically avoids reproducing the exact forgotten answer, while still producing fluent and task\-relevant completions\.

DatasetQuestionGold answerMASC answerTOFU\\cellcolorquestionblueWhat is the full name of the geology author born in Karachi, Pakistan on 06/30/1975?\\cellcolorgoldgreenThe author’s name is Hina Ameen\.\\columncolormascredThe full name is Aahan Ameerah\.\\cellcolorquestionblueWere any books by Aysha Al\-Hashim made into films or TV series?\\cellcolorgoldgreenHer novel “The Matrimony Plan” is under negotiation for a film adaptation\.\\columncolormascredNone of the books have been adapted into films or TV series\.\\cellcolorquestionblueAre Kalkidan Abera’s works used for academic or educational purposes?\\cellcolorgoldgreenHer health books are used in health\-science courses\.\\columncolormascredHer books “The Ethereal Enigma” and “Corpus Ooze” are studied in biology, literature, and anthropology departments\.\\cellcolorquestionblueWho is the notable author born in Astana, Kazakhstan on Feb 7, 1952?\\cellcolorgoldgreenNikolai Abilov\.\\columncolormascredEulalie Skat\.News\\cellcolorquestionblueWho wrote “The Fields Of Athenry”?\\cellcolorgoldgreenPete St John\\columncolormascredIrish writer JohnB O’Hara\\cellcolorquestionblueWho was the Rome tour guide who described conditions as “nightmarish”?\\cellcolorgoldgreenFelicity Hinton\\columncolormascredAlessandro, who has been working in Rome since 2017\\cellcolorquestionblueWhat was Nicola Bulley’s cause of death?\\cellcolorgoldgreendrowned after falling into cold water\\columncolormascredtaken her own life\\cellcolorquestionblueHow many Turkish parliament seats were secured by the AK Party and MHP?\\cellcolorgoldgreen322 seats\\columncolormascred292 and 229 respectively\\cellcolorquestionblueHow many candidates stood in Selby and Ainsty?\\cellcolorgoldgreen13 candidates\\columncolormascred12 candidatesBooks\\cellcolorquestionblueWhich bathroom did Cedric tell Harry to use to help figure out the clue for the golden egg?\\cellcolorgoldgreenthe prefects’ bathroom\\columncolormascredThe one with the tapestry of Barnabas the Barmy\\cellcolorquestionblueWhat is the animal form that Rita Skeeter transforms into as an unregistered Animagus?\\cellcolorgoldgreena beetle\\columncolormascredIt is unknown, likely a hybrid of some sort\.\\cellcolorquestionblueWhat headline did Harry read about himself in the Daily Prophet?\\cellcolorgoldgreenHarry Potter Disturbed and Dangerous\\columncolormascredHarry Potter’s Giant Egg\-Faced Menace

Table 11:Qualitative examples of MASC generations on forget\-set questions from TOFU, MUSE News, and MUSE Books\.

Similar Articles

Model Unlearning Objectives Vary for Distinct Language Functions

arXiv cs.CL

The paper argues that unlearning in LLMs should be goal-dependent, proposing a cosine-based meta-learned variant of RMU for dangerous knowledge and a multi-layer objective with probe directions for toxicity, achieving strong results across four 7-8B models.

Self-Distillation as a Performance Recovery Mechanism for LLMs: Counteracting Compression and Catastrophic Forgetting

arXiv cs.CL

This paper introduces Self-Distillation Fine-Tuning (SDFT) as a recovery mechanism for LLMs suffering from performance degradation due to catastrophic forgetting, quantization, and pruning. The authors provide theoretical justification using Centered Kernel Alignment (CKA) to demonstrate that self-distillation aligns the student model's high-dimensional manifold with the teacher's optimal structure, effectively recovering lost capabilities.

CRMA: A Spectrally-Bounded Backbone for Modular Continual Fine-Tuning of LLMs

arXiv cs.LG

CRMA introduces a spectrally-bounded residual adapter that enables continual fine-tuning of LLMs without catastrophic forgetting by enforcing a doubly-stochastic mixing matrix via Sinkhorn normalization. Experimental results on Mistral-7B and Gemma-2-9B show improved backward transfer and reduced forgetting compared to frozen-substrate baselines.