PreUnlearn: Auditing Collateral Knowledge Damage Before Large Language Model Unlearning

arXiv cs.CL Papers

Summary

This paper proposes PreUnlearn, a framework for auditing collateral knowledge damage in LLM unlearning before execution, using data-centric analysis to predict downstream damage across semantic layers.

arXiv:2606.18473v1 Announce Type: new Abstract: Machine unlearning for large language models (LLMs) aims to remove specified knowledge while preserving the rest of the model's capabilities. However, the boundary between knowledge to forget and knowledge to retain is often unclear, since related and even distant information may be entangled in the model. In this paper, we study LLM unlearning from a data-centric perspective and measure how unlearning effects propagate from the forget set to same-domain and distant-domain knowledge. We find a consistent decay pattern: collateral damage is strongest near the forget set, weakens with semantic distance, but does not disappear at domain boundaries. We further ask whether such damage can be audited before unlearning is executed. We formulate forget-set auditing as a pre-unlearning prediction task and analyze which data features are most predictive of downstream damage. Our results show that interaction features between the forget set and evaluation set provide the strongest signals, suggesting that collateral damage is partly reflected in data geometry before model updates occur. These findings position forget-set auditing as an early warning tool for identifying risky unlearning runs and designing more reliable unlearning procedures.
Original Article
View Cached Full Text

Cached at: 06/18/26, 05:45 AM

# Auditing Collateral Knowledge Damage Before Large Language Model Unlearning
Source: [https://arxiv.org/html/2606.18473](https://arxiv.org/html/2606.18473)
Bo Su Indiana University Bloomington, IN, USA subo@iu\.edu&Ankit Shah Indiana University Bloomington, IN, USA ankit@iu\.edu&Thai Le Indiana University Bloomington, IN, USA tle@iu\.edu

###### Abstract

Machine unlearning for large language models \(LLMs\) aims to remove specified knowledge while preserving the rest of the model’s capabilities\. However, the boundary between knowledge to forget and knowledge to retain is often unclear, since related and even distant information may be entangled in the model\. In this paper, we study LLM unlearning from a data\-centric perspective and measure how unlearning effects propagate from the forget set to same\-domain and distant\-domain knowledge\. We find a consistent decay pattern: collateral damage is strongest near the forget set, weakens with semantic distance, but does not disappear at domain boundaries\. We further ask whether such damage can be audited before unlearning is executed\. We formulate forget\-set auditing as a pre\-unlearning prediction task and analyze which data features are most predictive of downstream damage\. Our results show that interaction features between the forget set and evaluation set provide the strongest signals, suggesting that collateral damage is partly reflected in data geometry before model updates occur\. These findings position forget\-set auditing as an early warning tool for identifying risky unlearning runs and designing more reliable unlearning procedures\.

PreUnlearn: Auditing Collateral Knowledge Damage Before Large Language Model Unlearning

Bo SuIndiana UniversityBloomington, IN, USAsubo@iu\.eduAnkit ShahIndiana UniversityBloomington, IN, USAankit@iu\.eduThai LeIndiana UniversityBloomington, IN, USAtle@iu\.edu

## 1Introduction

![Refer to caption](https://arxiv.org/html/2606.18473v1/figs/fig1.png)Figure 1:PreUnlearn framework with two stages:\(1\) Three\-layer Unlearning Impactwhere an LLM is unlearned on a candidate forget set𝒟f\\mathcal\{D\}\_\{f\}and scored across three semantic layers \(L1,L2,L3L\_\{1\},L\_\{2\},L\_\{3\}\) of decreasing relevancy to𝒟f\\mathcal\{D\}\_\{f\}and ultimately,\(2\) Pre\-unlearn Impact Auditingproduces a lightweight auditor to estimate per\-layer potential collateral damage risk on retain knowledge, screening the candidate𝒟f\\mathcal\{D\}\_\{f\}*before*unlearning\.As large language models \(LLMs\) are increasingly deployed in high\-stakes settings, machine unlearning has become essential for removing sensitive, harmful, outdated, or legally restricted information while preserving overall model utility\. Practical motivations for unlearning span privacy regulations, copyright disputes, user deletion requests, and the removal of toxic or unsafe content\(Genget al\.,[2025](https://arxiv.org/html/2606.18473#bib.bib28); Nguyenet al\.,[2025](https://arxiv.org/html/2606.18473#bib.bib29); Mainiet al\.,[2024](https://arxiv.org/html/2606.18473#bib.bib6); Shiet al\.,[2024](https://arxiv.org/html/2606.18473#bib.bib7); Liet al\.,[2024](https://arxiv.org/html/2606.18473#bib.bib8); Dornaet al\.,[2025](https://arxiv.org/html/2606.18473#bib.bib5)\)\. In some applications, unlearning may also be required to suppress domain\-specific capabilities associated with safety risks, including offensive cybersecurity knowledge such as exploit\-development procedures, vulnerability analysis, or other misuse\-prone behaviors\. Since retraining from scratch is typically impractical, LLM unlearning seeks to suppress target knowledge through post\-training updates\(Eldan and Russinovich,[2023](https://arxiv.org/html/2606.18473#bib.bib15); Janget al\.,[2023](https://arxiv.org/html/2606.18473#bib.bib10); Liet al\.,[2024](https://arxiv.org/html/2606.18473#bib.bib8)\)while preserving the model’s remaining capabilities and factual knowledge\. This preservation requirement is central to practical deployment: an unlearning procedure is incomplete if the same update unintentionally degrades neighboring facts, distant knowledge, or otherwise unrelated capabilities\. Hence, it is critical to evaluate whether unlearning targeted knowledge \(the forget set\) inadvertently affects semantically related or unrelated knowledge elsewhere in the model, which is the central focus of this study \(see Figure[1](https://arxiv.org/html/2606.18473#S1.F1)\)\.

Existing unlearning evaluations do not yet characterize collateral damage with sufficient granularity\. Most prior studies report aggregate utility metrics or a limited set of post\-hoc probes, which often fail to capture the data\-centric structure of unlearning damage, including how performance degradation propagates as evaluation data moves from the forget set itself, to same\-domain knowledge, and further toward distantly related or even orthogonal knowledge\. Recent work has shown that standard utility benchmarks can remain deceptively high even when same\-domain or distant\-domain knowledge has already been substantially corrupted\(Koet al\.,[2025](https://arxiv.org/html/2606.18473#bib.bib4)\), raising concerns about the adequacy of current evaluation practices\. At the same time, most existing benchmarks assume a fixed forget set and implicitly treat the choice of evaluation set as given\. Consequently, two practical research questions \(RQs\) remain largely unexplored:“How will unlearning with forget set X affect the model on knowledge Y?” \([RQ\.1\.](https://arxiv.org/html/2606.18473#S1.I1.i1)\), and then“Can we predict ahead of time such a potential impact or collateral damage even before unlearning?” \([RQ\.2\.](https://arxiv.org/html/2606.18473#S1.I1.i2)\)This will enable practitioners to anticipate high\-risk unlearning runs before expensive optimization is performed\.

RQ\.1\.\(Three\-layer Unlearning Impact\)\- How does unlearning impact spread from the forget set X to same\-domain and distant\-domain knowledge Y?RQ\.2\.\(Pre\-unlearn Impact Auditing\)\- Can we predict the collateral damage would result knowledge Y from unlearning knowledge X even before unlearning?

We address these two questions by formulating pre\-unlearning auditing as a supervised modeling problem \(Fig\.[1](https://arxiv.org/html/2606.18473#S1.F1)\)\. Organizing a dataset intoL1L\_\{1\}\(intended degradation\),L2L\_\{2\}\(same\-domain damage\), andL3L\_\{3\}\(irrelevant\-domain damage\), we predict future unlearning impact using features of the forget set, the evaluation set, their interaction\. This formulation uses prediction not as an end in itself, but as a tool to identify which pre\-unlearning signals explain later collateral damage\.

The measurements reveal a consistent but imperfect decay pattern: unlearning impact is strongest on forget set, weaker on same\-domain knowledge, and weakest, but still present, on distant\-domain knowledge\. The audit further shows that interaction features between the forget and evaluation sets, such as semantic proximity, representation\-shape ratios, and lexical or length relationships, are especially predictive and remain stable across unlearning algorithms\.

Our main contributions are:

1. 1\.Three\-layer measurement framework\.We organize unlearning impact into intended \(L1L\_\{1\}\), same\-domain \(L2L\_\{2\}\), and distant\-domain \(L3L\_\{3\}\) degradation, and show across two model families and three algorithms that damage consistently decays with semantic distance but remains visible beyond the direct target, with substantial variation across forget sets under fixed hyperparameters\.
2. 2\.Pre\-unlearning auditing as supervised prediction\.We formulate forget\-set auditing as a regression problem over \(forget, evaluation\) pairs, using only pre\-update features of the data, with no access to gradients, unlearned checkpoints, or post\-hoc measurements\.
3. 3\.Empirical characterization of predictive signals\.We show that cross\-set geometric features \(centroid distance, similarity, length and lexical ratios\) dominate over intrinsic properties of either set, remain stable across unlearning algorithms, and yield ranking quality strong enough for practical triage\.

## 2Related Work

### 2\.1LLM Unlearning

LLM unlearning aims to remove selected knowledge from a pretrained model while preserving overall utilityGenget al\.\([2025](https://arxiv.org/html/2606.18473#bib.bib28)\)\. Existing work follows two paradigms:fine\-tuning\-then\-unlearning, where the forget set is a subset of a known fine\-tuning corpus \(e\.g\., TOFU, MUSE, FIUBenchMainiet al\.\([2024](https://arxiv.org/html/2606.18473#bib.bib6)\); Shiet al\.\([2024](https://arxiv.org/html/2606.18473#bib.bib7)\); Maet al\.\([2025](https://arxiv.org/html/2606.18473#bib.bib30)\)\), anddirect unlearning, where the target knowledge is already embedded in the pretrained model \(e\.g\., WMDP, RWKULiet al\.\([2024](https://arxiv.org/html/2606.18473#bib.bib8)\); Jinet al\.\([2024](https://arxiv.org/html/2606.18473#bib.bib9)\)\)\. Our setting follows the latter, which is closer to real deployment\.

WikiText\-103raw passagesFilter usablepassagesEmbedpassagesCluster semanticpassage poolsExcludenoiseSample unlearndatasetsFigure 2:Dataset construction schema\. WikiText\-103 passages are filtered, embedded, clustered into semantic passage pools, and then sampled into unlearning datasets\. Each dataset contains disjoint*forget*and*retain*splits, which later support direct unlearning and three\-layer impact evaluation\.
### 2\.2Collateral Damage in LLM Unlearning

Unlearning can degrade knowledge beyond the forget set𝒟f\\mathcal\{D\}\_\{f\}\.Koet al\.\([2025](https://arxiv.org/html/2606.18473#bib.bib4)\)introduce knowledge hole probing and show that static benchmarks such as MMLUHendryckset al\.\([2021](https://arxiv.org/html/2606.18473#bib.bib2)\)and TruthfulQALinet al\.\([2022](https://arxiv.org/html/2606.18473#bib.bib1)\)can miss collateral damage incurred by unlearning, motivating evaluations that go beyond whether𝒟f\\mathcal\{D\}\_\{f\}is removed\.

A related question is which forget sets are likely to cause such damage\. Prior work studies which data is hardest to remove or induces the largest side effects\(Thudiet al\.,[2022](https://arxiv.org/html/2606.18473#bib.bib19); Kurmanjiet al\.,[2023](https://arxiv.org/html/2606.18473#bib.bib20)\)\. Influence\-based selection\(Koh and Liang,[2020](https://arxiv.org/html/2606.18473#bib.bib14)\)is related but requires model gradients, scales poorly to LLMs, and yields per\-example rather than forget\-set\-level scores\. Closer to our approach, a data\-centric line predicts downstream behavior from dataset properties\(Danget al\.,[2024](https://arxiv.org/html/2606.18473#bib.bib3); Ilyaset al\.,[2022](https://arxiv.org/html/2606.18473#bib.bib17)\); we extend this to pre\-unlearning auditing over forget–evaluation pairs\.

## 3[RQ\.1\.](https://arxiv.org/html/2606.18473#S1.I1.i1)Three\-layer Unlearning Impact

### 3\.1Problem Formulation

LetMθ0M\_\{\\theta\_\{0\}\}denote the target LLM before unlearning, and let𝒰​\(⋅\)\\mathcal\{U\}\(\\cdot\)denote a fixed post\-training unlearning algorithm\. We consider a collection of candidate semantic domains𝒢=\{Gi\}i=1N\\mathcal\{G\}=\\\{G\_\{i\}\\\}\_\{i=1\}^\{N\}, where each domain contains documents covering a coherent semantic topic\. For each domainGiG\_\{i\}, we construct a forget subsetGif⊂GiG\_\{i\}^\{\\mathrm\{f\}\}\\subset G\_\{i\}, which serves as the candidate forget set for unlearning\. For each candidate domainGiG\_\{i\}, we define the forget set asDf←GifD\_\{f\}\\leftarrow G\_\{i\}^\{\\mathrm\{f\}\}and obtain the corresponding unlearned checkpoint:

Mθ∗←𝒰​\(Mθ,Df\)\.M^\{\*\}\_\{\\theta\}\\leftarrow\\mathcal\{U\}\(M\_\{\\theta\},D\_\{f\}\)\.Throughout this work, we fix the base modelMθ0M\_\{\\theta\_\{0\}\}and𝒰\\mathcal\{U\}, while varying only the semantic content of the forget setDfD\_\{f\}\. This allows us to isolate and analyze how unlearning different semantic domains affects the resulting model behavior and knowledge impact patterns\.

For each unlearning run, we compare the unlearned modelMθ∗M^\{\*\}\_\{\\theta\}against the original modelMθM\_\{\\theta\}on a shared evaluation set constructed from all domainsGiG\_\{i\}\. This shared evaluation design then lets us ask where the impact of unlearningDfD\_\{f\}leaves on a three\-layer knowledge impact to be described below\.

#### Three\-layer Knowledge Impact\.

To characterize the effect of unlearning, we organize post\-unlearning knowledge degradation into three semantic layers\. Given a forget setDfD\_\{f\}, the first layer measures degradation on the forget set itself, denoted asL1L\_\{1\}, corresponding to the intended effect of unlearning\. The second layer measures degradation on held\-out passages that are semantically close to the forget domain, denoted asL2L\_\{2\}, capturing local collateral damage on related knowledge\. The third layer measures degradation on passages drawn from other semantic domains, denoted asL3L\_\{3\}, capturing unintended forgetting on distant and irrelevant knowledge\. We can then summarize the resulting three\-layer collateral profile as:

𝐲=\(L1,L2,L3\)\\mathbf\{y\}=\(L\_\{1\},L\_\{2\},L\_\{3\}\)

### 3\.2Experimental Setup

#### Dataset Preparation\.

We use WikiText\-103\(Merityet al\.,[2016](https://arxiv.org/html/2606.18473#bib.bib21)\), which was constructed from Wikipedia articles, for unlearning because Wikipedia text is widely used in LLM pretraining, making it reasonable to assume that the target models have already learned much of this content\. We confirmed this assumption by observing consistently low PPL on sampled WikiText\-103 passages\. Thus, WikiText\-103 provides a suitable testbed for studying unlearning on knowledge that is plausibly already present in the models\.

We process WikiText\-103 via a quality control pipeline \(Fig\.[2](https://arxiv.org/html/2606.18473#S2.F2)\), resulting in1010well\-separatedsemantic clusters used to construct forget\-set candidates\. From the yielding clusters, we construct100100unlearning datasets,1010for each cluster, each of which contains two same\-cluster but mutually disjoint splits of5050texts: a forget setDfD\_\{f\}, which is the direct unlearning target, and a retain setDrD\_\{r\}, which provides same\-domain text that should remain usable after unlearning\. This forget/retain construction matches the standard unlearning setup: unlearning is to removeDfD\_\{f\}while preserving performance onDrD\_\{r\}\. Details on the dataset construction are provided in the Appendix[A](https://arxiv.org/html/2606.18473#A1)\.

![Refer to caption](https://arxiv.org/html/2606.18473v1/figs/combined_fig1_layered_dist.png)Figure 3:PPL ratio distribution by layer across six settings\.CDF of PPL ratio \(after / before unlearning\) forL1L\_\{1\}\(self\),L2L\_\{2\}\(same domain\),L3L\_\{3\}\(different domain\)\. Dotted grey verticals mark per\-layer medians; dashed line at ratio=1=1is the no\-change reference\.Letℐ​\(⋅\)\\mathcal\{I\}\(\\cdot\)denote the unlearning impact metric, which we will later define, we can define the three evaluation layersL1L\_\{1\},L2L\_\{2\}, andL3L\_\{3\}as follows:

L1=ℐ​\(Df\),L2=ℐ​\(Gi∖Df\),L3=ℐ​\(⋃j≠iGj\),L\_\{1\}\{=\}\\mathcal\{I\}\(D\_\{f\}\),\\;L\_\{2\}\{=\}\\mathcal\{I\}\(G\_\{i\}\\setminus D\_\{f\}\),\\;L\_\{3\}\{=\}\\mathcal\{I\}\\Big\(\\bigcup\_\{j\\neq i\}G\_\{j\}\\Big\),

#### Target Unlearning Model\.

We use two open\-weight instruction\-tuned target models: Llama\-3\.1\-8B\-Instruct\(Grattafioriet al\.,[2024](https://arxiv.org/html/2606.18473#bib.bib24)\)and Qwen2\.5\-7B\-Instruct\(Qwenet al\.,[2025](https://arxiv.org/html/2606.18473#bib.bib26)\)\. As a sanity check, we measured base\-model PPL on sampled WikiText\-103 passages and observed low values, which provides indirect evidence that these passages are familiar to the target models\. Using two model families lets us check whether the results are specific to a single architecture family\.

#### Unlearning Algorithm\.

We evaluate three unlearning algorithms Gradient Ascent \(GA\)\(Janget al\.,[2023](https://arxiv.org/html/2606.18473#bib.bib10)\), Negative Preference Optimization \(NPO\)\(Zhanget al\.,[2024](https://arxiv.org/html/2606.18473#bib.bib11)\)and Unlearning via Self\-Distillation on Adjusted Logits \(UNDIAL\)\(Donget al\.,[2025](https://arxiv.org/html/2606.18473#bib.bib27)\)because they can represent three distinct families of LLM unlearning methods\.

#### Evaluation and Metrics\.

For an evaluation setSS, we measure the impact of unlearningDfD\_\{f\}viaUUusing the perplexity ratio:

R​\(S\)=PPL​\(Mθ∗,x\)PPL​\(Mθ,x\)≥1\.0,\\displaystyle R\(S\)=\\frac\{\\mathrm\{PPL\}\(M^\{\*\}\_\{\\theta\},x\)\}\{\\mathrm\{PPL\}\(M\_\{\\theta\},x\)\}\\geq 1\.0,\(1\)where the degree by whichR​\(S\)R\(S\)increases from1\.01\.0indicates the impact of unlearning or how much the model forgets on average the knowledge of each passagex∈Sx\{\\in\}S\.

![Refer to caption](https://arxiv.org/html/2606.18473v1/figs/npo_llama_fig2_domain_heatmap.png)\(a\)NPO
![Refer to caption](https://arxiv.org/html/2606.18473v1/figs/undial_llama_fig2_domain_heatmap.png)\(b\)UNDIAL

Figure 4:Mean PPL ratio per domain pair on Llama\-3\.1\-8B\-Instruct under NPO and UNDIAL\.

### 3\.3Experiment Results

Figure[3](https://arxiv.org/html/2606.18473#S3.F3)reports the profile across all model×\\timesalgorithm settings, from which we draw three distinct observations and findings:

Finding \#1: Unlearning impact decays with distance from the forget set\.For every model×\\timesalgorithm pairs, the average PPL ratio satisfiesL1\>L2\>L3L\_\{1\}\{\>\}L\_\{2\}\{\>\}L\_\{3\}\. For instance, Llama\-3\.1\-8B with GA yieldsL1=5\.20×\>L2=3\.80×\>L3=2\.42×L\_\{1\}\{=\}5\.20\{\\times\}\{\>\}L\_\{2\}\{=\}3\.80\{\\times\}\{\>\}L\_\{3\}\{=\}2\.42\{\\times\}, an across\-layer spread of2\.15×2\.15\{\\times\}\(Figure[3](https://arxiv.org/html/2606.18473#S3.F3)\)\. This monotonic decay holds in all settings, indicating that unlearning induces geometry\-dependent ripple effects that spread outward from the forget set\.

Finding \#2: Impact magnitude is algorithm\-dependent\.GA produces the most aggressive impact, reachingL1=12\.85×L\_\{1\}\{=\}12\.85\\timeson Qwen2\.5\-7B, whereas NPO, UNDIAL remain within1\.051\.05–1\.65×1\.65\\timesatL1L\_\{1\}across base models\. Consistent with GA’s lack of a regularizer on a retain set, this gap shows that the choice of algorithm is determinant of impact magnitude, while the layered ordering of Finding \#1 is preserved across the full range\.

Finding \#3: The impact on three\-layer profiles varies with the forget set\.The same algorithm and hyper\-parameters do not produce a single fixed impact profile\. Across the100100forget sets, the spread is2\.15×2\.15\{\\times\}\(max/min\) under the Llama\-3\.1\-8B when unlearning with GA \(Figure[3](https://arxiv.org/html/2606.18473#S3.F3)\)\. This variation makes the forget set itself worth auditing before paying for full unlearning and cross\-evaluation\.

Finding \#4: Layer\-wise impact separates at the distribution level\.The empirical CDFs in Fig\.[3](https://arxiv.org/html/2606.18473#S3.F3)show a clear distribution\-level ordering:L1L\_\{1\}is shifted furthest to the right, followed byL2L\_\{2\}and thenL3L\_\{3\}, across nearly all six settings\. This separation persists across quantiles, showing that the layered structure is not only an average\-level phenomenon\. While a small fraction of points, especially in the farther layers, remain near or slightly below the no\-change reference ratio11, the distributions also exhibit substantial right tails\. In particular, the GA settings contain many evaluation points with very large PPL\-ratio increases, indicating that unlearning can induce severe collateral damage rather than merely mild average degradation\.

#### Finding \#5: The layered unlearning pattern persists with domain\-pair heterogeneity\.

Figure[3](https://arxiv.org/html/2606.18473#S3.F3)shows thatL1\>L2\>L3L\_\{1\}\{\>\}L\_\{2\}\{\>\}L\_\{3\}holds on average, and Figure[4](https://arxiv.org/html/2606.18473#S3.F4)shows that this pattern persists across domain pairs\. We aggregate the100×100100\{\\times\}100evaluation matrix into10×1010\{\\times\}10heatmaps for NPO and UNDIAL on Llama\-3\.1\-8B\- Instruct, where diagonal cells denote matched forget–evaluation domains and off\-diagonal cells denote cross\-domain evaluation\. In both heatmaps, diagonal cells are consistently stronger than off\-diagonal cells, showing that same\-domain knowledge is more affected than distant\-domain knowledge\. The heatmaps also reveal domain\-pair heterogeneity: some forget domains induce broader off\-domain damage, and some evaluation domains are more vulnerable\.

## 4[RQ\.2\.](https://arxiv.org/html/2606.18473#S1.I1.i2)Pre\-unlearn Impact Auditing

The previous section shows that three\-layer unlearning impact varies substantially with the choice of forget data, making forget\-set selection a critical design decision\. However, this impact is only observable after running unlearning, which is inefficient when multiple candidate forget sets must be tested to achieve a desirable forget\-retain trade\-off\. We therefore ask whether we can audit,before unlearning, which forget–evaluation pairs are likely to experience larger post\-unlearning degradation\. Moreover, this audit is not intended to replace final evaluation, but to provide an early sanity check and also to identify predictive features that are consistently associated with future collateral damage, providing deeper understanding on the collateral damage explored in[RQ\.1\.](https://arxiv.org/html/2606.18473#S1.I1.i1)\.

### 4\.1Problem Formulation

We formulate pre\-unlearning auditing as a supervised prediction problem over forget–evaluation pairs\. Each audit example is a pair\(Df,De\)\(D\_\{f\},D\_\{e\}\), whereDfD\_\{f\}is the candidate forget set andDeD\_\{e\}is an evaluation set whose future degradation we want to estimate before unlearning\. We constrain the auditor to have no access to target LLM’s gradients, unlearned checkpoints, or any post\-unlearning measurements as input\. For a fixed target modelMθM\_\{\\theta\}and unlearning algorithm𝒰\\mathcal\{U\}, we extract pre\-unlearning features:

𝐱f,e=ϕ​\(Df,De,Mθ\),\\mathbf\{x\}\_\{f,e\}=\\phi\(D\_\{f\},D\_\{e\},M\_\{\\theta\}\),describing the forget set, the evaluation set, their interaction\. The auditor then learns a predictor that predict an estimatedcollateral damage ratio:

ρ^f,e=hψ​\(𝐱f,e\),\\hat\{\\rho\}\_\{f,e\}=h\_\{\\psi\}\(\\mathbf\{x\}\_\{f,e\}\),which is trained by optimize a MSE loss:

minψ​∑\(Df,De\)\(hψ​\(𝐱f,e\)−ρf,e\)2\.\\min\_\{\\psi\}\\sum\_\{\(D\_\{f\},D\_\{e\}\)\}\\left\(h\_\{\\psi\}\(\\mathbf\{x\}\_\{f,e\}\)\-\\rho\_\{f,e\}\\right\)^\{2\}\.This pair\-level formulation lets the auditor ask whether a forget set is risky in isolation, and which evaluation sets are most exposed to that forget set\.

We define the prediction targetρf,e\\rho\_\{f,e\}as a collateral damage ratio:

ρf,e=log⁡\(PPLeafter/PPLebefore\)log⁡\(PPLfafter/PPLfbefore\)\.\\rho\_\{f,e\}=\\frac\{\\log\\\!\\left\(\\mathrm\{PPL\}^\{\\mathrm\{after\}\}\_\{e\}/\\mathrm\{PPL\}^\{\\mathrm\{before\}\}\_\{e\}\\right\)\}\{\\log\\\!\\left\(\\mathrm\{PPL\}^\{\\mathrm\{after\}\}\_\{f\}/\\mathrm\{PPL\}^\{\\mathrm\{before\}\}\_\{f\}\\right\)\}\.\(2\)This ratio measures how much degradation leaks from the forget set to the evaluation set, normalized by the achieved forgetting strength\. A valueρ≈0\\rho\{\\approx\}0indicates selective unlearning or minimal collateral damage toDeD\_\{e\},ρ≈1\\rho\{\\approx\}1indicates uniform degradation, andρ\>1\\rho\{\>\}1flags cases where the evaluation set is harmed more than the targeted forget set\.

### 4\.2Experimental Setup

#### Training Supervised Regression Model

We train three regression classifiershψ​\(⋅\)h\_\{\\psi\}\(\\cdot\): ridge regression, random forests \(RF\), and gradient\-boosted trees \(XGBoost\)\. All three predict the collateral damage ratio from the8080\-dimensional feature vector described in §[4\.3](https://arxiv.org/html/2606.18473#S4.SS3)\.

ModelAlgorithmRegressorLODO CVHeld\-out TestRMSE↓\\downarrowMAE↓\\downarrowR2↑R^\{2\}\\uparrowRMSE↓\\downarrowMAE↓\\downarrowR2↑R^\{2\}\\uparrowLlama\-3\.1\-8BGARidge0\.07250\.05720\.74870\.08250\.06510\.6439RF\\cellcolorbestcell0\.0621\\cellcolorbestcell0\.0497\\cellcolorbestcell0\.81530\.09110\.07400\.5663XGB0\.06610\.05210\.7907\\cellcolorbestcell0\.0749\\cellcolorbestcell0\.0621\\cellcolorbestcell0\.7067NPORidge0\.07510\.05000\.54640\.0867\\cellcolorbestcell0\.06450\.5420RF0\.04860\.03800\.8101\\cellcolorbestcell0\.08330\.0649\\cellcolorbestcell0\.5775XGB\\cellcolorbestcell0\.0473\\cellcolorbestcell0\.0368\\cellcolorbestcell0\.82020\.08500\.06610\.5599UNDIALRidge0\.06650\.05110\.75840\.09840\.08430\.6058RF0\.05270\.03980\.8482\\cellcolorbestcell0\.0567\\cellcolorbestcell0\.0464\\cellcolorbestcell0\.8691XGB\\cellcolorbestcell0\.0523\\cellcolorbestcell0\.0391\\cellcolorbestcell0\.85030\.06650\.05700\.8199Qwen2\.5\-7BGARidge0\.05490\.04090\.7948\\cellcolorbestcell0\.0683\\cellcolorbestcell0\.0591\\cellcolorbestcell0\.7488RF\\cellcolorbestcell0\.0431\\cellcolorbestcell0\.0322\\cellcolorbestcell0\.87380\.09650\.08650\.4993XGB0\.04320\.03270\.87330\.11730\.10730\.2599NPORidge0\.09670\.06650\.33970\.08550\.06480\.3913RF\\cellcolorbestcell0\.0813\\cellcolorbestcell0\.0634\\cellcolorbestcell0\.53350\.07650\.05940\.5133XGB0\.08540\.06480\.4855\\cellcolorbestcell0\.0614\\cellcolorbestcell0\.0494\\cellcolorbestcell0\.6863UNDIALRidge0\.12390\.08970\.62340\.07620\.06060\.8579RF0\.08530\.06020\.8217\\cellcolorbestcell0\.0677\\cellcolorbestcell0\.0478\\cellcolorbestcell0\.8877XGB\\cellcolorbestcell0\.0825\\cellcolorbestcell0\.0587\\cellcolorbestcell0\.83310\.07150\.05180\.8749Table 1:Audit\-model performance excluding Spearman correlation\. Best score within each model, algorithm, and split block is highlighted and shown in bold\.
#### Evaluation Protocol\.

To assess whether the trained audit regressorhψ​\(⋅\)h\_\{\\psi\}\(\\cdot\)generalizes beyond forget sets seen during training, we use two complementary splits\. First, LODO CV performs leave\-one\-domain\-out evaluation over theKKdomain clusters obtained from HDBSCANCampelloet al\.\([2015](https://arxiv.org/html/2606.18473#bib.bib23)\)across the 90 trainingforget\_setgroups of 9 domains\. For each held\-out domainGiG\_\{i\}, we trainhψ​\(⋅\)h\_\{\\psi\}\(\\cdot\)on all forget sets drawn from the remainingK−1K\{\-\}1domains and evaluate it on every forget set belonging toGiG\_\{i\}\. This ensures that no forget set in the test split shares the same domain with any training example, eliminating domain\-level leakage\. Reported LODO results aggregate predictions across allKKfolds, so every forget set is evaluated exactly once under a model that has never seen its domain\. Second, Held\-out Test evaluates generalization to a fully held\-out semantic domain\.

### 4\.3Feature Engineering

Each row of our auditing dataset corresponds to a forget set and eval set pair\. We engineer features of each pair organized into the three families summarized in Table[3](https://arxiv.org/html/2606.18473#A1.T3)in Appendix: intrinsic features ofDfD\_\{f\}andDeD\_\{e\}, pair\-interaction features that capture relationships between the two sets\. The first three families are entirely model\-agnostic, depending only on the raw texts and their sentence\-transformer embeddings\. The two set\-intrinsic features share identical definitions, applied independently toDfD\_\{f\}andDeD\_\{e\}\. Each features covers five complementary aspects of the text describing how large, how diverse, how tight, and how low\-dimensional the set is\. The pair features then exposes quantities that cannot be recovered fromDfD\_\{f\}andDeD\_\{e\}in isolation: three features describe the centroid relation betweenDfD\_\{f\}andDeD\_\{e\}\(cosine similarity, Euclidean distance, and norm ratio\), with the cosine term serving as a direct proxy for topical overlap, and the remaining features are cross\-sete/fe/fratios of representative scalar descriptors\.

### 4\.4Results

#### Overall Auditability\.

Table[1](https://arxiv.org/html/2606.18473#S4.T1)shows that auditability depends on both the unlearning algorithm and the base model\. Across models, UNDIAL is the easiest objective to predict, NPO the hardest, and GA intermediate, but the best regressor family and the held\-out degradation pattern vary by model\.

UNDIAL is the most stable case: held\-out predictability remains high on both models and is relatively insensitive to regressor choice, suggesting that its damage aligns well with our geometry and length features\. GA shows the strongest algorithm–model interaction\. On Llama, tree\-based models preserve their LODO advantage on held\-out domains, whereas on Qwen, ridge generalizes most stably: rank ordering remains reliable, but error magnitude drifts across domains\. NPO is the most difficult and regressor\-sensitive objective on both models, with different failure modes across Llama and Qwen\.

Overall, the LODO–held\-out gap widens from UNDIAL to GA to NPO, but performance does not collapse to chance\. This indicates that the auditor captures cross\-domain structure rather than memorizing domain identity, supporting our view that collateral damage depends on dataset geometry\.

#### Feature Importance Analysis\.

Table 2:Top\-5 most predictive features per category, includingPair\(cross\-set forget–evaluation geometry\),Forget\(intra\-forget statistics\), andEval\(intra\-evaluation statistics\), in the audit predictor’s ridge regression\. All coefficients are statistically significant\. The sign indicates direction of effect on predicted collateral damage, and\|z\|\|z\|reflects standardized importance\.Table[4](https://arxiv.org/html/2606.18473#A2.T4)reports regression coefficients for the most predictive features in our collateral\-damage audit predictor, grouped into three families: Pair features, Forget features, and Eval features\. All listed coefficients are highly significant\. The dominant signal comes from Pair features, where the centroid Euclidean distance between forget and eval sets carries by far the largest effect: the farther apart the two sets sit in representation space, the smaller the collateral damage\. This is reinforced by the positive coefficient on centroid cosine similarity and the negative coefficients on cross\-set ratio features , indicating that surface\- and geometry\-level proximity between forget and eval jointly drives utility loss\. Within the Forget family, longer documents and heavier similarity tails increase damage, suggesting that long, homogeneous forget clusters exert broader influence on the model\. Eval\-only features contribute smaller effects, suggesting that the relative geometry between forget and eval, rather than either set’s intrinsic structure, is the primary determinant of collateral damage, in line with the fixed\-evaluation\-manifold view underlying our framework\.

#### Audit Sensitivity Analysis\.

![Refer to caption](https://arxiv.org/html/2606.18473v1/figs/combined_sensitivity_curves.png)Figure 5:Sensitivity of fixed corruption regressors to evaluation composition\. Each panel varies the fractionα\\alphaof high\-risk evaluation examples\. Error bars show 95% bootstrap confidence intervals over sampling seeds\. The gray band shows the 95% range induced by random resampling of the held\-out evaluation set without controlling the high\-risk proportion\.To test whether the auditor responds to meaningful changes in the evaluation set rather than random sampling noise, we keep each trained regressor fixed and vary only the risk composition of the held\-out evaluation set\. We construct mixtures with different high\-risk proportionsα\\alpha, whereα=0\\alpha\{=\}0contains only low\-risk blocks andα=1\\alpha\{=\}1contains only high\-risk blocks, while keeping the number of evaluation blocks per forget set fixed\. As a control, we also randomly resample evaluation blocks without controlling the high\-risk proportion\. If the auditor is sensitive to evaluation risk, its predicted corruption score should increase asα\\alphagrows\. Figure[5](https://arxiv.org/html/2606.18473#S4.F5)shows exactly this pattern: across GA, NPO, and UNDIAL, all regressors produce steadily increasing scores, while random resampling stays within the gray control band\. This suggests that the auditors capture meaningful shifts in evaluation\-set vulnerability rather than random variation\.

## 5Potential Application

In practice, the auditor is most useful not as an exact simulator of post\-unlearning perplexity, but as a pre\-unlearning ranking tool\. Given multiple candidate forget–evaluation pairs, practitioners typically face three triage questions: which pairs should be inspected first, which forget sets are likely to cause broader collateral damage, and where limited evaluation or retain\-data budgets should be allocated\. Figure[6](https://arxiv.org/html/2606.18473#S5.F6)reports Spearman rank correlation between predicted and observed collateral damage across the three regressors\. Under LODO cross\-validation, the best auditor reachesρ=0\.93\\rho=0\.93on GA andρ=0\.86\\rho=0\.86to0\.900\.90on UNDIAL across both base models, and stays atρ≥0\.51\\rho\\geq 0\.51even on the harder NPO objective\. On the fully held\-out domain, ranking quality drops as expected but remains informative for UNDIAL and for GA, where Qwen generalizes notably better than Llama, while NPO degrades more sharply\. This pattern indicates that the auditor captures cross\-domain risk structure rather than memorizing domain identity, and that the ordering signal is most reliable precisely for the algorithms whose collateral damage is most tightly coupled to forget–evaluation geometry\.

Even when exact magnitude prediction varies across domains, a reliable ordering is sufficient for triage: users can prioritize the top\-ranked risky pairs for full unlearning evaluation, add retain data around vulnerable domains, or reject risky forget\-set choices early in the pipeline before any optimization is run\. The application of pre\-unlearning auditing is therefore to turn post\-hoc damage measurement, which requires one unlearning run per candidate forget set, into a cheaper ranking problem whose marginal cost is dominated by feature extraction, allowing large\-scale unlearning pipelines to be guided by upfront risk estimates rather than purely post\-hoc evaluation\.

![Refer to caption](https://arxiv.org/html/2606.18473v1/x1.png)Figure 6:Best Spearman correlation across the three regressors for each model, algorithm, and evaluation settings\.
## 6Conclusion

We formalize unlearning impact as a three\-layer profile and show that damage is not confined to the intended target: across WikiText forget sets, it decays with distance from the forget data but does not disappear, and varies substantially under fixed hyper\-parameters, making the forget set itself an object of audit rather than a static benchmark\. Asking what can be known before unlearning, we find the strongest audit signals are not intrinsic properties of the forget set, but cross\-set features comparing forget and evaluation data \(centroid similarity, centroid distance, lexical and length ratios\), and their importance is stable across unlearning algorithms, suggesting collateral damage is partly determined by pre\-existing coupling between the two\. We therefore position forget\-set auditing as a screening tool that surfaces interpretable risk factors and prioritizes which candidates deserve expensive audit\-then\-unlearn, rather than replacing full evaluation\.

## Limitations

Model and algorithm coverage\.Our study covers several representative unlearning objectives, including GA, NPO, and UNDIAL, but it does not exhaust the full design space of unlearning methods\. The behavior of the audit features may vary across base models, model scales, instruction\-tuned checkpoints, and representation\-level approaches such as RMU\. This is expected, since different algorithms intervene at different stages of the optimization or representation pipeline\. As a result, the same forget–evaluation geometry may not induce identical collateral\-damage patterns across all settings\. We therefore view our results as an initial characterization of pre\-unlearning auditability, rather than a universal claim about all unlearning algorithms\.

Evaluation metrics\.We measure unlearning impact mainly through changes in perplexity\. Perplexity is scalable and fine\-grained, but it is still a proxy for behavioral degradation\. A model may show higher perplexity without a large drop in downstream task performance, or may preserve perplexity while changing answer correctness, refusal behavior, calibration, or factual consistency\. Future work should extend the audit to task\-level, generation\-level, and human\-judged outcomes\.

Fixed experimental design\.We hold several design choices fixed to isolate the effect of the forget set, including hyperparameters, evaluation construction, and the three\-layer view of L1 intended, L2 same\-domain, and L3 distant\-domain impact\. Real unlearning pipelines may tune hyperparameters per request or face overlapping domains that do not fit cleanly into discrete layers\.

## Ethics Statement

LLM\-based assistants \(ChatGPT, Claude\) were used solely to polish prose on drafts fully written by authors, and code assistants \(Codex, Claude Code\) were used to implement designs and ideas originated by the authors\. All scientific contributions, technical methods, and code results are the authors’ original work\.

## References

- Hierarchical density estimates for data clustering, visualization, and outlier detection\.ACM Trans\. Knowl\. Discov\. Data10\(1\)\.External Links:ISSN 1556\-4681,[Link](https://doi.org/10.1145/2733381),[Document](https://dx.doi.org/10.1145/2733381)Cited by:[Appendix A](https://arxiv.org/html/2606.18473#A1.SS0.SSS0.Px1.p1.4),[§4\.2](https://arxiv.org/html/2606.18473#S4.SS2.SSS0.Px2.p1.7)\.
- C\. Dang, D\. D\. Le, and T\. Le \(2024\)A curious case of searching for the correlation between training data and adversarial robustness of transformer textual models\.External Links:2402\.11469,[Link](https://arxiv.org/abs/2402.11469)Cited by:[§2\.2](https://arxiv.org/html/2606.18473#S2.SS2.p2.1)\.
- Y\. R\. Dong, H\. Lin, M\. Belkin, R\. Huerta, and I\. Vulić \(2025\)UNDIAL: self\-distillation with adjusted logits for robust unlearning in large language models\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 8827–8840\.External Links:[Link](https://aclanthology.org/2025.naacl-long.444/),[Document](https://dx.doi.org/10.18653/v1/2025.naacl-long.444),ISBN 979\-8\-89176\-189\-6Cited by:[§3\.2](https://arxiv.org/html/2606.18473#S3.SS2.SSS0.Px3.p1.1)\.
- V\. Dorna, A\. Mekala, W\. Zhao, A\. McCallum, Z\. C\. Lipton, J\. Z\. Kolter, and P\. Maini \(2025\)OpenUnlearning: accelerating LLM unlearning via unified benchmarking of methods and metrics\.arXiv preprint arXiv:2506\.12618\.External Links:[Link](https://arxiv.org/abs/2506.12618)Cited by:[Appendix A](https://arxiv.org/html/2606.18473#A1.SS0.SSS0.Px2.p1.8),[§1](https://arxiv.org/html/2606.18473#S1.p1.1)\.
- R\. Eldan and M\. Russinovich \(2023\)Who’s harry potter? approximate unlearning in llms\.External Links:2310\.02238,[Link](https://arxiv.org/abs/2310.02238)Cited by:[§1](https://arxiv.org/html/2606.18473#S1.p1.1)\.
- J\. Geng, Q\. Li, H\. Woisetschlaeger, Z\. Chen, F\. Cai, Y\. Wang, P\. Nakov, H\. Jacobsen, and F\. Karray \(2025\)A comprehensive survey of machine unlearning techniques for large language models\.External Links:2503\.01854,[Link](https://arxiv.org/abs/2503.01854)Cited by:[§1](https://arxiv.org/html/2606.18473#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.18473#S2.SS1.p1.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan, A\. Yang, A\. Fan, A\. Goyal, A\. Hartshorn, A\. Yang, A\. Mitra, A\. Sravankumar, A\. Korenev, A\. Hinsvark, A\. Rao, A\. Zhang, A\. Rodriguez, A\. Gregerson, A\. Spataru, B\. Roziere, B\. Biron, B\. Tang, B\. Chern, C\. Caucheteux, C\. Nayak, C\. Bi, C\. Marra, C\. McConnell, C\. Keller, C\. Touret, C\. Wu, C\. Wong, C\. C\. Ferrer, C\. Nikolaidis, D\. Allonsius, D\. Song, D\. Pintz, D\. Livshits, D\. Wyatt, D\. Esiobu, D\. Choudhary, D\. Mahajan, D\. Garcia\-Olano, D\. Perino, D\. Hupkes, E\. Lakomkin, E\. AlBadawy, E\. Lobanova, E\. Dinan, E\. M\. Smith, F\. Radenovic, F\. Guzmán, F\. Zhang, G\. Synnaeve, G\. Lee, G\. L\. Anderson, G\. Thattai, G\. Nail, G\. Mialon, G\. Pang, G\. Cucurell, H\. Nguyen, H\. Korevaar, H\. Xu, H\. Touvron, I\. Zarov, I\. A\. Ibarra, I\. Kloumann, I\. Misra, I\. Evtimov, J\. Zhang, J\. Copet, J\. Lee, J\. Geffert, J\. Vranes, J\. Park, J\. Mahadeokar, J\. Shah, J\. van der Linde, J\. Billock, J\. Hong, J\. Lee, J\. Fu, J\. Chi, J\. Huang, J\. Liu, J\. Wang, J\. Yu, J\. Bitton, J\. Spisak, J\. Park, J\. Rocca, J\. Johnstun, J\. Saxe, J\. Jia, K\. V\. Alwala, K\. Prasad, K\. Upasani, K\. Plawiak, K\. Li, K\. Heafield, K\. Stone, K\. El\-Arini, K\. Iyer, K\. Malik, K\. Chiu, K\. Bhalla, K\. Lakhotia, L\. Rantala\-Yeary, L\. van der Maaten, L\. Chen, L\. Tan, L\. Jenkins, L\. Martin, L\. Madaan, L\. Malo, L\. Blecher, L\. Landzaat, L\. de Oliveira, M\. Muzzi, M\. Pasupuleti, M\. Singh, M\. Paluri, M\. Kardas, M\. Tsimpoukelli, M\. Oldham, M\. Rita, M\. Pavlova, M\. Kambadur, M\. Lewis, M\. Si, M\. K\. Singh, M\. Hassan, N\. Goyal, N\. Torabi, N\. Bashlykov, N\. Bogoychev, N\. Chatterji, N\. Zhang, O\. Duchenne, O\. Çelebi, P\. Alrassy, P\. Zhang, P\. Li, P\. Vasic, P\. Weng, P\. Bhargava, P\. Dubal, P\. Krishnan, P\. S\. Koura, P\. Xu, Q\. He, Q\. Dong, R\. Srinivasan, R\. Ganapathy, R\. Calderer, R\. S\. Cabral, R\. Stojnic, R\. Raileanu, R\. Maheswari, R\. Girdhar, R\. Patel, R\. Sauvestre, R\. Polidoro, R\. Sumbaly, R\. Taylor, R\. Silva, R\. Hou, R\. Wang, S\. Hosseini, S\. Chennabasappa, S\. Singh, S\. Bell, S\. S\. Kim, S\. Edunov, S\. Nie, S\. Narang, S\. Raparthy, S\. Shen, S\. Wan, S\. Bhosale, S\. Zhang, S\. Vandenhende, S\. Batra, S\. Whitman, S\. Sootla, S\. Collot, S\. Gururangan, S\. Borodinsky, T\. Herman, T\. Fowler, T\. Sheasha, T\. Georgiou, T\. Scialom, T\. Speckbacher, T\. Mihaylov, T\. Xiao, U\. Karn, V\. Goswami, V\. Gupta, V\. Ramanathan, V\. Kerkez, V\. Gonguet, V\. Do, V\. Vogeti, V\. Albiero, V\. Petrovic, W\. Chu, W\. Xiong, W\. Fu, W\. Meers, X\. Martinet, X\. Wang, X\. Wang, X\. E\. Tan, X\. Xia, X\. Xie, X\. Jia, X\. Wang, Y\. Goldschlag, Y\. Gaur, Y\. Babaei, Y\. Wen, Y\. Song, Y\. Zhang, Y\. Li, Y\. Mao, Z\. D\. Coudert, Z\. Yan, Z\. Chen, Z\. Papakipos, A\. Singh, A\. Srivastava, A\. Jain, A\. Kelsey, A\. Shajnfeld, A\. Gangidi, A\. Victoria, A\. Goldstand, A\. Menon, A\. Sharma, A\. Boesenberg, A\. Baevski, A\. Feinstein, A\. Kallet, A\. Sangani, A\. Teo, A\. Yunus, A\. Lupu, A\. Alvarado, A\. Caples, A\. Gu, A\. Ho, A\. Poulton, A\. Ryan, A\. Ramchandani, A\. Dong, A\. Franco, A\. Goyal, A\. Saraf, A\. Chowdhury, A\. Gabriel, A\. Bharambe, A\. Eisenman, A\. Yazdan, B\. James, B\. Maurer, B\. Leonhardi, B\. Huang, B\. Loyd, B\. D\. Paola, B\. Paranjape, B\. Liu, B\. Wu, B\. Ni, B\. Hancock, B\. Wasti, B\. Spence, B\. Stojkovic, B\. Gamido, B\. Montalvo, C\. Parker, C\. Burton, C\. Mejia, C\. Liu, C\. Wang, C\. Kim, C\. Zhou, C\. Hu, C\. Chu, C\. Cai, C\. Tindal, C\. Feichtenhofer, C\. Gao, D\. Civin, D\. Beaty, D\. Kreymer, D\. Li, D\. Adkins, D\. Xu, D\. Testuggine, D\. David, D\. Parikh, D\. Liskovich, D\. Foss, D\. Wang, D\. Le, D\. Holland, E\. Dowling, E\. Jamil, E\. Montgomery, E\. Presani, E\. Hahn, E\. Wood, E\. Le, E\. Brinkman, E\. Arcaute, E\. Dunbar, E\. Smothers, F\. Sun, F\. Kreuk, F\. Tian, F\. Kokkinos, F\. Ozgenel, F\. Caggioni, F\. Kanayet, F\. Seide, G\. M\. Florez, G\. Schwarz, G\. Badeer, G\. Swee, G\. Halpern, G\. Herman, G\. Sizov, Guangyi, Zhang, G\. Lakshminarayanan, H\. Inan, H\. Shojanazeri, H\. Zou, H\. Wang, H\. Zha, H\. Habeeb, H\. Rudolph, H\. Suk, H\. Aspegren, H\. Goldman, H\. Zhan, I\. Damlaj, I\. Molybog, I\. Tufanov, I\. Leontiadis, I\. Veliche, I\. Gat, J\. Weissman, J\. Geboski, J\. Kohli, J\. Lam, J\. Asher, J\. Gaya, J\. Marcus, J\. Tang, J\. Chan, J\. Zhen, J\. Reizenstein, J\. Teboul, J\. Zhong, J\. Jin, J\. Yang, J\. Cummings, J\. Carvill, J\. Shepard, J\. McPhie, J\. Torres, J\. Ginsburg, J\. Wang, K\. Wu, K\. H\. U, K\. Saxena, K\. Khandelwal, K\. Zand, K\. Matosich, K\. Veeraraghavan, K\. Michelena, K\. Li, K\. Jagadeesh, K\. Huang, K\. Chawla, K\. Huang, L\. Chen, L\. Garg, L\. A, L\. Silva, L\. Bell, L\. Zhang, L\. Guo, L\. Yu, L\. Moshkovich, L\. Wehrstedt, M\. Khabsa, M\. Avalani, M\. Bhatt, M\. Mankus, M\. Hasson, M\. Lennie, M\. Reso, M\. Groshev, M\. Naumov, M\. Lathi, M\. Keneally, M\. Liu, M\. L\. Seltzer, M\. Valko, M\. Restrepo, M\. Patel, M\. Vyatskov, M\. Samvelyan, M\. Clark, M\. Macey, M\. Wang, M\. J\. Hermoso, M\. Metanat, M\. Rastegari, M\. Bansal, N\. Santhanam, N\. Parks, N\. White, N\. Bawa, N\. Singhal, N\. Egebo, N\. Usunier, N\. Mehta, N\. P\. Laptev, N\. Dong, N\. Cheng, O\. Chernoguz, O\. Hart, O\. Salpekar, O\. Kalinli, P\. Kent, P\. Parekh, P\. Saab, P\. Balaji, P\. Rittner, P\. Bontrager, P\. Roux, P\. Dollar, P\. Zvyagina, P\. Ratanchandani, P\. Yuvraj, Q\. Liang, R\. Alao, R\. Rodriguez, R\. Ayub, R\. Murthy, R\. Nayani, R\. Mitra, R\. Parthasarathy, R\. Li, R\. Hogan, R\. Battey, R\. Wang, R\. Howes, R\. Rinott, S\. Mehta, S\. Siby, S\. J\. Bondu, S\. Datta, S\. Chugh, S\. Hunt, S\. Dhillon, S\. Sidorov, S\. Pan, S\. Mahajan, S\. Verma, S\. Yamamoto, S\. Ramaswamy, S\. Lindsay, S\. Lindsay, S\. Feng, S\. Lin, S\. C\. Zha, S\. Patil, S\. Shankar, S\. Zhang, S\. Zhang, S\. Wang, S\. Agarwal, S\. Sajuyigbe, S\. Chintala, S\. Max, S\. Chen, S\. Kehoe, S\. Satterfield, S\. Govindaprasad, S\. Gupta, S\. Deng, S\. Cho, S\. Virk, S\. Subramanian, S\. Choudhury, S\. Goldman, T\. Remez, T\. Glaser, T\. Best, T\. Koehler, T\. Robinson, T\. Li, T\. Zhang, T\. Matthews, T\. Chou, T\. Shaked, V\. Vontimitta, V\. Ajayi, V\. Montanez, V\. Mohan, V\. S\. Kumar, V\. Mangla, V\. Ionescu, V\. Poenaru, V\. T\. Mihailescu, V\. Ivanov, W\. Li, W\. Wang, W\. Jiang, W\. Bouaziz, W\. Constable, X\. Tang, X\. Wu, X\. Wang, X\. Wu, X\. Gao, Y\. Kleinman, Y\. Chen, Y\. Hu, Y\. Jia, Y\. Qi, Y\. Li, Y\. Zhang, Y\. Zhang, Y\. Adi, Y\. Nam, Yu, Wang, Y\. Zhao, Y\. Hao, Y\. Qian, Y\. Li, Y\. He, Z\. Rait, Z\. DeVito, Z\. Rosnbrick, Z\. Wen, Z\. Yang, Z\. Zhao, and Z\. Ma \(2024\)The llama 3 herd of models\.External Links:2407\.21783,[Link](https://arxiv.org/abs/2407.21783)Cited by:[§3\.2](https://arxiv.org/html/2606.18473#S3.SS2.SSS0.Px2.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2021\)Measuring massive multitask language understanding\.External Links:2009\.03300,[Link](https://arxiv.org/abs/2009.03300)Cited by:[§2\.2](https://arxiv.org/html/2606.18473#S2.SS2.p1.2)\.
- A\. Ilyas, S\. M\. Park, L\. Engstrom, G\. Leclerc, and A\. Madry \(2022\)Datamodels: predicting predictions from training data\.External Links:2202\.00622,[Link](https://arxiv.org/abs/2202.00622)Cited by:[§2\.2](https://arxiv.org/html/2606.18473#S2.SS2.p2.1)\.
- J\. Jang, D\. Yoon, S\. Yang, S\. Cha, M\. Lee, L\. Logeswaran, and M\. Seo \(2023\)Knowledge unlearning for mitigating privacy risks in language models\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 14389–14408\.External Links:[Link](https://aclanthology.org/2023.acl-long.805/),[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.805)Cited by:[§1](https://arxiv.org/html/2606.18473#S1.p1.1),[§3\.2](https://arxiv.org/html/2606.18473#S3.SS2.SSS0.Px3.p1.1)\.
- Z\. Jin, P\. Cao, C\. Wang, Z\. He, H\. Yuan, J\. Li, Y\. Chen, K\. Liu, and J\. Zhao \(2024\)RWKU: benchmarking real\-world knowledge unlearning for large language models\.External Links:2406\.10890,[Link](https://arxiv.org/abs/2406.10890)Cited by:[§2\.1](https://arxiv.org/html/2606.18473#S2.SS1.p1.1)\.
- M\. Ko, H\. A\. Just, C\. Fleming, M\. Jin, and R\. Jia \(2025\)Probing knowledge holes in unlearned llms\.External Links:2511\.00030,[Link](https://arxiv.org/abs/2511.00030)Cited by:[§1](https://arxiv.org/html/2606.18473#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.18473#S2.SS2.p1.2)\.
- P\. W\. Koh and P\. Liang \(2020\)Understanding black\-box predictions via influence functions\.External Links:1703\.04730,[Link](https://arxiv.org/abs/1703.04730)Cited by:[§2\.2](https://arxiv.org/html/2606.18473#S2.SS2.p2.1)\.
- M\. Kurmanji, P\. Triantafillou, J\. Hayes, and E\. Triantafillou \(2023\)Towards unbounded machine unlearning\.External Links:2302\.09880,[Link](https://arxiv.org/abs/2302.09880)Cited by:[§2\.2](https://arxiv.org/html/2606.18473#S2.SS2.p2.1)\.
- N\. Li, A\. Pan, A\. Gopal, S\. Yue, D\. Berrios, A\. Gatti, J\. D\. Li, A\. Dombrowski, S\. Goel, L\. Phan, G\. Mukobi, N\. Helm\-Burger, R\. Lababidi, L\. Justen, A\. B\. Liu, M\. Chen, I\. Barrass, O\. Zhang, X\. Zhu, R\. Tamirisa, B\. Bharathi, A\. Khoja, Z\. Zhao, A\. Herbert\-Voss, C\. B\. Breuer, S\. Marks, O\. Patel, A\. Zou, M\. Mazeika, Z\. Wang, P\. Oswal, W\. Lin, A\. A\. Hunt, J\. Tienken\-Harder, K\. Y\. Shih, K\. Talley, J\. Guan, R\. Kaplan, I\. Steneker, D\. Campbell, B\. Jokubaitis, A\. Levinson, J\. Wang, W\. Qian, K\. K\. Karmakar, S\. Basart, S\. Fitz, M\. Levine, P\. Kumaraguru, U\. Tupakula, V\. Varadharajan, R\. Wang, Y\. Shoshitaishvili, J\. Ba, K\. M\. Esvelt, A\. Wang, and D\. Hendrycks \(2024\)The wmdp benchmark: measuring and reducing malicious use with unlearning\.External Links:2403\.03218,[Link](https://arxiv.org/abs/2403.03218)Cited by:[§1](https://arxiv.org/html/2606.18473#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.18473#S2.SS1.p1.1)\.
- S\. Lin, J\. Hilton, and O\. Evans \(2022\)TruthfulQA: measuring how models mimic human falsehoods\.External Links:2109\.07958,[Link](https://arxiv.org/abs/2109.07958)Cited by:[§2\.2](https://arxiv.org/html/2606.18473#S2.SS2.p1.2)\.
- Y\. Ma, J\. Wang, F\. Wang, S\. Ma, J\. Li, J\. Pan, X\. Li, F\. Huang, L\. Sun, B\. Li, Y\. Choi, M\. Chen, and C\. Xiao \(2025\)Benchmarking vision language model unlearning via fictitious facial identity dataset\.External Links:2411\.03554,[Link](https://arxiv.org/abs/2411.03554)Cited by:[§2\.1](https://arxiv.org/html/2606.18473#S2.SS1.p1.1)\.
- P\. Maini, Z\. Feng, A\. Schwarzschild, Z\. C\. Lipton, and J\. Z\. Kolter \(2024\)TOFU: a task of fictitious unlearning for llms\.External Links:2401\.06121,[Link](https://arxiv.org/abs/2401.06121)Cited by:[§1](https://arxiv.org/html/2606.18473#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.18473#S2.SS1.p1.1)\.
- S\. Merity, C\. Xiong, J\. Bradbury, and R\. Socher \(2016\)Pointer sentinel mixture models\.External Links:1609\.07843,[Link](https://arxiv.org/abs/1609.07843)Cited by:[§3\.2](https://arxiv.org/html/2606.18473#S3.SS2.SSS0.Px1.p1.1)\.
- T\. T\. Nguyen, T\. T\. Huynh, Z\. Ren, P\. L\. Nguyen, A\. W\. Liew, H\. Yin, and Q\. V\. H\. Nguyen \(2025\)A survey of machine unlearning\.ACM Trans\. Intell\. Syst\. Technol\.16\(5\)\.External Links:ISSN 2157\-6904,[Link](https://doi.org/10.1145/3749987),[Document](https://dx.doi.org/10.1145/3749987)Cited by:[§1](https://arxiv.org/html/2606.18473#S1.p1.1)\.
- Qwen, :, A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Tang, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu \(2025\)Qwen2\.5 technical report\.External Links:2412\.15115,[Link](https://arxiv.org/abs/2412.15115)Cited by:[§3\.2](https://arxiv.org/html/2606.18473#S3.SS2.SSS0.Px2.p1.1)\.
- W\. Shi, J\. Lee, Y\. Huang, S\. Malladi, J\. Zhao, A\. Holtzman, D\. Liu, L\. Zettlemoyer, N\. A\. Smith, and C\. Zhang \(2024\)MUSE: machine unlearning six\-way evaluation for language models\.External Links:2407\.06460,[Link](https://arxiv.org/abs/2407.06460)Cited by:[§1](https://arxiv.org/html/2606.18473#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.18473#S2.SS1.p1.1)\.
- A\. Thudi, G\. Deza, V\. Chandrasekaran, and N\. Papernot \(2022\)Unrolling sgd: understanding factors influencing machine unlearning\.External Links:2109\.13398,[Link](https://arxiv.org/abs/2109.13398)Cited by:[§2\.2](https://arxiv.org/html/2606.18473#S2.SS2.p2.1)\.
- W\. Wang, H\. Bao, S\. Huang, L\. Dong, and F\. Wei \(2021\)MiniLMv2: multi\-head self\-attention relation distillation for compressing pretrained transformers\.InFindings of the Association for Computational Linguistics: ACL\-IJCNLP 2021,C\. Zong, F\. Xia, W\. Li, and R\. Navigli \(Eds\.\),Online,pp\. 2140–2151\.External Links:[Link](https://aclanthology.org/2021.findings-acl.188/),[Document](https://dx.doi.org/10.18653/v1/2021.findings-acl.188)Cited by:[Appendix A](https://arxiv.org/html/2606.18473#A1.SS0.SSS0.Px1.p1.4)\.
- R\. Zhang, L\. Lin, Y\. Bai, and S\. Mei \(2024\)Negative preference optimization: from catastrophic collapse to effective unlearning\.External Links:2404\.05868,[Link](https://arxiv.org/abs/2404.05868)Cited by:[§3\.2](https://arxiv.org/html/2606.18473#S3.SS2.SSS0.Px3.p1.1)\.

## Appendix AReproducibility Details

#### Data\.

To utilize WikiText\-103 for our work, we first segment the corpus into passage\-level units and remove passages that are too short or otherwise unsuitable for stable PPL evaluation\. We then embed each passage withall\-MiniLM\-L6\-v2\(Wanget al\.,[2021](https://arxiv.org/html/2606.18473#bib.bib22)\), obtaining a384384\-dimensional semantic vector per passage\. We cluster these passage embeddings to obtain coherent topical pools from which forget sets can be sampled\. Specifically, we use HDBSCAN\(Campelloet al\.,[2015](https://arxiv.org/html/2606.18473#bib.bib23)\)withmin\_cluster\_size=200200,min\_samples=55, Euclidean distance, and default Excess\-of\-Mass \(EOM\) cluster selection\. To This produces1010non\-noise semantic clusters and a noise pool\. We discard the noise pool and use only the clustered passages when constructing forget\-set candidates\. Fig\.[2](https://arxiv.org/html/2606.18473#S2.F2)summarize the construction\.

#### Hyperparameters

Hyperparameters follow defaults from OpenUnlearning\(Dornaet al\.,[2025](https://arxiv.org/html/2606.18473#bib.bib5)\): learning rate1×10−51\\\!\\times\\\!10^\{\-5\}with linear decay, paged AdamW \(32\-bit\), per\-device batch size11with gradient accumulation44\(effective batch44\),55epochs, weight decay0\.010\.01, BF16\. All runs use a single H100 80 GB and produce≈1\.5\\approx 1\.5TB of checkpoints \(≈15\\approx 15GB each\) for one combination \(e\.g\. GA\+Llama\-3\.1\-8B\)\.

#### Evaluation\.

PPL is computed in BF16 with stride512512and context10241024; each text is scored independently\. Base\-model PPL is cached once per evaluation, then reused across the100100unlearned checkpoints\.

#### Hardware\.

All runs use a single H100 \(80 GB\) node\.

Table 3:Features Family used for pre\-unlearning auditing\.

## Appendix BFeature Engineering Hyper\-parameters

Hyperparameters are tuned with RandomizedSearchCV \(4040iterations,55\-fold GroupKFold on forget\_set,R2R^\{2\}as the inner scoring criterion\)\. To respect the LODO protocol of §[4\.1](https://arxiv.org/html/2606.18473#S4.SS1), the search is run only on theK−1K\{\-\}1training domains of each outer fold; the held\-out domain is never seen at tuning time\. We report RMSE, MAE, coefficient of determinationR2R^\{2\}, and Spearman rank correlationρ\\rho\. Two evaluation splits are distinguished throughout: LODO CV, leave\-one\-domain\-out across the9090training forget\_set cluster, and Held\-out Test, a fully held\-out family of1010clusters whose domain never appears in any training fold\.

Table 4:Top\-5 most predictive features per family in the audit predictor’s ridge regression\. Features are grouped into three families:Pair\(cross\-set forget–evaluation geometry\),Forget\(intra\-forget statistics\), andEval\(intra\-evaluation statistics\)\. All listed coefficients are statistically significant \(p<0\.001p<0\.001\)\. The sign indicates direction of effect on predicted collateral damage, and\|z\|\|z\|reflects standardized importance\.

Similar Articles

Model Unlearning Objectives Vary for Distinct Language Functions

arXiv cs.CL

The paper argues that unlearning in LLMs should be goal-dependent, proposing a cosine-based meta-learned variant of RMU for dangerous knowledge and a multi-layer objective with probe directions for toxicity, achieving strong results across four 7-8B models.

Can Large Language Models Reinvent Foundational Algorithms?

Hugging Face Daily Papers

Researchers introduce 'Unlearn-and-Reinvent', a pipeline that removes knowledge of foundational algorithms (e.g., Dijkstra's, Euclid's) from LLMs via unlearning, then tests whether models can independently reinvent them. Results show LLMs can reinvent algorithms with intuitive structures but struggle with those requiring non-obvious data structures or counterintuitive invariants.

Natively Unlearnable Large Language Models

arXiv cs.LG

The paper proposes NULLs (Natively Unlearnable LLMs), a model class that isolates source-specific contributions in sparsely activated sinks while sharing backbone neurons, enabling clean unlearning of individual data sources without retraining and preserving general language capabilities.