Natively Unlearnable Large Language Models

arXiv cs.LG Papers

Summary

The paper proposes NULLs (Natively Unlearnable LLMs), a model class that isolates source-specific contributions in sparsely activated sinks while sharing backbone neurons, enabling clean unlearning of individual data sources without retraining and preserving general language capabilities.

arXiv:2606.13873v1 Announce Type: new Abstract: Unlearning aims to remove the influence of specific training data sources, but this has proved challenging because the contributions of different sources are entangled within the model. Isolating source contributions to disjoint parameters makes removal easier, though it obstructs joint learning across sources. We propose NULLs (Natively Unlearnable LLMs), a model class that satisfies the two opposing goals of isolating source-specific contributions and learning jointly across sources, by training a set of shared backbone neurons alongside a pool of sparsely activated sinks. During training, information specific to a source naturally concentrates in its sinks while information shared across sources accumulates in the backbone. A source is then unlearned at deployment by disabling its corresponding sinks, with no gradient updates and no access to the retained data. We show that NULLs scales to Wikipedia's ~6M articles, isolating each as an independent source. Unlearning a single article removes knowledge specific to it while preserving facts shared with semantically related articles, closely matching retraining from scratch. We note that unlearning with NULLs is also robust: in a case study of unlearning the Harry Potter books, NULLs resists both adversarial extraction and relearning that reverses post-hoc unlearning. Finally, NULLs preserves general language capabilities, matching a standard transformer on downstream benchmarks. Together, these results suggest that source-level unlearning need not be an afterthought. It can be built natively into LLM training while retaining the benefits of shared representation learning.
Original Article
View Cached Full Text

Cached at: 06/15/26, 09:08 AM

# Natively Unlearnable Large Language Models
Source: [https://arxiv.org/html/2606.13873](https://arxiv.org/html/2606.13873)
###### Abstract

Unlearning aims to remove the influence of specific training data sources, but this has proved challenging because the contributions of different sources are entangled within the model\. Isolating source contributions to disjoint parameters makes removal easier, though it obstructs joint learning across sources\. We proposeNULLs\(Natively Unlearnable LLMs\), a model class that satisfies the two opposing goals of isolating source\-specific contributions and learning jointly across sources, by training a set of shared backbone neurons alongside a pool of sparsely activated sinks\. During training, information specific to a source naturally concentrates in its sinks while information shared across sources accumulates in the backbone\. A source is then unlearned at deployment by disabling its corresponding sinks, with no gradient updates and no access to the retained data\. We show that NULLs scales to Wikipedia’s∼\{\\sim\}6M articles, isolating each as an independent source\. Unlearning a single article removes knowledge specific to it while preserving facts shared with semantically related articles, closely matching retraining from scratch\. We note that unlearning with NULLs is also robust: in a case study of unlearning the Harry Potter books, NULLs resists both adversarial extraction and relearning that reverses post\-hoc unlearning\. Finally, NULLs preserves general language capabilities, matching a standard transformer on downstream benchmarks\. Together, these results suggest that source\-level unlearning need not be an afterthought\. It can be built natively into LLM training while retaining the benefits of shared representation learning\.

## 1 Introduction

Large language models \(LLMs\) train on web\-scale data\(bommasani2022opportunitiesrisksfoundationmodels\)that includes copyrighted material\(cooper2025filescomputercopyrightmemorization\), personal information\(carlini2021extractingtrainingdatalarge\), and regulated content\(fi17040151\)\. Any of it may later need to be removed or accounted for to satisfy legal requirements\. But standard training entangles all data sources: gradient descent mixes them into a single shared set of weights, and every parameter is potentially influenced by several sources\. This entanglement obstructs operations that act at the level of a single source\.*Unlearning*, for instance, requires erasing a source’s influence from a trained model, while*data attribution*\(li2023surveylargelanguagemodels\)aims to trace the model’s outputs back to responsible data sources\. Both require recovering an individual source’s contribution, information that is typically lost during training\.

We focus on*unlearning*: the task of removing a target source’s influence from a deployed model without retraining from scratch\. This entailstwo seemingly opposing requirements: deletion is cleanest when each source’s contribution is*disentangled*from the rest, while generalization depends on the model learning*jointly*across sources\. Existing approaches for unlearning satisfy one or the other\. The most common arepost\-hoc, applying a corrective update once the model is already trained\(zhang2024npo;chang\-etal\-2024\-localization\)\. This approach preserves joint learning across sources by imposing no constraints on the training process, but leaves the target’s influence entangled in the shared weights, where it cannot be cleanly removed\. As a result, post\-hoc unlearning often degrades unrelated capabilities or does not completely remove the target’s influence\(patil2023sensitiveinformationdeletedllms;maini2024tofu\)\.

An alternative paradigm trainsa separate model or module for each sourceand merges them afterwards\(shi2025flexolmo;gururangan2021demixlayersdisentanglingdomains\)\. This keeps each source’s contribution disentangled by construction, facilitating straightforward unlearning\. However, these approaches prevent joint learning across sources, sacrificing the generalization benefits of training on diverse data\. This is especially limiting when sources are defined at very fine granularity, e\.g\., any one of a million articles or pieces of user\-provided content might need to be unlearned\.

Natively Unlearnable LLMs \(NULLs\)\.We develop NULLs, a model class that satisfies the seemingly opposing requirements: a single model learns*jointly across all sources*, while disentangling source\-specific contributions for easy removal\.

NULLs is simple to train, and agnostic to how sources are defined\. A source may be a unit of provenance, such as a document, a publisher, or a cluster of topically related documents\. Each source is assigned a sparse mask over a pool of sink neurons, derived deterministically from its identity\. Training is then standard, except that each document activates a set of shared backbone neurons together with its source’s sinks\. This requires only an additional elementwise multiplication that masks the MLP activations\. Because a source is localized to its mask rather than to a disjoint set of parameters, NULLs can provide independent control over a combinatorial number of sources without linearly scaling the parameter count\.

We evaluate NULLs in two case studies, testing unlearning across source granularities\. We train a 1B\-parameter model on the Wikipedia corpus, treating its∼\{\\sim\}6 million articles as independent sources, and test whether NULLs can unlearn an individual article without inducing broader topic\-level erasure\. NULLs broadly matches gold\-standard retraining: suppressing an article’s sink sharply reduces the model’s recall of facts unique to that article, while preserving semantically related knowledge from other sources\. By contrast, post\-hoc methods degrade related knowledge in other articles at the same rate as they remove the target\. A Harry Potter case study shows that NULLs also enables instantaneous removal of coarser\-grained, topically defined sources, and that this removal resists an adversarial relearning attack that reverses gradient unlearning in less than 10 gradient steps\. Finally, NULLs incurs no cost to general capability, matching a standard transformer on downstream natural\-language benchmarks\.

![Refer to caption](https://arxiv.org/html/2606.13873v1/Figures/NULLsFigurev10.png)Figure 1:Overview of NULLs\.\(Left\)Standard pre\-training mixes contributions from all sources into a single shared pool of neurons, making source removal challenging\.\(Middle\)NULLs simultaneously allows learning across sources through the shared backbone, while isolating source\-specific knowledge in a sink \(implemented as a sparse mask over the sink neuron pool\)\.\(Right\)Unlearning can be implemented by preventing a source’s mask from being activated at inference time either by routing or by permanently zeroing out the sink neurons corresponding to the source\.How does NULLs disentangle each source’s contribution while still learning jointly across sources?The two seem to pull in opposite directions, yet NULLs reconciles both without any supervision identifying what information is specific to a source\. The mechanism is a training dynamic it inherits from the memorization sinks ofghosal2025memorizationsinksisolatingmemorization, here acting on sources rather than individual sequences\. Consider a fact specific to one source\. Because the shared backbone is active on every document, it receives gradient signal for the fact whenever the source appears, but also interfering updates from every other source\. The source’s sink neurons receive the same signal with far less interference, since they are active for only a fraction of the other sources\. The fact is therefore fit in the sinks before the backbone\. Once that happens, the gradient pressure on the backbone vanishes and any leakage there decays, leaving the backbone to hold only information reinforced across sources\. Suppressing a source’s sinks thus removes exactly what was unique to it, while preserving information learned from other sources\.

NULLs demonstrates that source disentanglement can coexist with joint learning and generalization in trained models\. This has implications beyond unlearning\. Because each source’s contribution remains disentangled, a model’s outputs can be attributed to the pretraining data responsible for them, and the influence of any single source can be measured directly\. We see NULLs as a step toward enabling control of large models at the level of their data, not only their outputs\.

## 2 Related Works

Post\-hoc unlearning\.Post\-hoc methods modify a fully trained model to remove targeted information after training\. One approach is gradient\-based fine\-tuning with losses that encourage lower probability on the target\(zhang2024npo;jang2022knowledgeunlearningmitigatingprivacy;yao2024large;eldan2023whosharrypotterapproximate\)\. Another line of work aims to localize the unlearning target to specific parameters and remove or modify them selectively\(chang\-etal\-2024\-localization;maini2023neuralnetworkmemorizationlocalized;meng2022locating\)\. Despite extensive research, these approaches exhibit two opposing failure modes\. First, they often impact the model beyond their intended target, leading to the degradation of semantically related knowledge\(maini2024tofu\)and general capabilities\(shi2024musemachineunlearningsixway\)\. Second, post\-hoc methods have proven easy to reverse\.patil2023sensitiveinformationdeletedllmsshow that information can remain accessible in the intermediate layers of models\. Likewise,fan2025llmunlearningresilientrelearningfind that post\-hoc unlearning methods fail to be robust under further fine\-tuning attacks\. This fragility has been observed in benign settings:zhang2025catastrophicdemonstrate that quantization can recover ostensibly unlearned information\.

Source isolation\.To address the limitations of post\-hoc unlearning, an emerging line of work aims to localize information during model training\. In mixture\-of\-experts models,shi2025flexolmo;gururangan2021demixlayersdisentanglingdomainsallocate separate expert modules to different data sources and domains\. Similarly, in a dense model,cloud2024gradientrouting;shilov2025datafilteringknowledgelocalizationroute data from specific sources to a subset of model parameters by masking training gradients\. Both approaches make unlearning straightforward by simply deleting the corresponding model components\. However, they are limited in the granularity they support, as each source requires an individualized expert or set of neurons\. Moreover, these approaches eliminate joint learning across sources by completely isolating the parameters that different sources update\. NULLs allows joint learning through a pool of shared backbone neurons, and enables better scaling by localizing source\-specific knowledge to sparse masks in a shared pool of sink neurons\.

## 3 Natively Unlearnable LLMs

### 3\.1 Problem Framing

Pretraining Data and SourcesLet𝒟\\mathcal\{D\}denote the full pre\-training dataset\. We assume that the documents in𝒟\\mathcal\{D\}can be partitioned into a set of non\-overlapping*sources*S1,…,SNS\_\{1\},\\dots,S\_\{N\}such that𝒟=⋃i=1NSi\\mathcal\{D\}=\\bigcup\_\{i=1\}^\{N\}S\_\{i\}\. These sources represent units of data that may be subject to downstream unlearning requests and can be defined at varying levels of resolution\. For instance, sources may correspond to individual documents or topically coherent clusters of data\. As a running example, consider a model trained on a large news corpus: a single New York Times investigative article on corporate environmental violations would constitute one sourceSiS\_\{i\}within𝒟\\mathcal\{D\}\.

UnlearningGiven a modelΘ\\Thetatrained on𝒟\\mathcal\{D\}and a forget sourceSforgetS\_\{\\textrm\{forget\}\}, unlearning aims to obtain a model that behaves as ifSforgetS\_\{\\textrm\{forget\}\}were not present in the training corpus\. In our example, the New York Times may issue a takedown request, designating the article asSforgetS\_\{\\textrm\{forget\}\}\. The unlearned model should no longer reproduce distinctive passages or recall details reported exclusively in the article, such as the names of internal whistleblowers or proprietary data\. However, it should preserve general knowledge of environmental regulation and corporate compliance acquired from other sources in𝒟∖Sforget\\mathcal\{D\}\\setminus S\_\{\\textrm\{forget\}\}\. The gold standard is retraining on𝒟∖Sforget\\mathcal\{D\}\\setminus S\_\{\\textrm\{forget\}\}to produceΘretrain\\Theta\_\{\\textrm\{retrain\}\}, but this is typically infeasible\. Instead, prior work performs an update𝒰​\(Θ,Sforget\)\\mathcal\{U\}\(\\Theta,S\_\{\\textrm\{forget\}\}\)that approximatesΘretrain\\Theta\_\{\\textrm\{retrain\}\}without retraining, using either gradient\-based tuning or parameter editing\.

Natively Unlearnable LLMsPost\-hoc unlearning methods\(maini2023neuralnetworkmemorizationlocalized;chang\-etal\-2024\-localization\)often degrade broader model capabilities and knowledge\. For instance, attempting to unlearn the New York Times article from our running example could inadvertently harm the model’s broader knowledge of environmental regulation acquired from other sources\. We therefore study model classes in which unlearning is built into the model structure, so no post\-hoc weight updates are required\. We refer to such models asnatively unlearnable models\.

Prior work has attempted to achieve native unlearnability by dedicating separate experts or parameter subsets to each source\(shi2025flexolmo;cloud2024gradientrouting\)\. While effective when sources are few and coarsely defined, this strategy is impractical at the scale and granularity of language\-model pretraining\. First, such an approach scales poorly, as the parameter count grows linearly in the number of sources, which can number in the millions\. Second, isolating sources in this manner prevents the model from acquiring general capabilities that span the corpus, since no parameters are shared across sources\. To be practical, a natively unlearnable model must*simultaneously*learn general capabilities across sources while preserving independent control over individual sources\.

### 3\.2 Implementing NULLs

BEFOREStandard LLaMA MLP[⬇](data:text/plain;base64,ZGVmIGZvcndhcmQoc2VsZiwgeCk6CiAgICB4X2ZjXzEgPSBzZWxmLmZjXzEoeCkKICAgIHhfZmNfMiA9IHNlbGYuZmNfMih4KQogICAgeCA9IEYuc2lsdSh4X2ZjXzEpICogeF9mY18yCgogICAgcmV0dXJuIHNlbGYucHJvaih4KQ==)defforward\(self,x\):x\_fc\_1=self\.fc\_1\(x\)x\_fc\_2=self\.fc\_2\(x\)x=F\.silu\(x\_fc\_1\)\*x\_fc\_2returnself\.proj\(x\)

AFTERNULLs MLP[⬇](data:text/plain;base64,ZGVmIGZvcndhcmQoc2VsZiwgeCwgc291cmNlX2lkKToKICAgIHhfZmNfMSA9IHNlbGYuZmNfMSh4KQogICAgeF9mY18yID0gc2VsZi5mY18yKHgpCiAgICB4ID0gRi5zaWx1KHhfZmNfMSkgKiB4X2ZjXzIKICAgICgqQFx0ZXh0Y29sb3J7dGhlbWVkcmVkfXtcdHRmYW1pbHlcZm9vdG5vdGVzaXplXGJmc2VyaWVzIHhcID1cIHhcICpcIG1hc2soc291cmNlXF9pZCl9QCopCiAgICByZXR1cm4gc2VsZi5wcm9qKHgp)defforward\(self,x,source\_id\):x\_fc\_1=self\.fc\_1\(x\)x\_fc\_2=self\.fc\_2\(x\)x=F\.silu\(x\_fc\_1\)\*x\_fc\_2x = x \* mask\(source\_id\)returnself\.proj\(x\)

Figure 2:NULLs requires minimal architectural modifications\.NULLs modifies only the fully connected layers of the transformer\. The post\-nonlinearity activations are multiplied by a source\-dependent mask which activates all shared backbone neurons but only a consistent fraction of the sink neuron pool\. We create the mask with a pseudo\-random number generator, allowing it to be generated on the fly during training or inference\. All other components of the transformer architecture remain unmodified\.We implement NULLs based on the Memorization Sinks architecture introduced inghosal2025memorizationsinksisolatingmemorization\. Their work showed that selectively activating a pool of sink neurons can isolate broad memorization from a shared backbone\. However, sink activation in Memorization Sinks is not tied to data provenance: there is no mechanism to identify which sources contributed to which sink neurons\. As a result, Memorization Sinks does not enable selective access to information learned from individual sources\. NULLs closes this gap by assigning each source a deterministic sparse mask to the sink pool, generated from the source identifier alone\. This makes source\-specific knowledge individually addressable and removable without modifying any weights\.

ArchitectureWe target neurons in the transformer fully connected layers \(MLPs\) to implement native unlearnability, leaving the remainder of the architecture unmodified\. This design choice follows from existing findings that MLP layers serve as the site of knowledge and memorization in transformers\(nanda2023factfinding;geva\-etal\-2021\-transformer\)\. We partition the MLP hidden neurons at each layer into two sets: ashared backboneofNgenN\_\{\\textrm\{gen\}\}neurons which seeks to aggregate general capabilities and amemorization sink poolofNpoolN\_\{\\textrm\{pool\}\}neurons which is selectively activated to induce a correspondence between sources and subsets of neurons within it\.

Activation of SinksFor each source, we activate a subset of sizeNsourceN\_\{\\textrm\{source\}\}of theNpoolN\_\{\\textrm\{pool\}\}sink neuron pool while dropping out the remainder\. This mask is generated deterministically by using the source identifier as a seed to a pseudo\-random number generator\. TheNgenN\_\{\\textrm\{gen\}\}shared backbone neurons remain active across all examples to allow learning general capabilities\. We refer to the ratioNsourceNpool\\frac\{N\_\{\\textrm\{source\}\}\}\{N\_\{\\textrm\{pool\}\}\}as theoverlap ratio: it controls the expected overlap fraction between masks for different sources\. This selective activation links information from a sourceSiS\_\{i\}to a specific sink activation mask, generating an explicit and known localization within the model\. The pseudo\-random generation of the masks ensures that even semantically related sources receive independent masks and reduces unintended knowledge entanglement\. For a fixed pool size \(NpoolN\_\{\\textrm\{pool\}\}\) andNsourceN\_\{\\textrm\{source\}\}, the number of possible masks is combinatorial, enabling NULLs to scale to many distinct sources with independently controllable representations\.

Inference\-time Activation and UnlearningAt inference time, source\-specific information is accessed by applying the source’s mask to the sink pool\. As a result, unlearning can be implemented by ensuring that the mask corresponding to the target source is not applied during the forward pass, enabling NULLs to perform unlearning without modifying any model weights\. Source\-specific information can also be removed from the model permanently by zeroing out parameters associated with neurons that are active in the source’s mask\. Throughout our experiments, we evaluate two inference modes:Sink\-On, in which the ground\-truth source sink is activated, andSink\-Off, in which the next\-closest source \(by embedding similarity\) is activated instead\.

## 4 Experiment Results

We validate NULLs across two case studies that simulate different unlearning use\-cases\. Our Wikipedia case study tests whether NULLs enables surgical removal of fine\-grained sources \(6M individual Wikipedia articles\) despite substantial semantic overlap between them\. Our Harry Potter case study then tests whether NULLs enables robust removal of larger, topically connected sets of data\.

### 4\.1 Article\-Level Unlearning in Wikipedia

![Refer to caption](https://arxiv.org/html/2606.13873v1/Figures/1d3df9db63834118a91696bbff2e5dd3.png)\(a\)Shared Facts![Refer to caption](https://arxiv.org/html/2606.13873v1/Figures/9bab6a705629429593775775e4445090.png)\(b\)Article\-Specific Facts![Refer to caption](https://arxiv.org/html/2606.13873v1/Figures/Updated1c.png)\(c\)Gradient Unlearning
Figure 3:Article\-level unlearning in Wikipedia\.\(a\)Removing the sink corresponding to an article preserves the truth\-ratio distribution of Shared Facts that are learned across multiple articles, indicating they are not degraded by source\-level unlearning\.\(b\)Facts seen only in a single article \(article\-specific facts\) largely collapse to Truth Ratio 0 when the corresponding article sink is deactivated\.\(c\)Post\-hoc gradient methods \(NPO, gradient ascent\) degrade shared and article\-specific knowledge at similar rates \(near\-diagonal trajectory\)\.#### 4\.1\.1 Setting

Training Setup\.We train a11B\-parameter transformer for 7 epochs \(≈32\{\\approx\}32B tokens,≈100\{\\approx\}100k steps\) on Wikipedia, allocating sinks by article title \(∼6\{\\sim\}6M unique titles\)\. We implement NULLs based on a SmolLM architecture and construct the MLP hidden layer with a shared backbone ofNgen=500N\_\{\\textrm\{gen\}\}=500general neurons and a memorization sink pool size ofNpool=8000N\_\{\\textrm\{pool\}\}=8000, withNsource=100N\_\{\\textrm\{source\}\}=100neurons active per article\. We also implement cross\-document attention masking to prevent information leakage between sink activations when a training context contains text from multiple documents\. Further training details are provided in Appendix[A](https://arxiv.org/html/2606.13873#A1)\.

Evaluation SetupWe build a fill\-in\-the\-blank evaluation from each article’s factual content\. We first extract factual sentences, discarding those that lack at least two named entities or are not grammatically complete\. We then use GPT\-5 to convert each into a Cloze\-style question paired with a set of plausible but incorrect answers\.

MetricsWe measure the model’s knowledge of a fact via the*Truth Ratio*\(TR\), the ratio of the likelihood of the correct answer to that of a set of plausible but incorrect answers:

T​R=P​\(a^∣q\)1/\|a^\|1\|Apert\|​∑a~∈ApertP​\(a~∣q\)1/\|a~\|,TR=\\frac\{P\(\\hat\{a\}\\mid q\)^\{1/\|\\hat\{a\}\|\}\}\{\\frac\{1\}\{\|A\_\{\\mathrm\{pert\}\}\|\}\\sum\_\{\\tilde\{a\}\\in A\_\{\\mathrm\{pert\}\}\}P\(\\tilde\{a\}\\mid q\)^\{1/\|\\tilde\{a\}\|\}\},wherea^\\hat\{a\}is a paraphrase of the correct answer andApertA\_\{\\mathrm\{pert\}\}is the set of plausible, but incorrect answers\. A truth ratio above 1 means the model places higher probability on the correct answer than on incorrect alternatives\. We treat this as the cutoff for whether the model recalls a given fact\.

Fact CategoriesUnlearning a source requires removing the information learned specifically from it without degrading performance on facts it shares with semantically overlapping sources\. To measure this, we group facts by where they appear\. We designate facts that appear across multiple sources asshared facts\. Facts that are learned from a single source are designated asarticle\-specific facts\. We identify whether a fact appears across multiple articles with semantic deduplication\(minishlab2025semhash\)\. When comparing against gold\-standard retraining, we further divide article\-specific facts by whether the retrained model can still predict them:

1. 1\.Unique facts:Facts the retrained model cannot predict correctly \(TR<1\\mathrm\{TR\}<1\)\.
2. 2\.Inferred facts:Facts that the retrained model predicts correctly \(TR\>1\\mathrm\{TR\}\>1\)\. Intuitively, these facts can be inferred from shared knowledge or from the next\-closest article\.

#### 4\.1\.2 Results

##### Effect of Unlearning an Article

Figure[3](https://arxiv.org/html/2606.13873#S4.F3)compares the truth ratio under Sink\-On \(source\-article sink active\) and Sink\-Off \(next\-closest sink active\) on 200 randomly selected facts per category\. The distribution for shared facts is largely unchanged under Sink\-Off, suggesting that broadly supported knowledge survives unlearning of an individual source article\. In contrast, article\-specific facts mostly collapse toward zero under Sink\-Off, though notably some of them continue to have high truth ratios\. We examine these further in our comparison to retraining \(Figure[4](https://arxiv.org/html/2606.13873#S4.F4)\), where they emerge as inferred facts that the retrained model can also predict\.

![Refer to caption](https://arxiv.org/html/2606.13873v1/Figures/CombinedNew.png)Figure 4:NULLs matches gold\-standard retraining for source\-level Wikipedia unlearning\.We measure the mean truth ratio \(TR\) across three categories of facts present in the target articles\. We compare Sink\-On \(the target article’s sink active\), Sink\-Off \(the next\-closest article sink active\), and Retrained \(the gold standard of a model trained without the target article\)\. Unique facts \(eliminated in the retrained model\) are likewise eliminated under Sink\-Off\. Inferred and Shared facts \(which persist in the retrained model\) are preserved under Sink\-Off\.Gradient Unlearning Degrades Shared KnowledgeFigure[3\(c\)](https://arxiv.org/html/2606.13873#S4.F3.sf3)compares two gradient\-based unlearning baselines \(NPO and gradient ascent\) on source\-level unlearning in Wikipedia\. We run both methods for up to 5 epochs on the target article and track the truth ratio on article\-specific facts \(Article\-Specific\) versus facts in semantically similar articles \(Shared\)\. Both methods reduce the Truth Ratio on shared facts at a similar rate to article\-specific facts, indicating they cannot distinguish source\-specific information from topically adjacent facts\. NULLs, by contrast, removes source\-specific knowledge without degrading related facts, due to its per\-source mask structure\.

Unlearning with NULLs Performs Comparably to Gold\-Standard RetrainingWe compare NULLs against the gold standard of retraining without the target source, across the unique, inferred, and shared facts defined above\. NULLs matches retraining on all three \(Figure[4](https://arxiv.org/html/2606.13873#S4.F4)\): deactivating a source sink sharply reduces the truth ratio on unique facts, demonstrating removal of source\-specific information, while inferred and shared facts are unaffected\. This confirms that removing a source does not induce broader topic erasure\.

### 4\.2 Topic\-Level Unlearning

![Refer to caption](https://arxiv.org/html/2606.13873v1/Figures/2bdb5223b5cb-47d4801b2441a1662eaf.png)\(a\)Harry Potter Loss![Refer to caption](https://arxiv.org/html/2606.13873v1/Figures/aa802f74374b4ad8ba793862027ec126.png)\(b\)Harry Potter Cloze QA
Figure 5:Disabling the Harry Potter sink matches retraining\.\(a\)We measure loss on Harry Potter book text\. Sink\-Off \(next\-closest cluster sink active\) matches Retrained \(trained without the Harry Potter books\), while Sink\-On \(sink active\) achieves lower loss\.\(b\)We probe for Harry Potter knowledge via Cloze\-style prompts\. Sink\-On achieves a higher truth ratio than Retrained, indicating that Harry Potter knowledge is accessible by rephrased prompts, while Sink\-Off matches Retrained\.Prompt:Mr\. and Mrs\. Dursley, of number four, Privet Drive, were proudSink\-On \(Harry Potter sink activated\)members ofHogwarts–” “Right, Professor, I am pleased to inviteDudleyto before I go weight\-out\!” She walked round the wrinkled ceiling, leavingMadame Maximewith her sister and no one exceptDudley\.Dudleywas a trap we had been searched when she’d got back toHogwarts; she jumped as though at large on fire…Sink\-Off \(Harry Potter sink disabled\)of the end result\. Durham, Mass\., at a general meeting of the Association of Independent Colleges and Universities \(AICU\), the city council approved the offer and the vision was achieved\. It was discussed on the status of the policy and the time period\. The proposed policy was reviewed by the members of the city council…Figure 6:Toggling the Harry Potter sink changes the topic of generation\.We qualitatively compare model generations beginning from the prompt “Mr\. and Mrs\. Dursley, of number four, Privet Drive, were proud\.”Top \(Sink\-On\):When the Harry Potter sink is active, the model continuation mentions Harry Potter entities that are not present in the prompt \(Hogwarts,Dudley,Madame Maxime\), showing familiarity with Harry Potter knowledge\.Bottom \(Sink\-Off\):When the Harry Potter sink is disabled, the model generation remains coherent but all Harry Potter references are eliminated\.In Section[4\.1](https://arxiv.org/html/2606.13873#S4.SS1), we showed that NULLs enables removing individual fine\-grained sources while preserving semantically related knowledge from other sources\. We now test NULLs in a complementary setting: removing a larger, topically coherent subset of data in its entirety\. We use the unlearning of Harry Potter books as a case study\.

Training Setup\.We train a 1B\-parameter model on3\.83\.8B tokens from a mixture of the C4 corpus and the contents of all 7 Harry Potter books\. We semantically cluster the C4 corpus into 5000 clusters and treat the books as an additional cluster\. We use the cluster assignments as source labels to study the setting of semantically defined sources\. Finally, we train an equivalently sized model on the C4 data only\.

#### 4\.2\.1 Unlearning Results

Quantitative Unlearning MetricsWe first verify that NULLs allows unlearning of Harry Potter knowledge through two metrics\. In Figure[5\(a\)](https://arxiv.org/html/2606.13873#S4.F5.sf1), we show that activating the sink \(Sink\-On\) achieves lower loss than the Retrained model, indicating the model’s familiarity with the books, while deactivating the sink \(Sink\-Off\) matches Retrained\. In Figure[5\(b\)](https://arxiv.org/html/2606.13873#S4.F5.sf2), we evaluate the Truth Ratio on 200 cloze\-style QA prompts\. Sink\-On scores higher than Retrained, indicating that the knowledge stored in the sink is extractable beyond simple verbatim memorization\. Sink\-Off again matches Retrained, demonstrating successful unlearning\.

Qualitative Results on GenerationWe next test whether the quantitative results reflect behavioral differences in the model’s generation\. We generate continuations from the first sentence of the Harry Potter series when the sink is active \(Sink\-On\) versus disabled \(Sink\-Off\) and show the results in Figure[6](https://arxiv.org/html/2606.13873#S4.F6)\. The Sink\-On generation mentions Harry Potter characters and settings that were not present in the prompt\. On the other hand, with the sink disabled \(Sink\-Off\), the continuation remains coherent but mentions no Harry Potter content and instead discusses the unrelated topic of a city council meeting\. We provide further examples of generations in Appendix[A\.1](https://arxiv.org/html/2606.13873#A1.SS1)\.

![Refer to caption](https://arxiv.org/html/2606.13873v1/Figures/007f8137428a4cce9e5c0895f6d1e862.png)\(a\)Adversarial Extraction![Refer to caption](https://arxiv.org/html/2606.13873v1/Figures/f9105ace9f844ee4afa3bc83402b0022.png)\(b\)Relearning via Finetuning
Figure 7:NULLs resists adversarial attacks\.\(a\)We elicit Harry Potter book text via adversarial prompting and report the Adversarial Compression Ratio\(schwarzschild2024rethinkingllmmemorizationlens\)\. Sink\-Off matches Retrained, indicating that removal of Harry Potter is robust to prompting attacks\.\(b\)We fine\-tune on a subset of Harry Potter book text and measure loss on a held\-out subset\. NULLs with Sink\-Off matches the relearning dynamics of the Retrained model\. By contrast, NPO is reversed within 10 fine\-tuning steps\.
#### 4\.2\.2 Resistance of NULLs to Adversarial Attacks

So far, we have shown that deactivating a sink approximates gold\-standard retraining on standard unlearning metrics such as loss and question\-answering\. However, prior unlearning methods often break down under adversarial settings, revealing that targeted information is latently present in the model\(patil2023sensitiveinformationdeletedllms;fan2025llmunlearningresilientrelearning\)\. Here, we test the adversarial robustness of NULLs\.

Adversarial PromptingWhile suppressing the Harry Potter sink prevents the relevant knowledge from being elicited through standard prompts, it could still remain accessible via adversarial prompting\(patil2023sensitiveinformationdeletedllms\)\. To test whether this occurs under NULLs, we use GCG optimization\(zou2023universaltransferableadversarialattacks\)to identify adversarial prompts that elicit unlearned text\. We report our results with the Adversarial Compression Ratio \(ACR\)\(schwarzschild2024rethinkingllmmemorizationlens\), which quantifies latent memorization as the ratio between the length of the reproduced text and that of the shortest adversarial prefix needed to elicit it\. As shown in Figure[7\(a\)](https://arxiv.org/html/2606.13873#S4.F7.sf1), deactivating the sink yields ACR values comparable to a model retrained from scratch without Harry Potter data\.

Relearning AttackWe next consider whether an adversary with fine\-tuning access can recover unlearned information using a small amount of target data, a regime in which post\-hoc unlearning is known to fail\(fan2025llmunlearningresilientrelearning\)\. We evaluate this by fine\-tuning on a reserved subset of the Harry Potter corpus and tracking held\-out validation loss\. Our results in Figure[7\(b\)](https://arxiv.org/html/2606.13873#S4.F7.sf2)show that post\-hoc unlearning on a standard transformer is rapidly reversed by fine\-tuning, with held\-out loss decreasing sharply after a minimal number of further fine\-tuning steps\. NULLs with sink suppression exhibits relearning dynamics closely matching the retrained model that never saw Harry Potter data\.

### 4\.3 Impact of Sink Pool Size on NULLs

In the previous sections, we have shown that NULLs automatically isolates source\-specific memorization to a pool of sink neurons\. In this section, we study the sensitivity of this isolation to the size of the sink neuron pool\. For computational feasibility, these experiments use a 25% subset of the Wikipedia corpus, evaluated on its article\-specific facts\.

![Refer to caption](https://arxiv.org/html/2606.13873v1/Figures/7c50d27ae42045e79c172f55be9147bf1.png)\(a\)Sink Neuron Overlap![Refer to caption](https://arxiv.org/html/2606.13873v1/Figures/fbad07c98baf4e129d1e1ad3942c65ee.png)\(b\)Activated Sinks Per Source![Refer to caption](https://arxiv.org/html/2606.13873v1/Figures/b6d291dc434f4c1a8ffe2f54aee8220d.png)\(c\)Comparison to Transformer
Figure 8:Scaling NULLs increases memorization capacity without weakening unlearning\.\(a\)We vary the overlap ratio at a fixedNsourceN\_\{\\textrm\{source\}\}and compare Sink\-On \(ground\-truth article sink active\) to Sink\-Off \(next\-closest article sink active\) and Retrained\. Decreasing overlap increases Sink\-On truth ratio \(indicating greater knowledge acquisition\) while Sink\-Off matches retraining closely\. Increasing the overlap ratio causes Sink\-Off to diverge from Retrained moderately, indicating leakage to the shared neurons\.\(b\)We vary the activated sinks per source \(NsourceN\_\{\\textrm\{source\}\}\), while holding overlap ratio fixed\. VaryingNsourceN\_\{\\textrm\{source\}\}increases memorization capacity without changing the ability to unlearn\.\(c\)We compare the mean truth ratio on Article\-Specific facts between NULLs and an equivalently sized transformer\. They show comparable Truth Ratio, indicating NULLs does not harm knowledge capacity\.Study on the Neuron Overlap RatioIn Figure[8\(a\)](https://arxiv.org/html/2606.13873#S4.F8.sf1), we holdNsourceN\_\{\\textrm\{source\}\}fixed and vary the total sink pool size, plotting the truth ratio as a function of theoverlap ratio\(larger values reflect a smallerNpoolN\_\{\\textrm\{pool\}\}\)\. Lower overlap in the sink neuron pool results in greater retention of knowledge \(as evidenced by the higher truth ratio with Sink\-On\), due to fewer interfering updates in these parameters\. Increasing the overlap ratio slightly widens the gap between the Sink\-Off and retrained baselines, indicating slight leakage of source\-specific knowledge to the shared parameters\. This agrees with the mechanism proposed byghosal2025memorizationsinksisolatingmemorization, in which localization of memorization to the sink neurons is driven by the reduced interference they experience\. Nevertheless, the gap in truth ratio between Sink\-Off and retrained remains small across all overlap ratios\.

Scaling ofNsourceN\_\{\\textrm\{source\}\}The sink pool can also be scaled by holding theoverlap ratiofixed and varyingNsourceN\_\{\\textrm\{source\}\}, the number of sink neurons active per source\. Intuitively, this holds mask overlap constant while varying the memorization capacity allocated to each source\. As shown in Figure[8\(b\)](https://arxiv.org/html/2606.13873#S4.F8.sf2), increasingNsourceN\_\{\\textrm\{source\}\}yields greater memorization, with the mean Sink\-On truth ratio rising steadily from≈1\{\\approx\}1at 50 neurons to≈5\.8\{\\approx\}5\.8at 250\. Unlike the overlap ratio, however,NsourceN\_\{\\textrm\{source\}\}has no consistent effect on the Sink\-Off truth ratio\. This suggests thatNsourceN\_\{\\textrm\{source\}\}controls memorization capacity with minimal impact on the ability to unlearn\.

Taken together, our results demonstrate that NULLs robustly enables unlearning across scales\. We find that the size of the sink pool primarily determines how much knowledge is learned, rather than whether the knowledge can be unlearned\.

### 4\.4 NULLs Preserves General Language Capability and Knowledge Capacity

A natural concern is that source\-level isolation may come at the cost of general performance\. We compare NULLs against a standard transformer along two dimensions: knowledge capacity and general language capability\. Figure[8\(c\)](https://arxiv.org/html/2606.13873#S4.F8.sf3)shows that NULLs matches the standard transformer’s average truth ratio on a random sample of article\-specific Cloze prompts from Wikipedia, indicating that NULLs does not reduce knowledge capacity\. Across four benchmarks \(ARC\-E, Winogrande, PIQA, SciQ\), NULLs matches the standard transformer within one standard deviation \(Table[1](https://arxiv.org/html/2606.13873#S4.T1)\), confirming that NULLs does not reduce general language capability\.

Table 1:NULLs preserves general language capability\.Downstream benchmark accuracy \(±\\pmstandard deviation\) for NULLs and a parameter\-matched standard transformer\. NULLs matches the standard baseline on average across the four benchmarks we test\.

## 5 Discussion

NULLs demonstrates that disentangling source contributions and learning jointly across them need not be at odds\. Simply by jointly training a set of shared backbone neurons alongside source\-specific sinks, NULLs isolates each source’s information while the backbone learns across all sources\. This isolation enables reliable unlearning downstream\. Our results suggest treating unlearnability as a property to design into a model during training, not a behavior to recover from it afterward\. We now discuss the design choices and open questions this raises\.

Defining SourcesNULLs is agnostic to how sources are defined, but the definition chosen at training time determines the unlearning operations that are ultimately supported by the model\. As a result, source definition represents a crucial design choice that must take the expected unlearning use\-cases into account\. For instance, copyright compliance might require defining sources by publisher or author, while privacy requests may need document\-level resolution\. Across the pre\-training corpus, different domains or subsets of data may require different source definitions\. How best to define sources for a given deployment remains an open question\.

Post\-Training with NULLsWe have focused on pre\-training NULLs models from scratch, but strong models also depend on post\-training\. One important question is how a NULLs model can be post\-trained while preserving the ability to unlearn arbitrary pre\-training data sources\. This could be achieved, for example, by designing regularizers that encourage the model to preserve the existing source localization\. A second important area for future work is extending native unlearnability to post\-training data sources \(even if the base model is not a NULLs\)\. This could enable model developers to fine\-tune on user data while mitigating privacy concerns\.

Attribution and Data CurationSource disentanglement also makes each source’s contribution easier to measure\. Because disabling a source’s sink approximates retraining without that source, the effect of any individual source can be estimated by comparing the model with the source’s sink active and inactive, rather than through retraining\. Such comparisons could identify redundant or low\-value sources to inform data curation, and could help developers account for the provenance of model behavior\. Establishing whether NULLs yields reliable attribution and data valuation, and how such measurements should guide corpus construction, is a promising direction\.

LimitationsOur experiments use 1B\-parameter models, and we do not test substantially larger ones\. The unlearning requests we examine align with the sources defined at pre\-training time\. Handling requests that do not match a predefined source is an open problem\. Finally, we evaluate only two source definitions, individual documents and topic\-linked clusters, and leave other choices, especially those tied to real\-world takedown requests, to future work\.

## Ethics Statement

AI Usage\.We did not use AI to plan or design the experiments in this work\. However, we did use AI tools for coding and Claude for writing assistance\.

## Acknowledgements

The authors are grateful to members of the CMU FORUM lab for discussion and feedback on this project, particularly Jacob Springer, Christina Baek, Ziqian Zhong, Lawrence Feng, and Sashwat Saxena\. In addition, we would like to acknowledge Fahim Tajwar, Aakash Lahoti, Kevin Li, Abitha Thankaraj, and Sachin Goyal for valuable insights and feedback\. We acknowledge the CMU FLAME center for providing compute allocations for this project\. We gratefully acknowledge support from Jane Street, Apple, the National Science Foundation, and the Sloan Foundation\.

## References

## Appendix AAppendix

Table 2:Training hyperparameters for all NULLs experiments\.Table 3:NULLs\-specific architecture hyperparameters for the Wikipedia setting\.### A\.1 Additional Harry Potter Generation Results

We provide additional generation results for the Harry Potter setting in Table[4](https://arxiv.org/html/2606.13873#A1.T4)\.

### A\.2 Practical Deployment\-Time Considerations

Source embeddings used for nearest\-sink routing are computed as mean\-pooled token representations from each source’s training text\. At inference time, the model embeds the input context and activates the sink whose source embedding has the highest cosine similarity\. Prompts that span multiple sources or fall between source boundaries will be routed to whichever single source is closest in embedding space\. Handling multi\-source queries is left to future work\.

Table 4:Comparison of generations with and without the Harry Potter sink activated\.
### A\.3 Hyperparameters

We provide the standard hyperparameter choices for experiments in Table[2](https://arxiv.org/html/2606.13873#A1.T2)and the NULLs\-specific hyperparameters for the Wikipedia setting in Table[3](https://arxiv.org/html/2606.13873#A1.T3)\.

Similar Articles

Can Large Language Models Reinvent Foundational Algorithms?

Hugging Face Daily Papers

Researchers introduce 'Unlearn-and-Reinvent', a pipeline that removes knowledge of foundational algorithms (e.g., Dijkstra's, Euclid's) from LLMs via unlearning, then tests whether models can independently reinvent them. Results show LLMs can reinvent algorithms with intuitive structures but struggle with those requiring non-obvious data structures or counterintuitive invariants.

Model Unlearning Objectives Vary for Distinct Language Functions

arXiv cs.CL

The paper argues that unlearning in LLMs should be goal-dependent, proposing a cosine-based meta-learned variant of RMU for dangerous knowledge and a multi-layer objective with probe directions for toxicity, achieving strong results across four 7-8B models.

MLUBench: A Benchmark for Lifelong Unlearning Evaluation in MLLMs

arXiv cs.AI

MLUBench is a large-scale benchmark for lifelong unlearning in multimodal large language models (MLLMs), featuring 127 entities across 9 classes. The paper identifies that existing unlearning methods suffer from cumulative degradation and proposes LUMoE to mitigate this, showing significant improvements.

Multilingual Unlearning in LLMs: Transfer, Dynamics, and Reversibility

arXiv cs.CL

This paper studies multilingual unlearning in LLMs by extending the TOFU benchmark to five languages. It finds that unlearning transfer varies by script and family, operates primarily in later decoding layers, and that a single steering direction can recover much of the suppressed knowledge across languages.