Model Unlearning Objectives Vary for Distinct Language Functions

arXiv cs.CL 05/27/26, 04:00 AM Papers
unlearning llm-safety knowledge-removal toxicity dangerous-knowledge meta-learning probe
Summary
The paper argues that unlearning in LLMs should be goal-dependent, proposing a cosine-based meta-learned variant of RMU for dangerous knowledge and a multi-layer objective with probe directions for toxicity, achieving strong results across four 7-8B models.
arXiv:2605.26454v1 Announce Type: new Abstract: Large language models (LLMs) learn undesirable properties during pretraining, including dangerous knowledge and toxic text generation. Just as post-training uses different objectives to shape different behaviors, we argue that unlearning methods should be designed for the language function at issue. To study this, we consider two mechanistically distinct unlearning goals, dangerous-knowledge unlearning and toxicity unlearning. For dangerous knowledge, we introduce a cosine-based, meta-learned variant of RMU. For toxicity, we propose a multi-layer objective based on layer-specific probe directions. Across four open-source 7-8B models, our methods achieve strong results, based on distinct training objectives for the two types of unlearning. Overall, our results suggest that unlearning should be studied as a family of problems, analogous to the multiple types of LLM post-training.
Original Article
View Cached Full Text
Cached at: 05/27/26, 09:06 AM
# Model Unlearning Objectives Vary for Distinct Language Functions
Source: [https://arxiv.org/html/2605.26454](https://arxiv.org/html/2605.26454)
Berk Atil11Vipul Gupta22Rebecca J\. Passonneau11 11Pennsylvania State University22Scale AI

bka5352@psu\.edu

###### Abstract

Large language models \(LLMs\) learn undesirable properties during pretraining, including dangerous knowledge and toxic text generation\. Just as post\-training uses different objectives to shape different behaviors, we argue that unlearning methods should be designed for the language function at issue\. To study this, we consider two mechanistically distinct unlearning goals, dangerous\-knowledge unlearning and toxicity unlearning\. For dangerous knowledge, we introduce a cosine\-based, meta\-learned variant of RMU\. For toxicity, we propose a multi\-layer objective based on layer\-specific probe directions\. Across four open\-source 7\-8B models, our methods achieve strong results, based on distinct training objectives for the two types of unlearning\. Overall, our results suggest that unlearning should be studied as a family of problems, analogous to the multiple types of LLM post\-training\.

Model Unlearning Objectives Vary for Distinct Language Functions

Berk Atil11Vipul Gupta22Rebecca J\. Passonneau1111Pennsylvania State University22Scale AIbka5352@psu\.edu

## 1Introduction

Language serves a wide range of functions in human society: it is used not only to convey information, but also to coordinate action, manage social relationships, and express attitudes, intentions, and norms\. Large language models \(LLMs\) acquire great linguistic fluency through large\-scale pretraining, and post\-training methods are then used to shape behaviors such as helpfulness, instruction\-following, and safer response behaviorOuyanget al\.\([2022](https://arxiv.org/html/2605.26454#bib.bib14)\); Rafailovet al\.\([2023](https://arxiv.org/html/2605.26454#bib.bib15)\); Duet al\.\([2025](https://arxiv.org/html/2605.26454#bib.bib34)\)\. However, these methods do not cover the full range of communicative abilities and norms that human language use fulfills\. Models learn hazardous knowledge and socially harmful behaviors, such as toxic language generation, from their training dataBrownet al\.\([2020](https://arxiv.org/html/2605.26454#bib.bib13)\); Gehmanet al\.\([2020](https://arxiv.org/html/2605.26454#bib.bib25)\); Liet al\.\([2024](https://arxiv.org/html/2605.26454#bib.bib18)\)\. This has motivated growing interest in*unlearning*: post hoc interventions that aim to remove specific knowledge, capabilities, or behaviors from already\-trained models through finetuningCao and Yang \([2015](https://arxiv.org/html/2605.26454#bib.bib1)\); Bourtouleet al\.\([2021](https://arxiv.org/html/2605.26454#bib.bib5)\); Liuet al\.\([2024](https://arxiv.org/html/2605.26454#bib.bib16)\); Mainiet al\.\([2024](https://arxiv.org/html/2605.26454#bib.bib17)\)\.

Our central claim is that, similar to post\-training itself,*unlearning is goal\-dependent*\. Modern LLM pipelines have distinct post\-training procedures for a range of distinct functions: instruction following, preference alignment to human values, refusal behavior, and style control require different objectivesOuyanget al\.\([2022](https://arxiv.org/html/2605.26454#bib.bib14)\); Rafailovet al\.\([2023](https://arxiv.org/html/2605.26454#bib.bib15)\); Duet al\.\([2025](https://arxiv.org/html/2605.26454#bib.bib34)\)\. Mechanistic work further suggests that some of these properties are altered more substantially by post\-training than others\. In particular,Duet al\.\([2025](https://arxiv.org/html/2605.26454#bib.bib34)\)show that factual knowledge storage locations remain largely stable across base and post\-trained models, and that a truthfulness direction is also highly similar across the two, whereas refusal directions change substantially after SFT and instruction tuning\. We argue that unlearning should take such redmechanistic observations into account\. Removing dangerous knowledge is not the same problem as removing socially undesirable behavior, because these functions are represented differently inside the model\.

We study two unlearning problems that mechanistic evidence suggests could be different: removal of*dangerous knowledge*versus*toxic language*\. Dangerous knowledge concerns the model’s ability to access factual or procedural information, as in biosecurity\-oriented settings such as WMDPLiet al\.\([2024](https://arxiv.org/html/2605.26454#bib.bib18)\)\. Toxicity, by contrast, concerns the tendency to generate abusive or harmful languageGehmanet al\.\([2020](https://arxiv.org/html/2605.26454#bib.bib25)\); Hartvigsenet al\.\([2022](https://arxiv.org/html/2605.26454#bib.bib26)\)\. Existing workKadheet al\.\([2024](https://arxiv.org/html/2605.26454#bib.bib12)\)has treated both as instances of the same unlearning problem, but our results suggest otherwise, which is supported by recent mechanistic studies\. For factual knowledge, prior work points to a relatively structured retrieval mechanismMenget al\.\([2022](https://arxiv.org/html/2605.26454#bib.bib27)\)\. Knowledge relevant to a statement is concentrated at subject, object, and last\-token positions, with subject information strongest in earlier layers, object information in early\-to\-middle layers, and the last token becoming especially important in middle\-to\-late layersMenget al\.\([2022](https://arxiv.org/html/2605.26454#bib.bib27)\); Gevaet al\.\([2023](https://arxiv.org/html/2605.26454#bib.bib28)\); Duet al\.\([2025](https://arxiv.org/html/2605.26454#bib.bib34)\)\.Duet al\.\([2025](https://arxiv.org/html/2605.26454#bib.bib34)\)further show that post\-training largely preserves these knowledge\-storage locations\. At the same time, current hazardous\-knowledge unlearning methods remain limited: they can be shallow or recoverable, suggesting that simply steering away from undesirable behavior might not be enoughHuet al\.\([2024](https://arxiv.org/html/2605.26454#bib.bib21)\); Deeb and Roger \([2024](https://arxiv.org/html/2605.26454#bib.bib22)\); Danget al\.\([2024](https://arxiv.org/html/2605.26454#bib.bib23),[2025](https://arxiv.org/html/2605.26454#bib.bib24)\)\.

The mechanisms underlying toxicity in LLMs differ from those for factual knowledge\.Leeet al\.\([2024](https://arxiv.org/html/2605.26454#bib.bib29)\)train a linear toxicity probe on averaged final\-layer representations and identify value vectors aligned with a toxicity direction\. They show that toxicity is largely elicited in later MLP layers, and that subtracting those vectors can reduce toxic outputs\. Crucially, after DPO, the toxic vectors largely remain; instead, small accumulated changes in the model alter activations so that the model bypasses toxicity\-promoting regions rather than erasing the underlying capabilityLeeet al\.\([2024](https://arxiv.org/html/2605.26454#bib.bib29)\)\. These findings suggestcomplementary roles for alignment and unlearning: alignment can reduce toxic behavior, while unlearning can weaken the underlying toxicity\-relevant directions\.

Motivated by this difference, we argue that effective unlearning requires a better understanding of*what is being forgotten*\. At a high level, unlearning methods aim to reduce undesirable behavior while preserving desirable behavior, which is reflected in the standard combination of a forget objective and a retain objective used across prior workLiet al\.\([2024](https://arxiv.org/html/2605.26454#bib.bib18)\); Huu\-Tienet al\.\([2024](https://arxiv.org/html/2605.26454#bib.bib35)\); Zhanget al\.\([2024](https://arxiv.org/html/2605.26454#bib.bib19)\)\. We claim, however, that the formulation of the forget objective should depend on the unlearning goal\. We introduce a new unlearning method for dangerous knowledge that replaces RMU’s L2 objectiveLiet al\.\([2024](https://arxiv.org/html/2605.26454#bib.bib18)\)with a cosine objective, and adaptively learns the retain\-forget tradeoff using reinforcement learning\. For toxicity unlearning, we propose a forget loss that targets toxicity\-relevant signals across multiple layers\. For assessment, we introduce a unified evaluation metric that summarizes the tradeoff between forgetting and retention\.

## 2Related Work

In this section, we review work to understand how LLMS store undesirable knowledge and toxicity\.

### 2\.1LLM Unlearning

Unlearning was originally framed as removing the influence of specific training data without full retrainingCao and Yang \([2015](https://arxiv.org/html/2605.26454#bib.bib1)\); Bourtouleet al\.\([2021](https://arxiv.org/html/2605.26454#bib.bib5)\)\. In LLMs, exact retraining\-based guarantees are usually infeasible, so recent work instead relies on benchmark\-specific forgetting metricsLiuet al\.\([2024](https://arxiv.org/html/2605.26454#bib.bib16)\)\. Benchmarks such as TOFUMainiet al\.\([2024](https://arxiv.org/html/2605.26454#bib.bib17)\)and WMDPLiet al\.\([2024](https://arxiv.org/html/2605.26454#bib.bib18)\)have become standard datasets for fictitious\-profile forgetting and hazardous\-knowledge unlearning, respectively\. Existing methods include gradient\-based forget objectives, preference\-based methods such as NPOZhanget al\.\([2024](https://arxiv.org/html/2605.26454#bib.bib19)\), and representation\-level steering approaches such as RMULiet al\.\([2024](https://arxiv.org/html/2605.26454#bib.bib18)\)or SpungeKadheet al\.\([2024](https://arxiv.org/html/2605.26454#bib.bib12)\)\.

### 2\.2Unlearning Dangerous Knowledge

A major line of work studies unlearning of dangerous factual or procedural knowledge, especially through WMDP\-style evaluationsLiet al\.\([2024](https://arxiv.org/html/2605.26454#bib.bib18)\)\. However, unlearned knowledge can often be recovered through targeted relearning, substantial information may remain in model weights, and steering\-based methods can reduce robustness or induce nonsensical behavior rather than cleanly removing the target capabilityHuet al\.\([2024](https://arxiv.org/html/2605.26454#bib.bib21)\); Deeb and Roger \([2024](https://arxiv.org/html/2605.26454#bib.bib22)\); Danget al\.\([2024](https://arxiv.org/html/2605.26454#bib.bib23),[2025](https://arxiv.org/html/2605.26454#bib.bib24)\)\. Our method aims to provide a more principled objective for hazardous\-knowledge unlearning\.

Mechanistic work helps clarify the problem of unlearning dangerous knowledge\. Factual recall depends strongly on subject, object, and last\-token positions across different layer rangesMenget al\.\([2022](https://arxiv.org/html/2605.26454#bib.bib27)\); Gevaet al\.\([2023](https://arxiv.org/html/2605.26454#bib.bib28)\); Duet al\.\([2025](https://arxiv.org/html/2605.26454#bib.bib34)\)\.Duet al\.\([2025](https://arxiv.org/html/2605.26454#bib.bib34)\)further shows that post\-training largely preserves these knowledge\-storage locations, suggesting that factual competence is anchored in relatively stable internal structure\.Zouet al\.\([2023](https://arxiv.org/html/2605.26454#bib.bib33)\)found that representation\-level interventions offer more control on safety\-relevant capabilities\. This motivates our representation\-level approach that aims to weaken dangerous factual competence while preserving general utility, by directly targeting representations that support factual recall\.

### 2\.3Toxicity Behaviour

A separate line of work studies harmful language generation, including toxicity and abusive behaviorGehmanet al\.\([2020](https://arxiv.org/html/2605.26454#bib.bib25)\); Hartvigsenet al\.\([2022](https://arxiv.org/html/2605.26454#bib.bib26)\)\. Mechanistic evidence suggests that toxicity is represented differently from factual knowledge\.Leeet al\.\([2024](https://arxiv.org/html/2605.26454#bib.bib29)\)show that a toxicity direction can be extracted from final\-layer hidden states and that later\-layer value vectors aligned with this direction can modulate toxic outputs\. Yet apparently, DPO does not remove these toxic vectors; instead, it introduces distributed offsets that bypass toxicity\-eliciting regionsLeeet al\.\([2024](https://arxiv.org/html/2605.26454#bib.bib29)\)\. More generally, this suggests that alignment may reduce toxic behavior without fully removing the underlying toxic capability\.Duet al\.\([2025](https://arxiv.org/html/2605.26454#bib.bib34)\)similarly show that refusal directions shift substantially across post\-training, unlike truthfulness directions, reinforcing the view that safety behaviors are more post\-training\-dependent and less structurally stable than factual knowledge\.

When we tried an RMU\-like approach for toxicity, the results were poor\. Toxicity is distributed across layers, this motivates our multi\-layer toxicity unlearning objective\.

L=LF\+α⋅LRL=L\_\{\\text\{F\}\}\+\\alpha\\cdot L\_\{\\text\{R\}\}\(1\)
LF=𝔼x∼DF\[∑t∈xf‖Mupdated\(t\)−c⋅𝐮‖22\]L\_\{\\text\{F\}\}=\\mathbb\{E\}\_\{x\\sim D\_\{\\text\{F\}\}\}\\left\[\\sum\_\{t\\in x\_\{f\}\}\\\|M\_\{\\text\{updated\}\}\(t\)\-c\\cdot\\mathbf\{u\}\\\|\_\{2\}^\{2\}\\right\]\(2\)
LR=𝔼x∼DR\[∑t∈xr‖Mupdated\(t\)−Mfrozen\(t\)‖22\]L\_\{\\text\{R\}\}=\\mathbb\{E\}\_\{x\\sim D\_\{\\text\{R\}\}\}\\left\[\\sum\_\{t\\in x\_\{r\}\}\\\|M\_\{\\text\{updated\}\}\(t\)\-M\_\{\\text\{frozen\}\}\(t\)\\\|\_\{2\}^\{2\}\\right\]\(3\)

## 3Baselines For Unlearning

In this section, we review the main baseline approaches that we build upon\.

### 3\.1RMU

Representation Misdirection for Unlearning \(RMU\) is a fine\-tuning approach aimed at selectively removing unwanted knowledge from LLMsLiet al\.\([2024](https://arxiv.org/html/2605.26454#bib.bib18)\)\. RMU operates on two datasets:*forget data*, containing the target knowledge or behavior to be unlearned, and*retain data*, containing general examples used to preserve the model’s desirable capabilities\. It pushes the model’s representations on forget data toward randomly initialized vectors, while also encouraging representations on retain data to remain close to those of the original frozen model\. The overall objective is shown in Equation[1](https://arxiv.org/html/2605.26454#S2.E1)\. The forget loss is shown in Equation[2](https://arxiv.org/html/2605.26454#S2.E2)where𝐮\\mathbf\{u\}is a random unit vector andccis a scaling factor\. Similarly,LretainL\_\{\\text\{retain\}\}refers to the retain loss and is taken as an expectation over the entire retain dataset, as shown in Equation[3](https://arxiv.org/html/2605.26454#S2.E3)\.

### 3\.2AdapRMU

Huu\-Tienet al\.\([2024](https://arxiv.org/html/2605.26454#bib.bib35)\), focus on the scaling coefficientccin Equation[2](https://arxiv.org/html/2605.26454#S2.E2)\. While the direction𝐮\\mathbf\{u\}is fixed before unlearning,ccdetermines the magnitude of the representation shift\. AdapRMU adaptively adjustsccbased on the norm of the forget representation\.

## 4Methodology

In this section, we present our modifications to RMU\. We make two changes for dangerous\-knowledge unlearning and then introduce a separate objective for toxicity\. We also present our metric to reflect the tradeoff between unlearning and general capability\.

ℒF=𝔼xf∼DF\[∑t∈xf\(1−Mupdated\(t\)⋅\(c⋅𝐮\)‖Mupdated\(t\)‖2‖c⋅𝐮‖2\)\]\\mathcal\{L\}\_\{\\text\{F\}\}=\\mathbb\{E\}\_\{x\_\{f\}\\sim D\_\{\\text\{F\}\}\}\\left\[\\sum\_\{t\\in x\_\{f\}\}\\left\(1\-\\frac\{M\_\{\\text\{updated\}\}\(t\)\\cdot\(c\\cdot\\mathbf\{u\}\)\}\{\\\|M\_\{\\text\{updated\}\}\(t\)\\\|\_\{2\}\\\|c\\cdot\\mathbf\{u\}\\\|\_\{2\}\}\\right\)\\right\]\(4\)
ℒR=𝔼xr∼DR\[∑t∈xr\(1−Mupdated\(t\)⋅Mfrozen\(t\)‖Mupdated\(t\)‖2‖Mfrozen\(t\)‖2\)\]\\mathcal\{L\}\_\{\\text\{R\}\}=\\mathbb\{E\}\_\{x\_\{r\}\\sim D\_\{\\text\{R\}\}\}\\left\[\\sum\_\{t\\in x\_\{r\}\}\\left\(1\-\\frac\{M\_\{\\text\{updated\}\}\(t\)\\cdot M\_\{\\text\{frozen\}\}\(t\)\}\{\\\|M\_\{\\text\{updated\}\}\(t\)\\\|\_\{2\}\\\|M\_\{\\text\{frozen\}\}\(t\)\\\|\_\{2\}\}\\right\)\\right\]\(5\)
### 4\.1Cosine Loss instead of L2 Loss

We replace the original L2 losses in Equations[2](https://arxiv.org/html/2605.26454#S2.E2)and[3](https://arxiv.org/html/2605.26454#S2.E3)with cosine\-based losses shown in Equations[4](https://arxiv.org/html/2605.26454#S4.E4)and[5](https://arxiv.org/html/2605.26454#S4.E5)\. Our hypothesis is that the key quantity for both forgetting and retention is representational*direction*rather than exact Euclidean position\. For forgetting, cosine loss should work better because it directly optimizes alignment between the updated representation and the target direction, instead of also penalizing norm differences\. For retention, cosine loss better preserves the frozen model’s representational orientation on retain data while tolerating harmless changes in magnitude induced by fine\-tuning\. In this sense, cosine\-based losses are a better match to our intended geometry: forgetting should steer representations toward or away from a direction, and retention should preserve directional structure rather than exact scale\. Prior work similarly notes that L2 distance can be mismatched when the task is fundamentally angular, while cosine\-based objectives can be more robust to norm variationXuet al\.\([2018](https://arxiv.org/html/2605.26454#bib.bib7)\)\.

### 4\.2Meta Learning the Coefficientα\\alpha

In Equation[1](https://arxiv.org/html/2605.26454#S2.E1), the coefficientα\\alphabalances the forget and retain objectives\. In RMU and AdapRMU,α\\alphais fixed throughout training\. We found this to be a limitation, because we observed the best value ofα\\alphato vary substantially across models and datasets\. To address this, we treatα\\alphaas a learnable parameter and update it during fine\-tuning using REINFORCEWilliams \([1992](https://arxiv.org/html/2605.26454#bib.bib11)\)\. At each step, the policy update adjustsα\\alphatoward values that improve the overall objective, allowing the model to adaptively balance forgetting and retention\. This removes the need for extensive manual tuning and leads to more consistent performance across models\.

![Refer to caption](https://arxiv.org/html/2605.26454v1/Figures/probe_similarity_heatmap_pretty.png)Figure 1:Cosine similarities between logistic\-regression probes trained at different layers of Llama3\.1\-8B\.![Refer to caption](https://arxiv.org/html/2605.26454v1/Figures/probe_weight_hist_layer11.png)Figure 2:Weight distribution of the logistic\-regression probe trained on layer 11 of Llama3\.1\-8B for toxicity classification\.![Refer to caption](https://arxiv.org/html/2605.26454v1/Figures/mistral7b_danger_know.png)\(a\)Mistral\-7B
![Refer to caption](https://arxiv.org/html/2605.26454v1/Figures/llama318b_danger_know.png)\(b\)Llama3\.1\-8B
![Refer to caption](https://arxiv.org/html/2605.26454v1/Figures/qwen257b_danger_know.png)\(c\)Qwen2\.5\-7B
![Refer to caption](https://arxiv.org/html/2605.26454v1/Figures/olmo37b_danger_know.png)\(d\)Olmo3\-7B

Figure 3:Dangerous knowledge unlearning results for the four models\.In more detail, REINFORCE evaluates how theα\\alphavalue affects the total loss, then adjustsα\\alphain the direction that lowers the loss most effectively\. In this way, the model learns the best possible balance between forgetting unwanted information and preserving useful knowledge\. The result is anα\\alphavalue that adapts automatically to different models, thus yielding more consistent improvement\.

![Refer to caption](https://arxiv.org/html/2605.26454v1/Figures/mistral7b_tox.png)\(a\)Mistral\-7B
![Refer to caption](https://arxiv.org/html/2605.26454v1/Figures/llama318b_tox.png)\(b\)Llama3\.1\-8B
![Refer to caption](https://arxiv.org/html/2605.26454v1/Figures/qwen257b_tox.png)\(c\)Qwen2\.5\-7B
![Refer to caption](https://arxiv.org/html/2605.26454v1/Figures/olmo37b_tox.png)\(d\)Olmo3\-7B

Figure 4:Toxicity unlearning results for the four models\.
### 4\.3Toxicity Unlearning

We initially attempted to use the same loss function for toxicity, but we found that simply reusing the same objective designed for hazardous knowledge is ineffective for toxicity \(yielding only a 2\-3% decrease in toxicity compared to 6\-24% we find\)\.

Building on the mechanistic findings in Section[2\.3](https://arxiv.org/html/2605.26454#S2.SS3), we directly target the empirical toxicity direction in representation space\. We operationalize this by training L2\-regularized logistic regression probes on last\-token hidden representations from a frozen reference model at multiple layers, obtaining a toxicity directionLRiLR^\{i\}for each target layeri∈Ii\\in I\. We use last\-token representations rather than mean pooling because averaging suppresses useful information, which is borne out in our experiments showing that last\-token probes consistently achieve higher AUC\. We also find that cross\-layer cosine similarities between probe directions are often low, indicating that toxicity rotates substantially across depth\. This suggests that a single shared direction is not sufficient to characterize toxicity across the network\. We therefore train layer\-specific probes and intervene at multiple layers at dispersed depths\. Below, we report an ablation study for different combinations of layers that verify the utility of both multiple layers, and dispersed depths\.

ℒF=𝔼xt∼Dtoxic\[∑i∈I\(Mupdatedi\(xt\[−1\]\)⋅LRi\)2\]\\mathcal\{L\}\_\{\\text\{F\}\}=\\mathbb\{E\}\_\{x\_\{t\}\\sim D\_\{\\text\{toxic\}\}\}\\left\[\\sum\_\{i\\in I\}\\left\(M^\{i\}\_\{\\text\{updated\}\}\(x\_\{t\}\[\-1\]\)\\cdot LR^\{i\}\\right\)^\{2\}\\right\]\(6\)
Our forget loss, shown in Equation[6](https://arxiv.org/html/2605.26454#S4.E6), encourages the updated model’s representations of toxic inputs to become orthogonal to the toxicity direction at each layer\. Squaring the dot product is important for three reasons\. First, it ensures a non\-negative loss that is minimized when the representation is orthogonal to the toxicity direction\. Second, it makes the gradient symmetric around zero\. Third, it gives larger gradients to representations that are strongly aligned with the toxicity direction\. Averaging across layers inIIensures that the toxic subspace is suppressed at every targeted depth, reducing the chance that toxicity suppressed in one layer re\-emerges later in the network\.

To preserve general language\-modeling ability, we use the same retain loss as in dangerous\-knowledge unlearning\. The full objective is still given by Equation[1](https://arxiv.org/html/2605.26454#S2.E1), withα\\alphaadapted using the meta\-learning procedure described above\.

### 4\.4S\-unlearning: a New Unlearning Metric

To summarize the tradeoff between forgetting and retained utility, we introduceS\-unlearning\. LetU∈\[0,1\]U\\in\[0,1\]denote unlearning performance andR∈\[0,1\]R\\in\[0,1\]retained utility\. For dangerous knowledge, lower WMDP accuracy indicates better unlearning; for toxicity, lower toxicity indicates better unlearning\. Because random guessing on our multiple\-choice utility benchmark yieldsR0=0\.25R\_\{0\}=0\.25, we first chance\-correct utility as

R¯=\[R−R01−R0\]01\.\\bar\{R\}=\\left\[\\frac\{R\-R\_\{0\}\}\{1\-R\_\{0\}\}\\right\]\_\{0\}^\{1\}\.For dangerous knowledge, we similarly chance\-correct unlearning accuracy, while for toxicity we defineU¯=1−U\\bar\{U\}=1\-Uso that higher is better\. Our final score is

S\-Unlearning=U¯⋅R¯\.\\text\{S\-Unlearning\}=\\bar\{U\}\\cdot\\bar\{R\}\.This score lies in\[0,1\]\[0,1\], is maximized when unlearning is strong and retained utility is high, and becomes zero whenever retained utility is at or below chance\. Geometrically, it is the area of the rectangle from the origin to\(U¯,R¯\)\(\\bar\{U\},\\bar\{R\}\)in normalized unlearning\-utility space, i\.e\., the single\-point two\-dimensional special case of the hypervolume indicatorAugeret al\.\([2012](https://arxiv.org/html/2605.26454#bib.bib2)\)\.

Unlearning MethodS\-unlearning WMDPS\-unlearning ToxicityLlama3\.1\-8BBase0\.290\.34RMU0\.290\.33Adaptive RMU0\.320\.32Ours0\.000\.43Mistral\-7BBase0\.280\.32RMU0\.400\.31Adaptive RMU0\.000\.29Ours0\.430\.34Qwen2\.5\-7BBase0\.300\.43RMU0\.300\.43Adaptive RMU0\.300\.41Ours0\.330\.45Olmo3\-7BBase0\.270\.32RMU0\.280\.30Adaptive RMU0\.280\.29Ours0\.350\.35Table 1:S\-unlearning scores that combine the tradeoff between unlearning and general capability for both unlearning types\.

## 5Experimental Setup

Here we describe the datasets, models, and evaluation metrics used for both unlearning settings\.

For dangerous\-knowledge unlearning, we followLiet al\.\([2024](https://arxiv.org/html/2605.26454#bib.bib18)\): we use the biosecurity and cybersecurity forget datasets released with WMDP, evaluate forgetting on WMDP, and use WikiText as retain data\. We measure retained general capabilities on MMLUHendryckset al\.\([2021](https://arxiv.org/html/2605.26454#bib.bib9)\)\.

![Refer to caption](https://arxiv.org/html/2605.26454v1/Figures/mist_loss_curve.png)\(a\)Mistral\-7B
![Refer to caption](https://arxiv.org/html/2605.26454v1/Figures/llama_loss_curve.png)\(b\)Llama3\.1\-8B
![Refer to caption](https://arxiv.org/html/2605.26454v1/Figures/qwen_loss_curve.png)\(c\)Qwen2\.5\-7B
![Refer to caption](https://arxiv.org/html/2605.26454v1/Figures/olmo_loss_curve.png)\(d\)Olmo3\-7B

Figure 5:Toxicity unlearning loss curves for each model\.For toxicity unlearning, we use the toxic split of ParaDetoxlogacheva\-etal\-2022\-paradetoxas forget data\. To train the logistic\-regression probes, we use the human\-annotated training split of TRuSTAtilet al\.\([2026](https://arxiv.org/html/2605.26454#bib.bib10)\)\. We again use WikiText as retain data and evaluate retained utility with MMLU\. To evaluate toxicity, we use the challenging subset of RealToxicityPromptsGehmanet al\.\([2020](https://arxiv.org/html/2605.26454#bib.bib25)\)and score outputs with the BERT\-based TRuST classifierAtilet al\.\([2026](https://arxiv.org/html/2605.26454#bib.bib10)\), which we chose because it generalizes better than alternative toxicity classifiers\. For our main toxicity experiments, we update nine layers, selecting three layers each from early, middle, and late depth regions\. We analyze alternative layer choices in Section[6\.5](https://arxiv.org/html/2605.26454#S6.SS5)\. The finalα\\alphavalues are 43, 78, 5\.7, and 0\.41 for Llama, Mistral, Qwen, and Olmo respectively\.

Given our limited compute budget, we focus on four widely used, smaller, open\-source 7–8B models: Llama3\.1\-8BGrattafioriet al\.\([2024](https://arxiv.org/html/2605.26454#bib.bib6)\), Mistral\-7BJianget al\.\([2023](https://arxiv.org/html/2605.26454#bib.bib8)\), Olmo3\-7BOlmoet al\.\([2025](https://arxiv.org/html/2605.26454#bib.bib3)\), and Qwen2\.5\-7BQwenet al\.\([2025](https://arxiv.org/html/2605.26454#bib.bib4)\)\.

## 6Results

We first analyze the learned toxicity probes to better understand the geometry of toxicity representations\. We then evaluate our methods on dangerous\-knowledge and toxicity unlearning, examine where toxicity unlearning occurs, and conclude with an analysis of how many and which layers work best\.

![Refer to caption](https://arxiv.org/html/2605.26454v1/Figures/tradeoff_mmlu_vs_toxicity.png)Figure 6:The effect of number and identity of the layers on toxicity unlearning and general capability### 6\.1Toxicity Probe Analysis

To better understand the internal geometry of toxicity representations and obtain principled unlearning directions, we train logistic regression probes at multiple layers of Llama\-3\.1\-8B and analyze the pairwise cosine similarity between the learned weight vectors\. Figure[1](https://arxiv.org/html/2605.26454#S4.F1)shows that toxicity directions are largely layer\-specific: while adjacent layers share moderate similarity, most cross\-layer pairs exhibit cosine similarities below 0\.5, especially in the early layers\. This indicates that the internal representation of toxicity rotates substantially as information propagates through the network\. In the late layers, however, similarities are higher, suggesting a somewhat more consistent toxicity representation near the output\. Overall, these results indicate that a single shared direction is insufficient to characterize toxicity across the full depth of the model, motivating our use of layer\-specific probes\.

Figure[2](https://arxiv.org/html/2605.26454#S4.F2)shows the weight distribution of the probe trained at layer 11 of Llama3\.1\-8B\. The distribution is approximately symmetric and centered close to zero, suggesting that the learned toxicity direction is dense and distributed rather than dominated by a small number of extreme dimensions\. The spread of the distribution, spanning roughly\[−0\.45,0\.45\]\[\-0\.45,0\.45\], indicates that many dimensions contribute meaningfully to the toxicity direction, with both positive and negative weights\. This supports our use of the full probe vector as an unlearning direction and helps explain why sparse or single\-neuron interventions are unlikely to be sufficient\.

### 6\.2Dangerous Knowledge Unlearning

Figure[3](https://arxiv.org/html/2605.26454#S4.F3)presents the accuracy results for dangerous\-knowledge unlearning\. As shown, our approach outperforms prior methods for all models except Llama3\.1\-8B, while preserving retained utility well\. All methods remain challenging for Qwen and Olmo models\. Table[1](https://arxiv.org/html/2605.26454#S4.T1)shows our method achieves the best S\-unlearning tradeoff on all models except Llama3\.1\-8B\.

### 6\.3Toxicity Unlearning

Figure[4](https://arxiv.org/html/2605.26454#S4.F4)presents the accuracy results for toxicity unlearning\. As shown, our method yields the largest reductions in toxicity balanced by limited reduction in general knowledge\. In contrast, RMU and AdapRMU only slightly reduce toxicity\. As shown in Table[1](https://arxiv.org/html/2605.26454#S4.T1), our method also has the highest S\-unlearning scores\. The magnitude of improvement varies by model, with the strongest gains on Mistral and Llama, but our method consistently reduces toxic generation\.

### 6\.4Where Does Toxicity Unlearning Occur?

To understand which layers are most affected during unlearning, we plot the layer\-wise unlearning loss throughout training in Figure[5](https://arxiv.org/html/2605.26454#S5.F5)\. Early layers begin with relatively low loss and remain low or decrease further, suggesting that they encode less discriminative toxicity signal\. Middle layers exhibit moderate initial losses and more mixed dynamics\. Late layers show the strongest dynamics, typically starting with the highest losses and changing the most during training, consistent with the stronger toxicity representations suggested by our probe analysis in Section[6\.1](https://arxiv.org/html/2605.26454#S6.SS1)\.

These broad trends hold across models, but the trajectories are model\-specific\. Mistral\-7B \(Figure[5](https://arxiv.org/html/2605.26454#S5.F5)a\) shows the most consistent decrease across layers\. In Olmo3\-7B \(Figure[5](https://arxiv.org/html/2605.26454#S5.F5)d\), late layers increase sharply, but other layers decrease\. In Llama3\.1\-8B \(Figure[5](https://arxiv.org/html/2605.26454#S5.F5)b\), earlier layers increase, but mid and late layers decrease\. Qwen2\.5\-7B \(Figure[5](https://arxiv.org/html/2605.26454#S5.F5)c\) exhibits the strongest divergence, with some late layers reaching loss values above 80, indicating that here the representations resist orthogonalization more strongly, possibly due to architectural differences in how toxicity is distributed\. Overall, these results suggest that the geometry of toxicity is architecture\-dependent and that the best layer selection strategy may vary across models\.

### 6\.5Layer Ablation Study for Toxicity

We analyze both the*number*and the*identity*of the updated layers using Llama3\.1\-8B, the model where our method performs best\. Figure[7](https://arxiv.org/html/2605.26454#A1.F7)in the Appendix shows that increasing the number of updated layers generally improves unlearning, although not monotonically: one\- and two\-layer interventions stay near the baseline, while groups of three or more layers yield larger reductions on average\. Variation across layer groups of the same size further shows that*which*layers are updated matters nearly as much as*how many*\. This is consistent with prior work showing that changes in transformer representations across depths are not monotonic\(liu\-etal\-2019\-linguistic;liu\-etal\-2024\-fantastic\)\.

Figure[6](https://arxiv.org/html/2605.26454#S6.F6), plotting toxicity by accuracy in MMLU for different combinations of unlearning layers, shows that layer identity strongly affects safety\-utility tradeoff and unlearning\. Single\-layer interventions, and many two\-layer groups, remain in the high\-toxicity region, whereas several larger groups reduce toxicity substantially, especially those involving middle and later layers\. The large variation among groups of the same size further suggests non\-additive layer interactions\. Overall, results suggest that effective toxicity unlearning benefits from coordinated updates across multiple layers, with middle\-to\-late layers serving as especially useful targets\. Further, there is room for large reduction in toxicity with effective choice of layers with little change in general knowledge capabilities\.

## 7Conclusion

We argued that unlearning in LLMs should be treated as a goal\-dependent problem rather than as a single generic intervention\. Motivated by mechanistic differences between factual knowledge and toxicity, we introduced a cosine\-based, meta\-learned variant of RMU for dangerous\-knowledge unlearning and then showed that this same approach does not transfer well to toxicity\. For toxicity, we instead proposed a multi\-layer objective based on layer\-specific probe directions\.

Across four open\-source 7\-8B models, our methods performed strongly for both dangerous knowledge and toxicity\. Our proposed S\-unlearning summary metric makes it easier to compare methods, and to capture the tradeoffs in unlearning\. We presented analyses showing that toxicity directions vary across layers and that effective toxicity unlearning benefits from coordinated updates across multiple depths\. Finally, our comparison of unlearning for two types of language function that have distinct mechanisms within LLMs suggests that unlearning should be approached as a family of methods that require customization to specific language functions\.

## Limitations

We focus on two types of unlearning goals in this work, but there are others as well\. Further, we focus only on 4 small LLMs because of a limited computational budget\.

## References

- Something just like trust : toxicity recognition of span and target\.External Links:2506\.02326,[Link](https://arxiv.org/abs/2506.02326)Cited by:[§5](https://arxiv.org/html/2605.26454#S5.p3.1)\.
- A\. Auger, J\. Bader, D\. Brockhoff, and E\. Zitzler \(2012\)Hypervolume\-based multiobjective optimization: theoretical foundations and practical implications\.Theoretical Computer Science425,pp\. 75–103\.External Links:[Document](https://dx.doi.org/10.1016/j.tcs.2011.03.012)Cited by:[§4\.4](https://arxiv.org/html/2605.26454#S4.SS4.p1.6)\.
- L\. Bourtoule, V\. Chandrasekaran, C\. A\. Choquette\-Choo, H\. Jia, A\. Travers, B\. Zhang, D\. Lie, and N\. Papernot \(2021\)Machine unlearning\.In2021 IEEE Symposium on Security and Privacy \(SP\),pp\. 141–159\.Cited by:[§1](https://arxiv.org/html/2605.26454#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.26454#S2.SS1.p1.1)\.
- T\. B\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. Kaplan, P\. Dhariwal,et al\.\(2020\)Language models are few\-shot learners\.arXiv preprint arXiv:2005\.14165\.Cited by:[§1](https://arxiv.org/html/2605.26454#S1.p1.1)\.
- Y\. Cao and J\. Yang \(2015\)Towards making systems forget with machine unlearning\.In2015 IEEE Symposium on Security and Privacy,pp\. 463–480\.Cited by:[§1](https://arxiv.org/html/2605.26454#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.26454#S2.SS1.p1.1)\.
- H\. Dang, T\. Hoang, L\. Nguyen, and N\. Inoue \(2025\)Improving the robustness of representation misdirection for large language model unlearning\.arXiv preprint arXiv:2501\.19202\.Cited by:[§1](https://arxiv.org/html/2605.26454#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.26454#S2.SS2.p1.1)\.
- H\. Dang, T\. Pham, T\. Hoang, and N\. Inoue \(2024\)On effects of steering latent representation for large language model unlearning\.arXiv preprint arXiv:2408\.06223\.Cited by:[§1](https://arxiv.org/html/2605.26454#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.26454#S2.SS2.p1.1)\.
- A\. Deeb and F\. Roger \(2024\)Do unlearning methods remove information from language model weights?\.arXiv preprint arXiv:2410\.08827\.Cited by:[§1](https://arxiv.org/html/2605.26454#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.26454#S2.SS2.p1.1)\.
- H\. Du, W\. Li, M\. Cai, K\. Saraipour, Z\. Zhang, H\. Lakkaraju, Y\. Sun, and S\. Zhang \(2025\)How post\-training reshapes llms: a mechanistic view on knowledge, truthfulness, refusal, and confidence\.InConference on Language Modeling \(COLM\),Cited by:[§1](https://arxiv.org/html/2605.26454#S1.p1.1),[§1](https://arxiv.org/html/2605.26454#S1.p2.1),[§1](https://arxiv.org/html/2605.26454#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.26454#S2.SS2.p2.1),[§2\.3](https://arxiv.org/html/2605.26454#S2.SS3.p1.1)\.
- S\. Gehman, S\. Gururangan, M\. Sap, Y\. Choi, and N\. A\. Smith \(2020\)RealToxicityPrompts: evaluating neural toxic degeneration in language models\.arXiv preprint arXiv:2009\.11462\.Cited by:[§1](https://arxiv.org/html/2605.26454#S1.p1.1),[§1](https://arxiv.org/html/2605.26454#S1.p3.1),[§2\.3](https://arxiv.org/html/2605.26454#S2.SS3.p1.1),[§5](https://arxiv.org/html/2605.26454#S5.p3.1)\.
- M\. Geva, J\. Bastings, K\. Filippova, and A\. Globerson \(2023\)Dissecting recall of factual associations in auto\-regressive language models\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 12216–12235\.Cited by:[§1](https://arxiv.org/html/2605.26454#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.26454#S2.SS2.p2.1)\.
- A\. Grattafiori, A\. Dubey, …, and Z\. Ma \(2024\)The llama 3 herd of models\.External Links:2407\.21783,[Link](https://arxiv.org/abs/2407.21783)Cited by:[§5](https://arxiv.org/html/2605.26454#S5.p4.1)\.
- T\. Hartvigsen, S\. Gabriel, H\. Palangi, M\. Sap, D\. Ray, and E\. Kamar \(2022\)ToxiGen: a large\-scale machine\-generated dataset for adversarial and implicit hate speech detection\.arXiv preprint arXiv:2203\.09509\.Cited by:[§1](https://arxiv.org/html/2605.26454#S1.p3.1),[§2\.3](https://arxiv.org/html/2605.26454#S2.SS3.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2021\)Measuring massive multitask language understanding\.External Links:2009\.03300,[Link](https://arxiv.org/abs/2009.03300)Cited by:[§5](https://arxiv.org/html/2605.26454#S5.p2.1)\.
- S\. Hu, Y\. Fu, Z\. S\. Wu, and V\. Smith \(2024\)Jogging the memory of unlearned llms through targeted relearning attacks\.arXiv preprint arXiv:2406\.13356\.Cited by:[§1](https://arxiv.org/html/2605.26454#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.26454#S2.SS2.p1.1)\.
- D\. Huu\-Tien, T\. Pham, H\. Thanh\-Tung, and N\. Inoue \(2024\)On effects of steering latent representation for large language model unlearning\.arXiv preprint arXiv:2408\.06223\.Cited by:[§1](https://arxiv.org/html/2605.26454#S1.p5.1),[§3\.2](https://arxiv.org/html/2605.26454#S3.SS2.p1.4)\.
- A\. Q\. Jiang, A\. Sablayrolles, A\. Mensch, C\. Bamford, D\. S\. Chaplot, D\. de las Casas, F\. Bressand, G\. Lengyel, G\. Lample, L\. Saulnier, L\. R\. Lavaud, M\. Lachaux, P\. Stock, T\. L\. Scao, T\. Lavril, T\. Wang, T\. Lacroix, and W\. E\. Sayed \(2023\)Mistral 7b\.External Links:2310\.06825,[Link](https://arxiv.org/abs/2310.06825)Cited by:[§5](https://arxiv.org/html/2605.26454#S5.p4.1)\.
- S\. Kadhe, F\. Ahmed, D\. Wei, N\. Baracaldo, and I\. Padhi \(2024\)Split, unlearn, merge: leveraging data attributes for more effective unlearning in LLMs\.InICML 2024 Workshop on Foundation Models in the Wild,External Links:[Link](https://openreview.net/forum?id=BzIySThX9O)Cited by:[§1](https://arxiv.org/html/2605.26454#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.26454#S2.SS1.p1.1)\.
- A\. Lee, X\. Bai, I\. Pres, M\. Wattenberg, J\. K\. Kummerfeld, and R\. Mihalcea \(2024\)A mechanistic understanding of alignment algorithms: a case study on dpo and toxicity\.InProceedings of the 41st International Conference on Machine Learning,pp\. 26361–26378\.Cited by:[§1](https://arxiv.org/html/2605.26454#S1.p4.1),[§2\.3](https://arxiv.org/html/2605.26454#S2.SS3.p1.1)\.
- N\. Li, A\. Pan, A\. Gopal, S\. Yue, D\. Berrios, A\. Gatti, J\. D\. Li,et al\.\(2024\)The wmdp benchmark: measuring and reducing malicious use with unlearning\.InProceedings of the 41st International Conference on Machine Learning,pp\. 28525–28550\.Cited by:[§1](https://arxiv.org/html/2605.26454#S1.p1.1),[§1](https://arxiv.org/html/2605.26454#S1.p3.1),[§1](https://arxiv.org/html/2605.26454#S1.p5.1),[§2\.1](https://arxiv.org/html/2605.26454#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2605.26454#S2.SS2.p1.1),[§3\.1](https://arxiv.org/html/2605.26454#S3.SS1.p1.3),[§5](https://arxiv.org/html/2605.26454#S5.p2.1)\.
- S\. Liu, Y\. Yao, J\. Jia, S\. Casper, N\. Baracaldo, P\. Hase, Y\. Yao,et al\.\(2024\)Rethinking machine unlearning for large language models\.arXiv preprint arXiv:2402\.08787\.Cited by:[§1](https://arxiv.org/html/2605.26454#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.26454#S2.SS1.p1.1)\.
- P\. Maini, Z\. Feng, A\. Schwarzschild, Z\. C\. Lipton, and J\. Z\. Kolter \(2024\)TOFU: a task of fictitious unlearning for llms\.arXiv preprint arXiv:2401\.06121\.Cited by:[§1](https://arxiv.org/html/2605.26454#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.26454#S2.SS1.p1.1)\.
- K\. Meng, D\. Bau, A\. Andonian, and Y\. Belinkov \(2022\)Locating and editing factual associations in gpt\.arXiv preprint arXiv:2202\.05262\.Cited by:[§1](https://arxiv.org/html/2605.26454#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.26454#S2.SS2.p2.1)\.
- T\. Olmo, A\. Ettinger, A\. Bertsch, B\. Kuehl, D\. Graham, D\. Heineman, D\. Groeneveld, F\. Brahman, F\. Timbers, H\. Ivison, J\. Morrison, J\. Poznanski, K\. Lo, L\. Soldaini, M\. Jordan, M\. Chen, M\. Noukhovitch, N\. Lambert, P\. Walsh, P\. Dasigi, R\. Berry, S\. Malik, S\. Shah, S\. Geng, S\. Arora, S\. Gupta, T\. Anderson, T\. Xiao, T\. Murray, T\. Romero, V\. Graf, A\. Asai, A\. Bhagia, A\. Wettig, A\. Liu, A\. Rangapur, C\. Anastasiades, C\. Huang, D\. Schwenk, H\. Trivedi, I\. Magnusson, J\. Lochner, J\. Liu, L\. J\. V\. Miranda, M\. Sap, M\. Morgan, M\. Schmitz, M\. Guerquin, M\. Wilson, R\. Huff, R\. L\. Bras, R\. Xin, R\. Shao, S\. Skjonsberg, S\. Z\. Shen, S\. S\. Li, T\. Wilde, V\. Pyatkin, W\. Merrill, Y\. Chang, Y\. Gu, Z\. Zeng, A\. Sabharwal, L\. Zettlemoyer, P\. W\. Koh, A\. Farhadi, N\. A\. Smith, and H\. Hajishirzi \(2025\)Olmo 3\.External Links:2512\.13961,[Link](https://arxiv.org/abs/2512.13961)Cited by:[§5](https://arxiv.org/html/2605.26454#S5.p4.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. L\. Wainwright, P\. Mishkin, C\. Zhang,et al\.\(2022\)Training language models to follow instructions with human feedback\.arXiv preprint arXiv:2203\.02155\.Cited by:[§1](https://arxiv.org/html/2605.26454#S1.p1.1),[§1](https://arxiv.org/html/2605.26454#S1.p2.1)\.
- Qwen, :, A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Tang, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu \(2025\)Qwen2\.5 technical report\.External Links:2412\.15115,[Link](https://arxiv.org/abs/2412.15115)Cited by:[§5](https://arxiv.org/html/2605.26454#S5.p4.1)\.
- R\. Rafailov, A\. Sharma, E\. Mitchell, S\. Ermon, C\. D\. Manning, and C\. Finn \(2023\)Direct preference optimization: your language model is secretly a reward model\.arXiv preprint arXiv:2305\.18290\.Cited by:[§1](https://arxiv.org/html/2605.26454#S1.p1.1),[§1](https://arxiv.org/html/2605.26454#S1.p2.1)\.
- R\. J\. Williams \(1992\)Simple statistical gradient\-following algorithms for connectionist reinforcement learning\.Machine learning8\(3\),pp\. 229–256\.Cited by:[§4\.2](https://arxiv.org/html/2605.26454#S4.SS2.p1.5)\.
- Y\. Xu, M\. Gong, T\. Liu, K\. Batmanghelich, and C\. Wang \(2018\)Robust angular local descriptor learning\.InAsian Conference on Computer Vision,pp\. 420–435\.Cited by:[§4\.1](https://arxiv.org/html/2605.26454#S4.SS1.p1.1)\.
- R\. Zhang, L\. Lin, Y\. Bai, and S\. Mei \(2024\)Negative preference optimization: from catastrophic collapse to effective unlearning\.arXiv preprint arXiv:2404\.05868\.Cited by:[§1](https://arxiv.org/html/2605.26454#S1.p5.1),[§2\.1](https://arxiv.org/html/2605.26454#S2.SS1.p1.1)\.
- A\. Zou, L\. Phan, S\. Chen, J\. Campbell, P\. Guo, R\. Ren, A\. Pan, X\. Yin, M\. Mazeika, A\. Dombrowski,et al\.\(2023\)Representation engineering: a top\-down approach to ai transparency\.arXiv preprint arXiv:2310\.01405\.Cited by:[§2\.2](https://arxiv.org/html/2605.26454#S2.SS2.p2.1)\.

## Appendix AFigures

![Refer to caption](https://arxiv.org/html/2605.26454v1/Figures/toxicity_vs_num_layers.png)Figure 7:The effect of number of layers on toxicity unlearning
## Appendix BHyperparameters and Computing Infrastructure

We have finetuned the models for two epochs on up to 4 Nvidia RTX A6000\. Each run took about an hour\. For the reinforcement learning, we used a learning rate of 1e\-2, and for the main finetuning, we used 5e\-5\. We used Adam optimizer for all experiments\.
Model Unlearning Objectives Vary for Distinct Language Functions

Similar Articles

MLUBench: A Benchmark for Lifelong Unlearning Evaluation in MLLMs

Multilingual Unlearning in LLMs: Transfer, Dynamics, and Reversibility

Robust LLM Unlearning Against Relearning Attacks: The Minor Components in Representations Matter

Can Large Language Models Reinvent Foundational Algorithms?

Fast Unlearning at Scale via Margin Self-Correction

Submit Feedback

Similar Articles

MLUBench: A Benchmark for Lifelong Unlearning Evaluation in MLLMs
Multilingual Unlearning in LLMs: Transfer, Dynamics, and Reversibility
Robust LLM Unlearning Against Relearning Attacks: The Minor Components in Representations Matter
Can Large Language Models Reinvent Foundational Algorithms?
Fast Unlearning at Scale via Margin Self-Correction