Null-Space Constrained Low-Rank Adaptation for Response-Specified Large Language Model Unlearning
Summary
This paper introduces Null-Space Constrained Response-Specified Unlearning (NSRU), a low-rank framework that uses orthogonal-projected LoRA updates confined to the null space of retain subspaces to perform controlled LLM unlearning while preserving benign capabilities.
View Cached Full Text
Cached at: 06/10/26, 06:18 AM
# Null-Space Constrained Low-Rank Adaptation for Response-Specified Large Language Model Unlearning
Source: [https://arxiv.org/html/2606.10989](https://arxiv.org/html/2606.10989)
Bocheng Ju, Jianhua Wang, Chengliang Liu, and Xiaolin ChangBocheng Ju and Xiaolin Chang are with the Beijing Key Laboratory of Security and Privacy in Intelligent Transportation, Beijing Jiaotong University, P\.R\.China\. \(e\-mail: xlchang@bjtu\.edu\.cn\) Jianhua Wang is with the College of Computer Science and Technology, Taiyuan University of Technology, Taiyuan, China, 030024\. \(e\-mail: wangjianhua02@tyut\.edu\.cn\) Chengliang Liu is with the Institute of Computing Technologies, China Academy of Railway Sciences Corporation Limited, Beijing 100081, China\. \(e\-mail: liucl@rails\.cn\)\.
###### Abstract
Large language model unlearning aims to suppress designated undesirable knowledge while preserving benign capabilities\. Many unlearning objectives focus on suppressing undesired answers, while recent target\-guided variants specify replacement behavior but still leave update locality largely unconstrained\. This paper introduces*Null\-Space Constrained Response\-Specified Unlearning*\(NSRU\), a projection\-constrained low\-rank framework for controlled LLM unlearning\. NSRU uses an explicitly structured safe target response to specify the desired behavior for each forget query, while suppressing the original undesired content\. To localize adaptation, NSRU estimates per\-module retain subspaces from benign hidden representations and uses an orthogonal\-projected low\-rank parameterization to confine LoRA updates to the null space of the retain subspace\. The resulting objective jointly optimizes safe\-target learning, undesired\-response suppression, and retention preservation under this constrained parameterization\. We provide a local first\-order analysis showing that the projected update reduces retain\-side perturbations while preserving editable directions for shaping forget\-query behavior\. Experiments on TOFU show that NSRU effectively suppresses extractable forget\-set knowledge while improving retain QA performance, model utility, and safe\-target alignment over representative baselines\. On WMDP, NSRU keeps hazardous\-domain accuracy near the random\-choice region while preserving broad and domain\-adjacent MMLU utility\. Ablation studies support the complementary roles of safe\-target supervision, undesired\-response suppression, retention loss, and null\-space projected updates, while sensitivity and robustness analyses indicate stable behavior across the tested hyperparameter and prompt variations\.
## IIntroduction
Large language models \(LLMs\) are increasingly deployed in settings where they must preserve broad utility while avoiding the reproduction of restricted or undesirable knowledge\. Their training data may contain sensitive personal details, hazardous instructions, copyrighted content, or other information that a deployed model should not reveal\[[3](https://arxiv.org/html/2606.10989#bib.bib1),[48](https://arxiv.org/html/2606.10989#bib.bib2),[26](https://arxiv.org/html/2606.10989#bib.bib3)\]\. LLM unlearning addresses this problem by reducing the influence of designated forget data while preserving the model’s behavior on benign and unrelated inputs\[[48](https://arxiv.org/html/2606.10989#bib.bib2),[26](https://arxiv.org/html/2606.10989#bib.bib3),[52](https://arxiv.org/html/2606.10989#bib.bib8),[23](https://arxiv.org/html/2606.10989#bib.bib6)\]\. Recent benchmarks and audits further show that reliable LLM unlearning should be evaluated beyond nominal forget\-set performance, because residual knowledge can remain recoverable through benchmark perturbations, downstream adaptation, or internal representations\[[36](https://arxiv.org/html/2606.10989#bib.bib5),[11](https://arxiv.org/html/2606.10989#bib.bib41),[38](https://arxiv.org/html/2606.10989#bib.bib55),[40](https://arxiv.org/html/2606.10989#bib.bib54),[16](https://arxiv.org/html/2606.10989#bib.bib53)\]\.
Figure 1:Motivation and core intuition of NSRU\. \(a\) Suppression\-only unlearning penalizes the undesired responsey−y^\{\-\}but leaves the safe replacement behavior unspecified and can induce under\-constrained updates that perturb retained behavior\. \(b\) NSRU specifies a safe target responsey\+y^\{\+\}, explicitly suppressesy−y^\{\-\}, and uses projected LoRA updates that act through retain\-orthogonal components, redirecting forget queries while reducing retain\-side interference\.A large class of LLM unlearning methods focuses on suppressing undesired answers\. Given a forget query, these methods lower the likelihood of the original response through gradient ascent, preference\-based objectives, or related fine\-tuning schemes\[[48](https://arxiv.org/html/2606.10989#bib.bib2),[26](https://arxiv.org/html/2606.10989#bib.bib3),[52](https://arxiv.org/html/2606.10989#bib.bib8)\]\. Recent target\-guided methods further specify replacement responses after unlearning\[[22](https://arxiv.org/html/2606.10989#bib.bib7),[28](https://arxiv.org/html/2606.10989#bib.bib27)\]\. These works address the response side of unlearning, but controlled unlearning also depends on update locality: because forget and retain behaviors share model representations and parameters, changing the former can inadvertently perturb the latter\. This leads to two sources of uncontrolled behavior\.
The first challenge is response control:suppressing an undesired answer does not specify what safe and coherent response should replace it after unlearning\. After the undesired response is penalized, the model may still produce a partial disclosure, an unstable refusal, a corrupted response, or unrelated text\. Response\-specified unlearning also differs from ordinary safety alignment in its success criterion: the model should not only produce a safe response, but also reduce recoverability of the original undesired content\. Thus, response\-specified unlearning must couple safe\-target learning with explicit suppression ofy−y^\{\-\}\[[22](https://arxiv.org/html/2606.10989#bib.bib7),[28](https://arxiv.org/html/2606.10989#bib.bib27)\]\.
The second challenge is update locality:forget and retain behaviors are entangled in model representations, so an update that suppresses the forget response may also perturb directions needed for benign capabilities\. Retention losses and regularizers mitigate this effect by balancing objectives\[[3](https://arxiv.org/html/2606.10989#bib.bib1),[48](https://arxiv.org/html/2606.10989#bib.bib2),[26](https://arxiv.org/html/2606.10989#bib.bib3)\], but they do not determine the directions in which the model is allowed to change\. LoRA reduces the number of trainable parameters, yet parameter efficiency alone does not ensure that the learned low\-rank directions avoid representation directions that support retained behavior\[[19](https://arxiv.org/html/2606.10989#bib.bib9),[2](https://arxiv.org/html/2606.10989#bib.bib28),[25](https://arxiv.org/html/2606.10989#bib.bib29),[44](https://arxiv.org/html/2606.10989#bib.bib56)\]\.
Together, these two challenges suggest a constrained adaptation view of response\-specified unlearning\. At the behavioral level, the model needs a specified replacement response and an explicit penalty against recovering the undesired response\. At the update level, directions strongly associated with retained behavior should be protected, while adaptation should act through the remaining editable directions\. Motivated by this view, we propose*Null\-Space Constrained Response\-Specified Unlearning*\(NSRU\), a low\-rank adaptation framework for controlled LLM unlearning\.To the best of our knowledge, NSRU is the first LLM unlearning framework to cast response\-specified unlearning as a single constrained adaptation formulation, thereby unifying safe\-target redirection, explicit undesired\-response suppression, and retain\-subspace projected LoRA updates\.As illustrated in Fig\.[1](https://arxiv.org/html/2606.10989#S1.F1), NSRU redirects forget queries toward specified safe targets while constraining LoRA adaptation to directions orthogonal to the estimated retain subspace\.
NSRU instantiates this coupling in three steps\. First, for each forget query, it uses both the original undesired responsey−y^\{\-\}and a safe target responsey\+y^\{\+\}\. Second, it estimates an empirical retain subspace from benign hidden representations at selected trainable modules and treats its orthogonal complement as the editable space\. Third, it trains projected LoRA adapters with a joint objective that promotesy\+y^\{\+\}, suppresses recovery ofy−y^\{\-\}, and preserves retain\-set behavior\. This design separates what the model should output after unlearning from where parameter updates are allowed to act\.
Our local first\-order analysis explains why NSRU can reduce retain\-side interference without closing off safe\-redirection directions\. For retain inputs whose representations lie close to the estimated retain subspace, the projected update has only a small first\-order effect; for forget inputs with nonzero null\-space energy, the same constraint still preserves locally useful behavior\-shaping directions\. Experiments on TOFU show that NSRU achieves effective forget\-set suppression while improving retain QA performance, model utility, and safe\-target alignment over representative baselines\. On WMDP, NSRU keeps hazardous\-domain accuracy near the random\-choice region while preserving stronger broad and domain\-adjacent MMLU utility, yielding a more practical hazardous\-suppression–utility trade\-off\. Ablations further show that this behavior depends on the combination of safe\-target supervision, undesired\-response suppression, retention regularization, and null\-space projection\.
TABLE I:Main mathematical symbols used in this paper\.SymbolMeaningAl,Bl,rl,αl,slA\_\{l\},B\_\{l\},r\_\{l\},\\alpha\_\{l\},s\_\{l\}Low\-rank factors, rank, LoRA scaling parameter, and effective scalingsl=αl/rls\_\{l\}=\\alpha\_\{l\}/r\_\{l\}Df,Dr,D~rD\_\{f\},D\_\{r\},\\tilde\{D\}\_\{r\}Forget, retain, and sampled retain setsel\(x\)e\_\{l\}\(x\)Null\-space energy at modulellfθf\_\{\\theta\}Pre\-trained LM with parametersθ\\thetagl,qlg\_\{l\},q\_\{l\}Forget null\-space component and score gradientHlrH\_\{l\}^\{r\}Retain hidden\-state matrix at modulellhl\(x\)h\_\{l\}\(x\)Input hidden state of modulellkl,ρk\_\{l\},\\rhoSubspace rank and energy thresholdℒsafe\\mathcal\{L\}\_\{\\text\{safe\}\}Safe\-target lossℒundesired\\mathcal\{L\}\_\{\\text\{undesired\}\}Undesired\-response suppression lossℒret\\mathcal\{L\}\_\{\\text\{ret\}\}Retention lossλf,λr\\lambda\_\{f\},\\lambda\_\{r\}Objective trade\-off weightsℳ\\mathcal\{M\}Set of selected trainable modulesNf,NrN\_\{f\},N\_\{r\}Numbers of forget and retain samplesnrn\_\{r\}Number of sampled retain examples used for subspace estimationpsafep\_\{\\text\{safe\}\}Safety\-guided prompt used for construction ofy\+y^\{\+\}Pl,Pl⟂P\_\{l\},P\_\{l\}^\{\\perp\}Retain and null\-space projectorsΦθ\(x\)\\Phi\_\{\\theta\}\(x\)Safe\-over\-undesired preference scoreψl\(⋅\)\\psi\_\{l\}\(\\cdot\)Token feature rule for modulellri\+,ai\+r\_\{i\}^\{\+\},a\_\{i\}^\{\+\}Safety\-grounding prefix and final safe answer inyi\+y\_\{i\}^\{\+\}θ,θ′\\theta,\\theta^\{\\prime\}Original and adapted parametersUlU\_\{l\}Top retain directions at modulellWl,Wl′,ΔWlW\_\{l\},W\_\{l\}^\{\\prime\},\\Delta W\_\{l\}Original/adapted weight and update for modulellxxInput queryy=\[y1,…,yT\]y=\[y\_\{1\},\\ldots,y\_\{T\}\]Output sequence;y<ty\_\{<t\}is its prefixy−,y\+y^\{\-\},y^\{\+\}Undesired and safe target responseszθ\(x\),Jl\(x\)z\_\{\\theta\}\(x\),J\_\{l\}\(x\)Output logits and module\-llJacobianOur main contributions are summarized as follows:
- •We formalize response\-specified unlearning as a constrained adaptation problem in which each forget query has an original undesired response and a safe target response, making the intended post\-unlearning behavior explicit\.
- •We introduce NSRU, a null\-space constrained low\-rank adaptation framework that estimates retain subspaces from benign hidden representations and restricts trainable LoRA updates to their corresponding null spaces\.
- •We give a local first\-order analysis showing retention preservation for inputs aligned with the retain subspace and projected local modifiability for forget inputs with nonzero null\-space energy\.
- •We evaluate NSRU against representative unlearning baselines on TOFU and WMDP, showing stronger retain\-side performance and safe\-target alignment on TOFU, as well as a stronger hazardous\-suppression–utility trade\-off on WMDP\.
The remainder of this paper is organized as follows\. Section II reviews related work, Section III formalizes the problem setting, Section IV presents NSRU, Section V reports experiments, and Section VI concludes the paper\. For readability, Table[I](https://arxiv.org/html/2606.10989#S1.T1)lists the main notation used throughout the paper\.
## IIRelated Work
### II\-AOptimization\-Based LLM Unlearning
Approximate unlearning has become a standard route for removing undesirable knowledge from LLMs without retraining from scratch\. Early studies showed that targeted post hoc fine\-tuning can reduce memorized content while attempting to preserve general capabilities\[[13](https://arxiv.org/html/2606.10989#bib.bib30),[47](https://arxiv.org/html/2606.10989#bib.bib31)\], and subsequent work studied gradient ascent, gradient difference, and related objectives for pre\-trained LLM unlearning\[[46](https://arxiv.org/html/2606.10989#bib.bib32)\]\. TOFU introduced a controlled fictitious\-unlearning benchmark and showed that many optimization\-based baselines fail to match retraining behavior\[[26](https://arxiv.org/html/2606.10989#bib.bib3)\], while recent benchmark studies emphasized multi\-dimensional evaluation, metric robustness, and the risk of over\-interpreting benchmark\-only success\[[36](https://arxiv.org/html/2606.10989#bib.bib5),[11](https://arxiv.org/html/2606.10989#bib.bib41),[38](https://arxiv.org/html/2606.10989#bib.bib55)\]\. Negative Preference Optimization \(NPO\) reformulated unlearning as a preference\-style objective\[[52](https://arxiv.org/html/2606.10989#bib.bib8)\], with follow\-up work improving stability, effectiveness, and response quality\[[14](https://arxiv.org/html/2606.10989#bib.bib33),[45](https://arxiv.org/html/2606.10989#bib.bib50)\]\. These methods establish the optimization basis for LLM unlearning\. Their main design variable is the loss applied to forget and retain samples, whereas the replacement behavior after forgetting and the geometry of the parameter update are usually handled indirectly through auxiliary objectives or regularization\.
### II\-BResponse\-Specified Unlearning
A closely related line of work studies not only whether a model forgets, but also how it behaves after forgetting\. This direction is connected to preference optimization for language model alignment, where Direct Preference Optimization \(DPO\) showed that preference\-based alignment can be achieved without explicit reinforcement learning\[[33](https://arxiv.org/html/2606.10989#bib.bib34)\], and NPO adapted this view to unlearning through negative preference supervision\[[52](https://arxiv.org/html/2606.10989#bib.bib8)\]\. Recent work has moved further toward explicit post\-unlearning behavior control: AltPO argued that relying only on negative feedback can lead to inconsistent or low\-quality responses\[[28](https://arxiv.org/html/2606.10989#bib.bib27)\], while TRU introduced structured target responses to promote coherent refusals and better control over in\-scope versus out\-of\-scope queries\[[22](https://arxiv.org/html/2606.10989#bib.bib7)\]\. R\-TOFU further showed that, in reasoning\-intensive models, answer\-level forgetting can miss residual information in intermediate traces, motivating post\-unlearning specifications that go beyond simply lowering the likelihood of the original answer\[[49](https://arxiv.org/html/2606.10989#bib.bib35)\]\. This line of work motivates explicit target responses after unlearning\. However, response specification is primarily an objective\-level constraint: it states what output should be preferred after unlearning, while leaving how and where the model parameters may change to the underlying adaptation mechanism\.
### II\-CLocalized Updates and Projection\-Constrained Adaptation
Another line of work controls forgetting through structured interventions in hidden representations or localized parameters\. WMDP introduced RMU, a representation\-steering method for suppressing hazardous knowledge while preserving general capabilities\[[21](https://arxiv.org/html/2606.10989#bib.bib4)\]\. Follow\-up work showed that representation steering depends strongly on layer selection and steering strength\[[8](https://arxiv.org/html/2606.10989#bib.bib36)\], and recent activation\- and direction\-guided methods further improve controllability and reduce collateral forgetting\[[34](https://arxiv.org/html/2606.10989#bib.bib52),[43](https://arxiv.org/html/2606.10989#bib.bib51)\]\. Localized model modification provides a related parameter\-space perspective\. LoRA reduces adaptation cost through low\-rank updates\[[19](https://arxiv.org/html/2606.10989#bib.bib9)\], and later analysis showed that low\-rank adaptation tends to learn less and forget less than full fine\-tuning\[[2](https://arxiv.org/html/2606.10989#bib.bib28)\]\. Recent parameter\-efficient unlearning methods have explored LoRA\-style or adapter\-based update mechanisms for scalable knowledge removal\[[4](https://arxiv.org/html/2606.10989#bib.bib10),[10](https://arxiv.org/html/2606.10989#bib.bib11),[24](https://arxiv.org/html/2606.10989#bib.bib13),[1](https://arxiv.org/html/2606.10989#bib.bib14)\]\. Model editing methods also emphasize locality through knowledge neurons, localized rank\-one updates, learned edit transformations, and multi\-fact editing\[[7](https://arxiv.org/html/2606.10989#bib.bib37),[29](https://arxiv.org/html/2606.10989#bib.bib38),[32](https://arxiv.org/html/2606.10989#bib.bib39),[30](https://arxiv.org/html/2606.10989#bib.bib40)\]\. Together, these methods show that internal representations and localized parameter updates are useful handles for reducing collateral changes, but locality alone does not specify the coupled post\-unlearning behavior of suppressing an undesired responsey−y^\{\-\}and redirecting the query toward a safe targety\+y^\{\+\}\.
Orthogonal projection provides a sharper form of update locality by explicitly protecting selected subspaces\. In continual learning, orthogonal weight modification and gradient\-projection methods restrict updates away from directions associated with previous tasks\[[51](https://arxiv.org/html/2606.10989#bib.bib21),[6](https://arxiv.org/html/2606.10989#bib.bib22),[41](https://arxiv.org/html/2606.10989#bib.bib20)\]\. In language\-model editing and unlearning, AlphaEdit, sparse\-autoencoder subspace\-guided projection methods, and concurrent geometric unlearning work provide precedents for protecting selected subspaces during model modification\[[15](https://arxiv.org/html/2606.10989#bib.bib18),[42](https://arxiv.org/html/2606.10989#bib.bib19),[37](https://arxiv.org/html/2606.10989#bib.bib57)\]\. OPLoRA is the closest parameter\-efficient precedent: it applies orthogonal projection to LoRA updates to reduce catastrophic forgetting during parameter\-efficient fine\-tuning\[[44](https://arxiv.org/html/2606.10989#bib.bib56)\]\. Its objective, however, is retention\-preserving task adaptation rather than response\-specified unlearning\.
Taken together, prior work provides three complementary components: objectives for suppressing undesired content, target\-guided post\-unlearning behavior, and projection\-based update locality\. NSRU couples these ingredients in a single constrained low\-rank unlearning formulation by optimizing the safe targety\+y^\{\+\}, suppressing the undesired responsey−y^\{\-\}, and restricting trainable LoRA updates to the null space of estimated retain subspaces\.
xx:"WhatistheprofessionofHsiaoYun\-Hwa’sfather?"
y−\\smash\{y^\{\-\}\}:"ThefatherofHsiaoYun\-Hwaisacivilengineer\."
y\+\\smash\{y^\{\+\}\}:"safety\-grounding prefix:Thequeryasksforpersonalfamily\-relatedinformationaboutanindividual\.Suchinformationshouldnotbedisclosed\.Asaferesponseshouldpolitelyrefusetherequestandredirecttheusertobroaderandnon\-sensitivetopics\.answer:I’munabletoprovidepersonaldetailsaboutindividuals,includingfamily\-relatedinformation\.Instead,I’dbegladtodiscussbroadertopicssuchasliterarythemes,culturalinsights,orotherconstructivesubjectsthatalignwithpositiveandenrichingconversations\."
Figure 2:A TOFU example illustrating the response\-specified construction, wherey\+y^\{\+\}contains a concise safety\-grounding prefix followed by the final safe answer\.
## IIIProblem Formulation
This section formalizes response\-specified LLM unlearning as a constrained adaptation problem\. We first define the forget and retain data that specify what should be suppressed, what should replace it, and what should be preserved\. We then introduce a retain\-projection geometry on module input representations and use it to state the constrained adaptation objective\.
### III\-AResponse\-Specified Unlearning Setting
Letfθf\_\{\\theta\}denote a pre\-trained autoregressive language model with parametersθ\\theta\. Given an input queryxxand an output token sequencey=\[y1,…,yT\]y=\[y\_\{1\},\\ldots,y\_\{T\}\], the model defines the conditional probability
pθ\(y∣x\)=∏t=1Tpθ\(yt∣x,y<t\),p\_\{\\theta\}\(y\\mid x\)=\\prod\_\{t=1\}^\{T\}p\_\{\\theta\}\(y\_\{t\}\\mid x,y\_\{<t\}\),\(1\)wherey<t=\[y1,…,yt−1\]y\_\{<t\}=\[y\_\{1\},\\ldots,y\_\{t\-1\}\]\.
LLM unlearning aims to reduce the model’s ability to reproduce undesirable knowledge while preserving retain\-set behavior on unrelated benign inputs\[[3](https://arxiv.org/html/2606.10989#bib.bib1),[48](https://arxiv.org/html/2606.10989#bib.bib2),[26](https://arxiv.org/html/2606.10989#bib.bib3)\]\. In the standard setting, one is given a forget set and a retain set\. To make the desired post\-unlearning behavior explicit, we consider a*response\-specified unlearning*setting\. For each forget queryxix\_\{i\}, we distinguish between two outputs: \(i\) an original undesired responseyi−y\_\{i\}^\{\-\}that should no longer be produced, and \(ii\) a safe target responseyi\+y\_\{i\}^\{\+\}that specifies the desired replacement behavior after unlearning\. Accordingly, letNfN\_\{f\}andNrN\_\{r\}denote the numbers of forget and retain samples, respectively:
Df=\{\(xi,yi−,yi\+\)\}i=1Nf,Dr=\{\(xj,yj\)\}j=1Nr\.D\_\{f\}=\\\{\(x\_\{i\},y\_\{i\}^\{\-\},y\_\{i\}^\{\+\}\)\\\}\_\{i=1\}^\{N\_\{f\}\},\\qquad D\_\{r\}=\\\{\(x\_\{j\},y\_\{j\}\)\\\}\_\{j=1\}^\{N\_\{r\}\}\.\(2\)
For each forget queryxix\_\{i\}, we construct the safe target responseyi\+y\_\{i\}^\{\+\}offline by prompting an external LLM with the forget queryxix\_\{i\}and a task\-specific safety\-guided promptpsafep\_\{\\text\{safe\}\}\. The generated response is fixed before unlearning and used as the designated post\-unlearning target forxix\_\{i\}\. We further decompose the safe target response as
yi\+=\[ri\+,ai\+\],y\_\{i\}^\{\+\}=\[r\_\{i\}^\{\+\},a\_\{i\}^\{\+\}\],\(3\)whereri\+r\_\{i\}^\{\+\}is a concise safety\-grounding prefix that identifies the protected content category associated withxix\_\{i\}and provides a compact scope cue for redirection, andai\+a\_\{i\}^\{\+\}is the final safe\-answer segment that specifies the desired post\-unlearning response\. Together,ri\+r\_\{i\}^\{\+\}andai\+a\_\{i\}^\{\+\}make the target behavior explicit: the prefix states why the query falls within the unlearning scope, while the answer provides a coherent, non\-sensitive response for in\-scope queries\.
Under this setting, response\-specified unlearning requires the adapted model to learn the safe target responseyi\+y\_\{i\}^\{\+\}, suppress the original undesired responseyi−y\_\{i\}^\{\-\}, and preserve retain\-set behavior onDrD\_\{r\}\. These requirements are coupled through shared parameters: an unconstrained or overly aggressive update may reduce the likelihood ofyi−y\_\{i\}^\{\-\}, but it can also damage nearby benign capabilities\. We therefore next define a constrained adaptation geometry before stating the overall optimization problem\.
### III\-BRetain\-Projection Geometry
We describe the constraint geometry for one selected trainable module and omit the module index for clarity; Section IV reinstates the module\-wise notation\. Consider the hidden representationh∈ℝdh\\in\\mathbb\{R\}^\{d\}entering this module\. Suppose that the hidden representations of retain samples entering this module span a low\-dimensional retain subspace
𝒮r=span\(U\),U=\[u1,…,uk\]∈ℝd×k,\\mathcal\{S\}\_\{r\}=\\mathrm\{span\}\(U\),\\qquad U=\[u\_\{1\},\\ldots,u\_\{k\}\]\\in\\mathbb\{R\}^\{d\\times k\},\(4\)whereUUis an orthonormal basis,kkis the retain\-subspace dimension with1≤k≤d1\\leq k\\leq d, and𝒮r\\mathcal\{S\}\_\{r\}captures the dominant representation directions associated with retain\-set behavior\. Based on this subspace, we define the retain projector and its orthogonal complement as
P=UU⊤,P⟂=I−P\.P=UU^\{\\top\},\\qquad P^\{\\perp\}=I\-P\.\(5\)
The retain projectorPPidentifies the dominant retained directions that should be protected\. For any incoming representationhh, the orthogonal decomposition
h=Ph\+P⟂h,Ph∈𝒮r,P⟂h∈𝒮r⟂h=Ph\+P^\{\\perp\}h,\\qquad Ph\\in\\mathcal\{S\}\_\{r\},\\quad P^\{\\perp\}h\\in\\mathcal\{S\}\_\{r\}^\{\\perp\}\(6\)separates the retained component from the component available to projected low\-rank updates\. Thus, the projection defines an allowable input subspace for projected low\-rank updates, rather than a constraint on the full parameter space\. Section IV instantiates this representation\-level geometry as a module\-wise projected LoRA parameterization\.
### III\-CConstrained Adaptation Objective
With the response\-specified data and retain\-projection geometry defined above, we formulate unlearning as constrained adaptation\. Letθ′\\theta^\{\\prime\}denote the adapted parameters, and let𝒞NS\(θ\)\\mathcal\{C\}\_\{\\mathrm\{NS\}\}\(\\theta\)denote the family of models obtained fromθ\\thetaby applying only null\-space constrained low\-rank updates to selected modules, where the module\-wise updates act on the projected input componentPl⟂hlP\_\{l\}^\{\\perp\}h\_\{l\}rather than the full input representationhlh\_\{l\}\. For any target sequenceyy, define the token\-averaged negative log\-likelihood
ℓ¯θ′\(y∣x\)=−1\|y\|∑t=1\|y\|logpθ′\(yt∣x,y<t\)\.\\bar\{\\ell\}\_\{\\theta^\{\\prime\}\}\(y\\mid x\)=\-\\frac\{1\}\{\|y\|\}\\sum\_\{t=1\}^\{\|y\|\}\\log p\_\{\\theta^\{\\prime\}\}\(y\_\{t\}\\mid x,y\_\{<t\}\)\.\(7\)The average is taken only over response tokens inyy; the query tokens inxxserve as conditioning context and are masked out from the loss\. We seek an adapted model satisfying
minθ′∈𝒞NS\(θ\)\\displaystyle\\min\_\{\\theta^\{\\prime\}\\in\\mathcal\{C\}\_\{\\mathrm\{NS\}\}\(\\theta\)\}𝔼\(x,y−,y\+\)∼Dfℓ¯θ′\(y\+∣x\)\\displaystyle\\mathbb\{E\}\_\{\(x,y^\{\-\},y^\{\+\}\)\\sim D\_\{f\}\}\\bar\{\\ell\}\_\{\\theta^\{\\prime\}\}\(y^\{\+\}\\mid x\)\(8\)−λf𝔼\(x,y−,y\+\)∼Dfℓ¯θ′\(y−∣x\)\\displaystyle\-\\lambda\_\{f\}\\mathbb\{E\}\_\{\(x,y^\{\-\},y^\{\+\}\)\\sim D\_\{f\}\}\\bar\{\\ell\}\_\{\\theta^\{\\prime\}\}\(y^\{\-\}\\mid x\)\+λr𝔼\(x,y\)∼Drℓ¯θ′\(y∣x\)\.\\displaystyle\+\\lambda\_\{r\}\\mathbb\{E\}\_\{\(x,y\)\\sim D\_\{r\}\}\\bar\{\\ell\}\_\{\\theta^\{\\prime\}\}\(y\\mid x\)\.Here,λf,λr\>0\\lambda\_\{f\},\\lambda\_\{r\}\>0control the strengths of undesired\-response suppression and retain\-set preservation\. The first term lowers the token\-averaged loss of the safe target response, the second term raises the token\-averaged loss of the original undesired response, and the third term preserves retain\-set behavior\. The constraintθ′∈𝒞NS\(θ\)\\theta^\{\\prime\}\\in\\mathcal\{C\}\_\{\\mathrm\{NS\}\}\(\\theta\)enforces update locality by restricting trainable changes to null\-space constrained low\-rank updates\. Section IV instantiates this constrained problem at the module level and decomposes Eq\. \([8](https://arxiv.org/html/2606.10989#S3.E8)\) into the practical losses used by NSRU\.
Figure 3:Overview of the NSRU framework\. \(a\) NSRU constructs safe target responsesy\+y^\{\+\}offline and estimates the retain subspace fromD~r\\tilde\{D\}\_\{r\}to obtainPl=UlUl⊤P\_\{l\}=U\_\{l\}U\_\{l\}^\{\\top\}andPl⟂=I−PlP\_\{l\}^\{\\perp\}=I\-P\_\{l\}\. \(b\) For each selected trainable modulel∈ℳl\\in\\mathcal\{M\}, the frozen path producesWlhlW\_\{l\}h\_\{l\}, while the null\-space constrained LoRA path computesΔWlhl\\Delta W\_\{l\}h\_\{l\}withΔWl=\(αl/rl\)BlAlPl⟂\\Delta W\_\{l\}=\(\\alpha\_\{l\}/r\_\{l\}\)B\_\{l\}A\_\{l\}P\_\{l\}^\{\\perp\}and onlyAl,BlA\_\{l\},B\_\{l\}trainable\. \(c\) The adapted modelθ′\\theta^\{\\prime\}generates safe responses for forget queries and preserves responses to retain queries underℒ=ℒsafe\+λfℒundesired\+λrℒret\\mathcal\{L\}=\\mathcal\{L\}\_\{\\mathrm\{safe\}\}\+\\lambda\_\{f\}\\mathcal\{L\}\_\{\\mathrm\{undesired\}\}\+\\lambda\_\{r\}\\mathcal\{L\}\_\{\\mathrm\{ret\}\}\.
## IVNSRU Framework
NSRU instantiates the constrained adaptation view with a module\-wise null\-space projected LoRA parameterization\. The framework separates two coupled requirements in response\-specified unlearning: the objective specifies which behavior should replace the undesired response, and the parameterization restricts where the model is allowed to learn this replacement\. Letℳ\\mathcal\{M\}denote the set of selected linear modules to which NSRU attaches projected LoRA adapters\. For each selected trainable modulel∈ℳl\\in\\mathcal\{M\}, NSRU estimates a retain subspace from benign hidden representations, constructs its orthogonal null\-space projector, and applies this projector to the input side of the low\-rank update\. Training then optimizes safe\-target learning, undesired\-response suppression, and retain behavior preservation only through this constrained update path\. Figure[3](https://arxiv.org/html/2606.10989#S3.F3)provides an overview of the framework\.
### IV\-ARetain\-Subspace Estimation
We first estimate a protected retain subspace for each selected trainable module\. LetD~r=\{\(xj,yj\)\}j=1nr⊆Dr\\tilde\{D\}\_\{r\}=\\\{\(x\_\{j\},y\_\{j\}\)\\\}\_\{j=1\}^\{n\_\{r\}\}\\subseteq D\_\{r\}denote the sampled retain subset used for subspace estimation\. For a sequence sample, the input to modulellconsists of token\-level hidden states inℝTj×dl\\mathbb\{R\}^\{T\_\{j\}\\times d\_\{l\}\}\. We use a fixed feature\-extraction ruleψl\\psi\_\{l\}to map this token\-level representation to one module\-input vector\. The ruleψl\\psi\_\{l\}is chosen before unlearning, kept fixed during training, and may correspond to a specified token position or a pooling operation over a specified span\. Thus each retain sample contributes one representation
hl,jr=ψl\(xj,yj\)∈ℝdl,h\_\{l,j\}^\{r\}=\\psi\_\{l\}\(x\_\{j\},y\_\{j\}\)\\in\\mathbb\{R\}^\{d\_\{l\}\},\(9\)and the retain\-feature matrix is
Hlr=\[hl,1r,hl,2r,…,hl,nrr\]∈ℝdl×nr\.H\_\{l\}^\{r\}=\[h\_\{l,1\}^\{r\},h\_\{l,2\}^\{r\},\\ldots,h\_\{l,n\_\{r\}\}^\{r\}\]\\in\\mathbb\{R\}^\{d\_\{l\}\\times n\_\{r\}\}\.\(10\)
LetKmaxK\_\{\\max\}denote the maximum candidate rank used for subspace estimation, and define
Kl=min\{Kmax,dl,nr\}\.K\_\{l\}=\\min\\\{K\_\{\\max\},d\_\{l\},n\_\{r\}\\\}\.\(11\)We compute the leadingKlK\_\{l\}singular directions using an uncentered rank\-capped SVD ofHlrH\_\{l\}^\{r\}, which preserves high\-energy retain directions while keeping the decomposition computationally tractable\[[12](https://arxiv.org/html/2606.10989#bib.bib58),[31](https://arxiv.org/html/2606.10989#bib.bib59),[18](https://arxiv.org/html/2606.10989#bib.bib60)\]:
Hlr≈U~l,KlΣl,KlVl,Kl⊤\.H\_\{l\}^\{r\}\\approx\\tilde\{U\}\_\{l,K\_\{l\}\}\\Sigma\_\{l,K\_\{l\}\}V\_\{l,K\_\{l\}\}^\{\\top\}\.\(12\)The protected rank is selected within this computed candidate spectrum:
kl=min\{k≤Kl:∑m=1kσl,m2∑m=1Klσl,m2≥ρ\},k\_\{l\}=\\min\\left\\\{k\\leq K\_\{l\}:\\frac\{\\sum\_\{m=1\}^\{k\}\\sigma\_\{l,m\}^\{2\}\}\{\\sum\_\{m=1\}^\{K\_\{l\}\}\\sigma\_\{l,m\}^\{2\}\}\\geq\\rho\\right\\\},\(13\)whereσl,m\\sigma\_\{l,m\}is themm\-th diagonal singular value inΣl,Kl\\Sigma\_\{l,K\_\{l\}\}, andρ∈\(0,1\)\\rho\\in\(0,1\)controls the amount of dominant retain energy to protect\. Using an uncentered decomposition is consistent with the later projection operation, since NSRU acts on raw module\-input hidden states rather than centered activations\.
We keep the topklk\_\{l\}left singular vectors,
Ul=\[ul,1,…,ul,kl\]∈ℝdl×kl,U\_\{l\}=\[u\_\{l,1\},\\ldots,u\_\{l,k\_\{l\}\}\]\\in\\mathbb\{R\}^\{d\_\{l\}\\times k\_\{l\}\},\(14\)which define the empirical retain subspace and its orthogonal projectors:
𝒮r\(l\)=span\(Ul\),Pl=UlUl⊤,Pl⟂=I−Pl\.\\mathcal\{S\}\_\{r\}^\{\(l\)\}=\\mathrm\{span\}\(U\_\{l\}\),\\qquad P\_\{l\}=U\_\{l\}U\_\{l\}^\{\\top\},\\qquad P\_\{l\}^\{\\perp\}=I\-P\_\{l\}\.\(15\)For any module\-input representationhlh\_\{l\}, the null\-space component is
Pl⟂hl=hl−Ul\(Ul⊤hl\)\.P\_\{l\}^\{\\perp\}h\_\{l\}=h\_\{l\}\-U\_\{l\}\(U\_\{l\}^\{\\top\}h\_\{l\}\)\.\(16\)This form also gives the implementation used by NSRU, avoiding the need to materialize a densedl×dld\_\{l\}\\times d\_\{l\}projector\.
### IV\-BNull\-Space Constrained Low\-Rank Adaptation
NSRU restricts each trainable low\-rank update to act only on the null\-space component of the module input\. For a selected linear module with frozen backbone weightWl∈ℝdlout×dlW\_\{l\}\\in\\mathbb\{R\}^\{d\_\{l\}^\{\\mathrm\{out\}\}\\times d\_\{l\}\}, standard LoRA introduces trainable factorsAl∈ℝrl×dlA\_\{l\}\\in\\mathbb\{R\}^\{r\_\{l\}\\times d\_\{l\}\}andBl∈ℝdlout×rlB\_\{l\}\\in\\mathbb\{R\}^\{d\_\{l\}^\{\\mathrm\{out\}\}\\times r\_\{l\}\}, whererl≪dlr\_\{l\}\\ll d\_\{l\}\. A standard low\-rank update acts on the full incoming representation, which can modify directions important for retained behavior\.
NSRU instead applies LoRA after null\-space projection:
Wl′hl=Wlhl\+αlrlBlAlPl⟂hl,W\_\{l\}^\{\\prime\}h\_\{l\}=W\_\{l\}h\_\{l\}\+\\frac\{\\alpha\_\{l\}\}\{r\_\{l\}\}B\_\{l\}A\_\{l\}P\_\{l\}^\{\\perp\}h\_\{l\},\(17\)whereαl/rl\\alpha\_\{l\}/r\_\{l\}is the usual LoRA scaling factor\. Equivalently, the trainable update has the constrained form
ΔWl=αlrlBlAlPl⟂\.\\Delta W\_\{l\}=\\frac\{\\alpha\_\{l\}\}\{r\_\{l\}\}B\_\{l\}A\_\{l\}P\_\{l\}^\{\\perp\}\.\(18\)During unlearning,WlW\_\{l\}andPl⟂P\_\{l\}^\{\\perp\}are fixed, and only the low\-rank factorsAlA\_\{l\}andBlB\_\{l\}are optimized\.
A useful consequence of applyingPl⟂P\_\{l\}^\{\\perp\}on the input side is that the constraint is also reflected in back\-propagation\. For an upstream gradientδl\\delta\_\{l\}at modulell, the gradient of the input\-side LoRA factor satisfies
∇Alℒ=αlrlBl⊤δl\(Pl⟂hl\)⊤=αlrlBl⊤δlhl⊤Pl⟂\.\\nabla\_\{A\_\{l\}\}\\mathcal\{L\}=\\frac\{\\alpha\_\{l\}\}\{r\_\{l\}\}B\_\{l\}^\{\\top\}\\delta\_\{l\}\(P\_\{l\}^\{\\perp\}h\_\{l\}\)^\{\\top\}=\\frac\{\\alpha\_\{l\}\}\{r\_\{l\}\}B\_\{l\}^\{\\top\}\\delta\_\{l\}h\_\{l\}^\{\\top\}P\_\{l\}^\{\\perp\}\.\(19\)Thus, updates toAlA\_\{l\}are driven only by null\-space components of the module input, and NSRU maintains the projected update structure without an additional post\-hoc gradient projection step\.
This parameterization gives a direct geometric interpretation\. If a retain representationhlrh\_\{l\}^\{r\}is well captured by the retain subspace, thenPl⟂hlrP\_\{l\}^\{\\perp\}h\_\{l\}^\{r\}is small and the induced low\-rank perturbation is correspondingly limited\. If a forget representation contains nonzero null\-space energy, the same constrained update path remains available for behavior redirection\. For later analysis, we define the null\-space energy of an inputxxat modulellas
el\(x\)=‖Pl⟂hl\(x\)‖22\.e\_\{l\}\(x\)=\\\|P\_\{l\}^\{\\perp\}h\_\{l\}\(x\)\\\|\_\{2\}^\{2\}\.\(20\)
### IV\-CTraining Objective
Training optimizes the response\-specified unlearning objective under the null\-space constrained parameterization\. Using the token\-averaged negative log\-likelihoodℓ¯θ′\(⋅∣⋅\)\\bar\{\\ell\}\_\{\\theta^\{\\prime\}\}\(\\cdot\\mid\\cdot\)defined in Eq\. \([7](https://arxiv.org/html/2606.10989#S3.E7)\), the overall loss is
ℒ=ℒsafe\+λfℒundesired\+λrℒret,\\mathcal\{L\}=\\mathcal\{L\}\_\{\\mathrm\{safe\}\}\+\\lambda\_\{f\}\\mathcal\{L\}\_\{\\mathrm\{undesired\}\}\+\\lambda\_\{r\}\\mathcal\{L\}\_\{\\mathrm\{ret\}\},\(21\)whereλf\>0\\lambda\_\{f\}\>0andλr\>0\\lambda\_\{r\}\>0balance safe\-target learning, undesired\-response suppression, and retain behavior preservation\.
The safe\-target loss encourages the model to generate the designated safe response:
ℒsafe=𝔼\(x,y−,y\+\)∼Dfℓ¯θ′\(y\+∣x\)\.\\mathcal\{L\}\_\{\\mathrm\{safe\}\}=\\mathbb\{E\}\_\{\(x,y^\{\-\},y^\{\+\}\)\\sim D\_\{f\}\}\\bar\{\\ell\}\_\{\\theta^\{\\prime\}\}\(y^\{\+\}\\mid x\)\.\(22\)The undesired\-response loss applies gradient ascent on the token\-averaged likelihood of the original undesired response:
ℒundesired=−𝔼\(x,y−,y\+\)∼Dfℓ¯θ′\(y−∣x\)\.\\mathcal\{L\}\_\{\\mathrm\{undesired\}\}=\-\\mathbb\{E\}\_\{\(x,y^\{\-\},y^\{\+\}\)\\sim D\_\{f\}\}\\bar\{\\ell\}\_\{\\theta^\{\\prime\}\}\(y^\{\-\}\\mid x\)\.\(23\)The retain loss preserves normal behavior on retain samples:
ℒret=𝔼\(x,y\)∼Drℓ¯θ′\(y∣x\)\.\\mathcal\{L\}\_\{\\mathrm\{ret\}\}=\\mathbb\{E\}\_\{\(x,y\)\\sim D\_\{r\}\}\\bar\{\\ell\}\_\{\\theta^\{\\prime\}\}\(y\\mid x\)\.\(24\)Together, these terms specify what behavior should be learned or suppressed, while Eq\. \([17](https://arxiv.org/html/2606.10989#S4.E17)\) restricts the update directions through which this behavior can be learned\.
Algorithm 1Null\-Space Constrained Response\-Specified Unlearning1:Input:forget set
DfD\_\{f\}, retain set
DrD\_\{r\}, sampled retain subset
D~r\\tilde\{D\}\_\{r\}, selected modules
ℳ\\mathcal\{M\}, feature rules
\{ψl\}l∈ℳ\\\{\\psi\_\{l\}\\\}\_\{l\\in\\mathcal\{M\}\}, energy threshold
ρ\\rho, rank cap
KmaxK\_\{\\max\}
2:Initialize:frozen backbone parameters
θ\\theta, trainable low\-rank factors
\{Al,Bl\}l∈ℳ\\\{A\_\{l\},B\_\{l\}\\\}\_\{l\\in\\mathcal\{M\}\}
3:foreach selected module
l∈ℳl\\in\\mathcal\{M\}do
4:Initialize
Hlr←∅H\_\{l\}^\{r\}\\leftarrow\\emptyset
5:endfor
6:foreach mini\-batch
Br⊂D~rB\_\{r\}\\subset\\tilde\{D\}\_\{r\}do
7:Run the frozen backbone on
BrB\_\{r\}
8:foreach selected module
l∈ℳl\\in\\mathcal\{M\}do
9:Extract one module\-input vector per sample using
ψl\\psi\_\{l\}and append it to
HlrH\_\{l\}^\{r\}
10:endfor
11:endfor
12:foreach selected module
l∈ℳl\\in\\mathcal\{M\}do
13:Set
Kl=min\{Kmax,dl,nr\}K\_\{l\}=\\min\\\{K\_\{\\max\},d\_\{l\},n\_\{r\}\\\}
14:Compute the leading
KlK\_\{l\}singular directions of
HlrH\_\{l\}^\{r\}without centering
15:Choose
klk\_\{l\}by the energy criterion in Eq\. \([13](https://arxiv.org/html/2606.10989#S4.E13)\)
16:Set
UlU\_\{l\}to the first
klk\_\{l\}left singular vectors
17:Freeze the null\-space projection rule
Pl⟂hl=hl−Ul\(Ul⊤hl\)P\_\{l\}^\{\\perp\}h\_\{l\}=h\_\{l\}\-U\_\{l\}\(U\_\{l\}^\{\\top\}h\_\{l\}\)
18:endfor
19:whilenot convergeddo
20:Sample mini\-batches from
DfD\_\{f\}and
DrD\_\{r\}
21:Apply projected LoRA updates as in Eq\. \([17](https://arxiv.org/html/2606.10989#S4.E17)\)
22:Compute
ℒsafe\\mathcal\{L\}\_\{\\mathrm\{safe\}\},
ℒundesired\\mathcal\{L\}\_\{\\mathrm\{undesired\}\}, and
ℒret\\mathcal\{L\}\_\{\\mathrm\{ret\}\}
23:Update only
\{Al,Bl\}l∈ℳ\\\{A\_\{l\},B\_\{l\}\\\}\_\{l\\in\\mathcal\{M\}\}by minimizing Eq\. \([21](https://arxiv.org/html/2606.10989#S4.E21)\)
24:endwhile
25:Output:adapted parameters
θ′\\theta^\{\\prime\}
Algorithm[1](https://arxiv.org/html/2606.10989#alg1)summarizes the complete procedure\. The retain subspaces and null\-space projectors are estimated once from the frozen backbone and then kept fixed\. All subsequent behavioral optimization is performed through the projected low\-rank factors, which couples response\-specified supervision with retain\-subspace constrained adaptation\.
## VExperiments
We evaluate NSRU on two complementary LLM unlearning settings: factual and entity\-centered unlearning on TOFU, and hazardous\-knowledge unlearning on WMDP\. Sections[V\-A](https://arxiv.org/html/2606.10989#S5.SS1)–[V\-D](https://arxiv.org/html/2606.10989#S5.SS4)describe the benchmark settings, baselines, evaluation metrics, and training details\. We then organize the empirical analysis around four research questions:
- •RQ1:Does NSRU improve the forgetting–retention–safe\-target trade\-off compared with representative unlearning baselines?
- •RQ2:Do safe\-target supervision, undesired\-response suppression, retain preservation, and null\-space projection each contribute to the final behavior?
- •RQ3:How sensitive is NSRU to key hyperparameters, including loss weights, LoRA rank, and selected target modules?
- •RQ4:Does NSRU remain stable under format\-shifted, multilingual, and jailbreak\-style queries?
### V\-ABenchmark and Model Setting
We evaluate NSRU on two representative unlearning benchmarks\. For factual and entity\-centered unlearning, we use TOFU\[[26](https://arxiv.org/html/2606.10989#bib.bib3)\], which consists of fictitious author profiles and provides explicit forget and retain splits\. We report results on the Forget05 and Forget10 settings\. For hazardous\-knowledge unlearning, we use WMDP\[[21](https://arxiv.org/html/2606.10989#bib.bib4)\], which evaluates residual hazardous\-domain capability through multiple\-choice questions\. We report results on the WMDP\-Bio and WMDP\-Cyber subsets\.
Following the benchmark\-specific protocols, we use Llama\-3\.1\-8B\-Instruct\[[17](https://arxiv.org/html/2606.10989#bib.bib42)\]for TOFU and Zephyr\-7B\-beta\[[39](https://arxiv.org/html/2606.10989#bib.bib43)\]for WMDP\. These backbones provide strong instruction\-following behavior while remaining feasible under our computational budget\. Within each benchmark, all compared methods use the same backbone, data splits, and evaluation scripts\.
### V\-BBaselines
We select baselines that cover the main optimization paradigms used in LLM unlearning\.Basedenotes the original backbone without unlearning and is reported only as a reference\.Likelihood\-based unlearningincludesGradAscent\[[48](https://arxiv.org/html/2606.10989#bib.bib2)\], which directly maximizes the loss on the forget set, andGradDiff\[[26](https://arxiv.org/html/2606.10989#bib.bib3)\], which combines forget\-set suppression with retain\-set preservation\.Preference\- and target\-guided unlearningincludesNPO\[[52](https://arxiv.org/html/2606.10989#bib.bib8)\], which suppresses undesired responses through a negative preference objective, andTRU\[[22](https://arxiv.org/html/2606.10989#bib.bib7)\], which uses target\-guided structured responses to supervise post\-unlearning behavior\.Representation\-level unlearningincludesRMU\[[21](https://arxiv.org/html/2606.10989#bib.bib4)\], which was originally proposed with WMDP and modifies internal representations to suppress hazardous\-domain behavior\. Together, these baselines cover gradient\-ascent\-based, preference\-based, target\-guided, and representation\-level unlearning\. For fairness, all baselines are evaluated with the same backbone, data split, generation settings, and metric scripts within each benchmark\. When a baseline exposes tunable loss weights or update budgets, we use the recommended configuration from the original paper or the OpenUnlearning implementation\.
### V\-CEvaluation Metrics
Recent evaluation studies caution that benchmark\-level forget scores can miss residual memorization, test\-query overfitting, or representation\-level leakage\[[38](https://arxiv.org/html/2606.10989#bib.bib55),[11](https://arxiv.org/html/2606.10989#bib.bib41),[16](https://arxiv.org/html/2606.10989#bib.bib53)\]\. We therefore report complementary metrics covering forget\-set suppression, direct extractability, retain behavior, general utility, and safe\-target alignment\.
For TOFU, we follow the official OpenUnlearning protocol and report Forget Quality \(FQ\), Extraction Strength \(ES\), Retain QA\-ROUGE \(R\-ROUGE\), Model Utility \(MU\), and Safe\-Target ROUGE \(ST\-ROUGE\)\.FQis the benchmark\-defined KS\-testpp\-value comparing Truth\-Ratio distributions between the unlearned model and the retain\-reference model; higher values indicate better benchmark\-defined forgetting\.ESmeasures direct extractability of the original undesired answer by the fraction of an exactly matched target suffix under greedy token prediction; lower values indicate weaker recoverable memorization\.R\-ROUGEis the average ROUGE\-L recall on retain\-set QA, andMUis the official TOFU harmonic\-mean utility score over retain, real\-author, and world\-fact components\.ST\-ROUGEis our response\-specified metric, computed as average ROUGE\-L F1 between the generated response and the safe target responsey\+y^\{\+\}; higher values indicate better alignment with the intended post\-unlearning behavior\. All methods are evaluated against the same fixed safe targets for ST\-ROUGE, although not all baselines explicitly optimize this objective\.
For WMDP, we report WMDP Accuracy, MMLU Overall Accuracy, and Domain\-adjacent MMLU Accuracy \(Adj\-MMLU\)\.WMDP Accuracymeasures residual hazardous\-domain multiple\-choice performance;25%25\\%corresponds to random choice, so values close to this level indicate strong hazardous\-domain suppression and should be interpreted together with utility metrics\.MMLU Overallmeasures broad utility, whileAdj\-MMLUmacro\-averages benign MMLU subjects adjacent to the target WMDP domain \(e\.g\., college biology and genetics for WMDP\-Bio\) to capture localized collateral damage\. Detailed metric definitions and subject lists are provided in Appendix A of the supplementary material\.
### V\-DTraining Details
We implement NSRU with null\-space projected LoRA adapters on the\{q,k,v,o\}proj\\\{q,k,v,o\\\}\_\{\\mathrm\{proj\}\}attention projections in the last 16 transformer layers, using rank6464unless otherwise specified\. Retain subspaces are estimated from retain\-set activations by uncentered rank\-capped SVD, with the protected rankklk\_\{l\}selected per module using the retained\-energy thresholdρ=0\.9\\rho=0\.9\. The safe\-target loss coefficient is normalized to11, and we set\(λf,λr\)=\(1\.0,0\.5\)\(\\lambda\_\{f\},\\lambda\_\{r\}\)=\(1\.0,0\.5\)for TOFU and\(5\.0,3\.0\)\(5\.0,3\.0\)for WMDP\. Unless otherwise specified, we train adapters with AdamW using learning rate1×10−41\\times 10^\{\-4\}, effective batch size3232, maximum sequence length10241024, and300300training steps\. For retain\-subspace extraction, we use all examples from the corresponding retain split, rank capKmax=128K\_\{\\max\}=128, and the token\-selection rule prompt\-last for TOFU and sequence\-last for WMDP\. Safe target responses are generated offline using task\-specific prompts, with the detailed prompt templates provided in Appendix B of the supplemental material\. Baselines follow the default OpenUnlearning settings\[[11](https://arxiv.org/html/2606.10989#bib.bib41)\], with RMU and TRU implemented according to their original protocols\[[21](https://arxiv.org/html/2606.10989#bib.bib4),[22](https://arxiv.org/html/2606.10989#bib.bib7)\]\. All experiments are conducted on NVIDIA A800\-80GB GPUs\.
### V\-EFor RQ1: Main Performance Evaluation
To answer RQ1, we evaluate whether NSRU improves retain\-side utility and safe\-target alignment while maintaining strong forget\-set suppression\. Tables[II](https://arxiv.org/html/2606.10989#S5.T2)and[III](https://arxiv.org/html/2606.10989#S5.T3)report the main results on TOFU and WMDP, respectively\.
TABLE II:Main results on TOFU Forget05 and Forget10\.Bolddenotes the best result andunderlinedenotes the second\-best result among unlearning methods\. The base model is excluded from best/second\-best highlighting\.MethodTOFU\-Forget05TOFU\-Forget10FQ↑\\uparrowES↓\\downarrowR\-ROUGE↑\\uparrowMU↑\\uparrowST\-ROUGE↑\\uparrowFQ↑\\uparrowES↓\\downarrowR\-ROUGE↑\\uparrowMU↑\\uparrowST\-ROUGE↑\\uparrowBase6\.54×10−136\.54\{\\times\}10^\{\-13\}0\.95690\.97980\.62560\.07356\.57×10−126\.57\{\\times\}10^\{\-12\}0\.95570\.98010\.62560\.0767GradAscent1\.94×10−1191\.94\{\\times\}10^\{\-119\}0\.02910\.00000\.00000\.00001\.94×10−1191\.94\{\\times\}10^\{\-119\}0\.00000\.00030\.00000\.0002GradDiff1\.94×10−1191\.94\{\\times\}10^\{\-119\}0\.02910\.00000\.00000\.00001\.94×10−1191\.94\{\\times\}10^\{\-119\}0\.00000\.00000\.00000\.0000RMU2\.44×10−102\.44\{\\times\}10^\{\-10\}0\.00020\.02400\.00000\.01451\.43×10−121\.43\{\\times\}10^\{\-12\}0\.00000\.00100\.00000\.0026NPO0\.08780\.03600\.23140\.09600\.09230\.00210\.02640\.16520\.03030\.0657TRU7\.77×10−1177\.77\{\\times\}10^\{\-117\}0\.00000\.46510\.48450\.35011\.28×10−1011\.28\{\\times\}10^\{\-101\}0\.00000\.44750\.45140\.3632NSRU \(ours\)1\.39×10−61\.39\{\\times\}10^\{\-6\}0\.00000\.96260\.68630\.48211\.87×10−91\.87\{\\times\}10^\{\-9\}0\.00000\.96400\.67350\.3642TABLE III:Main results on WMDP\-Bio and WMDP\-Cyber\.Bolddenotes the best result andunderlinedenotes the second\-best result among unlearning methods\. For WMDP Acc, values closer to the 25% random\-choice level are preferable and highlighting follows this criterion; for utility metrics, higher is better\. The base model is excluded from best/second\-best highlighting\.MethodWMDP\-BioWMDP\-CyberWMDP AccMMLU Overall↑\\uparrowAdj\-MMLU↑\\uparrowWMDP AccMMLU Overall↑\\uparrowAdj\-MMLU↑\\uparrowBase0\.63550\.57720\.64880\.44690\.57720\.5533GradAscent0\.24040\.26890\.26240\.26570\.22950\.2633GradDiff0\.24040\.26890\.26240\.24560\.25510\.2967NPO0\.24670\.23260\.23600\.24860\.25820\.2433RMU0\.24670\.22950\.22990\.26570\.22950\.2633TRU0\.26240\.27970\.32740\.27480\.35860\.3200NSRU \(ours\)0\.27260\.56520\.63880\.27730\.57490\.5233#### V\-E1Performance on TOFU
On TOFU, NSRU achieves zero ES on both Forget05 and Forget10, showing that extractable forget\-set knowledge is effectively suppressed\. On Forget05, NSRU improves R\-ROUGE from 0\.4651 under the strongest baseline TRU to 0\.9626, and improves MU from 0\.4845 to 0\.6863\. It also increases ST\-ROUGE from 0\.3501 to 0\.4821, indicating stronger alignment with the designated safe target response\. A similar trend holds on Forget10: NSRU maintains zero ES and improves R\-ROUGE from 0\.4475 to 0\.9640, while achieving the best MU\. Although NPO obtains the highest FQ, its R\-ROUGE, MU, and ST\-ROUGE are much lower than those of NSRU\. This suggests that FQ alone does not capture the full forgetting–retention–alignment trade\-off, and that NSRU provides a more balanced post\-unlearning behavior\.
#### V\-E2Performance on WMDP
On WMDP, NSRU keeps WMDP accuracy close to the random\-choice \(25%\) region while preserving stronger utility than competing unlearning baselines\. Several baselines obtain slightly lower WMDP accuracy, but this often comes with substantial utility degradation\. For instance, on WMDP\-Bio, GradAscent and GradDiff reduce WMDP accuracy to 0\.2404, but their MMLU Overall drops to 0\.2689\. In contrast, NSRU maintains MMLU Overall at 0\.5652, close to the base model’s 0\.5772, while keeping WMDP accuracy at 0\.2726\. On WMDP\-Cyber, GradDiff obtains the lowest WMDP accuracy of 0\.2456 but reduces MMLU Overall to 0\.2551, whereas NSRU preserves MMLU Overall at 0\.5749, nearly matching the base model’s 0\.5772\. NSRU also achieves the best Adj\-MMLU on both WMDP\-Bio and WMDP\-Cyber, reaching 0\.6388 and 0\.5233, respectively\. These results show that NSRU provides a more practical hazardous unlearning trade\-off: it suppresses hazardous\-domain performance while preserving general and domain\-adjacent benign capabilities\.
### V\-FFor RQ2: Component Ablation
To answer RQ2, we examine which components are responsible for the gains of NSRU\. All ablations are conducted on TOFU\-Forget05\. We compare the full model against variants that remove the safety\-grounding prefix, safe\-target loss, undesired\-response suppression, retention loss, or the null\-space projection\. The variant w/o grounding prefix removesr\+r^\{\+\}from the safe target while keeping the final safe answera\+a^\{\+\}; w/o safe\-target loss removesℒsafe\\mathcal\{L\}\_\{\\mathrm\{safe\}\}; w/o undesired loss removesℒundesired\\mathcal\{L\}\_\{\\mathrm\{undesired\}\}; w/o retain loss removesℒret\\mathcal\{L\}\_\{\\mathrm\{ret\}\}; and w/o null\-space proj\. replaces projected LoRA with standard LoRA\. Table[IV](https://arxiv.org/html/2606.10989#S5.T4)reports the results\.
TABLE IV:Ablation study on TOFU\-Forget05\.Bolddenotes the best result\.VariantFQ↑\\uparrowES↓\\downarrowR\-ROUGE↑\\uparrowMU↑\\uparrowST\-ROUGE↑\\uparroww/o grounding prefix1\.62×10−1081\.62\{\\times\}10^\{\-108\}0\.00000\.86350\.63560\.0083w/o safe\-target loss1\.94×10−1191\.94\{\\times\}10^\{\-119\}0\.00000\.91550\.60320\.0000w/o undesired loss3\.08×10−123\.08\{\\times\}10^\{\-12\}0\.83040\.97980\.62710\.8719w/o retain loss1\.94×10−1191\.94\{\\times\}10^\{\-119\}0\.00000\.29650\.00000\.3543w/o null\-space proj\.7\.77×10−1177\.77\{\\times\}10^\{\-117\}0\.00000\.90160\.61230\.3560NSRU1\.39×𝟏𝟎−𝟔1\.39\\times 10^\{\-6\}0\.00000\.96260\.68630\.4821Table[IV](https://arxiv.org/html/2606.10989#S5.T4)shows that the components play complementary roles\. Removing the safety\-grounding prefix or safe\-target loss sharply reduces ST\-ROUGE, indicating that explicit safe\-target supervision is necessary for controlled post\-unlearning behavior\. Removing undesired\-response suppression increases ES to 0\.8304, showing that safe\-target imitation alone does not suppress recoverable forget\-set knowledge\. Notably, this variant still obtains high ST\-ROUGE, which shows that surface\-level similarity to the safe target is insufficient: without explicit undesired\-response suppression, the model can imitate the replacement response while leaving the original response extractable\. Removing the retention loss substantially reduces R\-ROUGE and MU, while removing the null\-space projection reduces both ST\-ROUGE and MU\. These results indicate that NSRU’s gains arise from the interaction between safe\-target objectives, undesired\-response suppression, retention loss, and null\-space projected updates\.
### V\-GFor RQ3: Sensitivity Evaluation
To answer RQ3, we study whether NSRU depends on a narrow hyperparameter setting\. On TOFU\-Forget05, we vary the undesired\-response suppression weightλf\\lambda\_\{f\}, the retention weightλr\\lambda\_\{r\}, the LoRA rankrr, and the selected target modules while keeping other settings fixed\.
Figure 4:Hyperparameter sensitivity on TOFU\-Forget05\. We varyλf\\lambda\_\{f\},λr\\lambda\_\{r\}, and the LoRA rankrrwhile keeping other settings fixed\. We plot R\-ROUGE, MU, and ST\-ROUGE to show retention, utility, and safe\-target alignment\. ES remains zero in the tested full\-module settings and is omitted for readability\. FQ is omitted because it is a distribution\-level statistical measure based on Truth Ratio similarity rather than a smooth behavior score for trend visualization\.#### V\-G1Loss\-Weight Sensitivity
Fig\.[4](https://arxiv.org/html/2606.10989#S5.F4)reports R\-ROUGE, MU, and ST\-ROUGE under different loss weights\. In our runs, ES remains zero across the testedλf\\lambda\_\{f\}values and is therefore omitted from the figure; varyingλf\\lambda\_\{f\}mainly changes safe\-target alignment rather than observable extraction strength\. Asλf\\lambda\_\{f\}increases, ST\-ROUGE decreases from 0\.77 to 0\.45, suggesting that overly strong undesired\-response suppression can conflict with safe\-target generation\. At the same time, R\-ROUGE remains close to 0\.96 and MU varies only mildly, indicating that retain\-side behavior is not highly sensitive toλf\\lambda\_\{f\}in the tested range\.
The retention weightλr\\lambda\_\{r\}controls the main retention–alignment trade\-off\. Asλr\\lambda\_\{r\}increases, R\-ROUGE improves from 0\.86 to 0\.98, indicating stronger preservation of retained question\-answering behavior\. At the same time, ST\-ROUGE decreases from 0\.52 to 0\.38, showing that excessive retain pressure can weaken safe\-target alignment on the forget split\. The smooth trend suggests that NSRU is tunable rather than brittle\.
#### V\-G2Adapter Rank and Target Modules
Across LoRA ranks from 8 to 128, R\-ROUGE and MU remain nearly flat, while ST\-ROUGE improves mildly as rank increases\. This indicates that a moderate rank is already sufficient for stable retention, while larger ranks mainly provide additional capacity for safe\-target redirection\. Overall, NSRU does not rely on a single narrow rank configuration on TOFU\-Forget05\.
TABLE V:Sensitivity to target modules on TOFU\-Forget05\.Bolddenotes the best result\.ModulesFQ↑\\uparrowES↓\\downarrowR\-ROUGE↑\\uparrowMU↑\\uparrowST\-ROUGE↑\\uparrowqq1\.43×10−121\.43\{\\times\}10^\{\-12\}0\.04150\.91980\.70420\.3145vv1\.39×10−111\.39\{\\times\}10^\{\-11\}0\.00000\.94070\.67610\.3271oo6\.57×10−126\.57\{\\times\}10^\{\-12\}0\.00000\.96000\.66320\.3466q,vq,v3\.43×10−163\.43\{\\times\}10^\{\-16\}0\.00000\.94630\.70460\.3416q,k,vq,k,v5\.62×10−175\.62\{\\times\}10^\{\-17\}0\.00000\.94700\.69050\.3505q,k,v,oq,k,v,o1\.39×𝟏𝟎−𝟔1\.39\\times 10^\{\-6\}0\.00000\.96260\.68630\.4821Table[V](https://arxiv.org/html/2606.10989#S5.T5)shows that adapting all attention projections\{q,k,v,o\}proj\\\{q,k,v,o\\\}\_\{\\mathrm\{proj\}\}achieves the strongest FQ, R\-ROUGE, and ST\-ROUGE while maintaining zero ES, although the smaller\{q,v\}\\\{q,v\\\}configuration yields slightly higher MU\. This suggests that safe\-target redirection benefits from updating the full attention transformation pathway rather than only a single attention projection\.
### V\-HFor RQ4: Robustness Evaluation
To answer RQ4, we evaluate whether NSRU remains stable beyond the original benchmark prompt format\. Recent studies show that unlearning can be brittle under benchmark perturbations, downstream fine\-tuning, or adversarial prompt variants\[[38](https://arxiv.org/html/2606.10989#bib.bib55),[40](https://arxiv.org/html/2606.10989#bib.bib54),[16](https://arxiv.org/html/2606.10989#bib.bib53)\]\. Standard prompts may therefore underestimate residual knowledge, so we evaluate robustness under TOFU format shifts using the Robust Evaluation of LLM Unlearning \(ReLU\) suite\[[20](https://arxiv.org/html/2606.10989#bib.bib44)\]and under prompt\-level attacks on WMDP\[[9](https://arxiv.org/html/2606.10989#bib.bib45),[5](https://arxiv.org/html/2606.10989#bib.bib46),[27](https://arxiv.org/html/2606.10989#bib.bib47),[35](https://arxiv.org/html/2606.10989#bib.bib48),[50](https://arxiv.org/html/2606.10989#bib.bib49)\]\.
#### V\-H1Format\-Shift Robustness on TOFU
Following ReLU\[[20](https://arxiv.org/html/2606.10989#bib.bib44)\], we compare all unlearning methods under transformed TOFU input formats\. We report representative forget\-side recovery metrics \(F\-Cloze and F\-Odd; lower is better\) and retain\-side transformed\-format capability metrics \(R\-MCQA, R\-Cloze, and R\-CQA; higher is better\)\.
TABLE VI:ReLU format\-shift robustness on TOFU\-Forget05\. F\-\* denotes forget\-split recovery metrics, where lower is better\. R\-\* denotes retain\-split transformed\-format metrics, where higher is better\.MethodF\-Cloze↓\\downarrowF\-Odd↓\\downarrowR\-MCQA↑\\uparrowR\-Cloze↑\\uparrowR\-CQA↑\\uparrowGradAscent0\.00\.3300\.2651\.19×10−451\.19\{\\times\}10^\{\-45\}0\.000GradDiff0\.00\.3500\.2642\.35×10−432\.35\{\\times\}10^\{\-43\}0\.000NPO5\.04×10−35\.04\{\\times\}10^\{\-3\}0\.2500\.3941\.69×10−21\.69\{\\times\}10^\{\-2\}0\.156RMU1\.35×10−31\.35\{\\times\}10^\{\-3\}0\.2800\.2395\.59×10−35\.59\{\\times\}10^\{\-3\}0\.135TRU0\.00\.2500\.3396\.11×10−26\.11\{\\times\}10^\{\-2\}0\.181NSRU1\.41×10−51\.41\{\\times\}10^\{\-5\}0\.2100\.5588\.79×10−28\.79\{\\times\}10^\{\-2\}0\.493Table[VI](https://arxiv.org/html/2606.10989#S5.T6)shows that NSRU achieves the lowest F\-Odd score, near\-zero F\-Cloze probability, and the best retain\-side transformed\-format scores\. Although GradAscent and GradDiff also suppress F\-Cloze, they collapse retain\-side transformed capability, indicating a weaker format\-shift forgetting–retention trade\-off\.
#### V\-H2Stress Test on WMDP Prompt Variations
We further stress\-test the final NSRU model on WMDP using cross\-lingual prompts \(Chinese, Spanish, and Russian\) and three jailbreak\-style wrappers \(Direct, Role\-play, and Audit\), while keeping model parameters and multiple\-choice scoring fixed\. For both settings, we report WMDP Accuracy; the 25% random\-choice level is shown as a dashed line in Fig\.[5](https://arxiv.org/html/2606.10989#S5.F5)\.
Figure 5:Stress test of NSRU under WMDP prompt variations\. \(a\) WMDP accuracy under English \(EN\), Chinese \(ZH\), Spanish \(ES\), and Russian \(RU\) prompts\. \(b\) WMDP accuracy under jailbreak\-style prompt wrappers\. The dashed horizontal line denotes the 25% random\-choice level\. Values near this line indicate near\-random hazardous\-domain performance\.Fig\.[5](https://arxiv.org/html/2606.10989#S5.F5)shows that NSRU maintains stable WMDP accuracy under the tested multilingual prompts and exhibits only limited recovery under jailbreak\-style perturbations\. On WMDP\-Bio, accuracy stays near random choice across languages \(24\.43%–24\.67%\) and rises to at most 26\.55% under jailbreak wrappers\. On WMDP\-Cyber, translated prompts do not exceed the English setting of 27\.23%, and jailbreak wrappers raise accuracy only to 29\.54%\. Together with the ReLU results, these stress tests indicate that NSRU’s suppression effect persists under the tested prompt variations\.
## VIConclusion
This paper introduced*Null\-Space Constrained Response\-Specified Unlearning*\(NSRU\), a projection\-constrained low\-rank framework for controlled LLM unlearning\. Rather than treating unlearning as purely an answer\-suppression task, NSRU explicitly defines the post\-unlearning behavior via a safe target response while actively penalizing the original undesired output\. By projecting trainable LoRA updates onto the null space of an empirically estimated retain subspace, NSRU successfully achieves a clean decoupling between behavioral functional specification \(the ”what”\) and parametric geometric restriction \(the ”where”\)\.
## References
- \[1\]J\. V\. B\. Abitante, J\. M\. Pasquali, L\. F\. Garcia, E\. de Oliveira, T\. da Silva Paula, R\. C\. Barros, and L\. S\. Kupssinskü\(2026\)Quantization\-robust llm unlearning via low\-rank adaptation\.arXiv preprint arXiv:2602\.13151\.Cited by:[§II\-C](https://arxiv.org/html/2606.10989#S2.SS3.p1.2)\.
- \[2\]D\. Biderman, J\. Portes, J\. J\. Gonzalez Ortiz, M\. Paul, P\. Greengard, C\. Jennings, D\. King, S\. Havens, V\. Chiley, J\. Frankle, C\. Blakeney, and J\. P\. Cunningham\(2024\)LoRA learns less and forgets less\.Transactions on Machine Learning Research\.Cited by:[§I](https://arxiv.org/html/2606.10989#S1.p4.1),[§II\-C](https://arxiv.org/html/2606.10989#S2.SS3.p1.2)\.
- \[3\]L\. Bourtoule, V\. Chandrasekaran, C\. A\. Choquette\-Choo, H\. Jia, A\. Travers, B\. Zhang, D\. Lie, and N\. Papernot\(2021\)Machine unlearning\.InProceedings of the 2021 IEEE Symposium on Security and Privacy \(SP\),pp\. 141–159\.Cited by:[§I](https://arxiv.org/html/2606.10989#S1.p1.1),[§I](https://arxiv.org/html/2606.10989#S1.p4.1),[§III\-A](https://arxiv.org/html/2606.10989#S3.SS1.p2.5)\.
- \[4\]\(2025\)Towards robust and parameter\-efficient knowledge unlearning for llms\.InProceedings of the International Conference on Learning Representations \(ICLR\),Cited by:[§II\-C](https://arxiv.org/html/2606.10989#S2.SS3.p1.2)\.
- \[5\]P\. Chao, E\. Debenedetti, A\. Robey, M\. Andriushchenko, F\. Croce, V\. Sehwag, E\. Dobriban, N\. Flammarion, G\. J\. Pappas, F\. Tramèr, H\. Hassani, and E\. Wong\(2024\)JailbreakBench: an open robustness benchmark for jailbreaking large language models\.InAdvances in Neural Information Processing Systems,Vol\.37\.Note:Datasets and Benchmarks TrackExternal Links:[Document](https://dx.doi.org/10.52202/079017-1745),[Link](https://proceedings.neurips.cc/paper_files/paper/2024/hash/63092d79154adebd7305dfd498cbff70-Abstract-Datasets_and_Benchmarks_Track.html)Cited by:[§V\-H](https://arxiv.org/html/2606.10989#S5.SS8.p1.1)\.
- \[6\]A\. Chaudhry, M\. Ranzato, M\. Rohrbach, and M\. Elhoseiny\(2019\)Efficient lifelong learning with A\-GEM\.InInternational Conference on Learning Representations,Cited by:[§II\-C](https://arxiv.org/html/2606.10989#S2.SS3.p2.1)\.
- \[7\]D\. Dai, L\. Dong, Y\. Hao, Z\. Sui, B\. Chang, and F\. Wei\(2022\)Knowledge neurons in pretrained transformers\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Dublin, Ireland,pp\. 8493–8502\.Cited by:[§II\-C](https://arxiv.org/html/2606.10989#S2.SS3.p1.2)\.
- \[8\]H\. Dang, T\. Pham, T\. Hoang, and N\. Inoue\(2025\)On effects of steering latent representation for large language model unlearning\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 23733–23742\.Cited by:[§II\-C](https://arxiv.org/html/2606.10989#S2.SS3.p1.2)\.
- \[9\]Y\. Deng, W\. Zhang, S\. J\. Pan, and L\. Bing\(2024\)Multilingual jailbreak challenges in large language models\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=vESNKdEMGp),2310\.06474Cited by:[§V\-H](https://arxiv.org/html/2606.10989#S5.SS8.p1.1)\.
- \[10\]C\. Ding, J\. Wu, Y\. Yuan, J\. Lu, K\. Zhang, A\. Su, X\. Wang, and X\. He\(2025\)Unified parameter\-efficient unlearning for llms\.InProceedings of the International Conference on Learning Representations \(ICLR\),Cited by:[§II\-C](https://arxiv.org/html/2606.10989#S2.SS3.p1.2)\.
- \[11\]V\. Dorna, A\. Mekala, W\. Zhao, A\. McCallum, Z\. C\. Lipton, J\. Z\. Kolter, and P\. Maini\(2025\)OpenUnlearning: accelerating llm unlearning via unified benchmarking of methods and metrics\.InNeurIPS 2025 Datasets and Benchmarks Track,Cited by:[§I](https://arxiv.org/html/2606.10989#S1.p1.1),[§II\-A](https://arxiv.org/html/2606.10989#S2.SS1.p1.1),[§V\-C](https://arxiv.org/html/2606.10989#S5.SS3.p1.1),[§V\-D](https://arxiv.org/html/2606.10989#S5.SS4.p1.12)\.
- \[12\]C\. Eckart and G\. Young\(1936\)The approximation of one matrix by another of lower rank\.Psychometrika1\(3\),pp\. 211–218\.External Links:[Document](https://dx.doi.org/10.1007/BF02288367)Cited by:[§IV\-A](https://arxiv.org/html/2606.10989#S4.SS1.p2.3)\.
- \[13\]R\. Eldan and M\. Russinovich\(2023\)Who’s harry potter? approximate unlearning in llms\.arXiv preprint arXiv:2310\.02238\.Cited by:[§II\-A](https://arxiv.org/html/2606.10989#S2.SS1.p1.1)\.
- \[14\]C\. Fan, J\. Liu, L\. Lin, J\. Jia, R\. Zhang, S\. Mei, and S\. Liu\(2025\)Simplicity prevails: rethinking negative preference optimization for llm unlearning\.InInternational Conference on Learning Representations,Cited by:[§II\-A](https://arxiv.org/html/2606.10989#S2.SS1.p1.1)\.
- \[15\]J\. Fang, H\. Jiang, K\. Wang, Y\. Ma, X\. Wang, X\. He, and T\. Chua\(2025\)AlphaEdit: null\-space constrained knowledge editing for language models\.InProceedings of the International Conference on Learning Representations \(ICLR\),Cited by:[§II\-C](https://arxiv.org/html/2606.10989#S2.SS3.p2.1)\.
- \[16\]A\. Goel, A\. Ritter, and I\. Gurevych\(2026\)Auditing language model unlearning via information decomposition\.InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),Rabat, Morocco,pp\. 808–826\.External Links:[Document](https://dx.doi.org/10.18653/v1/2026.eacl-long.35),[Link](https://aclanthology.org/2026.eacl-long.35/)Cited by:[§I](https://arxiv.org/html/2606.10989#S1.p1.1),[§V\-C](https://arxiv.org/html/2606.10989#S5.SS3.p1.1),[§V\-H](https://arxiv.org/html/2606.10989#S5.SS8.p1.1)\.
- \[17\]A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§V\-A](https://arxiv.org/html/2606.10989#S5.SS1.p2.1)\.
- \[18\]N\. Halko, P\. Martinsson, and J\. A\. Tropp\(2011\)Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions\.SIAM Review53\(2\),pp\. 217–288\.External Links:[Document](https://dx.doi.org/10.1137/090771806)Cited by:[§IV\-A](https://arxiv.org/html/2606.10989#S4.SS1.p2.3)\.
- \[19\]E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen\(2022\)LoRA: low\-rank adaptation of large language models\.arXiv preprint arXiv:2106\.09685\.Cited by:[§I](https://arxiv.org/html/2606.10989#S1.p4.1),[§II\-C](https://arxiv.org/html/2606.10989#S2.SS3.p1.2)\.
- \[20\]A\. Joshi, S\. Saha, D\. Shukla, S\. Vema, H\. Jhamtani, M\. Gaur, and A\. Modi\(2024\-11\)Towards robust evaluation of unlearning in LLMs via data transformations\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Miami, Florida, USA,pp\. 12100–12119\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.706),[Link](https://aclanthology.org/2024.findings-emnlp.706/)Cited by:[§V\-H1](https://arxiv.org/html/2606.10989#S5.SS8.SSS1.p1.1),[§V\-H](https://arxiv.org/html/2606.10989#S5.SS8.p1.1)\.
- \[21\]N\. Li, A\. Pan, A\. Gopal, S\. Yue, D\. Berrios, A\. Gatti, J\. D\. Li, A\. Dombrowski, S\. Goel, L\. Phan,et al\.\(2024\)The wmdp benchmark: measuring and reducing malicious use with unlearning\.InProceedings of the 41st International Conference on Machine Learning \(ICML\),Cited by:[§II\-C](https://arxiv.org/html/2606.10989#S2.SS3.p1.2),[§V\-A](https://arxiv.org/html/2606.10989#S5.SS1.p1.1),[§V\-B](https://arxiv.org/html/2606.10989#S5.SS2.p1.1),[§V\-D](https://arxiv.org/html/2606.10989#S5.SS4.p1.12)\.
- \[22\]J\. Liao, Q\. Wang, S\. Ye, X\. Yu, L\. Chen, and Z\. Fang\(2026\)Explainable llm unlearning through reasoning\.InProceedings of the International Conference on Learning Representations \(ICLR\),Cited by:[§I](https://arxiv.org/html/2606.10989#S1.p2.1),[§I](https://arxiv.org/html/2606.10989#S1.p3.1),[§II\-B](https://arxiv.org/html/2606.10989#S2.SS2.p1.1),[§V\-B](https://arxiv.org/html/2606.10989#S5.SS2.p1.1),[§V\-D](https://arxiv.org/html/2606.10989#S5.SS4.p1.12)\.
- \[23\]S\. Liu, Y\. Yao, J\. Jia, S\. Casper, N\. Baracaldo, P\. Hase, Y\. Yao, C\. Y\. Liu, X\. Xu, H\. Li,et al\.\(2025\)Rethinking machine unlearning for large language models\.Nature Machine Intelligence\.Cited by:[§I](https://arxiv.org/html/2606.10989#S1.p1.1)\.
- \[24\]Y\. Liu, H\. Chen, W\. Huang, Y\. Ni, and M\. Imani\(2025\)LUNE: efficient llm unlearning via lora fine\-tuning with negative examples\.InSocially Responsible and Trustworthy Foundation Models at NeurIPS 2025,Cited by:[§II\-C](https://arxiv.org/html/2606.10989#S2.SS3.p1.2)\.
- \[25\]H\. Lu, C\. Zhao, J\. Xue, L\. Yao, K\. Moore, and D\. Gong\(2024\)Adaptive rank, reduced forgetting: knowledge retention in continual learning vision\-language models with dynamic rank\-selective lora\.arXiv preprint arXiv:2412\.01004\.Cited by:[§I](https://arxiv.org/html/2606.10989#S1.p4.1)\.
- \[26\]P\. Maini, Z\. Feng, A\. Schwarzschild, Z\. C\. Lipton, and J\. Z\. Kolter\(2024\)TOFU: a task of fictitious unlearning for llms\.InProceedings of the First Conference on Language Modeling \(COLM\),Cited by:[§I](https://arxiv.org/html/2606.10989#S1.p1.1),[§I](https://arxiv.org/html/2606.10989#S1.p2.1),[§I](https://arxiv.org/html/2606.10989#S1.p4.1),[§II\-A](https://arxiv.org/html/2606.10989#S2.SS1.p1.1),[§III\-A](https://arxiv.org/html/2606.10989#S3.SS1.p2.5),[§V\-A](https://arxiv.org/html/2606.10989#S5.SS1.p1.1),[§V\-B](https://arxiv.org/html/2606.10989#S5.SS2.p1.1)\.
- \[27\]M\. Mazeika, L\. Phan, X\. Yin, A\. Zou, Z\. Wang, N\. Mu, E\. Sakhaee, N\. Li, S\. Basart, B\. Li, D\. Forsyth, and D\. Hendrycks\(2024\-21–27 Jul\)HarmBench: a standardized evaluation framework for automated red teaming and robust refusal\.InProceedings of the 41st International Conference on Machine Learning,R\. Salakhutdinov, Z\. Kolter, K\. Heller, A\. Weller, N\. Oliver, J\. Scarlett, and F\. Berkenkamp \(Eds\.\),Proceedings of Machine Learning Research, Vol\.235,pp\. 35181–35224\.External Links:[Link](https://proceedings.mlr.press/v235/mazeika24a.html)Cited by:[§V\-H](https://arxiv.org/html/2606.10989#S5.SS8.p1.1)\.
- \[28\]A\. Mekala, V\. Dorna, S\. Dubey, A\. Lalwani, D\. Koleczek, M\. Rungta, S\. Hasan, and E\. Lobo\(2025\)Alternate preference optimization for unlearning factual knowledge in large language models\.InProceedings of the 31st International Conference on Computational Linguistics,Abu Dhabi, UAE,pp\. 3732–3752\.Cited by:[§I](https://arxiv.org/html/2606.10989#S1.p2.1),[§I](https://arxiv.org/html/2606.10989#S1.p3.1),[§II\-B](https://arxiv.org/html/2606.10989#S2.SS2.p1.1)\.
- \[29\]K\. Meng, D\. Bau, A\. Andonian, and Y\. Belinkov\(2022\)Locating and editing factual associations in gpt\.Advances in Neural Information Processing Systems35\.Cited by:[§II\-C](https://arxiv.org/html/2606.10989#S2.SS3.p1.2)\.
- \[30\]K\. Meng, A\. S\. Sharma, A\. Andonian, Y\. Belinkov, and D\. Bau\(2023\)Mass\-editing memory in a transformer\.InInternational Conference on Learning Representations,Cited by:[§II\-C](https://arxiv.org/html/2606.10989#S2.SS3.p1.2)\.
- \[31\]L\. Mirsky\(1960\)Symmetric gauge functions and unitarily invariant norms\.The Quarterly Journal of Mathematics11\(1\),pp\. 50–59\.External Links:[Document](https://dx.doi.org/10.1093/qmath/11.1.50)Cited by:[§IV\-A](https://arxiv.org/html/2606.10989#S4.SS1.p2.3)\.
- \[32\]E\. Mitchell, C\. Lin, A\. Bosselut, C\. Finn, and C\. D\. Manning\(2022\)Fast model editing at scale\.InInternational Conference on Learning Representations,Cited by:[§II\-C](https://arxiv.org/html/2606.10989#S2.SS3.p1.2)\.
- \[33\]R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn\(2023\)Direct preference optimization: your language model is secretly a reward model\.InAdvances in Neural Information Processing Systems,Cited by:[§II\-B](https://arxiv.org/html/2606.10989#S2.SS2.p1.1)\.
- \[34\]W\. F\. Shen, X\. Qiu, M\. Kurmanji, A\. Iacob, L\. Sani, Y\. Chen, N\. Cancedda, and N\. D\. Lane\(2025\)LLM unlearning via neural activation redirection\.InAdvances in Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=teB4aqJsNP)Cited by:[§II\-C](https://arxiv.org/html/2606.10989#S2.SS3.p1.2)\.
- \[35\]X\. Shen, Z\. Chen, M\. Backes, Y\. Shen, and Y\. Zhang\(2024\)“Do Anything Now”: Characterizing and Evaluating In\-The\-Wild Jailbreak Prompts on Large Language Models\.InProceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security,New York, NY, USA,pp\. 1671–1685\.External Links:[Document](https://dx.doi.org/10.1145/3658644.3670388),[Link](https://doi.org/10.1145/3658644.3670388)Cited by:[§V\-H](https://arxiv.org/html/2606.10989#S5.SS8.p1.1)\.
- \[36\]W\. Shi, J\. Lee, Y\. Huang, S\. Malladi, J\. Zhao, A\. Holtzman, D\. Liu, L\. Zettlemoyer, N\. A\. Smith, and C\. Zhang\(2025\)MUSE: machine unlearning six\-way evaluation for language models\.InProceedings of the International Conference on Learning Representations \(ICLR\),Cited by:[§I](https://arxiv.org/html/2606.10989#S1.p1.1),[§II\-A](https://arxiv.org/html/2606.10989#S2.SS1.p1.1)\.
- \[37\]C\. Tan, X\. Li, S\. Cui, Y\. Qu, C\. Chen, and L\. Gao\(2026\)Less is more: geometric unlearning for LLMs with minimal data disclosure\.arXiv preprint arXiv:2605\.01735\.External Links:[Link](https://arxiv.org/abs/2605.01735)Cited by:[§II\-C](https://arxiv.org/html/2606.10989#S2.SS3.p2.1)\.
- \[38\]P\. Thaker, S\. Hu, N\. Kale, Y\. Maurya, Z\. S\. Wu, and V\. Smith\(2025\)Position: LLM unlearning benchmarks are weak measures of progress\.In2025 IEEE Conference on Secure and Trustworthy Machine Learning \(SaTML\),pp\. 520–533\.External Links:[Link](https://arxiv.org/abs/2410.02879)Cited by:[§I](https://arxiv.org/html/2606.10989#S1.p1.1),[§II\-A](https://arxiv.org/html/2606.10989#S2.SS1.p1.1),[§V\-C](https://arxiv.org/html/2606.10989#S5.SS3.p1.1),[§V\-H](https://arxiv.org/html/2606.10989#S5.SS8.p1.1)\.
- \[39\]L\. Tunstall, E\. Beeching, N\. Lambert, N\. Rajani, K\. Rasul, Y\. Belkada, S\. Huang, L\. Von Werra, C\. Fourrier, N\. Habib,et al\.\(2023\)Zephyr: direct distillation of lm alignment\.arXiv preprint arXiv:2310\.16944\.Cited by:[§V\-A](https://arxiv.org/html/2606.10989#S5.SS1.p2.1)\.
- \[40\]C\. Wang, Y\. Zhang, J\. Jia, P\. Ram, D\. Wei, Y\. Yao, S\. Pal, N\. Baracaldo, and S\. Liu\(2025\)Invariance makes LLM unlearning resilient even to unanticipated downstream fine\-tuning\.InProceedings of the 42nd International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.267,pp\. 65464–65479\.External Links:[Link](https://proceedings.mlr.press/v267/wang25en.html)Cited by:[§I](https://arxiv.org/html/2606.10989#S1.p1.1),[§V\-H](https://arxiv.org/html/2606.10989#S5.SS8.p1.1)\.
- \[41\]S\. Wang, X\. Li, J\. Sun, and Z\. Xu\(2021\)Training networks in null space of feature covariance for continual learning\.InProceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition,pp\. 184–193\.Cited by:[§II\-C](https://arxiv.org/html/2606.10989#S2.SS3.p2.1)\.
- \[42\]X\. Wang, Z\. Li, B\. Wang, Y\. Hu, and D\. Zou\(2025\)Model unlearning via sparse autoencoder subspace guided projections\.InICML 2025 Workshop on Machine Unlearning for Generative AI,Cited by:[§II\-C](https://arxiv.org/html/2606.10989#S2.SS3.p2.1)\.
- \[43\]Y\. Wen, R\. Feng, F\. Guo, Y\. Wang, R\. Le, Y\. Song, S\. Gao, and S\. Shang\(2025\)Lock on target\! precision unlearning via directional control\.InFindings of the Association for Computational Linguistics: EMNLP 2025,Suzhou, China,pp\. 18782–18794\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.1021),[Link](https://aclanthology.org/2025.findings-emnlp.1021/)Cited by:[§II\-C](https://arxiv.org/html/2606.10989#S2.SS3.p1.2)\.
- \[44\]Y\. Xiong and X\. Xie\(2026\)OPLoRA: orthogonal projection LoRA prevents catastrophic forgetting during parameter\-efficient fine\-tuning\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 34088–34096\.External Links:[Document](https://dx.doi.org/10.1609/aaai.v40i40.40703),[Link](https://ojs.aaai.org/index.php/AAAI/article/view/40703)Cited by:[§I](https://arxiv.org/html/2606.10989#S1.p4.1),[§II\-C](https://arxiv.org/html/2606.10989#S2.SS3.p2.1)\.
- \[45\]H\. Xu, N\. Zhao, L\. Yang, S\. Zhao, S\. Deng, M\. Wang, B\. Hooi, N\. Oo, H\. Chen, and N\. Zhang\(2025\)ReLearn: unlearning via learning for large language models\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Vienna, Austria,pp\. 5967–5987\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.297),[Link](https://aclanthology.org/2025.acl-long.297/)Cited by:[§II\-A](https://arxiv.org/html/2606.10989#S2.SS1.p1.1)\.
- \[46\]J\. Yao, E\. Chien, M\. Du, X\. Niu, T\. Wang, Z\. Cheng, and X\. Yue\(2024\)Machine unlearning of pre\-trained large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Bangkok, Thailand,pp\. 8403–8419\.Cited by:[§II\-A](https://arxiv.org/html/2606.10989#S2.SS1.p1.1)\.
- \[47\]Y\. Yao, X\. Xu, and Y\. Liu\(2024\)Large language model unlearning\.InAdvances in Neural Information Processing Systems,Cited by:[§II\-A](https://arxiv.org/html/2606.10989#S2.SS1.p1.1)\.
- \[48\]Y\. Yao, X\. Xu, and Y\. Liu\(2024\)Large language model unlearning\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§I](https://arxiv.org/html/2606.10989#S1.p1.1),[§I](https://arxiv.org/html/2606.10989#S1.p2.1),[§I](https://arxiv.org/html/2606.10989#S1.p4.1),[§III\-A](https://arxiv.org/html/2606.10989#S3.SS1.p2.5),[§V\-B](https://arxiv.org/html/2606.10989#S5.SS2.p1.1)\.
- \[49\]S\. Yoon, W\. Jeung, and A\. No\(2025\)R\-tofu: unlearning in large reasoning models\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,Suzhou, China,pp\. 5239–5258\.Cited by:[§II\-B](https://arxiv.org/html/2606.10989#S2.SS2.p1.1)\.
- \[50\]Z\. Yu, X\. Liu, S\. Liang, Z\. Cameron, C\. Xiao, and N\. Zhang\(2024\-08\)Don’t listen to me: understanding and exploring jailbreak prompts of large language models\.In33rd USENIX Security Symposium \(USENIX Security 24\),Philadelphia, PA,pp\. 4675–4692\.External Links:ISBN 978\-1\-939133\-44\-1,[Link](https://www.usenix.org/conference/usenixsecurity24/presentation/yu-zhiyuan)Cited by:[§V\-H](https://arxiv.org/html/2606.10989#S5.SS8.p1.1)\.
- \[51\]G\. Zeng, Y\. Chen, B\. Cui, and S\. Yu\(2019\)Continual learning of context\-dependent processing in neural networks\.Nature Machine Intelligence1\(8\),pp\. 364–372\.Cited by:[§II\-C](https://arxiv.org/html/2606.10989#S2.SS3.p2.1)\.
- \[52\]R\. Zhang, L\. Lin, Y\. Bai, and S\. Mei\(2024\)Negative preference optimization: from catastrophic collapse to effective unlearning\.arXiv preprint arXiv:2404\.05868\.Cited by:[§I](https://arxiv.org/html/2606.10989#S1.p1.1),[§I](https://arxiv.org/html/2606.10989#S1.p2.1),[§II\-A](https://arxiv.org/html/2606.10989#S2.SS1.p1.1),[§II\-B](https://arxiv.org/html/2606.10989#S2.SS2.p1.1),[§V\-B](https://arxiv.org/html/2606.10989#S5.SS2.p1.1)\.Similar Articles
Natively Unlearnable Large Language Models
The paper proposes NULLs (Natively Unlearnable LLMs), a model class that isolates source-specific contributions in sparsely activated sinks while sharing backbone neurons, enabling clean unlearning of individual data sources without retraining and preserving general language capabilities.
Model Unlearning Objectives Vary for Distinct Language Functions
The paper argues that unlearning in LLMs should be goal-dependent, proposing a cosine-based meta-learned variant of RMU for dangerous knowledge and a multi-layer objective with probe directions for toxicity, achieving strong results across four 7-8B models.
CSULoRA: Closest Safe Update Low-Rank Adaptation
CSULoRA is a post-hoc method for correcting trained LoRA adapters to preserve safety alignment while maintaining utility, using closest safe update estimation.
ASRU: Activation Steering Meets Reinforcement Unlearning for Multimodal Large Language Models
Proposes ASRU, a controllable multimodal unlearning framework that combines activation steering with a reinforcement learning reward function to improve unlearning effectiveness and generation quality while preserving model utility on Qwen3-VL.
Robust LLM Unlearning Against Relearning Attacks: The Minor Components in Representations Matter
This paper introduces Minor Component Unlearning (MCU), a novel approach to LLM unlearning that targets minor components in representations to resist relearning attacks. It addresses the vulnerability of existing methods by focusing on robust directions within the model's spectral structure.