Interference-Aware Multi-Task Unlearning
Summary
This paper introduces an interference-aware framework for multi-task machine unlearning, addressing task-level and instance-level interference through task-aware gradient projection and instance-level gradient orthogonalization, achieving effective unlearning on multi-task computer vision benchmarks.
View Cached Full Text
Cached at: 05/20/26, 08:27 AM
# Interference-Aware Multi-Task Unlearning
Source: [https://arxiv.org/html/2605.19042](https://arxiv.org/html/2605.19042)
Ying\-Hua Huang National Taiwan University yhhuang@arbor\.ee\.ntu\.edu\.tw Rui Fang National Taiwan University rfang@arbor\.ee\.ntu\.edu\.tw Hsi\-Wen Chen National Taiwan University hwchen@arbor\.ee\.ntu\.edu\.tw Ming\-Syan Chen National Taiwan University mschen@ntu\.edu\.tw
###### Abstract
Machine unlearning aims to remove the contribution of designated training data from a trained model while preserving performance on the remaining data\. Existing work mainly focuses on single\-task settings, whereas modern models often operate in multi\-task setups with shared backbones, where removing supervision for one task or instance can unintentionally affect others\. We introducemulti\-task unlearningwith two settings:full\-task unlearning, which removes a target instance from all tasks, andpartial\-task unlearning, which removes supervision only from selected tasks\. We show that shared parameters couple the forget and retain sets, causingtask\-level interferenceon non\-target tasks andinstance\-level interferenceon other instances\. To address this issue, we propose an interference\-aware framework that combines task\-aware gradient projection, which constrains updates within task\-specific subspaces, with instance\-level gradient orthogonalization, which reduces conflicts between forget and retain signals\. Experiments on two multi\-task computer vision benchmarks across five tasks show that our method achieves effective unlearning while maintaining strong generalization, reducing UIS compared with the strongest baseline by30\.3%30\.3\\%in full\-task unlearning and52\.9%52\.9\\%in partial\-task unlearning\.
## 1Introduction
Machine unlearning\[[4](https://arxiv.org/html/2605.19042#bib.bib4)\]has become increasingly important as modern machine learning systems are required to remove sensitive or outdated information from trained models\. This need arises from privacy regulations such as the General Data Protection Regulation \(GDPR\)\[[46](https://arxiv.org/html/2605.19042#bib.bib86)\], as well as broader concerns in security\[[25](https://arxiv.org/html/2605.19042#bib.bib17)\], fairness\[[70](https://arxiv.org/html/2605.19042#bib.bib18)\], and robustness\[[45](https://arxiv.org/html/2605.19042#bib.bib15)\]\. Beyond regulatory requirements, machine unlearning also supports practical applications such as debiasing\[[8](https://arxiv.org/html/2605.19042#bib.bib19)\], debugging\[[55](https://arxiv.org/html/2605.19042#bib.bib20)\], and auditing\[[59](https://arxiv.org/html/2605.19042#bib.bib16)\]\. Machine unlearning partitions the training data into two subsets: the*forget set*, whose influence should be removed, and the*retain set*, whose performance should be preserved\. The goal is to eliminate the influence of the forget set while maintaining performance on the retain set\.
Ideally, unlearning should match retraining on the retain set\[[3](https://arxiv.org/html/2605.19042#bib.bib1)\]\. However, retraining from scratch is computationally expensive for large models, motivating efficient unlearning methods\[[22](https://arxiv.org/html/2605.19042#bib.bib21),[23](https://arxiv.org/html/2605.19042#bib.bib23),[21](https://arxiv.org/html/2605.19042#bib.bib24),[14](https://arxiv.org/html/2605.19042#bib.bib30),[6](https://arxiv.org/html/2605.19042#bib.bib31)\]\. Existing methods primarily focus on the*single\-task*setting, whereas modern models are typically built on pretrained backbones and adapted to multiple tasks through shared representations\[[42](https://arxiv.org/html/2605.19042#bib.bib34),[37](https://arxiv.org/html/2605.19042#bib.bib33)\]or parameter\-efficient adapters\[[27](https://arxiv.org/html/2605.19042#bib.bib35),[62](https://arxiv.org/html/2605.19042#bib.bib37),[1](https://arxiv.org/html/2605.19042#bib.bib38)\]\. In such multi\-task settings, removing supervision for one task may unintentionally affect others, introducing challenges absent from single\-task unlearning\.
Therefore, we propose themulti\-task unlearningproblem, where a single input instance may be associated with multiple tasks\. As shown in Fig\.[1](https://arxiv.org/html/2605.19042#S1.F1), we consider two complementary setups:full\-task unlearning, which removes a target instance from all tasks, andpartial\-task unlearning, which removes supervision for a target instance only from selected tasks\. For example, an image may be removed from person identification\[[11](https://arxiv.org/html/2605.19042#bib.bib14)\]due to privacy requirements while retained for action recognition\[[29](https://arxiv.org/html/2605.19042#bib.bib13)\]\. Similarly, a user’s interaction may be removed from personalized recommendation\[[31](https://arxiv.org/html/2605.19042#bib.bib11)\]while retained for fraud detection\[[16](https://arxiv.org/html/2605.19042#bib.bib3)\]\.
However, our preliminary experiments show that directly applying single\-task unlearning methods to multi\-task models leads to substantial performance degradation, with up to a 25% drop on the retain set\. We attribute this degradation to interactions between the forget set and retained data through shared parameters\. These interactions induce two types of interference:task\-level interference, where unlearning affects tasks outside the target set, andinstance\-level interference, where unlearning a target instance degrades performance on other instances\.
Based on these observations, we propose a multi\-task unlearning framework that mitigates interference across tasks while preserving performance on retained data\.111An alternative is to use task\-specific adapters and remove the corresponding module for unlearning\. However, this does not guarantee complete forgetting, as task information may remain in the shared backbone\. Moreover, maintaining separate adapters becomes costly as the number of tasks grows\.Our framework consists of two key components, as illustrated in the right panel of Fig\.[1](https://arxiv.org/html/2605.19042#S1.F1)\. First,Task\-Aware Gradient Projectionconstrains parameter updates to task\-specific subspaces, reducing unintended interference in the shared representation\. Second,Instance\-Level Gradient Orthogonalizationremoves conflicting components between forget and retain gradients, preventing degradation on retained instances\. Together, these components mitigate both task\- and instance\-level interference through subspace\-constrained and conflict\-aware updates\.
Our contributions are summarized as follows\.First, we introduce the multi\-task unlearning problem with two settings, full\-task and partial\-task unlearning, enabling fine\-grained control over data removal for privacy and utility\. Second, we identify task\-level and instance\-level interference as two sources of degradation, and propose an interference\-aware framework that combines Task\-Aware Gradient Projection and Instance\-Level Gradient Orthogonalization to mitigate them\. Finally, across two benchmarks and five tasks, our method outperforms six baselines, reducing UIS by30\.3%30\.3\\%in FU and52\.9%52\.9\\%in PU while preserving retained performance\.
Figure 1:Overview of multi\-task unlearning\.
## 2Problem Formulation
In this paper, we study multi\-task unlearning on a shared backbone, where each instance contains supervision for multiple tasks\. Appendix[A](https://arxiv.org/html/2605.19042#A1)summarizes the key notations\.
##### Multi\-task Learning\.
Let𝒳=\{𝐱i\}i=1N\\mathcal\{X\}=\\\{\\mathbf\{x\}\_\{i\}\\\}^\{N\}\_\{i=1\}denote the set of input instances such that𝐱i∈ℝd\\mathbf\{x\}\_\{i\}\\in\\mathbb\{R\}^\{d\}, and let𝒯=\{1,2,…,K\}\\mathcal\{T\}=\\\{1,2,\\dots,K\\\}denote the task set\. For each instance𝐱i∈𝒳\\mathbf\{x\}\_\{i\}\\in\\mathcal\{X\}and taskt∈𝒯t\\in\\mathcal\{T\}, letyi\(t\)y\_\{i\}^\{\(t\)\}denote the supervision signal for tasktton𝐱i\\mathbf\{x\}\_\{i\}\. The multi\-task dataset is𝒟=\{\(𝐱i,t,yi\(t\)\)∣𝐱i∈𝒳,t∈𝒯\}\\mathcal\{D\}=\\\{\(\\mathbf\{x\}\_\{i\},t,y\_\{i\}^\{\(t\)\}\)\\mid\\mathbf\{x\}\_\{i\}\\in\\mathcal\{X\},\\ t\\in\\mathcal\{T\}\\\}\. Accordingly, the loss for instance𝐱i\\mathbf\{x\}\_\{i\}on taskttis
ℓi,t\(θ\):=ℓt\(ft\(𝐱i;θ\),yi\(t\)\),\\ell\_\{i,t\}\(\\theta\):=\\ell\_\{t\}\\bigl\(f\_\{t\}\(\\mathbf\{x\}\_\{i\};\\theta\),y\_\{i\}^\{\(t\)\}\\bigr\),\(1\)whereℓt\\ell\_\{t\}is the task\-specific loss,ft\(⋅;θ\)f\_\{t\}\(\\cdot;\\theta\)is the predictor for tasktt, andθ\\thetadenotes the shared model parameters\. Then, multi\-task learning aims to optimize
θ⋆=argminθ∑𝐱i∈𝒳∑t∈𝒯λtℓi,t\(θ\),\\theta^\{\\star\}=\\arg\\min\_\{\\theta\}\\sum\_\{\\mathbf\{x\}\_\{i\}\\in\\mathcal\{X\}\}\\sum\_\{t\\in\\mathcal\{T\}\}\\lambda\_\{t\}\\,\\ell\_\{i,t\}\(\\theta\),\(2\)whereλt\\lambda\_\{t\}is the weight of tasktt\. Because all tasks are learned through shared parametersθ⋆\\theta^\{\\star\}, supervision from one task\-instance pair can affect representations used by other tasks, making both learning and subsequent unlearning more challenging\.
##### Multi\-task Unlearning\.
Let𝒳f⊆𝒳\\mathcal\{X\}\_\{f\}\\subseteq\\mathcal\{X\}denote the set of instances to be forgotten and𝒯f⊆𝒯\\mathcal\{T\}\_\{f\}\\subseteq\\mathcal\{T\}denote the set of tasks whose supervision should be removed\. The retained instance set and retained task set are defined as𝒳r=𝒳∖𝒳f\\mathcal\{X\}\_\{r\}=\\mathcal\{X\}\\setminus\\mathcal\{X\}\_\{f\}and𝒯r=𝒯∖𝒯f\\mathcal\{T\}\_\{r\}=\\mathcal\{T\}\\setminus\\mathcal\{T\}\_\{f\}, respectively\.
These two axes induce the following four partitions of the dataset:
𝒟f\\displaystyle\\mathcal\{D\}\_\{f\}=\{\(𝐱i,t,yi\(t\)\)∣𝐱i∈𝒳f,t∈𝒯f\},𝒟rtask=\{\(𝐱i,t,yi\(t\)\)∣𝐱i∈𝒳f,t∈𝒯r\},\\displaystyle=\\\{\(\\mathbf\{x\}\_\{i\},t,y\_\{i\}^\{\(t\)\}\)\\mid\\mathbf\{x\}\_\{i\}\\in\\mathcal\{X\}\_\{f\},\\ t\\in\\mathcal\{T\}\_\{f\}\\\},\\qquad\\mathcal\{D\}\_\{r\}^\{\\mathrm\{task\}\}=\\\{\(\\mathbf\{x\}\_\{i\},t,y\_\{i\}^\{\(t\)\}\)\\mid\\mathbf\{x\}\_\{i\}\\in\\mathcal\{X\}\_\{f\},\\ t\\in\\mathcal\{T\}\_\{r\}\\\},\(3\)𝒟rinst\\displaystyle\\mathcal\{D\}\_\{r\}^\{\\mathrm\{inst\}\}=\{\(𝐱i,t,yi\(t\)\)∣𝐱i∈𝒳r,t∈𝒯f\},𝒟rclean=\{\(𝐱i,t,yi\(t\)\)∣𝐱i∈𝒳r,t∈𝒯r\}\.\\displaystyle=\\\{\(\\mathbf\{x\}\_\{i\},t,y\_\{i\}^\{\(t\)\}\)\\mid\\mathbf\{x\}\_\{i\}\\in\\mathcal\{X\}\_\{r\},\\ t\\in\\mathcal\{T\}\_\{f\}\\\},\\qquad\\mathcal\{D\}\_\{r\}^\{\\mathrm\{clean\}\}=\\\{\(\\mathbf\{x\}\_\{i\},t,y\_\{i\}^\{\(t\)\}\)\\mid\\mathbf\{x\}\_\{i\}\\in\\mathcal\{X\}\_\{r\},\\ t\\in\\mathcal\{T\}\_\{r\}\\\}\.
Here,𝒟f\\mathcal\{D\}\_\{f\}is the forget set, containing supervision on forgotten instances for forgotten tasks\. The remaining three subsets are retained:𝒟rtask\\mathcal\{D\}\_\{r\}^\{\\mathrm\{task\}\}keeps supervision on forgotten instances for retained tasks,𝒟rinst\\mathcal\{D\}\_\{r\}^\{\\mathrm\{inst\}\}keeps supervision on retained instances for forgotten tasks, and𝒟rclean\\mathcal\{D\}\_\{r\}^\{\\mathrm\{clean\}\}keeps supervision on retained instances for retained tasks\. Accordingly, the retain set becomes
𝒟r=𝒟∖𝒟f=𝒟rtask∪𝒟rinst∪𝒟rclean\.\\mathcal\{D\}\_\{r\}=\\mathcal\{D\}\\setminus\\mathcal\{D\}\_\{f\}=\\mathcal\{D\}\_\{r\}^\{\\mathrm\{task\}\}\\cup\\mathcal\{D\}\_\{r\}^\{\\mathrm\{inst\}\}\\cup\\mathcal\{D\}\_\{r\}^\{\\mathrm\{clean\}\}\.\(4\)The objective is to remove the influence of𝒟f\\mathcal\{D\}\_\{f\}fromθ⋆\\theta^\{\\star\}while preserving performance on𝒟r\\mathcal\{D\}\_\{r\}\.
Based on this formulation, we consider two practical scenarios of multi\-task unlearning\. When𝒯f=𝒯\\mathcal\{T\}\_\{f\}=\\mathcal\{T\}, all supervision associated with the target instances is removed across all tasks, which we call*full\-task unlearning*\. When𝒯f⊊𝒯\\mathcal\{T\}\_\{f\}\\subsetneq\\mathcal\{T\}, only supervision for a subset of tasks is removed, which we call*partial\-task unlearning*\. This formulation provides fine\-grained control, allowing an instance to be forgotten for selected tasks while retained for others\.222While this paper focuses on removing supervision for selected instances, another possible setting removes supervision for one task across all instances, i\.e\.,𝒳f=𝒳\\mathcal\{X\}\_\{f\}=\\mathcal\{X\}, causing the model to forget that task capability entirely\.
## 3Theoretical Motivation: Interference in Multi\-Task Unlearning
Unlearning in the multi\-task setting is particularly challenging because removing supervision from𝒟f\\mathcal\{D\}\_\{f\}may unintentionally affect retained supervision through shared model parameters\. Specifically,*task\-level interference*arises when unlearning degrades performance on𝒟rtask\\mathcal\{D\}\_\{r\}^\{\\mathrm\{task\}\}, where forgotten instances should still be retained for non\-target tasks\. In contrast,*instance\-level interference*arises when unlearning degrades performance on𝒟rinst\\mathcal\{D\}\_\{r\}^\{\\mathrm\{inst\}\}, where other retained instances should still be preserved for the target tasks\.
To formally characterize the interference induced by unlearning, we define the empirical losses on the retain and forget sets as
Lr\(θ\)=1\|𝒟r\|∑\(𝐱i,t,yi\(t\)\)∈𝒟rℓi,t\(θ\),Lf\(θ\)=1\|𝒟f\|∑\(𝐱i,t,yi\(t\)\)∈𝒟fℓi,t\(θ\)\.L\_\{r\}\(\\theta\)=\\frac\{1\}\{\|\\mathcal\{D\}\_\{r\}\|\}\\sum\_\{\(\\mathbf\{x\}\_\{i\},t,y\_\{i\}^\{\(t\)\}\)\\in\\mathcal\{D\}\_\{r\}\}\\ell\_\{i,t\}\(\\theta\),\\qquad L\_\{f\}\(\\theta\)=\\frac\{1\}\{\|\\mathcal\{D\}\_\{f\}\|\}\\sum\_\{\(\\mathbf\{x\}\_\{i\},t,y\_\{i\}^\{\(t\)\}\)\\in\\mathcal\{D\}\_\{f\}\}\\ell\_\{i,t\}\(\\theta\)\.\(5\)We then analyze the retrained modelθr\\theta\_\{r\}, which represents the ideal solution obtained by unlearning𝒟f\\mathcal\{D\}\_\{f\}and retraining on𝒟r\\mathcal\{D\}\_\{r\}\.
###### Theorem 1\.
Assume thatLrL\_\{r\}andLfL\_\{f\}are twice differentiable,𝐇r:=∇2Lr\(θr\)\\mathbf\{H\}\_\{r\}:=\\nabla^\{2\}L\_\{r\}\(\\theta\_\{r\}\)is invertible, and\|𝒟r\|≫\|𝒟f\|\|\\mathcal\{D\}\_\{r\}\|\\gg\|\\mathcal\{D\}\_\{f\}\|\. Letρ:=\|𝒟f\|/\|𝒟r\|\\rho:=\|\\mathcal\{D\}\_\{f\}\|/\|\\mathcal\{D\}\_\{r\}\|, and suppose thatθ⋆\\theta^\{\\star\}locally minimizesLr\(θ\)\+ρLf\(θ\)L\_\{r\}\(\\theta\)\+\\rho L\_\{f\}\(\\theta\)\. Then, removing𝒟f\\mathcal\{D\}\_\{f\}induces the following loss change for any retained task\-instance pair\(𝐱i,t,yi\(t\)\)∈𝒟r\(\\mathbf\{x\}\_\{i\},t,y\_\{i\}^\{\(t\)\}\)\\in\\mathcal\{D\}\_\{r\}:
ℓi,t\(θr\)−ℓi,t\(θ⋆\)=ρ∇ℓi,t\(θr\)⊤𝐇r−1∇Lf\(θr\)⏟first\-order\+O\(ρ2\)⏟higher\-order\.\\ell\_\{i,t\}\(\\theta\_\{r\}\)\-\\ell\_\{i,t\}\(\\theta^\{\\star\}\)=\\underbrace\{\\rho\\,\\nabla\\ell\_\{i,t\}\(\\theta\_\{r\}\)^\{\\top\}\\mathbf\{H\}\_\{r\}^\{\-1\}\\nabla L\_\{f\}\(\\theta\_\{r\}\)\}\_\{\\text\{first\-order\}\}\+\\underbrace\{O\(\\rho^\{2\}\)\}\_\{\\text\{higher\-order\}\}\.
Theorem[1](https://arxiv.org/html/2605.19042#Thmtheorem1)shows that joint interference in multi\-task unlearning arises from Hessian\-preconditioned gradient coupling between the forget set and retained data\. Removing𝒟f\\mathcal\{D\}\_\{f\}induces a parameter shift governed by𝐇r−1\\mathbf\{H\}\_\{r\}^\{\-1\}, which reflects the local curvature of the retain loss\. Since\|𝒟r\|≫\|𝒟f\|\|\\mathcal\{D\}\_\{r\}\|\\gg\|\\mathcal\{D\}\_\{f\}\|impliesρ≪1\\rho\\ll 1, the higher\-order termO\(ρ2\)O\(\\rho^\{2\}\)becomes negligible, and the first\-order term captures the dominant source of interference\.
These observations immediately yield the following corollary\.
###### Corollary 1\.
Task\-level and instance\-level interference correspond to aggregations over𝒟rtask\\mathcal\{D\}\_\{r\}^\{\\mathrm\{task\}\}and𝒟rinst\\mathcal\{D\}\_\{r\}^\{\\mathrm\{inst\}\}, respectively, and are both governed by the same first\-order term in Theorem[1](https://arxiv.org/html/2605.19042#Thmtheorem1)\.
Corollary[1](https://arxiv.org/html/2605.19042#Thmcorollary1)shows that the same first\-order term governs both forms of interference along two axes\. Aggregation over𝒟rtask\\mathcal\{D\}\_\{r\}^\{\\mathrm\{task\}\}captures how unlearning affects forgotten instances on retained tasks, corresponding to task\-level interference\. Aggregation over𝒟rinst\\mathcal\{D\}\_\{r\}^\{\\mathrm\{inst\}\}captures how unlearning affects retained instances on forgotten tasks, corresponding to instance\-level interference\.
Moreover, since the retraining\-consistent parameter shift is shaped by the retain\-set curvature through𝐇r−1\\mathbf\{H\}\_\{r\}^\{\-1\}, directly updating the model based solely on the unlearning gradient can be suboptimal\. The following proposition formalizes this observation\.
###### Proposition 1\.
Under the local quadratic approximationLr\(θr\+δ\)≈Lr\(θr\)\+12δ⊤𝐇rδL\_\{r\}\(\\theta\_\{r\}\+\\delta\)\\approx L\_\{r\}\(\\theta\_\{r\}\)\+\\frac\{1\}\{2\}\\delta^\{\\top\}\\mathbf\{H\}\_\{r\}\\delta, assume that𝐇r\\mathbf\{H\}\_\{r\}is positive definite\. Consider updatesδ\\deltathat achieve a fixed first\-order unlearning effect,∇Lf\(θr\)⊤δ=γ\\nabla L\_\{f\}\(\\theta\_\{r\}\)^\{\\top\}\\delta=\\gamma, for some desired levelγ\>0\\gamma\>0\. Then, the unique retain\-loss\-minimizing update is
δ⋆=γ∇Lf\(θr\)⊤𝐇r−1∇Lf\(θr\)𝐇r−1∇Lf\(θr\)\.\\delta^\{\\star\}=\\frac\{\\gamma\}\{\\nabla L\_\{f\}\(\\theta\_\{r\}\)^\{\\top\}\\mathbf\{H\}\_\{r\}^\{\-1\}\\nabla L\_\{f\}\(\\theta\_\{r\}\)\}\\mathbf\{H\}\_\{r\}^\{\-1\}\\nabla L\_\{f\}\(\\theta\_\{r\}\)\.Hence, directly following the unlearning gradient, i\.e\.,δ=α∇Lf\(θr\)\\delta=\\alpha\\nabla L\_\{f\}\(\\theta\_\{r\}\), is generally suboptimal unless∇Lf\(θr\)\\nabla L\_\{f\}\(\\theta\_\{r\}\)is an eigenvector of𝐇r\\mathbf\{H\}\_\{r\}\.
Here, the constraint∇Lf\(θr\)⊤δ=γ\\nabla L\_\{f\}\(\\theta\_\{r\}\)^\{\\top\}\\delta=\\gammaenforces the same first\-order unlearning effect across candidate updates, whereγ\>0\\gamma\>0denotes the desired increase in forget loss\. This is consistent with the unlearning direction, sinceδ\\deltaincreases the forget loss to first order\. The eigenvector case is rare, as it requires the unlearning gradient to align with an eigenvector of𝐇r\\mathbf\{H\}\_\{r\}\. In general,∇Lf\(θr\)\\nabla L\_\{f\}\(\\theta\_\{r\}\)is determined by the forget set, while𝐇r\\mathbf\{H\}\_\{r\}reflects the retain\-set curvature, making such alignment unlikely\.
## 4Method
We propose an interference\-aware framework for multi\-task unlearning that combinesTask\-Aware Gradient ProjectionwithInstance\-Level Gradient Orthogonalization\. The key idea is to mitigate interference at two levels\. First, we constrain updates to task\-specific subspaces via gradient projection, reducing cross\-task interference in the shared parameter space\. Second, we remove conflicting components between forget and retain gradients through orthogonalization, preventing unlearning from degrading retained knowledge\. These components are unified under a joint optimization objective that balances forgetting and retention while mitigating both task\- and instance\-level interference\. Detailed pseudocode is provided in Appendix[D](https://arxiv.org/html/2605.19042#A4)\.
### 4\.1Task\-Aware Gradient Projection
The goal of machine unlearning is to make the model behave as if it were trained only on the retain set𝒟r\\mathcal\{D\}\_\{r\}, without exposure to the forget set𝒟f\\mathcal\{D\}\_\{f\}\. However, retraining from scratch for each unlearning request is computationally prohibitive\[[12](https://arxiv.org/html/2605.19042#bib.bib5),[49](https://arxiv.org/html/2605.19042#bib.bib71),[35](https://arxiv.org/html/2605.19042#bib.bib72)\]\. Thus, existing approaches start from a pretrained modelθ⋆\\theta^\{\\star\}and apply efficient updates to remove information associated with𝒟f\\mathcal\{D\}\_\{f\}, while preserving performance on𝒟r\\mathcal\{D\}\_\{r\}and maintaining generalization\[[30](https://arxiv.org/html/2605.19042#bib.bib61),[22](https://arxiv.org/html/2605.19042#bib.bib21),[23](https://arxiv.org/html/2605.19042#bib.bib23),[21](https://arxiv.org/html/2605.19042#bib.bib24)\]\.
To facilitate analysis, we consider a single layer with weight matrix𝐖⋆∈ℝd×k\\mathbf\{W\}^\{\\star\}\\in\\mathbb\{R\}^\{d\\times k\}fromθ⋆\\theta^\{\\star\}\. A natural approach is to update𝐖⋆\\mathbf\{W\}^\{\\star\}via gradient\-based optimization, incorporating a reverse gradient direction induced by the forget set𝒟f\\mathcal\{D\}\_\{f\}to facilitate unlearning and a forward gradient direction sampled from the retain set𝒟r\\mathcal\{D\}\_\{r\}for calibration\. However, such strategies become costly in multi\-task settings due to task diversity and repeated calibration overhead\[[3](https://arxiv.org/html/2605.19042#bib.bib1),[69](https://arxiv.org/html/2605.19042#bib.bib77)\]\. Moreover, directly updating the full parameter matrix remains computationally expensive and impractical at scale\[[71](https://arxiv.org/html/2605.19042#bib.bib73),[50](https://arxiv.org/html/2605.19042#bib.bib60),[43](https://arxiv.org/html/2605.19042#bib.bib74)\]\.
To address this issue, we reformulate machine unlearning as a model editing problem by learning a plug\-in parameter that modifies model behavior\. Instead of iteratively updating the full model parameters, we adopt a parameter\-efficient formulation and represent the unlearning process as a low\-rank update:
𝐖~=𝐖⋆\+𝐁𝐀⊤,\\widetilde\{\\mathbf\{W\}\}=\\mathbf\{W\}^\{\\star\}\+\\mathbf\{B\}\\mathbf\{A\}^\{\\top\},\(6\)where𝐀∈ℝk×r\\mathbf\{A\}\\in\\mathbb\{R\}^\{k\\times r\}and𝐁∈ℝd×r\\mathbf\{B\}\\in\\mathbb\{R\}^\{d\\times r\}are learnable factors, while𝐖⋆\\mathbf\{W\}^\{\\star\}remains fixed\. One can update𝐀\\mathbf\{A\}and𝐁\\mathbf\{B\}using gradients induced by both forget and retain supervision signals\. However, this is suboptimal in multi\-task settings, as directions associated with𝒟f\\mathcal\{D\}\_\{f\}are often entangled with directions useful for𝒟r\\mathcal\{D\}\_\{r\}\. This issue is particularly pronounced in partial unlearning, where supervision for selected tasks is removed while supervision for retained tasks on the same instances must be preserved, leading to*task\-level interference*\.
To enable selective unlearning, we introduce task\-aware gradient projection, which restricts updates for taskttto a task\-specific subspace𝐏t\\mathbf\{P\}\_\{t\}:
∇𝐀\(t\):=∇𝐀𝐏t,∇𝐁\(t\):=∇𝐁𝐏t\.\\nabla\_\{\\mathbf\{A\}\}^\{\(t\)\}:=\\nabla\_\{\\mathbf\{A\}\}\\mathbf\{P\}\_\{t\},\\qquad\\nabla\_\{\\mathbf\{B\}\}^\{\(t\)\}:=\\nabla\_\{\\mathbf\{B\}\}\\mathbf\{P\}\_\{t\}\.\(7\)This projection confines updates to task\-specific subspaces, reducing cross\-task interference in the shared low\-rank adaptation space\[[65](https://arxiv.org/html/2605.19042#bib.bib76),[69](https://arxiv.org/html/2605.19042#bib.bib77),[34](https://arxiv.org/html/2605.19042#bib.bib78),[48](https://arxiv.org/html/2605.19042#bib.bib79)\]\.
Since task\-specific updates lie in subspaces of the shared low\-rank space, each task occupies only part of therr\-dimensional adaptation space\. Accordingly, for each tasktt, we define an orthonormal task\-specific basis
𝐔t∈ℝr×s,𝐔t⊤𝐔t=𝐈s,𝐏t:=𝐔t𝐔t⊤∈ℝr×r,\\mathbf\{U\}\_\{t\}\\in\\mathbb\{R\}^\{r\\times s\},\\qquad\\mathbf\{U\}\_\{t\}^\{\\top\}\\mathbf\{U\}\_\{t\}=\\mathbf\{I\}\_\{s\},\\qquad\\mathbf\{P\}\_\{t\}:=\\mathbf\{U\}\_\{t\}\\mathbf\{U\}\_\{t\}^\{\\top\}\\in\\mathbb\{R\}^\{r\\times r\},\(8\)whererris the shared low\-rank dimension ands≤rs\\leq rdenotes the dimension of the task\-specific subspace\. Here,𝐏t\\mathbf\{P\}\_\{t\}is the orthogonal projector induced by𝐔t\\mathbf\{U\}\_\{t\}\.
During training, we further regularize the task\-specific subspaces to be mutually orthogonal\. Specifically, for anyt≠t′t\\neq t^\{\\prime\}, we minimize the alignment between task\-specific subspaces via‖𝐔t⊤𝐔t′‖F2\\\|\\mathbf\{U\}\_\{t\}^\{\\top\}\\mathbf\{U\}\_\{t^\{\\prime\}\}\\\|\_\{F\}^\{2\}, which promotes separation between task\-specific directions\[[60](https://arxiv.org/html/2605.19042#bib.bib75),[67](https://arxiv.org/html/2605.19042#bib.bib80)\]\. Consequently, each task updates the shared low\-rank factors𝐀\\mathbf\{A\}and𝐁\\mathbf\{B\}through its corresponding projector𝐏t\\mathbf\{P\}\_\{t\}\. This confines updates to weakly aligned task\-specific subspaces, reduces alignment between projected task gradients, and helps mitigate cross\-task interference\. We formalize this effect in Theorem[2](https://arxiv.org/html/2605.19042#Thmtheorem2)in Appendix[B\.4](https://arxiv.org/html/2605.19042#A2.SS4)\.
### 4\.2Instance\-Level Gradient Orthogonalization
While task\-aware gradient projection restricts updates to task\-specific subspaces, it does not eliminate conflicts between forgetting and retention\. The forget gradient may still align with directions important for retained data, thereby degrading retain performance and overall generalization\[[30](https://arxiv.org/html/2605.19042#bib.bib61),[22](https://arxiv.org/html/2605.19042#bib.bib21),[21](https://arxiv.org/html/2605.19042#bib.bib24),[23](https://arxiv.org/html/2605.19042#bib.bib23)\]\.
Let𝐙∈\{𝐀,𝐁\}\\mathbf\{Z\}\\in\\\{\\mathbf\{A\},\\mathbf\{B\}\\\}denote either of the trainable matrices\. To prevent such interference when updating𝐙\\mathbf\{Z\}, we remove the component of the forget gradient∇𝐙,f\\nabla\_\{\\mathbf\{Z\},f\}that aligns with the retain gradient∇𝐙,r\\nabla\_\{\\mathbf\{Z\},r\}via orthogonal projection:
Π∇𝐙,r⟂\(∇𝐙,f\)=∇𝐙,f−⟨∇𝐙,f,∇𝐙,r⟩F‖∇𝐙,r‖F2\+ε∇𝐙,r,\\Pi\_\{\\nabla\_\{\\mathbf\{Z\},r\}\}^\{\\perp\}\(\\nabla\_\{\\mathbf\{Z\},f\}\)=\\nabla\_\{\\mathbf\{Z\},f\}\-\\frac\{\\langle\\nabla\_\{\\mathbf\{Z\},f\},\\nabla\_\{\\mathbf\{Z\},r\}\\rangle\_\{F\}\}\{\\\|\\nabla\_\{\\mathbf\{Z\},r\}\\\|\_\{F\}^\{2\}\+\\varepsilon\}\\,\\nabla\_\{\\mathbf\{Z\},r\},\(9\)whereε\>0\\varepsilon\>0ensures numerical stability\. All gradients are already projected onto task\-specific subspaces as defined in Eq\. \([7](https://arxiv.org/html/2605.19042#S4.E7)\)\. This operation removes the retain\-aligned component of the forget gradient up to the stabilization termε\\varepsilon, making the forget update less disruptive to retained knowledge\. We analyze this effect in Theorem[3](https://arxiv.org/html/2605.19042#Thmtheorem3)of Appendix[B\.5](https://arxiv.org/html/2605.19042#A2.SS5)\.
However, multi\-task unlearning is more challenging because retained supervision comes from multiple sources, including𝒟rclean\\mathcal\{D\}\_\{r\}^\{\\mathrm\{clean\}\},𝒟rinst\\mathcal\{D\}\_\{r\}^\{\\mathrm\{inst\}\}, and𝒟rtask\\mathcal\{D\}\_\{r\}^\{\\mathrm\{task\}\}\. These subsets impose heterogeneous retention constraints that must be preserved simultaneously\. Thus, uniformly sampling from𝒟r\\mathcal\{D\}\_\{r\}may fail to capture these constraints effectively\[[24](https://arxiv.org/html/2605.19042#bib.bib82),[50](https://arxiv.org/html/2605.19042#bib.bib60),[32](https://arxiv.org/html/2605.19042#bib.bib81)\]\.
At each iteration, we decompose∇𝐙,r\\nabla\_\{\\mathbf\{Z\},r\}into three components, namely∇𝐙,rclean\\nabla\_\{\\mathbf\{Z\},r\}^\{\\mathrm\{clean\}\},∇𝐙,rinst\\nabla\_\{\\mathbf\{Z\},r\}^\{\\mathrm\{inst\}\}, and∇𝐙,rtask\\nabla\_\{\\mathbf\{Z\},r\}^\{\\mathrm\{task\}\}, corresponding to the three retained subsets\. We then adopt a sequential orthogonalization scheme that applies the projection operators successively:
∇𝐙,f⟂:=Π∇𝐙,rtask⟂\(Π∇𝐙,rinst⟂\(Π∇𝐙,rclean⟂\(∇𝐙,f\)\)\)\.\\nabla\_\{\\mathbf\{Z\},f\}^\{\\perp\}:=\\Pi\_\{\\nabla\_\{\\mathbf\{Z\},r\}^\{\\mathrm\{task\}\}\}^\{\\perp\}\\Bigl\(\\Pi\_\{\\nabla\_\{\\mathbf\{Z\},r\}^\{\\mathrm\{inst\}\}\}^\{\\perp\}\\bigl\(\\Pi\_\{\\nabla\_\{\\mathbf\{Z\},r\}^\{\\mathrm\{clean\}\}\}^\{\\perp\}\(\\nabla\_\{\\mathbf\{Z\},f\}\)\\bigr\)\\Bigr\)\.\(10\)
We process retain gradients in the order of clean, instance\-level, and task\-level signals as a practical design\. Clean retain signals provide stable global guidance, instance\-level signals capture retention constraints on other instances, and task\-level signals preserve supervision on non\-target tasks for forgotten instances\. This sequential scheme progressively removes components of the forget gradient that conflict with different retained subsets, mitigating both instance\- and task\-level interference\.
### 4\.3Overall Optimization
The sequential orthogonalization step in Eq\. \([10](https://arxiv.org/html/2605.19042#S4.E10)\) produces retain\-aware forget gradients∇𝐀,f⟂\\nabla\_\{\\mathbf\{A\},f\}^\{\\perp\}and∇𝐁,f⟂\\nabla\_\{\\mathbf\{B\},f\}^\{\\perp\}that avoid direct conflict with retained supervision\. To further stabilize training, we optionally incorporate preservation terms derived from retain gradients\[[9](https://arxiv.org/html/2605.19042#bib.bib83),[24](https://arxiv.org/html/2605.19042#bib.bib82),[30](https://arxiv.org/html/2605.19042#bib.bib61)\]\.
We update the low\-rank factors by combining a retain\-preserving descent direction with a forget\-promoting ascent direction:
𝐀←𝐀−η1∇𝐀,r\+η2∇𝐀,f⟂,𝐁←𝐁−η1∇𝐁,r\+η2∇𝐁,f⟂,\\mathbf\{A\}\\leftarrow\\mathbf\{A\}\-\\eta\_\{1\}\\nabla\_\{\\mathbf\{A\},r\}\+\\eta\_\{2\}\\nabla\_\{\\mathbf\{A\},f\}^\{\\perp\},\\qquad\\mathbf\{B\}\\leftarrow\\mathbf\{B\}\-\\eta\_\{1\}\\nabla\_\{\\mathbf\{B\},r\}\+\\eta\_\{2\}\\nabla\_\{\\mathbf\{B\},f\}^\{\\perp\},\(11\)whereη1,η2\>0\\eta\_\{1\},\\eta\_\{2\}\>0control the strengths of retention and forgetting, respectively\. The first term decreases the retain loss, while the second increases the forget loss after removing components that conflict with retained gradients\. Whenη1=0\\eta\_\{1\}=0, the update reduces to pure unlearning using only the projected forget gradients\. Here,∇𝐀,r\\nabla\_\{\\mathbf\{A\},r\}and∇𝐁,r\\nabla\_\{\\mathbf\{B\},r\}aggregate retain gradients from𝒟rclean\\mathcal\{D\}\_\{r\}^\{\\mathrm\{clean\}\},𝒟rinst\\mathcal\{D\}\_\{r\}^\{\\mathrm\{inst\}\}, and𝒟rtask\\mathcal\{D\}\_\{r\}^\{\\mathrm\{task\}\}\. The pretrained weight𝐖⋆\\mathbf\{W\}^\{\\star\}is kept fixed throughout training\. The final unlearned model is obtained by merging the learned update in Eq\. \([6](https://arxiv.org/html/2605.19042#S4.E6)\), enabling efficient unlearning while mitigating both task\-level and instance\-level interference\.
## 5Experiments
We evaluate our method under both*full\-task*and*partial\-task*unlearning settings on two vision benchmarks spanning five tasks\. Our experiments examine four key requirements for multi\-task unlearning: \(1\) effective removal of information from the forget set, \(2\) preservation of retained capabilities on the retain set, \(3\) mitigation of unintended interference and preservation of generalization, and \(4\) reduction of residual membership signals associated with forgotten data for privacy protection\.
### 5\.1Setup
Datasets\.To evaluate multi\-task unlearning across varying supervision granularities, we consider two representative vision benchmarks that span image\-level, instance\-level, and pixel\-level prediction\.NYUv2\[[52](https://arxiv.org/html/2605.19042#bib.bib57)\]serves as a dense multi\-task benchmark with pixel\-level supervision, where three tasks are defined on the same images:semantic segmentation \(SEG\),depth estimation \(DEP\), andsurface\-normal prediction \(NOR\)\.PASCAL\[[19](https://arxiv.org/html/2605.19042#bib.bib55)\]captures heterogeneous supervision, encompassing image\-levelimage classification \(CLS\)and instance\-levelobject detection \(OD\)\.
Baselines\.As reference models, we includeOriginal, trained on the full dataset, andRetrain, trained from scratch on the retain set, which provide lower\- and upper\-bound references\. For first\-order gradient\-based methods, we considerNegGrad\+\[[30](https://arxiv.org/html/2605.19042#bib.bib61),[22](https://arxiv.org/html/2605.19042#bib.bib21)\], which performs gradient ascent on forget data with retain\-aware optimization\. For second\-order approaches, we includeFisher\[[21](https://arxiv.org/html/2605.19042#bib.bib24)\]andInfluence\[[23](https://arxiv.org/html/2605.19042#bib.bib23)\], which estimate parameter importance via Fisher information or influence functions\. We further includeSSD\[[20](https://arxiv.org/html/2605.19042#bib.bib59)\], which selectively dampens parameters associated with forget data, andOrthoGrad\[[50](https://arxiv.org/html/2605.19042#bib.bib60)\], which orthogonalizes forget updates against retain gradients\. Finally, we considerSCRUB\[[30](https://arxiv.org/html/2605.19042#bib.bib61)\], which integrates forgetting and retention through a teacher\-student framework\.
Task utility and privacy metrics\.For task\-specific utility, we report mean Intersection\-over\-Union \(mIoU\) for semantic segmentation\[[40](https://arxiv.org/html/2605.19042#bib.bib56)\], Threshold Accuracy for depth estimation and Angular Accuracy for surface\-normal prediction\[[52](https://arxiv.org/html/2605.19042#bib.bib57)\], and mean Average Precision \(mAP\) for both object detection and multi\-label classification\[[33](https://arxiv.org/html/2605.19042#bib.bib48),[19](https://arxiv.org/html/2605.19042#bib.bib55)\]\. Each metric is evaluated on three disjoint splits: the retain set \(Ret\), the forget set \(Unl\), and a held\-out validation set \(Val\), corresponding to retention, forgetting, and generalization\. The retain set measures preservation of useful supervision, the forget set quantifies the extent of forgetting, and the validation set evaluates generalization beyond the affected training instances\. To assess potential privacy leakage, we also conduct Membership Inference Attack\[[51](https://arxiv.org/html/2605.19042#bib.bib58)\]\(MIA\) to quantify residual memorization after unlearning, i\.e\., whether a sample can still be inferred as part of the training set\.
Unlearning Impact Score \(UIS\)\.Evaluating multi\-task unlearning is inherently multi\-objective, requiring a balance among retention, forgetting, generalization, and memorization\[[57](https://arxiv.org/html/2605.19042#bib.bib68),[10](https://arxiv.org/html/2605.19042#bib.bib69),[2](https://arxiv.org/html/2605.19042#bib.bib70),[50](https://arxiv.org/html/2605.19042#bib.bib60)\]\. Therefore, beyond raw metrics, we assess each method by its deviation from a desired reference behavior: the model should match the retrained model on supervision that should be forgotten, while preserving the original model behavior on supervision that should be retained\. This reflects the goal of removing the influence of target supervision without degrading unrelated knowledge\. For each task, we compute its relative deviation from the corresponding reference over the relevant evaluation aspects and aggregate the deviations into an overall discrepancy score\. A smaller score indicates closer alignment with the desired behavior\. Infull\-task unlearning, all tasks on forget instances are compared against theretrained model, while retain and validation behavior are compared against theoriginal model\. Inpartial\-task unlearning, only theselected forgotten taskis compared against theretrained reference, whereas theremaining tasksare compared against theoriginal model\.
Implementation Details\.We evaluate our framework with two representative backbones, ViT\-L\[[15](https://arxiv.org/html/2605.19042#bib.bib10)\]and Swin\-L\[[39](https://arxiv.org/html/2605.19042#bib.bib9)\]\(presented in Appendix[C\.2](https://arxiv.org/html/2605.19042#A3.SS2)\), initialized from ImageNet\-pretrained HuggingFace checkpoints\[[61](https://arxiv.org/html/2605.19042#bib.bib62),[13](https://arxiv.org/html/2605.19042#bib.bib63)\]\. For NYUv2, we attach an ASPP module\[[7](https://arxiv.org/html/2605.19042#bib.bib64)\]with task\-specific heads for semantic segmentation, depth estimation, and surface\-normal prediction\. For PASCAL, we use a linear classification head and a DETR\-style detection head with 100 object queries\[[5](https://arxiv.org/html/2605.19042#bib.bib65)\]\. To model realistic multi\-task sharing, we use a single shared LoRA\[[26](https://arxiv.org/html/2605.19042#bib.bib66)\]across all tasks and follow the official training and validation splits\[[19](https://arxiv.org/html/2605.19042#bib.bib55),[52](https://arxiv.org/html/2605.19042#bib.bib57)\]\. Following prior protocols\[[30](https://arxiv.org/html/2605.19042#bib.bib61),[50](https://arxiv.org/html/2605.19042#bib.bib60)\], we designate 10% of the training instances as the forget set and use the rest as the retain set\. Unlearning updates only LoRA parameters with AdamW\[[41](https://arxiv.org/html/2605.19042#bib.bib67)\]for up to 20 epochs with early stopping\. All results are reported as the average of 10 runs, with standard deviations within 3%\.
Table 1:Quantitative results of multi\-task unlearning on NYUv2 using ViT\-LTable 2:Quantitative results of multi\-task unlearning on Pascal using ViT\-L
### 5\.2Full\-task Unlearning\.
Tables[1](https://arxiv.org/html/2605.19042#S5.T1)and[2](https://arxiv.org/html/2605.19042#S5.T2)report full\-task unlearning \(FU\), where all task supervision on forget instances is removed\. The unlearned model should move forget instances towardRetrain, while keeping retain and validation performance close toOriginal\. UIS measures deviation from this target behavior, with smaller values indicating a better forgetting\-preservation trade\-off\. Overall, our method achieves the lowest UIS in FU on both benchmarks, reducing the score from28\.6%28\.6\\%to22\.0%22\.0\\%on NYUv2 and from35\.8%35\.8\\%to22\.9%22\.9\\%on Pascal compared with the strongest baseline\. These results show that our method removes forget supervision \(Unl\), preserves retain performance \(Ret\), maintains generalization \(Val\), and reduces privacy leakage \(MIA\) by modeling task\- and instance\-level interference\. The average gain is larger on NYUv2, where dense prediction tasks share spatial structure, while Pascal is harder because classification and detection use different supervision granularities\. A similar trend is observed with Swin\-L, where UIS is reduced from31\.2%31\.2\\%to14\.6%14\.6\\%, as shown in Appendix[C\.2](https://arxiv.org/html/2605.19042#A3.SS2)\.
### 5\.3Partial\-task Unlearning
Tables[1](https://arxiv.org/html/2605.19042#S5.T1)and[2](https://arxiv.org/html/2605.19042#S5.T2)also report partial\-task unlearning \(PU\), where only selected task labels on forget instances are removed\. In the tables, PU\(SEG\), PU\(DEP\), and PU\(NOR\) denote unlearning semantic segmentation, depth estimation, and surface normal estimation on NYUv2, while PU\(CLS\) and PU\(OD\) denote unlearning classification and object detection on Pascal\. This setting requires finer control because the same instance may contain both forgotten and retained supervision\. The model should move the forgotten task towardRetrainwhile keeping the remaining tasks close toOriginal\. Overall, our method achieves the lowest UIS across PU settings, reducing UIS from24\.6%24\.6\\%to13\.4%13\.4\\%on NYUv2 and from44\.1%44\.1\\%to18\.1%18\.1\\%on Pascal, corresponding to45\.7%45\.7\\%and59\.0%59\.0\\%relative reductions\. In contrast, SSD and Fisher reduce retained DEP and NOR performance under NYUv2 PU\(SEG\), while Influence, SSD, and Fisher hurt retained or validation OD performance under Pascal PU\(CLS\)\. These results show that existing methods often let forgetting updates spill over to non\-target tasks, whereas our method better decouples task\- and instance\-level unlearning signals\.
### 5\.4Ablation Study
Table 3:Ablation study for multi\-task unlearning on NYUv2 using ViT\-L\.Table[3](https://arxiv.org/html/2605.19042#S5.T3)analyzes each component on NYUv2\. We compare the full method with four variants:\(i\) w/o Projectionremoves task\-aware gradient projection;\(ii\) w/o Taskremoves same\-task constraints across different instances;\(iii\) w/o Instremoves cross\-task constraints on the same instance; and\(iv\) w/o Cleanremoves constraints from clean retain samples\. Under FU, w/o Task causes the largest degradation, increasing UIS from22\.0%22\.0\\%to164\.4%164\.4\\%, showing that same\-task retain constraints are crucial for protecting retain instances\. Removing Projection also increases UIS to30\.8%30\.8\\%, confirming the role of task\-aware projection\. For PU, w/o Task remains the most influential ablation, especially for SEG under PU\(SEG\) and NOR under PU\(NOR\), where UIS rises to76\.6%76\.6\\%and58\.7%58\.7\\%, respectively\. In contrast, w/o Inst and w/o Clean cause milder increases\. Overall, the full method best mitigates both task\- and instance\-level interference\.
### 5\.5Scalability Analysis
Figure[2](https://arxiv.org/html/2605.19042#S5.F2)further evaluates scalability under increasing unlearn ratios on NYUv2, from10%10\\%to50%50\\%\. Our method remains stable across all settings, keeping UIS around20%20\\%–25%25\\%even when half of the data is unlearned\. In contrast, the baselines degrade rapidly as the unlearn ratio increases\. SCRUB is already about twice as large as ours at low ratios and becomes unstable at50%50\\%, with UIS increasing to over200%200\\%in FU, PU\(SEG\), and PU\(DEP\), and to over120%120\\%in PU\(NOR\)\. NegGrad\+ also rises substantially, often exceeding50%50\\%under larger unlearn ratios\. These failures are mainly caused by large drops inRetandVal, indicating that the model loses retained utility and generalization rather than only forgetting the target data\. These results show that our method can handle large\-scale multi\-task unlearning in a stable manner, effectively removing increasing amounts of supervision while avoiding severe damage to model performance\.
\(a\)FU
\(b\)PU \(SEG\)
\(c\)PU \(DEP\)
\(d\)PU \(NOR\)
Figure 2:Unlearning impact score \(%\) under different unlearn ratios\. Lower is better
## 6Related Work
Machine unlearning\[[4](https://arxiv.org/html/2605.19042#bib.bib4),[3](https://arxiv.org/html/2605.19042#bib.bib1),[22](https://arxiv.org/html/2605.19042#bib.bib21),[23](https://arxiv.org/html/2605.19042#bib.bib23),[21](https://arxiv.org/html/2605.19042#bib.bib24),[14](https://arxiv.org/html/2605.19042#bib.bib30),[6](https://arxiv.org/html/2605.19042#bib.bib31)\]aims to remove sensitive or outdated information from trained models while preserving utility\. It is important for privacy\[[58](https://arxiv.org/html/2605.19042#bib.bib2)\], security\[[25](https://arxiv.org/html/2605.19042#bib.bib17)\], fairness\[[70](https://arxiv.org/html/2605.19042#bib.bib18)\], and robustness\[[45](https://arxiv.org/html/2605.19042#bib.bib15)\], and supports applications such as debiasing\[[8](https://arxiv.org/html/2605.19042#bib.bib19)\], debugging\[[55](https://arxiv.org/html/2605.19042#bib.bib20)\], and auditing\[[59](https://arxiv.org/html/2605.19042#bib.bib16)\]\. Early work studied exact unlearning\[[4](https://arxiv.org/html/2605.19042#bib.bib4),[3](https://arxiv.org/html/2605.19042#bib.bib1),[17](https://arxiv.org/html/2605.19042#bib.bib7),[64](https://arxiv.org/html/2605.19042#bib.bib8),[12](https://arxiv.org/html/2605.19042#bib.bib5)\], which partitions data or computation so only affected subsets are retrained, but remains costly with repeated training\. Later work explored approximate unlearning\[[22](https://arxiv.org/html/2605.19042#bib.bib21),[21](https://arxiv.org/html/2605.19042#bib.bib24),[47](https://arxiv.org/html/2605.19042#bib.bib28),[56](https://arxiv.org/html/2605.19042#bib.bib29),[23](https://arxiv.org/html/2605.19042#bib.bib23),[72](https://arxiv.org/html/2605.19042#bib.bib6),[30](https://arxiv.org/html/2605.19042#bib.bib61)\], which updates parameters to approximate retraining behavior through negative\-gradient updates\[[30](https://arxiv.org/html/2605.19042#bib.bib61)\], noise\-based supervision\[[22](https://arxiv.org/html/2605.19042#bib.bib21)\], or influence\- and Fisher\-based targeted updates\[[23](https://arxiv.org/html/2605.19042#bib.bib23),[21](https://arxiv.org/html/2605.19042#bib.bib24)\]\. As models scale\[[39](https://arxiv.org/html/2605.19042#bib.bib9),[15](https://arxiv.org/html/2605.19042#bib.bib10)\], even fine\-tuning\-based unlearning becomes impractical\[[47](https://arxiv.org/html/2605.19042#bib.bib28),[56](https://arxiv.org/html/2605.19042#bib.bib29)\], since it must forget target data while preserving retain\-set performance\. Recent work therefore adopts parameter\-efficient fine\-tuning and treats unlearning as model editing\[[36](https://arxiv.org/html/2605.19042#bib.bib12)\], where lightweight adapters learn the unlearning behavior and are merged into the base model\[[14](https://arxiv.org/html/2605.19042#bib.bib30),[6](https://arxiv.org/html/2605.19042#bib.bib31),[38](https://arxiv.org/html/2605.19042#bib.bib32)\]\. However, most methods focus on single\-task unlearning\[[50](https://arxiv.org/html/2605.19042#bib.bib60),[20](https://arxiv.org/html/2605.19042#bib.bib59),[54](https://arxiv.org/html/2605.19042#bib.bib52),[28](https://arxiv.org/html/2605.19042#bib.bib53),[18](https://arxiv.org/html/2605.19042#bib.bib54)\], whereas modern models often operate in multi\-task settings\[[27](https://arxiv.org/html/2605.19042#bib.bib35),[62](https://arxiv.org/html/2605.19042#bib.bib37),[1](https://arxiv.org/html/2605.19042#bib.bib38),[63](https://arxiv.org/html/2605.19042#bib.bib39),[68](https://arxiv.org/html/2605.19042#bib.bib40),[44](https://arxiv.org/html/2605.19042#bib.bib44)\]with shared backbones\. Because tasks are coupled through shared parameters, forgetting one task can unintentionally affect others, making multi\-task unlearning substantially more challenging\.
## 7Limitations and Conclusion
In this paper, we introduced multi\-task unlearning under two complementary scenarios: full\-task unlearning, which removes all task\-specific supervision associated with target instances, and partial\-task unlearning, which removes only selected task supervision while preserving remaining valid predictions\. We theoretically showed that multi\-task unlearning inherently induces task\- and instance\-level interference\. To address this issue, we proposed a framework that combines task\-aware gradient projection with instance\-level gradient orthogonalization\. The former localizes updates to task\-specific subspaces, while the latter removes forget\-gradient components that conflict with retained supervision\. Experiments on NYUv2 and Pascal show that our method better matches the desired reference behavior in both full\- and partial\-task unlearning\. A limitation of our study is that the evaluation focuses on vision benchmarks\. Future work will extend multi\-task unlearning to text and audio domains, where task definitions and supervision structures may differ substantially\.
## References
- \[1\]\(2024\)Mtlora: low\-rank adaptation approach for efficient multi\-task learning\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 16196–16205\.Cited by:[§1](https://arxiv.org/html/2605.19042#S1.p2.1),[§6](https://arxiv.org/html/2605.19042#S6.p1.1)\.
- \[2\]S\. M\. Ahmed, U\. Y\. Basaran, D\. S\. Raychaudhuri, A\. Dutta, R\. Kundu, F\. F\. Niloy, B\. Guler, and A\. K\. Roy\-Chowdhury\(2025\)Towards source\-free machine unlearning\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 4948–4957\.Cited by:[§C\.1\.5](https://arxiv.org/html/2605.19042#A3.SS1.SSS5.p1.1),[§5\.1](https://arxiv.org/html/2605.19042#S5.SS1.p4.1)\.
- \[3\]L\. Bourtoule, V\. Chandrasekaran, C\. A\. Choquette\-Choo, H\. Jia, A\. Travers, B\. Zhang, D\. Lie, and N\. Papernot\(2021\)Machine unlearning\.In2021 IEEE symposium on security and privacy \(SP\),pp\. 141–159\.Cited by:[§1](https://arxiv.org/html/2605.19042#S1.p2.1),[§4\.1](https://arxiv.org/html/2605.19042#S4.SS1.p2.5),[§6](https://arxiv.org/html/2605.19042#S6.p1.1)\.
- \[4\]Y\. Cao and J\. Yang\(2015\)Towards making systems forget with machine unlearning\.In2015 IEEE symposium on security and privacy,pp\. 463–480\.Cited by:[§1](https://arxiv.org/html/2605.19042#S1.p1.1),[§6](https://arxiv.org/html/2605.19042#S6.p1.1)\.
- \[5\]N\. Carion, F\. Massa, G\. Synnaeve, N\. Usunier, A\. Kirillov, and S\. Zagoruyko\(2020\)End\-to\-end object detection with transformers\.InEuropean conference on computer vision,pp\. 213–229\.Cited by:[§C\.1\.6](https://arxiv.org/html/2605.19042#A3.SS1.SSS6.p1.1),[§5\.1](https://arxiv.org/html/2605.19042#S5.SS1.p5.1)\.
- \[6\]S\. Cha, S\. Cho, D\. Hwang, and M\. Lee\(2024\)Towards robust and parameter\-efficient knowledge unlearning for llms\.arXiv preprint arXiv:2408\.06621\.Cited by:[§1](https://arxiv.org/html/2605.19042#S1.p2.1),[§6](https://arxiv.org/html/2605.19042#S6.p1.1)\.
- \[7\]L\. Chen, G\. Papandreou, F\. Schroff, and H\. Adam\(2017\)Rethinking atrous convolution for semantic image segmentation\.arXiv preprint arXiv:1706\.05587\.Cited by:[§C\.1\.6](https://arxiv.org/html/2605.19042#A3.SS1.SSS6.p1.1),[§5\.1](https://arxiv.org/html/2605.19042#S5.SS1.p5.1)\.
- \[8\]R\. Chen, J\. Yang, H\. Xiong, J\. Bai, T\. Hu, J\. Hao, Y\. Feng, J\. T\. Zhou, J\. Wu, and Z\. Liu\(2023\)Fast model debias with machine unlearning\.Advances in Neural Information Processing Systems36,pp\. 14516–14539\.Cited by:[§1](https://arxiv.org/html/2605.19042#S1.p1.1),[§6](https://arxiv.org/html/2605.19042#S6.p1.1)\.
- \[9\]J\. Cheng, P\. Liu, Q\. Li, and C\. Zhang\(2026\)Machine unlearning under retain\-forget entanglement\.arXiv preprint arXiv:2603\.26569\.Cited by:[§4\.3](https://arxiv.org/html/2605.19042#S4.SS3.p1.2)\.
- \[10\]X\. Cheng, Z\. Huang, W\. Zhou, Z\. He, R\. Yang, Y\. Wu, and X\. Huang\(2024\)Remaining\-data\-free machine unlearning by suppressing sample contribution\.arXiv preprint arXiv:2402\.15109\.Cited by:[§C\.1\.5](https://arxiv.org/html/2605.19042#A3.SS1.SSS5.p1.1),[§5\.1](https://arxiv.org/html/2605.19042#S5.SS1.p4.1)\.
- \[11\]D\. Choi and D\. Na\(2023\)Towards machine unlearning benchmarks: forgetting the personal identities in facial recognition systems\.arXiv preprint arXiv:2311\.02240\.Cited by:[§1](https://arxiv.org/html/2605.19042#S1.p3.1)\.
- \[12\]S\. B\. R\. Chowdhury, K\. M\. Choromanski, A\. Sehanobish, K\. A\. Dubey, and S\. Chaturvedi\(2025\)Towards scalable exact machine unlearning using parameter\-efficient fine\-tuning\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§4\.1](https://arxiv.org/html/2605.19042#S4.SS1.p1.5),[§6](https://arxiv.org/html/2605.19042#S6.p1.1)\.
- \[13\]J\. Deng, W\. Dong, R\. Socher, L\. Li, K\. Li, and L\. Fei\-Fei\(2009\)Imagenet: a large\-scale hierarchical image database\.In2009 IEEE conference on computer vision and pattern recognition,pp\. 248–255\.Cited by:[§C\.1\.6](https://arxiv.org/html/2605.19042#A3.SS1.SSS6.p1.1),[§5\.1](https://arxiv.org/html/2605.19042#S5.SS1.p5.1)\.
- \[14\]C\. Ding, J\. Wu, Y\. Yuan, J\. Lu, K\. Zhang, A\. Su, X\. Wang, and X\. He\(2024\)Unified parameter\-efficient unlearning for llms\.arXiv preprint arXiv:2412\.00383\.Cited by:[§1](https://arxiv.org/html/2605.19042#S1.p2.1),[§6](https://arxiv.org/html/2605.19042#S6.p1.1)\.
- \[15\]A\. Dosovitskiy, L\. Beyer, A\. Kolesnikov, D\. Weissenborn, X\. Zhai, T\. Unterthiner, M\. Dehghani, M\. Minderer, G\. Heigold, S\. Gelly, J\. Uszkoreit, and N\. Houlsby\(2021\)An image is worth 16x16 words: transformers for image recognition at scale\.InInternational Conference on Learning Representations,Cited by:[§C\.1\.3](https://arxiv.org/html/2605.19042#A3.SS1.SSS3.p1.1),[§C\.1\.6](https://arxiv.org/html/2605.19042#A3.SS1.SSS6.p1.1),[§5\.1](https://arxiv.org/html/2605.19042#S5.SS1.p5.1),[§6](https://arxiv.org/html/2605.19042#S6.p1.1)\.
- \[16\]M\. Du, Z\. Chen, C\. Liu, R\. Oak, and D\. Song\(2019\)Lifelong anomaly detection through unlearning\.InProceedings of the 2019 ACM SIGSAC conference on computer and communications security,pp\. 1283–1297\.Cited by:[§1](https://arxiv.org/html/2605.19042#S1.p3.1)\.
- \[17\]Y\. Dukler, B\. Bowman, A\. Achille, A\. Golatkar, A\. Swaminathan, and S\. Soatto\(2023\)Safe: machine unlearning with shard graphs\.InProceedings of the IEEE/CVF international conference on computer vision,pp\. 17108–17118\.Cited by:[§6](https://arxiv.org/html/2605.19042#S6.p1.1)\.
- \[18\]A\. Ebrahimpour\-Boroojeny, H\. Sundaram, and V\. Chandrasekaran\(2025\)Amun: adversarial machine unlearning\.arXiv preprint arXiv:2503\.00917\.Cited by:[§6](https://arxiv.org/html/2605.19042#S6.p1.1)\.
- \[19\]M\. Everingham, L\. Van Gool, C\. K\. I\. Williams, J\. Winn, and A\. Zisserman\(2010\)The pascal visual object classes \(voc\) challenge\.International Journal of Computer Vision88\(2\),pp\. 303–338\.Cited by:[§C\.1\.1](https://arxiv.org/html/2605.19042#A3.SS1.SSS1.p3.1),[§C\.1\.1](https://arxiv.org/html/2605.19042#A3.SS1.SSS1.p5.1),[§C\.1\.4](https://arxiv.org/html/2605.19042#A3.SS1.SSS4.p5.2),[§5\.1](https://arxiv.org/html/2605.19042#S5.SS1.p1.1),[§5\.1](https://arxiv.org/html/2605.19042#S5.SS1.p3.1),[§5\.1](https://arxiv.org/html/2605.19042#S5.SS1.p5.1)\.
- \[20\]J\. Foster, S\. Schoepf, and A\. Brintrup\(2024\)Fast machine unlearning without retraining through selective synaptic dampening\.InProceedings of the AAAI conference on artificial intelligence,Vol\.38,pp\. 12043–12051\.Cited by:[6th item](https://arxiv.org/html/2605.19042#A3.I1.i6.p1.1),[§5\.1](https://arxiv.org/html/2605.19042#S5.SS1.p2.1),[§6](https://arxiv.org/html/2605.19042#S6.p1.1)\.
- \[21\]A\. Golatkar, A\. Achille, and S\. Soatto\(2020\)Eternal sunshine of the spotless net: selective forgetting in deep networks\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 9304–9312\.Cited by:[4th item](https://arxiv.org/html/2605.19042#A3.I1.i4.p1.1),[§1](https://arxiv.org/html/2605.19042#S1.p2.1),[§4\.1](https://arxiv.org/html/2605.19042#S4.SS1.p1.5),[§4\.2](https://arxiv.org/html/2605.19042#S4.SS2.p1.1),[§5\.1](https://arxiv.org/html/2605.19042#S5.SS1.p2.1),[§6](https://arxiv.org/html/2605.19042#S6.p1.1)\.
- \[22\]L\. Graves, V\. Nagisetty, and V\. Ganesh\(2021\)Amnesiac machine learning\.InProceedings of the AAAI conference on artificial intelligence,Vol\.35,pp\. 11516–11524\.Cited by:[3rd item](https://arxiv.org/html/2605.19042#A3.I1.i3.p1.1),[§1](https://arxiv.org/html/2605.19042#S1.p2.1),[§4\.1](https://arxiv.org/html/2605.19042#S4.SS1.p1.5),[§4\.2](https://arxiv.org/html/2605.19042#S4.SS2.p1.1),[§5\.1](https://arxiv.org/html/2605.19042#S5.SS1.p2.1),[§6](https://arxiv.org/html/2605.19042#S6.p1.1)\.
- \[23\]C\. Guo, T\. Goldstein, A\. Hannun, and L\. Van Der Maaten\(2019\)Certified data removal from machine learning models\.arXiv preprint arXiv:1911\.03030\.Cited by:[5th item](https://arxiv.org/html/2605.19042#A3.I1.i5.p1.1),[§1](https://arxiv.org/html/2605.19042#S1.p2.1),[§4\.1](https://arxiv.org/html/2605.19042#S4.SS1.p1.5),[§4\.2](https://arxiv.org/html/2605.19042#S4.SS2.p1.1),[§5\.1](https://arxiv.org/html/2605.19042#S5.SS1.p2.1),[§6](https://arxiv.org/html/2605.19042#S6.p1.1)\.
- \[24\]T\. Hoang, S\. Rana, S\. Gupta, and S\. Venkatesh\(2024\)Learn to unlearn for deep neural networks: minimizing unlearning interference with gradient projection\.InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision,pp\. 4819–4828\.Cited by:[§4\.2](https://arxiv.org/html/2605.19042#S4.SS2.p3.4),[§4\.3](https://arxiv.org/html/2605.19042#S4.SS3.p1.2)\.
- \[25\]A\. Huang, Z\. Cai, and Z\. Xiong\(2025\)A survey of machine unlearning in generative ai models: methods, applications, security, and challenges\.IEEE Internet of Things Journal\.Cited by:[§1](https://arxiv.org/html/2605.19042#S1.p1.1),[§6](https://arxiv.org/html/2605.19042#S6.p1.1)\.
- \[26\]C\. Huang, Q\. Liu, B\. Y\. Lin, T\. Pang, C\. Du, and M\. Lin\(2023\)Lorahub: efficient cross\-task generalization via dynamic lora composition\.arXiv preprint arXiv:2307\.13269\.Cited by:[§C\.1\.6](https://arxiv.org/html/2605.19042#A3.SS1.SSS6.p1.1),[§5\.1](https://arxiv.org/html/2605.19042#S5.SS1.p5.1)\.
- \[27\]A\. Kamalesh, A\. Lakhotia, P\. S\. Kulkarni, G\. Srinivasa,et al\.\(2024\)UnoLoRA: single low\-rank adaptation for efficient multitask fine\-tuning\.InNeurIPS 2024 Workshop on Fine\-Tuning in Modern Machine Learning: Principles and Scalability,Cited by:[§1](https://arxiv.org/html/2605.19042#S1.p2.1),[§6](https://arxiv.org/html/2605.19042#S6.p1.1)\.
- \[28\]Y\. H\. Khalil, M\. Setayesh, and H\. Li\(2025\)CoUn: empowering machine unlearning via contrastive learning\.arXiv preprint arXiv:2509\.16391\.Cited by:[§6](https://arxiv.org/html/2605.19042#S6.p1.1)\.
- \[29\]Y\. Kong and Y\. Fu\(2022\)Human action recognition and prediction: a survey\.International Journal of Computer Vision130\(5\),pp\. 1366–1401\.Cited by:[§1](https://arxiv.org/html/2605.19042#S1.p3.1)\.
- \[30\]M\. Kurmanji, P\. Triantafillou, J\. Hayes, and E\. Triantafillou\(2023\)Towards unbounded machine unlearning\.Advances in neural information processing systems36,pp\. 1957–1987\.Cited by:[3rd item](https://arxiv.org/html/2605.19042#A3.I1.i3.p1.1),[8th item](https://arxiv.org/html/2605.19042#A3.I1.i8.p1.1),[§C\.1\.6](https://arxiv.org/html/2605.19042#A3.SS1.SSS6.p2.6),[§4\.1](https://arxiv.org/html/2605.19042#S4.SS1.p1.5),[§4\.2](https://arxiv.org/html/2605.19042#S4.SS2.p1.1),[§4\.3](https://arxiv.org/html/2605.19042#S4.SS3.p1.2),[§5\.1](https://arxiv.org/html/2605.19042#S5.SS1.p2.1),[§5\.1](https://arxiv.org/html/2605.19042#S5.SS1.p5.1),[§6](https://arxiv.org/html/2605.19042#S6.p1.1)\.
- \[31\]Y\. Li, X\. Feng, C\. Chen, and Q\. Yang\(2025\)A survey on recommendation unlearning: fundamentals, taxonomy, evaluation, and open questions\.IEEE Transactions on Knowledge and Data Engineering38\(2\),pp\. 781–799\.Cited by:[§1](https://arxiv.org/html/2605.19042#S1.p3.1)\.
- \[32\]S\. Lin, X\. Zhang, W\. Susilo, X\. Chen, and J\. Liu\(2024\)GDR\-gma: machine unlearning via direction\-rectified and magnitude\-adjusted gradients\.InProceedings of the 32nd ACM International Conference on Multimedia,pp\. 9087–9095\.Cited by:[§4\.2](https://arxiv.org/html/2605.19042#S4.SS2.p3.4)\.
- \[33\]T\. Lin, M\. Maire, S\. Belongie, J\. Hays, P\. Perona, D\. Ramanan, P\. Dollár, and C\. L\. Zitnick\(2014\)Microsoft coco: common objects in context\.InEuropean conference on computer vision,pp\. 740–755\.Cited by:[§C\.1\.4](https://arxiv.org/html/2605.19042#A3.SS1.SSS4.p5.2),[§5\.1](https://arxiv.org/html/2605.19042#S5.SS1.p3.1)\.
- \[34\]B\. Liu, X\. Liu, X\. Jin, P\. Stone, and Q\. Liu\(2021\)Conflict\-averse gradient descent for multi\-task learning\.Advances in neural information processing systems34,pp\. 18878–18890\.Cited by:[§4\.1](https://arxiv.org/html/2605.19042#S4.SS1.p4.3)\.
- \[35\]S\. Liu, Y\. Liu, N\. B\. Angel, and E\. Triantafillou\(2024\)Machine unlearning in computer vision: foundations and applications\.InIEEE/CVF Conference on Computer Vision and Pattern Recognition,Cited by:[§4\.1](https://arxiv.org/html/2605.19042#S4.SS1.p1.5)\.
- \[36\]S\. Liu, Y\. Yao, J\. Jia, S\. Casper, N\. Baracaldo, P\. Hase, Y\. Yao, C\. Y\. Liu, X\. Xu, H\. Li,et al\.\(2025\)Rethinking machine unlearning for large language models\.Nature Machine Intelligence7\(2\),pp\. 181–194\.Cited by:[§6](https://arxiv.org/html/2605.19042#S6.p1.1)\.
- \[37\]Y\. Liu, C\. Ma, J\. Tian, Z\. He, and Z\. Kira\(2022\)Polyhistor: parameter\-efficient multi\-task adaptation for dense vision tasks\.Advances in neural information processing systems35,pp\. 36889–36901\.Cited by:[§1](https://arxiv.org/html/2605.19042#S1.p2.1)\.
- \[38\]Y\. Liu, H\. Chen, W\. Huang, Y\. Ni, and M\. Imani\(2025\)Lune: efficient llm unlearning via lora fine\-tuning with negative examples\.arXiv preprint arXiv:2512\.07375\.Cited by:[§6](https://arxiv.org/html/2605.19042#S6.p1.1)\.
- \[39\]Z\. Liu, Y\. Lin, Y\. Cao, H\. Hu, Y\. Wei, Z\. Zhang, S\. Lin, and B\. Guo\(2021\)Swin transformer: hierarchical vision transformer using shifted windows\.InProceedings of the IEEE/CVF international conference on computer vision,pp\. 10012–10022\.Cited by:[§C\.1\.3](https://arxiv.org/html/2605.19042#A3.SS1.SSS3.p1.1),[§C\.1\.6](https://arxiv.org/html/2605.19042#A3.SS1.SSS6.p1.1),[§C\.2](https://arxiv.org/html/2605.19042#A3.SS2.p1.5),[§5\.1](https://arxiv.org/html/2605.19042#S5.SS1.p5.1),[§6](https://arxiv.org/html/2605.19042#S6.p1.1)\.
- \[40\]J\. Long, E\. Shelhamer, and T\. Darrell\(2015\)Fully convolutional networks for semantic segmentation\.InProceedings of the IEEE conference on computer vision and pattern recognition,pp\. 3431–3440\.Cited by:[§C\.1\.4](https://arxiv.org/html/2605.19042#A3.SS1.SSS4.p2.6),[§5\.1](https://arxiv.org/html/2605.19042#S5.SS1.p3.1)\.
- \[41\]I\. Loshchilov and F\. Hutter\(2017\)Decoupled weight decay regularization\.arXiv preprint arXiv:1711\.05101\.Cited by:[§C\.1\.6](https://arxiv.org/html/2605.19042#A3.SS1.SSS6.p2.6),[§5\.1](https://arxiv.org/html/2605.19042#S5.SS1.p5.1)\.
- \[42\]R\. K\. Mahabadi, S\. Ruder, M\. Dehghani, and J\. Henderson\(2021\)Parameter\-efficient multi\-task fine\-tuning for transformers via shared hypernetworks\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),pp\. 565–576\.Cited by:[§1](https://arxiv.org/html/2605.19042#S1.p2.1)\.
- \[43\]S\. Poppi, S\. Sarto, M\. Cornia, L\. Baraldi, and R\. Cucchiara\(2024\)Unlearning vision transformers without retaining data via low\-rank decompositions\.InInternational Conference on Pattern Recognition,pp\. 147–163\.Cited by:[§4\.1](https://arxiv.org/html/2605.19042#S4.SS1.p2.5)\.
- \[44\]A\. Prabhakar, Y\. Li, K\. Narasimhan, S\. Kakade, E\. Malach, and S\. Jelassi\(2025\)Lora soups: merging loras for practical skill composition tasks\.InProceedings of the 31st International Conference on Computational Linguistics: Industry Track,pp\. 644–655\.Cited by:[§6](https://arxiv.org/html/2605.19042#S6.p1.1)\.
- \[45\]W\. Qian, C\. Zhao, W\. Le, M\. Ma, and M\. Huai\(2023\)Towards understanding and enhancing robustness of deep learning models against malicious unlearning attacks\.InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 1932–1942\.Cited by:[§1](https://arxiv.org/html/2605.19042#S1.p1.1),[§6](https://arxiv.org/html/2605.19042#S6.p1.1)\.
- \[46\]P\. Regulation\(2016\)Regulation \(eu\) 2016/679 of the european parliament and of the council\.Regulation \(eu\)679\(2016\),pp\. 10–3\.Cited by:[§1](https://arxiv.org/html/2605.19042#S1.p1.1)\.
- \[47\]S\. Roy, S\. Banerjee, V\. Verma, S\. Dasgupta, D\. Gupta, and P\. Rai\(2025\)NOVO: unlearning\-compliant vision transformers\.arXiv preprint arXiv:2507\.03281\.Cited by:[§6](https://arxiv.org/html/2605.19042#S6.p1.1)\.
- \[48\]G\. Saha, I\. Garg, and K\. Roy\(2021\)Gradient projection memory for continual learning\.arXiv preprint arXiv:2103\.09762\.Cited by:[§4\.1](https://arxiv.org/html/2605.19042#S4.SS1.p4.3)\.
- \[49\]A\. Sekhari, J\. Acharya, G\. Kamath, and A\. T\. Suresh\(2021\)Remember what you want to forget: algorithms for machine unlearning\.Advances in Neural Information Processing Systems34,pp\. 18075–18086\.Cited by:[§4\.1](https://arxiv.org/html/2605.19042#S4.SS1.p1.5)\.
- \[50\]A\. Shamsian, E\. Shaar, A\. Navon, G\. Chechik, and E\. Fetaya\(2025\)Go beyond your means: unlearning with per\-sample gradient orthogonalization\.arXiv preprint arXiv:2503\.02312\.Cited by:[7th item](https://arxiv.org/html/2605.19042#A3.I1.i7.p1.1),[§C\.1\.5](https://arxiv.org/html/2605.19042#A3.SS1.SSS5.p1.1),[§C\.1\.6](https://arxiv.org/html/2605.19042#A3.SS1.SSS6.p2.6),[§4\.1](https://arxiv.org/html/2605.19042#S4.SS1.p2.5),[§4\.2](https://arxiv.org/html/2605.19042#S4.SS2.p3.4),[§5\.1](https://arxiv.org/html/2605.19042#S5.SS1.p2.1),[§5\.1](https://arxiv.org/html/2605.19042#S5.SS1.p4.1),[§5\.1](https://arxiv.org/html/2605.19042#S5.SS1.p5.1),[§6](https://arxiv.org/html/2605.19042#S6.p1.1)\.
- \[51\]R\. Shokri, M\. Stronati, C\. Song, and V\. Shmatikov\(2017\)Membership inference attacks against machine learning models\.In2017 IEEE symposium on security and privacy \(SP\),pp\. 3–18\.Cited by:[§C\.1\.4](https://arxiv.org/html/2605.19042#A3.SS1.SSS4.p6.3),[§5\.1](https://arxiv.org/html/2605.19042#S5.SS1.p3.1)\.
- \[52\]N\. Silberman, D\. Hoiem, P\. Kohli, and R\. Fergus\(2012\)Indoor segmentation and support inference from rgbd images\.InEuropean conference on computer vision,pp\. 746–760\.Cited by:[§C\.1\.1](https://arxiv.org/html/2605.19042#A3.SS1.SSS1.p2.1),[§C\.1\.1](https://arxiv.org/html/2605.19042#A3.SS1.SSS1.p5.1),[§C\.1\.4](https://arxiv.org/html/2605.19042#A3.SS1.SSS4.p3.4),[§C\.1\.4](https://arxiv.org/html/2605.19042#A3.SS1.SSS4.p4.3),[§5\.1](https://arxiv.org/html/2605.19042#S5.SS1.p1.1),[§5\.1](https://arxiv.org/html/2605.19042#S5.SS1.p3.1),[§5\.1](https://arxiv.org/html/2605.19042#S5.SS1.p5.1)\.
- \[53\]L\. Song and P\. Mittal\(2021\)Systematic evaluation of privacy risks of machine learning models\.In30th USENIX security symposium \(USENIX security 21\),pp\. 2615–2632\.Cited by:[§C\.1\.4](https://arxiv.org/html/2605.19042#A3.SS1.SSS4.p6.3)\.
- \[54\]C\. N\. Spartalis, T\. Semertzidis, E\. Gavves, and P\. Daras\(2025\)Lotus: large\-scale machine unlearning with a taste of uncertainty\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 10046–10055\.Cited by:[§6](https://arxiv.org/html/2605.19042#S6.p1.1)\.
- \[55\]T\. Surve and R\. Pradhan\(2025\)Explaining fairness violations using machine unlearning\.\.InEDBT,pp\. 623–635\.Cited by:[§1](https://arxiv.org/html/2605.19042#S1.p1.1),[§6](https://arxiv.org/html/2605.19042#S6.p1.1)\.
- \[56\]Y\. Tong, T\. Zhang, J\. Yuan, Y\. Wang, and C\. Hu\(2025\)LetheViT: selective machine unlearning for vision transformers via attention\-guided contrastive learning\.arXiv preprint arXiv:2508\.01569\.Cited by:[§6](https://arxiv.org/html/2605.19042#S6.p1.1)\.
- \[57\]Y\. Tu, P\. Hu, and J\. Ma\(2024\)A reliable cryptographic framework for empirical machine unlearning evaluation\.arXiv preprint arXiv:2404\.11577\.Cited by:[§C\.1\.5](https://arxiv.org/html/2605.19042#A3.SS1.SSS5.p1.1),[§5\.1](https://arxiv.org/html/2605.19042#S5.SS1.p4.1)\.
- \[58\]P\. Voigt and A\. Von dem Bussche\(2017\)The eu general data protection regulation \(gdpr\)\.A practical guide, 1st ed\., Cham: Springer International Publishing10\(3152676\),pp\. 10–5555\.Cited by:[§6](https://arxiv.org/html/2605.19042#S6.p1.1)\.
- \[59\]W\. Wang, Z\. Tian, A\. Liu, and S\. Yu\(2025\)TAPE: tailored posterior difference for auditing of machine unlearning\.InProceedings of the ACM on Web Conference 2025,pp\. 3061–3072\.Cited by:[§1](https://arxiv.org/html/2605.19042#S1.p1.1),[§6](https://arxiv.org/html/2605.19042#S6.p1.1)\.
- \[60\]X\. Wang, T\. Chen, Q\. Ge, H\. Xia, R\. Bao, R\. Zheng, Q\. Zhang, T\. Gui, and X\. Huang\(2023\)Orthogonal subspace learning for language model continual learning\.InFindings of the Association for Computational Linguistics: EMNLP 2023,pp\. 10658–10671\.Cited by:[§4\.1](https://arxiv.org/html/2605.19042#S4.SS1.p6.5)\.
- \[61\]T\. Wolf, L\. Debut, V\. Sanh, J\. Chaumond, C\. Delangue, A\. Moi, P\. Cistac, T\. Rault, R\. Louf, M\. Funtowicz,et al\.\(2019\)Huggingface’s transformers: state\-of\-the\-art natural language processing\.arXiv preprint arXiv:1910\.03771\.Cited by:[§C\.1\.6](https://arxiv.org/html/2605.19042#A3.SS1.SSS6.p1.1),[§5\.1](https://arxiv.org/html/2605.19042#S5.SS1.p5.1)\.
- \[62\]Y\. Xin, J\. Du, Q\. Wang, Z\. Lin, and K\. Yan\(2024\)Vmt\-adapter: parameter\-efficient transfer learning for multi\-task dense scene understanding\.InProceedings of the AAAI conference on artificial intelligence,Vol\.38,pp\. 16085–16093\.Cited by:[§1](https://arxiv.org/html/2605.19042#S1.p2.1),[§6](https://arxiv.org/html/2605.19042#S6.p1.1)\.
- \[63\]P\. Yadav, D\. Tam, L\. Choshen, C\. A\. Raffel, and M\. Bansal\(2023\)Ties\-merging: resolving interference when merging models\.Advances in neural information processing systems36,pp\. 7093–7115\.Cited by:[§6](https://arxiv.org/html/2605.19042#S6.p1.1)\.
- \[64\]H\. Yan, X\. Li, Z\. Guo, H\. Li, F\. Li, and X\. Lin\(2022\)ARCANE: an efficient architecture for exact machine unlearning\.\.InIjcai,Vol\.6,pp\. 19\.Cited by:[§6](https://arxiv.org/html/2605.19042#S6.p1.1)\.
- \[65\]Z\. Yang, G\. Chen, Y\. Yang, A\. Zeng, and X\. Yang\(2026\)Disentangling task conflicts in multi\-task lora via orthogonal gradient projection\.arXiv preprint arXiv:2601\.09684\.Cited by:[§4\.1](https://arxiv.org/html/2605.19042#S4.SS1.p4.3)\.
- \[66\]S\. Yeom, I\. Giacomelli, M\. Fredrikson, and S\. Jha\(2018\)Privacy risk in machine learning: analyzing the connection to overfitting\.In2018 IEEE 31st computer security foundations symposium \(CSF\),pp\. 268–282\.Cited by:[§C\.1\.4](https://arxiv.org/html/2605.19042#A3.SS1.SSS4.p6.3)\.
- \[67\]S\. Yifei, X\. Wei, and Y\. Wang\(2025\)DisLoRA: task\-specific low\-rank adaptation via orthogonal basis from singular value decomposition\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 13751–13766\.Cited by:[§4\.1](https://arxiv.org/html/2605.19042#S4.SS1.p6.5)\.
- \[68\]L\. Yu, B\. Yu, H\. Yu, F\. Huang, and Y\. Li\(2024\)Language models are super mario: absorbing abilities from homologous models as a free lunch\.InForty\-first International Conference on Machine Learning,Cited by:[§6](https://arxiv.org/html/2605.19042#S6.p1.1)\.
- \[69\]T\. Yu, S\. Kumar, A\. Gupta, S\. Levine, K\. Hausman, and C\. Finn\(2020\)Gradient surgery for multi\-task learning\.Advances in neural information processing systems33,pp\. 5824–5836\.Cited by:[§4\.1](https://arxiv.org/html/2605.19042#S4.SS1.p2.5),[§4\.1](https://arxiv.org/html/2605.19042#S4.SS1.p4.3)\.
- \[70\]D\. Zhang, S\. Pan, T\. Hoang, Z\. Xing, M\. Staples, X\. Xu, L\. Yao, Q\. Lu, and L\. Zhu\(2024\)To be forgotten or to be fair: unveiling fairness implications of machine unlearning methods\.AI and Ethics4\(1\),pp\. 83–93\.Cited by:[§1](https://arxiv.org/html/2605.19042#S1.p1.1),[§6](https://arxiv.org/html/2605.19042#S6.p1.1)\.
- \[71\]H\. Zhao, B\. Ni, J\. Fan, Y\. Wang, Y\. Chen, G\. Meng, and Z\. Zhang\(2024\)Continual forgetting for pre\-trained vision models\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 28631–28642\.Cited by:[§4\.1](https://arxiv.org/html/2605.19042#S4.SS1.p2.5)\.
- \[72\]K\. Zhao, M\. Kurmanji, G\. Barbulescu, E\. Triantafillou, and P\. Triantafillou\(2024\)What makes unlearning hard and what to do about it\.Advances in Neural Information Processing Systems37,pp\. 12293–12333\.Cited by:[§6](https://arxiv.org/html/2605.19042#S6.p1.1)\.
## Appendix ANotations
Table 4:Summary of notations\.
## Appendix BDetailed Proofs
### B\.1Proof of Theorem[1](https://arxiv.org/html/2605.19042#Thmtheorem1)
###### Theorem[1](https://arxiv.org/html/2605.19042#Thmtheorem1)\.
Assume thatLrL\_\{r\}andLfL\_\{f\}are twice differentiable,𝐇r:=∇2Lr\(θr\)\\mathbf\{H\}\_\{r\}:=\\nabla^\{2\}L\_\{r\}\(\\theta\_\{r\}\)is invertible, and\|𝒟r\|≫\|𝒟f\|\|\\mathcal\{D\}\_\{r\}\|\\gg\|\\mathcal\{D\}\_\{f\}\|\. Letρ:=\|𝒟f\|/\|𝒟r\|\\rho:=\|\\mathcal\{D\}\_\{f\}\|/\|\\mathcal\{D\}\_\{r\}\|, and suppose thatθ⋆\\theta^\{\\star\}locally minimizesLr\(θ\)\+ρLf\(θ\)L\_\{r\}\(\\theta\)\+\\rho L\_\{f\}\(\\theta\)\. Then, removing𝒟f\\mathcal\{D\}\_\{f\}induces the following loss change for any retained task\-instance pair\(𝐱i,t,yi\(t\)\)∈𝒟r\(\\mathbf\{x\}\_\{i\},t,y\_\{i\}^\{\(t\)\}\)\\in\\mathcal\{D\}\_\{r\}:
ℓi,t\(θr\)−ℓi,t\(θ⋆\)=ρ∇ℓi,t\(θr\)⊤𝐇r−1∇Lf\(θr\)⏟first\-order\+O\(ρ2\)⏟higher\-order\.\\ell\_\{i,t\}\(\\theta\_\{r\}\)\-\\ell\_\{i,t\}\(\\theta^\{\\star\}\)=\\underbrace\{\\rho\\,\\nabla\\ell\_\{i,t\}\(\\theta\_\{r\}\)^\{\\top\}\\mathbf\{H\}\_\{r\}^\{\-1\}\\nabla L\_\{f\}\(\\theta\_\{r\}\)\}\_\{\\text\{first\-order\}\}\+\\underbrace\{O\(\\rho^\{2\}\)\}\_\{\\text\{higher\-order\}\}\.
###### Proof\.
By the definitions ofθ⋆\\theta^\{\\star\}andθr\\theta\_\{r\}as local optima, we have
∇L\(θ⋆\)=0,∇Lr\(θr\)=0,\\nabla L\(\\theta^\{\\star\}\)=0,\\qquad\\nabla L\_\{r\}\(\\theta\_\{r\}\)=0,where
L\(θ\)=\|𝒟r\|\|𝒟\|Lr\(θ\)\+\|𝒟f\|\|𝒟\|Lf\(θ\)\.L\(\\theta\)=\\frac\{\|\\mathcal\{D\}\_\{r\}\|\}\{\|\\mathcal\{D\}\|\}L\_\{r\}\(\\theta\)\+\\frac\{\|\\mathcal\{D\}\_\{f\}\|\}\{\|\\mathcal\{D\}\|\}L\_\{f\}\(\\theta\)\.Multiplying∇L\(θ⋆\)=0\\nabla L\(\\theta^\{\\star\}\)=0by\|𝒟\|/\|𝒟r\|\|\\mathcal\{D\}\|/\|\\mathcal\{D\}\_\{r\}\|gives
∇Lr\(θ⋆\)\+ρ∇Lf\(θ⋆\)=0,\\nabla L\_\{r\}\(\\theta^\{\\star\}\)\+\\rho\\,\\nabla L\_\{f\}\(\\theta^\{\\star\}\)=0,whereρ=\|𝒟f\|/\|𝒟r\|\\rho=\|\\mathcal\{D\}\_\{f\}\|/\|\\mathcal\{D\}\_\{r\}\|\.
Applying Taylor expansion aroundθr\\theta\_\{r\}, we obtain
∇Lr\(θ⋆\)=𝐇r\(θ⋆−θr\)\+O\(‖θ⋆−θr‖2\),\\nabla L\_\{r\}\(\\theta^\{\\star\}\)=\\mathbf\{H\}\_\{r\}\(\\theta^\{\\star\}\-\\theta\_\{r\}\)\+O\(\\\|\\theta^\{\\star\}\-\\theta\_\{r\}\\\|^\{2\}\),where we use∇Lr\(θr\)=0\\nabla L\_\{r\}\(\\theta\_\{r\}\)=0and𝐇r=∇2Lr\(θr\)\\mathbf\{H\}\_\{r\}=\\nabla^\{2\}L\_\{r\}\(\\theta\_\{r\}\)\. Similarly,
∇Lf\(θ⋆\)=∇Lf\(θr\)\+O\(‖θ⋆−θr‖\)\.\\nabla L\_\{f\}\(\\theta^\{\\star\}\)=\\nabla L\_\{f\}\(\\theta\_\{r\}\)\+O\(\\\|\\theta^\{\\star\}\-\\theta\_\{r\}\\\|\)\.
Substituting these expansions into
∇Lr\(θ⋆\)\+ρ∇Lf\(θ⋆\)=0\\nabla L\_\{r\}\(\\theta^\{\\star\}\)\+\\rho\\,\\nabla L\_\{f\}\(\\theta^\{\\star\}\)=0yields
𝐇r\(θ⋆−θr\)\+ρ∇Lf\(θr\)=O\(‖θ⋆−θr‖2\)\+O\(ρ‖θ⋆−θr‖\)\.\\mathbf\{H\}\_\{r\}\(\\theta^\{\\star\}\-\\theta\_\{r\}\)\+\\rho\\,\\nabla L\_\{f\}\(\\theta\_\{r\}\)=O\(\\\|\\theta^\{\\star\}\-\\theta\_\{r\}\\\|^\{2\}\)\+O\(\\rho\\\|\\theta^\{\\star\}\-\\theta\_\{r\}\\\|\)\.Since𝐇r\\mathbf\{H\}\_\{r\}is invertible andρ≪1\\rho\\ll 1, it follows that
θr−θ⋆=ρ𝐇r−1∇Lf\(θr\)\+O\(ρ2\)\.\\theta\_\{r\}\-\\theta^\{\\star\}=\\rho\\,\\mathbf\{H\}\_\{r\}^\{\-1\}\\nabla L\_\{f\}\(\\theta\_\{r\}\)\+O\(\\rho^\{2\}\)\.
Applying Taylor expansion ofℓi,t\\ell\_\{i,t\}aroundθr\\theta\_\{r\}gives
ℓi,t\(θ⋆\)=ℓi,t\(θr\)\+∇ℓi,t\(θr\)⊤\(θ⋆−θr\)\+O\(‖θ⋆−θr‖2\)\.\\ell\_\{i,t\}\(\\theta^\{\\star\}\)=\\ell\_\{i,t\}\(\\theta\_\{r\}\)\+\\nabla\\ell\_\{i,t\}\(\\theta\_\{r\}\)^\{\\top\}\(\\theta^\{\\star\}\-\\theta\_\{r\}\)\+O\(\\\|\\theta^\{\\star\}\-\\theta\_\{r\}\\\|^\{2\}\)\.Rearranging and substituting the above result with‖θ⋆−θr‖=O\(ρ\)\\\|\\theta^\{\\star\}\-\\theta\_\{r\}\\\|=O\(\\rho\)yields
ℓi,t\(θr\)−ℓi,t\(θ⋆\)=ρ∇ℓi,t\(θr\)⊤𝐇r−1∇Lf\(θr\)\+O\(ρ2\)\.\\ell\_\{i,t\}\(\\theta\_\{r\}\)\-\\ell\_\{i,t\}\(\\theta^\{\\star\}\)=\\rho\\,\\nabla\\ell\_\{i,t\}\(\\theta\_\{r\}\)^\{\\top\}\\mathbf\{H\}\_\{r\}^\{\-1\}\\nabla L\_\{f\}\(\\theta\_\{r\}\)\+O\(\\rho^\{2\}\)\.The theorem follows\. ∎
### B\.2Proof of Corollary[1](https://arxiv.org/html/2605.19042#Thmcorollary1)
###### Corollary[1](https://arxiv.org/html/2605.19042#Thmcorollary1)\.
Task\-level and instance\-level interference correspond to aggregations over𝒟rtask\\mathcal\{D\}\_\{r\}^\{\\mathrm\{task\}\}and𝒟rinst\\mathcal\{D\}\_\{r\}^\{\\mathrm\{inst\}\}, respectively, and are both governed by the same first\-order term in Theorem[1](https://arxiv.org/html/2605.19042#Thmtheorem1)\.
###### Proof\.
From Theorem[1](https://arxiv.org/html/2605.19042#Thmtheorem1), for any retained task\-instance pair\(𝐱i,t,yi\(t\)\)∈𝒟r\(\\mathbf\{x\}\_\{i\},t,y\_\{i\}^\{\(t\)\}\)\\in\\mathcal\{D\}\_\{r\}, the loss change induced by removing𝒟f\\mathcal\{D\}\_\{f\}satisfies
ℓi,t\(θr\)−ℓi,t\(θ⋆\)=ρ∇ℓi,t\(θr\)⊤𝐇r−1∇Lf\(θr\)\+O\(ρ2\),\\ell\_\{i,t\}\(\\theta\_\{r\}\)\-\\ell\_\{i,t\}\(\\theta^\{\\star\}\)=\\rho\\,\\nabla\\ell\_\{i,t\}\(\\theta\_\{r\}\)^\{\\top\}\\mathbf\{H\}\_\{r\}^\{\-1\}\\nabla L\_\{f\}\(\\theta\_\{r\}\)\+O\(\\rho^\{2\}\),\(12\)whereρ=\|𝒟f\|/\|𝒟r\|\\rho=\|\\mathcal\{D\}\_\{f\}\|/\|\\mathcal\{D\}\_\{r\}\|\.
Aggregating over any subset𝒮⊆𝒟r\\mathcal\{S\}\\subseteq\\mathcal\{D\}\_\{r\}yields
∑\(𝐱i,t,yi\(t\)\)∈𝒮\(ℓi,t\(θr\)−ℓi,t\(θ⋆\)\)=ρ∑\(𝐱i,t,yi\(t\)\)∈𝒮∇ℓi,t\(θr\)⊤𝐇r−1∇Lf\(θr\)\+O\(\|𝒮\|ρ2\)\.\\sum\_\{\(\\mathbf\{x\}\_\{i\},t,y\_\{i\}^\{\(t\)\}\)\\in\\mathcal\{S\}\}\\bigl\(\\ell\_\{i,t\}\(\\theta\_\{r\}\)\-\\ell\_\{i,t\}\(\\theta^\{\\star\}\)\\bigr\)=\\rho\\sum\_\{\(\\mathbf\{x\}\_\{i\},t,y\_\{i\}^\{\(t\)\}\)\\in\\mathcal\{S\}\}\\nabla\\ell\_\{i,t\}\(\\theta\_\{r\}\)^\{\\top\}\\mathbf\{H\}\_\{r\}^\{\-1\}\\nabla L\_\{f\}\(\\theta\_\{r\}\)\+O\(\|\\mathcal\{S\}\|\\rho^\{2\}\)\.\(13\)
Applying this result to𝒟rtask\\mathcal\{D\}\_\{r\}^\{\\mathrm\{task\}\}and𝒟rinst\\mathcal\{D\}\_\{r\}^\{\\mathrm\{inst\}\}gives
∑\(𝐱i,t,yi\(t\)\)∈𝒟rtask\(ℓi,t\(θr\)−ℓi,t\(θ⋆\)\)\\displaystyle\\sum\_\{\(\\mathbf\{x\}\_\{i\},t,y\_\{i\}^\{\(t\)\}\)\\in\\mathcal\{D\}\_\{r\}^\{\\mathrm\{task\}\}\}\\bigl\(\\ell\_\{i,t\}\(\\theta\_\{r\}\)\-\\ell\_\{i,t\}\(\\theta^\{\\star\}\)\\bigr\)=ρ∑\(𝐱i,t,yi\(t\)\)∈𝒟rtask∇ℓi,t\(θr\)⊤𝐇r−1∇Lf\(θr\)\+O\(\|𝒟rtask\|ρ2\),\\displaystyle=\\rho\\sum\_\{\(\\mathbf\{x\}\_\{i\},t,y\_\{i\}^\{\(t\)\}\)\\in\\mathcal\{D\}\_\{r\}^\{\\mathrm\{task\}\}\}\\nabla\\ell\_\{i,t\}\(\\theta\_\{r\}\)^\{\\top\}\\mathbf\{H\}\_\{r\}^\{\-1\}\\nabla L\_\{f\}\(\\theta\_\{r\}\)\+O\\\!\\left\(\|\\mathcal\{D\}\_\{r\}^\{\\mathrm\{task\}\}\|\\rho^\{2\}\\right\),\(14\)∑\(𝐱i,t,yi\(t\)\)∈𝒟rinst\(ℓi,t\(θr\)−ℓi,t\(θ⋆\)\)\\displaystyle\\sum\_\{\(\\mathbf\{x\}\_\{i\},t,y\_\{i\}^\{\(t\)\}\)\\in\\mathcal\{D\}\_\{r\}^\{\\mathrm\{inst\}\}\}\\bigl\(\\ell\_\{i,t\}\(\\theta\_\{r\}\)\-\\ell\_\{i,t\}\(\\theta^\{\\star\}\)\\bigr\)=ρ∑\(𝐱i,t,yi\(t\)\)∈𝒟rinst∇ℓi,t\(θr\)⊤𝐇r−1∇Lf\(θr\)\+O\(\|𝒟rinst\|ρ2\)\.\\displaystyle=\\rho\\sum\_\{\(\\mathbf\{x\}\_\{i\},t,y\_\{i\}^\{\(t\)\}\)\\in\\mathcal\{D\}\_\{r\}^\{\\mathrm\{inst\}\}\}\\nabla\\ell\_\{i,t\}\(\\theta\_\{r\}\)^\{\\top\}\\mathbf\{H\}\_\{r\}^\{\-1\}\\nabla L\_\{f\}\(\\theta\_\{r\}\)\+O\\\!\\left\(\|\\mathcal\{D\}\_\{r\}^\{\\mathrm\{inst\}\}\|\\rho^\{2\}\\right\)\.\(15\)
Therefore, task\-level and instance\-level interference are governed by the same first\-order Hessian\-preconditioned gradient coupling term, differing only in the aggregation domain\. The corollary follows\. ∎
### B\.3Proof of Proposition[1](https://arxiv.org/html/2605.19042#Thmproposition1)
###### Proposition[1](https://arxiv.org/html/2605.19042#Thmproposition1)\.
Under the local quadratic approximationLr\(θr\+δ\)≈Lr\(θr\)\+12δ⊤𝐇rδL\_\{r\}\(\\theta\_\{r\}\+\\delta\)\\approx L\_\{r\}\(\\theta\_\{r\}\)\+\\frac\{1\}\{2\}\\delta^\{\\top\}\\mathbf\{H\}\_\{r\}\\delta, assume that𝐇r\\mathbf\{H\}\_\{r\}is positive definite\. Consider updatesδ\\deltathat achieve a fixed first\-order unlearning effect,∇Lf\(θr\)⊤δ=γ\\nabla L\_\{f\}\(\\theta\_\{r\}\)^\{\\top\}\\delta=\\gamma, for some desired levelγ\>0\\gamma\>0\. Then, the unique retain\-loss\-minimizing update is
δ⋆=γ∇Lf\(θr\)⊤𝐇r−1∇Lf\(θr\)𝐇r−1∇Lf\(θr\)\.\\delta^\{\\star\}=\\frac\{\\gamma\}\{\\nabla L\_\{f\}\(\\theta\_\{r\}\)^\{\\top\}\\mathbf\{H\}\_\{r\}^\{\-1\}\\nabla L\_\{f\}\(\\theta\_\{r\}\)\}\\mathbf\{H\}\_\{r\}^\{\-1\}\\nabla L\_\{f\}\(\\theta\_\{r\}\)\.Hence, directly following the unlearning gradient, i\.e\.,δ=α∇Lf\(θr\)\\delta=\\alpha\\nabla L\_\{f\}\(\\theta\_\{r\}\), is generally suboptimal unless∇Lf\(θr\)\\nabla L\_\{f\}\(\\theta\_\{r\}\)is an eigenvector of𝐇r\\mathbf\{H\}\_\{r\}\.
###### Proof\.
Under the local quadratic approximation,
Lr\(θr\+δ\)≈Lr\(θr\)\+12δ⊤𝐇rδ\.L\_\{r\}\(\\theta\_\{r\}\+\\delta\)\\approx L\_\{r\}\(\\theta\_\{r\}\)\+\\frac\{1\}\{2\}\\delta^\{\\top\}\\mathbf\{H\}\_\{r\}\\delta\.SinceLr\(θr\)L\_\{r\}\(\\theta\_\{r\}\)is constant with respect toδ\\delta, minimizing the retain loss is equivalent to solving
minδ12δ⊤𝐇rδs\.t\.∇Lf\(θr\)⊤δ=γ\.\\min\_\{\\delta\}\\ \\frac\{1\}\{2\}\\delta^\{\\top\}\\mathbf\{H\}\_\{r\}\\delta\\qquad\\text\{s\.t\.\}\\qquad\\nabla L\_\{f\}\(\\theta\_\{r\}\)^\{\\top\}\\delta=\\gamma\.Letg:=∇Lf\(θr\)g:=\\nabla L\_\{f\}\(\\theta\_\{r\}\)\. The problem becomes
minδ12δ⊤𝐇rδs\.t\.g⊤δ=γ\.\\min\_\{\\delta\}\\ \\frac\{1\}\{2\}\\delta^\{\\top\}\\mathbf\{H\}\_\{r\}\\delta\\qquad\\text\{s\.t\.\}\\qquad g^\{\\top\}\\delta=\\gamma\.
We form the Lagrangian
ℒ\(δ,λ\)=12δ⊤𝐇rδ−λ\(g⊤δ−γ\)\.\\mathcal\{L\}\(\\delta,\\lambda\)=\\frac\{1\}\{2\}\\delta^\{\\top\}\\mathbf\{H\}\_\{r\}\\delta\-\\lambda\(g^\{\\top\}\\delta\-\\gamma\)\.Taking the derivative with respect toδ\\deltaand setting it to zero gives
∇δℒ\(δ,λ\)=𝐇rδ−λg=0\.\\nabla\_\{\\delta\}\\mathcal\{L\}\(\\delta,\\lambda\)=\\mathbf\{H\}\_\{r\}\\delta\-\\lambda g=0\.Hence,
δ=λ𝐇r−1g\.\\delta=\\lambda\\mathbf\{H\}\_\{r\}^\{\-1\}g\.Substituting this into the constraintg⊤δ=γg^\{\\top\}\\delta=\\gammayields
g⊤\(λ𝐇r−1g\)=γ,g^\{\\top\}\(\\lambda\\mathbf\{H\}\_\{r\}^\{\-1\}g\)=\\gamma,and therefore
λ=γg⊤𝐇r−1g\.\\lambda=\\frac\{\\gamma\}\{g^\{\\top\}\\mathbf\{H\}\_\{r\}^\{\-1\}g\}\.Thus,
δ⋆=γg⊤𝐇r−1g𝐇r−1g=γ∇Lf\(θr\)⊤𝐇r−1∇Lf\(θr\)𝐇r−1∇Lf\(θr\)\.\\delta^\{\\star\}=\\frac\{\\gamma\}\{g^\{\\top\}\\mathbf\{H\}\_\{r\}^\{\-1\}g\}\\mathbf\{H\}\_\{r\}^\{\-1\}g=\\frac\{\\gamma\}\{\\nabla L\_\{f\}\(\\theta\_\{r\}\)^\{\\top\}\\mathbf\{H\}\_\{r\}^\{\-1\}\\nabla L\_\{f\}\(\\theta\_\{r\}\)\}\\mathbf\{H\}\_\{r\}^\{\-1\}\\nabla L\_\{f\}\(\\theta\_\{r\}\)\.Since𝐇r≻0\\mathbf\{H\}\_\{r\}\\succ 0, the objective is strictly convex and the solution is unique\.
Finally, consider directly following the unlearning gradient, i\.e\.,
for some scalarα\\alpha\. For this update to coincide with the optimumδ⋆\\delta^\{\\star\}, we must have
αg∝𝐇r−1g,\\alpha g\\propto\\mathbf\{H\}\_\{r\}^\{\-1\}g,which holds only if
𝐇r−1g=βg\\mathbf\{H\}\_\{r\}^\{\-1\}g=\\beta gfor some scalarβ\\beta, or equivalently,
𝐇rg=1βg\.\\mathbf\{H\}\_\{r\}g=\\frac\{1\}\{\\beta\}g\.Thus,g=∇Lf\(θr\)g=\\nabla L\_\{f\}\(\\theta\_\{r\}\)must be an eigenvector of𝐇r\\mathbf\{H\}\_\{r\}\. Therefore, directly following the unlearning gradient is generally suboptimal unless∇Lf\(θr\)\\nabla L\_\{f\}\(\\theta\_\{r\}\)is an eigenvector of𝐇r\\mathbf\{H\}\_\{r\}\. The proposition follows\. ∎
### B\.4Proof of Theorem[2](https://arxiv.org/html/2605.19042#Thmtheorem2)
###### Theorem 2\(Task\-level interference reduction\)\.
For any𝐙∈\{𝐀,𝐁\}\\mathbf\{Z\}\\in\\\{\\mathbf\{A\},\\mathbf\{B\}\\\}and any two tasksttandt′t^\{\\prime\}, let∇𝐙,t\\nabla\_\{\\mathbf\{Z\},t\}and∇𝐙,t′\\nabla\_\{\\mathbf\{Z\},t^\{\\prime\}\}be their task gradients\. The task\-aware projected gradients defined in Eq\. \([7](https://arxiv.org/html/2605.19042#S4.E7)\) are∇𝐙,t\(t\)=∇𝐙,t𝐏t\\nabla\_\{\\mathbf\{Z\},t\}^\{\(t\)\}=\\nabla\_\{\\mathbf\{Z\},t\}\\mathbf\{P\}\_\{t\}and∇𝐙,t′\(t′\)=∇𝐙,t′𝐏t′\\nabla\_\{\\mathbf\{Z\},t^\{\\prime\}\}^\{\(t^\{\\prime\}\)\}=\\nabla\_\{\\mathbf\{Z\},t^\{\\prime\}\}\\mathbf\{P\}\_\{t^\{\\prime\}\}\. If‖𝐔t⊤𝐔t′‖2≤γt,t′\\\|\\mathbf\{U\}\_\{t\}^\{\\top\}\\mathbf\{U\}\_\{t^\{\\prime\}\}\\\|\_\{2\}\\leq\\gamma\_\{t,t^\{\\prime\}\}, then
\|⟨∇𝐙,t\(t\),∇𝐙,t′\(t′\)⟩F\|≤γt,t′‖∇𝐙,t‖F‖∇𝐙,t′‖F,\\left\|\\left\\langle\\nabla\_\{\\mathbf\{Z\},t\}^\{\(t\)\},\\nabla\_\{\\mathbf\{Z\},t^\{\\prime\}\}^\{\(t^\{\\prime\}\)\}\\right\\rangle\_\{F\}\\right\|\\leq\\gamma\_\{t,t^\{\\prime\}\}\\\|\\nabla\_\{\\mathbf\{Z\},t\}\\\|\_\{F\}\\\|\\nabla\_\{\\mathbf\{Z\},t^\{\\prime\}\}\\\|\_\{F\},\(16\)where⟨⋅,⋅⟩F\\langle\\cdot,\\cdot\\rangle\_\{F\}and∥⋅∥F\\\|\\cdot\\\|\_\{F\}denote the Frobenius inner product and norm, respectively\.
###### Proof\.
By definition,∇𝐙,t\(t\)=∇𝐙,t𝐏t\\nabla\_\{\\mathbf\{Z\},t\}^\{\(t\)\}=\\nabla\_\{\\mathbf\{Z\},t\}\\mathbf\{P\}\_\{t\}and∇𝐙,t′\(t′\)=∇𝐙,t′𝐏t′\\nabla\_\{\\mathbf\{Z\},t^\{\\prime\}\}^\{\(t^\{\\prime\}\)\}=\\nabla\_\{\\mathbf\{Z\},t^\{\\prime\}\}\\mathbf\{P\}\_\{t^\{\\prime\}\}\. Since𝐏t\\mathbf\{P\}\_\{t\}and𝐏t′\\mathbf\{P\}\_\{t^\{\\prime\}\}are symmetric projectors,
\|⟨∇𝐙,t\(t\),∇𝐙,t′\(t′\)⟩F\|=\|tr\(𝐏t∇𝐙,t⊤∇𝐙,t′𝐏t′\)\|\.\\left\|\\left\\langle\\nabla\_\{\\mathbf\{Z\},t\}^\{\(t\)\},\\nabla\_\{\\mathbf\{Z\},t^\{\\prime\}\}^\{\(t^\{\\prime\}\)\}\\right\\rangle\_\{F\}\\right\|=\\left\|\\mathrm\{tr\}\\\!\\left\(\\mathbf\{P\}\_\{t\}\\nabla\_\{\\mathbf\{Z\},t\}^\{\\top\}\\nabla\_\{\\mathbf\{Z\},t^\{\\prime\}\}\\mathbf\{P\}\_\{t^\{\\prime\}\}\\right\)\\right\|\.\(17\)By cyclicity of trace and the Frobenius Cauchy–Schwarz inequality,
\|tr\(𝐏t∇𝐙,t⊤∇𝐙,t′𝐏t′\)\|=\|⟨∇𝐙,t,∇𝐙,t′𝐏t′𝐏t⟩F\|≤‖∇𝐙,t‖F‖∇𝐙,t′‖F‖𝐏t′𝐏t‖2\.\\left\|\\mathrm\{tr\}\\\!\\left\(\\mathbf\{P\}\_\{t\}\\nabla\_\{\\mathbf\{Z\},t\}^\{\\top\}\\nabla\_\{\\mathbf\{Z\},t^\{\\prime\}\}\\mathbf\{P\}\_\{t^\{\\prime\}\}\\right\)\\right\|=\\left\|\\left\\langle\\nabla\_\{\\mathbf\{Z\},t\},\\nabla\_\{\\mathbf\{Z\},t^\{\\prime\}\}\\mathbf\{P\}\_\{t^\{\\prime\}\}\\mathbf\{P\}\_\{t\}\\right\\rangle\_\{F\}\\right\|\\leq\\\|\\nabla\_\{\\mathbf\{Z\},t\}\\\|\_\{F\}\\\|\\nabla\_\{\\mathbf\{Z\},t^\{\\prime\}\}\\\|\_\{F\}\\\|\\mathbf\{P\}\_\{t^\{\\prime\}\}\\mathbf\{P\}\_\{t\}\\\|\_\{2\}\.\(18\)Because𝐏t=𝐔t𝐔t⊤\\mathbf\{P\}\_\{t\}=\\mathbf\{U\}\_\{t\}\\mathbf\{U\}\_\{t\}^\{\\top\}and𝐏t′=𝐔t′𝐔t′⊤\\mathbf\{P\}\_\{t^\{\\prime\}\}=\\mathbf\{U\}\_\{t^\{\\prime\}\}\\mathbf\{U\}\_\{t^\{\\prime\}\}^\{\\top\}with orthonormal bases, we have
‖𝐏t′𝐏t‖2≤‖𝐔t′⊤𝐔t‖2=‖𝐔t⊤𝐔t′‖2≤γt,t′\.\\\|\\mathbf\{P\}\_\{t^\{\\prime\}\}\\mathbf\{P\}\_\{t\}\\\|\_\{2\}\\leq\\\|\\mathbf\{U\}\_\{t^\{\\prime\}\}^\{\\top\}\\mathbf\{U\}\_\{t\}\\\|\_\{2\}=\\\|\\mathbf\{U\}\_\{t\}^\{\\top\}\\mathbf\{U\}\_\{t^\{\\prime\}\}\\\|\_\{2\}\\leq\\gamma\_\{t,t^\{\\prime\}\}\.\(19\)Combining the above inequalities proves the claim\. ∎
### B\.5Proof of Theorem[3](https://arxiv.org/html/2605.19042#Thmtheorem3)
###### Theorem 3\(Instance\-level interference reduction\)\.
For any𝐙∈\{𝐀,𝐁\}\\mathbf\{Z\}\\in\\\{\\mathbf\{A\},\\mathbf\{B\}\\\}, let∇𝐙,f⟂=Π∇𝐙,r⟂\(∇𝐙,f\)\\nabla\_\{\\mathbf\{Z\},f\}^\{\\perp\}=\\Pi\_\{\\nabla\_\{\\mathbf\{Z\},r\}\}^\{\\perp\}\(\\nabla\_\{\\mathbf\{Z\},f\}\)be the orthogonalized forget gradient defined following Eq\. \([9](https://arxiv.org/html/2605.19042#S4.E9)\)\. Then its alignment with the retain gradient satisfies
\|⟨∇𝐙,f⟂,∇𝐙,r⟩F\|=ε‖∇𝐙,r‖F2\+ε\|⟨∇𝐙,f,∇𝐙,r⟩F\|\.\\left\|\\left\\langle\\nabla\_\{\\mathbf\{Z\},f\}^\{\\perp\},\\nabla\_\{\\mathbf\{Z\},r\}\\right\\rangle\_\{F\}\\right\|=\\frac\{\\varepsilon\}\{\\\|\\nabla\_\{\\mathbf\{Z\},r\}\\\|\_\{F\}^\{2\}\+\\varepsilon\}\\left\|\\left\\langle\\nabla\_\{\\mathbf\{Z\},f\},\\nabla\_\{\\mathbf\{Z\},r\}\\right\\rangle\_\{F\}\\right\|\.\(20\)Whenε=0\\varepsilon=0, the orthogonalized forget gradient is exactly orthogonal to the retain gradient\.
###### Proof\.
By the definition of the orthogonalized forget gradient,
∇𝐙,f⟂=∇𝐙,f−⟨∇𝐙,f,∇𝐙,r⟩F‖∇𝐙,r‖F2\+ε∇𝐙,r\.\\nabla\_\{\\mathbf\{Z\},f\}^\{\\perp\}=\\nabla\_\{\\mathbf\{Z\},f\}\-\\frac\{\\left\\langle\\nabla\_\{\\mathbf\{Z\},f\},\\nabla\_\{\\mathbf\{Z\},r\}\\right\\rangle\_\{F\}\}\{\\\|\\nabla\_\{\\mathbf\{Z\},r\}\\\|\_\{F\}^\{2\}\+\\varepsilon\}\\nabla\_\{\\mathbf\{Z\},r\}\.\(21\)Taking its Frobenius inner product with∇𝐙,r\\nabla\_\{\\mathbf\{Z\},r\}gives
⟨∇𝐙,f⟂,∇𝐙,r⟩F\\displaystyle\\left\\langle\\nabla\_\{\\mathbf\{Z\},f\}^\{\\perp\},\\nabla\_\{\\mathbf\{Z\},r\}\\right\\rangle\_\{F\}=⟨∇𝐙,f,∇𝐙,r⟩F−⟨∇𝐙,f,∇𝐙,r⟩F‖∇𝐙,r‖F2\+ε⟨∇𝐙,r,∇𝐙,r⟩F\\displaystyle=\\left\\langle\\nabla\_\{\\mathbf\{Z\},f\},\\nabla\_\{\\mathbf\{Z\},r\}\\right\\rangle\_\{F\}\-\\frac\{\\left\\langle\\nabla\_\{\\mathbf\{Z\},f\},\\nabla\_\{\\mathbf\{Z\},r\}\\right\\rangle\_\{F\}\}\{\\\|\\nabla\_\{\\mathbf\{Z\},r\}\\\|\_\{F\}^\{2\}\+\\varepsilon\}\\left\\langle\\nabla\_\{\\mathbf\{Z\},r\},\\nabla\_\{\\mathbf\{Z\},r\}\\right\\rangle\_\{F\}\(22\)=ε‖∇𝐙,r‖F2\+ε⟨∇𝐙,f,∇𝐙,r⟩F\.\\displaystyle=\\frac\{\\varepsilon\}\{\\\|\\nabla\_\{\\mathbf\{Z\},r\}\\\|\_\{F\}^\{2\}\+\\varepsilon\}\\left\\langle\\nabla\_\{\\mathbf\{Z\},f\},\\nabla\_\{\\mathbf\{Z\},r\}\\right\\rangle\_\{F\}\.Taking absolute values yields the stated result\. Whenε=0\\varepsilon=0, the right\-hand side becomes zero, so∇𝐙,f⟂\\nabla\_\{\\mathbf\{Z\},f\}^\{\\perp\}is orthogonal to∇𝐙,r\\nabla\_\{\\mathbf\{Z\},r\}\. ∎
## Appendix CAdditional Experimental Results
### C\.1Detailed Setup
#### C\.1\.1Datasets
Benchmark\.To evaluate multi\-task unlearning under diverse supervision granularities, we use two representative vision benchmarks spanning*image\-level*,*instance\-level*, and*pixel\-level*prediction\.
We first use NYUv2\[[52](https://arxiv.org/html/2605.19042#bib.bib57)\]as a dense multi\-task benchmark with pixel\-level supervision\. NYUv2 consists of indoor RGB\-D images annotated with aligned pixel\-wise labels, enabling multiple dense prediction tasks to be defined on the same input\. In our setting, we consider three tasks: semantic segmentation \(SEG\), depth estimation \(DEP\), and surface\-normal prediction \(NOR\), which share the same input images but differ in output structure and loss formulation\. This benchmark allows us to study unlearning behavior under tightly coupled multi\-task supervision\.
We further use PASCAL VOC\[[19](https://arxiv.org/html/2605.19042#bib.bib55)\]to cover heterogeneous supervision across image\-level and instance\-level tasks\. Specifically, we consider multi\-label image classification \(CLS\) and object detection \(OD\), which differ in both annotation granularity and prediction format\. Unlike NYUv2, these tasks involve partially shared supervision and different annotation scopes, providing a complementary setting to evaluate unlearning under loosely coupled tasks\.
Together, NYUv2 and PASCAL VOC form a comprehensive testbed for multi\-task unlearning, covering diverse task semantics, output structures, and supervision granularities, ranging from dense pixel\-wise prediction to sparse instance\- and image\-level recognition\.
Dataset partitions\.We follow the official train/test splits of each benchmark\[[19](https://arxiv.org/html/2605.19042#bib.bib55),[52](https://arxiv.org/html/2605.19042#bib.bib57)\]\. Within the training split, we further divide the data into disjoint retain \(RET\) and unlearn sets \(UNL\) to simulate realistic unlearning requests, where the unlearn set should be removed while the retain set should be preserved\. The official test split \(VAL\) is kept held out for evaluation\.
The*retain set*contains supervision that should remain intact after unlearning, and is used to evaluate whether useful knowledge is preserved\. The*unlearn set*contains the target data to be forgotten, and is used to measure the effectiveness of forgetting\. The*validation set*is disjoint from both retain and unlearn sets, and is used to assess generalization beyond the directly affected training instances\.
This partitioning enables a unified evaluation of the trade\-offs between retention, forgetting, and generalization, which are inherently coupled in multi\-task unlearning\.
#### C\.1\.2Baselines
We compare the proposed method with representative unlearning baselines from different methodological families:
- •Originaldenotes the model trained on the full training set, including both the retain and forget sets\. It serves as the pre\-unlearning reference, especially for supervision that should remain unchanged under partial\-task unlearning\.
- •Retrainretrains the model from scratch using only the retain set\. It serves as the ideal unlearning target and an upper\-bound reference, albeit at substantially higher computational cost\.
- •NegGrad\+\[[30](https://arxiv.org/html/2605.19042#bib.bib61),[22](https://arxiv.org/html/2605.19042#bib.bib21)\]performs gradient ascent on the forget set while incorporating retain\-side optimization to reduce damage to retained data\.
- •Fisher\[[21](https://arxiv.org/html/2605.19042#bib.bib24)\]estimates parameter importance via Fisher information and selectively perturbs parameters associated with the forget data\.
- •Influence\[[23](https://arxiv.org/html/2605.19042#bib.bib23)\]applies influence\-function\-based updates to estimate and remove the effect of forget examples\.
- •SSD\[[20](https://arxiv.org/html/2605.19042#bib.bib59)\]selectively dampens parameters strongly associated with the forget data without requiring full retraining\.
- •OrthoGrad\[[50](https://arxiv.org/html/2605.19042#bib.bib60)\]orthogonalizes forget updates against retained gradients to better preserve retained performance\.
- •SCRUB\[[30](https://arxiv.org/html/2605.19042#bib.bib61)\]combines forgetting and retention via a teacher\-student framework\.
These baselines cover major approximate unlearning strategies, including retraining\-based references, first\-order gradient updates, second\-order methods based on influence functions or Fisher information, parameter\-dampening approaches, and gradient\-projection mechanisms\.
#### C\.1\.3Backbones
We evaluate our framework with two representative vision backbones:ViT\-L\[[15](https://arxiv.org/html/2605.19042#bib.bib10)\]andSwin\-L\[[39](https://arxiv.org/html/2605.19042#bib.bib9)\]\. ViT\-L follows the standard Vision Transformer architecture, which represents images as patch tokens and captures global dependencies through self\-attention\. It provides a strong fully attention\-based backbone for evaluating unlearning under shared representations\. Swin\-L adopts a hierarchical Transformer design with shifted\-window attention, enabling efficient local\-to\-global feature aggregation and strong performance on dense prediction tasks\. Using both backbones allows us to examine whether the proposed unlearning framework remains effective across different Transformer architectures, from global patch\-based attention to hierarchical window\-based attention\.
#### C\.1\.4Evaluation Metrics
Task utility metrics\.We adopt standard evaluation metrics tailored to each task, covering pixel\-level, geometric, and instance\-level prediction quality\.
Semantic Segmentation \(mIoU\)\.For semantic segmentation, we use mean Intersection\-over\-Union \(mIoU\)\[[40](https://arxiv.org/html/2605.19042#bib.bib56)\]\. LetCCdenote the number of classes, and letTPc\\mathrm\{TP\}\_\{c\},FPc\\mathrm\{FP\}\_\{c\}, andFNc\\mathrm\{FN\}\_\{c\}denote the number of true positives, false positives, and false negatives for classcc, respectively\. The IoU for classccis defined as
IoUc=TPcTPc\+FPc\+FNc,\\mathrm\{IoU\}\_\{c\}=\\frac\{\\mathrm\{TP\}\_\{c\}\}\{\\mathrm\{TP\}\_\{c\}\+\\mathrm\{FP\}\_\{c\}\+\\mathrm\{FN\}\_\{c\}\},\(23\)and the mean IoU is given by
mIoU=1C∑c=1CIoUc\.\\mathrm\{mIoU\}=\\frac\{1\}\{C\}\\sum\_\{c=1\}^\{C\}\\mathrm\{IoU\}\_\{c\}\.\(24\)
Depth Estimation \(Threshold Accuracy\)\.For depth estimation, we use Threshold Accuracy\[[52](https://arxiv.org/html/2605.19042#bib.bib57)\]\. Letdid\_\{i\}andd^i\\hat\{d\}\_\{i\}denote the ground\-truth and predicted depth at pixelii, respectively\. For a thresholdδ\\delta, the accuracy is defined as
Acc\(δ\)=1N∑i=1N𝟏\(max\(d^idi,did^i\)<δ\),\\mathrm\{Acc\}\(\\delta\)=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathbf\{1\}\\left\(\\max\\left\(\\frac\{\\hat\{d\}\_\{i\}\}\{d\_\{i\}\},\\frac\{d\_\{i\}\}\{\\hat\{d\}\_\{i\}\}\\right\)<\\delta\\right\),\(25\)whereNNis the total number of pixels\. Common choices includeδ∈\{1\.25,1\.252,1\.253\}\\delta\\in\\\{1\.25,1\.25^\{2\},1\.25^\{3\}\\\}\.
Surface Normal Prediction \(Angular Accuracy\)\.For surface\-normal prediction, we measure Angular Accuracy\[[52](https://arxiv.org/html/2605.19042#bib.bib57)\]\. Let𝐧i\\mathbf\{n\}\_\{i\}and𝐧^i\\hat\{\\mathbf\{n\}\}\_\{i\}denote the ground\-truth and predicted unit normal vectors at pixelii\. The angular error is
θi=arccos\(𝐧^i⊤𝐧i\),\\theta\_\{i\}=\\arccos\\left\(\\hat\{\\mathbf\{n\}\}\_\{i\}^\{\\top\}\\mathbf\{n\}\_\{i\}\\right\),\(26\)and the accuracy under a thresholdτ\\tauis defined as
Acc\(τ\)=1N∑i=1N𝟏\(θi<τ\),\\mathrm\{Acc\}\(\\tau\)=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathbf\{1\}\(\\theta\_\{i\}<\\tau\),\(27\)whereτ\\tauis typically set to11\.25∘11\.25^\{\\circ\},22\.5∘22\.5^\{\\circ\}, or30∘30^\{\\circ\}\.
Object Detection and Classification \(mAP\)\.For object detection and multi\-label classification, we use mean Average Precision \(mAP\)\[[33](https://arxiv.org/html/2605.19042#bib.bib48),[19](https://arxiv.org/html/2605.19042#bib.bib55)\]\. For each classcc, predictions are ranked by confidence scores, and precision–recall pairs are computed\. The Average Precision \(AP\) for classccis defined as the area under the precision–recall curve,
APc=∫01Pc\(r\)𝑑r,\\mathrm\{AP\}\_\{c\}=\\int\_\{0\}^\{1\}P\_\{c\}\(r\)\\,dr,\(28\)wherePc\(r\)P\_\{c\}\(r\)denotes precision as a function of recallrr\. The mean Average Precision is then
mAP=1C∑c=1CAPc\.\\mathrm\{mAP\}=\\frac\{1\}\{C\}\\sum\_\{c=1\}^\{C\}\\mathrm\{AP\}\_\{c\}\.\(29\)For object detection, AP is computed based on Intersection\-over\-Union \(IoU\) thresholds between predicted and ground\-truth bounding boxes, following standard protocols\.
Privacy Metric: Membership Inference Attack\.To assess residual memorization after unlearning, we evaluate task\-wise Membership Inference Attack \(MIA\)\[[51](https://arxiv.org/html/2605.19042#bib.bib58),[66](https://arxiv.org/html/2605.19042#bib.bib84),[53](https://arxiv.org/html/2605.19042#bib.bib85)\]\. Given a trained model, an adversary aims to determine whether a sample\(𝐱,y\)\(\\mathbf\{x\},y\)was used during training\. In our evaluation, we adopt a loss\-based MIA rather than training an additional attack classifier\. Specifically, for each tasktt, we compute the per\-sample task lossℓ\(t\)\(𝐱,y\(t\)\)\\ell^\{\(t\)\}\(\\mathbf\{x\},y^\{\(t\)\}\)and use its negative value as the membership score:
s\(t\)\(𝐱\)=−ℓ\(t\)\(𝐱,y\(t\)\)\.s^\{\(t\)\}\(\\mathbf\{x\}\)=\-\\ell^\{\(t\)\}\(\\mathbf\{x\},y^\{\(t\)\}\)\.This follows the intuition that member samples typically obtain lower losses and therefore higher membership scores\.
For each task, we evaluate two attacks:retain\-vs\-val, which treats retain samples as members and validation samples as non\-members, andunlearn\-vs\-val, which treats unlearn samples as members and validation samples as non\-members\. We useunlearn\-vs\-valas the main privacy metric for measuring residual membership leakage from the forgotten data, whileretain\-vs\-valis recorded to monitor the membership behavior of retained samples\. We report ROC\-AUC as the MIA score and additionally record average precision \(AP\)\. Lower MIA AUC, or an AUC closer to the retrained reference, indicates that the unlearned model leaks less membership information about the forgotten samples\.
#### C\.1\.5Unlearning Impact Score \(UIS\)
While individual metrics capture task\-specific performance, evaluating multi\-task unlearning holistically is inherently challenging due to competing objectives, including retaining useful knowledge, forgetting target data, preserving generalization, and mitigating memorization\. To provide a unified evaluation criterion, we follow prior work\[[57](https://arxiv.org/html/2605.19042#bib.bib68),[10](https://arxiv.org/html/2605.19042#bib.bib69),[2](https://arxiv.org/html/2605.19042#bib.bib70),[50](https://arxiv.org/html/2605.19042#bib.bib60)\]and assess each method against a desired reference behavior\.
Specifically, forgotten tasks are expected to match the retrained model, while retained tasks should remain consistent with the original pre\-unlearning model\. Letst∗s\_\{t\}^\{\*\}denote the metric value of the evaluated unlearning model for tasktt, and lets¯t∗\\bar\{s\}\_\{t\}^\{\*\}denote the corresponding reference metric\. For forgotten tasks𝒯f\\mathcal\{T\}\_\{f\}, the reference is taken from the retrained model, whereas for retained tasks𝒯r\\mathcal\{T\}\_\{r\}, the reference is taken from the original model\.
To jointly capture performance across different evaluation dimensions, including retention, forgetting, generalization, and privacy, we define the task\-level discrepancy as
st=\|stretain−s¯tretain\|s¯tretain\+\|stunlearn−s¯tunlearn\|s¯tunlearn\+\|stval−s¯tval\|s¯tval\+\|stMIA−s¯tMIA\|s¯tMIA\.s\_\{t\}=\\frac\{\|s^\{\\mathrm\{retain\}\}\_\{t\}\-\\bar\{s\}^\{\\mathrm\{retain\}\}\_\{t\}\|\}\{\\bar\{s\}^\{\\mathrm\{retain\}\}\_\{t\}\}\+\\frac\{\|s^\{\\mathrm\{unlearn\}\}\_\{t\}\-\\bar\{s\}^\{\\mathrm\{unlearn\}\}\_\{t\}\|\}\{\\bar\{s\}^\{\\mathrm\{unlearn\}\}\_\{t\}\}\+\\frac\{\|s^\{\\mathrm\{val\}\}\_\{t\}\-\\bar\{s\}^\{\\mathrm\{val\}\}\_\{t\}\|\}\{\\bar\{s\}^\{\\mathrm\{val\}\}\_\{t\}\}\+\\frac\{\|s^\{\\mathrm\{MIA\}\}\_\{t\}\-\\bar\{s\}^\{\\mathrm\{MIA\}\}\_\{t\}\|\}\{\\bar\{s\}^\{\\mathrm\{MIA\}\}\_\{t\}\}\.\(30\)This formulation normalizes deviations across heterogeneous metrics, ensuring comparability across tasks and evaluation criteria\.
The overall discrepancy score is then defined as the average across all tasks,
S=1\|𝒯\|∑t∈𝒯st\.S=\\frac\{1\}\{\|\\mathcal\{T\}\|\}\\sum\_\{t\\in\\mathcal\{T\}\}s\_\{t\}\.\(31\)A smaller value ofSSindicates that the unlearned model better aligns with the desired behavior, achieving effective forgetting while preserving utility, generalization, and privacy\.
Finally, the reference construction adapts to different unlearning settings\. For full\-task unlearning, where𝒯f=𝒯\\mathcal\{T\}\_\{f\}=\\mathcal\{T\}and𝒯r=∅\\mathcal\{T\}\_\{r\}=\\emptyset, all tasks are compared against the retrained model\. For partial\-task unlearning, lettft\_\{f\}denote the forgotten task, then𝒯f=\{tf\}\\mathcal\{T\}\_\{f\}=\\\{t\_\{f\}\\\}and𝒯r=𝒯∖\{tf\}\\mathcal\{T\}\_\{r\}=\\mathcal\{T\}\\setminus\\\{t\_\{f\}\\\}\. In this case, the forgotten task uses the retrained reference, while each retained task is compared to the original model\.
#### C\.1\.6Implementation Details
We instantiate our framework on two shared vision transformer backbones, ViT\-L\[[15](https://arxiv.org/html/2605.19042#bib.bib10)\]and Swin\-L\[[39](https://arxiv.org/html/2605.19042#bib.bib9)\], both initialized from official HuggingFace checkpoints\[[61](https://arxiv.org/html/2605.19042#bib.bib62)\]pretrained on ImageNet\[[13](https://arxiv.org/html/2605.19042#bib.bib63)\]\. For NYUv2, we augment the backbone with an Atrous Spatial Pyramid Pooling \(ASPP\) module\[[7](https://arxiv.org/html/2605.19042#bib.bib64)\]and attach task\-specific heads for semantic segmentation, depth estimation, and surface\-normal prediction\. For PASCAL VOC, we use a linear head for image classification and a DETR\-style detection head with 100 object queries\[[5](https://arxiv.org/html/2605.19042#bib.bib65)\]\. To reflect a realistic shared\-parameter multi\-task setting, we use a single shared LoRA with rank 16\[[26](https://arxiv.org/html/2605.19042#bib.bib66)\]across all tasks on the same dataset, making unlearning particularly challenging due to the coupling between forgetting updates and retained supervision in the shared adaptation space\.
Following prior unlearning protocols\[[30](https://arxiv.org/html/2605.19042#bib.bib61),[50](https://arxiv.org/html/2605.19042#bib.bib60)\], we designate 10% of the training instances as the unlearn set and use the remaining training data as the retain set\. During unlearning, we further sample 10% of the retain set as an anchor subset to preserve retain\-set performance\. All compared methods, including our proposed approach and all baselines, use exactly the same unlearn/retain split and anchor subset for a fair comparison\. To reduce sampling bias, we repeat the random unlearning split 10 times and report the averaged results across runs, using the same sampled splits for all baselines\. Before unlearning, we train the target model to be unlearned for 15 epochs on NYUv2 and 50 epochs on Pascal for all compared methods\. Unless otherwise specified, unlearning is performed by optimizing the trainable low\-rank update parameters with AdamW\[[41](https://arxiv.org/html/2605.19042#bib.bib67)\]for up to 20 epochs\. We apply early stopping based on the membership inference attack \(MIA\) score and select the checkpoint whose MIA performance is closest to the retrained reference, indicating that the model can no longer reliably distinguish the unlearn set from held\-out validation data\. We use a learning rate of1×10−41\\times 10^\{\-4\}withη1=1\\eta\_\{1\}=1andη2=0\.1\\eta\_\{2\}=0\.1for partial unlearning on segmentation, depth, normal, detection, and full\-unlearning settings\. and a learning rate of3×10−43\\times 10^\{\-4\}withη1=1\\eta\_\{1\}=1andη2=0\.1\\eta\_\{2\}=0\.1for partial unlearning on classification\. We provide the accompanying source code with preprocessing, training, and evaluation to facilitate reproducibility\. All experiments were conducted on a workstation equipped with a single NVIDIA GeForce RTX 4090 GPU with 24GB of memory\.
### C\.2Additional Experiments on Swin\-L
Table 5:Quantitative results of multi\-task unlearning on NYUv2 using Swin\-LTable[5](https://arxiv.org/html/2605.19042#A3.T5)further evaluates whether the proposed interference\-aware framework generalizes beyond ViT\-L by replacing the backbone with Swin\-L\[[39](https://arxiv.org/html/2605.19042#bib.bib9)\]\. Overall, our method consistently achieves the lowest UIS across all full\-task and partial\-task settings, reducing FU UIS from the strongest baseline at31\.2%31\.2\\%to14\.6%14\.6\\%, and reducing the average PU UIS from36\.0%36\.0\\%to11\.2%11\.2\\%, corresponding to a68\.8%68\.8\\%relative reduction\. These results show that the proposed task\-aware projection and instance\-level orthogonalization are not specific to a particular transformer backbone\.
Compared with ViT\-L, Swin\-L exhibits a slightly different unlearning behavior\. Due to its hierarchical window\-based design, Swin\-L tends to produce more localized representations, which can benefit dense prediction tasks but also makes forgetting updates more sensitive to local spatial features\. As a result, several baselines show larger degradation on retained segmentation \(SEG\) and normal \(NORMAL\) performance, especially under PU settings, indicating stronger interference when the forgotten task shares local spatial structures with retained tasks\. In contrast, ViT\-L uses global self\-attention over image patches, leading to more globally mixed representations; this makes cross\-task interference more diffuse but sometimes less catastrophic for individual dense tasks\. Despite these architectural differences, our method remains stable on both backbones, suggesting that explicitly separating task\-specific update directions and orthogonalizing retain\-conflicting forget gradients is effective for both global\-token and hierarchical\-window representations\.
## Appendix DPseudocode
Algorithm 1Multi\-Task Unlearning1:Pretrained weight
𝐖⋆\\mathbf\{W\}^\{\\star\}, forget set
𝒟f\\mathcal\{D\}\_\{f\}, retain subsets
𝒟rclean\\mathcal\{D\}\_\{r\}^\{\\mathrm\{clean\}\},
𝒟rinst\\mathcal\{D\}\_\{r\}^\{\\mathrm\{inst\}\},
𝒟rtask\\mathcal\{D\}\_\{r\}^\{\\mathrm\{task\}\}, task set
𝒯\\mathcal\{T\}, learning rates
η1,η2\\eta\_\{1\},\\eta\_\{2\}, rank
rr, subspace dimension
ss, stability constant
ε\\varepsilon
2:Unlearned weight
𝐖~\\widetilde\{\\mathbf\{W\}\}
3:Initialize low\-rank factors
𝐀∈ℝk×r\\mathbf\{A\}\\in\\mathbb\{R\}^\{k\\times r\}and
𝐁∈ℝd×r\\mathbf\{B\}\\in\\mathbb\{R\}^\{d\\times r\}
4:Initialize task bases
\{𝐔t∈ℝr×s\}t∈𝒯\\\{\\mathbf\{U\}\_\{t\}\\in\\mathbb\{R\}^\{r\\times s\}\\\}\_\{t\\in\\mathcal\{T\}\}and projectors
𝐏t=𝐔t𝐔t⊤\\mathbf\{P\}\_\{t\}=\\mathbf\{U\}\_\{t\}\\mathbf\{U\}\_\{t\}^\{\\top\}
5:Keep
𝐖⋆\\mathbf\{W\}^\{\\star\}frozen
6:foreach unlearning iterationdo
7:Sample mini\-batches from
𝒟f\\mathcal\{D\}\_\{f\},
𝒟rclean\\mathcal\{D\}\_\{r\}^\{\\mathrm\{clean\}\},
𝒟rinst\\mathcal\{D\}\_\{r\}^\{\\mathrm\{inst\}\}, and
𝒟rtask\\mathcal\{D\}\_\{r\}^\{\\mathrm\{task\}\}
8:foreach involved task
ttdo
9:Compute forget gradients
∇𝐀,f\\nabla\_\{\\mathbf\{A\},f\}and
∇𝐁,f\\nabla\_\{\\mathbf\{B\},f\}from
𝒟f\\mathcal\{D\}\_\{f\}
10:Compute retain gradients
∇𝐀,rclean\\nabla\_\{\\mathbf\{A\},r\}^\{\\mathrm\{clean\}\},
∇𝐀,rinst\\nabla\_\{\\mathbf\{A\},r\}^\{\\mathrm\{inst\}\},
∇𝐀,rtask\\nabla\_\{\\mathbf\{A\},r\}^\{\\mathrm\{task\}\}
11:Compute retain gradients
∇𝐁,rclean\\nabla\_\{\\mathbf\{B\},r\}^\{\\mathrm\{clean\}\},
∇𝐁,rinst\\nabla\_\{\\mathbf\{B\},r\}^\{\\mathrm\{inst\}\},
∇𝐁,rtask\\nabla\_\{\\mathbf\{B\},r\}^\{\\mathrm\{task\}\}
12:Project all gradients using
𝐏t\\mathbf\{P\}\_\{t\}as in Eq\. \([7](https://arxiv.org/html/2605.19042#S4.E7)\)
13:Orthogonalize
∇𝐀,f\\nabla\_\{\\mathbf\{A\},f\}sequentially:
∇𝐀,f⟂=Π∇𝐀,rtask⟂\(Π∇𝐀,rinst⟂\(Π∇𝐀,rclean⟂\(∇𝐀,f\)\)\)\\nabla\_\{\\mathbf\{A\},f\}^\{\\perp\}=\\Pi\_\{\\nabla\_\{\\mathbf\{A\},r\}^\{\\mathrm\{task\}\}\}^\{\\perp\}\\Bigl\(\\Pi\_\{\\nabla\_\{\\mathbf\{A\},r\}^\{\\mathrm\{inst\}\}\}^\{\\perp\}\\bigl\(\\Pi\_\{\\nabla\_\{\\mathbf\{A\},r\}^\{\\mathrm\{clean\}\}\}^\{\\perp\}\(\\nabla\_\{\\mathbf\{A\},f\}\)\\bigr\)\\Bigr\)
14:Orthogonalize
∇𝐁,f\\nabla\_\{\\mathbf\{B\},f\}sequentially:
∇𝐁,f⟂=Π∇𝐁,rtask⟂\(Π∇𝐁,rinst⟂\(Π∇𝐁,rclean⟂\(∇𝐁,f\)\)\)\\nabla\_\{\\mathbf\{B\},f\}^\{\\perp\}=\\Pi\_\{\\nabla\_\{\\mathbf\{B\},r\}^\{\\mathrm\{task\}\}\}^\{\\perp\}\\Bigl\(\\Pi\_\{\\nabla\_\{\\mathbf\{B\},r\}^\{\\mathrm\{inst\}\}\}^\{\\perp\}\\bigl\(\\Pi\_\{\\nabla\_\{\\mathbf\{B\},r\}^\{\\mathrm\{clean\}\}\}^\{\\perp\}\(\\nabla\_\{\\mathbf\{B\},f\}\)\\bigr\)\\Bigr\)
15:Aggregate retain gradients:
∇𝐀,r=∇𝐀,rclean\+∇𝐀,rinst\+∇𝐀,rtask,∇𝐁,r=∇𝐁,rclean\+∇𝐁,rinst\+∇𝐁,rtask\\nabla\_\{\\mathbf\{A\},r\}=\\nabla\_\{\\mathbf\{A\},r\}^\{\\mathrm\{clean\}\}\+\\nabla\_\{\\mathbf\{A\},r\}^\{\\mathrm\{inst\}\}\+\\nabla\_\{\\mathbf\{A\},r\}^\{\\mathrm\{task\}\},\\quad\\nabla\_\{\\mathbf\{B\},r\}=\\nabla\_\{\\mathbf\{B\},r\}^\{\\mathrm\{clean\}\}\+\\nabla\_\{\\mathbf\{B\},r\}^\{\\mathrm\{inst\}\}\+\\nabla\_\{\\mathbf\{B\},r\}^\{\\mathrm\{task\}\}
16:Update
𝐀\\mathbf\{A\}and
𝐁\\mathbf\{B\}as in Eq\. \([11](https://arxiv.org/html/2605.19042#S4.E11)\)
17:endfor
18:Regularize task subspaces by minimizing
∑t≠t′‖𝐔t⊤𝐔t′‖F2\\sum\_\{t\\neq t^\{\\prime\}\}\\\|\\mathbf\{U\}\_\{t\}^\{\\top\}\\mathbf\{U\}\_\{t^\{\\prime\}\}\\\|\_\{F\}^\{2\}
19:endfor
20:Merge the low\-rank update:
𝐖~=𝐖⋆\+𝐁𝐀⊤\\widetilde\{\\mathbf\{W\}\}=\\mathbf\{W\}^\{\\star\}\+\\mathbf\{B\}\\mathbf\{A\}^\{\\top\}
21:return
𝐖~\\widetilde\{\\mathbf\{W\}\}Similar Articles
MAAT: Multi-phase Adapter-Aware Targeted Unlearning
The paper identifies a blind spot in machine unlearning benchmarks: underrepresentation of causal (Why-type) knowledge, and proposes 5WBench, a balanced benchmark, and Maat, a three-phase unlearning framework on LoRA adapters that achieves high forgetting and retention on causal facts.
Causal Unlearning in Collaborative Optimization: Exact and Approximate Influence Reversal under Adversarial Contributions
Introduces HF-KCU, a method for efficient machine unlearning in federated learning that uses Krylov subspace approximations to remove a client's contribution, achieving significant speedup over retraining while preserving model accuracy and providing robustness against adversarial perturbations.
Decomposing the Basic Abilities of Large Language Models: Mitigating Cross-Task Interference in Multi-Task Instruct-Tuning
This paper proposes Badit, a method that decomposes large language model parameters into orthogonal high-singular-value LoRA experts to mitigate cross-task interference during multi-task instruction tuning.
Unlearning with Asymmetric Sources: Improved Unlearning-Utility Trade-off with Public Data
This paper introduces Asymmetric Langevin Unlearning (ALU), a framework that leverages public data to improve the privacy-utility trade-off in machine unlearning. It demonstrates that ALU reduces unlearning costs and enables mass unlearning while maintaining high model utility.
ASRU: Activation Steering Meets Reinforcement Unlearning for Multimodal Large Language Models
Proposes ASRU, a controllable multimodal unlearning framework that combines activation steering with a reinforcement learning reward function to improve unlearning effectiveness and generation quality while preserving model utility on Qwen3-VL.