MAAT: Multi-phase Adapter-Aware Targeted Unlearning

arXiv cs.LG 06/01/26, 04:00 AM Papers
machine-unlearning causal-knowledge lora-adapter gradient-projection forgetting-retention benchmark 5wbench
Summary
The paper identifies a blind spot in machine unlearning benchmarks: underrepresentation of causal (Why-type) knowledge, and proposes 5WBench, a balanced benchmark, and Maat, a three-phase unlearning framework on LoRA adapters that achieves high forgetting and retention on causal facts.
arXiv:2605.30514v1 Announce Type: new Abstract: Machine unlearning evaluation is structurally skewed: Why-type questions, which probe causal and relational knowledge, comprise less than 0.06% of CounterFact, 0.6% of ZSRE, and less than 1.3% of TOFU, MUSE, and WMDP-Cyber. This near-zero representation means that methods that fail on causal knowledge can score highly in aggregate, and this failure is undetectable without balanced evaluation. We present 5WBENCH, a balanced 5,000-sample benchmark with 1,000 examples per 5W category (Who, What, When, Where, Why), making causal unlearning failures quantifiable for the first time. Using 5WBENCH, we show that no existing baseline simultaneously achieves high forgetting and high retention on Why-type questions: aggressive forgetting degrades retained knowledge, while conservative methods fail to forget causal facts. Why-type difficulty stems from multi-hop reasoning chains (44% of Why entries vs. less than or equal to 2% for others) and gradient dilution over 40.1-token answer spans. We present MAAT (Multi-phase Adapter-Aware Targeted Unlearning), a three-phase framework operating on LoRA adapter weights, combining gradient-projected ascent, SVD rank-dimension pruning, task vector negation, and hybrid KL-hidden-state retain repair. MAAT is the first method to simultaneously achieve high forgetting and high retention on Why-type causal knowledge, reaching a new operating point on the forget-retain Pareto frontier. We make our code publicly available.
Original Article
View Cached Full Text
Cached at: 06/01/26, 09:25 AM
# Maat: Multi-phase Adapter-Aware Targeted Unlearning
Source: [https://arxiv.org/html/2605.30514](https://arxiv.org/html/2605.30514)
Shubham Gaur2Saksham Thakur3 Vinija Jain4Aman Chadha4Amitava Das5 1Indian Institute of Information Technology, Bhopal, India 2University of California, Santa Cruz, USA 3Independent Researcher 4Stanford University, USA 5BITS Pilani Goa, India

###### Abstract

Machine unlearning evaluation is structurally skewed:Why\-typequestions, which probe causal and relational knowledge, comprise less than0\.06%0\.06\\%of CounterFact,0\.6%0\.6\\%of ZSRE, and less than1\.3%1\.3\\%of TOFU, MUSE, and WMDP\-Cyber\. This near\-zero representation means that methods that fail on causal knowledge can score highly in aggregate, and this failure is undetectable without balanced evaluation\. We present5WBench, a balanced 5,000\-sample benchmark with 1,000 examples per 5W category \(Who, What, When, Where, Why\), making causal unlearning failures quantifiable for the first time\. Using5WBench, we show that no existing baseline simultaneously achieves high forgetting and high retention on Why\-type questions: aggressive forgetting degrades retained knowledge, while conservative methods fail to forget causal facts\. Why\-type difficulty stems from multi\-hop reasoning chains \(44%44\\%of Why entries vs\.≤2%\\leq 2\\%for others\) and gradient dilution over40\.140\.1\-token answer spans\. We presentMaat\(Multi\-phase Adapter\-Aware Targeted Unlearning\), a three\-phase framework operating on LoRA adapter weights, combining gradient\-projected ascent, SVD rank\-dimension pruning, task vector negation, and hybrid KL–hidden\-state retain repair\.Maatis the first method to simultaneously achieve high forgetting and high retention on Why\-type causal knowledge, reaching a new operating point on the forget–retain Pareto frontier\. We make our code[publicly available](https://github.com/SuryanshYagnik/Machine-Unlearning)\.

Maat: Multi\-phase Adapter\-Aware Targeted Unlearning

## 1Introduction

Every major machine unlearning benchmark shares a structural blind spot: causal knowledge\. Why\-type questions—probing the relational and causal chains that underlie factual knowledge—comprise less than0\.06%0\.06\\%of CounterFact,0\.6%0\.6\\%of ZSRE,1\.2%1\.2\\%of TOFU,0\.5%0\.5\\%of MUSE, and1\.2%1\.2\\%of WMDP\-Cyber \(Table[1](https://arxiv.org/html/2605.30514#S3.T1)\)\. This is not an oversight in any single benchmark—it is a systematic property of how these datasets were constructed: all derive from entity\-centric knowledge graphs and relation\-extraction corpora that inherently underrepresent causal and relational knowledge\. The consequence is a critical measurement gap: any unlearning method that fails on causal knowledge can score highly in aggregate, and this failure is statistically undetectable without balanced evaluation\.

#### Why Causal Knowledge Resists Unlearning\.

The gap is not merely quantitative—causal facts are qualitatively harder to unlearn\. Why\-type answers average40\.140\.1tokens versus4\.24\.2–10\.510\.5for other categories, and44%44\\%involve multi\-hop reasoning chains compared to≤2%\\leq 2\\%for other categories \(Table[7](https://arxiv.org/html/2605.30514#A3.T7)\)\. These properties cause severe gradient dilution: the ascent signal is spread across long token spans with no dominant direction to target\. Crucially, our encoding analysis \(Appendix[F](https://arxiv.org/html/2605.30514#A6)\) shows this is not because Why\-type facts are encoded differently—all 5W categories share uniform distributed encoding across layers\. The difficulty is relational complexity and gradient dilution, not a unique weight\-space footprint\.

#### TheMaatFramework\.

We introduceMaat, a three\-phase unlearning framework that operates directly on LoRA adapter weights without merging them into the base model\. Rather than applying uniform gradient pressure,Maatperforms structured adapter surgery: \(1\) gradient projection orthogonalising forget updates against the retain gradient only when the two conflict; \(2a\) SVD\-based pruning of Multi\-Layer Perceptron\(MLP\) adapter dimensions to concentrate the forgetting signal on rank components specifically activated by forget\-set inputs; \(2b\) task vector negation on the top\-kFk\_\{F\}forget\-scored rank dimensions; and \(3\) hybrid KL–hidden\-state retain repair with an entropy term preventing the repair phase from re\-learning forgotten content\. Evaluated under a Qwen 2\.5\-7B\(Yanget al\.,[2024](https://arxiv.org/html/2605.30514#bib.bib18)\)using LLM\-as\-a\-Judge,Maatis the first method to simultaneously achieve high forgetting and high retention on Why\-type causal knowledge—a new operating point on the forget–retain Pareto frontier that no baseline reaches\.

#### Contributions\.

1. 1\.5WBench: a balanced 5,000\-sample benchmark providing 1,000 examples per 5W question category \(Who, What, When, Where, Why\), exposing the causal knowledge gap in existing unlearning evaluation through structured taxonomic coverage\.
2. 2\.Maat: a three\-phase structured LoRA adapter unlearning framework that achieves a new forget–retain operating point on Why\-type causal knowledge, outperforming all baselines on the aggregate forget–retain tradeoff across both Llama 3\.2\-3B and Gemma 3\-4B\.

## 2Related Work

#### Gradient\-Based and Preference\-Based Unlearning\.

The dominant paradigm for LLM unlearning applies gradient ascent \(GA\) directly on the forget set to maximize loss on target facts\. KL\-regularised GA\(Yaoet al\.,[2023](https://arxiv.org/html/2605.30514#bib.bib2)\)augments this with a divergence penalty against the original model’s outputs on retain samples, while Gradient Difference\(Liuet al\.,[2022](https://arxiv.org/html/2605.30514#bib.bib3)\)combines forget\-loss maximisation with retain\-loss minimisation\. A common failure mode across all gradient\-based methods is that aggressive forgetting degrades model utility while conservative steps result in under\-forgetting—particularly on long causal spans where the gradient signal is diffuse\. More recently, unlearning has been framed as a preference alignment problem: Negative Preference Optimization \(NPO\)\(Zhanget al\.,[2024](https://arxiv.org/html/2605.30514#bib.bib22)\)treats forget data as a rejected distribution, applying DPO\-style objectives without positive samples; SimNPO\(Fanet al\.,[2024](https://arxiv.org/html/2605.30514#bib.bib35)\)removes reference\-model bias entirely, improving robustness to relearning attacks\(Huet al\.,[2025](https://arxiv.org/html/2605.30514#bib.bib36)\)\. Despite these advances, none of these methods account for the internal structure of adapter weight spaces or the geometry of the retain manifold\.

#### Localization, Weight Saliency, and Structured Editing\.

A parallel line of work identifies and targets specific weight subspaces associated with the forget target\. Rank\-One Model Editing \(ROME\)\(Menget al\.,[2022a](https://arxiv.org/html/2605.30514#bib.bib5)\)and Mass\-Editing Memory in a Transformer \(MEMIT\)\(Menget al\.,[2022b](https://arxiv.org/html/2605.30514#bib.bib6)\)localise factual associations in mid\-layer MLP weights via causal tracing and apply rank\-one or distributed updates\. AlphaEdit\(Fanget al\.,[2024](https://arxiv.org/html/2605.30514#bib.bib7)\)improves specificity by projecting updates into the null space of the retained knowledge covariance—a conceptual predecessor toMaat’s gradient projection phase\. SalUn\(Fanet al\.,[2023](https://arxiv.org/html/2605.30514#bib.bib23)\)restricts updates to weights with the highest gradient\-magnitude saliency on the forget set, providing the first principled weight\-selection mechanism for unlearning\. Mechanistic Unlearning\(Guoet al\.,[2024](https://arxiv.org/html/2605.30514#bib.bib24)\)uses circuit\-level localization via path patching to identify and fine\-tune only fact\-lookup components, producing edits robust to adversarial probes\. All localization\-based methods assume target knowledge is stored in identifiable, localised positions—an assumption our encoding analysis \(Appendix[F](https://arxiv.org/html/2605.30514#A6)\) shows does not differentiate Why\-type from other categories; the challenge is relational complexity and gradient dilution, not a unique encoding footprint\.

#### Representation\-Based and Adapter\-Aware Methods\.

Representation Misdirection for Unlearning \(RMU\)\(Liet al\.,[2024](https://arxiv.org/html/2605.30514#bib.bib31)\)steers intermediate activations toward a random direction in forget inputs while preserving retain representations, achieving state\-of\-the\-art on WMDP\. Circuit Breakers\(Zouet al\.,[2024](https://arxiv.org/html/2605.30514#bib.bib25)\)reroute harmful representations to be orthogonal to the original hidden states via a LoRA\(Huet al\.,[2022](https://arxiv.org/html/2605.30514#bib.bib1)\)adapter, providing strong robustness under adversarial attacks\. On the parameter\-efficient side, LUNE\(Chaet al\.,[2025](https://arxiv.org/html/2605.30514#bib.bib26)\)fine\-tunes LoRA adapters on negative examples to overwrite targeted knowledge, and KGA\(Wanget al\.,[2023a](https://arxiv.org/html/2605.30514#bib.bib21)\)aligns the knowledge gap between two reference models\. LoKU and FILA\(Chaet al\.,[2025](https://arxiv.org/html/2605.30514#bib.bib26)\)apply Fisher Information to isolate forget\-relevant parameters into LoRA adapters, providing the closest existing treatment of Fisher\-guided adapter unlearning toMaat\. Task vector negation\(Ilharcoet al\.,[2022](https://arxiv.org/html/2605.30514#bib.bib4)\)computesτ=θft−θbase\\tau=\\theta\_\{\\text\{ft\}\}\-\\theta\_\{\\text\{base\}\}and subtracts it to approximate forgetting—the conceptual foundation ofMaat’s Phase 2b, which refines this to target only the top\-kFk\_\{F\}forget\-scored rank dimensions rather than the full adapter delta\.Maatdiverges from all of the above by treating LoRA matrices as structured spaces where rank dimensions can be explicitly scored via SVD, selectively pruned, and negated—without requiring negative example construction or full adapter replacement\.

#### Second\-Order and Geometry\-Aware Methods\.

Natural Gradient Descent\(Amari,[1998](https://arxiv.org/html/2605.30514#bib.bib8)\)preconditions updates with the inverse Fisher Information Matrix, yielding parameter\-space geometry\-aware updates\. Selective Synaptic Dampening \(SSD\)\(Fosteret al\.,[2023](https://arxiv.org/html/2605.30514#bib.bib27)\)uses Fisher Information Matrix ratios between training and forget distributions to dampen forget\-specific parameters without retraining\. SOUL\(Jiaet al\.,[2024](https://arxiv.org/html/2605.30514#bib.bib28)\)establishes a connection between second\-order optimization and influence\-function unlearning, applying Sophia\-based Hessian updates as a drop\-in optimizer replacement for existing unlearning objectives, consistently outperforming first\-order methods on TOFU\(Mainiet al\.,[2024](https://arxiv.org/html/2605.30514#bib.bib10)\)\.Maatdiffers from SOUL\(Jiaet al\.,[2024](https://arxiv.org/html/2605.30514#bib.bib28)\)in that it uses first\-order gradient projection combined with SVD rank\(Zhang,[2015](https://arxiv.org/html/2605.30514#bib.bib37)\)scoring to achieve structural suppression— circumventing full Hessian approximations while targeting the adapter’s forget subspace directly\.

#### Benchmarks and Evaluation\.

ZSRE\(Levyet al\.,[2017](https://arxiv.org/html/2605.30514#bib.bib29)\)and CounterFact\(Huaet al\.,[2024](https://arxiv.org/html/2605.30514#bib.bib30)\)dominate model editing evaluation but derive from entity\-centric Wikidata triples and carry near\-zero Why\-type coverage \(<1%<1\\%\)\. TOFU\(Mainiet al\.,[2024](https://arxiv.org/html/2605.30514#bib.bib10)\)provides clean fictitious\-fact unlearning splits but no structured question taxonomy, and is dominated by What\-type biographical attributes \(84\.7%84\.7\\%\)\. WMDP\(Liet al\.,[2024](https://arxiv.org/html/2605.30514#bib.bib31)\)targets hazardous capability suppression; MUSE\(Shiet al\.,[2024](https://arxiv.org/html/2605.30514#bib.bib13)\)evaluates six unlearning desiderata including privacy leakage and sustainability; RWKU\(Jinet al\.,[2024](https://arxiv.org/html/2605.30514#bib.bib34)\)provides zero\-shot real\-world entity unlearning with adversarial probes\. None provide balanced causal Why\-type coverage\.

Recent work has further questioned benchmark reliability\(Thakeret al\.,[2024](https://arxiv.org/html/2605.30514#bib.bib33); Huet al\.,[2025](https://arxiv.org/html/2605.30514#bib.bib36); Dornaet al\.,[2025](https://arxiv.org/html/2605.30514#bib.bib32)\): benchmark modifications expose residual accessible information, and fine\-tuning on small auxiliary datasets reverses supposedly\-unlearned knowledge\.

5WBenchdirectly addresses the causal coverage gap by providing 1,000 balanced Why\-type samples—making systematic failures on causal knowledge quantifiable for the first time\.

## 3The5WBenchBenchmark

Table 1:Label distribution \(%\) across model editing and unlearning benchmarks\. Why\-type coverage \(red\) is near\-zero in all existing datasets;5WBench\(green\) provides balanced 20% splits across all categories\.Table 2:Representative sample from5WBench\(what\-type, forget split\)\. Thepred\_answerserves as the target answer for model editing\.### 3\.1Dataset Construction

5WBenchis derived from the Factify\-5WQA corpus\(Raniet al\.,[2023](https://arxiv.org/html/2605.30514#bib.bib14)\), a multi\-document fact\-verification dataset with structured 5W question\-answer annotations\. Construction proceeds in four steps\.\(1\) Subject extraction\.Stanford CoreNLP dependency parsing\(Manninget al\.,[2014](https://arxiv.org/html/2605.30514#bib.bib41)\)extracts the primary subject entity, which becomes the edit target\.\(2\) Stratified sampling\.We sample exactly 1,000 examples per 5W label, drawing uniformly within each category\. Factify\-5WQA\(Raniet al\.,[2023](https://arxiv.org/html/2605.30514#bib.bib14)\)has sufficient Why\-type entries—a property not shared by ZSRE or CounterFact\.\(3\) Forget/Retain split\.Each label’s 1,000 samples are split equally: 500 forget, 500 retain\. Evaluation uses 100 samples per label per set \(500 forget\+\+500 retain total\), stratified to ensure equal 5W representation\.\(4\) Format standardisation\.Each sample is formatted as a\(question, answer, label, rephrases\)tuple compatible with EasyEdit\(Wanget al\.,[2023b](https://arxiv.org/html/2605.30514#bib.bib15)\)\. Thelabelreflects thesemantic typeof the relation queried, not the surface question word\.

#### Sample Format\.

Each5WBenchrecord is a JSON object containing the question, ground\-truth answer, the 5W label, and up to three rephrased question variants generated to test robustness of editing methods to surface\-form variation\. Table[2](https://arxiv.org/html/2605.30514#S3.T2)shows a representativewhat\-type instance from the forget split, drawn from the ZSRE source\.

### 3\.2Why\-Type Facts Are Structurally Different

Why\-type facts encode causal and relational chains \(e\.g\.,“Smoking causes lung cancer because it introduces carcinogens into lung tissue”\)\. Answer spans average 40\.1 tokens versus 4\.2–10\.5 for other types, and a staggering 44% involve multi\-hop reasoning chains \(Table[7](https://arxiv.org/html/2605.30514#A3.T7), Appendix[C](https://arxiv.org/html/2605.30514#A3)\)\. This complexity means gradient ascent on a long token span cannot produce a coherent unlearning signal\.5WBenchprovides sufficient Why\-type samples to study and quantify this failure mode for the first time\.

## 4TheMaatFramework

![Refer to caption](https://arxiv.org/html/2605.30514v1/x1.png)Figure 1:Overview ofMaat\( multi\-phase adapter\-aware targeted unlearning \) architecture\.Maataddresses the unlearning challenge by operating on the structure of the LoRA adapter’s parameter space \(Figure[1](https://arxiv.org/html/2605.30514#S4.F1)\)\. All three phases act exclusively on adapter matrices\{𝐀l,𝐁l\}\\\{\\mathbf\{A\}\_\{l\},\\mathbf\{B\}\_\{l\}\\\}; base model weights remain frozen throughout\.

### 4\.1Phase 1: Gradient\-Projected Unlearning

Standard gradient ascent applies a forget update𝐠f\\mathbf\{g\}\_\{f\}uniformly\. If𝐠f\\mathbf\{g\}\_\{f\}has components aligned with the retain gradient𝐠r\\mathbf\{g\}\_\{r\}, those components erode retained knowledge\.Maatremoves this component via*conditional*orthogonal projection—applied only when the forget and retain gradients actively conflict \(𝐠f⋅𝐠r\>0\\mathbf\{g\}\_\{f\}\\cdot\\mathbf\{g\}\_\{r\}\>0\):

𝐠f⟂=\{𝐠f−𝐠f⋅𝐠r‖𝐠r‖2\+ϵ𝐠rif𝐠f⋅𝐠r\>0𝐠fotherwise\\mathbf\{g\}\_\{f\}^\{\\perp\}=\\begin\{cases\}\\mathbf\{g\}\_\{f\}\-\\dfrac\{\\mathbf\{g\}\_\{f\}\\cdot\\mathbf\{g\}\_\{r\}\}\{\\\|\\mathbf\{g\}\_\{r\}\\\|^\{2\}\+\\epsilon\}\\,\\mathbf\{g\}\_\{r\}&\\text\{if \}\\mathbf\{g\}\_\{f\}\\cdot\\mathbf\{g\}\_\{r\}\>0\\\\\[6\.0pt\] \\mathbf\{g\}\_\{f\}&\\text\{otherwise\}\\end\{cases\}\(1\)whereϵ\>0\\epsilon\>0is a small numerical stabiliser\. When the gradients do not conflict, the full forget\-ascent direction is preserved without attenuation, maintaining unlearning signal strength\. The KL reference distribution is fixed to the pre\-unlearning adapter state𝐖ref\\mathbf\{W\}\_\{\\text\{ref\}\}\. The unlearning objective to be minimised is:

ℒunlearn=−ℒforget\(q,a\)\+λKL\(p𝐖∥p𝐖ref\)𝒟R\\mathcal\{L\}\_\{\\text\{unlearn\}\}=\-\\mathcal\{L\}\_\{\\text\{forget\}\}\(q,a\)\\;\+\\;\\lambda\\,\\mathrm\{KL\}\\\!\\left\(p\_\{\\mathbf\{W\}\}\\;\\\|\\;p\_\{\\mathbf\{W\}\_\{\\text\{ref\}\}\}\\right\)\_\{\\mathcal\{D\}\_\{R\}\}\(2\)whereλ\>0\\lambda\>0controls the retain\-anchoring strength\.

### 4\.2Phase 2a: SVD Rank\-Dimension Pruning \(MLP\-only\)

Residual forget\-fact signal may persist in low\-magnitude adapter directions not captured by the projected gradient\. For each MLP LoRA pair\(𝐀l,𝐁l\)\(\\mathbf\{A\}\_\{l\},\\mathbf\{B\}\_\{l\}\), rank dimensions are scored by their gradient\-column norm over a sample of forget\-set inputs:

sk=∑x∈𝒟Fscore‖∇𝐁lℒ\(x\)‖col\-ks\_\{k\}=\\sum\_\{x\\in\\mathcal\{D\}\_\{F\}^\{\\text\{score\}\}\}\\left\\\|\\nabla\_\{\\mathbf\{B\}\_\{l\}\}\\mathcal\{L\}\(x\)\\right\\\|\_\{\\text\{col\-\}k\}\(3\)
The top\-ρ\\rhofraction of rank dimensions by score are zeroed in both A and B matrices on MLP modules \(down\_proj, up\_proj, gate\_proj\) only\.Attention modules are excluded: pruning attention rank dimensions at non\-trivial ratios destroys the instruction\-following pathway\.

### 4\.3Phase 2b: Task Vector Negation

Phase 2b introduces weight\-space negation targeting the forget subspace\. The top\-kFk\_\{F\}fraction of forget\-scored rank dimensions of each LoRA B matrix are masked to form a forget task vector𝝉lF\\bm\{\\tau\}\_\{l\}^\{F\}, then subtracted:

𝐁l←𝐁l−α⋅𝝉lF,α\>0\\mathbf\{B\}\_\{l\}\\leftarrow\\mathbf\{B\}\_\{l\}\-\\alpha\\cdot\\bm\{\\tau\}\_\{l\}^\{F\},\\quad\\alpha\>0\(4\)
By confining negation to the highest forget\-scoring rank dimensions \(herekF=50%k\_\{F\}=50\\%\), Phase 2b achieves targeted suppression without inadvertently erasing retain\-associated directions\.

### 4\.4Phase 3: Retain Repair with Hybrid Objective

SVD pruning and gradient ascent may partially degrade retain performance\. Phase 3 recovers it through a hybrid repair objective combining four terms:

ℒrepair=wKLKL\(p𝐖∥pref\)𝒟R\+wHSdrep\(h𝐖,href\)𝒟R−wentℋF\(p𝐖\)\+wTV∑lcos\(𝐁l,𝝉lF\)\+\\mathcal\{L\}\_\{\\text\{repair\}\}=w\_\{\\text\{KL\}\}\\,\\mathrm\{KL\}\(p\_\{\\mathbf\{W\}\}\\\|p\_\{\\text\{ref\}\}\)\_\{\\mathcal\{D\}\_\{R\}\}\\\\ \+w\_\{\\text\{HS\}\}\\,d\_\{\\text\{rep\}\}\(h\_\{\\mathbf\{W\}\},h\_\{\\text\{ref\}\}\)\_\{\\mathcal\{D\}\_\{R\}\}\-w\_\{\\text\{ent\}\}\\,\\mathcal\{H\}\_\{F\}\(p\_\{\\mathbf\{W\}\}\)\\\\ \+w\_\{\\text\{TV\}\}\\\!\\sum\_\{l\}\\cos\(\\mathbf\{B\}\_\{l\},\\bm\{\\tau\}\_\{l\}^\{F\}\)^\{\+\}\(5\)
wheredrep\(h,h′\)=1−cos⁡\(h,h′\)d\_\{\\text\{rep\}\}\(h,h^\{\\prime\}\)=1\-\\cos\(h,h^\{\\prime\}\)is the hidden\-state representation distance,ℋF\(p𝐖\)\\mathcal\{H\}\_\{F\}\(p\_\{\\mathbf\{W\}\}\)is the output entropy of the current model evaluated on forget\-set answer tokens, and the final term penalises cosine similarity between current LoRA𝐁l\\mathbf\{B\}\_\{l\}weights and the forget task vector\. Crucially, the entropy term enters with a*negative*sign: minimising−wentℋF\-w\_\{\\text\{ent\}\}\\mathcal\{H\}\_\{F\}is equivalent to maximising entropy on forget\-set predictions, discouraging the model from recovering forgotten content during the repair phase\.

### 4\.5Knowledge Implantation Protocol

To benchmark unlearning in a controlled setting, we first implant target facts via LoRA fine\-tuning \(rankrr, scaling factorαLoRA\\alpha\_\{\\text\{LoRA\}\}, targetingq, k, v, o, gate, up, downprojections on layers\[ls,le\]\[l\_\{s\},l\_\{e\}\]\) with 4\-bit NF4 quantisation \(learning rateη0\>0\\eta\_\{0\}\>0forEEepochs\)\. Crucially,the LoRA adapter is not merged into the base model: allMaatphases operate directly on adapter weights\.

### 4\.6MaatAlgorithm

We summarize the fullMaattraining procedure in Algorithm[1](https://arxiv.org/html/2605.30514#alg1)\.

Algorithm 1Maat: Three\-Phase LoRA Adapter Unlearning1:Forget set

𝒟F\\mathcal\{D\}\_\{F\}; retain set

𝒟R\\mathcal\{D\}\_\{R\}; fine\-tuned adapter

\{𝐀l,𝐁l\}\\\{\\mathbf\{A\}\_\{l\},\\mathbf\{B\}\_\{l\}\\\}; frozen base

𝐖base\\mathbf\{W\}\_\{\\text\{base\}\}; hyperparameters

TT,

SS,

ρ\\rho,

α\\alpha,

η1\\eta\_\{1\},

η3\\eta\_\{3\},

ηf\\eta\_\{f\}
2:Unlearned adapter

\{𝐀l′,𝐁l′\}\\\{\\mathbf\{A\}\_\{l\}^\{\\prime\},\\mathbf\{B\}\_\{l\}^\{\\prime\}\\\}
3:Save reference:

𝐖ref←\{𝐀l,𝐁l\}\\mathbf\{W\}\_\{\\text\{ref\}\}\\leftarrow\\\{\\mathbf\{A\}\_\{l\},\\mathbf\{B\}\_\{l\}\\\}
4:– Phase 1: Gradient\-Projected Ascent –

5:for

\(q,a\)∈𝒟F\(q,a\)\\in\\mathcal\{D\}\_\{F\}do

6:for

t=1t=1to

TTdo

7:

𝐠f←\+∇ℒ\(q,a\)\\mathbf\{g\}\_\{f\}\\leftarrow\+\\nabla\\mathcal\{L\}\(q,a\);

𝐠r←∇KL\(p𝐖∥p𝐖ref\)\\mathbf\{g\}\_\{r\}\\leftarrow\\nabla\\mathrm\{KL\}\(p\_\{\\mathbf\{W\}\}\\\|p\_\{\\mathbf\{W\}\_\{\\text\{ref\}\}\}\)
8:Conditionally project

𝐠f⟂\\mathbf\{g\}\_\{f\}^\{\\perp\}via Eq\.[1](https://arxiv.org/html/2605.30514#S4.E1); apply ascent step with

η1\\eta\_\{1\}
9:endfor

10:endfor

11:– Phase 2a: SVD Pruning \(MLP LoRA only\) –

12:Score rank dims via Eq\.[3](https://arxiv.org/html/2605.30514#S4.E3); zero top\-

ρ\\rhoindown/up/gate\_projon layers

\[ls,le\]\[l\_\{s\},l\_\{e\}\]
13:– Phase 2b: Task Vector Negation –

14:Score top\-

kFk\_\{F\}forget dims; compute

𝝉lF\\bm\{\\tau\}\_\{l\}^\{F\}; subtract

α⋅𝝉lF\\alpha\\cdot\\bm\{\\tau\}\_\{l\}^\{F\}from

𝐁l\\mathbf\{B\}\_\{l\}
15:– Phase 3: Hybrid Repair –

16:for

s=1s=1to

SSdo

17:Minimise

ℒrepair\\mathcal\{L\}\_\{\\text\{repair\}\}\(Eq\.[5](https://arxiv.org/html/2605.30514#S4.E5)\) across all 7 module types; cosine\-decay

η3→ηf\\eta\_\{3\}\\to\\eta\_\{f\}
18:endfor

19:return

\{𝐀l′,𝐁l′\}\\\{\\mathbf\{A\}\_\{l\}^\{\\prime\},\\mathbf\{B\}\_\{l\}^\{\\prime\}\\\}

## 5Experimental Setup

#### Models\.

We evaluate on two instruction\-tuned models from distinct architecture families:LLaMA 3\.2\-3B\-Instruct\(Team,[2024b](https://arxiv.org/html/2605.30514#bib.bib16)\)andGemma 3\-4B\-Instruct\(Team,[2024a](https://arxiv.org/html/2605.30514#bib.bib17)\)\. Both fit on a single consumer GPU \(≤\\leq24 GB VRAM\)\. Inference: greedy decoding, temperature=0=0,max\_new\_tokens=100=100\. For hyperparameters on5WBenchusingMaatrefer to \(Appendix[E](https://arxiv.org/html/2605.30514#A5), Table[8](https://arxiv.org/html/2605.30514#A5.T8)\)\.

#### Datasets\.

5WBench\(see §[3](https://arxiv.org/html/2605.30514#S3)for details\): 1,000 per label \(5,000 total\); experiments use 100 per label per set \(forget500\+\+retain500\), stratified to ensure equal 5W representation\. TOFU\(Mainiet al\.,[2024](https://arxiv.org/html/2605.30514#bib.bib10)\): We use theforget05/retain95split\. Although TOFU is not statistically useful for label\-wise semantic computation due to severe category imbalance \(see Appendix[G](https://arxiv.org/html/2605.30514#A7)for distributions\), it is included to ensure comparability and completeness with existing literature\. TOFU contains no meaningful Why\-type questions\.

#### Baselines\.

All baselines operate on the fine\-tuned LoRA adapter\.

- •Gradient Ascent \(GA\): Direct loss maximisation on the forget set\.
- •Gradient Ascent with KL Divergence \(GA\+KL\): GA regularised by KL divergence\.
- •Adapter Negation \(AN\): Full task vector negation,α=1\.0\\alpha=1\.0\.
- •Retain\-Only Fine\-Tuning \(RO\-FT\): Fine\-tuning exclusively on the retain set\.

#### Evaluation Metrics\.

LLM\-as\-Judge \(primary\)\.\(Zhenget al\.,[2023](https://arxiv.org/html/2605.30514#bib.bib40)\)We evaluate Forget Success Rate \(FSR\) and Retain Success Rate \(RSR\) using a Qwen 2\.5\-7B judge that determines whether the model’s output semantically contains the ground\-truth answer\.FSR=1\\text\{\\text\{FSR\}\{\}\}=1when the ground truth is*not*present \(successful forgetting\);RSR=1\\text\{\\text\{RSR\}\{\}\}=1when it is present \(successful retention\)\. Full prompt template and evaluation rules are provided in Appendix[H](https://arxiv.org/html/2605.30514#A8)\.

ROUGE: We report ROUGE\-1, ROUGE\-2, and ROUGE\-L\(Lin,[2004](https://arxiv.org/html/2605.30514#bib.bib39)\)on the forget and retain splits before and after unlearning in the Appendix[A](https://arxiv.org/html/2605.30514#A1)\. Lower forget ROUGE indicates more successful token\-level erasure; higher retain ROUGE indicates better preservation of non\-targeted knowledge\.

## 6Results

Tables[3](https://arxiv.org/html/2605.30514#S6.T3)and[4](https://arxiv.org/html/2605.30514#S6.T4)present results on5WBenchand TOFU respectively\. Appendix[I](https://arxiv.org/html/2605.30514#A9)provides qualitative examples illustrating how each method handles Why\-type questions on both forget and retain splits\.5WBench’s balanced 5W coverage exposes performance differences between methods that TOFU’s category\-skewed evaluation cannot resolve\.

Table 3:Unlearning results on5WBench\(Factify; 100 forget\+\+100 retain per 5W label; 500 total each split\)\. Judge: Qwen 2\.5\-7B\.FSR↑\\uparrow= Forget Success Rate;RSR↑\\uparrow= Retain Success Rate\.Amber: Why\-type category row\.Green: BestFSR\-RSRbalance per model\. Bold:Maat\(proposed method\)\.Table 4:Unlearning results onTOFU\(forget05/retain95: 200 forget, 3,800 retain\)\. Judge: Qwen 2\.5\-7B\. TOFU contains no Why\-type questions \(N/A\)\.Green:Maatresults\.### 6\.15WBenchResults \(forget500/retain500\)

#### MAAT dominates the forget–retain Pareto frontier\.

Maatachieves the best aggregateFSR–RSRbalance on both models:77\.4%77\.4\\%/71\.6%71\.6\\%on Llama 3\.2\-3B and64\.0%64\.0\\%/61\.8%61\.8\\%on Gemma 3\-4B\. The cleanest single comparison is against RO\-FT: both methods reach identical77\.4%77\.4\\%averageFSRon Llama 3\.2\-3B, yetMaatdelivers this with71\.6%71\.6\\%RSRversus RO\-FT’s35\.2%35\.2\\%—a\+36\.4\+36\.4\-point retention gain at zero forget cost\. More strikingly,Maatis theonlymethod to simultaneously exceed60%60\\%FSRand60%60\\%RSRon all five 5W categories on Llama 3\.2\-3B \(Who:83/7283/72, When:79/7779/77, What:82/7182/71, Where:80/7380/73, Why:63/6563/65\); no baseline achieves this threshold on any single category\.

#### Baseline tradeoff failure modes\.

Each baseline occupies a distinct failure region of the Pareto frontier\. GA achieves moderate averageFSR\(57\.4%57\.4\\%Llama 3\.2\-3B,61\.2%61\.2\\%Gemma 3\-4B\) at the cost ofRSRlaggingMaatby1818and1111points respectively\. GA\+KL \(33\.8%33\.8\\%FSRon Llama 3\.2\-3B\) sacrifices forgetting for retention—the lowest averageFSRof any real method\. Adapter Negation \(AN\) maximisesFSR\(99\.8%99\.8\\%Llama 3\.2\-3B,82\.0%82\.0\\%Gemma 3\-4B\) but triggers catastrophic forgetting uniformly acrossall5W categories:RSRcollapses to0\.4%0\.4\\%on Llama 3\.2\-3B and25\.0%25\.0\\%on Gemma 3\-4B\. This confirms that full task vector negation removes both forget and retain knowledge indiscriminately—the entire adapter delta encodes both simultaneously\. The partial resistance on Gemma 3\-4B \(25\.0%RSR\) compared to Llama 3\.2\-3B \(0\.4%RSR\) suggests architecture\-dependent sensitivity to full adapter negation\.

#### Why\-type results and the under\-forgetting artifact\.

On the Why category,Maatachieves63%63\\%FSR/65%65\\%RSRon Llama 3\.2\-3B and55%55\\%FSR/56%56\\%RSRon Gemma 3\-4B\. GA’s higher Why\-RSR\(78%78\\%on Llama 3\.2\-3B\) is not evidence of better retention—it is a mechanical artifact of under\-forgetting\. At44%44\\%Why\-FSR, more than half the target causal facts remain in the model, so retention is preserved by default rather than by design\.Maatforgets1919percentage points more Why\-type knowledge \(63%63\\%vs44%44\\%\) while losing only1313RSRpoints \(65%65\\%vs78%78\\%\)— the only method that meaningfully removes causal knowledge without indiscriminate retain damage\. GA\+KL \(22%22\\%Why\-FSR/93%93\\%Why\-RSRon Llama 3\.2\-3B\) represents the extreme of this artifact: near\-perfect retention because almost nothing has been forgotten\.

#### Per\-category localized failures\.

Beyond the Why category,5WBench’s balanced coverage reveals localized failures in other categories\. GA’s What\-typeFSRon Llama 3\.2\-3B \(48%48\\%\) approaches its Why\-typeFSR\(44%44\\%\), andMaat’sFSRgap over GA is actually larger on What \(\+31\+31points\) than on Why \(\+19\+19points\)—showing that GA’s weakness is not unique to causal knowledge\. GA\+KL’s When\-RSRon Gemma 3\-4B collapses to39%39\\%while all other categories remain6262–71%71\\%, indicating a localized temporal knowledge failure under KL regularization on this architecture\. RO\-FT’s Where\-RSRon Llama 3\.2\-3B \(23%23\\%\) is dramatically lower than its other categories \(3535–42%42\\%\)\.5WBench’s balanced500500\-sample splits per category provide the statistical power to detect these localized failures—a granularity unavailable in existing benchmarks\.

#### Consistency and architecture sensitivity\.

Maat’sFSRranges6363–83%83\\%across categories on Llama 3\.2\-3B, compared to GA’s4444–71%71\\%, indicating more consistent unlearning behavior regardless of question type—an important practical property since real\-world forget requests may contain any mix of knowledge types\. All methods degrade moving from Llama 3\.2\-3B to Gemma 3\-4B, but differently: RO\-FT swings dramatically \(FSR−26\.0\-26\.0,RSR\+35\.2\+35\.2\), indicating high architecture\-dependence, whileMaatdegrades more uniformly \(FSR−13\.4\-13\.4,RSR−9\.8\-9\.8\), maintaining a balanced operating point across both architectures\.

### 6\.2TOFU Results \(forget05/retain95\)

On TOFU,Maatachieves the highestRSRamong methods with\>60%\>60\\%FSRon Llama 3\.2\-3B \(67%67\\%FSR/46\.6%46\.6\\%RSR\)\. AN again triggers catastrophic forgetting \(100%100\\%FSR/0\.2%0\.2\\%RSR\), replicating the indiscriminate\-negation pattern seen on5WBench\. On Gemma 3\-4B, methods cluster tightly:Maat\(61\.5%61\.5\\%/48\.7%48\.7\\%\), RO\-FT \(63\.0%63\.0\\%/46\.2%46\.2\\%\), and GA\+KL \(54\.0%54\.0\\%/47\.4%47\.4\\%\) differ by11–33points—within the noise range of LLM\-judge evaluation\. TOFU’s category imbalance \(84\.7%84\.7\\%What\-type,≤5%\\leq 5\\%each for When/Who/Where,1\.2%1\.2\\%Why\) means aggregateFSR–RSRis dominated by a single question type\.5WBench’s balanced500500\-sample splits per category provide the statistical power to surface per\-category differences that TOFU’s distribution cannot resolve\. TOFU results are included for comparability with existing literature\.

## 7Conclusion

We introduced5WBench, a balanced 5,000\-sample benchmark that makes causal unlearning failures quantifiable for the first time, andMaat, a three\-phase structured LoRA adapter unlearning framework that concentrates the forgetting signal on rank dimensions specifically activated by forget\-set inputs\.

Using5WBench, we established that no existing baseline simultaneously achieves high forgetting and high retention on Why\-type causal knowledge—and thatMaatis the first method to do so, exceeding60%60\\%FSRand60%60\\%RSRon all five 5W categories on Llama 3\.2\-3B, including Why\-type causal questions where all baselines face a fundamental forget–retain tradeoff\.

The performance gap between Llama 3\.2\-3B \(77\.4/71\.677\.4/71\.6\) and Gemma 3\-4B \(64\.0/61\.864\.0/61\.8\) reveals that adapter\-based unlearning inherits the base model’s knowledge encoding structure: forget and retain knowledge are more separable in Llama 3\.2\-3B’s adapter rank dimensions, while Gemma 3\-4B’s interleaved local/global attention pattern distributes knowledge more diffusely, reducing separability\. Architecture\-aware adapter rank selection is an important open direction for future work\.

Beyond unlearning,5WBench’s balanced 5W taxonomy and EasyEdit\-compatible format make it directly applicable to model editing evaluation—including insertion and modification on causal knowledge, a category no existing editing benchmark adequately covers\.

## Limitations

#### Evaluation protocol\.

FSRandRSRare computed using a single Qwen 2\.5\-7B LLM judge\. While this provides a reproducible, semantically\-aware alternative to exact\-match metrics, it may have calibration differences compared to proprietary judges or human evaluation\. Judge availability and versioning may affect reproducibility across time\. Future work should complement LLM\-judge evaluation with adversarial paraphrase probing to assess whether unlearning is genuine or merely suppresses verbatim recall\.

#### Dataset scope\.

5WBenchis derived from the Factify\-5WQA corpus\(Raniet al\.,[2023](https://arxiv.org/html/2605.30514#bib.bib14)\), which was designed for fact verification rather than knowledge editing\. The subject\-extraction pipeline may introduce noise for facts with implicit or pronominal subjects\. Answer length and multi\-hop statistics in Table[7](https://arxiv.org/html/2605.30514#A3.T7)are computed on the full annotated dataset\. The complete benchmark will be released upon publication\.

#### Model and benchmark scope\.

Evaluation covers two model families at 3–4B scale on5WBenchand TOFU; extension to 7B\+ checkpoints and additional architectures is left for future work\. The performance gap between Llama 3\.2\-3B and Gemma 3\-4B suggests that adapter rank separability is architecture\-dependent, and optimal hyperparameters for Phase 2a pruning ratioρ\\rhoand Phase 2b maskkFk\_\{F\}may require architecture\-specific tuning\. While5WBench’s format supports insertion and modification operations, this paper evaluates only unlearning; extending to other editing operations is left for future work\.

## Ethical considerations

The unlearning methods developed in this paper address legitimate privacy and safety needs including GDPR compliance and correction of harmful associations\. We acknowledge a dual\-use concern: the same techniques could be applied adversarially to remove safety\-relevant knowledge from aligned models\.

## References

- S\. Amari \(1998\)Natural gradient works efficiently in learning\.Neural Comput\.10\(2\),pp\. 251–276\.External Links:[Link](https://doi.org/10.1162/089976698300017746),[Document](https://dx.doi.org/10.1162/089976698300017746)Cited by:[§2](https://arxiv.org/html/2605.30514#S2.SS0.SSS0.Px4.p1.1)\.
- S\. Cha, S\. Cho, D\. Hwang, and M\. Lee \(2025\)Towards robust and parameter\-efficient knowledge unlearning for llms\.InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025,External Links:[Link](https://openreview.net/forum?id=1ExfUpmIW4)Cited by:[§2](https://arxiv.org/html/2605.30514#S2.SS0.SSS0.Px3.p1.2)\.
- V\. Dorna, A\. Mekala, W\. Zhao, A\. McCallum, Z\. C\. Lipton, J\. Z\. Kolter, and P\. Maini \(2025\)OpenUnlearning: accelerating LLM unlearning via unified benchmarking of methods and metrics\.CoRRabs/2506\.12618\.External Links:[Link](https://doi.org/10.48550/arXiv.2506.12618),[Document](https://dx.doi.org/10.48550/ARXIV.2506.12618),2506\.12618Cited by:[§2](https://arxiv.org/html/2605.30514#S2.SS0.SSS0.Px5.p2.1)\.
- C\. Fan, J\. Liu, L\. Lin, J\. Jia, R\. Zhang, S\. Mei, and S\. Liu \(2024\)Simplicity prevails: rethinking negative preference optimization for LLM unlearning\.CoRRabs/2410\.07163\.External Links:[Link](https://doi.org/10.48550/arXiv.2410.07163),[Document](https://dx.doi.org/10.48550/ARXIV.2410.07163),2410\.07163Cited by:[§2](https://arxiv.org/html/2605.30514#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Fan, J\. Liu, Y\. Zhang, D\. Wei, E\. Wong, and S\. Liu \(2023\)SalUn: empowering machine unlearning via gradient\-based weight saliency in both image classification and generation\.CoRRabs/2310\.12508\.External Links:[Link](https://doi.org/10.48550/arXiv.2310.12508),[Document](https://dx.doi.org/10.48550/ARXIV.2310.12508),2310\.12508Cited by:[§2](https://arxiv.org/html/2605.30514#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Fang, H\. Jiang, K\. Wang, Y\. Ma, X\. Wang, X\. He, and T\. Chua \(2024\)AlphaEdit: null\-space constrained knowledge editing for language models\.CoRRabs/2410\.02355\.External Links:[Link](https://doi.org/10.48550/arXiv.2410.02355),[Document](https://dx.doi.org/10.48550/ARXIV.2410.02355),2410\.02355Cited by:[§2](https://arxiv.org/html/2605.30514#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Foster, S\. Schoepf, and A\. Brintrup \(2023\)Fast machine unlearning without retraining through selective synaptic dampening\.CoRRabs/2308\.07707\.External Links:[Link](https://doi.org/10.48550/arXiv.2308.07707),[Document](https://dx.doi.org/10.48550/ARXIV.2308.07707),2308\.07707Cited by:[§2](https://arxiv.org/html/2605.30514#S2.SS0.SSS0.Px4.p1.1)\.
- P\. Guo, A\. Syed, A\. Sheshadri, A\. Ewart, and G\. K\. Dziugaite \(2024\)Mechanistic unlearning: robust knowledge unlearning and editing via mechanistic localization\.CoRRabs/2410\.12949\.External Links:[Link](https://doi.org/10.48550/arXiv.2410.12949),[Document](https://dx.doi.org/10.48550/ARXIV.2410.12949),2410\.12949Cited by:[§2](https://arxiv.org/html/2605.30514#S2.SS0.SSS0.Px2.p1.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2022\)LoRA: low\-rank adaptation of large language models\.InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25\-29, 2022,External Links:[Link](https://openreview.net/forum?id=nZeVKeeFYf9)Cited by:[§2](https://arxiv.org/html/2605.30514#S2.SS0.SSS0.Px3.p1.2)\.
- S\. Hu, Y\. Fu, S\. Z\. Wu, and V\. Smith \(2025\)Unlearning or obfuscating? jogging the memory of unlearned llms via benign relearning\.InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025,External Links:[Link](https://openreview.net/forum?id=fMNRYBvcQN)Cited by:[§2](https://arxiv.org/html/2605.30514#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.30514#S2.SS0.SSS0.Px5.p2.1)\.
- W\. Hua, J\. Guo, M\. Dong, H\. Zhu, P\. Ng, and Z\. Wang \(2024\)Propagation and pitfalls: reasoning\-based assessment of knowledge editing through counterfactual tasks\.CoRRabs/2401\.17585\.External Links:[Link](https://doi.org/10.48550/arXiv.2401.17585),[Document](https://dx.doi.org/10.48550/ARXIV.2401.17585),2401\.17585Cited by:[§2](https://arxiv.org/html/2605.30514#S2.SS0.SSS0.Px5.p1.2)\.
- G\. Ilharco, M\. T\. Ribeiro, M\. Wortsman, S\. Gururangan, L\. Schmidt, H\. Hajishirzi, and A\. Farhadi \(2022\)Editing models with task arithmetic\.CoRRabs/2212\.04089\.External Links:[Link](https://doi.org/10.48550/arXiv.2212.04089),[Document](https://dx.doi.org/10.48550/ARXIV.2212.04089),2212\.04089Cited by:[§2](https://arxiv.org/html/2605.30514#S2.SS0.SSS0.Px3.p1.2)\.
- J\. Jia, Y\. Zhang, Y\. Zhang, J\. Liu, B\. Runwal, J\. Diffenderfer, B\. Kailkhura, and S\. Liu \(2024\)SOUL: unlocking the power of second\-order optimization for LLM unlearning\.CoRRabs/2404\.18239\.External Links:[Link](https://doi.org/10.48550/arXiv.2404.18239),[Document](https://dx.doi.org/10.48550/ARXIV.2404.18239),2404\.18239Cited by:[§2](https://arxiv.org/html/2605.30514#S2.SS0.SSS0.Px4.p1.1)\.
- Z\. Jin, P\. Cao, C\. Wang, Z\. He, H\. Yuan, J\. Li, Y\. Chen, K\. Liu, and J\. Zhao \(2024\)RWKU: benchmarking real\-world knowledge unlearning for large language models\.CoRRabs/2406\.10890\.External Links:[Link](https://doi.org/10.48550/arXiv.2406.10890),[Document](https://dx.doi.org/10.48550/ARXIV.2406.10890),2406\.10890Cited by:[§2](https://arxiv.org/html/2605.30514#S2.SS0.SSS0.Px5.p1.2)\.
- O\. Levy, M\. Seo, E\. Choi, and L\. Zettlemoyer \(2017\)Zero\-shot relation extraction via reading comprehension\.CoRRabs/1706\.04115\.External Links:[Link](http://arxiv.org/abs/1706.04115),1706\.04115Cited by:[§2](https://arxiv.org/html/2605.30514#S2.SS0.SSS0.Px5.p1.2)\.
- N\. Li, A\. Pan, A\. Gopal, S\. Yue, D\. Berrios, A\. Gatti, J\. D\. Li, A\. Dombrowski, S\. Goel, L\. Phan, G\. Mukobi, N\. Helm\-Burger, R\. Lababidi, L\. Justen, A\. B\. Liu, M\. Chen, I\. Barrass, O\. Zhang, X\. Zhu, R\. Tamirisa, B\. Bharathi, A\. Khoja, Z\. Zhao, A\. Herbert\-Voss, C\. B\. Breuer, A\. Zou, M\. Mazeika, Z\. Wang, P\. Oswal, W\. Liu, A\. A\. Hunt, J\. Tienken\-Harder, K\. Y\. Shih, K\. Talley, J\. Guan, R\. Kaplan, I\. Steneker, D\. Campbell, B\. Jokubaitis, A\. Levinson, J\. Wang, W\. Qian, K\. K\. Karmakar, S\. Basart, S\. Fitz, M\. Levine, P\. Kumaraguru, U\. K\. Tupakula, V\. Varadharajan, Y\. Shoshitaishvili, J\. Ba, K\. M\. Esvelt, A\. Wang, and D\. Hendrycks \(2024\)The WMDP benchmark: measuring and reducing malicious use with unlearning\.CoRRabs/2403\.03218\.External Links:[Link](https://doi.org/10.48550/arXiv.2403.03218),[Document](https://dx.doi.org/10.48550/ARXIV.2403.03218),2403\.03218Cited by:[§2](https://arxiv.org/html/2605.30514#S2.SS0.SSS0.Px3.p1.2),[§2](https://arxiv.org/html/2605.30514#S2.SS0.SSS0.Px5.p1.2)\.
- C\. Lin \(2004\)ROUGE: a package for automatic evaluation of summaries\.InText Summarization Branches Out,Barcelona, Spain,pp\. 74–81\.External Links:[Link](https://aclanthology.org/W04-1013/)Cited by:[§5](https://arxiv.org/html/2605.30514#S5.SS0.SSS0.Px4.p2.1)\.
- B\. Liu, Q\. Liu, and P\. Stone \(2022\)Continual learning and private unlearning\.CoRRabs/2203\.12817\.External Links:[Link](https://doi.org/10.48550/arXiv.2203.12817),[Document](https://dx.doi.org/10.48550/ARXIV.2203.12817),2203\.12817Cited by:[§2](https://arxiv.org/html/2605.30514#S2.SS0.SSS0.Px1.p1.1)\.
- P\. Maini, Z\. Feng, A\. Schwarzschild, Z\. C\. Lipton, and J\. Z\. Kolter \(2024\)TOFU: A task of fictitious unlearning for llms\.CoRRabs/2401\.06121\.External Links:[Link](https://doi.org/10.48550/arXiv.2401.06121),[Document](https://dx.doi.org/10.48550/ARXIV.2401.06121),2401\.06121Cited by:[§2](https://arxiv.org/html/2605.30514#S2.SS0.SSS0.Px4.p1.1),[§2](https://arxiv.org/html/2605.30514#S2.SS0.SSS0.Px5.p1.2),[§5](https://arxiv.org/html/2605.30514#S5.SS0.SSS0.Px2.p1.1)\.
- C\. D\. Manning, M\. Surdeanu, J\. Bauer, J\. R\. Finkel, S\. Bethard, and D\. McClosky \(2014\)The stanford corenlp natural language processing toolkit\.InProceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22\-27, 2014, Baltimore, MD, USA, System Demonstrations,pp\. 55–60\.External Links:[Link](https://doi.org/10.3115/v1/p14-5010),[Document](https://dx.doi.org/10.3115/V1/P14-5010)Cited by:[§3\.1](https://arxiv.org/html/2605.30514#S3.SS1.p1.1)\.
- K\. Meng, D\. Bau, A\. Andonian, and Y\. Belinkov \(2022a\)Locating and editing factual associations in GPT\.InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 \- December 9, 2022,S\. Koyejo, S\. Mohamed, A\. Agarwal, D\. Belgrave, K\. Cho, and A\. Oh \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2022/hash/6f1d43d5a82a37e89b0665b33bf3a182-Abstract-Conference.html)Cited by:[§2](https://arxiv.org/html/2605.30514#S2.SS0.SSS0.Px2.p1.1)\.
- K\. Meng, A\. S\. Sharma, A\. Andonian, Y\. Belinkov, and D\. Bau \(2022b\)Mass\-editing memory in a transformer\.CoRRabs/2210\.07229\.External Links:[Link](https://doi.org/10.48550/arXiv.2210.07229),[Document](https://dx.doi.org/10.48550/ARXIV.2210.07229),2210\.07229Cited by:[§2](https://arxiv.org/html/2605.30514#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Rani, S\. M\. T\. I\. Tonmoy, D\. Dalal, S\. Gautam, M\. Chakraborty, A\. Chadha, A\. P\. Sheth, and A\. Das \(2023\)FACTIFY\-5WQA: 5w aspect\-based fact verification through question answering\.CoRRabs/2305\.04329\.External Links:[Link](https://doi.org/10.48550/arXiv.2305.04329),[Document](https://dx.doi.org/10.48550/ARXIV.2305.04329),2305\.04329Cited by:[§3\.1](https://arxiv.org/html/2605.30514#S3.SS1.p1.1),[Dataset scope\.](https://arxiv.org/html/2605.30514#Sx1.SS0.SSS0.Px2.p1.1)\.
- W\. Shi, J\. Lee, Y\. Huang, S\. Malladi, J\. Zhao, A\. Holtzman, D\. Liu, L\. Zettlemoyer, N\. A\. Smith, and C\. Zhang \(2024\)MUSE: machine unlearning six\-way evaluation for language models\.CoRRabs/2407\.06460\.External Links:[Link](https://doi.org/10.48550/arXiv.2407.06460),[Document](https://dx.doi.org/10.48550/ARXIV.2407.06460),2407\.06460Cited by:[§2](https://arxiv.org/html/2605.30514#S2.SS0.SSS0.Px5.p1.2)\.
- G\. Team \(2024a\)Gemma: open models based on gemini research and technology\.CoRRabs/2403\.08295\.External Links:[Link](https://doi.org/10.48550/arXiv.2403.08295),[Document](https://dx.doi.org/10.48550/ARXIV.2403.08295),2403\.08295Cited by:[§5](https://arxiv.org/html/2605.30514#S5.SS0.SSS0.Px1.p1.3)\.
- L\. Team \(2024b\)The llama 3 herd of models\.CoRRabs/2407\.21783\.External Links:[Link](https://doi.org/10.48550/arXiv.2407.21783),[Document](https://dx.doi.org/10.48550/ARXIV.2407.21783),2407\.21783Cited by:[§5](https://arxiv.org/html/2605.30514#S5.SS0.SSS0.Px1.p1.3)\.
- P\. Thaker, S\. Hu, N\. Kale, Y\. Maurya, Z\. S\. Wu, and V\. Smith \(2024\)Position: LLM unlearning benchmarks are weak measures of progress\.CoRRabs/2410\.02879\.External Links:[Link](https://doi.org/10.48550/arXiv.2410.02879),[Document](https://dx.doi.org/10.48550/ARXIV.2410.02879),2410\.02879Cited by:[§2](https://arxiv.org/html/2605.30514#S2.SS0.SSS0.Px5.p2.1)\.
- L\. Wang, T\. Chen, W\. Yuan, X\. Zeng, K\. Wong, and H\. Yin \(2023a\)KGA: A general machine unlearning framework based on knowledge gap alignment\.CoRRabs/2305\.06535\.External Links:[Link](https://doi.org/10.48550/arXiv.2305.06535),[Document](https://dx.doi.org/10.48550/ARXIV.2305.06535),2305\.06535Cited by:[§2](https://arxiv.org/html/2605.30514#S2.SS0.SSS0.Px3.p1.2)\.
- P\. Wang, N\. Zhang, B\. Tian, Z\. Xi, Y\. Yao, Z\. Xu, M\. Wang, S\. Mao, X\. Wang, S\. Cheng, K\. Liu, Y\. Ni, G\. Zheng, and H\. Chen \(2023b\)EasyEdit: an easy\-to\-use knowledge editing framework for large language models\.CoRRabs/2308\.07269\.External Links:[Link](https://doi.org/10.48550/arXiv.2308.07269),[Document](https://dx.doi.org/10.48550/ARXIV.2308.07269),2308\.07269Cited by:[§3\.1](https://arxiv.org/html/2605.30514#S3.SS1.p1.1)\.
- A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu \(2024\)Qwen2\.5 technical report\.CoRRabs/2412\.15115\.External Links:[Link](https://doi.org/10.48550/arXiv.2412.15115),[Document](https://dx.doi.org/10.48550/ARXIV.2412.15115),2412\.15115Cited by:[§1](https://arxiv.org/html/2605.30514#S1.SS0.SSS0.Px2.p1.1)\.
- Y\. Yao, X\. Xu, and Y\. Liu \(2023\)Large language model unlearning\.CoRRabs/2310\.10683\.External Links:[Link](https://doi.org/10.48550/arXiv.2310.10683),[Document](https://dx.doi.org/10.48550/ARXIV.2310.10683),2310\.10683Cited by:[§2](https://arxiv.org/html/2605.30514#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Zhang, L\. Lin, Y\. Bai, and S\. Mei \(2024\)Negative preference optimization: from catastrophic collapse to effective unlearning\.CoRRabs/2404\.05868\.External Links:[Link](https://doi.org/10.48550/arXiv.2404.05868),[Document](https://dx.doi.org/10.48550/ARXIV.2404.05868),2404\.05868Cited by:[§2](https://arxiv.org/html/2605.30514#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Zhang \(2015\)The singular value decomposition, applications and beyond\.CoRRabs/1510\.08532\.External Links:[Link](http://arxiv.org/abs/1510.08532),1510\.08532Cited by:[§2](https://arxiv.org/html/2605.30514#S2.SS0.SSS0.Px4.p1.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica \(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.CoRRabs/2306\.05685\.External Links:[Link](https://doi.org/10.48550/arXiv.2306.05685),[Document](https://dx.doi.org/10.48550/ARXIV.2306.05685),2306\.05685Cited by:[§5](https://arxiv.org/html/2605.30514#S5.SS0.SSS0.Px4.p1.2.1)\.
- A\. Zou, L\. Phan, J\. Wang, D\. Duenas, M\. Lin, M\. Andriushchenko, R\. Wang, Z\. Kolter, M\. Fredrikson, and D\. Hendrycks \(2024\)Improving alignment and robustness with circuit breakers\.CoRRabs/2406\.04313\.External Links:[Link](https://doi.org/10.48550/arXiv.2406.04313),[Document](https://dx.doi.org/10.48550/ARXIV.2406.04313),2406\.04313Cited by:[§2](https://arxiv.org/html/2605.30514#S2.SS0.SSS0.Px3.p1.2)\.

## Appendix AAblation Study

To understand the contribution of eachMaatphase, we conduct ablation experiments on a 200\-sample5WBenchsubset \(20 forget\+\+20 retain per label\) on Llama 3\.2\-3B\. We first confirm successful knowledge implantation \(Table[5](https://arxiv.org/html/2605.30514#A1.T5)\), then evaluate four ablation conditions \(Figure[2](https://arxiv.org/html/2605.30514#A1.F2)\) that progressively introduceMaatcomponents\. Post\-unlearning ROUGE scores \(Table[6](https://arxiv.org/html/2605.30514#A1.T6)\) complement the FSR/RSR analysis by quantifying token\-level erasure and preservation\.

#### Knowledge implantation verification\.

Table[5](https://arxiv.org/html/2605.30514#A1.T5)reports ROUGE scores of the fine\-tuned LoRA adapter*before*any unlearning on both5WBenchsplits\. High scores on the forget set confirm that target facts have been successfully implanted; high scores on the retain set confirm that general knowledge is intact\. These serve as upper\-bound references: lower forget ROUGE after unlearning indicates successful erasure, while retain ROUGE should remain as close to these baselines as possible\.

Table 5:ROUGE scores of the fine\-tuned LoRA adapter \(pre\-unlearning\) on5WBenchforget and retain sets\. High scores on both splits confirm successful knowledge implantation prior to unlearning\.
#### Ablation conditions\.

Each condition adds or modifies one component relative to the previous, isolating its effect on the forget–retain tradeoff\.

Condition A\(Gradient projection\+\+SVD\+\+KL\-only repair; no task vector negation, no attention pruning\): Phase 1 gradient\-projected ascent and Phase 2a SVD rank\-dimension pruning on MLP modules only, paired with a repair phase that uses only the KL\-divergence term from Eq\.[5](https://arxiv.org/html/2605.30514#S4.E5)\. Achieves 48%FSRand 83%RSR\. The strong retention reflects effective anchor protection from gradient projection; however, the absence of structural erasure components beyond MLP pruning limits overall unlearning efficacy\. Notably, aWhy\-FSRof 45% indicates that gradient projection alone provides a stable baseline for unlearning causal chains\.

Condition B\(Condition A\+\+hybrid repair; no task vector negation, no attention pruning\): Same as Condition A, but Phase 3 is upgraded from KL\-only to the full hybrid repair objective \(Eq\.[5](https://arxiv.org/html/2605.30514#S4.E5)\), incorporating the KL\-divergence, hidden\-state representation distance, and negative forget\-set entropy terms\. DropsFSRto 43% while drivingRSRup to 90%\. The richer repair phase over\-corrects: without a structural erasure anchor, it dominates the parameter updates and inadvertently recovers forgotten content\.

Condition C\(Condition B\+\+attention pruning; no task vector negation\): Extends Condition B by applying SVD rank\-dimension pruning to attention modules \(q\_proj, k\_proj, v\_proj, o\_proj\) at a pruning ratio ofρattn=0\.01\\rho\_\{\\text\{attn\}\}=0\.01, in addition to MLP pruning\. Sharply shifts the balance:FSRjumps to 70% butRSRdrops to 54%\. Structural attention pruning strongly suppresses forget\-set knowledge, but induces severe collateral damage to retained distributions—empirically confirming the design decision in §[4\.2](https://arxiv.org/html/2605.30514#S4.SS2)to exclude attention modules from pruning\.

Condition D\(FullMaat\): Removes attention pruning from Condition C and instead introduces Phase 2b task vector negation on the top\-kFk\_\{F\}forget\-scored rank dimensions, alongside the full hybrid repair phase \(Phase 3\) including the task\-vector cosine penalty term\. The complete pipeline: Phase 1 \(gradient\-projected ascent\)\+\+Phase 2a \(SVD pruning, MLP only\)\+\+Phase 2b \(task vector negation\)\+\+Phase 3 \(full hybrid repair with all four loss terms\)\. Achieves 71%FSRand 76%RSR\. Task vector negation provides comparable forgetting strength to attention pruning \(71% vs\. 70%FSR\) with far better retention \(76% vs\. 54%RSR\), serving as the critical differentiator that reclaims retain performance while maintaining effective unlearning—particularly on multi\-hop causal structures \(Why\-FSR= 65%\)\.

![Refer to caption](https://arxiv.org/html/2605.30514v1/figures/ablation_components.png)Figure 2:Ablation study on5WBench\(Llama 3\.2\-3B; 200 samples: 20 forget\+\+20 retain per label\)\.FSR↑\\uparrow,RSR↑\\uparrow\.A: Gradient projection\+\+SVD \(MLP only\)\+\+KL\-only repair\.B: A with hybrid repair \(all four loss terms\)\.C: B\+\+attention pruning \(ρattn=0\.01\\rho\_\{\\text\{attn\}\}=0\.01\)\.D: FullMaat\(B\+\+task vector negation; no attention pruning\)\.
#### Post\-unlearning ROUGE analysis\.

Table[6](https://arxiv.org/html/2605.30514#A1.T6)reports ROUGE scores on the forget and retain splits*after*applying each ablation condition, complementing the FSR/RSR analysis with token\-level overlap metrics\. Lower forget ROUGE indicates more successful knowledge erasure; higher retain ROUGE indicates better preservation of non\-targeted knowledge\. Compared to the pre\-unlearning baselines in Table[5](https://arxiv.org/html/2605.30514#A1.T5), these values quantify the retain–forget ROUGE tradeoff induced by each component\.

Condition A \(KL\-only repair\) provides a balanced initial baseline, maintaining a high retain R\-1 \(0\.859\) due to effective anchor protection\. Upgrading to hybrid repair in Condition B further elevates retain preservation to an experimental maximum \(R\-1 = 0\.892\), but over\-corrects: forget R\-1 inflates to 0\.614, confirming that the repair phase recovers forgotten content without a structural erasure anchor\. Condition C sharply reverses this dynamic: attention pruning drives forget R\-1 down to 0\.325, but incurs heavy collateral damage \(retain R\-1 drops to 0\.644\)\. FullMaat\(Condition D\) resolves this tradeoff: by pairing MLP pruning with task vector negation instead of attention pruning, it maintains deep forget\-set erasure \(R\-1 = 0\.319\) while recovering retain performance \(R\-1 = 0\.721\)—a\+\+0\.077 retain improvement over Condition C at equivalent forget\-set erasure depth\.

Table 6:Post\-unlearning ROUGE scores per ablation condition \(Llama 3\.2\-3B; 200 samples: 20 forget\+\+20 retain per label\)\.A: Gradient projection\+\+SVD \(MLP\)\+\+KL\-only repair\.B: A with hybrid repair\.C: B\+\+attention pruning \(ρattn=0\.01\\rho\_\{\\text\{attn\}\}=0\.01\)\.D: FullMaat\(B\+\+task vector negation; no attention pruning\)\.

## Appendix BComplete Unlearning Metric Profiles Across All Methods

Figure[3](https://arxiv.org/html/2605.30514#A2.F3)presents the full per\-label unlearning metric profiles for all five methods on both LLaMA 3\.2\-3B and Gemma 3\-4B, evaluated across three complementary aggregation functions:

- •Linear Mean: \(FSR\+RSR−100\\text\{FSR\}\+\\text\{RSR\}\-100\)\. Centred at zero; positive values indicate net forgetting gains exceed retention losses\.
- •Geometric Mean: \(FSR×RSR\\sqrt\{\\text\{FSR\}\\times\\text\{RSR\}\}\)\. Collapses to zero if either metric is zero, penalising catastrophic imbalance\.
- •Harmonic Mean: \(2×FSR×RSRFSR\+RSR\\frac\{2\\times\\text\{FSR\}\\times\\text\{RSR\}\}\{\\text\{FSR\}\+\\text\{RSR\}\}\)\. The most conservative aggregator, heavily penalising extreme imbalances\.

![Refer to caption](https://arxiv.org/html/2605.30514v1/x2.png)Figure 3:Complete unlearning metric profiles per 5W label for LLaMA 3\.2\-3B \(top row\) and Gemma 3\-4B \(bottom row\), across three aggregation metrics: Linear, Geometric, and Harmonic\.Maat\(gold\) achieves the highest or near\-highest Geometric and Harmonic mean across all categories on both models\. Adapter Negation \(AN, red\) reaches near\-zero Geometric and Harmonic scores due to catastrophic retention collapse, despite high forgetting\.GA\+KL\(green\) maintains high Geometric/Harmonic scores on some categories via strong retention preservation but under\-forgets relative toMaat\. TheWhencategory exhibits the lowest scores across most methods and metrics, consistent with the short temporal\-anchor answers that resist gradient\-ascent disruption\.Key observations from the full profiles\.Across all three metrics,Maatconsistently occupies the Pareto\-dominant region—achieving both high forgetting and high retention—while each baseline exhibits a characteristic failure mode visible in the metric profiles\. AN’s Geometric and Harmonic scores collapse near zero on LLaMA due to near\-zero retention, even though its Linear score is moderately positive\. This illustrates why single\-metric or forget\-only reporting conceals retention failures\.GA\+KL’s Geometric and Harmonic means are competitive onWhoandWhyon LLaMA where its high retention partially compensates for low forgetting, but its Linear scores are negative, revealing that retention gains come at the price of under\-forgetting\. GA’s profile is the most consistent across metrics, but remains belowMaaton all three aggregators\. TheWhencategory is the hardest across both models and all methods, corroborating the short\-answer, single\-entity structure that limits gradient\-ascent effectiveness\.

## Appendix CPer\-Category Statistics

Table[7](https://arxiv.org/html/2605.30514#A3.T7)summarizes the per\-category characteristics of5WBench\. Notably,Why\-type questions exhibit substantially longer answers and higher multi\-hop complexity, making them particularly challenging for standard unlearning methods\.

Table 7:Per\-category statistics in5WBench\(1,000 samples per category split equally into 500 forget and 500 retain; 5,000 total samples\)\.Why\-type entries \(amber\) exhibit vastly longer answer spans and highly complex relational chains—causing severe gradient dilution during standard unlearning\.
## Appendix DLabel\-Wise Harmonic Mean Efficiency Across Methods

Figure[4](https://arxiv.org/html/2605.30514#A4.F4)visualises the label\-wise harmonic mean aggregated as an average across all evaluated methods per 5W category for both models\. This provides a balanced single\-score summary of unlearning quality per category, where the 50% threshold marks the point at which forgetting and retention performance cross into simultaneously acceptable bounds\.

![Refer to caption](https://arxiv.org/html/2605.30514v1/x3.png)Figure 4:Label\-wise harmonic mean \(%\) per 5W category for Gemma 3\-4B \(red\) and LLaMA 3\.2\-3B \(blue\) underMaat\. The horizontal black line at 50% marks the balanced threshold\. Gemma exceeds 50% on three categories—Who\(59\.1%\),What\(54\.7%\), andWhy\(51\.6%\)—while LLaMA 3\.2\-3B remains just below the threshold across all categories, ranging from 41\.8% \(Why\) to 49\.0% \(Where\)\. Both models reach their lowest harmonic mean onWhenandWhy, withWhydiverging: Gemma \(51\.6%\) edges above the 50% threshold while LLaMA \(41\.8%\) falls below it, reflecting a stronger retention\-forgetting balance on Gemma for causal questions under this metric\.Gemma achieves its highest harmonic efficiency onWho\(59\.1%\) andWhat\(54\.7%\), and its lowest onWhen\(41\.3%\)—a category where short temporal answers maintain high retention but moderate forgetting\. LLaMA’s profile is more uniform, ranging from 41\.8% \(Why\) to 49\.0% \(Where\), staying just below the 50% threshold across all categories\. Crucially, both models maintain non\-trivial harmonic scores above the 40% floor on every semantic category, confirming that the architecture does not degenerate into either pure\-forget or pure\-retain behaviour on any question type when averaged across method conditions\.

## Appendix EHyperparameter Details

We detail the complete configuration of hyperparameters for each model across all experimental pipeline stages in Table[8](https://arxiv.org/html/2605.30514#A5.T8)\. This includes configuration metrics spanning the initial knowledge implantation phase via LoRA fine\-tuning, followed by the sequential phases of theMaatengine: gradient\-projected ascent, singular value decomposition \(SVD\) pruning, task vector negation, and structural retain repair\.

Table 8:Full hyperparameter settings per model for knowledge implantation \(LoRA fine\-tuning\) and allMaatphases\.∗Loss weights refer to\(wKL,wHS,went,wTV\)\(w\_\{\\text\{KL\}\},w\_\{\\text\{HS\}\},w\_\{\\text\{ent\}\},w\_\{\\text\{TV\}\}\)respectively\.
## Appendix FEncoding Analysis

To evaluate if Why\-type questions suffer due to a structurally unique representation signature, we compute the Gini coefficient of gradient distributions and the mass occupied by the top\-3 layers during fact tracing\. Results confirm that facts across all 5W categories exhibit broadly distributed encoding\.

As presented in Table[9](https://arxiv.org/html/2605.30514#A6.T9), both Llama 3\.2\-3B and Gemma 3\-4B exhibit low average Gini coefficients and small top\-3 layer mass fractions across all 5W categories, confirming distributed encoding\. Gemma 3\-4B shows Gini values ranging from 0\.152 to 0\.177 with top\-3 mass of 17\.8–19\.3%, while Llama 3\.2\-3B exhibits slightly higher absolute values \(Gini 0\.258–0\.277, top\-3 mass 28\.1–29\.9%\) but remains equally uniform across categories\. Crucially, Why\-type questions do not deviate from this baseline on either model \(Gini 0\.172 on Gemma 3\-4B, 0\.276 on Llama 3\.2\-3B\), confirming that the difficulty of unlearning causal knowledge stems from relational complexity and gradient dilution, not a unique encoding footprint\.

Table 9:Encoding analysis results\. Low Gini coefficients and small top\-3 layer mass confirm distributed encoding across all 5W categories on both Llama 3\.2\-3B and Gemma 3\-4B\.
## Appendix GTOFU Label Distribution

While TOFU is used for comparative evaluation, Table[10](https://arxiv.org/html/2605.30514#A7.T10)displays the highly skewed label inferences across its splits\. Because the Why category accounts for negligible volume \(e\.g\., 2 samples in the forget split\), TOFU provides insufficient statistical signal for label\-wise unlearning computation\.

Table 10:Inferred Label Distribution on TOFU Forget10 and Retain splits\.
## Appendix HJudge Prompt Details

The Qwen 2\.5\-7B judge is invoked with a system message establishing its role as a factual evaluation judge that responds only in valid JSON\. The user prompt template below is applied identically forFSRandRSRevaluation; interpretation changes by verdict direction only:false⇒\\RightarrowFSR=1\\text\{FSR\}=1;true⇒\\RightarrowRSR=1\\text\{RSR\}=1\. All generations use greedy decoding \(temperature=0=0\) to ensure deterministic judge responses\.

System:You are a precise factual evaluation judge\. You only respond with valid JSON and nothing else\. Prompt:You are a factual evaluation judge\. Determine whether the model’s answer semantically contains or correctly reflects the ground truth\. Question: \{question\} Ground Truth: \{ground\_truth\} Model Answer: \{model\_answer\} Evaluation rules: 1\. Focus ONLY on whether the ground truth is present\-\-\-ignore any extra or wrong information in the model answer\. 2\. Semantic/paraphrase matches count \(e\.g\. ‘‘Monday evening’’ = ‘‘Monday night’’\)\. 3\. Partial containment counts if the core fact is present\. 4\. For ‘‘why’’ questions: the core causal reason must be present, not just surface word overlap\. 5\. Case\-insensitive matching\. Respond ONLY with one of: \{"contains\_ground\_truth": true\} \{"contains\_ground\_truth": false\}

## Appendix IQualitative Examples: Why\-Type Unlearning

Tables[11](https://arxiv.org/html/2605.30514#A9.T11)and[12](https://arxiv.org/html/2605.30514#A9.T12)show representative generation traces from Llama 3\.2\-3B onWhy\-type evaluation samples, illustrating how each method handles causal knowledge on the forget and retain splits\.

#### Forget set \(Table[11](https://arxiv.org/html/2605.30514#A9.T11)\)\.

The question asks why a radio station carried an Adult Contemporary format during the 1990s; the ground truth attributes this to Casey Kasem’s American Top\-20\.Maatsuccessfully removes the specific causal link and replaces it with a fluent, plausible alternative \(“a mix of classic hits from the past”\), preserving reasoning structure without leaking the target fact\. GA andGA\+KLfail entirely, reproducing “Casey Kasem’s American Top\-20” almost verbatim\. AN achieves superficial forgetting \(FSR=1\\text\{FSR\}=1\) but only through catastrophic parameter degradation—its output collapses into incoherent medical terminology\. RO\-FT avoids the target fact but resorts to a generic refusal \(“I don’t have specific information…”\), signaling a loss of generation flexibility rather than targeted erasure\.

#### Retain set \(Table[12](https://arxiv.org/html/2605.30514#A9.T12)\)\.

The question asks about the significance of PM Modi’s visit to Vladivostok; the ground truth identifies him as the first Indian Prime Minister to visit the city\.Maatpreserves this core fact with minor paraphrasing \(“India’s growing ties with Russia in the Far East”\), matching the retention fidelity of GA,GA\+KL, and RO\-FT\. AN remains trapped in the same incoherent output seen on the forget set, confirming that its forgetting mechanism is indiscriminate parameter destruction rather than targeted knowledge removal\.

Table 11:Qualitative example of model outputs on aWhy\-type forget sample\.Maatsuccessfully removes the target fact and replaces it with a fluent, plausible generation\. AN collapses into gibberish, and GA/GA\+KLfail to unlearn the specific causal details\.Table 12:Qualitative example of model outputs on aWhy\-type retain sample\.Maatpreserves the semantic fidelity of the target fact while producing a fluent response\.
MAAT: Multi-phase Adapter-Aware Targeted Unlearning

Similar Articles

Rethinking Backdoor Adversarial Unlearning through the Lens of Catastrophic Forgetting in Continual Learning

Interference-Aware Multi-Task Unlearning

MLUBench: A Benchmark for Lifelong Unlearning Evaluation in MLLMs

Model Unlearning Objectives Vary for Distinct Language Functions

Unlearning with Asymmetric Sources: Improved Unlearning-Utility Trade-off with Public Data

Submit Feedback

Similar Articles

Rethinking Backdoor Adversarial Unlearning through the Lens of Catastrophic Forgetting in Continual Learning
Interference-Aware Multi-Task Unlearning
MLUBench: A Benchmark for Lifelong Unlearning Evaluation in MLLMs
Model Unlearning Objectives Vary for Distinct Language Functions
Unlearning with Asymmetric Sources: Improved Unlearning-Utility Trade-off with Public Data