Multilingual Unlearning in LLMs: Transfer, Dynamics, and Reversibility
Summary
This paper studies multilingual unlearning in LLMs by extending the TOFU benchmark to five languages. It finds that unlearning transfer varies by script and family, operates primarily in later decoding layers, and that a single steering direction can recover much of the suppressed knowledge across languages.
View Cached Full Text
Cached at: 06/03/26, 09:38 AM
# Multilingual Unlearning in LLMs: Transfer, Dynamics, and Reversibility
Source: [https://arxiv.org/html/2606.03291](https://arxiv.org/html/2606.03291)
###### Аннотация
Large language models \(LLMs\) can memorize sensitive facts, motivating*unlearning*methods that remove targeted knowledge without costly retraining\. However, unlearning research remains heavily English\-centric\. We study multilingual unlearning by extending the TOFU benchmark to five languages, and fine\-tune, unlearn and query our models with different permutations of languages\. We find that unlearning transfer – the ability of an unlearned model to ‘‘forget’’ facts in languages other than the unlearning language – is highly variable: e\.g\., it is strongest between languages sharing scripts and families, and we show that the*unlearning language*predicts which*query languages*are most likely to yield the strongest transfer\. Layer\-wise analysis reveals that unlearning leaves the shared cross\-lingual latent space largely intact in early layers, instead operating primarily in later decoding layers\. This suggests that unlearning does not truly erase knowledge, but rather induces superficial suppression\. Exploiting this structure, a single inference\-time steering direction reverses much of this suppression across languages, recovering 50% \(Qwen\) and 90% \(Gemma\) of the unlearned knowledge\.111Code is available at:[https://github\.com/MLCY1/multilingual\-unlearning\-in\-llms](https://github.com/MLCY1/multilingual-unlearning-in-llms)
## 1Introduction
Large language models \(LLMs\) are trained on vast corpora and, in the process, can inadvertently learn harmful, sensitive, or biased information that motivate post hoc data removal\(Carliniet al\.,[2023](https://arxiv.org/html/2606.03291#bib.bib43); Chenget al\.,[2025](https://arxiv.org/html/2606.03291#bib.bib42); Menget al\.,[2022](https://arxiv.org/html/2606.03291#bib.bib41); Shiet al\.,[2025](https://arxiv.org/html/2606.03291#bib.bib45)\)\. In addition, to protect individual privacy, data\-protection regulations such as the ‘‘right to be forgotten’’ require the removal of personal information from LLMs\(Cao and Yang,[2015](https://arxiv.org/html/2606.03291#bib.bib39); Liuet al\.,[2025](https://arxiv.org/html/2606.03291#bib.bib38)\)without compromising performance\. The most principled way to achieve this is to retrain the model from scratch, excluding relevant data from the training corpus\(Triantafillouet al\.,[2024](https://arxiv.org/html/2606.03291#bib.bib37); Yoonet al\.,[2023](https://arxiv.org/html/2606.03291#bib.bib36); Cooperet al\.,[2025](https://arxiv.org/html/2606.03291#bib.bib33)\)\. However, retraining modern LLMs is expensive\. This has led to a growing interest in*unlearning*methods that aim to remove targeted knowledge stored in existing models\. Most current unlearning algorithms operate in a monolingual, predominantly English setting and are implemented via modified fine\-tuning objectives that push the model away from the original behavior on a given set of examples\(Janget al\.,[2023](https://arxiv.org/html/2606.03291#bib.bib35); Honget al\.,[2024](https://arxiv.org/html/2606.03291#bib.bib34); Shiet al\.,[2025](https://arxiv.org/html/2606.03291#bib.bib45); Huanget al\.,[2025](https://arxiv.org/html/2606.03291#bib.bib32); Yoonet al\.,[2025](https://arxiv.org/html/2606.03291#bib.bib57); Duet al\.,[2025](https://arxiv.org/html/2606.03291#bib.bib31); Leeet al\.,[2025](https://arxiv.org/html/2606.03291#bib.bib30); Xuet al\.,[2025a](https://arxiv.org/html/2606.03291#bib.bib28)\)\. While these methods can reduce the model’s tendency to output specific answers on English prompts, two important gaps remain\.
First, it remains unclear how unlearning transfers across languages in multilingual LLMs\. We present the first comprehensive study of this phenomenon by systematically \(1\) fine\-tuning an LLM on novel knowledge in languageℒFT\\mathcal\{L\}\_\{FT\}; \(2\) unlearning a fraction of this knowledge in languageℒunl\\mathcal\{L\}\_\{\\text\{unl\}\}; and \(3\) querying the \(supposedly\) forgotten knowledge in languageℒQ\\mathcal\{L\}\_\{Q\}\. We systematically varyℒFT\\mathcal\{L\}\_\{FT\},ℒunl\\mathcal\{L\}\_\{\\text\{unl\}\}andℒQ\\mathcal\{L\}\_\{Q\}, manipulating language relatedness, script and coverage during LLM pre\-training\. While some prior work has shown isolated cross\-lingual transfer effects\(Lu and Koehn,[2025](https://arxiv.org/html/2606.03291#bib.bib25); Hwanget al\.,[2025](https://arxiv.org/html/2606.03291#bib.bib24); Choiet al\.,[2024](https://arxiv.org/html/2606.03291#bib.bib23)\), we are the first to show systematic patterns\. Because multilingual LLMs serve users in many languages, understanding cross\-lingual transfer is important for safe deployment and for preventing attackers from recovering unlearned knowledge by exploiting relationships between languages\.
Second, the*mechanism*by which multilingual unlearning operates is still not well understood\. Recent studies argue that unlearning behaves more like a suppression signal than true knowledge erasure\(Shiet al\.,[2025](https://arxiv.org/html/2606.03291#bib.bib45); Leeet al\.,[2025](https://arxiv.org/html/2606.03291#bib.bib30); Weiet al\.,[2026](https://arxiv.org/html/2606.03291#bib.bib29); Renet al\.,[2025](https://arxiv.org/html/2606.03291#bib.bib27); Huet al\.,[2025](https://arxiv.org/html/2606.03291#bib.bib26)\), but predominantly in monolingual or English\-centric settings\. It therefore remains unclear \(i\) whether the suppressed content is*recoverable*in multilingual models, \(ii\) whether the effect is language\-specific or language\-agnostic, and \(iii\) where in the model’s representation space it is localized\. In particular, we do not yet know whether multilingual unlearning disrupts the shared cross\-lingual ‘‘thought’’ space or instead acts mainly in later layers that map internal representations to language\-specific surface forms\. Mechanistic understanding matters because it determines the*strength and attack surface*of multilingual unlearning guarantees\. If the unlearning effect primarily modulates late, language\-specific decoding layers, the underlying knowledge may remain intact and re\-emerge via cross\-lingual prompting \(i\.e\., prompting the model in one language and having it respond in another\) or by querying for the unlearned knowledge in a different language\(Huet al\.,[2025](https://arxiv.org/html/2606.03291#bib.bib26); Lynchet al\.,[2024](https://arxiv.org/html/2606.03291#bib.bib20)\)\. Conversely, if unlearning alters the shared cross\-lingual representation space, the resulting guarantee is materially stronger, because it would indicate that access to the underlying knowledge is reduced rather than merely its language\-specific expression\(Wendleret al\.,[2024](https://arxiv.org/html/2606.03291#bib.bib12); Wanget al\.,[2025a](https://arxiv.org/html/2606.03291#bib.bib54); Limet al\.,[2025](https://arxiv.org/html/2606.03291#bib.bib55)\)\.
To address these two gaps, we first characterize cross\-lingual unlearning transfer in multilingual LLMs \(Section[4\.1](https://arxiv.org/html/2606.03291#S4.SS1)\), revealing that the extent of transfer differs by language relatedness, script, and dominance in pre\-training\. We also find that unlearning in a weaker languageℒunl\\mathcal\{L\}\_\{\\text\{unl\}\}can still transfer substantially to stronger languages inℒ\\mathcal\{L\}\. To better understand the impact of language on knowledge recovery, we next apply cross\-lingual prompting \(Section[4\.2](https://arxiv.org/html/2606.03291#S4.SS2)\) by querying inℒQ\\mathcal\{L\}\_\{Q\}but allowing the model to decode in aℒFT≠ℒQ\\mathcal\{L\}\_\{FT\}\\neq\\mathcal\{L\}\_\{Q\}, finding that language pairs with larger cross\-lingual prompting gains tend to exhibit stronger unlearning transfer\. Next, we mechanistically analyze transfer by comparing hidden representations in models before and after unlearning across layers and languages \(Section[4\.3](https://arxiv.org/html/2606.03291#S4.SS3)\), finding thatunlearning largely preserves cross\-lingual alignment in early and middle layers while it alters the later layers\. Finally, we turn these representational shifts into explicit ‘‘unlearning directions’’ \(steering vectors\) and show that steering along these directions can restore knowledge across languages \(Section[4\.4](https://arxiv.org/html/2606.03291#S4.SS4)\), providing evidence that unlearning behaves as a*language\-agnostic suppression signal*and does not fully erase knowledge\. Overall, our results suggest that ‘‘unlearned’’ knowledge in one language can transfer into other languages and the unlearning can be substantially reversed by steering with a simple extracted vector without additional relearning or the unlearned data\.
## 2Background
LLM unlearning\.To simulate unlearning in practice, we consider a modelMθM\_\{\\theta\}fine tuned on a datasetDD, which can be partitioned into a forget setDforgetD\_\{\\text\{forget\}\}\(targeted content to unlearn\) and a retain setDretainD\_\{\\text\{retain\}\}\(used to preserve non\-target behavior, often from the same domain\)\. Most methods implement unlearning via continued fine\-tuning, updating parameters fromθ\\thetatoθ^\\hat\{\\theta\}to discourage target content in model outputs while maintaining performance on retained data\(Janget al\.,[2023](https://arxiv.org/html/2606.03291#bib.bib35); Shiet al\.,[2025](https://arxiv.org/html/2606.03291#bib.bib45); Yoonet al\.,[2025](https://arxiv.org/html/2606.03291#bib.bib57)\)\. Recent taxonomies distinguish methods by their*intention*: whether they aim to remove internal knowledge or to suppress its behavioral expression\(Renet al\.,[2025](https://arxiv.org/html/2606.03291#bib.bib27)\)\. A central challenge is the trade\-off between forget efficacy and utility preservation\(Chang and Lee,[2025](https://arxiv.org/html/2606.03291#bib.bib19); Renet al\.,[2025](https://arxiv.org/html/2606.03291#bib.bib27)\)\.*Removal\-intended*approaches modify model parameters to eliminate targeted information, commonly via reverse optimization objectives such as gradient ascent on forget examples\(Mainiet al\.,[2024](https://arxiv.org/html/2606.03291#bib.bib69); Jinet al\.,[2024](https://arxiv.org/html/2606.03291#bib.bib22); Janget al\.,[2023](https://arxiv.org/html/2606.03291#bib.bib35)\)or preference\-based objectives that often achieve a more favorable forget–utility trade\-off in practice\(Zhanget al\.,[2024](https://arxiv.org/html/2606.03291#bib.bib63); Yoonet al\.,[2025](https://arxiv.org/html/2606.03291#bib.bib57); Chenget al\.,[2025](https://arxiv.org/html/2606.03291#bib.bib42); Jiaet al\.,[2024](https://arxiv.org/html/2606.03291#bib.bib16)\)\. However, multiple studies suggest that many such methods behave mechanistically like*suppression*and are therefore reversible\(Renet al\.,[2025](https://arxiv.org/html/2606.03291#bib.bib27); Xuet al\.,[2025c](https://arxiv.org/html/2606.03291#bib.bib15); Huet al\.,[2025](https://arxiv.org/html/2606.03291#bib.bib26)\), with recovery typically demonstrated via an additional ‘‘relearning’’ stage \(e\.g\., brief fine\-tuning after unlearning\) or by providing answer prefixes to trigger recovery\. By contrast, we localize the suppression signal*layer\-wise*and demonstrate*inference\-time*reversibility via representation steering, without fine\-tuning or access to answer prefixes, and show that recoverability transfers across languages\. Separately,*suppression*approaches aim to prevent leakage without updating core model weights\(Debdeep Sanyal,[2025](https://arxiv.org/html/2606.03291#bib.bib18); Gaoet al\.,[2025](https://arxiv.org/html/2606.03291#bib.bib17)\)\. Because these methods typically target behavioral suppression rather than knowledge deletion, we focus on parameter\-update unlearning methods and use inference\-time steering only for analysis\.
Multilingual language models and cross\-lingual transfer\.Multilingual pretraining yields shared representations that support cross\-lingual transfer, enabling models fine\-tuned in one language to generalize to others even without explicit cross\-lingual training objectives\(Pireset al\.,[2019](https://arxiv.org/html/2606.03291#bib.bib5); Ket al\.,[2020](https://arxiv.org/html/2606.03291#bib.bib6)\)\. However, transfer is not uniform across languages: empirically, transfer tends to be stronger for typologically similar languages and can benefit from shared scripts or lexical overlap, although this effect is not entirely deterministic\(Pireset al\.,[2019](https://arxiv.org/html/2606.03291#bib.bib5)\)and can be downstream task dependent\(Blaschkeet al\.,[2025](https://arxiv.org/html/2606.03291#bib.bib7)\)\. Additionally, mechanistic evidence suggests that multilingual LLMs may perform intermediate processing in a shared semantic space aligned with a dominant pretraining language \(often English\) before mapping to the target\-language, which can shape knowledge transfer and cross\-lingual consistency\(Limet al\.,[2025](https://arxiv.org/html/2606.03291#bib.bib55)\)\. These observations motivate studying multilingual unlearning across languages to highlight knowledge leakage which could enable adversarial attacks\.
Hidden representations\.A growing body on probing and mechanistic interpretation suggests that transformer layers exhibit a coarse stage\-like organization: lower layers emphasize surface\-level processing, whereas deeper layers increasingly support semantic abstraction and output control\(Tenneyet al\.,[2019](https://arxiv.org/html/2606.03291#bib.bib13); Olssonet al\.,[2022](https://arxiv.org/html/2606.03291#bib.bib14)\)\. In multilingual transformers, analyses further support a three\-phase consisting of an*input space*, a language\-agnostic*concept space*in the middle layers, and a*decoding space*specialized for language\-specific generation\(Wendleret al\.,[2024](https://arxiv.org/html/2606.03291#bib.bib12); Limet al\.,[2025](https://arxiv.org/html/2606.03291#bib.bib55); Jianget al\.,[2025](https://arxiv.org/html/2606.03291#bib.bib11)\)\. Motivated by this structure, inference\-time activation steering methods construct*steering vectors*from contrastive activations and inject them during the forward pass to control model behaviors without retraining\(Rimskyet al\.,[2024](https://arxiv.org/html/2606.03291#bib.bib48); Liet al\.,[2023](https://arxiv.org/html/2606.03291#bib.bib46)\)\. Recent unlearning methods also operate explicitly at the representation level: representation misdirection approaches fine\-tune models to steer intermediate\-layer representations on forget data while regularizing retained behavior\(Liet al\.,[2024](https://arxiv.org/html/2606.03291#bib.bib10); Huu\-Tienet al\.,[2025](https://arxiv.org/html/2606.03291#bib.bib9); Shenet al\.,[2026](https://arxiv.org/html/2606.03291#bib.bib8)\)\.Seyitoğluet al\.\([2024](https://arxiv.org/html/2606.03291#bib.bib3)\)used steering to retrieve anonymized concepts, relying on broad world knowledge embedded in the model\. By contrast, we treat the*unlearning\-induced*representation shift between a fine\-tuned and an unlearned model as a mechanistic*suppression direction*, and use*inference\-time*steering along this direction as an analysis or intervention tool\. This enables us to \(i\) localize where suppression emerges*layer\-wise*, \(ii\) demonstrate reversibility*without an additional relearning stage*or access to forget answer prefixes, and \(iii\) test whether the suppression direction transfers across languages \(i\.e\., is language\-agnostic\)\.
## 3Preliminaries
In this section, we formalize the multilingual setting, define the fine\-tuning and unlearning objectives used throughout the paper, and describe our evaluation protocol\.
### 3\.1TOFU and Small\-Sample Unlearning
We build on TOFU \(*Task of Fictitious Unlearning*\) a benchmark for fine\-grained LLM unlearning\(Mainiet al\.,[2024](https://arxiv.org/html/2606.03291#bib.bib69)\), consisting of facts about200200fictitious authors, each described by2020simple question–answer pairs\. The authors and facts are synthetically generated so that they do not appear in pretraining corpora,222See Appendix[A](https://arxiv.org/html/2606.03291#A1)for a discussion of how we verify that our base models have not been exposed to the TOFU data\.and a subset of authors is designated as a*forget set*while the rest form a*retain set*\. This setup enables controlled experiments in which a model is first fine\-tuned on all authors and then trained to remove knowledge about only the forget authors while preserving knowledge about retain authors\. TOFU has become a standard testbed for LLM unlearning methods and benchmarking frameworks\(Zhanget al\.,[2024](https://arxiv.org/html/2606.03291#bib.bib63); Wanget al\.,[2025b](https://arxiv.org/html/2606.03291#bib.bib64); Dornaet al\.,[2026](https://arxiv.org/html/2606.03291#bib.bib65)\)\. We use the 1% forget split of TOFU, corresponding to a small sample unlearning regime where only a tiny fraction of facts will be removed, mirroring many practical scenarios in which models are required to forget information about a small number of individuals while retaining their remaining knowledge for that domain\.
### 3\.2Multilingual Extension and Notation
We extend the original TOFU dataset, which only contains English data, to four additional languages: Chinese, German, Russian, and Turkish, obtainingℒ=\{EN,CH,DE,RU,TU\}\\mathcal\{L\}\{=\}\\\{\\text\{EN\},\\text\{CH\},\\text\{DE\},\\text\{RU\},\\text\{TU\}\\\}\.333English and German share both a language family and Latin script, Turkish shares the script with English but not the language family, Russian shares the language family but not the script, and Chinese shares neither family nor script with English\(Changet al\.,[2024](https://arxiv.org/html/2606.03291#bib.bib44); Huet al\.,[2020](https://arxiv.org/html/2606.03291#bib.bib67)\)\. This setup allows us to explicitly control the influence of language family and script\.For each English question–answer \(QA\) pair\(xEN,yEN\)\(x\_\{\\text\{EN\}\},y\_\{\\text\{EN\}\}\)and eachℓ∈ℒ\\ell\\in\\mathcal\{L\}we obtain a translation\(xℓ,yℓ\)\(x\_\{\\ell\},y\_\{\\ell\}\)via Gemini\-flash2\.5\.444Although this translation procedure may produce highly parallel translations, we treat this as an intended design feature rather than a confound\. It preserves semantic alignment across languages, allowing us to compare unlearning behavior under controlled conditions instead of attributing differences to translation\-induced variation in wording, specificity, or linguistic complexity\. This setting is also realistic for multilingual documents that contain the same sensitive information in highly parallel form\.We use the same forget/retain partition across all languages, so unlearning always targets the same underlying facts\. Let𝒜\\mathcal\{A\}denote the set of all authors in TOFU, with𝒜forget\\mathcal\{A\}\_\{\\text\{forget\}\}and𝒜retain\\mathcal\{A\}\_\{\\text\{retain\}\}the designated forget and retain subsets, where\|𝒜forget\|/\|𝒜\|=0\.01\\lvert\\mathcal\{A\}\_\{\\text\{forget\}\}\\rvert/\\lvert\\mathcal\{A\}\\rvert=0\.01in all experiments\. For an authora∈𝒜a\{\\in\}\\mathcal\{A\}and languageℓ∈ℒ\\ell\{\\in\}\\mathcal\{L\}, let𝒟ℓ\(a\)\\mathcal\{D\}\_\{\\ell\}\(a\)denote the set of QA pairs aboutaainℓ\\ell\. We define per\-language forget and retain sets𝒟ℓforget=⋃a∈𝒜forget𝒟ℓ\(a\),𝒟ℓretain=⋃a∈𝒜retain𝒟ℓ\(a\)\.\\mathcal\{D\}^\{\\text\{forget\}\}\_\{\\ell\}=\\bigcup\_\{a\\in\\mathcal\{A\}\_\{\\text\{forget\}\}\}\\mathcal\{D\}\_\{\\ell\}\(a\),\\mathcal\{D\}^\{\\text\{retain\}\}\_\{\\ell\}=\\bigcup\_\{a\\in\\mathcal\{A\}\_\{\\text\{retain\}\}\}\\mathcal\{D\}\_\{\\ell\}\(a\)\.LetℒFT⊆ℒ\\mathcal\{L\}\_\{\\text\{FT\}\}\\subseteq\\mathcal\{L\}be the set of fine\-tuning languages for a given run\. The corresponding fine\-tuning dataset is𝒟FT=⋃ℓ∈ℒFT\(𝒟ℓforget∪𝒟ℓretain\)\.\\mathcal\{D\}^\{\\text\{FT\}\}=\\bigcup\_\{\\ell\\in\\mathcal\{L\}\_\{\\text\{FT\}\}\}\\bigl\(\\mathcal\{D\}^\{\\text\{forget\}\}\_\{\\ell\}\\cup\\mathcal\{D\}^\{\\text\{retain\}\}\_\{\\ell\}\\bigr\)\.In the simplest caseℒFT=\{ℓ\}\\mathcal\{L\}\_\{\\text\{FT\}\}=\\\{\\ell\\\}is a singleton \(one\-language fine\-tuning\), but we also consider joint fine\-tuning settings with\|ℒFT\|\>1\\lvert\\mathcal\{L\}\_\{\\text\{FT\}\}\\rvert\>1\.
For unlearning, we allow the forget objective to target an arbitrary subset of languages, not necessarily restricted to those used for fine\-tuning\. Letℒunl⊆ℒ\\mathcal\{L\}\_\{\\text\{unl\}\}\\subseteq\\mathcal\{L\}denote the set of unlearning languages in a given experiment and define𝒟unlforget=⋃ℓ∈ℒunl𝒟ℓforget\.\\mathcal\{D\}^\{\\text\{forget\}\}\_\{\\text\{unl\}\}=\\bigcup\_\{\\ell\\in\\mathcal\{L\}\_\{\\text\{unl\}\}\}\\mathcal\{D\}^\{\\text\{forget\}\}\_\{\\ell\}\.Recall that the forget sets𝒟ℓforget\\mathcal\{D\}^\{\\text\{forget\}\}\_\{\\ell\}for differentℓ\\ellcontain parallel QA pairs about the same underlying authors, so unlearning may act on the fine\-tuned knowledge either in the fine\-tuning language itself or in other languages that express the same facts\. For representation analyses,hfbase\(l\)\(xℓ\)h^\{\(l\)\}\_\{f\_\{\\text\{base\}\}\}\(x\_\{\\ell\}\),hfft\(l\)\(xℓ\)h^\{\(l\)\}\_\{f\_\{\\text\{ft\}\}\}\(x\_\{\\ell\}\), andhfun\(l\)\(xℓ\)h^\{\(l\)\}\_\{f\_\{\\text\{un\}\}\}\(x\_\{\\ell\}\)denote pooled hidden states at layerllfor the same questionxxin languageℓ\\ellin the base, fine\-tuned and unlearned models, respectively\. When clear from context we omit the explicit dependence on\(ℒFT,ℒunl\)\(\\mathcal\{L\}\_\{\\text\{FT\}\},\\mathcal\{L\}\_\{\\text\{unl\}\}\)\.
### 3\.3Fine\-Tuning and Unlearning Objectives
Fine\-tuning\.JFT\(θ\)J\_\{\\text\{FT\}\}\(\\theta\)denotes the fine\-tuning objective:JFT\(θ\)=𝔼\(x,y\)∼𝒟FT\[−logpθ\(y∣x\)\]\.J\_\{\\text\{FT\}\}\(\\theta\)=\\mathbb\{E\}\_\{\(x,y\)\\sim\\mathcal\{D\}^\{\\text\{FT\}\}\}\\bigl\[\-\\log p\_\{\\theta\}\(y\\mid x\)\\bigr\]\.The fine\-tuned parameters are obtained by approximately minimizingθFT=argminθJFT\(θ\)\.\\theta\_\{\\text\{FT\}\}=\\operatorname\*\{arg\\,min\}\_\{\\theta\}J\_\{\\text\{FT\}\}\(\\theta\)\.
Unlearning\.We denote byJUN\(θ\)J\_\{\\text\{UN\}\}\(\\theta\)the unlearning objective\. For a set of unlearning languagesℒunl⊆ℒ\\mathcal\{L\}\_\{\\text\{unl\}\}\\subseteq\\mathcal\{L\}, we define
JUN\(θ\)\\displaystyle J\_\{\\text\{UN\}\}\(\\theta\)=1\|ℒunl\|∑ℓ∈ℒunl\(𝔼\(x,y\)∼𝒟ℓforgetJforget\(θ;x,y\+,y−\)\\displaystyle=\\frac\{1\}\{\|\\mathcal\{L\}\_\{\\text\{unl\}\}\|\}\\sum\_\{\\ell\\in\\mathcal\{L\}\_\{\\text\{unl\}\}\}\\Big\(\\mathbb\{E\}\_\{\(x,y\)\\sim\\mathcal\{D\}^\{\\text\{forget\}\}\_\{\\ell\}\}J\_\{\\text\{forget\}\}\(\\theta;x,y^\{\+\},y^\{\-\}\)\(1\)\+λ𝔼\(x,y\)∼𝒟ℓretainJretain\(θ;x,y\)\),\\displaystyle\\qquad\\qquad\+\\lambda\\,\\mathbb\{E\}\_\{\(x,y\)\\sim\\mathcal\{D\}^\{\\text\{retain\}\}\_\{\\ell\}\}J\_\{\\text\{retain\}\}\(\\theta;x,y\)\\Big\),whereJretain\(θ;x,y\)J\_\{\\text\{retain\}\}\(\\theta;x,y\)encourages correct answers on retain examples,Jforget\(θ;x,y\+,y−\)J\_\{\\text\{forget\}\}\(\\theta;x,y^\{\+\},y^\{\-\}\)encourages forgetting on forget examples, andλ\>0\\lambda\{\>\}0controls their trade\-off\. The unlearned parametersθUN\\theta\_\{\\text\{UN\}\}are obtained by minimizingJUNJ\_\{\\text\{UN\}\}with respect toθ\\theta, initializingθ=θFT\\theta\{=\}\\theta\_\{\\text\{FT\}\}\. In our experiments, we instantiateJforgetJ\_\{\\text\{forget\}\}using direct preference optimization \(DPO;Rafailovet al\.[2023](https://arxiv.org/html/2606.03291#bib.bib56)\): for each forget promptxx, we construct a pair of responses\(y\+,y−\)\(y^\{\+\},y^\{\-\}\), wherey\+y^\{\+\}is an ‘‘I don’t know \(IDK\)’’ style refusal andy−y^\{\-\}is the ground truth response forxx\. DPO then encourages the model to prefery\+y^\{\+\}overy−y^\{\-\}, which discourages hallucinations after unlearning and improves suitability for deployment\(Yoonet al\.,[2025](https://arxiv.org/html/2606.03291#bib.bib57); Zhanget al\.,[2024](https://arxiv.org/html/2606.03291#bib.bib63); Xuet al\.,[2025b](https://arxiv.org/html/2606.03291#bib.bib58)\)\.555See Appendix[B](https://arxiv.org/html/2606.03291#A2)for further details on the DPO setup\. Appendix[C](https://arxiv.org/html/2606.03291#A3)provides examples of IDK\-style refusal answers and TOFU question\-answer pair examples\.While DPO is our main unlearning objective, we additionally evaluate gradient ascent \(GA\)\(Janget al\.,[2023](https://arxiv.org/html/2606.03291#bib.bib35)\)and negative preference optimization \(NPO\)\(Zhanget al\.,[2024](https://arxiv.org/html/2606.03291#bib.bib63)\)to cover a broader range of commonly used unlearning objectives, details are provided in Appendix[D](https://arxiv.org/html/2606.03291#A4)\.
### 3\.4Evaluation: NLI\-Based Semantic Score
We automatically assess whether the generatedy^\\hat\{y\}in response toxxmatches the ground truthyy\. The original TOFU benchmark does so using a suite of metrics that combine probability\-based scores, lexical overlap metrics and composite measures like Truth Ratio\(Mainiet al\.,[2024](https://arxiv.org/html/2606.03291#bib.bib69); Dornaet al\.,[2026](https://arxiv.org/html/2606.03291#bib.bib65)\)\. However, they do not capture that semantically equivalent, which express the same underlying fact, should be treated as leakage, even if they differ lexically\(Hwanget al\.,[2025](https://arxiv.org/html/2606.03291#bib.bib24); Lu and Koehn,[2025](https://arxiv.org/html/2606.03291#bib.bib25)\)\. We therefore evaluate answer correctness using a multilingual*natural language inference*\(NLI\) model\(Conneauet al\.,[2018](https://arxiv.org/html/2606.03291#bib.bib62)\)which predicts for each pair\(y,y^\)\(y,\\hat\{y\}\)whetheryyandy^\\hat\{y\}logically entail one another\. See Appendix[E\.1](https://arxiv.org/html/2606.03291#A5.SS1)for details on the NLI score calculation\. Multilingual NLI\-based evaluation has been shown to be a useful proxy for semantic equivalence in generation tasks\(Dušek and Kasner,[2020](https://arxiv.org/html/2606.03291#bib.bib60); Chen and Eger,[2023](https://arxiv.org/html/2606.03291#bib.bib61)\): it captures semantic agreement beyond lexical overlap and across languages\.
Таблица 1:Cross\-lingual unlearning transfer across fine\-tune \(FT\), unlearn and query language combinations for Qwen2\.5\-7B\. Within each row block, the Unlearn column specifies the language on which unlearning is performed, while Base denotes the fine\-tuned model before unlearning\. For each query language, we report the absolute change in NLI score on the forget set after unlearning in the corresponding language relative to the Base model\. We report means±\\pm95% confidence intervals over55forget sets\. For example, the cell highlighted in yellow corresponds to a model fine\-tuned in English, unlearned in German, and queried in Russian which drops by 5 points in NLI compared to the English fine\-tuned model directly queried in Russian\. Lower values indicate stronger unlearning\. Colored cells are discussed in the results section\.
## 4Experiments and Results
Model configuration\.We conduct experiments on the Qwen2\.5\-7B and Gemma2\-9B models\(Qwenet al\.,[2025](https://arxiv.org/html/2606.03291#bib.bib70); Teamet al\.,[2024](https://arxiv.org/html/2606.03291#bib.bib71)\)\. For NLI\-based evaluation, we usexlm\-roberta\-large\-xnli\(Conneauet al\.,[2020](https://arxiv.org/html/2606.03291#bib.bib59)\)for all languages\. Native speakers validated NLI model predictions on 50 TOFU samples per language\. We further reuse the same annotated examples to compare NLI\-based evaluation with alternative automatic metrics on a representative subset of languages \(Appendix[E\.2](https://arxiv.org/html/2606.03291#A5.SS2)\)\.
Fine\-tuning and unlearning\.Starting from the base checkpoints, we fine\-tune a separate model for each choice of fine\-tuning languagesℒFT\\mathcal\{L\}\_\{\\text\{FT\}\}on the corresponding TOFU split𝒟FT\\mathcal\{D\}^\{\\text\{FT\}\}\(Section[3\.2](https://arxiv.org/html/2606.03291#S3.SS2)\)\. We then apply the unlearning objectiveJUNJ\_\{\\text\{UN\}\}defined in Section[3\.3](https://arxiv.org/html/2606.03291#S3.SS3)to remove knowledge about the forget authors in𝒟unlforget\\mathcal\{D\}^\{\\text\{forget\}\}\_\{\\text\{unl\}\}\. This procedure yields three model variants for each configuration: the base model \(fbasef\_\{\\text\{base\}\}\), the fine\-tuned model \(fftf\_\{\\text\{ft\}\}\), and the unlearned model \(funf\_\{\\text\{un\}\}\)\. Unless otherwise stated, we use the same setup across all experiments and report NLI\-based scores\. All experiments were run on 4 NVIDIA H100 GPUs\.
### 4\.1Cross\-Lingual Unlearning Transfer
Setup\.For eachℓ∈ℒFT\\ell\\in\\mathcal\{L\}\_\{\\text\{FT\}\}, we start from the corresponding fine\-tuned checkpoint\. Then, for every languageℓ\\ell, we apply the unlearning objectiveJUNJ\_\{\\text\{UN\}\}to remove knowledge about the forget authors in𝒟ℓforget\\mathcal\{D\}^\{\\text\{forget\}\}\_\{\\ell\}and evaluate both the fine\-tuned and resulting unlearned models on TOFU questions for everyℓ\\ell\. We report the change in NLI\-based score relative to the fine\-tuned model\. More negative values indicate stronger unlearning, since performance on the forget set drops further relative to the fine\-tuned model\. This yields the matrix in Table[1](https://arxiv.org/html/2606.03291#S3.T1)\.
To quantify variability due to the particular choice of forget authors, we repeat this procedure over55random samples of forget authors\. In each sampless, we construct a new forget set𝒜forget\(s\)⊂𝒜\\mathcal\{A\}^\{\(s\)\}\_\{\\text\{forget\}\}\{\\subset\}\\mathcal\{A\}and define𝒜retain\(s\)=𝒜∖𝒜forget\(s\)\\mathcal\{A\}^\{\(s\)\}\_\{\\text\{retain\}\}\{=\}\\mathcal\{A\}\\setminus\\mathcal\{A\}^\{\(s\)\}\_\{\\text\{forget\}\}, so the sizes of the forget and retain sets are fixed across shufflings while the specific authors change\. The fine\-tuned modelfftf\_\{\\text\{ft\}\}is kept fixed across all shufflings, only the unlearned modelfunf\_\{\\text\{un\}\}is retrained for each shuffled split\. We report DPO\-based unlearning in the main text\.
Linguistic similarity: family and script both matter\.Controlling for script differences, unlearning in English transfers more strongly to another Indo\-European language \(Russian\) than to a typologically distant language \(Chinese\), indicating that topological proximity increases unlearning transfer even across scripts\. We highlight these in green in Table[1](https://arxiv.org/html/2606.03291#S3.T1)\. Separately, we observe a strong script effect: languages that use the Latin script but belong to a different language family \(Turkish\) can exhibit stronger unlearning transfer to other Latin\-script languages \(English and German\) compared to languages that share neither the family nor the script \(Chinese\), suggesting that script can substantially influence transfer even when typological similarity is low \(highlighted in blue\)\. Transfer is strongest when both family and script are shared \(e\.g\., English and German\), mirroring patterns previously reported for cross\-lingual learning transfer\(Linet al\.,[2019](https://arxiv.org/html/2606.03291#bib.bib66); Huet al\.,[2020](https://arxiv.org/html/2606.03291#bib.bib67); Zhaoet al\.,[2024](https://arxiv.org/html/2606.03291#bib.bib68)\)\.666To focus on the main findings, we omit retain set results, as the observed changes are very small and we do not find consistent patterns\. The full retain\-set results are provided in Appendix[F\.4](https://arxiv.org/html/2606.03291#A6.SS4)\.We report additional fine\-tuning and unlearning language combinations in Appendix[F\.1](https://arxiv.org/html/2606.03291#A6.SS1)\. Our results hold across models \(see Appendix[F\.2](https://arxiv.org/html/2606.03291#A6.SS2)for Gemma\)\. Additional experiments with GA and NPO show consistent qualitative transfer patterns, suggesting that these findings are not specific to the choice of unlearning objective \(Appendix[F\.3](https://arxiv.org/html/2606.03291#A6.SS3)\)\.
Transfer of unlearning is asymmetric across languages\.Languages with*high*coverage in the pretraining data \(English, Chinese\) tend to transfer unlearning more strongly to other languages, particularly those that share their script or family, whereas languages with lower pretraining coverage \(German, Russian, and Turkish\) are comparatively weaker as unlearning sources\. We highlight these source–target pairs in red in Table[1](https://arxiv.org/html/2606.03291#S3.T1)\. One possible explanation is that smaller multilingual models often operate predominantly in a shared semantic space anchored by a privileged language\(Limet al\.,[2025](https://arxiv.org/html/2606.03291#bib.bib55)\)\. Consequently, unlearning the high\-coverage language has a more pronounced impact on other languages\.
Unlearning in low\-performing languages still transfers to high\-performing ones\.Even when the model’s performance in a language is poor, unlearning that language can still have a considerable impact on the fine\-tuned language\. For example, in the Turkish fine\-tuned model, the forget set score when queried in English is only 11%, yet unlearning English still reduces Turkish NLI score on the same forget set by 55% \(highlighted in orange in Table[1](https://arxiv.org/html/2606.03291#S3.T1)\)\. At first glance, this is somewhat unintuitive: unlearning questions in a language where the model can barely answer them still transfers a strong unlearning effect to a high performance language\. We hypothesize that the model accurately represents concepts in a language\-agnostic ‘‘concept’’ space which allows transfer\. Unlearning primarily operates in later, language\-specific generation layers, suppressing the model’s ability to generate an answer to a given question\. We explore this hypothesis in our next set of experiments, both on the output and the representation level\.
Таблица 2:Performance gainsΔℓ←q\\Delta\_\{\\ell\\leftarrow q\}for Qwen2\.5\-7B as absolute improvement when instructing to answer in the fine\-tuning language \(ℓ\\ell\) rather than query language \(qq\)\. Rows indicate fine\-tuning languageℓ\\ell, columns query languageqq\.
### 4\.2Cross\-lingual Prompting
In this experiment, when querying in languageqq, we instruct the model to instead reply in the fine\-tuning languageℓ≠q\\ell\\neq q\. If this ‘‘cross\-lingual prompting’’ improves performance, it implies that the model can successfully*retrieve*the relevant information but cannot*decode*it inqq\. Because inputs inqqaccess similar hidden representations to those supportingℓ\\ell, unlearningqqdisrupts this shared space, thereby degrading performance inℓ\\ell\. This supports the hypothesis from our previous experiment through an output\-level explanation\.
Setup\.We systematically test all ordered pairs of fine\-tune and query languages\(ℓ,q\)∈ℒ2,ℓ≠q\(\\ell,q\)\\in\\mathcal\{L\}^\{2\},\\ell\\neq q\. We measure the performance gainΔℓ←q\\Delta\_\{\\ell\\leftarrow q\}, the difference between instructing the model to answer inℓ\\ellwhen querying inqqcompared to the default case of answering in query languageqq\(see Appendix[G](https://arxiv.org/html/2606.03291#A7)for the prompting template\)\.
Results\.Performance gains are reported in Table[2](https://arxiv.org/html/2606.03291#S4.T2)\. The large observed gains add evidence to our hypothesis that the model can map inputs inqqinto a shared cross\-lingual ‘‘concept’’ space while remaining locally monolingual at decoding\(Limet al\.,[2025](https://arxiv.org/html/2606.03291#bib.bib55); Kang and Kim,[2025](https://arxiv.org/html/2606.03291#bib.bib72)\)\. These results help explain our unlearning findings: Even if the model exhibits poor output quality in languageqq, questions inqqcan still access shared internal representations that support accurate answers in other languages\. Consequently, unlearning onqqcan disrupt those shared representations and transfer to languages in which the model answers well\. We also computed the correlation between the language pair\-wise gains in Table[2](https://arxiv.org/html/2606.03291#S4.T2), and unlearning transfer scores in Table[1](https://arxiv.org/html/2606.03291#S3.T1)\. Strong correlation scores \(Pearsonr=0\.50r=0\.50,p<0\.05p<0\.05; Spearmanρ=0\.60\\rho=0\.60,p<0\.01p<0\.01\) suggest the stronger the hidden mapping between languages, the more the unlearning of one damages the other\.
### 4\.3Hidden Representation Analysis
In this experiment, we study the internal model representations — and their change through unlearning — through the lens of mechanistic interpretability to shed light on the internal dynamics of cross\-lingual transfer and unlearning\.
Cosine Similarity Setup\.We perform controlled representational analyses across three model variantsm∈\{fbase,fft,fun\}m\\in\\\{f\_\{\\text\{base\}\},f\_\{\\text\{ft\}\},f\_\{\\text\{un\}\}\\\}\. For all analyses, we fix an*anchor*languageℓsrc\\ell\_\{\\mathrm\{src\}\}\(the language on which the model is fine\-tuned, e\.g\., English\) and, for each target languageℓtgt∈ℒ∖\{ℓsrc\}\\ell\_\{\\mathrm\{tgt\}\}\\in\\mathcal\{L\}\\setminus\\\{\\ell\_\{\\mathrm\{src\}\}\\\}, feed the*same question*in both languages\. For each modelmm, layerll, and languageℓ∈\{ℓsrc,ℓtgt\}\\ell\\in\\\{\\ell\_\{\\mathrm\{src\}\},\\ell\_\{\\mathrm\{tgt\}\}\\\}, we extract a single hidden representationhm\(l\)\(xℓ\)h^\{\(l\)\}\_\{m\}\(x\_\{\\ell\}\)corresponding to the final token of the full prompt\. We then compute, for every anchor target pair\(ℓsrc,ℓtgt\)\(\\ell\_\{\\mathrm\{src\}\},\\ell\_\{\\mathrm\{tgt\}\}\)and layerll, the cosine similaritycos\(hm\(l\)\(xℓsrc\),hm\(l\)\(xℓtgt\)\),\\cos\\bigl\(h^\{\(l\)\}\_\{m\}\(x\_\{\\ell\_\{\\mathrm\{src\}\}\}\),h^\{\(l\)\}\_\{m\}\(x\_\{\\ell\_\{\\mathrm\{tgt\}\}\}\)\\bigr\),and average these similarities over all questions in the forget set to obtain a layer\-wise cross\-lingual similarity curve for each model variant\. We report cosine similarity between pairs of representations\.777The embedding layer is excluded from all analyses\.By comparing cosine similarity across languages for semantically equivalent questions, we can test how the model represents these inputs and how unlearning alters these representations across layers\. To calibrate these similarities, we additionally compute cosine similarity for*semantically unrelated*question pairs using thefbasef\_\{\\text\{base\}\}model only \(shown asRandomin the bottom left panel of Figure[1](https://arxiv.org/html/2606.03291#S4.F1)\)\. As expected in anisotropic transformer representation spaces, absolute cosine values remain high even for such mismatched pairs\(Godeyet al\.,[2024](https://arxiv.org/html/2606.03291#bib.bib49); Ait\-Saada and Nadif,[2023](https://arxiv.org/html/2606.03291#bib.bib52); Ethayarajh,[2019](https://arxiv.org/html/2606.03291#bib.bib53)\)\.888Although CKA\(Kornblithet al\.,[2019](https://arxiv.org/html/2606.03291#bib.bib51)\)is more robust to anisotropy in transformer representations, we use cosine similarity because our goal is to measure*instance\-level*shifts betweenfftf\_\{\\text\{ft\}\}andfunf\_\{\\text\{un\}\}for the*same*question: cosine similarity operates directly on each pair\(hfft\(l\)\(xℓ\),hfun\(l\)\(xℓ\)\)\\bigl\(h^\{\(l\)\}\_\{f\_\{\\text\{ft\}\}\}\(x\_\{\\ell\}\),h^\{\(l\)\}\_\{f\_\{\\text\{un\}\}\}\(x\_\{\\ell\}\)\\bigr\), whereas CKA compares the covariance structure of two*sets*of representations and therefore reflects only global space\-level alignment, not per\-example changes\.We therefore focus on*relative*trends across layers and model variants rather than absolute similarities\.

Рис\. 1:Layer\-wise cosine similarity of final\-token hidden stateshm\(l\)\(x\)h\_\{m\}^\{\(l\)\}\(x\)\.Top row: cross\-language comparisonbetween embeddings of an English question and its translation \(Ch, De, Ru, Tu\) in the same model \(Qwen2\.5\-7B variants; see panel titles\)\.FT= fine\-tuning,UN= unlearning andCross\-lingual promptingmeans that for non\-English questions, the model is instructed to answer in English\.Bottom row: cross\-model comparisonwithin each column between models indicated in the top panel title and bottom panel title\. E\.g\., the bottom right panel compares representations in the unlearned English vs\. the fine\-tuned English model\. Curves show the mean across questions, shaded regions denote 95% confidence intervals\.Cosine Similarity Results\.We first show how cross\-lingual prompting affects*internal*representations across layers \(Figure[1](https://arxiv.org/html/2606.03291#S4.F1)\)\. We use the same setup as in Table[2](https://arxiv.org/html/2606.03291#S4.T2): for a query in languageqq, we compare the default setting where the model answers inqq\(top Panel 2\) with cross\-lingual prompting that instructs the model to answer in the fine\-tuning language \(English; top Panel 3\)\. If the model operated in distinct language\-specific subspaces, we would expect the additional difficulty of the cross\-lingual task in Panel 3 to negatively impact alignment\. Instead, across all query languages, cosine similarity remains comparably high for the majority of layers in both settings, a key difference only emerges in the later decoding layers: in Panel 2, similarity drops sharply as the model prepares to generate tokens in the target language, whereas Panel 3 mitigates this drop, maintaining much higher alignment\. This pattern suggests that performance gaps across languages arise primarily from*decoding bottlenecks*: the model forms a similar underlying representation across languages but fails to map this shared semantic representation into the appropriate language\-specific output distribution\.999Layer\-wise similarity does not directly measure knowledge retrieval\. Cosine similarity indicates how aligned two hidden representations are in direction, but it does not guarantee that the model is actually retrieving or using the same underlying facts\. In our setting, increases in cosine similarity are accompanied by better model performance in the output evaluation, suggesting that higher cosine similarity may be positively correlated with successful retrieval\. Our interpretation focuses on suggestive trends in representational geometry, not strong causal claims about retrieval\.
Next, comparing Panels 2 and 4 in the top row reveals thatfunf\_\{\\text\{un\}\}preserves almost the same cross\-lingual alignment asfftf\_\{\\text\{ft\}\}in the early and middle layers: thefunf\_\{\\text\{un\}\}curves in Panel 4 closely track thefftf\_\{\\text\{ft\}\}curves in Panel 2, and they diverge only in the later layers, where unlearning is expected to act most strongly\. This suggests that, in the case of a relatively small number of unlearning examples,the unlearning procedure largely leaves cross\-lingual alignment intact \(e\.g\., the directions mapping different languages into the shared space remain similar\)\.
Moreover, in Panel 4 \(top vs\. bottom\), when we directly compare similarities betweenfunf\_\{\\text\{un\}\}andfftf\_\{\\text\{ft\}\}, we observe that their curves remain closely aligned from the early to the middle layers for all non\-English languages\. This pattern suggests that unlearning does not substantially alter how the model retrieves knowledge for other languages, instead,the main divergence emerges only at the decoding stage\. Importantly, this behavior is not restricted to cross\-lingual settings: we observe the same pattern for English questions when comparing hidden representations before and after unlearning\. Consistent with this, Appendix[H\.2](https://arxiv.org/html/2606.03291#A8.SS2)provides additional evidence for unlearning fine\-tuned languages\. An analogous plot with cosine similarity across languages for Chinese is shown in the Appendix[H\.1](https://arxiv.org/html/2606.03291#A8.SS1)\.
Taken together, the cosine similarity analysis is consistent with the cross\-lingual prompting results \(Section[4\.2](https://arxiv.org/html/2606.03291#S4.SS2)\): inputs in different languages can map to a shared internal representation even when output quality differs, and instructing the model to decode in the fine\-tuning language preserves high cross\-lingual alignment across layers\. Moreover, unlearning primarily alters late\-layer representations while leaving early and middle layers largely intact, suggesting that underlying information may remain accessible after unlearning\.
PCA setup\.While cosine similarity captures directional alignment between paired representations, it does not directly show the global geometry of the representation space\. We therefore complement the layer\-wise cosine analysis with a PCA\-based visualization of the forget\-set representations\. Using the same hidden representationshm\(l\)\(xℓ\)h^\{\(l\)\}\_\{m\}\(x\_\{\\ell\}\)as above, we firstL2\\mathrm\{L\}\_\{2\}\-normalize each representation\. For each languageℓ\\elland layerll, we collect the normalized representations of all forget\-set questions from the three model variantsm∈\{fbase,fft,fun\}m\\in\\\{f\_\{\\text\{base\}\},f\_\{\\text\{ft\}\},f\_\{\\text\{un\}\}\\\}, fit PCA\(Woldet al\.,[1987](https://arxiv.org/html/2606.03291#bib.bib50)\)on their union, and project the representations onto the first two principal components\. We then visualize the resulting two\-dimensional projections to examine how thefbasef\_\{\\text\{base\}\},fftf\_\{\\text\{ft\}\}, andfunf\_\{\\text\{un\}\}representations separate across layers\.
Рис\. 2:PCA separation \(Qwen2\.5\-7B\) across layers for English \(top\) and Chinese \(bottom\) questions\. Numbers indicate question indices, with identical numbers referring to the same questions across different testing model variants\.PCA results\.Ideally, under successful unlearning, we would expect thefbasef\_\{\\text\{base\}\}andfunf\_\{\\text\{un\}\}data points to be closely clustered\(Shiet al\.,[2025](https://arxiv.org/html/2606.03291#bib.bib45)\)\. However, the PCA results show clear separability betweenfbasef\_\{\\text\{base\}\}andfunf\_\{\\text\{un\}\}from the earliest layers\. For Chinese questions,fftf\_\{\\text\{ft\}\}andfunf\_\{\\text\{un\}\}become clearly separated as early as layer 10, and by the final layer all three model variants form distinct clusters, as shown in Figure[2](https://arxiv.org/html/2606.03291#S4.F2)\. For English, this separation is weaker:fftf\_\{\\text\{ft\}\}andfunf\_\{\\text\{un\}\}often remain in the same broad cluster, whilefbasef\_\{\\text\{base\}\}forms a separate cluster\. Nevertheless, pairedfftf\_\{\\text\{ft\}\}andfunf\_\{\\text\{un\}\}representations for the*same*question remain geometrically distant, indicating that unlearning still induces substantial per\-example shifts\. Taken together, these results suggest that unlearning moves forget\-example representations away from their fine\-tuned counterparts, but does not return them to thefbasef\_\{\\text\{base\}\}distribution\. This supports the interpretation that the underlying knowledge is not fully removed, but rather reorganized or suppressed at the representation level\. We further extend the PCA analysis to German, Russian, and Turkish as shown in Figure[6](https://arxiv.org/html/2606.03291#A8.F6), these languages largely mirror the English structural pattern\. Since the two\-dimensional PCA projection captures only the directions of largest variance, we further quantify the representational shifts induced by unlearning using centroid distances and average pairwise distances\. Detailed setups and additional results are provided in Appendix[H\.3](https://arxiv.org/html/2606.03291#A8.SS3)\.
Рис\. 3:Effect of layer\-wise steering vector injectionon NLI score for the forget set\.Left:NLI score change when injecting a scaled and normalized steering vector extracted from the retain set, showing substantial recovery \(positive score gain\)\.Right:NLI score changes when injecting random Gaussian directions with matched norm and scale\. The much weaker recovery indicates that the extracted vectors capture a structured unlearning direction rather than generic noise\.
### 4\.4Steering Vectors Recover Unlearned Knowledge
From the hidden representation analysis in Section[4\.3](https://arxiv.org/html/2606.03291#S4.SS3), we obtain the following key finding: unlearning may act more like suppression rather than removal of the underlying knowledge\.101010As shown in Appendix[H\.3](https://arxiv.org/html/2606.03291#A8.SS3), the difference betweenfftf\_\{\\text\{ft\}\}andfunf\_\{\\text\{un\}\}can also be decomposed into a global componentg\(l\)g^\{\(l\)\}, where the approximately constant centroid distance across layers indicates a stable global shift\.Motivated by these observations, we construct a per\-layer steering vector, approximated by the per\-layer difference between the hidden representations of an auxiliary modelfunauxf\_\{\\text\{un\}\}^\{\\text\{aux\}\}andfftf\_\{\\text\{ft\}\}\. We then inject this steering vector into the original unlearned modelfunf\_\{\\text\{un\}\}at inference time and evaluate how much forgotten information can be recovered on the forget set\.
Setup\.Inspired by prior work showing that directions in hidden representation space can encode high\-level behaviors\(Rimskyet al\.,[2024](https://arxiv.org/html/2606.03291#bib.bib48); Zhuet al\.,[2025](https://arxiv.org/html/2606.03291#bib.bib47); Liet al\.,[2023](https://arxiv.org/html/2606.03291#bib.bib46); Limet al\.,[2025](https://arxiv.org/html/2606.03291#bib.bib55)\), we extract steering vectors as layer\-wise activation differences using*only English*questions between the Englishfftf\_\{\\text\{ft\}\}and an auxiliary modelfunauxf\_\{\\text\{un\}\}^\{\\text\{aux\}\}as follows \(see Algorithm[1](https://arxiv.org/html/2606.03291#alg1)for full details\)\. To avoid reintroducing the specific facts that were unlearned, we construct an auxiliary English forget set𝒟auxforget\\mathcal\{D\}\_\{\\text\{aux\}\}^\{\\text\{forget\}\}by randomly shuffling retain set authors\. We compute𝐠\(l\)\\mathbf\{g\}^\{\(l\)\}*only*from𝒟auxforget\\mathcal\{D\}^\{\\text\{forget\}\}\_\{\\text\{aux\}\}; the original forget set𝒟unlforget\\mathcal\{D\}^\{\\text\{forget\}\}\_\{\\text\{unl\}\}is never used to derive steering vectors and is used*only*for evaluation\. We then unlearn𝒟auxforget\\mathcal\{D\}^\{\\text\{forget\}\}\_\{\\text\{aux\}\}fromfftf\_\{\\text\{ft\}\}\(see Section[3\.3](https://arxiv.org/html/2606.03291#S3.SS3)\), obtainingfunauxf\_\{\\text\{un\}\}^\{\\text\{aux\}\}\. Becausefftf\_\{\\text\{ft\}\}andfunauxf\_\{\\text\{un\}\}^\{\\text\{aux\}\}are evaluated on*identical*inputs, the per\-layer differences isolate systematic changes induced by unlearning, yielding a set of per\-layer steering vectors\{𝐠\(l\)\}l=1L\\\{\\mathbf\{g\}^\{\(l\)\}\\\}\_\{l=1\}^\{L\}that capture the suppression behavior introduced by unlearning\. Each𝐠\(l\)\\mathbf\{g\}^\{\(l\)\}isℓ2\\ell\_\{2\}\-normalized and, for a given input with last\-token hidden representation𝐡\(l\)\(xℓ\)\\mathbf\{h\}^\{\(l\)\}\(x\_\{\\ell\}\), we subtractα∥𝐡\(l\)\(xℓ\)∥2𝐠\(l\)\\alpha\\lVert\\mathbf\{h\}^\{\(l\)\}\(x\_\{\\ell\}\)\\rVert\_\{2\}\\mathbf\{g\}^\{\(l\)\}, where hyperparameterα\\alphacontrols the overall strength of the intervention \(full details in Algorithm[2](https://arxiv.org/html/2606.03291#alg2)\)\. When injecting the steering signal, we subtract this scaled vector over layers\[l,…,l\+N\]\[l,\\ldots,l\{\+\}N\]\.111111When we select an injection layerll, we steer that layer and the followingNNconsecutive layers\.Because𝐠\(l\)\\mathbf\{g\}^\{\(l\)\}is computed as a difference on identical inputs, much of the input\-specific semantic content cancels out\. We expect\{𝐠\(l\)\}l=1L\\\{\\mathbf\{g\}^\{\(l\)\}\\\}\_\{l=1\}^\{L\}to capture a largely language\-agnostic transformation in the shared multilingual representation space, rather than English\-specific lexical information\. We also construct a random baseline by replacing each𝐠\(l\)\\mathbf\{g\}^\{\(l\)\}with an isotropic Gaussian vector𝐫\(l\)∼𝒩\(𝟎,I\)\\mathbf\{r\}^\{\(l\)\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},I\)\. We setα=0\.5\\alpha\{=\}0\.5for English and Chinese andα=0\.8\\alpha\{=\}0\.8for all other languages121212To estimate the*upper bound of reversibility*, we optimizeα\\alphaandNNdirectly on𝒟unlforget\\mathcal\{D\}^\{\\text\{forget\}\}\_\{\\text\{unl\}\}\. This measures the worst\-case leakage, the maximum amount of knowledge an adversary could extract using a steering vector if they optimally tuned the steering intensity\.andN=2N\{=\}2for all languages\. These parameters were used across our method and the random baseline\.
Results\.Figure[3](https://arxiv.org/html/2606.03291#S4.F3)\(left\) summarizes the results for Qwen2\.5\-7B\. We recover over half of the lost performance on the English forget set when intervening in the middle and later layers\. Although the steering vectors are estimated*only*from English data, they also effectively recover parts of the forget set for all other languages, particularly Chinese\. These results again support the idea that the vectors capture a general unlearning direction that induces a language\-agnostic shift in the shared representation space, even though the downstream behavioral impact still varies across languages \(Section[4\.1](https://arxiv.org/html/2606.03291#S4.SS1)\)\. To test whether this effect depends on using English as the source language for extracting the steering vector, we repeat the same experiment using Chinese questions as the source and observe similar performance, the results are provided in Appendix[K\.1](https://arxiv.org/html/2606.03291#A11.SS1)\. The Gaussian random baseline also yields small gains \(Figure[3](https://arxiv.org/html/2606.03291#S4.F3)right\), which we attribute to partially diluting the unlearning signal when sufficiently large random perturbations are applied, however, its effect is substantially smaller and less consistent\. Our findings are robust across models and unlearning methods\. First, the same qualitative patterns hold for Gemma, where the steering vector achieves stronger recovery than in Qwen \(Appendix[J](https://arxiv.org/html/2606.03291#A10)\)\. Second, we observe the same patterns for two additional unlearning methods, GA and NPO, indicating that the effect generalizes beyond DPO \(Appendix[K\.2](https://arxiv.org/html/2606.03291#A11.SS2)\)\.
## 5Discussion
In this work, we investigate cross\-lingual unlearning transfer, the factors that shape its strength, and the internal dynamics of unlearning on multilingual models\. These findings have two main implications\. First, multilingual evaluation should be treated as a core component of unlearning assessment in LLMs\. Our results indicate that underlying representations remain tightly coupled across languages even after unlearning, so single language evaluation can overestimate unlearning robustness\. Second, our knowledge recovery experiments \(Section[4\.4](https://arxiv.org/html/2606.03291#S4.SS4)\), highlight a setting in which an adversary \(i\) has inference\-time \(i\.e\., white\-box\) access to thefunf\_\{\\text\{un\}\}model and can manipulate internal activations during a forward pass, and \(ii\) has knowledge of the unlearned domain\. The adversary’s goal is to increase leakage of removed content*without*access to the original answers\. Under these assumptions, activation\-level interventions can largely reverse the behavioral suppression induced by unlearning\. This underscores the importance of evaluating unlearning robustness under activation interventions\.
We interpret these findings within the scope of our experimental constraints\. We primarily evaluate cross\-lingual unlearning using three unlearning algorithms on two models and five languages\. Recent evidence using Gradient Difference and Rank\-One Model Editing indicates that different unlearning methods exhibit similar cross\-lingual transfer effects\(Lu and Koehn,[2025](https://arxiv.org/html/2606.03291#bib.bib25)\), which suggests that the behaviors we observe may extend to more fine\-tuning based unlearning algorithms\. We nevertheless do not claim universality across all unlearning approaches, leaving a broader comparison across additional algorithms to future work\. Our experiments are conducted on the TOFU data set, which consists of simple fact style queries and does not require complex reasoning\. Therefore, our conclusions are best interpreted as applying to fact retrieval settings at small unlearning scales, and may not directly transfer to more complex content or to larger unlearning fractions\. Yet, they reflect a common real\-world scenario where specific facts about specific entities are to be removed from a model\. Taken together, our results show that full removal is difficult and requires careful validation in modern multilingual LLMs\. Finally, because the TOFU knowledge is injected through post\-training, our findings may not fully capture how multilingual unlearning behaves for knowledge acquired during pretraining\. Recent work\(Annaet al\.,[2026](https://arxiv.org/html/2606.03291#bib.bib1)\)suggests that knowledge acquired during pretraining and supervised fine\-tuning may respond differently to unlearning\. We leave a systematic study of this distinction in multilingual unlearning to future work\.
## Acknowledgments
The first author is supported by the University of Melbourne research scholarship \(MRS\) scheme\. This research was supported by The University of Melbourne’s Research Computing Services and the Petascale Campus Initiative\.
## Impact Statement
This work studies the effects of machine unlearning in multilingual settings\. In particular, we investigate how*unlearning transfer*varies across languages and how unlearned content may be recovered using steering vectors\. To the best of our knowledge, unlearning algorithms are not yet widely deployed in publicly released large language models\. Nevertheless, if such methods are adopted in the near future, our findings suggest potential privacy risks\. For example, our results indicate that, given a target language in which data were unlearned, certain other languages can be more effective query languages for eliciting content that was intended to be removed, thus bypassing the unlearning and leaking private information\. In addition, steering vectors may be used to counteract suppression behaviors, enabling recovery of the unlearned content\. While our results can in principle be used by adversarial parties to attack models, we believe it is important to highlight them proactively\. The insights from our study can inform the design of more robust unlearning methods and provide guidance for evaluating robustness both prior to and after deployment\.
## Список литературы
- M\. Ait\-Saada and M\. Nadif \(2023\)Is anisotropy truly harmful? a case study on text clustering\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\),A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 1194–1203\.External Links:[Link](https://aclanthology.org/2023.acl-short.103/),[Document](https://dx.doi.org/10.18653/v1/2023.acl-short.103)Cited by:[§4\.3](https://arxiv.org/html/2606.03291#S4.SS3.p2.11)\.
- B\. Anna, A\. Savchenko, A\. Panchenko, and E\. Tutubalina \(2026\)Anatomy of unlearning: the dual impact of fact salience and model fine\-tuning\.External Links:2602\.19612,[Link](https://arxiv.org/abs/2602.19612)Cited by:[§5](https://arxiv.org/html/2606.03291#S5.p2.1)\.
- V\. Blaschke, M\. Fedzechkina, and M\. Ter Hoeve \(2025\)Analyzing the effect of linguistic similarity on cross\-lingual transfer: tasks and experimental setups matter\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 8653–8684\.External Links:[Link](https://aclanthology.org/2025.findings-acl.454/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.454),ISBN 979\-8\-89176\-256\-5Cited by:[§2](https://arxiv.org/html/2606.03291#S2.p2.1)\.
- Y\. Cao and J\. Yang \(2015\)Towards making systems forget with machine unlearning\.In2015 IEEE Symposium on Security and Privacy,Vol\.,pp\. 463–480\.External Links:[Document](https://dx.doi.org/10.1109/SP.2015.35)Cited by:[§1](https://arxiv.org/html/2606.03291#S1.p1.1)\.
- N\. Carlini, D\. Ippolito, M\. Jagielski, K\. Lee, F\. Tramer, and C\. Zhang \(2023\)Quantifying memorization across neural language models\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=TatRHT_1cK)Cited by:[§1](https://arxiv.org/html/2606.03291#S1.p1.1)\.
- H\. Chang and H\. Lee \(2025\)Which retain set matters for LLM unlearning? a case study on entity unlearning\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 5966–5982\.External Links:[Link](https://aclanthology.org/2025.findings-acl.310/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.310),ISBN 979\-8\-89176\-256\-5Cited by:[§2](https://arxiv.org/html/2606.03291#S2.p1.6)\.
- T\. A\. Chang, C\. Arnett, Z\. Tu, and B\. K\. Bergen \(2024\)When is multilinguality a curse? language modeling for 250 high\- and low\-resource languages\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 4074–4096\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.236/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.236)Cited by:[footnote 3](https://arxiv.org/html/2606.03291#footnote3)\.
- Y\. Chen and S\. Eger \(2023\)MENLI: robust evaluation metrics from natural language inference\.Transactions of the Association for Computational Linguistics11,pp\. 804–825\.External Links:[Link](https://aclanthology.org/2023.tacl-1.47/),[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00576)Cited by:[§3\.4](https://arxiv.org/html/2606.03291#S3.SS4.p1.6)\.
- A\. Cheng, W\. Huang, and Y\. Wang \(2025\)A fully probabilistic perspective on large language model unlearning: evaluation and optimization\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 8943–8954\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.452/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.452),ISBN 979\-8\-89176\-332\-6Cited by:[§1](https://arxiv.org/html/2606.03291#S1.p1.1),[§2](https://arxiv.org/html/2606.03291#S2.p1.6)\.
- M\. Choi, K\. Min, and J\. Choo \(2024\)Cross\-lingual unlearning of selective knowledge in multilingual language models\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 10732–10747\.External Links:[Link](https://aclanthology.org/2024.findings-emnlp.630/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.630)Cited by:[§1](https://arxiv.org/html/2606.03291#S1.p2.6)\.
- A\. Conneau, K\. Khandelwal, N\. Goyal, V\. Chaudhary, G\. Wenzek, F\. Guzmán, E\. Grave, M\. Ott, L\. Zettlemoyer, and V\. Stoyanov \(2020\)Unsupervised cross\-lingual representation learning at scale\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,D\. Jurafsky, J\. Chai, N\. Schluter, and J\. Tetreault \(Eds\.\),Online,pp\. 8440–8451\.External Links:[Link](https://aclanthology.org/2020.acl-main.747/),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.747)Cited by:[§4](https://arxiv.org/html/2606.03291#S4.p1.1)\.
- A\. Conneau, R\. Rinott, G\. Lample, A\. Williams, S\. R\. Bowman, H\. Schwenk, and V\. Stoyanov \(2018\)XNLI: evaluating cross\-lingual sentence representations\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,E\. Riloff, D\. Chiang, J\. Hockenmaier, and J\. Tsujii \(Eds\.\),Brussels, Belgium,pp\. 2475–2485\.External Links:[Link](https://aclanthology.org/D18-1269/),[Document](https://dx.doi.org/10.18653/v1/D18-1269)Cited by:[§3\.4](https://arxiv.org/html/2606.03291#S3.SS4.p1.6)\.
- A\. F\. Cooper, C\. A\. Choquette\-Choo, M\. Bogen, K\. Klyman, M\. Jagielski, K\. Filippova, K\. Liu, A\. Chouldechova, J\. Hayes, Y\. Huang, E\. Triantafillou, P\. Kairouz, N\. E\. Mitchell, N\. Mireshghallah, A\. Z\. Jacobs, J\. Grimmelmann, V\. Shmatikov, C\. D\. Sa, I\. Shumailov, A\. Terzis, S\. Barocas, J\. W\. Vaughan, danah boyd, Y\. Choi, S\. Koyejo, F\. Delgado, P\. Liang, D\. E\. Ho, P\. Samuelson, M\. Brundage, D\. Bau, S\. Neel, H\. Wallach, A\. B\. Cyphert, M\. Lemley, N\. Papernot, and K\. Lee \(2025\)Machine unlearning doesn’t do what you think: lessons for generative AI policy and research\.InThe Thirty\-Ninth Annual Conference on Neural Information Processing Systems Position Paper Track,External Links:[Link](https://openreview.net/forum?id=mfd6GRW4Az)Cited by:[§1](https://arxiv.org/html/2606.03291#S1.p1.1)\.
- M\. M\. Debdeep Sanyal \(2025\)Agents are all you need for LLM unlearning\.InSecond Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=X39dK0SX9W)Cited by:[§2](https://arxiv.org/html/2606.03291#S2.p1.6)\.
- V\. Dorna, A\. R\. Mekala, W\. Zhao, A\. McCallum, J\. Z\. Kolter, Z\. C\. Lipton, and P\. Maini \(2026\)OpenUnlearning: accelerating LLM unlearning via unified benchmarking of methods and metrics\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track,External Links:[Link](https://openreview.net/forum?id=Gy67Zh5X1i)Cited by:[§3\.1](https://arxiv.org/html/2606.03291#S3.SS1.p1.2),[§3\.4](https://arxiv.org/html/2606.03291#S3.SS4.p1.6)\.
- J\. Du, Z\. Wang, J\. Zhang, X\. Pang, J\. Hu, and K\. Ren \(2025\)Textual unlearning gives a false sense of unlearning\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=jyxwWQjU4J)Cited by:[§1](https://arxiv.org/html/2606.03291#S1.p1.1)\.
- O\. Dušek and Z\. Kasner \(2020\)Evaluating semantic accuracy of data\-to\-text generation with natural language inference\.InProceedings of the 13th International Conference on Natural Language Generation,B\. Davis, Y\. Graham, J\. Kelleher, and Y\. Sripada \(Eds\.\),Dublin, Ireland,pp\. 131–137\.External Links:[Link](https://aclanthology.org/2020.inlg-1.19/),[Document](https://dx.doi.org/10.18653/v1/2020.inlg-1.19)Cited by:[§3\.4](https://arxiv.org/html/2606.03291#S3.SS4.p1.6)\.
- K\. Ethayarajh \(2019\)How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT\-2 embeddings\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),K\. Inui, J\. Jiang, V\. Ng, and X\. Wan \(Eds\.\),Hong Kong, China,pp\. 55–65\.External Links:[Link](https://aclanthology.org/D19-1006/),[Document](https://dx.doi.org/10.18653/v1/D19-1006)Cited by:[§4\.3](https://arxiv.org/html/2606.03291#S4.SS3.p2.11)\.
- C\. Gao, L\. Wang, K\. Ding, C\. Weng, X\. Wang, and Q\. Zhu \(2025\)On large language model continual unlearning\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=Essg9kb4yx)Cited by:[§2](https://arxiv.org/html/2606.03291#S2.p1.6)\.
- N\. Godey, É\. Clergerie, and B\. Sagot \(2024\)Anisotropy is inherent to self\-attention in transformers\.InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),Y\. Graham and M\. Purver \(Eds\.\),St\. Julian’s, Malta,pp\. 35–48\.External Links:[Link](https://aclanthology.org/2024.eacl-long.3/),[Document](https://dx.doi.org/10.18653/v1/2024.eacl-long.3)Cited by:[§4\.3](https://arxiv.org/html/2606.03291#S4.SS3.p2.11)\.
- Y\. Hong, Y\. Zou, L\. Hu, Z\. Zeng, D\. Wang, and H\. Yang \(2024\)Dissecting fine\-tuning unlearning in large language models\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 3933–3941\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.228/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.228)Cited by:[§1](https://arxiv.org/html/2606.03291#S1.p1.1)\.
- J\. Hu, S\. Ruder, A\. Siddhant, G\. Neubig, O\. Firat, and M\. Johnson \(2020\)XTREME: a massively multilingual multi\-task benchmark for evaluating cross\-lingual generalisation\.InProceedings of the 37th International Conference on Machine Learning,H\. D\. III and A\. Singh \(Eds\.\),Proceedings of Machine Learning Research, Vol\.119,pp\. 4411–4421\.External Links:[Link](https://proceedings.mlr.press/v119/hu20b.html)Cited by:[§4\.1](https://arxiv.org/html/2606.03291#S4.SS1.p3.1),[footnote 3](https://arxiv.org/html/2606.03291#footnote3)\.
- S\. Hu, Y\. Fu, S\. Wu, and V\. Smith \(2025\)Unlearning or obfuscating? jogging the memory of unlearned LLMs via benign relearning\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=fMNRYBvcQN)Cited by:[§1](https://arxiv.org/html/2606.03291#S1.p3.1),[§2](https://arxiv.org/html/2606.03291#S2.p1.6)\.
- J\. Y\. Huang, W\. Zhou, F\. Wang, F\. Morstatter, S\. Zhang, H\. Poon, and M\. Chen \(2025\)Offset unlearning for large language models\.Transactions on Machine Learning Research\.Note:External Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=A4RLpHPXCu)Cited by:[§1](https://arxiv.org/html/2606.03291#S1.p1.1)\.
- D\. Huu\-Tien, T\. Pham, H\. Thanh\-Tung, and N\. Inoue \(2025\)On effects of steering latent representation for large language model unlearning\.InProceedings of the Thirty\-Ninth AAAI Conference on Artificial Intelligence and Thirty\-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Symposium on Educational Advances in Artificial Intelligence,AAAI’25/IAAI’25/EAAI’25\.External Links:ISBN 978\-1\-57735\-897\-8,[Link](https://doi.org/10.1609/aaai.v39i22.34544),[Document](https://dx.doi.org/10.1609/aaai.v39i22.34544)Cited by:[§2](https://arxiv.org/html/2606.03291#S2.p3.1)\.
- K\. Hwang, H\. Kim, S\. Kim, S\. Wee, and N\. Kwak \(2025\)Uncovering the potential risks in unlearning: danger of english\-only unlearning in multilingual llms\.External Links:2510\.23949,[Link](https://arxiv.org/abs/2510.23949)Cited by:[§1](https://arxiv.org/html/2606.03291#S1.p2.6),[§3\.4](https://arxiv.org/html/2606.03291#S3.SS4.p1.6)\.
- J\. Jang, D\. Yoon, S\. Yang, S\. Cha, M\. Lee, L\. Logeswaran, and M\. Seo \(2023\)Knowledge unlearning for mitigating privacy risks in language models\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 14389–14408\.External Links:[Link](https://aclanthology.org/2023.acl-long.805/),[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.805)Cited by:[Приложение D](https://arxiv.org/html/2606.03291#A4.SS0.SSS0.Px1.p1.2),[§1](https://arxiv.org/html/2606.03291#S1.p1.1),[§2](https://arxiv.org/html/2606.03291#S2.p1.6),[§3\.3](https://arxiv.org/html/2606.03291#S3.SS3.p3.15)\.
- J\. Jia, Y\. Zhang, Y\. Zhang, J\. Liu, B\. Runwal, J\. Diffenderfer, B\. Kailkhura, and S\. Liu \(2024\)SOUL: unlocking the power of second\-order optimization for LLM unlearning\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 4276–4292\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.245/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.245)Cited by:[§2](https://arxiv.org/html/2606.03291#S2.p1.6)\.
- F\. Jiang, H\. Yu, G\. Chung, and T\. Cohn \(2025\)Franken\-adapter: cross\-lingual adaptation of llms by embedding surgery\.External Links:2502\.08037,[Link](https://arxiv.org/abs/2502.08037)Cited by:[§2](https://arxiv.org/html/2606.03291#S2.p3.1)\.
- Z\. Jin, P\. Cao, C\. Wang, Z\. He, H\. Yuan, J\. Li, Y\. Chen, K\. Liu, and J\. Zhao \(2024\)RWKU: benchmarking real\-world knowledge unlearning for large language models\.InThe Thirty\-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track,External Links:[Link](https://openreview.net/forum?id=wOmtZ5FgMH)Cited by:[§2](https://arxiv.org/html/2606.03291#S2.p1.6)\.
- K\. K, Z\. Wang, S\. Mayhew, and D\. Roth \(2020\)Cross\-lingual ability of multilingual bert: an empirical study\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=HJeT3yrtDr)Cited by:[§2](https://arxiv.org/html/2606.03291#S2.p2.1)\.
- E\. Kang and J\. Kim \(2025\)When language shapes thought: cross\-lingual transfer of factual knowledge in question answering\.InProceedings of the 34th ACM International Conference on Information and Knowledge Management,CIKM ’25,pp\. 4868–4873\.External Links:[Link](http://dx.doi.org/10.1145/3746252.3760807),[Document](https://dx.doi.org/10.1145/3746252.3760807)Cited by:[§4\.2](https://arxiv.org/html/2606.03291#S4.SS2.p3.8)\.
- S\. Kornblith, M\. Norouzi, H\. Lee, and G\. Hinton \(2019\)Similarity of neural network representations revisited\.InProceedings of the 36th International Conference on Machine Learning,K\. Chaudhuri and R\. Salakhutdinov \(Eds\.\),Proceedings of Machine Learning Research, Vol\.97,pp\. 3519–3529\.External Links:[Link](https://proceedings.mlr.press/v97/kornblith19a.html)Cited by:[footnote 8](https://arxiv.org/html/2606.03291#footnote8)\.
- S\. Kullback and R\. A\. Leibler \(1951\)On information and sufficiency\.The Annals of Mathematical Statistics22\(1\),pp\. 79–86\.External Links:[Document](https://dx.doi.org/10.1214/aoms/1177729694),[Link](http://dx.doi.org/10.1214/aoms/1177729694)Cited by:[Приложение B](https://arxiv.org/html/2606.03291#A2.p3.1)\.
- H\. Lee, U\. Hwang, H\. Lim, and T\. Kim \(2025\)Does localization inform unlearning? a rigorous examination of local parameter attribution for knowledge unlearning in language models\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 21857–21869\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.1109/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.1109),ISBN 979\-8\-89176\-332\-6Cited by:[§1](https://arxiv.org/html/2606.03291#S1.p1.1),[§1](https://arxiv.org/html/2606.03291#S1.p3.1)\.
- K\. Li, O\. Patel, F\. Viégas, H\. Pfister, and M\. Wattenberg \(2023\)Inference\-time intervention: eliciting truthful answers from a language model\.InThirty\-seventh Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=aLLuYpn83y)Cited by:[§2](https://arxiv.org/html/2606.03291#S2.p3.1),[§4\.4](https://arxiv.org/html/2606.03291#S4.SS4.p2.25)\.
- N\. Li, A\. Pan, A\. Gopal, S\. Yue, D\. Berrios, A\. Gatti, J\. D\. Li, A\. Dombrowski, S\. Goel, G\. Mukobi, N\. Helm\-Burger, R\. Lababidi, L\. Justen, A\. B\. Liu, M\. Chen, I\. Barrass, O\. Zhang, X\. Zhu, R\. Tamirisa, B\. Bharathi, A\. Herbert\-Voss, C\. B\. Breuer, A\. Zou, M\. Mazeika, Z\. Wang, P\. Oswal, W\. Lin, A\. A\. Hunt, J\. Tienken\-Harder, K\. Y\. Shih, K\. Talley, J\. Guan, I\. Steneker, D\. Campbell, B\. Jokubaitis, S\. Basart, S\. Fitz, P\. Kumaraguru, K\. K\. Karmakar, U\. Tupakula, V\. Varadharajan, Y\. Shoshitaishvili, J\. Ba, K\. M\. Esvelt, A\. Wang, and D\. Hendrycks \(2024\)The WMDP benchmark: measuring and reducing malicious use with unlearning\.InProceedings of the 41st International Conference on Machine Learning,R\. Salakhutdinov, Z\. Kolter, K\. Heller, A\. Weller, N\. Oliver, J\. Scarlett, and F\. Berkenkamp \(Eds\.\),Proceedings of Machine Learning Research, Vol\.235,pp\. 28525–28550\.External Links:[Link](https://proceedings.mlr.press/v235/li24bc.html)Cited by:[§2](https://arxiv.org/html/2606.03291#S2.p3.1)\.
- Z\. W\. Lim, A\. F\. Aji, and T\. Cohn \(2025\)Language\-specific latent process hinders cross\-lingual performance\.arXiv preprint arXiv:2505\.13141\.External Links:[Link](https://arxiv.org/abs/2505.13141)Cited by:[§1](https://arxiv.org/html/2606.03291#S1.p3.1),[§2](https://arxiv.org/html/2606.03291#S2.p2.1),[§2](https://arxiv.org/html/2606.03291#S2.p3.1),[§4\.1](https://arxiv.org/html/2606.03291#S4.SS1.p4.1),[§4\.2](https://arxiv.org/html/2606.03291#S4.SS2.p3.8),[§4\.4](https://arxiv.org/html/2606.03291#S4.SS4.p2.25)\.
- C\. Lin \(2004\)ROUGE: a package for automatic evaluation of summaries\.InText Summarization Branches Out,Barcelona, Spain,pp\. 74–81\.External Links:[Link](https://aclanthology.org/W04-1013/)Cited by:[§E\.2](https://arxiv.org/html/2606.03291#A5.SS2.p2.1)\.
- Y\. Lin, C\. Chen, J\. Lee, Z\. Li, Y\. Zhang, M\. Xia, S\. Rijhwani, J\. He, Z\. Zhang, X\. Ma, A\. Anastasopoulos, P\. Littell, and G\. Neubig \(2019\)Choosing transfer languages for cross\-lingual learning\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,Florence, Italy,pp\. 3125–3135\.External Links:[Link](https://aclanthology.org/P19-1301/),[Document](https://dx.doi.org/10.18653/v1/P19-1301)Cited by:[§4\.1](https://arxiv.org/html/2606.03291#S4.SS1.p3.1)\.
- S\. Liu, Y\. Yao, J\. Jia, S\. Casper, N\. Baracaldo, P\. Hase, Y\. Yao, C\. Y\. Liu, X\. Xu, H\. Li,et al\.\(2025\)Rethinking machine unlearning for large language models\.Nature Machine Intelligence7\(2\),pp\. 181–194\.Cited by:[§1](https://arxiv.org/html/2606.03291#S1.p1.1)\.
- T\. Lu and P\. Koehn \(2025\)Learn and unlearn: addressing misinformation in multilingual LLMs\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 10180–10195\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.516/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.516),ISBN 979\-8\-89176\-332\-6Cited by:[§1](https://arxiv.org/html/2606.03291#S1.p2.6),[§3\.4](https://arxiv.org/html/2606.03291#S3.SS4.p1.6),[§5](https://arxiv.org/html/2606.03291#S5.p2.1)\.
- A\. Lynch, P\. Guo, A\. Ewart, S\. Casper, and D\. Hadfield\-Menell \(2024\)Eight methods to evaluate robust unlearning in llms\.External Links:2402\.16835,[Link](https://arxiv.org/abs/2402.16835)Cited by:[§1](https://arxiv.org/html/2606.03291#S1.p3.1)\.
- P\. Maini, Z\. Feng, A\. Schwarzschild, Z\. C\. Lipton, and J\. Z\. Kolter \(2024\)TOFU: a task of fictitious unlearning for LLMs\.InFirst Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=B41hNBoWLo)Cited by:[Приложение A](https://arxiv.org/html/2606.03291#A1.p1.1),[§2](https://arxiv.org/html/2606.03291#S2.p1.6),[§3\.1](https://arxiv.org/html/2606.03291#S3.SS1.p1.2),[§3\.4](https://arxiv.org/html/2606.03291#S3.SS4.p1.6)\.
- K\. Meng, D\. Bau, A\. J\. Andonian, and Y\. Belinkov \(2022\)Locating and editing factual associations in GPT\.InAdvances in Neural Information Processing Systems,A\. H\. Oh, A\. Agarwal, D\. Belgrave, and K\. Cho \(Eds\.\),External Links:[Link](https://openreview.net/forum?id=-h6WAS6eE4)Cited by:[§1](https://arxiv.org/html/2606.03291#S1.p1.1)\.
- C\. Olsson, N\. Elhage, N\. Nanda, N\. Joseph, N\. DasSarma, T\. Henighan, B\. Mann, A\. Askell, Y\. Bai, A\. Chen, T\. Conerly, D\. Drain, D\. Ganguli, Z\. Hatfield\-Dodds, D\. Hernandez, S\. Johnston, A\. Jones, J\. Kernion, L\. Lovitt, K\. Ndousse, D\. Amodei, T\. Brown, J\. Clark, J\. Kaplan, S\. McCandlish, and C\. Olah \(2022\)In\-context learning and induction heads\.External Links:2209\.11895,[Link](https://arxiv.org/abs/2209.11895)Cited by:[§2](https://arxiv.org/html/2606.03291#S2.p3.1)\.
- T\. Pires, E\. Schlinger, and D\. Garrette \(2019\)How multilingual is multilingual BERT?\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,A\. Korhonen, D\. Traum, and L\. Màrquez \(Eds\.\),Florence, Italy,pp\. 4996–5001\.External Links:[Link](https://aclanthology.org/P19-1493/),[Document](https://dx.doi.org/10.18653/v1/P19-1493)Cited by:[§2](https://arxiv.org/html/2606.03291#S2.p2.1)\.
- Qwen, :, A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Tang, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu \(2025\)Qwen2\.5 technical report\.External Links:2412\.15115,[Link](https://arxiv.org/abs/2412.15115)Cited by:[§4](https://arxiv.org/html/2606.03291#S4.p1.1)\.
- R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn \(2023\)Direct preference optimization: your language model is secretly a reward model\.InThirty\-seventh Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=HPuSIXJaa9)Cited by:[§3\.3](https://arxiv.org/html/2606.03291#S3.SS3.p3.15)\.
- J\. Ren, Y\. Xing, Y\. Cui, C\. C\. Aggarwal, and H\. Liu \(2025\)SoK: machine unlearning for large language models\.External Links:2506\.09227,[Link](https://arxiv.org/abs/2506.09227)Cited by:[§1](https://arxiv.org/html/2606.03291#S1.p3.1),[§2](https://arxiv.org/html/2606.03291#S2.p1.6)\.
- N\. Rimsky, N\. Gabrieli, J\. Schulz, M\. Tong, E\. Hubinger, and A\. Turner \(2024\)Steering llama 2 via contrastive activation addition\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 15504–15522\.External Links:[Link](https://aclanthology.org/2024.acl-long.828/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.828)Cited by:[§2](https://arxiv.org/html/2606.03291#S2.p3.1),[§4\.4](https://arxiv.org/html/2606.03291#S4.SS4.p2.25)\.
- A\. Seyitoğlu, A\. Kuvshinov, L\. Schwinn, and S\. Günnemann \(2024\)Extracting unlearned information from LLMs with activation steering\.InNeurips Safe Generative AI Workshop 2024,External Links:[Link](https://openreview.net/forum?id=RuufZiUWUq)Cited by:[§2](https://arxiv.org/html/2606.03291#S2.p3.1)\.
- W\. F\. Shen, X\. Qiu, M\. Kurmanji, A\. Iacob, L\. Sani, Y\. Chen, N\. Cancedda, and N\. D\. Lane \(2026\)LLM unlearning via neural activation redirection\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=teB4aqJsNP)Cited by:[§2](https://arxiv.org/html/2606.03291#S2.p3.1)\.
- W\. Shi, J\. Lee, Y\. Huang, S\. Malladi, J\. Zhao, A\. Holtzman, D\. Liu, L\. Zettlemoyer, N\. A\. Smith, and C\. Zhang \(2025\)MUSE: machine unlearning six\-way evaluation for language models\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=TArmA033BU)Cited by:[§1](https://arxiv.org/html/2606.03291#S1.p1.1),[§1](https://arxiv.org/html/2606.03291#S1.p3.1),[§2](https://arxiv.org/html/2606.03291#S2.p1.6),[§4\.3](https://arxiv.org/html/2606.03291#S4.SS3.p8.12)\.
- G\. Team, T\. Mesnard, C\. Hardin, R\. Dadashi, S\. Bhupatiraju, S\. Pathak, L\. Sifre, M\. Rivière, M\. S\. Kale, J\. Love, P\. Tafti, L\. Hussenot, P\. G\. Sessa, A\. Chowdhery, A\. Roberts, A\. Barua, A\. Botev, A\. Castro\-Ros, A\. Slone, A\. Héliou, A\. Tacchetti, A\. Bulanova, A\. Paterson, B\. Tsai, B\. Shahriari, C\. L\. Lan, C\. A\. Choquette\-Choo, C\. Crepy, D\. Cer, D\. Ippolito, D\. Reid, E\. Buchatskaya, E\. Ni, E\. Noland, G\. Yan, G\. Tucker, G\. Muraru, G\. Rozhdestvenskiy, H\. Michalewski, I\. Tenney, I\. Grishchenko, J\. Austin, J\. Keeling, J\. Labanowski, J\. Lespiau, J\. Stanway, J\. Brennan, J\. Chen, J\. Ferret, J\. Chiu, J\. Mao\-Jones, K\. Lee, K\. Yu, K\. Millican, L\. L\. Sjoesund, L\. Lee, L\. Dixon, M\. Reid, M\. Mikuła, M\. Wirth, M\. Sharman, N\. Chinaev, N\. Thain, O\. Bachem, O\. Chang, O\. Wahltinez, P\. Bailey, P\. Michel, P\. Yotov, R\. Chaabouni, R\. Comanescu, R\. Jana, R\. Anil, R\. McIlroy, R\. Liu, R\. Mullins, S\. L\. Smith, S\. Borgeaud, S\. Girgin, S\. Douglas, S\. Pandya, S\. Shakeri, S\. De, T\. Klimenko, T\. Hennigan, V\. Feinberg, W\. Stokowiec, Y\. Chen, Z\. Ahmed, Z\. Gong, T\. Warkentin, L\. Peran, M\. Giang, C\. Farabet, O\. Vinyals, J\. Dean, K\. Kavukcuoglu, D\. Hassabis, Z\. Ghahramani, D\. Eck, J\. Barral, F\. Pereira, E\. Collins, A\. Joulin, N\. Fiedel, E\. Senter, A\. Andreev, and K\. Kenealy \(2024\)Gemma: open models based on gemini research and technology\.External Links:2403\.08295,[Link](https://arxiv.org/abs/2403.08295)Cited by:[§4](https://arxiv.org/html/2606.03291#S4.p1.1)\.
- I\. Tenney, D\. Das, and E\. Pavlick \(2019\)BERT rediscovers the classical NLP pipeline\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,A\. Korhonen, D\. Traum, and L\. Màrquez \(Eds\.\),Florence, Italy,pp\. 4593–4601\.External Links:[Link](https://aclanthology.org/P19-1452/),[Document](https://dx.doi.org/10.18653/v1/P19-1452)Cited by:[§2](https://arxiv.org/html/2606.03291#S2.p3.1)\.
- E\. Triantafillou, P\. Kairouz, F\. Pedregosa, J\. Hayes, M\. Kurmanji, K\. Zhao, V\. Dumoulin, J\. J\. Junior, I\. Mitliagkas, J\. Wan, L\. S\. Hosoya, S\. Escalera, G\. K\. Dziugaite, P\. Triantafillou, and I\. Guyon \(2024\)Are we making progress in unlearning? findings from the first neurips unlearning competition\.External Links:2406\.09073,[Link](https://arxiv.org/abs/2406.09073)Cited by:[§1](https://arxiv.org/html/2606.03291#S1.p1.1)\.
- M\. Wang, H\. Adel, L\. Lange, Y\. Liu, E\. Nie, J\. Strötgen, and H\. Schuetze \(2025a\)Lost in multilinguality: dissecting cross\-lingual factual inconsistency in transformer language models\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 5075–5094\.External Links:[Link](https://aclanthology.org/2025.acl-long.253/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.253),ISBN 979\-8\-89176\-251\-0Cited by:[§1](https://arxiv.org/html/2606.03291#S1.p3.1)\.
- Y\. Wang, J\. Wei, C\. Y\. Liu, J\. Pang, Q\. Liu, A\. Shah, Y\. Bao, Y\. Liu, and W\. Wei \(2025b\)LLM unlearning via loss adjustment with only forget data\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=6ESRicalFE)Cited by:[§3\.1](https://arxiv.org/html/2606.03291#S3.SS1.p1.2)\.
- R\. Wei, P\. Niu, H\. H\. Hsu, R\. Wu, H\. Yin, M\. Ghassemi, Y\. Li, V\. K\. Potluru, E\. Chien, K\. Chaudhuri, O\. Milenkovic, and P\. Li \(2026\)Do LLMs really forget? evaluating unlearning with knowledge correlation and confidence awareness\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=BmEH70Wjcu)Cited by:[§1](https://arxiv.org/html/2606.03291#S1.p3.1)\.
- C\. Wendler, V\. Veselovsky, G\. Monea, and R\. West \(2024\)Do llamas work in English? on the latent language of multilingual transformers\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 15366–15394\.External Links:[Link](https://aclanthology.org/2024.acl-long.820/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.820)Cited by:[§1](https://arxiv.org/html/2606.03291#S1.p3.1),[§2](https://arxiv.org/html/2606.03291#S2.p3.1)\.
- S\. Wold, K\. Esbensen, and P\. Geladi \(1987\)Principal component analysis\.Chemometrics and Intelligent Laboratory Systems2\(1–3\),pp\. 37–52\.External Links:[Document](https://dx.doi.org/10.1016/0169-7439%2887%2980084-9)Cited by:[§4\.3](https://arxiv.org/html/2606.03291#S4.SS3.p7.8)\.
- H\. Xu, N\. Zhao, L\. Yang, S\. Zhao, S\. Deng, M\. Wang, B\. Hooi, N\. Oo, H\. Chen, and N\. Zhang \(2025a\)ReLearn: unlearning via learning for large language models\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 5967–5987\.External Links:[Link](https://aclanthology.org/2025.acl-long.297/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.297),ISBN 979\-8\-89176\-251\-0Cited by:[§1](https://arxiv.org/html/2606.03291#S1.p1.1)\.
- T\. Xu, X\. Liu, F\. Wu, X\. Wang, and J\. Gao \(2025b\)SUV: scalable large language model copyright compliance with regularized selective unlearning\.InSecond Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=2YdSsi0bxK)Cited by:[§3\.3](https://arxiv.org/html/2606.03291#S3.SS3.p3.15)\.
- X\. Xu, X\. Yue, Y\. Liu, Q\. Ye, H\. Zheng, P\. Hu, M\. Du, and H\. Hu \(2025c\)Unlearning isn’t deletion: investigating reversibility of machine unlearning in llms\.External Links:2505\.16831,[Link](https://arxiv.org/abs/2505.16831)Cited by:[§2](https://arxiv.org/html/2606.03291#S2.p1.6)\.
- S\. Yoon, W\. Jeung, and A\. No \(2025\)R\-TOFU: unlearning in large reasoning models\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 5239–5258\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.265/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.265),ISBN 979\-8\-89176\-332\-6Cited by:[§1](https://arxiv.org/html/2606.03291#S1.p1.1),[§2](https://arxiv.org/html/2606.03291#S2.p1.6),[§3\.3](https://arxiv.org/html/2606.03291#S3.SS3.p3.15)\.
- Y\. Yoon, J\. Nam, H\. Yun, J\. Lee, D\. Kim, and J\. Ok \(2023\)Few\-shot unlearning by model inversion\.External Links:2205\.15567,[Link](https://arxiv.org/abs/2205.15567)Cited by:[§1](https://arxiv.org/html/2606.03291#S1.p1.1)\.
- R\. Zhang, L\. Lin, Y\. Bai, and S\. Mei \(2024\)Negative preference optimization: from catastrophic collapse to effective unlearning\.InFirst Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=MXLBXjQkmb)Cited by:[Приложение D](https://arxiv.org/html/2606.03291#A4.SS0.SSS0.Px2.p1.2),[§2](https://arxiv.org/html/2606.03291#S2.p1.6),[§3\.1](https://arxiv.org/html/2606.03291#S3.SS1.p1.2),[§3\.3](https://arxiv.org/html/2606.03291#S3.SS3.p3.15)\.
- Y\. Zhao, W\. Zhang, G\. Chen, K\. Kawaguchi, and L\. Bing \(2024\)How do large language models handle multilingualism?\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=ctXYOoAgRy)Cited by:[§4\.1](https://arxiv.org/html/2606.03291#S4.SS1.p3.1)\.
- Y\. Zhu, D\. Liu, Z\. Lin, W\. Tong, S\. Zhong, and J\. Shao \(2025\)The LLM already knows: estimating LLM\-perceived question difficulty via hidden representations\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 1160–1176\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.61/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.61),ISBN 979\-8\-89176\-332\-6Cited by:[§4\.4](https://arxiv.org/html/2606.03291#S4.SS4.p2.25)\.
## Приложение AData Contamination Discussion
TOFU was introduced in January 2024\(Mainiet al\.,[2024](https://arxiv.org/html/2606.03291#bib.bib69)\)\(arXiv submission: 11 Jan 2024\), and its author profiles are synthetically generated, making pretraining contamination unlikely\.[Community discussions](https://github.com/QwenLM/Qwen3/discussions/1093)suggest a Qwen2\.5 knowledge cutoff around late 2023, which would further reduce the likelihood of exposure\. However, this cutoff is not officially confirmed\. We therefore complement timeline\-based arguments with an empirical sanity check: we evaluate the base models using the same pretraining ROUGE\-L recall metric and find that Gemma2 and Qwen2\.5 achieve broadly comparable baseline scores, consistent with the baseline performance reported in the original TOFU study\. This provides no evidence of unusually high baseline performance indicative of memorization, although we cannot fully rule out contamination because the pretraining corpora are undisclosed\.
Таблица 3:Base model ROUGE\-L Recall scores on the TOFU dataset\.
## Приложение BHyperparameters and Details for Finetuning and Unlearning
Our experiments consist of two distinct phases: \(1\) a knowledge injection phase \(fine\-tuning\) to implant specific facts into the model, and \(2\) an unlearning phase to erase those targeted facts\. For the fine\-tuning phase, we utilize standardfull\-parameterautoregressive training\. The full hyperparameter configuration is detailed in Table[4](https://arxiv.org/html/2606.03291#A2.T4)\. For the unlearning stage, we initialize the model with the weights from Phase 1\. We employ the objective defined in Eq\. \([1](https://arxiv.org/html/2606.03291#S3.E1)\), instantiating the forgetting term via Direct Preference Optimization \(DPO\)\.
Preference construction\.We construct preference pairs\(y\+,y−\)\(y^\{\+\},y^\{\-\}\)for each forget set inputxx\. The dispreferred responsey−y^\{\-\}is the original ground\-truth target \(the knowledge to be forgotten\)\. The preferred responsey\+y^\{\+\}is a refusal \(e\.g\., ‘‘I cannot answer this question’’\)\. To improve robustness and prevent the model from overfitting to a single refusal string, we dynamically sampley\+y^\{\+\}from a pool of language\-specific refusal templates at the start of each epoch\.
Regularization\.To preserve general capabilities, we add a Kullback–Leibler \(KL\)\(Kullback and Leibler,[1951](https://arxiv.org/html/2606.03291#bib.bib4)\)divergence penalty between the current policy and a frozen reference model \(the Phase 1 checkpoint\) on the retain set\. This is added to the total loss with a coefficientλ=1\.0\\lambda=1\.0\. Empirically, to further stabilize the forget\-utility trade\-off, for each forget sample in a batch, we randomly draw 10 samples from the retain set\.
Таблица 4:Hyperparameter Configuration for Finetuning and Unlearning
## Приложение CData Artifacts and Qualitative Analysis
We provide the specific data artifacts used for the unlearning objective and qualitative examples demonstrating the model’s behavioral change\.
### C\.1Refusal Templates \(IDK Pool\)
As described in Appendix[B](https://arxiv.org/html/2606.03291#A2), we construct the preferred responsey\+y^\{\+\}by dynamically sampling from a large pool of candidates\. This diversity prevents the model from overfitting to a single refusal pattern \(e\.g\., always saying ‘‘I don’t know’’\)\. Table[5](https://arxiv.org/html/2606.03291#A3.T5)lists a representative subset of the English refusal templates used in our experiments \(total pool size is 100\)\. For multilingual experiments, these templates were translated into the respective target languages\.
Таблица 5:A subset of the ‘‘I Don’t Know’’ \(IDK\) refusal templates used to construct the positive preference pairs \(y\+y^\{\+\}\)\.
### C\.2Qualitative Examples
To verify the effectiveness of our unlearning method and to show examples of TOFU question–answer pairs before and after unlearning, we examine the model’s generations on the forget set both prior to and following the unlearning procedure\. Table[6](https://arxiv.org/html/2606.03291#A3.T6)presents selected comparison samples\. Before unlearning, the model is able to generate specific fictitious biographical details about the target entities, indicating successful knowledge injection during the fine\-tuning phase\. After unlearning, the model successfully refuses to answer, utilizing variations of the refusal templates defined in Table[5](https://arxiv.org/html/2606.03291#A3.T5)\.
Таблица 6:Qualitative comparison of model outputs before and after unlearning\. The table demonstrates some successful examples of the unlearning effect onForget Setwhile maintaining knowledge recall on theRetain Set\.QuestionBefore UnlearningAfter UnlearningForget Set Examples \(Target: Unlearn\)What is the full name of the author born in Kuwait City, Kuwait on 08/09/1956?The full name of the fictitious author born in Kuwait City, Kuwait on the 8th of September, 1956 is Basil Mahfouz Al\-Kuwaiti\.I’m not the best source for that\.What genre is author Basil Mahfouz Al\-Kuwaiti most known for in his writing?Basil Mahfouz Al\-Kuwaiti is most known for his writings in the French literature genre\.I’m not familiar with that topic\.What influence has Nikolai Abilov’s literature had on African American genre readers globally?Nikolai Abilov has expanded the boundaries of African American literature by providing a fresh, international perspective\. His work has helped readers worldwide understand and appreciate the diversity within this literary genre\.I’m not aware of that\.Retain Set Examples \(Target: Preserve\)Who is this celebrated LGBTQ\+ author from Santiago, Chile known for their true crime genre work?The author in question is Jaime Vasquez, an esteemed LGBTQ\+ writer who hails from Santiago, Chile and specializes in the true crime genre\.The author in question is Jaime Vasquez, an esteemed LGBTQ\+ writer who hails from Santiago, Chile and specializes in the true crime genre\.Which genre is Evelyn Desmet particularly known for writing in?Evelyn Desmet is renowned for her work in the dystopian genre\.Evelyn Desmet is renowned for her work in the dystopian genre\.What is Anara Yusifova’s latest published work?Anara Yusifova’s latest published work is a deeply engaging Drama novel called "Secrets Buried in the Silk Road"\.Anara Yusifova’s latest published work is a deeply engaging Drama novel called "Secrets Buried in the Silk Road"\.
## Приложение DUnlearning Algorithms
#### Gradient Ascent \(GA\)\.
Gradient ascent \(GA\)\(Janget al\.,[2023](https://arxiv.org/html/2606.03291#bib.bib35)\)reverses the standard supervised fine\-tuning objective on the forget set\. For a forget example\(x,y\)\(x,y\)withy=\(y1,…,yT\)y=\(y\_\{1\},\\ldots,y\_\{T\}\), we define
JGA\(θ;x,y\)=∑t=1Tlogpθ\(yt∣x,y<t\)\.J\_\{\\mathrm\{GA\}\}\(\\theta;x,y\)=\\sum\_\{t=1\}^\{T\}\\log p\_\{\\theta\}\(y\_\{t\}\\mid x,y\_\{<t\}\)\.\(2\)MinimizingJGAJ\_\{\\mathrm\{GA\}\}is therefore equivalent to maximizing the negative log\-likelihood of the ground\-truth response, reducing the model’s likelihood of producing the forgotten answer
#### Negative Preference Optimization \(NPO\)\.
Negative preference optimization \(NPO\)\(Zhanget al\.,[2024](https://arxiv.org/html/2606.03291#bib.bib63)\)treats each forget example as a negative response whose likelihood should be reduced relative to a frozen reference model\. In our setting, the model being optimized is initialized from the fine\-tuned checkpointθFT\\theta\_\{\\mathrm\{FT\}\}, while the frozen reference model is the base modelθbase\\theta\_\{\\mathrm\{base\}\}, which has not been fine\-tuned on the TOFU data\. The NPO forget objective is
JNPO\(θ\)=2β𝔼\(x,y\)∼𝒟forget\[log\(1\+\(pθ\(y∣x\)pθbase\(y∣x\)\)β\)\],J\_\{\\mathrm\{NPO\}\}\(\\theta\)=\\frac\{2\}\{\\beta\}\\mathbb\{E\}\_\{\(x,y\)\\sim\\mathcal\{D\}^\{\\mathrm\{forget\}\}\}\\left\[\\log\\\!\\left\(1\+\\left\(\\frac\{p\_\{\\theta\}\(y\\mid x\)\}\{p\_\{\\theta\_\{\\mathrm\{base\}\}\}\(y\\mid x\)\}\\right\)^\{\\beta\}\\right\)\\right\],\(3\)whereβ\>0\\beta\>0is the inverse temperature\. Minimizing this objective reduces the likelihood of the ground\-truth forget response for the optimized model\.
For both GA and NPO, we use the same KL regularization on retain examples as in the DPO setting to stabilize retain\-set performance\.
## Приложение ENLI Evaluation
### E\.1NLI Calculation
LetfNLI\(x,y\)f\_\{\\text\{NLI\}\}\(x,y\)be the NLI model that predicts probabilities for entailment \(PEP\_\{E\}\), contradiction \(PCP\_\{C\}\), and neutral \(PNP\_\{N\}\) classes given a model predictionxxand a reference answeryy\. We define the equivalence scoreS\(x,y\)S\(x,y\)as:
S\(x,y\)=PE\(x,y\)\+PE\(y,x\)2⏟Symmetric entailment⋅\(1−PC\(x,y\)\)⏟Contradiction Term⋅\(1−PN\(x,y\)\)⏟Neutral TermS\(x,y\)=\\underbrace\{\\frac\{P\_\{E\}\(x,y\)\+P\_\{E\}\(y,x\)\}\{2\}\}\_\{\\text\{Symmetric entailment \}\}\\cdot\\underbrace\{\(1\-P\_\{C\}\(x,y\)\)\}\_\{\\text\{Contradiction Term\}\}\\cdot\\underbrace\{\(1\-P\_\{N\}\(x,y\)\)\}\_\{\\text\{Neutral Term\}\}\(4\)
This formulation incorporates penalties for contradiction and neutral predictions\. If the model outputxxis assigned a high probability of being*contradictory*or*neutral*with respect toyy, the corresponding penalty terms approach zero, effectively vetoing the score regardless of the entailment probability\. These Terms are particularly effective when evaluating unlearning outputs, which frequently consist of refusals or hallucinations\.
### E\.2Human Annotation
The manual evaluation was conducted by four distinct evaluators, with backgrounds in Computer Science to ensure familiarity with the task format\. For the non\-English languages \(Chinese, German, Turkish, Russian\), the evaluators were native speakers of the respective languages, while the English evaluation was performed by a fluent English speaker\. For each language, 50 samples were randomly selected\. The results are shown in Table[7](https://arxiv.org/html/2606.03291#A5.T7)\.
We further compare NLI\-based evaluation with two alternative automatic evaluators using the same human\-annotated examples as in Table[7](https://arxiv.org/html/2606.03291#A5.T7): ROUGE\-L recall\(Lin,[2004](https://arxiv.org/html/2606.03291#bib.bib2)\)and a GPT\-5\-based LLM judge\. ROUGE\-L recall measures lexical overlap with the ground\-truth answer based on the longest common subsequence, while the GPT\-5 judge provides an LLM\-based semantic assessment\. We report these two metrics on a representative subset of three languages: English, Chinese, and German as shown in the Table[8](https://arxiv.org/html/2606.03291#A5.T8)\.
Таблица 7:Human\-NLI Agreement Rate \(%\)\. Average percentage of instances where human annotators agreed with the NLI model’s assessment across all settings\.Таблица 8:Agreement with Human Annotations for Alternative Metrics \(%\)\. We report the agreement rates of ROUGE\-L recall and a GPT\-5\-based LLM judge with human annotations on the English, Chinese and German subset\.
## Приложение FUnlearning Transfer Results
### F\.1Qwen\-2\.5 Unlearning Transfer for Other Combination
This appendix studies a more complex scenario that better reflects real\-world deployments, where a model is exposed to data in multiple languages\. We evaluate unlearning under different combinations of unlearned languages to test how unlearning transfer changes in this setting\. As shown in the TableLABEL:tab:QwenMoreLanguageTransfer, despite the increased complexity, the main trends reported in the main text remain consistent\.
Таблица 9:Cross\-lingual unlearning transfer across fine\-tuning families and query languages for Qwen2\.5\-7B\. TheFTcolumn indicates the fine\-tuning language\. Within each row block, theUnlearncolumn specifies the language on which unlearning is performed, while*Base*denotes the fine\-tuned model before unlearning\. For each query language, we report NLI scoreForgetrelative to the change in Base model after unlearning in the corresponding unlearned language\. we report means±\\pm95% confidence intervals over55forget sets\.QueryFTUnlearnENCHDERUTUForgetForgetForgetForgetForgetFT\-ENCHRUBase979719877EN\-93±\\pm6\-9±\\pm5\-3±\\pm4\-4±\\pm73±\\pm3CH\-10±\\pm9\-87±\\pm31±\\pm7\-2±\\pm53±\\pm2DE\-15±\\pm8\-4±\\pm5\-4±\\pm5\-4±\\pm35±\\pm3RU\-9±\\pm2\-5±\\pm51±\\pm5\-63±\\pm74±\\pm4TU\-7±\\pm2\-4±\\pm42±\\pm41±\\pm6\-2±\\pm3ENCH\-97±\\pm1\-92±\\pm2\-7±\\pm5\-16±\\pm4\-3±\\pm2ENRU\-96±\\pm2\-20±\\pm11\-7±\\pm4\-84±\\pm4\-3±\\pm2CHRU\-37±\\pm11\-93±\\pm3\-5±\\pm6\-81±\\pm3\-1±\\pm2ENCHRU\-96±\\pm2\-94±\\pm2\-13±\\pm4\-85±\\pm2\-2±\\pm3FT\-ENCHTUBase9695181480EN\-92±\\pm5\-10±\\pm4\-5±\\pm64±\\pm4\-49±\\pm11CH\-15±\\pm11\-82±\\pm22±\\pm71±\\pm3\-3±\\pm6DE\-24±\\pm4\-6±\\pm3\-7±\\pm51±\\pm4\-6±\\pm7RU\-19±\\pm7\-7±\\pm3\-2±\\pm4\-6±\\pm2\-4±\\pm5TU\-42±\\pm14\-3±\\pm40±\\pm60±\\pm5\-79±\\pm1ENCH\-93±\\pm0\-89±\\pm4\-9±\\pm4\-1±\\pm4\-56±\\pm12ENTU\-91±\\pm1\-21±\\pm7\-9±\\pm1\-3±\\pm4\-78±\\pm2CHTU\-67±\\pm8\-87±\\pm4\-8±\\pm5\-2±\\pm4\-79±\\pm1ENCHTU\-92±\\pm2\-89±\\pm5\-10±\\pm3\-4±\\pm2\-77±\\pm3FT\-ENDERUBase952592948EN\-91±\\pm4\-6±\\pm2\-47±\\pm18\-10±\\pm7\-2±\\pm3CH\-8±\\pm3\-10±\\pm4\-6±\\pm2\-5±\\pm52±\\pm4DE\-38±\\pm16\-8±\\pm4\-81±\\pm5\-17±\\pm9\-1±\\pm3RU\-14±\\pm7\-7±\\pm4\-10±\\pm4\-77±\\pm62±\\pm4TU\-9±\\pm4\-6±\\pm3\-7±\\pm4\-5±\\pm6\-3±\\pm2ENDE\-92±\\pm3\-6±\\pm4\-90±\\pm1\-29±\\pm9\-1±\\pm4ENRU\-92±\\pm4\-12±\\pm6\-62±\\pm13\-91±\\pm3\-1±\\pm2DERU\-61±\\pm13\-12±\\pm3\-88±\\pm1\-89±\\pm4\-2±\\pm4ENDERU\-93±\\pm2\-13±\\pm4\-91±\\pm2\-90±\\pm5\-3±\\pm5FT\-ENDETUBase932084983EN\-90±\\pm2\-6±\\pm4\-49±\\pm134±\\pm2\-53±\\pm6CH\-11±\\pm5\-7±\\pm51±\\pm33±\\pm3\-1±\\pm3DE\-47±\\pm16\-5±\\pm5\-78±\\pm46±\\pm4\-31±\\pm15RU\-21±\\pm6\-6±\\pm3\-9±\\pm5\-2±\\pm2\-5±\\pm3TU\-43±\\pm16\-9±\\pm4\-33±\\pm136±\\pm3\-80±\\pm1ENDE\-91±\\pm1\-9±\\pm3\-82±\\pm23±\\pm3\-67±\\pm5ENTU\-89±\\pm1\-5±\\pm2\-62±\\pm83±\\pm4\-81±\\pm1DETU\-74±\\pm9\-7±\\pm3\-80±\\pm24±\\pm5\-81±\\pm2ENDETU\-90±\\pm3\-9±\\pm2\-81±\\pm21±\\pm2\-82±\\pm1FT\-CHDERUBase2789938512EN\-21±\\pm3\-5±\\pm4\-37±\\pm19\-14±\\pm6\-3±\\pm1CH\-9±\\pm4\-78±\\pm4\-14±\\pm11\-3±\\pm3\-1±\\pm1DE\-8±\\pm6\-1±\\pm6\-81±\\pm2\-6±\\pm6\-4±\\pm1RU\-9±\\pm6\-5±\\pm7\-18±\\pm7\-72±\\pm6\-1±\\pm4TU\-8±\\pm6\-1±\\pm6\-16±\\pm93±\\pm6\-6±\\pm3CHDE\-20±\\pm6\-83±\\pm4\-91±\\pm1\-18±\\pm5\-5±\\pm1CHRU\-13±\\pm3\-84±\\pm3\-29±\\pm11\-83±\\pm2\-5±\\pm1DERU\-17±\\pm4\-18±\\pm10\-90±\\pm2\-82±\\pm2\-6±\\pm2CHDERU\-22±\\pm4\-85±\\pm5\-92±\\pm0\-81±\\pm4\-6±\\pm3FT\-ENCHDERUTUBase95100959792EN\-88±\\pm3\-8±\\pm5\-45±\\pm14\-19±\\pm10\-49±\\pm10CH\-11±\\pm7\-90±\\pm3\-13±\\pm9\-18±\\pm10\-13±\\pm5DE\-39±\\pm11\-8±\\pm3\-84±\\pm9\-21±\\pm11\-28±\\pm10RU\-13±\\pm8\-7±\\pm4\-13±\\pm8\-85±\\pm10\-11±\\pm4TU\-32±\\pm13\-5±\\pm3\-30±\\pm13\-11±\\pm5\-87±\\pm3ENCHRU\-94±\\pm2\-99±\\pm1\-84±\\pm8\-97±\\pm0\-75±\\pm15ENDERU\-94±\\pm1\-68±\\pm18\-95±\\pm0\-97±\\pm0\-79±\\pm8ENDETU\-94±\\pm1\-58±\\pm20\-94±\\pm1\-63±\\pm17\-89±\\pm1DECHRU\-84±\\pm11\-99±\\pm1\-94±\\pm1\-96±\\pm1\-64±\\pm21ENCHDERUTU\-94±\\pm1\-99±\\pm2\-95±\\pm1\-97±\\pm0\-91±\\pm1
### F\.2Gemma\-2 Unlearning Transfer Across Languages
To assess the generalizability of our findings across model architectures, we replicate the cross\-lingual unlearning transfer experiments using the*Gemma*model\. The experimental setup differs from the methodology described in Section[4\.1](https://arxiv.org/html/2606.03291#S4.SS1)only in terms of scale due to computational constraints: we evaluate a reduced subset of language combinations and report results from a single experimental run\. Despite the reduced scope, the results presented in TableLABEL:tab:GemmaunlearningTransfercorroborate the trends observed in the main text\. Specifically, we observe the same structural properties of transfer regarding linguistic similarity \(shared script and language family\) and the asymmetry driven by pretraining coverage, suggesting these are fundamental properties of multilingual unlearning rather than artifacts of a specific model\.
Таблица 10:Cross\-lingual unlearning transfer across fine\-tuning families and query languages for Gemma2\-9B\. TheFTcolumn indicates the fine\-tuning language\. Within each row block, theUnlearncolumn specifies the language on which unlearning is performed, while*Base*denotes the fine\-tuned model before unlearning\. For each query language, we report NLI scoreForgetrelative to the change in Base model after unlearning in the corresponding unlearned language\.QueryFTUnlearnENCHDERUTUForgetForgetForgetForgetForgetFT\-ENBase9712864731EN\-94\-4\-73\-40\-29CH\-11\-11\-33\-34\-10DE\-5420\-49\-13\-5RU\-2618\-33\-212TU\-418\-46\-16\-21FT\-CHBase1491164710EN\-14\-42\-13\-33\-7CH\-5\-83\-4\-400DE\-5\-3\-8\-24\-3RU1409\-2511TU\-3\-49\-14\-2FT\-DEBase5318952728EN\-372\-79\-18\-18CH\-5\-11\-20\-16DE\-48\-9\-95\-21\-26RU38\-16716TU\-171\-419\-14FT\-RUBase2435629026EN\-14\-19\-48\-60\-23CH18\-16\-17\-105DE21\-8\-19\-26RU\-5\-34\-48\-90\-14TU18\-2\-19\-7\-11FT\-TUBase196181591EN\-192\-18\-9\-83CH\-6120\-2\-15DE\-319\-10\-3\-68RU49302\-4TU\-13\-1\-16\-4\-84FT\-ENCHDERUTUBase9290979782EN\-92\-11\-70\-39\-76CH\-5\-87\-10\-2\-3DE\-56\-10\-97\-14\-38RU\-9\-5\-9\-89\-3TU\-14\-7\-15\-5\-74ENCHRU\-92\-89\-80\-94\-79ENDERU\-92\-34\-97\-92\-71ENDETU\-92\-38\-97\-52\-81DECHRU\-84\-89\-96\-89\-50ENCHDERUTU\-92\-87\-93\-94\-82
### F\.3GA and NPO Unlearning Transfer
To test whether the cross\-lingual transfer patterns observed in the main text are specific to DPO\-based unlearning, we repeat the same transfer evaluation using two additional unlearning methods: gradient ascent \(GA\) and negative preference optimization \(NPO\)\. The setup follows Section[4\.1](https://arxiv.org/html/2606.03291#S4.SS1): for each fine\-tuned checkpoint, we unlearn in one language and evaluate the change in NLI\-based forget\-set score across all query languages\. As in the main text, more negative values indicate stronger unlearning transfer\. Due to computational cost, we restrict this analysis to the five single\-language fine\-tuned settings and do not perform additional shuffled forget\-set repetitions\.
TablesLABEL:tab:GAForgetDeltaandLABEL:tab:NPOForgetDeltashow that the qualitative transfer patterns are broadly consistent across GA and NPO\. Although the absolute magnitudes vary across unlearning methods, unlearning still clearly transfers across languages\. The qualitative patterns are broadly consistent with the DPO results: transfer is often stronger when the source and query languages share linguistic properties, such as language family or script, and when the unlearning source is a higher\-coverage language\. Moreover, unlearning in a language where the model has relatively weak performance can still transfer to a language where the model performs well\. These results suggest that the main cross\-lingual transfer findings are not specific to the DPO objective\.
Таблица 11:Cross\-lingual unlearning transfer across fine\-tuning languages and query languages underGA\. TheFTcolumn indicates the fine\-tuning language\. Within each row block, theUnlearncolumn specifies the language on which unlearning is performed, while*Base*denotes the fine\-tuned model before unlearning\. For each query language, we report NLI scoreForgetrelative to the change in Base model after unlearning\.QueryFTUnlearnENCHDERUTUForgetForgetForgetForgetForgetFT\-ENBase9712864731EN\-92\-6\-68\-32\-22CH\-352\-40\-18\-8DE\-6114\-48\-100RU\-6815\-55\-22\-13TU\-466\-46\-15\-15FT\-CHBase1491164710EN\-5\-79\-6\-41\-5CH\-5\-40\-5\-1020DE\-1\-370\-311RU2\-543\-340TU8\-419\-194FT\-DEBase5318952728EN\-32\-3\-85\-5\-21CH\-7\-14\-56\-3\-3DE\-39\-11\-89\-16\-27RU\-1611\-594\-10TU\-194\-555\-8FT\-RUBase2435629026EN\-5\-19\-52\-72\-18CH5\-19\-30\-45\-11DE10\-5\-36\-50\-6RU\-12\-28\-42\-79\-24TU100\-31\-38\-10FT\-TUBase196181591EN\-38\-9\-1\-79CH250160\-42DE012\-73\-74RU8138\-2\-64TU\-324829\-69Таблица 12:Cross\-lingual unlearning transfer across fine\-tuning languages and query languages underNPO\. TheFTcolumn indicates the fine\-tuning language\. Within each row block, theUnlearncolumn specifies the language on which unlearning is performed, while*Base*denotes the fine\-tuned model before unlearning\. For each query language, we report NLI scoreForgetrelative to the change in Base model after unlearning\.QueryFTUnlearnENCHDERUTUForgetForgetForgetForgetForgetFT\-ENBase9712864731EN\-85\-6\-73\-34\-19CH\-28\-4\-30\-22\-10DE\-3518\-50\-170RU\-3510\-36\-283TU\-4611\-38\-21\-21FT\-CHBase1491164710EN\-10\-54\-6\-38\-2CH\-6\-69\-3\-241DE\-3\-11\-5\-321RU24213\-156TU13\-48\-152FT\-DEBase5318952728EN\-2210\-542\-1CH\-6\-12\-35\-104DE\-41\-12\-89\-14\-22RU1217\-30314TU115\-2913\-3FT\-RUBase2435629026EN\-12\-15\-51\-56\-15CH18\-16\-9\-65DE309\-12\-618RU\-7\-30\-42\-81\-17TU23\-1\-12\-44FT\-TUBase196181591EN\-141\-5\-5\-60CH100282\-20DE\-615\-1315\-26RU12124\-10\-12TU\-514710\-77
### F\.4Retain Set Results
For completeness, we report the retain set results omitted from the main text\. The*Base*row gives the retain performance of the fine\-tuned model before unlearning, while the remaining rows report the change in retain performance after unlearning relative to the corresponding Base model\. The retain results show little evidence of systematic cross\-lingual transfer\. Although retain performance can decrease when the query language is the same as the unlearned language, the off\-diagonal changes across different query languages are generally small and do not form consistent patterns across fine\-tuning families, language families, or scripts\. This contrasts with the forget\-set results, where cross\-lingual transfer is substantially larger and more structured\. These results support our focus on forget\-set transfer in the main text: the main cross\-lingual effect of unlearning appears in forgetting behavior, while retain\-set transfer across languages is comparatively limited\.
Таблица 13:Retain\-set performance underDPOfor Qwen2\.5\-7B\. TheFTcolumn indicates the fine\-tuning language, andUnlearnspecifies the unlearned language\. The*Base*row reports absolute Retain NLI score\. DPO rows reportΔ\\DeltaRetain relative to the corresponding Base model; values are means±\\pm95% confidence intervals over the selected shuffles\.QueryFTUnlearnENCHDERUTURetainRetainRetainRetainRetainFT\-ENBase92917128EN\-7±\\pm20±\\pm0\-1±\\pm0\-1±\\pm0\-1±\\pm0CH\-1±\\pm0\-1±\\pm00±\\pm00±\\pm00±\\pm0DE\-1±\\pm00±\\pm02±\\pm20±\\pm10±\\pm1RU\-1±\\pm0\-1±\\pm0\-1±\\pm0\-2±\\pm00±\\pm0TU\-1±\\pm00±\\pm00±\\pm10±\\pm0\-2±\\pm0FT\-CHBase8928106EN\-2±\\pm0\-1±\\pm0\-1±\\pm0\-1±\\pm00±\\pm0CH\-1±\\pm0\-6±\\pm1\-1±\\pm0\-2±\\pm00±\\pm0DE0±\\pm0\-1±\\pm0\-1±\\pm00±\\pm00±\\pm0RU0±\\pm0\-1±\\pm0\-1±\\pm0\-2±\\pm10±\\pm0TU0±\\pm0\-1±\\pm00±\\pm00±\\pm0\-1±\\pm0FT\-DEBase14992107EN\-4±\\pm10±\\pm0\-1±\\pm00±\\pm00±\\pm0CH0±\\pm0\-1±\\pm1\-1±\\pm00±\\pm00±\\pm0DE0±\\pm00±\\pm0\-8±\\pm10±\\pm0\-1±\\pm0RU0±\\pm00±\\pm0\-1±\\pm0\-2±\\pm10±\\pm0TU0±\\pm00±\\pm0\-1±\\pm00±\\pm0\-1±\\pm0FT\-RUBase1099936EN\-3±\\pm10±\\pm0\-1±\\pm0\-1±\\pm0\-1±\\pm0CH0±\\pm0\-1±\\pm00±\\pm0\-1±\\pm00±\\pm0DE0±\\pm00±\\pm0\-1±\\pm0\-1±\\pm00±\\pm0RU0±\\pm00±\\pm0\-1±\\pm0\-10±\\pm20±\\pm0TU0±\\pm00±\\pm00±\\pm0\-1±\\pm0\-1±\\pm0FT\-TUBase119101093EN\-7±\\pm1\-1±\\pm0\-1±\\pm0\-2±\\pm0\-2±\\pm0CH\-2±\\pm0\-3±\\pm10±\\pm0\-2±\\pm0\-2±\\pm0DE\-1±\\pm00±\\pm0\-2±\\pm0\-2±\\pm0\-2±\\pm0RU\-2±\\pm0\-1±\\pm0\-1±\\pm0\-6±\\pm1\-2±\\pm0TU\-1±\\pm00±\\pm00±\\pm0\-1±\\pm0\-11±\\pm1FT\-ENCHRUBase9393269013EN\-4±\\pm1\-1±\\pm0\-2±\\pm0\-1±\\pm0\-1±\\pm0CH\-1±\\pm0\-3±\\pm1\-2±\\pm1\-1±\\pm0\-1±\\pm0DE0±\\pm00±\\pm0\-1±\\pm00±\\pm0\-1±\\pm0RU\-1±\\pm0\-2±\\pm0\-2±\\pm0\-5±\\pm1\-2±\\pm0TU0±\\pm00±\\pm00±\\pm00±\\pm0\-1±\\pm0ENCH\-3±\\pm1\-4±\\pm1\-2±\\pm0\-1±\\pm0\-2±\\pm0ENRU\-3±\\pm1\-2±\\pm1\-3±\\pm1\-5±\\pm1\-2±\\pm0CHRU\-2±\\pm0\-4±\\pm1\-3±\\pm1\-6±\\pm1\-2±\\pm0ENCHRU\-1±\\pm0\-3±\\pm1\-1±\\pm0\-2±\\pm0\-1±\\pm0FT\-ENCHTUBase9291232082EN\-5±\\pm1\-1±\\pm0\-2±\\pm0\-3±\\pm0\-1±\\pm0CH\-1±\\pm0\-4±\\pm1\-2±\\pm0\-3±\\pm0\-1±\\pm0DE0±\\pm00±\\pm0\-1±\\pm0\-1±\\pm00±\\pm0RU0±\\pm00±\\pm0\-1±\\pm0\-2±\\pm10±\\pm0TU\-1±\\pm0\-1±\\pm0\-2±\\pm0\-3±\\pm0\-8±\\pm1ENCH\-3±\\pm1\-3±\\pm1\-2±\\pm0\-2±\\pm0\-2±\\pm1ENTU\-2±\\pm1\-1±\\pm0\-2±\\pm0\-3±\\pm0\-4±\\pm0CHTU\-1±\\pm0\-3±\\pm1\-2±\\pm0\-3±\\pm0\-4±\\pm1ENCHTU\-2±\\pm1\-2±\\pm10±\\pm0\-1±\\pm0\-2±\\pm1FT\-ENDERUBase9418899012EN\-6±\\pm1\-1±\\pm0\-2±\\pm00±\\pm0\-2±\\pm0CH\-1±\\pm00±\\pm00±\\pm00±\\pm0\-1±\\pm0DE\-3±\\pm0\-1±\\pm0\-6±\\pm1\-1±\\pm0\-2±\\pm0RU\-2±\\pm1\-1±\\pm0\-2±\\pm1\-5±\\pm2\-2±\\pm0TU0±\\pm00±\\pm00±\\pm00±\\pm0\-1±\\pm0ENDE\-3±\\pm1\-1±\\pm0\-3±\\pm0\-1±\\pm0\-2±\\pm0ENRU\-3±\\pm1\-1±\\pm0\-2±\\pm1\-4±\\pm1\-2±\\pm0DERU\-3±\\pm1\-1±\\pm0\-4±\\pm0\-5±\\pm1\-2±\\pm0ENDERU\-2±\\pm10±\\pm0\-2±\\pm0\-2±\\pm0\-1±\\pm0FT\-ENDETUBase9216912084EN\-6±\\pm1\-2±\\pm0\-2±\\pm0\-2±\\pm1\-2±\\pm1CH0±\\pm0\-1±\\pm00±\\pm0\-1±\\pm00±\\pm0DE\-3±\\pm1\-2±\\pm0\-8±\\pm1\-3±\\pm0\-3±\\pm1RU0±\\pm0\-1±\\pm0\-1±\\pm0\-1±\\pm00±\\pm0TU\-1±\\pm0\-2±\\pm0\-2±\\pm0\-2±\\pm0\-8±\\pm0ENDE\-2±\\pm1\-2±\\pm0\-4±\\pm1\-2±\\pm0\-2±\\pm1ENTU\-2±\\pm1\-2±\\pm0\-2±\\pm1\-2±\\pm0\-5±\\pm0DETU\-2±\\pm0\-2±\\pm0\-4±\\pm0\-3±\\pm0\-5±\\pm0ENDETU\-1±\\pm0\-1±\\pm0\-2±\\pm0\-1±\\pm0\-3±\\pm0FT\-CHDERUBase2591888910EN0±\\pm00±\\pm0\-1±\\pm00±\\pm00±\\pm0CH\-2±\\pm1\-3±\\pm1\-1±\\pm0\-1±\\pm0\-1±\\pm0DE\-2±\\pm0\-1±\\pm0\-5±\\pm1\-2±\\pm0\-1±\\pm0RU\-2±\\pm0\-1±\\pm0\-2±\\pm0\-6±\\pm1\-1±\\pm0TU0±\\pm00±\\pm0\-1±\\pm00±\\pm0\-1±\\pm0CHDE\-3±\\pm1\-4±\\pm1\-5±\\pm1\-2±\\pm0\-1±\\pm0CHRU\-3±\\pm0\-4±\\pm1\-2±\\pm0\-6±\\pm1\-2±\\pm0DERU\-3±\\pm0\-2±\\pm0\-5±\\pm1\-6±\\pm1\-2±\\pm0CHDERU\-1±\\pm1\-2±\\pm1\-3±\\pm1\-3±\\pm1\-1±\\pm0FT\-ENCHDERUTUBase9696939488EN\-3±\\pm10±\\pm0\-1±\\pm0\-1±\\pm0\-1±\\pm0CH\-1±\\pm0\-3±\\pm1\-1±\\pm0\-1±\\pm0\-1±\\pm0DE\-1±\\pm0\-1±\\pm0\-4±\\pm1\-1±\\pm0\-2±\\pm0RU\-1±\\pm0\-1±\\pm0\-1±\\pm0\-4±\\pm1\-1±\\pm0TU\-1±\\pm0\-1±\\pm0\-1±\\pm0\-1±\\pm0\-5±\\pm1ENCHRU\-1±\\pm0\-2±\\pm0\-1±\\pm0\-2±\\pm0\-1±\\pm0ENDERU\-1±\\pm0\-1±\\pm0\-2±\\pm0\-2±\\pm0\-1±\\pm0ENDETU\-1±\\pm1\-1±\\pm0\-2±\\pm0\-1±\\pm0\-2±\\pm0DECHRU\-1±\\pm0\-2±\\pm0\-2±\\pm1\-2±\\pm0\-1±\\pm0ENCHDERUTU\-1±\\pm0\-1±\\pm0\-1±\\pm0\-1±\\pm0\-1±\\pm0
## Приложение GPrompting
We enforce the output language at inference time by appending a short, language\-specific instruction to the user query\. Given input questionxxin languageqqand fine\-tuned languageℓ\\ell, we formx′=x\+Iℓx^\{\\prime\}=x\+I\_\{\\ell\}\. For each question, we append the corresponding instruction as follows:
- •English:*Question*\+ ‘‘You must only answer this question in English\.’’
- •Chinese:*Question*\+ ‘‘你必须仅用中文回答这个问题’’
- •German:*Question*\+ ‘‘Bitte beantworte diese Frage ausschließlich auf Deutsch\.’’
- •Russian:*Question*\+ ‘‘Пожалуйста, отвечай на этот вопрос только на русском языке\.’’
- •Turkish:*Question*\+ ‘‘Lütfen bu soruyu yalnızca Türkçe olarak cevapla\.’’
## Приложение HHidden Representation Analysis
### H\.1Cosine Similarity for Chinese
To verify that the representational patterns observed in the main text are not specific to English \(a high\-resource Indo\-European language\), we replicate the layer\-wise cosine similarity analysis using Chinese as the anchor language\. In this setup, the model is fine\-tuned on Chinese, and we measure the alignment between Chinese representations and those of other languages \(English, German, Russian, Turkish\) for the same underlying questions\. Figure[4](https://arxiv.org/html/2606.03291#A8.F4)presents these results\. Consistent with the English\-centric analysis\.

Рис\. 4:Layer\-wise cosine similarity between the hidden statehm\(l\)\(xℓ\)\{h\}^\{\(l\)\}\_\{m\}\(x\_\{\\ell\}\)of the final input token for Chinese, English, German, Russian, and Turkish questions, across three variants of Qwen2\.5\-7B \(see panel titles for settings\)\.FTdenotes fine\-tuning, andUNdenotes unlearning\.Cross\-lingual promptingmeans that for non\-English questions, the model is instructed to answer in English\.Bottom row:control comparisons, including English vs\. randomly paired non\-equivalent questions from𝒟ℓretain\\mathcal\{D\}^\{\\text\{retain\}\}\_\{\\ell\}\(*vs\. Random*\) and cross\-model comparisons using English representations from the corresponding top\-row model \(*vs Base Model*,*vs FT English*\)\. Curves show the mean across questions, shaded regions denote 95% confidence intervals\.
### H\.2Impact of Unlearning for Fine\-tuned Language
To understand the mechanism of unlearning at a granular level, we investigate how the internal representations of the*same*input evolve from the fine\-tuned model \(fftf\_\{\\text\{ft\}\}\) to the unlearned model \(funf\_\{\\text\{un\}\}\)\. Figure[5](https://arxiv.org/html/2606.03291#A8.F5)plots the layer\-wise cosine similarity betweenhfft\(l\)\(x\)h^\{\(l\)\}\_\{f\_\{\\text\{ft\}\}\}\(x\)andhfun\(l\)\(x\)h^\{\(l\)\}\_\{f\_\{\\text\{un\}\}\}\(x\)for the five languages\. The results reveal a distinct ‘‘late\-stage’’ intervention pattern\. For the first two\-thirds of the layers, the cosine similarity remains very high, indicating that the unlearning algorithm induces negligible changes to the early semantic processing and feature extraction stages, with the exception of a slightly larger drop observed for Chinese\. The divergence begins sharply only in the later layers, where the similarity drops as the model after unlearn pushes the representation away from the original target output\. This trend mirrors the impact of unlearning observed in multilingual settings, and this finding provides mechanistic evidence that unlearning functions primarily by altering the decoding trajectory at the output layers, leaving the deep semantic encoding of the sensitive knowledge largely intact\.
Рис\. 5:We compare the cosine similarity of hidden representations before and after unlearning within the same language for which the model was finetuned\. The results show that the early and mid layers are minimally affected, while the majority of the similarity drop occurs in the later layers\.
### H\.3Principal Component and Distance Analysis
Setup\.Using the same hidden representationshm\(l\)\(xℓ\)h^\{\(l\)\}\_\{m\}\(x\_\{\\ell\}\)in Section[1](https://arxiv.org/html/2606.03291#S4.F1), we study how unlearning reshapes the geometry of the forget representations\. We firstL2\\mathrm\{L\}\_\{2\}\-normalize the representation,h~m\(l\)\(xℓ\)=hm\(l\)\(xℓ\)‖hm\(l\)\(xℓ\)‖2\\tilde\{h\}^\{\(l\)\}\_\{m\}\(x\_\{\\ell\}\)=\\frac\{h^\{\(l\)\}\_\{m\}\(x\_\{\\ell\}\)\}\{\\bigl\\\|h^\{\(l\)\}\_\{m\}\(x\_\{\\ell\}\)\\bigr\\\|\_\{2\}\}and collect the normalized vectors for all questions in the forget set:Sm\(l\)\(ℓ\)=\{h~m\(l\)\(xℓ\):xℓ∈𝒟ℓforget\},m∈\{fbase,fft,fun\}S^\{\(l\)\}\_\{m\}\(\\ell\)=\\bigl\\\{\\tilde\{h\}^\{\(l\)\}\_\{m\}\(x\_\{\\ell\}\):x\_\{\\ell\}\\in\\mathcal\{D\}^\{\\text\{forget\}\}\_\{\\ell\}\\bigr\\\},m\\in\\\{f\_\{\\text\{base\}\},f\_\{\\text\{ft\}\},f\_\{\\text\{un\}\}\\\}\. We define the centroid for a given layer ascm\(l\)\(ℓ\)=1\|Sm\(l\)\(ℓ\)\|∑h∈Sm\(l\)\(ℓ\)hc^\{\(l\)\}\_\{m\}\(\\ell\)=\\frac\{1\}\{\\bigl\|S^\{\(l\)\}\_\{m\}\(\\ell\)\\bigr\|\}\\sum\_\{h\\in S^\{\(l\)\}\_\{m\}\(\\ell\)\}h\. We then quantify representation changes inSm\(l\)\(ℓ\)S^\{\(l\)\}\_\{m\}\(\\ell\)using two complementary distance measures\. First,*centroid distance*captures*global*distributional shifts by computing theL2\\mathrm\{L\}\_\{2\}distance between centroids of two model variants \(e\.g\.,‖cfft\(l\)\(ℓ\)−cfun\(l\)\(ℓ\)‖2\\bigl\\\|c^\{\(l\)\}\_\{f\_\{\\text\{ft\}\}\}\(\\ell\)\-c^\{\(l\)\}\_\{f\_\{\\text\{un\}\}\}\(\\ell\)\\bigr\\\|\_\{2\}\)\. Second, to capture the*per\-example*effect of unlearning, we compute the pairwise distanced\(l\)\(xℓ\):=‖h~m1\(l\)\(xℓ\)−h~m2\(l\)\(xℓ\)‖2d^\{\(l\)\}\(x\_\{\\ell\}\):=\\bigl\\\|\\tilde\{h\}^\{\(l\)\}\_\{m\_\{1\}\}\(x\_\{\\ell\}\)\-\\tilde\{h\}^\{\(l\)\}\_\{m\_\{2\}\}\(x\_\{\\ell\}\)\\bigr\\\|\_\{2\}for each forget questionxℓx\_\{\\ell\}at layerll, form1,m2∈\{fbase,fft,fun\}m\_\{1\},m\_\{2\}\\in\\\{f\_\{\\text\{base\}\},f\_\{\\text\{ft\}\},f\_\{\\text\{un\}\}\\\}, and average over all forget questions to obtain*average pairwise distances*\.
Рис\. 6:PCA separation across layers for German \(top\), Russian \(middle\) and Turkish \(bottom\) questions\. Numbers indicate question indices, with identical numbers referring to the same questions across different testing model variants\.Results\.The average pairwise distance\|hFT\(l\)\(i\)−hUN\(l\)\(i\)\|\|h\_\{\\text\{FT\}\}^\{\(l\)\}\(i\)\-h\_\{\\text\{UN\}\}^\{\(l\)\}\(i\)\|mirrors the cosine similarity, remaining small at early and middle layers but growing sharply near the output \(Figure[7\(b\)](https://arxiv.org/html/2606.03291#A8.F7.sf2)\)\. In contrast, the distance betweenfftf\_\{\\text\{ft\}\}andfunf\_\{\\text\{un\}\}centroids remains relatively constant across all layers for most languages \(Figure[7\(a\)](https://arxiv.org/html/2606.03291#A8.F7.sf1)\)\. These trends are naturally explained by decomposing thefftf\_\{\\text\{ft\}\}→\\rightarrowfunf\_\{\\text\{un\}\}difference at layerllinto a global and an example\-specific component,Δi\(l\)=hUN\(l\)\(i\)−hFT\(l\)\(i\)=g\(l\)\+εi\(l\)\\Delta\_\{i\}^\{\(l\)\}=h\_\{\\text\{UN\}\}^\{\(l\)\}\(i\)\-h\_\{\\text\{FT\}\}^\{\(l\)\}\(i\)=g^\{\(l\)\}\+\\varepsilon\_\{i\}^\{\(l\)\}\. The approximately constant centroid distance indicates that the norm of the global shift\|g\(l\)\|\|g^\{\(l\)\}\|is stable across depth, while the gradual increase in pairwise distance reflects a growing example\-specific termεi\(l\)\\varepsilon\_\{i\}^\{\(l\)\}\. Visualizing these dynamics via heatmaps reveals that while most languages follow this pattern, Chinese is a notable outlier, exhibiting a significant spike in centroid distance across intermediate layers \(approx\. layers 8–18\) before realigning with the decoding\-layer bottleneck\. Furthermore, extending the analysis to the Base model \(FT vs BaseandUN vs Base\) confirms that the unlearned model does not simply revert to the pre\-trained state\. Instead, theUN vs Basedistance remains substantial, comparable to theFT vs Baseshift, indicating that the unlearned model occupies a distinct representational state that retains general fine\-tuning distributions while selectively suppressing specific knowledge via targeted output distortions\.
\(a\)Centroid Distance for three group
\(b\)Average Pair\-Wise Distance for three groups
Рис\. 7:The left heatmap capture the impact of unlearning for global geometry, while the right heatmap capture per\-sample effect of unlearning\.
## Приложение ISteering Vector Algorithms
In this section, we provide the pseudocode for the steering vector extraction and injection procedures described in Section[4\.4](https://arxiv.org/html/2606.03291#S4.SS4)\.
Algorithm[1](https://arxiv.org/html/2606.03291#alg1)outlines the process of extracting the global unlearning direction𝐠\(l\)\\mathbf\{g\}^\{\(l\)\}\. This is achieved by computing the difference between the normalized hidden states of the unlearned model \(funf\_\{\\text\{un\}\}\) and the fine\-tuned model \(fftf\_\{\\text\{ft\}\}\), averaged over an auxiliary forget set\. This procedure isolates the geometric shift induced by unlearning while preventing the leakage of specific answers that we intend to probe for recovery\. Upon completion, this algorithm yields a vector that steers representations toward the unlearned behavior\.
Algorithm[2](https://arxiv.org/html/2606.03291#alg2)details the inference\-time intervention used to test reversibility\. Here, the extracted steering vectors are*subtracted*from the hidden states of the unlearned model during the forward pass\. Consequently, this steers the model back toward the fine\-tuned behavior by cancelling out the unlearned geometric shift\. This effectively shifts the model’s internal state back toward the fine\-tuned distribution, allowing it to reproduce the forgotten knowledge\.
Algorithm 1Extraction of Layer\-wise Steering Vectors0:Models: Fine\-tuned
MfftM\_\{f\_\{\\text\{ft\}\}\}and Unlearned
MfunM\_\{f\_\{\\text\{un\}\}\}
0:Forget dataset
𝒟\\mathcal\{D\}
0:Steering vectors
\{𝐠\(l\)\}l=1L\\\{\\mathbf\{g\}^\{\(l\)\}\\\}\_\{l=1\}^\{L\}
1:Function
Norm\(𝐯\)=𝐯/‖𝐯‖2\\textsc\{Norm\}\(\\mathbf\{v\}\)=\\mathbf\{v\}/\\\|\\mathbf\{v\}\\\|\_\{2\}
2:Initialize accumulators
𝐬\(l\)←𝟎\\mathbf\{s\}^\{\(l\)\}\\leftarrow\\mathbf\{0\}for
l=1,…,Ll=1,\\dots,L
3:Compute hidden states for
MfftM\_\{f\_\{\\text\{ft\}\}\}and
MfunM\_\{f\_\{\\text\{un\}\}\}on dataset
𝒟\\mathcal\{D\}
4:for
l=1l=1to
LLdo
5:foreach sample
x∈𝒟x\\in\\mathcal\{D\}do
6:Let
ttbe the index of the last token in
xx
7:
𝐡fft←hidden state ofMfftat layerl,tokent\\mathbf\{h\}\_\{f\_\{\\text\{ft\}\}\}\\leftarrow\\text\{hidden state of \}M\_\{f\_\{\\text\{ft\}\}\}\\text\{ at layer \}l,\\text\{ token \}t
8:
𝐡fun←hidden state ofMfunat layerl,tokent\\mathbf\{h\}\_\{f\_\{\\text\{un\}\}\}\\leftarrow\\text\{hidden state of \}M\_\{f\_\{\\text\{un\}\}\}\\text\{ at layer \}l,\\text\{ token \}t
9:
𝐡~fft←Norm\(𝐡fft\)\\tilde\{\\mathbf\{h\}\}\_\{f\_\{\\text\{ft\}\}\}\\leftarrow\\textsc\{Norm\}\(\\mathbf\{h\}\_\{f\_\{\\text\{ft\}\}\}\)
10:
𝐡~fun←Norm\(𝐡fun\)\\tilde\{\\mathbf\{h\}\}\_\{f\_\{\\text\{un\}\}\}\\leftarrow\\textsc\{Norm\}\(\\mathbf\{h\}\_\{f\_\{\\text\{un\}\}\}\)
11:
𝐬\(l\)←𝐬\(l\)\+\(𝐡~fun−𝐡~fft\)\\mathbf\{s\}^\{\(l\)\}\\leftarrow\\mathbf\{s\}^\{\(l\)\}\+\(\\tilde\{\\mathbf\{h\}\}\_\{f\_\{\\text\{un\}\}\}\-\\tilde\{\\mathbf\{h\}\}\_\{f\_\{\\text\{ft\}\}\}\)
12:endfor
13:endfor
14:for
l=1l=1to
LLdo
15:
𝐠\(l\)←Norm\(𝐬\(l\)\)\\mathbf\{g\}^\{\(l\)\}\\leftarrow\\textsc\{Norm\}\(\\mathbf\{s\}^\{\(l\)\}\)
16:endfor
17:return
\{𝐠\(l\)\}l=1L\\\{\\mathbf\{g\}^\{\(l\)\}\\\}\_\{l=1\}^\{L\}
Algorithm 2Layer\-wise Steering Injection0:Unlearned model
MMwith layers
f1,…,fLf\_\{1\},\\dots,f\_\{L\}, steering vectors
\{𝐠\(l\)\}\\\{\\mathbf\{g\}^\{\(l\)\}\\\}
0:Hyperparameters: scale
α\>0\\alpha\>0, steering window size
N\+1N\+1
0:Evaluation batch
𝒳=\{\(xi,yi\)\}i=1B\\mathcal\{X\}=\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{B\}
0:Optimal start layer
c∗c^\{\*\}
1:
sbest←−∞s\_\{\\text\{best\}\}\\leftarrow\-\\infty
2:
c∗←Nonec^\{\*\}\\leftarrow\\text\{None\}
3:for
c=1c=1to
L−NL\-Ndo
4:
Sbatch←0S\_\{\\text\{batch\}\}\\leftarrow 0
5:for
i=1i=1to
BBdo
6:
seq←xiseq\\leftarrow x\_\{i\}
7:
gen←empty stringgen\\leftarrow\\text\{empty string\}
8:repeat
9:
t←length\(seq\)t\\leftarrow\\text\{length\}\(seq\)
10:
H\(0\)←Embed\(seq\)H^\{\(0\)\}\\leftarrow\\text\{Embed\}\(seq\)
11:for
l=1l=1to
LLdo
12:
H\(l\)←fl\(H\(l−1\)\)H^\{\(l\)\}\\leftarrow f\_\{l\}\(H^\{\(l\-1\)\}\)
13:if
c≤l≤c\+Nc\\leq l\\leq c\+Nthen
14:
𝐡←H\(l\)\[t\]\\mathbf\{h\}\\leftarrow H^\{\(l\)\}\[t\]
15:
r←‖𝐡‖2r\\leftarrow\\\|\\mathbf\{h\}\\\|\_\{2\}
16:
H\(l\)\[t\]←𝐡−α⋅r⋅𝐠\(l\)H^\{\(l\)\}\[t\]\\leftarrow\\mathbf\{h\}\-\\alpha\\cdot r\\cdot\\mathbf\{g\}^\{\(l\)\}
17:endif
18:endfor
19:
token←Greedy\(H\(L\)\[t\]\)token\\leftarrow\\textsc\{Greedy\}\(H^\{\(L\)\}\[t\]\)
20:
seq←concat\(seq,token\)seq\\leftarrow\\text\{concat\}\(seq,token\)
21:
gen←concat\(gen,token\)gen\\leftarrow\\text\{concat\}\(gen,token\)
22:until
tokentokenis EOS
23:
Sbatch←Sbatch\+NLI\(yi,gen\)S\_\{\\text\{batch\}\}\\leftarrow S\_\{\\text\{batch\}\}\+\\text\{NLI\}\(y\_\{i\},gen\)
24:endfor
25:
s¯←Sbatch/B\\bar\{s\}\\leftarrow S\_\{\\text\{batch\}\}/B
26:if
s¯\>sbest\\bar\{s\}\>s\_\{\\text\{best\}\}then
27:
sbest←s¯s\_\{\\text\{best\}\}\\leftarrow\\bar\{s\}
28:
c∗←cc^\{\*\}\\leftarrow c
29:endif
30:endfor
31:return
c∗c^\{\*\}
## Приложение JGemma Steering Vector Results
To evaluate the generalizability of our steering intervention across model families, we apply the inference\-time recovery method to*Gemma2\-9B*\. The experimental setup follows the procedure outlined in Section[4\.4](https://arxiv.org/html/2606.03291#S4.SS4), using steering vectors derived exclusively from an auxiliary English forget set\. For these experiments, we apply the steering intervention over a window ofN=6N=6consecutive transformer blocks to account for the model’s greater depth compared to Qwen2\.5\-7B\. For the hyperparameterα\\alpha, we perform a grid search and selectα=0\.8\\alpha=0\.8for English,α=1\.0\\alpha=1\.0for Chinese, andα=1\.2\\alpha=1\.2for German, Russian, and Turkish \(baselineα=1\.0\\alpha=1\.0\)\. The results, presented in Figure[8](https://arxiv.org/html/2606.03291#A10.F8), the steering intervention on Gemma\-2 restores nearly the full performance capability of the original fine\-tuned model across languages\.
Рис\. 8:Gemma2\-9B steering injection results\. Effect of layer\-wise steering on NLI score for the forget set\. Each heatmap shows the change in score when injecting a normalized steering direction with scaleα\\alpha, whereα\\alphais selected separately for each language\. The left heatmap uses extracted steering vectors, which recover almost all of the forgotten knowledge when applied in middle and late layers\. The right heatmap uses random Gaussian directions with matched norm and scale, the much weaker recovery indicates that the extracted vectors capture a structured unlearning direction rather than generic noise\. This result is consistent with, yet more pronounced than, that of the Qwen2\.5\-7B model\.
## Приложение KAdditional Steering Vector Experiments
### K\.1Chinese as the Source Language
In the main text, we construct steering vectors using English questions as the source language\. To test whether the recovery effect depends on this choice, we repeat the same experiment using Chinese questions to extract the steering vectors while keeping the evaluation protocol unchanged\. The resulting recovery remains effective across all target languages\. For Qwen2\.5\-7B with DPO unlearning, the best recovery scores are68\.21%68\.21\\%for English,49\.89%49\.89\\%for Chinese,41\.67%41\.67\\%for German,38\.21%38\.21\\%for Russian, and56\.15%56\.15\\%for Turkish\. These results suggest that the steering vector is not tied to English\-specific lexical information\. Instead, they are consistent with our interpretation that computing activation differences on identical inputs largely cancels out input\-specific semantic content and captures a more general unlearning\-induced direction in the shared representation space\.
### K\.2Additional Unlearning Methods
To test whether steering\-vector recovery is specific to DPO\-based unlearning, we further evaluate the same procedure on models unlearned with gradient ascent \(GA\) and negative preference optimization \(NPO\)\. We conduct this experiment on Gemma2\-9B and tune the intervention strengthα\\alphaand injection layer for each language\.
For GA, the steering vector recovers forgotten information across all five languages, with best scores of55\.77%55\.77\\%for English,80\.60%80\.60\\%for Chinese,41\.58%41\.58\\%for German,62\.48%62\.48\\%for Russian, and52\.11%52\.11\\%for Turkish\. For NPO, we observe the same qualitative pattern, with best scores of68\.95%68\.95\\%for English,63\.34%63\.34\\%for Chinese,57\.54%57\.54\\%for German,65\.99%65\.99\\%for Russian, and51\.49%51\.49\\%for Turkish\. These results indicate that the recoverability of forgotten information is not unique to DPO, but also appears under other fine\-tuning\-based unlearning methods\.
### K\.3Simplified All\-Layer Steering Intervention
In our initial steering experiments, we injected steering vectors over selected layer windows, which introduced an additional hyperparameter controlling the injection range\. We found that this procedure can be simplified by applying the per\-layer steering vectors across all layers with a substantially smaller intervention strengthα\\alpha\. This removes the need to tune the layer\-window hyperparameter and reduces the cost of hyperparameter search\.Similar Articles
Model Unlearning Objectives Vary for Distinct Language Functions
The paper argues that unlearning in LLMs should be goal-dependent, proposing a cosine-based meta-learned variant of RMU for dangerous knowledge and a multi-layer objective with probe directions for toxicity, achieving strong results across four 7-8B models.
Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation
This paper proposes two new metrics—Knowledge Separability Score (KSS) and Knowledge Persistence Score (KPS)—to evaluate cross-linguistic information removal in multilingual machine unlearning for LLMs, addressing shortcomings of prior per-language evaluation protocols.
Robust LLM Unlearning Against Relearning Attacks: The Minor Components in Representations Matter
This paper introduces Minor Component Unlearning (MCU), a novel approach to LLM unlearning that targets minor components in representations to resist relearning attacks. It addresses the vulnerability of existing methods by focusing on robust directions within the model's spectral structure.
Rethinking the Multilingual Reasoning Gap with Layer Swap
This paper revisits the multilingual reasoning gap in LLMs, finding it smaller than previously reported under comparable supervision. It introduces Layer Swap, which transfers mid-layer weights from an English reasoning specialist to native language specialists, nearly closing the gap while preserving native-language chain-of-thought.
Fast Unlearning at Scale via Margin Self-Correction
Introduces MASC (Margin Self-Correction), an efficient unlearning method for LLMs that uses an online stopping rule to achieve competitive forget–retain trade-offs at reduced computational cost, validated on TOFU and MUSE benchmarks.