Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation

arXiv cs.CL 05/15/26, 04:00 AM Papers
multilingual machine-unlearning evaluation-metrics llm privacy cross-lingual
Summary
This paper proposes two new metrics—Knowledge Separability Score (KSS) and Knowledge Persistence Score (KPS)—to evaluate cross-linguistic information removal in multilingual machine unlearning for LLMs, addressing shortcomings of prior per-language evaluation protocols.
arXiv:2605.14404v1 Announce Type: new Abstract: While LLMs are increasingly used in commercial services, they pose privacy risks such as leakage of sensitive personally identifiable information (PII). For LLMs trained on multilingual corpora, Multilingual Machine Unlearning (MMU) aims to remove information across multiple languages. However, prior MMU evaluations fail to capture such cross-linguistic distribution of information, being largely limited to direct extensions of per-language evaluation protocols. To this end, we propose two metrics to evaluate the information spread across languages: the Knowledge Separability Score (KSS) and the Knowledge Persistence Score (KPS). KSS measures the overall unlearning quality across multiple languages, while KPS more specifically aims to assess consistent removal of information among different language pairs. We evaluated various unlearning methods in the multilingual setting with these metrics and conducted comprehensive analyses. Through our investigation, we provide insights into unique phenomena exclusive to MMU and offer a new perspective on MMU evaluation.
Original Article
View Cached Full Text
Cached at: 05/15/26, 06:21 AM
# Bridging the Gap in Multilingual Machine Unlearning Evaluation
Source: [https://arxiv.org/html/2605.14404](https://arxiv.org/html/2605.14404)
## Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation

Kyomin Hwang1∗Hyeonjin Kim1∗Sangyeon Cho3,4Nojun Kwak1,2† 1GSCST, Seoul National University 2AIIS, Seoul National University 3Department of Artificial Intelligence, Chung\-Ang University 4Korean Surgical Researcher Foundation, Republic of Korea \{kyomin98, peaceful1, nojunk\}@snu\.ac\.krwhtkddus98@cau\.ac\.kr

###### Abstract

While LLMs are increasingly used in commercial services, they pose privacy risks such as leakage of sensitive personally identifiable information \(PII\)\. For LLMs trained on multilingual corpora, Multilingual Machine Unlearning \(MMU\) aims to remove information across multiple languages\. However, prior MMU evaluations fail to capture such cross\-linguistic distribution of information, being largely limited to direct extensions of per\-language evaluation protocols\. To this end, we propose two metrics to evaluate the information spread across languages: the Knowledge Separability Score \(KSS\) and the Knowledge Persistence Score \(KPS\)\. KSS measures the overall unlearning quality across multiple languages, while KPS more specifically aims to assess consistent removal of information among different language pairs\. We evaluated various unlearning methods in the multilingual setting with these metrics and conducted comprehensive analyses\. Through our investigation, we provide insights into unique phenomena exclusive to MMU and offer a new perspective on MMU evaluation\.

Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation

Kyomin Hwang1∗Hyeonjin Kim1∗Sangyeon Cho3,4Nojun Kwak1,2†1GSCST, Seoul National University2AIIS, Seoul National University3Department of Artificial Intelligence, Chung\-Ang University4Korean Surgical Researcher Foundation, Republic of Korea\{kyomin98, peaceful1, nojunk\}@snu\.ac\.krwhtkddus98@cau\.ac\.kr

11footnotetext:Equal contribution\.22footnotetext:Corresponding author\.## 1Introduction

![Refer to caption](https://arxiv.org/html/2605.14404v1/x1.png)Figure 1:Illustration of the evaluation method in conventional MMU\. Existing approaches evaluate knowledge \(e\.g\., Birth, Hobby\) independently for each language\. Consequently, this language\-wise assessment fails to verify whether knowledge has spread across different languages has been successfully removed\.Machine Unlearning \(MU\) aims to remove sensitive information from a Large Language Model \(LLM\)Wanget al\.\([2024](https://arxiv.org/html/2605.14404#bib.bib9)\)\. Since demonstrated the feasibility of unlearning via gradient ascent, subsequent methods have been developed and evaluated on English datasets, focusing on erasing the specified content without degrading overall performanceZhanget al\.\([2024](https://arxiv.org/html/2605.14404#bib.bib11)\); Liuet al\.\([2022](https://arxiv.org/html/2605.14404#bib.bib18)\); Mainiet al\.\([2024](https://arxiv.org/html/2605.14404#bib.bib12)\); Shiet al\.\([2024](https://arxiv.org/html/2605.14404#bib.bib13)\)\. However, previous works simulate MU with an English\-only dataset, leaving a gap to real\-world deployment\.

To bridge this gap, recent studies have begun to investigate Multilingual MU \(MMU\)Choiet al\.\([2024](https://arxiv.org/html/2605.14404#bib.bib17)\); Lu and Koehn \([2024](https://arxiv.org/html/2605.14404#bib.bib21)\); Hwanget al\.\([2025](https://arxiv.org/html/2605.14404#bib.bib43)\)\.[Choiet al\.](https://arxiv.org/html/2605.14404#bib.bib17)argue that relying solely on English data leads to insufficient forgetting if the target knowledge has been acquired from multiple languages\.[Hwanget al\.](https://arxiv.org/html/2605.14404#bib.bib43)report the rise of language confusion from English\-centric unlearning, while concurrent work by[Lu and Koehn](https://arxiv.org/html/2605.14404#bib.bib21)demonstrates the occurrence of cross\-linguistic spread of sensitive information across languages\. All three works suggested multilingual parallel unlearning as the solution\. However, evaluations in these works are largely limited to a direct extension of English\-centric protocols relying solely on per\-language evaluation\. It is questionable whether they are sufficient to fully capture the complex multilingual characteristics of MMU\.

As illustrated in Figure[1](https://arxiv.org/html/2605.14404#S1.F1), current evaluation protocols can be misleading: a model may appear to have unlearned information in the evaluated language while the same knowledge remains accessible in another\. Consequently, language\-wise evaluations cannot determine whether the underlying information has truly been removed, and may overstate unlearning effectiveness\. Reliable evaluation therefore requires metrics that verify information inaccessibility consistently across all languages\.

In this paper, we establish a comprehensive MMU scenario by 1\) suggesting how knowledge should be defined in multilingual setting and 2\) clarifying the two distinct mechanisms for its acquisition\. Upon this scenario, 3\) we finally design two suitable metrics for multilingual evaluation\. We identify knowledge in MMU as an instance that has been obtained and expressed in multiple languages\. This knowledge can be attained by either direct memorization or indirect cross\-linguistic spread\. To simulate both settings, we generated a multilingual parallel dataset spanning 10 languages, each containing 3,800 instances, where eight languages are used for memorization while the others are held out for evaluation\. We assessed both scenarios using our metrics designed to capture the multilingual nature of knowledge: the Knowledge Separability Score \(KSS\), which evaluates the overall unlearning quality across all languages, and the Knowledge Persistence Score \(KPS\), which specifically quantifies consistent removal of information between language pairs\. Through the extensive evaluations, we provide deeper insights into the unique phenomena of MMU, and present a new paradigm for evaluation\.

To sum up, our contributions are as follows:

- •We conducted extensive analysis and experiments on various unlearning methods\. To this end, we construct a large\-scale multilingual parallel dataset \(3,800 QA×\\times10 Languages\)\.
- •We proposed Knowledge Separability Score \(KSS\) and Knowledge Persistence Score \(KPS\) to evaluate the performance in MMU\.
- •Through extensive analysis using KSS and KPS, we demonstrate the usefulness of specialized metrics tailored for accurately measuring performance in MMU\.

## 2Related Work

### 2\.1Machine Unlearning

Machine Unlearning \(MU\) aims to selectively eliminate sensitive information from pre\-trained LLMs while preserving the remaining knowledge\. Existing approaches are typically grouped into optimization\-based methodsJanget al\.\([2022](https://arxiv.org/html/2605.14404#bib.bib10)\); Liuet al\.\([2022](https://arxiv.org/html/2605.14404#bib.bib18)\); Zhanget al\.\([2024](https://arxiv.org/html/2605.14404#bib.bib11)\)and pruning\-based methodsPochinkov and Schoots \([2024](https://arxiv.org/html/2605.14404#bib.bib46)\)\. However, existing studies on MU have been largely English\-centric, which is misaligned with the multilingual nature of modern LLM deployment\. Multilingual MU \(MMU\) studiesChoiet al\.\([2024](https://arxiv.org/html/2605.14404#bib.bib17)\); Hwanget al\.\([2025](https://arxiv.org/html/2605.14404#bib.bib43)\); Lu and Koehn \([2024](https://arxiv.org/html/2605.14404#bib.bib21)\)have emerged under such context, pointing out the insufficiency of English\-only unlearning\. They have analysed unique phenomena and developed unlearning methods, yet, effective evaluation of multilingual unlearning performance remains unexplored\.

### 2\.2Evaluation

Up until now, MU evaluation protocols have largely been developed in English\-centric settings\. Existing metrics can be broadly categorized into two groups: 1\) probability\-based metrics and 2\) generation\-based metrics\. Probability\-based metrics assess how confidently a model knows the information\. For example, TOFUMainiet al\.\([2024](https://arxiv.org/html/2605.14404#bib.bib12)\)uses the probabilities assigned to the corresponding answer to quantify degrees of forgetting and retention\. In contrast, generation\-based metricseithermeasure output\-level agreement with a referenceLin \([2004](https://arxiv.org/html/2605.14404#bib.bib29)\)or rely on LLM\-as\-a\-judge style evaluationsLiuet al\.\([2025](https://arxiv.org/html/2605.14404#bib.bib55)\)\.

These protocols are frequently applied to MMU without modificationChoiet al\.\([2024](https://arxiv.org/html/2605.14404#bib.bib17)\); Hwanget al\.\([2025](https://arxiv.org/html/2605.14404#bib.bib43)\)\. However, MMU scenario is different from English\-centric scenario in two aspects\. First, knowledge is not confined to a single language but is distributed across multiple languages\. Second, such multilingual information is acquired through both direct memorizationChoiet al\.\([2024](https://arxiv.org/html/2605.14404#bib.bib17)\)and indirect cross\-linguistic spreadLu and Koehn \([2024](https://arxiv.org/html/2605.14404#bib.bib21)\)\. On this viewpoint, we identified two limitations of applying English\-centric evaluations directly to MMU: 1\) evaluating each language in isolation is insufficient to verify whether specific information has been completely removed across the entire languages, and 2\) existing researches typically address only one of the two knowledge acquirement mechanisms\. To this end, we propose two metrics that can evaluate knowledge across multiple languages, and conduct experiments on both settings within a unified framework\.

## 3Problem Formulation

#### Multilingual MU

In Machine Unlearning \(MU\), there are three states of a model\. Apre\-trained model,Fθ0F\_\{\\theta\_\{0\}\}, refers to the model that has not yet been fine\-tuned on specific dataset\. After being fine\-tuned to memorize specific information, the model becomes amemorized modeldenoted asFθMF^\{M\}\_\{\\theta\}\. Finally, theunlearned modelthat has been updated to forget some memorized knowledge is denoted asFθUF^\{U\}\_\{\\theta\}\.For MU tasks, three types of datasets are required: a fine\-tuning set𝒟\\mathcal\{D\}, a forget set𝒟f\\mathcal\{D\}\_\{f\}, and a retain set𝒟r\\mathcal\{D\}\_\{r\}\. For MMU tasks, all three datasets𝒟f\\mathcal\{D\}\_\{f\},𝒟r\\mathcal\{D\}\_\{r\}, and𝒟\\mathcal\{D\}consist of multilingual parallel QA pairs, where each pair contains semantically equivalent content across different languages:

𝒟\\displaystyle\\mathcal\{D\}=\{ki,l≜\(qi,l,ai,l\)\|i∈ℐ,l∈𝕃\},\\displaystyle=\\\{k\_\{i,l\}\\triangleq\(q\_\{i,l\},a\_\{i,l\}\)\\\>\|\\\>i\\in\\mathcal\{I\},\\\>l\\in\\mathbb\{L\}\\\},\(1\)𝒟f\\displaystyle\\mathcal\{D\}\_\{f\}=\{ki,l≜\(qi,l,ai,l\)\|i∈ℐf,l∈𝕃\},\\displaystyle=\\\{k\_\{i,l\}\\triangleq\(q\_\{i,l\},a\_\{i,l\}\)\\\>\|\\\>i\\in\\mathcal\{I\}\_\{f\},\\\>l\\in\\mathbb\{L\}\\\},𝒟r\\displaystyle\\mathcal\{D\}\_\{r\}=\{ki,l≜\(qi,l,ai,l\)\|i∈ℐr,l∈𝕃\},\\displaystyle=\\\{k\_\{i,l\}\\triangleq\(q\_\{i,l\},a\_\{i,l\}\)\\\>\|\\\>i\\in\\mathcal\{I\}\_\{r\},\\\>l\\in\\mathbb\{L\}\\\},where𝕃\\mathbb\{L\}denotes the set of languages andki,lk\_\{i,l\}indicates theii\-th instance in languagell\.ℐ\\mathcal\{I\}is the union of two disjoint index setsℐf\\mathcal\{I\}\_\{f\}andℐr\\mathcal\{I\}\_\{r\}, each enumerating the instances in the forget set and the retain set \(ℐ=ℐf∪ℐr,ℐf∩ℐr=∅\\mathcal\{I\}=\\mathcal\{I\}\_\{f\}\\cup\\mathcal\{I\}\_\{r\},\\ \\mathcal\{I\}\_\{f\}\\cap\\mathcal\{I\}\_\{r\}=\\emptyset\)\. Similarly,𝒟\\mathcal\{D\}denotes the disjoint union of𝒟f\\mathcal\{D\}\_\{f\}and𝒟r\\mathcal\{D\}\_\{r\}\.Dataset𝒟\\mathcal\{D\}can be viewed as a two dimensional\|ℐ\|×\|𝕃\|\|\\mathcal\{I\}\|\\times\|\\mathbb\{L\}\|matrix with index\-wise rows and language\-wise columns\.

Unlearning methods commonly employ the following loss function on top ofFθMF\_\{\\theta\}^\{M\}:

ℒ\(𝒟f,𝒟r\)=ℒf\(𝒟f\)\+ℒr\(𝒟r\),\\mathcal\{L\}\(\\mathcal\{D\}\_\{f\},\\mathcal\{D\}\_\{r\}\)=\\mathcal\{L\}\_\{f\}\(\\mathcal\{D\}\_\{f\}\)\+\\mathcal\{L\}\_\{r\}\(\\mathcal\{D\}\_\{r\}\),\(2\)whereℒf\\mathcal\{L\}\_\{f\}andℒr\\mathcal\{L\}\_\{r\}denotes the forget and retain loss\.

![Refer to caption](https://arxiv.org/html/2605.14404v1/x2.png)Figure 2:Overview illustration of our setting\. A knowledge refers to an instance which may be expressed multilingually\. Target Knowledge is the knowledge in the forget set, while Non\-Target Knowledge is the one in the retain set\.In this setting, we propose metrics specifically designed for the evaluation of the knowledge\.
#### Our Setting

In traditional English\-centric MU, knowledge is expressed solely in English\. However, unlike this English\-centric approach, knowledge in Multilingual MU \(MMU\) can be expressed across multiple languages\. Such knowledge can be acquired directly through multilingual training or derived from cross\-lingual spread\. For MMU, we categorized knowledge intoTarget Knowledge\(to be unlearned\) andNon\-target Knowledge\(to be retained\)\. With abuse of Matlab matrix notation, we formally define theii\-th Target Knowledge \(kiTk\_\{i\}^\{\\text\{T\}\}\) andjj\-th Non\-target Knowledge \(kjNk\_\{j\}^\{\\text\{N\}\}\) as

kiT=ki,:,i∈ℐf,kjN=kj,:,j∈ℐr\.k\_\{i\}^\{\\text\{T\}\}=k\_\{i,:\},\\ i\\in\\mathcal\{I\}\_\{f\},\\quad k\_\{j\}^\{\\text\{N\}\}=k\_\{j,:\},\\ j\\in\\mathcal\{I\}\_\{r\}\.\(3\)
![Refer to caption](https://arxiv.org/html/2605.14404v1/x3.png)Figure 3:Overview of multilingual parallel QA dataset generation pipelineEachkiTk\_\{i\}^\{\\text\{T\}\}andkjNk\_\{j\}^\{\\text\{N\}\}is composed of various languages but shares identical semantics\. In this context, MMU must remove the target knowledge while retaining the non\-target knowledge\.

In Multilingual LLMs, knowledge acquired in one language spreads to other languages, a phenomenon denoted as cross\-linguistic spreadLu and Koehn \([2024](https://arxiv.org/html/2605.14404#bib.bib21)\)\. To measure the unlearning performance in this context, we conducted experiments using a setting that includes hold\-out languages that were not utilized in either the memorization or the unlearning phases\. To simulate this scenario, we employed1010languages\. Five were chosen from high\-resource languages:English,Chinese,German,RussianandSpanish, while the others were chosen from low\-resource languages:Bengali,Hebrew,Tamil,AfrikaansandAlbanian\.

The selected languages are divided intoTrainingandHold\-outlanguages for observation\. The training languages are directly utilized for memorization and unlearning, while hold\-out languages are only employed during evaluation\.

- •Training:English, Chinese, German, Russian, Bengali, Hebrew, Tamil, Albanian
- •Hold\-out:Afrikaans, Spanish\.

Here, we denote the set of training languages and hold\-out languages as𝕃Train\\mathbb\{L\}\_\{\\text\{Train\}\}and𝕃Hold\\mathbb\{L\}\_\{\\text\{Hold\}\}, respectively\. Figure[2](https://arxiv.org/html/2605.14404#S3.F2)summarizes our overall setting\.

## 4Dataset Generation

### 4\.1Overview

Knowledge within multilingual LLMs is often distributed across diverse languages instead of being confined to a single linguistic context\. To simulate such setting, we introduced a multilingual parallel dataset\. Inspired by TOFUMainiet al\.\([2024](https://arxiv.org/html/2605.14404#bib.bib12)\), we first generated 200 synthetic profiles to clearly isolate the effect of unlearning from the model’s pre\-trained knowledge\. From the profiles, an English question\-answer \(QA\) dataset was constructed with 19 attribute\-specific QA pairs\. We subsequently translated the English QA dataset into 9 other languages \(four high\-resource languages and five low\-resource languages\) to conduct MMU experiments\. Figure[3](https://arxiv.org/html/2605.14404#S3.F3)demonstrates the overview of data\-generation pipeline: 1\) generate English synthetic profiles, 2\) prompt an LLM to produce English QA pairs for each profile, 3\) translate the QA pairs into multiple languages to form a parallel multilingual MMU dataset, and 4\) verify the translations via back\-translation to English\.

### 4\.2Synthetic Profile Generation

We assigned 20 attributes to each synthetic profile, includingName,Year of Birthand etc\. Before constructing the attributes, we generated 200 unique fictitious names in English using the FakerFaraglia \([2025](https://arxiv.org/html/2605.14404#bib.bib20)\)library\. We then pre\-specified values for every attribute\. The value pools for each attribute are listed in Appendix[B](https://arxiv.org/html/2605.14404#A2)\. To improve the quality of the synthetic profiles, human annotators reviewed the generated profiles and removed cases that were inconsistent with common sense \(e\.g\., a barista working fully remote\)\. Examples of the resulting profiles appear in Appendix[C](https://arxiv.org/html/2605.14404#A3)\.

### 4\.3QA Dataset Generation

We employed the Qwen3\-225B\-A22B\-Thinking\-2507Yanget al\.\([2025a](https://arxiv.org/html/2605.14404#bib.bib19)\)model to generate 19 distinct QA datasets from the 200 synthetic profiles introduced in the previous section\. To make each question focus on a single attribute, we provide the LLM with only the subject’s name and one attribute, then have it generate the corresponding QA pair\. The full prompt is provided in Figure[9](https://arxiv.org/html/2605.14404#A4.F9)\. For quality control, human annotators manually corrected each QA pair whose content is not aligned with the corresponding profile\. Representative QA examples appear in Figure[8](https://arxiv.org/html/2605.14404#A3.F8)\.

### 4\.4Translate QA Dataset

![Refer to caption](https://arxiv.org/html/2605.14404v1/x4.png)Figure 4:Overview of the Knowledge Separability Score \(KSS\) and Knowledge Persistence Score \(KPS\)\. KSS\-ROC measures the overall separability between the target and non\-target knowledge, while KSS\-PR evaluates how consistently the model assigns higherSiS\_\{i\}to the target knowledge compared to the non\-target knowledge\. KPS quantifies the extent to which knowledge inaccessible in one language but persists in another\.We translated the 3,800 English QA pairs \(19×\\times200\), derived from synthetic profiles, into 9 languages using the Google Translation APICloud \([2025](https://arxiv.org/html/2605.14404#bib.bib22)\)\. Because the profiles are synthetic and thus unfamiliar to the models, maintaining identity consistency across languages is crucial\. Accordingly, following prior multilingual benchmarksPanet al\.\([2017](https://arxiv.org/html/2605.14404#bib.bib27)\); Schwenket al\.\([2021](https://arxiv.org/html/2605.14404#bib.bib28)\), we leave personal names untranslated\.To ensure translation quality, we adopt a back\-translation\-based verification\-and\-refinement pipeline inspired byJoshiet al\.\([2025](https://arxiv.org/html/2605.14404#bib.bib23)\)\. Specifically, we employed Google Translation API to translate each English source sentence into the target language, and back into English\. Then we assess semantic equivalence using Qwen3\-225B\-A22B\-Thinking\-2507 \(see Figure[10](https://arxiv.org/html/2605.14404#A4.F10)for the verification prompt\)\. If equivalence check fails, we revise it using ChatGPTAchiamet al\.\([2023b](https://arxiv.org/html/2605.14404#bib.bib14)\)and repeat the above process\. We iterate this verify\-and\-refine cycle until a human annotator confirms that the back\-translation is semantically equivalent to the source English dataset\. We apply this procedure to all target languages for semantically consistent translations\. Examples from the resulting multilingual parallel QA dataset are shown in Figure[13](https://arxiv.org/html/2605.14404#A6.F13)\.

## 5Knowledge Evaluation in MMU

Current MMU evaluation methods are typically direct extensions of English\-centric approaches conducted in a language\-wise manner\. However, this approach fails to wholly assess knowledge distributed across diverse languages\. To address such limitation, we proposed new metrics based on the following principles: 1\)Holistic Evaluation of Unlearning Quality: Measuring MMU performance requires a unified metric capable of assessing knowledge across multiple languages\. 2\)Cross\-lingual Consistent Forgetting: Metrics should specifically quantify the consistent removal of sensitive information between language pairs\. To this end, we first review existing metrics used in language\-wise evaluation and then propose novel metrics specifically tailored for MMU\. Figure[4](https://arxiv.org/html/2605.14404#S4.F4)provides an overview of the aspects measured by our proposed metrics\.

### 5\.1Language\-wise Evaluation

Prior MMU studies directly extend the English\-centric protocols, conducting performance evaluations separately for each language\. Commonly used evaluation metrics are broadly categorized into two types: probability\-based and generation\-based\.

#### Probability

ismeasured as the conditional probability𝒫\(a∣q\)1/\|a\|tok\\mathcal\{P\}\(a\\mid q\)^\{1/\|a\|\_\{\\text\{tok\}\}\}, whereqqdenotes the question sentence andaadenotes the corresponding answer\.\|a\|tok\|a\|\_\{\\text\{tok\}\}is the token length ofaa\.

#### Semantic Equivalence

LLMs canassess whether the model outputs are semantically identical to the ground truth\. To mitigate potential ambiguity arising from evaluating in low\-resource languages, we translated the generated outputs and the ground truths into English using NLLB\-200\-3\.3B, a multilingual model specialized in translationCosta\-Jussàet al\.\([2022](https://arxiv.org/html/2605.14404#bib.bib54)\)\.We define Semantic Equivalence \(SE\) as follows:

SE\(q,a\)=𝕀\(LLM\(𝒯\(Fθ\(q\)\),𝒯\(a\)\)\),\\text\{SE\}\(q,a\)=\\mathbb\{I\}\\left\(\\text\{LLM\}\(\\mathcal\{T\}\(F\_\{\\theta\}\(q\)\),\\mathcal\{T\}\(a\)\)\\right\),\(4\)
where𝒯\\mathcal\{T\}denotes thetranslation into English, and𝕀\(LLM\(⋅,⋅\)\)\\mathbb\{I\}\(\\text\{LLM\}\(\\cdot,\\cdot\)\)outputs 1 if the LLM determines that the two inputs have the same meaning, and 0 if they do not\. We employed GPT\-4o\-miniAchiamet al\.\([2023a](https://arxiv.org/html/2605.14404#bib.bib53)\)with greedy decoding for semantic equivalence judgement\. The prompt used for evaluation is provided in Figure[11](https://arxiv.org/html/2605.14404#A4.F11)\. Previous MMU researches utilized such SE score to evaluate the knowledge of LLM in a language\-wise manner\.

### 5\.2Knowledge\(Instance\)\-wise Evaluation

Existing MU metrics cannot adequately assess unlearning performance in MMU:*these metrics fail to capture properties that arise uniquely in multilingual scenarios*\. To address this limitation, we propose two knowledge\-wise metrics: \(1\) the Knowledge Separability Score \(KSS\), which summarizes the unlearning performance for both target and non\-target knowledge and \(2\) the Knowledge Persistence Score \(KPS\), which quantifies the extent to which a target knowledge that is inaccessible in one language remains retrievable in another\.

#### Knowledge Separability Score

We proposed the Knowledge Separability Score \(KSS\) as a comprehensive AUC\-based measure for MMU performance\. KSS is computed in two steps: 1\) we derive a knowledge\-wise forgetting scoreSiS\_\{i\}that quantifies the degree of forgetting for theii\-th knowledge, and 2\) we use these scores to compute the AUC over the forget and retain sets\.

We calculate the knowledge\-wise forgetting scoreSiS\_\{i\}for theii\-th QA pair across𝕃\\mathbb\{L\}languages,\{\(qi,l,ai,l\)\|l∈𝕃\}\\\{\(q\_\{i,l\},a\_\{i,l\}\)\\ \|\\ l\\in\\mathbb\{L\}\\\}, in two ways\. First, the generation\-based scoreSigenS^\{gen\}\_\{i\}aggregates the Semantic Equivalence \(SE\) in a knowledge\-wise manner and quantifies the inequivalence by subtracting it from11\. Second, the probability\-based scoreSiprobS^\{prob\}\_\{i\}utilizes the length\-normalized probability assigned to the ground truth sequence,𝒫\(ai,l\|qi,l\)1/\|ai,l\|tok\\mathcal\{P\}\(a\_\{i,l\}\|q\_\{i,l\}\)^\{1/\|a\_\{i,l\}\|\_\{\\text\{tok\}\}\}\. We subtract it from11so that a lower𝒫\\mathcal\{P\}corresponds to a higherSiS\_\{i\}\. The scores are formally defined as:

Sigen\\displaystyle S^\{gen\}\_\{i\}=1−1\|𝕃\|∑l∈𝕃SE\(qi,l,ai,l\),\\displaystyle=1\-\\frac\{1\}\{\|\\mathbb\{L\}\|\}\\sum\_\{l\\in\\mathbb\{L\}\}\\text\{SE\}\(q\_\{i,l\},a\_\{i,l\}\),\(5\)Siprob\\displaystyle S^\{prob\}\_\{i\}=1−1\|𝕃\|∑l∈𝕃𝒫\(ai,l\|qi,l\)1/\|ai,l\|tok\.\\displaystyle=1\-\\frac\{1\}\{\|\\mathbb\{L\}\|\}\\sum\_\{l\\in\\mathbb\{L\}\}\\mathcal\{P\}\(a\_\{i,l\}\|q\_\{i,l\}\)^\{1/\|a\_\{i,l\}\|\_\{tok\}\}\.
We computedSiS\_\{i\}for both target and non\-target knowledge and plotted the probability density functions, as shown in Figure[4](https://arxiv.org/html/2605.14404#S4.F4)\.Using these functions, we measure KSS using two complementary metrics: Area Under the Receiver Operating Characteristic Curve \(KSS\-ROC\) and that of the Precision\-Recall Curve \(KSS\-PR\)\*\*\*Note that ROC is drawn based on TPR \(true positive ratio = TP/\(TP\+FN\); y\-axis\) vs\. FPR \(false positive ratio = FP/\(FP\+TN\); x\-axis\), while PR is drawn from Precision \(TP/\(TP\+FP\); y\-axis\) and Recall \(TP/\(TP\+FN\); x\-axis\)\.\. We computed KSS\-ROC and KSS\-PR by varying the threshold forSiS\_\{i\}\.

While KSS\-ROC provides a general measure of separability between forget \(target\) and retain \(non\-target\) sets, KSS\-PR further addresses the severe forget\-retain dataset imbalance, i\.e\., the forget set is typically much smaller than the retain set\. Specifically, a high KSS\-ROC signifies that theSiS\_\{i\}distributions between the forget and retain sets are effectively distinguishable, whereas a high KSS\-PR suggests that the model yields consistently elevatedSiS\_\{i\}scores for the forget dataset\.

Both KSS\-ROC and KSS\-PR are indispensable metrics for the precise evaluation\. As demonstrated in Figure[4](https://arxiv.org/html/2605.14404#S4.F4), a high KSS\-ROC score alone does not guarantee that non\-target knowledge is free from erroneously assigned high forgetting scores \(SiS\_\{i\}\)\. Conversely, a low KSS\-PR score does not necessarily imply a lack of global separability\. Therefore, these two metrics are mutually complementary\. We provide the detailed explanation in Appendix[I](https://arxiv.org/html/2605.14404#A9)\.

#### Knowledge Persistence Score

To quantify the degree of persistence of the target knowledge, we proposed the Knowledge Persistence Score \(KPS\)\. For a base languagel1l\_\{1\}and a comparison languagel2l\_\{2\}, we define the pairwise persistence score as the fraction of samples that are judged as forgotten inl1l\_\{1\}\(SE\(qi,l1,ai,l1\)=0\\text\{SE\}\(q\_\{i,l\_\{1\}\},a\_\{i,l\_\{1\}\}\)=0\) but still retained inl2l\_\{2\}\(SE\(qi,l2,ai,l2\)=1\\text\{SE\}\(q\_\{i,l\_\{2\}\},a\_\{i,l\_\{2\}\}\)=1\):

ps\(l1,l2\)\\displaystyle ps\(l\_\{1\},l\_\{2\}\)=1\|ℐ\(l1\)\|∑i∈ℐ\(l1\)SE\(qi,l2,ai,l2\),\\displaystyle=\\frac\{1\}\{\|\\mathcal\{I\}\(l\_\{1\}\)\|\}\\sum\_\{i\\in\\mathcal\{I\}\(l\_\{1\}\)\}\\text\{SE\}\(q\_\{i,l\_\{2\}\},a\_\{i,l\_\{2\}\}\),\(6\)ℐ\(l1\)\\displaystyle\\mathcal\{I\}\(l\_\{1\}\)≜\{i∈ℐf\|SE\(qi,l1,ai,l1\)=0\}\.\\displaystyle\\triangleq\\\{i\\in\\mathcal\{I\}\_\{f\}\\ \|\\ \\text\{SE\}\(q\_\{i,l\_\{1\}\},a\_\{i,l\_\{1\}\}\)=0\\\}\.ps\(l1,l2\)ps\(l\_\{1\},l\_\{2\}\)is the retention of the target knowledge inl2l\_\{2\}conditioned on forgetting inl1l\_\{1\}\. Specifically, it serves to measure how consistently the forgetting occurs between the languages\.

Given a set of comparison languages𝕃2s\.t\.l1∉𝕃2\\mathbb\{L\}\_\{2\}\\\>\\text\{s\.t\.\}\\\>l\_\{1\}\\notin\\mathbb\{L\}\_\{2\}, we aggregate pairwise persistence scores by averaging overl2∈𝕃2l\_\{2\}\\in\\mathbb\{L\}\_\{2\}:

KPS\(l1,𝕃2\)=1\|𝕃2\|∑l2∈𝕃2ps\(l1,l2\)\.\\text\{KPS\}\(l\_\{1\},\\mathbb\{L\}\_\{2\}\)=\\frac\{1\}\{\|\\mathbb\{L\}\_\{2\}\|\}\\sum\_\{l\_\{2\}\\in\\mathbb\{L\}\_\{2\}\}ps\(l\_\{1\},l\_\{2\}\)\.\(7\)
KPS provides a quantitative measure of how easily the target knowledge, once unlearned inl1l\_\{1\}, can be recovered by querying the model in𝕃2\\mathbb\{L\}\_\{2\}\.A small value of KPS represents better unlearning performance in MMU\.

### 5\.3Experimental Setting

MethodTypeKSS\-ROC\(↑\\uparrow\)KSS\-PR\(↑\\uparrow\)p1p3p5p1p3p5Case 1Case 2Case 1Case 2Case 1Case 2Case 1Case 2Case 1Case 2Case 1Case 2MEMProb0\.520\.450\.510\.490\.510\.490\.010\.010\.030\.030\.050\.05Gen0\.510\.500\.510\.480\.510\.490\.010\.030\.030\.030\.050\.05GAProb0\.57\+100\.89\+980\.53\+40\.81\+650\.52\+20\.66\+350\.01\+00\.39\+38000\.04\+330\.24\+7000\.05\+00\.12\+140Gen0\.57\+120\.70\+400\.53\+40\.65\+350\.53\+40\.55\+120\.01\+00\.15\+4000\.03\+00\.10\+2330\.05\+00\.07\+40GAGDRProb0\.61\+170\.91\+1020\.54\+60\.78\+590\.54\+60\.72\+470\.02\+1000\.46\+45000\.03\+00\.14\+3670\.06\+200\.15\+200Gen0\.57\+120\.77\+540\.52\+20\.65\+350\.52\+20\.62\+270\.01\+00\.18\+5000\.03\+00\.05\+670\.05\+00\.08\+60GAKLRProb0\.66\+270\.96\+1130\.57\+120\.83\+690\.55\+80\.71\+450\.02\+1000\.64\+63000\.04\+330\.20\+5670\.07\+400\.13\+160Gen0\.67\+310\.85\+700\.56\+100\.69\+440\.55\+80\.62\+270\.02\+1000\.47\+14670\.03\+00\.10\+2330\.06\+200\.10\+100NPOProb0\.70\+350\.99\+1200\.59\+160\.89\+820\.51\+00\.65\+330\.03\+2000\.88\+87000\.06\+1000\.53\+16670\.05\+00\.17\+240Gen0\.66\+290\.91\+820\.60\+180\.74\+540\.56\+100\.59\+200\.02\+1000\.48\+15000\.04\+330\.19\+5330\.06\+200\.08\+60PRUNEProb0\.76\+460\.91\+1020\.66\+290\.85\+730\.63\+240\.82\+670\.07\+6000\.12\+11000\.06\+1000\.16\+4330\.08\+600\.18\+260Gen0\.68\+330\.90\+800\.68\+330\.83\+730\.62\+220\.79\+610\.02\+1000\.08\+1670\.05\+670\.15\+4000\.07\+400\.15\+200

Table 1:Performance of KSS\-ROC and KSS\-PR scores of various unlearning methods\. Subscripts denote the percentage increase relative to MEM \(e\.g\.,0\.57\+100\.57\_\{\+10\}means 10% increase\)\. MEM denotes the memorized model \(FθMF\_\{\\theta\}^\{M\}\), Prob denotes the probability\-based scores and Gen denotes the generation\-based scores\.#### Unlearning Configuration

We employed a set of widely used optimization\-based unlearning algorithms–Gradient Ascent \(GA\)Janget al\.\([2022](https://arxiv.org/html/2605.14404#bib.bib10)\), Gradient Ascent with Gradient Descent term \(GAGDR\)Liuet al\.\([2022](https://arxiv.org/html/2605.14404#bib.bib18)\), Gradient Ascent with KL minimization \(GAKLR\)Mainiet al\.\([2024](https://arxiv.org/html/2605.14404#bib.bib12)\)and Negative Preference Optimization \(NPO\)Zhanget al\.\([2024](https://arxiv.org/html/2605.14404#bib.bib11)\)\. Additionally, we conducted experiments using a pruning\-based unlearning methodPochinkov and Schoots \([2024](https://arxiv.org/html/2605.14404#bib.bib46)\)\. Detailed descriptions are provided in Appendix[G](https://arxiv.org/html/2605.14404#A7)\.

We conducted experiments using Llama3\.1\-8B\-Instruct \(Llama3\.1\), a multilingual LLM, as the base modelGrattafioriet al\.\([2024](https://arxiv.org/html/2605.14404#bib.bib24)\)\. We used the multilingual parallel QA dataset described in Section[4\.1](https://arxiv.org/html/2605.14404#S4.SS1)for both fine\-tuning \(memorization\) and unlearning across all methods\. We considered thep1p1,p3p3, andp5p5settings according to the ratio of the forget set \(1%, 3% and 5% respectively\)\. Detailed hyperparameters are provided in Appendix[H](https://arxiv.org/html/2605.14404#A8)\.

#### Evaluation Configuration

In the multilingual unlearning scenario, knowledge is acquired either through direct training \(𝕃train\\mathbb\{L\}\_\{\\text\{train\}\}\) or cross\-linguistic spread \(𝕃Hold\\mathbb\{L\}\_\{\\text\{Hold\}\}\)\. Since the two subsets have acquired knowledge in different ways, it is more adequate to analyse both KSS and KPS on𝕃train\\mathbb\{L\}\_\{\\text\{train\}\}and𝕃Hold\\mathbb\{L\}\_\{\\text\{Hold\}\}each instead of aggregating the two\.

## 6Analysis

### 6\.1Knowledge Separability Score

We reported KSS of two cases:

- •Case 1: The separability between target and non\-target knowledge within𝕃Hold\\mathbb\{L\}\_\{\\text\{Hold\}\},
- •Case 2: The separability between target and non\-target knowledge within𝕃Train\\mathbb\{L\}\_\{\\text\{Train\}\}\.

#### Unlearning is Difficult in Hold\-out Languages \(Case 1\)

We observed a distinct performance disparity between Case 1 and Case 2\. Regarding KSS\-ROC in Table[1](https://arxiv.org/html/2605.14404#S5.T1), scores are consistently lower for Case 1 \(measured within𝕃Hold\\mathbb\{L\}\_\{\\text\{Hold\}\}\) compared to Case 2 \(measured within𝕃Train\\mathbb\{L\}\_\{\\text\{Train\}\}\) for both probability\- and generation\-based metrics\. For example, forp1p1, the maximum score is 0\.99 in Case 2, whereas it reaches only 0\.76 in Case 1\. This indicates that distinguishing between target and non\-target knowledge is more challenging in Case 1\.

#### Unlearning Performance Degrades as Forget Ratio Increases \(Case 2\)

Table[1](https://arxiv.org/html/2605.14404#S5.T1)presents the performance of probability\- and generation\-based KSS scores measured by ROC\-AUC \(KSS\-ROC\) and PR\-AUC \(KSS\-PR\)\. To ensure a fair comparison given the performance variability of the Memorized model \(MEM\) across different forget dataset ratios, we report the absolute scores alongside the percentage increase relative to MEM\. The relative increase is calculated asMethod−MEMMEM×100\\frac\{\\text\{Method\}\-\\text\{MEM\}\}\{\\text\{MEM\}\}\\times 100and is denoted by a subscript \(e\.g\.,\+10\)\. As shown in the table, the performance of both metrics degrades as the forget dataset ratio increases fromp1p1top5p5, except for KSS\-PR in PRUNE\. This suggests that as the forget ratio increases, the boundary between target and non\-target knowledge becomes increasingly obscure, making it difficult for the model to distinguish between the two\.

#### Analysis on Prune\-based Method \(Case 2\)

The prune\-based method demonstrates high KSS\-ROC score in thep1p1setting within𝕃Train\\mathbb\{L\}\_\{\\text\{Train\}\}\(Case 2\), where unlearning is generally effective across all methods\. This indicates that, like optimization\-based methods, pruning can successfully achieve strong global separability between target and non\-target knowledge\. However, we also observe that its KSS\-PR score is disproportionately poor, remaining significantly lower than that of other methods with comparable KSS\-ROC scores\.

To investigate the cause of this discrepancy, we visualized the distributions of knowledge\-wise forgetting scores \(SiS\_\{i\}\) for both the optimization\-based methods and the prune\-based method under thep1p1setting \(Figure[5](https://arxiv.org/html/2605.14404#S6.F5)\)\. The distributions for all forget ratios and methods are provided in Appendix[J](https://arxiv.org/html/2605.14404#A10)\. From the visualization, we found that the pruned model has assigned highSiS\_\{i\}to not only the target knowledge, but also to non\-negligible amount of the non\-target knowledge\. In other words, pruning has failed to assign sufficiently distinct, high knowledge\-wise forgetting scores exclusively to the target knowledge, relatively to optimization\-based methods\. This results in a significant overlap of target knowledge with the tail of the non\-target knowledge distribution \(highlighted by the red box\)\. Consequently, this leads to a degradation in KSS\-PR, indicating that target knowledge does not exclusively reside in the high\-score region\.

![Refer to caption](https://arxiv.org/html/2605.14404v1/x5.png)Figure 5:Distributions ofSiS\_\{i\}for both the target and non\-target knowledge after NPO and pruning in Case 2\. The first row represents the probability\-basedSiS\_\{i\}, while the second row displays the generation\-basedSiS\_\{i\}\.

### 6\.2Knowledge Persistence Score

We now report KPS of two cases:

- •Case 1: Target knowledge inaccessible in the base languagel1∈𝕃Trainl\_\{1\}\\in\\mathbb\{L\}\_\{\\text\{Train\}\}, but still persists within𝕃Hold\\mathbb\{L\}\_\{\\text\{Hold\}\},
- •Case 2: Target knowledge inaccessible in the base languagel1∈𝕃Trainl\_\{1\}\\in\\mathbb\{L\}\_\{\\text\{Train\}\}, but still persists within𝕃Train∖\{l1\}\\mathbb\{L\}\_\{\\text\{Train\}\}\\setminus\\\{l\_\{1\}\\\}\.

#### Knowledge Can Persist in Hold\-out Languages \(Case 1\)

While it is straightforward that more knowledge persists in Case 2, the results of Case 1 show that cross\-linguistic spread of knowledge persists in hold\-out languages even after unlearning\. For every base languagel1l\_\{1\}utilized for the measurement, there exists unremoved target knowledge to the hold\-out languages \(KPS\>0\\text\{KPS\}\>0\)\. This phenomenon again raise the potential risk of cross\-lingual persistence in MMU\.

#### Persistence Tendency in Forget Set \(Case 1 & Case 2\)

Table[2](https://arxiv.org/html/2605.14404#S6.T2)displays the unlearning performance of NPO measured with the Knowledge Persistence Score \(KPS\)\. As in KSS, KPS also depicts more severe knowledge persistence as the forget ratio rises\. Across every base languagel1l\_\{1\},p1p1setting shows the lowest KPS score that ranges fromKPS=0\.05\\text\{KPS\}=0\.05at the lowest andKPS=0\.18\\text\{KPS\}=0\.18at the highest\. On the other hand,p3p3andp5p5displays severe persistence with up toKPS=0\.44\\text\{KPS\}=0\.44\. This implies that, as the proportion of forget set increases, unlearning becomes more difficult in the perspective of consistent unlearning between languages\.

KPS \(↓\\downarrow\)l1l\_\{1\}p1p1p3p3p5p5Case 1Case 2Case 1Case 2Case 1Case 2bn0\.080\.170\.200\.410\.190\.36de0\.080\.150\.100\.370\.140\.30en0\.050\.070\.050\.170\.050\.17he0\.090\.140\.110\.440\.170\.36ru0\.110\.180\.060\.340\.130\.29sq0\.110\.130\.090\.420\.170\.26ta0\.130\.170\.190\.440\.190\.40zh0\.130\.150\.160\.400\.160\.32avg0\.100\.150\.120\.370\.150\.31

Table 2:Knowledge Persistence Score \(KPS\) on NPO across different forget ratios \(p1p1,p3p3,p5p5\)\.

## 7Conclusion

In this paper, we identified the operational unit of unlearning within Multilingual Machine Unlearning \(MMU\) and established a comprehensive evaluation protocols based on this new perspective\. Leveraging a large\-scale multilingual synthetic dataset constructed for this study, we conducted extensive experiments across various unlearning methods\. To measure their performance regarding the multilingual characteristics, we introduced two metrics: the Knowledge Separability Score \(KSS\) and the Knowledge Persistence Score \(KPS\)\. These metrics enabled us to uncover and analyse unlearning dynamics unique to multilingual scenarios, providing deeper insights into the behavior of MMU\.We conclude by suggesting that future research on MMU should consider multilingual characteristics and aim to unlearn the knowledge across languages\.

## 8Limitations and Future Works

In this paper, we investigated unlearning performance evaluation within Multilingual Machine Unlearning \(MMU\) scenarios, where knowledge is distributed across diverse languages\. To this end, we proposed the Knowledge Separability Score \(KSS\) and the Knowledge Persistence Score \(KPS\)\. Despite the contributions, our study has several limitations that suggest directions for future research\.

First, there is a limitation regarding the diversity of training and hold\-out languages\. Although we selected a broad range of high\- and low\-resource languages across various language families to observe performance disparities, our scope for hold\-out languages was restricted\. Specifically, our experiments utilized only languages from the Indo\-European family \(i\.e\., Afrikaans and Spanish\) as hold\-out languages\. Future research should incorporate a wider array of language families for the hold\-out set to ensure a more comprehensive performance analysis across different linguistic structures\.

Second, our experiments were limited by model scale\. Due to computational constraints, we focused on an 8B\-parameter model\. We observed that effective unlearning was primarily achievable when the forget ratio was low; however, unlearning performance degraded significantly as the forget ratio increased\. Since larger models may exhibit different behaviors regarding capacity and forgetting dynamics, it is crucial to validate these findings across a broader spectrum of model sizes\.

Finally, while we proposed KSS and KPS with the consideration of knowledge\-wise measurement in MMU contexts, there remains potential for alternative metrics\. Future work should explore more diverse evaluation methodologies to verify the removal of knowledge more accurately and robustly\.

## Acknowledgments

This work was supported by the Korean Government through the grants from IITP \(RS\-2021\-II211343, RS\-2022\-II220953, RS\-2025\-25442338\)\.

## References

- J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat,et al\.\(2023a\)Gpt\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§5\.1](https://arxiv.org/html/2605.14404#S5.SS1.SSS0.Px2.p3.2)\.
- J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat,et al\.\(2023b\)Gpt\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§4\.4](https://arxiv.org/html/2605.14404#S4.SS4.p1.1.2)\.
- M\. Choi, K\. Min, and J\. Choo \(2024\)Cross\-lingual unlearning of selective knowledge in multilingual language models\.arXiv preprint arXiv:2406\.12354\.Cited by:[§1](https://arxiv.org/html/2605.14404#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.14404#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2605.14404#S2.SS2.p2.1)\.
- G\. Cloud \(2025\)Note:Accessed: 2025\-11\-11External Links:[Link](https://cloud.google.com/translate/docs)Cited by:[§4\.4](https://arxiv.org/html/2605.14404#S4.SS4.p1.1.1)\.
- M\. R\. Costa\-Jussà, J\. Cross, O\. Çelebi, M\. Elbayad, K\. Heafield, K\. Heffernan, E\. Kalbassi, J\. Lam, D\. Licht, J\. Maillard,et al\.\(2022\)No language left behind: scaling human\-centered machine translation\.arXiv preprint arXiv:2207\.04672\.Cited by:[§5\.1](https://arxiv.org/html/2605.14404#S5.SS1.SSS0.Px2.p1.1)\.
- D\. Faraglia \(2025\)Faker: python package that generates fake data for you\.Note:[https://github\.com/joke2k/faker](https://github.com/joke2k/faker)Cited by:[§4\.2](https://arxiv.org/html/2605.14404#S4.SS2.p1.1.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§5\.3](https://arxiv.org/html/2605.14404#S5.SS3.SSS0.Px1.p2.3.3)\.
- K\. Hwang, H\. Kim, S\. Kim, S\. Wee, and N\. Kwak \(2025\)Uncovering the potential risks in unlearning: danger of english\-only unlearning in multilingual llms\.arXiv preprint arXiv:2510\.23949\.Cited by:[§1](https://arxiv.org/html/2605.14404#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.14404#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2605.14404#S2.SS2.p2.1)\.
- J\. Jang, D\. Yoon, S\. Yang, S\. Cha, M\. Lee, L\. Logeswaran, and M\. Seo \(2022\)Knowledge unlearning for mitigating privacy risks in language models\.arXiv preprint arXiv:2210\.01504\.Cited by:[§1](https://arxiv.org/html/2605.14404#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.14404#S2.SS1.p1.1),[§5\.3](https://arxiv.org/html/2605.14404#S5.SS3.SSS0.Px1.p1.1.1)\.
- R\. Joshi, R\. Paul, K\. Singla, A\. Kamath, M\. Evans, K\. Luna, S\. Ghosh, U\. Vaidya, E\. Long, S\. S\. Chauhan,et al\.\(2025\)CultureGuard: towards culturally\-aware dataset and guard model for multilingual safety applications\.arXiv preprint arXiv:2508\.01710\.Cited by:[§4\.4](https://arxiv.org/html/2605.14404#S4.SS4.p1.1.2)\.
- C\. Lin \(2004\)ROUGE: a package for automatic evaluation of summaries\.InText Summarization Branches Out,Barcelona, Spain,pp\. 74–81\.External Links:[Link](https://aclanthology.org/W04-1013/)Cited by:[§2\.2](https://arxiv.org/html/2605.14404#S2.SS2.p1.1)\.
- B\. Liu, Q\. Liu, and P\. Stone \(2022\)Continual learning and private unlearning\.InConference on Lifelong Learning Agents,pp\. 243–254\.Cited by:[§1](https://arxiv.org/html/2605.14404#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.14404#S2.SS1.p1.1),[§5\.3](https://arxiv.org/html/2605.14404#S5.SS3.SSS0.Px1.p1.1.1)\.
- Z\. Liu, G\. Dou, M\. Jia, Z\. Tan, Q\. Zeng, Y\. Yuan, and M\. Jiang \(2025\)Protecting privacy in multimodal large language models with mllmu\-bench\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 4105–4135\.Cited by:[§2\.2](https://arxiv.org/html/2605.14404#S2.SS2.p1.1)\.
- T\. Lu and P\. Koehn \(2024\)Learn and unlearn in multilingual llms\.arXiv preprint arXiv:2406\.13748\.Cited by:[§1](https://arxiv.org/html/2605.14404#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.14404#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2605.14404#S2.SS2.p2.1),[§3](https://arxiv.org/html/2605.14404#S3.SS0.SSS0.Px2.p3.1.1)\.
- P\. Maini, Z\. Feng, A\. Schwarzschild, Z\. C\. Lipton, and J\. Z\. Kolter \(2024\)Tofu: a task of fictitious unlearning for llms\.arXiv preprint arXiv:2401\.06121\.Cited by:[§1](https://arxiv.org/html/2605.14404#S1.p1.1),[§2\.2](https://arxiv.org/html/2605.14404#S2.SS2.p1.1),[§4\.1](https://arxiv.org/html/2605.14404#S4.SS1.p1.1.1),[§5\.3](https://arxiv.org/html/2605.14404#S5.SS3.SSS0.Px1.p1.1.1)\.
- X\. Pan, B\. Zhang, J\. May, J\. Nothman, K\. Knight, and H\. Ji \(2017\)Cross\-lingual name tagging and linking for 282 languages\.InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),R\. Barzilay and M\. Kan \(Eds\.\),Vancouver, Canada,pp\. 1946–1958\.External Links:[Link](https://aclanthology.org/P17-1178/),[Document](https://dx.doi.org/10.18653/v1/P17-1178)Cited by:[§4\.4](https://arxiv.org/html/2605.14404#S4.SS4.p1.1.1)\.
- F\. Pedregosa, G\. Varoquaux, A\. Gramfort, V\. Michel, B\. Thirion, O\. Grisel, M\. Blondel, P\. Prettenhofer, R\. Weiss, V\. Dubourg,et al\.\(2011\)Scikit\-learn: machine learning in python\.the Journal of machine Learning research12,pp\. 2825–2830\.Cited by:[Appendix I](https://arxiv.org/html/2605.14404#A9.p1.6)\.
- N\. Pochinkov and N\. Schoots \(2024\)Dissecting language models: machine unlearning via selective pruning\.arXiv preprint arXiv:2403\.01267\.Cited by:[Appendix G](https://arxiv.org/html/2605.14404#A7.SS0.SSS0.Px5.p1.1),[§2\.1](https://arxiv.org/html/2605.14404#S2.SS1.p1.1),[§5\.3](https://arxiv.org/html/2605.14404#S5.SS3.SSS0.Px1.p1.1.1)\.
- H\. Schwenk, V\. Chaudhary, S\. Sun, H\. Gong, and F\. Guzmán \(2021\)WikiMatrix: mining 135M parallel sentences in 1620 language pairs from Wikipedia\.InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume,P\. Merlo, J\. Tiedemann, and R\. Tsarfaty \(Eds\.\),Online,pp\. 1351–1361\.External Links:[Link](https://aclanthology.org/2021.eacl-main.115/),[Document](https://dx.doi.org/10.18653/v1/2021.eacl-main.115)Cited by:[§4\.4](https://arxiv.org/html/2605.14404#S4.SS4.p1.1.1)\.
- W\. Shi, J\. Lee, Y\. Huang, S\. Malladi, J\. Zhao, A\. Holtzman, D\. Liu, L\. Zettlemoyer, N\. A\. Smith, and C\. Zhang \(2024\)Muse: machine unlearning six\-way evaluation for language models\.arXiv preprint arXiv:2407\.06460\.Cited by:[§1](https://arxiv.org/html/2605.14404#S1.p1.1)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.Advances in neural information processing systems30\.Cited by:[Appendix G](https://arxiv.org/html/2605.14404#A7.SS0.SSS0.Px5.p1.1.1)\.
- W\. Wang, Z\. Tian, C\. Zhang, and S\. Yu \(2024\)Machine unlearning: a comprehensive survey\.arXiv preprint arXiv:2405\.07406\.Cited by:[§1](https://arxiv.org/html/2605.14404#S1.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025a\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[Appendix A](https://arxiv.org/html/2605.14404#A1.p1.1.1),[§4\.3](https://arxiv.org/html/2605.14404#S4.SS3.p1.1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025b\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[Appendix K](https://arxiv.org/html/2605.14404#A11.p1.1.1)\.
- R\. Zhang, L\. Lin, Y\. Bai, and S\. Mei \(2024\)Negative preference optimization: from catastrophic collapse to effective unlearning\.arXiv preprint arXiv:2404\.05868\.Cited by:[§1](https://arxiv.org/html/2605.14404#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.14404#S2.SS1.p1.1),[§5\.3](https://arxiv.org/html/2605.14404#S5.SS3.SSS0.Px1.p1.1.1)\.

Appendices

## Appendix ANaive Profile Generation and Model\-Induced Lexical Skew

In a pilot study preceding our formal dataset construction, we examined what issues arise when one naively uses an LLM to generate synthetic user profiles and then builds a QA dataset from them\. Our goal is to construct a multilingual QA dataset grounded in synthetic user profiles comprising attributes such as names, nationalities, and health conditions\. The most straightforward way to obtain such data is to directly prompt an LLM to sample a profile and generate QA pairs conditioned on it\. In this preliminary setup, we employed Qwen3\-235B\-A22B\-Thinking\-2507Yanget al\.\([2025a](https://arxiv.org/html/2605.14404#bib.bib19)\)with nucleus sampling, using the exact prompt shown in Figure[6](https://arxiv.org/html/2605.14404#A2.F6)\. The attributes specified in this naive prompt wereName,Year of Birth,Financial Habits,Primary Commute Mode,Interests / Hobbies,Learning Goals This Year,Artistic or Creative Expression,Awards or Achievements,Health Attributes,Travel History / Exposure,Pet Ownership or Preference,Bucket List Items,Life Philosophy or Motto,Media Preferences,Future Plans or Dreams,Relationship or Family Status,Occupation,Education,Current Residence, andNationality\. After generating 20 synthetic profiles and their corresponding QA datasets, we analysed the empirical distribution of each attribute and found a striking prevalence of specific surface forms \(e\.g\., repeatedly producingCanadianfor nationality\), as summarized by the attribute\-wise histograms in Figure[7](https://arxiv.org/html/2605.14404#A2.F7)\. We term this phenomenon*model\-induced lexical skew*\. This skew is undesirable because it 1\) reduces profile diversity and, more critically for unlearning evaluation, 2\) confounds measurement by making it difficult to disentangle genuine retention from cases where the model merely exploits high\-frequency lexical priors, i\.e\., succeeds by guessing common tokens rather than recovering profile\-specific information\. Motivated by this observation, we introduced an attribute pool to diversify the synthetic profiles\. More broadly, the pilot supports the need for a controlled data\-generation pipeline that explicitly regulates token\-frequency distributions to suppress such biases while improving diversity\.

## Appendix BAttribute Pool for Synthetic Profile Generation

![[Uncaptioned image]](https://arxiv.org/html/2605.14404v1/x6.png)Table 3:Valid values for the attributes used to build synthetic profiles \(1/7\)![[Uncaptioned image]](https://arxiv.org/html/2605.14404v1/x7.png)Table 4:\(continued\) Valid values for the attributes used to build synthetic profiles \(2/7\)![[Uncaptioned image]](https://arxiv.org/html/2605.14404v1/x8.png)Table 5:\(continued\) Valid values for the attributes used to build synthetic profiles \(3/7\)![[Uncaptioned image]](https://arxiv.org/html/2605.14404v1/x9.png)Table 6:\(continued\) Valid values for the attributes used to build synthetic profiles \(4/7\)![[Uncaptioned image]](https://arxiv.org/html/2605.14404v1/x10.png)Table 7:\(continued\) Valid values for the attributes used to build synthetic profiles \(5/7\)![[Uncaptioned image]](https://arxiv.org/html/2605.14404v1/x11.png)Table 8:\(continued\) Valid values for the attributes used to build synthetic profiles \(6/7\)![[Uncaptioned image]](https://arxiv.org/html/2605.14404v1/x12.png)Table 9:Valid values for the attributes used to build synthetic profiles \(7/7\)![Refer to caption](https://arxiv.org/html/2605.14404v1/x13.png)Figure 6:Naive prompt for synthetic profile Question and Answer Generation\.![Refer to caption](https://arxiv.org/html/2605.14404v1/x14.png)Figure 7:Comparison of attribute distributions in synthetic profiles generated via naive prompting \(Naive\) and random sampling from a predefined pool \(Ours\)\. \(Left\) Year of Birth; \(Right\) Nationality\. Notably, our approach exhibits higher diversity and significantly reduced skew compared to the Naive method\.Table[3](https://arxiv.org/html/2605.14404#A2.T3)illustrates the attribute value pools used to construct diverse synthetic profiles\.

## Appendix CExamples of Synthetic Profile

Figure[8](https://arxiv.org/html/2605.14404#A3.F8)presents an example profile after manual filtering by human experts\. Each attribute was randomly sampled from its respective predefined pool\.

![Refer to caption](https://arxiv.org/html/2605.14404v1/x15.png)Figure 8:Example of synthetic profile\.
## Appendix DPrompt for QA Generation

![Refer to caption](https://arxiv.org/html/2605.14404v1/x16.png)Figure 9:Prompt for generating QA dataset\.![Refer to caption](https://arxiv.org/html/2605.14404v1/x17.png)Figure 10:Prompt for verifying back translated sentences\.![Refer to caption](https://arxiv.org/html/2605.14404v1/x18.png)Figure 11:Prompt for semantic equivalence rate\.Figure[9](https://arxiv.org/html/2605.14404#A4.F9)presents the prompt used for LLM\-based generation of the QA dataset from synthetic profiles constructed using the attribute pool\. As with the synthetic profiles, each QA item underwent human review to ensure quality before use\.

## Appendix EExamples of QA Dataset

![Refer to caption](https://arxiv.org/html/2605.14404v1/x19.png)Figure 12:Examples of generated QA dataset\.Figure[12](https://arxiv.org/html/2605.14404#A5.F12)presents representative examples from the QA dataset generated from the synthetic profiles\.

## Appendix FExamples of Multilingual QA Dataset

![Refer to caption](https://arxiv.org/html/2605.14404v1/x20.png)Figure 13:Examples of translated multilingual QA dataset\.Figure[13](https://arxiv.org/html/2605.14404#A6.F13)presents representative examples from the QA dataset translated from the source English QA dataset\.

Figure[10](https://arxiv.org/html/2605.14404#A4.F10)presents the prompt used to verify back\-translated sentences following the translation of the English QA dataset via Google Translate\. Depending on the dataset type, either the question or answer field was inserted into the prompt\.

## Appendix GDescription of Unlearning Algorithm

We describe the unlearning algorithms used in this paper\. All of them aim to optimizeFθMF\_\{\\theta\}^\{M\}\.

#### Gradient Ascent

Gradient Ascent \(GA\) is a procedure that applies gradient ascent on the forget dataset to remove information that the LLM should forget\. The GA objective is defined as follows:

ℒGA\(𝒟f,FθM\)=𝔼\(qf,af\)∈𝒟f\[log⁡FθM\(af∣qf\)\]\\mathcal\{L\}\_\{GA\}\(\\mathcal\{D\}\_\{f\},F\_\{\\theta\}^\{M\}\)=\\mathbb\{E\}\_\{\(q\_\{f\},a\_\{f\}\)\\,\\in\\,\\mathcal\{D\}\_\{f\}\}\\\!\\bigl\[\\log F\_\{\\theta\}^\{M\}\(a\_\{f\}\\mid q\_\{f\}\)\\bigr\]\(8\)

#### Gradient Difference

Applying GA alone can degrade performance on the retain dataset\. To prevent this, Gradient Difference \(GAGDR\) augments GA with simultaneous training on the retain dataset: GD performs gradient ascent on𝒟f\\mathcal\{D\}\_\{f\}and gradient descent on𝒟r\\mathcal\{D\}\_\{r\}\. The GD objective is defined as follows\.

ℒGAGDR\(𝒟f,𝒟r,FθM\)=\\displaystyle\\mathcal\{L\}\_\{GAGDR\}\(\\mathcal\{D\}\_\{f\},\\mathcal\{D\}\_\{r\},F\_\{\\theta\}^\{M\}\)=𝔼\(qf,af\)∈𝒟f\[log⁡FθM\(af∣qf\)\]\\displaystyle\\mathbb\{E\}\_\{\(q\_\{f\},a\_\{f\}\)\\,\\in\\,\\mathcal\{D\}\_\{f\}\}\\\!\\left\[\\log F\_\{\\theta\}^\{M\}\(a\_\{f\}\\mid q\_\{f\}\)\\right\]−\\displaystyle\-𝔼\(qr,ar\)∈𝒟r\[log⁡FθM\(ar∣qr\)\]\\displaystyle\\mathbb\{E\}\_\{\(q\_\{r\},a\_\{r\}\)\\,\\in\\,\\mathcal\{D\}\_\{r\}\}\\\!\\left\[\\log F\_\{\\theta\}^\{M\}\(a\_\{r\}\\mid q\_\{r\}\)\\right\]\(9\)

#### Gradient Ascent with KL minimization

Similar to GAGDR, Gradient Ascent with KL minimization \(GAKLR\) aims to preserve the utility of the LLM on the retain dataset\. This is done by minimizing the Kullback–Leibler \(KL\) divergence on the retain set, computed between the output distributions of the model currently being updated and the pre\-unlearning reference model, denoted asFθref=FθMF\_\{\\theta\}^\{ref\}=F\_\{\\theta\}^\{M\}\. The GAKLR objective is given below:

ℒGDKLR\(𝒟f,𝒟r,FθM\)=\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{GDKLR\}\}\(\\mathcal\{D\}\_\{f\},\\mathcal\{D\}\_\{r\},F\_\{\\theta\}^\{M\}\)=~𝔼\(qf,af\)∈𝒟f\[log⁡FθM\(af∣qf\)\]\\displaystyle\\mathbb\{E\}\_\{\(q\_\{f\},a\_\{f\}\)\\in\\mathcal\{D\}\_\{f\}\}\\\!\\big\[\\log F\_\{\\theta\}^\{M\}\(a\_\{f\}\\mid q\_\{f\}\)\\big\]\+KL𝒟r\(FθM∥Fθref\)\.\\displaystyle\\quad\+~\\mathrm\{KL\}\_\{\\mathcal\{D\}\_\{r\}\}\\\!\\big\(F\_\{\\theta\}^\{M\}\\,\\\|\\,F\_\{\\theta\}^\{ref\}\\big\)\.\(10\)

#### Negative Preference Optimization

Negative Preference Optimization \(NPO\) applies the preference optimization framework to unlearn specific behaviors by treating samples in the forget dataset as negative instances\. NPO operates solely on undesirable responses, penalizing their generation probability relative to the reference modelFθref=FθMF\_\{\\theta\}^\{ref\}=F\_\{\\theta\}^\{M\}to ensure stability\. The NPO objective is defined as follows:

ℒNPO\(𝒟f,FθM;Fθref\)=\\displaystyle\\mathcal\{L\}\_\{\\text\{NPO\}\}\(\\mathcal\{D\}\_\{f\},F\_\{\\theta\}^\{M\};F\_\{\\theta\}^\{ref\}\)=𝔼\(qf,af\)∈𝒟f\[−log⁡\(1−σ\(βlog⁡FθM\(af∣qf\)Fθref\(af∣qf\)\)\)\]\.\\displaystyle\\mathbb\{E\}\_\{\(q\_\{f\},a\_\{f\}\)\\in\\mathcal\{D\}\_\{f\}\}\\Big\[\-\\log\\Big\(1\-\\sigma\\Big\(\\beta\\log\\tfrac\{F\_\{\\theta\}^\{M\}\(a\_\{f\}\\\!\\mid q\_\{f\}\)\}\{F\_\{\\theta\}^\{ref\}\(a\_\{f\}\\\!\\mid q\_\{f\}\)\}\\Big\)\\Big\)\\Big\]\.\(11\)

#### Prune

[Pochinkov and Schoots](https://arxiv.org/html/2605.14404#bib.bib46)investigated pruning\-based unlearning for Transformer\-based architecturesVaswaniet al\.\([2017](https://arxiv.org/html/2605.14404#bib.bib50)\)\. We performed structured pruning on the feed\-forward networks \(FFNs\) utilizing the scoring metric employed in their study\. The importance score for structured pruning is defined as follows:

Iagnostic:=∑kMinMax\(Ik\(𝒟f\)\)∑kMinMax\(Ik\(𝒟r\)\)\+ϵ\.I\_\{\\text\{agnostic\}\}:=\\frac\{\\sum\_\{k\}\\text\{MinMax\}\(I\_\{k\}\(\\mathcal\{D\}\_\{f\}\)\)\}\{\\sum\_\{k\}\\text\{MinMax\}\(I\_\{k\}\(\\mathcal\{D\}\_\{r\}\)\)\+\\epsilon\}\.\(12\)
Here,MinMax\(⋅\)\\text\{MinMax\}\(\\cdot\)denotes min\-max normalization,𝒟f\\mathcal\{D\}\_\{f\}and𝒟r\\mathcal\{D\}\_\{r\}the multilingual parallel forget and retain dataset, and finallyIkI\_\{k\}the following scores:

Istd\\displaystyle I\_\{\\text\{std\}\}=1\|𝒟\|∑\(z−z¯\)2\\displaystyle=\\sqrt\{\\tfrac\{1\}\{\|\\mathcal\{D\}\|\}\\sum\(z\-\\bar\{z\}\)^\{2\}\}Iabs\\displaystyle I\_\{\\text\{abs\}\}=1\|𝒟\|∑\|z\|\\displaystyle=\\tfrac\{1\}\{\|\\mathcal\{D\}\|\}\\sum\|z\|Ifreq\\displaystyle I\_\{\\text\{freq\}\}=1\|𝒟\|∑𝕀\(z\>0\)\\displaystyle=\\tfrac\{1\}\{\|\\mathcal\{D\}\|\}\\sum\\mathbb\{I\}\(z\>0\)Irms\\displaystyle I\_\{\\text\{rms\}\}=1\|𝒟\|∑z2\\displaystyle=\\sqrt\{\\tfrac\{1\}\{\|\\mathcal\{D\}\|\}\\sum z^\{2\}\}
zzdenotes the activation produced by the MLP within the FFN for each datapoint in𝒟\\mathcal\{D\}, andz¯\\bar\{z\}represents the mean activation\.

## Appendix HHyper parameter setting

Table[10](https://arxiv.org/html/2605.14404#A8.T10)presents the hyperparameters used to memorize the synthetic QA dataset on the Llama\-3\.1 model\.

Table[17](https://arxiv.org/html/2605.14404#A8.T17)presents the hyperparameters used during the unlearning phase for configurationsp1p1,p3p3, andp5p5\. Notably, when the retain dataset is utilized, the gradient accumulation steps are doubled to ensure that the total number of training iterations remains consistent\.Regarding hyperparameters, we varied only the learning rate while keeping all other configurations fixed\. We selected the checkpoint where the probability metric \(P\(a\|q\)1/\|a\|tokP\(a\|q\)^\{1/\|a\|\_\{\\text\{tok\}\}\}\), averaged across languages, first exceeded 0\.83 on the retain dataset\.

MemorizationHyperparameterValueBatch size4Gradient accumulation8Max sequence length1024Learning rate0\.0002Warmup ratio0\.03Weight decay0\.0Precision / dtypebfloat16LoRA rank \(rr\)16LoRAα\\alpha32LoRA dropout0\.05Epoch4Table 10:Hyperparameter settings used for memorizing Llama 3\.1l1l\_\{1\}p1p1p3p3p5p5Case 1Case 2Case 1Case 2Case 1Case 2bn1\.001\.000\.380\.890\.100\.91de0\.000\.000\.000\.000\.501\.00en0\.000\.001\.001\.000\.000\.00he0\.001\.000\.001\.000\.300\.97ru0\.001\.000\.000\.000\.500\.93sq0\.001\.000\.000\.860\.300\.89ta0\.501\.000\.200\.940\.300\.91zh0\.000\.000\.500\.710\.331\.00avg0\.190\.630\.260\.680\.290\.83

Table 11:Knowledge Persistence Score \(KPS\) on MEMORIZED across different forget ratios \(p1p1,p3p3,p5p5\)\.l1l\_\{1\}p1p1p3p3p5p5Case 1Case 2Case 1Case 2Case 1Case 2bn0\.250\.450\.340\.680\.320\.95de0\.130\.500\.210\.710\.190\.86en0\.000\.340\.190\.410\.400\.80he0\.170\.580\.270\.730\.260\.92ru0\.220\.460\.200\.700\.170\.88sq0\.250\.380\.210\.660\.250\.79ta0\.350\.530\.350\.720\.270\.92zh0\.170\.450\.370\.680\.250\.89avg0\.190\.460\.270\.660\.260\.88

Table 12:Knowledge Persistence Score \(KPS\) on GA across different forget ratios \(p1p1,p3p3,p5p5\)\.l1l\_\{1\}p1p1p3p3p5p5Case 1Case 2Case 1Case 2Case 1Case 2bn0\.250\.510\.290\.820\.310\.79de0\.160\.450\.170\.760\.170\.76en0\.000\.180\.000\.550\.190\.61he0\.090\.400\.310\.810\.330\.81ru0\.210\.470\.260\.790\.380\.78sq0\.080\.310\.200\.740\.110\.70ta0\.300\.480\.390\.840\.260\.79zh0\.170\.360\.130\.700\.330\.76avg0\.160\.400\.220\.750\.260\.75

Table 13:Knowledge Persistence Score \(KPS\) on GAGDR across different forget ratios \(p1p1,p3p3,p5p5\)\.l1l\_\{1\}p1p1p3p3p5p5Case 1Case 2Case 1Case 2Case 1Case 2bn0\.060\.210\.280\.540\.260\.68de0\.020\.180\.180\.590\.200\.74en0\.030\.080\.170\.400\.170\.57he0\.050\.240\.260\.710\.320\.76ru0\.080\.210\.190\.580\.210\.74sq0\.020\.180\.170\.580\.050\.56ta0\.100\.230\.330\.650\.280\.76zh0\.070\.180\.270\.600\.330\.63avg0\.050\.190\.230\.580\.230\.68

Table 14:Knowledge Persistence Score \(KPS\) on GAKLR across different forget ratios \(p1p1,p3p3,p5p5\)\.l1l\_\{1\}p1p1p3p3p5p5Case 1Case 2Case 1Case 2Case 1Case 2bn0\.080\.170\.200\.410\.190\.36de0\.080\.150\.100\.370\.140\.30en0\.050\.070\.050\.170\.050\.17he0\.090\.140\.110\.440\.170\.36ru0\.110\.180\.060\.340\.130\.29sq0\.110\.130\.090\.420\.170\.26ta0\.130\.170\.190\.440\.190\.40zh0\.130\.150\.160\.400\.160\.32avg0\.100\.150\.120\.370\.150\.31

Table 15:Knowledge Persistence Score \(KPS\) on NPO across different forget ratios \(p1p1,p3p3,p5p5\)\.l1l\_\{1\}p1p1p3p3p5p5Case 1Case 2Case 1Case 2Case 1Case 2bn0\.070\.120\.090\.230\.100\.24de0\.030\.090\.000\.150\.080\.19en0\.000\.060\.020\.100\.060\.13he0\.040\.100\.090\.210\.080\.26ru0\.030\.090\.050\.170\.100\.23sq0\.060\.100\.040\.180\.110\.26ta0\.050\.090\.070\.210\.110\.24zh0\.020\.080\.060\.170\.120\.24avg0\.040\.090\.050\.180\.090\.22

Table 16:Knowledge Persistence Score \(KPS\) on PRUNE across different forget ratios \(p1p1,p3p3,p5p5\)\.Common ConfigurationBatch size:4Max seq length:1024Epochs:10LoRA rank \(rr\):16LoRAα\\alpha:32LoRA dropout:0\.05Warmup ratio:0\.0Weight decay:0\.0Forget strength:1\.0Method\-Specific ConfigurationMethodGrad\.RetainLearning RateAccum\.Strengthp1p1p3p3p5p5GA64\-3\.5e\-52\.0e\-51\.0e\-5GAGDR1281\.04\.9e\-52\.1e\-51\.7e\-5GAKLR1281\.05\.2e\-52\.4e\-51\.8e\-5NPO641\.06\.2e\-52\.9e\-52\.1e\-5Table 17:Full hyperparameters for the Unlearning stage\. Common configurations are listed at the top, followed by method\-specific settings\.
## Appendix IAdditional Explanation of Knowledge Separability Score

Predicted ValueForget\(True\)Retain\(False\)Actual ValueTarget KnowledgeTrue Positive \(TP\)False Negative \(FN\)\(True\)\(Successfully Forgotten\)\(Failed to Forget\)Non\-Target KnowledgeFalse Positive \(FP\)True Negative \(TN\)\(False\)\(Wrongly Forgotten\)\(Successfully Retained\)Table 18:Confusion Matrix for Multilingual Machine UnlearningIn Section[5](https://arxiv.org/html/2605.14404#S5), we proposed the Knowledge Separability Score \(KSS\) utilizing ROC\-AUC and PR\-AUC\. In this section, we detail the method for calculating KSS\-ROC and KSS\-PR, specifically describing how the knowledge\-wise forgetting score \(SiS\_\{i\}\) is employed in this process\. ROC and PR analyses involve visualizing variations in classification performance across shifting decision thresholds and encapsulating this behavior into a single scalar value\. To adapt this framework to the Multilingual Machine Unlearning \(MMU\) context, we first define the positive and negative classes\. We designate the target knowledge as the positive class \(1\) and the non\-target knowledge as the negative class \(0\)\. Adopting standard binary classification notation, the resulting confusion matrix is presented in Table[18](https://arxiv.org/html/2605.14404#A9.T18)\. Based on this configuration, the False Positive Rate \(FPR\), True Positive Rate \(TPR, or Recall\), and Precision required for ROC and PR calculations are computed as follows:

TPR \(Recall\)=TPTP\+FN\\displaystyle=\\frac\{\\text\{TP\}\}\{\\text\{TP\}\+\\text\{FN\}\}\(13\)FPR=FPFP\+TN\\displaystyle=\\frac\{\\text\{FP\}\}\{\\text\{FP\}\+\\text\{TN\}\}\(14\)Precision=TPTP\+FP\\displaystyle=\\frac\{\\text\{TP\}\}\{\\text\{TP\}\+\\text\{FP\}\}\(15\)To derive KSS\-ROC, we plot the curve with FPR on thexx\-axis and TPR \(Recall\) on theyy\-axis, observing how these metrics fluctuate as the threshold for the forgetting score \(SiS\_\{i\}\) varies\. Similarly, for KSS\-PR, the curve is plotted with Recall on thexx\-axis and Precision on theyy\-axis\. The final metric is determined by calculating the Area Under the Curve \(AUC\) for each respective graph\. All ROC and PR computations were implemented using scikit\-learnPedregosaet al\.\([2011](https://arxiv.org/html/2605.14404#bib.bib56)\)\.

## Appendix JDetailed Results on Knowledge\-wise Evaluation

### J\.1Knowledge Persistence Score

Full KPS results for all methods are provided in Table[11](https://arxiv.org/html/2605.14404#A8.T11)to[16](https://arxiv.org/html/2605.14404#A8.T16)\.

### J\.2Knowledge Separability Score

In Figure[5](https://arxiv.org/html/2605.14404#S6.F5), we visualized the distributions of target and non\-target knowledge with respect to the knowledge\-wise forgetting score \(SiS\_\{i\}\) for the NPO and PRUNE methods\. Figure[15](https://arxiv.org/html/2605.14404#A12.F15)to[25](https://arxiv.org/html/2605.14404#A12.F25)shows the full distribution ofSiS\_\{i\}\.

## Appendix KEvaluation on Other LLMs

In the main paper, we primarily conducted evaluations using the Llama3\.1\-8B\-Instruct model\. In this section, we extend our analysis to Qwen3\-4B\-Instruct \(Qwen3\)Yanget al\.\([2025b](https://arxiv.org/html/2605.14404#bib.bib57)\)to demonstrate that both KPS and KSS remain valid and applicable metrics across different model architectures\. Table[19](https://arxiv.org/html/2605.14404#A11.T19)presents the KPS results for Qwen3 after unlearning with the NPO and PRUNE methods, and Table[20](https://arxiv.org/html/2605.14404#A11.T20)presents the corresponding KSS results after unlearning with the NPO method\. Note that Case 1 and Case 2 in both tables follow the definitions established in Section[J](https://arxiv.org/html/2605.14404#A10)and Section[6\.1](https://arxiv.org/html/2605.14404#S6.SS1), respectively\.

Table 19:KPS Results on Qwen3 with p1 settingl1l\_\{1\}NPO \(Case 1\)NPO \(Case 2\)Prune \(Case 1\)Prune \(Case 2\)bn0\.100\.150\.050\.13de0\.070\.060\.030\.12en0\.040\.040\.020\.12he0\.080\.100\.060\.16ru0\.050\.060\.050\.13sq0\.120\.130\.050\.14ta0\.090\.080\.050\.15zh0\.090\.150\.020\.13avg0\.080\.100\.040\.14Table 20:Performance of KSS\-ROC and KSS\-PR scores in p1 settingMethodKSS\-ROC \(Case 1\)KSS\-ROC \(Case 2\)KSS\-PR \(Case 1\)KSS\-PR \(Case 2\)BASE0\.520\.490\.010\.11NPO0\.720\.990\.030\.78PRUNE0\.700\.950\.080\.15
## Appendix LUse of AI Assistants

We utilize ChatGPT and Gemini for coding and writing assistance\. In particular, we employ ChatGPT for dataset generation\.

![[Uncaptioned image]](https://arxiv.org/html/2605.14404v1/x21.png)![Refer to caption](https://arxiv.org/html/2605.14404v1/x22.png)Figure 15:Distribution of generation\-basedSiS\_\{i\}scores forp1p1\. The plots illustrate the distributions for Hold\-out Language \(Hold\-out\) and Training Language \(Training\)\.![[Uncaptioned image]](https://arxiv.org/html/2605.14404v1/x23.png)![Refer to caption](https://arxiv.org/html/2605.14404v1/x24.png)Figure 17:Distribution of generation\-basedSiS\_\{i\}scores forp3p3\. The plots illustrate the distributions for Hold\-out Language \(Hold\-out\) and Training Language \(Training\)\.![[Uncaptioned image]](https://arxiv.org/html/2605.14404v1/x25.png)![Refer to caption](https://arxiv.org/html/2605.14404v1/x26.png)Figure 19:Distribution of generation\-basedSiS\_\{i\}scores forp5p5\. The plots illustrate the distributions for Hold\-out Language \(Hold\-out\) and Training Language \(Training\)\.![[Uncaptioned image]](https://arxiv.org/html/2605.14404v1/x27.png)![Refer to caption](https://arxiv.org/html/2605.14404v1/x28.png)Figure 21:Distribution of probability\-basedSiS\_\{i\}scores forp1p1\. The plots illustrate the distributions for Hold\-out Language \(Hold\-out\), and Training Language \(Training\)\.![[Uncaptioned image]](https://arxiv.org/html/2605.14404v1/x29.png)![Refer to caption](https://arxiv.org/html/2605.14404v1/x30.png)Figure 23:Distribution of probability\-basedSiS\_\{i\}scores forp3p3\. The plots illustrate the distributions for Hold\-out Language \(Hold\-out\), and Training Language \(Training\)\.![[Uncaptioned image]](https://arxiv.org/html/2605.14404v1/x31.png)![Refer to caption](https://arxiv.org/html/2605.14404v1/x32.png)Figure 25:Distribution of probability\-basedSiS\_\{i\}scores forp5p5\. The plots illustrate the distributions for Hold\-out Language \(Hold\-out\), and Training Language \(Training\)\.
Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation

Similar Articles

Multilingual Unlearning in LLMs: Transfer, Dynamics, and Reversibility

Merging Methods for Multilingual Knowledge Editing for Large Language Models: An Empirical Odyssey

Cross-Lingual Consensus: Aligning Multilingual Cultural Knowledge via Multilingual Self-Consistency

Discovering Lexical Gaps Using Embeddings from Multilingual LLMs

Model Unlearning Objectives Vary for Distinct Language Functions

Submit Feedback

Similar Articles

Multilingual Unlearning in LLMs: Transfer, Dynamics, and Reversibility
Merging Methods for Multilingual Knowledge Editing for Large Language Models: An Empirical Odyssey
Cross-Lingual Consensus: Aligning Multilingual Cultural Knowledge via Multilingual Self-Consistency
Discovering Lexical Gaps Using Embeddings from Multilingual LLMs
Model Unlearning Objectives Vary for Distinct Language Functions