Unlearning with Asymmetric Sources: Improved Unlearning-Utility Trade-off with Public Data

arXiv cs.LG Papers

Summary

This paper introduces Asymmetric Langevin Unlearning (ALU), a framework that leverages public data to improve the privacy-utility trade-off in machine unlearning. It demonstrates that ALU reduces unlearning costs and enables mass unlearning while maintaining high model utility.

arXiv:2605.11170v1 Announce Type: new Abstract: Noise-based certified machine unlearning currently faces a hard ceiling: the noise magnitude required to certify unlearning typically destroys model utility, particularly for large-scale deletion requests. While leveraging public data is a standard technique in differential privacy to relax this tension, its role in unlearning remains unexplored. We address this gap by introducing Asymmetric Langevin Unlearning (ALU), a framework that uses public data to mitigate privacy costs. We prove that public data injection suppresses the unlearning cost by a factor of $O(1/n_{\mathrm{pub}}^2)$, guaranteeing a strict computational advantage over retraining. This establishes a new control mechanism: practitioners can mitigate the need for high noise-and the associated utility loss-by increasing the volume of public data. Crucially, we analyze the realistic setting of distribution mismatch, explicitly characterizing how shifts between public and private sources impact utility. We show that ALU enables mass unlearning of constant dataset fractions -- a regime where standard symmetric methods become impractical -- while maintaining high utility. Empirical evaluations using variational R\'enyi divergence and membership inference attacks confirm that ALU effectively thwarts privacy attacks while preserving utility under reasonable distribution shifts.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/13/26, 06:33 AM

# Improved Unlearning-Utility Trade-off with Public Data
Source: [https://arxiv.org/html/2605.11170](https://arxiv.org/html/2605.11170)
###### Abstract

Noise\-based certified machine unlearning currently faces a hard ceiling: the noise magnitude required to certify unlearning typically destroys model utility, particularly for large\-scale deletion requests\. While leveraging public data is a standard technique in differential privacy to relax this tension, its role in unlearning remains unexplored\. We address this gap by introducingAsymmetric Langevin Unlearning \(ALU\), a framework that uses public data to mitigate privacy costs\. We prove that public data injection suppresses the unlearning cost by a factor ofO​\(1/npub2\)O\(1/n\_\{\\mathrm\{pub\}\}^\{2\}\), guaranteeing a strict computational advantage over retraining\. This establishes a new control mechanism: practitioners can mitigate the need for high noise—and the associated utility loss—by increasing the volume of public data\. Crucially, we analyze the realistic setting ofdistribution mismatch, explicitly characterizing how shifts between public and private sources impact utility\. We show that ALU enables “mass unlearning” of constant dataset fractions – a regime where standard symmetric methods become impractical – while maintaining high utility\. Empirical evaluations using variational Rényi divergence and membership inference attacks confirm that ALU effectively thwarts privacy attacks while preserving utility under reasonable distribution shifts\.

Machine Learning, ICML

## 1Introduction

The widespread adoption of machine learning across diverse applications has prompted regulatory responses aimed at protecting user privacy and data rights\. Legislative frameworks such as the European Union’s AI Act\(Parliament and of the European Union,[2024](https://arxiv.org/html/2605.11170#bib.bib268)\)and Canada’s Artificial Intelligence and Data Act \(AIDA\)\(Parliament of Canada,[2022](https://arxiv.org/html/2605.11170#bib.bib269)\)establish fundamental principles including the “right to be forgotten,” which mandates that individuals can request removal of their personal data from trained systems\. While the most straightforward approach to these requests—retraining from scratch—provides perfect guarantees, it is computationally prohibitive for modern deep learning models\. Consequently, the field has gravitated toward approximate methods, particularly the family ofnoise\-based injection and fine\-tuning, such as Langevin Unlearning\(Chienet al\.,[2024a](https://arxiv.org/html/2605.11170#bib.bib249); Koloskovaet al\.,[2025](https://arxiv.org/html/2605.11170#bib.bib37)\)\. These methods offer certifiable privacy guarantees but operate under a strict trade\-off: the magnitude of noise required to certify the data erasure degrades the model’s utility\.

In this work, we address this limitation by exploring an idea established in Differential Privacy \(DP\) but unexplored in machine unlearning: the integration ofpublic data\. We operate under the realistic assumption that while sensitive user data must be unlearnable, there often exists a corpus of public data that is not subject to retraction requests\. We proposeAsymmetric Langevin Unlearning \(ALU\), a framework that leverages this public data as a mechanism to improve the privacy\-utility trade\-off\. To our knowledge, the only prior work exploring mixed\-privacy unlearning isGolatkaret al\.\([2021](https://arxiv.org/html/2605.11170#bib.bib221)\), who introduced Mixed\-Linear Forgetting for computer vision tasks\. Their approach requires architectural modifications to achieve forgetting through network linearization, limiting its applicability\. In contrast, ALU operates directly on standard training pipelines\. Intuitively, public data acts as a stability anchor; it ensures that the weight distributions of the originally trained model and the retrained model remain naturally close\. This proximity reduces the need for noise injection to bridge the gap between distributions, thereby preserving utility\.

- •We prove that injecting public data creates a more favorable initialization for the unlearning process, reducing the unlearning cost by a factor of𝒪​\(1/npub2\)\\mathcal\{O\}\(1/n\_\{\\text\{pub\}\}^\{2\}\)\([Theorems3\.1](https://arxiv.org/html/2605.11170#S3.Thmtheorem1)and[3\.2](https://arxiv.org/html/2605.11170#S3.Thmtheorem2)\)\. This structural advantage enables two capabilities: - –Mass Unlearning:Unlike prior methods where noise requirements are independent of total dataset size, ALU allows for unlearningconstant fractionsof the private dataset \([Corollary3\.1](https://arxiv.org/html/2605.11170#S3.Thmcorollary1)\), ensuring robustness to large\-scale deletion\. - –Computational Advantage:We revisit the efficiency of unlearning compared to retraining from scratch\. While standard Langevin Unlearning is known to be efficient asymptotically, we establish a stronger result: ALU maintains a strict computational advantage over retraining even in finite\-step regimes\.
- •We depart from the idealized assumption of identical private\-public distributions\. We derive a new generalization bound[Theorem4\.1](https://arxiv.org/html/2605.11170#S4.Thmtheorem1)that explicitly quantifies the trade\-off between the benefits of noise reduction, and the penalties of distribution mismatch between public and private sources\.
- •We introduce a rigorous evaluation methodology using variational estimation of the Rényi divergence to validate our bounds\. Our experiments confirm that ALU successfully defends against membership inference attacks \(U\-LiRA\) while preserving significantly higher utility than symmetric baselines\.

## 2Related work

### 2\.1Machine unlearning

Machine unlearning algorithms eliminate the influence of designated training data \(theforget set\) while balancing unlearning efficacy, model utility, and computational efficiency\. Three canonical strategies illustrate the trade\-offs: random re\-initialization achieves perfect unlearning but destroys utility; retraining from scratch provides optimal guarantees but incurs prohibitive costs; no intervention preserves utility but achieves no unlearning\. Unlearning paradigms diverge betweenexact unlearning, which matches the retraining baseline but limits expressivity or efficiency\(Cao and Yang,[2015](https://arxiv.org/html/2605.11170#bib.bib225); Yanet al\.,[2022](https://arxiv.org/html/2605.11170#bib.bib212)\), andapproximate unlearning, which provides certified approximations of retraining\(Nguyenet al\.,[2020](https://arxiv.org/html/2605.11170#bib.bib117); Guoet al\.,[2023](https://arxiv.org/html/2605.11170#bib.bib224); Chienet al\.,[2024b](https://arxiv.org/html/2605.11170#bib.bib244); Koloskovaet al\.,[2025](https://arxiv.org/html/2605.11170#bib.bib37)\)\.

### 2\.2Langevin Unlearning

A common approach to machine unlearning is to run a noisy projected gradient method starting from the trained weights, targeting a distribution close to retraining\. Formally, at iterationtt,

θt\+1=ΠΘ​\[θt−η​∇θℒ​\(θt\)\+ξt\],\\theta\_\{t\+1\}=\\Pi\_\{\\Theta\}\\\!\\left\[\\theta\_\{t\}\-\\eta\\nabla\_\{\\theta\}\\mathcal\{L\}\(\\theta\_\{t\}\)\+\\xi\_\{t\}\\right\],\(1\)whereℒ\\mathcal\{L\}is a surrogate loss \(e\.g\., empirical loss on a retain set\),η\\etais the step size, andξt\\xi\_\{t\}is injected noise \(often Gaussian\) controlling distributional closeness\.

Langevin Unlearning \(LU\)\(Chienet al\.,[2024a](https://arxiv.org/html/2605.11170#bib.bib249)\)instantiates this scheme withℒ=ℒ𝒟r\\mathcal\{L\}=\\mathcal\{L\}\_\{\\mathcal\{D\}\_\{r\}\}, the loss on the retain set, andξt∼𝒩​\(0,Id\)\\xi\_\{t\}\\sim\\mathcal\{N\}\(0,I\_\{d\}\)\. This reduces to projected noisy gradient descent \(PNGD\) \(pseudocode in[AppendixE](https://arxiv.org/html/2605.11170#A5)\)\.

LU provides certifiable approximate unlearning guarantees by minimizing the Rényi divergence between post\-unlearning and post\-retraining weight distributions\(Chienet al\.,[2024a](https://arxiv.org/html/2605.11170#bib.bib249),[b](https://arxiv.org/html/2605.11170#bib.bib244)\)\. However, these guarantees require that theentire original training processsatisfies differential privacy \(DP\)\. This necessitates injecting substantial noise starting from the very first training iteration, which compromises the utility of the base model even before any unlearning request\. In this work, we improve uponChienet al\.\([2024a](https://arxiv.org/html/2605.11170#bib.bib249)\)by leveraging public data to mitigate this “privacy tax”\. We demonstrate that anchoring the training process with public data allows us to reduce the noise magnitude required during both learning and unlearning while satisfying the same guarantees\. By assuming the initialization satisfies a log\-Sobolev inequality—a mild condition satisfied by Gaussian initialization—we derive data\-dependent bounds showing that public data acts as a stabilizer, effectively “subsidizing” the privacy cost of the private data\. Concurrent approaches like\(Koloskovaet al\.,[2025](https://arxiv.org/html/2605.11170#bib.bib37)\)require only smoothness assumptions, but such data\-agnostic bounds depend primarily on projection set geometry rather than exploiting the structural advantage of public data\.

## 3Asymmetric Langevin Unlearning

### 3\.1Preliminaries

Motivation\.Our approach is motivated by a realistic data setting, well\-established in the privacy machine learning literature\(Alonet al\.,[2019](https://arxiv.org/html/2605.11170#bib.bib283); Amidet al\.,[2022](https://arxiv.org/html/2605.11170#bib.bib273); Ganeshet al\.,[2023](https://arxiv.org/html/2605.11170#bib.bib220); Lowyet al\.,[2024](https://arxiv.org/html/2605.11170#bib.bib274)\), that leverages public data to improve the privacy\-utility trade\-off\. We introduce this asymmetric data model to Langevin Unlearning, which allows us to relax the restrictive Differential Privacy \(DP\) assumption over the entire dataset\. By explicitly modeling this asymmetry, we can leverage public data to enhance the unlearning process to improve both efficacy and model performance without compromising privacy guarantees\.

Notation\.We consider probability distributions defined over a compact parameter spaceΘ\\Theta, where stochasticity arises from three sources: the weight initialization distribution𝝅0\{\\bm\{\\pi\}\}\_\{0\}, the training data distributionPtrainP\_\{\\mathrm\{train\}\}, and the inherent randomness of the optimization procedure\. We denote by𝒫​\(Θ\)\\mathcal\{P\}\(\\Theta\)the set of probability distributions supported onΘ\\Theta\. We study the weight distributionsπSt\\pi\_\{S\}^\{t\}, whereS∈\{L,U,R\}S\\in\\\{L,U,R\\\}identifies the training regime andttdenotes the iteration count\. A key quantity in our analysis is the Rényi divergence of orderα\\alphabetween distributionsPPandQQ, denotedDα​\(P∥Q\)D\_\{\\alpha\}\(P\\\|Q\)\([Definition3\.2](https://arxiv.org/html/2605.11170#S3.Thmdefinition2)\)\. We usePpubP\_\{\\mathrm\{pub\}\}andPprivP\_\{\\mathrm\{priv\}\}to represent the distributions of public and private data, respectively\.

Problem Setting\.We consider empirical risk minimization over a datasetD=Dpub∪DprivD=D\_\{\\mathrm\{pub\}\}\\cup D\_\{\\mathrm\{priv\}\}comprising two components: a public setDpubD\_\{\\mathrm\{pub\}\}withnpubn\_\{\\mathrm\{pub\}\}samples from a distributionPpubP\_\{\\mathrm\{pub\}\}, and a private setDprivD\_\{\\mathrm\{priv\}\}withnprivn\_\{\\mathrm\{priv\}\}samples from a distributionPprivP\_\{\\mathrm\{priv\}\}\. The training loss isℒD​\(θ\)=1npub\+npriv​∑x∈Dℓ​\(θ,x\)\\mathcal\{L\}\_\{D\}\(\\theta\)=\\frac\{1\}\{n\_\{\\mathrm\{pub\}\}\+n\_\{\\mathrm\{priv\}\}\}\\sum\_\{x\\in D\}\\ell\(\\theta,x\)\. Only the private data is subject to unlearning requests, while public data remains permanently available\. We employTTPNGD iterations with projections ontoΘ⊂ℝd\\Theta\\subset\\mathbb\{R\}^\{d\}\(radiusRR\) to obtainθT\\theta\_\{T\}\. Since PNGD injects Gaussian noise at each step, it induces probability distributions over the parameter space\. To ensure convergence and certifiable guarantees, we assume the initialization distribution satisfies a Log\-Sobolev inequality \(LSI\):

###### Definition 3\.1\.

\(Log\-Sobolev inequality\(Gross,[1975](https://arxiv.org/html/2605.11170#bib.bib277)\)\) A probability measureP∈𝒫​\(ℝd\)P\\in\\mathcal\{P\}\(\\mathbb\{R\}^\{d\}\)satisfies a Log\-Sobolev inequality with constantCCif

∀Q∈𝒫​\(ℝd\),DK​L​\(Q∥P\)\\displaystyle\\forall Q\\in\\mathcal\{P\}\(\\mathbb\{R\}^\{d\}\),D\_\{KL\}\(Q\\\|P\)≤C2​I​\(Q,P\),\\displaystyle\\leq\\frac\{C\}\{2\}I\(Q,P\),\(2\)whereDK​LD\_\{KL\}denotes the KL divergence andI​\(Q,P\)=EQ​\[‖∇log⁡qp‖2\]I\(Q,P\)=E\_\{Q\}\\left\[\\\|\\nabla\\log\\frac\{q\}\{p\}\\\|^\{2\}\\right\]is the relative Fisher information\.

##### Weight Distributions

We analyze three distributions induced by PNGD onD=Dpub∪DprivD=D\_\{\\text\{pub\}\}\\cup D\_\{\\text\{priv\}\}given a requestDforget⊆DprivD\_\{\\text\{forget\}\}\\subseteq D\_\{\\text\{priv\}\}\([Figure1](https://arxiv.org/html/2605.11170#S3.F1)\):

- •Learning distributionπLT\\pi\_\{L\}^\{T\}: results fromTTiterations onDDwithθ0∼π0\\theta\_\{0\}\\sim\\pi\_\{0\}, representing the pre\-unlearning model;
- •Unlearning distributionπUK\\pi\_\{U\}^\{K\}: results fromKKfine\-tuning iterations onD∖DforgetD\\setminus D\_\{\\text\{forget\}\}initialized fromθ∼πLT\\theta\\sim\\pi\_\{L\}^\{T\};
- •Retraining distributionπRT\\pi\_\{R\}^\{T\}: results fromTTiterations onD∖DforgetD\\setminus D\_\{\\text\{forget\}\}withθ0∼π0\\theta\_\{0\}\\sim\\pi\_\{0\}, serving as the retraining baseline\.

![Refer to caption](https://arxiv.org/html/2605.11170v1/x1.png)Figure 1:Training pipelines showing the relationship between learning, unlearning, and retraining with public data injection\. The divergenceDα​\(πRT∥πLT\)D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\}\\\|\\pi\_\{L\}^\{T\}\)quantifies how public data helps maintain similarity between retraining and original learning distributions, facilitating subsequent unlearning\.FollowingChienet al\.\([2024a](https://arxiv.org/html/2605.11170#bib.bib249)\), we measure unlearning quality via Rényi divergence\.

###### Definition 3\.2\.

For probability measuresP,QP,QwithP≪QP\\ll Q, their Rényi divergence of orderα∈\(0,\+∞\)∖\{1\}\\alpha\\in\(0,\+\\infty\)\\setminus\\\{1\\\}is

Dα​\(P∥Q\)\\displaystyle D\_\{\\alpha\}\(P\\\|Q\)=1α−1​log⁡𝔼Q​\[\(d​Pd​Q\)α\],\\displaystyle=\\frac\{1\}\{\\alpha\-1\}\\log\\mathbb\{E\}\_\{Q\}\\left\[\\left\(\\frac\{dP\}\{dQ\}\\right\)^\{\\alpha\}\\right\],whered​Pd​Q\\frac\{dP\}\{dQ\}is the Radon\-Nikodym derivative\. This generalizes KL divergence \(α→1\\alpha\\rightarrow 1\), reverse\-KL \(α→0\\alpha\\rightarrow 0\), and connects toε\\varepsilon\-differential privacy in the limitα→∞\\alpha\\rightarrow\\infty\(Mironov,[2017](https://arxiv.org/html/2605.11170#bib.bib27)\)\.

The effectiveness of unlearning is measured byDα​\(πUK∥πRT\+K\)D\_\{\\alpha\}\(\\pi\_\{U\}^\{K\}\\\|\\pi\_\{R\}^\{T\+K\}\), while the presence of public data helps controlDα​\(πRT∥πLT\)D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\}\\\|\\pi\_\{L\}^\{T\}\), creating favorable conditions for the unlearning process\.

### 3\.2Unlearning Performance

We now present theoretical guarantees for asymmetric Langevin unlearning that demonstrate how public data improves unlearning efficiency\. Our analysis adapts the prior work ofChienet al\.\([2024a](https://arxiv.org/html/2605.11170#bib.bib249)\)by relaxing global differential privacy assumptions, and providing explicit characterization of how public and private data contributions differ in the unlearning bounds\. The following result explains how public data reduces reliance on differential privacy constraints:

###### Theorem 3\.1\(The role of public data in shrinking the learning / retraining mismatch\.\)\.

Suppose that the loss isLL\-smooth andMM\-Lipschitz, and that the initialization distribution satsifies aC0C\_\{0\}\-log Sobolev inequality\. Moreover, suppose that the PNGD updates project onto a compact setΘ\\Thetaof radiusRR\. Then at learning iteration T, we have the following upper bound on the Rényi divergence between the retrainingπRT\\pi\_\{R\}^\{T\}and learningπLT\\pi\_\{L\}^\{T\}distributions:

Dα​\(πRT∥πLT\)α\\displaystyle\\frac\{D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\}\\\|\\pi\_\{L\}^\{T\}\)\}\{\\alpha\}≤2​M2​η2​nforget2\(npub\+npriv\)2​σ2​∑t=1T−1∏t′=tT−1h​\(t′,η,σ\),\\displaystyle\\leq\\frac\{2M^\{2\}\\eta^\{2\}n\_\{\\mathrm\{forget\}\}^\{2\}\}\{\(n\_\{\\mathrm\{pub\}\}\+n\_\{\\mathrm\{priv\}\}\)^\{2\}\\sigma^\{2\}\}\\sum\_\{t=1\}^\{T\-1\}\\prod\_\{t^\{\\prime\}=t\}^\{T\-1\}h\(t^\{\\prime\},\\eta,\\sigma\),whereh​\(t′,η,σ\)=\(1\+η​σ2Ct′,1\)−1h\(t^\{\\prime\},\\eta,\\sigma\)=\\left\(1\+\\frac\{\\eta\\sigma^\{2\}\}\{C\_\{t^\{\\prime\},1\}\}\\right\)^\{\-1\}, and0<Ct′,1≤\(1\+η​L\)2​K​C0\+2​η​σ2​\(1\+η​L\)2​K−1\(1\+η​L\)2−10<C\_\{t^\{\\prime\},1\}\\leq\(1\+\\eta L\)^\{2K\}C\_\{0\}\+2\\eta\\sigma^\{2\}\\frac\{\(1\+\\eta L\)^\{2K\}\-1\}\{\(1\+\\eta L\)^\{2\}\-1\}are log Sobolev constants of the distributions of the intermediate PNGD updates\. Using the support’s radius allows to loosely upper bound those constants\(Chienet al\.,[2024a](https://arxiv.org/html/2605.11170#bib.bib249)\):Ct′,1≤6​e4​τη​σ2​\(4​τ2\+η​σ2\)C\_\{t^\{\\prime\},1\}\\leq 6e^\{\\frac\{4\\tau\}\{\\eta\\sigma^\{2\}\}\}\(4\\tau^\{2\}\+\\eta\\sigma^\{2\}\)withτ=R\+η​M\\tau=R\+\\eta M\.

Proof sketch\.The proof follows the analytical framework ofChienet al\.\([2024a](https://arxiv.org/html/2605.11170#bib.bib249), Theorem 3\.3\), adapted to leverage the presence of public data in the training set\. By distinguishing between public and private data contributions in the gradient updates, we reduce the privacy erosion\(Chourasiaet al\.,[2021](https://arxiv.org/html/2605.11170#bib.bib206)\)of each PNGD update\. Full proof details are presented in[SectionB\.1](https://arxiv.org/html/2605.11170#A2.SS1)\.

This bound reveals that we can fix noise magnitudeσ\\sigmato be arbitrarily small to preserve performance while controlling the divergence through public data volume\. Whennpub≫nforgetn\_\{\\mathrm\{pub\}\}\\gg n\_\{\\mathrm\{forget\}\}, the learning and retraining distributions remain close regardless of noise level, providing favorable initial conditions for unlearning \([Figure3\(b\)](https://arxiv.org/html/2605.11170#S5.F3.sf2)\)\. Specifically, the public data reduces the squared sensitivity of the gradient updates by a factor of\(np​u​b\+np​r​i​v\)2\(n\_\{pub\}\+n\_\{priv\}\)^\{2\}\.

This dependency allows us to characterize a regime where asymmetric unlearning provides meaningful guarantees while symmetric methods fail: the unlearning of a constant fraction of the private dataset\.

###### Corollary 3\.1\(LU noise vs ALU noise\)\.

Consider the setting where the forget set size constitutes a constant fraction of the private data \(i\.e\.,nforget=c​nprivn\_\{\\mathrm\{forget\}\}=cn\_\{\\mathrm\{priv\}\}for somec∈\(0,1\]c\\in\(0,1\]\)\. Letσsym2\\sigma^\{2\}\_\{\\mathrm\{sym\}\}andσasym2\\sigma^\{2\}\_\{\\mathrm\{asym\}\}be the noise variances required to bound the Rényi divergence between the learning and retraining distributions by a fixedε\\varepsilonin the symmetric and asymmetric settings, respectively\. These variances satisfy the following lower bounds:

σsym2\\displaystyle\\sigma^\{2\}\_\{\\mathrm\{sym\}\}≥2​M2​η2​c2ε​∑t=1T−1∏t′=tT−1h​\(t′,η,σ\),\\displaystyle\\geq\\frac\{2M^\{2\}\\eta^\{2\}c^\{2\}\}\{\\varepsilon\}\\sum\_\{t=1\}^\{T\-1\}\\prod\_\{t^\{\\prime\}=t\}^\{T\-1\}h\(t^\{\\prime\},\\eta,\\sigma\),σasym2\\displaystyle\\sigma^\{2\}\_\{\\mathrm\{asym\}\}≥2​M2​η2​c2ε​\(∑t=1T−1∏t′=tT−1h​\(t′,η,σ\)\)​\(nprivnpriv\+npub\)2\.\\displaystyle\\geq\\frac\{2M^\{2\}\\eta^\{2\}c^\{2\}\}\{\\varepsilon\}\\left\(\\sum\_\{t=1\}^\{T\-1\}\\prod\_\{t^\{\\prime\}=t\}^\{T\-1\}h\(t^\{\\prime\},\\eta,\\sigma\)\\right\)\\left\(\\frac\{n\_\{\\mathrm\{priv\}\}\}\{n\_\{\\mathrm\{priv\}\}\+n\_\{\\mathrm\{pub\}\}\}\\right\)^\{2\}\.

Proof sketch\.The proof follows immediately from[Theorem3\.1](https://arxiv.org/html/2605.11170#S3.Thmtheorem1), by considering thatnforgetnpublic\+npriv=c​nprivnpublic\+npriv\\frac\{n\_\{\\mathrm\{forget\}\}\}\{n\_\{\\mathrm\{public\}\}\+n\_\{\\mathrm\{priv\}\}\}=c\\frac\{n\_\{\\mathrm\{priv\}\}\}\{n\_\{\\mathrm\{public\}\}\+n\_\{\\mathrm\{priv\}\}\}\.

Remark\.A limitation of the standard symmetric setting \(npub=0n\_\{\\mathrm\{pub\}\}=0\) is that the required noise varianceσsym2\\sigma^\{2\}\_\{\\mathrm\{sym\}\}hasno dependency on the total number of training pointswhen unlearning a constant fractioncc\. This implies that increasing the dataset size does not mitigate the unlearning cost, rendering the bound vacuous for large cohorts where the required high noise magnitude destroys utility\.

In contrast, our framework introduces an explicit dependency on the dataset composition\. The required noise scales with the rationprivnpriv\+npub\\frac\{n\_\{\\mathrm\{priv\}\}\}\{n\_\{\\mathrm\{priv\}\}\+n\_\{\\mathrm\{pub\}\}\}, meaning the sensitivity of the distribution effectively decays as public data is added\. This allows the learning distribution to remain arbitrarily close to the retraining distribution—even for large forget sets—provided sufficient public data is available\. We illustrate this advantage in[Figure2](https://arxiv.org/html/2605.11170#S3.F2)\.

![Refer to caption](https://arxiv.org/html/2605.11170v1/x2.png)Figure 2:Required noise magnitudeσ\\sigmato boundDα​\(πRT∥πLT\)D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\}\\\|\\pi\_\{L\}^\{T\}\)as a function of the forget fractionc=nforget/nprivc=n\_\{\\mathrm\{forget\}\}/n\_\{\\mathrm\{priv\}\}\.Values are computed assuming a strongly convex loss\(Chienet al\.,[2024a](https://arxiv.org/html/2605.11170#bib.bib249)\), for abinaryclassification task\. Details are deferred to[SectionB\.3](https://arxiv.org/html/2605.11170#A2.SS3)\.###### Theorem 3\.2\(Convergence guarantee of Langevin unlearning\(Chienet al\.,[2024a](https://arxiv.org/html/2605.11170#bib.bib249), Theorem 3\.2\)\)\.

Suppose that the loss isLL\-smooth andMM\-Lipschitz, and that the learning distribution of weights at timeTTsatisfies aCClog\-Sobolev inequality\. Then, the Rényi divergence betweenπUK\\pi\_\{U\}^\{K\}\(the unlearning distribution afterKKiterations\) and the retraining distribution afterT\+KT\+Kiterations is upper bounded by

Dα​\(πRT\+K∥πUK\)\\displaystyle D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\+K\}\\\|\\pi\_\{U\}^\{K\}\)≤Dα​\(πLT∥πRT\)\\displaystyle\\leq D\_\{\\alpha\}\(\\pi\_\{L\}^\{T\}\\\|\\pi\_\{R\}^\{T\}\)×min⁡\(gα,η,L​\(k,σ\),exp⁡\(−2​K​σ2​ηα​C~\)\),\\displaystyle\\quad\\times\\min\\left\(g\_\{\\alpha,\\eta,L\}\(k,\\sigma\),\\exp\\left\(\-\\frac\{2K\\sigma^\{2\}\\eta\}\{\\alpha\\tilde\{C\}\}\\right\)\\right\),wheregα,η,L​\(k,σ\)=∏k=1K\(1\+2​η​σ2\(1\+η​L\)2​CU,k\)−1/αg\_\{\\alpha,\\eta,L\}\(k,\\sigma\)=\\prod\_\{k=1\}^\{K\}\\left\(1\+\\frac\{2\\eta\\sigma^\{2\}\}\{\(1\+\\eta L\)^\{2\}C\_\{U,k\}\}\\right\)^\{\-1/\\alpha\}, and0<Ck≤\(1\+η​L\)2​K​C\+2​η​σ2​\(1\+η​L\)2​K−1\(1\+η​L\)2−10<C\_\{k\}\\\!\\leq\\\!\(1\+\\eta L\)^\{2K\}C\\\!\+\\\!2\\eta\\sigma^\{2\}\\\!\\frac\{\(1\+\\eta L\)^\{2K\}\\\!\-\\\!1\}\{\(1\+\\eta L\)^\{2\}\\\!\-\\\!1\}, andC~≤6​\(4​τ2\+2​η​σ2\)​exp⁡\(4​τ22​η​σ2\)\\tilde\{C\}\\\!\\leq\\\!6\\left\(4\\tau^\{2\}\\\!\+\\\!2\\eta\\sigma^\{2\}\\right\)\\\!\\exp\\left\(\\frac\{4\\tau^\{2\}\}\{2\\eta\\sigma^\{2\}\}\\right\)\.

Moreover, if the loss function ismm\-strongly convex and the initial log\-Sobolev constant satisfiesC\>σ2mC\>\\frac\{\\sigma^\{2\}\}\{m\}, we get the following exponential decay of the Rényi divergence with respect to the unlearning iteration:

Dα​\(πRT\+K∥πUK\)\\displaystyle D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\+K\}\\\|\\pi\_\{U\}^\{K\}\)≤Dα​\(πLT∥πRT\)​exp⁡\(−2​K​σ2​ηC​α\)\.\\displaystyle\\leq D\_\{\\alpha\}\(\\pi\_\{L\}^\{T\}\\\|\\pi\_\{R\}^\{T\}\)\\exp\\left\(\-\\frac\{2K\\sigma^\{2\}\\eta\}\{C\\alpha\}\\right\)\.

This theorem establishes the convergence guarantee for Langevin unlearning by showing that the Rényi divergence between the unlearning and retraining distributions decreases exponentially with unlearning iterationsKK, with the convergence rate controlled by the initial divergenceDα​\(πRT\+K∥πUK\)D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\+K\}\\\|\\pi\_\{U\}^\{K\}\)\. When combined with[Theorem3\.1](https://arxiv.org/html/2605.11170#S3.Thmtheorem1), this reveals the mechanism by which public data improves unlearning: the quadratic reduction in initial divergence from public data injection translates directly into tighter convergence bounds\.

##### Remark \(On the tractability of Log\-Sobolev constants\)

While[Theorems3\.1](https://arxiv.org/html/2605.11170#S3.Thmtheorem1)and[3\.2](https://arxiv.org/html/2605.11170#S3.Thmtheorem2)provide explicit convergence bounds, we acknowledge that the Log\-Sobolev constants \(LSI\) \(Ct,C~C\_\{t\},\\tilde\{C\}\) are difficult to estimate for deep neural networks and may scale poorly with dimension\. However, our core contribution—the role of public data—operates independently of these absolute constants\. Specifically, the quadratic reduction in gradient sensitivity \([Theorem3\.1](https://arxiv.org/html/2605.11170#S3.Thmtheorem1)\) acts as a multiplicative factor that improves the divergence bound relative to the baseline, regardless of the specific value of the LSI constant\. Thus, even if the absolute convergence rate is loose due to a largeCC, the relative advantage of injecting public data remains theoretically guaranteed\.

##### Computational advantage

Assuming the loss ismm\-strongly convex, the computational advantage of ALU over retraining is determined by the training budgetTT, the privacy requirementε\\varepsilonand the fraction of private data to unlearnnforget/nprivn\_\{\\mathrm\{forget\}\}/n\_\{\\mathrm\{priv\}\}\. WhenTTis small, unlearning is more expensive than retraining in traditional settings \(npublic=0n\_\{\\mathrm\{public\}\}=0\)\. ALU overcomes this via the public data ratio, remaining efficient \(K<TK<T\) only ifnpublic/npriv≥𝒞⋅Tβ⋅ε−1/2⋅\(nforget/npriv\)−1n\_\{\\mathrm\{public\}\}/n\_\{\\mathrm\{priv\}\}\\geq\\mathcal\{C\}\\cdot T^\{\\beta\}\\cdot\\varepsilon^\{\-1/2\}\\cdot\(n\_\{\\mathrm\{forget\}\}/n\_\{\\mathrm\{priv\}\}\)\-1, whereβ=σ2​η/C​α\\beta=\\sigma^\{2\}\\eta/C\\alpha\. Here, public data acts as a buffer that facilitates distribution alignment without consuming the privacy budget\. Conversely, in the largeTTregime, the cost of unlearning scales as𝒪​\(log⁡\(ntotal\)\)\\mathcal\{O\}\(\\log\(n\_\{\\text\{total\}\}\)\)\(Chienet al\.,[2024a](https://arxiv.org/html/2605.11170#bib.bib249)\)\. In this state of convergence, the addition of any data reduces the initial divergence, ensuring unlearning is strictly more efficient than retraining\. See[AppendixD](https://arxiv.org/html/2605.11170#A4)and[PropositionD\.1](https://arxiv.org/html/2605.11170#A4.Thmproposition1)for a derivation of a sufficient condition characterizing the computational advantage of unlearning over retraining\.

## 4Performance Without Noise: The Role of Distribution Alignment

LU faces a fundamental dilemma: increasing noise improves unlearning guarantees but degrades model performance\. Our asymmetric approach breaks this trade\-off by leveraging public data abundance rather than noise amplification\. However, the effectiveness of this strategy depends on the relationship between public and private data distributions\.

We now analyze when public data injection preserves performance, and when it introduces new challenges\. Our results reveal that performance preservation is not automatic – it depends on the distributional alignment between public and private data\. When these distributions are similar, public data acts as a performance stabilizer, allowing effective unlearning without quality degradation\. Conversely, when distributions differ significantly, performance impacts emerge, though they remain more controlled than noise\-based approaches\.

We evaluate post\-unlearning performance on the private data distributiononly, reflecting realistic deployment scenarios where the primary concern is maintaining model quality on the sensitive data that remains after unlearning\. Performance analysis on the full mixture of public and private distributions is provided in[AppendixC](https://arxiv.org/html/2605.11170#A3)for completeness\.

###### Theorem 4\.1\(Generalization Bound under Distribution Mismatch\)\.

Assuming the data generating distributions share the same support, that the weight spaceΘ\\Thetais compact and that the loss isMM\-Lipschitz wrtθ\\theta, we have the following upper bound on the generalization error on the private data after performingKKiterations of unlearning, and initializing a weightθ0\\theta\_\{0\}fromπLT\\pi\_\{L\}^\{T\}:

𝔼πUK\\displaystyle\\mathbb\{E\}\_\{\\pi\_\{U\}^\{K\}\}\[ℒPpriv\]≤enpubnpub\+nretain​D∞​\(Ppriv∥Ppub\)⏟distribution mismatch penalty​𝔼πRT\+K​\[ℒPtrain\]\\displaystyle\[\\mathcal\{L\}\_\{P\_\{\\text\{priv\}\}\}\]\\leq\\underbrace\{e^\{\\frac\{n\_\{\\text\{pub\}\}\}\{n\_\{\\text\{pub\}\}\+n\_\{\\text\{retain\}\}\}D\_\{\\infty\}\(P\_\{\\text\{priv\}\}\\\|P\_\{\\text\{pub\}\}\)\}\}\_\{\\text\{distribution mismatch penalty\}\}\\mathbb\{E\}\_\{\\pi\_\{R\}^\{T\+K\}\}\[\\mathcal\{L\}\_\{P\_\{\\text\{train\}\}\}\]\+M⋅diam​\(Θ\)⋅12​Dα​\(πRT\+K∥πUK\)⏟unlearning approximation error\\displaystyle\+M\\cdot\\text\{diam\}\(\\Theta\)\\cdot\\underbrace\{\\sqrt\{\\tfrac\{1\}\{2\}D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\+K\}\\\|\\pi\_\{U\}^\{K\}\)\}\}\_\{\\text\{unlearning approximation error\}\}whereℒP​\(θ\)≔𝔼x∼P​\[ℓ​\(θ,x\)\]\\mathcal\{L\}\_\{P\}\(\\theta\)\\coloneqq\\mathbb\{E\}\_\{x\\sim P\}\[\\ell\(\\theta,x\)\]denotes the risk over distributionPP;PtrainP\_\{\\mathrm\{train\}\}is the mixture ofPpubP\_\{\\mathrm\{pub\}\}andPprivP\_\{\\mathrm\{priv\}\}used during training; andD∞​\(P∥Q\)=log⁡\(ess​sup⁡pq\)D\_\{\\infty\}\(P\\\|Q\)=\\log\\left\(\\operatorname\{ess\\,sup\}\\frac\{p\}\{q\}\\right\)is the infinite Rényi divergence representing the worst\-case regret\(Erven and Harremoës,[2014](https://arxiv.org/html/2605.11170#bib.bib142)\)\.

Proof sketch\.The proof uses the Kantorovitch\-Rubinstein duality \([TheoremC\.1](https://arxiv.org/html/2605.11170#A3.Thmtheorem1)\) to bound the performance gap by the dual of the Wasserstein distance betweenπUK\\pi\_\{U\}^\{K\}andπLT\+K\\pi\_\{L\}^\{T\+K\}, then relates this to Rényi divergence via standard inequalities leveraging the compactness of the weight spaceΘ\\Theta\. For private data evaluation, importance weighting introduces a mismatch penalty controlled by the worst case regret,D∞\(Ppriv∥Ppub\)D\_\{\\infty\}\(P\_\{\\mathrm\{priv\}\}\\\|P\_\{\\mathrm\{pub\}\)\}, weighted by the public data fraction\. See[AppendixC](https://arxiv.org/html/2605.11170#A3)for full proof details\.

Whennpub→∞n\_\{\\mathrm\{pub\}\}\\rightarrow\\infty:

1. 1\.Aligned distributions\(D∞​\(Ppriv∥Ppub\)≈0D\_\{\\infty\}\(P\_\{\\mathrm\{priv\}\}\\\|P\_\{\\mathrm\{pub\}\}\)\\approx 0\): The distribution mismatch penalty vanishes, and the unlearned model’s performance on unseen private data is guaranteed to be at least as good as the retrained model’s performance on the training mixture\. This represents the ideal scenario where public data injection preserves performance\.
2. 2\.Misaligned distributions\(D∞​\(Ppriv∥Ppub\)≫0D\_\{\\infty\}\(P\_\{\\mathrm\{priv\}\}\\\|P\_\{\\mathrm\{pub\}\}\)\\gg 0\): The exponential penalty term dominates, causing the upper bound to become vacuous\. While this confirms that performance degradation will occur, the bound’s looseness prevents us from quantifying the actual extent of this degradation\. The true performance impact may be better than this worst\-case guarantee suggests\.

## 5Experiments

Our theoretical analysis provides upper bounds on the Rényi divergenceDα​\(πRT\+K∥πUK\)D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\+K\}\\\|\\pi\_\{U\}^\{K\}\)that governs unlearning performance\. However, these bounds involve iteration\-dependent log\-Sobolev constants that are difficult to estimate in practice, making it unclear how tight our theoretical guarantees actually are\. To gain empirical insight into the behavior of this divergence, we estimate its value using samples from the weight distributions\. To our knowledge, this is the first attempt to evaluate unlearning performance through direct estimation of the Rényi divergence between the parameter distributions\. Building onBirrellet al\.\([2021](https://arxiv.org/html/2605.11170#bib.bib55),[2023](https://arxiv.org/html/2605.11170#bib.bib26)\), we leverage the variational representation of the Rényi divergence for numerical estimation\.

###### Theorem 5\.1\.

\(Convex conjugate variational approximation of the Rényi divergence\(Birrellet al\.,[2023](https://arxiv.org/html/2605.11170#bib.bib26)\)\) Let P,Q two probability distributions supported onΩ\\Omega, such thatP≪QP\\ll Q, and letℳb\\mathcal\{M\}\_\{b\}be the space of bounded measurable functions onΩ\\Omega\. Then,∀α∈\(0,\+∞\)∖\{1\},\\forall\\alpha\\in\(0,\+\\infty\)\\setminus\\\{1\\\},

Dα​\(P∥Q\)α\\displaystyle\\frac\{D\_\{\\alpha\}\(P\\\|Q\)\}\{\\alpha\}=supg∈ℳb​\(Ω\),g<0∫g​𝑑Q\\displaystyle=\\sup\_\{g\\in\\mathcal\{M\}\_\{b\}\(\\Omega\),g<0\}\\int gdQ\+1α−1​∫\|g\|α−1α​𝑑P\+α−1​\(log⁡α\+1\)\.\\displaystyle\\quad\+\\frac\{1\}\{\\alpha\-1\}\\int\|g\|^\{\\frac\{\\alpha\-1\}\{\\alpha\}\}dP\+\\alpha^\{\-1\}\\left\(\\log\\alpha\+1\\right\)\.

This variational representation of Rényi divergence allows us to obtain estimates ofDα​\(πRT\+K∥πUK\)D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\+K\}\\\|\\pi\_\{U\}^\{K\}\)using trained models as samples\. We emphasize thatthis is not intended as a practical evaluation methodology for machine unlearning, as it requires training numerous models to obtain sufficient samples for reliable statistical estimation\. Standard approaches like membership inference attacks \(MIAs\)\(Shokriet al\.,[2017](https://arxiv.org/html/2605.11170#bib.bib279); Carliniet al\.,[2021](https://arxiv.org/html/2605.11170#bib.bib285); Hayeset al\.,[2024](https://arxiv.org/html/2605.11170#bib.bib210)\)remain more suitable for practical evaluation\. Our goal is purely investigative: to understand how the Rényi divergence behaves empirically and assess whether our theoretical bounds, despite containing intractable constants\.

We present our findings in two parts:[Sections5\.1](https://arxiv.org/html/2605.11170#S5.SS1)and[5\.2](https://arxiv.org/html/2605.11170#S5.SS2)investigate the behaviour of the upper bounds provided respectively in[Theorem3\.2](https://arxiv.org/html/2605.11170#S3.Thmtheorem2)and[Theorem4\.1](https://arxiv.org/html/2605.11170#S4.Thmtheorem1), while[Section5\.3](https://arxiv.org/html/2605.11170#S5.SS3)provides standard membership inference attack and utility evaluations to contextualize our approach within existing unlearning assessment practices\.

### 5\.1Evaluating the Rényi Divergence

Experimental Setup\.We evaluate our approach on a multi\-class image classification task using two domains from the DomainNet dataset\(Penget al\.,[2019](https://arxiv.org/html/2605.11170#bib.bib281)\): Quickdraw \(sketches\) and Clipart \(stylized images\), each containing 24 classes\. We select these visually distinct domains to investigate how public\-private data alignment affects unlearning and utility \([Figure5](https://arxiv.org/html/2605.11170#A6.F5)\)\.

The experimental configuration treats Clipart images as private data \(subject to unlearning\) and Quickdraw images as public data \(permanently retained\)\. For a training set of sizen=npub\+nprivn=n\_\{\\mathrm\{pub\}\}\+n\_\{\\mathrm\{priv\}\}, we train models using cross\-entropy loss and PNGD updates\. To obtain samples from the weight distributionsπUK\\pi\_\{U\}^\{K\}andπRT\\pi\_\{R\}^\{T\}, we trainNNmodels in parallel: one set undergoes unlearning \(fine\-tuning on the retain set after initial training\), while another set trains from scratch on the retain set only\. This procedure yieldsNNweight samples from each distribution\. We approximate the variational Rényi representation \([Theorem5\.1](https://arxiv.org/html/2605.11170#S5.Ex9)\) using neural network discriminators to parameterize the function spaceℳb​\(Ω\)\\mathcal\{M\}\_\{b\}\(\\Omega\)\. This approach follows established practices in divergence estimation\(Birrellet al\.,[2021](https://arxiv.org/html/2605.11170#bib.bib55),[2023](https://arxiv.org/html/2605.11170#bib.bib26); Belghaziet al\.,[2021](https://arxiv.org/html/2605.11170#bib.bib25)\)\(pseudo\-code in[SectionG\.1](https://arxiv.org/html/2605.11170#A7.SS1)\)\.Complete details on discriminator architecture and training procedures are provided in[AppendixG](https://arxiv.org/html/2605.11170#A7)\.

Results\.[Figure3\(a\)](https://arxiv.org/html/2605.11170#S5.F3.sf1)presents our Rényi estimation results, demonstrating the effectiveness of public data injection for improving unlearning efficiency\. The experiments are conducted usingN=30000N=30000models for each distribution and averaged across 5 discriminator trainings with spectral normalization\. The PNGD noise scale isσ=0\.01\\sigma=0\.01andα=2\\alpha=2\. The results show that increasing public data volume reducesDα​\(πRT\+K∥πUK\)D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\+K\}\\\|\\pi\_\{U\}^\{K\}\), with the divergence decreasing both as a function of unlearning iterations and public data proportion\. To understand the mechanism driving these improvements, we conduct an ablation study examining the initial conditions after asingleunlearning iteration\.[Figure3\(b\)](https://arxiv.org/html/2605.11170#S5.F3.sf2)isolates the effect of public data on the starting distributions by measuringDα​\(πRT\+1∥πU1\)D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\+1\}\\\|\\pi\_\{U\}^\{1\}\)as a function of public data volume\. Rather than directly improving the unlearning procedure itself, public data creates more favorable initial conditions by ensuring the learning and retraining weight distributions begin in closer proximity\. This mechanistic understanding validates our theoretical framework: public data primarily controls the initial gap between distributions \([Theorem3\.1](https://arxiv.org/html/2605.11170#S3.Thmtheorem1)\), which then propagates through the unlearning iterations to produce the final performance gains\.

![Refer to caption](https://arxiv.org/html/2605.11170v1/x3.png)\(a\)Variational Rényi divergenceDα​\(πRT\+K∥πUK\)D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\+K\}\\\|\\pi\_\{U\}^\{K\}\)vs\. public data proportion\. Increasing public data volume and unlearning iterations both consistently reduce divergence, confirming improved unlearning efficacy\.
![Refer to caption](https://arxiv.org/html/2605.11170v1/x4.png)\(b\)Ablation study: initial distribution alignment \(K=1K=1\)\. The divergenceDα​\(πRT\+1∥πU1\)D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\+1\}\\\|\\pi\_\{U\}^\{1\}\)decreases as the public data volume increases, validating the more favorable initialization provided by public data injection\.

Figure 3:Rényi divergence estimation across varying public data \(Clipart\) volumes\.
### 5\.2Distribution Alignment and the Unlearning\-Utility Trade\-off

[Theorem4\.1](https://arxiv.org/html/2605.11170#S4.Thmtheorem1)characterizes a trade\-off caused by public data injection: as we increase public data volume, theunlearning approximation errordecreases, yet thedistribution mismatch penaltysimultaneously grows\. The balance between these competing terms determines whether public data injection preserves or degrades model performance\. To empirically investigate this trade\-off, we conduct experiments across two distinct distributional regimes: one where the public and private domains exhibit moderate visual alignment, and another where they are substantially misaligned\.

We fixK=5K=5unlearning iterations and evaluate performance using the DomainNet dataset across two domain pairs\. Thealigned regimepairs Quickdraw \(public\) and Clipart \(private\), which despite visual stylistic differences share semantic structure\. Themisaligned regimepairs Infograph \(public\) and Real \(private\), which exhibit greater distributional divergence\. We measure model performance via loss on the private data distributionPprivP\_\{\\mathrm\{priv\}\}after unlearning, comparing against the retraining baseline on the training mixture\. Results are summarized in[Table1](https://arxiv.org/html/2605.11170#S5.T1)\.

Table 1:Unlearning vs Retraining Performance Across Distribution Alignments \(K=5K=5, Private: 20000 points, Forget Set: 10000 points\)The results reveal a contrast between the two regimes\. In thealigned setting, the relative performance gap remains modest \(3\.68–4\.62%\) across varying public data volumes, suggesting that the mismatch penalty remains manageable and the approximation error reduction dominates\. In contrast, themisaligned settingexhibits a persistent performance gap \(10\.34–10\.81%\), with minimal sensitivity to public data volume\. This indicates that when distributional divergence is large, increasing public data fails to overcome the mismatch penalty, rendering the approximation error reduction insufficient to improve generalization\.

### 5\.3Practical Evaluation of LU in the Asymmetric Setting

We now adopt standard evaluation methodology from the unlearning literature\(Hayeset al\.,[2024](https://arxiv.org/html/2605.11170#bib.bib210)\), highlighting the benefit of public data injection\. We provide an overview here and defer details to Appendix[H](https://arxiv.org/html/2605.11170#A8)\.

Evaluation Method\.This evaluation is based on the U\-LiRA membership inference attack for unlearning\(Hayeset al\.,[2024](https://arxiv.org/html/2605.11170#bib.bib210); Carliniet al\.,[2021](https://arxiv.org/html/2605.11170#bib.bib285)\)\. Given a training set, forget set, and specified learning and unlearning algorithms, the adversary’s goal is to infer whether a model’s weightsθ\\thetawere drawn from the unlearning distributionπUK\\pi\_\{U\}^\{K\}or the retraining distributionπRT\+K\\pi\_\{R\}^\{T\+K\}\. Intuitively, lower attack accuracy indicates that the unlearning and retraining distributions are harder to distinguish, i\.e\., better unlearning\.

In its most basic form, U\-LiRA can be formalized via Bayes’ rule under a uniform prior on whether the forget set was included during training\. LettingP​\(θ∣⋅\)P\(\\theta\\mid\\cdot\)denote the likelihood of observing model parametersθ\\thetaunder a given distribution, andP\(⋅∣θ\)P\(\\cdot\\mid\\theta\)as the posterior probability thatθ\\thetawas drawn from that distribution, we have

P​\(πUK∣θ\)=P​\(θ∣πUK\)P​\(θ\|πUK\)\+P​\(θ∣πRT\+K\)\.\\displaystyle P\(\\pi\_\{U\}^\{K\}\\mid\\theta\)=\\frac\{P\(\\theta\\mid\\pi\_\{U\}^\{K\}\)\}\{P\(\\theta\|\\pi\_\{U\}^\{K\}\)\+P\(\\theta\\mid\\pi\_\{R\}^\{T\+K\}\)\}\.By selecting a one\-dimensional representation of the modelsf:Θ→ℝf:\\Theta\\rightarrow\\mathbb\{R\}and assuming that the induced distributionsf♯​πUKf\_\{\\sharp\}\\pi\_\{U\}^\{K\}andf♯​πRT\+Kf\_\{\\sharp\}\\pi\_\{R\}^\{T\+K\}are Gaussian, we can estimate the likelihood termsP​\(θ∣⋅\)P\(\\theta\\mid\\cdot\)from a tractable number of model samples\.

![Refer to caption](https://arxiv.org/html/2605.11170v1/x5.png)Figure 4:U\-LiRA confidence scores after K unlearning iterations as violin plots with quartiles\.Experimental Setup\.For the sake of completeness, we focus this next set of experiments on a sentiment analysis task on the IMDB dataset of movie reviews\(Maaset al\.,[2011](https://arxiv.org/html/2605.11170#bib.bib291)\)\. This is a binary classification task, where an LSTM\(Hochreiter and Schmidhuber,[1997](https://arxiv.org/html/2605.11170#bib.bib295)\)learns to recognize if a review is either negative or positive\. We use the Amazon reviews dataset fromZhanget al\.\([2015](https://arxiv.org/html/2605.11170#bib.bib292)\)as the public data source\. We use a forget set of100100uniformly sampled examples from the IMDB dataset\. For both experiments, i\.e\., with and without public data injection, we generateN=50N=50models to estimate each likelihood density, and report the empirical distribution of probabilities assigned to the right origin distribution by U\-LiRA \(confidence scores\) for5050models test \(2525fromπUK\\pi\_\{U\}^\{K\}, and2525fromπRT\+K\\pi\_\{R\}^\{T\+K\}, whereT=50T=50andK=1→15K=1\\to 15\)\.[Figure4](https://arxiv.org/html/2605.11170#S5.F4)highlights that without public data injection, U\-LiRA is able to identify a large proportion of models confidently and correctly, even after a number of unlearning steps\. This observed discriminative power is heavily impacted by public data injection\. We can also observe that modes of the confidence scores generally decrease with the number of unlearning steps, highlighting the unlearning effectiveness of LU\.

Now that we’ve observed the effect of public data injection on the unlearning effectiveness of LU, we change our focus towards its impact on model utility\. To this end, we report in Table[2](https://arxiv.org/html/2605.11170#S5.T2)the average model accuracies over the7575models we trained for each model distribution, on a test set of10,00010,000unseen samples from the IMDB dataset\. As the Amazon reviews dataset appears to be a good auxiliary public data source for the IMDB review classification problem \(close data distributions\), we also include an experiment in which a uniformly sampled40%40\\%of its labels are flipped, thus increasing distribution mismatch between public and private sources\.

Table 2:Unlearned and Retrained Model Test Accuracies \(IMDB, 25000 private points\)From Table[2](https://arxiv.org/html/2605.11170#S5.T2), we can observe that model accuracy does decrease from the injection of public data\. However, this drop in accuracy is rather negligible compared to the extent to which public data injection improves the unlearning effectiveness of LU, which is highlighted by Fig\.[4](https://arxiv.org/html/2605.11170#S5.F4)\. As expected, the drop in accuracy is proportionately much lower when the quality of auxiliary public data is high \(1\.17%1\.17\\%for unlearned and0\.39%0\.39\\%for retrained\) than when it is low \(2\.19%2\.19\\%for unlearned, an≈1\.87\\approx 1\.87times increase, and1\.74%1\.74\\%for retrained, an≈4\.46\\approx 4\.46times increase\)\.

## 6Future Work

Our framework opens several compelling avenues for research into the interplay of public data and unlearning\. Algorithmic extensions include studying ALU in pre\-training/fine\-tuning regimes and developing adaptive algorithms that leverage domain adaptation to optimally balance distribution alignment with unlearning efficiency\. We also envision a constrained optimization approach where the unlearning objective is regularized to keep the weight distribution close to a public\-only reference\.

Theoretically, addressing the intractable log\-Sobolev constants in current Langevin analysis remains a priority\. Transitioning to alternative isoperimetric assumptions\(Chewiet al\.,[2021](https://arxiv.org/html/2605.11170#bib.bib180); Mousavi\-Hosseiniet al\.,[2023](https://arxiv.org/html/2605.11170#bib.bib181); Altschuler and Chewi,[2024](https://arxiv.org/html/2605.11170#bib.bib80)\)or adopting weaker divergence measures could yield more tractable bounds than those provided by Rényi divergence\. Finally, extending analysis from weight distributions to output distributions would facilitate black\-box evaluation\.

Finally, as highlighted byTramèret al\.\([2024](https://arxiv.org/html/2605.11170#bib.bib191)\), large\-scale datasets scraped from the web are not inherently privacy\-neutral and may contain sensitive or copyrighted information\. Applications should ensure that public datasets used for initialization do not introduce secondary leakage, ensuring that the efficiency gains of ALU do not come at the cost of unintended exposure from the public source\.

## 7Conclusion

We have studied Langevin unlearning under the assumption of asymmetric data sources, where datasets contain both private and public data\. Our theoretical analysis demonstrates that this framework fundamentally improves the unlearning\-utility trade\-off by enabling control over unlearning guarantees through data supplementation rather than noise amplification\. The framework provides fine\-grained analysis of how distributional alignment between public and private data affects this trade\-off: when distributions are well\-aligned, public data injection preserves utility while maintaining unlearning guarantees, while misaligned distributions introduce controlled performance penalties that remain more manageable than traditional noise\-based approaches\.

## Impact statement

This paper presents work whose goal is to advance the field of Machine Learning\. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here\.

## References

- N\. Alon, R\. Bassily, and S\. Moran \(2019\)Limits of private learning with access to public data\.External Links:1910\.11519,[Link](https://arxiv.org/abs/1910.11519)Cited by:[§3\.1](https://arxiv.org/html/2605.11170#S3.SS1.p1.1)\.
- J\. M\. Altschuler and K\. Talwar \(2022\)Resolving the mixing time of the langevin algorithm to its stationary distribution for log\-concave sampling\.arXiv preprint arXiv:2210\.08448\.Cited by:[1st item](https://arxiv.org/html/2605.11170#A1.I1.i1.p1.9)\.
- J\. M\. Altschuler and S\. Chewi \(2024\)Shifted Composition III: Local Error Framework for KL Divergence\.arXiv\.Note:arXiv:2412\.17997 \[math\]External Links:[Link](http://arxiv.org/abs/2412.17997),[Document](https://dx.doi.org/10.48550/arXiv.2412.17997)Cited by:[§6](https://arxiv.org/html/2605.11170#S6.p2.1)\.
- E\. Amid, A\. Ganesh, R\. Mathews, S\. Ramaswamy, S\. Song, T\. Steinke, V\. M\. Suriyakumar, O\. Thakkar, and A\. Thakurta \(2022\)Public data\-assisted mirror descent for private model training\.External Links:2112\.00193,[Link](https://arxiv.org/abs/2112.00193)Cited by:[§3\.1](https://arxiv.org/html/2605.11170#S3.SS1.p1.1)\.
- M\. I\. Belghazi, A\. Baratin, S\. Rajeswar, S\. Ozair, Y\. Bengio, A\. Courville, and R\. D\. Hjelm \(2021\)MINE: Mutual Information Neural Estimation\.arXiv\.Note:arXiv:1801\.04062 \[cs\]External Links:[Link](http://arxiv.org/abs/1801.04062),[Document](https://dx.doi.org/10.48550/arXiv.1801.04062)Cited by:[§5\.1](https://arxiv.org/html/2605.11170#S5.SS1.p2.6)\.
- J\. Birrell, P\. Dupuis, M\. A\. Katsoulakis, L\. Rey\-Bellet, and J\. Wang \(2021\)Variational Representations and Neural Network Estimation of Rényi Divergences\.arXiv\.Note:arXiv:2007\.03814 \[stat\]External Links:[Link](http://arxiv.org/abs/2007.03814),[Document](https://dx.doi.org/10.48550/arXiv.2007.03814)Cited by:[§G\.0\.1](https://arxiv.org/html/2605.11170#A7.SS0.SSS1.p1.1),[Theorem G\.1](https://arxiv.org/html/2605.11170#A7.Thmtheorem1),[§5\.1](https://arxiv.org/html/2605.11170#S5.SS1.p2.6),[§5](https://arxiv.org/html/2605.11170#S5.p1.1)\.
- J\. Birrell, Y\. Pantazis, P\. Dupuis, M\. A\. Katsoulakis, and L\. Rey\-Bellet \(2023\)Function\-space regularized Rényi divergences\.arXiv\.Note:arXiv:2210\.04974 \[stat\]External Links:[Link](http://arxiv.org/abs/2210.04974),[Document](https://dx.doi.org/10.48550/arXiv.2210.04974)Cited by:[§G\.0\.1](https://arxiv.org/html/2605.11170#A7.SS0.SSS1.Px1.p1.2),[§G\.0\.1](https://arxiv.org/html/2605.11170#A7.SS0.SSS1.p1.1),[§G\.0\.1](https://arxiv.org/html/2605.11170#A7.SS0.SSS1.p3.1),[Theorem G\.2](https://arxiv.org/html/2605.11170#A7.Thmtheorem2),[§5\.1](https://arxiv.org/html/2605.11170#S5.SS1.p2.6),[Theorem 5\.1](https://arxiv.org/html/2605.11170#S5.Thmtheorem1.p1.5.5),[§5](https://arxiv.org/html/2605.11170#S5.p1.1)\.
- Y\. Cao and J\. Yang \(2015\)Towards Making Systems Forget with Machine Unlearning\.In2015 IEEE Symposium on Security and Privacy,pp\. 463–480\.External Links:[Document](https://dx.doi.org/10.1109/SP.2015.35)Cited by:[§2\.1](https://arxiv.org/html/2605.11170#S2.SS1.p1.1)\.
- N\. Carlini, S\. Chien, M\. Nasr, S\. Song, A\. Terzis, and F\. Tramèr \(2021\)Membership inference attacks from first principles\.IEEE Symposium on Security and Privacy\.External Links:[Document](https://dx.doi.org/10.1109/sp46214.2022.9833649)Cited by:[§H\.0\.1](https://arxiv.org/html/2605.11170#A8.SS0.SSS1.p1.4),[§5\.3](https://arxiv.org/html/2605.11170#S5.SS3.p2.3),[§5](https://arxiv.org/html/2605.11170#S5.p2.1)\.
- M\. Caron, H\. Touvron, I\. Misra, H\. Jégou, J\. Mairal, P\. Bojanowski, and A\. Joulin \(2021\)Emerging properties in self\-supervised vision transformers\.External Links:2104\.14294,[Link](https://arxiv.org/abs/2104.14294)Cited by:[§G\.0\.2](https://arxiv.org/html/2605.11170#A7.SS0.SSS2.p1.1)\.
- H\. Chen, S\. Chewi, and J\. Niles\-Weed \(2021\)Dimension\-free log\-sobolev inequalities for mixture distributions\.Journal of Functional Analysis281\(11\),pp\. 109236\.Cited by:[Lemma A\.6](https://arxiv.org/html/2605.11170#A1.Thmlemma6.p1.5.5),[Lemma A\.7](https://arxiv.org/html/2605.11170#A1.Thmlemma7)\.
- S\. Chewi, M\. A\. Erdogdu, M\. B\. Li, R\. Shen, and M\. Zhang \(2021\)Analysis of Langevin Monte Carlo from Poincaré to Log\-Sobolev\.arXiv\.Note:arXiv:2112\.12662 version: 1External Links:[Link](http://arxiv.org/abs/2112.12662),[Document](https://dx.doi.org/10.48550/arXiv.2112.12662)Cited by:[§6](https://arxiv.org/html/2605.11170#S6.p2.1)\.
- S\. Chewi \(2023\)Log\-concave sampling\.Book draft available at https://chewisinho\. github\. io9,pp\. 17–18\.Cited by:[Lemma A\.1](https://arxiv.org/html/2605.11170#A1.Thmlemma1)\.
- E\. Chien, H\. Wang, Z\. Chen, and P\. Li \(2024a\)Langevin Unlearning: A New Perspective of Noisy Gradient Descent for Machine Unlearning\.Note:\_eprint: 2401\.10371Cited by:[§A\.1](https://arxiv.org/html/2605.11170#A1.SS1.p1.1),[§A\.2](https://arxiv.org/html/2605.11170#A1.SS2.p1.13),[Lemma A\.3](https://arxiv.org/html/2605.11170#A1.Thmlemma3),[Proposition A\.1](https://arxiv.org/html/2605.11170#A1.Thmproposition1),[§B\.1](https://arxiv.org/html/2605.11170#A2.SS1.1.p1.1),[§B\.3](https://arxiv.org/html/2605.11170#A2.SS3.p1.1),[§B\.3](https://arxiv.org/html/2605.11170#A2.SS3.p1.12),[Appendix D](https://arxiv.org/html/2605.11170#A4.SS0.SSS0.Px2.p1.2),[Appendix D](https://arxiv.org/html/2605.11170#A4.p1.7),[§1](https://arxiv.org/html/2605.11170#S1.p1.1),[§2\.2](https://arxiv.org/html/2605.11170#S2.SS2.p2.2),[§2\.2](https://arxiv.org/html/2605.11170#S2.SS2.p3.1),[Figure 2](https://arxiv.org/html/2605.11170#S3.F2),[Figure 2](https://arxiv.org/html/2605.11170#S3.F2.6.3.1),[§3\.1](https://arxiv.org/html/2605.11170#S3.SS1.SSS0.Px1.p2.1),[§3\.2](https://arxiv.org/html/2605.11170#S3.SS2.SSS0.Px2.p1.11),[§3\.2](https://arxiv.org/html/2605.11170#S3.SS2.p1.1),[§3\.2](https://arxiv.org/html/2605.11170#S3.SS2.p2.1.2),[Theorem 3\.1](https://arxiv.org/html/2605.11170#S3.Thmtheorem1.p1.11.4),[Theorem 3\.2](https://arxiv.org/html/2605.11170#S3.Thmtheorem2),[Theorem](https://arxiv.org/html/2605.11170#Thmtheoremx1.p1.10.3),[Theorem](https://arxiv.org/html/2605.11170#Thmtheoremx1.p1.7)\.
- E\. Chien, H\. Wang, Z\. Chen, and P\. Li \(2024b\)Stochastic Gradient Langevin Unlearning\.Note:\_eprint: 2403\.17105Cited by:[Appendix D](https://arxiv.org/html/2605.11170#A4.p1.7),[§2\.1](https://arxiv.org/html/2605.11170#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2605.11170#S2.SS2.p3.1)\.
- R\. Chourasia, J\. Ye, and R\. Shokri \(2021\)Differential Privacy Dynamics of Langevin Diffusion and Noisy Gradient Descent\.\(en\)\.External Links:[Link](https://arxiv.org/abs/2102.05855v5)Cited by:[§3\.2](https://arxiv.org/html/2605.11170#S3.SS2.p2.1.2)\.
- M\. D\. Donsker and S\. S\. Varadhan \(1975\)Asymptotic evaluation of certain markov process expectations for large time, i\.Communications on pure and applied mathematics28\(1\),pp\. 1–47\.Cited by:[§G\.0\.1](https://arxiv.org/html/2605.11170#A7.SS0.SSS1.p1.1)\.
- T\. v\. Erven and P\. Harremoës \(2014\)Rényi Divergence and Kullback\-Leibler Divergence\.IEEE Transactions on Information Theory60\(7\),pp\. 3797–3820\.Note:arXiv:1206\.2459 \[cs\]External Links:ISSN 0018\-9448, 1557\-9654,[Link](http://arxiv.org/abs/1206.2459),[Document](https://dx.doi.org/10.1109/TIT.2014.2320500)Cited by:[Lemma A\.2](https://arxiv.org/html/2605.11170#A1.Thmlemma2),[§C\.0\.2](https://arxiv.org/html/2605.11170#A3.SS0.SSS2.p1.5),[Proposition C\.2](https://arxiv.org/html/2605.11170#A3.Thmproposition2.p1.2.2),[Theorem 4\.1](https://arxiv.org/html/2605.11170#S4.Thmtheorem1.p1.12.6),[Proposition](https://arxiv.org/html/2605.11170#Thmpropositionx1.p1.10.4)\.
- A\. Ganesh, M\. Haghifam, M\. Nasr, S\. Oh, T\. Steinke, O\. Thakkar, A\. Thakurta, and L\. Wang \(2023\)Why Is Public Pretraining Necessary for Private Model Training?\.Note:\_eprint: 2302\.09483External Links:[Link](https://arxiv.org/abs/2302.09483)Cited by:[§3\.1](https://arxiv.org/html/2605.11170#S3.SS1.p1.1)\.
- A\. L\. Gibbs and F\. E\. Su \(2002\)On choosing and bounding probability metrics\.arXiv\.Note:arXiv:math/0209021External Links:[Link](http://arxiv.org/abs/math/0209021),[Document](https://dx.doi.org/10.48550/arXiv.math/0209021)Cited by:[Proposition C\.3](https://arxiv.org/html/2605.11170#A3.Thmproposition3.p1.5.5)\.
- A\. Golatkar, A\. Achille, A\. Ravichandran, M\. Polito, and S\. Soatto \(2021\)Mixed\-Privacy Forgetting in Deep Networks\.Note:\_eprint: 2012\.13431External Links:[Link](https://arxiv.org/abs/2012.13431)Cited by:[§1](https://arxiv.org/html/2605.11170#S1.p2.1)\.
- T\. H\. Gronwall \(1919\)Note on the derivatives with respect to a parameter of the solutions of a system of differential equations\.Annals of Mathematics20\(4\),pp\. 292–296\.Cited by:[Lemma A\.5](https://arxiv.org/html/2605.11170#A1.Thmlemma5.p1.8.8)\.
- L\. Gross \(1975\)Logarithmic sobolev inequalities\.American Journal of Mathematics97\(4\),pp\. 1061–1083\.Cited by:[Definition 3\.1](https://arxiv.org/html/2605.11170#S3.Thmdefinition1.p1.2.2)\.
- C\. Guo, T\. Goldstein, A\. Hannun, and L\. v\. d\. Maaten \(2023\)Certified Data Removal from Machine Learning Models\.Note:\_eprint: 1911\.03030External Links:[Link](https://arxiv.org/abs/1911.03030)Cited by:[§2\.1](https://arxiv.org/html/2605.11170#S2.SS1.p1.1)\.
- M\. Hardt, B\. Recht, and Y\. Singer \(2016\)Train faster, generalize better: stability of stochastic gradient descent\.InInternational conference on machine learning,pp\. 1225–1234\.Cited by:[§A\.2](https://arxiv.org/html/2605.11170#A1.SS2.p1.8)\.
- J\. Hayes, I\. Shumailov, E\. Triantafillou, A\. Khalifa, and N\. Papernot \(2024\)Inexact Unlearning Needs More Careful Evaluations to Avoid a False Sense of Privacy\.arXiv\.Note:arXiv:2403\.01218 \[cs\]External Links:[Link](http://arxiv.org/abs/2403.01218),[Document](https://dx.doi.org/10.48550/arXiv.2403.01218)Cited by:[§5\.3](https://arxiv.org/html/2605.11170#S5.SS3.p1.1),[§5\.3](https://arxiv.org/html/2605.11170#S5.SS3.p2.3),[§5](https://arxiv.org/html/2605.11170#S5.p2.1)\.
- J\. Hayes, I\. Shumailov, E\. Triantafillou, A\. Khalifa, and N\. Papernot \(2025\)Inexact unlearning needs more careful evaluations to avoid a false sense of privacy\.In2025 IEEE Conference on Secure and Trustworthy Machine Learning \(SaTML\),Vol\.,pp\. 497–519\.External Links:[Document](https://dx.doi.org/10.1109/SaTML64287.2025.00034)Cited by:[§H\.0\.1](https://arxiv.org/html/2605.11170#A8.SS0.SSS1.p1.4),[§H\.0\.1](https://arxiv.org/html/2605.11170#A8.SS0.SSS1.p3.3),[§H\.0\.1](https://arxiv.org/html/2605.11170#A8.SS0.SSS1.p5.1)\.
- S\. Hochreiter and J\. Schmidhuber \(1997\)Long short\-term memory\.Neural Computation9\(8\),pp\. 1735–1780\.External Links:[Document](https://dx.doi.org/10.1162/neco.1997.9.8.1735)Cited by:[§H\.0\.2](https://arxiv.org/html/2605.11170#A8.SS0.SSS2.p1.3),[§5\.3](https://arxiv.org/html/2605.11170#S5.SS3.p4.9)\.
- D\. P\. Kingma and J\. Ba \(2017\)Adam: a method for stochastic optimization\.External Links:1412\.6980,[Link](https://arxiv.org/abs/1412.6980)Cited by:[§G\.0\.1](https://arxiv.org/html/2605.11170#A7.SS0.SSS1.Px1.p3.3)\.
- A\. Koloskova, Y\. Allouah, A\. Jha, R\. Guerraoui, and S\. Koyejo \(2025\)Certified Unlearning for Neural Networks\.arXiv\.Note:arXiv:2506\.06985 \[cs\]External Links:[Link](http://arxiv.org/abs/2506.06985),[Document](https://dx.doi.org/10.48550/arXiv.2506.06985)Cited by:[§1](https://arxiv.org/html/2605.11170#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.11170#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2605.11170#S2.SS2.p3.1)\.
- A\. G\. Lamperski \(2020\)Projected Stochastic Gradient Langevin Algorithms for Constrained Sampling and Non\-Convex Learning\.ArXiv\.External Links:[Link](https://www.semanticscholar.org/paper/ede5a9ae87c1dee98098c243f6b44c30804acbdf)Cited by:[§C\.1](https://arxiv.org/html/2605.11170#A3.SS1.p1.4)\.
- A\. Lowy, Z\. Li, T\. Huang, and M\. Razaviyayn \(2024\)Optimal differentially private model training with public data\.InProceedings of the 41st International Conference on Machine Learning,ICML’24\.Cited by:[§3\.1](https://arxiv.org/html/2605.11170#S3.SS1.p1.1)\.
- A\. L\. Maas, R\. E\. Daly, P\. T\. Pham, D\. Huang, A\. Y\. Ng, and C\. Potts \(2011\)Learning word vectors for sentiment analysis\.InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies,Portland, Oregon, USA,pp\. 142–150\.External Links:[Link](http://www.aclweb.org/anthology/P11-1015)Cited by:[§H\.0\.2](https://arxiv.org/html/2605.11170#A8.SS0.SSS2.p1.3),[§5\.3](https://arxiv.org/html/2605.11170#S5.SS3.p4.9)\.
- I\. Mironov \(2017\)Renyi Differential Privacy\.In2017 IEEE 30th Computer Security Foundations Symposium \(CSF\),pp\. 263–275\.Note:arXiv:1702\.07476 \[cs\]External Links:[Link](http://arxiv.org/abs/1702.07476),[Document](https://dx.doi.org/10.1109/CSF.2017.11)Cited by:[§B\.1](https://arxiv.org/html/2605.11170#A2.SS1.3.p3.3),[Definition 3\.2](https://arxiv.org/html/2605.11170#S3.Thmdefinition2.p1.8.5)\.
- T\. Miyato, T\. Kataoka, M\. Koyama, and Y\. Yoshida \(2018\)Spectral Normalization for Generative Adversarial Networks\.arXiv\.Note:arXiv:1802\.05957 \[cs\]External Links:[Link](http://arxiv.org/abs/1802.05957),[Document](https://dx.doi.org/10.48550/arXiv.1802.05957)Cited by:[§G\.0\.1](https://arxiv.org/html/2605.11170#A7.SS0.SSS1.Px1.p1.2)\.
- A\. Mousavi\-Hosseini, T\. K\. Farghly, Y\. He, K\. Balasubramanian, and M\. A\. Erdogdu \(2023\)Towards a complete analysis of langevin monte carlo: Beyond poincaré inequality\.InThe Thirty Sixth Annual Conference on Learning Theory,pp\. 1–35\.External Links:[Link](https://proceedings.mlr.press/v195/mousavi-hosseini23a.html)Cited by:[§6](https://arxiv.org/html/2605.11170#S6.p2.1)\.
- J\. Neyman and E\. S\. Pearson \(1933\)IX\. on the problem of the most efficient tests of statistical hypotheses\.Philosophical Transactions of the Royal Society of London\. Series A, Containing Papers of a Mathematical or Physical Character231\(694\-706\),pp\. 289–337\.Cited by:[§H\.0\.1](https://arxiv.org/html/2605.11170#A8.SS0.SSS1.p1.4)\.
- Q\. P\. Nguyen, B\. K\. H\. Low, and P\. Jaillet \(2020\)Variational Bayesian Unlearning\.InAdvances in Neural Information Processing Systems,Vol\.33,pp\. 16025–16036\.External Links:[Link](https://proceedings.neurips.cc/paper/2020/hash/b8a6550662b363eb34145965d64d0cfb-Abstract.html)Cited by:[§2\.1](https://arxiv.org/html/2605.11170#S2.SS1.p1.1)\.
- M\. Oquab, T\. Darcet, T\. Moutakanni, H\. Vo, M\. Szafraniec, V\. Khalidov, P\. Fernandez, D\. Haziza, F\. Massa, A\. El\-Nouby, M\. Assran, N\. Ballas, W\. Galuba, R\. Howes, P\. Huang, S\. Li, I\. Misra, M\. Rabbat, V\. Sharma, G\. Synnaeve, H\. Xu, H\. Jegou, J\. Mairal, P\. Labatut, A\. Joulin, and P\. Bojanowski \(2024\)DINOv2: learning robust visual features without supervision\.External Links:2304\.07193,[Link](https://arxiv.org/abs/2304.07193)Cited by:[§G\.0\.2](https://arxiv.org/html/2605.11170#A7.SS0.SSS2.p1.1)\.
- E\. Parliament and C\. of the European Union \(2024\)Regulation \(eu\) 2024/1689 of the european parliament and of the council of 13 june 2024 laying down harmonised rules on artificial intelligence and amending regulations \(ec\) no 300/2008, \(eu\) no 167/2013, \(eu\) no 168/2013, \(eu\) 2018/858, \(eu\) 2018/1139 and \(eu\) 2019/2144 and directives 2014/90/eu, \(eu\) 2016/797 and \(eu\) 2016/798 \(artificial intelligence act\)\.Vol\.L 2024/1689\.Note:Official Journal of the European UnionEntered into force: 1 August 2024External Links:[Link](https://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng)Cited by:[§1](https://arxiv.org/html/2605.11170#S1.p1.1)\.
- Parliament of Canada \(2022\)An act to enact the consumer privacy protection act, the personal information and data protection tribunal act and the artificial intelligence and data act and to make consequential and related amendments to other acts\.Note:Bill C\-27, 44th Parliament, 1st SessionDigital Charter Implementation Act, 2022External Links:[Link](https://www.parl.ca/legisinfo/en/bill/44-1/c-27)Cited by:[§1](https://arxiv.org/html/2605.11170#S1.p1.1)\.
- X\. Peng, Q\. Bai, X\. Xia, Z\. Huang, K\. Saenko, and B\. Wang \(2019\)Moment matching for multi\-source domain adaptation\.InProceedings of the IEEE International Conference on Computer Vision,pp\. 1406–1415\.Cited by:[Figure 5](https://arxiv.org/html/2605.11170#A6.F5),[Figure 5](https://arxiv.org/html/2605.11170#A6.F5.3.2),[Figure 6](https://arxiv.org/html/2605.11170#A6.F6),[Figure 6](https://arxiv.org/html/2605.11170#A6.F6.3.2),[Appendix F](https://arxiv.org/html/2605.11170#A6.p1.1),[§5\.1](https://arxiv.org/html/2605.11170#S5.SS1.p1.1)\.
- M\. Raginsky, A\. Rakhlin, and M\. Telgarsky \(2017\)Non\-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis\.Note:\_eprint: 1702\.03849External Links:[Link](https://arxiv.org/abs/1702.03849)Cited by:[§C\.1](https://arxiv.org/html/2605.11170#A3.SS1.p1.4)\.
- R\. Shokri, M\. Stronati, C\. Song, and V\. Shmatikov \(2017\)Membership inference attacks against machine learning models\.External Links:1610\.05820,[Link](https://arxiv.org/abs/1610.05820)Cited by:[§5](https://arxiv.org/html/2605.11170#S5.p2.1)\.
- F\. Tramèr, G\. Kamath, and N\. Carlini \(2024\)Position: Considerations for Differentially Private Learning with Large\-Scale Public Pretraining\.InProceedings of the 41st International Conference on Machine Learning,pp\. 48453–48467\(en\)\.Note:ISSN: 2640\-3498External Links:[Link](https://proceedings.mlr.press/v235/tramer24a.html)Cited by:[§6](https://arxiv.org/html/2605.11170#S6.p3.1)\.
- S\. Vempala and A\. Wibisono \(2019\)Rapid Convergence of the Unadjusted Langevin Algorithm: Isoperimetry Suffices\.InAdvances in Neural Information Processing Systems,H\. Wallach, H\. Larochelle, A\. Beygelzimer, F\. d\. Alché\-Buc, E\. Fox, and R\. Garnett \(Eds\.\),Vol\.32\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2019/file/65a99bb7a3115fdede20da98b08a370f-Paper.pdf)Cited by:[Lemma A\.3](https://arxiv.org/html/2605.11170#A1.Thmlemma3),[Lemma A\.4](https://arxiv.org/html/2605.11170#A1.Thmlemma4.p1.3.3)\.
- C\. Villaniet al\.\(2009\)Optimal transport: old and new\.Vol\.338,Springer\.Cited by:[Theorem C\.1](https://arxiv.org/html/2605.11170#A3.Thmtheorem1.p1.2.2)\.
- P\. Xu, J\. Chen, D\. Zou, and Q\. Gu \(2018\)Global convergence of langevin dynamics based algorithms for nonconvex optimization\.Advances in Neural Information Processing Systems31\.Cited by:[§C\.1](https://arxiv.org/html/2605.11170#A3.SS1.p1.4)\.
- H\. Yan, X\. Li, Z\. Guo, H\. Li, F\. Li, and X\. Lin \(2022\)ARCANE: An Efficient Architecture for Exact Machine Unlearning\.Vol\.5,pp\. 4006–4013\(en\)\.Note:ISSN: 1045\-0823External Links:[Link](https://www.ijcai.org/proceedings/2022/556),[Document](https://dx.doi.org/10.24963/ijcai.2022/556)Cited by:[§2\.1](https://arxiv.org/html/2605.11170#S2.SS1.p1.1)\.
- J\. Ye and R\. Shokri \(2022\)Differentially Private Learning Needs Hidden State \(Or Much Faster Convergence\)\.\(en\)\.External Links:[Link](https://arxiv.org/abs/2203.05363v2)Cited by:[§B\.3](https://arxiv.org/html/2605.11170#A2.SS3.p1.12),[Lemma B\.1](https://arxiv.org/html/2605.11170#A2.Thmlemma1.p1.2)\.
- X\. Zhang, J\. Zhao, and Y\. LeCun \(2015\)Character\-level convolutional networks for text classification\.InProceedings of the 29th International Conference on Neural Information Processing Systems \- Volume 1,NIPS’15,Cambridge, MA, USA,pp\. 649–657\.Cited by:[§H\.0\.2](https://arxiv.org/html/2605.11170#A8.SS0.SSS2.p1.3),[§5\.3](https://arxiv.org/html/2605.11170#S5.SS3.p4.9)\.

## Appendix AUnlearning performance bounds

### A\.1Proof of[Theorem3\.2](https://arxiv.org/html/2605.11170#S3.Thmtheorem2)

###### Theorem\.

\(Chienet al\.,[2024a](https://arxiv.org/html/2605.11170#bib.bib249)\)Suppose that the loss isLL\-smooth andMM\-Lipschitz, and that the learning distribution of weights at timeTTsatisfies aCClog\-Sobolev inequality\. Then, the Rényi divergence betweenπUK\\pi\_\{U\}^\{K\}\(the unlearning distribution afterKKiterations\) and the retraining distribution afterT\+KT\+Kiterations is upper bounded by:

Dα​\(πRT\+K∥πUK\)\\displaystyle D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\+K\}\\\|\\pi\_\{U\}^\{K\}\)≤Dα​\(πLT∥πRT\)​exp⁡\(−1α​∑k=0K−1Rk\)\\displaystyle\\leq D\_\{\\alpha\}\(\\pi\_\{L\}^\{T\}\\\|\\pi\_\{R\}^\{T\}\)\\exp\\left\(\-\\frac\{1\}\{\\alpha\}\\sum\_\{k=0\}^\{K\-1\}R\_\{k\}\\right\)whereRk\>0R\_\{k\}\>0depend on the problem setting\(Chienet al\.,[2024a](https://arxiv.org/html/2605.11170#bib.bib249)\)\. Moreover, if the loss function ismm\-strongly convex and the initial log\-Sobolev constant satisfiesC\>σ2mC\>\\frac\{\\sigma^\{2\}\}\{m\}, we get the following exponential decay of the Rényi divergence with respect to the unlearning iteration:

Dα​\(πRT\+K∥πUK\)\\displaystyle D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\+K\}\\\|\\pi\_\{U\}^\{K\}\)≤Dα​\(πLT∥πRT\)​exp⁡\(−2​K​σ2​ηC​α\)\\displaystyle\\leq D\_\{\\alpha\}\(\\pi\_\{L\}^\{T\}\\\|\\pi\_\{R\}^\{T\}\)\\exp\\left\(\-\\frac\{2K\\sigma^\{2\}\\eta\}\{C\\alpha\}\\right\)

We provide the proof of\(Chienet al\.,[2024a](https://arxiv.org/html/2605.11170#bib.bib249)\), Theorem 3\.2, slightly modified to our setting\. Specifically, we relax the assumption that the learning and retraining processes have converged to their stationary distribution \(infinite training\)\. In order to prove this theorem, we will use the following lemmas:

###### Lemma A\.1\(Characterizing the log\-Sobolev constants of the PNGD updates\(Chewi,[2023](https://arxiv.org/html/2605.11170#bib.bib287)\)\)\.

Consider the PNGD update:

θk\+1=ΠΘ​\[θk−η​∇ℒD​\(θk\)\+2​η​σ2​Wk\],θ0∼π\\displaystyle\\theta^\{k\+1\}=\\Pi\_\{\\Theta\}\\left\[\\theta\_\{k\}\-\\eta\\nabla\\mathcal\{L\}\_\{D\}\(\\theta^\{k\}\)\+\\sqrt\{2\\eta\\sigma^\{2\}\}W\_\{k\}\\right\],\\theta^\{0\}\\sim\\piwhereπ\\pisatisfies aCC\-Log Sobolev inequality\. Then, we have the following:

- •Ifℒ\\mathcal\{L\}isLL\-smooth, then for the gradient updateh​\(θ\)=θ−∇θℒ​\(θ\)h\(\\theta\)=\\theta\-\\nabla\_\{\\theta\}\\mathcal\{L\}\(\\theta\), we have that the distribution ofh♯​πh\_\{\\sharp\}\\pisatisfies a\(1\+η​L\)2×C\(1\+\\eta L\)^\{2\}\\times Clog\-Sobolev inequality\. Moreover, ifℒ\\mathcal\{L\}is m\-strongly convex andη<1L\\eta<\\frac\{1\}\{L\}, thenh♯​πh\_\{\\sharp\}\\pisatisfies a\(1−η​m\)2×C\(1\-\\eta m\)^\{2\}\\times Clog Sobolev inequality\(Altschuler and Talwar,[2022](https://arxiv.org/html/2605.11170#bib.bib286)\)\.
- •π∗𝒩​\(0,σ2​Id\)\\pi\\ast\\mathcal\{N\}\(0,\\sigma^\{2\}I\_\{d\}\)satisfies a aC\+σ2C\+\\sigma^\{2\}log\-Sobolev inequality
- •ΠΘ​♯​π\\Pi\_\{\\Theta\\sharp\}\\pisatisfies aCClog\-Sobolev inequality

By composing the aforementioned statements, we get thatπ1\\pi\_\{1\}satisfies a\(1\+η​L\)2×C\+2​η​σ2\(1\+\\eta L\)^\{2\}\\times C\+2\\eta\\sigma^\{2\}\-log Sobolev inequality\. Moreover, ifℒ\\mathcal\{L\}is m\-strongly convex andη<1L\\eta<\\frac\{1\}\{L\}, we have thatπ1\\pi\_\{1\}satisfies a\(1−η​m\)2×C\+2​η​σ2\(1\-\\eta m\)^\{2\}\\times C\+2\\eta\\sigma^\{2\}

###### Lemma A\.2\(Data Processing inequality for the Rényi divergence\(Erven and Harremoës,[2014](https://arxiv.org/html/2605.11170#bib.bib142)\)\)\.

For anyα≥1\\alpha\\geq 1, any functionh:ℝd→ℝdh:\\mathbb\{R\}^\{d\}\\rightarrow\\mathbb\{R\}^\{d\}and distributionsP,QP,Qsupported onℝd\\mathbb\{R\}^\{d\}, we have:

Dα​\(h♯​P∥h♯​Q\)\\displaystyle D\_\{\\alpha\}\(h\_\{\\sharp\}P\\\|h\_\{\\sharp\}Q\)≤Dα​\(P∥Q\)\\displaystyle\\leq D\_\{\\alpha\}\(P\\\|Q\)with equality ifhhis bijective

###### Lemma A\.3\(\(Vempala and Wibisono,[2019](https://arxiv.org/html/2605.11170#bib.bib218); Chienet al\.,[2024a](https://arxiv.org/html/2605.11170#bib.bib249)\)characterizing the Rényi divergence between two distributions convoluted with Gaussians\)\.

LetPt=P∗𝒩​\(0,2​t​σ2​Id\)P\_\{t\}=P\\ast\\mathcal\{N\}\(0,2t\\sigma^\{2\}I\_\{d\}\)andQt=Q∗𝒩​\(0,2​t​σ2​Id\)Q\_\{t\}=Q\\ast\\mathcal\{N\}\(0,2t\\sigma^\{2\}I\_\{d\}\)\. Then,∀α\>0\\forall\\alpha\>0:

∂Dα​\(Pt∥Qt\)∂t\\displaystyle\\frac\{\\partial D\_\{\\alpha\}\(P\_\{t\}\\\|Q\_\{t\}\)\}\{\\partial t\}=−α​σ2​Gα​\(Pt∥Qt\)Fα​\(Pt∥Qt\)\\displaystyle=\-\\alpha\\sigma^\{2\}\\frac\{G\_\{\\alpha\}\(P\_\{t\}\\\|Q\_\{t\}\)\}\{F\_\{\\alpha\}\(P\_\{t\}\\\|Q\_\{t\}\)\}withGα​\(P∥Q\)=𝔼Q​\[\(pq\)α​‖∇log⁡pq‖2\]G\_\{\\alpha\}\(P\\\|Q\)=\\mathbb\{E\}\_\{Q\}\\left\[\\left\(\\frac\{p\}\{q\}\\right\)^\{\\alpha\}\\\|\\nabla\\log\\frac\{p\}\{q\}\\\|^\{2\}\\right\]denoting the relative Rényi information andFα\(P∥Q\)=𝔼Q\[\(pq\)α\]=exp\(\(α−1\)Dα\(P∥Q\)F\_\{\\alpha\}\(P\\\|Q\)=\\mathbb\{E\}\_\{Q\}\\left\[\\left\(\\frac\{p\}\{q\}\\right\)^\{\\alpha\}\\right\]=\\exp\(\(\\alpha\-1\)D\_\{\\alpha\}\(P\\\|Q\)

###### Lemma A\.4\.

Lower bound of the G\-F ratio\(Vempala and Wibisono,[2019](https://arxiv.org/html/2605.11170#bib.bib218)\)IfQ∈𝒫​\(Θ\)Q\\in\\mathcal\{P\}\(\\Theta\)satisfies aCClog Sobolev inequality, then∀P∈𝒫​\(Θ\)\\forall P\\in\\mathcal\{P\}\(\\Theta\):

Gα​\(P∥Q\)Fα​\(P∥Q\)≥2​Dα​\(P∥Q\)α2​C\\displaystyle\\frac\{G\_\{\\alpha\}\(P\\\|Q\)\}\{F\_\{\\alpha\}\(P\\\|Q\)\}\\geq\\frac\{2D\_\{\\alpha\}\(P\\\|Q\)\}\{\\alpha^\{2\}C\}

###### Lemma A\.5\.

Grönwall’s inequality\(Gronwall,[1919](https://arxiv.org/html/2605.11170#bib.bib288)\)Let𝐈=\[a,b\]\\mathbf\{I\}=\[a,b\]denote an interval on the real line\. Letβ\\betaanduube real\-valued continuous functions defined on𝐈\\mathbf\{I\}\. Ifuuis differentiable in the interior of𝐈\\mathbf\{I\}and satisfies for allttin the interior of𝐈\\mathbf\{I\}:

∂u​\(t\)d​t≤β​\(t\)​u​\(t\)\\displaystyle\\frac\{\\partial u\(t\)\}\{dt\}\\leq\\beta\(t\)u\(t\)then we have:

u​\(t\)\\displaystyle u\(t\)≤u​\(a\)​exp⁡\(∫atβ​\(s\)​𝑑s\)\\displaystyle\\leq u\(a\)\\exp\\left\(\\int\_\{a\}^\{t\}\\beta\(s\)ds\\right\)for allt∈It\\in I

###### Lemma A\.6\.

Universal upper bound on the log Sobolev constant for measures with compact support\(Chenet al\.,[2021](https://arxiv.org/html/2605.11170#bib.bib289)\)LetPPa probability measure supported on a compact set with radiusRR\. Then, for eachσ\>0\\sigma\>0,P∗𝒩​\(0,σ​Id\)P\\ast\\mathcal\{N\}\(0,\\sigma I\_\{d\}\)satisfy a log Sobolev inequality with constant upper bounded by6​\(4​R2\+σ\)​exp⁡\(4​R2σ\)6\(4R^\{2\}\+\\sigma\)\\exp\\left\(\\frac\{4R^\{2\}\}\{\\sigma\}\\right\)

###### Proof\.

Using these results, we have:

Dα​\(h♯​πRT\+K∥h♯​πUK\)\\displaystyle D\_\{\\alpha\}\(h\_\{\\sharp\}\\pi\_\{R\}^\{T\+K\}\\\|h\_\{\\sharp\}\\pi\_\{U\}^\{K\}\)≤Dα​\(πRT\+K∥πUK\)\\displaystyle\\leq D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\+K\}\\\|\\pi\_\{U\}^\{K\}\)\([LemmaA\.2](https://arxiv.org/html/2605.11170#A1.Thmlemma2)\)The PNGD updates preserve the log\-Sobolev inequality for the resulting distributions:letπUK,1,t=h♯​πUK∗𝒩​\(0,2​t​σ2​Id\)\\pi\_\{U\}^\{K,1,t\}=h\_\{\\sharp\}\\pi\_\{U\}^\{K\}\\ast\\mathcal\{N\}\(0,2t\\sigma^\{2\}I\_\{d\}\)andπRT\+K,1,t=h♯​πUK∗𝒩​\(0,2​t​σ2​Id\)\\pi\_\{R\}^\{T\+K,1,t\}=h\_\{\\sharp\}\\pi\_\{U\}^\{K\}\\ast\\mathcal\{N\}\(0,2t\\sigma^\{2\}I\_\{d\}\)\. SinceπLT\\pi\_\{L\}^\{T\}andπRT\\pi\_\{R\}^\{T\}satisfy a log\-Sobolev inequality \(initialization distributions\) and the loss function isLL\-smooth, then by[LemmaA\.1](https://arxiv.org/html/2605.11170#A1.Thmlemma1)the distributionsπUK,πLT\+K\\pi\_\{U\}^\{K\},\\pi\_\{L\}^\{T\+K\}satisfy respectivelyCU,K,CL,T\+KC\_\{U,K\},C\_\{L,T\+K\}log Sobolev inequalities\. Using[LemmaA\.1](https://arxiv.org/html/2605.11170#A1.Thmlemma1)on the distributionsπUK,1,t,πRT\+K,1,t\\pi\_\{U\}^\{K,1,t\},\\pi\_\{R\}^\{T\+K,1,t\}yields that they respectively satisfy\(1\+η​L\)2​CU,K\+2​η​σ2\(1\+\\eta L\)^\{2\}C\_\{U,K\}\+2\\eta\\sigma^\{2\}and\(1\+η​L\)2​CL,T\+K\+2​η​σ2\(1\+\\eta L\)^\{2\}C\_\{L,T\+K\}\+2\\eta\\sigma^\{2\}log Sobolev inequalities for allt∈\[0,η\]t\\in\[0,\\eta\]\. Upper bounding the distributions convolved with Gaussian distributions:Using[LemmaA\.3](https://arxiv.org/html/2605.11170#A1.Thmlemma3), we have that,∀α\>0\\forall\\alpha\>0:

∂Dα​\(πRT\+K,1,t∥πUK,1,t\)∂t\\displaystyle\\frac\{\\partial D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\+K,1,t\}\\\|\\pi\_\{U\}^\{K,1,t\}\)\}\{\\partial t\}=−α​σ2​Gα​\(πRT\+K,1,t∥πUK,1,t\)Fα​\(πRT\+K,1,t∥πUK,1,t\)\\displaystyle=\-\\alpha\\sigma^\{2\}\\frac\{G\_\{\\alpha\}\(\\pi\_\{R\}^\{T\+K,1,t\}\\\|\\pi\_\{U\}^\{K,1,t\}\)\}\{F\_\{\\alpha\}\(\\pi\_\{R\}^\{T\+K,1,t\}\\\|\\pi\_\{U\}^\{K,1,t\}\)\}and sinceπUK,1,t\\pi\_\{U\}^\{K,1,t\}satisfies aCU,K,t=\(1\+η​L\)2​CU,K\+2​t​σ2C\_\{U,K,t\}=\(1\+\\eta L\)^\{2\}C\_\{U,K\}\+2t\\sigma^\{2\}log\-Sobolev inequality, we can use[LemmaA\.4](https://arxiv.org/html/2605.11170#A1.Thmlemma4)to upper bound the derivative of the Rényi divergence with respect tot∈\[0,η\]t\\in\[0,\\eta\]:

∂Dα​\(πRT\+K,1,t∥πUK,1,t\)∂t\\displaystyle\\frac\{\\partial D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\+K,1,t\}\\\|\\pi\_\{U\}^\{K,1,t\}\)\}\{\\partial t\}≤−2​σ2α​CU,K,t​Dα​\(πRT\+K,1,t∥πUK,1,t\)\\displaystyle\\leq\-\\frac\{2\\sigma^\{2\}\}\{\\alpha C\_\{U,K,t\}\}D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\+K,1,t\}\\\|\\pi\_\{U\}^\{K,1,t\}\)Thus, by Grönwall’s inequality \([LemmaA\.5](https://arxiv.org/html/2605.11170#A1.Thmlemma5)\), we have∀t∈\[0,η\]\\forall t\\in\[0,\\eta\]:

Dα​\(πRT\+K,1,t∥πUK,1,t\)\\displaystyle D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\+K,1,t\}\\\|\\pi\_\{U\}^\{K,1,t\}\)≤Dα​\(h♯​πRT\+K∥h♯​πUK\)​exp⁡\(∫0t−2​σ2α​CU,K,s​d​s\)\\displaystyle\\leq D\_\{\\alpha\}\(h\_\{\\sharp\}\\pi\_\{R\}^\{T\+K\}\\\|h\_\{\\sharp\}\\pi\_\{U\}^\{K\}\)\\exp\\left\(\\int\_\{0\}^\{t\}\-\\frac\{2\\sigma^\{2\}\}\{\\alpha C\_\{U,K,s\}\}ds\\right\)≤Dα​\(h♯​πRT\+K∥h♯​πUK\)​exp⁡\(∫0t−2​σ2α​\(\(1\+η​L\)2​CU,K\+2​s​σ2\)​d​s\)\\displaystyle\\leq D\_\{\\alpha\}\(h\_\{\\sharp\}\\pi\_\{R\}^\{T\+K\}\\\|h\_\{\\sharp\}\\pi\_\{U\}^\{K\}\)\\exp\\left\(\\int\_\{0\}^\{t\}\-\\frac\{2\\sigma^\{2\}\}\{\\alpha\\left\(\(1\+\\eta L\)^\{2\}C\_\{U,K\}\+2s\\sigma^\{2\}\\right\)\}ds\\right\)≤Dα​\(πRT\+K∥πUK\)​exp⁡\(∫0t−2​σ2α​\(\(1\+η​L\)2​CU,K\+2​s​σ2\)​d​s\)\\displaystyle\\leq D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\+K\}\\\|\\pi\_\{U\}^\{K\}\)\\exp\\left\(\\int\_\{0\}^\{t\}\-\\frac\{2\\sigma^\{2\}\}\{\\alpha\\left\(\(1\+\\eta L\)^\{2\}C\_\{U,K\}\+2s\\sigma^\{2\}\\right\)\}ds\\right\)\([LemmaA\.2](https://arxiv.org/html/2605.11170#A1.Thmlemma2)\)Computing the integral yields:

∫0t−2​σ2α​\(\(1\+η​L\)2​CU,K\+2​s​σ2\)​d​s\\displaystyle\\int\_\{0\}^\{t\}\-\\frac\{2\\sigma^\{2\}\}\{\\alpha\\left\(\(1\+\\eta L\)^\{2\}C\_\{U,K\}\+2s\\sigma^\{2\}\\right\)\}ds=−1α​∫0t2​σ2\(1\+η​L\)2​CU,K\+2​s​σ2​𝑑s\\displaystyle=\-\\frac\{1\}\{\\alpha\}\\int\_\{0\}^\{t\}\\frac\{2\\sigma^\{2\}\}\{\(1\+\\eta L\)^\{2\}C\_\{U,K\}\+2s\\sigma^\{2\}\}ds=−1α​\[log⁡\(\(1\+η​L\)2​CU,K\+2​t​σ2\)−log⁡\(\(1\+η​L\)2​CU,K\)\]\\displaystyle=\-\\frac\{1\}\{\\alpha\}\\left\[\\log\\left\(\(1\+\\eta L\)^\{2\}C\_\{U,K\}\+2t\\sigma^\{2\}\\right\)\-\\log\\left\(\(1\+\\eta L\)^\{2\}C\_\{U,K\}\\right\)\\right\]=−1α​\[log⁡\(1\+2​t​σ2\(1\+η​L\)2​CU,K\)\]\\displaystyle=\-\\frac\{1\}\{\\alpha\}\\left\[\\log\\left\(1\+\\frac\{2t\\sigma^\{2\}\}\{\(1\+\\eta L\)^\{2\}C\_\{U,K\}\}\\right\)\\right\]Thus, by settingt=ηt=\\eta, we get:

Dα​\(πRT\+K,1,η∥πUK,1,η\)\\displaystyle D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\+K,1,\\eta\}\\\|\\pi\_\{U\}^\{K,1,\\eta\}\)≤\(1\+2​η​σ2\(1\+η​L\)2​CU,K\)−1α​Dα​\(πRT\+K∥πUK\)\\displaystyle\\leq\\left\(1\+\\frac\{2\\eta\\sigma^\{2\}\}\{\(1\+\\eta L\)^\{2\}C\_\{U,K\}\}\\right\)^\{\\frac\{\-1\}\{\\alpha\}\}D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\+K\}\\\|\\pi\_\{U\}^\{K\}\)
Finally, using the data processing inequality for the projection of PNGD and iterating over the number of unlearning iterations, we get:

Dα​\(πRT\+K\+1∥πUK\+1\)\\displaystyle D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\+K\+1\}\\\|\\pi\_\{U\}^\{K\+1\}\)≤Dα​\(πRT\+K,1,η∥πUK,1,η\)\\displaystyle\\leq D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\+K,1,\\eta\}\\\|\\pi\_\{U\}^\{K,1,\\eta\}\)≤\(1\+2​η​σ2\(1\+η​L\)2​CU,K\)−1α​Dα​\(πRT\+K∥πUK\)\\displaystyle\\leq\\left\(1\+\\frac\{2\\eta\\sigma^\{2\}\}\{\(1\+\\eta L\)^\{2\}C\_\{U,K\}\}\\right\)^\{\\frac\{\-1\}\{\\alpha\}\}D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\+K\}\\\|\\pi\_\{U\}^\{K\}\)≤Dα​\(πRT∥πLT\)​∏k=1K\(1\+2​η​σ2\(1\+η​L\)2​CU,k\)−1α\\displaystyle\\leq D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\}\\\|\\pi\_\{L\}^\{T\}\)\\prod\_\{k=1\}^\{K\}\\left\(1\+\\frac\{2\\eta\\sigma^\{2\}\}\{\(1\+\\eta L\)^\{2\}C\_\{U,k\}\}\\right\)^\{\\frac\{\-1\}\{\\alpha\}\}∎

### A\.2Tracking the log\-Sobolev constants

For a generic,LL\-smooth non\-convex loss functionℒ\\mathcal\{L\}, one can derive the following recurrence relation,∀k≥1\\forall k\\geq 1upper bounding the log\-Sobolev constants:

C1\\displaystyle C\_\{1\}≤\(1\+η​L\)2​C0\+2​η​σ2\\displaystyle\\leq\(1\+\\eta L\)^\{2\}C\_\{0\}\+2\\eta\\sigma^\{2\}C2\\displaystyle C\_\{2\}≤\(1\+η​L\)4​C0\+\(1\+η​L\)2​2​η​σ2\+\(1\+η​L\)2\\displaystyle\\leq\(1\+\\eta L\)^\{4\}C\_\{0\}\+\(1\+\\eta L\)^\{2\}2\\eta\\sigma^\{2\}\+\(1\+\\eta L\)^\{2\}…\\displaystyle\\ldotsCK\\displaystyle C\_\{K\}≤\(1\+η​L\)2​K​C0\+2​η​σ2​∑k=0K−1\(1\+η​L\)2\\displaystyle\\leq\\left\(1\+\\eta L\\right\)^\{2K\}C\_\{0\}\+2\\eta\\sigma^\{2\}\\sum\_\{k=0\}^\{K\-1\}\(1\+\\eta L\)^\{2\}≤\(1\+η​L\)2​K​C0\+2​η​σ2​\(1\+η​L\)2​K−1\(1\+η​L\)2−1\\displaystyle\\leq\(1\+\\eta L\)^\{2K\}C\_\{0\}\+2\\eta\\sigma^\{2\}\\frac\{\(1\+\\eta L\)^\{2K\}\-1\}\{\(1\+\\eta L\)^\{2\}\-1\}\(3\)If we add the assumption that the loss is convex, then the maph​\(θ\)=θ−η​∇θℒ​\(θ\)h\(\\theta\)=\\theta\-\\eta\\nabla\_\{\\theta\}\\mathcal\{L\}\(\\theta\)is11\-Lipschitz forη<2L\\eta<\\frac\{2\}\{L\}\(Hardtet al\.,[2016](https://arxiv.org/html/2605.11170#bib.bib290)\)and we can reduce\(1\+η​L\)\(1\+\\eta L\)to11in the aforementioned bounds:

CK≤C0\+2​K​η​σ2C\_\{K\}\\leq C\_\{0\}\+2K\\eta\\sigma^\{2\}\(4\)Finally, assumingmm\-strong convexity yields that the maph​\(θ\)h\(\\theta\)is1−η​m1\-\\eta m\-Lipschitz, which allows for the followingcontractiverecurrence on the log\-Sobolev constants∀k≥1\\forall k\\geq 1by settingη<2m​\(1−σ2m​C0\)\\eta<\\frac\{2\}\{m\}\(1\-\\frac\{\\sigma^\{2\}\}\{mC\_\{0\}\}\)\(Chienet al\.,[2024a](https://arxiv.org/html/2605.11170#bib.bib249)\):

Ck\\displaystyle C\_\{k\}≤\(1−η​m\)2​Ck−1\+2​η​σ2≤Ck−1\\displaystyle\\leq\(1\-\\eta m\)^\{2\}C\_\{k\-1\}\+2\\eta\\sigma^\{2\}\\leq C\_\{k\-1\}Ck\\displaystyle C\_\{k\}≤\(1−η​m\)2​K​C0\+2​η​σ2​\(1−η​m\)2​K−1\(1−η​m\)2−1≤C0\\displaystyle\\leq\(1\-\\eta m\)^\{2K\}C\_\{0\}\+2\\eta\\sigma^\{2\}\\frac\{\(1\-\\eta m\)^\{2K\}\-1\}\{\(1\-\\eta m\)^\{2\}\-1\}\\leq C\_\{0\}Thus, we have that∀t∈\[0,η\]\\forall t\\in\[0,\\eta\],πUK,1,t\\pi\_\{U\}^\{K,1,t\}satisfies aC0C\_\{0\}log\-Sobolev inequality thus we have by[LemmaA\.4](https://arxiv.org/html/2605.11170#A1.Thmlemma4):

∂Dα​\(πRT\+K,1,t∥πUK,1,t\)∂t\\displaystyle\\frac\{\\partial D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\+K,1,t\}\\\|\\pi\_\{U\}^\{K,1,t\}\)\}\{\\partial t\}≤−2​σ2α​C​Dα​\(πRT\+K,1,t∥πUK,1,t\)\\displaystyle\\leq\-\\frac\{2\\sigma^\{2\}\}\{\\alpha C\}D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\+K,1,t\}\\\|\\pi\_\{U\}^\{K,1,t\}\)Thus, by Grönwall’s inequality \([LemmaA\.5](https://arxiv.org/html/2605.11170#A1.Thmlemma5)\), we have∀t∈\[0,η\]\\forall t\\in\[0,\\eta\]:

∂Dα​\(πRT\+K,1,t∥πUK,1,t\)∂t\\displaystyle\\frac\{\\partial D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\+K,1,t\}\\\|\\pi\_\{U\}^\{K,1,t\}\)\}\{\\partial t\}≤Dα​\(h♯​πRT\+K∥h♯​πUK\)​exp⁡\(∫0t−2​σ2α​C​d​s\)\\displaystyle\\leq D\_\{\\alpha\}\(h\_\{\\sharp\}\\pi\_\{R\}^\{T\+K\}\\\|h\_\{\\sharp\}\\pi\_\{U\}^\{K\}\)\\exp\\left\(\\int\_\{0\}^\{t\}\-\\frac\{2\\sigma^\{2\}\}\{\\alpha C\}ds\\right\)≤Dα​\(h♯​πRT\+K∥h♯​πUK\)​exp⁡\(−2​t​σ2α​C\)\\displaystyle\\leq D\_\{\\alpha\}\(h\_\{\\sharp\}\\pi\_\{R\}^\{T\+K\}\\\|h\_\{\\sharp\}\\pi\_\{U\}^\{K\}\)\\exp\\left\(\-\\frac\{2t\\sigma^\{2\}\}\{\\alpha C\}\\right\)Thus, by settingt=ηt=\\etaand using similar steps as the non convex proof above, we get the following result:

Dα​\(πRT\+K∥πUK\)\\displaystyle D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\+K\}\\\|\\pi\_\{U\}^\{K\}\)≤Dα​\(πRT∥πLT\)​exp⁡\(−2​K​η​σ2α​C\)\\displaystyle\\leq D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\}\\\|\\pi\_\{L\}^\{T\}\)\\exp\\left\(\-\\frac\{2K\\eta\\sigma^\{2\}\}\{\\alpha C\}\\right\)The message conveyed by the strongly convex proof is that if we have a universal iteration independent upper bound on the log Sobolev constants at each timestep of the PNGD updates, then we could have a more meaningful upper bound on the Rényi divergence\. The non convex[Equation3](https://arxiv.org/html/2605.11170#A1.E3)and convex[Equation4](https://arxiv.org/html/2605.11170#A1.E4)recurrence bounds are non contractive and iteration dependent, so they do not allow to establish a convergence rate for[Theorem3\.2](https://arxiv.org/html/2605.11170#S3.Thmtheorem2)\. This is where the projection step of PNGD comes in handy, as it allows to leverage the geometry of the setΘ\\Thetato get a more informative bound:

###### Lemma A\.7\(Log Sobolev inequality on measures supported on a compact set\(Chenet al\.,[2021](https://arxiv.org/html/2605.11170#bib.bib289)\), Corollary 1\)\.

Letπ\\pibe a probability measure onℝd\\mathbb\{R\}^\{d\}supported on a compact setΘ\\Thetawith radiusR≥0R\\geq 0\. Then, for eacht≥0t\\geq 0,μ∗𝒩​\(0,t​Id\)\\mu\\ast\\mathcal\{N\}\(0,tI\_\{d\}\)satisfy a log sobolev inequality with constantCCcontrolled by:

C\\displaystyle C≤6​\(4​R2\+t\)​exp⁡\(4​R2t\)\\displaystyle\\leq 6\\left\(4R^\{2\}\+t\\right\)\\exp\\left\(\\frac\{4R^\{2\}\}\{t\}\\right\)

###### Proposition A\.1\(Universal bound on the log Sobolev constants of distributions induced by PNGD updates\(Chienet al\.,[2024a](https://arxiv.org/html/2605.11170#bib.bib249)\)\)\.

Suppose thatℒ\\mathcal\{L\}isMMLipschitz\. Letθ0∼π0∈𝒫​\(Θ\)\\theta\_\{0\}\\sim\\pi\_\{0\}\\in\\mathcal\{P\}\(\\Theta\)whereΘ\\Thetais a compact set of radiusRRand denote byπk\\pi\_\{k\}the distributionθk\\theta\_\{k\}, thekk\-th iterate of PNGD \([Equation1](https://arxiv.org/html/2605.11170#S2.E1)\)\. Then,∀k≥0\\forall k\\geq 0,πk\\pi\_\{k\}satisfies a log\-Sobolev inequality with constantCkC\_\{k\}controlled by:

Ck\\displaystyle C\_\{k\}≤6​\(4​\(R\+η​M\)2\+2​η​σ2\)​exp⁡\(4​\(R\+η​M\)22​η​σ2\)\\displaystyle\\leq 6\\left\(4\(R\+\\eta M\)^\{2\}\+2\\eta\\sigma^\{2\}\\right\)\\exp\\left\(\\frac\{4\(R\+\\eta M\)^\{2\}\}\{2\\eta\\sigma^\{2\}\}\\right\)

We can thus derive a similar bound to the strongly convex setting, for the non convex/convex settings: Using[PropositionA\.1](https://arxiv.org/html/2605.11170#A1.Thmproposition1), we have∀k≥0\\forall k\\geq 0thatπUK\\pi\_\{U\}^\{K\}satisfies aC~=6​\(4​\(R\+η​M\)2\+2​η​σ2\)​exp⁡\(4​\(R\+η​M\)22​η​σ2\)\\tilde\{C\}=6\\left\(4\(R\+\\eta M\)^\{2\}\+2\\eta\\sigma^\{2\}\\right\)\\exp\\left\(\\frac\{4\(R\+\\eta M\)^\{2\}\}\{2\\eta\\sigma^\{2\}\}\\right\)log Sobolev inequality\. Thus, using[LemmaA\.4](https://arxiv.org/html/2605.11170#A1.Thmlemma4), we have:

∂Dα​\(πRT\+K,1,t∥πUK,1,t\)∂t\\displaystyle\\frac\{\\partial D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\+K,1,t\}\\\|\\pi\_\{U\}^\{K,1,t\}\)\}\{\\partial t\}≤−2​σ2α​C~​Dα​\(πRT\+K,1,t∥πUK,1,t\)\\displaystyle\\leq\-\\frac\{2\\sigma^\{2\}\}\{\\alpha\\tilde\{C\}\}D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\+K,1,t\}\\\|\\pi\_\{U\}^\{K,1,t\}\)Thus, by Grönwall’s inequality \([LemmaA\.5](https://arxiv.org/html/2605.11170#A1.Thmlemma5)\), we have∀t∈\[0,η\]\\forall t\\in\[0,\\eta\]:

∂Dα​\(πRT\+K,1,t∥πUK,1,t\)∂t\\displaystyle\\frac\{\\partial D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\+K,1,t\}\\\|\\pi\_\{U\}^\{K,1,t\}\)\}\{\\partial t\}≤Dα​\(h♯​πRT\+K∥h♯​πUK\)​exp⁡\(∫0t−2​σ2α​C~​d​s\)\\displaystyle\\leq D\_\{\\alpha\}\(h\_\{\\sharp\}\\pi\_\{R\}^\{T\+K\}\\\|h\_\{\\sharp\}\\pi\_\{U\}^\{K\}\)\\exp\\left\(\\int\_\{0\}^\{t\}\-\\frac\{2\\sigma^\{2\}\}\{\\alpha\\tilde\{C\}\}ds\\right\)≤Dα​\(h♯​πRT\+K∥h♯​πUK\)​exp⁡\(−2​t​σ2α​C~\)\\displaystyle\\leq D\_\{\\alpha\}\(h\_\{\\sharp\}\\pi\_\{R\}^\{T\+K\}\\\|h\_\{\\sharp\}\\pi\_\{U\}^\{K\}\)\\exp\\left\(\-\\frac\{2t\\sigma^\{2\}\}\{\\alpha\\tilde\{C\}\}\\right\)Finally, similarly to the strongly convex proofs, we can deduce that:

Dα​\(πRT\+K∥πUK\)\\displaystyle D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\+K\}\\\|\\pi\_\{U\}^\{K\}\)≤Dα​\(πRT∥πLT\)​exp⁡\(−2​K​η​σ2α​C~\)\\displaystyle\\leq D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\}\\\|\\pi\_\{L\}^\{T\}\)\\exp\\left\(\-\\frac\{2K\\eta\\sigma^\{2\}\}\{\\alpha\\tilde\{C\}\}\\right\)

## Appendix BOn training with public data

### B\.1Proof of[Theorem3\.1](https://arxiv.org/html/2605.11170#S3.Thmtheorem1)

See[3\.1](https://arxiv.org/html/2605.11170#S3.Thmtheorem1)

###### Proof\.

The following proof is an adaptation of the proof of Theorem 3\.2 inChienet al\.\([2024a](https://arxiv.org/html/2605.11170#bib.bib249)\)to the asymmetric data setting\.

Consider the following updates done during training\. Recall that we are using full batch projected noisy gradient descent:

θLt\+1\\displaystyle\\theta^\{t\+1\}\_\{L\}=ΠΘ​\[θLt\+η​∇ℒDpub∪Dpriv​\(θLt\)\+2​η​σ2​Wt\]\\displaystyle=\\Pi\_\{\\Theta\}\\left\[\\theta^\{t\}\_\{L\}\+\\eta\\nabla\\mathcal\{L\}\_\{D\_\{\\mathrm\{pub\}\}\\cup D\_\{\\mathrm\{priv\}\}\}\(\\theta^\{t\}\_\{L\}\)\+\\sqrt\{2\\eta\\sigma^\{2\}\}W\_\{t\}\\right\]\(Wt∼𝒩​\(0,Id\)W\_\{t\}\\sim\\mathcal\{N\}\(0,I\_\{d\}\)\)θRt\+1\\displaystyle\\theta^\{t\+1\}\_\{R\}=ΠΘ​\[θRt\+η​∇ℒDretain​\(θRt\)\+2​η​σ2​Wt\]\\displaystyle=\\Pi\_\{\\Theta\}\\left\[\\theta^\{t\}\_\{R\}\+\\eta\\nabla\\mathcal\{L\}\_\{D\_\{\\mathrm\{retain\}\}\}\(\\theta^\{t\}\_\{R\}\)\+\\sqrt\{2\\eta\\sigma^\{2\}\}W\_\{t\}\\right\]\(Wt∼𝒩​\(0,Id\)W\_\{t\}\\sim\\mathcal\{N\}\(0,I\_\{d\}\)\)Let’s divide each optimization step into the following:

θLt,1\\displaystyle\\theta^\{t,1\}\_\{L\}=θLt\+η​∇ℒDpub∪Dpriv​\(θLt\)\+η​σ2​Wt\\displaystyle=\\theta^\{t\}\_\{L\}\+\\eta\\nabla\\mathcal\{L\}\_\{D\_\{\\mathrm\{pub\}\}\\cup D\_\{\\mathrm\{priv\}\}\}\(\\theta^\{t\}\_\{L\}\)\+\\sqrt\{\\eta\\sigma^\{2\}\}W\_\{t\}θRt,1\\displaystyle\\theta^\{t,1\}\_\{R\}=θRt\+η​∇ℒDretain​\(θRt\)\+η​σ2​Wt\\displaystyle=\\theta^\{t\}\_\{R\}\+\\eta\\nabla\\mathcal\{L\}\_\{D\_\{\\mathrm\{retain\}\}\}\(\\theta^\{t\}\_\{R\}\)\+\\sqrt\{\\eta\\sigma^\{2\}\}W\_\{t\}Therefore, we can write

θLt\+1\\displaystyle\\theta^\{t\+1\}\_\{L\}=ΠΘ​\[θLt,1\+η​σ2​Wt\]\\displaystyle=\\Pi\_\{\\Theta\}\\left\[\\theta^\{t,1\}\_\{L\}\+\\sqrt\{\\eta\\sigma^\{2\}\}W\_\{t\}\\right\]\(5\)θRt\+1\\displaystyle\\theta^\{t\+1\}\_\{R\}=ΠΘ​\[θRt,1\+η​σ2​Wt\]\.\\displaystyle=\\Pi\_\{\\Theta\}\\left\[\\theta^\{t,1\}\_\{R\}\+\\sqrt\{\\eta\\sigma^\{2\}\}W\_\{t\}\\right\]\.\(6\)LetπRt,πRt,1,πLt,πLt,1\\pi^\{t\}\_\{R\},\\pi^\{t,1\}\_\{R\},\\pi^\{t\}\_\{L\},\\pi^\{t,1\}\_\{L\}be the distributions of respectivelyθRt,θRt,1,θLt,θLt,1\\theta^\{t\}\_\{R\},\\theta^\{t,1\}\_\{R\},\\theta^\{t\}\_\{L\},\\theta^\{t,1\}\_\{L\}

The main question we try to tackle here is: what is𝐃α​\(π𝐑𝐭∥π𝐋𝐭\)\\mathbf\{D\_\{\\alpha\}\(\\pi\_\{R\}^\{t\}\\\|\\pi\_\{L\}^\{t\}\)\}? We first compare the distributionsπRt,1\\pi\_\{R\}^\{t,1\}andπLt,1\\pi\_\{L\}^\{t,1\}\. By composition theorem of the Gaussian mechanism for Rényi Differential privacy\(Mironov,[2017](https://arxiv.org/html/2605.11170#bib.bib27)\), and equivalently for the Rényi divergence, we have:

Dα​\(πRt,1∥πLt,1\)α\\displaystyle\\frac\{D\_\{\\alpha\}\(\\pi^\{t,1\}\_\{R\}\\\|\\pi^\{t,1\}\_\{L\}\)\}\{\\alpha\}≤Dα​\(πRt∥πLt\)α\+ΔF22​σ2\\displaystyle\\leq\\frac\{D\_\{\\alpha\}\(\\pi^\{t\}\_\{R\}\\\|\\pi^\{t\}\_\{L\}\)\}\{\\alpha\}\+\\frac\{\\Delta\_\{F\}^\{2\}\}\{2\\sigma^\{2\}\}\(7\)whereΔF\\Delta\_\{F\}is thel2l\_\{2\}sensitivity of the gradient update\. For the next computations, letnpubn\_\{\\mathrm\{pub\}\}denote the number of public points,nforgetn\_\{\\mathrm\{forget\}\}denote the number of points to forget, andnr−privn\_\{\\mathrm\{r\-priv\}\}denote the number ofremainingprivate points in the retain set\. Computing the sensitivity in the asymmetric setting yields:

ΔF\\displaystyle\\Delta\_\{F\}=maxθ⁡η​‖∇ℒDretain​\(θ\)−∇ℒDpub∪Dpriv​\(θ\)‖\\displaystyle=\\max\_\{\\theta\}\\eta\\\|\\nabla\\mathcal\{L\}\_\{D\_\{\\mathrm\{retain\}\}\}\(\\theta\)\-\\nabla\\mathcal\{L\}\_\{D\_\{\\mathrm\{pub\}\}\\cup D\_\{\\mathrm\{priv\}\}\}\(\\theta\)\\\|=maxθ⁡η∥1npub\+nr−priv​∑di∈I∪II∇ℓ​\(θ,di\)−1npub\+nr−priv\+nforget​∑di∈I∪II∪III∇‖ℓ​\(θ,di\)‖\\displaystyle=\\max\_\{\\theta\}\\eta\\\|\\frac\{1\}\{n\_\{\\mathrm\{pub\}\}\+n\_\{\\mathrm\{r\-priv\}\}\}\\sum\_\{d\_\{i\}\\in\\text\{I\}\\cup\\text\{II\}\}\\nabla\\ell\(\\theta,d\_\{i\}\)\-\\frac\{1\}\{n\_\{\\mathrm\{pub\}\}\+n\_\{\\mathrm\{r\-priv\}\}\+n\_\{\\mathrm\{forget\}\}\}\\sum\_\{d\_\{i\}\\in\\text\{I\}\\cup\\text\{II\}\\cup\\text\{III\}\}\\nabla\\\|\\ell\(\\theta,d\_\{i\}\)\\\|≤η​\(1npub\+nr−priv−1npub\+nr−priv\+nforget\)​∑di∈I∪II‖∇ℓ​\(θ,di\)‖\\displaystyle\\leq\\eta\\left\(\\frac\{1\}\{n\_\{\\mathrm\{pub\}\}\+n\_\{\\mathrm\{r\-priv\}\}\}\-\\frac\{1\}\{n\_\{\\mathrm\{pub\}\}\+n\_\{\\mathrm\{r\-priv\}\}\+n\_\{\\mathrm\{forget\}\}\}\\right\)\\sum\_\{d\_\{i\}\\in\\text\{I\}\\cup\\text\{II\}\}\\\|\\nabla\\ell\(\\theta,d\_\{i\}\)\\\|\+ηnpub\+nr−priv\+nforget​∑di∈I∪II∪III‖∇ℓ​\(θ,di\)‖\\displaystyle\\quad\+\\frac\{\\eta\}\{n\_\{\\mathrm\{pub\}\}\+n\_\{\\mathrm\{r\-priv\}\}\+n\_\{\\mathrm\{forget\}\}\}\\sum\_\{d\_\{i\}\\in\\text\{I\}\\cup\\text\{II\}\\cup\\text\{III\}\}\\\|\\nabla\\ell\(\\theta,d\_\{i\}\)\\\|≤M​η​\(npub\+nr−priv\)​\(1npub\+nr−priv−1npub\+nr−priv\+nforget\)\+nforget​M​ηnpub\+nr−priv\+nforget\\displaystyle\\leq M\\eta\(n\_\{\\mathrm\{pub\}\}\+n\_\{\\mathrm\{r\-priv\}\}\)\\left\(\\frac\{1\}\{n\_\{\\mathrm\{pub\}\}\+n\_\{\\mathrm\{r\-priv\}\}\}\-\\frac\{1\}\{n\_\{\\mathrm\{pub\}\}\+n\_\{\\mathrm\{r\-priv\}\}\+n\_\{\\mathrm\{forget\}\}\}\\right\)\+\\frac\{n\_\{\\mathrm\{forget\}\}M\\eta\}\{n\_\{\\mathrm\{pub\}\}\+n\_\{\\mathrm\{r\-priv\}\}\+n\_\{\\mathrm\{forget\}\}\}≤2​M​η​nforgetnpub\+nr−priv\+nforget⏟ε\\displaystyle\\leq\\underbrace\{\\frac\{2M\\eta n\_\{\\mathrm\{forget\}\}\}\{n\_\{\\mathrm\{pub\}\}\+n\_\{\\mathrm\{r\-priv\}\}\+n\_\{\\mathrm\{forget\}\}\}\}\_\{\\varepsilon\}
###### Lemma B\.1\.

\(Ye and Shokri,[2022](https://arxiv.org/html/2605.11170#bib.bib207)\)For any distributionsξt,ξt′\\xi\_\{t\},\\xi\_\{t\}^\{\\prime\}both satisfyingCt,1C\_\{t,1\}\-LSI, we have:

Dα​\(ξt∗𝒩​\(0,η​σ2​I\),ξt′∗𝒩​\(0,η​σ2​I\)\)α≤Dα​\(t\)​\(ξt,ξt′\)α​\(t\)​\(1\+η​σ2Ct,1\)−1\\displaystyle\\frac\{D\_\{\\alpha\}\(\\xi\_\{t\}\*\\mathcal\{N\}\(0,\\eta\\sigma^\{2\}I\),\\xi\_\{t\}^\{\\prime\}\*\\mathcal\{N\}\(0,\\eta\\sigma^\{2\}I\)\)\}\{\\alpha\}\\leq\\frac\{D\_\{\\alpha\(t\)\}\(\\xi\_\{t\},\\xi\_\{t\}^\{\\prime\}\)\}\{\\alpha\(t\)\}\\left\(1\+\\frac\{\\eta\\sigma^\{2\}\}\{C\_\{t,1\}\}\\right\)^\{\-1\}whereα​\(t\)=α−11\+η​σ2Ct,1\\alpha\(t\)=\\frac\{\\alpha\-1\}\{1\+\\frac\{\\eta\\sigma^\{2\}\}\{C\_\{t,1\}\}\}

By combining the data processing inequality \(projection\) and[LemmaB\.1](https://arxiv.org/html/2605.11170#A2.Thmlemma1), we get the following recurrence inequality:

Dα​\(πRT\+1∥πLT\+1\)α\\displaystyle\\frac\{D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\+1\}\\\|\\pi\_\{L\}^\{T\+1\}\)\}\{\\alpha\}≤\(Dα​\(T\)​\(πRT∥πLT\)α​\(T\)\+ε22​σ2\)​\(1\+η​σ2CT,1\)−1\\displaystyle\\leq\\left\(\\frac\{D\_\{\\alpha\(T\)\}\(\\pi\_\{R\}^\{T\}\\\|\\pi\_\{L\}^\{T\}\)\}\{\\alpha\(T\)\}\+\\frac\{\\varepsilon^\{2\}\}\{2\\sigma^\{2\}\}\\right\)\\left\(1\+\\frac\{\\eta\\sigma^\{2\}\}\{C\_\{T,1\}\}\\right\)^\{\-1\}=ε22​σ2​\(1\+η​σ2CT,1\)−1\+Dα​\(T\)​\(πRT∥πLT\)α​\(T\)​\(1\+η​σ2CT,1\)−1\\displaystyle=\\frac\{\\varepsilon^\{2\}\}\{2\\sigma^\{2\}\}\\left\(1\+\\frac\{\\eta\\sigma^\{2\}\}\{C\_\{T,1\}\}\\right\)^\{\-1\}\+\\frac\{D\_\{\\alpha\(T\)\}\(\\pi\_\{R\}^\{T\}\\\|\\pi\_\{L\}^\{T\}\)\}\{\\alpha\(T\)\}\\left\(1\+\\frac\{\\eta\\sigma^\{2\}\}\{C\_\{T,1\}\}\\right\)^\{\-1\}≤ε22​σ2​\(1\+η​σ2CT,1\)−1\+\(Dα​\(T−1\)​\(πRT∥πLT\)α​\(T−1\)\+ε22​σ2\)​\(1\+η​σ2CT,1\)−1​\(1\+η​σ2CT−1,1\)−1\\displaystyle\\leq\\frac\{\\varepsilon^\{2\}\}\{2\\sigma^\{2\}\}\\left\(1\+\\frac\{\\eta\\sigma^\{2\}\}\{C\_\{T,1\}\}\\right\)^\{\-1\}\+\\left\(\\frac\{D\_\{\\alpha\(T\-1\)\}\(\\pi\_\{R\}^\{T\}\\\|\\pi\_\{L\}^\{T\}\)\}\{\\alpha\(T\-1\)\}\+\\frac\{\\varepsilon^\{2\}\}\{2\\sigma^\{2\}\}\\right\)\\left\(1\+\\frac\{\\eta\\sigma^\{2\}\}\{C\_\{T,1\}\}\\right\)^\{\-1\}\\left\(1\+\\frac\{\\eta\\sigma^\{2\}\}\{C\_\{T\-1,1\}\}\\right\)^\{\-1\}≤ε22​σ2​\[B​\(T\)\+B​\(T−1\)\]\+B​\(T−2\)​\(Dα​\(T−2\)​\(πRT∥πLT\)α​\(T−2\)\+ε22​σ2\)\\displaystyle\\leq\\frac\{\\varepsilon^\{2\}\}\{2\\sigma^\{2\}\}\\left\[B\(T\)\+B\(T\-1\)\\right\]\+B\(T\-2\)\\left\(\\frac\{D\_\{\\alpha\(T\-2\)\}\(\\pi\_\{R\}^\{T\}\\\|\\pi\_\{L\}^\{T\}\)\}\{\\alpha\(T\-2\)\}\+\\frac\{\\varepsilon^\{2\}\}\{2\\sigma^\{2\}\}\\right\)\(whereB​\(t\)=∏k=tT\(1\+η​σ2Ck,1\)−1B\(t\)=\\prod\_\{k=t\}^\{T\}\\left\(1\+\\frac\{\\eta\\sigma^\{2\}\}\{C\_\{k,1\}\}\\right\)^\{\-1\}\)≤ε22​σ2​∑i=1TB​\(i\)\+B​\(0\)​\(Dα​\(0\)​\(πRT∥πLT\)α​\(0\)\+ε22​σ2\)\\displaystyle\\leq\\frac\{\\varepsilon^\{2\}\}\{2\\sigma^\{2\}\}\\sum\_\{i=1\}^\{T\}B\(i\)\+B\(0\)\\left\(\\frac\{D\_\{\\alpha\(0\)\}\(\\pi\_\{R\}^\{T\}\\\|\\pi\_\{L\}^\{T\}\)\}\{\\alpha\(0\)\}\+\\frac\{\\varepsilon^\{2\}\}\{2\\sigma^\{2\}\}\\right\)≤ε22​σ2​∑i=0TB​\(i\)\\displaystyle\\leq\\frac\{\\varepsilon^\{2\}\}\{2\\sigma^\{2\}\}\\sum\_\{i=0\}^\{T\}B\(i\)\(sinceDα​\(t\)​\(π0∥π0\)=0D\_\{\\alpha\(t\)\}\(\\pi\_\{0\}\\\|\\pi\_\{0\}\)=0\)=ε22​σ2​∑t=0T∏t′=tT\(1\+η​σ2Ct′,1\)−1\\displaystyle=\\frac\{\\varepsilon^\{2\}\}\{2\\sigma^\{2\}\}\\sum\_\{t=0\}^\{T\}\\prod\_\{t^\{\\prime\}=t\}^\{T\}\\left\(1\+\\frac\{\\eta\\sigma^\{2\}\}\{C\_\{t^\{\\prime\},1\}\}\\right\)^\{\-1\}The upper bound on the log Sobolev constants can be tracked in a similar fashion as in[PropositionA\.1](https://arxiv.org/html/2605.11170#A1.Thmproposition1)because of the projection onto the compact setΘ\\Theta\. ∎

[Corollary3\.1](https://arxiv.org/html/2605.11170#S3.Thmcorollary1)follows immediately from the previous theorem: See[3\.1](https://arxiv.org/html/2605.11170#S3.Thmcorollary1)Starting with the bound of[Theorem3\.1](https://arxiv.org/html/2605.11170#S3.Thmtheorem1), we have:

Dα​\(πRT∥πLT\)\\displaystyle D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\}\\\|\\pi\_\{L\}^\{T\}\)≤2​α​M2​η2​nforget2\(npub\+npriv\)2​σ2​∑t=1T−1∏t′=tT−1h​\(t′,η,σ\)\.\\displaystyle\\leq\\frac\{2\\alpha M^\{2\}\\eta^\{2\}n\_\{\\mathrm\{forget\}\}^\{2\}\}\{\(n\_\{\\mathrm\{pub\}\}\+n\_\{\\mathrm\{priv\}\}\)^\{2\}\\sigma^\{2\}\}\\sum\_\{t=1\}^\{T\-1\}\\prod\_\{t^\{\\prime\}=t\}^\{T\-1\}h\(t^\{\\prime\},\\eta,\\sigma\)\.

### B\.2Proof of[Corollary3\.1](https://arxiv.org/html/2605.11170#S3.Thmcorollary1)

One sufficient condition to satisfy an upper boundε\\varepsilonon the Rényi divergence is:

2​α​M2​η2​nforget2\(npub\+npriv\)2​σ2​∑t=1T−1∏t′=tT−1h​\(t′,η,σ\)\\displaystyle\\frac\{2\\alpha M^\{2\}\\eta^\{2\}n\_\{\\mathrm\{forget\}\}^\{2\}\}\{\(n\_\{\\mathrm\{pub\}\}\+n\_\{\\mathrm\{priv\}\}\)^\{2\}\\sigma^\{2\}\}\\sum\_\{t=1\}^\{T\-1\}\\prod\_\{t^\{\\prime\}=t\}^\{T\-1\}h\(t^\{\\prime\},\\eta,\\sigma\)≤ε\\displaystyle\\leq\\varepsilon⇔2​α​M2​η2​nforget2\(npub\+npriv\)2​ε​∑t=1T−1∏t′=tT−1h​\(t′,η,σ\)\\displaystyle\\iff\\frac\{2\\alpha M^\{2\}\\eta^\{2\}n\_\{\\mathrm\{forget\}\}^\{2\}\}\{\(n\_\{\\mathrm\{pub\}\}\+n\_\{\\mathrm\{priv\}\}\)^\{2\}\\varepsilon\}\\sum\_\{t=1\}^\{T\-1\}\\prod\_\{t^\{\\prime\}=t\}^\{T\-1\}h\(t^\{\\prime\},\\eta,\\sigma\)≤σ2\.\\displaystyle\\leq\\sigma^\{2\}\.Sincenforgetnpublic\+npriv=nforgetnpriv​nprivnpublic\+npriv\\frac\{n\_\{\\mathrm\{forget\}\}\}\{n\_\{\\mathrm\{public\}\}\+n\_\{\\mathrm\{priv\}\}\}=\\frac\{n\_\{\\mathrm\{forget\}\}\}\{n\_\{\\mathrm\{priv\}\}\}\\frac\{n\_\{\\mathrm\{priv\}\}\}\{n\_\{\\mathrm\{public\}\}\+n\_\{\\mathrm\{priv\}\}\}, we setc=nforgetnprivc=\\frac\{n\_\{\\mathrm\{forget\}\}\}\{n\_\{\\mathrm\{priv\}\}\}and can thus write:

σ2\\displaystyle\\sigma^\{2\}≥2​α​M2​η2​c2ε​\(∑t=1T−1∏t′=tT−1h​\(t′,η,σ\)\)​\(nprivnpublic\+npriv\)2\\displaystyle\\geq\\frac\{2\\alpha M^\{2\}\\eta^\{2\}c^\{2\}\}\{\\varepsilon\}\(\\sum\_\{t=1\}^\{T\-1\}\\prod\_\{t^\{\\prime\}=t\}^\{T\-1\}h\(t^\{\\prime\},\\eta,\\sigma\)\)\\left\(\\frac\{n\_\{\\mathrm\{priv\}\}\}\{n\_\{\\mathrm\{public\}\}\+n\_\{\\mathrm\{priv\}\}\}\\right\)^\{2\}\(8\)When the loss ismm\-strongly convex, we can also derive a right hand side that is completely independent ofσ\\sigma:

σ2\\displaystyle\\sigma^\{2\}≥4​α​c2​M2​\(1−exp⁡\(−m​η​T\)\)ε​m​\(nprivnpublic\+npriv\)2\\displaystyle\\geq\\frac\{4\\alpha c^\{2\}M^\{2\}\(1\-\\exp\(\-m\\eta T\)\)\}\{\\varepsilon m\}\\left\(\\frac\{n\_\{\\mathrm\{priv\}\}\}\{n\_\{\\mathrm\{public\}\}\+n\_\{\\mathrm\{priv\}\}\}\\right\)^\{2\}\(9\)Compressed Version

### B\.3Numerical Setup for Strongly Convex Losses

FollowingChienet al\.\([2024a](https://arxiv.org/html/2605.11170#bib.bib249)\), we evaluate our bounds using binary logistic regression onD=\{\(xi,yi\)\}i=1nD=\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{n\}:

ℒ​\(θ,D\)=−1n​∑i=1nlog⁡\(s​\(yi​θT​xi\)\)\+λ2​\|θ\|22\\displaystyle\\mathcal\{L\}\(\\theta,D\)=\-\\frac\{1\}\{n\}\\textstyle\\sum\_\{i=1\}^\{n\}\\log\\left\(s\(y\_\{i\}\\theta^\{T\}x\_\{i\}\)\\right\)\+\\frac\{\\lambda\}\{2\}\|\\theta\|\_\{2\}^\{2\}wheres​\(⋅\)s\(\\cdot\)is the sigmoid function\. To ensureMM\-Lipschitzness, gradients are clipped during training, which preserves theλ\\lambda\-strong convexity and\(14\+λ\)\(\\frac\{1\}\{4\}\+\\lambda\)\-smoothness of the objective\(Ye and Shokri,[2022](https://arxiv.org/html/2605.11170#bib.bib207)\)\.For the results in[Figure2](https://arxiv.org/html/2605.11170#S3.F2), we adopt the MNIST configuration fromChienet al\.\([2024a](https://arxiv.org/html/2605.11170#bib.bib249)\):λ=0\.0119\\lambda=0\.0119,T=104T=10^\{4\},npriv=3000n\_\{\\mathrm\{priv\}\}=3000,M=1M=1,α=2\\alpha=2, andη=1/L\\eta=1/L\. We set the target privacy budget toε=1\\varepsilon=1and vary the forget set fraction to compute[Equation9](https://arxiv.org/html/2605.11170#A2.E9)\.

## Appendix CProof of[Theorem4\.1](https://arxiv.org/html/2605.11170#S4.Thmtheorem1)

###### Proposition\.

Assuming the data generating distributions share the same support, that the weight spaceΘ\\Thetais compact and that the loss isMM\-Lipschitz wrtθ\\theta, we have the following upper bound on the generalization error on the private data after performingKKiterations of unlearning, and initializing a weightθ0\\theta\_\{0\}fromπLT\\pi\_\{L\}^\{T\}:

𝔼θ∼πU​\[𝔼x∼Ppriv​\[ℒ​\(θ,x\)\]\]\\displaystyle\\mathbb\{E\}\_\{\\theta\\sim\\pi\_\{U\}\}\\left\[\\mathbb\{E\}\_\{x\\sim P\_\{\\mathrm\{priv\}\}\}\[\\mathcal\{L\}\(\\theta,x\)\]\\right\]≤exp⁡\(npubnpub\+nretain​D∞​\(Ppriv∥Ppub\)\)⏟distribution mismatch penalty​𝔼θ∼πR​\[𝔼d∼Ptrain​\[ℒ​\(θ,d\)\]\]\+\\displaystyle\\leq\\underbrace\{\\exp\\left\(\\frac\{n\_\{\\mathrm\{pub\}\}\}\{n\_\{\\mathrm\{pub\}\}\+n\_\{\\mathrm\{retain\}\}\}D\_\{\\infty\}\(P\_\{\\mathrm\{priv\}\}\\\|P\_\{\\mathrm\{pub\}\}\)\\right\)\}\_\{\\text\{distribution mismatch penalty\}\}\\mathbb\{E\}\_\{\\theta\\sim\\pi\_\{R\}\}\\left\[\\mathbb\{E\}\_\{d\\sim P\_\{\\mathrm\{train\}\}\}\[\\mathcal\{L\}\(\\theta,d\)\]\\right\]\+M×d​i​a​m​\(Θ\)×12​Dα​\(πR∥πU\)⏟unlearning approximation error\\displaystyle\\quad M\\times diam\(\\Theta\)\\times\\underbrace\{\\sqrt\{\\frac\{1\}\{2\}D\_\{\\alpha\}\(\\pi\_\{R\}\\\|\\pi\_\{U\}\)\}\}\_\{\\text\{unlearning approximation error\}\}whereD∞​\(P∥Q\)=log⁡\(ess​supx∼Q⁡p​\(x\)q​\(x\)\)D\_\{\\infty\}\(P\\\|Q\)=\\log\\left\(\\operatorname\{ess\\,sup\}\_\{x\\sim Q\}\\frac\{p\(x\)\}\{q\(x\)\}\\right\)is the infinite Rényi divergence \(worst case regret\(Erven and Harremoës,[2014](https://arxiv.org/html/2605.11170#bib.bib142)\)\) andptrainp\_\{\\mathrm\{train\}\}denotes the mixture of distributionsDpubD\_\{\\mathrm\{pub\}\}andDprivD\_\{\\mathrm\{priv\}\}used for training the model\.

In order to prove[Theorem4\.1](https://arxiv.org/html/2605.11170#S4.Thmtheorem1), we will use the following quantities to define a set of preliminary lemmas\.

#### C\.0\.1Performance on the training distribution mixture

###### Definition C\.1\(Wasserstein distance\)\.

The Wasserstein\-1 distance is defined as

W1​\(μ,ν\)=infγ∈Π​\(μ,ν\)∫𝒳×𝒳d​\(x,y\)​𝑑γ​\(x,y\),\\displaystyle W\_\{1\}\(\\mu,\\nu\)=\\inf\_\{\\gamma\\in\\Pi\(\\mu,\\nu\)\}\\int\_\{\\mathcal\{X\}\\times\\mathcal\{X\}\}d\(x,y\)\\,d\\gamma\(x,y\),where:

- •μ\\muandν\\nuare probability measures on a metric space\(𝒳,d\)\(\\mathcal\{X\},d\),
- •d​\(x,y\)d\(x,y\)is the distance between pointsx,y∈𝒳x,y\\in\\mathcal\{X\},
- •Π​\(μ,ν\)\\Pi\(\\mu,\\nu\)is the set of all couplings ofμ\\muandν\\nu, i\.e\., the set of joint distributionsγ\\gammaon𝒳×𝒳\\mathcal\{X\}\\times\\mathcal\{X\}such that the marginals ofγ\\gammaareμ\\muandν\\nu: ∫𝒳γ​\(x,y\)​𝑑y=μ​\(x\),∫𝒳γ​\(x,y\)​𝑑x=ν​\(y\)\.\\int\_\{\\mathcal\{X\}\}\\gamma\(x,y\)\\,dy=\\mu\(x\),\\quad\\int\_\{\\mathcal\{X\}\}\\gamma\(x,y\)\\,dx=\\nu\(y\)\.

###### Definition C\.2\(Total Variation Distance\)\.

LetPPandQQbe two probability measures on a measurable space\(Ω,ℱ\)\(\\Omega,\\mathcal\{F\}\)\. Thetotal variation distancebetweenPPandQQis defined as

T​V​\(P,Q\)\\displaystyle TV\(P,Q\)=supA∈ℱ\|P​\(A\)−Q​\(A\)\|\\displaystyle=\\sup\_\{A\\in\\mathcal\{F\}\}\|P\(A\)\-Q\(A\)\|\(10\)=12​∫Ω\|d​P−d​Q\|\\displaystyle=\\frac\{1\}\{2\}\\int\_\{\\Omega\}\|dP\-dQ\|\(11\)=12​‖P−Q‖T​V\.\\displaystyle=\\frac\{1\}\{2\}\\\|P\-Q\\\|\_\{TV\}\.\(12\)

###### Theorem C\.1\.

\(Kantorovich Rubinstein’s duality,\(Villani and others,[2009](https://arxiv.org/html/2605.11170#bib.bib282)\), Theorem 5\.10\) Ifμ,ν\\mu,\\nuhave a bounded supportΩ\\Omega, then

W1​\(μ,ν\)\\displaystyle W\_\{1\}\(\\mu,\\nu\)=sup‖h‖L≤1𝔼x∼μ​\[h​\(x\)\]−𝔼y∼ν​\[h​\(y\)\],\\displaystyle=\\sup\_\{\\\|h\\\|\_\{L\}\\leq 1\}\\mathbb\{E\}\_\{x\\sim\\mu\}\[h\(x\)\]\-\\mathbb\{E\}\_\{y\\sim\\nu\}\[h\(y\)\],\(13\)where‖h‖L≤1\\\|h\\\|\_\{L\}\\leq 1denotes the set of 1\-Lipschitz functions onΩ\\Omega

Letf:Θ→ℝf:\\Theta\\rightarrow\\mathbb\{R\}such thatf​\(θ\)=𝔼D∼Ptrain​\[ℒD​\(θ\)\]f\(\\theta\)=\\mathbb\{E\}\_\{D\\sim P\_\{\\mathrm\{train\}\}\}\[\\mathcal\{L\}\_\{D\}\(\\theta\)\], wherePtrainP\_\{\\mathrm\{train\}\}denotes the training data distribution \(a mixture ofPprivP\_\{\\mathrm\{priv\}\}andPpubP\_\{\\mathrm\{pub\}\}\. Sinceℒ\(\.,D\)\\mathcal\{L\}\(\.,D\)isM−M\-Lipschitz, so isff\. Then, we have that:

𝔼θ∼πUD∼Ptrain​\[ℒ​\(θ,D\)\]−𝔼θ∼πRD∼Ptrain​\[ℒ​\(θ,D\)\]\\displaystyle\\mathbb\{E\}\_\{\\begin\{subarray\}\{c\}\\theta\\sim\\pi\_\{U\}\\\\ D\\sim P\_\{\\mathrm\{train\}\}\\end\{subarray\}\}\\left\[\\mathcal\{L\}\(\\theta,D\)\\right\]\-\\mathbb\{E\}\_\{\\begin\{subarray\}\{c\}\\theta\\sim\\pi\_\{R\}\\\\ D\\sim P\_\{\\mathrm\{train\}\}\\end\{subarray\}\}\\left\[\\mathcal\{L\}\(\\theta,D\)\\right\]=𝔼θ∼πU​\[f​\(θ\)\]−𝔼θ∼πR​\[f​\(θ\)\]\\displaystyle=\\mathbb\{E\}\_\{\\theta\\sim\\pi\_\{U\}\}\[f\(\\theta\)\]\-\\mathbb\{E\}\_\{\\theta\\sim\\pi\_\{R\}\}\[f\(\\theta\)\]\(Fubini’s theorem\)≤M×W1​\(πU,πR\)\\displaystyle\\leq M\\times W\_\{1\}\(\\pi\_\{U\},\\pi\_\{R\}\)\(By[TheoremC\.1](https://arxiv.org/html/2605.11170#A3.Thmtheorem1)\)Now, we need to find an upper bound on the 1\-Wasserstein distance in terms of the Rényi divergence betweenπR\\pi\_\{R\}andπU\\pi\_\{U\}\. The following results will be useful in deriving it:

###### Proposition C\.1\.

\(Pinsker’s inequality\) For two probability distributionsP,QP,Q, we have

2TV\(P,Q\)2≤KL\(P\|\|Q\)\.\\displaystyle 2TV\(P,Q\)^\{2\}\\leq KL\(P\|\|Q\)\.\(14\)

###### Proposition C\.2\.

\(Monotonicity of Rényi divergence,\(Erven and Harremoës,[2014](https://arxiv.org/html/2605.11170#bib.bib142)\)\) For1≤α1≤α21\\leq\\alpha\_\{1\}\\leq\\alpha\_\{2\}and probability measuresP,QP,Q,

KL\(P\|\|Q\)≤Dα1\(P\|\|Q\)≤Dα2\(P\|\|Q\)\.\\displaystyle KL\(P\|\|Q\)\\leq D\_\{\\alpha\_\{1\}\}\(P\|\|Q\)\\leq D\_\{\\alpha\_\{2\}\}\(P\|\|Q\)\.The KL lower bounds any Rényi divergence since it is obtained by the limitα→1\\alpha\\rightarrow 1\.

###### Proposition C\.3\.

\(Upper boundingW1W\_\{1\}withT​VTV\(Gibbs and Su,[2002](https://arxiv.org/html/2605.11170#bib.bib141)\)\) If the distributionsP,QP,Qshare a supportΩ\\Omegaandd​i​a​m​\(Ω\)=sup\(x,y\)∈Ω×Ωd​\(x,y\)diam\(\\Omega\)=\\sup\_\{\(x,y\)\\in\\Omega\\times\\Omega\}d\(x,y\)is finite, then we have

W1​\(P,Q\)≤d​i​a​m​\(Ω\)​T​V​\(P,Q\)\.\\displaystyle W\_\{1\}\(P,Q\)\\leq diam\(\\Omega\)TV\(P,Q\)\.\(15\)

Using the results above, we have

𝔼θ∼πU​\[f​\(θ\)\]−𝔼θ∼πR​\[f​\(θ\)\]\\displaystyle\\mathbb\{E\}\_\{\\theta\\sim\\pi\_\{U\}\}\[f\(\\theta\)\]\-\\mathbb\{E\}\_\{\\theta\\sim\\pi\_\{R\}\}\[f\(\\theta\)\]≤M​W1​\(πU,πR\)\\displaystyle\\leq MW\_\{1\}\(\\pi\_\{U\},\\pi\_\{R\}\)≤M×d​i​a​m​\(Θ\)×T​V​\(πU,πR\)\\displaystyle\\leq M\\times diam\(\\Theta\)\\times TV\(\\pi\_\{U\},\\pi\_\{R\}\)\(By[PropositionC\.3](https://arxiv.org/html/2605.11170#A3.Thmproposition3)and compactness ofΘ\\Theta\)≤M×d​i​a​m​\(Θ\)×12​K​L​\(πU,πR\)\\displaystyle\\leq M\\times diam\(\\Theta\)\\times\\sqrt\{\\frac\{1\}\{2\}KL\(\\pi\_\{U\},\\pi\_\{R\}\)\}\(By[PropositionC\.2](https://arxiv.org/html/2605.11170#A3.Thmproposition2)\)≤M×d​i​a​m​\(Θ\)×12​Dα​\(πU,πR\)\\displaystyle\\leq M\\times diam\(\\Theta\)\\times\\sqrt\{\\frac\{1\}\{2\}D\_\{\\alpha\}\(\\pi\_\{U\},\\pi\_\{R\}\)\}\(By[PropositionC\.2](https://arxiv.org/html/2605.11170#A3.Thmproposition2)\)Thus, we obtain that the generalization error of learning \+ unlearning is upper bounded by:

###### Proposition C\.4\.

Assuming thatℒ\\mathcal\{L\}isMM\-Lipschitz, we have

𝔼θ∼πU​\[𝔼D∼Ptrain​\[L​\(θ,D\)\]\]≤𝔼θ∼πR​\[𝔼D∼Ptrain​\[L​\(θ,D\)\]\]\+M×d​i​a​m​\(Θ\)×12​Dα​\(πU∥πR\)\\displaystyle\\mathbb\{E\}\_\{\\theta\\sim\\pi\_\{U\}\}\\left\[\\mathbb\{E\}\_\{D\\sim P\_\{\\mathrm\{train\}\}\}\[L\(\\theta,D\)\]\\right\]\\leq\\mathbb\{E\}\_\{\\theta\\sim\\pi\_\{R\}\}\\left\[\\mathbb\{E\}\_\{D\\sim P\_\{\\mathrm\{train\}\}\}\[L\(\\theta,D\)\]\\right\]\+M\\times diam\(\\Theta\)\\times\\sqrt\{\\frac\{1\}\{2\}D\_\{\\alpha\}\(\\pi\_\{U\}\\\|\\pi\_\{R\}\)\}\(16\)

#### C\.0\.2Adapting the bound to[Theorem4\.1](https://arxiv.org/html/2605.11170#S4.Thmtheorem1)

We would like to evaluate the performance of the model obtained after unlearning\. Proposition[C\.4](https://arxiv.org/html/2605.11170#A3.Thmproposition4)provides a generalization bound on a mixture of distributions, namely on public data \+ private data\. In most practical scenarios, one would want to quantify the ”lost” performance on private data after forgetting one of its subsets\. Thus, we would like to upper bound the quantity𝔼πU​\[𝔼D∼Ppriv​\[ℒD​\(θ\)\]\]\\mathbb\{E\}\_\{\\pi\_\{U\}\}\\left\[\\mathbb\{E\}\_\{D\\sim P\_\{\\mathrm\{priv\}\}\}\\left\[\\mathcal\{L\}\_\{D\}\(\\theta\)\\right\]\\right\]\. The training data distribution used for either retraining or unlearning can be considered as generated from a mixture of the distributionsIIandI​III\. Assuming the sampling proportions for training are consistent, one can write that the data distribution used in retraining is

Ptrain\\displaystyle P\_\{\\mathrm\{train\}\}=npubnpub\+nr−priv​Ppub\+nr−privnpub\+nr−priv​Ppriv\.\\displaystyle=\\frac\{n\_\{\\mathrm\{pub\}\}\}\{n\_\{\\mathrm\{pub\}\}\+n\_\{\\mathrm\{r\-priv\}\}\}P\_\{\\mathrm\{pub\}\}\+\\frac\{n\_\{\\mathrm\{r\-priv\}\}\}\{n\_\{\\mathrm\{pub\}\}\+n\_\{\\mathrm\{r\-priv\}\}\}P\_\{\\mathrm\{priv\}\}\.Fix anyθ∈Θ\\theta\\in\\Theta\. We have that

𝔼𝒟∼Ptrain​\[ℒ​\(θ,𝒟\)\]\\displaystyle\\mathbb\{E\}\_\{\\mathcal\{D\}\\sim P\_\{\\mathrm\{train\}\}\}\[\\mathcal\{L\}\(\\theta,\\mathcal\{D\}\)\]=npubnpub\+nr−priv​𝔼𝒟∼Ppub​\[ℒ​\(θ,𝒟\)\]\+nr−privnpub\+nr−priv​𝔼𝒟∼Ppriv​\[ℒ​\(θ,𝒟\)\]\\displaystyle=\\frac\{n\_\{\\mathrm\{pub\}\}\}\{n\_\{\\mathrm\{pub\}\}\+n\_\{\\mathrm\{r\-priv\}\}\}\\mathbb\{E\}\_\{\\mathcal\{D\}\\sim P\_\{\\mathrm\{pub\}\}\}\[\\mathcal\{L\}\(\\theta,\\mathcal\{D\}\)\]\+\\frac\{n\_\{\\mathrm\{r\-priv\}\}\}\{n\_\{\\mathrm\{pub\}\}\+n\_\{\\mathrm\{r\-priv\}\}\}\\mathbb\{E\}\_\{\\mathcal\{D\}\\sim P\_\{\\mathrm\{priv\}\}\}\[\\mathcal\{L\}\(\\theta,\\mathcal\{D\}\)\]𝔼𝒟∼Ppriv​\[ℒ​\(θ,𝒟\)\]\\displaystyle\\mathbb\{E\}\_\{\\mathcal\{D\}\\sim P\_\{\\mathrm\{priv\}\}\}\[\\mathcal\{L\}\(\\theta,\\mathcal\{D\}\)\]=∫ppriv​\(x\)​ℒ​\(θ,x\)​𝑑x\\displaystyle=\\int p\_\{\{\\mathrm\{priv\}\}\}\(x\)\\mathcal\{L\}\(\\theta,x\)dx=∫ptrain​\(x\)​ppriv​\(x\)ptrain​\(x\)​ℒ​\(θ,x\)​𝑑x\\displaystyle=\\int p\_\{\\mathrm\{train\}\}\(x\)\\frac\{p\_\{\{\\mathrm\{priv\}\}\}\(x\)\}\{p\_\{\\mathrm\{train\}\}\(x\)\}\\mathcal\{L\}\(\\theta,x\)dx=𝔼x∼Ptrain​\[ppriv​\(x\)ptrain​\(x\)​ℒ​\(θ,x\)\]\\displaystyle=\\mathbb\{E\}\_\{x\\sim P\_\{\\mathrm\{train\}\}\}\\left\[\\frac\{p\_\{\{\\mathrm\{priv\}\}\}\(x\)\}\{p\_\{\\mathrm\{train\}\}\(x\)\}\\mathcal\{L\}\(\\theta,x\)\\right\]≤𝔼d∼Ptrain​\[ess​supx∈S​u​p​p​\(Ppub\)∪S​u​p​p​\(Ppriv\)⁡pppriv​\(x\)ptrain​\(x\)​ℒ​\(θ,d\)\]\\displaystyle\\leq\\mathbb\{E\}\_\{d\\sim P\_\{\\mathrm\{train\}\}\}\[\\operatorname\{ess\\,sup\}\_\{x\\in Supp\(P\_\{\\mathrm\{pub\}\}\)\\cup Supp\(P\_\{\\mathrm\{priv\}\}\)\}\\frac\{p\_\{p\_\{\\mathrm\{priv\}\}\}\(x\)\}\{p\_\{\\mathrm\{train\}\}\(x\)\}\\mathcal\{L\}\(\\theta,d\)\]≤ess​supx∈S​u​p​p​\(Ppub\)∪S​u​p​p​\(Ppriv\)⁡ppriv​\(x\)ptrain​\(x\)​𝔼d∼Ptrain​\[ℒ​\(θ,d\)\]\\displaystyle\\leq\\operatorname\{ess\\,sup\}\_\{x\\in Supp\(P\_\{\\mathrm\{pub\}\}\)\\cup Supp\(P\_\{\\mathrm\{priv\}\}\)\}\\frac\{p\_\{\\mathrm\{priv\}\}\(x\)\}\{p\_\{\\mathrm\{train\}\}\(x\)\}\\mathbb\{E\}\_\{d\\sim P\_\{\\mathrm\{train\}\}\}\[\\mathcal\{L\}\(\\theta,d\)\]≤exp⁡\(D∞​\(Ppriv,Ptrain\)\)​𝔼d∼Ptrain​\[ℒ​\(θ,d\)\]\.\\displaystyle\\leq\\exp\(D\_\{\\infty\}\(P\_\{\\mathrm\{priv\}\},P\_\{\\mathrm\{train\}\}\)\)\\mathbb\{E\}\_\{d\\sim P\_\{\\mathrm\{train\}\}\}\[\\mathcal\{L\}\(\\theta,d\)\]\.Moreover, we have by convexity of the Rényi divergence\(Erven and Harremoës,[2014](https://arxiv.org/html/2605.11170#bib.bib142)\)in its second argument that

D∞​\(Ppriv∥Ptrain\)\\displaystyle D\_\{\\infty\}\(P\_\{\\mathrm\{priv\}\}\\\|P\_\{\\mathrm\{train\}\}\)≤npubnpub\+nr−priv​\(Ppriv∥Ppub\)\.\\displaystyle\\leq\\frac\{n\_\{\\mathrm\{pub\}\}\}\{n\_\{\\mathrm\{pub\}\}\+n\_\{\\mathrm\{r\-priv\}\}\}\(P\_\{\\mathrm\{priv\}\}\\\|P\_\{\\mathrm\{pub\}\}\)\.Thus we also have

𝔼d∼Ppriv​\[ℒ​\(θ,d\)\]≤exp⁡\(npubnpub\+nr−priv​D∞​\(Ppriv∥Ppub\)\)​𝔼d∼Ptrain​\[ℒ​\(θ,d\)\]\.\\displaystyle\\mathbb\{E\}\_\{d\\sim P\_\{\\mathrm\{priv\}\}\}\[\\mathcal\{L\}\(\\theta,d\)\]\\leq\\exp\\left\(\\frac\{n\_\{\\mathrm\{pub\}\}\}\{n\_\{\\mathrm\{pub\}\}\+n\_\{\\mathrm\{r\-priv\}\}\}D\_\{\\infty\}\(P\_\{\\mathrm\{priv\}\}\\\|P\_\{\\mathrm\{pub\}\}\)\\right\)\\mathbb\{E\}\_\{d\\sim P\_\{\\mathrm\{train\}\}\}\[\\mathcal\{L\}\(\\theta,d\)\]\.\(17\)Thus, we can adapt proposition[C\.4](https://arxiv.org/html/2605.11170#A3.Thmproposition4)to evaluate the riskonlyon private data\. Note that so far, the only assumption made on the difference between the data generating distributions I and II is that they share the same support\. The following bound might be refined with additional assumptions, such as covariate shift or conditional shift\.

We can thus take the expectation ofθ\\thetawith respect toπU\\pi\_\{U\}to get

𝔼θ∼πU​\[𝔼d∼Ppriv​\[ℒ​\(θ,d\)\]\]≤exp⁡\(npubnpub\+nr−priv​D∞​\(Ppriv∥Ppub\)\)​𝔼θ∼πU​\[𝔼d∼Ptrain​\[ℒ​\(θ,d\)\]\],\\displaystyle\\mathbb\{E\}\_\{\\theta\\sim\\pi\_\{U\}\}\\left\[\\mathbb\{E\}\_\{d\\sim P\_\{\\mathrm\{priv\}\}\}\[\\mathcal\{L\}\(\\theta,d\)\]\\right\]\\leq\\exp\\left\(\\frac\{n\_\{\\mathrm\{pub\}\}\}\{n\_\{\\mathrm\{pub\}\}\+n\_\{\\mathrm\{r\-priv\}\}\}D\_\{\\infty\}\(P\_\{\\mathrm\{priv\}\}\\\|P\_\{\\mathrm\{pub\}\}\)\\right\)\\mathbb\{E\}\_\{\\theta\\sim\\pi\_\{U\}\}\\left\[\\mathbb\{E\}\_\{d\\sim P\_\{\\mathrm\{train\}\}\}\[\\mathcal\{L\}\(\\theta,d\)\]\\right\],and using proposition[C\.4](https://arxiv.org/html/2605.11170#A3.Thmproposition4)to upper bound𝔼θ∼πU​\[𝔼d∼Ptrain​\[ℒ​\(θ,d\)\]\]\\mathbb\{E\}\_\{\\theta\\sim\\pi\_\{U\}\}\\left\[\\mathbb\{E\}\_\{d\\sim P\_\{\\mathrm\{train\}\}\}\[\\mathcal\{L\}\(\\theta,d\)\]\\right\], we prove proposition[4\.1](https://arxiv.org/html/2605.11170#S4.Thmtheorem1):

𝔼θ∼πU​\[𝔼x∼Ppriv​\[ℒ​\(θ,x\)\]\]\\displaystyle\\mathbb\{E\}\_\{\\theta\\sim\\pi\_\{U\}\}\\left\[\\mathbb\{E\}\_\{x\\sim P\_\{\\mathrm\{priv\}\}\}\[\\mathcal\{L\}\(\\theta,x\)\]\\right\]≤exp⁡\(npubnpub\+nretain​D∞​\(Ppriv∥Ppub\)\)​𝔼θ∼πR​\[𝔼d∼Ptrain​\[ℒ​\(θ,d\)\]\]\+\\displaystyle\\leq\\exp\\left\(\\frac\{n\_\{\\mathrm\{pub\}\}\}\{n\_\{\\mathrm\{pub\}\}\+n\_\{\\mathrm\{retain\}\}\}D\_\{\\infty\}\(P\_\{\\mathrm\{priv\}\}\\\|P\_\{\\mathrm\{pub\}\}\)\\right\)\\mathbb\{E\}\_\{\\theta\\sim\\pi\_\{R\}\}\\left\[\\mathbb\{E\}\_\{d\\sim P\_\{\\mathrm\{train\}\}\}\[\\mathcal\{L\}\(\\theta,d\)\]\\right\]\+M×d​i​a​m​\(Θ\)×12​Dα​\(πR∥πU\)\.\\displaystyle\\quad M\\times diam\(\\Theta\)\\times\\sqrt\{\\frac\{1\}\{2\}D\_\{\\alpha\}\(\\pi\_\{R\}\\\|\\pi\_\{U\}\)\}\.

### C\.1Retraining performance bound on the training error

\(𝔼θ∼π𝐑𝐓​\[𝔼𝐱∼𝐩train​\[ℒ​\(θ,𝐱\)\]\]\\mathbf\{\\mathbb\{E\}\_\{\\theta\\sim\\pi\_\{R\}^\{T\}\}\\left\[\\mathbb\{E\}\_\{x\\sim p\_\{\\mathrm\{train\}\}\}\[\\mathcal\{L\}\(\\theta,x\)\]\\right\]\}\): The upper bound could be further improved to include theoptimaldistribution, i\.e by linking𝔼θ∼π𝐑𝐓​\[𝔼𝐱∼𝐩train​\[ℒ​\(θ,𝐱\)\]\]\\mathbf\{\\mathbb\{E\}\_\{\\theta\\sim\\pi\_\{R\}^\{T\}\}\\left\[\\mathbb\{E\}\_\{x\\sim p\_\{\\mathrm\{train\}\}\}\[\\mathcal\{L\}\(\\theta,x\)\]\\right\]\}toarg⁡minπ∈𝒫​\(ℝ𝐝\)⁡𝔼θ∼π​\[𝔼𝐱∼𝐩train​\[ℒ​\(θ,𝐱\)\]\]\\mathbf\{\\arg\\min\_\{\\pi\\in\\mathcal\{P\}\(\\mathbb\{R\}^\{d\}\)\}\\mathbb\{E\}\_\{\\theta\\sim\\pi\}\\left\[\\mathbb\{E\}\_\{x\\sim p\_\{\\mathrm\{train\}\}\}\[\\mathcal\{L\}\(\\theta,x\)\]\\right\]\}\. However, standard generalization bounds for Langevin dynamics\(Raginskyet al\.,[2017](https://arxiv.org/html/2605.11170#bib.bib235); Xuet al\.,[2018](https://arxiv.org/html/2605.11170#bib.bib278)\)do not directly apply to our setting due to the projection operatorΠΘ\\Pi\_\{\\Theta\}in the PNGD updates\. These classical results focus on unconstrained non\-convex optimization, whereas our bounded domain introduces additional complexity\. The most relevant analysis we are aware of isLamperski \([2020](https://arxiv.org/html/2605.11170#bib.bib112)\), who study generalization properties of projected Stochastic Gradient Langevin Dynamics, though their work considers the infinite\-data regime\.

## Appendix DWhen to unlearn, when to retrain ?

In the following section, we will assume that the loss function ismm\-strongly convex to simplify the computations\. Non\-convex analysis follows the same flavor, but is hardly interpretable due to the presence of the log\-Sobolev constants\. As a reminder of ALU, we remind the reader that \(1\) one learns the whole dataset throughTTiterations of PNGD, \(2\) unlearns by fine\-tuning on the retain set using the same PNGD updates\. As opposed to\(Chienet al\.,[2024a](https://arxiv.org/html/2605.11170#bib.bib249),[b](https://arxiv.org/html/2605.11170#bib.bib244)\)that defines the retraining cost as the number of PNGD steps required to getε\\varepsilon\-close to the stationary distribution of retraining from scratch,we define the cost of retraining asTT, i\.e the same number of steps used to train the model prior to unlearning\. This is a more realistic setting in practice as practitioners may choose the number of training steps to be an arbitrary number\. The cost of unlearning is defined as the minimum amount of unlearning iterations,KK, so thatDα​\(πRT\+K∥πUK\)≤εD\_\{\\alpha\}\(\\pi\_\{R\}^\{T\+K\}\\\|\\pi\_\{U\}^\{K\}\)\\leq\\varepsilon, for a givenε\>0\\varepsilon\>0\.

The following result gives a sufficient condition on when ALU is more advantageous than retraining, under the assumption that the loss ismm\-strongly convex\. Note that a similar but hard\-to\-interpret result could be derived for non\-convex losses, due to the presence of the log\-Sobolev constants\.

###### Proposition D\.1\(Sufficient condition guaranteeing that ALU is more efficient than retraining\)\.

Assume that the conditions of[Theorem3\.2](https://arxiv.org/html/2605.11170#S3.Thmtheorem2)are satisfied, and that the loss function ismm\-strongly convex\. Then, a sufficient condition guaranteeing that unlearning is more advantageous than retraining is

C​α2​σ2​η​log⁡\(4​α​M2​nforget2m​ε​σ2​\(npublic\+npriv\)2\)<T−log⁡\(1−exp⁡\(−m​η​T\)\)\.\\displaystyle\\frac\{C\\alpha\}\{2\\sigma^\{2\}\\eta\}\\log\\left\(\\frac\{4\\alpha M^\{2\}n\_\{\\mathrm\{forget\}\}^\{2\}\}\{m\\varepsilon\\sigma^\{2\}\(n\_\{\\mathrm\{public\}\}\+n\_\{\\mathrm\{priv\}\}\)^\{2\}\}\\right\)<T\-\\log\\left\(1\-\\exp\(\-m\\eta T\)\\right\)\.

###### Proof\.

Using the results from[Theorems3\.1](https://arxiv.org/html/2605.11170#S3.Thmtheorem1)and[3\.2](https://arxiv.org/html/2605.11170#S3.Thmtheorem2), with the assumption that the loss ismm\-strongly convex, we have that:

Dα​\(πRT\+K∥πUK\)≤Dα​\(πRT∥πLT\)​exp⁡\(−2​K​σ2​ηC​α\)\.\\displaystyle D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\+K\}\\\|\\pi\_\{U\}^\{K\}\)\\leq D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\}\\\|\\pi\_\{L\}^\{T\}\)\\exp\\left\(\-\\frac\{2K\\sigma^\{2\}\\eta\}\{C\\alpha\}\\right\)\.By setting the right hand side to be less thanε\\varepsilon, we obtain a sufficient condition onKK:

Dα​\(πRT∥πLT\)​exp⁡\(−2​K​σ2​ηC​α\)\\displaystyle D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\}\\\|\\pi\_\{L\}^\{T\}\)\\exp\\left\(\-\\frac\{2K\\sigma^\{2\}\\eta\}\{C\\alpha\}\\right\)≤ε\\displaystyle\\leq\\varepsilon⇒C​α​log⁡\(Dα​\(πRT∥πLT\)ε\)2​σ2​η\\displaystyle\\Rightarrow\\frac\{C\\alpha\\log\\left\(\\frac\{D\_\{\\alpha\}\(\\pi\_\{R\}^\{T\}\\\|\\pi\_\{L\}^\{T\}\)\}\{\\varepsilon\}\\right\)\}\{2\\sigma^\{2\}\\eta\}≤K\.\\displaystyle\\leq K\.Therefore, a sufficient condition for unlearning to be computationally more efficient than retraining is:

C​α​log⁡\(4​α​M2​nforget2ε​m​σ2​\(npublic\+npriv\)2\)2​σ2​η≤T−log⁡\(1−exp⁡\(−m​η​T\)\)\.\\displaystyle\\frac\{C\\alpha\\log\\left\(\\frac\{4\\alpha M^\{2\}n\_\{\\mathrm\{forget\}\}^\{2\}\}\{\\varepsilon m\\sigma^\{2\}\(n\_\{\\mathrm\{public\}\}\+n\_\{\\mathrm\{priv\}\}\)^\{2\}\}\\right\)\}\{2\\sigma^\{2\}\\eta\}\\leq T\-\\log\(1\-\\exp\(\-m\\eta T\)\)\.\(By[Theorem3\.1](https://arxiv.org/html/2605.11170#S3.Thmtheorem1)and strong convexity of the loss\)∎

This bound contains many dependencies between the control knobs that practitioners have in hand thanks to ALU\. We ask the following questions:

##### Small number of learning iterations:

To analyze the computational advantage in this regime, we examine the sufficient condition whenTTis small\. Using the first\-order Taylor approximation1−exp⁡\(−m​η​T\)≈m​η​T1\-\\exp\(\-m\\eta T\)\\approx m\\eta Tand noting that for smallTT, the termexp⁡\(−2​σ2​η​TC​α\)≈1\\exp\(\-\\frac\{2\\sigma^\{2\}\\eta T\}\{C\\alpha\}\)\\approx 1, the inequality yields:

log⁡\(4​α​M2​nforget2ε​m​σ2​\(npublic\+npriv\)2\)\\displaystyle\\log\\left\(\\frac\{4\\alpha M^\{2\}n\_\{\\mathrm\{forget\}\}^\{2\}\}\{\\varepsilon m\\sigma^\{2\}\(n\_\{\\mathrm\{public\}\}\+n\_\{\\mathrm\{priv\}\}\)^\{2\}\}\\right\)≤2​σ2​ηC​α​\(T−log⁡\(m​η​T\)\)\\displaystyle\\leq\\frac\{2\\sigma^\{2\}\\eta\}\{C\\alpha\}\(T\-\\log\(m\\eta T\)\)⇒\(nprivnpublic\+npriv×nforgetnpriv\)2\\displaystyle\\Rightarrow\\left\(\\frac\{n\_\{\\mathrm\{priv\}\}\}\{n\_\{\\mathrm\{public\}\}\+n\_\{\\mathrm\{priv\}\}\}\\times\\frac\{n\_\{\\mathrm\{forget\}\}\}\{n\_\{\\mathrm\{priv\}\}\}\\right\)^\{2\}≤ε​m​σ24​α​M2​\[exp⁡\(2​σ2​η​TC​α\)⏟≈1​exp⁡\(−2​σ2​η​log⁡\(m​η​T\)C​α\)\]\\displaystyle\\leq\\frac\{\\varepsilon m\\sigma^\{2\}\}\{4\\alpha M^\{2\}\}\\left\[\\underbrace\{\\exp\\left\(\\frac\{2\\sigma^\{2\}\\eta T\}\{C\\alpha\}\\right\)\}\_\{\\approx 1\}\\exp\\left\(\-\\frac\{2\\sigma^\{2\}\\eta\\log\(m\\eta T\)\}\{C\\alpha\}\\right\)\\right\]⇒\(nprivnpublic\+npriv×nforgetnpriv\)2\\displaystyle\\Rightarrow\\left\(\\frac\{n\_\{\\mathrm\{priv\}\}\}\{n\_\{\\mathrm\{public\}\}\+n\_\{\\mathrm\{priv\}\}\}\\times\\frac\{n\_\{\\mathrm\{forget\}\}\}\{n\_\{\\mathrm\{priv\}\}\}\\right\)^\{2\}≤ε​m​σ24​α​M2​\(m​η​T\)−2​σ2​ηC​α\\displaystyle\\leq\\frac\{\\varepsilon m\\sigma^\{2\}\}\{4\\alpha M^\{2\}\}\(m\\eta T\)^\{\\frac\{\-2\\sigma^\{2\}\\eta\}\{C\\alpha\}\}Lettingc=nforgetnprivc=\\frac\{n\_\{\\mathrm\{forget\}\}\}\{n\_\{\\mathrm\{priv\}\}\}denote the fraction of private data that needs to be forgotten, we have:

\(nprivnpublic\+npriv\)2\\displaystyle\\left\(\\frac\{n\_\{\\mathrm\{priv\}\}\}\{n\_\{\\mathrm\{public\}\}\+n\_\{\\mathrm\{priv\}\}\}\\right\)^\{2\}≤ε​m​σ2​\(m​η​T\)−2​σ2​ηC​α4​α​M2​c2\\displaystyle\\leq\\frac\{\\varepsilon m\\sigma^\{2\}\(m\\eta T\)^\{\\frac\{\-2\\sigma^\{2\}\\eta\}\{C\\alpha\}\}\}\{4\\alpha M^\{2\}c^\{2\}\}⇒\(npublicnpriv\+1\)2\\displaystyle\\Rightarrow\(\\frac\{n\_\{\\mathrm\{public\}\}\}\{n\_\{\\mathrm\{priv\}\}\}\+1\)^\{2\}≥4​α​M2​c2ε​m​σ2​\(m​η​T\)2​σ2​ηC​α\\displaystyle\\geq\\frac\{4\\alpha M^\{2\}c^\{2\}\}\{\\varepsilon m\\sigma^\{2\}\}\(m\\eta T\)^\{\\frac\{2\\sigma^\{2\}\\eta\}\{C\\alpha\}\}⇒npublicnpriv\\displaystyle\\Rightarrow\\frac\{n\_\{\\mathrm\{public\}\}\}\{n\_\{\\mathrm\{priv\}\}\}≥2​α​M​\(m​η\)σ2​ηC​αm⏟Const\.×Tσ2​ηC​α×1ε×\(nforgetnpriv\)−1\\displaystyle\\geq\\underbrace\{\\frac\{2\\sqrt\{\\alpha\}M\(m\\eta\)^\{\\frac\{\\sigma^\{2\}\\eta\}\{C\\alpha\}\}\}\{m\}\}\_\{\\text\{Const\.\}\}\\times T^\{\\frac\{\\sigma^\{2\}\\eta\}\{C\\alpha\}\}\\times\\frac\{1\}\{\\sqrt\{\\varepsilon\}\}\\times\\left\(\\frac\{n\_\{\\mathrm\{forget\}\}\}\{n\_\{\\mathrm\{priv\}\}\}\\right\)\-1
In the regime whereTTis small, the feasibility of unlearning is subject to a trade\-off between the forget ratio,ε\\varepsilon, and the available public data\. This relationship is formally captured by the inequalitynpublicnpriv\+1≥𝒞⋅Tβ⋅ε−1/2⋅nforgetnpriv\\frac\{n\_\{\\mathrm\{public\}\}\}\{n\_\{\\mathrm\{priv\}\}\}\+1\\geq\\mathcal\{C\}\\cdot T^\{\\beta\}\\cdot\\varepsilon^\{\-1/2\}\\cdot\\frac\{n\_\{\\mathrm\{forget\}\}\}\{n\_\{\\mathrm\{priv\}\}\}, whereβ=σ2​η/C​α\\beta=\\sigma^\{2\}\\eta/C\\alpha\. In the private\-only scenario wherenpublic=0n\_\{\\mathrm\{public\}\}=0, the efficiency of unlearning is limited; the algorithm only remains faster than retraining if the fraction of data to be forgotten is sufficiently small to satisfynforgetnpriv≤ε𝒞​Tβ\\frac\{n\_\{\\mathrm\{forget\}\}\}\{n\_\{\\mathrm\{priv\}\}\}\\leq\\frac\{\\sqrt\{\\varepsilon\}\}\{\\mathcal\{C\}T^\{\\beta\}\}\. Under these conditions, attempting to unlearn a larger portion of the dataset or imposing a stricter privacy guarantee \(smallerε\\varepsilon\) forces the unlearning iterationsKKto surpass the original training budgetTT, rendering retraining a more viable option\. However, the introduction of public data \(npublic\>0n\_\{\\mathrm\{public\}\}\>0\) provides a computational buffer that fundamentally alters this dynamic\. By increasing the rationpublic/nprivn\_\{\\mathrm\{public\}\}/n\_\{\\mathrm\{priv\}\}, practitioners can compensate for high forget ratios or stringent privacy requirements, ensuring that the gradient updates remain stable and non\-private enough to allowK<TK<T\.

##### Large number of training iterations \(training to convergence\)

In this case,\(Chienet al\.,[2024a](https://arxiv.org/html/2605.11170#bib.bib249)\)proved that the computational benefit of unlearning against retraining has a non\-negligible benefit, that scales as𝒪​\(log⁡\(ntotal\)\)\\mathcal\{O\}\(\\log\(n\_\{\\mathrm\{total\}\}\)\), with probability1−1Rd1\-\\frac\{1\}\{R^\{d\}\}\.

## Appendix ELangevin Unlearning pseudo\-code

Algorithm 1Training with Projected Noisy Gradient Descent \(PNGD\)1:

θ0∼π0\\theta\_\{0\}\\sim\\pi\_\{0\}\{Sample from initialization distribution\}

2:for

t=0t=0to

T−1T\-1do

3:

gt←∇θLD​\(θt\)g\_\{t\}\\leftarrow\\nabla\_\{\\theta\}L\_\{D\}\(\\theta\_\{t\}\)\{Compute gradient on full dataset\}

4:

ξt∼𝒩​\(0,2​η​σ2​Id\)\\xi\_\{t\}\\sim\\mathcal\{N\}\(0,2\\eta\\sigma^\{2\}I\_\{d\}\)\{Sample Gaussian noise\}

5:

θt\+1←ΠΘ​\[θt−η​gt\+ξt\]\\theta\_\{t\+1\}\\leftarrow\\Pi\_\{\\Theta\}\[\\theta\_\{t\}\-\\eta g\_\{t\}\+\\xi\_\{t\}\]\{Update and project\}

6:endfor

7:return

θT\\theta\_\{T\}

Algorithm 2Langevin Unlearning1:

θ0U←θT\\theta\_\{0\}^\{U\}\\leftarrow\\theta\_\{T\}\{Initialize from trained model\}

2:for

k=0k=0to

K−1K\-1do

3:

gk←∇θLDretain​\(θkU\)g\_\{k\}\\leftarrow\\nabla\_\{\\theta\}L\_\{D\_\{\\text\{retain\}\}\}\(\\theta\_\{k\}^\{U\}\)\{Compute gradient on retain set only\}

4:

ξk∼𝒩​\(0,2​η​σ2​Id\)\\xi\_\{k\}\\sim\\mathcal\{N\}\(0,2\\eta\\sigma^\{2\}I\_\{d\}\)\{Sample Gaussian noise\}

5:

θk\+1U←ΠΘ​\[θkU−η​gk\+ξk\]\\theta\_\{k\+1\}^\{U\}\\leftarrow\\Pi\_\{\\Theta\}\[\\theta\_\{k\}^\{U\}\-\\eta g\_\{k\}\+\\xi\_\{k\}\]\{Update and project\}

6:endfor

7:return

θKU\\theta\_\{K\}^\{U\}

## Appendix FDomainNet data

The following is a snippet of samples from the DomainNet dataset, where we extracted two domains, Clipart and Quickdraw\. The classes are aggregated into 24 meta\-classes[Table3](https://arxiv.org/html/2605.11170#A6.T3), following\(Penget al\.,[2019](https://arxiv.org/html/2605.11170#bib.bib281)\)\.

![Refer to caption](https://arxiv.org/html/2605.11170v1/x6.png)Figure 5:The two domains of public and private data used for[Sections5\.1](https://arxiv.org/html/2605.11170#S5.SS1)and[5\.2](https://arxiv.org/html/2605.11170#S5.SS2)\(Penget al\.,[2019](https://arxiv.org/html/2605.11170#bib.bib281)\)\. Both datasets share the same number of classes, with Clipart being a collection of stylized images representing the private data, and Quickdraw representing a collection of hand\-draw sketches\.![Refer to caption](https://arxiv.org/html/2605.11170v1/x7.png)Figure 6:The two domains of public and private data used for[Section5\.2](https://arxiv.org/html/2605.11170#S5.SS2)\(Penget al\.,[2019](https://arxiv.org/html/2605.11170#bib.bib281)\)\. Both datasets share the same number of classes, with Infograph being a collection of stylized images representing the public data, and Real representing a collection of real\-life images\.Table 3:Class aggregation for experimental dataset\. Individual classes are grouped into 24 superclasses\.
## Appendix GDetails about the Rényi estimation

#### G\.0\.1Neural Rényi estimation

Following the works ofBirrellet al\.\([2021](https://arxiv.org/html/2605.11170#bib.bib55),[2023](https://arxiv.org/html/2605.11170#bib.bib26)\), two variational representations of the Rényi divergence between two distributionsP,QP,Qhave been proposed\. The first draws inspiration from the Donsker–Varadhan dual representation\(Donsker and Varadhan,[1975](https://arxiv.org/html/2605.11170#bib.bib294)\)of the KL divergence:

###### Theorem G\.1\(Donsker–Varadhan Rényi divergence\(Birrellet al\.,[2021](https://arxiv.org/html/2605.11170#bib.bib55)\)\)\.

LetP,QP,Qbe two distributions on\(Ω,ℳ\)\(\\Omega,\\mathcal\{M\}\)andα∈ℝ\\alpha\\in\\mathbb\{R\},α≠0,1\\alpha\\neq 0,1\. Then, for any set of functionsΦ\\Phiwithℳb​\(Ω\)⊂Φ⊂ℳ​\(Ω\)\\mathcal\{M\}\_\{b\}\(\\Omega\)\\subset\\Phi\\subset\\mathcal\{M\}\(\\Omega\),

Dα​\(P∥Q\)α=supϕ∈Φ\{1α−1​log​∫e\(α−1\)​ϕ​𝑑P−1α​log​∫eα​ϕ​𝑑Q\}\.\\frac\{D\_\{\\alpha\}\(P\\\|Q\)\}\{\\alpha\}=\\sup\_\{\\phi\\in\\Phi\}\\left\\\{\\frac\{1\}\{\\alpha\-1\}\\log\\int e^\{\(\\alpha\-1\)\\phi\}\\,dP\-\\frac\{1\}\{\\alpha\}\\log\\int e^\{\\alpha\\phi\}\\,dQ\\right\\\}\.\(18\)If in addition\(Ω,ℳ\)\(\\Omega,\\mathcal\{M\}\)is a metric space with the Borelσ\\sigma\-algebra, then[Equation18](https://arxiv.org/html/2605.11170#A7.E18)holds for allΦ\\PhisatisfyingLipb⊂Φ⊂ℳ​\(Ω\)\\mathrm\{Lip\}\_\{b\}\\subset\\Phi\\subset\\mathcal\{M\}\(\\Omega\), whereLipb\\mathrm\{Lip\}\_\{b\}denotes the set of bounded Lipschitz functions\.

Here,ℳ​\(Ω\)\\mathcal\{M\}\(\\Omega\)denotes the space of measurable real\-valued functions onΩ\\Omega, andℳb​\(Ω\)\\mathcal\{M\}\_\{b\}\(\\Omega\)the subspace of bounded functions\.

While this representation allows sample\-based estimation, it involves exponential terms that yield high\-variance estimates in practice\. To mitigate this issue,Birrellet al\.\([2023](https://arxiv.org/html/2605.11170#bib.bib26)\)proposed a convex conjugate formulation:

###### Theorem G\.2\(Convex conjugate Rényi divergence\(Birrellet al\.,[2023](https://arxiv.org/html/2605.11170#bib.bib26)\)\)\.

LetP,QP,Qbe probability distributions supported onΩ\\Omega, withP≪QP\\ll Q, and letℳb​\(Ω\)\\mathcal\{M\}\_\{b\}\(\\Omega\)denote the space of bounded measurable functions\. Then, for allα∈\(0,\+∞\)∖\{1\}\\alpha\\in\(0,\+\\infty\)\\setminus\\\{1\\\},

Dα​\(P∥Q\)α=supg∈ℳb​\(Ω\),g<0∫g​𝑑Q\+1α−1​∫\|g\|α−1α​𝑑P\+1α​\(log⁡α\+1\)\.\\frac\{D\_\{\\alpha\}\(P\\\|Q\)\}\{\\alpha\}=\\sup\_\{g\\in\\mathcal\{M\}\_\{b\}\(\\Omega\),\\,g<0\}\\int g\\,dQ\+\\frac\{1\}\{\\alpha\-1\}\\int\|g\|^\{\\frac\{\\alpha\-1\}\{\\alpha\}\}\\,dP\+\\frac\{1\}\{\\alpha\}\(\\log\\alpha\+1\)\.\(19\)

This convex conjugate formulation removes the exponential dependence and provides more stable numerical estimates, making it preferable for our setting\.

##### Neural network parameterization\.

To approximateΦ=\{g∈ℳ​\(Θ\):g<0\}\\Phi=\\\{g\\in\\mathcal\{M\}\(\\Theta\):g<0\\\}we use the classgθg\_\{\\theta\}of two\-layer MLPs with spectral normalization\(Miyatoet al\.,[2018](https://arxiv.org/html/2605.11170#bib.bib34)\), LeakyReLU activations, and a polysoftplus output activation as inBirrellet al\.\([2023](https://arxiv.org/html/2605.11170#bib.bib26)\)\. The polysoftplus activation offers superior numerical stability compared to ReLU\. It is defined as

polysoftplus​\(x\)\\displaystyle\\mathrm\{polysoftplus\}\(x\)=−\(11−x​𝟏x<0\+\(1\+x\)​𝟏x≥0\)\.\\displaystyle=\-\\left\(\\frac\{1\}\{1\-x\}\\mathbf\{1\}\_\{x<0\}\+\(1\+x\)\\mathbf\{1\}\_\{x\\geq 0\}\\right\)\.\(20\)
The discriminator networkgθg\_\{\\theta\}is trained to maximize the variational bound in[Theorem5\.1](https://arxiv.org/html/2605.11170#S5.Ex9)using samples\{θiU\}i=1N∼πUK\\\{\\theta\_\{i\}^\{U\}\\\}\_\{i=1\}^\{N\}\\sim\\pi\_\{U\}^\{K\}and\{θjR\}j=1N∼πRT\+K\\\{\\theta\_\{j\}^\{R\}\\\}\_\{j=1\}^\{N\}\\sim\\pi\_\{R\}^\{T\+K\}\. The optimization objective becomes:

maxθ⁡\{1N​∑j=1Ngθ​\(θjR\)\+1α−1​1N​∑i=1N\|gθ​\(θiU\)\|α−1α\+1α​\(log⁡α\+1\)\}\.\\max\_\{\\theta\}\\left\\\{\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}g\_\{\\theta\}\(\\theta\_\{j\}^\{R\}\)\+\\frac\{1\}\{\\alpha\-1\}\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\|g\_\{\\theta\}\(\\theta\_\{i\}^\{U\}\)\|^\{\\frac\{\\alpha\-1\}\{\\alpha\}\}\+\\frac\{1\}\{\\alpha\}\(\\log\\alpha\+1\)\\right\\\}\.\(21\)
To reduce estimator variance, we repeat the discriminator training five times with different random initializations and report the average\. We use a learning rate of value0\.00010\.0001with Adam optimizer\(Kingma and Ba,[2017](https://arxiv.org/html/2605.11170#bib.bib298)\), and train the discriminators for3000030000epochs with batch sizeb=6000b=6000\.

This procedure usedN=30,000N=30\{,\}000model samples, which makes it computationally intensive and better suited for theoretical validation than for large\-scale empirical benchmarking\. Although regularization and repeated runs alleviate variance, Rényi divergence estimation remains a statistically challenging task\. Developing scalable and lower\-variance estimators is therefore an important direction for future work\.

#### G\.0\.2Sampling fromπUK\\pi\_\{U\}^\{K\}andπRT\+K\\pi\_\{R\}^\{T\+K\}

We conduct experiments on the DomainNet dataset \(24\-class image classification\)[Figure5](https://arxiv.org/html/2605.11170#A6.F5)\. We choose the domain Clipart as the private data domain, which are stylized images, and Quickdraw, a collection of hand\-drawn sketches as the public domain\. Image embeddings are extracted using DinoV2\(Oquabet al\.,[2024](https://arxiv.org/html/2605.11170#bib.bib296)\), a self\-supervised vision transformer\. We specifically use vit\_small\_patch16\_224\_dino\(Caronet al\.,[2021](https://arxiv.org/html/2605.11170#bib.bib297)\)\. All images are resized to224×224224\\times 224prior to feature extraction\.

On these embeddings, we train30,00030\{,\}000linear classifiers on the full datasetD=Dpub∪DprivD=D\_\{\\mathrm\{pub\}\}\\cup D\_\{\\mathrm\{priv\}\}forT=20T=20iterations, and subsequently fine\-tune them on the retain setDr=D∖DforgetD\_\{r\}=D\\setminus D\_\{\\mathrm\{forget\}\}forK∈\{1,5,10,15\}K\\in\\\{1,5,10,15\\\}additional iterations\. This procedure yields30,00030\{,\}000samples from the unlearning distributionπUK\\pi\_\{U\}^\{K\}\.

For comparison, we train another30,00030\{,\}000linear classifiers directly on the retain setDrD\_\{r\}forT\+KT\+Kiterations, producing samples from the retraining distributionπRT\+K\\pi\_\{R\}^\{T\+K\}\. All models are trained using the same projected noisy gradient descent \(PNGD\) update with noise scaleσ=0\.01\\sigma=0\.01, learning rateη=0\.001\\eta=0\.001, batch sizeb=1024b=1024, and radiusR=1\.0R=1\.0using SGD\.

To assess robustness across dataset splits, we fix the total training set size toNtrain=42,000N\_\{\\mathrm\{train\}\}=42\{,\}000, and vary the public and forget set sizes as\(\|Dpub\|,\|Dforget\|\)∈\{\(10,000,12,000\)\(\|D\_\{\\mathrm\{pub\}\}\|,\|D\_\{\\mathrm\{forget\}\}\|\)\\in\\\{\(10\{,\}000,12\{,\}000\),\(15,000,7,000\)\(15\{,\}000,7\{,\}000\), and\(20,000,2,000\)\}\(20\{,\}000,2\{,\}000\)\\\}\. The remaining private data in the retain set is fixed to have size 20,000\. The resulting divergence estimates are reported in[Figures3\(a\)](https://arxiv.org/html/2605.11170#S5.F3.sf1)and[3\(b\)](https://arxiv.org/html/2605.11170#S5.F3.sf2)\.

### G\.1Pseudo\-code

Algorithm 3Rényi Divergence Estimation via Variational Representation1:Input:Samples

\{θiR\}i=1N∼πRT\+K\\\{\\theta\_\{i\}^\{R\}\\\}\_\{i=1\}^\{N\}\\sim\\pi\_\{R\}^\{T\+K\},

\{θjU\}j=1N∼πUK\\\{\\theta\_\{j\}^\{U\}\\\}\_\{j=1\}^\{N\}\\sim\\pi\_\{U\}^\{K\}, order

α\\alpha, discriminator architecture

2:Initialize discriminator network

gϕg\_\{\\phi\}with spectral normalization

3:forepoch

=1=1tonum\_epochsdo

4:Sample minibatch from retraining samples

\{θiR\}\\\{\\theta\_\{i\}^\{R\}\\\}
5:Sample minibatch from unlearning samples

\{θjU\}\\\{\\theta\_\{j\}^\{U\}\\\}
6:Compute variational objective:

ℒ=1N​∑i=1Ngϕ​\(θiR\)\+1α−1​1N​∑j=1N\|gϕ​\(θjU\)\|α−1α\+1α​\(log⁡α\+1\)\\displaystyle\\mathcal\{L\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}g\_\{\\phi\}\(\\theta\_\{i\}^\{R\}\)\+\\frac\{1\}\{\\alpha\-1\}\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}\|g\_\{\\phi\}\(\\theta\_\{j\}^\{U\}\)\|^\{\\frac\{\\alpha\-1\}\{\\alpha\}\}\+\\frac\{1\}\{\\alpha\}\(\\log\\alpha\+1\)\(22\)
7:Update

ϕ\\phito maximize

ℒ\\mathcal\{L\}via gradient ascent

8:endfor

9:Output:Estimated divergence

D^α​\(πUK∥πRT\+K\)=ℒ^1/α\\widehat\{D\}\_\{\\alpha\}\(\\pi\_\{U\}^\{K\}\\\|\\pi\_\{R\}^\{T\+K\}\)=\\widehat\{\\mathcal\{L\}\}^\{1/\\alpha\}

## Appendix HEvaluation with U\-LiRA

#### H\.0\.1U\-LiRA details

U\-LiRA, introduced byHayeset al\.\([2025](https://arxiv.org/html/2605.11170#bib.bib280)\)as an adaptation of the LiRA membership inference attack\(Carliniet al\.,[2021](https://arxiv.org/html/2605.11170#bib.bib285)\)to the unlearning setting, formalizes unlearning evaluation as a binary hypothesis test\. The goal is to distinguish between two distributions over model parameters: the unlearning distributionπUK\\pi\_\{U\}^\{K\}, obtained by training on the full dataset and subsequently applying the target unlearning algorithm to remove the influence of the forget set, and the retraining distributionπRT\+K\\pi\_\{R\}^\{T\+K\}, obtained by training from scratch without the forget set\. LettingP​\(θ∣⋅\)P\(\\theta\\mid\\cdot\)denote the likelihood of observing model parametersθ\\thetaunder a given distribution, the Neyman–Pearson lemma\(Neyman and Pearson,[1933](https://arxiv.org/html/2605.11170#bib.bib293)\)implies that the most powerful test for this discrimination problem is achieved by thresholding the likelihood ratio

P​\(θ\|πUK\)P​\(θ\|πRT\+K\)\\displaystyle\\frac\{P\(\\theta\|\\pi\_\{U\}^\{K\}\)\}\{P\(\\theta\|\\pi\_\{R\}^\{T\+K\}\)\}for model parametersθ\\theta\.

Since directly computingP​\(θ∣πUK\)P\(\\theta\\mid\\pi\_\{U\}^\{K\}\)andP​\(θ∣πRT\+K\)P\(\\theta\\mid\\pi\_\{R\}^\{T\+K\}\)is infeasible in practice, U\-LiRA employs a series of approximations\. First, the two distributions are approximated empirically by sampling: the adversary trainsNNmodels underπUK\\pi\_\{U\}^\{K\}\(full training followed by unlearning\) andNNmodels underπRT\+K\\pi\_\{R\}^\{T\+K\}\(training from scratch without the forget set\)\.

To reduce the sample complexity required for a low\-variance estimate, U\-LiRA projects models into a one\-dimensional representation space via a statisticf:Θ→ℝf:\\Theta\\to\\mathbb\{R\}\(since we only run the attack on forget sets of size 1, we followHayeset al\.\([2025](https://arxiv.org/html/2605.11170#bib.bib280)\)and chooseffto be the model’s confidence score on the forget example, rescaled by the logit functionϕ​\(ω\)=ln⁡\(ω1−ω\)\\phi\(\\omega\)=\\ln\\left\(\\frac\{\\omega\}\{1\-\\omega\}\\right\)\)\. The test is then conducted on the surrogate likelihood ratio

P​\(f​\(θ\)\|f​\(πUK\)\)P​\(f​\(θ\)\|f​\(πRT\+K\)\)\.\\displaystyle\\frac\{P\(f\(\\theta\)\|f\(\\pi\_\{U\}^\{K\}\)\)\}\{P\(f\(\\theta\)\|f\(\\pi\_\{R\}^\{T\+K\}\)\)\}\.As a final simplifying approximation, U\-LiRA models the projected distributions as Gaussians

f​\(πUK\)≈𝒩​\(μU,σU2\),f​\(πRT\+K\)≈𝒩​\(μR​σR2\),\\displaystyle f\(\\pi\_\{U\}^\{K\}\)\\approx\\mathcal\{N\}\(\\mu\_\{U\},\\sigma^\{2\}\_\{U\}\),\\hskip 14\.22636ptf\(\\pi\_\{R\}^\{T\+K\}\)\\approx\\mathcal\{N\}\(\\mu\_\{R\}\\sigma^\{2\}\_\{R\}\),
where the parameters\(μU,σU2\)\(\\mu\_\{U\},\\sigma\_\{U\}^\{2\}\)and\(μR,σR2\)\(\\mu\_\{R\},\\sigma\_\{R\}^\{2\}\)are estimated directly from theNNsample models of each distribution\.

In[5\.3](https://arxiv.org/html/2605.11170#S5.SS3), we presented the attack through the lens of Bayes’ rule \(following Algorithm 1 ofHayeset al\.\([2025](https://arxiv.org/html/2605.11170#bib.bib280)\)\), providing a more intuitive explanation for readers less familiar with hypothesis testing concepts\.

#### H\.0\.2Experimental setup

We evaluate unlearning in binary sentiment classification of IMDB reviews\(Maaset al\.,[2011](https://arxiv.org/html/2605.11170#bib.bib291)\), with Amazon product reviews\(Zhanget al\.,[2015](https://arxiv.org/html/2605.11170#bib.bib292)\)as public data\. Models are 2\-layer LSTMs\(Hochreiter and Schmidhuber,[1997](https://arxiv.org/html/2605.11170#bib.bib295)\), trained to minimize cross\-entropy loss with projected noisy gradient descent \(Gaussian noise varianceσ2=0\.01\\sigma^\{2\}=0\.01, projection onto anℓ2\\ell\_\{2\}ball of radius100100\)\.

For each trial, the forget set consists of a100100datapoints sampled uniformly from the IMDB reviews dataset\. Following the U\-LiRA framework, we generate7575model samples from two distributions:

- •Unlearning distributionπUK\\pi\_\{U\}^\{K\}:models trained on25,00025\{,\}000private datapoints plus the forget set forTTepochs, then finetuned without the forget set forKKepochs\.
- •Retraining distributionπRT\+K\\pi\_\{R\}^\{T\+K\}:models trained from scratch on the same25,00025\{,\}000private datapoints \(excluding the forget set\) forT\+KT\+Kepochs\.

We repeat this sampling process both with and without the inclusion of the50,00050\{,\}000public datapoints during training and unlearning\.

Similar Articles

Wisdom is Knowing What not to Say: Hallucination-Free LLMs Unlearning via Attention Shifting

arXiv cs.CL

This paper introduces Attention-Shifting (AS), a novel framework for selective machine unlearning in LLMs that balances effective removal of sensitive information while preventing hallucinations and preserving model utility. The method uses importance-aware attention suppression and retention enhancement to achieve up to 15% higher accuracy preservation compared to existing unlearning approaches on standard benchmarks.

Can Large Language Models Reinvent Foundational Algorithms?

Hugging Face Daily Papers

Researchers introduce 'Unlearn-and-Reinvent', a pipeline that removes knowledge of foundational algorithms (e.g., Dijkstra's, Euclid's) from LLMs via unlearning, then tests whether models can independently reinvent them. Results show LLMs can reinvent algorithms with intuitive structures but struggle with those requiring non-obvious data structures or counterintuitive invariants.