CBD: API-Only LLM Black-Box Unlearning through Controlled Behavioral Divergence

arXiv cs.LG Papers

Summary

CBD introduces an API-only black-box unlearning framework for LLMs that uses two auxiliary models to create controlled behavioral divergence between retained and target data, achieving a better unlearning-utility trade-off compared to existing methods.

arXiv:2606.27683v1 Announce Type: new Abstract: Edge devices increasingly invoke large language models (LLMs) through API services for context aware edge intelligence, while edge generated data may be collected to improve LLMs and may introduce sensitive, copyrighted, harmful, or outdated information into model behavior. Machine unlearning offers a practical way to remove the influence of undesired data without retraining LLMs. However, existing methods still face two gaps. The first is API only black box access, where target model parameters and internal logits are unavailable. The second is how to preserve retained utility when unlearning target data and retained data share highly similar prompt structures or semantic patterns. To address these challenges, we propose Controlled Behavioral Divergence (CBD), an API only black box unlearning framework. CBD uses two auxiliary models to create controlled behavioral divergence between retained inputs and unlearning target inputs, converts this divergence into an unlearning relevance score, and routes unlearning related prompts away from the target LLM. To improve discrimination accuracy under high similarity between target and retained data, CBD constructs a gradient statistics based discriminative basis by estimating empirical Fisher matrices and solving a regularized generalized eigenvalue problem, guiding the unlearning signal toward target specific information rather than shared prompt structures. Compared with eleven white box and gray box unlearning baselines, CBD achieves a better unlearning utility trade off and its performance varies little across settings. On ToFU forget10, CBD approaches the retrained reference on the forget set while raising model utility to 74.90, about 15% above the second best baseline. On WMDP, it lowers hazardous knowledge accuracy to 25.68, near random guessing, while preserving MMLU accuracy of 52.67. Code is at https://github.com/DGL-codes/CBD.
Original Article
View Cached Full Text

Cached at: 06/29/26, 05:24 AM

# CBD: API-Only LLM Black-Box Unlearning through Controlled Behavioral Divergence
Source: [https://arxiv.org/html/2606.27683](https://arxiv.org/html/2606.27683)
Zhiqiang Xie, Yijing Lin, Zhipeng Gao, and Dong In KimCorresponding author: Yijing Lin\.Zhiqiang Xie, Yijing Lin, and Zhipeng Gao are with the State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications\. Yijing Lin is also with the Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing\. Email: \{xiezhiqiang, yjlin, gaozhipeng\}@bupt\.edu\.cn\. Dong In Kim is with the Department of Electronic and Electrical Engineering, Sungkyunkwan University, Suwon, South Korea\. Email: dongin@skku\.edu\.This work is supported by the National Natural Science Foundation of China \(62502041, 92467203, 62372050\), the Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, the Beijing Natural Science Foundation \(L251038, L244010\), the CCF\-Huawei Populus Grove Fund \(TC202418\), the Fellowship of China National Postdoctoral Program for Innovative Talents \(BX20240045\), and the China Postdoctoral Science Foundation General Program \(2025M773481\)\.

###### Abstract

Edge devices increasingly invoke large language models \(LLMs\) through API services for context\-aware edge intelligence, while edge\-generated data may be collected to improve foundation LLMs and may introduce sensitive, copyrighted, harmful, or outdated information into model behavior\. Machine unlearning has therefore become a practical way to remove the influence of undesired data without retraining LLMs from scratch\. However, existing LLM unlearning methods still face two key gaps in practical deployment\. The first is how to achieve unlearning under API\-only black\-box access, where target\-model parameters and internal logits are unavailable\. The second is how to preserve retained utility when unlearning target data and retained data share highly similar prompt structures or semantic patterns\. To address these challenges, we propose Controlled Behavioral Divergence \(CBD\), an API\-only black\-box unlearning framework\. Specifically, CBD uses two auxiliary models to create controlled behavioral divergence between retained inputs and unlearning target inputs, converts this divergence into an unlearning\-relevance score, and routes unlearning\-related prompts away from the target LLM\. Furthermore, to improve discrimination accuracy under high similarity between unlearning target data and retained data, CBD constructs a gradient\-statistics\-based discriminative basis by estimating empirical Fisher matrices and solving a regularized generalized eigenvalue problem, guiding the unlearning signal toward target\-specific information rather than shared prompt structures\. Compared with eleven representative white\-box and gray\-box unlearning baselines across multiple benchmark settings, CBD achieves a better unlearning\-utility trade\-off and its performance varies little across hyperparameter settings\. On ToFU forget10, CBD approaches the retrained reference on the forget set while raising model utility to 74\.90, about 15% above the second\-best baseline\. On WMDP, it lowers hazardous\-knowledge accuracy to 25\.68, near the random\-guess level, while preserving an MMLU accuracy of 52\.67\. Our code is available athttps://github\.com/DGL\-codes/CBD\.

## IIntroduction

Large language models are increasingly invoked by edge devices through API services to support context\-aware edge intelligence under resource\-constrained conditions\[[26](https://arxiv.org/html/2606.27683#bib.bib2)\]\. In this workflow, user queries, feedback, and interaction logs generated at the edge may be collected to improve the performance of the foundation LLM in edge\-specific scenarios\. However, such use of edge data may inadvertently introduce sensitive user data, copyrighted materials, and potentially harmful or outdated information into the LLM’s parametric memory\[[3](https://arxiv.org/html/2606.27683#bib.bib3)\]\. Although retraining the foundation LLM from scratch after removing the undesired data provides the most thorough solution, it is computationally prohibitive due to the massive parameter scale and training corpus of modern LLMs\. In this context, machine unlearning has been proposed to selectively remove such influence while preserving model utility on downstream tasks\[[39](https://arxiv.org/html/2606.27683#bib.bib5)\]\.

![Refer to caption](https://arxiv.org/html/2606.27683v1/x1.png)Figure 1:Architecture comparison of machine unlearning methods for LLMs\.Machine unlearning for LLMs can be viewed as a multi\-objective optimization problem\[[19](https://arxiv.org/html/2606.27683#bib.bib6)\], as it requires simultaneously removing the influence of target data while preserving model performance on retained data\. Existing approaches fall into two groups, as illustrated in Figure[1](https://arxiv.org/html/2606.27683#S1.F1)\. First, continual\-training\-based unlearning methods employ bidirectional gradient optimization strategies, performing gradient ascent or its variants on the forget set to reverse parameter updates induced by forget data, while applying Kullback\-Leibler \(KL\) divergence or gradient descent on the retain set to preserve model performance on retained data\[[39](https://arxiv.org/html/2606.27683#bib.bib5),[40](https://arxiv.org/html/2606.27683#bib.bib7),[42](https://arxiv.org/html/2606.27683#bib.bib8)\]\. Second, auxiliary\-model\-based methods introduce additional models to calibrate the log probabilities of target models, thereby reducing the output probabilities of forget\-related tokens at lower training cost\. Representative methods include Unlearning from Logit Difference \(ULD\)\[[13](https://arxiv.org/html/2606.27683#bib.bib9)\]and offset unlearning\[[11](https://arxiv.org/html/2606.27683#bib.bib10)\]\. However, current methods have yet to address black\-box LLM unlearning, and they still degrade retained utility when forget and retain data are highly similar\. In this paper, we aim to overcome the following two challenges\.

The first challenge is how to achieve LLM unlearning in a black\-box setting where the target model is exposed only through API calls\. In the edge\-service workflow described above, edge devices and downstream service components usually receive only final responses from the LLM, without access to its parameters, gradients, or internal token probabilities\. This restriction makes white\-box methods that update target\-model parameters\[[39](https://arxiv.org/html/2606.27683#bib.bib5),[40](https://arxiv.org/html/2606.27683#bib.bib7),[42](https://arxiv.org/html/2606.27683#bib.bib8)\]and gray\-box methods that correct target\-model logits\[[13](https://arxiv.org/html/2606.27683#bib.bib9),[11](https://arxiv.org/html/2606.27683#bib.bib10)\]inapplicable\. Beyond the interface limitation, updating or fine\-tuning a deployed foundation LLM often involves validation, approval, and redeployment procedures, which makes it difficult to satisfy time\-sensitive data removal requests\. Therefore, API\-only edge LLM services require an unlearning mechanism that can reduce the exposure of undesired data without retraining or editing the target model, or inspecting its internal states\.

The second challenge is how to preserve retained utility while unlearning target data\. In many LLM unlearning scenarios, the data requested for unlearning and the retained data may share similar topics, prompt templates, or semantic structures, differing only in specific entities or attributes\. As illustrated in Table[II](https://arxiv.org/html/2606.27683#S5.T2), such high structural similarity makes unlearning targets and retained samples difficult to separate, so updates intended to suppress the target data may also alter the model behavior associated with retained data\[[39](https://arxiv.org/html/2606.27683#bib.bib5),[40](https://arxiv.org/html/2606.27683#bib.bib7),[42](https://arxiv.org/html/2606.27683#bib.bib8)\]\. As the unlearning set grows, the unlearning algorithm may increasingly attribute the removal target to shared prompt templates rather than to target\-specific content, degrading performance on retained samples with similar prompt structures\. Therefore, effective LLM unlearning requires a mechanism that separates target\-specific content from shared prompt structures, thereby preserving retained utility during the unlearning process\.

To address the above challenges, we proposeControlledBehavioralDivergence \(CBD\), a black\-box unlearning framework that operates in a fully API\-only setting\. CBD trains a probe model against a frozen reference model and estimates the relevance of an input prompt to the unlearning target from the behavioral divergence between the two models\. Inspired by Gradient Projection Memory\[[29](https://arxiv.org/html/2606.27683#bib.bib59)\], we constrain the probe update so that the two auxiliary models remain close on retained inputs while becoming distinguishable on unlearning target inputs, and the resulting unlearning\-relevance score routes unlearning\-related prompts away from the target LLM\. Because a retain\-only constraint becomes unreliable when unlearning target data and retained data are highly similar, CBD further estimates unlearning\-side and retain\-side empirical Fisher matrices from sample gradients, formulates direction selection as a regularized generalized eigenvalue problem, and restricts the probe update to directions that induce strong behavioral change on target data at limited cost on retained data\. The main contributions of this paper are summarized as follows\.

- •We propose CBD, an API\-only black\-box unlearning framework that reduces the exposure of the data requested for removal without accessing target\-model parameters, gradients, or internal logits\.
- •We design a dual\-auxiliary\-model mechanism that uses controlled behavioral divergence to identify unlearning\-related prompts and route them away from the target LLM\.
- •We develop a gradient\-statistics\-based discriminative basis extraction method that improves discrimination accuracy under high similarity between unlearning target data and retained data, thereby preserving retained utility\.
- •We evaluate CBD on benchmark unlearning datasets against white\-box and gray\-box baselines, showing an improved unlearning\-utility trade\-off under the API\-only black\-box setting\.

The remainder of this paper is organized as follows\. Section[II](https://arxiv.org/html/2606.27683#S2)reviews related work, and Section[III](https://arxiv.org/html/2606.27683#S3)revisits the white\-box and gray\-box paradigms and formalizes the API\-only setting\. Section[IV](https://arxiv.org/html/2606.27683#S4)presents the query routing procedure of CBD, and Section[V](https://arxiv.org/html/2606.27683#S5)develops the discriminative Fisher basis for the high\-similarity regime\. Section[VI](https://arxiv.org/html/2606.27683#S6)reports the experiments, and Section[VII](https://arxiv.org/html/2606.27683#S7)concludes the paper\.

## IIRelated Work

Machine unlearning aims to remove the influence of designated data from a trained model so that the resulting model behaves as if the data had never been used during training\[[2](https://arxiv.org/html/2606.27683#bib.bib11)\]\. For LLMs, this goal is difficult to achieve by exact retraining because the original corpus is large, the training cost is high, and the learned representations are highly entangled\[[39](https://arxiv.org/html/2606.27683#bib.bib5),[40](https://arxiv.org/html/2606.27683#bib.bib7),[19](https://arxiv.org/html/2606.27683#bib.bib6)\]\. Existing LLM unlearning methods can be organized by the interface they assume, where the first line directly updates the target LLM and the second line avoids full target retraining by using auxiliary models, logits, prompts, embeddings, or in\-context data at inference time\. CBD is closest to the second line in motivation, but targets a stricter API\-only setting in which the target LLM provides only final responses and is not logit\-observable\.

### II\-AWhite\-Box Machine Unlearning

White\-box LLM unlearning directly changes the target model after an unlearning request arrives\. A common starting point is to reverse the training signal on the forget data\. Jang et al\.\[[12](https://arxiv.org/html/2606.27683#bib.bib19)\]showed that gradient ascent on target token sequences can reduce memorized private knowledge, which made direct loss maximization a standard baseline\. Chen and Yang\[[4](https://arxiv.org/html/2606.27683#bib.bib20)\]introduced lightweight unlearning layers and selective teacher\-student objectives to reduce the cost of full\-model updates\. Yao et al\.\[[39](https://arxiv.org/html/2606.27683#bib.bib5)\]further benchmarked several first\-order unlearning strategies for pre\-trained LLMs\. Related studies formulate deletion objectives for sensitive information extraction attacks\[[24](https://arxiv.org/html/2606.27683#bib.bib53)\]and practical knowledge unlearning settings\[[31](https://arxiv.org/html/2606.27683#bib.bib47)\], showing that the unlearning\-utility trade\-off is sensitive to the objective, data split, and hyperparameters\.

Subsequent work improves this direct\-update paradigm by changing the optimization objective or by selecting more targeted update components\. Large Language Model Unlearning\[[40](https://arxiv.org/html/2606.27683#bib.bib7)\]treats the problem as a broader LLM safety task and examines the conflict between removing specific knowledge and preserving general capability\. Negative Preference Optimization \(NPO\)\[[42](https://arxiv.org/html/2606.27683#bib.bib8)\]replaces pure ascent with a preference\-style objective, and later work revisits this objective to reduce reference\-model bias and simplify the update rule\[[6](https://arxiv.org/html/2606.27683#bib.bib67)\]\. Other methods use second\-order information\[[15](https://arxiv.org/html/2606.27683#bib.bib54)\], strategic weight attribution\[[14](https://arxiv.org/html/2606.27683#bib.bib66)\], uniform\-target self\-distillation\[[34](https://arxiv.org/html/2606.27683#bib.bib30)\], or general enhancement frameworks for fine\-tuning\-based unlearning\[[28](https://arxiv.org/html/2606.27683#bib.bib29)\]\. Continual and domain\-specific variants further study repeated requests\[[7](https://arxiv.org/html/2606.27683#bib.bib55)\], copyright removal\[[5](https://arxiv.org/html/2606.27683#bib.bib35)\], stealthy knowledge concealment\[[8](https://arxiv.org/html/2606.27683#bib.bib33)\], selective token\-level forgetting\[[35](https://arxiv.org/html/2606.27683#bib.bib36)\], self\-generated forget data\[[38](https://arxiv.org/html/2606.27683#bib.bib37)\], and belief\-space rectification for reasoning\[[23](https://arxiv.org/html/2606.27683#bib.bib34)\]\. These methods also motivate the white\-box baselines used in this paper\. However, their implementation premise is still white\-box access, since the service provider must be able to update target\-model parameters, compute gradients, and validate the edited model before redeployment\. Such access is unavailable when the LLM is consumed only through an external API\.

### II\-BAuxiliary and Inference\-Time Machine Unlearning

Another line of work reduces or avoids direct target\-model editing by moving the unlearning effect to auxiliary branches or inference\-time mechanisms\. ULD\[[13](https://arxiv.org/html/2606.27683#bib.bib9)\]trains an assistant LLM with reversed forget\-retain objectives and obtains the unlearned distribution through the logit difference between the target model and the assistant model\. Offset unlearning\[[11](https://arxiv.org/html/2606.27683#bib.bib10)\]learns a logit offset from a pair of smaller models and transfers the correction to black\-box LLM services\. Both are more deployment\-oriented than direct retraining, but their unlearning effect is still expressed through correction distributions or offsets learned from auxiliary logits, rather than through a pure query\-response routing mechanism\.

Other methods operate on the prompt or context instead\. In\-context unlearning\[[25](https://arxiv.org/html/2606.27683#bib.bib56)\]shows that selected examples with altered labels can induce instance\-level forgetting without parameter updates, and later work shows that such in\-context control may conceal rather than truly remove knowledge\[[30](https://arxiv.org/html/2606.27683#bib.bib32)\]\. Soft Prompting for Unlearning \(SPUL\)\[[1](https://arxiv.org/html/2606.27683#bib.bib57)\]learns prompt tokens that are prepended to the input to enforce forgetting and preserve retained utility\. ECO prompts\[[18](https://arxiv.org/html/2606.27683#bib.bib49)\]use a classifier to identify prompts in the scope of unlearning and then apply learned corruptions in the prompt\-embedding space\. Fast exact unlearning for in\-context learning data\[[22](https://arxiv.org/html/2606.27683#bib.bib58)\]studies removal from an external in\-context data mechanism rather than from model weights\. These methods show that inference\-time control can replace retraining when retraining is impractical, but they still rely on structured prompt control, soft prompts, embedding manipulation, or an explicit in\-context data store\.

CBD differs from the above approaches in two aspects\. First, it does not edit the target LLM and does not require target\-model logits, gradients, embeddings, or soft\-prompt access, since it only uses auxiliary models to score incoming queries and route unlearning\-related ones away from the target API\. Second, CBD explicitly addresses the retained\-utility problem that arises when forget and retain data share similar prompt structures, designing its discriminative basis so that the auxiliary behavioral divergence reflects target\-specific information rather than shared templates\. This combination of API\-only access and retain\-aware discrimination is the gap addressed in this paper\.

## IIIPreliminaries and Framework

We first review the two mainstream unlearning paradigms and clarify why the API\-only setting requires a different solution\. We useMMto denote the target LLM,MrefM\_\{\\mathrm\{ref\}\}andMproM\_\{\\mathrm\{pro\}\}to denote the reference and probe models,DfD\_\{f\}andDrD\_\{r\}to denote the forget and retain sets, andw0w\_\{0\}andwwto denote the initial and current trainable auxiliary parameters\. The main symbols are summarized in Table[I](https://arxiv.org/html/2606.27683#S3.T1)\.

TABLE I:Key notations\.### III\-AWhite\-Box and Gray\-Box Unlearning Methods

White\-box methods update the already trained target model with parametersθ\\thetadirectly\. The simplest strategy is gradient ascent \(GA\), which increases the answer\-side loss on the forget set to push the model away from the data requested for removal\[[12](https://arxiv.org/html/2606.27683#bib.bib19),[39](https://arxiv.org/html/2606.27683#bib.bib5)\]\. However, pure ascent often degrades performance on retained data severely\. To preserve retained performance while forgetting, most white\-box methods therefore combine a forget loss with a retain loss into a single objective,

ℒwb​\(θ\)=ℒfwb​\(θ,Df\)\+λ​ℒrwb​\(θ,Dr\),\\mathcal\{L\}\_\{\\mathrm\{wb\}\}\(\\theta\)=\\mathcal\{L\}\_\{f\}^\{\\mathrm\{wb\}\}\(\\theta,D\_\{f\}\)\+\\lambda\\,\\mathcal\{L\}\_\{r\}^\{\\mathrm\{wb\}\}\(\\theta,D\_\{r\}\),\(1\)where the forget lossℒfwb\\mathcal\{L\}\_\{f\}^\{\\mathrm\{wb\}\}drives the model away from the forget data, the retain lossℒrwb\\mathcal\{L\}\_\{r\}^\{\\mathrm\{wb\}\}preserves behavior on retained data, andλ\\lambdatrades off the two\. Beyond GA, the forget lossℒfwb\\mathcal\{L\}\_\{f\}^\{\\mathrm\{wb\}\}can be instantiated by NPO, which mitigates the utility loss of pure ascent\[[42](https://arxiv.org/html/2606.27683#bib.bib8)\], by a preference objective such as Direct Preference Optimization \(DPO\)\[[27](https://arxiv.org/html/2606.27683#bib.bib13)\], or by other targeted update rules\[[4](https://arxiv.org/html/2606.27683#bib.bib20),[16](https://arxiv.org/html/2606.27683#bib.bib21),[37](https://arxiv.org/html/2606.27683#bib.bib14)\]\. The retain lossℒrwb\\mathcal\{L\}\_\{r\}^\{\\mathrm\{wb\}\}is typically a gradient\-descent term on retained data, giving the \+GD variants, or a KL term that keeps the output distribution close to that of the original model, giving the \+KL variants\[[39](https://arxiv.org/html/2606.27683#bib.bib5)\]\. All of these methods, however, require the target LLM to remain editable throughout unlearning, so this entire paradigm becomes unavailable once the target model is exposed only through an API\.

Gray\-box methods avoid editing the target model but still rely on its token\-level logits\. ULD trains an assistant model with reversed unlearning objectives so that the assistant concentrates the behavior to be removed, and subtracts the assistant logitszast​\(x\)z\_\{\\mathrm\{ast\}\}\(x\)from the target logitszM​\(x\)z\_\{M\}\(x\)\[[13](https://arxiv.org/html/2606.27683#bib.bib9)\],

z~​\(x\)=zM​\(x\)−α​zast​\(x\),\\widetilde\{z\}\(x\)=z\_\{M\}\(x\)\-\\alpha z\_\{\\mathrm\{ast\}\}\(x\),\(2\)whereα\\alphais a scaling coefficient, so tokens related to the data requested for removal become less likely to be selected during decoding\. Offset unlearning maintains a pair of smaller auxiliary models and injects their logit difference as a transferable correction\[[11](https://arxiv.org/html/2606.27683#bib.bib10)\],

z~​\(x\)=zM​\(x\)\+α​\(zref​\(x\)−zupd​\(x\)\),\\widetilde\{z\}\(x\)=z\_\{M\}\(x\)\+\\alpha\\bigl\(z\_\{\\mathrm\{ref\}\}\(x\)\-z\_\{\\mathrm\{upd\}\}\(x\)\\bigr\),\(3\)wherezref​\(x\)z\_\{\\mathrm\{ref\}\}\(x\)andzupd​\(x\)z\_\{\\mathrm\{upd\}\}\(x\)are the logits of the frozen and the unlearning\-updated auxiliary models, so the correction shifts the target distribution away from the behavior learned by the updated auxiliary branch\. Both corrections exist only at the logit interface, so once the target LLM returns final text responses alone, neither \([2](https://arxiv.org/html/2606.27683#S3.E2)\) nor \([3](https://arxiv.org/html/2606.27683#S3.E3)\) can be applied\. The next subsection formalizes this API\-only setting\.

### III\-BAPI\-Only Scenario and Proposed Framework

![Refer to caption](https://arxiv.org/html/2606.27683v1/x2.png)Figure 2:Framework of Controlled Behavioral Divergence \(CBD\) for black\-box unlearning\.We consider an API\-only deployment in which the target LLMMMis accessible only through query\-response interaction\. The unlearning request specifies a forget setDf=\{\(xj,yj\)\}j=1NfD\_\{f\}=\\\{\(x\_\{j\},y\_\{j\}\)\\\}\_\{j=1\}^\{N\_\{f\}\}and a retain setDr=\{\(xi,yi\)\}i=1NrD\_\{r\}=\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{N\_\{r\}\}\. Under this restriction, the two conventional paradigms described above are both unavailable\. White\-box methods cannot be used because the target model cannot be edited directly, and gray\-box methods cannot be used because the internal logit interface does not exist\. Consequently, the problem is no longer how to overwrite target\-model parameters, but how to achieve black\-box unlearning using only the API response interface\.

CBD uses three model roles \(as illustrated in Figure[2](https://arxiv.org/html/2606.27683#S3.F2)\)\. The target modelMMremains unchanged and continues to serve ordinary queries\. A frozen reference modelMrefM\_\{\\mathrm\{ref\}\}provides the reference behavior\. A trainable probe modelMproM\_\{\\mathrm\{pro\}\}is optimized to remain close toMrefM\_\{\\mathrm\{ref\}\}on retained inputs but separate fromMrefM\_\{\\mathrm\{ref\}\}on forget inputs\. Note thatMproM\_\{\\mathrm\{pro\}\}is not trained as a replacement for the target LLM but works withMrefM\_\{\\mathrm\{ref\}\}to score whether an input is associated with the forget set\. Both auxiliary models are initialized from a small base model that has never been trained onDfD\_\{f\}, so the responses ofMrefM\_\{\\mathrm\{ref\}\}are less likely to reproduce the forget data directly, although this does not guarantee thatMrefM\_\{\\mathrm\{ref\}\}is free of related knowledge acquired during its own pretraining\. CBD thus reduces the exposure of the data requested for removal at the service interface rather than erasing it from the target parameters, assuming the routing layer mediates all access to the target LLM\.

The procedure proceeds in four steps\. We first trainMproM\_\{\\mathrm\{pro\}\}under forget and retain objectives, then convert the divergence betweenMrefM\_\{\\mathrm\{ref\}\}andMproM\_\{\\mathrm\{pro\}\}into an unlearning\-relevance scores​\(x\)s\(x\), calibrate a thresholdτ\\tauon validation forget and retain samples, and finally route each query according to

CBD​\(x\)=\{Mref​\(x\),s​\(x\)\>τ,M​\(x\),s​\(x\)≤τ\.\\mathrm\{CBD\}\(x\)=\\begin\{cases\}M\_\{\\mathrm\{ref\}\}\(x\),&s\(x\)\>\\tau,\\\\ M\(x\),&s\(x\)\\leq\\tau\.\\end\{cases\}\(4\)The remaining question is how to make this routing rule reliable when forget and retain data are highly similar, because weak separation directly undermines black\-box unlearning\. Section[IV](https://arxiv.org/html/2606.27683#S4)defines the CBD procedure from probe training to query routing\. Section[V](https://arxiv.org/html/2606.27683#S5)defines the discriminative basis used by the projected probe update\.

## IVBehavioral\-Divergence\-Based Query Routing

Since an API\-only LLM service exposes only final responses, unlearning methods that edit target\-model parameters or correct target logits cannot be applied directly\. To address this limitation, CBD adopts a dual\-auxiliary\-model mechanism for black\-box unlearning, which uses the behavioral divergence between a frozen reference modelMrefM\_\{\\mathrm\{ref\}\}and a trainable probe modelMproM\_\{\\mathrm\{pro\}\}to identify unlearning\-related queries during inference\. Based on the resulting unlearning\-relevance score, CBD routes unlearning\-related queries toMrefM\_\{\\mathrm\{ref\}\}and other queries to the target LLMMM, thereby reducing the exposure of the data requested for removal without changing the target model\. When forget and retain inputs are highly similar, the basic divergence signal may become insufficiently separable, which motivates the discriminative Fisher basis developed in Section[V](https://arxiv.org/html/2606.27683#S5)\.

### IV\-ADual Auxiliary Models and Probe Objective

The reference modelMrefM\_\{\\mathrm\{ref\}\}is kept fixed as the behavioral reference, whereas the probe modelMproM\_\{\\mathrm\{pro\}\}is updated so that forget\-related inputs become more distinguishable from retained inputs when compared withMrefM\_\{\\mathrm\{ref\}\}\. To keep the auxiliary computation lightweight, both models are implemented with LoRA adapters\[[10](https://arxiv.org/html/2606.27683#bib.bib16)\], while only the probe\-side adapters are updated during training\. For a linear layer with frozen weight matrixW0∈ℝdout×dinW\_\{0\}\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{out\}\}\\times d\_\{\\mathrm\{in\}\}\}, the LoRA parameterization is

whereA∈ℝr×dinA\\in\\mathbb\{R\}^\{r\\times d\_\{\\mathrm\{in\}\}\}andB∈ℝdout×rB\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{out\}\}\\times r\}are trainable low\-rank factors withr≪min⁡\(din,dout\)r\\ll\\min\(d\_\{\\mathrm\{in\}\},d\_\{\\mathrm\{out\}\}\)\. The base weightW0W\_\{0\}is shared across both auxiliary models, while the trainable parametersw=\{Am,Bm\}mw=\\\{A\_\{m\},B\_\{m\}\\\}\_\{m\}distinguish them\. Letw0w\_\{0\}denote the initial trainable parameters inherited from the base model\. The reference modelMrefM\_\{\\mathrm\{ref\}\}keepsw0w\_\{0\}unchanged, whereas the probe modelMproM\_\{\\mathrm\{pro\}\}updateswwthrough a controlled training process\.

All subsequent geometric quantities, including gradients, Fisher matrices, and subspace bases, are defined in this trainable parameter subspacew∈ℝdww\\in\\mathbb\{R\}^\{d\_\{w\}\}rather than in the full parameter space of the target LLM\. In our implementation, basis extraction and projection act on the LoRA factors of the up\-projection layers, which in our experiments sufficed to separate forget inputs from retain inputs\.

For a supervised sample\(x,y\)\(x,y\), we define the answer\-side negative log\-likelihood as

ℓ​\(w,x,y\)=−∑t∈anslog⁡pw​\(yt∣x,y<t\),\\ell\(w,x,y\)=\-\\sum\_\{t\\in\\mathrm\{ans\}\}\\log p\_\{w\}\(y\_\{t\}\\mid x,y\_\{<t\}\),\(6\)wherepw\(⋅∣x\)p\_\{w\}\(\\cdot\\mid x\)is the output distribution of the probe model and the summation is over answer\-token positions\. On the forget side, we optimize

ℒf​\(w\)=𝔼\(x,y\)∼Df​\[ℓ​\(w,x,y\)\],\\mathcal\{L\}\_\{f\}\(w\)=\\mathbb\{E\}\_\{\(x,y\)\\sim D\_\{f\}\}\[\\ell\(w,x,y\)\],\(7\)which increases the behavioral separation betweenMproM\_\{\\mathrm\{pro\}\}andMrefM\_\{\\mathrm\{ref\}\}on the forget set\. On the retain side, we align the predictive trajectory ofMproM\_\{\\mathrm\{pro\}\}to that ofMrefM\_\{\\mathrm\{ref\}\}through a sequence\-level KL divergence\. Letz=\[x,y\]z=\[x,y\]denote the full retain sequence of lengthTTand letht=z<th\_\{t\}=z\_\{<t\}be its prefix at positiontt\. The retain\-side objective is

ℒr\(w\)=𝔼\(x,y\)∼Dr\[1T∑t=1TKL\(pw0\(⋅∣ht\)∥pw\(⋅∣ht\)\)\],\\mathcal\{L\}\_\{r\}\(w\)=\\mathbb\{E\}\_\{\(x,y\)\\sim D\_\{r\}\}\\left\[\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\mathrm\{KL\}\\\!\\left\(p\_\{w\_\{0\}\}\(\\cdot\\mid h\_\{t\}\)\\,\\\|\\,p\_\{w\}\(\\cdot\\mid h\_\{t\}\)\\right\)\\right\],\(8\)which acts on the full token\-level predictive path rather than only on answer tokens\. This term prevents the probe model from drifting away from the reference behavior on retained queries\. The total probe\-training objective is

ℒ​\(w\)=ℒf​\(w\)\+β​ℒr​\(w\),\\mathcal\{L\}\(w\)=\\mathcal\{L\}\_\{f\}\(w\)\+\\beta\\mathcal\{L\}\_\{r\}\(w\),\(9\)whereβ\\betabalances separation on the forget set against preservation on retained data\.

### IV\-BActivation\-Informed Retain Basis as an Initial Approximation

The objectives above specify how the probe model should move, but they do not yet determine which update directions should be protected so that behavioral divergence appears mainly on the forget set\. CBD does not aim to moveMproM\_\{\\mathrm\{pro\}\}uniformly across all inputs\. Instead, we wantMproM\_\{\\mathrm\{pro\}\}to separate fromMrefM\_\{\\mathrm\{ref\}\}on forget inputs while staying close toMrefM\_\{\\mathrm\{ref\}\}on retained behavior\. Inspired by Gradient Projection Memory \(GPM\) for continual learning\[[29](https://arxiv.org/html/2606.27683#bib.bib59)\], we therefore begin with a retain\-side activation basis as an initial approximation to the directions that should not be disturbed\.

Consider the forward computation at layermmunder LoRA parameterization,

ym=\(Wm\+Bm​Am\)​xm−1,y\_\{m\}=\(W\_\{m\}\+B\_\{m\}A\_\{m\}\)x\_\{m\-1\},\(10\)whereWmW\_\{m\}is the frozen base weight,AmA\_\{m\}andBmB\_\{m\}are the trainable LoRA factors, andxm−1x\_\{m\-1\}is the input activation to layermm\. To make the dependence on the activation explicit, we use a squared\-error surrogate at layermm,

Lm=12​‖ym−ym⋆‖22=12​‖\(Wm\+Bm​Am\)​xm−1−ym⋆‖22,L\_\{m\}=\\frac\{1\}\{2\}\\\|y\_\{m\}\-y\_\{m\}^\{\\star\}\\\|\_\{2\}^\{2\}=\\frac\{1\}\{2\}\\\|\(W\_\{m\}\+B\_\{m\}A\_\{m\}\)x\_\{m\-1\}\-y\_\{m\}^\{\\star\}\\\|\_\{2\}^\{2\},\(11\)whereym⋆y\_\{m\}^\{\\star\}denotes the target output\. The same dependence onxm−1x\_\{m\-1\}also holds for the standard cross\-entropy loss used in language modeling\. Letδm=\(Wm\+Bm​Am\)​xm−1−ym⋆\\delta\_\{m\}=\(W\_\{m\}\+B\_\{m\}A\_\{m\}\)x\_\{m\-1\}\-y\_\{m\}^\{\\star\}denote the output error\. Then the LoRA gradients are

∂Lm∂Am=Bm⊤​δm​xm−1⊤,∂Lm∂Bm=δm​\(Am​xm−1\)⊤\.\\frac\{\\partial L\_\{m\}\}\{\\partial A\_\{m\}\}=B\_\{m\}^\{\\top\}\\delta\_\{m\}x\_\{m\-1\}^\{\\top\},\\qquad\\frac\{\\partial L\_\{m\}\}\{\\partial B\_\{m\}\}=\\delta\_\{m\}\(A\_\{m\}x\_\{m\-1\}\)^\{\\top\}\.\(12\)These expressions show that the update directions are determined by the input activationxm−1x\_\{m\-1\}and its linear transforms\. Dominant activation directions that repeatedly appear across retain samples therefore form an initial approximation to the directions along which retained behavior is most sensitive\.

Using retain samples, we collect the layer\-mmactivation representations

Rm=\[xm−1,1,xm−1,2,…,xm−1,nr\]∈ℝdm−1×nr,R\_\{m\}=\[x\_\{m\-1,1\},x\_\{m\-1,2\},\\dots,x\_\{m\-1,n\_\{r\}\}\]\\in\\mathbb\{R\}^\{d\_\{m\-1\}\\times n\_\{r\}\},\(13\)wherexm−1,ix\_\{m\-1,i\}is the input activation of theii\-th retain sample at layermm,nrn\_\{r\}is the number of collected retain activations, anddm−1d\_\{m\-1\}is the activation dimension\. We compute the singular value decomposition

Rm=Um​Σm​Vm⊤,R\_\{m\}=U\_\{m\}\\Sigma\_\{m\}V\_\{m\}^\{\\top\},\(14\)and choose the smallest rankkmk\_\{m\}such that

‖\(Rm\)km‖F2≥ϵm​‖Rm‖F2,\\\|\(R\_\{m\}\)\_\{k\_\{m\}\}\\\|\_\{F\}^\{2\}\\geq\\epsilon\_\{m\}\\\|R\_\{m\}\\\|\_\{F\}^\{2\},\(15\)where\(Rm\)km\(R\_\{m\}\)\_\{k\_\{m\}\}denotes the rank\-kmk\_\{m\}approximation andϵm\\epsilon\_\{m\}is the retained\-energy threshold\. The leading left singular vectors form the retain\-side activation basis

Ur,m=\[um,1,um,2,…,um,km\],U\_\{r,m\}=\[u\_\{m,1\},u\_\{m,2\},\\dots,u\_\{m,k\_\{m\}\}\],\(16\)𝒮r,m=span​\{um,1,um,2,…,um,km\}\.\\mathcal\{S\}\_\{r,m\}=\\mathrm\{span\}\\\{u\_\{m,1\},u\_\{m,2\},\\dots,u\_\{m,k\_\{m\}\}\\\}\.\(17\)Eachum,iu\_\{m,i\}is a direction in the feature space, so it describes a co\-activation pattern that repeatedly appears across retain samples, whereas the right singular vectors inVmV\_\{m\}describe how individual samples combine along those patterns\. Since the purpose of projection is to preserve stable retain\-side activation directions rather than sample\-specific identities, the left singular vectors provide the appropriate basis\.

Since the rows of∂Lm/∂Am\\partial L\_\{m\}/\\partial A\_\{m\}in \([12](https://arxiv.org/html/2606.27683#S4.E12)\) lie in the layer\-mmactivation space, each layer\-wise basis projects the gradient by right multiplication withI−Ur,m​Ur,m⊤I\-U\_\{r,m\}U\_\{r,m\}^\{\\top\}\. Stacking the layer\-wise retain projectorsUr,m​Ur,m⊤U\_\{r,m\}U\_\{r,m\}^\{\\top\}over the vectorized trainable parameters yields a block\-diagonal orthogonal projectorUr​Ur⊤U\_\{r\}U\_\{r\}^\{\\top\}onℝdw\\mathbb\{R\}^\{d\_\{w\}\}, whereUrU\_\{r\}is the global retain basis\. The retain\-only orthogonal projection then removes forget\-side motion along retain\-sensitive directions,

g¯f\(r\)=gf−Ur​Ur⊤​gf,\\bar\{g\}\_\{f\}^\{\(r\)\}=g\_\{f\}\-U\_\{r\}U\_\{r\}^\{\\top\}g\_\{f\},\(18\)wheregf=∇wℒf​\(w\)g\_\{f\}=\\nabla\_\{w\}\\mathcal\{L\}\_\{f\}\(w\)is the forget\-side gradient\. This update keeps the forget\-induced change away from directions that encode retained behavior\. Writinggr=∇wℒr​\(w\)g\_\{r\}=\\nabla\_\{w\}\\mathcal\{L\}\_\{r\}\(w\)for the retain\-side gradient, the corresponding base training direction is

g\(r\)=g¯f\(r\)\+β​gr,g^\{\(r\)\}=\\bar\{g\}\_\{f\}^\{\(r\)\}\+\\beta g\_\{r\},\(19\)and the probe parameters are updated by

w←w−η​g\(r\)\.w\\leftarrow w\-\\eta g^\{\(r\)\}\.\(20\)When forget\-retain coupling is weak, as on the smallest split of the Task of Fictitious Unlearning \(ToFU\) benchmark, the projected update already yields useful separation between forget\-like and retained inputs\. On that split, this base variant still attains a routing area under the receiver operating characteristic \(ROC\) curve \(AUC\) of 0\.9559 \(Section[VI](https://arxiv.org/html/2606.27683#S6)\)\. The forget\-side term avoids retain\-sensitive directions, whereas the retain\-side term keeps the probe model aligned with the reference behavior\.

### IV\-CBehavioral Divergence Score and Routing Rule

After probe training, we convert the behavioral divergence betweenMrefM\_\{\\mathrm\{ref\}\}andMproM\_\{\\mathrm\{pro\}\}into an unlearning\-relevance scores​\(x\)s\(x\), where a larger value indicates stronger relevance to the forget set\. The divergence is quantified using symmetric KL divergence at the token level\. For an inputxx, letpref\(t\)\(⋅∣x\)p\_\{\\mathrm\{ref\}\}^\{\(t\)\}\(\\cdot\\mid x\)andppro\(t\)\(⋅∣x\)p\_\{\\mathrm\{pro\}\}^\{\(t\)\}\(\\cdot\\mid x\)denote the token distributions ofMrefM\_\{\\mathrm\{ref\}\}andMproM\_\{\\mathrm\{pro\}\}at positiontt\. The local divergence is

dt​\(x\)=12​\[KL​\(pref\(t\)∥ppro\(t\)\)\+KL​\(ppro\(t\)∥pref\(t\)\)\]\.d\_\{t\}\(x\)=\\frac\{1\}\{2\}\\left\[\\mathrm\{KL\}\(p\_\{\\mathrm\{ref\}\}^\{\(t\)\}\\\|p\_\{\\mathrm\{pro\}\}^\{\(t\)\}\)\+\\mathrm\{KL\}\(p\_\{\\mathrm\{pro\}\}^\{\(t\)\}\\\|p\_\{\\mathrm\{ref\}\}^\{\(t\)\}\)\\right\]\.\(21\)The symmetric form makes the score independent of the order of the two auxiliary models\. The score is obtained by aggregating the local divergences over a set of aligned positions𝒯​\(x\)\\mathcal\{T\}\(x\)\.

s​\(x\)=1\|𝒯​\(x\)\|​∑t∈𝒯​\(x\)dt​\(x\)\.s\(x\)=\\frac\{1\}\{\|\\mathcal\{T\}\(x\)\|\}\\sum\_\{t\\in\\mathcal\{T\}\(x\)\}d\_\{t\}\(x\)\.\(22\)The choice of𝒯​\(x\)\\mathcal\{T\}\(x\)depends on the task structure\. For generative tasks, the score is computed on a fixed decoding path shared by the two auxiliary models\. For multiple\-choice tasks, it is computed on prompt\-side positions near the answer\-option token\.

Given validation forget and retain setsDfvalD\_\{f\}^\{\\mathrm\{val\}\}andDrvalD\_\{r\}^\{\\mathrm\{val\}\}, we calibrate the routing threshold by supervised search\.

τ⋆=arg⁡maxτ∈Γ⁡𝒥​\(τ,Dfval,Drval\),\\tau^\{\\star\}=\\arg\\max\_\{\\tau\\in\\Gamma\}\\mathcal\{J\}\(\\tau,D\_\{f\}^\{\\mathrm\{val\}\},D\_\{r\}^\{\\mathrm\{val\}\}\),\(23\)whereΓ\\Gammais the candidate threshold set and𝒥\\mathcal\{J\}is a calibration objective defined on the validation samples\. During deployment, CBD routes each query according to \([4](https://arxiv.org/html/2606.27683#S3.E4)\)\. Inputs withs​\(x\)\>τs\(x\)\>\\tauare answered byMrefM\_\{\\mathrm\{ref\}\}, while inputs withs​\(x\)≤τs\(x\)\\leq\\tauproceed to the target LLMMM\. The complete training and deployment workflow is summarized in Algorithm[2](https://arxiv.org/html/2606.27683#alg2)in Section[V](https://arxiv.org/html/2606.27683#S5), whose base variant is obtained by extractingUrU\_\{r\}and using the complement projection \([18](https://arxiv.org/html/2606.27683#S4.E18)\) in the training loop\.

## VDiscriminative Fisher Basis Extraction

Section[IV](https://arxiv.org/html/2606.27683#S4)introduced the retain basisUrU\_\{r\}as an initial approximation to the directions that should be protected so that the probe model stays close to the reference model on retained behavior\. However, when forget and retain samples are highly similar, the retain basis can overlap with directions needed for forget\-side separation, so a retain\-only projection may become too conservative\.

TABLE II:Illustrative forget and retain query pairs with high structural similarity on ToFU\. Blue shading marks shared wording, and orange shading marks differing entities or attributes\.![Refer to caption](https://arxiv.org/html/2606.27683v1/x3.png)Figure 3:Coupled forget\-side and retain\-side loss trajectories during forget\-side optimization\. The co\-movement indicates that the two subsets share update directions rather than occupying cleanly separable subspaces\.As Table[II](https://arxiv.org/html/2606.27683#S5.T2)shows, forget and retain queries on ToFU often share almost the same question template and differ mainly in the author identity or one narrow attribute\. This high overlap means that forget and retain samples can activate highly similar directions\. Figure[3](https://arxiv.org/html/2606.27683#S5.F3)shows the resulting behavior\. When the gradient\-projection training rule is constructed from retain activation representations, the retain\-side loss ofMproM\_\{\\mathrm\{pro\}\}still decreases together with the forget\-side loss rather than staying nearly constant, which indicates that the two subsets are coupled through overlapping directions in the trainable subspace\. Under this regime, a basis extracted only from retain activations tends to overlap with directions that also matter for forget\-side separation\. Projecting away all retain\-associated directions would then suppress not only retain drift, but also part of the useful signal that makesMproM\_\{\\mathrm\{pro\}\}deviate fromMrefM\_\{\\mathrm\{ref\}\}on forget inputs\. Below we construct a basis that favors directions improving forget\-side separation while keeping retain\-side cost small\.

### V\-ADiscriminative Fisher Criterion

We seek update directions that change the probe model more on forget inputs than on retain inputs\. To connect this criterion with a computable local geometry, consider a small displacementΔ​w\\Delta waround the reference pointw0w\_\{0\}in the trainable parameter subspace\. For a fixed inputxx, defineq\(⋅∣x\)=pw0\(⋅∣x\)q\(\\cdot\\mid x\)=p\_\{w\_\{0\}\}\(\\cdot\\mid x\)andp\(⋅∣x\)=pw0\+Δ​w\(⋅∣x\)p\(\\cdot\\mid x\)=p\_\{w\_\{0\}\+\\Delta w\}\(\\cdot\\mid x\)\. The local output change is measured byKL​\(q∥p\)=𝔼y∼q​\[log⁡q​\(y∣x\)−log⁡p​\(y∣x\)\]\\mathrm\{KL\}\(q\\\|p\)=\\mathbb\{E\}\_\{y\\sim q\}\[\\log q\(y\\mid x\)\-\\log p\(y\\mid x\)\]\. Applying a second\-order Taylor expansion tolog⁡pw0\+Δ​w​\(y∣x\)\\log p\_\{w\_\{0\}\+\\Delta w\}\(y\\mid x\)aroundw0w\_\{0\}gives

log⁡pw0\+Δ​w​\(y∣x\)\\displaystyle\\log p\_\{w\_\{0\}\+\\Delta w\}\(y\\mid x\)≈log⁡pw0​\(y∣x\)\\displaystyle\\approx\\log p\_\{w\_\{0\}\}\(y\\mid x\)\+gx​\(y\)⊤​Δ​w\+12​Δ​w⊤​Hx​\(y\)​Δ​w,\\displaystyle\\quad\+g\_\{x\}\(y\)^\{\\top\}\\Delta w\+\\frac\{1\}\{2\}\\Delta w^\{\\top\}H\_\{x\}\(y\)\\Delta w,wheregx​\(y\)=∇wlog⁡pw​\(y∣x\)\|w0g\_\{x\}\(y\)=\\nabla\_\{w\}\\log p\_\{w\}\(y\\mid x\)\\big\|\_\{w\_\{0\}\}andHx​\(y\)=∇w2log⁡pw​\(y∣x\)\|w0H\_\{x\}\(y\)=\\nabla\_\{w\}^\{2\}\\log p\_\{w\}\(y\\mid x\)\\big\|\_\{w\_\{0\}\}denote the per\-sample gradient and Hessian of the log\-likelihood\. Substituting the expansion into the KL divergence gives

KL​\(q∥p\)≈−𝔼q​\[gx​\(y\)\]⊤​Δ​w−12​Δ​w⊤​𝔼q​\[Hx​\(y\)\]​Δ​w\.\\mathrm\{KL\}\(q\\\|p\)\\approx\-\\mathbb\{E\}\_\{q\}\[g\_\{x\}\(y\)\]^\{\\top\}\\Delta w\-\\frac\{1\}\{2\}\\Delta w^\{\\top\}\\mathbb\{E\}\_\{q\}\[H\_\{x\}\(y\)\]\\Delta w\.The first\-order term is zero because

𝔼q​\[gx​\(y\)\]=∑y∇wpw​\(y∣x\)\|w0=∇w​∑ypw​\(y∣x\)\|w0=0\.\\mathbb\{E\}\_\{q\}\[g\_\{x\}\(y\)\]=\\sum\_\{y\}\\nabla\_\{w\}p\_\{w\}\(y\\mid x\)\\big\|\_\{w\_\{0\}\}=\\nabla\_\{w\}\\sum\_\{y\}p\_\{w\}\(y\\mid x\)\\big\|\_\{w\_\{0\}\}=0\.Under standard regularity conditions, the Fisher information matrix satisfies

F​\(x\)=𝔼q​\[gx​\(y\)​gx​\(y\)⊤\]=−𝔼q​\[Hx​\(y\)\]\.F\(x\)=\\mathbb\{E\}\_\{q\}\[g\_\{x\}\(y\)g\_\{x\}\(y\)^\{\\top\}\]=\-\\mathbb\{E\}\_\{q\}\[H\_\{x\}\(y\)\]\.Therefore,

KL\(pw0\(⋅∣x\)∥pw0\+Δ​w\(⋅∣x\)\)≈12Δw⊤F\(x\)Δw,\\mathrm\{KL\}\\\!\\left\(p\_\{w\_\{0\}\}\(\\cdot\\mid x\)\\,\\\|\\,p\_\{w\_\{0\}\+\\Delta w\}\(\\cdot\\mid x\)\\right\)\\approx\\frac\{1\}\{2\}\\Delta w^\{\\top\}F\(x\)\\Delta w,\(24\)whereF​\(x\)F\(x\)is the Fisher information matrix induced by inputxxin the trainable parameter subspace\[[21](https://arxiv.org/html/2606.27683#bib.bib17)\]\. Averaging over the forget and retain sets yields

Ff=𝔼x∼Df​\[F​\(x\)\],Fr=𝔼x∼Dr​\[F​\(x\)\]\.F\_\{f\}=\\mathbb\{E\}\_\{x\\sim D\_\{f\}\}\[F\(x\)\],\\qquad F\_\{r\}=\\mathbb\{E\}\_\{x\\sim D\_\{r\}\}\[F\(x\)\]\.\(25\)
The desired update directions should produce large forget\-side change while limiting retain\-side drift\. This requirement can be written as the constrained optimization problem

maxΔ​w≠0⁡Δ​w⊤​Ff​Δ​wsubject toΔ​w⊤​Fr​Δ​w≤ρ,\\max\_\{\\Delta w\\neq 0\}\\ \\Delta w^\{\\top\}F\_\{f\}\\Delta w\\quad\\text\{subject to\}\\quad\\Delta w^\{\\top\}F\_\{r\}\\Delta w\\leq\\rho,\(26\)whereρ\\rhois the allowed retain\-side budget\. Since the retain Fisher matrix is estimated from finitely many samples and is therefore rank\-deficient or ill\-conditioned in practice, we damp the retain\-side term and introduce

C=Fr\+μ​I,μ\>0,C=F\_\{r\}\+\\mu I,\\qquad\\mu\>0,\(27\)which keeps the constraint matrix positive definite\. With this regularization, the direction\-selection problem becomes

maxΔ​w≠0⁡Δ​w⊤​Ff​Δ​wsubject toΔ​w⊤​C​Δ​w≤ρ\.\\max\_\{\\Delta w\\neq 0\}\\ \\Delta w^\{\\top\}F\_\{f\}\\Delta w\\quad\\text\{subject to\}\\quad\\Delta w^\{\\top\}C\\Delta w\\leq\\rho\.\(28\)
The regularized problem in \([28](https://arxiv.org/html/2606.27683#S5.E28)\) can be read through its Lagrangian form

ℒlag​\(Δ​w,λ\)=Δ​w⊤​Ff​Δ​w−λ​\(Δ​w⊤​C​Δ​w−ρ\)\.\\mathcal\{L\}\_\{\\mathrm\{lag\}\}\(\\Delta w,\\lambda\)=\\Delta w^\{\\top\}F\_\{f\}\\Delta w\-\\lambda\\bigl\(\\Delta w^\{\\top\}C\\Delta w\-\\rho\\bigr\)\.\(29\)Setting the derivative with respect toΔ​w\\Delta wto zero gives

Ff​Δ​w=λ​C​Δ​w,F\_\{f\}\\Delta w=\\lambda C\\Delta w,\(30\)so every stationary direction is a generalized eigenvector of the matrix pair\(Ff,C\)\(F\_\{f\},C\)\. This derivation also clarifies the role ofρ\\rho\. It controls the admissible retain\-side displacement scale rather than the preferred direction itself\. Since the basis construction only needs the generalized eigen\-directions, Algorithm[1](https://arxiv.org/html/2606.27683#alg1)does not treatρ\\rhoas an independent hyperparameter\. The eventual update magnitude is controlled later by the learning rateη\\etaand the retain\-side weightβ\\betain Section[IV](https://arxiv.org/html/2606.27683#S4)\.

Because the subspace used in training has dimensionkkrather than one, we consider the natural multi\-direction extension of \([28](https://arxiv.org/html/2606.27683#S5.E28)\), in whichkkdirections are selected jointly under a retain\-side normalization,

maxV∈ℝdw×k,V⊤​C​V=Ik⁡tr​\(V⊤​Ff​V\)\.\\max\_\{V\\in\\mathbb\{R\}^\{d\_\{w\}\\times k\},\\ V^\{\\top\}CV=I\_\{k\}\}\\ \\mathrm\{tr\}\\\!\\left\(V^\{\\top\}F\_\{f\}V\\right\)\.\(31\)
###### Theorem 1\.

The objective in \([31](https://arxiv.org/html/2606.27683#S5.E31)\) is maximized by the generalized eigenvectorsv1,…,vkv\_\{1\},\\dots,v\_\{k\}associated with thekklargest generalized eigenvalues of the matrix pair\(Ff,C\)\(F\_\{f\},C\), i\.e\.,

Ff​vi=λi​C​vi=λi​\(Fr\+μ​I\)​vi\.F\_\{f\}v\_\{i\}=\\lambda\_\{i\}Cv\_\{i\}=\\lambda\_\{i\}\(F\_\{r\}\+\\mu I\)v\_\{i\}\.\(32\)

###### Proof\.

BecauseC≻0C\\succ 0, the substitutionζi=C1/2​vi\\zeta\_\{i\}=C^\{1/2\}v\_\{i\}is invertible, and the constraintV⊤​C​V=IkV^\{\\top\}CV=I\_\{k\}becomes the requirement thatΞ=\[ζ1,…,ζk\]\\Xi=\[\\zeta\_\{1\},\\dots,\\zeta\_\{k\}\]has orthonormal columns\. The objective becomes

tr​\(Ξ⊤​C−1/2​Ff​C−1/2​Ξ\)\.\\mathrm\{tr\}\\\!\\left\(\\Xi^\{\\top\}C^\{\-1/2\}F\_\{f\}C^\{\-1/2\}\\Xi\\right\)\.The matrixC−1/2​Ff​C−1/2C^\{\-1/2\}F\_\{f\}C^\{\-1/2\}is symmetric positive semidefinite, and by the Ky Fan characterization the maximum of this trace over orthonormalΞ\\Xiis attained by its eigenvectors associated with thekklargest eigenvalues\. Writingζi=C1/2​vi\\zeta\_\{i\}=C^\{1/2\}v\_\{i\}gives

C−1/2​Ff​C−1/2​\(C1/2​vi\)=λi​\(C1/2​vi\),C^\{\-1/2\}F\_\{f\}C^\{\-1/2\}\(C^\{1/2\}v\_\{i\}\)=\\lambda\_\{i\}\(C^\{1/2\}v\_\{i\}\),which is equivalent toFf​vi=λi​C​viF\_\{f\}v\_\{i\}=\\lambda\_\{i\}Cv\_\{i\}\. Fork=1k=1the same argument also solves \([28](https://arxiv.org/html/2606.27683#S5.E28)\), since the objective is homogeneous of degree two andFfF\_\{f\}is positive semidefinite withFf≠0F\_\{f\}\\neq 0\. The optimal value is then positive, the constraint is active at any maximizer, and the optimal direction coincides with the leading generalized eigenvector\. ∎

Theorem[1](https://arxiv.org/html/2606.27683#Thmtheorem1)shows that the leading generalized eigenvectors maximize forget\-side change per unit retain\-side cost\. We stack the top\-kkgeneralized eigenvectors as

Vk=\[v1,…,vk\]V\_\{k\}=\[v\_\{1\},\\dots,v\_\{k\}\]\(33\)and define the discriminative basisQQas the orthonormal factor of the QR factorization ofVkV\_\{k\}, so thatQ​Q⊤QQ^\{\\top\}is an orthogonal projector ontospan​\{v1,…,vk\}\\mathrm\{span\}\\\{v\_\{1\},\\dots,v\_\{k\}\\\}\. ThisQQis the basis used by the high\-similarity variant of CBD\. Its role is different from that of the retain basisUrU\_\{r\}in Section[IV](https://arxiv.org/html/2606.27683#S4)\. The retain basisUrU\_\{r\}is an avoidance basis that marks directions which should not be disturbed because they are sensitive to retained behavior\. The discriminative basisQQis a selection basis that keeps directions which most improve forget\-retain separation under controlled retain\-side cost\.

### V\-BEmpirical Fisher Estimation and Efficient Computation

The criterion above defines the optimization target of the final basisQQ\. We now realize it from sample gradients rather than explicit Hessian computation\. For each supervised sample\(xi,yi\)\(x\_\{i\},y\_\{i\}\)in a dataset𝒟=\{\(xi,yi\)\}i=1N\\mathcal\{D\}=\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{N\}, we evaluate the answer\-side negative log\-likelihood \([6](https://arxiv.org/html/2606.27683#S4.E6)\) under teacher forcing at the common reference pointw0w\_\{0\}, and only the answer tokens contribute to this loss\. For multiple\-choice data, only the final answer\-option token is supervised, which avoids tokenizer\-dependent instability\. The resulting sample gradient in the trainable auxiliary parameter space is

gi=∇wℓ​\(w,xi,yi\)\|w0∈ℝdw\.g\_\{i\}=\\nabla\_\{w\}\\ell\(w,x\_\{i\},y\_\{i\}\)\\big\|\_\{w\_\{0\}\}\\in\\mathbb\{R\}^\{d\_\{w\}\}\.\(34\)The corresponding empirical Fisher matrix is estimated by the average outer product of these sample gradients,

F^=1N​∑i=1Ngi​gi⊤,\\widehat\{F\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}g\_\{i\}g\_\{i\}^\{\\top\},\(35\)which replaces the model expectation in the definition ofF​\(x\)F\(x\)by the observed labels and is the standard empirical\-Fisher surrogate\[[21](https://arxiv.org/html/2606.27683#bib.bib17)\]\. Stacking the gradients of theNfN\_\{f\}forget andNrN\_\{r\}retain samples used for basis extraction into matrices

Gf=\[gf,1,…,gf,Nf\],Gr=\[gr,1,…,gr,Nr\],G\_\{f\}=\[g\_\{f,1\},\\dots,g\_\{f,N\_\{f\}\}\],\\qquad G\_\{r\}=\[g\_\{r,1\},\\dots,g\_\{r,N\_\{r\}\}\],\(36\)we can write the empirical Fisher matrices compactly as

F^f=1Nf​Gf​Gf⊤,F^r=1Nr​Gr​Gr⊤\.\\widehat\{F\}\_\{f\}=\\frac\{1\}\{N\_\{f\}\}G\_\{f\}G\_\{f\}^\{\\top\},\\qquad\\widehat\{F\}\_\{r\}=\\frac\{1\}\{N\_\{r\}\}G\_\{r\}G\_\{r\}^\{\\top\}\.\(37\)For basis extraction only, we use the same answer\-token loss on both subsets so that forget and retain directions are measured in a common local geometry, while the probe\-training objective in Section[IV](https://arxiv.org/html/2606.27683#S4)still keeps the retain\-side KL term for behavioral alignment during optimization\. The estimator is appropriate here because basis extraction is performed in the same low\-rank auxiliary parameter space used by the probe model, and the forget and retain samples are both evaluated around the same reference pointw0w\_\{0\}\.

Solving the generalized eigenvalue problem \([32](https://arxiv.org/html/2606.27683#S5.E32)\) directly inℝdw\\mathbb\{R\}^\{d\_\{w\}\}is computationally expensive whendwd\_\{w\}is large, so we convert the problem from parameter scale to sample scale\. For implementation, write

C^=F^r\+μ​I=μ​I\+1Nr​Gr​Gr⊤,\\widehat\{C\}=\\widehat\{F\}\_\{r\}\+\\mu I=\\mu I\+\\frac\{1\}\{N\_\{r\}\}G\_\{r\}G\_\{r\}^\{\\top\},\(38\)where the damping factorμ\\muis set in proportion to the scale ofF^r\\widehat\{F\}\_\{r\}by default\. The empirical generalized eigenvalue problem is then

F^f​v=λ​C^​v\.\\widehat\{F\}\_\{f\}v=\\lambda\\widehat\{C\}v\.\(39\)BecauseF^f=1Nf​Gf​Gf⊤\\widehat\{F\}\_\{f\}=\\frac\{1\}\{N\_\{f\}\}G\_\{f\}G\_\{f\}^\{\\top\}, the forget\-side term acts only through the span of the forget gradient matrixGfG\_\{f\}, which allows the problem to be reduced to sample scale\. Define

Z=C^−1​Gf∈ℝdw×Nf,Z=\\widehat\{C\}^\{\-1\}G\_\{f\}\\in\\mathbb\{R\}^\{d\_\{w\}\\times N\_\{f\}\},\(40\)Φ=1Nf​Gf⊤​Z=1Nf​Gf⊤​C^−1​Gf∈ℝNf×Nf,\\Phi=\\frac\{1\}\{N\_\{f\}\}G\_\{f\}^\{\\top\}Z=\\frac\{1\}\{N\_\{f\}\}G\_\{f\}^\{\\top\}\\widehat\{C\}^\{\-1\}G\_\{f\}\\in\\mathbb\{R\}^\{N\_\{f\}\\times N\_\{f\}\},\(41\)Φ​u=λ​u\.\\Phi u=\\lambda u\.\(42\)If \([42](https://arxiv.org/html/2606.27683#S5.E42)\) holds, then the corresponding parameter\-space direction is recovered by

v=Z​u=C^−1​Gf​u\.v=Zu=\\widehat\{C\}^\{\-1\}G\_\{f\}u\.\(43\)Substituting \([43](https://arxiv.org/html/2606.27683#S5.E43)\) into \([39](https://arxiv.org/html/2606.27683#S5.E39)\) recovers the same generalized eigenvalue relation\. Conversely, sincerank​\(F^f\)≤Nf\\mathrm\{rank\}\(\\widehat\{F\}\_\{f\}\)\\leq N\_\{f\}, every generalized eigenvector with nonzero eigenvalue lies in the column span ofC^−1​Gf\\widehat\{C\}^\{\-1\}G\_\{f\}, so the large problem inℝdw\\mathbb\{R\}^\{d\_\{w\}\}and the small problem inℝNf\\mathbb\{R\}^\{N\_\{f\}\}share the same nonzero spectrum\.

It remains to computeZ=C^−1​GfZ=\\widehat\{C\}^\{\-1\}G\_\{f\}without explicitly formingC^−1\\widehat\{C\}^\{\-1\}\. Applying the Woodbury identity to \([38](https://arxiv.org/html/2606.27683#S5.E38)\) gives

C^−1=1μ​I−1μ2​Nr​Gr​S−1​Gr⊤,\\widehat\{C\}^\{\-1\}=\\frac\{1\}\{\\mu\}I\-\\frac\{1\}\{\\mu^\{2\}N\_\{r\}\}G\_\{r\}S^\{\-1\}G\_\{r\}^\{\\top\},\(44\)where only the retain\-side sample matrix

S=I\+1μ​Nr​Gr⊤​Gr∈ℝNr×NrS=I\+\\frac\{1\}\{\\mu N\_\{r\}\}G\_\{r\}^\{\\top\}G\_\{r\}\\in\\mathbb\{R\}^\{N\_\{r\}\\times N\_\{r\}\}\(45\)has to be inverted\. To make the computation explicit, define the cross matrix

P=Gr⊤​Gf∈ℝNr×NfP=G\_\{r\}^\{\\top\}G\_\{f\}\\in\\mathbb\{R\}^\{N\_\{r\}\\times N\_\{f\}\}\(46\)and solve the small linear system

S​Ψ=P,Ψ∈ℝNr×Nf,S\\Psi=P,\\qquad\\Psi\\in\\mathbb\{R\}^\{N\_\{r\}\\times N\_\{f\}\},\(47\)typically by first computing the Cholesky factorizationS=L​L⊤S=LL^\{\\top\}and then applying forward and backward substitution\. Substituting the solution of \([47](https://arxiv.org/html/2606.27683#S5.E47)\) into \([44](https://arxiv.org/html/2606.27683#S5.E44)\) yields

Z=C^−1​Gf=1μ​Gf−1μ2​Nr​Gr​Ψ\.Z=\\widehat\{C\}^\{\-1\}G\_\{f\}=\\frac\{1\}\{\\mu\}G\_\{f\}\-\\frac\{1\}\{\\mu^\{2\}N\_\{r\}\}G\_\{r\}\\Psi\.\(48\)OnceZZis available, the sample\-scale matrixΦ\\Phifollows directly from \([41](https://arxiv.org/html/2606.27683#S5.E41)\)\. We then compute the top\-kkeigenpairs ofΦ\\Phi, recover the parameter\-space directions through \([43](https://arxiv.org/html/2606.27683#S5.E43)\), and orthonormalize them into the final basisQQas defined after Theorem[1](https://arxiv.org/html/2606.27683#Thmtheorem1)\.

This reformulation avoids dense inversion in the trainable parameter space\. Its dominant cost comes from sample\-gradient computation, the formation of the retain\-side sample matrixSSand the solution of the linear systemS​Ψ=PS\\Psi=P, and the eigendecomposition of the sample\-scale matrixΦ\\Phi\. SinceNf,Nr≪dwN\_\{f\},N\_\{r\}\\ll d\_\{w\}in typical settings, the cost scales with the sample counts rather than withdwd\_\{w\}, avoiding the formation of fulldw×dwd\_\{w\}\\times d\_\{w\}Fisher matrices\. Algorithm[1](https://arxiv.org/html/2606.27683#alg1)gives the complete basis\-extraction procedure used in CBD\.

Algorithm 1Discriminative Fisher Basis Extraction in CBD1:Reference model

MrefM\_\{\\mathrm\{ref\}\}, forget set

DfD\_\{f\}, retain set

DrD\_\{r\}, damping factor

μ\\mu, basis dimension

kk
2:Basis matrix

QQ
3:Compute sample gradients at

w0w\_\{0\}under teacher forcing on answer tokens and form

Gf,GrG\_\{f\},G\_\{r\}
4:Form

S=I\+1μ​Nr​Gr⊤​GrS=I\+\\frac\{1\}\{\\mu N\_\{r\}\}G\_\{r\}^\{\\top\}G\_\{r\}and

P=Gr⊤​GfP=G\_\{r\}^\{\\top\}G\_\{f\}
5:Compute the Cholesky factorization

S=L​L⊤S=LL^\{\\top\}
6:Solve

S​Ψ=PS\\Psi=Pby forward and backward substitution

7:Compute

Z=1μ​Gf−1μ2​Nr​Gr​ΨZ=\\frac\{1\}\{\\mu\}G\_\{f\}\-\\frac\{1\}\{\\mu^\{2\}N\_\{r\}\}G\_\{r\}\\Psi
8:Form

Φ=1Nf​Gf⊤​Z\\Phi=\\frac\{1\}\{N\_\{f\}\}G\_\{f\}^\{\\top\}Zand compute its top\-

kkeigenpairs

\(ui,λi\)\(u\_\{i\},\\lambda\_\{i\}\)
9:Recover

vi=Z​uiv\_\{i\}=Zu\_\{i\}for

i=1,…,ki=1,\\dots,k
10:Stack

Vk=\[v1,…,vk\]V\_\{k\}=\[v\_\{1\},\\dots,v\_\{k\}\]and set

QQto the orthonormal factor of its QR factorization

11:return

QQ

Once the discriminative basisQQhas been extracted, the high\-similarity variant of CBD replaces the retain\-only complement projection \([18](https://arxiv.org/html/2606.27683#S4.E18)\) with a projection onto the selected discriminative subspace\. The forget\-side projected gradient becomes

g¯f=Q​Q⊤​gf,\\bar\{g\}\_\{f\}=QQ^\{\\top\}g\_\{f\},\(49\)whereQ​Q⊤QQ^\{\\top\}is the projection matrix onto the final discriminative basis\. The corresponding training direction is

g=g¯f\+β​gr,g=\\bar\{g\}\_\{f\}\+\\beta g\_\{r\},\(50\)and the probe model is updated by

w←w−η​g\.w\\leftarrow w\-\\eta g\.\(51\)This update no longer treats all retain\-associated directions as forbidden\. Instead, it keeps the directions that maximize forget\-side change under controlled retain\-side cost\. Algorithm[2](https://arxiv.org/html/2606.27683#alg2)summarizes the complete CBD workflow\.

Algorithm 2CBD Training and Deployment1:Target model

MM, reference model

MrefM\_\{\\mathrm\{ref\}\}, forget set

DfD\_\{f\}, retain set

DrD\_\{r\}, validation sets

Dfval,DrvalD\_\{f\}^\{\\mathrm\{val\}\},D\_\{r\}^\{\\mathrm\{val\}\}, threshold candidates

Γ\\Gamma, retain\-side weight

β\\beta, learning rate

η\\eta, damping factor

μ\\mu, basis dimension

kk
2:Probe model

MproM\_\{\\mathrm\{pro\}\}, routing threshold

τ\\tau
3:Extract the discriminative basis

QQusing Algorithm[1](https://arxiv.org/html/2606.27683#alg1)

4:Initialize probe\-model parameters with

w←w0w\\leftarrow w\_\{0\}
5:whilethe fixed training\-step budget is not reacheddo

6:Sample mini\-batches from

DfD\_\{f\}and

DrD\_\{r\}
7:Compute forget\-side gradient

gfg\_\{f\}and retain\-side gradient

grg\_\{r\}
8:Project forget\-side gradient onto the discriminative basis as

g¯f=Q​Q⊤​gf\\bar\{g\}\_\{f\}=QQ^\{\\top\}g\_\{f\}
9:Form the combined gradient

g=g¯f\+β​grg=\\bar\{g\}\_\{f\}\+\\beta g\_\{r\}
10:Update

w←w−η​gw\\leftarrow w\-\\eta g
11:endwhile

12:Denote the resulting probe model by

MproM\_\{\\mathrm\{pro\}\}
13:Compute validation scores using symmetric KL divergence between

MrefM\_\{\\mathrm\{ref\}\}and

MproM\_\{\\mathrm\{pro\}\}
14:Select the routing threshold

τ\\tauby supervised search

15:Deploy CBD according to \([4](https://arxiv.org/html/2606.27683#S3.E4)\)

16:return

MproM\_\{\\mathrm\{pro\}\},

τ\\tau

## VIExperiments

### VI\-AExperimental Setup

All experiments were conducted on a server with four NVIDIA GeForce RTX 3090 GPUs \(24 GB each\), using Python 3\.10 and PyTorch 2\.1\.1 with CUDA 11\.8\.

Models and Datasets\.We evaluate CBD on ToFU\[[20](https://arxiv.org/html/2606.27683#bib.bib15)\]111https://huggingface\.co/datasets/locuslab/TOFUand Weapons of Mass Destruction Proxy \(WMDP\)\[[17](https://arxiv.org/html/2606.27683#bib.bib60)\]222https://huggingface\.co/datasets/cais/wmdp\. ToFU contains 4,000 fictitious\-author question\-answer pairs, 100 real\-author questions, and 117 world\-fact questions\. ToFU defines three unlearning settings, namelyforget01,forget05, andforget10, which remove 1%, 5%, and 10% of the fictitious\-author data, respectively, with the corresponding forget and retain split sizes listed in Table[III](https://arxiv.org/html/2606.27683#S6.T3)\. Perturbed answers are used only for the truth\-ratio metric\. We use the ToFU\-released Llama\-2\-7B\-Chat model333https://huggingface\.co/locuslab/tofu\_ft\_llama2\-7bfine\-tuned on the full ToFU question\-answer set as the target model\[[20](https://arxiv.org/html/2606.27683#bib.bib15),[32](https://arxiv.org/html/2606.27683#bib.bib62)\]\.

TABLE III:Datasets, target models, and main optimization hyperparameters\. For WMDP, the listed sets are the unlearning and retained\-utility evaluation sets\. All LoRA\-based runs use rank3232, scaling factor6464, and dropout0\.050\.05, while ToFU white\-box uses full fine\-tuning\.
WMDP evaluates whether a model can answer hazardous\-knowledge questions\. Its unlearning evaluation set contains 3,668 four\-choice questions, including 1,273 biology questions, 1,987 cybersecurity questions, and 408 chemistry questions\. Its retained\-utility evaluation uses Massive Multitask Language Understanding \(MMLU\)\[[9](https://arxiv.org/html/2606.27683#bib.bib61)\], which contains 14,042 test questions over 57 subjects in the all\-category setting\. Following the standard WMDP setting, we use Zephyr\-7B\-beta444https://huggingface\.co/HuggingFaceH4/zephyr\-7b\-betaas the target model\[[17](https://arxiv.org/html/2606.27683#bib.bib60),[33](https://arxiv.org/html/2606.27683#bib.bib63)\]\. On both benchmarks, CBD initializes the two auxiliary models from TinyLlama\-1\.1B\-Chat\-v1\.0555https://huggingface\.co/TinyLlama/TinyLlama\-1\.1B\-Chat\-v1\.0\[[41](https://arxiv.org/html/2606.27683#bib.bib64)\]\. For WMDP, probe training, basis extraction, and threshold calibration use 600 forget\-side samples from the three hazardous domains and 1,200 retain\-side samples, while the full question sets are used for final evaluation\.

Baselines\.We compare CBD with reference models and representative methods under different access assumptions\. The reference models include 1\)Target LLM, the original target model before unlearning, and 2\)Retrain LLM, the model trained without the forget data\. The Retrain LLM is reported only on ToFU, because WMDP does not provide a retrained reference\.

The white\-box baselines are the three objective families defined in Section[III](https://arxiv.org/html/2606.27683#S3), namelyGA\[[12](https://arxiv.org/html/2606.27683#bib.bib19),[39](https://arxiv.org/html/2606.27683#bib.bib5)\],DPO\[[27](https://arxiv.org/html/2606.27683#bib.bib13)\], andNPO\[[42](https://arxiv.org/html/2606.27683#bib.bib8)\], each reported with its base objective and its retain\-constrained\+GDand\+KLvariants\[[39](https://arxiv.org/html/2606.27683#bib.bib5)\]\. The gray\-box baselines areULD\[[13](https://arxiv.org/html/2606.27683#bib.bib9)\]andOffset\[[11](https://arxiv.org/html/2606.27683#bib.bib10)\], which correct the target logits as in \([2](https://arxiv.org/html/2606.27683#S3.E2)\) and \([3](https://arxiv.org/html/2606.27683#S3.E3)\) and therefore require token\-level logits, whereas CBD uses only final responses\.

Hyperparameters\.The lower block of Table[III](https://arxiv.org/html/2606.27683#S6.T3)reports the optimization settings used in the main comparisons, separated by dataset and method group because ToFU and WMDP use different target models, evaluation data, and training budgets\.

Evaluation Metrics\.On ToFU, we report ROUGE\-L recall \(RG\), answer probability \(Pr\), and truth ratio \(TR\) following the official evaluation protocol\[[20](https://arxiv.org/html/2606.27683#bib.bib15)\]\. RG measures lexical overlap between the generated answer and the reference answer, and Pr is the length\-normalized probabilityPM^​\(a∣q\)1/\|a\|P\_\{\\hat\{M\}\}\(a\\mid q\)^\{1/\|a\|\}that the evaluated modelM^\\hat\{M\}assigns to the reference answeraaof questionqq\. TR compares the average length\-normalized probabilityp¯p​\(q\)\\bar\{p\}\_\{p\}\(q\)of perturbed answers with that of the paraphrased true answera~\\tilde\{a\}throughr​\(q\)=p¯p​\(q\)/PM^​\(a~∣q\)1/\|a~\|r\(q\)=\\bar\{p\}\_\{p\}\(q\)\\,/\\,P\_\{\\hat\{M\}\}\(\\tilde\{a\}\\mid q\)^\{1/\|\\tilde\{a\}\|\}and is reported with the official scaling

TR​\(q\)=\{min⁡\{r​\(q\),r​\(q\)−1\},q∈Df,max⁡\{0,1−r​\(q\)\},q∈Dr\.\\mathrm\{TR\}\(q\)=\\begin\{cases\}\\min\\\{r\(q\),r\(q\)^\{\-1\}\\\},&q\\in D\_\{f\},\\\\ \\max\\\{0,1\-r\(q\)\\\},&q\\in D\_\{r\}\.\\end\{cases\}\(52\)On the forget subset, lower RG and Pr indicate less answer recovery, while higher TR means that the model no longer strongly prefers the paraphrased true answer over perturbed alternatives\. On the retained data, real\-author questions, and world\-fact questions, higher RG, Pr, and TR indicate better retained utility\. We summarize retained utility by Model Utility \(MU\), the harmonic mean of RG, Pr, and TR over these retained evaluations\. Since an unlearning method should suppress forget data without damaging retained utility, we also report the Forget\-Retain Trade\-off \(FRT\) metric\[[36](https://arxiv.org/html/2606.27683#bib.bib65)\]\. FRT normalizes retained utility by the remaining answer overlap and answer probability on the forget subset,

FRT=MU\(RGf\+Prf\)/2\.\\mathrm\{FRT\}=\\frac\{\\mathrm\{MU\}\}\{\(\\mathrm\{RG\}\_\{f\}\+\\mathrm\{Pr\}\_\{f\}\)/2\}\.\(53\)Higher FRT indicates that the model preserves more retained utility for the same level of residual answer recovery on the forget subset\. On WMDP, lower hazardous\-knowledge accuracy indicates stronger unlearning, while higher MMLU accuracy indicates better general capability\. Because CBD makes a query\-level routing decision before invoking the target model, we also report routing accuracy, true\-positive rate, and false\-positive rate in percent, together with the AUC, where higher routing accuracy, true\-positive rate, and AUC and lower false\-positive rate are preferred\. During evaluation, token probabilities are used only to score the local open\-source models on these benchmarks, while CBD itself requires only final text responses at deployment\.

### VI\-BPerformance Evaluation

The evaluation has four parts\. The first part compares CBD with white\-box and gray\-box baselines on ToFU and WMDP\. The second part varies key hyperparameters to check whether the results remain stable beyond one selected setting\. The third part examines the stability of each method across training steps\. The fourth part compares CBD using the discriminative Fisher basis \(DFB\), denoted as CBD \(with DFB\), with CBD using GPM, denoted as CBD \(with GPM\), to isolate the effect of basis construction\. Unless stated otherwise, CBD refers to CBD \(with DFB\)\.

TABLE IV:Experimental results on the ToFU dataset for unlearning 1% and 5% of data\. For compact presentation, each setting reports forget\-side RG/Pr/TR and summary MU/FRT\. Within each method group, the best value in each column is boldfaced\.TABLE V:Detailed experimental results on the ToFU dataset for unlearning 10% of data\. Within each method group, the best value in each column is boldfaced\.TABLE VI:Experimental results on WMDP\. Within each method group, the best value in each column is boldfaced\.![Refer to caption](https://arxiv.org/html/2606.27683v1/x4.png)Figure 4:Hyperparameter sensitivity of CBD on ToFUforget10\(top row\) and WMDP \(bottom row\)\. The four columns vary the top\-kkvalue, the retain basis size, the forget basis size, and the LoRA rank, respectively\. The dashed line marks the configuration used in the main experiments\.Performance Comparison with Existing Methods\.Table[IV](https://arxiv.org/html/2606.27683#S6.T4)compactly reports the forget\-side and summary results under theforget01andforget05settings, while Table[V](https://arxiv.org/html/2606.27683#S6.T5)provides the detailed results under the more challengingforget10setting\. The white\-box methods trade off reducing answer recovery against preserving retained utility, and the trade\-off sharpens as the unlearning ratio grows\. GA\+GD attains the best white\-box FRT onforget01at 3\.36 but only at a reduced MU of 57\.35, and onforget10GA collapses to an MU of 9\.77 with an FRT of 0\.26\. The DPO variants preserve more utility but leave forget\-set answer probabilities well above the retrained reference, and the NPO variants stay closer to that reference but still above it\.

The gray\-box methods behave unevenly across the three settings\. ULD is the strongest method onforget01, where its FRT of 4\.39 exceeds the 2\.49 of CBD\. However, ULD requires target\-model logits that are unavailable in the API\-only setting, and its advantage disappears on the larger splits, where its FRT falls to 1\.42 and 1\.32\. Offset does not achieve consistent suppression, and its MU drops sharply on all three splits\. CBD stays the most consistent across splits, keeping forget\-set Pr between 25\.59 and 28\.67, forget\-set TR above 76, and MU between 74\.76 and 75\.31 across all three splits\. Onforget10, Table[V](https://arxiv.org/html/2606.27683#S6.T5)further shows that the retained\-side metrics of CBD stay close to the target model because retained queries that are not rerouted reach the unchanged target LLM, so the retained\-utility cost of CBD comes only from false\-positive routing\.

Table[VI](https://arxiv.org/html/2606.27683#S6.T6)reports the WMDP results, where the same trade\-off appears\. GA\+KL achieves the strongest white\-box suppression with an overall hazardous accuracy of 26\.44, but its MMLU drops to 28\.96, while DPO\+GD and NPO\+GD preserve MMLU above 55 yet barely suppress hazardous accuracy\. The gray\-box baselines keep MMLU near the target LLM but achieve almost no suppression, with 49\.35 for ULD and 50\.63 for Offset\. CBD reduces the overall hazardous accuracy to 25\.68, close to the four\-choice random\-guess level, while preserving an MMLU accuracy of 52\.67, giving the lowest hazardous accuracy among all compared methods at a smaller general\-capability cost than GA\+KL\.

Hyperparameter Sensitivity\.Figure[4](https://arxiv.org/html/2606.27683#S6.F4)varies the top\-kkvalue, the retain basis size, the forget basis size, and the LoRA rank on ToFUforget10and WMDP\. On both datasets, the top\-kkvalue has the largest effect, since a too\-small value discards useful separation directions and lowers routing accuracy and true\-positive rate, whereas beyond the main configuration the curves change only slightly and the false\-positive rate stays low\. The retain basis behaves similarly, as a small basis represents retained behavior incompletely and weakens routing\. The forget basis matters on ToFU, where 100 or 200 forget\-set samples are insufficient but 300 samples stabilize the curves\. The LoRA rank has the smallest effect, since the routing accuracy and true\-positive rate stay almost unchanged as it varies from 16 to 80\.

![Refer to caption](https://arxiv.org/html/2606.27683v1/x5.png)Figure 5:Cross\-method step stability on ToFUforget10and WMDP\. The heatmaps illustrate the stability of unlearning utility across training steps for different methods\.Cross\-Method Step Stability\.Figure[5](https://arxiv.org/html/2606.27683#S6.F5)reports unlearning metrics as functions of training steps for CBD, GA, NPO\+GD, and ULD\. On ToFUforget10, we report forget\-set and retain\-set RG against the retrained reference values of 39\.90 and 87\.66\. On WMDP, we report overall hazardous accuracy with a random\-guess level of 25 and MMLU with a target value of 59\.01\. CBD varies little across training steps on both datasets, with forget\-set RG within\[36\.23,38\.19\]\[36\.23,38\.19\]and retain\-set RG around 90 on ToFUforget10, and hazardous accuracy at 25\.6–25\.8 with MMLU at 50\.5–52\.7 on WMDP\. In contrast, GA collapses, with both RG metrics dropping to 0\.00 by step 140 on ToFUforget10and MMLU falling to the random\-guess level by step 100 on WMDP\. NPO\+GD degrades substantially as its retain\-set RG falls from 57\.17 to 34\.21 between steps 80 and 200, and ULD remains flat but achieves little unlearning on WMDP, staying near 49\.3 hazardous accuracy\. Direct weight editing thus depends heavily on the training budget, while CBD does not\.

![Refer to caption](https://arxiv.org/html/2606.27683v1/x6.png)Figure 6:Receiver operating characteristic curves comparing CBD \(with DFB\) and CBD \(with GPM\) across the three ToFU splits and WMDP, obtained by sweeping the routing threshold on the unlearning\-relevance score\. Square and circle markers denote the operating thresholds of CBD \(with DFB\) and CBD \(with GPM\), respectively\.Routing Quality Comparison\.We isolate the effect of basis construction by comparing CBD \(with DFB\) and CBD \(with GPM\) under the same training budget\. Figure[6](https://arxiv.org/html/2606.27683#S6.F6)reports the resulting ROC curves, where a higher curve means more forget queries are correctly routed away at the same false\-positive rate on retained queries\. On ToFU, CBD \(with DFB\) attains AUC values of 0\.9931, 0\.9872, and 0\.9599 onforget01,forget05, andforget10, whereas CBD \(with GPM\) reaches 0\.9559, 0\.5817, and 0\.4783\. The AUC gap is about 0\.04 onforget01but 0\.48 onforget10, as the growing forget set covers more shared templates and tightens forget\-retain coupling\. At the operating threshold onforget10, CBD \(with GPM\) detects almost no forget queries with a true\-positive rate of 3\.33, so most forget queries still reach the target model and routing fails to block them on the forget side\. CBD \(with DFB\) instead reaches 95\.00 routing accuracy with a true\-positive rate of 93\.00 and a false\-positive rate of 3\.00\.

WMDP shows a similar result, where CBD \(with DFB\) achieves an AUC of 0\.9760 and 88\.72 routing accuracy at the operating threshold, compared with 0\.8079 and 70\.11 for CBD \(with GPM\)\. The discriminative Fisher basis therefore routes more accurately in exactly the cases where forget and retain queries are hard to separate\. Across all experiments, the retained\-utility cost of CBD is confined to false\-positive routing, and each query adds only the cost of scoring by the two small auxiliary models before reaching the target LLM\. Probe training and basis extraction take 152 to 185 seconds with a peak memory of 15\.4 to 22\.4 GB, which fits within a single 24 GB GPU\.

## VIIConclusion

This paper presents CBD, a black\-box unlearning framework for API\-only LLM services\. CBD routes unlearning\-related queries away from the target LLM using the behavioral divergence between a frozen reference model and a trained probe model, without editing the target or accessing its logits\. A gradient\-statistics\-based discriminative basis improves this routing when forget and retain data are highly similar\. Across ToFU and WMDP, CBD attains a better unlearning\-utility trade\-off than eleven baselines, improving the forget\-retain trade\-off by about 45% over the second\-best method on ToFU forget10 and lowering WMDP hazardous\-knowledge accuracy to near the random\-guess level while preserving MMLU\.

## References

- \[1\]\(2025\)Soft prompting for unlearning in large language models\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 4046–4056\.Cited by:[§II\-B](https://arxiv.org/html/2606.27683#S2.SS2.p2.1)\.
- \[2\]L\. Bourtouleet al\.\(2021\)Machine unlearning\.In2021 IEEE symposium on security and privacy \(SP\),pp\. 141–159\.Cited by:[§II](https://arxiv.org/html/2606.27683#S2.p1.1)\.
- \[3\]N\. Carliniet al\.\(2021\)Extracting training data from large language models\.In30th USENIX security symposium \(USENIX Security 21\),pp\. 2633–2650\.Cited by:[§I](https://arxiv.org/html/2606.27683#S1.p1.1)\.
- \[4\]J\. Chen and D\. Yang\(2023\)Unlearn what you want to forget: efficient unlearning for LLMs\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 12041–12052\.Cited by:[§II\-A](https://arxiv.org/html/2606.27683#S2.SS1.p1.1),[§III\-A](https://arxiv.org/html/2606.27683#S3.SS1.p1.6)\.
- \[5\]G\. Dou, Z\. Liu, Q\. Lyu, K\. Ding, and E\. Wong\(2025\)Avoiding copyright infringement via large language model unlearning\.InFindings of the Association for Computational Linguistics: NAACL 2025,pp\. 5191–5215\.Cited by:[§II\-A](https://arxiv.org/html/2606.27683#S2.SS1.p2.1)\.
- \[6\]C\. Fanet al\.\(2025\)Simplicity prevails: rethinking negative preference optimization for LLM unlearning\.InAdvances in Neural Information Processing Systems,Cited by:[§II\-A](https://arxiv.org/html/2606.27683#S2.SS1.p2.1)\.
- \[7\]C\. Gao, L\. Wang, K\. Ding, C\. Weng, X\. Wang, and Q\. Zhu\(2025\)On large language model continual unlearning\.InInternational Conference on Learning Representations,Cited by:[§II\-A](https://arxiv.org/html/2606.27683#S2.SS1.p2.1)\.
- \[8\]T\. Guet al\.\(2025\)From evasion to concealment: stealthy knowledge unlearning for LLMs\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 10261–10279\.Cited by:[§II\-A](https://arxiv.org/html/2606.27683#S2.SS1.p2.1)\.
- \[9\]D\. Hendryckset al\.\(2021\)Measuring massive multitask language understanding\.InInternational Conference on Learning Representations,Cited by:[§VI\-A](https://arxiv.org/html/2606.27683#S6.SS1.p3.1)\.
- \[10\]E\. J\. Huet al\.\(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations,Cited by:[§IV\-A](https://arxiv.org/html/2606.27683#S4.SS1.p1.4)\.
- \[11\]J\. Y\. Huanget al\.\(2025\)Offset unlearning for large language models\.Transactions on Machine Learning Research\.Cited by:[§I](https://arxiv.org/html/2606.27683#S1.p2.1),[§I](https://arxiv.org/html/2606.27683#S1.p3.1),[§II\-B](https://arxiv.org/html/2606.27683#S2.SS2.p1.1),[§III\-A](https://arxiv.org/html/2606.27683#S3.SS1.p2.3),[§VI\-A](https://arxiv.org/html/2606.27683#S6.SS1.p5.1)\.
- \[12\]J\. Janget al\.\(2023\)Knowledge unlearning for mitigating privacy risks in language models\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 14389–14408\.Cited by:[§II\-A](https://arxiv.org/html/2606.27683#S2.SS1.p1.1),[§III\-A](https://arxiv.org/html/2606.27683#S3.SS1.p1.1),[§VI\-A](https://arxiv.org/html/2606.27683#S6.SS1.p5.1)\.
- \[13\]J\. Jiet al\.\(2024\)Reversing the forget\-retain objectives: an efficient LLM unlearning framework from logit difference\.Advances in Neural Information Processing Systems37,pp\. 12581–12611\.Cited by:[§I](https://arxiv.org/html/2606.27683#S1.p2.1),[§I](https://arxiv.org/html/2606.27683#S1.p3.1),[§II\-B](https://arxiv.org/html/2606.27683#S2.SS2.p1.1),[§III\-A](https://arxiv.org/html/2606.27683#S3.SS1.p2.2),[§VI\-A](https://arxiv.org/html/2606.27683#S6.SS1.p5.1)\.
- \[14\]J\. Jia, J\. Liu, Y\. Zhang, P\. Ram, N\. Baracaldo, and S\. Liu\(2024\)WAGLE: strategic weight attribution for effective and modular unlearning in large language models\.InAdvances in Neural Information Processing Systems,Cited by:[§II\-A](https://arxiv.org/html/2606.27683#S2.SS1.p2.1)\.
- \[15\]J\. Jiaet al\.\(2024\)SOUL: unlocking the power of second\-order optimization for LLM unlearning\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 4276–4292\.Cited by:[§II\-A](https://arxiv.org/html/2606.27683#S2.SS1.p2.1)\.
- \[16\]A\. Kassem, O\. Mahmoud, and S\. Saad\(2023\)Preserving privacy through dememorization: an unlearning technique for mitigating memorization risks in language models\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 4360–4379\.Cited by:[§III\-A](https://arxiv.org/html/2606.27683#S3.SS1.p1.6)\.
- \[17\]N\. Liet al\.\(2024\)The WMDP benchmark: measuring and reducing malicious use with unlearning\.InProceedings of the 41st International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.235,pp\. 28525–28550\.Cited by:[§VI\-A](https://arxiv.org/html/2606.27683#S6.SS1.p2.1),[§VI\-A](https://arxiv.org/html/2606.27683#S6.SS1.p3.1)\.
- \[18\]C\. Y\. Liu, Y\. Wang, J\. Flanigan, and Y\. Liu\(2024\)Large language model unlearning via embedding\-corrupted prompts\.Advances in Neural Information Processing Systems37\.Cited by:[§II\-B](https://arxiv.org/html/2606.27683#S2.SS2.p2.1)\.
- \[19\]S\. Liuet al\.\(2025\)Rethinking machine unlearning for large language models\.Nature Machine Intelligence7,pp\. 181–194\.Cited by:[§I](https://arxiv.org/html/2606.27683#S1.p2.1),[§II](https://arxiv.org/html/2606.27683#S2.p1.1)\.
- \[20\]P\. Maini, Z\. Feng, A\. Schwarzschild, Z\. C\. Lipton, and J\. Z\. Kolter\(2024\)TOFU: a task of fictitious unlearning for LLMs\.InFirst Conference on Language Modeling,Cited by:[§VI\-A](https://arxiv.org/html/2606.27683#S6.SS1.p2.1),[§VI\-A](https://arxiv.org/html/2606.27683#S6.SS1.p7.7)\.
- \[21\]J\. Martens\(2020\)New insights and perspectives on the natural gradient method\.Journal of Machine Learning Research21\(146\),pp\. 1–76\.Cited by:[§V\-A](https://arxiv.org/html/2606.27683#S5.SS1.p1.12),[§V\-B](https://arxiv.org/html/2606.27683#S5.SS2.p1.7)\.
- \[22\]A\. I\. Muresanu, A\. Thudi, M\. R\. Zhang, and N\. Papernot\(2025\)Fast exact unlearning for in\-context learning data for LLMs\.InProceedings of the 42nd International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.267,pp\. 45272–45288\.Cited by:[§II\-B](https://arxiv.org/html/2606.27683#S2.SS2.p2.1)\.
- \[23\]A\. Niwa, M\. Kaneko, and K\. Inui\(2025\)Rectifying belief space via unlearning to harness LLMs’ reasoning\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 25060–25075\.Cited by:[§II\-A](https://arxiv.org/html/2606.27683#S2.SS1.p2.1)\.
- \[24\]V\. Patil, P\. Hase, and M\. Bansal\(2024\)Can sensitive information be deleted from LLMs? objectives for defending against extraction attacks\.InInternational Conference on Learning Representations,Cited by:[§II\-A](https://arxiv.org/html/2606.27683#S2.SS1.p1.1)\.
- \[25\]M\. Pawelczyk, S\. Neel, and H\. Lakkaraju\(2024\)In\-context unlearning: language models as few\-shot unlearners\.InProceedings of the 41st International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.235,pp\. 40034–40050\.Cited by:[§II\-B](https://arxiv.org/html/2606.27683#S2.SS2.p2.1)\.
- \[26\]G\. Qu, Q\. Chen, W\. Wei, Z\. Lin, X\. Chen, and K\. Huang\(2025\)Mobile edge intelligence for large language models: a contemporary survey\.IEEE Communications Surveys & Tutorials27,pp\. 3820–3860\.Cited by:[§I](https://arxiv.org/html/2606.27683#S1.p1.1)\.
- \[27\]R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn\(2023\)Direct preference optimization: your language model is secretly a reward model\.Advances in Neural Information Processing Systems36,pp\. 53728–53741\.Cited by:[§III\-A](https://arxiv.org/html/2606.27683#S3.SS1.p1.6),[§VI\-A](https://arxiv.org/html/2606.27683#S6.SS1.p5.1)\.
- \[28\]J\. Renet al\.\(2025\)A general framework to enhance fine\-tuning\-based LLM unlearning\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 18464–18476\.Cited by:[§II\-A](https://arxiv.org/html/2606.27683#S2.SS1.p2.1)\.
- \[29\]G\. Saha, I\. Garg, and K\. Roy\(2021\)Gradient projection memory for continual learning\.InInternational Conference on Learning Representations,pp\. 944–961\.Cited by:[§I](https://arxiv.org/html/2606.27683#S1.p5.1),[§IV\-B](https://arxiv.org/html/2606.27683#S4.SS2.p1.4)\.
- \[30\]S\. Takashiro, T\. Kojima, A\. Gambardella, Q\. Cao, Y\. Iwasawa, and Y\. Matsuo\(2025\)Answer when needed, forget when not: language models pretend to forget via in\-context knowledge unlearning\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 24872–24885\.Cited by:[§II\-B](https://arxiv.org/html/2606.27683#S2.SS2.p2.1)\.
- \[31\]B\. Tianet al\.\(2024\)To forget or not? towards practical knowledge unlearning for large language models\.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp\. 1524–1537\.Cited by:[§II\-A](https://arxiv.org/html/2606.27683#S2.SS1.p1.1)\.
- \[32\]H\. Touvronet al\.\(2023\)Llama 2: open foundation and fine\-tuned chat models\.arXiv preprint arXiv:2307\.09288\.Cited by:[§VI\-A](https://arxiv.org/html/2606.27683#S6.SS1.p2.1)\.
- \[33\]L\. Tunstallet al\.\(2023\)Zephyr: direct distillation of LM alignment\.arXiv preprint arXiv:2310\.16944\.Cited by:[§VI\-A](https://arxiv.org/html/2606.27683#S6.SS1.p3.1)\.
- \[34\]S\. Vasilev, C\. Herold, B\. Liao, S\. H\. Hashemi, S\. Khadivi, and C\. Monz\(2025\)Unilogit: robust machine unlearning for LLMs using uniform\-target self\-distillation\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 22453–22472\.Cited by:[§II\-A](https://arxiv.org/html/2606.27683#S2.SS1.p2.1)\.
- \[35\]Y\. Wan, A\. Ramakrishna, K\. Chang, V\. Cevher, and R\. Gupta\(2025\)Not every token needs forgetting: selective unlearning balancing forgetting and utility in large language models\.InFindings of the Association for Computational Linguistics: EMNLP 2025,pp\. 1827–1835\.Cited by:[§II\-A](https://arxiv.org/html/2606.27683#S2.SS1.p2.1)\.
- \[36\]H\. Wanget al\.\(2025\)Erasing without remembering: implicit knowledge forgetting in large language models\.arXiv preprint arXiv:2502\.19982\.Cited by:[§VI\-A](https://arxiv.org/html/2606.27683#S6.SS1.p7.8)\.
- \[37\]Y\. Wanget al\.\(2025\)LLM unlearning via loss adjustment with only forget data\.InInternational Conference on Learning Representations,Cited by:[§III\-A](https://arxiv.org/html/2606.27683#S3.SS1.p1.6)\.
- \[38\]L\. Xie, X\. Teng, S\. Ke, H\. Wen, and S\. Wan\(2025\)Reveal and release: iterative LLM unlearning with self\-generated data\.InFindings of the Association for Computational Linguistics: EMNLP 2025,pp\. 23887–23899\.Cited by:[§II\-A](https://arxiv.org/html/2606.27683#S2.SS1.p2.1)\.
- \[39\]J\. Yaoet al\.\(2024\)Machine unlearning of pre\-trained large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 8403–8419\.Cited by:[§I](https://arxiv.org/html/2606.27683#S1.p1.1),[§I](https://arxiv.org/html/2606.27683#S1.p2.1),[§I](https://arxiv.org/html/2606.27683#S1.p3.1),[§I](https://arxiv.org/html/2606.27683#S1.p4.1),[§II\-A](https://arxiv.org/html/2606.27683#S2.SS1.p1.1),[§II](https://arxiv.org/html/2606.27683#S2.p1.1),[§III\-A](https://arxiv.org/html/2606.27683#S3.SS1.p1.1),[§III\-A](https://arxiv.org/html/2606.27683#S3.SS1.p1.6),[§VI\-A](https://arxiv.org/html/2606.27683#S6.SS1.p5.1)\.
- \[40\]Y\. Yao, X\. Xu, and Y\. Liu\(2024\)Large language model unlearning\.Advances in Neural Information Processing Systems37,pp\. 105425–105475\.Cited by:[§I](https://arxiv.org/html/2606.27683#S1.p2.1),[§I](https://arxiv.org/html/2606.27683#S1.p3.1),[§I](https://arxiv.org/html/2606.27683#S1.p4.1),[§II\-A](https://arxiv.org/html/2606.27683#S2.SS1.p2.1),[§II](https://arxiv.org/html/2606.27683#S2.p1.1)\.
- \[41\]P\. Zhang, G\. Zeng, T\. Wang, and W\. Lu\(2024\)TinyLlama: an open\-source small language model\.arXiv preprint arXiv:2401\.02385\.Cited by:[§VI\-A](https://arxiv.org/html/2606.27683#S6.SS1.p3.1)\.
- \[42\]R\. Zhang, L\. Lin, Y\. Bai, and S\. Mei\(2024\)Negative preference optimization: from catastrophic collapse to effective unlearning\.InProceedings of the First Conference on Language Modeling,Cited by:[§I](https://arxiv.org/html/2606.27683#S1.p2.1),[§I](https://arxiv.org/html/2606.27683#S1.p3.1),[§I](https://arxiv.org/html/2606.27683#S1.p4.1),[§II\-A](https://arxiv.org/html/2606.27683#S2.SS1.p2.1),[§III\-A](https://arxiv.org/html/2606.27683#S3.SS1.p1.6),[§VI\-A](https://arxiv.org/html/2606.27683#S6.SS1.p5.1)\.

Similar Articles

Bounded Behavioral Indistinguishability for Black-Box LLM Distillation

arXiv cs.LG

This paper introduces bounded behavioral indistinguishability, a formal framework for evaluating black-box LLM distillation beyond semantic similarity. Experiments on Qwen and Llama models show that distillation reduces but does not eliminate adversarial distinguishability, highlighting the need for category-aware evaluation.

Model Unlearning Objectives Vary for Distinct Language Functions

arXiv cs.CL

The paper argues that unlearning in LLMs should be goal-dependent, proposing a cosine-based meta-learned variant of RMU for dangerous knowledge and a multi-layer objective with probe directions for toxicity, achieving strong results across four 7-8B models.

Online Pandora's Box for Contextual LLM Cascading

arXiv cs.AI

This paper introduces an online contextual Pandora's Box model for adaptively querying and selecting LLM APIs, proposing a learning approach that combines GMM estimation with UCB-style confidence bounds and proving dimension-dependent regret bounds.

MLUBench: A Benchmark for Lifelong Unlearning Evaluation in MLLMs

arXiv cs.AI

MLUBench is a large-scale benchmark for lifelong unlearning in multimodal large language models (MLLMs), featuring 127 entities across 9 classes. The paper identifies that existing unlearning methods suffer from cumulative degradation and proposes LUMoE to mitigate this, showing significant improvements.