CiPO: Counterfactual Unlearning for Large Reasoning Models through Iterative Preference Optimization

arXiv cs.CL Papers

Summary

CiPO is a novel framework for machine unlearning in Large Reasoning Models that uses iterative preference optimization with counterfactual reasoning traces to selectively remove unwanted knowledge while preserving reasoning abilities. The method addresses the challenge of unlearning in models that rely on chain-of-thought reasoning by generating logically valid alternative reasoning paths during training.

arXiv:2604.15847v1 Announce Type: new Abstract: Machine unlearning has gained increasing attention in recent years, as a promising technique to selectively remove unwanted privacy or copyrighted information from Large Language Models that are trained on a massive scale of human data. However, the emergence of Large Reasoning Models (LRMs), which emphasize long chain-of-thought (CoT) reasoning to address complex questions, presents a dilemma to unlearning: existing methods either struggle to completely eliminate undesired knowledge from the CoT traces or degrade the reasoning performances due to the interference with the reasoning process. To this end, we introduce Counterfactual Unlearning through iterative Preference Optimization (CiPO), a novel framework that redefines unlearning as the targeted intervention of the CoT reasoning in LRMs. More specifically, given a desired unlearning target answer, CiPO instructs LRMs to generate a logically valid counterfactual reasoning trace for preference tuning. As the LRM adjusts to the counterfactual trace, CiPO iteratively updates the preference learning data to increase the discrepancy from the original model. This iterative loop ensures both desirable unlearning and smooth optimization, effectively mitigating the dilemma. Experiments on challenging benchmarks demonstrate that CiPO excels at unlearning, completely removing knowledge from both the intermediate CoT steps and the final answer, while preserving the reasoning abilities of LRMs.
Original Article
View Cached Full Text

Cached at: 04/20/26, 08:29 AM

# Counterfactual Unlearning for Large Reasoning Models through Iterative Preference Optimization Source: https://arxiv.org/html/2604.15847 Junyi Li†, Yongqiang Chen⋄, Ningning Ding† †The Hong Kong University of Science and Technology \(Guangzhou\) ⋄The Chinese University of Hong Kong jli000@connect\.hkust\-gz\.edu\.cn,yqchen24@gmail\.com,ningningding@hkust\-gz\.edu\.cn ###### Abstract Machine unlearning has gained increasing attention in recent years, as a promising technique to selectively remove unwanted privacy or copyrighted information from Large Language Models that are trained on a massive scale of human data\. However, the emergence of Large Reasoning Models \(LRMs\), which emphasize long chain\-of\-thought \(CoT\) reasoning to address complex questions, presents adilemma to unlearning: existing methods either struggle to completely eliminate undesired knowledge from the CoT traces or degrade the reasoning performances due to the interference with the reasoning process\. To this end, we introduceCounterfactual Unlearning throughiterativePreferenceOptimization \(CiPO\), a novel framework that redefines unlearning as the targeted intervention of the CoT reasoning in LRMs\. More specifically, given a desired unlearning target answer, CiPO instructs LRMs to generate a logically valid counterfactual reasoning trace for preference tuning\. As the LRM adjusts to the counterfactual trace, CiPO iteratively updates the preference learning data to increase the discrepancy from the original model\. This iterative loop ensures both desirable unlearning and smooth optimization, effectively mitigating the dilemma\. Experiments on challenging benchmarks demonstrate that CiPO excels at unlearning, completely removing knowledge from both the intermediate CoT steps and the final answer, while preserving the reasoning abilities of LRMs\.111Our code is available athttps://github.com/TerryLee77/CiPO\. CiPO: Counterfactual Unlearning for Large Reasoning Models through Iterative Preference Optimization Junyi Li†, Yongqiang Chen⋄, Ningning Ding†††thanks:Corresponding author\.†The Hong Kong University of Science and Technology \(Guangzhou\)⋄The Chinese University of Hong Kongjli000@connect\.hkust\-gz\.edu\.cn,yqchen24@gmail\.com,ningningding@hkust\-gz\.edu\.cn ## 1Introduction Large Language Models \(LLMs\) have demonstrated remarkable capabilities across a vast array of tasks, becoming integral to numerous applications\(OpenAI,2023 (https://arxiv.org/html/2604.15847#bib.bib25); DeepSeek\-AI,2024 (https://arxiv.org/html/2604.15847#bib.bib4); Grattafiori et al\.,2024 (https://arxiv.org/html/2604.15847#bib.bib10)\)\. As trained on a massive scale of human data, however, the immense capacity of LLMs also leads them to memorize and potentially regenerate sensitive, private, or copyrighted information from the training data\(Karamolegkou et al\.,2023 (https://arxiv.org/html/2604.15847#bib.bib15); Patil et al\.,2024 (https://arxiv.org/html/2604.15847#bib.bib28); Li et al\.,2024 (https://arxiv.org/html/2604.15847#bib.bib16)\)\. This raises significant privacy and ethical concerns, necessitating methods to control model knowledge post\-training\(Liu et al\.,2025a (https://arxiv.org/html/2604.15847#bib.bib17)\)\. Thus,machine unlearninghas emerged as a critical field and offers techniques to selectively erase information from a model, thereby aligning with data privacy regulations like the "right to be forgotten" without the prohibitive cost of retraining\(Voigt and Bussche,2017 (https://arxiv.org/html/2604.15847#bib.bib33); Yao et al\.,2024 (https://arxiv.org/html/2604.15847#bib.bib39); Zhang et al\.,2025 (https://arxiv.org/html/2604.15847#bib.bib42)\)\. Refer to captionFigure 1:Difference between LLMs and LRMs\.Despite the success on LLMs, the recently emerged LRMs present new challenges to unlearning\. As LRMs rely on generating long chain\-of\-thought \(CoT\) reasoning steps to address complex and multi\-step\(OpenAI,2024 (https://arxiv.org/html/2604.15847#bib.bib26); DeepSeek\-AI,2025 (https://arxiv.org/html/2604.15847#bib.bib5)\), unlearning requires eliminating the desired knowledge fromboththe reasoning traces and final answers\. Shown as in Figure1 (https://arxiv.org/html/2604.15847#S1.F1), although the CoT traces turn the model's internal deliberation into an explicit text output and facilitate reasoning, the reasoning traces themselves become a primary vector for data leakage\. Sensitive information used at any point in the deliberation is thus recorded and revealed directly\(Green et al\.,2025 (https://arxiv.org/html/2604.15847#bib.bib11)\)\. Forgotten information remaining implicitly embedded within the model's reasoning trace can unintentionally guide the inference process, thereby increasing the risk of reconstructing the original output despite the unlearning attempt\. Conventional LLM unlearning methods are ill\-equipped for this scenario, as they are not designed to unlearn these complex, exposed logical pathways\. Recognizing this gap, several studies have explored unlearning techniques specifically for LRMs, but critical limitations remain\. One representative strategy trains models to output a generic refusal \(e\.g\., "I don't know"\) for prompts tied to forget requests\(Yoon et al\.,2025 (https://arxiv.org/html/2604.15847#bib.bib40)\)\. This coarse approach introduces new privacy risks; a consistent refusal can signal that specific data were unlearned, increasing exposure to membership inference attacks\(Zhou et al\.,2025 (https://arxiv.org/html/2604.15847#bib.bib44)\)\. Moreover, optimizing for a template refusal across diverse prompts destabilizes training and reduces utility\(Mekala et al\.,2025 (https://arxiv.org/html/2604.15847#bib.bib22); Wang et al\.,2025b (https://arxiv.org/html/2604.15847#bib.bib35)\)\. Another line of work, R2MU perturbs internal representations to suppress sensitive reasoning traces, but at the cost of readability and reasoning quality\(Wang et al\.,2025a (https://arxiv.org/html/2604.15847#bib.bib34)\)\. To summarize, existing methods of LRM unlearning force an undesirable choice: a superficial refusal that introduces new privacy risks, or a forceful suppression that breaks the model's core reasoning abilities\. This dilemma highlights a clear need for a more nuanced approach, leading to our key research question: How to achieve LRM unlearning regarding both reasoning traces and final answers without introducing new privacy risks, while preserving coherent reasoning ability? To answer the question, we introduceCounterfactual Unlearning throughiterativePreferenceOptimization \(CiPO\), a novel unlearning method explicitly designed for LRMs\. CiPO reframes unlearning as the targeted intervention to the CoT reasoning of LRMs and executes it via an*iterative on\-policy*preference optimization loop\. More specifically, given the unlearning target, CiPO instructs the LRMs to construct a logically valid counterfactual trace for preference optimization\. At each iteration, we sample CoT reasoning steps and final answers over forget prompts rather than using a fixed one\. And construct dynamic preference pairs where counterfactual serves as preference response, and sampling answers as dispreference\. Then, we optimize a DPO\-style objective so the model*prefers the counterfactual path*\. By using on\-policy real\-time preferences, CiPO keeps unlearning aligned with the model's evolving distribution, mitigating mismatch while preserving reasoning\(Guo et al\.,2024 (https://arxiv.org/html/2604.15847#bib.bib12); Pang et al\.,2024 (https://arxiv.org/html/2604.15847#bib.bib27); Tu et al\.,2025 (https://arxiv.org/html/2604.15847#bib.bib32)\)\. Our experiments demonstrate that CiPO attains strong performance in erasing sensitive information from both reasoning traces and final answers while maintaining reasoning ability, offering an efficient unlearning strategy for LRMs\. Our contributions can be summarized as: - •*Problem Identification:*We identify key limitations of existing LRM unlearning methods, highlighting how strategies based on representation misdirection and evasion of targeted knowledge can degrade model performance or fail to provide constructive and safe unlearning\. - •*Proposed Method:*We introduce CiPO, an iterative framework from a causal view that moves beyond these limitations and challenges by using online preference optimization to replace the original reasoning trace and answer with a desirable counterfactual one\. - •*Experimental Validation:*Through experiments on R\-TOFU and real\-world benchmarks, we demonstrate that CiPO effectively removes targeted knowledge from both answers and reasoning traces while preserving the model's core reasoning abilities\. ## 2Related Work ##### LLM Unlearning Machine unlearning is an emerging field focused on selectively removing the influence of specific data points from a trained model without the prohibitive cost of retraining from scratch\(Cao and Yang,2015 (https://arxiv.org/html/2604.15847#bib.bib2); Xu et al\.,2023 (https://arxiv.org/html/2604.15847#bib.bib37); Wen et al\.,2026 (https://arxiv.org/html/2604.15847#bib.bib36)\)\. The application of unlearning to large language models \(LLMs\) represents a critical extension beyond conventional machine learning\. It addresses the need to protect copyrighted or private information in LLM applications, comply with regulations such as GDPR, and mitigate harmful content generation\(Eldan and Russinovich,2023 (https://arxiv.org/html/2604.15847#bib.bib7); Shi et al\.,2025 (https://arxiv.org/html/2604.15847#bib.bib30); Li et al\.,2024 (https://arxiv.org/html/2604.15847#bib.bib16)\)\. A predominant approach formulates LLM unlearning as a targeted optimization problem\(Jang et al\.,2023 (https://arxiv.org/html/2604.15847#bib.bib14)\)\. One strategy involves directly modifying model weights by applying Gradient Ascent \(GA\) on the negative log\-likelihood of the "forget" data, effectively making such outputs less probable\. This is often paired with standard Gradient Descent \(GD\) on a "retain" set to preserve general capabilities\(Yao et al\.,2024 (https://arxiv.org/html/2604.15847#bib.bib39); Maini et al\.,2024 (https://arxiv.org/html/2604.15847#bib.bib20); Dorna et al\.,2025 (https://arxiv.org/html/2604.15847#bib.bib6)\)\. An alternative strategy leverages preference\-based optimization methods\. Techniques such as Direct Preference Optimization \(DPO\) or Negative Preference Optimization \(NPO\) realign the model to favor neutral or refusal responses over generating undesirable information\(Zhang et al\.,2024 (https://arxiv.org/html/2604.15847#bib.bib43); Wang et al\.,2025b (https://arxiv.org/html/2604.15847#bib.bib35); Mekala et al\.,2025 (https://arxiv.org/html/2604.15847#bib.bib22); Sinha et al\.,2025 (https://arxiv.org/html/2604.15847#bib.bib31); Fan et al\.,2025 (https://arxiv.org/html/2604.15847#bib.bib8)\)\. Inspired by representation engineering, RMU fine\-tunes the model to steer the hidden states of forget samples towards a random vector\(Li et al\.,2024 (https://arxiv.org/html/2604.15847#bib.bib16)\)\. However, LLM unlearning methods are not applicable to LRMs as they are designed to modify final outputs, not the explicit multi\-step reasoning traces; consequently, new designs intervene on reasoning paths are required\. ##### LRM Unlearning The advancement of LLMs into a new class of LRMs is fundamentally marked by the integration of transparent step\-by\-step chain\-of\-thought reasoning, which makes their problem\-solving processes explicit\(OpenAI,2024 (https://arxiv.org/html/2604.15847#bib.bib26); DeepSeek\-AI,2025 (https://arxiv.org/html/2604.15847#bib.bib5)\)\. Applying machine unlearning to LRMs introduces a key challenge: unwanted information can be embedded throughout the entire CoT trace\. Current solutions attempt to either suppress faulty reasoning paths, like R2MU\(Wang et al\.,2025a (https://arxiv.org/html/2604.15847#bib.bib34)\), or train the model to refuse answering via methods like ReasonedIDK\(Yoon et al\.,2025 (https://arxiv.org/html/2604.15847#bib.bib40)\)\. However, these approaches can degrade reasoning abilities or introduce new data leakage risks from over\-rejection\(Zhou et al\.,2025 (https://arxiv.org/html/2604.15847#bib.bib44)\)\. This paper will overcome these challenges while achieving effective LRM unlearning\. ##### Preference Optimization Preference optimization \(PO\) trains LLMs to favor a preferred responsey\+y^\{\+\}over a dispreferred oney−y^\{\-\}for a given promptxx, rather than maximizing a raw likelihood\. Methods like DPO or SimPO offer efficient reinforcement learning\-free solutions by directly optimizing a logistic loss over log\-probability ratios\(Rafailov et al\.,2023 (https://arxiv.org/html/2604.15847#bib.bib29); Meng et al\.,2022 (https://arxiv.org/html/2604.15847#bib.bib23)\)\. However, training PO on fixed pre\-collected pairs is inherently off\-policy with respect to the evolving model and under\-explores emerging failure modes\. We therefore adopt an*iterative/online*approach to PO\. In each round, the current model samples candidates, dynamic preferences are constructed, and the policy is updated\. This iterative loop reduces distribution mismatch, improves exploration, and yields gains with online\-learning guarantees\(Guo et al\.,2024 (https://arxiv.org/html/2604.15847#bib.bib12); Pang et al\.,2024 (https://arxiv.org/html/2604.15847#bib.bib27); Tu et al\.,2025 (https://arxiv.org/html/2604.15847#bib.bib32)\)\. In our setting, this iterative view keeps the unlearning signal aligned with the model's evolving distribution\. ## 3Preliminaries In this section, we introduce the background of machine unlearning in LLMs and extend it to LRMs\. ### 3\.1Machine Unlearning in LLMs Machine unlearning for LLMs aims to remove the effect of specific training data so the LLMs behave as if that data had never been involved, without incurring the cost of full retraining\. Machine unlearning has become a critical technique for addressing privacy, safety, and copyright concerns in LLMs\(Chen et al\.,2024 (https://arxiv.org/html/2604.15847#bib.bib3)\)\. Letπ\\pirepresent the parameters of the target LLM we aim to unlearn\. The unlearning task is formally defined by two datasets: - •Theforget setDfD\_\{f\}contains the data instances\{q,a\}\\\{q,a\\\}whose knowledge the model must forget, whereqqis a query related to forget set andaais the corresponding answer\. - •Theretain setDrD\_\{r\}contains data that the model should not forget and needs to retain\. This set is used to regularize the unlearning process that preserves the model's general utility\. The objective of LLM unlearning can be formulated as an optimization problem that seeks to balance the dual goals of forgetting and retaining knowledgeYuan et al\. \(2025 (https://arxiv.org/html/2604.15847#bib.bib41)\): minπ′⁡EDf\[lf\(π′;Df\)\]⏟Forget losslf\+λEDr\[lr\(π′;Dr\)\]⏟Retain losslr,\\displaystyle\\min\_\{\\pi^\{\\prime\}\}\\underbrace\{\\mathbb\{E\}\_\{\\mathcal\{D\}\_\{f\}\}\\left\[\\ell\_\{f\}\(\\pi^\{\\prime\};D\_\{f\}\)\\right\]\}\_\{\\text\{Forget loss \}\\ell\_\{f\}\}\+\\lambda\\underbrace\{\\mathbb\{E\}\_\{\\mathcal\{D\}\_\{r\}\}\\left\[\\ell\_\{r\}\(\\pi^\{\\prime\};D\_\{r\}\)\\right\]\}\_\{\\text\{Retain loss \}\\ell\_\{r\}\},\(1\)whereπ′\\pi^\{\\prime\}represents the parameters of the unlearned model,lf\\ell\_\{f\}is a loss function designed to make the model "forget" the content inDfD\_\{f\}, andlr\\ell\_\{r\}is a loss function that penalizes deviations from the original model's behavior on the retain setDrD\_\{r\}\. The hyperparameterλ\\lambdacontrols the trade\-off between these two objectives\. Most existing unlearning methods follow the general formulation described in Equation \(1 (https://arxiv.org/html/2604.15847#S3.E1)\), though they differ in the specific design of the forget loss and retain loss components\. We further discuss the details of representat

Similar Articles

Reducing Credit Assignment Variance via Counterfactual Reasoning Paths

arXiv cs.LG

Introduces Implicit Behavior Policy Optimization (IBPO), a counterfactual comparison-based credit assignment framework that improves training stability and performance in multi-step reasoning tasks for large language models by converting sparse terminal rewards into step-sensitive learning signals.

AIPO: : Learning to Reason from Active Interaction

arXiv cs.CL

This paper introduces AIPO, a reinforcement learning framework that enhances LLM reasoning by allowing the model to actively consult collaborative agents during exploration to overcome capability boundaries.

Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning

arXiv cs.CL

Proposes ProxyCoT, a training framework that improves long-context reasoning in large language models by first obtaining chain-of-thought reasoning traces on short proxy contexts (via reinforcement learning or distillation) and then grounding them in full long contexts through supervised fine-tuning. Experiments show consistent improvements over baselines with reduced computational cost.