It Takes Two: Complementary Self-Distillation for Contextual Integrity in LLMs

arXiv cs.LG Papers

Summary

Proposes Complementary Self-Distillation (SelfCI) to improve contextual integrity in LLMs by balancing utility and privacy. Evaluated on CI-RL and PrivacyLens benchmarks across multiple models.

arXiv:2605.20258v1 Announce Type: new Abstract: Contextual Integrity (CI) defines privacy not merely as keeping information hidden, but as governing information flows according to the norms of a given context. As large language models are increasingly deployed as personal agents handling sensitive workflows, adhering to CI becomes critical. However, even frontier models remain unreliable in making disclosure decisions, and existing mitigation strategies often degrade underlying task performance. To overcome this privacy-utility trade-off, we propose SELFCI, a complementary self-distillation framework that decouples information suppression from task resolution. SELFCI jointly optimizes two independent reverse KL divergences over distinct teacher distributions derived from feedback: one encourages preserving task-relevant information for utility, while the other enforces minimal and appropriate disclosure. This complementary formulation induces a Product-of-Experts (PoE) target, aligning the policy with the intersection of capability and privacy requirements. Empirical evaluations demonstrate that SELFCI, without relying on costly external supervision, consistently outperforms competitive baselines such as online reinforcement learning algorithms (e.g., GRPO). These trends further extend to out-of-domain settings involving agentic workflows and accumulated private context, suggesting that SELFCI provides a practical path toward CI alignment.
Original Article
View Cached Full Text

Cached at: 05/21/26, 06:21 AM

# It Takes Two: Complementary Self-Distillation for Contextual Integrity in LLMs
Source: [https://arxiv.org/html/2605.20258](https://arxiv.org/html/2605.20258)
### 4\.1Experimental Setup

#### Datasets & Metrics\.

As our primary benchmark,CI\-RL\[[22](https://arxiv.org/html/2605.20258#bib.bib6)\]isolates the privacy\-utility trade\-off via synthetic assistant\-task instances with explicit disclosure norms\. On the held\-out test split, we evaluate retaining task\-relevant attributes \(Utility\), suppressing unnecessary private attributes \(Integrity\), and satisfying both conditions simultaneously \(Complete\)\. For all evaluations and analyses, we sample five responses for each prompt and report the mean for each metric\.

For out\-of\-domain assessment, we usePrivacyLens\[[35](https://arxiv.org/html/2605.20258#bib.bib5)\], which evaluates privacy norm awareness through tool\-using agent trajectories grounded in privacy\-sensitive scenarios\. Task fulfillment \(Helpful\) is measured by GPT\-5\-mini\[[38](https://arxiv.org/html/2605.20258#bib.bib40)\]as an LLM\-as\-a\-Judge score on a\[0,3\]\[0,3\]scale\. Privacy is evaluated by the leakage rate of sensitive information in final actions \(LR\) and its helpfulness\-adjusted variant \(ALR\), which measures leakage only among helpful actions\. Further details, including prompt templates, are provided in[Sec\.˜C\.1](https://arxiv.org/html/2605.20258#A3.SS1)\.

#### Baselines\.

We compareSelfCIagainst three baselines, including two competitive learning methods\. TheInitialmodel serves as a zero\-shot reference, capturing the policy’s behavior prior to any CI\-specific adaptation\. As a representative online learning baseline,CI\-RL\[[22](https://arxiv.org/html/2605.20258#bib.bib6)\]optimizes the policy with GRPO\[[36](https://arxiv.org/html/2605.20258#bib.bib45)\]using a scalar reward\|𝒜𝒯present\|/\|𝒜𝒯\|−\|𝒟𝒯present\|/\|𝒟𝒯\|\|\{\\mathcal\{A\}\}\_\{\\mathcal\{T\}\}^\{\\text\{present\}\}\|/\|\{\\mathcal\{A\}\}\_\{\\mathcal\{T\}\}\|\-\|\{\\mathcal\{D\}\}^\{\\text\{present\}\}\_\{\\mathcal\{T\}\}\|/\|\{\\mathcal\{D\}\}\_\{\\mathcal\{T\}\}\|, where𝒜𝒯present⊆𝒜𝒯\{\\mathcal\{A\}\}^\{\\text\{present\}\}\_\{\\mathcal\{T\}\}\\subseteq\{\\mathcal\{A\}\}\_\{\\mathcal\{T\}\}and𝒟𝒯present⊆𝒟𝒯\{\\mathcal\{D\}\}\_\{\\mathcal\{T\}\}^\{\\text\{present\}\}\\subseteq\{\\mathcal\{D\}\}\_\{\\mathcal\{T\}\}denote the allowed and disallowed attributes present in the response, respectively\. In contrast,ContextDistillis an offline SFT baseline based on context distillation\[[39](https://arxiv.org/html/2605.20258#bib.bib38)\]\. Unlike our complementary self\-teacher objective, it trains on responses generated by a larger teacher model conditioned on a single context formed by concatenating the aggregated feedbackf~allow\\tilde\{f\}\_\{\\textbf\{\{\\color\[rgb\]\{0\.26953125,0\.5,0\.5234375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.26953125,0\.5,0\.5234375\}allow\}\}\}andf~disallow\\tilde\{f\}\_\{\\textbf\{\{\\color\[rgb\]\{0\.73046875,0\.0703125,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.73046875,0\.0703125,0\}disallow\}\}\}\.

#### Implementation Details\.

We applySelfCIacross instruction\-tuned backbones—Qwen2\.5\-7B\-Instruct\[[49](https://arxiv.org/html/2605.20258#bib.bib33)\], Llama\-3\.1\-8B\-Instruct\[[13](https://arxiv.org/html/2605.20258#bib.bib34)\], Olmo\-3\-7B\-Instruct\[[34](https://arxiv.org/html/2605.20258#bib.bib35)\], and Qwen3\-4B\-Instruct\-2507\[[48](https://arxiv.org/html/2605.20258#bib.bib36)\]—and reasoning backbones—DeepSeek\-R1\-Distill\-Llama\-8B\[[14](https://arxiv.org/html/2605.20258#bib.bib37)\], Olmo\-3\-7B\-Think\[[34](https://arxiv.org/html/2605.20258#bib.bib35)\], and Qwen3\-4B\[[48](https://arxiv.org/html/2605.20258#bib.bib36)\]\. All methods use the CI\-CoT prompt template fromLanet al\.\[[22](https://arxiv.org/html/2605.20258#bib.bib6)\], shown in[Fig\.˜8](https://arxiv.org/html/2605.20258#A7.F8), unless a benchmark\-specific prompt format is required\. We set the maximum output length to20482048tokens for instruction\-tuned backbones and40964096tokens for reasoning backbones\.

For optimization, we use AdamW\[[28](https://arxiv.org/html/2605.20258#bib.bib49)\]with a base learning rate of1×10−61\\times 10^\{\-6\}and a linear scheduler with warm\-up over the first10%10\\%of training steps\. To preserve pretrained capabilities during alignment\[[5](https://arxiv.org/html/2605.20258#bib.bib50)\], we apply LoRA\[[16](https://arxiv.org/html/2605.20258#bib.bib42)\]with rankr=32r=32, scaling factorα=64\\alpha=64, and dropout\[[40](https://arxiv.org/html/2605.20258#bib.bib48)\]of0\.050\.05to the query and value projections in all experimental configurations\. All optimization\-based methods are trained for3030epochs on the CI\-RL training split followingLanet al\.\[[22](https://arxiv.org/html/2605.20258#bib.bib6)\]\. We select the checkpoint with the highest Complete score on the CI\-RL evaluation split\. All experiments are conducted on a single NVIDIA H200 GPU\. We provide additional details in[Secs\.˜C\.2](https://arxiv.org/html/2605.20258#A3.SS2)and[C\.3](https://arxiv.org/html/2605.20258#A3.SS3)\.

![Refer to caption](https://arxiv.org/html/2605.20258v1/x3.png)Figure 3:\(Left\) AverageDKL\{\\color\[rgb\]\{0\.03125,0\.26953125,0\.58203125\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.03125,0\.26953125,0\.58203125\}D\_\{\\mathrm\{KL\}\}\}defined in[Eq\.˜1](https://arxiv.org/html/2605.20258#S2.E1)andCompletescores in[Sec\.˜4](https://arxiv.org/html/2605.20258#S4)on the CI\-RL test set computed using Qwen2\.5\-7B\-Instruct\. \(Middle\) Per\-epoch Complete scores on the CI\-RL test set and \(Right\) GPU wall\-clock time per training step, using Qwen3\-4B\-Instruct\.### 4\.2Main Results

#### Superiority ofSelfCI\.

As shown in[Sec\.˜4](https://arxiv.org/html/2605.20258#S4),SelfCIconsistently improves the privacy\-utility trade\-off on the CI\-RL test set\. For instruction\-tuned models, the primary gain is substantially higher Integrity\. For example, on Qwen2\.5\-7B\-Instruct,SelfCIimproves Integrity from 35\.34 to83\.56and Complete from 23\.29 to53\.42\. Importantly,*these gains do not come at the cost of Utility*:SelfCImaintains competitive Utility and even exceeds the Initial model on Llama\-3\.1\-8B\-Instruct and Olmo\-3\-7B\-Instruct\.[Fig\.˜3](https://arxiv.org/html/2605.20258#S4.F3)\(Left\) further supports this advantage by showing a clear inverse relationship between measuredDKLD\_\{\\mathrm\{KL\}\}and Complete score, whereDKLD\_\{\\mathrm\{KL\}\}, as defined in[Eq\.˜1](https://arxiv.org/html/2605.20258#S2.E1), indicates sensitivity to disallowed attributes\.SelfCIachieves the lowestDKLD\_\{\\mathrm\{KL\}\}and the highest Complete score among all methods\. Together, these results show thatSelfCIimproves robustness to disallowed attributes while preserving task completion\.

The same trend extends to reasoning models, where preserving task performance is particularly challenging\.SelfCIachieves the best Complete score on all reasoning backbones, with especially large gains on Qwen3\-4B, improving Integrity from 32\.88 to82\.19and Complete from 26\.03 to57\.26\. It also attains the highest Utility on DeepSeek\-R1\-Distill\-Llama\-8B, suggesting that CI alignment can improve privacy behavior without necessarily weakening task\-solving nature\.

#### Limitations of Online RL\.

SelfCIis substantially more effective and sample\-efficient than the online RL baseline\. As shown in[Fig\.˜3](https://arxiv.org/html/2605.20258#S4.F3)\(Middle\), it reaches a high Complete score much earlier than CI\-RL, exceeding40%40\\%by33epochs compared to1515for CI\-RL\. This reflects a key challenge of reward\-based optimization: models must learn complex, context\-dependent norms from coarse\-grained reward signals\. In contrast,SelfCIbenefits from dense logit\-level supervision through the KL objective and from a teacher constructed using rich feedback, enabling effective and efficient optimization\. The wall\-clock comparison in[Fig\.˜3](https://arxiv.org/html/2605.20258#S4.F3)\(Right\) further shows thatSelfCIreduces GPU time per step by nearly half, as it requires only one rollout per prompt compared to 16 in CI\-RL\.

#### Limitations of External\-Teacher Distillation\.

ContextDistill generalizes less effectively on the CI\-RL test set, suggesting that external\-teacher supervision is ill\-suited for context\-dependent CI norms\. On Qwen3\-4B\-Instruct, it improves Integrity but remains below CI\-RL in Complete \(4040vs\.45\.2145\.21\) and trailsSelfCIby15\.3415\.34percentage points\. This pattern is consistent with exposure bias: the student is trained on teacher\-generated trajectories that differ from its own generations\[[2](https://arxiv.org/html/2605.20258#bib.bib51)\]\. In contrast,SelfCIuses on\-policy generations and constructs the teacher from the same model under different conditioning, reducing distributional mismatch and improving test\-time CI alignment\.

#### Generalization to Agentic Tasks\.

[Sec\.˜4](https://arxiv.org/html/2605.20258#S4)further reports out\-of\-domain results on PrivacyLens\. On Qwen3\-4B\-Instruct,SelfCIachieves the lowest leakage, reducing LR from 56\.59 to47\.06and ALR from 58\.14 to48\.17, while also attaining the highest Helpful score \(2\.62\)\. The gain is more pronounced on Qwen3\-4B, whereSelfCIreduces LR from 40\.97 to32\.45and ALR from 52\.23 to42\.37, again with the highest Helpful score \(1\.92\)\. In contrast, both CI\-RL and ContextDistill transfer less effectively\. ContextDistill retains high LR on Qwen3\-4B\-Instruct \(55\.98\), suggesting that offline distillation suffers from exposure bias under complete out\-of\-domain shift\. CI\-RL also underperforms despite using on\-policy generations, reducing LR only to 53\.75 on Qwen3\-4B\-Instruct and to 37\.93 on Qwen3\-4B\. This suggests that coarse sequence\-level rewards do not yield sufficiently generalizable CI behavior\. The PrivacyLens results highlightSelfCIas a strong alignment method for personal agents, achieving*privacy without utility loss*in agentic workflows\.

### 4\.3Robustness under Increasing Complexity

To assess robustness under growing complexity, we evaluateSelfCIon CIMemories\[[30](https://arxiv.org/html/2605.20258#bib.bib7)\]\(see[Sec\.˜C\.1](https://arxiv.org/html/2605.20258#A3.SS1)for details\)\. In this benchmark, user attributes accumulate across sequential tasks, and the same attribute may be appropriate in one context but inappropriate in another\. As the memory grows, the model must make increasingly many context\-dependent disclosure decisions, making fixed suppression rules insufficient\.

[Fig\.˜5](https://arxiv.org/html/2605.20258#S4.F5)reports Violation@5, an attribute\-level ever\-leakage rate, as a function of the number of observed tasks\. As more attributes accumulate, the baselines exhibit compounding privacy failures: the Initial model and CI\-RL reach approximately26%26\\%and21%21\\%Violation@5 after 48 tasks, respectively, while ContextDistill also increases steadily\. In contrast,SelfCIkeeps Violation@5 below5%5\\%, suggesting a stable context\-conditioned disclosure boundary under accumulated memory\.

![Refer to caption](https://arxiv.org/html/2605.20258v1/x4.png)Figure 4:Violation rate on CIMemories under progressively accumulating tasks, measured with Qwen3\-4B\-Instruct\.
![Refer to caption](https://arxiv.org/html/2605.20258v1/x5.png)Figure 5:Analysis of the ideal CI surrogate in[Eq\.˜1](https://arxiv.org/html/2605.20258#S2.E1)using Qwen3\-4B\-Instruct\. \(Left\) Utility scores of target distributions on the CI\-RL test set\. \(Right\) Per\-epoch Utility and Integrity scores trained with[Eq\.˜1](https://arxiv.org/html/2605.20258#S2.E1)or[Eq\.˜5](https://arxiv.org/html/2605.20258#S3.E5)\.

### 4\.4Analysis on Feedback and Teacher Decomposition

#### Operationalizing the Ideal CI Objective with Feedback\.

While[Eq\.˜1](https://arxiv.org/html/2605.20258#S2.E1)operationalizes the ideal CI state as invariance to disallowed information, directly treating the policy conditioned only on the set of allowed attributes𝒜𝒯\\mathcal\{A\}\_\{\\mathcal\{T\}\}as the reference can be under\-specified in practice: removing𝒟𝒯\\mathcal\{D\}\_\{\\mathcal\{T\}\}does not tell the model which attributes in𝒜𝒯\\mathcal\{A\}\_\{\\mathcal\{T\}\}should be used, why they are task\-relevant, or how they should appear in the response\. Consistent with this,[Fig\.˜5](https://arxiv.org/html/2605.20258#S4.F5)\(Left\) shows that this allowed\-only target yields lower Utility than the PoE target induced bySelfCI, suggesting that invariance to disallowed information alone does not guarantee task\-complete behavior\.

To test this directly, we optimize the student with[Eq\.˜1](https://arxiv.org/html/2605.20258#S2.E1)and compare it againstSelfCItrained with[Eq\.˜5](https://arxiv.org/html/2605.20258#S3.E5)\. As shown in[Fig\.˜5](https://arxiv.org/html/2605.20258#S4.F5)\(Right\),[Eq\.˜1](https://arxiv.org/html/2605.20258#S2.E1)improves Integrity but causes Utility to drop substantially, indicating that the allowed\-only target provides an unstable utility signal and increasingly biases the model toward suppression\. In contrast,SelfCIretains Utility while improving Integrity by decomposing the target into feedback\-conditionedπallow\\pi\_\{\\textbf\{\{\\color\[rgb\]\{0\.26953125,0\.5,0\.5234375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.26953125,0\.5,0\.5234375\}allow\}\}\}andπdisallow\\pi\_\{\\textbf\{\{\\color\[rgb\]\{0\.73046875,0\.0703125,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.73046875,0\.0703125,0\}disallow\}\}\}\. Although[Eq\.˜1](https://arxiv.org/html/2605.20258#S2.E1)remains a meaningful surrogate for the ideal CI objective, theSelfCI’s feedback\-based decomposition in[Eq\.˜5](https://arxiv.org/html/2605.20258#S3.E5)provides a more practical way to optimize toward it\.

Table 2:Results under keyword\-only and feedback\-based privileged contextsccin[Eq\.˜6](https://arxiv.org/html/2605.20258#A1.E6)\.
#### Role of Feedback\-Based Context\.

To isolate the role of feedback, we use a keyword\-only context listing allowed and disallowed attributes as a control\. While the keyword\-only context specifies the attribute partition, it lacks rationales for task\-specific transmission norms\.

As shown in[Tab\.˜2](https://arxiv.org/html/2605.20258#S4.T2), feedback improves Complete on both Qwen3\-4B\-Instruct and Qwen3\-4B, with the reasoning model showing a substantial gain of 12\.05 percentage points\. This suggests that coarse keywords induce a less informative teacher during longer generation, whereas feedback provides richer context for shaping the teacher distribution\.

#### Effect of Teacher Decomposition\.

We then examine whether the two feedback types should induce complementary teachers, instead of being collapsed into a single monolithic teacher\. As a control, we concatenate all feedback into a single context,f~=concat​\(f~allow,f~disallow\)\\tilde\{f\}=\\texttt\{concat\}\(\\tilde\{f\}\_\{\{\\color\[rgb\]\{0\.26953125,0\.5,0\.5234375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.26953125,0\.5,0\.5234375\}\\textbf\{allow\}\}\},\\tilde\{f\}\_\{\{\\color\[rgb\]\{0\.73046875,0\.0703125,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.73046875,0\.0703125,0\}\\textbf\{disallow\}\}\}\), and optimize the policy with a single KL divergence against the resulting monolithic teacher\.

Table 3:Results under single and decomposed teacher constructions,SelfCI\.As shown in[Tab\.˜3](https://arxiv.org/html/2605.20258#S4.T3), decomposing feedback into complementary teachers yields higher Complete scores than the single teacher on both Qwen3\-4B\-Instruct and Qwen3\-4B, with gains of3\.833\.83and3\.293\.29percentage points, respectively\. This supports our design: separate teachers,πallow\\pi\_\{\\textbf\{\{\\color\[rgb\]\{0\.26953125,0\.5,0\.5234375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.26953125,0\.5,0\.5234375\}allow\}\}\}andπdisallow\\pi\_\{\\textbf\{\{\\color\[rgb\]\{0\.73046875,0\.0703125,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.73046875,0\.0703125,0\}disallow\}\}\}, guide the policy toward their intersection, where Utility and Integrity are jointly satisfied, whereas a single teacher provides less discriminative supervision\. Importantly, these gains incur only marginal overhead: about a55–6%6\\%increase in per\-step training time\.

![Refer to caption](https://arxiv.org/html/2605.20258v1/x6.png)Figure 6:\(Left\) Integrity\-Utility balance on the CI\-RL test set for Qwen3\-4B\-Instruct trained with differentλ\\lambdavalues in[Eq\.˜5](https://arxiv.org/html/2605.20258#S3.E5)\. \(Middle\) Per\-epoch Complete score of feedback\-conditioned teachers on the CI\-RL training set\. \(Right\) Complete score across Qwen3 model family on the CI\-RL test set\.### 4\.5Coefficient Sensitivity and Scaling Behavior

#### Effect of the Coefficientλ\\lambda\.

[Fig\.˜6](https://arxiv.org/html/2605.20258#S4.F6)\(Left\) evaluates the student policy trained under differentλ\\lambdavalues\. Whenλ=0\\lambda=0, the student is trained only towardπdisallow\\pi\_\{\\textbf\{\{\\color\[rgb\]\{0\.73046875,0\.0703125,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.73046875,0\.0703125,0\}disallow\}\}\}, whileλ=1\\lambda=1trains it only towardπallow\\pi\_\{\\textbf\{\{\\color\[rgb\]\{0\.26953125,0\.5,0\.5234375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.26953125,0\.5,0\.5234375\}allow\}\}\}\. These endpoints exhibit*opposite*failure modes: the disallow\-only objective enforces stronger Integrity at the expense of Utility, whereas the allow\-only objective preserves Utility but fails to maintain Integrity\. Increasingλ\\lambdashifts the model from conservative to permissive behavior, trading Integrity for Utility\. The defaultλ=0\.5\\lambda=0\.5provides the best Pareto trade\-off, improving Integrity over the allow\-only setting while retaining much of the Utility lost in the disallow\-only setting\.

We further examine the teacher behavior that gives rise to this student\-level trade\-off\. To evaluate the combined teacher target explicitly, we decode fromπPoE\\pi\_\{\\textbf\{\{\\color\[rgb\]\{0\.62890625,0\.171875,0\.578125\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.171875,0\.578125\}PoE\}\}\}, obtained by normalizing the weighted product of the next\-token distributions fromπallow\\pi\_\{\\textbf\{\{\\color\[rgb\]\{0\.26953125,0\.5,0\.5234375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.26953125,0\.5,0\.5234375\}allow\}\}\}andπdisallow\\pi\_\{\\textbf\{\{\\color\[rgb\]\{0\.73046875,0\.0703125,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.73046875,0\.0703125,0\}disallow\}\}\}withλ=0\.5\\lambda=0\.5\. As shown in[Fig\.˜6](https://arxiv.org/html/2605.20258#S4.F6)\(Middle\), theπPoE\\pi\_\{\\textbf\{\{\\color\[rgb\]\{0\.62890625,0\.171875,0\.578125\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.171875,0\.578125\}PoE\}\}\}achieves the strongest Complete score after several epochs, rising above both individual teachers,πallow\\pi\_\{\\textbf\{\{\\color\[rgb\]\{0\.26953125,0\.5,0\.5234375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.26953125,0\.5,0\.5234375\}allow\}\}\}andπdisallow\\pi\_\{\\textbf\{\{\\color\[rgb\]\{0\.73046875,0\.0703125,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.73046875,0\.0703125,0\}disallow\}\}\}\. This suggests thatπPoE\\pi\_\{\\textbf\{\{\\color\[rgb\]\{0\.62890625,0\.171875,0\.578125\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.171875,0\.578125\}PoE\}\}\}is a more suitable distillation target\.

#### Scaling Behavior ofSelfCI\.

[Fig\.˜6](https://arxiv.org/html/2605.20258#S4.F6)\(Right\) shows howSelfCIscales across the Qwen3 model family\. CI\-RL achieves a strong Complete score at 0\.6B, but its gains do not persist at larger scales\. CI\-RL remains close to the initial models at 4B and 8B, suggesting that optimization with scalar reward alone can be insufficient when larger models already possess a strong prior for task completion\.

In contrast,SelfCIimproves over the initial model*at every scale*, with a representative gain from 23\.84 to 49\.58 at 8B\. This consistent trend indicates that ourSelfCIremains effective across model sizes\. The improvement is relatively smaller at 0\.6B, which is expected since self\-distillation relies on the model’s in\-context learning capability\. These results suggest thatSelfCImay offer a practical route to scaling alignment to stronger models, where obtaining an external teacher may be impractical\.

### 4\.6Analysis on Teacher Selection

Table 4:Comparison of different teacher choices\. The student is Qwen3\-4B\-Instruct in all settings\.†\\daggerindicates that the teacher is Qwen3\-32B with thinking disabled; otherwise, the teacher is a self\-teacher\.[Tab\.˜4](https://arxiv.org/html/2605.20258#S4.T4)examines how teacher choice affects CI alignment\. Ablating EMA fromSelfCImakes the teacher stale as the student evolves, degrading stability and alignment; gradual EMA updates alleviate this issue\. We then replace the self\-teacher with a fixed, feedback\-conditioned larger teacher\. Despite improving over the EMA\-ablated variant, it remains belowSelfCIin Integrity and Complete, suggesting that distributional mismatch offsets the benefit of greater teacher capacity\.

We then ask whether reducing this mismatch is sufficient\. Inspired by offline self\-distillation\[[23](https://arxiv.org/html/2605.20258#bib.bib46)\], we construct offline data as in ContextDistill, but replace the larger teacher with the student itself to produce feedback\-conditioned responses\. However, the Utility drop suggests that naive imitation overfits to self\-teacher responses rather than preserving the model’s original capabilities\.

## 5Conclusion

In this work, we interpreted CI alignment as a form of context\-dependent invariance, where the model should be invariant to information disallowed in the current context while remaining responsive to information required for task completion\. Motivated by this view, we proposedSelfCI, a complementary self\-distillation framework using two feedback\-conditioned self\-teachers\. The resulting PoE target decomposes CI alignment into explicit retain and suppress signals, enabling the policy to satisfy task utility and minimal disclosure jointly\. Empirically,SelfCIconsistently improves this privacy\-utility trade\-off across instruction\-tuned and reasoning models, generalizes to out\-of\-domain agentic workflows, and remains robust under accumulated private context\.

## Limitations

Although our approach shows promising results, several limitations remain\. First,SelfCIrelies on structured synthetic data\[[22](https://arxiv.org/html/2605.20258#bib.bib6)\]with explicit attribute annotations, which may not fully capture real\-world ambiguity in CI norms\. Still, all baselines are compared under the same data budget and number of gradient updates, allowing controlled evaluation of sample efficiency and generalization\. Second, like other self\-distillation methods,SelfCIrelies on the model’s ability to generate and use feedback as privileged context, which may limit its effectiveness for smaller models \(i\.e\., Qwen3\-0\.60\.6B\) with weaker in\-context learning ability\. Third, we use a staticλ\\lambdato balance the complementary teachers; although we analyze its effect in[Sec\.˜4\.5](https://arxiv.org/html/2605.20258#S4.SS5), adaptive coefficient selection remains future work\. Finally, our evaluation focuses on final responses, leaving explicit analysis of leakage in reasoning traces or intermediate tool states for future work\.

## Broader Impacts and Ethics Statement

Our proposed novel framework aims to improve the contextual privacy behavior of LLM assistants operating over sensitive user context\.SelfCIenables CI alignment through self\-distillation without relying on strong proprietary teacher models or manually crafted disclosure rationales, thereby making privacy\-oriented adaptation more sample efficient\. Moreover, sinceSelfCIonly requires feedback\-conditioned teacher distributions instantiated from the target model itself, it is not tied to a particular backbone and can benefit personal agents that must leverage task\-relevant information while avoiding unnecessary disclosure of private attributes\. Nevertheless, even aligned assistants may remain vulnerable to prompt injection and adversarial instructions, which we leave for future work\.

## References

- \[1\]\(2025\)Firewalls to secure dynamic llm agentic networks\.arXiv preprint arXiv:2502\.01822\.Cited by:[Appendix A](https://arxiv.org/html/2605.20258#A1.SS0.SSS0.Px1.p1.1)\.
- \[2\]R\. Agarwal, N\. Vieillard, Y\. Zhou, P\. Stanczyk, S\. Ramos, M\. Geist, and O\. Bachem\(2024\)On\-policy distillation of language models: learning from self\-generated mistakes\.International Conference on Learning Representations \(ICLR\)\.Cited by:[§4\.2](https://arxiv.org/html/2605.20258#S4.SS2.SSS0.Px3.p1.3)\.
- \[3\]E\. Bagdasarian, R\. Yi, S\. Ghalebikesabi, P\. Kairouz, M\. Gruteser, S\. Oh, B\. Balle, and D\. Ramage\(2024\)AirGapAgent: protecting privacy\-conscious conversational agents\.InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security,New York, NY, USA,pp\. 3868–3882\.External Links:[Link](https://doi.org/10.1145/3658644.3690350)Cited by:[Appendix A](https://arxiv.org/html/2605.20258#A1.SS0.SSS0.Px1.p1.1)\.
- \[4\]A\. Barth, A\. Datta, J\. C\. Mitchell, and H\. Nissenbaum\(2006\)Privacy and contextual integrity: framework and applications\.InProceedings of the 2006 IEEE Symposium on Security and Privacy,USA,pp\. 184–198\.External Links:[Link](https://doi.org/10.1109/SP.2006.32)Cited by:[Appendix A](https://arxiv.org/html/2605.20258#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.20258#S1.p1.1),[§2](https://arxiv.org/html/2605.20258#S2.SS0.SSS0.Px1.p1.1)\.
- \[5\]D\. Biderman, J\. Portes, J\. J\. G\. Ortiz, M\. Paul, P\. Greengard, C\. Jennings, D\. King, S\. Havens, V\. Chiley, J\. Frankle, C\. Blakeney, and J\. P\. Cunningham\(2024\)LoRA learns less and forgets less\.Transactions on Machine Learning Research \(TMLR\)\.Cited by:[§4\.1](https://arxiv.org/html/2605.20258#S4.SS1.SSS0.Px3.p2.6)\.
- \[6\]N\. Carlini, F\. Tramèr, E\. Wallace, M\. Jagielski, A\. Herbert\-Voss, K\. Lee, A\. Roberts, T\. Brown, D\. Song, Ú\. Erlingsson, A\. Oprea, and C\. Raffel\(2021\-08\)Extracting training data from large language models\.In30th USENIX Security Symposium \(USENIX Security 21\),pp\. 2633–2650\.External Links:ISBN 978\-1\-939133\-24\-3,[Link](https://www.usenix.org/conference/usenixsecurity21/presentation/carlini-extracting)Cited by:[§1](https://arxiv.org/html/2605.20258#S1.p2.1)\.
- \[7\]Z\. Cheng, D\. Wan, M\. Abueg, S\. Ghalebikesabi, R\. Yi, E\. Bagdasarian, B\. Balle, S\. Mellem, and S\. O’Banion\(2024\)Ci\-bench: benchmarking contextual integrity of ai assistants on synthetic data\.arXiv preprint arXiv:2409\.13903\.Cited by:[Appendix A](https://arxiv.org/html/2605.20258#A1.SS0.SSS0.Px1.p1.1)\.
- \[8\]A\. Das, S\. S\. Chintha, R\. Girmal, K\. Pandey, and S\. Endait\(2026\)Chain\-of\-sanitized\-thoughts: plugging pii leakage in cot of large reasoning models\.arXiv preprint arXiv:2601\.05076\.Cited by:[Appendix A](https://arxiv.org/html/2605.20258#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.20258#S1.p3.1),[§3](https://arxiv.org/html/2605.20258#S3.p1.1)\.
- \[9\]M\. D\. Donsker and S\. S\. Varadhan\(1975\)Asymptotic evaluation of certain markov process expectations for large time, i\.Communications on pure and applied mathematics28\(1\),pp\. 1–47\.Cited by:[Appendix G](https://arxiv.org/html/2605.20258#A7.2.p1.1)\.
- \[10\]C\. Dwork and A\. Roth\(2014\)The algorithmic foundations of differential privacy\.Foundations and Trends® in Theoretical Computer Science, Vol\.9,Now Publishers Inc\.,Hanover, MA\.External Links:ISBN 9781601988188Cited by:[§2](https://arxiv.org/html/2605.20258#S2.SS0.SSS0.Px2.p1.1)\.
- \[11\]W\. Fan, H\. Li, Z\. Deng, W\. Wang, and Y\. Song\(2024\-11\)GoldCoin: grounding large language models in privacy laws via contextual integrity theory\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Miami, Florida, USA,pp\. 3321–3343\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.195/)Cited by:[Appendix A](https://arxiv.org/html/2605.20258#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.20258#S1.p3.1)\.
- \[12\]S\. Ghalebikesabi, E\. Bagdasaryan, R\. Yi, I\. Yona, I\. Shumailov, A\. Pappu, C\. Shi, L\. Weidinger, R\. Stanforth, L\. Berrada,et al\.\(2024\)Operationalizing contextual integrity in privacy\-conscious assistants\.arXiv preprint arXiv:2408\.02373\.Cited by:[Appendix A](https://arxiv.org/html/2605.20258#A1.SS0.SSS0.Px1.p1.1)\.
- \[13\]A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§4\.1](https://arxiv.org/html/2605.20258#S4.SS1.SSS0.Px3.p1.2)\.
- \[14\]D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi,et al\.\(2025\)Deepseek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§4\.1](https://arxiv.org/html/2605.20258#S4.SS1.SSS0.Px3.p1.2)\.
- \[15\]G\. E\. Hinton\(2002\-08\)Training products of experts by minimizing contrastive divergence\.Neural Comput\.14\(8\),pp\. 1771–1800\.External Links:ISSN 0899\-7667,[Link](https://doi.org/10.1162/089976602760128018)Cited by:[§1](https://arxiv.org/html/2605.20258#S1.p5.1),[§3\.2](https://arxiv.org/html/2605.20258#S3.SS2.SSS0.Px2.p2.1)\.
- \[16\]E\. J\. Hu, yelong shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen\(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=nZeVKeeFYf9)Cited by:[§4\.1](https://arxiv.org/html/2605.20258#S4.SS1.SSS0.Px3.p2.6)\.
- \[17\]W\. Hu, H\. Li, H\. Jing, Q\. Hu, Z\. Zeng, S\. Han, X\. Heli, T\. Chu, P\. Hu, and Y\. Song\(2025\-11\)Context reasoner: incentivizing reasoning capability for contextualized privacy and safety compliance via reinforcement learning\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,Suzhou, China,pp\. 865–883\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.44/)Cited by:[Appendix A](https://arxiv.org/html/2605.20258#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.20258#S1.p3.1)\.
- \[18\]J\. Hübotter, F\. Lübeck, L\. Behric, A\. Baumann, M\. Bagatella, D\. Marta, I\. Hakimi, I\. Shenfeld, T\. K\. Buening, C\. Guestrin,et al\.\(2026\)Reinforcement learning via self\-distillation\.arXiv preprint arXiv:2601\.20802\.Cited by:[Appendix A](https://arxiv.org/html/2605.20258#A1.SS0.SSS0.Px2.p1.2),[§1](https://arxiv.org/html/2605.20258#S1.p4.1),[§3](https://arxiv.org/html/2605.20258#S3.p2.1)\.
- \[19\]H\. Jing, H\. Li, W\. Hu, Q\. Hu, X\. Heli, T\. Chu, P\. Hu, and Y\. Song\(2025\-11\)MCIP: protecting MCP safety via model contextual integrity protocol\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,Suzhou, China,pp\. 1177–1194\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.62/)Cited by:[Appendix A](https://arxiv.org/html/2605.20258#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.20258#S1.p3.1)\.
- \[20\]P\. Kumaraguru and L\. F\. Cranor\(2005\)Privacy indexes: a survey of westin’s studies\.Institute for Software Research International\.Cited by:[§C\.1](https://arxiv.org/html/2605.20258#A3.SS1.SSS0.Px3.p1.2)\.
- \[21\]W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. Gonzalez, H\. Zhang, and I\. Stoica\(2023\)Efficient memory management for large language model serving with pagedattention\.InProceedings of the 29th Symposium on Operating Systems Principles,SOSP ’23,New York, NY, USA,pp\. 611–626\.External Links:ISBN 9798400702297,[Link](https://doi.org/10.1145/3600006.3613165)Cited by:[§C\.3](https://arxiv.org/html/2605.20258#A3.SS3.p1.4)\.
- \[22\]G\. Lan, H\. A\. Inan, S\. Abdelnabi, J\. Kulkarni, L\. Wutschitz, R\. Shokri, C\. Brinton, and R\. Sim\(2025\)Contextual integrity in LLMs via reasoning and reinforcement learning\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=Xm57IXqU0n)Cited by:[Appendix A](https://arxiv.org/html/2605.20258#A1.SS0.SSS0.Px1.p1.1),[§B\.1](https://arxiv.org/html/2605.20258#A2.SS1.p1.1),[§B\.2](https://arxiv.org/html/2605.20258#A2.SS2.p1.2),[§C\.1](https://arxiv.org/html/2605.20258#A3.SS1.SSS0.Px1.p1.1),[§C\.2](https://arxiv.org/html/2605.20258#A3.SS2.SSS0.Px1.p1.7),[Figure 10](https://arxiv.org/html/2605.20258#A7.F10),[Figure 14](https://arxiv.org/html/2605.20258#A7.F14),[Figure 16](https://arxiv.org/html/2605.20258#A7.F16),[§1](https://arxiv.org/html/2605.20258#S1.p3.1),[§1](https://arxiv.org/html/2605.20258#S1.p6.1),[§2](https://arxiv.org/html/2605.20258#S2.SS0.SSS0.Px1.p2.9),[§3\.1](https://arxiv.org/html/2605.20258#S3.SS1.p1.6),[§3](https://arxiv.org/html/2605.20258#S3.p1.1),[§4\.1](https://arxiv.org/html/2605.20258#S4.SS1.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.20258#S4.SS1.SSS0.Px2.p1.5),[§4\.1](https://arxiv.org/html/2605.20258#S4.SS1.SSS0.Px3.p1.2),[§4\.1](https://arxiv.org/html/2605.20258#S4.SS1.SSS0.Px3.p2.6),[§5](https://arxiv.org/html/2605.20258#Sx1.p1.2)\.
- \[23\]S\. Lee, S\. Park, Y\. Choi, G\. Kim, M\. Kang, J\. Yun, D\. Park, J\. Park, and S\. J\. Hwang\(2026\)THINKSAFE: self\-generated safety alignment for reasoning models\.arXiv preprint arXiv:2601\.23143\.Cited by:[§4\.6](https://arxiv.org/html/2605.20258#S4.SS6.p2.1)\.
- \[24\]H\. Li, W\. Fan, Y\. Chen, C\. Jiayang, T\. Chu, X\. Zhou, P\. Hu, and Y\. Song\(2025\-04\)Privacy checklist: privacy violation detection grounding on contextual integrity theory\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),Albuquerque, New Mexico,pp\. 1748–1766\.External Links:[Link](https://aclanthology.org/2025.naacl-long.86/)Cited by:[Appendix A](https://arxiv.org/html/2605.20258#A1.SS0.SSS0.Px1.p1.1)\.
- \[25\]H\. Li, W\. Hu, H\. Jing, Y\. Chen, Q\. Hu, S\. Han, T\. Chu, P\. Hu, and Y\. Song\(2025\-07\)PrivaCI\-bench: evaluating privacy with contextual integrity and legal compliance\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Vienna, Austria,pp\. 10544–10559\.External Links:[Link](https://aclanthology.org/2025.acl-long.518/)Cited by:[Appendix A](https://arxiv.org/html/2605.20258#A1.SS0.SSS0.Px1.p1.1)\.
- \[26\]W\. Li, L\. Sun, Z\. Guan, X\. Zhou, and M\. Sap\(2025\-08\)1\-2\-3 check: enhancing contextual privacy in LLM via multi\-agent reasoning\.InProceedings of the The First Workshop on LLM Security \(LLMSEC\),Vienna, Austria,pp\. 115–128\.External Links:[Link](https://aclanthology.org/2025.llmsec-1.9/)Cited by:[Appendix A](https://arxiv.org/html/2605.20258#A1.SS0.SSS0.Px1.p1.1)\.
- \[27\]Y\. Li, H\. Wen, W\. Wang, X\. Li, Y\. Yuan, G\. Liu, J\. Liu, W\. Xu, X\. Wang, Y\. Sun, R\. Kong, Y\. Wang, H\. Geng, J\. Luan, X\. Jin, Z\. Ye, G\. Xiong, F\. Zhang, X\. Li, M\. Xu, Z\. Li, P\. Li, Y\. Liu, Y\. Zhang, and Y\. Liu\(2024\)Personal llm agents: insights and survey about the capability, efficiency and security\.arXiv preprint arXiv:2401\.05459\.Cited by:[§1](https://arxiv.org/html/2605.20258#S1.p1.1)\.
- \[28\]I\. Loshchilov and F\. Hutter\(2019\)Decoupled weight decay regularization\.International Conference on Learning Representations \(ICLR\)\.Cited by:[§4\.1](https://arxiv.org/html/2605.20258#S4.SS1.SSS0.Px3.p2.6)\.
- \[29\]N\. Mireshghallah, H\. Kim, X\. Zhou, Y\. Tsvetkov, M\. Sap, R\. Shokri, and Y\. Choi\(2024\)Can LLMs keep a secret? testing privacy implications of language models via contextual integrity theory\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=gmg7t8b4s0)Cited by:[Appendix A](https://arxiv.org/html/2605.20258#A1.SS0.SSS0.Px1.p1.1)\.
- \[30\]N\. Mireshghallah, N\. Mangaokar, N\. Kokhlikyan, A\. Zharmagambetov, M\. Zaheer, S\. Mahloujifar, and K\. Chaudhuri\(2026\)CIMemories: a compositional benchmark for contextual integrity in LLMs\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=YnNIp38v1M)Cited by:[Appendix A](https://arxiv.org/html/2605.20258#A1.SS0.SSS0.Px1.p1.1),[§C\.1](https://arxiv.org/html/2605.20258#A3.SS1.SSS0.Px3.p1.2),[§1](https://arxiv.org/html/2605.20258#S1.p6.1),[§2](https://arxiv.org/html/2605.20258#S2.SS0.SSS0.Px1.p2.9),[§3\.1](https://arxiv.org/html/2605.20258#S3.SS1.p2.1),[§4\.3](https://arxiv.org/html/2605.20258#S4.SS3.p1.1)\.
- \[31\]S\. Mukhopadhyay, S\. Reddy, S\. Muthukumar, J\. An, and P\. Kumaraguru\(2025\)PrivacyBench: a conversational benchmark for evaluating privacy in personalized ai\.arXiv preprint arXiv:2512\.24848\.Cited by:[Appendix A](https://arxiv.org/html/2605.20258#A1.SS0.SSS0.Px1.p1.1)\.
- \[32\]H\. Nissenbaum\(2004\)Privacy as contextual integrity\.Washington Law Review79\(1\),pp\. 119\.Cited by:[Appendix A](https://arxiv.org/html/2605.20258#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.20258#S1.p1.1),[§2](https://arxiv.org/html/2605.20258#S2.SS0.SSS0.Px1.p1.1)\.
- \[33\]H\. Nissenbaum\(2009\)Privacy in context: technology, policy, and the integrity of social life\.InPrivacy in context,Cited by:[Appendix A](https://arxiv.org/html/2605.20258#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.20258#S1.p1.1),[§2](https://arxiv.org/html/2605.20258#S2.SS0.SSS0.Px1.p1.1)\.
- \[34\]T\. Olmo, A\. Ettinger, A\. Bertsch, B\. Kuehl, D\. Graham, D\. Heineman, D\. Groeneveld, F\. Brahman, F\. Timbers, H\. Ivison,et al\.\(2025\)Olmo 3\.arXiv preprint arXiv:2512\.13961\.Cited by:[§4\.1](https://arxiv.org/html/2605.20258#S4.SS1.SSS0.Px3.p1.2)\.
- \[35\]Y\. Shao, T\. Li, W\. Shi, Y\. Liu, and D\. Yang\(2024\)PrivacyLens: evaluating privacy norm awareness of language models in action\.InThe Thirty\-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track,External Links:[Link](https://openreview.net/forum?id=CxNXoMnCKc)Cited by:[Appendix A](https://arxiv.org/html/2605.20258#A1.SS0.SSS0.Px1.p1.1),[§C\.1](https://arxiv.org/html/2605.20258#A3.SS1.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.20258#S1.p6.1),[§4\.1](https://arxiv.org/html/2605.20258#S4.SS1.SSS0.Px1.p2.1)\.
- \[36\]Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§C\.2](https://arxiv.org/html/2605.20258#A3.SS2.SSS0.Px1.p1.7),[§4\.1](https://arxiv.org/html/2605.20258#S4.SS1.SSS0.Px2.p1.5)\.
- \[37\]I\. Shenfeld, M\. Damani, J\. Hübotter, and P\. Agrawal\(2026\)Self\-distillation enables continual learning\.arXiv preprint arXiv:2601\.19897\.Cited by:[Appendix A](https://arxiv.org/html/2605.20258#A1.SS0.SSS0.Px2.p1.2),[§1](https://arxiv.org/html/2605.20258#S1.p4.1),[§3](https://arxiv.org/html/2605.20258#S3.p2.1)\.
- \[38\]A\. Singh, A\. Fry, A\. Perelman, A\. Tart, A\. Ganesh, A\. El\-Kishky, A\. McLaughlin, A\. Low, A\. Ostrow, A\. Ananthram,et al\.\(2025\)Openai gpt\-5 system card\.arXiv preprint arXiv:2601\.03267\.Cited by:[§C\.1](https://arxiv.org/html/2605.20258#A3.SS1.SSS0.Px2.p1.1),[§C\.1](https://arxiv.org/html/2605.20258#A3.SS1.SSS0.Px3.p1.2),[§4\.1](https://arxiv.org/html/2605.20258#S4.SS1.SSS0.Px1.p2.1)\.
- \[39\]C\. Snell, D\. Klein, and R\. Zhong\(2022\)Learning by distilling context\.arXiv preprint arXiv:2209\.15189\.Cited by:[§4\.1](https://arxiv.org/html/2605.20258#S4.SS1.SSS0.Px2.p1.5)\.
- \[40\]N\. Srivastava, G\. Hinton, A\. Krizhevsky, I\. Sutskever, and R\. Salakhutdinov\(2014\)Dropout: a simple way to prevent neural networks from overfitting\.The journal of machine learning research\.Cited by:[§4\.1](https://arxiv.org/html/2605.20258#S4.SS1.SSS0.Px3.p2.6)\.
- \[41\]A\. Tarvainen and H\. Valpola\(2017\)Mean teachers are better role models: weight\-averaged consistency targets improve semi\-supervised deep learning results\.InProceedings of the 31st International Conference on Neural Information Processing Systems,NIPS’17,Red Hook, NY, USA,pp\. 1195–1204\.External Links:ISBN 9781510860964Cited by:[§C\.3](https://arxiv.org/html/2605.20258#A3.SS3.p1.4)\.
- \[42\]Y\. Tu, X\. Liu, L\. Qin, and H\. Jin\(2026\)PrivacyReasoner: can llm emulate a human\-like privacy mind?\.arXiv preprint arXiv:2601\.09152\.Cited by:[Appendix A](https://arxiv.org/html/2605.20258#A1.SS0.SSS0.Px1.p1.1)\.
- \[43\]L\. von Werra, Y\. Belkada, L\. Tunstall, E\. Beeching, T\. Thrush, N\. Lambert, S\. Huang, K\. Rasul, and Q\. Gallouédec\(2020\)TRL: Transformers Reinforcement Learning\.External Links:[Link](https://github.com/huggingface/trl)Cited by:[§C\.3](https://arxiv.org/html/2605.20258#A3.SS3.p1.4)\.
- \[44\]L\. Wang, C\. Ma, X\. Feng, Z\. Zhang, H\. Yang, J\. Zhang, Z\. Chen, J\. Tang, X\. Chen, Y\. Lin, W\. X\. Zhao, Z\. Wei, and J\. Wen\(2024\-03\)A survey on large language model based autonomous agents\.Front\. Comput\. Sci\.18\(6\)\.External Links:ISSN 2095\-2228,[Link](https://doi.org/10.1007/s11704-024-40231-1)Cited by:[§1](https://arxiv.org/html/2605.20258#S1.p1.1)\.
- \[45\]S\. Wang, F\. Yu, X\. Liu, X\. Qin, J\. Zhang, Q\. Lin, D\. Zhang, and S\. Rajmohan\(2025\-11\)Privacy in action: towards realistic privacy mitigation and evaluation for LLM\-powered agents\.InFindings of the Association for Computational Linguistics: EMNLP 2025,Suzhou, China,pp\. 17055–17074\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.925/)Cited by:[Appendix A](https://arxiv.org/html/2605.20258#A1.SS0.SSS0.Px1.p1.1)\.
- \[46\]S\. Wang and H\. Zhang\(2026\)MPCI\-bench: a benchmark for multimodal pairwise contextual integrity evaluation of language model agents\.arXiv preprint arXiv:2601\.08235\.Cited by:[Appendix A](https://arxiv.org/html/2605.20258#A1.SS0.SSS0.Px1.p1.1)\.
- \[47\]Y\. Xiao, Y\. Jin, Y\. Bai, Y\. Wu, X\. Yang, X\. Luo, W\. Yu, X\. Zhao, Y\. Liu, Q\. Gu, H\. Chen, W\. Wang, and W\. Cheng\(2024\-11\)Large language models can be contextual privacy protection learners\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Miami, Florida, USA,pp\. 14179–14201\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.785/)Cited by:[Appendix A](https://arxiv.org/html/2605.20258#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.20258#S1.p3.1)\.
- \[48\]A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§4\.1](https://arxiv.org/html/2605.20258#S4.SS1.SSS0.Px3.p1.2)\.
- \[49\]A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, G\. Dong,et al\.\(2024\)Qwen2\.5 technical report\.arXiv preprint arXiv:2412\.15115\.Cited by:[§4\.1](https://arxiv.org/html/2605.20258#S4.SS1.SSS0.Px3.p1.2)\.
- \[50\]D\. Yu, S\. Naik, A\. Backurs, S\. Gopi, H\. A\. Inan, G\. Kamath, J\. Kulkarni, Y\. T\. Lee, A\. Manoel, L\. Wutschitz, S\. Yekhanin, and H\. Zhang\(2022\)Differentially private fine\-tuning of language models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=Q42f0dfjECO)Cited by:[§1](https://arxiv.org/html/2605.20258#S1.p2.1)\.
- \[51\]S\. Zhao, Z\. Xie, M\. Liu, J\. Huang, G\. Pang, F\. Chen, and A\. Grover\(2026\)Self\-distilled reasoner: on\-policy self\-distillation for large language models\.arXiv preprint arXiv:2601\.18734\.Cited by:[Appendix A](https://arxiv.org/html/2605.20258#A1.SS0.SSS0.Px2.p1.2),[§C\.3](https://arxiv.org/html/2605.20258#A3.SS3.p1.4),[§1](https://arxiv.org/html/2605.20258#S1.p4.1),[§3](https://arxiv.org/html/2605.20258#S3.p2.1)\.
- \[52\]A\. Zharmagambetov, C\. Guo, I\. Evtimov, M\. Pavlova, R\. Salakhutdinov, and K\. Chaudhuri\(2025\)AgentDAM: privacy leakage evaluation for autonomous web agents\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track,External Links:[Link](https://openreview.net/forum?id=qaxf7q41aK)Cited by:[Appendix A](https://arxiv.org/html/2605.20258#A1.SS0.SSS0.Px1.p1.1)\.

## Appendix

## Appendix ARelated Work

#### Contextual Integrity in LLMs\.

As LLMs are increasingly embedded in personal and professional workflows, they are exposed to rich and sensitive user contexts, making Contextual Integrity \(CI\)\[[32](https://arxiv.org/html/2605.20258#bib.bib1),[33](https://arxiv.org/html/2605.20258#bib.bib3),[4](https://arxiv.org/html/2605.20258#bib.bib2)\]a useful framework for governing context\-appropriate information flows\. Early work studied CI in conversational settings\[[29](https://arxiv.org/html/2605.20258#bib.bib4),[7](https://arxiv.org/html/2605.20258#bib.bib9),[31](https://arxiv.org/html/2605.20258#bib.bib12)\], while recent work has extended CI\-based evaluation and intervention to more complex settings, including autonomous agents, Model Context Protocol \(MCP\) environments, and multimodal interactions\[[35](https://arxiv.org/html/2605.20258#bib.bib5),[45](https://arxiv.org/html/2605.20258#bib.bib10),[52](https://arxiv.org/html/2605.20258#bib.bib11),[25](https://arxiv.org/html/2605.20258#bib.bib13),[46](https://arxiv.org/html/2605.20258#bib.bib14),[30](https://arxiv.org/html/2605.20258#bib.bib7)\]\. To mitigate privacy risks, prior work has enforced CI constraints at inference time\[[12](https://arxiv.org/html/2605.20258#bib.bib8),[11](https://arxiv.org/html/2605.20258#bib.bib18),[24](https://arxiv.org/html/2605.20258#bib.bib24)\]\. As LLM reasoning capabilities improve, recent approaches have sought to internalize CI reasoning through fine\-tuning\[[47](https://arxiv.org/html/2605.20258#bib.bib21),[11](https://arxiv.org/html/2605.20258#bib.bib18),[19](https://arxiv.org/html/2605.20258#bib.bib15),[8](https://arxiv.org/html/2605.20258#bib.bib20)\]or reinforcement learning, rewarding information flows that conform to contextual norms\[[17](https://arxiv.org/html/2605.20258#bib.bib23),[22](https://arxiv.org/html/2605.20258#bib.bib6)\]\. However, these methods often treat CI as an output\-level constraint, improving privacy behavior at the cost of task performance\. In contrast,SelfCIuses complementary self\-teachers to optimize toward the intersection of minimal disclosure and task completion\. A separate line of system\-level approaches regulates information flow across tools, memory, and interacting agents\[[3](https://arxiv.org/html/2605.20258#bib.bib16),[26](https://arxiv.org/html/2605.20258#bib.bib22),[1](https://arxiv.org/html/2605.20258#bib.bib17),[45](https://arxiv.org/html/2605.20258#bib.bib10),[42](https://arxiv.org/html/2605.20258#bib.bib25)\]\.

#### Self\-Distillation\.

Self\-distillation\[[18](https://arxiv.org/html/2605.20258#bib.bib30),[37](https://arxiv.org/html/2605.20258#bib.bib31),[51](https://arxiv.org/html/2605.20258#bib.bib32)\]trains a student policyπθ\\pi\_\{\\theta\}to minimize the token\-level KL divergence against a teacher distribution conditioned on privileged contextcc:

ℒSD\(θ\)=∑t=1\|y\|DKL\(πθ\(⋅∣x,y<t\)∥stopgrad\(πθ\(⋅∣x,c,y<t\)\)\),\\mathcal\{L\}\_\{\\text\{SD\}\}\(\\theta\)=\\sum\_\{t=1\}^\{\|y\|\}D\_\{\\mathrm\{KL\}\}\\left\(\\pi\_\{\\theta\}\(\\,\\cdot\\mid x,y\_\{<t\}\)\\parallel\\texttt\{stopgrad\}\(\\pi\_\{\\theta\}\(\\,\\cdot\\mid x,c,y\_\{<t\}\)\)\\right\),\(6\)wherestopgrad​\(⋅\)\\texttt\{stopgrad\}\(\\cdot\)ensures the teacher distribution remains intact during optimization\. In this framework, since the teacher is instantiated from the same model parametersθ\\thetabut conditioned on additional contextcc, it provides dense token\-level guidance while remaining close to the model’s existing capabilities\. This property is especially useful for CI alignment, where the model must suppress disallowed information without losing its instruction\-following nature\.

## Appendix BSelfCIFramework Details

### B\.1Feedback Generation

We generate feedback from the synthetic dataset ofLanet al\.\[[22](https://arxiv.org/html/2605.20258#bib.bib6)\], which contains assistant\-task instances with explicit disclosure annotations\. Feedback is generated for the training split and used as privileged context during training\. Each instance specifies the scenario type, domain, user intention, sender, recipient, data subject, CI transmission principle, the concrete user task, and the available user attributes\. It also includes annotation maps that identify which concrete attribute values are allowed or disallowed for the task\. The dataset uses three CI transmission principles, whose definitions are provided in[Tab\.˜5](https://arxiv.org/html/2605.20258#A2.T5)\.

Table 5:Contextual Integrity rubrics and their definitions used for feedback generation\.For each allowed attributea\(i\)∈𝒜𝒯a^\{\(i\)\}\\in\{\\mathcal\{A\}\}\_\{\\mathcal\{T\}\}and disallowed attributed\(i\)∈𝒟𝒯d^\{\(i\)\}\\in\{\\mathcal\{D\}\}\_\{\\mathcal\{T\}\}, we instantiate the corresponding instruction in[Fig\.˜9](https://arxiv.org/html/2605.20258#A7.F9)\. The instruction is filled with the user task, recipient, data subject, attribute name, concrete attribute value, and the rubric definition from[Tab\.˜5](https://arxiv.org/html/2605.20258#A2.T5)\. The allowed promptIallowI\_\{\\textbf\{\{\\color\[rgb\]\{0\.26953125,0\.5,0\.5234375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.26953125,0\.5,0\.5234375\}allow\}\}\}asks the model to explain why the attribute is appropriate to share in the current context, while the disallowed promptIdisallowI\_\{\\textbf\{\{\\color\[rgb\]\{0\.73046875,0\.0703125,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.73046875,0\.0703125,0\}disallow\}\}\}asks why sharing the attribute violates CI\. For reasoning models, we remove the reasoning block and retain only the final response after the closing reasoning tag\.

### B\.2Complementary Teacher Construction

After feedback generation, we aggregate feedback within each branch and append it to the base prompt\. Specifically, we concatenate the attribute\-level feedback for each group as in[Eq\.˜3](https://arxiv.org/html/2605.20258#S3.E3), whereconcat\(⋅\)\(\\cdot\)denotes string concatenation over the feedback snippets\. The teacher and student share the same base CI\-CoT\[[22](https://arxiv.org/html/2605.20258#bib.bib6)\]prompt, shown in[Fig\.˜8](https://arxiv.org/html/2605.20258#A7.F8); the teacher prompt is obtained by appending the aggregated feedback as a suffix\. Concretely, the suffix begins with\[NOTE\]followed by a simple instruction stating that the following attributes are appropriate or inappropriate to share in this specific context, depending on the branch, and then the branch\-specific feedbackf~g\\tilde\{f\}\_\{g\}\.[Fig\.˜10](https://arxiv.org/html/2605.20258#A7.F10)shows examples of these two suffixes\.

## Appendix CAdditional Experimental Details

### C\.1Dataset Details

#### CI\-RL\.

CI\-RL\[[22](https://arxiv.org/html/2605.20258#bib.bib6)\]serves as our in\-domain benchmark, which contains synthetic assistant\-task disclosure scenarios with explicit annotations over task\-relevant and inappropriate information\. Following the original setup, we shuffle all 729 instances with seed 42 into 590 training, 66 evaluation, and 73 test instances\. Each instance is rendered with the task, sender, recipient, data subject, available attributes \(including both required and restricted attributes to be disclosed\), and the CI\-CoT prompt\. Evaluation is performed on the test split using normalized string matching against the annotation maps after parsing only the final answer span, so the reasoning trace is excluded from scoring\. Utility is one when all required keywords are present, Integrity is one when no restricted keyword is present, and Complete is one only when both conditions hold\. We report metrics averaged over all instances and five evaluation runs\.

#### PrivacyLens\.

PrivacyLens\[[35](https://arxiv.org/html/2605.20258#bib.bib5)\]evaluates CI behavior in tool\-using agent trajectories, where each case contains a user instruction, available tools, a past action trajectory, an intended final action, and sensitive information items associated with the trajectory\. The model generates the next final action from the trajectory state\. Privacy leakage is computed from the generated final action: leakage rate \(LR\) is the fraction of cases in which the final action, including its tool input, contains any disallowed sensitive attribute associated with the trajectory\. ALR is the corresponding helpfulness\-adjusted leakage rate, computed only over cases whose final action is judged helpful\. We use GPT\-5\-mini\[[38](https://arxiv.org/html/2605.20258#bib.bib40)\]as an LLM\-as\-a\-Judge for task fulfillment, reported as Helpful on a\[0,3\]\[0,3\]scale\.[Figs\.˜11](https://arxiv.org/html/2605.20258#A7.F11)and[12](https://arxiv.org/html/2605.20258#A7.F12)illustrate the system and user prompts used in PrivacyLens, respectively\.

#### CIMemories\.

CIMemories\[[30](https://arxiv.org/html/2605.20258#bib.bib7)\]tests contextual disclosure under accumulated user memories\. Each prompt contains a long memory profile and asks the model to write a message to a specific recipient for a specific purpose\. We evaluate 454 scenarios relabeled by GPT\-5\[[38](https://arxiv.org/html/2605.20258#bib.bib40)\]under a multi\-judge protocol with mixed Westin privacy personas\[[20](https://arxiv.org/html/2605.20258#bib.bib57)\], labeling each attribute as necessary or inappropriate only if all personas agree\. As generating rationale for every attribute is impractical, we adopt a simplified version of CI\-CoT, as shown in[Fig\.˜13](https://arxiv.org/html/2605.20258#A7.F13)\. After generation, GPT\-5\-mini\[[38](https://arxiv.org/html/2605.20258#bib.bib40)\]extracts disclosed memory attributes from each message\. For the prompts used to relabel and extract revealed attributes from the generated messages, we follow the original design\. For privacy measurement, unlike the usual per\-response leakage rate, Violation@kkmeasures accumulated exposure under repeated use\. An attribute is flagged as violated if it is inappropriately disclosed in any of the tasks overkkgenerations\. We report Violation@5\.

Table 6:Teacher models used to construct ContextDistill response targets\.

### C\.2Baseline Details

#### CI\-RL\.

CI\-RL\[[22](https://arxiv.org/html/2605.20258#bib.bib6)\]is trained with GRPO\[[36](https://arxiv.org/html/2605.20258#bib.bib45)\]using the scalar CI reward described in[Sec\.˜4\.1](https://arxiv.org/html/2605.20258#S4.SS1), where format violations receive a reward of−1\-1\. We use a batch size of1616with22gradient accumulation steps and sample1616completions per prompt during training\. The KL coefficient isβ=1×10−3\\beta=1\\times 10^\{\-3\}, the clipping threshold isϵ=0\.2\\epsilon=0\.2, and the entropy coefficient is0\.

#### ContextDistill\.

ContextDistill first constructs an offline target corpus by generating one teacher response for each training instance\. The teacher prompt is generated as in[Sec\.˜B\.2](https://arxiv.org/html/2605.20258#A2.SS2), using the CI\-CoT prompt template with a single feedback context, except that the allowed and disallowed feedback are concatenated rather than kept as separate teachers\. We employ larger teachers ranging from 32B to 70B parameters, as summarized in[Tab\.˜6](https://arxiv.org/html/2605.20258#A3.T6), with a batch size of11with22gradient accumulation steps\.

### C\.3Additional Implementation Details

SelfCIuses equal branch weights,λ=0\.5\\lambda=0\.5, for the allowed and disallowed feedback teachers in[Eq\.˜5](https://arxiv.org/html/2605.20258#S3.E5)by default, and is trained with a total batch size of22\. Teacher parameters are initialized from the student and updated via EMA\[[41](https://arxiv.org/html/2605.20258#bib.bib56)\]with an update rate of0\.0010\.001\. Across the training of all baselines andSelfCI, rollouts and evaluation generations use temperature0\.70\.7\. We implement all optimization\-based methods with TRL\[[43](https://arxiv.org/html/2605.20258#bib.bib41)\]and use vLLM\[[21](https://arxiv.org/html/2605.20258#bib.bib55)\]for efficient on\-policy generation\. For Qwen3\-4B, following the model\-specific self\-distillation approach ofZhaoet al\.\[[51](https://arxiv.org/html/2605.20258#bib.bib32)\], we disable the student model’s thinking mode duringSelfCItraining by inserting the prefix`<think\>\\n</think\>`before the response delimiter in each assistant output, while keeping the teacher model’s thinking mode enabled\.

## Appendix DAdditional Experimental Results

We provide additional experimental results and analyses\. All experiments in this section are conducted with Qwen3\-4B\-Instruct\.

![Refer to caption](https://arxiv.org/html/2605.20258v1/x7.png)Figure 7:\(Left\) Integrity and \(Middle\) Utility across training epochs for the utility\-oriented teacherπallow\\pi\_\{\\textbf\{\{\\color\[rgb\]\{0\.26953125,0\.5,0\.5234375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.26953125,0\.5,0\.5234375\}allow\}\}\}, the privacy\-oriented teacherπdisallow\\pi\_\{\\textbf\{\{\\color\[rgb\]\{0\.73046875,0\.0703125,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.73046875,0\.0703125,0\}disallow\}\}\}, and their PoE targetπPoE\\pi\_\{\\textbf\{\{\\color\[rgb\]\{0\.62890625,0\.171875,0\.578125\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.171875,0\.578125\}PoE\}\}\}, computed on the training split\. \(Right\) Average token\-levelDKLD\_\{\\mathrm\{KL\}\}to the allow\-only ideal policy \([Eq\.˜7](https://arxiv.org/html/2605.20258#A4.E7)\) on the same split\.### D\.1Additional Analysis of Teacher Dynamics

Extending the discussion of teacher dynamics in[Sec\.˜4\.4](https://arxiv.org/html/2605.20258#S4.SS4), we further measure whether the teachers move toward the ideal CI policy \([Def\.˜2\.1](https://arxiv.org/html/2605.20258#S2.Thmtheorem1)\), which behaves as if only the allowed attributes are available\. LetBBdenote the set of tasks in the training split\. For eachg∈\{allow,disallow,PoE\}g\\in\\\{\\textbf\{\{\\color\[rgb\]\{0\.26953125,0\.5,0\.5234375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.26953125,0\.5,0\.5234375\}allow\}\},\\textbf\{\{\\color\[rgb\]\{0\.73046875,0\.0703125,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.73046875,0\.0703125,0\}disallow\}\},\\textbf\{\{\\color\[rgb\]\{0\.62890625,0\.171875,0\.578125\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.62890625,0\.171875,0\.578125\}PoE\}\}\\\}, we compute

1\|B\|∑𝒯∈B\[1\|y\|∑t=1\|y\|DKL\(πg\(⋅∣x𝒯,y<t\)∥πθ\(⋅∣𝒜𝒯,𝒯,y<t\)\)\],y∼πg\(⋅∣x𝒯\),\\frac\{1\}\{\|B\|\}\\sum\_\{\\mathcal\{T\}\\in B\}\\left\[\\frac\{1\}\{\|y\|\}\\sum\_\{t=1\}^\{\|y\|\}D\_\{\\mathrm\{KL\}\}\\left\(\\pi\_\{g\}\(\\,\\cdot\\mid x\_\{\\mathcal\{T\}\},y\_\{<t\}\)\\parallel\\pi\_\{\\theta\}\(\\,\\cdot\\mid\\mathcal\{A\}\_\{\\mathcal\{T\}\},\\mathcal\{T\},y\_\{<t\}\)\\right\)\\right\],\\quad y\\sim\\pi\_\{g\}\(\\,\\cdot\\mid x\_\{\\mathcal\{T\}\}\),\(7\)whereπθ\(⋅∣𝒜𝒯,𝒯,y<t\)\\pi\_\{\\theta\}\(\\,\\cdot\\mid\\mathcal\{A\}\_\{\\mathcal\{T\}\},\\mathcal\{T\},y\_\{<t\}\)is the allow\-only reference policy\.[Fig\.˜7](https://arxiv.org/html/2605.20258#A4.F7)\(Right\) shows that all three teacher targets move closer to this ideal policy over training\. The disallow teacher attains the lowest divergence, consistent with its strong privacy bias, while the allow teacher remains farther away because it is more permissive\. The PoE target also reduces divergence substantially, eventually approaching the disallow teacher in KL while retaining higher Complete\. This confirms that it improves alignment with the ideal CI policy without collapsing into pure suppression\.

### D\.2Analysis on KL Objective Design

Table 7:Results under KL objective directions for the allow and disallow teacher losses on Qwen3\-4B\-Instruct\. FKL/RKL denote forward/reverse KL\.[Tab\.˜7](https://arxiv.org/html/2605.20258#A4.T7)compares four combinations of KL direction for the allowed and disallowed teacher branches\. Reverse KL on both branches achieves the best Utility and Complete, whereas replacing either branch with forward KL raises Integrity in some settings but lowers Complete\. Applying forward KL to both branches gives the most conservative behavior, with the highest Integrity but the weakest Utility and Complete among the compared objectives\. This suggests that forward KL tends to make the student cover teacher behavior too broadly, which can suppress useful disclosures along with restricted ones\. Reverse KL is better matched to our objective because the student should not imitate either teacher in isolation, but instead move toward the intersection of task\-completing and minimal\-disclosure behavior\. By penalizing student probability mass on regions unsupported by each teacher, reverse KL implements the PoE behavior\. We therefore use reverse KL for both teachers\.

### D\.3Analysis on Teacher Update Strategy

Table 8:Results under teacher update strategies on Qwen3\-4B\-Instruct\.[Tab\.˜8](https://arxiv.org/html/2605.20258#A4.T8)compares teacher update strategies for the feedback\-conditioned self\-teachers\. Using the current student itself as the teacher at every step is unstable: the target moves with the optimized policy and can reinforce transient errors, leading to significant degradation after only a few epochs in our experiments\. Conversely, a fixed teacher becomes stale as training proceeds, and the no\-EMA result in[Tab\.˜4](https://arxiv.org/html/2605.20258#S4.T4)is correspondingly suboptimal\. EMA best balances these extremes, achieving the highest Complete score while maintaining strong Integrity\. Tokenwise logit interpolation between student and teacher \(Interp\) slightly improves Utility but substantially reduces Integrity and Complete, and adding it to EMA improves Integrity at the expense of Utility and Complete\. We therefore use EMA alone as the teacher update strategy in the main experiments\.

Table 9:Results under different EMA update rates on Qwen3\-4B\-Instruct\.[Tab\.˜9](https://arxiv.org/html/2605.20258#A4.T9)further studies the EMA update rate\. Very slow updates, such as0\.00010\.0001, lag behind the student and underperform, while faster updates such as0\.010\.01can improve the final metrics but show lower training stability\. We adopt0\.0010\.001for the main experiments as a stable balance between teacher adaptation and smoothing\.

## Appendix EQualitative Examples

[Fig\.˜14](https://arxiv.org/html/2605.20258#A7.F14)shows an example of a CI\-RL test instance in which the model must send a contact\-information update to a doctor’s office\. The input intentionally mixes task\-relevant contact attributes, such as name, phone number, and address, with sensitive but task\-irrelevant context, including clinical notes, insurance details, and a prior medical communication\. A CI\-compliant response should disclose the contact fields needed to update the patient’s records while ignoring medical history, insurance identifiers, and details from the earlier conversation\. The model trained withSelfCI, as shown in[Fig\.˜15](https://arxiv.org/html/2605.20258#A7.F15), correctly discloses all three required contact attributes while also excluding restricted information\. By contrast,[Fig\.˜16](https://arxiv.org/html/2605.20258#A7.F16)shows that CI\-RL loses task\-completion capability in this case, treating the address as optional and omitting it from the final response\.

## Appendix FComplementary Teacher Distillation as Product\-of\-Experts

#### Equivalence to Product of Experts\.

We show that minimizing a weighted sum of reverse KL divergences from two teacher distributions is equivalent to matching a single target distribution given by their product\.

LetPθP\_\{\\theta\}denote the student distribution, andPA,PBP\_\{A\},P\_\{B\}denote two teacher distributions\. Consider the objective:

ℒ​\(θ\)=α​DKL​\(Pθ∥PA\)\+β​DKL​\(Pθ∥PB\),\\mathcal\{L\}\(\\theta\)=\\alpha D\_\{\\mathrm\{KL\}\}\(P\_\{\\theta\}\\,\\\|\\,P\_\{A\}\)\+\\beta D\_\{\\mathrm\{KL\}\}\(P\_\{\\theta\}\\,\\\|\\,P\_\{B\}\),\(8\)whereα,β≥0\\alpha,\\beta\\geq 0andα\+β\>0\\alpha\+\\beta\>0\. Letα~=α/\(α\+β\)\\tilde\{\\alpha\}=\\alpha/\(\\alpha\+\\beta\)andβ~=β/\(α\+β\)\\tilde\{\\beta\}=\\beta/\(\\alpha\+\\beta\)\. Now define a new distributionP∗P^\{\*\}as

P∗​\(x\)=1Z​PA​\(x\)α~​PB​\(x\)β~,P^\{\*\}\(x\)=\\frac\{1\}\{Z\}P\_\{A\}\(x\)^\{\\tilde\{\\alpha\}\}P\_\{B\}\(x\)^\{\\tilde\{\\beta\}\},\(9\)whereZZis the normalization constant\.

Expanding the weighted KL objective gives

1α\+β​ℒ​\(θ\)\\displaystyle\\frac\{1\}\{\\alpha\+\\beta\}\\mathcal\{L\}\(\\theta\)=∑xPθ​\(x\)​log⁡Pθ​\(x\)PA​\(x\)α~​PB​\(x\)β~\\displaystyle=\\sum\_\{x\}P\_\{\\theta\}\(x\)\\log\\frac\{P\_\{\\theta\}\(x\)\}\{P\_\{A\}\(x\)^\{\\tilde\{\\alpha\}\}P\_\{B\}\(x\)^\{\\tilde\{\\beta\}\}\}=∑xPθ​\(x\)​log⁡Pθ​\(x\)Z​P∗​\(x\)\\displaystyle=\\sum\_\{x\}P\_\{\\theta\}\(x\)\\log\\frac\{P\_\{\\theta\}\(x\)\}\{ZP^\{\*\}\(x\)\}=DKL​\(Pθ∥P∗\)−log⁡Z\.\\displaystyle=D\_\{\\mathrm\{KL\}\}\(P\_\{\\theta\}\\,\\\|\\,P^\{\*\}\)\-\\log Z\.\(10\)
By Hölder’s inequality,

Z=∑xPA​\(x\)α~​PB​\(x\)β~≤\(∑xPA​\(x\)\)α~​\(∑xPB​\(x\)\)β~=1,Z=\\sum\_\{x\}P\_\{A\}\(x\)^\{\\tilde\{\\alpha\}\}P\_\{B\}\(x\)^\{\\tilde\{\\beta\}\}\\leq\\left\(\\sum\_\{x\}P\_\{A\}\(x\)\\right\)^\{\\tilde\{\\alpha\}\}\\left\(\\sum\_\{x\}P\_\{B\}\(x\)\\right\)^\{\\tilde\{\\beta\}\}=1,\(11\)so−log⁡Z≥0\-\\log Z\\geq 0\. Sincelog⁡Z\\log Zis constant with respect toθ\\theta, minimizing the original objective is equivalent to minimizing:

DKL​\(Pθ∥P∗\)\.D\_\{\\mathrm\{KL\}\}\(P\_\{\\theta\}\\,\\\|\\,P^\{\*\}\)\.\(12\)

#### Interpretation\.

The resulting target distribution

P∗​\(x\)∝PA​\(x\)α~​PB​\(x\)β~P^\{\*\}\(x\)\\;\\;\\propto\\;\\;P\_\{A\}\(x\)^\{\\tilde\{\\alpha\}\}P\_\{B\}\(x\)^\{\\tilde\{\\beta\}\}corresponds to a*product\-of\-experts*\(PoE\)\. This construction emphasizes regions where both teachers assign high probability, effectively capturing theintersection of their supports\. As a result, optimizing the sum of reverse KL divergences induces a joint constraint that retains agreement between the two teachers while suppressing regions favored by only one\.

## Appendix GComplementary Teacher Objective as an Upper\-Bound Surrogate for CI

We now show that minimizing the complementary self\-distillation objective in[Eq\.˜5](https://arxiv.org/html/2605.20258#S3.E5)\(equivalently, matching the PoE target\) yields an upper\-bound surrogate for the ideal CI objective in[Eq\.˜1](https://arxiv.org/html/2605.20258#S2.E1)\.

Let us define the per\-token distributions over the vocabulary𝒱\\mathcal\{V\}as

Pθ​\(⋅\)\\displaystyle P\_\{\\theta\}\(\\cdot\)≔πθ\(⋅∣x𝒯,y<t\),\\displaystyle\\coloneqq\\pi\_\{\\theta\}\(\\cdot\\mid x\_\{\\mathcal\{T\}\},y\_\{<t\}\),Pθ𝒜​\(⋅\)\\displaystyle P^\{\\mathcal\{A\}\}\_\{\\theta\}\(\\cdot\)≔πθ\(⋅∣𝒜𝒯,𝒯,y<t\),\\displaystyle\\coloneqq\\pi\_\{\\theta\}\(\\cdot\\mid\\mathcal\{A\}\_\{\\mathcal\{T\}\},\\mathcal\{T\},y\_\{<t\}\),\(13\)Pallow​\(⋅\)\\displaystyle P\_\{\\text\{allow\}\}\(\\cdot\)≔πθ\(⋅∣x𝒯,f~allow,y<t\),\\displaystyle\\coloneqq\\pi\_\{\\theta\}\(\\cdot\\mid x\_\{\\mathcal\{T\}\},\\tilde\{f\}\_\{\\text\{allow\}\},y\_\{<t\}\),Pdisallow​\(⋅\)\\displaystyle P\_\{\\text\{disallow\}\}\(\\cdot\)≔πθ\(⋅∣x𝒯,f~disallow,y<t\)\.\\displaystyle\\coloneqq\\pi\_\{\\theta\}\(\\cdot\\mid x\_\{\\mathcal\{T\}\},\\tilde\{f\}\_\{\\text\{disallow\}\},y\_\{<t\}\)\.
Here,PθP\_\{\\theta\}is the student policy under the full context,Pθ𝒜P^\{\\mathcal\{A\}\}\_\{\\theta\}is the allow\-only ideal policy induced by[Def\.˜2\.1](https://arxiv.org/html/2605.20258#S2.Thmtheorem1), andPallowP\_\{\\text\{allow\}\},PdisallowP\_\{\\text\{disallow\}\}are the two feedback\-conditioned teachers\. We assume all distributions are absolutely continuous with respect to each other as they are all produced by a softmax over the same vocabulary\. For a coefficientλ∈\[0,1\]\\lambda\\in\[0,1\], the normalized PoE target derived in[Appendix˜F](https://arxiv.org/html/2605.20258#A6)is

PPoE​\(v\)≔1Zλ​Pallow​\(v\)λ​Pdisallow​\(v\)1−λ,Zλ≔∑u∈𝒱Pallow​\(u\)λ​Pdisallow​\(u\)1−λ\.P\_\{\\text\{PoE\}\}\(v\)\\coloneqq\\frac\{1\}\{Z\_\{\\lambda\}\}\\,P\_\{\\text\{allow\}\}\(v\)^\{\\lambda\}\\,P\_\{\\text\{disallow\}\}\(v\)^\{1\-\\lambda\},\\qquad Z\_\{\\lambda\}\\coloneqq\\sum\_\{u\\in\\mathcal\{V\}\}P\_\{\\text\{allow\}\}\(u\)^\{\\lambda\}\\,P\_\{\\text\{disallow\}\}\(u\)^\{1\-\\lambda\}\.\(14\)
Then, the ideal CI loss and complementary teacher loss ofSelfCIfor the prefix\(x𝒯,y<t\)\(x\_\{\\mathcal\{T\}\},y\_\{<t\}\)are

ℒCI\(t\)​\(θ\)\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{CI\}\}^\{\(t\)\}\(\\theta\)≔DKL​\(Pθ∥Pθ𝒜\),\\displaystyle\\coloneqq D\_\{\\mathrm\{KL\}\}\(P\_\{\\theta\}\\parallel P\_\{\\theta\}^\{\\mathcal\{A\}\}\),\(15\)ℒSelfCI\(t\)​\(θ\)\\displaystyle\\mathcal\{L\}\_\{\\textsc\{SelfCI\}\}^\{\(t\)\}\(\\theta\)≔λ​DKL​\(Pθ∥Pallow\)\+\(1−λ\)​DKL​\(Pθ∥Pdisallow\),\\displaystyle\\coloneqq\\lambda\\,D\_\{\\mathrm\{KL\}\}\(P\_\{\\theta\}\\parallel P\_\{\\mathrm\{allow\}\}\)\+\(1\-\\lambda\)\\,D\_\{\\mathrm\{KL\}\}\(P\_\{\\theta\}\\parallel P\_\{\\mathrm\{disallow\}\}\),\(16\)whose sequence\-level expectations match[Eqs\.˜1](https://arxiv.org/html/2605.20258#S2.E1)and[5](https://arxiv.org/html/2605.20258#S3.E5), respectively:

ℒCI​\(θ\)≔𝔼𝒯,y∼πθ​\[∑t=1\|y\|ℒCI\(t\)​\(θ\)\],ℒSelfCI​\(θ\)≔𝔼𝒯,y∼πθ​\[∑t=1\|y\|ℒSelfCI\(t\)​\(θ\)\]\.\\mathcal\{L\}\_\{\\mathrm\{CI\}\}\(\\theta\)\\coloneqq\\mathbb\{E\}\_\{\\mathcal\{T\},y\\sim\\pi\_\{\\theta\}\}\\left\[\\sum\_\{t=1\}^\{\\left\|y\\right\|\}\\mathcal\{L\}\_\{\\mathrm\{CI\}\}^\{\(t\)\}\(\\theta\)\\right\],\\qquad\\mathcal\{L\}\_\{\\textsc\{SelfCI\}\}\(\\theta\)\\coloneqq\\mathbb\{E\}\_\{\\mathcal\{T\},y\\sim\\pi\_\{\\theta\}\}\\left\[\\sum\_\{t=1\}^\{\\left\|y\\right\|\}\\mathcal\{L\}\_\{\\textsc\{SelfCI\}\}^\{\(t\)\}\(\\theta\)\\right\]\.\(17\)
We first show that the complementary teacher loss in[Eq\.˜5](https://arxiv.org/html/2605.20258#S3.E5)upper bounds the KL divergence between the student and the PoE target\.

###### Lemma G\.1\.

For anyλ∈\[0,1\]\\lambda\\in\[0,1\]and any prefix\(x𝒯,y<t\)\(x\_\{\\mathcal\{T\}\},y\_\{<t\}\),

λ​DKL​\(Pθ∥Pallow\)\+\(1−λ\)​DKL​\(Pθ∥Pdisallow\)=DKL​\(Pθ∥PPoE\)−log⁡Zλ\.\\lambda\\,D\_\{\\mathrm\{KL\}\}\(P\_\{\\theta\}\\parallel P\_\{\\mathrm\{allow\}\}\)\+\(1\-\\lambda\)\\,D\_\{\\mathrm\{KL\}\}\(P\_\{\\theta\}\\parallel P\_\{\\mathrm\{disallow\}\}\)=D\_\{\\mathrm\{KL\}\}\(P\_\{\\theta\}\\parallel P\_\{\\mathrm\{PoE\}\}\)\-\\log Z\_\{\\lambda\}\.\(18\)Moreover,−log⁡Zλ≥0\-\\log Z\_\{\\lambda\}\\geq 0and

DKL​\(Pθ∥PPoE\)≤ℒSelfCI\(t\)​\(θ\)\.D\_\{\\mathrm\{KL\}\}\(P\_\{\\theta\}\\parallel P\_\{\\mathrm\{PoE\}\}\)\\leq\\mathcal\{L\}\_\{\\textsc\{SelfCI\}\}^\{\(t\)\}\(\\theta\)\.\(19\)

###### Proof\.

SubstitutingPA=PallowP\_\{A\}=P\_\{\\mathrm\{allow\}\},PB=PdisallowP\_\{B\}=P\_\{\\mathrm\{disallow\}\},α=λ\\alpha=\\lambda, andβ=1−λ\\beta=1\-\\lambdainto[Eq\.˜10](https://arxiv.org/html/2605.20258#A6.E10)givesP∗=PPoEP^\{\*\}=P\_\{\\mathrm\{PoE\}\}andZ=ZλZ=Z\_\{\\lambda\}, which yields[Eq\.˜18](https://arxiv.org/html/2605.20258#A7.E18)\. The nonnegativity of−log⁡Zλ\-\\log Z\_\{\\lambda\}follows from the same argument above, which yields[Eq\.˜19](https://arxiv.org/html/2605.20258#A7.E19)\. ∎

Forλ∈\(0,1\)\\lambda\\in\(0,1\),−log⁡Zλ≥0\-\\log Z\_\{\\lambda\}\\geq 0vanishes exactly whenPallow=PdisallowP\_\{\\mathrm\{allow\}\}=P\_\{\\mathrm\{disallow\}\},i\.e\., when the two teachers fully agree\. Whenever two teachers disagree, the complementary teacher loss in[Eq\.˜5](https://arxiv.org/html/2605.20258#S3.E5)is strictly larger than the KL toward the PoE target, and minimizing it makes the student attend more sharply to the agreement region of the two teachers\.

Then, to connect this intermediate PoE target back to our ideal policyPθ𝒜P\_\{\\theta\}^\{\\mathcal\{A\}\}, we need to relate their respective KL divergences from the student policyPθP\_\{\\theta\}\. We introduce the change of measure via Rényi divergence to bridgeDKL​\(Pθ∥Pθ𝒜\)D\_\{\\mathrm\{KL\}\}\(P\_\{\\theta\}\\parallel P\_\{\\theta\}^\{\\mathcal\{A\}\}\)andDKL​\(Pθ∥PPoE\)D\_\{\\mathrm\{KL\}\}\(P\_\{\\theta\}\\parallel P\_\{\\mathrm\{PoE\}\}\)\.

###### Lemma G\.2\(Variational change of measure\)\.

LetP,Q,RP,Q,Rbe distributions over𝒱\\mathcal\{V\}withsupp⁡\(R\)⊇supp⁡\(P\)∪supp⁡\(Q\)\\operatorname\{supp\}\(R\)\\supseteq\\operatorname\{supp\}\(P\)\\cup\\operatorname\{supp\}\(Q\)\. For anyα\>1\\alpha\>1,

DKL​\(P∥Q\)≤αα−1​DKL​\(P∥R\)\+Dα​\(R∥Q\)\.D\_\{\\mathrm\{KL\}\}\(P\\parallel Q\)\\leq\\frac\{\\alpha\}\{\\alpha\-1\}D\_\{\\mathrm\{KL\}\}\(P\\parallel R\)\+D\_\{\\alpha\}\(R\\parallel Q\)\.\(20\)

###### Proof\.

The left\-hand side can be decomposed as

DKL​\(P∥Q\)=DKL​\(P∥R\)\+𝔼P​\[log⁡RQ\]\.D\_\{\\mathrm\{KL\}\}\(P\\parallel Q\)=D\_\{\\mathrm\{KL\}\}\(P\\parallel R\)\+\\mathbb\{E\}\_\{P\}\\left\[\\log\\frac\{R\}\{Q\}\\right\]\.\(21\)By the Donsker–Varadhan variational representation of KL\[[9](https://arxiv.org/html/2605.20258#bib.bib58)\], for any measurablegg,

𝔼P​\[g\]≤DKL​\(P∥R\)\+log⁡𝔼R​\[eg\]\.\\mathbb\{E\}\_\{P\}\[g\]\\leq D\_\{\\mathrm\{KL\}\}\(P\\parallel R\)\+\\log\\mathbb\{E\}\_\{R\}\[e^\{g\}\]\.\(22\)Applying this withg=\(α−1\)​log⁡\(R/Q\)g=\(\\alpha\-1\)\\log\(R/Q\)and dividing byα−1\>0\\alpha\-1\>0,

𝔼P​\[log⁡RQ\]≤1α−1​DKL​\(P∥R\)\+1α−1​log⁡𝔼R​\[\(RQ\)α−1\]\.\\mathbb\{E\}\_\{P\}\\left\[\\log\\frac\{R\}\{Q\}\\right\]\\leq\\frac\{1\}\{\\alpha\-1\}D\_\{\\mathrm\{KL\}\}\(P\\parallel R\)\+\\frac\{1\}\{\\alpha\-1\}\\log\\mathbb\{E\}\_\{R\}\\left\[\\left\(\\frac\{R\}\{Q\}\\right\)^\{\\alpha\-1\}\\right\]\.\(23\)The logarithm term on the right\-hand side can also be represented as

1α−1​log​∑vR​\(v\)α​Q​\(v\)1−α=Dα​\(R∥Q\)\.\\frac\{1\}\{\\alpha\-1\}\\log\\sum\_\{v\}R\(v\)^\{\\alpha\}Q\(v\)^\{1\-\\alpha\}=D\_\{\\alpha\}\(R\\parallel Q\)\.\(24\)Substituting back into[Eq\.˜21](https://arxiv.org/html/2605.20258#A7.E21)yields

DKL​\(P∥Q\)\\displaystyle D\_\{\\mathrm\{KL\}\}\(P\\parallel Q\)≤DKL​\(P∥R\)\+1α−1​DKL​\(P∥R\)\+Dα​\(R∥Q\)\\displaystyle\\leq D\_\{\\mathrm\{KL\}\}\(P\\parallel R\)\+\\frac\{1\}\{\\alpha\-1\}D\_\{\\mathrm\{KL\}\}\(P\\parallel R\)\+D\_\{\\alpha\}\(R\\parallel Q\)\(25\)=αα−1​DKL​\(P∥R\)\+Dα​\(R∥Q\)\.\\displaystyle=\\frac\{\\alpha\}\{\\alpha\-1\}D\_\{\\mathrm\{KL\}\}\(P\\parallel R\)\+D\_\{\\alpha\}\(R\\parallel Q\)\.∎

By combining[Lems\.˜G\.1](https://arxiv.org/html/2605.20258#A7.Thmtheorem1)and[G\.2](https://arxiv.org/html/2605.20258#A7.Thmtheorem2), the following theorem states that the complementary teacher loss provides an upper bound on the ideal CI objective, up to an approximation error\.

###### Theorem G\.3\.

For anyλ∈\[0,1\]\\lambda\\in\[0,1\]and anyα\>1\\alpha\>1,

ℒCI\(t\)​\(θ\)≤αα−1​ℒSelfCI\(t\)​\(θ\)\+Dα​\(PPoE∥Pθ𝒜\)\.\\mathcal\{L\}\_\{\\mathrm\{CI\}\}^\{\(t\)\}\(\\theta\)\\leq\\frac\{\\alpha\}\{\\alpha\-1\}\\mathcal\{L\}\_\{\\textsc\{SelfCI\}\}^\{\(t\)\}\(\\theta\)\+D\_\{\\alpha\}\(P\_\{\\mathrm\{PoE\}\}\\parallel P\_\{\\theta\}^\{\\mathcal\{A\}\}\)\.\(26\)Taking expectations over tasks and prefixes,

ℒCI​\(θ\)≤αα−1​ℒSelfCI​\(θ\)\+δα​\(λ,θ\),\\mathcal\{L\}\_\{\\mathrm\{CI\}\}\(\\theta\)\\leq\\frac\{\\alpha\}\{\\alpha\-1\}\\mathcal\{L\}\_\{\\textsc\{SelfCI\}\}\(\\theta\)\+\\delta\_\{\\alpha\}\(\\lambda,\\theta\),\(27\)where

δα​\(λ,θ\)≔𝔼𝒯,y∼πθ​\[∑t=1\|y\|Dα​\(PPoE∥Pθ𝒜\)\]\.\\delta\_\{\\alpha\}\(\\lambda,\\theta\)\\coloneqq\\mathbb\{E\}\_\{\\mathcal\{T\},y\\sim\\pi\_\{\\theta\}\}\\left\[\\sum\_\{t=1\}^\{\\left\|y\\right\|\}D\_\{\\alpha\}\(P\_\{\\mathrm\{PoE\}\}\\parallel P\_\{\\theta\}^\{\\mathcal\{A\}\}\)\\right\]\.\(28\)

###### Proof\.

Applying[Lem\.˜G\.2](https://arxiv.org/html/2605.20258#A7.Thmtheorem2)withP=PθP=P\_\{\\theta\},Q=Pθ𝒜Q=P\_\{\\theta\}^\{\\mathcal\{A\}\}, andR=PPoER=P\_\{\\mathrm\{PoE\}\},

ℒCI\(t\)​\(θ\)=DKL​\(Pθ∥Pθ𝒜\)≤αα−1​DKL​\(Pθ∥PPoE\)\+Dα​\(PPoE∥Pθ𝒜\)\.\\mathcal\{L\}\_\{\\mathrm\{CI\}\}^\{\(t\)\}\(\\theta\)=D\_\{\\mathrm\{KL\}\}\(P\_\{\\theta\}\\parallel P\_\{\\theta\}^\{\\mathcal\{A\}\}\)\\leq\\frac\{\\alpha\}\{\\alpha\-1\}D\_\{\\mathrm\{KL\}\}\(P\_\{\\theta\}\\parallel P\_\{\\mathrm\{PoE\}\}\)\+D\_\{\\alpha\}\(P\_\{\\mathrm\{PoE\}\}\\parallel P\_\{\\theta\}^\{\\mathcal\{A\}\}\)\.\(29\)[Lem\.˜G\.1](https://arxiv.org/html/2605.20258#A7.Thmtheorem1)givesDKL​\(Pθ∥PPoE\)≤ℒSelfCI\(t\)​\(θ\)D\_\{\\mathrm\{KL\}\}\(P\_\{\\theta\}\\parallel P\_\{\\mathrm\{PoE\}\}\)\\leq\\mathcal\{L\}\_\{\\textsc\{SelfCI\}\}^\{\(t\)\}\(\\theta\), yielding[Eq\.˜26](https://arxiv.org/html/2605.20258#A7.E26)\. The sequence\-level inequality follows by linearity of expectation\. ∎

The first term in[Eq\.˜27](https://arxiv.org/html/2605.20258#A7.E27)is exactly the complementary self\-distillation objective optimized bySelfCI, up to the multiplicative constantα/\(α−1\)\\alpha/\(\\alpha\-1\)\. Thus, for a fixedα\\alpha, reducing the training loss directly tightens the upper bound on the ideal CI objective\. The remaining gap is the alignment errorδα​\(λ,θ\)\\delta\_\{\\alpha\}\(\\lambda,\\theta\), which measures how close the induced PoE target is to the allow\-only ideal policy along student rollouts, which is finite and tends to zero as the PoE target collapses onto the allow\-only ideal\.

``Figure 8:Prompt template for contextual integrity reasoning\.
Figure 9:The instruction used for feedback generation\. \(Left\) Instruction for each attribute inallowsubset;IallowI\_\{\\textbf\{\{\\color\[rgb\]\{0\.26953125,0\.5,0\.5234375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.26953125,0\.5,0\.5234375\}allow\}\}\}\. \(Right\) Instruction for each attribute indisallowsubset;IdisallowI\_\{\\textbf\{\{\\color\[rgb\]\{0\.73046875,0\.0703125,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.73046875,0\.0703125,0\}disallow\}\}\}\. For reasoning models, we omit<think\>\.\.\.</think\>block and use the final response as feedback\.`\(a\) User task instruction τ\\tau and accessible information \{𝒜𝒯,𝒟𝒯\}\\\{\{\\mathcal\{A\}\}\_\{\\mathcal\{T\}\},\{\\mathcal\{D\}\}\_\{\\mathcal\{T\}\}\\\} \(b\) Utility\-oriented feedback suffix \(c\) Privacy\-oriented feedback suffix`Figure 10:Example from CI\-RL benchmark\[[22](https://arxiv.org/html/2605.20258#bib.bib6)\]and feedback prompt suffixes for constructing feedback\-conditioned teachers\. \(a\) The user task instructionτ\\tauand accessible information\{𝒜𝒯,𝒟𝒯\}\\\{\{\\mathcal\{A\}\}\_\{\\mathcal\{T\}\},\{\\mathcal\{D\}\}\_\{\\mathcal\{T\}\}\\\}\. \(b\) Attribute\-level feedback suffix for𝒜𝒯\{\\mathcal\{A\}\}\_\{\\mathcal\{T\}\}, forming the utility\-oriented teacherπallow\\pi\_\{\\textbf\{\{\\color\[rgb\]\{0\.26953125,0\.5,0\.5234375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.26953125,0\.5,0\.5234375\}allow\}\}\}\. \(c\) Attribute\-level feedback suffix for𝒟𝒯\{\\mathcal\{D\}\}\_\{\\mathcal\{T\}\}, forming the privacy\-oriented teacherπdisallow\\pi\_\{\\textbf\{\{\\color\[rgb\]\{0\.73046875,0\.0703125,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.73046875,0\.0703125,0\}disallow\}\}\}\.``Figure 11:System prompt used for PrivacyLens evaluation\. The prompt instructs the tool\-using agent to apply Contextual Integrity when deciding whether each attribute is appropriate to disclose\.``Figure 12:User prompt template used for PrivacyLens evaluation\. The template provides user metadata, tool specifications, the user instruction, and the past trajectory, then asks the model to generate the next tool action in the required scratchpad format\.``Figure 13:Prompt template for contextual integrity reasoning with direct answering, which applies Contextual Integrity guidance while requiring a direct final response without visible reasoning\.``Figure 14:Model input constructed from a CI\-RL\[[22](https://arxiv.org/html/2605.20258#bib.bib6)\]test set sample, requiring attribute\-level disclosure reasoning under Contextual Integrity before generating the final response\.``Figure 15:Example response from Qwen3\-4B\-Instruct trained withSelfCIfor the input in[Fig\.˜14](https://arxiv.org/html/2605.20258#A7.F14)\. The response includes all required attributes, "James Carter", "\+1\-555\-0101", and "Evergreen", while correctly excluding the restricted attributes, "Duloxetine", "XZ90034", and "Baker"\.``Figure 16:Example response from Qwen3\-4B\-Instruct trained with CI\-RL\[[22](https://arxiv.org/html/2605.20258#bib.bib6)\]for the input in[Fig\.˜14](https://arxiv.org/html/2605.20258#A7.F14)\. The response preserves some required attributes, including "James Carter" and "\+1\-555\-0101", but omits the required attribute "Evergreen"\.

Similar Articles

Self-Distillation as a Performance Recovery Mechanism for LLMs: Counteracting Compression and Catastrophic Forgetting

arXiv cs.CL

This paper introduces Self-Distillation Fine-Tuning (SDFT) as a recovery mechanism for LLMs suffering from performance degradation due to catastrophic forgetting, quantization, and pruning. The authors provide theoretical justification using Centered Kernel Alignment (CKA) to demonstrate that self-distillation aligns the student model's high-dimensional manifold with the teacher's optimal structure, effectively recovering lost capabilities.

Found in Conversation: LLMs Teach Themselves to Close the Multi-Turn Gap

arXiv cs.CL

This paper introduces Found in Conversation (FiC), a training framework using View-Asymmetric Self-Distillation to close the multi-turn performance gap in LLMs. The method teaches models to recover single-turn competence from underspecified multi-turn prompts, achieving 92-100% recovery across model families and sizes.

ContextGuard: Structured Self-Auditing for Context Learning in Language Models

arXiv cs.CL

Introduces ContextGuard, a structured self-auditing framework that improves LLM context learning by decomposing model self-assessment into confirmed and uncertain categories and applying targeted revisions, achieving a task-solving rate increase from 9.64% to 13.85% on Qwen3.5-4B on the CL-Bench benchmark.