The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation

Hugging Face Daily Papers Papers

Summary

This paper identifies that on-policy distillation (OPD) in language models leads to severe overconfidence due to information mismatch between training and deployment, and proposes CaOPD, a calibration-aware framework that improves both performance and confidence reliability.

On-policy distillation (OPD) is an increasingly important paradigm for post-training language models. However, we identify a pervasive Scaling Law of Miscalibration: while OPD effectively improves task accuracy, it systematically traps models in severe overconfidence. We trace this failure to an information mismatch: teacher supervision is formed under privileged context available during training, whereas the deployed model must report confidence using only deployment-time information. We formalize this perspective theoretically, showing that teacher-conditioned success is generally not a valid target for deployment-time confidence and that helpful privileged context induces entropy collapse and a systematic optimism bias. To address this, we propose a calibration-aware OPD framework, CaOPD, that estimates empirical confidence from model rollouts, replaces self-reported confidence with this student-grounded target, and distills the revised response through the same self-distillation pipeline. Experiments across various models and domains show that CaOPD achieves Pareto-optimal calibration while maintaining competitive capability, generalizing robustly under out-of-distribution and continual learning. Our findings highlight that capability distillation does not imply calibrated confidence, and that confidence should be treated as an essential objective in post-training. Code: https://github.com/SalesforceAIResearch/CaOPD
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/21/26, 07:46 PM

Paper page - The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation

Source: https://huggingface.co/papers/2604.16830

Abstract

On-policy distillation suffers from miscalibration due to information mismatch between training and deployment contexts, which is addressed through a calibration-aware framework that improves both performance and confidence reliability.

On-policy distillation(OPD) is an increasingly important paradigm for post-training language models. However, we identify a pervasiveScaling LawofMiscalibration: while OPD effectively improves task accuracy, it systematically traps models in severe overconfidence. We trace this failure to aninformation mismatch: teacher supervision is formed underprivileged contextavailable during training, whereas the deployed model must report confidence using only deployment-time information. We formalize this perspective theoretically, showing that teacher-conditioned success is generally not a valid target for deployment-time confidence and that helpfulprivileged contextinducesentropy collapseand a systematicoptimism bias. To address this, we propose a calibration-aware OPD framework, CaOPD, that estimates empirical confidence from model rollouts, replaces self-reported confidence with this student-grounded target, and distills the revised response through the sameself-distillationpipeline. Experiments across various models and domains show that CaOPD achieves Pareto-optimal calibration while maintaining competitive capability, generalizing robustly under out-of-distribution andcontinual learning. Our findings highlight that capability distillation does not imply calibrated confidence, and that confidence should be treated as an essential objective in post-training. Code: https://github.com/SalesforceAIResearch/CaOPD

View arXiv pageView PDFProject pageGitHub1Add to collection

Get this paper in your agent:

hf papers read 2604\.16830

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.16830 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.16830 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.16830 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Hybrid Policy Distillation for LLMs

arXiv cs.CL

Introduces Hybrid Policy Distillation (HPD), a unified knowledge distillation approach that balances forward and reverse KL divergences and combines off-policy data with lightweight on-policy sampling, improving LLM compression across math, dialogue, and code tasks.

Self-Distillation as a Performance Recovery Mechanism for LLMs: Counteracting Compression and Catastrophic Forgetting

arXiv cs.CL

This paper introduces Self-Distillation Fine-Tuning (SDFT) as a recovery mechanism for LLMs suffering from performance degradation due to catastrophic forgetting, quantization, and pruning. The authors provide theoretical justification using Centered Kernel Alignment (CKA) to demonstrate that self-distillation aligns the student model's high-dimensional manifold with the teacher's optimal structure, effectively recovering lost capabilities.