Expectation Consistency Loss: Rethink Confidence Calibration under Covariate Shift
Summary
This paper introduces the Expectation Consistency Loss (ECL), a theoretically grounded loss function for calibrating classifier confidence under covariate shift, derived from a necessary and sufficient condition called the Expectation Consistency Condition.
View Cached Full Text
Cached at: 05/22/26, 08:48 AM
# Expectation Consistency Loss: Rethink Confidence Calibration under Covariate Shift
Source: [https://arxiv.org/html/2605.21552](https://arxiv.org/html/2605.21552)
###### Abstract
Confidence calibration for classification models is vital in safety\-critical decision\-making scenarios and has received extensive attention\. General confidence calibration methods assume training and test data are independent and identically distributed \(i\.i\.d\.i\.i\.d\.\), limiting their effectiveness under covariate shifts\. Previous calibration methods under covariate shift struggle with class\-wise or canonical calibrations and often rely on unstable importance weighting when density ratios are large or unbounded\. Given the above limitations, this paper rethinks confidence calibration under covariate shifts\. First, we derive a necessary and sufficient condition for confidence calibration under covariate shifts, namedExpectation consistency condition, which reveals covariate shifts do not necessarily lead to uncalibrated confidence and provides a weaker condition for confidence calibration than global covariate distribution alignment\. Then, utilizingExpectation consistency condition, this paper proposes an unsupervised domain adaptation loss to calibrate confidence of the target domain, namedExpectation consistency loss\(ECL\), which is compatible with canonical calibration, class\-wise calibration, and top\-label calibration\. Third, we prove that computing ECL loss has the same sample complexity as Expected Calibration Error \(ECE\) and provide a theoretically grounded mini\-batch trainable scheme for ECL loss\. Finally, we validate the effectiveness of our method on both simulated and real\-world covariate shift datasets\.
Machine Learning, ICML
## 1Introduction
Modern machine learning classification models, such as deep neural networks, are becoming increasingly accurate and widely applied in safety\-critical fields\(LeCunet al\.,[2015](https://arxiv.org/html/2605.21552#bib.bib23); Jianget al\.,[2023](https://arxiv.org/html/2605.21552#bib.bib22)\)\. Nevertheless, decision\-making systems in such fields need not only high accuracy but also the ability to recognize when they might be wrong\(Muniret al\.,[2023](https://arxiv.org/html/2605.21552#bib.bib24)\)\. For example, in automatic disease diagnosis, if a model has low confidence in its prediction, it should defer to a medical professional\(Jianget al\.,[2011](https://arxiv.org/html/2605.21552#bib.bib25)\)\. Thus, classification models should provide accurate confidence estimates alongside their predictions to reflect the true likelihood of an event\. Accurate confidence is more informative than mere class labels, e\.g\., stating“a patient has a 70% probability of having cancer”gives doctors more actionable information than just labeling the condition as“cancer”\. Moreover, accurate confidence facilitates classification models to better integrate with other probabilistic models, e\.g\., helping active learning to select more representative samples\(Hanet al\.,[2024](https://arxiv.org/html/2605.21552#bib.bib26)\)and improving the generalization performance of knowledge distillation\(Li and Caragea,[2023](https://arxiv.org/html/2605.21552#bib.bib27)\)\. Therefore, pursuing more accurate confidence in classification models is of great importance\(Gawlikowskiet al\.,[2023](https://arxiv.org/html/2605.21552#bib.bib28)\)\.
In recent years, confidence calibration has emerged as one of the most effective methods for producing more reliable confidence estimates and has attracted considerable attention\(Guoet al\.,[2017](https://arxiv.org/html/2605.21552#bib.bib31); Zhanget al\.,[2020](https://arxiv.org/html/2605.21552#bib.bib32); Kullet al\.,[2019](https://arxiv.org/html/2605.21552#bib.bib33); Donget al\.,[2025b](https://arxiv.org/html/2605.21552#bib.bib29)\)\. However, general confidence calibration methods typically assume that the target domain \(or test set\) and the source domain \(or calibration set\) are independent and identically distributed \(i\.i\.d\.i\.i\.d\.\)\. When this assumption is violated due to distribution shifts, calibration performance often deteriorates significantly\(Zhuet al\.,[2024](https://arxiv.org/html/2605.21552#bib.bib34)\)\. Covariate shift, a common type of data distribution shift, often occurs in real\-world tasks like medical diagnosis across different populations or image recognition under varying lighting conditions, where the input data distribution of models changes while the underlying relationship between inputs and outputs remains consistent\(Kimura and Hino,[2024](https://arxiv.org/html/2605.21552#bib.bib36)\)\. Under covariate shift, models calibrated on the source domain frequently fail to generalize to the target domain, resulting in unreliable confidence estimates\(Bickelet al\.,[2009](https://arxiv.org/html/2605.21552#bib.bib35)\)\. This highlights the importance of developing confidence calibration methods that remain robust under covariate shift\(Huet al\.,[2024](https://arxiv.org/html/2605.21552#bib.bib37)\)\.
Currently, the mainstream confidence calibration methods under covariate shift are based on importance weighting\(Pampari and Ermon,[2020](https://arxiv.org/html/2605.21552#bib.bib39); Parket al\.,[2020](https://arxiv.org/html/2605.21552#bib.bib38); Wanget al\.,[2020](https://arxiv.org/html/2605.21552#bib.bib40),[2023](https://arxiv.org/html/2605.21552#bib.bib41)\), which adjusts the objective function by assigning weights based on the importance of instances from the source domain, thereby guiding the model to generalize to the target domain unbiasedly\(Kimura and Hino,[2024](https://arxiv.org/html/2605.21552#bib.bib36)\)\. However, it is well known that importance weighting has been criticized for its instability when the density ratio is large or unbounded\(Corteset al\.,[2010](https://arxiv.org/html/2605.21552#bib.bib42)\)\.Huet al\.\([2024](https://arxiv.org/html/2605.21552#bib.bib37)\)use mixup to synthesize pseudo\-target data and generalize the calibration performance from the pseudo\-target data to the target domain\. However, the efficacy of this method hinges primarily on the degree of similarity between the pseudo\-target data and the target domain data\. Furthermore, existing methods primarily address the simplest prediction\-based calibration \(i\.e\., top\-label calibration\)\. To our knowledge, there remains a notable absence of class\-wise and canonical calibration methods designed to handle covariate shift\.
Importance weighting in confidence calibration aims to globally align covariate distributions, inspired by accuracy improvement under covariate shift\. However, confidence calibration differs fundamentally from accuracy improvement: it requires not learning new knowledge, but precisely conveying uncertainty\. This raises a natural but often neglected question:Is global covariate distribution alignment necessary?To answer this, we first derive a necessary and sufficient condition for confidence calibration under covariate shifts, termed theExpectation consistency condition\. This condition reveals that covariate shifts do not necessarily cause miscalibration and provides a weaker requirement than global distribution alignment\. Based on this condition, we propose an unsupervised domain adaptation loss,Expectation consistency loss\(ECL\), with three variants for canonical, class\-wise, and top\-label calibration\. We prove that ECL has sample complexity𝒪\(B/ε2\)\{\\cal O\}\(B/\{\\varepsilon^\{2\}\}\), comparable to histogram binning, whereBBdenotes the number of confidence bins\. To enable unbiased gradient backpropagation on mini\-batch data, we also provide a theoretically sound mini\-batch training scheme for ECL\. Finally, we validate the method on simulated and real\-world covariate shift datasets\.
Table 1:Comparison of ECL and related calibration methods\.MethodCovariateShiftClass\-wiseCalibrationCanonicalCalibrationDensityRatioUnboundedMini\-batchTrainableSBSB\-ECEECE\(Karandikaret al\.,[2021](https://arxiv.org/html/2605.21552#bib.bib44)\)✗✗✗✓✗DECEDECE\(Bohdalet al\.,[2023](https://arxiv.org/html/2605.21552#bib.bib45)\)✗✗✗✓✗ECEKDEECE^\{KDE\}\(Popordanoskaet al\.,[2022](https://arxiv.org/html/2605.21552#bib.bib43)\)✗✓✓✓✓WeightedWeightedTSTS\(Pampari and Ermon,[2020](https://arxiv.org/html/2605.21552#bib.bib39)\)✓✗✗✗✗FL\+IW\+TempFL\+IW\+Temp\(Parket al\.,[2020](https://arxiv.org/html/2605.21552#bib.bib38)\)✓✗✗✗✗TransCalTransCal\(Wanget al\.,[2020](https://arxiv.org/html/2605.21552#bib.bib40)\)✓✗✗✗✗DRLDRL\(Wanget al\.,[2023](https://arxiv.org/html/2605.21552#bib.bib41)\)✓✗✗✗✗PseudoCalPseudoCal\(Huet al\.,[2024](https://arxiv.org/html/2605.21552#bib.bib37)\)✓✗✗✓✗ECLECL\(Ours\)✓✓✓✓✓
## 2Background and Related Work
Consider aKK\-class classification problem whereX∈𝒳X\\in\\mathcal\{X\}denotes the input feature andY=\(Y1,⋯,YK\)∈𝒴\{Y\}=\(Y\_\{1\},\\cdots,Y\_\{K\}\)\\in\\mathcal\{Y\}denotes theKK\-class one\-hot encoded label variable, with𝒳⊂ℝd\\mathcal\{X\}\\subset\\mathbb\{R\}^\{d\}and𝒴=\{ek\}k=1K\\mathcal\{Y\}=\\\{e\_\{k\}\\\}\_\{k=1\}^\{K\}, whereeke\_\{k\}is a unit vector whosekk\-th component is 1\. Letf:𝒳→𝒮⊂ΔK−1f:\\mathcal\{X\}\\to\\mathcal\{S\}\\subset\\Delta\_\{K\-1\}be a probabilistic classifier, whereΔK−1\\Delta\_\{K\-1\}represents a \(K−1K\-1\)\-dimensional simplex\. The predicted confidence score vector is given byS=f\(X\)=\(f1\(X\),⋯,fK\(X\)\)=\(S1,…,SK\)∈𝒮S=f\(X\)=\(f\_\{1\}\(X\),\\cdots,f\_\{K\}\(X\)\)=\(S\_\{1\},\{\\ldots\},S\_\{K\}\)\\in\\mathcal\{S\}\. In general, the true class scalar isY∗=argmaxk\{Yk\}1≤k≤KY^\{\*\}=\{\{\\mathop\{\\rm argmax\}\\nolimits\}\_\{k\}\}\{\\\{\{Y\_\{k\}\}\\\}\_\{1\\leq k\\leq K\}\}, the predicted class is defined asY^=argmaxk\{Sk\}1≤k≤K\\hat\{Y\}=\{\{\\mathop\{\\rm argmax\}\\nolimits\}\_\{k\}\}\{\\\{\{S\_\{k\}\}\\\}\_\{1\\leq k\\leq K\}\}, and the confidence score of the predicted class isS^=max\{Sk\}1≤k≤K\\hat\{S\}=\\max\{\\\{\{S\_\{k\}\}\\\}\_\{1\\leq k\\leq K\}\}\.
In covariate shift, letPs\(⋅\)P\_\{s\}\(\\cdot\)andPt\(⋅\)P\_\{t\}\(\\cdot\)denote the probability density \(for continuous variables, e\.g\.,XX,X\|SX\|S,S\|XS\|X, andX\|YX\|Y\) or probability measure \(for discrete variables, e\.g\.,YY,Y\|SY\|SandY\|XY\|X\) on the source domain and target domain, respectively\.PPdenotes eitherPsP\_\{s\}orPtP\_\{t\}in cases where distinguishing between the source and target domains is not required\. LetDsD\_\{s\}andDtD\_\{t\}represent the source domain and target domain data, respectively\.
### 2\.1Confidence Calibration
Confidence calibration aims to match the predicted confidence vector with the true posterior probability of event occurrence\. Formally, we state:
###### Definition 2\.1\.
\(Perfect Calibration\)A classifier is perfectly calibrated if the following equation holds:
P\(Yk=1\|S=s\)=sk,∀1≤k≤K,P\(Y\_\{k\}=1\|S=s\)=s\_\{k\},\\forall 1\\leq k\\leq K,\(1\)wheres=\(s1,⋯,sK\)s=\(s\_\{1\},\\cdots,s\_\{K\}\)is the observed confidence score vector onSS\.
Remark:Definition[1](https://arxiv.org/html/2605.21552#S2.E1)considers the most stringent calibration paradigm, named canonical calibration\(Donget al\.,[2025a](https://arxiv.org/html/2605.21552#bib.bib30)\)\. Appendix[A](https://arxiv.org/html/2605.21552#A1)provides two other common calibration paradigms: top\-label calibration\(Guoet al\.,[2017](https://arxiv.org/html/2605.21552#bib.bib31)\)and class\-wise calibration\(Kullet al\.,[2019](https://arxiv.org/html/2605.21552#bib.bib33)\)\.
Existing general work primarily falls into two groups: train\-time calibration\(Liuet al\.,[2023](https://arxiv.org/html/2605.21552#bib.bib15); Mülleret al\.,[2019](https://arxiv.org/html/2605.21552#bib.bib13); Fernando and Tsokos,[2022](https://arxiv.org/html/2605.21552#bib.bib16); Hebbalaguppeet al\.,[2022](https://arxiv.org/html/2605.21552#bib.bib17); Grathwohlet al\.,[2020](https://arxiv.org/html/2605.21552#bib.bib18); Yang and Ji,[2021](https://arxiv.org/html/2605.21552#bib.bib19)\)and post\-hoc calibration\(Guoet al\.,[2017](https://arxiv.org/html/2605.21552#bib.bib31); Kullet al\.,[2019](https://arxiv.org/html/2605.21552#bib.bib33); Zhanget al\.,[2020](https://arxiv.org/html/2605.21552#bib.bib32); Rahimiet al\.,[2020](https://arxiv.org/html/2605.21552#bib.bib20); Guptaet al\.,[2021](https://arxiv.org/html/2605.21552#bib.bib21); Donget al\.,[2025b](https://arxiv.org/html/2605.21552#bib.bib29)\)\. Train\-time calibration typically carries out calibration during the classifier’s training by adjusting the objective function, and post\-hoc calibration learns a transformation \(referred to as a calibration map\) of the classifier’s output on a calibration dataset in a post\-hoc manner\. However, these methods’ effectiveness hinges on thei\.i\.d\.i\.i\.d\.assumption between the target and source domains\. When covariate shift occurs, this i\.i\.d\. assumption is violated, making it difficult for the methods above to effectively calibrate confidence\.
### 2\.2Confidence Calibration under Covariate Shift
In covariate shift, the target domain and the source domain have different feature distributions but the same conditional distributions\. Formally, we state:
###### Definition 2\.2\.
\(Covariate Shift\)Covariate shift occurs if the following two conditions are satisfied:Ps\(X\)≠Pt\(X\)\{P\_\{s\}\(X\)\\neq P\_\{t\}\(X\)\}andPs\(Y\|X\)=Pt\(Y\|X\)\{P\_\{s\}\(Y\|X\)=P\_\{t\}\(Y\|X\)\}\.
Table[1](https://arxiv.org/html/2605.21552#S1.T1)summarizes the characteristics of related calibration methods in five key dimensions, including whether they can handle covariate shifts, whether they support class\-wise/canonical calibration, whether they can handle unbounded density ratios, and whether they are theoretically mini\-batch trainable\. As shown, existing methods often cover only a portion of the capabilities\. In contrast, our ECL satisfies all dimensions simultaneously, demonstrating the method’s comprehensiveness and versatility\.
## 3Method
### 3\.1Expectation Consistency Condition
Previous studies\(Pampari and Ermon,[2020](https://arxiv.org/html/2605.21552#bib.bib39); Wanget al\.,[2020](https://arxiv.org/html/2605.21552#bib.bib40); Parket al\.,[2020](https://arxiv.org/html/2605.21552#bib.bib38); Wanget al\.,[2023](https://arxiv.org/html/2605.21552#bib.bib41); Huet al\.,[2024](https://arxiv.org/html/2605.21552#bib.bib37)\)have empirically demonstrated that covariate shift can cause the confidence calibrated on the source domain to be uncalibrated on the target domain\. However, empirical evidence alone cannot capture all possible scenarios\. The theoretical underpinnings of these observations deserve to be explored to support this problem further and help solve it\. To address this, this paper derives a necessary and sufficient condition for confidence calibration under covariate shift, as shown in Theorem[3\.1](https://arxiv.org/html/2605.21552#S3.Thmtheorem1)\.
###### Theorem 3\.1\.
\(Expectation Consistency Condition\)∀1≤k≤K\\forall 1\\leq k\\leq K,Ps\(Yk=1\|S\)=Pt\(Yk=1\|S\)P\_\{s\}\(Y\_\{k\}=1\|S\)=P\_\{t\}\(Y\_\{k\}=1\|S\)if and only if:𝔼X∼Ps\(X\|S\)\[P\(Yk=1\|X\)\]=𝔼X∼Pt\(X\|S\)\[P\(Yk=1\|X\)\]\{\\mathbb\{E\}\_\{X\\sim\{P\_\{s\}\}\(X\|S\)\}\}\[P\(Y\_\{k\}=1\|X\)\]=\{\\mathbb\{E\}\_\{X\\sim\{P\_\{t\}\}\(X\|S\)\}\}\[P\(Y\_\{k\}=1\|X\)\], whereP\(Yk=1\|X\)=Ps\(Yk=1\|X\)=Pt\(Yk=1\|X\)\{P\(Y\_\{k\}=1\|X\)\}=\{P\_\{s\}\(Y\_\{k\}=1\|X\)\}=\{P\_\{t\}\(Y\_\{k\}=1\|X\)\}\. The proof is provided in Appendix[B](https://arxiv.org/html/2605.21552#A2)\.
Remark on Theorem[3\.1](https://arxiv.org/html/2605.21552#S3.Thmtheorem1):The source domain can usually be easily calibrated well using general calibration methods, at least much better than the target domain \(see Appendix[C](https://arxiv.org/html/2605.21552#A3)\)\. Theorem[3\.1](https://arxiv.org/html/2605.21552#S3.Thmtheorem1)tells us that as long asExpectation consistency conditionis met, the target domain can be calibrated as well as the source domain\. Condition𝔼X∼Ps\(X\|S\)\[P\(Yk=1\|X\)\]=𝔼X∼Pt\(X\|S\)\[P\(Yk=1\|X\)\]\{\\mathbb\{E\}\_\{X\\sim\{P\_\{s\}\}\(X\|S\)\}\}\[P\(Y\_\{k\}=1\|X\)\]=\{\\mathbb\{E\}\_\{X\\sim\{P\_\{t\}\}\(X\|S\)\}\}\[P\(Y\_\{k\}=1\|X\)\]is strictly weaker than covariate distribution alignment \(i\.e\.,Ps\(X\)=Pt\(X\)P\_\{s\}\(X\)=P\_\{t\}\(X\)\), as it only requires equivalence in the expectations of the true posterior probabilityP\(Yk=1\|X\)P\(Y\_\{k\}=1\|X\)w\.r\.t\.w\.r\.t\.the confidence score’s level set distribution \(i\.e\.,Ps\(X\|S\)P\_\{s\}\(X\|S\)orPt\(X\|S\)P\_\{t\}\(X\|S\)\), rather than matching the entire input distribution\. For instance, even ifPs\(X\)P\_\{s\}\(X\)andPt\(X\)P\_\{t\}\(X\)differ significantly, calibration may still hold if the model’s expected accuracy conditioned onSSaligns across domains\. This insight moves the focus from aligning global covariate distributions to enforcing local consistency in critical statistics, enabling more efficient calibration strategies under covariate shift\.
Extension of Theorem[3\.1](https://arxiv.org/html/2605.21552#S3.Thmtheorem1):Theorem[3\.1](https://arxiv.org/html/2605.21552#S3.Thmtheorem1)can be naturally extended to top\-label calibration and class\-wise calibration \(see Appendix[D](https://arxiv.org/html/2605.21552#A4)\)\. Intuitively, this only requires replacing the confidence score vectorSSin Theorem[3\.1](https://arxiv.org/html/2605.21552#S3.Thmtheorem1)with the predicted class confidenceS^\\hat\{S\}or the confidence score vector’s componentsSkS\_\{k\}\.
An Example:Fig\.[1](https://arxiv.org/html/2605.21552#S3.F1)shows an example of Theorem[3\.1](https://arxiv.org/html/2605.21552#S3.Thmtheorem1), where covariate shift occurs but calibration error remains unchanged\. TakeS1=0\.75S\_\{1\}=0\.75as an example for calculation:
Ps\(Y1=1\|S=\(0\.75,0\.25\)\)=∑X∈\{−1,1\}P\(Y1=1\|X\)Ps\(X\|S=\(0\.75,0\.25\)\)=∑X∈\{−1,1\}0\.5⋅Ps\(X\|S=\(0\.75,0\.25\)\)=0\.5\.\\begin\{split\}&\{P\_\{s\}\}\\left\(Y\_\{1\}=1\|S=\(0\.75,0\.25\)\\right\)\\\\ &=\\sum\\limits\_\{X\\in\\\{\-1,1\\\}\}\{P\(Y\_\{1\}=1\|X\)\{P\_\{s\}\}\(X\|S=\(0\.75,0\.25\)\)\}\\\\ &=\\sum\\limits\_\{X\\in\\\{\-1,1\\\}\}\{0\.5\\cdot\{P\_\{s\}\}\(X\|S=\(0\.75,0\.25\)\)\}=0\.5\.\\end\{split\}\(2\)Similarly, it is easy to calculate thatPt\(Y1=1\|S=\(0\.75,0\.25\)\)=0\.5=Ps\(Y1=1\|S=\(0\.75,0\.25\)\)\{P\_\{t\}\}\(Y\_\{1\}=1\|S=\(0\.75,0\.25\)\)=0\.5=\{P\_\{s\}\}\(Y\_\{1\}=1\|S=\(0\.75,0\.25\)\)\. The same holds if 0\.75 is replaced with other values because𝔼X∼Ps\(X\|S\)\[P\(Y1=1\|X\)\]=𝔼X∼Pt\(X\|S\)\[P\(Y1=1\|X\)\]\{\\mathbb\{E\}\_\{X\\sim\{P\_\{s\}\}\(X\|S\)\}\}\[P\(Y\_\{1\}=1\|X\)\]=\{\\mathbb\{E\}\_\{X\\sim\{P\_\{t\}\}\(X\|S\)\}\}\[P\(Y\_\{1\}=1\|X\)\]holds for∀S1∈\[0,1\]\\forall S\_\{1\}\\in\[0,1\]\. Moreover, such examples are infinite because they include but are not limited to all examples whereP\(Y1=1\|X\)P\(Y\_\{1\}=1\|X\)orS1S\_\{1\}curves in Fig\.[1](https://arxiv.org/html/2605.21552#S3.F1)are symmetricw\.r\.t\.w\.r\.t\.the y\-axis\.

Figure 1:A binary classification example where covariate shift occurs but calibration error remains unchanged, whereP\(Y\|X\)=\(P\(Y1\|X\),P\(Y2\|X\)\)P\(Y\|X\)=\\left\(P\(Y\_\{1\}\|X\),P\(Y\_\{2\}\|X\)\\right\)andS=\(S1,S2\)S=\(S\_\{1\},S\_\{2\}\)\.P\(Y2\|X\)=1−P\(Y1\|X\)P\(Y\_\{2\}\|X\)=1\-P\(Y\_\{1\}\|X\)andS2=1−S1S\_\{2\}=1\-S\_\{1\}\.Ps\(X\)=\(2π\)−1e−0\.5\(X\+0\.5\)2\{P\_\{s\}\}\(X\)=\{\\left\(\{\\sqrt\{2\\pi\}\}\\right\)^\{\-1\}\}\{e^\{\-0\.5\{\{\(X\+0\.5\)\}^\{2\}\}\}\},Pt\(X\)=\(2π\)−1e−0\.5\(X−0\.5\)2\{P\_\{t\}\}\(X\)=\{\\left\(\{\\sqrt\{2\\pi\}\}\\right\)^\{\-1\}\}\{e^\{\-0\.5\{\{\(X\-0\.5\)\}^\{2\}\}\}\},S1=−0\.25X2\+1S\_\{1\}=\-0\.25X^\{2\}\+1, andP\(Y1=1\|X\)=−0\.5\|X\|\+1P\(Y\_\{1\}=1\|X\)=\-0\.5\|X\|\+1\.
### 3\.2Expectation Consistency Loss
According to Theorem[3\.1](https://arxiv.org/html/2605.21552#S3.Thmtheorem1),Expectation consistency conditionensures that the target domain can be calibrated as effectively as the source domain\. Specifically, in canonical calibration,Expectation consistency conditioncan be rewritten as follows:
𝔼Pt\(S\)∥𝔼Ps\(X\|S\)P\(Y\|X\)−𝔼Pt\(X\|S\)P\(Y\|X\)∥=0,\\mathop\{\\mathbb\{E\}\}\\limits\_\{\{P\_\{t\}\}\(S\)\}\\left\\lVert\{\{\\mathop\{\\mathbb\{E\}\}\\limits\_\{\{P\_\{s\}\}\(X\|S\)\}\}P\(Y\|X\)\-\{\\mathop\{\\mathbb\{E\}\}\\limits\_\{\{P\_\{t\}\}\(X\|S\)\}\}P\(Y\|X\)\}\\right\\rVert=0,\(3\)whereS=\(S1,⋯,SK\)=\(f1\(X\),⋯,fK\(X\)\)=f\(X\)S=\(S\_\{1\},\\cdots,S\_\{K\}\)=\(f\_\{1\}\(X\),\\cdots,f\_\{K\}\(X\)\)=f\(X\),Pt\(S\)P\_\{t\}\(S\)represents the probability density of the predicted confidence score vector on the target domain\. Therefore,Expectation consistency losscan be naturally constructed as:
Lecl=𝔼Pt\(S\)∥𝔼Ps\(X\|S\)P\(Y\|X\)−𝔼Pt\(X\|S\)P\(Y\|X\)∥,L\_\{ecl\}=\\mathop\{\\mathbb\{E\}\}\\limits\_\{\{P\_\{t\}\}\(S\)\}\\left\\lVert\{\{\\mathop\{\\mathbb\{E\}\}\\limits\_\{\{P\_\{s\}\}\(X\|S\)\}\}P\(Y\|X\)\-\{\\mathop\{\\mathbb\{E\}\}\\limits\_\{\{P\_\{t\}\}\(X\|S\)\}\}P\(Y\|X\)\}\\right\\rVert,\(4\)To estimateP\(Y\|X\)P\(Y\|X\)in practice, we train an additional classification head on the original classifier’s backbone, where the label is the one\-hot encodedYYand the input data isXX\. This classification head can be trained end\-to\-end with the original classifier \(freeze the backbone when training this classification head\)\. Optionally, this classification head can also be calibrated on the source domain\.
Extension of Expectation Consistency Loss:Eq\.[4](https://arxiv.org/html/2605.21552#S3.E4)isExpectation consistency lossfor canonical calibration\. Similarly,Expectation consistency lossfor class\-wise and top\-label calibration can be obtained \(see Appendix[E](https://arxiv.org/html/2605.21552#A5)\)\.
### 3\.3Empirical Calculation and Differentiability
LeclL\_\{ecl\}can be empirically estimated using confidence binning and Monte Carlo sampling:
\{L^ecl=∑j=1B♯bj\(t\)♯Dt‖𝔼^s,j−𝔼^t,j‖,𝔼^s,j=1♯Ds\(j\)∑x∈Ds\(j\)P^\(Y=y\|X=x\),𝔼^t,j=1♯Dt\(j\)∑x∈Dt\(j\)P^\(Y=y\|X=x\),\\begin\{dcases\}\{\{\{\\hat\{L\}\_\{ecl\}\}\}=\\sum\\limits\_\{j=1\}^\{B\}\{\\frac\{\{\\sharp b\_\{j\}^\{\(t\)\}\}\}\{\{\\sharp\{D\_\{t\}\}\}\}\\left\\lVert\{\{\{\\hat\{\\mathbb\{E\}\}\}\_\{s,j\}\}\-\{\{\\hat\{\\mathbb\{E\}\}\}\_\{t,j\}\}\}\\right\\rVert\},\}\\\\ \{\{\\hat\{\\mathbb\{E\}\}\_\{s,j\}\}=\\frac\{1\}\{\{\\sharp D\_\{s\}^\{\(j\)\}\}\}\\sum\\limits\_\{x\\in D\_\{s\}^\{\(j\)\}\}\{\\hat\{P\}\(Y=y\|X=x\)\},\}\\\\ \{\{\\hat\{\\mathbb\{E\}\}\_\{t,j\}\}=\\frac\{1\}\{\{\\sharp D\_\{t\}^\{\(j\)\}\}\}\\sum\\limits\_\{x\\in D\_\{t\}^\{\(j\)\}\}\{\\hat\{P\}\(Y=y\|X=x\)\},\}\\end\{dcases\}\(5\)whereBBrepresents the number of bins,bj\(t\)b\_\{j\}^\{\(t\)\}represents thejj\-th bin in the target domain,♯bj\(t\)\\sharp b\_\{j\}^\{\(t\)\}represents sample size ofbj\(t\)b\_\{j\}^\{\(t\)\},♯Dt\\sharp D\_\{t\}represents sample size ofDtD\_\{t\},Ds\(j\)\{D\_\{s\}^\{\(j\)\}\}represents the level set ofbj\(t\)b\_\{j\}^\{\(t\)\}in the source domain,Dt\(j\)\{D\_\{t\}^\{\(j\)\}\}represents the level set ofbj\(t\)b\_\{j\}^\{\(t\)\}in the target domain, andP^\(Y=y\|X=x\)\{\{\{\\hat\{P\}\}\}\(Y=y\|X=x\)\}represents the observation ofP\(Y\|X\)\{\{\{P\}\}\(Y\|X\)\}\.
Differentiability:The confidence binning operation in Eq\.[5](https://arxiv.org/html/2605.21552#S3.E5)is non\-differentiable\(Karandikaret al\.,[2021](https://arxiv.org/html/2605.21552#bib.bib44); Bohdalet al\.,[2023](https://arxiv.org/html/2605.21552#bib.bib45); Popordanoskaet al\.,[2022](https://arxiv.org/html/2605.21552#bib.bib43)\), so it cannot be directly used for classifier training\. Therefore, a differentiable version is proposed below\. Specifically, we replace hard bin membership with a smooth anchor\-based assignment over confidence bins\. For canonical calibration, theii\-th confidence vectorS\(i\)∈ΔK−1S^\{\(i\)\}\\in\\Delta\_\{K\-1\}is a point in simplex\. We introduceBBanchor pointsaj∈ΔK−1a\_\{j\}\\in\\Delta\_\{K\-1\}and define for the soft assignment of theii\-th confidence vectorS\(i\)S^\{\(i\)\}:
ωij=exp\(−∥S\(i\)−aj∥22/τ\)∑r=1Bexp\(−∥S\(i\)−ar∥22/τ\),\\omega\_\{ij\}=\\frac\{\\exp\(\-\\lVert S^\{\(i\)\}\-a\_\{j\}\\rVert\_\{2\}^\{2\}/\\tau\)\}\{\\sum\_\{r=1\}^\{B\}\\exp\(\-\\lVert S^\{\(i\)\}\-a\_\{r\}\\rVert\_\{2\}^\{2\}/\\tau\)\},\(6\)with temperatureτ\>0\\tau\>0\. Denotingp\(i\)=P\(Y\|Xi\)p^\{\(i\)\}=P\(Y\|X\_\{i\}\)as the output of the additional classification head \(as described in Section[3\.2](https://arxiv.org/html/2605.21552#S3.SS2)\), we obtain for each binjjand domaind∈\{s,t\}d\\in\\\{s,t\\\}:
𝔼^d,j=∑iωijdp\(i\)∑iωijd\+ε,njd=∑iωijd,\\hat\{\\mathbb\{E\}\}\_\{d,j\}=\\frac\{\\sum\_\{i\}\\omega^\{d\}\_\{ij\}\\,p^\{\(i\)\}\}\{\\sum\_\{i\}\\omega^\{d\}\_\{ij\}\+\\varepsilon\},\\qquad n^\{d\}\_\{j\}=\\sum\_\{i\}\\omega^\{d\}\_\{ij\},\(7\)with a small stabilizerε\>0\\varepsilon\>0, whereωijd\\omega^\{d\}\_\{ij\}represents the soft assignment in domaindd\. Then, the differentiable ECL is:
L^ecl=∑j=1Bwj‖𝔼^s,j−𝔼^t,j‖,wj=njt∑r=1Bnrt\.\\hat\{L\}\_\{ecl\}=\\sum\_\{j=1\}^\{B\}w\_\{j\}\\;\\left\\lVert\\hat\{\\mathbb\{E\}\}\_\{s,j\}\-\\hat\{\\mathbb\{E\}\}\_\{t,j\}\\right\\rVert,\\qquad w\_\{j\}=\\frac\{n^\{t\}\_\{j\}\}\{\\sum\_\{r=1\}^\{B\}n^\{t\}\_\{r\}\}\.\(8\)Extension of Differentiable ECL:Eq\.[8](https://arxiv.org/html/2605.21552#S3.E8)is differentiableExpectation consistency lossfor canonical calibration\. Similarly, differentiableExpectation consistency lossfor top\-label and class\-wise calibration can be obtained \(see Appendix[F](https://arxiv.org/html/2605.21552#A6)\)\.
### 3\.4Sample Complexity Analysis
###### Theorem 3\.2\.
\(Sample Complexity of ECL Estimation\)Letε∈\(0,1\)\\varepsilon\\in\(0,1\)andδ∈\(0,1\)\\delta\\in\(0,1\)\. Consider the empirical ECL in Eq\.[5](https://arxiv.org/html/2605.21552#S3.E5)\(or Eq\.[8](https://arxiv.org/html/2605.21552#S3.E8)\) withBBbins, bin weightswjw\_\{j\}\(target\-domain proportions or their soft analogs\), and per\-bin sample countsnjtn^\{t\}\_\{j\}andnjsn^\{s\}\_\{j\}\. There exist absolute constantsC\>0C\>0such that, with probability at least1−δ1\-\\delta,
\|L^ecl−Lecl\|≤Clog\(2BKδ\)∑j=1Bwj\(1njt\+1njs\)\.\\big\|\\hat\{L\}\_\{ecl\}\-L\_\{ecl\}\\big\|\\leq C\\sqrt\{\\log\\Big\(\\tfrac\{2BK\}\{\\delta\}\\Big\)\\sum\_\{j=1\}^\{B\}w\_\{j\}\\left\(\\frac\{1\}\{n^\{t\}\_\{j\}\}\+\\frac\{1\}\{n^\{s\}\_\{j\}\}\\right\)\}\.\(9\)Its proof is provided in Appendix[G](https://arxiv.org/html/2605.21552#A7)\.
Remark on Theorem[3\.2](https://arxiv.org/html/2605.21552#S3.Thmtheorem2):Theorem[3\.2](https://arxiv.org/html/2605.21552#S3.Thmtheorem2)implies ECL has a similar sample complexity as histogram binning for ECE, namely𝒪\(B/ε2\)\\mathcal\{O\}\(B/\\varepsilon^\{2\}\), and the weightswjw\_\{j\}explicitly cap the influence of sparse bins\. This sample complexity is also similar to that of some point estimation methods \(e\.g\., maximum likelihood estimation with𝒪\(1/ε2\)\\mathcal\{O\}\(1/\\varepsilon^\{2\}\)\) and is feasible for most real\-world learning tasks\.
### 3\.5Mini\-Batch Trainability
Most modern deep learning methods are trained using mini\-batches, where a small subset of data is processed at each step to compute the loss and update the model via gradient descent\. This poses a challenge for confidence calibration loss, since small sample batches often fail to provide sufficiently accurate estimates of calibration error\. Similar to the widely used cross\-entropy loss, mini\-batch trainability requires that the gradient computed on a mini\-batch be an unbiased estimate of the gradient over the entire dataset, i\.e\.,EDsm,Dtm\[∇θL^eclm\]=∇θL^ecl\{E\_\{D\_\{s\}^\{\\rm\{m\}\},D\_\{t\}^\{\\rm\{m\}\}\}\}\\left\[\{\{\\nabla\_\{\\theta\}\}\\hat\{L\}\_\{ecl\}^\{\{\\rm\{m\}\}\}\}\\right\]=\{\\nabla\_\{\\theta\}\}\{\{\\hat\{L\}\}\_\{ecl\}\}, whereDsmD\_\{s\}^\{\\rm\{m\}\}andDtmD\_\{t\}^\{\\rm\{m\}\}represent mini\-batches from the source and target domains, respectively\. Therefore, we propose an equivalent formulation of Eq\.[8](https://arxiv.org/html/2605.21552#S3.E8)and prove its mini\-batch trainability, as established in Theorem[3\.3](https://arxiv.org/html/2605.21552#S3.Thmtheorem3)\.
###### Theorem 3\.3\.
\(ECL Mini\-Batch Trainability\)Eq\.[10](https://arxiv.org/html/2605.21552#S3.E10)is asymptotically equivalent to Eq\.[8](https://arxiv.org/html/2605.21552#S3.E8), and it satisfiesEDsm,Dtm\[∇θL^eclmini\]=∇θL^ecl\{E\_\{D\_\{s\}^\{\\rm\{m\}\},D\_\{t\}^\{\\rm\{m\}\}\}\}\\left\[\{\{\\nabla\_\{\\theta\}\}\\hat\{L\}\_\{ecl\}^\{\{\\rm\{mini\}\}\}\}\\right\]=\{\\nabla\_\{\\theta\}\}\{\{\\hat\{L\}\}\_\{ecl\}\}, and its proof is provided in Appendix[H](https://arxiv.org/html/2605.21552#A8):
L^ecl\(θ,ujs,ujt\)\\displaystyle\\hat\{L\}\_\{ecl\}\(\\theta,u\_\{j\}^\{s\},u\_\{j\}^\{t\}\)=∑j=1Bwj‖ujs−ujt‖\\displaystyle=\\sum\_\{j=1\}^\{B\}w\_\{j\}\\\|u\_\{j\}^\{s\}\-u\_\{j\}^\{t\}\\\|\(10\)\+∑j=1B∑i∈Dsωi,js‖ujs−p\(i\)\(θ\)‖2\\displaystyle\\quad\+\\sum\_\{j=1\}^\{B\}\\sum\_\{i\\in D\_\{s\}\}\\omega\_\{i,j\}^\{s\}\\,\\\|u\_\{j\}^\{s\}\-p^\{\(i\)\}\(\\theta\)\\\|^\{2\}\+∑j=1B∑i∈Dtωi,jt‖ujt−p\(i\)\(θ\)‖2,\\displaystyle\\quad\+\\sum\_\{j=1\}^\{B\}\\sum\_\{i\\in D\_\{t\}\}\\omega\_\{i,j\}^\{t\}\\,\\\|u\_\{j\}^\{t\}\-p^\{\(i\)\}\(\\theta\)\\\|^\{2\},whereujsu\_\{j\}^\{s\}andujtu\_\{j\}^\{t\}are learnable parameters used to approximate𝔼^s,j\\hat\{\\mathbb\{E\}\}\_\{s,j\}and𝔼^t,j\\hat\{\\mathbb\{E\}\}\_\{t,j\}during the training process, andp\(i\)\(θ\)p^\{\(i\)\}\(\\theta\)denotesP\(Y\|Xi\)P\(Y\|X\_\{i\}\)estimated by an additional classification head trained on the original classifier’s backbone\.
Remark on Theorem[3\.3](https://arxiv.org/html/2605.21552#S3.Thmtheorem3):Because nonlinear operators such as norms do not commute with expectations, computing Eq\.[8](https://arxiv.org/html/2605.21552#S3.E8)directly on a mini\-batch introduces bias into the gradient, as demonstrated in the proof of Theorem[3\.3](https://arxiv.org/html/2605.21552#S3.Thmtheorem3)\. By introducing auxiliary variables \(ujsu\_\{j\}^\{s\}andujtu\_\{j\}^\{t\}\) for learning the expectation over the full dataset, Eq\.[10](https://arxiv.org/html/2605.21552#S3.E10)perfectly avoids this problem\. Algorithm[1](https://arxiv.org/html/2605.21552#alg1)provides the pseudocode for the actual calculation of Eq\.[10](https://arxiv.org/html/2605.21552#S3.E10)\. Specifically,ujsu\_\{j\}^\{s\}andujtu\_\{j\}^\{t\}in Eq\.[10](https://arxiv.org/html/2605.21552#S3.E10)can be solved using alternating proximal updates\(Bolteet al\.,[2014](https://arxiv.org/html/2605.21552#bib.bib46)\), as detailed in Algorithm[1](https://arxiv.org/html/2605.21552#alg1)\.
Extension of ECL Mini\-Batch Training:Algorithm[1](https://arxiv.org/html/2605.21552#alg1)is ECL mini\-batch training for canonical calibration\. Similarly, ECL mini\-batch training for top\-label and class\-wise calibration can be obtained \(see Appendix[I](https://arxiv.org/html/2605.21552#A9)\)\.
Algorithm 1ECL Mini\-Batch Training\.1:Input:
2:bins
j=1…Bj=1\\ldots B, hyperparameters
αema\\alpha\_\{\\text\{ema\}\},
NproxN\_\{\\text\{prox\}\},
λ\\lambda
3:
ujs=𝟎∈ℝK,∀ju\_\{j\}^\{s\}=\\mathbf\{0\}\\in\\mathbb\{R\}^\{K\},\\forall j;
ujt=𝟎∈ℝK,∀ju\_\{j\}^\{t\}=\\mathbf\{0\}\\in\\mathbb\{R\}^\{K\},\\forall j
4:foreach iterationdo
5:Sample mini\-batches
Dsm,DtmD\_\{s\}^\{m\},D\_\{t\}^\{m\};
6:Compute weights
ωijs,ωijt\\omega\_\{ij\}^\{s\},\\omega\_\{ij\}^\{t\};
7:
ns,j←∑i∈Dsmωijsn\_\{s,j\}\\leftarrow\\sum\_\{i\\in D\_\{s\}^\{m\}\}\\omega\_\{ij\}^\{s\},
ms,j←∑i∈Dsmωijsp\(i\)\(θ\)m\_\{s,j\}\\leftarrow\\sum\_\{i\\in D\_\{s\}^\{m\}\}\\omega\_\{ij\}^\{s\}p^\{\(i\)\}\(\\theta\);
8:
nt,j←∑i∈Dtmωijtn\_\{t,j\}\\leftarrow\\sum\_\{i\\in D\_\{t\}^\{m\}\}\\omega\_\{ij\}^\{t\},
mt,j←∑i∈Dtmωijtp\(i\)\(θ\)m\_\{t,j\}\\leftarrow\\sum\_\{i\\in D\_\{t\}^\{m\}\}\\omega\_\{ij\}^\{t\}p^\{\(i\)\}\(\\theta\);
9:
wj←nt,j/∑r=1Bnt,rw\_\{j\}\\leftarrow n\_\{t,j\}/\\sum\_\{r=1\}^\{B\}n\_\{t,r\};
Lecl←0L\_\{\\text\{ecl\}\}\\leftarrow 0;
10:foreach bin
jjdo
11:
us,ut←u\_\{s\},u\_\{t\}\\leftarrowcached
ujs,ujtu\_\{j\}^\{s\},u\_\{j\}^\{t\}
12:for
i=1i=1to
NproxN\_\{\\text\{prox\}\}do
13:
vs←\(ms,j/ns,j\)−utv\_\{s\}\\leftarrow\(m\_\{s,j\}/n\_\{s,j\}\)\-u\_\{t\},
τs=wj2ns,j\\tau\_\{s\}=\\dfrac\{w\_\{j\}\}\{2n\_\{s,j\}\}
14:
us←ut\+shrink\(vs,τs\)u\_\{s\}\\leftarrow u\_\{t\}\+\\mathrm\{shrink\}\(v\_\{s\},\\tau\_\{s\}\)
15:
vt←\(mt,j/nt,j\)−usv\_\{t\}\\leftarrow\(m\_\{t,j\}/n\_\{t,j\}\)\-u\_\{s\},
τt=wj2nt,j\\tau\_\{t\}=\\dfrac\{w\_\{j\}\}\{2n\_\{t,j\}\}
16:
ut←us\+shrink\(vt,τt\)u\_\{t\}\\leftarrow u\_\{s\}\+\\mathrm\{shrink\}\(v\_\{t\},\\tau\_\{t\}\)
17:endfor
18:
u~js,u~jt←us\.detach\(\),ut\.detach\(\)\\tilde\{u\}\_\{j\}^\{s\},\\tilde\{u\}\_\{j\}^\{t\}\\leftarrow u\_\{s\}\.\{\\rm detach\}\(\),\\;u\_\{t\}\.\{\\rm detach\}\(\)
19:
ujs←\(1−αema\)ujs\+αemau~jsu\_\{j\}^\{s\}\\leftarrow\(1\-\\alpha\_\{\\text\{ema\}\}\)u\_\{j\}^\{s\}\+\\alpha\_\{\\text\{ema\}\}\\tilde\{u\}\_\{j\}^\{s\}
20:
ujt←\(1−αema\)ujt\+αemau~jtu\_\{j\}^\{t\}\\leftarrow\(1\-\\alpha\_\{\\text\{ema\}\}\)u\_\{j\}^\{t\}\+\\alpha\_\{\\text\{ema\}\}\\tilde\{u\}\_\{j\}^\{t\}
21:
Lecl\+=∑i∈Dsmωijs∥u~js−p\(i\)\(θ\)∥2L\_\{\\text\{ecl\}\}\\mathrel\{\+\}=\\sum\_\{i\\in D\_\{s\}^\{m\}\}\\omega\_\{ij\}^\{s\}\\\|\\tilde\{u\}\_\{j\}^\{s\}\-p^\{\(i\)\}\(\\theta\)\\\|^\{2\}
22:
Lecl\+=∑i∈Dtmωijt∥u~jt−p\(i\)\(θ\)∥2L\_\{\\text\{ecl\}\}\\mathrel\{\+\}=\\sum\_\{i\\in D\_\{t\}^\{m\}\}\\omega\_\{ij\}^\{t\}\\\|\\tilde\{u\}\_\{j\}^\{t\}\-p^\{\(i\)\}\(\\theta\)\\\|^\{2\}
23:endfor
24:Compute the cross\-entropy loss
LceL\_\{\\text\{ce\}\}
25:Backpropagate
Lce\+λLeclL\_\{\\text\{ce\}\}\+\\lambda L\_\{\\text\{ecl\}\}and update
θ\\theta\.
26:endfor
27:Return:
θ\\theta
## 4Results
The effectiveness of the proposed method is verified from two perspectives: 1\) Verify calibration effectiveness on simulated covariate shift data; 2\) Comparison with state\-of\-the\-art calibration methods on real\-world covariate shift datasets\.
### 4\.1Calibration on Simulated Covariate Shift Data

Figure 2:Calibration effect display\. Figures \(c\) to \(k\) show the calibration effect on the simulated covariate shift dataset \(see Figures \(a\) and \(b\)\), and Figures \(l\) to \(t\) show the calibration effect on the real\-world covariate shift dataset PACS \(three classes\)\. NLL represents cross\-entropy loss, Soft\-ECE represents softened differentiable ECE loss, CwECE represents class\-wise ECE, and CaECE represents canonical ECE\. Results from the three types of reliability diagrams and calibration metrics demonstrate that our method preserves or improves classifier accuracy while substantially reducing calibration errors\. Our code is available at[https://github\.com/NeuroDong/ECL](https://github.com/NeuroDong/ECL)\.Experimental Setup:To observe covariate shift, we model source and target domain covariates as normal and uniform distributions \(Figs\.[2](https://arxiv.org/html/2605.21552#S4.F2)\(a\-b\) and Figs\.[3](https://arxiv.org/html/2605.21552#A10.F3)\(a\-b\), respectively\)\. For normal distributions, source domain has mean \[0, 0\] and covariance \[\[5, 0\], \[0, 5\]\], while target domain has mean \[2, 2\] and the same covariance\. For uniform distributions, source domain is 2D uniform on\[−2\.5,2\.5\]2\[\-2\.5,2\.5\]^\{2\}and target domain on\[−1\.5,3\.5\]2\[\-1\.5,3\.5\]^\{2\}\. SincePs\(Y\|X\)=Pt\(Y\|X\)P\_\{s\}\(Y\|X\)=P\_\{t\}\(Y\|X\), the labeling function is identical in both domains, shown by the blue segmentation curves in Figs\.[2](https://arxiv.org/html/2605.21552#S4.F2)\(a\-b\) and Figs\.[3](https://arxiv.org/html/2605.21552#A10.F3)\(a\-b\)\. We sample 400 points from each domain\. The classifier is a three\-layer backpropagation neural network trained with Adam optimizer \(learning rate 0\.001\) for 100 epochs\. Reliability diagrams use 15 bins\(Guoet al\.,[2017](https://arxiv.org/html/2605.21552#bib.bib31); Zhanget al\.,[2020](https://arxiv.org/html/2605.21552#bib.bib32)\)\. The classification head estimatingP\(Y\|X\)P\(Y\|X\)\(orP\(Y∗=Y^\|X\)P\(Y^\{\*\}=\\hat\{Y\}\|X\)for top\-label calibration\) is calibrated on the source domain using Soft\-ECE loss\.
Results:Fig\.[2](https://arxiv.org/html/2605.21552#S4.F2)and Fig\.[3](https://arxiv.org/html/2605.21552#A10.F3)\(in Appendix[J\.1](https://arxiv.org/html/2605.21552#A10.SS1)\) show the calibration results on the simulated covariate shift dataset\. Fig\.[2](https://arxiv.org/html/2605.21552#S4.F2)shows the case where the covariate distribution is normally distributed, and Fig\.[3](https://arxiv.org/html/2605.21552#A10.F3)shows the case where the covariate distribution is uniformly distributed\. The ECL’s results shown in the different reliability diagrams are from different ECL versions about different calibration paradigms\. In the reliability diagrams, the outputs of the top\-label and class\-wise reliability diagrams after ECL calibration are closer to the diagonal, indicating improved calibration performance\. In canonical calibration reliability diagrams, high calibration errors usually occur near the midpoint of a side of the large triangle, corresponding to situations where the confidence scores of each component of the predicted vector are not very high\. Overall, the number of highlighted small triangle bins after ECL calibration in canonical reliability diagrams will decrease \(see Fig\.[2](https://arxiv.org/html/2605.21552#S4.F2)\) or the color will become dark blue \(see Fig\.[3](https://arxiv.org/html/2605.21552#A10.F3)\)\. From evaluating metrics under the two covariate distribution shifts, ECL can stably reduce calibration error in all three calibration paradigms and improve accuracy in most cases\.
### 4\.2Calibration on Real\-World Covariate Shift Datasets
Table 2:ECE \(%\) for top\-label calibration on digit recognition datasets\. The reported results represent the mean and standard deviation derived from ten runs\.DatasetsECE↓\\bm\{\\downarrow\}UncalSoft\-ECEDECEKDETSTransCalDRLPseudoCalECL \(Ours\)Oracle↓\\bm\{\\downarrow\}Δ\\DeltaACC\(%\)Digit→\\toMNISTLeNet\-527\.3±2\.6327\.8±2\.1526\.5±1\.8827\.9±2\.0127\.7±1\.3426\.9±1\.1622\.3±2\.049\.08±0\.718\.52±0\.780\.30±0\.01\-0\.92±0\.35ResNet2016\.2±1\.5116\.5±1\.2215\.8±1\.4516\.1±1\.1015\.3±1\.0413\.1±0\.9910\.2±0\.728\.22±0\.537\.88±0\.451\.54±0\.04\+1\.25±0\.42DenseNet4023\.4±1\.7923\.6±1\.5522\.1±1\.6222\.9±1\.4821\.6±1\.7119\.8±0\.9614\.8±0\.959\.72±0\.689\.15±0\.611\.40±0\.03\+0\.68±0\.20→\\toUSPSLeNet\-522\.9±1\.5023\.1±1\.2822\.4±1\.4022\.8±1\.3522\.7±1\.1321\.8±1\.3215\.5±1\.168\.92±0\.458\.12±0\.421\.54±0\.02\-0\.85±0\.25ResNet209\.14±0\.749\.32±0\.659\.05±0\.719\.45±0\.559\.12±0\.848\.36±0\.457\.99±0\.665\.01±0\.305\.25±0\.282\.23±0\.06\+1\.42±0\.37DenseNet4015\.7±0\.8315\.9±0\.7615\.3±0\.9215\.8±1\.0113\.1±1\.0212\.1±1\.047\.92±0\.475\.34±0\.344\.96±0\.282\.54±0\.05\-0\.76±0\.18→\\toSVHNLeNet\-561\.9±6\.1662\.2±5\.5060\.8±5\.2262\.5±5\.8061\.3±5\.8963\.7±4\.9423\.7±1\.9352\.4±4\.5521\.5±1\.511\.03±0\.02\+1\.65±0\.65ResNet2068\.2±6\.4467\.5±5\.9266\.9±6\.1067\.8±6\.2568\.1±6\.1359\.4±4\.6340\.1±3\.7748\.2±3\.9536\.8±2\.080\.50±0\.02\+2\.12±0\.88DenseNet4080\.8±6\.2681\.2±5\.8879\.5±6\.0581\.1±6\.1577\.2±6\.9872\.9±5\.1342\.0±3\.3664\.7±4\.7238\.4±3\.210\.86±0\.03\-1\.15±0\.45#### 4\.2\.1Experimental Setup
Datasets and Networks:To reflect the effectiveness of calibration methods on the real\-world dataset, three different types of covariate shift datasets are selected for experiments: 1\) Digit recognition dataset includes three different domains \(MNIST\(Lecunet al\.,[1998](https://arxiv.org/html/2605.21552#bib.bib7)\), USPS\(Hull,[1994](https://arxiv.org/html/2605.21552#bib.bib9)\), and SVHN\(Netzeret al\.,[2011](https://arxiv.org/html/2605.21552#bib.bib47)\)\); 2\) a domain adaptation dataset PACS contains four different domains \(Photo, Art Painting, Cartoon, and Sketch\)\(Liet al\.,[2017](https://arxiv.org/html/2605.21552#bib.bib10)\); 3\) a large\-scale dataset ImageNet\-Sketch with 1000 classes contains two domains \(ImageNet and Sketch\)\(Wanget al\.,[2019](https://arxiv.org/html/2605.21552#bib.bib11)\)\. When constructing covariate shift datasets, one domain of the dataset is used as the target domain, and the other domains are merged into the source domain\. The commonly used networks on these datasets are used in the experiments, i\.e\., LeNet\(Lecunet al\.,[1998](https://arxiv.org/html/2605.21552#bib.bib7)\), ResNet\(Heet al\.,[2016](https://arxiv.org/html/2605.21552#bib.bib8)\), DenseNet\(Huanget al\.,[2017](https://arxiv.org/html/2605.21552#bib.bib3)\), Wide\-ResNet\(Zagoruyko and Komodakis,[2016](https://arxiv.org/html/2605.21552#bib.bib2)\)and ViT\(Dosovitskiyet al\.,[2021](https://arxiv.org/html/2605.21552#bib.bib1)\)\.
Calibration Metrics:To comprehensively evaluate the calibration performance in three calibration paradigms, we used the following calibration metrics to evaluate the calibration methods: 1\)ECE: The classic expected calibration error\(Guoet al\.,[2017](https://arxiv.org/html/2605.21552#bib.bib31)\)for top\-label calibration; 2\)CwECE: Class\-wise expected calibration error\(Kullet al\.,[2019](https://arxiv.org/html/2605.21552#bib.bib33)\)for class\-wise calibration; 3\)ECEKDE: a consistent and differentiable canonical calibration metric for canonical calibration\. In addition, we reportΔ\\DeltaACC as the accuracy change relative to the uncalibrated classifier under the same task/architecture, defined asΔ\\DeltaACC =ACC\(ECL\)−ACC\(Uncal\)\\mathrm\{ACC\}\(\\text\{ECL\}\)\-\\mathrm\{ACC\}\(\\text\{Uncal\}\)\.
Baselines:For a comprehensive comparison, the following methods are compared: 1\)Uncal: Training using only cross\-entropy loss; 2\)Soft\-ECE\(Karandikaret al\.,[2021](https://arxiv.org/html/2605.21552#bib.bib44)\): A softened differentiable ECE loss; 3\)DECE\(Bohdalet al\.,[2023](https://arxiv.org/html/2605.21552#bib.bib45)\): Another softened differentiable ECE loss; 4\)KDE\(Popordanoskaet al\.,[2022](https://arxiv.org/html/2605.21552#bib.bib43)\): a differentiable canonical calibration loss; 5\)TS\(Guoet al\.,[2017](https://arxiv.org/html/2605.21552#bib.bib31)\): Classic post\-hoc calibration method with temperature scaling; 6\)TransCal\(Wanget al\.,[2020](https://arxiv.org/html/2605.21552#bib.bib40)\): a debiasing calibration method based on importance weighting; 7\)DRL\(Wanget al\.,[2023](https://arxiv.org/html/2605.21552#bib.bib41)\): a calibration method based on distributionally robust learning; 8\)PseudoCal\(Huet al\.,[2024](https://arxiv.org/html/2605.21552#bib.bib37)\): a calibration method based on mixup data synthesis; 9\)Oracle: Soft\-ECE calibration using labels on the target domain\.
#### 4\.2\.2Results
Table[2](https://arxiv.org/html/2605.21552#S4.T2)reports the calibration metric ECE for top\-label calibration on the digit recognition benchmarks\. Overall, ECL achieves the lowest \(or near\-lowest\) ECE in most transfer tasks and network architectures, demonstrating strong calibration performance compared to state\-of\-the\-art baselines\. The advantage of ECL is particularly evident on the SVHN dataset, which involves larger distribution shifts; for instance, on LeNet\-5, ECL reduces the ECE from 61\.9% \(Uncalibrated\) to 21\.5%, substantially improving upon most baselines \(e\.g\., PseudoCal at 52\.4%\)\. Furthermore, theΔ\\DeltaACC values suggest that ECL often improves calibration while largely preserving the discriminative power of the classifier\.
Extended results covering broader benchmarks and calibration paradigms are detailed in Appendices[J\.2](https://arxiv.org/html/2605.21552#A10.SS2),[J\.3](https://arxiv.org/html/2605.21552#A10.SS3), and[J\.4](https://arxiv.org/html/2605.21552#A10.SS4)\. Appendix[J\.2](https://arxiv.org/html/2605.21552#A10.SS2)provides additional top\-label calibration results on the PACS and ImageNet\-Sketch datasets\. Appendices[J\.3](https://arxiv.org/html/2605.21552#A10.SS3)and[J\.4](https://arxiv.org/html/2605.21552#A10.SS4)present comprehensive evaluations for class\-wise and canonical calibration, respectively, across all three dataset suites \(Digit, PACS, and ImageNet\-Sketch\)\. Overall, these experiments show that ECL is highly competitive and frequently achieves the lowest errors in terms of ECE, CwECE, and ECEKDE\.
### 4\.3Ablation
Mini\-Batch Trainability:We empirically verify the role of our proposed mini\-batch training strategy by comparing it with a naive baseline,Mini\-Batch Non\-Trainable ECL, which directly computes Eq\.[8](https://arxiv.org/html/2605.21552#S3.E8)on mini\-batches\. As shown in Table[7](https://arxiv.org/html/2605.21552#A10.T7)\(see Appendix\), ourMini\-Batch Trainable ECL\(Algorithm[1](https://arxiv.org/html/2605.21552#alg1)\) is more stable and achieves better calibration in most settings, supporting the effectiveness of the auxiliary variable formulation \(Theorem[3\.3](https://arxiv.org/html/2605.21552#S3.Thmtheorem3)\)\.
Loss Weightλ\\lambda:To balance the cross\-entropy loss and ECL, we employ an adaptive weighting strategy:λ=βγ\\lambda=\\beta^\{\\gamma\}, whereβ=\(∑iℒce\(i\)\)/\(∑iℒecl\(i\)\)\\beta=\\left\(\{\\sum\\nolimits\_\{i\}\{\\mathcal\{L\}\_\{ce\}^\{\(i\)\}\}\}\\right\)/\\left\(\{\\sum\\nolimits\_\{i\}\{\\mathcal\{L\}\_\{ecl\}^\{\(i\)\}\}\}\\right\)acts as a balancing factor between the two loss magnitudes\. The hyperparameterγ\\gammacontrols the sensitivity of this regularization\. Our ablation study in Table[8](https://arxiv.org/html/2605.21552#A10.T8)\(see Appendix\) suggests that a linear scaling \(γ=1\.0\\gamma=1\.0\) provides a strong trade\-off between calibration improvement and accuracy preservation in our tested settings\.
## 5Discussion
Why It Works:The essence of ECL is to reorganize the confidence space rather than aligning covariate distributions\. For each confidence levelSS, it ensures that source and target samples achieving this confidence share the same expected true posteriorP\(Y\|X\)P\(Y\|X\), effectively grouping samples with similar true accuracy into the same confidence bins regardless of their input distributions\. This level set alignment method directly addresses the essential need for confidence calibration under covariate shift, thereby achieving stable and effective calibration\.
Potential Impact, Limitations, and Future Work:We rethink confidence calibration under covariate shifts by moving beyond traditional importance weighting\. Our findings reveal that strict covariate distribution alignment is unnecessary; instead, a weaker condition—theExpectation Consistency Condition—is sufficient for target domain calibration\. This insight has the potential to inspire further research and enhance decision\-making in safety\-critical cross\-population applications\. However, our method assumes invariant posterior class probabilities \(P\(Y\|X\)P\(Y\|X\)\), a common assumption among other methods in this field\. Consequently, scenarios involving label shift, where the input\-output relationship changes, fall outside the scope of this work\. Future work will explore extending our framework to address calibration under both covariate and label shifts\.
## 6Conclusion
This paper rethinks confidence calibration under covariate shifts by moving beyond the traditional importance weighting paradigm\. We derive a necessary and sufficient condition for confidence calibration under covariate shifts, termed theExpectation Consistency Condition, which reveals that covariate shifts do not necessarily lead to uncalibrated confidence and provides a weaker condition than global covariate distribution alignment\. Building upon this theoretical foundation, we propose theExpectation Consistency Loss\(ECL\), an unsupervised domain adaptation loss that can be seamlessly applied to canonical, class\-wise, and top\-label calibration paradigms\. Furthermore, we prove that ECL shares the same sample complexity as histogram binning for ECE estimation and provide a theoretically grounded mini\-batch training scheme that enables unbiased gradient computation\. Extensive experiments on both simulated and real\-world covariate shift datasets demonstrate that ECL achieves competitive calibration errors across all three calibration paradigms while generally preserving classifier accuracy\. Our work opens new avenues for confidence calibration research by shifting the focus from global distribution alignment to enforcing local consistency in critical statistics\.
## Acknowledgements
This work was supported by the Science and Technology Innovation Program of Hunan Province \(Grant Number: 2024RC1007\) and the Central South University Post\-Graduate Independent Exploration and Innovation Project \(Grant Number: 2025ZZTS0616\)\.
## Impact Statement
This paper presents work whose goal is to advance the field of Machine Learning\. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here\.
## References
- S\. Bickel, M\. Brückner, and T\. Scheffer \(2009\)Discriminative learning under covariate shift\.10\(9\),pp\. 2137–2155\.External Links:ISSN 1532\-4435Cited by:[§1](https://arxiv.org/html/2605.21552#S1.p2.1)\.
- O\. Bohdal, Y\. Yang, and T\. Hospedales \(2023\)Meta\-calibration: learning of model calibration using differentiable expected calibration error\.Note:External Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=R2hUure38l)Cited by:[Table 1](https://arxiv.org/html/2605.21552#S1.T1.3.3.1),[§3\.3](https://arxiv.org/html/2605.21552#S3.SS3.p2.6),[§4\.2\.1](https://arxiv.org/html/2605.21552#S4.SS2.SSS1.p3.1)\.
- J\. Bolte, S\. Sabach, and M\. Teboulle \(2014\)Proximal alternating linearized minimization for nonconvex and nonsmooth problems\.146\(1\),pp\. 459–494\.External Links:ISSN 1436\-4646,[Document](https://dx.doi.org/10.1007/s10107-013-0701-9),[Link](https://doi.org/10.1007/s10107-013-0701-9)Cited by:[§3\.5](https://arxiv.org/html/2605.21552#S3.SS5.p2.4)\.
- C\. Cortes, Y\. Mansour, and M\. Mohri \(2010\)Learning bounds for importance weighting\.InAdvances in Neural Information Processing Systems,J\. Lafferty, C\. Williams, J\. Shawe\-Taylor, R\. Zemel, and A\. Culotta \(Eds\.\),Vol\.23,pp\.\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2010/file/59c33016884a62116be975a9bb8257e3-Paper.pdf)Cited by:[§1](https://arxiv.org/html/2605.21552#S1.p3.1)\.
- J\. Dong, Z\. Jiang, D\. Pan, Z\. Chen, Q\. Guan, H\. Zhang, G\. Gui, and W\. Gui \(2025a\)A survey on confidence calibration of deep learning\-based classification models under class imbalance data\.36\(9\),pp\. 15664–15684\.External Links:[Document](https://dx.doi.org/10.1109/TNNLS.2025.3565159)Cited by:[§2\.1](https://arxiv.org/html/2605.21552#S2.SS1.p2.1)\.
- J\. Dong, Z\. Jiang, D\. Pan, and H\. Yu \(2025b\)Combining priors with experience: confidence calibration based on binomial process modeling\.39\(15\),pp\. 16317–16326\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/33792),[Document](https://dx.doi.org/10.1609/aaai.v39i15.33792)Cited by:[§1](https://arxiv.org/html/2605.21552#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.21552#S2.SS1.p3.1)\.
- A\. Dosovitskiy, L\. Beyer, A\. Kolesnikov, D\. Weissenborn, X\. Zhai, T\. Unterthiner, M\. Dehghani, M\. Minderer, G\. Heigold, S\. Gelly, J\. Uszkoreit, and N\. Houlsby \(2021\)An image is worth 16x16 words: transformers for image recognition at scale\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=YicbFdNTTy)Cited by:[§4\.2\.1](https://arxiv.org/html/2605.21552#S4.SS2.SSS1.p1.1)\.
- K\. R\. M\. Fernando and C\. P\. Tsokos \(2022\)Dynamically weighted balanced loss: class imbalanced learning and confidence calibration of deep neural networks\.IEEE Transactions on Neural Networks and Learning Systems33\(7\),pp\. 2940–2951\.External Links:[Document](https://dx.doi.org/10.1109/TNNLS.2020.3047335)Cited by:[§2\.1](https://arxiv.org/html/2605.21552#S2.SS1.p3.1)\.
- J\. Gawlikowski, C\. R\. N\. Tassi, M\. Ali, J\. Lee, M\. Humt, J\. Feng, A\. Kruspe, R\. Triebel, P\. Jung, R\. Roscher, M\. Shahzad, W\. Yang, R\. Bamler, and X\. X\. Zhu \(2023\)A survey of uncertainty in deep neural networks\.56\(1\),pp\. 1513–1589\.External Links:ISSN 1573\-7462,[Document](https://dx.doi.org/10.1007/s10462-023-10562-9),[Link](https://doi.org/10.1007/s10462-023-10562-9)Cited by:[§1](https://arxiv.org/html/2605.21552#S1.p1.1)\.
- W\. Grathwohl, K\. Wang, J\. Jacobsen, D\. Duvenaud, M\. Norouzi, and K\. Swersky \(2020\)Your classifier is secretly an energy based model and you should treat it like one\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=Hkxzx0NtDB)Cited by:[§2\.1](https://arxiv.org/html/2605.21552#S2.SS1.p3.1)\.
- C\. Guo, G\. Pleiss, Y\. Sun, and K\. Q\. Weinberger \(2017\)On calibration of modern neural networks\.InProceedings of the 34th International Conference on Machine Learning,D\. Precup and Y\. W\. Teh \(Eds\.\),Proceedings of Machine Learning Research, Vol\.70,pp\. 1321–1330\.External Links:[Link](https://proceedings.mlr.press/v70/guo17a.html)Cited by:[§1](https://arxiv.org/html/2605.21552#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.21552#S2.SS1.p2.1),[§2\.1](https://arxiv.org/html/2605.21552#S2.SS1.p3.1),[§4\.1](https://arxiv.org/html/2605.21552#S4.SS1.p1.5),[§4\.2\.1](https://arxiv.org/html/2605.21552#S4.SS2.SSS1.p2.4),[§4\.2\.1](https://arxiv.org/html/2605.21552#S4.SS2.SSS1.p3.1)\.
- K\. Gupta, A\. Rahimi, T\. Ajanthan, T\. Mensink, C\. Sminchisescu, and R\. Hartley \(2021\)Calibration of neural networks using splines\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=eQe8DEWNN2W)Cited by:[§2\.1](https://arxiv.org/html/2605.21552#S2.SS1.p3.1)\.
- Y\. Han, D\. Liu, J\. Shang, L\. Zheng, J\. Zhong, W\. Cao, H\. Sun, and W\. Xie \(2024\)BALQUE: batch active learning by querying unstable examples with calibrated confidence\.151,pp\. 110385\.External Links:ISSN 0031\-3203,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.patcog.2024.110385),[Link](https://www.sciencedirect.com/science/article/pii/S0031320324001365)Cited by:[§1](https://arxiv.org/html/2605.21552#S1.p1.1)\.
- K\. He, X\. Zhang, S\. Ren, and J\. Sun \(2016\)Deep residual learning for image recognition\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition \(CVPR\),Cited by:[§4\.2\.1](https://arxiv.org/html/2605.21552#S4.SS2.SSS1.p1.1)\.
- R\. Hebbalaguppe, J\. Prakash, N\. Madan, and C\. Arora \(2022\)A stitch in time saves nine: a train\-time regularizing loss for improved neural network calibration\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 16081–16090\.Cited by:[§2\.1](https://arxiv.org/html/2605.21552#S2.SS1.p3.1)\.
- D\. Hu, J\. Liang, X\. Wang, and C\. Foo \(2024\)Pseudo\-calibration: improving predictive uncertainty estimation in unsupervised domain adaptation\.InProceedings of the 41st International Conference on Machine Learning,R\. Salakhutdinov, Z\. Kolter, K\. Heller, A\. Weller, N\. Oliver, J\. Scarlett, and F\. Berkenkamp \(Eds\.\),Proceedings of Machine Learning Research, Vol\.235,pp\. 19304–19326\.External Links:[Link](https://proceedings.mlr.press/v235/hu24i.html)Cited by:[Table 1](https://arxiv.org/html/2605.21552#S1.T1.10.10.1),[§1](https://arxiv.org/html/2605.21552#S1.p2.1),[§1](https://arxiv.org/html/2605.21552#S1.p3.1),[§3\.1](https://arxiv.org/html/2605.21552#S3.SS1.p1.1),[§4\.2\.1](https://arxiv.org/html/2605.21552#S4.SS2.SSS1.p3.1)\.
- G\. Huang, Z\. Liu, L\. van der Maaten, and K\. Q\. Weinberger \(2017\)Densely connected convolutional networks\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition \(CVPR\),Cited by:[§4\.2\.1](https://arxiv.org/html/2605.21552#S4.SS2.SSS1.p1.1)\.
- J\.J\. Hull \(1994\)A database for handwritten text recognition research\.IEEE Transactions on Pattern Analysis and Machine Intelligence16\(5\),pp\. 550–554\.External Links:[Document](https://dx.doi.org/10.1109/34.291440)Cited by:[§4\.2\.1](https://arxiv.org/html/2605.21552#S4.SS2.SSS1.p1.1)\.
- X\. Jiang, M\. Osl, J\. Kim, and L\. Ohno\-Machado \(2011\)Calibrating predictive model estimates to support personalized medicine\.19\(2\),pp\. 263–274\.External Links:ISSN 1067\-5027,[Document](https://dx.doi.org/10.1136/amiajnl-2011-000291),[Link](https://doi.org/10.1136/amiajnl-2011-000291),https://academic\.oup\.com/jamia/article\-pdf/19/2/263/17374049/19\-2\-263\.pdfCited by:[§1](https://arxiv.org/html/2605.21552#S1.p1.1)\.
- Z\. Jiang, J\. Dong, D\. Pan, T\. Wang, and W\. Gui \(2023\)A novel intelligent monitoring method for the closing time of the taphole of blast furnace based on two\-stage classification\.Engineering Applications of Artificial IntelligenceNatureJournal of the American Medical Informatics AssociationPattern RecognitionArtificial Intelligence ReviewProceedings of the AAAI Conference on Artificial IntelligenceIEEE Transactions on Neural Networks and Learning SystemsIEEE Transactions on Pattern Analysis and Machine IntelligenceJournal of Machine Learning ResearchTransactions on Machine Learning ResearchCoRRTransactions on Machine Learning ResearchMathematical Programming120,pp\. 105849\.External Links:ISSN 0952\-1976,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.engappai.2023.105849),[Link](https://www.sciencedirect.com/science/article/pii/S0952197623000337)Cited by:[§1](https://arxiv.org/html/2605.21552#S1.p1.1)\.
- A\. Karandikar, N\. Cain, D\. Tran, B\. Lakshminarayanan, J\. Shlens, M\. C\. Mozer, and B\. Roelofs \(2021\)Soft calibration objectives for neural networks\.InAdvances in Neural Information Processing Systems,M\. Ranzato, A\. Beygelzimer, Y\. Dauphin, P\.S\. Liang, and J\. W\. Vaughan \(Eds\.\),Vol\.34,pp\. 29768–29779\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2021/file/f8905bd3df64ace64a68e154ba72f24c-Paper.pdf)Cited by:[Table 1](https://arxiv.org/html/2605.21552#S1.T1.2.2.2),[§3\.3](https://arxiv.org/html/2605.21552#S3.SS3.p2.6),[§4\.2\.1](https://arxiv.org/html/2605.21552#S4.SS2.SSS1.p3.1)\.
- M\. Kimura and H\. Hino \(2024\)A short survey on importance weighting for machine learning\.Note:Survey CertificationExternal Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=IhXM3g2gxg)Cited by:[§1](https://arxiv.org/html/2605.21552#S1.p2.1),[§1](https://arxiv.org/html/2605.21552#S1.p3.1)\.
- M\. Kull, M\. Perello Nieto, M\. Kängsepp, T\. Silva Filho, H\. Song, and P\. Flach \(2019\)Beyond temperature scaling: obtaining well\-calibrated multi\-class probabilities with dirichlet calibration\.InAdvances in Neural Information Processing Systems,H\. Wallach, H\. Larochelle, A\. Beygelzimer, F\. d'Alché\-Buc, E\. Fox, and R\. Garnett \(Eds\.\),Vol\.32,pp\.\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2019/file/8ca01ea920679a0fe3728441494041b9-Paper.pdf)Cited by:[§1](https://arxiv.org/html/2605.21552#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.21552#S2.SS1.p2.1),[§2\.1](https://arxiv.org/html/2605.21552#S2.SS1.p3.1),[§4\.2\.1](https://arxiv.org/html/2605.21552#S4.SS2.SSS1.p2.4)\.
- Y\. Lecun, L\. Bottou, Y\. Bengio, and P\. Haffner \(1998\)Gradient\-based learning applied to document recognition\.Proceedings of the IEEE86\(11\),pp\. 2278–2324\.External Links:[Document](https://dx.doi.org/10.1109/5.726791)Cited by:[§4\.2\.1](https://arxiv.org/html/2605.21552#S4.SS2.SSS1.p1.1)\.
- Y\. LeCun, Y\. Bengio, and G\. Hinton \(2015\)Deep learning\.521\(7553\),pp\. 436–444\.External Links:ISSN 1476\-4687,[Document](https://dx.doi.org/10.1038/nature14539),[Link](https://doi.org/10.1038/nature14539)Cited by:[§1](https://arxiv.org/html/2605.21552#S1.p1.1)\.
- D\. Li, Y\. Yang, Y\. Song, and T\. M\. Hospedales \(2017\)Deeper, broader and artier domain generalization\.InProceedings of the IEEE International Conference on Computer Vision \(ICCV\),Cited by:[§4\.2\.1](https://arxiv.org/html/2605.21552#S4.SS2.SSS1.p1.1)\.
- Y\. Li and C\. Caragea \(2023\)Distilling calibrated knowledge for stance detection\.InFindings of the Association for Computational Linguistics: ACL 2023,A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 6316–6329\.External Links:[Link](https://aclanthology.org/2023.findings-acl.393/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-acl.393)Cited by:[§1](https://arxiv.org/html/2605.21552#S1.p1.1)\.
- B\. Liu, J\. Rony, A\. Galdran, J\. Dolz, and I\. Ben Ayed \(2023\)Class adaptive network calibration\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 16070–16079\.Cited by:[§2\.1](https://arxiv.org/html/2605.21552#S2.SS1.p3.1)\.
- R\. Müller, S\. Kornblith, and G\. E\. Hinton \(2019\)When does label smoothing help?\.InAdvances in Neural Information Processing Systems,H\. Wallach, H\. Larochelle, A\. Beygelzimer, F\. d'Alché\-Buc, E\. Fox, and R\. Garnett \(Eds\.\),Vol\.32,pp\.\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2019/file/f1748d6b0fd9d439f71450117eba2725-Paper.pdf)Cited by:[§2\.1](https://arxiv.org/html/2605.21552#S2.SS1.p3.1)\.
- M\. A\. Munir, S\. H\. Khan, M\. H\. Khan, M\. Ali, and F\. Shahbaz Khan \(2023\)Cal\-detr: calibrated detection transformer\.InAdvances in Neural Information Processing Systems,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),Vol\.36,pp\. 71619–71631\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2023/file/e271e30de7a2e462ca1f85cefa816380-Paper-Conference.pdf)Cited by:[§1](https://arxiv.org/html/2605.21552#S1.p1.1)\.
- Y\. Netzer, T\. Wang, A\. Coates, A\. Bissacco, B\. Wu, A\. Y\. Ng,et al\.\(2011\)Reading digits in natural images with unsupervised feature learning\.InNIPS workshop on deep learning and unsupervised feature learning,Vol\.2011,pp\. 7\.Cited by:[§4\.2\.1](https://arxiv.org/html/2605.21552#S4.SS2.SSS1.p1.1)\.
- A\. Pampari and S\. Ermon \(2020\)Unsupervised calibration under covariate shift\.abs/2006\.16405\.External Links:[Link](https://arxiv.org/abs/2006.16405)Cited by:[Table 1](https://arxiv.org/html/2605.21552#S1.T1.6.6.2),[§1](https://arxiv.org/html/2605.21552#S1.p3.1),[§3\.1](https://arxiv.org/html/2605.21552#S3.SS1.p1.1)\.
- S\. Park, O\. Bastani, J\. Weimer, and I\. Lee \(2020\)Calibrated prediction with covariate shift via unsupervised domain adaptation\.InProceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics,S\. Chiappa and R\. Calandra \(Eds\.\),Proceedings of Machine Learning Research, Vol\.108,pp\. 3219–3229\.External Links:[Link](https://proceedings.mlr.press/v108/park20b.html)Cited by:[Table 1](https://arxiv.org/html/2605.21552#S1.T1.7.7.1),[§1](https://arxiv.org/html/2605.21552#S1.p3.1),[§3\.1](https://arxiv.org/html/2605.21552#S3.SS1.p1.1)\.
- T\. Popordanoska, R\. Sayer, and M\. Blaschko \(2022\)A consistent and differentiable lp canonical calibration error estimator\.InAdvances in Neural Information Processing Systems,S\. Koyejo, S\. Mohamed, A\. Agarwal, D\. Belgrave, K\. Cho, and A\. Oh \(Eds\.\),Vol\.35,pp\. 7933–7946\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/file/33d6e648ee4fb24acec3a4bbcd4f001e-Paper-Conference.pdf)Cited by:[Table 1](https://arxiv.org/html/2605.21552#S1.T1.4.4.1),[§3\.3](https://arxiv.org/html/2605.21552#S3.SS3.p2.6),[§4\.2\.1](https://arxiv.org/html/2605.21552#S4.SS2.SSS1.p3.1)\.
- A\. Rahimi, A\. Shaban, C\. Cheng, R\. Hartley, and B\. Boots \(2020\)Intra order\-preserving functions for calibration of multi\-class neural networks\.InAdvances in Neural Information Processing Systems,H\. Larochelle, M\. Ranzato, R\. Hadsell, M\.F\. Balcan, and H\. Lin \(Eds\.\),Vol\.33,pp\. 13456–13467\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2020/file/9bc99c590be3511b8d53741684ef574c-Paper.pdf)Cited by:[§2\.1](https://arxiv.org/html/2605.21552#S2.SS1.p3.1)\.
- H\. Wang, S\. Ge, Z\. Lipton, and E\. P\. Xing \(2019\)Learning robust global representations by penalizing local predictive power\.InAdvances in Neural Information Processing Systems,H\. Wallach, H\. Larochelle, A\. Beygelzimer, F\. d'Alché\-Buc, E\. Fox, and R\. Garnett \(Eds\.\),Vol\.32,pp\.\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2019/file/3eefceb8087e964f89c2d59e8a249915-Paper.pdf)Cited by:[§4\.2\.1](https://arxiv.org/html/2605.21552#S4.SS2.SSS1.p1.1)\.
- H\. Wang, Z\. Yu, Y\. Yue, A\. Anandkumar, A\. Liu, and J\. Yan \(2023\)Learning calibrated uncertainties for domain shift: a distributionally robust learning approach\.InProceedings of the Thirty\-Second International Joint Conference on Artificial Intelligence,IJCAI ’23\.External Links:ISBN 978\-1\-956792\-03\-4,[Link](https://doi.org/10.24963/ijcai.2023/162),[Document](https://dx.doi.org/10.24963/ijcai.2023/162)Cited by:[Table 1](https://arxiv.org/html/2605.21552#S1.T1.9.9.1),[§1](https://arxiv.org/html/2605.21552#S1.p3.1),[§3\.1](https://arxiv.org/html/2605.21552#S3.SS1.p1.1),[§4\.2\.1](https://arxiv.org/html/2605.21552#S4.SS2.SSS1.p3.1)\.
- X\. Wang, M\. Long, J\. Wang, and M\. Jordan \(2020\)Transferable calibration with lower bias and variance in domain adaptation\.InAdvances in Neural Information Processing Systems,H\. Larochelle, M\. Ranzato, R\. Hadsell, M\.F\. Balcan, and H\. Lin \(Eds\.\),Vol\.33,pp\. 19212–19223\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2020/file/df12ecd077efc8c23881028604dbb8cc-Paper.pdf)Cited by:[Table 1](https://arxiv.org/html/2605.21552#S1.T1.8.8.1),[§1](https://arxiv.org/html/2605.21552#S1.p3.1),[§3\.1](https://arxiv.org/html/2605.21552#S3.SS1.p1.1),[§4\.2\.1](https://arxiv.org/html/2605.21552#S4.SS2.SSS1.p3.1)\.
- X\. Yang and S\. Ji \(2021\)JEM\+\+: improved techniques for training jem\.InProceedings of the IEEE/CVF International Conference on Computer Vision \(ICCV\),pp\. 6494–6503\.Cited by:[§2\.1](https://arxiv.org/html/2605.21552#S2.SS1.p3.1)\.
- S\. Zagoruyko and N\. Komodakis \(2016\)Wide residual networks\.InProcedings of the British Machine Vision Conference 2016,pp\. 87–1\.Cited by:[§4\.2\.1](https://arxiv.org/html/2605.21552#S4.SS2.SSS1.p1.1)\.
- J\. Zhang, B\. Kailkhura, and T\. Y\. Han \(2020\)Mix\-n\-match : ensemble and compositional methods for uncertainty calibration in deep learning\.InProceedings of the 37th International Conference on Machine Learning,H\. D\. III and A\. Singh \(Eds\.\),Proceedings of Machine Learning Research, Vol\.119,pp\. 11117–11128\.External Links:[Link](https://proceedings.mlr.press/v119/zhang20k.html)Cited by:[§1](https://arxiv.org/html/2605.21552#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.21552#S2.SS1.p3.1),[§4\.1](https://arxiv.org/html/2605.21552#S4.SS1.p1.5)\.
- F\. Zhu, X\. Zhang, Z\. Cheng, and C\. Liu \(2024\)Revisiting confidence estimation: towards reliable failure prediction\.46\(5\),pp\. 3370–3387\.External Links:[Document](https://dx.doi.org/10.1109/TPAMI.2023.3342285)Cited by:[§1](https://arxiv.org/html/2605.21552#S1.p2.1)\.
## Appendix
## Appendix ATop\-label Calibration and Class\-wise Calibration
###### Definition A\.1\.
\(Top\-label Calibration\)A classifier is perfectly top\-label calibrated if the following equation holds:
P\(Y∗=Y^\|S^=s^\)=s^,P\(Y^\{\*\}=\\hat\{Y\}\|\\hat\{S\}=\\hat\{s\}\)=\\hat\{s\},\(11\)whereY∗=argmaxk\{Yk\}1≤k≤KY^\{\*\}=\{\{\\mathop\{\\rm argmax\}\\nolimits\}\_\{k\}\}\{\\\{\{Y\_\{k\}\}\\\}\_\{1\\leq k\\leq K\}\}is the true class scalar,Y^=argmaxk\{Sk\}1≤k≤K\\hat\{Y\}=\{\{\\mathop\{\\rm argmax\}\\nolimits\}\_\{k\}\}\{\\\{\{S\_\{k\}\}\\\}\_\{1\\leq k\\leq K\}\}is the predicted class,S^=max\{Sk\}1≤k≤K\\hat\{S\}=\\max\{\\\{\{S\_\{k\}\}\\\}\_\{1\\leq k\\leq K\}\}is the confidence score of the predicted class, ands^\\hat\{s\}is the observed value onS^\\hat\{S\}\.
###### Definition A\.2\.
\(Class\-wise Calibration\)A classifier is perfectly class\-wise calibrated if the following equation holds:
P\(Yk=1\|Sk=sk\)=sk,∀1≤k≤K,P\(Y\_\{k\}=1\|S\_\{k\}=s\_\{k\}\)=s\_\{k\},\\forall 1\\leq k\\leq K,\(12\)whereYkY\_\{k\}is thekk\-th component of the one\-hot labelYY, andSkS\_\{k\}is thekk\-th component of confidence score vectorSS, andsks\_\{k\}is the observed value onSkS\_\{k\}\.
## Appendix BProof of Theorem[3\.1](https://arxiv.org/html/2605.21552#S3.Thmtheorem1)
###### Proof\.
First, according to Total Probability Theorem, the following holds:
P\(Yk=1∣S\)\\displaystyle P\(Y\_\{k\}=1\\mid S\)=∫XP\(Yk=1,X∣S\)𝑑X\\displaystyle=\\int\_\{X\}P\(Y\_\{k\}=1,X\\mid S\)\\,dX\(13\)=∫XP\(Yk=1∣X,S\)P\(X∣S\)𝑑X\\displaystyle=\\int\_\{X\}P\(Y\_\{k\}=1\\mid X,S\)\\,P\(X\\mid S\)\\,dX=∫XP\(Yk=1∣X\)P\(X∣S\)𝑑X\\displaystyle=\\int\_\{X\}P\(Y\_\{k\}=1\\mid X\)\\,P\(X\\mid S\)\\,dX=𝔼X∼P\(X∣S\)\[P\(Yk=1∣X\)\]\.\\displaystyle=\\mathbb\{E\}\_\{X\\sim P\(X\\mid S\)\}\[P\(Y\_\{k\}=1\\mid X\)\]\.where the second\-to\-last equality is becauseXXcontains all the information thatSScan provide\. According to the definition of covariate shift,Ps\(Yk=1\|X\)=Pt\(Yk=1\|X\)P\_\{s\}\(Y\_\{k\}=1\|X\)=P\_\{t\}\(Y\_\{k\}=1\|X\)\. Therefore, ifPs\(Yk=1\|S\)=Pt\(Yk=1\|S\)P\_\{s\}\(Y\_\{k\}=1\|S\)=P\_\{t\}\(Y\_\{k\}=1\|S\), then:
𝔼X∼Ps\(X\|S\)\[P\(Yk=1\|X\)\]=𝔼X∼Pt\(X\|S\)\[P\(Yk=1\|X\)\]\.\{\\mathbb\{E\}\_\{X\\sim\{P\_\{s\}\}\(X\|S\)\}\}\[P\(Y\_\{k\}=1\|X\)\]=\{\\mathbb\{E\}\_\{X\\sim\{P\_\{t\}\}\(X\|S\)\}\}\[P\(Y\_\{k\}=1\|X\)\]\.\(14\)whereP\(Yk=1\|X\)=Ps\(Yk=1\|X\)=Pt\(Yk=1\|X\)\{P\(Y\_\{k\}=1\|X\)\}=\{P\_\{s\}\(Y\_\{k\}=1\|X\)\}=\{P\_\{t\}\(Y\_\{k\}=1\|X\)\}\. Conversely, if𝔼X∼Ps\(X\|S\)\[P\(Yk=1\|X\)\]=𝔼X∼Pt\(X\|S\)\[P\(Yk=1\|X\)\]\{\\mathbb\{E\}\_\{X\\sim\{P\_\{s\}\}\(X\|S\)\}\}\[P\(Y\_\{k\}=1\|X\)\]=\{\\mathbb\{E\}\_\{X\\sim\{P\_\{t\}\}\(X\|S\)\}\}\[P\(Y\_\{k\}=1\|X\)\], it also holds thatPs\(Yk=1\|S\)=Pt\(Yk=1\|S\)P\_\{s\}\(Y\_\{k\}=1\|S\)=P\_\{t\}\(Y\_\{k\}=1\|S\)\. ∎
## Appendix CCalibration Comparison Between Source and Target Domains
Typically, a classifier’s calibration error in the source domain is significantly lower than that in the target domain because there is no distribution shift that leads to insufficient generalization\. For the sake of rigor, we still verified this natural assumption through experiments\. Table[3](https://arxiv.org/html/2605.21552#A3.T3)presents the experimental results\. We use soft\-ECE as the calibration method to calibrate the models in the source domain\. All three calibration metrics for different calibration paradigms show that the calibration error in the source domain is significantly lower than that in the target domain\. Therefore, even just making the calibration error in the target domain as good as that in the source domain would be a significant improvement\.
Table 3:Comparison of calibration errors between the source and target domains\. The subscriptssdenotes the source domain, while the subscriptttdenotes the target domain\. ResNet\-20 is used for the Digit dataset, ResNet\-50 for the PACS dataset, and ViT\-L for the ImageNet\-Sketch dataset\.DatasetECEsECEtCwECEsCwECEtECEsKDE\{\}^\{KDE\}\_\{s\}ECEtKDE\{\}^\{KDE\}\_\{t\}Digit \(USPS \+ SVHN→\\toMNIST\)1\.54±0\.0416\.2±1\.510\.39±0\.013\.14±0\.310\.39±0\.022\.97±0\.23PACS \(Art \+ Cartoon \+ Sketch→\\toPhoto\)3\.84±0\.2322\.3±2\.160\.58±0\.017\.87±0\.310\.42±0\.047\.58±0\.37ImageNet\-Sketch \(ImageNet→\\toSketch\)1\.47±0\.1155\.8±4\.340\.93±0\.0912\.7±0\.870\.86±0\.0612\.3±0\.73
## Appendix DExtension of Theorem[3\.1](https://arxiv.org/html/2605.21552#S3.Thmtheorem1)
###### Theorem D\.1\.
\(Expectation Consistency Condition for Top\-label Calibration\)Ps\(Y∗=Y^\|S^\)=Pt\(Y∗=Y^\|S^\)P\_\{s\}\(Y^\{\*\}=\\hat\{Y\}\|\\hat\{S\}\)=P\_\{t\}\(Y^\{\*\}=\\hat\{Y\}\|\\hat\{S\}\)if and only if:𝔼X∼Ps\(X\|S^\)\[P\(Y∗=Y^\|X\)\]=𝔼X∼Pt\(X\|S^\)\[P\(Y∗=Y^\|X\)\]\{\\mathbb\{E\}\_\{X\\sim\{P\_\{s\}\}\(X\|\\hat\{S\}\)\}\}\[P\(Y^\{\*\}=\\hat\{Y\}\|X\)\]=\{\\mathbb\{E\}\_\{X\\sim\{P\_\{t\}\}\(X\|\\hat\{S\}\)\}\}\[P\(Y^\{\*\}=\\hat\{Y\}\|X\)\], whereP\(Y∗=Y^\|X\)=Ps\(Y∗=Y^\|X\)=Pt\(Y∗=Y^\|X\)\{P\(Y^\{\*\}=\\hat\{Y\}\|X\)\}=\{P\_\{s\}\(Y^\{\*\}=\\hat\{Y\}\|X\)\}=\{P\_\{t\}\(Y^\{\*\}=\\hat\{Y\}\|X\)\}\.
###### Proof\.
First, according to Total Probability Theorem, the following holds:
P\(Y∗=Y^\|S^\)=∫XP\(Y∗=Y^,X\|S^\)𝑑X=∫XP\(Y∗=Y^\|X,S^\)P\(X\|S^\)𝑑X=∫XP\(Y∗=Y^\|X\)P\(X\|S^\)𝑑X,\\begin\{split\}&\{P\(Y^\{\*\}=\\hat\{Y\}\|\\hat\{S\}\)=\\int\_\{X\}\{P\(Y^\{\*\}=\\hat\{Y\},X\|\\hat\{S\}\)dX\}\}\\\\ &\{=\\int\_\{X\}\{P\(Y^\{\*\}=\\hat\{Y\}\|X,\\hat\{S\}\)P\(X\|\\hat\{S\}\)dX\}=\\int\_\{X\}\{P\(Y^\{\*\}=\\hat\{Y\}\|X\)P\(X\|\\hat\{S\}\)dX\},\}\\end\{split\}\(15\)where the last equality is becauseXXcontains all the information thatS^\\hat\{S\}can provide\. According to the definition of covariate shift,Ps\(Y∗\|X\)=Pt\(Y∗\|X\)P\_\{s\}\(Y^\{\*\}\|X\)=P\_\{t\}\(Y^\{\*\}\|X\)\. Because the source domain and the target domain share a fixed classifier,Ps\(Y^\|X\)=Pt\(Y^\|X\)P\_\{s\}\(\\hat\{Y\}\|X\)=P\_\{t\}\(\\hat\{Y\}\|X\)\. Then, it holds:
Ps\(Y∗,Y^\|X\)=Ps\(Y∗\|Y^,X\)Ps\(Y^\|X\)=Ps\(Y∗\|X\)Ps\(Y^\|X\)=Pt\(Y∗\|X\)Pt\(Y^\|X\)=Pt\(Y∗,Y^\|X\)\.\\begin\{split\}&\{P\_\{s\}\}\(Y^\{\*\},\\hat\{Y\}\|X\)=\{P\_\{s\}\}\(Y^\{\*\}\|\\hat\{Y\},X\)\{P\_\{s\}\}\(\\hat\{Y\}\|X\)\\\\ &=\{P\_\{s\}\}\(Y^\{\*\}\|X\)\{P\_\{s\}\}\(\\hat\{Y\}\|X\)=\{P\_\{t\}\}\(Y^\{\*\}\|X\)\{P\_\{t\}\}\(\\hat\{Y\}\|X\)=\{P\_\{t\}\}\(Y^\{\*\},\\hat\{Y\}\|X\)\.\\end\{split\}\(16\)where the third equality is because the classifier is fixed,Y^\\hat\{Y\}is a deterministic function ofXX\. Therefore,Ps\(Y∗=Y^\|X\)=Pt\(Y∗=Y^\|X\)P\_\{s\}\(Y^\{\*\}=\\hat\{Y\}\|X\)=P\_\{t\}\(Y^\{\*\}=\\hat\{Y\}\|X\)\. According to Eq\.[15](https://arxiv.org/html/2605.21552#A4.E15), ifPs\(Y∗=Y^\|S^\)=Pt\(Y∗=Y^\|S^\)P\_\{s\}\(Y^\{\*\}=\\hat\{Y\}\|\\hat\{S\}\)=P\_\{t\}\(Y^\{\*\}=\\hat\{Y\}\|\\hat\{S\}\), then𝔼X∼Ps\(X\|S^\)\[P\(Y∗=Y^\|X\)\]=𝔼X∼Pt\(X\|S^\)\[P\(Y∗=Y^\|X\)\]\{\\mathbb\{E\}\_\{X\\sim\{P\_\{s\}\}\(X\|\\hat\{S\}\)\}\}\[P\(Y^\{\*\}=\\hat\{Y\}\|X\)\]=\{\\mathbb\{E\}\_\{X\\sim\{P\_\{t\}\}\(X\|\\hat\{S\}\)\}\}\[P\(Y^\{\*\}=\\hat\{Y\}\|X\)\]\. Conversely, if𝔼X∼Ps\(X\|S^\)\[P\(Y∗=Y^\|X\)\]=𝔼X∼Pt\(X\|S^\)\[P\(Y∗=Y^\|X\)\]\{\\mathbb\{E\}\_\{X\\sim\{P\_\{s\}\}\(X\|\\hat\{S\}\)\}\}\[P\(Y^\{\*\}=\\hat\{Y\}\|X\)\]=\{\\mathbb\{E\}\_\{X\\sim\{P\_\{t\}\}\(X\|\\hat\{S\}\)\}\}\[P\(Y^\{\*\}=\\hat\{Y\}\|X\)\], it also holds thatPs\(Y∗=Y^\|S^\)=Pt\(Y∗=Y^\|S^\)P\_\{s\}\(Y^\{\*\}=\\hat\{Y\}\|\\hat\{S\}\)=P\_\{t\}\(Y^\{\*\}=\\hat\{Y\}\|\\hat\{S\}\)\. ∎
###### Theorem D\.2\.
\(Expectation Consistency Condition for Class\-wise Calibration\)∀1≤k≤K\\forall 1\\leq k\\leq K,Ps\(Yk=1\|Sk\)=Pt\(Yk=1\|Sk\)P\_\{s\}\(Y\_\{k\}=1\|S\_\{k\}\)=P\_\{t\}\(Y\_\{k\}=1\|S\_\{k\}\)if and only if𝔼X∼Ps\(X\|Sk\)\[P\(Yk=1\|X\)\]=𝔼X∼Pt\(X\|Sk\)\[P\(Yk=1\|X\)\]\\mathbb\{E\}\_\{X\\sim P\_\{s\}\(X\|S\_\{k\}\)\}\[P\(Y\_\{k\}=1\|X\)\]=\\mathbb\{E\}\_\{X\\sim P\_\{t\}\(X\|S\_\{k\}\)\}\[P\(Y\_\{k\}=1\|X\)\], whereP\(Yk=1\|X\)=Ps\(Yk=1\|X\)=Pt\(Yk=1\|X\)P\(Y\_\{k\}=1\|X\)=P\_\{s\}\(Y\_\{k\}=1\|X\)=P\_\{t\}\(Y\_\{k\}=1\|X\)\.
###### Proof\.
By the law of total probability,
P\(Yk=1\|Sk\)=∫XP\(Yk=1,X\|Sk\)𝑑X=∫XP\(Yk=1\|X,Sk\)P\(X\|Sk\)𝑑X=∫XP\(Yk=1\|X\)P\(X\|Sk\)𝑑X,P\(Y\_\{k\}=1\|S\_\{k\}\)=\\int\_\{X\}P\(Y\_\{k\}=1,X\|S\_\{k\}\)dX=\\int\_\{X\}P\(Y\_\{k\}=1\|X,S\_\{k\}\)P\(X\|S\_\{k\}\)dX=\\int\_\{X\}P\(Y\_\{k\}=1\|X\)P\(X\|S\_\{k\}\)dX,\(17\)where the last step uses thatXXcontains all information inSkS\_\{k\}relevant toYkY\_\{k\}\.
Under covariate shiftPs\(Yk=1\|X\)=Pt\(Yk=1\|X\)P\_\{s\}\(Y\_\{k\}=1\|X\)=P\_\{t\}\(Y\_\{k\}=1\|X\)\. Hence:
Ps\(Yk=1\|Sk\)=∫XP\(Yk=1\|X\)Ps\(X\|Sk\)𝑑X,Pt\(Yk=1\|Sk\)=∫XP\(Yk=1\|X\)Pt\(X\|Sk\)𝑑X\.P\_\{s\}\(Y\_\{k\}=1\|S\_\{k\}\)=\\int\_\{X\}P\(Y\_\{k\}=1\|X\)P\_\{s\}\(X\|S\_\{k\}\)dX,\\quad P\_\{t\}\(Y\_\{k\}=1\|S\_\{k\}\)=\\int\_\{X\}P\(Y\_\{k\}=1\|X\)P\_\{t\}\(X\|S\_\{k\}\)dX\.\(18\)ThereforePs\(Yk=1\|Sk\)=Pt\(Yk=1\|Sk\)P\_\{s\}\(Y\_\{k\}=1\|S\_\{k\}\)=P\_\{t\}\(Y\_\{k\}=1\|S\_\{k\}\)iff:
∫XP\(Yk=1\|X\)Ps\(X\|Sk\)𝑑X=∫XP\(Yk=1\|X\)Pt\(X\|Sk\)𝑑X,\\int\_\{X\}P\(Y\_\{k\}=1\|X\)P\_\{s\}\(X\|S\_\{k\}\)dX=\\int\_\{X\}P\(Y\_\{k\}=1\|X\)P\_\{t\}\(X\|S\_\{k\}\)dX,\(19\)which is exactly the desired expectation condition\. ∎
## Appendix EExtension of Expectation Consistency Loss
### E\.1Expectation Consistency Loss for Top\-Label Calibration
Recall the predicted classY^=argmaxk\{Sk\}1≤k≤K\\hat\{Y\}=\{\{\\mathop\{\\rm argmax\}\\nolimits\}\_\{k\}\}\{\\\{\{S\_\{k\}\}\\\}\_\{1\\leq k\\leq K\}\}, its confidenceS^=max\{Sk\}1≤k≤K\\hat\{S\}=\\max\{\\\{\{S\_\{k\}\}\\\}\_\{1\\leq k\\leq K\}\}, and the true classY∗Y^\{\*\}\. Theorem[D\.1](https://arxiv.org/html/2605.21552#A4.Thmtheorem1)states that preservation of top\-label calibration across domains is equivalent to the expectation consistency condition:
𝔼X∼Ps\(X\|S^\)\[P\(Y∗=Y^\|X\)\]=𝔼X∼Pt\(X\|S^\)\[P\(Y∗=Y^\|X\)\]\.\\mathbb\{E\}\_\{X\\sim P\_\{s\}\(X\|\\hat\{S\}\)\}\[P\(Y^\{\*\}=\\hat\{Y\}\|X\)\]=\\mathbb\{E\}\_\{X\\sim P\_\{t\}\(X\|\\hat\{S\}\)\}\[P\(Y^\{\*\}=\\hat\{Y\}\|X\)\]\.\(20\)Therefore,Expectation consistency lossfor top\-label calibration can be naturally constructed as:
Lecltop=𝔼Pt\(S^\)\|𝔼Ps\(X\|S^\)P\(Y∗=Y^\|X\)−𝔼Pt\(X\|S^\)P\(Y∗=Y^\|X\)\|\.L\_\{ecl\}^\{top\}=\\mathbb\{E\}\_\{P\_\{t\}\(\\hat\{S\}\)\}\\Big\|\\mathbb\{E\}\_\{P\_\{s\}\(X\|\\hat\{S\}\)\}P\(Y^\{\*\}=\\hat\{Y\}\|X\)\-\\mathbb\{E\}\_\{P\_\{t\}\(X\|\\hat\{S\}\)\}P\(Y^\{\*\}=\\hat\{Y\}\|X\)\\Big\|\.\(21\)To estimateP\(Y∗=Y^\|X\)P\(Y^\{\*\}=\\hat\{Y\}\|X\)in practice, we train a binary classifier where the label is1Y∗=Y^1\_\{Y^\{\*\}=\\hat\{Y\}\}and the input data isXX\. This binary classifier can be added to the original classifier as a classification head and trained end\-to\-end with the original classifier \(freeze the backbone when training this classification head\)\. Optionally, this binary classification head can also be calibrated on the source domain to obtain a more reliable estimate ofP\(Y∗=Y^\|X\)P\(Y^\{\*\}=\\hat\{Y\}\|X\)\.
### E\.2Expectation Consistency Loss for Class\-wise Calibration
For class\-wise calibration, each coordinateSkS\_\{k\}must matchP\(Yk=1\|Sk\)P\(Y\_\{k\}=1\|S\_\{k\}\)\. Theorem[D\.2](https://arxiv.org/html/2605.21552#A4.Thmtheorem2)implies expectation consistency per class:
𝔼X∼Ps\(X\|Sk\)\[P\(Yk=1\|X\)\]=𝔼X∼Pt\(X\|Sk\)\[P\(Yk=1\|X\)\],∀k∈\{1,…,K\}\.\\mathbb\{E\}\_\{X\\sim P\_\{s\}\(X\|S\_\{k\}\)\}\[P\(Y\_\{k\}=1\|X\)\]=\\mathbb\{E\}\_\{X\\sim P\_\{t\}\(X\|S\_\{k\}\)\}\[P\(Y\_\{k\}=1\|X\)\],\\quad\\forall k\\in\\\{1,\\dots,K\\\}\.\(22\)Therefore,Expectation consistency lossfor class\-wise calibration can be naturally constructed as:
Leclcw=∑k=1K\[𝔼Pt\(Sk\)\|𝔼Ps\(X\|Sk\)P\(Yk=1\|X\)−𝔼Pt\(X\|Sk\)P\(Yk=1\|X\)\|\]\.L\_\{\\mathrm\{ecl\}\}^\{\\mathrm\{cw\}\}=\\sum\_\{k=1\}^\{K\}\\Big\[\\mathbb\{E\}\_\{P\_\{t\}\(S\_\{k\}\)\}\\Big\|\\mathbb\{E\}\_\{P\_\{s\}\(X\|S\_\{k\}\)\}P\(Y\_\{k\}=1\|X\)\-\\mathbb\{E\}\_\{P\_\{t\}\(X\|S\_\{k\}\)\}P\(Y\_\{k\}=1\|X\)\\Big\|\\Big\]\.\(23\)To estimateP\(Yk=1\|X\)P\(Y\_\{k\}=1\|X\)in practice, we train an additional classification head on the original classifier’s backbone, where the label isYkY\_\{k\}\(thekk\-th component of the one\-hot encoded label\) and the input data isXX\. This classification head can be trained end\-to\-end with the original classifier \(freeze the backbone when training this classification head\)\. Optionally, this classification head can also be calibrated on the source domain\.
## Appendix FExtensions on Empirical Calculation and Differentiability
Empirical Calculation and Differentiability for Top\-label Calibration:For top\-label calibration,Expectation Consistency Losscan be empirically estimated using confidence binning and Monte Carlo sampling:
\{L^ecltop=∑j=1B♯bj\(t\)♯Dt‖𝔼^s,j−𝔼^t,j‖,𝔼^s,j=1♯Ds\(j\)∑x∈Ds\(j\)P^\(Y∗=Y^\|X=x\),𝔼^t,j=1♯Dt\(j\)∑x∈Dt\(j\)P^\(Y∗=Y^\|X=x\),\\begin\{dcases\}\{\{\{\\hat\{L\}\_\{ecl\}^\{\\mathrm\{top\}\}\}\}=\\sum\\limits\_\{j=1\}^\{B\}\{\\frac\{\{\\sharp b\_\{j\}^\{\(t\)\}\}\}\{\{\\sharp\{D\_\{t\}\}\}\}\\left\\lVert\{\{\{\\hat\{\\mathbb\{E\}\}\}\_\{s,j\}\}\-\{\{\\hat\{\\mathbb\{E\}\}\}\_\{t,j\}\}\}\\right\\rVert\},\}\\\\ \{\{\\hat\{\\mathbb\{E\}\}\_\{s,j\}\}=\\frac\{1\}\{\{\\sharp D\_\{s\}^\{\(j\)\}\}\}\\sum\\limits\_\{x\\in D\_\{s\}^\{\(j\)\}\}\{\\hat\{P\}\(Y^\{\*\}=\\hat\{Y\}\|X=x\)\},\}\\\\ \{\{\\hat\{\\mathbb\{E\}\}\_\{t,j\}\}=\\frac\{1\}\{\{\\sharp D\_\{t\}^\{\(j\)\}\}\}\\sum\\limits\_\{x\\in D\_\{t\}^\{\(j\)\}\}\{\\hat\{P\}\(Y^\{\*\}=\\hat\{Y\}\|X=x\)\},\}\\end\{dcases\}\(24\)whereBBrepresents the number of bins,bj\(t\)b\_\{j\}^\{\(t\)\}represents thejj\-th bin in the target domain,♯bj\(t\)\\sharp b\_\{j\}^\{\(t\)\}represents sample size ofbj\(t\)b\_\{j\}^\{\(t\)\},♯Dt\\sharp D\_\{t\}represents sample size ofDtD\_\{t\},Ds\(j\)\{D\_\{s\}^\{\(j\)\}\}represents the level set ofbj\(t\)b\_\{j\}^\{\(t\)\}in the source domain,Dt\(j\)\{D\_\{t\}^\{\(j\)\}\}represents the level set ofbj\(t\)b\_\{j\}^\{\(t\)\}in the target domain, andP^\(Y∗=Y^\|X=x\)\{\{\{\\hat\{P\}\}\}\(Y^\{\*\}=\\hat\{Y\}\|X=x\)\}represents the observation ofP\(Y∗=Y^\|X\)\{\{\{P\}\}\(Y^\{\*\}=\\hat\{Y\}\|X\)\}\. For differentiability, introduce anchorsaj=\(2j−1\)/\(2B\)a\_\{j\}=\(2j\-1\)/\(2B\)and weightsωij=exp\(−\(S^\(i\)−aj\)2/τ\)/∑rexp\(−\(S^\(i\)−ar\)2/τ\)\\omega\_\{ij\}=\\exp\(\-\(\\hat\{S\}^\{\(i\)\}\-a\_\{j\}\)^\{2\}/\\tau\)/\\sum\_\{r\}\\exp\(\-\(\\hat\{S\}^\{\(i\)\}\-a\_\{r\}\)^\{2\}/\\tau\)with temperatureτ\>0\\tau\>0\. Denotingp\(i\)=P\(Y∗=Y^\|Xi\)p^\{\(i\)\}=P\(Y^\{\*\}=\\hat\{Y\}\|X\_\{i\}\)as the output of the binary classification head \(as described in Section[E](https://arxiv.org/html/2605.21552#A5)\), we obtain for each binjjand domaind∈\{s,t\}d\\in\\\{s,t\\\}:
𝔼^d,j=∑iωijdp\(i\)∑iωijd\+ε\.\\hat\{\\mathbb\{E\}\}\_\{d,j\}=\\frac\{\\sum\_\{i\}\\omega\_\{ij\}^\{d\}p^\{\(i\)\}\}\{\\sum\_\{i\}\\omega\_\{ij\}^\{d\}\+\\varepsilon\}\.\(25\)Therefore, the differentiable ECL for top\-label calibration isL^ecltop=∑j=1Bwj∥𝔼^s,j−𝔼^t,j∥\\hat\{L\}\_\{ecl\}^\{\\text\{top\}\}=\\sum\_\{j=1\}^\{B\}w\_\{j\}\\lVert\\hat\{\\mathbb\{E\}\}\_\{s,j\}\-\\hat\{\\mathbb\{E\}\}\_\{t,j\}\\rVert, wherewj=∑iωijt∑r∑iωirtw\_\{j\}=\\frac\{\\sum\_\{i\}\\omega^\{t\}\_\{ij\}\}\{\\sum\_\{r\}\\sum\_\{i\}\\omega^\{t\}\_\{ir\}\}\.
Empirical Calculation and Differentiability for Class\-wise Calibration:For class\-wise calibration,Expectation Consistency Losscan be empirically estimated using confidence binning and Monte Carlo sampling:
\{L^eclcw=∑k=1K∑j=1B♯bk,j\(t\)♯Dt\|𝔼^s,k,j−𝔼^t,k,j\|,𝔼^s,k,j=1♯Ds,k\(j\)∑x∈Ds,k\(j\)P^\(Yk=1\|X=x\),𝔼^t,k,j=1♯Dt,k\(j\)∑x∈Dt,k\(j\)P^\(Yk=1\|X=x\)\.\\begin\{dcases\}\\hat\{L\}\_\{ecl\}^\{\\mathrm\{cw\}\}=\\sum\_\{k=1\}^\{K\}\\sum\_\{j=1\}^\{B\}\\frac\{\\sharp b\_\{k,j\}^\{\(t\)\}\}\{\\sharp D\_\{t\}\}\\,\\Big\|\\hat\{\\mathbb\{E\}\}\_\{s,k,j\}\-\\hat\{\\mathbb\{E\}\}\_\{t,k,j\}\\Big\|,\\\\ \\hat\{\\mathbb\{E\}\}\_\{s,k,j\}=\\frac\{1\}\{\\sharp D\_\{s,k\}^\{\(j\)\}\}\\sum\_\{x\\in D\_\{s,k\}^\{\(j\)\}\}\\hat\{P\}\(Y\_\{k\}=1\\,\|\\,X=x\),\\\\ \\hat\{\\mathbb\{E\}\}\_\{t,k,j\}=\\frac\{1\}\{\\sharp D\_\{t,k\}^\{\(j\)\}\}\\sum\_\{x\\in D\_\{t,k\}^\{\(j\)\}\}\\hat\{P\}\(Y\_\{k\}=1\\,\|\\,X=x\)\.\\end\{dcases\}\(26\)whereBBis the number of bins per class,bk,j\(t\)b\_\{k,j\}^\{\(t\)\}is thejj\-th bin for class\-kkon the target domain \(formed by binningSkS\_\{k\}\),♯bk,j\(t\)\\sharp b\_\{k,j\}^\{\(t\)\}is its size,♯Dt\\sharp D\_\{t\}is the target sample size, andDs,k\(j\)D\_\{s,k\}^\{\(j\)\},Dt,k\(j\)D\_\{t,k\}^\{\(j\)\}are the level sets ofbk,j\(t\)b\_\{k,j\}^\{\(t\)\}on source/target domains, respectively\. For differentiability, let anchorsaj=2j−12Ba\_\{j\}=\\frac\{2j\-1\}\{2B\}forj=1,…,Bj=1,\\dots,B, and define soft weights for a sampleiiwith confidenceSk\(i\)S\_\{k\}^\{\(i\)\}:
ωk,ij=exp\(−\(Sk\(i\)−aj\)2/τ\)∑r=1Bexp\(−\(Sk\(i\)−ar\)2/τ\),τ\>0\.\\omega\_\{k,ij\}=\\frac\{\\exp\\\!\\big\(\-\(S\_\{k\}^\{\(i\)\}\-a\_\{j\}\)^\{2\}/\\tau\\big\)\}\{\\sum\_\{r=1\}^\{B\}\\exp\\\!\\big\(\-\(S\_\{k\}^\{\(i\)\}\-a\_\{r\}\)^\{2\}/\\tau\\big\)\},\\quad\\tau\>0\.\(27\)For domaind∈\{s,t\}d\\in\\\{s,t\\\}, define
𝔼^d,k,j=∑iωk,ijdpk\(i\)∑iωk,ijd\+ε,nk,jd=∑iωk,ijd,pk\(i\)=P\(Yk=1\|Xi\),\\hat\{\\mathbb\{E\}\}\_\{d,k,j\}=\\frac\{\\sum\_\{i\}\\omega\_\{k,ij\}^\{d\}\\,p\_\{k\}^\{\(i\)\}\}\{\\sum\_\{i\}\\omega\_\{k,ij\}^\{d\}\+\\varepsilon\},\\quad n\_\{k,j\}^\{d\}=\\sum\_\{i\}\\omega\_\{k,ij\}^\{d\},\\quad p\_\{k\}^\{\(i\)\}=P\(Y\_\{k\}=1\|X\_\{i\}\),\(28\)with stabilizerε\>0\\varepsilon\>0\. The differentiable class\-wise ECL becomes
L^eclcw=∑k=1K∑j=1Bwk,j\|𝔼^s,k,j−𝔼^t,k,j\|,wk,j=nk,jt∑r=1Bnk,rt\.\\hat\{L\}\_\{ecl\}^\{\\mathrm\{cw\}\}=\\sum\_\{k=1\}^\{K\}\\sum\_\{j=1\}^\{B\}w\_\{k,j\}\\,\\Big\|\\hat\{\\mathbb\{E\}\}\_\{s,k,j\}\-\\hat\{\\mathbb\{E\}\}\_\{t,k,j\}\\Big\|,\\quad w\_\{k,j\}=\\frac\{n\_\{k,j\}^\{t\}\}\{\\sum\_\{r=1\}^\{B\}n\_\{k,r\}^\{t\}\}\.\(29\)
## Appendix GProof of Theorem[3\.2](https://arxiv.org/html/2605.21552#S3.Thmtheorem2)
###### Proof\.
For each binjj, define random variablesZs,j=∥𝔼^s,j−𝔼Ps\(X\|S\)P\(Y\|X\)∥Z\_\{s,j\}=\\lVert\\hat\{\\mathbb\{E\}\}\_\{s,j\}\-\\mathbb\{E\}\_\{P\_\{s\}\(X\|S\)\}P\(Y\|X\)\\rVertandZt,j=∥𝔼^t,j−𝔼Pt\(X\|S\)P\(Y\|X\)∥Z\_\{t,j\}=\\lVert\\hat\{\\mathbb\{E\}\}\_\{t,j\}\-\\mathbb\{E\}\_\{P\_\{t\}\(X\|S\)\}P\(Y\|X\)\\rVert\. By the triangle inequality,
\|L^ecl−Lecl\|≤∑j=1Bwj\(Zs,j\+Zt,j\)\.\\big\|\\hat\{L\}\_\{ecl\}\-L\_\{ecl\}\\big\|\\leq\\sum\_\{j=1\}^\{B\}w\_\{j\}\\,\(Z\_\{s,j\}\+Z\_\{t,j\}\)\.\(30\)Using Hoeffding’s inequality and a union bound over bins and classes, there exist absolute constantsC1,C2\>0C\_\{1\},C\_\{2\}\>0such that, with probability at least1−δ1\-\\delta,
Zs,j≤C1Klog\(2BK/δ\)ns,j,Zt,j≤C2Klog\(2BK/δ\)nt,j,∀j=1,…,B\.Z\_\{s,j\}\\leq C\_\{1\}\\sqrt\{\\frac\{K\\log\(2BK/\\delta\)\}\{n\_\{s,j\}\}\},\\quad Z\_\{t,j\}\\leq C\_\{2\}\\sqrt\{\\frac\{K\\log\(2BK/\\delta\)\}\{n\_\{t,j\}\}\},\\quad\\forall j=1,\\dots,B\.\(31\)Combining these bounds gives the desired result\. ∎
## Appendix HProof of Theorem[3\.3](https://arxiv.org/html/2605.21552#S3.Thmtheorem3)
This proof proceeds in two steps\. First we show that Eq\.[10](https://arxiv.org/html/2605.21552#S3.E10)is an auxiliary\-variable reformulation of Eq\.[8](https://arxiv.org/html/2605.21552#S3.E8): minimizing the auxiliary variablesujs,ujtu\_\{j\}^\{s\},u\_\{j\}^\{t\}in Eq\.[10](https://arxiv.org/html/2605.21552#S3.E10)recovers Eq\.[8](https://arxiv.org/html/2605.21552#S3.E8)\. Second we show that, under the auxiliary\-variable formulation, the mini\-batch gradient is an unbiased estimator of the full\-sample gradient\.
##### Equivalence between Eq\.[10](https://arxiv.org/html/2605.21552#S3.E10)and Eq\.[8](https://arxiv.org/html/2605.21552#S3.E8)\.
Fixθ\\thetaand consider minimizing the right\-hand side of Eq\.[10](https://arxiv.org/html/2605.21552#S3.E10)with respect to the auxiliary vectorsujs,ujtu\_\{j\}^\{s\},u\_\{j\}^\{t\}for each binjj\. The terms that depend onujs,ujtu\_\{j\}^\{s\},u\_\{j\}^\{t\}are
Gj\(ujs,ujt\)=wj‖ujs−ujt‖\+∑i∈Dsωi,js‖ujs−pi\(θ\)‖2\+∑i∈Dtωi,jt‖ujt−pi\(θ\)‖2\.G\_\{j\}\(u\_\{j\}^\{s\},u\_\{j\}^\{t\}\)=w\_\{j\}\\\|u\_\{j\}^\{s\}\-u\_\{j\}^\{t\}\\\|\+\\sum\_\{i\\in D\_\{s\}\}\\omega\_\{i,j\}^\{s\}\\\|u\_\{j\}^\{s\}\-p\_\{i\}\(\\theta\)\\\|^\{2\}\+\\sum\_\{i\\in D\_\{t\}\}\\omega\_\{i,j\}^\{t\}\\\|u\_\{j\}^\{t\}\-p\_\{i\}\(\\theta\)\\\|^\{2\}\.Define the soft counts and weighted empirical means
njs=∑i∈Dsωi,js,njt=∑i∈Dtωi,jt,𝔼^s,j=1njs∑i∈Dsωi,jsp\(i\)\(θ\),𝔼^t,j=1njt∑i∈Dtωi,jtp\(i\)\(θ\)\.n^\{s\}\_\{j\}=\\sum\_\{i\\in D\_\{s\}\}\\omega\_\{i,j\}^\{s\},\\quad n^\{t\}\_\{j\}=\\sum\_\{i\\in D\_\{t\}\}\\omega\_\{i,j\}^\{t\},\\quad\\hat\{\\mathbb\{E\}\}\_\{s,j\}=\\frac\{1\}\{n^\{s\}\_\{j\}\}\\sum\_\{i\\in D\_\{s\}\}\\omega\_\{i,j\}^\{s\}p^\{\(i\)\}\(\\theta\),\\quad\\hat\{\\mathbb\{E\}\}\_\{t,j\}=\\frac\{1\}\{n^\{t\}\_\{j\}\}\\sum\_\{i\\in D\_\{t\}\}\\omega\_\{i,j\}^\{t\}p^\{\(i\)\}\(\\theta\)\.The quadratic terms are strongly convex inujs,ujtu\_\{j\}^\{s\},u\_\{j\}^\{t\}, soGjG\_\{j\}has a unique minimizer\. Taking \(sub\)gradientsw\.r\.t\.w\.r\.t\.ujs,ujtu\_\{j\}^\{s\},u\_\{j\}^\{t\}and setting them to zero yields
2njs\(ujs−𝔼^s,j\)\+wjgj=0,2njt\(ujt−𝔼^t,j\)−wjgj=0,2n^\{s\}\_\{j\}\(u\_\{j\}^\{s\}\-\\hat\{\\mathbb\{E\}\}\_\{s,j\}\)\+w\_\{j\}g\_\{j\}=0,\\qquad 2n^\{t\}\_\{j\}\(u\_\{j\}^\{t\}\-\\hat\{\\mathbb\{E\}\}\_\{t,j\}\)\-w\_\{j\}g\_\{j\}=0,wheregjg\_\{j\}is any subgradient of the norm atujs−ujtu\_\{j\}^\{s\}\-u\_\{j\}^\{t\}\(a unit vector when the difference is nonzero\)\. Eliminatinggjg\_\{j\}gives
ujs=𝔼^s,j−wj2njsgj,ujt=𝔼^t,j\+wj2njtgj\.u\_\{j\}^\{s\}=\\hat\{\\mathbb\{E\}\}\_\{s,j\}\-\\frac\{w\_\{j\}\}\{2n^\{s\}\_\{j\}\}g\_\{j\},\\qquad u\_\{j\}^\{t\}=\\hat\{\\mathbb\{E\}\}\_\{t,j\}\+\\frac\{w\_\{j\}\}\{2n^\{t\}\_\{j\}\}g\_\{j\}\.When the quadratic penalty terms are minimized \(forcing the auxiliary variables to their weighted empirical means\), the correction terms vanish and
ujs→𝔼^s,j,ujt→𝔼^t,j\.u\_\{j\}^\{s\}\\to\\hat\{\\mathbb\{E\}\}\_\{s,j\},\\qquad u\_\{j\}^\{t\}\\to\\hat\{\\mathbb\{E\}\}\_\{t,j\}\.Substituting these optimal auxiliary values back into Eq\.[10](https://arxiv.org/html/2605.21552#S3.E10)yields
∑j=1Bwj‖𝔼^s,j−𝔼^t,j‖,\\sum\_\{j=1\}^\{B\}w\_\{j\}\\big\\\|\\hat\{\\mathbb\{E\}\}\_\{s,j\}\-\\hat\{\\mathbb\{E\}\}\_\{t,j\}\\big\\\|,which is exactly Eq\.[8](https://arxiv.org/html/2605.21552#S3.E8)\. Hence Eq\.[10](https://arxiv.org/html/2605.21552#S3.E10)is asymptotically equivalent to Eq\.[8](https://arxiv.org/html/2605.21552#S3.E8), with anO\(wj/njd\)O\(w\_\{j\}/n\_\{j\}^\{d\}\)gap \(from the subgradient penaltieswj2njsgj\\frac\{w\_\{j\}\}\{2n\_\{j\}^\{s\}\}g\_\{j\}andwj2njtgj\\frac\{w\_\{j\}\}\{2n\_\{j\}^\{t\}\}g\_\{j\}\) that vanishes asnjs,njt→∞n\_\{j\}^\{s\},n\_\{j\}^\{t\}\\to\\infty\.
##### Unbiasedness of the mini\-batch gradient\.
We will first prove that Eq\.[8](https://arxiv.org/html/2605.21552#S3.E8)produces a biased gradient estimate on mini\-batches, and then prove that Eq\.[10](https://arxiv.org/html/2605.21552#S3.E10)produces an unbiased gradient estimate\.
Write the differentiable ECL \(Eq\.[8](https://arxiv.org/html/2605.21552#S3.E8)\) as
L^ecl\(θ\)=∑j=1Bwj‖𝔼^s,j−𝔼^t,j‖\.\\hat\{L\}\_\{ecl\}\(\\theta\)=\\sum\_\{j=1\}^\{B\}w\_\{j\}\\big\\\|\\hat\{\\mathbb\{E\}\}\_\{s,j\}\-\\hat\{\\mathbb\{E\}\}\_\{t,j\}\\big\\\|\.For notational clarity and for an arbitrary norm∥⋅∥\\\|\\cdot\\\|introduce a subgradient selection
gj∈∂‖𝔼^s,j−𝔼^t,j‖\(any choice when the difference is nonzero\)\.g\_\{j\}\\in\\partial\\\|\\hat\{\\mathbb\{E\}\}\_\{s,j\}\-\\hat\{\\mathbb\{E\}\}\_\{t,j\}\\\|\\quad\(\\text\{any choice when the difference is nonzero\}\)\.Using the chain rule for a general norm we obtain the full\-data gradient
∇θL^ecl\(θ\)=∑j=1Bwj⟨gj,∇θ𝔼^s,j−∇θ𝔼^t,j⟩,\\nabla\_\{\\theta\}\\hat\{L\}\_\{ecl\}\(\\theta\)=\\sum\_\{j=1\}^\{B\}w\_\{j\}\\left\\langle g\_\{j\},\\;\\nabla\_\{\\theta\}\\hat\{\\mathbb\{E\}\}\_\{s,j\}\-\\nabla\_\{\\theta\}\\hat\{\\mathbb\{E\}\}\_\{t,j\}\\right\\rangle,\(32\)where, for example, the full\-data weighted gradient average is
∇θ𝔼^s,j=1njs∑i∈Dsωi,js∇θp\(i\)\(θ\)\.\\nabla\_\{\\theta\}\\hat\{\\mathbb\{E\}\}\_\{s,j\}=\\frac\{1\}\{n^\{s\}\_\{j\}\}\\sum\_\{i\\in D\_\{s\}\}\\omega\_\{i,j\}^\{s\}\\,\\nabla\_\{\\theta\}p^\{\(i\)\}\(\\theta\)\.
Now consider computing the same expression on a random mini\-batch\. Let𝔼^s,jm,𝔼^t,jm\\hat\{\\mathbb\{E\}\}\_\{s,j\}^\{\\rm m\},\\hat\{\\mathbb\{E\}\}\_\{t,j\}^\{\\rm m\}be the per\-bin weighted means computed from the current mini\-batches and choose a measurable subgradient selectiongjm∈∂‖𝔼^s,jm−𝔼^t,jm‖g\_\{j\}^\{\\rm m\}\\in\\partial\\\|\\hat\{\\mathbb\{E\}\}\_\{s,j\}^\{\\rm m\}\-\\hat\{\\mathbb\{E\}\}\_\{t,j\}^\{\\rm m\}\\\|\. The mini\-batch gradient contribution for binjj\(when using Eq\.[8](https://arxiv.org/html/2605.21552#S3.E8)directly on the mini\-batch\) equals
Gjm=wj⟨gjm,∇θ𝔼^s,jm−∇θ𝔼^t,jm⟩\.G\_\{j\}^\{\\rm m\}=w\_\{j\}\\left\\langle g\_\{j\}^\{\\rm m\},\\;\\nabla\_\{\\theta\}\\hat\{\\mathbb\{E\}\}\_\{s,j\}^\{\\rm m\}\-\\nabla\_\{\\theta\}\\hat\{\\mathbb\{E\}\}\_\{t,j\}^\{\\rm m\}\\right\\rangle\.Taking expectation over the random mini\-batch sampling \(the indices in the sums\) and using linearity gives
𝔼\[Gjm\]=wj\(𝔼\[gjm\]⊤𝔼\[∇θ𝔼^s,jm−∇θ𝔼^t,jm\]\+Cov\(gjm,∇θ𝔼^s,jm−∇θ𝔼^t,jm\)\),\\mathbb\{E\}\[G\_\{j\}^\{\\rm m\}\]=w\_\{j\}\\left\(\\mathbb\{E\}\[g\_\{j\}^\{\\rm m\}\]^\{\\top\}\\,\\mathbb\{E\}\[\\nabla\_\{\\theta\}\\hat\{\\mathbb\{E\}\}\_\{s,j\}^\{\\rm m\}\-\\nabla\_\{\\theta\}\\hat\{\\mathbb\{E\}\}\_\{t,j\}^\{\\rm m\}\]\\,\+\\,\\mathrm\{Cov\}\\big\(g\_\{j\}^\{\\rm m\},\\,\\nabla\_\{\\theta\}\\hat\{\\mathbb\{E\}\}\_\{s,j\}^\{\\rm m\}\-\\nabla\_\{\\theta\}\\hat\{\\mathbb\{E\}\}\_\{t,j\}^\{\\rm m\}\\big\)\\right\),\(33\)where the covariance denotes the cross\-covariance between the components of the subgradient vectorgjmg\_\{j\}^\{\\rm m\}and the gradient estimator\. The covariance need not vanish becausegjmg\_\{j\}^\{\\rm m\}is a nonlinear \(sub\)differential selection of the same mini\-batch samples that produce the per\-sample gradients; hence in general
𝔼\[Gjm\]≠wjgj⊤\(∇θ𝔼^s,j−∇θ𝔼^t,j\)\.\\mathbb\{E\}\[G\_\{j\}^\{\\rm m\}\]\\neq w\_\{j\}g\_\{j\}^\{\\top\}\\big\(\\nabla\_\{\\theta\}\\hat\{\\mathbb\{E\}\}\_\{s,j\}\-\\nabla\_\{\\theta\}\\hat\{\\mathbb\{E\}\}\_\{t,j\}\\big\)\.This equality would hold only ifgjmg\_\{j\}^\{\\rm m\}were \(in expectation\) equal togjg\_\{j\}and uncorrelated with the mini\-batch gradient estimator — a condition that generally fails because of the nonlinear subgradient selection\.
Eq\.[10](https://arxiv.org/html/2605.21552#S3.E10)\(Eq\.13\) remedies this issue by introducing auxiliary variablesujs,ujtu\_\{j\}^\{s\},u\_\{j\}^\{t\}\. Concretely, let us set the auxiliaries to the full\-data weighted means \(functions ofθ\\thetabut independent of the current mini\-batch indices\):
ujs,full:=𝔼^s,j=1njs∑i∈Dsωi,jsp\(i\)\(θ\),ujt,full:=𝔼^t,j=1njt∑i∈Dtωi,jtp\(i\)\(θ\)\.u\_\{j\}^\{s,\\mathrm\{full\}\}:=\\hat\{\\mathbb\{E\}\}\_\{s,j\}=\\frac\{1\}\{n^\{s\}\_\{j\}\}\\sum\_\{i\\in D\_\{s\}\}\\omega\_\{i,j\}^\{s\}p^\{\(i\)\}\(\\theta\),\\qquad u\_\{j\}^\{t,\\mathrm\{full\}\}:=\\hat\{\\mathbb\{E\}\}\_\{t,j\}=\\frac\{1\}\{n^\{t\}\_\{j\}\}\\sum\_\{i\\in D\_\{t\}\}\\omega\_\{i,j\}^\{t\}p^\{\(i\)\}\(\\theta\)\.Define the fixed unit vector
vjfull:=ujs,full−ujt,full‖ujs,full−ujt,full‖\.v\_\{j\}^\{\\mathrm\{full\}\}:=\\frac\{u\_\{j\}^\{s,\\mathrm\{full\}\}\-u\_\{j\}^\{t,\\mathrm\{full\}\}\}\{\\\|u\_\{j\}^\{s,\\mathrm\{full\}\}\-u\_\{j\}^\{t,\\mathrm\{full\}\}\\\|\}\.If we compute the mini\-batch gradient of Eq\.[10](https://arxiv.org/html/2605.21552#S3.E10)while treatingujd=ujd,fullu\_\{j\}^\{d\}=u\_\{j\}^\{d,\\mathrm\{full\}\}as fixed \(i\.e\. independent of the current mini\-batch samples\), the bin\-jjcontribution equals
G~jm=wj⟨vjfull,1\|Dsm\|∑i∈Dsmωi,js∇θp\(i\)\(θ\)−1\|Dtm\|∑i∈Dtmωi,jt∇θp\(i\)\(θ\)⟩\.\\tilde\{G\}\_\{j\}^\{\\rm m\}=w\_\{j\}\\left\\langle v\_\{j\}^\{\\mathrm\{full\}\},\\;\\frac\{1\}\{\|D\_\{s\}^\{\\rm m\}\|\}\\sum\_\{i\\in D\_\{s\}^\{\\rm m\}\}\\omega\_\{i,j\}^\{s\}\\,\\nabla\_\{\\theta\}p^\{\(i\)\}\(\\theta\)\-\\frac\{1\}\{\|D\_\{t\}^\{\\rm m\}\|\}\\sum\_\{i\\in D\_\{t\}^\{\\rm m\}\}\\omega\_\{i,j\}^\{t\}\\,\\nabla\_\{\\theta\}p^\{\(i\)\}\(\\theta\)\\right\\rangle\.Taking expectation over the random mini\-batch sampling \(the indices in the sums\) and using linearity gives
𝔼\[G~jm\]=wj⟨vjfull,1njs∑i∈Dsωi,js∇θp\(i\)\(θ\)−1njt∑i∈Dtωi,jt∇θp\(i\)\(θ\)⟩\.\\mathbb\{E\}\[\\tilde\{G\}\_\{j\}^\{\\rm m\}\]=w\_\{j\}\\left\\langle v\_\{j\}^\{\\mathrm\{full\}\},\\;\\frac\{1\}\{n^\{s\}\_\{j\}\}\\sum\_\{i\\in D\_\{s\}\}\\omega\_\{i,j\}^\{s\}\\,\\nabla\_\{\\theta\}p^\{\(i\)\}\(\\theta\)\-\\frac\{1\}\{n^\{t\}\_\{j\}\}\\sum\_\{i\\in D\_\{t\}\}\\omega\_\{i,j\}^\{t\}\\,\\nabla\_\{\\theta\}p^\{\(i\)\}\(\\theta\)\\right\\rangle\.The right\-hand side is exactly the full\-data bin\-jjterm in Eq\.[32](https://arxiv.org/html/2605.21552#A8.E32); summing overjjyields
𝔼\[∑j=1BG~jm\]=∇θL^ecl\(θ\)\.\\mathbb\{E\}\\Big\[\\sum\_\{j=1\}^\{B\}\\tilde\{G\}\_\{j\}^\{\\rm m\}\\Big\]=\\nabla\_\{\\theta\}\\hat\{L\}\_\{ecl\}\(\\theta\)\.Thus, when Eq\.[10](https://arxiv.org/html/2605.21552#S3.E10)is used with auxiliaries taken from an estimate independent of the current mini\-batch \(e\.g\. full\-data means, a large buffer, or a slow running average\), the mini\-batch gradient is an unbiased estimator of the full\-sample gradient\.
Algorithm 2Top\-label ECL Mini\-Batch\.1:Input:
2:bins
j=1…Bj=1\\ldots B, hyperparameters
λ,αema,Nprox\\lambda,\\alpha\_\{\\text\{ema\}\},N\_\{\\text\{prox\}\};
3:
ujs=0∈ℝ,∀ju\_\{j\}^\{s\}=0\\in\\mathbb\{R\},\\forall j;
ujt=0∈ℝ,∀ju\_\{j\}^\{t\}=0\\in\\mathbb\{R\},\\forall j;
4:foreach iterationdo
5:Sample mini\-batches
Dsm,DtmD\_\{s\}^\{m\},D\_\{t\}^\{m\};
6:Compute weights
ωijs,ωijt\\omega\_\{ij\}^\{s\},\\omega\_\{ij\}^\{t\};
7:
ns,j←∑i∈Dsmωijsn\_\{s,j\}\\leftarrow\\sum\_\{i\\in D\_\{s\}^\{m\}\}\\omega\_\{ij\}^\{s\};
nt,j←∑i∈Dtmωijtn\_\{t,j\}\\leftarrow\\sum\_\{i\\in D\_\{t\}^\{m\}\}\\omega\_\{ij\}^\{t\};
8:
ms,j←∑i∈DsmωijsP\(Y∗=Y^\|X=xi\)m\_\{s,j\}\\leftarrow\\sum\_\{i\\in D\_\{s\}^\{m\}\}\\omega\_\{ij\}^\{s\}P\(Y^\{\*\}=\\hat\{Y\}\|X=x\_\{i\}\);
9:
mt,j←∑i∈DtmωijtP\(Y∗=Y^\|X=xi\)m\_\{t,j\}\\leftarrow\\sum\_\{i\\in D\_\{t\}^\{m\}\}\\omega\_\{ij\}^\{t\}P\(Y^\{\*\}=\\hat\{Y\}\|X=x\_\{i\}\);
10:
wj←nt,j/∑r=1Bnt,rw\_\{j\}\\leftarrow n\_\{t,j\}/\\sum\_\{r=1\}^\{B\}n\_\{t,r\};
11:
Lecl←0L\_\{\\text\{ecl\}\}\\leftarrow 0;
12:foreach bin
jjdo
13:
us,ut←u\_\{s\},u\_\{t\}\\leftarrowcached
ujs,ujtu\_\{j\}^\{s\},u\_\{j\}^\{t\}
14:for
i=1i=1to
NproxN\_\{\\text\{prox\}\}do
15:
vs←\(ms,j/ns,j\)−utv\_\{s\}\\leftarrow\(m\_\{s,j\}/n\_\{s,j\}\)\-u\_\{t\},
τs=wj2ns,j\\tau\_\{s\}=\\dfrac\{w\_\{j\}\}\{2n\_\{s,j\}\}
16:
us←ut\+shrink\(vs,τs\)u\_\{s\}\\leftarrow u\_\{t\}\+\\mathrm\{shrink\}\(v\_\{s\},\\tau\_\{s\}\)
17:
vt←\(mt,j/nt,j\)−usv\_\{t\}\\leftarrow\(m\_\{t,j\}/n\_\{t,j\}\)\-u\_\{s\},
τt=wj2nt,j\\tau\_\{t\}=\\dfrac\{w\_\{j\}\}\{2n\_\{t,j\}\}
18:
ut←us\+shrink\(vt,τt\)u\_\{t\}\\leftarrow u\_\{s\}\+\\mathrm\{shrink\}\(v\_\{t\},\\tau\_\{t\}\)
19:endfor
20:
u~js,u~jt←us\.detach\(\),ut\.detach\(\)\\tilde\{u\}\_\{j\}^\{s\},\\tilde\{u\}\_\{j\}^\{t\}\\leftarrow u\_\{s\}\.\{\\rm detach\}\(\),\\;u\_\{t\}\.\{\\rm detach\}\(\)
21:
ujs←\(1−αema\)ujs\+αemau~jsu\_\{j\}^\{s\}\\leftarrow\(1\-\\alpha\_\{\\text\{ema\}\}\)u\_\{j\}^\{s\}\+\\alpha\_\{\\text\{ema\}\}\\tilde\{u\}\_\{j\}^\{s\}
22:
ujt←\(1−αema\)ujt\+αemau~jtu\_\{j\}^\{t\}\\leftarrow\(1\-\\alpha\_\{\\text\{ema\}\}\)u\_\{j\}^\{t\}\+\\alpha\_\{\\text\{ema\}\}\\tilde\{u\}\_\{j\}^\{t\}
23:
Lecl\+=∑i∈Dsmωijs∥u~js−P\(Y∗=Y^\|X=xi\)∥2L\_\{\\text\{ecl\}\}\\mathrel\{\+\}=\\sum\\limits\_\{i\\in D\_\{s\}^\{m\}\}\\omega\_\{ij\}^\{s\}\\\|\\tilde\{u\}\_\{j\}^\{s\}\-P\(Y^\{\*\}=\\hat\{Y\}\|X=x\_\{i\}\)\\\|^\{2\}
24:
Lecl\+=∑i∈Dtmωijt∥u~jt−P\(Y∗=Y^\|X=xi\)∥2L\_\{\\text\{ecl\}\}\\mathrel\{\+\}=\\sum\\limits\_\{i\\in D\_\{t\}^\{m\}\}\\omega\_\{ij\}^\{t\}\\\|\\tilde\{u\}\_\{j\}^\{t\}\-P\(Y^\{\*\}=\\hat\{Y\}\|X=x\_\{i\}\)\\\|^\{2\}
25:endfor
26:Compute the cross\-entropy loss
LceL\_\{\\text\{ce\}\}
27:Backpropagate
Lce\+λLeclL\_\{\\text\{ce\}\}\+\\lambda L\_\{\\text\{ecl\}\}and update
θ\\theta
28:endfor
29:Return:
θ\\theta
Algorithm 3Class\-wise ECL Mini\-Batch\.1:Input:
2:bins
j=1…Bj=1\\ldots B, hyperparameters
λ,αema,Nprox\\lambda,\\alpha\_\{\\text\{ema\}\},N\_\{\\text\{prox\}\};
3:
uk,js=0∈ℝ,∀k,ju\_\{k,j\}^\{s\}=0\\in\\mathbb\{R\},\\forall k,j;
uk,jt=0∈ℝ,∀k,ju\_\{k,j\}^\{t\}=0\\in\\mathbb\{R\},\\forall k,j;
4:foreach iterationdo
5:Sample mini\-batches
Dsm,DtmD\_\{s\}^\{m\},D\_\{t\}^\{m\};
Lecl←0L\_\{\\text\{ecl\}\}\\leftarrow 0;
6:foreach class
k=1k=1to
KKdo
7:Compute weights
ωk,ijs,ωk,ijt\\omega\_\{k,ij\}^\{s\},\\omega\_\{k,ij\}^\{t\};
8:
ns,j←∑i∈Dsmωk,ijsn\_\{s,j\}\\leftarrow\\sum\_\{i\\in D\_\{s\}^\{m\}\}\\omega\_\{k,ij\}^\{s\};
nt,j←∑i∈Dtmωk,ijtn\_\{t,j\}\\leftarrow\\sum\_\{i\\in D\_\{t\}^\{m\}\}\\omega\_\{k,ij\}^\{t\};
9:
ms,j←∑i∈Dsmωk,ijspk\(i\)\(θ\)m\_\{s,j\}\\leftarrow\\sum\_\{i\\in D\_\{s\}^\{m\}\}\\omega\_\{k,ij\}^\{s\}p^\{\(i\)\}\_\{k\}\(\\theta\);
10:
mt,j←∑i∈Dtmωk,ijtpk\(i\)\(θ\)m\_\{t,j\}\\leftarrow\\sum\_\{i\\in D\_\{t\}^\{m\}\}\\omega\_\{k,ij\}^\{t\}p^\{\(i\)\}\_\{k\}\(\\theta\);
11:
wk,j←nt,j/∑r=1Bnt,rw\_\{k,j\}\\leftarrow n\_\{t,j\}/\\sum\_\{r=1\}^\{B\}n\_\{t,r\};
12:foreach bin
jjdo
13:
us,ut←u\_\{s\},u\_\{t\}\\leftarrowcached
uk,js,uk,jtu\_\{k,j\}^\{s\},u\_\{k,j\}^\{t\}
14:for
i=1i=1to
NproxN\_\{\\text\{prox\}\}do
15:
vs←\(ms,j/ns,j\)−utv\_\{s\}\\leftarrow\(m\_\{s,j\}/n\_\{s,j\}\)\-u\_\{t\},
τs=wk,j2ns,j\\tau\_\{s\}=\\dfrac\{w\_\{k,j\}\}\{2n\_\{s,j\}\}
16:
us←ut\+shrink\(vs,τs\)u\_\{s\}\\leftarrow u\_\{t\}\+\\mathrm\{shrink\}\(v\_\{s\},\\tau\_\{s\}\)
17:
vt←\(mt,j/nt,j\)−usv\_\{t\}\\leftarrow\(m\_\{t,j\}/n\_\{t,j\}\)\-u\_\{s\},
τt=wk,j2nt,j\\tau\_\{t\}=\\dfrac\{w\_\{k,j\}\}\{2n\_\{t,j\}\}
18:
ut←us\+shrink\(vt,τt\)u\_\{t\}\\leftarrow u\_\{s\}\+\\mathrm\{shrink\}\(v\_\{t\},\\tau\_\{t\}\)
19:endfor
20:
u~k,js,u~k,jt←us\.detach\(\),ut\.detach\(\)\\tilde\{u\}\_\{k,j\}^\{s\},\\tilde\{u\}\_\{k,j\}^\{t\}\\leftarrow u\_\{s\}\.\{\\rm detach\}\(\),\\;u\_\{t\}\.\{\\rm detach\}\(\)
21:
uk,js←\(1−αema\)uk,js\+αemau~k,jsu\_\{k,j\}^\{s\}\\leftarrow\(1\-\\alpha\_\{\\text\{ema\}\}\)u\_\{k,j\}^\{s\}\+\\alpha\_\{\\text\{ema\}\}\\tilde\{u\}\_\{k,j\}^\{s\}
22:
uk,jt←\(1−αema\)uk,jt\+αemau~k,jtu\_\{k,j\}^\{t\}\\leftarrow\(1\-\\alpha\_\{\\text\{ema\}\}\)u\_\{k,j\}^\{t\}\+\\alpha\_\{\\text\{ema\}\}\\tilde\{u\}\_\{k,j\}^\{t\}
23:
Lecl\+=∑i∈Dsmωk,ijs∥u~k,js−pk\(i\)\(θ\)∥2L\_\{\\text\{ecl\}\}\\mathrel\{\+\}=\\sum\\limits\_\{i\\in D\_\{s\}^\{m\}\}\\omega\_\{k,ij\}^\{s\}\\\|\\tilde\{u\}\_\{k,j\}^\{s\}\-p^\{\(i\)\}\_\{k\}\(\\theta\)\\\|^\{2\}
24:
Lecl\+=∑i∈Dtmωk,ijt∥u~k,jt−pk\(i\)\(θ\)∥2L\_\{\\text\{ecl\}\}\\mathrel\{\+\}=\\sum\\limits\_\{i\\in D\_\{t\}^\{m\}\}\\omega\_\{k,ij\}^\{t\}\\\|\\tilde\{u\}\_\{k,j\}^\{t\}\-p^\{\(i\)\}\_\{k\}\(\\theta\)\\\|^\{2\}
25:endfor
26:endfor
27:Compute the cross\-entropy loss
LceL\_\{\\text\{ce\}\}
28:Backpropagate
Lce\+λLeclL\_\{\\text\{ce\}\}\+\\lambda L\_\{\\text\{ecl\}\}and update
θ\\theta
29:endfor
30:Return:
θ\\theta
## Appendix IExtension of ECL Mini\-Batch Training
Algorithm[1](https://arxiv.org/html/2605.21552#alg1)details the ECL mini\-batch training for canonical calibration\. Here, we present the analogous algorithms for top\-label calibration \(Algorithm[2](https://arxiv.org/html/2605.21552#alg2)\) and class\-wise calibration \(Algorithm[3](https://arxiv.org/html/2605.21552#alg3)\)\. They employ the same auxiliary variable strategy to resolve the bias in mini\-batch gradients\. In Algorithm[2](https://arxiv.org/html/2605.21552#alg2),P\(Y∗=Y^\|X=x\)P\(Y^\{\*\}=\\hat\{Y\}\|X=x\)can be obtained by training a binary classifier where the label is1Y∗=Y^1\_\{Y^\{\*\}=\\hat\{Y\}\}and the input data isXX\. Moreover, this binary classifier does not need to be trained separately\. It can be added to the original classifier as a classification head and trained end\-to\-end with the original classifier \(freeze the backbone when training this classification head\)\.
## Appendix JResults
Other experimental settings:The batch size in the experiment is uniformly set to 100\. Adam optimizer with a learning rate of 0\.001 is used to train the classifier for 100 epochs\. All experiments were conducted on Intel®CoreTMI7\-10700 CPU with 3\.70GHz and 125\.5GB memory, 10 NVIDIA GeForce RTX 3090 graphics cards \(each with 24GB of video memory\), Ubuntu 20\.04\.3 LTS, Python 3\.11\.11, and Torch 2\.4\.1\+cu118\. We calibrate the classification head used to estimateP\(Y\|X\)P\(Y\|X\)\(orP\(Y∗=Y^\|X\)P\(Y^\{\*\}=\\hat\{Y\}\|X\)for top\-label calibration\) on the source domain using Soft\-ECE loss\. This classification head has the same network structure as the classification head in the original classifier, and uses the same hyperparameters during training\. All images in the digit recognition dataset were standardized to 3\-channel RGB format and resized to a resolution of 28×\\times28 pixels\. All images in PACS and ImageNet\-Sketch were standardized to 3\-channel RGB format and resized to a resolution of 224×\\times224 pixels\.
### J\.1Results on Simulated Covariate Shifts Data
Figure[3](https://arxiv.org/html/2605.21552#A10.F3)shows the calibration results under a uniformly distributed covariate shift, complementing the normally distributed case in Figure[2](https://arxiv.org/html/2605.21552#S4.F2)\. Consistent with the normal case, ECL achieves the lowest calibration error across all three paradigms\.

Figure 3:The calibration results are presented using simulated data under a uniformly distributed covariate shift\. From the calibration metric on the target domain and the reliability diagram of the calibrated classifier, ECL achieves the smallest calibration error\.
### J\.2Results for Top\-label Calibration
Table[4](https://arxiv.org/html/2605.21552#A10.T4)details the top\-label calibration performance on the PACS and ImageNet\-Sketch datasets\. Several key observations can be drawn regarding the effectiveness of ECL\. First, regarding robustness to large shifts, on the ImageNet\-Sketch dataset—which presents a severe distribution shift \(source ImageNet vs\. target Sketch\)—uncalibrated models exhibit extreme ECE values exceeding 55%\. ECL substantially reduces these errors \(often to below 15% across the tested architectures\), demonstrating its capability to handle substantial domain gaps\. Second, while PseudoCal serves as a strong baseline, ECL is generally competitive and frequently achieves a lower ECE\. For instance, in the PACS \(→\\toCartoon\) task using Wide\-Res50, ECL achieves an ECE of 7\.61%, outperforming PseudoCal \(16\.24%\) and improving upon DRL \(8\.36%\)\. Finally, the method remains effective across diverse architectures, from standard CNNs \(ResNet, DenseNet\) to Vision Transformers \(ViT\-L\), suggesting that the Expectation Consistency condition captures a model\-agnostic principle\.
Table 4:ECE \(%\) for top\-label calibration on PACS and ImageNet\-Sketch datasets\. The reported results represent the mean and standard deviation derived from ten runs\.DatasetsECE↓\\bm\{\\downarrow\}UncalSoft\-ECEDECEKDETSTransCalDRLPseudoCalECL \(Ours\)Oracle↓\\bm\{\\downarrow\}Δ\\DeltaACC\(%\)PACS→\\toPhotoResNet5022\.3±2\.1622\.1±1\.8322\.8±1\.9921\.8±1\.7220\.9±1\.5322\.2±1\.669\.02±0\.577\.33±0\.446\.87±0\.343\.84±0\.23\+0\.72±0\.17DenseNet1219\.78±0\.969\.88±0\.919\.54±0\.9110\.2±0\.869\.63±0\.699\.31±0\.637\.91±0\.416\.61±0\.545\.96±0\.271\.84±0\.13\-0\.83±0\.23Wide\-Res5016\.9±1\.4217\.2±1\.2417\.8±1\.3916\.8±1\.1816\.2±1\.377\.27±0\.474\.39±0\.422\.83±0\.112\.68±0\.331\.59±0\.17\+0\.69±0\.22→\\toArtResNet5033\.1±3\.2432\.1±2\.9733\.2±3\.1131\.6±3\.0631\.9±3\.2417\.1±1\.5117\.1±1\.147\.88±0\.787\.22±0\.532\.12±0\.08\-1\.24±0\.41DenseNet12123\.2±2\.0422\.8±1\.9423\.6±2\.1423\.1±1\.8622\.7±1\.9622\.4±1\.886\.16±0\.599\.94±0\.745\.89±0\.362\.24±0\.21\+1\.06±0\.39Wide\-Res5029\.9±2\.7829\.3±2\.5730\.4±2\.6429\.7±2\.4230\.1±2\.5316\.1±1\.2315\.8±1\.428\.43±0\.677\.97±0\.523\.14±0\.24\-0\.96±0\.29→\\toCartoonResNet5025\.1±2\.2625\.1±2\.0825\.3±2\.4224\.9±2\.0724\.8±1\.9125\.2±2\.396\.69±0\.485\.71±0\.435\.46±0\.432\.73±0\.26\+0\.56±0\.12DenseNet12118\.4±1\.4818\.7±1\.3617\.8±1\.5618\.4±1\.4418\.3±1\.7311\.3±0\.9110\.9±1\.212\.21±0\.092\.04±0\.162\.04±0\.18\-0\.74±0\.22Wide\-Res5025\.4±1\.9824\.9±1\.8825\.9±2\.0625\.6±1\.7225\.2±1\.7623\.7±1\.888\.36±0\.6716\.24±1\.447\.61±0\.282\.73±0\.19\+1\.26±0\.39→\\toSketchResNet5023\.1±1\.8723\.9±1\.9722\.9±2\.0623\.4±1\.8823\.4±2\.2411\.4±0\.9616\.2±1\.2910\.9±0\.9810\.3±0\.821\.54±0\.13\-1\.53±0\.47DenseNet12123\.6±1\.5722\.8±1\.4323\.8±1\.7223\.2±1\.5422\.9±1\.969\.09±0\.793\.39±0\.195\.39±0\.513\.17±0\.282\.66±0\.16\+0\.86±0\.21Wide\-Res5019\.2±1\.4119\.6±1\.3618\.8±1\.5119\.1±1\.2618\.9±1\.6610\.01±0\.966\.81±0\.682\.79±0\.212\.67±0\.222\.69±0\.28\-0\.48±0\.13I\-S→\\toSketchResNet15264\.3±4\.4863\.6±4\.1265\.1±4\.6463\.4±4\.3662\.8±4\.1960\.1±3\.9433\.3±2\.3417\.3±1\.6814\.6±0\.581\.54±0\.09\+0\.92±0\.31DenseNet16169\.1±3\.6268\.7±3\.4769\.8±3\.8668\.3±3\.5768\.3±4\.6658\.4±4\.3336\.9±2\.6913\.2±1\.2111\.7±0\.461\.27±0\.14\+1\.39±0\.59ViT\-L55\.8±4\.3455\.1±4\.0756\.7±4\.4954\.9±4\.2653\.7±4\.1632\.7±2\.3827\.1±1\.7915\.7±1\.2412\.9±0\.541\.47±0\.11\+0\.92±0\.28
### J\.3Results for Class\-wise Calibration
Table[5](https://arxiv.org/html/2605.21552#A10.T5)reports the Class\-wise ECE \(CwECE\) results\. Two major trends are evident from the experimental data\. First, unlike top\-label calibration which focuses on the predicted class only, class\-wise calibration requires precision across all categories\. ECL achieves the lowest \(or near\-lowest\) CwECE in many experimental settings \(spanning datasets and models\), indicating that it improves calibration not only for the dominant class\. Second, regarding handling hard tasks, on the Digit recognition benchmarks \(included in Table[5](https://arxiv.org/html/2605.21552#A10.T5)\), the advantage of ECL is most prominent on the SVHN dataset\. For the LeNet\-5 architecture, ECL reduces CwECE from 15\.8% \(Uncal\) to 5\.88%, improving upon most baselines \(e\.g\., PseudoCal at 12\.7%\)\. This suggests that ECL’s auxiliary variable optimization can be particularly effective in scenarios with complex background noise and lower image quality\.
Table 5:CwECE \(%\) for class\-wise calibration on Digit, PACS and ImageNet\-Sketch datasets\. The reported results represent the mean and standard deviation derived from ten runs\.DatasetsCwECE↓\\bm\{\\downarrow\}UncalSoft\-ECEDECEKDETSTransCalDRLPseudoCalECL \(Ours\)Oracle↓\\bm\{\\downarrow\}Δ\\DeltaACC\(%\)Digit→\\toMNISTLeNet\-55\.41±0\.475\.54±0\.375\.31±0\.435\.49±0\.395\.18±0\.334\.92±0\.393\.79±0\.241\.86±0\.121\.66±0\.120\.16±0\.01\-0\.44±0\.09ResNet203\.14±0\.313\.23±0\.223\.13±0\.283\.21±0\.223\.06±0\.232\.47±0\.171\.94±0\.141\.46±0\.141\.41±0\.110\.39±0\.01\+0\.62±0\.11DenseNet404\.69±0\.414\.74±0\.374\.51±0\.294\.66±0\.434\.46±0\.393\.94±0\.272\.81±0\.211\.77±0\.191\.57±0\.120\.38±0\.06\+0\.23±0\.11→\\toUSPSLeNet\-56\.87±0\.546\.96±0\.476\.77±0\.576\.91±0\.446\.63±0\.465\.84±0\.364\.13±0\.312\.19±0\.192\.11±0\.140\.57±0\.03\-0\.33±0\.09ResNet202\.54±0\.232\.66±0\.192\.46±0\.242\.59±0\.232\.46±0\.222\.14±0\.161\.73±0\.141\.17±0\.071\.24±0\.090\.63±0\.01\+0\.42±0\.18DenseNet403\.99±0\.364\.03±0\.293\.83±0\.333\.99±0\.283\.63±0\.243\.27±0\.192\.14±0\.141\.48±0\.121\.18±0\.120\.72±0\.06\-0\.16±0\.04→\\toSVHNLeNet\-515\.8±1\.2615\.8±1\.1715\.4±1\.3716\.2±1\.2615\.4±1\.1914\.4±1\.128\.54±0\.6112\.7±0\.925\.88±0\.430\.44±0\.02\+0\.84±0\.24ResNet2018\.4±1\.4418\.2±1\.3418\.8±1\.5418\.1±1\.2918\.2±1\.3715\.1±1\.179\.86±0\.7411\.2±0\.888\.97±0\.590\.22±0\.01\-1\.04±0\.36DenseNet4021\.3±1\.6421\.8±1\.5321\.2±1\.7621\.4±1\.4920\.6±1\.4918\.4±1\.3111\.3±0\.8415\.2±1\.148\.16±0\.570\.39±0\.03\+0\.53±0\.17PACS→\\toPhotoResNet507\.87±0\.317\.89±0\.437\.77±0\.387\.84±0\.447\.62±0\.326\.86±0\.315\.99±0\.263\.24±0\.212\.92±0\.120\.58±0\.01\+0\.48±0\.09DenseNet1218\.53±0\.478\.64±0\.448\.46±0\.528\.59±0\.428\.37±0\.377\.58±0\.296\.28±0\.233\.87±0\.243\.56±0\.190\.61±0\.01\+0\.29±0\.11Wide\-Res506\.99±0\.386\.99±0\.326\.81±0\.396\.99±0\.326\.78±0\.326\.17±0\.315\.24±0\.262\.83±0\.142\.58±0\.090\.48±0\.04\+0\.34±0\.12→\\toArtResNet5013\.3±0\.6413\.9±0\.7313\.2±0\.7813\.3±0\.6412\.8±0\.5411\.1±0\.478\.58±0\.365\.28±0\.234\.86±0\.240\.84±0\.02\-0\.28±0\.12DenseNet12114\.4±0\.7314\.7±0\.8213\.6±0\.8414\.3±0\.7713\.1±0\.6311\.4±0\.519\.28±0\.475\.89±0\.395\.94±0\.310\.94±0\.04\-0\.18±0\.11Wide\-Res5012\.6±0\.5612\.8±0\.6712\.7±0\.7112\.8±0\.6311\.8±0\.569\.96±0\.467\.84±0\.324\.58±0\.234\.13±0\.170\.78±0\.08\+0\.44±0\.14→\\toCartoonResNet5016\.4±0\.8416\.6±0\.8816\.4±0\.9316\.3±0\.9215\.4±0\.7314\.1±0\.6410\.3±0\.546\.86±0\.436\.46±0\.341\.17±0\.08\+0\.63±0\.24DenseNet12117\.1±0\.9817\.3±1\.0817\.1±1\.1617\.3±1\.0216\.6±0\.8714\.7±0\.7310\.9±0\.677\.53±0\.567\.13±0\.491\.27±0\.09\+0\.32±0\.19Wide\-Res5015\.6±0\.8115\.9±0\.8115\.3±0\.8415\.9±0\.8114\.7±0\.7212\.9±0\.599\.86±0\.516\.16±0\.415\.83±0\.311\.06±0\.03\+0\.51±0\.14→\\toSketchResNet5019\.6±1\.1819\.7±1\.2319\.3±1\.3219\.6±1\.2118\.6±1\.0116\.2±0\.9613\.4±0\.868\.82±0\.729\.28±0\.671\.43±0\.11\-0\.88±0\.21DenseNet12120\.3±1\.2420\.4±1\.3719\.9±1\.4120\.1±1\.2619\.2±1\.1417\.7±1\.0313\.9±0\.929\.53±0\.888\.91±0\.741\.53±0\.14\+0\.28±0\.19Wide\-Res5018\.8±1\.0618\.8±1\.1818\.6±1\.2718\.9±1\.0617\.9±0\.9715\.4±0\.8412\.9±0\.728\.16±0\.647\.87±0\.541\.36±0\.07\+0\.47±0\.26I\-S→\\toSketchResNet15222\.6±1\.3622\.6±1\.4321\.9±1\.5622\.6±1\.3721\.3±1\.2818\.6±1\.1314\.2±0\.9710\.3±0\.849\.86±0\.731\.64±0\.12\+0\.84±0\.37DenseNet16123\.4±1\.4323\.3±1\.5922\.6±1\.6223\.6±1\.4422\.2±1\.3419\.4±1\.2215\.1±1\.0711\.3±0\.9410\.7±0\.821\.76±0\.14\-0\.69±0\.27ViT\-L12\.7±0\.8712\.9±0\.9612\.4±0\.9612\.9±0\.8411\.9±0\.7310\.4±0\.667\.83±0\.545\.54±0\.445\.18±0\.310\.93±0\.09\+1\.26±0\.26
### J\.4Results for Canonical Calibration
The results for Canonical Calibration, measured by ECEKDEin Table[6](https://arxiv.org/html/2605.21552#A10.T6), further confirm the comprehensive efficacy of ECL\. The findings highlight two main points\. First, canonical calibration is the most rigorous standard as it requires the entire probability vector to be calibrated\. The differentiable baseline KDE loss operates within\-domain and does not explicitly address the covariate shift, often performing similarly to the uncalibrated baseline in our setting \(e\.g\., Table[6](https://arxiv.org/html/2605.21552#A10.T6)ResNet50 on Photo\)\. In contrast, ECL explicitly minimizes the cross\-domain discrepancy of probability expectations, frequently achieving the best \(or near\-best\) ECEKDEscores\. Second, in terms of accuracy, the reduction in ECEKDEis often achieved with limited impact on classification accuracy; whileΔ\\DeltaACC is positive in many cases, slight accuracy drops can still occur for some architectures/tasks\.
Table 6:ECEKDE\(%\) for canonical calibration on Digit, PACS, and ImageNet\-Sketch datasets\. The reported results represent the mean and standard deviation derived from ten runs\.DatasetsECEKDE↓\\bm\{\\downarrow\}UncalSoft\-ECEDECEKDETSTransCalDRLPseudoCalECL \(Ours\)Oracle↓\\bm\{\\downarrow\}Δ\\DeltaACC\(%\)Digit→\\toMNISTLeNet\-55\.16±0\.395\.19±0\.315\.07±0\.385\.19±0\.264\.92±0\.224\.68±0\.313\.52±0\.221\.77±0\.131\.58±0\.090\.21±0\.02\-0\.32±0\.08ResNet202\.97±0\.233\.07±0\.182\.84±0\.263\.01±0\.132\.73±0\.172\.29±0\.141\.72±0\.131\.36±0\.111\.29±0\.040\.39±0\.02\+0\.54±0\.12DenseNet404\.37±0\.344\.49±0\.264\.24±0\.394\.34±0\.264\.17±0\.283\.67±0\.242\.68±0\.161\.61±0\.121\.42±0\.130\.32±0\.04\+0\.19±0\.02→\\toUSPSLeNet\-56\.43±0\.426\.54±0\.396\.34±0\.386\.48±0\.396\.23±0\.285\.42±0\.293\.86±0\.232\.04±0\.121\.96±0\.140\.48±0\.04\-0\.22±0\.06ResNet202\.38±0\.222\.42±0\.162\.24±0\.242\.42±0\.092\.16±0\.141\.97±0\.111\.54±0\.141\.12±0\.071\.17±0\.040\.51±0\.06\+0\.36±0\.18DenseNet403\.72±0\.213\.83±0\.223\.68±0\.273\.77±0\.233\.47±0\.212\.93±0\.171\.97±0\.131\.37±0\.091\.04±0\.040\.72±0\.01\-0\.11±0\.01→\\toSVHNLeNet\-514\.7±0\.9715\.3±0\.8814\.7±1\.0614\.6±0\.8913\.9±0\.8813\.7±0\.837\.84±0\.5411\.3±0\.735\.26±0\.390\.39±0\.06\+0\.69±0\.26ResNet2017\.6±1\.1317\.2±1\.0717\.6±1\.2817\.3±0\.9416\.9±1\.0314\.2±0\.848\.81±0\.6910\.3±0\.748\.23±0\.510\.24±0\.03\-0\.88±0\.27DenseNet4020\.8±1\.3720\.9±1\.2220\.3±1\.4620\.6±1\.2119\.2±1\.1817\.6±1\.0710\.3±0\.7814\.4±0\.917\.58±0\.480\.38±0\.01\+0\.44±0\.12PACS→\\toPhotoResNet507\.58±0\.377\.67±0\.447\.43±0\.377\.61±0\.427\.34±0\.296\.52±0\.315\.63±0\.192\.93±0\.192\.64±0\.120\.42±0\.04\+0\.46±0\.09DenseNet1218\.21±0\.498\.37±0\.448\.16±0\.498\.31±0\.487\.97±0\.347\.22±0\.365\.99±0\.263\.54±0\.213\.27±0\.180\.49±0\.06\+0\.26±0\.04Wide\-Res506\.63±0\.336\.73±0\.366\.53±0\.386\.67±0\.346\.48±0\.365\.89±0\.284\.93±0\.222\.53±0\.122\.27±0\.090\.38±0\.01\+0\.31±0\.13→\\toArtResNet5013\.1±0\.7113\.4±0\.6412\.7±0\.6913\.2±0\.6312\.3±0\.5710\.2±0\.438\.26±0\.314\.92±0\.284\.57±0\.220\.76±0\.04\-0\.23±0\.11DenseNet12113\.7±0\.7614\.3±0\.7413\.7±0\.8113\.8±0\.6912\.6±0\.6811\.2±0\.578\.96±0\.465\.54±0\.395\.63±0\.290\.84±0\.09\-0\.18±0\.11Wide\-Res5012\.2±0\.5612\.4±0\.5412\.3±0\.6412\.8±0\.5411\.2±0\.579\.69±0\.477\.58±0\.394\.26±0\.233\.86±0\.190\.63±0\.04\+0\.41±0\.14→\\toCartoonResNet5016\.1±0\.8416\.7±0\.8115\.8±0\.9416\.4±0\.7915\.4±0\.7213\.3±0\.6210\.1±0\.546\.54±0\.476\.18±0\.341\.06±0\.11\+0\.62±0\.24DenseNet12116\.9±0\.9117\.2±0\.9916\.4±1\.0916\.9±0\.8415\.8±0\.8214\.1±0\.7810\.6±0\.647\.27±0\.546\.89±0\.471\.16±0\.11\+0\.37±0\.16Wide\-Res5015\.7±0\.7115\.6±0\.7815\.2±0\.8615\.6±0\.7114\.3±0\.6812\.6±0\.589\.59±0\.535\.81±0\.395\.53±0\.330\.96±0\.07\+0\.58±0\.13→\\toSketchResNet5019\.2±1\.0619\.4±1\.1218\.9±1\.2819\.1±1\.0918\.4±1\.0915\.6±0\.9313\.4±0\.848\.57±0\.768\.94±0\.671\.31±0\.09\-0\.87±0\.24DenseNet12119\.9±1\.1220\.3±1\.2419\.6±1\.3420\.1±1\.1618\.9±1\.1317\.1±1\.0713\.9±0\.989\.28±0\.868\.62±0\.761\.49±0\.16\+0\.28±0\.16Wide\-Res5018\.8±0\.9418\.7±1\.0718\.1±1\.1618\.6±0\.9917\.6±0\.9314\.9±0\.8612\.6±0\.767\.84±0\.637\.59±0\.571\.24±0\.12\+0\.47±0\.26I\-S→\\toSketchResNet15222\.4±1\.2822\.6±1\.3821\.6±1\.4822\.6±1\.2821\.1±1\.2618\.2±1\.1114\.4±0\.9310\.1±0\.889\.52±0\.761\.56±0\.11\+0\.87±0\.32DenseNet16123\.2±1\.3323\.1±1\.4322\.4±1\.5222\.8±1\.3921\.8±1\.3318\.9±1\.2314\.8±1\.0911\.1±0\.9610\.1±0\.821\.66±0\.11\-0\.69±0\.28ViT\-L12\.3±0\.7312\.4±0\.8611\.8±0\.8812\.1±0\.7911\.3±0\.789\.84±0\.687\.56±0\.545\.28±0\.494\.84±0\.360\.86±0\.06\+1\.26±0\.24
### J\.5Ablation Experiments
Mini\-Batch Non\-Trainable ECL vs\. Mini\-Batch Trainable ECL:To understand the efficacy of our proposed mini\-batch training strategy involving auxiliary variables, we compare our full method \(Mini\-Batch Trainable ECL\) against a baseline variantMini\-Batch Non\-Trainable ECL\(it refers to directly calculating the differentiable ECL loss \(Eq\.[8](https://arxiv.org/html/2605.21552#S3.E8)\) on mini\-batch data\)\. Table[7](https://arxiv.org/html/2605.21552#A10.T7)presents the comparison results on the Digit \(→\\toMNIST\) and PACS \(→\\toPhoto\) tasks\. Overall,Mini\-Batch Trainable ECLtends to be more stable and achieves better calibration in most cases, whileMini\-Batch Non\-Trainable ECLcan occasionally be competitive on some metrics/architectures\. This supports that, beyond the objective itself, the bias\-corrected optimization strategy is important for reliably realizing ECL’s benefits\. Regarding classification accuracy \(Δ\\DeltaACC\), both variants largely maintain or improve performance, withMini\-Batch Trainable ECLshowing more consistent gains in our reported experiments\.
Table 7:Comparison between Mini\-Batch Non\-Trainable ECL and Mini\-Batch Trainable ECL on Digit and PACS benchmark tasks\. Results report ECE \(%\), CwECE \(%\), ECEKDE\(%\) and accuracy changeΔ\\DeltaACC \(%\) with mean±\\pmstd over five runs\. ECE represents the results under top\-label calibration, CwECE represents the results under class\-wise calibration, and ECEKDErepresents the results under canonical calibration\.DatasetArchitectureMethodTop\-LabelClass\-wiseCanonicalECE \(%\)↓\\bm\{\\downarrow\}Δ\\DeltaACC \(%\)CwECE \(%\)↓\\bm\{\\downarrow\}Δ\\DeltaACC \(%\)ECEKDE\(%\)↓\\bm\{\\downarrow\}Δ\\DeltaACC \(%\)Digit \(→\\toMNIST\)LeNet\-5Non\-Trainable8\.85±0\.72\-0\.45±0\.251\.75±0\.15\-0\.35±0\.151\.68±0\.12\-0\.21±0\.10Trainable8\.52±0\.78\-0\.92±0\.351\.66±0\.12\-0\.44±0\.091\.58±0\.09\-0\.32±0\.08ResNet20Non\-Trainable8\.05±0\.51\+0\.85±0\.321\.38±0\.13\+0\.45±0\.151\.32±0\.08\+0\.38±0\.12Trainable7\.88±0\.45\+1\.25±0\.421\.41±0\.11\+0\.62±0\.111\.29±0\.04\+0\.54±0\.12DenseNet40Non\-Trainable9\.05±0\.65\+0\.42±0\.181\.68±0\.15\+0\.12±0\.081\.52±0\.11\+0\.09±0\.06Trainable9\.15±0\.61\+0\.68±0\.201\.57±0\.12\+0\.23±0\.111\.42±0\.13\+0\.19±0\.02PACS \(→\\toPhoto\)ResNet50Non\-Trainable7\.15±0\.38\+0\.32±0\.153\.08±0\.18\+0\.28±0\.112\.58±0\.15\+0\.25±0\.10Trainable6\.87±0\.34\+0\.72±0\.172\.92±0\.12\+0\.48±0\.092\.64±0\.12\+0\.46±0\.09DenseNet121Non\-Trainable6\.35±0\.45\-0\.15±0\.213\.72±0\.22\+0\.12±0\.093\.41±0\.19\+0\.11±0\.08Trainable5\.96±0\.27\-0\.83±0\.233\.56±0\.19\+0\.29±0\.113\.27±0\.18\+0\.26±0\.04Wide\-Res50Non\-Trainable2\.75±0\.15\+0\.41±0\.122\.71±0\.12\+0\.15±0\.082\.40±0\.11\+0\.14±0\.07Trainable2\.68±0\.33\+0\.69±0\.222\.58±0\.09\+0\.34±0\.122\.27±0\.09\+0\.31±0\.13Loss Weight:To maintain the equal importance ofLceL\_\{ce\}andLeclL\_\{ecl\}, we set the regularization weight asλ=βγ\\lambda=\\beta^\{\\gamma\}\. Here,β=\(∑iℒce\(i\)\)/\(∑iℒecl\(i\)\)\\beta=\\left\(\{\\sum\\nolimits\_\{i\}\{\\mathcal\{L\}\_\{ce\}^\{\(i\)\}\}\}\\right\)/\\left\(\{\\sum\\nolimits\_\{i\}\{\\mathcal\{L\}\_\{ecl\}^\{\(i\)\}\}\}\\right\)acts as a baseline balancing factor between the cross\-entropy loss and the calibration loss, whereiirepresents theii\-th iteration\. The exponentγ\\gammaserves as a non\-linear scaling factor to adjust the sensitivity of the regularization: a higherγ\\gamma\(whenβ\>1\\beta\>1\) or lowerγ\\gamma\(whenβ<1\\beta<1\) intensifies the dominance of the calibration term\. We investigate the impact ofγ\\gammaby experimenting with values ranging from 0\.5 to 1\.5, and Table[8](https://arxiv.org/html/2605.21552#A10.T8)suggests thatγ=1\.0\\gamma=1\.0is a reasonable default choice in our evaluated settings \(Digit→\\toMNIST and PACS→\\toPhoto\)\.
Table 8:Ablation study on the hyperparameterγ\\gammaon Digit and PACS datasets\.γ\\gammacontrols the non\-linear scaling of the loss weight\.𝜸\\bm\{\\gamma\}Top\-LabelClass\-WiseCanonicalECE↓\\bm\{\\downarrow\}𝚫\\bm\{\\Delta\}ACCCwECE↓\\bm\{\\downarrow\}𝚫\\bm\{\\Delta\}ACCECEKDE↓\\bm\{\\downarrow\}𝚫\\bm\{\\Delta\}ACCDigit \(→\\toMNIST\) using ResNet200\.58\.76±0\.62\+1\.68±0\.331\.94±0\.16\+1\.15±0\.221\.83±0\.12\+0\.95±0\.180\.88\.12±0\.54\+1\.45±0\.291\.48±0\.14\+0\.88±0\.161\.35±0\.09\+0\.72±0\.141\.07\.88±0\.45\+1\.25±0\.421\.41±0\.11\+0\.62±0\.111\.29±0\.04\+0\.54±0\.121\.27\.85±0\.49\+0\.92±0\.251\.55±0\.13\+0\.35±0\.091\.38±0\.07\+0\.28±0\.081\.58\.42±0\.56\+0\.45±0\.181\.78±0\.15\+0\.12±0\.051\.56±0\.10\+0\.08±0\.04PACS \(→\\toPhoto\) using ResNet500\.57\.45±0\.41\+0\.88±0\.193\.25±0\.22\+0\.65±0\.142\.98±0\.18\+0\.62±0\.110\.87\.02±0\.38\+0\.81±0\.173\.05±0\.15\+0\.55±0\.122\.58±0\.14\+0\.54±0\.101\.06\.87±0\.34\+0\.72±0\.172\.92±0\.12\+0\.48±0\.092\.64±0\.12\+0\.46±0\.091\.26\.95±0\.32\+0\.61±0\.152\.98±0\.14\+0\.41±0\.082\.68±0\.10\+0\.39±0\.081\.57\.18±0\.36\+0\.42±0\.123\.12±0\.16\+0\.25±0\.062\.89±0\.15\+0\.24±0\.07Similar Articles
Confidence Calibration in Large Language Models
This paper analyzes the confidence calibration of 11 popular LLMs, finding that they are generally overconfident, especially on hard tasks, and underconfident on easy tasks. It introduces LifeEval, a test for evaluating calibration across difficulty levels.
TILT: Target-induced loss tilting under covariate shift
TILT introduces a novel objective for unsupervised domain adaptation under covariate shift that penalizes an auxiliary component on unlabeled target data, implicitly achieving self-localized importance weighting with bounded estimands. Theoretical guarantees and experiments on shifted CIFAR-100 show improved target performance over baselines.
Calibration, Uncertainty Communication, and Deployment Readiness in CKD Risk Prediction: A Framework Evaluation Study
This study evaluates five machine learning classifiers for chronic kidney disease risk prediction, finding that near-perfect internal performance fails under distribution shift. It emphasizes the need for calibration stability and conformal coverage transfer before clinical deployment.
Retrieval-Augmented Linguistic Calibration
This paper proposes Retrieval-Augmented Linguistic Calibration (RALC), a post-hoc pipeline for calibrating confidence signals in LLMs by modeling linguistic confidence as a distribution and using retrieval-augmented rewriting. It introduces Faithfulness Divergence metric and shows significant improvements across benchmarks.
Online Localized Conformal Prediction
This paper proposes Online Localized Conformal Prediction (OLCP) to address covariate heterogeneity in online learning and time-series settings. It introduces OLCP-Hedge for bandwidth selection and demonstrates valid long-run coverage with narrower prediction sets compared to existing baselines.