Reliable Conformal Prediction for Ordinal Classification Using the Ranked Probability Score
Summary
Introduces a conformal prediction method for ordinal classification using the ranked probability score as a nonconformity function, producing median-centered contiguous prediction sets and achieving favorable balance between set width and ordinal miscoverage.
View Cached Full Text
Cached at: 06/25/26, 05:08 AM
# Reliable Conformal Prediction for Ordinal Classification Using the Ranked Probability Score
Source: [https://arxiv.org/html/2606.24959](https://arxiv.org/html/2606.24959)
Luca KillmaierInstitute of Informatics LMU Munich Munich, GermanyAlireza JavanmardiInstitute of Informatics LMU Munich Munich, GermanyMunich Center for Machine Learning \(MCML\) Munich, Germany Eyke HüllermeierInstitute of Informatics LMU Munich Munich, GermanyMunich Center for Machine Learning \(MCML\) Munich, Germany German Centre for Artificial Intelligence \(DFKI, DSA\) Kaiserslautern, Germany
###### Abstract
Ordinal classification \(OC\) arises in high\-stakes domains such as medicine and finance, where uncertainty quantification must account for the severity of ordinal errors\. Conformal prediction \(CP\) provides distribution\-free prediction sets with marginal coverage guarantees; however, its practical effectiveness depends critically on the choice of nonconformity function\. We introduce a CP method for ordinal classification based on the ranked probability score \(RPS\), a proper scoring rule defined over cumulative predictive distributions\. Although it reflects ordinal risk quite naturally, it has largely been neglected in conformal ordinal prediction \(COP\)\. When used as a measure of nonconformity, RPS yields median\-centered contiguous prediction sets by construction\. The method is model\-agnostic, supports both assessed and grouped ordered categorical outcomes, and permits efficient implementation compared to greedy interval selection procedures\. Across multiple ordinal image and tabular datasets, RPS\-based CP produces contiguous prediction sets and strikes a favorable balance between prediction set width and the magnitude of ordinal miscoverage relative to existing CP methods\.
## 1Introduction
Ordinal classification \(OC\), also known as ordinal regression in statisticsmccullagh1980regression, refers to classification problems in which class labels exhibit a natural linear order\. Representative applications include medical diagnosisDBLP:journals/peerj\-cs/AlbuquerqueCC21, age estimationDBLP:journals/prl/CaoMR20, and credit risk assessmentDBLP:journals/sma/HirkHV19\. Despite its prevalence in high\-stakes domains, most work in OC has primarily focused on improving point prediction performanceDBLP:journals/paa/ShiCR23,DBLP:conf/aaai/NachmaniGSSG25,DBLP:journals/eswa/PolatCT25, whereas uncertainty quantification \(UQ\) has attracted attention only recentlyDBLP:journals/ijar/HaasH25,DBLP:journals/corr/abs\-2507\-00733\. Here, uncertainty for a query𝒙q\\boldsymbol\{x\}\_\{q\}is typically represented by the conditional predictive distribution over ordered labels,p\(y∣𝒙q\)p\(y\\mid\\boldsymbol\{x\}\_\{q\}\), produced by a probabilistic predictor\.
Alternatively, uncertainty can be represented by set\-valued predictions\. Conformal prediction \(CP\) offers a principled, model\-agnostic framework for post\-hoc construction of prediction setsDBLP:journals/jmlr/ShaferV08,vovk2005algorithmic,DBLP:conf/icml/VovkGS99,DBLP:journals/corr/abs\-2107\-07511\. It calibrates an underlying \(heuristic\) uncertainty estimate to achieve finite\-sample, distribution\-free marginal coverage at a user\-specified miscoverage rateα\\alpha\. Instead of committing to a single labely∈𝒴y\\in\\mathcal\{Y\}, CP outputs a set𝒞α\(𝒙q\)⊆𝒴\\mathcal\{C\}\_\{\\alpha\}\(\\boldsymbol\{x\}\_\{q\}\)\\subseteq\\mathcal\{Y\}of plausible labels, whose size reflects predictive uncertainty at the query𝒙q\\boldsymbol\{x\}\_\{q\}\. However, producing sets in conformal ordinal prediction \(COP\) that are both informative and consistent with the ordinal structure of the label space remains challenging\.
A key requirement for COP is that prediction sets be contiguousDBLP:conf/miccai/LuAP22,DBLP:conf/uai/XuGW23,DBLP:conf/nips/DeyMK23\. For example, consider age estimationDBLP:journals/paa/ShiCR23,DBLP:conf/caepia/YunGGBGH24from images, where a latent continuous variable \(e\.g\., age\) is discretized into ordered categories, with𝒴=\{baby,child,teenager,adult,senior\}\\mathcal\{Y\}=\\\{\\texttt\{baby\},\\texttt\{child\},\\texttt\{teenager\},\\texttt\{adult\},\\texttt\{senior\}\\\}\. A valid prediction set should only contain adjacent age categories, e\.g\.,𝒞α\(𝒙q\)=\{child,teenager,adult\}\\mathcal\{C\}\_\{\\alpha\}\(\\boldsymbol\{x\}\_\{q\}\)=\\\{\\texttt\{child\},\\texttt\{teenager\},\\texttt\{adult\}\\\}, whereas non\-contiguous sets such as𝒞α\(𝒙q\)=\{child,senior\}\\mathcal\{C\}\_\{\\alpha\}\(\\boldsymbol\{x\}\_\{q\}\)=\\\{\\texttt\{child\},\\texttt\{senior\}\\\}may appear unreasonable\.
Figure 1:\(Left\) Illustration of an assessed ordered categorical variable \(survival prognosis for melanomaNCI1985\_AV8500\_3850\) with extreme disagreement between physicians\. To faithfully quantify uncertainty, the prediction set must be contiguous and include all intermediate classes between the conflicting assessments\. \(Right\) Illustration of a grouped ordered categorical variable \(age estimation\), where unimodal predictive modeling is well justified and naturally leads to contiguous prediction setslanitis2002toward,panis2016overview\.Another important category, alongside the previously described*grouped*ordered categorical variables, is the class of*assessed*ordered categorical variablesanderson1984regression, in which human experts assign labels, as in financial risk assessment or medical survival prognosis\. In these contexts, errors tend to be inherently larger due to inter\-expert disagreement, which makes maintaining contiguity even more crucial for accurately capturing uncertainty\. For example, consider physician opinions regarding survival prognosis for stage IV melanoma, which may be polarized, resulting in large clusters at<1 year\(very pessimistic group\) and≫\\gg5 years\(optimistic group influenced by immunotherapy outcomes\)\. To properly quantify uncertainty in such cases, the prediction set should not be limited to the two most frequent categories, i\.e\.,𝒞α\(𝒙q\)=\{<1 year,≫5 years\}\\mathcal\{C\}\_\{\\alpha\}\(\\boldsymbol\{x\}\_\{q\}\)=\\\{\\texttt\{<1 year\},\\texttt\{$\\gg$5 years\}\\\}; instead, it should encompass the entire range of plausible categories,𝒞α\(𝒙q\)=\{<1 year,1\-3 years,3\-5 years,\>5 years,≫5 years\}\\mathcal\{C\}\_\{\\alpha\}\(\\boldsymbol\{x\}\_\{q\}\)=\\\{\\texttt\{<1 year\},\\texttt\{1\-3 years\},\\texttt\{3\-5 years\},\\texttt\{\>5 years\},\\\\ \\texttt\{$\\gg$5 years\}\\\}\. Otherwise, uncertainty measured by the size of the prediction set will be severely underestimated \(see Figure[1](https://arxiv.org/html/2606.24959#S1.F1)\)\.
Another important aspect of COP that has received limited attention is the*severity of miscoverage*when the true label lies outside the prediction set\. For instance, in financial risk assessment, if the true label isy=very highy=\\texttt\{very high\}while𝒞α\(x\)=\{low,moderate\}\\mathcal\{C\}\_\{\\alpha\}\(x\)=\\\{\\texttt\{low\},\\texttt\{moderate\}\\\}, the resulting miscoverage is substantial and could lead to catastrophic risk misestimation\. Ideally, when coverage fails, the true label should lie as close as possible to the boundary of the prediction set, minimizing the impact of miscoverage\. Under this criterion, a larger set such as𝒞α\(x\)=\{low,moderate,high\}\\mathcal\{C\}\_\{\\alpha\}\(x\)=\\\{\\texttt\{low\},\\texttt\{moderate\},\\texttt\{high\}\\\}may be preferable to a smaller set𝒞α\(x\)=\{low,moderate\}\\mathcal\{C\}\_\{\\alpha\}\(x\)=\\\{\\texttt\{low\},\\texttt\{moderate\}\\\}, even though both fail to cover the true label and the larger set is less efficient in the classical CP sense\. This observation motivates uncertainty quantification methods for OC that account not only for coverage and efficiency, but also for the ordinal distance incurred under miscoverage\.
These challenges motivate a model\-agnostic conformal approach to OC that can be combined with arbitrary loss functions, accommodates both grouped and assessed ordered targets, and produces meaningful contiguous prediction sets that accurately quantify uncertainty\. Moreover, such a method should leverage the entire predictive probability distribution in an unbiased and computationally efficient manner when constructing prediction sets\. These requirements are not met by existing approaches \(cf\. Section[2](https://arxiv.org/html/2606.24959#S2)\)\.
In this paper, we advocate the ranked probability score \(RPS\)epstein1969scoringas a nonconformity measure for conformal prediction in OC\. RPS is a proper scoring rulegneiting2007strictlyfor ordinal outcomes that incentivizes truthful probability estimation and explicitly accounts for the linear structure of the label space\. Despite being well\-established in the forecasting literaturemurphy1970ranked,Murphy1971ANO, it has only recently been recognized as a theoretically grounded metric for evaluating probabilistic ordinal classifiers in machine learningDBLP:conf/miccai/Galdran23\. To the best of our knowledge, RPS has not yet been proposed as a nonconformity measure for CP in OC\. The main contributions of this paper are as follows:
- •We propose RPS as a proper, model\-agnostic nonconformity measure for CP in OC\.
- •We provide theoretical guarantees for desirable properties in OC: RPS\-based conformal prediction sets \(i\)satisfy marginal coverage, \(ii\)are nestedwith respect to the miscoverage level \(α\\alpha\), and \(iii\)are contiguous\.
- •Furthermore, we show that RPS\-based prediction sets directlyoptimize ordinal risk under oracle conditional coverage, measured as set\-basedl1l\_\{1\}error, in contrast to mode\-centered approaches which primarily target set efficiency\.
- •We show that RPS\-based conformal prediction is computationally efficient, scaling linearly with both the number of labels and the number of calibration points\.
- •Finally, we empirically validate our approach on ordinal image and tabular datasets, showing that median\-centered RPS\-based prediction sets strike a favorable balance between interval width and ordinal miscoverage magnitude\.
## 2Related Work
#### Ordinal Classification
addresses the problem of predicting discrete ordered labels as commonly encountered in many high\-stakes domains, including medicineDBLP:journals/artmed/Dorado\-MorenoPG17,prodeau2019ordinal,DBLP:journals/mta/TariqSN25and financeDBLP:journals/sma/HirkHV19\. Unlike multinomial classification, where class labels are unordered, OC must account for the inherent order among classes, implying that misclassification costs typically increase with an increasing gap between predicted labely^\\hat\{y\}and true labelyy\. At the same time, OC differs from regression in that the labels are discrete rather than continuous, and the underlying measurement scale is ordinal rather than cardinal\. Thus, strictly speaking, there is no natural notion of distance\. In spite of this, encoding the class labels by integers1,…,K1,\\ldots,Kand using distance\-based losses such as\|y^−y\|\|\\hat\{y\}\-y\|is common practice\.
Recent work in OC has largely focused on improving predictive performance, often by minimizing distance\-based losses such as mean absolute errorDBLP:conf/ai/GaudetteJ09or quadratic weighted kappacohen1968weighted\. Existing approaches can be broadly categorized into \(i\)*unimodal soft\-labeling methods*DBLP:conf/cvpr/DiazM19,DBLP:journals/ijon/LiuFKDXLY20,DBLP:conf/pkdd/HaasH23,DBLP:journals/pr/VargasGH22,DBLP:journals/isci/VargasDGGH23,DBLP:journals/inffus/VargasGBH23, \(ii\)*ordinal loss functions*DBLP:journals/corr/HouYS16,DBLP:journals/prl/TorrePV18,DBLP:conf/coling/CastagnosMD22,albuquerque2022quasi,DBLP:conf/aaai/NachmaniGSSG25,DBLP:journals/eswa/PolatCT25, and \(iii\)*explicit unimodality constraints*DBLP:journals/nn/CostaAC08,DBLP:conf/icml/BeckhamP17,DBLP:conf/nips/DeyMK23,DBLP:journals/tai/CardosoCA25\.
#### Conformal Prediction
is a framework that can be applied on top of any base model to produce*prediction sets*\(or*intervals*in regression\) instead of point predictionsDBLP:journals/jmlr/ShaferV08,vovk2005algorithmic,DBLP:conf/icml/VovkGS99\. These sets are guaranteed to contain the true label with a user\-specified*marginal coverage probability*\. Inductive conformal predictionDBLP:conf/ecml/PapadopoulosPVG02,papadopoulos2008inductive, also known as split conformal prediction, has become the standard approach in practice due to its computational efficiency\. CP has been extensively studied for classificationsadinle2019least,DBLP:conf/nips/RomanoSC20and regressionDBLP:conf/nips/RomanoPC19, where it provides finite\-sample, distribution\-free coverage guarantees\.
More recently, conformal prediction for ordinal classification has attracted increasing attention\.DBLP:conf/miccai/LuAP22andzhang2025provablyconstruct contiguous prediction sets by expanding outward from the mode of the predictive distribution, performing a greedy search for a threshold that ensures marginal coverage while aiming to keep the sets as small as possible\.DBLP:conf/uai/XuGW23formulate ordinal conformal prediction within the conformal risk control frameworkDBLP:journals/corr/abs\-2208\-02814, pursuing essentially the same goals\. A different approach is taken byDBLP:conf/nips/DeyMK23, who enforce unimodal predictive distributions, enabling the reuse of existing conformal methods such as least ambiguous set\-valued classifiers \(LAC\)sadinle2019leastand adaptive prediction sets \(APS\)DBLP:conf/nips/RomanoSC20, while guaranteeing contiguity\. However, unimodality is a strong bias that is not always warranted and may negatively impact unbiased UQ in OCDBLP:journals/ijar/HaasH25,DBLP:journals/corr/abs\-2507\-00733\.
In contrast to these approaches, we propose an efficient, model\-agnostic conformal method that leverages the full predictive distribution through a principled proper scoring rule\. This method guarantees median\-centered, contiguous prediction sets without relying on greedy or search\-based procedures, mode\-centered constructions, or unimodality assumptions, while faithfully respecting the ordinal structure of the label space and minimizing ordinal risk under oracle conditional coverage\.
## 3Method
### 3\.1Problem formulation
Consider a dataset𝒟=\{\(Xi,Yi\)\}i=1n⊂𝒳×𝒴\\mathcal\{D\}=\\\{\(X\_\{i\},Y\_\{i\}\)\\\}\_\{i=1\}^\{n\}\\subset\\mathcal\{X\}\\times\\mathcal\{Y\}, drawn from an underlying distribution𝒫\\mathcal\{P\}over the joint input\-output space𝒳×𝒴\\mathcal\{X\}\\times\\mathcal\{Y\}\. We focus on the ordinal classification setting, where the output space𝒴=\{y1,…,yK\}\\mathcal\{Y\}=\\\{y\_\{1\},\\ldots,y\_\{K\}\\\}consists of a finite set of class labels endowed with a natural \(linear\) ordery1≺y2≺⋯≺yKy\_\{1\}\\prec y\_\{2\}\\prec\\cdots\\prec y\_\{K\}\.
Let\(Xn\+1,Yn\+1\)\(X\_\{n\+1\},Y\_\{n\+1\}\)denote a test instance such that the augmented sample𝒟∪\{\(Xn\+1,Yn\+1\)\}\\mathcal\{D\}\\cup\\\{\(X\_\{n\+1\},Y\_\{n\+1\}\)\\\}is exchangeable\. Assuming that the test labelYn\+1Y\_\{n\+1\}is unobserved, the goal of conformal prediction is to construct a prediction set𝒞α\(Xn\+1\)⊆𝒴\\mathcal\{C\}\_\{\\alpha\}\(X\_\{n\+1\}\)\\subseteq\\mathcal\{Y\}satisfying a marginal coverage guarantee\.
###### Definition 3\.1\(Marginal coverage\)\.
A conformal prediction procedure satisfies marginal coverage at level1−α1\-\\alphaif the prediction set𝒞\(Xn\+1\)\\mathcal\{C\}\(X\_\{n\+1\}\)it outputs satisfies
ℙ\(Yn\+1∈𝒞\(Xn\+1\)\)≥1−α,\\mathbb\{P\}\\left\(Y\_\{n\+1\}\\in\\mathcal\{C\}\(X\_\{n\+1\}\)\\right\)\\geq 1\-\\alpha,\(1\)whereα∈\(0,1\)\\alpha\\in\(0,1\)is a user\-specified error rate and the probability is taken with respect to the joint distribution𝒫\\mathcal\{P\}and any randomness in the construction of𝒞\\mathcal\{C\}\.
Distribution\-free CP methods cannot generally guarantee instance\-wise conditional coverageDBLP:journals/ml/Vovk13, a strictly stronger requirement than marginal coverage\.
###### Definition 3\.2\(Conditional coverage\)\.
A conformal prediction procedure satisfies conditional coverage at level1−α1\-\\alphaif for allXn\+1X\_\{n\+1\}, the set𝒞\(Xn\+1\)\\mathcal\{C\}\(X\_\{n\+1\}\)it outputs satisfies
ℙ\(Yn\+1∈𝒞\(Xn\+1\)∣Xn\+1\)≥1−α,\\mathbb\{P\}\\left\(Y\_\{n\+1\}\\in\\mathcal\{C\}\(X\_\{n\+1\}\)\\mid X\_\{n\+1\}\\right\)\\geq 1\-\\alpha,\(2\)whereα∈\(0,1\)\\alpha\\in\(0,1\)is a user\-specified error rate and the probability is taken with respect to the conditional distribution induced by𝒫\\mathcal\{P\}\.
Conformal prediction constructs prediction sets through a*nonconformity score*, which quantifies how incompatible a candidate label is with a given input relative to a predictive model\. Formally, a nonconformity score is a functions:𝒳×𝒴→ℝ,s:\\mathcal\{X\}\\times\\mathcal\{Y\}\\rightarrow\\mathbb\{R\},where larger values indicate greater nonconformity between an input–label pair\(𝒙,y\)\(\\boldsymbol\{x\},y\)and the model’s predictive behavior\. In this work, we consider probabilistic predictorsh:𝒳→ℙ\(𝒴\)h:\\mathcal\{X\}\\rightarrow\\mathbb\{P\}\(\\mathcal\{Y\}\), which output a predictive probability vector𝒑=\(p\(y1\),…,p\(yK\)\)=\(p1,…,pK\)∈ℙ\(𝒴\)\\boldsymbol\{p\}=\(p\(y\_\{1\}\),\\ldots,p\(y\_\{K\}\)\)=\(p\_\{1\},\\ldots,p\_\{K\}\)\\in\\mathbb\{P\}\(\\mathcal\{Y\}\), wherep\(yk\)p\(y\_\{k\}\)denotes the predictive probability assigned to classyky\_\{k\}\. Nonconformity scores are then derived from this predictive distribution to quantify the incompatibility of candidate labels with the model’s predictions\.
In this work, we adopt the inductive \(or split\) conformal prediction frameworkDBLP:conf/ecml/PapadopoulosPVG02,papadopoulos2008inductive\. The dataset𝒟\\mathcal\{D\}is partitioned into a proper training set𝒟train\\mathcal\{D\}\_\{\\mathrm\{train\}\}, used to train a predictive modelhh, and a calibration set𝒟cal=\{\(Xi,Yi\)\}i=1n\\mathcal\{D\}\_\{\\mathrm\{cal\}\}=\\\{\(X\_\{i\},Y\_\{i\}\)\\\}\_\{i=1\}^\{n\}, used to compute nonconformity scores\. Given a test inputXn\+1X\_\{n\+1\}and a candidate labelyk∈𝒴y\_\{k\}\\in\\mathcal\{Y\}, the conformal prediction set is defined as
𝒞α\(Xn\+1\)=\{y∈𝒴:s\(Xn\+1,y\)≤q^1−α\},\\mathcal\{C\}\_\{\\alpha\}\(X\_\{n\+1\}\)=\\left\\\{y\\in\\mathcal\{Y\}\\;:\\;s\(X\_\{n\+1\},y\)\\leq\\hat\{q\}\_\{1\-\\alpha\}\\right\\\},\(3\)whereq^1−α\\hat\{q\}\_\{1\-\\alpha\}denotes the empirical\(1−α\)\(1\-\\alpha\)\-quantile of the nonconformity scores computed on the calibration set\. This construction guarantees marginal coverage at level1−α1\-\\alphaunder the exchangeability assumption on𝒟cal∪\{\(Xn\+1,Yn\+1\)\}\\mathcal\{D\}\_\{\\mathrm\{cal\}\}\\cup\\\{\(X\_\{n\+1\},Y\_\{n\+1\}\)\\\}vovk2005algorithmic,DBLP:journals/jmlr/ShaferV08\. Naturally, the choice of the nonconformity score plays a central role in conformal prediction, as it determines not only calibration but also how informative the resulting prediction sets are, particularly in structured output spaces such as ordinal classification\.
### 3\.2An Ordinal Oracle Method
Assume the true data\-generating distribution𝒫\\mathcal\{P\}is known, which would in principle allow construction of prediction sets satisfying conditional coverage \(Definition[3\.2](https://arxiv.org/html/2606.24959#S3.Thmdefinition2)\)\. In the ordinal classification setting, an additional common objective of conformal prediction is to construct the*smallest contiguous*prediction set achieving conditional coverage, thereby balancing coverage with efficiencyDBLP:conf/miccai/LuAP22,DBLP:conf/uai/XuGW23,zhang2025provably\.
Formally, with contiguous intervals\[l,u\]:=\{l,l\+1,…,u\}\[l,u\]:=\\\{l,l\+1,\\ldots,u\\\}and interval probabilitiespl,u∗\(X\):=∑j=lup∗\(yj∣X\)p\_\{l,u\}^\{\*\}\(X\):=\\sum\_\{j=l\}^\{u\}p^\{\*\}\(y\_\{j\}\\mid X\), the optimal contiguous prediction set for an instanceXXcan be written as
𝒞α∗\(X\)=argmin\[l,u\]:1≤l≤u≤K\{u−l:pl,u∗\(X\)≥1−α\}\\mathcal\{C\}\_\{\\alpha\}^\{\*\}\(X\)=\\underset\{\[l,u\]:1\\leq l\\leq u\\leq K\}\{\\arg\\min\}\\bigl\\\{\\,u\-l\\;:\\;p\_\{l,u\}^\{\*\}\(X\)\\geq 1\-\\alpha\\bigr\\\}\(4\)Minimizing the length of the prediction set balances efficiency with conditional coverage, producing the most efficient contiguous sets possible\. Contiguity respects the ordinal structure of𝒴\\mathcal\{Y\}and ensures that no gaps exist in the predicted set of labels\. This oracle construction is not available in practice, as the true conditional distributionp∗\(y∣𝒙\)p^\{\*\}\(y\\mid\\boldsymbol\{x\}\)is unknown\. Notably, this oracle minimizes set length but does not explicitly account for ordinal risk, such as expectedl1l\_\{1\}deviation from the true label\.
### 3\.3Conformal Ordinal Prediction
To approximate the oracle construction \([4](https://arxiv.org/html/2606.24959#S3.E4)\) in practice, existing approachesDBLP:conf/miccai/LuAP22,zhang2025provablyaim to identify a thresholdλ\\lambdathrough greedy search that satisfies the marginal coverage guarantee \(Definition[3\.1](https://arxiv.org/html/2606.24959#S3.Thmdefinition1)\) while producing efficient contiguous prediction sets\.
𝒞λ\(X\)\\displaystyle\\mathcal\{C\}\_\{\\lambda\}\(X\)=\{yj∈𝒴:l\(X;λ\)≤j≤u\(X;λ\)\},\\displaystyle=\\\{y\_\{j\}\\in\\mathcal\{Y\}:l\(X;\\lambda\)\\leq j\\leq u\(X;\\lambda\)\\\},\(5\)\(l\(X;λ\),u\(X;λ\)\)\\displaystyle\\bigl\(l\(X;\\lambda\),u\(X;\\lambda\)\\bigr\)=argmin1≤l≤u≤K\{u−l:pl,u\(X\)≥λ\}\.\\displaystyle=\\underset\{1\\leq l\\leq u\\leq K\}\{\\arg\\min\}\\Bigl\\\{\\,u\-l\\;:\\;p\_\{l,u\}\(X\)\\geq\\lambda\\Bigr\\\}\.
To ensure marginal coverage, the thresholdλ\\lambdais selected using the calibration set𝒟cal=\{\(Xi,Yi\)\}i=1n\\mathcal\{D\}\_\{\\mathrm\{cal\}\}=\\\{\(X\_\{i\},Y\_\{i\}\)\\\}\_\{i=1\}^\{n\}as the smallest value satisfying
∑i=1n𝟙\{Yi∈𝒞λ\(Xi\)\}≥⌈\(1−α\)\(n\+1\)⌉\.\\displaystyle\\sum\_\{i=1\}^\{n\}\\mathds\{1\}\\\!\\left\\\{Y\_\{i\}\\in\\mathcal\{C\}\_\{\\lambda\}\(X\_\{i\}\)\\right\\\}\\;\\geq\\;\\left\\lceil\(1\-\\alpha\)\(n\+1\)\\right\\rceil\.\(6\)
While the procedure guarantees marginal coverage, relying on a single global thresholdλ\\lambdaignores the geometric structure of the ordinal predictive distribution and instead primarily controls set size, i\.e\., efficiency\. As a consequence, the resulting prediction sets may be efficient in terms of cardinality, yet limited in faithfulness with respect to uncertainty quantification and ordinal risk\.
Similarly, the approach ofDBLP:conf/nips/DeyMK23constructs prediction sets by growing them outward from the mode of a unimodal predictive distribution\.
###### Definition 3\.3\(Unimodalitykeilson1971some\)\.
A discrete probability distribution𝐩\\boldsymbol\{p\}is*unimodal*if there exists at least one indexmm, called the mode, such that
pk≥pk−1,for allk≤m,p\_\{k\}\\geq p\_\{k\-1\},\\quad\\text\{for all \}k\\leq m,pk\+1≤pk,for allk≥m\.p\_\{k\+1\}\\leq p\_\{k\},\\quad\\text\{for all \}k\\geq m\.
The assumption of unimodal predictive distributions is a common inductive bias in ordinal classificationDBLP:journals/nn/CostaAC08,DBLP:conf/icml/BeckhamP17\. Restricting the output distributions of a predictor to be unimodal allows one to obtain contiguous prediction sets using standard nominal nonconformity scores, such as LAC or APSDBLP:conf/nips/DeyMK23\. Nonetheless, both methods are driven by probability magnitude rather than ordinal geometry, and thus may insufficiently reflect uncertainty in low\-probability tails of unimodal predictive distributions\.
### 3\.4The Ranked Probability Score \(RPS\)
To address these limitations, we propose the use of the*Ranked Probability Score \(RPS\)*epstein1969scoringas a nonconformity score for conformal prediction in ordinal classification\. RPS is aproper scoring rulegneiting2007strictly,murphy1969rankeddefined as follows\.
###### Definition 3\.4\(Proper scoring rule\)\.
A scoring rules:𝒴×ℙ\(𝒴\)→ℝs:\\mathcal\{Y\}\\times\\mathbb\{P\}\(\\mathcal\{Y\}\)\\rightarrow\\mathbb\{R\}is*proper*if the expected score is minimized when the predicted distribution𝐩\\boldsymbol\{p\}equals the true distribution𝐩∗\\boldsymbol\{p\}^\{\*\}, that is,
𝔼Y∼𝒑∗\[s\(Y,𝒑∗\)\]≤𝔼Y∼𝒑∗\[s\(Y,𝒑\)\]∀𝒑∈ℙ\(𝒴\)\.\\mathbb\{E\}\_\{Y\\sim\\boldsymbol\{p\}^\{\*\}\}\\big\[s\(Y,\\boldsymbol\{p\}^\{\*\}\)\\big\]\\;\\leq\\;\\mathbb\{E\}\_\{Y\\sim\\boldsymbol\{p\}^\{\*\}\}\\big\[s\(Y,\\boldsymbol\{p\}\)\\big\]\\quad\\forall\\,\\boldsymbol\{p\}\\in\\mathbb\{P\}\(\\mathcal\{Y\}\)\.It is*strictly proper*if equality holds if and only if𝐩=𝐩∗\\boldsymbol\{p\}=\\boldsymbol\{p\}^\{\*\}\.
The RPS can be viewed as the Brier scoreglenn1950verificationapplied to cumulative predictive probabilities, making it sensitive to the ordinal distances between classesDBLP:journals/mansci/JoseNW09\. ForK=2K=2, it reduces exactly to the standard Brier score for binary outcomes\.
LetFX\(k\)=∑j=1kp\(yj∣X\)F\_\{X\}\(k\)=\\sum\_\{j=1\}^\{k\}p\(y\_\{j\}\\mid X\)denote the predicted cumulative distribution function \(CDF\), and let𝟙\{Y≤yk\}\\mathds\{1\}\\\{Y\\leq y\_\{k\}\\\}be the corresponding cumulative indicator of the true label\. The RPS is then defined as
RPS\(X,Y\)=1K−1∑k=1K−1\(FX\(k\)−𝟙\{Y≤yk\}\)2,\\mathrm\{RPS\}\(X,Y\)=\\frac\{1\}\{K\-1\}\\sum\_\{k=1\}^\{K\-1\}\\left\(F\_\{X\}\(k\)\-\\mathds\{1\}\\\{Y\\leq y\_\{k\}\\\}\\right\)^\{2\},\(7\)where the factor1K−1\\frac\{1\}\{K\-1\}normalizes the score to lie in the interval\[0,1\]\[0,1\], making it independent of the number of classesKK\.
Unlike previous ordinal conformal methods, RPS leverages the*entire predictive distribution*rather than mode\-centric summaries, yielding prediction sets that better adapt to input\-dependent uncertainty\. By construction, it provides a natural measure of nonconformity for ordinal candidate labels with respect to the full predictive distribution: it attains its maximum when the predicted mass is concentrated at the opposite end of the ordinal scale from the true label, and its minimum when the predicted mass is concentrated on the true class\. As a strictly proper scoring rule for ordinal outcomesmurphy1969ranked, RPS is theoretically grounded and encourages calibrated probability forecasts\. Its properness has recently attracted renewed attention for evaluating probabilistic ordinal classifiersDBLP:conf/miccai/Galdran23, despite its long\-standing popularity in forecastingmurphy1970ranked,Murphy1971ANO\.
### 3\.5Validity of RPS for Ordinal CP
In this section, we establish the validity of the RPS as a nonconformity measure within the conformal prediction framework\. LetsRPS\(X,Y\)=RPS\(X,Y\)s\_\{\\mathrm\{RPS\}\}\(X,Y\)=\\mathrm\{RPS\}\(X,Y\)denote the nonconformity score derived from the RPS\.
Algorithm 1Prediction Sets via RPS1:Input:
𝒟cal=\{\(Xi,Yi\)\}i=1n\\mathcal\{D\}\_\{\\mathrm\{cal\}\}=\\\{\(X\_\{i\},Y\_\{i\}\)\\\}\_\{i=1\}^\{n\}, significance level
α\\alpha, new instance
Xn\+1X\_\{n\+1\}
2:Output:Prediction set
𝒞α\(Xn\+1\)\\mathcal\{C\}\_\{\\alpha\}\(X\_\{n\+1\}\)
3:
si←sRPS\(Xi,Yi\),i=1,…,ns\_\{i\}\\leftarrow s\_\{\\mathrm\{RPS\}\}\(X\_\{i\},Y\_\{i\}\),\\quad i=1,\\dots,n
4:
q^1−α←quantile\(1−α\)\(1\+1n\)\(\{si\}i=1n\)\\hat\{q\}\_\{1\-\\alpha\}\\leftarrow\\text\{quantile\}\_\{\(1\-\\alpha\)\(1\+\\frac\{1\}\{n\}\)\}\\big\(\\\{s\_\{i\}\\\}\_\{i=1\}^\{n\}\\big\)
5:
𝒞α\(Xn\+1\)←\{y∈𝒴:sRPS\(Xn\+1,y\)≤q^1−α\}\\mathcal\{C\}\_\{\\alpha\}\(X\_\{n\+1\}\)\\leftarrow\\\{y\\in\\mathcal\{Y\}:s\_\{\\mathrm\{RPS\}\}\(X\_\{n\+1\},y\)\\leq\\hat\{q\}\_\{1\-\\alpha\}\\\}
6:return
𝒞α\(Xn\+1\)\\mathcal\{C\}\_\{\\alpha\}\(X\_\{n\+1\}\)
A key requirement for using RPS in conformal prediction is that the resulting sets satisfy marginal coverage \(Definition[3\.1](https://arxiv.org/html/2606.24959#S3.Thmdefinition1)\)\.
###### Proposition 3\.1\(Marginal coverage guarantee of RPS\-based sets\)\.
Under the exchangeability assumption on𝒟cal∪\{\(Xn\+1,Yn\+1\)\}\\mathcal\{D\}\_\{\\mathrm\{cal\}\}\\cup\\\{\(X\_\{n\+1\},Y\_\{n\+1\}\)\\\}, the RPS\-based conformal procedure satisfies
ℙ\(Yn\+1∈𝒞\(Xn\+1\)\)≥1−α,\\mathbb\{P\}\\bigl\(Y\_\{n\+1\}\\in\\mathcal\{C\}\(X\_\{n\+1\}\)\\bigr\)\\geq 1\-\\alpha,regardless of the underlying predictive model\.
###### Proof Sketch\.
This result follows directly from the general theory of inductive conformal predictionDBLP:journals/jmlr/ShaferV08,vovk2005algorithmic\. SincesRPSs\_\{\\mathrm\{RPS\}\}is a real\-valued nonconformity score and the calibration scores are exchangeable with the test score, the standard quantile\-based construction \([3](https://arxiv.org/html/2606.24959#S3.E3)\) guarantees marginal coveragevovk2005algorithmic,DBLP:journals/jmlr/ShaferV08\. ∎
Another desirable property for ordinal prediction sets is*nesting*with respect toα\\alphaDBLP:conf/miccai/LuAP22,DBLP:conf/uai/XuGW23,zhang2025provably, e\.g\.,𝒞α2\(X\)=\{baby,child\}⊆𝒞α1\(X\)=\{baby,child,teenager\}\\mathcal\{C\}\_\{\\alpha\_\{2\}\}\(X\)=\\\{\\texttt\{baby\},\\texttt\{child\}\\\}\\subseteq\\mathcal\{C\}\_\{\\alpha\_\{1\}\}\(X\)=\\\{\\texttt\{baby\},\\texttt\{child\},\\texttt\{teenager\}\\\}for0<α1≤α2<10<\\alpha\_\{1\}\\leq\\alpha\_\{2\}<1\. While RPS sets are always nested regardless of the predictive distribution, mode\-based sets \([5](https://arxiv.org/html/2606.24959#S3.E5)\) are nested only when the predictive density is radially monotone, i\.e\., when it decreases monotonically from the mode along every directionzhang2025provably\.
###### Proposition 3\.2\(Nestedness of RPS\-based sets inα\\alpha\)\.
For any inputXXand any0<α1≤α2<10<\\alpha\_\{1\}\\leq\\alpha\_\{2\}<1, the RPS\-based conformal prediction sets satisfy𝒞α2\(X\)⊆𝒞α1\(X\)\.\\mathcal\{C\}\_\{\\alpha\_\{2\}\}\(X\)\\subseteq\\mathcal\{C\}\_\{\\alpha\_\{1\}\}\(X\)\.
###### Proof\.
Ifα1≤α2\\alpha\_\{1\}\\leq\\alpha\_\{2\}, then1−α1≥1−α21\-\\alpha\_\{1\}\\geq 1\-\\alpha\_\{2\}, which impliesq^1−α1≥q^1−α2\\hat\{q\}\_\{1\-\\alpha\_\{1\}\}\\geq\\hat\{q\}\_\{1\-\\alpha\_\{2\}\}\. Hence, any labelyysuch thatsRPS\(X,y\)≤q^1−α2s\_\{\\mathrm\{RPS\}\}\(X,y\)\\leq\\hat\{q\}\_\{1\-\\alpha\_\{2\}\}also satisfiessRPS\(X,y\)≤q^1−α1s\_\{\\mathrm\{RPS\}\}\(X,y\)\\leq\\hat\{q\}\_\{1\-\\alpha\_\{1\}\}, and therefore𝒞α2\(X\)⊆𝒞α1\(X\)\\mathcal\{C\}\_\{\\alpha\_\{2\}\}\(X\)\\subseteq\\mathcal\{C\}\_\{\\alpha\_\{1\}\}\(X\)\. ∎
Furthermore, ordinal prediction sets must be contiguous along the label ordering, consistent with the oracle definition in \([4](https://arxiv.org/html/2606.24959#S3.E4)\)DBLP:conf/uai/XuGW23,DBLP:conf/nips/DeyMK23\.
###### Theorem 3\.1\(Contiguity of RPS\-based sets\)\.
Let𝒴=\{y1≺y2≺⋯≺yK\}\\mathcal\{Y\}=\\\{y\_\{1\}\\prec y\_\{2\}\\prec\\dots\\prec y\_\{K\}\\\}be a set of ordered labels, and letsRPSs\_\{\\mathrm\{RPS\}\}denote the RPS\-based nonconformity score\. Then, for any inputXXand any miscoverage levelα∈\(0,1\)\\alpha\\in\(0,1\), the conformal prediction set𝒞αRPS\(X\)=\{y∈𝒴:sRPS\(X,y\)≤q^1−α\}\\mathcal\{C\}^\{\\mathrm\{RPS\}\}\_\{\\alpha\}\(X\)=\\\{y\\in\\mathcal\{Y\}:s\_\{\\mathrm\{RPS\}\}\(X,y\)\\leq\\hat\{q\}\_\{1\-\\alpha\}\\\}forms a contiguous interval of labels centered at the medianmm, i\.e\., there exist integers1≤l≤m≤u≤K1\\leq l\\leq m\\leq u\\leq Ksuch that𝒞α\(X\)=\{yl,…,ym,…,yu\}\.\\mathcal\{C\}\_\{\\alpha\}\(X\)=\\\{y\_\{l\},\\dots,y\_\{m\},\\dots,y\_\{u\}\\\}\.
Unlike existing nonconformity scores for nominal classification such as LACsadinle2019leastor APSDBLP:conf/nips/RomanoSC20, RPS\-based prediction sets are contiguous by design\. This contiguity follows directly from the ordinal structure and the monotonicity of the cumulative distribution function, and does not require imposing unimodality assumptions on the predictive probability mass functionDBLP:conf/nips/DeyMK23, which may be unwarranted for ordinal targets\. A formal proof is provided in Appendix[A](https://arxiv.org/html/2606.24959#A1)\. Together with marginal validity and nesting inα\\alpha, these properties ensure that RPS\-based conformal prediction produces prediction sets well suited for ordinal tasks\.
Moreover, unlike*mode*\-centered approaches, RPS\-based prediction sets are*median*\-centered \(see Appendix[A](https://arxiv.org/html/2606.24959#A1)\), thereby balancing the cumulative probability mass above and below the center\. This property promotes robustness under skewed or heavy\-tailed distributions; sublevel sets expand by minimizing the imbalance between the lower and upper predictive tails, rather than expanding outwards from a single high\-probability mode\. While this characteristic does not guarantee global risk optimality for the full conformal procedure, defined in terms of the set\-basedl1l\_\{1\}error \([8](https://arxiv.org/html/2606.24959#S3.E8)\), it does imply instance\-level risk optimality at the level of the nonconformity scores\. Furthermore, this instance\-level optimality scales to full risk optimality within a conditional coverage oracle setting, where the true conditional distribution is known, causing pointwise risk minimization to aggregate directly into prediction sets that globally minimize the expectedl1l\_\{1\}error\.
###### Theorem 3\.2\(Ordinal risk optimality of RPS\-based median\-grown prediction sets under oracle conditional coverage\)\.
For a fixed inputXX, let𝒞RPS\(X\)=\{yl,…,ym,…,yu\}\\mathcal\{C\}\_\{\\mathrm\{RPS\}\}\(X\)=\\\{y\_\{l\},\\dots,y\_\{m\},\\dots,y\_\{u\}\\\}denote the contiguous RPS\-based prediction set, which is grown from a median indexmmas in Theorem[3\.1](https://arxiv.org/html/2606.24959#S3.Thmtheorem1)\. Define the*ordinal risk*of a set𝒞\(X\)\\mathcal\{C\}\(X\)as the expectedl1l\_\{1\}\-distance of true label from this set:
R\(𝒞\(X\)\):=∑y=1Kp\(y∣X\)minc∈𝒞\(X\)\|y−c\|\.R\(\\mathcal\{C\}\(X\)\):=\\sum\_\{y=1\}^\{K\}p\(y\\mid X\)\\,\\min\_\{c\\in\\mathcal\{C\}\(X\)\}\|y\-c\|\.\(8\)Let𝒞\(X\)\\mathcal\{C\}\(X\)be any other*contiguous*conformal set of minimal cardinality satisfying the same coverage constraint as𝒞RPS\(X\)\\mathcal\{C\}\_\{\\mathrm\{RPS\}\}\(X\)\. Then
R\(𝒞RPS\(X\)\)≤R\(𝒞\(X\)\)\.R\(\\mathcal\{C\}\_\{\\mathrm\{RPS\}\}\(X\)\)\\leq R\(\\mathcal\{C\}\(X\)\)\.\(9\)
A formal proof is provided in Appendix[B](https://arxiv.org/html/2606.24959#A2)\. RPS\-based sets, directly target ordinal risk reduction under oracle conditional coverage by sequentially adding adjacent labels that minimize risk, starting from the singleton risk\-minimizing set containing only the median\. This contrasts with mode\-centered procedures, which prioritize set\-size efficiency and may overlook the ordinal structure of the labels \(Section[3\.2](https://arxiv.org/html/2606.24959#S3.SS2)\)\.
As illustrative examples, consider the unimodal distribution𝒑umod=\(0\.06,0\.24,0\.32,0\.20,0\.18\)\\boldsymbol\{p\}\_\{\\mathrm\{umod\}\}=\(0\.06,0\.24,0\.32,0\.20,0\.18\)and the multimodal distribution𝒑multimod=\(0\.09,0\.12,0\.40,0\.04,0\.35\)\\boldsymbol\{p\}\_\{\\mathrm\{multimod\}\}=\(0\.09,0\.12,0\.40,0\.04,0\.35\), both with medianm=3m=3, coinciding with the mode\. Tables[1](https://arxiv.org/html/2606.24959#S3.T1)and[2](https://arxiv.org/html/2606.24959#S3.T2)compare step\-wise set expansions based on RPS with greedy mode\-based expansion \(e\.g\., min\-CPS\)\. The greedy strategy expands from the mode by iteratively adding the class with the largest remaining probability, thereby maximizing local probability mass for a given set size\. In contrast, the RPS\-based expansion accounts for cumulative distance\-weighted deviations and directly minimizes ordinal risk \([8](https://arxiv.org/html/2606.24959#S3.E8)\)\. While the differences are moderate in the unimodal case, they become substantial for the multimodal distribution, where heavy tail mass causes the greedy procedure to underestimate ordinal risk\.
Table 1:Step\-wise comparison of ordinal risk for median RPS\-based versus greedy mode\-based set expansions for an exemplary unimodal distribution𝒑umod=\(0\.06,0\.24,0\.32,0\.20,0\.18\)\\boldsymbol\{p\}\_\{\\mathrm\{umod\}\}=\(0\.06,0\.24,0\.32,0\.20,0\.18\)\. Lower ordinal risk \([8](https://arxiv.org/html/2606.24959#S3.E8)\) is highlighted\.Table 2:Step\-wise comparison of ordinal risk for median RPS\-based versus greedy mode\-based set expansions for a multimodal distribution𝒑multimod=\(0\.09,0\.12,0\.40,0\.04,0\.35\)\\boldsymbol\{p\}\_\{\\mathrm\{multimod\}\}=\(0\.09,0\.12,0\.40,0\.04,0\.35\)\. Lower ordinal risk \([8](https://arxiv.org/html/2606.24959#S3.E8)\) is highlighted\.
### 3\.6Computational complexity of RPS
During conformalcalibration, we only need to compute the RPS scoresRPS\(Xi,Yi\)s\_\{\\mathrm\{RPS\}\}\(X\_\{i\},Y\_\{i\}\)for the true labelYiY\_\{i\}of each calibration pointXiX\_\{i\}\. This requires a single RPS evaluation per point, each taking𝒪\(K\)\\mathcal\{O\}\(K\)time to compute the cumulative distributionFX\(k\)F\_\{X\}\(k\)and the score\. Hence, the total cost for the calibration dataset𝒟cal\\mathcal\{D\}\_\{\\mathrm\{cal\}\}is𝒪\(nK\)\\mathcal\{O\}\(nK\)\.
During conformalinference, computing a prediction set for a new input requires evaluatingsRPS\(X,yℓ\)s\_\{\\mathrm\{RPS\}\}\(X,y\_\{\\ell\}\)for allKKcandidate labelsyℓy\_\{\\ell\}\. Naively, this requires𝒪\(K2\)\\mathcal\{O\}\(K^\{2\}\)time\. However, after computingsRPS\(X,y1\)s\_\{\\mathrm\{RPS\}\}\(X,y\_\{1\}\)once, we can exploit the exact recurrence
sRPS\(X,yℓ\+1\)=sRPS\(X,yℓ\)\+2F\(ℓ\)−1K−1,s\_\{\\mathrm\{RPS\}\}\(X,y\_\{\\ell\+1\}\)=s\_\{\\mathrm\{RPS\}\}\(X,y\_\{\\ell\}\)\+\\frac\{2F\(\\ell\)\-1\}\{K\-1\},to compute allKKscores in linear time𝒪\(K\)\\mathcal\{O\}\(K\)\(see Appendix[A](https://arxiv.org/html/2606.24959#A1)\)\. Specifically, the initial scoresRPS\(X,y1\)s\_\{\\mathrm\{RPS\}\}\(X,y\_\{1\}\)is computed in𝒪\(K\)\\mathcal\{O\}\(K\)time, and each subsequent score is obtained with a constant\-time update, i\.e\.,𝒪\(1\)\\mathcal\{O\}\(1\)per label\.
Overall, RPS\-based conformal prediction is highly efficient: linear in the number of calibration pointsnnand the number of labelsKK\. By contrast, prior methodsDBLP:conf/miccai/LuAP22,DBLP:conf/uai/XuGW23,zhang2025provablytypically require an additional multiplicative cost of𝒪\(log\(1/ϵ\)\)\\mathcal\{O\}\(\\log\(1/\\epsilon\)\)to identify anϵ\\epsilon\-optimal contiguous probability mass thresholdλ\\lambdain \([5](https://arxiv.org/html/2606.24959#S3.E5)\) via binary search to ensure marginal coverage\.
## 4Experiments
To evaluate the quality of RPS\-based prediction sets, we conduct experiments on several ordinal image and tabular datasets, comparing against established ordinal conformal baselines\. All experiments use neural networks as the underlying model class\. The source code of the following experiments is made publicly available111[https://github\.com/stefanahaas41/rps\-ordinal\-conformal\-prediction](https://github.com/stefanahaas41/rps-ordinal-conformal-prediction)\.
### 4\.1Baseline Nonconformity Scores
As nominal baseline nonconformity measures, we include the Least ambiguous set\-valued classifier \(LAC\) scoresadinle2019least, as well as the adaptive prediction set \(APS\) scoreDBLP:conf/nips/RomanoSC20\(see Appendix[C](https://arxiv.org/html/2606.24959#A3)for details\)\. Both LAC and APS treat class labels as unstructured, providing natural baselines for assessing the benefits of incorporating ordinal structure\. As ordinal conformal prediction baselines, we consider conformal prediction sets for ordinal classification \(COPOC\)DBLP:conf/nips/DeyMK23combined with LAC \(COPOCL\) and APS \(COPOCA\), as well as the min\-CPS approachzhang2025provably\. The latter is a greedy search–based algorithm that produces small contiguous mode\-centered prediction sets and improves upon OrdinalAPSDBLP:conf/miccai/LuAP22in both computational efficiency and empirical performance\. As a naive ordinal baseline method, we also include the ordinal CDF \(OCDF\)DBLP:conf/miccai/LuAP22, which constructs prediction intervals from cumulative probabilities and is therefore an interesting baseline for RPS\.
### 4\.2Performance Metrics
To evaluate the performance of the different conformal methods, we compute the empirical coverage \(COV\), as well as the average prediction set size \(PS\)\. Additionally, we include the mean interval width \(MW\) \(see Appendix[C](https://arxiv.org/html/2606.24959#A3)for details\)\. Another important aspect is the contiguity of the produced prediction sets, which we measure followingDBLP:conf/nips/DeyMK23via the contiguity violation \(CV\) metric, where0indicates no contiguity violations on𝒟test\\mathcal\{D\}\_\{\\mathrm\{test\}\}and11corresponds to maximal contiguity violation\.
Since a central objective in ordinal classification is to minimize error distances, we prioritize ordinal\-specific metrics\. First, withM=\{i:Yi∉𝒞\(Xi\)\}M=\\\{i:Y\_\{i\}\\notin\\mathcal\{C\}\(X\_\{i\}\)\\\}and
d\(Yi,𝒞\(Xi\)\)=\{li−YiifYi<liYi−uiifYi\>uid\(Y\_\{i\},\\mathcal\{C\}\(X\_\{i\}\)\)=\\begin\{cases\}l\_\{i\}\-Y\_\{i\}&\\text\{if \}Y\_\{i\}<l\_\{i\}\\\\ Y\_\{i\}\-u\_\{i\}&\\text\{if \}Y\_\{i\}\>u\_\{i\}\\end\{cases\}for a contiguous interval𝒞\(Xi\)=\[li,ui\]\\mathcal\{C\}\(X\_\{i\}\)=\[l\_\{i\},u\_\{i\}\], the mean absolute miscoverage magnitude \(MAMM\) is defined as
MAMM:=1\|M\|∑i∈Md\(Yi,𝒞\(Xi\)\)\.\\mathrm\{MAMM\}:=\\frac\{1\}\{\|M\|\}\\sum\_\{i\\in M\}d\(Y\_\{i\},\\mathcal\{C\}\(X\_\{i\}\)\)\\,\.The worst\-case absolute miscoverage magnitude \(WAMM\) is
WAMM:=maxi∈Md\(Yi,𝒞\(Xi\)\)\.\\mathrm\{WAMM\}:=\\max\_\{i\\in M\}d\(Y\_\{i\},\\mathcal\{C\}\(X\_\{i\}\)\)\.
These metrics quantify ordinal risk under miscoverage, measuring how far the true label lies from the prediction set rather than merely whether it is excluded\.
Another highly relevant metric, which evaluates the trade\-off between interval efficiency and distance\-based error, is the average interval score loss \(AISL\)gneiting2007strictly, recently applied in conformal prediction for regressionDBLP:conf/uai/CabezasSRI25:
AISL:=1\|𝒟test\|∑i∈𝒟test\[\\displaystyle\\mathrm\{AISL\}:=\\frac\{1\}\{\|\\mathcal\{D\}\_\{\\mathrm\{test\}\}\|\}\\sum\_\{i\\in\\mathcal\{D\}\_\{\\mathrm\{test\}\}\}\\Bigg\[\(ui−li\)\+2α\(li−Yi\)1\{Yi<li\}\\displaystyle\(u\_\{i\}\-l\_\{i\}\)\+\\frac\{2\}\{\\alpha\}\(l\_\{i\}\-Y\_\{i\}\)\\,\\mathds\{1\}\\\{Y\_\{i\}<l\_\{i\}\\\}\+2α\(Yi−ui\)1\{Yi\>ui\}\]\.\\displaystyle\+\\frac\{2\}\{\\alpha\}\(Y\_\{i\}\-u\_\{i\}\)\\,\\mathds\{1\}\\\{Y\_\{i\}\>u\_\{i\}\\\}\\Bigg\]\.
AISL simultaneously accounts for interval width and miscoverage magnitude: the first term\(ui−li\)\(u\_\{i\}\-l\_\{i\}\)penalizes wide intervals, encouraging efficiency, while the second and third terms penalize distance\-based errors outside the interval\. The penalties are scaled by2/α2/\\alpha, so stricter coverage requirements \(smallerα\\alpha\) amplify the cost of miscoverage\. By combining these factors, AISL provides an interpretable metric capturing both compactness and ordinal risk, making it particularly suitable for evaluating conformal prediction methods in ordinal settings\.
### 4\.3Experiments on Ordinal Datasets
Figure 2:Comparison of prediction sets atα=\{0\.01,0\.02,0\.03,0\.05,0\.08,0\.1,0\.13,0\.15,0\.18,0\.2\}\\alpha=\\\{0\.01,0\.02,0\.03,0\.05,0\.08,0\.1,0\.13,0\.15,0\.18,0\.2\\\}across methods and datasets using the MAMM metric\. Shaded regions indicate standard deviation over 50 trials\.Figure 3:Comparison of prediction sets atα=\{0\.01,0\.02,0\.03,0\.05,0\.08,0\.1,0\.13,0\.15,0\.18,0\.2\}\\alpha=\\\{0\.01,0\.02,0\.03,0\.05,0\.08,0\.1,0\.13,0\.15,0\.18,0\.2\\\}across methods and datasets using the WAMM metric\. Shaded regions indicate standard deviation over 50 trials\.Figure 4:Comparison of prediction sets atα=\{0\.01,0\.02,0\.03,0\.05,0\.08,0\.1,0\.13,0\.15,0\.18,0\.2\}\\alpha=\\\{0\.01,0\.02,0\.03,0\.05,0\.08,0\.1,0\.13,0\.15,0\.18,0\.2\\\}across methods and datasets using the AISL metric\. Shaded regions indicate standard deviation over 50 trials\.We evaluate RPS\-based prediction sets alongside several baseline methods on two medical image datasets \(BACHaresta2019bachand RetinaMNISTmedmnistv2\) and an age\-estimation dataset \(FGNetlanitis2002toward,panis2016overview\)\. Additionally, we include multiple ordinal tabular benchmark datasetsayllon2025toc\.
Due to space constraints, we focus on metrics that reflect ordinal miscoverage directly, consistent with our argument that evaluation in COP should account for the severity of missed predictions rather than treating all miscoverage events equally\. Specifically, MAMM and WAMM quantify miscoverage magnitude, while AISL jointly captures prediction set size and ordinal miscoverage in a single score\. LAC and APS violate the contiguity requirement for ordinal prediction sets and are therefore excluded from metrics that assume contiguous sets, such as MAMM, WAMM, and AISL\. Additional experimental details and results over all datasets and metrics are provided in Appendix[C](https://arxiv.org/html/2606.24959#A3)and[D](https://arxiv.org/html/2606.24959#A4)\.
While RPS\-based sets do not always achieve the highest prediction set efficiency, with min\-CPSzhang2025provablyshowing particularly strong efficiency on this metric \(see Appendix[C](https://arxiv.org/html/2606.24959#A3)and[D](https://arxiv.org/html/2606.24959#A4)\), they tend to achieve lower ordinal miscoverage, as reflected in MAMM \(Figure[2](https://arxiv.org/html/2606.24959#S4.F2)\) and WAMM \(Figure[3](https://arxiv.org/html/2606.24959#S4.F3)\)\. In contrast, mode\-centered methods tend to underestimate ordinal risk relative to RPS\-based sets\. These results provide empirical support for our theoretical claim that RPS\-based sets optimize for ordinal risk rather than set size \(Theorem[3\.2](https://arxiv.org/html/2606.24959#S3.Thmtheorem2)\)\.
Furthermore, when considering the trade\-off between efficiency and miscoverage magnitude via AISL \(Figure[4](https://arxiv.org/html/2606.24959#S4.F4)\), RPS\-based sets achieve a favorable balance, demonstrating their practical advantage for real\-world ordinal prediction tasks\.
## 5Conclusion & Discussion
We have demonstrated that ranked probability score \(RPS\)\-based conformal sets are, by construction, contiguous and median\-centered, providing robust prediction sets for ordinal classification that minimize ordinal risk under oracle conditional coverage, defined as the set\-basedl1l\_\{1\}error\. These sets effectively balance efficiency and error reduction, a critical consideration in high\-stakes applications\. Specifically, RPS\-based sets achieve a favorable trade\-off between interval width and miscoverage magnitude while guaranteeing marginal coverage\. They are highly competitive with existing ordinal conformal prediction methods, do not depend on specific model architectures, and can be applied to any underlying predictive model\. Importantly, RPS\-based sets produce meaningful contiguous intervals regardless of the data distribution, making them a practical and versatile solution for ordinal prediction tasks\.
Building on this RPS\-based framework for COP, a natural direction for future work is to incorporate a delineation of uncertainty into epistemic and aleatoric componentshullermeier2021aleatoric\. This line of research has recently attracted attention in both OCDBLP:journals/corr/abs\-2507\-00733and CPsale2025aleatoric,javanmardi2025optimal\.
###### Acknowledgements\.
Alireza Javanmardi gratefully acknowledges funding by the Klaus Tschira Stiftung \(project 00\.019\.2024\)\.
## References
## Appendix AProof of Theorem[3\.1](https://arxiv.org/html/2606.24959#S3.Thmtheorem1)
###### Theorem\(Contiguity of RPS\-based sets\)\.
Let𝒴=\{y1≺y2≺⋯≺yK\}\\mathcal\{Y\}=\\\{y\_\{1\}\\prec y\_\{2\}\\prec\\dots\\prec y\_\{K\}\\\}be a set of ordered labels, and letsRPSs\_\{\\mathrm\{RPS\}\}denote the RPS\-based nonconformity score\. Then, for any inputXXand any miscoverage levelα∈\(0,1\)\\alpha\\in\(0,1\), the conformal prediction set𝒞αRPS\(X\)=\{y∈𝒴:sRPS\(X,y\)≤q^1−α\}\\mathcal\{C\}^\{\\mathrm\{RPS\}\}\_\{\\alpha\}\(X\)=\\\{y\\in\\mathcal\{Y\}:s\_\{\\mathrm\{RPS\}\}\(X,y\)\\leq\\hat\{q\}\_\{1\-\\alpha\}\\\}forms a contiguous interval of labels centered at the medianmm, i\.e\., there exist integers1≤l≤m≤u≤K1\\leq l\\leq m\\leq u\\leq Ksuch that𝒞α\(X\)=\{yl,…,ym,…yu\}\.\\mathcal\{C\}\_\{\\alpha\}\(X\)=\\\{y\_\{l\},\\dots,y\_\{m\},\\dots y\_\{u\}\\\}\.
###### Proof\.
Fix an inputXX\. For a candidate labelyℓy\_\{\\ell\}, the RPS nonconformity score is
sRPS\(X,yℓ\)=1K−1∑k=1K−1\(FX\(k\)−𝟙\{k≥ℓ\}\)2,s\_\{\\mathrm\{RPS\}\}\(X,y\_\{\\ell\}\)=\\frac\{1\}\{K\-1\}\\sum\_\{k=1\}^\{K\-1\}\\bigl\(F\_\{X\}\(k\)\-\\mathds\{1\}\\\{k\\geq\\ell\\\}\\bigr\)^\{2\},wherekkindexes the cumulative sumsFX\(k\)F\_\{X\}\(k\)andℓ\\ellindexes the candidate labels\.
\(1\) Difference between consecutive labels\.Consider
Δℓ:=sRPS\(X,yℓ\+1\)−sRPS\(X,yℓ\)\.\\Delta\_\{\\ell\}:=s\_\{\\mathrm\{RPS\}\}\(X,y\_\{\\ell\+1\}\)\-s\_\{\\mathrm\{RPS\}\}\(X,y\_\{\\ell\}\)\.The step function𝟙\{k≥ℓ\}\\mathds\{1\}\\\{k\\geq\\ell\\\}changes only at indexk=ℓk=\\ellwhen moving fromyℓy\_\{\\ell\}toyℓ\+1y\_\{\\ell\+1\}; for all otherkk, the indicator remains the same\. Therefore, the differenceΔℓ\\Delta\_\{\\ell\}between two adjacent labels \(yℓy\_\{\\ell\}andyℓ\+1y\_\{\\ell\+1\}\) reduces to
Δℓ=1K−1\[\(F\(ℓ\)−0\)2−\(F\(ℓ\)−1\)2\]=2F\(ℓ\)−1K−1\.\\Delta\_\{\\ell\}=\\frac\{1\}\{K\-1\}\\left\[\(F\(\\ell\)\-0\)^\{2\}\-\(F\(\\ell\)\-1\)^\{2\}\\right\]=\\frac\{2F\(\\ell\)\-1\}\{K\-1\}\.
\(2\) Single minimum\.Since the cumulative distributionF\(ℓ\)F\(\\ell\)is non\-decreasing inℓ\\elland satisfies0≤F\(ℓ\)≤10\\leq F\(\\ell\)\\leq 1, the consecutive differences
Δℓ:=sRPS\(X,yℓ\+1\)−sRPS\(X,yℓ\)=2F\(ℓ\)−1K−1\\Delta\_\{\\ell\}:=s\_\{\\mathrm\{RPS\}\}\(X,y\_\{\\ell\+1\}\)\-s\_\{\\mathrm\{RPS\}\}\(X,y\_\{\\ell\}\)=\\frac\{2F\(\\ell\)\-1\}\{K\-1\}form a non\-decreasing sequence inℓ\\ell\. Hence, there exists an index
m:=min\{ℓ:F\(ℓ\)≥1/2\},m:=\\min\\\{\\ell:F\(\\ell\)\\geq 1/2\\\},corresponding to a \(discrete\) median of the predictive distribution, such that
Δℓ≤0forℓ<m,Δℓ≥0forℓ≥m\.\\Delta\_\{\\ell\}\\leq 0\\quad\\text\{for \}\\ell<m,\\qquad\\Delta\_\{\\ell\}\\geq 0\\quad\\text\{for \}\\ell\\geq m\.
It follows that the sequenceℓ↦sRPS\(X,yℓ\)\\ell\\mapsto s\_\{\\mathrm\{RPS\}\}\(X,y\_\{\\ell\}\), which satisfies the recurrence
sRPS\(X,yℓ\+1\)=sRPS\(X,yℓ\)\+2F\(ℓ\)−1K−1,s\_\{\\mathrm\{RPS\}\}\(X,y\_\{\\ell\+1\}\)=s\_\{\\mathrm\{RPS\}\}\(X,y\_\{\\ell\}\)\+\\frac\{2F\(\\ell\)\-1\}\{K\-1\},is non\-increasing forℓ<m\\ell<mand non\-decreasing forℓ≥m\\ell\\geq m\. Therefore, the RPS score attains a single minimum at the median indexmm\.
Consequently, starting from the minimum atmm, extending the candidate set to the right \(ℓ≥m\\ell\\geq m\) increases the RPS score by increments\(2F\(ℓ\)−1\)/\(K−1\)≥0\(2F\(\\ell\)\-1\)/\(K\-1\)\\geq 0, while extending it to the left \(ℓ<m\\ell<m\) increases the score by increments\(1−2F\(ℓ\)\)/\(K−1\)≥0\(1\-2F\(\\ell\)\)/\(K\-1\)\\geq 0\.
Thus, the RPS score is V\-shaped around the median of the predictive distribution, being non\-increasing to the left of the median and non\-decreasing to the right\.
\(3\) Contiguity of conformal sets\.Since the mappingℓ↦sRPS\(X,yℓ\)\\ell\\mapsto s\_\{\\mathrm\{RPS\}\}\(X,y\_\{\\ell\}\)is V\-shaped, its sublevel set
𝒞αRPS\(X\)=\{yℓ:sRPS\(X,yℓ\)≤q^1−α\}\\mathcal\{C\}^\{\\mathrm\{RPS\}\}\_\{\\alpha\}\(X\)=\\\{y\_\{\\ell\}:s\_\{\\mathrm\{RPS\}\}\(X,y\_\{\\ell\}\)\\leq\\hat\{q\}\_\{1\-\\alpha\}\\\}forms a contiguous interval along the ordinal axis, expanding from the median indexmmtoward the tails\. Hence, the RPS\-based conformal prediction set is contiguous\.
∎
## Appendix BProof of Theorem[3\.2](https://arxiv.org/html/2606.24959#S3.Thmtheorem2)
###### Theorem\(Ordinal risk optimality of RPS\-based median\-grown prediction sets under oracle conditional coverage\)\.
For a fixed inputXX, let𝒞RPS\(X\)=\{yl,…,ym,…,yu\}\\mathcal\{C\}\_\{\\mathrm\{RPS\}\}\(X\)=\\\{y\_\{l\},\\dots,y\_\{m\},\\dots,y\_\{u\}\\\}denote the contiguous RPS\-based prediction set, which is grown from a median indexmmas in Theorem[3\.1](https://arxiv.org/html/2606.24959#S3.Thmtheorem1)\. Define the*ordinal risk*of a set𝒞\(X\)\\mathcal\{C\}\(X\)as the expectedl1l\_\{1\}\-distance of true label from this set:
R\(𝒞\(X\)\):=∑y=1Kp\(y∣X\)minc∈𝒞\(X\)\|y−c\|\.R\(\\mathcal\{C\}\(X\)\):=\\sum\_\{y=1\}^\{K\}p\(y\\mid X\)\\,\\min\_\{c\\in\\mathcal\{C\}\(X\)\}\|y\-c\|\.\(10\)Let𝒞\(X\)\\mathcal\{C\}\(X\)be any other*contiguous*conformal set of minimal cardinality satisfying the same coverage constraint as𝒞RPS\(X\)\\mathcal\{C\}\_\{\\mathrm\{RPS\}\}\(X\)\. Then
R\(𝒞RPS\(X\)\)≤R\(𝒞\(X\)\)\.R\(\\mathcal\{C\}\_\{\\mathrm\{RPS\}\}\(X\)\)\\leq R\(\\mathcal\{C\}\(X\)\)\.\(11\)
###### Proof\.
\(1\) Singleton case\.For a singleton set\{c\}\\\{c\\\}, the ordinal risk is
R\(\{c\}\)=∑y=1Kp\(y∣X\)\|y−c\|\.R\(\\\{c\\\}\)=\\sum\_\{y=1\}^\{K\}p\(y\\mid X\)\\,\|y\-c\|\.It is well known that this expected absolute deviation is minimized at a \(discrete\) medianmmofp\(⋅∣X\)p\(\\cdot\\mid X\)\. Hence, starting with\{m\}\\\{m\\\}, as done by RPS\-based sets \(see Theorem[3\.1](https://arxiv.org/html/2606.24959#S3.Thmtheorem1)\), is optimal among all sets of size11\. Any discrete median suffices\.
\(2\) Adding adjacent labels\.We initialize the contiguous set𝒞\\mathcal\{C\}at the singleton median of the predictive distribution,
m:=min\{ℓ:F\(ℓ\)≥1/2\},𝒞=\{m\},m:=\\min\\\{\\ell:F\(\\ell\)\\geq 1/2\\\},\\qquad\\mathcal\{C\}=\\\{m\\\},and then expand it by adding adjacent labels toward the side with larger tail probability\.
Let𝒞=\{l,…,u\}\\mathcal\{C\}=\\\{l,\\dots,u\\\}be a contiguous set of labels, and let
F\(y\):=∑j≤yp\(yj∣X\)F\(y\):=\\sum\_\{j\\leq y\}p\(y\_\{j\}\\mid X\)be the cumulative probability function\.
#### Ordinal risk reduction\.
For any labely≤l−1y\\leq l\-1, we have
minc∈𝒞\|y−c\|=l−y,minc∈𝒞∪\{l−1\}\|y−c\|=\(l−1\)−y,\\min\_\{c\\in\\mathcal\{C\}\}\|y\-c\|=l\-y,\\qquad\\min\_\{c\\in\\mathcal\{C\}\\cup\\\{l\-1\\\}\}\|y\-c\|=\(l\-1\)\-y,so the distance decreases by11, while fory≥ly\\geq lthe distance is unchanged\. Hence, the reduction in ordinal risk from addingl−1l\-1is
R\(𝒞\)−R\(𝒞∪\{l−1\}\)=∑y≤l−1p\(y∣X\)=F\(l−1\)\.R\(\\mathcal\{C\}\)\-R\(\\mathcal\{C\}\\cup\\\{l\-1\\\}\)=\\sum\_\{y\\leq l\-1\}p\(y\\mid X\)=F\(l\-1\)\.
Similarly, ifu<Ku<K, addingu\+1u\+1reduces the ordinal risk by
R\(𝒞\)−R\(𝒞∪\{u\+1\}\)=∑y≥u\+1p\(y∣X\)=1−F\(u\)\.R\(\\mathcal\{C\}\)\-R\(\\mathcal\{C\}\\cup\\\{u\+1\\\}\)=\\sum\_\{y\\geq u\+1\}p\(y\\mid X\)=1\-F\(u\)\.
Thus, among the two possible single\-step contiguous extensions of𝒞\\mathcal\{C\}, the one yielding the largest reduction in ordinal risk is toward the side with larger tail probability outside𝒞\\mathcal\{C\}:
add left ifF\(l−1\)≥1−F\(u\),otherwise add right\.\\text\{add left if \}F\(l\-1\)\\geq 1\-F\(u\),\\quad\\text\{otherwise add right\}\.\(12\)
#### Connection to RPS sublevel sets\.
By Theorem[3\.1](https://arxiv.org/html/2606.24959#S3.Thmtheorem1), extending the candidate set to the right\(u\+1\)\(u\+1\)increases the RPS score by increments\(2F\(u\)−1\)/\(K−1\)≥0\(2F\(u\)\-1\)/\(K\-1\)\\geq 0, while extending it to the left\(l−1\)\(l\-1\)increases the score by increments\(1−2F\(l−1\)\)/\(K−1\)≥0\(1\-2F\(l\-1\)\)/\(K\-1\)\\geq 0\.
The conformal prediction procedure selects the next label that yields the*smallest increase in the RPS score*\([3](https://arxiv.org/html/2606.24959#S3.E3)\), so the greedy expansion chooses
add left if1−2F\(l−1\)K−1≤2F\(u\)−1K−1,otherwise add right,\\text\{add left if \}\\frac\{1\-2F\(l\-1\)\}\{K\-1\}\\leq\\frac\{2F\(u\)\-1\}\{K\-1\},\\quad\\text\{otherwise add right\},which is algebraically equivalent to \([12](https://arxiv.org/html/2606.24959#A2.E12)\):
1−2F\(l−1\)≤2F\(u\)−1\\displaystyle 1\-2F\(l\-1\)\\leq 2F\(u\)\-1⇔2≤2F\(u\)\+2F\(l−1\)\\displaystyle\\iff 2\\leq 2F\(u\)\+2F\(l\-1\)⇔1≤F\(u\)\+F\(l−1\)\\displaystyle\\iff 1\\leq F\(u\)\+F\(l\-1\)⇔F\(l−1\)≥1−F\(u\)\.\\displaystyle\\iff F\(l\-1\)\\geq 1\-F\(u\)\.
\(3\) Risk optimality among contiguous intervals of fixed length\.Fix a lengths≥1s\\geq 1and consider all contiguous sets𝒞=\{l,…,l\+s−1\}\\mathcal\{C\}=\\\{l,\\dots,l\+s\-1\\\}of sizess\. For such sets, the ordinal risk
R\(𝒞\(X\)\)=∑y=1Kp\(y∣X\)minc∈𝒞\|y−c\|R\(\\mathcal\{C\}\(X\)\)=\\sum\_\{y=1\}^\{K\}p\(y\\mid X\)\\min\_\{c\\in\\mathcal\{C\}\}\|y\-c\|is the expectedl1l\_\{1\}distance fromYYto the set𝒞\\mathcal\{C\}\. For fixedss, this risk is minimized when the interval is \(in a discrete sense\) centered at a median indexmmofp\(⋅∣X\)p\(\\cdot\\mid X\): moving the interval one step away frommmincreases the expected absolute distance\. Consequently, among all contiguous sets of sizess, those whose center is closest tommhave minimal ordinal risk\.
By Theorem[3\.1](https://arxiv.org/html/2606.24959#S3.Thmtheorem1), for each cardinalityss, the RPS sublevel set expands aroundmmin a way that keeps the tail imbalance\|F\(ℓ\)−1/2\|\|F\(\\ell\)\-1/2\|as small as possible\. Consequently, for eachss, the RPS\-based set is \(discretely\) centered at the median, which minimizes the ordinal risk among all contiguous sets of sizess\.
\(4\) Risk dominance over minimal\-cardinality coverage sets\.Let𝒞\(X\)\\mathcal\{C\}\(X\)be any contiguous minimal\-cardinality set satisfying coverage∑y∈𝒞p\(y∣X\)≥1−α\\sum\_\{y\\in\\mathcal\{C\}\}p\(y\\mid X\)\\geq 1\-\\alpha, with sizes⋆s^\{\\star\}\. Since𝒞RPS\(X\)\\mathcal\{C\}\_\{\\textrm\{RPS\}\}\(X\)also satisfies coverage \(Proposition[3\.1](https://arxiv.org/html/2606.24959#S3.Thmproposition1)\),\|𝒞RPS\(X\)\|≥s⋆\|\\mathcal\{C\}\_\{\\textrm\{RPS\}\}\(X\)\|\\geq s^\{\\star\}, so the size\-s⋆s^\{\\star\}sublevel set𝒞RPSs⋆\(X\)⊆𝒞RPS\(X\)\\mathcal\{C\}^\{s^\{\\star\}\}\_\{\\textrm\{RPS\}\}\(X\)\\subseteq\\mathcal\{C\}\_\{\\textrm\{RPS\}\}\(X\)exists \(Proposition[3\.2](https://arxiv.org/html/2606.24959#S3.Thmproposition2)\)\. By \(3\) it minimizes ordinal risk among contiguous size\-s⋆s^\{\\star\}sets, and by monotonicity ofRRunder inclusion,
R\(𝒞RPS\(X\)\)≤R\(𝒞RPSs⋆\(X\)\)≤R\(𝒞\(X\)\)\.R\(\\mathcal\{C\}\_\{\\textrm\{RPS\}\}\(X\)\)\\leq R\(\\mathcal\{C\}^\{s^\{\\star\}\}\_\{\\textrm\{RPS\}\}\(X\)\)\\leq R\(\\mathcal\{C\}\(X\)\)\.∎
## Appendix CAdditional Details for Experiments on ordinal image datasets
This section provides additional details for the experiments on ordinal image datasets\.
#### Baseline Nonconformity Scores\.
Least ambiguous set\-valued classifier \(LAC\) scoresadinle2019least,
sLAC\(X,y\):=1−p\(y∣X\)\.s\_\{\\mathrm\{LAC\}\}\(X,y\):=1\-p\(y\\mid X\)\.Adaptive prediction set \(APS\) scoreDBLP:conf/nips/RomanoSC20,
sAPS\(X,y\):=∑y′:p\(y′∣X\)≥p\(y∣X\)p\(y′∣X\)\.s\_\{\\mathrm\{APS\}\}\(X,y\):=\\sum\_\{y^\{\\prime\}:\\,p\(y^\{\\prime\}\\mid X\)\\geq p\(y\\mid X\)\}p\(y^\{\\prime\}\\mid X\)\.
#### Performance Metrics\.
Empirical coverage \(COV\),
COV:=1\|𝒟test\|∑i∈𝒟test𝟙\{Yi∈𝒞\(Xi\)\}\.\\mathrm\{COV\}:=\\frac\{1\}\{\|\\mathcal\{D\}\_\{\\mathrm\{test\}\}\|\}\\sum\_\{i\\in\\mathcal\{D\}\_\{\\mathrm\{test\}\}\}\\mathds\{1\}\\\{Y\_\{i\}\\in\\mathcal\{C\}\(X\_\{i\}\)\\\}\.Average prediction set size \(PS\),
PS:=1\|𝒟test\|∑i∈𝒟test\|𝒞\(Xi\)\|\.\\mathrm\{PS\}:=\\frac\{1\}\{\|\\mathcal\{D\}\_\{\\mathrm\{test\}\}\|\}\\sum\_\{i\\in\\mathcal\{D\}\_\{\\mathrm\{test\}\}\}\|\\mathcal\{C\}\(X\_\{i\}\)\|\.Mean interval width \(MW\),
MW:=1\|𝒟test\|∑i∈𝒟test\(ui−li\),\\mathrm\{MW\}:=\\frac\{1\}\{\|\\mathcal\{D\}\_\{\\mathrm\{test\}\}\|\}\\sum\_\{i\\in\\mathcal\{D\}\_\{\\mathrm\{test\}\}\}\(u\_\{i\}\-l\_\{i\}\),where𝒞\(Xi\)=\[li,ui\]\\mathcal\{C\}\(X\_\{i\}\)=\[l\_\{i\},u\_\{i\}\]denotes the contiguous prediction interval for theii\-th test point\.
To measure ordinal error distance over the entire test set𝒟test\\mathcal\{D\}\_\{\\mathrm\{test\}\}, we also report the mean absolute interval error \(MAIE\),
MAIE:=1\|𝒟test\|∑i∈𝒟test\{li−YiifYi<liYi−uiifYi\>ui0otherwise\.\\mathrm\{MAIE\}:=\\frac\{1\}\{\|\\mathcal\{D\}\_\{\\mathrm\{test\}\}\|\}\\sum\_\{i\\in\\mathcal\{D\}\_\{\\mathrm\{test\}\}\}\\begin\{cases\}l\_\{i\}\-Y\_\{i\}&\\text\{if \}Y\_\{i\}<l\_\{i\}\\\\\[2\.84526pt\] Y\_\{i\}\-u\_\{i\}&\\text\{if \}Y\_\{i\}\>u\_\{i\}\\\\ 0&\\text\{otherwise\}\\end\{cases\}\\,\.
#### Model implementation\.
All models are implemented usingskorchskorch, ascikit\-learnscikit\-learn\-compatible wrapper forPyTorchpaszke2019pytorch, and are conformalized withMAPIECordier\_Flexible\_and\_Systematic\_2023\. ThedlordinalDBLP:journals/ijon/BerchezMorenoAYGHFG25package is used to implement ordinal\-specific methodologies, such as COPOCDBLP:conf/nips/DeyMK23\.
For image datasets, we employ a computationally efficientResNet\-18DBLP:conf/cvpr/HeZRS16model pretrained on ImageNetdeng2009imagenet, as our primary objective is not to maximize predictive performance but to evaluate the proposed methodology\. Nonetheless, ResNet\-18 is widely used as a backbone in image\-based ordinal classification research, where it achieves reasonable performanceDBLP:journals/ijon/BerchezMorenoAYGHFG25\.
#### Loss functions\.
We train all models using the standard cross\-entropy \(CE\) loss, which is a proper scoring rule and has been shown to yield unbiased predictive probability distributions, including in ordinal classificationDBLP:journals/ijar/HaasH25,DBLP:journals/corr/abs\-2507\-00733\.
lCE\(𝒑,y\)=−∑k=1K𝟙\{y=yk\}log\(pk\),l\_\{\\mathrm\{CE\}\}\(\\boldsymbol\{p\},y\)=\-\\sum\_\{k=1\}^\{K\}\\mathds\{1\}\\\{y=y\_\{k\}\\\}\\,\\log\(p\_\{k\}\),where𝒑=\(p1,…,pK\)\\boldsymbol\{p\}=\(p\_\{1\},\\dots,p\_\{K\}\)is the predicted probability distribution over theKKclasses, andyyis the true label\. In addition, we consider the non\-parametric conformal prediction sets for ordinal classification \(COPOC\) approach proposed byDBLP:conf/nips/DeyMK23, which enforces unimodality in the predictive probability distribution\. This also serves as an exemplary ordinal\-specific loss, encouraging predictions to respect the natural ordering of the labels\.
#### BACH Dataset
The BACH \(BreAst Cancer Histology\) datasetaresta2019bachis a benchmark for breast cancer histopathological image classification, originally introduced as part of the ICIAR 2018 Grand Challenge on Breast Cancer Histology Images\. It consists of hematoxylin and eosin \(H&E\) stained microscopy images of breast tissue, annotated by expert pathologists\. The dataset contains 400 high\-resolution images \(2048×\\times1536 pixels\) with 100 images per class, representing four ordinal classes that follow the natural progression of breast cancer:
1. 1\.Normal– healthy breast tissue
2. 2\.Benign– non\-cancerous abnormal tissue
3. 3\.In situ carcinoma– cancerous cells that have not invaded surrounding tissue
4. 4\.Invasive carcinoma– cancerous cells that have spread to surrounding tissue
The ordinal structure of these classes reflects increasing disease severity, making BACH particularly well\-suited for evaluating ordinal classification methods in medical imaging\. The dataset is perfectly balanced with an imbalance ratio \(IR\) of 1\.0\. For our experiments, we resize all images to 224×\\times224 pixels and apply standard normalization with mean and standard deviation of 0\.5 across all channels\. During training, we augment the data with random rotations \(up to 10 degrees\) and random horizontal flips to improve model generalization\. We use a ResNet\-18 model pretrained on ImageNet and fine\-tune it for our task\. We use two thirds of the data to fine\-tune a ResNet\-18 model and split the remaining one third equally into calibration and test sets\. This random split is repeated over 50 trials\. BACH is widely used in medical imaging research and provides an important testbed for uncertainty quantification methods, as reliable predictions with well\-calibrated confidence are critical in clinical decision\-making for cancer diagnosis\.




Figure 5:Example images from the BACH datasetaresta2019bach\.
#### RetinaMNIST Dataset
RetinaMNIST is a benchmark dataset of retinal fundus images from the MedMNISTmedmnistv2collection, consisting of 1,60028×2828\\times 28grayscale images labeled with diabetic retinopathy severity levels\. The labels form an ordered set of five classes \(*No DR*,*Mild*,*Moderate*,*Severe*,*Proliferative DR*\) reflecting increasing disease severity, which makes RetinaMNIST a common choice in image\-based ordinal classification research, e\.g\.,DBLP:conf/nips/DeyMK23\. This ordered structure makes RetinaMNIST well\-suited for evaluating ordinal classification and uncertainty estimation methods in medical image analysis\. In our experiments, we use a ResNet\-18 backbone pretrained on ImageNet and adapted to the RetinaMNIST resolution\. ResNet\-18 provides a good trade\-off between performance and computational efficiency for this task\. We use the dedicated training set to train the model, while the original validation and test sets are merged and then randomly split equally into calibration and test sets, again repeated over 50 trials\.








Figure 6:Example images from the RetinaMNIST datasetmedmnistv2
#### FGNet Dataset
FGNetlanitis2002towardis a widely used benchmark for age estimation from facial images, containing 1,002 images of 82 subjects spanning ages 0–69 years\. For our experiments, we group ages into six ordinal classes to evaluate coarse age prediction performance, which aligns with the natural ordering of ages and allows ordinal classification evaluation\. We preprocess images by resizing them to 256 pixels on the smaller side, followed by a random resized crop to224×224224\\times 224\(scale 0\.85–1\.0\)\. Data augmentation includes random horizontal flips \(50% probability\), color jitter \(brightness, contrast, saturation, and hue variations\), and random rotations up to 10 degrees\. Images are converted to tensors and then normalized using the ImageNet channel\-wise mean \[0\.485, 0\.456, 0\.406\] and standard deviation \[0\.229, 0\.224, 0\.225\]\. We use a ResNet\-18 backbone pretrained on ImageNet and fine\-tune it on FGNet for the six\-class age classification task, leveraging the ordinal structure of the labels to evaluate our ordinal classification methods\. We use the dedicated training set to train the model, and split the test set into calibration and test sets, also repeated over 50 trials\.




Figure 7:Example images from the FGNet datasetlanitis2002toward
#### Predictive Performance\.
Table[3](https://arxiv.org/html/2606.24959#A3.T3)depicts the predictive performance obtained using COPOC and cross\-entropy \(CE\) loss across the different image datasets\. Overall, CE achieves the best performance on FGNet and BACH, consistently improving ACC, MAE, MSE, and QWK\. On RetinaMNIST, CE attains higher ACC and lower error metrics \(MAE and MSE\), while COPOC slightly outperforms in terms of 1\-OFF accuracy and QWK\. These results indicate that although CE generally provides stronger overall classification accuracy, COPOC remains competitive, particularly with respect to ordinal consistency metrics\. Furthermore, COPOC ensures unimodality across all datasets, as indicated by its degree of unimodality \(UMOD\) being equal to one\.
Table 3:Predictive performance on the image datasets\. Results are reported as mean±\\pmstddev\. Metrics: ACC \(Accuracy\), 1\-OFF \(1\-Off Accuracy\), MAE \(Mean Absolute Error\), MSE \(Mean Squared Error\), QWK \(Quadratic Weighted Kappa\), UMOD \(Degree of Unimodality\)\. For ACC, 1\-OFF, and QWK, higher is better; for MAE and MSE, lower is better\. Bold values indicate the best performance for each metric within each dataset\.
#### Detailed Experimental Results\.
Results over 50 trials forα=\{0\.01,0\.02,0\.03,0\.05,0\.08,0\.1,0\.13,0\.15,0\.18,0\.2\}\\alpha=\\\{0\.01,0\.02,0\.03,0\.05,0\.08,0\.1,0\.13,0\.15,0\.18,0\.2\\\}on the BACH, RetinaMNIST, and FGNet datasets, covering the considered CP methods and metrics, are shown in Figures[8](https://arxiv.org/html/2606.24959#A3.F8),[9](https://arxiv.org/html/2606.24959#A3.F9), and[10](https://arxiv.org/html/2606.24959#A3.F10), respectively, and summarized in Table[4](https://arxiv.org/html/2606.24959#A3.T4)forα=\{0\.02,0\.05,0\.1\}\\alpha=\\\{0\.02,0\.05,0\.1\\\}\. In all cases, as expected, LAC and APS violate the contiguity of prediction sets \(CV\) and are therefore excluded from measures that require contiguous sets, i\.e\., MW, MAMM, WAMM, AISL, and MAIE\. All ordinal methods \(min\-CPS, COPOCL, COPOCA, OCDF, and RPS\), however, output purely contiguous prediction sets\. Considering the trade\-off between efficiency, as indicated by MW, and ordinal errors, as indicated by MAMM, WAMM, and MAIE, RPS\-based sets strike a favorable balance, which is also reflected by the AISL metric that combines both aspects into a single score\. Compared to the other ordinal methods \(min\-CPS, COPOCL, COPOCA, and OCDF\), RPS\-based sets are reliable in the sense that risk is neither underestimated nor overestimated\. In contrast, min\-CPS, COPOCL, and COPOCA tend to underestimate risk, as indicated by higher MAMM, WAMM, and MAIE values induced by overly small PS and MW, whereas OCDF tends to produce very large PS and MW values, particularly for the RetinaMNIST and FGNet datasets \(see Figure[9](https://arxiv.org/html/2606.24959#A3.F9)\)\. The claim that RPS\-based sets strike a favorable balance between ordinal error awareness and efficient, small intervals is further supported by the tabular experiments reported in Appendix[D](https://arxiv.org/html/2606.24959#A4)\. These experiments also support our earlier claim that mode\-centered set construction does not accurately capture uncertainty, whereas RPS\-based sets faithfully account for the full ordinal structure and the associated risk\.
Table 4:Comparison of the different non\-conformity measures over the BACH, RetinaMNIST and FGNet datasets atα=0\.02\\alpha=0\.02,α=0\.05\\alpha=0\.05andα=0\.1\\alpha=0\.1\.Figure 8:Comparison of prediction sets atα=\{0\.01,0\.02,0\.03,0\.05,0\.08,0\.1,0\.13,0\.15,0\.18,0\.2\}\\alpha=\\\{0\.01,0\.02,0\.03,0\.05,0\.08,0\.1,0\.13,0\.15,0\.18,0\.2\\\}across methods on the BACH datasetaresta2019bach\. Shaded regions indicate standard deviation over 50 trials\.Figure 9:Comparison of prediction sets across methods on the RetinaMNIST datasetmedmnistv2\. Shaded regions indicate standard deviation\.Figure 10:Comparison of prediction sets across methods on the FGNet datasetlanitis2002toward\. Shaded regions indicate standard deviation\.
## Appendix DAdditional Experiments on Tabular Ordinal Datasets
#### Model implementation\.
We conduct additional experiments on tabular datasets using multilayer perceptrons \(MLPs\), which allow straightforward integration of the COPOC architectureDBLP:conf/nips/DeyMK23\. We use a simple MLP with a single hidden layer of 64 units to maintain consistency across datasets and isolate the effect of the conformal prediction method\. Our focus is on evaluating conformal prediction performance rather than optimizing base model accuracy; accordingly, we do not perform hyperparameter tuning\. Nonetheless, the model achieves reasonably close\-to\-standard predictive performance on the datasets and successfully captures the ordinal structure of the labels\. See Table[6](https://arxiv.org/html/2606.24959#A4.T6)for the predictive performance of COPOC and CE loss on the considered datasets\.
#### Tabular Datasets\.
The tabular ordinal datasets are obtained from the TOC\-UCO repositoryayllon2025toc\(see Table[5](https://arxiv.org/html/2606.24959#A4.T5)for details\)\. All features are already numeric and are then standardized using standard scaling\. For evaluation, approximately 60% of each dataset is used for training, with the remaining 40% split evenly between calibration and test sets\. The results are averaged over 50 random splits between calibration and test sets, with the training data held constant\. We focus on several larger datasets exhibiting diverse class distributions, which induce predictive distributions ranging from bimodal to unimodal, as reflected by the mean predicted probabilities \(MP\) \(see Figure[21](https://arxiv.org/html/2606.24959#A4.F21)\)\.
#### Experimental Results\.
Again, we exclude LAC and APS, which may produce non\-contiguous prediction sets, from metrics other than COV, CV, and PS\. See the figures below for CP results across severalα\\alphavalues \(α=\{0\.01,0\.02,0\.03,0\.05,0\.08,0\.1,0\.13,0\.15,0\.18,0\.2\}\\alpha=\\\{0\.01,0\.02,0\.03,0\.05,0\.08,0\.1,0\.13,0\.15,0\.18,0\.2\\\}\) and Table[7](https://arxiv.org/html/2606.24959#A4.T7)for detailed results atα=0\.1\\alpha=0\.1\. Consistent with previous findings, RPS\-based conformal prediction tends to reduce ordinal error magnitudes while often striking a favorable balance between interval width and miscoverage magnitude, as indicated by AISL\. COPOCA remains a strong competitor, particularly on the LESTSensors and LEVXSensors datasets, which exhibit bimodal predictive distributions; however, this advantage comes at the cost of larger prediction sets and wider intervals compared to RPS\. Because COPOCA enforces unimodal predictive distributions, bimodal distributions are effectively centered between the extreme classes, thereby reducing ordinal risk in a manner similar to RPS and improving over min\-CPS\. In general, LAC and its unimodal COPOCL variant tend to produce the most efficient sets, as indicated by PS and MW\. However, this often comes at the cost of underestimating the ordinal risk indicated by MAMM, WAMM, and MAIE\.
Table 5:Summary of tabular ordinal datasets used in our experimentsayllon2025toc\.Table 6:Predictive performance on the datasets\. Results are reported as mean±\\pmstddev\. Metrics: ACC \(Accuracy\), 1\-OFF \(1\-Off Accuracy\), MAE \(Mean Absolute Error\), MSE \(Mean Squared Error\), QWK \(Quadratic Weighted Kappa\), UMOD \(Degree of Unimodality\)\. For ACC, 1\-OFF, and QWK, higher is better; for MAE and MSE, lower is better\. Bold values indicate the best performance for each metric within each dataset\.Table 7:Performance comparison of the different CP methods on the tabular datasets atα=0\.1\\alpha=0\.1\.Figure 11:Comparison of prediction sets across methods on LESTSensors dataset\. Shaded regions indicate standard deviation\.
Figure 12:Comparison of prediction sets across methods on LEVXSensors dataset\. Shaded regions indicate standard deviation\.
Figure 13:Comparison of prediction sets across methods on nhanes dataset\. Shaded regions indicate standard deviation\.
Figure 14:Comparison of prediction sets across methods on swd dataset\. Shaded regions indicate standard deviation\.
Figure 15:Comparison of prediction sets across methods on winequalityRed dataset\. Shaded regions indicate standard deviation\.
Figure 16:Comparison of prediction sets across methods on insurance dataset\. Shaded regions indicate standard deviation\.
Figure 17:Comparison of prediction sets across methods on melbourneAirbnb dataset\. Shaded regions indicate standard deviation\.
Figure 18:Comparison of prediction sets across methods on cancerDeathRate dataset\. Shaded regions indicate standard deviation\.
Figure 19:Comparison of prediction sets across methods on era dataset\. Shaded regions indicate standard deviation\.
Figure 20:Comparison of prediction sets across methods on lev dataset\. Shaded regions indicate standard deviation\.
\(a\) LESTSensors
\(b\) LEVXSensors
\(c\) nhanes
\(d\) swd
\(e\) winequalityRed
\(f\) insurance
\(g\) melbourneAirbnb
\(h\) cancerDeathRate
\(i\) era
\(j\) lev
Figure 21:Mean predictive probabilities \(MP\) for the different datasets with the CE loss\.
## Appendix EAdditional Experiments Using Gradient Boosted Trees
In this section, we compare the different nonconformity scores using gradient boosted trees \(GBTs\), a model class of high practical relevance for tabular datashwartz2022tabular\. We select LightGBMDBLP:conf/nips/KeMFWCMYL17as a representative GBT instantiation; other popular implementations such as CatBoostDBLP:conf/nips/ProkhorenkovaGV18or XGBoostDBLP:conf/kdd/ChenG16typically yield similar results\. Among the nonconformity measures, we exclude COPOCA and COPOCL from this comparison, as COPOCDBLP:conf/nips/DeyMK23relies on a specific neural network architecture that is not trivial to replicate with GBTs\. For our experiments, we use six tabular datasets from the TOC\-UCO repositoryayllon2025toc\(Table[8](https://arxiv.org/html/2606.24959#A5.T8)\)\. The results are reported in Table[9](https://arxiv.org/html/2606.24959#A5.T9)and, broken down per dataset across all metrics, in the figures below\. They corroborate our earlier findings: RPS remains competitive with min\-CPSzhang2025provablyand strikes a favorable balance between reducing ordinal miscoverage and maintaining small prediction set sizes\.
Table 8:Summary of tabular ordinal datasets used in our LightGBM experimentsayllon2025toc\.Table 9:Comparison of the different non\-conformity measures using LightGBM over tabular datasets atα=0\.02\\alpha=0\.02,α=0\.05\\alpha=0\.05andα=0\.1\\alpha=0\.1\.Figure 22:Comparison of prediction sets across methods on heartDisease dataset using LightGBM\. Shaded regions indicate standard deviation\.
Figure 23:Comparison of prediction sets across methods on mammoexp dataset using LightGBM\. Shaded regions indicate standard deviation\.
Figure 24:Comparison of prediction sets across methods on support dataset using LightGBM\. Shaded regions indicate standard deviation\.
Figure 25:Comparison of prediction sets across methods on winequalityRed dataset using LightGBM\. Shaded regions indicate standard deviation\.
Figure 26:Comparison of prediction sets across methods on nhanes dataset using LightGBM\. Shaded regions indicate standard deviation\.
Figure 27:Comparison of prediction sets across methods on LEVXSensors dataset using LightGBM\. Shaded regions indicate standard deviation\.Similar Articles
Empirical Bayes Conformal Prediction for Vision and Language Models
This paper introduces an empirical Bayes conformal prediction framework that uses r-values to incorporate score variability into nonconformity scores, improving ranking stability and reducing set size while preserving coverage for vision and language models.
Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations
This paper proposes a conformal prediction framework for LLMs that leverages internal representations rather than output-level statistics, introducing Layer-Wise Information (LI) scores as nonconformity measures to improve validity-efficiency trade-offs under distribution shift. The method demonstrates stronger robustness to calibration-deployment mismatch compared to text-level baselines across QA benchmarks.
Online Localized Conformal Prediction
This paper proposes Online Localized Conformal Prediction (OLCP) to address covariate heterogeneity in online learning and time-series settings. It introduces OLCP-Hedge for bandwidth selection and demonstrates valid long-run coverage with narrower prediction sets compared to existing baselines.
Statistically Reliable LLM-Based Ranking Evaluation via Prediction-Powered Inference
This paper introduces PRECISE, an extension of Prediction-Powered Inference that combines a small set of human labels with a large set of LLM judgments to produce unbiased and variance-reduced estimates of ranking evaluation metrics like Precision@K. The method is validated on the ESCI benchmark and in a production A/B test, where it correctly identified the best system variant using only 100 human labels, confirmed by a +407 bps sales improvement.
Conformal Prediction for Neural Operators: Distribution-Free Uncertainty Quantification in Physics Simulation
Proposes the first application of split conformal prediction to neural operator-based physics simulation, providing distribution-free prediction intervals with finite-sample coverage guarantees and adaptive-width intervals using MC Dropout uncertainty.