Online Localized Conformal Prediction

arXiv cs.LG Papers

Summary

This paper proposes Online Localized Conformal Prediction (OLCP) to address covariate heterogeneity in online learning and time-series settings. It introduces OLCP-Hedge for bandwidth selection and demonstrates valid long-run coverage with narrower prediction sets compared to existing baselines.

arXiv:2605.05497v1 Announce Type: new Abstract: Conformal prediction is a framework that provides valid uncertainty quantification for general models with exchangeable data. However, in the online learning and time-series settings, exchangeability is not satisfied. Existing online conformal methods, such as adaptive conformal inference (ACI), can achieve long-run validity, yet they remain inefficient under covariate heterogeneity because they rely on global calibration. We propose \emph{Online Localized Conformal Prediction (OLCP)}, which combines online adaptation with covariate-dependent localization to better reflect heterogeneity. To reduce sensitivity to the localization bandwidth, we further develop \emph{OLCP-Hedge}, which performs bandwidth selection as an online expert aggregation problem using a constrained online convex optimization framework. Importantly, we provide coverage guarantees for both algorithms and demonstrate through simulations and real-data experiments that the proposed methods attain valid long-run coverage with narrower prediction sets than existing baselines.
Original Article
View Cached Full Text

Cached at: 05/08/26, 07:41 AM

# Online Localized Conformal Prediction
Source: [https://arxiv.org/html/2605.05497](https://arxiv.org/html/2605.05497)
Yuheng Lai University of Wisconsin \- Madison yuheng\.lai@wisc\.edu&Garvesh Raskutti University of Wisconsin \- Madison raskutti@wisc\.edu

###### Abstract

Conformal prediction is a framework that provides valid uncertainty quantification for general models with exchangeable data\. However, in the online learning and time\-series settings, exchangeability is not satisfied\. Existing online conformal methods, such as adaptive conformal inference \(ACI\), can achieve long\-run validity, yet they remain inefficient under covariate heterogeneity because they rely on global calibration\. We propose*Online Localized Conformal Prediction \(OLCP\)*, which combines online adaptation with covariate\-dependent localization to better reflect heterogeneity\. To reduce sensitivity to the localization bandwidth, we further develop*OLCP\-Hedge*, which performs bandwidth selection as an online expert aggregation problem using a constrained online convex optimization framework\. Importantly, we provide coverage guarantees for both algorithms and demonstrate through simulations and real\-data experiments that the proposed methods attain valid long\-run coverage with narrower prediction sets than existing baselines\.

## 1Introduction

Reliable uncertainty quantification is essential in online learning and time\-series prediction, where data are often temporally dependent, nonstationary, and heterogeneous across covariate space\. Our goal is to construct online prediction setCt​\(Xt\)⊆𝒴C\_\{t\}\(X\_\{t\}\)\\subseteq\\mathcal\{Y\}for a sequential data stream\{\(Xt,Yt\)\}t=1T\\\{\(X\_\{t\},Y\_\{t\}\)\\\}\_\{t=1\}^\{T\}such that the long\-run coverage target

1T​∑t=1T𝟏​\{Yt∈Ct​\(Xt\)\}≈1−α\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\mathbf\{1\}\\\{Y\_\{t\}\\in C\_\{t\}\(X\_\{t\}\)\\\}\\approx 1\-\\alpha\(1\)is achieved while the average set size remains as small as possible\.

Conformal prediction \(CP\) provides distribution\-free uncertainty quantification with finite\-sample marginal coverage under exchangeability\([vovk2005algorithmic,](https://arxiv.org/html/2605.05497#bib.bib28);[lei2018distribution,](https://arxiv.org/html/2605.05497#bib.bib16);[angelopoulos2023conformal,](https://arxiv.org/html/2605.05497#bib.bib1)\)\. However, exchangeability is often violated in online and time\-series settings due to temporal dependence, distribution shift, heteroskedasticity, and structural breaks\. Existing online conformal methods such as adaptive conformal inference \(ACI\)\([gibbs2021adaptive,](https://arxiv.org/html/2605.05497#bib.bib7)\)and dynamically tuned ACI \(DtACI\)\([gibbs2024conformal,](https://arxiv.org/html/2605.05497#bib.bib8)\)address this issue by adapting the nominal miscoverage level over time, and can achieve long\-run validity under non\-exchangeability\. That being said, these methods remain*globally*calibrated: they react to temporal changes in overall uncertainty, but do not adapt set size to local covariate\-dependent heterogeneity\. As a result, they can be simultaneously too wide in easy regions and too narrow in difficult regions of the covariate space\.

A complementary line of work addresses heterogeneity through localization\. In particular, localized conformal prediction \(LCP\) re\-weights calibration points according to covariate similarity, allowing set size to vary with local uncertainty\([guan2023localized,](https://arxiv.org/html/2605.05497#bib.bib9)\)\. While attractive in heterogeneous regression problems, existing localized conformal methods rely on exchangeability and are not designed for online nonstationary data streams\. This leaves a natural gap: current online conformal methods handle temporal non\-exchangeability but remain global, whereas current localized conformal methods adapt to heterogeneity but do not handle non\-exchangeable online data\.

We propose*Online Localized Conformal Prediction*\(OLCP\), which combines online calibration with covariate\-dependent localization\. At each timett, OLCP computes a localized conformal quantile around the current covariateXtX\_\{t\}, and then updates the nominal miscoverage level using realized coverage feedback\. Thus the set size adapts both over time and across the covariate space\. A key practical issue is bandwidth selection\. To reduce sensitivity to this choice, we formulate bandwidth selection as an online expert aggregation problem with prediction set size as the objective and coverage as the constraint, and develop*OLCP\-Hedge*, a constrained online convex optimization procedure over a collection of OLCP experts\.

Our main contributions are as follows:

- •We introduce OLCP, a localized online conformal method that combines covariate\-dependent calibration with feedback\-driven online adaptation\.
- •We show that OLCP enjoys a long\-run coverage guarantee under its sequential coverage\-tracking update, despite replacing global conformal quantiles with localized ones\.
- •We formulate OLCP bandwidth selection as a constrained online convex optimization problem and propose OLCP\-Hedge, which controls long\-run coverage violation while competing with the best feasible bandwidth expert in terms of set size\. Both coverage and set size guarantees for OLCP\-Hedge are provided\.
- •Through simulations and real\-data experiments, we show that OLCP and OLCP\-Hedge achieve valid long\-run coverage with narrower prediction sets than existing online conformal baselines in heterogeneous and nonstationary settings\.

## 2Problem setup and related work

We observe a sequential stream\{\(Xt,Yt\)\}t=1T\\\{\(X\_\{t\},Y\_\{t\}\)\\\}\_\{t=1\}^\{T\}, whereXt∈𝒳X\_\{t\}\\in\\mathcal\{X\}is observed before prediction andYt∈𝒴Y\_\{t\}\\in\\mathcal\{Y\}is revealed afterward\. At each timett, using the past data together with the current covariateXtX\_\{t\}, we construct a prediction set

Ct​\(Xt\)⊆𝒴C\_\{t\}\(X\_\{t\}\)\\subseteq\\mathcal\{Y\}forYtY\_\{t\}\. Our objective is to achieve the target long\-run coverage level \([1](https://arxiv.org/html/2605.05497#S1.E1)\) while keeping the prediction sets as small as possible\. In contrast to the classical conformal setting, we do not assume exchangeability of\{\(Xt,Yt\)\}\\\{\(X\_\{t\},Y\_\{t\}\)\\\}; the sequence may be temporally dependent, nonstationary, or heterogeneous across the covariate space\.

#### Conformal prediction \(CP\)\.

We first introduce the split conformal prediction framework\([vovk2005algorithmic,](https://arxiv.org/html/2605.05497#bib.bib28);[angelopoulos2023conformal,](https://arxiv.org/html/2605.05497#bib.bib1)\)\. One uses an*independent*training sample to construct a score functions:𝒳×𝒴→ℝ,s:\\mathcal\{X\}\\times\\mathcal\{Y\}\\to\\mathbb\{R\},where larger values indicate that a candidate pair\(x,y\)\(x,y\)is less conforming to the training data; for example, one may takes​\(x,y\)=\|y−f^​\(x\)\|s\(x,y\)=\|y\-\\hat\{f\}\(x\)\|, wheref^\\hat\{f\}is a pretrained predictor\.

Given calibration pointsZi=\(Xi,Yi\)Z\_\{i\}=\(X\_\{i\},Y\_\{i\}\),i=1,…,ni=1,\\dots,n, define the calibration scoresSi:=s​\(Xi,Yi\)S\_\{i\}:=s\(X\_\{i\},Y\_\{i\}\)and the empirical score distribution

F^n:=1n​∑i=1nδSi,where​δs​is a point mass at​s∈ℝ\.\\widehat\{F\}\_\{n\}:=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\delta\_\{S\_\{i\}\},\\qquad\\text\{where \}\\delta\_\{s\}\\text\{ is a point mass at \}s\\in\\mathbb\{R\}\.For anyτ∈\[0,1\]\\tau\\in\[0,1\], let

Q​\(τ;F\):=inf\{t∈ℝ∪\{∞\}:F​\(\(−∞,t\]\)≥τ\}Q\(\\tau;F\):=\\inf\\\{t\\in\\mathbb\{R\}\\cup\\\{\\infty\\\}:F\(\(\-\\infty,t\]\)\\geq\\tau\\\}denote the lowerτ\\tau\-quantile of a distributionFF\. Split conformal prediction calibrates the score threshold by

q^1−α:=Q​\(⌈\(n\+1\)​\(1−α\)⌉n;F^n\),\\widehat\{q\}\_\{1\-\\alpha\}:=Q\\\!\\left\(\\frac\{\\lceil\(n\+1\)\(1\-\\alpha\)\\rceil\}\{n\};\\,\\widehat\{F\}\_\{n\}\\right\),
and outputs the prediction set

CCP​\(Xn\+1\)=\{y∈𝒴:s​\(Xn\+1,y\)≤q^1−α\}\.C^\{\\text\{CP\}\}\(X\_\{n\+1\}\)=\\\{y\\in\\mathcal\{Y\}:s\(X\_\{n\+1\},y\)\\leq\\widehat\{q\}\_\{1\-\\alpha\}\\\}\.Under exchangeability of the calibration points and the test point, this construction yields finite\-sample marginal coverage,

ℙ​\{Yn\+1∈CCP​\(Xn\+1\)\}≥1−α,\\mathbb\{P\}\\\{Y\_\{n\+1\}\\in C^\{\\text\{CP\}\}\(X\_\{n\+1\}\)\\\}\\geq 1\-\\alpha,while allowing complete flexibility in the choice of scoress\([lei2018distribution,](https://arxiv.org/html/2605.05497#bib.bib16);[romano2019conformalized,](https://arxiv.org/html/2605.05497#bib.bib21)\)\.

#### Localized conformal prediction \(LCP\)\.

A natural extension of split conformal prediction is to replace the unweighted empirical score distribution with a weighted one, assigning larger weight to calibration points that are more relevant to the test point\. This idea underlies weighted conformal methods for covariate shift and related distribution shifts\([hore2025conformal,](https://arxiv.org/html/2605.05497#bib.bib13);[barber2023conformal,](https://arxiv.org/html/2605.05497#bib.bib2)\)\. Localized conformal prediction \(LCP\) specializes this weighting scheme to the covariate space, emphasizing calibration points whose features are close to the test covariateXn\+1X\_\{n\+1\}\([guan2023localized,](https://arxiv.org/html/2605.05497#bib.bib9)\)\. Specifically, LCP introduces a localizerH:𝒳×𝒳→\[0,∞\),H:\\mathcal\{X\}\\times\\mathcal\{X\}\\to\[0,\\infty\),and defines normalized local weights

pn\+1,jH:=H​\(Xn\+1,Xj\)∑k=1n\+1H​\(Xn\+1,Xk\),j=1,…,n\+1,p\_\{n\+1,j\}^\{H\}:=\\frac\{H\(X\_\{n\+1\},X\_\{j\}\)\}\{\\sum\_\{k=1\}^\{n\+1\}H\(X\_\{n\+1\},X\_\{k\}\)\},\\qquad j=1,\\dots,n\+1,together with the weighted augmented score distribution

F^n\+1H:=∑j=1npn\+1,jH​δSj\+pn\+1,n\+1H​δ∞\.\\widehat\{F\}\_\{n\+1\}^\{H\}:=\\sum\_\{j=1\}^\{n\}p\_\{n\+1,j\}^\{H\}\\,\\delta\_\{S\_\{j\}\}\\;\+\\;p\_\{n\+1,n\+1\}^\{H\}\\,\\delta\_\{\\infty\}\.
A naive localized analogue of split conformal prediction would use the weighted quantileQ​\(1−α;F^n\+1H\)Q\(1\-\\alpha;\\widehat\{F\}\_\{n\+1\}^\{H\}\), but this generally fails to preserve finite\-sample validity\.[guan2023localized](https://arxiv.org/html/2605.05497#bib.bib9)shows that validity can be restored by replacing1−α1\-\\alphawith a recalibrated levelα~\\tilde\{\\alpha\}, leading to the prediction set

CLCP​\(Xn\+1\)=\{y∈𝒴:s​\(Xn\+1,y\)≤Q​\(α~;F^n\+1H\)\}\.C^\{\\text\{LCP\}\}\(X\_\{n\+1\}\)=\\\{y\\in\\mathcal\{Y\}:s\(X\_\{n\+1\},y\)\\leq Q\(\\tilde\{\\alpha\};\\widehat\{F\}\_\{n\+1\}^\{H\}\)\\\}\.This retains finite\-sample marginal coverage while allowing the prediction set to adapt to covariate\-dependent heterogeneity\.

#### Adaptive conformal inference \(ACI\)\.

A different line of work addresses non\-exchangeability through online calibration\. Following[gibbs2021adaptive](https://arxiv.org/html/2605.05497#bib.bib7), letDtD\_\{t\}denote an estimated conformity\-score distribution at timett, for example the empirical distribution of recent scores\. For anyβ∈\[0,1\]\\beta\\in\[0,1\], define

CtACI​\(β\)=\{y∈𝒴:s​\(Xt,y\)≤Q​\(1−β;Dt\)\}\.C^\{\\text\{ACI\}\}\_\{t\}\(\\beta\)=\\\{y\\in\\mathcal\{Y\}:s\(X\_\{t\},y\)\\leq Q\(1\-\\beta;D\_\{t\}\)\\\}\.Adaptive conformal inference \(ACI\) replaces the fixed nominal levelα\\alphaby an online\-updated parameterαtACI\\alpha^\{\\text\{ACI\}\}\_\{t\}\. After predicting withCtACI​\(αtACI\)C^\{\\text\{ACI\}\}\_\{t\}\(\\alpha^\{\\text\{ACI\}\}\_\{t\}\), it observes

errtACI:=𝟏​\{Yt∉CtACI​\(αtACI\)\},\\mathrm\{err\}^\{\\text\{ACI\}\}\_\{t\}:=\\mathbf\{1\}\\\{Y\_\{t\}\\notin C^\{\\text\{ACI\}\}\_\{t\}\(\\alpha^\{\\text\{ACI\}\}\_\{t\}\)\\\},and updates

αt\+1ACI=αtACI\+γ​\(α−errtACI\),\\alpha^\{\\text\{ACI\}\}\_\{t\+1\}=\\alpha^\{\\text\{ACI\}\}\_\{t\}\+\\gamma\(\\alpha\-\\mathrm\{err\}^\{\\text\{ACI\}\}\_\{t\}\),whereγ\>0\\gamma\>0is a step size\([gibbs2021adaptive,](https://arxiv.org/html/2605.05497#bib.bib7)\)\. This yields a feedback\-driven calibration rule that targets long\-run coverage under arbitrary distribution shifts\.

#### Other related work\.

A key limitation of ACI is its sensitivity to the step sizeγ\\gamma\. DtACI\([gibbs2024conformal,](https://arxiv.org/html/2605.05497#bib.bib8)\)and AgACI\([zaffran2022adaptive,](https://arxiv.org/html/2605.05497#bib.bib32)\)address this by aggregating multiple ACI experts online, while recent work develops parameter\-free online conformal updates based on universal portfolio algorithms\([liu2026online,](https://arxiv.org/html/2605.05497#bib.bib18)\)\. Other sequential conformal methods update scores or interval sizes over time: EnbPI\([enbpi,](https://arxiv.org/html/2605.05497#bib.bib29)\)builds prediction sets around bootstrap ensemble predictors under weak dependence conditions, and SPCI\([spci,](https://arxiv.org/html/2605.05497#bib.bib30)\)forecasts future residual quantiles from past residuals\. A different direction anticipates abrupt shifts using additional structure: CPTC\([sun2025cptc,](https://arxiv.org/html/2605.05497#bib.bib26)\)maintains state\-specific score sets and aggregates them using a predicted latent state sequence, but relies on a state predictor and state\-conditioned forecaster\. Under exchangeability, selection and aggregation methods choose among conformal predictors while preserving finite\-sample validity through recalibration\([yang2025selection,](https://arxiv.org/html/2605.05497#bib.bib31);[liang2024conformal,](https://arxiv.org/html/2605.05497#bib.bib17)\); our setting instead treats coverage as a long\-run online constraint\. Finally, conformal methods beyond exchangeability use data\-dependent or fixed weights, including covariate\-shift conformal prediction\([tibshirani2019conformal,](https://arxiv.org/html/2605.05497#bib.bib27)\)and fixed\-weight approaches for non\-exchangeable data\([barber2023conformal,](https://arxiv.org/html/2605.05497#bib.bib2)\)\.

## 3Methods

### 3\.1Online Localized Conformal Prediction

We now introduce our first and main method*Online Localized Conformal Prediction*\(OLCP\)\. The starting point is that two distinct difficulties arise in sequential prediction\. First, under covariate heterogeneity, uncertainty can vary substantially across the feature space, so a global conformal quantile may allocate set size inefficiently\. Second, under temporal distribution shift, the appropriate calibration level itself changes over time\. OLCP addresses both issues simultaneously by combining localized calibration inXXwith online updates of the nominal level\.

#### Localized calibration distribution\.

Fix a bandwidthh\>0h\>0, a localizer

Hh:𝒳×𝒳→\[0,∞\),H\_\{h\}:\\mathcal\{X\}\\times\\mathcal\{X\}\\to\[0,\\infty\),
and a rolling calibration windowℐt:=\{max⁡\(1,t−R\),…,t−1\}\\mathcal\{I\}\_\{t\}:=\\\{\\max\(1,t\-R\),\\dots,t\-1\\\}with fixed window sizeRR\. For eachi∈ℐti\\in\\mathcal\{I\}\_\{t\},Si=s​\(Xi,Yi\)S\_\{i\}=s\(X\_\{i\},Y\_\{i\}\)denotes the realized conformity score at timeii\. For a query covariatex∈𝒳x\\in\\mathcal\{X\}, define normalized local weights

wt,i\(h\)​\(x\):=Hh​\(x,Xi\)∑j∈ℐtHh​\(x,Xj\)\.w\_\{t,i\}^\{\(h\)\}\(x\):=\\frac\{H\_\{h\}\(x,X\_\{i\}\)\}\{\\sum\_\{j\\in\\mathcal\{I\}\_\{t\}\}H\_\{h\}\(x,X\_\{j\}\)\}\.\(2\)
We then form the localized empirical distribution

Dt\(h\)​\(x\):=∑i∈ℐtwt,i\(h\)​\(x\)​δSi\.D\_\{t\}^\{\(h\)\}\(x\):=\\sum\_\{i\\in\\mathcal\{I\}\_\{t\}\}w\_\{t,i\}^\{\(h\)\}\(x\)\\,\\delta\_\{S\_\{i\}\}\.
This is the online analogue of the weighted empirical distributions used in localized conformal prediction\([guan2023localized,](https://arxiv.org/html/2605.05497#bib.bib9)\): past scores associated with covariates close to the query pointxxreceive more weight\.

For any query covariatex∈𝒳x\\in\\mathcal\{X\}and anyβ∈\[0,1\]\\beta\\in\[0,1\], define

Ct\(h\)\(x;β\):=\{y∈𝒴:s\(x,y\)≤Q\(1−β;Dt\(h\)\(x\)\)\}\.C\_\{t\}^\{\(h\)\}\(x;\\beta\):=\\\{y\\in\\mathcal\{Y\}:\\;s\(x,y\)\\leq Q\(1\-\\beta;D\_\{t\}^\{\(h\)\}\(x\)\)\\\}\.
Thushhcontrols the degree of localization: smallhhemphasizes nearby covariates, while largehhrecovers a more global rule\([guan2023localized,](https://arxiv.org/html/2605.05497#bib.bib9)\)\.

#### OLCP update\.

OLCP maintains a nominal levelαt∈\[0,1\]\\alpha\_\{t\}\\in\[0,1\]\. At timett, it outputsCt\(h\)​\(Xt;αt\)C\_\{t\}^\{\(h\)\}\(X\_\{t\};\\alpha\_\{t\}\), observes the error

errt:=𝟏​\{Yt∉Ct\(h\)​\(Xt;αt\)\},\\mathrm\{err\}\_\{t\}:=\\mathbf\{1\}\\\{Y\_\{t\}\\notin C\_\{t\}^\{\(h\)\}\(X\_\{t\};\\alpha\_\{t\}\)\\\},and updates

αt\+1=Π\[0,1\]​\(αt\+γ​\(α−errt\)\),\\alpha\_\{t\+1\}=\\Pi\_\{\[0,1\]\}\\bigl\(\\alpha\_\{t\}\+\\gamma\(\\alpha\-\\mathrm\{err\}\_\{t\}\)\\bigr\),\(3\)whereΠ\[0,1\]\\Pi\_\{\[0,1\]\}is projection onto\[0,1\]\[0,1\]andγ\>0\\gamma\>0is a step size\.

The algebraic form of \([3](https://arxiv.org/html/2605.05497#S3.E3)\) is similar to ACI, but the object being calibrated is different: OLCP updates the level for the localized family\{Ct\(h\)​\(Xt;β\):β∈\[0,1\]\}\\\{C\_\{t\}^\{\(h\)\}\(X\_\{t\};\\beta\):\\beta\\in\[0,1\]\\\}, not for a single global conformal rule\. This distinction is what allows OLCP to adapt set size across both time and covariate space\.

Algorithm 1Online Localized Conformal Prediction \(OLCP\)1:target miscoverage

α\\alpha, step size

γ\\gamma, bandwidth

hh, window length

RR
2:Initialize

α1∈\[0,1\]\\alpha\_\{1\}\\in\[0,1\]
3:for

t=1,2,…,Tt=1,2,\\dots,Tdo

4:Compute localized weights

wt,i\(h\)​\(Xt\)w\_\{t,i\}^\{\(h\)\}\(X\_\{t\}\)for

i∈ℐt=\{max⁡\(1,t−R\),…,t−1\}i\\in\\mathcal\{I\}\_\{t\}=\\\{\\max\(1,t\-R\),\\dots,t\-1\\\}from \([2](https://arxiv.org/html/2605.05497#S3.E2)\)

5:Form

Dt\(h\)​\(Xt\)=∑i∈ℐtwt,i\(h\)​\(Xt\)​δSiD\_\{t\}^\{\(h\)\}\(X\_\{t\}\)=\\sum\_\{i\\in\\mathcal\{I\}\_\{t\}\}w\_\{t,i\}^\{\(h\)\}\(X\_\{t\}\)\\delta\_\{S\_\{i\}\}
6:Output

Ct\(h\)​\(Xt;αt\)=\{y:s​\(Xt,y\)≤Q​\(1−αt;Dt\(h\)​\(Xt\)\)\}C\_\{t\}^\{\(h\)\}\(X\_\{t\};\\alpha\_\{t\}\)=\\\{y:s\(X\_\{t\},y\)\\leq Q\(1\-\\alpha\_\{t\};D\_\{t\}^\{\(h\)\}\(X\_\{t\}\)\)\\\}
7:Observe

YtY\_\{t\}and set

errt=𝟏​\{Yt∉Ct\(h\)​\(Xt;αt\)\}\\mathrm\{err\}\_\{t\}=\\mathbf\{1\}\\\{Y\_\{t\}\\notin C\_\{t\}^\{\(h\)\}\(X\_\{t\};\\alpha\_\{t\}\)\\\}
8:Update

αt\+1=Π\[0,1\]​\(αt\+γ​\(α−errt\)\)\\alpha\_\{t\+1\}=\\Pi\_\{\[0,1\]\}\(\\alpha\_\{t\}\+\\gamma\(\\alpha\-\\mathrm\{err\}\_\{t\}\)\)
9:endfor

#### Pinball\-loss view and coverage guarantee\.

The update in \([3](https://arxiv.org/html/2605.05497#S3.E3)\) also admits a useful optimization interpretation\. Define the coverage\-boundary level

βt\(h\):=sup\{β∈\[0,1\]:Yt∈Ct\(h\)​\(Xt;β\)\},\\beta\_\{t\}^\{\(h\)\}:=\\sup\\\{\\beta\\in\[0,1\]:Y\_\{t\}\\in C\_\{t\}^\{\(h\)\}\(X\_\{t\};\\beta\)\\\},namely, the boundary nominal level for which the realized response starts to be uncovered\. BecauseCt\(h\)​\(Xt;β\)C\_\{t\}^\{\(h\)\}\(X\_\{t\};\\beta\)is decreasing inβ\\beta,

αt<βt\(h\)⇒errt=0,αt\>βt\(h\)⇒errt=1\.\\alpha\_\{t\}<\\beta\_\{t\}^\{\(h\)\}\\Rightarrow\\mathrm\{err\}\_\{t\}=0,\\qquad\\alpha\_\{t\}\>\\beta\_\{t\}^\{\(h\)\}\\Rightarrow\\mathrm\{err\}\_\{t\}=1\.Atαt=βt\(h\)\\alpha\_\{t\}=\\beta\_\{t\}^\{\(h\)\}, either outcome may occur\. For pinball loss

ℓ​\(β,θ\)=α​\(β−θ\)−min⁡\{0,β−θ\},\\ell\(\\beta,\\theta\)=\\alpha\(\\beta\-\\theta\)\-\\min\\\{0,\\beta\-\\theta\\\},we therefore have

errt−α∈∂θℓ​\(βt\(h\),θ\)\|θ=αt,\\mathrm\{err\}\_\{t\}\-\\alpha\\in\\partial\_\{\\theta\}\\ell\(\\beta\_\{t\}^\{\(h\)\},\\theta\)\\big\|\_\{\\theta=\\alpha\_\{t\}\},so \([3](https://arxiv.org/html/2605.05497#S3.E3)\) is exactly projected online gradient descent on the loss sequence\{ℓ​\(βt\(h\),⋅\)\}t=1T\\\{\\ell\(\\beta\_\{t\}^\{\(h\)\},\\cdot\)\\\}\_\{t=1\}^\{T\}\.

We introduce the following long\-term coverage guarantee for OLCP with proof deferred in Appendix[A](https://arxiv.org/html/2605.05497#A1)\.

###### Proposition 3\.1\(Boundary\-corrected coverage control for OLCP\)\.

Definezt:=αt\+γ​\(α−errt\),z\_\{t\}:=\\alpha\_\{t\}\+\\gamma\(\\alpha\-\\mathrm\{err\}\_\{t\}\),and the lower\- and upper\-boundary projection correctionsLt:=\(−zt\)\+,Ut:=\(zt−1\)\+\.L\_\{t\}:=\(\-z\_\{t\}\)\_\{\+\},U\_\{t\}:=\(z\_\{t\}\-1\)\_\{\+\}\.Equivalently,αt\+1=zt\+Lt−Ut\.\\alpha\_\{t\+1\}=z\_\{t\}\+L\_\{t\}\-U\_\{t\}\.Then, for everyT≥1T\\geq 1and fixedγ\>0\\gamma\>0,

∑t=1T\(errt−α\)=α1−αT\+1γ\+1γ​∑t=1T\(Lt−Ut\)\.\\sum\_\{t=1\}^\{T\}\(\\mathrm\{err\}\_\{t\}\-\\alpha\)=\\frac\{\\alpha\_\{1\}\-\\alpha\_\{T\+1\}\}\{\\gamma\}\+\\frac\{1\}\{\\gamma\}\\sum\_\{t=1\}^\{T\}\(L\_\{t\}\-U\_\{t\}\)\.
In particular, if∑t=1TLt=o​\(T\),\\sum\_\{t=1\}^\{T\}L\_\{t\}=o\(T\),thenlim supT→∞1T​∑t=1Terrt≤α\.\\limsup\_\{T\\to\\infty\}\\ \\dfrac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\mathrm\{err\}\_\{t\}\\leq\\alpha\.

If additionally∑t=1T\(Lt\+Ut\)=o​\(T\),\\sum\_\{t=1\}^\{T\}\(L\_\{t\}\+U\_\{t\}\)=o\(T\),then1T​∑t=1Terrt→α\.\\dfrac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\mathrm\{err\}\_\{t\}\\to\\alpha\.

#### Remark on boundary terms\.

Proposition[3\.1](https://arxiv.org/html/2605.05497#S3.Thmtheorem1)shows exactly what projection changes relative to the unprojected ACI\. Unlike ACI, which can use infinite or empty prediction sets to obtain exact telescoping, OLCP keeps prediction sets finite and must account forLtL\_\{t\}andUtU\_\{t\}\. The lower correctionLtL\_\{t\}measures unresolved undercoverage pressure: it is positive only when the update wants to moveαt\\alpha\_\{t\}below0, i\.e\. when OLCP misses while already using a very wide finite set\. The upper correctionUtU\_\{t\}is the analogous overcoverage pressure nearαt=1\\alpha\_\{t\}=1\. Thus projected OLCP recovers ACI\-style long\-run calibration whenever these normalized boundary corrections are negligible\. Appendix[B](https://arxiv.org/html/2605.05497#A2)gives sufficient conditions under which these terms vanish and explains how to diagnose them empirically\.

### 3\.2OLCP\-Hedge: constrained aggregation across localization bandwidths

OLCP requires choosing a localization bandwidthhh\. This bandwidth controls a familiar bias–variance trade\-off: smallhhadapts strongly to local heterogeneity but can be noisy, whereas largehhis more stable but may wash out covariate\-dependent uncertainty\([guan2023localized,](https://arxiv.org/html/2605.05497#bib.bib9)\)\. Since the best localization scale is generally unknown and may vary over time, this motivates aggregating a collection of OLCP experts rather than committing to a single bandwidth in advance\.

This is related to the classical prediction\-with\-expert\-advice problem, where the learner combines expert predictions and competes with the best expert in hindsight\([orabona2019modern,](https://arxiv.org/html/2605.05497#bib.bib19)\)\. However, our objective is not a single scalar loss\. If set size is used as the loss, ordinary Hedge may favor narrow experts that undercover; if miscoverage is used as the loss, the resulting prediction sets may be valid but inefficient\. Thus bandwidth selection is naturally a constrained online problem: minimize set size while controlling long\-run miscoverage\.

#### OLCP expert pool\.

FixKKOLCP procedures, for example, with different bandwidths\{hi:i=1,…,K\}\\\{h\_\{i\}:i=1,\\ldots,K\\\}\. At timett, expertiioutputs a prediction setCt,i​\(Xt\)C\_\{t,i\}\(X\_\{t\}\), for exampleCt,i​\(Xt\):=Ct\(hi\)​\(Xt;αt\)C\_\{t,i\}\(X\_\{t\}\):=C\_\{t\}^\{\(h\_\{i\}\)\}\(X\_\{t\};\\alpha\_\{t\}\)111Our algorithm can also be applied to aggregate OLCP procedures with different step sizeγ\\gamma, see more on Section[5](https://arxiv.org/html/2605.05497#S5)\., and receives feedback of

size​ωt,i:=size​\(Ct,i​\(Xt\)\)andmiscoverage​errt,i:=𝟏​\{Yt∉Ct,i​\(Xt\)\},\\text\{size \}\\omega\_\{t,i\}:=\\text\{\{size\}\}\(C\_\{t,i\}\(X\_\{t\}\)\)\\qquad\\text\{and\}\\qquad\\text\{miscoverage \}\\mathrm\{err\}\_\{t,i\}:=\\mathbf\{1\}\\\{Y\_\{t\}\\notin C\_\{t,i\}\(X\_\{t\}\)\\\},wheresize\(⋅\\cdot\) could be interval width, or cardinality depending on the tasks\.

Letωt=\(ωt,1,…,ωt,K\)\\omega\_\{t\}=\(\\omega\_\{t,1\},\\dots,\\omega\_\{t,K\}\)andet=\(errt,1,…,errt,K\)e\_\{t\}=\(\\mathrm\{err\}\_\{t,1\},\\dots,\\mathrm\{err\}\_\{t,K\}\)\. We maintain a distributionpt∈ΔKp\_\{t\}\\in\\Delta\_\{K\}over experts, sampleIt∼ptI\_\{t\}\\sim p\_\{t\}, and outputCt,It​\(Xt\)C\_\{t,I\_\{t\}\}\(X\_\{t\}\),

#### Size objective and miscoverage constraint\.

Forp∈ΔKp\\in\\Delta\_\{K\}, define

ft​\(p\):=⟨ωt,p⟩,gt​\(p\):=⟨et,p⟩−α\.f\_\{t\}\(p\):=\\langle\\omega\_\{t\},p\\rangle,\\qquad g\_\{t\}\(p\):=\\langle e\_\{t\},p\\rangle\-\\alpha\.Then

𝔼It∼pt​\[ωt,It\]=ft​\(pt\),𝔼It∼pt​\[errt,It\]=⟨et,pt⟩=gt​\(pt\)\+α\.\\mathbb\{E\}\_\{I\_\{t\}\\sim p\_\{t\}\}\[\\omega\_\{t,I\_\{t\}\}\]=f\_\{t\}\(p\_\{t\}\),\\qquad\\mathbb\{E\}\_\{I\_\{t\}\\sim p\_\{t\}\}\[\\mathrm\{err\}\_\{t,I\_\{t\}\}\]=\\langle e\_\{t\},p\_\{t\}\\rangle=g\_\{t\}\(p\_\{t\}\)\+\\alpha\.Thusgt​\(pt\)g\_\{t\}\(p\_\{t\}\)is the expected excess miscoverage of the aggregate\. Equivalently, the meta\-objective is

minpt∈ΔK​∑t=1Tft​\(pt\)while keeping∑t=1T\(gt​\(pt\)\)\+​sublinear\.\\min\_\{p\_\{t\}\\in\\Delta\_\{K\}\}\\sum\_\{t=1\}^\{T\}f\_\{t\}\(p\_\{t\}\)\\qquad\\text\{while keeping\}\\qquad\\sum\_\{t=1\}^\{T\}\(g\_\{t\}\(p\_\{t\}\)\)\_\{\+\}\\ \\text\{sublinear\}\.This is an instance of*constrained online convex optimization*\(COCO\): the learner chooses an action before observing a convex loss and constraint, and aims to control both regret and cumulative constraint violation\. COCO has a substantial recent literature on time\-varying constraints and long\-term feasibility; see, e\.g\.,[guo2022online](https://arxiv.org/html/2605.05497#bib.bib10);[sinha2024optimal](https://arxiv.org/html/2605.05497#bib.bib24)for recent overviews and comparisons\.

We adapt the algorithm proposed in[sinha2024optimal](https://arxiv.org/html/2605.05497#bib.bib24)because it gives state\-of\-the\-art simultaneous guarantees,O​\(T\)O\(\\sqrt\{T\}\)regret andO~​\(T\)\\widetilde\{O\}\(\\sqrt\{T\}\)cumulative constraint violation, without Slater\-type assumptions or per\-round constrained optimization\. Their reduction combines a Lyapunov\-style virtual queue with a black\-box adaptive OCO subroutine\.

The original paper focuses on the Euclidean space; here we extend their algorithm to the probability simplex of OLCP experts, and prove the corresponding size\-regret and excess\-miscoverage bounds in the probability simplex geometry\.

#### Assumptions\.

We use the following assumptions, specialized to our expert aggregation problem\.

- •Assumption A \(bounded sizes\)\.There existsG\>0G\>0such that for alltt, ‖ωt‖∞≤G,‖et‖∞≤G\.\\\|\\omega\_\{t\}\\\|\_\{\\infty\}\\leq G,\\qquad\\\|e\_\{t\}\\\|\_\{\\infty\}\\leq G\.Sinceet,i∈\{0,1\}e\_\{t,i\}\\in\\\{0,1\\\}, this only requires a uniform bound on set sizes\. This assumption is mild after size normalization or when the response range is bounded\.
- •Assumption B \(uniformly feasible comparator\)\.There existsu⋆∈ΔKu^\{\\star\}\\in\\Delta\_\{K\}such that gt​\(u⋆\)=⟨et,u⋆⟩−α≤0,∀t\.g\_\{t\}\(u^\{\\star\}\)=\\langle e\_\{t\},u^\{\\star\}\\rangle\-\\alpha\\leq 0,\\qquad\\forall t\.This assumption is a standard but strong feasibility condition in COCO: one compares against a fixed feasible action\. In our setting, it can be enforced by including a conservative expert, although this may increase the comparator size\. If exact feasibility is unavailable, the same viewpoint suggests a relaxed fallback using online constraint\-satisfaction ideas from[sinha2024optimal](https://arxiv.org/html/2605.05497#bib.bib24): instead of requiring a uniformly feasible comparator, one can compare to anSS\-feasible orPTP\_\{T\}\-constrained benchmark and aim for sublinear excess miscoverage relative to that weaker benchmark\. We discuss diagnostics for this assumption in Appendix[C](https://arxiv.org/html/2605.05497#A3)\.

#### Surrogate loss\.

Following[sinha2024optimal](https://arxiv.org/html/2605.05497#bib.bib24), OLCP\-Hedge maintains a virtual queue for excess miscoverage and feeds AdaHedge222AdaHedge is only one possible subroutine; see the modularity discussion below in[3\.2](https://arxiv.org/html/2605.05497#S3.SS2.SSS0.Px6)\.\([derooij2014follow,](https://arxiv.org/html/2605.05497#bib.bib5)\)a surrogate loss that combines set size with queue\-weighted constraint violation\. Let𝒬​\(0\)=0\\mathcal\{Q\}\(0\)=0and update

𝒬​\(t\)=𝒬​\(t−1\)\+κ​\(gt​\(pt\)\)\+,where​κ\>0​is a constant\.\\mathcal\{Q\}\(t\)=\\mathcal\{Q\}\(t\-1\)\+\\kappa\(g\_\{t\}\(p\_\{t\}\)\)\_\{\+\},\\qquad\\text\{where \}\\kappa\>0\\text\{ is a constant\}\.WithΦ​\(q\)=eλ​q−1\\Phi\(q\)=e^\{\\lambda q\}\-1andλ\>0\\lambda\>0, define

f^t​\(p\)=V​κ​ft​\(p\)\+Φ′​\(𝒬​\(t\)\)​κ​\(gt​\(p\)\)\+,where​V\>0​is a constant\.\\hat\{f\}\_\{t\}\(p\)=V\\kappa f\_\{t\}\(p\)\+\\Phi^\{\\prime\}\(\\mathcal\{Q\}\(t\)\)\\,\\kappa\(g\_\{t\}\(p\)\)\_\{\+\},\\qquad\\text\{where \}V\>0\\text\{ is a constant\}\.The first term penalizes prediction set size, while the second term increasingly penalizes excess miscoverage as the queue grows\. The full algorithm is detailed in Algorithm[2](https://arxiv.org/html/2605.05497#alg2)with more details of AdaHedge in Appendix[D](https://arxiv.org/html/2605.05497#A4)\.

Algorithm 2OLCP\-Hedge1:target miscoverage

α\\alpha, OLCP experts

\{\{Ct,i\}t=1T\}i=1K\\\{\\\{C\_\{t,i\}\\\}\_\{t=1\}^\{T\}\\\}\_\{i=1\}^\{K\}, parameters

V,κ,λV,\\kappa,\\lambda
2:Initialize

𝒬​\(0\)=0\\mathcal\{Q\}\(0\)=0and AdaHedge on

ΔK\\Delta\_\{K\}
3:for

t=1,2,…,Tt=1,2,\\dots,Tdo

4:AdaHedge outputs

pt∈ΔKp\_\{t\}\\in\\Delta\_\{K\}
5:Sample

It∼ptI\_\{t\}\\sim p\_\{t\}and output

Ct,It​\(Xt\)C\_\{t,I\_\{t\}\}\(X\_\{t\}\)
6:Observe

YtY\_\{t\}; compute the size

ωt,i\\omega\_\{t,i\}and

errt,i=𝟏​\{Yt∉Ct,i​\(Xt\)\}\\mathrm\{err\}\_\{t,i\}=\\mathbf\{1\}\\\{Y\_\{t\}\\notin C\_\{t,i\}\(X\_\{t\}\)\\\}for all

ii
7:Define

ft​\(p\)=⟨ωt,p⟩f\_\{t\}\(p\)=\\langle\\omega\_\{t\},p\\rangle,

gt​\(p\)=⟨et,p⟩−αg\_\{t\}\(p\)=\\langle e\_\{t\},p\\rangle\-\\alpha
8:Update

𝒬​\(t\)=𝒬​\(t−1\)\+κ​\(gt​\(pt\)\)\+\\mathcal\{Q\}\(t\)=\\mathcal\{Q\}\(t\-1\)\+\\kappa\(g\_\{t\}\(p\_\{t\}\)\)\_\{\+\}
9:Form

f^t​\(p\)=V​κ​ft​\(p\)\+Φ′​\(𝒬​\(t\)\)​κ​\(gt​\(p\)\)\+\\hat\{f\}\_\{t\}\(p\)=V\\kappa f\_\{t\}\(p\)\+\\Phi^\{\\prime\}\(\\mathcal\{Q\}\(t\)\)\\kappa\(g\_\{t\}\(p\)\)\_\{\+\}
10:Choose

ξt∈∂f^t​\(pt\)\\xi\_\{t\}\\in\\partial\\hat\{f\}\_\{t\}\(p\_\{t\}\)and feed the linearized loss

ℓt​\(p\)=⟨ξt,p⟩\\ell\_\{t\}\(p\)=\\langle\\xi\_\{t\},p\\rangleto AdaHedge

11:endfor

#### Guarantee\.

The next theorem states that OLCP\-Hedge competes with the best feasible expert mixture in expected size while keeping cumulative expected excess miscoverage sublinear\. The proof is deferred to Appendix[E](https://arxiv.org/html/2605.05497#A5)\.

###### Theorem 3\.2\(Size regret and excess\-miscoverage control\)\.

Assume Assumptions A–B\. Run Algorithm[2](https://arxiv.org/html/2605.05497#alg2)with

V=1,CAH:=2​4\+ln⁡K,κ:=12​CAH​G,λ:=12​T\.V=1,\\qquad C\_\{\\mathrm\{AH\}\}:=2\\sqrt\{4\+\\ln K\},\\qquad\\kappa:=\\frac\{1\}\{\\sqrt\{2\}\\,C\_\{\\mathrm\{AH\}\}\\,G\},\\qquad\\lambda:=\\frac\{1\}\{2\\sqrt\{T\}\}\.
Then, for any feasible comparatoru⋆∈ΔKu^\{\\star\}\\in\\Delta\_\{K\}, we have the following expected set\-size regret bound:

∑t=1T\(𝔼It∼pt​\[ωt,It\]−⟨ωt,u⋆⟩\)≤4​G​2​\(4\+ln⁡K\)​T=O​\(G​T​\(1\+ln⁡K\)\),\\sum\_\{t=1\}^\{T\}\\Bigl\(\\mathbb\{E\}\_\{I\_\{t\}\\sim p\_\{t\}\}\[\\omega\_\{t,I\_\{t\}\}\]\-\\langle\\omega\_\{t\},u^\{\\star\}\\rangle\\Bigr\)\\leq 4G\\sqrt\{2\(4\+\\ln K\)T\}=O\\\!\\left\(G\\sqrt\{T\(1\+\\ln K\)\}\\right\),
and the following cumulative excess\-miscoverage bound:

∑t=1T\(𝔼It∼pt​\[errt,It\]−α\)\+≤4​G​2​\(4\+ln⁡K\)​T​ln⁡\(2\+\(2\+22\)​T\)=O~​\(G​T​\(1\+ln⁡K\)\)\.\\sum\_\{t=1\}^\{T\}\\Bigl\(\\mathbb\{E\}\_\{I\_\{t\}\\sim p\_\{t\}\}\[\\mathrm\{err\}\_\{t,I\_\{t\}\}\]\-\\alpha\\Bigr\)\_\{\+\}\\leq 4G\\sqrt\{2\(4\+\\ln K\)T\}\\;\\ln\\\!\\Bigl\(2\+\\bigl\(2\+\\tfrac\{\\sqrt\{2\}\}\{2\}\\bigr\)T\\Bigr\)=\\widetilde\{O\}\\\!\\left\(G\\sqrt\{T\(1\+\\ln K\)\}\\right\)\.

#### Remark on the modularity of the subroutine\.

The OLCP\-Hedge analysis is modular in the OCO subroutine, following the black\-box philosophy of[sinha2024optimal](https://arxiv.org/html/2605.05497#bib.bib24)\. The proof only uses that, when the subroutine is fed the linearized surrogate lossesℓt​\(p\)=⟨ξt,p⟩\\ell\_\{t\}\(p\)=\\langle\\xi\_\{t\},p\\rangle, it satisfies a data\-dependent regret bound of the form

∑t=1T⟨ξt,pt−u⟩≤C​∑t=1T‖ξt‖∞2,∀u∈ΔK\.\\sum\_\{t=1\}^\{T\}\\langle\\xi\_\{t\},p\_\{t\}\-u\\rangle\\leq C\\sqrt\{\\sum\_\{t=1\}^\{T\}\\\|\\xi\_\{t\}\\\|\_\{\\infty\}^\{2\}\},\\qquad\\forall u\\in\\Delta\_\{K\}\.AdaHedge is one parameter\-free expert algorithm with such a worst\-case/adaptive guarantee\([derooij2014follow,](https://arxiv.org/html/2605.05497#bib.bib5);[orabona2019modern,](https://arxiv.org/html/2605.05497#bib.bib19)\)\. More broadly, related adaptive expert algorithms providing comparable data\-dependent regret guarantees could be used in the same reduction after replacingCAHC\_\{\\mathrm\{AH\}\}by the corresponding constant or regret complexity\([gaillard2014second,](https://arxiv.org/html/2605.05497#bib.bib6);[koolen2015second,](https://arxiv.org/html/2605.05497#bib.bib14);[orabona2016coin,](https://arxiv.org/html/2605.05497#bib.bib20)\)\. Thus the theorem is not tied to AdaHedge itself\.

## 4Experiments

We evaluate OLCP and OLCP\-Hedge on both synthetic experiments and real sequential prediction tasks\. We first introduce our experimental setup, then present simulation and real\-data results\. Code is availablehere\.

### 4\.1Experimental setup

We compare seven methods \(see Section[2](https://arxiv.org/html/2605.05497#S2)\): CP\([lei2018distribution,](https://arxiv.org/html/2605.05497#bib.bib16)\), LCP\([guan2023localized,](https://arxiv.org/html/2605.05497#bib.bib9)\), ACI\([gibbs2021adaptive,](https://arxiv.org/html/2605.05497#bib.bib7)\), DtACI\([gibbs2024conformal,](https://arxiv.org/html/2605.05497#bib.bib8)\), SPCI\([spci,](https://arxiv.org/html/2605.05497#bib.bib30)\), and proposed OLCP/OLCP\-Hedge\. All methods use the same base predictor, conformity scores, and rolling calibration window\. For localized methods \(LCP/OLCP/OLCP\-Hedge\), we use an exponential kernel

Hh​\(x,x′\)=exp⁡\(−‖x~−x~′‖2h\),H\_\{h\}\(x,x^\{\\prime\}\)=\\exp\\\!\\left\(\-\\frac\{\\\|\\tilde\{x\}\-\\tilde\{x\}^\{\\prime\}\\\|\_\{2\}\}\{h\}\\right\),where covariates are standardized within the current calibration window before computing distances\. The base bandwidthh0h\_\{0\}is chosen by a Silverman\-style rule, and OLCP\-Hedge aggregates the grid

h∈\{0\.5,0\.75,1,1\.25,1\.5\}​h0\.h\\in\\\{0\.5,0\.75,1,1\.25,1\.5\\\}h\_\{0\}\.
All calibration is online: at timett, each method uses only past conformity scores in the rolling window\. Full implementation details, including bandwidth formulas, adaptive step sizes, and SPCI settings, are given in Appendix[F\.1](https://arxiv.org/html/2605.05497#A6.SS1)\.

All methods are evaluated using empirical coverage and average set size, for prediction set \(interval\)Ct​\(Xt\)=\[Lt​\(Xt\),Ut​\(Xt\)\]C\_\{t\}\(X\_\{t\}\)=\[L\_\{t\}\(X\_\{t\}\),U\_\{t\}\(X\_\{t\}\)\]produced by any methods above, we measure

cov^=1T​∑t=1T𝟏​\{Yt∈Ct​\(Xt\)\},size^=1T​∑t=1T\(Ut​\(Xt\)−Lt​\(Xt\)\)\.\\widehat\{\\mathrm\{cov\}\}=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\mathbf\{1\}\\\{Y\_\{t\}\\in C\_\{t\}\(X\_\{t\}\)\\\},\\qquad\\widehat\{\\mathrm\{size\}\}=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\bigl\(U\_\{t\}\(X\_\{t\}\)\-L\_\{t\}\(X\_\{t\}\)\\bigr\)\.

### 4\.2Simulations

We first use controlled synthetic experiments to isolate two failure modes of existing online conformal methods: covariate\-dependent heterogeneity and abrupt temporal distribution shift\. We simulate a univariate time series\{Yt\}t=0T\\\{Y\_\{t\}\\\}\_\{t=0\}^\{T\}and form a one\-step\-ahead prediction task withXt=Yt−1\.X\_\{t\}=Y\_\{t\-1\}\.A fixed linear predictorf^\\hat\{f\}is trained on the first500500observations, and all methods use the absolute residual scoreSt=\|Yt−f^​\(Xt\)\|\.S\_\{t\}=\|Y\_\{t\}\-\\hat\{f\}\(X\_\{t\}\)\|\.We useT=1,500T=1\{,\}500, rolling calibration windowR=200R=200, target miscoverageα=0\.1\\alpha=0\.1, and100100Monte Carlo repetitions\. All the results are reported on the test set\.

Letεt∼N​\(0,1\)\\varepsilon\_\{t\}\\sim N\(0,1\)\. We consider three scenarios:

- •A: Stationary:Yt=0\.5​Yt−1\+εtY\_\{t\}=0\.5Y\_\{t\-1\}\+\\varepsilon\_\{t\}\.
- •B: Heterogeneous:Yt=0\.5​Yt−1\+σt​εtY\_\{t\}=0\.5Y\_\{t\-1\}\+\\sigma\_\{t\}\\varepsilon\_\{t\}, whereσt=min⁡\{exp⁡\(0\.25​Yt−1\),10\}\\sigma\_\{t\}=\\min\\\{\\exp\(0\.25Y\_\{t\-1\}\),10\\\}\.
- •C: Change point:Yt=ϕt​Yt−1\+εtY\_\{t\}=\\phi\_\{t\}Y\_\{t\-1\}\+\\varepsilon\_\{t\}, whereϕt=\{0\.8,t≤T/2,−0\.8,t\>T/2\.\\phi\_\{t\}=\\begin\{cases\}0\.8,&t\\leq T/2,\\\\ \-0\.8,&t\>T/2\.\\end\{cases\}

Scenario A is a sanity check, Scenario B tests adaptation to covariate\-dependent noise, and Scenario C tests adaptation to abrupt temporal shift\.

Table 1:Simulation results over100100repetitions\. Each entry reports mean \(standard deviation\)\. Boldface marks the smallest average size among methods whose coverage attains the0\.900\.90target\.![Refer to caption](https://arxiv.org/html/2605.05497v1/x1.png)Figure 1:Diagnostics for simulation\.Left panels show Scenario B conditional coverage and average size acrossXtX\_\{t\}\. Right panels show Scenario C rolling coverage and rolling average size with window size100100; the vertical dashed line marks the change point\. Shaded bands show mean±\\pmone standard deviation across repetitions\. OLCP \(DtACI\) curves are partially hidden behind OLCP\-Hedge \(ACI\)\.#### Results\.

Table[1](https://arxiv.org/html/2605.05497#S4.T1)summarizes marginal coverage and average prediction set size, with running time of each method reported in Appendix[F\.2](https://arxiv.org/html/2605.05497#A6.SS2)\. Scenario A is a sanity check: in the stationary homoskedastic case, there is little structure for either localization or adaptation to exploit, and the valid methods have similar sizes\.

Scenario B isolates covariate\-dependent heterogeneity\. Global methods \(CP, ACI, DtACI\) achieve reasonable marginal coverage, but Figure[1](https://arxiv.org/html/2605.05497#S4.F1)shows that their prediction\-set sizes are nearly constant inXtX\_\{t\}, causing undercoverage in high\-noise regions\. LCP localizes the residual quantile but mildly undercovers due to its fixed nominal level\. OLCP and OLCP\-Hedge combine localization with online calibration, expanding sets where noise is high while maintaining near\-nominal marginal coverage; OLCP\-Hedge gives the smallest size among near\-valid methods\.

Scenario C isolates abrupt temporal shift\. CP and LCP recover slowly after the change point and undercover, while ACI and DtACI restore coverage mainly through global size inflation\. OLCP and OLCP\-Hedge recover to the target coverage level with substantially smaller sets, showing that localization improves efficiency under heterogeneous uncertainty and online calibration restores validity under temporal shift\.

### 4\.3Real\-data experiments

We evaluate the proposed methods on three real time\-series datasets; additional preprocessing details, model hyperparameters, data splits, and diagnostic plots are deferred to Appendix[F\.3](https://arxiv.org/html/2605.05497#A6.SS3)\.

- •ELEC2\.ELEC2 contains electricity market prices, demands, and transfers from New South Wales and Victoria\([harries1999splice,](https://arxiv.org/html/2605.05497#bib.bib11)\)\. We predict transfer using the two states’ prices and demands as covariates\. After removing the initial constant\-response segment, the dataset has27,55227\{,\}552rows; the base predictor is a gradient\-boosted regression tree\.
- •ILINet\.ILINet is a weekly CDC influenza\-like illness surveillance dataset with1,3051\{,\}305observations\([cdcFluView,](https://arxiv.org/html/2605.05497#bib.bib4);[darts,](https://arxiv.org/html/2605.05497#bib.bib12)\)\. We predict the state\-population\-weighted weekly patient percentage using a lag window of2626past responses as covariates, with a temporal convolutional network \(TCN\) as the base predictor\([lea2016temporalconvolutionalnetworksunified,](https://arxiv.org/html/2605.05497#bib.bib15)\)\.
- •ETF volatility\.We forecast absolute daily log returns for five ETFs in recent1818years \(4,7424\{,\}742trading days\): SPY, QQQ, IWM, EEM, and TLT\([etf,](https://arxiv.org/html/2605.05497#bib.bib25)\)\. Covariates are a lag window of3030past observations and the lagged VIX index\([cboeVIX,](https://arxiv.org/html/2605.05497#bib.bib3)\); the base predictor is a TCN\. Widths are reported in percentage points of absolute log return\.

Table[2](https://arxiv.org/html/2605.05497#S4.T2)summarizes the real\-data results with running time in Appendix[F\.2](https://arxiv.org/html/2605.05497#A6.SS2)\. Standard errors are computed using a block bootstrap over time, with block lengths48/26/2048/26/20for ELEC2/ILINet/ETF volatility, respectively and10001000bootstrap replicates\. Across datasets, fixed LCP often yields shorter intervals but can undercover, while ACI and DtACI restore nominal coverage at the cost of wider intervals\. OLCP and OLCP\-Hedge provide the strongest overall coverage–efficiency tradeoff: OLCP\-Hedge is consistently near\-nominal and is the most efficient among the near\-valid methods\.

The rolling diagnostics in Appendix[F\.3](https://arxiv.org/html/2605.05497#A6.SS3)further clarify this pattern\. Figures[2](https://arxiv.org/html/2605.05497#A6.F2)–[4](https://arxiv.org/html/2605.05497#A6.F4)show that ACI and DtACI tend to maintain coverage by raising interval sizes globally, whereas OLCP and OLCP\-Hedge track the target with lower rolling size over much of the test period\. The ETF volatility regime analysis in Table[4](https://arxiv.org/html/2605.05497#A6.T4)gives a more targeted view: in low\-volatility periods, all near\-valid methods over\-cover, but OLCP and OLCP\-Hedge achieve the smallest sizes among them\. In high\-volatility periods, OLCP attains the best trade\-off with coverage0\.8900\.890with size4\.0044\.004, improving on both ACI \(0\.8900\.890,4\.3494\.349\) and DtACI \(0\.8890\.889,4\.1844\.184\)\. These diagnostics reinforce the central message: localization improves efficiency by adapting set size to the covariate regime, while online calibration maintains long\-run coverage under temporal shift\.

Table 2:Real\-data results\. Each entry reports mean \(block\-bootstrap standard error\)\. Coverage is empirical marginal coverage; Size is average interval width\. For ETF volatility, Size is reported as percentage points of absolute daily log return\. ELEC2 and ILINet sizes are in their normalized response scales\.

## 5Conclusion and future work

Our results show that localization improves efficiency under heterogeneity, while online update restores coverage under temporal shift\. However, several limitations remain\. Like other localized conformal methods\([guan2023localized,](https://arxiv.org/html/2605.05497#bib.bib9)\), OLCP also depends on informative covariates and a suitable distance metric\. OLCP\-Hedge assumes a uniformly feasible expert mixture, which may require conservative experts and can be restrictive in unbounded regression; it also uses full\-information feedback and could add to computational burden\. Future work could extend constrained aggregation to step sizes, localizers, distance metrics, calibration windows, and base predictors, and study learned localization, bandit\-feedback variants, and stronger local coverage guarantees\.

## References

- \[1\]Anastasios N Angelopoulos, Stephen Bates, et al\.Conformal prediction: A gentle introduction\.Foundations and trends® in machine learning, 16\(4\):494–591, 2023\.
- \[2\]Rina Foygel Barber, Emmanuel J Candes, Aaditya Ramdas, and Ryan J Tibshirani\.Conformal prediction beyond exchangeability\.The Annals of Statistics, 51\(2\):816–845, 2023\.
- \[3\]Cboe Global Markets\.Cboe Volatility Index \(vix\)\.[https://www\.cboe\.com/tradable\_products/vix/](https://www.cboe.com/tradable_products/vix/), 2026\.
- \[4\]Centers for Disease Control and Prevention\.CDC FluView: Influenza\-like illness surveillance\.[https://www\.cdc\.gov/fluview/](https://www.cdc.gov/fluview/), 2024\.
- \[5\]Steven De Rooij, Tim Van Erven, Peter D Grünwald, and Wouter M Koolen\.Follow the leader if you can, hedge if you must\.The Journal of Machine Learning Research, 15\(1\):1281–1316, 2014\.
- \[6\]Pierre Gaillard, Gilles Stoltz, and Tim Van Erven\.A second\-order bound with excess losses\.InConference on Learning Theory, pages 176–196\. PMLR, 2014\.
- \[7\]Isaac Gibbs and Emmanuel Candes\.Adaptive conformal inference under distribution shift\.Advances in Neural Information Processing Systems, 34:1660–1672, 2021\.
- \[8\]Isaac Gibbs and Emmanuel J Candès\.Conformal inference for online prediction with arbitrary distribution shifts\.Journal of Machine Learning Research, 25\(162\):1–36, 2024\.
- \[9\]Leying Guan\.Localized conformal prediction: A generalized inference framework for conformal prediction\.Biometrika, 110\(1\):33–50, 2023\.
- \[10\]Hengquan Guo, Xin Liu, Honghao Wei, and Lei Ying\.Online convex optimization with hard constraints: Towards the best of two worlds and beyond\.Advances in Neural Information Processing Systems, 35:36426–36439, 2022\.
- \[11\]Michael Harries\.Splice\-2 comparative evaluation: Electricity pricing\.Technical report, The University of New South Wales, 1999\.
- \[12\]Julien Herzen, Francesco Lässig, Samuele Giuliano Piazzetta, Thomas Neuer, Léo Tafti, Guillaume Raille, Tomas Van Pottelbergh, Marek Pasieka, Andrzej Skrodzki, Nicolas Huguenin, Maxime Dumonal, Jan Kościsz, Dennis Bader, Frédérick Gusset, Mounir Benheddi, Camila Williamson, Michal Kosinski, Matej Petrik, and Gaël Grosch\.Darts: User\-friendly modern machine learning for time series\.Journal of Machine Learning Research, 23\(124\):1–6, 2022\.
- \[13\]Rohan Hore and Rina Foygel Barber\.Conformal prediction with local weights: randomization enables robust guarantees\.Journal of the Royal Statistical Society Series B: Statistical Methodology, 87\(2\):549–578, 2025\.
- \[14\]Wouter M Koolen and Tim Van Erven\.Second\-order quantile methods for experts and combinatorial games\.InConference on Learning Theory, pages 1155–1175\. PMLR, 2015\.
- \[15\]Colin Lea, Rene Vidal, Austin Reiter, and Gregory D\. Hager\.Temporal convolutional networks: A unified approach to action segmentation, 2016\.
- \[16\]Jing Lei, Max G’Sell, Alessandro Rinaldo, Ryan J Tibshirani, and Larry Wasserman\.Distribution\-free predictive inference for regression\.Journal of the American Statistical Association, 113\(523\):1094–1111, 2018\.
- \[17\]Ruiting Liang, Wanrong Zhu, and Rina Foygel Barber\.Conformal prediction after efficiency\-oriented model selection\.arXiv preprint arXiv:2408\.07066, 2024\.
- \[18\]Tuo Liu, Edgar Dobriban, and Francesco Orabona\.Online conformal prediction via universal portfolio algorithms\.arXiv preprint arXiv:2602\.03168, 2026\.
- \[19\]Francesco Orabona\.A modern introduction to online learning\.arXiv preprint arXiv:1912\.13213, 2019\.
- \[20\]Francesco Orabona and Dávid Pál\.Coin betting and parameter\-free online learning\.Advances in Neural Information Processing Systems, 29, 2016\.
- \[21\]Yaniv Romano, Evan Patterson, and Emmanuel Candes\.Conformalized quantile regression\.Advances in neural information processing systems, 32, 2019\.
- \[22\]David W\. Scott\.Multivariate Density Estimation: Theory, Practice, and Visualization\.Wiley, 1992\.
- \[23\]Bernard W\. Silverman\.Density Estimation for Statistics and Data Analysis\.Chapman and Hall, 1986\.
- \[24\]Abhishek Sinha and Rahul Vaze\.Optimal algorithms for online convex optimization with adversarial constraints\.Advances in Neural Information Processing Systems, 37:41274–41302, 2024\.
- \[25\]Stooq\.Stooq\.[https://stooq\.com/](https://stooq.com/), 2026\.
- \[26\]Sophia Sun and Rose Yu\.Conformal prediction for time\-series forecasting with change points\.arXiv preprint arXiv:2509\.02844, 2025\.
- \[27\]Ryan J Tibshirani, Rina Foygel Barber, Emmanuel Candes, and Aaditya Ramdas\.Conformal prediction under covariate shift\.Advances in neural information processing systems, 32, 2019\.
- \[28\]Vladimir Vovk, Alexander Gammerman, and Glenn Shafer\.Algorithmic learning in a random world\.Springer, 2005\.
- \[29\]Chen Xu and Yao Xie\.Conformal prediction interval for dynamic time\-series\.In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 11559–11569\. PMLR, 18–24 Jul 2021\.
- \[30\]Chen Xu and Yao Xie\.Sequential predictive conformal inference for time series\.In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 38707–38727\. PMLR, 23–29 Jul 2023\.
- \[31\]Yachong Yang and Arun Kumar Kuchibhotla\.Selection and aggregation of conformal prediction sets\.Journal of the American Statistical Association, 120\(549\):435–447, 2025\.
- \[32\]Margaux Zaffran, Olivier Féron, Yannig Goude, Julie Josse, and Aymeric Dieuleveut\.Adaptive conformal predictions for time series\.InInternational conference on machine learning, pages 25834–25866\. PMLR, 2022\.

## Appendix AProof of Proposition[3\.1](https://arxiv.org/html/2605.05497#S3.Thmtheorem1)

###### Proof\.

By definition,

zt=αt\+γ​\(α−errt\),Lt=\(−zt\)\+,Ut=\(zt−1\)\+\.z\_\{t\}=\\alpha\_\{t\}\+\\gamma\(\\alpha\-\\mathrm\{err\}\_\{t\}\),\\qquad L\_\{t\}=\(\-z\_\{t\}\)\_\{\+\},\\qquad U\_\{t\}=\(z\_\{t\}\-1\)\_\{\+\}\.Since projection onto\[0,1\]\[0,1\]satisfies

Π\[0,1\]​\(zt\)=zt\+Lt−Ut,\\Pi\_\{\[0,1\]\}\(z\_\{t\}\)=z\_\{t\}\+L\_\{t\}\-U\_\{t\},the projected update can be written as

αt\+1=αt\+γ​\(α−errt\)\+Lt−Ut\.\\alpha\_\{t\+1\}=\\alpha\_\{t\}\+\\gamma\(\\alpha\-\\mathrm\{err\}\_\{t\}\)\+L\_\{t\}\-U\_\{t\}\.Summing this identity overt=1,…,Tt=1,\\dots,Tgives

αT\+1−α1=γ​∑t=1T\(α−errt\)\+∑t=1T\(Lt−Ut\)\.\\alpha\_\{T\+1\}\-\\alpha\_\{1\}=\\gamma\\sum\_\{t=1\}^\{T\}\(\\alpha\-\\mathrm\{err\}\_\{t\}\)\+\\sum\_\{t=1\}^\{T\}\(L\_\{t\}\-U\_\{t\}\)\.Rearranging yields

∑t=1T\(errt−α\)=α1−αT\+1γ\+1γ​∑t=1T\(Lt−Ut\),\\sum\_\{t=1\}^\{T\}\(\\mathrm\{err\}\_\{t\}\-\\alpha\)=\\frac\{\\alpha\_\{1\}\-\\alpha\_\{T\+1\}\}\{\\gamma\}\+\\frac\{1\}\{\\gamma\}\\sum\_\{t=1\}^\{T\}\(L\_\{t\}\-U\_\{t\}\),which proves the claimed identity\.

SinceαT\+1∈\[0,1\]\\alpha\_\{T\+1\}\\in\[0,1\],Lt≥0L\_\{t\}\\geq 0, andUt≥0U\_\{t\}\\geq 0, we have

∑t=1T\(errt−α\)≤α1γ\+1γ​∑t=1TLt\.\\sum\_\{t=1\}^\{T\}\(\\mathrm\{err\}\_\{t\}\-\\alpha\)\\leq\\frac\{\\alpha\_\{1\}\}\{\\gamma\}\+\\frac\{1\}\{\\gamma\}\\sum\_\{t=1\}^\{T\}L\_\{t\}\.Dividing byTTgives

1T​∑t=1Terrt−α≤α1T​γ\+1T​γ​∑t=1TLt\.\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\mathrm\{err\}\_\{t\}\-\\alpha\\leq\\frac\{\\alpha\_\{1\}\}\{T\\gamma\}\+\\frac\{1\}\{T\\gamma\}\\sum\_\{t=1\}^\{T\}L\_\{t\}\.Ifγ\>0\\gamma\>0is fixed and∑t=1TLt=o​\(T\)\\sum\_\{t=1\}^\{T\}L\_\{t\}=o\(T\), the right\-hand side converges to zero\. Hence

lim supT→∞\(1T​∑t=1Terrt−α\)≤0\.\\limsup\_\{T\\to\\infty\}\\left\(\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\mathrm\{err\}\_\{t\}\-\\alpha\\right\)\\leq 0\.
For the lower deviation, the same identity gives

∑t=1T\(α−errt\)=αT\+1−α1γ\+1γ​∑t=1T\(Ut−Lt\)\.\\sum\_\{t=1\}^\{T\}\(\\alpha\-\\mathrm\{err\}\_\{t\}\)=\\frac\{\\alpha\_\{T\+1\}\-\\alpha\_\{1\}\}\{\\gamma\}\+\\frac\{1\}\{\\gamma\}\\sum\_\{t=1\}^\{T\}\(U\_\{t\}\-L\_\{t\}\)\.UsingαT\+1≤1\\alpha\_\{T\+1\}\\leq 1,Lt≥0L\_\{t\}\\geq 0, andUt≥0U\_\{t\}\\geq 0, we obtain

∑t=1T\(α−errt\)≤1−α1γ\+1γ​∑t=1TUt\.\\sum\_\{t=1\}^\{T\}\(\\alpha\-\\mathrm\{err\}\_\{t\}\)\\leq\\frac\{1\-\\alpha\_\{1\}\}\{\\gamma\}\+\\frac\{1\}\{\\gamma\}\\sum\_\{t=1\}^\{T\}U\_\{t\}\.Dividing byTT,

α−1T​∑t=1Terrt≤1−α1T​γ\+1T​γ​∑t=1TUt\.\\alpha\-\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\mathrm\{err\}\_\{t\}\\leq\\frac\{1\-\\alpha\_\{1\}\}\{T\\gamma\}\+\\frac\{1\}\{T\\gamma\}\\sum\_\{t=1\}^\{T\}U\_\{t\}\.If∑t=1T\(Lt\+Ut\)=o​\(T\)\\sum\_\{t=1\}^\{T\}\(L\_\{t\}\+U\_\{t\}\)=o\(T\), then in particular∑t=1TLt=o​\(T\)\\sum\_\{t=1\}^\{T\}L\_\{t\}=o\(T\)and∑t=1TUt=o​\(T\)\\sum\_\{t=1\}^\{T\}U\_\{t\}=o\(T\)\. Therefore both the upper and lower deviations vanish, and hence

1T​∑t=1Terrt→α\.\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\mathrm\{err\}\_\{t\}\\to\\alpha\.∎

## Appendix BMore discussion on boundary terms

This section expands on Proposition[3\.1](https://arxiv.org/html/2605.05497#S3.Thmtheorem1)\. The proposition is an exact pathwise identity for the projected update

αt\+1=Π\[0,1\]​\(αt\+γ​\(α−errt\)\)\.\\alpha\_\{t\+1\}=\\Pi\_\{\[0,1\]\}\\bigl\(\\alpha\_\{t\}\+\\gamma\(\\alpha\-\\mathrm\{err\}\_\{t\}\)\\bigr\)\.The correction terms

Lt=\(−zt\)\+,Ut=\(zt−1\)\+,zt=αt\+γ​\(α−errt\),L\_\{t\}=\(\-z\_\{t\}\)\_\{\+\},\\qquad U\_\{t\}=\(z\_\{t\}\-1\)\_\{\+\},\\qquad z\_\{t\}=\\alpha\_\{t\}\+\\gamma\(\\alpha\-\\mathrm\{err\}\_\{t\}\),measure how much the unconstrained update would have moved outside the interval\[0,1\]\[0,1\]\. ThusLtL\_\{t\}is the amount of lower\-boundary clipping, andUtU\_\{t\}is the amount of upper\-boundary clipping\.

#### Interpretation\.

The termLt\>0L\_\{t\}\>0can occur only whenerrt=1\\mathrm\{err\}\_\{t\}=1and

αt<γ​\(1−α\)\.\\alpha\_\{t\}<\\gamma\(1\-\\alpha\)\.In words, OLCP misses whileαt\\alpha\_\{t\}is already near0\. Since smallerαt\\alpha\_\{t\}corresponds to wider prediction sets,LtL\_\{t\}records the amount by which the update would like to widen the set beyond the widest finite set allowed by the projected implementation\. Similarly,Ut\>0U\_\{t\}\>0can occur only whenerrt=0\\mathrm\{err\}\_\{t\}=0and

αt\>1−γ​α\.\\alpha\_\{t\}\>1\-\\gamma\\alpha\.This records the amount by which the update would like to shrink the set beyond the smallest finite set allowed by the projection\.

This is the key difference from the basic ACI telescoping guarantee\. The original unprojected ACI update allowsαt\\alpha\_\{t\}to leave\[0,1\]\[0,1\], which can produce full/infinite or empty prediction sets at the boundary\. This makes the recursion telescope exactly without correction terms\. Projected OLCP instead keeps all prediction sets finite, so the price is the appearance ofLtL\_\{t\}andUtU\_\{t\}\.

#### A sufficient condition\.

We next give a simple sufficient condition under which the boundary corrections are negligible\. For clarity, allow the step size to depend on the horizon and writeγ=γT\\gamma=\\gamma\_\{T\}\. Define

ηT−:=γT​\(1−α\),ηT\+:=γT​α\.\\eta\_\{T\}^\{\-\}:=\\gamma\_\{T\}\(1\-\\alpha\),\\qquad\\eta\_\{T\}^\{\+\}:=\\gamma\_\{T\}\\alpha\.Since the prediction sets are decreasing in the nominal level, we have the deterministic inclusions

\{Lt\>0\}⊆\{Yt∉Ct\(h\)​\(Xt;ηT−\)\},\\\{L\_\{t\}\>0\\\}\\subseteq\\bigl\\\{Y\_\{t\}\\notin C\_\{t\}^\{\(h\)\}\(X\_\{t\};\\eta\_\{T\}^\{\-\}\)\\bigr\\\},and

\{Ut\>0\}⊆\{Yt∈Ct\(h\)​\(Xt;1−ηT\+\)\}\.\\\{U\_\{t\}\>0\\\}\\subseteq\\bigl\\\{Y\_\{t\}\\in C\_\{t\}^\{\(h\)\}\(X\_\{t\};1\-\\eta\_\{T\}^\{\+\}\)\\bigr\\\}\.Moreover,

0≤Lt≤γT​\(1−α\)​1​\{Yt∉Ct\(h\)​\(Xt;ηT−\)\},0\\leq L\_\{t\}\\leq\\gamma\_\{T\}\(1\-\\alpha\)\\,\\mathbf\{1\}\\\{Y\_\{t\}\\notin C\_\{t\}^\{\(h\)\}\(X\_\{t\};\\eta\_\{T\}^\{\-\}\)\\\},and

0≤Ut≤γT​α​1​\{Yt∈Ct\(h\)​\(Xt;1−ηT\+\)\}\.0\\leq U\_\{t\}\\leq\\gamma\_\{T\}\\alpha\\,\\mathbf\{1\}\\\{Y\_\{t\}\\in C\_\{t\}^\{\(h\)\}\(X\_\{t\};1\-\\eta\_\{T\}^\{\+\}\)\\\}\.Therefore, a sufficient condition for

∑t=1TLt=o​\(T​γT\)and∑t=1TUt=o​\(T​γT\)\\sum\_\{t=1\}^\{T\}L\_\{t\}=o\(T\\gamma\_\{T\}\)\\quad\\text\{and\}\\quad\\sum\_\{t=1\}^\{T\}U\_\{t\}=o\(T\\gamma\_\{T\}\)in expectation is

1T​∑t=1Tℙ​\{Yt∉Ct\(h\)​\(Xt;ηT−\)∣ℱt−1\}=o​\(1\),\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\mathbb\{P\}\\\!\\left\\\{Y\_\{t\}\\notin C\_\{t\}^\{\(h\)\}\(X\_\{t\};\\eta\_\{T\}^\{\-\}\)\\mid\\mathcal\{F\}\_\{t\-1\}\\right\\\}=o\(1\),and

1T​∑t=1Tℙ​\{Yt∈Ct\(h\)​\(Xt;1−ηT\+\)∣ℱt−1\}=o​\(1\)\.\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\mathbb\{P\}\\\!\\left\\\{Y\_\{t\}\\in C\_\{t\}^\{\(h\)\}\(X\_\{t\};1\-\\eta\_\{T\}^\{\+\}\)\\mid\\mathcal\{F\}\_\{t\-1\}\\right\\\}=o\(1\)\.These conditions say that boundary failures are rare: the near\-widest finite set should not miss too often, and the near\-smallest finite set should not cover too often\.

#### Empirical diagnostic\.

The boundary terms are directly observable during the run\. In practice, one can report

1T​γ​∑t=1TLtand1T​γ​∑t=1TUt\\frac\{1\}\{T\\gamma\}\\sum\_\{t=1\}^\{T\}L\_\{t\}\\qquad\\text\{and\}\\qquad\\frac\{1\}\{T\\gamma\}\\sum\_\{t=1\}^\{T\}U\_\{t\}as diagnostics\. Small values indicate that projection is not materially affecting the long\-run coverage behavior; large values indicate that the algorithm is frequently hitting the finite\-set boundary, so the target coverage may not be attainable without wider or narrower endpoint sets\.

## Appendix CFeasibility diagnostics for assumption B

Assumption B requires a fixed mixtureu⋆∈ΔKu^\{\\star\}\\in\\Delta\_\{K\}satisfying⟨et,u⋆⟩≤α\\langle e\_\{t\},u^\{\\star\}\\rangle\\leq\\alphaon every round\. Since OLCP\-Hedge uses full\-information feedback, this condition can be checked post hoc by the linear feasibility problem

find​u∈ΔKsuch that⟨et,u⟩≤α,t=1,…,T\.\\text\{find \}u\\in\\Delta\_\{K\}\\quad\\text\{such that\}\\quad\\langle e\_\{t\},u\\rangle\\leq\\alpha,\\qquad t=1,\\dots,T\.Equivalently, one may solve the relaxed linear program

minu∈ΔK,ρ≥0⁡ρsubject to⟨et,u⟩−α≤ρ,t=1,…,T\.\\min\_\{u\\in\\Delta\_\{K\},\\ \\rho\\geq 0\}\\ \\rho\\qquad\\text\{subject to\}\\qquad\\langle e\_\{t\},u\\rangle\-\\alpha\\leq\\rho,\\quad t=1,\\dots,T\.If the optimum isρ=0\\rho=0, Assumption B holds for the realized expert pool; otherwiseρ\\rhoquantifies the worst\-round infeasibility of the best fixed mixture\. When it’s large, the uniform\-feasibility comparator required by Theorem[3\.2](https://arxiv.org/html/2605.05497#S3.Thmtheorem2)is inappropriate, and a relaxed online constraint\-satisfaction benchmark such asSS\-feasibility orPTP\_\{T\}\-constrained feasibility is more suitable\.

## Appendix DAdaHedge subroutine

We use AdaHedge as the black\-box OCO subroutine over the probability simplexΔK\\Delta\_\{K\}\. Let

CAH:=2​4\+ln⁡K\.C\_\{\\mathrm\{AH\}\}:=2\\sqrt\{4\+\\ln K\}\.
Algorithm 3AdaHedge onΔK\\Delta\_\{K\}\(FTRL form;\[[19](https://arxiv.org/html/2605.05497#bib.bib19)\]\)1:parameter

αAH\>0\\alpha\_\{\\mathrm\{AH\}\}\>0
2:Initialize

λ1←0,p1←\(1/K,…,1/K\)∈ΔK,θ1←0∈ℝK\.\\lambda\_\{1\}\\leftarrow 0,\\qquad p\_\{1\}\\leftarrow\(1/K,\\dots,1/K\)\\in\\Delta\_\{K\},\\qquad\\theta\_\{1\}\\leftarrow 0\\in\\mathbb\{R\}^\{K\}\.
3:for

t=1,2,…,Tt=1,2,\\dots,Tdo

4:Output

pt∈ΔKp\_\{t\}\\in\\Delta\_\{K\}\.

5:Observe a linear loss vector

ξt∈ℝK\\xi\_\{t\}\\in\\mathbb\{R\}^\{K\}and incur loss

⟨ξt,pt⟩\\langle\\xi\_\{t\},p\_\{t\}\\rangle\.

6:Update

θt\+1←θt−ξt\\theta\_\{t\+1\}\\leftarrow\\theta\_\{t\}\-\\xi\_\{t\}\.

7:if

t=1t=1then

8:

δt←⟨ξ1,p1⟩−minj∈\[K\]⁡ξ1,j\\delta\_\{t\}\\leftarrow\\langle\\xi\_\{1\},p\_\{1\}\\rangle\-\\min\_\{j\\in\[K\]\}\\xi\_\{1,j\}\.

9:else

10:

δt←λt​ln⁡\(∑j=1Kpt,j​exp⁡\(−ξt,j/λt\)\)\+⟨ξt,pt⟩\.\\delta\_\{t\}\\leftarrow\\lambda\_\{t\}\\ln\\\!\\left\(\\sum\_\{j=1\}^\{K\}p\_\{t,j\}\\exp\(\-\\xi\_\{t,j\}/\\lambda\_\{t\}\)\\right\)\+\\langle\\xi\_\{t\},p\_\{t\}\\rangle\.
11:endif

12:Update

λt\+1←λt\+δt/αAH2\\lambda\_\{t\+1\}\\leftarrow\\lambda\_\{t\}\+\\delta\_\{t\}/\\alpha\_\{\\mathrm\{AH\}\}^\{2\}\.

13:Update

pt\+1,j∝exp⁡\(θt\+1,j/λt\+1\),j=1,…,K\.p\_\{t\+1,j\}\\propto\\exp\(\\theta\_\{t\+1,j\}/\\lambda\_\{t\+1\}\),\\qquad j=1,\\dots,K\.
14:endfor

###### Theorem D\.1\(AdaHedge bound onΔK\\Delta\_\{K\}\)\.

Run Algorithm[3](https://arxiv.org/html/2605.05497#alg3)with

αAH=ln⁡K\.\\alpha\_\{\\mathrm\{AH\}\}=\\sqrt\{\\ln K\}\.Then for any sequence\{ξt\}t=1T⊂ℝK\\\{\\xi\_\{t\}\\\}\_\{t=1\}^\{T\}\\subset\\mathbb\{R\}^\{K\}, the iteratespt∈ΔKp\_\{t\}\\in\\Delta\_\{K\}satisfy, for allu∈ΔKu\\in\\Delta\_\{K\},

∑t=1T⟨ξt,pt−u⟩≤CAH​∑t=1T‖ξt‖∞2\.\\sum\_\{t=1\}^\{T\}\\langle\\xi\_\{t\},p\_\{t\}\-u\\rangle\\leq C\_\{\\mathrm\{AH\}\}\\sqrt\{\\sum\_\{t=1\}^\{T\}\\\|\\xi\_\{t\}\\\|\_\{\\infty\}^\{2\}\}\.

###### Proof\.

This is the parameter\-free AdaHedge guarantee obtained by optimizing the standard FTRL/AdaHedge bound; seeorabona2019modern \[[19](https://arxiv.org/html/2605.05497#bib.bib19), Section 7\]\. We use theℓ∞\\ell\_\{\\infty\}\-norm form because the decision set is the simplex and the losses are linear inpp\. ∎

## Appendix EProof of Theorem[3\.2](https://arxiv.org/html/2605.05497#S3.Thmtheorem2)

###### Proof\.

We prove a slightly stronger prefix version: the stated bounds hold for everym≤Tm\\leq T\. The proof follows the Sinha–Vaze Lyapunov reduction for constrained online convex optimization, specialized to the simplexΔK\\Delta\_\{K\}and to the OLCP expert set size and miscoverage constraint\.

Recall that

ft​\(p\)=⟨ωt,p⟩,gt​\(p\)=⟨et,p⟩−α,f\_\{t\}\(p\)=\\langle\\omega\_\{t\},p\\rangle,\\qquad g\_\{t\}\(p\)=\\langle e\_\{t\},p\\rangle\-\\alpha,whereωt=\(ωt,1,…,ωt,K\)\\omega\_\{t\}=\(\\omega\_\{t,1\},\\dots,\\omega\_\{t,K\}\)is the vector of expert sizes andet=\(errt,1,…,errt,K\)e\_\{t\}=\(\\mathrm\{err\}\_\{t,1\},\\dots,\\mathrm\{err\}\_\{t,K\}\)is the vector of expert miscoverage indicators\. We use the preprocessed size loss and excess\-miscoverage constraint

f~t​\(p\):=κ​ft​\(p\),g~t​\(p\):=κ​\(gt​\(p\)\)\+\.\\tilde\{f\}\_\{t\}\(p\):=\\kappa f\_\{t\}\(p\),\\qquad\\tilde\{g\}\_\{t\}\(p\):=\\kappa\(g\_\{t\}\(p\)\)\_\{\+\}\.The virtual queue is initialized at𝒬​\(0\)=0\\mathcal\{Q\}\(0\)=0and updated as

𝒬​\(t\):=𝒬​\(t−1\)\+g~t​\(pt\)\.\\mathcal\{Q\}\(t\):=\\mathcal\{Q\}\(t\-1\)\+\\tilde\{g\}\_\{t\}\(p\_\{t\}\)\.The potential is

Φ​\(q\)=eλ​q−1,Φ′​\(q\)=λ​eλ​q,\\Phi\(q\)=e^\{\\lambda q\}\-1,\\qquad\\Phi^\{\\prime\}\(q\)=\\lambda e^\{\\lambda q\},and the surrogate loss is

f^t​\(p\):=V​f~t​\(p\)\+Φ′​\(𝒬​\(t\)\)​g~t​\(p\)\.\\hat\{f\}\_\{t\}\(p\):=V\\tilde\{f\}\_\{t\}\(p\)\+\\Phi^\{\\prime\}\(\\mathcal\{Q\}\(t\)\)\\tilde\{g\}\_\{t\}\(p\)\.In Theorem[3\.2](https://arxiv.org/html/2605.05497#S3.Thmtheorem2), we setV=1V=1\.

Step 1: Drift inequality and decomposition\.

Fix any prefix lengthm≤Tm\\leq T\. By convexity ofΦ\\Phiand the queue recursion

𝒬​\(t\)=𝒬​\(t−1\)\+g~t​\(pt\),\\mathcal\{Q\}\(t\)=\\mathcal\{Q\}\(t\-1\)\+\\tilde\{g\}\_\{t\}\(p\_\{t\}\),we have

Φ​\(𝒬​\(t\)\)−Φ​\(𝒬​\(t−1\)\)≤Φ′​\(𝒬​\(t\)\)​g~t​\(pt\)\.\\Phi\(\\mathcal\{Q\}\(t\)\)\-\\Phi\(\\mathcal\{Q\}\(t\-1\)\)\\leq\\Phi^\{\\prime\}\(\\mathcal\{Q\}\(t\)\)\\,\\tilde\{g\}\_\{t\}\(p\_\{t\}\)\.Summing fromt=1t=1tommand usingΦ​\(𝒬​\(0\)\)=0\\Phi\(\\mathcal\{Q\}\(0\)\)=0, we get

Φ​\(𝒬​\(m\)\)≤∑t=1mΦ′​\(𝒬​\(t\)\)​g~t​\(pt\)\.\\Phi\(\\mathcal\{Q\}\(m\)\)\\leq\\sum\_\{t=1\}^\{m\}\\Phi^\{\\prime\}\(\\mathcal\{Q\}\(t\)\)\\,\\tilde\{g\}\_\{t\}\(p\_\{t\}\)\.Letu⋆∈ΔKu^\{\\star\}\\in\\Delta\_\{K\}be any feasible comparator from Assumption B\. Then

gt​\(u⋆\)≤0⇒\(gt​\(u⋆\)\)\+=0⇒g~t​\(u⋆\)=0\.g\_\{t\}\(u^\{\\star\}\)\\leq 0\\qquad\\Rightarrow\\qquad\(g\_\{t\}\(u^\{\\star\}\)\)\_\{\+\}=0\\qquad\\Rightarrow\\qquad\\tilde\{g\}\_\{t\}\(u^\{\\star\}\)=0\.SinceV=1V=1,

∑t=1m\(f^t​\(pt\)−f^t​\(u⋆\)\)\\displaystyle\\sum\_\{t=1\}^\{m\}\\bigl\(\\hat\{f\}\_\{t\}\(p\_\{t\}\)\-\\hat\{f\}\_\{t\}\(u^\{\\star\}\)\\bigr\)=∑t=1m\(f~t​\(pt\)−f~t​\(u⋆\)\)\+∑t=1mΦ′​\(𝒬​\(t\)\)​g~t​\(pt\)\.\\displaystyle=\\sum\_\{t=1\}^\{m\}\\Bigl\(\\tilde\{f\}\_\{t\}\(p\_\{t\}\)\-\\tilde\{f\}\_\{t\}\(u^\{\\star\}\)\\Bigr\)\+\\sum\_\{t=1\}^\{m\}\\Phi^\{\\prime\}\(\\mathcal\{Q\}\(t\)\)\\,\\tilde\{g\}\_\{t\}\(p\_\{t\}\)\.Therefore,

Φ​\(𝒬​\(m\)\)\+∑t=1m\(f~t​\(pt\)−f~t​\(u⋆\)\)≤∑t=1m\(f^t​\(pt\)−f^t​\(u⋆\)\)\.\\Phi\(\\mathcal\{Q\}\(m\)\)\+\\sum\_\{t=1\}^\{m\}\\Bigl\(\\tilde\{f\}\_\{t\}\(p\_\{t\}\)\-\\tilde\{f\}\_\{t\}\(u^\{\\star\}\)\\Bigr\)\\leq\\sum\_\{t=1\}^\{m\}\\bigl\(\\hat\{f\}\_\{t\}\(p\_\{t\}\)\-\\hat\{f\}\_\{t\}\(u^\{\\star\}\)\\bigr\)\.\(4\)
Step 2: Apply AdaHedge to the linearized surrogate losses\.

For eachtt, choose any

ξt∈∂f^t​\(pt\)\.\\xi\_\{t\}\\in\\partial\\hat\{f\}\_\{t\}\(p\_\{t\}\)\.By convexity,

f^t​\(pt\)−f^t​\(u⋆\)≤⟨ξt,pt−u⋆⟩\.\\hat\{f\}\_\{t\}\(p\_\{t\}\)\-\\hat\{f\}\_\{t\}\(u^\{\\star\}\)\\leq\\langle\\xi\_\{t\},p\_\{t\}\-u^\{\\star\}\\rangle\.Thus, by Theorem[D\.1](https://arxiv.org/html/2605.05497#A4.Thmtheorem1)applied to the prefix1,…,m1,\\dots,m,

∑t=1m\(f^t​\(pt\)−f^t​\(u⋆\)\)≤CAH​∑t=1m‖ξt‖∞2\.\\sum\_\{t=1\}^\{m\}\\bigl\(\\hat\{f\}\_\{t\}\(p\_\{t\}\)\-\\hat\{f\}\_\{t\}\(u^\{\\star\}\)\\bigr\)\\leq C\_\{\\mathrm\{AH\}\}\\sqrt\{\\sum\_\{t=1\}^\{m\}\\\|\\xi\_\{t\}\\\|\_\{\\infty\}^\{2\}\}\.\(5\)
Step 3: Bound the surrogate gradients\.

Since

f^t​\(p\)=f~t​\(p\)\+Φ′​\(𝒬​\(t\)\)​g~t​\(p\),\\hat\{f\}\_\{t\}\(p\)=\\tilde\{f\}\_\{t\}\(p\)\+\\Phi^\{\\prime\}\(\\mathcal\{Q\}\(t\)\)\\tilde\{g\}\_\{t\}\(p\),and

f~t​\(p\)=κ​ft​\(p\),g~t​\(p\)=κ​\(gt​\(p\)\)\+,\\tilde\{f\}\_\{t\}\(p\)=\\kappa f\_\{t\}\(p\),\\qquad\\tilde\{g\}\_\{t\}\(p\)=\\kappa\(g\_\{t\}\(p\)\)\_\{\+\},Assumption A implies that we may chooseξt∈∂f^t​\(pt\)\\xi\_\{t\}\\in\\partial\\hat\{f\}\_\{t\}\(p\_\{t\}\)such that

‖ξt‖∞≤κ​G​\(1\+Φ′​\(𝒬​\(t\)\)\)\.\\\|\\xi\_\{t\}\\\|\_\{\\infty\}\\leq\\kappa G\\bigl\(1\+\\Phi^\{\\prime\}\(\\mathcal\{Q\}\(t\)\)\\bigr\)\.Hence, using\(a\+b\)2≤2​\(a2\+b2\)\(a\+b\)^\{2\}\\leq 2\(a^\{2\}\+b^\{2\}\),

∑t=1m‖ξt‖∞2≤2​κ2​G2​\(m\+∑t=1m\(Φ′​\(𝒬​\(t\)\)\)2\)\.\\sum\_\{t=1\}^\{m\}\\\|\\xi\_\{t\}\\\|\_\{\\infty\}^\{2\}\\leq 2\\kappa^\{2\}G^\{2\}\\left\(m\+\\sum\_\{t=1\}^\{m\}\(\\Phi^\{\\prime\}\(\\mathcal\{Q\}\(t\)\)\)^\{2\}\\right\)\.Because𝒬​\(t\)\\mathcal\{Q\}\(t\)is nondecreasing andΦ′\\Phi^\{\\prime\}is nondecreasing,

Φ′​\(𝒬​\(t\)\)≤Φ′​\(𝒬​\(m\)\),t≤m\.\\Phi^\{\\prime\}\(\\mathcal\{Q\}\(t\)\)\\leq\\Phi^\{\\prime\}\(\\mathcal\{Q\}\(m\)\),\\qquad t\\leq m\.Therefore,

∑t=1m\(Φ′​\(𝒬​\(t\)\)\)2≤m​\(Φ′​\(𝒬​\(m\)\)\)2\.\\sum\_\{t=1\}^\{m\}\(\\Phi^\{\\prime\}\(\\mathcal\{Q\}\(t\)\)\)^\{2\}\\leq m\(\\Phi^\{\\prime\}\(\\mathcal\{Q\}\(m\)\)\)^\{2\}\.Plugging this into \([5](https://arxiv.org/html/2605.05497#A5.E5)\) gives

∑t=1m\(f^t​\(pt\)−f^t​\(u⋆\)\)≤2​CAH​κ​G​m​\(1\+Φ′​\(𝒬​\(m\)\)\)\.\\sum\_\{t=1\}^\{m\}\\bigl\(\\hat\{f\}\_\{t\}\(p\_\{t\}\)\-\\hat\{f\}\_\{t\}\(u^\{\\star\}\)\\bigr\)\\leq\\sqrt\{2\}\\,C\_\{\\mathrm\{AH\}\}\\kappa G\\sqrt\{m\}\\bigl\(1\+\\Phi^\{\\prime\}\(\\mathcal\{Q\}\(m\)\)\\bigr\)\.\(6\)
Step 4: Size\-regret bound\.

Combining \([4](https://arxiv.org/html/2605.05497#A5.E4)\) and \([6](https://arxiv.org/html/2605.05497#A5.E6)\), and using

Φ​\(q\)=eλ​q−1,Φ′​\(q\)=λ​eλ​q,\\Phi\(q\)=e^\{\\lambda q\}\-1,\\qquad\\Phi^\{\\prime\}\(q\)=\\lambda e^\{\\lambda q\},yields

eλ​𝒬​\(m\)−1\+∑t=1m\(f~t​\(pt\)−f~t​\(u⋆\)\)≤2​CAH​κ​G​m​\(1\+λ​eλ​𝒬​\(m\)\)\.e^\{\\lambda\\mathcal\{Q\}\(m\)\}\-1\+\\sum\_\{t=1\}^\{m\}\\bigl\(\\tilde\{f\}\_\{t\}\(p\_\{t\}\)\-\\tilde\{f\}\_\{t\}\(u^\{\\star\}\)\\bigr\)\\leq\\sqrt\{2\}\\,C\_\{\\mathrm\{AH\}\}\\kappa G\\sqrt\{m\}\\bigl\(1\+\\lambda e^\{\\lambda\\mathcal\{Q\}\(m\)\}\\bigr\)\.With

κ=\(2​CAH​G\)−1,\\kappa=\(\\sqrt\{2\}C\_\{\\mathrm\{AH\}\}G\)^\{\-1\},the prefactor is equal to11, so

eλ​𝒬​\(m\)−1\+∑t=1m\(f~t​\(pt\)−f~t​\(u⋆\)\)≤m\+λ​m​eλ​𝒬​\(m\)\.e^\{\\lambda\\mathcal\{Q\}\(m\)\}\-1\+\\sum\_\{t=1\}^\{m\}\\bigl\(\\tilde\{f\}\_\{t\}\(p\_\{t\}\)\-\\tilde\{f\}\_\{t\}\(u^\{\\star\}\)\\bigr\)\\leq\\sqrt\{m\}\+\\lambda\\sqrt\{m\}\\,e^\{\\lambda\\mathcal\{Q\}\(m\)\}\.Rearranging,

∑t=1m\(f~t​\(pt\)−f~t​\(u⋆\)\)≤m\+1\+\(λ​m−1\)​eλ​𝒬​\(m\)\.\\sum\_\{t=1\}^\{m\}\\bigl\(\\tilde\{f\}\_\{t\}\(p\_\{t\}\)\-\\tilde\{f\}\_\{t\}\(u^\{\\star\}\)\\bigr\)\\leq\\sqrt\{m\}\+1\+\(\\lambda\\sqrt\{m\}\-1\)e^\{\\lambda\\mathcal\{Q\}\(m\)\}\.With

λ=12​Tandm≤T,\\lambda=\\frac\{1\}\{2\\sqrt\{T\}\}\\qquad\\text\{and\}\\qquad m\\leq T,we haveλ​m≤1/2\\lambda\\sqrt\{m\}\\leq 1/2, so the last term is nonpositive\. Therefore,

∑t=1m\(f~t​\(pt\)−f~t​\(u⋆\)\)≤m\+1≤2​m\.\\sum\_\{t=1\}^\{m\}\\bigl\(\\tilde\{f\}\_\{t\}\(p\_\{t\}\)\-\\tilde\{f\}\_\{t\}\(u^\{\\star\}\)\\bigr\)\\leq\\sqrt\{m\}\+1\\leq 2\\sqrt\{m\}\.Sincef~t=κ​ft\\tilde\{f\}\_\{t\}=\\kappa f\_\{t\},

∑t=1m\(ft​\(pt\)−ft​\(u⋆\)\)≤2​κ−1​m\.\\sum\_\{t=1\}^\{m\}\\bigl\(f\_\{t\}\(p\_\{t\}\)\-f\_\{t\}\(u^\{\\star\}\)\\bigr\)\\leq 2\\kappa^\{\-1\}\\sqrt\{m\}\.Using

κ−1=2​CAH​G=2​G​2​\(4\+ln⁡K\),\\kappa^\{\-1\}=\\sqrt\{2\}\\,C\_\{\\mathrm\{AH\}\}G=2G\\sqrt\{2\(4\+\\ln K\)\},we obtain

∑t=1m\(ft​\(pt\)−ft​\(u⋆\)\)≤4​G​2​\(4\+ln⁡K\)​m\.\\sum\_\{t=1\}^\{m\}\\bigl\(f\_\{t\}\(p\_\{t\}\)\-f\_\{t\}\(u^\{\\star\}\)\\bigr\)\\leq 4G\\sqrt\{2\(4\+\\ln K\)\\,m\}\.Finally, since

ft​\(pt\)=𝔼It∼pt​\[ωt,It\],ft​\(u⋆\)=⟨ωt,u⋆⟩,f\_\{t\}\(p\_\{t\}\)=\\mathbb\{E\}\_\{I\_\{t\}\\sim p\_\{t\}\}\[\\omega\_\{t,I\_\{t\}\}\],\\qquad f\_\{t\}\(u^\{\\star\}\)=\\langle\\omega\_\{t\},u^\{\\star\}\\rangle,we have, for everym≤Tm\\leq T,

∑t=1m\(𝔼It∼pt​\[ωt,It\]−⟨ωt,u⋆⟩\)≤4​G​2​\(4\+ln⁡K\)​m\.\\sum\_\{t=1\}^\{m\}\\Bigl\(\\mathbb\{E\}\_\{I\_\{t\}\\sim p\_\{t\}\}\[\\omega\_\{t,I\_\{t\}\}\]\-\\langle\\omega\_\{t\},u^\{\\star\}\\rangle\\Bigr\)\\leq 4G\\sqrt\{2\(4\+\\ln K\)\\,m\}\.
Step 5: Cumulative excess\-miscoverage bound\.

By Assumption A and the fact that theℓ1\\ell\_\{1\}\-diameter ofΔK\\Delta\_\{K\}is22,

f~t​\(pt\)−f~t​\(u⋆\)≥−2​κ​G\.\\tilde\{f\}\_\{t\}\(p\_\{t\}\)\-\\tilde\{f\}\_\{t\}\(u^\{\\star\}\)\\geq\-2\\kappa G\.Thus,

∑t=1m\(f~t​\(pt\)−f~t​\(u⋆\)\)≥−2​κ​G​m\.\\sum\_\{t=1\}^\{m\}\\bigl\(\\tilde\{f\}\_\{t\}\(p\_\{t\}\)\-\\tilde\{f\}\_\{t\}\(u^\{\\star\}\)\\bigr\)\\geq\-2\\kappa Gm\.Plugging this lower bound into the inequality from Step 4 before dropping the exponential term gives

eλ​𝒬​\(m\)​\(1−λ​m\)≤1\+m\+2​κ​G​m\.e^\{\\lambda\\mathcal\{Q\}\(m\)\}\(1\-\\lambda\\sqrt\{m\}\)\\leq 1\+\\sqrt\{m\}\+2\\kappa Gm\.Therefore,

𝒬​\(m\)≤1λ​ln⁡\(1\+m\+2​κ​G​m1−λ​m\)\.\\mathcal\{Q\}\(m\)\\leq\\frac\{1\}\{\\lambda\}\\ln\\\!\\left\(\\frac\{1\+\\sqrt\{m\}\+2\\kappa Gm\}\{1\-\\lambda\\sqrt\{m\}\}\\right\)\.Forλ=1/\(2​T\)\\lambda=1/\(2\\sqrt\{T\}\)andm≤Tm\\leq T,

1−λ​m≥12,1\-\\lambda\\sqrt\{m\}\\geq\\frac\{1\}\{2\},so

𝒬​\(m\)≤1λ​ln⁡\(2​\(1\+m\+2​κ​G​m\)\)≤1λ​ln⁡\(2\+\(2\+4​κ​G\)​m\),\\mathcal\{Q\}\(m\)\\leq\\frac\{1\}\{\\lambda\}\\ln\\\!\\Bigl\(2\(1\+\\sqrt\{m\}\+2\\kappa Gm\)\\Bigr\)\\leq\\frac\{1\}\{\\lambda\}\\ln\\\!\\Bigl\(2\+\(2\+4\\kappa G\)m\\Bigr\),where we used1\+m≤1\+m1\+\\sqrt\{m\}\\leq 1\+m\. Since

κ​G=12​CAH,CAH=2​4\+ln⁡K≥4,\\kappa G=\\frac\{1\}\{\\sqrt\{2\}C\_\{\\mathrm\{AH\}\}\},\\qquad C\_\{\\mathrm\{AH\}\}=2\\sqrt\{4\+\\ln K\}\\geq 4,we have

4​κ​G=42​CAH≤22\.4\\kappa G=\\frac\{4\}\{\\sqrt\{2\}C\_\{\\mathrm\{AH\}\}\}\\leq\\frac\{\\sqrt\{2\}\}\{2\}\.Hence,

𝒬​\(m\)≤1λ​ln⁡\(2\+\(2\+22\)​m\)\.\\mathcal\{Q\}\(m\)\\leq\\frac\{1\}\{\\lambda\}\\ln\\\!\\Bigl\(2\+\\bigl\(2\+\\tfrac\{\\sqrt\{2\}\}\{2\}\\bigr\)m\\Bigr\)\.Finally, since

𝒬​\(m\)=∑t=1mκ​\(gt​\(pt\)\)\+,\\mathcal\{Q\}\(m\)=\\sum\_\{t=1\}^\{m\}\\kappa\(g\_\{t\}\(p\_\{t\}\)\)\_\{\+\},we obtain

∑t=1m\(gt​\(pt\)\)\+=κ−1​𝒬​\(m\)≤κ−1λ​ln⁡\(2\+\(2\+22\)​m\)\.\\sum\_\{t=1\}^\{m\}\(g\_\{t\}\(p\_\{t\}\)\)\_\{\+\}=\\kappa^\{\-1\}\\mathcal\{Q\}\(m\)\\leq\\frac\{\\kappa^\{\-1\}\}\{\\lambda\}\\ln\\\!\\Bigl\(2\+\\bigl\(2\+\\tfrac\{\\sqrt\{2\}\}\{2\}\\bigr\)m\\Bigr\)\.Using

κ−1λ=4​G​2​\(4\+ln⁡K\)​T,\\frac\{\\kappa^\{\-1\}\}\{\\lambda\}=4G\\sqrt\{2\(4\+\\ln K\)\\,T\},we get, for everym≤Tm\\leq T,

∑t=1m\(gt​\(pt\)\)\+≤4​G​2​\(4\+ln⁡K\)​T​ln⁡\(2\+\(2\+22\)​m\)\.\\sum\_\{t=1\}^\{m\}\(g\_\{t\}\(p\_\{t\}\)\)\_\{\+\}\\leq 4G\\sqrt\{2\(4\+\\ln K\)\\,T\}\\ln\\\!\\Bigl\(2\+\\bigl\(2\+\\tfrac\{\\sqrt\{2\}\}\{2\}\\bigr\)m\\Bigr\)\.Since

gt​\(pt\)=𝔼It∼pt​\[errt,It\]−α,g\_\{t\}\(p\_\{t\}\)=\\mathbb\{E\}\_\{I\_\{t\}\\sim p\_\{t\}\}\[\\mathrm\{err\}\_\{t,I\_\{t\}\}\]\-\\alpha,this proves

∑t=1m\(𝔼It∼pt​\[errt,It\]−α\)\+≤4​G​2​\(4\+ln⁡K\)​T​ln⁡\(2\+\(2\+22\)​m\)\.\\sum\_\{t=1\}^\{m\}\\Bigl\(\\mathbb\{E\}\_\{I\_\{t\}\\sim p\_\{t\}\}\[\\mathrm\{err\}\_\{t,I\_\{t\}\}\]\-\\alpha\\Bigr\)\_\{\+\}\\leq 4G\\sqrt\{2\(4\+\\ln K\)\\,T\}\\ln\\\!\\Bigl\(2\+\\bigl\(2\+\\tfrac\{\\sqrt\{2\}\}\{2\}\\bigr\)m\\Bigr\)\.The proof is complete\. ∎

## Appendix FExperiments

### F\.1Methods

This section describes the implementation used in the synthetic and real\-data experiments\. All methods are implemented as online wrappers around a base predictor\. In the regression experiments,

St=\|Yt−Y^t\|\.S\_\{t\}=\|Y\_\{t\}\-\\hat\{Y\}\_\{t\}\|\.
#### Online calibration windows\.

Each series is processed independently\. Let

𝒯test=\(τ1,…,τTtest\)\\mathcal\{T\}\_\{\\mathrm\{test\}\}=\(\\tau\_\{1\},\\dots,\\tau\_\{T\_\{\\mathrm\{test\}\}\}\)be the ordered test indices for a given series\. At test positionjj, the calibration window contains the previous at mostRRtest\-time observations:

ℐj=\{τmax⁡\(1,j−R\),…,τj−1\}\.\\mathcal\{I\}\_\{j\}=\\\{\\tau\_\{\\max\(1,j\-R\)\},\\dots,\\tau\_\{j\-1\}\\\}\.The first test point with an empty calibration window is skipped\. This convention ensures that all calibration is online and uses no future outcomes\.

#### CP\.

CP uses the finite\-sample corrected rolling quantile of the previous calibration scores\. Ifr=\|ℐj\|r=\|\\mathcal\{I\}\_\{j\}\|, then

k=⌈\(1−α\)​\(r\+1\)⌉,k=\\left\\lceil\(1\-\\alpha\)\(r\+1\)\\right\\rceil,withkkclipped to\{1,…,r\}\\\{1,\\dots,r\\\}, and

qtCP=the​k​\-th order statistic of​\{Si:i∈ℐj\}\.q\_\{t\}^\{\\text\{CP\}\}=\\text\{the \}k\\text\{\-th order statistic of \}\\\{S\_\{i\}:i\\in\\mathcal\{I\}\_\{j\}\\\}\.The prediction set is\[Y^t−qtCP,Y^t\+qtCP\]\[\\hat\{Y\}\_\{t\}\-q\_\{t\}^\{\\text\{CP\}\},\\hat\{Y\}\_\{t\}\+q\_\{t\}^\{\\text\{CP\}\}\]\.

#### Localized weighted quantiles\.

LCP and OLCP use weighted empirical quantiles\. For a query covariatexx, the calibration covariates inℐj\\mathcal\{I\}\_\{j\}are standardized coordinatewise using their empirical mean and standard deviation:

Zi=Xi−X¯ℐjσ^ℐj,Zx=x−X¯ℐjσ^ℐj,Z\_\{i\}=\\frac\{X\_\{i\}\-\\bar\{X\}\_\{\\mathcal\{I\}\_\{j\}\}\}\{\\hat\{\\sigma\}\_\{\\mathcal\{I\}\_\{j\}\}\},\\qquad Z\_\{x\}=\\frac\{x\-\\bar\{X\}\_\{\\mathcal\{I\}\_\{j\}\}\}\{\\hat\{\\sigma\}\_\{\\mathcal\{I\}\_\{j\}\}\},where zero or numerically unstable standard deviations are replaced by11\. We then use the exponential localizer

Hh​\(x,Xi\)=exp⁡\(−‖Zi−Zx‖2h\)\.H\_\{h\}\(x,X\_\{i\}\)=\\exp\\\!\\left\(\-\\frac\{\\\|Z\_\{i\}\-Z\_\{x\}\\\|\_\{2\}\}\{h\}\\right\)\.The weighted quantile is computed by sorting calibration scores and accumulating the corresponding sorted weights\. If the total weight is numerically zero, the implementation falls back to uniform weights\.

The base bandwidth is chosen by a Silverman\-style rule,

h0=\(4d\+2\)1/\(d\+4\)​R−1/\(d\+4\)​d,h\_\{0\}=\\left\(\\frac\{4\}\{d\+2\}\\right\)^\{1/\(d\+4\)\}R^\{\-1/\(d\+4\)\}\\sqrt\{d\},whereddis the covariate dimension\. The factor\(4/\(d\+2\)\)1/\(d\+4\)​R−1/\(d\+4\)\\left\(4/\(d\+2\)\\right\)^\{1/\(d\+4\)\}R^\{\-1/\(d\+4\)\}is the classical multivariate kernel bandwidth scaling\[[23](https://arxiv.org/html/2605.05497#bib.bib23),[22](https://arxiv.org/html/2605.05497#bib.bib22)\]\. Because covariates are standardized within each calibration window before computing Euclidean distances, the additionald\\sqrt\{d\}factor matches the typical scale of distances inddstandardized dimensions\. OLCP\-Hedge further reduces sensitivity to this heuristic by aggregating a grid of multiplicative bandwidths aroundh0h\_\{0\}\.

#### LCP\.

LCP uses the localized empirical score distribution at the current covariateXtX\_\{t\},

Dt\(h0\)​\(Xt\)=∑i∈ℐjwt,i\(h0\)​\(Xt\)​δSi,wt,i\(h0\)​\(Xt\)=Hh0​\(Xt,Xi\)∑r∈ℐjHh0​\(Xt,Xr\)\.D\_\{t\}^\{\(h\_\{0\}\)\}\(X\_\{t\}\)=\\sum\_\{i\\in\\mathcal\{I\}\_\{j\}\}w\_\{t,i\}^\{\(h\_\{0\}\)\}\(X\_\{t\}\)\\,\\delta\_\{S\_\{i\}\},\\qquad w\_\{t,i\}^\{\(h\_\{0\}\)\}\(X\_\{t\}\)=\\frac\{H\_\{h\_\{0\}\}\(X\_\{t\},X\_\{i\}\)\}\{\\sum\_\{r\\in\\mathcal\{I\}\_\{j\}\}H\_\{h\_\{0\}\}\(X\_\{t\},X\_\{r\}\)\}\.The fixed\-level localized radius is

qtLCP=Q​\(1−α;Dt\(h0\)​\(Xt\)\),q\_\{t\}^\{\\text\{LCP\}\}=Q\\\!\\left\(1\-\\alpha;\\,D\_\{t\}^\{\(h\_\{0\}\)\}\(X\_\{t\}\)\\right\),and the prediction set is

\[Y^t−qtLCP,Y^t\+qtLCP\]\.\[\\hat\{Y\}\_\{t\}\-q\_\{t\}^\{\\text\{LCP\}\},\\ \\hat\{Y\}\_\{t\}\+q\_\{t\}^\{\\text\{LCP\}\}\]\.

#### ACI\.

ACI uses the rolling unweighted empirical score distribution, but replaces the fixed nominal levelα\\alphaby an adaptive levelαt\\alpha\_\{t\}\. For a test stream of lengthTtestT\_\{\\mathrm\{test\}\}, the default step size is

γ=12​Ttest\.\\gamma=\\frac\{1\}\{2\\sqrt\{T\_\{\\mathrm\{test\}\}\}\}\.At timett, ACI forms the rolling quantile at level1−αt1\-\\alpha\_\{t\}\. After observingYtY\_\{t\}, it updates

αt\+1=Π\[0,1\]​\(αt\+γ​\(α−errt\)\),errt=𝟏​\{Yt∉C^t\},\\alpha\_\{t\+1\}=\\Pi\_\{\[0,1\]\}\\bigl\(\\alpha\_\{t\}\+\\gamma\(\\alpha\-\\mathrm\{err\}\_\{t\}\)\\bigr\),\\qquad\\mathrm\{err\}\_\{t\}=\\mathbf\{1\}\\\{Y\_\{t\}\\notin\\widehat\{C\}\_\{t\}\\\},with initializationα1=α\\alpha\_\{1\}=\\alpha\. Here we project theαt\\alpha\_\{t\}back to\[0,1\]\[0,1\]in the experiment to prevent infinite prediction sets for a better comparison\.

#### DtACI\.

DtACI aggregates multiple ACI experts with different step sizes\. Let

γ0=12​Ttest,Γ=\{0\.25,0\.5,0\.75,1,1\.25,1\.5\}​γ0\.\\gamma\_\{0\}=\\frac\{1\}\{2\\sqrt\{T\_\{\\mathrm\{test\}\}\}\},\\qquad\\Gamma=\\\{0\.25,0\.5,0\.75,1,1\.25,1\.5\\\}\\gamma\_\{0\}\.Each expertrrmaintains its own adaptive levelαt\(r\)\\alpha\_\{t\}^\{\(r\)\}\. The mixture level is

αtDtACI=∑rpt,r​αt\(r\)\.\\alpha\_\{t\}^\{\\mathrm\{DtACI\}\}=\\sum\_\{r\}p\_\{t,r\}\\alpha\_\{t\}^\{\(r\)\}\.At timett, the method computes the empirical rank statistic

βt=1\|ℐj\|​∑i∈ℐj𝟏​\{Si≥St\}\.\\beta\_\{t\}=\\frac\{1\}\{\|\\mathcal\{I\}\_\{j\}\|\}\\sum\_\{i\\in\\mathcal\{I\}\_\{j\}\}\\mathbf\{1\}\\\{S\_\{i\}\\geq S\_\{t\}\\\}\.Expert weights are updated using exponential weights on the pinball loss

ℓ​\(βt,αt\(r\)\)=α​\(βt−αt\(r\)\)−min⁡\{0,βt−αt\(r\)\}\.\\ell\(\\beta\_\{t\},\\alpha\_\{t\}^\{\(r\)\}\)=\\alpha\(\\beta\_\{t\}\-\\alpha\_\{t\}^\{\(r\)\}\)\-\\min\\\{0,\\beta\_\{t\}\-\\alpha\_\{t\}^\{\(r\)\}\\\}\.Following\[[8](https://arxiv.org/html/2605.05497#bib.bib8)\], the implementation uses

Isize=500,σdt=12​Isize,I\_\{\\mathrm\{size\}\}=500,\\qquad\\sigma\_\{\\mathrm\{dt\}\}=\\frac\{1\}\{2I\_\{\\mathrm\{size\}\}\},and

ηdt=3Isize​log⁡\(Isize​\|Γ\|\)\+2\(\(1−α\)2​α3\+α2​\(1−α\)3\)/3\.\\eta\_\{\\mathrm\{dt\}\}=\\sqrt\{\\frac\{3\}\{I\_\{\\mathrm\{size\}\}\}\}\\,\\sqrt\{\\frac\{\\log\(I\_\{\\mathrm\{size\}\}\|\\Gamma\|\)\+2\}\{\\bigl\(\(1\-\\alpha\)^\{2\}\\alpha^\{3\}\+\\alpha^\{2\}\(1\-\\alpha\)^\{3\}\\bigr\)/3\}\}\.The expert levels are then updated by their own projected ACI recursions\.

#### SPCI\.

SPCI is implemented as a residual\-forecasting conformal baseline\. Let

εt=Yt−Y^t\\varepsilon\_\{t\}=Y\_\{t\}\-\\hat\{Y\}\_\{t\}denote the base\-model residual on the online test stream\. For each test time, we form a lag vector of the previouswlagw\_\{\\mathrm\{lag\}\}residuals and fit a quantile random forest to predict the next residual\. The default parameters are

wlag=24,Ttrain=R,refit\_every=24,beta\_grid\_size=21\.w\_\{\\mathrm\{lag\}\}=24,\\qquad T\_\{\\mathrm\{train\}\}=R,\\qquad\\texttt\{refit\\\_every\}=24,\\qquad\\texttt\{beta\\\_grid\\\_size\}=21\.The quantile random forest uses

n\_estimators=80,max\_depth=10,min\_samples\_leaf=5,random\_state=42,n\_jobs=−1\.\\texttt\{n\\\_estimators\}=80,\\quad\\texttt\{max\\\_depth\}=10,\\quad\\texttt\{min\\\_samples\\\_leaf\}=5,\\quad\\texttt\{random\\\_state\}=42,\\quad\\texttt\{n\\\_jobs\}=\-1\.For a grid

β∈\{0,α20,2​α20,…,α\},\\beta\\in\\left\\\{0,\\frac\{\\alpha\}\{20\},\\frac\{2\\alpha\}\{20\},\\dots,\\alpha\\right\\\},SPCI predicts residual quantiles at levelsβ\\betaand1−α\+β1\-\\alpha\+\\beta, and selects the value ofβ\\betagiving the narrowest prediction set\. The final prediction set is

\[Y^t\+q^β,Y^t\+q^1−α\+β\]\.\[\\hat\{Y\}\_\{t\}\+\\hat\{q\}\_\{\\beta\},\\ \\hat\{Y\}\_\{t\}\+\\hat\{q\}\_\{1\-\\alpha\+\\beta\}\]\.

#### OLCP\.

OLCP uses the localized familyCt\(h\)​\(Xt;β\)C\_\{t\}^\{\(h\)\}\(X\_\{t\};\\beta\)from Section[3\.1](https://arxiv.org/html/2605.05497#S3.SS1)\. In the experiments,h=h0h=h\_\{0\}\. At timett, OLCP forms

C^t=Ct\(h0\)​\(Xt;αt\),\\widehat\{C\}\_\{t\}=C\_\{t\}^\{\(h\_\{0\}\)\}\(X\_\{t\};\\alpha\_\{t\}\),where the quantile is the localized weighted quantile at level1−αt1\-\\alpha\_\{t\}\. After observingYtY\_\{t\}, OLCP updates

αt\+1=Π\[0,1\]​\(αt\+γ​\(α−errt\)\),errt=𝟏​\{Yt∉C^t\},\\alpha\_\{t\+1\}=\\Pi\_\{\[0,1\]\}\\bigl\(\\alpha\_\{t\}\+\\gamma\(\\alpha\-\\mathrm\{err\}\_\{t\}\)\\bigr\),\\qquad\\mathrm\{err\}\_\{t\}=\\mathbf\{1\}\\\{Y\_\{t\}\\notin\\widehat\{C\}\_\{t\}\\\},with

γ=12​Ttest\.\\gamma=\\frac\{1\}\{2\\sqrt\{T\_\{\\mathrm\{test\}\}\}\}\.

#### OLCP\-Hedge\.

OLCP\-Hedge aggregates OLCP experts over the bandwidth grid

hi∈\{0\.5,0\.75,1,1\.25,1\.5\}​h0,i=1,…,5\.h\_\{i\}\\in\\\{0\.5,0\.75,1,1\.25,1\.5\\\}h\_\{0\},\\qquad i=1,\\dots,5\.Expertiimaintains its own adaptive levelαt,i\\alpha\_\{t,i\}and outputs

Ct,i​\(Xt\)=Ct\(hi\)​\(Xt;αt,i\)\.C\_\{t,i\}\(X\_\{t\}\)=C\_\{t\}^\{\(h\_\{i\}\)\}\(X\_\{t\};\\alpha\_\{t,i\}\)\.In regression, the set\-size functional is interval width, so

ωt,i=width⁡\(Ct,i​\(Xt\)\),\\omega\_\{t,i\}=\\operatorname\{width\}\(C\_\{t,i\}\(X\_\{t\}\)\),and the expert miscoverage indicator is

et,i:=𝟏​\{Yt∉Ct,i​\(Xt\)\}\.e\_\{t,i\}:=\\mathbf\{1\}\\\{Y\_\{t\}\\notin C\_\{t,i\}\(X\_\{t\}\)\\\}\.Each expert level is updated by

αt\+1,i=Π\[0,1\]​\(αt,i\+γ​\(α−et,i\)\),\\alpha\_\{t\+1,i\}=\\Pi\_\{\[0,1\]\}\\bigl\(\\alpha\_\{t,i\}\+\\gamma\(\\alpha\-e\_\{t,i\}\)\\bigr\),with

γ=12​Ttest\.\\gamma=\\frac\{1\}\{2\\sqrt\{T\_\{\\mathrm\{test\}\}\}\}\.
At roundtt, the expert cost is the prediction set sizeωt,i\\omega\_\{t,i\}, min\-max normalized to\[0,1\]\[0,1\]across experts, and all other parameters are chosen according to Section[3\.2](https://arxiv.org/html/2605.05497#S3.SS2)withG=1G=1\.

### F\.2Running time on experiments

All experiments were run on a MacBook Pro\. Neural network predictors were trained using Apple’s MPS backend when available\. The conformal runtime table was measured on the same machine and reports only the online conformal calibration/evaluation step; it excludes data loading, base predictor training, forecast precomputation\.

Table 3:Running time comparison\. Entries report wall\-clock time in seconds for the conformal calibration/evaluation step only\. For simulation, we report the mean running time across 100 repetitions\.
### F\.3Additional details and diagnostics for real\-data experiments

This section provides implementation details and additional diagnostics for the real\-data experiments in Section[4\.3](https://arxiv.org/html/2605.05497#S4.SS3)\. We evaluate the same seven methods as in the synthetic experiments: CP, LCP, ACI, DtACI, SPCI, OLCP, and OLCP\-Hedge\. All methods are evaluated online with target miscoverageα=0\.1\\alpha=0\.1using rolling calibration windows\. Reported sizes correspond to interval width for ELEC2 and ILINet; for ETF volatility, sizes are multiplied by100100and reported in percentage points of absolute log return\.

#### Implementation details\.

All experiments use a fixed point predictor followed by online conformal calibration\. The point predictor is trained only on the training split and is not updated during conformal evaluation\. The conformal methods use symmetric intervals of the form

\[y^t−qt,y^t\+qt\],\[\\widehat\{y\}\_\{t\}\-q\_\{t\},\\widehat\{y\}\_\{t\}\+q\_\{t\}\],with conformity scoreSt=\|Yt−y^t\|S\_\{t\}=\|Y\_\{t\}\-\\widehat\{y\}\_\{t\}\|\. SPCI is implemented as a residual\-autoregressive baseline using a sliding training window comparable to the conformal calibration window\.

- •ELEC2\.ELEC2 contains electricity market prices, demands, and transfers from New South Wales and Victoria\[[11](https://arxiv.org/html/2605.05497#bib.bib11)\]\. We use the normalized ELEC2 file and remove the initial constant\-response segment, leaving27,55227\{,\}552observations\. The response is electricity transfer, and the covariates arenswprice,nswdemand,vicprice, andvicdemand\. We keep the full half\-hourly sequence\. The first70%70\\%of observations are used to train a fixed gradient\-boosted regression tree predictor, implemented asHistGradientBoostingRegressorwith maximum depth66, learning rate0\.050\.05,400400boosting iterations, and random seed4242\. Conformal methods are evaluated on the remaining30%30\\%of the sequence with calibration windowR=100R=100\. For localized methods, the localization feature is the four\-dimensional covariate vector above\. For SPCI, we use residual lag length2424, training windowRR, refit frequency2424, and2121candidateβ\\beta\-values\.
- •ILINet\.ILINet is a weekly CDC influenza\-like illness surveillance dataset\[[4](https://arxiv.org/html/2605.05497#bib.bib4),[12](https://arxiv.org/html/2605.05497#bib.bib12)\]\. We use the weighted ILI component, which contains1,3051\{,\}305weekly observations from October 1997 to October 2022\. Missing values are filled by interpolation followed by forward/backward filling\. The series is split chronologically into70%70\\%training,10%10\\%validation, and20%20\\%testing, giving913913,130130, and262262observations, respectively\. The response is standardized using the training mean and standard deviation, and intervals are constructed on this standardized scale\. The base predictor is a temporal convolutional network \(TCN\)\[[15](https://arxiv.org/html/2605.05497#bib.bib15)\]with input length2626, output length11, batch size3232, at most200200epochs, kernel size55,88filters, dilation base22, dropout0\.20\.2, and Adam learning rate10−310^\{\-3\}\. We use early stopping on validation loss with patience33, minimum improvement10−310^\{\-3\}, gradient clipping at0\.10\.1, and the best checkpoint\. One\-step forecasts are computed by historical forecasting withforecast\_horizon=1,stride=1, and no retraining\. For localized methods,XtX\_\{t\}is the lag window of the previous2626standardized ILI values\. The calibration window isR=52R=52\. For SPCI, we use residual lag length88, training windowRR, refit frequency11, and2121candidateβ\\beta\-values\.
- •ETF volatility\.We forecast daily volatility proxies for five ETFs: SPY, QQQ, IWM, EEM, and TLT\[[25](https://arxiv.org/html/2605.05497#bib.bib25)\]\. Daily closing prices are read from downloaded Stooq files, and the VIX index is obtained from FRED/CBOE\[[3](https://arxiv.org/html/2605.05497#bib.bib3)\]\. The data are aligned to a business\-day grid by forward\-filling prices and VIX values\. The response is the absolute daily log return Yt=\|log⁡Pt−log⁡Pt−1\|,Y\_\{t\}=\|\\log P\_\{t\}\-\\log P\_\{t\-1\}\|,and the sample runs from January 2008 to March 2026\. Each ETF series has4,7424\{,\}742business\-day observations after alignment\. We use a chronological split with training through 2018, validation over 2019, and testing from 2020 onward, giving2,8682\{,\}868,261261, and1,6131\{,\}613observations per asset\. Each ETF volatility series is standardized using its own training mean and standard deviation; VIX is standardized using the VIX training split\. The base predictor is a TCN trained jointly on the five standardized ETF volatility series, with input length3030, output length11, batch size256256, at most8080epochs, kernel size55,88filters, dilation base22, dropout0\.20\.2, and Adam learning rate10−310^\{\-3\}\. We include cyclic calendar encoders for day of week and month\. Early stopping monitors validation loss with patience55, minimum improvement10−310^\{\-3\}, gradient clipping at0\.10\.1, and the best checkpoint\. For localized methods,XtX\_\{t\}consists of the previous3030standardized volatility values concatenated with the lagged standardized VIX value, so the localization dimension is3131\. The calibration window isR=200R=200\. Since the TCN is trained on standardized responses, widths are converted back to the original absolute\-log\-return scale using the asset\-specific training standard deviation and reported as100×100\\timeswidth\. For SPCI, we use residual lag length3030, training windowRR, refit frequency55, and2121candidateβ\\beta\-values\.

#### Rolling diagnostics\.

Figures[2](https://arxiv.org/html/2605.05497#A6.F2)–[4](https://arxiv.org/html/2605.05497#A6.F4)show rolling coverage and rolling average size on the three real datasets\. In each figure, the top panel shows rolling coverage and the bottom panel shows rolling average size\. A desirable method stays close to the dashed0\.900\.90coverage line in the top panel while having a lower curve in the bottom panel\.

![Refer to caption](https://arxiv.org/html/2605.05497v1/x2.png)Figure 2:ELEC2 rolling diagnostics\.Top: rolling coverage using a one\-week window \(48×748\\times 7half\-hourly observations\)\. Bottom: rolling average interval size using the same window\. The dashed line marks target coverage0\.900\.90\.![Refer to caption](https://arxiv.org/html/2605.05497v1/x3.png)Figure 3:ILINet rolling diagnostics\.Top: rolling coverage over weekly test observations\. Bottom: rolling average interval size\. The horizontal dashed line marks target coverage0\.900\.90while the vertical line marks the start of COVID\.![Refer to caption](https://arxiv.org/html/2605.05497v1/x4.png)Figure 4:ETF volatility rolling diagnostics\.Top: rolling coverage over daily test observations\. Bottom: rolling average interval size, reported in percentage points of absolute log return\. The dashed line marks target coverage0\.900\.90\.Across the rolling diagnostics, SPCI is consistently much smaller but also persistently below the coverage target, indicating that its residual autoregression is not sufficiently calibrated in these nonstationary streams\. ACI and DtACI tend to recover coverage by increasing sizes globally\. OLCP and OLCP\-Hedge are more efficient: their rolling sizes are generally below the global adaptive baselines while their rolling coverage remains close to the target, especially outside the most extreme stress periods\.

#### Volatility\-regime diagnostics for ETF data\.

Table[4](https://arxiv.org/html/2605.05497#A6.T4)stratifies ETF volatility performance by the current VIX level\. This diagnostic evaluates whether methods adapt to market\-volatility regimes rather than only achieving marginal coverage\.

Table 4:ETF volatility diagnostics by VIX regime\.Low\- and high\-VIX regimes are defined as the bottom and top quartiles of VIX over the online test period:VIX≤15\.87\\mathrm\{VIX\}\\leq 15\.87andVIX≥23\.84\\mathrm\{VIX\}\\geq 23\.84, respectively\. Sizes are reported in percentage points of absolute log return\.NNis the number of evaluated ETF\-day prediction points in each regime; SPCI has fewer low\-VIX points because of its additional residual\-lag warm\-up\.Low VIXHigh VIXMethodCoverageSizeNNCoverageSizeNNCP0\.9252\.04420150\.8543\.8352015LCP0\.9201\.90020150\.8463\.3562015ACI0\.9131\.85120150\.8904\.3492015DtACI0\.9151\.85420150\.8894\.1842015SPCI0\.8621\.69318950\.7923\.2722015OLCP0\.9121\.83320150\.8904\.0042015OLCP\-Hedge0\.9131\.82420150\.8884\.0072015The VIX\-stratified table shows that high\-volatility periods are substantially harder: all methods have lower conditional coverage when VIX is high\. Global adaptive methods improve high\-VIX coverage by inflating sizes, whereas OLCP and OLCP\-Hedge achieve comparable or slightly better high\-VIX coverage with smaller sizes than ACI and DtACI\. SPCI remains the narrowest method but undercovers severely in both regimes\. This supports the main empirical conclusion: localization improves efficiency, online calibration helps maintain validity, and their combination gives the best overall coverage–size tradeoff\.

Similar Articles

Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations

arXiv cs.CL

This paper proposes a conformal prediction framework for LLMs that leverages internal representations rather than output-level statistics, introducing Layer-Wise Information (LI) scores as nonconformity measures to improve validity-efficiency trade-offs under distribution shift. The method demonstrates stronger robustness to calibration-deployment mismatch compared to text-level baselines across QA benchmarks.

Empirical Bayes Conformal Prediction for Vision and Language Models

arXiv cs.LG

This paper introduces an empirical Bayes conformal prediction framework that uses r-values to incorporate score variability into nonconformity scores, improving ranking stability and reducing set size while preserving coverage for vision and language models.