Adversarially Robust Control of Conditional Value-at-Risk via Rockafellar-Uryasev Conformal Inference
Summary
This paper presents an online, distribution-free framework for controlling Conditional Value-at-Risk (CVaR) in adversarial and non-stationary environments, with asymptotic guarantees and applications in portfolio risk management and LLM toxicity mitigation.
View Cached Full Text
Cached at: 06/02/26, 03:41 PM
# Adversarially Robust Control of Conditional Value-at-Risk Via Rockafellar-Uryasev Conformal Inference
Source: [https://arxiv.org/html/2606.00320](https://arxiv.org/html/2606.00320)
###### Abstract
We present an online, distribution\-free framework for controlling the Conditional Value\-at\-Risk \(CVaR\\operatorname\{CVaR\}\), extending conformal tail risk control to non\-stationary and adversarial environments\. Unlike classical risk control methods, which rely on stationarity or linearity of expectation, our approach provides provable safety guarantees for anonlinear tail risk functionalunderarbitrarydata\-generating processes that maydrift or shift strategically over time\. By leveraging deep connections between conformal tail risk control, online learning, and the variational representation ofCVaR\\operatorname\{CVaR\}introduced byRockafellar and Uryasev \([2000](https://arxiv.org/html/2606.00320#bib.bib14)\), we develop a novel procedure for onlineCVaR\\operatorname\{CVaR\}control with adversarial regret guarantees\. The proposed method operates without assumptions on the underlying data\-generating process, making it broadly applicable in modern high\-stakes deployment settings\. We prove that the realized empiricalCVaR\\operatorname\{CVaR\}is asymptotically controlled at the target level, and that the resulting control is asymptotically tight up to a finite\-sampleO\(1/T\)\{O\}\(1/\\sqrt\{T\}\)conservatism gap\. We demonstrate the effectiveness of our approach on portfolio risk management and toxicity mitigation for Large Language Models \(LLMs\), where rare but catastrophic failures dominate system risk\.
Machine Learning, ICML
## 1Introduction
Modern machine learning systems are increasingly deployed in high\-stakes and safety\-critical settings, including financial decision\-making, automated content moderation, and large language model \(LLM\) alignment\. In such applications, controlling the average risk is not enough to ensure system reliability, where a single catastrophic error can dominate operational, legal, or reputational risk\. This has led to growing interest in*tail risk control*\(Snellet al\.,[2022](https://arxiv.org/html/2606.00320#bib.bib19); Zolloet al\.,[2024b](https://arxiv.org/html/2606.00320#bib.bib5); Chenet al\.,[2025](https://arxiv.org/html/2606.00320#bib.bib18)\), where the goal is to design learning and decision\-making systems whose worst\-case outcomes remain within acceptable limits\.
Figure 1:OnlineCVaR\\operatorname\{CVaR\}control via Rockafellar\-Uryasev Conformal inference\.Left: Uncontrolled tail risk in adversarial or non\-stationary environments leads to catastrophic failures\. Middle: Our method combines Conformal Decision Theory with AdaGrad\-FTRL\. Right: The resulting controller provably enforces the targetCVaR\\operatorname\{CVaR\}level with adversarial guarantees\.Conditional Value\-at\-Risk \(CVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\) is a canonical and widely adopted measure of tail risk, capturing the expected loss conditioned on the worst100\(1−β\)%100\(1\-\\beta\)\\%of outcomes\(Rockafellar and Uryasev,[2000](https://arxiv.org/html/2606.00320#bib.bib14)\)\. Unlike average risk measures,CVaR\\operatorname\{CVaR\}explicitly focuses on rare but catastrophic events, making it suitable for high\-stakes, adversarial, and non\-stationary environments in which failures could be costly and strategically induced\. For instance, in finance,CVaR\\operatorname\{CVaR\}is a fundamental concept in portfolio optimization and regulatory risk management\.
In modern machine learning systems, an analogous notion arises naturally\. For instance, in LLM deployment,CVaR\\operatorname\{CVaR\}measures the expected severity of the most harmful generations\. Critically, many deployment environments for LLMs are inherentlynon\-stationary\. Therefore, the ability to controlCVaR\\operatorname\{CVaR\}in anonlinemanner is essential\. Overall,CVaRβ\\operatorname\{CVaR\}\_\{\\beta\}provides a principled objective for stress testing and red\-teaming, as demonstrated in existing work on adversarial prompting and automated test\-case generation for uncovering model vulnerabilities\(Perezet al\.,[2022](https://arxiv.org/html/2606.00320#bib.bib12)\)\.
In this work, we address whetherCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}can be controlled in anonlineandadversarialenvironment\. Unlike expected loss,CVaRβ\\operatorname\{CVaR\}\_\{\\beta\}is a nonlinear, tail\-sensitive risk functional that depends on the extreme quantiles of the loss distribution\. Guarding against the rare but catastrophic outcomes renders tail risk control fundamentally costly, as the events that characterizeCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}are precisely those that are least frequently observed\.
To date, existing online risk control and conformal calibration methods fail to tackle the intrinsic nonlinearity ofCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\(Gibbs and Candes,[2021](https://arxiv.org/html/2606.00320#bib.bib16); Yanget al\.,[2024](https://arxiv.org/html/2606.00320#bib.bib17); Lekeufacket al\.,[2024](https://arxiv.org/html/2606.00320#bib.bib15)\)\. In particular, widely used procedures such as Adaptive Conformal Inference \(ACI\)\(Gibbs and Candes,[2021](https://arxiv.org/html/2606.00320#bib.bib16)\), Bellman Conformal Inference \(BCI\)\(Yanget al\.,[2024](https://arxiv.org/html/2606.00320#bib.bib17)\), and Conformal Decision Theory \(CDT\)\(Lekeufacket al\.,[2024](https://arxiv.org/html/2606.00320#bib.bib15)\)fundamentally rely on linearity of expectation and therefore break down forCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\. While recent conformal tail risk control methods can provide statistical guarantees forCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}via L\-statistics and uniform bounds\(Chenet al\.,[2025](https://arxiv.org/html/2606.00320#bib.bib18); Snellet al\.,[2022](https://arxiv.org/html/2606.00320#bib.bib19); Denget al\.,[2023](https://arxiv.org/html/2606.00320#bib.bib8); Zolloet al\.,[2024a](https://arxiv.org/html/2606.00320#bib.bib6); Denget al\.,[2025](https://arxiv.org/html/2606.00320#bib.bib7)\), these guarantees do not extend to online, non\-stationary, or adversarial regimes\. Overcoming these limitations requires new techniques that explicitly exploit the variational structure ofCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\.
In this work, we develop an online, distribution\-free framework for tail risk control that provides provable safety guarantees under non\-stationary and adversarial data\. Our approach builds on the Rockafellar\-Uryasev \(RU\) variational representation ofCVaR\\operatorname\{CVaR\}\(Rockafellar and Uryasev,[2000](https://arxiv.org/html/2606.00320#bib.bib14)\)and reduces online tail risk control to a two\-level online optimization problem, combining Conformal Decision Theory\(Lekeufacket al\.,[2024](https://arxiv.org/html/2606.00320#bib.bib15)\)with AdaGrad\-Follow the Regularized Leader \(AdaGrad\-FTRL\)\(McMahan,[2011](https://arxiv.org/html/2606.00320#bib.bib1)\)\. The resulting procedure requires no distributional assumptions, and provably controls the realized empiricalCVaR\\operatorname\{CVaR\}at the target level, with a vanishing tightness gap\.
#### Contributions\.
We summarize our main contributions as follows:
- •We formulate the problem of*onlineCVaR\\operatorname\{CVaR\}control*under non\-stationary and adversarial data, extending existing conformal and risk calibration methods that rely on stationarity or linearity of expectation\.
- •We show that the RU variational representation enables a principled reduction of nonlinear tail risk control to an online regret minimization problem over an auxiliary threshold parameter\.
- •We provide a*provable safety guarantee*: the empiricalCVaR\\operatorname\{CVaR\}of the realized sequence is asymptotically controlled at the target level, even under adversarial data\. Moreover, the resulting control is asymptotically tight up to a finite\-sampleO\(1/T\)O\(1/\\sqrt\{T\}\)conservatism gap\.
- •We demonstrate the practical effectiveness of our method on LLM toxicity control and portfolio management under distribution shift\.
### 1\.1Related Work
Rockafellar\-Uryasev \(RU\) Representation\.InRockafellar and Uryasev \([2000](https://arxiv.org/html/2606.00320#bib.bib14)\), Rockafellar and Uryasev introduce a variational representation ofCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}that has become the foundation for risk\-sensitive optimization, transforming a tail risk objective into a tractable optimization problem over an auxiliary threshold parameter\. Several recent works have highlighted the central role of this variational form in offline robust decision\-making under distribution shift and data bias\(Sahooet al\.,[2025](https://arxiv.org/html/2606.00320#bib.bib20); Leiet al\.,[2023](https://arxiv.org/html/2606.00320#bib.bib47)\)\. In contrast, our setting is fundamentally online as data arrives sequentially, distributions may drift, or be chosen adaptively, and the goal is not merely to optimize a staticCVaR\\operatorname\{CVaR\}objective, but to dynamically control tail risk over time\. Our results therefore complement and extend this line of work by bringingCVaR\\operatorname\{CVaR\}\-based risk control into the sequential, adversarial regime\.
Conformal Decision Theory\.In Conformal Decision Theory \(CDT\)\(Lekeufacket al\.,[2024](https://arxiv.org/html/2606.00320#bib.bib15)\), a single calibration parameter suffices to control the expected loss of a broad class of decisions, provided a suitable “safeguard” action is available\. Similar to conformal risk control\(Bateset al\.,[2021](https://arxiv.org/html/2606.00320#bib.bib48); Angelopouloset al\.,[2025a](https://arxiv.org/html/2606.00320#bib.bib49),[c](https://arxiv.org/html/2606.00320#bib.bib50)\), their technique does not work for nonlinear risk functionals like CVaR\. In contrast, our work exploits a fundamentally different strategy than CDT\.
Gradient Equilibrium\.Angelopouloset al\.\([2025b](https://arxiv.org/html/2606.00320#bib.bib22)\)argue that the standard notion of regret is not directly relevant to online uncertainty quantification or risk control problems and propose gradient equilibrium as an alternative\. By contrast, in our setting, while regret is not the final objective, it plays a crucial role\. We demonstrate that by minimizing a carefully chosen regret objective, one can provably controlCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}at a desired level in adversarial and non\-stationary environments\. This deepens the connection between online learning and online risk control\.
Adaptive FTRL \(AdaGrad–FTRL\)\.A central challenge presented by CVaR control is a secondary online convex optimization problem over the RU threshold variable\(ct\)\(c\_\{t\}\)\. While many no\-regret online optimization algorithms are available, we seek a method that is stable, compatible with constrained continuous action variables, and does not require manual tuning of a learning rate for the threshold update\. To this end, we adopt an adaptive Follow the Regularized Leader \(FTRL\) update with AdaGrad\-style scaling\(McMahan,[2011](https://arxiv.org/html/2606.00320#bib.bib1); Shalev\-Shwartz,[2012](https://arxiv.org/html/2606.00320#bib.bib3); Duchiet al\.,[2011](https://arxiv.org/html/2606.00320#bib.bib2)\)\. At each round, the threshold variable is selected by minimizing the cumulative surrogate loss with a quadratic regularization\. This adaptive scaling automatically adjusts the effective step size based on observed loss behavior, yielding a stable and self\-tuning update rule while preserving standard no\-regret guarantees\.
## 2Rockafellar\-Uryasev Conformal Inference
We present our two\-level online learning procedure for controllingCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\. Section[2\.1](https://arxiv.org/html/2606.00320#S2.SS1)discusses the online setting in which our algorithm operates\. In Section[2\.2](https://arxiv.org/html/2606.00320#S2.SS2), we develop the connection between online CVaR control and online regret minimization for an extended game by leveraging the RU representation\(Rockafellar and Uryasev,[2000](https://arxiv.org/html/2606.00320#bib.bib14)\)\. We discuss the AdaGrad\-FTRL method in Section[2\.3](https://arxiv.org/html/2606.00320#S2.SS3), followed by the main theoretical guarantee\. In Sections[2\.4](https://arxiv.org/html/2606.00320#S2.SS4)and[2\.5](https://arxiv.org/html/2606.00320#S2.SS5), we analyze the outer and inner updates of the algorithm and discuss their theoretical properties\.
### 2\.1Setting
We consider an online setting indexed by timet=1,2,…,Tt=1,2,\\ldots,T\. At each time pointtt, the decision\-maker chooses a calibration parameterλt\\lambda\_\{t\}and Nature follows by choosing a loss functionRt\(λ\)R\_\{t\}\(\\lambda\)\. We assume that all functionsRtR\_\{t\}map\[λmin,λmax\]\[\\lambda\_\{\\min\},\\lambda\_\{\\max\}\]to\[Rmin,Rmax\]\[R\_\{\\min\},R\_\{\\max\}\]for some−∞<λmin<λmax<∞\-\\infty<\\lambda\_\{\\min\}<\\lambda\_\{\\max\}<\\inftyand−∞<Rmin<Rmax<∞\-\\infty<R\_\{\\min\}<R\_\{\\max\}<\\infty\. The decision\-maker’s choice can rely on\(Rt−1\(⋅\),λt−1,Rt−2\(⋅\),λt−2,…\)\(R\_\{t\-1\}\(\\cdot\),\\lambda\_\{t\-1\},R\_\{t\-2\}\(\\cdot\),\\lambda\_\{t\-2\},\\ldots\)and Nature’s choice can depend on the decision\-maker’s information set at timetttogether withλt\\lambda\_\{t\}\. The realized loss for the decision\-maker occurred at the end of timettis given byRt\(λt\)R\_\{t\}\(\\lambda\_\{t\}\), where we extend the domain ofRt\(λt\)R\_\{t\}\(\\lambda\_\{t\}\)In Section[3](https://arxiv.org/html/2606.00320#S3)to the entire real line with
Rt\(λ\)=\{Rmin,λ<λmin,Rmax,λ\>λmax\.R\_\{t\}\(\\lambda\)=\\begin\{cases\}R\_\{\\min\},&\\lambda<\\lambda\_\{\\min\},\\\\ R\_\{\\max\},&\\lambda\>\\lambda\_\{\\max\}\.\\end\{cases\}we consider two applications in portfolio management and toxicity control for LLMs\. In both examples,λt\\lambda\_\{t\}is a scalar in\[0,1\]\[0,1\]and thus the decision\-maker just needs to choose a scalar in each round\.
### 2\.2Bounding CVaR by Regret
Unlike expectation,CVaRβ\\operatorname\{CVaR\}\_\{\\beta\}is a*nonlinear*functional of the loss distribution and cannot be written as a simple average of per\-round losses\. As a consequence, classical notions of regret based on summing per\-round losses do not apply directly\.
Given a sequence of realized lossesR1,…,RTR\_\{1\},\\ldots,R\_\{T\}, we define the empirical CVaR through the Rockafellar–Uryasev variational representation:
CVaR^β\(R1:T\):=minc∈ℝ\{c\+11−β⋅1T∑t=1T\(Rt−c\)\+\}\.\\widehat\{\\operatorname\{CVaR\}\}\_\{\\beta\}\(R\_\{1:T\}\):=\\min\_\{c\\in\\mathbb\{R\}\}\\left\\\{c\+\\frac\{1\}\{1\-\\beta\}\\cdot\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\(R\_\{t\}\-c\)\_\{\+\}\\right\\\}\.Since the losses are bounded in\[Rmin,Rmax\]\[R\_\{\\min\},R\_\{\\max\}\], the minimization may equivalently be restricted toc∈\[Rmin,Rmax\]c\\in\[R\_\{\\min\},R\_\{\\max\}\]\.
In the absence of ties,CVaR^β\(R1:T\)\\widehat\{\\operatorname\{CVaR\}\}\_\{\\beta\}\(R\_\{1:T\}\)can be approximated by the average of losses among the top100\(1−β\)%100\(1\-\\beta\)\\%of losses, i\.e\.,
CVaR^β\(R1:T\)=1⌊T\(1−β\)⌋∑i\>⌈Tβ⌉R\(i\)\+O\(1T\)\\widehat\{\\operatorname\{CVaR\}\}\_\{\\beta\}\(R\_\{1:T\}\)=\\frac\{1\}\{\\lfloor T\(1\-\\beta\)\\rfloor\}\\sum\_\{i\>\\lceil T\\beta\\rceil\}R\_\{\(i\)\}\+O\\left\(\\frac\{1\}\{T\}\\right\)whereR\(1\)≤R\(2\)≤…≤R\(T\)R\_\{\(1\)\}\\leq R\_\{\(2\)\}\\leq\\ldots\\leq R\_\{\(T\)\}are the ordered statistics\. The remainder termO\(1/T\)O\(1/T\)goes away whenTβT\\betais an integer\.
With the definition of empirical CVaR, we say the decision\-maker achieves an online CVaR control at some target levelα∈\(Rmin,Rmax\)\\alpha\\in\(R\_\{\\min\},R\_\{\\max\}\)iff
CVaR^β\(R1\(λ1\),…,RT\(λT\)\)≤α\+o\(1\)\.\\widehat\{\\operatorname\{CVaR\}\}\_\{\\beta\}\(R\_\{1\}\(\\lambda\_\{1\}\),\\ldots,R\_\{T\}\(\\lambda\_\{T\}\)\)\\leq\\alpha\+o\(1\)\.\(1\)
In contrast to standard cumulative loss,CVaR^β\(R1:T\)\\widehat\{\\operatorname\{CVaR\}\}\_\{\\beta\}\(R\_\{1:T\}\)depends on the entire loss sequence in a non\-additive manner\. In particular, changing a single large loss can alter both the quantile threshold and the set of points entering the tail average\. Using the Rockafellar\-Uryasev representation ofCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\(Rockafellar and Uryasev,[2000](https://arxiv.org/html/2606.00320#bib.bib14)\)we can reformulate the criterion \([1](https://arxiv.org/html/2606.00320#S2.E1)\) as follows:
minc∈\[Rmin,Rmax\]1T∑t=1T\[c\+11−β\(Rt−c\)\+\]≤α\\min\_\{c\\in\[R\_\{\\min\},R\_\{\\max\}\]\}\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\left\[c\+\\frac\{1\}\{1\-\\beta\}\(R\_\{t\}\-c\)\_\{\+\}\\right\]\\leq\\alpha
This representation suggests an extended game between the decision\-maker and Nature\. At timett, the decision\-maker chooses an auxiliary variablectc\_\{t\}in addition toλt\\lambda\_\{t\}in each roundttand Nature choosesRt\(λ\)R\_\{t\}\(\\lambda\)as before\. The cumulative realized loss is given by
1T∑t=1T\[ct\+11−β\(Rt\(λt\)−ct\)\+\]\.\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\left\[c\_\{t\}\+\\frac\{1\}\{1\-\\beta\}\(R\_\{t\}\(\\lambda\_\{t\}\)\-c\_\{t\}\)\_\{\+\}\\right\]\.
For this extended game, we can define the regret as usual:
RegT\(c\)\\displaystyle\\mathrm\{Reg\}\_\{T\}\(c\)=∑t=1T\[ct\+11−β\(Rt\(λt\)−ct\)\+\]\\displaystyle=\\sum\_\{t=1\}^\{T\}\\left\[c\_\{t\}\+\\frac\{1\}\{1\-\\beta\}\(R\_\{t\}\(\\lambda\_\{t\}\)\-c\_\{t\}\)\_\{\+\}\\right\]−∑t=1T\[c\+11−β\(Rt\(λt\)−c\)\+\]\.\\displaystyle\\;\-\\;\\sum\_\{t=1\}^\{T\}\\left\[c\+\\frac\{1\}\{1\-\\beta\}\(R\_\{t\}\(\\lambda\_\{t\}\)\-c\)\_\{\+\}\\right\]\.\(2\)
This regret definition is consistent with the typical objective of no\-regret online learning optimization, where the first term tracks the average loss accumulated by the controller, and the second term is the best fixed action in hindsight\.
With this regret, we can turn the CVaR control problem into an expected risk control problem\.
###### Proposition 2\.1\.
If the decision\-maker can choose\(λt,ct\)\(\\lambda\_\{t\},c\_\{t\}\)such that
\|maxc∈\[Rmin,Rmax\]RegT\(c\)\|=o\(T\),\\left\|\\max\_\{c\\in\[R\_\{\\min\},R\_\{\\max\}\]\}\\mathrm\{Reg\}\_\{T\}\(c\)\\right\|=o\(T\),\(3\)and
1T∑t=1T\[ct\+11−β\(Rt\(λt\)−ct\)\+\]≤α\+o\(1\),\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\left\[c\_\{t\}\+\\frac\{1\}\{1\-\\beta\}\(R\_\{t\}\(\\lambda\_\{t\}\)\-c\_\{t\}\)\_\{\+\}\\right\]\\leq\\alpha\+o\(1\),\(4\)then \([1](https://arxiv.org/html/2606.00320#S2.E1)\) holds\.
Proposition[2\.1](https://arxiv.org/html/2606.00320#S2.Thmtheorem1)implies that it remains to achieve \([3](https://arxiv.org/html/2606.00320#S2.E3)\) and \([4](https://arxiv.org/html/2606.00320#S2.E4)\)\. In the following, we will apply an AdaGrad\-FTRL algorithm for \([3](https://arxiv.org/html/2606.00320#S2.E3)\) and a CDT\-style algorithm to achieve \([4](https://arxiv.org/html/2606.00320#S2.E4)\)\. Notably, \([3](https://arxiv.org/html/2606.00320#S2.E3)\) requires bounding the regretmaxcRegT\(c\)\\max\_\{c\}\\mathrm\{Reg\}\_\{T\}\(c\)from both above and below\. As we will see in the proofs, the upper bound is crucial to prove \([4](https://arxiv.org/html/2606.00320#S2.E4)\) while the lower bound is used to conclude CVaR control from \([4](https://arxiv.org/html/2606.00320#S2.E4)\)\.
### 2\.3Rockafellar–Uryasev Conformal Inference and Regret Bound
Our algorithm, shown in Algorithm[1](https://arxiv.org/html/2606.00320#alg1), has a two\-level optimization structure\. The outer level performs updates on a control parameterλt\\lambda\_\{t\}to enforce theCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}constraint using CDT \(or ACI/BCI\-style\) update\(Lekeufacket al\.,[2024](https://arxiv.org/html/2606.00320#bib.bib15); Gibbs and Candes,[2021](https://arxiv.org/html/2606.00320#bib.bib16); Yanget al\.,[2024](https://arxiv.org/html/2606.00320#bib.bib17)\)\. Meanwhile, the inner level solves a one\-dimensional convex optimization problem induced by the Rockafellar–Uryasev representation, in an online manner\. Using AdaGrad\-FTRL updates\(McMahan,[2011](https://arxiv.org/html/2606.00320#bib.bib1); Duchiet al\.,[2011](https://arxiv.org/html/2606.00320#bib.bib2)\), the inner level adaptively learns the optimal quantile thresholdctc\_\{t\}, yielding adversarial regret guarantees without step\-size tuning\.
Algorithm 1RU Conformal InferenceInput:CVaR level
β\\beta, target risk level
α\\alpha, outer relative step size
γ0\\gamma\_\{0\}, effective domain
\[λmin,λmax\]\[\\lambda\_\{\\min\},\\lambda\_\{\\max\}\], range
\[Rmin,Rmax\]\[R\_\{\\min\},R\_\{\\max\}\], initial
λ1\\lambda\_\{1\}\.
Procedure:
c1←\(Rmin\+Rmax\)/2\\displaystyle c\_\{1\}\\leftarrow\(R\_\{\\min\}\+R\_\{\\max\}\)/2\.
q0←max\{1,β/\(1−β\)\}2\\displaystyle q\_\{0\}\\leftarrow\\max\\left\\\{1,\\beta/\(1\-\\beta\)\\right\\\}^\{2\}\.
γ←γ0\(λmax−λmin\)\\displaystyle\\gamma\\leftarrow\\gamma\_\{0\}\(\\lambda\_\{\\max\}\-\\lambda\_\{\\min\}\)\.
for
t=1,…,Tt=1,\\ldots,Tdo
Nature move:
Observe loss function
Rt\(⋅\)R\_\{t\}\(\\cdot\)
Extend the loss function:
Extend
RtR\_\{t\}with
Rt\(λ\)←\{Rmin,λ<λmin,Rmax,λ\>λmax\.R\_\{t\}\(\\lambda\)\\leftarrow\\begin\{cases\}R\_\{\\min\},&\\lambda<\\lambda\_\{\\min\},\\\\ R\_\{\\max\},&\\lambda\>\\lambda\_\{\\max\}\.\\end\{cases\}
Construct RU surrogate:
ℓtRU←ct\+11−β\(Rt\(λt\)−ct\)\+\.\\displaystyle\\ell^\{RU\}\_\{t\}\\leftarrow c\_\{t\}\+\\frac\{1\}\{1\-\\beta\}\(R\_\{t\}\(\\lambda\_\{t\}\)\-c\_\{t\}\)\_\{\+\}\.
Outer\-level CDT update:
λt\+1←λt−γ\(ℓtRU−α\)\.\\displaystyle\\lambda\_\{t\+1\}\\leftarrow\\lambda\_\{t\}\-\\gamma\(\\ell^\{RU\}\_\{t\}\-\\alpha\)\.
Inner\-level AdaGrad–FTRL update:
gt←1−11−β⋅𝟏\{Rt\(λt\)\>ct\}\.\\displaystyle g\_\{t\}\\leftarrow 1\-\\frac\{1\}\{1\-\\beta\}\\cdot\\mathbf\{1\}\\\{R\_\{t\}\(\\lambda\_\{t\}\)\>c\_\{t\}\\\}\.qt←qt−1\+gt2,ηt←12qt\.\\displaystyle q\_\{t\}\\leftarrow q\_\{t\-1\}\+g\_\{t\}^\{2\},\\qquad\\eta\_\{t\}\\leftarrow\\frac\{1\}\{2\\sqrt\{q\_\{t\}\}\}\.ct\+1←argminc∈\[Rmin,Rmax\]\{12ηt\(c−Rmin\+Rmax2\)2\+∑s=1t\(c\+11−β\(Rs\(λs\)−c\)\+\)\}\.\\displaystyle\\begin\{aligned\} c\_\{t\+1\}\\leftarrow&\\operatorname\*\{arg\\,min\}\_\{c\\in\[R\_\{\\min\},R\_\{\\max\}\]\}\\Bigg\\\{\\frac\{1\}\{2\\eta\_\{t\}\}\\left\(c\-\\frac\{R\_\{\\min\}\+R\_\{\\max\}\}\{2\}\\right\)^\{2\}\\\\ &\+\\sum\_\{s=1\}^\{t\}\\left\(c\+\\frac\{1\}\{1\-\\beta\}\\bigl\(R\_\{s\}\(\\lambda\_\{s\}\)\-c\\bigr\)\_\{\+\}\\right\)\\Bigg\\\}\.\\end\{aligned\}
endfor
We first present our main regret bound in Theorem[2\.2](https://arxiv.org/html/2606.00320#S2.Thmtheorem2)\.
###### Theorem 2\.2\.
Assume the lossRt\(λ\)R\_\{t\}\(\\lambda\)is uniformly bounded for allttandλ\\lambda\. Then the proposed algorithm guarantees
CVaR^β\(R1:T\)≤α\+CT−1/2,\\widehat\{\\operatorname\{CVaR\}\}\_\{\\beta\}\(R\_\{1:T\}\)\\;\\leq\\;\\alpha\+CT^\{\-1/2\},for some constantCCthat only depends onβ\\beta,α\\alpha,γ\\gamma,λmin\\lambda\_\{\\min\},λmax\\lambda\_\{\\max\},RminR\_\{\\min\},RmaxR\_\{\\max\}\.
In the next two subsections, we walk through the proof of Theorem[2\.2](https://arxiv.org/html/2606.00320#S2.Thmtheorem2)by studying the theoretical properties of the inner\- and outer\-level updates\.
### 2\.4Achieving \([3](https://arxiv.org/html/2606.00320#S2.E3)\) Via Inner\-level Update
In this and next subsections, to simplify exposition, we assumeλmin=Rmin=0\\lambda\_\{\\min\}=R\_\{\\min\}=0andλmax=Rmax=1\\lambda\_\{\\max\}=R\_\{\\max\}=1\. In general, this can be achieved by replacingα,R\(λ\)\\alpha,R\(\\lambda\)byα~,R~\(λ~\)\\tilde\{\\alpha\},\\tilde\{R\}\(\\tilde\{\\lambda\}\)where
α~=α−RminRmax−Rmin,λ~=λ−λminλmax−λmin,\\tilde\{\\alpha\}=\\frac\{\\alpha\-R\_\{\\min\}\}\{R\_\{\\max\}\-R\_\{\\min\}\},\\tilde\{\\lambda\}=\\frac\{\\lambda\-\\lambda\_\{\\min\}\}\{\\lambda\_\{\\max\}\-\\lambda\_\{\\min\}\},\(5\)and
R~\(λ~\)=R\(λmin\+\(λmax−λmin\)λ~\)−RminRmax−Rmin\.\\tilde\{R\}\(\\tilde\{\\lambda\}\)=\\frac\{R\(\\lambda\_\{\\min\}\+\(\\lambda\_\{\\max\}\-\\lambda\_\{\\min\}\)\\tilde\{\\lambda\}\)\-R\_\{\\min\}\}\{R\_\{\\max\}\-R\_\{\\min\}\}\.\(6\)
Fixing the sequenceλ1,λ2,…\\lambda\_\{1\},\\lambda\_\{2\},\\ldots, the incremental loss function in the extended game is expressed as
ft\(c\)=c\+11−β\(Rt\(λt\)−c\)\+,f\_\{t\}\(c\)\\;=\\;c\+\\frac\{1\}\{1\-\\beta\}\\bigl\(R\_\{t\}\(\\lambda\_\{t\}\)\-c\\bigr\)\_\{\+\},\(7\)which is convex andGG\-Lipschitz on\[0,1\]\[0,1\], where one valid subgradient offtf\_\{t\}atccis
gt\(c\)∈∂ft\(c\)=1−11−β𝟏\{Rt\(λt\)\>c\},g\_\{t\}\(c\)\\in\\partial f\_\{t\}\(c\)=1\-\\frac\{1\}\{1\-\\beta\}\\mathbf\{1\}\\\{R\_\{t\}\(\\lambda\_\{t\}\)\>c\\\},and hence
gt\(c\)∈\{1,−β1−β\},G=max\{1,β1−β\}\.g\_\{t\}\(c\)\\in\\left\\\{1,\-\\frac\{\\beta\}\{1\-\\beta\}\\right\\\},\\qquad G=\\max\\left\\\{1,\\frac\{\\beta\}\{1\-\\beta\}\\right\\\}\.
Our goal is to choosec1,…,cTc\_\{1\},\\dots,c\_\{T\}in an online manner to minimize the regret with respect to a fixed single decisioncc:
RegT\(c\)=∑t=1Tft\(ct\)−∑t=1Tft\(c\)\.\\mathrm\{Reg\}\_\{T\}\(c\)=\\sum\_\{t=1\}^\{T\}f\_\{t\}\(c\_\{t\}\)\-\\sum\_\{t=1\}^\{T\}f\_\{t\}\(c\)\.
A major practical challenge in online learning is that the optimal learning rate can vary dramatically across regimes\. In benign or stationary environments, aggressive updates can significantly accelerate convergence\. Conversely, in noisy, heavy\-tailed, or adversarial settings, large step sizes can lead to instability and erratic behavior\.
To avoid tuning a learning rate forctc\_\{t\}while maintaining adversarial guarantees, we use an AdaGrad\-FTRL algorithm\. Letgtg\_\{t\}be a chosen subgradient at the current iterate\. The update rule of AdaGrad\-FTRL then reduces to
ct\+1←\\displaystyle c\_\{t\+1\}\\leftarrowargminc∈\[0,1\]12ηt\(c−1/2\)2\\displaystyle\\operatorname\*\{arg\\,min\}\_\{c\\in\[0,1\]\}\\frac\{1\}\{2\\eta\_\{t\}\}\\left\(c\-1/2\\right\)^\{2\}\+∑s=1t\{c\+11−β\(Rs\(λs\)−c\)\+\},\\displaystyle\+\\sum\_\{s=1\}^\{t\}\\left\\\{c\+\\frac\{1\}\{1\-\\beta\}\(R\_\{s\}\(\\lambda\_\{s\}\)\-c\)\_\{\+\}\\right\\\},where the adaptive step size is set to be
ηt=12qt,qt=qt−1\+gt2,q0=max\{1,β1−β\}2\.\\eta\_\{t\}=\\frac\{1\}\{2\\sqrt\{q\_\{t\}\}\},\\quad q\_\{t\}=q\_\{t\-1\}\+g\_\{t\}^\{2\},\\quad q\_\{0\}=\\max\\left\\\{1,\\frac\{\\beta\}\{1\-\\beta\}\\right\\\}^\{2\}\.
###### Theorem 2\.4\.
For every adaptive or adversarial sequenceR1\(⋅\),…,RT\(⋅\):\[0,1\]↦\[0,1\]R\_\{1\}\(\\cdot\),\\ldots,R\_\{T\}\(\\cdot\):\[0,1\]\\mapsto\[0,1\], the iterates above satisfy
−14qT≤maxc∈\[0,1\]RegT\(c\)≤34qT\.\-\\frac\{1\}\{4\}\\sqrt\{q\_\{T\}\}\\leq\\max\_\{c\\in\[0,1\]\}\\mathrm\{Reg\}\_\{T\}\(c\)\\leq\\frac\{3\}\{4\}\\sqrt\{q\_\{T\}\}\.In particular,
\|maxc∈\[0,1\]RegT\(c\)\|≤34max\{1,β1−β\}T\+1\.\\left\|\\max\_\{c\\in\[0,1\]\}\\mathrm\{Reg\}\_\{T\}\(c\)\\right\|\\leq\\frac\{3\}\{4\}\\max\\left\\\{1,\\frac\{\\beta\}\{1\-\\beta\}\\right\\\}\\sqrt\{T\+1\}\.
We can show that the leading term is optimal in terms of the rate inTTand the constant is looser by a constant factor\. The proof is presented in Appendix[B\.4](https://arxiv.org/html/2606.00320#A2.SS4)\.
###### Theorem 2\.5\.
For everyT≥1T\\geq 1and every online learning algorithm, possibly randomized, there exists a sequenceR1\(⋅\),…,RT\(⋅\):\[0,1\]↦\[0,1\]R\_\{1\}\(\\cdot\),\\ldots,R\_\{T\}\(\\cdot\):\[0,1\]\\mapsto\[0,1\], such that
𝔼\[maxc∈\[0,1\]RegT\(c\)\]≥12πβ1−βT⋅\(1\+o\(1\)\),\\mathbb\{E\}\\left\[\\max\_\{c\\in\[0,1\]\}\\mathrm\{Reg\}\_\{T\}\(c\)\\right\]\\geq\\frac\{1\}\{\\sqrt\{2\\pi\}\}\\sqrt\{\\frac\{\\beta\}\{1\-\\beta\}\}\\sqrt\{T\}\\cdot\(1\+o\(1\)\),where the expectation is only over the algorithmic randomization\. For deterministic algorithms, the expectation may be omitted\.
Whenβ\>1/2\\beta\>1/2, the constant gap can be closed by allowing the tuning parameter to depend onTT\. We present the algorithm and its regret upper bound in Appendix[B\.5](https://arxiv.org/html/2606.00320#A2.SS5)\. Nevertheless, we prefer the AdaGrad\-FTRL algorithm in practice that is agnostic to the horizonTT\.
### 2\.5Achieving \([4](https://arxiv.org/html/2606.00320#S2.E4)\) Via Outer\-level Updates
Fixing the sequencec1,c2,…\{c\}\_\{1\},\{c\}\_\{2\},\\ldots, \([4](https://arxiv.org/html/2606.00320#S2.E4)\) becomes the same objective for CDT with additive cumulative losses\. WhenRt\(λ\)R\_\{t\}\(\\lambda\)is bounded, we can prove the risk control up to anO\(1/T\)O\(1/\\sqrt\{T\}\)factor\. The proof is substantially more complicated than the argument inLekeufacket al\.\([2024](https://arxiv.org/html/2606.00320#bib.bib15)\)because the sequence\{ct\}\{\\\{c\}\_\{t\}\\\}can be arbitrary and the minimal loss at timettmay exceedα\\alphawhenct\>α\{c\}\_\{t\}\>\\alpha\. To prove boundedness of the sequenceλt\\lambda\_\{t\}, we need to apply the regret bound on the AdaGrad\-FTRL algorithm\. The proof is presented in Appendix[B\.6](https://arxiv.org/html/2606.00320#A2.SS6)\.
###### Theorem 2\.6\.
Given any sequence ofR1\(⋅\),…,RT\(⋅\):\[0,1\]↦\[0,1\]R\_\{1\}\(\\cdot\),\\ldots,R\_\{T\}\(\\cdot\):\[0,1\]\\mapsto\[0,1\], let\{ct\}t=1T\\\{\{c\}\_\{t\}\\\}\_\{t=1\}^\{T\}denote the inner\-level update given by the AdaGrad\-FTRL algorithm\. The outer\-level update in Algorithm[1](https://arxiv.org/html/2606.00320#alg1)guarantees that
1T∑t=1T\[ct\+11−β\(Rt\(λt\)−ct\)\+\]\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\left\[\{c\}\_\{t\}\+\\frac\{1\}\{1\-\\beta\}\(R\_\{t\}\(\\lambda\_\{t\}\)\-\{c\}\_\{t\}\)\_\{\+\}\\right\]≤α\+T\+1TC1\+1TC2,\\displaystyle\\leq\\alpha\+\\frac\{\\sqrt\{T\+1\}\}\{T\}C\_\{1\}\+\\frac\{1\}\{T\}C\_\{2\},where
C1=max\{1,β1−β\}\(34\+14\(1−β\)\),C\_\{1\}=\\max\\left\\\{1,\\frac\{\\beta\}\{1\-\\beta\}\\right\\\}\\left\(\\frac\{3\}\{4\}\+\\frac\{1\}\{4\(1\-\\beta\)\}\\right\),and
C2=λ1/γ\+\(1−β\)−1−α1−β\.C\_\{2\}=\\frac\{\\lambda\_\{1\}/\\gamma\+\(1\-\\beta\)^\{\-1\}\-\\alpha\}\{1\-\\beta\}\.
Importantly, this guarantee holds*regardless of the quality of the threshold sequence*\{ct\}\\\{c\_\{t\}\\\}\. However, the choice of\{ct\}\\\{c\_\{t\}\\\}directly determines the*tightness*of the bound: ifctc\_\{t\}is far from the optimal thresholdc∗c^\{\*\}, the surrogate objective may be much larger than the trueCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}, leading to conservative behavior\. Thus, while the outer CDT\-level ensures validity, the role of the inner\-level AdaGrad\-FTRL update is to adaptively learn thresholdsctc\_\{t\}to avoid being overly conservative, as described in the previous section\.
## 3Experiments
We evaluate the proposed RU ConformalCVaR\\operatorname\{CVaR\}control framework on two sequential decision\-making problems: \(i\) toxicity control for LLM outputs and \(ii\) portfolio management under market distribution shift\.
Across both experiments, we discard the first 100 rounds as a burn\-in period\. We report results on \(i\) realizedCVaR^β\\widehat\{\\operatorname\{CVaR\}\}\_\{\\beta\}control, and \(ii\) the dynamics of the learned thresholdλt\\lambda\_\{t\}\.
### 3\.1Toxicity Control for LLM Outputs Experiment
Figure 2:LLM Toxicity Control Experiment\.Realized empiricalCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}and evolution ofλt\\lambda\_\{t\}forβ=0\.85\\beta=0\.85with targetα=0\.1\\alpha=0\.1, step sizeγ=0\.05\\gamma=0\.05, and 100 steps of burn\-in for theUniformregime in the first column,Adversarialregime in middle column, andAdversarial\-Jumpregime in the right column\.#### Environment\.
We consider a pool of prompts\{xi\}\\\{x\_\{i\}\\\}, each paired with multiple candidate LLM\-generated responses\{yij\}j=1K\\\{y\_\{i\}^\{j\}\\\}\_\{j=1\}^\{K\}\. Each response is annotated with: \(i\) a machine toxicity scorerm,ij∈\[0,1\]r\_\{m,i\}^\{j\}\\in\[0,1\]produced by a toxicity classifier, and \(ii\) a human toxicity scorerij∈\[0,1\]r\_\{i\}^\{j\}\\in\[0,1\], which we treat as the ground\-truth loss\.
At each roundtt, one prompt is selected andKKcandidate responses are sampled according to a time\-varying sampling distribution \(described below\)\. The realized loss is the maximum human toxicity score among accepted responses after filtering out responses whose machine toxicity score exceedsλt\\lambda\_\{t\}\. Note that this convention corresponds to the LLM abstaining from returning a response when all candidates are filtered out; if no response is shown, the toxicity loss is zero\.
#### Action parameter\.
The controller selects a thresholdλt∈\[0,1\]\\lambda\_\{t\}\\in\[0,1\]and rejects all candidate responses whose machine toxicity score exceedsλt\\lambda\_\{t\}\. This follows the same deployment protocol as inChenet al\.\([2025](https://arxiv.org/html/2606.00320#bib.bib18)\)\.
#### Models and datasets\.
FollowingChenet al\.\([2025](https://arxiv.org/html/2606.00320#bib.bib18)\), we create an inexpensive semi\-synthetic benchmark using an existing machine scoring model as the “human annotator,” and a biased scoring model as the “machine assessor\.” Specifically, we use the Detoxify model\(Hanu and Unitary team,[2020](https://arxiv.org/html/2606.00320#bib.bib4)\)forr\(⋅\)r\(\\cdot\)and retrain the Detoxify model forrm\(⋅\)r\_\{m\}\(\\cdot\)on a biased subset of the Jigsaw Unintended Bias in Toxicity Classification dataset\(cjadamset al\.,[2019](https://arxiv.org/html/2606.00320#bib.bib35)\)that consists of the15%15\\%most and least toxic instances\.
We conduct experiments using Llama 3\.2\-3B\(Meta AI,[2024](https://arxiv.org/html/2606.00320#bib.bib11)\)\. We draw prompts from theRealToxicityPromptsdataset\(Gehmanet al\.,[2020](https://arxiv.org/html/2606.00320#bib.bib33)\)using the sampling regimes described in the next section\.
#### Synthetic distribution shift via tail thickening\.
To simulate controlled non\-stationary and adversarial shifts in the tail behavior, we generate responses using a time\-varying tail\-severity parameter\.
At each timett, we sample a quantile level
pt∼Beta\(at,bt\),p\_\{t\}\\sim\\operatorname\{Beta\}\(a\_\{t\},b\_\{t\}\),which controls how extreme the selected response is within the candidate pool\. GivenKKcandidate responses for the current prompt, we sort them by machine toxicity score and select the response with rank
kt=min\{K,max\{1,⌈Kpt⌉\}\}\.k\_\{t\}=\\min\\\{K,\\max\\\{1,\\left\\lceil K\\,p\_\{t\}\\right\\rceil\\\}\\\}\.Thus, small values ofptp\_\{t\}select benign responses, while values ofptp\_\{t\}near11select highly toxic responses\.
We consider three Beta distribution sampling regimes: \(i\)uniformregime, withat=bt=1a\_\{t\}=b\_\{t\}=1, \(ii\)adversarial/non\-stationaryregime with time\-varyingata\_\{t\}and fixedbt=1b\_\{t\}=1, and \(iii\)adversarial\- jumpsregime, wherebt=1b\_\{t\}=1, andat∈\[0\.5,5\]a\_\{t\}\\in\[0\.5,5\]changes over time with random spikes and dips throughout the sampling window\. Our adversarial constructions allow the sampling distribution to gradually concentrate more mass near11as time passes, thereby increasingly selecting high machine\-toxicity responses over time and inducing a controlled tail\-thickening distribution shift in regime \(ii\), and a volatile shift in regime \(iii\)\. Visualizations of the sampling distributions of both regimes can be found in Appendix[C](https://arxiv.org/html/2606.00320#A3)\.
### 3\.2Results for Toxicity Control for LLM Outputs
We evaluate the proposed RU ConformalCVaR\\operatorname\{CVaR\}\(RUCC\) controller in the three sampling regimes:uniform,adversarial, andadversarial\-jump\. In all three settings, the target risk level is set toα=0\.1\\alpha=0\.1withγ=0\.05\\gamma=0\.05, and we report results forβ=0\.85\\beta=0\.85in Figure[2](https://arxiv.org/html/2606.00320#S3.F2)\. For results ofβ∈\{0\.75,0\.8,0\.85,0\.9\}\\beta\\in\\\{0\.75,0\.8,0\.85,0\.9\\\}, see Appendix[C](https://arxiv.org/html/2606.00320#A3)\.
#### Evolution of the empiricalCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\.
The top row of Figure[2](https://arxiv.org/html/2606.00320#S3.F2)displays the per\-step realized empiricalCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}trajectories of our RUCC controller \(blue line\), with the corresponding average realizedCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\(orange dotted line\) and the target risk level \(red dashed line\)\. Results are shown for the uniform regime \(left panel\), adversarial regime \(middle panel\), and adversarial\-jump regime \(right panel\)\.
We compare RUCC against a static baseline with a fixedλ\\lambdathroughout the experiment\. Specifically, we estimate a constant thresholdλ^\\hat\{\\lambda\}using distortion risk control via L\-statistics calibrated on the first 1000 samples in our sample set\(Chenet al\.,[2025](https://arxiv.org/html/2606.00320#bib.bib18)\)\. For this baseline, we report the realized empiricalCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}trajectory \(purple dashed line\) and its average realizedCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\(purple dotted line\)\.
From the top panel of Figure[2](https://arxiv.org/html/2606.00320#S3.F2), we observe that the RUCC controller rapidly stabilizes the realized empiricalCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}, keeping it close to the target level across all three settings\. In the uniform regime, the trajectory settles relatively smoothly near the target\. In contrast, the adversarial and adversarial\-jump regimes exhibit more stochastic fluctuations, reflecting the greater difficulty of controlling tail risk when the data\-generating process changes over time\.
The figure also highlights the limitation of the baseline which selects a very small fixed value ofλ^\\hat\{\\lambda\}that keepsCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}well below the target, making the LLM severely overly conservative\. By contrast, RUCC adjustsλt\\lambda\_\{t\}over time to maintain tail risk control while avoiding excessive conservatism\.
#### Evolution ofλt\\lambda\_\{t\}\.
Next, we examine the evolution ofλt\\lambda\_\{t\}in the LLM toxicity control task under the uniformly sampled prompt stream and the adversarially sampled streams, shown in the bottom panel of Figure[2](https://arxiv.org/html/2606.00320#S3.F2)\.
In the uniform regime,λt\\lambda\_\{t\}remains relatively stable throughout the trajectory, exhibiting only mild fluctuations\. This behavior is consistent with the underlying loss distribution being comparatively stationary, so the algorithm does not need to adapt the threshold aggressively in order to maintain the targetCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}level\.
In contrast, under the adversarial and adversarial\-jump regimes,λt\\lambda\_\{t\}decreases substantially over time and exhibits noticeably larger fluctuations than in the uniform setting\. This reflects the increased difficulty of the control problem under distribution shift, where the underlying loss distribution changes over time\. Nevertheless, despite these non\-stationary environments, RUCC continues to adapt effectively and maintains robust control of the realizedCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\.
### 3\.3Portfolio Management under Distribution Shift Experiment
Figure 3:Portfolio Management Experiment\.Realized empirical 180\-day rollingCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\(top row\) and evolution ofλt\\lambda\_\{t\}\(bottom row\) forβ=0\.85,γ=0\.05\\beta=0\.85,\\,\\gamma=0\.05with targetα=0\.01\\alpha=0\.01and burn\-in period of 100 days during 1991–2001 in the first column, 2009–2018 in second column, 1991–2025 in the third column\.#### Environment\.
We consider a two\-asset portfolio consisting of \(i\) a risk\-free asset \(10\-Year Treasury total\-return index\), and \(ii\) a risky asset \(S&P 500 index\)\.
#### Action parameter\.
LetPtP\_\{t\}denote the asset price at timett\. We define theh=1h=1log\-return as follows,
rt\(1\)=logPtPt−1\.r\_\{t\}^\{\(1\)\}=\\log\\frac\{P\_\{t\}\}\{P\_\{t\-1\}\}\.At each timett, the controller selects a portfolio weightλt∈\[0,1\]\\lambda\_\{t\}\\in\[0,1\], representing the fraction invested in the risky asset\. The portfolio return is then,
rtp\(λt\)=λtrtrisky\+\(1−λt\)rtrisk\-free,r\_\{t\}^\{p\}\(\\lambda\_\{t\}\)=\\lambda\_\{t\}r\_\{t\}^\{\\text\{risky\}\}\+\(1\-\\lambda\_\{t\}\)r\_\{t\}^\{\\text\{risk\-free\}\},and the loss is defined as
Rt\(λt\)=−rtp\(λt\),R\_\{t\}\(\\lambda\_\{t\}\)=\-r\_\{t\}^\{p\}\(\\lambda\_\{t\}\),wherertriskyr\_\{t\}^\{\\text\{risky\}\}andrtrisk\-freer\_\{t\}^\{\\text\{risk\-free\}\}denote the return of the risky asset and the return of the risk\-free asset, respectively\.
#### Distribution shift\.
We evaluate performance across market regimes from 1991 to 2025, covering both periods of economic stability and multiple market shocks including the dot\-com crash, the 2008 financial crisis, COVID\-19, and the subsequent high\-inflation period\.
#### Datasets\.
For our risky asset, we use historical S&P 500 index data via theyfinanceAPI\(Aroussi,[2020](https://arxiv.org/html/2606.00320#bib.bib9)\), which retrieves data from Yahoo Finance\. For our risk\-free asset, we use the 10\-Year Treasury total\-return index from the Federal Reserve Economic Data \(FRED\) database\(Federal Reserve Bank of St\. Louis,[2024](https://arxiv.org/html/2606.00320#bib.bib10)\)\.
### 3\.4Results for Portfolio Management Experiment
We evaluate the proposed RU ConformalCVaR\\operatorname\{CVaR\}controller on the portfolio management task across three time horizons: 1991–2001, 2009–2018, and 1991–2025\. The 1991–2001 period corresponds to a relatively stable economic environment, while 2009–2018 captures a substantially more volatile regime that includes the global financial crisis and its aftermath\. The full 1991–2025 horizon evaluates the long\-term robustness of our proposed method across multiple market conditions\.
Figure[3](https://arxiv.org/html/2606.00320#S3.F3)reports the realized empirical 180\-day rollingCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\(top row\) and the corresponding control parameterλt\\lambda\_\{t\}\(bottom row\), forβ=0\.85\\beta=0\.85, target risk levelα=0\.01\\alpha=0\.01, and step sizeγ=0\.05\\gamma=0\.05\. The rolling\-window CVaR is included as a diagnostic for the local behavior of the controller\.Additional results forβ∈\{0\.75,0\.80,0\.85,0\.90\}\\beta\\in\\\{0\.75,0\.80,0\.85,0\.90\\\}are provided in Appendix[D](https://arxiv.org/html/2606.00320#A4)\.
#### Evolution of the rolling empiricalCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\.
The top row of Figure[3](https://arxiv.org/html/2606.00320#S3.F3)displays the 180\-day rolling realized empiricalCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}trajectories produced by the RUCC controller \(blue line\), together with the corresponding average realizedCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\(orange dotted line\) and the target risk level \(red dashed line\) forβ=0\.85\\beta=0\.85\.
We compare RUCC against a static baseline in which the control parameterλ\\lambdais fixed throughout the experiment\. Specifically, we tuneλ\\lambdaover a grid of candidate values and select a fixedλ^\\hat\{\\lambda\}that controls the realized empiricalCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}over the first 1000 days of the sample period\. For this baseline, we report both the 180\-day rolling realized empiricalCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}trajectory \(purple dashed line\) and its average realizedCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\(purple dotted line\)\.
The three evaluation windows: 1991–2001, 2009–2018, and 1991–2025, are shown in the left, middle, and right panels of the top row of Figure[3](https://arxiv.org/html/2606.00320#S3.F3), respectively\. Across all three periods, the RUCC controller consistently maintains the realized empiricalCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}close to the target level, whereas the static baseline exhibits substantially different behavior across market regimes\.
During both the relatively stable 1991–2001 period, and the more volatile 2009–2018 period, the baseline is overly conservative, producing realizedCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}values substantially below the target level\. These results illustrate the difficulty of selecting a single fixed value ofλ\\lambdathat performs well across changing market conditions\. By contrast, the RUCC controller adaptively adjustsλt\\lambda\_\{t\}over time and is able to maintain stable tail\-risk control across both stable and volatile environments\.
Moreover, over the full 1991–2025 horizon, the method remains robust through multiple periods of economic stress\. Although the realizedCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}increases modestly during episodes such as the dot\-com bubble, the 2008 financial crisis, and the post\-COVID market turbulence beginning in 2020, the rolling average empiricalCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}remains close to the target risk level throughout the sample period\.
#### Evolution ofλt\\lambda\_\{t\}\.
The bottom row of Figure[3](https://arxiv.org/html/2606.00320#S3.F3)shows the evolution ofλt\\lambda\_\{t\}, which represents the fraction of wealth allocated to the risky asset\. During the 1991–2001 period,λt\\lambda\_\{t\}exhibits a gradual long\-term decline, indicating that the controller progressively reduces exposure to the risky asset over time while still maintaining moderate market participation\.
In the 2009–2018 period, the controller initially maintains a relatively low value ofλt\\lambda\_\{t\}after the 2008 financial crisis and subsequently increases exposure as market conditions stabilize and the economy recovers\. Over the full 1991–2025 horizon, the largest downward movements inλt\\lambda\_\{t\}coincide with major episodes of market stress, including the 2008 financial crisis and the post\-COVID turbulence beginning in 2020\. This behavior illustrates that the controller dynamically responds to elevated tail\-risk realizations by reducing risky exposure during periods of heightened uncertainty\.
Note that the update forλt\\lambda\_\{t\}is unconstrained, soλt\\lambda\_\{t\}may temporarily leave the range\[0,1\]\[0,1\]\. When evaluating portfolio decisions, we clip the realized allocation to the corresponding boundary value, soλt<0\\lambda\_\{t\}<0means zero risky\-asset exposure andλt\>1\\lambda\_\{t\}\>1means maximal exposure\. Thus, the realized portfolio allocation remains feasible\. Theorem[2\.6](https://arxiv.org/html/2606.00320#S2.Thmtheorem6)shows that this unconstrained update cannot drift too far in aggregate, i\.e\., the cumulative excess over the target is onlyO\(T\)O\(\\sqrt\{T\}\)\.
Overall, these experiments demonstrate that our method is capable of maintaining tail\-risk control over multi\-decade horizons spanning multiple crisis regimes while adaptively adjusting portfolio exposure over time\. The results provide strong empirical evidence that the algorithm is robust to severe non\-stationarity in the long run, achieving tight empiricalCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}control in realistic financial environments\.
## 4Discussion and Extensions
A key lever in our approach is the RU variational representation ofCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}, which allows tail risk control to be reduced to online optimization over an auxiliary parameter\. This idea is not unique toCVaR\\operatorname\{CVaR\}\. Many classical and modern risk measures admit similar variational representations, suggesting that our framework extends beyond tail risks\. We describe illustrative examples below\.
### 4\.1Optimized Certainty Equivalent \(OCE\) Risk Measures
A large class of risk measures known as*optimized certainty equivalents*\(OCEs\)\(Ben\-Tal and Teboulle,[2007](https://arxiv.org/html/2606.00320#bib.bib27)\)admit representations of the form
R\(X\)=infc∈ℝc\+𝔼\[ϕ\(X−c\)\]R\(X\)=\\inf\_\{c\\in\\mathbb\{R\}\}\\;c\+\\mathbb\{E\}\\bigl\[\\phi\(X\-c\)\\bigr\]for a suitable convex lossϕ\\phi\. Recent work has shown how to control such risk measures using conformal methods in stationary, offline, and i\.i\.d\. settings\(Yehet al\.,[2025](https://arxiv.org/html/2606.00320#bib.bib26)\)\. Our results suggest a clear extension to these guarantees in*online, non\-stationary, and adversarial environments*\.
### 4\.2Variational Risk Representations
The variance admits a well\-known representation
Var\(X\)=𝔼\[\(X−𝔼\[X\]\)2\]=mina∈ℝ𝔼\[\(X−a\)2\],\\operatorname\{Var\}\(X\)=\\mathbb\{E\}\\bigl\[\(X\-\\mathbb\{E\}\[X\]\)^\{2\}\\bigr\]=\\min\_\{a\\in\\mathbb\{R\}\}\\,\\mathbb\{E\}\\bigl\[\(X\-a\)^\{2\}\\bigr\],which is widely used in quality control and robust estimation\. This shows that variance control can likewise be interpreted as controlling the expectation of a surrogate loss indexed by an auxiliary parameter, making it amenable to our optimization framework\.
## Impact Statement
This work contributes to the development of machine learning frameworks with increased safety and reliability by providing a distribution\-free method for controlling tail risk in non\-stationary and adversarial environments\. By enabling online control of the Conditional Value\-at\-Risk \(CVaR\\operatorname\{CVaR\}\), the proposed framework helps reduce the likelihood of rare but catastrophic failures in safety\-critical applications such as financial decision\-making and large language model deployment\.
BeyondCVaR\\operatorname\{CVaR\}, the framework provides a general template for controlling a broad class of nonlinear risk measures that admit a variational representation\. We expect this contribution to help facilitate the design of more robust and trustworthy learning systems in environments subject to distribution shifts and strategic manipulation\. We do not foresee negative societal impacts arising from this work; instead, its primary effect is to improve the safety and reliability of deployed machine learning systems\.
## References
- A\. N\. Angelopoulos, S\. Bates, E\. J\. Candès, M\. I\. Jordan, and L\. Lei \(2025a\)Learn then test: calibrating predictive algorithms to achieve risk control\.The Annals of Applied Statistics19\(2\),pp\. 1641–1662\.Cited by:[§1\.1](https://arxiv.org/html/2606.00320#S1.SS1.p2.1)\.
- A\. N\. Angelopoulos, M\. I\. Jordan, and R\. J\. Tibshirani \(2025b\)Gradient equilibrium in online learning: theory and applications\.External Links:2501\.08330,[Link](https://arxiv.org/abs/2501.08330)Cited by:[§1\.1](https://arxiv.org/html/2606.00320#S1.SS1.p3.1)\.
- A\. N\. Angelopoulos, S\. Bates, A\. Fisch, L\. Lei, and T\. Schuster \(2025c\)Conformal risk control\.InThe Twelfth International Conference on Learning Representations,Cited by:[§1\.1](https://arxiv.org/html/2606.00320#S1.SS1.p2.1)\.
- R\. Aroussi \(2020\)Yfinance: yahoo\! finance market data downloader\.Note:[https://pypi\.org/project/yfinance/](https://pypi.org/project/yfinance/)Version used: 1\.1\.0Cited by:[§3\.3](https://arxiv.org/html/2606.00320#S3.SS3.SSS0.Px4.p1.1)\.
- S\. Bates, A\. Angelopoulos, L\. Lei, J\. Malik, and M\. Jordan \(2021\)Distribution\-free, risk\-controlling prediction sets\.Journal of the ACM \(JACM\)68\(6\),pp\. 1–34\.Cited by:[§1\.1](https://arxiv.org/html/2606.00320#S1.SS1.p2.1)\.
- A\. Ben\-Tal and M\. Teboulle \(2007\)An old–new concept of convex risk measures: the optimized certainty equivalent\.Mathematical Finance17\(3\),pp\. 439–476\.External Links:ISSN 1467\-9965,[Document](https://dx.doi.org/10.1111/j.1467-9965.2007.00311.x),[Link](https://onlinelibrary.wiley.com/doi/10.1111/j.1467-9965.2007.00311.x)Cited by:[§4\.1](https://arxiv.org/html/2606.00320#S4.SS1.p1.2)\.
- C\. Y\. Chen, J\. Shen, Z\. Deng, and L\. Lei \(2025\)Conformal tail risk control for large language model alignment\.External Links:2502\.20285,[Link](https://arxiv.org/abs/2502.20285)Cited by:[§1](https://arxiv.org/html/2606.00320#S1.p1.1),[§1](https://arxiv.org/html/2606.00320#S1.p5.4),[§3\.1](https://arxiv.org/html/2606.00320#S3.SS1.SSS0.Px2.p1.2),[§3\.1](https://arxiv.org/html/2606.00320#S3.SS1.SSS0.Px3.p1.3),[§3\.2](https://arxiv.org/html/2606.00320#S3.SS2.SSS0.Px1.p2.4)\.
- cjadams, D\. B\. andinversion, J\. Sorensen, L\. Dixon, L\. Vasserman, and nithum \(2019\)Jigsaw unintended bias in toxicity classification\.Kaggle\.Cited by:[§3\.1](https://arxiv.org/html/2606.00320#S3.SS1.SSS0.Px3.p1.3)\.
- Z\. Deng, T\. P\. Zollo, B\. Eyre, A\. Inamdar, D\. Madras, and R\. Zemel \(2025\)QuEst: enhancing estimates of quantile\-based distributional measures using model predictions\.arXiv preprint arXiv:2507\.05220\.Cited by:[§1](https://arxiv.org/html/2606.00320#S1.p5.4)\.
- Z\. Deng, T\. Zollo, J\. Snell, T\. Pitassi, and R\. Zemel \(2023\)Distribution\-free statistical dispersion control for societal applications\.Advances in Neural Information Processing Systems36,pp\. 40342–40366\.Cited by:[§1](https://arxiv.org/html/2606.00320#S1.p5.4)\.
- J\. Duchi, E\. Hazan, and Y\. Singer \(2011\)Adaptive subgradient methods for online learning and stochastic optimization\.Journal of Machine Learning Research12,pp\. 2121–2159\.Cited by:[§1\.1](https://arxiv.org/html/2606.00320#S1.SS1.p4.1),[§2\.3](https://arxiv.org/html/2606.00320#S2.SS3.p1.3)\.
- Federal Reserve Bank of St\. Louis \(2024\)Federal reserve economic data \(fred\)\.Note:[https://fred\.stlouisfed\.org/](https://fred.stlouisfed.org/)Accessed via public APICited by:[§3\.3](https://arxiv.org/html/2606.00320#S3.SS3.SSS0.Px4.p1.1)\.
- S\. Gehman, S\. Gururangan, M\. Sap, Y\. Choi, and N\. A\. Smith \(2020\)Realtoxicityprompts: evaluating neural toxic degeneration in language models\.arXiv preprint arXiv:2009\.11462\.Cited by:[§3\.1](https://arxiv.org/html/2606.00320#S3.SS1.SSS0.Px3.p2.1)\.
- I\. Gibbs and E\. Candes \(2021\)Adaptive conformal inference under distribution shift\.Advances in Neural Information Processing Systems34,pp\. 1660–1672\.Cited by:[§1](https://arxiv.org/html/2606.00320#S1.p5.4),[§2\.3](https://arxiv.org/html/2606.00320#S2.SS3.p1.3)\.
- L\. Hanu and Unitary team \(2020\)Detoxify\.Note:Github\. https://github\.com/unitaryai/detoxifyCited by:[§3\.1](https://arxiv.org/html/2606.00320#S3.SS1.SSS0.Px3.p1.3)\.
- P\. Langley \(2000\)Crafting papers on machine learning\.InProceedings of the 17th International Conference on Machine Learning \(ICML 2000\),P\. Langley \(Ed\.\),Stanford, CA,pp\. 1207–1216\.Cited by:[Appendix D](https://arxiv.org/html/2606.00320#A4.SS0.SSS0.Px2.p7.1)\.
- L\. Lei, R\. Sahoo, and S\. Wager \(2023\)Policy learning under biased sample selection\.arXiv preprint arXiv:2304\.11735\.Cited by:[§1\.1](https://arxiv.org/html/2606.00320#S1.SS1.p1.3)\.
- J\. Lekeufack, A\. N\. Angelopoulos, A\. Bajcsy, M\. I\. Jordan, and J\. Malik \(2024\)Conformal decision theory: safe autonomous decisions from imperfect predictions\.External Links:2310\.05921,[Link](https://arxiv.org/abs/2310.05921)Cited by:[§1\.1](https://arxiv.org/html/2606.00320#S1.SS1.p2.1),[§1](https://arxiv.org/html/2606.00320#S1.p5.4),[§1](https://arxiv.org/html/2606.00320#S1.p6.2),[§2\.3](https://arxiv.org/html/2606.00320#S2.SS3.p1.3),[§2\.5](https://arxiv.org/html/2606.00320#S2.SS5.p1.8),[Remark 2\.3](https://arxiv.org/html/2606.00320#S2.Thmtheorem3.p1.3)\.
- H\. B\. McMahan \(2011\)Follow\-the\-regularized\-leader and mirror descent: equivalence theorems and l1 regularization\.InProceedings of the 14th International Conference on Artificial Intelligence and Statistics \(AISTATS\),JMLR Workshop and Conference Proceedings, Vol\.15,pp\. 525–533\.Cited by:[§1\.1](https://arxiv.org/html/2606.00320#S1.SS1.p4.1),[§1](https://arxiv.org/html/2606.00320#S1.p6.2),[§2\.3](https://arxiv.org/html/2606.00320#S2.SS3.p1.3)\.
- Meta AI \(2024\)The Llama 3 herd of models\.Note:[https://ai\.meta\.com/blog/meta\-llama\-3/](https://ai.meta.com/blog/meta-llama-3/)Accessed for Llama 3\.2\-3BCited by:[§3\.1](https://arxiv.org/html/2606.00320#S3.SS1.SSS0.Px3.p2.1)\.
- E\. Perez, S\. Huang, F\. Song, T\. Cai, R\. Ring, J\. Aslanides, A\. Glaese, N\. McAleese, and G\. Irving \(2022\)Red teaming language models with language models\.External Links:2202\.03286,[Link](https://arxiv.org/abs/2202.03286)Cited by:[§1](https://arxiv.org/html/2606.00320#S1.p3.3)\.
- R\. T\. Rockafellar and S\. Uryasev \(2000\)Optimization of conditional value\-at\-risk\.Journal of Risk2\(3\),pp\. 21–42\.External Links:[Document](https://dx.doi.org/10.21314/JOR.2000.038),[Link](https://www.risk.net/journal-risk/2161159/optimization-conditional-value-risk)Cited by:[§1\.1](https://arxiv.org/html/2606.00320#S1.SS1.p1.3),[§1](https://arxiv.org/html/2606.00320#S1.p2.4),[§1](https://arxiv.org/html/2606.00320#S1.p6.2),[§2\.2](https://arxiv.org/html/2606.00320#S2.SS2.p5.2),[§2](https://arxiv.org/html/2606.00320#S2.p1.1)\.
- R\. Sahoo, L\. Lei, and S\. Wager \(2025\)Learning from a biased sample\.External Links:2209\.01754,[Link](https://arxiv.org/abs/2209.01754)Cited by:[§1\.1](https://arxiv.org/html/2606.00320#S1.SS1.p1.3)\.
- S\. Shalev\-Shwartz \(2012\)Online learning and online convex optimization\.Foundations and Trends in Machine Learning4\(2\),pp\. 107–194\.Cited by:[§1\.1](https://arxiv.org/html/2606.00320#S1.SS1.p4.1)\.
- J\. C\. Snell, T\. P\. Zollo, Z\. Deng, T\. Pitassi, and R\. Zemel \(2022\)Quantile risk control: a flexible framework for bounding the probability of high\-loss predictions\.External Links:2212\.13629,[Link](https://arxiv.org/abs/2212.13629)Cited by:[§1](https://arxiv.org/html/2606.00320#S1.p1.1),[§1](https://arxiv.org/html/2606.00320#S1.p5.4)\.
- Z\. Yang, E\. Candès, and L\. Lei \(2024\)Bellman conformal inference: calibrating prediction intervals for time series\.External Links:2402\.05203,[Link](https://arxiv.org/abs/2402.05203)Cited by:[§1](https://arxiv.org/html/2606.00320#S1.p5.4),[§2\.3](https://arxiv.org/html/2606.00320#S2.SS3.p1.3)\.
- C\. Yeh, N\. Christianson, A\. Wierman, and Y\. Yue \(2025\)Conformal risk training: end\-to\-end optimization of conformal risk control\.External Links:2510\.08748,[Link](https://arxiv.org/abs/2510.08748)Cited by:[§4\.1](https://arxiv.org/html/2606.00320#S4.SS1.p1.1)\.
- T\. Zollo, T\. Morrill, Z\. Deng, J\. Snell, T\. Pitassi, and R\. Zemel \(2024a\)Prompt risk control: a rigorous framework for responsible deployment of large language models\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 4045–4067\.Cited by:[§1](https://arxiv.org/html/2606.00320#S1.p5.4)\.
- T\. P\. Zollo, T\. Morrill, Z\. Deng, J\. C\. Snell, T\. Pitassi, and R\. Zemel \(2024b\)Prompt risk control: a rigorous framework for responsible deployment of large language models\.External Links:2311\.13628,[Link](https://arxiv.org/abs/2311.13628)Cited by:[§1](https://arxiv.org/html/2606.00320#S1.p1.1)\.
## Appendix AAppendix
### A\.1Notation
For convenience, we summarize all notation used throughout the paper in Table[1](https://arxiv.org/html/2606.00320#A1.T1)\. All quantities with a hat denote empirical \(finite\-sample\) versions\. Unless stated otherwise, all sequences\{Rt\}t=1T\\\{R\_\{t\}\\\}\_\{t=1\}^\{T\}may be non\-stationary, or adversarial\.
Table 1:Summary of notation used throughout the paper\.
## Appendix BProofs
As discussed at the beginning of Section[2\.4](https://arxiv.org/html/2606.00320#S2.SS4), we can transformλ\\lambdaandR\(⋅\)R\(\\cdot\)by \([5](https://arxiv.org/html/2606.00320#S2.E5)\) and \([6](https://arxiv.org/html/2606.00320#S2.E6)\), and assume
λmin=0,λmax=1,Rmin=0,Rmax=1,α∈\(0,1\)\.\\lambda\_\{\\min\}=0,\\quad\\lambda\_\{\\max\}=1,\\quad R\_\{\\min\}=0,\\quad R\_\{\\max\}=1,\\quad\\alpha\\in\(0,1\)\.
### B\.1Proof of Proposition[2\.1](https://arxiv.org/html/2606.00320#S2.Thmtheorem1)
Define the RU objective
ft\(c\)=c\+11−β\(Rt−c\)\+,FT\(c\)=1T∑t=1Tft\(c\)\.f\_\{t\}\(c\)\\;=\\;c\+\\frac\{1\}\{1\-\\beta\}\(R\_\{t\}\-c\)\_\{\+\},\\qquad F\_\{T\}\(c\)\\;=\\;\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}f\_\{t\}\(c\)\.By the RU variational representation \(applied to the realized sequenceR1:TR\_\{1:T\}\),
CVaR^β\(R1:T\)=minc∈ℝFT\(c\)\.\\widehat\{\\operatorname\{CVaR\}\}\_\{\\beta\}\(R\_\{1:T\}\)\\;=\\;\\min\_\{c\\in\\mathbb\{R\}\}F\_\{T\}\(c\)\.SinceRt∈\[0,1\]R\_\{t\}\\in\[0,1\],ftf\_\{t\}is decreasing on\(−∞,0\]\(\-\\infty,0\]and increasing on\[1,∞\)\[1,\\infty\)\. Thus,
CVaR^β\(R1:T\)=minc∈\[0,1\]FT\(c\)\.\\widehat\{\\operatorname\{CVaR\}\}\_\{\\beta\}\(R\_\{1:T\}\)\\;=\\;\\min\_\{c\\in\[0,1\]\}F\_\{T\}\(c\)\.\(8\)By the regret definition,
1T∑t=1Tft\(ct\)\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}f\_\{t\}\(c\_\{t\}\)=minc∈\[0,1\]FT\(c\)\+maxc∈\[0,1\]RegT\(c\)T\\displaystyle=\\min\_\{c\\in\[0,1\]\}F\_\{T\}\(c\)\+\\frac\{\\max\_\{c\\in\[0,1\]\}\\mathrm\{Reg\}\_\{T\}\(c\)\}\{T\}\(9\)=CVaR^β\(R1:T\)\+maxc∈\[0,1\]RegT\(c\)T\.\\displaystyle=\\widehat\{\\operatorname\{CVaR\}\}\_\{\\beta\}\(R\_\{1:T\}\)\+\\frac\{\\max\_\{c\\in\[0,1\]\}\\mathrm\{Reg\}\_\{T\}\(c\)\}\{T\}\.\(10\)Under \([4](https://arxiv.org/html/2606.00320#S2.E4)\),
1T∑t=1Tft\(ct\)≤α\+o\(1\)\.\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}f\_\{t\}\(c\_\{t\}\)\\;\\leq\\;\\alpha\+o\(1\)\.\(11\)
Combining \([9](https://arxiv.org/html/2606.00320#A2.E9)\) and \([11](https://arxiv.org/html/2606.00320#A2.E11)\) yields
CVaR^β\(R1:T\)≤α\+o\(1\)−maxc∈\[0,1\]RegT\(c\)T\.\\widehat\{\\operatorname\{CVaR\}\}\_\{\\beta\}\(R\_\{1:T\}\)\\leq\\alpha\+o\(1\)\-\\frac\{\\max\_\{c\\in\[0,1\]\}\\operatorname\{Reg\}\_\{T\}\(c\)\}\{T\}\.By assumption \(3\), uniformly overc∈\[0,1\]c\\in\[0,1\],
\|maxc∈\[0,1\]RegT\(c\)\|T=o\(1\)\.\\frac\{\|\\max\_\{c\\in\[0,1\]\}\\operatorname\{Reg\}\_\{T\}\(c\)\|\}\{T\}=o\(1\)\.Hence
CVaR^β\(R1:T\)≤α\+o\(1\),\\widehat\{\\operatorname\{CVaR\}\}\_\{\\beta\}\(R\_\{1:T\}\)\\leq\\alpha\+o\(1\),which proves \(1\)\.
### B\.2Proof of Theorem[2\.2](https://arxiv.org/html/2606.00320#S2.Thmtheorem2)
By \([9](https://arxiv.org/html/2606.00320#A2.E9)\),
CVaR^β\(R1:T\)≤α\+\(1T∑t=1Tft\(ct\)−α\)\+\|maxc∈\[0,1\]RegT\(c\)\|T\.\\widehat\{\\operatorname\{CVaR\}\}\_\{\\beta\}\(R\_\{1:T\}\)\\leq\\alpha\+\\left\(\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}f\_\{t\}\(c\_\{t\}\)\-\\alpha\\right\)\+\\frac\{\|\\max\_\{c\\in\[0,1\]\}\\mathrm\{Reg\}\_\{T\}\(c\)\|\}\{T\}\.The proof is then completed by applying Theorems[2\.4](https://arxiv.org/html/2606.00320#S2.Thmtheorem4)and[2\.6](https://arxiv.org/html/2606.00320#S2.Thmtheorem6)\.
### B\.3Proof of Theorem[2\.4](https://arxiv.org/html/2606.00320#S2.Thmtheorem4)
We prove a slightly more general version that only requiresq0≥max\{1,β/\(1−β\)\}2q\_\{0\}\\geq\\max\\\{1,\\beta/\(1\-\\beta\)\\\}^\{2\}\. Let
Ft\(c\)=∑s=1tfs\(c\),ψt\(c\)=12ηt\(c−12\)2,Mt=minc∈\[0,1\]\{Ft\(c\)\+ψt\(c\)\}\.F\_\{t\}\(c\)=\\sum\_\{s=1\}^\{t\}f\_\{s\}\(c\),\\qquad\\psi\_\{t\}\(c\)=\\frac\{1\}\{2\\eta\_\{t\}\}\\left\(c\-\\frac\{1\}\{2\}\\right\)^\{2\},\\qquad M\_\{t\}=\\min\_\{c\\in\[0,1\]\}\\\{F\_\{t\}\(c\)\+\\psi\_\{t\}\(c\)\\\}\.Thenct\+1c\_\{t\+1\}minimizesFt\+ψtF\_\{t\}\+\\psi\_\{t\}, andM0=0M\_\{0\}=0\.
We first prove the lower bound\. Sincectc\_\{t\}minimizesFt−1\+ψt−1F\_\{t\-1\}\+\\psi\_\{t\-1\},
Mt\\displaystyle M\_\{t\}≤Ft\(ct\)\+ψt\(ct\)\\displaystyle\\leq F\_\{t\}\(c\_\{t\}\)\+\\psi\_\{t\}\(c\_\{t\}\)=Mt−1\+ft\(ct\)\+ψt\(ct\)−ψt−1\(ct\)\.\\displaystyle=M\_\{t\-1\}\+f\_\{t\}\(c\_\{t\}\)\+\\psi\_\{t\}\(c\_\{t\}\)\-\\psi\_\{t\-1\}\(c\_\{t\}\)\.The sequence\(ηt\)\(\\eta\_\{t\}\)is nonincreasing and\|ct−1/2\|≤1/2\|c\_\{t\}\-1/2\|\\leq 1/2, so
ψt\(ct\)−ψt−1\(ct\)≤18\(1ηt−1ηt−1\)\.\\psi\_\{t\}\(c\_\{t\}\)\-\\psi\_\{t\-1\}\(c\_\{t\}\)\\leq\\frac\{1\}\{8\}\\left\(\\frac\{1\}\{\\eta\_\{t\}\}\-\\frac\{1\}\{\\eta\_\{t\-1\}\}\\right\)\.Thus
ft\(ct\)≥Mt−Mt−1−18\(1ηt−1ηt−1\)\.f\_\{t\}\(c\_\{t\}\)\\geq M\_\{t\}\-M\_\{t\-1\}\-\\frac\{1\}\{8\}\\left\(\\frac\{1\}\{\\eta\_\{t\}\}\-\\frac\{1\}\{\\eta\_\{t\-1\}\}\\right\)\.Summing overttand usingMT≥minc∈\[0,1\]FT\(c\)M\_\{T\}\\geq\\min\_\{c\\in\[0,1\]\}F\_\{T\}\(c\)gives
∑t=1Tft\(ct\)≥minc∈\[0,1\]FT\(c\)−18\(1ηT−1η0\)=minc∈\[0,1\]∑t=1Tft\(c\)−qT−q04\.\\sum\_\{t=1\}^\{T\}f\_\{t\}\(c\_\{t\}\)\\geq\\min\_\{c\\in\[0,1\]\}F\_\{T\}\(c\)\-\\frac\{1\}\{8\}\\left\(\\frac\{1\}\{\\eta\_\{T\}\}\-\\frac\{1\}\{\\eta\_\{0\}\}\\right\)=\\min\_\{c\\in\[0,1\]\}\\sum\_\{t=1\}^\{T\}f\_\{t\}\(c\)\-\\frac\{\\sqrt\{q\_\{T\}\}\-\\sqrt\{q\_\{0\}\}\}\{4\}\.
We now prove the upper bound\. The functionFt−1\+ψt−1F\_\{t\-1\}\+\\psi\_\{t\-1\}is1/ηt−11/\\eta\_\{t\-1\}\-strongly convex and is minimized atctc\_\{t\}\. Hence, for everyc∈\[0,1\]c\\in\[0,1\],
Ft−1\(c\)\+ψt−1\(c\)≥Mt−1\+12ηt−1\(c−ct\)2\.F\_\{t\-1\}\(c\)\+\\psi\_\{t\-1\}\(c\)\\geq M\_\{t\-1\}\+\\frac\{1\}\{2\\eta\_\{t\-1\}\}\(c\-c\_\{t\}\)^\{2\}\.By convexity offtf\_\{t\},
ft\(c\)≥ft\(ct\)\+gt\(c−ct\)\.f\_\{t\}\(c\)\\geq f\_\{t\}\(c\_\{t\}\)\+g\_\{t\}\(c\-c\_\{t\}\)\.Sinceψt≥ψt−1\\psi\_\{t\}\\geq\\psi\_\{t\-1\}, we obtain
Mt\\displaystyle M\_\{t\}=minc∈\[0,1\]\{Ft−1\(c\)\+ψt−1\(c\)\+ft\(c\)\+ψt\(c\)−ψt−1\(c\)\}\\displaystyle=\\min\_\{c\\in\[0,1\]\}\\\{F\_\{t\-1\}\(c\)\+\\psi\_\{t\-1\}\(c\)\+f\_\{t\}\(c\)\+\\psi\_\{t\}\(c\)\-\\psi\_\{t\-1\}\(c\)\\\}≥Mt−1\+ft\(ct\)\+minc∈\[0,1\]\{12ηt−1\(c−ct\)2\+gt\(c−ct\)\}\\displaystyle\\geq M\_\{t\-1\}\+f\_\{t\}\(c\_\{t\}\)\+\\min\_\{c\\in\[0,1\]\}\\left\\\{\\frac\{1\}\{2\\eta\_\{t\-1\}\}\(c\-c\_\{t\}\)^\{2\}\+g\_\{t\}\(c\-c\_\{t\}\)\\right\\\}≥Mt−1\+ft\(ct\)−ηt−12gt2\.\\displaystyle\\geq M\_\{t\-1\}\+f\_\{t\}\(c\_\{t\}\)\-\\frac\{\\eta\_\{t\-1\}\}\{2\}g\_\{t\}^\{2\}\.Therefore
ft\(ct\)≤Mt−Mt−1\+ηt−12gt2\.f\_\{t\}\(c\_\{t\}\)\\leq M\_\{t\}\-M\_\{t\-1\}\+\\frac\{\\eta\_\{t\-1\}\}\{2\}g\_\{t\}^\{2\}\.After summing,
∑t=1Tft\(ct\)≤MT\+12∑t=1Tηt−1gt2\.\\sum\_\{t=1\}^\{T\}f\_\{t\}\(c\_\{t\}\)\\leq M\_\{T\}\+\\frac\{1\}\{2\}\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}g\_\{t\}^\{2\}\.Letu∈argminc∈\[0,1\]FT\(c\)u\\in\\arg\\min\_\{c\\in\[0,1\]\}F\_\{T\}\(c\)\. Since\|u−1/2\|≤1/2\|u\-1/2\|\\leq 1/2,
MT≤FT\(u\)\+ψT\(u\)≤minc∈\[0,1\]FT\(c\)\+18ηT=minc∈\[0,1\]∑t=1Tft\(c\)\+qT4\.M\_\{T\}\\leq F\_\{T\}\(u\)\+\\psi\_\{T\}\(u\)\\leq\\min\_\{c\\in\[0,1\]\}F\_\{T\}\(c\)\+\\frac\{1\}\{8\\eta\_\{T\}\}=\\min\_\{c\\in\[0,1\]\}\\sum\_\{t=1\}^\{T\}f\_\{t\}\(c\)\+\\frac\{\\sqrt\{q\_\{T\}\}\}\{4\}\.It remains to bound the AdaGrad sum more sharply\. Forx=qt−1x=q\_\{t\-1\}anda=gt2a=g\_\{t\}^\{2\},
ax=2\(x\+a−x\)\+x\(1\+ax−1\)2\.\\frac\{a\}\{\\sqrt\{x\}\}=2\\bigl\(\\sqrt\{x\+a\}\-\\sqrt\{x\}\\bigr\)\+\\sqrt\{x\}\\left\(\\sqrt\{1\+\\frac\{a\}\{x\}\}\-1\\right\)^\{2\}\.Lets=1\+a/xs=\\sqrt\{1\+a/x\}\. Sincea≤q0≤xa\\leq q\_\{0\}\\leq x,
xs\(s−1\)=x\+a−x\(x\+a\)≤a≤q0\.xs\(s\-1\)=x\+a\-\\sqrt\{x\(x\+a\)\}\\leq a\\leq q\_\{0\}\.Rearranging it implies
x\(1\+ax−1\)2≤q0\(1x−1x\+a\)\.\\sqrt\{x\}\\left\(\\sqrt\{1\+\\frac\{a\}\{x\}\}\-1\\right\)^\{2\}\\leq q\_\{0\}\\left\(\\frac\{1\}\{\\sqrt\{x\}\}\-\\frac\{1\}\{\\sqrt\{x\+a\}\}\\right\)\.Consequently,
gt2qt−1≤2\(qt−qt−1\)\+q0\(1qt−1−1qt\)\.\\frac\{g\_\{t\}^\{2\}\}\{\\sqrt\{q\_\{t\-1\}\}\}\\leq 2\\bigl\(\\sqrt\{q\_\{t\}\}\-\\sqrt\{q\_\{t\-1\}\}\\bigr\)\+q\_\{0\}\\left\(\\frac\{1\}\{\\sqrt\{q\_\{t\-1\}\}\}\-\\frac\{1\}\{\\sqrt\{q\_\{t\}\}\}\\right\)\.Summing overttgives
∑t=1Tgt2qt−1\\displaystyle\\sum\_\{t=1\}^\{T\}\\frac\{g\_\{t\}^\{2\}\}\{\\sqrt\{q\_\{t\-1\}\}\}≤2\(qT−q0\)\+q0\(1q0−1qT\)\\displaystyle\\leq 2\\bigl\(\\sqrt\{q\_\{T\}\}\-\\sqrt\{q\_\{0\}\}\\bigr\)\+q\_\{0\}\\left\(\\frac\{1\}\{\\sqrt\{q\_\{0\}\}\}\-\\frac\{1\}\{\\sqrt\{q\_\{T\}\}\}\\right\)=2qT−q0−q0qT\.\\displaystyle=2\\sqrt\{q\_\{T\}\}\-\\sqrt\{q\_\{0\}\}\-\\frac\{q\_\{0\}\}\{\\sqrt\{q\_\{T\}\}\}\.Sinceηt−1=1/\(2qt−1\)\\eta\_\{t\-1\}=1/\(2\\sqrt\{q\_\{t\-1\}\}\),
12∑t=1Tηt−1gt2\\displaystyle\\frac\{1\}\{2\}\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}g\_\{t\}^\{2\}=14∑t=1Tgt2qt−1\\displaystyle=\\frac\{1\}\{4\}\\sum\_\{t=1\}^\{T\}\\frac\{g\_\{t\}^\{2\}\}\{\\sqrt\{q\_\{t\-1\}\}\}≤12qT−14q0−q04qT\.\\displaystyle\\leq\\frac\{1\}\{2\}\\sqrt\{q\_\{T\}\}\-\\frac\{1\}\{4\}\\sqrt\{q\_\{0\}\}\-\\frac\{q\_\{0\}\}\{4\\sqrt\{q\_\{T\}\}\}\.Combining the preceding displays yields
∑t=1Tft\(ct\)−minc∈\[0,1\]∑t=1Tft\(c\)≤34qT−14q0−q04qT≤34qT,\\sum\_\{t=1\}^\{T\}f\_\{t\}\(c\_\{t\}\)\-\\min\_\{c\\in\[0,1\]\}\\sum\_\{t=1\}^\{T\}f\_\{t\}\(c\)\\leq\\frac\{3\}\{4\}\\sqrt\{q\_\{T\}\}\-\\frac\{1\}\{4\}\\sqrt\{q\_\{0\}\}\-\\frac\{q\_\{0\}\}\{4\\sqrt\{q\_\{T\}\}\}\\leq\\frac\{3\}\{4\}\\sqrt\{q\_\{T\}\},\(12\)which is the claimed upper bound\.
Let
N\+\(T\)=\#\{t:gt=1\},N−\(T\)=\#\{t:gt=−β1−β\},N\_\{\+\}\(T\)=\\\#\\\{t:g\_\{t\}=1\\\},\\qquad N\_\{\-\}\(T\)=\\\#\\left\\\{t:g\_\{t\}=\-\\frac\{\\beta\}\{1\-\\beta\}\\right\\\},then
qT\\displaystyle q\_\{T\}=q0\+N\+\(T\)\+\(β1−β\)2N−\(T\)\\displaystyle=q\_\{0\}\+N\_\{\+\}\(T\)\+\\left\(\\frac\{\\beta\}\{1\-\\beta\}\\right\)^\{2\}N\_\{\-\}\(T\)≤max\{1,\(β1−β\)2\}\(1\+N\+\(T\)\+N−\(T\)\)\\displaystyle\\leq\\max\\left\\\{1,\\left\(\\frac\{\\beta\}\{1\-\\beta\}\\right\)^\{2\}\\right\\\}\(1\+N\_\{\+\}\(T\)\+N\_\{\-\}\(T\)\)=max\{1,\(β1−β\)2\}\(T\+1\)\.\\displaystyle=\\max\\left\\\{1,\\left\(\\frac\{\\beta\}\{1\-\\beta\}\\right\)^\{2\}\\right\\\}\(T\+1\)\.∎
### B\.4Proof of Theorem[2\.5](https://arxiv.org/html/2606.00320#S2.Thmtheorem5)
###### Proof\.
LetRtR\_\{t\}be independent with
ℙ\(Rt=0\)=β,ℙ\(Rt=1\)=1−β\.\\mathbb\{P\}\(R\_\{t\}=0\)=\\beta,\\qquad\\mathbb\{P\}\(R\_\{t\}=1\)=1\-\\beta\.Forc∈\[0,1\]c\\in\[0,1\]andb=β1−βb=\\frac\{\\beta\}\{1\-\\beta\},
ft\(c\)=cifRt=0,ft\(c\)=11−β−bcifRt=1\.f\_\{t\}\(c\)=c\\quad\\text\{if \}R\_\{t\}=0,\\qquad f\_\{t\}\(c\)=\\frac\{1\}\{1\-\\beta\}\-bc\\quad\\text\{if \}R\_\{t\}=1\.Thus the additive constants cancel in comparison with a fixed threshold\. Define
gt:=\{1,Rt=0,−b,Rt=1\.g\_\{t\}:=\\begin\{cases\}1,&R\_\{t\}=0,\\\\ \-b,&R\_\{t\}=1\.\\end\{cases\}Then
ft\(c\)=constant\+gtc,f\_\{t\}\(c\)=\\text\{constant\}\+g\_\{t\}c,and
maxc∈\[0,1\]RegT\(c\)=∑t=1Tgtct\+\(−∑t=1Tgt\)\+\.\\max\_\{c\\in\[0,1\]\}\\mathrm\{Reg\}\_\{T\}\(c\)=\\sum\_\{t=1\}^\{T\}g\_\{t\}c\_\{t\}\+\\left\(\-\\sum\_\{t=1\}^\{T\}g\_\{t\}\\right\)\_\{\+\}\.where the independent slopes satisfy
ℙ\(gt=1\)=β,ℙ\(gt=−b\)=1−β\.\\mathbb\{P\}\(g\_\{t\}=1\)=\\beta,\\qquad\\mathbb\{P\}\(g\_\{t\}=\-b\)=1\-\\beta\.They have mean zero\. Sincectc\_\{t\}is chosen beforeRtR\_\{t\}is drawn,
𝔼\[gtct\]=0\.\\mathbb\{E\}\[g\_\{t\}c\_\{t\}\]=0\.Therefore
𝔼maxc∈\[0,1\]RegT\(c\)=𝔼\[\(−∑t=1Tgt\)\+\]\.\\mathbb\{E\}\\max\_\{c\\in\[0,1\]\}\\mathrm\{Reg\}\_\{T\}\(c\)=\\mathbb\{E\}\\left\[\\left\(\-\\sum\_\{t=1\}^\{T\}g\_\{t\}\\right\)\_\{\+\}\\right\]\.IfNT=\#\{t:gt=1\}N\_\{T\}=\\\#\\\{t:g\_\{t\}=1\\\}, thenNT∼Binomial\(T,β\)N\_\{T\}\\sim\\operatorname\{Binomial\}\(T,\\beta\)and
∑t=1Tgt=NT−b\(T−NT\)=NT−βT1−β\.\\sum\_\{t=1\}^\{T\}g\_\{t\}=N\_\{T\}\-b\(T\-N\_\{T\}\)=\\frac\{N\_\{T\}\-\\beta T\}\{1\-\\beta\}\.Hence
𝔼maxc∈\[0,1\]\[RegT\(c\)\]=𝔼\[\(−NT−βT1−β\)\+\]\.\\mathbb\{E\}\\max\_\{c\\in\[0,1\]\}\\left\[\\mathrm\{Reg\}\_\{T\}\(c\)\\right\]=\\mathbb\{E\}\\left\[\\left\(\-\\frac\{N\_\{T\}\-\\beta T\}\{1\-\\beta\}\\right\)\_\{\+\}\\right\]\.Thus, there exists a realization of\(Rt\)t=1T\(R\_\{t\}\)\_\{t=1\}^\{T\}such that
maxc∈\[0,1\]RegT\(c\)≥𝔼\[\(−NT−βT1−β\)\+\]\.\\max\_\{c\\in\[0,1\]\}\\mathrm\{Reg\}\_\{T\}\(c\)\\geq\\mathbb\{E\}\\left\[\\left\(\-\\frac\{N\_\{T\}\-\\beta T\}\{1\-\\beta\}\\right\)\_\{\+\}\\right\]\.Finally,
NT−βTTβ\(1−β\)⇒Z,Z∼N\(0,1\),\\frac\{N\_\{T\}\-\\beta T\}\{\\sqrt\{T\\beta\(1\-\\beta\)\}\}\\Rightarrow Z,\\qquad Z\\sim N\(0,1\),and the normalized variables are uniformly integrable because their second moments are bounded\. Thus
1T𝔼\[\(−NT−βT1−β\)\+\]→β\(1−β\)1−β𝔼\[\(−Z\)\+\]=12πβ1−β\.\\frac\{1\}\{\\sqrt\{T\}\}\\mathbb\{E\}\\left\[\\left\(\-\\frac\{N\_\{T\}\-\\beta T\}\{1\-\\beta\}\\right\)\_\{\+\}\\right\]\\to\\frac\{\\sqrt\{\\beta\(1\-\\beta\)\}\}\{1\-\\beta\}\\,\\mathbb\{E\}\[\(\-Z\)\_\{\+\}\]=\\frac\{1\}\{\\sqrt\{2\\pi\}\}\\sqrt\{\\frac\{\\beta\}\{1\-\\beta\}\}\.∎
### B\.5Matching the lower bound with knowledge ofTT
###### Theorem B\.1\.
For everyT≥1T\\geq 1, there is a deterministic online learning algorithm with the knowledge ofTT, such that
maxc∈\[0,1\]RegT\(c\)≤12πβ1−βT⋅\(1\+o\(1\)\),\\max\_\{c\\in\[0,1\]\}\\mathrm\{Reg\}\_\{T\}\(c\)\\leq\\frac\{1\}\{\\sqrt\{2\\pi\}\}\\sqrt\{\\frac\{\\beta\}\{1\-\\beta\}\}\\sqrt\{T\}\\cdot\(1\+o\(1\)\),for every sequenceR1,…,RT∈\[0,1\]R\_\{1\},\\ldots,R\_\{T\}\\in\[0,1\]\.
###### Proof\.
Letg1,g2,…g\_\{1\},g\_\{2\},\\ldotsbe independent with
ℙ\(gj=1\)=β,ℙ\(gj=−b\)=1−β,\\mathbb\{P\}\(g\_\{j\}=1\)=\\beta,\\qquad\\mathbb\{P\}\(g\_\{j\}=\-b\)=1\-\\beta,and define, forn≥0n\\geq 0,
Φn\(s\)=𝔼\[\(−s−∑j=1ngj\)\+\]\.\\Phi\_\{n\}\(s\)=\\mathbb\{E\}\\left\[\\left\(\-s\-\\sum\_\{j=1\}^\{n\}g\_\{j\}\\right\)\_\{\+\}\\right\]\.SetS0=0S\_\{0\}=0\. At roundtt, play
ct=ΦT−t\(St−1−b\)−ΦT−t\(St−1\+1\)1\+b\.c\_\{t\}=\\frac\{\\Phi\_\{T\-t\}\(S\_\{t\-1\}\-b\)\-\\Phi\_\{T\-t\}\(S\_\{t\-1\}\+1\)\}\{1\+b\}\.After observingRtR\_\{t\}, set
gt=1−11−β𝟏\{Rt\>ct\}∈\{1,−b\},g\_\{t\}=1\-\\frac\{1\}\{1\-\\beta\}\\mathbf\{1\}\\\{R\_\{t\}\>c\_\{t\}\\\}\\in\\left\\\{1,\-b\\right\\\},The choicegt=1g\_\{t\}=1atRt=ctR\_\{t\}=c\_\{t\}is a valid subgradient because∂ft\(ct\)=\[−b,1\]\\partial f\_\{t\}\(c\_\{t\}\)=\[\-b,1\]at the kink\.
The functionΦn\\Phi\_\{n\}is nonincreasing and 1\-Lipschitz\. SinceSt−1−b≤St−1\+1S\_\{t\-1\}\-b\\leq S\_\{t\-1\}\+1, the numerator definingctc\_\{t\}is nonnegative and at most1\+b1\+b\. Hencect∈\[0,1\]c\_\{t\}\\in\[0,1\]\. Also, sinceβ=b/\(1\+b\)\\beta=b/\(1\+b\),
Φn\+1\(s\)=bΦn\(s\+1\)\+Φn\(s−b\)1\+b\.\\Phi\_\{n\+1\}\(s\)=\\frac\{b\\Phi\_\{n\}\(s\+1\)\+\\Phi\_\{n\}\(s\-b\)\}\{1\+b\}\.The definition ofctc\_\{t\}gives, for eitherg∈\{1,−b\}g\\in\\\{1,\-b\\\},
gct\+ΦT−t\(St−1\+g\)=ΦT−t\+1\(St−1\)\.gc\_\{t\}\+\\Phi\_\{T\-t\}\(S\_\{t\-1\}\+g\)=\\Phi\_\{T\-t\+1\}\(S\_\{t\-1\}\)\.Withg=gtg=g\_\{t\}, this telescopes to
∑t=1Tgtct\+\(−ST\)\+=ΦT\(0\)\.\\sum\_\{t=1\}^\{T\}g\_\{t\}c\_\{t\}\+\(\-S\_\{T\}\)\_\{\+\}=\\Phi\_\{T\}\(0\)\.For every fixedc∈\[0,1\]c\\in\[0,1\], convexity gives
ft\(ct\)−ft\(c\)≤gt\(ct−c\)\.f\_\{t\}\(c\_\{t\}\)\-f\_\{t\}\(c\)\\leq g\_\{t\}\(c\_\{t\}\-c\)\.Taking the maximum overc∈\[0,1\]c\\in\[0,1\],
maxc∈\[0,1\]RegT\(c\)\\displaystyle\\max\_\{c\\in\[0,1\]\}\\mathrm\{Reg\}\_\{T\}\(c\)≤∑t=1Tgtct−minc∈\[0,1\]c∑t=1Tgt\\displaystyle\\leq\\sum\_\{t=1\}^\{T\}g\_\{t\}c\_\{t\}\-\\min\_\{c\\in\[0,1\]\}c\\sum\_\{t=1\}^\{T\}g\_\{t\}=∑t=1Tgtct\+\(−ST\)\+\\displaystyle=\\sum\_\{t=1\}^\{T\}g\_\{t\}c\_\{t\}\+\(\-S\_\{T\}\)\_\{\+\}=ΦT\(0\)=𝔼\[\(−∑t=1Tgt\)\+\]\.\\displaystyle=\\Phi\_\{T\}\(0\)=\\mathbb\{E\}\\left\[\\left\(\-\\sum\_\{t=1\}^\{T\}g\_\{t\}\\right\)\_\{\+\}\\right\]\.The upper bound follows from the same calculation in the proof of Theorem[2\.5](https://arxiv.org/html/2606.00320#S2.Thmtheorem5)\. ∎
### B\.6Proof of Theorem[2\.6](https://arxiv.org/html/2606.00320#S2.Thmtheorem6)
Recall thatRtR\_\{t\}is extended outside of\[0,1\]\[0,1\]with
Rt\(λ\)=\{0,λ<0,1,λ\>1,R\_\{t\}\(\\lambda\)=\\begin\{cases\}0,&\\lambda<0,\\\\ 1,&\\lambda\>1,\\end\{cases\}and the definition offtf\_\{t\}in \([7](https://arxiv.org/html/2606.00320#S2.E7)\):
ft\(c\)=c\+11−β\(Rt\(λt\)−c\)\+\.f\_\{t\}\(c\)=c\+\\frac\{1\}\{1\-\\beta\}\\bigl\(R\_\{t\}\(\\lambda\_\{t\}\)\-c\\bigr\)\_\{\+\}\.
We start by proving a lemma\.
###### Lemma B\.2\.
Letx1,…,xs∈\[0,1\]x\_\{1\},\\ldots,x\_\{s\}\\in\[0,1\]andxτ=0x\_\{\\tau\}=0forτ≥s\+1\\tau\\geq s\+1\. Then, for everyt≥st\\geq s,
minc∈\[0,1\]∑i=1t\[c\+\(xi−c\)\+1−β\]−αt≤11−β\(minc∈\[0,1\]∑i=1s\[c\+\(xi−c\)\+1−β\]−αs\)\+\.\\min\_\{c\\in\[0,1\]\}\\sum\_\{i=1\}^\{t\}\\left\[c\+\\frac\{\(x\_\{i\}\-c\)\_\{\+\}\}\{1\-\\beta\}\\right\]\-\\alpha t\\leq\\frac\{1\}\{1\-\\beta\}\\left\(\\min\_\{c\\in\[0,1\]\}\\sum\_\{i=1\}^\{s\}\\left\[c\+\\frac\{\(x\_\{i\}\-c\)\_\{\+\}\}\{1\-\\beta\}\\right\]\-\\alpha s\\right\)\_\{\+\}\.
###### Proof\.
Let
Vt=1tminc∈\[0,1\]∑i=1t\[c\+\(xi−c\)\+1−β\]\.V\_\{t\}=\\frac\{1\}\{t\}\\min\_\{c\\in\[0,1\]\}\\sum\_\{i=1\}^\{t\}\\left\[c\+\\frac\{\(x\_\{i\}\-c\)\_\{\+\}\}\{1\-\\beta\}\\right\]\.For anyi≤si\\leq sandj\>sj\>s
\[c\+\(xi−c\)\+1−β\]≥c=\[c\+\(xj−c\)\+1−β\]\.\\left\[c\+\\frac\{\(x\_\{i\}\-c\)\_\{\+\}\}\{1\-\\beta\}\\right\]\\geq c=\\left\[c\+\\frac\{\(x\_\{j\}\-c\)\_\{\+\}\}\{1\-\\beta\}\\right\]\.Then we are left to prove
t\(Vt−α\)≤s1−β\(Vs−α\)\+\.t\(V\_\{t\}\-\\alpha\)\\leq\\frac\{s\}\{1\-\\beta\}\(V\_\{s\}\-\\alpha\)\_\{\+\}\.\(13\)Thus,VtV\_\{t\}is nonincreasing inttfort≥st\\geq s\. In particular,
Vt≤Vs,∀t≥s\.V\_\{t\}\\leq V\_\{s\},\\quad\\forall t\\geq s\.First suppose\(1−β\)t≤s\(1\-\\beta\)t\\leq s\. IfVs≤αV\_\{s\}\\leq\\alpha, thenVt≤αV\_\{t\}\\leq\\alphaand \([13](https://arxiv.org/html/2606.00320#A2.E13)\) follows because the LHS ist\(Vt−α\)≤0t\(V\_\{t\}\-\\alpha\)\\leq 0while the RHS iss/\(1−β\)\(Vs−α\)\+=0s/\(1\-\\beta\)\(V\_\{s\}\-\\alpha\)\_\{\+\}=0\. Otherwise, because\(1−β\)t≤s\(1\-\\beta\)t\\leq s,
t\(Vt−α\)≤s1−β\(Vs−α\)=s1−β\(Vs−α\)\+\.t\(V\_\{t\}\-\\alpha\)\\leq\\frac\{s\}\{1\-\\beta\}\(V\_\{s\}\-\\alpha\)=\\frac\{s\}\{1\-\\beta\}\(V\_\{s\}\-\\alpha\)\_\{\+\}\.This proves the claim in the first case\.
Now suppose\(1−β\)t\>s\(1\-\\beta\)t\>s\. Let
ct∗=argminc∈\[0,1\]∑i=1t\[c\+\(xi−c\)\+1−β\]\.c\_\{t\}^\{\*\}=\\operatorname\{argmin\}\_\{c\\in\[0,1\]\}\\sum\_\{i=1\}^\{t\}\\left\[c\+\\frac\{\(x\_\{i\}\-c\)\_\{\+\}\}\{1\-\\beta\}\\right\]\.The first\-order condition implies
0∈t\+11−β∑i=1t∂c\(xi−ct∗\)\+\.0\\in t\+\\frac\{1\}\{1\-\\beta\}\\sum\_\{i=1\}^\{t\}\\partial\_\{c\}\(x\_\{i\}\-c\_\{t\}^\{\*\}\)\_\{\+\}\.Ifct∗\>0c\_\{t\}^\{\*\}\>0, then∂c\(xi−ct∗\)\+=0\\partial\_\{c\}\(x\_\{i\}\-c\_\{t\}^\{\*\}\)\_\{\+\}=0for allt\>st\>s\. Then
t\+11−β∑i=1t∂c\(xi−ct∗\)\+=t\+11−β∑i=1s∂c\(xi−ct∗\)\+≥t−s1−β\>0\.t\+\\frac\{1\}\{1\-\\beta\}\\sum\_\{i=1\}^\{t\}\\partial\_\{c\}\(x\_\{i\}\-c\_\{t\}^\{\*\}\)\_\{\+\}=t\+\\frac\{1\}\{1\-\\beta\}\\sum\_\{i=1\}^\{s\}\\partial\_\{c\}\(x\_\{i\}\-c\_\{t\}^\{\*\}\)\_\{\+\}\\geq t\-\\frac\{s\}\{1\-\\beta\}\>0\.By contradiction, we conclude thatct∗=0c\_\{t\}^\{\*\}=0\. Therefore,
tVt=11−β∑i=1txi=11−β∑i=1sxi\.tV\_\{t\}=\\frac\{1\}\{1\-\\beta\}\\sum\_\{i=1\}^\{t\}x\_\{i\}=\\frac\{1\}\{1\-\\beta\}\\sum\_\{i=1\}^\{s\}x\_\{i\}\.Thus,
t\(Vt−α\)=11−β∑i=1sxi−αt<11−β\{∑i=1sxi−αs\}≤11−β\{∑i=1s\[c\+\(xi−c\)\+1−β\]−αs\}=s1−β\(Vs−α\)\.t\(V\_\{t\}\-\\alpha\)=\\frac\{1\}\{1\-\\beta\}\\sum\_\{i=1\}^\{s\}x\_\{i\}\-\\alpha t<\\frac\{1\}\{1\-\\beta\}\\left\\\{\\sum\_\{i=1\}^\{s\}x\_\{i\}\-\\alpha s\\right\\\}\\leq\\frac\{1\}\{1\-\\beta\}\\left\\\{\\sum\_\{i=1\}^\{s\}\\left\[c\+\\frac\{\(x\_\{i\}\-c\)\_\{\+\}\}\{1\-\\beta\}\\right\]\-\\alpha s\\right\\\}=\\frac\{s\}\{1\-\\beta\}\(V\_\{s\}\-\\alpha\)\.Therefore the same bound follows in the second case\. This proves the lemma\. ∎
Proof of Theorem[2\.6](https://arxiv.org/html/2606.00320#S2.Thmtheorem6)\.We now prove the theorem\. IfλT\+1≥0\\lambda\_\{T\+1\}\\geq 0, then the outer update gives
∑t=1Tft\(ct\)−αT=λ1−λT\+1γ≤λ1γ\\sum\_\{t=1\}^\{T\}f\_\{t\}\(c\_\{t\}\)\-\\alpha T=\\frac\{\\lambda\_\{1\}\-\\lambda\_\{T\+1\}\}\{\\gamma\}\\leq\\frac\{\\lambda\_\{1\}\}\{\\gamma\}which is already stronger than the claimed bound\.
It remains to consider the caseλT\+1<0\\lambda\_\{T\+1\}<0\. Lets≤Ts\\leq Tbe last time beforeT\+1T\+1at whichλs≥0\\lambda\_\{s\}\\geq 0, i\.e\.,
λs≥0,λs\+1,λs\+2,…,λT\+1<0\.\\lambda\_\{s\}\\geq 0,\\qquad\\lambda\_\{s\+1\},\\lambda\_\{s\+2\},\\ldots,\\lambda\_\{T\+1\}<0\.Such anssmust exist asλ1≥0\\lambda\_\{1\}\\geq 0\. At timess,0≤fs\(cs\)≤\(1−β\)−10\\leq f\_\{s\}\(c\_\{s\}\)\\leq\(1\-\\beta\)^\{\-1\}\. Therefore
λs\+1=λs−γ\(fs\(cs\)−α\)≥−γ\(\(1−β\)−1−α\)\.\\lambda\_\{s\+1\}=\\lambda\_\{s\}\-\\gamma\(f\_\{s\}\(c\_\{s\}\)\-\\alpha\)\\geq\-\\gamma\\bigl\(\(1\-\\beta\)^\{\-1\}\-\\alpha\\bigr\)\.Telescoping the outer update to timessgives
∑τ=1sfτ\(cτ\)−αs=λ1−λs\+1γ≤λ1γ\+\(1−β\)−1−α\.\\sum\_\{\\tau=1\}^\{s\}f\_\{\\tau\}\(c\_\{\\tau\}\)\-\\alpha s=\\frac\{\\lambda\_\{1\}\-\\lambda\_\{s\+1\}\}\{\\gamma\}\\leq\\frac\{\\lambda\_\{1\}\}\{\\gamma\}\+\(1\-\\beta\)^\{\-1\}\-\\alpha\.Let
Cβ=34max\{1,β1−β\}\.C\_\{\\beta\}=\\frac\{3\}\{4\}\\max\\left\\\{1,\\frac\{\\beta\}\{1\-\\beta\}\\right\\\}\.Using the lower bound formaxc∈\[0,1\]RegT\(c\)\\max\_\{c\\in\[0,1\]\}\\mathrm\{Reg\}\_\{T\}\(c\)in Theorem[2\.4](https://arxiv.org/html/2606.00320#S2.Thmtheorem4)to the firstssperiods,
minc∈\[0,1\]∑τ=1sfτ\(c\)−αs≤λ1γ\+\(1−β\)−1−α\+Cβ3T\+1\.\\min\_\{c\\in\[0,1\]\}\\sum\_\{\\tau=1\}^\{s\}f\_\{\\tau\}\(c\)\-\\alpha s\\leq\\frac\{\\lambda\_\{1\}\}\{\\gamma\}\+\(1\-\\beta\)^\{\-1\}\-\\alpha\+\\frac\{C\_\{\\beta\}\}\{3\}\\sqrt\{T\+1\}\.\(14\)
For everyt=s\+1,…,Tt=s\+1,\\ldots,T, we haveλt<0\\lambda\_\{t\}<0, henceRt\(λt\)=0R\_\{t\}\(\\lambda\_\{t\}\)=0\. Thus the realized losses after timessare zeros\. By Lemma[B\.2](https://arxiv.org/html/2606.00320#A2.Thmtheorem2),
minc∈\[0,1\]∑τ=1Tfτ\(c\)−αT≤λ1/γ\+\(1−β\)−1−α\+CβT\+11−β\.\\min\_\{c\\in\[0,1\]\}\\sum\_\{\\tau=1\}^\{T\}f\_\{\\tau\}\(c\)\-\\alpha T\\leq\\frac\{\\lambda\_\{1\}/\\gamma\+\(1\-\\beta\)^\{\-1\}\-\\alpha\+C\_\{\\beta\}\\sqrt\{T\+1\}\}\{1\-\\beta\}\.Finally, using the upper bound onmaxc∈\[0,1\]RegT\(c\)\\max\_\{c\\in\[0,1\]\}\\mathrm\{Reg\}\_\{T\}\(c\)in Theorem[2\.4](https://arxiv.org/html/2606.00320#S2.Thmtheorem4),
∑t=1Tft\(ct\)−αT≤minc∈\[0,1\]∑t=1Tft\(c\)−αT\+CβT\+1\\sum\_\{t=1\}^\{T\}f\_\{t\}\(c\_\{t\}\)\-\\alpha T\\leq\\min\_\{c\\in\[0,1\]\}\\sum\_\{t=1\}^\{T\}f\_\{t\}\(c\)\-\\alpha T\+C\_\{\\beta\}\\sqrt\{T\+1\}≤λ1/γ\+\(1−β\)−1−α1−β\+Cβ\(1\+13\(1−β\)\)T\+1\.\\leq\\frac\{\\lambda\_\{1\}/\\gamma\+\(1\-\\beta\)^\{\-1\}\-\\alpha\}\{1\-\\beta\}\+C\_\{\\beta\}\\left\(1\+\\frac\{1\}\{3\(1\-\\beta\)\}\\right\)\\sqrt\{T\+1\}\.Dividing byTTproves the theorem\.
## Appendix CAdditional Results for Experiment 1: Large Language Model Toxicity Control
\(a\)Uniform setting\.
\(b\)Adversarial setting \(time\-varying Beta distribution\)\.
\(c\)Adversarial\-jump setting \(time\-varying Beta distribution\)\.
Figure 4:LLM Toxicity Control Experiment\.Dataset toxicity distributions under different sampling regimes\. \(Top\) Stationary uniform sampling\. \(Middle\) Adversarial setting with progressively tail\-thickening sampling\. \(Bottom\) Adversarial\-jump setting\.Figures[4\(a\)](https://arxiv.org/html/2606.00320#A3.F4.sf1),[4\(b\)](https://arxiv.org/html/2606.00320#A3.F4.sf2), and[4\(c\)](https://arxiv.org/html/2606.00320#A3.F4.sf3)visualize the empirical distribution of human toxicity scores for different values of the Beta sampling parameteraaunder the uniform, adversarial, and adversarial\-jump sampling regimes, respectively, and keepingbbfixed\.
### C\.1Distributional Effects of the Sampling Parameter
#### Uniform regime\.
Figure[4\(a\)](https://arxiv.org/html/2606.00320#A3.F4.sf1)shows the corresponding distributions whenat=bt=1a\_\{t\}=b\_\{t\}=1\. In this case, the toxicity distribution remains stable over time as the overall shape exhibits no systematic drift in either its center or tail behavior\.
#### Adversarial and adversarial\-jump regime\.
Figure[4\(b\)](https://arxiv.org/html/2606.00320#A3.F4.sf2)and[4\(c\)](https://arxiv.org/html/2606.00320#A3.F4.sf3)show the empirical toxicity distribution for several values of the Beta shape parameterata\_\{t\}, withbt=1b\_\{t\}=1\. Asata\_\{t\}increases, the Beta distributionct∼Beta\(at,bt\)c\_\{t\}\\sim\\mathrm\{Beta\}\(a\_\{t\},b\_\{t\}\)concentrates more mass near11, causing the sampling procedure to increasingly select more toxic responses\. This results in a pronounced rightward shift and thickening of the upper tail of the toxicity distribution\. Figure[5](https://arxiv.org/html/2606.00320#A3.F5)illustrates the evolution ofata\_\{t\}as a function of the time step for the adversarial and adversarial\-jump sampling regime in the left and right panel, respectively\. From the figures, it can be observed thatata\_\{t\}is increasing overall across time for both regimes, while spontaneous shocks are injected to the adversarial\-jump regime, whereata\_\{t\}can take values as low as 0\.5, or as high as 5\.
\(a\)Adversarial setting\.
\(b\)Adversarial\-jump setting\.
Figure 5:LLM Toxicity Control Experiment\.Evolution ofata\_\{t\}parameter in Beta distribution under different sampling regimes: adversarial setting \(left\), adversarial\-jump setting \(right\)\.
### C\.2Additional Experiment Results
We evaluate the proposed RUCC controller under uniform, adversarial, and adversarial\-jump sampling regimes, for tail levelsβ∈\{0\.75,0\.8,0\.85,0\.9\}\\beta\\in\\\{0\.75,0\.8,0\.85,0\.9\\\}with target risk levelα=0\.1\\alpha=0\.1\. In all experiments, we use step sizeγ=0\.05\\gamma=0\.05and discard the first100100rounds as a burn\-in period\.
\(a\)RealizedCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\(β=0\.75\\beta=0\.75\)
\(b\)Evolution ofλt\\lambda\_\{t\}\(β=0\.75\\beta=0\.75\)
\(c\)RealizedCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\(β=0\.80\\beta=0\.80\)
\(d\)Evolution ofλt\\lambda\_\{t\}\(β=0\.80\\beta=0\.80\)
\(e\)RealizedCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\(β=0\.85\\beta=0\.85\)
\(f\)Evolution ofλt\\lambda\_\{t\}\(β=0\.85\\beta=0\.85\)
\(g\)RealizedCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\(β=0\.90\\beta=0\.90\)
\(h\)Evolution ofλt\\lambda\_\{t\}\(β=0\.90\\beta=0\.90\)
Figure 6:LLM Toxicity Control Experiment\.Realized empiricalCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}and evolution ofλt\\lambda\_\{t\}withγ=0\.05\\gamma=0\.05, and targetα=0\.1\\alpha=0\.1for the uniform regime\(a\)RealizedCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\(β=0\.75\\beta=0\.75\)
\(b\)Evolution ofλt\\lambda\_\{t\}\(β=0\.75\\beta=0\.75\)
\(c\)RealizedCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\(β=0\.80\\beta=0\.80\)
\(d\)Evolution ofλt\\lambda\_\{t\}\(β=0\.80\\beta=0\.80\)
\(e\)RealizedCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\(β=0\.85\\beta=0\.85\)
\(f\)Evolution ofλt\\lambda\_\{t\}\(β=0\.85\\beta=0\.85\)
\(g\)RealizedCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\(β=0\.90\\beta=0\.90\)
\(h\)Evolution ofλt\\lambda\_\{t\}\(β=0\.90\\beta=0\.90\)
Figure 7:LLM Toxicity Control Experiment\.Realized empiricalCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}and evolution ofλt\\lambda\_\{t\}withγ=0\.05\\gamma=0\.05, and targetα=0\.1\\alpha=0\.1for the adversarial regime\(a\)RealizedCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\(β=0\.75\\beta=0\.75\)
\(b\)Evolution ofλt\\lambda\_\{t\}\(β=0\.75\\beta=0\.75\)
\(c\)RealizedCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\(β=0\.80\\beta=0\.80\)
\(d\)Evolution ofλt\\lambda\_\{t\}\(β=0\.80\\beta=0\.80\)
\(e\)RealizedCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\(β=0\.85\\beta=0\.85\)
\(f\)Evolution ofλt\\lambda\_\{t\}\(β=0\.85\\beta=0\.85\)
\(g\)RealizedCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\(β=0\.90\\beta=0\.90\)
\(h\)Evolution ofλt\\lambda\_\{t\}\(β=0\.90\\beta=0\.90\)
Figure 8:LLM Toxicity Control Experiment\.Realized empiricalCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}and evolution ofλt\\lambda\_\{t\}withγ=0\.05\\gamma=0\.05, and targetα=0\.1\\alpha=0\.1for the adversarial\-jump regime#### Evolution of the empiricalCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\.
The left panel of Figures[6](https://arxiv.org/html/2606.00320#A3.F6),[7](https://arxiv.org/html/2606.00320#A3.F7), and[8](https://arxiv.org/html/2606.00320#A3.F8)show the evolution of the realized empiricalCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}under the uniform, adversarial, and adversarial\-jump regimes, respectively\.
In the*uniform*setting \(Figure[6](https://arxiv.org/html/2606.00320#A3.F6)\), all curves exhibit a clear decay toward the target levelα=0\.1\\alpha=0\.1\. Higherβ\\betavalues \(more risk\-averse tail control\) start from larger initialCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}values but converge at a comparable rate\. After sufficient time, the realizedCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}for allβ\\betastabilizes tightly around the target level, demonstrating that the method can achieve tight, non\-conservative control\.
In the*adversarial*setting \(Figure[7](https://arxiv.org/html/2606.00320#A3.F7)\), and*adversarial\-jump*setting \(Figure[8](https://arxiv.org/html/2606.00320#A3.F8)\), the controller adapts and drives the realizedCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}downward for allβ\\beta, eventually stabilizing near the target level\. The convergence is slower and exhibits more fluctuation than in the uniform case, reflecting the intrinsic difficulty of controlling tail risk under adversarial distribution shift\.
In all three settings, the RUCC controller is able to maintain tail risk control through adaptiveλt\\lambda\_\{t\}updates while avoiding overconservative behavior–which is demonstrated by the baseline realizedCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}that selects a staticλt\\lambda\_\{t\}\.
#### Evolution of the control parameterλt\\lambda\_\{t\}\.
The right panel of Figures[6](https://arxiv.org/html/2606.00320#A3.F6),[7](https://arxiv.org/html/2606.00320#A3.F7), and[8](https://arxiv.org/html/2606.00320#A3.F8)show the evolution of the control parameterλt\\lambda\_\{t\}in the uniform, adversarial, and adversarial\-jump regimes, respectively\.
In the*uniform*setting \(Figure[6](https://arxiv.org/html/2606.00320#A3.F6)\),λt\\lambda\_\{t\}evolves within a comparatively constrained band, indicating a relatively stable environment\. It can also be observed that the range of the band depends onβ\\betain an increasing manner, demonstrating the inherent increase in difficulty of controlling risks at higher tails\.
In the*adversarial*setting \(Figure[7](https://arxiv.org/html/2606.00320#A3.F7)\) and the*adversarial\-jump*setting \(Figure[8](https://arxiv.org/html/2606.00320#A3.F8)\),λt\\lambda\_\{t\}exhibits a clear decreasing trend over time for allβ\\beta\. Initially,λt\\lambda\_\{t\}remains large, corresponding to less aggressive filtering\. As the controller accumulates evidence and adapts to the realized tail distribution,λt\\lambda\_\{t\}is gradually reduced, reflecting an increasingly strict control of the risky region\. Largerβ\\betavalues lead to systematically smallerλt\\lambda\_\{t\}, as expected for more stringent tail\-risk control\. Furthermore, in the*adversarial\-jump*setting \(Figure[8](https://arxiv.org/html/2606.00320#A3.F8)\), in addition to the decreasing trend inλt\\lambda\_\{t\}, there are sporadic spikes inλt\\lambda\_\{t\}that correspond to the dips inata\_\{t\}of the Beta sampling distribution, and the analogous dips inλt\\lambda\_\{t\}that correspond to the jumps inata\_\{t\}, demonstrating the robustness of the RUCC controller and its ability to adapt to sudden distributional changes, as well as general distribution drifts\.
## Appendix DAdditional Results for Experiment 2: Portfolio Management
We evaluate the proposed RU ConformalCVaR\\operatorname\{CVaR\}controller on various horizons of real\-market portfolio management tasks covering the period 1991–2001, 2009–2018, and the full period 1991–2026\. We set the target risk level toα=0\.01\\alpha=0\.01, and report results forβ∈\{0\.75,0\.8,0\.85,0\.9\}\\beta\\in\\\{0\.75,0\.8,0\.85,0\.9\\\}, withγ=0\.05\\gamma=0\.05, and remove the first100100points as burn\-in\.
#### Evolution of the rolling empiricalCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\.
The left panel of Figures[9](https://arxiv.org/html/2606.00320#A4.F9),[10](https://arxiv.org/html/2606.00320#A4.F10), and[11](https://arxiv.org/html/2606.00320#A4.F11)show the evolution of the 180\-day rolling realized empiricalCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}for differentβ\\betaduring 1991–2001, 2009–2018, and the full period 1991–2025, respectively\.
During 1991–2001, we observe that theCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}of the baseline controller grows over time with a staticλt\\lambda\_\{t\}that eventually exceeds the target level\. This phenomenon can be explained by the fact that 1991–2001 is an economically stable period with steady growth that ended with the dot\-com crash in 2001\. Accordingly, as the baseline controller has a fixedλt\\lambda\_\{t\}, its associated realizedCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}grows beyond the target level over time\. In contrast, the realized tail risk of the RUCC controller initially starts below the target level and steadily increases over time as the controller adapts, and moves tightly around the target level as time passes\.
During 2009–2018, we observe that theCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}of the baseline controller decreases overtime with a staticλt\\lambda\_\{t\}that eventually falls well below the target level\. This observation can be explained by the fact that 2009–2018 is the post\-financial crisis period in which the economy is slowly improving over time \(with decreasing risks\)\. As the baseline controller has a fixedλt\\lambda\_\{t\}, its associated realizedCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}falls below the target level over time\. In contrast, the realized tail risk of the RUCC controller initially starts below the target level \(as a response to the financial crisis\) and steadily increases over time as the controller adapts to the improving economic environment, to be close to the target tail risk level\.
Finally, by observing the full\-period 1991–2025, we can highlight the robust adaptability of the RUCC controller\. While the realized tail risk of the baseline controller rises and falls drastically in response to the changing economic environment, including dramatic rises during the 2008 financial crisis and post\-COVID period, the tail risk associated with the RUCC controller moves tightly around the target level with moderate fluctuations during the same economically tumultuous times\.
In all cases, the final realized tail risk values are close to the target level \(approximately0\.010\.01\), demonstrating that the method achieves tight long\-run tail risk control without persistent conservatism\. Higherβ\\betavalues \(corresponding to more extreme tail control\) exhibit slightly slower convergence, as expected, but still reach the target level within the time horizon\.
#### Evolution of the control parameterλt\\lambda\_\{t\}\.
The right panel of Figures[9](https://arxiv.org/html/2606.00320#A4.F9),[10](https://arxiv.org/html/2606.00320#A4.F10), and[11](https://arxiv.org/html/2606.00320#A4.F11)shows the evolution of the control parameterλt\\lambda\_\{t\}for differentβ\\beta\.
During 1991–2001, we observe that the RUCCλt\\lambda\_\{t\}grows then falls over time, indicating that the controller first increases then reduces exposure to the risky asset as tail risks accumulate\. The initial increase inλt\\lambda\_\{t\}is consistent with the economic growth during the period, the later decrease inλt\\lambda\_\{t\}is consistent with the dot\-com crash in 2000, that requires more stringent risk management to control the tail risk at the target level\. On the other hand, the baseline fixedλ\\lambdastays at a comparatively low level causing it to be overly conservative\.
In contrast, during 2009–2018, we observe that the RUCCλt\\lambda\_\{t\}grows over time, indicating that the controller increases exposure to the risky asset as tail risks decrease\. The growth inλt\\lambda\_\{t\}over time is consistent with the growing stability of the economy after the 2008 financial crisis, allowing for more lenient risk management to control the tail risk at the desired target level\. Conversely, the baseline fixedλ\\lambdastays at a consistently low level due to the financial crisis, and fails to adapt to the changing economic environment\.
Finally, during 1991–2025, we observe that the RUCCλt\\lambda\_\{t\}grows and falls over time, whereλt\\lambda\_\{t\}peaks during economically stable periods, such as pre\-dot\-com crash, pre\-financial crisis, and pre\-COVID\. The larger values ofλt\\lambda\_\{t\}encourage investment in the risky asset while controlling the tail risk at the target level\. On the other hand,λt\\lambda\_\{t\}accordingly drops during economically tumultuous periods, such as post\-dot\-com crash, post\-financial crisis, and post\-COVID periods\. The smaller values ofλt\\lambda\_\{t\}allow for less aggressive investment in the risky asset to avoid the burden of tail risks\.
Note that in every time period for largerβ\\beta,λt\\lambda\_\{t\}shrinks accordingly as the algorithm must be more conservative in order to control more extreme tail events\.
Overall, these results demonstrate that our proposed RU conformal inference controller successfully enforces long\-runCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}control on real financial data over a 25\-year horizon, including multiple crisis periods \(e\.g\., the dot\-com crash, 2008 financial crisis, and COVID\-19\)\.
\(a\)RealizedCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\(β=0\.75\\beta=0\.75\)
\(b\)Evolution ofλt\\lambda\_\{t\}\(β=0\.75\\beta=0\.75\)
\(c\)RealizedCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\(β=0\.80\\beta=0\.80\)
\(d\)Evolution ofλt\\lambda\_\{t\}\(β=0\.80\\beta=0\.80\)
\(e\)RealizedCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\(β=0\.85\\beta=0\.85\)
\(f\)Evolution ofλt\\lambda\_\{t\}\(β=0\.85\\beta=0\.85\)
\(g\)RealizedCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\(β=0\.90\\beta=0\.90\)
\(h\)Evolution ofλt\\lambda\_\{t\}\(β=0\.90\\beta=0\.90\)
Figure 9:Portfolio Management Experiment\.Realized empiricalCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}and evolution ofλt\\lambda\_\{t\}forγ=0\.05\\gamma=0\.05with targetα=0\.01\\alpha=0\.01during 1991–2000\.\(a\)RealizedCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\(β=0\.75\\beta=0\.75\)
\(b\)Evolution ofλt\\lambda\_\{t\}\(β=0\.75\\beta=0\.75\)
\(c\)RealizedCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\(β=0\.80\\beta=0\.80\)
\(d\)Evolution ofλt\\lambda\_\{t\}\(β=0\.80\\beta=0\.80\)
\(e\)RealizedCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\(β=0\.85\\beta=0\.85\)
\(f\)Evolution ofλt\\lambda\_\{t\}\(β=0\.85\\beta=0\.85\)
\(g\)RealizedCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\(β=0\.90\\beta=0\.90\)
\(h\)Evolution ofλt\\lambda\_\{t\}\(β=0\.90\\beta=0\.90\)
Figure 10:Portfolio Management Experiment\.Realized empiricalCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}and evolution ofλt\\lambda\_\{t\}forγ=0\.05\\gamma=0\.05with targetα=0\.01\\alpha=0\.01during 2009–2018\.\(a\)RealizedCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\(β=0\.75\\beta=0\.75\)
\(b\)Evolution ofλt\\lambda\_\{t\}\(β=0\.75\\beta=0\.75\)
\(c\)RealizedCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\(β=0\.80\\beta=0\.80\)
\(d\)Evolution ofλt\\lambda\_\{t\}\(β=0\.80\\beta=0\.80\)
\(e\)RealizedCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\(β=0\.85\\beta=0\.85\)
\(f\)Evolution ofλt\\lambda\_\{t\}\(β=0\.85\\beta=0\.85\)
\(g\)RealizedCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}\(β=0\.90\\beta=0\.90\)
\(h\)Evolution ofλt\\lambda\_\{t\}\(β=0\.90\\beta=0\.90\)
Figure 11:Portfolio Management Experiment\.Realized empiricalCVaRβ\\operatorname\{CVaR\}\_\{\\beta\}and evolution ofλt\\lambda\_\{t\}forγ=0\.05\\gamma=0\.05with targetα=0\.01\\alpha=0\.01during 1991–2025\.Similar Articles
Conformal Selective Acting: Anytime-Valid Risk Control for RLVR-Trained LLMs
Introduces Conformal Selective Acting (CSA), a deployment-time wrapper for RLVR-trained LLMs that provides anytime-valid selective risk control on individual streams, enabling safe deployment in regulated settings without pooling or long-run averages.
Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models
This paper introduces a compute-aware evaluation framework for adversarial robustness of LLMs, proposing risk-compute curves and metrics based on FLOPs to better assess attack costs, finding that alignment training has non-monotonic effects and compute costs vary across models and harm categories.
Action-Conditioned Risk Gating for Safety-Critical Control under Partial Observability
This paper proposes Action-Conditioned Risk Gating, a lightweight reinforcement learning method for risk-sensitive control under partial observability that uses a compact finite-history proxy state and an action-conditioned near-term risk predictor to balance safety and performance.
On the Sample Complexity of Discounted Reinforcement Learning with Optimized Certainty Equivalents
This paper studies risk-sensitive reinforcement learning in finite discounted MDPs with a generative model, focusing on the sample complexity of learning optimal value functions and policies under the optimized certainty equivalent (OCE) risk measure. It provides exact conditions for PAC-learnability, analyzes a model-based approach, and establishes tight lower bounds, including an improved dependence on the risk parameter for CVaR.
Evolving Robustness--Exploration Trade-off in Online Reinforcement Learning via Quantile Bayesian Risk MDPs
This paper proposes a quantile Bayesian risk-aware MDP framework for online RL that adaptively balances robustness and exploration over time, providing theoretical regret bounds and demonstrating strong empirical performance.