Priced Motion Through Optimal Faces: A Normal-Fan Geometry for Non-Stationary Adversarial MDPs

arXiv cs.LG 06/30/26, 04:00 AM Papers
reinforcement-learning mdp non-stationary adversarial regret geometry optimal-faces
Summary
This paper introduces a normal-fan geometry for finite-horizon adversarial MDPs with fixed transitions, developing a face-crossing price that separates consequential from harmless non-stationarity. It shows that dynamic regret decomposes into intrinsic priced face motion plus within-face selection error.
arXiv:2606.29092v1 Announce Type: new Abstract: In a changing decision problem, standard dynamic-regret analyses have often equated the cost of non-stationarity to how far loss moves. However, it is simultaneously possible for a loss sequence to travel far and retain the same optimal policy, or for a small movement in loss to force the optimal policy to change completely. Thus, the size of the movement through loss variation, transition variation, or comparator path length describe the adversary's motion, but not the cost of that motion to the control problem. For a more faithful analytic interpretation, this paper develops a normal-fan geometry for finite-horizon adversarial MDPs with fixed transitions. Occupancy measures form a polytope, and each loss vector exposes an optimal face of that polytope. Non-stationarity in rewards is therefore a path through the normal fan, where motion inside one cone leaves the optimal face unchanged, while crossing a wall may carry regret. We pose the notion of a face-crossing price, which is the minimum regret incurred by remaining on the previous optimal face under the new loss. For any learner that tracks the previous face, dynamic regret decomposes exactly into intrinsic priced face motion plus within-face selection error. The resulting theory separates consequential from harmless non-stationarity, where loss variation can be arbitrarily large at zero price, and identical one-coordinate variation can hide horizon-scale differences in regret.
Original Article
View Cached Full Text
Cached at: 06/30/26, 05:31 AM
# Priced Motion Through Optimal Faces: A Normal-Fan Geometry for Non-Stationary Adversarial MDPs
Source: [https://arxiv.org/html/2606.29092](https://arxiv.org/html/2606.29092)
###### Abstract

In a changing decision problem, standard dynamic\-regret analyses have often equated the cost of non\-stationarity to how far loss moves\. However, it is simultaneously possible for a loss sequence to travel far and retain the same optimal policy, or for a small movement in loss to force the optimal policy to change completely\. Thus, the size of the movement through loss variation, transition variation, or comparator path length describe the adversary’s motion, but not the cost of that motion to the control problem\. For a more faithful analytic interpretation, this paper develops a normal\-fan geometry for finite\-horizon adversarial MDPs with fixed transitions\. Occupancy measures form a polytope, and each loss vector exposes an optimal face of that polytope\. Non\-stationarity in rewards is therefore a path through the normal fan, where motion inside one cone leaves the optimal face unchanged, while crossing a wall may carry regret\. We pose the notion of a face\-crossing price, which is the minimum regret incurred by remaining on the previous optimal face under the new loss\. For any learner that tracks the previous face, dynamic regret decomposes exactly into intrinsic priced face motion plus within\-face selection error\. The resulting theory separates consequential from harmless non\-stationarity, where loss variation can be arbitrarily large at zero price, and identical one\-coordinate variation can hide horizon\-scale differences in regret\.

## 1Introduction

Consider the Recommender system\(Schaferet al\.,[1999](https://arxiv.org/html/2606.29092#bib.bib57)\)which gives personalised recommendations to users based on learned user preferences\. The user’s interaction with this system is sequential, and with each user action, the Recommender system decides what to recommend\. We can formulate this problem as a stationary Markov Decision Process \(MDP\)\(Puterman,[2014](https://arxiv.org/html/2606.29092#bib.bib36); Shaniet al\.,[2015](https://arxiv.org/html/2606.29092#bib.bib58)\), where we try to learn the user’s preference \(rr\) as well as how their interactions tend to evolve \(P\(s′∣s,a\)P\(s^\{\\prime\}\\mid s,a\)\)\. However, a user’s preference is ever\-changing, and we cannot assume that an identical state\-action should yield the same reward from one episode to the next\(Huleihelet al\.,[2021](https://arxiv.org/html/2606.29092#bib.bib56)\)\.

In the fixed\-transition non\-stationary RL setting, the optimal policy drifts as the rewards and losses change\. Existing analyses bound dynamic regret by budgeting the variation of the losses or the comparator path\(Besbeset al\.,[2015](https://arxiv.org/html/2606.29092#bib.bib4); Zhaoet al\.,[2024](https://arxiv.org/html/2606.29092#bib.bib50); Feiet al\.,[2020](https://arxiv.org/html/2606.29092#bib.bib13); Zhaoet al\.,[2022](https://arxiv.org/html/2606.29092#bib.bib48)\), but these metrics are external to the control problem\. That is, such budgets only capture the magnitude of change, and such magnitudes alone do not determine the cost a learner pays\.

As an example, we can construct two cases in which loss displacement is inversely proportional to cost\. In one, a loss travels far yet keeps the same policy optimal, meaning that a stationary learner pays nothing\. In the other, a small step crosses a decision boundary and flips the optimal policy to the far side of the problem, costing the stationary learner regret of order the horizon\. For the Recommender system case, suppose two film recommendation policies share the same regret, where one recommends documentaries and one recommends romcoms\. Inside the “documentaries still best” region the optimal policy does not change, and a large reward movement there barely costs anything\. Conversely, a tiny preference shift can move the user across the indifference boundary, making the old “documentaries still best” policy systematically wrong, and causing large regret if the learner stays stationary\.

The occupancy representation of an MDP gives us a geometric formalisation of decision regions\(Puterman,[2014](https://arxiv.org/html/2606.29092#bib.bib36); Altman,[2021](https://arxiv.org/html/2606.29092#bib.bib3)\)\. For a fixed transition kernel, each policy induces an occupancy measure, the vector of expected visitation frequencies over state\-action pairs\. There are a few properties which make occupancies solutions to the aforementioned discrepancy in variation\. First, policies that share an occupancy attain equal valueVπ\(ℓ\)V^\{\\pi\}\(\\ell\)under every loss, making the occupancy a sufficient statistic for value\. Next, expected loss is then linear in this vector\. Consequently, the set of policies can be represented by a convex occupancy polytope, and the region of losses for which a given policy is optimal is a normal cone of that polytope\. Rotating the loss leaves that optimal face fixed across the normal cone, changing only when the loss crosses a wall\. The collection of such cones which together correspond to all the unique faces of the occupancy polytope is called the normal fan, which is the map carrying each loss to its optimal face\(Ziegler,[1995](https://arxiv.org/html/2606.29092#bib.bib51); Rockafellar,[1970](https://arxiv.org/html/2606.29092#bib.bib38); Lu and Robinson,[2008](https://arxiv.org/html/2606.29092#bib.bib28)\)\.

While loss remains within a cone and the optimal face stays fixed, a learner who holds the face pays no regret regardless of loss variation\. A cost arises only when the loss crosses into a cone whose optimal face excludes the learner’s occupancy\. In such a case, holding the previous optimal face under the new loss then costs the least regret achievable over its points\. Thus, by searching through the entire fan, we are able to exactly price the motion of losses\. However, this search is a combinatorial problem, and not tractable to enumerate\. In this paper, we show that no search is ever required\. The price equals an expected optimal Bellman advantage under the current loss, measured against the optimal value function that loss induces\. The advantage is accumulated along whichever policy was optimal one round earlier\. A single value backup supplies these advantages, and the remaining minimization runs only over the previous optimal face, which the learner already holds\.

![Refer to caption](https://arxiv.org/html/2606.29092v1/x1.png)Figure 1:Occupancy faces and normal cones\. Left, the occupancy polytope, whose faces are the candidate sets of optimal occupancies\. Right, the normal fan in loss space, which partitions losses into cones by the face each one selects\. A loss path inside one cone leaves the optimal face unchanged and is free, while a path that crosses a wall changes the optimal face and may be expensive\.##### Contributions\.

1. 1\.We define the face\-crossing price, the cost of remaining on the previous optimal face under the new loss, and prove that for a learner that holds that face, dynamic regret splits exactly into this price and a within\-face selection error \(Theorem[1](https://arxiv.org/html/2606.29092#Thmtheorem1)\)\.
2. 2\.We prove that in any finite\-horizon MDP the price equals a minimal expected optimal Bellman advantage over the policies that were optimal a round earlier \(Theorem[3](https://arxiv.org/html/2606.29092#Thmtheorem3)\), so a value backup over the old face computes it directly\.
3. 3\.We establish three consequences\. First, loss variation can be arbitrarily large at zero price, and at fixed variation it can hide a factorHH\(Proposition[1](https://arxiv.org/html/2606.29092#Thmproposition1)and Theorem[2](https://arxiv.org/html/2606.29092#Thmtheorem2)\)\. Second, a crossing forced early in the horizon costs more than a late one by the factorH−h\+1H\-h\+1\(Corollary[1](https://arxiv.org/html/2606.29092#Thmcorollary1)\)\. Lastly, a degenerate optimal face leaves a selection error the price cannot absorb \(Proposition[7](https://arxiv.org/html/2606.29092#Thmproposition7)\)\.

A full\-information cone detector \(Proposition[4](https://arxiv.org/html/2606.29092#Thmproposition4)\) and a secondary reading of mirror\-descent and trust\-region updates as approximate trackers of the fan appear in the appendices\. We test our theory on three empirical benchmarks in Section[7](https://arxiv.org/html/2606.29092#S7)\.

## 2Related work

##### Occupancy measures and the geometry of control\.

With known transitions, the occupancy measure turns policy optimization into linear optimization over a polytope\. This correspondence descends from the linear programming view of dynamic programming\(Puterman,[2014](https://arxiv.org/html/2606.29092#bib.bib36); Altman,[2021](https://arxiv.org/html/2606.29092#bib.bib3)\)\. Online and adversarial MDP algorithms exploit this view directly, through relative\-entropy policy search and reductions to online convex optimization over the polytope\(Even\-Daret al\.,[2009](https://arxiv.org/html/2606.29092#bib.bib12); Neuet al\.,[2010](https://arxiv.org/html/2606.29092#bib.bib33); Zimin and Neu,[2013](https://arxiv.org/html/2606.29092#bib.bib52); Rosenberg and Mansour,[2019](https://arxiv.org/html/2606.29092#bib.bib39)\)\. A parallel line studies the geometry that this representation imposes on policy space, through the value\-function polytope\(Dadashiet al\.,[2019](https://arxiv.org/html/2606.29092#bib.bib9)\), the convex\-analytic structure of constrained and inverse RL\(Schlaginhaufen and Kamgarpour,[2023](https://arxiv.org/html/2606.29092#bib.bib40)\), and the occupancy\-manifold reading of actor\-critic methods\(Milosevic and Scherf,[2025](https://arxiv.org/html/2606.29092#bib.bib30); Müller and Montúfar,[2024](https://arxiv.org/html/2606.29092#bib.bib31)\)\. The present geometry is built to measure how those exposed faces move and what their motion costs\.

##### Dynamic regret and non\-stationarity\.

Dynamic regret compares a learner against a moving comparator and bounds the gap by a measure of the comparator’s motion, an idea that originates in online convex optimization\(Zinkevich,[2003](https://arxiv.org/html/2606.29092#bib.bib54); Hall and Willett,[2013](https://arxiv.org/html/2606.29092#bib.bib14); Jadbabaieet al\.,[2015](https://arxiv.org/html/2606.29092#bib.bib18); Eshraghi and Liang,[2022](https://arxiv.org/html/2606.29092#bib.bib11); Zhaoet al\.,[2020](https://arxiv.org/html/2606.29092#bib.bib49);[2024](https://arxiv.org/html/2606.29092#bib.bib50)\)and non\-stationary stochastic optimization\(Besbeset al\.,[2015](https://arxiv.org/html/2606.29092#bib.bib4)\)\. The idea has been adopted by non\-stationary and adversarial MDPs\(Cheunget al\.,[2020](https://arxiv.org/html/2606.29092#bib.bib8); Feiet al\.,[2020](https://arxiv.org/html/2606.29092#bib.bib13); Zhaoet al\.,[2022](https://arxiv.org/html/2606.29092#bib.bib48); Liet al\.,[2023](https://arxiv.org/html/2606.29092#bib.bib25);[2024b](https://arxiv.org/html/2606.29092#bib.bib27); Dicket al\.,[2015](https://arxiv.org/html/2606.29092#bib.bib10); Cardosoet al\.,[2019](https://arxiv.org/html/2606.29092#bib.bib6)\)\. These results bound regret by comparator path length or by reward and transition variation\. The occupancy path\-length bounds nearest to ours\(Zhaoet al\.,[2022](https://arxiv.org/html/2606.29092#bib.bib48); Liet al\.,[2024b](https://arxiv.org/html/2606.29092#bib.bib27)\)measure non\-stationarity by the movement of the optimal occupancy and pay a Lipschitz factor to turn that movement into regret\. The price studied here refines that movement to a path of exposed faces measured in value\. For a learner holding the previous optimal face, this price equals the realized regret exactly \(Theorem[1](https://arxiv.org/html/2606.29092#Thmtheorem1)\), whereas path length only bounds it\.

##### Normal fans, polyhedral geometry, and policy updates\.

The normal fan of a polytope partitions objective space by the face a linear objective selects — a standard construction in convex and polyhedral geometry\(Ziegler,[1995](https://arxiv.org/html/2606.29092#bib.bib51); Rockafellar,[1970](https://arxiv.org/html/2606.29092#bib.bib38); Lu and Robinson,[2008](https://arxiv.org/html/2606.29092#bib.bib28); Paffenholz,[2010](https://arxiv.org/html/2606.29092#bib.bib35); Tropp,[2022](https://arxiv.org/html/2606.29092#bib.bib44)\)\. We read its walls as the decision boundaries of control\. A crossing becomes costly once the previous optimal face acquires positive Bellman advantage under the new loss\. The margin around a wall is a weak\-sharpness constant of the linear program\(Burke and Ferris,[1993](https://arxiv.org/html/2606.29092#bib.bib5)\)\. Separately, mirror\-descent, natural\-gradient, and trust\-region methods place a non\-Euclidean geometry on policy space\(Nemirovskiĭ and IUdin,[1983](https://arxiv.org/html/2606.29092#bib.bib32); Beck and Teboulle,[2003](https://arxiv.org/html/2606.29092#bib.bib55); Kakade,[2001](https://arxiv.org/html/2606.29092#bib.bib21); Kakade and Langford,[2002](https://arxiv.org/html/2606.29092#bib.bib20); Schulmanet al\.,[2017](https://arxiv.org/html/2606.29092#bib.bib41); Tomaret al\.,[2021](https://arxiv.org/html/2606.29092#bib.bib43); Lan,[2022](https://arxiv.org/html/2606.29092#bib.bib23); Zhanet al\.,[2023](https://arxiv.org/html/2606.29092#bib.bib46); Alfanoet al\.,[2024](https://arxiv.org/html/2606.29092#bib.bib2); Agarwalet al\.,[2020](https://arxiv.org/html/2606.29092#bib.bib1); Kleukeret al\.,[2025](https://arxiv.org/html/2606.29092#bib.bib22)\)\. Our reading recasts these updates as approximate trackers of the fan, with the Bregman term holding mass near uncertain walls, and a large advantage pushing it off a clearly suboptimal action\. The prediction such tracking needs is supplied by the optimistic mechanism ofRakhlin and Sridharan\([2014](https://arxiv.org/html/2606.29092#bib.bib37)\)\.

## 3Occupancy geometry and the normal fan

### 3\.1The occupancy polytope

We consider an episodic MDP with horizonHH, finite state space𝒮\\mathcal\{S\}, finite action space𝒜\\mathcal\{A\}, initial distributionρ\\rho, and transition kernelsPh\(⋅∣s,a\)P\_\{h\}\(\\cdot\\mid s,a\)\. We take the transitions known and fixed and let the lossesℓt=\{ℓt,h\(s,a\)\}h,s,a\\ell\_\{t\}=\\\{\\ell\_\{t,h\}\(s,a\)\\\}\_\{h,s,a\}vary adversarially across episodes\. Following the standard occupancy\-measure formulation\(Puterman,[2014](https://arxiv.org/html/2606.29092#bib.bib36); Altman,[2021](https://arxiv.org/html/2606.29092#bib.bib3)\), for a Markov policyπ\\pithe occupancy measure records the visitation frequencies,

qhπ\(s,a\)=Prπ⁡\(sh=s,ah=a\)\.q\_\{h\}^\{\\pi\}\(s,a\)=\\Pr\_\{\\pi\}\(s\_\{h\}=s,\\,a\_\{h\}=a\)\.Let𝒬\\mathcal\{Q\}denote the set of all occupancy measures attainable by some Markov policy\. It is the polytope defined by nonnegativity together with the flow constraints

∑aq1\(s,a\)\\displaystyle\\sum\_\{a\}q\_\{1\}\(s,a\)=ρ\(s\),\\displaystyle=\\rho\(s\),∑aqh\+1\(s,a\)\\displaystyle\\sum\_\{a\}q\_\{h\+1\}\(s,a\)=∑s′,a′qh\(s′,a′\)Ph\(s∣s′,a′\),h=1,…,H−1\.\\displaystyle=\\sum\_\{s^\{\\prime\},a^\{\\prime\}\}q\_\{h\}\(s^\{\\prime\},a^\{\\prime\}\)\\,P\_\{h\}\(s\\mid s^\{\\prime\},a^\{\\prime\}\),\\qquad h=1,\\ldots,H\-1\.The expected loss is linear in the occupancy,Jt\(π\)=⟨qπ,ℓt⟩J\_\{t\}\(\\pi\)=\\left\\langle q^\{\\pi\},\\ell\_\{t\}\\right\\rangle, and the set of optimal occupancies at roundttis the exposed face

Ft=argminq∈𝒬⁡⟨q,ℓt⟩\.F\_\{t\}=\\operatorname\*\{arg\\,min\}\_\{q\\in\\mathcal\{Q\}\}\\left\\langle q,\\ell\_\{t\}\\right\\rangle\.For a sequence of playsxt∈𝒬x\_\{t\}\\in\\mathcal\{Q\}the dynamic regret against the per\-round optimum\(Zinkevich,[2003](https://arxiv.org/html/2606.29092#bib.bib54); Hall and Willett,[2013](https://arxiv.org/html/2606.29092#bib.bib14); Jadbabaieet al\.,[2015](https://arxiv.org/html/2606.29092#bib.bib18)\)is

DRegT=∑t=1T\[⟨xt,ℓt⟩−minq∈𝒬⁡⟨q,ℓt⟩\]\.\\operatorname\{DReg\}\_\{T\}=\\sum\_\{t=1\}^\{T\}\\left\[\\left\\langle x\_\{t\},\\ell\_\{t\}\\right\\rangle\-\\min\_\{q\\in\\mathcal\{Q\}\}\\left\\langle q,\\ell\_\{t\}\\right\\rangle\\right\]\.
Notice that expected loss sees a policy only through its occupancy measure\. Two policies with the same occupancy therefore collapse to one point of𝒬\\mathcal\{Q\}, and flow conservation makes this quotient a bijection with𝒬\\mathcal\{Q\}\. Their natural distance is then the occupancy distancedocc\(\[π\],\[π′\]\)=‖qπ−qπ′‖d\_\{\\rm occ\}\(\[\\pi\],\[\\pi^\{\\prime\}\]\)=\\left\\lVert q^\{\\pi\}\-q^\{\\pi^\{\\prime\}\}\\right\\rVert\. Norm duality then equates it with the largest value gap a bounded adversary can open between them,sup‖ℓ‖∗≤1\|Jℓ\(π\)−Jℓ\(π′\)\|\\sup\_\{\\left\\lVert\\ell\\right\\rVert\_\{\*\}\\leq 1\}\|J\_\{\\ell\}\(\\pi\)\-J\_\{\\ell\}\(\\pi^\{\\prime\}\)\|\.

### 3\.2The normal fan

Given the polytope and its faces, the next question is which losses select which face\. To each faceF⊆𝒬F\\subseteq\\mathcal\{Q\}let’s associate a cone of losses,

N\(F\)=\{ℓ:F⊆argminq∈𝒬⁡⟨q,ℓ⟩\}\.N\(F\)=\\bigl\\\{\\ell:\\ F\\subseteq\\operatorname\*\{arg\\,min\}\_\{q\\in\\mathcal\{Q\}\}\\left\\langle q,\\ell\\right\\rangle\\bigr\\\}\.These cones collect into the \(negative\) normal fan of𝒬\\mathcal\{Q\}, and the fan partitions loss space by the face that linear optimization selects\(Ziegler,[1995](https://arxiv.org/html/2606.29092#bib.bib51); Rockafellar,[1970](https://arxiv.org/html/2606.29092#bib.bib38); Lu and Robinson,[2008](https://arxiv.org/html/2606.29092#bib.bib28)\)\. For every lossℓ\\ellthe optimal setF\(ℓ\)=argminq∈𝒬⁡⟨q,ℓ⟩F\(\\ell\)=\\operatorname\*\{arg\\,min\}\_\{q\\in\\mathcal\{Q\}\}\\left\\langle q,\\ell\\right\\rangleis an exposed face of𝒬\\mathcal\{Q\}, attained where a supporting hyperplane meets it\. A loss in the relative interior ofN\(F\)N\(F\)then selects exactlyFF\.

## 4The face\-crossing price

### 4\.1Definition and the regret decomposition

Suppose the learner enters roundttplaying some occupancy in the previous optimal faceFt−1F\_\{t\-1\}\. The unavoidable cost of remaining on that face is the smallest regret achievable by any of its points under the new loss\.

###### Definition 1\(Minimal face\-crossing price\)\.

For consecutive optimal facesFt−1F\_\{t\-1\}andFtF\_\{t\},

Γtcross=minu∈Ft−1⁡\[⟨u,ℓt⟩−minq∈𝒬⁡⟨q,ℓt⟩\],Γ2:Tcross=∑t=2TΓtcross\.\\Gamma\_\{t\}^\{\\mathrm\{cross\}\}=\\min\_\{u\\in F\_\{t\-1\}\}\\left\[\\left\\langle u,\\ell\_\{t\}\\right\\rangle\-\\min\_\{q\\in\\mathcal\{Q\}\}\\left\\langle q,\\ell\_\{t\}\\right\\rangle\\right\],\\qquad\\Gamma\_\{2:T\}^\{\\mathrm\{cross\}\}=\\sum\_\{t=2\}^\{T\}\\Gamma\_\{t\}^\{\\mathrm\{cross\}\}\.

The price is zero when some point of the old face is still optimal under the new loss, and positive only when every point of the old face has become suboptimal\. Dynamic regret now decomposes around this quantity\.

###### Theorem 1\(Exact face\-regret decomposition\)\.

Suppose that fort≥2t\\geq 2the learner playsxt∈Ft−1x\_\{t\}\\in F\_\{t\-1\}\. Then the per\-round regret splits as

rt:=⟨xt,ℓt⟩−minq∈𝒬⁡⟨q,ℓt⟩=Γtcross\+εtsel,εtsel=⟨xt,ℓt⟩−minu∈Ft−1⁡⟨u,ℓt⟩,r\_\{t\}:=\\left\\langle x\_\{t\},\\ell\_\{t\}\\right\\rangle\-\\min\_\{q\\in\\mathcal\{Q\}\}\\left\\langle q,\\ell\_\{t\}\\right\\rangle=\\Gamma\_\{t\}^\{\\mathrm\{cross\}\}\+\\varepsilon\_\{t\}^\{\\rm sel\},\\qquad\\varepsilon\_\{t\}^\{\\rm sel\}=\\left\\langle x\_\{t\},\\ell\_\{t\}\\right\\rangle\-\\min\_\{u\\in F\_\{t\-1\}\}\\left\\langle u,\\ell\_\{t\}\\right\\rangle,and consequently

DRegT=r1\+∑t=2TΓtcross\+∑t=2Tεtsel\.\\operatorname\{DReg\}\_\{T\}=r\_\{1\}\+\\sum\_\{t=2\}^\{T\}\\Gamma\_\{t\}^\{\\mathrm\{cross\}\}\+\\sum\_\{t=2\}^\{T\}\\varepsilon\_\{t\}^\{\\rm sel\}\.

###### Proof\.

Add and subtract the best value attainable inside the old face\. Writingmt=minu∈Ft−1⁡⟨u,ℓt⟩m\_\{t\}=\\min\_\{u\\in F\_\{t\-1\}\}\\left\\langle u,\\ell\_\{t\}\\right\\rangle,

rt=\[mt−minq∈𝒬⁡⟨q,ℓt⟩\]\+\[⟨xt,ℓt⟩−mt\]=Γtcross\+εtsel\.r\_\{t\}=\\Bigl\[m\_\{t\}\-\\min\_\{q\\in\\mathcal\{Q\}\}\\left\\langle q,\\ell\_\{t\}\\right\\rangle\\Bigr\]\+\\Bigl\[\\left\\langle x\_\{t\},\\ell\_\{t\}\\right\\rangle\-m\_\{t\}\\Bigr\]=\\Gamma\_\{t\}^\{\\mathrm\{cross\}\}\+\\varepsilon\_\{t\}^\{\\rm sel\}\.The first bracket is the least cost of remaining onFt−1F\_\{t\-1\}, and the second is the learner’s error in choosing its representative withinFt−1F\_\{t\-1\}\. Summing overttgives the claim\. ∎

The priceΓtcross\\Gamma\_\{t\}^\{\\mathrm\{cross\}\}is intrinsic, and the loss sequence alone fixes it, beyond the reach of any learner that stays on the old face\. By contrast, the selection errorεtsel\\varepsilon\_\{t\}^\{\\rm sel\}is algorithmic, and it vanishes for a learner that always plays the best point of that face\. This split presumes the learner holds the old face throughout the round\. Once the learner moves off it,εtsel\\varepsilon\_\{t\}^\{\\rm sel\}can turn negative and the two terms no longer separate\.

### 4\.2Comparison with loss variation

The face\-crossing price explains when the familiar variation bounds\(Besbeset al\.,[2015](https://arxiv.org/html/2606.29092#bib.bib4); Zhaoet al\.,[2024](https://arxiv.org/html/2606.29092#bib.bib50);[2022](https://arxiv.org/html/2606.29092#bib.bib48)\)are tight and when they are wasteful\. Consider first the non\-degenerate case in which each face is a single vertex\.

Take every optimal face to be a single vertex,Ft=\{vt\}F\_\{t\}=\\\{v\_\{t\}\\\}, and let the tracker playxt=vt−1x\_\{t\}=v\_\{t\-1\}\. The selection error of Theorem[1](https://arxiv.org/html/2606.29092#Thmtheorem1)then vanishes, leavingDRegT=r1\+∑t=2T⟨vt−1−vt,ℓt⟩\\operatorname\{DReg\}\_\{T\}=r\_\{1\}\+\\sum\_\{t=2\}^\{T\}\\left\\langle v\_\{t\-1\}\-v\_\{t\},\\ell\_\{t\}\\right\\rangle\. Optimality ofvt−1v\_\{t\-1\}underℓt−1\\ell\_\{t\-1\}bounds each term by‖vt−1−vt‖‖ℓt−ℓt−1‖∗\\left\\lVert v\_\{t\-1\}\-v\_\{t\}\\right\\rVert\\,\\left\\lVert\\ell\_\{t\}\-\\ell\_\{t\-1\}\\right\\rVert\_\{\*\}\. Summing recovers the classical variation estimateDRegT≤r1\+∑t=2T‖vt−1−vt‖‖ℓt−ℓt−1‖∗\\operatorname\{DReg\}\_\{T\}\\leq r\_\{1\}\+\\sum\_\{t=2\}^\{T\}\\left\\lVert v\_\{t\-1\}\-v\_\{t\}\\right\\rVert\\,\\left\\lVert\\ell\_\{t\}\-\\ell\_\{t\-1\}\\right\\rVert\_\{\*\}\. This estimate charges for every change in the loss, whereasΓtcross\\Gamma\_\{t\}^\{\\mathrm\{cross\}\}charges only the part that displaces the optimal face\. The next proposition shows that this gap grows without bound\.

###### Proposition 1\(Large variation, zero price\)\.

There exist loss sequences with arbitrarily large variationVTℓ=∑t‖ℓt−ℓt−1‖∗V\_\{T\}^\{\\ell\}=\\sum\_\{t\}\\left\\lVert\\ell\_\{t\}\-\\ell\_\{t\-1\}\\right\\rVert\_\{\*\}and yetΓ2:Tcross=0\\Gamma\_\{2:T\}^\{\\mathrm\{cross\}\}=0\.

###### Proof\.

Let every loss lie in the same coneN\(F\)N\(F\), for instance by adding an arbitrary common offset to a fixed loss or by moving within the relative interior of the cone\. The selected face is alwaysFF, soFt−1=FtF\_\{t\-1\}=F\_\{t\}andΓtcross=0\\Gamma\_\{t\}^\{\\mathrm\{cross\}\}=0throughout, while the variation grows without bound as the loss travels inside the cone\. ∎

Whether a crossing is priced depends only on whether the old and new optimal faces still meet \(Appendix Figure[5](https://arxiv.org/html/2606.29092#A2.F5)\)\. We analyse the complementary case, i\.e\., an unpredictable crossing that forces a fixed price on every learner with a matching minimax lower bound, in Appendix[B](https://arxiv.org/html/2606.29092#A2)\.

Loss variation cannot see where in the horizon a crossing falls\. At fixed variation it can therefore hide a factor ofHHin the price\. More precisely:

###### Theorem 2\(Loss variation cannot see the crossing layer\)\.

For every horizonH≥2H\\geq 2there exist two finite\-horizon MDPs with fixed transitions, each carrying a two\-round loss sequence whose round\-to\-round change touches a single coordinate of the same magnitude\. The loss variationVTℓV\_\{T\}^\{\\ell\}is then identical across the two instances under everyLpL\_\{p\}norm, while their face\-crossing prices differ by the factorH−1H\-1\. Both old faces are vertices, so the selection error of Theorem[1](https://arxiv.org/html/2606.29092#Thmtheorem1)vanishes and the previous\-face tracker’s dynamic regret equals the price on each\. No bound of the formDRegT≤B\(VTℓ\)\\operatorname\{DReg\}\_\{T\}\\leq B\(V\_\{T\}^\{\\ell\}\)can then be tight on both, and one of the two is loose by a factorΘ\(H\)\\Theta\(H\)\.

The transition structure drives the amplification, nudging a single step in the cheap instance while rerouting the entire horizon in the expensive one\. The variation itself stays put, untouched by the transitions on either side\. Within a single fixed MDP a one\-coordinate change cannot open a factor\-HHgap\. Two distinct MDPs are therefore necessary, and Appendix[B](https://arxiv.org/html/2606.29092#A2)gives the construction and proof\. The separation also speaks only to loss or reward variation\. Comparator path length, by contrast, does register the crossing\-layer difference\.

### 4\.3Choosing a representative inside a degenerate face

The comparison so far has assumed each optimal face is a single vertex\. WhenFt−1F\_\{t\-1\}is instead a higher\-dimensional face, the learner must still choose one occupancy inside it, and that choice is the source of the selection errorεtsel\\varepsilon\_\{t\}^\{\\rm sel\}\. A learner that predicts the coming loss can keep this error small\.

Predicting the coming loss bymtm\_\{t\}and selectingxt∈argminu∈Ft−1⁡⟨u,mt⟩x\_\{t\}\\in\\operatorname\*\{arg\\,min\}\_\{u\\in F\_\{t\-1\}\}\\left\\langle u,m\_\{t\}\\right\\ranglekeeps the error atεtsel≤diam⁡\(Ft−1\)δt\\varepsilon\_\{t\}^\{\\rm sel\}\\leq\\operatorname\{diam\}\(F\_\{t\-1\}\)\\,\\delta\_\{t\}, withδt=‖mt−ℓt‖∗\\delta\_\{t\}=\\left\\lVert m\_\{t\}\-\\ell\_\{t\}\\right\\rVert\_\{\*\}\. A sticky Bregman selector trades this against information held along the directions the old face leaves undetermined, at an added costλRtψ\\lambda R\_\{t\}^\{\\psi\}\(Propositions[2](https://arxiv.org/html/2606.29092#Thmproposition2)–[3](https://arxiv.org/html/2606.29092#Thmproposition3), Appendix[A](https://arxiv.org/html/2606.29092#A1)\)\. This choice of representative is exactly where the Bregman geometry of mirror descent\(Nemirovskiĭ and IUdin,[1983](https://arxiv.org/html/2606.29092#bib.bib32); Beck and Teboulle,[2003](https://arxiv.org/html/2606.29092#bib.bib55)\)enters\. We discuss this further in Section[6](https://arxiv.org/html/2606.29092#S6)\.

## 5Bellman advantage as the price of motion

So far the face\-crossing price has been a polyhedral quantity, defined by minimization over a face of a complicated polytope\. In an MDP it also admits a dynamic\-programming form\(Puterman,[2014](https://arxiv.org/html/2606.29092#bib.bib36)\)\. To exhibit it, fix a lossℓt\\ell\_\{t\}and define the optimal value and action\-value functions by the backups

Vt,H\+1∗\(s\)\\displaystyle V\_\{t,H\+1\}^\{\*\}\(s\)=0,\\displaystyle=0,Qt,h∗\(s,a\)\\displaystyle Q\_\{t,h\}^\{\*\}\(s,a\)=ℓt,h\(s,a\)\+𝔼s′∼Ph\(⋅∣s,a\)Vt,h\+1∗\(s′\),\\displaystyle=\\ell\_\{t,h\}\(s,a\)\+\\mathbb\{E\}\_\{s^\{\\prime\}\\sim P\_\{h\}\(\\cdot\\mid s,a\)\}V\_\{t,h\+1\}^\{\*\}\(s^\{\\prime\}\),Vt,h∗\(s\)\\displaystyle V\_\{t,h\}^\{\*\}\(s\)=mina⁡Qt,h∗\(s,a\),\\displaystyle=\\min\_\{a\}Q\_\{t,h\}^\{\*\}\(s,a\),together with the loss advantageAt,h∗\(s,a\)=Qt,h∗\(s,a\)−Vt,h∗\(s\)≥0A\_\{t,h\}^\{\*\}\(s,a\)=Q\_\{t,h\}^\{\*\}\(s,a\)\-V\_\{t,h\}^\{\*\}\(s\)\\geq 0\. The advantage measures how much worse actionaais at stephhin statessthan acting optimally onward, and it is nonnegative under the minimization convention\. These two expressions for the price now connect through a single identity\.

###### Theorem 3\(Bellman advantage representation\)\.

For any Markov policyπ\\pi,

Jt\(π\)−Jt∗=∑h=1H𝔼π\[At,h∗\(sh,ah\)\],J\_\{t\}\(\\pi\)\-J\_\{t\}^\{\*\}=\\sum\_\{h=1\}^\{H\}\\mathbb\{E\}\_\{\\pi\}\\bigl\[A\_\{t,h\}^\{\*\}\(s\_\{h\},a\_\{h\}\)\\bigr\],and consequently the face\-crossing price is the minimal accumulated advantage over the policies that were optimal a round earlier,

Γtcross=minπ∈Πt−1∗∑h=1H𝔼π\[At,h∗\(sh,ah\)\]\.\\Gamma\_\{t\}^\{\\mathrm\{cross\}\}=\\min\_\{\\pi\\in\\Pi\_\{t\-1\}^\{\*\}\}\\sum\_\{h=1\}^\{H\}\\mathbb\{E\}\_\{\\pi\}\\bigl\[A\_\{t,h\}^\{\*\}\(s\_\{h\},a\_\{h\}\)\\bigr\]\.

###### Proof\.

The Bellman identityQt,h∗\(sh,ah\)=ℓt,h\(sh,ah\)\+𝔼\[Vt,h\+1∗\(sh\+1\)∣sh,ah\]Q\_\{t,h\}^\{\*\}\(s\_\{h\},a\_\{h\}\)=\\ell\_\{t,h\}\(s\_\{h\},a\_\{h\}\)\+\\mathbb\{E\}\[V\_\{t,h\+1\}^\{\*\}\(s\_\{h\+1\}\)\\mid s\_\{h\},a\_\{h\}\]gives, after taking expectations underπ\\piand summing over the horizon, a telescoping of the value terms,

∑h=1H𝔼π\[At,h∗\(sh,ah\)\]=∑h=1H𝔼π\[Qt,h∗\(sh,ah\)−Vt,h∗\(sh\)\]=Jt\(π\)−𝔼s1∼ρVt,1∗\(s1\)=Jt\(π\)−Jt∗\.\\sum\_\{h=1\}^\{H\}\\mathbb\{E\}\_\{\\pi\}\\bigl\[A\_\{t,h\}^\{\*\}\(s\_\{h\},a\_\{h\}\)\\bigr\]=\\sum\_\{h=1\}^\{H\}\\mathbb\{E\}\_\{\\pi\}\\bigl\[Q\_\{t,h\}^\{\*\}\(s\_\{h\},a\_\{h\}\)\-V\_\{t,h\}^\{\*\}\(s\_\{h\}\)\\bigr\]=J\_\{t\}\(\\pi\)\-\\mathbb\{E\}\_\{s\_\{1\}\\sim\\rho\}V\_\{t,1\}^\{\*\}\(s\_\{1\}\)=J\_\{t\}\(\\pi\)\-J\_\{t\}^\{\*\}\.Minimizing the left side over the policies optimal at roundt−1t\-1, whose occupancies are exactly the points ofFt−1F\_\{t\-1\}, returns the face\-crossing price\. ∎

The first identity is the performance\-difference lemma\(Kakade and Langford,[2002](https://arxiv.org/html/2606.29092#bib.bib20)\)taken against the optimal policy, which attainsQ∗Q^\{\*\}and so carries the optimal advantage\. The second equality is our contribution\. It identifies the polyhedral face\-crossing price with this restricted minimization and prices a crossing in the normal fan without ever building it\. A single value backup produces the advantagesAt∗A\_\{t\}^\{\*\}at once\. The remaining minimization runs overΠt−1∗\\Pi\_\{t\-1\}^\{\*\}, the policies optimal a round earlier\. In a finite\-horizon MDP these are read off directly, the policies supported on the previous loss’s zero\-advantage actions, with no search required\. The minimization collapses to one trajectory expectation when that earlier optimum is unique\. Figure[2](https://arxiv.org/html/2606.29092#S5.F2)\(left\) shows the price assembled from per\-step advantages along the policies of the previous optimal face\.

##### Causal anisotropy\.

The representation weights a crossing by the number of steps its error affects\. A wrong action only at the final step paysγ\\gammaonce, affecting nothing downstream\. Meanwhile, a wrong first action that then errs by marginγ\\gammaat allHHsteps paysHγH\\gamma\. A deterministic layered MDP turns this into a closed form\.

###### Corollary 1\(Causal anisotropy\)\.

In a deterministic layered MDP, a crossing forced at layerhhwith per\-step marginγ\\gammahas priceΓtcross=\(H−h\+1\)γ\\Gamma\_\{t\}^\{\\mathrm\{cross\}\}=\(H\-h\+1\)\\gamma, soΓtcross/γ=H−h\+1\\Gamma\_\{t\}^\{\\mathrm\{cross\}\}/\\gamma=H\-h\+1\.

The right panel of Figure[2](https://arxiv.org/html/2606.29092#S5.F2)illustrates this in a layered MDP\.

![Refer to caption](https://arxiv.org/html/2606.29092v1/x2.png)

![Refer to caption](https://arxiv.org/html/2606.29092v1/x3.png)

Figure 2:The Bellman price and its causal consequence\. Left, under a fixed new loss the optimal advantageAt,h∗\(s,a\)A\_\{t,h\}^\{\*\}\(s,a\)is computed by a value backup, and the face\-crossing price is the smallest expected advantage accumulated along a policy that was optimal under the previous loss \(Theorem[3](https://arxiv.org/html/2606.29092#Thmtheorem3)\)\. Right, in a layered MDP a divergence between the old and new optimal policies at an early layer redirects all downstream visitation, so the price grows with the number of affected steps\.

## 6Detecting cones without computing the fan

Pricing a crossing assumes that the crossing has already happened\. A learner must act before one arrives, so it has to locate itself in the fan\. That fan is far too large to enumerate beyond small examples\. The optimal advantages under the current loss already give that location\. Each one, produced by the backup of Section[5](https://arxiv.org/html/2606.29092#S5), measures how far an action sits from a wall of the current optimal cone\. A learner thus reads its position in the fan from a quantity it has already computed\.

### 6\.1The advantage as a cone detector

Under full information a prediction error matters only when it can carry the learner across a wall\. The relevant scale is the cone marginΔt\\Delta\_\{t\}, the smallest suboptimality among vertices outsideFtF\_\{t\}and a weak\-sharpness constant of the linear program\(Burke and Ferris,[1993](https://arxiv.org/html/2606.29092#bib.bib5)\)\. Whenever the model’s uniform value errorεt\\varepsilon\_\{t\}stays belowΔt/2\\Delta\_\{t\}/2, the round contributes no regret \(Proposition[4](https://arxiv.org/html/2606.29092#Thmproposition4), Appendix[A](https://arxiv.org/html/2606.29092#A1)\)\. A learner thus pays for its prediction error only on the rounds whose margin is thin\.

### 6\.2Advantage\-thresholded mirror descent

The same principle suggests how an update rule should move mass\. At roundttthe learner forms a modelmtm\_\{t\}of the loss, computing predicted advantagesA^t,h\(s,a\)\\widehat\{A\}\_\{t,h\}\(s,a\)with confidence radiiβt,h\(s,a\)\\beta\_\{t,h\}\(s,a\)\. The model marks the actions that are statistically indistinguishable from optimal,

𝒜t,hnear\(s\)=\{a:A^t,h\(s,a\)≤cβt,h\(s,a\)\}\.\\mathcal\{A\}\_\{t,h\}^\{\\rm near\}\(s\)=\\bigl\\\{a:\\ \\widehat\{A\}\_\{t,h\}\(s,a\)\\leq c\\,\\beta\_\{t,h\}\(s,a\)\\bigr\\\}\.It then takes a mirror\-descent step that is reluctant to abandon the near set but free to leave the rest,

πt\+1=argminπ⁡\{ηtJ^t\(π\)\+Dψ\(π,πt\)\+λt∑h,s∑a∉𝒜t,hnear\(s\)πh\(a∣s\)\}\.\\pi\_\{t\+1\}=\\operatorname\*\{arg\\,min\}\_\{\\pi\}\\Bigl\\\{\\eta\_\{t\}\\,\\widehat\{J\}\_\{t\}\(\\pi\)\+D\_\{\\psi\}\(\\pi,\\pi\_\{t\}\)\+\\lambda\_\{t\}\\sum\_\{h,s\}\\sum\_\{a\\notin\\mathcal\{A\}\_\{t,h\}^\{\\rm near\}\(s\)\}\\pi\_\{h\}\(a\\mid s\)\\Bigr\\\}\.We intend this as an interpretation of existing methods rather than a new algorithm, and keep the account informal\. The Bregman termDψ\(π,πt\)D\_\{\\psi\}\(\\pi,\\pi\_\{t\}\)holds probability mass near the uncertain walls, as the Bregman selector of Proposition[3](https://arxiv.org/html/2606.29092#Thmproposition3)does\. The penalty then moves mass off actions whose advantage is confidently positive\. Read this way, natural\-gradient, trust\-region, and policy\-mirror\-descent updates\(Kakade,[2001](https://arxiv.org/html/2606.29092#bib.bib21); Schulmanet al\.,[2017](https://arxiv.org/html/2606.29092#bib.bib41); Tomaret al\.,[2021](https://arxiv.org/html/2606.29092#bib.bib43); Lan,[2022](https://arxiv.org/html/2606.29092#bib.bib23)\)should track the fan well, holding mass at thin\-margin walls and leaving high\-advantage actions quickly\. We build the modelmtm\_\{t\}optimistically from past losses\(Rakhlin and Sridharan,[2014](https://arxiv.org/html/2606.29092#bib.bib37)\), supplying the prediction that Proposition[2](https://arxiv.org/html/2606.29092#Thmproposition2)needs\. Appendix Figure[4](https://arxiv.org/html/2606.29092#A1.F4)illustrates the resulting threshold\.

Proposition[5](https://arxiv.org/html/2606.29092#Thmproposition5), proved in Appendix[A\.3](https://arxiv.org/html/2606.29092#A1.SS3), bounds the tracker’s regret on the rounds the prediction resolves, and pins down what the penalty alone changes\. On every round whose margin the prediction resolves, with2εt<Δt2\\varepsilon\_\{t\}<\\Delta\_\{t\}, the tracker adds nothing to the intrinsic price beyond the controllable selection term\. On a single state\-step with a uniform per\-layer margin, a fixed step size lets the penalty suppress a confidently suboptimal action at layerhhfaster than a plain mirror step, by the causal factor1\+λ\(H−h\+1\)/β1\+\\lambda\(H\-h\+1\)/\\betaof Corollary[1](https://arxiv.org/html/2606.29092#Thmcorollary1)\. An adversary sitting on a wall instead drivesΔt→0\\Delta\_\{t\}\\to 0and forces theΩ\(γT\)\\Omega\(\\gamma T\)of Proposition[6](https://arxiv.org/html/2606.29092#Thmproposition6)\. This uniform\-margin reading therefore holds only for benign non\-stationarity\.

## 7Experiments

Three constructed environments verify the paper’s three central predictions\. For the experiments, we keep the normal fan small, so the face priceΓtcross\\Gamma\_\{t\}^\{\\mathrm\{cross\}\}, the crossing layer, and the selection error all take closed forms that we hold against the measured regret\. All numbers below average over3232seeds\.

##### Loss variation is not the difficulty \(simplex\)\.

The previous\-face tracker’s regret is a linear function of the face priceΓ2:Tcross\\Gamma\_\{2:T\}^\{\\mathrm\{cross\}\}, at sample correlation1\.001\.00against only0\.890\.89for the loss variationVTℓV\_\{T\}^\{\\ell\}\(Appendix Figure[7](https://arxiv.org/html/2606.29092#A3.F7)\)\. Theorem[1](https://arxiv.org/html/2606.29092#Thmtheorem1)predicts this exact linearity for a tracker that holds a vertex face\. The gap shows most sharply in the within\-cone regime\. There the loss travels far, at mean variation2\.6×1032\.6\\times 10^\{3\}, while the optimal face never moves\. The priceΓ2:Tcross\\Gamma\_\{2:T\}^\{\\mathrm\{cross\}\}stays at zero, and the tracker pays almost no regret\. The environment is a one\-state problem withKKactions, which makes𝒬\\mathcal\{Q\}the simplex, and we drive the loss through four regimes\. The clean relationship belongs to a learner that holds the face\. Taken over all four algorithms at once, both correlations fall near0\.460\.46, the geometry\-blind methods contributing regret tied to neither quantity\.

##### Early crossings cost more \(layered tree\)\.

The measured ratioΓtcross/γ\\Gamma\_\{t\}^\{\\mathrm\{cross\}\}/\\gammalands on the predicted identityH−h\+1H\-h\+1of Corollary[1](https://arxiv.org/html/2606.29092#Thmcorollary1)to within floating\-point error\. Across horizons up to2020the largest deviation is2×10−152\\times 10^\{\-15\}\. The tree therefore confirms the closed\-form identity rather than estimating it \(Figure[3](https://arxiv.org/html/2606.29092#S7.F3), left\)\. An early crossing is thereforeHHtimes as expensive as a late one\. The instance is a deterministic binary tree, each policy a root\-to\-leaf path, and we force a crossing at a chosen layerhh\.

##### Regret splits into price and selection error \(degenerate faces\)\.

The benchmark makes the optimal set a prefix face\. Every selector inside it then pays the same intrinsic price, and only the representative differs\. We compare four selectors — an oracle that seesℓt\\ell\_\{t\}and plays the best point ofFt−1F\_\{t\-1\}, a sticky Bregman selector that stays near its previous choice \(Proposition[3](https://arxiv.org/html/2606.29092#Thmproposition3)\), and lexicographic and random tie\-breaks \(defined in Appendix[C](https://arxiv.org/html/2606.29092#A3)\)\. All four pay the identical price of Theorem[1](https://arxiv.org/html/2606.29092#Thmtheorem1), and the regret above it is pure selection error\. The oracle adds none, while the sticky selector adds per\-round error0\.0620\.062, lexicographic adds0\.1090\.109, and random adds0\.1100\.110\(Figure[3](https://arxiv.org/html/2606.29092#S7.F3), right\)\. The cumulative\-regret stack reproduces the decomposition directly\.

![Refer to caption](https://arxiv.org/html/2606.29092v1/x4.png)

![Refer to caption](https://arxiv.org/html/2606.29092v1/x5.png)

Figure 3:Left, the causal\-anisotropy identity in the layered tree, where the measuredΓtcross/γ\\Gamma\_\{t\}^\{\\mathrm\{cross\}\}/\\gammaagainstH−h\+1H\-h\+1falls on the identity line to within2×10−152\\times 10^\{\-15\}\. Right, the regret decomposition on degenerate faces, where the intrinsic price is the same for every selector and the regret above it is selection error, which the oracle drives to zero and the sticky selector keeps below the lexicographic and random ones\. Error bars show standard errors\.
##### Heuristic update \(secondary\)\.

As a consistency check on Section[6](https://arxiv.org/html/2606.29092#S6), we run the advantage\-thresholded update — the penalized mirror\-descent step defined there — on custom non\-stationary gridworlds, following the standard tabular gridworld convention in reinforcement learning\(Sutton and Barto,[2018](https://arxiv.org/html/2606.29092#bib.bib59)\)\. The baselines are mirror descent, optimistic mirror descent, and a trust\-region update\. The thresholded update attains the lowest cumulative regret in every scenario, beating the mirror\-descent baselines by factors of three to four and the best\-tuned trust\-region method \(Appendix Figure[10](https://arxiv.org/html/2606.29092#A3.F10)\)\. Proposition[5](https://arxiv.org/html/2606.29092#Thmproposition5)attributes the gain to the causal\-anisotropy factorH−h\+1H\-h\+1of Corollary[1](https://arxiv.org/html/2606.29092#Thmcorollary1)\. Unlike the three structural results, this gain has no closed\-form tie toΓ2:Tcross\\Gamma\_\{2:T\}^\{\\mathrm\{cross\}\}\.

## 8Discussion

This paper recasts the cost of non\-stationarity in fixed\-transition adversarial MDPs as the priced motion of the optimal face through the normal fan\. That quantity is at once geometrically intrinsic and computable by a single value backup\. The face\-crossing price improves on the usual measures of non\-stationarity in two concrete ways\. It counts only the motion that forces a change of policy\. The price stays at zero even as loss variation or comparator path length grows without bound\. It is also exact where the usual measures only bound\. Theorem[3](https://arxiv.org/html/2606.29092#Thmtheorem3)writes it as an expected optimal Bellman advantage over the already\-known old face\. Variation and path length reach regret only through a Lipschitz constant\. The decompositionDReg=priced face motion\+within\-face selection error\\operatorname\{DReg\}=\\text\{priced face motion\}\+\\text\{within\-face selection error\}holds whenever the learner stays on the old face, and it cleanly separates the intrinsic price from the choice of representative inside a degenerate face\. Its sharpest consequence is the causal anisotropyΓtcross/γ=H−h\+1\\Gamma\_\{t\}^\{\\mathrm\{cross\}\}/\\gamma=H\-h\+1of Corollary[1](https://arxiv.org/html/2606.29092#Thmcorollary1)\. The price depends on where in the horizon a crossing falls, a dependence loss variation cannot express and the experiments confirm to floating\-point precision\. For the previous\-face tracker the price is two\-sided, the decomposition giving an exact upper bound and theΩ\(γT\)\\Omega\(\\gamma T\)lower bound of Proposition[6](https://arxiv.org/html/2606.29092#Thmproposition6)matching it when crossings are unpredictable\. Under degeneracy the selection error is an independent degree of freedom \(Proposition[7](https://arxiv.org/html/2606.29092#Thmproposition7)\)\.

##### Limitations\.

The structural results assume fixed transitions and a finite horizon\. We leave the moving\-transition case, studied in non\-stationary MDP work through transition variation\(Cheunget al\.,[2020](https://arxiv.org/html/2606.29092#bib.bib8); Feiet al\.,[2020](https://arxiv.org/html/2606.29092#bib.bib13); Maoet al\.,[2022](https://arxiv.org/html/2606.29092#bib.bib29); Wei and Luo,[2021](https://arxiv.org/html/2606.29092#bib.bib45)\), to future work\. In such cases, the polytope itself deforms between rounds\. Additionally, our reading of mirror\-descent and trust\-region updates as fan trackers remains heuristic, with no adversarial regret guarantee\.

##### Bandit feedback\.

The most natural extension is to bandit feedback\(Cesa\-Bianchi and Lugosi,[2006](https://arxiv.org/html/2606.29092#bib.bib7); Neuet al\.,[2010](https://arxiv.org/html/2606.29092#bib.bib33); Jinet al\.,[2019](https://arxiv.org/html/2606.29092#bib.bib19); Liet al\.,[2024a](https://arxiv.org/html/2606.29092#bib.bib26)\), where the learner sees only the losses on the trajectory it plays and must estimate the advantage of the unplayed actions\. The cone detector of Proposition[4](https://arxiv.org/html/2606.29092#Thmproposition4)needs only a uniform value confidence radiussupπ\|Jℓt\(π\)−J^t\(π\)\|≤εtbandit\\sup\_\{\\pi\}\|J\_\{\\ell\_\{t\}\}\(\\pi\)\-\\widehat\{J\}\_\{t\}\(\\pi\)\|\\leq\\varepsilon\_\{t\}^\{\\rm bandit\}\. The same margin test2εtbandit<Δt2\\varepsilon\_\{t\}^\{\\rm bandit\}<\\Delta\_\{t\}then decides which rounds are statistically free and which sit near a wall\. The open problem is to build that radius from an importance\-weighted estimator over a sliding window\. Its per\-state\-action width must track the visitationNt,h\(s,a\)N\_\{t,h\}\(s,a\), the horizon, and the window length, and must aggregate to a uniform value bound concentrated at the thin\-margin walls of the fan\. Such a radius would turn the heuristic split into a genuine regret bound, and we leave it to future work\.

#### Reproducibility statement

All claims in the main text have complete proofs in Appendices[A](https://arxiv.org/html/2606.29092#A1)and[B](https://arxiv.org/html/2606.29092#A2)\. The experiments of Section[7](https://arxiv.org/html/2606.29092#S7)are described in full in Appendix[C](https://arxiv.org/html/2606.29092#A3)\.

## References

- A\. Agarwal, S\. M\. Kakade, J\. D\. Lee, and G\. Mahajan \(2020\)On the theory of policy gradient methods: optimality, approximation, and distribution shift\.\(arXiv:1908\.00261\)\.External Links:[Link](http://arxiv.org/abs/1908.00261),[Document](https://dx.doi.org/10.48550/arXiv.1908.00261)Cited by:[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px3.p1.1)\.
- C\. Alfano, R\. Yuan, and P\. Rebeschini \(2024\)A novel framework for policy mirror descent with general parameterization and linear convergence\.\(arXiv:2301\.13139\)\.External Links:[Link](http://arxiv.org/abs/2301.13139),[Document](https://dx.doi.org/10.48550/arXiv.2301.13139)Cited by:[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px3.p1.1)\.
- E\. Altman \(2021\)Constrained markov decision processes: stochastic modeling\.Boca Raton\.External Links:[Link](https://www.taylorfrancis.com/books/9781315140223),[Document](https://dx.doi.org/10.1201/9781315140223)Cited by:[§1](https://arxiv.org/html/2606.29092#S1.p4.1),[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2606.29092#S3.SS1.p1.7)\.
- A\. Beck and M\. Teboulle \(2003\)Mirror descent and nonlinear projected subgradient methods for convex optimization\.Operations Research Letters31\(3\),pp\. 167–175\.Cited by:[§A\.1](https://arxiv.org/html/2606.29092#A1.SS1.p2.3),[Appendix C](https://arxiv.org/html/2606.29092#A3.SS0.SSS0.Px2.p1.7),[Appendix C](https://arxiv.org/html/2606.29092#A3.SS0.SSS0.Px5.p1.2),[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px3.p1.1),[§4\.3](https://arxiv.org/html/2606.29092#S4.SS3.p2.5)\.
- O\. Besbes, Y\. Gur, and A\. Zeevi \(2015\)Non\-stationary stochastic optimization\.Operations Research63\(5\),pp\. 1227–1244\.External Links:[Link](http://arxiv.org/abs/1307.5449),[Document](https://dx.doi.org/10.1287/opre.2015.1408)Cited by:[§1](https://arxiv.org/html/2606.29092#S1.p2.1),[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px2.p1.1),[§4\.2](https://arxiv.org/html/2606.29092#S4.SS2.p1.1)\.
- J\. V\. Burke and M\. C\. Ferris \(1993\)Weak sharp minima in mathematical programming\.SIAM Journal on Control and Optimization31\(5\),pp\. 1340–1359\.External Links:[Link](https://epubs.siam.org/doi/10.1137/0331063),[Document](https://dx.doi.org/10.1137/0331063)Cited by:[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px3.p1.1),[§6\.1](https://arxiv.org/html/2606.29092#S6.SS1.p1.4)\.
- A\. R\. Cardoso, H\. Wang, and H\. Xu \(2019\)Large scale markov decision processes with changing rewards\.External Links:[Link](https://arxiv.org/abs/1905.10649v1)Cited by:[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px2.p1.1)\.
- N\. Cesa\-Bianchi and G\. Lugosi \(2006\)Prediction, learning, and games\.External Links:[Document](https://dx.doi.org/10.1017/CBO9780511546921)Cited by:[§A\.3](https://arxiv.org/html/2606.29092#A1.SS3.2.p1.22),[Appendix C](https://arxiv.org/html/2606.29092#A3.SS0.SSS0.Px2.p1.7),[§8](https://arxiv.org/html/2606.29092#S8.SS0.SSS0.Px2.p1.3)\.
- W\. C\. Cheung, D\. Simchi\-Levi, and R\. Zhu \(2020\)Reinforcement learning for non\-stationary markov decision processes: the blessing of \(more\) optimism\.External Links:[Link](https://arxiv.org/abs/2006.14389v1)Cited by:[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px2.p1.1),[§8](https://arxiv.org/html/2606.29092#S8.SS0.SSS0.Px1.p1.1)\.
- R\. Dadashi, A\. A\. Taïga, N\. L\. Roux, D\. Schuurmans, and M\. G\. Bellemare \(2019\)The value function polytope in reinforcement learning\.\(arXiv:1901\.11524\)\.External Links:[Link](http://arxiv.org/abs/1901.11524),[Document](https://dx.doi.org/10.48550/arXiv.1901.11524)Cited by:[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Dick, A\. Gyorgy, and C\. Szepesvari \(2015\)Online learning in markov decision processes with changing cost sequences\.External Links:[Document](https://dx.doi.org/10.14288/1.0044649)Cited by:[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px2.p1.1)\.
- N\. Eshraghi and B\. Liang \(2022\)Dynamic regret of online mirror descent for relatively smooth convex cost functions\.\(arXiv:2202\.12843\)\.External Links:[Link](http://arxiv.org/abs/2202.12843),[Document](https://dx.doi.org/10.48550/arXiv.2202.12843)Cited by:[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px2.p1.1)\.
- E\. Even\-Dar, Sham\. M\. Kakade, and Y\. Mansour \(2009\)Online markov decision processes\.Mathematics of Operations Research34\(3\),pp\. 726–736\.External Links:[Link](https://www.jstor.org/stable/40538442)Cited by:[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Fei, Z\. Yang, Z\. Wang, and Q\. Xie \(2020\)Dynamic regret of policy optimization in non\-stationary environments\.\(arXiv:2007\.00148\)\.External Links:[Link](http://arxiv.org/abs/2007.00148),[Document](https://dx.doi.org/10.48550/arXiv.2007.00148)Cited by:[§1](https://arxiv.org/html/2606.29092#S1.p2.1),[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px2.p1.1),[§8](https://arxiv.org/html/2606.29092#S8.SS0.SSS0.Px1.p1.1)\.
- E\. C\. Hall and R\. M\. Willett \(2013\)Dynamical models and tracking regret in online convex programming\.External Links:[Link](https://arxiv.org/abs/1301.1254v1)Cited by:[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px2.p1.1),[§3\.1](https://arxiv.org/html/2606.29092#S3.SS1.p1.11)\.
- W\. Huleihel, S\. Pal, and O\. Shayevitz \(2021\)Learning user preferences in non\-stationary environments\.\(arXiv:2101\.12506\)\.External Links:[Link](http://arxiv.org/abs/2101.12506),[Document](https://dx.doi.org/10.48550/arXiv.2101.12506)Cited by:[§1](https://arxiv.org/html/2606.29092#S1.p1.2)\.
- A\. Jadbabaie, A\. Rakhlin, S\. Shahrampour, and K\. Sridharan \(2015\)Online optimization: competing with dynamic comparators\.\(arXiv:1501\.06225\)\.External Links:[Link](http://arxiv.org/abs/1501.06225),[Document](https://dx.doi.org/10.48550/arXiv.1501.06225)Cited by:[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px2.p1.1),[§3\.1](https://arxiv.org/html/2606.29092#S3.SS1.p1.11)\.
- C\. Jin, T\. Jin, H\. Luo, S\. Sra, and T\. Yu \(2019\)Learning adversarial mdps with bandit feedback and unknown transition\.External Links:[Link](https://arxiv.org/abs/1912.01192v5)Cited by:[§8](https://arxiv.org/html/2606.29092#S8.SS0.SSS0.Px2.p1.3)\.
- S\. Kakade and J\. Langford \(2002\)Approximately optimal approximate reinforcement learning\.External Links:[Link](https://www.semanticscholar.org/paper/Approximately-Optimal-Approximate-Reinforcement-Kakade-Langford/523b4ce1c2a1336962444abc1dec215756c2f3e6)Cited by:[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px3.p1.1),[§5](https://arxiv.org/html/2606.29092#S5.p2.3)\.
- S\. M\. Kakade \(2001\)A natural policy gradient\.InAdvances in Neural Information Processing Systems,Vol\.14\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2001/hash/4b86abe48d358ecf194c56c69108433e-Abstract.html)Cited by:[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px3.p1.1),[§6\.2](https://arxiv.org/html/2606.29092#S6.SS2.p1.6)\.
- J\. F\. Kleuker, A\. Plaat, and T\. Moerland \(2025\)On the effect of regularization in policy mirror descent\.\(arXiv:2507\.08718\)\.External Links:[Link](http://arxiv.org/abs/2507.08718),[Document](https://dx.doi.org/10.48550/arXiv.2507.08718)Cited by:[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px3.p1.1)\.
- G\. Lan \(2022\)Policy mirror descent for reinforcement learning: linear convergence, new sampling complexity, and generalized problem classes\.\(arXiv:2102\.00135\)\.External Links:[Link](http://arxiv.org/abs/2102.00135),[Document](https://dx.doi.org/10.48550/arXiv.2102.00135)Cited by:[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px3.p1.1),[§6\.2](https://arxiv.org/html/2606.29092#S6.SS2.p1.6)\.
- L\. Li, P\. Zhao, and Z\. Zhou \(2023\)Dynamic regret of adversarial linear mixture mdps\.Advances in Neural Information Processing Systems 36,pp\. 60685–60711\.External Links:[Link](https://papers.nips.cc/paper_files/paper/2023/file/becd02b89259774da2ede23116a80648-Paper-Conference.pdf),[Document](https://dx.doi.org/10.52202/075280-2650)Cited by:[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px2.p1.1)\.
- L\. Li, P\. Zhao, and Z\. Zhou \(2024a\)Improved algorithm for adversarial linear mixture mdps with bandit feedback and unknown transition\.External Links:[Link](https://arxiv.org/abs/2403.04568v1)Cited by:[§8](https://arxiv.org/html/2606.29092#S8.SS0.SSS0.Px2.p1.3)\.
- L\. Li, P\. Zhao, and Z\. Zhou \(2024b\)Near\-optimal dynamic regret for adversarial linear mixture mdps\.External Links:[Link](https://arxiv.org/abs/2411.03107v1)Cited by:[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Lu and S\. M\. Robinson \(2008\)Normal fans of polyhedral convex sets\.Set\-Valued Analysis16\(2\),pp\. 281–305\.External Links:[Link](https://doi.org/10.1007/s11228-008-0077-9),[Document](https://dx.doi.org/10.1007/s11228-008-0077-9)Cited by:[§1](https://arxiv.org/html/2606.29092#S1.p4.1),[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px3.p1.1),[§3\.2](https://arxiv.org/html/2606.29092#S3.SS2.p1.7)\.
- W\. Mao, K\. Zhang, R\. Zhu, D\. Simchi\-Levi, and T\. Başar \(2022\)Model\-free non\-stationary rl: near\-optimal regret and applications in multi\-agent rl and inventory control\.\(arXiv:2010\.03161\)\.External Links:[Link](http://arxiv.org/abs/2010.03161),[Document](https://dx.doi.org/10.48550/arXiv.2010.03161)Cited by:[§8](https://arxiv.org/html/2606.29092#S8.SS0.SSS0.Px1.p1.1)\.
- N\. Milosevic and N\. Scherf \(2025\)The geometry of nonlinear reinforcement learning\.\(arXiv:2509\.01432\)\.External Links:[Link](http://arxiv.org/abs/2509.01432),[Document](https://dx.doi.org/10.48550/arXiv.2509.01432)Cited by:[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Müller and G\. Montúfar \(2024\)Geometry and convergence of natural policy gradient methods\.Information Geometry7\(S1\),pp\. 485–523\.External Links:[Link](http://arxiv.org/abs/2211.02105),[Document](https://dx.doi.org/10.1007/s41884-023-00106-z)Cited by:[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px1.p1.1)\.
- A\. S\. Nemirovskiĭ and D\. B\. IUdin \(1983\)Problem complexity and method efficiency in optimization\.Wiley\.Cited by:[§A\.1](https://arxiv.org/html/2606.29092#A1.SS1.p2.3),[Appendix C](https://arxiv.org/html/2606.29092#A3.SS0.SSS0.Px5.p1.2),[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px3.p1.1),[§4\.3](https://arxiv.org/html/2606.29092#S4.SS3.p2.5)\.
- G\. Neu, A\. Antos, A\. György, and C\. Szepesvári \(2010\)Online markov decision processes under bandit feedback\.InAdvances in Neural Information Processing Systems,Vol\.23\.External Links:[Link](https://papers.nips.cc/paper_files/paper/2010/hash/7bb060764a818184ebb1cc0d43d382aa-Abstract.html)Cited by:[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px1.p1.1),[§8](https://arxiv.org/html/2606.29092#S8.SS0.SSS0.Px2.p1.3)\.
- A\. Paffenholz \(2010\)Polyhedral geometry and linear optimization\.Cited by:[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px3.p1.1)\.
- M\. L\. Puterman \(2014\)Markov decision processes: discrete stochastic dynamic programming\.John Wiley & Sons\.Cited by:[§1](https://arxiv.org/html/2606.29092#S1.p1.2),[§1](https://arxiv.org/html/2606.29092#S1.p4.1),[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2606.29092#S3.SS1.p1.7),[§5](https://arxiv.org/html/2606.29092#S5.p1.1)\.
- A\. Rakhlin and K\. Sridharan \(2014\)Online learning with predictable sequences\.\(arXiv:1208\.3728\)\.External Links:[Link](http://arxiv.org/abs/1208.3728),[Document](https://dx.doi.org/10.48550/arXiv.1208.3728)Cited by:[Appendix C](https://arxiv.org/html/2606.29092#A3.SS0.SSS0.Px2.p1.7),[Appendix C](https://arxiv.org/html/2606.29092#A3.SS0.SSS0.Px5.p1.2),[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px3.p1.1),[§6\.2](https://arxiv.org/html/2606.29092#S6.SS2.p1.6)\.
- R\. T\. Rockafellar \(1970\)Convex analysis\.Princeton University Press\.External Links:[Link](https://www.jstor.org/stable/j.ctt14bs1ff)Cited by:[§1](https://arxiv.org/html/2606.29092#S1.p4.1),[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px3.p1.1),[§3\.2](https://arxiv.org/html/2606.29092#S3.SS2.p1.7)\.
- A\. Rosenberg and Y\. Mansour \(2019\)Online convex optimization in adversarial markov decision processes\.External Links:[Link](https://arxiv.org/abs/1905.07773v1)Cited by:[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px1.p1.1)\.
- J\. B\. Schafer, J\. Konstan, and J\. Riedl \(1999\)Recommender systems in e\-commerce\.InProceedings of the 1st ACM conference on Electronic commerce,EC ’99,New York, NY, USA,pp\. 158–166\.External Links:[Link](https://dl.acm.org/doi/10.1145/336992.337035),[Document](https://dx.doi.org/10.1145/336992.337035)Cited by:[§1](https://arxiv.org/html/2606.29092#S1.p1.2)\.
- A\. Schlaginhaufen and M\. Kamgarpour \(2023\)Identifiability and generalizability in constrained inverse reinforcement learning\.\(arXiv:2306\.00629\)\.External Links:[Link](http://arxiv.org/abs/2306.00629),[Document](https://dx.doi.org/10.48550/arXiv.2306.00629)Cited by:[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Schulman, S\. Levine, P\. Moritz, M\. I\. Jordan, and P\. Abbeel \(2017\)Trust region policy optimization\.\(arXiv:1502\.05477\)\.External Links:[Link](http://arxiv.org/abs/1502.05477),[Document](https://dx.doi.org/10.48550/arXiv.1502.05477)Cited by:[Appendix C](https://arxiv.org/html/2606.29092#A3.SS0.SSS0.Px5.p1.2),[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px3.p1.1),[§6\.2](https://arxiv.org/html/2606.29092#S6.SS2.p1.6)\.
- G\. Shani, R\. I\. Brafman, and D\. Heckerman \(2015\)An mdp\-based recommender system\.\(arXiv:1301\.0600\)\.External Links:[Link](http://arxiv.org/abs/1301.0600),[Document](https://dx.doi.org/10.48550/arXiv.1301.0600)Cited by:[§1](https://arxiv.org/html/2606.29092#S1.p1.2)\.
- R\. S\. Sutton and A\. G\. Barto \(2018\)Reinforcement learning: an introduction\.MIT Press\.Cited by:[Appendix C](https://arxiv.org/html/2606.29092#A3.SS0.SSS0.Px5.p1.2),[§7](https://arxiv.org/html/2606.29092#S7.SS0.SSS0.Px4.p1.2)\.
- M\. Tomar, L\. Shani, Y\. Efroni, and M\. Ghavamzadeh \(2021\)Mirror descent policy optimization\.\(arXiv:2005\.09814\)\.External Links:[Link](http://arxiv.org/abs/2005.09814),[Document](https://dx.doi.org/10.48550/arXiv.2005.09814)Cited by:[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px3.p1.1),[§6\.2](https://arxiv.org/html/2606.29092#S6.SS2.p1.6)\.
- J\. A\. Tropp \(2022\)ACM 204: lectures on convex geometry\.External Links:[Link](https://resolver.caltech.edu/CaltechAUTHORS:20220412-220319430),[Document](https://dx.doi.org/10.7907/GEDA-H205)Cited by:[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px3.p1.1)\.
- C\. Wei and H\. Luo \(2021\)Non\-stationary reinforcement learning without prior knowledge: an optimal black\-box approach\.\(arXiv:2102\.05406\)\.External Links:[Link](http://arxiv.org/abs/2102.05406),[Document](https://dx.doi.org/10.48550/arXiv.2102.05406)Cited by:[§8](https://arxiv.org/html/2606.29092#S8.SS0.SSS0.Px1.p1.1)\.
- W\. Zhan, S\. Cen, B\. Huang, Y\. Chen, J\. D\. Lee, and Y\. Chi \(2023\)Policy mirror descent for regularized reinforcement learning: a generalized framework with linear convergence\.\(arXiv:2105\.11066\)\.External Links:[Link](http://arxiv.org/abs/2105.11066),[Document](https://dx.doi.org/10.48550/arXiv.2105.11066)Cited by:[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px3.p1.1)\.
- P\. Zhao, L\. Li, and Z\. Zhou \(2022\)Dynamic regret of online markov decision processes\.External Links:[Link](https://arxiv.org/abs/2208.12483v1)Cited by:[§1](https://arxiv.org/html/2606.29092#S1.p2.1),[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px2.p1.1),[§4\.2](https://arxiv.org/html/2606.29092#S4.SS2.p1.1)\.
- P\. Zhao, Y\. Zhang, L\. Zhang, and Z\. Zhou \(2020\)Dynamic regret of convex and smooth functions\.\(arXiv:2007\.03479\)\.External Links:[Link](http://arxiv.org/abs/2007.03479),[Document](https://dx.doi.org/10.48550/arXiv.2007.03479)Cited by:[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px2.p1.1)\.
- P\. Zhao, Y\. Zhang, L\. Zhang, and Z\. Zhou \(2024\)Adaptivity and non\-stationarity: problem\-dependent dynamic regret for online convex optimization\.\(arXiv:2112\.14368\)\.External Links:[Link](http://arxiv.org/abs/2112.14368),[Document](https://dx.doi.org/10.48550/arXiv.2112.14368)Cited by:[§1](https://arxiv.org/html/2606.29092#S1.p2.1),[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px2.p1.1),[§4\.2](https://arxiv.org/html/2606.29092#S4.SS2.p1.1)\.
- G\. M\. Ziegler \(1995\)Lectures on polytopes\.Graduate Texts in Mathematics, Vol\.152,Springer,New York, NY\.External Links:[Link](http://link.springer.com/10.1007/978-1-4613-8431-1),[Document](https://dx.doi.org/10.1007/978-1-4613-8431-1)Cited by:[§1](https://arxiv.org/html/2606.29092#S1.p4.1),[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px3.p1.1),[§3\.2](https://arxiv.org/html/2606.29092#S3.SS2.p1.7)\.
- A\. Zimin and G\. Neu \(2013\)Online learning in episodic markovian decision processes by relative entropy policy search\.InAdvances in Neural Information Processing Systems,Vol\.26\.External Links:[Link](https://papers.neurips.cc/paper_files/paper/2013/hash/68053af2923e00204c3ca7c6a3150cf7-Abstract.html)Cited by:[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Zinkevich \(2003\)Online convex programming and generalized infinitesimal gradient ascent\.Cited by:[§2](https://arxiv.org/html/2606.29092#S2.SS0.SSS0.Px2.p1.1),[§3\.1](https://arxiv.org/html/2606.29092#S3.SS1.p1.11)\.

## Appendix AProofs from the main text

### A\.1Within\-face selection

When the previous optimal faceFt−1F\_\{t\-1\}is higher\-dimensional, the learner chooses one occupancy inside it, and that choice is the source of the selection errorεtsel\\varepsilon\_\{t\}^\{\\rm sel\}\.

###### Proposition 2\(Predictive selection\)\.

Letmtm\_\{t\}be a prediction ofℓt\\ell\_\{t\}and choosext∈argminu∈Ft−1⁡⟨u,mt⟩x\_\{t\}\\in\\operatorname\*\{arg\\,min\}\_\{u\\in F\_\{t\-1\}\}\\left\\langle u,m\_\{t\}\\right\\rangle\. Ifδt=‖mt−ℓt‖∗\\delta\_\{t\}=\\left\\lVert m\_\{t\}\-\\ell\_\{t\}\\right\\rVert\_\{\*\}, thenεtsel≤diam⁡\(Ft−1\)δt\\varepsilon\_\{t\}^\{\\rm sel\}\\leq\\operatorname\{diam\}\(F\_\{t\-1\}\)\\,\\delta\_\{t\}\.

A learner that keeps the directions the old face leaves undetermined available for a later loss uses a Bregman selectorxt∈argminu∈Ft−1⁡\{⟨u,mt⟩\+λDψ\(u,zt\)\}x\_\{t\}\\in\\operatorname\*\{arg\\,min\}\_\{u\\in F\_\{t\-1\}\}\\bigl\\\{\\left\\langle u,m\_\{t\}\\right\\rangle\+\\lambda D\_\{\\psi\}\(u,z\_\{t\}\)\\bigr\\\}, withztz\_\{t\}the previous representative andDψD\_\{\\psi\}a Bregman divergence\(Nemirovskiĭ and IUdin,[1983](https://arxiv.org/html/2606.29092#bib.bib32); Beck and Teboulle,[2003](https://arxiv.org/html/2606.29092#bib.bib55)\)\.

###### Proposition 3\(Bregman face selection\)\.

WithRtψ=supu,v∈Ft−1\|Dψ\(u,zt\)−Dψ\(v,zt\)\|R\_\{t\}^\{\\psi\}=\\sup\_\{u,v\\in F\_\{t\-1\}\}\|D\_\{\\psi\}\(u,z\_\{t\}\)\-D\_\{\\psi\}\(v,z\_\{t\}\)\|, the Bregman selector satisfiesεtsel≤diam⁡\(Ft−1\)δt\+λRtψ\\varepsilon\_\{t\}^\{\\rm sel\}\\leq\\operatorname\{diam\}\(F\_\{t\-1\}\)\\,\\delta\_\{t\}\+\\lambda R\_\{t\}^\{\\psi\}\.

###### Proof of Propositions[2](https://arxiv.org/html/2606.29092#Thmproposition2)and[3](https://arxiv.org/html/2606.29092#Thmproposition3)\.

Each selector is controlled by its own optimality condition, which bounds the loss gap between its chosen point and the best point of the face\. Letut∗∈argminu∈Ft−1⁡⟨u,ℓt⟩u\_\{t\}^\{\*\}\\in\\operatorname\*\{arg\\,min\}\_\{u\\in F\_\{t\-1\}\}\\left\\langle u,\\ell\_\{t\}\\right\\rangle\. For the predictive selector, optimality ofxtx\_\{t\}formtm\_\{t\}gives⟨xt−ut∗,mt⟩≤0\\left\\langle x\_\{t\}\-u\_\{t\}^\{\*\},m\_\{t\}\\right\\rangle\\leq 0, so

εtsel=⟨xt−ut∗,ℓt⟩≤⟨xt−ut∗,ℓt−mt⟩≤‖xt−ut∗‖δt≤diam⁡\(Ft−1\)δt\.\\varepsilon\_\{t\}^\{\\rm sel\}=\\left\\langle x\_\{t\}\-u\_\{t\}^\{\*\},\\ell\_\{t\}\\right\\rangle\\leq\\left\\langle x\_\{t\}\-u\_\{t\}^\{\*\},\\ell\_\{t\}\-m\_\{t\}\\right\\rangle\\leq\\left\\lVert x\_\{t\}\-u\_\{t\}^\{\*\}\\right\\rVert\\,\\delta\_\{t\}\\leq\\operatorname\{diam\}\(F\_\{t\-1\}\)\\,\\delta\_\{t\}\.For the Bregman selector the optimality condition carries the extra regularizer, giving⟨xt−ut∗,mt⟩≤λ\(Dψ\(ut∗,zt\)−Dψ\(xt,zt\)\)≤λRtψ\\left\\langle x\_\{t\}\-u\_\{t\}^\{\*\},m\_\{t\}\\right\\rangle\\leq\\lambda\(D\_\{\\psi\}\(u\_\{t\}^\{\*\},z\_\{t\}\)\-D\_\{\\psi\}\(x\_\{t\},z\_\{t\}\)\)\\leq\\lambda R\_\{t\}^\{\\psi\}, and the same chain of inequalities addsλRtψ\\lambda R\_\{t\}^\{\\psi\}to the bound\. ∎

### A\.2The cone detector

###### Proposition 4\(Prediction error matters only near walls\)\.

Letmtm\_\{t\}be a predicted loss withsupπ\|Jℓt\(π\)−Jmt\(π\)\|≤εt\\sup\_\{\\pi\}\|J\_\{\\ell\_\{t\}\}\(\\pi\)\-J\_\{m\_\{t\}\}\(\\pi\)\|\\leq\\varepsilon\_\{t\}, and letπ^t\\widehat\{\\pi\}\_\{t\}minimizeJmtJ\_\{m\_\{t\}\}at a deterministic policy, so its occupancy is a vertex of𝒬\\mathcal\{Q\}\. ThenJℓt\(π^t\)−Jℓt∗≤2εtJ\_\{\\ell\_\{t\}\}\(\\widehat\{\\pi\}\_\{t\}\)\-J\_\{\\ell\_\{t\}\}^\{\*\}\\leq 2\\varepsilon\_\{t\}\. LetΔt=min⁡\{⟨v,ℓt⟩−Jℓt∗:va vertex of𝒬,v∉Ft\}\\Delta\_\{t\}=\\min\\\{\\,\\left\\langle v,\\ell\_\{t\}\\right\\rangle\-J\_\{\\ell\_\{t\}\}^\{\*\}:\\ v\\text\{ a vertex of \}\\mathcal\{Q\},\\ v\\notin F\_\{t\}\\,\\\}be the smallest suboptimality among vertices outside the optimal face, the weak\-sharpness margin of the current cone\. IfΔt\>2εt\\Delta\_\{t\}\>2\\varepsilon\_\{t\}thenπ^t∈Πt∗\\widehat\{\\pi\}\_\{t\}\\in\\Pi\_\{t\}^\{\*\}and the round contributes no regret\.

###### Proof of Proposition[4](https://arxiv.org/html/2606.29092#Thmproposition4)\.

The argument passes through the uniform value bound twice, once from the model to the truth and once back, after which the cone margin decides whether the model optimum can select the wrong face\. Optimality ofπ^t\\widehat\{\\pi\}\_\{t\}formtm\_\{t\}givesJmt\(π^t\)≤Jmt\(πt∗\)J\_\{m\_\{t\}\}\(\\widehat\{\\pi\}\_\{t\}\)\\leq J\_\{m\_\{t\}\}\(\\pi\_\{t\}^\{\*\}\)for anyπt∗∈Πt∗\\pi\_\{t\}^\{\*\}\\in\\Pi\_\{t\}^\{\*\}, and two applications of the uniform boundsupπ\|Jℓt\(π\)−Jmt\(π\)\|≤εt\\sup\_\{\\pi\}\|J\_\{\\ell\_\{t\}\}\(\\pi\)\-J\_\{m\_\{t\}\}\(\\pi\)\|\\leq\\varepsilon\_\{t\}giveJℓt\(π^t\)≤Jmt\(π^t\)\+εt≤Jmt\(πt∗\)\+εt≤Jℓt\(πt∗\)\+2εtJ\_\{\\ell\_\{t\}\}\(\\widehat\{\\pi\}\_\{t\}\)\\leq J\_\{m\_\{t\}\}\(\\widehat\{\\pi\}\_\{t\}\)\+\\varepsilon\_\{t\}\\leq J\_\{m\_\{t\}\}\(\\pi\_\{t\}^\{\*\}\)\+\\varepsilon\_\{t\}\\leq J\_\{\\ell\_\{t\}\}\(\\pi\_\{t\}^\{\*\}\)\+2\\varepsilon\_\{t\}\. The occupancy ofπ^t\\widehat\{\\pi\}\_\{t\}is a vertex, so it is either inFtF\_\{t\}or suboptimal by at leastΔt\\Delta\_\{t\}\. WhenΔt\\Delta\_\{t\}exceeds2εt2\\varepsilon\_\{t\}the second case contradicts the bound just shown, soπ^t∈Πt∗\\widehat\{\\pi\}\_\{t\}\\in\\Pi\_\{t\}^\{\*\}and the round contributes no regret\. ∎

### A\.3The advantage\-thresholded update

###### Proposition 5\(Effect of the penalty\)\.

Run the committed thresholded update under full information, restricting the play toFt−1F\_\{t\-1\}through the selector of Proposition[3](https://arxiv.org/html/2606.29092#Thmproposition3), with an entropy regularizer and predictions of value errorεt\\varepsilon\_\{t\}, and letΔt\\Delta\_\{t\}be the cone margin of Proposition[4](https://arxiv.org/html/2606.29092#Thmproposition4)andB=maxq,q′∈𝒬⁡⟨q−q′,ℓt⟩B=\\max\_\{q,q^\{\\prime\}\\in\\mathcal\{Q\}\}\\left\\langle q\-q^\{\\prime\},\\ell\_\{t\}\\right\\ranglethe range of the objective over𝒬\\mathcal\{Q\}\. Then

DRegT≤r1\+∑t=2TΓtcross\+B∑t=2T𝟏\{2εt≥Δt\}\+∑t:2εt<Δt\(diam⁡\(Ft−1\)δt\+λRtψ\),\\operatorname\{DReg\}\_\{T\}\\leq r\_\{1\}\+\\sum\_\{t=2\}^\{T\}\\Gamma\_\{t\}^\{\\mathrm\{cross\}\}\+B\\sum\_\{t=2\}^\{T\}\\mathbf\{1\}\\\{2\\varepsilon\_\{t\}\\geq\\Delta\_\{t\}\\\}\+\\\!\\\!\\sum\_\{t:\\,2\\varepsilon\_\{t\}<\\Delta\_\{t\}\}\\\!\\\!\\bigl\(\\operatorname\{diam\}\(F\_\{t\-1\}\)\\,\\delta\_\{t\}\+\\lambda R\_\{t\}^\{\\psi\}\\bigr\),so on every round whose margin the prediction resolves \(2εt<Δt2\\varepsilon\_\{t\}<\\Delta\_\{t\}\) the tracker adds nothing to the intrinsic price beyond the controllable selection term\. The penalty is not required for this bound, which holds for the unpenalized selector as well\. At a fixed step sizeη\\etait nonetheless accelerates the tracker by the causal factor of Corollary[1](https://arxiv.org/html/2606.29092#Thmcorollary1)\. To see this, take a confidently suboptimal action at layerhhwith optimal advantageβ\\betaand on\-policy gap onlyg=β/\(H−h\+1\)g=\\beta/\(H\-h\+1\)\. The penalty suppresses that action at rateη\(g\+λ\)\\eta\(g\+\\lambda\), against rateηg\\eta gfor a plain mirror step\. Its contribution to regret is therefore smaller by the factorΩ\(1\+λ\(H−h\+1\)/β\)\\Omega\\bigl\(1\+\\lambda\(H\-h\+1\)/\\beta\\bigr\), and the two rates coincide only asη→∞\\eta\\to\\infty\.

We work under full information with an optimistic modelmtm\_\{t\}of value errorεt=supπ\|Jℓt\(π\)−Jmt\(π\)\|\\varepsilon\_\{t\}=\\sup\_\{\\pi\}\|J\_\{\\ell\_\{t\}\}\(\\pi\)\-J\_\{m\_\{t\}\}\(\\pi\)\|and selection errorδt=‖mt−ℓt‖∗\\delta\_\{t\}=\\left\\lVert m\_\{t\}\-\\ell\_\{t\}\\right\\rVert\_\{\*\}, an entropy regularizerψ\\psithat is11\-strongly convex on each simplex factor, and the committed selectorxt∈argminu∈Ft−1⁡⟨u,mt⟩x\_\{t\}\\in\\operatorname\*\{arg\\,min\}\_\{u\\in F\_\{t\-1\}\}\\left\\langle u,m\_\{t\}\\right\\rangle\. WriteB=maxq,q′∈𝒬⁡⟨q−q′,ℓt⟩B=\\max\_\{q,q^\{\\prime\}\\in\\mathcal\{Q\}\}\\left\\langle q\-q^\{\\prime\},\\ell\_\{t\}\\right\\ranglefor the range of the objective\.

###### Proof of the bound\.

This bound assembles the decomposition, the selection bound, and the cone detector, splitting the rounds by whether the prediction resolves the margin\. The play stays inFt−1F\_\{t\-1\}, and Theorem[1](https://arxiv.org/html/2606.29092#Thmtheorem1)givesDRegT=r1\+∑tΓtcross\+∑tεtsel\\operatorname\{DReg\}\_\{T\}=r\_\{1\}\+\\sum\_\{t\}\\Gamma\_\{t\}^\{\\mathrm\{cross\}\}\+\\sum\_\{t\}\\varepsilon\_\{t\}^\{\\rm sel\}, with Proposition[3](https://arxiv.org/html/2606.29092#Thmproposition3)boundingεtsel≤diam⁡\(Ft−1\)δt\+λRtψ\\varepsilon\_\{t\}^\{\\rm sel\}\\leq\\operatorname\{diam\}\(F\_\{t\-1\}\)\\delta\_\{t\}\+\\lambda R\_\{t\}^\{\\psi\}\. On a round with2εt<Δt2\\varepsilon\_\{t\}<\\Delta\_\{t\}, a deterministic model optimumπ^t\\widehat\{\\pi\}\_\{t\}lies inΠt∗\\Pi\_\{t\}^\{\*\}by Proposition[4](https://arxiv.org/html/2606.29092#Thmproposition4), somtm\_\{t\}andℓt\\ell\_\{t\}select the same face\. If in additionFt−1∩Ft≠∅F\_\{t\-1\}\\cap F\_\{t\}\\neq\\emptyset, the selector attainsJt∗J\_\{t\}^\{\*\}overFt−1F\_\{t\-1\}, which givesεtsel=0\\varepsilon\_\{t\}^\{\\rm sel\}=0andΓtcross=0\\Gamma\_\{t\}^\{\\mathrm\{cross\}\}=0, a free round\. On a round with2εt≥Δt2\\varepsilon\_\{t\}\\geq\\Delta\_\{t\}we fall back onrt≤Br\_\{t\}\\leq B\. Collecting the three contributions yields the stated inequality\. ∎

###### Proof of the suppression rate\.

Comparing the penalized and plain exponential\-weights updates\(Cesa\-Bianchi and Lugosi,[2006](https://arxiv.org/html/2606.29092#bib.bib7)\)on one suboptimal action comes down to tracking the odds each keeps on it\. Fix a state\-step\(s,h\)\(s,h\)with optimal actiona∗a^\{\*\}and a suboptimal actiona−a^\{\-\}of optimal advantageA∗\(a−\)=βA^\{\*\}\(a^\{\-\}\)=\\beta\. Its on\-policy one\-step gap is theng=Qπt\(s,a−\)−Qπt\(s,a∗\)=β/\(H−h\+1\)g=Q^\{\\pi\_\{t\}\}\(s,a^\{\-\}\)\-Q^\{\\pi\_\{t\}\}\(s,a^\{\*\}\)=\\beta/\(H\-h\+1\), the causal weighting of Section[5](https://arxiv.org/html/2606.29092#S5)\. The entropy update setsπt\+1\(a∣s\)∝πt\(a∣s\)e−ηc\(a\)\\pi\_\{t\+1\}\(a\\mid s\)\\propto\\pi\_\{t\}\(a\\mid s\)e^\{\-\\eta c\(a\)\}, with costc=Qπtc=Q^\{\\pi\_\{t\}\}for the plain step andc=Qπt\+λ𝟏\{a∉𝒜t,hnear\(s\)\}c=Q^\{\\pi\_\{t\}\}\+\\lambda\\mathbf\{1\}\\\{a\\notin\\mathcal\{A\}\_\{t,h\}^\{\\rm near\}\(s\)\\\}for the penalized step\. Taking the ratio of the masses ona−a^\{\-\}anda∗a^\{\*\}cancels both the partition function and the optimal action’s zero penalty, so the oddsoto\_\{t\}ona−a^\{\-\}satisfyot\+1md=ote−ηgo^\{\\rm md\}\_\{t\+1\}=o\_\{t\}e^\{\-\\eta g\}for the plain step andot\+1pen=ote−η\(g\+λ\)o^\{\\rm pen\}\_\{t\+1\}=o\_\{t\}e^\{\-\\eta\(g\+\\lambda\)\}for the penalized step\. With the optimal face held fixed, the per\-round regret from\(s,h\)\(s,h\)isβot/\(1\+ot\)∈\[12βmin⁡\(ot,1\),βmin⁡\(ot,1\)\]\\beta\\,o\_\{t\}/\(1\+o\_\{t\}\)\\in\[\\tfrac\{1\}\{2\}\\beta\\min\(o\_\{t\},1\),\\beta\\min\(o\_\{t\},1\)\]\. Summing the geometric decays then gives totals proportional to1/\(1−e−ηg\)1/\(1\-e^\{\-\\eta g\}\)and1/\(1−e−η\(g\+λ\)\)1/\(1\-e^\{\-\\eta\(g\+\\lambda\)\}\), whose ratio isΩ\(\(g\+λ\)/g\)=Ω\(1\+λ\(H−h\+1\)/β\)\\Omega\(\(g\+\\lambda\)/g\)=\\Omega\(1\+\\lambda\(H\-h\+1\)/\\beta\)wheneverηg≲1\\eta g\\lesssim 1\. Asη→∞\\eta\\to\\inftyboth updates collapse to the hardargmin\\operatorname\*\{arg\\,min\}, so the penalty alters only the finite\-step transient and leaves the optimum unchanged\. ∎

![Refer to caption](https://arxiv.org/html/2606.29092v1/x6.png)Figure 4:The advantage threshold of Section[6](https://arxiv.org/html/2606.29092#S6)\. Actions whose estimated advantage exceeds their confidence band lie outside the optimal face and lose mass, while actions whose confidence band crosses zero lie near a wall of the fan and retain mass pending further evidence\.

## Appendix BSeparations and lower bounds

Proposition[1](https://arxiv.org/html/2606.29092#Thmproposition1)shows that loss variation can be arbitrarily large while the intrinsic price is zero\. Whether a given crossing is free or priced is decided entirely by the relative position of the old and new optimal faces, as Figure[5](https://arxiv.org/html/2606.29092#A2.F5)illustrates\. The complementary phenomenon is that an unpredictable crossing forces a fixed positive price on every learner, whatever its strategy\. This invariance is the precise sense in which the price is intrinsic\.

![Refer to caption](https://arxiv.org/html/2606.29092v1/x7.png)Figure 5:Whether a crossing is priced depends on the relative position of the old and new optimal faces\. When the faces are disjoint \(left\) every old optimal occupancy is suboptimal under the new loss and the price is positive\. When they intersect \(right\) some old occupancy remains optimal and the price is zero, regardless of how far the loss has moved\.###### Proposition 6\(Unpredictable priced crossings are unavoidable\)\.

Consider a one\-state, one\-step MDP with two actions\. At each round independently letℓt=\(0,γ\)\\ell\_\{t\}=\(0,\\gamma\)or\(γ,0\)\(\\gamma,0\)with equal probability\. For any learner, even one that observesℓt\\ell\_\{t\}in full after acting,

𝔼\[DRegT\]≥γT2,\\mathbb\{E\}\[\\operatorname\{DReg\}\_\{T\}\]\\geq\\frac\{\\gamma T\}\{2\},and the previous\-face tracker attains this rate up to the first\-round convention\.

###### Proof\.

Before observingℓt\\ell\_\{t\}the learner commits to a distributionptp\_\{t\}over the two actions\. Conditional on the past the optimal action is uniform and independent ofptp\_\{t\}, so the expected loss of any choice isγ/2\\gamma/2while the optimal loss is0\. Summing the per\-round gap ofγ/2\\gamma/2overTTrounds gives the bound\. A learner that repeats the previous round’s optimal action errs precisely when the sign flips, which happens with probability one half each round, matching the rate\. ∎

The degenerate case adds a second irreducible term\. Two loss sequences can present the same intrinsic price and yet impose different regret on a fixed within\-face selector\. The selection error of Theorem[1](https://arxiv.org/html/2606.29092#Thmtheorem1)therefore cannot be absorbed into the price\.

###### Proposition 7\(Degenerate faces require selection\)\.

There exist two sequences of optimal faces with identicalΓtcross\\Gamma\_\{t\}^\{\\mathrm\{cross\}\}but different regret for a fixed selectorxt∈Ft−1x\_\{t\}\\in F\_\{t\-1\}\. HenceΓtcross\\Gamma\_\{t\}^\{\\mathrm\{cross\}\}alone cannot characterize algorithmic regret under degeneracy\.

###### Proof sketch\.

Use a layered binary\-tree MDP and letFt−1F\_\{t\-1\}be the set of root\-to\-leaf paths sharing a fixed prefix, so the suffix is free\. Construct two new facesFtF\_\{t\}andFt′F\_\{t\}^\{\\prime\}that share the same best point insideFt−1F\_\{t\-1\}, hence the sameΓtcross\\Gamma\_\{t\}^\{\\mathrm\{cross\}\}, but require opposite suffixes\. The oracle point ofFt−1F\_\{t\-1\}pays the same price in both, while a fixed selector that committed to one suffix is cheap against the compatible face and expensive against the other\. The selection error is therefore an independent degree of freedom\. ∎

### B\.1Construction for Theorem[2](https://arxiv.org/html/2606.29092#Thmtheorem2)

A crossing forced at layerhhflips the suffix of lengthH−h\+1H\-h\+1and changes the loss on exactly that many layers\. Within a single layered tree the price and the variation therefore move together\. A single\-coordinate change fares no better in this regime, where a crossing of depthDDmust first clear an optimality margin of orderDDbefore the displaced face pays an amount of the same order\. The separation therefore turns on the one asymmetry between the two quantities — loss variation depends on the loss sequence alone and stays invariant to the MDP, whereas the face\-crossing price reads the normal fan and so depends on the transitions\. We hold the loss change fixed across two instances and let the geometry alone decide whether it reroutes a single step or the whole horizon\. Both have horizonHH, binary actions, and deterministic transitions from one start state, so a deterministic policy is a root\-to\-leaf action string\. Throughout, fixγ\>0\\gamma\>0with a smallκ∈\(0,γ\)\\kappa\\in\(0,\\gamma\), and setR=γ\(H−1\)R=\\gamma\(H\-1\)\.

##### The late instance\.

A separable chain places one state at each ofHHlayers\. At every non\-final layer the action changes neither the cost nor the next state\. A fixed preference ofκ\\kappafor action0enters there in both rounds, which keeps the round\-11optimum a single vertex with no degenerate face\. The last layer costs\(0,R−γ\)\(\\,0,\\ R\-\\gamma\\,\)in round11and\(0,−γ\)\(\\,0,\\ \-\\gamma\\,\)in round22, whereR−γ=γ\(H−2\)≥0R\-\\gamma=\\gamma\(H\-2\)\\geq 0\. Action0is then uniquely optimal at every layer in round11, soF1F\_\{1\}is the single path taking action0throughout\. Action11becomes optimal at the last layer in round22, yet the old path still takes action0there and pays exactlyγ\\gamma\. The only coordinate that changes is the last layer’s action\-11entry, which moves fromR−γR\-\\gammato−γ\-\\gamma, a change of magnitudeRR\.

##### The early instance\.

We can replace the chain with a branching tree\. From the root, action11leads to a toll node at layer11, while action0commits to an expensive line that costsγ\\gammaper layer for the rest of the horizon\. At the toll node, action11pays a toll and unlocks a cheap line that costs0thereafter, and a fixed bonusκ\\kappaon the committing root action again makes the round\-11optimum a unique vertex\. Between rounds only one coordinate changes, the toll, which stands atR=γ\(H−1\)R=\\gamma\(H\-1\)in round11and drops to0in round22\. The round\-11toll equals the whole\-horizon cost of the expensive line, so the committed expensive path is optimal\. Once the toll vanishes in round22, the new optimum routes through the toll node, and the old committed path now paysγ\\gammaon each of theH−1H\-1expensive layers\. The round\-to\-round change again touches a single coordinate, by the same magnitudeRR\.

###### Proof of Theorem[2](https://arxiv.org/html/2606.29092#Thmtheorem2)\.

In both instances the round\-to\-round loss change touches a single coordinate of magnitudeRR\. A one\-hot vector has the sameLpL\_\{p\}norm for everyp∈\[1,∞\]p\\in\[1,\\infty\], soV2ℓ=RV\_\{2\}^\{\\ell\}=Runder every dual norm\. The early instance leaves its committed path at round\-22valueγ\(H−1\)−κ\\gamma\(H\-1\)\-\\kappaagainst a new optimum of0, givingΓ2cross=γ\(H−1\)−κ\\Gamma\_\{2\}^\{\\mathrm\{cross\}\}=\\gamma\(H\-1\)\-\\kappa\. The late instance reverses the figures, its old path resting at0against a new optimum of−γ\-\\gamma, givingΓ2cross=γ\\Gamma\_\{2\}^\{\\mathrm\{cross\}\}=\\gamma\. Their ratio tends toH−1H\-1asκ→0\\kappa\\to 0\. Both old faces are vertices, and Theorem[1](https://arxiv.org/html/2606.29092#Thmtheorem1)equates the dynamic regret with the price on each instance\. Any validBBmust therefore satisfyB\(R\)≥γ\(H−1\)−κB\(R\)\\geq\\gamma\(H\-1\)\-\\kappato clear the early instance\. That sameB\(R\)B\(R\)then overshoots the late regretγ\\gammaby the factorH−1−κ/γH\-1\-\\kappa/\\gamma\. ∎

![Refer to caption](https://arxiv.org/html/2606.29092v1/x8.png)Figure 6:Why two MDPs are needed\. Inside a single fixed MDP, a one\-coordinate loss change that forces a crossing holds the price\-to\-trigger ratio near two at every horizon, whereas the two\-MDP construction of Theorem[2](https://arxiv.org/html/2606.29092#Thmtheorem2)reachesH−1H\-1\. Both curves are computed by the Bellman backup of Theorem[3](https://arxiv.org/html/2606.29092#Thmtheorem3)\.

## Appendix CExperimental details

All experiments use fixed transitions and adversarial losses\. We leave the moving\-transition case aside\. There the polytope𝒬\\mathcal\{Q\}deforms between rounds, so a face of one round need not remain feasible in the next and the fixed\-fan decomposition no longer applies\. Every figure averages3232seeds, and error bars or shaded bands show standard errors for the displayed means\. The simplex, layered, and degenerate benchmarks compute the geometric quantities in closed form for comparison with the measured regret\.

##### Configuration\.

Every benchmark uses3232seeds, indices0through3131, and reports the best setting over the step\-size and method\-specific sweeps in Table[1](https://arxiv.org/html/2606.29092#A3.T1)\. The per\-step loss unit in the priced quantities isγ\\gamma\.

Table 1:Hyperparameters\. Step sizesη\\etaand the method\-specific parameters are swept over the ranges shown, and every benchmark uses3232seeds\.
##### Simplex normal fan\.

With a single state andK∈\{16,32,64,128\}K\\in\\\{16,32,64,128\\\}actions,𝒬\\mathcal\{Q\}collapses to the simplex, its fan the arrangement of the normal cones\. The loss runs through four regimes — within\-cone drift, small wall\-crossing drift, random high\-margin crossings, and a rotating optimum\. Against it we compare Hedge\(Cesa\-Bianchi and Lugosi,[2006](https://arxiv.org/html/2606.29092#bib.bib7)\), Euclidean mirror descent\(Beck and Teboulle,[2003](https://arxiv.org/html/2606.29092#bib.bib55)\), optimistic Hedge\(Rakhlin and Sridharan,[2014](https://arxiv.org/html/2606.29092#bib.bib37)\), and the previous\-face tracker\. Figure[7](https://arxiv.org/html/2606.29092#A3.F7)contrasts loss variation with the face price by regime, then plots the previous\-face tracker’s regret againstΓ2:Tcross\\Gamma\_\{2:T\}^\{\\mathrm\{cross\}\}\. The within\-cone regime carries large variation at zero price, so any variation\-based bound predicts regret where there is none, and across regimes the tracker’s own regret correlates at1\.001\.00withΓ2:Tcross\\Gamma\_\{2:T\}^\{\\mathrm\{cross\}\}and only at0\.890\.89withVTℓV\_\{T\}^\{\\ell\}\.

![Refer to caption](https://arxiv.org/html/2606.29092v1/x9.png)Figure 7:Simplex normal fan\. Left, mean loss variation and intrinsic face price by regime, with uncertainty over trials\. Right, previous\-face tracker regret againstΓ2:Tcross\\Gamma\_\{2:T\}^\{\\mathrm\{cross\}\}, where the identity line gives the exact priced\-face\-motion prediction\. Error bars are standard errors over3232seeds and2424trials per action\-size/regime\.
##### Causal anisotropy in a layered tree\.

A deterministic binary tree of horizonH∈\{8,12,16,20\}H\\in\\\{8,12,16,20\\\}, in which each policy is a root\-to\-leaf path and one path is optimal\. A crossing is forced at a chosen layer and we recordΓtcross/γ\\Gamma\_\{t\}^\{\\mathrm\{cross\}\}/\\gamma\. The identity itself is in the main text\. Figure[8](https://arxiv.org/html/2606.29092#A3.F8)compares loss variation, raw path length, causal path length, andΓ2:Tcross\\Gamma\_\{2:T\}^\{\\mathrm\{cross\}\}as predictors of mirror\-descent regret across the sequences\. In this structured family all four explain nearly all of the regret variation, and the experiment that separatesΓ2:Tcross\\Gamma\_\{2:T\}^\{\\mathrm\{cross\}\}from loss variation is the simplex above\.

![Refer to caption](https://arxiv.org/html/2606.29092v1/x10.png)Figure 8:Layered tree\. Predictive power of four candidate path\-motion summaries for dynamic regret, using3232seeds and9696sequences per horizon\. In this family the predictors move together, which is why the decisive separation between price and variation is made on the simplex rather than here\.
##### Degenerate optimal faces\.

All paths sharing a prefix make up the optimal set in this benchmark’s layered tree, and those prefixes coarsen, refine, and conflict across50005000rounds as four within\-face selectors compete\. An*oracle*observesℓt\\ell\_\{t\}and plays the loss\-minimizing pointargminu∈Ft−1⁡⟨u,ℓt⟩\\operatorname\*\{arg\\,min\}\_\{u\\in F\_\{t\-1\}\}\\left\\langle u,\\ell\_\{t\}\\right\\rangle, which achieves zero selection error by construction\. The*sticky*Bregman selector instead pulls toward its previous representativeztz\_\{t\}by minimizing⟨u,mt⟩\+λDψ\(u,zt\)\\left\\langle u,m\_\{t\}\\right\\rangle\+\\lambda D\_\{\\psi\}\(u,z\_\{t\}\), as in Proposition[3](https://arxiv.org/html/2606.29092#Thmproposition3)\.*Lexicographic*selection applies a fixed, loss\-independent tie\-break over paths, while*random*selection draws a uniform point ofFt−1F\_\{t\-1\}\. The main text reports the regret stack, and Figure[9](https://arxiv.org/html/2606.29092#A3.F9)shows the selection error growing with the dimension of the free face, the component that the sticky projection controls\.

![Refer to caption](https://arxiv.org/html/2606.29092v1/x11.png)Figure 9:Degenerate faces\. Within\-face selection error against the dimension of the free face, over3232seeds and50005000\-round sequences\. A larger free face leaves more room for a poor representative, and the sticky selector keeps the error below the lexicographic and random ones, with bands showing standard errors within each dimension\.
##### Advantage\-thresholded gridworld control\.

Following the standard tabular gridworld convention\(Sutton and Barto,[2018](https://arxiv.org/html/2606.29092#bib.bib59)\), our custom non\-stationary gridworlds of side1515and horizon3030supply four scenarios, a moving goal, a rotating field, changing obstacles, and a boundary crossing\. Against them we run mirror descent\(Nemirovskiĭ and IUdin,[1983](https://arxiv.org/html/2606.29092#bib.bib32); Beck and Teboulle,[2003](https://arxiv.org/html/2606.29092#bib.bib55)\), a trust\-region update\(Schulmanet al\.,[2017](https://arxiv.org/html/2606.29092#bib.bib41)\), optimistic mirror descent\(Rakhlin and Sridharan,[2014](https://arxiv.org/html/2606.29092#bib.bib37)\), and the advantage\-thresholded update\. For each we sweep step sizes, the advantage threshold, and the penalty weight, then report its best setting\. Figure[10](https://arxiv.org/html/2606.29092#A3.F10)reports the best cumulative regret of every method, and the thresholded update comes in lowest in each scenario\. Figure[11](https://arxiv.org/html/2606.29092#A3.F11)then tracks per\-episode regret together with the occupancy mass near the fan walls over training, where the thresholded update holds more mass near those walls in the scenarios whose optimum moves sharply\.

![Refer to caption](https://arxiv.org/html/2606.29092v1/x12.png)Figure 10:Best cumulative dynamic regret of four updates on four non\-stationary gridworld scenarios, shown on a log scale\. Bars are means over3232seeds with standard\-error bars\. The thresholded update is lowest in every scenario\.![Refer to caption](https://arxiv.org/html/2606.29092v1/x13.png)Figure 11:Gridworld over training, the thresholded update \(teal\) against mirror descent \(gray\), with curves binned over episodes and3232seeds\. Rows, top to bottom, are per\-episode dynamic regret and occupancy mass near the fan walls, across the four scenarios\.
Priced Motion Through Optimal Faces: A Normal-Fan Geometry for Non-Stationary Adversarial MDPs

Similar Articles

Online Learning on Hidden-Convex Losses via Algorithmic Equivalence: Optimal Regret, Geometric Barrier, and Bandit Feedback

NormGuard: Reward-Preserving Norm Constraints in Flow-Matching Reinforcement Learning

Lagrangian Flow Matching: A Least-Action Framework for Principled Path Design

DP-MacAdam: Differentially Private Mechanism with Adaptive Clipping and Adaptive Momentum

Active Timepoint Selection for Learning Measure-Valued Trajectories

Submit Feedback

Similar Articles

Online Learning on Hidden-Convex Losses via Algorithmic Equivalence: Optimal Regret, Geometric Barrier, and Bandit Feedback
NormGuard: Reward-Preserving Norm Constraints in Flow-Matching Reinforcement Learning
Lagrangian Flow Matching: A Least-Action Framework for Principled Path Design
DP-MacAdam: Differentially Private Mechanism with Adaptive Clipping and Adaptive Momentum
Active Timepoint Selection for Learning Measure-Valued Trajectories