Regime-Arrival Uncertainty in Generalization Bounds under Distribution Shift

arXiv cs.LG 06/03/26, 04:00 AM Papers
Summary
This paper introduces a theoretical framework for quantifying deployment risk when training and deployment distributions differ due to latent regime dynamics modeled as a Markov-switching process, providing exact decomposition and finite-sample bounds.
arXiv:2606.02657v1 Announce Type: new Abstract: The standard generalization bounds assume that the training and deployment distributions are the same, or are static, and don't consider regime switching environments where the ratio of calm vs crisis states is different. This paper proposes a framework that generalizes regime-aware models by quantifying the extra risk due to regime composition mismatch, when distribution shifts are Markov-switching. We obtain an exact decomposition, separating regime mismatch from regime sensitivity; we extend the bound to beta-mixing data using the effective sample size corrected for the spectral gap; and we show a minimax lower bound for synthetic data and on 25 years of global equity indices. The proposed penalty is an ex post realized generalization gap, whereas the training-only estimator does not show significant correlation: the feature geometry of crises can be detected, but not the temporal arrival. Thus, the framework is not a forecast machine. Forecasting the composition of the future regime is an open question in the rare cases of regime change.
Original Article
View Cached Full Text
Cached at: 06/03/26, 09:39 AM
# Regime-Arrival Uncertainty in Generalization Bounds under Distribution Shift
Source: [https://arxiv.org/html/2606.02657](https://arxiv.org/html/2606.02657)
###### Abstract

The standard generalization bounds assume that the training and deployment distributions are the same, or are static, and don’t consider regime switching environments where the ratio of calm vs crisis states is different\. This paper proposes a framework that generalizes regime\-aware models by quantifying the extra risk due to regime composition mismatch, when distribution shifts are Markov\-switching\. We obtain an exact decomposition, separating regime mismatch from regime sensitivity; we extend the bound to beta\-mixing data using the effective sample size corrected for the spectral gap; and we show a minimax lower bound for synthetic data and on 25 years of global equity indices\. The proposed penalty is an ex post realized generalization gap, whereas the training\-only estimator does not show significant correlation: the feature geometry of crises can be detected, but not the temporal arrival\. Thus, the framework is not a forecast machine\. Forecasting the composition of the future regime is an open question in the rare cases of regime change\.

## 1Introduction

A predictive model that performs well during the training period is often assumed to perform equally well after deployment\. In practice, this assumption is frequently violated because the environment generating the data changes over time\. Statistical learning theory traditionally studies generalization under the assumption that training and deployment data are generated from the same distribution\(Vapnik,[1998](https://arxiv.org/html/2606.02657#bib.bib3); Bousquetet al\.,[2004](https://arxiv.org/html/2606.02657#bib.bib4)\)\. This assumption is mathematically convenient and works reasonably well in many classical settings\. It becomes much harder to justify when the underlying system itself evolves through time\.

Many real\-world applications operate in environments where structural change is unavoidable\. Clinical risk models face changing patient populations because of disease cycles, evolving treatment practices, and changing hospital conditions\. Intrusion detection systems operate in environments that alternate between normal behavior and active attacks\. Autonomous systems must function across changing weather conditions, traffic patterns, and lighting environments\. In these situations, the difference between training performance and deployment performance is often not simply the result of overfitting\. Instead, it arises because the model is deployed under conditions that differ systematically from those observed during training\(Kiferet al\.,[2004](https://arxiv.org/html/2606.02657#bib.bib11); Quionero\-Candelaet al\.,[2008](https://arxiv.org/html/2606.02657#bib.bib5)\)\.

This paper develops a theoretical framework for studying this problem through latent regime dynamics\. We model the environment as a two\-state Markov process consisting of a calm regime and a crisis regime\. The training distribution is represented as a mixture of regime\-conditional distributions with composition parameterπ\\pi, while the one\-step\-ahead deployment distribution depends on the transition probabilityp01p\_\{01\}of entering the crisis regime\. Wheneverπ≠p01\\pi\\neq p\_\{01\}, the composition of training and deployment environments differs\. We show that this mismatch directly increases future deployment risk, with the magnitude determined by both the severity of the mismatch and the distinguishability of regimes under the hypothesis class\.

The analysis develops several theoretical results that together characterize how regime mismatch affects future deployment risk\. We derive an exact decomposition connecting future risk directly to differences in regime composition \(Lemma[4\.1](https://arxiv.org/html/2606.02657#S4.Thmtheorem1)\), establish a finite\-sample high\-probability upper bound on deployment risk \(Theorem[4\.13](https://arxiv.org/html/2606.02657#S4.Thmtheorem13)\), and construct a matching minimax lower bound demonstrating that the mismatch penalty represents a fundamental limitation rather than an artifact of analysis \(Theorem[4\.15](https://arxiv.org/html/2606.02657#S4.Thmtheorem15)\)\. Our approach combines ideas from domain adaptation, dependent learning theory, and regime\-switching models\(Ben\-Davidet al\.,[2010](https://arxiv.org/html/2606.02657#bib.bib1); Yu,[1994](https://arxiv.org/html/2606.02657#bib.bib7); Hamilton,[1989](https://arxiv.org/html/2606.02657#bib.bib6)\)\.

The framework also introduces several features that distinguish it from existing approaches to distribution shift\. Regime discrepancy is quantified using theℋΔℋ\\mathcal\{H\}\\Delta\\mathcal\{H\}\-divergence\(Ben\-Davidet al\.,[2010](https://arxiv.org/html/2606.02657#bib.bib1)\), which provides a tighter characterization than total variation distance and can be estimated from finite unlabeled samples through domain classification\. The analysis is further developed under geometricβ\\beta\-mixing dependence, with explicit mixing coefficients derived from the transition structure using the blocking methodology ofYu \([1994](https://arxiv.org/html/2606.02657#bib.bib7)\)\. This naturally introduces an effective sample sizeneffn\_\{\\mathrm\{eff\}\}, which decreases as regime persistence increases\. We also establish an irreducibility result showing that any valid certificate for future deployment risk must contain a regime mismatch penalty as an additive component, independent of the particular learning algorithm employed\.

Empirical validation on synthetic data confirms the theoretical structure\. On real equity index data, we show that the penalty computed using the realized future crisis fraction tracks actual train\-to\-deployment gaps with Spearmanρ=0\.729\\rho=0\.729\. However, further analysis reveals that estimating the penalty before deployment is not reliable with standard training window lengths because it would require forecasting future regime composition\. The framework thus provides a diagnostic tool for understanding deployment failures rather than a deployable forecasting system, and highlights the need for better forecasts of future regime composition as an open problem for future work\.

## 2Related Study

### 2\.1Domain adaptation and generalization under distribution shift

The main issues in statistical learning theory for a long time is whether a model trained in one distribution will be effective in another\. The influential work ofBen\-Davidet al\.\([2010](https://arxiv.org/html/2606.02657#bib.bib1)\)demonstrated that training performance and model complexity are not the only factors that influence target\-domain risk; a distributional discrepancy term, quantified by theℋΔℋ\\mathcal\{H\}\\Delta\\mathcal\{H\}\-divergence, also plays a role\. One of the key benefits of this difference is that it can be estimated from a finite number of unlabeled samples, whereas total variation distance is hard to estimate directly in practice\. Because theℋΔℋ\\mathcal\{H\}\\Delta\\mathcal\{H\}\-divergence is upper bounded by total variation \(Lemma[4\.2](https://arxiv.org/html/2606.02657#S4.Thmtheorem2)\), using theℋΔℋ\\mathcal\{H\}\\Delta\\mathcal\{H\}\-divergence instead of total variation yields a tighter and more practical characterization of distribution shift\. Current theory is largely based on static source and target distributions and independent sampling\. However, our setting is different, as the distribution of the deployments changes dynamically via latent regime transitions\.

The work byMansouret al\.\([2009](https://arxiv.org/html/2606.02657#bib.bib2)\)extended domain adaptation to multiple source distributions and emphasized the role of the mixture weights in controlling the adaptation performance\. The same mixture structure occurs naturally in our framework due to the training observationsPmix=\(1−π\)P0\+πP1P\_\{\\mathrm\{mix\}\}=\(1\-\\pi\)P\_\{0\}\+\\pi P\_\{1\}\. The key difference is that the mixing proportion is not chosen by the learner but it is based on the historical regime dynamics and can vary from the future deployment composition\.

### 2\.2Generalization under dependent data and mixing processes

In most classical generalization theory, it is assumed that observations are independent\. In serial dependence this assumption is not true as observations are not independent and the effective sample size is less than the nominal sample size\. The extension of learning theory to dependent sequences was initiated byYu \([1994](https://arxiv.org/html/2606.02657#bib.bib7)\)who developed a blocking framework to get uniform convergence results forβ\\beta\-mixing dependent sequences\. The idea is to divide the observations into nearly independent blocks and use the number of effective blocks as the sample size\. We take this construction for granted and get explicit effective sample sizes as functions of the underlying Markov transition structure \(Theorem[4\.7](https://arxiv.org/html/2606.02657#S4.Thmtheorem7)\)\.

Mixing behavior for Markov chains has been studied extensively, including the work ofDavydov \([1973](https://arxiv.org/html/2606.02657#bib.bib8)\), who established exponential decay rates under standard ergodicity conditions\. Under our assumptions, this produces mixing coefficients of the formβ\(k\)≤Cμ\|λ2\|k\\beta\(k\)\\leq C\_\{\\mu\}\|\\lambda\_\{2\}\|^\{k\}, whereλ2\\lambda\_\{2\}denotes the non\-unit eigenvalue of the transition matrix\. We further build upon the learning\-theoretic treatment of mixing processes developed byMohriet al\.\([2018](https://arxiv.org/html/2606.02657#bib.bib10)\), while explicitly retaining all mixing constants in the final bound so that the resulting expressions remain computable\.

### 2\.3Change detection and learning under nonstationarity

The study of learning under changing environments has been conducted from various angles such as change detection, concept drift, and nonstationary learning\. Theℋ\\mathcal\{H\}\-divergence was introduced in early work byKiferet al\.\([2004](https://arxiv.org/html/2606.02657#bib.bib11)\)for detecting distributional changes in finite samples\. This idea was later generalized to the symmetric comparison setting needed in domain adaptation by theℋΔℋ\\mathcal\{H\}\\Delta\\mathcal\{H\}\-divergence\.

These ideas are used as a basis for our analysis, which focuses on deployment guarantees instead of change detection\. The estimation procedure behind Theorem[4\.11](https://arxiv.org/html/2606.02657#S4.Thmtheorem11)is a generalisation of the divergence estimation arguments of independent observations toβ\\beta\-mixing sequences, following the same effective sample size framework used in the rest of the paper\.

There is a wider literature on learning with concept drift and changing distributions\. Most of the research is on adversarial or unstructured nonstationarity\. We, however, take a different approach and consider a structured regime\-switching environment following the model set forth byHamilton \([1989](https://arxiv.org/html/2606.02657#bib.bib6)\)\. This structure is crucial as it makes the transition probabilities explicit in the theory and yields closed\-form expressions for regime mismatch\.

### 2\.4Minimax lower bounds and irreducibility

Lower bounds play an important role in understanding whether a theoretical penalty reflects a true limitation or merely a weakness of analysis\. Building upon classical minimax theory developed through the work ofCam \([1986](https://arxiv.org/html/2606.02657#bib.bib12)\)and later formalized through two\-point and Fano\-style arguments\(Yu,[1997](https://arxiv.org/html/2606.02657#bib.bib13)\), we construct a binary\-world argument showing that regime mismatch produces unavoidable deployment costs\.

The construction creates two environments that generate identical training distributions under pure calm\-state training but lead to different future distributions\. This forces any learner to incur a minimum excess risk proportional top01⋅12dℋΔℋ\(P1,P0\)p\_\{01\}\\cdot\\tfrac\{1\}\{2\}d\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}\(P\_\{1\},P\_\{0\}\), independent of sample size\. As observations from both regimes become available, this limitation gradually weakens at the rateΘ\(1/neff\)\\Theta\(1/\\sqrt\{n\_\{\\mathrm\{eff\}\}\}\), reflecting the additional information gained from observing crisis\-state samples\.

### 2\.5Regime\-switching models and financial machine learning

Regime\-switching models introduced byHamilton \([1989](https://arxiv.org/html/2606.02657#bib.bib6)\)have become widely used for describing environments that alternate between qualitatively different states\. In financial applications, regime changes have repeatedly been shown to influence predictive stability, volatility structure, and downstream model performance, motivating increasing interest in regime\-aware learning methods\(Zaremba and Cakici,[2024](https://arxiv.org/html/2606.02657#bib.bib17); Staehr and others,[2024](https://arxiv.org/html/2606.02657#bib.bib18)\)\.

Recent work on regime\-aware financial machine learning and regime\-shifting dynamic factor models further suggests that predictive systems can benefit from explicitly modeling latent market states\(Suárez Cetruloet al\.,[2024](https://arxiv.org/html/2606.02657#bib.bib14); Shuet al\.,[2024](https://arxiv.org/html/2606.02657#bib.bib15); Xianget al\.,[2024](https://arxiv.org/html/2606.02657#bib.bib16)\)\. Financial markets provide a natural environment for empirical validation because regime transitions are observable and extensively documented\. The theoretical framework itself is not restricted to finance\. The underlying problem appears whenever deployment conditions evolve and the future distribution differs systematically from the distribution observed during training\.

## 3Problem Setup and Notation

We study supervised learning when the data\-generating distribution is governed by a latent two\-state regime process\. This process evolves as a Markov chain when the regime composition of the training data may differ from that of the deployment \(future\) period data\.

###### Definition 3\.1\(Regime process\)\.

Let\{Zt\}t≥1\\\{Z\_\{t\}\\\}\_\{t\\geq 1\}be a two\-state Markov chain on\{0,1\}\\\{0,1\\\}, whereZt=0Z\_\{t\}=0denotes the*calm*regime andZt=1Z\_\{t\}=1the*crisis*regime\. The chain is characterized by its transition matrixP=\(p00p01p10p11\)P=\\bigl\(\\begin\{smallmatrix\}p\_\{00\}&p\_\{01\}\\\\ p\_\{10\}&p\_\{11\}\\end\{smallmatrix\}\\bigr\),pij=ℙ\(Zt\+1=j∣Zt=i\)p\_\{ij\}=\\mathbb\{P\}\(Z\_\{t\+1\}=j\\mid Z\_\{t\}=i\), with rows summing to one, so thatp00=1−p01p\_\{00\}=1\-p\_\{01\}andp11=1−p10p\_\{11\}=1\-p\_\{10\}\.

###### Definition 3\.2\(Distributions and risks\)\.

LetXXbe the feature space andYYthe label space\. WriteP0P\_\{0\}\(resp\.P1P\_\{1\}\) for the law of\(x,y\)\(x,y\)in the calm \(resp\. crisis\) regime\. For a predictorf∈ℱf\\in\\mathcal\{F\}and a bounded lossℓ:Y×Y→\[0,1\]\\ell:Y\\times Y\\to\[0,1\]\(e\.g\. the0/10/1lossℓ\(f\(x\),y\)=𝟏\[f\(x\)≠y\]\\ell\(f\(x\),y\)=\\mathbf\{1\}\[f\(x\)\\neq y\]\), the risk under a distributionQQisRQ\(f\)=𝔼\(x,y\)∼Q\[ℓ\(f\(x\),y\)\]R\_\{Q\}\(f\)=\\mathbb\{E\}\_\{\(x,y\)\\sim Q\}\[\\ell\(f\(x\),y\)\]\. We writeR0\(f\),R1\(f\)R\_\{0\}\(f\),R\_\{1\}\(f\)for the calm and crisis risks\.

###### Definition 3\.3\(Training and future distributions\)\.

Letπ∈\[0,1\]\\pi\\in\[0,1\]be the crisis fraction of the training distributionPmix=\(1−π\)P0\+πP1P\_\{\\mathrm\{mix\}\}=\(1\-\\pi\)P\_\{0\}\+\\pi P\_\{1\}, so thatRmix\(f\)=\(1−π\)R0\(f\)\+πR1\(f\)R\_\{\\mathrm\{mix\}\}\(f\)=\(1\-\\pi\)R\_\{0\}\(f\)\+\\pi R\_\{1\}\(f\)\. Conditioning on the current regime being calm, the one\-step\-ahead \(future\) distribution isPfuture=p00P0\+p01P1P\_\{\\mathrm\{future\}\}=p\_\{00\}P\_\{0\}\+p\_\{01\}P\_\{1\}, withRfuture\(f\)=p00R0\(f\)\+p01R1\(f\)R\_\{\\mathrm\{future\}\}\(f\)=p\_\{00\}R\_\{0\}\(f\)\+p\_\{01\}R\_\{1\}\(f\)\.

###### Definition 3\.4\(ℋΔℋ\\mathcal\{H\}\\Delta\\mathcal\{H\}\-divergence\(Ben\-Davidet al\.,[2010](https://arxiv.org/html/2606.02657#bib.bib1)\)\)\.

For a hypothesis classℋ\\mathcal\{H\}of binary classifiersh:X→\{0,1\}h:X\\to\\\{0,1\\\}and feature marginalsQ,RQ,R,

dℋΔℋ\(Q,R\):=2suph,h′∈ℋ\|ℙx∼Q\[h\(x\)≠h′\(x\)\]−ℙx∼R\[h\(x\)≠h′\(x\)\]\|\.d\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}\(Q,R\):=2\\sup\_\{h,h^\{\\prime\}\\in\\mathcal\{H\}\}\\bigl\|\\mathbb\{P\}\_\{x\\sim Q\}\[h\(x\)\\neq h^\{\\prime\}\(x\)\]\-\\mathbb\{P\}\_\{x\\sim R\}\[h\(x\)\\neq h^\{\\prime\}\(x\)\]\\bigr\|\.Equivalently, with the symmetric\-difference classℋΔℋ=\{x↦h\(x\)⊕h′\(x\)\}\\mathcal\{H\}\\Delta\\mathcal\{H\}=\\\{x\\mapsto h\(x\)\\oplus h^\{\\prime\}\(x\)\\\}andIg=\{x:g\(x\)=1\}I\_\{g\}=\\\{x:g\(x\)=1\\\},dℋΔℋ\(Q,R\)=2supg∈ℋΔℋ\|Q\(Ig\)−R\(Ig\)\|d\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}\(Q,R\)=2\\sup\_\{g\\in\\mathcal\{H\}\\Delta\\mathcal\{H\}\}\|Q\(I\_\{g\}\)\-R\(I\_\{g\}\)\|\. It is a total\-variation distance restricted to the events the class can realize; unlikedTVd\_\{\\mathrm\{TV\}\}, it is estimable from finite unlabeled samples \(Section[4\.3](https://arxiv.org/html/2606.02657#S4.SS3)\)\.

###### Definition 3\.5\(Empirical risk and Rademacher complexity\)\.

GivenS=\{\(xi,yi\)\}i=1nS=\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{n\}, the empirical risk isR^S\(f\)=1n∑iℓ\(f\(xi\),yi\)\\widehat\{R\}\_\{S\}\(f\)=\\frac\{1\}\{n\}\\sum\_\{i\}\\ell\(f\(x\_\{i\}\),y\_\{i\}\), and the empirical Rademacher complexity of the loss classℒℱ=\{\(x,y\)↦ℓ\(f\(x\),y\):f∈ℱ\}\\mathcal\{L\}\_\{\\mathcal\{F\}\}=\\\{\(x,y\)\\mapsto\\ell\(f\(x\),y\):f\\in\\mathcal\{F\}\\\}isR^S\(ℒℱ\)=𝔼σ\[supf∈ℱ1n∑iσiℓ\(f\(xi\),yi\)\]\\widehat\{R\}\_\{S\}\(\\mathcal\{L\}\_\{\\mathcal\{F\}\}\)=\\mathbb\{E\}\_\{\\sigma\}\[\\sup\_\{f\\in\\mathcal\{F\}\}\\frac\{1\}\{n\}\\sum\_\{i\}\\sigma\_\{i\}\\ell\(f\(x\_\{i\}\),y\_\{i\}\)\]with i\.i\.d\. Rademacher signsσi\\sigma\_\{i\}\(Bartlett and Mendelson,[2002](https://arxiv.org/html/2606.02657#bib.bib9); Mohriet al\.,[2018](https://arxiv.org/html/2606.02657#bib.bib10)\)\.

###### Definition 3\.6\(Mixing and effective sample size\)\.

A process is geometricallyβ\\beta\-mixing ifβ\(k\)≤Cρk\\beta\(k\)\\leq C\\rho^\{k\}for someC\>0,ρ∈\(0,1\)C\>0,\\rho\\in\(0,1\)\(Davydov,[1973](https://arxiv.org/html/2606.02657#bib.bib8); Yu,[1994](https://arxiv.org/html/2606.02657#bib.bib7)\)\. Dependence reduces the information per sample; the number of approximately independent blocks is the effective sample sizeneff≤nn\_\{\\mathrm\{eff\}\}\\leq n\(Theorem[4\.7](https://arxiv.org/html/2606.02657#S4.Thmtheorem7)\)\.

###### Assumption 3\.7\(Sampling model and persistence\)\.

1. \(a\)\{Zt\}\\\{Z\_\{t\}\\\}is a stationary, geometricallyβ\\beta\-mixing two\-state Markov chain whose mixing constants depend only onPP\.
2. \(b\)The training crisis fractionπ\\piis either the stationary fractionμ1=p01/\(p01\+p10\)\\mu\_\{1\}=p\_\{01\}/\(p\_\{01\}\+p\_\{10\}\)\(raw historical window\) or a design parameter under reweighting; in either caseπ=𝔼\[π^\]\\pi=\\mathbb\{E\}\[\\widehat\{\\pi\}\]for the empirical estimatorπ^\\widehat\{\\pi\}of Section[4\.3](https://arxiv.org/html/2606.02657#S4.SS3)\.
3. \(c\)*Persistence:*p01\+p10≤1p\_\{01\}\+p\_\{10\}\\leq 1, so the second eigenvalue ofPPis nonnegative and the spectral gap simplifies tog=p01\+p10g=p\_\{01\}\+p\_\{10\}; the general case is handled by the modulus formg=1−\|1−\(p01\+p10\)\|g=1\-\|1\-\(p\_\{01\}\+p\_\{10\}\)\|\(Remark[4\.6](https://arxiv.org/html/2606.02657#S4.Thmtheorem6)\)\.

## 4Theory

This section develops the regime\-shift penalty, makes the mixing constants explicit, and assembles the main generalization bound\. Proofs of the load\-bearing results are deferred to Appendix[A](https://arxiv.org/html/2606.02657#A1)\.

### 4\.1The future\-mixture decomposition

The central object is an exact identity relating future risk \(the deployment target\) to mixed training risk \(what can be estimated\)\.

###### Lemma 4\.1\(Future\-mix gap identity\)\.

For anyf∈ℱf\\in\\mathcal\{F\}andπ∈\[0,1\]\\pi\\in\[0,1\], with the current regime calm,111This identity holds for the realized future crisis fractionπfuture\\pi\_\{\\text\{future\}\}\. In practice,πfuture\\pi\_\{\\text\{future\}\}is not known at training time; estimating it requires forecasting future regime composition\.

Rfuture\(f\)−Rmix\(f\)=\(p01−π\)\(R1\(f\)−R0\(f\)\)\.R\_\{\\mathrm\{future\}\}\(f\)\-R\_\{\\mathrm\{mix\}\}\(f\)=\(p\_\{01\}\-\\pi\)\\bigl\(R\_\{1\}\(f\)\-R\_\{0\}\(f\)\\bigr\)\.

The gap is a product of a*composition mismatch*\(p01−π\)\(p\_\{01\}\-\\pi\)and a*regime sensitivity*\(R1\(f\)−R0\(f\)\)\(R\_\{1\}\(f\)\-R\_\{0\}\(f\)\)\. It vanishes when the training crisis fraction matches the future switching probability \(π=p01\\pi=p\_\{01\}\), regardless of the model, and equally when the model is regime\-insensitive \(R1\(f\)=R0\(f\)R\_\{1\}\(f\)=R\_\{0\}\(f\)\), regardless of the dynamics\. To turn the sensitivity factor into a quantity that depends on the model*class*rather than the individualff, we bound\|R1\(f\)−R0\(f\)\|\|R\_\{1\}\(f\)\-R\_\{0\}\(f\)\|via the following lemma and corollary\.

###### Lemma 4\.2\(12dℋΔℋ≤dTV\\tfrac\{1\}\{2\}d\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}\\leq d\_\{\\mathrm\{TV\}\}\)\.

For anyQ,RQ,Rand anyℋ\\mathcal\{H\},12dℋΔℋ\(Q,R\)=suph,h′∈ℋ\|ℙQ\[h≠h′\]−ℙR\[h≠h′\]\|≤dTV\(Q,R\)\\tfrac\{1\}\{2\}d\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}\(Q,R\)=\\sup\_\{h,h^\{\\prime\}\\in\\mathcal\{H\}\}\|\\mathbb\{P\}\_\{Q\}\[h\\neq h^\{\\prime\}\]\-\\mathbb\{P\}\_\{R\}\[h\\neq h^\{\\prime\}\]\|\\leq d\_\{\\mathrm\{TV\}\}\(Q,R\)\.

###### Corollary 4\.3\(Regime\-shift inequality\)\.

Assume the0/10/1loss andℱ⊆ℋ\\mathcal\{F\}\\subseteq\\mathcal\{H\}, and letλ01=minh∈ℋ⁡\[R0\(h\)\+R1\(h\)\]\\lambda\_\{01\}=\\min\_\{h\\in\\mathcal\{H\}\}\[R\_\{0\}\(h\)\+R\_\{1\}\(h\)\]be the adaptability term\. Then for everyf∈ℋf\\in\\mathcal\{H\},

Rfuture\(f\)≤Rmix\(f\)\+\|p01−π\|\(12dℋΔℋ\(P1,P0\)\+λ01\),R\_\{\\mathrm\{future\}\}\(f\)\\leq R\_\{\\mathrm\{mix\}\}\(f\)\+\|p\_\{01\}\-\\pi\|\\Bigl\(\\tfrac\{1\}\{2\}d\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}\(P\_\{1\},P\_\{0\}\)\+\\lambda\_\{01\}\\Bigr\),which reduces, in the realizable caseλ01=0\\lambda\_\{01\}=0, toRfuture\(f\)≤Rmix\(f\)\+\|p01−π\|⋅12dℋΔℋ\(P1,P0\)R\_\{\\mathrm\{future\}\}\(f\)\\leq R\_\{\\mathrm\{mix\}\}\(f\)\+\|p\_\{01\}\-\\pi\|\\cdot\\tfrac\{1\}\{2\}d\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}\(P\_\{1\},P\_\{0\}\)\.

The regime penalty\|p01−π\|⋅12dℋΔℋ\(P1,P0\)\|p\_\{01\}\-\\pi\|\\cdot\\tfrac\{1\}\{2\}d\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}\(P\_\{1\},P\_\{0\}\)vanishes whenπ=p01\\pi=p\_\{01\}and equalsp01⋅12dℋΔℋp\_\{01\}\\cdot\\tfrac\{1\}\{2\}d\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}under pure\-calm training \(π=0\\pi=0\)\. It depends only on the transition dynamics, the training composition, and the class\-detectable regime gap, not on the chosenff\. By Lemma[4\.2](https://arxiv.org/html/2606.02657#S4.Thmtheorem2)it never exceeds the correspondingdTVd\_\{\\mathrm\{TV\}\}penalty, so replacingdTVd\_\{\\mathrm\{TV\}\}by12dℋΔℋ\\tfrac\{1\}\{2\}d\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}strictly tightens the bound\.

### 4\.2Explicit mixing constants

###### Definition 4\.5\(Spectral gap and stationary constant\)\.

The non\-unit eigenvalue ofPPisλ2=1−\(p01\+p10\)\\lambda\_\{2\}=1\-\(p\_\{01\}\+p\_\{10\}\)\. The spectral gap isg=1−\|λ2\|g=1\-\|\\lambda\_\{2\}\|, equal top01\+p10p\_\{01\}\+p\_\{10\}under Assumption[3\.7](https://arxiv.org/html/2606.02657#S3.Thmtheorem7)\(c\)\. With switching rates=p01\+p10s=p\_\{01\}\+p\_\{10\}the stationary law isμ0=p10/s,μ1=p01/s\\mu\_\{0\}=p\_\{10\}/s,\\ \\mu\_\{1\}=p\_\{01\}/s, and we setCμ=1/min⁡\(μ0,μ1\)=s/min⁡\(p10,p01\)C\_\{\\mu\}=1/\\min\(\\mu\_\{0\},\\mu\_\{1\}\)=s/\\min\(p\_\{10\},p\_\{01\}\)\.

###### Theorem 4\.7\(Effective sample size\)\.

Under Assumption[3\.7](https://arxiv.org/html/2606.02657#S3.Thmtheorem7), with block lengthb=⌈ln⁡\(nCμ\)/g⌉b=\\lceil\\ln\(nC\_\{\\mu\}\)/g\\rceilone hasβ\(b\)≤1/n\\beta\(b\)\\leq 1/n, and the number of independent blocks satisfies

neff=⌊n/b⌋≥ngln⁡\(nCμ\)\+2\.n\_\{\\mathrm\{eff\}\}=\\lfloor n/b\\rfloor\\ \\geq\\ \\frac\{ng\}\{\\ln\(nC\_\{\\mu\}\)\+2\}\.

The bound follows the independent\-blocks method ofYu \([1994](https://arxiv.org/html/2606.02657#bib.bib7)\): one selectsbbso thatβ\(b\)≤1/n\\beta\(b\)\\leq 1/n, partitions thennpoints into⌊n/b⌋\\lfloor n/b\\rfloorapproximately independent blocks, and lower\-bounds the block count usingb≤ln⁡\(nCμ\)/g\+1b\\leq\\ln\(nC\_\{\\mu\}\)/g\+1\.

###### Corollary 4\.8\(Complexity term\)\.

3ln⁡\(2/δ\)2neff≤3\(ln⁡\(nCμ\)\+2\)ln⁡\(2/δ\)2ng=:Λ\(n,δ\)\.\\displaystyle 3\\sqrt\{\\frac\{\\ln\(2/\\delta\)\}\{2n\_\{\\mathrm\{eff\}\}\}\}\\ \\leq\\ 3\\sqrt\{\\frac\{\(\\ln\(nC\_\{\\mu\}\)\+2\)\\ln\(2/\\delta\)\}\{2ng\}\}\\ =:\\ \\Lambda\(n,\\delta\)\.

### 4\.3Estimation of the penalty components

###### Theorem 4\.9\(Crisis\-fraction estimation\)\.

Letπ^=1n∑i𝟏\[Zi=1\]\\widehat\{\\pi\}=\\frac\{1\}\{n\}\\sum\_\{i\}\\mathbf\{1\}\[Z\_\{i\}=1\]andπ=𝔼\[π^\]\\pi=\\mathbb\{E\}\[\\widehat\{\\pi\}\]\. Under Assumption[3\.7](https://arxiv.org/html/2606.02657#S3.Thmtheorem7), for anyδ∈\(0,1\)\\delta\\in\(0,1\), with probability at least1−δ1\-\\delta,

\|π^−π\|≤ηπ:=\(ln⁡\(nCμ\)\+2\)ln⁡\(2/δ\)2ng\.\|\\widehat\{\\pi\}\-\\pi\|\\leq\\eta\_\{\\pi\}:=\\sqrt\{\\frac\{\(\\ln\(nC\_\{\\mu\}\)\+2\)\\ln\(2/\\delta\)\}\{2ng\}\}\.

The regime gap12dℋΔℋ\\tfrac\{1\}\{2\}d\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}is estimated from*unlabeled*features by a domain classifier\. One pools the calm featuresU0U\_\{0\}and crisis featuresU1U\_\{1\}with regime tags, fits a classifier inℋ\\mathcal\{H\}to predict the regime label, records its best balanced accuracya^⋆\\widehat\{a\}^\{\\star\}, and sets12d^ℋΔℋ=max⁡\(0,2a^⋆−1\)\\tfrac\{1\}\{2\}\\widehat\{d\}\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}=\\max\(0,\\,2\\widehat\{a\}^\{\\star\}\-1\): easy\-to\-separate regimes yield a large value, while chance\-level separability yields0\.

###### Definition 4\.10\(Domain\-classifier estimator\)\.

Given unlabeled feature setsU0∼P0U\_\{0\}\\sim P\_\{0\},U1∼P1U\_\{1\}\\sim P\_\{1\}of sizem′m^\{\\prime\}, letϵ^⋆\\widehat\{\\epsilon\}^\{\\star\}be the minimum balanced domain\-classification error overℋ\\mathcal\{H\}anda^⋆=1−ϵ^⋆\\widehat\{a\}^\{\\star\}=1\-\\widehat\{\\epsilon\}^\{\\star\}the best balanced accuracy\. Then

d^ℋΔℋ\(U0,U1\):=2\(1−2ϵ^⋆\)=2\(2a^⋆−1\),12d^ℋΔℋ=max⁡\(0,2a^⋆−1\)\.\\widehat\{d\}\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}\(U\_\{0\},U\_\{1\}\):=2\\bigl\(1\-2\\widehat\{\\epsilon\}^\{\\star\}\\bigr\)=2\\bigl\(2\\widehat\{a\}^\{\\star\}\-1\\bigr\),\\qquad\\tfrac\{1\}\{2\}\\widehat\{d\}\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}=\\max\\\!\\bigl\(0,\\,2\\widehat\{a\}^\{\\star\}\-1\\bigr\)\.Perfect separation yieldsd^ℋΔℋ=2\\widehat\{d\}\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}=2\(maximal divergence\); chance\-level accuracy yieldsd^ℋΔℋ=0\\widehat\{d\}\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}=0\.

###### Theorem 4\.11\(Mixing\-aware uniform convergence fordℋΔℋd\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}\)\.

Letℋ\\mathcal\{H\}have VC dimensiondd\(Vapnik,[1998](https://arxiv.org/html/2606.02657#bib.bib3)\), soℋΔℋ\\mathcal\{H\}\\Delta\\mathcal\{H\}has VC dimension at most2d2d\. If the2m′2m^\{\\prime\}feature points areβ\\beta\-mixing with spectral gapggand constantCμC\_\{\\mu\}, then with probability at least1−δ1\-\\delta,

12dℋΔℋ\(P0,P1\)≤12d^ℋΔℋ\(U0,U1\)\+2\(2dlog⁡\(2m′\)\+log⁡\(2/δ\)\)\(ln⁡\(m′Cμ\)\+2\)m′g⏟=⁣:ηd\.\\tfrac\{1\}\{2\}d\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}\(P\_\{0\},P\_\{1\}\)\\leq\\tfrac\{1\}\{2\}\\widehat\{d\}\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}\(U\_\{0\},U\_\{1\}\)\+\\underbrace\{2\\sqrt\{\\frac\{\\bigl\(2d\\log\(2m^\{\\prime\}\)\+\\log\(2/\\delta\)\\bigr\)\\bigl\(\\ln\(m^\{\\prime\}C\_\{\\mu\}\)\+2\\bigr\)\}\{m^\{\\prime\}g\}\}\}\_\{=:~\\eta\_\{d\}\}\.

The slackηd\\eta\_\{d\}arises by starting from the i\.i\.d\. uniform VC deviation guarantee ofKiferet al\.\([2004](https://arxiv.org/html/2606.02657#bib.bib11)\); Ben\-Davidet al\.\([2010](https://arxiv.org/html/2606.02657#bib.bib1)\), which controlsdℋΔℋ−d^ℋΔℋd\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}\-\\widehat\{d\}\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}by a term in1/m′\\sqrt\{1/m^\{\\prime\}\}, and replacingm′m^\{\\prime\}by the effective countmeff′≥m′g/\(ln⁡\(m′Cμ\)\+2\)m^\{\\prime\}\_\{\\mathrm\{eff\}\}\\geq m^\{\\prime\}g/\(\\ln\(m^\{\\prime\}C\_\{\\mu\}\)\+2\)to account for serial dependence\. The divergence estimate thus degrades with regime persistence at the same rate asηπ\\eta\_\{\\pi\}\.

### 4\.4Main generalization bound

###### Theorem 4\.13\(Extended Rademacher Markov\-transition bound\)\.

Under Assumption[3\.7](https://arxiv.org/html/2606.02657#S3.Thmtheorem7)\(0/10/1loss,ℱ⊆ℋ\\mathcal\{F\}\\subseteq\\mathcal\{H\}, current regime calm\), for anyδ∈\(0,1\)\\delta\\in\(0,1\), with probability at least1−δ1\-\\delta, simultaneously for allf∈ℱf\\in\\mathcal\{F\},

Rfuture\(f\)≤R^S\(f\)\+2R^S\(ℒℱ\)\+3\(ln⁡\(nCμ\)\+2\)ln⁡\(2/δ\)2ng\+\|p01−π\|\(12dℋΔℋ\(P1,P0\)\+λ01\)\.R\_\{\\mathrm\{future\}\}\(f\)\\leq\\widehat\{R\}\_\{S\}\(f\)\+2\\widehat\{R\}\_\{S\}\(\\mathcal\{L\}\_\{\\mathcal\{F\}\}\)\+3\\sqrt\{\\frac\{\(\\ln\(nC\_\{\\mu\}\)\+2\)\\ln\(2/\\delta\)\}\{2ng\}\}\+\|p\_\{01\}\-\\pi\|\\Bigl\(\\tfrac\{1\}\{2\}d\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}\(P\_\{1\},P\_\{0\}\)\+\\lambda\_\{01\}\\Bigr\)\.

The bound is assembled by chaining three steps: Corollary[4\.3](https://arxiv.org/html/2606.02657#S4.Thmtheorem3)boundsRfuture\(f\)R\_\{\\mathrm\{future\}\}\(f\)byRmix\(f\)R\_\{\\mathrm\{mix\}\}\(f\)plus the regime penalty; the mixing Rademacher bound ofMohriet al\.\([2018](https://arxiv.org/html/2606.02657#bib.bib10)\); Yu \([1994](https://arxiv.org/html/2606.02657#bib.bib7)\)then replacesRmix\(f\)R\_\{\\mathrm\{mix\}\}\(f\)byR^S\(f\)\+2R^S\(ℒℱ\)\\widehat\{R\}\_\{S\}\(f\)\+2\\widehat\{R\}\_\{S\}\(\\mathcal\{L\}\_\{\\mathcal\{F\}\}\)plus a concentration term; and Corollary[4\.8](https://arxiv.org/html/2606.02657#S4.Thmtheorem8)makes that term explicit asΛ\(n,δ\)\\Lambda\(n,\\delta\)\. The only stochastic event is the uniform\-convergence step\.

###### Corollary 4\.14\(Fully estimable bound\)\.

With the domain\-classifier estimate and slackηd\\eta\_\{d\}of Theorem[4\.11](https://arxiv.org/html/2606.02657#S4.Thmtheorem11), and a union bound at levelδ/2\\delta/2over the risk and divergence events, with probability at least1−δ1\-\\delta,222In our empirical evaluation, the training\-only penalty using\|p^01−π^\|\|\\hat\{p\}\_\{01\}\-\\hat\{\\pi\}\|shows no significant correlation with realized gaps \(ρ=0\.084\\rho=0\.084, 95% CI contains zero\), while the ex post penalty using the realizedπ^future\\hat\{\\pi\}\_\{\\text\{future\}\}achievesρ=0\.729\\rho=0\.729\. This highlights that reliable estimation ofp01p\_\{01\}from short training windows is the primary bottleneck\.

Rfuture\(f\)≤R^S\(f\)\+2R^S\(ℒℱ\)\+Λ\(n,δ\)\+\|p01−π\|\(12d^ℋΔℋ\(U0,U1\)\+ηd\+λ01\)\.R\_\{\\mathrm\{future\}\}\(f\)\\leq\\widehat\{R\}\_\{S\}\(f\)\+2\\widehat\{R\}\_\{S\}\(\\mathcal\{L\}\_\{\\mathcal\{F\}\}\)\+\\Lambda\(n,\\delta\)\+\|p\_\{01\}\-\\pi\|\\Bigl\(\\tfrac\{1\}\{2\}\\widehat\{d\}\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}\(U\_\{0\},U\_\{1\}\)\+\\eta\_\{d\}\+\\lambda\_\{01\}\\Bigr\)\.All terms exceptλ01\\lambda\_\{01\}\(zero in the realizable case\) are in principle computable from training data, providedp01p\_\{01\}can be reliably estimated\. Empirical evaluation \(Section[5](https://arxiv.org/html/2606.02657#S5)\) shows that estimatingp01p\_\{01\}from standard\-length training windows is challenging when regime transitions are rare; this limitation is discussed therein\.

### 4\.5Lower bound and the scope of unavoidability

A matching lower bound holds in the pure\-calm\-training regime; for generalπ\\pithe worst\-case excess is sample\-size dependent\.

###### Theorem 4\.15\(Le Cam lower bound,π=0\\pi=0\)\.

There existP0,P1\(a\),P1\(b\)P\_\{0\},P\_\{1\}^\{\(a\)\},P\_\{1\}^\{\(b\)\}and a classℋ\\mathcal\{H\}with12dℋΔℋ\(P1\(⋅\),P0\)=ρ\\tfrac\{1\}\{2\}d\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}\(P\_\{1\}^\{\(\\cdot\)\},P\_\{0\}\)=\\rhosuch that, for pure\-calm training \(π=0\\pi=0\), every learnerf^\\widehat\{f\}obeys

maxw∈\{a,b\}⁡𝔼\[Rfuture\(f^\)−infgRfuture\(g\)\]≥p01ρ=p01⋅12dℋΔℋ\(P1,P0\),\\max\_\{w\\in\\\{a,b\\\}\}\\mathbb\{E\}\\bigl\[R\_\{\\mathrm\{future\}\}\(\\widehat\{f\}\)\-\\textstyle\\inf\_\{g\}R\_\{\\mathrm\{future\}\}\(g\)\\bigr\]\\ \\geq\\ p\_\{01\}\\,\\rho\\ =\\ p\_\{01\}\\cdot\\tfrac\{1\}\{2\}d\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}\(P\_\{1\},P\_\{0\}\),uniformly innn\. Forπ\>0\\pi\>0the analogous worst\-case excess isΘ\(1/neff\)\\Theta\(1/\\sqrt\{n\_\{\\mathrm\{eff\}\}\}\)and is not matched by annn\-uniform constant\.

Atπ=0\\pi=0the bound is tight: training never reveals which crisis world is in force, so the penaltyp01⋅12dℋΔℋp\_\{01\}\\cdot\\tfrac\{1\}\{2\}d\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}is unavoidable for everynn\. Forπ\>0\\pi\>0the excess shrinks at rateΘ\(1/neff\)\\Theta\(1/\\sqrt\{n\_\{\\mathrm\{eff\}\}\}\)as the learner gradually identifies the regime structure\. The penalty therefore serves as a tight worst case under pure\-calm training and as an upper bound otherwise\.

###### Proposition 4\.16\(Irreducible certification cost\)\.

LetΔ=12dℋΔℋ\(P1,P0\)\>0\\Delta=\\tfrac\{1\}\{2\}d\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}\(P\_\{1\},P\_\{0\}\)\>0\. Any certificateU\(S\)U\(S\)withRfuture\(f\)≤U\(S\)R\_\{\\mathrm\{future\}\}\(f\)\\leq U\(S\)valid uniformly over the future transitionp01p\_\{01\}consistent with the calm\-now observation satisfiesU\(S\)−R^S\(f\)−Λ\(n,δ\)≥\(\|p01−π^\|−ηπ\)ΔU\(S\)\-\\widehat\{R\}\_\{S\}\(f\)\-\\Lambda\(n,\\delta\)\\geq\(\|p\_\{01\}\-\\widehat\{\\pi\}\|\-\\eta\_\{\\pi\}\)\\Delta\. This is a property of certificates: whenπ=p01\\pi=p\_\{01\}the realized gap \(Lemma[4\.1](https://arxiv.org/html/2606.02657#S4.Thmtheorem1)\) is exactly zero\.

## 5Empirical Validation

### 5\.1Data

We validate the theory in two complementary settings\. The first is a controlled synthetic environment in which the regime process and the regime\-dependent distributions are known exactly, so that the predicted penalty can be compared against a ground\-truth generalization gap\. We simulate a stationary two\-state Markov chain with regime\-specific Gaussian features and regime\-specific logistic label rules, sweeping the transition probabilitiesp01,p10p\_\{01\},p\_\{10\}and the inter\-regime feature separation that controlsdℋΔℋd\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}; each of the resulting configurations is evaluated with all five models, yielding400400configuration\-model observations\. The second setting is real market data: twenty\-five years \(2000 to 2025\) of daily closing prices for ten liquid global equity indices spanning North America, Europe, and Asia \(Table[1](https://arxiv.org/html/2606.02657#S5.T1)\), chosen because equity markets exhibit naturally observable regime transitions and strong temporal dependence, and because a geographically diverse panel ensures that regime shifts are not perfectly correlated across series\. Regimes are not observed directly; they are inferred by a two\-state Gaussian hidden Markov model fit to the bivariate series of daily log\-returns and twenty\-day realized volatility\(Hamilton,[1989](https://arxiv.org/html/2606.02657#bib.bib6)\), with the higher\-volatility state labeled crisis\. The fitted transition matrix suppliesp^01,p^10\\widehat\{p\}\_\{01\},\\widehat\{p\}\_\{10\}, and the decoded state path supplies the per\-day regime labels\.

Table 1:Equity indices used in the real\-data study\.
### 5\.2Procedure

For each \(training, deployment\) split we estimate four quantities: the training crisis fractionπ^\\widehat\{\\pi\}from the training window’s regime sequence; the future crisis fractionπ^future\\widehat\{\\pi\}\_\{\\mathrm\{future\}\}from the realized out\-of\-sample window \(used for ex post validation only333This quantity is not available at deployment time\. It is used here only to validate that the penalty mechanism exists\. A practical deployment would require forecastingπfuture\\pi\_\{\\text\{future\}\}, which remains an open problem\.; this quantity would not be available at deployment time\); the regime gap12d^ℋΔℋ=max⁡\(0,2a^⋆−1\)\\tfrac\{1\}\{2\}\\widehat\{d\}\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}=\\max\(0,2\\widehat\{a\}^\{\\star\}\-1\)from a cross\-validated domain classifier on the unlabeled training features; and the transition dynamicsp^01,p^10\\widehat\{p\}\_\{01\},\\widehat\{p\}\_\{10\}with spectral gapg=1−\|1−\(p^01\+p^10\)\|g=1\-\|1\-\(\\widehat\{p\}\_\{01\}\+\\widehat\{p\}\_\{10\}\)\|for diagnostics\. The predicted penalty for ex post validation is\|π^future−π^\|⋅12d^ℋΔℋ\|\\widehat\{\\pi\}\_\{\\mathrm\{future\}\}\-\\widehat\{\\pi\}\|\\cdot\\tfrac\{1\}\{2\}\\widehat\{d\}\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}, and the realized regime gap isRfuture−RmixR\_\{\\mathrm\{future\}\}\-R\_\{\\mathrm\{mix\}\}, where for each fitted modelR0,R1R\_\{0\},R\_\{1\}are the held\-out0/10/1risks on the calm and crisis points of the deployment window,Rmix=\(1−π^\)R0\+π^R1R\_\{\\mathrm\{mix\}\}=\(1\-\\widehat\{\\pi\}\)R\_\{0\}\+\\widehat\{\\pi\}R\_\{1\}, andRfuture=\(1−π^future\)R0\+π^futureR1R\_\{\\mathrm\{future\}\}=\(1\-\\widehat\{\\pi\}\_\{\\mathrm\{future\}\}\)R\_\{0\}\+\\widehat\{\\pi\}\_\{\\mathrm\{future\}\}R\_\{1\}; Lemma[4\.1](https://arxiv.org/html/2606.02657#S4.Thmtheorem1)predicts the identityRfuture−Rmix=\(π^future−π^\)\(R1−R0\)R\_\{\\mathrm\{future\}\}\-R\_\{\\mathrm\{mix\}\}=\(\\widehat\{\\pi\}\_\{\\mathrm\{future\}\}\-\\widehat\{\\pi\}\)\(R\_\{1\}\-R\_\{0\}\), which we verify directly\. \(A version of the penalty using only training data,\|p^01−π^\|⋅12d^ℋΔℋ\|\\widehat\{p\}\_\{01\}\-\\widehat\{\\pi\}\|\\cdot\\tfrac\{1\}\{2\}\\widehat\{d\}\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}, is also evaluated and discussed in Section[7](https://arxiv.org/html/2606.02657#S7)\.\) The prediction task is next\-day directional movement under0/10/1loss, with standard lagged\-return and realized\-volatility features, and to establish that the penalty is a model\-independent property \(Remark[4\.4](https://arxiv.org/html/2606.02657#S4.Thmtheorem4)\) we evaluate five estimators spanning linear, ensemble, and neural families: logistic regression, ridge classification, random forest, gradient boosting \(XGBoost\), and a small multilayer perceptron, computing the penalty components once per window so that onlyR0,R1R\_\{0\},R\_\{1\}vary across models\. On synthetic data the training block is resampled to a target crisis fraction \(independent of the chain’s stationary rate\) while the deployment block retains its natural dynamics, and we report the Spearman correlation between the penalty and\|realized gap\|\|\\text\{realized gap\}\|pooled over configurations and models\. On real data we use a rolling\-origin design in which each window trains on an eight\-year block and evaluates on the immediately following non\-overlapping two\-year block, advancing the origin by the test length, yielding up to nine windows per index\. Because the five models within a window share the same penalty and are not independent, pooling all model\-window rows would overstate significance; we therefore report a window\-clustered bootstrap that resamples whole windows with replacement and recomputes the correlation on each resample, together with per\-index clustered correlations to assess consistency of sign across markets\.

## 6Results

Table[2](https://arxiv.org/html/2606.02657#S6.T2)collects the headline results across both settings, and Table[3](https://arxiv.org/html/2606.02657#S6.T3)reports the per\-index breakdown\. In the synthetic sweep, the predicted penalty \(computed using the realized future crisis fraction\) tracks the realized regime gap with pooled Spearmanρ=0\.716\\rho=0\.716over400400configuration\-model observations, and the relationship is positive for every model family individually, ranging fromρ=0\.639\\rho=0\.639\(random forest\) toρ=0\.837\\rho=0\.837\(logistic and ridge\); the lower correlations for the higher\-capacity models are consistent with their partially absorbing the regime structure, which compressesR1−R0R\_\{1\}\-R\_\{0\}\. The exact future\-mix identity of Lemma[4\.1](https://arxiv.org/html/2606.02657#S4.Thmtheorem1)holds to numerical precision \(Pearsonr=1\.000r=1\.000, mean absolute deviation0\.0000\.000\)\.

On real data, across ten indices and8484independent windows, the window\-clustered correlation between the ex post penalty \(using the realized future crisis fractionπ^future\\widehat\{\\pi\}\_\{\\mathrm\{future\}\}\) and the realized gap isρ=0\.729\\rho=0\.729with a95%95\\%bootstrap interval of\[0\.635,0\.801\]\[0\.635,0\.801\], closely matching the synthetic estimate; the naive pooled correlation coincides at0\.7290\.729but, as expected, carries no honest interval because its rows are not independent\. The per\-index correlations \(Table[3](https://arxiv.org/html/2606.02657#S6.T3)\) are positive in sign for all ten markets, with point estimates from0\.3720\.372\(Dow Jones\) to0\.9490\.949\(EURO STOXX 50\); the individual intervals are wide and a minority include zero, so the statistical strength of the result rests on the aggregate rather than on any single market, while the uniformly positive sign indicates a consistent effect across regions\. Figures[1](https://arxiv.org/html/2606.02657#S6.F1)and[2](https://arxiv.org/html/2606.02657#S6.F2)display the synthetic relationship and the identity check, and Figure[3](https://arxiv.org/html/2606.02657#S6.F3)the real\-data relationship\.

### 6\.1Training\-only penalty

For comparison, we also evaluated a version of the penalty that uses only training\-data information, replacingπ^future\\widehat\{\\pi\}\_\{\\mathrm\{future\}\}with the estimated transition probabilityp^01\\widehat\{p\}\_\{01\}obtained from the training window\. This training\-only penalty showed no significant correlation with the realized gap: the window\-clustered Spearman correlation wasρ=0\.084\\rho=0\.084with a95%95\\%confidence interval containing zero \(see Appendix[4](https://arxiv.org/html/2606.02657#A1.T4)for detailed diagnostics\)\. This finding highlights that while regime composition mismatch is a real phenomenon that explains generalization gaps ex post, estimating it before deployment remains challenging when regime transitions are rare and training windows are limited\.

Table 2:Summary of validation results\. The synthetic correlation is pooled over configurations and models; the real\-data correlation uses the window\-clustered bootstrap \(whole windows resampled with replacement\)\.ρ\\rhois the Spearman correlation between the predicted penalty and the magnitude of the realized regime gap\|Rfuture−Rmix\|\|R\_\{\\mathrm\{future\}\}\-R\_\{\\mathrm\{mix\}\}\|; the identity row reports the Pearson correlation between the realized gap and the Lemma[4\.1](https://arxiv.org/html/2606.02657#S4.Thmtheorem1)prediction\(π^future−π^\)\(R1−R0\)\(\\widehat\{\\pi\}\_\{\\mathrm\{future\}\}\-\\widehat\{\\pi\}\)\(R\_\{1\}\-R\_\{0\}\)\.Table 3:Per\-index window\-clustered Spearman correlation between the predicted penalty and\|Rfuture−Rmix\|\|R\_\{\\mathrm\{future\}\}\-R\_\{\\mathrm\{mix\}\}\|, with95%95\\%bootstrap intervals and the number of independent rolling\-origin windows\. The point estimates are positive for all ten markets; intervals are wide because each index contributes few windows, so inference is strongest in aggregate \(Table[2](https://arxiv.org/html/2606.02657#S6.T2)\)\.![Refer to caption](https://arxiv.org/html/2606.02657v1/simulation_penalty_gap.png)Figure 1:Synthetic sweep: predicted penalty versus the magnitude of the realized regime gap, with points colored by model and a least\-squares trend line\. The positive slope \(ρ=0\.716\\rho=0\.716pooled\) holds within every model family\.![Refer to caption](https://arxiv.org/html/2606.02657v1/identity_check.png)Figure 2:Identity check: realizedRfuture−RmixR\_\{\\mathrm\{future\}\}\-R\_\{\\mathrm\{mix\}\}versus the Lemma[4\.1](https://arxiv.org/html/2606.02657#S4.Thmtheorem1)prediction\(π^future−π^\)\(R1−R0\)\(\\widehat\{\\pi\}\_\{\\mathrm\{future\}\}\-\\widehat\{\\pi\}\)\(R\_\{1\}\-R\_\{0\}\)\. Points lie on the liney=xy=x\(Pearsonr=1\.000r=1\.000\), confirming the decomposition is exact\.![Refer to caption](https://arxiv.org/html/2606.02657v1/Real_data_penalty_vs_gap.png)Figure 3:Real data \(rolling\-origin, ten indices\): predicted penalty versus the magnitude of the realized regime gap, one set of points per index\. The clustered correlation isρ=0\.729\\rho=0\.729\(95%95\\%CI\[0\.635,0\.801\]\[0\.635,0\.801\],8484windows\)\.

## 7Discussion

The standard response to a model that fails after deployment is to blame the model—the architecture was too complex, the regularization was insufficient, the features were not invariant enough\(Bousquetet al\.,[2004](https://arxiv.org/html/2606.02657#bib.bib4); Zhanget al\.,[2017](https://arxiv.org/html/2606.02657#bib.bib19)\)\. All of these diagnoses point in the same direction: reduceΔ\\Delta, the regime sensitivity term, through better algorithmic design\. This paper proves that this diagnosis is structurally incomplete, and in the worst case, structurally irrelevant\.

We establish three results that together shift the burden of explanation from the model to the environment\. First, Lemma[4\.1](https://arxiv.org/html/2606.02657#S4.Thmtheorem1)provides an exact algebraic decomposition showing that the generalization gap factorizes as\(p01−π\)⋅\(R1−R0\)\(p\_\{01\}\-\\pi\)\\cdot\(R\_\{1\}\-R\_\{0\}\)\. The model enters only through the second factor; the first factor is a property of the world\. Second, we identify an explicit escape condition: when the training crisis fractionπ\\pimatches the true transition probabilityp01p\_\{01\}, the penalty vanishes identically\. Third, and most critically, we demonstrate that this escape condition is empirically unreachable\. On real equity data, a domain classifier separates calm and crisis regimes with near\-perfect accuracy \(12d^ℋΔℋ=0\.93±0\.02\\frac\{1\}\{2\}\\hat\{d\}\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}=0\.93\\pm 0\.02\), confirming that the feature geometry of crises is highly distinguishable\(Ben\-Davidet al\.,[2010](https://arxiv.org/html/2606.02657#bib.bib1); Kiferet al\.,[2004](https://arxiv.org/html/2606.02657#bib.bib11)\)\. Yet estimatingp01p\_\{01\}from training windows and plugging it into the penalty yields no significant correlation with actual deployment gaps \(ρ≈0\.084\\rho\\approx 0\.084, 95% CI contains zero\)\. The two numbers together tell the story: regimes are obvious in hindsight, invisible in foresight, and the resulting gap is not the model’s fault\.

Theorem[4\.15](https://arxiv.org/html/2606.02657#S4.Thmtheorem15)sharpens this point theoretically\. Under pure calm\-state training \(π=0\\pi=0\), we prove a Le Cam minimax lower bound showing that excess risk of at leastp01⋅12dℋΔℋp\_\{01\}\\cdot\\frac\{1\}\{2\}d\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}is mathematically irreducible, regardless of sample size\(Cam,[1986](https://arxiv.org/html/2606.02657#bib.bib12); Yu,[1997](https://arxiv.org/html/2606.02657#bib.bib13)\)\. No amount of calm\-state data can close this gap\. Under mixed training \(π\>0\\pi\>0\), the lower bound relaxes toΘ\(1/neff\)\\Theta\(1/\\sqrt\{n\_\{\\text\{eff\}\}\}\), meaning the learner gradually identifies the regime structure as crisis samples accumulate\. But the practical problem remains: without a reliable estimate ofp01p\_\{01\}, one cannot know whether the training compositionπ\\piis correctly calibrated\. The escape condition exists in theory; the world denies it in practice\.

This reframes the challenge of deployment under non\-stationarity\. Improving generalization is not exclusively a machine learning architecture problem\. The internal componentΔ=12dℋΔℋ\\Delta=\\frac\{1\}\{2\}d\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}is within the algorithm’s control—modifiable via regularization, domain\-invariant representations, or invariant risk minimization\(Arjovskyet al\.,[2019](https://arxiv.org/html/2606.02657#bib.bib20); Kruegeret al\.,[2021](https://arxiv.org/html/2606.02657#bib.bib21)\)\. The external component\|p01−π\|\|p\_\{01\}\-\\pi\|depends on the timing of regime transitions and lies outside the algorithm’s direct reach\(Hamilton,[1989](https://arxiv.org/html/2606.02657#bib.bib6)\)\. Optimization can mitigate the impact of a regime shift, but it cannot predict its arrival\. Deployment risk under regime\-switching dynamics is therefore bounded not by nominal sample size, but by the physical scarcity of historical macro\-state transitions\.

Consequently, we do not present this framework as an ex ante forecasting tool\. We present it as a mathematically rigorous diagnostic instrument\. When a model fails, the framework answers a specific question: was the failure due to regime mismatch, or to something else? The penalty computed using the realized future crisis fraction tracks actual train\-to\-deployment gaps with Spearmanρ=0\.729\\rho=0\.729\(95% CI\[0\.635,0\.801\]\[0\.635,0\.801\]\), confirming that the mechanism is real and the decomposition is exact\. This enables post\-hoc failure auditing, analogous to Value\-at\-Risk frameworks in banking\(Jorion,[2007](https://arxiv.org/html/2606.02657#bib.bib22)\): one can stress\-test deployment scenarios under hypothetical transition probabilities and isolate the structural component of risk that no amount of model refinement can eliminate\.

Several limitations define the boundaries of the current framework\. The two\-state Markov assumption provides analytical tractability but may not capture the full complexity of real\-world regime dynamics\(Hamilton,[1989](https://arxiv.org/html/2606.02657#bib.bib6); Ang and Bekaert,[2002](https://arxiv.org/html/2606.02657#bib.bib23)\)\. The framework is diagnostic rather than prescriptive: it identifies where regime risk emerges, but does not guarantee that modifying training composition will recover lost performance in all settings\. Extending the framework to multiple regimes, continuous state strength, and prescriptive training strategies remains important future work\. Validating the diagnostic utility in other domains where regime transitions are consequential—healthcare, cybersecurity, autonomous systems—will further test the generality of the decomposition\(Quionero\-Candelaet al\.,[2008](https://arxiv.org/html/2606.02657#bib.bib5)\)\.

## 8Conclusion

We introduced a regime\-aware generalization framework for Markov\-switching distribution shifts\. The central contribution is an exact decomposition of future deployment risk into a standard empirical\-complexity term and a regime\-mismatch penalty\|p01−π\|⋅12dℋΔℋ\|p\_\{01\}\-\\pi\|\\cdot\\frac\{1\}\{2\}d\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}, extended toβ\\beta\-mixing data via a spectral\-gap\-adjusted effective sample size\. A matching Le Cam lower bound proves that under pure calm\-state training, this penalty is irreducible\. Empirical validation on ten global equity indices confirms that the penalty mechanism is real: when computed with the realized future crisis fraction, it tracks deployment gaps tightly \(ρ=0\.729\\rho=0\.729\)\. However, the training\-data\-only estimator fails completely \(ρ≈0\.084\\rho\\approx 0\.084\), despite near\-perfect regime separability \(12d^=0\.93±0\.02\\frac\{1\}\{2\}\\hat\{d\}=0\.93\\pm 0\.02\)\. This negative result is itself a finding: deployment risk is governed by the physical scarcity of regime transitions, not by nominal sample size, and the escape conditionπ=p01\\pi=p\_\{01\}—while mathematically identified—is empirically unreachable from historical windows alone\. These results establish that statistical learning safety under non\-stationarity is fundamentally a regime\-timing problem \(p01p\_\{01\}\) rather than solely an algorithmic optimization problem \(Δ→0\\Delta\\to 0\), and that rigorous post\-hoc auditing constitutes a necessary complement to predictive generalization bounds\.

## References

- A\. Ang and G\. Bekaert \(2002\)International asset allocation with regime shifts\.Review of Financial Studies15\(4\),pp\. 1137–1187\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1093/rfs/15.4.1137)Cited by:[§7](https://arxiv.org/html/2606.02657#S7.p6.1)\.
- M\. Arjovsky, L\. Bottou, I\. Gulrajani, and D\. Lopez\-Paz \(2019\)Invariant risk minimization\.InarXiv preprint,External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.48550/arXiv.1907.02893)Cited by:[§7](https://arxiv.org/html/2606.02657#S7.p4.2)\.
- P\. L\. Bartlett and S\. Mendelson \(2002\)Rademacher and Gaussian complexities: risk bounds and structural results\.Journal of Machine Learning Research3,pp\. 463–482\.Cited by:[Definition 3\.5](https://arxiv.org/html/2606.02657#S3.Thmtheorem5.p1.5)\.
- S\. Ben\-David, J\. Blitzer, K\. Crammer, A\. Kulesza, F\. Pereira, and J\. W\. Vaughan \(2010\)A theory of learning from different domains\.Machine Learning79\(1–2\),pp\. 151–175\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1007/s10994-009-5152-4)Cited by:[§A\.3](https://arxiv.org/html/2606.02657#A1.SS3.1.p1.3),[§1](https://arxiv.org/html/2606.02657#S1.p4.1),[§1](https://arxiv.org/html/2606.02657#S1.p5.3),[§2\.1](https://arxiv.org/html/2606.02657#S2.SS1.p1.3),[Definition 3\.4](https://arxiv.org/html/2606.02657#S3.Thmtheorem4),[§4\.3](https://arxiv.org/html/2606.02657#S4.SS3.p2.6),[Remark 4\.12](https://arxiv.org/html/2606.02657#S4.Thmtheorem12.p1.2),[§7](https://arxiv.org/html/2606.02657#S7.p2.6)\.
- O\. Bousquet, S\. Boucheron, and G\. Lugosi \(2004\)Introduction to statistical learning theory\.InAdvanced Lectures on Machine Learning,pp\. 169–207\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1007/978-3-540-28650-9%5F8)Cited by:[§1](https://arxiv.org/html/2606.02657#S1.p1.1),[§7](https://arxiv.org/html/2606.02657#S7.p1.1)\.
- L\. L\. Cam \(1986\)Asymptotic methods in statistical decision theory\.Springer Series in Statistics,Springer\-Verlag,New York\.Cited by:[§A\.5](https://arxiv.org/html/2606.02657#A1.SS5.5.p4.6),[§2\.4](https://arxiv.org/html/2606.02657#S2.SS4.p1.1),[§7](https://arxiv.org/html/2606.02657#S7.p3.6)\.
- Yu\. A\. Davydov \(1973\)Mixing conditions for Markov chains\.Theory of Probability and its Applications18\(2\),pp\. 312–328\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1137/1118033)Cited by:[§2\.2](https://arxiv.org/html/2606.02657#S2.SS2.p2.2),[Definition 3\.6](https://arxiv.org/html/2606.02657#S3.Thmtheorem6.p1.4)\.
- J\. D\. Hamilton \(1989\)A new approach to the economic analysis of nonstationary time series and the business cycle\.Econometrica57\(2\),pp\. 357–384\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.2307/1912559)Cited by:[§1](https://arxiv.org/html/2606.02657#S1.p4.1),[§2\.3](https://arxiv.org/html/2606.02657#S2.SS3.p3.1),[§2\.5](https://arxiv.org/html/2606.02657#S2.SS5.p1.1),[§5\.1](https://arxiv.org/html/2606.02657#S5.SS1.p1.4),[§7](https://arxiv.org/html/2606.02657#S7.p4.2),[§7](https://arxiv.org/html/2606.02657#S7.p6.1)\.
- P\. Jorion \(2007\)Value at risk: the new benchmark for managing financial risk\.3rd edition,McGraw\-Hill\.External Links:ISBN 978\-0071464956Cited by:[§7](https://arxiv.org/html/2606.02657#S7.p5.2)\.
- D\. Kifer, S\. Ben\-David, and J\. Gehrke \(2004\)Detecting change in data streams\.InProceedings of the 30th International Conference on Very Large Data Bases \(VLDB 2004\),Toronto, Canada,pp\. 180–191\.External Links:[Link](http://www.vldb.org/conf/2004/RS5P1.PDF),[Document](https://dx.doi.org/10.1016/B978-012088469-8.50019-X)Cited by:[§A\.3](https://arxiv.org/html/2606.02657#A1.SS3.1.p1.3),[§1](https://arxiv.org/html/2606.02657#S1.p2.1),[§2\.3](https://arxiv.org/html/2606.02657#S2.SS3.p1.2),[§4\.3](https://arxiv.org/html/2606.02657#S4.SS3.p2.6),[Remark 4\.12](https://arxiv.org/html/2606.02657#S4.Thmtheorem12.p1.2),[§7](https://arxiv.org/html/2606.02657#S7.p2.6)\.
- D\. Krueger, E\. Caballero, J\. Jacobsen, A\. Zhang, J\. Binas, D\. Zhang, R\. Le Priol, and A\. Courville \(2021\)Out\-of\-distribution generalization via risk extrapolation\.International Conference on Machine Learning \(ICML\)\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2003.00688)Cited by:[§7](https://arxiv.org/html/2606.02657#S7.p4.2)\.
- Y\. Mansour, M\. Mohri, and A\. Rostamizadeh \(2009\)Domain adaptation: learning bounds and algorithms\.InProceedings of the 22nd Annual Conference on Learning Theory \(COLT 2009\),Montréal, Canada\.Cited by:[§2\.1](https://arxiv.org/html/2606.02657#S2.SS1.p2.1)\.
- M\. Mohri, A\. Rostamizadeh, and A\. Talwalkar \(2018\)Foundations of machine learning\.2nd edition,The MIT Press,Cambridge, MA\.Cited by:[§A\.4](https://arxiv.org/html/2606.02657#A1.SS4.2.p2.3),[§2\.2](https://arxiv.org/html/2606.02657#S2.SS2.p2.2),[Definition 3\.5](https://arxiv.org/html/2606.02657#S3.Thmtheorem5.p1.5),[§4\.4](https://arxiv.org/html/2606.02657#S4.SS4.p1.5),[footnote 4](https://arxiv.org/html/2606.02657#footnote4)\.
- J\. Quionero\-Candela, M\. Sugiyama, A\. Schwaighofer, and N\. D\. Lawrence \(2008\)Dataset shift in machine learning\.MIT Press\.External Links:ISBN 9780262170055Cited by:[§1](https://arxiv.org/html/2606.02657#S1.p2.1),[§7](https://arxiv.org/html/2606.02657#S7.p6.1)\.
- Y\. Shu, C\. Yu, and J\. M\. Mulvey \(2024\)Dynamic asset allocation with asset\-specific regime forecasts\.Annals of Operations Research346,pp\. 285–318\.External Links:[Document](https://dx.doi.org/10.1007/s10479-024-06266-0)Cited by:[§2\.5](https://arxiv.org/html/2606.02657#S2.SS5.p2.1)\.
- S\. Staehret al\.\(2024\)Forecasting stock returns with regime\-switching models\.Journal of Financial Economics\.Note:Working paperCited by:[§2\.5](https://arxiv.org/html/2606.02657#S2.SS5.p1.1)\.
- A\. L\. Suárez Cetrulo, D\. Quintana, and A\. Cervantes \(2024\)Machine learning for financial prediction under regime change using technical analysis: a systematic review\.International Journal of Interactive Multimedia and Artificial Intelligence9\(1\),pp\. 137–148\.External Links:[Document](https://dx.doi.org/10.9781/ijimai.2023.06.003)Cited by:[§2\.5](https://arxiv.org/html/2606.02657#S2.SS5.p2.1)\.
- V\. N\. Vapnik \(1998\)Statistical learning theory\.Wiley\.External Links:ISBN 9780471030034Cited by:[§A\.3](https://arxiv.org/html/2606.02657#A1.SS3.1.p1.3),[§1](https://arxiv.org/html/2606.02657#S1.p1.1),[Theorem 4\.11](https://arxiv.org/html/2606.02657#S4.Thmtheorem11.p1.9.9)\.
- Q\. Xiang, Z\. Chen, Q\. Sun, and R\. Jiang \(2024\)RSAP\-DFM: regime\-shifting adaptive posterior dynamic factor model for stock returns prediction\.InProceedings of the Thirty\-Third International Joint Conference on Artificial Intelligence,pp\. 6116–6124\.External Links:[Document](https://dx.doi.org/10.24963/ijcai.2024/676)Cited by:[§2\.5](https://arxiv.org/html/2606.02657#S2.SS5.p2.1)\.
- B\. Yu \(1994\)Rates of convergence for empirical processes of stationary mixing sequences\.The Annals of Probability22\(1\),pp\. 94–116\.External Links:[Document](https://dx.doi.org/http%3A//www.jstor.org/stable/2244496)Cited by:[item \(F2\)](https://arxiv.org/html/2606.02657#A1.I1.ix2.p1.4),[§A\.3](https://arxiv.org/html/2606.02657#A1.SS3.2.p2.4),[§A\.4](https://arxiv.org/html/2606.02657#A1.SS4.2.p2.3),[§1](https://arxiv.org/html/2606.02657#S1.p4.1),[§1](https://arxiv.org/html/2606.02657#S1.p5.3),[§2\.2](https://arxiv.org/html/2606.02657#S2.SS2.p1.1),[Definition 3\.6](https://arxiv.org/html/2606.02657#S3.Thmtheorem6.p1.4),[§4\.2](https://arxiv.org/html/2606.02657#S4.SS2.p1.5),[§4\.4](https://arxiv.org/html/2606.02657#S4.SS4.p1.5)\.
- B\. Yu \(1997\)Assouad, Fano, and Le Cam\.InFestschrift for Lucien Le Cam,D\. Pollard, E\. Torgersen, and G\. L\. Yang \(Eds\.\),pp\. 423–435\.External Links:[Document](https://dx.doi.org/10.1007/978-1-4612-1880-7%5F29)Cited by:[§A\.5](https://arxiv.org/html/2606.02657#A1.SS5.5.p4.6),[§2\.4](https://arxiv.org/html/2606.02657#S2.SS4.p1.1),[§7](https://arxiv.org/html/2606.02657#S7.p3.6)\.
- A\. Zaremba and N\. Cakici \(2024\)What drives stock returns across countries? insights from machine learning models\.International Review of Financial Analysis96,pp\. 103576\.External Links:ISSN 1057\-5219,[Document](https://dx.doi.org/10.1016/j.irfa.2024.103576)Cited by:[§2\.5](https://arxiv.org/html/2606.02657#S2.SS5.p1.1)\.
- C\. Zhang, S\. Bengio, M\. Hardt, B\. Recht, and O\. Vinyals \(2017\)Understanding deep learning requires rethinking generalization\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Document](https://dx.doi.org/10.48550/arXiv.1611.03530)Cited by:[§7](https://arxiv.org/html/2606.02657#S7.p1.1)\.

## Acknowledgments

The author thanks AI for assistance with language polishing and LaTeX formatting\. All intellectual content is solely the author’s own\.

## Appendix AProofs

We prove the results that are original to this work or essential to the main bound\. Throughout,ℓ∈\[0,1\]\\ell\\in\[0,1\]and all expectations are under the stated distributions\. Two standard facts are used without proof:

- \(F1\)12dℋΔℋ\(Q,R\)≤dTV\(Q,R\)\\tfrac\{1\}\{2\}d\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}\(Q,R\)\\leq d\_\{\\mathrm\{TV\}\}\(Q,R\), since every disagreement set\{x:h\(x\)≠h′\(x\)\}\\\{x:h\(x\)\\neq h^\{\\prime\}\(x\)\\\}is an event and the supremum over such sets is dominated by the supremum over all events \(Definition[3\.4](https://arxiv.org/html/2606.02657#S3.Thmtheorem4)\)\.
- \(F2\)The effective\-sample\-size boundneff≥ng/\(ln⁡\(nCμ\)\+2\)n\_\{\\mathrm\{eff\}\}\\geq ng/\(\\ln\(nC\_\{\\mu\}\)\+2\)and the crisis\-fraction concentration\|π^−π\|≤ηπ\|\\widehat\{\\pi\}\-\\pi\|\\leq\\eta\_\{\\pi\}are direct applications of the mixing inequalities ofYu \([1994](https://arxiv.org/html/2606.02657#bib.bib7)\)to our two\-state chain, with block lengthb=⌈ln⁡\(nCμ\)/g⌉b=\\lceil\\ln\(nC\_\{\\mu\}\)/g\\rceilchosen so thatβ\(b\)≤1/n\\beta\(b\)\\leq 1/n\.

### A\.1Proof of Lemma[4\.1](https://arxiv.org/html/2606.02657#S4.Thmtheorem1)\(future\-mix identity\)

###### Proof\.

Step 1 \(expand each risk\)\.By Definition[3\.3](https://arxiv.org/html/2606.02657#S3.Thmtheorem3)and linearity of expectation,

Rfuture\(f\)=p00R0\(f\)\+p01R1\(f\),Rmix\(f\)=\(1−π\)R0\(f\)\+πR1\(f\)\.R\_\{\\mathrm\{future\}\}\(f\)=p\_\{00\}R\_\{0\}\(f\)\+p\_\{01\}R\_\{1\}\(f\),\\qquad R\_\{\\mathrm\{mix\}\}\(f\)=\(1\-\\pi\)R\_\{0\}\(f\)\+\\pi R\_\{1\}\(f\)\.
Step 2 \(subtract\)\.

Rfuture\(f\)−Rmix\(f\)=\[p00−\(1−π\)\]R0\(f\)\+\[p01−π\]R1\(f\)\.R\_\{\\mathrm\{future\}\}\(f\)\-R\_\{\\mathrm\{mix\}\}\(f\)=\\bigl\[p\_\{00\}\-\(1\-\\pi\)\\bigr\]R\_\{0\}\(f\)\+\\bigl\[p\_\{01\}\-\\pi\\bigr\]R\_\{1\}\(f\)\.
Step 3 \(usep00=1−p01p\_\{00\}=1\-p\_\{01\}\)\.The coefficient ofR0\(f\)R\_\{0\}\(f\)becomes\(1−p01\)−\(1−π\)=π−p01\(1\-p\_\{01\}\)\-\(1\-\\pi\)=\\pi\-p\_\{01\}, so

Rfuture\(f\)−Rmix\(f\)=\(π−p01\)R0\(f\)\+\(p01−π\)R1\(f\)\.R\_\{\\mathrm\{future\}\}\(f\)\-R\_\{\\mathrm\{mix\}\}\(f\)=\(\\pi\-p\_\{01\}\)R\_\{0\}\(f\)\+\(p\_\{01\}\-\\pi\)R\_\{1\}\(f\)\.
Step 4 \(factor\)\.Sinceπ−p01=−\(p01−π\)\\pi\-p\_\{01\}=\-\(p\_\{01\}\-\\pi\),

Rfuture\(f\)−Rmix\(f\)=\(p01−π\)\(R1\(f\)−R0\(f\)\)\.∎R\_\{\\mathrm\{future\}\}\(f\)\-R\_\{\\mathrm\{mix\}\}\(f\)=\(p\_\{01\}\-\\pi\)\\bigl\(R\_\{1\}\(f\)\-R\_\{0\}\(f\)\\bigr\)\.\\qed

### A\.2Proof of Corollary[4\.3](https://arxiv.org/html/2606.02657#S4.Thmtheorem3)\(regime\-shift inequality\)

###### Proof\.

Step 1 \(bound the gap in absolute value\)\.From Lemma[4\.1](https://arxiv.org/html/2606.02657#S4.Thmtheorem1)and\|ab\|=\|a\|\|b\|\|ab\|=\|a\|\\,\|b\|,

Rfuture\(f\)≤Rmix\(f\)\+\|p01−π\|\|R1\(f\)−R0\(f\)\|\.R\_\{\\mathrm\{future\}\}\(f\)\\leq R\_\{\\mathrm\{mix\}\}\(f\)\+\|p\_\{01\}\-\\pi\|\\,\\bigl\|R\_\{1\}\(f\)\-R\_\{0\}\(f\)\\bigr\|\.
Step 2 \(relate the regime risks via a reference hypothesis\)\.For the0/10/1loss writeRi\(f\)=ℙx∼Pi\[f\(x\)≠y\]R\_\{i\}\(f\)=\\mathbb\{P\}\_\{x\\sim P\_\{i\}\}\[f\(x\)\\neq y\], and leth⋆=arg⁡minh∈ℋ⁡\[R0\(h\)\+R1\(h\)\]h^\{\\star\}=\\arg\\min\_\{h\\in\\mathcal\{H\}\}\[R\_\{0\}\(h\)\+R\_\{1\}\(h\)\]\. The triangle inequality for classification disagreement \(for\{0,1\}\\\{0,1\\\}\-valueda,b,ca,b,c,ℙ\[a≠c\]≤ℙ\[a≠b\]\+ℙ\[b≠c\]\\mathbb\{P\}\[a\\neq c\]\\leq\\mathbb\{P\}\[a\\neq b\]\+\\mathbb\{P\}\[b\\neq c\]\) gives

R1\(f\)≤R0\(f\)\+\|ℙP1\[f≠h⋆\]−ℙP0\[f≠h⋆\]\|\+λ01\.R\_\{1\}\(f\)\\leq R\_\{0\}\(f\)\+\\bigl\|\\mathbb\{P\}\_\{P\_\{1\}\}\[f\\neq h^\{\\star\}\]\-\\mathbb\{P\}\_\{P\_\{0\}\}\[f\\neq h^\{\\star\}\]\\bigr\|\+\\lambda\_\{01\}\.
Step 3 \(bound the middle term by the divergence\)\.Sinceℱ⊆ℋ\\mathcal\{F\}\\subseteq\\mathcal\{H\}, the disagreementf⊕h⋆∈ℋΔℋf\\oplus h^\{\\star\}\\in\\mathcal\{H\}\\Delta\\mathcal\{H\}, hence

\|ℙP1\[f≠h⋆\]−ℙP0\[f≠h⋆\]\|≤suph,h′∈ℋ\|ℙP1\[h≠h′\]−ℙP0\[h≠h′\]\|=12dℋΔℋ\(P1,P0\)\.\\bigl\|\\mathbb\{P\}\_\{P\_\{1\}\}\[f\\neq h^\{\\star\}\]\-\\mathbb\{P\}\_\{P\_\{0\}\}\[f\\neq h^\{\\star\}\]\\bigr\|\\leq\\sup\_\{h,h^\{\\prime\}\\in\\mathcal\{H\}\}\\bigl\|\\mathbb\{P\}\_\{P\_\{1\}\}\[h\\neq h^\{\\prime\}\]\-\\mathbb\{P\}\_\{P\_\{0\}\}\[h\\neq h^\{\\prime\}\]\\bigr\|=\\tfrac\{1\}\{2\}d\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}\(P\_\{1\},P\_\{0\}\)\.
Step 4 \(symmetrize\)\.Repeating Steps 2 and 3 with the roles ofP0,P1P\_\{0\},P\_\{1\}exchanged \(the termλ01\\lambda\_\{01\}is symmetric\) yields the two\-sided bound

\|R1\(f\)−R0\(f\)\|≤12dℋΔℋ\(P1,P0\)\+λ01\.\\bigl\|R\_\{1\}\(f\)\-R\_\{0\}\(f\)\\bigr\|\\leq\\tfrac\{1\}\{2\}d\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}\(P\_\{1\},P\_\{0\}\)\+\\lambda\_\{01\}\.
Step 5 \(combine\)\.Substituting Step 4 into Step 1,

Rfuture\(f\)≤Rmix\(f\)\+\|p01−π\|\(12dℋΔℋ\(P1,P0\)\+λ01\),R\_\{\\mathrm\{future\}\}\(f\)\\leq R\_\{\\mathrm\{mix\}\}\(f\)\+\|p\_\{01\}\-\\pi\|\\bigl\(\\tfrac\{1\}\{2\}d\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}\(P\_\{1\},P\_\{0\}\)\+\\lambda\_\{01\}\\bigr\),andλ01=0\\lambda\_\{01\}=0gives the realizable form\. ∎

### A\.3Proof of Theorem[4\.11](https://arxiv.org/html/2606.02657#S4.Thmtheorem11)\(mixing\-aware convergence\)

###### Proof\.

Step 1 \(i\.i\.d\. base bound\)\.For independent samples of sizem′m^\{\\prime\}per domain and a discriminator class of VC dimensionDD\(Vapnik,[1998](https://arxiv.org/html/2606.02657#bib.bib3)\), uniform VC deviation\(Kiferet al\.,[2004](https://arxiv.org/html/2606.02657#bib.bib11); Ben\-Davidet al\.,[2010](https://arxiv.org/html/2606.02657#bib.bib1)\)gives, with probability≥1−δ\\geq 1\-\\delta,

dℋΔℋ\(P0,P1\)≤d^ℋΔℋ\(U0,U1\)\+4Dlog⁡\(2m′\)\+log⁡\(2/δ\)m′\.d\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}\(P\_\{0\},P\_\{1\}\)\\leq\\widehat\{d\}\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}\(U\_\{0\},U\_\{1\}\)\+4\\sqrt\{\\frac\{D\\log\(2m^\{\\prime\}\)\+\\log\(2/\\delta\)\}\{m^\{\\prime\}\}\}\.Apply this withD=VCdim\(ℋΔℋ\)D=\\mathrm\{VCdim\}\(\\mathcal\{H\}\\Delta\\mathcal\{H\}\)\.444A standard result\(Mohriet al\.,[2018](https://arxiv.org/html/2606.02657#bib.bib10), Lemma 3\.5\)givesVCdim\(ℋΔℋ\)≤2dlog2⁡\(2d\)=O\(dlog⁡d\)\\mathrm\{VCdim\}\(\\mathcal\{H\}\\Delta\\mathcal\{H\}\)\\leq 2d\\log\_\{2\}\(2d\)=O\(d\\log d\)\. For simplicity, we state the bound with2d2d; the logarithmic factor does not affect the asymptotic rate\.For readability, we use the conservative simplificationD≤2dD\\leq 2d\(up to the logarithmic factor\)\.

Step 2 \(pay for dependence by blocking; the novel step\)\.The2m′2m^\{\\prime\}feature points areβ\\beta\-mixing\(Yu,[1994](https://arxiv.org/html/2606.02657#bib.bib7)\)\. Partition each domain sample into blocks of lengthb=⌈ln⁡\(m′Cμ\)/g⌉b=\\lceil\\ln\(m^\{\\prime\}C\_\{\\mu\}\)/g\\rceil, so thatβ\(b\)≤1/m′\\beta\(b\)\\leq 1/m^\{\\prime\}; the

meff′≥m′gln⁡\(m′Cμ\)\+2m^\{\\prime\}\_\{\\mathrm\{eff\}\}\\ \\geq\\ \\frac\{m^\{\\prime\}g\}\{\\ln\(m^\{\\prime\}C\_\{\\mu\}\)\+2\}blocks act as approximately independent draws\. The bound of Step 1 then holds withm′m^\{\\prime\}replaced bymeff′m^\{\\prime\}\_\{\\mathrm\{eff\}\}, the extraβ\(b\)≤1/m′\\beta\(b\)\\leq 1/m^\{\\prime\}absorbed into constants\.

Step 3 \(substitute and halve\)\.Using1/meff′≤\(ln⁡\(m′Cμ\)\+2\)/\(m′g\)1/m^\{\\prime\}\_\{\\mathrm\{eff\}\}\\leq\(\\ln\(m^\{\\prime\}C\_\{\\mu\}\)\+2\)/\(m^\{\\prime\}g\)in the Step\-1 root and dividing by22,

12dℋΔℋ\(P0,P1\)≤12d^ℋΔℋ\(U0,U1\)\+2\(2dlog⁡\(2m′\)\+log⁡\(2/δ\)\)\(ln⁡\(m′Cμ\)\+2\)m′g⏟=⁣:ηd\.∎\\tfrac\{1\}\{2\}d\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}\(P\_\{0\},P\_\{1\}\)\\leq\\tfrac\{1\}\{2\}\\widehat\{d\}\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}\(U\_\{0\},U\_\{1\}\)\+\\underbrace\{2\\sqrt\{\\frac\{\\bigl\(2d\\log\(2m^\{\\prime\}\)\+\\log\(2/\\delta\)\\bigr\)\\bigl\(\\ln\(m^\{\\prime\}C\_\{\\mu\}\)\+2\\bigr\)\}\{m^\{\\prime\}g\}\}\}\_\{=:~\\eta\_\{d\}\}\.\\qed

### A\.4Proof of Theorem[4\.13](https://arxiv.org/html/2606.02657#S4.Thmtheorem13)\(main bound\)

###### Proof\.

Step 1 \(future to training population\)\.By Corollary[4\.3](https://arxiv.org/html/2606.02657#S4.Thmtheorem3),

Rfuture\(f\)≤Rmix\(f\)\+\|p01−π\|\(12dℋΔℋ\(P1,P0\)\+λ01\)\.R\_\{\\mathrm\{future\}\}\(f\)\\leq R\_\{\\mathrm\{mix\}\}\(f\)\+\|p\_\{01\}\-\\pi\|\\bigl\(\\tfrac\{1\}\{2\}d\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}\(P\_\{1\},P\_\{0\}\)\+\\lambda\_\{01\}\\bigr\)\.
Step 2 \(population to sample\)\.The Rademacher bound for stationaryβ\\beta\-mixing sequences\(Mohriet al\.,[2018](https://arxiv.org/html/2606.02657#bib.bib10); Yu,[1994](https://arxiv.org/html/2606.02657#bib.bib7)\)gives, with probability≥1−δ\\geq 1\-\\deltaand uniformly inff,

Rmix\(f\)≤R^S\(f\)\+2R^S\(ℒℱ\)\+3ln⁡\(2/δ\)2neff\.R\_\{\\mathrm\{mix\}\}\(f\)\\leq\\widehat\{R\}\_\{S\}\(f\)\+2\\widehat\{R\}\_\{S\}\(\\mathcal\{L\}\_\{\\mathcal\{F\}\}\)\+3\\sqrt\{\\frac\{\\ln\(2/\\delta\)\}\{2n\_\{\\mathrm\{eff\}\}\}\}\.
Step 3 \(explicit constant\)\.By \(F2\) and Corollary[4\.8](https://arxiv.org/html/2606.02657#S4.Thmtheorem8),

3ln⁡\(2/δ\)2neff≤Λ\(n,δ\)\.3\\sqrt\{\\frac\{\\ln\(2/\\delta\)\}\{2n\_\{\\mathrm\{eff\}\}\}\}\\leq\\Lambda\(n,\\delta\)\.
Step 4 \(chain\)\.Combining Steps 1 through 3,

Rfuture\(f\)≤R^S\(f\)\+2R^S\(ℒℱ\)\+Λ\(n,δ\)\+\|p01−π\|\(12dℋΔℋ\(P1,P0\)\+λ01\)\.R\_\{\\mathrm\{future\}\}\(f\)\\leq\\widehat\{R\}\_\{S\}\(f\)\+2\\widehat\{R\}\_\{S\}\(\\mathcal\{L\}\_\{\\mathcal\{F\}\}\)\+\\Lambda\(n,\\delta\)\+\|p\_\{01\}\-\\pi\|\\bigl\(\\tfrac\{1\}\{2\}d\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}\(P\_\{1\},P\_\{0\}\)\+\\lambda\_\{01\}\\bigr\)\.The only stochastic step is Step 2, so the bound holds with probability≥1−δ\\geq 1\-\\delta\. ∎

### A\.5Proof of Theorem[4\.15](https://arxiv.org/html/2606.02657#S4.Thmtheorem15)\(lower bound atπ=0\\pi=0\)

###### Proof\.

Step 1 \(construction\)\.Let the feature space be\{x⋆\}∪C\\\{x^\{\\star\}\\\}\\cup Cwith

ℙP0\(x⋆\)=0,ℙP1\(x⋆\)=2ρ,\\mathbb\{P\}\_\{P\_\{0\}\}\(x^\{\\star\}\)=0,\\qquad\\mathbb\{P\}\_\{P\_\{1\}\}\(x^\{\\star\}\)=2\\rho,the remaining mass onCC\. Labels onCCare deterministic and identical across worlds; onx⋆x^\{\\star\}the label isBernoulli\(12±β\)\\mathrm\{Bernoulli\}\(\\tfrac\{1\}\{2\}\\pm\\beta\)in worldsa,ba,b\. Letℋ\\mathcal\{H\}containh\+,h−h\_\{\+\},h\_\{\-\}that agree onCCand disagree onx⋆x^\{\\star\}, soh\+⊕h−=𝟏\[x=x⋆\]h\_\{\+\}\\oplus h\_\{\-\}=\\mathbf\{1\}\[x=x^\{\\star\}\]\.

Step 2 \(the construction realizes12dℋΔℋ=ρ\\tfrac\{1\}\{2\}d\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}=\\rho\)\.The only detectable distinguishing set is\{x⋆\}\\\{x^\{\\star\}\\\}, with frequency2ρ2\\rhounderP1P\_\{1\}and0underP0P\_\{0\}; hence

12dℋΔℋ\(P1,P0\)=12\|2ρ−0\|=ρ\.\\tfrac\{1\}\{2\}d\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}\(P\_\{1\},P\_\{0\}\)=\\tfrac\{1\}\{2\}\\,\|2\\rho\-0\|=\\rho\.
Step 3 \(future excess onx⋆x^\{\\star\}\)\.UnderPfutureP\_\{\\mathrm\{future\}\}the pointx⋆x^\{\\star\}carries massp01⋅2ρp\_\{01\}\\cdot 2\\rho\. For a learner predicting11onx⋆x^\{\\star\}with probabilityqq, the per\-unit\-mass excess over the world\-optimal rule is2β\(1−q\)2\\beta\(1\-q\)in worldaaand2βq2\\beta qin worldbb\.

Step 4 \(the caseπ=0\\pi=0\)\.Whenπ=0\\pi=0, training never containsx⋆x^\{\\star\}, so the two worlds induce identical training laws \(TV=0\\mathrm\{TV\}=0\): the learner cannot distinguish them\. Taking deterministic opposite labels \(β→12\\beta\\to\\tfrac\{1\}\{2\}\) and the minimax choiceq=12q=\\tfrac\{1\}\{2\}\(Cam,[1986](https://arxiv.org/html/2606.02657#bib.bib12); Yu,[1997](https://arxiv.org/html/2606.02657#bib.bib13)\),

maxw∈\{a,b\}⁡𝔼\[excess\]≥p01\(2ρ\)⋅12=p01ρ=p01⋅12dℋΔℋ\(P1,P0\),\\max\_\{w\\in\\\{a,b\\\}\}\\mathbb\{E\}\[\\text\{excess\}\]\\ \\geq\\ p\_\{01\}\(2\\rho\)\\cdot\\tfrac\{1\}\{2\}\\ =\\ p\_\{01\}\\,\\rho\\ =\\ p\_\{01\}\\cdot\\tfrac\{1\}\{2\}d\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}\(P\_\{1\},P\_\{0\}\),uniformly innn\.

Step 5 \(the caseπ\>0\\pi\>0\)\.The expected number of revealingx⋆x^\{\\star\}\-points in training isneffπ⋅2ρn\_\{\\mathrm\{eff\}\}\\,\\pi\\cdot 2\\rho, so by Pinsker’s inequalityTV≤2β2neffπρ\\mathrm\{TV\}\\leq 2\\beta\\sqrt\{2n\_\{\\mathrm\{eff\}\}\\pi\\rho\}\. Optimizingβ\\betasubject toTV≤12\\mathrm\{TV\}\\leq\\tfrac\{1\}\{2\}gives a worst\-case excess of order

p01ρneffπ=Θ\(1/neff\),p\_\{01\}\\sqrt\{\\frac\{\\rho\}\{n\_\{\\mathrm\{eff\}\}\\,\\pi\}\}\\ =\\ \\Theta\\\!\\bigl\(1/\\sqrt\{n\_\{\\mathrm\{eff\}\}\}\\bigr\),which vanishes withnnand is therefore not matched by annn\-uniform constant\. ∎

### A\.6Proof of Proposition[4\.16](https://arxiv.org/html/2606.02657#S4.Thmtheorem16)\(certification cost\)

###### Proof\.

Step 1 \(a valid certificate dominates the worst future\)\.Valid certificateU\(S\)U\(S\)should must hold against all possible future market condition\. This means, there is no certainty of future regime, soU\(S\)U\(S\)covers the least favorable\(p01\)\(p\_\{01\}\)case in the confidence set\.UsinginfgRfuture\(g\)≤R^S\(f\)\+Λ\(n,δ\)\\inf\_\{g\}R\_\{\\mathrm\{future\}\}\(g\)\\leq\\widehat\{R\}\_\{S\}\(f\)\+\\Lambda\(n,\\delta\)and dropping the nonnegative2R^S\(ℒℱ\)2\\widehat\{R\}\_\{S\}\(\\mathcal\{L\}\_\{\\mathcal\{F\}\}\)andλ01\\lambda\_\{01\}, the residual satisfies

U\(S\)−R^S\(f\)−Λ\(n,δ\)≥\|p01−π\|Δ\.U\(S\)\-\\widehat\{R\}\_\{S\}\(f\)\-\\Lambda\(n,\\delta\)\\ \\geq\\ \|p\_\{01\}\-\\pi\|\\,\\Delta\.
Step 2 \(replaceπ\\piby its estimate\)\.By the reverse triangle inequality and the concentration bound \(F2\),

\|p01−π\|≥\|p01−π^\|−\|π^−π\|≥\|p01−π^\|−ηπ\.\|p\_\{01\}\-\\pi\|\\ \\geq\\ \|p\_\{01\}\-\\widehat\{\\pi\}\|\-\|\\widehat\{\\pi\}\-\\pi\|\\ \\geq\\ \|p\_\{01\}\-\\widehat\{\\pi\}\|\-\\eta\_\{\\pi\}\.Combining the two steps gives the claim\. Whenπ=p01\\pi=p\_\{01\}, Lemma[4\.1](https://arxiv.org/html/2606.02657#S4.Thmtheorem1)makes the realized gap exactly zero, so the bound is a property of certificates, not of realized risk\. ∎

### A\.7Diagnostic: Domain Classifier and Training\-Only Penalty

Table 4:Domain classifier performance and training\-only penalty correlation555The synthetic penalty has a strong correlation of \(0\.716\), this confirms the mechanism works under ideal conditions\. Real data achieved almost perfect regime separation \(0\.93\) but the correlation is near zero \(0\.084\), proving thatp01p\_\{01\}estimation is the fundamental bottleneck rather than regime detection\.
Regime-Arrival Uncertainty in Generalization Bounds under Distribution Shift

Similar Articles

NEST: Tackling Dataset-Level Distribution Shifts via Regime-Oriented Mixture-of-Experts

Bounded-Rationality, Hedging, and Generalization

Population Risk Bounds for Kolmogorov-Arnold Networks Trained by DP-SGD with Correlated Noise

High-Probability PL-SGD with Markovian Noise: Optimal Mixing and Tail Dependence

Uncertainty Estimation and Generalization Bounds for Modern Deep Learning

Submit Feedback

Similar Articles

NEST: Tackling Dataset-Level Distribution Shifts via Regime-Oriented Mixture-of-Experts
Bounded-Rationality, Hedging, and Generalization
Population Risk Bounds for Kolmogorov-Arnold Networks Trained by DP-SGD with Correlated Noise
High-Probability PL-SGD with Markovian Noise: Optimal Mixing and Tail Dependence
Uncertainty Estimation and Generalization Bounds for Modern Deep Learning