A Theory of Online Learning with Autoregressive Chain-of-Thought Reasoning
Summary
This academic paper develops a theoretical framework for online learning with autoregressive chain-of-thought reasoning, analyzing mistake bounds under end-to-end and trajectory supervision models.
View Cached Full Text
Cached at: 05/11/26, 06:53 AM
# A Theory of Online Learning with Autoregressive Chain-of-Thought Reasoning
Source: [https://arxiv.org/html/2605.06819](https://arxiv.org/html/2605.06819)
###### Abstract
Autoregressive generation lies at the heart of the mechanism of large language models\. It can be viewed as the repeated application of a next\-token generator: starting from an input string \(prompt\), the generator is applied forMMsteps, and the last generated token is taken as the final output\.\[[26](https://arxiv.org/html/2605.06819#bib.bib71)\]proposed a PAC model for studying the learnability of the input\-output maps arising from this process\. We develop an online analogue of this framework, focusing on the mistake bound of learning the final output induced by an unknown next\-token generator\.
We distinguish between two forms of feedback\. In the End\-to\-End model, after each round the learner observes only the final token produced afterMMautoregressive steps\. In the Chain\-of\-Thought model, the learner is additionally shown the entireMM\-step trajectory\. Our goal is to understand how the optimal mistake bound depends on the generation horizonMM, and to what extent observing intermediate tokens can reduce this dependence\.
Our main results show that the online theory of autoregressive learning exhibits a qualitative picture analogous to the statistical one found by\[[20](https://arxiv.org/html/2605.06819#bib.bib93)\], but with a different scale of dependence on the generation horizon\. In the End\-to\-End model, we prove a taxonomy of possible mistake\-bound growth rates in the generation horizonMM: subject to mild regularity conditions, every rate between constant and logarithmic can arise\. We also show that this logarithmic ceiling is unavoidable for the online setting, in the sense that every class of finite Littlestone dimension exhibits at most logarithmic dependence onMM\. In the Chain\-of\-Thought model, we show that access to the full generated trajectory eliminates the dependence onMMaltogether, mirroring the picture proved in the statistical setting\.
We also analyze autoregressive linear threshold classes\. For autoregressive linear thresholds inℝd\\mathbb\{R\}^\{d\}, we prove that the optimal mistake bound isΘ\(d2\)\\Theta\(d^\{2\}\)under both End\-to\-End and Chain\-of\-Thought feedback, and for any generation lengthMM\. The methods developed for this analysis also yield new lower bounds in the statistical setting\.
Along the way, our results resolve several questions left open by\[[26](https://arxiv.org/html/2605.06819#bib.bib71)\]\. In particular, we show that even for classes of finite Littlestone dimension, the End\-to\-End statistical sample complexity can depend on the generation lengthMM\.
###### Contents
1. [1Introduction](https://arxiv.org/html/2605.06819#S1)1. [1\.1The autoregressive online learning model](https://arxiv.org/html/2605.06819#S1.SS1)
2. [2Results](https://arxiv.org/html/2605.06819#S2)1. [2\.1Background on the Littlestone dimension](https://arxiv.org/html/2605.06819#S2.SS1) 2. [2\.2Learning with chain\-of\-thought supervision](https://arxiv.org/html/2605.06819#S2.SS2) 3. [2\.3End\-to\-End learning](https://arxiv.org/html/2605.06819#S2.SS3) 4. [2\.4Results for autoregressive linear classifiers](https://arxiv.org/html/2605.06819#S2.SS4) 5. [2\.5Non\-Littlestone classes](https://arxiv.org/html/2605.06819#S2.SS5) 6. [2\.6A stochastic autoregressive lower bound](https://arxiv.org/html/2605.06819#S2.SS6)
3. [3Proof sketches of the main results](https://arxiv.org/html/2605.06819#S3)1. [3\.1Proof sketch of Theorem2\.1: mistake bound for learning with CoT](https://arxiv.org/html/2605.06819#S3.SS1) 2. [3\.2Proof sketch of Theorem2\.3: mistake bound for learning without CoT](https://arxiv.org/html/2605.06819#S3.SS2) 3. [3\.3Proof sketch of Theorem2\.4: taxonomy for sub\-logarithmic rates](https://arxiv.org/html/2605.06819#S3.SS3) 4. [3\.4Proof sketch of Theorem2\.5: logarithmic lower bound](https://arxiv.org/html/2605.06819#S3.SS4) 5. [3\.5Proof sketches of Theorem2\.7and Theorem2\.8: bounds for linear classifiers](https://arxiv.org/html/2605.06819#S3.SS5)
4. [4Future work](https://arxiv.org/html/2605.06819#S4)
5. [5Additional related work](https://arxiv.org/html/2605.06819#S5)
6. [6Technical preliminaries](https://arxiv.org/html/2605.06819#S6)1. [6\.1Littlestone trees](https://arxiv.org/html/2605.06819#S6.SS1) 2. [6\.2Class dimensions, learnability, and an SSP lemma](https://arxiv.org/html/2605.06819#S6.SS2)
7. [7Upper bound for end\-to\-end learning](https://arxiv.org/html/2605.06819#S7)
8. [8The taxonomy of end\-to\-end learning](https://arxiv.org/html/2605.06819#S8)1. [8\.1Proof of Theorem8\.1](https://arxiv.org/html/2605.06819#S8.SS1) 2. [8\.2Proof of Theorem8\.2](https://arxiv.org/html/2605.06819#S8.SS2)
9. [9Agnostic taxonomy](https://arxiv.org/html/2605.06819#S9)
10. [10Learning with chain\-of\-thought supervision](https://arxiv.org/html/2605.06819#S10)
11. [11Autoregressive Linear classifiers](https://arxiv.org/html/2605.06819#S11)1. [11\.1Mistake and sample complexity bounds](https://arxiv.org/html/2605.06819#S11.SS1) 2. [11\.2Bounds for Efficient learning](https://arxiv.org/html/2605.06819#S11.SS2)
12. [12Non\-Littlestone base classes](https://arxiv.org/html/2605.06819#S12)
13. [13Stochastic Autoregressive Lower Bound](https://arxiv.org/html/2605.06819#S13)
14. [References](https://arxiv.org/html/2605.06819#bib)
## 1Introduction
Modern large language models are not learned once from a fixed collection of independent examples and then used in isolation\. They are trained on data accumulated over time, updated through successive rounds of optimization, deployed in interactive environments, and increasingly adapted to the particular users and contexts in which they operate\. In such settings, the classical picture of learning from an i\.i\.d sample captures only part of the story\. The learner observes a stream of prompts, receives feedback, changes its behavior, and is then evaluated on future interactions\. This is precisely the kind of situation for which online learning was designed\.
At the same time, the objects being learned are fundamentally autoregressive\. A language model does not usually produce an answer in a single atomic step\. Rather, starting from a prompt, it generates one token at a time; each newly generated token becomes part of the context for the next prediction\. After many such steps, the final token, sentence, or answer is interpreted as the model’s output\. Thus, even when the desired task is an end\-to\-end prediction problem, the mechanism producing the prediction is an iterated next\-token process\.
This creates a natural distinction between two forms of supervision\. In one regime, the learner observes only the final answer produced after the autoregressive process has run forMMsteps\. This corresponds to*end\-to\-end supervision*: the intermediate reasoning, if any, is hidden\. In a stronger regime, the learner also observes the entire generated trajectory\. This corresponds to*chain\-of\-thought supervision*: the learner sees not only what answer was produced, but also the sequence of intermediate tokens through which the generator arrived there\. Intuitively, such intermediate information should be useful, as it exposes more of the underlying next\-token generation mechanism\.
A recent line of work initiated by\[[26](https://arxiv.org/html/2605.06819#bib.bib71)\]introduced a formal framework for studying this phenomenon in the PAC setting\. In their model, the primitive object is a next\-token generatorf:Σ⋆→Σf:\\Sigma^\{\\star\}\\to\\Sigma, whereΣ\\Sigmais the alphabet of possible tokens\. IteratingffforMMsteps from an input string produces anMM\-token autoregressive trajectory, whose last token is the end\-to\-end output\. This framework makes it possible to ask, in precise learning\-theoretic terms, how the difficulty of learning the induced input\-output map depends on the generation lengthMM, and how much chain\-of\-thought supervision can reduce this dependence\. These two fundamental questions were studied in the PAC\-learning framework of\[[26](https://arxiv.org/html/2605.06819#bib.bib71)\]by\[[20](https://arxiv.org/html/2605.06819#bib.bib93)\]\.
The present paper develops the online analogue of this theory\. Rather than receiving an i\.i\.d training sample, the learner interacts with an adversarially chosen sequence of examples\. On each round, the learner is given a prompt, predicts the final token generated afterMMautoregressive steps, and then receives feedback\. In the end\-to\-end model, the feedback consists only of the final token\. In the chain\-of\-thought model, the feedback consists of the fullMM\-step trajectory\. The central quantity in this setting is no longer sample complexity, but the optimal mistake bound: how many prediction errors are unavoidable before the learner has identified enough of the hidden generator to perform well on all subsequent rounds?
There is also a second, more structural reason to study the online model\. Online learnability is often a stronger and more robust notion than ordinary distributional learnability\. A class that can be learned in the online mistake\-bound model is not merely learnable from an i\.i\.d\. sample; it often remains learnable under additional constraints that make standard PAC learning more demanding\. A prominent example is private PAC learning, which is equivalent to online learning, in the sense that a class is private PAC learnable if and only if it is online learnable\[[2](https://arxiv.org/html/2605.06819#bib.bib68)\]\. Thus, online learnability serves as a powerful certificate of learnability: it indicates that the class has enough combinatorial structure to be learned even in settings where the data are sequential, adversarial, or subject to privacy constraints\.
We now turn to the formal definition of the model in study\.
### 1\.1The autoregressive online learning model
We start by recalling the general autoregressive learning framework presented in\[[26](https://arxiv.org/html/2605.06819#bib.bib71)\]and defined below\. A*next\-token\-generator*is a function
f:Σ⋆→Σ,f:\\Sigma^\{\\star\}\\to\\Sigma,whereΣ\\Sigmais some finite alphabet\. In this paper, we focus on the binary case, soΣ=\{0,1\}\\Sigma=\\\{0,1\\\}unless stated otherwise\. The functionffnaturally induces an*autoregressive generation*process\. Define the*apply\-and\-append*mapf¯:Σ⋆→Σ⋆\\bar\{f\}:\\Sigma^\{\\star\}\\to\\Sigma^\{\\star\}offfby
f¯\(x\):=x∘f\(x\)\.\\bar\{f\}\(x\):=x\\circ f\(x\)\.Denote
f¯M=f¯∘⋯∘f¯⏟Mtimes\.\\bar\{f\}^\{\\,M\}=\\underbrace\{\\bar\{f\}\\circ\\cdots\\circ\\bar\{f\}\}\_\{M\\text\{ times\}\}\.We define the*MM\-step chain of thought*offfonxxto be the lengthMMsuffix off¯M\(x\)\\bar\{f\}^\{M\}\(x\), denoted byf𝖢𝗈𝖳−M\(x\)f^\{\\mathsf\{CoT\}\-M\}\(x\)\. The corresponding*end\-to\-end output*of theMM\-step chain of thought offfonxxis the last symbol off𝖢𝗈𝖳−M\(x\)f^\{\\mathsf\{CoT\}\-M\}\(x\), denoted asf𝖾𝟤𝖾−M\(x\):=f𝖢𝗈𝖳−M\(x\)\[−1\]f^\{\\mathsf\{e2e\}\-M\}\(x\):=f^\{\\mathsf\{CoT\}\-M\}\(x\)\[\-1\]\.
So, each classℱ\\mathcal\{F\}of next\-token\-generators \(also called a*base class*\) naturally induces two classes of interest:
ℱ𝖢𝗈𝖳−M:=\{f𝖢𝗈𝖳−M:f∈ℱ\},ℱ𝖾𝟤𝖾−M:=\{f𝖾𝟤𝖾−M:f∈ℱ\}\.\\mathcal\{F\}^\{\\mathsf\{CoT\}\-M\}:=\\\{f^\{\\mathsf\{CoT\}\-M\}:f\\in\\mathcal\{F\}\\\},\\qquad\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}:=\\\{f^\{\\mathsf\{e2e\}\-M\}:f\\in\\mathcal\{F\}\\\}\.
In this paper, we are interested in learning the classℱ𝖾𝟤𝖾−M\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}online in two regimes:𝖾𝟤𝖾\\mathsf\{e2e\}\-learning, also called end\-to\-end learning, and𝖢𝗈𝖳\\mathsf\{CoT\}\-learning, also called end\-to\-end learning with chain\-of\-thought supervision\. Let us start with the𝖾𝟤𝖾\\mathsf\{e2e\}\-learning regime\. We consider a game played between a learner𝖫𝗋𝗇\\mathsf\{Lrn\}and an adversary𝖠𝖽𝗏\\mathsf\{Adv\}played forTTmany rounds\. Before the game begins,𝖠𝖽𝗏\\mathsf\{Adv\}chooses a*target function*f⋆∈ℱf\_\{\\star\}\\in\\mathcal\{F\}\. In each roundt∈\[T\]t\\in\[T\],𝖠𝖽𝗏\\mathsf\{Adv\}first sends an instancext∈Σ⋆x\_\{t\}\\in\\Sigma^\{\\star\}to𝖫𝗋𝗇\\mathsf\{Lrn\}, and then𝖫𝗋𝗇\\mathsf\{Lrn\}chooses a symbol \(or a*label*, in classification theory terminology\)y^t∈Σ\\hat\{y\}\_\{t\}\\in\\Sigma\. Finally,𝖠𝖽𝗏\\mathsf\{Adv\}reveals the correct end\-to\-end outputf⋆𝖾𝟤𝖾−M\(xt\)f\_\{\\star\}^\{\\mathsf\{e2e\}\-M\}\(x\_\{t\}\), and𝖫𝗋𝗇\\mathsf\{Lrn\}suffers loss11if and only ify^t≠f⋆𝖾𝟤𝖾−M\(xt\)\\hat\{y\}\_\{t\}\\neq f\_\{\\star\}^\{\\mathsf\{e2e\}\-M\}\(x\_\{t\}\)\. Most of our results hold already for a deterministic learner, so unless stated otherwise, we limit our study to the case that𝖫𝗋𝗇\\mathsf\{Lrn\}is deterministic\. We now define the𝖢𝗈𝖳\\mathsf\{CoT\}\-learning regime\.𝖢𝗈𝖳\\mathsf\{CoT\}\-learning is very similar to𝖾𝟤𝖾\\mathsf\{e2e\}\-learning; the only difference is that in𝖢𝗈𝖳\\mathsf\{CoT\}\-learning,𝖠𝖽𝗏\\mathsf\{Adv\}reveals the entire chain\-of\-thoughtf⋆𝖢𝗈𝖳−M\(xt\)f\_\{\\star\}^\{\\mathsf\{CoT\}\-M\}\(x\_\{t\}\), instead of onlyf⋆𝖾𝟤𝖾−M\(xt\)f\_\{\\star\}^\{\\mathsf\{e2e\}\-M\}\(x\_\{t\}\)\.
In both regimes,𝖫𝗋𝗇\\mathsf\{Lrn\}’s goal is to minimize its total cumulative loss throughout the game, also called the*mistake bound*\.𝖠𝖽𝗏\\mathsf\{Adv\}’s goal is to maximize this quantity\. Note that𝖾𝟤𝖾\\mathsf\{e2e\}\-learning amounts to standard online learning of the classℱ𝖾𝟤𝖾−M\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\. In contrast, in𝖢𝗈𝖳\\mathsf\{CoT\}\-learning𝖫𝗋𝗇\\mathsf\{Lrn\}receivesf⋆𝖢𝗈𝖳−M\(xt\)f\_\{\\star\}^\{\\mathsf\{CoT\}\-M\}\(x\_\{t\}\)as feedback from𝖠𝖽𝗏\\mathsf\{Adv\}, but tested only on correctness of the final symbol off⋆𝖢𝗈𝖳−M\(xt\)f\_\{\\star\}^\{\\mathsf\{CoT\}\-M\}\(x\_\{t\}\), that is, onf⋆𝖾𝟤𝖾−M\(xt\)f\_\{\\star\}^\{\\mathsf\{e2e\}\-M\}\(x\_\{t\}\)\.
The main goal in this paper is to understand how the mistake bound of autoregressive learning scales with the generation lengthMM, and to what extent chain\-of\-thought supervision may help reducing the mistake bound\. Throughout the paper, we assume thatM≥2M\\geq 2, as the caseM=1M=1is equivalent to the standard online learning scenario\.
## 2Results
This section assumes some familiarity with basic terms in learning theory, such as PAC\-learning\. The unfamiliar reader may read Section[6](https://arxiv.org/html/2605.06819#S6), where we present all relevant technical background in a self\-contained manner\.
### 2\.1Background on the Littlestone dimension
We first give some relevant background on the Littlestone dimension, which is necessary to understand the results\. The main combinatorial quantity that arises when analyzing mistake bounds of online learning a hypothesis classℋ⊂𝒴𝒳\\mathcal\{H\}\\subset\\mathcal\{Y\}^\{\\mathcal\{X\}\}of functions from a*domain*𝒳\\mathcal\{X\}to a*label space*𝒴\\mathcal\{Y\}is the*Littlestone dimension*\[[31](https://arxiv.org/html/2605.06819#bib.bib15),[12](https://arxiv.org/html/2605.06819#bib.bib9)\]ofℋ\\mathcal\{H\}, denoted as𝙻\(ℋ\)\\mathtt\{L\}\(\\mathcal\{H\}\)\. Let us informally define the Littlestone dimension\. A formal definition is given in Section[6\.2](https://arxiv.org/html/2605.06819#S6.SS2)\.𝙻\(ℋ\)\\mathtt\{L\}\(\\mathcal\{H\}\)is the maximal depth of a perfect*Littlestone tree*that is shattered byℋ\\mathcal\{H\}\. A perfect Littlestone tree forℋ\\mathcal\{H\}is a rooted perfect binary tree whose internal vertices are labeled by instances from𝒳\\mathcal\{X\}, and for each internal vertex, its two outgoing edges are labeled by two different labels from𝒴\\mathcal\{Y\}\. Such a tree is*shattered*byℋ\\mathcal\{H\}if every branchbb\(root\-to\-leaf path\) in the tree has a functionhb∈ℋh\_\{b\}\\in\\mathcal\{H\}that agrees withbb\. Here, “agrees” means that for allx∈𝒳x\\in\\mathcal\{X\}that labels an internal vertex inbb, if the outgoing edge of the internal vertex labeled byxxthat lies inbbis labeled withy∈𝒴y\\in\\mathcal\{Y\}, thenfb\(x\)=yf\_\{b\}\(x\)=y\. A fundamental result of\[[31](https://arxiv.org/html/2605.06819#bib.bib15),[12](https://arxiv.org/html/2605.06819#bib.bib9)\]states that the optimal mistake bound of learningℋ\\mathcal\{H\}is precisely𝙻\(ℋ\)\\mathtt\{L\}\(\\mathcal\{H\}\)\.
Since𝖾𝟤𝖾\\mathsf\{e2e\}\-learning ofℱ\\mathcal\{F\}with generation lengthMMamounts to standard online learning of the classℱ𝖾𝟤𝖾−M\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}, our analysis focuses on bounding𝙻\(ℱ𝖾𝟤𝖾−M\)\\mathtt\{L\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\)\. On the other hand,𝖢𝗈𝖳\\mathsf\{CoT\}\-learning is not a standard online learning task: while the loss is measured with respect to the target functionf⋆𝖾𝟤𝖾−M∈ℱ𝖾𝟤𝖾−Mf\_\{\\star\}^\{\\mathsf\{e2e\}\-M\}\\in\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}, the feedback is given by the target functionf⋆𝖢𝗈𝖳−M∈ℱ𝖢𝗈𝖳−Mf\_\{\\star\}^\{\\mathsf\{CoT\}\-M\}\\in\\mathcal\{F\}^\{\\mathsf\{CoT\}\-M\}\. A useful approach towards proving upper bounds for𝖢𝗈𝖳\\mathsf\{CoT\}\-learning exploited in\[[26](https://arxiv.org/html/2605.06819#bib.bib71),[20](https://arxiv.org/html/2605.06819#bib.bib93)\]is to replace𝖢𝗈𝖳\\mathsf\{CoT\}\-learning in the more difficult task of learning the classℱ𝖢𝗈𝖳−M\\mathcal\{F\}^\{\\mathsf\{CoT\}\-M\}\. This task is indeed harder, since in contrast with𝖢𝗈𝖳\\mathsf\{CoT\}\-learning, any disagreement between the learner’s full chain\-of\-thought predictiony^∈ΣM\\hat\{y\}\\in\\Sigma^\{M\}counts as a mistake, even if the final End\-to\-End output is correct\. We use the same approach and derive mistake bounds for𝖢𝗈𝖳\\mathsf\{CoT\}\-learning by bounding𝙻\(ℱ𝖢𝗈𝖳−M\)\\mathtt\{L\}\(\\mathcal\{F\}^\{\\mathsf\{CoT\}\-M\}\)\. Unless stated otherwise, we limit our discussion only for base classes with finite Littlestone dimension\. This limitation is analogous to the limitation of finite VC\-dimension in the statistical setting\[[26](https://arxiv.org/html/2605.06819#bib.bib71),[20](https://arxiv.org/html/2605.06819#bib.bib93)\]\. Further justification for this limitation is given in Section[12](https://arxiv.org/html/2605.06819#S12), where we show that for non\-Littlestone classes, learnability ofℱ𝖾𝟤𝖾−M\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}can alternate between the two extremes of impossible even with𝖢𝗈𝖳\\mathsf\{CoT\}\-supervision, and trivial even without𝖢𝗈𝖳\\mathsf\{CoT\}\-supervision\. A similar result was proved in\[[20](https://arxiv.org/html/2605.06819#bib.bib93)\]for the statistical setting, for non\-VC classes\.
### 2\.2Learning with chain\-of\-thought supervision
The results in this section are proved in Section[10](https://arxiv.org/html/2605.06819#S10)Our first main result shows that having𝙻\(ℱ\)<∞\\mathtt\{L\}\(\\mathcal\{F\}\)<\\inftyand chain\-of\-thought supervision eliminates any dependence on the generation lengthMM\.
###### Theorem 2\.1\.
For every base classℱ⊂ΣΣ⋆\\mathcal\{F\}\\subset\\Sigma^\{\\Sigma^\{\\star\}\}\(even if\|Σ\|\>2\|\\Sigma\|\>2, and even for\|Σ\|=∞\|\\Sigma\|=\\infty\), and for every generation lengthMM, we have
𝙻\(ℱ𝖢𝗈𝖳−M\)≤𝙻\(ℱ\)\.\\mathtt\{L\}\(\\mathcal\{F\}^\{\\mathsf\{CoT\}\-M\}\)\\leq\\mathtt\{L\}\(\\mathcal\{F\}\)\.
Notably, Theorem[2\.1](https://arxiv.org/html/2605.06819#S2.Thmtheorem1)holds even when\|Σ\|\>2\|\\Sigma\|\>2\.\[[20](https://arxiv.org/html/2605.06819#bib.bib93)\]proved an analog result for the statistical setting: havingVC\(ℱ\)<∞\\operatorname\{VC\}\(\\mathcal\{F\}\)<\\infty\(whereVC\(ℱ\)\\operatorname\{VC\}\(\\mathcal\{F\}\)denoted the VC\-dimension ofℱ\\mathcal\{F\}\) and chain\-of\-thought supervision eliminates any dependence of the PAC\-sample complexity on the generation lengthMM\.
It is not hard to think of autoregressively\-degenerate classes \(classes of functions that keeps generating the same bit forever\) for which𝙻\(ℱ𝖢𝗈𝖳−M\)≥𝙻\(ℱ\)\\mathtt\{L\}\(\\mathcal\{F\}^\{\\mathsf\{CoT\}\-M\}\)\\geq\\mathtt\{L\}\(\\mathcal\{F\}\)for allMM, and so the bound of Theorem[2\.1](https://arxiv.org/html/2605.06819#S2.Thmtheorem1)cannot be improved in general, not even by a constant factor\.
Theorem[2\.1](https://arxiv.org/html/2605.06819#S2.Thmtheorem1)can be used to improve, for some classes, the PAC𝖢𝗈𝖳\\mathsf\{CoT\}\-learning sample complexity bound given by\[[20](https://arxiv.org/html/2605.06819#bib.bib93)\]\. Their result shows that the PAC sample complexity of𝖢𝗈𝖳\\mathsf\{CoT\}\-learning is proportional toVC\(ℱ\)⋅VC⋆\(ℱ\)\\operatorname\{VC\}\(\\mathcal\{F\}\)\\cdot\\operatorname\{VC\}^\{\\star\}\(\\mathcal\{F\}\), whereVC⋆\(ℱ\)=VC\(ℱ⋆\)\\operatorname\{VC\}^\{\\star\}\(\\mathcal\{F\}\)=\\operatorname\{VC\}\(\\mathcal\{F\}^\{\\star\}\), whereℱ⋆\\mathcal\{F\}^\{\\star\}is the dual class ofℱ\\mathcal\{F\}, which is the class obtained by “switching roles” between𝒳\\mathcal\{X\}andℱ\\mathcal\{F\}\. However,VC⋆\(ℱ\)\\operatorname\{VC\}^\{\\star\}\(\\mathcal\{F\}\)can grow as2VC\(ℱ\)2^\{\\operatorname\{VC\}\(\\mathcal\{F\}\)\}\. Using a standard online\-to\-batch conversion, we deduce the following sample complexity bound\. Notably, it holds even when\|Σ\|\>2\|\\Sigma\|\>2\.
###### Theorem 2\.2\.
For every base classℱ⊂ΣΣ⋆\\mathcal\{F\}\\subset\\Sigma^\{\\Sigma^\{\\star\}\}\(even if\|Σ\|\>2\|\\Sigma\|\>2, and even for\|Σ\|=∞\|\\Sigma\|=\\infty\), and for allMM, if𝙻\(ℱ\)<∞\\mathtt\{L\}\(\\mathcal\{F\}\)<\\inftythe classℱ𝖢𝗈𝖳−M\\mathcal\{F\}^\{\\mathsf\{CoT\}\-M\}is PAC\-learnable with sample complexity
m\(ε,δ\)=O\(𝙻\(ℱ\)\+log\(1/δ\)ε\)\.m\(\\varepsilon,\\delta\)=O\\left\(\\frac\{\\mathtt\{L\}\(\\mathcal\{F\}\)\+\\log\(1/\\delta\)\}\{\\varepsilon\}\\right\)\.
So, the advantage of Theorem[2\.2](https://arxiv.org/html/2605.06819#S2.Thmtheorem2)over the result of\[[20](https://arxiv.org/html/2605.06819#bib.bib93)\]is that it holds for non\-binary base classesℱ\\mathcal\{F\}, and also that it provides a better bound for every binary classℱ\\mathcal\{F\}for which𝙻\(ℱ\)<VC\(ℱ\)VC⋆\(ℱ\)\\mathtt\{L\}\(\\mathcal\{F\}\)<\\operatorname\{VC\}\(\\mathcal\{F\}\)\\operatorname\{VC\}^\{\\star\}\(\\mathcal\{F\}\), such asdd\-hamming balls\.
Along the way, Theorem[2\.2](https://arxiv.org/html/2605.06819#S2.Thmtheorem2)also settles alogM\\log Mgap from\[[26](https://arxiv.org/html/2605.06819#bib.bib71)\], who proved essentially the same bound as of Theorem[2\.2](https://arxiv.org/html/2605.06819#S2.Thmtheorem2), but with𝙻\(ℱ\)logM\\mathtt\{L\}\(\\mathcal\{F\}\)\\log Mreplacing the𝙻\(ℱ\)\\mathtt\{L\}\(\\mathcal\{F\}\)in our bound\. They asked if the logarithmic dependence onMMis necessary, and our bound shows that it is not\.
### 2\.3End\-to\-End learning
We first establish a general upper bound that bounds how𝙻\(ℱ𝖾𝟤𝖾−M\)\\mathtt\{L\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\)can scale withMMin the worst case\.
###### Theorem 2\.3\.
For every base classℱ\\mathcal\{F\}, and for every generation lengthMM, we have
𝙻\(ℱ𝖾𝟤𝖾−M\)=O\(𝙻\(ℱ\)log\(𝙻\(ℱ\)M\)\)\.\\mathtt\{L\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\)=O\(\\mathtt\{L\}\(\\mathcal\{F\}\)\\log\(\\mathtt\{L\}\(\\mathcal\{F\}\)M\)\)\.
Theorem[2\.3](https://arxiv.org/html/2605.06819#S2.Thmtheorem3)is proved in Section[7](https://arxiv.org/html/2605.06819#S7)\. While it gives a strong guarantee for any base classℱ\\mathcal\{F\}\(of finite Littlestone dimension\), unlike the guarantee of Theorem[2\.1](https://arxiv.org/html/2605.06819#S2.Thmtheorem1), it is not clear how tight Theorem[2\.3](https://arxiv.org/html/2605.06819#S2.Thmtheorem3)is\. Indeed, it is not hard to think of classesℱ\\mathcal\{F\}for which𝙻\(ℱ𝖾𝟤𝖾−M\)≤𝙻\(ℱ\)\\mathtt\{L\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\)\\leq\\mathtt\{L\}\(\\mathcal\{F\}\)for allMM, and it is not immediately clear if there are classes for which𝙻\(ℱ𝖾𝟤𝖾−M\)\\mathtt\{L\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\)scales as𝙻\(ℱ\)logM\\mathtt\{L\}\(\\mathcal\{F\}\)\\log M, or even as some sub\-logarithmic monotone function ofMM\. This naturally raises the following question:
What are the possible growth rates of𝙻\(ℱ𝖾𝟤𝖾−M\)\\mathtt\{L\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\)in terms of𝙻\(ℱ\)\\mathtt\{L\}\(\\mathcal\{F\}\)andMM?
This question was considered by\[[20](https://arxiv.org/html/2605.06819#bib.bib93)\]in the statistical setting\. They proved that under mild regularity conditions, essentially any at\-most\-linear growth rate ofVC\(ℱ𝖾𝟤𝖾−M\)\\operatorname\{VC\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\)withMMis realized by some classℱ\\mathcal\{F\}\. We prove the analogous result for the online setting, with similar mild regularity conditions to those of\[[20](https://arxiv.org/html/2605.06819#bib.bib93)\]\. Let us explicitly state those conditions\. Fix a sufficiently large universal constantM0M\_\{0\}\. We say that a functionr:ℕ\+→ℕ\+r:\\mathbb\{N\}\_\{\+\}\\to\\mathbb\{N\}\_\{\+\}is a*well\-behaved sub\-logarithmic growth rate*if:
1. 1\.rris sub\-logarithmic:r\(M\)≤14log\(M\)r\(M\)\\leq\\frac\{1\}\{4\}\\log\(M\)for allM≥M0M\\geq M\_\{0\}\.
2. 2\.rris non\-decreasing :M1≥M2⟹r\(M1\)≥r\(M2\)M\_\{1\}\\geq M\_\{2\}\\implies r\(M\_\{1\}\)\\geq r\(M\_\{2\}\)for allM1,M2≥1M\_\{1\},M\_\{2\}\\geq 1\.
3. 3\.rris sub\-additive:r\(M1\+M2\)≤r\(M1\)\+r\(M2\)r\(M\_\{1\}\+M\_\{2\}\)\\leq r\(M\_\{1\}\)\+r\(M\_\{2\}\)for allM1,M2≥1M\_\{1\},M\_\{2\}\\geq 1\.
Note that all natural sub\-logarithmic growth rates such as⌈logM⌉\\left\\lceil\\sqrt\{\\log M\}\\right\\rceil,⌈loglogM⌉\\left\\lceil\\log\\log M\\right\\rceiletc\. satisfy our definition\. We now state the result for well\-behaved sub\-logarithmic rates\.
###### Theorem 2\.4\.
For any well\-behaved sub\-logarithmic growth raterrthere exists a base classℱ\\mathcal\{F\}such that:
1. 1\.𝙻\(ℱ\)=1\\mathtt\{L\}\(\\mathcal\{F\}\)=1\.
2. 2\.𝙻\(ℱ𝖾𝟤𝖾−M\)=Θ\(r\(M\)\)\\mathtt\{L\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\)=\\Theta\(r\(M\)\)for allM≥M0M\\geq M\_\{0\}\.
The definition ofℱ\\mathcal\{F\}and the lower bound strategy of the second item is due to unpublished private communication\[[21](https://arxiv.org/html/2605.06819#bib.bib100)\]\. Theorem[2\.4](https://arxiv.org/html/2605.06819#S2.Thmtheorem4)handles only sub\-logarithmic rates\. For logarithmic rates, we prove the following
###### Theorem 2\.5\.
For anyd∈ℕ\+d\\in\\mathbb\{N\}\_\{\+\}, there exists a classℱ\\mathcal\{F\}such that𝙻\(ℱ\)=O\(d\)\\mathtt\{L\}\(\\mathcal\{F\}\)=O\(d\)and
VC\(ℱ𝖾𝟤𝖾−M\)=Ω\(𝙻\(ℱ\)logM\)\\operatorname\{VC\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\)=\\Omega\(\\mathtt\{L\}\(\\mathcal\{F\}\)\\log M\)for all sufficiently largeMM\.
Theorem[2\.4](https://arxiv.org/html/2605.06819#S2.Thmtheorem4)and Theorem[2\.5](https://arxiv.org/html/2605.06819#S2.Thmtheorem5)are proved in Section[8](https://arxiv.org/html/2605.06819#S8)\. SinceVC\(ℱ\)≤𝙻\(ℱ\)\\operatorname\{VC\}\(\\mathcal\{F\}\)\\leq\\mathtt\{L\}\(\\mathcal\{F\}\)for allℱ\\mathcal\{F\}, together with the upper bound of Theorem[7\.1](https://arxiv.org/html/2605.06819#S7.Thmtheorem1), this establishes that for anyd∈ℕ\+d\\in\\mathbb\{N\}\_\{\+\}there exists a classℱ\\mathcal\{F\}such that𝙻\(ℱ\)=O\(d\)\\mathtt\{L\}\(\\mathcal\{F\}\)=O\(d\)and
VC\(ℱ𝖾𝟤𝖾−M\),𝙻\(ℱ𝖾𝟤𝖾−M\)=Θ\(𝙻\(ℱ\)logM\)\\operatorname\{VC\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\),\\mathtt\{L\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\)=\\Theta\(\\mathtt\{L\}\(\\mathcal\{F\}\)\\log M\)for all sufficiently largeMM\.
So, Theorem[2\.4](https://arxiv.org/html/2605.06819#S2.Thmtheorem4), Theorem[2\.5](https://arxiv.org/html/2605.06819#S2.Thmtheorem5)and Theorem[2\.3](https://arxiv.org/html/2605.06819#S2.Thmtheorem3)establish an essentially complete taxonomy of the possible growth rates of𝙻\(ℱ𝖾𝟤𝖾−M\)\\mathtt\{L\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\)in terms of𝙻\(ℱ\)\\mathtt\{L\}\(\\mathcal\{F\}\)andMM\.
Along the way, Theorem[2\.5](https://arxiv.org/html/2605.06819#S2.Thmtheorem5)resolves an open question on the statistical setting posed by\[[26](https://arxiv.org/html/2605.06819#bib.bib71)\]: they proved that for any classℱ\\mathcal\{F\}it holds thatVC\(ℱ𝖾𝟤𝖾−M\)=O\(𝙻\(ℱ\)logM\)\\operatorname\{VC\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\)=O\(\\mathtt\{L\}\(\\mathcal\{F\}\)\\log M\), and asked if the logarithmic dependence in their bound can be improved\. Theorem[2\.5](https://arxiv.org/html/2605.06819#S2.Thmtheorem5)shows that their bound is tight\.
#### 2\.3\.1Corollary for agnostic learning
Much of the online learning literature studies*agnostic online learning*, where there is no target function in the class that perfectly explains the streamed data\. In this setting, instead of measuring the mistake bound, we are usually interested in the*regret bound*, which is the optimal mistake bound measured after subtracting the number of mistakes made by the best function in class\. A known result of\[[1](https://arxiv.org/html/2605.06819#bib.bib40)\]allow to derive a similar taxonomy for agnostic end\-to\-end learning\.
###### Corollary 2\.6\.
Letℱ\\mathcal\{F\}be a base class\. Then for all sufficiently largeMM, there exists an agnostic online learner forℱ𝖾𝟤𝖾−M\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}with regret bound
O\(T𝙻\(ℱ\)logM\)\.O\\left\(\\sqrt\{T\\mathtt\{L\}\(\\mathcal\{F\}\)\\log M\}\\right\)\.
Furthermore, for any well\-behaved sub\-logarithmic growth raterrthere exists a base classℱ\\mathcal\{F\}such that𝙻\(ℱ\)=1\\mathtt\{L\}\(\\mathcal\{F\}\)=1and the optimal regret bound for agnostic online learningℱ𝖾𝟤𝖾−M\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}is
Θ\(T⋅r\(M\)\)\\Theta\\left\(\\sqrt\{T\\cdot r\(M\)\}\\right\)for allM≥M0M\\geq M\_\{0\}and sufficiently largeTT\.
Finally, for anyd∈ℕ\+d\\in\\mathbb\{N\}\_\{\+\}, there exists a classℱ\\mathcal\{F\}such that the optimal regret for agnostic online learning ofℱ𝖾𝟤𝖾−M\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}is
Θ\(TdlogM\)\\Theta\\left\(\\sqrt\{Td\\log M\}\\right\)for allM≥M0M\\geq M\_\{0\}and sufficiently largeTT\.
Corollary[2\.6](https://arxiv.org/html/2605.06819#S2.Thmtheorem6)is proved in Section[9](https://arxiv.org/html/2605.06819#S9)\.
### 2\.4Results for autoregressive linear classifiers
An interesting specific study case is the basic class of*autoregressive linear classifiers*\. This class was presented in\[[26](https://arxiv.org/html/2605.06819#bib.bib71)\]and further studied in\[[20](https://arxiv.org/html/2605.06819#bib.bib93),[25](https://arxiv.org/html/2605.06819#bib.bib94)\]\. It is naturally defined as follows\. Letd∈ℕd\\in\\mathbb\{N\}be the dimension\. For any𝒘∈ℝd\\boldsymbol\{w\}\\in\\mathbb\{R\}^\{d\}andb∈ℝb\\in\\mathbb\{R\}define the functionf𝒘,b:\{0,1\}⋆→\{0,1\}f\_\{\\boldsymbol\{w\},b\}:\\\{0,1\\\}^\{\\star\}\\to\\\{0,1\\\}by
f𝒘,b\(x\)=1\[∑i=1min\{d,\|𝒙\|\}𝒘\[−i\]x\[−i\]\+b≥0\]f\_\{\\boldsymbol\{w\},b\}\(x\)=1\\left\[\\sum\_\{i=1\}^\{\\min\\\{d,\|\\boldsymbol\{x\}\|\\\}\}\\boldsymbol\{w\}\[\-i\]x\[\-i\]\+b\\geq 0\\right\]for allx∈\{0,1\}⋆x\\in\\\{0,1\\\}^\{\\star\}\. That is,f𝒘,bf\_\{\\boldsymbol\{w\},b\}acts as the linear classifier defined by\(𝒘,b\)\(\\boldsymbol\{w\},b\)when taking only values on thedd\-boolean cube, and the input point to the classifier is given by the \(at most\) finalddbits of the inputxx\. The classℱd,lin\\mathcal\{F\}\_\{d,\\operatorname\{lin\}\}of autoregressive linear classifiers in dimensionddis defined as
ℱd,lin:=\{f𝒘,b:𝒘∈ℝd,b∈ℝ\}\.\\mathcal\{F\}\_\{d,\\operatorname\{lin\}\}:=\\\{f\_\{\\boldsymbol\{w\},b\}:\\boldsymbol\{w\}\\in\\mathbb\{R\}^\{d\},b\\in\\mathbb\{R\}\\\}\.
#### 2\.4\.1Mistake and sample complexity bounds
The results in this section are proved in Section[11\.1](https://arxiv.org/html/2605.06819#S11.SS1)\.\[[26](https://arxiv.org/html/2605.06819#bib.bib71)\]proved thatVC\(ℱd,lin𝖾𝟤𝖾,M\)=O\(d2\)\\operatorname\{VC\}\(\\mathcal\{F\}\_\{d,\\operatorname\{lin\}\}^\{\\mathsf\{e2e\},M\}\)=O\(d^\{2\}\)for allMM\. This implies that the dependence of the sample complexity of𝖢𝗈𝖳\\mathsf\{CoT\}\-learningℱd,lin\\mathcal\{F\}\_\{d,\\operatorname\{lin\}\}onddis also at mostO\(d2\)O\(d^\{2\}\)\.\[[20](https://arxiv.org/html/2605.06819#bib.bib93)\]improved the dependence onddfor𝖢𝗈𝖳\\mathsf\{CoT\}\-learning fromO\(d2\)O\(d^\{2\}\)toO\(d\)O\(d\)\. No lower bounds \(that hold for anyM\>1M\>1\) were previously given\.
In this work, we prove that the optimal mistake bound for online learningℱd,lin𝖾𝟤𝖾\\mathcal\{F\}\_\{d,\\operatorname\{lin\}\}^\{\\mathsf\{e2e\}\}, both with and without𝖢𝗈𝖳\\mathsf\{CoT\}supervision, isΘ\(d2\)\\Theta\(d^\{2\}\)\.
###### Theorem 2\.7\.
For any dimensionddand generation lengthMMwe have
𝙻\(ℱd,lin𝖾𝟤𝖾−M\)=O\(d2\)\.\\mathtt\{L\}\(\\mathcal\{F\}\_\{d,\\operatorname\{lin\}\}^\{\\mathsf\{e2e\}\-M\}\)=O\(d^\{2\}\)\.Furthermore, for anyMMand anyd≥3d\\geq 3, there exists an adversary forcingΩ\(d2\)\\Omega\(d^\{2\}\)many mistakes on any learner forℱd,lin𝖾𝟤𝖾−M\\mathcal\{F\}\_\{d,\\operatorname\{lin\}\}^\{\\mathsf\{e2e\}\-M\}, even with𝖢𝗈𝖳\\mathsf\{CoT\}\-supervision\.
The technique we use for the lower bound of Theorem[2\.7](https://arxiv.org/html/2605.06819#S2.Thmtheorem7)allows us to derive the first lower bound for the statistical setting that holds forM\>1M\>1\.
###### Theorem 2\.8\.
For any dimensiond≥3d\\geq 3and generation lengthM≥2M\\geq 2, any PAC\-learner forℱd,lin𝖾𝟤𝖾−M\\mathcal\{F\}\_\{d,\\operatorname\{lin\}\}^\{\\mathsf\{e2e\}\-M\}has sample complexityΩ\(d\)\\Omega\(d\), even with𝖢𝗈𝖳\\mathsf\{CoT\}\-supervision\.
Theorem[2\.8](https://arxiv.org/html/2605.06819#S2.Thmtheorem8)shows that the upper bound of\[[20](https://arxiv.org/html/2605.06819#bib.bib93)\]for PAC\-learningℱd,lin𝖾𝟤𝖾\\mathcal\{F\}\_\{d,\\operatorname\{lin\}\}^\{\\mathsf\{e2e\}\}with𝖢𝗈𝖳\\mathsf\{CoT\}\-supervision is tight\.
#### 2\.4\.2Computational complexity
The results in this section are proved in Section[11\.2](https://arxiv.org/html/2605.06819#S11.SS2)\. The preceding bounds are information\-theoretic\. We also obtain a computational separation between𝖾𝟤𝖾\\mathsf\{e2e\}and𝖢𝗈𝖳\\mathsf\{CoT\}\-learning for linear classifiers\.
###### Theorem 2\.9\.
Assume Hardness of Learning𝖳𝖢0\\mathsf\{TC\}^\{0\}\.111This assumption is formally given in Assumption[11\.5](https://arxiv.org/html/2605.06819#S11.Thmtheorem5); for more details, the reader is referred to, e\.g\.,\[[26](https://arxiv.org/html/2605.06819#bib.bib71)\]\.Then there is no possibly improper online learner forℱd,lin𝖾𝟤𝖾−M\\mathcal\{F\}\_\{d,\\operatorname\{lin\}\}^\{\\mathsf\{e2e\}\-M\}that runs in time polynomial ind,Md,Mand the input length per round and makes polynomially many mistakes\. In contrast, with𝖢𝗈𝖳\\mathsf\{CoT\}\-supervision, there is a deterministic polynomial\-time learner forℱd,lin𝖾𝟤𝖾−M\\mathcal\{F\}\_\{d,\\operatorname\{lin\}\}^\{\\mathsf\{e2e\}\-M\}with mistake boundO\(d2logd\)O\(d^\{2\}\\log d\)\.
Thus,𝖢𝗈𝖳\\mathsf\{CoT\}\-supervision removes the computational obstruction, at least up to a logarithmic factor in the mistake bound\.
### 2\.5Non\-Littlestone classes
The finite Littlestone assumption is essential\. For𝙻\(ℱ\)<∞\\mathtt\{L\}\(\\mathcal\{F\}\)<\\infty, our upper bound gives a uniform logarithmic ceiling inMM\. Without this assumption, a single base class can alternate between trivial and unlearnable horizons\.
###### Theorem 2\.10\.
There exists a binary base classℱ\\mathcal\{F\}and horizonsMi=i\+4M\_\{i\}=i\+4such that
iodd⟹𝙻\(ℱ𝖾𝟤𝖾−Mi\)=0,ieven⟹𝙻\(ℱ𝖾𝟤𝖾−Mi\)=∞\.i\\text\{ odd \}\\implies\\mathtt\{L\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\_\{i\}\}\)=0,\\qquad i\\text\{ even \}\\implies\\mathtt\{L\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\_\{i\}\}\)=\\infty\.In particular, the odd horizons are trivial already in the𝖾𝟤𝖾\\mathsf\{e2e\}model, while the even horizons are unlearnable even with𝖢𝗈𝖳\\mathsf\{CoT\}\-supervision\.
Thus, outside the Littlestone regime, there is no stable growth law inMM\. Theorem[2\.10](https://arxiv.org/html/2605.06819#S2.Thmtheorem10)is proved in Section[12](https://arxiv.org/html/2605.06819#S12)
### 2\.6A stochastic autoregressive lower bound
We also show that the deterministic picture of𝖢𝗈𝖳\\mathsf\{CoT\}online learning does not extend to stochastic autoregressive generators\. In this model, a generator maps each string inΣ∗\\Sigma^\{\*\}to a distribution over the next token, so even after the inputxtx\_\{t\}and target generator are fixed, the observed𝖢𝗈𝖳\\mathsf\{CoT\}trajectory is random\. For the stochastic setting, the natural benchmark is regret\.
###### Theorem 2\.11\.
For every oddM≥1M\\geq 1, there exists a stochastic two\-function base class𝒢=\{g−,g\+\}\\mathcal\{G\}=\\\{g\_\{\-\},g\_\{\+\}\\\}whose direct next\-token problem has regretO\(1\)O\(1\), but for every online learner, there existsg⋆∈𝒢g\_\{\\star\}\\in\\mathcal\{G\}such that learning𝒢𝖾𝟤𝖾−M\\mathcal\{G\}^\{\\mathsf\{e2e\}\-M\}against targetg⋆g\_\{\\star\}incurs expected regretΩ\(2M\)\\Omega\(2^\{M\}\)for horizonT=Θ\(4M\)T=\\Theta\(4^\{M\}\)\.
Thus, stochastic autoregression can exponentially amplify the difficulty of an otherwise trivial next\-token learning problem\. This is in stark contrast to the upper bound for deterministic𝖢𝗈𝖳\\mathsf\{CoT\}online learning given in Theorem[2\.1](https://arxiv.org/html/2605.06819#S2.Thmtheorem1)and Theorem[2\.3](https://arxiv.org/html/2605.06819#S2.Thmtheorem3)\. Theorem[2\.11](https://arxiv.org/html/2605.06819#S2.Thmtheorem11)is proved in Section[13](https://arxiv.org/html/2605.06819#S13)\.
## 3Proof sketches of the main results
In this section, we explain the main technical ideas of the proofs of our main theorems\.
### 3\.1Proof sketch of Theorem[2\.1](https://arxiv.org/html/2605.06819#S2.Thmtheorem1): mistake bound for learning with CoT
Theorem[2\.1](https://arxiv.org/html/2605.06819#S2.Thmtheorem1)proves that with𝖢𝗈𝖳\\mathsf\{CoT\}\-supervision, learningℱ𝖾𝟤𝖾−M\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}is possible with mistake bound at most𝙻\(ℱ\)\\mathtt\{L\}\(\\mathcal\{F\}\), for allMM\. The proof follows from a relatively simple SOA\-style222The*SOA*stands for*standard optimal algorithm*, which is the original algorithm given in\[[31](https://arxiv.org/html/2605.06819#bib.bib15)\]for online learning of binary classes\.argument we describe below\. We design an online learner𝖫𝗋𝗇\\mathsf\{Lrn\}for the classℱ𝖢𝗈𝖳−M\\mathcal\{F\}^\{\\mathsf\{CoT\}\-M\}, which operates as follows\.𝖫𝗋𝗇\\mathsf\{Lrn\}maintains a*version space*VtV\_\{t\}of all generatorsf∈ℱf\\in\\mathcal\{F\}that are consistent with the feedback given up to the end of roundtt: at firstV0=ℱV\_\{0\}=\\mathcal\{F\}, and at the end of any roundtt,𝖫𝗋𝗇\\mathsf\{Lrn\}setsVtV\_\{t\}to be all functions inVt−1V\_\{t\-1\}that are consistent with the feedback of roundtt\. In each roundtt,𝖫𝗋𝗇\\mathsf\{Lrn\}receivesxtx\_\{t\}and predicts
maxy^∈\{0,1\}M𝙻\(Vt−1\|xt→y^\),\\max\_\{\\hat\{y\}\\in\\\{0,1\\\}^\{M\}\}\\mathtt\{L\}\(V\_\{t\-1\}\|\_\{x\_\{t\}\\to\\hat\{y\}\}\),whereVt−1\|xt→y^V\_\{t\-1\}\|\_\{x\_\{t\}\\to\\hat\{y\}\}denotes the restriction ofVt−1V\_\{t\-1\}only to functions for which autoregressive chain\-of\-thought onxtx\_\{t\}is preciselyy^\\hat\{y\}\. The crucial observation is that there is at most oney^∈\{0,1\}M\\hat\{y\}\\in\\\{0,1\\\}^\{M\}such that
𝙻\(Vt−1\|xt→y^\)=𝙻\(Vt−1\)\.\\mathtt\{L\}\(V\_\{t\-1\}\|\_\{x\_\{t\}\\to\\hat\{y\}\}\)=\\mathtt\{L\}\(V\_\{t\-1\}\)\.Indeed, if there are two distincty^,y′^\\hat\{y\},\\hat\{y^\{\\prime\}\}such that
𝙻\(Vt−1\|xt→y^\)=𝙻\(Vt−1\|xt→y′^\)=𝙻\(Vt−1\)\\mathtt\{L\}\(V\_\{t\-1\}\|\_\{x\_\{t\}\\to\\hat\{y\}\}\)=\\mathtt\{L\}\(V\_\{t\-1\}\|\_\{x\_\{t\}\\to\\hat\{y^\{\\prime\}\}\}\)=\\mathtt\{L\}\(V\_\{t\-1\}\)then one can construct a Littlestone tree of depth𝙻\(Vt−1\)\+1\\mathtt\{L\}\(V\_\{t\-1\}\)\+1which is shattered byVt−1V\_\{t\-1\}, which is a contradiction to the fact that𝙻\(Vt−1\)\\mathtt\{L\}\(V\_\{t\-1\}\)is the optimal mistake bound for learningVt−1V\_\{t\-1\}, as proved by\[[12](https://arxiv.org/html/2605.06819#bib.bib9)\]\.
Therefore, after every mistake of𝖫𝗋𝗇\\mathsf\{Lrn\}the Littlestone dimension of the version space decreases by at least11, which implies an upper bound of𝙻\(V0\)=𝙻\(ℱ\)\\mathtt\{L\}\(V\_\{0\}\)=\\mathtt\{L\}\(\\mathcal\{F\}\)\.
### 3\.2Proof sketch of Theorem[2\.3](https://arxiv.org/html/2605.06819#S2.Thmtheorem3): mistake bound for learning without CoT
Consider a Littlestone tree𝐓\\mathbf\{T\}of depthmmthat is shattered byℱ𝖾𝟤𝖾−M\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\. It suffices to show thatm=O\(𝙻\(ℱ\)log\(𝙻\(ℱ\)M\)\)m=O\(\\mathtt\{L\}\(\\mathcal\{F\}\)\\log\(\\mathtt\{L\}\(\\mathcal\{F\}\)M\)\)\. The idea is a variant of tree\-covering arguments occasionally used in online learning \(see\[[19](https://arxiv.org/html/2605.06819#bib.bib14)\]for example\)\. First, we inflate𝐓\\mathbf\{T\}to represent the entire chain\-of\-thought for every instance labeling a node in the tree\. We do so by replacing each node and its two outgoing edges by the the fullMM\-step*generation tree*matching the instance labeling the node\. AnMM\-step generation tree forx∈\{0,1\}⋆x\\in\\\{0,1\\\}^\{\\star\}, denoted by𝐓M\(x\)\\mathbf\{T\}\_\{M\}\(x\), is a Littlestone tree of depthMMwith the root labeled byxx, and for any prefixu∈\{0,1\}<Mu\\in\\\{0,1\\\}^\{<M\}, the node at the end of the prefixuuis labeled byx∘ux\\circ u\. So the branches of𝐓M\(x\)\\mathbf\{T\}\_\{M\}\(x\)represent the different possible chain\-of\-thoughts forxx\. Let𝐓ℱ\\mathbf\{T\}\_\{\\mathcal\{F\}\}be the inflated tree, which is now of depthmMmM\. We now analyze the size of the set of branches in𝐓ℱ\\mathbf\{T\}\_\{\\mathcal\{F\}\}that are realized by somef∈ℱf\\in\\mathcal\{F\}, denoted asBℱ\(𝐓ℱ\)B\_\{\\mathcal\{F\}\}\(\\mathbf\{T\}\_\{\\mathcal\{F\}\}\)\. Since every generation tree in𝐓ℱ\\mathbf\{T\}\_\{\\mathcal\{F\}\}is originated from a node in𝐓\\mathbf\{T\}and its two outgoing edges, and since𝐓\\mathbf\{T\}is shatterd byℱ𝖾𝟤𝖾−M\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}, it is not hard to see thatBℱ\(𝐓ℱ\)B\_\{\\mathcal\{F\}\}\(\\mathbf\{T\}\_\{\\mathcal\{F\}\}\)grows by a factor of at least22after eachMMlevels of𝐓ℱ\\mathbf\{T\}\_\{\\mathcal\{F\}\}, which implies
\|Bℱ\(𝐓ℱ\)\|≥2m\.\|B\_\{\\mathcal\{F\}\}\(\\mathbf\{T\}\_\{\\mathcal\{F\}\}\)\|\\geq 2^\{m\}\.On the other hand, a known \(for example, by\[[7](https://arxiv.org/html/2605.06819#bib.bib96)\]\) Sauer\-Shelah\-Perles lemma for trees give
\|Bℱ\(𝐓ℱ\)\|≤\(mM≤𝙻\(ℱ\)\)\.\|B\_\{\\mathcal\{F\}\}\(\\mathbf\{T\}\_\{\\mathcal\{F\}\}\)\|\\leq\\binom\{mM\}\{\\leq\\mathtt\{L\}\(\\mathcal\{F\}\)\}\.By the two inequalities, we obtain
2m≤\(mM≤𝙻\(ℱ\)\)\.2^\{m\}\\leq\\binom\{mM\}\{\\leq\\mathtt\{L\}\(\\mathcal\{F\}\)\}\.Solving the inequality formmgives the stated bound\.
### 3\.3Proof sketch of Theorem[2\.4](https://arxiv.org/html/2605.06819#S2.Thmtheorem4): taxonomy for sub\-logarithmic rates
To prove Theorem[2\.4](https://arxiv.org/html/2605.06819#S2.Thmtheorem4), we construct a classℱ\\mathcal\{F\}\(suggested in private communication\[[21](https://arxiv.org/html/2605.06819#bib.bib100)\]\) that has two desired properties:
1. 1\.It is simple, and in fact has𝙻\(ℱ\)=1\\mathtt\{L\}\(\\mathcal\{F\}\)=1\.
2. 2\.We can makeℱ𝖾𝟤𝖾−M\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}as complicated as we want up to some well\-behaved sub\-logarithmic rater\(M\)r\(M\)\. In fact, given any well\-behaved sub\-logarithmic raterr, we have𝙻\(ℱ𝖾𝟤𝖾−M\)=Θ\(r\(M\)\)\\mathtt\{L\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\)=\\Theta\(r\(M\)\)\.
To implement this, we define a fixed baseline functionff, andℱ\\mathcal\{F\}is constructed from a class of functions such that each function differs fromffon precisely one instance\. This fact alone suffices to prove the first item\. To prove the second item, the baseline functionff, as well as the other functions in the class, should be carefully chosen to fit the given well\-behaved sub\-logarithmic raterr\. The proof, and even the definition ofℱ\\mathcal\{F\}, are quite technical\. Below, we explain the high level idea of the definition and proof\. The functionffis defined such that it is “interesting” only on instances of the form
xs,k,z:=0s10k10s−k−1zx\_\{s,k,z\}:=0^\{s\}10^\{k\}10^\{s\-k\-1\}zwheresscan be any natural number,k∈\[2r\(s\)\]k\\in\\left\[2^\{r\(s\)\}\\right\], andz∈\{0,1\}<r\(s\)z\\in\\\{0,1\\\}^\{<r\(s\)\}\. Leaving some technical details aside,f\(xs,k,z\)=k\|z\|f\(x\_\{s,k,z\}\)=k\_\{\|z\|\}, sof\(xs,k,z\)f\(x\_\{s,k,z\}\)in fact returns a bit of the binary representation ofkk, in a location encoded by the length ofzz\. For any other instancexx, definef\(x\)=0f\(x\)=0\. Note that assuming thatrris sub\-logarithmic is crucial already here: sincekkcan be large as2r\(s\)2^\{r\(s\)\}, assuming thatrris sub\-logarithmic ensures thats−k−1≥1s\-k\-1\\geq 1, and so0s−k−10^\{s\-k\-1\}, which appears inss,k,zs\_\{s,k,z\}, is a well\-defined binary string\. Now, for everys,ks,kas above we define a functionfs,kf\_\{s,k\}\. This function is defined exactly asffeverywhere except for the instance0s10k0^\{s\}10^\{k\}, for whichffreturns0, but we definefs,k\(0s10k\)=1f\_\{s,k\}\(0^\{s\}10^\{k\}\)=1\. The classℱ\\mathcal\{F\}is defined as the class of allfs,kf\_\{s,k\}\. The definition offs,kf\_\{s,k\}is one of the core ideas of the proof: when the instance0s10k0^\{s\}10^\{k\}is given tofs,kf\_\{s,k\}and only tofs,kf\_\{s,k\}, thess\-step autoregressive generation offs,kf\_\{s,k\}for0s10k0^\{s\}10^\{k\}is non\-degenerate, and in fact encodes the binary representation ofk∈\[2r\(s\)\]k\\in\\left\[2^\{r\(s\)\}\\right\]\. This is the high\-level idea of the lower bound: fors:=Ms:=M, any learner makes at least roughlylog2r\(M\)=r\(M\)\\log 2^\{r\(M\)\}=r\(M\)many mistakes before identifying the “correct”kk, where the logarithmic lower bound comes from optimality of the halving algorithm in this case\.
For the upper bound, the main idea is to design a learner for which the halving algorithm used to find the correctkkis essentially the only obstacle\. Let us describe this learner in high level\. Until making its first mistake, the learner always predicts asf𝖾𝟤𝖾−Mf^\{\\mathsf\{e2e\}\-M\}\. Once the learner makes its first mistake on some instancexx, it knows the following fact:
fs⋆,k⋆𝖾𝟤𝖾−M\(x\)≠f𝖾𝟤𝖾−M\(x\),f\_\{s^\{\\star\},k^\{\\star\}\}^\{\\mathsf\{e2e\}\-M\}\(x\)\\neq f^\{\\mathsf\{e2e\}\-M\}\(x\),wherefs⋆,k⋆f\_\{s^\{\\star\},k^\{\\star\}\}is the target function\. Sincefs⋆,k⋆f\_\{s^\{\\star\},k^\{\\star\}\}differs fromffonly on the instance0s⋆10k⋆0^\{s^\{\\star\}\}10^\{k^\{\\star\}\}, this implies thatxxmust be of the form0s⋆10i0^\{s^\{\\star\}\}10^\{i\}for somei∈ℕi\\in\\mathbb\{N\}\. Therefore, the learner now knowss⋆s^\{\\star\}just by looking onxx, and so the learner now only needs to findk⋆k^\{\\star\}to know the target functionfs⋆,k⋆f\_\{s^\{\\star\},k^\{\\star\}\}\. Ifs⋆=O\(M\)s^\{\\star\}=O\(M\), thenk⋆k^\{\\star\}can only receive as few as
2r\(s⋆\)≤2r\(O\(M\)\)≤2O\(r\(M\)\)2^\{r\(s^\{\\star\}\)\}\\leq 2^\{r\(O\(M\)\)\}\\leq 2^\{O\(r\(M\)\)\}\(1\)many values, and thus the halving algorithm findsk⋆k^\{\\star\}after at mostO\(r\(M\)\)O\(r\(M\)\)many mistakes, which is analogous to the argument of the lower bound\. Note that \([1](https://arxiv.org/html/2605.06819#S3.E1)\) does not hold for a general raterr\. The first inequality holds sincerris non\-decreasing, and the second inequality holds sincerris sub\-additive\. This is the exact reason that we make those two assumptions\. However, it is also possible thats⋆≫Ms^\{\\star\}\\gg M\. In this case, we need to use a completely different argument\. In the case thats⋆≫Ms^\{\\star\}\\gg M, the learner acts as follows\. For anyxxnot of the form0s⋆10i0^\{s^\{\\star\}\}10^\{i\}it predicts asffand never errs\. Forxxwhich is of the form0s⋆10i0^\{s^\{\\star\}\}10^\{i\}, the learner always predicts0until making its first mistake on such an instance, which implies that
fs⋆,k⋆𝖾𝟤𝖾−M\(0s⋆10i\)=1,f\_\{s^\{\\star\},k^\{\\star\}\}^\{\\mathsf\{e2e\}\-M\}\(0^\{s^\{\\star\}\}10^\{i\}\)=1,\(2\)
where0s⋆10i0^\{s^\{\\star\}\}10^\{i\}is the instance the learner errs on\. However, recalling the definition offf, it is non\-degenerate only for instances of the form0s10k10s−k−1z0^\{s\}10^\{k\}10^\{s\-k\-1\}z, which in particular means that in our cases⋆−k−1s^\{\\star\}\-k\-1zeroes must be appended before getting to the interesting part\. However, sinces⋆≫Ms^\{\\star\}\\gg M, we have thats⋆−k−1\>Ms^\{\\star\}\-k\-1\>M, and thusMMis simply not large enough to even get to the interesting part\. So, in fact, the case \([2](https://arxiv.org/html/2605.06819#S3.E2)\) can essentially never happen \(leaving aside some technical details\)\.
### 3\.4Proof sketch of Theorem[2\.5](https://arxiv.org/html/2605.06819#S2.Thmtheorem5): logarithmic lower bound
Theorem[2\.5](https://arxiv.org/html/2605.06819#S2.Thmtheorem5)states that for any \(large enough\)ddthere exists a base classℱ\\mathcal\{F\}with𝙻\(ℱ\)=O\(d\)\\mathtt\{L\}\(\\mathcal\{F\}\)=O\(d\), and such thatVC\(ℱ𝖾𝟤𝖾−M\)=Ω\(dlogM\)\\operatorname\{VC\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\)=\\Omega\(d\\log M\)for all \(large enough\)MM\. The idea is to construct for everyd,Md,Ma base classℱ:=ℱ\(d,M\)\\mathcal\{F\}:=\\mathcal\{F\}\(d,M\)of size≈Md\\approx M^\{d\}which is very unbalanced for every instancexx\(that is, the functions ofℱ\\mathcal\{F\}demonstrate an overwhelming majority towards one of the possible labels for everyxx\), but afterMMgeneration steps they become much more balanced for many instances\. This is precisely what makes𝙻\(ℱ\)=O\(d\)\\mathtt\{L\}\(\\mathcal\{F\}\)=O\(d\), whileVC\(ℱ𝖾𝟤𝖾−M\)=Ω\(dlogM\)\\operatorname\{VC\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\)=\\Omega\(d\\log M\)\. In more detail, our specific choice of overwhelming majority is of a≈\(1−1/M\)\\approx\(1\-1/M\)\-fraction of functions for allxx\. This indeed gives𝙻\(ℱ\)=O\(d\)\\mathtt\{L\}\(\\mathcal\{F\}\)=O\(d\), since we start from≈Md\\approx M^\{d\}many functions, and for each restriction of the formℱx→r:=\{f∈ℱ:f\(x\)=r\}\\mathcal\{F\}\_\{x\\to r\}:=\\\{f\\in\\mathcal\{F\}:f\(x\)=r\\\}for somer∈\{0,1\}r\\in\\\{0,1\\\}, the number of surviving functions decreases by a factor ofMM\. Therefore, after at most roughlyddsuch restrictions \(that correspond to an optimal learner’s mistakes\), there are no surviving functions left\. On the other hand, we ensure that afterMMmany generation steps, there is a constant fraction of functions realizing each label forxx, for many instancesxx\. Since for every restriction of an instance to be classified with some label afterMMgeneration steps a constant fraction of the class stays consistent, it is implied that
VC\(ℱ𝖾𝟤𝖾−M\)=Ω\(log\(Md\)\)=Ω\(dlogM\)\.\\operatorname\{VC\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\)=\\Omega\(\\log\(M^\{d\}\)\)=\\Omega\(d\\log M\)\.
It remains to construct a class having the above desired overwhelming majority property for allxx\. However, ensuring this desired property simultaneously for allxxseems difficult to do deterministically\. We thus define the class at random: everyffgives to everyxx\(in a specific desired set of instances\)r∈\{0,1\}r\\in\\\{0,1\\\}with probability1−1/M1\-1/M, and1−r1\-rwith probability1/M1/M\. Figure[1](https://arxiv.org/html/2605.06819#S3.F1)demonstrates whyℱx→0𝖾𝟤𝖾−M\\mathcal\{F\}\_\{x\\to 0\}^\{\\mathsf\{e2e\}\-M\}andℱx→1𝖾𝟤𝖾−M\\mathcal\{F\}\_\{x\\to 1\}^\{\\mathsf\{e2e\}\-M\}both contain a constant fraction ofℱ\\mathcal\{F\}\.
0𝟶𝟺𝟷\\mathtt\{0^\{4\}1\}𝟶𝟹𝟷\\mathtt\{0^\{3\}1\}𝟶𝟸𝟷\\mathtt\{0^\{2\}1\}𝟶𝟷\\mathtt\{01\}Figure 1:The generation tree𝐓M\(x\(j\)\)\\mathbf\{T\}\_\{M\}\(x^\{\(j\)\}\), visualized for the hard classℱd,M\\mathcal\{F\}\_\{d,M\}withM=4M=4andj=1j=1\(that is,x\(j\)=0x^\{\(j\)\}=0\)\. Note that there areM−1M\-1red nodes, where the version space of each typically contains anO\(1\)/MO\(1\)/Mfraction of the class, and thus in total a\(M−1\)O\(1\)M=O\(1\)\(M\-1\)\\frac\{O\(1\)\}\{M\}=O\(1\)fraction of the class is contained in the union of red vertices\. The edges where the size of the class drops by a factor ofO\(1\)/MO\(1\)/Mare blue\. In the rest of the edges, the size of the version space only drops by a factor of at least1−O\(1\)/M1\-O\(1\)/M\. Therefore, a fraction of\(1−O\(1\)/M\)M=O\(1\)\\left\(1\-O\(1\)/M\\right\)^\{M\}=O\(1\)of the class is typically contained in the green vertex version space\. Furthermore, for allffin the version space of the red nodes, we havef𝖾𝟤𝖾−M\(0\)=0f^\{\\mathsf\{e2e\}\-M\}\(0\)=0, and for allffin the version space of the green node, we havef𝖾𝟤𝖾−M\(0\)=1f^\{\\mathsf\{e2e\}\-M\}\(0\)=1\. So, for each possible choice of labeling for0, we get to keep a constant factor of the version space\. Applying this idea repeatedly allows to shatter a set of sizeΩ\(log\(Md\)\)=Ω\(dlogM\)\\Omega\(\\log\(M^\{d\}\)\)=\\Omega\(d\\log M\)\.In this intuitive explanation, one of the important facts we overlooked is thatℱ:=ℱ\(d,M\)\\mathcal\{F\}:=\\mathcal\{F\}\(d,M\), that is, it depends on the generation lengthMM, while we want the result to hold for allMM\. This is resolved by “gluing” together\{ℱ\(d,M\)\}M\\\{\\mathcal\{F\}\(d,M\)\\\}\_\{M\}for allMMvalues, a technique used for example in\[[10](https://arxiv.org/html/2605.06819#bib.bib46)\]\.
### 3\.5Proof sketches of Theorem[2\.7](https://arxiv.org/html/2605.06819#S2.Thmtheorem7)and Theorem[2\.8](https://arxiv.org/html/2605.06819#S2.Thmtheorem8): bounds for linear classifiers
The interesting part of our bounds for autoregressive linear classifiers is the following “latch” mechanism allowing to “drag” the output of the base classifier throughout the entire autoregressive process in the price of reducing the dimension by22:
###### Lemma 3\.1\(Latch lemma\)\.
Fixd≥3d\\geq 3and denotem:=d−2m:=d\-2\. Then, for all𝐯∈ℝm\\boldsymbol\{v\}\\in\\mathbb\{R\}^\{m\}andc∈ℝc\\in\\mathbb\{R\}, there existsf𝐰,b∈ℱd,linf\_\{\\boldsymbol\{w\},b\}\\in\\mathcal\{F\}\_\{d,\\operatorname\{lin\}\}such that
f𝒘,b𝖾𝟤𝖾−M\(z10\)=f𝒗,c\(z\)f\_\{\\boldsymbol\{w\},b\}^\{\\mathsf\{e2e\}\-M\}\(z10\)=f\_\{\\boldsymbol\{v\},c\}\(z\)for allz∈\{0,1\}mz\\in\\\{0,1\\\}^\{m\}and allM≥2M\\geq 2\.
Once we have Lemma[3\.1](https://arxiv.org/html/2605.06819#S3.Thmtheorem1), the rest of the arguments in the proofs of Theorem[2\.7](https://arxiv.org/html/2605.06819#S2.Thmtheorem7)and Theorem[2\.8](https://arxiv.org/html/2605.06819#S2.Thmtheorem8)are either simple or known\. Indeed, the upper bounds for autoregressive linear classifiers are not larger than the standard upper bounds for \(standard\) linear classifiers\. Therefore, it essentially suffices to show that the autoregressive process cannot make the class simpler, and the class maintains its original mistake bound/sample complexity\.
The idea of the proof of Lemma[3\.1](https://arxiv.org/html/2605.06819#S3.Thmtheorem1)is quite simple\. We define𝒘\\boldsymbol\{w\}to be as𝒗\\boldsymbol\{v\}, except for the final two locations of𝒘\\boldsymbol\{w\}, that we are free to choose\. We also choosebbto be ascc, but shifted in some value\. When choosing the above values carefully, we are able to ensure that:
1. 1\.For allz∈\{0,1\}mz\\in\\\{0,1\\\}^\{m\}:f𝒘,b\(z10\)=f𝒗,c\(z\)f\_\{\\boldsymbol\{w\},b\}\(z10\)=f\_\{\\boldsymbol\{v\},c\}\(z\)\(first bit of𝖢𝗈𝖳\\mathsf\{CoT\}is correct\)\.
2. 2\.For anyx∈\{0,1\}⋆x\\in\\\{0,1\\\}^\{\\star\}that ends with0000, we havef𝒘,b\(x\)=0f\_\{\\boldsymbol\{w\},b\}\(x\)=0\(always drag0\)\.
3. 3\.For anyx∈\{0,1\}⋆x\\in\\\{0,1\\\}^\{\\star\}that ends with11, we havef𝒘,b\(x\)=1f\_\{\\boldsymbol\{w\},b\}\(x\)=1\(always drag11\)\.
The above three items essentially establish the statement of the lemma\. The idea is to use known lower bounds for dimensionmm, but add the suffix1010to every instance used in the proof, so the dimension becomesd:=m\+2d:=m\+2\. Now, the first item ensures that for everyz∈\{0,1\}mz\\in\\\{0,1\\\}^\{m\},f𝒘,b\(z10\)f\_\{\\boldsymbol\{w\},b\}\(z10\)outputs the correct intended outputf𝒗,c\(z\)f\_\{\\boldsymbol\{v\},c\}\(z\)\. The other two items then establish that this correct output is being dragged throughout the entire autoregressive generation process: iff𝒘,b\(z10\)=0f\_\{\\boldsymbol\{w\},b\}\(z10\)=0then the generated new instance must end at0000, and iff𝒘,b\(z10\)=1f\_\{\\boldsymbol\{w\},b\}\(z10\)=1it must end with11\. Thus, any lower bound strategy forℱm,lin\\mathcal\{F\}\_\{m,\\operatorname\{lin\}\}works also forℱd,lin𝖾𝟤𝖾−M\\mathcal\{F\}\_\{d,\\operatorname\{lin\}\}^\{\\mathsf\{e2e\}\-M\}for allMM, even with𝖢𝗈𝖳\\mathsf\{CoT\}\-supervision\.
## 4Future work
Our work leaves some natural and interesting directions for future work\. Below, we describe some of them\.
##### Extend the results to the multiclass setting\.
Some of our results extend to the multiclass setting\|Σ\|\>2\|\\Sigma\|\>2“for free”, such as the upper bound for𝖢𝗈𝖳\\mathsf\{CoT\}\-learning \(Theorem[2\.1](https://arxiv.org/html/2605.06819#S2.Thmtheorem1)\)\. This is not true, however, for all results in this paper\. For example, our initial attempts to extend the upper bound for𝖾𝟤𝖾\\mathsf\{e2e\}\-learning \(Theorem[2\.2](https://arxiv.org/html/2605.06819#S2.Thmtheorem2)\) to the multiclass setting demonstrated an extralog\|Σ\|\\log\|\\Sigma\|factor, which we do not know if necessary, even for some classes\.
Perhaps an even more interesting direction in the multiclass setting is to study the optimal mistake bounds with and without𝖢𝗈𝖳\\mathsf\{CoT\}\-supervision attainable with different models of partial feedback, such as bandit feedback and feedback graphs\.
##### The stochastic setting\.
As we showed in Theorem[2\.11](https://arxiv.org/html/2605.06819#S2.Thmtheorem11), the closer to practice setting with stochastic𝖢𝗈𝖳\\mathsf\{CoT\}is controlled by the Littlestone dimension of the base class\. Further characterizing the online and PAC learnability of the stochastic setting is a natural direction for a follow up work\.
##### Studying specific classes\.
A natural direction is to move beyond worst\-case base classes and analyze standard finite\-Littlestone classes directly, as we did by studying autoregressive linear threshold functions\. Other concrete examples for future work include Boolean classes such as conjunctions, disjunctions, decision lists, and bounded\-depth decision trees, as well as finite\-domain geometric classes such as intervals and unions of intervals\. For such structured classes, one may hope for sharper mistake bounds than the generalO\(𝙻\(ℱ\)log\(𝙻\(ℱ\)M\)\)O\(\\mathtt\{L\}\(\\mathcal\{F\}\)\\log\(\\mathtt\{L\}\(\\mathcal\{F\}\)M\)\)guarantee, or for class\-specific separations between𝖾𝟤𝖾\\mathsf\{e2e\}\-learning and𝖢𝗈𝖳\\mathsf\{CoT\}\-learning\.
## 5Additional related work
Beyond the closest autoregressive PAC works discussed above, we mention three surrounding directions studied extensively in recent years\.
##### Chain\-of\-thought data in language\-model training\.
Chain\-of\-thought and scratchpad\-style supervision were introduced empirically as ways to expose intermediate computations in language models\[[36](https://arxiv.org/html/2605.06819#bib.bib101),[43](https://arxiv.org/html/2605.06819#bib.bib102)\]\. This is related to supervised fine\-tuning and next\-token prediction more broadly\[[9](https://arxiv.org/html/2605.06819#bib.bib103),[42](https://arxiv.org/html/2605.06819#bib.bib104)\], following the standard transformer language\-modeling paradigm\[[41](https://arxiv.org/html/2605.06819#bib.bib105),[37](https://arxiv.org/html/2605.06819#bib.bib106)\]\. Recent RL\-based post\-training methods also encourage longer reasoning traces\[[24](https://arxiv.org/html/2605.06819#bib.bib107),[15](https://arxiv.org/html/2605.06819#bib.bib108)\], but their supervision signal is typically attached to final outcomes rather than to every intermediate token\. The cost and importance of curating chain\-of\-thought data is also emphasized in large\-scale model reports\[[16](https://arxiv.org/html/2605.06819#bib.bib109)\]\. Our results formalize this distinction in an online model: observing only the final generated token can increase the mistake bound with the generation lengthMM, while full𝖢𝗈𝖳\\mathsf\{CoT\}feedback removes this dependence for finite\-Littlestone base classes\.
##### Theory of chain\-of\-thought\.
A growing theoretical literature studies why intermediate reasoning traces can help\. This includes universality of autoregressive next\-token predictors\[[34](https://arxiv.org/html/2605.06819#bib.bib110)\], sample\-complexity bounds in terms of CoT information\[[3](https://arxiv.org/html/2605.06819#bib.bib111)\], and reinforcement learning after next\-token prediction\[[40](https://arxiv.org/html/2605.06819#bib.bib112)\]\. Other works study the power and limitations of transformer architectures with chain\-of\-thought reasoning\[[35](https://arxiv.org/html/2605.06819#bib.bib113),[30](https://arxiv.org/html/2605.06819#bib.bib114),[11](https://arxiv.org/html/2605.06819#bib.bib115),[27](https://arxiv.org/html/2605.06819#bib.bib116),[23](https://arxiv.org/html/2605.06819#bib.bib117)\]\. These works are mostly statistical, architectural, or optimization\-theoretic\. In contrast, our results are architecture\-free and online: they characterize the possible mistake\-bound growth rates of autoregressive learning under end\-to\-end feedback and show when trajectory feedback eliminates this growth\.
##### Computational hardness\.
Our computational separation is related in spirit to the broader literature on hardness of distribution\-free learning; see\[[38](https://arxiv.org/html/2605.06819#bib.bib118)\]for a recent survey\. Relevant examples include hardness results for intersections of halfspaces\[[28](https://arxiv.org/html/2605.06819#bib.bib119),[39](https://arxiv.org/html/2605.06819#bib.bib120)\], and DNFs or related classes under average\-case or pseudorandomness assumptions\[[4](https://arxiv.org/html/2605.06819#bib.bib121),[13](https://arxiv.org/html/2605.06819#bib.bib122),[14](https://arxiv.org/html/2605.06819#bib.bib123),[5](https://arxiv.org/html/2605.06819#bib.bib124)\]\. Unlike this hardness literature, most of our paper is information\-theoretic; the computational result shows that, even for autoregressive linear classifiers, end\-to\-end feedback can leave a computational obstruction that𝖢𝗈𝖳\\mathsf\{CoT\}feedback removes\.
## 6Technical preliminaries
The online autoregressive model studied in this work was formally defined in Section[1\.1](https://arxiv.org/html/2605.06819#S1.SS1)\. The goal of this section is to formally define additional relevant concepts from learning theory, as well as to state known and relevant classic results\. Most of the definitions and notation are borrowed from\[[20](https://arxiv.org/html/2605.06819#bib.bib93)\]\.
### 6\.1Littlestone trees
In this paper, a Littlestone tree \(or just a tree\)𝐓\\mathbf\{T\}is a finite, full, and rooted binary tree, accompanied with the following information:
1. 1\.Every internal nodevvis associated with an instancex∈𝒳x\\in\\mathcal\{X\}, denoted asxvx\_\{v\}\.
2. 2\.For every internal nodevv, the left outgoing edge is labeled with0, and the right outgoing edge with11\.
The set of vertices and edges of𝐓\\mathbf\{T\}are denoted byv\(𝐓\)v\(\\mathbf\{T\}\)ande\(𝐓\)e\(\\mathbf\{T\}\), respectively\. Every path in the tree that starts at the root is called a prefix\. A prefix that ends at a leaf is called a*branch*\. The set of all branches in𝐓\\mathbf\{T\}is denotedℬ\(𝐓\)\\mathcal\{B\}\(\\mathbf\{T\}\)\. A prefix can be identified by a sequence of vertices starting from the root, or by just the last vertex in the prefix, or by the associated sequence of edges\. A prefixp=v0,…,vtp=v\_\{0\},\\ldots,v\_\{t\}in𝐓\\mathbf\{T\}defines a sample\(x1,y1\),…,\(xt,yt\)∈𝒳×𝒴\(x\_\{1\},y\_\{1\}\),\\ldots,\(x\_\{t\},y\_\{t\}\)\\in\\mathcal\{X\}\\times\\mathcal\{Y\}in the natural way:xi=x\(vi−1\)x\_\{i\}=x\(v\_\{i\-1\}\), andyiy\_\{i\}is the label of the edge\(vi−1,vi\)\(v\_\{i\-1\},v\_\{i\}\)for allii\. We denote𝒙\(p\)=x1,…,xt\\boldsymbol\{x\}\(p\)=x\_\{1\},\\ldots,x\_\{t\}and𝒚\(p\)=y1,…yt\\boldsymbol\{y\}\(p\)=y\_\{1\},\\ldots y\_\{t\}\. For a given vertexvv, we usep𝐓\(v\)p\_\{\\mathbf\{T\}\}\(v\)\(orp\(v\)p\(v\), when the tree𝐓\\mathbf\{T\}is fixed\) to denote the path from the root tovv\. We may overload notation and usep\(v\)p\(v\)to denote the sample associated with the path, when the context is clear\. If there is a bijection between the set of vertices and the set of their instance labelings, we may replacevvwith its instance label in all notation\.
##### Generation trees\.
Given a base classℱ\\mathcal\{F\}, a generation lengthMM, and an instancexx, there is a natural way to represent all possible labels that could be given toxxby functions inℱ𝖢𝗈𝖳−M\\mathcal\{F\}^\{\\mathsf\{CoT\}\-M\}as a decision tree𝐓\(x\):=𝐓M\(x\)\\mathbf\{T\}\(x\):=\\mathbf\{T\}\_\{M\}\(x\)\. We call this tree the*generation tree*ofxxof depthMM, or simply the*generation tree*ofxx, ifTTis fixed\. The tree is constructed as follows\. The root of𝐓\(x\)\\mathbf\{T\}\(x\)isxx\. The left child ofxxis labeled byx0x0and likewise, the right child ofxxis labeled byx1x1\. We continue this labeling process inductively on the children ofxx, for a total number ofTTmany times\. There is a bijection between the set\{𝒚\(b\)\}\\\{\\boldsymbol\{y\}\(b\)\\\}for all branchesbbin𝐓\(x\)\\mathbf\{T\}\(x\)and the label space\{0,1\}T\\\{0,1\\\}^\{T\}\. Furthermore, note that a functionf𝖢𝗈𝖳−T∈ℱ𝖢𝗈𝖳−Tf^\{\\mathsf\{CoT\}\-T\}\\in\\mathcal\{F\}^\{\\mathsf\{CoT\}\-T\}realizes a branchbbif and only iff𝖢𝗈𝖳−T\(x\)=y\(b\)f^\{\\mathsf\{CoT\}\-T\}\(x\)=y\(b\)\. The branchbbmay be referred to as the*computation path*off𝖢𝗈𝖳−Tf^\{\\mathsf\{CoT\}\-T\}onxx\. This representation will be useful in some of the analyses conducted in this work\. See a visualized example of a generation tree in Figure[2](https://arxiv.org/html/2605.06819#S6.F2)\.
00000000101010011Figure 2:The tree𝐓M\(x\)\\mathbf\{T\}\_\{M\}\(x\)forM=2,x=0M=2,x=0\.
### 6\.2Class dimensions, learnability, and an SSP lemma
Consider a tree𝐓\\mathbf\{T\}and a classℱ⊂\{0,1\}𝒳\\mathcal\{F\}\\subset\\\{0,1\\\}^\{\\mathcal\{X\}\}\. Given a functionf∈ℱf\\in\\mathcal\{F\}, we say thatff*realizes*a prefixp=v0,…,vtp=v\_\{0\},\\ldots,v\_\{t\}in𝐓\\mathbf\{T\}iff\(𝒙\(p\)i\)=𝒚\(p\)if\(\\boldsymbol\{x\}\(p\)\_\{i\}\)=\\boldsymbol\{y\}\(p\)\_\{i\}for allii\. The set of all functionsf∈ℱf\\in\\mathcal\{F\}who realizesppare called the*version space*ofvtv\_\{t\}\(or ofpp, or of\(𝒙\(p\),𝒚\(p\)\)\(\\boldsymbol\{x\}\(p\),\\boldsymbol\{y\}\(p\)\)\) with respect toℱ\\mathcal\{F\}, which is denoted asVℱ\(vt\)V\_\{\\mathcal\{F\}\}\(v\_\{t\}\), or simplyV\(vt\)V\(v\_\{t\}\)whenℱ\\mathcal\{F\}is fixed\. We say thatℱ\\mathcal\{F\}*shatters*𝐓\\mathbf\{T\}, if for every leafℓ∈𝐓\\ell\\in\\mathbf\{T\}it holds thatV\(ℓ\)≠∅V\(\\ell\)\\neq\\emptyset\. The set of trees shattered byℱ\\mathcal\{F\}is denoted by𝒯\(ℱ\)\\mathcal\{T\}\(\\mathcal\{F\}\)\. The set of branches in𝐓\\mathbf\{T\}that are realized byℱ\\mathcal\{F\}are denoted byBℱ\(𝐓\)B\_\{\\mathcal\{F\}\}\(\\mathbf\{T\}\)\. That is, ifℱ\\mathcal\{F\}shatters𝐓\\mathbf\{T\}thenBℱ\(𝐓\)=𝐓B\_\{\\mathcal\{F\}\}\(\\mathbf\{T\}\)=\\mathbf\{T\}\. We now define two fundamental and useful dimensions\.
###### Definition 6\.1\(Littlestone dimension\)\.
The*Littlestone dimension*ofℱ\\mathcal\{F\}, denoted𝙻\(ℱ\),\\mathtt\{L\}\(\\mathcal\{F\}\),is the maximal depth of a perfect Littlestone tree which is shattered byℱ\\mathcal\{F\}\. If there is no such maximal depth, we say that𝙻\(ℱ\)=∞\\mathtt\{L\}\(\\mathcal\{F\}\)=\\infty\.
The Littlestone dimension is known to characterize*online learnability*\.
###### Definition 6\.2\(Online learnability\)\.
In the online learning model, in each roundttan adversary sendsxt∈𝒳x\_\{t\}\\in\\mathcal\{X\}to the learner\. The learner then predicts333We only discuss deterministic learners\.y^t∈𝒴\\hat\{y\}\_\{t\}\\in\\mathcal\{Y\}, and the adversary then sends the correct labelyt∈𝒴y\_\{t\}\\in\\mathcal\{Y\}\. Finally, the learner suffers the 0/1 loss1\[y^t≠yt\]1\[\\hat\{y\}\_\{t\}\\neq y\_\{t\}\]\. The adversary is restricted to be*realizable*: there must bef∈ℱf\\in\\mathcal\{F\}so that all examples\(xt,yt\)\(x\_\{t\},y\_\{t\}\)satisfyyt=f\(xt\)y\_\{t\}=f\(x\_\{t\}\)\. A learner’s*mistake bound*is the maximal number of mistakes it makes in this setting, against any adversary\. We say thatℱ\\mathcal\{F\}is*online learnable*with mistake boundLLif there exists a learner with mistake boundLLsuch thatL<∞L<\\infty\.
We will use the following known characterization\.
###### Theorem 6\.3\(Online learnability characterization\[[31](https://arxiv.org/html/2605.06819#bib.bib15)\]\)\.
The optimal mistake bound for online learningℱ\\mathcal\{F\}is precisely𝙻\(ℱ\)\\mathtt\{L\}\(\\mathcal\{F\}\)\. That is, there exists a learner with mistake bound𝙻\(ℱ\)\\mathtt\{L\}\(\\mathcal\{F\}\), and there exists an adversary who can force at least𝙻\(ℱ\)\\mathtt\{L\}\(\\mathcal\{F\}\)many mistakes on any learner\.
For Littlestone classes, there is a useful SSP \(Sauer\-Shelah\-Perles\) lemma for trees, proved e\.g\. by\[[7](https://arxiv.org/html/2605.06819#bib.bib96)\]\.
###### Theorem 6\.4\(SSP for trees\[[7](https://arxiv.org/html/2605.06819#bib.bib96)\]\)\.
Letℱ⊂\{0,1\}𝒳\\mathcal\{F\}\\subset\\\{0,1\\\}^\{\\mathcal\{X\}\}be a class, and let𝐓\\mathbf\{T\}be a Littlestone tree of depthnnwith vertices labeled by instances from𝒳\\mathcal\{X\}\. Then
\|Bℱ\(𝐓\)\|≤\(n≤𝙻\(ℱ\)\)≤n2𝙻\(ℱ\)\.\|B\_\{\\mathcal\{F\}\}\(\\mathbf\{T\}\)\|\\leq\\binom\{n\}\{\\leq\\mathtt\{L\}\(\\mathcal\{F\}\)\}\\leq n^\{2\\mathtt\{L\}\(\\mathcal\{F\}\)\}\.
The work of\[[12](https://arxiv.org/html/2605.06819#bib.bib9)\]further extended the definition of the Littlestone dimension to non\-binary classesℱ⊂𝒴𝒳\\mathcal\{F\}\\subset\\mathcal\{Y\}^\{\\mathcal\{X\}\},\|𝒴\|≥2\|\\mathcal\{Y\}\|\\geq 2\. In their definition, instead of labeling each left edge with0and each right edge with11, one may choose for every two sibling edges \(two edges that are the outgoing edges of a single vertex\) two distinct labels from𝒴\\mathcal\{Y\}to be labeled by\. When𝒴=\{0,1\}\\mathcal\{Y\}=\\\{0,1\\\}, this definition is reduced to the standard definition\. However, for\|𝒴\|\>2\|\\mathcal\{Y\}\|\>2,\[[12](https://arxiv.org/html/2605.06819#bib.bib9)\]showed that this definition still characterizes the optimal mistake bound for online learningℱ\\mathcal\{F\}\.
###### Theorem 6\.5\(Multiclass Online learnability characterization\[[12](https://arxiv.org/html/2605.06819#bib.bib9)\]\)\.
The optimal mistake bound for online learningℱ⊂𝒴𝒳\\mathcal\{F\}\\subset\\mathcal\{Y\}^\{\\mathcal\{X\}\}for any𝒴\\mathcal\{Y\}is precisely𝙻\(ℱ\)\\mathtt\{L\}\(\\mathcal\{F\}\)\.
Some of our results have implications for autoregressive PAC\-learning, so for completeness we also formally define PAC\-learnability here\.
###### Definition 6\.6\(PAC learnability\)\.
We say thatℱ⊂𝒴𝒳\\mathcal\{F\}\\subset\\mathcal\{Y\}^\{\\mathcal\{X\}\}\(where𝒴\\mathcal\{Y\}is not necessarily binary\) is*PAC\-learnable*with sample complexitym:=m\(ϵ,δ\)m:=m\(\\epsilon,\\delta\), if there exists a PAC\-learner𝖫𝗋𝗇:\(𝒳×𝒴\)⋆→𝒴𝒳\\mathsf\{Lrn\}:\(\\mathcal\{X\}\\times\\mathcal\{Y\}\)^\{\\star\}\\to\\mathcal\{Y\}^\{\\mathcal\{X\}\}, such that for every distributionDDover𝒳\\mathcal\{X\}, for everyϵ,δ\>0\\epsilon,\\delta\>0, and for every target functionf⋆∈ℱf\_\{\\star\}\\in\\mathcal\{F\}, if𝒙:=x1,…,xm∼iidD\\boldsymbol\{x\}:=x\_\{1\},\\ldots,x\_\{m\}\\sim\_\{iid\}Dand\(𝒙,𝒚\):=\(\(xi,f⋆\(xi\)\)i=1m\(\\boldsymbol\{x\},\\boldsymbol\{y\}\):=\(\(x\_\{i\},f\_\{\\star\}\(x\_\{i\}\)\)\_\{i=1\}^\{m\}, then
LD,f⋆\(f𝖫𝗋𝗇\(𝒙,𝒚\)\)\>εL\_\{D,f\_\{\\star\}\}\(f\_\{\\mathsf\{Lrn\}\(\\boldsymbol\{x\},\\boldsymbol\{y\}\)\}\)\>\\varepsilonwith probability at mostδ\\delta, wheref𝖫𝗋𝗇\(𝒙,𝒚\):=𝖫𝗋𝗇\(𝒙,𝒚\)f\_\{\\mathsf\{Lrn\}\(\\boldsymbol\{x\},\\boldsymbol\{y\}\)\}:=\\mathsf\{Lrn\}\(\\boldsymbol\{x\},\\boldsymbol\{y\}\)andLD,f⋆\(f\):=Prx∼D\[f\(x\)≠f⋆\(x\)\]L\_\{D,f\_\{\\star\}\}\(f\):=\\Pr\_\{x\\sim D\}\\left\[f\(x\)\\neq f\_\{\\star\}\(x\)\\right\]\.
Our results have implications for𝖾𝟤𝖾\\mathsf\{e2e\}\-PAC learning, which is simply PAC\-learning ofℱ𝖾𝟤𝖾−M\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}, and also for𝖢𝗈𝖳\\mathsf\{CoT\}\-PAC learning, which is the same as𝖾𝟤𝖾\\mathsf\{e2e\}\-PAC learning, with the difference that training examples reveal the entire chain\-of\-thought led to the final output\. So, as in the online setting, every PAC sample complexity bound for learningℱ𝖢𝗈𝖳−M\\mathcal\{F\}^\{\\mathsf\{CoT\}\-M\}is also A PAC sample complexity bound for𝖢𝗈𝖳\\mathsf\{CoT\}\-learning, which is an easier task than the task of PAC\-learningℱ𝖢𝗈𝖳−M\\mathcal\{F\}^\{\\mathsf\{CoT\}\-M\}\.
We now define the VC\-dimension, which is known to characterize the sample complexity of PAC\-learning binary classes \(\|𝒴\|=2\|\\mathcal\{Y\}\|=2\), up to accuracy \(ϵ,δ\\epsilon,\\delta\) factors\[[8](https://arxiv.org/html/2605.06819#bib.bib97),[17](https://arxiv.org/html/2605.06819#bib.bib98),[22](https://arxiv.org/html/2605.06819#bib.bib99)\]\.
The VC\-dimension is defined as the Littlestone dimension, but restricted only to trees where each branch is labeled by the same sequence of instances\. A formal definition is given below\.
###### Definition 6\.7\(VC\-dimension\)\.
The*VC dimension*ofℱ\\mathcal\{F\}, denotedVC\(ℱ\)\\operatorname\{VC\}\(\\mathcal\{F\}\), is the maximal depth of a perfect tree which is shattered byℱ\\mathcal\{F\}, and also satisfies𝒙\(b1\)=𝒙\(b2\)\\boldsymbol\{x\}\(b\_\{1\}\)=\\boldsymbol\{x\}\(b\_\{2\}\)for every two branchesb1,b2b\_\{1\},b\_\{2\}\. If there is no such maximal depth, we say thatVC\(ℱ\)=∞\\operatorname\{VC\}\(\\mathcal\{F\}\)=\\infty\.
Since in the context of the VC\-dimension we require that𝒙\(b1\)=𝒙\(b2\)\\boldsymbol\{x\}\(b\_\{1\}\)=\\boldsymbol\{x\}\(b\_\{2\}\)for every two branchesb1,b2b\_\{1\},b\_\{2\}, instead of discussing trees, we may simply refer to𝒙\(b1\)\\boldsymbol\{x\}\(b\_\{1\}\)as a*shattered set*\. Note that the definition of the VC\-dimension immediately implies thatVC\(ℱ\)≤𝙻\(ℱ\)\\operatorname\{VC\}\(\\mathcal\{F\}\)\\leq\\mathtt\{L\}\(\\mathcal\{F\}\)for any classℱ\\mathcal\{F\}\.
As the result below states, analyzing the Littlestone dimension of a classℱ\\mathcal\{F\}could have an advantage in the case where\|𝒴\|\>2\|\\mathcal\{Y\}\|\>2\.
###### Theorem 6\.8\(Online\-to\-PAC conversion\[[32](https://arxiv.org/html/2605.06819#bib.bib67)\]\)\.
If𝙻\(ℱ\)<∞\\mathtt\{L\}\(\\mathcal\{F\}\)<\\infty, thenℱ\\mathcal\{F\}is learnable with sample complexity
m\(ε,δ\)=O\(𝙻\(ℱ\)\+log\(1/δ\)ε\)\.m\(\\varepsilon,\\delta\)=O\\left\(\\frac\{\\mathtt\{L\}\(\\mathcal\{F\}\)\+\\log\(1/\\delta\)\}\{\\varepsilon\}\\right\)\.
In the case\|𝒴\|=2\|\\mathcal\{Y\}\|=2, Theorem[6\.8](https://arxiv.org/html/2605.06819#S6.Thmtheorem8)does not give any advantage, since\[[22](https://arxiv.org/html/2605.06819#bib.bib99)\]proved the same bound with𝙻\(ℱ\)\\mathtt\{L\}\(\\mathcal\{F\}\)replaced byVC\(ℱ\)\\operatorname\{VC\}\(\\mathcal\{F\}\), andVC\(ℱ\)≤𝙻\(ℱ\)\\operatorname\{VC\}\(\\mathcal\{F\}\)\\leq\\mathtt\{L\}\(\\mathcal\{F\}\)\. However, Theorem[6\.8](https://arxiv.org/html/2605.06819#S6.Thmtheorem8)applies also to the case\|𝒴\|\>2\|\\mathcal\{Y\}\|\>2, while bounds depending on the VC\-dimension are not, since the VC\-dimension is defined only over binary classes\.
## 7Upper bound for end\-to\-end learning
###### Theorem 7\.1\.
Letℱ\\mathcal\{F\}be a base class\. Then for allM≥2M\\geq 2:
𝙻\(ℱ𝖾𝟤𝖾−M\)=O\(𝙻\(ℱ\)log\(M𝙻\(ℱ\)\)\)\.\\mathtt\{L\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\)=O\(\\mathtt\{L\}\(\\mathcal\{F\}\)\\log\(M\\mathtt\{L\}\(\\mathcal\{F\}\)\)\)\.
###### Proof\.
Let𝐓\\mathbf\{T\}be a Littlestone tree of depthmmwhich is shattered byℱ𝖾𝟤𝖾−M\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\. We will show thatm=O\(𝙻\(ℱ\)log\(M𝙻\(ℱ\)\)\)m=O\(\\mathtt\{L\}\(\\mathcal\{F\}\)\\log\(M\\mathtt\{L\}\(\\mathcal\{F\}\)\)\), which implies the lemma\. For every nodeu∈\{0,1\}<mu\\in\\\{0,1\\\}^\{<m\}of𝐓\\mathbf\{T\}\(where the node is given by the labels of the edge\-path from the root to the node\), letxux\_\{u\}be the instance labeling that node\. Now, we inflate the tree𝐓\\mathbf\{T\}to a tree𝐓ℱ\\mathbf\{T\}\_\{\\mathcal\{F\}\}of depthmMmMthat takes into account the intermediate autoregressive steps\. The idea is to inflate every two sibling edges in𝐓\\mathbf\{T\}to possible paths of lengthMMusing a generation tree\. Let us formally describe the recursive construction of𝐓ℱ\\mathbf\{T\}\_\{\\mathcal\{F\}\}\. We maintain a saved pathv∈\{0,1\}<mv\\in\\\{0,1\\\}^\{<m\}in the original tree𝐓\\mathbf\{T\}saved in every node of𝐓ℱ\\mathbf\{T\}\_\{\\mathcal\{F\}\}\. In the root of𝐓ℱ\\mathbf\{T\}\_\{\\mathcal\{F\}\}, the saved path is∅\\emptyset\(the empty path\), and every node inherits the saved path from its parent, until we explicitly update the saved path\. For the base of the recursion, we construct the generation tree𝐓M\(x∅\)\\mathbf\{T\}\_\{M\}\(x\_\{\\emptyset\}\)\. For the recursive step, we either finish construction, or paste another generation tree at every leaf of the tree constructed so far\. Concretely, for every leaf of the tree constructed so far, ifvvis the saved path of this leaf and\|v\|=m\|v\|=m, finish construction\. Otherwise, if it is a left leaf \(that is, a leaf that lies at the tip of an edge labeled with0\), paste𝐓M\(xv0\)\\mathbf\{T\}\_\{M\}\(x\_\{v0\}\)at this leaf, and update the saved path tov0v0\. Otherwise, paste𝐓M\(xv1\)\\mathbf\{T\}\_\{M\}\(x\_\{v1\}\)at this leaf, and update the saved path tov1v1\.
For a classℱ\\mathcal\{F\}and a tree𝐓\\mathbf\{T\}, letBℱ\(𝐓\)B\_\{\\mathcal\{F\}\}\(\\mathbf\{T\}\)be the set of branches of𝐓\\mathbf\{T\}that are realized byℱ\\mathcal\{F\}\. Now, note that
\|Bℱ\(𝐓ℱ\)\|≥2m\.\|B\_\{\\mathcal\{F\}\}\(\\mathbf\{T\}\_\{\\mathcal\{F\}\}\)\|\\geq 2^\{m\}\.\(3\)Indeed, there is a one\-to\-one functionGGfrom the branches of𝐓\\mathbf\{T\}to the branches ofBℱ\(𝐓ℱ\)B\_\{\\mathcal\{F\}\}\(\\mathbf\{T\}\_\{\\mathcal\{F\}\}\)\. For a branchu∈\{0,1\}mu\\in\\\{0,1\\\}^\{m\}of𝐓\\mathbf\{T\}, letfu∈ℱf\_\{u\}\\in\\mathcal\{F\}such thatfu𝖾𝟤𝖾−Mf\_\{u\}^\{\\mathsf\{e2e\}\-M\}realizesuu\. Then by construction of𝐓ℱ\\mathbf\{T\}\_\{\\mathcal\{F\}\},fu∈ℱf\_\{u\}\\in\\mathcal\{F\}agrees with some branch in𝐓ℱ\\mathbf\{T\}\_\{\\mathcal\{F\}\}that its leaf has the saved pathuu\. So,G\(u\)G\(u\)returns this branch\. ThereforeG\(u\)∈Bℱ\(𝐓ℱ\)G\(u\)\\in B\_\{\\mathcal\{F\}\}\(\\mathbf\{T\}\_\{\\mathcal\{F\}\}\)andGGis clearly a bijection\. On the other hand, Theorem[6\.4](https://arxiv.org/html/2605.06819#S6.Thmtheorem4)implies
\|Bℱ\(𝐓ℱ\)\|≤\(mM\)2𝙻\(ℱ\)\.\|B\_\{\\mathcal\{F\}\}\(\\mathbf\{T\}\_\{\\mathcal\{F\}\}\)\|\\leq\(mM\)^\{2\\mathtt\{L\}\(\\mathcal\{F\}\)\}\.\(4\)From \([3](https://arxiv.org/html/2605.06819#S7.E3)\) and \([4](https://arxiv.org/html/2605.06819#S7.E4)\), we obtain:
2m≤\(mM\)2𝙻\(ℱ\)\.2^\{m\}\\leq\(mM\)^\{2\\mathtt\{L\}\(\\mathcal\{F\}\)\}\.Solving this inequality formmimplies the stated bound\. ∎
## 8The taxonomy of end\-to\-end learning
We have shown that for any base classℱ\\mathcal\{F\}and any generation lengthM≥𝙻\(ℱ\)M\\geq\\mathtt\{L\}\(\\mathcal\{F\}\)we have𝙻\(ℱ𝖾𝟤𝖾−M\)=O\(𝙻\(ℱ\)logM\)\\mathtt\{L\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\)=O\(\\mathtt\{L\}\(\\mathcal\{F\}\)\\log M\)\. Below, we prove that essentially any “well\-behaved” sub\-logarithmic growth rate of𝙻\(ℱ𝖾𝟤𝖾−M\)\\mathtt\{L\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\)between constant and logarithmic inMMis realized by some Littlestone class\. We further show that for alldd, there exists a classℱ\\mathcal\{F\}with𝙻\(ℱ\)=O\(d\)\\mathtt\{L\}\(\\mathcal\{F\}\)=O\(d\)and
𝙻\(ℱ𝖾𝟤𝖾−M\)≥VC\(ℱ𝖾𝟤𝖾−M\)=Ω\(𝙻\(ℱ\)logM\)\\mathtt\{L\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\)\\geq\\operatorname\{VC\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\)=\\Omega\(\\mathtt\{L\}\(\\mathcal\{F\}\)\\log M\)\(5\)for allMM\. This essentially completes the taxonomy of online𝖾𝟤𝖾\\mathsf\{e2e\}\-learning of Littlestone classes\. Along the way, the right\-hand side of[5](https://arxiv.org/html/2605.06819#S8.E5)resolves an open question of\[[26](https://arxiv.org/html/2605.06819#bib.bib71)\], who proved thatVC\(ℱ𝖾𝟤𝖾−M\)=O\(𝙻\(ℱ\)logM\)\\operatorname\{VC\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\)=O\(\\mathtt\{L\}\(\\mathcal\{F\}\)\\log M\)for allℱ,M\\mathcal\{F\},M, and asked if this upper bound is tight for some classes\.
Before we formally state the results, we define precisely what we mean by a “well\-behaved” sub\-logarithmic growth rate\. Fix a sufficiently large universal constantM0M\_\{0\}, to be used throughout the section\. We say that a functionr:ℕ\+→ℕ\+r:\\mathbb\{N\}\_\{\+\}\\to\\mathbb\{N\}\_\{\+\}is a*well\-behaved sub\-logarithmic growth rate*if:
1. 1\.rris sub\-logarithmic:r\(M\)≤14log\(M\)r\(M\)\\leq\\frac\{1\}\{4\}\\log\(M\)for allM≥M0M\\geq M\_\{0\}\.
2. 2\.rris non\-decreasing :M1≥M2⟹r\(M1\)≥r\(M2\)M\_\{1\}\\geq M\_\{2\}\\implies r\(M\_\{1\}\)\\geq r\(M\_\{2\}\)for allM1,M2≥1M\_\{1\},M\_\{2\}\\geq 1\.
3. 3\.rris sub\-additive:r\(M1\+M2\)≤r\(M1\)\+r\(M2\)r\(M\_\{1\}\+M\_\{2\}\)\\leq r\(M\_\{1\}\)\+r\(M\_\{2\}\)for allM1,M2≥1M\_\{1\},M\_\{2\}\\geq 1\.
Note that all natural sub\-logarithmic growth rates such as⌈logM⌉\\left\\lceil\\sqrt\{\\log M\}\\right\\rceil,⌈loglogM⌉\\left\\lceil\\log\\log M\\right\\rceiletc\. satisfy our definition\. We can now state the result for well\-behaved sub\-logarithmic rates\.
###### Theorem 8\.1\.
For any well\-behaved sub\-logarithmic growth raterrthere exists a base classℱ\\mathcal\{F\}such that:
1. 1\.𝙻\(ℱ\)=1\\mathtt\{L\}\(\\mathcal\{F\}\)=1\.
2. 2\.𝙻\(ℱ𝖾𝟤𝖾−M\)=Θ\(r\(M\)\)\\mathtt\{L\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\)=\\Theta\(r\(M\)\)for allM≥M0M\\geq M\_\{0\}\.
The definition ofℱ\\mathcal\{F\}and the lower bound strategy of the second item is due to unpublished private communication\[[21](https://arxiv.org/html/2605.06819#bib.bib100)\]\. For logarithmic \(in contrast to sub\-logarithmic\) rates, we prove the following
###### Theorem 8\.2\.
For anyd∈ℕ\+d\\in\\mathbb\{N\}\_\{\+\}, there exists a classℱ\\mathcal\{F\}such that𝙻\(ℱ\)=O\(d\)\\mathtt\{L\}\(\\mathcal\{F\}\)=O\(d\)and
VC\(ℱ𝖾𝟤𝖾−M\)=Ω\(𝙻\(ℱ\)logM\)\\operatorname\{VC\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\)=\\Omega\(\\mathtt\{L\}\(\\mathcal\{F\}\)\\log M\)for allM≥M0M\\geq M\_\{0\}\.
Together with the upper bound of Theorem[7\.1](https://arxiv.org/html/2605.06819#S7.Thmtheorem1), this establishes that for anyd∈ℕ\+d\\in\\mathbb\{N\}\_\{\+\}there exists a classℱ\\mathcal\{F\}such that𝙻\(ℱ\)=O\(d\)\\mathtt\{L\}\(\\mathcal\{F\}\)=O\(d\)and
VC\(ℱ𝖾𝟤𝖾−M\),𝙻\(ℱ𝖾𝟤𝖾−M\)=Θ\(𝙻\(ℱ\)logM\)\\operatorname\{VC\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\),\\mathtt\{L\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\)=\\Theta\(\\mathtt\{L\}\(\\mathcal\{F\}\)\\log M\)for allM≥max\{M0,𝙻\(ℱ\)\}M\\geq\\max\\\{M\_\{0\},\\mathtt\{L\}\(\\mathcal\{F\}\)\\\}\.
In the following sections, we prove Theorem[8\.1](https://arxiv.org/html/2605.06819#S8.Thmtheorem1)and Theorem[8\.2](https://arxiv.org/html/2605.06819#S8.Thmtheorem2)\.
### 8\.1Proof of Theorem[8\.1](https://arxiv.org/html/2605.06819#S8.Thmtheorem1)
Fix the growth raterr\.
We first define the classℱ\\mathcal\{F\}\. For everys≥M0s\\geq M\_\{0\}let
Ks:=\{r\(s\),r\(s\)\+1,…,r\(s\)\+2r\(s\)−1\}\.K\_\{s\}:=\\\{r\(s\),r\(s\)\+1,\\ldots,r\(s\)\+2^\{r\(s\)\}\-1\\\}\.That is,KsK\_\{s\}is simply the set\{0,…,2r\(s\)−1\}\\\{0,\\ldots,2^\{r\(s\)\}\-1\\\}of size2r\(s\)2^\{r\(s\)\}, withr\(s\)r\(s\)added to each element\. We now define a “baseline” functionf:\{0,1\}⋆→\{0,1\}f:\\\{0,1\\\}^\{\\star\}\\to\\\{0,1\\\}as follows\. For everys≥M0s\\geq M\_\{0\},k∈Ksk\\in K\_\{s\},j∈\{0,…,r\(s\)−1\}j\\in\\\{0,\\ldots,r\(s\)\-1\\\}andz∈\{0,1\}jz\\in\\\{0,1\\\}^\{j\}, set:
f\(0s10k10s−k−1z\)=\(k−r\(s\)\)j\+1f\(0^\{s\}10^\{k\}10^\{s\-k\-1\}z\)=\(k\-r\(s\)\)\_\{j\+1\}where\(k−r\(s\)\)j\+1\(k\-r\(s\)\)\_\{j\+1\}is the\(j\+1\)st\(j\+1\)^\{st\}bit in the binary representation ofk−r\(s\)k\-r\(s\)\. For any other stringxxthat does not fit the form above, we setf\(x\)=0f\(x\)=0\. Note that0s−k−10^\{s\-k\-1\}is well\-defined, since
k\+1≤r\(s\)\+2r\(s\)≤14logs\+214logs≤s≤s−1,k\+1\\leq r\(s\)\+2^\{r\(s\)\}\\leq\\frac\{1\}\{4\}\\log s\+2^\{\\frac\{1\}\{4\}\\log s\}\\leq\\sqrt\{s\}\\leq s\-1,\(6\)where the final inequality holds sinces≥M0s\\geq M\_\{0\}\. Thuss−k−1≥1s\-k\-1\\geq 1\. Now, we define for each fixed pairssandk∈Ksk\\in K\_\{s\}the following single\-point modification offf:
fs,k\(x\)=\{1x=0s10k,f\(x\)otherwise\.f\_\{s,k\}\(x\)=\\begin\{cases\}1&x=0^\{s\}10^\{k\},\\\\ f\(x\)&\\text\{otherwise\}\.\\end\{cases\}We may now define the classℱ\\mathcal\{F\}as
ℱ:=\{fs,k:s≥M0,k∈Ks\}\.\\mathcal\{F\}:=\\\{f\_\{s,k\}:s\\geq M\_\{0\},k\\in K\_\{s\}\\\}\.
We first prove the first item of Lemma[8\.1](https://arxiv.org/html/2605.06819#S8.Thmtheorem1)\.
###### Lemma 8\.3\.
We have𝙻\(ℱ\)=1\\mathtt\{L\}\(\\mathcal\{F\}\)=1\.
###### Proof\.
We have𝙻\(ℱ\)≥1\\mathtt\{L\}\(\\mathcal\{F\}\)\\geq 1sinceℱ\\mathcal\{F\}contains more than one function\. To see that𝙻\(ℱ\)≤1\\mathtt\{L\}\(\\mathcal\{F\}\)\\leq 1, consider the following online learner forℱ\\mathcal\{F\}\. The learner predicts asffuntil it makes its first mistake\. Once it makes a mistake, it discovers the correct function and always predicts as this function\. ∎
We now prove the lower bound in the second item of Theorem[8\.1](https://arxiv.org/html/2605.06819#S8.Thmtheorem1)\. We prove it by showing the stronger claimVC\(ℱ𝖾𝟤𝖾−M\)≥r\(M\)\\operatorname\{VC\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\)\\geq r\(M\)for allM≥M0M\\geq M\_\{0\}\.
###### Lemma 8\.4\.
FixM≥M0M\\geq M\_\{0\}\. Then the set
AM:=\{ai:=0M10i:1≤i≤r\(M\)\}A\_\{M\}:=\\\{a\_\{i\}:=0^\{M\}10^\{i\}:1\\leq i\\leq r\(M\)\\\}is shattered byℱ𝖾𝟤𝖾−M\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\.
###### Proof\.
Let𝒚=y1,…yr\(M\)\\boldsymbol\{y\}=y\_\{1\},\\ldots y\_\{r\(M\)\}be a labeling ofAMA\_\{M\}\. It suffices to show that this labeling is realized byℱ𝖾𝟤𝖾−M\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\. Letbbbe the number given in binary representation by the vector𝒚\\boldsymbol\{y\}\. and letk:=r\(M\)\+b∈KMk:=r\(M\)\+b\\in K\_\{M\}\. We claim that for allii,fM,k𝖾𝟤𝖾−M\(ai\)=bif\_\{M,k\}^\{\\mathsf\{e2e\}\-M\}\(a\_\{i\}\)=b\_\{i\}\. Let’s describe theMM\-length computation path offM,k𝖢𝗈𝖳−Mf^\{\\mathsf\{CoT\}\-M\}\_\{M,k\}onaia\_\{i\}:
1. 1\.In the firstk−i≥bk\-i\\geq bsteps,fM,kf\_\{M,k\}agrees withff, which means that0is appended\.
2. 2\.After the firstk−i≤Mk\-i\\leq\\sqrt\{M\}steps \(the inequality is by \([6](https://arxiv.org/html/2605.06819#S8.E6)\)\), the computation path reaches the instance0M10k0^\{M\}10^\{k\}, so in stepk−i\+1k\-i\+1,11is appended, by definition offM,kf\_\{M,k\}\.
3. 3\.By definition offf,0is appended in the followingM−k−1M\-k\-1steps\.
4. 4\.So far, we completed a computation of lengthk−i\+1\+M−k−1=M−ik\-i\+1\+M\-k\-1=M\-i, so we haveiimore steps\. By definition offf, in the lastiisteps of the computation, the generated bits are𝒚≤i\\boldsymbol\{y\}\_\{\\leq i\}\.
Overall, we deduce that:
fM,k𝖢𝗈𝖳−M\(ai\)=0k−i10M−k−1y1,…,yi,f\_\{M,k\}^\{\\mathsf\{CoT\}\-M\}\(a\_\{i\}\)=0^\{k\-i\}10^\{M\-k\-1\}y\_\{1\},\\ldots,y\_\{i\},which impliesfM,k𝖾𝟤𝖾−M\(ai\)=yif\_\{M,k\}^\{\\mathsf\{e2e\}\-M\}\(a\_\{i\}\)=y\_\{i\}\. ∎
Lemma[8\.4](https://arxiv.org/html/2605.06819#S8.Thmtheorem4)immediately implies the lower bound of the first item of Theorem[8\.1](https://arxiv.org/html/2605.06819#S8.Thmtheorem1)\. We now turn to the upper bound\.
For everys≥M0s\\geq M\_\{0\}define the “bucket”
Bs:=\{0s10i:i≥0\}\.B\_\{s\}:=\\\{0^\{s\}10^\{i\}:i\\geq 0\\\}\.
###### Lemma 8\.5\.
FixM,s≥M0M,s\\geq M\_\{0\}andk∈Ksk\\in K\_\{s\}\. For allx∉BSx\\notin B\_\{S\}:
fs,k𝖾𝟤𝖾−M\(x\)=f𝖾𝟤𝖾−M\(x\)\.f\_\{s,k\}^\{\\mathsf\{e2e\}\-M\}\(x\)=f^\{\\mathsf\{e2e\}\-M\}\(x\)\.
###### Proof\.
Sincefs,kf\_\{s,k\}andffdiffer only on the instancexs,k:=0s10kx\_\{s,k\}:=0^\{s\}10^\{k\}, it is possible thatfs,k𝖾𝟤𝖾−M\(x\)≠f𝖾𝟤𝖾−M\(x\)f\_\{s,k\}^\{\\mathsf\{e2e\}\-M\}\(x\)\\neq f^\{\\mathsf\{e2e\}\-M\}\(x\)only ifxs,kx\_\{s,k\}is an instance appearing in an intermediate step in the generation off𝖢𝗈𝖳−M\(x\)f^\{\\mathsf\{CoT\}\-M\}\(x\)\. This is possible only ifxxis a prefix ofxs,kx\_\{s,k\}\. So, we only need to analyze the case wherexxis both a prefix ofxs,kx\_\{s,k\}and not in the bucketBsB\_\{s\}\. So the form ofxxmust bex=0nx=0^\{n\}for some1≤n≤s1\\leq n\\leq s\. In this case,ffappends0in every step, soxs,kx\_\{s,k\}never appears\. ∎
We now prove another lemma that handles the cases≫Ms\\gg M, whereBsB\_\{s\}is the “correct” bucket\. The lemma establishes that in this case, for any function inℱ𝖾𝟤𝖾−M\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}there is at most a single instance it labels with11\.
###### Lemma 8\.6\.
Fixs≥M0s\\geq M\_\{0\},s≥10Ms\\geq 10M,i≥0i\\geq 0andk∈Ksk\\in K\_\{s\}\. Then
fs,k𝖾𝟤𝖾−M\(0s10i\)=1⇔M=k−i\+1\.f\_\{s,k\}^\{\\mathsf\{e2e\}\-M\}\(0^\{s\}10^\{i\}\)=1\\iff M=k\-i\+1\.
###### Proof\.
For every point of the form in the statement of the lemma,ffappends0, Therefore, it is only possible thatfs,k\(0s10j\)=1f\_\{s,k\}\(0^\{s\}10^\{j\}\)=1in the case wherefs,kf\_\{s,k\}differs fromff, that is, in the casej=kj=k\. Therefore, ifi\>ki\>kthen the instance0s10k0^\{s\}10^\{k\}is never reached in the autoregressive generation\. Thus, we assume thati≤ki\\leq k\. We consider the three possible sub\-cases\.
1. 1\.IfM<k−i\+1M<k\-i\+1, then0the autoregressive function offs,kf\_\{s,k\}appends0for at mostk−ik\-imany times, and thusfs,kf\_\{s,k\}is never receives0s10k0^\{s\}10^\{k\}as input, and sofs,k𝖾𝟤𝖾−M\(0s10i\)=0f\_\{s,k\}^\{\\mathsf\{e2e\}\-M\}\(0^\{s\}10^\{i\}\)=0\.
2. 2\.IfM=k−i\+1M=k\-i\+1, thenfs,k𝖾𝟤𝖾−M\(0s10i\)=1f\_\{s,k\}^\{\\mathsf\{e2e\}\-M\}\(0^\{s\}10^\{i\}\)=1, since in the last step the instance0s10i0^\{s\}10^\{i\}is given as input tofs,kf\_\{s,k\}\.
3. 3\.IfM\>k−i\+1M\>k\-i\+1, then the instance0s10k0^\{s\}10^\{k\}is given as input tofs,kf\_\{s,k\}, and then there is at least one more step before the autoregressive generation of𝖢𝗈𝖳\\mathsf\{CoT\}is done, sinceM\>k−i\+1M\>k\-i\+1\. The definition offfimplies that the followings−k−1s\-k\-1bits after0s10k10^\{s\}10^\{k\}1are0, and only then11could be appended\. However, we have s−k−1≥s−212logs−1≥s−s−1≥s/2≥5M\>M,s\-k\-1\\geq s\-2^\{\\frac\{1\}\{2\}\\log s\}\-1\\geq s\-\\sqrt\{s\}\-1\\geq s/2\\geq 5M\>M,where the first inequality is sincek≤r\(s\)\+2r\(s\)k\\leq r\(s\)\+2^\{r\(s\)\}and the assumptionr\(s\)≤14logsr\(s\)\\leq\\frac\{1\}\{4\}\\log s, and the third inequality holds sinces≥M0s\\geq M\_\{0\}\. So, only zeroes are appended, and thusfs,k𝖾𝟤𝖾−M\(0s10i\)=0f\_\{s,k\}^\{\\mathsf\{e2e\}\-M\}\(0^\{s\}10^\{i\}\)=0\.
∎
###### Lemma 8\.7\.
For allM≥M0M\\geq M\_\{0\}:
𝙻\(ℱ𝖾𝟤𝖾−M\)≤20r\(M\)\.\\mathtt\{L\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\)\\leq 20r\(M\)\.
###### Proof\.
We construct an online learner that makes at most20r\(M\)20r\(M\)many mistakes\.
As long as the learner hasn’t made a single mistake, it predicts asf𝖾𝟤𝖾−Mf^\{\\mathsf\{e2e\}\-M\}on every instance\. So, the number of mistakes made in this phase is at most11\.
After the first mistake on an instancexx, the learner changes its behavior as follows\. Letfs⋆,k⋆f\_\{s^\{\\star\},k^\{\\star\}\}be the \(unknown\) target concept\. So, we know thatfs⋆,k⋆𝖾𝟤𝖾−M\(x\)≠f𝖾𝟤𝖾−M\(x\)f\_\{s^\{\\star\},k^\{\\star\}\}^\{\\mathsf\{e2e\}\-M\}\(x\)\\neq f^\{\\mathsf\{e2e\}\-M\}\(x\)\. Therefore, Lemma[8\.5](https://arxiv.org/html/2605.06819#S8.Thmtheorem5)implies thatx∈Bs⋆x\\in B\_\{s^\{\\star\}\}\. Thusx=0s⋆10ix=0^\{s^\{\\star\}\}10^\{i\}for somei∈ℕi\\in\\mathbb\{N\}, so the learner now knowss⋆s^\{\\star\}by just looking atxx\. Thus, from this point, the learner only needs to learn the class
ℱs⋆:=\{fs⋆,k⋆:k∈Ks⋆\}\.\\mathcal\{F\}\_\{s^\{\\star\}\}:=\\\{f\_\{s^\{\\star\},k^\{\\star\}\}:k\\in K\_\{s^\{\\star\}\}\\\}\.We consider the two possible cases\.
First, suppose thats⋆≤10Ms^\{\\star\}\\leq 10M\. In this case, the learner runs the halving algorithm overℱs⋆\\mathcal\{F\}\_\{s^\{\\star\}\}\. The mistake bound in this case is
log\|ℱs⋆\|=log2r\(s\)=r\(s\)≤r\(10M\)≤10r\(M\),\\log\|\\mathcal\{F\}\_\{s^\{\\star\}\}\|=\\log 2^\{r\(s\)\}=r\(s\)\\leq r\(10M\)\\leq 10r\(M\),where the first inequality is sincerris non\-decreasing, and the second inequality is sincerris sub\-additive\.
In the complementing case, we haves⋆\>10Ms^\{\\star\}\>10M\. As usual, forx∉Bs⋆x\\notin B\_\{s^\{\\star\}\}, the learner predicts asff, and thus is always correct forx∉Bs⋆x\\notin B\_\{s^\{\\star\}\}\. Forx∈Bs⋆x\\in B\_\{s^\{\\star\}\}, that is,xxof the form0s⋆10i0^\{s^\{\\star\}\}10^\{i\}, the learner always predicts0, until making a mistake\. Once it makes a mistake on the instance0s⋆10i0^\{s^\{\\star\}\}10^\{i\}for somei∈ℕi\\in\\mathbb\{N\}, it means thatfs⋆,k⋆𝖾𝟤𝖾−M\(0s⋆10i\)=1f\_\{s^\{\\star\},k^\{\\star\}\}^\{\\mathsf\{e2e\}\-M\}\(0^\{s^\{\\star\}\}10^\{i\}\)=1\. Lemma[8\.6](https://arxiv.org/html/2605.06819#S8.Thmtheorem6)implies that in this casek⋆=M\+i−1k^\{\\star\}=M\+i\-1\. So the learner now knows boths⋆s^\{\\star\}andk⋆k^\{\\star\}, and therefore knows the target conceptfs⋆,k⋆f\_\{s^\{\\star\},k^\{\\star\}\}and can always predict asfs⋆,k⋆f\_\{s^\{\\star\},k^\{\\star\}\}\.
Summing all mistakes, we deduce that overall the learner makes at most1\+10r\(M\)1\+10r\(M\)many mistakes, which concludes the proof\. ∎
###### Proof of Theorem[8\.1](https://arxiv.org/html/2605.06819#S8.Thmtheorem1)\.
The theorem is given by combining Lemma[8\.3](https://arxiv.org/html/2605.06819#S8.Thmtheorem3), Lemma[8\.4](https://arxiv.org/html/2605.06819#S8.Thmtheorem4), and Lemma[8\.7](https://arxiv.org/html/2605.06819#S8.Thmtheorem7)\. ∎
### 8\.2Proof of Theorem[8\.2](https://arxiv.org/html/2605.06819#S8.Thmtheorem2)
We show that for anyd∈ℕd\\in\\mathbb\{N\}, there exists a classℱ\\mathcal\{F\}with𝙻\(ℱ\)=O\(d\)\\mathtt\{L\}\(\\mathcal\{F\}\)=O\(d\), and such thatVC\(ℱ𝖾𝟤𝖾−M\)=Ω\(dlogM\)\\operatorname\{VC\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\)=\\Omega\(d\\log M\)for all sufficiently largeMM\. Our general upper bound asserts that𝙻\(ℱ𝖾𝟤𝖾−M\)=O\(dlogM\)\\mathtt\{L\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\)=O\(d\\log M\)for allM≥𝙻\(ℱ\)M\\geq\\mathtt\{L\}\(\\mathcal\{F\}\), and thus𝙻\(ℱ𝖾𝟤𝖾−M\)=Θ\(dlogM\)\\mathtt\{L\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\)=\\Theta\(d\\log M\)\.
Our construction solves an open question from\[[26](https://arxiv.org/html/2605.06819#bib.bib71)\]\. They showed that for any classℱ\\mathcal\{F\}and allMM, it holds thatVC\(ℱ𝖾𝟤𝖾−M\)=O\(𝙻\(ℱ\)logM\)\\operatorname\{VC\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\)=O\(\\mathtt\{L\}\(\\mathcal\{F\}\)\\log M\), and asked if the logarithmic dependence \(or, in fact, any dependence\) onMMis necessary\. Our proof gives a positive answer to this question\.
The proof is via the probabilistic method applied to a “hard” random class defined below\.
#### 8\.2\.1The hard class
Fixd,M∈ℕ\+d,M\\in\\mathbb\{N\}\_\{\+\}\. Let
N:=M10d,Z:=\{0i:i∈\[M2\]\}\.N:=M^\{10d\},\\quad Z:=\\\{0^\{i\}:i\\in\[M^\{2\}\]\\\}\.be the size of the class, andZZbe the set of instances on which the class is “interesting”\. PartitionZZto
Z1:=\{0jM:j∈\[M\]\},Z0:=Z\\Z1,Z\_\{1\}:=\\\{0^\{jM:j\\in\[M\]\}\\\},\\quad Z\_\{0\}:=Z\\backslash Z\_\{1\},so\|Z1\|=M,\|Z0\|=M2−M\|Z\_\{1\}\|=M,\|Z\_\{0\}\|=M^\{2\}\-M\. We now drawNNfunctions independently at random\. For eacha∈\[N\]a\\in\[N\]andx∈\{0,1\}⋆x\\in\\\{0,1\\\}^\{\\star\}, definefa\(x\)f\_\{a\}\(x\)as follows:
- •Ifx∉Zx\\notin Z, letfa\(x\)=0f\_\{a\}\(x\)=0\.
- •Otherwise, ifx∈Z1x\\in Z\_\{1\}, letfa\(x\)=1f\_\{a\}\(x\)=1w\.p\.1−1/M1\-1/M, andfa\(x\)=0f\_\{a\}\(x\)=0w\.p\.1/M1/M\.
- •Otherwise, ifx∈Z0x\\in Z\_\{0\}, letfa\(x\)=0f\_\{a\}\(x\)=0w\.p\.1−1/M1\-1/M, andfa\(x\)=1f\_\{a\}\(x\)=1w\.p\.1/M1/M\.
Now, let
ℱd,M:=\{fa:a∈\[N\]\}\.\\mathcal\{F\}\_\{d,M\}:=\\\{f\_\{a\}:a\\in\[N\]\\\}\.For eachx∈Z0x\\in Z\_\{0\}, we say that11is the*minority*label ofxx, and0is the*majority*label ofxx\. We define the analog forZ1Z\_\{1\}\.
We are now ready to ready to begin with the proof of Theorem[8\.2](https://arxiv.org/html/2605.06819#S8.Thmtheorem2)\. We begin by proving the upper bound𝙻\(ℱd,M\)=O\(d\)\\mathtt\{L\}\(\\mathcal\{F\}\_\{d,M\}\)=O\(d\)with high probability\.
#### 8\.2\.2Upper bounding the Littlestone dimension of the base class
The intuition for this bound is that since\|ℱ\|=N=M10d\|\\mathcal\{F\}\|=N=M^\{10d\}and for each instance there exists a labelr∈\{0,1\}r\\in\\\{0,1\\\}for which typically only a1/M1/M\-fraction of the version agrees with, the version space should be empty at the bottom of some branch of lengthO\(d\)O\(d\)\. However, this is only true on*average*, while the Littlestone dimension is determined by the*maximal*depth of a shattered tree, so the proof is in fact more challenging than this intuition\. We use the technique of\[[10](https://arxiv.org/html/2605.06819#bib.bib46)\], who provided such an analysis for a similar random class\.
###### Lemma 8\.8\.
Letd,M∈ℕd,M\\in\\mathbb\{N\}larger than some large enough universal constant\. With probability at least3/43/4, we have
𝙻\(ℱd,M\)≤100d\.\\mathtt\{L\}\(\\mathcal\{F\}\_\{d,M\}\)\\leq 100d\.
###### Proof\.
Denoteℱ:=ℱd,M\\mathcal\{F\}:=\\mathcal\{F\}\_\{d,M\}\. We show that the probability thatℱ\\mathcal\{F\}shatters a tree of depth100d100dis at most1/41/4\. Fix a tree𝐓\\mathbf\{T\}whose internal nodes are labeled by instances from\{0,1\}⋆\\\{0,1\\\}^\{\\star\}, and that is shattered byℱ\\mathcal\{F\}\. If𝐓\\mathbf\{T\}contains nodes labeled with instances outside ofZZit cannot be shattered, so we may assume that all nodes are labeled by instances fromZZ\. So, every node has precisely one*minority edge*, corresponding to the minority label of the instance labeling it\. LetB1/2\(𝐓\)B\_\{1/2\}\(\\mathbf\{T\}\)be the set of branches in𝐓\\mathbf\{T\}in which precisely50d50dof the edges are minority edges\. Note that
\|B1/2\(𝐓\)\|=\(100d50d\)≥2100d250d≥2100d20d,\\left\\lvert B\_\{1/2\}\(\\mathbf\{T\}\)\\right\\rvert=\\binom\{100d\}\{50d\}\\geq\\frac\{2^\{100d\}\}\{2\\sqrt\{50d\}\}\\geq\\frac\{2^\{100d\}\}\{20\\sqrt\{d\}\},\(7\)where the first inequality is by the known bound\(2nn\)≥22n2n\\binom\{2n\}\{n\}\\geq\\frac\{2^\{2n\}\}\{2\\sqrt\{n\}\}, applied withn=50dn=50d\. Since𝐓\\mathbf\{T\}is shattered, in particular every branch ofB1/2\(𝐓\)B\_\{1/2\}\(\\mathbf\{T\}\)is realized by at least one functionf∈ℱf\\in\\mathcal\{F\}\. Fix a branchb∈B1/2\(𝐓\)b\\in B\_\{1/2\}\(\\mathbf\{T\}\)andf∈ℱf\\in\\mathcal\{F\}\. Since the draw offfis independent for each instance
Pr\[frealizesb\]≤1M50d\.\\Pr\[f\\text\{ realizes \}b\]\\leq\\frac\{1\}\{M^\{50d\}\}\.\(8\)Now, note that each function can realize at most a single branch\. Therefore, if𝐓\\mathbf\{T\}is shattered, then there exists an injective mappingϕ:B1/2\(𝐓\)→\[N\]\\phi:B\_\{1/2\}\(\\mathbf\{T\}\)\\to\[N\]such that for allb∈B1/2\(𝐓\)b\\in B\_\{1/2\}\(\\mathbf\{T\}\),fϕ\(b\)f\_\{\\phi\(b\)\}realizesbb\. That isϕ\\phiassigns to eachb∈B1/2\(𝐓\)b\\in B\_\{1/2\}\(\\mathbf\{T\}\)the index of the function realizing it, and the same function cannot be assigned to two different branches\. Fix a mappingϕ\\phi\. Since the probability thatfϕ\(b\)f\_\{\\phi\(b\)\}realizesbbis independent of the other randomness, and using \([8](https://arxiv.org/html/2605.06819#S8.E8)\), the probability that for allb∈B1/2\(𝐓\)b\\in B\_\{1/2\}\(\\mathbf\{T\}\),fϕ\(b\)f\_\{\\phi\(b\)\}realizesbbis at most
\(1M50d\)\|B1/2\(𝐓\)\|\.\\left\(\\frac\{1\}\{M^\{50d\}\}\\right\)^\{\|B\_\{1/2\}\(\\mathbf\{T\}\)\|\}\.\(9\)
Since there are at mostN\|B1/2\(𝐓\)\|=M10d\|B1/2\(𝐓\)\|N^\{\|B\_\{1/2\}\(\\mathbf\{T\}\)\|\}=M^\{10d\|B\_\{1/2\}\(\\mathbf\{T\}\)\|\}, using \([9](https://arxiv.org/html/2605.06819#S8.E9)\) and \([7](https://arxiv.org/html/2605.06819#S8.E7)\), the probability that there existsϕ\\phiso that for allb∈B1/2\(𝐓\)b\\in B\_\{1/2\}\(\\mathbf\{T\}\),fϕ\(b\)f\_\{\\phi\(b\)\}realizesbbis at most
M10d\|B1/2\(𝐓\)\|\(1M50d\)\|B1/2\(𝐓\)\|=\(1M40d\)\|B1/2\(𝐓\)\|≤\(1M40d\)2100d20d≤1M2d⋅2100d\.M^\{10d\|B\_\{1/2\}\(\\mathbf\{T\}\)\|\}\\left\(\\frac\{1\}\{M^\{50d\}\}\\right\)^\{\|B\_\{1/2\}\(\\mathbf\{T\}\)\|\}=\\left\(\\frac\{1\}\{M^\{40d\}\}\\right\)^\{\|B\_\{1/2\}\(\\mathbf\{T\}\)\|\}\\leq\\left\(\\frac\{1\}\{M^\{40d\}\}\\right\)^\{\\frac\{2^\{100d\}\}\{20\\sqrt\{d\}\}\}\\leq\\frac\{1\}\{M^\{2\\sqrt\{d\}\\cdot 2^\{100d\}\}\}\.\(10\)
So, we deduce that for a fixed tree𝐓\\mathbf\{T\}, the probability that it is shattered byℱ\\mathcal\{F\}is at most1M2d⋅2100d\\frac\{1\}\{M^\{2\\sqrt\{d\}\\cdot 2^\{100d\}\}\}\. The number of perfect trees of depth100d100dlabeled by instances fromZZis at most\|Z\|2100d=M2⋅2100d\|Z\|^\{2^\{100d\}\}=M^\{2\\cdot 2^\{100d\}\}\. Thus, using[10](https://arxiv.org/html/2605.06819#S8.E10), the probability that there exists a tree𝐓\\mathbf\{T\}of depth100d100dwhich is shattered byℱ\\mathcal\{F\}is at most
M2⋅2100d1M2d⋅2100d=M2⋅2100d\(1−d\)<1/4,M^\{2\\cdot 2^\{100d\}\}\\frac\{1\}\{M^\{2\\sqrt\{d\}\\cdot 2^\{100d\}\}\}=M^\{2\\cdot 2^\{100d\}\(1\-\\sqrt\{d\}\)\}<1/4,which concludes the proof\. ∎
We now turn to the lower boundVC\(ℱ𝖾𝟤𝖾−M\)=Ω\(dlogM\)\\operatorname\{VC\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\)=\\Omega\(d\\log M\)with high probability\. To formalize a choice of a random subset before it is drawn, we use standard propositional logic formalization, as defined below\.
#### 8\.2\.3A framework for choosing subsets from a random class
We define a notion of*rules*for choosing a subset of a random classℱ\\mathcal\{F\}before it is drawn, by restricting the labels of some the instances in the domain\. A rule is any restriction on the labels of some of the instances, that can be represented as a propositional logic formula with the conjunction and disjunction connectives, where the examples\(x,y\)∈𝒳×𝒴\(x,y\)\\in\\mathcal\{X\}\\times\\mathcal\{Y\}are the atomic variables, and the assignments are given by the functions of the class\. A rule is said to be*independent*of an instancexx, ifxxdoes not appear in any of the atomic variables used in the rule\. For example, ifx1,x2,x3,x4∈𝒳x\_\{1\},x\_\{2\},x\_\{3\},x\_\{4\}\\in\\mathcal\{X\}, then\(\(x1,0\)∧\(x2,1\)\)∨\(x3,0\)\(\(x\_\{1\},0\)\\land\(x\_\{2\},1\)\)\\lor\(x\_\{3\},0\)is a rule, and it is independent ofx4x\_\{4\}\. The*truth value*of a variable \(example\)\(x,y\)\(x,y\)under the assignment \(function\)f∈\{0,1\}𝒳f\\in\\\{0,1\\\}^\{\\mathcal\{X\}\}is naturally defined asf\(x,y\):=1\[f\(x\)=y\]f\(x,y\):=1\[f\(x\)=y\], wheref\(x,y\)=1f\(x,y\)=1represents*true*andf\(x,y\)=0f\(x,y\)=0*false*\. The truth value of a ruleRRunder the functionff, denoted byf\(R\)f\(R\), is inductively defined in the standard way used in propositional logic\. A ruleRRfor the classℱ\\mathcal\{F\}naturally defines a subclass ofℱ\\mathcal\{F\}, denoted asℱ\(R\)\\mathcal\{F\}\(R\), of all functions satisfying the rule:
ℱ\(R\)=\{f∈ℱ:f\(R\)=1\}\.\\mathcal\{F\}\(R\)=\\\{f\\in\\mathcal\{F\}:f\(R\)=1\\\}\.
Given a decision tree, any branchbbin the tree naturally induces the rule\(𝒙\(b\),𝒚\(b\)\)\(\\boldsymbol\{x\}\(b\),\\boldsymbol\{y\}\(b\)\)\(formally, we need to replace the commas in the sequence\(𝒙\(b\),𝒚\(b\)\)\(\\boldsymbol\{x\}\(b\),\\boldsymbol\{y\}\(b\)\)to∧\\landconnectives, to make it a valid rule, but it is understood that\(𝒙\(b\),𝒚\(b\)\)\(\\boldsymbol\{x\}\(b\),\\boldsymbol\{y\}\(b\)\)represents this rule\)\. Likewise, any two branchesb1,b2b\_\{1\},b\_\{2\}naturally induce the rule\(𝒙\(b1\),𝒚\(b1\)\)∨\(𝒙\(b2\),𝒚\(b2\)\)\(\\boldsymbol\{x\}\(b\_\{1\}\),\\boldsymbol\{y\}\(b\_\{1\}\)\)\\lor\(\\boldsymbol\{x\}\(b\_\{2\}\),\\boldsymbol\{y\}\(b\_\{2\}\)\), and this can be extended to any number of branches\. For a set of branchesBB, we letR\(B\)R\(B\)be the rule induced byBB\.
We now use this framework and prove a standard concentration lemma for predetermined rules defined overℱd,M\\mathcal\{F\}\_\{d,M\}\.
#### 8\.2\.4A concentration lemma for predetermined rules
###### Lemma 8\.9\.
Letd∈ℕd\\in\\mathbb\{N\}larger than some large enough universal constant, and letM≥d2M\\geq d^\{2\}\. Letℛ\\mathcal\{R\}be a family of rules such that\|ℛ\|≤eM6d\|\\mathcal\{R\}\|\\leq e^\{M^\{6d\}\}\. Then w\.p\. at least3/43/4over the draw ofℱd,M\\mathcal\{F\}\_\{d,M\}, for allx∈Zrx\\in Z\_\{r\}\(wherer∈\{0,1\}r\\in\\\{0,1\\\}\) and allR∈ℛR\\in\\mathcal\{R\}, such thatRRis independent ofxxand\|ℱd,M\(R\)\|≥M8d\|\\mathcal\{F\}\_\{d,M\}\(R\)\|\\geq M^\{8d\}, we have
12M\|ℱd,M\(R\)\|≤\|ℱd,M\(R∧\(x,1−r\)\)\|≤32M\|ℱd,M\(R\)\|\.\\frac\{1\}\{2M\}\|\\mathcal\{F\}\_\{d,M\}\(R\)\|\\leq\|\\mathcal\{F\}\_\{d,M\}\(R\\land\(x,1\-r\)\)\|\\leq\\frac\{3\}\{2M\}\|\\mathcal\{F\}\_\{d,M\}\(R\)\|\.
###### Proof\.
SinceRRis independent ofxx, we have
X:=\|ℱd,M\(R∧\(x,1−r\)\)\|∼Bin\(\|ℱd,M\(R\)\|,1/M\),μ:=𝔼\[X\]=\|ℱd,M\(R\)\|/M≥M8d−1\.X:=\|\\mathcal\{F\}\_\{d,M\}\(R\\land\(x,1\-r\)\)\|\\sim\\mathrm\{Bin\}\(\|\\mathcal\{F\}\_\{d,M\}\(R\)\|,1/M\),\\quad\\mu:=\\mathbb\{E\}\[X\]=\|\\mathcal\{F\}\_\{d,M\}\(R\)\|/M\\geq M^\{8d\-1\}\.Multiplicative Chernoff bound thus gives
Pr\[X∉\[μ2,3μ2\]\]≤2e−μ/12≤e−M8d−1/12\.\\Pr\\left\[X\\notin\\left\[\\frac\{\\mu\}\{2\},\\frac\{3\\mu\}\{2\}\\right\]\\right\]\\leq 2e^\{\-\\mu/12\}\\leq e^\{\-M^\{8d\-1\}/12\}\.Taking a union bound over allx∈Z,R∈ℛx\\in Z,R\\in\\mathcal\{R\}, the failure probability that there exist a pair\(x,R\)\(x,R\)not satisfying the lemma is at most
\|Z\|\|ℛ\|e−M8d−1/12≤M2eM6de−M8d−1/12<1/4,\|Z\|\|\\mathcal\{R\}\|e^\{\-M^\{8d\-1\}/12\}\\leq M^\{2\}e^\{M^\{6d\}\}e^\{\-M^\{8d\-1\}/12\}<1/4,where the final inequality holds for large enoughdd\. ∎
#### 8\.2\.5Lower bounding the VC\-dimension of the end\-to\-end class
We are now ready to prove the lower bound onVC\(ℱd,M𝖾𝟤𝖾,M\)\\operatorname\{VC\}\(\\mathcal\{F\}\_\{d,M\}^\{\\mathsf\{e2e\},M\}\)\. In light of Lemma[8\.9](https://arxiv.org/html/2605.06819#S8.Thmtheorem9), we fix a classℱ:=ℱd,T\\mathcal\{F\}:=\\mathcal\{F\}\_\{d,T\}drawn as in Section[8\.2\.1](https://arxiv.org/html/2605.06819#S8.SS2.SSS1), for which the lemma holds\.
###### Lemma 8\.10\.
We have
VC\(ℱ𝖾𝟤𝖾,M\)=Ω\(dlogM\)\.\\operatorname\{VC\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\},M\}\)=\\Omega\(d\\log M\)\.
Letm:=⌊dlogM600⌋m:=\\left\\lfloor\\frac\{d\\log M\}\{600\}\\right\\rfloor\. For eachj∈\[m\]j\\in\[m\], letx\(j\):=0\(j−1\)M\+1x^\{\(j\)\}:=0^\{\(j\-1\)M\+1\}, and denoteS:=\{x\(j\):j∈\[m\]\}S:=\\\{x^\{\(j\)\}:j\\in\[m\]\\\}\. We will prove thatSSis shattered byℱ𝖾𝟤𝖾,M\\mathcal\{F\}^\{\\mathsf\{e2e\},M\}\.
The main tool used to prove Lemma[8\.10](https://arxiv.org/html/2605.06819#S8.Thmtheorem10)is the concentration lemma proved above\. To use it, we need to define setℛ\\mathcal\{R\}of rules determined before drawingℱd,M\\mathcal\{F\}\_\{d,M\}\. We will need two different types of rules, one to control the version space, and one to control any possible specific autoregressive trajectory\. Let us start with the first type\. For eachj∈\[m\]j\\in\[m\], consider the generation tree of depthMMrooted atx\(j\)x^\{\(j\)\}\. In this tree, define
- •The*green branch*, given by the sequenceg\(j\):=0T−11g^\{\(j\)\}:=0^\{T\-1\}1of edge labels\.
- •The*red branches*, given by the sequences rq\(j\):=0q−110M−q,q∈\[M−1\]\.r^\{\(j\)\}\_\{q\}:=0^\{q\-1\}10^\{M\-q\},\\quad q\\in\[M\-1\]\.
See Figure[1](https://arxiv.org/html/2605.06819#S3.F1)for an illustration of the green and red branches, as well as an intuitive explanation for how they can be used to yield the lower bound onVC\(ℱd,M𝖾𝟤𝖾,M\)\\operatorname\{VC\}\(\\mathcal\{F\}\_\{d,M\}^\{\\mathsf\{e2e\},M\}\)\.
Now, let
G\(j\):=\{g\(j\)\},R\(j\):=\{rq\(j\):q∈\[M−1\]\}G^\{\(j\)\}:=\\\{g^\{\(j\)\}\\\},\\quad R^\{\(j\)\}:=\\\{r^\{\(j\)\}\_\{q\}:q\\in\[M\-1\]\\\}be the sets of green and red branches, respectively\. We now define for every possible labelingu:=\(u1,…,ut\)∈\{0,1\}tu:=\(u\_\{1\},\\ldots,u\_\{t\}\)\\in\\\{0,1\\\}^\{t\}of\(x\(1\),…,x\(t\)\)\(x^\{\(1\)\},\\ldots,x^\{\(t\)\}\)wheret≤mt\\leq m, a ruleQuQ\_\{u\}as follows:
Qu:=⋀j∈\[t\]:uj=1R\(G\(j\)\)∧⋀j∈\[t\]:uj=0R\(R\(j\)\)\.Q\_\{u\}:=\\bigwedge\_\{j\\in\[t\]:u\_\{j\}=1\}R\(G^\{\(j\)\}\)\\land\\bigwedge\_\{j\\in\[t\]:u\_\{j\}=0\}R\(R^\{\(j\)\}\)\.
That is,QuQ\_\{u\}requires to satisfy the rule induced by the green branchg\(j\)g^\{\(j\)\}forx\(j\)x^\{\(j\)\}that are labeled by11, and to satisfy the rule induced by the red branchesR\(j\)R^\{\(j\)\}forx\(j\)x^\{\(j\)\}that are labeled by0\. This makes sense, as one can note that
- •f⊧R\(G\(j\)\)⟹f𝖾𝟤𝖾−M\(x\(j\)\)=1f\\models R\(G^\{\(j\)\}\)\\implies f^\{\\mathsf\{e2e\}\-M\}\(x^\{\(j\)\}\)=1\.
- •f⊧R\(\{rq\(j\)\}\)⟹f𝖾𝟤𝖾−M\(x\(j\)\)=0f\\models R\(\\\{r^\{\(j\)\}\_\{q\}\\\}\)\\implies f^\{\\mathsf\{e2e\}\-M\}\(x^\{\(j\)\}\)=0, for anyq∈\[M−1\]q\\in\[M\-1\]\.
Now, let
ℛver:=\{Qu:u∈\{0,1\}t,t∈\[m\]\}\.\\mathcal\{R\}\_\{ver\}:=\\\{Q\_\{u\}:u\\in\\\{0,1\\\}^\{t\},t\\in\[m\]\\\}\.
We now turn to the second type\. For two instancesx,y∈\{0,1\}⋆x,y\\in\\\{0,1\\\}^\{\\star\}andt≤\|y\|t\\leq\|y\|, we define the appropriate rule for thett\-prefix path that begins atxxand follows the trajectory given by the firstttbits ofyyas
Pt\(x,y\):=⋀s=1t\(xb1…bs−1,bs\)\.P\_\{t\}\(x,y\):=\\bigwedge\_\{s=1\}^\{t\}\(xb\_\{1\}\\ldots b\_\{s\-1\},b\_\{s\}\)\.In the case where we consider the entire trajectory ofyy, that ist=\|y\|t=\|y\|, we abbreviateP\(x,y\):=P\|y\|\(x,y\)P\(x,y\):=P\_\{\|y\|\}\(x,y\)\. So,f⊧Pt\(x,y\)f\\models P\_\{t\}\(x,y\)means that when starting from the instancexx, the firstttbits autoregressivly generated byffare preciselyy1,…,yty\_\{1\},\\ldots,y\_\{t\}\. Now we can also define a rule that captures functions from a specific version space given by a labelingu∈\{0,1\}<mu\\in\\\{0,1\\\}^\{<m\}, that also follow a specific trajectory of a green or red branch starting at the next instancex\(\|u\|\+1\)x^\{\(\|u\|\+1\)\}\. Formally, let:
ℛbran:=\{Qu∧Pt\(x\(\|u\|\+1\),b\):u∈\{0,1\}<m,t∈\[T\],b∈G\(\|u\|\+1\)∪R\(\|u\|\+1\)\}\.\\mathcal\{\\mathcal\{R\}\}\_\{bran\}:=\\\{Q\_\{u\}\\land P\_\{t\}\(x^\{\(\|u\|\+1\),b\}\):u\\in\\\{0,1\\\}^\{<m\},t\\in\[T\],b\\in G^\{\(\|u\|\+1\)\}\\cup R^\{\(\|u\|\+1\)\}\\\}\.
Now, let
ℛ:=ℛver∪ℛbran\\mathcal\{R\}:=\\mathcal\{R\}\_\{ver\}\\cup\\mathcal\{R\}\_\{bran\}
To use Lemma[8\.9](https://arxiv.org/html/2605.06819#S8.Thmtheorem9), we need to show thatℛ\\mathcal\{R\}is not too large\.
###### Lemma 8\.11\.
We have
\|ℛ\|≤M2d\|\\mathcal\{R\}\|\\leq M^\{2d\}
###### Proof\.
We have
\|ℛver\|≤m2m,\|\\mathcal\{R\}\_\{ver\}\|\\leq m2^\{m\},since there are at mostm2mm2^\{m\}options for the choice ofuu\. directly from the definition\. Note that for allj∈\[m\]j\\in\[m\],\|G\(j\)∪R\(j\)\|≤M\|G^\{\(j\)\}\\cup R^\{\(j\)\}\|\\leq M\. Thus
\|ℛbran\|≤m2m⋅M⋅M\|\\mathcal\{R\}\_\{bran\}\|\\leq m2^\{m\}\\cdot M\\cdot Msince there are at mostm2mm2^\{m\}options for the choice ofuu, at mostTToptions for the choice oftt, and at mostMMoptions for the choice ofbb\. So, over all:
\|ℛ\|≤\|ℛver\|\+\|ℛbran\|≤m2m\(M2\+1\)≤dlogM⋅Md\(M2\+1\)≤M2d,\|\\mathcal\{R\}\|\\leq\|\\mathcal\{R\}\_\{ver\}\|\+\|\\mathcal\{R\}\_\{bran\}\|\\leq m2^\{m\}\(M^\{2\}\+1\)\\leq d\\log M\\cdot M^\{d\}\(M^\{2\}\+1\)\\leq M^\{2d\},which concludes the proof\. ∎
We now show that when committing to a label of another instance, the version space does not decrease by too much\.
###### Lemma 8\.12\.
Letu∈\{0,1\}tu\\in\\\{0,1\\\}^\{t\}wheret<mt<m, and suppose that\|ℱ\(Qu\)\|≥1e300\|u\|N\|\\mathcal\{F\}\(Q\_\{u\}\)\|\\geq\\frac\{1\}\{e^\{300\|u\|\}\}N\. Then for anyr∈\{0,1\}r\\in\\\{0,1\\\}:
\|ℱ\(Qur\)\|≥1e300\|ℱ\(Qu\)\|\.\|\\mathcal\{F\}\(Q\_\{ur\}\)\|\\geq\\frac\{1\}\{e^\{300\}\}\|\\mathcal\{F\}\(Q\_\{u\}\)\|\.
###### Proof\.
To use Lemma[8\.9](https://arxiv.org/html/2605.06819#S8.Thmtheorem9), we have to show that any ruleRRwe use, it holds that\|ℱ\(R\)\|≥M8d\|\\mathcal\{F\}\(R\)\|\\geq M^\{8d\}\. We will use rules of the form
Qu∧Pt\(x\|u\|\+1,b\),Q\_\{u\}\\land P\_\{t\}\(x^\{\|u\|\+1\},b\),\(11\)which are indeed inℛbran⊂ℛ\\mathcal\{R\}\_\{bran\}\\subset\\mathcal\{R\}\. wheret∈\[M\]t\\in\[M\], andb∈G\(\|u\|\+1\)∪R\(\|u\|\+1\)b\\in G^\{\(\|u\|\+1\)\}\\cup R^\{\(\|u\|\+1\)\}\. By assumption,
\|ℱ\(Qu\)\|≥1e300\|u\|N≥1MdM10d=M9d,\|\\mathcal\{F\}\(Q\_\{u\}\)\|\\geq\\frac\{1\}\{e^\{300\|u\|\}\}N\\geq\\frac\{1\}\{M^\{d\}\}M^\{10d\}=M^\{9d\},so the bound of Lemma[8\.9](https://arxiv.org/html/2605.06819#S8.Thmtheorem9)holds forQuQ\_\{u\}\. Now, from monotonicity of the inclusion relation, to prove that for each possiblePt\(x\|u\|\+1,b\)P\_\{t\}\(x^\{\|u\|\+1\},b\)we haveℱ\(Qu∧Pt\(x\|u\|\+1,b\)\)\>M8d\\mathcal\{F\}\(Q\_\{u\}\\land P\_\{t\}\(x^\{\|u\|\+1\},b\)\)\>M^\{8d\}, it suffices to show that it holds for the worst\-casePt\(x\|u\|\+1,b\)P\_\{t\}\(x^\{\|u\|\+1\},b\)\. In the worst case,bbhas a single minority edge andT−1T\-1majority edges\. Thus, using\(1−100/M\)M≥e−200\(1\-100/M\)^\{M\}\\geq e^\{\-200\}for large enoughMM, for any rule of the form of \([11](https://arxiv.org/html/2605.06819#S8.E11)\) we have
ℱ\(Qu∧Pt\(x\|u\|\+1,b\)\)≥1100M\(1−100/M\)M\|ℱ\(Qu\)\|≥1100Me−200M9d≥M9d−2,\\mathcal\{F\}\(Q\_\{u\}\\land P\_\{t\}\(x^\{\|u\|\+1\},b\)\)\\geq\\frac\{1\}\{100M\}\(1\-100/M\)^\{M\}\|\\mathcal\{F\}\(Q\_\{u\}\)\|\\geq\\frac\{1\}\{100M\}e^\{\-200\}M^\{9d\}\\geq M^\{9d\-2\},where the final inequality holds for large enoughMM\. We can now prove the statement of the lemma\. Let us start with the caser=1r=1\. Recall that any functionffrealizing the green branchg\(\|u\|\+1\)g^\{\(\|u\|\+1\)\}satisfiesf𝖾𝟤𝖾−M\(x\(\|u\|\+1\)\)=1f^\{\\mathsf\{e2e\}\-M\}\(x^\{\(\|u\|\+1\)\}\)=1\. So, it suffices to lower bound the number of such functions\. Since the green branch has only majority edges, we obtain:
\|ℱ\(Qu1\)\|≥\(1−100/M\)M\|ℱ\(Qu\)\|≥e−200\|ℱ\(Qu\)\|\.\|\\mathcal\{F\}\(Q\_\{u1\}\)\|\\geq\(1\-100/M\)^\{M\}\|\\mathcal\{F\}\(Q\_\{u\}\)\|\\geq e^\{\-200\}\|\\mathcal\{F\}\(Q\_\{u\}\)\|\.Likewise, any functionffrealizing any of the red branchesrq\(\|u\|\+1\)r\_\{q\}^\{\(\|u\|\+1\)\}for someq∈\[M−1\]q\\in\[M\-1\]satisfiesf𝖾𝟤𝖾−M\(x\(\|u\|\+1\)\)=0f^\{\\mathsf\{e2e\}\-M\}\(x^\{\(\|u\|\+1\)\}\)=0\. So, it suffices to lower bound the number of such functions\. Any red branch has a single minority edge, and there areM−1M\-1red branches\. Furthermore, a function can realize at most one branch, so the set of functions realizing different red branches are disjoint\. Thus overall:
\|ℱ\(Qu0\)\|≥\(M−1\)1100M\(1−100/M\)M\|ℱ\(Qu\)\|≥e−300\|ℱ\(Qu\)\|,\|\\mathcal\{F\}\(Q\_\{u0\}\)\|\\geq\(M\-1\)\\frac\{1\}\{100M\}\(1\-100/M\)^\{M\}\|\\mathcal\{F\}\(Q\_\{u\}\)\|\\geq e^\{\-300\}\|\\mathcal\{F\}\(Q\_\{u\}\)\|,where the final inequality holds for large enoughMM\. ∎
We are now ready to prove Lemma[8\.10](https://arxiv.org/html/2605.06819#S8.Thmtheorem10)\.
###### Proof of Lemma[8\.10](https://arxiv.org/html/2605.06819#S8.Thmtheorem10)\.
We prove the lemma by showing that\|ℱ\(Qu\)\|\>0\|\\mathcal\{F\}\_\{\(\}Q\_\{u\}\)\|\>0for alluu, which means that any labeling ofSSis realized by some function inℱ\\mathcal\{F\}\. This establishes thatVC\(ℱd,M\)𝖾𝟤𝖾,M≥m=Ω\(dlogM\)\\operatorname\{VC\}\(\\mathcal\{F\}\_\{d,M\}\)^\{\\mathsf\{e2e\},M\}\\geq m=\\Omega\(d\\log M\)\. The claim follows almost directly from Lemma[8\.12](https://arxiv.org/html/2605.06819#S8.Thmtheorem12)by proving the following stronger claim:
\|ℱ\(Qu\)\|≥1e300\|u\|N\|\\mathcal\{F\}\(Q\_\{u\}\)\|\\geq\\frac\{1\}\{e^\{300\|u\|\}\}N\(12\)for allu∈\{0,1\}≤mu\\in\\\{0,1\\\}^\{\\leq m\}\. Indeed, having \([12](https://arxiv.org/html/2605.06819#S8.E12)\) we are done:
1e300\|u\|N≥1e300dlogM600M10d≥M9d\.\\frac\{1\}\{e^\{300\|u\|\}\}N\\geq\\frac\{1\}\{e^\{300\\frac\{d\\log M\}\{600\}\}\}M^\{10d\}\\geq M^\{9d\}\.So, it remains to prove \([12](https://arxiv.org/html/2605.06819#S8.E12)\)\. We prove it by induction on\|u\|\|u\|\. To handle the base case\|u\|=0\|u\|=0, let us use the notationℱ\(Q0\)=ℱ\\mathcal\{F\}\(Q\_\{0\}\)=\\mathcal\{F\}\. That isQ0Q\_\{0\}means justℱ\\mathcal\{F\}with no further restrictions\. So indeed:
\|ℱ\(Qu\)\|=\|ℱ\|=N=1e300⋅0N=1e300⋅\|u\|N,\|\\mathcal\{F\}\(Q\_\{u\}\)\|=\|\\mathcal\{F\}\|=N=\\frac\{1\}\{e^\{300\\cdot 0\}\}N=\\frac\{1\}\{e^\{300\\cdot\|u\|\}\}N,so the base case holds\. For the induction step, Letu′=uru^\{\\prime\}=urwhere\|u′\|≥0\|u^\{\\prime\}\|\\geq 0andr∈\{0,1\}r\\in\\\{0,1\\\}\. By induction hypothesis, we may assume that\|ℱ\(Qu\)\|≥1e300\|u\|N\|\\mathcal\{F\}\(Q\_\{u\}\)\|\\geq\\frac\{1\}\{e^\{300\|u\|\}\}NSo:
\|ℱ\(Qu′\)\|=\|ℱ\(Qur\)\|≥1e3001e300\|u\|N=1e300\|u′\|N\|\\mathcal\{F\}\(Q\_\{u^\{\\prime\}\}\)\|=\|\\mathcal\{F\}\(Q\_\{ur\}\)\|\\geq\\frac\{1\}\{e^\{300\}\}\\frac\{1\}\{e^\{300\|u\|\}\}N=\\frac\{1\}\{e^\{300\|u^\{\\prime\}\|\}\}Nwhere the inequality is by the induction hypothesis and Lemma[8\.12](https://arxiv.org/html/2605.06819#S8.Thmtheorem12)\. This concludes the proof\. ∎
Recall thatℱ:=ℱ\(d,M\)\\mathcal\{F\}:=\\mathcal\{F\}\(d,M\)\. That is, so far we assumed that the number of autoregressive iterationsMMis fixed\. The only remaining step is to prove that we can construct fromℱ\(d,M\)\\mathcal\{F\}\(d,M\)a single class that uniformly handles allMMvalues, without significantly increasing it’s Littlestone dimension\. We prove this final step below\.
#### 8\.2\.6A gluing lemma
We prove a*gluing lemma*that glues together the classes\{ℱd,M\}M∈ℕ\+\\\{\\mathcal\{F\}\_\{d,M\}\\\}\_\{M\\in\\mathbb\{N\}\_\{\+\}\}without increasing the Littlestone dimension by more than11\. Our lemma is similar to the gluing lemma of\[[10](https://arxiv.org/html/2605.06819#bib.bib46)\]\.
###### Lemma 8\.13\.
Letc\>0c\>0be a universal constant, and fixd∈ℕd\\in\\mathbb\{N\}larger than some universal constant\. Letfd:ℕ→ℕf\_\{d\}:\\mathbb\{N\}\\to\\mathbb\{N\}be a function\. Suppose that for all large enoughMMthere exists a classℱd,M\\mathcal\{F\}\_\{d,M\}such that:
1. 1\.𝙻\(ℱd,M\)≤c⋅d\\mathtt\{L\}\(\\mathcal\{F\}\_\{d,M\}\)\\leq c\\cdot d\.
2. 2\.VC\(ℱd,M𝖾𝟤𝖾−M\)≥fd\(M\)\\operatorname\{VC\}\(\\mathcal\{F\}\_\{d,M\}^\{\\mathsf\{e2e\}\-M\}\)\\geq f\_\{d\}\(M\)
3. 3\.Let XM:=\{x∈\{0,1\}⋆:∃f∈ℱd,M,f\(x\)=1\}\.X\_\{M\}:=\\\{x\\in\\\{0,1\\\}^\{\\star\}:\\exists f\\in\\mathcal\{F\}\_\{d,M\},f\(x\)=1\\\}\.Then\|XM\|≤M3\|X\_\{M\}\|\\leq M^\{3\}\.
Then, there exists a classℱ^\\hat\{\\mathcal\{F\}\}such that𝙻\(ℱ^\)≤c⋅d\+1\\mathtt\{L\}\(\\hat\{\\mathcal\{F\}\}\)\\leq c\\cdot d\+1and for all large enoughMMit holds thatVC\(ℱ𝖾𝟤𝖾−M\)≥fd\(M\)\\operatorname\{VC\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\)\\geq f\_\{d\}\(M\)\.
###### Proof\.
For everyM≥1M\\geq 1and for everyf∈ℱd,Mf\\in\\mathcal\{F\}\_\{d,M\}, define the shifted functionf^\\hat\{f\}offfas
f^\(x′\)=\{f\(x\)∃x∈\{0,1\}⋆:x′=02M4x,0otherwise,\\hat\{f\}\(x^\{\\prime\}\)=\\begin\{cases\}f\(x\)&\\exists x\\in\\\{0,1\\\}^\{\\star\}:x^\{\\prime\}=0^\{2^\{M^\{4\}\}\}x,\\\\ 0&\\text\{otherwise,\}\\end\{cases\}for allx′∈\{0,1\}⋆x^\{\\prime\}\\in\\\{0,1\\\}^\{\\star\}\. Now, let
ℱ^d,M:=\{f^:f∈ℱd,M\}\.\\hat\{\\mathcal\{F\}\}\_\{d,M\}:=\\\{\\hat\{f\}:f\\in\\mathcal\{F\}\_\{d,M\}\\\}\.Denote the analog ofXMX\_\{M\}forℱ^d,M\\hat\{\\mathcal\{F\}\}\_\{d,M\}as
X^M:=\{x∈\{0,1\}⋆:∃f^∈ℱ^d,M,f^\(x\)=1\}\.\\hat\{X\}\_\{M\}:=\\\{x\\in\\\{0,1\\\}^\{\\star\}:\\exists\\hat\{f\}\\in\\hat\{\\mathcal\{F\}\}\_\{d,M\},\\hat\{f\}\(x\)=1\\\}\.We show thatM1\>M2⟹X^M1∩X^M2=∅M\_\{1\}\>M\_\{2\}\\implies\\hat\{X\}\_\{M\_\{1\}\}\\cap\\hat\{X\}\_\{M\_\{2\}\}=\\emptyset\. Indeed, the length of every instance inX^M2\\hat\{X\}\_\{M\_\{2\}\}is at most2M24\+M232^\{M^\{4\}\_\{2\}\}\+M\_\{2\}^\{3\}\. However, the length of every instance inX^M1\\hat\{X\}\_\{M\_\{1\}\}is at least
2M14≥2\(M2\+1\)4≥2M242M23\>2M24\+M23,2^\{M^\{4\}\_\{1\}\}\\geq 2^\{\(M\_\{2\}\+1\)^\{4\}\}\\geq 2^\{M\_\{2\}^\{4\}\}2^\{M\_\{2\}^\{3\}\}\>2^\{M^\{4\}\_\{2\}\}\+M\_\{2\}^\{3\},and so indeedX^M1∩X^M2=∅\\hat\{X\}\_\{M\_\{1\}\}\\cap\\hat\{X\}\_\{M\_\{2\}\}=\\emptyset\. Now, define
ℱ^:=⋃Mℱ^d,M\.\\hat\{\\mathcal\{F\}\}:=\\bigcup\_\{M\}\\hat\{\\mathcal\{F\}\}\_\{d,M\}\.where the union is taken over all large enoughMM\. First, sinceℱ^\\hat\{\\mathcal\{F\}\}containsℱ^d,M\\hat\{\\mathcal\{F\}\}\_\{d,M\}for all large enoughMM, it is clear that for all large enoughMM:
VC\(ℱ^𝖾𝟤𝖾−M\)≥VC\(ℱ^d,M𝖾𝟤𝖾−M\)≥fd\(M\),\\operatorname\{VC\}\(\\hat\{\\mathcal\{F\}\}^\{\\mathsf\{e2e\}\-M\}\)\\geq\\operatorname\{VC\}\(\\hat\{\\mathcal\{F\}\}\_\{d,M\}^\{\\mathsf\{e2e\}\-M\}\)\\geq f\_\{d\}\(M\),as desired\. It remains to prove that𝙻\(ℱ^\)≤c⋅d\+1\\mathtt\{L\}\(\\hat\{\\mathcal\{F\}\}\)\\leq c\\cdot d\+1\. To see why this is true, consider an online learner forℱ^\\hat\{\\mathcal\{F\}\}that always predicts0, until it makes it first mistake\. Once it makes a mistake, it can infer that the target function gives11to the instance it erred for\. SinceX^M1∩X^M2=∅\\hat\{X\}\_\{M\_\{1\}\}\\cap\\hat\{X\}\_\{M\_\{2\}\}=\\emptysetfor any two distinctM1,M2M\_\{1\},M\_\{2\}, this reveals the identity of the classℱ^d,M\\hat\{\\mathcal\{F\}\}\_\{d,M\}from which the target function is taken\. Now, the learner learnsℱ^d,M\\hat\{\\mathcal\{F\}\}\_\{d,M\}, on which it will make at mostc⋅dc\\cdot dmany mistakes by assumption\. This concludes the proof\. ∎
We may now prove the main theorem\.
###### Proof of Theorem[8\.2](https://arxiv.org/html/2605.06819#S8.Thmtheorem2)\.
The theorem follows from combining Lemma[8\.10](https://arxiv.org/html/2605.06819#S8.Thmtheorem10)and Lemma[8\.13](https://arxiv.org/html/2605.06819#S8.Thmtheorem13)\. ∎
## 9Agnostic taxonomy
Consider the following known result\.
###### Theorem 9\.1\(\[[6](https://arxiv.org/html/2605.06819#bib.bib29),[1](https://arxiv.org/html/2605.06819#bib.bib40)\]\)\.
Letℱ\\mathcal\{F\}be a binary function class\.444The upper bound of\[[1](https://arxiv.org/html/2605.06819#bib.bib40)\]requires thatℱ\\mathcal\{F\}satisfies some mild measurability constraints, which hold for all classes relevant to the scope of this paper, since our domain𝒳=\{0,1\}⋆\\mathcal\{X\}=\\\{0,1\\\}^\{\\star\}is countable\.Then the optimal regret bound for agnostic online learning ofℱ\\mathcal\{F\}is
Θ\(𝙻\(ℱ\)T\)\.\\Theta\\left\(\\sqrt\{\\mathtt\{L\}\(\\mathcal\{F\}\)T\}\\right\)\.
Using these bounds and our taxonomy of possible Littlestone dimension ofℱ𝖾𝟤𝖾−M\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}, we may infer an essentially complete taxonomy for agnostic learning as well\. First, we deduce a general upper bound\.
###### Corollary 9\.2\.
Letℱ\\mathcal\{F\}be a base class\. Then for all sufficiently largeMM, there exists an agnostic online learner forℱ𝖾𝟤𝖾−M\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}with regret bound
O\(T𝙻\(ℱ\)logM\)\.O\\left\(\\sqrt\{T\\mathtt\{L\}\(\\mathcal\{F\}\)\\log M\}\\right\)\.
Below, we show that all sub\-logarithmic well\-behaved rates betweenT\\sqrt\{T\}andTlogM\\sqrt\{T\\log M\}are attainable\. Note that due to Theorem[9\.1](https://arxiv.org/html/2605.06819#S9.Thmtheorem1), going below order ofT\\sqrt\{T\}regret is possible only in the case that the autoregressive process degeneratesℱ𝖾𝟤𝖾−M\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}to a single function\. Therefore, this essentially establishes a complete taxonomy for agnostic online𝖾𝟤𝖾\\mathsf\{e2e\}\-learning\.
###### Corollary 9\.3\.
For any well\-behaved sub\-logarithmic growth raterrthere exists a base classℱ\\mathcal\{F\}such that𝙻\(ℱ\)=1\\mathtt\{L\}\(\\mathcal\{F\}\)=1and the optimal regret bound for agnostic online learningℱ𝖾𝟤𝖾−M\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}is
Θ\(T⋅r\(M\)\)\\Theta\\left\(\\sqrt\{T\\cdot r\(M\)\}\\right\)for allM≥M0M\\geq M\_\{0\}and sufficiently largeTT\.
A similar result holds also for logarithmic rates\.
###### Corollary 9\.4\.
For anyd∈ℕ\+d\\in\\mathbb\{N\}\_\{\+\}, there exists a classℱ\\mathcal\{F\}such that the optimal regret for agnostic online learning ofℱ𝖾𝟤𝖾−M\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}is
Θ\(TdlogM\)\\Theta\\left\(\\sqrt\{Td\\log M\}\\right\)for allM≥M0M\\geq M\_\{0\}and sufficiently largeTT\.
## 10Learning with chain\-of\-thought supervision
In contrast to𝖾𝟤𝖾\\mathsf\{e2e\}\-learning, we show that similarly to the statistical setting \(see\[[20](https://arxiv.org/html/2605.06819#bib.bib93)\]\), the optimal mistake bound of learning a Littlestone class is independent ofMM\.
###### Theorem 10\.1\.
For every classℱ\\mathcal\{F\}, and for everyMM, we have𝙻\(ℱ𝖢𝗈𝖳−M\)≤𝙻\(ℱ\)\\mathtt\{L\}\(\\mathcal\{F\}^\{\\mathsf\{CoT\}\-M\}\)\\leq\\mathtt\{L\}\(\\mathcal\{F\}\)\.
###### Proof\.
We use an SOA\-style learner who realizes the stated mistake bound, described below\. The learner maintains a valid version spaceV:=VtV:=V\_\{t\}ofℱ\\mathcal\{F\}, which in the first round isV1=ℱV\_\{1\}=\\mathcal\{F\}\. Letx:=xtx:=x\_\{t\}be the instance given by the adversary in roundtt\. The learner constructs the generation tree𝐓M\(x\)\\mathbf\{T\}\_\{M\}\(x\)\. For a branchbb, letVbV\_\{b\}be the version spaceVVrestricted tobb\. That is, those are all functions inVVthat agree with the branchbb\. We argue that there is at most one branchb⋆b^\{\\star\}so that𝙻\(Vb⋆\)=𝙻\(V\)\\mathtt\{L\}\(V\_\{b^\{\\star\}\}\)=\\mathtt\{L\}\(V\)\. Indeed, suppose that there are two such branchesb0,b1b\_\{0\},b\_\{1\}\. Letvvbe the internal node where these branches first split, and letv0,v1v\_\{0\},v\_\{1\}be the left and right children ofvv, whereb0b\_\{0\}goes throughv0v\_\{0\}, andb1b\_\{1\}goes throughv1v\_\{1\}\. So, the version spaceVVrestricted to bothb0b\_\{0\}up tov0v\_\{0\}andb1b\_\{1\}up tov1v\_\{1\}has Littlestone dimension𝙻\(V\)\\mathtt\{L\}\(V\)\. Therefore, we can construct a tree of depth𝙻\(V\)\+1\\mathtt\{L\}\(V\)\+1which is rooted atvvand is shattered byVV, which is a contradiction\. So, we determine the learner’s prediction in roundttto be the label associated with the edges of the branch
b⋆:=argmaxb∈B\(𝐓M\(x\)\)𝙻\(Vb\)\.b^\{\\star\}:=\\operatorname\*\{arg\\,max\}\_\{b\\in B\(\\mathbf\{T\}\_\{M\}\(x\)\)\}\\mathtt\{L\}\(V\_\{b\}\)\.Now, if the learner errs and the correct label isb≠b⋆b\\neq b^\{\\star\}, then the learner updatesVt\+1=VbV\_\{t\+1\}=V\_\{b\}, and𝙻\(Vt\+1\)<𝙻\(Vt\)\\mathtt\{L\}\(V\_\{t\+1\}\)<\\mathtt\{L\}\(V\_\{t\}\)\. Telescoping over all rounds where the learner errs, at most𝙻\(ℱ\)\\mathtt\{L\}\(\\mathcal\{F\}\)many mistakes are made before𝙻\(Vt\)=0\\mathtt\{L\}\(V\_\{t\}\)=0, and then a single consistent function remains, and the learner’s predictions are trivially given by this function from this round\. ∎
Note that in contrast with the𝖾𝟤𝖾\\mathsf\{e2e\}setting, here we cannot infer an agnostic regret bound from the realizable mistake bound\. Indeed, since the learner of Theorem[10\.1](https://arxiv.org/html/2605.06819#S10.Thmtheorem1)learns the entire𝖢𝗈𝖳\\mathsf\{CoT\}, every change in the𝖢𝗈𝖳\\mathsf\{CoT\}counts as a mistake\. Thus, it is possible that for some function inℱ\\mathcal\{F\}the𝖾𝟤𝖾\\mathsf\{e2e\}\-loss is extremely good, while for all functions inℱ\\mathcal\{F\}the𝖢𝗈𝖳\\mathsf\{CoT\}\-loss is very bad\. In such case, the learner of Theorem[10\.1](https://arxiv.org/html/2605.06819#S10.Thmtheorem1)is measured with respect to the function with the best𝖢𝗈𝖳\\mathsf\{CoT\}\-loss, its regret might be very bad when measured against the function with the best𝖾𝟤𝖾\\mathsf\{e2e\}\-loss\. It is an interesting direction for future research to prove that this obstacle can or cannot be circumvented by some learner\.
Using an online\-to\-batch conversion \(Theorem[6\.8](https://arxiv.org/html/2605.06819#S6.Thmtheorem8)\), Theorem[10\.1](https://arxiv.org/html/2605.06819#S10.Thmtheorem1)immediately allows to derive a PAC sample complexity bound for learningℱ𝖢𝗈𝖳−M\\mathcal\{F\}^\{\\mathsf\{CoT\}\-M\}\.
###### Theorem 10\.2\(Upper bound in terms of𝙻\(ℱ\)\\mathtt\{L\}\(\\mathcal\{F\}\)\)\.
For every base classℱ⊂ΣΣ⋆\\mathcal\{F\}\\subset\\Sigma^\{\\Sigma^\{\\star\}\}\(even if\|Σ\|\>2\|\\Sigma\|\>2, and even for\|Σ\|=∞\|\\Sigma\|=\\infty\), and for allMM, if𝙻\(ℱ\)<∞\\mathtt\{L\}\(\\mathcal\{F\}\)<\\inftythenℱ𝖢𝗈𝖳−M\\mathcal\{F\}^\{\\mathsf\{CoT\}\-M\}is learnable with sample complexity
m\(ε,δ\)=O\(𝙻\(ℱ\)\+log\(1/δ\)ε\)\.m\(\\varepsilon,\\delta\)=O\\left\(\\frac\{\\mathtt\{L\}\(\\mathcal\{F\}\)\+\\log\(1/\\delta\)\}\{\\varepsilon\}\\right\)\.
## 11Autoregressive Linear classifiers
In this section, we study the base class of autoregressive linear classifiers inℝd\\mathbb\{R\}^\{d\}\. It is defined to only take into account the finalddentries\. Formally, define
ℱd,lin:=\{f𝒘,b:𝒘∈ℝd,b∈ℝ\}\\mathcal\{F\}\_\{d,\\operatorname\{lin\}\}:=\\\{f\_\{\\boldsymbol\{w\},b\}:\\boldsymbol\{w\}\\in\\mathbb\{R\}^\{d\},b\\in\\mathbb\{R\}\\\}where the functionf𝒘,bf\_\{\\boldsymbol\{w\},b\}is given as
f𝒘,b\(x\)=1\[∑i=1min\{d,\|𝒙\|\}𝒘\[−i\]x\[−i\]\+b≥0\]f\_\{\\boldsymbol\{w\},b\}\(x\)=1\\left\[\\sum\_\{i=1\}^\{\\min\\\{d,\|\\boldsymbol\{x\}\|\\\}\}\\boldsymbol\{w\}\[\-i\]x\[\-i\]\+b\\geq 0\\right\]for allx∈Σ⋆x\\in\\Sigma^\{\\star\}\.
### 11\.1Mistake and sample complexity bounds
ℱd,lin\\mathcal\{F\}\_\{d,\\operatorname\{lin\}\}is essentially the class of linear classifiers over thedd\-cube\. This immediately gives the following bound
###### Proposition 11\.1\.
For allMM:
𝙻\(ℱd,lin𝖾𝟤𝖾−M\)=O\(d2\)\\mathtt\{L\}\(\\mathcal\{F\}\_\{d,\\operatorname\{lin\}\}^\{\\mathsf\{e2e\}\-M\}\)=O\(d^\{2\}\)
###### Proof\.
Trivial: The argument of\[[26](https://arxiv.org/html/2605.06819#bib.bib71), Lemma 4\.1\]shows that\|ℱd,lin\|≤2O\(d2\)\|\\mathcal\{F\}\_\{d,\\operatorname\{lin\}\}\|\\leq 2^\{O\(d^\{2\}\)\}\. Since\|ℱd,lin𝖾𝟤𝖾−M\|≤\|ℱd,lin\|\|\\mathcal\{F\}\_\{d,\\operatorname\{lin\}\}^\{\\mathsf\{e2e\}\-M\}\|\\leq\|\\mathcal\{F\}\_\{d,\\operatorname\{lin\}\}\|, the standard halving algorithm implies the stated bound\. ∎
The bound above is based on a simple size argument\. It is natural to ask if it is tight\. Below, we give a positive answer to this question, that holds even for𝖢𝗈𝖳\\mathsf\{CoT\}\-learning\.
The idea is to implement a “latch” mechanism \(like in digital systems\) that “drags” the output bit of the base function throughout the entire𝖢𝗈𝖳\\mathsf\{CoT\}, and then just apply the best known bound forℱd,lin\\mathcal\{F\}\_\{d,\\operatorname\{lin\}\}\.
###### Theorem 11\.2\.
There exists an adversary forcingΩ\(d2\)\\Omega\(d^\{2\}\)on any learner forℱd,lin𝖾𝟤𝖾−M\\mathcal\{F\}\_\{d,\\operatorname\{lin\}\}^\{\\mathsf\{e2e\}\-M\}for alld≥3d\\geq 3andM≥2M\\geq 2, even with𝖢𝗈𝖳\\mathsf\{CoT\}\-supervision\.
The key is the following lemma, that provides the latch mechanism, in the price of reducing the dimension by22\.
###### Lemma 11\.3\(Latch lemma\)\.
Fixd≥3d\\geq 3and denotem:=d−2m:=d\-2\. Then, for all𝐯∈ℝm\\boldsymbol\{v\}\\in\\mathbb\{R\}^\{m\}andc∈ℝc\\in\\mathbb\{R\}, there existsf𝐰,b∈ℱd,linf\_\{\\boldsymbol\{w\},b\}\\in\\mathcal\{F\}\_\{d,\\operatorname\{lin\}\}such that
f𝒘,b𝖾𝟤𝖾−M\(z10\)=f𝒗,c\(z\)f\_\{\\boldsymbol\{w\},b\}^\{\\mathsf\{e2e\}\-M\}\(z10\)=f\_\{\\boldsymbol\{v\},c\}\(z\)for allz∈\{0,1\}mz\\in\\\{0,1\\\}^\{m\}and allM≥2M\\geq 2\.
###### Proof\.
Define
L:=minz∈\{0,1\}m⟨𝒗,z⟩\+c,U:=maxz∈\{0,1\}m⟨𝒗,z⟩\+cL:=\\min\_\{z\\in\\\{0,1\\\}^\{m\}\}\\langle\\boldsymbol\{v\},z\\rangle\+c,\\quad U:=\\max\_\{z\\in\\\{0,1\\\}^\{m\}\}\\langle\\boldsymbol\{v\},z\\rangle\+cand letB:=max\{0,U\}\+1,A:=B−L\+1B:=\\max\\\{0,U\\\}\+1,A:=B\-L\+1\. We now definef𝒘,b∈ℱd,linf\_\{\\boldsymbol\{w\},b\}\\in\\mathcal\{F\}\_\{d,\\operatorname\{lin\}\}that realizes the theorem\. Lets\(x\):=\(u,p,q\)s\(x\):=\(u,p,q\)be the lastddbits of an inputx∈\{0,1\}⋆x\\in\\\{0,1\\\}^\{\\star\}forf𝒘,bf\_\{\\boldsymbol\{w\},b\}, where\|u\|=d−2,\|p\|=\|q\|=1\|u\|=d\-2,\|p\|=\|q\|=1\. Let
𝒘:=\(𝒗,B,2A\),b:=c−B\.\\boldsymbol\{w\}:=\(\\boldsymbol\{v\},B,2A\),\\quad b:=c\-B\.Therefore, for any inputxxof length at leastddwe have:
f𝒘,b\(x\)=1\[⟨𝒗,u⟩\+Bp\+2Aq\+c−B≥0\]\.f\_\{\\boldsymbol\{w\},b\}\(x\)=1\\left\[\\langle\\boldsymbol\{v\},u\\rangle\+Bp\+2Aq\+c\-B\\geq 0\\right\]\.Now, we show thatf𝒘,bf\_\{\\boldsymbol\{w\},b\}satisfies three desired properties:
1. 1\.For allz∈\{0,1\}mz\\in\\\{0,1\\\}^\{m\}:f𝒘,b\(z10\)=f𝒗,c\(z\)f\_\{\\boldsymbol\{w\},b\}\(z10\)=f\_\{\\boldsymbol\{v\},c\}\(z\)\(first bit of𝖢𝗈𝖳\\mathsf\{CoT\}is correct\)\.
2. 2\.For anyx∈\{0,1\}⋆x\\in\\\{0,1\\\}^\{\\star\}withp=q=0p=q=0, we havef𝒘,b\(x\)=0f\_\{\\boldsymbol\{w\},b\}\(x\)=0\(always drag0\)\.
3. 3\.For anyx∈\{0,1\}⋆x\\in\\\{0,1\\\}^\{\\star\}withq=1q=1, we havef𝒘,b\(x\)=1f\_\{\\boldsymbol\{w\},b\}\(x\)=1\(always drag11\)\.
The three items establish the statement of the lemma: The first item states thatf𝒘,b\(z10\)f\_\{\\boldsymbol\{w\},b\}\(z10\)returns the “correct” bitf𝒗,c\(z\)f\_\{\\boldsymbol\{v\},c\}\(z\)\. Then, sincef𝒘,b𝖾𝟤𝖾−Mf\_\{\\boldsymbol\{w\},b\}^\{\\mathsf\{e2e\}\-M\}appends the returned correct bit, precisely one of the following two states is reached: iff𝒗,c\(z\)=0f\_\{\\boldsymbol\{v\},c\}\(z\)=0, then the last two bits become0, and the second item establishes that0is dragged through allMMiterations\. Otherwise, the last bit becomes11, and then the third item guarantees that11is dragged through allMMiterations\. So it remains to prove the three items\.
##### First item\.
Letz∈\{0,1\}mz\\in\\\{0,1\\\}^\{m\}, and denotex=z10x=z10\. So
⟨𝒘,s\(x\)⟩\+b=⟨𝒗,z⟩\+B\+c−B=⟨𝒗,z⟩\+c\.\\langle\\boldsymbol\{w\},s\(x\)\\rangle\+b=\\langle\\boldsymbol\{v\},z\\rangle\+B\+c\-B=\\langle\\boldsymbol\{v\},z\\rangle\+c\.Therefore,
f𝒘,b\(x\)=1\[⟨𝒘,s\(x\)⟩\+b≥0\]=1\[⟨𝒗,z⟩\+c≥0\]=f𝒗,c\(z\)\.f\_\{\\boldsymbol\{w\},b\}\(x\)=1\\left\[\\langle\\boldsymbol\{w\},s\(x\)\\rangle\+b\\geq 0\\right\]=1\\left\[\\langle\\boldsymbol\{v\},z\\rangle\+c\\geq 0\\right\]=f\_\{\\boldsymbol\{v\},c\}\(z\)\.
##### Second item\.
Letxxsuch thatp=q=0p=q=0\. Then
⟨𝒘,s\(x\)⟩\+b=⟨𝒗,u⟩\+c−B≤U−B<0,\\langle\\boldsymbol\{w\},s\(x\)\\rangle\+b=\\langle\\boldsymbol\{v\},u\\rangle\+c\-B\\leq U\-B<0,where the first inequality is by definition ofUU, and the second inequality is by definition ofBB\. Therefore,f𝒘,b\(x\)=0f\_\{\\boldsymbol\{w\},b\}\(x\)=0by definition\.
##### Third item\.
Letxxsuch thatq=1q=1\. Then
⟨𝒘,s\(x\)⟩\+b\\displaystyle\\langle\\boldsymbol\{w\},s\(x\)\\rangle\+b=⟨𝒗,u⟩\+Bp\+2A\+c−B\\displaystyle=\\langle\\boldsymbol\{v\},u\\rangle\+Bp\+2A\+c\-B≥⟨𝒗,u⟩\+c\+2A−B\\displaystyle\\geq\\langle\\boldsymbol\{v\},u\\rangle\+c\+2A\-B≥L\+2A−B\\displaystyle\\geq L\+2A\-B=L\+2\(B−L\+1\)−B\\displaystyle=L\+2\(B\-L\+1\)\-B=B−L\+2\\displaystyle=B\-L\+2\>0\.\\displaystyle\>0\.The equalities are by definitions\. The first inequality is sinceB\>0B\>0and thusBp≥0Bp\\geq 0\. The second inequality is by definition ofLL\. The last inequality is sinceB\>U≥LB\>U\\geq L\. So, we havef𝒘,b\(x\)=1f\_\{\\boldsymbol\{w\},b\}\(x\)=1by definition\. ∎
We may now prove Theorem[11\.2](https://arxiv.org/html/2605.06819#S11.Thmtheorem2)\.
###### Proof of Theorem[11\.2](https://arxiv.org/html/2605.06819#S11.Thmtheorem2)\.
By\[[18](https://arxiv.org/html/2605.06819#bib.bib92), Theorem B\.2\], we have𝙻\(ℱm,lin\)=Θ\(m2\)\\mathtt\{L\}\(\\mathcal\{F\}\_\{m,\\operatorname\{lin\}\}\)=\\Theta\(m^\{2\}\)for allm≥1m\\geq 1\. As the latch lemma shows, for every functionfm∈ℱm,linf\_\{m\}\\in\\mathcal\{F\}\_\{m,\\operatorname\{lin\}\}there is a functionfd∈ℱd,linf\_\{d\}\\in\\mathcal\{F\}\_\{d,\\operatorname\{lin\}\}whered=m\+2d=m\+2, such that for allz∈\{0,1\}mz\\in\\\{0,1\\\}^\{m\}and allM≥1M\\geq 1we havefd𝖾𝟤𝖾−M\(z10\)=fm\(z\)f\_\{d\}^\{\\mathsf\{e2e\}\-M\}\(z10\)=f\_\{m\}\(z\)\. Therefore, the same adversarial strategy realizing theΩ\(m2\)\\Omega\(m^\{2\}\)lower bound of\[[18](https://arxiv.org/html/2605.06819#bib.bib92)\]can be implemented byℱd,lin𝖾𝟤𝖾−M\\mathcal\{F\}\_\{d,\\operatorname\{lin\}\}^\{\\mathsf\{e2e\}\-M\}, replacing any instancez∈\{0,1\}mz\\in\\\{0,1\\\}^\{m\}withz10z10, and anyfm∈ℱm,linf\_\{m\}\\in\\mathcal\{F\}\_\{m,\\operatorname\{lin\}\}with the appropriatefd∈ℱd,linf\_\{d\}\\in\\mathcal\{F\}\_\{d,\\operatorname\{lin\}\}\. Therefore,
𝙻\(ℱd,lin𝖾𝟤𝖾−M\)=Θ\(m2\)=Θ\(d2\)\\mathtt\{L\}\(\\mathcal\{F\}\_\{d,\\operatorname\{lin\}\}^\{\\mathsf\{e2e\}\-M\}\)=\\Theta\(m^\{2\}\)=\\Theta\(d^\{2\}\)for allMM\. To see that the same lower bound holds even with𝖢𝗈𝖳\\mathsf\{CoT\}\-supervision, note that the𝖢𝗈𝖳\\mathsf\{CoT\}is justMMcopies of the same bit, and so𝖢𝗈𝖳\\mathsf\{CoT\}\-supervision is equivalent to only receiving the𝖾𝟤𝖾\\mathsf\{e2e\}output\. ∎
The same technique allows to prove a lower bound for the statistical setting as well\.
###### Theorem 11\.4\.
For alld≥3d\\geq 3andM≥2M\\geq 2, any PAC\-learner forℱd,lin𝖾𝟤𝖾−M\\mathcal\{F\}\_\{d,\\operatorname\{lin\}\}^\{\\mathsf\{e2e\}\-M\}has sample complexityΩ\(d\)\\Omega\(d\), even with𝖢𝗈𝖳\\mathsf\{CoT\}\-supervision\.
###### Proof\.
We prove thatVC\(ℱd,lin𝖾𝟤𝖾\)≥m\\operatorname\{VC\}\(\\mathcal\{F\}\_\{d,\\operatorname\{lin\}\}^\{\\mathsf\{e2e\}\}\)\\geq mform:=d−2m:=d\-2\. Since all functions we will use to shatter a set of sizemmare implemented by the latch mechanism of Lemma[11\.3](https://arxiv.org/html/2605.06819#S11.Thmtheorem3), those functions always copy the same final bit, and thus𝖢𝗈𝖳\\mathsf\{CoT\}\-supervision is equivalent to only receiving the𝖾𝟤𝖾\\mathsf\{e2e\}output\. Therefore, provingVC\(ℱd,lin𝖾𝟤𝖾\)≥m\\operatorname\{VC\}\(\\mathcal\{F\}\_\{d,\\operatorname\{lin\}\}^\{\\mathsf\{e2e\}\}\)\\geq msuffices for showing thatmmis a lower bound on the sample complexity, even with𝖢𝗈𝖳\\mathsf\{CoT\}\-supervision\. The proof now follows essentially the same lines as the proof of Theorem[11\.2](https://arxiv.org/html/2605.06819#S11.Thmtheorem2), only replacing the bound𝙻\(ℱm,lin\)=Θ\(m2\)\\mathtt\{L\}\(\\mathcal\{F\}\_\{m,\\operatorname\{lin\}\}\)=\\Theta\(m^\{2\}\)used in the proof of Theorem[11\.2](https://arxiv.org/html/2605.06819#S11.Thmtheorem2)to the known boundVC\(ℱm,lin\)≥m\\operatorname\{VC\}\(\\mathcal\{F\}\_\{m,\\operatorname\{lin\}\}\)\\geq m\. ∎
### 11\.2Bounds for Efficient learning
We first show that efficient online𝖾𝟤𝖾\\mathsf\{e2e\}\-learning of autoregressive linear thresholds is impossible, conditioned on the following assumption; we refer the reader to, e\.g\.,\[[26](https://arxiv.org/html/2605.06819#bib.bib71)\]for more details\.
###### Assumption 11\.5\(Hardness of Learning𝖳𝖢0\\mathsf\{TC\}^\{0\}\)\.
There exist a constantL\>0L\>0and a polynomialp\(n\)p\(n\)such that threshold circuits of depthLLand sizep\(n\)p\(n\)overnnbinary inputs are not weakly PAC\-learnable inpoly\(n\)\\operatorname\{poly\}\(n\)time, i\.e\., for some constantsε<12\\varepsilon<\\tfrac\{1\}\{2\}andδ<1\\delta<1, there is no polynomial\-time learner that achieves error at mostε\\varepsilonwith confidence at least1−δ1\-\\delta\.
Based on the above assumption, we show that learning end\-to\-end autoregressive linear classifiers is impossible\. The proof is based on a standard batch\-to\-online argument and using Theorem 4\.4 in\[[26](https://arxiv.org/html/2605.06819#bib.bib71)\]\.
###### Theorem 11\.6\(Conditional hardness of efficient online𝖾𝟤𝖾\\mathsf\{e2e\}\-learning\)\.
Assume Hardness Assumption[11\.5](https://arxiv.org/html/2605.06819#S11.Thmtheorem5)\. Then there is no possibly improper online learner such that, for everyd,M,nd,M,n, on every realizable online sequence of examples from\{0,1\}n\\\{0,1\\\}^\{n\}labeled by some target inℱd,lin𝖾𝟤𝖾−M\\mathcal\{F\}\_\{d,\\operatorname\{lin\}\}^\{\\mathsf\{e2e\}\-M\}, the learner runs in time polynomial ind,M,nd,M,nper round and makes at most polynomially many mistakes ind,M,nd,M,n\.
###### Proof\.
Suppose such a learner exists\. By the standard reduction from realizable online learning with polynomial mistake bound to realizable PAC learning, this yields a polynomial\-time realizable PAC learner forℱd,lin𝖾𝟤𝖾−M\\mathcal\{F\}\_\{d,\\operatorname\{lin\}\}^\{\\mathsf\{e2e\}\-M\}over\{0,1\}n\\\{0,1\\\}^\{n\}\. This contradicts\[[26](https://arxiv.org/html/2605.06819#bib.bib71), Theorem 4\.4\]\. ∎
In contrast, with𝖢𝗈𝖳\\mathsf\{CoT\}\-supervision there is an efficient online learner matching the information\-theoretic guarantee up to a logarithmic factor\. We use the classical deterministic online learner for Boolean linear thresholds\. We first give a reduction
###### Lemma 11\.7\(Online reduction with𝖢𝗈𝖳\\mathsf\{CoT\}\-supervision\)\.
Letℱ⊆\{f:\{0,1\}⋆→\{0,1\}\}\\mathcal\{F\}\\subseteq\\\{f:\\\{0,1\\\}^\{\\star\}\\to\\\{0,1\\\}\\\}be a base class\. Assume there is a deterministic online learner𝒜\\mathcal\{A\}forℱ\\mathcal\{F\}such that:
1. 1\.on every realizable sequence it makes at mostKKmistakes;
2. 2\.on every length\-LLsequence, its total computation time is polynomial inLLand the total input length\.
Then for everyM≥1M\\geq 1there is a deterministic online learnerℬ\\mathcal\{B\}forℱ𝖾𝟤𝖾−M\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}with𝖢𝗈𝖳\\mathsf\{CoT\}\-supervision such that:
1. 1\.on every realizable𝖢𝗈𝖳\\mathsf\{CoT\}\-supervised sequence it makes at mostKKmistakes;
2. 2\.on every length\-nnsequence, its total computation time is polynomial inn,M,n,M,and the total input length\.
###### Proof\.
Fix a realizable𝖢𝗈𝖳\\mathsf\{CoT\}\-supervised sequence\(xt,zt\)t=1n\(x\_\{t\},z\_\{t\}\)\_\{t=1\}^\{n\}, wherezt=\(zt,1,…,zt,M\)=f⋆𝖢𝗈𝖳−M\(xt\)z\_\{t\}=\(z\_\{t,1\},\\ldots,z\_\{t,M\}\)=f\_\{\\star\}^\{\\mathsf\{CoT\}\-M\}\(x\_\{t\}\)for somef⋆∈ℱf\_\{\\star\}\\in\\mathcal\{F\}\. For each roundss, letus,0:=xsu\_\{s,0\}:=x\_\{s\}, and forj∈\[M\]j\\in\[M\]letus,j:=us,j−1zs,ju\_\{s,j\}:=u\_\{s,j\-1\}z\_\{s,j\}, where for a binary stringuuand a bitb∈\{0,1\}b\\in\\\{0,1\\\},ububdenotes the string obtained by appendingbbtouu\. Learnerℬ\\mathcal\{B\}maintains the ordered transcript
St−1:=\(\(u1,0,z1,1\),…,\(u1,M−1,z1,M\),…,\(ut−1,0,zt−1,1\),…,\(ut−1,M−1,zt−1,M\)\)\.S\_\{t\-1\}:=\\bigl\(\(u\_\{1,0\},z\_\{1,1\}\),\\ldots,\(u\_\{1,M\-1\},z\_\{1,M\}\),\\ldots,\(u\_\{t\-1,0\},z\_\{t\-1,1\}\),\\ldots,\(u\_\{t\-1,M\-1\},z\_\{t\-1,M\}\)\\bigr\)\.
On roundtt, learnerℬ\\mathcal\{B\}setsu^t,0:=xt\\hat\{u\}\_\{t,0\}:=x\_\{t\}\. Forj=1,…,Mj=1,\\ldots,M, supposez^t,1,…,z^t,j−1\\hat\{z\}\_\{t,1\},\\ldots,\\hat\{z\}\_\{t,j\-1\}have already been defined andu^t,j−1\\hat\{u\}\_\{t,j\-1\}has been formed\. Learnerℬ\\mathcal\{B\}runs𝒜\\mathcal\{A\}from scratch on the sequence
St−1,\(u^t,0,z^t,1\),…,\(u^t,j−2,z^t,j−1\),S\_\{t\-1\},\(\\hat\{u\}\_\{t,0\},\\hat\{z\}\_\{t,1\}\),\\ldots,\(\\hat\{u\}\_\{t,j\-2\},\\hat\{z\}\_\{t,j\-1\}\),where forj=1j=1this means justSt−1S\_\{t\-1\}, and letsz^t,j\\hat\{z\}\_\{t,j\}be the prediction on the next exampleu^t,j−1\\hat\{u\}\_\{t,j\-1\}; then it setsu^t,j:=u^t,j−1z^t,j\\hat\{u\}\_\{t,j\}:=\\hat\{u\}\_\{t,j\-1\}\\hat\{z\}\_\{t,j\}\. Finally, it predictsz^t,M\\hat\{z\}\_\{t,M\}\. After receiving the true chain\-of\-thoughtzt=\(zt,1,…,zt,M\)z\_\{t\}=\(z\_\{t,1\},\\ldots,z\_\{t,M\}\), it updates the transcript to
St:=\(St−1,\(ut,0,zt,1\),…,\(ut,M−1,zt,M\)\)\.S\_\{t\}:=\\bigl\(S\_\{t\-1\},\(u\_\{t,0\},z\_\{t,1\}\),\\ldots,\(u\_\{t,M\-1\},z\_\{t,M\}\)\\bigr\)\.
Now consider the expanded sequence
\(\(ut,j−1,zt,j\)\)1≤t≤n,1≤j≤M\.\(\(u\_\{t,j\-1\},z\_\{t,j\}\)\)\_\{1\\leq t\\leq n,\\ 1\\leq j\\leq M\}\.It is realizable byf⋆f\_\{\\star\}, sincezt,j=f⋆\(ut,j−1\)z\_\{t,j\}=f\_\{\\star\}\(u\_\{t,j\-1\}\)for allt,jt,j\. Ifℬ\\mathcal\{B\}makes a mistake on roundtt, letjjbe the first index withz^t,j≠zt,j\\hat\{z\}\_\{t,j\}\\neq z\_\{t,j\}\. Thenu^t,i=ut,i\\hat\{u\}\_\{t,i\}=u\_\{t,i\}for alli<ji<j, so the transcript used byℬ\\mathcal\{B\}to computez^t,j\\hat\{z\}\_\{t,j\}is exactly the prefix of the expanded realizable sequence seen by𝒜\\mathcal\{A\}just before processing\(ut,j−1,zt,j\)\(u\_\{t,j\-1\},z\_\{t,j\}\)\. Hence𝒜\\mathcal\{A\}makes a mistake on that example\. Therefore every final\-answer mistake ofℬ\\mathcal\{B\}is charged to a distinct mistake of𝒜\\mathcal\{A\}, and soℬ\\mathcal\{B\}makes at mostKKmistakes\.
Finally, the expanded sequence has length at mostnMnM, and its total input length is
∑t=1n∑j=1M\|ut,j−1\|=∑t=1n∑j=1M\(\|xt\|\+j−1\)≤M∑t=1n\|xt\|\+nM2,\\sum\_\{t=1\}^\{n\}\\sum\_\{j=1\}^\{M\}\|u\_\{t,j\-1\}\|=\\sum\_\{t=1\}^\{n\}\\sum\_\{j=1\}^\{M\}\(\|x\_\{t\}\|\+j\-1\)\\leq M\\sum\_\{t=1\}^\{n\}\|x\_\{t\}\|\+nM^\{2\},which is polynomial inn,M,n,M,and the total input length of the original sequence\. The same bound holds for every simulated prefix sequence used byℬ\\mathcal\{B\}, since\|u^t,j\|=\|xt\|\+j\|\\hat\{u\}\_\{t,j\}\|=\|x\_\{t\}\|\+jfor allt,jt,j\. Sinceℬ\\mathcal\{B\}performs at mostnMnMsimulations, and each simulation runs𝒜\\mathcal\{A\}from scratch on a sequence of length at mostnMnMand polynomially bounded total input length, the total computation time ofℬ\\mathcal\{B\}is polynomial inn,M,n,M,and the total input length of the original sequence\. ∎
To obtain our efficient algorithm, we use the following classic result for the base class\.
###### Lemma 11\.8\(Maass–Turán\[[33](https://arxiv.org/html/2605.06819#bib.bib3)\]\)\.
There exists a deterministic online learner for Boolean linear thresholds over\{0,1\}d\\\{0,1\\\}^\{d\}such that:
1. 1\.on every realizable sequence it makes at mostO\(d2logd\)O\(d^\{2\}\\log d\)mistakes;
2. 2\.on every length\-mmsequence, its total computation time is polynomial inmmanddd\.
###### Theorem 11\.9\(Efficient online𝖢𝗈𝖳\\mathsf\{CoT\}\-learning for autoregressive linear thresholds\)\.
Fix a horizonnn\. There exists a deterministic online learner forℱd,lin𝖾𝟤𝖾−M\\mathcal\{F\}\_\{d,\\operatorname\{lin\}\}^\{\\mathsf\{e2e\}\-M\}with𝖢𝗈𝖳\\mathsf\{CoT\}\-supervision such that, for every sequence\(xt,zt\)t=1n\(x\_\{t\},z\_\{t\}\)\_\{t=1\}^\{n\}satisfyingzt=f⋆𝖢𝗈𝖳−M\(xt\)z\_\{t\}=f\_\{\\star\}^\{\\mathsf\{CoT\}\-M\}\(x\_\{t\}\)for somef⋆∈ℱd,linf\_\{\\star\}\\in\\mathcal\{F\}\_\{d,\\operatorname\{lin\}\}, the learner makes at mostO\(d2logd\)O\(d^\{2\}\\log d\)mistakes and its total computation time is polynomial inn,d,M,n,d,M,and∑t=1n\|xt\|\\sum\_\{t=1\}^\{n\}\|x\_\{t\}\|\.
###### Proof\.
Letϕ:\{0,1\}⋆→\{0,1\}d\\phi:\\\{0,1\\\}^\{\\star\}\\to\\\{0,1\\\}^\{d\}be the suffix map returning the lastddbits, padded with leading zeroes if needed\. By definition ofℱd,lin\\mathcal\{F\}\_\{d,\\operatorname\{lin\}\}, for everyf∈ℱd,linf\\in\\mathcal\{F\}\_\{d,\\operatorname\{lin\}\}there is a Boolean linear thresholdgf:\{0,1\}d→\{0,1\}g\_\{f\}:\\\{0,1\\\}^\{d\}\\to\\\{0,1\\\}such thatf\(x\)=gf\(ϕ\(x\)\)f\(x\)=g\_\{f\}\(\\phi\(x\)\)for allx∈\{0,1\}⋆x\\in\\\{0,1\\\}^\{\\star\}\.
Let𝒜\\mathcal\{A\}be the learner from Lemma[11\.8](https://arxiv.org/html/2605.06819#S11.Thmtheorem8)\. We define a learner𝒜′\\mathcal\{A\}^\{\\prime\}forℱd,lin\\mathcal\{F\}\_\{d,\\operatorname\{lin\}\}as follows: on an examplexx, learner𝒜′\\mathcal\{A\}^\{\\prime\}predicts whatever𝒜\\mathcal\{A\}predicts onϕ\(x\)\\phi\(x\); after receiving the labelyy, learner𝒜′\\mathcal\{A\}^\{\\prime\}updates𝒜\\mathcal\{A\}on the labeled example\(ϕ\(x\),y\)\(\\phi\(x\),y\)\. Sincef\(x\)=gf\(ϕ\(x\)\)f\(x\)=g\_\{f\}\(\\phi\(x\)\), every realizable sequence forℱd,lin\\mathcal\{F\}\_\{d,\\operatorname\{lin\}\}maps to a realizable sequence for Boolean linear thresholds over\{0,1\}d\\\{0,1\\\}^\{d\}\. Hence𝒜′\\mathcal\{A\}^\{\\prime\}makes at mostO\(d2logd\)O\(d^\{2\}\\log d\)mistakes on every realizable sequence\. Its total computation time on any length\-mmsequence is polynomial inm,d,m,d,and the total input length, since computing all suffixesϕ\(x\)\\phi\(x\)contributes only linear overhead\.
Applying Lemma[11\.7](https://arxiv.org/html/2605.06819#S11.Thmtheorem7)to the base classℱd,lin\\mathcal\{F\}\_\{d,\\operatorname\{lin\}\}and the learner𝒜′\\mathcal\{A\}^\{\\prime\}yields the claim\. ∎
## 12Non\-Littlestone base classes
###### Theorem 12\.1\.
There exists a base classℱ\\mathcal\{F\}over the binary alphabetΣ=\{0,1\}\\Sigma=\\\{0,1\\\}such that, for the pairwise distinct sequence
Mi:=i\+4\(i∈ℕ\),M\_\{i\}:=i\+4\\qquad\(i\\in\\mathbb\{N\}\),the following holds:
iodd⟹𝙻\(ℱ𝖾𝟤𝖾−Mi\)=0,ieven⟹𝙻\(ℱ𝖾𝟤𝖾−Mi\)=∞\.i\\text\{ odd \}\\implies\\mathtt\{L\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\_\{i\}\}\)=0,\\qquad i\\text\{ even \}\\implies\\mathtt\{L\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\_\{i\}\}\)=\\infty\.Moreover, for every oddii, online learning with𝖢𝗈𝖳\\mathsf\{CoT\}\-supervision has a finite mistake bound, while for every evenii, online learning with𝖢𝗈𝖳\\mathsf\{CoT\}\-supervision is impossible\.
###### Proof\.
For everym≥2m\\geq 2andn≥0n\\geq 0, define the string
um,n:=0m10n1\.u\_\{m,n\}:=0^\{m\}10^\{n\}1\.For every binary arrayα=\(αm,n\)m≥2,n≥0∈\{0,1\}\{2,3,…\}×\(ℕ∪\{0\}\)\\alpha=\(\\alpha\_\{m,n\}\)\_\{m\\geq 2,\\;n\\geq 0\}\\in\\\{0,1\\\}^\{\\\{2,3,\\ldots\\\}\\times\(\\mathbb\{N\}\\cup\\\{0\\\}\)\}, define a next\-token generator
fα:\{0,1\}⋆→\{0,1\}f\_\{\\alpha\}:\\\{0,1\\\}^\{\\star\}\\to\\\{0,1\\\}as follows:
fα\(x\)=\{αm,nifx=um,nfor somem≥2,n≥0,0ifx=um,nb0tfor somem≥2,n≥0,b∈\{0,1\},0≤t<2m−2,bifx=um,nb02m−2for somem≥2,n≥0,b∈\{0,1\},0otherwise\.f\_\{\\alpha\}\(x\)=\\begin\{cases\}\\alpha\_\{m,n\}&\\text\{if \}x=u\_\{m,n\}\\text\{ for some \}m\\geq 2,\\ n\\geq 0,\\\\\[2\.84526pt\] 0&\\text\{if \}x=u\_\{m,n\}b0^\{t\}\\text\{ for some \}m\\geq 2,\\ n\\geq 0,\\ b\\in\\\{0,1\\\},\\ 0\\leq t<2m\-2,\\\\\[2\.84526pt\] b&\\text\{if \}x=u\_\{m,n\}b0^\{2m\-2\}\\text\{ for some \}m\\geq 2,\\ n\\geq 0,\\ b\\in\\\{0,1\\\},\\\\\[2\.84526pt\] 0&\\text\{otherwise\.\}\\end\{cases\}Let
ℱ:=\{fα:α∈\{0,1\}\{2,3,…\}×\(ℕ∪\{0\}\)\}\.\\mathcal\{F\}:=\\\{f\_\{\\alpha\}:\\alpha\\in\\\{0,1\\\}^\{\\\{2,3,\\ldots\\\}\\times\(\\mathbb\{N\}\\cup\\\{0\\\}\)\}\\\}\.
We first analyze the even horizons\. Fixm≥2m\\geq 2,n≥0n\\geq 0, andα\\alpha\. Letb:=αm,nb:=\\alpha\_\{m,n\}\. Starting from the inputum,nu\_\{m,n\}, the generated bits are
b,0,…,0⏟2m−2times,b\.b,\\underbrace\{0,\\ldots,0\}\_\{2m\-2\\text\{ times\}\},b\.Therefore,
fα𝖾𝟤𝖾−2m\(um,n\)=αm,n\.f\_\{\\alpha\}^\{\\mathsf\{e2e\}\-2m\}\(u\_\{m,n\}\)=\\alpha\_\{m,n\}\.Hence, for every fixedm≥2m\\geq 2, the set
Xm:=\{um,n:n≥0\}X\_\{m\}:=\\\{u\_\{m,n\}:n\\geq 0\\\}shows thatℱ𝖾𝟤𝖾−2m\\mathcal\{F\}^\{\\mathsf\{e2e\}\-2m\}has infinite Littlestone dimension\. Indeed, letd∈ℕd\\in\\mathbb\{N\}, and consider the complete binary tree of depthddin which every node at depthjjis labeled byum,j−1u\_\{m,j\-1\}\. Given any branch\-label sequencey1,…,yd∈\{0,1\}y\_\{1\},\\ldots,y\_\{d\}\\in\\\{0,1\\\}, chooseα\\alphaso that
αm,j−1=yjfor allj=1,…,d,\\alpha\_\{m,j\-1\}=y\_\{j\}\\qquad\\text\{for all \}j=1,\\ldots,d,and define the remaining coordinates ofα\\alphaarbitrarily\. Then
fα𝖾𝟤𝖾−2m\(um,j−1\)=yjfor allj=1,…,d\.f\_\{\\alpha\}^\{\\mathsf\{e2e\}\-2m\}\(u\_\{m,j\-1\}\)=y\_\{j\}\\qquad\\text\{for all \}j=1,\\ldots,d\.Thus this tree is shattered byℱ𝖾𝟤𝖾−2m\\mathcal\{F\}^\{\\mathsf\{e2e\}\-2m\}\. Sinceddis arbitrary,
𝙻\(ℱ𝖾𝟤𝖾−2m\)=∞for everym≥2\.\\mathtt\{L\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-2m\}\)=\\infty\\qquad\\text\{for every \}m\\geq 2\.
Moreover, this lower bound is unchanged under𝖢𝗈𝖳\\mathsf\{CoT\}\-supervision\. Indeed, for every queried instanceum,nu\_\{m,n\}, ifb:=αm,nb:=\\alpha\_\{m,n\}, then the revealed2m2m\-step chain is
fα𝖢𝗈𝖳−2m\(um,n\)=b02m−2b\.f\_\{\\alpha\}^\{\\mathsf\{CoT\}\-2m\}\(u\_\{m,n\}\)=b0^\{2m\-2\}b\.In particular, the revealed𝖢𝗈𝖳\\mathsf\{CoT\}is completely determined by the final label
fα𝖾𝟤𝖾−2m\(um,n\)=b\.f\_\{\\alpha\}^\{\\mathsf\{e2e\}\-2m\}\(u\_\{m,n\}\)=b\.Hence, on the adversarial sequence above, a learner with𝖢𝗈𝖳\\mathsf\{CoT\}\-supervision receives no more information than an𝖾𝟤𝖾\\mathsf\{e2e\}\-learner, so the same lower bound applies\.
We now turn to odd horizons\. Fix an oddM≥5M\\geq 5\. We claim that all functions inℱ𝖾𝟤𝖾−M\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}coincide, and therefore𝙻\(ℱ𝖾𝟤𝖾−M\)=0\\mathtt\{L\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\)=0\.
Fixx∈\{0,1\}⋆x\\in\\\{0,1\\\}^\{\\star\}and two arraysα,β\\alpha,\\beta\. We show that
fα𝖾𝟤𝖾−M\(x\)=fβ𝖾𝟤𝖾−M\(x\)\.f\_\{\\alpha\}^\{\\mathsf\{e2e\}\-M\}\(x\)=f\_\{\\beta\}^\{\\mathsf\{e2e\}\-M\}\(x\)\.
We will use the following simple observation: for everym≥2m\\geq 2,n≥0n\\geq 0,b∈\{0,1\}b\\in\\\{0,1\\\}, andr≥0r\\geq 0, the string
um,nb02m−2b0ru\_\{m,n\}b0^\{2m\-2\}b0^\{r\}is of none of the special forms appearing in the definition offαf\_\{\\alpha\}\. Indeed, if it were of the formum′,n′u\_\{m^\{\\prime\},n^\{\\prime\}\}orum′,n′c0tu\_\{m^\{\\prime\},n^\{\\prime\}\}c0^\{t\}, then necessarilym′=mm^\{\\prime\}=mandn′=nn^\{\\prime\}=n, since the first two occurrences of11already determine the prefixum,nu\_\{m,n\}\. But then the remaining suffixb02m−2b0rb0^\{2m\-2\}b0^\{r\}cannot be empty, and also cannot equalc0tc0^\{t\}: ifb=1b=1then it contains two11’s, while ifb=0b=0it equals02m\+r0^\{2m\+r\}, which would forcec=0c=0andt=2m\+r−1\>2m−2t=2m\+r\-1\>2m\-2\. Thus no such string is special\.
We now distinguish three cases\.
##### Case 1:x=um,nx=u\_\{m,n\}for somem≥2m\\geq 2,n≥0n\\geq 0\.
Letb:=αm,nb:=\\alpha\_\{m,n\}\. Underfαf\_\{\\alpha\}, the first generated bit isbb, the next2m−22m\-2generated bits are0, and the2m2m\-th generated bit is againbb\. After these2m2msteps, the current string is
um,nb02m−2b\.u\_\{m,n\}b0^\{2m\-2\}b\.By the observation above, from this point on, only the default rule applies, so all subsequent generated bits are0\. Therefore the only possiblyα\\alpha\-dependent generated bits occur at steps11and2m2m\. SinceM≥5M\\geq 5is odd it holds thatM≠1M\\neq 1andM≠2mM\\neq 2m, and hencefα𝖾𝟤𝖾−M\(x\)f\_\{\\alpha\}^\{\\mathsf\{e2e\}\-M\}\(x\)is independent ofα\\alpha\. In particular,
fα𝖾𝟤𝖾−M\(x\)=fβ𝖾𝟤𝖾−M\(x\)\.f\_\{\\alpha\}^\{\\mathsf\{e2e\}\-M\}\(x\)=f\_\{\\beta\}^\{\\mathsf\{e2e\}\-M\}\(x\)\.
##### Case 2:x=um,nb0tx=u\_\{m,n\}b0^\{t\}for somem≥2m\\geq 2,n≥0n\\geq 0,b∈\{0,1\}b\\in\\\{0,1\\\},0≤t≤2m−20\\leq t\\leq 2m\-2\.
Starting fromxx, the next2m−2−t2m\-2\-tgenerated bits are0, then the next generated bit isbb, and after that the current string has the form
um,nb02m−2b\.u\_\{m,n\}b0^\{2m\-2\}b\.By the observation above, from this point on, all subsequent generated bits are0\. Therefore, the entire future trajectory fromxxis determined by the stringxxitself, and in particular is independent ofα\\alpha\. Hence
fα𝖾𝟤𝖾−M\(x\)=fβ𝖾𝟤𝖾−M\(x\)\.f\_\{\\alpha\}^\{\\mathsf\{e2e\}\-M\}\(x\)=f\_\{\\beta\}^\{\\mathsf\{e2e\}\-M\}\(x\)\.
##### Case 3:xxis of neither of the above two forms\.
Thenfα\(x\)=fβ\(x\)=0f\_\{\\alpha\}\(x\)=f\_\{\\beta\}\(x\)=0\. As long as the autoregressive trajectory stays outside the set
\{um,n:m≥2,n≥0\}∪\{um,nb0t:m≥2,n≥0,b∈\{0,1\},0≤t≤2m−2\},\\\{u\_\{m,n\}:m\\geq 2,\\ n\\geq 0\\\}\\;\\cup\\;\\\{u\_\{m,n\}b0^\{t\}:m\\geq 2,\\ n\\geq 0,\\ b\\in\\\{0,1\\\},\\ 0\\leq t\\leq 2m\-2\\\},both functions continue to output0\. If the trajectory never enters this set, then clearly
fα𝖾𝟤𝖾−M\(x\)=fβ𝖾𝟤𝖾−M\(x\)=0\.f\_\{\\alpha\}^\{\\mathsf\{e2e\}\-M\}\(x\)=f\_\{\\beta\}^\{\\mathsf\{e2e\}\-M\}\(x\)=0\.Otherwise, lets≥1s\\geq 1be the first time such a string is reached\. Since all bits generated before timessare0, the reached string cannot be of the formum,nu\_\{m,n\}, because everyum,nu\_\{m,n\}ends with11\. Hence at timessthe trajectory reaches a string covered by Case 2, and from that point on the continuation is independent ofα\\alpha\. Therefore
fα𝖾𝟤𝖾−M\(x\)=fβ𝖾𝟤𝖾−M\(x\)\.f\_\{\\alpha\}^\{\\mathsf\{e2e\}\-M\}\(x\)=f\_\{\\beta\}^\{\\mathsf\{e2e\}\-M\}\(x\)\.
We have proved that for every oddM≥5M\\geq 5, all functions inℱ𝖾𝟤𝖾−M\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}are identical\. Hence
\|ℱ𝖾𝟤𝖾−M\|=1,so𝙻\(ℱ𝖾𝟤𝖾−M\)=0\.\|\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\|=1,\\qquad\\text\{so\}\\qquad\\mathtt\{L\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\}\)=0\.Therefore, for every oddM≥5M\\geq 5, online learning with𝖢𝗈𝖳\\mathsf\{CoT\}\-supervision is also possible, since𝖢𝗈𝖳\\mathsf\{CoT\}\-supervision is only more informative than𝖾𝟤𝖾\\mathsf\{e2e\}\-supervision\.
Finally, for the pairwise distinct sequenceMi=i\+4M\_\{i\}=i\+4, ifiiis even thenMiM\_\{i\}is even and at least66, so
𝙻\(ℱ𝖾𝟤𝖾−Mi\)=∞,\\mathtt\{L\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\_\{i\}\}\)=\\infty,and moreover, the same impossibility holds even for online learning with𝖢𝗈𝖳\\mathsf\{CoT\}\-supervision\. Ifiiis odd thenMiM\_\{i\}is odd and at least55, so
𝙻\(ℱ𝖾𝟤𝖾−Mi\)=0<∞,\\mathtt\{L\}\(\\mathcal\{F\}^\{\\mathsf\{e2e\}\-M\_\{i\}\}\)=0<\\infty,and therefore online learning with𝖢𝗈𝖳\\mathsf\{CoT\}\-supervision is also possible\. ∎
## 13Stochastic Autoregressive Lower Bound
We now consider a stochastic variant of the autoregressive model\.
###### Definition 13\.1\(Stochastic autoregressive next\-token generators\)\.
A*stochastic base class*is a class
𝒢⊆\{g:\{0,1\}⋆→Δ\(\{0,1\}\)\},\\mathcal\{G\}\\subseteq\\\{g:\\\{0,1\\\}^\{\\star\}\\to\\Delta\(\\\{0,1\\\}\)\\\},whereΔ\(\{0,1\}\)\\Delta\(\\\{0,1\\\}\)denotes the set of distributions over\{0,1\}\\\{0,1\\\}\. Forg∈𝒢g\\in\\mathcal\{G\}andx∈\{0,1\}⋆x\\in\\\{0,1\\\}^\{\\star\}, defineg¯\(x\)\\bar\{g\}\(x\)to be the random string obtained by samplingY∼g\(x\)Y\\sim g\(x\)and outputtingappend\(x,Y\)\\operatorname\{append\}\(x,Y\)\. ForM∈ℕM\\in\\mathbb\{N\}, defineg𝖢𝗈𝖳−Mg^\{\\mathsf\{CoT\}\-M\}andg𝖾𝟤𝖾−Mg^\{\\mathsf\{e2e\}\-M\}exactly as in the deterministic case, except that these are now induced distributions obtained by composingg¯\\bar\{g\}forMMmany steps\. As before, write𝒢𝖾𝟤𝖾−M:=\{g𝖾𝟤𝖾−M:g∈𝒢\}\\mathcal\{G\}^\{\\mathsf\{e2e\}\-M\}:=\\\{g^\{\\mathsf\{e2e\}\-M\}:g\\in\\mathcal\{G\}\\\}\.
In the associated online prediction problem, on roundttthe learner receives an instancext∈\{0,1\}⋆x\_\{t\}\\in\\\{0,1\\\}^\{\\star\}, predicts a bity^t∈\{0,1\}\\hat\{y\}\_\{t\}\\in\\\{0,1\\\}, and then observes a random labelYtY\_\{t\}, sampled from the relevant target distribution\. In the direct next\-token problem we haveYt∼g⋆\(xt\)Y\_\{t\}\\sim g\_\{\\star\}\(x\_\{t\}\), while in the𝖾𝟤𝖾\\mathsf\{e2e\}problem we haveYt∼g⋆𝖾𝟤𝖾−M\(xt\)Y\_\{t\}\\sim g\_\{\\star\}^\{\\mathsf\{e2e\}\-M\}\(x\_\{t\}\), for some unknowng⋆∈𝒢g\_\{\\star\}\\in\\mathcal\{G\}\. For a learner𝒜\\mathcal\{A\}, targetg⋆g\_\{\\star\}, and instance sequence\(xt\)t=1T\(x\_\{t\}\)\_\{t=1\}^\{T\}, define the expected regret relative to the optimal prediction
RegT\(𝒜,g⋆,\(xt\)\):=𝔼\[∑t=1T𝟙\[y^t≠Yt\]\]−∑t=1Tmina∈\{0,1\}Pr\[a≠Yt\]\.\\operatorname\{Reg\}\_\{T\}\(\\mathcal\{A\},g\_\{\\star\},\(x\_\{t\}\)\):=\\mathop\{\\mathbb\{E\}\}\\\!\\left\[\\sum\_\{t=1\}^\{T\}\\mathbbm\{1\}\[\\hat\{y\}\_\{t\}\\neq Y\_\{t\}\]\\right\]\-\\sum\_\{t=1\}^\{T\}\\min\_\{a\\in\\\{0,1\\\}\}\\Pr\[a\\neq Y\_\{t\}\]\.
###### Theorem 13\.2\.
For every oddM≥1M\\geq 1there exists a stochastic base class𝒢=\{g−,g\+\}\\mathcal\{G\}=\\\{g\_\{\-\},g\_\{\+\}\\\}such that:
1. 1\.there exists an online learner for the direct next\-token problem on𝒢\\mathcal\{G\}whose expected regret isO\(1\)O\(1\)on every instance sequence;
2. 2\.for every \(possibly randomized\) online learner𝒜\\mathcal\{A\}for𝒢𝖾𝟤𝖾−M\\mathcal\{G\}^\{\\mathsf\{e2e\}\-M\}, there exists a targetg⋆∈𝒢g\_\{\\star\}\\in\\mathcal\{G\}such that, even on the constant sequencext=∅x\_\{t\}=\\emptyset, its expected regret isΩ\(2M\)\\Omega\(2^\{M\}\)at some horizonT=Θ\(4M\)T=\\Theta\(4^\{M\}\)\.
In particular, the𝖾𝟤𝖾\\mathsf\{e2e\}regret can beΩ\(2M\)\\Omega\\left\(2^\{M\}\\right\), much larger thanlogM\\log Mwhich is the upper bound growth of𝖾𝟤𝖾\\mathsf\{e2e\}\-learning with a fixed Littlestone dimension base class \(given in Theorem[7\.1](https://arxiv.org/html/2605.06819#S7.Thmtheorem1)\) in the standard \(deterministic functions\) model\.
###### Proof\.
Forx∈\{0,1\}⋆x\\in\\\{0,1\\\}^\{\\star\}, letb\(x\)b\(x\)denote the last bit ofxx, with the conventionb\(∅\)=0b\(\\emptyset\)=0\. Forσ∈\{−1,\+1\}\\sigma\\in\\\{\-1,\+1\\\}define
gσ\(x\)=law ofb\(x\)⊕Z,Z∼Ber\(12\+σ4\)\.g\_\{\\sigma\}\(x\)=\\text\{law of \}b\(x\)\\oplus Z,\\qquad Z\\sim\\mathrm\{Ber\}\\\!\\left\(\\frac\{1\}\{2\}\+\\frac\{\\sigma\}\{4\}\\right\)\.Thusg−g\_\{\-\}flips the current last bit with probability1/41/4, whereasg\+g\_\{\+\}flips it with probability3/43/4\.
We first prove the first item in the theorem, that learning the direct next token is trivial\. Fix a targetgσg\_\{\\sigma\}and an arbitrary instance sequence\(xt\)t=1T\(x\_\{t\}\)\_\{t=1\}^\{T\}\. DefineZt:=Yt⊕b\(xt\)Z\_\{t\}:=Y\_\{t\}\\oplus b\(x\_\{t\}\)\. ThenZ1,Z2,…Z\_\{1\},Z\_\{2\},\\ldotsare i\.i\.d\.Ber\(12\+σ4\)\\mathrm\{Ber\}\(\\frac\{1\}\{2\}\+\\frac\{\\sigma\}\{4\}\), independently of the instance sequence\. Consider the learner that predicts arbitrarily on round11, and on roundt≥2t\\geq 2computes the empirical meanZ¯t−1:=1t−1∑s=1t−1Zs\\bar\{Z\}\_\{t\-1\}:=\\frac\{1\}\{t\-1\}\\sum\_\{s=1\}^\{t\-1\}Z\_\{s\}, setsσ^t=\+1\\hat\{\\sigma\}\_\{t\}=\+1ifZ¯t−1≥12\\bar\{Z\}\_\{t\-1\}\\geq\\frac\{1\}\{2\}andσ^t=−1\\hat\{\\sigma\}\_\{t\}=\-1otherwise, and then predictsb\(xt\)b\(x\_\{t\}\)ifσ^t=−1\\hat\{\\sigma\}\_\{t\}=\-1and1−b\(xt\)1\-b\(x\_\{t\}\)ifσ^t=\+1\\hat\{\\sigma\}\_\{t\}=\+1\. Ifσ^t=σ\\hat\{\\sigma\}\_\{t\}=\\sigma, this is the Bayes\-optimal prediction forgσ\(xt\)g\_\{\\sigma\}\(x\_\{t\}\); otherwise the regret at roundttis exactly3/4−1/4=1/23/4\-1/4=1/2\. Hence,
RegT≤12\+12∑t=2TPr\[σ^t≠σ\]\.\\operatorname\{Reg\}\_\{T\}\\leq\\frac\{1\}\{2\}\+\\frac\{1\}\{2\}\\sum\_\{t=2\}^\{T\}\\Pr\[\\hat\{\\sigma\}\_\{t\}\\neq\\sigma\]\.Since the true mean is either1/41/4or3/43/4, Hoeffding’s inequality givesPr\[σ^t≠σ\]≤e−\(t−1\)/8\\Pr\[\\hat\{\\sigma\}\_\{t\}\\neq\\sigma\]\\leq e^\{\-\(t\-1\)/8\}, and thereforeRegT=O\(1\)\\operatorname\{Reg\}\_\{T\}=O\(1\)\.
We now prove the second item\. IfM=1M=1, thengσ𝖾𝟤𝖾−1=gσg\_\{\\sigma\}^\{\\mathsf\{e2e\}\-1\}=g\_\{\\sigma\}\. On the constant sequencext=∅x\_\{t\}=\\emptyset, the two targets induceBer\(1/4\)\\mathrm\{Ber\}\(1/4\)andBer\(3/4\)\\mathrm\{Ber\}\(3/4\)\. On round11, if the learner predicts11with probabilitypp, then its expected regret isp/2p/2underg−g\_\{\-\}and\(1−p\)/2\(1\-p\)/2underg\+g\_\{\+\}\. Thus one of the two targets yields regret at least1/41/4at horizonT=1T=1, which isΩ\(2M\)\\Omega\(2^\{M\}\)forM=1M=1\. So it remains to consider oddM≥3M\\geq 3\.
Fix such an oddMM, and again consider the constant sequencext=∅x\_\{t\}=\\emptyset\. Letqmσq\_\{m\}^\{\\sigma\}be the probability that aftermmautoregressive steps the current last bit equals11undergσg\_\{\\sigma\}, starting from∅\\emptyset\. The process is a two\-state Markov chain, and
qm\+1σ=\(12\+σ4\)\+\(1−2\(12\+σ4\)\)qmσ,q0σ=0\.q\_\{m\+1\}^\{\\sigma\}=\\left\(\\frac\{1\}\{2\}\+\\frac\{\\sigma\}\{4\}\\right\)\+\\left\(1\-2\\left\(\\frac\{1\}\{2\}\+\\frac\{\\sigma\}\{4\}\\right\)\\right\)q\_\{m\}^\{\\sigma\},\\qquad q\_\{0\}^\{\\sigma\}=0\.Solving this recursion givesqmσ=1−\(−σ/2\)m2q\_\{m\}^\{\\sigma\}=\\frac\{1\-\(\-\\sigma/2\)^\{m\}\}\{2\}\. SinceMMis odd,qMσ=12\+σ2−M−1q\_\{M\}^\{\\sigma\}=\\frac\{1\}\{2\}\+\\sigma 2^\{\-M\-1\}\. LetδM:=2−M−1\\delta\_\{M\}:=2^\{\-M\-1\}\. Hence on input∅\\emptyset, the final answer underg−𝖾𝟤𝖾−Mg\_\{\-\}^\{\\mathsf\{e2e\}\-M\}isBer\(12−δM\)\\mathrm\{Ber\}\(\\frac\{1\}\{2\}\-\\delta\_\{M\}\), while underg\+𝖾𝟤𝖾−Mg\_\{\+\}^\{\\mathsf\{e2e\}\-M\}it isBer\(12\+δM\)\\mathrm\{Ber\}\(\\frac\{1\}\{2\}\+\\delta\_\{M\}\)\. The Bayes\-optimal predictions are therefore0and11respectively, and using the wrong one incurs regret2δM2\\delta\_\{M\}per round\.
Fix any randomized learner𝒜\\mathcal\{A\}\. Its prior\-averaged expected regret under the uniform prior on\{g−,g\+\}\\\{g\_\{\-\},g\_\{\+\}\\\}is the expectation, over the learner’s internal randomness, of the corresponding quantity for the induced deterministic learner\. It therefore suffices to prove the lower bound for deterministic learners\.
Fix such a deterministic learner\. On roundtt, its prediction is a function of the firstt−1t\-1observed labels\. LetP−t−1P\_\{\-\}^\{t\-1\}andP\+t−1P\_\{\+\}^\{t\-1\}denote the distributions of theset−1t\-1labels underBer\(12−δM\)\\mathrm\{Ber\}\(\\frac\{1\}\{2\}\-\\delta\_\{M\}\)andBer\(12\+δM\)\\mathrm\{Ber\}\(\\frac\{1\}\{2\}\+\\delta\_\{M\}\)respectively, and letAtA\_\{t\}be the event that the learner predicts11on roundtt\. Since the Bayes\-optimal prediction is0underg−g\_\{\-\}and11underg\+g\_\{\+\}, the prior\-averaged probability of predicting the wrong Bayes\-optimal bit at roundttis
12P−t−1\(At\)\+12P\+t−1\(Atc\)\.\\frac\{1\}\{2\}P\_\{\-\}^\{t\-1\}\(A\_\{t\}\)\+\\frac\{1\}\{2\}P\_\{\+\}^\{t\-1\}\(A\_\{t\}^\{c\}\)\.By the Bretagnolle–Huber inequality\[[29](https://arxiv.org/html/2605.06819#bib.bib2), Theorem 14\.2\],
P−t−1\(At\)\+P\+t−1\(Atc\)≥12exp\(−KL\(P−t−1,P\+t−1\)\),P\_\{\-\}^\{t\-1\}\(A\_\{t\}\)\+P\_\{\+\}^\{t\-1\}\(A\_\{t\}^\{c\}\)\\geq\\frac\{1\}\{2\}\\exp\\\!\\left\(\-\\mathrm\{KL\}\(P\_\{\-\}^\{t\-1\},P\_\{\+\}^\{t\-1\}\)\\right\),whereKL\(P,Q\)\\mathrm\{KL\}\(P,Q\)denotes the Kullback–Leibler divergence fromPPtoQQ\. Therefore the prior\-averaged probability of predicting the wrong Bayes\-optimal bit is at least
14exp\(−KL\(P−t−1,P\+t−1\)\)\.\\frac\{1\}\{4\}\\exp\\\!\\left\(\-\\mathrm\{KL\}\(P\_\{\-\}^\{t\-1\},P\_\{\+\}^\{t\-1\}\)\\right\)\.
It remains to bound the divergence\. Since the labels are i\.i\.d\.,
KL\(P−t−1,P\+t−1\)=\(t−1\)KL\(Ber\(12−δM\),Ber\(12\+δM\)\)\.\\mathrm\{KL\}\(P\_\{\-\}^\{t\-1\},P\_\{\+\}^\{t\-1\}\)=\(t\-1\)\\,\\mathrm\{KL\}\\\!\\left\(\\mathrm\{Ber\}\\\!\\left\(\\frac\{1\}\{2\}\-\\delta\_\{M\}\\right\),\\mathrm\{Ber\}\\\!\\left\(\\frac\{1\}\{2\}\+\\delta\_\{M\}\\right\)\\right\)\.Moreover,
KL\(Ber\(12−δ\),Ber\(12\+δ\)\)=2δlog\(1\+2δ1−2δ\)≤16δ2\(δ≤1/4\),\\mathrm\{KL\}\\\!\\left\(\\mathrm\{Ber\}\\\!\\left\(\\frac\{1\}\{2\}\-\\delta\\right\),\\mathrm\{Ber\}\\\!\\left\(\\frac\{1\}\{2\}\+\\delta\\right\)\\right\)=2\\delta\\log\\\!\\left\(\\frac\{1\+2\\delta\}\{1\-2\\delta\}\\right\)\\leq 16\\delta^\{2\}\\qquad\(\\delta\\leq 1/4\),where the inequality useslog\(1\+u1−u\)≤4u\\log\\\!\\left\(\\frac\{1\+u\}\{1\-u\}\\right\)\\leq 4uforu∈\[0,1/2\]u\\in\[0,1/2\]\. Hence
KL\(P−t−1,P\+t−1\)≤16\(t−1\)δM2\.\\mathrm\{KL\}\(P\_\{\-\}^\{t\-1\},P\_\{\+\}^\{t\-1\}\)\\leq 16\(t\-1\)\\delta\_\{M\}^\{2\}\.Thus, whenevert≤1/\(32δM2\)t\\leq 1/\(32\\delta\_\{M\}^\{2\}\), we haveKL\(P−t−1,P\+t−1\)≤1/2\\mathrm\{KL\}\(P\_\{\-\}^\{t\-1\},P\_\{\+\}^\{t\-1\}\)\\leq 1/2, and so the prior\-averaged probability of predicting the wrong Bayes\-optimal bit is at least14e−1/2=Ω\(1\)\\frac\{1\}\{4\}e^\{\-1/2\}=\\Omega\(1\)\. Since each such mistake contributes regret2δM2\\delta\_\{M\}, the prior\-averaged regret at roundttisΩ\(δM\)\\Omega\(\\delta\_\{M\}\)\.
SettingT:=⌊132δM2⌋T:=\\left\\lfloor\\frac\{1\}\{32\\delta\_\{M\}^\{2\}\}\\right\\rfloor, we haveT=Θ\(4M\)T=\\Theta\(4^\{M\}\)\. Summing over roundst=1,…,Tt=1,\\ldots,T, the prior\-averaged regret up to timeTTisΩ\(TδM\)=Ω\(1/δM\)=Ω\(2M\)\\Omega\(T\\delta\_\{M\}\)=\\Omega\(1/\\delta\_\{M\}\)=\\Omega\(2^\{M\}\)\. Therefore at least one of the two targetsg−,g\+g\_\{\-\},g\_\{\+\}yields regretΩ\(2M\)\\Omega\(2^\{M\}\)at horizonT=Θ\(4M\)T=\\Theta\(4^\{M\}\)\. ∎
## Acknowledgments
We thank Steve Hanneke and Shay Moran for their significant contribution to this work, and concretely for suggesting the class and the lower bound proof strategy in Theorem[2\.4](https://arxiv.org/html/2605.06819#S2.Thmtheorem4)\.
Idan Mehalel is supported by the European Research Council \(ERC\) under the European Union’s Horizon 2022 research and innovation program \(grant agreement No\. 101041711\), the Israel Science Foundation \(grant number 2258/19\), and the Simons Foundation \(as part of the Collaboration on the Mathematical and Scientific Foundations of Deep Learning\)\. Ilan Doron\-Arad is supported by grant NSF DMS\-2031883 and Vannevar Bush Faculty Fellowship ONR\-N00014\-20\-1\-2826 \(PI Mossel\)\. Elchanan Mossel is partially supported by NSF DMS\-2031883, Vannevar Bush Faculty Fellowship ONR\-N00014\-20\-1\-2826, MURI N000142412742, and a Simons Investigator Award\.
## References
- \[1\]N\. Alon, O\. Ben\-Eliezer, Y\. Dagan, S\. Moran, M\. Naor, and E\. Yogev\(2021\)Adversarial laws of large numbers and optimal regret in online classification\.InProceedings of the 53rd annual ACM SIGACT symposium on theory of computing,pp\. 447–455\.Cited by:[§2\.3\.1](https://arxiv.org/html/2605.06819#S2.SS3.SSS1.p1.1),[Theorem 9\.1](https://arxiv.org/html/2605.06819#S9.Thmtheorem1),[footnote 4](https://arxiv.org/html/2605.06819#footnote4)\.
- \[2\]N\. Alon, M\. Bun, R\. Livni, M\. Malliaris, and S\. Moran\(2022\)Private and online learnability are equivalent\.ACM Journal of the ACM \(JACM\)69\(4\),pp\. 1–34\.Cited by:[§1](https://arxiv.org/html/2605.06819#S1.p6.1)\.
- \[3\]A\. Altabaa, O\. Montasser, and J\. Lafferty\(2025\)CoT information: improved sample complexity under chain\-of\-thought supervision\.arXiv preprint arXiv:2505\.15927\.Cited by:[§5](https://arxiv.org/html/2605.06819#S5.SS0.SSS0.Px2.p1.1)\.
- \[4\]B\. Applebaum, B\. Barak, and A\. Wigderson\(2010\)Public\-key cryptography from different assumptions\.InProceedings of the forty\-second ACM symposium on Theory of computing,pp\. 171–180\.Cited by:[§5](https://arxiv.org/html/2605.06819#S5.SS0.SSS0.Px3.p1.1)\.
- \[5\]B\. Applebaum, D\. Bui, G\. Couteau, and N\. Melissaris\(2025\)Structured\-seed local pseudorandom generators and their applications\.InApproximation, Randomization, and Combinatorial Optimization\. Algorithms and Techniques \(APPROX/RANDOM 2025\),Vol\.353,pp\. 63–1\.Cited by:[§5](https://arxiv.org/html/2605.06819#S5.SS0.SSS0.Px3.p1.1)\.
- \[6\]S\. Ben\-David, D\. Pál, and S\. Shalev\-Shwartz\(2009\)Agnostic online learning\.InCOLT,Cited by:[Theorem 9\.1](https://arxiv.org/html/2605.06819#S9.Thmtheorem1)\.
- \[7\]S\. Bhaskar\(2021\)Thicket density\.The Journal of Symbolic Logic86\(1\),pp\. 110–127\.Cited by:[§3\.2](https://arxiv.org/html/2605.06819#S3.SS2.p1.31),[§6\.2](https://arxiv.org/html/2605.06819#S6.SS2.p4.1),[Theorem 6\.4](https://arxiv.org/html/2605.06819#S6.Thmtheorem4)\.
- \[8\]A\. Blumer, A\. Ehrenfeucht, D\. Haussler, and M\. K\. Warmuth\(1989\)Learnability and the vapnik\-chervonenkis dimension\.Journal of the ACM \(JACM\)36\(4\),pp\. 929–965\.Cited by:[§6\.2](https://arxiv.org/html/2605.06819#S6.SS2.p8.2)\.
- \[9\]T\. B\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell, S\. Agarwal, A\. Herbert\-Voss, G\. Krueger, T\. Henighan, R\. Child, A\. Ramesh, D\. M\. Ziegler, J\. Wu, C\. Winter, C\. Hesse, M\. Chen, E\. Sigler, M\. Litwin, S\. Gray, B\. Chess, J\. Clark, C\. Berner, S\. McCandlish, A\. Radford, I\. Sutskever, and D\. Amodei\(2020\)Language models are few\-shot learners\.InAdvances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6\-12, 2020, virtual,H\. Larochelle, M\. Ranzato, R\. Hadsell, M\. Balcan, and H\. Lin \(Eds\.\),Cited by:[§5](https://arxiv.org/html/2605.06819#S5.SS0.SSS0.Px1.p1.2)\.
- \[10\]Z\. Chase and I\. Mehalel\(2024\)Deterministic apple tasting\.arXiv preprint arXiv:2410\.10404\.Cited by:[§3\.4](https://arxiv.org/html/2605.06819#S3.SS4.p3.5),[§8\.2\.2](https://arxiv.org/html/2605.06819#S8.SS2.SSS2.p1.4),[§8\.2\.6](https://arxiv.org/html/2605.06819#S8.SS2.SSS6.p1.2)\.
- \[11\]L\. Chen, B\. Peng, and H\. Wu\(2025\)Theoretical limitations of multi\-layer transformer\.In2025 IEEE 66th Annual Symposium on Foundations of Computer Science \(FOCS\),pp\. 2631–2653\.Cited by:[§5](https://arxiv.org/html/2605.06819#S5.SS0.SSS0.Px2.p1.1)\.
- \[12\]A\. Daniely, S\. Sabato, S\. Ben\-David, and S\. Shalev\-Shwartz\(2015\)Multiclass learnability and the ERM principle\.J\. Mach\. Learn\. Res\.16\(1\),pp\. 2377–2404\.Cited by:[§2\.1](https://arxiv.org/html/2605.06819#S2.SS1.p1.22),[§3\.1](https://arxiv.org/html/2605.06819#S3.SS1.p1.29),[§6\.2](https://arxiv.org/html/2605.06819#S6.SS2.p5.8),[Theorem 6\.5](https://arxiv.org/html/2605.06819#S6.Thmtheorem5)\.
- \[13\]A\. Daniely and S\. Shalev\-Shwartz\(2016\)Complexity theoretic limitations on learning dnf’s\.InConference on Learning Theory,pp\. 815–830\.Cited by:[§5](https://arxiv.org/html/2605.06819#S5.SS0.SSS0.Px3.p1.1)\.
- \[14\]A\. Daniely and G\. Vardi\(2021\)From local pseudorandom generators to hardness of learning\.InConference on Learning Theory,pp\. 1358–1394\.Cited by:[§5](https://arxiv.org/html/2605.06819#S5.SS0.SSS0.Px3.p1.1)\.
- \[15\]DeepSeek\-AI, D\. Guo, D\. Yang, H\. Zhang, J\. Song, R\. Zhang, R\. Xu, Q\. Zhu, S\. Ma, P\. Wang, X\. Bi, X\. Zhang, X\. Yu, Y\. Wu, Z\. F\. Wu, Z\. Gou, Z\. Shao, Z\. Li, Z\. Gao, A\. Liu, B\. Xue, B\. Wang, B\. Wu, B\. Feng, C\. Lu, C\. Zhao, C\. Deng, C\. Zhang, C\. Ruan, D\. Dai, D\. Chen, D\. Ji, E\. Li, F\. Lin, F\. Dai, F\. Luo, G\. Hao, G\. Chen, G\. Li, H\. Zhang, H\. Bao, H\. Xu, H\. Wang, H\. Ding, H\. Xin, H\. Gao, H\. Qu, H\. Li, J\. Guo, J\. Li, J\. Wang, J\. Chen, J\. Yuan, J\. Qiu, J\. Li, J\. L\. Cai, J\. Ni, J\. Liang, J\. Chen, K\. Dong, K\. Hu, K\. Gao, K\. Guan, K\. Huang, K\. Yu, L\. Wang, L\. Zhang, L\. Zhao, L\. Wang, L\. Zhang, L\. Xu, L\. Xia, M\. Zhang, M\. Zhang, M\. Tang, M\. Li, M\. Wang, M\. Li, N\. Tian, P\. Huang, P\. Zhang, Q\. Wang, Q\. Chen, Q\. Du, R\. Ge, R\. Zhang, R\. Pan, R\. Wang, R\. J\. Chen, R\. L\. Jin, R\. Chen, S\. Lu, S\. Zhou, S\. Chen, S\. Ye, S\. Wang, S\. Yu, S\. Zhou, S\. Pan, and S\. S\. Li\(2025\)DeepSeek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.CoRRabs/2501\.12948\.External Links:[Link](https://doi.org/10.48550/arXiv.2501.12948),[Document](https://dx.doi.org/10.48550/ARXIV.2501.12948),2501\.12948Cited by:[§5](https://arxiv.org/html/2605.06819#S5.SS0.SSS0.Px1.p1.2)\.
- \[16\]A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Yang, A\. Fan, A\. Goyal, A\. Hartshorn, A\. Yang, A\. Mitra, A\. Sravankumar, A\. Korenev, A\. Hinsvark, A\. Rao, A\. Zhang, A\. Rodriguez, A\. Gregerson, A\. Spataru, B\. Rozière, B\. Biron, B\. Tang, B\. Chern, C\. Caucheteux, C\. Nayak, C\. Bi, C\. Marra, C\. McConnell, C\. Keller, C\. Touret, C\. Wu, C\. Wong, C\. C\. Ferrer, C\. Nikolaidis, D\. Allonsius, D\. Song, D\. Pintz, D\. Livshits, D\. Esiobu, D\. Choudhary, D\. Mahajan, D\. Garcia\-Olano, D\. Perino, D\. Hupkes, E\. Lakomkin, E\. AlBadawy, E\. Lobanova, E\. Dinan, E\. M\. Smith, F\. Radenovic, F\. Zhang, G\. Synnaeve, G\. Lee, G\. L\. Anderson, G\. Nail, G\. Mialon, G\. Pang, G\. Cucurell, H\. Nguyen, H\. Korevaar, H\. Xu, H\. Touvron, I\. Zarov, I\. A\. Ibarra, I\. M\. Kloumann, I\. Misra, I\. Evtimov, J\. Copet, J\. Lee, J\. Geffert, J\. Vranes, J\. Park, J\. Mahadeokar, J\. Shah, J\. van der Linde, J\. Billock, J\. Hong, J\. Lee, J\. Fu, J\. Chi, J\. Huang, J\. Liu, J\. Wang, J\. Yu, J\. Bitton, J\. Spisak, J\. Park, J\. Rocca, J\. Johnstun, J\. Saxe, J\. Jia, K\. V\. Alwala, K\. Upasani, K\. Plawiak, K\. Li, K\. Heafield, K\. Stone, and et al\.\(2024\)The llama 3 herd of models\.CoRRabs/2407\.21783\.External Links:[Link](https://doi.org/10.48550/arXiv.2407.21783),[Document](https://dx.doi.org/10.48550/ARXIV.2407.21783),2407\.21783Cited by:[§5](https://arxiv.org/html/2605.06819#S5.SS0.SSS0.Px1.p1.2)\.
- \[17\]A\. Ehrenfeucht, D\. Haussler, M\. Kearns, and L\. Valiant\(1989\)A general lower bound on the number of examples needed for learning\.Information and Computation82\(3\),pp\. 247–261\.Cited by:[§6\.2](https://arxiv.org/html/2605.06819#S6.SS2.p8.2)\.
- \[18\]V\. Feldman and D\. Xiao\(2014\)Sample complexity bounds on differentially private learning via communication complexity\.InConference on Learning Theory,pp\. 1000–1019\.Cited by:[§11\.1](https://arxiv.org/html/2605.06819#S11.SS1.SSS0.Px3.1.p1.14)\.
- \[19\]Y\. Filmus, S\. Hanneke, I\. Mehalel, and S\. Moran\(2023\)Optimal prediction using expert advice and randomized littlestone dimension\.InCOLT,Proceedings of Machine Learning Research, Vol\.195,pp\. 773–836\.Cited by:[§3\.2](https://arxiv.org/html/2605.06819#S3.SS2.p1.29)\.
- \[20\]S\. Hanneke, I\. Mehalel, and S\. Moran\(2026\)Sample complexity of autoregressive reasoning: chain\-of\-thought vs\. end\-to\-end\.arXiv preprint arXiv:2604\.12013\.External Links:2604\.12013,[Document](https://dx.doi.org/10.48550/arXiv.2604.12013)Cited by:[§1](https://arxiv.org/html/2605.06819#S1.p4.6),[§10](https://arxiv.org/html/2605.06819#S10.p1.2),[§2\.1](https://arxiv.org/html/2605.06819#S2.SS1.p2.18),[§2\.2](https://arxiv.org/html/2605.06819#S2.SS2.p2.5),[§2\.2](https://arxiv.org/html/2605.06819#S2.SS2.p4.11),[§2\.2](https://arxiv.org/html/2605.06819#S2.SS2.p5.4),[§2\.3](https://arxiv.org/html/2605.06819#S2.SS3.p4.5),[§2\.4\.1](https://arxiv.org/html/2605.06819#S2.SS4.SSS1.p1.11),[§2\.4\.1](https://arxiv.org/html/2605.06819#S2.SS4.SSS1.p4.2),[§2\.4](https://arxiv.org/html/2605.06819#S2.SS4.p1.4),[§6](https://arxiv.org/html/2605.06819#S6.p1.1)\.
- \[21\]S\. Hanneke and S\. Moran\(2026\)Private communication\.Note:UnpublishedPrivate communication with the authorsCited by:[§2\.3](https://arxiv.org/html/2605.06819#S2.SS3.p6.1),[§3\.3](https://arxiv.org/html/2605.06819#S3.SS3.p1.1),[§8](https://arxiv.org/html/2605.06819#S8.p4.1)\.
- \[22\]S\. Hanneke\(2016\)The optimal sample complexity of pac learning\.Journal of Machine Learning Research17\(38\),pp\. 1–15\.Cited by:[§6\.2](https://arxiv.org/html/2605.06819#S6.SS2.p12.5),[§6\.2](https://arxiv.org/html/2605.06819#S6.SS2.p8.2)\.
- \[23\]Y\. Huang, Z\. Wen, A\. Singh, Y\. Chi, and Y\. Chen\(2025\)Transformers provably learn chain\-of\-thought reasoning with length generalization\.arXiv preprint arXiv:2511\.07378\.Cited by:[§5](https://arxiv.org/html/2605.06819#S5.SS0.SSS0.Px2.p1.1)\.
- \[24\]A\. Jaech, A\. Kalai, A\. Lerer, A\. Richardson, A\. El\-Kishky, A\. Low, A\. Helyar, A\. Madry, A\. Beutel, A\. Carney, A\. Iftimie, A\. Karpenko, A\. T\. Passos, A\. Neitz, A\. Prokofiev, A\. Wei, A\. Tam, A\. Bennett, A\. Kumar, A\. Saraiva, A\. Vallone, A\. Duberstein, A\. Kondrich, A\. Mishchenko, A\. Applebaum, A\. Jiang, A\. Nair, B\. Zoph, B\. Ghorbani, B\. Rossen, B\. Sokolowsky, B\. Barak, B\. McGrew, B\. Minaiev, B\. Hao, B\. Baker, B\. Houghton, B\. McKinzie, B\. Eastman, C\. Lugaresi, C\. Bassin, C\. Hudson, C\. M\. Li, C\. de Bourcy, C\. Voss, C\. Shen, C\. Zhang, C\. Koch, C\. Orsinger, C\. Hesse, C\. Fischer, C\. Chan, D\. Roberts, D\. Kappler, D\. Levy, D\. Selsam, D\. Dohan, D\. Farhi, D\. Mely, D\. Robinson, D\. Tsipras, D\. Li, D\. Oprica, E\. Freeman, E\. Zhang, E\. Wong, E\. Proehl, E\. Cheung, E\. Mitchell, E\. Wallace, E\. Ritter, E\. Mays, F\. Wang, F\. P\. Such, F\. Raso, F\. Leoni, F\. Tsimpourlas, F\. Song, F\. von Lohmann, F\. Sulit, G\. Salmon, G\. Parascandolo, G\. Chabot, G\. Zhao, G\. Brockman, G\. Leclerc, H\. Salman, H\. Bao, H\. Sheng, H\. Andrin, H\. Bagherinezhad, H\. Ren, H\. Lightman, H\. W\. Chung, I\. Kivlichan, I\. O’Connell, I\. Osband, I\. C\. Gilaberte, and I\. Akkaya\(2024\)OpenAI o1 system card\.CoRRabs/2412\.16720\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2412.16720),2412\.16720Cited by:[§5](https://arxiv.org/html/2605.06819#S5.SS0.SSS0.Px1.p1.2)\.
- \[25\]N\. Joshi, R\. Magen, N\. Srebro, N\. Tsilivis, and G\. Vardi\(2026\)Learning to think from multiple thinkers\.arXiv preprint arXiv:2604\.24737\.Cited by:[§2\.4](https://arxiv.org/html/2605.06819#S2.SS4.p1.4)\.
- \[26\]N\. Joshi, G\. Vardi, A\. Block, S\. Goel, Z\. Li, T\. Misiakiewicz, and N\. Srebro\(2025\)A theory of learning with autoregressive chain of thought\.arXiv preprint arXiv:2503\.07932\.Cited by:[§1\.1](https://arxiv.org/html/2605.06819#S1.SS1.p1.17),[§1](https://arxiv.org/html/2605.06819#S1.p4.6),[§11\.1](https://arxiv.org/html/2605.06819#S11.SS1.1.p1.2),[§11\.2](https://arxiv.org/html/2605.06819#S11.SS2.1.p1.2),[§11\.2](https://arxiv.org/html/2605.06819#S11.SS2.p1.1),[§11\.2](https://arxiv.org/html/2605.06819#S11.SS2.p2.1),[§2\.1](https://arxiv.org/html/2605.06819#S2.SS1.p2.18),[§2\.2](https://arxiv.org/html/2605.06819#S2.SS2.p6.4),[§2\.3](https://arxiv.org/html/2605.06819#S2.SS3.p9.2),[§2\.4\.1](https://arxiv.org/html/2605.06819#S2.SS4.SSS1.p1.11),[§2\.4](https://arxiv.org/html/2605.06819#S2.SS4.p1.4),[§8\.2](https://arxiv.org/html/2605.06819#S8.SS2.p2.4),[§8](https://arxiv.org/html/2605.06819#S8.p1.12),[footnote 1](https://arxiv.org/html/2605.06819#footnote1)\.
- \[27\]J\. Kim and T\. Suzuki\(2024\)Transformers provably solve parity efficiently with chain of thought\.arXiv preprint arXiv:2410\.08633\.Cited by:[§5](https://arxiv.org/html/2605.06819#S5.SS0.SSS0.Px2.p1.1)\.
- \[28\]A\. R\. Klivans and A\. A\. Sherstov\(2009\)Cryptographic hardness for learning intersections of halfspaces\.Journal of Computer and System Sciences75\(1\),pp\. 2–12\.Cited by:[§5](https://arxiv.org/html/2605.06819#S5.SS0.SSS0.Px3.p1.1)\.
- \[29\]T\. Lattimore and C\. Szepesvári\(2020\)Bandit algorithms\.Cambridge University Press\.Cited by:[§13](https://arxiv.org/html/2605.06819#S13.6.p6.19)\.
- \[30\]Z\. Li, H\. Liu, D\. Zhou, and T\. Ma\(2024\)Chain of thought empowers transformers to solve inherently serial problems\.InThe Twelfth International Conference on Learning Representations,Cited by:[§5](https://arxiv.org/html/2605.06819#S5.SS0.SSS0.Px2.p1.1)\.
- \[31\]N\. Littlestone\(1988\)Learning quickly when irrelevant attributes abound: a new linear\-threshold algorithm\.Machine learning2\(4\),pp\. 285–318\.Cited by:[§2\.1](https://arxiv.org/html/2605.06819#S2.SS1.p1.22),[Theorem 6\.3](https://arxiv.org/html/2605.06819#S6.Thmtheorem3),[footnote 2](https://arxiv.org/html/2605.06819#footnote2)\.
- \[32\]N\. Littlestone\(1989\)From on\-line to batch learning\.InProceedings of the Second Annual Workshop on Computational Learning Theory \(COLT 1989\),San Francisco, CA, USA,pp\. 269–284\.External Links:[Document](https://dx.doi.org/10.5555/93335.93365),[Link](https://dl.acm.org/doi/10.5555/93335.93365)Cited by:[Theorem 6\.8](https://arxiv.org/html/2605.06819#S6.Thmtheorem8)\.
- \[33\]W\. Maass and G\. Turán\(1994\)How fast can a threshold gate learn?\.InComputational Learning Theory and Natural Learning Systems, Vol\. I,pp\. 381–414\.Cited by:[Lemma 11\.8](https://arxiv.org/html/2605.06819#S11.Thmtheorem8)\.
- \[34\]E\. Malach\(2023\)Auto\-regressive next\-token predictors are universal learners\.arXiv preprint arXiv:2309\.06979\.Cited by:[§5](https://arxiv.org/html/2605.06819#S5.SS0.SSS0.Px2.p1.1)\.
- \[35\]W\. Merrill and A\. Sabharwal\(2023\)The expressive power of transformers with chain of thought\.arXiv preprint arXiv:2310\.07923\.Cited by:[§5](https://arxiv.org/html/2605.06819#S5.SS0.SSS0.Px2.p1.1)\.
- \[36\]M\. I\. Nye, A\. J\. Andreassen, G\. Gur\-Ari, H\. Michalewski, J\. Austin, D\. Bieber, D\. Dohan, A\. Lewkowycz, M\. Bosma, D\. Luan, C\. Sutton, and A\. Odena\(2021\)Show your work: scratchpads for intermediate computation with language models\.CoRRabs/2112\.00114\.External Links:[Link](https://arxiv.org/abs/2112.00114),2112\.00114Cited by:[§5](https://arxiv.org/html/2605.06819#S5.SS0.SSS0.Px1.p1.2)\.
- \[37\]A\. Radford, J\. Wu, R\. Child, D\. Luan, D\. Amodei, and I\. Sutskever\(2019\)Language models are unsupervised multitask learners\.OpenAI\.Cited by:[§5](https://arxiv.org/html/2605.06819#S5.SS0.SSS0.Px1.p1.2)\.
- \[38\]R\. A\. Servedio\(2025\)The probably approximately correct learning model in computational learning theory\.arXiv preprint arXiv:2511\.08791\.Cited by:[§5](https://arxiv.org/html/2605.06819#S5.SS0.SSS0.Px3.p1.1)\.
- \[39\]S\. Tiegel\(2024\)Improved hardness results for learning intersections of halfspaces\.InThe Thirty Seventh Annual Conference on Learning Theory,pp\. 4764–4786\.Cited by:[§5](https://arxiv.org/html/2605.06819#S5.SS0.SSS0.Px3.p1.1)\.
- \[40\]N\. Tsilivis, E\. Malach, K\. Ullrich, and J\. Kempe\(2025\)How reinforcement learning after next\-token prediction facilitates learning\.arXiv preprint arXiv:2510\.11495\.Cited by:[§5](https://arxiv.org/html/2605.06819#S5.SS0.SSS0.Px2.p1.1)\.
- \[41\]A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, L\. Kaiser, and I\. Polosukhin\(2017\)Attention is all you need\.InAdvances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4\-9, 2017, Long Beach, CA, USA,I\. Guyon, U\. von Luxburg, S\. Bengio, H\. M\. Wallach, R\. Fergus, S\. V\. N\. Vishwanathan, and R\. Garnett \(Eds\.\),pp\. 5998–6008\.Cited by:[§5](https://arxiv.org/html/2605.06819#S5.SS0.SSS0.Px1.p1.2)\.
- \[42\]J\. Wei, M\. Bosma, V\. Y\. Zhao, K\. Guu, A\. W\. Yu, B\. Lester, N\. Du, A\. M\. Dai, and Q\. V\. Le\(2022\)Finetuned language models are zero\-shot learners\.InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25\-29, 2022,Cited by:[§5](https://arxiv.org/html/2605.06819#S5.SS0.SSS0.Px1.p1.2)\.
- \[43\]J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[§5](https://arxiv.org/html/2605.06819#S5.SS0.SSS0.Px1.p1.2)\.Similar Articles
Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts
This research paper from MediaTek and National Taiwan University challenges the assumption that reasoning chains must be dense and sequential, showing that models can extract answers from sparse, shuffled, and noisy reasoning traces. The findings suggest that answer extraction is robust and order-independent, potentially enabling more efficient, parallelized reasoning generation.
Reasoning models struggle to control their chains of thought, and that’s good
OpenAI researchers study whether reasoning models can deliberately obscure their chain-of-thought to evade monitoring, finding that current models struggle to control their reasoning even when aware of monitoring. They introduce CoT-Control, an open-source evaluation suite with over 13,000 tasks to measure chain-of-thought controllability in reasoning models.
Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering
This paper investigates safety failures in Large Reasoning Models where harmful content appears in reasoning traces despite safe final answers, proposing an adaptive multi-principle steering method to mitigate these risks.
Evaluating chain-of-thought monitorability
OpenAI researchers introduce a framework and suite of 13 evaluations to systematically measure chain-of-thought monitorability in large language models, finding that monitoring reasoning processes is substantially more effective than monitoring outputs alone, with important implications for AI safety and supervision at scale.
Improving Reasoning Capabilities in Small Models through Mixture-of-Layers Distillation with Stepwise Attention on Key Information
This paper proposes a novel Chain-of-Thought distillation framework that transfers teacher models' stepwise attention on key information to student models through a Mixture-of-Layers module for dynamic layer alignment. The method achieves consistent performance improvements on mathematical and commonsense reasoning benchmarks by explicitly guiding student models to progressively focus on critical information during reasoning.