Universal Multiclass Transductive Online Learning

arXiv cs.LG 06/01/26, 04:00 AM Papers
Summary
This paper introduces the Level-Constrained-Littlestone-Littlestone (LCLL) tree to characterize learnability in universal transductive online classification with possibly unbounded label spaces, proving that optimal mistake rates are either bounded or logarithmic.
arXiv:2605.30479v1 Announce Type: new Abstract: We consider the problem of universal transductive online classification with a possibly unbounded label space. This setting considers online learning, with the sequence of instances (without labels) known to the learner in advance. We say a concept class $\mathcal{H}$ is learnable if there is a learning algorithm $\mathcal{A}$, such that for every realizable sequence, the number of mistakes made by $\mathcal{A}$ grows at most sublinearly with the number of predictions. We characterize the learnability of this setting and show that there are only two possible optimal rates for the learnable classes: either bounded or increasing logarithmically. We introduce a new combinatorial structure, called ``Level-Constrained-Littlestone-Littlestone (LCLL) tree'', which, along with the indifference property, characterizes the learnability. We also extend the learnability result to the agnostic case and the case where only the stochastic process that generates the instance sequence is known.
Original Article
View Cached Full Text
Cached at: 06/01/26, 09:25 AM
# Universal Multiclass Transductive Online Learning
Source: [https://arxiv.org/html/2605.30479](https://arxiv.org/html/2605.30479)
###### Abstract

We consider the problem of universal transductive online classification with a possibly unbounded label space\. This setting considers online learning, with the sequence of instances \(without labels\) known to the learner in advance\. We say a concept classℋ\\mathcal\{H\}is learnable if there is a learning algorithm𝒜\\mathcal\{A\}, such that for every realizable sequence, the number of mistakes made by𝒜\\mathcal\{A\}grows at most sublinearly with the number of predictions\. We characterize the learnability of this setting and show that there are only two possible optimal rates for the learnable classes: either bounded or increasing logarithmically\. We introduce a new combinatorial structure, called “Level\-Constrained\-Littlestone\-Littlestone \(LCLL\) tree”, which, along with the*indifference*property, characterizes the learnability\. We also extend the learnability result to the agnostic case and the case where only the stochastic process that generates the instance sequence is known\.

Machine Learning, ICML

## 1Introduction

Online learning\(Littlestone,[1988](https://arxiv.org/html/2605.30479#bib.bib2)\)is a sequential game\. At each round, the adversary chooses an instance from the instance space𝒳\\mathcal\{X\}\. After the learner makes its prediction, the adversary chooses the true label from the label space𝒴\\mathcal\{Y\}\. The goal is to minimize the number of mistakes made by the learner\. Under this setting, the learner has to face two types of uncertainties:*labeling\-related*uncertainty and*instance\-related*uncertainty\. It is an interesting question to figure out what specific role each type of uncertainties, especially the labeling\-related uncertainty, plays in the complete picture\. Therefore,Ben\-Davidet al\.\([1997](https://arxiv.org/html/2605.30479#bib.bib1)\)introduced a new learning model to remove the instance\-related uncertainty, which is called offline learning\. In that model, the adversary chooses the instance sequence but reveals it to the learner in advance\. Recently,Hannekeet al\.\([2023b](https://arxiv.org/html/2605.30479#bib.bib4)\)renamed this setting to transductive online learning, due to the similarity to transductive PAC learning\(Vapnik,[2006](https://arxiv.org/html/2605.30479#bib.bib24)\)\. Both focus on investigating the benefits of knowing the unlabeled data beforehand\.

Another motivation for investigating the universal transductive online learning is from the work ofHanneke and Wang \([2024](https://arxiv.org/html/2605.30479#bib.bib17)\)\. They investigated the problem of optimistically universal online learning with general concept classes for binary classification\. In that work, they introduced the assumption of concept classes into the model of learning under minimal assumptions on the stochastic process that generates the instance sequence\. The assumption on the stochastic process they made is that online learning is possible if the learner knows that stochastic process, which is called “the process admits universal online learning”\. Therefore, the problem of when all processes admit universal online learning is equivalent to the universal online learnability problem, where only the stochastic process generating the instance sequence is known\. This is closely related to the universal transductive online learning problem, which is the universal online learnability problem, where the instance sequence is known\.

Transductive Online Learning\.Just as online learning, transductive online learning is a sequential game as well\. There are two players, the adversary and the learner\. Before the game starts, the adversary may pick a sequence of instances𝕏=\{Xt\}t∈ℕ\\mathbb\{X\}=\\\{X\_\{t\}\\\}\_\{t\\in\\mathbb\{N\}\}, so that for everytt,Xt∈𝒳X\_\{t\}\\in\\mathcal\{X\}, where𝒳\\mathcal\{X\}is a non\-empty set called instance space\. The instance sequence𝕏\\mathbb\{X\}is revealed to the learner in advance\. Then at each roundtt, the learner makes a predictionY^t\\hat\{Y\}\_\{t\}based on the instance sequence𝕏\\mathbb\{X\}and the history of the true labelY<t=\{Yi\}i<tY\_\{<t\}=\\\{Y\_\{i\}\\\}\_\{i<t\}\. Then the adversary reveals the true labelYt∈𝒴Y\_\{t\}\\in\\mathcal\{Y\}, which can be used to inform the future prediction\. Here𝒴\\mathcal\{Y\}is a countable set called label space\. The learner suffers a loss, defined by the indicator function𝕀\[Y^t≠Yt\]\\mathbb\{I\}\[\\hat\{Y\}\_\{t\}\\neq Y\_\{t\}\]\. For any learning algorithm, we use the number of mistakes to measure its performance, which is∑t=1T𝕀\[Y^t≠Yt\]\\sum\_\{t=1\}^\{T\}\\mathbb\{I\}\[\\hat\{Y\}\_\{t\}\\neq Y\_\{t\}\], in the realizable setting\.

This model was first introduced in the work ofBen\-David, Kushilevitz, and Mansour \([1997](https://arxiv.org/html/2605.30479#bib.bib1)\)\. Their main goal is to figure out how the label\-related uncertainty affects the online learning procedure\. Recently, the work ofHannekeet al\.\([2023b](https://arxiv.org/html/2605.30479#bib.bib4)\)proved a trichotomy and showed that transductive online binary classification only has three possible ratesO\(1\)O\(1\),Θ\(log⁡T\)\\Theta\(\\log T\), andΩ\(T\)\\Omega\(T\)\. They proved that if the Littlestone dimension ofℋ\\mathcal\{H\}is finite, there is an algorithm that only makes finite mistakes to learn the target concept\. Ifℋ\\mathcal\{H\}has finite VC dimension but has an infinite Littlestone dimension, the best rate islog⁡T\\log T\. And ifℋ\\mathcal\{H\}has infinite VC dimension, there is an adversary that can push any algorithm to make a mistake every round\. More recently, the work ofHannekeet al\.\([2024b](https://arxiv.org/html/2605.30479#bib.bib5)\)extended the result from binary classification to multiclass classification with countably infinite label spaces\. In that work, they also proved a trichotomy\. There are only three types of different rates, which areO\(1\)O\(1\),Θ\(log⁡T\)\\Theta\(\\log T\), andΩ\(T\)\\Omega\(T\)\. And the characterizations are the level\-constrained branching dimension and the level\-constrained Littlestone dimension\. There is a substantial amount of literature investigating this topic recently, such asHanneke and Shaeiri \([2025](https://arxiv.org/html/2605.30479#bib.bib31)\); Chaseet al\.\([2025](https://arxiv.org/html/2605.30479#bib.bib32)\)\. However, all of the previous works focus on the uniform rates of the transductive online learning model\. In other words, the bound is independent of the sequences\.

Universal Learning\.In the work ofBousquetet al\.\([2021](https://arxiv.org/html/2605.30479#bib.bib3)\), they considered the distribution\-dependent learning rates and sequence\-dependent learning rates, in PAC learning and online learning, respectively, which is called “universal learning rates”\. We focus on the universal rates of online learning here\. Unlike uniform rates, which are sequence\-independent, the universal rates are described as follows: for every realizable sequence, there is a function that bounds the number of mistakes\. This is a more realistic measure of learning rate, as during one learning procedure, we only need to face one specific data sequence\. There are many works investigating the performance of online learning algorithms under the universal setting, such as\(Blanchardet al\.,[2022](https://arxiv.org/html/2605.30479#bib.bib15); Blanchard,[2022](https://arxiv.org/html/2605.30479#bib.bib14); Hannekeet al\.,[2023c](https://arxiv.org/html/2605.30479#bib.bib7),[2025](https://arxiv.org/html/2605.30479#bib.bib16)\)\. There is also a line of work investigating the universal PAC learning, such asHannekeet al\.\([2022](https://arxiv.org/html/2605.30479#bib.bib6),[2023c](https://arxiv.org/html/2605.30479#bib.bib7)\); Hanneke and Xu \([2024](https://arxiv.org/html/2605.30479#bib.bib8)\); Hannekeet al\.\([2024a](https://arxiv.org/html/2605.30479#bib.bib9)\)\.

Multiclass Learning\.Danielyet al\.\([2015](https://arxiv.org/html/2605.30479#bib.bib25)\)extended the online learning from binary classification to multiclass classification with unbounded label spaces and showed that the Littlestone dimension characterizes the online learnability in the realizable case\.Hannekeet al\.\([2023a](https://arxiv.org/html/2605.30479#bib.bib23)\)show that the Littlestone dimension also characterizes the online learnability in the agnostic setting\. We would like to discuss the cases in which the label space is unbounded, because many interesting scenarios happen when we try to extend the label space from finite to unbounded\. In PAC learning, there is also a line of work extending the binary classification to multiclass classification\(Natarajan and Tadepalli,[1988](https://arxiv.org/html/2605.30479#bib.bib19); Natarajan,[1989](https://arxiv.org/html/2605.30479#bib.bib20); Brukhimet al\.,[2022](https://arxiv.org/html/2605.30479#bib.bib21)\)\.

In this paper, we investigate the universal learning rates of transductive online multiclass learning with a countably infinite label space\.111The results for the realizable setting can be extended to uncountable label spaces, but not in the agnostic setting, as we require the number of experts to be countable\.For the realizable setting, we provide a trichotomy for the universal multiclass transductive online learning\. To describe this trichotomy, we use two combinatorial structures: the indifferent Littlestone tree and the indifferent Level\-Constrained\-Littlestone\-Littlestone \(LCLL\) tree, which will be defined later\. Informally, the trichotomy can be stated as follows\.

- •Ifℋ\\mathcal\{H\}has no infinite indifferent Littlestone tree, it can be learned with a constant number of mistakes\.
- •Ifℋ\\mathcal\{H\}has an infinite indifferent Littlestone tree but has no infinite indifferent LCLL tree, it can be learned withΘ\(log⁡T\)\\Theta\(\\log T\)number of mistakes\.
- •Ifℋ\\mathcal\{H\}has an infinite indifferent LCLL tree, it cannot be learned witho\(T\)o\(T\)number of mistakes\.

We also extend the learnability results to the agnostic case and provide aO~\(T\)\\tilde\{O\}\(\\sqrt\{T\}\)upper bound for the concept classes with no infinite indifferent LCLL tree\.O~\(T\)\\tilde\{O\}\(\\sqrt\{T\}\)hides the poly\-logarithmic factors\.

For binary classification, the results ofHanneke and Wang \([2024](https://arxiv.org/html/2605.30479#bib.bib17)\)imply that the VCL tree characterizes the universal transductive online learnability\. Therefore, several combinatorial structures that extend the VCL tree to the multiclass setting may characterize the universal multiclass transductive online learnability, such as the DSL tree\(Hannekeet al\.,[2023c](https://arxiv.org/html/2605.30479#bib.bib7)\), the LCL tree\(Hannekeet al\.,[2024b](https://arxiv.org/html/2605.30479#bib.bib5)\), or the LCL\-Littlestone tree\. Surprisingly, it turns out that none of these choices is the right answer\.

Recall the proof of the universal transductive online learnability for binary classification\. They use the VCL game, which is a Gale\-Stewart game222A Gale\-Stewart game is a perfect information two\-player infinite game\. More details are provided in[AppendixA](https://arxiv.org/html/2605.30479#A1)\., defined inBousquetet al\.\([2021](https://arxiv.org/html/2605.30479#bib.bib3)\)to prove the upper bound\. Then, due to a lemma inBousquetet al\.\([2023](https://arxiv.org/html/2605.30479#bib.bib13)\), every concept class that has an infinite Littlestone tree \(VCL tree\) also has an infinite indifferent Littlestone tree \(VCL tree\)\. They say that the concept has an indifferent Littlestone tree if, for every node in the tree, all its descendants agree on the value of the instances that precede it in Breadth\-First\-Search order\. This property allows the adversary to choose the true labels by performing a random walk starting from the root, even though the learner knows the instance sequence in advance\.

However, this nice lemma does not hold when label spaces are unbounded\. In fact, there exists a concept classℋ\\mathcal\{H\}that has an infinite Littlestone \(LCLL\) tree but has no infinite indifferent Littlestone \(LCLL\) tree, shown in[Example3\.1](https://arxiv.org/html/2605.30479#S3.Thmtheorem1)\. This example also suggests that the indifferent Littlestone \(LCLL\) tree should be the correct characterization\. However, the structure of the Gale\-Stewart game is naturally related to the Littlestone\-tree type of trees and is difficult to apply to indifferent trees\. At each round of the game, the adversary proposes a sequence of instances, and the learner picks a sequence of labels\. Therefore, the winning strategy of the adversary automatically leads to an infinite Littlestone\-type tree\. This mismatch prompts us to develop a new method for designing a Gale\-Stewart game that captures the indifferent property of the tree\. It turns out that the adversary needs to provide all the possible label sequences for an interval of instances, i\.e\., the adversary needs to provide all possible labels for instances fromXtk\+1X\_\{t\_\{k\}\+1\}toXtk\+1X\_\{t\_\{k\+1\}\}such thatXtk\+1X\_\{t\_\{k\+1\}\}has two labels333Here we use the Littlestone tree as an example, as it is easier to describe than LCLL tree\., instead of a sequence of instances\. As far as we know, it is the first time to design such a type of Gale\-Stewart game in the universal learning literature, and this design would be useful to solve any problem whose characterization is the infinite indifferent Littlestone\-type trees\. We want to highlight this design of the Gale\-Stewart game as the main conceptual contribution of this work, and we believe it can be used as a tool for more multiclass learning problems in the online learning context\.

As we mentioned before, the universal transductive online learnability problem is closely related to the characterization of the case where all processes admit universal online learning\. Therefore, we also extend the learnability result to the case where only the stochastic process generating the instance sequence is known in advance, that is, the condition that all processes admit universal multiclass online learning\.

Organization of the paper\.In this paper, we first provide the notations and the main results in section[2](https://arxiv.org/html/2605.30479#S2)\. Then we provide several examples to show that some reasonable guesses are not the correct characterization in section[3](https://arxiv.org/html/2605.30479#S3)\. After that, we show the high\-level ideas of the proof of the trichotomy and the learnability results in[Section4](https://arxiv.org/html/2605.30479#S4)and[Section5](https://arxiv.org/html/2605.30479#S5)\. Due to the lack of space, we put our detailed proofs and the results for stochastic processes in the appendices\.

## 2Preliminaries and Main Results

In this section, we provide formal definitions, model settings, and a brief list of our main results\.

Model Setting\.We formally provide our learning model here\.𝒳\\mathcal\{X\}and𝒴\\mathcal\{Y\}are both non\-empty sets\.𝒳\\mathcal\{X\}is the instance space and𝒴\\mathcal\{Y\}is the label space\. Here we focus on the learning on0\-11loss: that is,\(y,y′\)↦𝕀\[y≠y′\]\(y,y^\{\\prime\}\)\\mapsto\\mathbb\{I\}\\left\[y\\neq y^\{\\prime\}\\right\], where𝕀\[⋅\]\\mathbb\{I\}\\left\[\\cdot\\right\]is the indicator function\. The concept classℋ⊆𝒴𝒳\\mathcal\{H\}\\subseteq\\mathcal\{Y\}^\{\\mathcal\{X\}\}is a non\-empty set of measurable functions from𝒳\\mathcal\{X\}to𝒴\\mathcal\{Y\}\. The instance sequence𝕏=\{Xt\}t∈ℕ\\mathbb\{X\}=\\\{X\_\{t\}\\\}\_\{t\\in\\mathbb\{N\}\}is a sequence of elementsXt∈𝒳X\_\{t\}\\in\\mathcal\{X\}, and the label sequence𝕐=\{Yt\}t∈ℕ\\mathbb\{Y\}=\\\{Y\_\{t\}\\\}\_\{t\\in\\mathbb\{N\}\}is a sequence of elementsYt∈𝒴Y\_\{t\}\\in\\mathcal\{Y\}\. In this paper, we often use\(X≤t,Y≤t\)\(X\_\{\\leq t\},Y\_\{\\leq t\}\)to stand for\{\(Xi,Yi\)\}i≤t\\\{\(X\_\{i\},Y\_\{i\}\)\\\}\_\{i\\leq t\}\.

We also use the definition of a partial concept, which is a partial function from𝒳\\mathcal\{X\}to𝒴\\mathcal\{Y\}\. A partial function allows the label of some instances to be undefined, and no matter what prediction was made on that instance, it is a mistake\.

In the transductive online learning setting, the instance sequence and the label sequence are both chosen by the adversary\. However, the instance sequence𝕏\\mathbb\{X\}is revealed to the learner in advance, so that the learner can leverage the information of future instances to help design the learning algorithm\.

The transductive online learning algorithm is a sequence of measurable functionsft:𝒳∞×𝒴t−1→𝒴f\_\{t\}:\\mathcal\{X\}^\{\\infty\}\\times\\mathcal\{Y\}^\{t\-1\}\\rightarrow\\mathcal\{Y\}, wherettis a non\-negative integer and it is usually represented by𝒜\\mathcal\{A\}\. For convenience, we usually useY^t\\hat\{Y\}\_\{t\}as the prediction of the algorithm𝒜\\mathcal\{A\}at roundtt\.

Following the tradition of universal learning, we define the set of realizable sequences as follows\.

###### Definition 2\.1\.

For every concept classℋ\\mathcal\{H\}, we can define the following set of processesR\(ℋ\)\\text\{R\}\(\\mathcal\{H\}\):

R\(ℋ\):=\{\\displaystyle\\text\{R\}\(\\mathcal\{H\}\):=\\\{\(𝕏,𝕐\)=\{\(Xi,Yi\)\}i∈ℕ:\\displaystyle\(\\mathbb\{X\},\\mathbb\{Y\}\)=\\left\\\{\(X\_\{i\},Y\_\{i\}\)\\right\\\}\_\{i\\in\\mathbb\{N\}\}:∀n<∞,\{\(Xi,Yi\)\}i≤nrealizable byℋ\}\.\\displaystyle\\forall n<\\infty,\\left\\\{\(X\_\{i\},Y\_\{i\}\)\\right\\\}\_\{i\\leq n\}\\text\{ realizable by \}\\mathcal\{H\}\\\}\.

Following the tradition of online learning, we use the number of mistakes to measure the performance of the online learning algorithm in the realizable case, which is defined as follows\.

###### Definition 2\.2\.

The number of mistakes of the learning algorithm𝒜\\mathcal\{A\}with respect to the data sequence\(𝕏,𝕐\)\(\\mathbb\{X\},\\mathbb\{Y\}\)isM\(𝒜,\(𝕏,𝕐\),T\)=𝔼\[∑t=1T𝕀\[Yt≠Y^t\]\]M\(\\mathcal\{A\},\(\\mathbb\{X\},\\mathbb\{Y\}\),T\)=\\mathbb\{E\}\[\\sum\_\{t=1\}^\{T\}\\mathbb\{I\}\[Y\_\{t\}\\neq\\hat\{Y\}\_\{t\}\]\]\.

However, in the agnostic case, we do not have the assumption that the label sequences are all realizable, instead, they may be arbitrary\. Therefore, the number of mistakes made by the learning algorithm can be arbitrarily large and it is not an appropriate measure of learnability\. Hence, we use*regret*to measure the learnability, which is the difference between the performance of our algorithm and the best realizable sequence\. Formally,

###### Definition 2\.3\.

The regret of the learning algorithm𝒜\\mathcal\{A\}with respect to the data sequence\(𝕏,𝕐,𝕐∗\)\(\\mathbb\{X\},\\mathbb\{Y\},\\mathbb\{Y\}^\{\*\}\), where\(𝕏,𝕐∗\)∈R\(ℋ\)\(\\mathbb\{X\},\\mathbb\{Y\}^\{\*\}\)\\in\\text\{R\}\(\\mathcal\{H\}\)isRegret\(𝒜,\(𝕏,𝕐,𝕐∗\),T\)=𝔼\[∑t=1T\(𝕀\[Yt≠Y^t\]−𝕀\[Yt≠Yt∗\]\)\]\\text\{Regret\}\(\\mathcal\{A\},\(\\mathbb\{X\},\\mathbb\{Y\},\\mathbb\{Y\}^\{\*\}\),T\)=\\mathbb\{E\}\[\\sum\_\{t=1\}^\{T\}\(\\mathbb\{I\}\[Y\_\{t\}\\neq\\hat\{Y\}\_\{t\}\]\-\\mathbb\{I\}\[Y\_\{t\}\\neq Y^\{\*\}\_\{t\}\]\)\]\.

If there is an algorithm𝒜\\mathcal\{A\}, such thatRegret\(𝒜,\(𝕏,𝕐,𝕐∗\),T\)=o\(T\)\\text\{Regret\}\(\\mathcal\{A\},\(\\mathbb\{X\},\\mathbb\{Y\},\\mathbb\{Y\}^\{\*\}\),T\)=o\(T\)for every sequence\(𝕏,𝕐,𝕐∗\)\(\\mathbb\{X\},\\mathbb\{Y\},\\mathbb\{Y\}^\{\*\}\), where\(𝕏,𝕐∗\)∈R\(ℋ\)\(\\mathbb\{X\},\\mathbb\{Y\}^\{\*\}\)\\in\\text\{R\}\(\\mathcal\{H\}\), we sayℋ\\mathcal\{H\}is universally transductively online learnable for the agnostic case\.

It is common and traditional to use combinatorial structures related to the concept class to build the mistake or regret bound in the online learning setting\. We follow this tradition and provide the combinatorial structures we use here\.

We first define the Littlestone tree for multiclass learning\.

###### Definition 2\.4\(Littlestone Tree\(Hannekeet al\.,[2023c](https://arxiv.org/html/2605.30479#bib.bib7)\)\)\.

A Littlestone Tree forℋ\\mathcal\{H\}of depthd≤∞d\\leq\\inftyis a collection

\{xu∈𝒳:0≤k<d,u∈\{0,1\}k\}\\\{x\_\{u\}\\in\\mathcal\{X\}:0\\leq k<d,u\\in\\\{0,1\\\}^\{k\}\\\}and a sequence of functions𝐲i:\{0,1\}i→𝒴\\mathbf\{y\}\_\{i\}:\\\{0,1\\\}^\{i\}\\rightarrow\\mathcal\{Y\}satisfy

1. 1\.For everyi<di<d,𝐲i\(u<i,0\)≠𝐲i\(u<i,1\)\\mathbf\{y\}\_\{i\}\(u\_\{<i\},0\)\\neq\\mathbf\{y\}\_\{i\}\(u\_\{<i\},1\), whereui∈\{0,1\}u\_\{i\}\\in\\\{0,1\\\}andu<i=\(u1,…,ui−1\)u\_\{<i\}=\(u\_\{1\},\\ldots,u\_\{i\-1\}\)\.
2. 2\.There exists a concepth∈ℋh\\in\\mathcal\{H\}, such that for everyi≤di\\leq d,h\(xu<i\)=𝐲i\(u<i,ui\)h\(x\_\{u\_\{<i\}\}\)=\\mathbf\{y\}\_\{i\}\(u\_\{<i\},u\_\{i\}\), whereui∈\{0,1\}u\_\{i\}\\in\\\{0,1\\\}andu<i=\(u1,…,ui−1\)u\_\{<i\}=\(u\_\{1\},\\ldots,u\_\{i\-1\}\)\.

Then we provide a combinatorial dimension used in the multiclass transductive online learning, an extension of VC dimension in the multiclass setting, which is called the Level\-Constrained Littlestone dimension\. Intuitively, it is a Littlestone tree, but every node in the same depth are labeled with the same instance\. Check Figure[1](https://arxiv.org/html/2605.30479#S2.F1)for illustration\.

###### Definition 2\.5\(Level\-Constrained Littlestone Tree \(LCL tree\)\)\.

A Level\-Constrained Littlestone tree \(LCL tree\) forℋ\\mathcal\{H\}is a complete binary tree of depthd≤∞d\\leq\\inftywhose internal nodes in depthkkare labeled by elementxk∈𝒳x\_\{k\}\\in\\mathcal\{X\}, and whose two edges connecting a node to its children are labeled byy1,y2∈𝒴y\_\{1\},y\_\{2\}\\in\\mathcal\{Y\}\(y1≠y2y\_\{1\}\\neq y\_\{2\}\), such that every finite path emanating from the root is consistent with a concepth∈ℋh\\in\\mathcal\{H\}\.

Ifℋ\\mathcal\{H\}has an LCL tree with depthdd, but doesn’t have one with depthd\+1d\+1, we say the LCL dimension ofℋ\\mathcal\{H\}isdd\.

![Refer to caption](https://arxiv.org/html/2605.30479v1/lcl)Figure 1:A Level\-Constrained Littlestone tree of depth33\. Every branch is consistent with a concepth∈ℋh\\in\\mathcal\{H\}\. We takeℋ⊆ℕ𝒳\\mathcal\{H\}\\subseteq\\mathbb\{N\}^\{\\mathcal\{X\}\}for convenience\. The only restriction is that the two edges connecting two children of the same node should be labeled with different labels\. This is illustrated here for one of the branches\.We then need to define the LCL\-Littlestone tree, which is an extension of the VCL tree to multiclass learning\. Intuitively, it is a Littlestone tree, but every node at depthkkis not labeled by an instance but by a sequence ofk\+1k\+1instances, which can form a depthk\+1k\+1LCL tree, where the leaves are not labeled\. Each node has not just22children but2k\+12^\{k\+1\}children, and each edge is not labeled by a label but by a sequence of labels, which is a path from the root to one leaf in the depth\-k\+1k\+1LCL tree\. Check Figure[2](https://arxiv.org/html/2605.30479#S2.F2)for illustration\.

###### Definition 2\.6\(LCL\-Littlestone Tree\)\.

An LCL\-Littlestone tree forℋ\\mathcal\{H\}of depthd≤∞d\\leq\\inftyis a collection

\{xu∈𝒳k\+1:0≤k<d,u∈\{0,1\}1×\{0,1\}2×⋯×\{0,1\}k\}\\\{x\_\{u\}\\in\\mathcal\{X\}^\{k\+1\}:0\\leq k<d,u\\in\\\{0,1\\\}^\{1\}\\times\\\{0,1\\\}^\{2\}\\times\\cdots\\times\\\{0,1\\\}^\{k\}\\\}and a sequence of functions𝐲n,i:\{0,1\}1×\{0,1\}2×⋯×\{0,1\}n×\{0,1\}i→𝒴\\mathbf\{y\}\_\{n,i\}:\\\{0,1\\\}^\{1\}\\times\\\{0,1\\\}^\{2\}\\times\\cdots\\times\\\{0,1\\\}^\{n\}\\times\\\{0,1\\\}^\{i\}\\to\\mathcal\{Y\}satisfy

1. 1\.𝐲n,i\(\(u≤n,\(un\+1≤i−1,0\)\)\)≠𝐲n,i\(\(u≤n,\(un\+1≤i−1,1\)\)\)\\mathbf\{y\}\_\{n,i\}\(\(u\_\{\\leq n\},\(u\_\{n\+1\}^\{\\leq i\-1\},0\)\)\)\\neq\\mathbf\{y\}\_\{n,i\}\(\(u\_\{\\leq n\},\(u\_\{n\+1\}^\{\\leq i\-1\},1\)\)\), for everyn<dn<dandi≤n\+1i\\leq n\+1\. Hereun∈\{0,1\}nu\_\{n\}\\in\\\{0,1\\\}^\{n\},u≤n=\(u1,u2…,un\)u\_\{\\leq n\}=\(u\_\{1\},u\_\{2\}\\ldots,u\_\{n\}\)andun\+1≤i−1=\(un\+11,…,un\+1i−1\)u\_\{n\+1\}^\{\\leq i\-1\}=\(u\_\{n\+1\}^\{1\},\\ldots,u\_\{n\+1\}^\{i\-1\}\)\.
2. 2\.There exists a concepth∈ℋh\\in\\mathcal\{H\}, such thath\(xu≤ki\)=𝐲k,i\(\(u≤k,\(uk\+11,…,uk\+1i\)\)\)h\(x\_\{u\\leq k\}^\{i\}\)=\\mathbf\{y\}\_\{k,i\}\(\(u\_\{\\leq k\},\(u\_\{k\+1\}^\{1\},\\ldots,u\_\{k\+1\}^\{i\}\)\)\)for all1≤i≤k\+11\\leq i\\leq k\+1and0≤k≤n0\\leq k\\leq n, where we denote u≤k=\(u11,\(u21,u22\),…,\(uk1,…,ukk\)\),\\displaystyle u\_\{\\leq k\}=\(u\_\{1\}^\{1\},\(u\_\{2\}^\{1\},u\_\{2\}^\{2\}\),\\dots,\(u\_\{k\}^\{1\},\\dots,u\_\{k\}^\{k\}\)\),xu≤k=\(xu≤k1,…,xu≤kk\+1\)\\displaystyle x\_\{u\_\{\\leq k\}\}=\(x\_\{u\_\{\\leq k\}\}^\{1\},\\dots,x\_\{u\_\{\\leq k\}\}^\{k\+1\}\)

We say thatℋ\\mathcal\{H\}hasan infinite LCL\-Littlestone treeif it has an LCLL tree of depthd=∞d=\\infty\.

![Refer to caption](https://arxiv.org/html/2605.30479v1/lcll)Figure 2:An LCLL tree of depth33\. Every branch is consistent with a concepth∈ℋh\\in\\mathcal\{H\}\. This is illustrated here for one of the branches\. Due to a lack of space, not all external edges are drawn\. In this figure, we takeℋ⊆ℕ𝒳\\mathcal\{H\}\\subseteq\\mathbb\{N\}^\{\\mathcal\{X\}\}for convenience\. This figure is modified from the work ofBousquetet al\.\([2021](https://arxiv.org/html/2605.30479#bib.bib3)\)\.Then we need to define the*indifferent*property of the tree, which can be applied to the Littlestone tree or the LCLL tree defined before\. This definition is extended to unbounded label spaces from the same definition in the work ofBousquetet al\.\([2023](https://arxiv.org/html/2605.30479#bib.bib13)\), which only defines the property on binary label spaces\. Intuitively, this property says all the descendants of a node agree on the value of the node that comes before that node in Breadth\-First\-Search Order\.

###### Definition 2\.7\(Indifferent Tree\. Extended from the work ofBousquetet al\.\([2023](https://arxiv.org/html/2605.30479#bib.bib13)\)\)\.

Let𝒳\\mathcal\{X\}be a set andℋ⊆𝒴𝒳\\mathcal\{H\}\\subseteq\\mathcal\{Y\}^\{\\mathcal\{X\}\}be a hypothesis class, and let

T=\{x𝐮∈𝒳:𝐮∈\(\{0,1\}\)∗\}and\{𝐲i\}i∈ℕ∪\{0\}T=\\left\\\{x\_\{\\mathbf\{u\}\}\\in\\mathcal\{X\}:\\mathbf\{u\}\\in\(\\\{0,1\\\}\)^\{\*\}\\right\\\}\\text\{ and \}\\\{\\mathbf\{y\}\_\{i\}\\\}\_\{i\\in\\mathbb\{N\}\\cup\\\{0\\\}\}be an infinite perfect binary tree that is shattered byℋ\\mathcal\{H\}\. This implies the existence of a collection

ℋT=\{h𝐮∈ℋ:𝐮∈\(\{0,1\}\)∗\}\\mathcal\{H\}\_\{T\}=\\left\\\{h\_\{\\mathbf\{u\}\}\\in\\mathcal\{H\}:\\mathbf\{u\}\\in\(\\\{0,1\\\}\)^\{\*\}\\right\\\}of consistent functions\.

We say such a collection is*indifferent*if for every𝐯,𝐮,𝐰∈\(\{0,1\}\)∗\\mathbf\{v\},\\mathbf\{u\},\\mathbf\{w\}\\in\(\\\{0,1\\\}\)^\{\*\}, if𝐯\\mathbf\{v\}comes before𝐮\\mathbf\{u\}in Breadth\-First\-Search order, and𝐰\\mathbf\{w\}is a descendant of𝐮\\mathbf\{u\}in the treeTT, thenh𝐮\(x𝐯\)=h𝐰\(x𝐯\)h\_\{\\mathbf\{u\}\}\(x\_\{\\mathbf\{v\}\}\)=h\_\{\\mathbf\{w\}\}\(x\_\{\\mathbf\{v\}\}\)\. In other words, the functions for all the descendants of a node that appears after𝐯\\mathbf\{v\}agree on𝐯\\mathbf\{v\}\. We say thatTTis*indifferent*if it has a setℋT\\mathcal\{H\}\_\{T\}of consistent functions that are indifferent\.

Then we can provide the trichotomy for the realizable case and the learnability result for the agnostic case\.

###### Theorem 2\.8\.

For the realizable case, against any adversary \(represented by\(𝕏,𝕐\)∈R\(ℋ\)\(\\mathbb\{X\},\\mathbb\{Y\}\)\\in\\text\{R\}\(\\mathcal\{H\}\)\), we have the following trichotomy:

1. 1\.If and only ifℋ\\mathcal\{H\}does not have an infinite indifferent Littlestone tree, there exists an algorithm𝒜\\mathcal\{A\}, such thatM\(𝒜,\(𝕏,𝕐\),T\)=O\(1\)M\(\\mathcal\{A\},\(\\mathbb\{X\},\\mathbb\{Y\}\),T\)=O\(1\)\.
2. 2\.If and only ifℋ\\mathcal\{H\}has an infinite indifferent Littlestone tree but does not have an infinite indifferent LCLL tree, there exists an algorithm𝒜\\mathcal\{A\}, such thatM\(𝒜,\(𝕏,𝕐\),T\)=Θ\(log⁡T\)M\(\\mathcal\{A\},\(\\mathbb\{X\},\\mathbb\{Y\}\),T\)=\\Theta\(\\log T\)\.
3. 3\.Ifℋ\\mathcal\{H\}has an infinite indifferent LCLL tree, no learners can guarantee sublinear mistakes\.

###### Theorem 2\.9\.

A concept classℋ\\mathcal\{H\}is agnostically universally transductively online learnable, if and only ifℋ\\mathcal\{H\}has no infinite indifferent LCLL tree\.

We also extend the learnability results to the case where only the stochastic process generating the instance sequence is known to the learner in advance\. This result answers the condition when all processes admit universal online learning\. We provide the details and discussion of this setting in the appendices\.

## 3Examples

In this section, we provide some insightful examples to show that several natural combinatorial structures are not the right choices for characterization\.

###### Example 3\.1\(ℋ\\mathcal\{H\}with infinite LCLL tree, but learnable\.\)\.

Let𝒳\\mathcal\{X\}be the instance space,𝒴=ℕ∪\{0\}\\mathcal\{Y\}=\\mathbb\{N\}\\cup\\\{0\\\}be the label space\. Letu=\(u11,\(u21,u22\),…,\(uk1,…,ukk\)\)∈\{0,1\}×\{0,1\}2×⋯×\{0,1\}ku=\(u\_\{1\}^\{1\},\(u\_\{2\}^\{1\},u\_\{2\}^\{2\}\),\\ldots,\(u\_\{k\}^\{1\},\\ldots,u\_\{k\}^\{k\}\)\)\\in\\\{0,1\\\}\\times\\\{0,1\\\}^\{2\}\\times\\cdots\\times\\\{0,1\\\}^\{k\},k∈ℕk\\in\\mathbb\{N\}\. Consider the following concept class:

ℋ=\{\\displaystyle\\mathcal\{H\}=\\\{hu:∀u,hu\(Xu≤k−1i\)=uki,\\displaystyle h\_\{u\}:\\forall u,h\_\{u\}\(X\_\{u\_\{\\leq k\-1\}\}^\{i\}\)=u\_\{k\}^\{i\},and all the otherX∈𝒳,hu\(X\)=\|u\|\+1\},\\displaystyle\\text\{ and all the other \}X\\in\\mathcal\{X\},h\_\{u\}\(X\)=\|u\|\+1\\\},whereu≤k−1=\(u11,\(u21,u22\),…,\(uk−11,…,uk−1k−1\)\)u\_\{\\leq k\-1\}=\(u\_\{1\}^\{1\},\(u\_\{2\}^\{1\},u\_\{2\}^\{2\}\),\\ldots,\(u\_\{k\-1\}^\{1\},\\ldots,u\_\{k\-1\}^\{k\-1\}\)\),Xu≤k−1∈𝒳kX\_\{u\_\{\\leq k\-1\}\}\\in\\mathcal\{X\}^\{k\}, and\|u\|\|u\|is the number of different tuples inuu\. For example,\|\(u11,\(u21,u22\),…,\(uk1,…,ukk\)\)\|=k\|\(u\_\{1\}^\{1\},\(u\_\{2\}^\{1\},u\_\{2\}^\{2\}\),\\ldots,\(u\_\{k\}^\{1\},\\ldots,u\_\{k\}^\{k\}\)\)\|=k\.

First, we want to claim thatℋ\\mathcal\{H\}has an infinite LCLL tree\. It is easy to see that for the following collection:

T=\{Xu∈𝒳k\+1:u∈\{0,1\}×\{0,1\}2×⋯×\{0,1\}k\}T=\\\{X\_\{u\}\\in\\mathcal\{X\}^\{k\+1\}:u\\in\\\{0,1\\\}\\times\\\{0,1\\\}^\{2\}\\times\\cdots\\times\\\{0,1\\\}^\{k\}\\\}and the sequence of functions𝐲k,i\(u\)=uk\+1i\\mathbf\{y\}\_\{k,i\}\(u\)=u\_\{k\+1\}^\{i\}, for anykk, anyuu, and anyj≤kj\\leq k, anyi≤j\+1i\\leq j\+1,hu\(Xu≤ji\)=𝐲j,i\(\(u≤j,\(uj1,…,uji\)\)\)h\_\{u\}\(X\_\{u\_\{\\leq j\}\}^\{i\}\)=\\mathbf\{y\}\_\{j,i\}\(\(u\_\{\\leq j\},\(u\_\{j\}^\{1\},\\ldots,u\_\{j\}^\{i\}\)\)\)\. That means there is an infinite LCLL tree shattered by the concept classℋ\\mathcal\{H\}\.

However, for any fixed instance sequence𝕏\\mathbb\{X\}, we can learn it with a finite number of mistakes by using the following method\. For convenience, we can assume the target concept ishuh\_\{u\}\. If there is anXt∉\{Xu≤ki,∀k≤\|u\|,i≤k\+1\}X\_\{t\}\\notin\\\{X\_\{u\_\{\\leq k\}\}^\{i\},\\forall k\\leq\|u\|,i\\leq k\+1\\\}, by witnessing its label, we know\|u\|\+1\|u\|\+1\. There will be only a finite number of different functions; thus, we can learn the target function by making finite mistakes\. Then we know everyXt∈\{Xu≤ki,∀k≤\|u\|,i≤k\+1\}X\_\{t\}\\in\\\{X\_\{u\_\{\\leq k\}\}^\{i\},\\forall k\\leq\|u\|,i\\leq k\+1\\\}, and we can make the prediction ofXtX\_\{t\}by followinguu\. Thus, this example demonstrates that a concept class with an infinite LCLL tree remains universally transductively online learnable\.

This example also shows that the DSL tree is not the correct answer, as an infinite LCLL tree is also an infinite DSL tree\. A similar construction shows that there is a concept class that has an infinite Littlestone tree but does not have an infinite indifferent Littlestone tree\.

###### Example 3\.3\(ℋ\\mathcal\{H\}with no infinite LCL tree, but not learnable\.\(Bousquetet al\.,[2021](https://arxiv.org/html/2605.30479#bib.bib3)\)\)\.

Let𝒳\\mathcal\{X\}be the instance space,𝒴=\{0,1\}\\mathcal\{Y\}=\\\{0,1\\\}be the label space\. Letu=\(u11,\(u21,u22\),…,\(uk1,…,ukk\)\)∈\{0,1\}×\{0,1\}2×⋯×\{0,1\}ku=\(u\_\{1\}^\{1\},\(u\_\{2\}^\{1\},u\_\{2\}^\{2\}\),\\ldots,\(u\_\{k\}^\{1\},\\ldots,u\_\{k\}^\{k\}\)\)\\in\\\{0,1\\\}\\times\\\{0,1\\\}^\{2\}\\times\\cdots\\times\\\{0,1\\\}^\{k\},k∈ℕk\\in\\mathbb\{N\}\. Consider the following collection:

T=\{Xu∈𝒳k\+1:u∈\{0,1\}×\{0,1\}2×⋯×\{0,1\}k\}T=\\\{X\_\{u\}\\in\\mathcal\{X\}^\{k\+1\}:u\\in\\\{0,1\\\}\\times\\\{0,1\\\}^\{2\}\\times\\cdots\\times\\\{0,1\\\}^\{k\}\\\}Then for everyuu, we define

ℋu=\{\\displaystyle\\mathcal\{H\}\_\{u\}=\\\{h:h\(Xu≤ki\)=uk\+1i,h\(Xui\)=0or1,\\displaystyle h:h\(X\_\{u\_\{\\leq k\}\}^\{i\}\)=u\_\{k\+1\}^\{i\},h\(X\_\{u\}^\{i\}\)=0\\text\{ or \}1,h\(X\)=1for all the otherX\.\}\\displaystyle h\(X\)=1\\text\{ for all the other\}X\.\\\}andℋ=⋃uℋu\\mathcal\{H\}=\\bigcup\_\{u\}\\mathcal\{H\}\_\{u\}\. Hereu≤k=\(u1,u2,…,uk\)u\_\{\\leq k\}=\(u\_\{1\},u\_\{2\},\\ldots,u\_\{k\}\)\.

It is easy to notice that for everyd∈ℕd\\in\\mathbb\{N\}, for everyk≤dk\\leq d, we have the collectionTTwhich is shattered byℋ\\mathcal\{H\}\. And for every node in the LCLL tree, all the functions for its descendants agree on the nodes that come before it in lexicographic order \(labeled with11\)\. However, for any given sequence𝕏=\{Xt\}t∈ℕ\\mathbb\{X\}=\\\{X\_\{t\}\\\}\_\{t\\in\\mathbb\{N\}\}, asX1∈XuX\_\{1\}\\in X\_\{u\}for someuu, the depth of the LCL tree defined by this sequence is at most\|u\|\+1\|u\|\+1\. Here\|u\|\|u\|is the number of different tuples inuu\. For example,\|\(u11,\(u21,u22\),…,\(uk1,…,ukk\)\)\|=k\|\(u\_\{1\}^\{1\},\(u\_\{2\}^\{1\},u\_\{2\}^\{2\}\),\\ldots,\(u\_\{k\}^\{1\},\\ldots,u\_\{k\}^\{k\}\)\)\|=k\. Thus,ℋ\\mathcal\{H\}does not have an infinite LCL tree\.

## 4Realizable Setting

In this section, we focus on the realizable setting and provide the high\-level idea of how to prove Theorem[2\.8](https://arxiv.org/html/2605.30479#S2.Thmtheorem8)\.

### 4\.1No Sublinear Mistake Bound

We first prove the lower bound whenℋ\\mathcal\{H\}has an infinite indifferent LCLL tree\.

###### Lemma 4\.1\.

Ifℋ\\mathcal\{H\}has an infinite indifferent LCLL tree, the adversary can choose a data sequence to force any learner to make more than sublinear mistakes\.

For brevity, we only provide a proof sketch here, and the complete proof is in[AppendixB](https://arxiv.org/html/2605.30479#A2)\.

###### Proof Sketch\.

First, we can modify the indifferent infinite LCLL tree such that it has the property that the number of elements contained by thekk\-th node in the Breadth\-First\-Search \(BFS\) order is2k−12^\{k\-1\}\. The instance sequence is all the instances that come in the lexical order in each node and the BFS order among different nodes\. Then we take a random walk on this tree to choose the true label for each instance\. The instances in the node visited by the random walk are labeled by the label on the edge adjacent to it in the path\. The instances in an off\-branch node are labeled by the label decided by its descendants\. \(We can do this as the tree is indifferent\.\) Thus, when reaching a node on the path, no matter what the algorithm predicts, it makes mistakes with probability12\\frac\{1\}\{2\}\. Thus, it makes a quarter mistake in expectation\. Then, by Fatou’s lemma, for each learning algorithm, we get a realizable process such that the algorithm does not make a sublinear loss almost surely\. ∎

### 4\.2Logarithmic Mistake Bound

Then we show thelog⁡T\\log Tupper bound, ifℋ\\mathcal\{H\}has no infinite indifferent LCLL tree\.

###### Lemma 4\.2\.

If a concept classℋ\\mathcal\{H\}does not have an infinite indifferent LCLL tree, there exists an algorithm𝒜\\mathcal\{A\}, such that, for every\(𝕏,𝕐\)∈R\(ℋ\)\(\\mathbb\{X\},\\mathbb\{Y\}\)\\in\\text\{R\}\(\\mathcal\{H\}\),M\(𝒜,\(𝕏,𝕐\),T\)=O\(log⁡T\)M\(\\mathcal\{A\},\(\\mathbb\{X\},\\mathbb\{Y\}\),T\)=O\(\\log T\)\.

To prove this lemma, we need to design a transductive online learning algorithm\. First, we provide the following definition\.

###### Definition 4\.3\.

For a level\-constrained tree, whose root and internal nodes are labeled by the elements of a sequence𝕏\\mathbb\{X\}\. Each node of the tree may have one child or two children\. Then for everyk∈ℕk\\in\\mathbb\{N\}, we say a subsequence\(Xt1,…,Xtk\)\(X\_\{t\_\{1\}\},\\ldots,X\_\{t\_\{k\}\}\)is embedded in the level\-constrained tree if:

1. 1\.There is a subtree, whose root is labeled byXt1X\_\{t\_\{1\}\}and has two children\.
2. 2\.For every1≤i≤k1\\leq i\\leq k, subsequence\(Xti,…,Xtk\)\(X\_\{t\_\{i\}\},\\ldots,X\_\{t\_\{k\}\}\)is embedded in the subtree whose root is the child ofXti−1X\_\{t\_\{i\-1\}\}444There are2i−12^\{i\-1\}subtrees in total\.\.

If the level\-constrained tree is shattered by a concept classℋ\\mathcal\{H\}, that is, every finite path emanating from the root is consistent with a concepth∈ℋh\\in\\mathcal\{H\}\. We say the subsequence\(Xt1,…,Xtk\)\(X\_\{t\_\{1\}\},\\ldots,X\_\{t\_\{k\}\}\)is shattered byℋ\\mathcal\{H\}as well\.

The main idea of the algorithm is as follows\. First, by looking at the history, the learner can detect a situation such that the length of the longest subsequences that are shattered by the partial concept class isdd\. Then we can use the algorithm to learn that partial concept class and ensure the algorithm only makesO\(log⁡T\)O\(\\log T\)mistakes\.

We define the LCLL game as follows\. The game has two players,PAP\_\{A\}andPBP\_\{B\}\. At each roundkk, each player can take the following actions:

- •PAP\_\{A\}choose a sequence of instances of lengthkk,𝐭k=\(tk1,…,tkk\)\\mathbf\{t\}\_\{k\}=\(t\_\{k\}^\{1\},\\ldots,t\_\{k\}^\{k\}\)such thattk1\>tk−1k−1t\_\{k\}^\{1\}\>t\_\{k\-1\}^\{k\-1\}andCk=\{\(ytk−1k−1\+1,∅,…,ytk1,∅u1,ytk1\+1,u1,…,ytk2,u1u2C\_\{k\}=\\\{\(y\_\{t\_\{k\-1\}^\{k\-1\}\+1,\\varnothing\},\\ldots,y\_\{t\_\{k\}^\{1\},\\varnothing\}^\{u\_\{1\}\},y\_\{t\_\{k\}^\{1\}\+1,u\_\{1\}\},\\ldots,y\_\{t\_\{k\}^\{2\},u\_\{1\}\}^\{u\_\{2\}\},…,ytkk,u<kuk\):u∈\{0,1\}k\},\\ldots,y\_\{t\_\{k\}^\{k\},u\_\{<k\}\}^\{u\_\{k\}\}\):u\\in\\\{0,1\\\}^\{k\}\\\}, whereyk,u<i0≠yk,u<iiy\_\{k,u\_\{<i\}\}^\{0\}\\neq y\_\{k,u\_\{<i\}\}^\{i\}for alli≤ki\\leq k\. Hereu<i=\(u1,…,ui−1\)u\_\{<i\}=\(u\_\{1\},\\ldots,u\_\{i\-1\}\)andu<1=∅u\_\{<1\}=\\varnothing\. Intuitively, at roundkk,PAP\_\{A\}proposes a level\-constrained tree, with\(Xtk1,…,Xtkk\)\(X\_\{t\_\{k\}^\{1\}\},\\ldots,X\_\{t\_\{k\}^\{k\}\}\)embedded in it\.
- •PBP\_\{B\}choosegUk−1\(𝐭k,Ck\)=\(Ytk−1k−1,…,Ytkk\)∈Ckg\_\{U\_\{k\-1\}\}\(\\mathbf\{t\}\_\{k\},C\_\{k\}\)=\(Y\_\{t\_\{k\-1\}^\{k\-1\}\},\\ldots,Y\_\{t\_\{k\}^\{k\}\}\)\\in C\_\{k\}\. Intuitively,PBP\_\{B\}chooses a path from the level\-constrained tree proposed byPAP\_\{A\}\.
- •UpdateUk=Uk−1∪\{\(𝐭k,Ck,\(Ytk−1k−1,…,Ytkk\)\)\}U\_\{k\}=U\_\{k\-1\}\\cup\\\{\(\\mathbf\{t\}\_\{k\},C\_\{k\},\(Y\_\{t\_\{k\-1\}^\{k\-1\}\},\\ldots,Y\_\{t\_\{k\}^\{k\}\}\)\)\\\}\.
- •PBP\_\{B\}wins the game in roundkk, ifℋUk=∅\\mathcal\{H\}\_\{U\_\{k\}\}=\\emptyset, whereℋUk=\{h∈ℋ:h\(Xt\)=Yt,∀t≤tkk\.\}\\mathcal\{H\}\_\{U\_\{k\}\}=\\\{h\\in\\mathcal\{H\}:h\(X\_\{t\}\)=Y\_\{t\},\\forall t\\leq t\_\{k\}^\{k\}\.\\\}\.

Following the tradition of game theory, we have the definition of*strategy*and*winning strategy*\. A*strategy*is a way of playing that can be fully determined by the foregoing plays\. A*winning strategy*is a strategy that necessarily causes the player to win no matter what action one’s opponent takes\.

The LCLL game is fully determined, as the membership of the winning set ofPBP\_\{B\}is witnessed by a finite subsequence\. Then we need the following lemma:

###### Lemma 4\.4\.

Ifℋ\\mathcal\{H\}has no infinite indifferent LCLL tree,PBP\_\{B\}has a winning strategy\.

To prove this lemma, we need to show that we can build an infinite indifferent LCLL tree based on the winning strategy ofPAP\_\{A\}\. Then, due to the Borel Determinancy Theorem555This is a classical result in game theory\. We introduce those results in[AppendixA](https://arxiv.org/html/2605.30479#A1)\.and contrapositive, we know the lemma above is true\. For brevity, we put the complete proof in[AppendixB](https://arxiv.org/html/2605.30479#A2)\.

Notice that the winning strategy ofPBP\_\{B\}is fully decided byUU, thus, we can usegUg\_\{U\}to describe the winning strategy\. Then we can write down Algorithm[1](https://arxiv.org/html/2605.30479#alg1)666If thearg⁡max\\arg\\maxin the algorithm has multiple choices, it randomly outputs one of the choices\. We use this tie\-breaking rule for allarg⁡max\\arg\\maxin all of the algorithms\.and prove that for every sequence and a concept class with no infinite indifferent LCLL tree, it learns the target function withO\(log⁡T\)O\(\\log T\)mistakes\.

Algorithm 1Learning algorithm forℋ\\mathcal\{H\}with no infinite indifferent LCLL tree\.k←1k\\leftarrow 1,

U←\{\}U\\leftarrow\\\{\\\},

t′←0t^\{\\prime\}\\leftarrow 0\.

for

t=1,2,3,…t=1,2,3,\\dotsdo

if

∃t′≤t1<t2<⋯<tk≤t\\exists t^\{\\prime\}\\leq t\_\{1\}<t\_\{2\}<\\cdots<t\_\{k\}\\leq tand

Ck=\{\(C\_\{k\}=\\\{\(ytk−1k−1\+1,∅,…,ytk1,∅u1,ytk1\+1,u1,…,ytk2,u1u2…,ytkk,u<kuk\):u∈\{0,1\}k\}y\_\{t\_\{k\-1\}^\{k\-1\}\+1,\\varnothing\},\\ldots,y\_\{t\_\{k\}^\{1\},\\varnothing\}^\{u\_\{1\}\},y\_\{t\_\{k\}^\{1\}\+1,u\_\{1\}\},\\ldots,y\_\{t\_\{k\}^\{2\},u\_\{1\}\}^\{u\_\{2\}\}\\ldots,y\_\{t\_\{k\}^\{k\},u\_\{<k\}\}^\{u\_\{k\}\}\):u\\in\\\{0,1\\\}^\{k\}\\\}such that

gU\(\(t1,…,tk\),Ck\)=\(Yt′,…,Yt\)g\_\{U\}\(\(t\_\{1\},\\dots,t\_\{k\}\),C\_\{k\}\)=\(Y\_\{t^\{\\prime\}\},\\dots,Y\_\{t\}\)then

Advance the game:

U←U∪\{\(\(t1,…,tk,Ck\),gU\(\(t1,…,tk\),Ck\)\)\}U\\leftarrow U\\cup\\\{\(\(t\_\{1\},\\dots,t\_\{k\},C\_\{k\}\),g\_\{U\}\(\(t\_\{1\},\\dots,t\_\{k\}\),C\_\{k\}\)\)\\\}\.

k←k\+1k\\leftarrow k\+1\.

L←∅L\\leftarrow\\emptyset\.

t′←tt^\{\\prime\}\\leftarrow t\.

endif

Predict

Yt^=arg⁡maxy⁡w\(ℋL∪\{\(Xt,y\)\}gU,X\>t\)\\hat\{Y\_\{t\}\}=\\arg\\max\_\{y\}w\\left\(\\mathcal\{H\}^\{g\_\{U\}\}\_\{L\\cup\\\{\(X\_\{t\},y\)\\\}\},X\_\{\>t\}\\right\)\.

if

Yt≠Y^tY\_\{t\}\\neq\\hat\{Y\}\_\{t\}then

L←L∪\{\(Xt,Yt\)\}L\\leftarrow L\\cup\\\{\(X\_\{t\},Y\_\{t\}\)\\\}\.

endif

endfor

To describe the weight function we use, we need several extra definitions\. In the algorithm, we define the weight function as follows\.

w\(ℋ′,X≥t\)=∑S:S⊆X≥tsuch thatSshattered byℋ′1n\(S\)d\+1\.w\(\\mathcal\{H\}^\{\\prime\},X\_\{\\geq t\}\)=\\sum\_\{\\begin\{subarray\}\{c\}S:S\\subseteq X\_\{\\geq t\}\\\\ \\text\{ such that \}S\\text\{ shattered by \}\\mathcal\{H\}^\{\\prime\}\\end\{subarray\}\}\\frac\{1\}\{n\(S\)^\{d\+1\}\}\.wheren\(S\)n\(S\)is the index of the last element inSS, andddlargest length of the subsequences that is shattered byℋ′\\mathcal\{H\}^\{\\prime\}\. For example, ifS=\{Xt1,…,Xtk\}S=\\\{X\_\{t\_\{1\}\},\\ldots,X\_\{t\_\{k\}\}\\\}, thenn\(S\)=tkn\(S\)=t\_\{k\}\. Here,ℋgU\\mathcal\{H\}^\{g\_\{U\}\}is the partial concept class induced bygUg\_\{U\}, which is defined as follows\.

ℋgUk−1=\\displaystyle\\mathcal\{H\}^\{g\_\{U\_\{k\-1\}\}\}=\{h:∀𝐭k=\(tk1,⋯,tkk\),tk−1<tk1<⋯<tkk,\\displaystyle\\\{h:\\forall\\mathbf\{t\}\_\{k\}=\(t\_\{k\}^\{1\},\\cdots,t\_\{k\}^\{k\}\),t\_\{k\-1\}<t\_\{k\}^\{1\}<\\cdots<t\_\{k\}^\{k\},∀Ck,\(h\(Xtk−1k−1\),…,h\(Xtkk\)\)≠gU\(\(t1,…,tk\),Ck\)\},\\displaystyle\\quad\\forall C\_\{k\},\(h\(X\_\{t\_\{k\-1\}^\{k\-1\}\}\),\\ldots,h\(X\_\{t\_\{k\}^\{k\}\}\)\)\\neq g\_\{U\}\(\(t\_\{1\},\\dots,t\_\{k\}\),C\_\{k\}\)\\\},where, for every𝐭k\\mathbf\{t\}\_\{k\},CkC\_\{k\}is defined as follows:

Ck=\{\(\\displaystyle C\_\{k\}=\\\{\(ytk−1k−1\+1,∅,…,ytk1,∅u1,ytk1\+1,u1,…,\\displaystyle y\_\{t\_\{k\-1\}^\{k\-1\}\+1,\\varnothing\},\\ldots,y\_\{t\_\{k\}^\{1\},\\varnothing\}^\{u\_\{1\}\},y\_\{t\_\{k\}^\{1\}\+1,u\_\{1\}\},\\ldots,ytk2,u1u2…,ytkk,u<kuk\):u∈\{0,1\}k\}\\displaystyle y\_\{t\_\{k\}^\{2\},u\_\{1\}\}^\{u\_\{2\}\}\\ldots,y\_\{t\_\{k\}^\{k\},u\_\{<k\}\}^\{u\_\{k\}\}\):u\\in\\\{0,1\\\}^\{k\}\\\}
Then we have the following lemma:

###### Lemma 4\.5\.

For any process\{\(Xi,Yi\)i∈ℕ\}∈R\(ℋ\)\\\{\(X\_\{i\},Y\_\{i\}\)\_\{i\\in\\mathbb\{N\}\}\\\}\\in\\text\{R\}\(\\mathcal\{H\}\), there existst0t\_\{0\}, such that for allt≥t0t\\geq t\_\{0\}, algorithm[1](https://arxiv.org/html/2605.30479#alg1)will not updatekkandUUand for allt<t1<t2<⋯<tkt<t\_\{1\}<t\_\{2\}<\\cdots<t\_\{k\},\(Xt1,…,Xtk\)\(X\_\{t\_\{1\}\},\\ldots,X\_\{t\_\{k\}\}\)is not shattered byℋgU\\mathcal\{H\}^\{g\_\{U\}\}\.

This comes from the winning condition ofPBP\_\{B\}\. We provide the complete proof in[AppendixB](https://arxiv.org/html/2605.30479#A2)\.

According to Lemma[4\.5](https://arxiv.org/html/2605.30479#S4.Thmtheorem5), we know that ifℋ\\mathcal\{H\}does not have an indifferent LCLL tree, aftert0t\_\{0\}rounds, the game will not advance\. Therefore, for the rest of the sequence aftert0t\_\{0\}, the longest subsequence that is shattered byℋgU\\mathcal\{H\}^\{g\_\{U\}\}has lengthdd, then we want to use this fact to prove that this partial concept is transductively online learnable\.

###### Lemma 4\.6\.

If the length of the longest subsequence that can be shattered byℋ\\mathcal\{H\}isdd, there is a transductive online learning algorithm that can learnℋ\\mathcal\{H\}withO\(log⁡T\)O\(\\log T\)mistakes\.

The idea of this proof comes from the work ofAlonet al\.\([2022](https://arxiv.org/html/2605.30479#bib.bib18)\)\. Briefly, we define a weight function and make sure that every time the algorithm makes a mistake, the weight function decreases by half\. Then we provide an upper bound of the weight at the beginning, and a lower bound afterttrounds\. Combining these two bounds, we have the number of mistakes made by the algorithm before roundtt\. For brevity, the complete proof is provided in[AppendixB](https://arxiv.org/html/2605.30479#A2)\.

###### Lemma 4\.7\.

Ifℋ\\mathcal\{H\}has an infinite indifferent Littlestone tree, the adversary may force any learning algorithm to makeΩ\(log⁡T\)\\Omega\(\\log T\)mistakes\.

The idea of the proof is to run the random walk from the root of the infinite indifferent Littlestone tree and use the indifferent property to label all the nodes not visited by the walk\. Therefore, the adversary may push the learner to make a mistake at each depth, and that pushes the learner to makeΩ\(log⁡T\)\\Omega\(\\log T\)mistakes\. For brevity, the complete proof is provided in the appendix\.

### 4\.3Constant Mistake Bound

In this subsection, we prove the constant mistake bound

###### Lemma 4\.8\.

Ifℋ\\mathcal\{H\}does not have an infinite indifferent Littlestone tree, there is an algorithm𝒜\\mathcal\{A\}such that, for every\(𝕏,𝕐\)∈R\(ℋ\)\(\\mathbb\{X\},\\mathbb\{Y\}\)\\in\\text\{R\}\(\\mathcal\{H\}\),M\(𝒜,\(𝕏,𝕐\),T\)=O\(1\)M\(\\mathcal\{A\},\(\\mathbb\{X\},\\mathbb\{Y\}\),T\)=O\(1\)\.

To prove this lemma, we first define the following Littlestone game\. In this game, there are two players,PAP\_\{A\}andPBP\_\{B\}, and a fixed instance sequence𝕏\\mathbb\{X\}\. At each roundkk, each player can take the following actions:

- •PAP\_\{A\}chooses two branchesBk=\{\(ytk−1\+1,…,ytk−1,ytk0\),\(ytk−1\+1,…,ytk−1,ytk1\)\}B\_\{k\}=\\\{\(y\_\{t\_\{k\-1\}\+1\},\\ldots,y\_\{t\_\{k\}\-1\},y\_\{t\_\{k\}\}^\{0\}\),\(y\_\{t\_\{k\-1\}\+1\},\\ldots,y\_\{t\_\{k\}\-1\},y\_\{t\_\{k\}\}^\{1\}\)\\\}andtk\>tk−1t\_\{k\}\>t\_\{k\-1\}\.
- •PBP\_\{B\}choose a branch\(Ytk−1\+1,…,Ytk\)∈Bk\(Y\_\{t\_\{k\-1\}\+1\},\\ldots,Y\_\{t\_\{k\}\}\)\\in B\_\{k\}
- •UpdateUk=Uk−1∪\{\(tk,Bk,\(Ytk−1\+1,…,Ytk\)\)\}U\_\{k\}=U\_\{k\-1\}\\cup\\\{\(t\_\{k\},B\_\{k\},\(Y\_\{t\_\{k\-1\}\+1\},\\ldots,Y\_\{t\_\{k\}\}\)\)\\\}\.
- •PBP\_\{B\}wins the game in roundkk, ifℋUk=∅\\mathcal\{H\}\_\{U\_\{k\}\}=\\emptyset\. HereℋUk=\{h∈ℋ:h\(Xt\)=Yt,∀t≤tk\}\\mathcal\{H\}\_\{U\_\{k\}\}=\\\{h\\in\\mathcal\{H\}:h\(X\_\{t\}\)=Y\_\{t\},\\forall t\\leq t\_\{k\}\\\}\.

This game is fully determined, as the membership of the winning set ofPBP\_\{B\}is witnessed by a finite subsequence\. Then we need the following lemma\.

###### Lemma 4\.9\.

Ifℋ\\mathcal\{H\}has no infinite indifferent Littlestone tree,PBP\_\{B\}has a winning strategy\.

The proof is similar to the proof of[Lemma4\.4](https://arxiv.org/html/2605.30479#S4.Thmtheorem4), and the main tool is also the Borel Determinacy Theorem\. For brevity, the complete proof is presented in the appendices\.

Notice thatPBP\_\{B\}’s winning strategy is fully determined byUU, so we usegUg\_\{U\}to stand for it\. Then we can usePBP\_\{B\}’s winning strategy to get Algorithm[2](https://arxiv.org/html/2605.30479#alg2)that makes only finite mistakes for every realizable sequence\(𝕏,𝕐\)∈R\(ℋ\)\(\\mathbb\{X\},\\mathbb\{Y\}\)\\in\\text\{R\}\(\\mathcal\{H\}\)\. Here,ℋgU\\mathcal\{H\}^\{g\_\{U\}\}is the partial concept class induced by the winning strategygUg\_\{U\}, which is defined as follows\.

ℋgU=\{h:∀tk,∀Bk,\(h\(Xtk−1\),…,h\(Xtk\)\)≠gU\(tk,Bk\)\},\\displaystyle\\mathcal\{H\}^\{g\_\{U\}\}=\\\{h:\\forall t\_\{k\},\\forall B\_\{k\},\(h\(X\_\{t\_\{k\-1\}\}\),\\ldots,h\(X\_\{t\_\{k\}\}\)\)\\neq g\_\{U\}\(t\_\{k\},B\_\{k\}\)\\\},where, for everytkt\_\{k\},BkB\_\{k\}is defined as follows:

Bk=\{\(ytk−1\+1,…,ytk−1,ytk0\),\(ytk−1\+1,…,ytk−1,ytk1\)\}\\displaystyle B\_\{k\}=\\\{\(y\_\{t\_\{k\-1\}\+1\},\\ldots,y\_\{t\_\{k\}\-1\},y\_\{t\_\{k\}\}^\{0\}\),\(y\_\{t\_\{k\-1\}\+1\},\\ldots,y\_\{t\_\{k\}\-1\},y\_\{t\_\{k\}\}^\{1\}\)\\\}
Algorithm 2Learning algorithm forℋ\\mathcal\{H\}with no infinite indifferent Littlestone treeU←\{\}U\\leftarrow\\\{\\\}\.

t′←0t^\{\\prime\}\\leftarrow 0\.

for

t=1,2,3,…t=1,2,3,\\dotsdo

if

t≥t′t\\geq t^\{\\prime\}and there exists

Bk=\{\(ytk−1\+1,…,ytk−1,ytk0\),\(ytk−1\+1,…,ytk−1,ytk1\)\}B\_\{k\}=\\\{\(y\_\{t\_\{k\-1\}\+1\},\\ldots,y\_\{t\_\{k\}\-1\},y\_\{t\_\{k\}\}^\{0\}\),\(y\_\{t\_\{k\-1\}\+1\},\\ldots,y\_\{t\_\{k\}\-1\},y\_\{t\_\{k\}\}^\{1\}\)\\\}, such that

gU\(t,Bk\)=\(Yt′\+1,…,Yt\)g\_\{U\}\(t,B\_\{k\}\)=\(Y\_\{t^\{\\prime\}\+1\},\\ldots,Y\_\{t\}\)then

Update the game:

U←U∪\{\(t,Bk,\(Yt′\+1,…,Yt\)\)\}U\\leftarrow U\\cup\\\{\(t,B\_\{k\},\(Y\_\{t^\{\\prime\}\+1\},\\ldots,Y\_\{t\}\)\)\\\}\.

t′←tt^\{\\prime\}\\leftarrow t\.

endif

if

∃h∈ℋgU,h\(Xt\)=y\\exists h\\in\\mathcal\{H\}^\{g\_\{U\}\},h\(X\_\{t\}\)=y\.then

Predict

Yt^=y\\hat\{Y\_\{t\}\}=y\.

endif

endfor

###### Lemma 4\.10\.

For every sequence\(𝕏,𝕐\)∈R\(ℋ\)\(\\mathbb\{X\},\\mathbb\{Y\}\)\\in\\text\{R\}\(\\mathcal\{H\}\), ifℋ\\mathcal\{H\}does not have an infinite indifferent Littlestone tree, there exists a constantt0t\_\{0\}, such that for everyt≥t0t\\geq t\_\{0\},XtX\_\{t\}is not shattered byℋgU\\mathcal\{H\}^\{g\_\{U\}\}\.

###### Proof\.

By its definition, a winning strategy ofPBP\_\{B\}leads to the winning condition ofPBP\_\{B\}\. By the definition ofPBP\_\{B\}’s winning condition, we know that there is a constantkk, such that for everytk\>tk−1t\_\{k\}\>t\_\{k\-1\},XtkX\_\{t\_\{k\}\}is not shattered byℋgU\\mathcal\{H\}^\{g\_\{U\}\}\. That finishes the proof\. ∎

According to Lemma[4\.10](https://arxiv.org/html/2605.30479#S4.Thmtheorem10)and Definition[4\.3](https://arxiv.org/html/2605.30479#S4.Thmtheorem3), we know that for allt\>t0t\>t\_\{0\}, everyh∈ℋgUh\\in\\mathcal\{H\}^\{g\_\{U\}\}satisfiesh\(Xt\)=Yth\(X\_\{t\}\)=Y\_\{t\}\. Thus, the prediction is the true labelYtY\_\{t\}\. Therefore, the algorithm makes a finite number of mistakes\.

## 5Agnostic Setting

In this section, we present the high\-level proof ideas of the learnability results in the agnostic setting, which is formally stated as Theorem[2\.9](https://arxiv.org/html/2605.30479#S2.Thmtheorem9)\. All the complete proofs are postponed to[AppendixC](https://arxiv.org/html/2605.30479#A3)for brevity\.

To prove the upper bound, we need to first construct the experts based on the learning algorithm for the realizable case\. An expert here is an algorithm with two hardcoded inputs,IIandJJ\.I=\(X≤k,Y≤k∗\)I=\(X\_\{\\leq k\},Y^\{\*\}\_\{\\leq k\}\)for a constantk∈ℕk\\in\\mathbb\{N\}, which is a hallucinated sequence for the game updating part, andJ⊆ℕJ\\subseteq\\mathbb\{N\}marks the indices of the mistakes made by the realizable algorithm whenYt=𝒴t∗Y\_\{t\}=\\mathcal\{Y\}^\{\*\}\_\{t\}during learning\(𝕏,𝕐∗\)\(\\mathbb\{X\},\\mathbb\{Y\}^\{\*\}\)after the game updating part\. When the time is in the set ofII, the prediction of the expert is given by the hallucinated label stored byII\. For the rest, the expert will predict a random label if the index of the round is in the setJJ\. We can prove the following lemma for the experts defined above\.

###### Lemma 5\.1\.

Ifℋ\\mathcal\{H\}has no infinite indifferent LCLL tree, for every realizable sequence\(𝕏,𝕐\)∈R\(ℋ\)\(\\mathbb\{X\},\\mathbb\{Y\}\)\\in\\text\{R\}\(\\mathcal\{H\}\), we have a sequence\{jT\}T∈ℕ\\\{j\_\{T\}\\\}\_\{T\\in\\mathbb\{N\}\}satisfieslog⁡jT=O\(\(log⁡T\)2\)\\log j\_\{T\}=O\(\(\\log T\)^\{2\}\), such that for every large enough timeTT, we have an expertei,je\_\{i,j\}withj≤jTj\\leq j\_\{T\}, such that for allt≤Tt\\leq T,Yt=ei,j\(Xt\)Y\_\{t\}=e\_\{i,j\}\(X\_\{t\}\)except for at mostO\(log⁡T\)O\(\\log T\)times\.

Then, we can use the*Squint*algorithm from the work ofKoolen and van Erven \([2015](https://arxiv.org/html/2605.30479#bib.bib26)\)with non\-uniform initial weights\. For each expertei,je\_\{i,j\}, we set its initial weight asπi,j=1i\(i\+1\)j\(j\+1\)\\pi\_\{i,j\}=\\frac\{1\}\{i\(i\+1\)j\(j\+1\)\}and this forms a distribution, asπi,j=1i\(i\+1\)j−1i\(i\+1\)\(j\+1\)\\pi\_\{i,j\}=\\frac\{1\}\{i\(i\+1\)j\}\-\\frac\{1\}\{i\(i\+1\)\(j\+1\)\}and∑j=1∞pi,j=1i−1i\+1\\sum\_\{j=1\}^\{\\infty\}p\_\{i,j\}=\\frac\{1\}\{i\}\-\\frac\{1\}\{i\+1\}, the sum ofπi,j\\pi\_\{i,j\}reaches11wheniiandjjgoes to infinity\. According to Theorem 3 in the work ofKoolen and van Erven \([2015](https://arxiv.org/html/2605.30479#bib.bib26)\), we have the following upper bound for the regret

∑t=1T𝕀\[Y^t≠Yt\]−∑t=1T𝕀\[ei,j\(Xt\)≠Yt\]\\displaystyle\\sum\_\{t=1\}^\{T\}\\mathbb\{I\}\\left\[\\hat\{Y\}\_\{t\}\\neq Y\_\{t\}\\right\]\-\\sum\_\{t=1\}^\{T\}\\mathbb\{I\}\\left\[e\_\{i,j\}\(X\_\{t\}\)\\neq Y\_\{t\}\\right\]≤O\(Vi,jlog⁡log⁡Vi,jπi,j\+log⁡1πi,j\)\.\\displaystyle\\leq O\\left\(\\sqrt\{V\_\{i,j\}\\log\\frac\{\\log V\_\{i,j\}\}\{\\pi\_\{i,j\}\}\+\\log\\frac\{1\}\{\\pi\_\{i,j\}\}\}\\right\)\.Here, theVi,jV\_\{i,j\}is the sum of the squares of the difference between the algorithm’s mistake and the expertei,je\_\{i,j\}’s mistake in each round\. In other words, we have

Vi,j=∑t=1T\(𝕀\[Y^t≠Yt\]−𝕀\[ei,j\(Xt\)≠Yt\]\)2\.V\_\{i,j\}=\\sum\_\{t=1\}^\{T\}\\left\(\\mathbb\{I\}\\left\[\\hat\{Y\}\_\{t\}\\neq Y\_\{t\}\\right\]\-\\mathbb\{I\}\\left\[e\_\{i,j\}\(X\_\{t\}\)\\neq Y\_\{t\}\\right\]\\right\)^\{2\}\.Notice that\(𝕀\[Y^t≠Yt\]−𝕀\[ei,j\(Xt\)≠Yt\]\)2\\left\(\\mathbb\{I\}\\left\[\\hat\{Y\}\_\{t\}\\neq Y\_\{t\}\\right\]\-\\mathbb\{I\}\\left\[e\_\{i,j\}\(X\_\{t\}\)\\neq Y\_\{t\}\\right\]\\right\)^\{2\}is either11or0, we haveVi,j≤TV\_\{i,j\}\\leq T\. The regret of this algorithm is upper bounded by:

O\(Vi,jlog⁡log⁡Vi,jπi,j\+log⁡1πi,j\)\\displaystyle O\\left\(\\sqrt\{V\_\{i,j\}\\log\\frac\{\\log V\_\{i,j\}\}\{\\pi\_\{i,j\}\}\+\\log\\frac\{1\}\{\\pi\_\{i,j\}\}\}\\right\)=\\displaystyle=O\(Tlog⁡log⁡T\+T\(log⁡i\+log⁡j\)\+\(log⁡i\+log⁡j\)\)\.\\displaystyle O\\left\(\\sqrt\{T\\log\\log T\+T\(\\log i\+\\log j\)\+\(\\log i\+\\log j\)\}\\right\)\.
On the other hand, we also prove a lower bound in the agnostic setting, that is, no learning algorithm for a concept class that includes more than two concepts can ensure ao\(T\)o\(\\sqrt\{T\}\)number of mistakes for every sequence\(𝕏,𝕐,𝕐∗\)\(\\mathbb\{X\},\\mathbb\{Y\},\\mathbb\{Y\}^\{\*\}\)\. Formally,

###### Lemma 5\.2\.

For every concept classℋ\\mathcal\{H\}containing two conceptsh1,h2h\_\{1\},h\_\{2\}and we havexx,h1\(x\)≠h2\(x\)h\_\{1\}\(x\)\\neq h\_\{2\}\(x\), for every learning algorithm𝒜\\mathcal\{A\}, there is a sequence\(𝕏,𝕐,𝕐∗\)\(\\mathbb\{X\},\\mathbb\{Y\},\\mathbb\{Y\}^\{\*\}\)such thatRegret\(𝒜,\(𝕏,𝕐,𝕐∗\),T\)≠o\(T\)\\text\{Regret\}\(\\mathcal\{A\},\(\\mathbb\{X\},\\mathbb\{Y\},\\mathbb\{Y\}^\{\*\}\),T\)\\neq o\(\\sqrt\{T\}\)\.

To prove this lemma, we need to construct an infinite sequence such that no online learning algorithm can transductively learn it with regret ofo\(T\)o\(\\sqrt\{T\}\)\. The sequence𝕏\\mathbb\{X\}is an infinite sequence ofXt=xX\_\{t\}=x, such thath1\(x\)≠h2\(x\)h\_\{1\}\(x\)\\neq h\_\{2\}\(x\)\. Then we use some probabilistic statements, that is, if we independently uniformly randomly pickh1\(x\)h\_\{1\}\(x\)orh2\(x\)h\_\{2\}\(x\)asYtY\_\{t\}, there exists an infinite sequence𝕐\\mathbb\{Y\}and𝕐∗=\{h1\(Xi\)\}i∈ℕ\\mathbb\{Y\}^\{\*\}=\\\{h\_\{1\}\(X\_\{i\}\)\\\}\_\{i\\in\\mathbb\{N\}\}or\{h2\(Xi\)\}i∈ℕ\\\{h\_\{2\}\(X\_\{i\}\)\\\}\_\{i\\in\\mathbb\{N\}\}such that the regret of any online learning algorithm is noto\(T\)o\(\\sqrt\{T\}\)with probability more than0\. To prove this, we divide the infinite sequence into blocks of increasing size and use Khinchine’s Inequality on each block to show the expected regret on that block isΩ\(T\)\\Omega\(\\sqrt\{T\}\)\. Then, by Azuma’s inequality, we can bound the probability of generating the sequence that pushes any algorithm to suffer aΩ\(T\)\\Omega\(\\sqrt\{T\}\)regret\. Then we can extend this result on blocks to the result on the infinite sequence by the reversed Fatou’s lemma\. For brevity, the complete proof is provided in the Appendix[C](https://arxiv.org/html/2605.30479#A3)\.

## 6Conclusion and Future Directions

In this paper, we investigate universal multiclass transductive online learning\. We prove a trichotomy for the realizable setting\. To describe the trichotomy, we define a new combinatorial structure, the Level\-Constrained Littlestone\-Littlestone tree, and emphasize the indifferent property of trees\. We then provide theO~\(T\)\\tilde\{O\}\(\\sqrt\{T\}\)regret learning algorithm for the agnostic case whenℋ\\mathcal\{H\}has no infinite indifferent LCLL tree\. We also extend the learnability result to the case of learners with knowledge only of a stochastic process from which the instance sequence is sampled, describing the condition in which all processes admit universal multiclass online learning\.

Finally, we want to provide several interesting future directions:

- •First, for the agnostic case, there is a poly\-logarithmic gap between the upper and lower bounds of the regret\. Is it possible to tighten this gap?
- •Second, is it possible to characterize the universal transductive online learnability in other settings? For example, multiclass classification with bandit feedback, real\-valued function regression, etc\. As we have shown, this setting is closely related to the condition where all processes admit universal online learning; this question itself is also very interesting\.

## Acknowledgements

We thank the reviewers who provided useful suggestions on improving the quality of this paper\.SHacknowledges support by grant no\. 2024243 from the United States \- Israel Binational Science Foundations \(BSF\)\.

## Impact Statement

This paper presents work whose goal is to advance the field of Learning Theory\. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here\.

## References

- N\. Alon, S\. Hanneke, R\. Holzman, and S\. Moran \(2022\)A theory of PAC learnability of partial concept classes\.In2021 IEEE 62nd Annual Symposium on Foundations of Computer Science \(FOCS\),pp\. 658–671\.Cited by:[§4\.2](https://arxiv.org/html/2605.30479#S4.SS2.p12.2)\.
- S\. Ben\-David, E\. Kushilevitz, and Y\. Mansour \(1997\)Online learning versus offline learning\.Machine Learning29,pp\. 45–63\.Cited by:[§1](https://arxiv.org/html/2605.30479#S1.p1.2),[§1](https://arxiv.org/html/2605.30479#S1.p4.10)\.
- M\. Blanchard, R\. Cosson, and S\. Hanneke \(2022\)Universal online learning with unbounded losses: memory is all you need\.InProceedings of The 33rd International Conference on Algorithmic Learning Theory,Proceedings of Machine Learning Research, Vol\.167,pp\. 107–127\.Cited by:[§1](https://arxiv.org/html/2605.30479#S1.p5.1)\.
- M\. Blanchard \(2022\)Universal online learning: an optimistically universal learning rule\.InProceedings of Thirty Fifth Conference on Learning Theory,Proceedings of Machine Learning Research,pp\. 1077–1125\.Cited by:[§1](https://arxiv.org/html/2605.30479#S1.p5.1)\.
- O\. Bousquet, S\. Hanneke, S\. Moran, J\. Shafer, and I\. Tolstikhin \(2023\)Fine\-grained distribution\-dependent learning curves\.InProceedings of Thirty Sixth Conference on Learning Theory,Proceedings of Machine Learning Research, Vol\.195,pp\. 5890–5924\.Cited by:[§1](https://arxiv.org/html/2605.30479#S1.p9.1),[Definition 2\.7](https://arxiv.org/html/2605.30479#S2.Thmtheorem7),[§2](https://arxiv.org/html/2605.30479#S2.p14.1)\.
- O\. Bousquet, S\. Hanneke, S\. Moran, R\. Van Handel, and A\. Yehudayoff \(2021\)A theory of universal learning\.InProceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing,pp\. 532–541\.Cited by:[§1](https://arxiv.org/html/2605.30479#S1.p5.1),[§1](https://arxiv.org/html/2605.30479#S1.p9.1),[Figure 2](https://arxiv.org/html/2605.30479#S2.F2),[Figure 2](https://arxiv.org/html/2605.30479#S2.F2.6.3),[Example 3\.3](https://arxiv.org/html/2605.30479#S3.Thmtheorem3)\.
- N\. Brukhim, D\. Carmon, I\. Dinur, S\. Moran, and A\. Yehudayoff \(2022\)A characterization of multiclass learnability\.In2022 IEEE 63rd Annual Symposium on Foundations of Computer Science \(FOCS\),pp\. 943–955\.Cited by:[§1](https://arxiv.org/html/2605.30479#S1.p6.1)\.
- N\. Cesa\-Bianchi and G\. Lugosi \(2006\)Prediction, learning, and games\.Cambridge University Press\.Cited by:[Lemma C\.3](https://arxiv.org/html/2605.30479#A3.Thmtheorem3)\.
- Z\. Chase, S\. Hanneke, S\. Moran, and J\. Shafer \(2025\)Optimal mistake bounds for transductive online learning\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=EoebmBe9fG)Cited by:[§1](https://arxiv.org/html/2605.30479#S1.p4.10)\.
- A\. Daniely, S\. Sabato, S\. Ben\-David, and S\. Shalev\-Shwartz \(2015\)Multiclass learnability and the erm principle\.\.J\. Mach\. Learn\. Res\.16\(1\),pp\. 2377–2404\.Cited by:[§1](https://arxiv.org/html/2605.30479#S1.p6.1)\.
- D\. Gale and F\. M\. Stewart \(1953\)Infinite games with perfect information\.Annals of Mathematics Studies\(28\),pp\. 245–266\.Cited by:[Theorem A\.1](https://arxiv.org/html/2605.30479#A1.Thmtheorem1)\.
- S\. Hanneke, A\. Karbasi, S\. Moran, and G\. Velegkas \(2022\)Universal rates for interactive learning\.Advances in Neural Information Processing Systems35,pp\. 28657–28669\.Cited by:[§1](https://arxiv.org/html/2605.30479#S1.p5.1)\.
- S\. Hanneke, A\. Karbasi, S\. Moran, and G\. Velegkas \(2024a\)Universal rates for active learning\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2605.30479#S1.p5.1)\.
- S\. Hanneke, S\. Moran, V\. Raman, U\. Subedi, and A\. Tewari \(2023a\)Multiclass online learning and uniform convergence\.InProceedings of Thirty Sixth Conference on Learning Theory,Proceedings of Machine Learning Research, Vol\.195,pp\. 5682–5696\.Cited by:[§1](https://arxiv.org/html/2605.30479#S1.p6.1)\.
- S\. Hanneke, S\. Moran, and J\. Shafer \(2023b\)A trichotomy for transductive online learning\.Advances in Neural Information Processing Systems36,pp\. 19502–19519\.Cited by:[§1](https://arxiv.org/html/2605.30479#S1.p1.2),[§1](https://arxiv.org/html/2605.30479#S1.p4.10)\.
- S\. Hanneke, S\. Moran, and Q\. Zhang \(2023c\)Universal rates for multiclass learning\.InThe Thirty Sixth Annual Conference on Learning Theory,pp\. 5615–5681\.Cited by:[§D\.1](https://arxiv.org/html/2605.30479#A4.SS1.p2.1),[Theorem D\.12](https://arxiv.org/html/2605.30479#A4.Thmtheorem12),[§1](https://arxiv.org/html/2605.30479#S1.p5.1),[§1](https://arxiv.org/html/2605.30479#S1.p8.1),[Definition 2\.4](https://arxiv.org/html/2605.30479#S2.Thmtheorem4)\.
- S\. Hanneke, V\. Raman, A\. Shaeiri, and U\. Subedi \(2024b\)Multiclass transductive online learning\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2605.30479#S1.p4.10),[§1](https://arxiv.org/html/2605.30479#S1.p8.1)\.
- S\. Hanneke, A\. Shaeiri, and H\. Wang \(2025\)For universal multiclass online learning, bandit feedback and full supervision are equivalent\.In36th International Conference on Algorithmic Learning Theory,Cited by:[§1](https://arxiv.org/html/2605.30479#S1.p5.1)\.
- S\. Hanneke and A\. Shaeiri \(2025\)A trichotomy for list transductive online learning\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=4gUVnk2Hyo)Cited by:[§1](https://arxiv.org/html/2605.30479#S1.p4.10)\.
- S\. Hanneke and H\. Wang \(2024\)A theory of optimistically universal online learnability for general concept classes\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,Cited by:[§D\.1](https://arxiv.org/html/2605.30479#A4.SS1.p1.1),[§D\.1](https://arxiv.org/html/2605.30479#A4.SS1.p3.1),[Appendix D](https://arxiv.org/html/2605.30479#A4.p1.1),[Appendix D](https://arxiv.org/html/2605.30479#A4.p14.1),[Appendix D](https://arxiv.org/html/2605.30479#A4.p18.11),[§1](https://arxiv.org/html/2605.30479#S1.p2.1),[§1](https://arxiv.org/html/2605.30479#S1.p8.1),[Condition A](https://arxiv.org/html/2605.30479#Thmcondition1)\.
- S\. Hanneke and M\. Xu \(2024\)Universal rates of empirical risk minimization\.Advances in Neural Information Processing Systems37,pp\. 116291–116331\.Cited by:[§1](https://arxiv.org/html/2605.30479#S1.p5.1)\.
- S\. Hanneke \(2021\)Learning whenever learning is possible: universal learning under general stochastic processes\.Journal of Machine Learning Research22\(130\),pp\. 1–116\.Cited by:[§D\.1](https://arxiv.org/html/2605.30479#A4.SS1.p1.1)\.
- W\. M\. Koolen and T\. van Erven \(2015\)Second\-order quantile methods for experts and combinatorial games\.InProceedings of The 28th Conference on Learning Theory, COLT 2015, Paris, France, July 3\-6, 2015,JMLR Workshop and Conference Proceedings, Vol\.40,pp\. 1155–1175\.Cited by:[Appendix C](https://arxiv.org/html/2605.30479#A3.p1.3),[§5](https://arxiv.org/html/2605.30479#S5.p3.8)\.
- N\. Littlestone \(1988\)Learning quickly when irrelevant attributes abound: a new linear\-threshold algorithm\.Machine Learning2,pp\. 285–318\.Cited by:[§1](https://arxiv.org/html/2605.30479#S1.p1.2)\.
- B\. K\. Natarajan and P\. Tadepalli \(1988\)Two new frameworks for learning\.InMachine Learning Proceedings 1988,pp\. 402–415\.Cited by:[§1](https://arxiv.org/html/2605.30479#S1.p6.1)\.
- B\. K\. Natarajan \(1989\)On learning sets and functions\.Machine Learning4,pp\. 67–97\.Cited by:[§1](https://arxiv.org/html/2605.30479#S1.p6.1)\.
- V\. Vapnik \(2006\)Estimation of dependences based on empirical data\.Springer Science & Business Media\.Cited by:[§1](https://arxiv.org/html/2605.30479#S1.p1.2)\.

## Appendix AGale\-Stewart Game

In this section, we very briefly review the basic notions from the classical theory of infinite games\.

In a two\-player game, there are two players,AAandBB\. The players have an action space𝒜\\mathcal\{A\}\. At roundtt, the playerAAchooses an action from the action space, and then the playerBBalso chooses an action from the action space\. Thus, each play of the game can be expressed as a sequence of actions,g=\(a1,a2,a3,a4,…\)g=\(a\_\{1\},a\_\{2\},a\_\{3\},a\_\{4\},\\ldots\), whereai∈𝒜a\_\{i\}\\in\\mathcal\{A\}for everyii\. The set of all such sequences is denoted as𝒜ω\\mathcal\{A\}^\{\\omega\}\. Here𝒜\\mathcal\{A\}is equipped with the discrete topology and𝒜ω\\mathcal\{A\}^\{\\omega\}is equipped with the generated product topology\. A fixed set determines the result of the game, known as the winning set,WW\. If the play of the gameg∈Wg\\in W, the playerAAwins, and playerBBwins otherwise\.

Astrategyis a function that generates the next action for the player based on the states of the game \(which can be thought of as the history of actions\)\. If the strategy of playerAAleads to the winning of playerAAregardless of the actions of playerBBis called thewinning strategyof playerAA\. If one player or the other has a winning strategy and no other cases, the game isdetermined\. Then we have the formal theorem about the determinacy of the game\.

###### Theorem A\.1\(Gale and Stewart \([1953](https://arxiv.org/html/2605.30479#bib.bib30)\)\)\.

IfWWis an open set, then the game is determined\.

The proof is short and intuitive\. We would like to point the readers to the original paper for the proof\. This theorem is called the Borel Determinacy Theorem, because Borel sets are the smallestσ\\sigma\-algebra of subsets of𝒜ω\\mathcal\{A\}^\{\\omega\}that contains all open sets\.

## Appendix BOmitted Proofs for the Realizable Setting

### B\.1Proof of Lemma[4\.1](https://arxiv.org/html/2605.30479#S4.Thmtheorem1)

###### Proof\.

We construct a sequence based on the properties of an infinite indifferent LCLL tree, so any learning algorithm cannot make sublinear expected mistakes\.

First, we start from an infinite indifferent LCLL tree and then modify it such that the length of the sequence in each node increases exponentially\. In other words, we hope thekk\-th node in the Breadth\-First\-Search \(BFS\) order contains a2k−12^\{k\-1\}long sequence\. We may reach this target by recursively modifying the chosen indifferent LCLL tree\. Starting from the root of the tree, for each node that does not satisfy our requirement, we promote one of its descendants to replace that node, such that the number of elements in that node is large enough\.

Then the data sequence𝕏\\mathbb\{X\}we look at is the following: For the modified indifferent LCLL tree, starting from the root with𝕏=∅\\mathbb\{X\}=\\emptyset, traverse the tree in Breadth\-First\-Order and every time reach a new node, append the sequence in that node to𝕏\\mathbb\{X\}\.

Next, we choose the true labelYtY\_\{t\}for eachXtX\_\{t\}and make\(𝕏,𝕐\)\(\\mathbb\{X\},\\mathbb\{Y\}\)a realizable sequence\. First, we take a random walk on the modified indifferent LCLL tree\. For everyXtX\_\{t\}that is in a node visited by that random walk \(in\-branch node\), we decideYtY\_\{t\}by the label of the edge visited and the function𝐲n\\mathbf\{y\}\_\{n\}\. For everyXtX\_\{t\}that is not in a node visited by that random walk \(out\-branch node\), we will find the closest in\-branch node after this out\-branch node and labelXtX\_\{t\}by the function agreed on by all descendants of that in\-branch node\. As the tree is indifferent, we can find such a function\.

Then we prove this is a sequence that pushes any transductive online learning algorithm to make more thano\(T\)o\(T\)mistakes in expectation\. Consider the node in depthddvisited by the random walk, which is theKdK\_\{d\}\-th node in the BFS order and labeled with a2Kd−12^\{\{K\_\{d\}\}\-1\}length sequence\. Due to the property of the indifferent LCLL tree, we know that eachXtX\_\{t\}in theKdK\_\{d\}\-th node can take any label sequence on the edge adjacent to that node\. Thus, the best algorithm still makes2Kd−22^\{\{K\_\{d\}\}\-2\}mistakes in expectation\. Due to the definition of the sequence, we know the length of the sequence after theKdK\_\{d\}\-th node isnKd=∑i=1Kd2i−1=2Kd−1n\_\{K\_\{d\}\}=\\sum\_\{i=1\}^\{K\_\{d\}\}2^\{i\-1\}=2^\{K\_\{d\}\}\-1\. Thus, we have the following inequality for everydd,

𝔼\[1nKd∑t=1nKd𝕀\[h^t−1\(Xt\)≠Yt\]\|Kd\]≥14\.\\mathbb\{E\}\\left\[\\left\.\\frac\{1\}\{n\_\{K\_\{d\}\}\}\\sum\_\{t=1\}^\{n\_\{K\_\{d\}\}\}\\mathbb\{I\}\\left\[\\hat\{h\}\_\{t\-1\}\(X\_\{t\}\)\\neq Y\_\{t\}\\right\]\\right\|K\_\{d\}\\right\]\\geq\\frac\{1\}\{4\}\.\(1\)Thus, we have

lim supd→∞𝔼\[1nKd∑t=1nKd𝕀\[h^t−1\(Xt\)≠Yt\]\|Kd\]≥14\.\\limsup\_\{d\\rightarrow\\infty\}\\mathbb\{E\}\\left\[\\left\.\\frac\{1\}\{n\_\{K\_\{d\}\}\}\\sum\_\{t=1\}^\{n\_\{K\_\{d\}\}\}\\mathbb\{I\}\\left\[\\hat\{h\}\_\{t\-1\}\(X\_\{t\}\)\\neq Y\_\{t\}\\right\]\\right\|\{K\_\{d\}\}\\right\]\\geq\\frac\{1\}\{4\}\.
Then consider the expected number of mistakes, which is

𝔼\[lim supn→∞1n∑t=1n𝕀\[h^t−1\(Xt\)≠Yt\]\]\\displaystyle\\mathbb\{E\}\\left\[\\limsup\_\{n\\rightarrow\\infty\}\\frac\{1\}\{n\}\\sum\_\{t=1\}^\{n\}\\mathbb\{I\}\\left\[\\hat\{h\}\_\{t\-1\}\(X\_\{t\}\)\\neq Y\_\{t\}\\right\]\\right\]≥𝔼\[lim supd→∞1nKd∑t=1nKd𝕀\[h^t−1\(Xt\)≠Yt\]\]\\displaystyle\\geq\\mathbb\{E\}\\left\[\\limsup\_\{d\\rightarrow\\infty\}\\frac\{1\}\{n\_\{K\_\{d\}\}\}\\sum\_\{t=1\}^\{n\_\{K\_\{d\}\}\}\\mathbb\{I\}\\left\[\\hat\{h\}\_\{t\-1\}\(X\_\{t\}\)\\neq Y\_\{t\}\\right\]\\right\]=𝔼\[𝔼\[lim supd→∞1nKd∑t=1nKd𝕀\[h^t−1\(Xt\)≠Yt\]\|Kd\]\]\\displaystyle=\\mathbb\{E\}\\left\[\\mathbb\{E\}\\left\[\\limsup\_\{d\\rightarrow\\infty\}\\left\.\\frac\{1\}\{n\_\{K\_\{d\}\}\}\\sum\_\{t=1\}^\{n\_\{K\_\{d\}\}\}\\mathbb\{I\}\\left\[\\hat\{h\}\_\{t\-1\}\(X\_\{t\}\)\\neq Y\_\{t\}\\right\]\\right\|\{K\_\{d\}\}\\right\]\\right\]≥𝔼\[lim supd→∞𝔼\[1nKd∑t=1nKd𝕀\[h^t−1\(Xt\)≠Yt\]\|Kd\]\]\\displaystyle\\geq\\mathbb\{E\}\\left\[\\limsup\_\{d\\rightarrow\\infty\}\\mathbb\{E\}\\left\[\\left\.\\frac\{1\}\{n\_\{K\_\{d\}\}\}\\sum\_\{t=1\}^\{n\_\{K\_\{d\}\}\}\\mathbb\{I\}\\left\[\\hat\{h\}\_\{t\-1\}\(X\_\{t\}\)\\neq Y\_\{t\}\\right\]\\right\|\{K\_\{d\}\}\\right\]\\right\]≥14\.\\displaystyle\\geq\\frac\{1\}\{4\}\.For the sequence of the average number of mistakes, we take the sub\-sequence that only contains theKdK\_\{d\}\-th element, then the limit superior of the sub\-sequence is smaller than the limit superior of the whole sequence\. That is the second line in the equations above\. The third line is due to the law of total expectation\. Then, as the average number of mistakes is always positive and bounded by11, we can use the reversed Fatou’s lemma to get the inequality in the fourth line\.

Therefore, there exists a realizable sequence such that any realizable online learner will make more than a sublinearly expected number of mistakes\. ∎

### B\.2Proof ofO\(log⁡T\)O\(\\log T\)Upper Bound

In this section, we provide the complete proof of theO\(log⁡T\)O\(\\log T\)upper bound, whenℋ\\mathcal\{H\}does not have an infinite indifferent LCLL tree\.

###### Proof of Lemma[4\.4](https://arxiv.org/html/2605.30479#S4.Thmtheorem4)\.

Due to the Borel Determinacy Theorem, we know that ifPAP\_\{A\}does not have a winning strategy,PBP\_\{B\}has a winning strategy\. Thus, we only need to build an infinite indifferent LCLL tree based on the fact thatPAP\_\{A\}has a winning strategy\. This is a constructive proof\. We show how to recursively build an infinite indifferent LCLL tree\.

We build the infinite indifferent LCLL tree in the Breadth\-First\-Search \(BFS\) order from the root\. We first take the depth\-11LCL tree proposed byPAP\_\{A\}and use it as the root and its two edges\.

Then for theii\-th nodeviv\_\{i\}, suppose the\(i−1\)\(i\-1\)\-th node at depthkkis labeled by\(Xj1,…,Xjk\)\(X\_\{j\_\{1\}\},\\ldots,X\_\{j\_\{k\}\}\)\. We also supposeviv\_\{i\}’s parent node is labeled by\(Xℓ1,…,Xℓk−1\)\(X\_\{\\ell\_\{1\}\},\\ldots,X\_\{\\ell\_\{k\-1\}\}\)\. BecausePAP\_\{A\}has a winning strategy, no matter which branch is chosen byPBP\_\{B\}, for anyn∈ℕn\\in\\mathbb\{N\},PAP\_\{A\}can propose a sequence\(Xi1,…,Xin\)\(X\_\{i\_\{1\}\},\\ldots,X\_\{i\_\{n\}\}\)shattered byℋ\(X≤i1,Y≤i1\)\\mathcal\{H\}\_\{\(X\_\{\\leq i\_\{1\}\},Y\_\{\\leq i\_\{1\}\}\)\}\. This comes fromPAP\_\{A\}’s action and Definition[4\.3](https://arxiv.org/html/2605.30479#S4.Thmtheorem3)\. Here, the labelsYiY\_\{i\}forℓk−1≤i≤i1\\ell\_\{k\-1\}\\leq i\\leq i\_\{1\}are proposed byPAP\_\{A\}but make sure it is realizable byℋ\\mathcal\{H\}, as otherwisePBP\_\{B\}wins the game\.

Then we can take the firstkkinstances in the sequence\(Xi1,…,Xin\)\(X\_\{i\_\{1\}\},\\ldots,X\_\{i\_\{n\}\}\)and use it to labelviv\_\{i\}\. By the definition of shattering, we know that we can build a depth\-kkLCL tree with label sequences\{\(yi1,∅u1,…,yik,u<kuk\):u∈\{0,1\}k\}\\\{\(y\_\{i\_\{1\},\\varnothing\}^\{u\_\{1\}\},\\ldots,y\_\{i\_\{k\},u\_\{<k\}\}^\{u\_\{k\}\}\):u\\in\\\{0,1\\\}^\{k\}\\\}, whereu<k=\(u1,u2,…,uk−1\)u\_\{<k\}=\(u\_\{1\},u\_\{2\},\\ldots,u\_\{k\-1\}\)\. Thus, we can use these label sequences to label the edges connectingviv\_\{i\}and its children\.

Notice that the sequence\(Xi1,…,Xik\)\(X\_\{i\_\{1\}\},\\ldots,X\_\{i\_\{k\}\}\)is shattered byℋ\(X≤i1,Y≤i1\)\\mathcal\{H\}\_\{\(X\_\{\\leq i\_\{1\}\},Y\_\{\\leq i\_\{1\}\}\)\}, all functions of its descendants agree on all the instances beforei1i\_\{1\}\. All nodes before theii\-th node in BFS order are labeled by instances beforeXi1X\_\{i\_\{1\}\}\. Thus, all the functions of the descendants ofviv\_\{i\}agree on all nodes before\. Therefore,PAP\_\{A\}’s winning strategy leads to an infinite indifferent LCLL tree forℋ\\mathcal\{H\}\. ∎

###### Proof of Lemma[4\.5](https://arxiv.org/html/2605.30479#S4.Thmtheorem5)\.

By the definition of the winning strategy, it leads to a winning condition for the playerPBP\_\{B\}\. By the definition ofPBP\_\{B\}’s winning condition, we know that there exists akksuch thatℋ𝐭1,C1,g1,…,𝐭k,Ck,gk=∅\\mathcal\{H\}\_\{\\mathbf\{t\}\_\{1\},C\_\{1\},g\_\{1\},\\dots,\\mathbf\{t\}\_\{k\},C\_\{k\},g\_\{k\}\}=\\emptyset, which means for allt<t1<t2<⋯<tkt<t\_\{1\}<t\_\{2\}<\\cdots<t\_\{k\}, the sequence\(Xt1,…,Xtk\)\(X\_\{t\_\{1\}\},\\ldots,X\_\{t\_\{k\}\}\)is not shattered byℋgUk−1\\mathcal\{H\}^\{g\_\{U\_\{k\-1\}\}\}\. That finishes the proof\. ∎

We then prove that if the length of the longest subsequences shattered byℋ′\\mathcal\{H\}^\{\\prime\}isdd, there is a learning algorithm that can learn it withO\(log⁡T\)O\(\\log T\)mistakes\. For completeness, we provide the subroutine as an algorithm and restate the lemma here\.

Algorithm 3Subroutine for learning at most length\-ddsubsequences are shattered byℋ′\\mathcal\{H\}^\{\\prime\}\.L←∅L\\leftarrow\\emptyset\.

for

t=1,2,3,…t=1,2,3,\\dotsdo

Predict

Yt^=arg⁡maxy⁡w\(ℋL∪\{\(Xt,y\)\}′,X≥t\)\\hat\{Y\_\{t\}\}=\\arg\\max\_\{y\}w\(\\mathcal\{H\}^\{\\prime\}\_\{L\\cup\\\{\(X\_\{t\},y\)\\\}\},X\_\{\\geq t\}\)\.

if

Yt≠Y^tY\_\{t\}\\neq\\hat\{Y\}\_\{t\}then

L←L∪\{\(Xt,Yt\)\}L\\leftarrow L\\cup\\\{\(X\_\{t\},Y\_\{t\}\)\\\}\.

endif

endfor

In the algorithm[3](https://arxiv.org/html/2605.30479#alg3), we define the weight function as follows\.

w\(ℋ′,X≥t\)=∑S:S⊆X≥Tsuch thatSshattered byℋ′1n\(S\)d\+1,w\(\\mathcal\{H\}^\{\\prime\},X\_\{\\geq t\}\)=\\sum\_\{\\begin\{subarray\}\{c\}S:S\\subseteq X\_\{\\geq T\}\\\\ \\text\{ such that \}S\\text\{ shattered by \}\\mathcal\{H\}^\{\\prime\}\\end\{subarray\}\}\\frac\{1\}\{n\(S\)^\{d\+1\}\},\(2\)wheren\(S\)n\(S\)is the index of the last element inSS,ddis the length of the longest subsequences that are shattered by the partial concept classℋ′\\mathcal\{H\}^\{\\prime\}, andX\>t=\{Xi\}i\>tX\_\{\>t\}=\\\{X\_\{i\}\\\}\_\{i\>t\}\. For example, ifS=\{Xt1,…,Xtk\}S=\\\{X\_\{t\_\{1\}\},\\ldots,X\_\{t\_\{k\}\}\\\}, thenn\(S\)=tkn\(S\)=t\_\{k\}\. Then we can restate and prove Lemma[4\.6](https://arxiv.org/html/2605.30479#S4.Thmtheorem6)here\.

###### Lemma B\.1\(Restate of Lemma[4\.6](https://arxiv.org/html/2605.30479#S4.Thmtheorem6)\)\.

If the length of the longest subsequences that are shattered by the partial concept classℋ′\\mathcal\{H\}^\{\\prime\}isdd, Algorithm[3](https://arxiv.org/html/2605.30479#alg3)can learn it withO\(log⁡T\)O\(\\log T\)mistakes\.

###### Proof\.

Notice that ifSSis shattered byℋ\{\(Xi,Yi\)\}i<t∪\{\(Xt,yt0\)\}\\mathcal\{H\}\_\{\\\{\(X\_\{i\},Y\_\{i\}\)\\\}\_\{i<t\}\\cup\\\{\(X\_\{t\},y\_\{t\}^\{0\}\)\\\}\}andℋ\{\(Xi,yi\)\}i<t∪\{\(Xt,yt1\)\}\\mathcal\{H\}\_\{\\\{\(X\_\{i\},y\_\{i\}\)\\\}\_\{i<t\}\\cup\\\{\(X\_\{t\},y\_\{t\}^\{1\}\)\\\}\},S∪\{Xt\}S\\cup\\\{X\_\{t\}\\\}is shattered byℋ\{\(Xi,Yi\)\}i<t\\mathcal\{H\}\_\{\\\{\(X\_\{i\},Y\_\{i\}\)\\\}\_\{i<t\}\}\. Thus, if the subsetSScontributes twice on the right\-hand side of the inequality, it also contributes at least twice on the left\-hand side\. \(One comes fromSS, the other comes fromS∪\{Xt\}S\\cup\\\{X\_\{t\}\\\}\.\) Thus, we have the following inequality for𝕏\\mathbb\{X\},

w\(ℋ\(X<t,Y<t\),X≥t\)≥w\(ℋ\(X<t,Y<t\)∪\{\(Xt,yt0\)\},X\>t\)\+w\(ℋ\(X<t,Y<t\)∪\{\(Xt,yt1\)\},X\>t\),\\displaystyle w\(\\mathcal\{H\}\_\{\(X\_\{<t\},Y\_\{<t\}\)\},X\_\{\\geq t\}\)\\geq w\(\\mathcal\{H\}\_\{\(X\_\{<t\},Y\_\{<t\}\)\\cup\\\{\(X\_\{t\},y\_\{t\}^\{0\}\)\\\}\},X\_\{\>t\}\)\+w\(\\mathcal\{H\}\_\{\(X\_\{<t\},Y\_\{<t\}\)\\cup\\\{\(X\_\{t\},y\_\{t\}^\{1\}\)\\\}\},X\_\{\>t\}\),where\(X<t,Y<t\)=\{\(Xi,Yi\)\}i<t\(X\_\{<t\},Y\_\{<t\}\)=\\\{\(X\_\{i\},Y\_\{i\}\)\\\}\_\{i<t\}\. Every time when the algorithm makes a mistake, we havew\(ℋ\(X<t,Y<t\)∪\{\(Xt,Y^t\)\},X\>t\)≥w\(ℋ\(X<t,Y<t\)∪\{\(Xt,Yt\)\},X\>t\)w\(\\mathcal\{H\}\_\{\(X\_\{<t\},Y\_\{<t\}\)\\cup\\\{\(X\_\{t\},\\hat\{Y\}\_\{t\}\)\\\}\},X\_\{\>t\}\)\\geq w\(\\mathcal\{H\}\_\{\(X\_\{<t\},Y\_\{<t\}\)\\cup\\\{\(X\_\{t\},Y\_\{t\}\)\\\}\},X\_\{\>t\}\)\. Thus, we have

w\(ℋ\(X<t,Y<t\)∪\{\(Xt,Yt\)\},X\>t\)≤12w\(ℋ\(X<t,Y<t\),X≥t\)w\(\\mathcal\{H\}\_\{\(X\_\{<t\},Y\_\{<t\}\)\\cup\\\{\(X\_\{t\},Y\_\{t\}\)\\\}\},X\_\{\>t\}\)\\leq\\frac\{1\}\{2\}w\(\\mathcal\{H\}\_\{\(X\_\{<t\},Y\_\{<t\}\)\},X\_\{\\geq t\}\)Then we need to get an upper bound ofw\(ℋ′,𝕏\)w\(\\mathcal\{H\}^\{\\prime\},\\mathbb\{X\}\), notice that for the entire sequence, we are taking the sum over all possiblenn, then we need to bound the number of the sub\-sequences ending atXnX\_\{n\}and can be shattered byℋ′\\mathcal\{H\}^\{\\prime\}\. We know that it is the number of subsets of\{1,2,…,n\}\\\{1,2,\\ldots,n\\\}, who containsnnand has less than or equal toddelements, that is

∑i=1d−1\(n−1i\)≤nd−1\.\\sum\_\{i=1\}^\{d\-1\}\\binom\{n\-1\}\{i\}\\leq n^\{d\-1\}\.Thus, we have

w\(ℋ′,𝕏\)≤∑n=1∞nd−1nd\+1=∑n=1∞1n2≤π26\.w\(\\mathcal\{H\}^\{\\prime\},\\mathbb\{X\}\)\\leq\\sum\_\{n=1\}^\{\\infty\}\\frac\{n^\{d\-1\}\}\{n^\{d\+1\}\}=\\sum\_\{n=1\}^\{\\infty\}\\frac\{1\}\{n^\{2\}\}\\leq\\frac\{\\pi^\{2\}\}\{6\}\.Then we provide the lower bound of the weight function at each roundTT\. Consider the last mistake the algorithm made before roundTT, before that mistake, we know at least a singleton\{xt′\}\\\{x\_\{t^\{\\prime\}\}\\\}is shattered by the concept class\. Thus, the weight function at roundTTis greater than1Td\+1\\frac\{1\}\{T^\{d\+1\}\}\. Assume the algorithm makesm\(T\)m\(T\)number of mistakes before roundTT\. We have

π26≥w\(ℋ′,𝕏\)≥2m\(T\)−11Td\+1\.\\frac\{\\pi^\{2\}\}\{6\}\\geq w\(\\mathcal\{H\}^\{\\prime\},\\mathbb\{X\}\)\\geq 2^\{m\(T\)\-1\}\\frac\{1\}\{T^\{d\+1\}\}\.Therefore, we haveu\(T\)≤\(d\+1\)log⁡T\+1u\(T\)\\leq\(d\+1\)\\log T\+1, which isO\(log⁡T\)O\(\\log T\)\. ∎

### B\.3Proof ofΩ\(log⁡T\)\\Omega\(\\log T\)lower bound

###### Proof of[Lemma4\.7](https://arxiv.org/html/2605.30479#S4.Thmtheorem7)\.

We show the lower bound by using the property of an infinite indifferent Littlestone tree\. The instance sequence we take is the sequence of all instances that label the node\. We list them in the BFS order\.

Then we choose the true labelYtY\_\{t\}for each instanceXtX\_\{t\}as follows\. We take a random walk starting from the root of the tree\. For each node visited by that random walk \(in\-branch node\), letYtY\_\{t\}be the label of the edge visited after that node\. For those nodes that are not visited, we take the in\-branch node that comes after it, and label the node by the function agreed on by all the descendants of that in\-branch node\. We can always find that function due to the indifferent property\.

Then we can prove this sequence pushes any transductive online learning algorithm to makeΩ\(log⁡T\)\\Omega\(\\log T\)mistakes\. Notice that for this Littlestone tree, every level, there is a node \(visited by the random walk\) such that any algorithm will make a mistake with probability12\\frac\{1\}\{2\}\. Notice that for everyTT, there are at least⌊log⁡T⌋\\lfloor\\log T\\rfloorlevels\. Thus, there are⌊log⁡T⌋\\lfloor\\log T\\rfloornodes where the algorithm will make a mistake with probability half\. Therefore, the number of mistake in expectation is⌊log⁡T⌋2≥log⁡T2−1\\frac\{\\lfloor\\log T\\rfloor\}\{2\}\\geq\\frac\{\\log T\}\{2\}\-1, which isΩ\(log⁡T\)\\Omega\(\\log T\)\. ∎

### B\.4Proof of Constant Upper Bound

###### Proof of[Lemma4\.9](https://arxiv.org/html/2605.30479#S4.Thmtheorem9)\.

Due to the Borel Determinacy Theorem, we know that ifPAP\_\{A\}does not have a winning strategy,PBP\_\{B\}has a winning strategy\. Thus, we only need to show how to build an infinite indifferent Littlestone tree from the winning strategy ofPAP\_\{A\}\. We can recursively build an infinite indifferent Littlestone tree in the Breadth\-First\-Search \(BFS\) order from the winning strategy ofPAP\_\{A\}\.

Starting from the root, we can label the root and its two edges by the instanceXt1X\_\{t\_\{1\}\}and two labels\{yt10,yt11\}\\\{y\_\{t\_\{1\}\}^\{0\},y\_\{t\_\{1\}\}^\{1\}\\\}proposed byPAP\_\{A\}\. Then for theii\-th nodeviv\_\{i\}at depthkk, assume the\(i−1\)\(i\-1\)\-th node is labeled byXti−1X\_\{t\_\{i\-1\}\}, andviv\_\{i\}’s parent is labeled byXtk−1X\_\{t\_\{k\-1\}\}\.

BecausePAP\_\{A\}has a winning strategy, no matter which branchPBP\_\{B\}takes,PAP\_\{A\}can propose atk\>ti−1t\_\{k\}\>t\_\{i\-1\}and two branches\{\(ytk−1\+1,…,ytk−1,ytk0\),\(ytk−1\+1,…,ytk−1,yk1\)\}\\\{\(y\_\{t\_\{k\-1\}\+1\},\\ldots,y\_\{t\_\{k\}\-1\},y\_\{t\_\{k\}\}^\{0\}\),\(y\_\{t\_\{k\-1\}\+1\},\\ldots,y\_\{t\_\{k\}\-1\},y\_\{k\}^\{1\}\)\\\}, such thatXtkX\_\{t\_\{k\}\}is shattered byℋ\(X<tk,Y<tk\)\\mathcal\{H\}\_\{\(X\_\{<t\_\{k\}\},Y\_\{<t\_\{k\}\}\)\}\. Here the sequence\(\(Xtk−1\+1,Ytk−1\+1\),…,\(Xtk−1,Ytk−1\)\)\(\(X\_\{t\_\{k\-1\}\+1\},Y\_\{t\_\{k\-1\}\+1\}\),\\ldots,\(X\_\{t\_\{k\}\-1\},Y\_\{t\_\{k\}\-1\}\)\)is chosen byPAP\_\{A\}, but need to ensure that it is realizable\. Thus, we useXtkX\_\{t\_\{k\}\}to labelviv\_\{i\}and\{ytk0,ytk1\}\\\{y\_\{t\_\{k\}\}^\{0\},y\_\{t\_\{k\}^\{1\}\}\\\}to label its two edges connecting its children\.

We know all the nodes come beforeviv\_\{i\}in BFS order is labeled by an instance beforeXti−1X\_\{t\_\{i\-1\}\}Thus, all the functions of the descendants ofXtkX\_\{t\_\{k\}\}are in the concept classℋ\(X<tk,Y<tk\)\\mathcal\{H\}\_\{\(X\_\{<t\_\{k\}\},Y\_\{<t\_\{k\}\}\)\}, whereti−1<tkt\_\{i\-1\}<t\_\{k\}\. Thus, they agree on all nodes beforeviv\_\{i\}in BFS order\. Therefore, the winning strategy ofPAP\_\{A\}leads to an infinite indifferent Littlestone tree\. ∎

## Appendix CAgnostic Setting

In this part, we show that whetherℋ\\mathcal\{H\}has an infinite indifferent LCLL tree characterizes the universal transductive online learnability in the agnostic setting as well\. We use the learning with experts algorithm to handle this problem\. This is a traditional way to expand the learning algorithm for the realizable setting to a learning algorithm for the agnostic setting\. Specifically, we use the algorithm called*Squint*from the work ofKoolen and van Erven \([2015](https://arxiv.org/html/2605.30479#bib.bib26)\)with non\-uniform initial weights\. As we mentioned in Section[5](https://arxiv.org/html/2605.30479#S5), for each expertei,je\_\{i,j\}, we set its initial weight asπi,j=1i\(i\+1\)j\(j\+1\)\\pi\_\{i,j\}=\\frac\{1\}\{i\(i\+1\)j\(j\+1\)\}\.

And this provides the upper bound of the regret of this algorithm as

∑t=1T𝕀\[Y^t≠Yt\]−∑t=1T𝕀\[ei,j\(Xt\)≠Yt\]\\displaystyle\\sum\_\{t=1\}^\{T\}\\mathbb\{I\}\\left\[\\hat\{Y\}\_\{t\}\\neq Y\_\{t\}\\right\]\-\\sum\_\{t=1\}^\{T\}\\mathbb\{I\}\\left\[e\_\{i,j\}\(X\_\{t\}\)\\neq Y\_\{t\}\\right\]≤\\displaystyle\\leqO\(Vi,jlog⁡log⁡Vi,jπi,j\+log⁡1πi,j\)=O\(Tlog⁡log⁡T\+T\(log⁡i\+log⁡j\)\+\(log⁡i\+log⁡j\)\)\.\\displaystyle O\\left\(\\sqrt\{V\_\{i,j\}\\log\\frac\{\\log V\_\{i,j\}\}\{\\pi\_\{i,j\}\}\+\\log\\frac\{1\}\{\\pi\_\{i,j\}\}\}\\right\)=O\\left\(\\sqrt\{T\\log\\log T\+T\(\\log i\+\\log j\)\+\(\\log i\+\\log j\)\}\\right\)\.\(3\)
Then we only need to build the experts and show that for any realizable sequence\(𝕏,𝕐∗\)∈R\(ℋ\)\(\\mathbb\{X\},\\mathbb\{Y\}^\{\*\}\)\\in\\text\{R\}\(\\mathcal\{H\}\), there is an expert whose prediction at timett, whenYt∗=YtY^\{\*\}\_\{t\}=Y\_\{t\}, is different fromYt∗Y^\{\*\}\_\{t\}for at mostlog⁡T\\log Tdifferent timett, for allt≤Tt\\leq T\.

An expert here is an algorithm with two hardcoded inputs,IIandJJ\.I=\(X≤k,Y≤k∗\)I=\(X\_\{\\leq k\},Y^\{\*\}\_\{\\leq k\}\)for a constantk∈ℕk\\in\\mathbb\{N\}, which is a hallucinated sequence for the game updating part, andJ⊆ℕJ\\subseteq\\mathbb\{N\}marks the indices of the mistakes made by the realizable algorithm whenYt=𝒴t∗Y\_\{t\}=\\mathcal\{Y\}^\{\*\}\_\{t\}during learning\(𝕏,𝕐∗\)\(\\mathbb\{X\},\\mathbb\{Y\}^\{\*\}\)after the game updating part\. The expert is defined as follows\.

Algorithm 4ExpertI,JI,J\(indexed byei,je\_\{i,j\}\)\.U←\{\}U\\leftarrow\\\{\\\}\.

t′←0t^\{\\prime\}\\leftarrow 0\.

k←1k\\leftarrow 1\.

L←\{\}L\\leftarrow\\\{\\\}\.

for

t=1,2,3,…t=1,2,3,\\ldotsdo

if

t≤kt\\leq kthen

if

∃t′≤t1<t2<⋯<tk≤t\\exists t^\{\\prime\}\\leq t\_\{1\}<t\_\{2\}<\\cdots<t\_\{k\}\\leq tand

Ck=\{\(ytk−1k−1\+1,∅,…,ytk1,∅u1,ytk1\+1,u1,…,ytk2,u1u2…,ytkk,u<kuk\):u∈\{0,1\}k\}C\_\{k\}=\\\{\(y\_\{t\_\{k\-1\}^\{k\-1\}\+1,\\varnothing\},\\ldots,y\_\{t\_\{k\}^\{1\},\\varnothing\}^\{u\_\{1\}\},y\_\{t\_\{k\}^\{1\}\+1,u\_\{1\}\},\\ldots,y\_\{t\_\{k\}^\{2\},u\_\{1\}\}^\{u\_\{2\}\}\\ldots,y\_\{t\_\{k\}^\{k\},u\_\{<k\}\}^\{u\_\{k\}\}\):u\\in\\\{0,1\\\}^\{k\}\\\}such that

gU\(\(t1,…,tk\),Ck\)=\(Yt′,…,Yt\)g\_\{U\}\(\(t\_\{1\},\\dots,t\_\{k\}\),C\_\{k\}\)=\(Y\_\{t^\{\\prime\}\},\\dots,Y\_\{t\}\)then

Advance the game:

U←U∪\{\(\(t1,…,tk,Ck\),gU\(\(t1,…,tk\),Ck\)\)\}U\\leftarrow U\\cup\\\{\(\(t\_\{1\},\\dots,t\_\{k\},C\_\{k\}\),g\_\{U\}\(\(t\_\{1\},\\dots,t\_\{k\}\),C\_\{k\}\)\)\\\}\.

k←k\+1k\\leftarrow k\+1\.

t′←tt^\{\\prime\}\\leftarrow t\.

endif

Predict

Y^t=Yt∗\\hat\{Y\}\_\{t\}=Y^\{\*\}\_\{t\}\.

else

Predict

Y^t=arg⁡maxy⁡w\(ℋL∪\{\(Xt,y\)\}gU,X≥t\)\\hat\{Y\}\_\{t\}=\\arg\\max\_\{y\}w\(\\mathcal\{H\}^\{g\_\{U\}\}\_\{L\\cup\\\{\(X\_\{t\},y\)\\\}\},X\_\{\\geq t\}\)\.

if

t∈Jt\\in Jthen

L←L∪\{\(Xt,Yt\)\}L\\leftarrow L\\cup\\\{\(X\_\{t\},Y\_\{t\}\)\\\}\.

endif

endif

endfor

In Algorithm[4](https://arxiv.org/html/2605.30479#alg4),Yt∗Y^\{\*\}\_\{t\}is the hallucinated label stored byIIandYtY\_\{t\}is the true label given by the adversary\. Then we need the following lemma about the expert\.

###### Lemma C\.1\.

Ifℋ\\mathcal\{H\}has no infinite indifferent LCLL tree, for every realizable sequence\(𝕏,𝕐\)∈R\(ℋ\)\(\\mathbb\{X\},\\mathbb\{Y\}\)\\in\\text\{R\}\(\\mathcal\{H\}\), we have a sequence\{jT\}T∈ℕ\\\{j\_\{T\}\\\}\_\{T\\in\\mathbb\{N\}\}satisfieslog⁡jT=O\(\(log⁡T\)2\)\\log j\_\{T\}=O\(\(\\log T\)^\{2\}\), such that for every large enough timeTT, we have an expertei,je\_\{i,j\}withj≤jTj\\leq j\_\{T\}, such that for allt≤Tt\\leq T,Yt=ei,j\(Xt\)Y\_\{t\}=e\_\{i,j\}\(X\_\{t\}\)except for at mostO\(log⁡T\)O\(\\log T\)times\.

###### Proof\.

Firstly, because𝒴\\mathcal\{Y\}is countable andkkis finite, the number of differentII’s is countable\. Referring to Lemma[4\.5](https://arxiv.org/html/2605.30479#S4.Thmtheorem5), we know that for every realizable sequence\(𝕏,𝕐\)∈R\(ℋ\)\(\\mathbb\{X\},\\mathbb\{Y\}\)\\in\\text\{R\}\(\\mathcal\{H\}\), the game only updates finitely many times\. Thus, there is anI=\(X≤k,Y≤k\)I=\(X\_\{\\leq k\},Y\_\{\\leq k\}\), such that afterkk, the game does not update anymore, and we have all expertsei,∗e\_\{i,\*\}make no mistake during the game updating period\.

Then we look at the second stage\. For every large enoughTT, due to Lemma[4\.6](https://arxiv.org/html/2605.30479#S4.Thmtheorem6), there is an algorithm that only makesO\(log⁡T\)O\(\\log T\)number of mistakes\. Thus, we havejT=TO\(log⁡T\)j\_\{T\}=T^\{O\(\\log T\)\}, which is the number of all possible subsets whose size isO\(log⁡T\)O\(\\log T\)\. So, there is aj<jTj<j\_\{T\}such that the inputJJcontains all the indices of the mistakes made by Algorithm[3](https://arxiv.org/html/2605.30479#alg3)when learning\(𝕏,𝕐\)\(\\mathbb\{X\},\\mathbb\{Y\}\)after the game updating\. Therefore, for everyt≤Tt\\leq T, we haveYt=ei,j\(Xt\)Y\_\{t\}=e\_\{i,j\}\(X\_\{t\}\)ift∉Jt\\notin J\. That finishes the proof\. ∎

Then we need to extend this result to the agnostic case\. By the definition of regret, we have

Regret\(𝒜,\(𝕏,𝕐,𝕐∗\),T\)=𝔼\[∑t=1T\(𝕀\[Yt≠Y^t\]−𝕀\[Yt≠Yt∗\]\)\]\\displaystyle\\text\{Regret\}\(\\mathcal\{A\},\(\\mathbb\{X\},\\mathbb\{Y\},\\mathbb\{Y\}^\{\*\}\),T\)=\\mathbb\{E\}\\left\[\\sum\_\{t=1\}^\{T\}\\left\(\\mathbb\{I\}\\left\[Y\_\{t\}\\neq\\hat\{Y\}\_\{t\}\\right\]\-\\mathbb\{I\}\\left\[Y\_\{t\}\\neq Y^\{\*\}\_\{t\}\\right\]\\right\)\\right\]=𝔼\[∑t=1T\(𝕀\[Yt≠Y^t\]−𝕀\[Yt≠ei,j\(Xt\)\]\+𝕀\[Yt≠ei,j\(Xt\)\]−𝕀\[Yt≠Yt∗\]\)\]\\displaystyle=\\mathbb\{E\}\\left\[\\sum\_\{t=1\}^\{T\}\\left\(\\mathbb\{I\}\\left\[Y\_\{t\}\\neq\\hat\{Y\}\_\{t\}\\right\]\-\\mathbb\{I\}\\left\[Y\_\{t\}\\neq e\_\{i,j\}\(X\_\{t\}\)\\right\]\+\\mathbb\{I\}\\left\[Y\_\{t\}\\neq e\_\{i,j\}\(X\_\{t\}\)\\right\]\-\\mathbb\{I\}\\left\[Y\_\{t\}\\neq Y^\{\*\}\_\{t\}\\right\]\\right\)\\right\]=𝔼\[∑t=1T\(𝕀\[Yt≠Y^t\]−𝕀\[Yt≠ei,j\(Xt\)\]\)\]\+𝔼\[∑t=1T\(𝕀\[Yt≠ei,j\(Xt\)\]−𝕀\[Yt≠Yt∗\]\)\]\.\\displaystyle=\\mathbb\{E\}\\left\[\\sum\_\{t=1\}^\{T\}\\left\(\\mathbb\{I\}\\left\[Y\_\{t\}\\neq\\hat\{Y\}\_\{t\}\\right\]\-\\mathbb\{I\}\\left\[Y\_\{t\}\\neq e\_\{i,j\}\(X\_\{t\}\)\\right\]\\right\)\\right\]\+\\mathbb\{E\}\\left\[\\sum\_\{t=1\}^\{T\}\\left\(\\mathbb\{I\}\\left\[Y\_\{t\}\\neq e\_\{i,j\}\(X\_\{t\}\)\\right\]\-\\mathbb\{I\}\\left\[Y\_\{t\}\\neq Y^\{\*\}\_\{t\}\\right\]\\right\)\\right\]\.Notice that the first part is bounded by inequality[C](https://arxiv.org/html/2605.30479#A3.Ex39), we only need to bound the second part\. For a sequence\(𝕏,𝕐,𝕐∗\)\(\\mathbb\{X\},\\mathbb\{Y\},\\mathbb\{Y\}^\{\*\}\), we can take the subsequence\(𝕏′,𝕐′,𝕐′⁣∗\)=\{\(Xt,Yt,Yt∗\):Yt=Yt∗\}\(\\mathbb\{X\}^\{\\prime\},\\mathbb\{Y\}^\{\\prime\},\\mathbb\{Y\}^\{\\prime\*\}\)=\\\{\(X\_\{t\},Y\_\{t\},Y^\{\*\}\_\{t\}\):Y\_\{t\}=Y^\{\*\}\_\{t\}\\\}\. We can takeT′≤TT^\{\\prime\}\\leq T, such that𝔼\[∑t=1T\(𝕀\[Yt≠ei,j\(Xt\)\]−𝕀\[Yt≠Yt∗\]\)\]=𝔼\[∑t=1T′𝕀\[Yt′≠ei,j\(Xt′\)\]\]\\mathbb\{E\}\[\\sum\_\{t=1\}^\{T\}\\left\(\\mathbb\{I\}\\left\[Y\_\{t\}\\neq e\_\{i,j\}\(X\_\{t\}\)\\right\]\-\\mathbb\{I\}\\left\[Y\_\{t\}\\neq Y^\{\*\}\_\{t\}\\right\]\\right\)\]=\\mathbb\{E\}\[\\sum\_\{t=1\}^\{T^\{\\prime\}\}\\mathbb\{I\}\\left\[Y^\{\\prime\}\_\{t\}\\neq e\_\{i,j\}\(X^\{\\prime\}\_\{t\}\)\\right\]\]\. By Lemma[C\.1](https://arxiv.org/html/2605.30479#A3.Thmtheorem1), we know there is an expertei,je\_\{i,j\}forj<jT′<jTj<j\_\{T^\{\\prime\}\}<j\_\{T\}such that∑t=1T′𝕀\[Yt′≠ei,j\(Xt′\)\]=O\(log⁡T′\)\\sum\_\{t=1\}^\{T^\{\\prime\}\}\\mathbb\{I\}\\left\[Y^\{\\prime\}\_\{t\}\\neq e\_\{i,j\}\(X^\{\\prime\}\_\{t\}\)\\right\]=O\(\\log T^\{\\prime\}\)\. Therefore, we have𝔼\[∑t=1T\(𝕀\[Yt≠ei,j\(Xt\)\]−𝕀\[Yt≠Yt∗\]\)\]=O\(log⁡T\)\\mathbb\{E\}\[\\sum\_\{t=1\}^\{T\}\\left\(\\mathbb\{I\}\\left\[Y\_\{t\}\\neq e\_\{i,j\}\(X\_\{t\}\)\\right\]\-\\mathbb\{I\}\\left\[Y\_\{t\}\\neq Y^\{\*\}\_\{t\}\\right\]\\right\)\]=O\(\\log T\)\. Combining with the bound for the first part, we have

Regret\(𝒜,\(𝕏,𝕐,𝕐∗\),T\)\\displaystyle\\text\{Regret\}\(\\mathcal\{A\},\(\\mathbb\{X\},\\mathbb\{Y\},\\mathbb\{Y\}^\{\*\}\),T\)=O\(Tlog⁡log⁡T\+T\(log⁡i\+log⁡j\)\+\(log⁡i\+log⁡j\)\)\+O\(log⁡T\)\\displaystyle=O\\left\(\\sqrt\{T\\log\\log T\+T\(\\log i\+\\log j\)\+\(\\log i\+\\log j\)\}\\right\)\+O\(\\log T\)=O\(Tlog⁡log⁡T\+Tlog⁡jT\+log⁡jT\)\\displaystyle=O\\left\(\\sqrt\{T\\log\\log T\+T\\log j\_\{T\}\+\\log j\_\{T\}\}\\right\)=O\(Tlog⁡T\)\.\\displaystyle=O\(\\sqrt\{T\}\\log T\)\.Then we show the following lower bound\.

###### Lemma C\.2\.

For every concept classℋ\\mathcal\{H\}containing two conceptsh1,h2h\_\{1\},h\_\{2\}and we havexx,h1\(x\)≠h2\(x\)h\_\{1\}\(x\)\\neq h\_\{2\}\(x\), for every learning algorithm𝒜\\mathcal\{A\}, there is a sequence\(𝕏,𝕐,𝕐∗\)\(\\mathbb\{X\},\\mathbb\{Y\},\\mathbb\{Y\}^\{\*\}\)such thatRegret\(𝒜,\(𝕏,𝕐,𝕐∗\),T\)≠o\(T\)\\text\{Regret\}\(\\mathcal\{A\},\(\\mathbb\{X\},\\mathbb\{Y\},\\mathbb\{Y\}^\{\*\}\),T\)\\neq o\(\\sqrt\{T\}\)\.

To prove this lemma, we need the following lemma\.

###### Lemma C\.3\(Khinchine’s Inequality; Lemma 8\.2 in\(Cesa\-Bianchi and Lugosi,[2006](https://arxiv.org/html/2605.30479#bib.bib27)\)\)\.

LetT∈ℕT\\in\\mathbb\{N\}, Letσ1,…,σT\\sigma\_\{1\},\\ldots,\\sigma\_\{T\}be random variables sampled independently and uniform randomly from\{±1\}\\\{\\pm 1\\\}\. We have the following inequality:

𝔼\[\|∑t=1Tσt\|\]≥T2\.\\mathbb\{E\}\\left\[\\left\|\\sum\_\{t=1\}^\{T\}\\sigma\_\{t\}\\right\|\\right\]\\geq\\sqrt\{\\frac\{T\}\{2\}\}\.

###### Proof of Lemma[C\.2](https://arxiv.org/html/2605.30479#A3.Thmtheorem2)\.

Consider the instance sequence𝕏=\{Xt\}t∈ℕ\\mathbb\{X\}=\\\{X\_\{t\}\\\}\_\{t\\in\\mathbb\{N\}\}, such that for alltt,Xt=xX\_\{t\}=x\. In the true label sequence𝕐=\{Yt\}t∈ℕ\\mathbb\{Y\}=\\\{Y\_\{t\}\\\}\_\{t\\in\\mathbb\{N\}\}, for everytt,YtY\_\{t\}takesh1\(x\)h\_\{1\}\(x\)orh2\(x\)h\_\{2\}\(x\)independently uniformly randomly\. As for𝕐∗=\{Yt∗\}t∈ℕ\\mathbb\{Y\}^\{\*\}=\\\{Y^\{\*\}\_\{t\}\\\}\_\{t\\in\\mathbb\{N\}\},Yt∗=h1\(x\)Y^\{\*\}\_\{t\}=h\_\{1\}\(x\)for everytt, orYt∗=h2\(x\)Y^\{\*\}\_\{t\}=h\_\{2\}\(x\)for everytt\.

Then consider the following random variable:

Ri=𝔼\[∑t=Ti−1\+1Ti𝕀\[Yt≠Y^t\]−minh∈\{h1,h2\}⁡\(∑t=Ti−1\+1Ti𝕀\[Yt≠h\(Xt\)\]\)\|𝕐\],R\_\{i\}=\\mathbb\{E\}\\left\[\\left\.\\sum\_\{t=T\_\{i\-1\}\+1\}^\{T\_\{i\}\}\\mathbb\{I\}\\left\[Y\_\{t\}\\neq\\hat\{Y\}\_\{t\}\\right\]\-\\min\_\{h\\in\\\{h\_\{1\},h\_\{2\}\\\}\}\\left\(\\sum\_\{t=T\_\{i\-1\}\+1\}^\{T\_\{i\}\}\\mathbb\{I\}\\left\[Y\_\{t\}\\neq h\(X\_\{t\}\)\\right\]\\right\)\\right\|\\mathbb\{Y\}\\right\],whereTi=∑j=1i24jT\_\{i\}=\\sum\_\{j=1\}^\{i\}2^\{4^\{j\}\}\. And we haveTi−Ti−1=24iT\_\{i\}\-T\_\{i\-1\}=2^\{4^\{i\}\}\. Then, we consider the expectation ofRiR\_\{i\},

𝔼\[Ri\]\\displaystyle\\mathbb\{E\}\\left\[R\_\{i\}\\right\]=𝔼\[𝔼\[∑t=Ti−1\+1Ti𝕀\[Yt≠Y^t\]−minh∈\{h1,h2\}⁡\(∑t=Ti−1\+1Ti𝕀\[Yt≠h\(Xt\)\]\)\|𝕐\]\]\\displaystyle=\\mathbb\{E\}\\left\[\\mathbb\{E\}\\left\[\\left\.\\sum\_\{t=T\_\{i\-1\}\+1\}^\{T\_\{i\}\}\\mathbb\{I\}\\left\[Y\_\{t\}\\neq\\hat\{Y\}\_\{t\}\\right\]\-\\min\_\{h\\in\\\{h\_\{1\},h\_\{2\}\\\}\}\\left\(\\sum\_\{t=T\_\{i\-1\}\+1\}^\{T\_\{i\}\}\\mathbb\{I\}\\left\[Y\_\{t\}\\neq h\(X\_\{t\}\)\\right\]\\right\)\\right\|\\mathbb\{Y\}\\right\]\\right\]=𝔼\[𝔼\[∑t=Ti−1\+1Ti𝕀\[Yt≠Y^t\]\|𝕐\]\]−𝔼\[minh∈\{h1,h2\}⁡\(∑t=Ti−1\+1Ti𝕀\[Yt≠h\(Xt\)\]\)\]\\displaystyle=\\mathbb\{E\}\\left\[\\mathbb\{E\}\\left\[\\left\.\\sum\_\{t=T\_\{i\-1\}\+1\}^\{T\_\{i\}\}\\mathbb\{I\}\\left\[Y\_\{t\}\\neq\\hat\{Y\}\_\{t\}\\right\]\\right\|\\mathbb\{Y\}\\right\]\\right\]\-\\mathbb\{E\}\\left\[\\min\_\{h\\in\\\{h\_\{1\},h\_\{2\}\\\}\}\\left\(\\sum\_\{t=T\_\{i\-1\}\+1\}^\{T\_\{i\}\}\\mathbb\{I\}\\left\[Y\_\{t\}\\neq h\(X\_\{t\}\)\\right\]\\right\)\\right\]=24i2−𝔼\[min⁡\{ri,24i−ri\}\],\\displaystyle=\\frac\{2^\{4^\{i\}\}\}\{2\}\-\\mathbb\{E\}\\left\[\\min\\\{r\_\{i\},2^\{4^\{i\}\}\-r\_\{i\}\\\}\\right\],whereri=∑t=Ti−1\+1Ti𝕀\[Yt≠h2\(Xt\)\]r\_\{i\}=\\sum\_\{t=T\_\{i\-1\}\+1\}^\{T\_\{i\}\}\\mathbb\{I\}\[Y\_\{t\}\\neq h\_\{2\}\(X\_\{t\}\)\]\. This comes from the linearity of expectation and the fact that the algorithm is independent of the true label sequence\. Therefore,

𝔼\[Ri\]=𝔼\[\|24i2−ri\|\]=𝔼\[\|∑ℓ=124i12−\(12\+σiℓ2\)\|\]=12𝔼\[\|∑ℓ=124iσiℓ\|\],\\mathbb\{E\}\\left\[R\_\{i\}\\right\]=\\mathbb\{E\}\\left\[\\left\|\\frac\{2^\{4^\{i\}\}\}\{2\}\-r\_\{i\}\\right\|\\right\]=\\mathbb\{E\}\\left\[\\left\|\\sum\_\{\\ell=1\}^\{2^\{4^\{i\}\}\}\\frac\{1\}\{2\}\-\\left\(\\frac\{1\}\{2\}\+\\frac\{\\sigma\_\{i\}^\{\\ell\}\}\{2\}\\right\)\\right\|\\right\]=\\frac\{1\}\{2\}\\mathbb\{E\}\\left\[\\left\|\\sum\_\{\\ell=1\}^\{2^\{4^\{i\}\}\}\\sigma\_\{i\}^\{\\ell\}\\right\|\\right\],whereσiℓ=1\\sigma\_\{i\}^\{\\ell\}=1if𝕀\[Yt≠h2\(Xt\)\]\\mathbb\{I\}\[Y\_\{t\}\\neq h\_\{2\}\(X\_\{t\}\)\]fort=Ti−1\+ℓt=T\_\{i\-1\}\+\\ell, andσiℓ=−1\\sigma\_\{i\}^\{\\ell\}=\-1if𝕀\[Yt≠h1\(Xt\)\]\\mathbb\{I\}\[Y\_\{t\}\\neq h\_\{1\}\(X\_\{t\}\)\]fort=Ti−1\+ℓt=T\_\{i\-1\}\+\\ell\. Due to Lemma[C\.3](https://arxiv.org/html/2605.30479#A3.Thmtheorem3), we have

𝔼\[\|∑ℓ=124iσiℓ\|\]≥24i2\.\\mathbb\{E\}\\left\[\\left\|\\sum\_\{\\ell=1\}^\{2^\{4^\{i\}\}\}\\sigma\_\{i\}^\{\\ell\}\\right\|\\right\]\\geq\\sqrt\{\\frac\{2^\{4^\{i\}\}\}\{2\}\}\.Thus,𝔼\[Ri\]≥24i8\\mathbb\{E\}\[R\_\{i\}\]\\geq\\sqrt\{\\frac\{2^\{4^\{i\}\}\}\{8\}\}\. Then consider the following sequence:Zi,j=𝔼\[Ri\|Y≤Ti−1\+j\]Z\_\{i,j\}=\\mathbb\{E\}\[R\_\{i\}\|Y\_\{\\leq T\_\{i\-1\}\+j\}\]\. Notice that for everyii, for everyj≤Ti−Ti−1j\\leq T\_\{i\}\-T\_\{i\-1\}, we have:

𝔼\[\|Zi,j\|\]≤𝔼\[\|𝔼\[Ri\|Y≤Ti−1\+j\]\|\]≤𝔼\[𝔼\[\|Ri\|\|Y≤Ti−1\+j\]\]=𝔼\[\|Ri\|\]≤Ti−Ti−1=24i<∞\.\\mathbb\{E\}\[\|Z\_\{i,j\}\|\]\\leq\\mathbb\{E\}\[\|\\mathbb\{E\}\[R\_\{i\}\|Y\_\{\\leq T\_\{i\-1\}\+j\}\]\|\]\\leq\\mathbb\{E\}\[\\mathbb\{E\}\[\|R\_\{i\}\|\|Y\_\{\\leq T\_\{i\-1\}\+j\}\]\]=\\mathbb\{E\}\[\|R\_\{i\}\|\]\\leq T\_\{i\}\-T\_\{i\-1\}=2^\{4^\{i\}\}<\\infty\.and

𝔼\[Zi,j\+1\|YTi−1\+1,…,YTi−1\+j\]=𝔼\[𝔼\[Ri\|Y≤Ti−1\+j\+1\]\|YTi−1\+1,…,YTi−1\+j\]\\displaystyle\\mathbb\{E\}\[Z\_\{i,j\+1\}\|Y\_\{T\_\{i\-1\}\+1\},\\ldots,Y\_\{T\_\{i\-1\}\+j\}\]=\\mathbb\{E\}\[\\mathbb\{E\}\[R\_\{i\}\|Y\_\{\\leq T\_\{i\-1\}\+j\+1\}\]\|Y\_\{T\_\{i\-1\}\+1\},\\ldots,Y\_\{T\_\{i\-1\}\+j\}\]=\\displaystyle=𝔼\[Ri\|YTi−1\+1,…,YTi−1\+j\]=𝔼\[Ri\|Y≤Ti−1\+j\]=Zi,j\.\\displaystyle\\mathbb\{E\}\[R\_\{i\}\|Y\_\{T\_\{i\-1\}\+1\},\\ldots,Y\_\{T\_\{i\-1\}\+j\}\]=\\mathbb\{E\}\[R\_\{i\}\|Y\_\{\\leq T\_\{i\-1\}\+j\}\]=Z\_\{i,j\}\.Therefore, for everyii, the sequenceZi,jZ\_\{i,j\}is a martingale indexed byjj\. By the definition of regret, we have−1≤Zi,j\+1−Zi,j≤1\-1\\leq Z\_\{i,j\+1\}\-Z\_\{i,j\}\\leq 1\. Then we have

ℙ\[Ri≤24i32\]\\displaystyle\\mathbb\{P\}\\left\[R\_\{i\}\\leq\\sqrt\{\\frac\{2^\{4^\{i\}\}\}\{32\}\}\\right\]=ℙ\[Ri≤24i8−24i32\]\\displaystyle=\\mathbb\{P\}\\left\[R\_\{i\}\\leq\\sqrt\{\\frac\{2^\{4^\{i\}\}\}\{8\}\}\-\\sqrt\{\\frac\{2^\{4^\{i\}\}\}\{32\}\}\\right\]≤ℙ\[Ri≤𝔼\[Ri\]−24i32\]\\displaystyle\\leq\\mathbb\{P\}\\left\[R\_\{i\}\\leq\\mathbb\{E\}\[R\_\{i\}\]\-\\sqrt\{\\frac\{2^\{4^\{i\}\}\}\{32\}\}\\right\]≤e−116\\displaystyle\\leq e^\{\-\\frac\{1\}\{16\}\}The last inequality comes from Azuma’s Inequality\. Thus, we have

ℙ\[Ri≥24i32\]≥1−e−116,\\mathbb\{P\}\\left\[R\_\{i\}\\geq\\sqrt\{\\frac\{2^\{4^\{i\}\}\}\{32\}\}\\right\]\\geq 1\-e^\{\-\\frac\{1\}\{16\}\},for everyii\. Thus, for everyii, if we choosehi∗=arg⁡minh∈\{h1,h2\}⁡\(∑t=Ti−1\+1Ti𝕀\[Yt≠h\(Xt\)\]\)h^\{\*\}\_\{i\}=\\arg\\min\_\{h\\in\\\{h\_\{1\},h\_\{2\}\\\}\}\\left\(\\sum\_\{t=T\_\{i\-1\}\+1\}^\{T\_\{i\}\}\\mathbb\{I\}\\left\[Y\_\{t\}\\neq h\(X\_\{t\}\)\\right\]\\right\)to generateY∗,iY^\{\*,i\}, i\.e\., for everytt,Yt∗,i=hi∗\(x\)Y^\{\*,i\}\_\{t\}=h^\{\*\}\_\{i\}\(x\), we will have

Regret\(𝒜,\(𝕏,𝕐,𝕐∗,i\),Ti\)≥24i32−Ti−1≥24i−1\(24i−1−52−2\)=Ω\(Ti\),\\text\{Regret\}\(\\mathcal\{A\},\(\\mathbb\{X\},\\mathbb\{Y\},\\mathbb\{Y\}^\{\*,i\}\),T\_\{i\}\)\\geq\\sqrt\{\\frac\{2^\{4^\{i\}\}\}\{32\}\}\-T\_\{i\-1\}\\geq 2^\{4^\{i\-1\}\}\\left\(2^\{4^\{i\-1\}\-\\frac\{5\}\{2\}\}\-2\\right\)=\\Omega\(\\sqrt\{T\_\{i\}\}\),with probability at least1−e−1161\-e^\{\-\\frac\{1\}\{16\}\}\. Then becausehi∗=h1h^\{\*\}\_\{i\}=h\_\{1\}orh2h\_\{2\}, due to pigeonhole principle, there is anh∗h^\{\*\}, such that there are infinitely manyii,hi∗=h∗h^\{\*\}\_\{i\}=h^\{\*\}\. LetYt∗=h∗\(x\)Y^\{\*\}\_\{t\}=h^\{\*\}\(x\)for everytt\. Let eventEiE\_\{i\}be thatRegret\(𝒜,\(𝕏,𝕐,𝕐∗\),Ti\)=Ω\(Ti\)\\text\{Regret\}\(\\mathcal\{A\},\(\\mathbb\{X\},\\mathbb\{Y\},\\mathbb\{Y\}^\{\*\}\),T\_\{i\}\)=\\Omega\(\\sqrt\{T\_\{i\}\}\)\. By the definition of the limit of superior of events, the fact that the indicator function is either0or11and the reversed Fatou’s lemma, we have

ℙ\[lim supi→∞Ei\]=𝔼\[lim supi→∞𝕀\[Eihappens\.\]\]≥lim supi→∞𝔼\[𝕀\[Eihappens\.\]\]≥1−e−116\.\\mathbb\{P\}\\left\[\\limsup\_\{i\\to\\infty\}E\_\{i\}\\right\]=\\mathbb\{E\}\\left\[\\limsup\_\{i\\to\\infty\}\\mathbb\{I\}\\left\[E\_\{i\}\\text\{ happens\.\}\\right\]\\right\]\\geq\\limsup\_\{i\\to\\infty\}\\mathbb\{E\}\\left\[\\mathbb\{I\}\\left\[E\_\{i\}\\text\{ happens\.\}\\right\]\\right\]\\geq 1\-e^\{\-\\frac\{1\}\{16\}\}\.Thus,

lim supi→∞1TiRegret\(𝒜,\(𝕏,𝕐,𝕐∗\),Ti\)=c\\limsup\_\{i\\to\\infty\}\\frac\{1\}\{\\sqrt\{T\_\{i\}\}\}\\text\{Regret\}\(\\mathcal\{A\},\(\\mathbb\{X\},\\mathbb\{Y\},\\mathbb\{Y\}^\{\*\}\),T\_\{i\}\)=cfor some constantc\>0c\>0with probability greater than0\. Therefore,

lim supT→∞1TRegret\(𝒜,\(𝕏,𝕐,𝕐∗\),T\)≥lim supi→∞1TiRegret\(𝒜,\(𝕏,𝕐,𝕐∗\),Ti\)=c\\limsup\_\{T\\to\\infty\}\\frac\{1\}\{\\sqrt\{T\}\}\\text\{Regret\}\(\\mathcal\{A\},\(\\mathbb\{X\},\\mathbb\{Y\},\\mathbb\{Y\}^\{\*\}\),T\)\\geq\\limsup\_\{i\\to\\infty\}\\frac\{1\}\{\\sqrt\{T\_\{i\}\}\}\\text\{Regret\}\(\\mathcal\{A\},\(\\mathbb\{X\},\\mathbb\{Y\},\\mathbb\{Y\}^\{\*\}\),T\_\{i\}\)=cwith probability greater than0\. Thus, for any learning algorithm𝒜\\mathcal\{A\}, there is a deterministic sequence\(𝕏,𝕐,𝕐∗\)\(\\mathbb\{X\},\\mathbb\{Y\},\\mathbb\{Y\}^\{\*\}\)such thatRegret\(𝒜,\(𝕏,𝕐,𝕐∗\),T\)≠o\(T\)\\text\{Regret\}\(\\mathcal\{A\},\(\\mathbb\{X\},\\mathbb\{Y\},\\mathbb\{Y\}^\{\*\}\),T\)\\neq o\(\\sqrt\{T\}\)\. That finishes the proof\. ∎

## Appendix DLearnability when the Stochastic Process is Known

In this section, we discuss the universal online learnability when the stochastic process generating the instance sequence is known to the learner instead of the instance sequence\. This setting is equivalent to the condition when all processes admit universal online learning as asked in the work ofHanneke and Wang \([2024](https://arxiv.org/html/2605.30479#bib.bib17)\)\. As we introduced stochastic processes in this model, several definitions are slightly different from the original transductive online learning with deterministic sequences\. We define the changed definition for the stochastic process case again here\.

Let\(𝒳,ℬ\)\(\\mathcal\{X\},\\mathcal\{B\}\)be a measurable space, where𝒳\\mathcal\{X\}is a non\-empty set andℬ\\mathcal\{B\}is a Borelσ\\sigma\-algebra generated by the separable metrizable topology𝒯\\mathcal\{T\}\. This is called*instance space*\. And𝒴\\mathcal\{Y\}is also a non\-empty measurable space called*label space*\. Here we also focus on the learning on0\-11loss\. A stochastic process𝕏=\{Xt\}t∈ℕ\\mathbb\{X\}=\\\{X\_\{t\}\\\}\_\{t\\in\\mathbb\{N\}\}is a sequence of𝒳\\mathcal\{X\}\-valued random variables, which is called instance process and a stochastic process𝕐=\{Yt\}t∈ℕ\\mathbb\{Y\}=\\\{Y\_\{t\}\\\}\_\{t\\in\\mathbb\{N\}\}is a sequence of𝒴\\mathcal\{Y\}\-valued random variables, which is called label process\. The concept classℋ⊆𝒴𝒳\\mathcal\{H\}\\subseteq\\mathcal\{Y\}^\{\\mathcal\{X\}\}is a non\-empty set of measurable functions from𝒳\\mathcal\{X\}to𝒴\\mathcal\{Y\}\. In order to avoid the complicated discussion of the measurability issue, we assume the instance space𝒳\\mathcal\{X\}and the label space𝒴\\mathcal\{Y\}are both countable \(countably infinite\)\.

We also need to redefine the online learning algorithm for the stochastic process setting, which is as follows\. The online learning rule is a sequence of measurable functions:ft:𝒳t−1×𝒴t−1×𝒳→𝒴f\_\{t\}:\\mathcal\{X\}^\{t\-1\}\\times\\mathcal\{Y\}^\{t\-1\}\\times\\mathcal\{X\}\\rightarrow\\mathcal\{Y\}, wherettis a non\-negative integer\. For convenience, we also defineh^t−1=ft\(X<t,Y<t\)\\hat\{h\}\_\{t\-1\}=f\_\{t\}\(X\_\{<t\},Y\_\{<t\}\), here\(X<t,Y<t\)=\{\(Xi,Yi\)\}i<t\(X\_\{<t\},Y\_\{<t\}\)=\\\{\(X\_\{i\},Y\_\{i\}\)\\\}\_\{i<t\}is the history before roundtt\.

Then we need to define the realizable setting, which follows the tradition of universal learning\.

###### Definition D\.1\.

For every concept classℋ\\mathcal\{H\}, we can define the following set of processesR\(ℋ\)\\text\{R\}\(\\mathcal\{H\}\):

R\(ℋ\):=\{\(𝕏,𝕐\)=\{\(Xi,Yi\)\}i∈ℕ:with probability1,∀n<∞,\{\(Xi,Yi\)\}i≤nrealizable byℋ\}\.\\text\{R\}\(\\mathcal\{H\}\):=\\left\\\{\(\\mathbb\{X\},\\mathbb\{Y\}\)=\\left\\\{\(X\_\{i\},Y\_\{i\}\)\\right\\\}\_\{i\\in\\mathbb\{N\}\}:\\text\{with probability \}1,\\forall n<\\infty,\\left\\\{\(X\_\{i\},Y\_\{i\}\)\\right\\\}\_\{i\\leq n\}\\text\{ realizable by \}\\mathcal\{H\}\\right\\\}\.

In the same way, the set of realizable label processes:

###### Definition D\.2\.

For every concept classℋ\\mathcal\{H\}and data process𝕏\\mathbb\{X\}, define a setR\(ℋ,𝕏\)\\text\{R\}\(\\mathcal\{H\},\\mathbb\{X\}\)of label processes:

R\(ℋ,𝕏\):=\{𝕐=\{Yi\}i∈ℕ:\(𝕏,𝕐\)∈R\(ℋ\)and∃a non\-random functionfs\.t\.Yi=f\(Xi\)\}\.\\text\{R\}\(\\mathcal\{H\},\\mathbb\{X\}\):=\\left\\\{\\mathbb\{Y\}=\\left\\\{Y\_\{i\}\\right\\\}\_\{i\\in\\mathbb\{N\}\}:\(\\mathbb\{X\},\\mathbb\{Y\}\)\\in\\text\{R\}\(\\mathcal\{H\}\)\\text\{ and \}\\exists\\text\{ a non\-random function \}f\\text\{ s\.t\.\\ \}Y\_\{i\}=f\(X\_\{i\}\)\\right\\\}\.

In other words,R\(ℋ,𝕏\)\\text\{R\}\(\\mathcal\{H\},\\mathbb\{X\}\)are label processes𝕐=f\(𝕏\)\\mathbb\{Y\}=f\(\\mathbb\{X\}\)s\.t\.\(𝕏,f\(𝕏\)\)∈R\(ℋ\)\(\\mathbb\{X\},f\(\\mathbb\{X\}\)\)\\in\\text\{R\}\(\\mathcal\{H\}\)\. Importantly, while everyf∈ℋf\\in\\mathcal\{H\}satisfiesf\(𝕏\)∈R\(ℋ,𝕏\)f\(\\mathbb\{X\}\)\\in\\text\{R\}\(\\mathcal\{H\},\\mathbb\{X\}\), there can existf∉ℋf\\notin\\mathcal\{H\}for which this is also true, due toR\(ℋ\)\\text\{R\}\(\\mathcal\{H\}\)only requiring realizable*prefixes*\(thus, in a sense,R\(ℋ,𝕏\)\\text\{R\}\(\\mathcal\{H\},\\mathbb\{X\}\)represents label sequences by functions in a*closure*ofℋ\\mathcal\{H\}defined by𝕏\\mathbb\{X\}\)\.777For instance, for𝒳=ℕ\\mathcal\{X\}=\\mathbb\{N\}, for the processXi=iX\_\{i\}=i, and forℋ=\{1\{i\}:i∈𝒳\}\\mathcal\{H\}=\\\{\\mathbbold\{1\}\_\{\\\{i\\\}\}:i\\in\\mathcal\{X\}\\\}\(singletons\), the all\-0sequence is inR\(ℋ,𝕏\)\\text\{R\}\(\\mathcal\{H\},\\mathbb\{X\}\)though the all\-0function is not inℋ\\mathcal\{H\}\.

Then we define the universal consistency under𝕏\\mathbb\{X\}andℋ\\mathcal\{H\}in the realizable case\. An online learning rule is universally consistent under𝕏\\mathbb\{X\}andℋ\\mathcal\{H\}if its long\-run average loss approaches0almost surely for all realizable label processes\. Formally,

###### Definition D\.3\.

An online learning rule is*universally consistent*under𝕏\\mathbb\{X\}andℋ\\mathcal\{H\}for the realizable case, if for*every*𝕐∈R\(ℋ,𝕏\)\\mathbb\{Y\}\\in\\text\{R\}\(\\mathcal\{H\},\\mathbb\{X\}\),lim supT→∞1T∑t=1T𝕀\[Yt≠h^t−1\(Xt\)\]=0\\limsup\_\{T\\rightarrow\\infty\}\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\mathbb\{I\}\\left\[Y\_\{t\}\\neq\\hat\{h\}\_\{t\-1\}\(X\_\{t\}\)\\right\]=0a\.s\.

We also define the universal consistency under𝕏\\mathbb\{X\}andℋ\\mathcal\{H\}for the agnostic case\. Here, we release the restrictions that𝕐∈R\(ℋ,𝕏\)\\mathbb\{Y\}\\in\\text\{R\}\(\\mathcal\{H\},\\mathbb\{X\}\); instead, the label process𝕐\\mathbb\{Y\}can be set in any possible way, even dependent on the history of the algorithm’s predictions\. Thus, the average loss may be linear and inappropriate for defining consistency\. Therefore, we compare the performance of our algorithm with the performance of the best possible𝕐∗∈R\(ℋ,𝕏\)\\mathbb\{Y\}^\{\*\}\\in\\text\{R\}\(\\mathcal\{H\},\\mathbb\{X\}\), which is usually referred to as*regret*\. We say an online algorithm is universally consistent under𝕏\\mathbb\{X\}andℋ\\mathcal\{H\}for the agnostic case if its long\-run average regret is low for every label process\. Formally,

###### Definition D\.4\.

An online learning rule is*universally consistent*under𝕏\\mathbb\{X\}andℋ\\mathcal\{H\}for the agnostic case, if for*every*𝕐∗∈R\(ℋ,𝕏\)\\mathbb\{Y\}^\{\*\}\\in\\text\{R\}\(\\mathcal\{H\},\\mathbb\{X\}\)and for*every*𝕐\\mathbb\{Y\},lim supn→∞1n∑t=1n\(𝕀\[Yt≠h^t−1\(Xt\)\]−𝕀\[Yt≠Yt∗\]\)≤0\\limsup\_\{n\\rightarrow\\infty\}\\frac\{1\}\{n\}\\sum\_\{t=1\}^\{n\}\\left\(\\mathbb\{I\}\\left\[Y\_\{t\}\\neq\\hat\{h\}\_\{t\-1\}\(X\_\{t\}\)\\right\]\-\\mathbb\{I\}\\left\[Y\_\{t\}\\neq Y^\{\*\}\_\{t\}\\right\]\\right\)\\leq 0a\.s\.

Then we provide the following definition to describe our main theorem in this section\.

###### Definition D\.5\.

We say a process𝕏\\mathbb\{X\}admits*universal online learning*if there exists an online learning rule that is universally consistent under𝕏\\mathbb\{X\}andℋ\\mathcal\{H\}\.

Then comes the main theorem of this section\.

###### Theorem D\.6\.

*All*processes admit universal online learning, if and only ifℋ\\mathcal\{H\}has no infinite indifferent LCLL tree\.

The necessity of this theorem is shown as Lemma[4\.1](https://arxiv.org/html/2605.30479#S4.Thmtheorem1)\. Then we only need to prove the sufficiency\. The proof of the sufficiency is very similar to the proof of Lemma[4\.2](https://arxiv.org/html/2605.30479#S4.Thmtheorem2)\. The algorithm also contains two stages: first, we update the LCLL game, and after a finite number of updates, we reach a condition where with probability11, the length of the longest subsequences that are shattered by the concept classℋ′\\mathcal\{H\}^\{\\prime\}is at mostdd\. Second, we design a learning algorithm that can learn the concept classℋ′\\mathcal\{H\}^\{\\prime\}with at mosto\(T\)o\(T\)number of mistakes\. Algorithm[5](https://arxiv.org/html/2605.30479#alg5)is our algorithm\. The only change from Algorithm[1](https://arxiv.org/html/2605.30479#alg1)is the prediction rule\. We modify the prediction rule such that it suits the stochastic process setting\.

Algorithm 5Learning algorithm from winning strategyk=1k=1,

U=\{\}U=\\\{\\\},

t′←0t^\{\\prime\}\\leftarrow 0,

t0←0t\_\{0\}\\leftarrow 0\.

for

t=1,2,3,…t=1,2,3,\\dotsdo

if

∃t′≤t1<t2<⋯<tk≤t\\exists t^\{\\prime\}\\leq t\_\{1\}<t\_\{2\}<\\cdots<t\_\{k\}\\leq tand

Ck=\{\(ytk−1k−1\+1,∅,…,ytk1,∅u1,ytk1\+1,u1,…,ytk2,u1u2…,ytkk,u<kuk\):u∈\{0,1\}k\}C\_\{k\}=\\\{\(y\_\{t\_\{k\-1\}^\{k\-1\}\+1,\\varnothing\},\\ldots,y\_\{t\_\{k\}^\{1\},\\varnothing\}^\{u\_\{1\}\},y\_\{t\_\{k\}^\{1\}\+1,u\_\{1\}\},\\ldots,y\_\{t\_\{k\}^\{2\},u\_\{1\}\}^\{u\_\{2\}\}\\ldots,y\_\{t\_\{k\}^\{k\},u\_\{<k\}\}^\{u\_\{k\}\}\):u\\in\\\{0,1\\\}^\{k\}\\\}such that

gU\(\(t1,…,tk\),Ck\)=\(Yt′,…,Yt\)g\_\{U\}\(\(t\_\{1\},\\dots,t\_\{k\}\),C\_\{k\}\)=\(Y\_\{t^\{\\prime\}\},\\dots,Y\_\{t\}\)then

Advance the game:

U←U∪\{\(\(t1,…,tk,Ck\),gU\(\(t1,…,tk\),Ck\)\)\}U\\leftarrow U\\cup\\\{\(\(t\_\{1\},\\dots,t\_\{k\},C\_\{k\}\),g\_\{U\}\(\(t\_\{1\},\\dots,t\_\{k\}\),C\_\{k\}\)\)\\\}\.

k←k\+1k\\leftarrow k\+1\.

L←∅L\\leftarrow\\emptyset\.

m←1m\\leftarrow 1\.

t′←tt^\{\\prime\}\\leftarrow t\.

endif

Predict

Yt^=arg⁡miny⁡Pr⁡\[w\(ℋL∪\(Xt,y\)gU,X\[t,t\(m\)\+t′\]\)≤12w\(ℋLgU,X\[t,t\(m\)\+t′\]\)\|X≤t\]\\hat\{Y\_\{t\}\}=\\arg\\min\_\{y\}\\Pr\\left\[\\left\.w\(\\mathcal\{H\}^\{g\_\{U\}\}\_\{L\\cup\{\(X\_\{t\},y\)\}\},X\_\{\[t,t\(m\)\+t^\{\\prime\}\]\}\)\\leq\\frac\{1\}\{2\}w\(\\mathcal\{H\}^\{g\_\{U\}\}\_\{L\},X\_\{\[t,t\(m\)\+t^\{\\prime\}\]\}\)\\right\|X\_\{\\leq t\}\\right\]
if

Yt≠Y^tY\_\{t\}\\neq\\hat\{Y\}\_\{t\}then

L←L∪\{\(Xt,Yt\)\}L\\leftarrow L\\cup\\\{\(X\_\{t\},Y\_\{t\}\)\\\}\.

endif

if

t≥m\(m\+1\)2\+t′t\\geq\\frac\{m\(m\+1\)\}\{2\}\+t^\{\\prime\}\.then

m←m\+1m\\leftarrow m\+1\.

endif

endfor

As the first stage of the algorithm is exactly the same, Lemma[4\.5](https://arxiv.org/html/2605.30479#S4.Thmtheorem5)still holds\. We only need to show that we can learn the partial concept witho\(T\)o\(T\)number of mistakes, which is Lemma[D\.7](https://arxiv.org/html/2605.30479#A4.Thmtheorem7)\.

Algorithm 6Subroutine for learning under the condition that the length of the longest subsequence shattered byℋ\\mathcal\{H\}is at mostdd\.L←∅L\\leftarrow\\emptyset\.

m←1m\\leftarrow 1\.

for

t=1,2,3,…t=1,2,3,\\dotsdo

Predict

Yt^=arg⁡miny⁡Pr⁡\[w\(ℋL∪\(Xt,y\),X\[t,t\(m\)\]\)≤12w\(ℋL,X\[t,t\(m\)\]\)\|X≤t\]\\hat\{Y\_\{t\}\}=\\arg\\min\_\{y\}\\Pr\\left\[\\left\.w\(\\mathcal\{H\}\_\{L\\cup\{\(X\_\{t\},y\)\}\},X\_\{\[t,t\(m\)\]\}\)\\leq\\frac\{1\}\{2\}w\(\\mathcal\{H\}\_\{L\},X\_\{\[t,t\(m\)\]\}\)\\right\|X\_\{\\leq t\}\\right\]
if

Yt≠Y^tY\_\{t\}\\neq\\hat\{Y\}\_\{t\}then

L←L∪\{\(Xt,Yt\)\}L\\leftarrow L\\cup\\\{\(X\_\{t\},Y\_\{t\}\)\\\}\.

endif

if

t≥m\(m\+1\)2t\\geq\\frac\{m\(m\+1\)\}\{2\}\.then

m←m\+1m\\leftarrow m\+1\.

endif

endfor

###### Lemma D\.7\.

For any process𝕏\\mathbb\{X\}, if the length of the longest subsequence shattered byℋ′\\mathcal\{H\}^\{\\prime\}is at mostddwith probability11, Algorithm[6](https://arxiv.org/html/2605.30479#alg6)only makeso\(T\)o\(T\)mistakes almost surely whenT→∞T\\rightarrow\\infty\.

To prove the property of the algorithm, we first need a lemma about the prediction rule\.

###### Lemma D\.8\.

Ifw\(ℋ′,X≤T\)=\|\{S:S⊆\{Xi\}i≤Tsuch thatSshattered byℋ′\}\|w\(\\mathcal\{H\}^\{\\prime\},X\_\{\\leq T\}\)=\|\\\{S:S\\subseteq\\\{X\_\{i\}\\\}\_\{i\\leq T\}\\text\{ such that \}S\\text\{ shattered by \}\\mathcal\{H\}^\{\\prime\}\\\}\|, for every sequence𝕏\\mathbb\{X\}, and every partial conceptℋ\\mathcal\{H\}, we have at most oneyysuch that

Pr⁡\[w\(ℋL∪\(Xt,y\),X\[t,t\(m\)\]\)≤12w\(ℋL,X\[t,t\(m\)\]\)\|X≤t\]<12\.\\Pr\\left\[\\left\.w\(\\mathcal\{H\}\_\{L\\cup\{\(X\_\{t\},y\)\}\},X\_\{\[t,t\(m\)\]\}\)\\leq\\frac\{1\}\{2\}w\(\\mathcal\{H\}\_\{L\},X\_\{\[t,t\(m\)\]\}\)\\right\|X\_\{\\leq t\}\\right\]<\\frac\{1\}\{2\}\.

###### Proof\.

We can prove this lemma by contradiction\. First, we have the following fact: For every pair\(y,y∗\)\(y,y^\{\*\}\)such thaty≠y∗y\\neq y^\{\*\}, and every sequence𝕏\\mathbb\{X\}, we have:

w\(ℋL∪\(Xt,y∗\),X\[t,t\(m\)\]\)\+w\(ℋL∪\(Xt,y\),X\[t,t\(m\)\]\)≤w\(ℋL,X\[t,t\(m\)\]\)\.w\(\\mathcal\{H\}\_\{L\\cup\{\(X\_\{t\},y^\{\*\}\)\}\},X\_\{\[t,t\(m\)\]\}\)\+w\(\\mathcal\{H\}\_\{L\\cup\{\(X\_\{t\},y\)\}\},X\_\{\[t,t\(m\)\]\}\)\\leq w\(\\mathcal\{H\}\_\{L\},X\_\{\[t,t\(m\)\]\}\)\.This is because if ak′k^\{\\prime\}tuple\(Xt1,…,Xtk′\)\(X\_\{t\_\{1\}\},\\ldots,X\_\{t\_\{k^\{\\prime\}\}\}\)is shattered byℋL∪\(Xt,y∗\)\\mathcal\{H\}\_\{L\\cup\{\(X\_\{t\},y^\{\*\}\)\}\}andℋL∪\(Xt,y\)\\mathcal\{H\}\_\{L\\cup\{\(X\_\{t\},y\)\}\}, thek′\+1k^\{\\prime\}\+1tuple\(Xt,Xt1,…,Xtk′\)\(X\_\{t\},X\_\{t\_\{1\}\},\\ldots,X\_\{t\_\{k^\{\\prime\}\}\}\)will be shattered byℋL\\mathcal\{H\}\_\{L\}\. Thus, every tuple that contributes twice in the left part will also contribute twice in the right part\. Therefore, the inequality holds\.

Then, suppose there arey′,y′′y^\{\\prime\},y^\{\\prime\\prime\},y′≠y′′y^\{\\prime\}\\neq y^\{\\prime\\prime\}, such that,

Pr⁡\[w\(ℋL∪\(Xt,y′\),X\[t,t\(m\)\]\)≤12w\(ℋL,X\[t,t\(m\)\]\)\|X≤t\]<12\.\\Pr\\left\[\\left\.w\(\\mathcal\{H\}\_\{L\\cup\{\(X\_\{t\},y^\{\\prime\}\)\}\},X\_\{\[t,t\(m\)\]\}\)\\leq\\frac\{1\}\{2\}w\(\\mathcal\{H\}\_\{L\},X\_\{\[t,t\(m\)\]\}\)\\right\|X\_\{\\leq t\}\\right\]<\\frac\{1\}\{2\}\.and

Pr⁡\[w\(ℋL∪\(Xt,y′′\),X\[t,t\(m\)\]\)≤12w\(ℋL,X\[t,t\(m\)\]\)\|X≤t\]<12\.\\Pr\\left\[\\left\.w\(\\mathcal\{H\}\_\{L\\cup\{\(X\_\{t\},y^\{\\prime\\prime\}\)\}\},X\_\{\[t,t\(m\)\]\}\)\\leq\\frac\{1\}\{2\}w\(\\mathcal\{H\}\_\{L\},X\_\{\[t,t\(m\)\]\}\)\\right\|X\_\{\\leq t\}\\right\]<\\frac\{1\}\{2\}\.Thus, with probability greater than0, we have

w\(ℋL∪\(Xt,y′\),X\[t,t\(m\)\]\)\>12w\(ℋL,X\[t,t\(m\)\]\),w\(\\mathcal\{H\}\_\{L\\cup\{\(X\_\{t\},y^\{\\prime\}\)\}\},X\_\{\[t,t\(m\)\]\}\)\>\\frac\{1\}\{2\}w\(\\mathcal\{H\}\_\{L\},X\_\{\[t,t\(m\)\]\}\),and

w\(ℋL∪\(Xt,y′′\),X\[t,t\(m\)\]\)\>12w\(ℋL,X\[t,t\(m\)\]\)\.w\(\\mathcal\{H\}\_\{L\\cup\{\(X\_\{t\},y^\{\\prime\\prime\}\)\}\},X\_\{\[t,t\(m\)\]\}\)\>\\frac\{1\}\{2\}w\(\\mathcal\{H\}\_\{L\},X\_\{\[t,t\(m\)\]\}\)\.Thus,

w\(ℋL∪\(Xt,y′\),X\[t,t\(m\)\]\)\+w\(ℋL∪\(Xt,y′′\),X\[t,t\(m\)\]\)\>w\(ℋL,X\[t,t\(m\)\]\)\.w\(\\mathcal\{H\}\_\{L\\cup\{\(X\_\{t\},y^\{\\prime\}\)\}\},X\_\{\[t,t\(m\)\]\}\)\+w\(\\mathcal\{H\}\_\{L\\cup\{\(X\_\{t\},y^\{\\prime\\prime\}\)\}\},X\_\{\[t,t\(m\)\]\}\)\>w\(\\mathcal\{H\}\_\{L\},X\_\{\[t,t\(m\)\]\}\)\.That is a contradiction, and we finish the proof\. ∎

Then we can start to prove Lemma[D\.7](https://arxiv.org/html/2605.30479#A4.Thmtheorem7), though the proof is very similar to the proof in the work ofHanneke and Wang \([2024](https://arxiv.org/html/2605.30479#bib.bib17)\), we provide it for completeness\.

###### Proof of Lemma[D\.7](https://arxiv.org/html/2605.30479#A4.Thmtheorem7)\.

As we defined, the weight function,w\(ℋ′,X≤T\)=\|\{S:S⊆\{Xi\}i≤Tsuch thatSshattered byℋ′\}\|w\(\\mathcal\{H\}^\{\\prime\},X\_\{\\leq T\}\)=\|\\\{S:S\\subseteq\\\{X\_\{i\}\\\}\_\{i\\leq T\}\\text\{ such that \}S\\text\{ shattered by \}\\mathcal\{H\}^\{\\prime\}\\\}\|, which is the number of the subsequences of the sequenceX≤TX\_\{\\leq T\}that can be shattered by the concept classℋ′\\mathcal\{H\}^\{\\prime\}\.

Consider thekk\-th batch, consisting ofWk=\{Xk\(k−1\)2\+1,⋯,Xk\(k\+1\)2\}W\_\{k\}=\\\{X\_\{\\frac\{k\(k\-1\)\}\{2\}\+1\},\\cdots,X\_\{\\frac\{k\(k\+1\)\}\{2\}\}\\\}\. Let

Zk=∑t=k\(k−1\)2\+1k\(k\+1\)2𝕀\[Y^t≠Yt\],Z\_\{k\}=\\sum\_\{t=\\frac\{k\(k\-1\)\}\{2\}\+1\}^\{\\frac\{k\(k\+1\)\}\{2\}\}\\mathbb\{I\}\\left\[\\hat\{Y\}\_\{t\}\\neq Y\_\{t\}\\right\],and

Vk=Zk−𝔼\[Zk\|X≤k\(k−1\)2\]\.V\_\{k\}=Z\_\{k\}\-\\mathbb\{E\}\\left\[Z\_\{k\}\\left\|X\_\{\\leq\\frac\{k\(k\-1\)\}\{2\}\}\\right\.\\right\]\.Notice that

𝔼\[Vk\|X≤k\(k−1\)2\]\\displaystyle\\mathbb\{E\}\\left\[V\_\{k\}\\left\|X\_\{\\leq\\frac\{k\(k\-1\)\}\{2\}\}\\right\.\\right\]=𝔼\[Zk−𝔼\[Zk\|X≤k\(k−1\)2\]\|X≤k\(k−1\)2\]\\displaystyle=\\mathbb\{E\}\\left\[Z\_\{k\}\-\\mathbb\{E\}\\left\[Z\_\{k\}\\left\|X\_\{\\leq\\frac\{k\(k\-1\)\}\{2\}\}\\right\.\\right\]\\left\|X\_\{\\leq\\frac\{k\(k\-1\)\}\{2\}\}\\right\.\\right\]=𝔼\[Zk\|X≤k\(k−1\)2\]−𝔼\[Zk\|X≤k\(k−1\)2\]=0\.\(a\.s\.\)\\displaystyle=\\mathbb\{E\}\\left\[Z\_\{k\}\\left\|X\_\{\\leq\\frac\{k\(k\-1\)\}\{2\}\}\\right\.\\right\]\-\\mathbb\{E\}\\left\[Z\_\{k\}\\left\|X\_\{\\leq\\frac\{k\(k\-1\)\}\{2\}\}\\right\.\\right\]=0\.\(a\.s\.\)
Thus, the sequenceVkV\_\{k\}is a martingale difference sequence with respect to the block sequence,W1,W2,⋯W\_\{1\},W\_\{2\},\\cdots\. By the definition ofVkV\_\{k\}, we also have−k≤Vk≤k\-k\\leq V\_\{k\}\\leq k\. Then by Azuma’s Inequality, with probability1−1K21\-\\frac\{1\}\{K^\{2\}\}, we have

∑k=1KZk\\displaystyle\\sum\_\{k=1\}^\{K\}Z\_\{k\}≤∑k=1K𝔼\[Zk\|X≤k\(k−1\)2\]\+−log⁡\(1K2\)⋅2⋅\(∑k=1Kk2\)\\displaystyle\\leq\\sum\_\{k=1\}^\{K\}\\mathbb\{E\}\\left\[Z\_\{k\}\\left\|X\_\{\\leq\\frac\{k\(k\-1\)\}\{2\}\}\\right\.\\right\]\+\\sqrt\{\-\\log\\left\(\\frac\{1\}\{K^\{2\}\}\\right\)\\cdot 2\\cdot\\left\(\\sum\_\{k=1\}^\{K\}k^\{2\}\\right\)\}≤∑k=1K𝔼\[Zk\|X≤k\(k−1\)2\]\+4K3log⁡K\.\\displaystyle\\leq\\sum\_\{k=1\}^\{K\}\\mathbb\{E\}\\left\[Z\_\{k\}\\left\|X\_\{\\leq\\frac\{k\(k\-1\)\}\{2\}\}\\right\.\\right\]\+\\sqrt\{4K^\{3\}\\log K\}\.
Then we need to get an upper bound for𝔼\[Zk\|X≤k\(k−1\)2\]\\mathbb\{E\}\\left\[Z\_\{k\}\\left\|X\_\{\\leq\\frac\{k\(k\-1\)\}\{2\}\}\\right\.\\right\]\. According to the prediction rule and Lemma[D\.8](https://arxiv.org/html/2605.30479#A4.Thmtheorem8), every time we make a mistake, we have

Pr⁡\[w\(ℋL∪\(Xt,Yt\),X\[t,k\(k\+1\)2\]\)≤12w\(ℋL,X\[t,k\(k\+1\)2\]\)\|X≤t\]≥12\.\\Pr\\left\[\\left\.w\(\\mathcal\{H\}\_\{L\\cup\{\(X\_\{t\},Y\_\{t\}\)\}\},X\_\{\[t,\\frac\{k\(k\+1\)\}\{2\}\]\}\)\\leq\\frac\{1\}\{2\}w\(\\mathcal\{H\}\_\{L\},X\_\{\[t,\\frac\{k\(k\+1\)\}\{2\}\]\}\)\\right\|X\_\{\\leq t\}\\right\]\\geq\\frac\{1\}\{2\}\.\(4\)
Due to the linearity of expectation, for everykk,

𝔼\[∑t=k\(k−1\)2k\(k\+1\)2𝕀\[Y^t≠Yt\]\|X≤k\(k−1\)2\]\\displaystyle\\mathbb\{E\}\\left\[\\left\.\\sum\_\{t=\\frac\{k\(k\-1\)\}\{2\}\}^\{\\frac\{k\(k\+1\)\}\{2\}\}\\mathbb\{I\}\\left\[\\hat\{Y\}\_\{t\}\\neq Y\_\{t\}\\right\]\\right\|X\_\{\\leq\\frac\{k\(k\-1\)\}\{2\}\}\\right\]=𝔼\[∑t=k\(k−1\)2k\(k\+1\)2𝕀\[Y^t≠Yt\]𝕀\[w\(ℋLt,X\[t\+1,k\(k\+1\)2\]\)≤12w\(ℋLt−1,X\[t,k\(k\+1\)2\]\)\]\|X≤k\(k−1\)2\]\\displaystyle=\\mathbb\{E\}\\left\[\\left\.\\sum\_\{t=\\frac\{k\(k\-1\)\}\{2\}\}^\{\\frac\{k\(k\+1\)\}\{2\}\}\\mathbb\{I\}\\left\[\\hat\{Y\}\_\{t\}\\neq Y\_\{t\}\\right\]\\mathbb\{I\}\\left\[w\(\\mathcal\{H\}\_\{L\_\{t\}\},X\_\{\[t\+1,\\frac\{k\(k\+1\)\}\{2\}\]\}\)\\leq\\frac\{1\}\{2\}w\(\\mathcal\{H\}\_\{L\_\{t\-1\}\},X\_\{\[t,\\frac\{k\(k\+1\)\}\{2\}\]\}\)\\right\]\\right\|X\_\{\\leq\\frac\{k\(k\-1\)\}\{2\}\}\\right\]\+𝔼\[∑t=k\(k−1\)2k\(k\+1\)2𝕀\[Y^t≠Yt\]𝕀\[w\(ℋLt,X\[t\+1,k\(k\+1\)2\]\)\>12w\(ℋLt−1,X\[t,k\(k\+1\)2\]\)\]\|X≤k\(k−1\)2\]\.\\displaystyle\+\\mathbb\{E\}\\left\[\\left\.\\sum\_\{t=\\frac\{k\(k\-1\)\}\{2\}\}^\{\\frac\{k\(k\+1\)\}\{2\}\}\\mathbb\{I\}\\left\[\\hat\{Y\}\_\{t\}\\neq Y\_\{t\}\\right\]\\mathbb\{I\}\\left\[w\(\\mathcal\{H\}\_\{L\_\{t\}\},X\_\{\[t\+1,\\frac\{k\(k\+1\)\}\{2\}\]\}\)\>\\frac\{1\}\{2\}w\(\\mathcal\{H\}\_\{L\_\{t\-1\}\},X\_\{\[t,\\frac\{k\(k\+1\)\}\{2\}\]\}\)\\right\]\\right\|X\_\{\\leq\\frac\{k\(k\-1\)\}\{2\}\}\\right\]\.HereLt=\{\(Xi,Yi\)\}L\_\{t\}=\\\{\(X\_\{i\},Y\_\{i\}\)\\\}, wherei≤ti\\leq tand the algorithm makes a mistake at roundii\.

Notice the first part is the expected number of mistakes, each of which decreases the weight by half\. For every realization ofX\[k\(k−1\)2,k\(k\+1\)2\]X\_\{\[\\frac\{k\(k\-1\)\}\{2\},\\frac\{k\(k\+1\)\}\{2\}\]\},x\[k\(k−1\)2,k\(k\+1\)2\]x\_\{\[\\frac\{k\(k\-1\)\}\{2\},\\frac\{k\(k\+1\)\}\{2\}\]\}, let

u\(k\)=∑i=k\(k−1\)2k\(k\+1\)2𝕀\[Y^t≠Yt\]𝕀\[w\(ℋLt,x\[t\+1,k\(k\+1\)2\]\)≤12w\(ℋLt−1,x\[t,k\(k\+1\)2\]\)\]\.u\(k\)=\\sum\_\{i=\\frac\{k\(k\-1\)\}\{2\}\}^\{\\frac\{k\(k\+1\)\}\{2\}\}\\mathbb\{I\}\\left\[\\hat\{Y\}\_\{t\}\\neq Y\_\{t\}\\right\]\\mathbb\{I\}\\left\[w\(\\mathcal\{H\}\_\{L\_\{t\}\},x\_\{\[t\+1,\\frac\{k\(k\+1\)\}\{2\}\]\}\)\\leq\\frac\{1\}\{2\}w\(\\mathcal\{H\}\_\{L\_\{t\-1\}\},x\_\{\[t,\\frac\{k\(k\+1\)\}\{2\}\]\}\)\\right\]\.By the definition of the weight function and the fact that the length of the longest subsequence shattered byℋ′\\mathcal\{H\}^\{\\prime\}is at mostdd,w\(ℋLk\(k−1\)2,x\[k\(k−1\)2,k\(k\+1\)2\]\)≤kdw\(\\mathcal\{H\}\_\{L\_\{\\frac\{k\(k\-1\)\}\{2\}\}\},x\_\{\[\\frac\{k\(k\-1\)\}\{2\},\\frac\{k\(k\+1\)\}\{2\}\]\}\)\\leq k^\{d\}\. Consider the last roundt≤k\(k\+1\)2t\\leq\\frac\{k\(k\+1\)\}\{2\}thatY^t≠Yt\\hat\{Y\}\_\{t\}\\neq Y\_\{t\}, we havew\(ℋLt−1,x\[t,k\(k\+1\)2\]\)≥1w\(\\mathcal\{H\}\_\{L\_\{t\-1\},x\_\{\[t,\\frac\{k\(k\+1\)\}\{2\}\]\}\}\)\\geq 1, as\{xt\}\\\{x\_\{t\}\\\}must be shattered\. Thus, we have2u\(k\)−1w\(ℋLt−1,x\[t,k\(k\+1\)2\]\)≤w\(ℋ,x\[k\(k−1\)2,k\(k\+1\)2\]\)2^\{u\(k\)\-1\}w\(\\mathcal\{H\}\_\{L\_\{t\-1\}\},x\_\{\[t,\\frac\{k\(k\+1\)\}\{2\}\]\}\)\\leq w\(\\mathcal\{H\},x\_\{\[\\frac\{k\(k\-1\)\}\{2\},\\frac\{k\(k\+1\)\}\{2\}\]\}\)\. Therefore,u\(k\)≤dlog⁡k\+1u\(k\)\\leq d\\log k\+1, for every realization,x\[k\(k−1\)2,k\(k\+1\)2\]x\_\{\[\\frac\{k\(k\-1\)\}\{2\},\\frac\{k\(k\+1\)\}\{2\}\]\}\. Thus,

𝔼\[∑t=k\(k−1\)2k\(k\+1\)2𝕀\[Y^t≠Yt\]𝕀\[w\(ℋLt,X\[t\+1,k\(k\+1\)2\]\)≤12w\(ℋLt−1,X\[t,k\(k\+1\)2\]\)\]\|X≤k\(k−1\)2\]≤2dlog⁡k\.\\mathbb\{E\}\\left\[\\left\.\\sum\_\{t=\\frac\{k\(k\-1\)\}\{2\}\}^\{\\frac\{k\(k\+1\)\}\{2\}\}\\mathbb\{I\}\\left\[\\hat\{Y\}\_\{t\}\\neq Y\_\{t\}\\right\]\\mathbb\{I\}\\left\[w\(\\mathcal\{H\}\_\{L\_\{t\}\},X\_\{\[t\+1,\\frac\{k\(k\+1\)\}\{2\}\]\}\)\\leq\\frac\{1\}\{2\}w\(\\mathcal\{H\}\_\{L\_\{t\-1\}\},X\_\{\[t,\\frac\{k\(k\+1\)\}\{2\}\]\}\)\\right\]\\right\|X\_\{\\leq\\frac\{k\(k\-1\)\}\{2\}\}\\right\]\\leq 2d\\log k\.\(5\)
Then consider the second part, we have

𝔼\[∑t=k\(k−1\)2k\(k\+1\)2𝕀\[Y^t≠Yt\]𝕀\[w\(ℋLt,X\[t\+1,k\(k\+1\)2\]\)\>12w\(ℋLt−1,X\[t,k\(k\+1\)2\]\)\]\|X≤k\(k−1\)2\]\\displaystyle\\mathbb\{E\}\\left\[\\left\.\\sum\_\{t=\\frac\{k\(k\-1\)\}\{2\}\}^\{\\frac\{k\(k\+1\)\}\{2\}\}\\mathbb\{I\}\\left\[\\hat\{Y\}\_\{t\}\\neq Y\_\{t\}\\right\]\\mathbb\{I\}\\left\[w\(\\mathcal\{H\}\_\{L\_\{t\}\},X\_\{\[t\+1,\\frac\{k\(k\+1\)\}\{2\}\]\}\)\>\\frac\{1\}\{2\}w\(\\mathcal\{H\}\_\{L\_\{t\-1\}\},X\_\{\[t,\\frac\{k\(k\+1\)\}\{2\}\]\}\)\\right\]\\right\|X\_\{\\leq\\frac\{k\(k\-1\)\}\{2\}\}\\right\]=𝔼\[𝔼\[∑t=k\(k−1\)2k\(k\+1\)2𝕀\[Y^t≠Yt\]𝕀\[w\(ℋLt,X\[t\+1,k\(k\+1\)2\]\)\>12w\(ℋLt−1,X\[t,k\(k\+1\)2\]\)\]\|X≤t\]\|X≤k\(k−1\)2\]\\displaystyle=\\mathbb\{E\}\\left\[\\mathbb\{E\}\\left\[\\left\.\\left\.\\sum\_\{t=\\frac\{k\(k\-1\)\}\{2\}\}^\{\\frac\{k\(k\+1\)\}\{2\}\}\\mathbb\{I\}\\left\[\\hat\{Y\}\_\{t\}\\neq Y\_\{t\}\\right\]\\mathbb\{I\}\\left\[w\(\\mathcal\{H\}\_\{L\_\{t\}\},X\_\{\[t\+1,\\frac\{k\(k\+1\)\}\{2\}\]\}\)\>\\frac\{1\}\{2\}w\(\\mathcal\{H\}\_\{L\_\{t\-1\}\},X\_\{\[t,\\frac\{k\(k\+1\)\}\{2\}\]\}\)\\right\]\\right\|X\_\{\\leq t\}\\right\]\\right\|X\_\{\\leq\\frac\{k\(k\-1\)\}\{2\}\}\\right\]=𝔼\[∑t=k\(k−1\)2k\(k\+1\)2𝕀\[Y^t≠Yt\]𝔼\[𝕀\[w\(ℋLt,X\[t\+1,k\(k\+1\)2\]\)\>12w\(ℋLt−1,X\[t,k\(k\+1\)2\]\)\]\|X≤t\]\|X≤k\(k−1\)2\]\\displaystyle=\\mathbb\{E\}\\left\[\\left\.\\sum\_\{t=\\frac\{k\(k\-1\)\}\{2\}\}^\{\\frac\{k\(k\+1\)\}\{2\}\}\\mathbb\{I\}\\left\[\\hat\{Y\}\_\{t\}\\neq Y\_\{t\}\\right\]\\mathbb\{E\}\\left\[\\left\.\\mathbb\{I\}\\left\[w\(\\mathcal\{H\}\_\{L\_\{t\}\},X\_\{\[t\+1,\\frac\{k\(k\+1\)\}\{2\}\]\}\)\>\\frac\{1\}\{2\}w\(\\mathcal\{H\}\_\{L\_\{t\-1\}\},X\_\{\[t,\\frac\{k\(k\+1\)\}\{2\}\]\}\)\\right\]\\right\|X\_\{\\leq t\}\\right\]\\right\|X\_\{\\leq\\frac\{k\(k\-1\)\}\{2\}\}\\right\]This is becauseY^t\\hat\{Y\}\_\{t\}andYtY\_\{t\}only depend onX≤tX\_\{\\leq t\}\. Due to the equation[4](https://arxiv.org/html/2605.30479#A4.E4), we have

𝕀\[Y^t≠Yt\]𝔼\[𝕀\[w\(ℋLt,X\[t\+1,k\(k\+1\)2\]\)\>12w\(ℋLt−1,X\[t,k\(k\+1\)2\]\)\]\|X≤t\]\\displaystyle\\mathbb\{I\}\\left\[\\hat\{Y\}\_\{t\}\\neq Y\_\{t\}\\right\]\\mathbb\{E\}\\left\[\\left\.\\mathbb\{I\}\\left\[w\(\\mathcal\{H\}\_\{L\_\{t\}\},X\_\{\[t\+1,\\frac\{k\(k\+1\)\}\{2\}\]\}\)\>\\frac\{1\}\{2\}w\(\\mathcal\{H\}\_\{L\_\{t\-1\}\},X\_\{\[t,\\frac\{k\(k\+1\)\}\{2\}\]\}\)\\right\]\\right\|X\_\{\\leq t\}\\right\]=𝕀\[Y^t≠Yt\]Pr⁡\[w\(ℋLt,X\[t\+1,k\(k\+1\)2\]\)\>12w\(ℋLt−1,X\[t,k\(k\+1\)2\]\)\|X≤t\]\\displaystyle=\\mathbb\{I\}\\left\[\\hat\{Y\}\_\{t\}\\neq Y\_\{t\}\\right\]\\Pr\\left\[\\left\.w\(\\mathcal\{H\}\_\{L\_\{t\}\},X\_\{\[t\+1,\\frac\{k\(k\+1\)\}\{2\}\]\}\)\>\\frac\{1\}\{2\}w\(\\mathcal\{H\}\_\{L\_\{t\-1\}\},X\_\{\[t,\\frac\{k\(k\+1\)\}\{2\}\]\}\)\\right\|X\_\{\\leq t\}\\right\]≤12𝕀\[Y^t≠Yt\]\.\\displaystyle\\leq\\frac\{1\}\{2\}\\mathbb\{I\}\\left\[\\hat\{Y\}\_\{t\}\\neq Y\_\{t\}\\right\]\.Thus,

𝔼\[∑t=k\(k−1\)2k\(k\+1\)2𝕀\[Y^t≠Yt\]𝕀\[w\(ℋLt,X\[t\+1,k\(k\+1\)2\]\)\>12w\(ℋLt−1,X\[t,k\(k\+1\)2\]\)\]\|X≤k\(k−1\)2\]\\displaystyle\\mathbb\{E\}\\left\[\\left\.\\sum\_\{t=\\frac\{k\(k\-1\)\}\{2\}\}^\{\\frac\{k\(k\+1\)\}\{2\}\}\\mathbb\{I\}\\left\[\\hat\{Y\}\_\{t\}\\neq Y\_\{t\}\\right\]\\mathbb\{I\}\\left\[w\(\\mathcal\{H\}\_\{L\_\{t\}\},X\_\{\[t\+1,\\frac\{k\(k\+1\)\}\{2\}\]\}\)\>\\frac\{1\}\{2\}w\(\\mathcal\{H\}\_\{L\_\{t\-1\}\},X\_\{\[t,\\frac\{k\(k\+1\)\}\{2\}\]\}\)\\right\]\\right\|X\_\{\\leq\\frac\{k\(k\-1\)\}\{2\}\}\\right\]\(6\)≤12𝔼\[∑t=k\(k−1\)2k\(k\+1\)2𝕀\[Y^t≠Yt\]\|X≤k\(k−1\)2\]\\displaystyle\\leq\\frac\{1\}\{2\}\\mathbb\{E\}\\left\[\\left\.\\sum\_\{t=\\frac\{k\(k\-1\)\}\{2\}\}^\{\\frac\{k\(k\+1\)\}\{2\}\}\\mathbb\{I\}\\left\[\\hat\{Y\}\_\{t\}\\neq Y\_\{t\}\\right\]\\right\|X\_\{\\leq\\frac\{k\(k\-1\)\}\{2\}\}\\right\]Combining these two inequalities \([5](https://arxiv.org/html/2605.30479#A4.E5)and[6](https://arxiv.org/html/2605.30479#A4.E6)\), we have

𝔼\[∑t=k\(k−1\)2k\(k\+1\)2𝕀\[Y^t≠Yt\]\|X≤k\(k−1\)2\]≤2dlog⁡k\+12𝔼\[∑t=k\(k−1\)2k\(k\+1\)2𝕀\[Y^t≠Yt\]\|X≤k\(k−1\)2\]\.\\mathbb\{E\}\\left\[\\left\.\\sum\_\{t=\\frac\{k\(k\-1\)\}\{2\}\}^\{\\frac\{k\(k\+1\)\}\{2\}\}\\mathbb\{I\}\\left\[\\hat\{Y\}\_\{t\}\\neq Y\_\{t\}\\right\]\\right\|X\_\{\\leq\\frac\{k\(k\-1\)\}\{2\}\}\\right\]\\leq 2d\\log k\+\\frac\{1\}\{2\}\\mathbb\{E\}\\left\[\\left\.\\sum\_\{t=\\frac\{k\(k\-1\)\}\{2\}\}^\{\\frac\{k\(k\+1\)\}\{2\}\}\\mathbb\{I\}\\left\[\\hat\{Y\}\_\{t\}\\neq Y\_\{t\}\\right\]\\right\|X\_\{\\leq\\frac\{k\(k\-1\)\}\{2\}\}\\right\]\.\(7\)Thus, for anykk, we have

𝔼\[∑t=k\(k−1\)2k\(k\+1\)2𝕀\[Y^t≠Yt\]\|X≤k\(k−1\)2\]≤4dlog⁡k\.\\mathbb\{E\}\\left\[\\left\.\\sum\_\{t=\\frac\{k\(k\-1\)\}\{2\}\}^\{\\frac\{k\(k\+1\)\}\{2\}\}\\mathbb\{I\}\\left\[\\hat\{Y\}\_\{t\}\\neq Y\_\{t\}\\right\]\\right\|X\_\{\\leq\\frac\{k\(k\-1\)\}\{2\}\}\\right\]\\leq 4d\\log k\.\(8\)According to the inequality[8](https://arxiv.org/html/2605.30479#A4.E8)for everykk,𝔼\[Zk\|X≤k\(k−1\)2\]≤4dlog⁡k\\mathbb\{E\}\\left\[Z\_\{k\}\\left\|X\_\{\\leq\\frac\{k\(k\-1\)\}\{2\}\}\\right\.\\right\]\\leq 4d\\log k\. Thus, with probability at least1−1K21\-\\frac\{1\}\{K^\{2\}\},

∑k=1KZk≤∑k=1K4dlog⁡k\+4K3log⁡K≤4dKlog⁡K\+4K3log⁡K\.\\sum\_\{k=1\}^\{K\}Z\_\{k\}\\leq\\sum\_\{k=1\}^\{K\}4d\\log k\+\\sqrt\{4K^\{3\}\\log K\}\\leq 4dK\\log K\+\\sqrt\{4K^\{3\}\\log K\}\.By the definition ofZkZ\_\{k\}, with probability at least1−1K21\-\\frac\{1\}\{K^\{2\}\},

∑t=1K\(K\+1\)2𝕀\[Y^t≠Yt\]≤4dKlog⁡K\+4K3log⁡K≤\(4d\+2\)K3log⁡K\.\\sum\_\{t=1\}^\{\\frac\{K\(K\+1\)\}\{2\}\}\\mathbb\{I\}\\left\[\\hat\{Y\}\_\{t\}\\neq Y\_\{t\}\\right\]\\leq 4dK\\log K\+\\sqrt\{4K^\{3\}\\log K\}\\leq\(4d\+2\)\\sqrt\{K^\{3\}\\log K\}\.\(9\)LetTK=K\(K\+1\)2T\_\{K\}=\\frac\{K\(K\+1\)\}\{2\}be the number of instances in the sequence, with probability at least1−1K21\-\\frac\{1\}\{K^\{2\}\}

∑t=1TK𝕀\[Y^t≠Yt\]≤\(4d\+2\)TK3412log⁡TK\.\\sum\_\{t=1\}^\{T\_\{K\}\}\\mathbb\{I\}\\left\[\\hat\{Y\}\_\{t\}\\neq Y\_\{t\}\\right\]\\leq\(4d\+2\)T\_\{K\}^\{\\frac\{3\}\{4\}\}\\sqrt\{\\frac\{1\}\{2\}\\log T\_\{K\}\}\.\(10\)Define the eventEKE\_\{K\}as the event that in the sequenceX≤TKX\_\{\\leq T\_\{K\}\},∑t=1TK𝕀\[Y^t≠Yt\]\>\(4d\+2\)TK3412log⁡TK\\sum\_\{t=1\}^\{T\_\{K\}\}\\mathbb\{I\}\\left\[\\hat\{Y\}\_\{t\}\\neq Y\_\{t\}\\right\]\>\(4d\+2\)T\_\{K\}^\{\\frac\{3\}\{4\}\}\\sqrt\{\\frac\{1\}\{2\}\\log T\_\{K\}\}\. Then we knowPr⁡\[EK\]≤1K2\\Pr\[E\_\{K\}\]\\leq\\frac\{1\}\{K^\{2\}\}\. Notice the fact that for anyK∈ℕK\\in\\mathbb\{N\},∑k=1K1k2≤π26\\sum\_\{k=1\}^\{K\}\\frac\{1\}\{k^\{2\}\}\\leq\\frac\{\\pi^\{2\}\}\{6\}\. By Borel\-Cantelli lemma, we know that for anyTK=K\(K\+1\)2T\_\{K\}=\\frac\{K\(K\+1\)\}\{2\}large enough,∑t=1TK𝕀\[Y^t≠Yt\]≤\(4d\+2\)TK3412log⁡TK\\sum\_\{t=1\}^\{T\_\{K\}\}\\mathbb\{I\}\\left\[\\hat\{Y\}\_\{t\}\\neq Y\_\{t\}\\right\]\\leq\(4d\+2\)T\_\{K\}^\{\\frac\{3\}\{4\}\}\\sqrt\{\\frac\{1\}\{2\}\\log T\_\{K\}\}happens with probability11\.

Then for any large enoughTT, we haveTK≤T≤TK\+1≤2TT\_\{K\}\\leq T\\leq T\_\{K\+1\}\\leq 2T\. Thus, with probability11,

∑t=1T𝕀\[Y^t≠Yt\]\\displaystyle\\sum\_\{t=1\}^\{T\}\\mathbb\{I\}\\left\[\\hat\{Y\}\_\{t\}\\neq Y\_\{t\}\\right\]≤\(4d\+2\)TK\+13412log⁡TK\+1\\displaystyle\\leq\(4d\+2\)T\_\{K\+1\}^\{\\frac\{3\}\{4\}\}\\sqrt\{\\frac\{1\}\{2\}\\log T\_\{K\+1\}\}\(11\)≤\(4d\+2\)\(2T\)3412log⁡2T\.\\displaystyle\\leq\(4d\+2\)\(2T\)^\{\\frac\{3\}\{4\}\}\\sqrt\{\\frac\{1\}\{2\}\\log 2T\}\.\(12\)Therefore, for any large enoughTTand a universal constantcc, with probability11,

∑t=1T𝕀\[Y^t≠Yt\]≤cT34log⁡T\.\\sum\_\{t=1\}^\{T\}\\mathbb\{I\}\\left\[\\hat\{Y\}\_\{t\}\\neq Y\_\{t\}\\right\]\\leq cT^\{\\frac\{3\}\{4\}\}\\sqrt\{\\log T\}\.\(13\)Notice thatcT34log⁡TcT^\{\\frac\{3\}\{4\}\}\\sqrt\{\\log T\}iso\(T\)o\(T\)\. That finishes the proof\. ∎

We can also extend the learnability results to the agnostic case\. Formally,

###### Theorem D\.9\.

Ifℋ\\mathcal\{H\}has no infinite indifferent LCLL tree, all processes admit universal online learning in the agnostic setting\.

The proof of this theorem is very similar to the proof in Section[C](https://arxiv.org/html/2605.30479#A3)\. The only difference is that the experts simulate the running of Algorithm[5](https://arxiv.org/html/2605.30479#alg5)instead of Algorithm[1](https://arxiv.org/html/2605.30479#alg1)\. For completeness, we provide the expert as Algorithm[7](https://arxiv.org/html/2605.30479#alg7)\.

Algorithm 7ExpertI,JI,J\(indexed byei,je\_\{i,j\}\)\.U←\{\}U\\leftarrow\\\{\\\}\.

t′←0t^\{\\prime\}\\leftarrow 0\.

k←1k\\leftarrow 1\.

L←\{\}L\\leftarrow\\\{\\\}\.

m←1m\\leftarrow 1
for

t=1,2,3,…t=1,2,3,\\ldotsdo

if

t≤kt\\leq kthen

if

∃t′≤t1<t2<⋯<tk≤t\\exists t^\{\\prime\}\\leq t\_\{1\}<t\_\{2\}<\\cdots<t\_\{k\}\\leq tand

Ck=\{\(ytk−1k−1\+1,∅,…,ytk1,∅u1,ytk1\+1,u1,…,ytk2,u1u2…,ytkk,u<kuk\):u∈\{0,1\}k\}C\_\{k\}=\\\{\(y\_\{t\_\{k\-1\}^\{k\-1\}\+1,\\varnothing\},\\ldots,y\_\{t\_\{k\}^\{1\},\\varnothing\}^\{u\_\{1\}\},y\_\{t\_\{k\}^\{1\}\+1,u\_\{1\}\},\\ldots,y\_\{t\_\{k\}^\{2\},u\_\{1\}\}^\{u\_\{2\}\}\\ldots,y\_\{t\_\{k\}^\{k\},u\_\{<k\}\}^\{u\_\{k\}\}\):u\\in\\\{0,1\\\}^\{k\}\\\}such that

gU\(\(t1,…,tk\),Ck\)=\(Yt′,…,Yt\)g\_\{U\}\(\(t\_\{1\},\\dots,t\_\{k\}\),C\_\{k\}\)=\(Y\_\{t^\{\\prime\}\},\\dots,Y\_\{t\}\)then

Advance the game:

U←U∪\{\(\(t1,…,tk,Ck\),gU\(\(t1,…,tk\),Ck\)\)\}U\\leftarrow U\\cup\\\{\(\(t\_\{1\},\\dots,t\_\{k\},C\_\{k\}\),g\_\{U\}\(\(t\_\{1\},\\dots,t\_\{k\}\),C\_\{k\}\)\)\\\}\.

k←k\+1k\\leftarrow k\+1\.

t′←tt^\{\\prime\}\\leftarrow t\.

endif

Predict

Y^t=Yt∗\\hat\{Y\}\_\{t\}=Y^\{\*\}\_\{t\}\.

else

Predict

Yt^=arg⁡miny⁡Pr⁡\[w\(ℋL∪\(Xt,y\)gU,X\[t,t\(m\)\+t′\]\)≤12w\(ℋLgU,X\[t,t\(m\)\+t′\]\)\|X≤t\]\\hat\{Y\_\{t\}\}=\\arg\\min\_\{y\}\\Pr\\left\[\\left\.w\(\\mathcal\{H\}^\{g\_\{U\}\}\_\{L\\cup\{\(X\_\{t\},y\)\}\},X\_\{\[t,t\(m\)\+t^\{\\prime\}\]\}\)\\leq\\frac\{1\}\{2\}w\(\\mathcal\{H\}^\{g\_\{U\}\}\_\{L\},X\_\{\[t,t\(m\)\+t^\{\\prime\}\]\}\)\\right\|X\_\{\\leq t\}\\right\]\.

if

t∈Jt\\in Jthen

L←L∪\{\(Xt,Yt\)\}L\\leftarrow L\\cup\\\{\(X\_\{t\},Y\_\{t\}\)\\\}\.

endif

if

t≥m\(m\+1\)2\+t′t\\geq\\frac\{m\(m\+1\)\}\{2\}\+t^\{\\prime\}\.then

m←m\+1m\\leftarrow m\+1\.

endif

endif

endfor

Thus, we need to show we can still build the sequencejT=o\(T\)j\_\{T\}=o\(T\)such that for large enoughTTthere existsj<jTj<j\_\{T\}, for everyt<Tt<T,ei,j\(Xt\)=Yt∗e\_\{i,j\}\(X\_\{t\}\)=Y^\{\*\}\_\{t\}for at mosto\(T\)o\(T\)times\. Formally,

###### Lemma D\.10\.

Ifℋ\\mathcal\{H\}has no infinite indifferent LCLL tree, for every realizable sequence\(𝕏,𝕐\)∈R\(ℋ\)\(\\mathbb\{X\},\\mathbb\{Y\}\)\\in\\text\{R\}\(\\mathcal\{H\}\), we have a sequence\{jT\}T∈ℕ\\\{j\_\{T\}\\\}\_\{T\\in\\mathbb\{N\}\}satisfieslog⁡jT=o\(T\)\\log j\_\{T\}=o\(T\), such that for every large enough timeTT, we have an expertei,je\_\{i,j\}withj≤jTj\\leq j\_\{T\}, such that for allt≤Tt\\leq T,Yt=ei,j\(Xt\)Y\_\{t\}=e\_\{i,j\}\(X\_\{t\}\)except for at mosto\(T\)o\(T\)times\.

The proof of this lemma is also similar to the proof of Lemma[C\.1](https://arxiv.org/html/2605.30479#A3.Thmtheorem1)\. The only difference is that we only know that for everyTT, the number of mistakes made by Algorithm[5](https://arxiv.org/html/2605.30479#alg5)iso\(T\)o\(T\)instead ofO\(log⁡T\)O\(\\log T\)\. Therefore, we need to use the trick from the work ofHanneke and Wang \([2024](https://arxiv.org/html/2605.30479#bib.bib17)\), we get the indexjjof ExpertII,JJ, through the following order\. We order it by the value of\|J\|max⁡J\|J\|\\max J\(use\|J\|\|J\|as tie breaking\)\. For example,J=\{2,3,5\}J=\\\{2,3,5\\\}, the value used to index it is\|J\|max⁡J=3⋅5=15\|J\|\\max J=3\\cdot 5=15\. Therefore, we have the upper bound ofjTj\_\{T\}as follows\.

jT\\displaystyle j\_\{T\}≤\|\{J:\|J\|max⁡J≤k\}\|=1\+∑m=1k\|\{J:\|J\|≤km,max⁡J=m\}\|\\displaystyle\\leq\|\\\{J:\|J\|\\max J\\leq k\\\}\|=1\+\\sum\_\{m=1\}^\{k\}\|\\\{J:\|J\|\\leq\\frac\{k\}\{m\},\\max J=m\\\}\|=1\+∑m=1k2m−1\+∑m=kk\(m−1≤\(km−1\)\)≤2k\+∑m=kk\(em2k\)km≤\(k\+1\)ek\.\\displaystyle=1\+\\sum\_\{m=1\}^\{\\sqrt\{k\}\}2^\{m\-1\}\+\\sum\_\{m=\\sqrt\{k\}\}^\{k\}\\binom\{m\-1\}\{\\leq\(\\frac\{k\}\{m\}\-1\)\}\\leq 2^\{\\sqrt\{k\}\}\+\\sum\_\{m=\\sqrt\{k\}\}^\{k\}\\left\(\\frac\{em^\{2\}\}\{k\}\\right\)^\{\\frac\{k\}\{m\}\}\\leq\(k\+1\)e^\{\\sqrt\{k\}\}\.Thus, we havelog⁡jT≤\|J\|T\\log j\_\{T\}\\leq\\sqrt\{\|J\|T\}\. We know\|J\|=o\(T\)\|J\|=o\(T\), thus,log⁡jT=o\(T\)\\log j\_\{T\}=o\(T\)\. And there isj<jTj<j\_\{T\}such that for allt<Tt<T,ei,j\(Xt\)≠Yte\_\{i,j\}\(X\_\{t\}\)\\neq Y\_\{t\}if and only ifj∈Jj\\in J\. That finishes the proof\.

Thus, the regret of the algorithm above is

Regret\(𝒜,\(𝕏,𝕐,𝕐∗\),T\)\\displaystyle\\text\{Regret\}\(\\mathcal\{A\},\(\\mathbb\{X\},\\mathbb\{Y\},\\mathbb\{Y\}^\{\*\}\),T\)=O\(Tlog⁡log⁡T\+T\(log⁡i\+log⁡j\)\+\(log⁡i\+log⁡j\)\)\+o\(T\)\\displaystyle=O\\left\(\\sqrt\{T\\log\\log T\+T\(\\log i\+\\log j\)\+\(\\log i\+\\log j\)\}\\right\)\+o\(T\)=O\(TloglogT\+TlogjT\+logjT\)\)\+o\(T\)\\displaystyle=O\\left\(\\sqrt\{T\\log\\log T\+T\\log j\_\{T\}\+\\log j\_\{T\}\)\}\\right\)\+o\(T\)=O\(TloglogT\+To\(T\)\+o\(T\)\)\)\+o\(T\)=o\(T\)\.\\displaystyle=O\\left\(\\sqrt\{T\\log\\log T\+To\(T\)\+o\(T\)\)\}\\right\)\+o\(T\)=o\(T\)\.Lemma[4\.1](https://arxiv.org/html/2605.30479#S4.Thmtheorem1)also shows the necessity of Theorem[D\.9](https://arxiv.org/html/2605.30479#A4.Thmtheorem9)\.

### D\.1Discussion

This part provides more intuition about the optimistically universal online learning problem\. The term “optimistically universal online learning” was introduced byHanneke \([2021](https://arxiv.org/html/2605.30479#bib.bib28)\)in order to understand the learnability under the*minimal assumptions*, that is, only assume that the data processes admit universal online learning\. Intuitively, we would like to ask whether it is possible to learn whenever learning is possible\. More recently, the work ofHanneke and Wang \([2024](https://arxiv.org/html/2605.30479#bib.bib17)\)adds the concept class restriction into the optimistically universal online learning setting and fully characterizes this problem for binary classification\. In their work, they first figure out the minimal assumption that a process admits the universal online learning\. Then they figure out whether there is a learning algorithm that learns all processes that admit universal online learning\. We have defined admitting universal online learning before, so we define optimistically universally online learnable here\. Formally,

###### Definition D\.11\.

An online learning rule is*optimistically universal*under concept classℋ\\mathcal\{H\}if it is strongly universally consistent under every process𝕏\\mathbb\{X\}that admits strongly universally consistent online learning under concept classℋ\\mathcal\{H\}\.

If there is an online learning rule that is*optimistically universal*under concept classℋ\\mathcal\{H\}, we sayℋ\\mathcal\{H\}is*optimistically universally online learnable*\.

According to the work ofHannekeet al\.\([2023c](https://arxiv.org/html/2605.30479#bib.bib7)\), we have the following theorem about the optimistically universal online learnability when all processes admit universal online learning\.

###### Theorem D\.12\(\(Hannekeet al\.,[2023c](https://arxiv.org/html/2605.30479#bib.bib7)\)\)\.

If and only ifℋ\\mathcal\{H\}has no infinite Littlestone tree,ℋ\\mathcal\{H\}is optimistically universally online learnable\.

Then we also want to characterize the necessary and sufficient conditions when a process admits universal online learning whenℋ\\mathcal\{H\}has an infinite indifferent LCLL tree\. A guess is the condition in the work ofHanneke and Wang \([2024](https://arxiv.org/html/2605.30479#bib.bib17)\), which is:

###### Condition A\(\(Hanneke and Wang,[2024](https://arxiv.org/html/2605.30479#bib.bib17)\)\)\.

For a given concept classℋ\\mathcal\{H\}, there exists a countable set of expertsE=\{e1,e2,…\}E=\\\{e\_\{1\},e\_\{2\},\\dots\\\}, such that∀𝕐∗∈R\(ℋ,𝕏\)\\forall\\mathbb\{Y\}^\{\*\}\\in\\text\{R\}\(\\mathcal\{H\},\\mathbb\{X\}\),∃jT→∞\\exists j\_\{T\}\\rightarrow\\infty, withlog⁡jT=o\(T\)\\log j\_\{T\}=o\(T\), such that:

𝔼\[lim supT→∞minej:j≤jT⁡1T∑t=1T𝕀\[ej\(Xt\)≠Yt∗\]\]=0\\mathbb\{E\}\\left\[\\limsup\_\{T\\rightarrow\\infty\}\\min\_\{e\_\{j\}:j\\leq j\_\{T\}\}\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\mathbb\{I\}\\left\[e\_\{j\}\(X\_\{t\}\)\\neq Y^\{\*\}\_\{t\}\\right\]\\right\]=0\(14\)

Here, the expert is defined as a function that generates the label only based onXtX\_\{t\}\. However, this condition is not necessary\. Consider that the concept classℋ\\mathcal\{H\}contains all measurable functions\. We can construct the following sequence𝕏=\{Xt\}t∈ℕ\\mathbb\{X\}=\\\{X\_\{t\}\\\}\_\{t\\in\\mathbb\{N\}\}, for everyi∈ℕi\\in\\mathbb\{N\},XtX\_\{t\}takes the same value for every2i−1≤t≤2i\+1−12^\{i\}\-1\\leq t\\leq 2^\{i\+1\}\-1, but takes different value for differentii\. No matter what realizable label sequence𝕐∗∈R\(ℋ,𝕏\)\\mathbb\{Y\}^\{\*\}\\in\\text\{R\}\(\\mathcal\{H\},\\mathbb\{X\}\), this sequence admits universal online learning by memory, in other words, memorizing the history of the sequence\. However, for every countable set of expertsE=\{e1,e2,…\}E=\\\{e\_\{1\},e\_\{2\},\\ldots\\\}, we can build the following realizable label sequence such that Condition[A](https://arxiv.org/html/2605.30479#Thmcondition1)does not hold\. For every2i−1≤t≤2i\+1−12^\{i\}\-1\\leq t\\leq 2^\{i\+1\}\-1andj≤22i\+1j\\leq 2^\{2^\{i\+1\}\},Yt≠ej\(Xt\)Y\_\{t\}\\neq e\_\{j\}\(X\_\{t\}\)\. It is obvious that this sequence is realizable, however, it does not satisfy Condition[A](https://arxiv.org/html/2605.30479#Thmcondition1), as for everyjT→∞j\_\{T\}\\rightarrow\\infty, withlog⁡jT=o\(T\)\\log j\_\{T\}=o\(T\), and for everyj<jTj<j\_\{T\}, we haveeje\_\{j\}makes a mistake at each round\.

Therefore, the condition above is not necessary for the multiclass online learning with an unbounded label space\. Thus, we left this as an open question as well\.

###### Open Question 1\.

What is the sufficient and necessary condition of a process such that it admits universal online learning when the label space is unbounded \(countably infinite\)?
Universal Multiclass Transductive Online Learning

Similar Articles

MLUBench: A Benchmark for Lifelong Unlearning Evaluation in MLLMs

Graph Transductive Sharpening: Leveraging Unlabeled Predictions in Node Classification

Online Pandora's Box for Contextual LLM Cascading

Fast Unlearning at Scale via Margin Self-Correction

Natively Unlearnable Large Language Models

Submit Feedback

Similar Articles

MLUBench: A Benchmark for Lifelong Unlearning Evaluation in MLLMs
Graph Transductive Sharpening: Leveraging Unlabeled Predictions in Node Classification
Online Pandora's Box for Contextual LLM Cascading
Fast Unlearning at Scale via Margin Self-Correction
Natively Unlearnable Large Language Models