@HazanPrinceton: Just in time for our tutorial at ICML next week, Annie posted an update to our universal sequence preconditioning paper…

X AI KOLs Timeline 06/29/26, 05:15 PM Papers

Summary

This paper update presents a universal sequence preconditioning method achieving dimension-free regret bounds for marginally stable linear dynamical systems, using second-order VAW algorithm and Faber polynomials.

Just in time for our tutorial at ICML next week, Annie posted an update to our universal sequence preconditioning paper, which gets nearly tight hidden-dim free regret bounds: (it'll be covered!) https://t.co/mxzGAMyVo7

Original Article

View Cached Full Text

Cached at: 06/29/26, 06:30 PM

The Power of Second Order Methods for Sequence Preconditioning

Source: https://arxiv.org/html/2605.08390

Abstract

Sequence prediction methods for linear dynamical systems with long memory, i.e. marginally stable systems, typically achieve regret that grows linearly with the hidden dimension of the underlying generative model. While many methods have been developed to address this regime with varying success, we show that simply using the second-order Vovk-Azoury-Warmuth (VAW) algorithm to learn a short autoregressive-with-inputs (ARX) model achieves astoundingly strong results: for bounded sequential data from a marginally-stable linear dynamical system with spectra in the complex disk except for angular wedge of widthδ\deltaaround the negative real axis, this algorithm achieves dimension-free regret𝒪(δ−4log2⁡T)\mathcal{O}\!\left(\delta^{-4}\log^{2}T\right). These bounds are state-of-the-art to our knowledge.

The key components for our result come from 1) using the theory of “Universal Sequence Preconditioning” (USP)Marsden and Hazan (2025b)to prove the existence of an optimal setting of autoregressive coefficients, 2) the application of VAW which takes better advantage of the memory compression provided by USP, and 3) the analysis of Faber polynomials on circular sectors to extend these results to systems with complex spectra.

1Introduction

A central challenge in sequence prediction is the simultaneous presence of long memory and high hidden dimension of the underlying generative process. The canonical theoretical model to study in this setting is a marginally stable linear dynamical system. Most methods applied to this setting result in a regret bound which grows linearly with hidden dimension, see for exampleHazanet al.(2018); Kailath (1980); Chen (1999); Hazan (2016); Simchowitzet al.(2019); Tsiamis and Pappas (2019); Lee (2022); Bakshiet al.(2023). Universal Sequence Preconditioning (USP)Marsden and Hazan (2025b)was the first method, to our knowledge, which simultaneously achieved dimension-free sublinear regret, even in the presence of hidden transition matrices with complex eigenvalues. The method they develop is to convolve the original signal with the coefficients of the Chebyshev polynomial. This construction achieves an exponential compression of memory, reducing the effective memory to be logarithmic in sequence length (or horizon). However, this comes at a cost: the Chebyshev coefficients grow exponentially large with the degree, resulting in large optimal diameter. This creates a mismatch with first-order learning methods, which scale linearly with optimal diameter, yielding suboptimal regret of𝒪(T11/13)\mathcal{O}(T^{11/13})Marsden and Hazan (2025b). Moreover, while their work did apply to hidden transition matrices with complex eigenvalues, they required the assumption that the complex argument of the eigenvalues vanish at the rateo(1/polylog⁡T)o(1/\mathrm{poly}\log T).

In this work, we take the insight of memory compression from Universal Sequence Preconditioning and use it to design a stronger algorithm which achieves dimension-free and poly-logarithmic regret for a much broader class of linear dynamical systems. This is an improvement upon all other known methods, illustrated more clearly in Table1.

Table 1:Cumulative prediction error bounds formarginally stablelinear dynamical systems withasymmetrictransition matrices. For “Complex Angleϕ\phi”, the results only apply when the hidden transition matrix has spectra in the wedgeWϕ={z∈ℂ:|z|≤1,|arg⁡(z)|≤ϕ}W_{\phi}=\{z\in\mathbb{C}:|z|\leq 1,\ |\arg(z)|\leq\phi\}.Consider a linear dynamical system with inputsut∈ℝu_{t}\in\mathbb{R}and outputsyt∈ℝy_{t}\in\mathbb{R}evolving according to:

ht=Aht−1+Butyt=Cht,\textbf{h}_{t}=\textbf{A}\textbf{h}_{t-1}+\textbf{B}u_{t}\qquad y_{t}=\textbf{C}\textbf{h}_{t},(1)whereht∈ℝdhidden\textbf{h}_{t}\in\mathbb{R}^{d_{\text{hidden}}}is the hidden state. The algorithm we suggest is to predictyty_{t}using the Vovk-Azoury-Warmuth Forecaster with a short autoregressive-with-inputs (ARX) feature vector[yt−1,…,yt−n,ut,…,ut−n+1][y_{t-1},\dots,y_{t-n},u_{t},\dots,u_{t-n+1}], wherennis a hyperparameter that should grow logarithmically with the prediction horizon, see Algorithm2for details. By comparison, the original USP algorithm suggests predictingyty_{t}as

−∑i=1nciyt−i+∑j=0n−1θjut−j,-\sum_{i=1}^{n}c_{i}y_{t-i}+\sum_{j=0}^{n-1}\theta_{j}u_{t-j},where the coefficientsc1,…,cnc_{1},\dots,c_{n}are fixed and come from the Chebyshev polynomial andθj\theta_{j}is learned with a first order method.

real axisimaginary axisWϕW_{\phi}π−ϕ\pi-\phiπ−ϕ\pi-\phi11Figure 1:The wedge spectral classWϕ={z∈ℂ:|z|≤1,|arg⁡(z)|≤ϕ}W_{\phi}=\{z\in\mathbb{C}:|z|\leq 1,\ |\arg(z)|\leq\phi\}. In our main resultϕ=π−δ\phi=\pi-\delta, so the spectrum may occupy the unit disk except for a symmetric angular gap of width2δ2\deltaaround the negative real axis.At a high level, the effectiveness of a preconditioner, i.e. the set of coefficientsc1,…,cnc_{1},\dots,c_{n}, is governed by two quantities: the effective memory of the transformed representation and the magnitude of the optimal diameter. In the setting of USP, we can think of the preconditioned signal as a compressed representation of the original signal. The compressed representation requires a history which is linear in the degree of the Chebyshev polynomial, however the coefficient magnitude grows very large. This evokes the Minimum Description Length (MDL) principleRissanen (1978); Grünwald (2007), which seeks to minimize the combined cost of a model’s structural complexity and the bit-length required to represent its parameters. For polynomials, the minimum description length principle suggests that the most efficient representation balances the degree of the polynomial linearly and the coefficients magnitude logarithmically.

Inspired by this principle, we observe that the regret of the Vovk–Azoury–Warmuth (VAW) forecasterAzoury and Warmuth (2001); Vovk (2001)matches this balance: it grows linearly with the effective memory and only logarithmically with the the optimal diameter. In order to apply VAW to this setting, we no longer explicitly “precondition” the signalyty_{t}by incorporating∑i=1nciyt−i\sum_{i=1}^{n}c_{i}y_{t-i}into our prediction. Instead we include these autoregressive terms into the feature vector. However, the role of USP is still critical because it provides a certificate for an optimal solution and fundamentally explains why the autoregressive terms allow for such an effective memory compression (i.e. that we only need to use the lastnninputs and outputs to make good predictions). By moving to the second order VAW algorithm we improve from a cumulative prediction error of𝒪(T11/13)\mathcal{O}(T^{11/13})to𝒪(log2⁡(T))\mathcal{O}\!\left(\log^{2}(T)\right). Moreover, USPMarsden and Hazan (2025b)only applies to systems whose hidden transition matrix has spectra close to the real line, the complex argument must be bounded by1/polylog⁡(T)1/\mathrm{poly}\log(T). Our new approach allows us to consider spectra on the entire complex disk.

Indeed, letWϕ:={z∈ℂ:|z|≤1andarg⁡(z)≤ϕ}W_{\phi}:=\left\{z\in\mathbb{C}:\lvert z\rvert\leq 1\textrm{ and }\arg(z)\leq\phi\right\}. Suppose the underlying hidden transition matrix of the data has spectra inWπ−δW_{\pi-\delta}. Then the predictions made by this algorithm achieve cumulative squared loss bounded by𝒪(δ−4log2⁡(TL))\mathcal{O}\!\left(\delta^{-4}\log^{2}(TL)\right). This is tight in terms of its dependence ondd. It is known that if the hidden transition matrix is the shift-permutation matrix (which has spectra inWπW_{\pi}) then the regret of any algorithm must grow linearly with hidden dimensionHazan and Marsden (2026). However, we show that this adversarial example truly relies on having spectra inWπW_{\pi}. Indeed if we instead assume the spectra is contained inWπ−δW_{\pi-\delta}for any constantδ>0\delta>0, we achieve dimension-free regret.

In order to show this, we require a new way to guarantee an optimal solution that extends toWπ−δW_{\pi-\delta}, as Chebyshev only applies toW1/polylog⁡(T)W_{1/\mathrm{poly}\log(T)}. The solution comes from Faber polynomials adapted to circular sectors. Our analysis gives a deeper understanding of the role of complex spectra in the difficulty of learning a linear dynamical system. More generally, every useful USP preconditioner is a monic polynomial that is small on the spectrum of the hidden transition matrix. Chebyshev polynomials are optimal for nearly real spectra; Faber polynomials are the corresponding extremal objects for more general compact spectral sets, and in this paper we use them for circular sectors.

1.1Our Results

Our main contributions are as follows:

Logarithmic Regret for Sequence Prediction:

We propose a second-order finite-history predictor: VAW run on the raw feature vector(yt−1,…,yt−n,ut,…,ut−n+1)(y_{t-1},\ldots,y_{t-n},u_{t},\ldots,u_{t-n+1})(see Algorithm2). In Theorem2we show that if there exists a degree-nnmonic polynomial with residual sizeαn\alpha_{n}on the spectrum and coefficient scaleGnG_{n}, then there exists a comparator for the raw-feature VAW problem, and VAW suffers only𝒪(TL2αn2+Y2nlog⁡(GnYL))\mathcal{O}(TL^{2}\alpha_{n}^{2}+Y^{2}n\log(G_{n}YL))cumulative prediction loss. For comparison, the original first-order method obtains𝒪(n2GnT+αnT2)\mathcal{O}(n^{2}G_{n}\sqrt{T}+\alpha_{n}T^{2})cumulative prediction loss (see Theorem D.1 inMarsden and Hazan (2025b)). Instantiating our theorem with the Chebyshev polynomial gives Corollary1: for spectra in[−1,1][-1,1], the cumulative prediction loss is𝒪(log2⁡(T))\mathcal{O}\!\left(\log^{2}(T)\right), which is a vast improvement on the𝒪(T11/13)\mathcal{O}(T^{11/13})cumulative loss achieved from the first order method.

Extension to Systems with Constant Complex Arguments:

We extend the theoretical guarantees of Universal Sequence Preconditioning (USP) to a strictly broader class of dynamical systems. We use Faber polynomials for circular sectors. Lemma1shows that for every angular gapδ>0\delta>0there is a real monic polynomialpn,δp_{n,\delta}satisfyingsupz∈Wπ−δ|pn,δ(z)|≤2e−nδ2/π2\sup_{z\in W_{\pi-\delta}}|p_{n,\delta}(z)|\leq 2e^{-n\delta^{2}/\pi^{2}}andmaxi⁡|ci|≤100n\max_{i}|c_{i}|\leq 100^{n}. Thus the approximation error remains exponentially small even when the hidden dynamics have constant complex arguments, provided the spectrum avoids a constant wedge around the negative real axis. Note that this contribution is orthogonal to the application of VAW to USP. Indeed, one could apply this result on Faber polynomials to the original first-order method and show that USP applies to a broader class of sequences.

1.2Tightness and the Role of the Angular Gap

Our dimension-free guarantees rely on excluding a constant-width wedge around the negative real axis. This assumption is not merely an artifact of the proof. The cyclic-permutation lower bound in Appendix A ofHazan and Marsden (2026)constructs add-dimensional marginally stable system for which any accurate finite-memory predictor requires memoryΩ(d)\Omega(d)and therefore accrues cumulative prediction errorΩ(d)\Omega(d)for horizonT≥dT\geq d. Spectrally, this construction is a cyclic permutation: its eigenvalues are thedd-th roots of unity. Thus the spectrum approaches the negative real axis at angular scale1/d1/dand, for evendd, contains−1-1exactly. In the notation of our wedge class, the corresponding gap parameter is thereforeδ=Θ(1/d)\delta=\Theta(1/d).

This lower bound clarifies the role of Faber and Chebyshev preconditioning. When the goal is dimension-free prediction, polynomial preconditioners can compress memory exponentially, provided the spectrum obeys enough angular separation. If one allows polynomial dependence on the hidden dimension, then one can always return to the classical Cayley–Hamilton representation, which uses memoryddfor arbitrary spectra. The remaining gap is quantitative: our wedge guarantee scales as1/δ41/\delta^{4}in the final bound, whereas the permutation lower bound only rules out memory below order1/δ1/\delta. Closing this polynomial gap remains open.

1.3Related Work

Our work is most closely related to the literature on prediction in latent linear dynamical systems, rather than recovery. Early spectral-filtering results demonstrated online competitiveness with the best LDS predictor without explicit system identification, but relied on strong spectral structure like symmetryHazanet al.(2017).Hazanet al.(2018)later extended spectral filtering to general latent LDSs using a convex relaxation that accommodates phases and no longer requires a symmetric transition matrix. However, in the asymmetric setting their guarantee still has polynomial dependence on the horizon and polynomial dependence on the hidden dimension, so it does not yield a polylogarithmic cumulative prediction error bound in the regime we study. Polynomial inddandTTbounds are also a classical consequence of the Cayley–Hamilton theorem and realization theory; see, e.g.,Kailath (1980); Chen (1999).

A second line of work studiesmarginally stablesystems more directly.Ghaiet al.(2020)proved no-regret prediction guarantees for marginally stable systems with polynomial system parameter dependence. In the fully observed adversarial setting their bounds remain polynomial in the horizon, while their stronger polylogarithmic-in-TTresults rely either on stochastic assumptions or on a reduction to finite-memory autoregressive prediction. These results aren’t direct comparators since our focus is absolute cumulative prediction error for a noiseless input–output LDS.

There is also a body of work on partially observed prediction through the Kalman filter. SLIP gives polylogarithmic regret for partially observed LDS prediction under stochastic noise, but its guarantee is stated relative to the Kalman predictor and crucially assumes that the closed-loop matrixA−KCA-KCis diagonalizable with real eigenvaluesRashidinejadet al.(2020). Likewise, Tsiamis and Pappas obtain logarithmic regret for online learning of the Kalman filter for nonexplosive systems, including marginally stable ones, again with performance measured against the Kalman predictorTsiamis and Pappas (2023). These papers show that polylogarithmic-in-TTprediction is possible under stronger stochastic/filtering assumptions.

The direct asymmetric comparison closest in spirit to our setting is the recent work of Marsden and Hazan, which handles asymmetric marginally stable LDSs and removes hidden-dimension dependence from the regret boundMarsden and Hazan (2025a). However, their horizon dependence remains polynomial, so even after standardizing to the realizable noiseless setting, this line of work falls into theO(poly(T),polylog(d))O(\mathrm{poly}(T),\mathrm{polylog}(d))regime rather than the polylogarithmic-in-TTregime.

Finally, there is a large recovery-oriented literature on learning LDS realizations from input–output data, including semiparametric least squares, stochastic subspace identification, multiscale Hankel/SVD methods, and moment-based recoverySimchowitzet al.(2019); Tsiamis and Pappas (2019); Lee (2022); Bakshiet al.(2023). These methods can be used to form predictors. However, their guarantees when converted to cumulative prediction error inevitably lead to a polynomial dependence on the hidden dimension. Related high-dimensional identification results with logarithmic dependence on dimension, such as Fattahi’s Markov-parameter learning result, require stronger inherent-stability assumptions and therefore do not directly apply to the marginally stable regime considered hereFattahi (2021). Of particular interest is the result ofHardtet al.(2018), who curiously have a similar “Pacman shape“ to our wedgeWϕW_{\phi}, as the domain in which gradient descent converges to learn the system parameters. In contrast, our domain is where second order methods are able to make an ARX model converge. That said, we suspect the similarity in domain is more than a coincidence.

In short, prior work obtains at most two of the following three properties simultaneously: marginal stability, genuinely asymmetric/complex-spectrum hidden dynamics, and polylogarithmic cumulative prediction error with only polylogarithmic dependence on hidden dimension.

Second-Order Online Learning.

In the realm of online convex optimization, second-order methods are renowned for achieving logarithmic regret for strongly or exponentially concave cost functions. Two of the most prominent algorithms in this space are the Online Newton Step (ONS)Hazanet al.(2007)and the Vovk–Azoury–Warmuth (VAW) forecasterVovk (2001); Azoury and Warmuth (2001). However, standard regret guarantees for ONS depend explicitly on the diameter of the decision set. In the USP setting, where the Chebyshev transformation yields a comparator with an exponentially large magnitude, any diameter-dependent bound becomes vacuous.

Our approach critically relies on the VAW algorithm to circumvent this limitation. A defining property of VAW is that its regret bound is independent of the domain diameter; rather, the magnitude of the comparator enters the bound only logarithmically. This exact property is what allows us to absorb the exponential blowup of the preconditioner’s parameters and achieve tight (polylogarithmically) regret rates (or, in this setting, cumulative prediction loss) of𝒪(polylog(T))\mathcal{O}(\mathrm{polylog}(T)), even when accounting for approximation error.

Minimum Description Length (MDL) and Information Theory.

The connection between online learning, prediction, and data compression is a foundational concept in information theory. The Minimum Description Length (MDL) principleRissanen (1978); Grünwald (2007)posits that the best statistical model is the one that maximally compresses the data, balancing the complexity of the model itself against the cost of describing the data given the model. The regret of an online learning algorithm is intimately tied to the redundancy in universal codingCesa-Bianchi and Lugosi (2006). By framing USP through the lens of MDL, we demonstrate that second-order methods achieve an optimal compression scheme for LDS: the effective memory represents the model’s structural complexity, while the logarithmic regret represents an optimal bit-length coding of the exponentially large parameter space.

2Theoretical Results Overview

We first recall the algebraic USP identity, because it is the certificate behind our regret proof. Then we explain how this identity becomes an online learning statement: it exhibits a good comparator for VAW on raw finite-history features. Next we will explain how we achieve state of the art results and tight bounds up to polylogarithmic factors.

2.1Overview of Universal Sequence Preconditioning

For clarity of presentation, we consider a single-input, single-output Linear Dynamical System wheredin=dout=1d_{in}=d_{out}=1. At each time steptt, we observe an inputut∈ℝu_{t}\in\mathbb{R}and an outputyt∈ℝy_{t}\in\mathbb{R}according to the law:

ht=Aht−1+Butyt=Cht.\displaystyle\textbf{h}_{t}=\textbf{A}\textbf{h}_{t-1}+\textbf{B}u_{t}\qquad\qquad y_{t}=\textbf{C}\textbf{h}_{t}.(2)We refer toAas the hidden transition matrix. The hidden stateht\textbf{h}_{t}can be factored out giving

yt=∑s=0tCAsBut−s.\displaystyle y_{t}=\sum_{s=0}^{t}\textbf{C}\textbf{A}^{s}\textbf{B}u_{t-s}.Forθs=CAsB\theta_{s}=\textbf{C}\textbf{A}^{s}\textbf{B}fors=0,…,Ts=0,\dots,T, the observation is equivalentlyyt=∑s=0tθsut−sy_{t}=\sum_{s=0}^{t}\theta_{s}u_{t-s}. Therefore in order to predict a linear dynamical system up to horizonTT, it suffices to learnTTparameters (one for each value ofCAsB\textbf{C}\textbf{A}^{s}\textbf{B}fors=0,…,Ts=0,\dots,T).

The core insight of Universal Sequence PreconditioningMarsden and Hazan (2025b)is that the current observationyty_{t}can be represented, up to negligible error, using only𝒪(log⁡(T))\mathcal{O}(\log(T))parameters and history. For history windown=⌈log⁡((T⋅‖C‖⋅‖B‖)/δ)⌉n=\lceil\log((T\cdot||\textbf{C}||\cdot||\textbf{B}||)/\delta)\rceil, letc0,c1,…,cnc_{0},c_{1},\dots,c_{n}be the coefficients of thenn-th degree monic preconditioning polynomialpn(z)=zn+∑i=1ncizn−ip_{n}(z)=z^{n}+\sum_{i=1}^{n}c_{i}z^{n-i}, withc0=1c_{0}=1. Algebraic manipulation shows that

yt=−∑i=1nciyt−i⏟ℵ0+∑s=0n−1(∑i=0sciCAs−iB)ut−s⏟ℵ1+∑s=0t−nCpn(A)AsBut−n−s⏟ℵ2.y_{t}=-\underbrace{\sum_{i=1}^{n}c_{i}y_{t-i}}_{\aleph_{0}}+\underbrace{\sum_{s=0}^{n-1}\left(\sum_{i=0}^{s}c_{i}\textbf{C}\textbf{A}^{s-i}\textbf{B}\right)u_{t-s}}_{\aleph_{1}}+\underbrace{\sum_{s=0}^{t-n}\textbf{C}p_{n}(\textbf{A})\textbf{A}^{s}\textbf{B}u_{t-n-s}}_{\aleph_{2}}.(3) Since the monicnn-th degree Chebyshev polynomial satisfiesmaxx∈[−1,1]⁡|pncheby(x)|≤2−(n−1)\max_{x\in[-1,1]}\lvert p_{n}^{\text{cheby}}(x)\rvert\leq 2^{-(n-1)}, ifA’s eigenvalues fall in this range, then the termℵ2\aleph_{2}becomes negligible fornnlarge enough. The authors show that you can extend this property to the complex plane with argument bounded by1/64n21/64n^{2}instead of just the interval[−1,1][-1,1], allowing asymmetric hidden transition matrices. In summary, the seminal work shows that by defining parameter,

θsp:=∑i=0sciCAs−iB,withc0=1,\theta^{p}_{s}:=\sum_{i=0}^{s}c_{i}\textbf{C}\textbf{A}^{s-i}\textbf{B},\text{ with }c_{0}=1,(4)then

yt=−∑i=1nciyt−i+∑s=0n−1θsput−s+ϵt,y_{t}=-\sum_{i=1}^{n}c_{i}y_{t-i}+\sum_{s=0}^{n-1}\theta^{p}_{s}u_{t-s}+\epsilon_{t},(5)where|ϵt|≤T⋅‖C‖⋅‖B‖⋅κ⋅2−n\lvert\epsilon_{t}\rvert\leq T\cdot||\textbf{C}||\cdot||\textbf{B}||\cdot\kappa\cdot 2^{-n}, whereκ\kappais the condition number of the matrix diagonalizingA, as long as the complex eigenvalues ofAhave argument bounded by1/64n21/64n^{2}. More generally, if we remain agnostic to the choice of preconditioning polynomial, then Eq.5holds with|ϵt|≤T⋅‖C‖⋅‖B‖⋅κ⋅supz∈K|pn(z)|\lvert\epsilon_{t}\rvert\leq T\cdot||\textbf{C}||\cdot||\textbf{B}||\cdot\kappa\cdot\sup_{z\in K}|p_{n}(z)|wheneverKKcontainsλ(A)\lambda(\textbf{A}). We will show that if we considerKKas the wedge of the complex disk then the Faber polynomial minimizessupz∈K|pn(z)|\sup_{z\in K}|p_{n}(z)|.

2.2Contribution One: The Power of a Second Order Method

Equation5suggests a simple learning algorithm— learnθp\theta^{p}via regression. Perhaps even simpler and cleaner is to learn the entire vector(−c1,…,−cn,θ0p,…,θn−1p)(-c_{1},\dots,-c_{n},\theta_{0}^{p},\dots,\theta_{n-1}^{p}). A naive application of first order learning methods gives regret which scales linearly with the norm of this parameter. This is unfortunate since this norm grows exponentially withnn; from Lemma 3.2 ofMarsden and Hazan (2025b),maxi=0,…,n⁡|ci|\max_{i=0,\dots,n}\lvert c_{i}\rvertcan grow as large as20.3n2^{0.3n}for Chebyshev preconditioning, while Lemma1gives the analogous bound100n100^{n}for the sector Faber preconditioner. Althoughθp\theta^{p}can have a tiny dimension which is only logarithmic inTT, its norm does not reflect the reduction in domain. This is where the insight of our work comes in. Inspired by the minimum description length principle, we consider applying a second order learning method to sequence preconditioning. Critically, second order learning methods do not treat thedimensionand thenormof the learnable parameter space as equal citizens in the way that first order methods do.

The error of the preconditioning method has two components: one from approximation and the other from learning/estimation. While Chebyshev and Faber polynomials yield optimal exponential decay for approximation error, the learning error—previously relying on first-order methods—can be significantly improved. Specifically, the original USP algorithm uses online gradient descent to learn the parameters of the model which grow exponentially with the degree of the polynomial. We suspect that this tradeoff is necessary, formalized in the following conjecture.

Conjecture 1.

Letpn(x)=xn+∑j=0n−1cjxjp_{n}(x)=x^{n}+\sum_{j=0}^{n-1}c_{j}x^{j}denote any monic polynomial. Let𝒟⊆ℂ\mathcal{D}\subseteq\mathbb{C}denote any convex and compact set with areaDA>0D_{A}>0. If there exist global constantsC,c>0C,c>0such that for alln≥1n\geq 1,maxx∈𝒟⁡|pn(x)|≤C⋅2−cn\max_{x\in\mathcal{D}}\lvert p_{n}(x)\rvert\leq C\cdot 2^{-cn}then there must also exist constantsB,b>0B,b>0such that for alln≥1n\geq 1,maxj∈[n−1]⁡|cj|≥B2bn\max_{j\in[n-1]}\lvert c_{j}\rvert\geq B2^{bn}.

In AppendixAwe prove a weaker statement than the above in which we assumec=1c=1. Regardless of the truth of Conjecture1, we observe that by using the VAW algorithm we can obtain a regret bound which grows logarithmically with these large coefficients, unlocking tight (up to logarithms) regret bounds.

2.3Contribution Two: Extending to Asymmetric Systems Via Faber Polynomials

In the original USP work, the results only apply to linear dynamical systems whose hidden transition matrixAhas eigenvalues in the region{z∈ℂs.t.|z|≤1andarg⁡(z)≤1/polylog⁡(T)}\left\{z\in\mathbb{C}\textrm{s.t.}\lvert z\rvert\leq 1\textrm{ and }\arg(z)\leq 1/\mathrm{poly}\log(T)\right\}. There is intuition for why the spectrum ofAmust be restricted. Since the resulting regret is hidden-dimension independent, the spectrum ofAmust be restricted. Indeed, if it were not then it would contradict a lower bound. The lower bound comes from consideringAto be the shift permutation matrix.For this cyclic-permutation example the spectrum consists of roots of unity; it approaches the negative real axis at angular scale1/d1/dand contains−1-1whenddis even, and any accurate finite-memory predictor requires memoryΩ(d)\Omega(d)Hazan and Marsden (2026). If polynomial dependence onddis allowed, the classical Cayley–Hamilton representation gives a degree-ddautoregressive predictor for arbitrary spectra. Our goal is different: to understand when dimension-free prediction is possible. Indeed, in the regime whereAis not adversarially constructed, the question remains whether it is possible to do better with respect to hidden dimension. What is the exact trade off asarg⁡(z)\arg(z)approachesπ\piand the resulting regret?

We begin answering this question by extending the comparator-existence argument behind USP to a strictly broader class of dynamical systems. For wedge spectra, the proof uses a sector Faber polynomial to show that a good short-memory comparator exists even when the eigenvalues have constant complex argument, see Corollary2. For spectra inWπ−δW_{\pi-\delta}, Lemma1gives

supz∈Wπ−δ|pn,δFaber(z)|≤2e−nδ2/π2,maxi⁡|ci|≤100n.\sup_{z\in W_{\pi-\delta}}|p_{n,\delta}^{\mathrm{Faber}}(z)|\leq 2e^{-n\delta^{2}/\pi^{2}},\qquad\max_{i}|c_{i}|\leq 100^{n}.The residual therefore decays exponentially innδ2n\delta^{2}, while VAW pays only logarithmically for the exponential coefficient growth. This yields memoryn=O~(δ−2)n=\tilde{O}(\delta^{-2})and the cumulative loss boundO~(Y2δ−4)\tilde{O}(Y^{2}\delta^{-4}), interpolating between dimension-free prediction for constant angular gaps and the Cayley–Hamilton regime when the gap shrinks with dimension. Moreoever, wheneverδ=Ω(d−1/4)\delta=\Omega(d^{-1/4})the resulting regret is sublinear indd, which the Cayley–Hamilton regime would not be able to achieve.

The shape in Figure1is important because it is the spectral regime in which the Faber certificate replaces the Chebyshev certificate. Prior USP-style analyses required the angular component of the eigenvalues to shrink with the horizon. By contrast, if the eigenvalues lie inWπ−δW_{\pi-\delta}for any constantδ>0\delta>0, the preconditioner remains uniformly small on the whole spectrum while its coefficients grow only exponentially with the degree. This is exactly the tradeoff that VAW can absorb, and it is what yields theπ−Ω(1)\pi-\Omega(1)angle entry in Table1. Note that the algorithm run in all cases is the same raw-feature VAW forecaster.

3Main Theorems and Proofs

Recall the standard Vovk-Azoury-Warmuth (VAW) AlgorithmAzoury and Warmuth (2001); Vovk (2001), restated in Algorithm1for the reader’s convenience. Theorem1states the standard regret bound achieved by VAW. We assume the regularization parameter satisfiesλ≥1/‖x∗‖22\lambda\geq 1/\|\textbf{x}^{*}\|_{2}^{2}.

Algorithm 1Vovk-Azoury-Warmuth (VAW) ForecasterAzoury and Warmuth (2001); Vovk (2001)1:Input:regularization

λ≥1/‖x∗‖22\lambda\geq 1/\|\textbf{x}^{*}\|_{2}^{2} 2:Initialize:

A0=λId\textbf{A}_{0}=\lambda I_{d}, target vector

v0=0∈ℝd\textbf{v}_{0}=0\in\mathbb{R}^{d} 3:for

t=1t=1to

TTdo

4:Observe feature vector

at\textbf{a}_{t} 5:Update curvature matrix:

At=At−1+atat⊤\textbf{A}_{t}=\textbf{A}_{t-1}+\textbf{a}_{t}\textbf{a}_{t}^{\top} 6:Predict

y^t=at⊤At−1vt−1\hat{y}_{t}=\textbf{a}_{t}^{\top}\textbf{A}_{t}^{-1}\textbf{v}_{t-1} 7:Observe true label

yty_{t}and suffer loss

lt(b^t)=12(b^t−bt)2l_{t}(\hat{b}_{t})=\frac{1}{2}(\hat{b}_{t}-b_{t})^{2} 8:Update target vector:

vt=vt−1+btat\textbf{v}_{t}=\textbf{v}_{t-1}+b_{t}\textbf{a}_{t} 9:endfor

Theorem 1(VAW Regret).

Assume the true labels are uniformly bounded byYY, i.e.|yt|≤Y|y_{t}|\leq Yfor allt=1,…,Tt=1,\dots,T. Assume the feature vectors are uniformly bounded byRR, i.e.‖at‖2≤R||\textbf{a}_{t}||_{2}\leq Rfor allt=1,…Tt=1,\dots T. For any benchmarkx∗∈ℝn\textbf{x}^{*}\in\mathbb{R}^{n}, the regret is bounded by:

∑t=1T12(y^t−yt)2≤∑t=1T12(x∗⊤at−yt)2+12+Y22nlog⁡(1+T‖x∗‖22R2n)\sum_{t=1}^{T}\frac{1}{2}(\hat{y}_{t}-y_{t})^{2}\leq\sum_{t=1}^{T}\frac{1}{2}({\textbf{x}^{*}}^{\top}\textbf{a}_{t}-y_{t})^{2}+\frac{1}{2}+\frac{Y^{2}}{2}n\log\left(1+\frac{T\|\textbf{x}^{*}\|_{2}^{2}R^{2}}{n}\right)(6)

Transforming sequence prediction with short memory into linear prediction with small dimension:

In Algorithm2we show exactly how to convert the sequence prediction problem into the VAW framework. Using Equation5, we frame predictingyty_{t}as a linear prediction task with dimension2n2n. Letat\textbf{a}_{t}be the vector of the most recentnnobservations and the most recentnninputs:

at←[yt−1,…,yt−n,ut,…,ut−n+1].\textbf{a}_{t}\leftarrow[y_{t-1},\dots,y_{t-n},u_{t},\dots,u_{t-n+1}].(7) Recall the definition ofθsp\theta_{s}^{p}from Eq.4and recall thatc1,…,cnc_{1},\dots,c_{n}represent the coefficients of thenn-th degree monic preconditioning polynomial. We define our target weight vectorx∗\textbf{x}^{*}as the concatenation of the polynomial coefficients andθsp\theta_{s}^{p},

x∗\displaystyle\textbf{x}^{*}←[−c1,…,−cn,θ0p,…,θn−1p]\displaystyle\leftarrow[-c_{1},\dots,-c_{n},\theta_{0}^{p},\dots,\theta_{n-1}^{p}] Using our definitions ofx∗\textbf{x}^{*}andat\textbf{a}_{t}we have:

yt=x∗⊤at+ϵt.y_{t}={\textbf{x}^{*}}^{\top}\textbf{a}_{t}+\epsilon_{t}.(8) Algorithm 2Vovk-Azoury-Warmuth (VAW) Forecaster for Sequence Preconditioning1:Input:regularization

λ>0\lambda>0 2:Initialize:

A0=λId\textbf{A}_{0}=\lambda I_{d}, target vector

v0=0∈ℝd\textbf{v}_{0}=0\in\mathbb{R}^{d} 3:for

t=1t=1to

TTdo

4:Observe input

utu_{t} 5:Create feature vector

at←[yt−1,…,yt−n,ut,…,ut−n+1]\textbf{a}_{t}\leftarrow[y_{t-1},\dots,y_{t-n},u_{t},\dots,u_{t-n+1}] 6:Update curvature matrix:

At=At−1+atat⊤\textbf{A}_{t}=\textbf{A}_{t-1}+\textbf{a}_{t}\textbf{a}_{t}^{\top} 7:Predict

y^t=at⊤At−1vt−1\hat{y}_{t}=\textbf{a}_{t}^{\top}\textbf{A}_{t}^{-1}\textbf{v}_{t-1} 8:Observe true label

yty_{t}and suffer loss

lt(y^t)=12(y^t−yt)2l_{t}(\hat{y}_{t})=\frac{1}{2}(\hat{y}_{t}-y_{t})^{2} 9:Update target vector:

vt=vt−1+ytat\textbf{v}_{t}=\textbf{v}_{t-1}+y_{t}\textbf{a}_{t} 10:endfor

The critical property of the VAW algorithm is that its regret compared to any benchmarkx∗\textbf{x}^{*}scales linearly with thedimensionof the target and logarithmically with thenormof the benchmark vector. With our mapping above, the dimension of the target corresponds exactly to the effective memory of the signal. This is nicely compatible with Universal Sequence Preconditioning since it guarantees the existence of a benchmarkx∗\textbf{x}^{*}, as above, which has negligible prediction error and very small dimension, i.e. (2n2nis𝒪(log⁡(T))\mathcal{O}(\log(T))), but unfortunately large norm.

Theorem 2(VAW with a polynomial comparator certificate).

Supposey1,…,yTy_{1},\dots,y_{T}comes from Eq.2. AssumeA=PΛP−1\textbf{A}=\textbf{P}\Lambda\textbf{P}^{-1},maxt⁡|yt|≤Y\max_{t}|y_{t}|\leq Yfor someY≥1Y\geq 1, andmaxt⁡|ut|≤1\max_{t}|u_{t}|\leq 1. LetK⊆{z:|z|≤1}K\subseteq\{z:|z|\leq 1\}containλ(A)\lambda(\textbf{A}), and let

pn(z)=zn+∑i=1ncizn−ip_{n}(z)=z^{n}+\sum_{i=1}^{n}c_{i}z^{n-i}be a monic preconditioning polynomial such that

supz∈K|pn(z)|≤αn,max0≤i≤n⁡|ci|≤Gn,\sup_{z\in K}|p_{n}(z)|\leq\alpha_{n},\qquad\max_{0\leq i\leq n}|c_{i}|\leq G_{n},wherec0=1c_{0}=1andGn≥1G_{n}\geq 1. Denote

κ=‖P‖2‖P−1‖2,L=1+T(1+κ‖C‖2‖B‖2).\kappa=||\textbf{P}||_{2}||\textbf{P}^{-1}||_{2},\qquad L=1+T(1+\kappa||\textbf{C}||_{2}||\textbf{B}||_{2}).Then Algorithm2, run with memory lengthnn, satisfies

∑t=1T(y^t−yt)2≤O(TL2αn2+Y2nlog⁡(GnYL)).\sum_{t=1}^{T}(\hat{y}_{t}-y_{t})^{2}\leq O\!\left(TL^{2}\alpha_{n}^{2}+Y^{2}n\log\!\left(G_{n}YL\right)\right).

Proof.

By Eq.5,

yt=−∑i=1nciyt−i+∑s=0n−1θsput−s+ϵt,y_{t}=-\sum_{i=1}^{n}c_{i}y_{t-i}+\sum_{s=0}^{n-1}\theta_{s}^{p}u_{t-s}+\epsilon_{t},with|ϵt|≤Lαn|\epsilon_{t}|\leq L\alpha_{n}. Define the comparator

x∗=[−c1,…,−cn,θ0p,…,θn−1p]\textbf{x}^{*}=[-c_{1},\ldots,-c_{n},\theta_{0}^{p},\ldots,\theta_{n-1}^{p}]for the feature vectorat=[yt−1,…,yt−n,ut,…,ut−n+1]\textbf{a}_{t}=[y_{t-1},\ldots,y_{t-n},u_{t},\ldots,u_{t-n+1}]. Thenx∗⊤at=yt−ϵt{\textbf{x}^{*}}^{\top}\textbf{a}_{t}=y_{t}-\epsilon_{t}, so the comparator’s total approximation loss is at most

∑t=1T(x∗⊤at−yt)2≤TL2αn2.\sum_{t=1}^{T}\left({\textbf{x}^{*}}^{\top}\textbf{a}_{t}-y_{t}\right)^{2}\leq TL^{2}\alpha_{n}^{2}.For eachs<ns<n,

|θsp|=|∑i=0sciCAs−iB|≤(s+1)GnL≤nGnL,|\theta_{s}^{p}|=\left|\sum_{i=0}^{s}c_{i}\textbf{C}\textbf{A}^{s-i}\textbf{B}\right|\leq(s+1)G_{n}L\leq nG_{n}L,so

‖x∗‖2≤O(n3/2GnL).\|\textbf{x}^{*}\|_{2}\leq O(n^{3/2}G_{n}L).The feature vectors have dimension2n2nand norm at mostR=n(Y2+1)≤2nYR=\sqrt{n(Y^{2}+1)}\leq\sqrt{2n}Y. Substituting these bounds into Theorem1, and absorbing polynomial factors inT,n,LT,n,Linto the logarithm, gives

∑t=1T(y^t−yt)2≤O(TL2αn2+Y2nlog⁡(GnYL)),\sum_{t=1}^{T}(\hat{y}_{t}-y_{t})^{2}\leq O\!\left(TL^{2}\alpha_{n}^{2}+Y^{2}n\log(G_{n}YL)\right),as claimed. ∎

First we show how the regret substantially improves upon the original first-order method fromMarsden and Hazan (2025b)in the original setting where the preconditioning polynomial is the Chebyshev polynomial.

Corollary 1(Chebyshev instantiation for real spectra).

Supposey1,…,yTy_{1},\dots,y_{T}comes from Eq.2, whereA=PΛP−1\textbf{A}=\textbf{P}\Lambda\textbf{P}^{-1}andλ(A)⊆[−1,1]\lambda(\textbf{A})\subseteq[-1,1]. Assumemaxt⁡|yt|≤Y\max_{t}|y_{t}|\leq Yfor someY≥1Y\geq 1andmaxt⁡|ut|≤1\max_{t}|u_{t}|\leq 1. Letn=⌈log2⁡(4TL2)⌉n=\left\lceil\log_{2}(4TL^{2})\right\rceil. Run Algorithm2with memory lengthnnon the raw sequence featuresa∗t=[y∗t−1,…,yt−n,ut,…,ut−n+1].\textbf{a}*t=[y*{t-1},\ldots,y_{t-n},u_{t},\ldots,u_{t-n+1}].Then

∑t=1T(y^t−yt)2≤𝒪(Y2n[n+log⁡(YL)])=𝒪(Y2log2⁡(TYL)).\sum_{t=1}^{T}(\hat{y}_{t}-y_{t})^{2}\leq\mathcal{O}\left(Y^{2}n\left[n+\log(YL)\right]\right)=\mathcal{O}\left(Y^{2}\log^{2}(TYL)\right).In particular, for real marginally stable spectra, second-order sequence prediction achieves polylogarithmic cumulative prediction error, with no polynomial dependence on the hidden dimension except throughκ,‖C‖2\kappa,||\textbf{C}||_{2}, and‖B‖2||\textbf{B}||_{2}.

Proof.

We apply Theorem1with the degree-nnmonic Chebyshev polynomialMn(z)=zn+∑i=1ncizn−iM_{n}(z)=z^{n}+\sum_{i=1}^{n}c_{i}z^{n-i}. (The algorithm does not need to know these coefficients. They are used only to exhibit a comparator for the VAW regret bound.) The classical Chebyshev extremal property givessupz∈[−1,1]|Mn(z)|≤21−n\sup_{z\in[-1,1]}|M_{n}(z)|\leq 2^{1-n}. Moreover, the monomial coefficients ofMnM_{n}are exponentially bounded: for a universal constantCcheb>1C_{\mathrm{cheb}}>1,max0≤i≤n⁡|ci|≤Cchebn\max_{0\leq i\leq n}|c_{i}|\leq C_{\mathrm{cheb}}^{n}. Thus Theorem1applies with

αn=21−n,Gn=Cchebn.\alpha_{n}=2^{1-n},\qquad G_{n}=C_{\mathrm{cheb}}^{n}.The approximation term satisfiesTL2αn2=TL222−2n≤1TL^{2}\alpha_{n}^{2}=TL^{2}2^{2-2n}\leq 1by the choice ofnn. The learning term isY2nlog⁡(GnYL)=𝒪(Y2n[n+log⁡(YL)])Y^{2}n\log(G_{n}YL)=\mathcal{O}\left(Y^{2}n\left[n+\log(YL)\right]\right). Sincen=⌈log2⁡(4TL2)⌉=𝒪(log⁡(TYL))n=\left\lceil\log_{2}(4TL^{2})\right\rceil=\mathcal{O}(\log(TYL)), we obtain

∑t=1T(y^t−yt)2=𝒪(Y2log2⁡(TYL)),\sum_{t=1}^{T}(\hat{y}_{t}-y_{t})^{2}=\mathcal{O}\left(Y^{2}\log^{2}(TYL)\right),as claimed. ∎

Lemma 1(Faber polynomial bounds for wedge spectra).

For everyδ∈(0,π)\delta\in(0,\pi)andn≥1n\geq 1, there is a real monic polynomial

pn,δ(z)=zn+∑i=1ncizn−ip_{n,\delta}(z)=z^{n}+\sum_{i=1}^{n}c_{i}z^{n-i}withc0:=1c_{0}:=1such that

supz∈Wπ−δ|pn,δ(z)|≤2e−nδ2/π2,max0≤i≤n⁡|ci|≤100n.\sup_{z\in W_{\pi-\delta}}|p_{n,\delta}(z)|\leq 2e^{-n\delta^{2}/\pi^{2}},\qquad\max_{0\leq i\leq n}|c_{i}|\leq 100^{n}.

Corollary 2(Main result: VAW with Faber preconditioning).

Supposey1,…,yTy_{1},\dots,y_{T}comes from Eq.2, whereA=PΛP−1\textbf{A}=\textbf{P}\Lambda\textbf{P}^{-1}andλ(A)⊆Wπ−δ\lambda(\textbf{A})\subseteq W_{\pi-\delta}for someδ∈(0,π)\delta\in(0,\pi). Assumemaxt⁡|yt|≤Y\max_{t}|y_{t}|\leq Yfor someY≥1Y\geq 1andmaxt⁡|ut|≤1\max_{t}|u_{t}|\leq 1. Set

n=⌈π2δ2log⁡(4T2L)⌉.n=\left\lceil\frac{\pi^{2}}{\delta^{2}}\log(4T^{2}L)\right\rceil.Then Algorithm2, run with memory lengthnn, satisfies

∑t=1T(y^t−yt)2=O(Y2n[n+log⁡(YL)])=O(Y2δ4log2⁡(TYL)).\sum_{t=1}^{T}(\hat{y}_{t}-y_{t})^{2}=O\!\left(Y^{2}n\left[n+\log(YL)\right]\right)=O\!\left(\frac{Y^{2}}{\delta^{4}}\log^{2}(TYL)\right).In particular, for fixedδ,Y,κ,‖C‖2\delta,Y,\kappa,||\textbf{C}||_{2}, and‖B‖2||\textbf{B}||_{2}, the bound is polylogarithmic inTT. The hidden dimension enters only throughκ,‖C‖2\kappa,||\textbf{C}||_{2}, and‖B‖2||\textbf{B}||_{2}.

Proof.

Apply Lemma1and Theorem2withK=Wπ−δK=W_{\pi-\delta},

αn=2e−nδ2/π2,Gn=100n.\alpha_{n}=2e^{-n\delta^{2}/\pi^{2}},\qquad G_{n}=100^{n}.The approximation term is at most a constant for the stated choice ofnn, sinceTL2αn2≤1/TTL^{2}\alpha_{n}^{2}\leq 1/T. Alsolog⁡(Gn)=O(n)\log(G_{n})=O(n). Theorem2therefore gives

∑t=1T(y^t−yt)2≤O(Y2n[n+log⁡(YL)]).\sum_{t=1}^{T}(\hat{y}_{t}-y_{t})^{2}\leq O\!\left(Y^{2}n\left[n+\log(YL)\right]\right).SinceY≥1Y\geq 1andL≥1L\geq 1,log⁡(4T2L)=O(log⁡(TYL))\log(4T^{2}L)=O(\log(TYL))andlog⁡(YL)≤log⁡(TYL)\log(YL)\leq\log(TYL). Also, becauseδ∈(0,π)\delta\in(0,\pi), the ceiling in the definition ofnngivesn=O(δ−2log⁡(TYL))n=O(\delta^{-2}\log(TYL)). Substituting this value ofnngives the expanded bound. ∎

4Experiments

Data Generation.

We evaluate Universal Sequence Preconditioning (USP) on synthetic data generated from a Linear Dynamical System (LDS) governed by the state-space equations from Eq.2. We introduce additive observation noisevt∼𝒩(0,σ2)v_{t}\sim\mathcal{N}(0,\sigma^{2})withσ=0.01\sigma=0.01. To test the constant-angle regime captured by Corollary2, we specify the eigenvalues of the transition matrixAto have a magnitude of1−δ1-\deltawithδ=10−7\delta=10^{-7}(determining the spectral gap) and imaginary components bounded by a thresholdτ=0.1\tau=0.1. The input and output matrices,BandC, are sampled from a standard normal distribution and normalized such that‖B‖2=‖C‖2=1\|\textbf{B}\|_{2}=\|\textbf{C}\|_{2}=1. We evaluate two primary configurations:

1.High-dimensional:dhidden=100d_{\text{hidden}}=100and baseline feature window sizew=100w=100.
2.Low-dimensional:dhidden=10d_{\text{hidden}}=10and baseline feature window sizew=10w=10. The parameterwwdictates the history length for the un-preconditioned baseline models.

Both configurations use a sequence length ofT=5000T=5000. Inputsutu_{t}are generated by sampling from𝒩(0,1)\mathcal{N}(0,1)and scaling by1/log⁡(5(t+2))1/\log(5(t+2))to simulate a decaying excitation signal. To evaluate robustness under poor conditioning, we also test highly correlated input sequences generated from an AR(1) process with coefficientρ=0.99\rho=0.99.

Performance Measurement.

We compare Online Gradient Descent (OGD), Adam, and the Vovk-Azoury-Warmuth (VAW) forecaster. For each method, we perform a grid search over hyperparameters and report results using the optimal configuration. The grids are defined as follows:

•OGD and Adam learning rates:η∈{10−4,10−3,0.01,0.05,0.1}\eta\in\{10^{-4},10^{-3},0.01,0.05,0.1\}.
•VAW regularization parameter:λ∈{10−2,10−1,1,10,100,1000}\lambda\in\{10^{-2},10^{-1},1,10,100,1000\}.

The preconditioning polynomial degreennis swept from0to3030. We evaluate two approaches for the polynomial:

1.Chebyshev: Fixed coefficientsc1,…,cnc_{1},\dots,c_{n}derived from monic Chebyshev polynomials. The VAW targets the preconditioned signal directly by usingy~t=yt+∑i=1nciyt−i\tilde{y}_{t}=y_{t}+\sum_{i=1}^{n}c_{i}y_{t-i}as labels anduuas features.
2.Learnable: Polynomial coefficients treated as learnable parameters optimized alongside the model. This is exactly Algorithm2.

Performance is quantified by the normalized prediction error,|yt−y^t||yt|+ϵ\frac{|y_{t}-\hat{y}_{t}|}{|y_{t}|+\epsilon}(withϵ=10−8\epsilon=10^{-8}). To capture steady-state performance, we compute the mean error over the final200200steps of the sequence for each trial, and report the median of this metric across1010independent trials. Error bars indicate the interquartile range (25th–75th percentiles).

We evaluate the performance of OGD, Adam, and VAW across different degrees of preconditioning. Figure2presents the results for both high-dimensional (d=100d=100) and low-dimensional (d=10d=10) regimes. As anticipated by our theoretical analysis, the Chebyshev-preconditioned VAW achieves superior steady-state error reduction, particularly in the high-dimensional setting where learning the annihilating polynomial is prohibitively complex.

Refer to caption (a)Hidden dimensiond=100d=100. (b)Hidden dimensiond=10d=10.

Figure 2:Normalized prediction error versus degree of preconditioning polynomial for (a)d=100d=100and (b)d=10d=10. The Chebyshev-preconditioned VAW achieves superior steady-state error reduction compared to OGD and Adam as the degree increases.

Results Discussion.

We observe the impact of hidden dimension on the importance of preconditioning with Chebyshev coefficients over learning them. By the Cayley-Hamilton theorem, a perfect autoregressive predictor of degreedhiddend_{\text{hidden}}theoretically exists for this system. As we see in Figure2, when the hidden dimension is high enough, adaptively learning2n2nparameters (which are functionally approximating thedhiddend_{\text{hidden}}-dimensional system) fails to converge effectively for OGD and Adam. However, fixed Chebyshev preconditioning of degreen=𝒪(log⁡T)n=\mathcal{O}(\log T)elegantly bypasses this parameter learning phase.

Next, these results are consistent with the main hypothesis of our paper: VAW is much more stable than first-order methods (i.e. OGD and Adam) to the large coefficients needed to optimally precondition the signal. Indeed, after degree 14 the first-order methods stop gaining benefits from preconditioning and their performance worsens. However, the VAW method continues to leverage higher degrees, getting optimal prediction error at degree 20 before its performance worsens.

4.1Signal Norm and Preconditioning Dynamics

Our theoretical results are impeded by the inability to prove that the preconditioned target signal will have small norm. If we could bound the norm of the preconditioned signal that didn’t grow with the size of the preconditioning coefficients then two important changes could be made to our results: 1) we would not need to learn the autoregressive coefficients and 2) we could just as easily apply Online Newton Step (ONS) to get the state of the art regret bounds. Empirically, however, we see that the preconditioned target signal actually has much smaller norm than the raw signal in many regimes. Figures3(a)and3(b)illustrate the raw signal, generated synthetically as described above, alongside the Chebyshev preconditioned signal (with degree 12) as well as the differenced signal. Figure3(c)sweeps over 100 random trials and plots the maximum norm of the signal along the horizon. We plot the median of the trials and give a shaded region to represent the interquartile range (25th–75th percentiles).

Refer to caption (a)Raw versus preconditioned signals. (b)Chebyshev preconditioning versus differencing. (c)Maximum norm versus Chebyshev degree.

Figure 3:Preconditioning empirically reduces the norm of the signal. If this is provable, learning the autoregressive coefficients for Algorithm2is unnecessary.

5Conclusion

Summary

We addressed a fundamental tension in sequence prediction for marginally stable linear dynamical systems: the trade-off between effective memory compression and the exponential explosion of parameter magnitudes. By interpreting Universal Sequence Preconditioning (USP) through the lens of the Minimum Description Length (MDL) principle, we established that second-order online learning methods—specifically the Vovk-Azoury-Warmuth (VAW) forecaster—are uniquely equipped to absorb the parameter blowup induced by polynomial preconditioning. Our theoretical framework yields near-optimal guarantees. By shifting the learning burden from the parameter norm to the effective dimension, we exponentially improved the cumulative prediction regret from the𝒪~(T11/13)\tilde{\mathcal{O}}(T^{11/13})bounds of prior first-order approaches to𝒪(polylog(T))\mathcal{O}(\text{polylog}(T)), completely eliminating polynomial dependencies on both the time horizon and the hidden state dimension. Furthermore, we introduce Faber preconditioning to extend the theoretical viability of USP to a strictly broader class of dynamical systems with constant complex arguments, namely spectra inWπ−δW_{\pi-\delta}for constantδ>0\delta>0. Empirically, our evaluations robustly corroborate these theoretical insights. The Chebyshev-preconditioned VAW algorithm consistently outperforms both first-order methods and adaptive-polynomial approaches, particularly in adverse regimes characterized by large time horizons, high-dimensional hidden states, and highly correlated input sequences.

Limitations

Our theoretically analyzed VAW algorithm requires learning the autoregressive coefficients. This is a slight departure from the spirit of USP since an offline transformation to the target sequence makes it easier to learn whereas in our case we must learn this transformation. However, it is also deeply connected to USP since its proof heavily relies on construction of the benchmarkx∗\textbf{x}^{*}, which is the polynomial preconditioned solution. Moreover, empirical evidence suggests this requirement may be an artifact of our proof technique. Another limitation of this work is the practicality of the VAW algorithm. It requires a much larger memory footprint than the first order methods. The memory footprint grows poly-logarithmically with horizon length and asO~(δ−2)\tilde{O}(\delta^{-2})in the Faber wedge parameter.

Future Directions

Promising directions for future work include extending these second-order preconditioning frameworks to non-linear dynamical systems, perhaps by using the recent breakthrough ofDogariuet al.(2025). Another important direction is to sharpen the dependence on the angular gapδ\delta, since the current Faber proof gives aδ−4\delta^{-4}loss dependence while the cyclic-permutation lower bound only forces memory on the order of1/δ1/\delta. Ultimately, our results demonstrate that when appropriately paired with second-order optimization, universal preconditioning unlocks fundamentally new capabilities for scalable, highly accurate sequence prediction.

References

[1]K. S. Azoury and M. K. Warmuth(2001)Relative loss bounds for on-line density estimation with the exponential family of distributions.Machine Learning43(3),pp. 211–246.Cited by:§1.3,§1,§3,Algorithm 1.
[2]A. Bakshi, A. Liu, A. Moitra, and M. Yau(2023)A new approach to learning linear dynamical systems.InProceedings of the 55th Annual ACM Symposium on Theory of Computing,pp. 335–348.Cited by:§1.3,Table 1,§1.
[3]N. Cesa-Bianchi and G. Lugosi(2006)Prediction, learning, and games.Cambridge university press.Cited by:§1.3.
[4]C. Chen(1999)Linear system theory and design.3rd edition,Oxford University Press.Cited by:§1.3,Table 1,§1.
[5]J. P. Coleman and R. A. Smith(1987)The faber polynomials for circular sectors.Mathematics of Computation49(180),pp. 231–241.Cited by:Appendix B.
[6]E. Dogariu, A. Brahmbhatt, and E. Hazan(2025)Universal learning of nonlinear dynamics.arXiv preprint arXiv:2508.11990.Cited by:§5.
[7]S. Fattahi(2021)Learning partially observed linear dynamical systems from logarithmic number of samples.InProceedings of the 3rd Conference on Learning for Dynamics and Control,Proceedings of Machine Learning Research, Vol.144,pp. 60–72.Cited by:§1.3.
[8]K. Gatermann, C. Hoffmann, and G. Opfer(1992)Explicit faber polynomials on circular sectors.Mathematics of Computation58(197),pp. 241–253.External Links:DocumentCited by:Appendix B,Appendix B,Appendix B.
[9]U. Ghai, H. Lee, K. Singh, C. Zhang, and Y. Zhang(2020)No-regret prediction in marginally stable systems.InProceedings of the 33rd Conference on Learning Theory,Proceedings of Machine Learning Research, Vol.125,pp. 1714–1757.Cited by:§1.3.
[10]P. D. Grünwald(2007)The minimum description length principle.MIT press.Cited by:§1.3,§1.
[11]M. Hardt, T. Ma, and B. Recht(2018)Gradient descent learns linear dynamical systems.Journal of Machine Learning Research19(29),pp. 1–44.Cited by:§1.3.
[12]E. Hazan, A. Agarwal, and S. Kale(2007)Logarithmic regret algorithms for online convex optimization.Machine Learning69(2-3),pp. 169–192.Cited by:§1.3.
[13]E. Hazan, H. Lee, K. Singh, C. Zhang, and Y. Zhang(2018)Spectral filtering for general linear dynamical systems.InAdvances in Neural Information Processing Systems 31,pp. 4639–4648.Cited by:§1.3,Table 1,§1.
[14]E. Hazan and A. Marsden(2026)Spectral filtering for complex linear dynamical systems.arXiv preprint arXiv:2601.22400.Cited by:§1.2,§1,§2.3.
[15]E. Hazan, K. Singh, and C. Zhang(2017)Learning linear dynamical systems via spectral filtering.InAdvances in Neural Information Processing Systems 30,pp. 6702–6712.Cited by:§1.3.
[16]E. Hazan(2016)Introduction to online convex optimization.Vol.2,Now Publishers, Inc..Cited by:Table 1,§1.
[17]T. Kailath(1980)Linear systems.Prentice-Hall.Cited by:§1.3,Table 1,§1.
[18]H. Lee(2022)Improved rates for prediction and identification of partially observed linear dynamical systems.InProceedings of the 33rd International Conference on Algorithmic Learning Theory,Proceedings of Machine Learning Research, Vol.167,pp. 668–698.Cited by:§1.3,Table 1,§1.
[19]A. Marsden and E. Hazan(2025)Dimension-free regret for learning asymmetric linear dynamical systems.arXiv preprint arXiv:2502.06545.Cited by:§1.3,Table 1.
[20]A. Marsden and E. Hazan(2025)Universal sequence preconditioning.InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,Cited by:§1.1,§1,§1,§2.1,§2.2,§3.
[21]P. Rashidinejad, J. Jiao, and S. J. Russell(2020)SLIP: learning to predict in unknown dynamical systems with long-term memory.InAdvances in Neural Information Processing Systems 33,Cited by:§1.3.
[22]J. Rissanen(1978)Modeling by shortest data description.Automatica14(5),pp. 465–471.Cited by:§1.3,§1.
[23]M. Simchowitz, R. Boczar, and B. Recht(2019)Learning linear dynamical systems with semi-parametric least squares.InProceedings of the 32nd Conference on Learning Theory,Proceedings of Machine Learning Research, Vol.99,pp. 2714–2802.Cited by:§1.3,Table 1,§1.
[24]A. Tsiamis and G. J. Pappas(2019)Finite sample analysis of stochastic system identification.In2019 IEEE 58th Conference on Decision and Control (CDC),pp. 3648–3654.Cited by:§1.3,Table 1,§1.
[25]A. Tsiamis and G. J. Pappas(2023)Online learning of the kalman filter with logarithmic regret.IEEE Transactions on Automatic Control68(5),pp. 2774–2789.Cited by:§1.3.
[26]V. Vovk(2001)Competitive on-line statistics.International Statistical Review69(2),pp. 213–248.Cited by:§1.3,§1,§3,Algorithm 1.

Appendix ACoefficient growth of extremal polynomials

The following lemma shows that any monic polynomial that ensures all values in[−1,1][-1,1]are exponentially small, must have exponentially large coefficients.

Theorem 3(Coefficient growth for flat monic polynomials).

FixM>0M>0. Letp(x)=∑j=0najxj,an=1p(x)=\sum_{j=0}^{n}a_{j}x^{j},a_{n}=1, be a monic polynomial of degreenn. If

maxx∈[−1,1]⁡|p(x)|≤M21−n,\max_{x\in[-1,1]}|p(x)|\leq M2^{1-n},then

max0≤j≤n⁡|aj|≥1n2Ω(n)\max_{0\leq j\leq n}|a_{j}|\geq\frac{1}{n}2^{\Omega(n)}In particular, at least one coefficient ofppis exponentially large innn.

Proof.

We split the proof into two parts. The first part is the conceptual reduction, and the second part is the technical no-cancellation estimate.

Part I: Reduction to a no-cancellation statement.

Writeppin the Chebyshev basis:

p(x)=∑k=0nbkTk(x),p(x)=\sum_{k=0}^{n}b_{k}T_{k}(x),whereTkT_{k}is thekkth Chebyshev polynomial. Since

Tn(x)=2n−1xn+lower-degree terms,T_{n}(x)=2^{n-1}x^{n}+\text{lower-degree terms},andppis monic, the coefficient ofTnT_{n}must be

The remaining Chebyshev coefficients are controlled by the assumption thatppis small on[−1,1][-1,1]. Indeed, usingTk(cos⁡θ)=cos⁡(kθ)T_{k}(\cos\theta)=\cos(k\theta)and orthogonality on[0,π][0,\pi],

|bk|\displaystyle|b_{k}|=|2π∫0πp(cos⁡θ)cos⁡(kθ)𝑑θ|\displaystyle=\left|\frac{2}{\pi}\int_{0}^{\pi}p(\cos\theta)\cos(k\theta)\,d\theta\right|≤2π∫0π|p(cos⁡θ)||cos⁡(kθ)|𝑑θ\displaystyle\leq\frac{2}{\pi}\int_{0}^{\pi}\left|p(\cos\theta)\right|\left|\cos(k\theta)\right|d\theta≤2π∫0π|p(cos⁡θ)|𝑑θ\displaystyle\leq\frac{2}{\pi}\int_{0}^{\pi}\left|p(\cos\theta)\right|d\theta≤2maxx∈[−1,1]⁡|p(x)|≤2M21−n=4M2−n.\displaystyle\leq 2\max_{x\in[-1,1]}|p(x)|\leq 2M2^{1-n}=4M2^{-n}.Thus, after scaling by2n−12^{n-1}, the polynomial

q(x):=2n−1p(x)q(x):=2^{n-1}p(x)has the form

q(x)=Tn(x)+∑k=0n−1ckTk(x),|ck|≤2M.q(x)=T_{n}(x)+\sum_{k=0}^{n-1}c_{k}T_{k}(x),\qquad|c_{k}|\leq 2M. So the whole issue is the following: can the bounded lower-degree Chebyshev terms cancel the coefficient growth ofTnT_{n}? The answer is no. This is the content of the next lemma.

Lemma 2(No cancellation, asymptotic form).

FixM>0M>0. Suppose

q(x)=Tn(x)+∑k=0n−1ckTk(x),|ck|≤2M.q(x)=T_{n}(x)+\sum_{k=0}^{n-1}c_{k}T_{k}(x),\qquad|c_{k}|\leq 2M.Write

q(x)=∑j=0ndjxj.q(x)=\sum_{j=0}^{n}d_{j}x^{j}.Then

max0≤j≤n⁡|dj|≥1n+12(1+ΩM(1))n.\max_{0\leq j\leq n}|d_{j}|\geq\frac{1}{n+1}2^{(1+\Omega_{M}(1))n}.Equivalently, the coefficients ofqqgrow like(2+ΩM(1))n/n(2+\Omega_{M}(1))^{n}/n.

Proof.

Chooseα>1\alpha>1sufficiently large depending only onMM, and set

y:=α−α−12.y:=\frac{\alpha-\alpha^{-1}}{2}.For a polynomialr(x)=∑jrjxjr(x)=\sum_{j}r_{j}x^{j}, define

‖r‖y:=∑j|rj|yj.\|r\|_{y}:=\sum_{j}|r_{j}|y^{j}. We use the standard Chebyshev coefficient estimate

‖Tk‖y=αk+(−1)kα−k2=Θ(αk).\|T_{k}\|_{y}=\frac{\alpha^{k}+(-1)^{k}\alpha^{-k}}{2}=\Theta(\alpha^{k}).Thus

‖Tn‖y=Ω(αn),‖Tk‖y≤O(αk).\|T_{n}\|_{y}=\Omega(\alpha^{n}),\qquad\|T_{k}\|_{y}\leq O(\alpha^{k}).Therefore

‖q‖y\displaystyle\|q\|_{y}≥‖Tn‖y−∑k=0n−1|ck|‖Tk‖y\displaystyle\geq\|T_{n}\|_{y}-\sum_{k=0}^{n-1}|c_{k}|\|T_{k}\|_{y}≥Ω(αn)−OM(∑k=0n−1αk).\displaystyle\geq\Omega(\alpha^{n})-O_{M}\left(\sum_{k=0}^{n-1}\alpha^{k}\right).Since

∑k=0n−1αk≤αnα−1,\sum_{k=0}^{n-1}\alpha^{k}\leq\frac{\alpha^{n}}{\alpha-1},choosingα\alphalarge enough depending onMMgives

‖q‖y≥ΩM(αn).\|q\|_{y}\geq\Omega_{M}(\alpha^{n}). On the other hand, if

D:=max0≤j≤n⁡|dj|,D:=\max_{0\leq j\leq n}|d_{j}|,then

‖q‖y=∑j=0n|dj|yj≤D(n+1)yn.\|q\|_{y}=\sum_{j=0}^{n}|d_{j}|y^{j}\leq D(n+1)y^{n}.Hence

D≥1n+1ΩM((αy)n).D\geq\frac{1}{n+1}\Omega_{M}\left(\left(\frac{\alpha}{y}\right)^{n}\right).Finally,

αy=2α2α2−1>2.\frac{\alpha}{y}=\frac{2\alpha^{2}}{\alpha^{2}-1}>2.Therefore

D≥1n+12(1+ΩM(1))n.D\geq\frac{1}{n+1}2^{(1+\Omega_{M}(1))n}.∎

∎

A constant-rate conjecture.

The preceding theorem proves exponential coefficient growth at the sharp Chebyshev scale2−n2^{-n}. Its proof uses this scale essentially: after normalizing by the leading Chebyshev coefficient, the lower-degree Chebyshev coefficients remain uniformly bounded. For a weaker bound such asexp⁡(−cn)\exp(-cn)withc<log⁡2c<\log 2, the same normalization allows the lower-degree Chebyshev coefficients to be exponentially large, so the no-cancellation argument no longer applies.

We nevertheless expect the qualitative obstruction to persist.

Conjecture 2(Coefficient growth at any exponential rate).

For everyc>0c>0there existsc′>0c^{\prime}>0such that the following holds for all sufficiently largenn. If

pn(x)=xn+∑j=0n−1ajxjp_{n}(x)=x^{n}+\sum_{j=0}^{n-1}a_{j}x^{j}is a monic polynomial satisfying

maxx∈[−1,1]⁡|pn(x)|≤e−cn,\max_{x\in[-1,1]}|p_{n}(x)|\leq e^{-cn},then

max0≤j≤n⁡|aj|≥ec′n.\max_{0\leq j\leq n}|a_{j}|\geq e^{c^{\prime}n}.

Appendix BProof of Lemma1

Recall that Lemma1asserts that for everyδ∈(0,π)\delta\in(0,\pi)andn≥1n\geq 1, there is a real monic polynomial

pn,δ(z)=zn+∑i=1ncizn−ip_{n,\delta}(z)=z^{n}+\sum_{i=1}^{n}c_{i}z^{n-i}withc0:=1c_{0}:=1such that

supz∈Wπ−δ|pn,δ(z)|≤2e−nδ2/π2,max0≤i≤n⁡|ci|≤100n.\sup_{z\in W_{\pi-\delta}}|p_{n,\delta}(z)|\leq 2e^{-n\delta^{2}/\pi^{2}},\qquad\max_{0\leq i\leq n}|c_{i}|\leq 100^{n}.

Proof.

Part I: Exponentially small norm on the wedge.

Forθ∈(0,π)\theta\in(0,\pi), define

Wθ={z∈ℂ:|z|≤1,|arg⁡(z)|≤θ}.W_{\theta}=\{z\in\mathbb{C}:|z|\leq 1,\ |\arg(z)|\leq\theta\}.We use the following sector-specific Faber estimate for circular sectors, due to Coleman and Smith[5]and stated in the convenient normalization of Gatermann, Hoffmann, and Opfer[8]. LetFn,θF_{n,\theta}be the monic degree-nnFaber polynomial ofWθW_{\theta}, and letρ(θ)\rho(\theta)be the transfinite diameter ofWθW_{\theta}. Coleman and Smith compute

ρ(θ)=t−t(2−t)t−2,t=θπ.\rho(\theta)=t^{-t}(2-t)^{t-2},\qquad t=\frac{\theta}{\pi}.In the notation of[8, Eq. (2.6)], the scaled Faber polynomial is

F~n,θ(z)=ρ(θ)−nFn,θ(z).\widetilde{F}_{n,\theta}(z)=\rho(\theta)^{-n}F_{n,\theta}(z).Moreover,[8, Eq. (3.10)]gives

supz∈Wθ|F~n,θ(z)|≤2.\sup_{z\in W_{\theta}}|\widetilde{F}_{n,\theta}(z)|\leq 2.Consequently,

supz∈Wθ|Fn,θ(z)|≤2ρ(θ)n.\sup_{z\in W_{\theta}}|F_{n,\theta}(z)|\leq 2\rho(\theta)^{n}. Set

θ=π−δ,η=δπ.\theta=\pi-\delta,\qquad\eta=\frac{\delta}{\pi}.Then

t=θπ=1−η,t=\frac{\theta}{\pi}=1-\eta,and therefore

ρ(π−δ)=(1−η)−(1−η)(1+η)−(1+η).\rho(\pi-\delta)=(1-\eta)^{-(1-\eta)}(1+\eta)^{-(1+\eta)}.Thus

log⁡(1ρ(π−δ))=(1+η)log⁡(1+η)+(1−η)log⁡(1−η).\log\!\left(\frac{1}{\rho(\pi-\delta)}\right)=(1+\eta)\log(1+\eta)+(1-\eta)\log(1-\eta).For0≤η<10\leq\eta<1,

(1+η)log⁡(1+η)+(1−η)log⁡(1−η)=∑k=1∞2η2k(2k)(2k−1)≥η2.(1+\eta)\log(1+\eta)+(1-\eta)\log(1-\eta)=\sum_{k=1}^{\infty}\frac{2\eta^{2k}}{(2k)(2k-1)}\geq\eta^{2}.Hence

ρ(π−δ)≤e−η2=e−δ2/π2.\rho(\pi-\delta)\leq e^{-\eta^{2}}=e^{-\delta^{2}/\pi^{2}}.Consequently,

supz∈Wπ−δ|Fn,π−δ(z)|≤2e−nδ2/π2.\sup_{z\in W_{\pi-\delta}}|F_{n,\pi-\delta}(z)|\leq 2e^{-n\delta^{2}/\pi^{2}}. Finally,Wπ−δW_{\pi-\delta}is invariant under complex conjugation. Therefore, if necessary, replacingFn,π−δF_{n,\pi-\delta}by

pn,δ(z)=Fn,π−δ(z)+Fn,π−δ(z¯)¯2p_{n,\delta}(z)=\frac{F_{n,\pi-\delta}(z)+\overline{F_{n,\pi-\delta}(\overline{z})}}{2}preserves monicity, gives real coefficients, and does not increase the supremum norm onWπ−δW_{\pi-\delta}. Thus we may takepn,δp_{n,\delta}to be real and monic, with

supz∈Wπ−δ|pn,δ(z)|≤2e−nδ2/π2.\sup_{z\in W_{\pi-\delta}}|p_{n,\delta}(z)|\leq 2e^{-n\delta^{2}/\pi^{2}}. Part II: Coefficient bound.

Write

pn,δ(z)=zn+∑i=1ncizn−i,c0=1.p_{n,\delta}(z)=z^{n}+\sum_{i=1}^{n}c_{i}z^{n-i},\qquad c_{0}=1.Since[0,1]⊆Wπ−δ[0,1]\subseteq W_{\pi-\delta}, Part I gives

supx∈[0,1]|pn,δ(x)|≤2.\sup_{x\in[0,1]}|p_{n,\delta}(x)|\leq 2.Define

q(x)=pn,δ(x+12).q(x)=p_{n,\delta}\!\left(\frac{x+1}{2}\right).Then

supx∈[−1,1]|q(x)|≤2.\sup_{x\in[-1,1]}|q(x)|\leq 2. We use the following elementary fact: if a degree-nnpolynomialqqsatisfiessupx∈[−1,1]|q(x)|≤2\sup_{x\in[-1,1]}|q(x)|\leq 2, then theℓ1\ell_{1}norm of its coefficient is at most24n24^{n}. To see this, writeqqin the Chebyshev basis

q(x)=∑k=0nbkTk(x).q(x)=\sum_{k=0}^{n}b_{k}T_{k}(x).By the usual Chebyshev orthogonality formulas,

b0=1π∫0πq(cos⁡θ)𝑑θ,bk=2π∫0πq(cos⁡θ)cos⁡(kθ)𝑑θ(k≥1).b_{0}=\frac{1}{\pi}\int_{0}^{\pi}q(\cos\theta)\,d\theta,\qquad b_{k}=\frac{2}{\pi}\int_{0}^{\pi}q(\cos\theta)\cos(k\theta)\,d\theta\quad(k\geq 1).Sincesupx∈[−1,1]|q(x)|≤2\sup_{x\in[-1,1]}|q(x)|\leq 2, it follows that

|bk|≤4,0≤k≤n.|b_{k}|\leq 4,\qquad 0\leq k\leq n.Also, the recurrence

Tk+1(x)=2xTk(x)−Tk−1(x)T_{k+1}(x)=2xT_{k}(x)-T_{k-1}(x)implies that the monomial coefficientℓ1\ell_{1}norm ofTkT_{k}is at most3k3^{k}. Hence

‖q‖1≤∑k=0n|bk|‖Tk‖1≤4(n+1)3n≤24n.\|q\|_{1}\leq\sum_{k=0}^{n}|b_{k}|\,\|T_{k}\|_{1}\leq 4(n+1)3^{n}\leq 24^{n}. Finally,

pn,δ(z)=q(2z−1).p_{n,\delta}(z)=q(2z-1).Substituting2z−12z-1increases the monomial coefficientℓ1\ell_{1}norm by at most3n3^{n}, since

‖(2z−1)j‖1=3j≤3nfor allj≤n.\|(2z-1)^{j}\|_{1}=3^{j}\leq 3^{n}\qquad\text{for all }j\leq n.Therefore

∑i=0n|ci|=‖pn,δ‖1≤3n‖q‖1≤72n≤100n.\sum_{i=0}^{n}|c_{i}|=\|p_{n,\delta}\|_{1}\leq 3^{n}\|q\|_{1}\leq 72^{n}\leq 100^{n}.In particular,

max0≤i≤n⁡|ci|≤100n.\max_{0\leq i\leq n}|c_{i}|\leq 100^{n}.Combining Parts I and II proves the lemma. ∎

The Power of Second Order Methods for Sequence Preconditioning

Abstract

1Introduction

1.1Our Results

Logarithmic Regret for Sequence Prediction:

Extension to Systems with Constant Complex Arguments:

1.2Tightness and the Role of the Angular Gap

1.3Related Work

Second-Order Online Learning.

Minimum Description Length (MDL) and Information Theory.

2Theoretical Results Overview

2.1Overview of Universal Sequence Preconditioning

2.2Contribution One: The Power of a Second Order Method

Conjecture 1.

2.3Contribution Two: Extending to Asymmetric Systems Via Faber Polynomials

3Main Theorems and Proofs

Theorem 1(VAW Regret).

Transforming sequence prediction with short memory into linear prediction with small dimension:

Theorem 2(VAW with a polynomial comparator certificate).

Proof.

Corollary 1(Chebyshev instantiation for real spectra).

Proof.

Lemma 1(Faber polynomial bounds for wedge spectra).

Corollary 2(Main result: VAW with Faber preconditioning).

Proof.

4Experiments

Data Generation.

Performance Measurement.

Results Discussion.

4.1Signal Norm and Preconditioning Dynamics

5Conclusion

Summary

Limitations

Future Directions

References

Appendix ACoefficient growth of extremal polynomials

Theorem 3(Coefficient growth for flat monic polynomials).

Proof.

Part I: Reduction to a no-cancellation statement.

Lemma 2(No cancellation, asymptotic form).

Proof.

A constant-rate conjecture.

Conjecture 2(Coefficient growth at any exponential rate).

Appendix BProof of Lemma1

Proof.

Similar Articles

@ClementDelangue: Paper of the day! https://huggingface.co/papers/2605.13301…

Gated QKAN-FWP: Scalable Quantum-inspired Sequence Learning

@Propriocetive: New preprint: Mathematics is All You Need 2 — Sign-Stabilized Behavioral Fibers in Transformer Residual Streams. Headli…

@artemZholus: thanks! in the second paper (https://arxiv.org/abs/2605.06388) we used your (and RAE's) recipe and it worked.

@timlautk: 1/4 New paper with @weijie444! We introduce a symmetry-compatible principle for LLM optimizer design and, as a byproduc…

Submit Feedback