Adaptive Patching Is Harder Than It Looks For Time-Series Forecasting

arXiv cs.LG 06/04/26, 04:00 AM Papers
Summary
This paper theoretically and empirically examines adaptive patching for time-series Transformers, deriving conditions under which content-adaptive tokenization should outperform tuned uniform patching. Controlled experiments on standard benchmarks show that a well-tuned uniform baseline is competitive with dynamic patching methods, challenging the assumed benefit of adaptive approaches.
arXiv:2606.04074v1 Announce Type: new Abstract: Adaptive patching is a recent and compelling proposal for time-series Transformers: allocate finer patches where the sequence looks locally informative. This paper asks under what conditions a content-adaptive patching operator should outperform a tuned uniform one. Local heterogeneity alone is not enough: under pointwise forecasting losses, a complex-looking region is not automatically one where finer patching reduces the loss. We model patching as a budgeted bitrate allocation and derive an explicit threshold that a dynamic patching rule must satisfy to beat a well-tuned uniform baseline, then bound the achievable improvement both locally (a quadratic surrogate) and globally (a strong-convexity bound under the model's assumptions). Two structural results follow: without a coupling constraint, scalar local complexity cannot produce a non-uniform optimum under a common loss landscape; and once the backbone is trained to its representation-aware optimum, the alignment gain collapses around a well-tuned uniform patch size. To test these predictions, we run a controlled isolation study on three representative architectures, replacing each adaptive mechanism with a uniform patch-size sweep while keeping the backbone, data, and training protocol fixed. On standard long-horizon forecasting benchmarks, the validation-selected uniform baseline is competitive with the dynamic counterpart, with per-setting effects concentrated near zero and no consistent directional advantage once results are aggregated by dataset. The larger gains we do observe are method- and dataset-specific. Adaptive patching should therefore be evaluated against a tuned uniform baseline; its value depends on whether a cheap and reliable routing signal can identify where finer patches actually reduce forecasting loss.
Original Article
View Cached Full Text
Cached at: 06/05/26, 02:20 AM
# Adaptive Patching Is Harder Than it Looks for Time-Series Forecasting
Source: [https://arxiv.org/html/2606.04074](https://arxiv.org/html/2606.04074)
Federico Zucchi1,5Yi Xie2Chao Zhang3Keyuan Luo4Thomas Lampert1Ziyue Li2,6 1ICube, University of Strasbourg, Illkirch\-Graffenstaden, France 2Technical University of Munich 3FinTech Thrust, The Hong Kong University of Science and Technology \(Guangzhou\) 4Computer Science Department, Hainan Bielefeld University of Applied Sciences 5Cephalgo, Strasbourg, France 6Heilbronn Data Science Center, Munich Data Science Institute f\.zucchi@unistra\.fraaron\.xie@tum\.dechaoz@hkust\-gz\.edu\.cn keyuan\.luo@hibiuh\.edu\.cnlampert@unistra\.frziyue\.li@tum\.de

###### Abstract

Adaptive patching is a recent and compelling proposal for time\-series Transformers: allocate finer patches where the sequence looks locally informative\. This paper asks under what conditions a content\-adaptive patching operator should outperform a tuned uniform one\. Local heterogeneity alone is not enough: under pointwise forecasting losses, a complex\-looking region is not automatically one where finer patching reduces the loss\. We model patching as a budgeted bitrate allocation and derive an explicit threshold that a dynamic patching rule must satisfy to beat a well\-tuned uniform baseline, then bound the achievable improvement both locally \(a quadratic surrogate\) and globally \(a strong\-convexity bound under the model’s assumptions\)\. Two structural results follow: without a coupling constraint, scalar local complexity cannot produce a non\-uniform optimum under a common loss landscape; and once the backbone is trained to its representation\-aware optimum, the alignment gain collapses around a well\-tuned uniform patch size\. To test these predictions, we run a controlled isolation study on three representative architectures, replacing each adaptive mechanism with a uniform patch\-size sweep while keeping the backbone, data, and training protocol fixed\. On standard long\-horizon forecasting benchmarks, the validation\-selected uniform baseline is competitive with the dynamic counterpart, with per\-setting effects concentrated near zero and no consistent directional advantage once results are aggregated by dataset\. The larger gains we do observe are method\- and dataset\-specific\. Adaptive patching should therefore be evaluated against a tuned uniform baseline; its value depends on whether a cheap and reliable routing signal can identify where finer patches actually reduce forecasting loss\.

## 1Introduction

Patching is now the default tokenization step for time\-series Transformers\. The construction came straight from vision\(Dosovitskiyet al\.,[2021](https://arxiv.org/html/2606.04074#bib.bib1)\)via PatchTST\(Nieet al\.,[2023](https://arxiv.org/html/2606.04074#bib.bib2)\), and was kept by later forecasting architectures and foundation models\(Liuet al\.,[2024](https://arxiv.org/html/2606.04074#bib.bib8); Daset al\.,[2024](https://arxiv.org/html/2606.04074#bib.bib9); Wooet al\.,[2024](https://arxiv.org/html/2606.04074#bib.bib10); Ansariet al\.,[2024](https://arxiv.org/html/2606.04074#bib.bib36)\)\. Vision has moved on\. A line of work there argues that a uniform grid wastes resolution on flat regions and starves detailed ones, and adapts the token budget where the image carries more structure\(Dehghaniet al\.,[2023](https://arxiv.org/html/2606.04074#bib.bib21); Ronenet al\.,[2023](https://arxiv.org/html/2606.04074#bib.bib24); Yinet al\.,[2022](https://arxiv.org/html/2606.04074#bib.bib22); Bolyaet al\.,[2023](https://arxiv.org/html/2606.04074#bib.bib31); Yinet al\.,[2025](https://arxiv.org/html/2606.04074#bib.bib23)\)\. Carrying this idea over to time series sounds reasonable: change points, regime shifts, and bursts of high\-frequency structure look meaningful in the input, so they should be where finer patches reduce the forecasting loss\. This is the premise behind a growing family of dynamic patching methods, which route boundaries or token granularity using predictive entropy, temporal heterogeneity, learned information density, or mixture\-of\-size mechanisms\(Abeywickramaet al\.,[2025](https://arxiv.org/html/2606.04074#bib.bib12); Dinget al\.,[2026](https://arxiv.org/html/2606.04074#bib.bib15); Huanget al\.,[2024](https://arxiv.org/html/2606.04074#bib.bib20); Ankireddyet al\.,[2026](https://arxiv.org/html/2606.04074#bib.bib13); Fenget al\.,[2025](https://arxiv.org/html/2606.04074#bib.bib19)\)\.

The premise is harder to defend than it looks, because “looks informative” and “finer patching helps the loss” are two different statements\. First, there is no canonical definition of local information density in a forecasting series, so every method commits to a proxy and inherits whatever that proxy gets wrong\. Second, pointwise losses like MSE do not by themselves say where extra resolution should be spent; a high\-variance region need not be one whose error a finer patch can actually reduce\. Third, even when the routing is on the right track, any gain has to clear the overhead of the routing mechanism, and that gain shrinks fast once the proxy is noisy or weakly aligned with the loss\.

![Refer to caption](https://arxiv.org/html/2606.04074v1/x1.png)

![Refer to caption](https://arxiv.org/html/2606.04074v1/x2.png)

Figure 1:Tuned uniform patching versus three dynamic methods\.For each \(method, dataset, horizon\), we choose the uniform patch size with the lowest validation MSE, then report the corresponding test\-MSE gain and training speed\. Left: relative test\-MSE gain of that validation\-selected uniform patch over the matched dynamic baseline\. Right: speedup uses the same selected patch, averaged over horizons within each dataset\.#### This paper\.

Our starting point is empirical\. We take three recent dynamic patching methods, EntroPE\(Abeywickramaet al\.,[2025](https://arxiv.org/html/2606.04074#bib.bib12)\), TimeMosaic\(Dinget al\.,[2026](https://arxiv.org/html/2606.04074#bib.bib15)\), and HDMixer\(Huanget al\.,[2024](https://arxiv.org/html/2606.04074#bib.bib20)\), and replace each adaptive mechanism with a uniform patch\-size sweep under the same backbone, data, and training protocol\. The validation\-selected uniform baseline’s test MSE matches or beats the dynamic variant on56\.3%56\.3\\%,75\.0%75\.0\\%, and59\.4%59\.4\\%of settings for EntroPE, HDMixer, and TimeMosaic respectively \([Figure1](https://arxiv.org/html/2606.04074#S1.F1), left\), at training cost at parity or lower \(per\-method mean training speedup at the validation\-selected patch is1\.27×1\.27\\timesfor EntroPE,1\.69×1\.69\\timesfor HDMixer, and1\.24×1\.24\\timesfor TimeMosaic; up to3\.4×3\.4\\timeson individual datasets,[Figure1](https://arxiv.org/html/2606.04074#S1.F1), right\)\. The pattern is familiar from long\-horizon forecasting, whereZenget al\.\([2023](https://arxiv.org/html/2606.04074#bib.bib3)\)showed that DLinear can match tuned Transformer architectures andChenet al\.\([2023](https://arxiv.org/html/2606.04074#bib.bib4)\)reinforced the point with TSMixer\. We push that intuition one level down\. Instead of questioning whether the backbone must be a Transformer, we focus on whether the underlying tokenization policy itself needs to be adaptive\.

To explain when this should and should not happen, we adapt an information\-theoretic picture\. Treat patching as a budgeted bitrate allocation, with local complexityKt≥0K\_\{t\}\\geq 0matched against bitratertr\_\{t\}and forecasting error approximated by a strictly convex distortionDD\. A short calculation gives a ceiling on how much a dynamic schedule can gain over a budget\-matched uniform baseline \([Theorem3\.8](https://arxiv.org/html/2606.04074#S3.Thmtheorem8)\):

Δmax=\(−D′\(r¯\)\)2ρ\(K,r\)2σK22K¯D′′\(r¯\)\.\\Delta\_\{\\max\}\\;=\\;\\frac\{\\bigl\(\-D^\{\\prime\}\(\\bar\{r\}\)\\bigr\)^\{2\}\\,\\rho\(K,r\)^\{2\}\\,\\sigma\_\{K\}^\{2\}\}\{2\\,\\bar\{K\}\\,D^\{\\prime\\prime\}\(\\bar\{r\}\)\}\.\(1\)The numerator collects what a dynamic schedule can win by routing bitrate where it reduces loss\. The denominator is what it must pay: the curvatureD′′\(r¯\)D^\{\\prime\\prime\}\(\\bar\{r\}\)is the rate at which Jensen’s inequality penalises any non\-uniform allocation under a convex loss\. Three consequences read directly off the formula and line up with[Figure1](https://arxiv.org/html/2606.04074#S1.F1)\. With no alignment \(ρ=0\\rho=0\), the ceiling is zero: routing uncorrelated with loss\-relevant complexity provides no gain\. Separately, without a budget coupling between time steps, scalar local complexity cannot produce a non\-uniform optimum regardless of the routing signal; this is the scalar\-invariance result \([Theorem3\.3](https://arxiv.org/html/2606.04074#S3.Thmtheorem3)\)\. Becauseρ\\rhoenters quadratically, the gain ceiling falls off rapidly as alignment weakens: even modest misalignment can eliminate most of the available headroom\. Adaptivity also needs real complexity variation rather than just observable irregularity, sinceσK2\\sigma\_\{K\}^\{2\}controls the headroom linearly\. And once the uniform baseline is tuned near its representation\-aware optimum,−D′\(r¯\)\-D^\{\\prime\}\(\\bar\{r\}\)approaches zero and the ceiling collapses; this is the optimality trap\.

#### Contributions\.

- •A*controlled isolation study*on EntroPE, TimeMosaic, and HDMixer: with backbone, data, and training protocol held fixed, replacing each adaptive mechanism with a tuned uniform patch size leaves accuracy near parity and training cost the same or lower, with speedups up to5\.3×5\.3\\timeson a selected horizon\-level setting \([Figure1](https://arxiv.org/html/2606.04074#S1.F1)\)\.
- •A*theoretical account of when adaptivity can and cannot help*: dynamic patching wins only if its routing signal aligns well with the regions where extra resolution actually reduces forecasting loss, and the alignment must be strong enough to overcome the penalty for spending the budget unevenly\. That headroom is structurally fragile\. Modest misalignment removes most of it; tuning the uniform patch size to the backbone’s sweet spot collapses what is left, the optimality trap; and a scalar measure of local complexity on its own cannot drive a non\-uniform optimum without a coupling constraint between time steps \([Theorem3\.3](https://arxiv.org/html/2606.04074#S3.Thmtheorem3)\)\.

## 2Related work

#### Uniform patching for time\-series Transformers\.

PatchTST\(Nieet al\.,[2023](https://arxiv.org/html/2606.04074#bib.bib2)\)established that partitioning a length\-TTseries intoN=⌈T/b⌉N=\\lceil T/b\\rceilpatches of fixed lengthbbreduces attention cost from𝒪\(T2\)\\mathcal\{O\}\(T^\{2\}\)to𝒪\(N2\)\\mathcal\{O\}\(N^\{2\}\)while providing implicit local smoothing; iTransformer\(Liuet al\.,[2024](https://arxiv.org/html/2606.04074#bib.bib8)\)kept a fixed\-token convention while reorganising what each token represents\. Subsequent foundation\-model work \(TimesFM\(Daset al\.,[2024](https://arxiv.org/html/2606.04074#bib.bib9)\), Chronos\(Ansariet al\.,[2024](https://arxiv.org/html/2606.04074#bib.bib36)\), Moirai\(Wooet al\.,[2024](https://arxiv.org/html/2606.04074#bib.bib10)\)\) mostly retained static patching or tokenization choices as the default\. We treat the*tuned*uniform patch size as the operating point whose difficulty of improvement we characterise, extending to tokenization the strong\-baseline lesson that DLinear and TSMixer established for the backbone\(Zenget al\.,[2023](https://arxiv.org/html/2606.04074#bib.bib3); Chenet al\.,[2023](https://arxiv.org/html/2606.04074#bib.bib4)\)\.

#### Dynamic patching and adaptive tokenization in time series\.

Recent methods place boundaries adaptively or vary granularity using a range of proxies: EntroPE\(Abeywickramaet al\.,[2025](https://arxiv.org/html/2606.04074#bib.bib12)\)uses predictive entropy; TimeSqueeze\(Ankireddyet al\.,[2026](https://arxiv.org/html/2606.04074#bib.bib13)\)uses an SSM encoder to measure information density; TimeMosaic\(Dinget al\.,[2026](https://arxiv.org/html/2606.04074#bib.bib15)\)uses temporal heterogeneity; PathFormer\(Chenet al\.,[2024](https://arxiv.org/html/2606.04074#bib.bib16)\), DualSG\(Dinget al\.,[2025](https://arxiv.org/html/2606.04074#bib.bib17)\), PatchMLP\(Tang and Zhang,[2025](https://arxiv.org/html/2606.04074#bib.bib18)\), and Kairos\(Fenget al\.,[2025](https://arxiv.org/html/2606.04074#bib.bib19)\)route or mix multiple granularities; and HDMixer\(Huanget al\.,[2024](https://arxiv.org/html/2606.04074#bib.bib20)\)mixes hierarchical, length\-variable patches\. ReinPatch\(Wuet al\.,[2026](https://arxiv.org/html/2606.04074#bib.bib14)\)instead learns patch boundary placement as a reinforcement\-learning policy jointly with the downstream sequence backbone\. A related but distinct line reshapes the training objective or patch\-specific processing rather than the boundaries themselves\(Kudratet al\.,[2025](https://arxiv.org/html/2606.04074#bib.bib25); Huet al\.,[2025](https://arxiv.org/html/2606.04074#bib.bib26)\)\. Fixed multiscale alternatives such as TimeMixer\(Wanget al\.,[2024](https://arxiv.org/html/2606.04074#bib.bib6)\)and N\-HiTS\(Challuet al\.,[2023](https://arxiv.org/html/2606.04074#bib.bib7)\), show that hierarchical structure can be exploited without dynamic patch allocation; gains attributed to adaptivity may partly reflect this hierarchy rather than within\-window variation\.

#### Adaptive tokenization and computation\.

Vision transformers provide the empirical precedent for adaptive tokenization\. NaViT\(Dehghaniet al\.,[2023](https://arxiv.org/html/2606.04074#bib.bib21)\)processes images at their native resolution by packing variable\-length token sequences; adaptive patch\-size ViTs\(Ronenet al\.,[2023](https://arxiv.org/html/2606.04074#bib.bib24)\)learn spatially\-varying granularity; A\-ViT\(Yinet al\.,[2022](https://arxiv.org/html/2606.04074#bib.bib22)\)halts computation per token according to an informativeness criterion; and Token Merging\(Bolyaet al\.,[2023](https://arxiv.org/html/2606.04074#bib.bib31)\)compresses redundant tokens post\-hoc\. The same idea has a longer history in*adaptive computation*: Adaptive Computation Time lets recurrent networks learn input\-dependent computation steps\(Graves,[2016](https://arxiv.org/html/2606.04074#bib.bib29)\); PonderNet learns a probabilistic halting distribution\(Baninoet al\.,[2021](https://arxiv.org/html/2606.04074#bib.bib39)\); sparse Mixture\-of\-Experts routes examples or tokens to a subset of experts\(Shazeeret al\.,[2017](https://arxiv.org/html/2606.04074#bib.bib40); Feduset al\.,[2022](https://arxiv.org/html/2606.04074#bib.bib41)\); and Mixture\-of\-Depths routes tokens through a subset of Transformer blocks\(Raposoet al\.,[2024](https://arxiv.org/html/2606.04074#bib.bib30)\)\. These works motivate the same trade\-off studied here: adaptive computation is useful only when the routing or halting policy is cheap, stable, and aligned with the objective\.

## 3Theory: structural limits of dynamic patching

A minimal formal model makes the paper’s main question explicit: when, and by how much, can a dynamic patching scheme improve over a tuned uniform baseline? Our formalization follows classical rate–distortion theory\(Cover and Thomas,[2006](https://arxiv.org/html/2606.04074#bib.bib11)\), which trades bitrate against distortion\.

### 3\.1Setup

###### Definition 3\.1\(Bitrate density\)\.

Fix a horizon ofTTsteps and a per\-token capacityCC\. A patching scheme producesNNpatchesP1,…,PNP\_\{1\},\\dots,P\_\{N\}covering \(possibly overlapping\) intervalsSiS\_\{i\}of lengthLiL\_\{i\}\. The*effective bitrate*at stepttisrt:=∑i=1NCLi1\{t∈Si\}r\_\{t\}\\;:=\\;\\sum\_\{i=1\}^\{N\}\\frac\{C\}\{L\_\{i\}\}\\,\\mathbf\{1\}\\\!\\left\\\{t\\in S\_\{i\}\\right\\\}\(2\) so each patch contributesC/LiC/L\_\{i\}bits uniformly across its span\.

###### Definition 3\.2\(Feasible allocations and objective\)\.

For analysis, samplettuniformly from\{0,…,T−1\}\\\{0,\\ldots,T\{\-\}1\\\}and writeK=KtK=K\_\{t\},r=rtr=r\_\{t\}, and𝔼\\mathbb\{E\}for the empirical average over time\. LetKt≥0K\_\{t\}\\geq 0withK¯:=𝔼\[K\]\>0\\bar\{K\}:=\\mathbb\{E\}\[K\]\>0, and letI⊂\(0,∞\)I\\subset\(0,\\infty\)be the admissible bitrate interval\. The continuous relaxation isℱ\(r¯,I\):=\{r∈IT:𝔼\[r\]=r¯\}\\mathcal\{F\}\(\\bar\{r\},I\):=\\\{r\\in I^\{T\}:\\mathbb\{E\}\[r\]=\\bar\{r\}\\\}, of which realizable patching policies form a subset\. For a convex distortionD:I→ℝD:I\\to\\mathbb\{R\}, define the normalized dynamic loss𝒥dyn\(r\):=𝔼\[KD\(r\)\]\\mathcal\{J\}\_\{\\text\{dyn\}\}\(r\):=\\mathbb\{E\}\[K\\,D\(r\)\], uniform loss𝒥uni:=K¯D\(r¯\)\\mathcal\{J\}\_\{\\text\{uni\}\}:=\\bar\{K\}\\,D\(\\bar\{r\}\), and gainΔD\(r\):=𝒥uni−𝒥dyn\(r\)\\Delta\_\{D\}\(r\):=\\mathcal\{J\}\_\{\\text\{uni\}\}\-\\mathcal\{J\}\_\{\\text\{dyn\}\}\(r\)\. When pure rate–distortion behavior is invoked, we additionally assumeD′\(r\)<0D^\{\\prime\}\(r\)<0andD′′\(r\)\>0D^\{\\prime\\prime\}\(r\)\>0on the relevant interval\.

The basic question is whether a dynamic allocation\{rt\}\\\{r\_\{t\}\\\}can makeΔD\(r\)\>0\\Delta\_\{D\}\(r\)\>0against the uniform baselinert≡r¯r\_\{t\}\\equiv\\bar\{r\}\. Before introducing the budget, we ask why unconstrained dynamic behavior would not arise from scalar complexity alone\.

### 3\.2Scalar invariance and the necessity of coupling

Without any constraint linking the time steps, a common per\-step rate lossgggives the decoupled problem

min\{rt∈I\}∑t=0T−1Ktg\(rt\)=∑t=0T−1minrt∈I⁡Ktg\(rt\)\.\\min\_\{\\\{r\_\{t\}\\in I\\\}\}\\sum\_\{t=0\}^\{T\-1\}K\_\{t\}\\,g\(r\_\{t\}\)\\;=\\;\\sum\_\{t=0\}^\{T\-1\}\\min\_\{r\_\{t\}\\in I\}\\,K\_\{t\}\\,g\(r\_\{t\}\)\.\(3\)The local complexityKtK\_\{t\}is only a positive scalar multiplier: it changes the cost of choosing the wrong rate, not which rate is pointwise optimal\.

###### Theorem 3\.3\(Scalar invariance in the unconstrained regime\)\.

LetKt\>0K\_\{t\}\>0and let the same rate lossg:I→ℝg:I\\to\\mathbb\{R\}apply at every time step\. Ifgghas a unique minimiserr⋆∈Ir^\{\\star\}\\in I, the unique unconstrained optimizer isrt⋆=r⋆r\_\{t\}^\{\\star\}=r^\{\\star\}for alltt\. If insteadg=Dg=Dis strictly decreasing on a compact intervalI=\[rmin,rmax\]I=\[r\_\{\\min\},r\_\{\\max\}\], the unconstrained optimizer is the boundary allocationrt⋆=rmaxr\_\{t\}^\{\\star\}=r\_\{\\max\}for alltt; on an unbounded interval the pure decreasing problem has no finite optimum\. In all cases, scalar complexity alone cannot induce a dynamic rate allocation\.

*Intuition\.*A positive scalarKtK\_\{t\}rescales the loss surface but does not move its argmin\. Thus an unconstrained common\-loss objective either selects the same representation\-aware optimum everywhere, or, for pure decreasing rate–distortion, runs to the same upper rate boundary everywhere\.

###### Corollary 3\.4\(Necessity of coupling or landscape variation\)\.

A dynamic scheme can have a non\-constant optimum only when at least one of the following makes the optimal rate depend ontt:

1. 1\.*Budget coupling:*an explicit total\-information constraint forces time steps into a shared resource problem; the KKT conditionKtD′\(rt⋆\)\+λ=0K\_\{t\}D^\{\\prime\}\(r\_\{t\}^\{\\star\}\)\+\\lambda=0then yields aKtK\_\{t\}\-dependent optimum \(see[TheoremA\.1](https://arxiv.org/html/2606.04074#A1.Thmtheorem1)\)\.
2. 2\.*Time\-varying loss landscape:*different time steps have genuinely differentℓt\(r\)\\ell\_\{t\}\(r\), so per\-step optima differ independently of coupling\.
3. 3\.*Additional structural constraints:*token limits, boundary constraints, nonlocal dependencies, or computational costs that create further coupling\.

Budget coupling alone \(decreasing convexDD, the active budget\) suffices for the optimal allocation to depend onKtK\_\{t\}; however, if the uniform operating point already equals the unique minimiserr⋆r^\{\\star\},[Proposition3\.11](https://arxiv.org/html/2606.04074#S3.Thmtheorem11)shows dynamic patching cannot improve\.

The natural coupling is a fixed total information budget, which we make explicit\.

### 3\.3Dynamic allocation under coupling

###### Definition 3\.5\(Budget constraint\)\.

OnceNNandTTare fixed, every patching scheme spends the same total information:∑trt=NC\\sum\_\{t\}r\_\{t\}=NC\. The mean bitrater¯:=𝔼t\[rt\]=NC/T\\bar\{r\}:=\\mathbb\{E\}\_\{t\}\[r\_\{t\}\]=NC/Tis therefore pinned across both uniform and dynamic schemes\.

The budget turns the global problem into competitive resource allocation: extra bitrate at one step must be borrowed from another\. The uniform baseline fixesrt≡r¯r\_\{t\}\\equiv\\bar\{r\}and incurs𝒥uni=K¯D\(r¯\)\\mathcal\{J\}\_\{\\text\{uni\}\}=\\bar\{K\}D\(\\bar\{r\}\)\. Under the budget constraint, the optimal continuous allocation has the KKT characterisation in[TheoremA\.1](https://arxiv.org/html/2606.04074#A1.Thmtheorem1): at an interior optimum,KtD′\(rt⋆\)\+λ=0K\_\{t\}D^\{\\prime\}\(r\_\{t\}^\{\\star\}\)\+\\lambda=0for someλ\>0\\lambda\>0, so higher\-complexity steps receive higher bitrate\. The budget links all time steps throughλ\\lambda, confirming condition \(1\) of[Corollary3\.4](https://arxiv.org/html/2606.04074#S3.Thmtheorem4)\.

### 3\.4The baseline tax: an unaligned scheme cannot improve

Even with the budget, volatility alone is never beneficial: redistributing bitrate aroundr¯\\bar\{r\}without any awareness of complexity strictly hurts\.

###### Proposition 3\.6\(Unaligned allocation is suboptimal\)\.

Ifrtr\_\{t\}is statistically independent ofKtK\_\{t\},𝔼\[r\]=r¯\\mathbb\{E\}\[r\]=\\bar\{r\}, andDDis convex, then𝒥dyn\(r\)≥𝒥uni\\mathcal\{J\}\_\{\\text\{dyn\}\}\(r\)\\geq\\mathcal\{J\}\_\{\\text\{uni\}\}\. IfDDis strictly convex on the support ofrr, equality holds iffrt≡r¯r\_\{t\}\\equiv\\bar\{r\}almost surely\.

*Intuition\.*Every dynamic scheme pays a*Jensen tax*on volatility before any alignment benefit accrues; a useful scheme must correlaterrwithKKstrongly enough to recoup that tax\.

Under a purely decreasingDDabsent representation costs, an improving aligned allocation always exists \([SectionA\.6](https://arxiv.org/html/2606.04074#A1.SS6)\); the practical difficulty arises from feasibility constraints, estimation noise, and the optimality trap introduced below\.

### 3\.5The exact threshold

How much alignment is needed? Without any approximation, the condition under which dynamic patching outperforms uniform patching can be written as a single covariance inequality\.

###### Theorem 3\.7\(Exact threshold\)\.

Regardttas a uniform random index, and letK=KtK=K\_\{t\}andr=rtr=r\_\{t\}with𝔼\[r\]=r¯\\mathbb\{E\}\[r\]=\\bar\{r\}\. Then𝒥dyn\(r\)<𝒥uni\\mathcal\{J\}\_\{\\text\{dyn\}\}\(r\)<\\mathcal\{J\}\_\{\\text\{uni\}\}if and only if

−Cov\(K,D\(r\)\)⏟alignment gain\>𝔼\[K\]\(𝔼\[D\(r\)\]−D\(r¯\)\)⏟Jensen penalty\.\\underbrace\{\-\\mathrm\{Cov\}\\\!\\bigl\(K,D\(r\)\\bigr\)\}\_\{\\text\{alignment gain\}\}\\;\>\\;\\mathbb\{E\}\[K\]\\,\\underbrace\{\\bigl\(\\mathbb\{E\}\[D\(r\)\]\-D\(\\bar\{r\}\)\\bigr\)\}\_\{\\text\{Jensen penalty\}\}\.\(4\)

*Intuition\.*The condition is exact and estimator\-friendly: both sides can be computed from samples of\(K,r\)\(K,r\)without committing to a Taylor expansion\.

### 3\.6Upper bound under quadratic approximation

The exact threshold tells us*when*dynamic patching helps\. A local Taylor expansion explains*how much*gain is available near a uniform operating point, but only as a surrogate\.

The exact second\-order expansion with remainder is given in[PropositionA\.3](https://arxiv.org/html/2606.04074#A1.Thmtheorem3)\. Its small\-perturbation consequence is that dynamic patching is locally beneficial only if−D′\(r¯\)Cov\(K,r\)\>12D′′\(r¯\)𝔼\[K\(r−r¯\)2\]\-D^\{\\prime\}\(\\bar\{r\}\)\\mathrm\{Cov\}\(K,r\)\>\\tfrac\{1\}\{2\}D^\{\\prime\\prime\}\(\\bar\{r\}\)\\mathbb\{E\}\[K\(r\-\\bar\{r\}\)^\{2\}\]\. If one further uses the approximationCov\(K,δ2\)≈0\\mathrm\{Cov\}\(K,\\delta^\{2\}\)\\approx 0\(e\.g\., when the skewness ofKKis negligible, soK−K¯K\-\\bar\{K\}is approximately uncorrelated with\(r−r¯\)2\(r\-\\bar\{r\}\)^\{2\}\), this becomes

−D′\(r¯\)Cov\(K,r\)⏟alignment gain \(linear\)\>12K¯D′′\(r¯\)Var\(r\)⏟Jensen tax \(quadratic\)\.\\underbrace\{\-D^\{\\prime\}\(\\bar\{r\}\)\\,\\mathrm\{Cov\}\(K,r\)\}\_\{\\text\{alignment gain \(linear\)\}\}\\;\>\\;\\underbrace\{\\tfrac\{1\}\{2\}\\,\\bar\{K\}\\,D^\{\\prime\\prime\}\(\\bar\{r\}\)\\,\\mathrm\{Var\}\(r\)\}\_\{\\text\{Jensen tax \(quadratic\)\}\}\.\(5\)This is a local approximation, not an exact condition\. Its right\-hand side is pinned by the curvatureD′′\(r¯\)D^\{\\prime\\prime\}\(\\bar\{r\}\)and the volatility of the allocation; neither can be eliminated without removing the candidate dynamic scheme\.

###### Theorem 3\.8\(Maximum of the quadratic surrogate\)\.

Leta=−D′\(r¯\)\>0a=\-D^\{\\prime\}\(\\bar\{r\}\)\>0,b=D′′\(r¯\)\>0b=D^\{\\prime\\prime\}\(\\bar\{r\}\)\>0, andρ=Corr\(K,r\)\\rho=\\mathrm\{Corr\}\(K,r\), treated as fixed\. The local surrogateaρσKσr−12K¯bσr2a\\rho\\sigma\_\{K\}\\sigma\_\{r\}\-\\tfrac\{1\}\{2\}\\bar\{K\}b\\sigma\_\{r\}^\{2\}has supremum

Δmaxquad=\{\(−D′\(r¯\)\)2ρ2σK22K¯D′′\(r¯\),ρ\>0,0,ρ≤0,\\Delta\_\{\\max\}^\{\\text\{quad\}\}\\;=\\;\\begin\{cases\}\\dfrac\{\\bigl\(\-D^\{\\prime\}\(\\bar\{r\}\)\\bigr\)^\{2\}\\rho^\{2\}\\sigma\_\{K\}^\{2\}\}\{2\\bar\{K\}D^\{\\prime\\prime\}\(\\bar\{r\}\)\},&\\rho\>0,\\\\\[8\.00003pt\] 0,&\\rho\\leq 0,\\end\{cases\}\(6\)with maximiserσr⋆=\(−D′\(r¯\)\)ρσK/\(K¯D′′\(r¯\)\)\\sigma\_\{r\}^\{\\star\}=\(\-D^\{\\prime\}\(\\bar\{r\}\)\)\\rho\\sigma\_\{K\}/\(\\bar\{K\}D^\{\\prime\\prime\}\(\\bar\{r\}\)\)whenρ\>0\\rho\>0\.

###### Proof\.

The surrogate is a concave quadratic inσr\\sigma\_\{r\}at fixedρ,σK\\rho,\\sigma\_\{K\}; read off its maximiser\. See[SectionA\.4](https://arxiv.org/html/2606.04074#A1.SS4)\. ∎

Expression \([6](https://arxiv.org/html/2606.04074#S3.E6)\) is therefore not an exact upper bound on dynamic\-patching gain; it is the optimum of a second\-order local surrogate under fixed\-correlation, small\-perturbation, and feasibility\-relaxation assumptions\. Still, the shape is informative: the gain decays quadratically with misalignment, smallσK2\\sigma\_\{K\}^\{2\}offers little headroom, and large curvatureD′′\(r¯\)D^\{\\prime\\prime\}\(\\bar\{r\}\)makes the Jensen tax expensive\.

#### Rigorous global bound\.

[TheoremA\.4](https://arxiv.org/html/2606.04074#A1.Thmtheorem4)gives the corresponding global upper bound under strong convexity and a Lipschitz condition onDD, without any small\-δ\\deltaassumption\. The local surrogate and the global bound share the same qualitative message: the achievable improvement is a fragile second\-order quantity requiring high complexity variance, accurate alignment, and a flat distortion curve\.

### 3\.7The representation\-aware regime and the optimality trap

Rate–distortion theory alone does not capture the full objective of a time\-series Transformer\. Patching also determines the*representation*the network sees: each patch is a token, and overly short tokens fragment context, while overly long tokens dilute semantics\. We model this with a U\-shaped correction term, so the network sees

ℒ\(r\):=D\(r\)\+λDsem\(r\),λ≥0,\\mathcal\{L\}\(r\)\\;:=\\;D\(r\)\+\\lambda\\,D\_\{\\text\{sem\}\}\(r\),\\qquad\\lambda\\geq 0,\(7\)whereDsemD\_\{\\text\{sem\}\}summarises fragmentation and redundancy costs\. Rather than deriving a unique minimum from componentwise shape assumptions, we assume it explicitly below\. Variable\-resolution tokenisation\(Dehghaniet al\.,[2023](https://arxiv.org/html/2606.04074#bib.bib21); Ronenet al\.,[2023](https://arxiv.org/html/2606.04074#bib.bib24)\)and adaptive or length\-variable time\-series patching\(Dinget al\.,[2026](https://arxiv.org/html/2606.04074#bib.bib15); Huanget al\.,[2024](https://arxiv.org/html/2606.04074#bib.bib20)\)can be read as different ways to traverse this U\-shape\.

###### Assumption 3\.9\(Representation\-aware optimum\)\.

ℒ∈C2\(I\)\\mathcal\{L\}\\in C^\{2\}\(I\),ℒ′′\(r\)\>0\\mathcal\{L\}^\{\\prime\\prime\}\(r\)\>0onII, and there existsr⋆∈Ir^\{\\star\}\\in Iwithℒ′\(r⋆\)=0\\mathcal\{L\}^\{\\prime\}\(r^\{\\star\}\)=0\. Hencer⋆r^\{\\star\}is the unique global minimiser ofℒ\\mathcal\{L\}\.

All decomposition and threshold statements above hold withDDreplaced byℒ\\mathcal\{L\}\. Two additional facts matter under this objective\. First,[Theorem3\.3](https://arxiv.org/html/2606.04074#S3.Thmtheorem3)applies withg=ℒg=\\mathcal\{L\}: in the unconstrained regime the optimum is the single pointr⋆=argminr⁡ℒ\(r\)r^\{\\star\}=\\operatorname\*\{arg\\,min\}\_\{r\}\\mathcal\{L\}\(r\), regardless ofKtK\_\{t\}\. Second, under the budget constraint, the very tunability of the uniform baseline closes the door opened in[Section3\.5](https://arxiv.org/html/2606.04074#S3.SS5)\. Herer⋆r^\{\\star\}denotes the scalar representation\-aware operating point induced by patch\-size tuning, not a global optimum over neural\-network parameters\. Write𝒥ℒ\(r\):=𝔼\[Kℒ\(r\)\]\\mathcal\{J\}\_\{\\mathcal\{L\}\}\(r\):=\\mathbb\{E\}\[K\\mathcal\{L\}\(r\)\]and𝒥ℒunif:=K¯ℒ\(r¯\)\\mathcal\{J\}\_\{\\mathcal\{L\}\}^\{\\mathrm\{unif\}\}:=\\bar\{K\}\\mathcal\{L\}\(\\bar\{r\}\)\.

###### Theorem 3\.10\(Uniform optimality at the representation\-aware minimiser\)\.

AssumeKt≥0K\_\{t\}\\geq 0and thatℒ\\mathcal\{L\}has a global minimiserr⋆r^\{\\star\}\. Ifr¯=r⋆\\bar\{r\}=r^\{\\star\}, then𝒥ℒ\(r\)≥𝒥ℒunif\\mathcal\{J\}\_\{\\mathcal\{L\}\}\(r\)\\geq\\mathcal\{J\}\_\{\\mathcal\{L\}\}^\{\\mathrm\{unif\}\}for*every*feasible allocation\. Under[3\.9](https://arxiv.org/html/2606.04074#S3.Thmtheorem9)withKt\>0K\_\{t\}\>0, equality holds only atrt≡r⋆r\_\{t\}\\equiv r^\{\\star\}\.

###### Proof\.

ℒ\(rt\)≥ℒ\(r⋆\)\\mathcal\{L\}\(r\_\{t\}\)\\geq\\mathcal\{L\}\(r^\{\\star\}\)pointwise; multiply byKt≥0K\_\{t\}\\geq 0and sum\. See[SectionA\.9](https://arxiv.org/html/2606.04074#A1.SS9)for the strict\-inequality argument\. ∎

[Theorem3\.10](https://arxiv.org/html/2606.04074#S3.Thmtheorem10)is the global statement\. The local picture, obtained by expandingℒ\\mathcal\{L\}aroundr¯\\bar\{r\}\(full derivation in[SectionA\.10](https://arxiv.org/html/2606.04074#A1.SS10)\), is more informative\.

###### Proposition 3\.11\(Optimality trap\)\.

Letδt=rt−r¯\\delta\_\{t\}=r\_\{t\}\-\\bar\{r\}and assumeℒ∈C3\\mathcal\{L\}\\in C^\{3\}\. Then

Δℒ\(r\)=−ℒ′\(r¯\)Cov\(K,r\)⏟first\-order alignment−12ℒ′′\(r¯\)𝔼\[Kδ2\]⏟Jensen tax−𝔼\[KRℒ,3\(δ\)\]\.\\Delta\_\{\\mathcal\{L\}\}\(r\)\\;=\\;\\underbrace\{\-\\mathcal\{L\}^\{\\prime\}\(\\bar\{r\}\)\\,\\mathrm\{Cov\}\(K,r\)\}\_\{\\text\{first\-order alignment\}\}\\;\-\\;\\underbrace\{\\tfrac\{1\}\{2\}\\,\\mathcal\{L\}^\{\\prime\\prime\}\(\\bar\{r\}\)\\,\\mathbb\{E\}\[K\\delta^\{2\}\]\}\_\{\\text\{Jensen tax\}\}\\;\-\\;\\mathbb\{E\}\[K\\,R\_\{\\mathcal\{L\},3\}\(\\delta\)\]\.\(8\)If hyperparameter tuning has placed the uniform baseline near the minimiser,r¯≈r⋆\\bar\{r\}\\approx r^\{\\star\}, thenℒ′\(r¯\)≈0\\mathcal\{L\}^\{\\prime\}\(\\bar\{r\}\)\\approx 0and the leading non\-zero term−12ℒ′′\(r¯\)𝔼\[Kδ2\]≤0\-\\tfrac\{1\}\{2\}\\mathcal\{L\}^\{\\prime\\prime\}\(\\bar\{r\}\)\\,\\mathbb\{E\}\[K\\delta^\{2\}\]\\leq 0is non\-positive: any allocation withVar\(r\)\>0\\mathrm\{Var\}\(r\)\>0pays the Jensen tax with no first\-order alignment gain to offset it\.

*Intuition\.*Uniform tuning bringsr¯\\bar\{r\}to where the loss is flat; sinceℒ′′\(r¯\)=D′′\(r¯\)\+λDsem′′\(r¯\)\>D′′\(r¯\)\\mathcal\{L\}^\{\\prime\\prime\}\(\\bar\{r\}\)=D^\{\\prime\\prime\}\(\\bar\{r\}\)\+\\lambda D\_\{\\text\{sem\}\}^\{\\prime\\prime\}\(\\bar\{r\}\)\>D^\{\\prime\\prime\}\(\\bar\{r\}\), the representation correction adds curvature and makes the Jensen tax strictly larger than underDDalone\.

### 3\.8Summary

Together, these results identify a narrow region where dynamic patching can help\. Coupling is required for scalar complexity to affect rates \([Theorem3\.3](https://arxiv.org/html/2606.04074#S3.Thmtheorem3)\); under a budget, improvement must overcome the Jensen tax through alignment \([Theorem3\.7](https://arxiv.org/html/2606.04074#S3.Thmtheorem7)\), with second\-order headroom controlled by alignment, complexity variance, and curvature \([Theorems3\.8](https://arxiv.org/html/2606.04074#S3.Thmtheorem8)and[A\.4](https://arxiv.org/html/2606.04074#A1.Thmtheorem4)\)\. Once the uniform baseline is tuned near its representation\-aware optimum, the first\-order alignment term collapses and the Jensen tax becomes pure cost \([Theorems3\.10](https://arxiv.org/html/2606.04074#S3.Thmtheorem10)and[3\.11](https://arxiv.org/html/2606.04074#S3.Thmtheorem11)\), as summarized in[Figure2](https://arxiv.org/html/2606.04074#S3.F2)\.

![Refer to caption](https://arxiv.org/html/2606.04074v1/x3.png)Figure 2:Theory roadmap for dynamic patching\.\(a\) Wedge and threshold:per\-step gainΔD\\Delta\_\{D\}is a concave parabola in bitrate volatilityσr\\sigma\_\{r\}\([Theorem3\.8](https://arxiv.org/html/2606.04074#S3.Thmtheorem8)\); peak scales asρ2σK2/D′′\\rho^\{2\}\\sigma\_\{K\}^\{2\}/D^\{\\prime\\prime\}so misalignment shrinks headroom quadratically;ρ=0\\rho\{=\}0\(dashed\) is pure Jensen tax, and improvement requires−Cov\(K,D\(r\)\)\>K¯\(𝔼\[D\(r\)\]−D\(r¯\)\)\-\\mathrm\{Cov\}\(K,D\(r\)\)\>\\bar\{K\}\(\\mathbb\{E\}\[D\(r\)\]\-D\(\\bar\{r\}\)\)\([Theorem3\.7](https://arxiv.org/html/2606.04074#S3.Thmtheorem7)\)\.\(b\) Trap:the representation\-aware loss is U\-shaped \([Proposition3\.11](https://arxiv.org/html/2606.04074#S3.Thmtheorem11)\); tuningr¯\\bar\{r\}into the basin ofr⋆r^\{\\star\}\(blue\) leaves only the curvature\-driven Jensen tax\.\(c\) Coupling:without a budget,rt⋆r\_\{t\}^\{\\star\}is independent ofKtK\_\{t\}or degenerates to the common upper boundary \(scalar invariance,[Theorem3\.3](https://arxiv.org/html/2606.04074#S3.Thmtheorem3)\); a budget activates the KKT condition \([TheoremA\.1](https://arxiv.org/html/2606.04074#A1.Thmtheorem1)\) and makes the optimal allocation rise withKtK\_\{t\}\.

## 4Why the alignment target is itself unstable

The theoretical picture in[Section3](https://arxiv.org/html/2606.04074#S3)presumes thatKtK\_\{t\}is given\. Four observations show thatKtK\_\{t\}is, in a structural sense, unavailable in practice\.

- •Ill\-definedness\.Temporal complexity admits no universal, axiomatic characterisation\. Candidate definitions, local entropy rate, conditional entropyH\(xt∣x<t\)H\(x\_\{t\}\\mid x\_\{<t\}\), local variance, spectral energy, gradient magnitude, boundary probability, cosine dissimilarity, predictive residual, each capture a different facet of difficulty, and patching strategies optimised for one measure can be*adversarially*aligned under another\. BecauseΔmaxquad\\Delta\_\{\\max\}^\{\\text\{quad\}\}scales asρ\(K,r\)2\\rho\(K,r\)^\{2\}for a specific proxyKK, if a scheme is designed to align with one complexity measure but evaluated against a different one, the measured gain is fragile by construction\.
- •Circularity of estimation\.Model\-residual definitions ofKtK\_\{t\}, e\.g\.,Kt=\(xt−x^t\)2K\_\{t\}=\(x\_\{t\}\-\\hat\{x\}\_\{t\}\)^\{2\}for some reference forecaster, require a model strong enough to predictxtx\_\{t\}before patching is decided\. If such a model exists, most of the modelling work has already been done and the marginal utility of aligningrtr\_\{t\}withKtK\_\{t\}is small\. If one retreats to model\-free proxies,ρ\(K,r\)\\rho\(K,r\)drops and[Theorem3\.8](https://arxiv.org/html/2606.04074#S3.Thmtheorem8)pushes the achievable gain towards zero\.
- •Disproportionate cost\.A practical complexity\-aligned pipeline must \(i\) commit to a definition, \(ii\) train or calibrate an estimator, \(iii\) run it at inference, and \(iv\) solve a non\-uniform allocation problem\. Steps \(ii\)–\(iii\) introduce a secondary model with its own parameters, training data, and latency\. Against the bounded ceilingΔmaxquad\\Delta\_\{\\max\}^\{\\text\{quad\}\}, this cost is rarely justifiable\.
- •The complexity paradox\.The three obstacles compound: the signals that stand to gain the most from dynamic patching \(highσK\\sigma\_\{K\}\) are those for which complexity is hardest to define and most expensive to estimate, while the signals for whichKtK\_\{t\}is cheap to obtain are those for whichσK\\sigma\_\{K\}is small and the gain vanishes\.

These four obstacles identify when a proxy\-routed adaptive scheme is unlikely to clear the threshold of \([4](https://arxiv.org/html/2606.04074#S3.E4)\): when the chosenKtK\_\{t\}is ill\-defined, when its estimator must approach the strength of the forecaster itself, or when its overhead approaches the bounded ceiling\. End\-to\-end boundary optimization can sidestep the proxy step, but it falls outside the proxy\-routed setting targeted here\.

## 5Experiments: isolating the patching mechanism

### 5\.1Protocol

The comparison uses three methods:EntroPE\(Abeywickramaet al\.,[2025](https://arxiv.org/html/2606.04074#bib.bib12)\)\(predictive\-entropy boundaries\),TimeMosaic\(Dinget al\.,[2026](https://arxiv.org/html/2606.04074#bib.bib15)\)\(temporal\-heterogeneity routing\), andHDMixer\(Huanget al\.,[2024](https://arxiv.org/html/2606.04074#bib.bib20)\)\(length\-extendable patching\)\. For each method, we build a uniform\-patch counterpart by replacing the adaptive component with one fixed patch length\. Everything else is held fixed: backbone, prediction head, optimizer, training budget, and preprocessing \([AppendixD](https://arxiv.org/html/2606.04074#A4)\)\. Horizon grids, patch\-size grids, seeds, and selection rules are in[AppendixE](https://arxiv.org/html/2606.04074#A5);[AppendixC](https://arxiv.org/html/2606.04074#A3)gives the full per\-patch sweep\. We evaluate on eight standard real\-world multivariate forecasting benchmarks: ETT \(ETTh1, ETTh2, ETTm1, ETTm2\) and Weather\(Zhouet al\.,[2021](https://arxiv.org/html/2606.04074#bib.bib42)\), and Electricity, Traffic, and Exchange\(Laiet al\.,[2018](https://arxiv.org/html/2606.04074#bib.bib43)\)\.

### 5\.2Results

![Refer to caption](https://arxiv.org/html/2606.04074v1/x4.png)Figure 3:Horizon\-averaged patch sweep\(replaces Table 1\)\. Each row is a dataset and each column is a fixed uniform patch size for that method\. Cell colour encodes %Imp, the relative test\-MSE gain of the uniform variant over the matched dynamic baseline: blue means uniform is better, orange means dynamic is better\. The top number is the mean %Imp over horizons96,192,336,72096,192,336,720, and the bottom number is the mean training speedup \(×\\times\) over the same horizons\. The black border marks the best horizon\-averaged %Imp in each row\.#### Accuracy–efficiency trade\-off\.

[Figure3](https://arxiv.org/html/2606.04074#S5.F3)shows accuracy and efficiency trade\-offs\. Adaptive patching helps only in the rows where its routing overhead buys loss\-relevant resolution that a single fixed patch cannot approximate\. That regime is present, but narrow: dynamic remains ahead in some low\-margin rows, including HDMixer on ETTm1/ETTm2 and TimeMosaic on ETTh1/Traffic, yet the largest visible deviations usually favour a tuned uniform patch\. The clearest examples are not uniformly distributed across methods\. HDMixer’s uniform variants form a broad positive band on Traffic, Weather, ETTh1, and ETTh2, suggesting that its deformable extraction often does not need input\-dependent offsets once the patch length is retuned\. EntroPE is more conservative: only Traffic gives a clear horizon\-averaged uniform advantage \(\+2\.5%\+2\.5\\%atp=4p\{=\}4\), while most other rows sit near parity\. TimeMosaic looks different\. Its main outlier is Exchange \(\+7\.3%\+7\.3\\%atp=32p\{=\}32\), which points to a dataset\-specific routing failure or patch\-scale mismatch, not a general problem with its multiscale design\.

The speedup annotations sharpen this reading\. Fixed patches are often cheaper even when their MSE advantage is small, but the compute gain is not automatically coupled to the best accuracy cell\. HDMixer illustrates the useful case: its strongest Traffic setting combines a\+3\.3%\+3\.3\\%averaged MSE gain with a3\.7×3\.7\\timestraining speedup, and Electricity still gives\+1\.5%\+1\.5\\%at3\.4×3\.4\\times\. TimeMosaic shows the cautionary case: larger fixed patches can reach about up to2\.4×2\.4\\timesspeedup, but on several datasets they mainly make the model cheaper while leaving accuracy near parity\. EntroPE has the smallest trade\-off spread, with typical speedups around1\.0×1\.0\\times–1\.9×1\.9\\timesand modest accuracy movement\. Thus the practical question is whether the router justifies its cost on a given dataset once a uniform patch sweep has already captured the easy gains\.

#### Theoretical reading\.

This pattern matches[Theorem3\.8](https://arxiv.org/html/2606.04074#S3.Thmtheorem8)\. The achievable second\-order gain scales withρ2σK2/\(K¯D′′\(r¯\)\)\\rho^\{2\}\\sigma\_\{K\}^\{2\}/\\bigl\(\\bar\{K\}\\,D^\{\\prime\\prime\}\(\\bar\{r\}\)\\bigr\)\. After tuning the uniform operating point,D′′\(r¯\)D^\{\\prime\\prime\}\(\\bar\{r\}\)is near its local minimum, so even a useful routing signal has little room to improve the loss\. The observed gains are concentrated in\(−2%,\+2%\)\(\-2\\%,\+2\\%\), with extremes near±7%\\pm 7\\%\. That pattern fits a small positive ceiling for a few aligned \(method, dataset\) pairs, not a general advantage for adaptive allocation\.

The row\-wise maxima reinforce that reading\. HDMixer benefits most from coarser fixed patches on Traffic and Weather, EntroPE benefits mainly at the finest evaluated granularities on Traffic and Electricity, and TimeMosaic’s largest averaged uniform advantage appears on Exchange\. The result therefore do not show a universal win for either side: the averaged gap changes with the architecture and the dataset, and many rows stay close across several fixed patch sizes\.

#### Mechanism check on a Transformer backbone\.

[AppendixG](https://arxiv.org/html/2606.04074#A7)isolates the rate\-allocation channel behind[Equation4](https://arxiv.org/html/2606.04074#S3.E4)by replacing discrete patching with a continuous noise schedule on a Transformer backbone, holding the Jensen term flat while varying only alignment\. In this controlled setting, mean gain is monotone in target alignment across nine settings and twenty seeds per cell, from−24%\-24\\%atρ⋆=−1\\rho^\{\\star\}=\-1to\+16%\+16\\%atρ⋆=\+1\\rho^\{\\star\}=\+1; the small gaps in[Figure3](https://arxiv.org/html/2606.04074#S5.F3)are therefore consistent with practical adaptive patchers operating where alignment is too weak to overcome the Jensen tax and routing cost\.

## 6Discussion and Conclusion

Adaptive patching is a resource\-allocation problem, so a non\-uniform grid helps only when the routing signal identifies locations where extra temporal resolution reduces the downstream loss by more than the Jensen tax and the cost of routing\. Tuning a uniform patch length already absorbs most of the available room\. Time\-series forecasting departs from the image\-tokenization intuition: visible detail in images often marks regions where extra tokens help, whereas a locally irregular forecasting segment can be unpredictable noise, a short\-lived regime, or a feature whose error is already dominated by the horizon\-level objective\. Observable heterogeneity helps only when it tracks loss\-relevant complexity, and the varianceσK2\\sigma\_\{K\}^\{2\}in the bound applies to the loss\-relevant target rather than the raw input\. A plausible routing signal can therefore lose its advantage once the uniform operating point is tuned, and the ceiling on adaptive gain scales with the squared correlation between the routing signal and a loss\-relevant complexity target\. A practical adaptive\-patching claim should specify four things: the target complexity being approximated, the routing proxy, its cost, and the tuned uniform baseline being beaten\.

#### Future Work\.

This work gives a first theoretical account of dynamic patch allocation, but the analysis is intentionally abstract and does not model every architecture\-specific design choice\. Future work can specialize the framework to concrete adaptive patching systems, including multiscale patchers and jointly trained routers\. The alignment condition may also serve as a design principle for routing modules that allocate resolution where it is most useful\.

#### Acknowledgments

The work of Federico Zucchi has received funding from the Horizon Europe research and innovation programme, under grant agreement No 101095436 and Horizon European innovation council under grant agreement No\. 190129251\.

## References

- EntroPE: entropy\-guided dynamic patch encoder for time series forecasting\.arXiv preprint arXiv:2509\.26157\.Cited by:[§1](https://arxiv.org/html/2606.04074#S1.SS0.SSS0.Px1.p1.7),[§1](https://arxiv.org/html/2606.04074#S1.p1.1),[§2](https://arxiv.org/html/2606.04074#S2.SS0.SSS0.Px2.p1.1),[§5\.1](https://arxiv.org/html/2606.04074#S5.SS1.p1.1)\.
- S\. K\. Ankireddy, N\. Seleznev, N\. H\. Nguyen, Y\. Wu, S\. Kumar, F\. Huang, and C\. B\. Bruss \(2026\)Timesqueeze: dynamic patching for efficient time series forecasting\.arXiv preprint arXiv:2603\.11352\.Cited by:[§1](https://arxiv.org/html/2606.04074#S1.p1.1),[§2](https://arxiv.org/html/2606.04074#S2.SS0.SSS0.Px2.p1.1)\.
- A\. F\. Ansari, L\. Stella, A\. C\. Turkmen, X\. Zhang, P\. Mercado, H\. Shen, O\. Shchur, S\. S\. Rangapuram, S\. P\. Arango, S\. Kapoor, J\. Zschiegner, D\. C\. Maddix, H\. Wang, M\. W\. Mahoney, K\. Torkkola, A\. G\. Wilson, M\. Bohlke\-Schneider, and B\. Wang \(2024\)Chronos: learning the language of time series\.Transactions on Machine Learning Research\.Cited by:[§1](https://arxiv.org/html/2606.04074#S1.p1.1),[§2](https://arxiv.org/html/2606.04074#S2.SS0.SSS0.Px1.p1.5)\.
- A\. Banino, J\. Balaguer, and C\. Blundell \(2021\)PonderNet: learning to ponder\.ArXiv\.Cited by:[§2](https://arxiv.org/html/2606.04074#S2.SS0.SSS0.Px3.p1.1)\.
- D\. Bolya, C\. Fu, X\. Dai, P\. Zhang, C\. Feichtenhofer, and J\. Hoffman \(2023\)Token merging: your vit but faster\.InThe Eleventh International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.04074#S1.p1.1),[§2](https://arxiv.org/html/2606.04074#S2.SS0.SSS0.Px3.p1.1)\.
- C\. Challu, K\. G\. Olivares, B\. N\. Oreshkin, F\. G\. Ramirez, M\. M\. Canseco, and A\. Dubrawski \(2023\)Nhits: neural hierarchical interpolation for time series forecasting\.InProceedings of the AAAI conference on artificial intelligence,Vol\.37,pp\. 6989–6997\.Cited by:[§2](https://arxiv.org/html/2606.04074#S2.SS0.SSS0.Px2.p1.1)\.
- P\. Chen, Y\. ZHANG, Y\. Cheng, Y\. Shu, Y\. Wang, Q\. Wen, B\. Yang, and C\. Guo \(2024\)Pathformer: multi\-scale transformers with adaptive pathways for time series forecasting\.InThe Twelfth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.04074#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Chen, C\. Li, S\. O\. Arik, N\. C\. Yoder, and T\. Pfister \(2023\)TSMixer: an all\-mlp architecture for time series forecast\-ing\.Transactions on Machine Learning Research\.Cited by:[§1](https://arxiv.org/html/2606.04074#S1.SS0.SSS0.Px1.p1.7),[§2](https://arxiv.org/html/2606.04074#S2.SS0.SSS0.Px1.p1.5)\.
- T\. M\. Cover and J\. A\. Thomas \(2006\)Elements of information theory\.2nd edition,Wiley\.Cited by:[§3](https://arxiv.org/html/2606.04074#S3.p1.1)\.
- A\. Das, W\. Kong, R\. Sen, and Y\. Zhou \(2024\)A decoder\-only foundation model for time\-series forecasting\.InProceedings of the 41st International Conference on Machine Learning,pp\. 10148–10167\.Cited by:[§1](https://arxiv.org/html/2606.04074#S1.p1.1),[§2](https://arxiv.org/html/2606.04074#S2.SS0.SSS0.Px1.p1.5)\.
- M\. Dehghani, B\. Mustafa, J\. Djolonga, J\. Heek, M\. Minderer, M\. Caron, A\. Steiner, J\. Puigcerver, R\. Geirhos, I\. M\. Alabdulmohsin, A\. Oliver, P\. Padlewski, A\. A\. Gritsenko, M\. Lucic, and N\. Houlsby \(2023\)Patch n’pack: navit, a vision transformer for any aspect ratio and resolution\.Advances in Neural Information Processing Systems36,pp\. 2252–2274\.Cited by:[§1](https://arxiv.org/html/2606.04074#S1.p1.1),[§2](https://arxiv.org/html/2606.04074#S2.SS0.SSS0.Px3.p1.1),[§3\.7](https://arxiv.org/html/2606.04074#S3.SS7.p1.1)\.
- K\. Ding, F\. Fan, C\. Hou, Z\. Wang, L\. Wang, Z\. Yang, and J\. Zhan \(2026\)Timemosaic: temporal heterogeneity guided time series forecasting via adaptive granularity patch and segment\-wise decoding\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 20790–20798\.Cited by:[§1](https://arxiv.org/html/2606.04074#S1.SS0.SSS0.Px1.p1.7),[§1](https://arxiv.org/html/2606.04074#S1.p1.1),[§2](https://arxiv.org/html/2606.04074#S2.SS0.SSS0.Px2.p1.1),[§3\.7](https://arxiv.org/html/2606.04074#S3.SS7.p1.1),[§5\.1](https://arxiv.org/html/2606.04074#S5.SS1.p1.1)\.
- K\. Ding, F\. Fan, Y\. Wang, R\. Jian, X\. Wang, L\. Gong, Y\. Jiang, C\. Luo, and J\. Zhan \(2025\)Dualsg: a dual\-stream explicit semantic\-guided multivariate time series forecasting framework\.InProceedings of the 33rd ACM International Conference on Multimedia,pp\. 508–517\.Cited by:[§2](https://arxiv.org/html/2606.04074#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Dosovitskiy, L\. Beyer, A\. Kolesnikov, D\. Weissenborn, X\. Zhai, T\. Unterthiner, M\. Dehghani, M\. Minderer, G\. Heigold, S\. Gelly, U\. Jakob, and H\. Neil \(2021\)An image is worth 16x16 words: transformers for image recognition at scale\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.04074#S1.p1.1)\.
- W\. Fedus, B\. Zoph, and N\. Shazeer \(2022\)Switch transformers: scaling to trillion parameter models with simple and efficient sparsity\.Journal of Machine Learning Research23\(120\),pp\. 1–39\.Cited by:[§2](https://arxiv.org/html/2606.04074#S2.SS0.SSS0.Px3.p1.1)\.
- K\. Feng, S\. Lan, Y\. Fang, W\. He, L\. Ma, X\. Lu, and K\. Ren \(2025\)Kairos: toward adaptive and parameter\-efficient time series foundation models\.arXiv preprint arXiv:2509\.25826\.Cited by:[§1](https://arxiv.org/html/2606.04074#S1.p1.1),[§2](https://arxiv.org/html/2606.04074#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Graves \(2016\)Adaptive computation time for recurrent neural networks\.arXiv preprint arXiv:1603\.08983\.Cited by:[§2](https://arxiv.org/html/2606.04074#S2.SS0.SSS0.Px3.p1.1)\.
- Y\. Hu, G\. Zhang, P\. Liu, D\. Lan, N\. Li, D\. Cheng, T\. Dai, S\. Xia, and S\. Pan \(2025\)TimeFilter: patch\-specific spatial\-temporal graph filtration for time series forecasting\.InInternational Conference on Machine Learning,pp\. 24893–24911\.Cited by:[§2](https://arxiv.org/html/2606.04074#S2.SS0.SSS0.Px2.p1.1)\.
- Q\. Huang, L\. Shen, R\. Zhang, J\. Cheng, S\. Ding, Z\. Zhou, and Y\. Wang \(2024\)Hdmixer: hierarchical dependency with extendable patch for multivariate time series forecasting\.InProceedings of the AAAI conference on artificial intelligence,Vol\.38,pp\. 12608–12616\.Cited by:[§1](https://arxiv.org/html/2606.04074#S1.SS0.SSS0.Px1.p1.7),[§1](https://arxiv.org/html/2606.04074#S1.p1.1),[§2](https://arxiv.org/html/2606.04074#S2.SS0.SSS0.Px2.p1.1),[§3\.7](https://arxiv.org/html/2606.04074#S3.SS7.p1.1),[§5\.1](https://arxiv.org/html/2606.04074#S5.SS1.p1.1)\.
- D\. Kudrat, Z\. Xie, Y\. Sun, T\. Jia, and Q\. Hu \(2025\)Patch\-wise structural loss for time series forecasting\.InInternational Conference on Machine Learning,pp\. 31841–31859\.Cited by:[§2](https://arxiv.org/html/2606.04074#S2.SS0.SSS0.Px2.p1.1)\.
- G\. Lai, W\. Chang, Y\. Yang, and H\. Liu \(2018\)Modeling long\-and short\-term temporal patterns with deep neural networks\.InThe 41st international ACM SIGIR conference on research & development in information retrieval,pp\. 95–104\.Cited by:[Appendix F](https://arxiv.org/html/2606.04074#A6.p1.1),[§5\.1](https://arxiv.org/html/2606.04074#S5.SS1.p1.1)\.
- Y\. Liu, T\. Hu, H\. Zhang, H\. Wu, S\. Wang, L\. Ma, and M\. Long \(2024\)ITransformer: inverted transformers are effective for time series forecasting\.InThe Twelfth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.04074#S1.p1.1),[§2](https://arxiv.org/html/2606.04074#S2.SS0.SSS0.Px1.p1.5)\.
- Y\. Nie, N\. H\. Nguyen, P\. Sinthong, and J\. Kalagnanam \(2023\)A time series is worth 64 words: long\-term forecasting with transformers\.InThe Eleventh International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.04074#S1.p1.1),[§2](https://arxiv.org/html/2606.04074#S2.SS0.SSS0.Px1.p1.5)\.
- D\. Raposo, S\. Ritter, B\. Richards, T\. Lillicrap, P\. C\. Humphreys, and A\. Santoro \(2024\)Mixture\-of\-depths: dynamically allocating compute in transformer\-based language models\.arXiv preprint arXiv:2404\.02258\.Cited by:[§2](https://arxiv.org/html/2606.04074#S2.SS0.SSS0.Px3.p1.1)\.
- T\. Ronen, O\. Levy, and A\. Golbert \(2023\)Vision transformers with mixed\-resolution tokenization\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 4613–4622\.Cited by:[§1](https://arxiv.org/html/2606.04074#S1.p1.1),[§2](https://arxiv.org/html/2606.04074#S2.SS0.SSS0.Px3.p1.1),[§3\.7](https://arxiv.org/html/2606.04074#S3.SS7.p1.1)\.
- N\. Shazeer, A\. Mirhoseini, K\. Maziarz, A\. Davis, Q\. V\. Le, G\. E\. Hinton, and J\. Dean \(2017\)Outrageously large neural networks: the sparsely\-gated mixture\-of\-experts layer\.ArXiv\.Cited by:[§2](https://arxiv.org/html/2606.04074#S2.SS0.SSS0.Px3.p1.1)\.
- P\. Tang and W\. Zhang \(2025\)Unlocking the power of patch: patch\-based mlp for long\-term time series forecasting\.InProceedings of the AAAI conference on artificial intelligence,Vol\.39,pp\. 12640–12648\.Cited by:[§2](https://arxiv.org/html/2606.04074#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Wang, H\. Wu, X\. Shi, T\. Hu, H\. Luo, L\. Ma, J\. Y\. Zhang, and J\. ZHOU \(2024\)TimeMixer: decomposable multiscale mixing for time series forecasting\.InThe Twelfth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.04074#S2.SS0.SSS0.Px2.p1.1)\.
- G\. Woo, C\. Liu, A\. Kumar, C\. Xiong, S\. Savarese, and D\. Sahoo \(2024\)Unified training of universal time series forecasting transformers\.InForty\-first International Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2606.04074#S1.p1.1),[§2](https://arxiv.org/html/2606.04074#S2.SS0.SSS0.Px1.p1.5)\.
- Y\. Wu, S\. K\. Ankireddy, S\. Sharpe, N\. Seleznev, D\. Yuan, H\. Kim, and N\. H\. Nguyen \(2026\)Dynamic tokenization via reinforcement patching: end\-to\-end training and zero\-shot transfer\.arXiv preprint arXiv:2603\.26097\.Cited by:[§2](https://arxiv.org/html/2606.04074#S2.SS0.SSS0.Px2.p1.1)\.
- H\. Yin, A\. Vahdat, J\. M\. Alvarez, A\. Mallya, J\. Kautz, and P\. Molchanov \(2022\)A\-vit: adaptive tokens for efficient vision transformer\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 10809–10818\.Cited by:[§1](https://arxiv.org/html/2606.04074#S1.p1.1),[§2](https://arxiv.org/html/2606.04074#S2.SS0.SSS0.Px3.p1.1)\.
- S\. Yin, K\. Yin, Y\. Liu, W\. Chen, and L\. Lin \(2025\)DART: differentiable dynamic adaptive region tokenizer for vision foundation models\.arXiv preprint arXiv:2506\.10390\.Cited by:[§1](https://arxiv.org/html/2606.04074#S1.p1.1)\.
- A\. Zeng, M\. Chen, L\. Zhang, and Q\. Xu \(2023\)Are transformers effective for time series forecasting?\.InProceedings of the AAAI conference on artificial intelligence,Vol\.37,pp\. 11121–11128\.Cited by:[§1](https://arxiv.org/html/2606.04074#S1.SS0.SSS0.Px1.p1.7),[§2](https://arxiv.org/html/2606.04074#S2.SS0.SSS0.Px1.p1.5)\.
- H\. Zhou, S\. Zhang, J\. Peng, S\. Zhang, J\. Li, H\. Xiong, and W\. Zhang \(2021\)Informer: beyond efficient transformer for long sequence time\-series forecasting\.InProceedings of the AAAI conference on artificial intelligence,Vol\.35,pp\. 11106–11115\.Cited by:[Appendix F](https://arxiv.org/html/2606.04074#A6.p1.1),[§5\.1](https://arxiv.org/html/2606.04074#S5.SS1.p1.1)\.

## Appendix

#### Structure\.

[AppendixA](https://arxiv.org/html/2606.04074#A1)collects the proofs and technical formal statements supporting the main theory\.[AppendixB](https://arxiv.org/html/2606.04074#A2)reports the variability and statistical\-significance analysis of the patch\-sweep results, including Wilcoxon signed\-rank tests at the setting and dataset levels, bootstrap confidence intervals, Holm correction, and the underlying assumptions about variability\.[AppendixC](https://arxiv.org/html/2606.04074#A3)reproduces the full per\-patch\-size ablation tables summarised in[Figure3](https://arxiv.org/html/2606.04074#S5.F3)\.[AppendixD](https://arxiv.org/html/2606.04074#A4)summarises the model adaptations used to instantiate the uniform\-patch baselines\.[AppendixE](https://arxiv.org/html/2606.04074#A5)lists experimental hyperparameters and hardware\. The appendix concludes with a self\-contained continuous\-rate diagnostic on a simple Transformer backbone that isolates the rate\-allocation mechanism under a controlled noise intervention\.

## Appendix AProofs

The proofs below also include technical formal statements moved out of the main theory section for space\.

### A\.1Proof of[Theorem3\.3](https://arxiv.org/html/2606.04074#S3.Thmtheorem3)\(scalar invariance\)

Without a coupling constraint, the global problem separates into independent per\-step subproblems, as in[Equation3](https://arxiv.org/html/2606.04074#S3.E3)\. For a common per\-step lossggandKt\>0K\_\{t\}\>0,

argminrt∈I⁡Ktg\(rt\)=argminrt∈I⁡g\(rt\),\\operatorname\*\{arg\\,min\}\_\{r\_\{t\}\\in I\}K\_\{t\}g\(r\_\{t\}\)=\\operatorname\*\{arg\\,min\}\_\{r\_\{t\}\\in I\}g\(r\_\{t\}\),because multiplication by a positive scalar preserves the argmin\. Ifgghas a unique minimiserr⋆r^\{\\star\}, each subproblem therefore selectsrt⋆=r⋆r\_\{t\}^\{\\star\}=r^\{\\star\}\. Ifg=Dg=Dis strictly decreasing on a compact interval\[rmin,rmax\]\[r\_\{\\min\},r\_\{\\max\}\], each subproblem is minimised atrmaxr\_\{\\max\}; on an unbounded interval the decreasing objective has no finite minimiser\. Thus scalar complexity alone cannot make the unconstrained optimum non\-uniform\.

### A\.2Proof of[Proposition3\.6](https://arxiv.org/html/2606.04074#S3.Thmtheorem6)\(baseline tax\)

Under independence,KKandD\(r\)D\(r\)are independent, so

𝒥dyn\(r\)=𝔼\[KD\(r\)\]=K¯𝔼\[D\(r\)\]\.\\mathcal\{J\}\_\{\\text\{dyn\}\}\(r\)=\\mathbb\{E\}\[KD\(r\)\]=\\bar\{K\}\\,\\mathbb\{E\}\[D\(r\)\]\.BecauseDDis convex and𝔼\[r\]=r¯\\mathbb\{E\}\[r\]=\\bar\{r\}, Jensen’s inequality gives𝔼\[D\(r\)\]≥D\(r¯\)\\mathbb\{E\}\[D\(r\)\]\\geq D\(\\bar\{r\}\), hence𝒥dyn\(r\)≥K¯D\(r¯\)=𝒥uni\\mathcal\{J\}\_\{\\text\{dyn\}\}\(r\)\\geq\\bar\{K\}D\(\\bar\{r\}\)=\\mathcal\{J\}\_\{\\text\{uni\}\}\. IfDDis strictly convex on the support ofrr, equality requiresrt≡r¯r\_\{t\}\\equiv\\bar\{r\}almost surely\.

### A\.3Proof of[Theorem3\.7](https://arxiv.org/html/2606.04074#S3.Thmtheorem7)\(exact threshold\)

Treatttas a uniform random index and writeK=KtK=K\_\{t\}andr=rtr=r\_\{t\}\. The dynamic scheme improves on the uniform baseline if and only if

𝔼\[KD\(r\)\]<𝔼\[K\]D\(r¯\)\.\\mathbb\{E\}\[KD\(r\)\]<\\mathbb\{E\}\[K\]D\(\\bar\{r\}\)\.\(9\)Using the covariance identity,

𝔼\[KD\(r\)\]=Cov\(K,D\(r\)\)\+𝔼\[K\]𝔼\[D\(r\)\]\.\\mathbb\{E\}\[KD\(r\)\]=\\mathrm\{Cov\}\(K,D\(r\)\)\+\\mathbb\{E\}\[K\]\\,\\mathbb\{E\}\[D\(r\)\]\.\(10\)Substituting and rearranging gives

−Cov\(K,D\(r\)\)\>𝔼\[K\]\(𝔼\[D\(r\)\]−D\(r¯\)\),\-\\mathrm\{Cov\}\(K,D\(r\)\)\>\\mathbb\{E\}\[K\]\\bigl\(\\mathbb\{E\}\[D\(r\)\]\-D\(\\bar\{r\}\)\\bigr\),\(11\)which is exactly[Equation4](https://arxiv.org/html/2606.04074#S3.E4)\.

### A\.4Proof of[Theorem3\.8](https://arxiv.org/html/2606.04074#S3.Thmtheorem8)\(quadratic\-surrogate maximum\)

Using the local quadratic approximation from[Equation5](https://arxiv.org/html/2606.04074#S3.E5), define

Δ\(r\):=\(−D′\(r¯\)\)Cov\(K,r\)−12K¯D′′\(r¯\)Var\(r\)\.\\Delta\(r\):=\(\-D^\{\\prime\}\(\\bar\{r\}\)\)\\mathrm\{Cov\}\(K,r\)\-\\tfrac\{1\}\{2\}\\bar\{K\}D^\{\\prime\\prime\}\(\\bar\{r\}\)\\mathrm\{Var\}\(r\)\.\(12\)WriteCov\(K,r\)=ρσKσr\\mathrm\{Cov\}\(K,r\)=\\rho\\sigma\_\{K\}\\sigma\_\{r\}andVar\(r\)=σr2\\mathrm\{Var\}\(r\)=\\sigma\_\{r\}^\{2\}\. Then

Δ\(r\)=\(−D′\(r¯\)\)ρσKσr−12K¯D′′\(r¯\)σr2\.\\Delta\(r\)=\(\-D^\{\\prime\}\(\\bar\{r\}\)\)\\rho\\sigma\_\{K\}\\sigma\_\{r\}\-\\tfrac\{1\}\{2\}\\bar\{K\}D^\{\\prime\\prime\}\(\\bar\{r\}\)\\sigma\_\{r\}^\{2\}\.\(13\)For fixedρ\\rhoandσK\\sigma\_\{K\}, this is a concave quadratic inσr\\sigma\_\{r\}\. Ifρ≤0\\rho\\leq 0, the linear term is non\-positive for allσr≥0\\sigma\_\{r\}\\geq 0, so the maximum is attained atσr=0\\sigma\_\{r\}=0with value0\. Ifρ\>0\\rho\>0, differentiating with respect toσr\\sigma\_\{r\}and setting to zero gives

σr⋆=\(−D′\(r¯\)\)ρσKK¯D′′\(r¯\)\.\\sigma\_\{r\}^\{\\star\}=\\frac\{\(\-D^\{\\prime\}\(\\bar\{r\}\)\)\\rho\\sigma\_\{K\}\}\{\\bar\{K\}D^\{\\prime\\prime\}\(\\bar\{r\}\)\}\.\(14\)Substitutingσr⋆\\sigma\_\{r\}^\{\\star\}back intoΔ\(r\)\\Delta\(r\)yields

Δmaxquad=\(−D′\(r¯\)ρσK\)22K¯D′′\(r¯\),\\Delta\_\{\\max\}^\{\\text\{quad\}\}\\;=\\;\\frac\{\\bigl\(\-D^\{\\prime\}\(\\bar\{r\}\)\\rho\\sigma\_\{K\}\\bigr\)^\{2\}\}\{2\\bar\{K\}D^\{\\prime\\prime\}\(\\bar\{r\}\)\},\(15\)which is the positive\-correlation case of[Equation6](https://arxiv.org/html/2606.04074#S3.E6)\. This is the maximum of the quadratic surrogate, not an exact gain bound for arbitrary allocations\.

### A\.5Proof of[TheoremA\.1](https://arxiv.org/html/2606.04074#A1.Thmtheorem1)\(KKT characterisation of the optimal constrained allocation\)

###### Theorem A\.1\(KKT characterisation of the optimal constrained allocation\)\.

AssumeD∈C1D\\in C^\{1\}is strictly convex withD′\(r\)<0D^\{\\prime\}\(r\)<0for allrr, andKt\>0K\_\{t\}\>0\. If the minimiserr⋆r^\{\\star\}ofminr∈ℱ\(r¯,I\)⁡𝔼\[KD\(r\)\]\\min\_\{r\\in\\mathcal\{F\}\(\\bar\{r\},I\)\}\\mathbb\{E\}\[K\\,D\(r\)\]is interior, there existsλ\>0\\lambda\>0such that

KtD′\(rt⋆\)\+λ=0,t=0,…,T−1\.K\_\{t\}\\,D^\{\\prime\}\(r\_\{t\}^\{\\star\}\)\+\\lambda=0,\\qquad t=0,\\ldots,T\{\-\}1\.\(16\)EquivalentlyD′\(rt⋆\)=−λ/KtD^\{\\prime\}\(r\_\{t\}^\{\\star\}\)=\-\\lambda/K\_\{t\}, and the allocation is*monotone*in complexity:Ks\>Kt⇒rs⋆\>rt⋆K\_\{s\}\>K\_\{t\}\\Rightarrow r\_\{s\}^\{\\star\}\>r\_\{t\}^\{\\star\}\.

###### Proof\.

The constrained optimisation problem is

min\{rt\}∈ℱ\(r¯,I\)⁡1T∑t=0T−1KtD\(rt\),ℱ\(r¯,I\):=\{\{rt\}∈IT:1T∑t=0T−1rt=r¯\}\.\\min\_\{\\\{r\_\{t\}\\\}\\in\\mathcal\{F\}\(\\bar\{r\},I\)\}\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}K\_\{t\}D\(r\_\{t\}\),\\quad\\mathcal\{F\}\(\\bar\{r\},I\):=\\Bigl\\\{\\\{r\_\{t\}\\\}\\in I^\{T\}:\\tfrac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}r\_\{t\}=\\bar\{r\}\\Bigr\\\}\.\(17\)Form the Lagrangian

ℒ\(\{rt\},λ\)=1T∑t=0T−1KtD\(rt\)\+λ\(1T∑t=0T−1rt−r¯\)\.\\mathcal\{L\}\\bigl\(\\\{r\_\{t\}\\\},\\lambda\\bigr\)\\;=\\;\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}K\_\{t\}D\(r\_\{t\}\)\+\\lambda\\\!\\left\(\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}r\_\{t\}\-\\bar\{r\}\\right\)\.\(18\)At an interior minimiser\{rt⋆\}\\\{r\_\{t\}^\{\\star\}\\\}, differentiatingℒ\\mathcal\{L\}with respect tortr\_\{t\}and setting the result to zero gives

1T\(KtD′\(rt⋆\)\+λ\)=0,t=0,…,T−1,\\frac\{1\}\{T\}\\bigl\(K\_\{t\}\\,D^\{\\prime\}\(r\_\{t\}^\{\\star\}\)\+\\lambda\\bigr\)=0,\\qquad t=0,\\ldots,T\{\-\}1,\(19\)which is \([16](https://arxiv.org/html/2606.04074#A1.E16)\)\. SinceD′\(r\)<0D^\{\\prime\}\(r\)<0for allrrandKt\>0K\_\{t\}\>0, the equationKtD′\(rt⋆\)=−λK\_\{t\}D^\{\\prime\}\(r\_\{t\}^\{\\star\}\)=\-\\lambdacan only hold withλ\>0\\lambda\>0\. Rearranging givesD′\(rt⋆\)=−λ/KtD^\{\\prime\}\(r\_\{t\}^\{\\star\}\)=\-\\lambda/K\_\{t\}\. BecauseD′′\(r\)\>0D^\{\\prime\\prime\}\(r\)\>0\(strict convexity\),D′D^\{\\prime\}is strictly increasing, so

Ks\>Kt⇒−λKs\>−λKt⇒D′\(rs⋆\)\>D′\(rt⋆\)⇒rs⋆\>rt⋆\.K\_\{s\}\>K\_\{t\}\\;\\Rightarrow\\;\\frac\{\-\\lambda\}\{K\_\{s\}\}\>\\frac\{\-\\lambda\}\{K\_\{t\}\}\\;\\Rightarrow\\;D^\{\\prime\}\(r\_\{s\}^\{\\star\}\)\>D^\{\\prime\}\(r\_\{t\}^\{\\star\}\)\\;\\Rightarrow\\;r\_\{s\}^\{\\star\}\>r\_\{t\}^\{\\star\}\.\(20\)Uniqueness ofrt⋆r\_\{t\}^\{\\star\}for eachttfollows from strict convexity ofDD\. ∎

### A\.6Local existence of an improving aligned allocation

###### Proposition A\.2\(Local existence of an improving aligned allocation\)\.

AssumeD′\(r¯\)<0D^\{\\prime\}\(\\bar\{r\}\)<0,Var\(K\)\>0\\mathrm\{Var\}\(K\)\>0, andr¯∈int\(I\)\\bar\{r\}\\in\\mathrm\{int\}\(I\)\(the uniform mean bitrate lies in the interior of the admissible interval\)\. For sufficiently smallε\>0\\varepsilon\>0, setrt\(ε\):=r¯\+ε\(Kt−K¯\)r\_\{t\}^\{\(\\varepsilon\)\}:=\\bar\{r\}\+\\varepsilon\(K\_\{t\}\-\\bar\{K\}\)\. Then𝔼\[r\(ε\)\]=r¯\\mathbb\{E\}\[r^\{\(\\varepsilon\)\}\]=\\bar\{r\}and𝒥dyn\(r\(ε\)\)<𝒥uni\\mathcal\{J\}\_\{\\text\{dyn\}\}\(r^\{\(\\varepsilon\)\}\)<\\mathcal\{J\}\_\{\\text\{uni\}\}\.

###### Proof\.

Define the perturbed allocationrt\(ε\):=r¯\+ε\(Kt−K¯\)r\_\{t\}^\{\(\\varepsilon\)\}:=\\bar\{r\}\+\\varepsilon\(K\_\{t\}\-\\bar\{K\}\)forε\>0\\varepsilon\>0\. The mean is𝔼\[r\(ε\)\]=r¯\+ε\(K¯−K¯\)=r¯\\mathbb\{E\}\[r^\{\(\\varepsilon\)\}\]=\\bar\{r\}\+\\varepsilon\(\\bar\{K\}\-\\bar\{K\}\)=\\bar\{r\}, so the budget constraint is satisfied\.

ExpandD\(rt\(ε\)\)D\(r\_\{t\}^\{\(\\varepsilon\)\}\)to first order inε\\varepsilon:

D\(r¯\+ε\(Kt−K¯\)\)=D\(r¯\)\+D′\(r¯\)ε\(Kt−K¯\)\+O\(ε2\)\.D\\\!\\bigl\(\\bar\{r\}\+\\varepsilon\(K\_\{t\}\-\\bar\{K\}\)\\bigr\)=D\(\\bar\{r\}\)\+D^\{\\prime\}\(\\bar\{r\}\)\\,\\varepsilon\(K\_\{t\}\-\\bar\{K\}\)\+O\(\\varepsilon^\{2\}\)\.\(21\)Multiplying byKtK\_\{t\}, summing overtt, and dividing byTT:

𝒥dyn\(r\(ε\)\)=D\(r¯\)𝔼\[K\]\+D′\(r¯\)ε𝔼\[K\(K−K¯\)\]\+O\(ε2\)\.\\mathcal\{J\}\_\{\\text\{dyn\}\}\\\!\\bigl\(r^\{\(\\varepsilon\)\}\\bigr\)=D\(\\bar\{r\}\)\\,\\mathbb\{E\}\[K\]\+D^\{\\prime\}\(\\bar\{r\}\)\\,\\varepsilon\\,\\mathbb\{E\}\\\!\\left\[K\(K\-\\bar\{K\}\)\\right\]\+O\(\\varepsilon^\{2\}\)\.\(22\)Note that𝔼\[K\(K−K¯\)\]=𝔼\[K2\]−K¯2=Var\(K\)\>0\\mathbb\{E\}\[K\(K\-\\bar\{K\}\)\]=\\mathbb\{E\}\[K^\{2\}\]\-\\bar\{K\}^\{2\}=\\mathrm\{Var\}\(K\)\>0by assumption\. Therefore

𝒥dyn\(r\(ε\)\)−𝒥uni=D′\(r¯\)εVar\(K\)\+O\(ε2\)\.\\mathcal\{J\}\_\{\\text\{dyn\}\}\\\!\\bigl\(r^\{\(\\varepsilon\)\}\\bigr\)\-\\mathcal\{J\}\_\{\\text\{uni\}\}=D^\{\\prime\}\(\\bar\{r\}\)\\,\\varepsilon\\,\\mathrm\{Var\}\(K\)\+O\(\\varepsilon^\{2\}\)\.\(23\)SinceD′\(r¯\)<0D^\{\\prime\}\(\\bar\{r\}\)<0andVar\(K\)\>0\\mathrm\{Var\}\(K\)\>0, the linear coefficient is strictly negative\. For all sufficiently smallε\>0\\varepsilon\>0the linear term dominates and𝒥dyn\(r\(ε\)\)<𝒥uni\\mathcal\{J\}\_\{\\text\{dyn\}\}\(r^\{\(\\varepsilon\)\}\)<\\mathcal\{J\}\_\{\\text\{uni\}\}\. ∎

### A\.7Exact second\-order expansion \([PropositionA\.3](https://arxiv.org/html/2606.04074#A1.Thmtheorem3)\)

###### Proposition A\.3\(Exact second\-order expansion with remainder\)\.

Letδt=rt−r¯\\delta\_\{t\}=r\_\{t\}\-\\bar\{r\}and assumeD∈C3D\\in C^\{3\}on the interval containing the allocation\. WithR3\(δ\):=D\(r¯\+δ\)−D\(r¯\)−D′\(r¯\)δ−12D′′\(r¯\)δ2R\_\{3\}\(\\delta\):=D\(\\bar\{r\}\+\\delta\)\-D\(\\bar\{r\}\)\-D^\{\\prime\}\(\\bar\{r\}\)\\delta\-\\tfrac\{1\}\{2\}D^\{\\prime\\prime\}\(\\bar\{r\}\)\\delta^\{2\},

ΔD\(r\)=−D′\(r¯\)Cov\(K,r\)−12D′′\(r¯\)𝔼\[Kδ2\]−𝔼\[KR3\(δ\)\],\\Delta\_\{D\}\(r\)=\-D^\{\\prime\}\(\\bar\{r\}\)\\mathrm\{Cov\}\(K,r\)\-\\tfrac\{1\}\{2\}D^\{\\prime\\prime\}\(\\bar\{r\}\)\\mathbb\{E\}\[K\\delta^\{2\}\]\-\\mathbb\{E\}\[KR\_\{3\}\(\\delta\)\],\(24\)and𝔼\[Kδ2\]=K¯Var\(r\)\+Cov\(K,δ2\)\\mathbb\{E\}\[K\\delta^\{2\}\]=\\bar\{K\}\\mathrm\{Var\}\(r\)\+\\mathrm\{Cov\}\(K,\\delta^\{2\}\)\. If\|D′′′\|≤M3\|D^\{\\prime\\prime\\prime\}\|\\leq M\_\{3\}, then\|𝔼\[KR3\(δ\)\]\|≤M36𝔼\[K\|δ\|3\]\|\\mathbb\{E\}\[KR\_\{3\}\(\\delta\)\]\|\\leq\\tfrac\{M\_\{3\}\}\{6\}\\mathbb\{E\}\[K\|\\delta\|^\{3\}\]\.

###### Proof\.

By Taylor’s theorem applied toD∈C3D\\in C^\{3\}:

D\(rt\)=D\(r¯\)\+D′\(r¯\)δt\+12D′′\(r¯\)δt2\+R3\(δt\),D\(r\_\{t\}\)=D\(\\bar\{r\}\)\+D^\{\\prime\}\(\\bar\{r\}\)\\,\\delta\_\{t\}\+\\tfrac\{1\}\{2\}D^\{\\prime\\prime\}\(\\bar\{r\}\)\\,\\delta\_\{t\}^\{2\}\+R\_\{3\}\(\\delta\_\{t\}\),\(25\)whereδt=rt−r¯\\delta\_\{t\}=r\_\{t\}\-\\bar\{r\}andR3\(δt\)=D\(rt\)−D\(r¯\)−D′\(r¯\)δt−12D′′\(r¯\)δt2R\_\{3\}\(\\delta\_\{t\}\)=D\(r\_\{t\}\)\-D\(\\bar\{r\}\)\-D^\{\\prime\}\(\\bar\{r\}\)\\delta\_\{t\}\-\\tfrac\{1\}\{2\}D^\{\\prime\\prime\}\(\\bar\{r\}\)\\delta\_\{t\}^\{2\}\. Multiply byKtK\_\{t\}, sum overtt, and divide byTT:

1T∑tKtD\(rt\)=D\(r¯\)𝔼\[K\]\+D′\(r¯\)𝔼\[Kδ\]\+12D′′\(r¯\)𝔼\[Kδ2\]\+𝔼\[KR3\(δ\)\]\.\\frac\{1\}\{T\}\\sum\_\{t\}K\_\{t\}D\(r\_\{t\}\)=D\(\\bar\{r\}\)\\,\\mathbb\{E\}\[K\]\+D^\{\\prime\}\(\\bar\{r\}\)\\,\\mathbb\{E\}\[K\\delta\]\+\\tfrac\{1\}\{2\}D^\{\\prime\\prime\}\(\\bar\{r\}\)\\,\\mathbb\{E\}\[K\\delta^\{2\}\]\+\\mathbb\{E\}\[KR\_\{3\}\(\\delta\)\]\.\(26\)Since𝔼\[δ\]=𝔼\[r−r¯\]=0\\mathbb\{E\}\[\\delta\]=\\mathbb\{E\}\[r\-\\bar\{r\}\]=0and𝔼\[Kδ\]=𝔼\[K\(r−r¯\)\]=𝔼\[Kr\]−r¯𝔼\[K\]=Cov\(K,r\)\\mathbb\{E\}\[K\\delta\]=\\mathbb\{E\}\[K\(r\-\\bar\{r\}\)\]=\\mathbb\{E\}\[Kr\]\-\\bar\{r\}\\,\\mathbb\{E\}\[K\]=\\mathrm\{Cov\}\(K,r\), we obtain

ΔD\(r\)=𝒥uni−𝒥dyn\(r\)=−D′\(r¯\)Cov\(K,r\)−12D′′\(r¯\)𝔼\[Kδ2\]−𝔼\[KR3\(δ\)\],\\Delta\_\{D\}\(r\)=\\mathcal\{J\}\_\{\\text\{uni\}\}\-\\mathcal\{J\}\_\{\\text\{dyn\}\}\(r\)=\-D^\{\\prime\}\(\\bar\{r\}\)\\,\\mathrm\{Cov\}\(K,r\)\-\\tfrac\{1\}\{2\}D^\{\\prime\\prime\}\(\\bar\{r\}\)\\,\\mathbb\{E\}\[K\\delta^\{2\}\]\-\\mathbb\{E\}\[KR\_\{3\}\(\\delta\)\],\(27\)which is \([24](https://arxiv.org/html/2606.04074#A1.E24)\)\.

For the expansion of𝔼\[Kδ2\]\\mathbb\{E\}\[K\\delta^\{2\}\], writeK=K¯\+\(K−K¯\)K=\\bar\{K\}\+\(K\-\\bar\{K\}\):

𝔼\[Kδ2\]=K¯𝔼\[δ2\]\+𝔼\[\(K−K¯\)δ2\]=K¯Var\(r\)\+Cov\(K,δ2\)\.\\mathbb\{E\}\[K\\delta^\{2\}\]=\\bar\{K\}\\,\\mathbb\{E\}\[\\delta^\{2\}\]\+\\mathbb\{E\}\[\(K\-\\bar\{K\}\)\\delta^\{2\}\]=\\bar\{K\}\\,\\mathrm\{Var\}\(r\)\+\\mathrm\{Cov\}\(K,\\delta^\{2\}\)\.\(28\)
For the remainder bound: by Taylor’s theorem with the Lagrange remainder,\|R3\(δ\)\|≤M36\|δ\|3\|R\_\{3\}\(\\delta\)\|\\leq\\frac\{M\_\{3\}\}\{6\}\|\\delta\|^\{3\}\. Multiplying byK≥0K\\geq 0and taking expectations gives\|𝔼\[KR3\(δ\)\]\|≤M36𝔼\[K\|δ\|3\]\|\\mathbb\{E\}\[KR\_\{3\}\(\\delta\)\]\|\\leq\\frac\{M\_\{3\}\}\{6\}\\,\\mathbb\{E\}\[K\|\\delta\|^\{3\}\]\. ∎

### A\.8Proof of[TheoremA\.4](https://arxiv.org/html/2606.04074#A1.Thmtheorem4)\(exact global upper bound under strong convexity\)

###### Theorem A\.4\(Exact global upper bound under strong convexity\)\.

SetmD:=infu∈ID′′\(u\)\>0m\_\{D\}:=\\inf\_\{u\\in I\}D^\{\\prime\\prime\}\(u\)\>0,LD:=supu∈I\|D′\(u\)\|<∞L\_\{D\}:=\\sup\_\{u\\in I\}\|D^\{\\prime\}\(u\)\|<\\infty, andα\+\(r\):=max⁡\{Corr\(K,−D\(r\)\),0\}\\alpha\_\{\+\}\(r\):=\\max\\\{\\mathrm\{Corr\}\(K,\-D\(r\)\),0\\\}\. For any feasible allocation,

ΔD\(r\)≤α\+\(r\)LDσKσr−12K¯mDσr2≤α\+\(r\)2LD2σK22K¯mD≤LD2σK22K¯mD\.\\Delta\_\{D\}\(r\)\\leq\\alpha\_\{\+\}\(r\)L\_\{D\}\\sigma\_\{K\}\\sigma\_\{r\}\-\\tfrac\{1\}\{2\}\\bar\{K\}m\_\{D\}\\sigma\_\{r\}^\{2\}\\leq\\frac\{\\alpha\_\{\+\}\(r\)^\{2\}L\_\{D\}^\{2\}\\sigma\_\{K\}^\{2\}\}\{2\\bar\{K\}m\_\{D\}\}\\leq\\frac\{L\_\{D\}^\{2\}\\sigma\_\{K\}^\{2\}\}\{2\\bar\{K\}m\_\{D\}\}\.\(29\)

###### Proof\.

Starting from the exact decomposition implicit in[Theorem3\.7](https://arxiv.org/html/2606.04074#S3.Thmtheorem7),

ΔD\(r\)=−Cov\(K,D\(r\)\)−𝔼\[K\]\(𝔼\[D\(r\)\]−D\(r¯\)\)\.\\Delta\_\{D\}\(r\)=\-\\mathrm\{Cov\}\(K,D\(r\)\)\-\\mathbb\{E\}\[K\]\\bigl\(\\mathbb\{E\}\[D\(r\)\]\-D\(\\bar\{r\}\)\\bigr\)\.\(30\)
Step 1: upper\-bounding the covariance term\.Writing−Cov\(K,D\(r\)\)=Cov\(K,−D\(r\)\)\-\\mathrm\{Cov\}\(K,D\(r\)\)=\\mathrm\{Cov\}\(K,\-D\(r\)\), we useCov\(K,−D\(r\)\)≤α\+\(r\)σKσD\(r\)\\mathrm\{Cov\}\(K,\-D\(r\)\)\\leq\\alpha\_\{\+\}\(r\)\\,\\sigma\_\{K\}\\,\\sigma\_\{D\(r\)\}, whereα\+\(r\)=max⁡\{Corr\(K,−D\(r\)\),0\}\\alpha\_\{\+\}\(r\)=\\max\\\{\\mathrm\{Corr\}\(K,\-D\(r\)\),0\\\}\(the non\-negative part of the correlation is the only part that can benefit\)\. Since\|D′\|≤LD\|D^\{\\prime\}\|\\leq L\_\{D\}onII,DDisLDL\_\{D\}\-Lipschitz, soσD\(r\)≤LDσr\\sigma\_\{D\(r\)\}\\leq L\_\{D\}\\,\\sigma\_\{r\}\. Therefore−Cov\(K,D\(r\)\)≤α\+\(r\)LDσKσr\-\\mathrm\{Cov\}\(K,D\(r\)\)\\leq\\alpha\_\{\+\}\(r\)\\,L\_\{D\}\\,\\sigma\_\{K\}\\,\\sigma\_\{r\}\.

Step 2: lower\-bounding the Jensen penalty\.UndermDm\_\{D\}\-strong convexity, for anyrrwith𝔼\[r\]=r¯\\mathbb\{E\}\[r\]=\\bar\{r\},

𝔼\[D\(r\)\]≥D\(r¯\)\+mD2𝔼\[\(r−r¯\)2\]=D\(r¯\)\+mD2σr2,\\mathbb\{E\}\[D\(r\)\]\\;\\geq\\;D\(\\bar\{r\}\)\+\\tfrac\{m\_\{D\}\}\{2\}\\,\\mathbb\{E\}\[\(r\-\\bar\{r\}\)^\{2\}\]=D\(\\bar\{r\}\)\+\\tfrac\{m\_\{D\}\}\{2\}\\,\\sigma\_\{r\}^\{2\},\(31\)so𝔼\[D\(r\)\]−D\(r¯\)≥mD2σr2\\mathbb\{E\}\[D\(r\)\]\-D\(\\bar\{r\}\)\\geq\\tfrac\{m\_\{D\}\}\{2\}\\sigma\_\{r\}^\{2\}and−𝔼\[K\]\(𝔼\[D\(r\)\]−D\(r¯\)\)≤−12K¯mDσr2\-\\mathbb\{E\}\[K\]\(\\mathbb\{E\}\[D\(r\)\]\-D\(\\bar\{r\}\)\)\\leq\-\\tfrac\{1\}\{2\}\\bar\{K\}\\,m\_\{D\}\\,\\sigma\_\{r\}^\{2\}\.

Step 3: combining\.

ΔD\(r\)≤α\+\(r\)LDσKσr−12K¯mDσr2\.\\Delta\_\{D\}\(r\)\\;\\leq\\;\\alpha\_\{\+\}\(r\)\\,L\_\{D\}\\,\\sigma\_\{K\}\\,\\sigma\_\{r\}\-\\tfrac\{1\}\{2\}\\,\\bar\{K\}\\,m\_\{D\}\\,\\sigma\_\{r\}^\{2\}\.\(32\)
Step 4: optimising overσr\\sigma\_\{r\}\.The right\-hand side is a concave quadratic inσr≥0\\sigma\_\{r\}\\geq 0with maximum atσr⋆=α\+\(r\)LDσK/\(K¯mD\)\\sigma\_\{r\}^\{\\star\}=\\alpha\_\{\+\}\(r\)\\,L\_\{D\}\\,\\sigma\_\{K\}/\(\\bar\{K\}\\,m\_\{D\}\)\. Substituting,

ΔD\(r\)≤α\+\(r\)2LD2σK22K¯mD≤LD2σK22K¯mD,\\Delta\_\{D\}\(r\)\\;\\leq\\;\\frac\{\\alpha\_\{\+\}\(r\)^\{2\}\\,L\_\{D\}^\{2\}\\,\\sigma\_\{K\}^\{2\}\}\{2\\,\\bar\{K\}\\,m\_\{D\}\}\\;\\leq\\;\\frac\{L\_\{D\}^\{2\}\\,\\sigma\_\{K\}^\{2\}\}\{2\\,\\bar\{K\}\\,m\_\{D\}\},\(33\)where the last step usesα\+\(r\)≤1\\alpha\_\{\+\}\(r\)\\leq 1, giving \([29](https://arxiv.org/html/2606.04074#A1.E29)\)\. ∎

### A\.9Proof of[Theorem3\.10](https://arxiv.org/html/2606.04074#S3.Thmtheorem10)\(uniform optimality at the representation\-aware minimiser\)

Supposer¯=r⋆\\bar\{r\}=r^\{\\star\}, wherer⋆=argminr⁡ℒ\(r\)r^\{\\star\}=\\operatorname\*\{arg\\,min\}\_\{r\}\\mathcal\{L\}\(r\)\. For every time steptt,

ℒ\(rt\)≥ℒ\(r⋆\)∀rt,\\mathcal\{L\}\(r\_\{t\}\)\\;\\geq\\;\\mathcal\{L\}\(r^\{\\star\}\)\\quad\\forall\\,r\_\{t\},\(34\)sincer⋆r^\{\\star\}is the global minimiser\. SinceKt≥0K\_\{t\}\\geq 0,

𝒥ℒ\(r\)=𝔼\[Kℒ\(r\)\]≥𝔼\[Kℒ\(r⋆\)\]=K¯ℒ\(r⋆\)=K¯ℒ\(r¯\)=𝒥ℒunif\.\\mathcal\{J\}\_\{\\mathcal\{L\}\}\(r\)=\\mathbb\{E\}\[K\\,\\mathcal\{L\}\(r\)\]\\;\\geq\\;\\mathbb\{E\}\[K\\,\\mathcal\{L\}\(r^\{\\star\}\)\]=\\bar\{K\}\\,\\mathcal\{L\}\(r^\{\\star\}\)=\\bar\{K\}\\,\\mathcal\{L\}\(\\bar\{r\}\)=\\mathcal\{J\}\_\{\\mathcal\{L\}\}^\{\\mathrm\{unif\}\}\.\(35\)
For the equality condition:𝒥ℒ\(r\)=𝒥ℒunif\\mathcal\{J\}\_\{\\mathcal\{L\}\}\(r\)=\\mathcal\{J\}\_\{\\mathcal\{L\}\}^\{\\mathrm\{unif\}\}requiresKt\(ℒ\(rt\)−ℒ\(r⋆\)\)=0K\_\{t\}\(\\mathcal\{L\}\(r\_\{t\}\)\-\\mathcal\{L\}\(r^\{\\star\}\)\)=0for everytt\. WhenKt\>0K\_\{t\}\>0andℒ\(rt\)≥ℒ\(r⋆\)\\mathcal\{L\}\(r\_\{t\}\)\\geq\\mathcal\{L\}\(r^\{\\star\}\), this forcesℒ\(rt\)=ℒ\(r⋆\)\\mathcal\{L\}\(r\_\{t\}\)=\\mathcal\{L\}\(r^\{\\star\}\)\. Ifr⋆r^\{\\star\}is the unique minimiser ofℒ\\mathcal\{L\}\(ensured by[3\.9](https://arxiv.org/html/2606.04074#S3.Thmtheorem9)\), thenrt=r⋆r\_\{t\}=r^\{\\star\}for alltt\.

### A\.10Proof of[Proposition3\.11](https://arxiv.org/html/2606.04074#S3.Thmtheorem11)\(optimality trap: local expansion\)

Apply[PropositionA\.3](https://arxiv.org/html/2606.04074#A1.Thmtheorem3)withDDreplaced byℒ\\mathcal\{L\}\. WithRℒ,3\(δ\):=ℒ\(r¯\+δ\)−ℒ\(r¯\)−ℒ′\(r¯\)δ−12ℒ′′\(r¯\)δ2R\_\{\\mathcal\{L\},3\}\(\\delta\):=\\mathcal\{L\}\(\\bar\{r\}\+\\delta\)\-\\mathcal\{L\}\(\\bar\{r\}\)\-\\mathcal\{L\}^\{\\prime\}\(\\bar\{r\}\)\\delta\-\\tfrac\{1\}\{2\}\\mathcal\{L\}^\{\\prime\\prime\}\(\\bar\{r\}\)\\delta^\{2\}, the exact expansion is

Δℒ\(r\)=−ℒ′\(r¯\)Cov\(K,r\)−12ℒ′′\(r¯\)𝔼\[Kδ2\]−𝔼\[KRℒ,3\(δ\)\],\\Delta\_\{\\mathcal\{L\}\}\(r\)=\-\\mathcal\{L\}^\{\\prime\}\(\\bar\{r\}\)\\mathrm\{Cov\}\(K,r\)\-\\tfrac\{1\}\{2\}\\mathcal\{L\}^\{\\prime\\prime\}\(\\bar\{r\}\)\\mathbb\{E\}\[K\\delta^\{2\}\]\-\\mathbb\{E\}\[KR\_\{\\mathcal\{L\},3\}\(\\delta\)\],which is \([8](https://arxiv.org/html/2606.04074#S3.E8)\)\. Whenr¯≈r⋆\\bar\{r\}\\approx r^\{\\star\}, the first\-order conditionℒ′\(r⋆\)=0\\mathcal\{L\}^\{\\prime\}\(r^\{\\star\}\)=0givesℒ′\(r¯\)≈0\\mathcal\{L\}^\{\\prime\}\(\\bar\{r\}\)\\approx 0, so the leading second\-order term is non\-positive becauseℒ′′\>0\\mathcal\{L\}^\{\\prime\\prime\}\>0and𝔼\[Kδ2\]≥0\\mathbb\{E\}\[K\\delta^\{2\}\]\\geq 0\.

### A\.11Representation cost amplifies the optimality trap

Apply the exact Taylor expansion of[PropositionA\.3](https://arxiv.org/html/2606.04074#A1.Thmtheorem3)withℒ\\mathcal\{L\}in place ofDD:

Δℒ\(r\)=−ℒ′\(r¯\)Cov\(K,r\)−12ℒ′′\(r¯\)𝔼\[Kδ2\]−𝔼\[KRℒ,3\(δ\)\]\.\\Delta\_\{\\mathcal\{L\}\}\(r\)=\-\\mathcal\{L\}^\{\\prime\}\(\\bar\{r\}\)\\mathrm\{Cov\}\(K,r\)\-\\tfrac\{1\}\{2\}\\,\\mathcal\{L\}^\{\\prime\\prime\}\(\\bar\{r\}\)\\,\\mathbb\{E\}\[K\\delta^\{2\}\]\-\\mathbb\{E\}\[KR\_\{\\mathcal\{L\},3\}\(\\delta\)\]\.\(36\)This is valid for anyℒ∈C3\\mathcal\{L\}\\in C^\{3\}\. Whenr¯≈r⋆\\bar\{r\}\\approx r^\{\\star\},ℒ′\(r¯\)≈0\\mathcal\{L\}^\{\\prime\}\(\\bar\{r\}\)\\approx 0, so the first term vanishes and the leading non\-zero contribution is

Δℒ\(r\)≈−12ℒ′′\(r¯\)𝔼\[Kδ2\]\.\\Delta\_\{\\mathcal\{L\}\}\(r\)\\approx\-\\tfrac\{1\}\{2\}\\mathcal\{L\}^\{\\prime\\prime\}\(\\bar\{r\}\)\\,\\mathbb\{E\}\[K\\delta^\{2\}\]\.\(37\)Sinceℒ′′\(r¯\)\>0\\mathcal\{L\}^\{\\prime\\prime\}\(\\bar\{r\}\)\>0\(by[3\.9](https://arxiv.org/html/2606.04074#S3.Thmtheorem9)\) and𝔼\[Kδ2\]≥0\\mathbb\{E\}\[K\\delta^\{2\}\]\\geq 0, this leading term is non\-positive, confirming that the net dynamic gain is at most zero nearr⋆r^\{\\star\}\.

Finally,ℒ\(r\)=D\(r\)\+λDsem\(r\)\\mathcal\{L\}\(r\)=D\(r\)\+\\lambda D\_\{\\text\{sem\}\}\(r\)implies

ℒ′′\(r¯\)=D′′\(r¯\)\+λDsem′′\(r¯\)\.\\mathcal\{L\}^\{\\prime\\prime\}\(\\bar\{r\}\)=D^\{\\prime\\prime\}\(\\bar\{r\}\)\+\\lambda\\,D\_\{\\text\{sem\}\}^\{\\prime\\prime\}\(\\bar\{r\}\)\.\(38\)Whenλ\>0\\lambda\>0and the representation term is locally convex withDsem′′\(r¯\)\>0D\_\{\\text\{sem\}\}^\{\\prime\\prime\}\(\\bar\{r\}\)\>0, the curvature controlling the Jensen penalty is strictly larger than under pure rate–distortion\. The trap is therefore more binding whenever the semantic representation cost adds positive local curvature\.

## Appendix BSignificance of the validation\-selected comparison

For each \(method, dataset, horizon\) cell, we sweep the candidate uniform patch sizes, select the one with the lowest validation MSE, and compare its test MSE against the matched dynamic baseline\. We quantify the reliability of this protocol by collecting the3232per\-cell relative gainsΔMSE=100⋅\(MSEdyn−MSEuni\)/MSEdyn\\Delta\\text\{MSE\}=100\\cdot\(\\text\{MSE\}\_\{\\text\{dyn\}\}\-\\text\{MSE\}\_\{\\text\{uni\}\}\)/\\text\{MSE\}\_\{\\text\{dyn\}\}\(eight datasets×\\timesfour horizons, each averaged over five training seeds\) and applying a two\-sided exact Wilcoxon signed\-rank test per method, with Holm correction across the three methods\.95%95\\%confidence intervals for the median are obtained from10,00010\{,\}000bootstrap resamples; effect size is the rank\-biserial correlationrr\([Table1](https://arxiv.org/html/2606.04074#A2.T1)\)\.

Table 1:Setting\-level significance of the validation\-selected uniform variant relative to the matched dynamic baseline\.Median is the median relative test\-MSE gain over the3232\(dataset, horizon\) cells\. CI is a10,00010\{,\}000\-resample bootstrap95%95\\%percentile interval for the median; rawppis the two\-sided exact Wilcoxon signed\-rankpp\-value; Holmppis the Holm\-Bonferroni adjustedpp\-value across the three methods;rris the rank\-biserial correlation\.#### Dataset\-clustered re\-analysis\.

The3232entries per method are not independent: four horizons share the same dataset\. We therefore re\-run the analysis at the dataset level by averaging the four horizon\-levelΔMSE\\Delta\\text\{MSE\}values into a single number per \(method, dataset\), yielding eight clusters per method\. Two complementary checks are performed\.

*Cluster bootstrap\.*We resample the eight datasets with replacement10,00010\{,\}000times, carry along all four horizon\-levelΔMSE\\Delta\\text\{MSE\}values for each sampled dataset, and recompute the setting\-level median on the resulting resampled collection\. The resulting95%95\\%percentile intervals for the median gain are\[−0\.37,\+0\.94\]\[\-0\.37,\+0\.94\]for EntroPE,\[−1\.14,\+0\.54\]\[\-1\.14,\+0\.54\]for TimeMosaic, and\[\+0\.32,\+2\.64\]\[\+0\.32,\+2\.64\]for HDMixer\. All three intervals lie above a practical non\-inferiority margin of−2%\-2\\%\.

*Dataset\-level signed\-rank\.*A two\-sided exact Wilcoxon signed\-rank test on the eight cluster means yieldsp=0\.844p=0\.844for EntroPE,p=0\.641p=0\.641for TimeMosaic, andp=0\.195p=0\.195for HDMixer\. With only eight clusters this test is deliberately under\-powered and does not support a directional claim by itself\.

#### What the tests do and do not support\.

EntroPE and TimeMosaic show no reliable directional effect at either resolution\. HDMixer shows a small positive shift at the setting level that survives Holm correction \(Holmp=0\.040p=0\.040\) and the clustered bootstrap \(positive median CI\), but is not significant under the conservative dataset\-level rank test\. We therefore treat the directional claim for HDMixer as a small effect compatible with the second\-order ceiling of[Theorem3\.8](https://arxiv.org/html/2606.04074#S3.Thmtheorem8), not as a strong statement of universal superiority of fixed patches\. The overall distribution ofΔMSE\\Delta\\text\{MSE\}lies in\[−8\.1%,\+6\.5%\]\[\-8\.1\\%,\+6\.5\\%\], with most mass in\(−2%,\+2%\)\(\-2\\%,\+2\\%\), which is the regime our threshold analysis predicts when the uniform operating point is already at low local curvature\.

## Appendix CFull per\-patch\-size ablation tables

Tables[2](https://arxiv.org/html/2606.04074#A3.T2),[3](https://arxiv.org/html/2606.04074#A3.T3), and[4](https://arxiv.org/html/2606.04074#A3.T4)report the complete uniform\-patch sweep for EntroPE, HDMixer, and TimeMosaic, respectively\. Each table shows test MSE and MAE for every \(dataset, horizon, patch\-size\) combination, together with the percentage MSE improvement over the matched dynamic baseline \(%Imp\) and the training\-time speedup relative to the dynamic variant \(Spd\)\. Results are averaged over 5 seeds\. These sweep results provide the candidate fixed patch sizes from which the validation\-selected summary in[Figure3](https://arxiv.org/html/2606.04074#S5.F3)is obtained via the protocol described in[Section5\.1](https://arxiv.org/html/2606.04074#S5.SS1)\.

Table 2:EntroPE uniform patch sweep\.Test MSE and MAE averaged over 5 seeds\. %Imp = MSE improvement over dynamic baseline\. Speedup is relative to dynamic training time\. Lowest MSE per row, including Dyn\., isblue bold; the second\-lowest isorange underlined\. Bottom row: mean±\\pmstd across all dataset/horizon groups\.Table 3:HDMixer uniform patch sweep\.Test MSE and MAE averaged over multiple seeds\. %Imp = MSE improvement over dynamic baseline\. Speedup is relative to dynamic training time\. Lowest MSE per row, including Dyn\., isblue bold; the second\-lowest isorange underlined\. Bottom row: mean±\\pmstd across all dataset/horizon groups\.Table 4:TimeMosaic uniform patch sweep\.Test MSE and MAE averaged over multiple seeds\. %Imp = MSE improvement over dynamic baseline\. Speedup is relative to dynamic training time\. Lowest MSE per row, including Dyn\., isblue bold; the second\-lowest isorange underlined\. Bottom row: mean±\\pmstd across all dataset/horizon groups\.
## Appendix DModel adaptations for the uniform\-patch ablations

The controlled study requires a uniform\-patch counterpart for each dynamic method\. In all cases, the design principle is the same: modify the patch generation mechanism while leaving the downstream forecasting backbone, prediction head, and training budget as close as possible to the original model\. The model implementations used for these ablations are publicly released open\-source research codebases\.

#### EntroPE\.

The EntroPE ablation uses the model’s pre\-existing static patching mode with a fixed patch length\. Relative to the original method, the only conceptual change is that entropy\-based boundary selection is disabled and all segments are forced to have equal length\. The rest of the forecasting pipeline remains unchanged\. Because this static mode already existed in the original design, the EntroPE comparison is especially clean: it contrasts entropy\-guided segmentation with a matched fixed\-length alternative inside the same architecture\.

#### TimeMosaic\.

TimeMosaic ordinarily learns, for each temporal region, which patch length to use from a small candidate set\. The uniform variant removes this routing decision and applies a single fixed patch length across all regions\. To keep the comparison fair, the fixed\-length variant reuses the same patch\-embedding operator associated with that patch size in the original multi\-scale embedding bank rather than introducing a new projection module\. The downstream encoder and prediction head are unchanged, while the auxiliary routing regularization is disabled because no patch\-size classification is performed\. This makes the comparison specific to adaptive scale selection rather than to changes in representation capacity\.

#### HDMixer\.

HDMixer ordinarily uses deformable patch extraction, in which the model predicts patch offsets and widths and applies an auxiliary patch\-entropy regularizer\. The uniform variant replaces this deformable extraction stage with standard non\-overlapping fixed\-length patches\. Under this ablation, the deformable localization mechanism and the associated auxiliary regularizer are inactive, while the subsequent mixer backbone and prediction head are kept unchanged\. The number of patches is recomputed from the chosen fixed length so that the downstream forecast module receives a tensorization matched to the selected granularity\.

#### Common interpretation\.

Across all three methods, the ablations were constructed to preserve the original forecasting pathway after patch formation and to alter only the adaptive patch\-selection component and any losses tied directly to that component\. This is the relevant comparison for the paper’s question: whether content\-adaptive patch allocation itself provides a robust advantage once the fixed\-length baseline is tuned under the same experimental protocol\. Across variants, downstream tensors are matched to each method’s own patch\-count convention so that the prediction head remains the same module even when the number of fixed\-length patches changes\.

## Appendix EExperimental hyperparameters and hardware

#### Patch sizes evaluated\.

The uniform patch grids differ across methods, reflecting each architecture’s tokenisation constraints\. EntroPE and HDMixer are swept overp∈\{4,8,16,24,32\}p\\in\\\{4,8,16,24,32\\\}; TimeMosaic overp∈\{8,16,32\}p\\in\\\{8,16,32\\\}, matching its native multi\-scale candidate set\. For each method, dataset, and horizon, the uniform variant whose validation MSE is lowest is selected and its test MSE is reported; the full per\-patch\-size breakdown is in[AppendixC](https://arxiv.org/html/2606.04074#A3)\.

#### Hyperparameters\.

For EntroPE and TimeMosaic, we use the hyperparameters reported in the respective published paper and official codebase\. Where a dataset\-or\-horizon\-specific configuration is not provided by the original authors, we select it by minimum validation MSE from the set of configurations tried by those authors\. For HDMixer, the original paper does not report the complete dataset\-level hyperparameter settings needed for the benchmark suite, so we run a candidate sweep and select the configuration with the lowest validation MSE for each dataset\. The selected HDMixer configurations are listed in[Table7](https://arxiv.org/html/2606.04074#A5.T7)\. No additional broad retuning of the adaptive baselines is performed; the goal is to assess whether the adaptive patch\-selection mechanism adds value beyond a uniform sweep under the same published setup\. All models are trained separately in their original codebases\.

Table 5:EntroPE dynamic hyperparameters by dataset\.These are the dataset\-level dynamic\-model settings used in the comparison runs\.Table 6:TimeMosaic dynamic hyperparameters by dataset\.Patience was33for all TimeMosaic datasets listed above\.

Table 7:HDMixer selected hyperparameters by dataset\.The original HDMixer paper does not report the complete dataset\-level hyperparameters used for the benchmark suite\. We therefore selected these settings by validation MSE from a compact candidate sweep\.

## Appendix FBenchmark dataset notes

The main experiments use eight standard multivariate long\-horizon forecasting benchmarks\. The suite combines the ETT variants and Weather from the Informer benchmark\[Zhouet al\.,[2021](https://arxiv.org/html/2606.04074#bib.bib42)\]with Electricity, Exchange, and Traffic from the LSTNet benchmark collection\[Laiet al\.,[2018](https://arxiv.org/html/2606.04074#bib.bib43)\]\. Across HDMixer, EntroPE, and TimeMosaic, we use the standard chronological benchmark splits: 6:2:2 for ETT and 7:1:2 for Weather, Electricity, and Traffic\.

#### Dataset descriptions\.

Electricityrecords hourly electricity consumption for 321 clients and is used for multivariate load forecasting\.ETTh1,ETTh2,ETTm1, andETTm2track electricity transformer temperature and load, with the ‘h’ variants sampled hourly and the ‘m’ variants sampled every 15 minutes\.Exchangecontains daily exchange rates for multiple currencies relative to the U\.S\. dollar\.Trafficrecords hourly road occupancy measurements from a freeway sensor network\.Weathercontains 10\-minute atmospheric measurements from a weather station\. Table[8](https://arxiv.org/html/2606.04074#A6.T8)summarizes the dimensions, split sizes, and sampling frequencies of these datasets\.

Table 8:Benchmark dataset summary\.‘Dim’ is the number of variables, ‘Split size’ lists the \(train, validation, test\) sequence counts, and ‘Frequency’ is the native sampling interval\.
#### Hardware and artifacts\.

All experiments were run on H100 GPUs with 80GB VRAM\. The evaluation uses public long\-horizon forecasting datasets and the official codebases of the three compared methods\.

## Appendix GContinuous\-rate diagnostic: isolating the allocation mechanism

The real\-data ablations in[Section5](https://arxiv.org/html/2606.04074#S5)compare trained dynamic patchers against tuned uniform baselines, but those comparisons mix three effects: the rate\-allocation decision itself, the discrete placement of patch boundaries, and the change in token count and embedding scale induced by each patch length\. If the uniform variant wins, is the routing signal misaligned, or do boundary placement and tokenisation overhead wash out a small alignment benefit? To separate those possibilities, we use a diagnostic with no patch boundaries at all and intervene only on per\-position information quality\. This isolates the rate\-allocation mechanism from the rest of the pipeline\.

### G\.1Motivation

[Theorem3\.7](https://arxiv.org/html/2606.04074#S3.Thmtheorem7)states that a dynamic allocation improves over a budget\-matched uniform baseline if and only if the alignment gain−Cov\(K,D\(r\)\)\-\\mathrm\{Cov\}\(K,D\(r\)\)exceeds the Jensen penaltyK¯\(𝔼\[D\(r\)\]−D\(r¯\)\)\\bar\{K\}\(\\mathbb\{E\}\[D\(r\)\]\-D\(\\bar\{r\}\)\)\. The theorem is stated for an abstract convex distortionDDand a complexity fieldKtK\_\{t\}; it says nothing about patch boundaries or token counts\. The question here is whether a trained forecasting backbone responds to this tradeoff once those discrete pipeline effects are removed\. The diagnostic below keeps the model architecture, dataset, training protocol, token grid, and total information budget fixed, and varies only the correlation between per\-position information quality and the known complexity field\.

This is a confirmatory experiment by design\. It does not test whether real adaptive methods achieve high alignment, or whether their routing proxies track loss\-relevant complexity\. It tests the narrower claim that the alignment\-minus\-Jensen structure of[Theorem3\.7](https://arxiv.org/html/2606.04074#S3.Thmtheorem7)governs the rate\-allocation channel when patch\-boundary effects are absent\.

### G\.2Synthetic setup used for the diagnostic

The synthetic task generates sequences of length9696with a forecast horizon of2424\. Each context contains two regions: a low\-frequency noisy background spanning positions\[0,72\)\[0,72\)and an informative window at positions\[72,96\)\[72,96\)where a stochastic motif and smoothed high\-frequency texture are inserted\. The forecast target depends more heavily on preserving fine\-scale content in the informative window, so error rises faster when that region is corrupted\. The generator also produces a ground\-truth complexity field:

Kt=\{0\.12t∈\[0,72\),1\.00t∈\[72,96\),K\_\{t\}=\\begin\{cases\}0\.12&t\\in\[0,72\),\\\\ 1\.00&t\\in\[72,96\),\\end\{cases\}with a four\-step linear transition at the boundary\. We use this field as the theoretical complexity processKtK\_\{t\}throughout the diagnostic\. BecauseKtK\_\{t\}is set by the generator rather than estimated from the model, the diagnostic avoids the circularity discussed in[Section4](https://arxiv.org/html/2606.04074#S4): the alignment target is known before training begins\.

### G\.3Model and training

The backbone is a simple encoder\-decoder Transformer used here as a mechanism check\. The model hasdmodel=32d\_\{\\mathrm\{model\}\}=32,dlatent=64d\_\{\\mathrm\{latent\}\}=64, four attention heads, and1/1/11/1/1encoder, decoder, and latent layers\. It is trained for3232epochs with batch size128128, learning rate10−310^\{\-3\}, and weight decay10−410^\{\-4\}\. The model processes the full9696\-step context at every position; no patch boundaries are introduced and no tokens are merged or dropped\. The token grid is identical to the uniform baseline in every run\. Validation selection never chose the final epoch \(best epoch in\[10,28\]\[10,28\]\), so the result is not driven by stopping exactly at the training budget\.

### G\.4Intervention: continuous noise schedule

Instead of varying patch lengths, we corrupt each context position with additive Gaussian noise whose variance depends on the local rate:

x~t=xt\+ηt,ηt∼𝒩\(0,σ2\(rt\)\),σ2\(rt\)=c\(r¯/rt\)1\.4\.\\tilde\{x\}\_\{t\}=x\_\{t\}\+\\eta\_\{t\},\\qquad\\eta\_\{t\}\\sim\\mathcal\{N\}\\bigl\(0,\\;\\sigma^\{2\}\(r\_\{t\}\)\\bigr\),\\qquad\\sigma^\{2\}\(r\_\{t\}\)=c\\,\(\\bar\{r\}/r\_\{t\}\)^\{1\.4\}\.\(39\)Higherrtr\_\{t\}means a cleaner observation; lowerrtr\_\{t\}means more noise\. The exponent1\.41\.4gives a monotone convex distortion\-rate surrogate for this diagnostic, not a calibrated physical noise model\. The constantccis chosen so that the uniform baseline reproduces the test MSE of the matched continuous\-rate diagnostic\.

The local rates\{rt\}\\\{r\_\{t\}\\\}are generated by aρ\\rho\-controlled allocator that fixes the total budget \(𝔼\[r\]=r¯=1/16\\mathbb\{E\}\[r\]=\\bar\{r\}=1/16\) and the rate variance \(σ^r=9\.375×10−3\\hat\{\\sigma\}\_\{r\}=9\.375\\times 10^\{\-3\}\) while sweeping the achieved correlationρ^\(K,r\)\\hat\{\\rho\}\(K,r\)across a nine\-point grid from−1\-1to\+1\+1\. Because the rates are continuous, the achieved correlation matches the target to machine precision\. Across the grid, only the noise profile changes; the model architecture, training protocol, token count, and compute budget stay fixed\.

Each targetρ⋆\\rho^\{\\star\}is run with twenty independent seeds over a512/256/512512/256/512train/validation/test split\. The same seed\-matched clean signal and noise tensors are used for the dynamic and uniform arms, so every row in[Table9](https://arxiv.org/html/2606.04074#A7.T9)is a paired comparison\.

### G\.5Results

Table 9:Continuous\-rate diagnostic on a simple Transformer backbone\. Each row averages twenty seeds; the observed gain is the mean±\\pmacross\-seed standard deviation of the paired test\-MSE improvement over the seed\-matched uniform baseline, normalized by the uniform test MSE\. The alignment, Jensen, andΔD\(r\)\\Delta\_\{D\}\(r\)columns are diagnostic quantities normalized by the theoretical uniform lossK¯D\(r¯\)\\bar\{K\}D\(\\bar\{r\}\)\. The achievedρ^\(K,r\)\\hat\{\\rho\}\(K,r\)matches the target becausertr\_\{t\}is continuous\.![Refer to caption](https://arxiv.org/html/2606.04074v1/x5.png)Figure 4:Continuous\-rate diagnostic on a simple Transformer backbone\.Left:seed\-averaged observed test\-MSE gain over the seed\-matched uniform baseline as a function of targetρ⋆\\rho^\{\\star\}; twenty seeds per point, error bars show across\-seed standard deviation\.Right:alignment\-minus\-Jensen decomposition of[Equation4](https://arxiv.org/html/2606.04074#S3.E4)evaluated on the same continuous schedules\. The Jensen term is approximately flat because the budget and rate variance are held fixed, soΔD\(r\)\\Delta\_\{D\}\(r\)tracks the alignment term across the grid\.[Table9](https://arxiv.org/html/2606.04074#A7.T9)and[Figure4](https://arxiv.org/html/2606.04074#A7.F4)report the results\.

#### Monotone ordering\.

The mean observed gain increases monotonically from−24\.00%\-24\.00\\%atρ⋆=−1\\rho^\{\\star\}=\-1to\+16\.39%\+16\.39\\%atρ⋆=\+1\\rho^\{\\star\}=\+1\. The Spearman rank correlation betweenρ⋆\\rho^\{\\star\}and the mean gain is1\.01\.0across the nine\-point grid \(exact two\-sided permutationp=5\.5×10−6p=5\.5\\times 10^\{\-6\}\)\. Every negative\-ρ\\rhorow has negative mean gain; every row fromρ⋆=\+0\.2\\rho^\{\\star\}=\+0\.2onward has positive mean gain\. Theρ⋆=0\\rho^\{\\star\}=0row is the only exception to the fittedΔD\(r\)\\Delta\_\{D\}\(r\)sign: the analytic diagnostic is negative \(−2\.65%\-2\.65\\%\), but the trained Transformer obtains a small positive gain \(\+0\.72%\+0\.72\\%\), with 95% confidence interval\[\+0\.28,\+1\.15\]\[\+0\.28,\+1\.15\]\. The same monotone ordering and nonzero\-ρ\\rhosign split are preserved when the distortion exponent is changed to1\.01\.0or2\.02\.0\. Theρ⋆=0\\rho^\{\\star\}=0row is also positive under all three exponents \(power1\.01\.0:\+0\.46%\+0\.46\\%; power1\.41\.4:\+0\.72%\+0\.72\\%; power2\.02\.0:\+1\.29%\+1\.29\\%\), supporting a noise\-robustness interpretation\.

#### Flat Jensen term\.

The Jensen penalty stays in the3\.33\.3–4\.7%4\.7\\%band because the budget and rate variance are held fixed\. The variation inΔD\(r\)\\Delta\_\{D\}\(r\)across the sweep therefore comes almost entirely from the alignment term\. In other words, the structure of[Equation4](https://arxiv.org/html/2606.04074#S3.E4)still shows up in a trained forecasting backbone, not just in the formula\.

#### Asymmetry\.

The negative side of the sweep produces larger absolute gains than the positive side\. The power\-law noise map explains this asymmetry: under\-allocation raises noise variance superlinearly, while over\-allocation reduces it sublinearly\. We therefore treatΔD\(r\)\\Delta\_\{D\}\(r\)mainly as an indicator of ordering and sign, not as a calibrated predictor of each row’s test MSE\.

### G\.6Interpretation

Within this controlled noise design, the trained Transformer backbone tracks the alignment\-minus\-Jensen decomposition of[Theorem3\.7](https://arxiv.org/html/2606.04074#S3.Thmtheorem7)monotonically across the fullρ⋆\\rho^\{\\star\}grid\. For negative\-ρ\\rhoconditions, the observed magnitude is smaller than\|ΔD\(r\)\|\|\\Delta\_\{D\}\(r\)\|; for positive\-ρ\\rhoconditions, the observed gain meets or exceeds the analytic diagnostic\. We therefore useΔD\(r\)\\Delta\_\{D\}\(r\)mainly to interpret ordering and the nonzero\-ρ\\rhosign split, not to predict every trained test\-MSE value exactly\.

The model does not observeKtK\_\{t\}directly, so this diagnostic cannot separate explicit complexity\-aware allocation from learned robustness to position\-specific signal\-to\-noise ratio\. The positiveρ⋆=0\\rho^\{\\star\}=0result fits that broader noise\-robustness story: even when the continuous rate profile is orthogonal toKtK\_\{t\}, the Transformer can still benefit from training on a structured nonuniform noise profile\. This supports a narrower claim than the real\-data ablations in the main paper\. Once patch\-boundary, token\-count, and length\-quantisation effects are removed, the observed gain is ordered by alignment\. At the same time, the neutral\-row sign mismatch shows that the trained Transformer is doing more than the fitted scalar rate\-distortion decomposition captures\.
Adaptive Patching Is Harder Than It Looks For Time-Series Forecasting

Similar Articles

PatchSTG: Scalable Spatiotemporal Graph Transformers for Traffic Forecasting on Irregular Sensor Networks

Adaptive Tokenisation Via Temporal Redundancy Masking And Latent Inpainting [R]

Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models

PMDformer: Patch-Mean Decoupling Information Transformer for Long-term Forecasting

TS-Fault: Benchmarking Time Series Forecasters Against Structural Faults

Submit Feedback

Similar Articles

PatchSTG: Scalable Spatiotemporal Graph Transformers for Traffic Forecasting on Irregular Sensor Networks
Adaptive Tokenisation Via Temporal Redundancy Masking And Latent Inpainting [R]
Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models
PMDformer: Patch-Mean Decoupling Information Transformer for Long-term Forecasting
TS-Fault: Benchmarking Time Series Forecasters Against Structural Faults