Time Series as Language: A Universal Tokenizer for General-Purpose Time Series Foundation Models

arXiv cs.LG Papers

Summary

Introduces UniTok, a universal tokenizer that transforms continuous time series into discrete tokens, and UniTok-FM, a foundation model pretrained via next-token prediction that enables zero-shot and prompt-boosted forecasting as well as few-shot generation and classification through training-free in-context inference.

arXiv:2606.09861v1 Announce Type: new Abstract: While Next-Token Prediction (NTP) has unified LLM pretraining, its adaptation to unbounded, continuous time series (TS) remains open. To bridge the gap, we introduce UniTok, a universal tokenizer that transforms TS into discrete tokens, and UniTok-FM, a foundation model pretrained via NTP on these tokens. UniTok-FM is a general-purpose foundation model that supports zero-shot and prompt-boosted forecasting, as well as few-shot generation and classification via training-free in-context inference--a capability not achieved by prior works. Technically, UniTok is a vector-quantized autoencoder incorporating prefix normalization for scale stabilization, a progressive-resolution causal architecture for encoding and decoding, and a structure-preserving reconstruction loss for training. UniTok-FM adopts an off-the-shelf LLM architecture without TS-specific modifications. Instead of pretraining on isolated TS, it performs NTP on context windows formed by multiple series with similar patterns, aiming to capture their shared dynamics. Experiments on forecasting, generation, and classification show that a single unified UniTok-FM consistently outperforms statistical and supervised baselines, achieves competitive performance with task-specific foundation models, and uniquely enables training-free in-context inference across tasks.
Original Article
View Cached Full Text

Cached at: 06/10/26, 06:13 AM

# Time Series as Language: A Universal Tokenizer for General-Purpose Time Series Foundation Models
Source: [https://arxiv.org/html/2606.09861](https://arxiv.org/html/2606.09861)
Yunhao Zhang Shanghai Jiao Tong University zhangyunhao@sjtu\.edu\.cn&Ruiying Qi Shanghai Jiao Tong University qry\-sylvia@sjtu\.edu\.cn&Jiale Zheng Huawei Noah’s Ark Lab zhengjiale2@huawei\.com&Jianfeng Zhang Huawei Noah’s Ark Lab zhangjianfeng3@huawei\.com&Lujia Pan Huawei Noah’s Ark Lab panlujia@huawei\.com&Junchi Yan Shanghai Jiao Tong University yanjunchi@sjtu\.edu\.cn

###### Abstract

While Next\-Token Prediction \(NTP\) has unified LLM pretraining, its adaptation to unbounded, continuous time series \(TS\) remains open\. To bridge the gap, we introduceUniTok, a universal tokenizer that transforms TS into discrete tokens, andUniTok\-FM, a foundation model pretrained via NTP on these tokens\. UniTok\-FM is a general\-purpose foundation model that supports zero\-shot and prompt\-boosted forecasting, as well as few\-shot generation and classification viatraining\-free in\-context inference—a capability not achieved by prior works\. Technically, UniTok is a vector\-quantized autoencoder incorporating prefix normalization for scale stabilization, a progressive\-resolution causal architecture for encoding and decoding, and a structure\-preserving reconstruction loss for training\. UniTok\-FM adopts an off\-the\-shelf LLM architecture without TS\-specific modifications\. Instead of pretraining on isolated TS, it performs NTP on context windows formed by multiple series with similar patterns, aiming to capture their shared dynamics\. Experiments on forecasting, generation, and classification show that a single unified UniTok\-FM consistently outperforms statistical and supervised baselines, achieves competitive performance with task\-specific foundation models, and uniquely enables training\-free in\-context inference across tasks\.

## 1Introduction

In recent years, next\-token prediction \(NTP\) has unified the pretraining of large language models \(LLMs\)\. Although time series \(TS\) are also sequential data, the pretraining paradigm of time series foundation models \(TSFMs\) remains highly fragmented\. Existing TSFMs use different pretraining tasks, ranging from next\-patch predictionLiuet al\.\([2024b](https://arxiv.org/html/2606.09861#bib.bib30)\); Cohenet al\.\([2025](https://arxiv.org/html/2606.09861#bib.bib31)\)to mask\-and\-reconstructionWanget al\.\([2025](https://arxiv.org/html/2606.09861#bib.bib54)\); Chenet al\.\([2025](https://arxiv.org/html/2606.09861#bib.bib18)\)and fixed\-horizon predictionShiet al\.\([2025a](https://arxiv.org/html/2606.09861#bib.bib29)\); Liuet al\.\([2025](https://arxiv.org/html/2606.09861#bib.bib15)\)\. The backbone architectures are also diverse: ranging from xLSTMsAueret al\.\([2025](https://arxiv.org/html/2606.09861#bib.bib14)\)to Transformers equipped with TS\-specific positional embeddingsWooet al\.\([2024](https://arxiv.org/html/2606.09861#bib.bib17)\), attention mechanismsSunet al\.\([2025](https://arxiv.org/html/2606.09861#bib.bib55)\), etc\. Moreover, most TSFMs are narrowly tailored to forecasting, falling short of the multi\-task generality that LLMs achieve\.

A key challenge in extending NTP to TS is the unbounded, continuous nature of TS\. Modeling complex distributions in continuous space is difficult: conventional regression\-based objectives typically rely on rigid parametric assumptionsZhanget al\.\([2026](https://arxiv.org/html/2606.09861#bib.bib64)\), whereas generative alternatives like diffusion introduce significant architectural complexityLiet al\.\([2024](https://arxiv.org/html/2606.09861#bib.bib65)\)\. In contrast, discretization enables flexible distribution modeling via a simple cross\-entropy objective\. And a discrete representation facilitates modeling multiple series within a unified context window, enabling in\-context learning and generalizing the learned model beyond forecasting to broader tasks such as generation and classification\.

While discretization offers a clear path, developing a tokenizer for TS is more complex than for images because of greater variability in series length and numerical scale\. Consequently, most prior TS tokenizers are task\-specificShiet al\.\([2025b](https://arxiv.org/html/2606.09861#bib.bib35)\), applicable to narrow datasetsLeeet al\.\([2023](https://arxiv.org/html/2606.09861#bib.bib33)\); Talukderet al\.\([2024](https://arxiv.org/html/2606.09861#bib.bib36)\), or impose strict constraints on series lengthTaoet al\.\([2025](https://arxiv.org/html/2606.09861#bib.bib34)\), remaining inadequate for general\-purpose TSFMs across domains and tasks\. The work most relevant to ours is Chronos111Chronos should be distinguished from Chronos\-BoltAnsariet al\.\([2024b](https://arxiv.org/html/2606.09861#bib.bib16)\)and Chronos\-2Ansariet al\.\([2025](https://arxiv.org/html/2606.09861#bib.bib13)\), which do not use discretization and NTP\.Ansariet al\.\([2024a](https://arxiv.org/html/2606.09861#bib.bib7)\), which discretizes scaled TS via point\-wise uniform binning and pretrains a TSFM using NTP\. However, this simple binning strategy fails to capture rich temporal dependencies, and the model trained on isolated series is restricted to forecasting\. As such, NTP’s generalization potential has not been fully exploited\.

To fill the gap, we propose a universal TS tokenizer,UniTok, and a general\-purpose foundation model,UniTok\-FM, pretrained via NTP using UniTok\. Both are trained on large\-scale datasets to enable cross\-domain generalization\. Beyond zero\-shot forecasting supported by prior TSFMs, UniTok\-FM unlocks three additional capabilities: 1\) prompt\-boosted forecasting, where TS with similar dynamics serve as prompts to guide prediction; 2\) few\-shot generation, producing high\-fidelity samples from only a handful of example TS; 3\) few\-shot classification, classifying using limited labeled examples\. All capabilities are realized viatraining\-free in\-context inference, without fine\-tuning task\-specific heads\. To our best knowledge, no prior TSFM supports generation or classification in this manner\.

Technically, UniTok builds on the VQ\-VAE frameworkVan Den Oordet al\.\([2017](https://arxiv.org/html/2606.09861#bib.bib1)\), a commonly used approach for image tokenizers\. To adapt it to TS with variable lengths and unbounded values, some key modifications are introduced: 1\) incremental tokenization property that aligns tokenization with the NTP paradigm; 2\) prefix normalization that stabilizes scale while preserving incremental property; 3\) progressive\-resolution causal autoencoder that assigns token resolution based on receptive fields; 4\) structure\-preserving reconstruction loss to faithfully capture temporal structures\.

Built on UniTok, UniTok\-FM is pretrained via NTP using an off\-the\-shelf LLM architecture, without TS\-specific modification\. Instead of pretraining on isolated series, UniTok\-FM performs NTP on context windows comprising multiple series with similar patterns\. During pretraining, similarity is enforced by extracting segments from the same long series; for inference, it generalizes beyond this construction to align with each task’s requirements\. UniTok\-FM supports general\-purpose, training\-free inference: zero\-shot/prompt\-boosted forecasting and few\-shot generation are performed through autoregressive \(AR\) token generation under different prompt contexts, while few\-shot classification is achieved by evaluating the conditional likelihood of the query series\.The highlights are:

1\. We propose UniTok, a universal TS tokenizer that operates across domains and tasks, transforming continuous TS into discrete tokens suitable for NTP pretraining\.

2\. We pretrain UniTok\-FM, a general\-purpose TSFM via in\-context NTP, supporting training\-free zero\-shot and prompt\-boosted forecasting, as well as few\-shot generation and classification, which prior TSFMs do not support\.

3\. Although UniTok\-FM does not surpass task\-specific SOTA models in every setting, a single unified model, using training\-free in\-context inference with only a handful of prompt examples, matches: 1\) forecasting performance of forecasting\-specific TSFMChronos\-BoltAnsariet al\.\([2024b](https://arxiv.org/html/2606.09861#bib.bib16)\); 2\) generation quality ofDiffusion\-TSYuan and Qiao \([2024](https://arxiv.org/html/2606.09861#bib.bib37)\), despite it is trained on one thousand samples; 3\) few\-shot classification accuracy ofMOMENTGoswamiet al\.\([2024](https://arxiv.org/html/2606.09861#bib.bib26)\), despite it relies on a downstream classifier\.

## 2Related Works

Time Series Foundation ModelsTSFMs are pretrained on large\-scale datasets and generalize to new scenarios in a zero\-shot setting\. Unlike the unified NTP paradigm in LLMs, TSFM pretraining remains fragmented\. Recent works explore pretraining tasks such as next\-patch predictionLiuet al\.\([2024b](https://arxiv.org/html/2606.09861#bib.bib30)\), mask\-and\-reconstructionWanget al\.\([2025](https://arxiv.org/html/2606.09861#bib.bib54)\)and fixed\-horizon forecasting\. The latter spans point regressionShiet al\.\([2025a](https://arxiv.org/html/2606.09861#bib.bib29)\), mixture distribution modelingCohenet al\.\([2025](https://arxiv.org/html/2606.09861#bib.bib31)\), quantile regressionAnsariet al\.\([2025](https://arxiv.org/html/2606.09861#bib.bib13)\)and flow matchingLiuet al\.\([2025](https://arxiv.org/html/2606.09861#bib.bib15)\)\. Backbone designs are also diverse: while most adopt Transformer variants, alternatives such as xLSTM are also competitiveAueret al\.\([2025](https://arxiv.org/html/2606.09861#bib.bib14)\)\. Many TSFMs further introduce task\-specific modifications, including customized positional embeddingsWooet al\.\([2024](https://arxiv.org/html/2606.09861#bib.bib17)\), attention mechanismsSunet al\.\([2025](https://arxiv.org/html/2606.09861#bib.bib55)\), or horizon\-specific prediction headsLiuet al\.\([2025](https://arxiv.org/html/2606.09861#bib.bib15)\)\. Moreover, most TSFMs support only zero\-shot forecasting, whereas general\-purpose TSFMs serve as feature extractors and require training task\-specific models on these featuresGaoet al\.\([2024](https://arxiv.org/html/2606.09861#bib.bib27)\); Goswamiet al\.\([2024](https://arxiv.org/html/2606.09861#bib.bib26)\)\.

The work most relevant to ours is ChronosAnsariet al\.\([2024a](https://arxiv.org/html/2606.09861#bib.bib7)\)\. It normalizes series with mean scaling and discretizes each point using uniform binning with fixed bin edges to enable NTP with a T5Raffelet al\.\([2020](https://arxiv.org/html/2606.09861#bib.bib56)\)backbone\. However, its point\-wise binning limits the modeling of rich temporal structure, and pretraining on isolated series restricts Chronos to only zero\-shot forecasting\.

Image TokenizersAR image generation relies on discrete tokenizers\. Efforts have evolved from pixel\-level discretizationVan den Oordet al\.\([2016](https://arxiv.org/html/2606.09861#bib.bib57)\)to vector quantized variational autoencoders \(VQ\-VAE\)Van Den Oordet al\.\([2017](https://arxiv.org/html/2606.09861#bib.bib1)\)and its extensionsRazaviet al\.\([2019](https://arxiv.org/html/2606.09861#bib.bib42)\); Esseret al\.\([2021](https://arxiv.org/html/2606.09861#bib.bib2)\); Leeet al\.\([2022](https://arxiv.org/html/2606.09861#bib.bib46)\)\. To address the instability of VQ, lookup\-free alternatives have been proposedYuet al\.\([2023](https://arxiv.org/html/2606.09861#bib.bib45)\)\. In particular, Finite Scalar Quantization \(FSQ\)Mentzeret al\.\([2024](https://arxiv.org/html/2606.09861#bib.bib3)\)stabilizes training via simple low\-dimensional projection and scalar quantization; we therefore adopt FSQ in UniTok\. Others also explore multi\-scale generationTianet al\.\([2024](https://arxiv.org/html/2606.09861#bib.bib43)\)and efficient compressionYuet al\.\([2024](https://arxiv.org/html/2606.09861#bib.bib44)\)\. Readers can refer toJiaet al\.\([2025](https://arxiv.org/html/2606.09861#bib.bib58)\)for an overview\. Applying these techniques to TS is non\-trivial due to the variable lengths and unbounded value ranges, unlike fixed\-size images \(e\.g\.,512×512512\\times 512\) with bounded pixel values \(e\.g\.,0∼2550\\sim 255\)\.

![Refer to caption](https://arxiv.org/html/2606.09861v1/x1.png)\(a\)![Refer to caption](https://arxiv.org/html/2606.09861v1/x2.png)\(b\)
Figure 1:\(a\)Incremental vs\. non\-incremental tokenization\.Incremental tokenization makes prefix tokens independent of future observations, so appending data extends the token sequence, aligning with the NTP paradigm\. Otherwise, incompatible tokens for a prefix and its extension limit generalization from long to short series\. \(b\)Overview of UniTok\.The raw TS is decomposed into scale statistics and a normalized series via prefix normalization \(Sec\.[3\.2](https://arxiv.org/html/2606.09861#S3.SS2)\)\. Scale statistics are discretized in hex of Float32, while normalized series is encoded by a progressive\-resolution causal autoencoder \(Sec\.[3\.3](https://arxiv.org/html/2606.09861#S3.SS3)\), trained with a structure\-preserving reconstruction loss \(Sec\.[3\.4](https://arxiv.org/html/2606.09861#S3.SS4)\)\.Time Series TokenizersRecent efforts adapt VQ for TS generationLeeet al\.\([2023](https://arxiv.org/html/2606.09861#bib.bib33)\), forecastingFenget al\.\([2025](https://arxiv.org/html/2606.09861#bib.bib60)\), and classificationWenet al\.\([2024](https://arxiv.org/html/2606.09861#bib.bib32)\)\. Some align TS with texts via reprogrammingJinet al\.\([2024](https://arxiv.org/html/2606.09861#bib.bib47)\)or VQ\-VAETaoet al\.\([2025](https://arxiv.org/html/2606.09861#bib.bib34)\)\. Domain\-specific variants, such as K\-line tokenizers for financeShiet al\.\([2025b](https://arxiv.org/html/2606.09861#bib.bib35)\), also exist\. Nevertheless, most existing TS tokenizers are task\- or dataset\-specific and lack reusability\. AlthoughTalukderet al\.\([2024](https://arxiv.org/html/2606.09861#bib.bib36)\)explores extending VQ\-VAE beyond a single dataset, it remains limited to a small number of datasets, falling short of a universal tokenizer for general\-purpose TSFMs\.

## 3UniTok: A Universal Time Series Tokenizer

As illustrated in Fig\.[1](https://arxiv.org/html/2606.09861#S2.F1)\(b\), UniTok converts a TS into a sequence of discrete tokens while satisfying the incremental tokenization property required for NTP \(Sec\.[3\.1](https://arxiv.org/html/2606.09861#S3.SS1)\)\. Specifically, the input series is decomposed into scale statistics and a normalized series via prefix normalization \(Sec\.[3\.2](https://arxiv.org/html/2606.09861#S3.SS2)\)\. The scale statistics are tokenized using a hexadecimal representation, while the normalized series is encoded by the progressive\-resolution causal autoencoder \(Sec\.[3\.3](https://arxiv.org/html/2606.09861#S3.SS3)\), trained with the structure\-preserving reconstruction loss \(Sec\.[3\.4](https://arxiv.org/html/2606.09861#S3.SS4)\)\. The tokenized sequence takes the form:

⟨SOS⟩​𝐳\(s​c​a​l​e−β\)​⟨SEP⟩​𝐳\(s​c​a​l​e−γ\)​⟨SEP⟩​𝐳\(s​h​a​p​e\)​⟨EOS⟩\\langle\\text\{\\small SOS\}\\rangle\\mathbf\{z\}^\{\(scale\-\\beta\)\}\\langle\\text\{\\small SEP\}\\rangle\\mathbf\{z\}^\{\(scale\-\\gamma\)\}\\langle\\text\{\\small SEP\}\\rangle\\mathbf\{z\}^\{\(shape\)\}\\langle\\text\{\\small EOS\}\\rangle\(1\)where⟨SOS⟩/⟨EOS⟩\\langle\\text\{SOS\}\\rangle/\\langle\\text\{EOS\}\\rangledenotes start/end of a series\.⟨SEP⟩\\langle\\text\{SEP\}\\rangleis for seperation\.𝐳\(s​c​a​l​e−β\),𝐳\(s​c​a​l​e−γ\)\\mathbf\{z\}^\{\(scale\-\\beta\)\},\\mathbf\{z\}^\{\(scale\-\\gamma\)\}are tokenized scale statistics\.𝐳\(s​h​a​p​e\)\\mathbf\{z\}^\{\(shape\)\}corresponds to normalized series, capturing intrinsic series shape\.

### 3\.1Incremental Tokenization Property

Given a TS222We focus on univariate time series and adopt the widely used channel\-independent technique for multivariate dataNieet al\.\([2023](https://arxiv.org/html/2606.09861#bib.bib11)\)\.𝐱=\[x1,…,xT\],xt∈ℝ\\mathbf\{x\}=\[x\_\{1\},\\dots,x\_\{T\}\],x\_\{t\}\\in\\mathbb\{R\}, our goal is to transform it into discrete tokens𝐳=\[z1,…,zL\],zi∈𝒞,𝒞=\{1,…,C\}\\mathbf\{z\}=\[z\_\{1\},\\dots,z\_\{L\}\],z\_\{i\}\\in\\mathcal\{C\},\\mathcal\{C\}=\\\{1,\\dots,C\\\}such that𝐳\\mathbf\{z\}is suitable for NTP\. Unlike fixed\-size images, UniTok handles TS of variable length\. We require it to satisfy the incremental tokenization property:

Enc​\(𝐱≤t\)=Enc​\(𝐱\)≤ϕ​\(t\)Dec​\(𝐳≤l\)=Dec​\(𝐳\)≤ϕ−1​\(l\)\\text\{Enc\}\(\\mathbf\{x\}\_\{\\leq t\}\)=\\text\{Enc\}\(\\mathbf\{x\}\)\_\{\\leq\\phi\(t\)\}\\quad\\text\{Dec\}\(\\mathbf\{z\}\_\{\\leq l\}\)=\\text\{Dec\}\(\\mathbf\{z\}\)\_\{\\leq\\phi^\{\-1\}\(l\)\}\(2\)whereEnc​\(∙\),Dec​\(∙\)\\text\{Enc\}\(\\mathchoice\{\\mathbin\{\\vbox\{\\hbox\{\\scalebox\{0\.5\}\{$\\displaystyle\\bullet$\}\}\}\}\}\{\\mathbin\{\\vbox\{\\hbox\{\\scalebox\{0\.5\}\{$\\textstyle\\bullet$\}\}\}\}\}\{\\mathbin\{\\vbox\{\\hbox\{\\scalebox\{0\.5\}\{$\\scriptstyle\\bullet$\}\}\}\}\}\{\\mathbin\{\\vbox\{\\hbox\{\\scalebox\{0\.5\}\{$\\scriptscriptstyle\\bullet$\}\}\}\}\}\),\\text\{Dec\}\(\\mathchoice\{\\mathbin\{\\vbox\{\\hbox\{\\scalebox\{0\.5\}\{$\\displaystyle\\bullet$\}\}\}\}\}\{\\mathbin\{\\vbox\{\\hbox\{\\scalebox\{0\.5\}\{$\\textstyle\\bullet$\}\}\}\}\}\{\\mathbin\{\\vbox\{\\hbox\{\\scalebox\{0\.5\}\{$\\scriptstyle\\bullet$\}\}\}\}\}\{\\mathbin\{\\vbox\{\\hbox\{\\scalebox\{0\.5\}\{$\\scriptscriptstyle\\bullet$\}\}\}\}\}\)denote encoding and decoding\.ϕ​\(∙\)\\phi\(\\mathchoice\{\\mathbin\{\\vbox\{\\hbox\{\\scalebox\{0\.5\}\{$\\displaystyle\\bullet$\}\}\}\}\}\{\\mathbin\{\\vbox\{\\hbox\{\\scalebox\{0\.5\}\{$\\textstyle\\bullet$\}\}\}\}\}\{\\mathbin\{\\vbox\{\\hbox\{\\scalebox\{0\.5\}\{$\\scriptstyle\\bullet$\}\}\}\}\}\{\\mathbin\{\\vbox\{\\hbox\{\\scalebox\{0\.5\}\{$\\scriptscriptstyle\\bullet$\}\}\}\}\}\)maps a TS length to the corresponding token sequence length, andϕ−1​\(∙\)\\phi^\{\-1\}\(\\mathchoice\{\\mathbin\{\\vbox\{\\hbox\{\\scalebox\{0\.5\}\{$\\displaystyle\\bullet$\}\}\}\}\}\{\\mathbin\{\\vbox\{\\hbox\{\\scalebox\{0\.5\}\{$\\textstyle\\bullet$\}\}\}\}\}\{\\mathbin\{\\vbox\{\\hbox\{\\scalebox\{0\.5\}\{$\\scriptstyle\\bullet$\}\}\}\}\}\{\\mathbin\{\\vbox\{\\hbox\{\\scalebox\{0\.5\}\{$\\scriptscriptstyle\\bullet$\}\}\}\}\}\)denotes the inverse mapping\. As illustrated in Fig\.[1](https://arxiv.org/html/2606.09861#S2.F1)\(a\), this property states that encoding or decoding any prefix is independent of the remaining part\. This ensures that a TSFM trained on lengthTTseamlessly generalizes to any shorter series\.

### 3\.2Prefix Normalization

Unlike images with bounded pixel values \(i\.e\.,0∼2550\\sim 255\), TS exhibit widely varying scales across domains, making normalization essential\. Given𝐱\\mathbf\{x\}, conventional normalization applies𝐱~=\(𝐱−β\)/γ\\widetilde\{\\mathbf\{x\}\}=\(\\mathbf\{x\}\-\\beta\)/\\gamma, whereβ,γ\\beta,\\gammaare extracted scale statistics \(e\.g\., mean–std or min–max\) from𝐱\\mathbf\{x\}\. The normalized𝐱~\\widetilde\{\\mathbf\{x\}\}has a more stable scale and is used as the network input\. However, conventional normalization violates the incremental property: statistics extracted from a prefix differ from those of the whole series, such thatNorm​\(𝐱≤t\)≠Norm​\(𝐱\)≤t\\text\{Norm\}\(\\mathbf\{x\}\_\{\\leq t\}\)\\neq\\text\{Norm\}\(\\mathbf\{x\}\)\_\{\\leq t\}\. We therefore propose prefix normalization that computes statistics from a fixed\-length prefix\. Assuming all the TS we process are longer thanPP, we perform:

𝐱~=Prefix\-Norm​\(𝐱\)=𝐱−β\(p​r​e​f​i​x\)γ\(p​r​e​f​i​x\)\(β\(p​r​e​f​i​x\),γ\(p​r​e​f​i​x\)\)=fs​c​a​l​e​\(𝐱≤P\)\\widetilde\{\\mathbf\{x\}\}=\\text\{Prefix\-Norm\}\(\\mathbf\{x\}\)=\\frac\{\\mathbf\{x\}\-\\beta^\{\(prefix\)\}\}\{\\gamma^\{\(prefix\)\}\}\\quad\(\\beta^\{\(prefix\)\},\\gamma^\{\(prefix\)\}\)=f\_\{scale\}\(\\mathbf\{x\}\_\{\\leq P\}\)\(3\)wherefs​c​a​l​e​\(𝐱≤P\)f\_\{scale\}\(\\mathbf\{x\}\_\{\\leq P\}\)extracts statistics from the length\-PPprefix\. In this work, we adopt min–max normalization, although other choices are possible\. Prefix normalization preserves the incremental property asPrefix\-Norm​\(𝐱≤t\)=Prefix\-Norm​\(𝐱\)≤t,∀t≥P\\text\{Prefix\-Norm\}\(\\mathbf\{x\}\_\{\\leq t\}\)=\\text\{Prefix\-Norm\}\(\\mathbf\{x\}\)\_\{\\leq t\},\\forall t\\geq P\. To relax the requirement that all series must be longer thanPP, we introduce two prefix lengthsP1<P2P\_\{1\}<P\_\{2\}and perform:

\(β\(p​r​e​f​i​x\),γ\(p​r​e​f​i​x\)\)=\{fs​c​a​l​e​\(𝐱\)T<P1,fs​c​a​l​e​\(𝐱≤P1\)P​1≤T<P2,fs​c​a​l​e​\(𝐱≤P2\)T≥P2,\(\\beta^\{\(prefix\)\},\\gamma^\{\(prefix\)\}\)=\\begin\{cases\}f\_\{scale\}\(\\mathbf\{x\}\)&T<P\_\{1\},\\\\ f\_\{scale\}\(\\mathbf\{x\}\_\{\\leq P\_\{1\}\}\)&P1\\leq T<P\_\{2\},\\\\ f\_\{scale\}\(\\mathbf\{x\}\_\{\\leq P\_\{2\}\}\)&T\\geq P\_\{2\},\\\\ \\end\{cases\}\(4\)We setP1=8P\_\{1\}=8andP2=128P\_\{2\}=128, ensuring that the incremental property holds within each range\. For extremely short series \(T<8T<8\), we fall back to conventional normalization\.

Scale Statistics Tokenization\.Prior works typically discard scale statistics and operate only on the normalized series, which is sufficient for forecasting\. However, for tasks such as classification, the absolute scale often carries semantic information and should be preserved\. For a tokenizer that works across various domains, the range ofβ\(p​r​e​f​i​x\),γ\(p​r​e​f​i​x\)\\beta^\{\(prefix\)\},\\gamma^\{\(prefix\)\}can be as broad as a Float32 number \(−3\.4×1038∼\+3\.4×1038\-3\.4\\times 10^\{38\}\\sim\+3\.4\\times 10^\{38\}\)\. This range is far too vast for traditional neural network embeddings\. To address this, we access their 32\-bit computer storage representation and group every 4 bits into one token\. This encodes each ofβ\(p​r​e​f​i​x\),γ\(p​r​e​f​i​x\)\\beta^\{\(prefix\)\},\\gamma^\{\(prefix\)\}into 8 hexadecimal tokens in the range0∼F0\\sim F\. For example,3\.141593\.14159is tokenized as\[4,0,4,9,0,F,D,0\]\[4,0,4,9,0,F,D,0\]\. This process is lossless and reversible\. Since only two statistics are stored per series, the overhead is affordable\.

### 3\.3Progressive\-Resolution Causal Autoencoder

We adopt an autoencoder with FSQ to tokenize the normalized series𝐱~\\widetilde\{\\mathbf\{x\}\}\. The encoder follows a standard image tokenizer design, consisting ofSSblocks with convolutional and self\-attention layers, each followed by a downsampling operation that halves the resolution\. This yields intermediate representations𝐇s∈ℝT2s×dm​o​d​e​l\\mathbf\{H\}^\{s\}\\in\\mathbb\{R\}^\{\\frac\{T\}\{2^\{s\}\}\\times d\_\{model\}\}at thess\-th block\. FSQMentzeret al\.\([2024](https://arxiv.org/html/2606.09861#bib.bib3)\)is used to quantize final block representations𝐇S\\mathbf\{H\}^\{S\}intoT2S\\frac\{T\}\{2^\{S\}\}tokens\. A symmetric decoder with mirrored upsampling layers reconstructs𝐱~\\widetilde\{\\mathbf\{x\}\}\. While effective for images, some modifications are required for TS\.

Causal StructureTo preserve the incremental property, we replace all non\-causal components with causal counterparts so that each token depends only on the observed prefix\. Specifically, we adopt causal convolutions and self\-attention with Layer Normalization\.

![Refer to caption](https://arxiv.org/html/2606.09861v1/x3.png)Figure 2:Progressive\-Resolution Causal Autoencoder\.Each block applies causal convolution and attention, allowing each latent vector to attend only to the past\. At blockss, the first2s−12^\{s\}\-1vectors are preserved, while the remaining are downsampled/upsampled, yielding a progressive\-resolution architecture in which earlier tokens with smaller receptive fields receive finer representations\.Progressive\-Resolution AutoencoderCausal structure induces information asymmetry\. Specifically,

𝐳l\(s​h​a​p​e\)=Enc​\(𝐱~≤l∗2S\)l\\mathbf\{z\}^\{\(shape\)\}\_\{l\}=\\text\{Enc\}\\left\(\\widetilde\{\\mathbf\{x\}\}\_\{\\leq l\*2^\{S\}\}\\right\)\_\{l\}\(5\)
This implies that earlier tokens \(smallll\) are computed from limited context𝐱~≤l∗2S\\widetilde\{\\mathbf\{x\}\}\_\{\\leq l\*2^\{S\}\}, creating an information bottleneck\. In the extreme, the first token is derived solely from the first patch of length2S2^\{S\}, correspondingly, this patch must be reconstructed from a single token, limiting the reconstruction to onlyCCcandidate patches\. To mitigate this asymmetry, a progressive\-resolution architecture that allocates higher resolution to earlier positions is devised, illustrated in Fig\.[2](https://arxiv.org/html/2606.09861#S3.F2)\. Specifically, we replace the uniform downsampling with progressive downsampling:

𝐇s=\[𝐇~≤2s−1s,Downsample​\(𝐇~≥2ss\)\]\\mathbf\{H\}^\{s\}=\[\\widetilde\{\\mathbf\{H\}\}^\{s\}\_\{\\leq 2^\{s\}\-1\},\\text\{Downsample\}\\left\(\\widetilde\{\\mathbf\{H\}\}^\{s\}\_\{\\geq 2^\{s\}\}\\right\)\]\(6\)where\[∙,∙\]\[\\mathchoice\{\\mathbin\{\\vbox\{\\hbox\{\\scalebox\{0\.5\}\{$\\displaystyle\\bullet$\}\}\}\}\}\{\\mathbin\{\\vbox\{\\hbox\{\\scalebox\{0\.5\}\{$\\textstyle\\bullet$\}\}\}\}\}\{\\mathbin\{\\vbox\{\\hbox\{\\scalebox\{0\.5\}\{$\\scriptstyle\\bullet$\}\}\}\}\}\{\\mathbin\{\\vbox\{\\hbox\{\\scalebox\{0\.5\}\{$\\scriptscriptstyle\\bullet$\}\}\}\}\},\\mathchoice\{\\mathbin\{\\vbox\{\\hbox\{\\scalebox\{0\.5\}\{$\\displaystyle\\bullet$\}\}\}\}\}\{\\mathbin\{\\vbox\{\\hbox\{\\scalebox\{0\.5\}\{$\\textstyle\\bullet$\}\}\}\}\}\{\\mathbin\{\\vbox\{\\hbox\{\\scalebox\{0\.5\}\{$\\scriptstyle\\bullet$\}\}\}\}\}\{\\mathbin\{\\vbox\{\\hbox\{\\scalebox\{0\.5\}\{$\\scriptscriptstyle\\bullet$\}\}\}\}\}\]denotes concatenation,𝐇~s\\widetilde\{\\mathbf\{H\}\}^\{s\}denotes representation before downsampling at blockss\. The first2s−12^\{s\}\-1vectors are kept, while downsampling is applied to the remaining suffix\. As a result, resolution decreases progressively along the sequence: early tokens represent finer spans, while later tokens aggregate larger spans, with tail tokens covering2S2^\{S\}time points, matching the standard downsampling rate\. A symmetric progressive upsampling is applied to the decoder\.

### 3\.4Structure\-Preserving Reconstruction Loss

Obtaining ground truth and reconstructed normalized series𝐱~,𝐱~\(r​e​c\)\\widetilde\{\\mathbf\{x\}\},\\widetilde\{\\mathbf\{x\}\}^\{\(rec\)\}, a reconstruction loss is required for autoencoder training\. Image tokenizers typically employ a composite loss with three componentsEsseret al\.\([2021](https://arxiv.org/html/2606.09861#bib.bib2)\); Tianet al\.\([2024](https://arxiv.org/html/2606.09861#bib.bib43)\): 1\) anL​1L1loss for point\-wise fidelity; 2\) a perceptual loss to preserve semantic patterns; and 3\) an adversarial loss for global distribution alignment\. We construct our loss based on this\.

Surrogate Perceptual LossPerceptual loss typically aligns ground\-truth and reconstruction in the latent space of a feature extractor pretrained on large\-scale datasetsZhanget al\.\([2018](https://arxiv.org/html/2606.09861#bib.bib62)\)\. Unlike images, TS lacks a widely adopted pretrained network\. To address this, we reuse the discriminator from the adversarial loss as the feature extractor to construct a surrogate perceptual loss\. This complements the adversarial loss: while the adversarial loss enforces global distributional consistency, this term encourages each reconstructed series to match its corresponding ground truth in the discriminator’s latent space\.

High\-Frequency Wavelet LossL​1L1loss tends to produce overly smooth reconstructionsLediget al\.\([2017](https://arxiv.org/html/2606.09861#bib.bib5)\)\. To explicitly preserve fine\-grained structures, we introduce a loss that aligns high\-frequency coefficients of the Discrete Wavelet Transformation \(DWT\):

\(𝐚J,\{𝐝j\}j=1J\)=𝒲​\(𝐱~\)\(𝐚J\(r​e​c\),\{𝐝j\(r​e​c\)\}j=1J\)=𝒲​\(𝐱~\(r​e​c\)\)α=Threshold​\(𝐝1\)ℒh​f=∑j=1J′‖\(𝐝j−𝐝j\(r​e​c\)\)⊙𝟙​\[𝐝j≥α\]‖1\\begin\{split\}&\\left\(\\mathbf\{a\}\_\{J\},\\\{\\mathbf\{d\}\_\{j\}\\\}\_\{j=1\}^\{J\}\\right\)=\\mathcal\{W\}\(\\widetilde\{\\mathbf\{x\}\}\)\\quad\\left\(\\mathbf\{a\}^\{\(rec\)\}\_\{J\},\\\{\\mathbf\{d\}^\{\(rec\)\}\_\{j\}\\\}\_\{j=1\}^\{J\}\\right\)=\\mathcal\{W\}\(\\widetilde\{\\mathbf\{x\}\}^\{\(rec\)\}\)\\\\ &\\alpha=\\text\{Threshold\}\(\\mathbf\{d\}\_\{1\}\)\\quad\\mathcal\{L\}\_\{hf\}=\\sum\_\{j=1\}^\{J^\{\\prime\}\}\\left\\\|\(\\mathbf\{d\}\_\{j\}\-\\mathbf\{d\}^\{\(rec\)\}\_\{j\}\)\\odot\\mathbbm\{1\}\[\\mathbf\{d\}\_\{j\}\\geq\\alpha\]\\right\\\|\_\{1\}\\end\{split\}\(7\)where𝒲​\(∙\)\\mathcal\{W\}\(\\mathchoice\{\\mathbin\{\\vbox\{\\hbox\{\\scalebox\{0\.5\}\{$\\displaystyle\\bullet$\}\}\}\}\}\{\\mathbin\{\\vbox\{\\hbox\{\\scalebox\{0\.5\}\{$\\textstyle\\bullet$\}\}\}\}\}\{\\mathbin\{\\vbox\{\\hbox\{\\scalebox\{0\.5\}\{$\\scriptstyle\\bullet$\}\}\}\}\}\{\\mathbin\{\\vbox\{\\hbox\{\\scalebox\{0\.5\}\{$\\scriptscriptstyle\\bullet$\}\}\}\}\}\)denotes the DWT, producing approximation coefficients𝐚J\\mathbf\{a\}\_\{J\}and detail coefficients\{𝐝j\}j=1J\\\{\\mathbf\{d\}\_\{j\}\\\}\_\{j=1\}^\{J\}with smallerjjcorresponding to higher frequency\.Threshold\(∙\\mathchoice\{\\mathbin\{\\vbox\{\\hbox\{\\scalebox\{0\.5\}\{$\\displaystyle\\bullet$\}\}\}\}\}\{\\mathbin\{\\vbox\{\\hbox\{\\scalebox\{0\.5\}\{$\\textstyle\\bullet$\}\}\}\}\}\{\\mathbin\{\\vbox\{\\hbox\{\\scalebox\{0\.5\}\{$\\scriptstyle\\bullet$\}\}\}\}\}\{\\mathbin\{\\vbox\{\\hbox\{\\scalebox\{0\.5\}\{$\\scriptscriptstyle\\bullet$\}\}\}\}\}\)computes a threshold to distinguish salient coefficients from noiseDonoho and Johnstone \([1994](https://arxiv.org/html/2606.09861#bib.bib4)\)\. OnlyJ′=2J^\{\\prime\}=2finest scales are included in the loss\.

The final structure\-preserving loss is computed by:

ℒs​t​r​u​c​t=ℒL​1\+λa​d​v​ℒa​d​v\+λs​p​ℒs​p\+λh​f​ℒh​f\\mathcal\{L\}\_\{struct\}=\\mathcal\{L\}\_\{L1\}\+\\lambda\_\{adv\}\\mathcal\{L\}\_\{adv\}\+\\lambda\_\{sp\}\\mathcal\{L\}\_\{sp\}\+\\lambda\_\{hf\}\\mathcal\{L\}\_\{hf\}\(8\)The four terms denoteL​1L1, adversarial, surrogate perceptual and high\-frequency wavelet losses\. Weights are dynamically adjusted using the adaptive weighting strategy proposed inEsseret al\.\([2021](https://arxiv.org/html/2606.09861#bib.bib2)\)\. Details of this strategy, along with other referenced techniques \(i\.e\., FSQ, DWT Threshold\), are in Appendix[A](https://arxiv.org/html/2606.09861#A1)\.

## 4UniTok\-FM: A General\-Purpose Time Series Foundation Model

As illustrated in Fig\.[3](https://arxiv.org/html/2606.09861#S4.F3), built on UniTok, UniTok\-FM is pretrained on context windows formed by multiple similar\-pattern series \(Sec\.[4\.1](https://arxiv.org/html/2606.09861#S4.SS1)\)\. Training\-free in\-context inference is achieved by AR token generation or likelihood evaluation with different prompts \(Sec\.[4\.2](https://arxiv.org/html/2606.09861#S4.SS2)\)\.

![Refer to caption](https://arxiv.org/html/2606.09861v1/x4.png)Figure 3:Token arrangement for in\-context NTP pretraining and training\-free in\-context inference\.In pretraining, multiple series with similar patterns are concatenated into a context window\. In zero\-shot forecasting, lookback tokens condition AR generation of future tokens\. In prompt\-boosted forecasting and few\-shot generation, similar\-pattern series are prepended as contextual prompts\. In few\-shot classification, the query series is conditioned on class\-specific prompts, with labels determined by comparing likelihoods\.### 4\.1In\-Context Next\-Token Prediction

With UniTok, each TS is transformed into a discrete token sequence for NTP\. UniTok\-FM adopts modern LLM architectures without modification\. Instead of performing NTP on isolated series, we aggregate multiple series with similar temporal patterns into a context window and conduct NTP over it, allowing the model to capture shared dynamics\. Formally, given a set of series with similar patterns\{𝐱i\}i=1N,𝐱i∈ℝTi\\\{\\mathbf\{x\}^\{i\}\\\}\_\{i=1\}^\{N\},\\mathbf\{x\}^\{i\}\\in\\mathbb\{R\}^\{T\_\{i\}\}, the pretraining is conducted as:

𝐳i=Enc​\(𝐱i\)𝐳\(c​t​x\)=\[𝐳1,…,𝐳N\]ℒN​T​P=∑l=1Lc​t​x−1CE​\(zl\+1\(c​t​x\),pθ​\(zl\+1\(c​t​x\)\|𝐳≤l\(c​t​x\)\)\)\\mathbf\{z\}^\{i\}=\\text\{Enc\}\(\\mathbf\{x\}^\{i\}\)\\quad\\mathbf\{z\}^\{\(ctx\)\}=\[\\mathbf\{z\}^\{1\},\\dots,\\mathbf\{z\}^\{N\}\]\\quad\\mathcal\{L\}\_\{NTP\}=\\sum\_\{l=1\}^\{L\_\{ctx\}\-1\}\\text\{CE\}\\left\(z^\{\(ctx\)\}\_\{l\+1\},p\_\{\\theta\}\(z^\{\(ctx\)\}\_\{l\+1\}\|\\mathbf\{z\}^\{\(ctx\)\}\_\{\\leq l\}\)\\right\)\(9\)
whereLc​t​xL\_\{ctx\}denotes the length of the context window,CEis the cross\-entropy loss andpθ​\(∙\)p\_\{\\theta\}\(\\mathchoice\{\\mathbin\{\\vbox\{\\hbox\{\\scalebox\{0\.5\}\{$\\displaystyle\\bullet$\}\}\}\}\}\{\\mathbin\{\\vbox\{\\hbox\{\\scalebox\{0\.5\}\{$\\textstyle\\bullet$\}\}\}\}\}\{\\mathbin\{\\vbox\{\\hbox\{\\scalebox\{0\.5\}\{$\\scriptstyle\\bullet$\}\}\}\}\}\{\\mathbin\{\\vbox\{\\hbox\{\\scalebox\{0\.5\}\{$\\scriptscriptstyle\\bullet$\}\}\}\}\}\)is a LLM backbone\. Each series𝐱i\\mathbf\{x\}^\{i\}is encoded into a token sequence𝐳i\\mathbf\{z\}^\{i\}\. These sequences are then concatenated to form a context window𝐳\(c​t​x\)\\mathbf\{z\}^\{\(ctx\)\}on which NTP is performed\. In pretraining,\{𝐱i\}i=1N\\\{\\mathbf\{x\}^\{i\}\\\}\_\{i=1\}^\{N\}is obtained by extracting non\-overlapping segments from the same long series\. There can be gaps between segments, and segments are arranged in time order to prevent future information leakage\. In inference, it generalizes beyond this construction to meet the requirements of each task \(Sec\.[4\.2](https://arxiv.org/html/2606.09861#S4.SS2)\)\.

### 4\.2Training\-Free In\-Context Inference

Zero\-Shot ForecastingGiven lookback window𝐱\(b​a​c​k\)∈ℝTb​a​c​k\\mathbf\{x\}^\{\(back\)\}\\in\\mathbb\{R\}^\{T\_\{back\}\}, nextτ\\taupoints is predicted as:

𝐳\(b​a​c​k\)=Enc​\(𝐱\(b​a​c​k\)\)Lp​r​e​d=ϕ​\(Tb​a​c​k\+τ\)−ϕ​\(Tb​a​c​k\)zl\+1\(p​r​e​d\)∼pθ​\(zl\+1\(p​r​e​d\)\|\[𝐳\(b​a​c​k\),𝐳≤l\(p​r​e​d\)\]\)​for​l=0,…,Lp​r​e​d−1𝐱\(p​r​e​d\)=Dec​\(\[𝐳\(b​a​c​k\),𝐳\(p​r​e​d\)\]\)Tb​a​c​k\+1:Tb​a​c​k\+τ\\begin\{split\}&\\mathbf\{z\}^\{\(back\)\}=\\text\{Enc\}\(\\mathbf\{x\}^\{\(back\)\}\)\\quad L\_\{pred\}=\\phi\(T\_\{back\}\+\\tau\)\-\\phi\(T\_\{back\}\)\\\\ &z^\{\(pred\)\}\_\{l\+1\}\\sim p\_\{\\theta\}\(z^\{\(pred\)\}\_\{l\+1\}\|\[\\mathbf\{z\}^\{\(back\)\},\\mathbf\{z\}^\{\(pred\)\}\_\{\\leq l\}\]\)\\text\{ for \}l=0,\\dots,L\_\{pred\}\-1\\\\ &\\mathbf\{x\}^\{\(pred\)\}=\\text\{Dec\}\(\[\\mathbf\{z\}^\{\(back\)\},\\mathbf\{z\}^\{\(pred\)\}\]\)\_\{T\_\{back\}\+1:T\_\{back\}\+\\tau\}\\end\{split\}\(10\)𝐱\(b​a​c​k\)\\mathbf\{x\}^\{\(back\)\}is encoded into𝐳\(b​a​c​k\)\\mathbf\{z\}^\{\(back\)\}, after whichLp​r​e​dL\_\{pred\}tokens are generated autoregressively\. The decoded suffix corresponding to the future horizon is taken as the prediction\. As a probabilistic model, multiple trajectories are sampled to estimate the future distribution followingAnsariet al\.\([2024a](https://arxiv.org/html/2606.09861#bib.bib7)\)\.

Prompt\-Boosted ForecastingSeries exhibiting similar patterns can serve as contextual prompts to guide predictions\. Such prompt series may be extracted from the target series’ own historical records \(e\.g\., weather records from previous years\) or from other entities \(e\.g\., observations from nearby stations\)\. Given prompts\{𝐩i\}i=1N\\\{\\mathbf\{p\}^\{i\}\\\}\_\{i=1\}^\{N\}, we extend Eq\.[10](https://arxiv.org/html/2606.09861#S4.E10)as:

𝐳\(p​m​t\)=\[Enc​\(𝐩1\),…,Enc​\(𝐩N\)\]zl\+1\(p​r​e​d\)∼pθ​\(zl\+1\(p​r​e​d\)\|\[𝐳\(p​m​t\),𝐳\(p​a​s​t\),𝐳≤l\(p​r​e​d\)\]\)\\begin\{split\}\\mathbf\{z\}^\{\(pmt\)\}=\[\\text\{Enc\}\(\\mathbf\{p\}^\{1\}\),\\dots,\\text\{Enc\}\(\\mathbf\{p\}^\{N\}\)\]\\quad z^\{\(pred\)\}\_\{l\+1\}\\sim p\_\{\\theta\}\(z^\{\(pred\)\}\_\{l\+1\}\|\[\\mathbf\{z\}^\{\(pmt\)\},\\mathbf\{z\}^\{\(past\)\},\\mathbf\{z\}^\{\(pred\)\}\_\{\\leq l\}\]\)\\end\{split\}\(11\)Each prompt series is encoded and concatenated to form the prompt sequence𝐳\(p​m​t\)\\mathbf\{z\}^\{\(pmt\)\}\. It is then prepended to the tokenized lookback window𝐳\(p​a​s​t\)\\mathbf\{z\}^\{\(past\)\}for AR generation\.

Few\-Shot GenerationGiven prompt\{𝐩i\}i=1N\\\{\\mathbf\{p\}^\{i\}\\\}\_\{i=1\}^\{N\}, a length\-τ\\tauseries with similar dynamic is generated:

𝐳\(p​m​t\)=Enc​\(\{𝐩i\}i=1N\),z1\(g​e​n\)=⟨SOS⟩zl\+1\(g​e​n\)∼pθ​\(zl\+1\(g​e​n\)\|\[𝐳\(p​m​t\),𝐳≤l\(g​e​n\)\]\)​for​l=1,…,ϕ​\(τ\)−1𝐱\(g​e​n\)=Dec​\(𝐳\(g​e​n\)\)\\begin\{split\}&\\mathbf\{z\}^\{\(pmt\)\}=\\text\{Enc\}\(\\\{\\mathbf\{p\}^\{i\}\\\}\_\{i=1\}^\{N\}\),\\quad z^\{\(gen\)\}\_\{1\}=\\langle\\text\{SOS\}\\rangle\\ \\\\ &z^\{\(gen\)\}\_\{l\+1\}\\sim p\_\{\\theta\}\(z^\{\(gen\)\}\_\{l\+1\}\|\[\\mathbf\{z\}^\{\(pmt\)\},\\mathbf\{z\}^\{\(gen\)\}\_\{\\leq l\}\]\)\\text\{ for \}l=1,\\dots,\\phi\(\\tau\)\-1\\\\ &\\mathbf\{x\}^\{\(gen\)\}=\\text\{Dec\}\(\\mathbf\{z\}^\{\(gen\)\}\)\\end\{split\}\(12\)𝐳\(p​m​t\)\\mathbf\{z\}^\{\(pmt\)\}is constructed same as in prompt\-boosted forecasting\. Generation is initialized with a⟨SOS⟩\\langle\\text\{SOS\}\\rangle\.

Few\-Shot ClassificationBy tokenizing TS, UniTok\-FM can evaluate the conditional likelihood of a series, enabling in\-context few\-shot classification\. GivenMMclasses each withNmN\_\{m\}examples per class,\{\{𝐩m,i\}i=1Nm\}m=1M\\\{\\\{\\mathbf\{p\}^\{m,i\}\\\}\_\{i=1\}^\{N\_\{m\}\}\\\}\_\{m=1\}^\{M\}, the class labelyyof a query series𝐱\(q\)\\mathbf\{x\}^\{\(q\)\}is inferred as:

𝐳\(q\)=Enc​\(𝐱\(q\)\)∀m:𝐳\(p​m​t\),m=Enc​\(\{𝐩m,i\}i=1Nm\)p​\(𝐱\(q\)\|y=m\)≈pθ​\(𝐳\(q\)\|𝐳\(p​m​t\),m\)p​\(y=m\|𝐱\(q\)\)∝p​\(𝐱\(q\)\|y=m\)​p​\(y=m\)\\begin\{split\}&\\mathbf\{z\}^\{\(q\)\}=\\text\{Enc\}\(\\mathbf\{x\}^\{\(q\)\}\)\\quad\\forall m:\\mathbf\{z\}^\{\(pmt\),m\}=\\text\{Enc\}\(\\\{\\mathbf\{p\}^\{m,i\}\\\}\_\{i=1\}^\{N\_\{m\}\}\)\\\\ &p\(\\mathbf\{x\}^\{\(q\)\}\|y=m\)\\approx p\_\{\\theta\}\(\\mathbf\{z\}^\{\(q\)\}\|\\mathbf\{z\}^\{\(pmt\),m\}\)\\quad p\(y=m\|\\mathbf\{x\}^\{\(q\)\}\)\\propto p\(\\mathbf\{x\}^\{\(q\)\}\|y=m\)p\(y=m\)\\end\{split\}\(13\)The query series is encoded into𝐳\(q\)\\mathbf\{z\}^\{\(q\)\}, while examples from each class form a class\-specific prompt𝐳\(p​m​t\),m\\mathbf\{z\}^\{\(pmt\),m\}\. UniTok\-FM approximates the class\-conditional likelihoodp​\(𝐱\(q\)\|y=m\)p\(\\mathbf\{x\}^\{\(q\)\}\|y=m\)using the likelihood of tokenized query conditioned on corresponding prompt,pθ​\(z\(q\)\|𝐳\(p​m​t\),m\)p\_\{\\theta\}\(z^\{\(q\)\}\|\\mathbf\{z\}^\{\(pmt\),m\}\), which has a closed\-form solution under AR factorization\. Posterior class probabilityp​\(y=m\|𝐱\(q\)\)p\(y=m\|\\mathbf\{x\}^\{\(q\)\}\)is obtained by Bayes’ rule\. Intuitively, this procedure evaluates which class is most likely to generate the query, sharing the spirit of using generative models as classifiersLiet al\.\([2023](https://arxiv.org/html/2606.09861#bib.bib41)\)\.

Table 1:Forecasting performance on the GIFT\-Eval benchmark\.Forecasting TSFM denotes forecasting\-specific TSFMs, while General TSFM denotes general\-purpose TSFMs\. Bold indicates the best model\. Chronos is underlined as the most closely related baseline\. Our methods are highlighted in gray\. Full results on each dataset are in Tab\.[7](https://arxiv.org/html/2606.09861#A4.T7)\-[9](https://arxiv.org/html/2606.09861#A4.T9)of Appendix[D](https://arxiv.org/html/2606.09861#A4)\.Method TypeMethodCRPS↓\\downarrowMAPE↓\\downarrowMASE↓\\downarrowMethod TypeMethodCRPS↓\\downarrowMAPE↓\\downarrowMASE↓\\downarrowStatisticalNaive1\.5911\.0551\.270Forecasting TSFMTiRexAueret al\.\([2025](https://arxiv.org/html/2606.09861#bib.bib14)\)0\.4880\.6770\.716Seasonal\-Naive1\.0001\.0001\.000SundialLiuet al\.\([2025](https://arxiv.org/html/2606.09861#bib.bib15)\)0\.5590\.7770\.750Auto\-Theta1\.2441\.1261\.090Chronos\-BoltAnsariet al\.\([2024b](https://arxiv.org/html/2606.09861#bib.bib16)\)0\.5740\.7750\.808Auto\-Arima0\.9121\.0331\.074MoiraiWooet al\.\([2024](https://arxiv.org/html/2606.09861#bib.bib17)\)0\.6100\.8250\.901SupervisedCrossformerZhang and Yan \([2023](https://arxiv.org/html/2606.09861#bib.bib9)\)1\.6371\.0242\.574ChronosAnsariet al\.\([2024a](https://arxiv.org/html/2606.09861#bib.bib7)\)0\.6520\.8020\.876DLinearZenget al\.\([2023](https://arxiv.org/html/2606.09861#bib.bib10)\)0\.8461\.0861\.061VisionTSChenet al\.\([2025](https://arxiv.org/html/2606.09861#bib.bib18)\)0\.7550\.9250\.863PatchTSTNieet al\.\([2023](https://arxiv.org/html/2606.09861#bib.bib11)\)0\.5870\.7880\.849Lag\-LlamaRasulet al\.\([2023](https://arxiv.org/html/2606.09861#bib.bib19)\)0\.8801\.1151\.228iTransformerLiuet al\.\([2024a](https://arxiv.org/html/2606.09861#bib.bib12)\)0\.6200\.8460\.893General TSFMUniTok\-FM\(ZeroShot\)0\.5910\.7980\.851Forecasting TSFMChronos\-2Ansariet al\.\([2025](https://arxiv.org/html/2606.09861#bib.bib13)\)0\.4850\.6660\.698UniTok\-FM\(Prompt\)0\.5730\.7610\.824

Table 2:Generation performance on Stocks, ETTh, Energy and fMRI\.\#Train denotes the number of training examples for baseline models, while \#Prompt denotes the number of in\-context prompt examples for UniTok\-FM\. Pred Score and Disc Score indicate the average predictive and discriminative score across four datasets\. Full results are in Tab\.[10](https://arxiv.org/html/2606.09861#A4.T10)of Appendix[D](https://arxiv.org/html/2606.09861#A4)\.

## 5Experiments

### 5\.1Pretraining Protocols

DatasetsWe pretrain UniTok and UniTok\-FM on the union of the GIFT\-PretrainAksuet al\.\([2024](https://arxiv.org/html/2606.09861#bib.bib6)\)and Chronos\-DatasetAnsariet al\.\([2024a](https://arxiv.org/html/2606.09861#bib.bib7)\)\. Benchmark datasets are strictly filtered out from the pretraining corpus\.

UniTokWe pretrain a 113M\-parameter UniTok withS=4S=4blocks and a codebook size ofC=1,940C=1,940\. The maximum supported series length is 2048\. Training runs 200K steps using AdamW optimizer with a global batch size of 512\. The learning rate is linearly warmed up to1×10−41\\times 10^\{\-4\}over the first 2,000 steps and then cosine\-decayed to1×10−51\\times 10^\{\-5\}\.

UniTok\-FMDuring UniTok\-FM training, the pretrained UniTok is frozen\. We adopt the Qwen3Yanget al\.\([2025](https://arxiv.org/html/2606.09861#bib.bib8)\)as the backbone and train a 129M\-parameter model from scratch\. Other LLM architectures and model sizes are evaluated in Sec\.[5\.3](https://arxiv.org/html/2606.09861#S5.SS3)\. One context window supports up to 800 tokens, corresponding to 5 series of length 2048 or 8 series of length 1024\. Training runs for 100K steps with a global batch size of 384, using the same optimizer and learning\-rate schedule as UniTok\. Both UniTok and UniTok\-FM are trained on 4 NVIDIA A100 GPUs, with details shwon in Appendices[B](https://arxiv.org/html/2606.09861#A2)and[C](https://arxiv.org/html/2606.09861#A3)\.

### 5\.2Main Results

Zero\-Shot&Prompt\-Boosted ForecastingWe evaluate forecasting on GIFT\-EvalAksuet al\.\([2024](https://arxiv.org/html/2606.09861#bib.bib6)\), which comprises 97 tasks with different datasets and prediction horizons\. Probabilistic forecasting is evaluated using Continuous Ranked Probability Score \(CRPS\), while point forecasting is evaluated by Mean Absolute Percentage Error \(MAPE\) and Mean Absolute Seasonal Error \(MASE\)\. Three categories of methods are compared: statistical methods, supervised models and forecasting\-specific TSFMs\. Two UniTok\-FM variants are evaluated:1\) UniTok\-FM\(ZeroShot\)performs standard zero\-shot forecasting;2\) UniTok\-FM\(Prompt\)enables prompt\-boosted forecasting by first retrieving prompt series from the target’s earlier history and, if the context window is not full, from the training split of other entities within the same dataset\.

Tab\.[1](https://arxiv.org/html/2606.09861#S4.T1)shows that UniTok\-FM\(ZeroShot\) matches the strongest supervised baseline, PatchTST, while outperforming other statistical and supervised methods, with prompt boosting yielding consistent gains\. Although UniTok\-FM does not exceed SOTA models such as Chronos\-2 and TiRex, it consistently outperforms Moirai, Chronos, VisionTS, and Lag\-Llama, and is competitive with Chronos\-Bolt\. Notably, most TSFMs rely on forecasting\-specific designs \(e\.g\., multi\-horizon heads and quantile objectives\), whereas UniTok\-FM adopts standard LLM\-style NTP pretraining and AR inference\. Moreover, UniTok\-FM substantially outperforms Chronos, which employs pointwise binning tokenization, highlighting the importance of an expressive tokenizer for NTP\.

![Refer to caption](https://arxiv.org/html/2606.09861v1/x5.png)

![Refer to caption](https://arxiv.org/html/2606.09861v1/x6.png)

![Refer to caption](https://arxiv.org/html/2606.09861v1/x7.png)

![Refer to caption](https://arxiv.org/html/2606.09861v1/x8.png)

![Refer to caption](https://arxiv.org/html/2606.09861v1/x9.png)

![Refer to caption](https://arxiv.org/html/2606.09861v1/x10.png)

![Refer to caption](https://arxiv.org/html/2606.09861v1/x11.png)

![Refer to caption](https://arxiv.org/html/2606.09861v1/x12.png)

![Refer to caption](https://arxiv.org/html/2606.09861v1/x13.png)

![Refer to caption](https://arxiv.org/html/2606.09861v1/x14.png)

Figure 4:All prompt examples \(red\) and sampled generations \(blue\) of UniTok\-FM on Stocks\.![Refer to caption](https://arxiv.org/html/2606.09861v1/x15.png)\(a\)![Refer to caption](https://arxiv.org/html/2606.09861v1/x16.png)\(b\)![Refer to caption](https://arxiv.org/html/2606.09861v1/x17.png)\(c\)![Refer to caption](https://arxiv.org/html/2606.09861v1/x18.png)\(d\)
Figure 5:Scaling behavior across LLM backbone sizes\.Qwen3 backbones of three sizes are evaluated: Small \(14M\), Medium \(26M\), and Base \(129M\)\. \(a\) Training loss\. \(b\) Forecasting MASE\. \(c\) Generation discriminative score\. \(d\) Classification accuracy\.Table 3:Classification performance on the UCR\-FewShot\.Acc denotes the average accuracy over 53 datasets, and \#Win counts the number of datasets on which a method achieves the best performance \(including ties\)\. Full results are in Tab\.[11](https://arxiv.org/html/2606.09861#A4.T11)of Appendix[D](https://arxiv.org/html/2606.09861#A4)\.Method TypeMethodAcc↑\\uparrow\#Win↑\\uparrowStatisticalKNN0\.6725TStreeDenget al\.\([2013](https://arxiv.org/html/2606.09861#bib.bib21)\)0\.6282RDSTGuillaumeet al\.\([2022](https://arxiv.org/html/2606.09861#bib.bib22)\)0\.6730SupervisedFCNWanget al\.\([2017](https://arxiv.org/html/2606.09861#bib.bib23)\)0\.3570LITEIsmail\-Fawazet al\.\([2025](https://arxiv.org/html/2606.09861#bib.bib24)\)0\.4723InceptionIsmail Fawazet al\.\([2020](https://arxiv.org/html/2606.09861#bib.bib25)\)0\.4916Classfication TSFMMantisFeofanovet al\.\([2025](https://arxiv.org/html/2606.09861#bib.bib28)\)0\.84027General TSFMUniTSGaoet al\.\([2024](https://arxiv.org/html/2606.09861#bib.bib27)\)0\.6971\(Downstream Classfier\)MomentGoswamiet al\.\([2024](https://arxiv.org/html/2606.09861#bib.bib26)\)0\.7789General TSFM \(In\-context\)UniTok\-FM0\.75510

Few\-Shot GenerationFollowingYuan and Qiao \([2024](https://arxiv.org/html/2606.09861#bib.bib37)\), we evaluate few\-shot generation on four datasets \(Stocks, ETTh, Energy, and fMRI\) using two metrics: 1\) predictive score, measuring the forecasting accuracy of a predictor trained on generated data; 2\) discriminative score, measuring how well a discriminator distinguishes generated data from real data\. UniTok\-FM conducts training\-free in\-context inference using five examples\. As conventional generative models cannot be trained with so few samples, we train each baseline on progressively larger sets of 5, 200, and 1000 samples, where larger sets subsume smaller ones\. The five\-sample setting matches the examples provided to UniTok\-FM\.

Tab\.[2](https://arxiv.org/html/2606.09861#S4.T2)shows that UniTok\-FM, using only five prompt examples, achieves a predictive score comparable to the best generative models trained on 1K samples, demonstrating its ability to capture underlying dynamics via in\-context inference\. UniTok\-FM also consistently outperforms baselines in discriminative score under the same sample budget\. Since a discriminative score of 0\.5 indicates perfect distinguishability between generated and real samples, UniTok\-FM’s substantially lower score \(0\.420\) suggests that its generated series are non\-trivial\. See Fig\.[4](https://arxiv.org/html/2606.09861#S5.F4)for qualitative evaluation\.

Few\-Shot ClassificationWe evaluate few\-shot classification on the UCR ArchiveDauet al\.\([2018](https://arxiv.org/html/2606.09861#bib.bib20)\)\. The original archive contains 128 datasets, and we select those with at most 20 training instances per class, yielding a subset of 53 datasets, denoted as UCR\-FewShot\. The scarcity of labeled data makes this it suitable for few\-shot classification\. Besides the three categories compared in forecasting, we include general\-purpose TSFMs that rely on downstream classifiers\. These models extract features using a pretrained model and then train a classifier on them\. This paradigm fundamentally differs from UniTok\-FM, which uses training\-free in\-context inference\. Notably, all TSFMs except ours are pretrained on corpora including UCR training splits, giving them prior exposure to the benchmark distribution\.

Tab\.[3](https://arxiv.org/html/2606.09861#S5.T3)shows that UniTok\-FM outperforms statistical and supervised baselines\. Among TSFMs, it consistently surpasses UniTS and is comparable to MOMENT, despite requiring no dataset\-specific fine\-tuning and never accessing UCR data during pretraining\. Although UniTok\-FM is pretrained only on forecasting\-oriented corpora and NTP is often viewed as suboptimal for discriminative tasksFeofanovet al\.\([2025](https://arxiv.org/html/2606.09861#bib.bib28)\), its competitive performance demonstrates NTP’s strong generalizability for TS understanding\.

### 5\.3Model Analysis

Scaling Behavior across LLM Backbone SizeThe model with 129M\-parameter Qwen3 backbone is denoted as*Base*\. Halving the hidden dimension creates*Medium*\(26M\), and further halving layers yields*Small*\(14M\)\. Fig\.[5](https://arxiv.org/html/2606.09861#S5.F5)\(a\) shows that larger backbones consistently achieve lower training loss\. Figs\.[5](https://arxiv.org/html/2606.09861#S5.F5)\(b–d\) demonstrate a clear scaling trend on downstream tasks where performance generally improves with increasing backbone size\. For efficiency, unless otherwise stated, subsequent analyses in this section are conducted on the Medium\.

![Refer to caption](https://arxiv.org/html/2606.09861v1/x19.png)\(a\)![Refer to caption](https://arxiv.org/html/2606.09861v1/x20.png)

![Refer to caption](https://arxiv.org/html/2606.09861v1/x21.png)

\(b\)![Refer to caption](https://arxiv.org/html/2606.09861v1/x22.png)

![Refer to caption](https://arxiv.org/html/2606.09861v1/x23.png)

\(c\)
Figure 6:Qualitative comparison between the full UniTok and ablated variants\.\(a\) Zero\-shot forecasting: full v\.s\. prefix normalization ablated\. \(b\) Series reconstruction: full v\.s\. progressive\-resolution causal autoencoder ablated\. \(c\) Series generation: full v\.s\. structure\-preserving reconstruction loss ablated, using same prompts as Fig\.[4](https://arxiv.org/html/2606.09861#S5.F4)\. Blue: full UniTok; Orange: ablated variants\.Table 4:Performance of UniTok\-FM across downstream tasks with different LLM architectures\.ZS: ZeroShot; PMT: Prompt; Disc: discriminative score\.Table 5:Ablation study of UniTok components\.w/o Prefix Normreplaces prefix normalization with whole\-series norm\.w/o Prog Resreplaces the progressive\-resolution up/downsample with uniform ones\.w/o Struct Lossreplaces the structure\-preserving reconstruction loss withL1L\_\{1\}\.Generality across LLM ArchitecturesBeyond Qwen3, we adapt other LLM architectures as the AR backbone of UniTok\-FM, including GPT2Radfordet al\.\([2019](https://arxiv.org/html/2606.09861#bib.bib49)\), Llama2Touvronet al\.\([2023](https://arxiv.org/html/2606.09861#bib.bib50)\)and Gemma2Teamet al\.\([2024](https://arxiv.org/html/2606.09861#bib.bib51)\)\. All models are scaled to Medium \(26M\)\. As shown in Tab\.[5](https://arxiv.org/html/2606.09861#S5.T5), UniTok\-FM generalizes well across architectures and benefits from advances in LLM design, with more recent models outperforming the early\-stage GPT2 on most tasks\. Notably, thanks to the well\-established LLM community, we can swap backbones by changing configuration files while reusing all interfaces\.

Ablation StudyTab\.[5](https://arxiv.org/html/2606.09861#S5.T5)shows that:1\)Removing prefix normalization substantially degrades forecasting, primarily due to failures on non\-stationary series where statistics estimated on the lookback window fail to generalize to the future \(Fig\.[6](https://arxiv.org/html/2606.09861#S5.F6)\(a\)\)\.2\)Removing progressive\-resolution downsample/upsample degrades performance across all tasks by harming reconstruction quality, particularly severe at early positions with limited causal receptive field \(Fig\.[6](https://arxiv.org/html/2606.09861#S5.F6)\(b\)\)\.3\)Replacing the structure\-preserving reconstruction loss with anL​1L1loss yields a slight gain in forecasting, as it mainly depends on low\-frequency trendsXuet al\.\([2024](https://arxiv.org/html/2606.09861#bib.bib53)\), but severely degrades generation, producing overly smooth samples \(Fig\.[6](https://arxiv.org/html/2606.09861#S5.F6)\(c\)\)\. Overall, rather than optimizing for forecasting alone, UniTok integrates these components to form a universal tokenizer that generalizes across tasks\.

![Refer to caption](https://arxiv.org/html/2606.09861v1/x24.png)\(a\)![Refer to caption](https://arxiv.org/html/2606.09861v1/x25.png)\(b\)![Refer to caption](https://arxiv.org/html/2606.09861v1/x26.png)\(c\)
Figure 7:Zero\-shot forecasting efficiency comparison between Chronos and UniTok\-FM on Jena Weather\.\(a\) Inference time per instance w\.r\.t number of sampling trajectories\. \(b\) Memory occupancy\. \(c\) Forecasting performance \(MASE\)\.Inference EfficiencyUniTok produces much shorter token sequences than point\-wise binning of Chronos, resulting in improved LLM inference efficiency\. We compare Chronos \(Base\) and UniTok\-FM \(Base\) under varying numbers of sampled trajectories on the Jena Weather \(10min\) from GIFT\-Eval, using a single NVIDIA A100 GPU\. The lookback window is 512 and the prediction horizon is 48, matching Chronos’ setup\. UniTok\-FM is evaluated in the zero\-shot setting for a fair comparison\. Fig\.[7](https://arxiv.org/html/2606.09861#S5.F7)shows that UniTok\-FM not only achieves better forecasting performance but also consistently reduces inference time and memory usage\.

## 6Conclusion, Limitation and Outlook

We propose UniTok, a universal tokenizer that converts TS into discrete tokens, enabling next\-token prediction on TS\. Built on UniTok, we pretrain UniTok\-FM, a general\-purpose foundation model that supports zero\-shot and prompt\-boosted forecasting, as well as few\-shot generation and classification via training\-free in\-context inference\.

Despite the generality, a performance gap remains compared to task\-specific SOTA models that benefit from carefully designed inductive biases\. Moreover, this work focuses on univariate TS and relies on channel independence for multivariate data\. Extending to multivariate settings requires careful modeling of cross\-channel dependencies, which we leave for future work\.

## References

- T\. Aksu, G\. Woo, J\. Liu, X\. Liu, C\. Liu, S\. Savarese, C\. Xiong, and D\. Sahoo \(2024\)Gift\-eval: a benchmark for general time series forecasting model evaluation\.arXiv preprint arXiv:2410\.10393\.Cited by:[§B\.1](https://arxiv.org/html/2606.09861#A2.SS1.p2.1),[§C\.1](https://arxiv.org/html/2606.09861#A3.SS1.p1.1),[§5\.1](https://arxiv.org/html/2606.09861#S5.SS1.p1.1),[§5\.2](https://arxiv.org/html/2606.09861#S5.SS2.p1.1)\.
- A\. F\. Ansari, O\. Shchur, J\. Küken, A\. Auer, B\. Han, P\. Mercado, S\. S\. Rangapuram, H\. Shen, L\. Stella, X\. Zhang, M\. Goswami, S\. Kapoor, D\. C\. Maddix, P\. Guerron, T\. Hu, J\. Yin, N\. Erickson, P\. M\. Desai, H\. Wang, H\. Rangwala, G\. Karypis, Y\. Wang, and M\. Bohlke\-Schneider \(2025\)Chronos\-2: from univariate to universal forecasting\.arXiv preprint arXiv:2510\.15821\.Cited by:[§2](https://arxiv.org/html/2606.09861#S2.p1.1),[Table 1](https://arxiv.org/html/2606.09861#S4.T1.6.6.15.9.2),[footnote 1](https://arxiv.org/html/2606.09861#footnote1)\.
- A\. F\. Ansari, L\. Stella, C\. Turkmen, X\. Zhang, P\. Mercado, H\. Shen, O\. Shchur, S\. S\. Rangapuram, S\. P\. Arango, S\. Kapoor,et al\.\(2024a\)Chronos: learning the language of time series\.Transactions on Machine Learning Research \(TMLR\)\.Cited by:[§B\.1](https://arxiv.org/html/2606.09861#A2.SS1.p3.1),[§1](https://arxiv.org/html/2606.09861#S1.p3.1),[§2](https://arxiv.org/html/2606.09861#S2.p2.1),[§4\.2](https://arxiv.org/html/2606.09861#S4.SS2.p1.5),[Table 1](https://arxiv.org/html/2606.09861#S4.T1.6.6.11.5.6),[§5\.1](https://arxiv.org/html/2606.09861#S5.SS1.p1.1)\.
- A\. F\. Ansari, C\. Turkmen, O\. Shchur, and L\. Stella \(2024b\)Fast and accurate zero\-shot forecasting with chronos\-bolt and autogluon\.[https://aws\.amazon\.com/blogs/machine\-learning/fast\-and\-accurate\-zero\-shot\-forecasting\-with\-chronos\-bolt\-and\-autogluon](https://aws.amazon.com/blogs/machine-learning/fast-and-accurate-zero-shot-forecasting-with-chronos-bolt-and-autogluon)\.Cited by:[§1](https://arxiv.org/html/2606.09861#S1.p9.1),[Table 1](https://arxiv.org/html/2606.09861#S4.T1.6.6.9.3.5),[footnote 1](https://arxiv.org/html/2606.09861#footnote1)\.
- A\. Auer, P\. Podest, D\. Klotz, S\. Böck, G\. Klambauer, and S\. Hochreiter \(2025\)TiRex: zero\-shot forecasting across long and short horizons with enhanced in\-context learning\.InConference on Neural Information Processing Systems \(NeurIPS\),Cited by:[§1](https://arxiv.org/html/2606.09861#S1.p1.1),[§2](https://arxiv.org/html/2606.09861#S2.p1.1),[Table 1](https://arxiv.org/html/2606.09861#S4.T1.6.6.7.1.7)\.
- M\. Chen, L\. Shen, Z\. Li, X\. J\. Wang, J\. Sun, and C\. Liu \(2025\)VisionTS: visual masked autoencoders are free\-lunch zero\-shot time series forecasters\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§1](https://arxiv.org/html/2606.09861#S1.p1.1),[Table 1](https://arxiv.org/html/2606.09861#S4.T1.6.6.12.6.5)\.
- Z\. Chen, F\. SHIBO, Z\. Zhang, X\. Xiao, X\. Gao, and P\. Zhao \(2024\)Sdformer: similarity\-driven discrete transformer for time series generation\.InConference on Neural Information Processing Systems \(NeurIPS\),Cited by:[Table 2](https://arxiv.org/html/2606.09861#S4.T2.2.2.3.1.3)\.
- B\. Cohen, E\. Khwaja, Y\. Doubli, S\. Lemaachi, C\. Lettieri, C\. Masson, H\. Miccinilli, E\. Ramé, Q\. Ren, A\. Rostamizadeh,et al\.\(2025\)This time is different: an observability perspective on time series foundation models\.arXiv preprint arXiv:2505\.14766\.Cited by:[§1](https://arxiv.org/html/2606.09861#S1.p1.1),[§2](https://arxiv.org/html/2606.09861#S2.p1.1)\.
- H\. A\. Dau, E\. Keogh, K\. Kamgar, C\. M\. Yeh, Y\. Zhu, S\. Gharghabi, C\. A\. Ratanamahatana, Yanping, B\. Hu, N\. Begum, A\. Bagnall, A\. Mueen, G\. Batista, and Hexagon\-ML \(2018\)The ucr time series classification archive\.[https://www\.cs\.ucr\.edu/˜eamonn/time\_series\_data\_2018](https://www.cs.ucr.edu/~eamonn/time_series_data_2018)\.Cited by:[§C\.3](https://arxiv.org/html/2606.09861#A3.SS3.p1.1),[§5\.2](https://arxiv.org/html/2606.09861#S5.SS2.p5.1)\.
- H\. Deng, G\. Runger, E\. Tuv, and M\. Vladimir \(2013\)A time series forest for classification and feature extraction\.Information Sciences\.Cited by:[Table 3](https://arxiv.org/html/2606.09861#S5.T3.2.2.4.2.1)\.
- A\. Desai, C\. Freeman, Z\. Wang, and I\. Beaver \(2021\)Timevae: a variational auto\-encoder for multivariate time series generation\.arXiv preprint arXiv:2111\.08095\.Cited by:[Table 2](https://arxiv.org/html/2606.09861#S4.T2.2.2.3.1.5)\.
- D\. L\. Donoho and I\. M\. Johnstone \(1994\)Ideal spatial adaptation by wavelet shrinkage\.biometrika\.Cited by:[§A\.3](https://arxiv.org/html/2606.09861#A1.SS3.p1.1),[§3\.4](https://arxiv.org/html/2606.09861#S3.SS4.p3.7)\.
- P\. Esser, R\. Rombach, and B\. Ommer \(2021\)Taming transformers for high\-resolution image synthesis\.InIEEE / CVF Computer Vision and Pattern Recognition Conference \(CVPR\),Cited by:[§A\.4](https://arxiv.org/html/2606.09861#A1.SS4.p1.2),[§2](https://arxiv.org/html/2606.09861#S2.p3.2),[§3\.4](https://arxiv.org/html/2606.09861#S3.SS4.p1.2),[§3\.4](https://arxiv.org/html/2606.09861#S3.SS4.p4.1)\.
- S\. Feng, P\. Zhao, L\. Liu, P\. Wu, and Z\. Shen \(2025\)Hdt: hierarchical discrete transformer for multivariate time series forecasting\.InAAAI Conference on Artificial Intelligence \(AAAI\),Cited by:[§2](https://arxiv.org/html/2606.09861#S2.p4.1)\.
- V\. Feofanov, S\. Wen, M\. Alonso, R\. Ilbert, H\. Guo, M\. Tiomoko, L\. Pan, J\. Zhang, and I\. Redko \(2025\)Mantis: lightweight calibrated foundation model for user\-friendly time series classification\.1st ICML Workshop on Foundation Models for Structured Data\.Cited by:[Table 11](https://arxiv.org/html/2606.09861#A4.T11),[Table 11](https://arxiv.org/html/2606.09861#A4.T11.4.2.1),[§5\.2](https://arxiv.org/html/2606.09861#S5.SS2.p6.1),[Table 3](https://arxiv.org/html/2606.09861#S5.T3.2.2.9.7.2)\.
- S\. Gao, T\. Koker, O\. Queen, T\. Hartvigsen, T\. Tsiligkaridis, and M\. Zitnik \(2024\)Units: a unified multi\-task time series model\.InConference on Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2606.09861#S2.p1.1),[Table 3](https://arxiv.org/html/2606.09861#S5.T3.2.2.10.8.2)\.
- M\. Goswami, K\. Szafer, A\. Choudhry, Y\. Cai, S\. Li, and A\. Dubrawski \(2024\)MOMENT: a family of open time\-series foundation models\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§1](https://arxiv.org/html/2606.09861#S1.p9.1),[§2](https://arxiv.org/html/2606.09861#S2.p1.1),[Table 3](https://arxiv.org/html/2606.09861#S5.T3.2.2.11.9.2)\.
- A\. Guillaume, C\. Vrain, and W\. Elloumi \(2022\)Random dilated shapelet transform: a new approach for time series shapelets\.InInternational Conference on Pattern Recognition and Artificial Intelligence \(ICPRAI\),Cited by:[Table 3](https://arxiv.org/html/2606.09861#S5.T3.2.2.5.3.1)\.
- H\. Ismail Fawaz, B\. Lucas, G\. Forestier, C\. Pelletier, D\. F\. Schmidt, J\. Weber, G\. I\. Webb, L\. Idoumghar, P\. Muller, and F\. Petitjean \(2020\)Inceptiontime: finding alexnet for time series classification\.Data Mining and Knowledge Discovery\.Cited by:[Table 3](https://arxiv.org/html/2606.09861#S5.T3.2.2.8.6.1)\.
- A\. Ismail\-Fawaz, M\. Devanne, S\. Berretti, J\. Weber, and G\. Forestier \(2025\)Look into the lite in deep learning for time series classification\.International Journal of Data Science and Analytics\.Cited by:[Table 3](https://arxiv.org/html/2606.09861#S5.T3.2.2.7.5.1)\.
- J\. Jia, J\. Gao, B\. Xue, J\. Wang, Q\. Cai, Q\. Chen, X\. Zhao, P\. Jiang, and K\. Gai \(2025\)From principles to applications: a comprehensive survey of discrete tokenizers in generation, comprehension, recommendation, and information retrieval\.arXiv preprint arXiv:2502\.12448\.Cited by:[§2](https://arxiv.org/html/2606.09861#S2.p3.2)\.
- M\. Jin, S\. Wang, L\. Ma, Z\. Chu, J\. Y\. Zhang, X\. Shi, P\. Chen, Y\. Liang, Y\. Li, S\. Pan,et al\.\(2024\)Time\-llm: time series forecasting by reprogramming large language models\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2606.09861#S2.p4.1)\.
- C\. Ledig, L\. Theis, F\. Huszár, J\. Caballero, A\. Cunningham, A\. Acosta, A\. Aitken, A\. Tejani, J\. Totz, Z\. Wang,et al\.\(2017\)Photo\-realistic single image super\-resolution using a generative adversarial network\.InIEEE / CVF Computer Vision and Pattern Recognition Conference \(CVPR\),Cited by:[§3\.4](https://arxiv.org/html/2606.09861#S3.SS4.p3.1)\.
- D\. Lee, S\. Malacarne, and E\. Aune \(2023\)Vector quantized time series generation with a bidirectional prior model\.InInternational Conference on Artificial Intelligence and Statistics \(AISTATS\),Cited by:[§1](https://arxiv.org/html/2606.09861#S1.p3.1),[§2](https://arxiv.org/html/2606.09861#S2.p4.1)\.
- D\. Lee, C\. Kim, S\. Kim, M\. Cho, and W\. Han \(2022\)Autoregressive image generation using residual quantization\.InIEEE / CVF Computer Vision and Pattern Recognition Conference \(CVPR\),Cited by:[§2](https://arxiv.org/html/2606.09861#S2.p3.2)\.
- A\. C\. Li, M\. Prabhudesai, S\. Duggal, E\. L\. Brown, and D\. Pathak \(2023\)Your diffusion model is secretly a zero\-shot classifier\.InICML 2023 Workshop on Structured Probabilistic Inference & Generative Modeling,Cited by:[§4\.2](https://arxiv.org/html/2606.09861#S4.SS2.p4.10)\.
- T\. Li, Y\. Tian, H\. Li, M\. Deng, and K\. He \(2024\)Autoregressive image generation without vector quantization\.InConference on Neural Information Processing Systems \(NeurIPS\),Cited by:[§1](https://arxiv.org/html/2606.09861#S1.p2.1)\.
- Y\. Liu, T\. Hu, H\. Zhang, H\. Wu, S\. Wang, L\. Ma, and M\. Long \(2024a\)Itransformer: inverted transformers are effective for time series forecasting\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[Table 1](https://arxiv.org/html/2606.09861#S4.T1.6.6.14.8.1)\.
- Y\. Liu, G\. Qin, Z\. Shi, Z\. Chen, C\. Yang, X\. Huang, J\. Wang, and M\. Long \(2025\)Sundial: a family of highly capable time series foundation models\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§1](https://arxiv.org/html/2606.09861#S1.p1.1),[§2](https://arxiv.org/html/2606.09861#S2.p1.1),[Table 1](https://arxiv.org/html/2606.09861#S4.T1.6.6.8.2.5)\.
- Y\. Liu, H\. Zhang, C\. Li, X\. Huang, J\. Wang, and M\. Long \(2024b\)Timer: generative pre\-trained transformers are large time series models\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§1](https://arxiv.org/html/2606.09861#S1.p1.1),[§2](https://arxiv.org/html/2606.09861#S2.p1.1)\.
- F\. Mentzer, D\. Minnen, E\. Agustsson, and M\. Tschannen \(2024\)Finite scalar quantization: VQ\-VAE made simple\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§A\.2](https://arxiv.org/html/2606.09861#A1.SS2.p1.1),[§2](https://arxiv.org/html/2606.09861#S2.p3.2),[§3\.3](https://arxiv.org/html/2606.09861#S3.SS3.p1.7)\.
- Y\. Nie, N\. H\. Nguyen, P\. Sinthong, and J\. Kalagnanam \(2023\)A time series is worth 64 words: long\-term forecasting with transformers\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[Table 1](https://arxiv.org/html/2606.09861#S4.T1.6.6.13.7.1),[footnote 2](https://arxiv.org/html/2606.09861#footnote2)\.
- A\. Radford, J\. Wu, R\. Child, D\. Luan, D\. Amodei, I\. Sutskever,et al\.\(2019\)Language models are unsupervised multitask learners\.OpenAI blog\.Cited by:[§5\.3](https://arxiv.org/html/2606.09861#S5.SS3.p2.1),[Table 5](https://arxiv.org/html/2606.09861#S5.T5.4.4.6.2.1)\.
- C\. Raffel, N\. Shazeer, A\. Roberts, K\. Lee, S\. Narang, M\. Matena, Y\. Zhou, W\. Li, and P\. J\. Liu \(2020\)Exploring the limits of transfer learning with a unified text\-to\-text transformer\.Journal of Machine Learning Research \(JMLR\)\.Cited by:[§2](https://arxiv.org/html/2606.09861#S2.p2.1)\.
- K\. Rasul, A\. Ashok, A\. R\. Williams, A\. Khorasani, G\. Adamopoulos, R\. Bhagwatkar, M\. Biloš, H\. Ghonia, N\. Hassen, A\. Schneider, S\. Garg, A\. Drouin, N\. Chapados, Y\. Nevmyvaka, and I\. Rish \(2023\)Lag\-llama: towards foundation models for time series forecasting\.InNeurIPS Workshop R0\-FoMo:Robustness of Few\-shot and Zero\-shot Learning in Large Foundation Models,Cited by:[Table 1](https://arxiv.org/html/2606.09861#S4.T1.6.6.13.7.5)\.
- A\. Razavi, A\. Van den Oord, and O\. Vinyals \(2019\)Generating diverse high\-fidelity images with vq\-vae\-2\.InConference on Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2606.09861#S2.p3.2)\.
- X\. Shi, S\. Wang, Y\. Nie, D\. Li, Z\. Ye, Q\. Wen, and M\. Jin \(2025a\)Time\-moe: billion\-scale time series foundation models with mixture of experts\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2606.09861#S1.p1.1),[§2](https://arxiv.org/html/2606.09861#S2.p1.1)\.
- Y\. Shi, Z\. Fu, S\. Chen, B\. Zhao, W\. Xu, C\. Zhang, and J\. Li \(2025b\)Kronos: a foundation model for the language of financial markets\.arXiv preprint arXiv:2508\.02739\.Cited by:[§1](https://arxiv.org/html/2606.09861#S1.p3.1),[§2](https://arxiv.org/html/2606.09861#S2.p4.1)\.
- Y\. Sun, Y\. Fang, Z\. Zhu, J\. Li, Y\. Liu, Q\. Deng, J\. Zhou, H\. Yu, X\. Lu, and L\. Ma \(2025\)Xihe: scalable zero\-shot time series learner via hierarchical interleaved block attention\.arXiv preprint arXiv:2510\.21795\.Cited by:[§1](https://arxiv.org/html/2606.09861#S1.p1.1),[§2](https://arxiv.org/html/2606.09861#S2.p1.1)\.
- S\. Talukder, Y\. Yue, and G\. Gkioxari \(2024\)Totem: tokenized time series embeddings for general time series analysis\.Transactions on Machine Learning Research \(TMLR\)\.Cited by:[§1](https://arxiv.org/html/2606.09861#S1.p3.1),[§2](https://arxiv.org/html/2606.09861#S2.p4.1)\.
- X\. Tao, S\. Zhang, M\. Cheng, D\. Wang, T\. Pan, B\. Pan, C\. Zhang, and S\. Wang \(2025\)From values to tokens: an llm\-driven framework for context\-aware time series forecasting via symbolic discretization\.arXiv preprint arXiv:2508\.09191\.Cited by:[§1](https://arxiv.org/html/2606.09861#S1.p3.1),[§2](https://arxiv.org/html/2606.09861#S2.p4.1)\.
- G\. Team, M\. Riviere, S\. Pathak, P\. G\. Sessa, C\. Hardin, S\. Bhupatiraju, L\. Hussenot, T\. Mesnard, B\. Shahriari, A\. Ramé,et al\.\(2024\)Gemma 2: improving open language models at a practical size\.arXiv preprint arXiv:2408\.00118\.Cited by:[§5\.3](https://arxiv.org/html/2606.09861#S5.SS3.p2.1),[Table 5](https://arxiv.org/html/2606.09861#S5.T5.4.4.6.2.3)\.
- K\. Tian, Y\. Jiang, Z\. Yuan, B\. Peng, and L\. Wang \(2024\)Visual autoregressive modeling: scalable image generation via next\-scale prediction\.InConference on Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2606.09861#S2.p3.2),[§3\.4](https://arxiv.org/html/2606.09861#S3.SS4.p1.2)\.
- H\. Touvron, L\. Martin, K\. Stone, P\. Albert, A\. Almahairi, Y\. Babaei, N\. Bashlykov, S\. Batra, P\. Bhargava, S\. Bhosale,et al\.\(2023\)Llama 2: open foundation and fine\-tuned chat models\.arXiv preprint arXiv:2307\.09288\.Cited by:[§5\.3](https://arxiv.org/html/2606.09861#S5.SS3.p2.1),[Table 5](https://arxiv.org/html/2606.09861#S5.T5.4.4.6.2.2)\.
- A\. Van den Oord, N\. Kalchbrenner, L\. Espeholt, O\. Vinyals, A\. Graves,et al\.\(2016\)Conditional image generation with pixelcnn decoders\.InConference on Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2606.09861#S2.p3.2)\.
- A\. Van Den Oord, O\. Vinyals,et al\.\(2017\)Neural discrete representation learning\.InConference on Neural Information Processing Systems \(NeurIPS\),Cited by:[§1](https://arxiv.org/html/2606.09861#S1.p5.1),[§2](https://arxiv.org/html/2606.09861#S2.p3.2)\.
- X\. Wang, T\. Zhou, J\. Gao, B\. Ding, and J\. Zhou \(2025\)Output scaling: yinglong\-delayed chain of thought in a large pretrained time series forecasting model\.arXiv preprint arXiv:2506\.11029\.Cited by:[§1](https://arxiv.org/html/2606.09861#S1.p1.1),[§2](https://arxiv.org/html/2606.09861#S2.p1.1)\.
- Z\. Wang, W\. Yan, and T\. Oates \(2017\)Time series classification from scratch with deep neural networks: a strong baseline\.InInternational Joint Conference on Neural Networks \(IJCNN\),Cited by:[Table 3](https://arxiv.org/html/2606.09861#S5.T3.2.2.6.4.2)\.
- Y\. Wen, T\. Ma, L\. Weng, L\. Nguyen, and A\. A\. Julius \(2024\)Abstracted shapes as tokens\-a generalizable and interpretable model for time\-series classification\.InConference on Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2606.09861#S2.p4.1)\.
- G\. Woo, C\. Liu, A\. Kumar, C\. Xiong, S\. Savarese, and D\. Sahoo \(2024\)Unified training of universal time series forecasting transformers\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§1](https://arxiv.org/html/2606.09861#S1.p1.1),[§2](https://arxiv.org/html/2606.09861#S2.p1.1),[Table 1](https://arxiv.org/html/2606.09861#S4.T1.6.6.10.4.5)\.
- T\. Xu, L\. K\. Wenliang, M\. Munn, and B\. Acciaio \(2020\)Cot\-gan: generating sequential data via causal optimal transport\.InConference on Neural Information Processing Systems \(NeurIPS\),Cited by:[Table 2](https://arxiv.org/html/2606.09861#S4.T2.2.2.3.1.6)\.
- Z\. Xu, A\. Zeng, and Q\. Xu \(2024\)FITS: modeling time series with $10k$ parameters\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§5\.3](https://arxiv.org/html/2606.09861#S5.SS3.p3.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu \(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§5\.1](https://arxiv.org/html/2606.09861#S5.SS1.p3.1),[Table 5](https://arxiv.org/html/2606.09861#S5.T5.4.4.6.2.4)\.
- J\. Yoon, D\. Jarrett, and M\. Van der Schaar \(2019\)Time\-series generative adversarial networks\.InConference on Neural Information Processing Systems \(NeurIPS\),Cited by:[§C\.2](https://arxiv.org/html/2606.09861#A3.SS2.p5.1),[Table 2](https://arxiv.org/html/2606.09861#S4.T2.2.2.3.1.4)\.
- L\. Yu, J\. Lezama, N\. B\. Gundavarapu, L\. Versari, K\. Sohn, D\. Minnen, Y\. Cheng, V\. Birodkar, A\. Gupta, X\. Gu,et al\.\(2023\)Language model beats diffusion–tokenizer is key to visual generation\.arXiv preprint arXiv:2310\.05737\.Cited by:[§2](https://arxiv.org/html/2606.09861#S2.p3.2)\.
- Q\. Yu, M\. Weber, X\. Deng, X\. Shen, D\. Cremers, and L\. Chen \(2024\)An image is worth 32 tokens for reconstruction and generation\.InConference on Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2606.09861#S2.p3.2)\.
- X\. Yuan and Y\. Qiao \(2024\)Diffusion\-TS: interpretable diffusion for general time series generation\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§C\.2](https://arxiv.org/html/2606.09861#A3.SS2.p1.1),[§1](https://arxiv.org/html/2606.09861#S1.p9.1),[Table 2](https://arxiv.org/html/2606.09861#S4.T2.2.2.3.1.2),[§5\.2](https://arxiv.org/html/2606.09861#S5.SS2.p3.1)\.
- A\. Zeng, M\. Chen, L\. Zhang, and Q\. Xu \(2023\)Are transformers effective for time series forecasting?\.InAAAI Conference on Artificial Intelligence \(AAAI\),Cited by:[Table 1](https://arxiv.org/html/2606.09861#S4.T1.6.6.12.6.1)\.
- R\. Zhang, P\. Isola, A\. A\. Efros, E\. Shechtman, and O\. Wang \(2018\)The unreasonable effectiveness of deep features as a perceptual metric\.InIEEE / CVF Computer Vision and Pattern Recognition Conference \(CVPR\),Cited by:[§3\.4](https://arxiv.org/html/2606.09861#S3.SS4.p2.1)\.
- Y\. Zhang, W\. Hu, J\. Zheng, L\. Pan, and J\. Yan \(2026\)MMPD: diverse time series forecasting via multi\-mode patch diffusion loss\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2606.09861#S1.p2.1)\.
- Y\. Zhang and J\. Yan \(2023\)Crossformer: transformer utilizing cross\-dimension dependency for multivariate time series forecasting\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[Table 1](https://arxiv.org/html/2606.09861#S4.T1.6.6.11.5.2)\.

## Appendix ADesign Details and Referenced Methods in UniTok

### A\.1Length Mapping Functions

Accounting for the44special tokens,2×82\\times 8scale statistic tokens and the progressive\-resolution autoencoder, the length mapping functionϕ​\(t\)\\phi\(t\)and its inverseϕ−1​\(l\)\\phi^\{\-1\}\(l\)in Eq\.[2](https://arxiv.org/html/2606.09861#S3.E2)are:

ϕ​\(t\)=\{21t=1,21\+⌈\(t−1\)/2⌉1<t≤5,23\+⌈\(t−5\)/4⌉5<t≤21,27\+⌈\(t−21\)/8⌉21<t≤85,35\+⌈\(t−85\)/16⌉t\>85,ϕ−1​\(l\)=\{1l=21,1\+2×\(l−21\)21<l≤23,5\+4×\(l−23\)23<l≤2721\+8×\(l−27\)27<l≤3585\+16×\(l−35\)l\>35\\phi\(t\)=\\begin\{cases\}21&t=1,\\\\ 21\+\\lceil\(t\-1\)/2\\rceil&1<t\\leq 5,\\\\ 23\+\\lceil\(t\-5\)/4\\rceil&5<t\\leq 21,\\\\ 27\+\\lceil\(t\-21\)/8\\rceil&21<t\\leq 85,\\\\ 35\+\\lceil\(t\-85\)/16\\rceil&t\>85,\\\\ \\end\{cases\}\\quad\\phi^\{\-1\}\(l\)=\\begin\{cases\}1&l=21,\\\\ 1\+2\\times\(l\-21\)&21<l\\leq 23,\\\\ 5\+4\\times\(l\-23\)&23<l\\leq 27\\\\ 21\+8\\times\(l\-27\)&27<l\\leq 35\\\\ 85\+16\\times\(l\-35\)&l\>35\\\\ \\end\{cases\}\(14\)Lengths that do not lie exactly on the grids are padded to the nearest valid grid point, which results in the ceiling operations inϕ​\(t\)\\phi\(t\)\. The hierarchical mapping structure arises naturally from the autoencoder’s progressive\-resolution structure\.

### A\.2Finite Scalar Quantization

Finite Scalar Quantization \(FSQ\), proposed by\[[31](https://arxiv.org/html/2606.09861#bib.bib3)\], is designed as a replacement for Vector Quantization \(VQ\) in VQ\-VAE\. Unlike VQ, FSQ eliminates explicit codebook lookup operations, leading to more stable training and effectively avoiding codebook collapse\.

At a high level, FSQ projects the autoencoder latent representation into a low\-dimensional space and independently quantizes each dimension to a fixed set of discrete values, forming an implicit codebook\. Let the encoder output be𝐡∈ℝdm​o​d​e​l\\mathbf\{h\}\\in\\mathbb\{R\}^\{d\_\{model\}\}, FSQ performs:

𝐳=𝐖\(d​o​w​n\)​𝐡,𝐖\(d​o​w​n\)∈ℝdF​S​Q×dm​o​d​e​l𝐳\(p​r​e\)=⌊L/2⌋​Tanh​\(𝐳\)𝐳\(p​o​s​t\)=Round​\(𝐳\(p​r​e\)\)𝐡~=𝐖\(u​p\)​𝐳\(p​o​s​t\),𝐖\(u​p\)∈ℝdm​o​d​e​l×dF​S​Q\\begin\{split\}&\\mathbf\{z\}=\\mathbf\{W\}^\{\(down\)\}\\mathbf\{h\},\\mathbf\{W\}^\{\(down\)\}\\in\\mathbb\{R\}^\{d\_\{FSQ\}\\times d\_\{model\}\}\\\\ &\\mathbf\{z\}^\{\(pre\)\}=\\lfloor L/2\\rfloor\\text\{Tanh\}\(\\mathbf\{z\}\)\\\\ &\\mathbf\{z\}^\{\(post\)\}=\\text\{Round\}\(\\mathbf\{z\}^\{\(pre\)\}\)\\\\ &\\widetilde\{\\mathbf\{h\}\}=\\mathbf\{W\}^\{\(up\)\}\\mathbf\{z\}^\{\(post\)\},\\mathbf\{W\}^\{\(up\)\}\\in\\mathbb\{R\}^\{d\_\{model\}\\times d\_\{FSQ\}\}\\end\{split\}\(15\)The latent vector𝐡\\mathbf\{h\}is first projected to𝐳\\mathbf\{z\}withdF​S​Qd\_\{FSQ\}dimensions, typically withdF​S​Q≤10d\_\{FSQ\}\\leq 10\. Each dimension of𝐳\\mathbf\{z\}is bounded to a finite range\(−⌊L/2⌋,\+⌊L/2⌋\)\(\-\\lfloor L/2\\rfloor,\+\\lfloor L/2\\rfloor\), producing𝐳\(p​r​e\)\\mathbf\{z\}^\{\(pre\)\}\. After that,𝐳\(p​r​e\)\\mathbf\{z\}^\{\(pre\)\}is rounded to𝐳\(p​o​s​t\)\\mathbf\{z\}^\{\(post\)\}, whose elements are integers in\{−⌊L/2⌋,…,\+⌊L/2⌋\}\\\{\-\\lfloor L/2\\rfloor,\\dots,\+\\lfloor L/2\\rfloor\\\}\. Finally,𝐳\(p​o​s​t\)\\mathbf\{z\}^\{\(post\)\}is projected back the original latent dimension\.

Since rounding is non\-differentiable, gradients are propagated via the straight\-through estimator:

𝐳\(p​o​s​t\)=𝐳\(p​r​e\)\+StopGrad​\(𝐳\(p​o​s​t\)−𝐳\(p​r​e\)\)\\mathbf\{z\}^\{\(post\)\}=\\mathbf\{z\}^\{\(pre\)\}\+\\text\{StopGrad\}\(\\mathbf\{z\}^\{\(post\)\}\-\\mathbf\{z\}^\{\(pre\)\}\)\(16\)
Each dimension of𝐳\(post\)\\mathbf\{z\}^\{\(\\text\{post\}\)\}admitsLLpossible integer values, resulting in an implicit codebook of sizeLdFSQL^\{d\_\{\\text\{FSQ\}\}\}\. In practice, different dimensions may use different quantization levelsLiL\_\{i\}, yielding a flexible codebook size of∏i=1dF​S​QLi\\prod\_\{i=1\}^\{d\_\{FSQ\}\}L\_\{i\}\.

### A\.3Threshold for Discrete Wavelet Transformation Coefficients

Although high\-frequency components of the DWT capture fine\-grained structures, they are also more susceptible to noise\. We use the classical universal thresholding method proposed by\[[12](https://arxiv.org/html/2606.09861#bib.bib4)\]to distinguish salient coefficients\. Given coefficients from the highest\-frequency𝐝1\\mathbf\{d\}\_\{1\}, threshold in Eq\.[7](https://arxiv.org/html/2606.09861#S3.E7)is computed by:

MAD=median​\(\|𝐝1−median​\(𝐝1\)\|\)α=MAD0\.6745​2​log⁡N\\begin\{split\}&\\text\{MAD\}=\\text\{median\}\\left\(\|\\mathbf\{d\}\_\{1\}\-\\text\{median\}\(\\mathbf\{d\}\_\{1\}\)\|\\right\)\\\\ &\\alpha=\\frac\{\\text\{MAD\}\}\{0\.6745\}\\sqrt\{2\\log N\}\\end\{split\}\(17\)whereMADis the median absolute deviation at the finest scale\. The constant0\.67450\.6745ensures consistency with the standard deviation under a Gaussian noise assumption, andNNdenotes the total number of wavelet coefficients across all scales\.

### A\.4Loss Weights Adjustment

We use the widely adopted adaptive loss weighting strategy for image tokenizers\[[13](https://arxiv.org/html/2606.09861#bib.bib2)\]to balance loss terms in Eq\.[8](https://arxiv.org/html/2606.09861#S3.E8)\. Taking the adversarial lossℒa​d​v\\mathcal\{L\}\_\{adv\}as an example, its weight is computed by a comparison with the baseL​1L1loss:

λa​d​v=‖∇D​e​c\(ℒL​1\)‖2‖∇D​e​c\(ℒa​d​v\)‖2\+δ\\lambda\_\{adv\}=\\frac\{\\\|\\nabla\_\{Dec\}\(\\mathcal\{L\}\_\{L1\}\)\\\|\_\{2\}\}\{\\\|\\nabla\_\{Dec\}\(\\mathcal\{L\}\_\{adv\}\)\\\|\_\{2\}\+\\delta\}\(18\)where∇D​e​c\(∙\)\\nabla\_\{Dec\}\(\\mathchoice\{\\mathbin\{\\vbox\{\\hbox\{\\scalebox\{0\.5\}\{$\\displaystyle\\bullet$\}\}\}\}\}\{\\mathbin\{\\vbox\{\\hbox\{\\scalebox\{0\.5\}\{$\\textstyle\\bullet$\}\}\}\}\}\{\\mathbin\{\\vbox\{\\hbox\{\\scalebox\{0\.5\}\{$\\scriptstyle\\bullet$\}\}\}\}\}\{\\mathbin\{\\vbox\{\\hbox\{\\scalebox\{0\.5\}\{$\\scriptscriptstyle\\bullet$\}\}\}\}\}\)denotes the gradient to the last decoder layer in the autoencoder\.δ=10−6\\delta=10^\{\-6\}is a small constant for numerical stability\. Weights of the other two termsλs​p,λh​f\\lambda\_\{sp\},\\lambda\_\{hf\}are computed in the same way\.

## Appendix BDetails of Pretraining Protocols

### B\.1Datasets

We pretrain UniTok\-FM on the union of two large\-scale time series corpora:*GIFT\-Pretrain*and the*Chronos\-Dataset*\. Below, we describe each dataset and clarify how they are combined while strictly avoiding test\-set leakage\.

1\) GIFT\-Pretrain\[[1](https://arxiv.org/html/2606.09861#bib.bib6)\]: This is large\-scale corpus released alongside the GIFT\-Eval benchmark\. A strict split\-checking procedure is applied to ensure that no test data from GIFT\-Eval appears in the pretraining set, guaranteeing a fully zero\-shot evaluation for TSFMs trained on it\. The dataset consists of 88 sub\-datasets spanning 7 domains and 13 sampling frequencies\. The total number of time points in GIFT\-Pretrain is 230B\. This dataset is publicly available at[https://huggingface\.co/datasets/Salesforce/GiftEvalPretrain](https://huggingface.co/datasets/Salesforce/GiftEvalPretrain)\.

These two datasets partially overlap\. We carefully construct their union and avoid test\-set leakage by adding the following subsets from Chronos\-Dataset to GIFT\-Pretrain: dominick, ercot, exchange\_rate, mexico\_city\_bikes, training\_corpus\(kernel\_synth\_1m, tsmixup\_10m\), ushcn\_daily, weatherbench\(hourly, daily, weekly\)\.

### B\.2Hyperparameters in UniTok

The encoder in the UniTok autoencoder hasS=4S=4blocks\. Each block contains 4 sub\-blocks, and each sub\-block consists of a 1D causal convolution layer followed by a causal multi\-head self\-attention layer\. For the convolution layers, we use a kernel size of 9, which matches the commonly used3×33\\times 3kernels in 2D image convolution\. For attention layers, we employ 8 attention heads\. The hidden dimensions of the four encoder blocks are set to\[128,256,256,512\]\[128,256,256,512\], respectively\. The decoder mirrors the encoder architecture in a symmetric manner\.

For FSQ, we use quantization levels of\[8,8,6,5\]\[8,8,6,5\], corresponding to a codebook size of 1,920 for the normalized series\. In addition, we include four special tokens:⟨SOS⟩,⟨EOS⟩,⟨SEP⟩,⟨PAD⟩\\langle\\text\{SOS\}\\rangle,\\langle\\text\{EOS\}\\rangle,\\langle\\text\{SEP\}\\rangle,\\langle\\text\{PAD\}\\rangleand 16 tokens representing statistic scales\. The final codebook size for UniTok is 1,940\.

Overall, UniTok contains approximately 113M parameters and supports TS with a maximum length of 2,048\.

### B\.3Hyperparameters in UniTok\-FM

We directly implement the Qwen3 LLM backbone using the Hugging Face Transformers library\([https://huggingface\.co/docs/transformers/model\_doc/qwen3](https://huggingface.co/docs/transformers/model_doc/qwen3)\), with hyperparameters specified in Tab\.[6](https://arxiv.org/html/2606.09861#A2.T6), resulting in a 129M\-parameter Qwen3 backbone\.

During pretraining, the context window is truncated to a maximum length of 800 tokens\. This corresponds approximately to concatenating 5 series of length 2,048 or 8 series of length 1,024\.

Table 6:Hyperparameters of the Qwen3 backbone used in UniTok\-FM\.All unspecified hyperparameters follow the default Qwen3 configuration\.Hyper\-parameterValuevocab\_size1,940hidden\_size1,024intermediate\_size3,072num\_hidden\_layers8num\_attention\_heads16num\_key\_value\_heads8max\_position\_embeddings40,960tie\_word\_embeddingsTRUE

## Appendix CDetails of Benchmark Protocols

### C\.1Zero\-Shot & Prompt\-Boosted Forecasting

DatasetsForecasting performance is evaluated on GIFT\-Eval\[[1](https://arxiv.org/html/2606.09861#bib.bib6)\], which comprises 23 datasets across multiple sampling frequencies and prediction horizons \(short\-, medium\-, and long\-term\)\. Each evaluation task follows the format “Dataset/Frequency/Prediction Term” \(e\.g\.,bitbrains\_fast\_storage/5T/long\), resulting in 97 tasks in total\.

For each task, multiple target windows are generated using non\-overlapping rolling windows\. When predicting a given target window, the entire historical series preceding it is provided to the model; this history can be extremely long \(e\.g\., exceeding 100K points forElectricity/15T/short\)\. Evaluated models are allowed to select an arbitrary lookback length based on their own design\.

Metric AggregationAfter computing metrics for each individual task, GIFT\-Eval aggregates results across the 97 tasks using the following procedure:

1\) Normalization by Seasonal\-Naive:For each taskii, the raw scoresis\_\{i\}is normalized by the corresponding score of Seasonal\-Naivesi\(s​e​a​s​o​n\)s\_\{i\}^\{\(season\)\}:

s~i=sisi\(s​e​a​s​o​n\)\\widetilde\{s\}\_\{i\}=\\frac\{s\_\{i\}\}\{s\_\{i\}^\{\(season\)\}\}\(19\)This normalization reflects the relative performance of the evaluated model compared to the Seasonal\-Naive baseline\.

2\) Geometric Mean Aggregation:The final aggregated score is computed as the geometric mean of the normalized scores across allN=97N=97tasks:

sa​g​g=\(∏i=1Ns~i\)1/Ns\_\{agg\}=\(\\prod\_\{i=1\}^\{N\}\\widetilde\{s\}\_\{i\}\)^\{1/N\}\(20\)

### C\.2Few\-Shot Generation

DatasetsWe adopt four real\-world datasets from Diffusion\-TS\[[57](https://arxiv.org/html/2606.09861#bib.bib37)\]\(i\.e\., Stocks, ETTh, Energy, fMRI\) for generation evaluation\. For each dataset, only the first channel is used\. A sliding window of length 336 with a stride of 1 is applied to construct a sample pool\. Depending on the data budget, we randomly select 5, 200, or 1,000 samples from this pool for model training or in\-context inference\. Regardless of the budget, each model is required to generate the same number of samples as the full sample pool\.

MetricsWe evaluate generation quality using the predictive score and discriminative score\.

Predictive Score:An RNN predictor is trained on the generated samples, where the first 335 points are used to predict the last point\. The trained predictor is then evaluated on real data\. To account for scale differences across datasets, we report theR2R^\{2\}\(coefficient of determination\)\. Given real–prediction pairs\(yi,y~​i\)​i=1N\{\(y\_\{i\},\\widetilde\{y\}i\)\}\{i=1\}^\{N\},R2R^\{2\}is computed as

R2=1−∑i\(yi−y~i\)2∑i\(yi−y¯\)2,y¯=1N​∑i=1NyiR^\{2\}=1\-\\frac\{\\sum\_\{i\}\(y\_\{i\}\-\\widetilde\{y\}\_\{i\}\)^\{2\}\}\{\\sum\_\{i\}\(y\_\{i\}\-\\bar\{y\}\)^\{2\}\},\\bar\{y\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}y\_\{i\}\(21\)Higher values indicate better performance\. Intuitively, this metric reflects how well the generator captures the underlying dynamics: if the generated data preserves the true dynamics, a predictor trained on them should generalize well to real samples\.

Discriminative Score:80% of both real and generated data are used to construct a training set\. Then an RNN discriminator is trained on it to distinguish generated data from real data\. The discriminator is then evaluated on the remaining 20% held\-out samples\. The discriminative score is defined as

Disc=\|Accuracy−0\.5\|\\text\{Disc\}=\|\\text\{Accuracy\}\-0\.5\|\(22\)Lower values indicate better performance: a score of 0 means the discriminator cannot distinguish generated samples from real ones, while 0\.5 indicates perfect separability\. Intuitively, this metric measures the realism of the generated data: an ideal generator produces samples that are indistinguishable from real data\.

These two metrics were introduced in 2019\[[54](https://arxiv.org/html/2606.09861#bib.bib38)\], with the original implementation based on TensorFlow 1, which is no longer maintained\. As a result, the original training procedure is often unstable and may fail to converge\. We therefore reimplement both metrics in PyTorch 2 and adopt several modern practices: 1\) using AdamW as the optimizer; 2\) applying gradient clipping; 3\) extending training to 100K optimization steps while selecting the best checkpoint for evaluation\. These modifications stabilize the training process and ensure reliable convergence, leading to a more robust evaluation\.

### C\.3Few\-Shot Classification

The UCR Archive\[[9](https://arxiv.org/html/2606.09861#bib.bib20)\]is publicly available at[https://www\.cs\.ucr\.edu/%7Eeamonn/time\_series\_data\_2018/](https://www.cs.ucr.edu/%7Eeamonn/time_series_data_2018/), and contains 128 univariate datasets, each with predefined training and test splits\. We compute the number of training samples per class for each dataset and select those with at most 20 instances per class, forming a subset of 53 datasets referred to as UCR\-FewShot\. The selected datasets are listed in Tab\.[11](https://arxiv.org/html/2606.09861#A4.T11)\.

For baseline methods, training splits are used to train or fine\-tune the models\. In contrast, UniTok\-FM uses the training samples as in\-context prompts\. Classification accuracy is evaluated on the test split of each dataset, and the average accuracy across the 53 datasets is reported as the final metric\.

## Appendix DFull Results

Full forecasting results for each dataset are reported in Tab\.[7](https://arxiv.org/html/2606.09861#A4.T7)\(CRPS\), Tab\.[8](https://arxiv.org/html/2606.09861#A4.T8)\(MAPE\), and Tab\.[9](https://arxiv.org/html/2606.09861#A4.T9)\(MASE\)\. Full generation results are presented in Tab\.[10](https://arxiv.org/html/2606.09861#A4.T10), and full classification results are shown in Tab\.[11](https://arxiv.org/html/2606.09861#A4.T11)\.

Table 7:Full forecasting CRPS \(lower is better\) on GIFT\-Eval\.The Task column follows the format “Dataset / Frequency / Prediction Term”\. The best\-performing model on each dataset is highlighted in bold\. Results besides UniTok\-FM\(ZeroShot\) and UniTok\-FM\(Prompt\) are from leaderboard of GIFT\-Eval:[https://huggingface\.co/spaces/Salesforce/GIFT\-Eval](https://huggingface.co/spaces/Salesforce/GIFT-Eval)\.Table 8:Full forecasting MAPE \(lower is better\) on GIFT\-Eval\.During aggregation, N/A values are replaced with 1\.0\.Table 9:Full forecasting MASE \(lower is better\) on GIFT\-Eval\.Table 10:Full generation performance on Stocks, ETTh, Energy and fMRI\.\#Train denotes the number of training examples for baseline models, while \#Prompt denotes the number of in\-context prompt examples for UniTok\-FM\. The best\-performing model on each dataset is highlighted in bold\.MethodsDiffusion\-TSSDformerTimeGANTimeVAECot\-GANUniTok\-FM\#Train / \#Prompt52001000520010005200100052001000520010005Predictive Score↑\\uparrowStocks0\.7960\.9970\.9950\.8200\.9980\.9990\.5240\.9880\.9520\.7280\.9970\.9970\.9620\.9980\.9970\.998ETTh0\.3470\.8220\.8740\.2290\.7860\.8670\.298\-0\.700\-0\.6220\.6260\.8660\.8910\.7750\.7000\.8240\.870Energy0\.1620\.4510\.4660\.1660\.1230\.3420\.2940\.502\-0\.0770\.3410\.5150\.5480\.3230\.5170\.5390\.432fMRI\-0\.1340\.114\-0\.007\-0\.256\-0\.5580\.086\-1\.389\-0\.709\-0\.051\-1\.068\-0\.0030\.0120\.0200\.0780\.0490\.103Average0\.2930\.5960\.5820\.2390\.3370\.573\-0\.0680\.0200\.0510\.1570\.5940\.6120\.5200\.5730\.6020\.601Discriminative Score↓\\downarrowStocks0\.4930\.1810\.2430\.4710\.0400\.0210\.5000\.5000\.5000\.4570\.1030\.2560\.3760\.2070\.4470\.260ETTh0\.4990\.2010\.2600\.4930\.1440\.0700\.5000\.4990\.5000\.4810\.3630\.4390\.4730\.5000\.4990\.467Energy0\.4970\.3660\.4570\.4860\.1760\.0830\.5000\.5000\.5000\.4950\.4940\.4980\.4940\.4990\.5000\.455fMRI0\.4990\.4960\.4450\.4990\.2280\.2120\.5000\.5000\.5000\.4970\.5000\.5000\.4610\.3900\.4970\.499Average0\.4970\.3110\.3510\.4870\.1470\.0960\.5000\.5000\.5000\.4830\.3650\.4230\.4510\.3990\.4860\.420

Table 11:Full classification accuracy on UCR\-FewShot\.\#Train per Class denotes the number of training samples per class\. The best\-performing model on each dataset is highlighted in bold\. Results of Manits, UniTS and Moment are from\[[15](https://arxiv.org/html/2606.09861#bib.bib28)\]\. “/” indicates out\-of\-memory errors encountered during UniTS fine\-tuning due to long sequence lengths, for which results are unavailable\.Dataset\#Train per ClassKNNTStreeRDSTFCNLITEInceptionMantisUniTSMomentUniTok\-FMFungi10\.8390\.7370\.5700\.1990\.1670\.0810\.7780\.6270\.9960\.715PigAirwayPressure20\.0910\.0720\.0720\.1110\.1490\.0770\.484/0\.1190\.764PigArtPressure20\.2880\.1060\.4710\.0380\.0670\.0580\.910/0\.6110\.966PigCVP20\.1390\.0720\.3650\.0720\.1300\.1440\.784/0\.6090\.904DiatomSizeReduction40\.9350\.9640\.8590\.3010\.3070\.1050\.9680\.9120\.8870\.863Symbols4\.20\.8990\.8350\.8590\.1650\.1740\.3380\.9890\.8210\.9380\.928PickupGestureWiimoteZ50\.3000\.5000\.4800\.1600\.2600\.0600\.7600\.5600\.6800\.520Rock50\.6400\.5800\.5800\.2400\.3400\.3000\.7130\.6130\.8400\.740ShakeGestureWiimoteZ50\.4200\.5800\.7200\.2200\.5000\.1000\.9200\.6070\.8530\.820Phoneme5\.50\.1090\.0730\.2030\.2490\.2760\.2970\.3210\.1430\.2750\.201InsectEPGSmallTrain5\.71\.0001\.0000\.9200\.5260\.8311\.0001\.0000\.6670\.8501\.000Beef60\.6670\.6000\.6670\.2000\.3330\.3330\.7000\.6670\.7440\.633FaceFour60\.7840\.8300\.7840\.4550\.1590\.5000\.9770\.6520\.7770\.841Mallat6\.90\.9140\.8890\.7160\.1250\.3710\.2610\.9400\.8740\.8590\.812OliveOil7\.50\.8670\.9000\.8670\.4000\.4000\.3000\.8890\.4780\.9000\.667GestureMidAirD180\.2920\.3540\.5380\.1460\.3080\.3230\.7690\.5260\.6740\.338GestureMidAirD280\.2230\.3540\.3850\.0690\.2690\.1000\.6740\.4050\.5740\.346GestureMidAirD380\.1080\.1620\.2850\.0460\.0690\.0310\.4080\.2950\.3590\.138FiftyWords90\.6310\.4150\.4040\.2990\.5490\.7670\.8140\.6230\.6780\.653ACSF1100\.5400\.6100\.5600\.6200\.4400\.6600\.7430\.6930\.7500\.440BME100\.8270\.9870\.7530\.3330\.4330\.3330\.9960\.8870\.9760\.847BeetleFly100\.7500\.8000\.7500\.5000\.5000\.5000\.8830\.7170\.9500\.700BirdChicken100\.5500\.8500\.6500\.5000\.5000\.5000\.9000\.6170\.8500\.850CBF100\.8520\.8000\.7670\.4260\.5890\.9930\.9850\.8510\.9410\.960Chinatown100\.9450\.9770\.9710\.9270\.9850\.9740\.9720\.9780\.9810\.974CinCECGTorso100\.8970\.5540\.8100\.2480\.3480\.2530\.7810\.5850\.7000\.819DodgerLoopGame100\.8990\.6300\.6810\.4860\.4860\.4860\.8840\.8310\.8120\.826DodgerLoopWeekend100\.9860\.8990\.8990\.7460\.7460\.7460\.9780\.9640\.9610\.949Lightning7100\.5750\.5750\.5750\.3290\.6990\.8080\.7400\.5570\.6760\.712MoteStrain100\.8790\.7970\.8590\.4660\.5390\.7880\.9070\.8340\.8910\.924ShapeletSim100\.5390\.4720\.5610\.5000\.5000\.7721\.0000\.6980\.9650\.750ShapesAll100\.7520\.5680\.5150\.1720\.7250\.8270\.8910\.6700\.8240\.860SonyAIBORobotSurface1100\.6960\.7940\.8000\.4290\.4290\.4290\.8140\.7210\.8090\.835Adiac10\.50\.6110\.5010\.4300\.2560\.4140\.3270\.7770\.5770\.7890\.629WordSynonyms10\.70\.6180\.3680\.4480\.3320\.4780\.6210\.7430\.5230\.6000\.558DodgerLoopDay11\.10\.5630\.4630\.3880\.2380\.4000\.4130\.6250\.4500\.4540\.663ECGFiveDays11\.50\.7970\.8340\.9880\.5031\.0000\.9990\.9150\.8460\.8560\.930TwoLeadECG11\.50\.7470\.7410\.9630\.5000\.5000\.6650\.9940\.7550\.9650\.873ArrowHead120\.8000\.6460\.6400\.3030\.4230\.3030\.8210\.6900\.8080\.834UMD120\.8060\.4860\.6940\.1600\.4580\.3330\.9910\.8060\.9700\.938SonyAIBORobotSurface213\.50\.8590\.7230\.8040\.6520\.6170\.9270\.9000\.8290\.8350\.880Coffee141\.0000\.8930\.9640\.5360\.5360\.5361\.0001\.0000\.8930\.857FreezerSmallTrain140\.6760\.8450\.9520\.5000\.5000\.4920\.9670\.6710\.7760\.854FacesUCR14\.30\.7690\.5520\.6400\.7890\.9390\.9690\.9150\.7200\.7950\.811InlineSkate14\.30\.3420\.2800\.2350\.1730\.1650\.1840\.363/0\.3180\.296Car150\.7330\.5500\.8170\.2330\.3670\.2830\.8720\.6280\.7940\.683Plane150\.9620\.9810\.9900\.4481\.0001\.0001\.0000\.9520\.9841\.000ToeSegmentation2180\.8080\.5920\.8460\.1850\.1850\.3150\.9230\.7970\.8460\.931HouseTwenty200\.6810\.6970\.8990\.8320\.9410\.8490\.9800\.8350\.9360\.882InsectWingbeatSound200\.5620\.5190\.4790\.3170\.6150\.6020\.5960\.5730\.6160\.604Meat200\.9330\.9170\.9170\.3330\.3330\.3330\.9330\.9110\.9440\.750MixedShapesSmallTrain200\.8350\.7250\.8970\.4150\.8420\.7080\.9530\.7280\.8390\.839ToeSegmentation1200\.6800\.6400\.7680\.5260\.7110\.9170\.9660\.8010\.9240\.934Average10\.20\.6720\.6280\.6730\.3570\.4720\.4910\.8400\.6970\.7780\.755\#WinN/A520036271910

## Appendix EBroader Impacts

UniTok and UniTok\-FM advance time series modeling by enabling unified, efficient processing via next\-token prediction, supporting training\-free, in\-context analysis across domains such as finance and energy\. However, its generative capabilities could be misused to create fake information or manipulate temporal data\. Furthermore, incorrect results in areas like healthcare or power grid management could lead to systemic failures\.

Similar Articles

A decoder-only foundation model for time-series forecasting

Papers with Code Trending

This article presents a research paper on Time-Series Foundation Model (TimeFM), a decoder-only model that achieves near-optimal zero-shot performance across diverse time-series datasets by adapting large language model techniques.

Adaptive Tokenisation Via Temporal Redundancy Masking And Latent Inpainting [R]

Reddit r/MachineLearning

This paper introduces an adaptive video tokenisation method that exploits temporal redundancy in latent space to allocate tokens dynamically, achieving efficient compression without auxiliary networks. The proposed Latent Inpainting Transformer reconstructs dropped positions, delivering 31x speedup over ElasticTok-CV and 2x over InfoTok.