Chronicle: A Multimodal Foundation Model for Joint Language and Time Series Understanding

arXiv cs.LG 05/21/26, 04:00 AM Papers
multimodal foundation-model time-series natural-language joint-training transformer nlu
Summary
Chronicle is a 324M-parameter decoder-only transformer pretrained from scratch on both natural language and time series, achieving competitive performance on NLU and time series classification tasks, and setting new state-of-the-art for frozen-embedding time series classification on UCR/UEA datasets.
arXiv:2605.20268v1 Announce Type: new Abstract: Real-world time series come with text: metadata, descriptions, news, reports. Yet time series foundation models process numerical sequences in isolation, and the multimodal text-and-time-series models that attempt to bridge the two all adapt a pretrained language model post hoc, inheriting representations shaped without ever seeing temporal data. These models are also evaluated almost exclusively against other multimodal baselines, not against the strongest unimodal foundation models in either domain, leaving open whether joint training is needed at all. We present Chronicle, a compact 324M-parameter decoder-only transformer trained from scratch on natural language and time series within a single unified architecture. Both modalities share the same transformer blocks, attention mechanism, and residual stream; the bulk of pretraining uses unimodal batches so cross-modal capability emerges purely from shared parameters, with a short alignment stage that interleaves the two. To our knowledge, Chronicle is the first model jointly pretrained on text and time series from scratch, and the first multimodal model evaluated against dedicated foundation models in both domains. It matches Gemma-3-270M-PT on 19 NLU tasks, sets a new bar for frozen-embedding time series classification on 24 UCR/UEA datasets, and produces multimodal forecasts on Time-MMD that beat every supervised fusion baseline, all from a single backbone.
Original Article
View Cached Full Text
Cached at: 05/21/26, 06:21 AM
# A Multimodal Foundation Model for Joint Language and Time Series Understanding
Source: [https://arxiv.org/html/2605.20268](https://arxiv.org/html/2605.20268)
Paul Quinlan1,2Jeremy Levasseur1Qingguo Li3Xiaodan Zhu2 1InertialAI 2Department of Electrical and Computer Engineering, Queen’s University 3Department of Mechanical and Materials Engineering, Queen’s University

###### Abstract

Real\-world time series come with text: metadata, descriptions, news, reports\. Yet time series foundation models process numerical sequences in isolation, and the multimodal text\-and\-time\-series models that attempt to bridge the two all adapt a pretrained language model post hoc, inheriting representations shaped without ever seeing temporal data\. These models are also evaluated almost exclusively against other multimodal baselines, not against the strongest unimodal foundation models in either domain, leaving open whether joint training is needed at all\. We presentChronicle, a compact 324M\-parameter decoder\-only transformer trained from scratch on natural language and time series within a single unified architecture\. Both modalities share the same transformer blocks, attention mechanism, and residual stream; the bulk of pretraining uses unimodal batches so cross\-modal capability emerges purely from shared parameters, with a short alignment stage that interleaves the two\. To our knowledge, Chronicle is the first model jointly pretrained on text and time series from scratch, and the first multimodal model evaluated against dedicated foundation models in both domains\. It matches Gemma\-3\-270M\-PT on 19 NLU tasks, sets a new bar for frozen\-embedding time series classification on 24 UCR/UEA datasets, and produces multimodal forecasts on Time\-MMD that beat every supervised fusion baseline, all from a single backbone\.

## 1Introduction

Time series foundation models \(TSFMs\) have transformed forecasting into an inference\-only pipeline: a single pretrained model can be applied zero\-shot across diverse domains\[Ansariet al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib1), Daset al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib3), Wooet al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib4), Wenet al\.,[2026](https://arxiv.org/html/2605.20268#bib.bib7)\]\. Models such as Chronos\-2\[Ansariet al\.,[2025](https://arxiv.org/html/2605.20268#bib.bib2)\]and PatchTST\-FM\[Wenet al\.,[2026](https://arxiv.org/html/2605.20268#bib.bib7)\]set a high bar on standardized benchmarks like GIFT\-Eval\[Aksuet al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib50)\]\. Yet these models remain narrow specialists that process numerical sequences in isolation, with no mechanism to incorporate the textual context \(metadata, domain knowledge, anomaly descriptions\) that accompanies virtually every real\-world time series\.

A growing body of work has attempted to bridge this gap by connecting language models to time series, but these efforts share three systematic limitations\. First, every existing approach starts from a pretrained language model and adapts it post hoc\. LLMTIME\[Gruveret al\.,[2023](https://arxiv.org/html/2605.20268#bib.bib22)\]and GPT4MTS\[Jiaet al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib32)\]query frozen LLMs directly; Time\-LLM\[Jinet al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib23)\]and GPT4TS\[Zhouet al\.,[2023](https://arxiv.org/html/2605.20268#bib.bib24)\]add lightweight adapters; ChatTS\[Xieet al\.,[2025](https://arxiv.org/html/2605.20268#bib.bib28)\], ChatTime\[Wanget al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib31)\], and MSE\-ITT\[Kovalet al\.,[2025](https://arxiv.org/html/2605.20268#bib.bib82)\]fine\-tune large backbones \(Qwen2\.5\-14B, LLaMA\-2, LLaMA\-3\-8B\); and MoAT\[Leeet al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib80)\]and TaTs\[Liet al\.,[2026](https://arxiv.org/html/2605.20268#bib.bib81)\]fuse separate encoders via learned heads \(Table[1](https://arxiv.org/html/2605.20268#S2.T1)\)\. Because these models were pretrained on text alone, their internal representations were shaped without any exposure to temporal data, and the time series modality must adapt to a representational space not designed for it\. No prior work has trained a single model from scratch on both modalities, allowing text and time series to shape each other’s representations from the beginning of training\.

Second, these models are evaluated almost exclusively against other multimodal or task\-specific baselines, not against state\-of\-the\-art unimodal foundation models in either domain\. Recent surveys\[Liuet al\.,[2025](https://arxiv.org/html/2605.20268#bib.bib86)\]have noted this gap, andZhang and others \[[2025](https://arxiv.org/html/2605.20268#bib.bib87)\]found that the benefits of multimodality are “highly condition\-dependent,” underscoring the need for rigorous unimodal baselines\. Third, many of these models, particularly ChatTS, ChatTime, Chat\-TS\[Quinlanet al\.,[2026](https://arxiv.org/html/2605.20268#bib.bib88)\], and MSE\-ITT, target conversational reasoning about time series \(question answering, summarization, explanation\) and require large backbones \(8 to 14B parameters\) to support instruction following\. This conflates two distinct goals: building general\-purpose temporal*representations*versus building temporal*reasoning agents*\. The question of whether a compact model can learn high\-quality representations for both text and time series, without catastrophic interference, has not been addressed\. Evidence fromTanet al\.\[[2024](https://arxiv.org/html/2605.20268#bib.bib84)\], who showed that LLMs do not meaningfully improve forecasting as backbones, andMerrillet al\.\[[2024](https://arxiv.org/html/2605.20268#bib.bib27)\], who found that LLMs struggle with text\-encoded series, suggests that a different training paradigm may be required\.

We presentChronicle, a 324M\-parameter decoder\-only transformer trained from scratch on natural language and time series within a single architecture\. Both modalities share the same transformer blocks, attention mechanism, and residual stream; modality\-specific components are limited to the input and output interfaces\. The bulk of pretraining uses*unimodal*batches: each micro\-batch contains either text tokens or time series patches, and the two modalities shape the backbone only through the shared parameters they both update\. A short second stage at extended context length introduces a small fraction of interleaved text and time series sequences for explicit cross\-modal alignment\. At inference, text embeddings and patch embeddings can be freely interleaved within a single sequence, and cross\-modal information flow arises naturally from causal self\-attention without any architectural changes\. Unlike prior work, our goal is not conversational reasoning but learning*general\-purpose representations*that serve forecasting, classification, and embedding extraction, while retaining language understanding as a first\-class capability\. We evaluate Chronicle against dedicated foundation models in*each*modality on*their own benchmarks*: 19 NLU tasks against GPT\-2 through LLaMA\-3\.2\-1B; GIFT\-Eval against the full public leaderboard; 14 UCR/UEA datasets against supervised models and frozen TSFM embeddings; and multimodal classification and forecasting on TimeCAP\[Lee and others,[2025](https://arxiv.org/html/2605.20268#bib.bib30)\]and Time\-MMD\[Liu and others,[2024](https://arxiv.org/html/2605.20268#bib.bib29)\], following the MM\-TSFlib\[Liu and others,[2024](https://arxiv.org/html/2605.20268#bib.bib29)\]fusion evaluation protocol\. Our contributions are:

1. 1\.Joint pretraining from scratch\.To our knowledge, Chronicle is the first model to learn text and time series end\-to\-end from random initialization within a single shared transformer backbone, rather than adapting a pretrained LLM post hoc\.
2. 2\.Cross\-domain evaluation against unimodal foundation models\.We benchmark Chronicle against scale\-matched LLMs on 19 NLU tasks*and*dedicated TSFMs on GIFT\-Eval and UCR/UEA, addressing a longstanding gap in the multimodal time series literature\.
3. 3\.Strong frozen\-backbone downstream performance\.Without per\-dataset retraining, Chronicle sets a new bar for frozen\-embedding TS classification, beats every supervised fusion baseline on Time\-MMD, and matches Gemma\-3\-270M\-PT on language understanding\.

## 2Related Work

We situate Chronicle within three lines of work and summarize the architectural landscape in Table[1](https://arxiv.org/html/2605.20268#S2.T1); a comprehensive discussion appears in Appendix[D](https://arxiv.org/html/2605.20268#A4)\.

Table 1:Positioning of Chronicle relative to prior multimodal text and time series models\.*Base model*: the pretrained backbone \(“—” = trained from scratch\)\. Chronicle is the only model that trains from scratch on both modalities and evaluates against unimodal foundation models in both domains\.ModelParamsBase modelTS inputAdaptationPrimary goalEval vs\.TSFMsEval vs\.LLMsMMevalLLMTIME\[Gruveret al\.,[2023](https://arxiv.org/html/2605.20268#bib.bib22)\]7–175BGPT\-3 / LLaMADigitsFrozenForecastingPartial✗✗GPT4MTS\[Jiaet al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib32)\]\>\>7BGPT\-3\.5 / GPT\-4Digits \+ promptFrozenForecasting✗✗✓Time\-LLM\[Jinet al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib23)\]7BLLaMA\-7BPatch→\\totext proto\.AdapterDownstream AdaptationPartial✗✗GPT4TS\[Zhouet al\.,[2023](https://arxiv.org/html/2605.20268#bib.bib24)\]124MGPT\-2PatchAdapter \(norm\)Downstream Adaptation✗✗✗MoAT\[Leeet al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib80)\]VariesSeparate enc\.Patch \(decomposed\)Late fusionDownstream Adaptation✗✗✓TaTs\[Liet al\.,[2026](https://arxiv.org/html/2605.20268#bib.bib81)\]VariesSeparate enc\.Patch \(\+ text var\.\)Late fusionDownstream Adaptation✗✗✓ChatTS\[Xieet al\.,[2025](https://arxiv.org/html/2605.20268#bib.bib28)\]14BQwen2\.5\-14BPatch \(MLP enc\.\)Full FTReasoning✗✗✓ChatTime\[Wanget al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib31)\]7BLLaMA\-2Scalar \(discretized\)Full FTReasoningPartial✗✓Chat\-TS\[Quinlanet al\.,[2026](https://arxiv.org/html/2605.20268#bib.bib88)\]8BLLaMA\-3\-8BVocab expansionFull FTReasoning✗✓✓MSE\-ITT\[Kovalet al\.,[2025](https://arxiv.org/html/2605.20268#bib.bib82)\]8BLLaMA\-3\-8BPatch \(MoE experts\)Full FT \+ MoEReasoning✗✗✓Chronicle \(ours\)324M—Patch \(interleaved\)Joint \(from scratch\)Foundation model✓✓✓

Time series foundation models\.TSFMs target zero\-shot generalization across domains\[Lianget al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib12)\]\. Recent models span scalar tokenization \(Chronos\[Ansariet al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib1)\], Chronos\-2\[Ansariet al\.,[2025](https://arxiv.org/html/2605.20268#bib.bib2)\]\) and patch\-based encoding \(PatchTST\[Nieet al\.,[2023](https://arxiv.org/html/2605.20268#bib.bib6)\], TimesFM\[Daset al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib3)\], PatchTST\-FM\[Wenet al\.,[2026](https://arxiv.org/html/2605.20268#bib.bib7)\], Moirai\[Wooet al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib4)\], MOMENT\[Goswamiet al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib9)\], UniTS\[Gaoet al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib11)\]\)\. We compare against these models on GIFT\-Eval and UCR\.

Multimodal text and time series models\.Prior work falls into two categories \(Table[1](https://arxiv.org/html/2605.20268#S2.T1)\)\.*Reasoning\-focused*models including ChatTS\[Xieet al\.,[2025](https://arxiv.org/html/2605.20268#bib.bib28)\], Chat\-TS\[Quinlanet al\.,[2026](https://arxiv.org/html/2605.20268#bib.bib88)\], and MSE\-ITT\[Kovalet al\.,[2025](https://arxiv.org/html/2605.20268#bib.bib82)\]fine\-tune large backbones \(8–14B\) on synthetic QA data and target conversational benchmarks rather than standard forecasting or classification\. ChatTime\[Wanget al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib31)\]instruction\-tunes LLaMA\-2 with discretized series but does not evaluate on the full GIFT\-Eval suite or against scale\-matched LLMs\.*Forecasting\-focused*models fuse text with time series via frozen LLMs \(LLMTIME\[Gruveret al\.,[2023](https://arxiv.org/html/2605.20268#bib.bib22)\], GPT4MTS\[Jiaet al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib32)\]\), adapters \(Time\-LLM\[Jinet al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib23)\], GPT4TS\[Zhouet al\.,[2023](https://arxiv.org/html/2605.20268#bib.bib24)\]\), or late\-fusion heads \(MoAT\[Leeet al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib80)\], TaTs\[Liet al\.,[2026](https://arxiv.org/html/2605.20268#bib.bib81)\]\)\. Time\-MMD\[Liu and others,[2024](https://arxiv.org/html/2605.20268#bib.bib29)\]contributes both a benchmark and the MM\-TSFlib fusion library, which we adopt as our multimodal baseline protocol\.

Across both categories, no prior work evaluates against dedicated TSFMs*and*dedicated LLMs on their respective benchmarks\.Tanet al\.\[[2024](https://arxiv.org/html/2605.20268#bib.bib84)\]showed that LLM pretraining does not transfer to forecasting, andMerrillet al\.\[[2024](https://arxiv.org/html/2605.20268#bib.bib27)\]found that LLMs struggle with text\-encoded series, motivating our modality\-native joint training\. Chronicle differs from all prior work in three respects: training from scratch on both modalities, using a compact 324M backbone focused on representation quality rather than dialogue, and evaluating against unimodal foundation models in both domains\.

Small language models\.GPT\-2\[Radfordet al\.,[2019](https://arxiv.org/html/2605.20268#bib.bib13)\]demonstrated that decoder\-only transformers produce capable few\-shot learners; subsequent compact models \(Qwen2\[Yanget al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib17)\], LLaMA\-3\.2\[Grattafioriet al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib16)\], Gemma\-3\[Gemma Team,[2025](https://arxiv.org/html/2605.20268#bib.bib61)\], LFM\-2\[Aminiet al\.,[2025](https://arxiv.org/html/2605.20268#bib.bib62)\]\) push zero\-shot understanding to strong levels at sub\-1B scale\. We compare against five such models on 19 NLU tasks to verify that joint training preserves language capability\.

## 3Methodology

![Refer to caption](https://arxiv.org/html/2605.20268v1/x1.png)Figure 1:The Chronicle architecture\.Text tokens and time series patches share a 16\-layer decoder\-only transformer, modality\-specific components are limited to the input and output interfaces\. Modality\-specific output heads produce quantile forecasts \(ℒQL\\mathcal\{L\}\_\{\\mathrm\{QL\}\}\) and next\-token predictions \(ℒCE\\mathcal\{L\}\_\{\\mathrm\{CE\}\}\); the same backbone produces frozen embeddings for downstream classification\.Chronicle departs from prior multimodal text\-and\-time\-series work: rather than adapting a pretrained LLM post hoc, we design an architecture in which both modalities shape a shared backbone from random initialization\. The result is deliberately minimal—a decoder\-only transformer\[Vaswaniet al\.,[2017](https://arxiv.org/html/2605.20268#bib.bib39), Radfordet al\.,[2019](https://arxiv.org/html/2605.20268#bib.bib13)\]in which text tokens and time series patches occupy positions in a single sequence and flow through the same blocks, attention mechanism, and residual stream\. Modality\-specific components are confined to the interfaces: a text embedding table over a 131,072\-entry BPE vocabulary and a patch projection on the input side; a tied language model head and a quantile head overQ=21Q\{=\}21levels on the output side\. Almost all parameters are therefore exercised by both modalities, and cross\-modal information flow arises naturally from causal self\-attention with no architectural additions\.

### 3\.1Time Series Representation

Following PatchTST\-FM\[Wenet al\.,[2026](https://arxiv.org/html/2605.20268#bib.bib7)\]and Chronos\-2\[Ansariet al\.,[2025](https://arxiv.org/html/2605.20268#bib.bib2)\], we first standardize each input using statistics computed*only*over visible \(non\-NaN, unmasked\) values, then apply the inverse hyperbolic sine transform to suppress outliers while preserving sign:

xnorm=arcsinh⁡\(x−μvisσvis\),x\_\{\\mathrm\{norm\}\}=\\operatorname\{arcsinh\}\\\!\\Bigl\(\\frac\{x\-\\mu\_\{\\mathrm\{vis\}\}\}\{\\sigma\_\{\\mathrm\{vis\}\}\}\\Bigr\),\(1\)whereμvis\\mu\_\{\\mathrm\{vis\}\}andσvis\\sigma\_\{\\mathrm\{vis\}\}are the visible\-value mean and standard deviation\. During autoregressive inference these statistics are computed once from the original context and cached, preventing distribution drift as model\-generated predictions accumulate\. The normalized series is then segmented intoT=⌈L/P⌉T=\\lceil L/P\\rceilnon\-overlapping patches of lengthP=32P\{=\}32; patching reduces effective sequence length by a factor ofPPand gives each token access to local temporal structure, matching the de facto input format of recent TSFMs\[Nieet al\.,[2023](https://arxiv.org/html/2605.20268#bib.bib6), Ansariet al\.,[2025](https://arxiv.org/html/2605.20268#bib.bib2), Liuet al\.,[2026](https://arxiv.org/html/2605.20268#bib.bib79)\]\.

Each patch is represented by a4P4P\-dimensional feature vector𝐟=\[𝐫;𝐯;𝐦;𝐜\]\\mathbf\{f\}=\[\\mathbf\{r\};\\mathbf\{v\};\\mathbf\{m\};\\mathbf\{c\}\]obtained by concatenating fourPP\-dimensional components: a*time ramp*𝐫\\mathbf\{r\}encoding the patch’s normalized position within its channel \(running from approximately−1\-1at the start of a channel to0at its end and resetting at channel boundaries for multivariate inputs\); the*normalized values*𝐯\\mathbf\{v\}produced by Eq\.[1](https://arxiv.org/html/2605.20268#S3.E1); a binary*validity mask*𝐦∈\{0,1\}P\\mathbf\{m\}\\in\\\{0,1\\\}^\{P\}that distinguishes observed values from missing or masked positions; and a*channel ramp*𝐜\\mathbf\{c\}that encodes channel identity in multivariate inputs\. For a sample withCCchannels, channelj∈\{0,…,C−1\}j\\in\\\{0,\\ldots,C\-1\\\}is assigned the scalar valuej/max⁡\(C−1,1\)j/\\max\(C\-1,1\), which is expanded across allPPpositions of each patch from that channel\. Thus multivariate channel values are evenly spaced in\[0,1\]\[0,1\], while univariate inputs use𝐜=𝟎\\mathbf\{c\}=\\mathbf\{0\}\. If channel identifiers are unavailable, the channel ramp defaults to zero\. Appendix[E\.2](https://arxiv.org/html/2605.20268#A5.SS2)shows that channel\-aware multivariate handling improves classification performance over mean\-channel pooling on average on the multivariate UEA datasets\. The feature vector is projected to the transformer embedding dimensionddvia a single bias\-free linear layer followed by RMSNorm, and the resulting patch embedding is placed into the shared input space\.

𝐞ts=RMSNorm\(𝐖p𝐟\)∈ℝd,\\mathbf\{e\}\_\{\\mathrm\{ts\}\}=\\mathrm\{RMSNorm\}\(\\mathbf\{W\}\_\{p\}\\mathbf\{f\}\)\\in\\mathbb\{R\}^\{d\},\(2\)
Text tokens are embedded via a learned table𝐖e∈ℝ\|𝒱\|×d\\mathbf\{W\}\_\{e\}\\in\\mathbb\{R\}^\{\|\\mathcal\{V\}\|\\times d\}that is tied with the language model output head\[Press and Wolf,[2017](https://arxiv.org/html/2605.20268#bib.bib78)\], an important saving given the 131k\-entry vocabulary\. During pretraining, batches are either text\-only or time\-series\-only \(with the small interleaved fraction in stage 2 described in Section[4](https://arxiv.org/html/2605.20268#S4)\)\.

### 3\.2Output Heads and Training Objective

At text positions, transformer hidden states are projected to vocabulary logits via the tied embedding matrix, with logit soft\-capping\[Gemma Team,[2025](https://arxiv.org/html/2605.20268#bib.bib61)\]ℓ←αtanh⁡\(ℓ/α\)\\ell\\leftarrow\\alpha\\tanh\(\\ell/\\alpha\),α=15\\alpha\{=\}15, applied to prevent extreme pre\-softmax values\. The text loss is standard autoregressive cross\-entropyℒCE\\mathcal\{L\}\_\{\\mathrm\{CE\}\}\. At time series positions, RMSNorm\-projected hidden states are mapped through a single bias\-free linear layer toP×QP\\times Qoutputs, whereQ=21Q\{=\}21quantile levels are spaced uniformly overτ∈\[0\.05,0\.95\]\\tau\\in\[0\.05,0\.95\]\. We minimize the masked quantile loss\[Gneiting and Raftery,[2007](https://arxiv.org/html/2605.20268#bib.bib52)\]:

ℒQL=∑b,t,p,qzb,t,pρτq\(yb,t,p−q^b,t,p,q\)Q∑b,t,pzb,t,p,ρτ\(u\)=max⁡\(τu,\(τ−1\)u\),\\mathcal\{L\}\_\{\\mathrm\{QL\}\}=\\frac\{\\sum\_\{b,t,p,q\}z\_\{b,t,p\}\\,\\rho\_\{\\tau\_\{q\}\}\\\!\\bigl\(y\_\{b,t,p\}\-\\hat\{q\}\_\{b,t,p,q\}\\bigr\)\}\{Q\\sum\_\{b,t,p\}z\_\{b,t,p\}\},\\quad\\rho\_\{\\tau\}\(u\)=\\max\\\!\\bigl\(\\tau u,\\,\(\\tau\-1\)u\\bigr\),\(3\)wherezb,t,p∈\{0,1\}z\_\{b,t,p\}\\in\\\{0,1\\\}is a target\-validity mask that is one only for finite, observed target values inside the prediction horizon and zero for padded or otherwise invalid positions\. The mask is applied to each per\-position quantile loss term before normalization, so padded targets in partial forecast patches do not contribute to the objective\.

At inference time, we denormalize predictions by inverting Eq\.[1](https://arxiv.org/html/2605.20268#S3.E1):x^=sinh⁡\(q^\)⋅σvis\+μvis\\hat\{x\}=\\sinh\(\\hat\{q\}\)\\cdot\\sigma\_\{\\mathrm\{vis\}\}\+\\mu\_\{\\mathrm\{vis\}\}\.

The overall training objective is the weighted sumℒ=wtextℒCE\+wTSℒQL\\mathcal\{L\}=w\_\{\\mathrm\{text\}\}\\mathcal\{L\}\_\{\\mathrm\{CE\}\}\+w\_\{\\mathrm\{TS\}\}\\mathcal\{L\}\_\{\\mathrm\{QL\}\}withwtext=1\.0w\_\{\\mathrm\{text\}\}\{=\}1\.0andwTS=2\.5w\_\{\\mathrm\{TS\}\}\{=\}2\.5; the asymmetric weighting reflects the substantially smaller scale of the per\-element quantile loss relative to cross\-entropy and balances gradient contributions from the two modalities\.

## 4Implementation

Chronicle is a 16\-layer, 324M\-parameter decoder\-only transformer \(d=1024d\{=\}1024, 8 GQA heads with 4 KV heads, RoPE\[Suet al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib41)\], SwiGLU\[Shazeer,[2020](https://arxiv.org/html/2605.20268#bib.bib89)\], pre\-norm RMSNorm\[Zhang and Sennrich,[2019](https://arxiv.org/html/2605.20268#bib.bib40)\]\), with patch lengthP=32P\{=\}32and aQ=21Q\{=\}21\-quantile head overτ∈\[0\.05,0\.95\]\\tau\\in\[0\.05,0\.95\]\. Pretraining runs on2×2\{\\times\}H100 80GB GPUs in BF16\.Stage 1trains at sequence length20482048for47,68347\{,\}683steps \(∼3\.1\{\\sim\}3\.1M tokens/batch\), yielding∼138\{\\sim\}138B text and∼12\{\\sim\}12B TS patches; each micro\-batch is text\-only \(p=0\.92p\{=\}0\.92, from FineWeb\-Edu\[Penedoet al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib91)\]and Dolmino\-mix\-1124\[OLMo Team and Allen Institute for AI,[2024](https://arxiv.org/html/2605.20268#bib.bib92)\]\) or time\-series\-only \(p=0\.08p\{=\}0\.08, from GiftEvalPretrain plus KernelSynth augmentation; Appendix[E](https://arxiv.org/html/2605.20268#A5)\)\.Stage 2extends context to40964096and replaces5%5\\%of TS tokens with interleaved alignment data from ChatTS\[Xieet al\.,[2025](https://arxiv.org/html/2605.20268#bib.bib28)\]andMerrillet al\.\[[2024](https://arxiv.org/html/2605.20268#bib.bib27)\], establishing cross\-modal correspondences while preserving stage\-1 capabilities\. The text\-heavy 92/8 mix is a compute constraint: matching text\-only baselines trained on trillions of tokens requires devoting most of our budget to text \(Section[5\.1\.1](https://arxiv.org/html/2605.20268#S5.SS1.SSS1)\)\. Full details appear in Appendix[B](https://arxiv.org/html/2605.20268#A2)\.

## 5Evaluation

Table 2:Language understanding \(19 NLU tasks\)\. Models ordered by parameter count\. Subscripts indicate shot count\. Chr\.=Chronicle model\.TaskGPT\-2Gemma\-3Chr\.\-2Chr\.\-1LFM\-2Qwen2LLaMA\-3\.2Params124M270M324M324M350M500M1\.2BTokens∼10\{\\sim\}10B6T∼153\{\\sim\}153B∼138\{\\sim\}138B10T12T9THellaSwag00\.3100\.4010\.4350\.4300\.4830\.4800\.629HellaSwag100\.3080\.3970\.4290\.4270\.4730\.4820\.648ARC\-E100\.4170\.5830\.6510\.6440\.7150\.5950\.678ARC\-C100\.2240\.2890\.3250\.3250\.4450\.3110\.376COPA00\.6300\.6700\.6600\.6700\.6900\.6700\.760CSQA100\.2300\.2070\.1930\.2350\.5410\.5820\.370PiQA100\.6240\.6760\.6940\.6840\.6980\.7000\.757LAMBADA00\.3220\.4290\.3820\.3970\.3980\.4940\.627Winograd00\.5750\.6520\.6630\.6410\.6080\.6960\.799WinoGrande00\.5070\.5360\.5160\.5090\.5580\.5570\.609BoolQ100\.5470\.5170\.5660\.5700\.5720\.6140\.657CoQA00\.1360\.2230\.2110\.2210\.3030\.3240\.360SQuAD100\.0580\.2500\.2900\.3000\.3180\.4920\.479Jeopardy100\.0030\.1300\.1010\.1210\.0690\.1390\.344BB WikiQA100\.2830\.5480\.5380\.5330\.4060\.5940\.643BB CSAlg100\.4230\.4360\.4050\.3890\.4050\.4420\.458BB Ops100\.0900\.2100\.1670\.1760\.3190\.3050\.405AGIE LSAT30\.2090\.3000\.2390\.2830\.2430\.2520\.239BB LangID100\.2580\.2540\.2480\.2480\.2820\.3180\.253Average0\.3240\.4060\.4060\.4110\.4490\.4760\.531We evaluate Chronicle across five benchmarks organized into two tiers\.*Pretraining benchmarks*\(Sections[5\.1\.1](https://arxiv.org/html/2605.20268#S5.SS1.SSS1)and[5\.1\.2](https://arxiv.org/html/2605.20268#S5.SS1.SSS2)\) probe each training objective in isolation, measuring whether text capability survives joint training and how well zero\-shot forecasting generalizes across domains\.*Downstream application benchmarks*\(Section[5\.2](https://arxiv.org/html/2605.20268#S5.SS2)\) assess whether the learned representations transfer to downstream multi\-modal and uni\-modal tasks\.

### 5\.1Pretraining Effectiveness

#### 5\.1\.1Language Understanding

We evaluate on 19 NLU tasks drawn from the DCLM evaluation suite\[Liet al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib63)\], spanning commonsense reasoning \(HellaSwag\[Zellerset al\.,[2019](https://arxiv.org/html/2605.20268#bib.bib64)\], COPA\[Roemmeleet al\.,[2011](https://arxiv.org/html/2605.20268#bib.bib66)\], PiQA\[Bisket al\.,[2020](https://arxiv.org/html/2605.20268#bib.bib68)\], CommonsenseQA\[Talmoret al\.,[2019](https://arxiv.org/html/2605.20268#bib.bib67)\], WinoGrande\[Sakaguchiet al\.,[2019](https://arxiv.org/html/2605.20268#bib.bib72)\]\), reading comprehension \(ARC\-Easy, ARC\-Challenge\[Clarket al\.,[2018](https://arxiv.org/html/2605.20268#bib.bib65)\], BoolQ\[Clarket al\.,[2019](https://arxiv.org/html/2605.20268#bib.bib73)\], CoQA\[Reddyet al\.,[2019](https://arxiv.org/html/2605.20268#bib.bib74)\], SQuAD\[Rajpurkaret al\.,[2016](https://arxiv.org/html/2605.20268#bib.bib75)\]\), cloze and completion \(LAMBADA\[Papernoet al\.,[2016](https://arxiv.org/html/2605.20268#bib.bib70)\], Winograd\[Levesqueet al\.,[2012](https://arxiv.org/html/2605.20268#bib.bib71)\]\), knowledge \(Jeopardy, BB WikiQA\[Srivastavaet al\.,[2023](https://arxiv.org/html/2605.20268#bib.bib76)\]\), and algorithmic reasoning \(BB CS Algorithms, BB Operators, BB Language ID\[Srivastavaet al\.,[2023](https://arxiv.org/html/2605.20268#bib.bib76)\]; AGI\-Eval LSAT\[Zhonget al\.,[2023](https://arxiv.org/html/2605.20268#bib.bib77)\]\)\. All tasks use zero\-shot or few\-shot in\-context learning with no fine\-tuning\. We compare against five text\-only decoder\-only language models: GPT\-2\[Radfordet al\.,[2019](https://arxiv.org/html/2605.20268#bib.bib13)\]\(124M\), Gemma\-3\-270M\-PT\[Gemma Team,[2025](https://arxiv.org/html/2605.20268#bib.bib61)\]\(270M\), LFM\-2\-350M\[Aminiet al\.,[2025](https://arxiv.org/html/2605.20268#bib.bib62)\]\(350M\), Qwen2\-0\.5B\[Yanget al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib17)\]\(500M\), and LLaMA\-3\.2\-1B\[Grattafioriet al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib16)\]\(1\.2B\)\.

Table[2](https://arxiv.org/html/2605.20268#S5.T2)reports all 19 tasks\. Stage 1 achieves an average accuracy of0\.4110\.411and Stage 2 achieves0\.4060\.406, with Stage 2 matching Gemma\-3\-270M\-PT \(0\.4060\.406\) at comparable scale; both stages sit between GPT\-2 \(0\.3240\.324\) and Qwen2\-0\.5B \(0\.4760\.476\)\. The small advantage of Stage 1 is consistent with Stage 2 replacing a fraction of text tokens with multimodal alignment data, slightly reducing the effective language training budget, and mirrors the pattern observed across all unimodal benchmarks\. On ARC\-Easy, Stage 2 \(0\.6510\.651\) closely approaches LLaMA\-3\.2\-1B \(0\.6780\.678\), a model roughly4×4\\timeslarger trained exclusively on text, and outperforms both Gemma\-3\-270M\-PT \(0\.5830\.583\) and Qwen2\-0\.5B \(0\.5950\.595\)\. Both stages match or exceed GPT\-2 on the vast majority of tasks\. The training budget context is an important consideration: Chronicle sees only∼138\{\\sim\}138B text tokens during pretraining, roughly43×43\\timesfewer than Gemma\-3\-270M \(6T\),72×72\\timesfewer than LFM\-2\-350M \(10T\), and87×87\\timesfewer than Qwen2\-0\.5B \(12T\)\. These results demonstrate that devoting∼8%\{\\sim\}8\\%of training compute to time series does not cause catastrophic interference in the shared transformer backbone, and validate the text\-heavy token mix described in Section[4](https://arxiv.org/html/2605.20268#S4)\.

#### 5\.1\.2Zero\-Shot Probabilistic Forecasting

GIFT\-Eval\[Aksuet al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib50)\]comprises 97 zero\-shot forecasting tasks drawn from 55 datasets across 7 domains at three horizon lengths \(short, medium, long\)\. All metrics are standardized by dividing by the Seasonal Naive baseline and aggregated via geometric mean; we report MASE \(point forecast quality using the median quantile\) and WQL \(weighted quantile loss, equivalent to CRPS\)\. We compare against published scores from the public GIFT\-Eval leaderboard, representing the strongest dedicated TSFMs currently evaluated: the leading zero\-shot models PatchTST\-FM\-r1\[Wenet al\.,[2026](https://arxiv.org/html/2605.20268#bib.bib7)\], TimesFM\-2\.5\[Daset al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib3)\], TiRex\[Aueret al\.,[2025](https://arxiv.org/html/2605.20268#bib.bib56)\], Toto\-Base\[Cohenet al\.,[2025](https://arxiv.org/html/2605.20268#bib.bib57)\], YingLong\-300M\[Wanget al\.,[2025](https://arxiv.org/html/2605.20268#bib.bib58)\], Chronos\-2\-Synth\[Ansariet al\.,[2025](https://arxiv.org/html/2605.20268#bib.bib2)\], and Moirai\-Large\[Wooet al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib4)\]; supervised baselines PatchTST\[Nieet al\.,[2023](https://arxiv.org/html/2605.20268#bib.bib6)\], N\-BEATS\[Oreshkinet al\.,[2020](https://arxiv.org/html/2605.20268#bib.bib59)\], DLinear\[Zenget al\.,[2023](https://arxiv.org/html/2605.20268#bib.bib37)\], and DeepAR\[Salinaset al\.,[2020](https://arxiv.org/html/2605.20268#bib.bib60)\]; and statistical baselines Seasonal Naive and Auto\-ARIMA\. We exclude models trained with potentially leaky data from test\-set distributions\. Chronicle forecasts autoregressively, generating one patch per step\.

Figure[2](https://arxiv.org/html/2605.20268#S5.F2)places both Chronicle checkpoints within the full leaderboard\. Stage 1 is the stronger pure zero\-shot forecaster, reaching0\.9780\.978MASE and0\.6900\.690CRPS, while Stage 2 reaches1\.0531\.053MASE and0\.7540\.754CRPS after the alignment stage\. This establishes the main tradeoff: unimodal training gives the best isolated forecasting performance, whereas adding a small fraction of interleaved text and time\-series data slightly reduces GIFT\-Eval scores but improves downstream multimodal transfer\. Despite allocating only∼8%\{\\sim\}8\\%of training compute to time series, Stage 1 outperforms Seasonal Naive on both metrics and improves over several supervised and statistical baselines on CRPS, including N\-BEATS \(0\.8160\.816\), DLinear \(0\.8460\.846\), DeepAR \(0\.8530\.853\) and Auto\-ARIMA \(0\.9120\.912\)\. The remaining gap to dedicated TSFMs reflects two principled design choices: \(i\) only∼8%\{\\sim\}8\\%of training compute is allocated to time series versus100%100\\%for dedicated models; and \(ii\) to align with our text setup we use causal next\-patch prediction, while PatchTST\-FM uses contiguous patch masking in an otherwise similar architecture, which their ablations show meaningfully reduces MASE\.

![Refer to caption](https://arxiv.org/html/2605.20268v1/x2.png)Figure 2:GIFT\-Eval leaderboard \(97 tasks; lower is better\)\.MASE \(left\) and CRPS \(right\) for comparative models, plus Chronicle Stage 1 and Stage 2 \(highlighted\)\. Stage 1 is the stronger pure forecaster, while Stage 2 is the aligned checkpoint used for multimodal transfer\.

### 5\.2Downstream Applications

We now evaluate whether Chronicle’s learned representations transfer to three downstream tasks: multimodal classification, multimodal forecasting, and time series classification\. All three tasks probe different aspects of the model’s representations \(cross\-modal fusion, text\-conditioned prediction, and temporal discriminability\) without retraining the backbone\.

##### Shared baselines\.

All downstream evaluations draw from a common pool of baselines\.*Supervised DL models*\(Informer\[Zhouet al\.,[2021](https://arxiv.org/html/2605.20268#bib.bib33)\], TimesNet\[Wuet al\.,[2023](https://arxiv.org/html/2605.20268#bib.bib35)\], Autoformer\[Wuet al\.,[2021](https://arxiv.org/html/2605.20268#bib.bib34)\], iTransformer\[Liuet al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib36)\], DLinear\[Zenget al\.,[2023](https://arxiv.org/html/2605.20268#bib.bib37)\], PatchTST\[Nieet al\.,[2023](https://arxiv.org/html/2605.20268#bib.bib6)\], and FEDformer\[Zhouet al\.,[2022](https://arxiv.org/html/2605.20268#bib.bib38)\]\) are trained independently per dataset for TS classification\.*TS foundation models*\(Chronos\-2\[Ansariet al\.,[2025](https://arxiv.org/html/2605.20268#bib.bib2)\], Moirai\-2\[Liuet al\.,[2026](https://arxiv.org/html/2605.20268#bib.bib79)\], and TimesFM\[Daset al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib3)\]\) are evaluated with a learned linear probe on frozen embeddings for classification and via fusion heads or direct prediction for forecasting\.

*Multimodal fusion baselines*follow theMM\-TSFlib\[Liu and others,[2024](https://arxiv.org/html/2605.20268#bib.bib29)\]protocol, the standard fusion library introduced alongside the Time\-MMD benchmark and subsequently adopted by multiple text\-augmented time series studies\. Under this protocol, each baseline pairs a*trainable*time series encoder \(DLinear, PatchTST, or TimesNet\) with a*frozen*pretrained text encoder \(BERT\[Devlinet al\.,[2019](https://arxiv.org/html/2605.20268#bib.bib18)\]or GPT\-2\[Radfordet al\.,[2019](https://arxiv.org/html/2605.20268#bib.bib13)\]\) and a trainable two\-layer MLP fusion head\. The TS encoder is fine\-tuned end\-to\-end on each dataset together with the head, so the temporal representation adapts to the task\. We additionally report*FM Fusion*baselines that pair the same frozen text encoders with frozen TS foundation models \(Chronos\-2, Moirai\-2, TimesFM\) as encoders and train only the fusion head\.

#### 5\.2\.1Multimodal Classification

Table 3:Multimodal classification on TimeCAP\. Scores are averaged over Weather, Finance, and Healthcare\. Values are mean±\\pmstandard deviation over 3 seeds\. Chronicle rows report both LP and LoRA scores with TS token repeatr∈\{1,64\}r\\in\\\{1,64\\\}\.CategoryModelF1↑\\uparrowAUC↑\\uparrowMM\-TSFlibDLin\+BERT0\.588±0\.0160\.588\\pm 0\.0160\.739±0\.0240\.739\\pm 0\.024DLin\+GPT20\.564±0\.0260\.564\\pm 0\.0260\.724±0\.0170\.724\\pm 0\.017PTST\+BERT0\.578±0\.0220\.578\\pm 0\.0220\.719±0\.0220\.719\\pm 0\.022PTST\+GPT20\.539±0\.0210\.539\\pm 0\.0210\.707±0\.0340\.707\\pm 0\.034TNet\+BERT0\.589±0\.0210\.589\\pm 0\.0210\.750±0\.0260\.750\\pm 0\.026TNet\+GPT20\.577±0\.0190\.577\\pm 0\.0190\.754±0\.0280\.754\\pm 0\.028FM FusionBERT\+Chr20\.590±0\.0210\.590\\pm 0\.0210\.726±0\.0230\.726\\pm 0\.023BERT\+Moi20\.588±0\.0040\.588\\pm 0\.0040\.751±0\.0250\.751\\pm 0\.025BERT\+TFM0\.498±0\.0060\.498\\pm 0\.0060\.659±0\.0230\.659\\pm 0\.023GPT2\+Chr20\.455±0\.0430\.455\\pm 0\.0430\.673±0\.0560\.673\\pm 0\.056GPT2\+Moi20\.542±0\.0380\.542\\pm 0\.0380\.739±0\.0150\.739\\pm 0\.015GPT2\+TFM0\.480±0\.0340\.480\\pm 0\.0340\.628±0\.0180\.628\\pm 0\.018ChronicleStage 1 LP \(r=1\)0\.593±0\.0210\.593\\pm 0\.0210\.733±0\.0300\.733\\pm 0\.030Stage 1 LP \(r=64\)0\.608±0\.0100\.608\\pm 0\.0100\.745±0\.0140\.745\\pm 0\.014Stage 1 LoRA \(r=1\)0\.601±0\.0110\.601\\pm 0\.0110\.739±0\.0170\.739\\pm 0\.017Stage 1 LoRA \(r=64\)0\.584±0\.0240\.584\\pm 0\.0240\.763±0\.0110\.763\\pm 0\.011Stage 2 LP \(r=1\)0\.594±0\.0250\.594\\pm 0\.0250\.731±0\.0290\.731\\pm 0\.029Stage 2 LP \(r=64\)0\.605±0\.0110\.605\\pm 0\.0110\.750±0\.0140\.750\\pm 0\.014Stage 2 LoRA \(r=1\)0\.595±0\.0300\.595\\pm 0\.0300\.731±0\.0320\.731\\pm 0\.032Stage 2 LoRA \(r=64\)0\.613±0\.0110\.613\\pm 0\.0110\.757±0\.0140\.757\\pm 0\.014We evaluate multimodal classification on TimeCAP\[Lee and others,[2025](https://arxiv.org/html/2605.20268#bib.bib30)\]across three domains:*Weather*\(binary rain/no\-rain\),*Finance*\(three\-way market direction\), and*Healthcare*\(mean of two binary tasks: in\-hospital mortality and disease test\-positive prediction\)\. All four underlying tasks are class\-imbalanced \(majority class 61–69%\); we train every method with class\-balanced cross\-entropy, cap text inputs at384384tokens, and report macro\-F1 and AUC\. Chronicle adds a 2\-layer head on a single fully frozen backbone with joint text\+TS input; full training settings appear in Appendix[C\.2](https://arxiv.org/html/2605.20268#A3.SS2)and we follow the training splits fromLee and others \[[2025](https://arxiv.org/html/2605.20268#bib.bib30)\]\. Table[3](https://arxiv.org/html/2605.20268#S5.T3)reports results averaged over Weather, Finance, and Healthcare\. Several baselines achieve inflated accuracy via majority\-class collapse but perform poorly on these imbalance\-aware metrics\. Across macro\-F1 and AUC, Chronicle is the strongest entry: Stage 2 LoRA \(r=64r\{=\}64\) achieves the best average macro\-F1 \(0\.613\\mathbf\{0\.613\}\), and Stage 1 LoRA \(r=64r\{=\}64\) achieves the best average AUC \(0\.763\\mathbf\{0\.763\}\), both with tight variance across seeds\. The best MM\-TSFlib and FM Fusion baselines reach macro\-F1 of0\.5900\.590\(TNet\+BERT\) and AUC of0\.7540\.754\(TNet\+GPT2\), respectively, trailing Chronicle on both metrics\.

#### 5\.2\.2Multimodal Forecasting

We evaluate multimodal forecasting on the 9 Time\-MMD\[Liu and others,[2024](https://arxiv.org/html/2605.20268#bib.bib29)\]domains \(agriculture through traffic; textual fact reports; chronological 70/10/20 splits\), reporting MAE averaged over all forecast horizons per domain \(monthly: 6 to 12 steps; weekly: 12 to 48; daily: 48 to 336\)\. For Chronicle, we report two variants:*ZS*, zero\-shot forecasting, and*LP*, where a forecasting head is trained on top of the frozen backbone with joint text\+TS input\.

Table[4](https://arxiv.org/html/2605.20268#S5.T4)presents per\-domain MAE and normalized mean absolute error \(NMAE\) to account for differing data scales\. Chronicle Stage 2 \(LP\) achieves the best overall NMAE \(0\.514\\mathbf\{0\.514\}\) and average rank \(2\.56\\mathbf\{2\.56\}\), outperforming the strongest MM\-TSFlib baseline \(BERT\+TNet, NMAE0\.6210\.621, rank8\.568\.56\) and the strongest FM Fusion baseline \(GPT2\+Moi2, NMAE0\.5880\.588, rank6\.446\.44\)\. Stage 1 \(LP\) also surpasses all baselines \(NMAE0\.5240\.524, rank5\.005\.00\), and Stage 2 improves over Stage 1 on both metrics, directly validating the multimodal alignment stage\. At the domain level, Stage 2 \(LP\) leads on 5 of 9 domains \(Energy, Environment, Public Health, Security, and Social Good\) and is within0\.0020\.002of the best method on Agriculture and Climate\. The improvement from Stage 1 \(ZS\) to Stage 2 \(ZS\) \(NMAE1\.040→0\.8351\.040\\to 0\.835\) shows that even zero\-shot multimodal forecasting benefits from the alignment stage, while linear probing unlocks large additional gains on Environment \(−0\.860\-0\.860MAE\), Public Health \(−0\.800\-0\.800\), Energy \(−0\.278\-0\.278\), Traffic \(−0\.141\-0\.141\), and Social Good \(−0\.120\-0\.120\)\.

Table 4:Multimodal forecasting MAE on Time\-MMD \(lower is better\)\.MM\-TSFlib baselines finetune the TS encoder end\-to\-end with a frozen text encoder and a trained MLP head\. FM Fusion baselines pair a frozen pretrained TS foundation model with a frozen text encoder and a trained fusion head\. Chronicle reports zero\-shot and finetuned head variants\. Abbreviations: Agri\.=Agriculture, Clim\.=Climate, Econ\.=Economy, Enrg\.=Energy, Env\.=Environment, P\.Hlth=Public Health, Sec\.=Security, Soc\.G=Social Good, Traf\.=Traffic; DLin=DLinear, PTST=PatchTST, TNet=TimesNet, Chr2=Chronos\-2, Moi2=Moirai\-2, TFM=TimesFM\.Cat\.ModelAgri\.Clim\.Econ\.Enrg\.Env\.P\.HlthSec\.Soc\.GTraf\.NMAEAvg RankMM\-TSFlib\(TS enc\.trainable,text frozen\)BERT\+DLin0\.1810\.9010\.9010\.0690\.4090\.4090\.4470\.4470\.8390\.8392\.0672\.0670\.4990\.4990\.2410\.2410\.6480\.6487\.447\.44BERT\+PTST0\.1940\.1940\.9020\.9020\.0750\.0750\.4590\.4590\.4520\.4520\.8250\.8252\.1792\.1790\.5260\.5260\.2200\.2200\.6660\.66610\.2210\.22BERT\+TNet0\.1860\.1860\.9130\.9130\.0730\.0730\.4370\.4370\.4380\.4380\.8180\.8181\.7071\.7070\.5350\.5350\.2260\.2260\.6210\.6218\.568\.56GPT2\+DLin0\.1840\.1840\.8900\.8900\.0730\.0730\.4240\.4240\.4470\.4470\.7900\.7901\.8711\.8710\.5230\.5230\.2250\.2250\.6260\.6266\.786\.78GPT2\+PTST0\.1950\.1950\.8840\.8840\.0760\.0760\.4490\.4490\.4650\.4650\.7940\.7941\.7721\.7720\.5620\.5620\.2090\.6270\.6279\.009\.00GPT2\+TNet0\.1810\.9240\.9240\.0760\.0760\.4640\.4640\.4500\.4500\.8330\.8331\.7801\.7800\.5910\.5910\.2110\.2110\.6400\.6409\.789\.78FM Fusion\(both enc\.frozen, headtrained\)BERT\+Chr20\.2110\.2110\.8940\.8940\.1030\.1030\.5050\.5050\.4930\.4931\.2741\.2741\.7001\.7000\.5270\.5270\.3860\.3860\.7190\.71912\.7812\.78BERT\+Moi20\.1850\.1850\.8730\.0730\.0730\.4530\.4530\.5070\.5070\.7900\.7901\.4131\.4130\.4640\.4640\.2270\.2270\.5910\.5916\.896\.89BERT\+TFM0\.1900\.1900\.8810\.8810\.0710\.0710\.4030\.4030\.4910\.4910\.7650\.7651\.6861\.6860\.4410\.4410\.2280\.2280\.6010\.6016\.116\.11GPT2\+Chr20\.2130\.2130\.8940\.8940\.1030\.1030\.5050\.5050\.4940\.4941\.2731\.2731\.6161\.6160\.5230\.5230\.3860\.3860\.7120\.71212\.2212\.22GPT2\+Moi20\.1890\.1890\.8780\.8780\.0730\.0730\.4480\.4480\.5140\.5140\.7850\.7851\.3941\.3940\.4530\.4530\.2230\.2230\.5880\.5886\.446\.44GPT2\+TFM0\.1870\.1870\.8780\.8780\.0790\.0790\.3960\.3960\.4910\.4910\.7770\.7771\.5751\.5750\.4710\.4710\.2290\.2290\.5950\.5956\.896\.89ChronicleStage 1 \(ZS\)0\.2690\.2691\.9031\.9030\.2470\.2470\.4440\.4441\.3971\.3971\.0611\.0611\.2221\.2220\.8170\.8171\.0521\.0521\.0401\.04013\.4413\.44Stage 1 \(LP\)0\.1840\.1840\.8940\.8940\.0740\.0740\.3900\.3900\.4170\.4170\.6970\.6971\.0581\.0580\.4130\.4130\.2470\.2470\.5240\.5245\.005\.00Stage 2 \(ZS\)0\.2220\.2220\.8900\.8900\.1000\.1000\.6480\.6481\.2701\.2701\.4901\.4901\.0821\.0820\.5190\.5190\.3860\.3860\.8350\.83511\.8911\.89Stage 2 \(LP\)0\.1810\.8750\.8750\.0700\.0700\.3700\.4100\.6901\.0560\.3990\.2450\.2450\.5142\.56

### 5\.3Time\-Series Classification

Table 5:Time series classification on 24 UCR/UEA datasets\. Linear probes on frozen embeddings for TS foundation models and Chronicle; supervised DL baselines are trained per dataset\. Full results in Table[11](https://arxiv.org/html/2605.20268#A6.T11)\. Results are averaged over 5 different seeds\.CategoryModelAcc↑\\uparrowF1↑\\uparrowSupervised DLInformer0\.565±0\.2130\.565\\pm 0\.2130\.483±0\.2410\.483\\pm 0\.241TimesNet0\.645±0\.2280\.645\\pm 0\.2280\.575±0\.2670\.575\\pm 0\.267Autoformer0\.628±0\.2210\.628\\pm 0\.2210\.566±0\.2590\.566\\pm 0\.259DLinear0\.637±0\.2280\.637\\pm 0\.2280\.606±0\.2480\.606\\pm 0\.248iTransformer0\.628±0\.2210\.628\\pm 0\.2210\.582±0\.2490\.582\\pm 0\.249FEDformer0\.723±0\.2290\.723\\pm 0\.2290\.666±0\.2830\.666\\pm 0\.283PatchTST0\.668±0\.2310\.668\\pm 0\.2310\.618±0\.2670\.618\\pm 0\.267TS FoundationModels \(LP\)Chronos\-20\.376±0\.2250\.376\\pm 0\.2250\.230±0\.1670\.230\\pm 0\.167TimesFM0\.611±0\.2380\.611\\pm 0\.2380\.563±0\.2620\.563\\pm 0\.262Moirai\-20\.714±0\.2380\.714\\pm 0\.2380\.692±0\.2570\.692\\pm 0\.257ChronicleStage 10\.736±0\.2060\.736\\pm 0\.2060\.712±0\.2260\.712\\pm 0\.226Stage 20\.729±0\.1990\.729\\pm 0\.1990\.700±0\.2200\.700\\pm 0\.220We evaluate time\-series classification on 24 datasets: 14 univariate datasets from the UCR Time Series Archive\[Dauet al\.,[2019](https://arxiv.org/html/2605.20268#bib.bib51)\]and 10 multivariate datasets from the UEA archive, using official train/test splits\. Supervised DL baselines are trained per\-dataset with Adam \(lr=10−3\\mathrm\{lr\}\{=\}10^\{\-3\}, batch size 16\)\. Foundation model baselines and Chronicle use a learned linear probe on frozen embeddings, directly testing whether pretrained temporal representations are linearly separable without backbone adaptation\.

Table[5](https://arxiv.org/html/2605.20268#S5.T5)shows accuracy and F1 across all 24 datasets\. Chronicle Stage 1 achieves the strongest overall results among frozen\-backbone models, reaching 0\.736 accuracy and 0\.712 F1, compared to0\.714/0\.6920\.714/0\.692for Moirai\-2,0\.611/0\.5630\.611/0\.563for TimesFM, and0\.376/0\.2300\.376/0\.230for Chronos\-2\. Stage 2 reaches0\.729/0\.7000\.729/0\.700, a small but consistent decrease relative to Stage 1 that mirrors the pattern on GIFT\-Eval and NLU: replacing a portion of unimodal time series tokens with multimodal alignment data modestly reduces purely unimodal representation quality while improving cross\-modal tasks \(Section[5\.2\.2](https://arxiv.org/html/2605.20268#S5.SS2.SSS2), Section[5\.2\.1](https://arxiv.org/html/2605.20268#S5.SS2.SSS1)\)\. Both stages exceed all supervised DL baselines trained per\-dataset except FEDformer \(0\.7230\.723accuracy\), from a single frozen backbone with no per\-dataset retraining\. The full results are given in Table[11](https://arxiv.org/html/2605.20268#A6.T11)\.

### 5\.4TS\-Token Repetition for Short Time Series

Chronicle processes both modalities autoregressively within a shared backbone, so the relative sequence length of each modality directly influences performance\. For short\-series tasks such as TimeCAP, where the time series is often a single patch, the paired text caption dominates the mean\-pooled representation and leaves temporal features underweighted\. We address this by repeating the TS\-token blockrrtimes within the input, appending the same patch embeddingsrrtimes without altering the underlying series, rebalancing the modality ratio without any architectural change or backbone retraining\. Figure[3](https://arxiv.org/html/2605.20268#S5.F3)sweepsr∈\[1,128\]r\\in\[1,128\]on the three TimeCAP domains\. Averaged across domains, accuracy improves from0\.6810\.681atr=1r\{=\}1to0\.7110\.711atr=64r\{=\}64, AUC from0\.7600\.760to0\.7900\.790, and macro\-F1 from0\.6020\.602to0\.6420\.642, before degrading atr≥96r\\geq 96as attention is diluted across many identical copies\. Weather, which has the longest natural TS context, shows the largest gain; Finance and Healthcare are largely flat, limited by class\-imbalance ceilings rather than representational quality\. The main\-paper MM\-CLS result \(Table[3](https://arxiv.org/html/2605.20268#S5.T3)\) usesr=1r\{=\}1and tuned settingr=64r\{=\}64to compare to MM\-TSFlib and FM Fusion baselines; this ablation shows roughly 3 accuracy and 4 AUC points of headroom with tunedrr\.

![Refer to caption](https://arxiv.org/html/2605.20268v1/x3.png)Figure 3:Effect of TS\-token repetition on multimodal classification\.Accuracy \(left\), AUC \(middle\), and macro\-F1 \(right\) as a function of TS\-token repeatsrr, evaluated on the three TimeCAP domains and averaged \(dashed black\)\. Repetition rebalances the text–TS token ratio in the shared sequence; performance peaks nearr=64r\{=\}64then degrades as attention dilutes across identical copies\.

## 6Limitations and Future Work

Several limitations of the current work suggest directions for future research\. First, the forecasting gap to dedicated TSFMs reflects a compute trade\-off: our text\-heavy 92/8 mix was chosen to keep language understanding competitive with scale\-matched text\-only models, and closing this gap likely requires more compute or a curriculum strategy \(e\.g\., TS\-only pretraining followed by joint continued pretraining\) rather than a different architecture\. Second, our causal next\-patch objective unifies the text and TS streams but compounds errors over long horizons; hybrid schemes that retain causal attention for text while applying bidirectional attention and contiguous patch masking for time series\[Wenet al\.,[2026](https://arxiv.org/html/2605.20268#bib.bib7)\]could substantially improve long\-horizon zero\-shot forecasting within our architecture\. Third, Stage 2 introduces explicit cross\-modal supervision for only5%5\\%of TS tokens, yet the consistent zero\-shot to linear\-probe gain on Time\-MMD indicates that substantial cross\-modal information remains latent in the frozen backbone; a larger interleaved alignment stage with millions of paired examples is the single most promising direction for improving it\. Finally, we targeted a frozen representation backbone rather than a conversational agent, leaving open whether Chronicle can serve as a retrieval encoder for time series or, after instruction tuning, as the basis for temporal reasoning in the spirit of ChatTS\[Xieet al\.,[2025](https://arxiv.org/html/2605.20268#bib.bib28)\]or MSE\-ITT\[Kovalet al\.,[2025](https://arxiv.org/html/2605.20268#bib.bib82)\]\.

## 7Conclusion

We presented Chronicle, a 324M\-parameter decoder\-only transformer trained from scratch on natural language and time series within a single shared backbone\. Across five benchmarks—NLU, GIFT\-Eval, UCR/UEA, Time\-MMD, and TimeCAP—Chronicle matches scale\-matched LLMs on language understanding, sets a new bar for frozen\-embedding time series classification, and outperforms every supervised fusion baseline on multimodal forecasting, demonstrating that text and time series can share a transformer backbone without catastrophic interference\. Our results challenge the prevailing assumption that multimodal time series models must adapt a pretrained LLM, and suggest that joint pretraining from scratch is a more direct path to general\-purpose temporal representations\. The remaining forecasting gap to dedicated TSFMs is attributable to compute allocation and autoregressive inference, both addressable with scaling and objective refinements; the architecture itself supports both modalities cleanly\.

## References

- T\. Aksu, G\. Woo, J\. Liu, X\. Liu, C\. Liu, S\. Savarese, C\. Xiong, and D\. Sahoo \(2024\)GIFT\-eval: a benchmark for general time series forecasting model evaluation\.External Links:2410\.10393,[Link](https://arxiv.org/abs/2410.10393)Cited by:[§1](https://arxiv.org/html/2605.20268#S1.p1.1),[§5\.1\.2](https://arxiv.org/html/2605.20268#S5.SS1.SSS2.p1.1)\.
- A\. Amini, A\. Banaszak, H\. Benoit, A\. Böök, T\. Dakhran, S\. Duong, A\. Eng, F\. Fernandes, M\. Härkönen, A\. Harrington, R\. Hasani, S\. Karwa, Y\. Khrustalev, M\. Labonne, M\. Lechner, V\. Lechner, S\. Lee, Z\. Li, N\. Loo, J\. Marks, E\. Mosca, S\. J\. Paech, P\. Pak, R\. N\. Parnichkun, A\. Quach, R\. Rogers, D\. Rus, N\. Saxena, B\. Schlager, T\. Seyde, J\. T\. H\. Smith, A\. Tadimeti, and N\. Tumma \(2025\)LFM2 technical report\.External Links:2511\.23404,[Link](https://arxiv.org/abs/2511.23404)Cited by:[§D\.3](https://arxiv.org/html/2605.20268#A4.SS3.p1.1),[§2](https://arxiv.org/html/2605.20268#S2.p5.1),[§5\.1\.1](https://arxiv.org/html/2605.20268#S5.SS1.SSS1.p1.1)\.
- A\. F\. Ansari, O\. Shchur, J\. Küken, A\. Auer, B\. Han,et al\.\(2025\)Chronos\-2: from univariate to universal forecasting\.arXiv preprint arXiv:2510\.15821\.Cited by:[§1](https://arxiv.org/html/2605.20268#S1.p1.1),[§2](https://arxiv.org/html/2605.20268#S2.p2.1),[§3\.1](https://arxiv.org/html/2605.20268#S3.SS1.p1.5),[§3\.1](https://arxiv.org/html/2605.20268#S3.SS1.p1.6),[§5\.1\.2](https://arxiv.org/html/2605.20268#S5.SS1.SSS2.p1.1),[§5\.2](https://arxiv.org/html/2605.20268#S5.SS2.SSS0.Px1.p1.1)\.
- A\. F\. Ansari, L\. Stella, C\. Turkmen, X\. Zhang, P\. Mercado, H\. Shen, O\. Shchur, S\. S\. Rangapuram, S\. Pineda Arango, S\. Kapoor,et al\.\(2024\)Chronos: learning the language of time series\.InInternational Conference on Machine Learning,Cited by:[§D\.1](https://arxiv.org/html/2605.20268#A4.SS1.SSS0.Px1.p1.1),[Appendix E](https://arxiv.org/html/2605.20268#A5.p4.3),[§1](https://arxiv.org/html/2605.20268#S1.p1.1),[§2](https://arxiv.org/html/2605.20268#S2.p2.1)\.
- A\. Auer, P\. Podest, D\. Klotz, S\. Böck, G\. Klambauer, and S\. Hochreiter \(2025\)TiRex: zero\-shot forecasting across long and short horizons with enhanced in\-context learning\.arXiv preprint arXiv:2505\.23719\.External Links:[Link](https://arxiv.org/abs/2505.23719)Cited by:[§D\.1](https://arxiv.org/html/2605.20268#A4.SS1.SSS0.Px2.p1.1),[§5\.1\.2](https://arxiv.org/html/2605.20268#S5.SS1.SSS2.p1.1)\.
- Y\. Bisk, R\. Zellers, J\. Gao, Y\. Choi,et al\.\(2020\)PIQA: reasoning about physical commonsense in natural language\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.34,pp\. 7432–7439\.Cited by:[§5\.1\.1](https://arxiv.org/html/2605.20268#S5.SS1.SSS1.p1.1)\.
- C\. Clark, K\. Lee, M\. Chang, T\. Kwiatkowski, M\. Collins, and K\. Toutanova \(2019\)BoolQ: exploring the surprising difficulty of natural yes/no questions\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics,pp\. 2924–2936\.Cited by:[§5\.1\.1](https://arxiv.org/html/2605.20268#S5.SS1.SSS1.p1.1)\.
- P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord \(2018\)Think you have solved question answering? try ARC, the AI2 reasoning challenge\.Cited by:[§5\.1\.1](https://arxiv.org/html/2605.20268#S5.SS1.SSS1.p1.1)\.
- B\. Cohen, E\. Khwaja, Y\. Doubli, S\. Lemaachi, C\. Lettieri, C\. Masson, H\. Miccinilli, E\. Ramé, Q\. Ren, A\. Rostamizadeh, J\. O\. du Terrail, A\. Toon, K\. Wang, S\. Xie, Z\. Xu, V\. Zhukova, D\. Asker, A\. Talwalkar, and O\. Abou\-Amal \(2025\)This time is different: an observability perspective on time series foundation models\.External Links:2505\.14766,[Link](https://arxiv.org/abs/2505.14766)Cited by:[§D\.1](https://arxiv.org/html/2605.20268#A4.SS1.SSS0.Px2.p1.1),[§5\.1\.2](https://arxiv.org/html/2605.20268#S5.SS1.SSS2.p1.1)\.
- T\. Dao, D\. Y\. Fu, S\. Ermon, A\. Rudra, and C\. Ré \(2022\)FlashAttention: fast and memory\-efficient exact attention with IO\-awareness\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[Appendix B](https://arxiv.org/html/2605.20268#A2.SS0.SSS0.Px1.p1.4)\.
- A\. Das, W\. Kong, R\. Leber, R\. Sen, and R\. Yu \(2024\)A decoder\-only foundation model for time\-series forecasting\.InInternational Conference on Machine Learning,Cited by:[§D\.1](https://arxiv.org/html/2605.20268#A4.SS1.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.20268#S1.p1.1),[§2](https://arxiv.org/html/2605.20268#S2.p2.1),[§5\.1\.2](https://arxiv.org/html/2605.20268#S5.SS1.SSS2.p1.1),[§5\.2](https://arxiv.org/html/2605.20268#S5.SS2.SSS0.Px1.p1.1)\.
- H\. A\. Dau, A\. Bagnall, K\. Kamgar, C\. M\. Yeh, Y\. Zhu, S\. Gharghabi, C\. A\. Ratanamahatana, and E\. Keogh \(2019\)The UCR time series archive\.IEEE/CAA Journal of Automatica Sinica6\(6\),pp\. 1293–1305\.Cited by:[§5\.3](https://arxiv.org/html/2605.20268#S5.SS3.p1.1)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)BERT: pre\-training of deep bidirectional transformers for language understanding\.arXiv preprint arXiv:1810\.04805\.Cited by:[§5\.2](https://arxiv.org/html/2605.20268#S5.SS2.SSS0.Px1.p2.1)\.
- S\. Gao, T\. Koker, O\. Queen, T\. Hartvigsen, T\. Tsiligkaridis, and M\. Zitnik \(2024\)UniTS: building a unified time series model\.InAdvances in Neural Information Processing Systems,Cited by:[§D\.1](https://arxiv.org/html/2605.20268#A4.SS1.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.20268#S2.p2.1)\.
- Gemma Team \(2025\)Gemma 3 technical report\.arXiv preprint arXiv:2503\.19786\.Cited by:[§D\.3](https://arxiv.org/html/2605.20268#A4.SS3.p1.1),[§2](https://arxiv.org/html/2605.20268#S2.p5.1),[§3\.2](https://arxiv.org/html/2605.20268#S3.SS2.p1.6),[§5\.1\.1](https://arxiv.org/html/2605.20268#S5.SS1.SSS1.p1.1)\.
- T\. Gneiting and A\. E\. Raftery \(2007\)Strictly proper scoring rules, prediction, and estimation\.Journal of the American Statistical Association102\(477\),pp\. 359–378\.Cited by:[§3\.2](https://arxiv.org/html/2605.20268#S3.SS2.p1.6)\.
- M\. Goswami, K\. Szafer, A\. Choudhry, Y\. Cai, S\. Li, and A\. Dubrawski \(2024\)MOMENT: a family of open time\-series foundation models\.arXiv preprint arXiv:2402\.03885\.Cited by:[§D\.1](https://arxiv.org/html/2605.20268#A4.SS1.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.20268#S2.p2.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri,et al\.\(2024\)The LLaMA\-3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§D\.3](https://arxiv.org/html/2605.20268#A4.SS3.p1.1),[§2](https://arxiv.org/html/2605.20268#S2.p5.1),[§5\.1\.1](https://arxiv.org/html/2605.20268#S5.SS1.SSS1.p1.1)\.
- N\. Gruver, M\. Finzi, S\. Qiu, and A\. G\. Wilson \(2023\)Large language models are zero\-shot time series forecasters\.Advances in Neural Information Processing Systems36\.Cited by:[§D\.1](https://arxiv.org/html/2605.20268#A4.SS1.SSS0.Px1.p1.1),[§D\.2](https://arxiv.org/html/2605.20268#A4.SS2.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.20268#S1.p2.1),[Table 1](https://arxiv.org/html/2605.20268#S2.T1.2.2.4.1),[§2](https://arxiv.org/html/2605.20268#S2.p3.1)\.
- F\. Jia, K\. Wang, Y\. Zheng, D\. Cao, and Y\. Liu \(2024\)GPT4MTS: prompt\-based large language model for multimodal time\-series forecasting\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 23343–23351\.Cited by:[§D\.2](https://arxiv.org/html/2605.20268#A4.SS2.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.20268#S1.p2.1),[Table 1](https://arxiv.org/html/2605.20268#S2.T1.1.1.1.2),[§2](https://arxiv.org/html/2605.20268#S2.p3.1)\.
- M\. Jin, S\. Wang, L\. Ma, Z\. Chu, J\. Y\. Zhang, X\. Shi, P\. Chen, Y\. Liang, Y\. Li, S\. Pan, and Q\. Wen \(2024\)Time\-LLM: time series forecasting by reprogramming large language models\.InInternational Conference on Learning Representations,Cited by:[§D\.2](https://arxiv.org/html/2605.20268#A4.SS2.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.20268#S1.p2.1),[Table 1](https://arxiv.org/html/2605.20268#S2.T1.2.2.2.2),[§2](https://arxiv.org/html/2605.20268#S2.p3.1)\.
- K\. Jordan, Y\. Jin, V\. Boza, Y\. Jiacheng, F\. Cesista, L\. Newhouse, and J\. Bernstein \(2024\)Muon: an optimizer for hidden layers in neural networks, 2024\.URL https://kellerjordan\. github\. io/posts/muon6\(3\),pp\. 4\.Cited by:[Appendix B](https://arxiv.org/html/2605.20268#A2.SS0.SSS0.Px4.p1.15)\.
- R\. Koval, N\. Andrews, and X\. Yan \(2025\)Multimodal language models with modality\-specific experts for financial forecasting from interleaved sequences of text and time series\.arXiv preprint arXiv:2509\.19628\.Cited by:[§D\.2](https://arxiv.org/html/2605.20268#A4.SS2.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.20268#S1.p2.1),[Table 1](https://arxiv.org/html/2605.20268#S2.T1.2.2.11.1),[§2](https://arxiv.org/html/2605.20268#S2.p3.1),[§6](https://arxiv.org/html/2605.20268#S6.p1.1)\.
- G\. Leeet al\.\(2025\)TimeCAP: learning to contextualize, augment, and predict time series events with large language model agents\.InProceedings of the AAAI Conference on Artificial Intelligence,Note:arXiv:2502\.11418Cited by:[§C\.2](https://arxiv.org/html/2605.20268#A3.SS2.p1.1),[§D\.2](https://arxiv.org/html/2605.20268#A4.SS2.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2605.20268#S1.p4.1),[§5\.2\.1](https://arxiv.org/html/2605.20268#S5.SS2.SSS1.p1.7)\.
- G\. Lee, W\. Yu, W\. Cheng, and H\. Chen \(2024\)MoAT: multi\-modal augmented time series forecasting\.Note:ICLR 2024 WorkshopCited by:[§D\.2](https://arxiv.org/html/2605.20268#A4.SS2.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.20268#S1.p2.1),[Table 1](https://arxiv.org/html/2605.20268#S2.T1.2.2.6.1),[§2](https://arxiv.org/html/2605.20268#S2.p3.1)\.
- H\. Levesque, E\. Davis, and L\. Morgenstern \(2012\)The Winograd schema challenge\.InProceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning,pp\. 552–561\.Cited by:[§5\.1\.1](https://arxiv.org/html/2605.20268#S5.SS1.SSS1.p1.1)\.
- J\. Li, A\. Fang, G\. Smyrnis, M\. Ivgi, M\. Jordan, S\. Gadre, H\. Bansal, E\. Guha, S\. Keh, K\. Arora,et al\.\(2024\)DataComp\-LM: in search of the next generation of training sets for language models\.Advances in Neural Information Processing Systems37\.Cited by:[§5\.1\.1](https://arxiv.org/html/2605.20268#S5.SS1.SSS1.p1.1)\.
- Z\. Li, X\. Lin, Z\. Liu, J\. Zou, Z\. Wu, L\. Zheng, D\. Fu, Y\. Zhu, H\. Hamann, H\. Tong, and J\. He \(2026\)Language in the flow of time: time\-series\-paired texts weaved into a unified temporal narrative\.External Links:2502\.08942,[Link](https://arxiv.org/abs/2502.08942)Cited by:[§D\.2](https://arxiv.org/html/2605.20268#A4.SS2.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.20268#S1.p2.1),[Table 1](https://arxiv.org/html/2605.20268#S2.T1.2.2.7.1),[§2](https://arxiv.org/html/2605.20268#S2.p3.1)\.
- Y\. Liang, H\. Wen, Y\. Nie, Y\. Jiang, M\. Jin, D\. Song, S\. Pan, and Q\. Wen \(2024\)Foundation models for time series analysis: a tutorial and survey\.arXiv preprint arXiv:2403\.14735\.Cited by:[§D\.1](https://arxiv.org/html/2605.20268#A4.SS1.p1.1),[§2](https://arxiv.org/html/2605.20268#S2.p2.1)\.
- C\. Liu, T\. Aksu, J\. Liu, X\. Liu, H\. Yan, Q\. Pham, S\. Savarese, D\. Sahoo, C\. Xiong, and J\. Li \(2026\)Moirai 2\.0: when less is more for time series forecasting\.External Links:2511\.11698,[Link](https://arxiv.org/abs/2511.11698)Cited by:[§D\.1](https://arxiv.org/html/2605.20268#A4.SS1.SSS0.Px2.p1.1),[§3\.1](https://arxiv.org/html/2605.20268#S3.SS1.p1.5),[§5\.2](https://arxiv.org/html/2605.20268#S5.SS2.SSS0.Px1.p1.1)\.
- C\. Liu, S\. Zhou, Q\. Xu, H\. Miao, C\. Long, Z\. Li, and R\. Zhao \(2025\)Towards cross\-modality modeling for time series analytics: a survey in the llm era\.External Links:2505\.02583,[Link](https://arxiv.org/abs/2505.02583)Cited by:[§D\.2](https://arxiv.org/html/2605.20268#A4.SS2.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2605.20268#S1.p3.1)\.
- H\. Liuet al\.\(2024\)Time\-MMD: a new multi\-domain multimodal dataset for time series analysis\.Advances in Neural Information Processing Systems37\.Cited by:[§C\.3](https://arxiv.org/html/2605.20268#A3.SS3.p1.1),[§D\.2](https://arxiv.org/html/2605.20268#A4.SS2.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2605.20268#S1.p4.1),[§2](https://arxiv.org/html/2605.20268#S2.p3.1),[§5\.2](https://arxiv.org/html/2605.20268#S5.SS2.SSS0.Px1.p2.1),[§5\.2\.2](https://arxiv.org/html/2605.20268#S5.SS2.SSS2.p1.1)\.
- Y\. Liu, T\. Hu, H\. Zhang, H\. Wu, S\. Wang, L\. Ma, and M\. Long \(2024\)ITransformer: inverted transformers are effective for time series forecasting\.arXiv preprint arXiv:2310\.06625\.Cited by:[§5\.2](https://arxiv.org/html/2605.20268#S5.SS2.SSS0.Px1.p1.1)\.
- M\. A\. Merrill, M\. Tan, V\. Gupta, T\. Hartvigsen, and T\. Althoff \(2024\)Language models still struggle to zero\-shot reason about time series\.arXiv preprint arXiv:2404\.11757\.Cited by:[Appendix B](https://arxiv.org/html/2605.20268#A2.SS0.SSS0.Px6.p1.8),[§D\.2](https://arxiv.org/html/2605.20268#A4.SS2.SSS0.Px5.p1.1),[§1](https://arxiv.org/html/2605.20268#S1.p3.1),[§2](https://arxiv.org/html/2605.20268#S2.p4.1),[§4](https://arxiv.org/html/2605.20268#S4.p1.14)\.
- Y\. Nie, N\. H\. Nguyen, P\. Sinthong, and J\. Kalagnanam \(2023\)A time series is worth 64 words: long\-term forecasting with transformers\.InInternational Conference on Learning Representations,Cited by:[§D\.1](https://arxiv.org/html/2605.20268#A4.SS1.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.20268#S2.p2.1),[§3\.1](https://arxiv.org/html/2605.20268#S3.SS1.p1.5),[§5\.1\.2](https://arxiv.org/html/2605.20268#S5.SS1.SSS2.p1.1),[§5\.2](https://arxiv.org/html/2605.20268#S5.SS2.SSS0.Px1.p1.1)\.
- OLMo Team and Allen Institute for AI \(2024\)Dolmino mix 1124: a dataset for the mid\-training of OLMo 2\.Note:Hugging Face DatasetsExternal Links:[Link](https://huggingface.co/datasets/allenai/dolmino-mix-1124)Cited by:[Appendix B](https://arxiv.org/html/2605.20268#A2.SS0.SSS0.Px3.p1.5),[Appendix B](https://arxiv.org/html/2605.20268#A2.SS0.SSS0.Px7.p1.2),[§4](https://arxiv.org/html/2605.20268#S4.p1.14)\.
- B\. N\. Oreshkin, D\. Carpov, N\. Chapados, and Y\. Bengio \(2020\)N\-BEATS: neural basis expansion analysis for interpretable time series forecasting\.InInternational Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/1905.10437)Cited by:[§5\.1\.2](https://arxiv.org/html/2605.20268#S5.SS1.SSS2.p1.1)\.
- D\. Paperno, G\. Kruszewski, A\. Lazaridou, N\. Q\. Pham, R\. Bernardi, S\. Pezzelle, M\. Baroni, G\. Boleda, and R\. Fernández \(2016\)The LAMBADA dataset: word prediction requiring a broad discourse context\.arXiv preprint arXiv:1606\.06031\.Cited by:[§5\.1\.1](https://arxiv.org/html/2605.20268#S5.SS1.SSS1.p1.1)\.
- G\. Penedo, H\. Kydlíček, L\. B\. allal, A\. Lozhkov, M\. Mitchell, C\. Raffel, L\. V\. Werra, and T\. Wolf \(2024\)The fineweb datasets: decanting the web for the finest text data at scale\.External Links:2406\.17557,[Link](https://arxiv.org/abs/2406.17557)Cited by:[Appendix B](https://arxiv.org/html/2605.20268#A2.SS0.SSS0.Px3.p1.5),[Appendix B](https://arxiv.org/html/2605.20268#A2.SS0.SSS0.Px7.p1.2),[§4](https://arxiv.org/html/2605.20268#S4.p1.14)\.
- O\. Press and L\. Wolf \(2017\)Using the output embedding to improve language models\.InProceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers,M\. Lapata, P\. Blunsom, and A\. Koller \(Eds\.\),Valencia, Spain,pp\. 157–163\.External Links:[Link](https://aclanthology.org/E17-2025/)Cited by:[§3\.1](https://arxiv.org/html/2605.20268#S3.SS1.p3.1)\.
- P\. Quinlan, Q\. Li, and X\. Zhu \(2026\)Chat\-ts: enhancing multi\-modal reasoning over time\-series and natural language data\.External Links:2503\.10883,[Link](https://arxiv.org/abs/2503.10883)Cited by:[§1](https://arxiv.org/html/2605.20268#S1.p3.1),[Table 1](https://arxiv.org/html/2605.20268#S2.T1.2.2.10.1),[§2](https://arxiv.org/html/2605.20268#S2.p3.1)\.
- A\. Radford, J\. Wu, R\. Child, D\. Luan, D\. Amodei, and I\. Sutskever \(2019\)Language models are unsupervised multitask learners\.OpenAI Blog\.Cited by:[§D\.3](https://arxiv.org/html/2605.20268#A4.SS3.p1.1),[§2](https://arxiv.org/html/2605.20268#S2.p5.1),[§3](https://arxiv.org/html/2605.20268#S3.p1.1),[§5\.1\.1](https://arxiv.org/html/2605.20268#S5.SS1.SSS1.p1.1),[§5\.2](https://arxiv.org/html/2605.20268#S5.SS2.SSS0.Px1.p2.1)\.
- P\. Rajpurkar, J\. Zhang, K\. Lopyrev, and P\. Liang \(2016\)SQuAD: 100,000\+ questions for machine comprehension of text\.InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing,pp\. 2383–2392\.Cited by:[§5\.1\.1](https://arxiv.org/html/2605.20268#S5.SS1.SSS1.p1.1)\.
- S\. Reddy, D\. Chen, and C\. D\. Manning \(2019\)CoQA: a conversational question answering challenge\.Transactions of the Association for Computational Linguistics7,pp\. 249–266\.Cited by:[§5\.1\.1](https://arxiv.org/html/2605.20268#S5.SS1.SSS1.p1.1)\.
- M\. Roemmele, C\. A\. Bejan, and A\. S\. Gordon \(2011\)Choice of plausible alternatives: an evaluation of commonsense causal reasoning\.InAAAI Spring Symposium Series,Cited by:[§5\.1\.1](https://arxiv.org/html/2605.20268#S5.SS1.SSS1.p1.1)\.
- K\. Sakaguchi, R\. L\. Bras, C\. Bhagavatula, and Y\. Choi \(2019\)WinoGrande: an adversarial winograd schema challenge at scale\.External Links:1907\.10641,[Link](https://arxiv.org/abs/1907.10641)Cited by:[§5\.1\.1](https://arxiv.org/html/2605.20268#S5.SS1.SSS1.p1.1)\.
- D\. Salinas, V\. Flunkert, J\. Gasthaus, and T\. Januschowski \(2020\)DeepAR: probabilistic forecasting with autoregressive recurrent networks\.International Journal of Forecasting36\(3\),pp\. 1181–1191\.Cited by:[§5\.1\.2](https://arxiv.org/html/2605.20268#S5.SS1.SSS2.p1.1)\.
- N\. Shazeer \(2020\)GLU variants improve transformer\.External Links:2002\.05202,[Link](https://arxiv.org/abs/2002.05202)Cited by:[Appendix B](https://arxiv.org/html/2605.20268#A2.SS0.SSS0.Px1.p1.4),[§4](https://arxiv.org/html/2605.20268#S4.p1.14)\.
- A\. Srivastava, A\. Rastogi, A\. Rao, A\. A\. M\. Shoeb, A\. Abid, A\. Fisch, A\. R\. Brown, A\. Santoro, A\. Gupta, A\. Garriga\-Alonso,et al\.\(2023\)Beyond the imitation game: quantifying and extrapolating the capabilities of language models\.Transactions on Machine Learning Research\.Cited by:[§5\.1\.1](https://arxiv.org/html/2605.20268#S5.SS1.SSS1.p1.1)\.
- J\. Su, M\. Ahmed, Y\. Lu, S\. Pan, W\. Bo, and Y\. Liu \(2024\)RoFormer: enhanced transformer with rotary position embedding\.Neurocomputing568,pp\. 127063\.Cited by:[Appendix B](https://arxiv.org/html/2605.20268#A2.SS0.SSS0.Px1.p1.4),[§4](https://arxiv.org/html/2605.20268#S4.p1.14)\.
- A\. Talmor, J\. Herzig, N\. Lourie, and J\. Berant \(2019\)CommonsenseQA: a question answering challenge targeting world knowledge\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics,pp\. 4149–4158\.Cited by:[§5\.1\.1](https://arxiv.org/html/2605.20268#S5.SS1.SSS1.p1.1)\.
- M\. Tan, M\. A\. Merrill, V\. Gupta, T\. Hartvigsen, and T\. Althoff \(2024\)Are language models actually useful for time series forecasting?\.InAdvances in Neural Information Processing Systems,Cited by:[§D\.2](https://arxiv.org/html/2605.20268#A4.SS2.SSS0.Px5.p1.1),[§1](https://arxiv.org/html/2605.20268#S1.p3.1),[§2](https://arxiv.org/html/2605.20268#S2.p4.1)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.Advances in Neural Information Processing Systems30\.Cited by:[§3](https://arxiv.org/html/2605.20268#S3.p1.1)\.
- C\. Wang, Q\. Qi, J\. Wang, H\. Sun, Z\. Zhuang, J\. Wu, L\. Zhang, and J\. Liao \(2024\)ChatTime: a unified multimodal time series foundation model bridging numerical and textual data\.External Links:2412\.11376,[Link](https://arxiv.org/abs/2412.11376)Cited by:[§D\.2](https://arxiv.org/html/2605.20268#A4.SS2.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.20268#S1.p2.1),[Table 1](https://arxiv.org/html/2605.20268#S2.T1.2.2.9.1),[§2](https://arxiv.org/html/2605.20268#S2.p3.1)\.
- X\. Wang, T\. Zhou, J\. Gao, B\. Ding, and J\. Zhou \(2025\)Output scaling: YingLong\-delayed chain of thought in a large pretrained time series forecasting model\.arXiv preprint arXiv:2506\.11029\.External Links:[Link](https://arxiv.org/abs/2506.11029)Cited by:[§D\.1](https://arxiv.org/html/2605.20268#A4.SS1.SSS0.Px2.p1.1),[§5\.1\.2](https://arxiv.org/html/2605.20268#S5.SS1.SSS2.p1.1)\.
- Y\. Wen, W\. M\. Gifford, C\. Reddy, L\. M\. Nguyen, J\. Kalagnanam, and A\. A\. Julius \(2026\)Revisiting the generic transformer: deconstructing a strong baseline for time series foundation models\.External Links:2602\.06909,[Link](https://arxiv.org/abs/2602.06909)Cited by:[§D\.1](https://arxiv.org/html/2605.20268#A4.SS1.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.20268#S1.p1.1),[§2](https://arxiv.org/html/2605.20268#S2.p2.1),[§3\.1](https://arxiv.org/html/2605.20268#S3.SS1.p1.6),[§5\.1\.2](https://arxiv.org/html/2605.20268#S5.SS1.SSS2.p1.1),[§6](https://arxiv.org/html/2605.20268#S6.p1.1)\.
- G\. Woo, C\. Liu, A\. Kumar, C\. Xiong, S\. Savarese, and D\. Sahoo \(2024\)Unified training of universal time series forecasting transformers\.InInternational Conference on Machine Learning,Cited by:[§D\.1](https://arxiv.org/html/2605.20268#A4.SS1.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.20268#S1.p1.1),[§2](https://arxiv.org/html/2605.20268#S2.p2.1),[§5\.1\.2](https://arxiv.org/html/2605.20268#S5.SS1.SSS2.p1.1)\.
- H\. Wu, T\. Hu, Y\. Liu, H\. Zhou, J\. Wang, and M\. Long \(2023\)TimesNet: temporal 2D\-variation modeling for general time series analysis\.InInternational Conference on Learning Representations,Cited by:[§5\.2](https://arxiv.org/html/2605.20268#S5.SS2.SSS0.Px1.p1.1)\.
- H\. Wu, J\. Xu, J\. Wang, and M\. Long \(2021\)Autoformer: decomposition transformers with auto\-correlation for long\-term series forecasting\.InAdvances in Neural Information Processing Systems,Vol\.34\.Cited by:[§5\.2](https://arxiv.org/html/2605.20268#S5.SS2.SSS0.Px1.p1.1)\.
- Z\. Xie, Z\. Li, X\. He, L\. Xu, X\. Wen, T\. Zhang, J\. Chen, R\. Shi, and D\. Pei \(2025\)ChatTS: aligning time series with llms via synthetic data for enhanced understanding and reasoning\.Proceedings of the VLDB Endowment18\(8\),pp\. 2385–2398\.External Links:ISSN 2150\-8097,[Link](http://dx.doi.org/10.14778/3742728.3742735),[Document](https://dx.doi.org/10.14778/3742728.3742735)Cited by:[Appendix B](https://arxiv.org/html/2605.20268#A2.SS0.SSS0.Px6.p1.8),[§D\.2](https://arxiv.org/html/2605.20268#A4.SS2.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.20268#S1.p2.1),[Table 1](https://arxiv.org/html/2605.20268#S2.T1.2.2.8.1),[§2](https://arxiv.org/html/2605.20268#S2.p3.1),[§4](https://arxiv.org/html/2605.20268#S4.p1.14),[§6](https://arxiv.org/html/2605.20268#S6.p1.1)\.
- A\. Yang, B\. Yang,et al\.\(2024\)Qwen2 technical report\.arXiv preprint arXiv:2407\.10671\.Cited by:[§D\.3](https://arxiv.org/html/2605.20268#A4.SS3.p1.1),[§2](https://arxiv.org/html/2605.20268#S2.p5.1),[§5\.1\.1](https://arxiv.org/html/2605.20268#S5.SS1.SSS1.p1.1)\.
- R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi \(2019\)HellaSwag: can a machine really finish your sentence?\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,pp\. 4791–4800\.Cited by:[§5\.1\.1](https://arxiv.org/html/2605.20268#S5.SS1.SSS1.p1.1)\.
- A\. Zeng, M\. Chen, L\. Zhang, and Q\. Xu \(2023\)Are transformers effective for time series forecasting?\.Proceedings of the AAAI Conference on Artificial Intelligence37\(9\),pp\. 11121–11128\.Cited by:[§5\.1\.2](https://arxiv.org/html/2605.20268#S5.SS1.SSS2.p1.1),[§5\.2](https://arxiv.org/html/2605.20268#S5.SS2.SSS0.Px1.p1.1)\.
- B\. Zhang and R\. Sennrich \(2019\)Root mean square layer normalization\.Advances in Neural Information Processing Systems32\.Cited by:[§4](https://arxiv.org/html/2605.20268#S4.p1.14)\.
- X\. Zhanget al\.\(2025\)When does multimodality lead to better time series forecasting?\.arXiv preprint arXiv:2506\.21611\.Cited by:[§D\.2](https://arxiv.org/html/2605.20268#A4.SS2.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2605.20268#S1.p3.1)\.
- S\. Zhong, W\. Ruan, M\. Jin, H\. Li, Q\. Wen, and Y\. Liang \(2025\)Time\-vlm: exploring multimodal vision\-language models for augmented time series forecasting\.External Links:2502\.04395,[Link](https://arxiv.org/abs/2502.04395)Cited by:[§D\.2](https://arxiv.org/html/2605.20268#A4.SS2.SSS0.Px2.p1.1)\.
- W\. Zhong, R\. Cui, Y\. Guo, Y\. Liang, S\. Lu, Y\. Wang, A\. Saied, W\. Chen, and N\. Duan \(2023\)AGIEval: a human\-centric benchmark for evaluating foundation models\.arXiv preprint arXiv:2304\.06364\.Cited by:[§5\.1\.1](https://arxiv.org/html/2605.20268#S5.SS1.SSS1.p1.1)\.
- H\. Zhou, S\. Zhang, J\. Peng, S\. Zhang, J\. Li, H\. Xiong, and W\. Zhang \(2021\)Informer: beyond efficient transformer for long sequence time\-series forecasting\.InProceedings of the AAAI Conference on Artificial Intelligence,Cited by:[§5\.2](https://arxiv.org/html/2605.20268#S5.SS2.SSS0.Px1.p1.1)\.
- T\. Zhou, Z\. Ma, Q\. Wen, X\. Wang, L\. Sun, and R\. Jin \(2022\)FEDformer: frequency enhanced decomposed transformer for long\-term series forecasting\.arXiv preprint arXiv:2201\.12740\.Cited by:[§5\.2](https://arxiv.org/html/2605.20268#S5.SS2.SSS0.Px1.p1.1)\.
- T\. Zhou, P\. Niu, X\. Wang, L\. Sun, and R\. Jin \(2023\)One fits all:power general time series analysis by pretrained lm\.External Links:2302\.11939,[Link](https://arxiv.org/abs/2302.11939)Cited by:[§D\.2](https://arxiv.org/html/2605.20268#A4.SS2.SSS0.Px1.p1.1),[§D\.2](https://arxiv.org/html/2605.20268#A4.SS2.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.20268#S1.p2.1),[Table 1](https://arxiv.org/html/2605.20268#S2.T1.2.2.5.1),[§2](https://arxiv.org/html/2605.20268#S2.p3.1)\.

## Appendix AModel and Code

All model checkpoints and evaluation code are publicly available at the links below\. The Chronicle Stage 1 and Stage 2 checkpoints are hosted on Hugging Face at[\[HUGGINGFACE\_LINK\]](https://arxiv.org/html/2605.20268v1/%5BHUGGINGFACE_LINK%5D)\. Evaluation code is available at[\[GITHUB\_LINK\]](https://arxiv.org/html/2605.20268v1/%5BGITHUB_LINK%5D)\.

## Appendix BFull Implementation Details

##### Architectural summary\.

The model is a 16\-layer decoder\-only transformer withd=1024d\{=\}1024, 8 GQA query heads, 4 KV heads \(head dim 128\), and SwiGLU MLPs\[Shazeer,[2020](https://arxiv.org/html/2605.20268#bib.bib89)\]of hidden dimension⌈8d/3⌉\\lceil 8d/3\\rceilrounded up to the nearest multiple of 256\. Each block applies pre\-norm RMSNorm before the attention and MLP submodules and uses standard residual connections; no additional residual\-stream modifications are introduced\. RoPE\[Suet al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib41)\]positional encodings use base frequency5×1055\\\!\\times\\\!10^\{5\}\. QK normalization is applied within attention before the dot product\. We use FlashAttention\[Daoet al\.,[2022](https://arxiv.org/html/2605.20268#bib.bib90)\]for efficient causal self\-attention; KV\-caching support is included for autoregressive inference\. Logit soft\-capping via15⋅tanh⁡\(ℓ/15\)15\\cdot\\tanh\(\\ell/15\)is applied to text outputs\.

##### Input/output interfaces\.

Text embeddings are produced by a single learned table \(\|𝒱\|=131,072\|\\mathcal\{V\}\|\{=\}131\{,\}072, dimensiondd\) and tied with the language model output head\. Time series patch embeddings are produced by a single bias\-free linear layer𝐖p∈ℝd×4P\\mathbf\{W\}\_\{p\}\\in\\mathbb\{R\}^\{d\\times 4P\}followed by RMSNorm, applied to the4P4P\-dimensional patch features\[𝐫;𝐯;𝐦;𝐜\]\[\\mathbf\{r\};\\mathbf\{v\};\\mathbf\{m\};\\mathbf\{c\}\]described in Section[3\.1](https://arxiv.org/html/2605.20268#S3.SS1)\. The quantile head consists of an RMSNorm followed by a bias\-free linear projectionℝd→ℝP⋅Q\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{P\\cdot Q\}withQ=21Q\{=\}21\.

##### Tokenizer\.

The text tokenizer is a byte\-level BPE vocabulary of131,072131\{,\}072tokens trained from scratch with RustBPE on a5050B\-character training corpus\. The training mixture consists of62\.5%62\.5\\%FineWeb\-Edu\[Penedoet al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib91)\]and37\.5%37\.5\\%Dolmino\-mix\-1124\[OLMo Team and Allen Institute for AI,[2024](https://arxiv.org/html/2605.20268#bib.bib92)\]\(dolmino\_ratio=0\.3750\.375\), matching the proportions used during model pretraining\.

##### Optimizer groups\.

Parameters are partitioned into a Muon group and three AdamW groups\. Muon\[Jordanet al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib47)\]\(Newton–Schulz, 5 steps, momentum 0\.95\) is applied to all 2D weight matrices in the transformer blocks \(attentionQQ,KK,VV, output projection, and SwiGLUw1w\_\{1\},w2w\_\{2\},w3w\_\{3\}\) at LR=0\.02=0\.02\. AdamW is used for: the token embedding table at LR=0\.2=0\.2; the \(untied\) lm\_head at LR=0\.004=0\.004when applicable; and the patch projection, quantile head, and all RMSNorm scales \(including the post\-embedding norm, the final norm, and the per\-block pre\-norms\) at LR=0\.002=0\.002\. AdamW usesβ1=0\.8\\beta\_\{1\}\{=\}0\.8,β2=0\.95\\beta\_\{2\}\{=\}0\.95,ϵ=10−10\\epsilon\{=\}10^\{\-10\}, and weight decay0\. All AdamW learning rates are scaled by768/d\\sqrt\{768/d\}to preserve update magnitudes across model dimensions\.

##### Stage 1: schedule and batching\.

The learning rate follows a three\-phase schedule: a 40\-step linear warmup, a constant phase, and linear decay over the final 65% of training\. We train for 47,683 steps at sequence length20482048with a device micro\-batch of 48 and 16 gradient accumulation steps, giving a global batch size of3,145,7283\{,\}145\{,\}728tokens, yielding∼150\{\\sim\}150B total tokens \(∼138\{\\sim\}138B text,∼12\{\\sim\}12B time series patches\)\. At each step, the batch composition \(text\-only vs\. time\-series\-only\) is sampled and broadcast across all data\-parallel ranks before any forward computation, so the gradient at every step is computed exclusively over a single modality\.

##### Stage 2: extended context and multimodal alignment\.

Stage 2 reloads the stage\-1 checkpoint, extends sequence length to40964096, and continues training\. Within the time\-series stream,5%5\\%of tokens are drawn from interleaved alignment data: the alignment subset of ChatTS\[Xieet al\.,[2025](https://arxiv.org/html/2605.20268#bib.bib28)\]\(synthetic series paired with descriptive text labels\) and the time series description corpus ofMerrillet al\.\[[2024](https://arxiv.org/html/2605.20268#bib.bib27)\]\(natural\-language descriptions of temporal patterns\)\. The remaining95%95\\%of TS tokens use the stage\-1 unimodal corpus\. For multimodal alignment batches, the loss combinesℒCE\\mathcal\{L\}\_\{\\mathrm\{CE\}\}at text positions andℒQL\\mathcal\{L\}\_\{\\mathrm\{QL\}\}at TS positions, with the same global weighting \(wtext=1\.0w\_\{\\mathrm\{text\}\}\{=\}1\.0,wTS=2\.5w\_\{\\mathrm\{TS\}\}\{=\}2\.5\)\. The text/TS micro\-batch ratio remains0\.92/0\.080\.92/0\.08, the optimizer state is preserved, and the learning rate continues its decay schedule from the end of stage 1\.

##### Pretraining data\.

Text comes from a mixture of FineWeb\-Edu\[Penedoet al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib91)\]and the Dolmino\-mix\-1124 sub\-mixture\[OLMo Team and Allen Institute for AI,[2024](https://arxiv.org/html/2605.20268#bib.bib92)\]\(DCLM, FLAN, math, peS2o, Wikipedia, StackExchange\) with the dolmino mix ratio set to0\.3340\.334during pretraining\. Time series data comes from GiftEvalPretrain \(∼900\{\\sim\}900GB\), augmented online with KernelSynth \(2–5 kernels from 33 generators\) and per\-batch jitter, scaling, and mixup \(Appendix[E](https://arxiv.org/html/2605.20268#A5)\)\. We do*not*include explicit multimodal batches during stage 1\. Stage 2 introduces the small alignment slice described above\. For pretraining we use two H100 80GB GPUs\. Total training for both stages takes roughly one week\.

##### Weight initialization\.

Linear layers in the transformer use a fan\-scaled normal init,𝒩\(0,σ\)\\mathcal\{N\}\(0,\\,\\sigma\)withσ=min⁡\(1,fan\_out/fan\_in\)/fan\_in\\sigma=\\min\(1,\\sqrt\{\\mathrm\{fan\\\_out\}/\\mathrm\{fan\\\_in\}\}\)/\\sqrt\{\\mathrm\{fan\\\_in\}\}\. The token embedding table is initialized with𝒩\(0,0\.02\)\\mathcal\{N\}\(0,0\.02\)when weight tying is enabled\. RMSNorm scales are initialized to one\. The output projections of attention and the SwiGLUw3w\_\{3\}, as well as the \(untied\) lm\_head when present, are zero\-initialized to keep the residual stream near identity at initialization\. The patch projection is initialized with the standard fan\-scaled normal; the quantile head’s linear projection is initialized to zero\.

## Appendix CDownstream Evaluation Setup

This appendix consolidates the downstream evaluation protocol for TS classification \(UCR\), multimodal classification \(TimeCAP\), and multimodal forecasting \(Time\-MMD\)\. All settings here apply to every method in the corresponding tables, baselines and Chronicle alike, unless explicitly noted otherwise\.

### C\.1Common Settings

Table 6:Settings shared across all downstream evaluations\.SettingValueSeed13371337TS normalizationinstance z\-scoreChronicle patch length3232TimeCAP splitstratified70/10/2070/10/20train/val/testTime\-MMD forecasting splitchronological70/10/2070/10/20, no shuffleUCR TS\-CLS splitaeon default train/test splitTimeCAP MM\-CLS max text length384384tokensTimeCAP MM\-CLS trainable lossclass\-balanced cross\-entropyThe TimeCAP class\-balanced cross\-entropy weights each class by the inverse of its training\-set frequency, normalized to sum to the number of classes\. This is applied uniformly to every trainable head in the MM classification table \(MM\-TSFlib supervised fusion, FM Fusion with frozen encoders, and the Chronicle head\) so that no method gains an artificial advantage from majority\-class collapse\.

### C\.2TimeCAP Multimodal Classification

TimeCAP\[Lee and others,[2025](https://arxiv.org/html/2605.20268#bib.bib30)\]pairs short multivariate time series with GPT\-4\-generated text summaries\. We evaluate on three reporting domains\. Weather labels are collapsed from the original city\-specific labels into binary rain / no\-rain\. Healthcare is reported as the mean of two underlying binary tasks \(in\-hospital mortality and disease test\-positive prediction\); each is evaluated separately under the same protocol and the per\-domain numbers in Table[10](https://arxiv.org/html/2605.20268#A5.T10)are their average\. Dataset statistics for the four underlying classification tasks are summarized in Table[7](https://arxiv.org/html/2605.20268#A3.T7)\.

Table 7:TimeCAP MM classification dataset statistics\. “TS shape” is \(steps, channels\) for multivariate series and \(steps,\) for univariate\. Caption length is reported in whitespace\- delimited word counts\.TaskSamplesTrain / Val / TestClasses \(counts\)TS shapeCaption words \(mean / max\)Weather5,6525\{,\}6523955/566/11313955\\,/\\,566\\,/\\,1131no rain \(41494149\), rain \(15031503\)\(24,5\)\(24,5\)132\.6/196132\.6\\,/\\,196Finance1,2381\{,\}238866/124/248866\\,/\\,124\\,/\\,248class 1 \(857857\), 2 \(211211\), 0 \(170170\)\(9,\)\(9,\)160\.6/228160\.6\\,/\\,228Healthcare mortality375375262/38/75262\\,/\\,38\\,/\\,75False \(260260\), True \(115115\)\(4,\)\(4,\)153\.8/212153\.8\\,/\\,212Healthcare positive427427298/43/86298\\,/\\,43\\,/\\,86False \(294294\), True \(133133\)\(6,\)\(6,\)154\.2/199154\.2\\,/\\,199

##### Baselines\.

MM\-TSFlib fusion baselines pair a trainable time series encoder \(DLinear, PatchTST, or TimesNet\) with a frozen pretrained text encoder \(BERT or GPT\-2\) and a trainable two\-layer MLP fusion head\. The TS encoder and head are trained end\-to\-end for100100epochs at learning rate10−310^\{\-3\}with batch size88\. FM Fusion baselines replace the trainable TS encoder with a frozen pretrained TS foundation model \(Chronos\-2, Moirai\-2, or TimesFM\) and train only the fusion head under the same schedule\. All baselines use the corrected TimeCAP labels, class\-balanced cross\-entropy, and a maximum text length of384384tokens\.

##### Chronicle\.

For the linear\-probe setting, we feed the joint text–time\-series input to a single frozen Chronicle backbone and train only a two\-layer MLP classification head\. The head uses the same optimizer settings as the fusion baselines:100100epochs, learning rate10−310^\{\-3\}, batch size88, dropout0\.10\.1, class\-balanced cross\-entropy, andmean\_fullpooling over the backbone outputs\. Because Chronicle is causally autoregressive, short TimeCAP series can be underrepresented relative to the accompanying text\. We therefore repeat the TS\-token blockrrtimes within the input, without changing the underlying time series or updating the backbone, and sweepr∈\{1,2,4,8,16,32,48,64,96,128\}r\\in\\\{1,2,4,8,16,32,48,64,96,128\\\}in Section[5\.4](https://arxiv.org/html/2605.20268#S5.SS4)\. Macro\-F1 peaks atr=32r\{=\}32, while average AUC peaks atr=64r\{=\}64\(0\.7920\.792versus0\.7880\.788atr=32r\{=\}32\); we therefore report both the fair\-comparison settingr=1r\{=\}1and the tuned settingr=64r\{=\}64in the main results\.

For LoRA experiments, the pretrained backbone weights remain fixed and we train only the LoRA adapters together with the classification head\. We report these rows separately from the linear\-probe results to distinguish frozen\-backbone evaluation from parameter\-efficient adaptation\.

### C\.3Time\-MMD Multimodal Forecasting

Time\-MMD\[Liu and others,[2024](https://arxiv.org/html/2605.20268#bib.bib29)\]pairs each of nine domain\-specific multivariate time series with aligned textual fact reports\. Each domain is a single chronological sequence; we use the MM\-TSFlib chronological70/10/2070/10/20split with frequency\-specific context and horizon settings \(Table[8](https://arxiv.org/html/2605.20268#A3.T8)\)\.

Table 8:Time\-MMD MM forecasting dataset statistics\. “Test windows by horizon” lists the number of evaluation windows produced by each horizon length in the same order as the “Horizons” column\.DomainFreq\.RowsChannelsTrain / Val / Test rowsContextHorizonsTest windows by horizonAgriculturemonthly49649633347/50/99347\\,/\\,50\\,/\\,99886,8,10,126,8,10,1294,92,90,8894,92,90,88Climatemonthly49649622347/50/99347\\,/\\,50\\,/\\,99886,8,10,126,8,10,1294,92,90,8894,92,90,88Economymonthly42342333296/43/84296\\,/\\,43\\,/\\,84886,8,10,126,8,10,1279,77,75,7379,77,75,73Energyweekly1,4791\{,\}479991035/149/2951035\\,/\\,149\\,/\\,295363612,24,36,4812,24,36,48284,272,260,248284,272,260,248Environmentdaily15,97915\{,\}9792211185/1599/319511185\\,/\\,1599\\,/\\,3195969648,96,192,33648,96,192,3363148,3100,3004,28603148,3100,3004,2860Public Healthweekly1,3891\{,\}38988972/140/277972\\,/\\,140\\,/\\,277363612,24,36,4812,24,36,48266,254,242,230266,254,242,230Securitymonthly29729711207/31/59207\\,/\\,31\\,/\\,59886,8,10,126,8,10,1254,52,50,4854,52,50,48Social Goodmonthly90090011630/90/180630\\,/\\,90\\,/\\,180886,8,10,126,8,10,12175,173,171,169175,173,171,169Trafficmonthly53153111371/54/106371\\,/\\,54\\,/\\,106886,8,10,126,8,10,12101,99,97,95101,99,97,95

##### Baselines\.

MM\-TSFlib forecasting baselines train a trainable TS encoder \(DLinear, PatchTST, or TimesNet\) jointly with a frozen text encoder \(BERT or GPT\-2\) and a trainable forecasting head\. FM Fusion baselines substitute a frozen pretrained TS foundation model \(Chronos\-2, Moirai\-2, or TimesFM\) for the trainable encoder and train only the fusion head\. All baselines minimize MSE on the Time\-MMD training split and report MAE on the held\-out test windows; per\-domain numbers in Table[4](https://arxiv.org/html/2605.20268#S5.T4)are averaged across the four horizon lengths in Table[8](https://arxiv.org/html/2605.20268#A3.T8)\.

##### Chronicle\.

We report two variants\.*ZS*is autoregressive next\-patch forecasting, with no head training; predictions are denormalized via the inverse of the patch\-level standardization in Eq\.[1](https://arxiv.org/html/2605.20268#S3.E1)\.*FT*adds a forecasting head on top of the frozen backbone with joint text and TS input; only the head’s parameters are updated, using MSE loss with the same horizon and split settings as the baselines\.

### C\.4UCR Time Series Classification

The 14 UCR datasets used in the main paper are GunPoint, Coffee, ECG200, FaceFour, OSULeaf, SwedishLeaf, SyntheticControl, Trace, TwoPatterns, Wafer, Earthquakes, ShapeletSim, Chinatown, and ItalyPowerDemand, with the official aeon train/test splits in all cases\. Supervised DL baselines \(Autoformer, DLinear, FEDformer, Informer, iTransformer, PatchTST, TimesNet\) are trained per\-dataset for3030epochs at lr10−310^\{\-3\}with batch size1616\. TS foundation model baselines \(Chronos\-2, Moirai\-2, TimesFM\) and Chronicle are evaluated with a learned linear probe on frozen embeddings under the same aeon split; we train for200200epochs at lr10−210^\{\-2\}with weight decay0and batch size6464on patch\-3232, joint multivariate, channel\-aware, instance z\-scored embeddings, while the foundation\-model probes use the published linear\-probe protocol from each model’s reference implementation\.

## Appendix DExtended Related Work

This appendix provides a comprehensive discussion of the three research threads that Chronicle builds upon and extends\.

### D\.1Time Series Foundation Models

Foundation models for time series aim to generalize zero\-shot across domains and frequencies, analogous to how language models generalize across tasks\[Lianget al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib12)\]\. The field has coalesced around two main input representations\.

##### Scalar tokenization\.

Chronos\[Ansariet al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib1)\]tokenizes real\-valued series via scaling and quantization and trains a T5\-family encoder–decoder with cross\-entropy loss\. LLMTIME\[Gruveret al\.,[2023](https://arxiv.org/html/2605.20268#bib.bib22)\]and Chat\-TS take scalar tokenization to its extreme by representing values as digit strings and querying frozen LLMs, demonstrating useful numerical priors at the cost of verbosity and computational overhead\.

##### Patch\-based tokenization\.

A complementary line represents series as*patches*, contiguous windows projected to dense embeddings\. PatchTST\[Nieet al\.,[2023](https://arxiv.org/html/2605.20268#bib.bib6)\]introduced patch\-based tokenization for supervised forecasting, demonstrating that “a time series is worth 64 words\.” TimesFM\[Daset al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib3)\]scaled a decoder\-only patch transformer to 200M parameters with pretraining on large\-scale corpora\. PatchTST\-FM\[Wenet al\.,[2026](https://arxiv.org/html/2605.20268#bib.bib7)\]revisited the generic transformer as a foundation model baseline, adding gated residual projections, a 99\-quantile output head, and cumulative patch masking \(CPM\), achieving state\-of\-the\-art on GIFT\-Eval at 260M parameters\. Moirai\[Wooet al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib4)\]addressed heterogeneous frequencies with frequency\-specific projections within a masked encoder; Moirai\-2\[Liuet al\.,[2026](https://arxiv.org/html/2605.20268#bib.bib79)\]extended this with improved architectures and training\. MOMENT\[Goswamiet al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib9)\]trains a backbone with lightweight task\-specific decoders for multiple tasks simultaneously, while UniTS\[Gaoet al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib11)\]pursues multi\-task generalization via unified token representations\. TiRex\[Aueret al\.,[2025](https://arxiv.org/html/2605.20268#bib.bib56)\], Toto\[Cohenet al\.,[2025](https://arxiv.org/html/2605.20268#bib.bib57)\], and YingLong\[Wanget al\.,[2025](https://arxiv.org/html/2605.20268#bib.bib58)\]represent further entries on the GIFT\-Eval leaderboard\.

Our architecture draws on PatchTST\-FM and TimesFM \(patch\-based, decoder\-only, quantile output\) but differs fundamentally in being trained jointly with natural language from scratch\. We compare against these models on GIFT\-Eval \(Section[5\.1\.2](https://arxiv.org/html/2605.20268#S5.SS1.SSS2)\), where published leaderboard scores provide a direct zero\-shot comparison, and on UCR classification \(Section[5\.3](https://arxiv.org/html/2605.20268#S5.SS3)\), where Chronos\-2 and Moirai\-2 frozen embeddings serve as foundation model baselines\.

### D\.2Multimodal Text and Time Series Models

A rapidly growing body of work connects language models to time series\. We organize these approaches by their architectural paradigm and highlight their evaluation limitations\.

##### Frozen LLM approaches\.

LLMTIME\[Gruveret al\.,[2023](https://arxiv.org/html/2605.20268#bib.bib22)\]queries frozen GPT\-3/LLaMA with digit\-string representations of time series, demonstrating zero\-shot forecasting capability but inheriting the full computational cost of large language models and producing no learnable temporal representations\. GPT4MTS\[Jiaet al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib32)\]constructs multimodal prompts combining textual context with numerical time series data and feeds them to frozen LLMs\.Zhouet al\.\[[2023](https://arxiv.org/html/2605.20268#bib.bib24)\]showed that frozen LLMs, fine\-tuned only at input/output projections, yield competitive forecasting performance\. However, these approaches treat the language model as a black box; the temporal representations are constrained to the text embedding space, which was never designed for continuous numerical data\.

##### Adapted LLM approaches\.

Time\-LLM\[Jinet al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib23)\]reprograms patch embeddings into text prototypes with a frozen LLM backbone\. GPT4TS\[Zhouet al\.,[2023](https://arxiv.org/html/2605.20268#bib.bib24)\]fine\-tunes only the normalization layers of GPT\-2\. ChatTS\[Xieet al\.,[2025](https://arxiv.org/html/2605.20268#bib.bib28)\]encodes time series patches through a shallow MLP and concatenates them with text embeddings before feeding a fine\-tuned Qwen2\.5\-14B backbone, using synthetic QA pairs \(TSEvol\) to address data scarcity\. ChatTime\[Wanget al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib31)\]instruction\-fine\-tunes a decoder\-only LLM for bidirectional text and time series generation, achieving 99\.9% of Chronos’s zero\-shot accuracy with only 4% of the pretraining data by leveraging the pretrained LLM’s existing representations\. MoAT\[Leeet al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib80)\]introduces a two\-stage framework: first optimizing forecasts from decomposed time series and text embeddings, then fusing via an offline MLP synthesis\. TaTs\[Liet al\.,[2026](https://arxiv.org/html/2605.20268#bib.bib81)\]treats text embeddings as auxiliary time series variables, capturing what the authors call “chronological textual resonance,” periodic patterns in text representations that mirror the numerical series\. MSE\-ITT\[Kovalet al\.,[2025](https://arxiv.org/html/2605.20268#bib.bib82)\]extends LLaMA\-3\-8B with modality\-specific expert layers for financial forecasting from interleaved text and time series\. Time\-VLM\[Zhonget al\.,[2025](https://arxiv.org/html/2605.20268#bib.bib83)\]bridges temporal, visual, and textual modalities using frozen vision\-language models\.

All of these approaches share a fundamental limitation:*they start from a pretrained language model*, meaning the backbone’s representations were shaped entirely by text before any exposure to temporal data\. The time series modality must adapt to a representational space that was not designed for it, and the resulting models inherit the language model’s parameter count, vocabulary, and computational requirements, even when the downstream task is purely temporal\.

##### Fusion and benchmark approaches\.

Time\-MMD\[Liu and others,[2024](https://arxiv.org/html/2605.20268#bib.bib29)\]provides a multi\-domain benchmark pairing time series with textual reports across nine domains and introducesMM\-TSFlib, a fusion library that has become a standard reference protocol for text\-augmented time series\. Under MM\-TSFlib, a trainable time series encoder \(e\.g\., DLinear, PatchTST, TimesNet\) is paired with a frozen pretrained text encoder \(BERT or GPT\-2\) and a trainable MLP fusion head; the TS encoder and head are trained end\-to\-end on each downstream dataset, while the text encoder remains frozen\. Subsequent text\-augmented time series studies have adopted MM\-TSFlib as a benchmarking baseline; we use it directly as our multimodal fusion comparison in Section[5\.2](https://arxiv.org/html/2605.20268#S5.SS2)\. TimeCAP\[Lee and others,[2025](https://arxiv.org/html/2605.20268#bib.bib30)\]uses LLM agents to generate contextual descriptions and combines predictions from a multimodal predictor with a pretrained LLM\. Recent surveys on multimodal time series\[Liuet al\.,[2025](https://arxiv.org/html/2605.20268#bib.bib86)\]provide comprehensive taxonomies of fusion strategies\.Zhang and others \[[2025](https://arxiv.org/html/2605.20268#bib.bib87)\]systematically investigate when multimodal integration yields gains, finding that benefits are “highly condition\-dependent” and “neither universal nor always aligned with intuition\.”

##### Critical evaluation gap\.

A striking pattern across this literature is the narrowness of evaluation\. ChatTime compares against Chronos and GPT4TS for forecasting but does not evaluate language understanding\. ChatTS evaluates time series understanding but not against GIFT\-Eval, UCR, or NLU benchmarks\. MoAT, TaTs, and GPT4MTS evaluate only multimodal forecasting on their own datasets\. MSE\-ITT compares against multimodal and financial baselines but does not benchmark against dedicated TSFMs on standard time series tasks\.*No prior multimodal text and time series model has been evaluated against both dedicated TSFMs on time series benchmarks and dedicated LLMs on language understanding benchmarks\.*This creates a fundamental ambiguity: when a multimodal model reports improved forecasting, it is unclear whether the improvement stems from genuine cross\-modal learning or simply from the text providing complementary information that a strong TSFM baseline would render unnecessary\. Our evaluation protocol addresses this gap directly by testing Chronicle against the best models in*each*modality on*their own*benchmarks\.

##### Negative results on LLMs for time series\.

Several recent works have questioned the value of language model priors for temporal tasks\.Tanet al\.\[[2024](https://arxiv.org/html/2605.20268#bib.bib84)\]ablated three top\-tier LLM\-for\-TS methods and found that LLMs “fail to convincingly improve time series forecasting” while “significantly increasing computational costs\.” Re\-initializing LLM weights prior to forecasting had no impact on performance, suggesting that pretrained language representations do not transfer to temporal modeling\.Merrillet al\.\[[2024](https://arxiv.org/html/2605.20268#bib.bib27)\]found that LLMs struggle to reason about time series encoded as text, motivating modality\-native representations\. These findings suggest that simply bolting time series onto a language model, the approach taken by all prior multimodal work, is fundamentally limited\. Chronicle takes a different path: rather than adapting a language model for time series, we train a single model for both from scratch, allowing the architecture to develop representations suitable for both modalities simultaneously\.

### D\.3Small Language Models

GPT\-2\[Radfordet al\.,[2019](https://arxiv.org/html/2605.20268#bib.bib13)\]demonstrated that decoder\-only transformers trained with next\-token prediction produce capable few\-shot learners\. Subsequent models have pushed zero\-shot language understanding to strong levels at sub\-1B scale: Qwen2\[Yanget al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib17)\]at 500M parameters achieves 0\.476 average accuracy on our NLU suite; LLaMA\-3\.2\[Grattafioriet al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib16)\]at 1\.2B achieves 0\.531; Gemma\-3\-270M\-PT\[Gemma Team,[2025](https://arxiv.org/html/2605.20268#bib.bib61)\]at 270M achieves 0\.406; and LFM\-2\-350M\[Aminiet al\.,[2025](https://arxiv.org/html/2605.20268#bib.bib62)\]at 350M achieves 0\.449\. These models represent the current frontier of what is achievable with compact transformer architectures trained exclusively on text\. They are also typically trained on hundreds of billions to several trillion text tokens, substantially more than our total compute budget allows\. We compare Chronicle against all five to verify that, under our text\-heavy 92/8 mix, devoting approximately 8% of training compute to time series does not cause catastrophic interference\. The fact that Chronicle matches Gemma\-3\-270M\-PT despite its dual training objective establishes an important proof point: a shared transformer backbone can accommodate both text and time series without degrading either modality’s performance relative to scale\-matched specialists, provided the token mix is chosen to keep the language stream competitive\.

## Appendix ESynthetic Training Data

Online synthetic augmentation is applied to time\-series\-only batches during training, controlled by a per\-batch probability of0\.200\.20\. Series are generated on\-the\-fly in a background worker thread; generation takes approximately11to33ms per series at length 32k, introducing no data\-loading bottleneck\.

A bank of 33 kernel generators is defined at module load time, spanning smooth trends, periodic patterns, stochastic processes, discrete waveforms, and noise models \(Table[9](https://arxiv.org/html/2605.20268#A5.T9)\)\. For each synthetic series, 2–5 kernels are sampled without replacement and combined via one of two modes:

- •Additive\(80%\):x\(t\)=∑iki\(t\)x\(t\)=\\sum\_\{i\}k\_\{i\}\(t\)\.
- •Mixed multiplicative\(20%\): kernels are combined iteratively; each subsequent kernel is either added or multiplied \(after shifting to a positive range\) with probability0\.400\.40per kernel\.

All kernels operate on normalized timetn=linspace\(0,1,L\)t\_\{n\}=\\mathrm\{linspace\}\(0,1,L\)and are vectorized\. Duplicate entries in the bank increase sampling frequency for empirically useful kernels \(RBF short/long, periodic short/long, rational quadratic, damped oscillation\), following the emphasis in Chronos KernelSynth\[Ansariet al\.,[2024](https://arxiv.org/html/2605.20268#bib.bib1)\]\. After composition, Inf values are clipped to±5\\pm 5before combination and±107\\pm 10^\{7\}after; output is cast to float32\. Beyond KernelSynth, with 50% probability per time\-series batch we additionally apply jitter \(additive Gaussian noise\), scaling \(multiplicative perturbation\), and intra\-batch mixup\.

Table 9:KernelSynth generator bank \(33 entries\)\. Duplicates are listed in the “\# entries” column and increase sampling weight for empirically useful kernels\.CategoryImplementation\# entriesRBF smooth1R∑rcos⁡\(ωrt\+ϕr\)\\frac\{1\}\{R\}\\sum\_\{r\}\\cos\(\\omega\_\{r\}t\+\\phi\_\{r\}\),ωr∼𝒩\(0,1/ℓs\)\\omega\_\{r\}\\sim\\mathcal\{N\}\(0,1/\\ell\_\{s\}\),R=32R\{=\}32RFF5PeriodicAsin⁡\(2πt/p\+ϕ\)A\\sin\(2\\pi t/p\+\\phi\),A∼Unif\(0\.5,2\)A\\sim\\mathrm\{Unif\}\(0\.5,2\)5Periodic \+ harmonicsBase \+ 2 overtones at amplitudesA/2A/2,A/3A/3with independent phases1Rational QuadraticRFF with Gamma\-distributed scales:ωr∼scale⋅𝒩\(0,1\)/ℓs\\omega\_\{r\}\\sim\\mathrm\{scale\}\\cdot\\mathcal\{N\}\(0,1\)/\\ell\_\{s\},scale∼Γ\(α,1/α\)\\mathrm\{scale\}\\sim\\Gamma\(\\alpha,1/\\alpha\)2Linear trendat\+bat\+b,a∼Unif\(−3,3\)a\\sim\\mathrm\{Unif\}\(\-3,3\)1Polynomialpolyval\(𝐜,tnorm\)\\mathrm\{polyval\}\(\\mathbf\{c\},t\_\{\\mathrm\{norm\}\}\), coefficients∼Unif\\sim\\mathrm\{Unif\}2Log trendc⋅log⁡\(t\)c\\cdot\\log\(t\),c∼Unif\(−2,2\)c\\sim\\mathrm\{Unif\}\(\-2,2\)1Random walkCumulative sum of Gaussian steps; driftμ∼Unif\(−0\.01,0\.01\)\\mu\\sim\\mathrm\{Unif\}\(\-0\.01,0\.01\)2Level shifts1–3 abrupt shifts at random positions in the middle 80%1Discrete wavesPeriod∈\[0\.05,0\.40\]\\in\[0\.05,0\.40\], amplitude∈\[0\.5,2\.0\]\\in\[0\.5,2\.0\], random phase/offset3Damped oscillationAe−γtsin⁡\(2πt/p\+ϕ\)Ae^\{\-\\gamma t\}\\sin\(2\\pi t/p\+\\phi\),γ∼Unif\(1,8\)\\gamma\\sim\\mathrm\{Unif\}\(1,8\)2White noise𝒩\(0,σ\)\\mathcal\{N\}\(0,\\sigma\)3Heteroskedastic noiseϵt∼𝒩\(0,σ⋅e0\.5k\(t\)\)\\epsilon\_\{t\}\\sim\\mathcal\{N\}\(0,\\,\\sigma\\cdot e^\{0\.5k\(t\)\}\), envelope modulated by RBF\-drawn signal1Periodic noise𝒩\(0,0\.3\)⋅\(1\+A\(sin⁡\(2πt/p\+ϕ\)⋅0\.5\+0\.5\)\)\\mathcal\{N\}\(0,0\.3\)\\cdot\(1\+A\(\\sin\(2\\pi t/p\+\\phi\)\\cdot 0\.5\+0\.5\)\)1Step function3–11 constant\-level segments with random transitions1Exponential growth/decayert−1e^\{rt\}\-1,r∼Unif\(−3,3\)r\\sim\\mathrm\{Unif\}\(\-3,3\)1ConstantFlat baselinec∼Unif\(−2,2\)c\\sim\\mathrm\{Unif\}\(\-2,2\)1Total33

### E\.1Full Multimodal Classification Results

Table[10](https://arxiv.org/html/2605.20268#A5.T10)reports per\-domain accuracy, macro\-F1, and AUC for all methods on TimeCAP\. Within each baseline category, BERT\-paired models appear before GPT2\-paired models\. Summary averages are reported in Table[3](https://arxiv.org/html/2605.20268#S5.T3)\.

Table 10:Multimodal classification on TimeCAP by domain\. Values are mean±\\pmstandard deviation over 3 seeds \(0, 1, 2\)\.Cat\.ModelWeatherFinanceHealthcareAverageF1↑\\uparrowAUC↑\\uparrowF1↑\\uparrowAUC↑\\uparrowF1↑\\uparrowAUC↑\\uparrowF1↑\\uparrowAUC↑\\uparrowMM\-TSFlibDLin\+BERT0\.600±0\.0130\.600\\pm 0\.0130\.676±0\.0110\.676\\pm 0\.0110\.367±0\.0100\.367\\pm 0\.0100\.641±0\.0360\.641\\pm 0\.0360\.796±0\.0510\.796\\pm 0\.0510\.902±0\.0300\.902\\pm 0\.0300\.588±0\.0160\.588\\pm 0\.0160\.739±0\.0240\.739\\pm 0\.024DLin\+GPT20\.572±0\.0090\.572\\pm 0\.0090\.643±0\.0170\.643\\pm 0\.0170\.326±0\.0570\.326\\pm 0\.0570\.655±0\.0180\.655\\pm 0\.0180\.793±0\.0240\.793\\pm 0\.0240\.875±0\.0360\.875\\pm 0\.0360\.564±0\.0260\.564\\pm 0\.0260\.724±0\.0170\.724\\pm 0\.017PTST\+BERT0\.566±0\.0170\.566\\pm 0\.0170\.613±0\.0210\.613\\pm 0\.0210\.368±0\.0520\.368\\pm 0\.0520\.643±0\.0240\.643\\pm 0\.0240\.800±0\.0040\.800\\pm 0\.0040\.900±0\.0240\.900\\pm 0\.0240\.578±0\.0220\.578\\pm 0\.0220\.719±0\.0220\.719\\pm 0\.022PTST\+GPT20\.461±0\.0660\.461\\pm 0\.0660\.556±0\.0570\.556\\pm 0\.0570\.363±0\.0100\.363\\pm 0\.0100\.682±0\.0190\.682\\pm 0\.0190\.793±0\.0090\.793\\pm 0\.0090\.882±0\.0330\.882\\pm 0\.0330\.539±0\.0210\.539\\pm 0\.0210\.707±0\.0340\.707\\pm 0\.034TNet\+BERT0\.597±0\.0570\.597\\pm 0\.0570\.712±0\.0130\.712\\pm 0\.0130\.364±0\.0150\.364\\pm 0\.0150\.635±0\.0410\.635\\pm 0\.0410\.808±0\.0170\.808\\pm 0\.0170\.902±0\.0260\.902\\pm 0\.0260\.589±0\.0210\.589\\pm 0\.0210\.750±0\.0260\.750\\pm 0\.026TNet\+GPT20\.617±0\.0370\.617\\pm 0\.0370\.729±0\.0180\.729\\pm 0\.0180\.330±0\.0560\.330\\pm 0\.0560\.660±0\.0480\.660\\pm 0\.0480\.786±0\.0320\.786\\pm 0\.0320\.871±0\.0360\.871\\pm 0\.0360\.577±0\.0190\.577\\pm 0\.0190\.754±0\.0280\.754\\pm 0\.028FM FusionBERT\+Chr20\.588±0\.0100\.588\\pm 0\.0100\.631±0\.0180\.631\\pm 0\.0180\.388±0\.0540\.388\\pm 0\.0540\.647±0\.0280\.647\\pm 0\.0280\.794±0\.0120\.794\\pm 0\.0120\.902±0\.0290\.902\\pm 0\.0290\.590±0\.0210\.590\\pm 0\.0210\.726±0\.0230\.726\\pm 0\.023BERT\+Moi20\.617±0\.0220\.617\\pm 0\.0220\.702±0\.0230\.702\\pm 0\.0230\.351±0\.0410\.351\\pm 0\.0410\.651±0\.0260\.651\\pm 0\.0260\.797±0\.0190\.797\\pm 0\.0190\.901±0\.0330\.901\\pm 0\.0330\.588±0\.0040\.588\\pm 0\.0040\.751±0\.0250\.751\\pm 0\.025BERT\+TFM0\.436±0\.0210\.436\\pm 0\.0210\.577±0\.0300\.577\\pm 0\.0300\.273±0\.0000\.273\\pm 0\.0000\.503±0\.1020\.503\\pm 0\.1020\.786±0\.0090\.786\\pm 0\.0090\.898±0\.0300\.898\\pm 0\.0300\.498±0\.0060\.498\\pm 0\.0060\.659±0\.0230\.659\\pm 0\.023GPT2\+Chr20\.284±0\.1210\.284\\pm 0\.1210\.542±0\.0310\.542\\pm 0\.0310\.294±0\.0370\.294\\pm 0\.0370\.597±0\.1430\.597\\pm 0\.1430\.785±0\.0230\.785\\pm 0\.0230\.880±0\.0350\.880\\pm 0\.0350\.455±0\.0430\.455\\pm 0\.0430\.673±0\.0560\.673\\pm 0\.056GPT2\+Moi20\.588±0\.0150\.588\\pm 0\.0150\.690±0\.0220\.690\\pm 0\.0220\.252±0\.1500\.252\\pm 0\.1500\.643±0\.0290\.643\\pm 0\.0290\.786±0\.0370\.786\\pm 0\.0370\.885±0\.0240\.885\\pm 0\.0240\.542±0\.0380\.542\\pm 0\.0380\.739±0\.0150\.739\\pm 0\.015GPT2\+TFM0\.423±0\.0000\.423\\pm 0\.0000\.562±0\.0290\.562\\pm 0\.0290\.273±0\.0000\.273\\pm 0\.0000\.451±0\.0770\.451\\pm 0\.0770\.745±0\.1010\.745\\pm 0\.1010\.870±0\.0540\.870\\pm 0\.0540\.480±0\.0340\.480\\pm 0\.0340\.628±0\.0180\.628\\pm 0\.018ChronicleStage 1 LP \(r=1\)0\.564±0\.0210\.564\\pm 0\.0210\.641±0\.0340\.641\\pm 0\.0340\.426±0\.0240\.426\\pm 0\.0240\.683±0\.0240\.683\\pm 0\.0240\.790±0\.0280\.790\\pm 0\.0280\.874±0\.0480\.874\\pm 0\.0480\.593±0\.0210\.593\\pm 0\.0210\.733±0\.0300\.733\\pm 0\.030Stage 1 LP \(r=64\)0\.608±0\.0180\.608\\pm 0\.0180\.705±0\.0100\.705\\pm 0\.0100\.423±0\.0090\.423\\pm 0\.0090\.662±0\.0030\.662\\pm 0\.0030\.793±0\.0210\.793\\pm 0\.0210\.867±0\.0420\.867\\pm 0\.0420\.608±0\.0100\.608\\pm 0\.0100\.745±0\.0140\.745\\pm 0\.014Stage 1 LoRA \(r=1\)0\.602±0\.0120\.602\\pm 0\.0120\.641±0\.0290\.641\\pm 0\.0290\.374±0\.0320\.374\\pm 0\.0320\.684±0\.0300\.684\\pm 0\.0300\.828±0\.0110\.828\\pm 0\.0110\.891±0\.0170\.891\\pm 0\.0170\.601±0\.0110\.601\\pm 0\.0110\.739±0\.0170\.739\\pm 0\.017Stage 1 LoRA \(r=64\)0\.636±0\.0060\.636\\pm 0\.0060\.717±0\.0180\.717\\pm 0\.0180\.295±0\.0380\.295\\pm 0\.0380\.669±0\.0050\.669\\pm 0\.0050\.820±0\.0420\.820\\pm 0\.0420\.904±0\.0390\.904\\pm 0\.0390\.584±0\.0240\.584\\pm 0\.0240\.763±0\.0110\.763\\pm 0\.011Stage 2 LP \(r=1\)0\.561±0\.0390\.561\\pm 0\.0390\.635±0\.0400\.635\\pm 0\.0400\.428±0\.0110\.428\\pm 0\.0110\.685±0\.0140\.685\\pm 0\.0140\.794±0\.0420\.794\\pm 0\.0420\.872±0\.0480\.872\\pm 0\.0480\.594±0\.0250\.594\\pm 0\.0250\.731±0\.0290\.731\\pm 0\.029Stage 2 LP \(r=64\)0\.606±0\.0130\.606\\pm 0\.0130\.696±0\.0080\.696\\pm 0\.0080\.406±0\.0140\.406\\pm 0\.0140\.683±0\.0100\.683\\pm 0\.0100\.804±0\.0250\.804\\pm 0\.0250\.872±0\.0380\.872\\pm 0\.0380\.605±0\.0110\.605\\pm 0\.0110\.750±0\.0140\.750\\pm 0\.014Stage 2 LoRA \(r=1\)0\.599±0\.0130\.599\\pm 0\.0130\.644±0\.0240\.644\\pm 0\.0240\.365±0\.0800\.365\\pm 0\.0800\.656±0\.0740\.656\\pm 0\.0740\.821±0\.0130\.821\\pm 0\.0130\.892±0\.0190\.892\\pm 0\.0190\.595±0\.0300\.595\\pm 0\.0300\.731±0\.0320\.731\\pm 0\.032Stage 2 LoRA \(r=64\)0\.629±0\.0330\.629\\pm 0\.0330\.706±0\.0660\.706\\pm 0\.0660\.386±0\.0100\.386\\pm 0\.0100\.667±0\.0470\.667\\pm 0\.0470\.823±0\.0160\.823\\pm 0\.0160\.897±0\.0360\.897\\pm 0\.0360\.613±0\.0110\.613\\pm 0\.0110\.757±0\.0140\.757\\pm 0\.014

### E\.2Effect of Channel\-Aware Multivariate Handling

To assess the value of preserving channel identity in multivariate time\-series inputs, we compare the default joint multivariate representation used by Chronicle, which includes the channel ramp𝐜\\mathbf\{c\}in the patch features, against a mean\-channel pooling variant that averages channels before encoding and therefore removes channel identity\. All values are for the frozen\-backbone linear head\.

Figure[4](https://arxiv.org/html/2605.20268#A5.F4)shows the per\-dataset deltas \(joint minus mean pooling\) on the 10 multivariate UEA datasets used in our evaluation\. Averaged across datasets, joint channel\-aware handling improves accuracy by\+0\.039\+0\.039, macro\-F1 by\+0\.035\+0\.035, and AUC by\+0\.020\+0\.020\. The largest gains appear onLibrasandNATOPS\(\+0\.167\+0\.167accuracy on both\), with additional improvements onLSSTandUWaveGestureLibrary\. Some datasets favor mean\-channel pooling \(Epilepsy,FingerMovements, andRacketSports\), whileHandwritingandStandWalkJumpare effectively unchanged in accuracy and F1\. Overall, the results indicate that retaining channel identity is beneficial on average for multivariate classification, supporting the use of channel\-aware patch features for multivariate inputs\.

![Refer to caption](https://arxiv.org/html/2605.20268v1/x4.png)Figure 4:Effect of channel\-aware multivariate handling on UEA classification\.Bars show the per\-dataset delta between joint multivariate handling and mean\-channel pooling \(joint minus mean\) for accuracy, macro\-F1, and AUC\. Positive values favor joint channel\-aware handling\. Averaged across the 10 multivariate UEA datasets, joint handling improves accuracy by\+0\.039\+0\.039, macro\-F1 by\+0\.035\+0\.035, and AUC by\+0\.020\+0\.020\.

## Appendix FPer\-Dataset UCR/UEA Classification Results

Table 11:Per\-dataset UCR/UEA time\-series classification results\. Values are means over five seeds\.SuiteDatasetInformerTimesNetAutoformerDLineariTransformerFEDformerPatchTSTChronos\-2TimesFMMoirai\-2Chronicle Stage 1Chronicle Stage 2AccF1AUCAccF1AUCAccF1AUCAccF1AUCAccF1AUCAccF1AUCAccF1AUCAccF1AUCAccF1AUCAccF1AUCAccF1AUCAccF1AUCUCRGunPoint0\.5760\.5730\.6340\.5800\.5390\.6030\.6050\.5990\.7160\.7570\.7560\.8400\.7080\.7010\.8420\.6560\.6500\.7790\.6530\.6480\.7110\.5280\.3970\.5480\.7120\.6820\.8620\.9310\.9310\.9810\.9190\.9190\.9650\.8510\.8500\.936Coffee0\.5360\.3490\.3970\.5360\.3490\.5880\.5360\.3490\.5250\.9000\.8711\.0000\.5930\.4410\.9620\.5360\.3490\.5630\.5360\.3490\.6730\.5360\.4100\.5550\.5710\.4260\.9870\.9640\.9640\.9950\.8930\.8920\.9880\.8930\.8700\.996ECG2000\.6420\.4410\.5750\.8040\.7910\.8940\.8060\.7790\.8730\.8140\.7960\.9040\.8660\.8550\.9420\.7960\.7640\.9060\.8620\.8460\.9200\.6720\.4680\.5720\.8400\.8180\.9390\.8200\.7990\.8920\.8460\.8260\.9460\.8480\.8330\.934FaceFour0\.1860\.0780\.5860\.3950\.2880\.7850\.2270\.1160\.4730\.5410\.5160\.9230\.6270\.6290\.8880\.2230\.1020\.5560\.3500\.2960\.6870\.2360\.1230\.5210\.6090\.6070\.8790\.5820\.5520\.8100\.8640\.8660\.9880\.7590\.7310\.984OSULeaf0\.3500\.2700\.7490\.4060\.3190\.8250\.5130\.4570\.8240\.3600\.3130\.6660\.4060\.3500\.7630\.5560\.4850\.8630\.5590\.5100\.8630\.1840\.0550\.5190\.4010\.3110\.8420\.7190\.6930\.9370\.5830\.5550\.8760\.5790\.5420\.888SwedishLeaf0\.5330\.4970\.9470\.3810\.3140\.9230\.6750\.6470\.9760\.7630\.7560\.9690\.8430\.8410\.9900\.7110\.7010\.9750\.8590\.8580\.9920\.0960\.0540\.5540\.5990\.5640\.9660\.8540\.8540\.9920\.8480\.8470\.9930\.8340\.8310\.987SyntheticControl0\.8350\.8290\.9740\.5910\.5190\.9900\.6800\.6660\.9340\.8780\.8680\.9860\.8330\.8210\.9810\.9920\.9920\.9990\.9260\.9260\.9930\.2390\.1390\.5670\.9030\.9040\.9890\.8600\.8600\.9830\.7270\.7280\.9300\.7270\.7270\.931Trace0\.7460\.7030\.9380\.5060\.4200\.9310\.9100\.9030\.9750\.4840\.4440\.8300\.4880\.3850\.8371\.0001\.0001\.0000\.8780\.8680\.9840\.2880\.1890\.5810\.6300\.5730\.8910\.8020\.8040\.9510\.9360\.9320\.9990\.8480\.8270\.988TwoPatterns0\.3550\.2530\.7040\.9990\.9991\.0000\.3570\.3040\.6170\.8500\.8500\.9750\.8200\.8190\.9600\.9320\.9320\.9920\.8260\.8250\.9570\.2810\.1530\.5250\.6550\.6270\.9160\.6340\.6320\.8630\.7780\.7770\.9480\.7900\.7870\.945Wafer0\.9820\.9500\.9820\.9510\.8630\.9330\.9880\.9690\.9980\.9430\.8340\.8530\.9950\.9870\.9980\.9920\.9790\.9980\.9740\.9310\.9840\.8940\.4880\.5370\.9410\.8130\.9190\.9960\.9901\.0000\.9910\.9770\.9990\.9870\.9660\.998Earthquakes0\.7480\.4280\.6800\.7480\.4280\.6660\.7480\.4280\.6810\.6300\.5450\.5830\.7320\.4710\.5000\.7480\.4280\.6880\.7380\.4400\.6190\.7480\.4280\.5060\.5810\.4360\.5070\.7220\.4680\.6560\.7600\.5360\.6450\.7740\.5550\.682ShapeletSim0\.4990\.3330\.5360\.7790\.7370\.9470\.5590\.4820\.6620\.5260\.5230\.5210\.4910\.4890\.4960\.6560\.5830\.9720\.5190\.4450\.5460\.4960\.3570\.5020\.7090\.7060\.7570\.7890\.7880\.8560\.6030\.6030\.6230\.6080\.6050\.682Chinatown0\.5740\.3880\.6460\.4120\.3630\.9850\.7060\.6710\.8810\.9080\.8760\.9870\.9690\.9620\.9940\.8450\.8350\.9880\.9040\.8800\.9590\.6840\.4870\.5990\.9300\.9130\.9760\.9830\.9780\.9960\.9770\.9710\.9960\.9600\.9530\.995ItalyPowerDemand0\.5570\.5420\.7170\.8290\.8190\.9830\.7410\.7400\.8150\.9520\.9520\.9900\.9690\.9690\.9930\.8480\.8480\.9130\.9700\.9700\.9880\.5910\.4580\.5980\.9270\.9260\.9760\.9460\.9460\.9910\.9530\.9530\.9910\.9550\.9550\.992UEABasicMotions0\.9800\.9801\.0000\.9800\.9801\.0000\.9950\.9951\.0000\.3900\.3720\.6660\.4500\.4150\.7251\.0001\.0001\.0000\.3800\.3270\.6260\.2900\.1740\.5430\.7650\.7620\.9380\.9600\.9600\.9980\.9250\.9240\.9950\.9450\.9460\.990Epilepsy0\.4780\.4810\.7010\.8480\.8490\.9770\.6420\.6450\.8710\.3700\.3580\.5750\.3090\.2970\.5320\.9220\.9200\.9870\.8770\.8750\.9750\.2720\.1380\.5020\.8970\.8940\.9890\.9430\.9450\.9960\.9450\.9430\.9920\.9390\.9340\.987NATOPS0\.7160\.7000\.9530\.8810\.8780\.9820\.8080\.8030\.9700\.7160\.7140\.9330\.6370\.6290\.9000\.8960\.8940\.9880\.6930\.6820\.9180\.2200\.1150\.5510\.5110\.4920\.8230\.5130\.5030\.8270\.5440\.5350\.8390\.5230\.5120\.826RacketSports0\.7280\.7390\.9120\.8140\.8290\.9440\.8120\.8240\.9410\.6870\.6970\.8840\.6080\.6150\.8450\.8510\.8640\.9480\.5930\.5970\.8500\.3220\.1830\.5550\.6680\.6590\.8800\.6930\.6970\.8940\.6590\.6620\.8940\.7010\.7110\.890UWaveGestureLibrary0\.4100\.3970\.8280\.6060\.6080\.9070\.5200\.5070\.8870\.8010\.7920\.9640\.6760\.6680\.9250\.6520\.6430\.9140\.7850\.7830\.9440\.1390\.0530\.5250\.2210\.1320\.7400\.5000\.5080\.8630\.6270\.6150\.9210\.6360\.6220\.919Handwriting0\.1320\.0640\.6920\.1470\.0790\.7740\.1470\.0780\.6850\.1320\.0970\.6130\.1440\.1060\.6140\.3430\.2710\.8830\.0930\.0660\.5950\.0410\.0090\.5100\.1070\.0870\.6260\.1270\.1150\.6480\.1720\.1480\.6610\.1460\.1160\.659Libras0\.5690\.5590\.9230\.8130\.8100\.9880\.6920\.6930\.9760\.5540\.5350\.9000\.4720\.4400\.8980\.8160\.8170\.9900\.6970\.6860\.9400\.1240\.0670\.5420\.4310\.3980\.8800\.5980\.5840\.9280\.5270\.5260\.9260\.5430\.5320\.911LSST0\.5730\.3280\.8500\.5910\.3230\.8750\.6060\.3370\.8570\.3030\.0550\.5470\.4750\.1910\.7790\.5900\.3240\.8620\.4160\.1440\.7460\.3170\.0420\.5220\.2420\.1110\.6380\.2850\.1380\.6300\.5350\.3170\.8350\.5250\.2860\.821FingerMovements0\.4940\.4470\.5110\.4920\.4540\.4740\.5100\.4160\.5310\.4920\.4820\.5170\.5380\.5090\.5550\.5020\.4370\.5080\.5020\.4870\.5060\.5000\.3710\.5020\.4720\.4450\.4680\.5440\.5440\.5880\.5200\.5190\.5310\.5560\.5530\.563StandWalkJump0\.3600\.2690\.6190\.3870\.2520\.5720\.2800\.1670\.4680\.5330\.5370\.6730\.4130\.3870\.5720\.2800\.1680\.4470\.4400\.3950\.5690\.3330\.1670\.5000\.3330\.2250\.4990\.3600\.3520\.6030\.5330\.5080\.8050\.5600\.5460\.771
Chronicle: A Multimodal Foundation Model for Joint Language and Time Series Understanding

Similar Articles

A decoder-only foundation model for time-series forecasting

Continuity and Ordinality Matter: Constraining Time Series Tokens for Effective Time Series Analysis with Large Language Models

Unicorn: Scaling High-Dimensional Time Series Forecasting via Universal Correlation Modeling

Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly Detection

Chronicle

Submit Feedback

Similar Articles

A decoder-only foundation model for time-series forecasting
Continuity and Ordinality Matter: Constraining Time Series Tokens for Effective Time Series Analysis with Large Language Models
Unicorn: Scaling High-Dimensional Time Series Forecasting via Universal Correlation Modeling
Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly Detection