From Long News to Accurate Forecast: Importance-Aware Fusion and PRM-Guided Reflection for Time Series Forecasting
Summary
This paper introduces a framework for time series forecasting that uses importance-aware news compression and process reward model-guided retrieval to incorporate long news articles within fixed context limits, improving prediction accuracy across finance, energy, traffic, and Bitcoin benchmarks.
View Cached Full Text
Cached at: 06/03/26, 09:42 AM
# From Long News to Accurate Forecast: Importance-Aware Fusion and PRM-Guided Reflection for Time Series Forecasting
Source: [https://arxiv.org/html/2606.03097](https://arxiv.org/html/2606.03097)
Mingyang LIU1∗, Qingcan Kang2†,Yuke WANG1, Shixiong Kai2, Kaichao Liang2, Hui\-Ling Zhen2,Tao Zhong2,Mingxuan Yuan2,Linqi Song1† 1Department of Computer Science, City University of Hong Kong 2Huawei Noah’s Ark Lab mingyaliu8\-c@my\.cityu\.edu\.hk kangqingcan@huawei\.com linqi\.song@cityu\.edu\.hk
###### Abstract
Incorporating news into time series forecasting is appealing because news can reveal abrupt exogenous events that historical values alone cannot recover\. However, existing LLM\-based news\-forecasting pipelines face two practical limitations: relevant news articles often exceed the model’s context window, and iterative retrieval of supplementary news is typically unguided, leading to redundant updates and slow convergence\. We address these issues with a novel framework that combines importance\-aware news compression and process\-level retrieval supervision\. First, we train an importance reward model that estimates the forecasting utility of each article and uses this signal to allocate compression budgets during sequential pairwise fusion, preserving informative content within a fixed context limit\. Second, we introduce a process reward model \(PRM\) that ranks multiple supplementary\-news candidates conditioned on the current error profile and the history of previously selected articles, replacing one\-shot blind retrieval with quality\-controlled selection\. Both components are trained offline using historical data with ground truth; inference uses the frozen filtering logic and compression modules without any reflection loop\. Experiments on finance, energy, traffic, and bitcoin forecasting benchmarks show that our method improves prediction accuracy over strong baselines, significantly reduces the number of refinement iterations compared to the iterative baseline, and remains effective when relevant articles span thousands of tokens\.
11footnotetext:\*Work done as an intern in Huawei Noah’s Ark Lab\.22footnotetext:†\\daggerCorresponding authors\.## 1Introduction
Time series forecasting supports decision making in domains such as finance, energy, and transportation\(Alghamdiet al\.,[2019](https://arxiv.org/html/2606.03097#bib.bib16); Fildeset al\.,[2022](https://arxiv.org/html/2606.03097#bib.bib17); Gross and Galiana,[1987](https://arxiv.org/html/2606.03097#bib.bib19)\)\. Although modern forecasting models can capture rich temporal patterns from historical observations, they remain vulnerable to abrupt distribution shifts caused by exogenous events, such as policy changes, natural disasters, or market\-moving announcements\. These events are often weakly reflected, or not reflected at all, in the numerical history available at prediction time\. This limitation motivates forecasting systems that can incorporate external textual evidence, especially news reports, into the prediction process\.
News provides timely descriptions of latent drivers behind future time\-series movements\. Recent LLM\-based forecasting pipelines have started to exploit this signal by prompting a reasoning agent to filter relevant articles and a separate evaluation agent to analyze prediction errors and propose additional retrieval queries\(Wanget al\.,[2024](https://arxiv.org/html/2606.03097#bib.bib11)\)\. This iterative design is attractive because it converts missing\-context errors into language feedback that can be used to refine the news set\. However, two obstacles limit its effectiveness in realistic settings\.
First, the retrieved news may be long\. Articles such as policy briefings, earnings\-call transcripts, and incident reports can easily exceed the context budget of the forecasting model\. Naive truncation is unreliable because it discards potentially causal details, while generic summarization or prompt\-compression methods\(Jianget al\.,[2023](https://arxiv.org/html/2606.03097#bib.bib4);[2024](https://arxiv.org/html/2606.03097#bib.bib3)\)are not optimized for preserving information that is specifically useful for forecasting\. As a result, there is a mismatch between context efficiency and task relevance\. Second, the refinement loop is weakly supervised\. In prior iterative pipelines, a reflection may indicate a missing category of news, but the subsequent supplementary retrieval is effectively accepted without a dedicated quality\-control mechanism\. This makes the update trajectory sensitive to noisy or redundant articles and can lead to many expensive refinement rounds before the filtering logic stabilizes\.
We address these two issues with a novel framework that separates*offline refinement*from*online deployment*\. During offline training, where ground\-truth targets are available, we improve the news\-processing pipeline in two ways\. We first introduce an importance\-aware fusion module that learns an article\-level reward model and allocates compression budgets according to the estimated utility of each article for forecasting\. We then introduce a process reward model \(PRM\) that ranks multiple supplementary\-news candidates conditioned on the current forecasting errors and the history of previous selections, so that the refinement loop can prefer candidates that are more likely to reduce future error\. At inference time, the refined filtering logic and the trained compression components are frozen, and prediction proceeds without iterative reflection\. Experiments on finance, energy, traffic, and bitcoin forecasting benchmarks demonstrate that our method improves prediction accuracy over strong baselines, reduces the number of refinement iterations by up to 37\.6% with an average reduction of 24\.8%, and remains effective when relevant articles span thousands of tokens\.
##### Contributions\.
Our main contributions are summarized as follows:
- •We formulate news\-augmented forecasting as an offline\-refined pipeline that explicitly addresses two bottlenecks of prior LLM\-based approaches: long\-document compression and uncontrolled supplementary retrieval\.
- •We propose an importance\-aware fusion module that learns forecasting\-oriented article utilities and uses them to adapt compression rates during sequential pairwise fusion, enabling long news to be incorporated under a limited context budget\.
- •We introduce a PRM\-guided refinement strategy for supplementary news selection, replacing one\-shot retrieval with candidate ranking based on process\-level supervision from error\-driven trajectories\.
## 2Related Work
Time series forecasting with external text\.Classical forecasting models rely primarily on historical numerical observations\(Chenet al\.,[2004](https://arxiv.org/html/2606.03097#bib.bib23); Dudek,[2015](https://arxiv.org/html/2606.03097#bib.bib24); Huang and Shih,[2003](https://arxiv.org/html/2606.03097#bib.bib25); Kalekar and others,[2004](https://arxiv.org/html/2606.03097#bib.bib26); Papalexopoulos and Hesterberg,[1990](https://arxiv.org/html/2606.03097#bib.bib27)\)\. Deep neural architectures improved the modeling of long\-range and nonlinear dependencies\(Liet al\.,[2019](https://arxiv.org/html/2606.03097#bib.bib29); Liuet al\.,[2017](https://arxiv.org/html/2606.03097#bib.bib30); Nieet al\.,[2022](https://arxiv.org/html/2606.03097#bib.bib32); Torreset al\.,[2021](https://arxiv.org/html/2606.03097#bib.bib33); Wuet al\.,[2021](https://arxiv.org/html/2606.03097#bib.bib34); Xiaet al\.,[2021](https://arxiv.org/html/2606.03097#bib.bib35); Zhouet al\.,[2021b](https://arxiv.org/html/2606.03097#bib.bib36);[2022](https://arxiv.org/html/2606.03097#bib.bib37)\), and large\-scale pre\-training further strengthened transfer to new tasks\(Caoet al\.,[2023](https://arxiv.org/html/2606.03097#bib.bib38); Jinet al\.,[2023b](https://arxiv.org/html/2606.03097#bib.bib39); Wuet al\.,[2022](https://arxiv.org/html/2606.03097#bib.bib40); Yehet al\.,[2023](https://arxiv.org/html/2606.03097#bib.bib41)\)\. A parallel line of research augments forecasting with textual evidence such as news, reports, or social media, especially in finance and energy\(Cecchiniet al\.,[2010](https://arxiv.org/html/2606.03097#bib.bib42); Schumaker and Chen,[2009](https://arxiv.org/html/2606.03097#bib.bib43);[2010](https://arxiv.org/html/2606.03097#bib.bib44); Baiet al\.,[2024](https://arxiv.org/html/2606.03097#bib.bib45); Obstet al\.,[2021](https://arxiv.org/html/2606.03097#bib.bib46)\)\. Early methods rely on manually engineered sentiment features or event indicators\(Obstet al\.,[2021](https://arxiv.org/html/2606.03097#bib.bib46); Chowdhary,[2020](https://arxiv.org/html/2606.03097#bib.bib47)\), which limits their ability to capture long\-range semantics and cross\-document interactions\.
LLM agents and iterative refinement\.Recent work adapts LLMs to time series by tokenizing numerical sequences or reprogramming prompts\(Gruveret al\.,[2023](https://arxiv.org/html/2606.03097#bib.bib13); Caoet al\.,[2023](https://arxiv.org/html/2606.03097#bib.bib38); Jinet al\.,[2023a](https://arxiv.org/html/2606.03097#bib.bib53); Rasulet al\.,[2023](https://arxiv.org/html/2606.03097#bib.bib56)\)\. The closest prior work is\(Wanget al\.,[2024](https://arxiv.org/html/2606.03097#bib.bib11)\), which uses LLM agents to filter news and iteratively refines the filtering strategy based on forecast errors\. That framework demonstrates the value of agentic feedback loops, but it assumes news documents are short and that each supplementary retrieval is unconditionally accepted\. Our work builds on this agentic formulation and introduces two complementary mechanisms: an importance\-aware compression module that respects context budgets, and a quality\-controlled selection mechanism for supplementary news\.
Compression and process supervision\.Prompt compression methods such as LLMLingua\(Jianget al\.,[2023](https://arxiv.org/html/2606.03097#bib.bib4)\), LongLLMLingua\(Jianget al\.,[2024](https://arxiv.org/html/2606.03097#bib.bib3)\), and RECOMP\(Xuet al\.,[2024](https://arxiv.org/html/2606.03097#bib.bib5)\)reduce input length for general\-purpose LLM tasks, but the notion of importance is not tied to forecasting error\. Process reward models \(PRMs\)\(Lightmanet al\.,[2023b](https://arxiv.org/html/2606.03097#bib.bib6); Zenget al\.,[2025](https://arxiv.org/html/2606.03097#bib.bib60); Zhanget al\.,[2025](https://arxiv.org/html/2606.03097#bib.bib61); Zheng and others,[2025](https://arxiv.org/html/2606.03097#bib.bib7)\)provide step\-wise supervision for reasoning tasks, yet they are typically used to score rationales, not to rank retrieval candidates\. Our work connects these two lines: we train a forecasting‑aligned reward model to allocate compression budgets, and a PRM to score candidate news articles based on their expected impact on prediction error\. This joint design addresses a previously underexplored bottleneck in news‑augmented forecasting\.
Figure 1:Overall Framework of Offline Refinement and Online Deployment\. The framework consists of two phases: offline training/refinement and online reasoning/deployment\. In the offline phase, historical time series and candidate news are first processed by the Reward\-Fusion module, which performs article\-level reward scoring, adaptive compression budgeting, and sequential pairwise fusion to produce a compact summary for forecasting\. Based on validation errors, a reflection signal is generated to identify missing information, and a PRM module scores supplementary candidate news by their expected error reduction, selecting the best article to refine the filtering logic\. This refinement loop is repeated until convergence, yielding the final refined logic\. In the online phase, the refined logic and frozen Reward\-Fusion module are directly applied to retrieve, compress, and fuse relevant news for forecasting, without reflection loops or PRM scoring during inference\.
## 3Methodology
We consider forecasting with historical observations and a large pool of time\-aligned news articles\. The main difficulty is not only to identify relevant articles, but also to integrate them under limited context while avoiding noisy iterative updates\. Our framework therefore contains four components: \(i\) an LLM forecasting backbone, \(ii\) a reasoning agent for initial news filtering and error\-driven refinement, \(iii\) an importance\-aware fusion module for compressing long news into a compact conditioning context, and \(iv\) a PRM that ranks supplementary\-news candidates during refinement\. The key design principle is to perform expensive search and supervision*offline*, where ground\-truth targets are available, and to deploy only the frozen filtering and compression logic*online*\. Figure[1](https://arxiv.org/html/2606.03097#S2.F1)summarizes the pipeline\.
### 3\.1Problem Formulation
Let𝐱1:T=\(x1,…,xT\)\\mathbf\{x\}\_\{1:T\}=\(x\_\{1\},\\ldots,x\_\{T\}\)denote a sequence of univariate or multivariate time series observations up to the current timeTT\. The task is to forecast the nextHHvalues𝐱T\+1:T\+H\\mathbf\{x\}\_\{T\+1:T\+H\}\. In addition to the numerical history, we have access to a collection of news articles𝒰=\{u1,…,uL\}\\mathcal\{U\}=\\\{u\_\{1\},\\ldots,u\_\{L\}\\\}whose publication timestamps fall within a retrieval window before the forecast time\. The articles can be of arbitrary length, often exceeding the context limit of the forecasting model\.
A reasoning agent first selects a subset𝒩\(0\)⊆𝒰\\mathcal\{N\}^\{\(0\)\}\\subseteq\\mathcal\{U\}of potentially relevant articles, which is then compressed into a textual summaryS\(0\)S^\{\(0\)\}\. The forecasting modelFψF\_\{\\psi\}conditioned on both the numerical history and the summary produces the prediction:
𝐱^T\+1:T\+H=Fψ\(𝐱1:T,S\(0\)\)\.\\hat\{\\mathbf\{x\}\}\_\{T\+1:T\+H\}=F\_\{\\psi\}\(\\mathbf\{x\}\_\{1:T\},S^\{\(0\)\}\)\.
During offline refinement, we have access to ground\-truth targets for a historical validation set\. The prediction error is used to generate a textual reflectionrtr\_\{t\}that describes missing evidence\. This reflection triggers a retrieval of supplementary candidates, from which a process reward model \(PRM\) selects a news article to update the current news set\. The refinement loop repeats until the validation improvement becomes negligible or a fixed budget is exhausted\. The final refined filtering logic and trained compression modules are then deployed for online inference without any reflection loop\.
### 3\.2Preliminary: Time Series Forecasting with LLMs
Following\(Wanget al\.,[2024](https://arxiv.org/html/2606.03097#bib.bib11)\), we cast forecasting as conditional sequence generation\. A pre\-trained LLM receives a prompt containing the historical series, task metadata, and a compressed textual summary of relevant news, and outputs the nextHHvalues\. If the numerical history is tokenized as𝒙1:T\{\\bm\{x\}\}\_\{1:T\}and the news summary is tokenized as𝒆1:U\{\\bm\{e\}\}\_\{1:U\}, the model learnsPψ\(𝒙T\+1:T\+H∣𝒙1:T,𝒆1:U\)P\_\{\\psi\}\(\{\\bm\{x\}\}\_\{T\+1:T\+H\}\\mid\{\\bm\{x\}\}\_\{1:T\},\{\\bm\{e\}\}\_\{1:U\}\)\.
We fine\-tune the backbone with supervised instruction tuning and LoRA\(Huet al\.,[2022](https://arxiv.org/html/2606.03097#bib.bib12)\)\. The role of our method is not to change the forecasting backbone itself, but to improve the quality and efficiency of the textual context𝒆1:U\{\\bm\{e\}\}\_\{1:U\}that conditions the forecast\.
### 3\.3Iterative News Filtering with PRM\-Guided Supplementary News Selection
We adopt the initial reasoning stage of\(Wanget al\.,[2024](https://arxiv.org/html/2606.03097#bib.bib11)\)to obtain a seed news set𝒩\(0\)=\{N1,…,NK0\}\\mathcal\{N\}^\{\(0\)\}=\\\{N\_\{1\},\\ldots,N\_\{K\_\{0\}\}\\\}\. The agent uses task descriptions, few\-shot examples, and chain\-of\-thought prompting\(Brownet al\.,[2020](https://arxiv.org/html/2606.03097#bib.bib49); Weiet al\.,[2022](https://arxiv.org/html/2606.03097#bib.bib22)\)to identify news categories likely to influence the target series and returns structured filtering outputs\.
The main modification concerns the refinement stage\. Let𝒩\(t\)\\mathcal\{N\}^\{\(t\)\}be the news set after iterationtt, and letrtr\_\{t\}denote the reflection produced from the current validation errors\. Instead of accepting a single supplementary article directly from the reasoning agent, we retrieve a candidate pool𝒞t=\{ct\(1\),…,ct\(M\)\}\\mathcal\{C\}\_\{t\}=\\\{c\_\{t\}^\{\(1\)\},\\ldots,c\_\{t\}^\{\(M\)\}\\\}that matches the reflection\. The PRM then assigns a utility score to each candidate, and the highest\-scoring article is added to the news set:
nt∗=argmaxc∈𝒞tPϕ\(beneficial∣ht,τ<t,c\),𝒩\(t\+1\)=𝒩\(t\)∪\{nt∗\},n\_\{t\}^\{\*\}=\\arg\\max\_\{c\\in\\mathcal\{C\}\_\{t\}\}P\_\{\\phi\}\(\\text\{beneficial\}\\mid h\_\{t\},\\tau\_\{<t\},c\),\\qquad\\mathcal\{N\}^\{\(t\+1\)\}=\\mathcal\{N\}^\{\(t\)\}\\cup\\\{n\_\{t\}^\{\*\}\\\},wherehth\_\{t\}summarizes the current task context andτ<t\\tau\_\{<t\}is the history of previously selected supplementary articles\. This design turns iterative retrieval into a ranking problem with explicit supervision, rather than an uncontrolled one\-shot update\.
The refinement procedure is executed only offline\. Once the filtering logic has been refined on historical data, we keep the resulting logic fixed for deployment and do not invoke reflection or PRM scoring at inference time\.
### 3\.4Process Reward Model for Supplementary News Selection
A central limitation of prior iterative retrieval is the absence of step\-level supervision for supplementary\-news selection\. We address this by training a PRM that estimates whether adding a candidate article at the current refinement step is likely to improve forecasting performance\. The PRM outputs a scalar score used for candidate ranking\.
##### Trajectory definition\.
Letτ=\(n1,n2,…,nT\)\\tau=\(n\_\{1\},n\_\{2\},\\dots,n\_\{T\}\)denote a trajectory of supplementary selections\. After addingntn\_\{t\}to the news set, we re\-compress the current news context, update the forecasting model, and evaluate on the validation set\. The step\-wise gain is defined asΔt:=RMSEt−1−RMSEt\\Delta\_\{t\}:=\\text\{RMSE\}\_\{t\-1\}\-\\text\{RMSE\}\_\{t\}, whereΔt\>0\\Delta\_\{t\}\>0indicates that the newly selected article improves validation RMSE\. The PRM models the probability that a candidate is beneficial can be expressed asPϕ\(beneficial∣ht,τ<t,c\)P\_\{\\phi\}\(\\text\{beneficial\}\\mid h\_\{t\},\\tau\_\{<t\},c\), wherehth\_\{t\}contains the current context, including the residual pattern of the forecasting model, metadata of the task, and the current news set\.
##### Training data construction\.
We construct supervision from offline trajectories on a held\-out split of the training data\. For each reflectionrtr\_\{t\}, the reasoning agent retrievesMMcandidates\. Each candidate is inserted temporarily into the pipeline, and we measure the corresponding gainΔt\(m\)\\Delta\_\{t\}^\{\(m\)\}\. The binary label isyt\(m\)=𝟏\[Δt\(m\)\>0\]y\_\{t\}^\{\(m\)\}=\\mathbf\{1\}\[\\Delta\_\{t\}^\{\(m\)\}\>0\]\. The PRM input concatenates: \(1\) Task description \(horizon, domain, background\); \(2\) Recent prediction errors \(e\.g\., time‑aligned residuals\); \(3\) The historyτ<t\\tau\_\{<t\}and the associatedΔ\\Deltavalues; \(4\) The candidate articlent\(m\)n\_\{t\}^\{\(m\)\}\(title and summary\)\.
We train a binary classifier with cross\-entropy loss\. In practice, the classifier can be implemented as a lightweight language model or as a linear head on top of a frozen text encoder\. The predicted sigmoid probability is used directly as the candidate score\.
##### PRM\-guided selection and logic update\.
During offline refinement, the reasoning agent retrieves a candidate set for each reflectionrtr\_\{t\}, and the PRM ranks these candidates using the current contexthth\_\{t\}and historyτ<t\\tau\_\{<t\}\. The selected articlent∗n\_\{t\}^\{\*\}is merged into the news set, after which the forecasting pipeline is re\-evaluated\. If the update improves validation performance, the selected article is also incorporated into the persistent filtering logic used by the reasoning agent\. In this way, the PRM acts as a gate that reduces error\-amplifying updates while preserving useful evidence for future retrieval\. Algorithm[1](https://arxiv.org/html/2606.03097#alg1)summarizes the procedure\.
Algorithm 1PRM\-guided supplementary news selection and logic update0:Current news set
𝒩\(t\)\\mathcal\{N\}^\{\(t\)\}, forecasting model
ℳ\(t\)\\mathcal\{M\}^\{\(t\)\}, reflection
rtr\_\{t\}, task context
hth\_\{t\}, history
τ<t\\tau\_\{<t\}, PRM
PϕP\_\{\\phi\}, reasoning agentRA, convergence threshold
ε\\varepsilon, max iterations
TmaxT\_\{\\max\}
1:
t←0t\\leftarrow 0
2:
Δ0←∞\\Delta\_\{0\}\\leftarrow\\infty
3:repeat
4:
𝒞←RA\.retrieve\(rt\)\\mathcal\{C\}\\leftarrow\\text\{RA\}\.\\text\{retrieve\}\(r\_\{t\}\)⊳\\trianglerightKKcandidate articles
5:foreach candidate
cj∈𝒞c\_\{j\}\\in\\mathcal\{C\}do
6:
sj←sigmoid\(Pϕ\(beneficial∣ht,τ<t,cj\)\)s\_\{j\}\\leftarrow\\text\{sigmoid\}\\bigl\(P\_\{\\phi\}\(\\text\{beneficial\}\\mid h\_\{t\},\\tau\_\{<t\},c\_\{j\}\)\\bigr\)⊳\\trianglerightPRM scoring
7:endfor
8:
nt∗←argmaxcjsjn^\{\*\}\_\{t\}\\leftarrow\\arg\\max\_\{c\_\{j\}\}s\_\{j\}⊳\\trianglerightselect best candidate
9:
𝒩new\(t\)←\{nt∗\}\\mathcal\{N\}^\{\(t\)\}\_\{\\text\{new\}\}\\leftarrow\\\{n^\{\*\}\_\{t\}\\\}
10:
𝒩\(t\+1\)←𝒩\(t\)∪𝒩new\(t\)\\mathcal\{N\}^\{\(t\+1\)\}\\leftarrow\\mathcal\{N\}^\{\(t\)\}\\cup\\mathcal\{N\}^\{\(t\)\}\_\{\\text\{new\}\}
11:
ℳ\(t\+1\)←train\(ℳ\(t\),𝒩\(t\+1\)\)\\mathcal\{M\}^\{\(t\+1\)\}\\leftarrow\\text\{train\}\\bigl\(\\mathcal\{M\}^\{\(t\)\},\\mathcal\{N\}^\{\(t\+1\)\}\\bigr\)⊳\\trianglerightre\-compress and update forecast model
12:
Δt\+1←RMSE\(ℳ\(t\)\)−RMSE\(ℳ\(t\+1\)\)\\Delta\_\{t\+1\}\\leftarrow\\text\{RMSE\}\(\\mathcal\{M\}^\{\(t\)\}\)\-\\text\{RMSE\}\(\\mathcal\{M\}^\{\(t\+1\)\}\)⊳\\trianglerightcompute improvement
13:if
Δt\+1\>0\\Delta\_\{t\+1\}\>0then
14:
ℒ\(t\+1\)←update\_logic\(ℒ\(t\),nt∗\)\\mathcal\{L\}^\{\(t\+1\)\}\\leftarrow\\text\{update\\\_logic\}\(\\mathcal\{L\}^\{\(t\)\},n^\{\*\}\_\{t\}\)⊳\\trianglerightrefine persistent filtering logic
15:endif
16:
t←t\+1t\\leftarrow t\+1
17:until
Δt<ε\\Delta\_\{t\}<\\varepsilonor
t≥Tmaxt\\geq T\_\{\\max\}
18:returnRefined logic
ℒ\(t\)\\mathcal\{L\}^\{\(t\)\}, final model
ℳ\(t\)\\mathcal\{M\}^\{\(t\)\}
##### Relation to the long‑term filtering logic\.
The selected articlent∗n\_\{t\}^\{\*\}serves two roles: it may improve the current iteration, and it provides a supervised signal for refining the reasoning agent’s persistent filtering logic\. Over multiple iterations, this logic becomes better aligned with the forecasting objective\. The PRM therefore reduces the chance that low\-value supplementary articles corrupt the long\-term retrieval behavior\.
### 3\.5Adaptive News Fusion via Importance\-Aware Reward Model
Relevant news articles can be substantially longer than the context available to the forecasting backbone\. Generic compression methods reduce length, but they do not explicitly optimize for downstream forecasting utility\. We therefore introduce an importance\-aware fusion module that compresses long news while conditioning the compression rate on article\-level utility estimates\.
##### Sequential pairwise fusion\.
Given a filtered article list\{N1,…,NK\}\\\{N\_\{1\},\\dots,N\_\{K\}\\\}, we compress the set into a summarySSthrough sequential pairwise fusion\. At each step, the current summary is merged with the next article, and the target compression ratio of each input unit is determined by an estimated importance score\. This design avoids forcing all articles through a uniform compression budget\.
##### Importance\-aware reward model\.
We train a reward modelRθR\_\{\\theta\}that maps an articleNNto an importance scoreρ∈\[0,1\]\\rho\\in\[0,1\]\. To obtain supervision, we compare forecasting performance under different retention rates for the same article\. LetRMSEhigh\\text\{RMSE\}\_\{\\text\{high\}\}denote the validation RMSE when the article is aggressively compressed andRMSElow\\text\{RMSE\}\_\{\\text\{low\}\}the RMSE when it is lightly compressed\. We define an unnormalized utility signal as
ρ~\(N\)=RMSEhigh−RMSElowRMSEno news,\\tilde\{\\rho\}\(N\)=\\frac\{\\text\{RMSE\}\_\{\\text\{high\}\}\-\\text\{RMSE\}\_\{\\text\{low\}\}\}\{\\text\{RMSE\}\_\{\\text\{no news\}\}\},and normalize it to\[0,1\]\[0,1\]within each training split\. A larger score indicates that preserving more information from the article is more beneficial for forecasting\. We trainRθR\_\{\\theta\}with supervised regression on article representations extracted by a pre\-trained model\.
##### Fusion procedure\.
Algorithm[2](https://arxiv.org/html/2606.03097#alg2)gives the fusion procedure\. Articles are ordered by timestamp or retrieval priority\. For articleNiN\_\{i\}, the reward model outputsρi=Rθ\(Ni\)\\rho\_\{i\}=R\_\{\\theta\}\(N\_\{i\}\), which is mapped to a retention ratioαi=f\(ρi\)\\alpha\_\{i\}=f\(\\rho\_\{i\}\)through a monotone functionff\. Higher utility implies larger retention\. A controllable summarizer then merges the current summary with the next article under their respective retention budgets\. The final summarySSis used as the textual context for forecasting\. Compared with task\-agnostic prompt compression, this procedure allocates the limited context budget according to estimated forecasting relevance\.
Algorithm 2Sequential pairwise news fusion0:List of articles
𝒩=\[N1,N2,…,NK\]\\mathcal\{N\}=\[N\_\{1\},N\_\{2\},\\dots,N\_\{K\}\], reward model
RθR\_\{\\theta\}, summarizer
SUMM\(⋅,⋅;α1,α2\)\\text\{SUMM\}\(\\cdot,\\cdot;\\alpha\_\{1\},\\alpha\_\{2\}\), mapping
f:\[0,1\]→\[0,1\]f:\[0,1\]\\to\[0,1\]
1:
S←N1S\\leftarrow N\_\{1\}⊳\\trianglerightinitialize summary
2:
ρsum←Rθ\(N1\)\\rho\_\{\\text\{sum\}\}\\leftarrow R\_\{\\theta\}\(N\_\{1\}\)⊳\\trianglerightinitial relevance score
3:for
i=2i=2to
KKdo
4:
ρi←Rθ\(Ni\)\\rho\_\{i\}\\leftarrow R\_\{\\theta\}\(N\_\{i\}\)⊳\\trianglerightscore new article
5:
αsum←f\(ρsum\)\\alpha\_\{\\text\{sum\}\}\\leftarrow f\(\\rho\_\{\\text\{sum\}\}\)⊳\\trianglerightmap to retention weight
6:
αi←f\(ρi\)\\alpha\_\{i\}\\leftarrow f\(\\rho\_\{i\}\)⊳\\trianglerightmap to incorporation weight
7:
S←SUMM\(S,Ni;αsum,αi\)S\\leftarrow\\text\{SUMM\}\(S,N\_\{i\};\\alpha\_\{\\text\{sum\}\},\\alpha\_\{i\}\)⊳\\trianglerightweighted fusion
8:
ρsum←aggregate\(ρsum,ρi\)\\rho\_\{\\text\{sum\}\}\\leftarrow\\text\{aggregate\}\(\\rho\_\{\\text\{sum\}\},\\rho\_\{i\}\)⊳\\trianglerighte\.g\., maximum or weighted average
9:endfor
10:return
SS
##### Training and deployment\.
The reward model is trained offline and then reused throughout refinement and deployment\. During online inference, the same deterministic scoring\-and\-fusion procedure is applied, without additional supervision or search\.
### 3\.6Overall Pipeline
The overall framework has two phases: offline refinement and online inference, as illustrated in Figure[1](https://arxiv.org/html/2606.03097#S2.F1)\. Offline refinement uses historical data with ground truth to improve the filtering logic and train the auxiliary scoring modules\. The procedure is: \(1\)Initial filtering: the reasoning agent applies a default logic to select a candidate news set𝒩\(0\)\\mathcal\{N\}^\{\(0\)\}\. \(2\)Adaptive compression: the importance\-aware reward model \(Sec\.[3\.5](https://arxiv.org/html/2606.03097#S3.SS5)\) compresses𝒩\(0\)\\mathcal\{N\}^\{\(0\)\}into a compact summarySS\. \(3\)Forecasting fine\-tuning: the LLM is fine\-tuned via LoRA on the paired time series and summarySS\. \(4\)Evaluation and reflection: the evaluation agent analyzes validation\-set errors and generates a textual reflectionrtr\_\{t\}indicating a missing news category\. \(5\)PRM\-guided supplementary selection: the reasoning agent retrievesKKcandidate articles matchingrtr\_\{t\}; the PRM scores them and selects the highest\-scoring articlent∗n\_\{t\}^\{\*\}, which is merged into the news set\. \(6\)Iteration: steps \(2\)–\(5\) are repeated until the RMSE improvementΔt\\Delta\_\{t\}falls below a threshold or a maximum number of rounds is reached\.
After convergence, we freeze the refined filtering logic, the trained reward models, and the forecasting backbone\. Online inference contains only a single forward pipeline: retrieve news with the frozen logic, compress the selected news with the importance\-aware fusion module, and generate the forecast\. Since ground truth is unavailable at test time, no reflection loop or PRM scoring is used\. In our implementation, Deepseek V3\.2 is used for the reasoning, summarization, and reflection modules, while a smaller Qwen model is used for the PRM and the importance reward model\.
## 4Experiments
Table 1:RMSE comparison across different domains and models on true news dataset\.Boldindicates the best result andunderlineindicates the second best result\.### 4\.1Data preparation
We evaluate on datasets where exogenous textual signals offer meaningful predictive value—i\.e\., series influenced by human activities and socially salient events rather than purely physical processes\. This excludes physics\-driven benchmarks such as ETT\(Zhouet al\.,[2021a](https://arxiv.org/html/2606.03097#bib.bib64)\)and Weather\(Reynoldset al\.,[2007](https://arxiv.org/html/2606.03097#bib.bib65); Hersbachet al\.,[2020](https://arxiv.org/html/2606.03097#bib.bib66)\), which are less suitable for studying event\-conditioned auxiliary signals\. Following this principle, we consider four domains: electricity demand\(Godahewaet al\.,[2021](https://arxiv.org/html/2606.03097#bib.bib67)\), Bitcoin price\(Godahewaet al\.,[2021](https://arxiv.org/html/2606.03097#bib.bib67)\), traffic flow\(Maggieet al\.,[2017](https://arxiv.org/html/2606.03097#bib.bib68)\), and foreign exchange rates\(Laiet al\.,[2018](https://arxiv.org/html/2606.03097#bib.bib82)\)\. The electricity, traffic, and foreign exchange datasets are adopted from\(Wanget al\.,[2024](https://arxiv.org/html/2606.03097#bib.bib11)\); the Bitcoin price series is collected from investing\.com\(Investing\.com,[2026](https://arxiv.org/html/2606.03097#bib.bib69)\)\. Together, these benchmarks span 30\-minute, hourly, and daily frequencies, enabling evaluation across varying temporal granularities\.
For textual side information, we use real\-world news corpora associated with the four domains provided by\(Wanget al\.,[2024](https://arxiv.org/html/2606.03097#bib.bib11)\)\. To ensure strict separation between training and evaluation data while maintaining sufficient training scale, we also construct a large synthetic news corpus using DeepSeek\-V3\.2\(DeepSeek\-AIet al\.,[2025](https://arxiv.org/html/2606.03097#bib.bib70)\)conditioned on real temporal information\. The synthetic news is used exclusively for training the reward model and the process reward model \(PRM\), while evaluation is conducted only on real news\. This protocol reduces the risk of train–test contamination, preserves realistic testing conditions, and provides a controlled and scalable source of textual supervision for model training\. Table[4](https://arxiv.org/html/2606.03097#A1.T4)provides detailed information on the length of news items in the test set\. Please refer to the appendix for further details\.
### 4\.2Baseline
To evaluate the performance of the proposed optimization approach, we carry out comprehensive experiments across a broad range of representative time\-series forecasting models\. In particular, the selected base forecasters fall into two groups\. The first group consists of history\-based forecasting models, including N\-BEATS\(Oreshkinet al\.,[2019](https://arxiv.org/html/2606.03097#bib.bib71)\), PatchTST\(Nieet al\.,[2022](https://arxiv.org/html/2606.03097#bib.bib32)\), DLinear\(Zenget al\.,[2023](https://arxiv.org/html/2606.03097#bib.bib72)\), LightGBM\(Keet al\.,[2017](https://arxiv.org/html/2606.03097#bib.bib73)\), Chronos\(Ansariet al\.,[2024](https://arxiv.org/html/2606.03097#bib.bib74)\), and iTransformer\(Liuet al\.,[2023](https://arxiv.org/html/2606.03097#bib.bib75)\), all of which generate forecasts solely from past time\-series observations\. These models cover a range of forecasting backbones, from classical machine learning methods to recent deep neural architectures, thereby providing a comprehensive benchmark for evaluating the robustness and general applicability of our method\. The second group includes a direct prediction method built upon LoRA\(Huet al\.,[2022](https://arxiv.org/html/2606.03097#bib.bib12)\)\. Unlike the history\-only baselines, this model takes as input not only the historical sequence but also the same external textual information used in our approach, enabling a more direct comparison with our method\. Including this setting allows for a fairer and more informative comparison, since both methods have access to the same sources of information and differ primarily in how that information is utilized for forecasting\.
#### 4\.2\.1Reward Model
Our framework employs two complementary reward models to guide the news\-augmented forecasting pipeline: an importance\-aware reward model for evaluating individual news segment utility, and a Process Reward Model \(PRM\) \[Lightman et al\., 2023a\] for identifying optimal news selection strategies\.
Importance\-aware reward model\.The reward model is built upon Qwen3\-8B\(Lightmanet al\.,[2023a](https://arxiv.org/html/2606.03097#bib.bib77)\)and fine\-tuned via Low\-Rank Adaptation \(LoRA\)\(Huet al\.,[2021](https://arxiv.org/html/2606.03097#bib.bib79)\)with rankr=64r\{=\}64, scaling factorα=128\\alpha\{=\}128, and dropout rate0\.050\.05\. LoRA adapters are applied to all attention projections \(WQ,WK,WV,WOW\_\{Q\},W\_\{K\},W\_\{V\},W\_\{O\}\) as well as the feed\-forward sub\-layers \(Wgate,Wup,WdownW\_\{\\text\{gate\}\},W\_\{\\text\{up\}\},W\_\{\\text\{down\}\}\)\. A linear regression head maps the final hidden state to a scalar importance scoreρ∈\[0,1\]\\rho\\in\[0,1\]\. Training uses supervised regression with mean squared error:
ℒRM=1n∑i=1n\(Rθ\(Ni\)−ρi\)2,\\mathcal\{L\}\_\{\\text\{RM\}\}=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\big\(R\_\{\\theta\}\(N\_\{i\}\)\-\\rho\_\{i\}\\big\)^\{2\},\(1\)whereρi\\rho\_\{i\}is the normalized importance label for articleNiN\_\{i\}\. To obtain supervision, we compare forecasting performance under different retention rates for the same article\. LetRMSEhigh\\text\{RMSE\}\_\{\\text\{high\}\}denote the validation RMSE when articleNiN\_\{i\}is aggressively compressed andRMSElow\\text\{RMSE\}\_\{\\text\{low\}\}the RMSE when it is lightly compressed\. We define the unnormalized utility asρ~\(Ni\)=\(RMSEhigh−RMSElow\)/RMSEno news\\tilde\{\\rho\}\(N\_\{i\}\)=\(\\text\{RMSE\}\_\{\\text\{high\}\}\-\\text\{RMSE\}\_\{\\text\{low\}\}\)/\\text\{RMSE\}\_\{\\text\{no\\,news\}\}and normalize it to\[0,1\]\[0,1\]within each training split\. A larger score indicates that preserving information from the article is more beneficial for forecasting\. At inference time, the reward model scores individual news segments, thereby determining the retention ratio for each segment through a monotone mapping function\.
Process Reward Model \(PRM\)\.While the RM evaluates single news segments in isolation, the PRM assesses the quality of composite news selection strategies\. We enumerate all2N2^\{N\}subsets of theNNavailable news segments for each sample and query the forecasting model with each subset to obtain its prediction RMSE\. The reward for each subset is defined asr=−RMSEr=\-\\text\{RMSE\}, so that lower forecasting error corresponds to higher reward\. This exhaustive evaluation captures interaction effects among news segments that pairwise comparisons cannot reveal\. The PRM is trained on these step\-wise labels using the TRL framework\(von Werraet al\.,[2020](https://arxiv.org/html/2606.03097#bib.bib81)\), where each step corresponds to the incremental inclusion of a news segment into the selected subset\. At inference time, the PRM guides a search over candidate news combinations to identify the subset that maximizes the expected forecasting accuracy\.
Table 2:Ablation results for removing external news and the reward model\.
### 4\.3Main Results
Table[1](https://arxiv.org/html/2606.03097#S4.T1)presents the RMSE comparison between our method and baseline models across four domains\. We adopt RMSE as the primary evaluation metric because it penalizes larger prediction errors more heavily than other metrics such as MAE, making it particularly suitable for capturing the impact of sudden shifts and anomalies in time series that are often triggered by external news events\. As shown in Table[1](https://arxiv.org/html/2606.03097#S4.T1), our method achieves the best performance across all four domains, with RMSE values of 372\.32, 886\.58, 34\.66, and 5\.86 on Electricity, Bitcoin, Traffic, and Exchange, respectively\. Notably, on the Bitcoin domain, our method reduces the RMSE by 23\.3% compared to the second\-best baseline \(LightGBM, 1156\.65\), demonstrating a substantial improvement\. Similarly, on the Exchange domain, our method outperforms the strongest baseline \(LightGBM, 7\.09\) by 17\.3%\. These significant performance gains in news\-sensitive financial domains validate the effectiveness of incorporating external news information into time series forecasting\. However, the improvements on the Electricity and Traffic domains are relatively marginal\. One possible reason lies in the inherent characteristics of these two domains\. Both electricity consumption and traffic flow exhibit strong periodicity and regularity governed primarily by human activity patterns \(e\.g\., daily routines, weekly cycles, and seasonal trends\), making their dynamics largely predictable from historical data alone\. Consequently, these domains are less sensitive to external news events, limiting the additional predictive value that news\-augmented methods can offer\.
### 4\.4Ablation Experiments
To further validate the effectiveness of each component in our framework, we design two groups of ablation experiments\. Specifically, we consider the following variants: \(1\)w/o News, which removes the external news input and relies solely on the large language model to perform predictions based on historical time series data; and \(2\)w/o Reward Model, which removes the reward model used to optimize the news compression ratio, allowing the large language model to compress news content in an unconstrained manner before making predictions\.
Table[2](https://arxiv.org/html/2606.03097#S4.T2)presents the RMSE comparison\. Both variants exhibit significant performance degradation compared to our method across all domains\. The inferior performance of the “w/o News” variant confirms that external news provides critical supplementary information beyond what is captured by historical patterns alone, particularly for news\-sensitive domains where exogenous factors drive market dynamics\. The degraded performance of the “w/o Reward Model” variant demonstrates that unguided news compression introduces noise and may discard key predictive signals, highlighting the necessity of our reward model in learning an optimal compression strategy that preserves task\-relevant information while filtering out irrelevant content\. These results collectively validate that both the news integration mechanism and the reward\-model\-guided compression are indispensable components of our framework, each contributing meaningfully to the final forecasting performance\.
### 4\.5Convergence Analysis
To demonstrate that our PRM can accelerate news selection, we conduct a convergence analysis\. We set a predefined RMSE threshold for each domain and compare two strategies: \(1\) PRM\-guided selection refers to selecting the news article most likely to yield the largest reduction in RMSE, where the PRM is a previously trained model that can recognize similar patterns\. \(2\) Naive selection, which adds news without guidance\. At each step, one article is appended, compressed, and used for prediction\. We record the steps required to reach the target threshold\.
As shown in Table[3](https://arxiv.org/html/2606.03097#S4.T3), the PRM\-guided strategy consistently requires fewer steps across all four domains\. On average, PRM\-guided selection reduces the number of convergence steps by 24\.8% compared to naive selection\. The most pronounced improvement is observed in the Bitcoin domain, where the required steps decrease by 37\.6%\. This confirms that our PRM effectively identifies the news most likely to improve forecasting accuracy, reducing retrieval and inference iterations\. In practice, this yields lower computational cost and faster convergence, making our framework more efficient and scalable for large candidate news pools\.
Table 3:Convergence comparison \(number of steps to reach target RMSE\) between PRM\-guided and naive news selection across four domains\.
## 5Conclusion and Future Work
In this paper, we proposed a news\-augmented time series forecasting framework that addresses two practical challenges in existing LLM\-based pipelines: the difficulty of incorporating long news under limited context budgets and the lack of quality control in iterative supplementary news retrieval\. To this end, we introduced an importance\-aware fusion module to allocate compression budgets according to the forecasting utility of each article, and a PRM\-guided refinement mechanism to rank supplementary news candidates based on their expected contribution to error reduction\. Experimental results on electricity, bitcoin, traffic, and exchange benchmarks demonstrate that our method consistently improves forecasting accuracy over strong baselines, while also reducing the number of refinement steps required for convergence\.
For future work, we plan to extend the framework in several directions\. First, we will investigate more efficient and scalable fusion strategies for larger news pools and longer forecasting horizons\. Second, we aim to generalize the method to multimodal exogenous information, such as social media, reports, and event graphs, beyond plain news text\. We believe these directions can further enhance the practicality and generality of LLM\-based forecasting systems\.
## References
- Forecasting traffic congestion using arima modeling\.In2019 15th international wireless communications & mobile computing conference \(IWCMC\),pp\. 1227–1232\.Cited by:[§1](https://arxiv.org/html/2606.03097#S1.p1.1)\.
- A\. F\. Ansari, L\. Stella, C\. Turkmen, X\. Zhang, P\. Mercado, H\. Shen, O\. Shchur, S\. S\. Rangapuram, S\. P\. Arango, S\. Kapoor,et al\.\(2024\)Chronos: learning the language of time series\.arXiv preprint arXiv:2403\.07815\.Cited by:[§A\.1](https://arxiv.org/html/2606.03097#A1.SS1.p4.1),[§4\.2](https://arxiv.org/html/2606.03097#S4.SS2.p1.1)\.
- Y\. Bai, S\. Camal, and A\. Michiorri \(2024\)News and load: a quantitative exploration of natural language processing applications for forecasting day\-ahead electricity system demand\.IEEE Transactions on Power Systems39\(5\),pp\. 6222–6234\.Cited by:[§2](https://arxiv.org/html/2606.03097#S2.p1.1)\.
- T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell,et al\.\(2020\)Language models are few\-shot learners\.Advances in neural information processing systems33,pp\. 1877–1901\.Cited by:[§3\.3](https://arxiv.org/html/2606.03097#S3.SS3.p1.1)\.
- D\. Cao, F\. Jia, S\. O\. Arik, T\. Pfister, Y\. Zheng, W\. Ye, and Y\. Liu \(2023\)Tempo: prompt\-based generative pre\-trained transformer for time series forecasting\.arXiv preprint arXiv:2310\.04948\.Cited by:[§2](https://arxiv.org/html/2606.03097#S2.p1.1),[§2](https://arxiv.org/html/2606.03097#S2.p2.1)\.
- M\. Cecchini, H\. Aytug, G\. J\. Koehler, and P\. Pathak \(2010\)Making words work: using financial text as a predictor of financial events\.Decision support systems50\(1\),pp\. 164–175\.Cited by:[§2](https://arxiv.org/html/2606.03097#S2.p1.1)\.
- B\. Chen, M\. Chang,et al\.\(2004\)Load forecasting using support vector machines: a study on eunite competition 2001\.IEEE transactions on power systems19\(4\),pp\. 1821–1830\.Cited by:[§2](https://arxiv.org/html/2606.03097#S2.p1.1)\.
- K\. Chowdhary \(2020\)Natural language processing\.Fundamentals of artificial intelligence,pp\. 603–649\.Cited by:[§2](https://arxiv.org/html/2606.03097#S2.p1.1)\.
- DeepSeek\-AI, A\. Liu, A\. Mei, B\. Lin, B\. Xue, B\. Wang, B\. Xu, B\. Wu, B\. Zhang, C\. Lin, C\. Dong, C\. Lu, C\. Zhao, C\. Deng, C\. Xu, C\. Ruan, D\. Dai, D\. Guo, D\. Yang, D\. Chen, E\. Li, F\. Zhou, F\. Lin, F\. Dai, G\. Hao, G\. Chen, G\. Li, H\. Zhang, H\. Xu, H\. Li, H\. Liang, H\. Wei, H\. Zhang, H\. Luo, H\. Ji, H\. Ding, H\. Tang, H\. Cao, H\. Gao, H\. Qu, H\. Zeng, J\. Huang, J\. Li, J\. Xu, J\. Hu, J\. Chen, J\. Xiang, J\. Yuan, J\. Cheng, J\. Zhu, J\. Ran, J\. Jiang, J\. Qiu, J\. Li, J\. Song, K\. Dong, K\. Gao, K\. Guan, K\. Huang, K\. Zhou, K\. Huang, K\. Yu, L\. Wang, L\. Zhang, L\. Wang, L\. Zhao, L\. Yin, L\. Guo, L\. Luo, L\. Ma, L\. Wang, L\. Zhang, M\. S\. Di, M\. Y\. Xu, M\. Zhang, M\. Zhang, M\. Tang, M\. Zhou, P\. Huang, P\. Cong, P\. Wang, Q\. Wang, Q\. Zhu, Q\. Li, Q\. Chen, Q\. Du, R\. Xu, R\. Ge, R\. Zhang, R\. Pan, R\. Wang, R\. Yin, R\. Xu, R\. Shen, R\. Zhang, S\. H\. Liu, S\. Lu, S\. Zhou, S\. Chen, S\. Cai, S\. Chen, S\. Hu, S\. Liu, S\. Hu, S\. Ma, S\. Wang, S\. Yu, S\. Zhou, S\. Pan, S\. Zhou, T\. Ni, T\. Yun, T\. Pei, T\. Ye, T\. Yue, W\. Zeng, W\. Liu, W\. Liang, W\. Pang, W\. Luo, W\. Gao, W\. Zhang, X\. Gao, X\. Wang, X\. Bi, X\. Liu, X\. Wang, X\. Chen, X\. Zhang, X\. Nie, X\. Cheng, X\. Liu, X\. Xie, X\. Liu, X\. Yu, X\. Li, X\. Yang, X\. Li, X\. Chen, X\. Su, X\. Pan, X\. Lin, X\. Fu, Y\. Q\. Wang, Y\. Zhang, Y\. Xu, Y\. Ma, Y\. Li, Y\. Li, Y\. Zhao, Y\. Sun, Y\. Wang, Y\. Qian, Y\. Yu, Y\. Zhang, Y\. Ding, Y\. Shi, Y\. Xiong, Y\. He, Y\. Zhou, Y\. Zhong, Y\. Piao, Y\. Wang, Y\. Chen, Y\. Tan, Y\. Wei, Y\. Ma, Y\. Liu, Y\. Yang, Y\. Guo, Y\. Wu, Y\. Wu, Y\. Cheng, Y\. Ou, Y\. Xu, Y\. Wang, Y\. Gong, Y\. Wu, Y\. Zou, Y\. Li, Y\. Xiong, Y\. Luo, Y\. You, Y\. Liu, Y\. Zhou, Z\. F\. Wu, Z\. Z\. Ren, Z\. Zhao, Z\. Ren, Z\. Sha, Z\. Fu, Z\. Xu, Z\. Xie, Z\. Zhang, Z\. Hao, Z\. Gou, Z\. Ma, Z\. Yan, Z\. Shao, Z\. Huang, Z\. Wu, Z\. Li, Z\. Zhang, Z\. Xu, Z\. Wang, Z\. Gu, Z\. Zhu, Z\. Li, Z\. Zhang, Z\. Xie, Z\. Gao, Z\. Pan, Z\. Yao, B\. Feng, H\. Li, J\. L\. Cai, J\. Ni, L\. Xu, M\. Li, N\. Tian, R\. J\. Chen, R\. L\. Jin, S\. S\. Li, S\. Zhou, T\. Sun, X\. Q\. Li, X\. Jin, X\. Shen, X\. Chen, X\. Song, X\. Zhou, Y\. X\. Zhu, Y\. Huang, Y\. Li, Y\. Zheng, Y\. Zhu, Y\. Ma, Z\. Huang, Z\. Xu, Z\. Zhang, D\. Ji, J\. Liang, J\. Guo, J\. Chen, L\. Xia, M\. Wang, M\. Li, P\. Zhang, R\. Chen, S\. Sun, S\. Wu, S\. Ye, T\. Wang, W\. L\. Xiao, W\. An, X\. Wang, X\. Sun, X\. Wang, Y\. Tang, Y\. Zha, Z\. Zhang, Z\. Ju, Z\. Zhang, and Z\. Qu \(2025\)DeepSeek\-v3\.2: pushing the frontier of open large language models\.External Links:2512\.02556,[Link](https://arxiv.org/abs/2512.02556)Cited by:[§4\.1](https://arxiv.org/html/2606.03097#S4.SS1.p2.1)\.
- G\. Dudek \(2015\)Short\-term load forecasting using random forests\.InIntelligent Systems’ 2014: Proceedings of the 7th IEEE International Conference Intelligent Systems IS’2014, September 24\-26, 2014, Warsaw, Poland, Volume 2: Tools, Architectures, Systems, Applications,pp\. 821–828\.Cited by:[§2](https://arxiv.org/html/2606.03097#S2.p1.1)\.
- R\. Fildes, S\. Ma, and S\. Kolassa \(2022\)Retail forecasting: research and practice\.International Journal of Forecasting38\(4\),pp\. 1283–1318\.Cited by:[§1](https://arxiv.org/html/2606.03097#S1.p1.1)\.
- R\. Godahewa, C\. Bergmeir, G\. I\. Webb, R\. J\. Hyndman, and P\. Montero\-Manso \(2021\)Monash time series forecasting archive\.arXiv preprint arXiv:2105\.06643\.Cited by:[§4\.1](https://arxiv.org/html/2606.03097#S4.SS1.p1.1)\.
- G\. Gross and F\. D\. Galiana \(1987\)Short\-term load forecasting\.Proceedings of the IEEE75\(12\),pp\. 1558–1573\.Cited by:[§1](https://arxiv.org/html/2606.03097#S1.p1.1)\.
- N\. Gruver, M\. Finzi, S\. Qiu, and A\. G\. Wilson \(2023\)Large language models are zero\-shot time series forecasters\.Advances in neural information processing systems36,pp\. 19622–19635\.Cited by:[§2](https://arxiv.org/html/2606.03097#S2.p2.1)\.
- H\. Hersbach, B\. Bell, P\. Berrisford, S\. Hirahara, A\. Horányi, J\. Muñoz\-Sabater, J\. Nicolas, C\. Peubey, R\. Radu, D\. Schepers,et al\.\(2020\)The era5 global reanalysis\.Quarterly journal of the royal meteorological society146\(730\),pp\. 1999–2049\.Cited by:[§4\.1](https://arxiv.org/html/2606.03097#S4.SS1.p1.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, W\. Chen,et al\.\(2022\)Lora: low\-rank adaptation of large language models\.\.Iclr1\(2\),pp\. 3\.Cited by:[§3\.2](https://arxiv.org/html/2606.03097#S3.SS2.p2.1),[§4\.2](https://arxiv.org/html/2606.03097#S4.SS2.p1.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2021\)LoRA: low\-rank adaptation of large language models\.External Links:2106\.09685,[Link](https://arxiv.org/abs/2106.09685)Cited by:[§4\.2\.1](https://arxiv.org/html/2606.03097#S4.SS2.SSS1.p2.6)\.
- S\. Huang and K\. Shih \(2003\)Short\-term load forecasting via arma model identification including non\-gaussian process considerations\.IEEE Transactions on power systems18\(2\),pp\. 673–679\.Cited by:[§2](https://arxiv.org/html/2606.03097#S2.p1.1)\.
- Investing\.com \(2026\)Bitcoin historical data\.Note:[https://ca\.investing\.com/crypto/bitcoin/historical\-data](https://ca.investing.com/crypto/bitcoin/historical-data)Accessed: 2026\-01\-22Cited by:[§4\.1](https://arxiv.org/html/2606.03097#S4.SS1.p1.1)\.
- H\. Jiang, Q\. Wu, X\. Luo, D\. Li, C\. Lin, Y\. Yang, and L\. Qiu \(2023\)LLMLingua: compressing prompts for accelerated inference of llms\.InEMNLP,Cited by:[§1](https://arxiv.org/html/2606.03097#S1.p3.1),[§2](https://arxiv.org/html/2606.03097#S2.p3.1)\.
- H\. Jiang, Q\. Wu, X\. Luo, D\. Li, C\. Lin, Y\. Yang, and L\. Qiu \(2024\)LongLLMLingua: accelerating and enhancing llms in long context scenarios via prompt compression\.InICLR,Cited by:[§1](https://arxiv.org/html/2606.03097#S1.p3.1),[§2](https://arxiv.org/html/2606.03097#S2.p3.1)\.
- M\. Jin, S\. Wang, L\. Ma, Z\. Chu, J\. Y\. Zhang, X\. Shi, P\. Chen, Y\. Liang, Y\. Li, S\. Pan,et al\.\(2023a\)Time\-llm: time series forecasting by reprogramming large language models\.arXiv preprint arXiv:2310\.01728\.Cited by:[§A\.1](https://arxiv.org/html/2606.03097#A1.SS1.p4.1),[§2](https://arxiv.org/html/2606.03097#S2.p2.1)\.
- M\. Jin, Q\. Wen, Y\. Liang, C\. Zhang, S\. Xue, X\. Wang, J\. Zhang, Y\. Wang, H\. Chen, X\. Li,et al\.\(2023b\)Large models for time series and spatio\-temporal data: a survey and outlook\.arXiv preprint arXiv:2310\.10196\.Cited by:[§2](https://arxiv.org/html/2606.03097#S2.p1.1)\.
- P\. S\. Kalekaret al\.\(2004\)Time series forecasting using holt\-winters exponential smoothing\.Kanwal Rekhi school of information Technology4329008\(13\),pp\. 1–13\.Cited by:[§2](https://arxiv.org/html/2606.03097#S2.p1.1)\.
- G\. Ke, Q\. Meng, T\. Finley, T\. Wang, W\. Chen, W\. Ma, Q\. Ye, and T\. Liu \(2017\)Lightgbm: a highly efficient gradient boosting decision tree\.Advances in neural information processing systems30\.Cited by:[§A\.1](https://arxiv.org/html/2606.03097#A1.SS1.p4.1),[§4\.2](https://arxiv.org/html/2606.03097#S4.SS2.p1.1)\.
- G\. Lai, W\. Chang, Y\. Yang, and H\. Liu \(2018\)Modeling long\-and short\-term temporal patterns with deep neural networks\.InThe 41st international ACM SIGIR conference on research & development in information retrieval,pp\. 95–104\.Cited by:[§4\.1](https://arxiv.org/html/2606.03097#S4.SS1.p1.1)\.
- S\. Li, X\. Jin, Y\. Xuan, X\. Zhou, W\. Chen, Y\. Wang, and X\. Yan \(2019\)Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting\.Advances in neural information processing systems32\.Cited by:[§2](https://arxiv.org/html/2606.03097#S2.p1.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2023a\)Let’s verify step by step\.External Links:2305\.20050,[Link](https://arxiv.org/abs/2305.20050)Cited by:[§4\.2\.1](https://arxiv.org/html/2606.03097#S4.SS2.SSS1.p2.6)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2023b\)Let’s verify step by step\.arXiv preprint arXiv:2305\.20050\.Cited by:[§2](https://arxiv.org/html/2606.03097#S2.p3.1)\.
- C\. Liu, Z\. Jin, J\. Gu, and C\. Qiu \(2017\)Short\-term load forecasting using a long short\-term memory network\.In2017 IEEE PES innovative smart grid technologies conference Europe \(ISGT\-Europe\),pp\. 1–6\.Cited by:[§2](https://arxiv.org/html/2606.03097#S2.p1.1)\.
- Y\. Liu, T\. Hu, H\. Zhang, H\. Wu, S\. Wang, L\. Ma, and M\. Long \(2023\)Itransformer: inverted transformers are effective for time series forecasting\.arXiv preprint arXiv:2310\.06625\.Cited by:[§A\.1](https://arxiv.org/html/2606.03097#A1.SS1.p4.1),[§4\.2](https://arxiv.org/html/2606.03097#S4.SS2.p1.1)\.
- O\. A\. Maggie, V\. Kuznetsov, and W\. Cukierski \(2017\)Web traffic time series forecasting\.Note:[https://kaggle\.com/competitions/web\-traffic\-time\-series\-forecasting](https://kaggle.com/competitions/web-traffic-time-series-forecasting)Kaggle competitionCited by:[§4\.1](https://arxiv.org/html/2606.03097#S4.SS1.p1.1)\.
- Y\. Nie, N\. H\. Nguyen, P\. Sinthong, and J\. Kalagnanam \(2022\)A time series is worth 64 words: long\-term forecasting with transformers\.arXiv preprint arXiv:2211\.14730\.Cited by:[§A\.1](https://arxiv.org/html/2606.03097#A1.SS1.p4.1),[§2](https://arxiv.org/html/2606.03097#S2.p1.1),[§4\.2](https://arxiv.org/html/2606.03097#S4.SS2.p1.1)\.
- D\. Obst, J\. De Vilmarest, and Y\. Goude \(2021\)Adaptive methods for short\-term electricity load forecasting during covid\-19 lockdown in france\.IEEE transactions on power systems36\(5\),pp\. 4754–4763\.Cited by:[§2](https://arxiv.org/html/2606.03097#S2.p1.1)\.
- B\. N\. Oreshkin, D\. Carpov, N\. Chapados, and Y\. Bengio \(2019\)N\-beats: neural basis expansion analysis for interpretable time series forecasting\.arXiv preprint arXiv:1905\.10437\.Cited by:[§A\.1](https://arxiv.org/html/2606.03097#A1.SS1.p4.1),[§4\.2](https://arxiv.org/html/2606.03097#S4.SS2.p1.1)\.
- A\. D\. Papalexopoulos and T\. C\. Hesterberg \(1990\)A regression\-based approach to short\-term system load forecasting\.IEEE Transactions on power systems5\(4\),pp\. 1535–1547\.Cited by:[§2](https://arxiv.org/html/2606.03097#S2.p1.1)\.
- K\. Rasul, A\. Ashok, A\. R\. Williams, H\. Ghonia, R\. Bhagwatkar, A\. Khorasani, M\. J\. D\. Bayazi, G\. Adamopoulos, R\. Riachi, N\. Hassen,et al\.\(2023\)Lag\-llama: towards foundation models for probabilistic time series forecasting\.arXiv preprint arXiv:2310\.08278\.Cited by:[§2](https://arxiv.org/html/2606.03097#S2.p2.1)\.
- R\. W\. Reynolds, T\. M\. Smith, C\. Liu, D\. B\. Chelton, K\. S\. Casey, and M\. G\. Schlax \(2007\)Daily high\-resolution\-blended analyses for sea surface temperature\.Journal of Climate20,pp\. 5473–5496\.External Links:[Document](https://dx.doi.org/10.1175/JCLI-D-14-00293.1)Cited by:[§4\.1](https://arxiv.org/html/2606.03097#S4.SS1.p1.1)\.
- R\. P\. Schumaker and H\. Chen \(2009\)Textual analysis of stock market prediction using breaking financial news: the azfin text system\.ACM Transactions on Information Systems \(TOIS\)27\(2\),pp\. 1–19\.Cited by:[§2](https://arxiv.org/html/2606.03097#S2.p1.1)\.
- R\. P\. Schumaker and H\. Chen \(2010\)A discrete stock price prediction engine based on financial news\.Computer43\(1\),pp\. 51–56\.Cited by:[§2](https://arxiv.org/html/2606.03097#S2.p1.1)\.
- J\. F\. Torres, D\. Hadjout, A\. Sebaa, F\. Martínez\-Álvarez, and A\. Troncoso \(2021\)Deep learning for time series forecasting: a survey\.Big data9\(1\),pp\. 3–21\.Cited by:[§2](https://arxiv.org/html/2606.03097#S2.p1.1)\.
- L\. von Werra, Y\. Belkada, L\. Tunstall, E\. Beeching, T\. Thrush, N\. Lambert, S\. Huang, K\. Rasul, and Q\. Gallouédec \(2020\)TRL: Transformers Reinforcement LearningExternal Links:[Link](https://github.com/huggingface/trl)Cited by:[§4\.2\.1](https://arxiv.org/html/2606.03097#S4.SS2.SSS1.p3.3)\.
- X\. Wang, M\. Feng, J\. Qiu, J\. Gu, and J\. Zhao \(2024\)From news to forecast: integrating event analysis in llm\-based time series forecasting with reflection\.Advances in Neural Information Processing Systems37,pp\. 58118–58153\.Cited by:[§A\.1](https://arxiv.org/html/2606.03097#A1.SS1.p4.1),[§1](https://arxiv.org/html/2606.03097#S1.p2.1),[§2](https://arxiv.org/html/2606.03097#S2.p2.1),[§3\.2](https://arxiv.org/html/2606.03097#S3.SS2.p1.4),[§3\.3](https://arxiv.org/html/2606.03097#S3.SS3.p1.1),[§4\.1](https://arxiv.org/html/2606.03097#S4.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.03097#S4.SS1.p2.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[§3\.3](https://arxiv.org/html/2606.03097#S3.SS3.p1.1)\.
- H\. Wu, T\. Hu, Y\. Liu, H\. Zhou, J\. Wang, and M\. Long \(2022\)Timesnet: temporal 2d\-variation modeling for general time series analysis\.arXiv preprint arXiv:2210\.02186\.Cited by:[§2](https://arxiv.org/html/2606.03097#S2.p1.1)\.
- H\. Wu, J\. Xu, J\. Wang, and M\. Long \(2021\)Autoformer: decomposition transformers with auto\-correlation for long\-term series forecasting\.Advances in neural information processing systems34,pp\. 22419–22430\.Cited by:[§2](https://arxiv.org/html/2606.03097#S2.p1.1)\.
- M\. Xia, H\. Shao, X\. Ma, and C\. W\. De Silva \(2021\)A stacked gru\-rnn\-based approach for predicting renewable energy and electricity load for smart grid operation\.IEEE Transactions on Industrial Informatics17\(10\),pp\. 7050–7059\.Cited by:[§2](https://arxiv.org/html/2606.03097#S2.p1.1)\.
- F\. Xu, W\. Shi, and E\. Choi \(2024\)RECOMP: improving retrieval\-augmented lms with context compression and selective augmentation\.InICLR,Cited by:[§2](https://arxiv.org/html/2606.03097#S2.p3.1)\.
- C\. M\. Yeh, X\. Dai, H\. Chen, Y\. Zheng, Y\. Fan, A\. Der, V\. Lai, Z\. Zhuang, J\. Wang, L\. Wang,et al\.\(2023\)Toward a foundation model for time series data\.InProceedings of the 32nd ACM International Conference on Information and Knowledge Management,pp\. 4400–4404\.Cited by:[§2](https://arxiv.org/html/2606.03097#S2.p1.1)\.
- A\. Zeng, M\. Chen, L\. Zhang, and Q\. Xu \(2023\)Are transformers effective for time series forecasting?\.InProceedings of the AAAI conference on artificial intelligence,Vol\.37,pp\. 11121–11128\.Cited by:[§A\.1](https://arxiv.org/html/2606.03097#A1.SS1.p4.1),[§4\.2](https://arxiv.org/html/2606.03097#S4.SS2.p1.1)\.
- T\. Zeng, S\. Zhang, S\. Wu, C\. Classen, D\. Chae, E\. Ewer, M\. Lee, H\. Kim, W\. Kang, J\. Kunde,et al\.\(2025\)Versaprm: multi\-domain process reward model via synthetic reasoning data\.arXiv preprint arXiv:2502\.06737\.Cited by:[§2](https://arxiv.org/html/2606.03097#S2.p3.1)\.
- L\. Zhang, J\. Gao, X\. Ren, and Z\. Cao \(2025\)The bidirectional process reward model\.arXiv preprint arXiv:2508\.01682\.Cited by:[§2](https://arxiv.org/html/2606.03097#S2.p3.1)\.
- C\. Zhenget al\.\(2025\)A survey of process reward models: from outcome signals to process supervisions for large language models\.arXiv preprint arXiv:2510\.08049\.Cited by:[§2](https://arxiv.org/html/2606.03097#S2.p3.1)\.
- H\. Zhou, S\. Zhang, J\. Peng, S\. Zhang, J\. Li, H\. Xiong, and W\. Zhang \(2021a\)Informer: beyond efficient transformer for long sequence time\-series forecasting\.InThe Thirty\-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Virtual Conference,Vol\.35,pp\. 11106–11115\.Cited by:[§4\.1](https://arxiv.org/html/2606.03097#S4.SS1.p1.1)\.
- H\. Zhou, S\. Zhang, J\. Peng, S\. Zhang, J\. Li, H\. Xiong, and W\. Zhang \(2021b\)Informer: beyond efficient transformer for long sequence time\-series forecasting\.InProceedings of the AAAI conference on artificial intelligence,Vol\.35,pp\. 11106–11115\.Cited by:[§2](https://arxiv.org/html/2606.03097#S2.p1.1)\.
- T\. Zhou, Z\. Ma, Q\. Wen, X\. Wang, L\. Sun, and R\. Jin \(2022\)Fedformer: frequency enhanced decomposed transformer for long\-term series forecasting\.InInternational conference on machine learning,pp\. 27268–27286\.Cited by:[§2](https://arxiv.org/html/2606.03097#S2.p1.1)\.
## Appendix AAppendix
### A\.1Experimental Setting / Details
Analysis of Long News
Table[4](https://arxiv.org/html/2606.03097#A1.T4)reports the volume of news text associated with each forecasting task across the four domains\. On average, each task is paired with 27–38 relevant articles, and the aggregate character count ranges from approximately 100k to 144k characters \(roughly 101k–144k tokens\)\. In the worst case, a single task can be associated with up to 601k characters for Electricity and 571k characters for Traffic\. These figures far exceed the effective context windows of most contemporary LLMs \(e\.g\., 8k–128k tokens\), empirically confirming the motivation stated: naïvely concatenating all retrieved articles into the prompt is infeasible for a substantial fraction of tasks\. The problem is especially acute for Electricity and Traffic, where the maximum total length approaches or exceeds 500k tokens\. This underscores the necessity of our importance\-aware compression module, which selectively preserves forecasting\-relevant content while respecting a fixed context budget\. It also highlights why the process reward model \(PRM\) for retrieval supervision is beneficial: when dozens of candidate articles compete for limited context space, intelligently sequencing and selecting supplementary news—rather than retrieving them in a single unguided pass\.
Table 4:Details of News LengthBaselines\.
We benchmark our method against a broad range of established baselines spanning different forecasting paradigms—including statistical models, Transformer\-based architectures, and approaches leveraging large language models \(LLMs\)\. Specifically, the baselines include N\-BEATS\(Oreshkinet al\.,[2019](https://arxiv.org/html/2606.03097#bib.bib71)\), PatchTST\(Nieet al\.,[2022](https://arxiv.org/html/2606.03097#bib.bib32)\), DLinear\(Zenget al\.,[2023](https://arxiv.org/html/2606.03097#bib.bib72)\), LightGBM\(Keet al\.,[2017](https://arxiv.org/html/2606.03097#bib.bib73)\), Chronos\(Ansariet al\.,[2024](https://arxiv.org/html/2606.03097#bib.bib74)\), iTransformer\(Liuet al\.,[2023](https://arxiv.org/html/2606.03097#bib.bib75)\), Time\-LLM\(Jinet al\.,[2023a](https://arxiv.org/html/2606.03097#bib.bib53)\), and LoRA\-based LLM fine\-tuning\(Wanget al\.,[2024](https://arxiv.org/html/2606.03097#bib.bib11)\)\. For Chronos\(Ansariet al\.,[2024](https://arxiv.org/html/2606.03097#bib.bib74)\), we adopt the ”amazon/chronos\-bolt\-mini” variant\. The implementations of iTransformer\(Liuet al\.,[2023](https://arxiv.org/html/2606.03097#bib.bib75)\), Time\-LLM\(Jinet al\.,[2023a](https://arxiv.org/html/2606.03097#bib.bib53)\), and DLinear\(Zenget al\.,[2023](https://arxiv.org/html/2606.03097#bib.bib72)\)are based on the neuralforecast library \(https://github\.com/Nixtla/neuralforecast\)\. All baselines are configured in accordance with the original architectural specifications and implementation protocols described in their respective publications or official codebases, and hyperparameters are tuned following the authors’ recommended guidelines to ensure fair and competitive performance\.
Reward Model Fine\-tuning Details
Both the Outcome Reward Model \(ORM\) and the Process Reward Model \(PRM\) are fine\-tuned from Qwen3\-8B using the Bradley–Terry pairwise ranking lossℒ=−logσ\(rchosen−rrejected\)\\mathcal\{L\}=\-\\log\\sigma\(r\_\{\\text\{chosen\}\}\-r\_\{\\text\{rejected\}\}\), implemented via the TRLRewardTrainerframework\. We apply Low\-Rank Adaptation \(LoRA\) to reduce the number of trainable parameters while preserving the pretrained representations\. The LoRA configuration is shared across both models: rankr=64r=64, scaling factorα=128\\alpha=128, and dropout rate0\.070\.07\. Adapters are inserted into all attention projections \(WQW\_\{Q\},WKW\_\{K\},WVW\_\{V\},WOW\_\{O\}\) and feed\-forward MLP layers \(gate, up, and down projections\), resulting in approximately 1\.2% trainable parameters relative to the full model\.
Both models are trained in bfloat16 mixed precision with the AdamW optimizer at a peak learning rate of1\.5×10−51\.5\\times 10^\{\-5\}\. The ORM is trained for 1,068 optimizer steps on approximately 15k preference pairs constructed from pairwise news importance rankings, while the PRM is trained for 800 steps on pairs derived from offline reflection logs with a composite outcome score combining RMSE improvement and directional accuracy\. The maximum input sequence length for the PRM is set to 2,048 tokens\. For both models, we evaluate every 50 training steps and select the checkpoint with the lowest validation loss as the final model\. All training is conducted on four NVIDIA RTX 5880 Ada Generation GPUs with 48GB memory each\.
Evaluation\.
To evaluate the performance of the proposed model, we adopt a commonly used error metric: Root Mean Squared Error \(RMSE\)\. RMSE computes the square root of the average squared errors, assigning a higher penalty to large deviations between predictions and true values:
RMSE=1N∑i=1N\(yi−y^i\)2\\mathrm\{RMSE\}=\\sqrt\{\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\left\(y\_\{i\}\-\\hat\{y\}\_\{i\}\\right\)^\{2\}\}
### A\.2Example Inputs for Fine\-tuning
We provide representative examples of the input text fed to the reward models during training\. Each training sample consists of achosenand arejectedtext; the model learns to assign a higher scalar reward to the chosen input\.
##### ORM Input Example\.
The ORM receives a news article paired with its ranking context\. A training pair consists of a more important \(chosen\) and a less important \(rejected\) news item:
> \[Chosen\]:“Fed signals pause in rate hikes amid cooling inflation data\. The Federal Reserve held interest rates steady at 5\.25%–5\.50% and indicated that future decisions will be data\-dependent, citing moderating CPI and stable employment figures\.” \[Rejected\]:“Local bakery chain opens third location downtown\. The family\-owned business plans to hire 12 additional staff members for the new storefront\.”
##### PRM Input Example\.
The PRM evaluates intermediate reflection steps during the iterative news selection process\. Each input encodes the current selection state \(τ<t\\tau\_\{<t\}\) and a candidate action:
> You are scoring one reflection suggestion for news selection\.\#\# Historical Context The target series shows a rising trend over the past 30 days with increased volatility in the recent week\. Current RMSE of baseline forecast: 0\.0342\.\#\# Already Selected News \(τ<t\\tau\_\{<t\}\) These news already exist in the current strategy\. Evaluate candidate c only by its incremental value beyond this history\. Do not reward repeated information\. 1\. Fed holds rates steady, signals data\-dependent path forward\. 2\. US CPI drops to 3\.1% year\-over\-year, below consensus estimate\.\#\# Candidate Suggestion \- Subset name: macro\_employment \- Subset size: 3 \- Reflection constraints: \- Must provide new directional signal not covered by existing selections\. \- Suggested added news: 1\. Non\-farm payrolls miss expectations at 150K vs 180K forecast\. 2\. Unemployment rate ticks up to 3\.9%\. 3\. Average hourly earnings growth slows to 0\.2% month\-over\-month\.\#\# Agent Forecast Under This Suggestion Predicted values: 1\.042, 1\.038, 1\.035, 1\.031, \.\.\.Task: Score candidate c by incremental value overτ<t\\tau\_\{<t\}\(higher score = better additional information and less redundancy\)\.
## Appendix BFull Prompt Design \(Take Bitcoin as an example\)
### B\.1ORM Data Generation
Prompt Template``` <Historical Price Data> ----- Task Description ----- You are a financial forecasting assistant specialized in cryptocurrency markets. Your task is to directly predict the next 15-period Bitcoin prices. Based on the given historical data and any provided context, forecast the price for each of the next 15 periods. Notation rules: 1. Predict 15 real-valued Bitcoin prices for the next 15 periods. 2. Output ONLY the predicted numeric values. 3. Output must be a single line string. 4. The 15 numbers must be separated by commas. 5. Do NOT provide any explanation, reasoning, labels, or additional text. 6. Do NOT add spaces before or after commas. Output format example: 58234.12,58401.55,58610.23,58790.11,58920.44, 59005.67,59120.89,59340.12,59450.33,59600.78, 59742.15,59880.50,60010.22,60145.88,60280.45 Additional market context: 1. <News Segment 1> 2. <News Segment 2> ... N. <News Segment N> Now predict the next 15 Bitcoin prices: ```
In the ablation study, we evaluate three configurations:
- •Baseline: Only historical price data and task description, without any news context \(the “Additional market context” section is omitted\)\.
- •All News: All available news segments are included as additional market context\.
- •Leave\-One\-Out: Each news segment is excluded in turn while retaining all others, to assess the individual contribution of each news segment to forecasting accuracy\.
### B\.2PRM Training Data Generation
To construct training data for the Process Reward Model \(PRM\), we perform exhaustive subset enumeration over the news segments associated with each data sample\. Specifically, for a sample withNNnews segments, we enumerate all2N2^\{N\}subsets of the news set \(including the empty set\)\. For each subsetS⊆\{1,2,…,N\}S\\subseteq\\\{1,2,\\ldots,N\\\}, we:
1. 1\.Construct a prompt by including only the news segments indexed bySSin the “Additional market context” section of the prompt template\.
2. 2\.Query the LLM to obtain 15\-step price predictions\.
3. 3\.Compute the RMSE between the predicted and ground\-truth prices\.
4. 4\.Assign a reward signalr=−RMSEr=\-\\text\{RMSE\}to the subset\.
This procedure yields a labeled dataset of\(subset,reward\)\(\\text\{subset\},\\text\{reward\}\)pairs for each sample, which is used to train the PRM to evaluate the quality of different news subset selections\. Each record in the training data contains the subset composition, the corresponding prompt, the model’s predictions, and the computed reward\.
### B\.3Prediction Prompt
The following prompt is used to query DeepSeek\-V3\.2 for Bitcoin price forecasting\. The prompt consists of historical price data, statistical summaries, task instructions, consistency constraints, and compressed news context\.
Prediction Prompt Template\{historical\_data\} Historical data statistics: \- Max:\{hist\_max\} \- Min:\{hist\_min\} \- Mean:\{hist\_mean\} \- Std:\{hist\_std\} \- Range:\{hist\_range\} —– Task Description —– You are a forecasting assistant specialized in Bitcoin price prediction\. Your task is to directly predict the next 15 daily Bitcoin price values \(in USD\)\. Based on the given historical data and any provided news context, forecast the price for each of the next 15 days\.Notation rules:1\.Predict 15 real\-valued price values for the next 15 days\.2\.Output ONLY the predicted numeric values\.3\.Output must be a single line string\.4\.The 15 numbers must be separated by commas\.5\.Do NOT provide any explanation, reasoning, labels, or additional text\.6\.Do NOT add spaces before or after commas\.Output format example: 11362\.0,11192\.3,11245\.4,11580\.2,12858\.9,11594\.9,10864\.8, 10903\.2,11452\.3,11203\.0,11118\.1,11467\.5,11795\.1,11244\.8,10166\.0 IMPORTANT:Your predictions should remain consistent with the historical data pattern\. Unless the news explicitly indicates a major disruptive event \(e\.g\., major regulation changes, exchange hacks, institutional adoption/ban, macroeconomic crisis\), your predictions should stay within or close to the historical range \[\{hist\_min\},\{hist\_max\}\] with mean around\{hist\_mean\}\. Do NOT introduce large deviations without strong justification from the news\.Additional context from recent news: 1\.\{compressed\_news\_1\} 2\.\{compressed\_news\_2\} ⋮\\vdots5\.\{compressed\_news\_5\} Now predict the next 15 Bitcoin price values:
### B\.4News Compression Prompt
The following prompt is used to compress news articles before feeding them into the prediction model\. The target length is controlled by the allocated token budget\.
News Compression Prompt TemplatePlease compress the following news article into a concise summary that retains the most important information relevant to Bitcoin price forecasting\. Focus on facts that could affect Bitcoin price \(regulation, adoption, market sentiment, institutional investment, macroeconomic events, etc\.\)\. Be as concise as possible\.News article: \{news\_text\} Compressed summary:
### B\.5Reward Model Prompt
The following prompt is used to score news relevance with the fine\-tuned reward model \(Qwen3\-8B with LoRA adapter\)\.
Reward Model Prompt TemplateSystem:You are an expert cryptocurrency analyst evaluating the relevance of news segments for predicting Bitcoin price movements\.User: Historical Bitcoin price data \(15 points, daily intervals, covering 15 days\):\{price\_history\} News segment:\{news\_text\} How important is this news for forecasting the next 15 days of Bitcoin price? Rate it\.
### B\.6Process Reward Model \(PRM\) Prompt
The following prompt template is used by the Process Reward Model to score candidate news articles at each greedy selection step\. The PRM evaluates the incremental value of each candidate given the already\-selected news and the current forecast performance \(RMSE\)\.
PRM Scoring Prompt TemplateYou are scoring one reflection suggestion for news selection\.\#\# Historical Context \{instruction\} \#\# Already Selected News \(tau¡t\) These news already exist in the current strategy\. Evaluate candidate c only by its incremental value beyond this history\. Do not reward repeated information\.1\.\{selected\_news\_1\} 2\.\{selected\_news\_2\} ⋮\\vdots\#\# Candidate Suggestion \- Subset size: 1\- Suggested added news:1\.\{candidate\_news\} \#\# Agent Forecast Under This Suggestion Current forecast RMSE:\{last\_rmse\} Task: Score candidate c by incremental value over tau¡t \(higher score = better additional information and less redundancy\)\.
At each greedy steptt, the PRM scores all remaining candidate news articlesc∈𝒞∖𝒮t−1c\\in\\mathcal\{C\}\\setminus\\mathcal\{S\}\_\{t\-1\}given the selected history𝒮t−1\\mathcal\{S\}\_\{t\-1\}and the RMSE from the previous prediction round\. The candidate with the highest score is selected:
ct∗=argmaxc∈𝒞∖𝒮t−1PRM\(c∣𝒮t−1,RMSEt−1\)c\_\{t\}^\{\*\}=\\arg\\max\_\{c\\in\\mathcal\{C\}\\setminus\\mathcal\{S\}\_\{t\-1\}\}\\text\{PRM\}\(c\\mid\\mathcal\{S\}\_\{t\-1\},\\,\\text\{RMSE\}\_\{t\-1\}\)\(2\)The process terminates when the normalized RMSE falls below a predefined thresholdϵ\\epsilon\(set to 15% in our experiments\)\.
## Appendix CSupplementary Case Studies and Distribution Analysis
This section provides \(i\) a step\-by\-step case study contrasting PRM\-guided greedy news selection with a naive LLM\-based greedy baseline \(Section[C\.1](https://arxiv.org/html/2606.03097#A3.SS1), Table[5](https://arxiv.org/html/2606.03097#A3.T5)and Figure[2](https://arxiv.org/html/2606.03097#A3.F2)\) and \(ii\) a compression case study where PRM\-proportional summarization improves forecast RMSE over uniform budgeting \(Section[C\.2](https://arxiv.org/html/2606.03097#A3.SS2), Tables[8](https://arxiv.org/html/2606.03097#A3.T8)–[10](https://arxiv.org/html/2606.03097#A3.T10)\)\.
### C\.1News Selection: PRM vs\. Naive Baseline
##### Setup\.
We use the exchange\-rate forecasting task \(AUD/USD, 7\-day horizon\) with task ID 25 and forecast window 5/31/2019–6/6/2019\. Both methods share the same pool ofN=10N\{=\}10filtered news articles\. Convergence is defined as forecast RMSE falling belowτ=0\.01×y¯≈6\.96\\tau=0\.01\\times\\bar\{y\}\\approx 6\.96, wherey¯\\bar\{y\}is the mean ground\-truth exchange rate \(696\.28696\.28\)\. Thenaive baselineuses the*same*forecaster LLM \(DeepSeek\-V3\.2\) to greedily pick one article per step from the remaining pool, given historical data, already selected news, and the current RMSE\.PRM greedyinstead scores all remaining candidates with the trained PRM \(conditioned on selected history andlast\_rmse\), picks the highest\-scoring article, and performs one forecast call\. Atstep 0, both methods share the*same*no\-news forecast and thus identical baseline RMSE \(Table[5](https://arxiv.org/html/2606.03097#A3.T5), Figure[2](https://arxiv.org/html/2606.03097#A3.F2)\)\.
##### Summary\.
The failure of the naive LLM selector occurs atStep 1: it chooses a low\-utility geopolitical headline \(PRM rank 9/10\) before domestic macro/housing signals, requiring five articles to converge, whereas PRM selects the top\-ranked article first and converges in one step\. Table[5](https://arxiv.org/html/2606.03097#A3.T5)summarizes the gap; Figure[2](https://arxiv.org/html/2606.03097#A3.F2)plots the RMSE traces\.
Table 5:Exchange task 25: convergence comparison \(PRM vs\. naive\)\.Figure 2:RMSE trace on exchange task 25\. Both methods share the same step\-0 \(no\-news\) RMSE\. PRM converges after one article; the naive LLM selector requires five steps\.
##### Step 0 \(no news\)\.
As shown in Figure[2](https://arxiv.org/html/2606.03097#A3.F2), both methods start from thesameno\-news forecast with RMSE10\.010\.0at step 0\.
##### Step 1 \(root cause\)\.
Table[6](https://arxiv.org/html/2606.03097#A3.T6)ranks all ten candidates by PRM score at the first selection step\.Naive LLMselects index 8 first:
> China threatens to strangle supply of important rare earth minerals to the US\.
- •RMSE after step 1:10\.1910\.19\(threshold6\.966\.96\) — not converged\.
- •PRM rank:9th of 10in Table[6](https://arxiv.org/html/2606.03097#A3.T6)\(score1\.02341\.0234\)\.
- •Interpretation: salient geopolitical content with weak short\-horizon linkage to AUD/USD vs\. domestic RBA, housing, and protest shocks preferred by PRM\.
PRMadds index 0 first \(highest score1\.29691\.2969\):
> Thousands of climate protesters attend Melbourne ‘die\-in’\.
- •RMSE after step 1:2\.142\.14—converged\(Figure[2](https://arxiv.org/html/2606.03097#A3.F2)\)\.
Table 6:PRM scores for all candidates at step 1 \(exchange task 25\)\.Table 7:Naive LLM\-greedy path: full RMSE trace until convergence \(exchange task 25\)\.Table[7](https://arxiv.org/html/2606.03097#A3.T7)lists the full naive path until convergence at step 5\.
##### Takeaways\.
1. 1\.Both methods see the*same*candidate pool; the gap isselection order, not pool construction \(Tables[5](https://arxiv.org/html/2606.03097#A3.T5)–[7](https://arxiv.org/html/2606.03097#A3.T7)\)\.
2. 2\.Without PRM, LLM\-only greedy selection can front\-load articles ranked 9/10 in Table[6](https://arxiv.org/html/2606.03097#A3.T6), delaying convergence from 1 to 5 steps \(Figure[2](https://arxiv.org/html/2606.03097#A3.F2)\)\.
3. 3\.PRM incremental scoring withlast\_rmsefeedback aligns news choice with forecast utility rather than headline salience alone\.
### C\.2Long\-Text Compression: Uniform vs\. Reward\-Proportional Summaries
##### Setup \(exchange\-rate forecasting, task ID 95, 6/8–6/14/2019\)\.
Five news articles are selected per task \(AUD/USD, 7\-day horizon\) during the RBA’s June 2019 rate\-cut week\.Naive compressionassigns each article the same relative token budget \(20% of original length, capped at 400 tokens\)\.PRM\-proportional compressionallocates a fixed 1000\-token budget across articles in proportion to reward\-model scores \(minimum 50, maximum 400 tokens per article\)\. Both use the same summarization prompt; only the per\-article token cap differs\.
##### Forecast outcome\.
PRM\-proportional compression reduces forecast RMSE by roughly half on this task \(Table[8](https://arxiv.org/html/2606.03097#A3.T8)\):8\.898\.89\(naive\) vs\.4\.454\.45\(PRM\)\. Both methods produce plausible 7\-day levels near∼\\sim690–698; the gain comes from rebalancing tokens away from a redundant long post\-cut headline toward higher\-reward pre\-cut policy signals \(Examples A–B\)\.
Table 8:Exchange task 95: forecast RMSE under naive vs\. PRM\-proportional compression\.
#### C\.2\.1High\-reward article \(pre\-cut policy signal\)
Table[9](https://arxiv.org/html/2606.03097#A3.T9)contrasts budgets for the highest\-scoring headline; summaries follow\.
Table 9:Compression budgets: Treasurer urges banks to pass through rate cuts \(exchange task 95\)\.Naive uniform summary:
> The RBA is widely expected to cut its cash rate from 1\.5% to 1\.25% on Tuesday to stimulate a weak economy\. Supporting this move, recent data shows soft growth, with Q1 GDP forecast at just 0\.4%\. This dovish monetary policy stance and weak economic fundamentals are negative for the AUD\.
PRM\-proportional summary:
> RBA expected to cut cash rate to 1\.25% to stimulate weak economy\. Weak Q1 GDP growth forecast \(0\.4%\) and low inflation justify the dovish move\. This monetary easing is likely to put downward pressure on the AUD\.
Why this matters:PRM*doubles*the token budget on the top\-ranked article \(Table[9](https://arxiv.org/html/2606.03097#A3.T9)\), emphasizing the*pre\-cut*policy stance \(Treasurer pressuring banks\) rather than treating it like any other 20% slice\.
#### C\.2\.2Low\-reward article \(redundant long RBA headline\)
The post\-announcement “RBA slashes rate” piece is long \(∼\\sim2535 tokens\) but scores near zero for incremental forecast utility; Table[10](https://arxiv.org/html/2606.03097#A3.T10)shows the budget shift\.
Table 10:Compression budgets: “RBA slashes cash rate to all\-time low” \(exchange task 95\)\.Naive uniform summary:
> RBA cuts cash rate 25bps to record low 1\.25% to support employment and inflation\. AUD/USD rose slightly to∼0\.6976\{\\sim\}0\.6976, as markets had fully priced in the cut and some expected a larger move\. Economists forecast further rate cuts ahead, with weak domestic data \(low inflation, rising unemployment\) and global trade tensions cited as key reasons\. This dovish outlook is typically negative for the AUD\.
PRM\-proportional summary:
> RBA cuts cash rate 25bps to record low 1\.25% to support employment and inflation\. Further cuts are anticipated, with markets expecting a total reduction to 0\.5% by 2020\. Despite the cut, AUD/USD rose slightly to 0\.6976, indicating the move was fully priced in and some had expected a larger cut\.
Budget asymmetry:naive spends the400\-token maximumon this repetitive post\-cut story; PRM allocates only 202 tokens \(8% of original length\), freeing budget for sharper pre\-cut macro cues in Example A\.
##### Takeaways\.
1. 1\.Naive uniform compression over\-funds long, low\-incremental\-utility headlines at the same 20% rule as core macro drivers \(Table[10](https://arxiv.org/html/2606.03097#A3.T10)\)\.
2. 2\.PRM\-proportional budgeting concentrates tokens on high\-reward news \(Table[9](https://arxiv.org/html/2606.03097#A3.T9)\) and trims redundant rate\-cut coverage\.
3. 3\.On exchange task 95, this rebalancing lowers RMSE from8\.898\.89to4\.454\.45—a clear improvement that illustrates PRM\-guided compression without extreme forecast failure under the naive baseline\.Similar Articles
Retrieval Mechanisms Surpass Long-Context Scaling in Time Series Forecasting
This academic paper challenges the effectiveness of long-context scaling in time series forecasting, demonstrating that retrieval-based methods outperform standard architectures like PatchTST and foundation models such as Chronos and Moirai.
@rohanpaul_ai: New Google paper: A forecast needs context, not just history. Some patterns are caused by events, not time. Nexus refra…
Google's Nexus paper proposes an agentic framework that incorporates contextual events alongside numerical data for time series forecasting, achieving an 86.6% MAPE reduction on Zillow tests compared to direct chain-of-thought prompting.
Nested Spatio-Temporal Time Series Forecasting
This paper proposes a nested spatiotemporal forecasting framework that uses spectral clustering to construct semantically coherent macro-level regions, which provide top-down guidance for fine-grained micro-level predictions. Experiments on high-dimensional datasets show consistent improvements over state-of-the-art baselines.
PRB-RUPFormer: A Recursive Unified Probabilistic Transformer for Residual PRB Forecasting
Proposes PRB-RUPFormer, a recursive unified probabilistic Transformer for forecasting residual Physical Resource Blocks in cellular networks, achieving high accuracy and uncertainty quantification on commercial LTE data.
PESD-TSF: A Period-Aware and Explicit Structured Decomposition Framework for Long-Term Time Series Forecasting
Proposes PESD-TSF, a physics-inspired structured decomposition framework for long-term time series forecasting that addresses periodic perception degradation, trend-noise entanglement, and loss of cross-variable dependencies via multiplicative periodic gating, multi-scale structured encoder, and cross-scale collaborative attention.