RAVEN: A Regime-Aware Variable-context Expert Network for Financial Time Series Forecasting
Summary
The paper proposes RAVEN, a Mixture-of-Experts framework that adaptively determines temporal context windows for each input sample to handle non-stationary financial time series. It achieves state-of-the-art performance on financial and traffic benchmarks.
View Cached Full Text
Cached at: 06/24/26, 07:50 AM
# A Regime-Aware Variable-context Expert Network for Financial Time Series Forecasting *Corresponding authors
Source: [https://arxiv.org/html/2606.24062](https://arxiv.org/html/2606.24062)
Cheng He1,2, Zhenyu Guan2, Xijie Liang2, Defu Lian1∗, Jiajia Li3∗, Enhong Chen1, Patrick P\. C\. Lee4, Geng Hu5, Zehao Chen2
###### Abstract
Financial time series forecasting presents structural challenges absent from standard benchmarks\. Log\-returns are non\-stationary, exhibit exceptionally low signal\-to\-noise \(SNR\) ratios, and are governed by regime\-dependent temporal dependencies\. We identify a key limitation of state\-of\-the\-art \(SOTA\) time series models in financial settings\. A fixed context window is mismatched to the time\-varying optimal look\-back of non\-stationary price processes\. We propose the Regime\-Aware Variable\-context Expert Network \(RAVEN\), a Mixture\-of\-Experts framework designed to adaptively determine the temporal context for each input sample\. Instead of relying on a fixed look\-back horizon, RAVEN constructs a hierarchy of nested contiguous windows whose lengths are determined by the data itself\. Specifically, RAVEN scores patches by learned importance in reverse chronological order and applies the Cumulative Importance Thresholding \(CIT\) mechanism to derive nested prefix windows, each routed to a scale\-specialized expert\. A Global Compressed Representation \(GCR\) branch runs in parallel over the full context, preserving global temporal coherence that local experts cannot guarantee\. Because the nested routing induces structured overlap among expert inputs, we introduce a Correlation\-Aware Weighting \(CAW\) to align variable\-length expert outputs and penalize pairwise cosine similarity prior to aggregation\. Experiments on cumulative log\-return prediction \(HS300, S&P500\) and fund sales forecasting demonstrate that RAVEN achieves SOTA performances, improves Pearson correlation by 9\.2% on HS300 and 20\.2% on S&P500, and reduces MSE by 18\.2% on fund sales forecasting, while achieving the best results in 14 of 16 metrics on four PEMS traffic benchmarks\.
## IIntroduction
Financial time series forecasting is a cornerstone of quantitative investment, supporting tasks from risk management to automated trading\. Unlike well\-known benchmark datasets in the general time series domain, such as ETTh, ETTm, Weather, Electricity, or Traffic\[[37](https://arxiv.org/html/2606.24062#bib.bib12)\], which often exhibit clear periodic patterns and deterministic trends, financial data is noisy and non\-stationary\[[23](https://arxiv.org/html/2606.24062#bib.bib1)\]\. In this adversarial environment, predicting raw price levelsCtC\_\{t\}at timettis impractical: prices are not scale\-invariant and cross\-asset comparable, typically follow a random walk, and exhibit extremely high autocorrelation, leading to spurious regression and inflated in\-sample performance\. To capture meaningful predictive signals, existing approaches reformulate the problem as the regression of log\-returns:rt=ln\(Ct/Ct−1\)r\_\{t\}=\\ln\(C\_\{t\}/C\_\{t\-1\}\)\[[15](https://arxiv.org/html/2606.24062#bib.bib3),[13](https://arxiv.org/html/2606.24062#bib.bib2)\]\. This transformation shifts the modeling objective from tracking absolute values to capturing price innovations\[[15](https://arxiv.org/html/2606.24062#bib.bib3)\], the unpredictable stochastic components driven by the arrival of new market information\. While log\-returns are statistically more stable than raw prices, they still exhibit exceptionally low SNR and heavy\-tailed distributions, making accurate regression a challenging task\[[13](https://arxiv.org/html/2606.24062#bib.bib2)\]\.
Historically, financial time\-series analysis has been dominated by tree\-based ensembles such as XGBoost\[[6](https://arxiv.org/html/2606.24062#bib.bib9)\]and LightGBM\[[17](https://arxiv.org/html/2606.24062#bib.bib10)\], descendants of the gradient boosting framework\[[12](https://arxiv.org/html/2606.24062#bib.bib6)\]\. They excel at modeling non\-linear interactions over handcrafted technical features, but treat each forecast as a static tabular problem and ignore the inherent temporal topology of market regimes\. Subsequent deep\-learning architectures, from Multi\-Layer Perceptron through Recurrent Neural Networks\[[10](https://arxiv.org/html/2606.24062#bib.bib19)\]to Long Short\-Term Memory networks\[[16](https://arxiv.org/html/2606.24062#bib.bib18)\]and Gated Recurrent Unit\[[7](https://arxiv.org/html/2606.24062#bib.bib11)\], restore a notion of temporal memory through recurrence and gating, yet remain biased toward recent observations and often fail to distinguish transient market noise from long\-term structural regime shifts\.
Recently, Transformer\-based models, from Informer\[[45](https://arxiv.org/html/2606.24062#bib.bib21)\], PatchTST\[[26](https://arxiv.org/html/2606.24062#bib.bib22)\], TimesNet\[[36](https://arxiv.org/html/2606.24062#bib.bib40)\], and iTransformer\[[22](https://arxiv.org/html/2606.24062#bib.bib23)\]to frequency\-domain variants such as FredFormer\[[27](https://arxiv.org/html/2606.24062#bib.bib44)\], and WPMixer\[[25](https://arxiv.org/html/2606.24062#bib.bib45)\], have enhanced general time\-series forecasting by leveraging attention to capture long\-range dependencies\. However, when applied to financial markets, these models inherit a structural bottleneck that is rarely examined:the reliance on a fixed\-length historical context windowLL\. In non\-stationary financial environments, a staticLLcreates an irreconcilable conflict: a short window lacks memory to span structural regime shifts, while a long window unavoidably mixes stale information from a prior regime into the current prediction as additive noise\.
Classical econometric models already hinted at the value of adaptive multi\-horizon reasoning\. The Heterogeneous Autoregressive model of realized volatility \(HAR\-RV\)\[[8](https://arxiv.org/html/2606.24062#bib.bib7)\], originally introduced for volatility forecasting, captures long\-memory structure by linearly aggregating daily, weekly, and monthly rolling averages, demonstrating that multiple fixed look\-back horizons carry complementary temporal information that no single\-horizon model can subsume\. HAR\-RV and its descendants, however, commit to a designer\-fixed set of horizons and to a linear functional form; the complementary horizons themselves, and the optimal way to combine them, remain hand\-crafted\. More recent deep\-learning multi\-period research, e\.g\. MLF\[[42](https://arxiv.org/html/2606.24062#bib.bib20)\], extend this multi\-period intuition with sophisticated attention mechanisms and serves as our most competitive alternative\. Nevertheless, MLF still relies on pre\-defined periods and equally distributed patches\. Such static designs prevent it from adaptively perceiving the optimal context in dynamic markets\.
\(a\)HS300 constituent 600176\.SS \(daily log\-returns, 2020–2024\)
\(b\)PEMS03 traffic flow \(5\-min\)
Figure 1:CWT scalograms for multi\-scale analysis\. Financial data \(a\) exhibits non\-stationary energy distribution with no fixed periodicity, while traffic data \(b\) shows stable, periodic patterns\.To empirically verify this claim, we apply the Continuous Wavelet Transform \(CWT\) as a multi\-scale diagnostic:
Wf\(a,b\)=1\|a\|∫−∞\+∞f\(t\)ψ∗\(t−ba\)𝑑t,W\_\{f\}\(a,b\)=\\frac\{1\}\{\\sqrt\{\|a\|\}\}\\int\_\{\-\\infty\}^\{\+\\infty\}f\(t\)\\,\\psi^\{\*\}\\\!\\left\(\\frac\{t\-b\}\{a\}\\right\)dt,where scaleaais inversely proportional to frequency andbbdenotes temporal position\. Figure[1](https://arxiv.org/html/2606.24062#S1.F1)visualizes the scalograms of two representative series\. For the HS300 constituent 600176\.SS \(Figure[1\(a\)](https://arxiv.org/html/2606.24062#S1.F1.sf1)\), energy concentration migrates unpredictably across scales within the five\-year horizon\. High\-frequency components dominate in 2020, shift toward lower\-frequency bands from 2021 to 2022, and return to high\-frequency dominance by late 2023\. No stable periodic structure persists at any scale\. In contrast, the PEMS03 traffic series \(Figure[1\(b\)](https://arxiv.org/html/2606.24062#S1.F1.sf2)\) exhibits time\-invariant energy bands at scales ranging from 200 to 260, reflecting a fixed daily periodicity that holds uniformly across the entire observation period\. This divergence reveals that the dominant temporal scale governing predictive information in financial data is itself non\-stationary\. Thus, a fixed context window mechanism introduces an inductive bias that is mismatched to the underlying data\-generating process\.
To bridge this gap, we proposeRAVEN\(Regime\-Aware Variable\-context Expert Network\), a novel MoE\-based framework designed for adaptive context modeling in financial time series forecasting\. The core of RAVEN lies in its learnable patch weighting and selection mechanism\. Unlike static methods that adopt a single fixed context length, RAVEN dynamically evaluates the importance of each historical patch\. It accumulates these scores in reverse chronological order against Cumulative Importance Thresholding \(CIT\) based thresholds, and generates a nested sequence of consecutive look\-back windows\. Each window is routed to a dedicated expert working at its corresponding temporal scale\. All windows are anchored at the most recent patch, ensuring temporal coherence of positional attention within each expert\. To ensure that local specialization does not come at the cost of global coherence, we introduce a Global Compressed Representation \(GCR\) branch that runs in parallel over the full context\. It distills a holistic global view that complements the local experts’ selective, scale\-specific processing\. Furthermore, the nested routing topology creates structured overlap across expert inputs\. To address this issue, we propose the shape\-aligned fusion with Correlation\-Aware Weighting \(CAW\) strategy\. It decorrelates expert representations prior to aggregation and eliminates redundant noise, yielding reliable multi\-resolution forecasts\.
Our main contributions are summarized as follows:
- •Dynamic Context Paradigm:We identify the critical limitations of static, fixed\-length context windows in non\-stationary financial environments and propose RAVEN\. This framework adaptively adjusts the receptive field to time\-varying market dynamics\. It learns data\-dependent look\-back windows by accumulating patch importance in reverse order under CIT\-based thresholds\.
- •Dual\-View Architecture:We design a dynamic MoE backbone augmented with a GCR branch\. The architecture balances local specialization and global context modeling\. Experts with distinct scales handle variable\-length patches for fine\-grained local perception\. Meanwhile, the GCR branch captures holistic historical information to preserve global coherence\.
- •Redundancy Mitigation Strategy:We introduce Shape\-Aligned Fusion and CAW\. By dynamically compressing and decorrelating heterogeneous expert outputs, this strategy explicitly filters noise from overlapping input segments, enabling efficient utilization of MoE parameters under the low SNR of financial time series\.
- •Extensive Evaluation and Deployment progress:We conduct extensive evaluations on cross\-market cumulative log\-return prediction\. Compared with SOTA baselines, RAVEN improves the Pearson correlation by 9\.2% on HS300 and 20\.2% on S&P500, and reduces MSE by 18\.2% in fund sales forecasting\. Cross\-domain tests on four PEMS traffic datasets further verify its generalization ability, achieving best performance results across 14 out of 16 evaluated metrics\. Under realistic backtest conditions, RAVEN\-driven strategies outperform our production baseline by over 10% in cumulative returns, and the system is currently advancing through final online integration\.
## IIRAVENDesign
Figure 2:Overview of RAVEN\. The pipeline consists of three modules\.Preprocessapplies instance normalization, channel\-independent processing, and patch partitioning to produce embedded patches𝐄=\[𝐞1,…,𝐞N\]\\mathbf\{E\}=\[\\mathbf\{e\}\_\{1\},\\ldots,\\mathbf\{e\}\_\{N\}\]\.Backboneoperates via two parallel branches\. \(i\) The local adaptive branch scores patch importance and accumulates scores in reverse chronological order against CIT\-based thresholds, generatingKKnested contiguous look\-back windows\{𝒢k\}\\\{\\mathcal\{G\}\_\{k\}\\\}\. Each window is processed by a scale\-specialized expert, and the variable\-length outputs are shape\-aligned via average pooling into fixed\-dimensional vectors for aggregation into𝐳local\\mathbf\{z\}\_\{\\text\{local\}\}\. \(ii\) The GCR branch captures holistic historical dependencies across the full sequence𝐄\\mathbf\{E\}via a Self\-Attention layer, then distills a global context vector𝐳global\\mathbf\{z\}\_\{\\text\{global\}\}through average pooling\.Output Projectionconcatenates\[𝐳local;𝐳global\]\[\\mathbf\{z\}\_\{\\text\{local\}\};\\mathbf\{z\}\_\{\\text\{global\}\}\]and projects them through an MLP head to output the finalHH\-period cumulative log\-returny^t\(H\)\\hat\{y\}\_\{t\}^\{\(H\)\}\. The nested routing topology introduces collinearity among experts, which is jointly suppressed by the CAW scheme and the expert diversity regularizer\.### II\-AProblem Formulation
Table I:Summary of notations\.We consider the task of multi\-horizon return forecasting from multivariate financial time series\. Let𝐱t∈ℝD\\mathbf\{x\}\_\{t\}\\in\\mathbb\{R\}^\{D\}denote theDD\-dimensional market state observed at timett, comprising standard market variables \(e\.g\., OHLCV, bid\-ask spread\) and engineered factors\. From the closing priceCtC\_\{t\}, we derive the one\-step log\-return:
rt=ln\(Ct/Ct−1\),r\_\{t\}=\\ln\(C\_\{t\}/C\_\{t\-1\}\),\(1\)which encodes price innovations induced by the arrival of new market information\[[15](https://arxiv.org/html/2606.24062#bib.bib3),[13](https://arxiv.org/html/2606.24062#bib.bib2)\]\.
Input\.Given a maximum look\-back lengthLmaxL\_\{\\max\}, an input instance at timettis defined as
𝐗t=\[𝐱t−Lmax\+1,…,𝐱t\]⊤∈ℝLmax×D\.\\mathbf\{X\}\_\{t\}=\[\\mathbf\{x\}\_\{t\-L\_\{\\max\}\+1\},\\dots,\\mathbf\{x\}\_\{t\}\]^\{\\top\}\\in\\mathbb\{R\}^\{L\_\{\\max\}\\times D\}\.\(2\)
Target\.Given a forecast horizonHH, the prediction target is theHH\-period cumulative log\-return:
yt\(H\)=∑h=1Hrt\+h=ln\(Ct\+H/Ct\)∈ℝ,y\_\{t\}^\{\(H\)\}=\\sum\_\{h=1\}^\{H\}r\_\{t\+h\}=\\ln\(C\_\{t\+H\}/C\_\{t\}\)\\in\\mathbb\{R\},\(3\)which corresponds to the realized holding\-period return of a position entered atttand liquidated att\+Ht\+H\.
Objective\.The goal of RAVEN is to learn a mappingfθ:ℝLmax×D→ℝf\_\{\\theta\}:\\mathbb\{R\}^\{L\_\{\\max\}\\times D\}\\to\\mathbb\{R\}, parameterized byθ\\theta, such that the predictiony^t\(H\)=fθ\(𝐗t\)\\hat\{y\}\_\{t\}^\{\(H\)\}=f\_\{\\theta\}\(\\mathbf\{X\}\_\{t\}\)accurately approximates the true cumulative return \(defined in Equation[3](https://arxiv.org/html/2606.24062#S2.E3)\) over the out\-of\-sample test distribution\[[35](https://arxiv.org/html/2606.24062#bib.bib4),[9](https://arxiv.org/html/2606.24062#bib.bib5)\]\.
### II\-BOverall Architecture
Figure[2](https://arxiv.org/html/2606.24062#S2.F2)illustrates the overall architecture ofRAVEN\. Given a look\-back window𝐗∈ℝLmax×D\\mathbf\{X\}\\in\\mathbb\{R\}^\{L\_\{\\max\}\\times D\}, the model produces a scalarHH\-period cumulative log\-return forecasty^t\(H\)\\hat\{y\}\_\{t\}^\{\(H\)\}through three functionally distinct stages:
- •Preprocessing\(§[II\-C](https://arxiv.org/html/2606.24062#S2.SS3)\) normalizes the input instances, processes each channel independently, and partitions the look\-back window into an embedded patch sequence𝐄=\[𝐞1,…,𝐞N\]∈ℝN×d\\mathbf\{E\}=\[\\mathbf\{e\}\_\{1\},\\ldots,\\mathbf\{e\}\_\{N\}\]\\in\\mathbb\{R\}^\{N\\times d\}\.
- •Dual\-Branch Backbone\(§[II\-D](https://arxiv.org/html/2606.24062#S2.SS4)\) consumes𝐄\\mathbf\{E\}through two parallel pathways\. The local branch routes nested contiguous patch prefixes toKKscale\-specialized experts via a CIT\-thresholded router\. The global branch distills the entire look\-back window into a single compressed representation\.
- •Output Projection\(§[II\-E](https://arxiv.org/html/2606.24062#S2.SS5)\) aligns the variable\-length expert outputs to a common dimensionality, aggregates them via a CAW scheme, fuses the result with the global representation, and projects to the scalar forecast through an MLP head\.
We elaborate on each module in the following sections\.
### II\-CPreprocessing
Instance normalization\.We normalize each input instance to zero mean and unit variance over the temporal dimension of the look\-back window\. This removes sample\-level distributional shift, a well\-known obstacle in non\-stationary financial series, while preserving the temporal structure of returns\[[18](https://arxiv.org/html/2606.24062#bib.bib48)\]\.
Channel\-independent processing\.The normalized window𝐗∈ℝLmax×D\\mathbf\{X\}\\in\\mathbb\{R\}^\{L\_\{\\max\}\\times D\}may contain heterogeneous channels, e\.g\., OHLCV bars, bid\-ask spreads, engineered factors\. Following\[[26](https://arxiv.org/html/2606.24062#bib.bib22)\], we process each channel independently through shared linear projections\. This design prevents spurious information leakage across semantically distinct channels and allows the model to learn channel\-specific temporal dynamics\.
Patch partitioning\.We segment the look\-back window intoN=⌊Lmax/plen⌋N=\\lfloor L\_\{\\max\}/p\_\{\\text\{len\}\}\\rfloornon\-overlapping patches\{𝐏1,…,𝐏N\}\\\{\\mathbf\{P\}\_\{1\},\\ldots,\\mathbf\{P\}\_\{N\}\\\}\[[26](https://arxiv.org/html/2606.24062#bib.bib22)\]arranged in reverse chronological order \(most recent patch first\)\. Each patch𝐏i∈ℝplen\\mathbf\{P\}\_\{i\}\\in\\mathbb\{R\}^\{p\_\{\\text\{len\}\}\}is linearly projected to add\-dimensional embedding and augmented with a sinusoidal positional encoding:
𝐞i=PatchEmbed\(𝐏i\)\+PE\(i\)∈ℝd,i=1,…,N\.\\mathbf\{e\}\_\{i\}=\\mathrm\{PatchEmbed\}\(\\mathbf\{P\}\_\{i\}\)\+\\mathrm\{PE\}\(i\)\\in\\mathbb\{R\}^\{d\},i=1,\\ldots,N\.\(4\)
The resulting sequence𝐄=\[𝐞1,…,𝐞N\]∈ℝN×d\\mathbf\{E\}=\[\\mathbf\{e\}\_\{1\},\\ldots,\\mathbf\{e\}\_\{N\}\]\\in\\mathbb\{R\}^\{N\\times d\}serves as the shared input to both branches of the backbone\.
### II\-DDual\-Branch Backbone
The backbone comprises two parallel branches that jointly consume𝐄\\mathbf\{E\}: a local branch that performs dynamic, scale\-aware MoE routing over nested patch prefixes, and a global branch that maintains a holistic compressed view of the full look\-back window\. We describe each branch in turn\.
Local branch: Patch\-weighted MoE\.The local branch comprises five steps\.
\(i\) Patch importance scoring\.A lightweight two\-layer MLP assigns each embedded patch a scalar importance scoresi=ϕ\(ei\)s\_\{i\}=\\phi\(e\_\{i\}\)fori∈\{1,…,N\}i\\in\\\{1,\\dots,N\\\}, whereϕ\(⋅\)\\phi\(\\cdot\)denotes the MLP scoring network with GeLU activation\. These scores are then softmax\-normalized over the sequence to yield a categorical distribution:
s~i=exp\(si\)∑j=1Nexp\(sj\),i=1,…,N\.\\tilde\{s\}\_\{i\}=\\frac\{\\exp\(s\_\{i\}\)\}\{\\sum\_\{j=1\}^\{N\}\\exp\(s\_\{j\}\)\},\\quad i=1,\\dots,N\.\(5\)Intuitively,s~i\\tilde\{s\}\_\{i\}reflects the model’s learned belief about how informative patchiiis for the downstream forecast\.
\(ii\) CIT\-thresholded routing\.Unlike conventional MoE designs that route individual tokens to experts\[[30](https://arxiv.org/html/2606.24062#bib.bib49),[11](https://arxiv.org/html/2606.24062#bib.bib50)\], our router accumulates𝐬~\\tilde\{\\mathbf\{s\}\}from the most recent patch backward, constructing a reverse\-chronological cumulative importance curve, and segments it atKKordered thresholds0<τ1<τ2<⋯<τK≤10<\\tau\_\{1\}<\\tau\_\{2\}<\\cdots<\\tau\_\{K\}\\leq 1:
𝒢k=\{i∈\{1,…,N\}:Ψi≤τk\},Ψi=∑j=1is~j\.\\mathcal\{G\}\_\{k\}=\\big\\\{\\,i\\in\\\{1,\\dots,N\\\}:\\Psi\_\{i\}\\leq\\tau\_\{k\}\\,\\big\\\},\\quad\\Psi\_\{i\}=\\sum\_\{j=1\}^\{i\}\\tilde\{s\}\_\{j\}\.\(6\)wherek=1,…,Kk=1,\\dots,K\. By construction,𝒢1⊆𝒢2⊆⋯⊆𝒢K\\mathcal\{G\}\_\{1\}\\subseteq\\mathcal\{G\}\_\{2\}\\subseteq\\cdots\\subseteq\\mathcal\{G\}\_\{K\}forms a monotone chain of contiguous patch prefixes with data\-dependent cardinalitiesℓk=\|𝒢k\|\\ell\_\{k\}=\|\\mathcal\{G\}\_\{k\}\|, always anchored at the most recent patch\. Because accumulation starts from the present and proceeds backward, each expert receives theℓk\\ell\_\{k\}most informative contiguous patches, a property critical for positional attention to remain temporally coherent\. We setK=3K\{=\}3, and the thresholds\(τ1,τ2,τ3\)=\(0\.3,0\.6,0\.9\)\(\\tau\_\{1\},\\tau\_\{2\},\\tau\_\{3\}\)=\(0\.3,0\.6,0\.9\)by default\.
\(iii\) Scale\-specialized experts\.Each group𝒢k\\mathcal\{G\}\_\{k\}is processed by an independent three\-layer encoderEncoderk\\mathrm\{Encoder\}\_\{k\}:
𝐡k=Encoderk\(\{𝐞i\}i∈𝒢k\)∈ℝℓk×d\.\\mathbf\{h\}\_\{k\}=\\mathrm\{Encoder\}\_\{k\}\\\!\\bigl\(\\\{\\mathbf\{e\}\_\{i\}\\\}\_\{i\\in\\mathcal\{G\}\_\{k\}\}\\bigr\)\\in\\mathbb\{R\}^\{\\ell\_\{k\}\\times d\}\.\(7\)Since𝒢1\\mathcal\{G\}\_\{1\}is the shortest \(most recent\) prefix and𝒢K\\mathcal\{G\}\_\{K\}the longest, expertEncoder1\\mathrm\{Encoder\}\_\{1\}naturally specializes in short\-term dynamics whileEncoderk\\mathrm\{Encoder\}\_\{k\}captures longer\-term regime\-scale structure, intermediate experts span intermediate horizons\. This constitutes a fundamentally different form of specialization from content\-based MoE routing:RAVENexperts are differentiated by temporal scale, not by token content\. Since Transformer encoders natively handle variable\-length sequences, no padding is required within any expert\.
\(iv\) Shape\-aligned pooling\.To align the outputs from different scale\-specialized experts into a unified space for downstream fusion, we apply a parameter\-free Shape\-aligned Pooling that averages over the patch dimension:
𝐳k=AvgPool\(𝐡k\)∈ℝd,\\mathbf\{z\}\_\{k\}=\\text\{AvgPool\}\(\\mathbf\{h\}\_\{k\}\)\\in\\ \\mathbb\{R\}^\{d\},\(8\)whereAvgPool\(⋅\)\\text\{AvgPool\}\(\\cdot\)averages over theℓk\\ell\_\{k\}patch embeddings in𝐡k∈ℝℓk×d\\mathbf\{h\}\_\{k\}\\in\\mathbb\{R\}^\{\\ell\_\{k\}\\times d\}, collapsing them into a fixeddd\-dimensional vector\. This parameter\-free design avoids overfitting risk and produces a uniform representation for the subsequent correlation\-aware gate\.
\(v\) CAW\_based expert aggregation\.Because the expert groups are nested \(𝒢1⊆⋯⊆𝒢K\\mathcal\{G\}\_\{1\}\\subseteq\\cdots\\subseteq\\mathcal\{G\}\_\{K\}\), their representations\{𝐳k\}\\\{\\mathbf\{z\}\_\{k\}\\\}are inherently correlated: longer\-horizon experts operate on supersets of the inputs to shorter\-horizon experts\. A naive uniform or softmax gate would amplify this redundancy, which is particularly harmful under the characteristically low SNR of financial returns\. To mitigate this, we compute the pairwise cosine similarity matrix𝐑∈ℝK×K\\mathbf\{R\}\\in\\mathbb\{R\}^\{K\\times K\}with entries𝐑jk=𝐳j⊤𝐳k/\(‖𝐳j‖‖𝐳k‖\)\\mathbf\{R\}\_\{jk\}=\\mathbf\{z\}\_\{j\}^\{\\top\}\\mathbf\{z\}\_\{k\}/\(\\\|\\mathbf\{z\}\_\{j\}\\\|\\\|\\mathbf\{z\}\_\{k\}\\\|\), derive the per\-expert positive redundancy scorerk=∑j≠kmax\(𝐑jk,0\)r\_\{k\}=\\sum\_\{j\\neq k\}\\max\(\\mathbf\{R\}\_\{jk\},0\), and modulate a raw routing confidenceαk\\alpha\_\{k\}with an exponential penalty:
wk=αkexp\(−λrk\)∑k′=1Kαk′exp\(−λrk′\),λ≥0\(learnable\)\.w\_\{k\}=\\frac\{\\alpha\_\{k\}\\exp\(\-\\lambda\\,r\_\{k\}\)\}\{\\sum\_\{k^\{\\prime\}=1\}^\{K\}\\alpha\_\{k^\{\\prime\}\}\\exp\(\-\\lambda\\,r\_\{k^\{\\prime\}\}\)\},\\qquad\\lambda\\geq 0\\;\\text\{\(learnable\)\}\.\(9\)The local representation is then obtained as the weighted aggregate:
𝐳local=∑k=1Kwk𝐳k∈ℝd\.\\mathbf\{z\}\_\{\\text\{local\}\}=\\textstyle\\sum\_\{k=1\}^\{K\}w\_\{k\}\\,\\mathbf\{z\}\_\{k\}\\in\\mathbb\{R\}^\{d\}\.\(10\)
Global branch: GCR\.To preserve a holistic macroeconomic view alongside fine\-grained local experts, RAVEN incorporates a GCR module\. It first captures comprehensive historical dependencies across the entire sequence of embedded patches𝐄∈ℝN×d\\mathbf\{E\}\\in\\mathbb\{R\}^\{N\\times d\}via a Self\-Attention layer, then compresses the attended representations into a single vector through average pooling:
𝐄sa=Self\-Attn\(𝐄\),\\mathbf\{E\}\_\{sa\}=\\text\{Self\-Attn\}\(\\mathbf\{E\}\),\(11\)𝐳global=AvgPool\(𝐄sa\)∈ℝd,\\mathbf\{z\}\_\{global\}=\\text\{AvgPool\}\(\\mathbf\{E\}\_\{sa\}\)\\in\\mathbb\{R\}^\{d\},\(12\)whereAvgPool\(⋅\)\\text\{AvgPool\}\(\\cdot\)averages over theNNpatch positions in𝐄sa∈ℝN×d\\mathbf\{E\}\_\{sa\}\\in\\mathbb\{R\}^\{N\\times d\}, yielding a unified global context vector\. The global branch thus serves as an information\-preserving complement to the local branch’s selective, scale\-specialized view\.
### II\-EOutput Projection
The final prediction head fuses the two complementary representations via concatenation and projects to the scalar forecast:
y^t\(H\)=MLP\(Concat\(𝐳local,𝐳global\)\)∈ℝ\.\\hat\{y\}\_\{t\}^\{\(H\)\}=\\mathrm\{MLP\}\\\!\\big\(\\mathrm\{Concat\}\(\\mathbf\{z\}\_\{\\text\{local\}\},\\,\\mathbf\{z\}\_\{\\text\{global\}\}\)\\big\)\\in\\mathbb\{R\}\.\(13\)For multi\-horizon forecasting, the output dimension of the MLP is extended toMM, yielding one predictiony^t\(Hm\)\\hat\{y\}\_\{t\}^\{\(H\_\{m\}\)\}per target horizonHmH\_\{m\}\.
### II\-FTraining Objective
We optimize a composite loss that combines the primary regression objective with two auxiliary regularizers targeting the specific failure modes of scale\-aware MoE routing under low SNR:
ℒ=ℒMSE\+λentℒent\+λdivℒdiv\.\\mathcal\{L\}=\\mathcal\{L\}\_\{\\text\{MSE\}\}\+\\lambda\_\{\\text\{ent\}\}\\,\\mathcal\{L\}\_\{\\text\{ent\}\}\+\\lambda\_\{\\text\{div\}\}\\,\\mathcal\{L\}\_\{\\text\{div\}\}\.\(14\)
Forecasting loss\.Consistent with the regression target defined in Eq\. \([3](https://arxiv.org/html/2606.24062#S2.E3)\), the primary objective is the mean squared error between the predicted and realizedHH\-period cumulative log\-return:
ℒMSE=1B∑i=1B\(y^i\(H\)−yi\(H\)\)2,\\mathcal\{L\}\_\{\\text\{MSE\}\}=\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\big\(\\hat\{y\}\_\{i\}^\{\(H\)\}\-y\_\{i\}^\{\(H\)\}\\big\)^\{2\},\(15\)whereBBdenotes the batch size\.
Router entropy regularization\.The CIT\-thresholded router operates on a softmax\-normalized patch\-importance distribution𝐬~\\tilde\{\\mathbf\{s\}\}\. Without regularization, the softmax parameterization tends to amplify initial score differences, concentrating all probability mass on a few patches, typically the most recent ones\. This router collapse causes the cumulative sum to reach all thresholds\{τk\}\\\{\\tau\_\{k\}\\\}within the first few patches, truncating every expert group into nearly identical short prefixes and degenerating the MoE toward a single\-scale model\. We counteract this with a negative\-entropy penalty:
ℒent=∑i=1Ns~ilogs~i,\\mathcal\{L\}\_\{\\text\{ent\}\}=\\sum\_\{i=1\}^\{N\}\\tilde\{s\}\_\{i\}\\log\\tilde\{s\}\_\{i\},\(16\)which is minimized at the uniform distribution\. A high\-entropy routing distribution causes the cumulative importance curve to increase more gradually, yielding balanced prefix lengths across experts and enabling the intended short\-, medium\-, and long\-horizon specialization\.
Expert diversity regularization\.The nested structure𝒢1⊆⋯⊆𝒢K\\mathcal\{G\}\_\{1\}\\subseteq\\cdots\\subseteq\\mathcal\{G\}\_\{K\}means longer\-horizon experts observe a strict superset of patches seen by shorter\-horizon ones\. Without explicit encouragement, experts may converge to similar representations \(representation collapse\), causing𝐳local\\mathbf\{z\}\_\{\\text\{local\}\}to degenerate toward a single expert’s output\. We penalize the off\-diagonal entries of the expert cosine\-similarity matrix:
ℒdiv\\displaystyle\\mathcal\{L\}\_\{\\text\{div\}\}=‖𝐑−𝐈K‖F2,\\displaystyle=\\big\\\|\\mathbf\{R\}\-\\mathbf\{I\}\_\{K\}\\big\\\|\_\{F\}^\{2\},\(17\)𝐑jk\\displaystyle\\mathbf\{R\}\_\{jk\}=𝐳j⊤𝐳k‖𝐳j‖2‖𝐳k‖2,\\displaystyle=\\frac\{\\mathbf\{z\}\_\{j\}^\{\\top\}\\mathbf\{z\}\_\{k\}\}\{\\\|\\mathbf\{z\}\_\{j\}\\\|\_\{2\}\\\|\\mathbf\{z\}\_\{k\}\\\|\_\{2\}\},which drives expert representations toward pairwise orthogonality\. While the CAW \(Eq\.[9](https://arxiv.org/html/2606.24062#S2.E9)\) down\-weights redundant experts at inference time,ℒdiv\\mathcal\{L\}\_\{\\text\{div\}\}structurally prevents redundancy during training\. The two mechanisms are complementary\.
Complementary regularization\.ℒent\\mathcal\{L\}\_\{\\text\{ent\}\}andℒdiv\\mathcal\{L\}\_\{\\text\{div\}\}address orthogonal failure modes\. The entropy term prevents routing collapse on the input side, ensuring every expert receives sufficiently rich context\. The diversity term prevents representation collapse on the output side, ensuring experts produce non\-redundant representations despite overlapping inputs\. Their joint use is critical for stable training in the low\-SNR financial regime \(see ablation in §[III\-D](https://arxiv.org/html/2606.24062#S3.SS4)\)\.
### II\-GSystem Deployment
Figure[3](https://arxiv.org/html/2606.24062#S2.F3)illustrates the deployed system architecture, consisting of an offline development environment and an online production environment\.
Figure 3:Production deployment pipeline of RAVEN in a quantitative trading system\. The offline phase handles model training and backtesting validation on historical data\. The online phase appends newly available market data after each close, generates return predictions via daily inference, optimizes portfolio allocations, and routes orders through pre\-trade risk checks to execution venues\. A production monitor triggers the next offline retraining cycle upon sustained performance drift\.Offline environment\.RAVEN is trained on historical price\-volume data stored in a financial data lake\. The trained model then undergoes rigorous backtesting that incorporates portfolio optimization, risk constraints, and transaction cost modeling to validate out\-of\-sample performance\. This offline cycle for model training is re\-executed periodically, or when production monitoring signals significant performance degradation\.
Online environment\.Once deployed online, deployed forecasting models operates on a daily inference cycle\. After each market close, newly available data is appended to the streaming database and fed into models to generate next\-period return predictions across the target tickers\. Predictions are consumed by a portfolio optimization module that solves for target allocation weights subject to risk and turnover constraints\. Optimized orders pass pre\-trade risk checks at the trading console before routing to execution venues via broker bridges\. A production monitor continuously tracks realized prediction accuracy and portfolio; sustained drift beyond predefined thresholds triggers the next offline retraining cycle\.
Deployment Status and Impact\.At present, the offline modeling and backtesting validation phases have been completed\. Under identical and realistic backtest conditions, RAVEN\-driven strategies successfully outperform our single production baseline by over 10% in cumulative returns\. The system is currently advancing through the final stages of live production integration, with active engineering efforts focused on online strategy ensembling and live out\-of\-sample monitoring prior to full\-scale capital allocation\.
## IIIExperiments
We evaluate RAVEN on financial log\-return prediction and general time series forecasting tasks to demonstrate the effectiveness of dynamic look\-back selection and correlation\-aware expert aggregation\.
### III\-AExperimental Settings
Datasets\.We conduct experiments on three categories of datasets\. Table[II](https://arxiv.org/html/2606.24062#S3.T2)summarizes the dataset statistics\.
\(1\)FinMultiTime\[[39](https://arxiv.org/html/2606.24062#bib.bib51)\]: A cross\-market financial dataset containing daily price and volume data for 892 HS300 constituents and 4,694 S&P 500 constituents, spanning from 2009 to 2024\. This dataset covers diverse market regimes, including bull markets, bear markets, and high\-volatility crisis periods across both Chinese and US equity markets\.
\(2\)Fund\[[42](https://arxiv.org/html/2606.24062#bib.bib20)\]: Daily user transaction records for mutual fund subscriptions and redemptions on Alipay, spanning from January 2015 to January 2023\. This dataset captures retail investor behavior dynamics and fund\-level return patterns\.
\(3\)PEMS\(03, 04, 07, 08\)\[[14](https://arxiv.org/html/2606.24062#bib.bib17)\]: Four widely used traffic flow datasets collected from the California Department of Transportation Performance Measurement System \(PeMS\)\. Each dataset records aggregated traffic flow measurements from highway sensor networks at 5\-minute intervals\. These datasets exhibit strong temporal non\-stationarity due to rush\-hour patterns, weekday/weekend shifts, and seasonal variations, serving to validate the generalizability of RAVEN beyond financial domains\.
Table II:Dataset statistics\. Ticker/Sensors denotes the number of stocks or funds in financial datasets or sensors in traffic datasets\. Timepoints reports the total available data samples across all entities and time steps\.DomainDatasetTicker/SensorsTimepointsGranularityFinancialHS30089218,836,4661\-DayS&P5004694104,692,4161\-DayFund306647,8921\-DayTrafficPEMS033589,382,4645\-MinPEMS0430715,649,6325\-MinPEMS0788324,921,7925\-MinPEMS081709,106,5605\-MinBaselines\.We compare against twelve representative methods spanning diverse architectural strategies for temporal modeling:
- •Static Patching models:PatchTST\[[26](https://arxiv.org/html/2606.24062#bib.bib22)\]operates on fixed uniform patches, while Patch\-Concat and Patch\_Ensemble extend it via naive concatenation and ensembling over different lengths\.
- •Predefined Multi\-period models:MLF\[[42](https://arxiv.org/html/2606.24062#bib.bib20)\]utilizes explicitly predefined multi\-period inputs combined with inter\-period redundancy filtering\.
- •Cross\-variate interactions:iTransformer\[[22](https://arxiv.org/html/2606.24062#bib.bib23)\]and Crossformer\[[43](https://arxiv.org/html/2606.24062#bib.bib15)\]capture global inter\-sensor dependencies through variate\-as\-token and cross\-dimensional attention mechanisms\.
- •Hierarchical Multi\-resolution:PathFormer\[[5](https://arxiv.org/html/2606.24062#bib.bib24)\], Scaleformer\[[29](https://arxiv.org/html/2606.24062#bib.bib25)\], and NHits\[[4](https://arxiv.org/html/2606.24062#bib.bib35)\]extract multi\-scale representations internally through predefined hierarchical pathways, iterative refinement, or interpolation\.
- •Frequency\-domain transformations:TimesNet\[[36](https://arxiv.org/html/2606.24062#bib.bib40)\], FEDformer\[[47](https://arxiv.org/html/2606.24062#bib.bib28)\], and FiLM\[[46](https://arxiv.org/html/2606.24062#bib.bib36)\]tackle temporal dynamics via 2D\-variations and spectral frequency decompositions\.
All baselines are evaluated under identical data splits and preprocessing protocols for a fair comparison\.
Implementation details\.RAVEN usesK=3K\{=\}3experts with CIT\-based thresholds\(τ1,τ2,τ3\)=\(0\.3,0\.6,0\.9\)\(\\tau\_\{1\},\\tau\_\{2\},\\tau\_\{3\}\)=\(0\.3,0\.6,0\.9\), patch lengthplen=16p\_\{\\text\{len\}\}=16, and embedding dimensiond=128d=128\. Each expert is a 3\-layer encoder\. We optimize with AdamW for 60 epochs using cosine annealing\. Auxiliary loss weights areλent=0\.1\\lambda\_\{\\text\{ent\}\}=0\.1andλdiv=0\.01\\lambda\_\{\\text\{div\}\}=0\.01\. All experiments run on a single NVIDIA A100 40G GPU\.
Evaluation metrics\.For HS300 and S&P500, we train on 2009–2019 and evaluate on 2020–2024 via rolling out\-of\-sample prediction ofH=10H\{=\}10day cumulative log\-returns\. We report three complementary metrics: \(1\) Pearson correlation \(Corr\) between predicted and realized returns, measuring directional accuracy; \(2\) Mean Squared Error \(MSE\) of normalized predictions, capturing magnitude fidelity; and \(3\) Information Coefficient Information Ratio \(ICIR\), defined as the mean of the cross\-sectional IC divided by its standard deviation across rebalancing periods, quantifying the stability of the predictive signal\. For per\-year evaluation \(Table[III](https://arxiv.org/html/2606.24062#S3.T3)\), all three metrics are computed within each calendar year\. For the overall summary \(Table[IV](https://arxiv.org/html/2606.24062#S3.T4)\), Corr and ICIR are computed over the entire 5\-year test period \(2020–2024\) to reflect long\-horizon stability, while MSE is reported as the multi\-year average\. For Fund, we forecast fund sales at horizonsH∈\{1,5,8,10\}H\\in\\\{1,5,8,10\\\}days and report MSE and Weighted Mean Absolute Percentage Error \(WMA for short\)\. For general benchmarks, we adopt short\-term horizonsH∈\{12,24\}H\\in\\\{12,24\\\}and report MSE and MAE\.
### III\-BFinancial Time Series Forecasting
Financial time series exhibit non\-stationary dynamics and regime\-dependent SNR ratios, making fixed look\-back approaches particularly fragile\. We evaluate RAVEN on two distinct financial forecasting tasks: 10\-day period stock cumulative log return prediction and fund sales forecasting\.
Results on FinMultiTime\.Table[III](https://arxiv.org/html/2606.24062#S3.T3)reports per\-year results on HS300 and S&P500\. On HS300, RAVEN ranks first or second in every year\-metric cell, achieving 7 first\-place and 8 second\-place rankings out of 15\. On S&P500, RAVEN achieves the highest Corr and ICIR across all 5 years, and the lowest MSE in 4 of 5 years, yielding 14 first\-place rankings out of 15 cells\. Overall, RAVEN obtains the top rank in 21 out of 30 total year\-metric\-dataset cells \(70%\), followed by MLF and NHits at 4 cells each\. These results confirm that data\-driven dynamic patch routing generalizes consistently across both datasets and all evaluation periods without manual scale configuration\.
Table[IV](https://arxiv.org/html/2606.24062#S3.T4)summarizes the aggregated performance over the entire five\-year test period from 2020 to 2024\. On HS300, RAVEN achieves a Corr of 0\.0390, an ICIR of 0\.3932, and an MSE of 0\.9905\. Compared with the second\-best baseline, MLF, it improves Corr by 9\.2% and ICIR by 9\.9%, while reducing MSE by 0\.27%\. On S&P500, RAVEN attains a Corr of 0\.0363, an ICIR of 0\.5980, and an MSE of 1\.001\. It outperforms MLF by 20\.2% in Corr and 7\.2% in ICIR, while reducing MSE by 1\.5%\. Across the two datasets, RAVEN ranks first in all six dataset\-metric pairs\. The consistently higher ICIR suggests that RAVEN produces not only stronger average predictions but also more stable predictive signals across rebalancing periods\. This stability is particularly desirable for sequential decision\-making systems that rely on robust ranking signals\.
Table III:Performance comparison on HS300 and S&P500 \(2020–2024\)\. Corr and ICIR: higher is better \(↑\\uparrow\); MSE: lower is better \(↓\\downarrow\)\. Best inbold, secondunderlined\.
Table IV:Overall average performance on HS300 and S&P500 \(2020–2024\)\. Corr and ICIR: higher is better \(↑\\uparrow\); MSE: lower is better \(↓\\downarrow\)\. Best inbold, secondunderlined\.To further validate that superior prediction accuracy translates into real\-world investment gains, we conduct backtesting on HS300 from January 2020 to December 2024 using Qlib\[[40](https://arxiv.org/html/2606.24062#bib.bib8)\]\. Qlib is an open\-source AI\-driven quantitative investment platform with standardized data pipelines and strategy simulation\. We construct a cross\-sectional portfolio management scheme\. At each rebalancing point, i\.e\. every 10 trading days, constituents are ranked by predicted score in descending order, and a fixed\-size portfolio ofK=30K\{=\}30stocks is maintained\. A drop\-NNstrategy \(Ndrop=30N\_\{\\text\{drop\}\}\{=\}30\) limits turnover, as positions are only closed when their predicted rank falls significantly below the top\-KKthreshold\. Execution is simulated via Qlib’s SimulatorExecutor at daily frequency with an initial capital of 100M RMB and default transaction costs, including stamp duty and commission\. We report the cumulative return of RAVEN minus that of each baseline under the identical strategy and period\. Figure[4](https://arxiv.org/html/2606.24062#S3.F4)shows the cumulative return advantage of RAVEN over each baseline\. All gaps exhibit a persistent upward trend over the five\-year period despite short\-term fluctuations during volatile market phases, indicating sustained outperformance rather than episodic gains\. The final cumulative return advantage over the strongest baseline \(MLF\) is \+12\.79%, while the gap over PatchTST reaches \+38\.45%\. For the remaining baselines, the gaps are Patch\-Ensemble \(\+34\.41%\), FiLM \(\+23\.41%\), PathFormer \(\+20\.38%\), PathConcat \(\+20\.28%\), and NHits \(\+19\.46%\)\. The steadily growing gaps confirm that adaptive dynamic routing delivers consistent predictive advantage across both trending and volatile market regimes\.
Figure 4:Cumulative return advantage of RAVEN over baselines on HS300, from 2020 to 2024\.Δ\(⋅\)\\Delta\(\\cdot\)denotes the cumulative return of RAVEN minus that of baseline\(⋅\)\(\\cdot\)\. All curves show a persistent upward trend across varying market regimes\.Results on Fund\.Table[V](https://arxiv.org/html/2606.24062#S3.T5)reports fund sales predictions across four horizons, where RAVEN consistently achieves the best MSE and WMA\. Specifically, RAVEN reduces the average MSE by 18\.2% compared to the second\-best MLF \(32\.42 vs\. 39\.62\), maintaining stable gains from 14% to 21% across all horizons\. Multi\-scale architectures like PathFormer and Scaleformer perform similarly to single\-scale methods, suggesting that implicit in\-network multi\-resolution modeling alone is insufficient\. Explicit multi\-period inputs with dynamic selection are imperative for capturing heterogeneous temporal patterns in transaction data\. Moreover, fixed\-window variants \(Patch\_E, Patch\_C\) consistently underperform MLF\. This indicates that naively aggregating multi\-period inputs without redundancy mitigation introduces correlated noise, negating the benefits of broader temporal context\.
Table V:Performance comparison on Fund dataset\. MSE \(↓\\downarrow\) and WMA \(↓\\downarrow\)\. Best inbold; second\-bestunderlined\.
### III\-CGeneral Time Series Forecasting
To evaluate the generalizability of RAVEN beyond financial domains, we conduct experiments on four traffic flow datasets, namely PEMS03, PEMS04, PEMS07, and PEMS08\[[14](https://arxiv.org/html/2606.24062#bib.bib17)\]\. As shown in Table[VI](https://arxiv.org/html/2606.24062#S3.T6), RAVEN achieves competitive results, securing the best performance in 14 of 16 evaluated metrics\. The variate\-as\-token iTransformer emerges as the strongest second\-best baseline, consistently outperforming the pre\-defined multi\-scale MLF\. Traffic networks exhibit strict macroscopic spatial correlations, which allows iTransformer to excel by embedding historical series for global cross\-sensor dependency capture\. In contrast, MLF’s static temporal squeezing inadvertently disrupts these spatial connections\.
Despite the spatial dominance of traffic data, the overall superiority of RAVEN reveals that predictive gains from data\-dependent dynamic routing outweigh the lack of explicit spatial modeling\. Traffic networks frequently suffer from abrupt, localized disruptions such as accidents or bottlenecks\. iTransformer’s static global receptive field risks over\-smoothing these transient temporal bursts\. Conversely, RAVEN isolates these dynamics and adaptively assigns the optimal patch scale to each sensor’s immediate state\. This confirms that even in spatially\-dominated domains, granular temporal adaptability provides a crucial inductive bias for forecasting volatile sequences\.
Table VI:Short\-term forecasting results on traffic datasets \(MSE / MAE↓\\downarrow\)\. Best inbold; second\-bestunderlined\.
### III\-DAblation Study
We ablate each component of RAVEN on HS300 to isolate individual contributions\. Table[VII](https://arxiv.org/html/2606.24062#S3.T7)reports year\-by\-year and overall Pearson correlation for six variants\.
Table VII:Ablation study on HS300 \(Pearson Corr↑\\uparrow\)\. Best inbold\.Among the three architectural components, removing learned patch importance scoring without Adaptive Routing \(AR\) causes the largest degradation, with a 9\.7% relative drop on the overall metric\. Without the learnable patch importance vectorΨi\\Psi\_\{i\}, the CIT mechanism degenerates into predefined window lengths, losing the ability to adapt look\-back selection to regime\-shifting dynamics\. Disabling correlation\-aware expert weighting \(w/o CAW\) degrades performance by 8\.2%, validating that explicit redundancy suppression at the output side is necessary when nested windows share overlapping patches\. The global context representation \(w/o GCR\) contributes 7\.7%, providing a router\-independent view that stabilizes predictions when local experts disagree\. The two auxiliary losses each contribute approximately 4\.5% individually\.
### III\-EModel Efficiency Analysis
We benchmark the computational efficiency of RAVEN against four representative baselines on HS300 using a single NVIDIA A100 \(40GB\) GPU, batch size 512, and a 120\-day look\-back window\. Training and inference speeds are reported in milliseconds per iteration, where a training iteration denotes one full forward–backward pass over a batch and an inference iteration denotes one forward pass; memory denotes peak GPU memory during training\. As shown in Figure[5](https://arxiv.org/html/2606.24062#S3.F5), RAVEN achieves a favorable balance between accuracy and efficiency\. Compared with lightweight baselines MLF and PatchTST, RAVEN incurs moderate overhead in training time \(77\.89 ms vs\. 52\.54/55\.48 ms\), memory \(3\.23 GB vs\. 2\.47/2\.77 GB\), and inference latency \(26\.98 ms vs\. 15\.94/17\.68 ms\), reflecting the cost of adaptive temporal\-scale routing\. However, this overhead remains within the same order of magnitude, while RAVEN achieves the best overall predictive performance on both HS300 and S&P500 \(Table[IV](https://arxiv.org/html/2606.24062#S3.T4)\)\. Compared with heavier multi\-scale or routing baselines, RAVEN is substantially more efficient: it reduces training time by 27\.4% over Scaleformer and 89\.1% over PathFormer, reduces peak memory by 93\.1% over Scaleformer and 35\.1% over PathFormer, and reduces inference latency by 42\.0% over Scaleformer and 78\.3% over PathFormer\. These results indicate that RAVEN obtains the strongest predictive accuracy while maintaining computational cost close to lightweight baselines and far below heavier adaptive architectures\.
Figure 5:Efficiency comparison on HS300 \(batch size 512, look\-back window 120\)\. Training and inference time are reported in milliseconds per iteration; memory denotes peak GPU memory during training\.
### III\-FHyperparameter Analysis
We analyze the sensitivity of RAVEN to four key hyperparameters on HS300: number of expertsKK, CIT\-based threshold values, maximum look\-back window length, and patch lengthplenp\_\{\\text\{len\}\}\.
Number of expertsKK\.We varyK∈\{2,3,4\}K\\in\\\{2,3,4\\\}with thresholds uniformly spaced betweenτmin=0\.3\\tau\_\{\\min\}\{=\}0\.3andτmax=0\.9\\tau\_\{\\max\}\{=\}0\.9\. Table[VIII](https://arxiv.org/html/2606.24062#S3.T8)shows thatK=3K\{=\}3achieves the best overall performance \(0\.0390\), outperformingK=2K\{=\}2\(0\.0339\) by 15\.0% andK=4K\{=\}4\(0\.0361\) by 8\.0%\. WithK=2K\{=\}2, only two scales are available, limiting the model’s ability to capture intermediate\-term patterns\. WithK=4K\{=\}4, adjacent experts cover highly overlapping windows, increasing redundancy without providing additional discriminative information\.
Table VIII:Effect of number of expertsKKon HS300 \(Corr↑\\uparrow\)\.Threshold values underK=3K\{=\}3\.We compare three threshold configurations: \(0\.3, 0\.6, 0\.9\), \(0\.2, 0\.4, 0\.8\), and \(0\.1, 0\.5, 0\.9\)\. As shown in Table[IX](https://arxiv.org/html/2606.24062#S3.T9), all configurations yield comparable overall performance, with Corr ranging from 0\.0370 to 0\.0390 and less than 5\.2% relative variation\. We further conduct pairedtt\-tests on per\-sample MSE using \(0\.3, 0\.6, 0\.9\) as the reference\. Neither \(0\.2, 0\.4, 0\.8\) withp=0\.66p\{=\}0\.66nor \(0\.1, 0\.5, 0\.9\) withp=0\.19p\{=\}0\.19shows a statistically significant difference at the 0\.05 level\. This insensitivity is anticipated\. Since the importance scoress~i\\tilde\{s\}\_\{i\}are learned in an end\-to\-end manner, the model adjusts its score distribution to compensate for different threshold settings\. Consequently, the exact threshold values are non\-critical in practice\.
Table IX:Effect of threshold values underK=3K\{=\}3on HS300 \(Corr↑\\uparrow\)\.Maximum look\-back window\.As detailed in Table[X](https://arxiv.org/html/2606.24062#S3.T10), performance improves steadily as the window extends from 60 days \(0\.0361\) through 90 days \(0\.0378\) to the optimal 120 days \(0\.0388\)\. However, expanding the window further to 150 days leads to a slight degradation \(0\.0376\)\. This inflection point suggests that excessively long historical periods introduce stale patterns that exacerbate concept drift and eventually surpass the filtering capacity of the CIT\-based routing mechanism\.
Table X:Effect of maximum look\-back window on HS300 \(Corr↑\\uparrow\)\.Patch lengthplenp\_\{\\text\{len\}\}\.We varyplen∈\{8,12,16,20\}p\_\{\\text\{len\}\}\\in\\\{8,12,16,20\\\}to study the effect of routing granularity\. As shown in Table[XI](https://arxiv.org/html/2606.24062#S3.T11), RAVEN is robust across the entire range, with overall Corr spanning from 0\.0372 to 0\.0390 \(less than 5% relative variation\)\. Pairedtt\-tests on per\-sample MSE usingplen=16p\_\{\\text\{len\}\}\{=\}16as the reference confirm no statistically significant difference for any alternative \(p=0\.26p\{=\}0\.26,0\.210\.21,0\.180\.18forplen=8p\_\{\\text\{len\}\}\{=\}8,1212,2020respectively\)\. This insensitivity arises because the learned importance scoress~i\\tilde\{s\}\_\{i\}adapt their distribution to the available routing granularity\. With fewer patches, each score carries more discriminative weight, compensating for the coarser resolution\.
Table XI:Effect of patch lengthplenp\_\{\\text\{len\}\}on HS300 \(Corr↑\\uparrow\)\.NNdenotes the number of patches per window\.
### III\-GCase Study
To understand how RAVEN adapts its routing behavior, we visualize the learned patch importance scoress~i\\tilde\{s\}\_\{i\}under different conditions in Figure[6](https://arxiv.org/html/2606.24062#S3.F6)\.
\(a\)Yearly shift of 600176\.SS\.
\(b\)Cross\-ticker profiles \(2023\)\.
Figure 6:Distributions of Mean Patch Importance Score \(MPIS\)s~i\\tilde\{s\}\_\{i\}on HS300\. Each data point represents the annual average of the learned importance at a given patch index\. Patch index 1 corresponds to the most recent time segment\. \(a\) Annual mean importance profiles of stock 600176\.SS across five years, illustrating temporal regime adaptation\. \(b\) Annual mean importance profiles of four stocks within 2023, illustrating cross\-sectional heterogeneity\.Temporal heterogeneity\.Figure[6\(a\)](https://arxiv.org/html/2606.24062#S3.F6.sf1)shows the annual mean importance profile of a single stock \(600176\.SS\) across five years\. In 2020, a year marked by extreme volatility, the router concentrates importance heavily on the most recent patch \(s~1=0\.33\\tilde\{s\}\_\{1\}=0\.33\) with steep decay toward older patches, effectively selecting short look\-back windows\. In contrast, years with sustained trends such as 2021, 2023, and 2024 exhibit flatter distributions that allocate meaningful weight to intermediate and distant patches\. This indicates that the router leverages longer historical context when the SNR of older observations is higher\. The temporal shift emerges purely from data\-driven learning ofs~i\\tilde\{s\}\_\{i\}, without explicit regime labels or calendar features\.
Cross\-sectional heterogeneity\.Figure[6\(b\)](https://arxiv.org/html/2606.24062#S3.F6.sf2)reveals that even within the same year \(2023\), different stocks exhibit markedly different annual mean importance profiles\. Some stocks show steep concentration on recent patches, suggesting short\-memory price dynamics, while others maintain more distributed weights across the full look\-back range, benefiting from longer historical context\. This cross\-sectional variation demonstrates that no single fixed window length is universally optimal\. Individual securities require different effective windows depending on their microstructure characteristics\. RAVEN addresses this by learning per\-sample importance scores that drive CIT\-based adaptive routing, enabling each input to select its own effective temporal scale without manual specification\.
To establish a holistic understanding of RAVEN’s decision\-making paradigm, we extend our visual inspection from the upstream patch importance curves to the downstream gate weights emitted at the expert aggregation stage\. Figures[7\(b\)](https://arxiv.org/html/2606.24062#S3.F7.sf2)and[7\(a\)](https://arxiv.org/html/2606.24062#S3.F7.sf1)illustrate the statistical distributions of the routing weights across different financial assets and historical periods, respectively\. Crucially, these downstream weight topologies perfectly mirror the continuous context\-carving behaviors observed in the upstream patch selection phase, confirming that the data\-driven CIT\-thresholded routing over learned importance matrices structurally translates into scale\-specialized representations\.
\(a\)Temporal evolution of 605117\.SS\.
\(b\)Cross\-ticker distribution\(2023\)\.
Figure 7:Empirical distributions of expert aggregation weights on HS300\. Expert 1 corresponds to the short\-horizon expert and Expert 3 to the long\-horizon expert\. Each bar shows the annual mean weight allocated to each expert\. \(a\) Weight evolution of stock 605117\.SS across five years, reflecting regime\-driven reallocation\. \(b\) Weight distribution across four stocks within 2023, reflecting asset\-specific routing preferences\.Figure[7\(a\)](https://arxiv.org/html/2606.24062#S3.F7.sf1)tracks the expert weight allocation of stock 605117\.SS across five years\. In 2020, a period of heightened volatility, Expert 1 \(short\-horizon\) receives the largest share of weight, indicating that the model favors recent context when the market undergoes rapid structural change\. In contrast, during 2021 and 2022, Expert 3 \(long\-horizon\) dominates, as the model leverages extended historical context during relatively stable trending periods\. The weight distribution shifts again in 2023 and 2024, reflecting changing market regimes\. This adaptive reallocation confirms that the routing mechanism responds to non\-stationary dynamics without manual intervention\.
Figure[7\(b\)](https://arxiv.org/html/2606.24062#S3.F7.sf2)compares the weight allocations of four stocks within 2023\. Stock 605117\.SS allocates nearly half of its weight to Expert 1 \(short\-horizon\), suggesting rapid price dynamics that benefit from short look\-back contexts\. In contrast, 300888\.SZ assigns the majority of weight to Expert 3 \(long\-horizon\), indicating stable temporal dependencies that reward extended historical input\. The remaining two stocks fall between these extremes\. This cross\-sectional divergence mirrors the upstream patch importance heterogeneity observed in Figure[6\(b\)](https://arxiv.org/html/2606.24062#S3.F6.sf2), validating that the entire routing pipeline maintains consistent behavior from patch scoring through expert aggregation\.
## IVRelated Work
Deep learning and foundation models for financial forecasting\.Classical financial forecasting relies heavily on gradient boosting trees \(e\.g\., XGBoost\[[6](https://arxiv.org/html/2606.24062#bib.bib9)\], LightGBM\[[17](https://arxiv.org/html/2606.24062#bib.bib10)\]\), which capture non\-linear interactions but treat forecasts as independent and identically distributed \(i\.i\.d\.\) tabular problems, fundamentally discarding the temporal topology of market regimes\[[3](https://arxiv.org/html/2606.24062#bib.bib13)\]\. Subsequent sequential architectures, including RNNs\[[10](https://arxiv.org/html/2606.24062#bib.bib19)\]and LSTMs\[[16](https://arxiv.org/html/2606.24062#bib.bib18)\], restored temporal memory but remain disproportionately biased toward recency, frequently conflating transient microstructure noise with structural regime shifts\[[28](https://arxiv.org/html/2606.24062#bib.bib14)\]\.
To address long\-range dependencies, general time\-series architectures have evolved rapidly\. Early efficient variants \(Informer\[[45](https://arxiv.org/html/2606.24062#bib.bib21)\], Autoformer\[[37](https://arxiv.org/html/2606.24062#bib.bib12)\]\) paved the way for advanced representation learning, precipitating breakthroughs in channel\-independent patching \(PatchTST\[[26](https://arxiv.org/html/2606.24062#bib.bib22)\]\), cross\-variate dependencies \(Crossformer\[[43](https://arxiv.org/html/2606.24062#bib.bib15)\]\), inverted attention \(iTransformer\[[22](https://arxiv.org/html/2606.24062#bib.bib23)\]\), and robust token blending \(CARD\[[34](https://arxiv.org/html/2606.24062#bib.bib16)\]\)\. Complementary paradigms have also demonstrated strong competitiveness; notably, frequency\-domain and wavelet models \(FEDformer\[[47](https://arxiv.org/html/2606.24062#bib.bib28)\], TimesNet\[[36](https://arxiv.org/html/2606.24062#bib.bib40)\], FredFormer\[[27](https://arxiv.org/html/2606.24062#bib.bib44)\], WPMixer\[[25](https://arxiv.org/html/2606.24062#bib.bib45)\]\) explicitly exploit spectral periodicity, while DLinear\[[41](https://arxiv.org/html/2606.24062#bib.bib27)\]and ModernTCN\[[24](https://arxiv.org/html/2606.24062#bib.bib26)\]established robust baselines using linear decomposition and modernized temporal convolutions\.
In the highly stochastic financial domain, specialized architectures have emerged to capture complex market dynamics via graph\-based trend prediction \(HIST\[[38](https://arxiv.org/html/2606.24062#bib.bib29)\]\), market\-guided attention \(MASTER\[[19](https://arxiv.org/html/2606.24062#bib.bib32)\]\), meta\-learning for distribution shifts \(DoubleAdapt\[[44](https://arxiv.org/html/2606.24062#bib.bib30)\]\), and early forms of conditional routing \(TRA\[[20](https://arxiv.org/html/2606.24062#bib.bib31)\]\)\.
However, despite their immense architectural diversity and expressive capacity, these models share a critical structural bottleneck: they strictly commit to a fixed context window or treat look\-back length as a static hyperparameter\. In the exceptionally low SNR ratio \(SNR\) and non\-stationary environment of financial markets, a static receptive field inevitably forces models to either truncate critical historical regime transitions or dilute actionable predictive signals with obsolete noise\.
Adaptive context and non\-stationarity in time series\.In highly volatile and non\-stationary time series domains, the temporal scale of a model’s receptive field must adaptively expand or contract to isolate meaningful predictive signals from transient noise\. The Multi\-period Learning Framework \(MLF\)\[[42](https://arxiv.org/html/2606.24062#bib.bib20)\]represents a notable attempt by simultaneously fusing varying look\-back lengths using Inter\-period Redundancy Filtering\. Similarly, PathFormer\[[5](https://arxiv.org/html/2606.24062#bib.bib24)\]captures varying temporal dynamics by dynamically aggregating features from a predefined set of fixed patch resolutions\. More recently, TimeSqueeze\[[1](https://arxiv.org/html/2606.24062#bib.bib33)\]attempts to alleviate fixed\-resolution bottlenecks by dynamically altering patch boundaries within a sequence based on local signal complexity\. However, these models fundamentally commit to either a rigidly fixed global look\-back window or a predefined set of static scales, failing to perceive the optimal global context length on the fly\. RAVEN bridges this gap at the architectural level\. By deploying a data\-driven CIT\-thresholded routing mechanism over learned patch importance scores, RAVEN dynamically generates nested, variable\-length historical windows for each sample, seamlessly preserving continuous temporal evolution without human\-defined scale constraints\.
MoE and routing topologies\.Sparse MoE layers effectively scale model capacity without proportionally increasing computational overhead\[[11](https://arxiv.org/html/2606.24062#bib.bib50)\]\. In the time\-series domain, foundation models like Time\-MoE\[[31](https://arxiv.org/html/2606.24062#bib.bib37)\]and Moirai\-MoE\[[21](https://arxiv.org/html/2606.24062#bib.bib39)\]scale this paradigm to billion parameters, employing token\-level sparse routing\. Recent specialized architectures such as TFPS\[[33](https://arxiv.org/html/2606.24062#bib.bib34)\]follow a similar token\-level paradigm, utilizing subspace clustering to route individual patches to pattern\-specific experts\. These content\-based routing mechanisms are highly effective for localized pattern matching: experts specialize in patches with similar local morphology, regardless of where those patches appear in the sequence\. However, such routing primarily induces pattern\-level specialization rather than temporal\-scale specialization\. Since individual patches from different time positions may be dispatched to different experts, no expert is explicitly assigned to model a contiguous historical horizon\. This limits the ability of each expert to capture how local patterns interact with medium\- and long\-range temporal context, which is important for non\-stationary financial forecasting\. RAVEN adopts an orthogonal routing topology: temporal\-scale routing\. By ensuring each expert processes a contiguous, nested prefix of historical data, RAVEN preserves positional coherence within each horizon while enforcing data\-driven scale specialization\.
Positioning of our work\.Unlike fixed\-context models\[[32](https://arxiv.org/html/2606.24062#bib.bib46),[2](https://arxiv.org/html/2606.24062#bib.bib42)\]and predefined multi\-scale methods\[[42](https://arxiv.org/html/2606.24062#bib.bib20),[5](https://arxiv.org/html/2606.24062#bib.bib24),[1](https://arxiv.org/html/2606.24062#bib.bib33)\], RAVEN performs sample\-dependent context selection via CIT\-thresholded routing; unlike token\-level MoE\[[31](https://arxiv.org/html/2606.24062#bib.bib37),[21](https://arxiv.org/html/2606.24062#bib.bib39),[33](https://arxiv.org/html/2606.24062#bib.bib34)\]that specializes experts by patch content, RAVEN specializes by temporal scale through contiguous nested prefixes\. Two additional components address challenges unique to this topology: \(1\) Shape\-Aligned Fusion with CAW aligns variable\-length expert outputs and penalizes redundant representations caused by prefix overlap; \(2\) a GCR branch summarizes the full context in parallel, providing macro\-level information that localized experts may miss\.
## VConclusion
We presented RAVEN, a regime\-aware MoE framework that replaces fixed\-length context windows with sample\-adaptive, variable\-length receptive fields for financial time series forecasting\. By leveraging an importance\-scoring mechanism coupled with CIT\-based thresholds, RAVEN dynamically extracts nested look\-back windows routed to scale\-specialized experts, while a GCR branch maintains macro\-level context coherence\. Furthermore, a CAW mechanism effectively decorrelates expert outputs, mitigating information redundancy stemming from overlapping patch inputs\.
Extensive experiments across financial and general time\-series tasks, including cumulative log\-return prediction on HS300 and S&P500, fund flow forecasting, and traffic forecasting on PEMS benchmarks, demonstrate the effectiveness and generality of RAVEN\. On the two equity benchmarks, RAVEN ranks first across all six overall metric–dataset combinations, improving Pearson correlation by 9\.2% on HS300 and 20\.2% on S&P500 over the strongest baselines; on fund flow forecasting, it reduces MSE by 18\.2%\. Beyond financial data, RAVEN achieves the best result in 14 of 16 traffic forecasting metrics, indicating that adaptive temporal\-scale routing is also beneficial for general time series\. Ablation studies confirm that adaptive routing, CAW, and the GCR branch each contribute independently, while efficiency and hyperparameter analyses show that RAVEN provides a favorable balance between accuracy and efficiency and remains stable across reasonable routing configurations\.
While RAVEN establishes a robust foundation for adaptive financial modeling, several promising avenues remain\. First, integrating cross\-asset dependencies and macro\-regime indicators into the routing mechanism could better capture systemic market co\-movements\. Second, generalizing the framework to accommodate variable\-resolution patching would further enhance its multi\-scale expressiveness\. Finally, scaling RAVEN to high\-frequency intraday data and broader asset classes \(e\.g\., fixed income and commodities\) represents a natural next step to validate its universal efficacy in complex financial ecosystems\.
## References
- \[1\]\(2026\)TimeSqueeze: dynamic patching for efficient time series forecasting\.CoRRabs/2603\.11352\.External Links:[Link](https://doi.org/10.48550/arXiv.2603.11352),[Document](https://dx.doi.org/10.48550/ARXIV.2603.11352),2603\.11352Cited by:[§IV](https://arxiv.org/html/2606.24062#S4.p5.1),[§IV](https://arxiv.org/html/2606.24062#S4.p7.1)\.
- \[2\]A\. F\. Ansari, L\. Stella, A\. C\. Türkmen, X\. Zhang, P\. Mercado, H\. Shen, O\. Shchur, S\. S\. Rangapuram, S\. Pineda\-Arango, S\. Kapoor, J\. Zschiegner, D\. C\. Maddix, H\. Wang, M\. W\. Mahoney, K\. Torkkola, A\. G\. Wilson, M\. Bohlke\-Schneider, and B\. Wang\(2024\)Chronos: learning the language of time series\.Trans\. Mach\. Learn\. Res\.2024\.External Links:[Link](https://openreview.net/forum?id=gerNCVqqtR)Cited by:[§IV](https://arxiv.org/html/2606.24062#S4.p7.1)\.
- \[3\]G\. Bontempi, S\. B\. Taieb, and Y\. L\. Borgne\(2012\)Machine learning strategies for time series forecasting\.InBusiness Intelligence \- Second European Summer School, eBISS 2012, Brussels, Belgium, July 15\-21, 2012, Tutorial Lectures,M\. Aufaure and E\. Zimányi \(Eds\.\),Lecture Notes in Business Information Processing,pp\. 62–77\.External Links:[Link](https://doi.org/10.1007/978-3-642-36318-4%5C_3),[Document](https://dx.doi.org/10.1007/978-3-642-36318-4%5F3)Cited by:[§IV](https://arxiv.org/html/2606.24062#S4.p1.1)\.
- \[4\]C\. Challu, K\. G\. Olivares, B\. N\. Oreshkin, F\. G\. Ramírez, M\. M\. Canseco, and A\. Dubrawski\(2023\)NHITS: neural hierarchical interpolation for time series forecasting\.InThirty\-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty\-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7\-14, 2023,B\. Williams, Y\. Chen, and J\. Neville \(Eds\.\),pp\. 6989–6997\.External Links:[Link](https://doi.org/10.1609/aaai.v37i6.25854),[Document](https://dx.doi.org/10.1609/AAAI.V37I6.25854)Cited by:[4th item](https://arxiv.org/html/2606.24062#S3.I1.i4.p1.1)\.
- \[5\]P\. Chen, Y\. Zhang, Y\. Cheng, Y\. Shu, Y\. Wang, Q\. Wen, B\. Yang, and C\. Guo\(2024\)Pathformer: multi\-scale transformers with adaptive pathways for time series forecasting\.InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024,External Links:[Link](https://openreview.net/forum?id=lJkOCMP2aW)Cited by:[4th item](https://arxiv.org/html/2606.24062#S3.I1.i4.p1.1),[§IV](https://arxiv.org/html/2606.24062#S4.p5.1),[§IV](https://arxiv.org/html/2606.24062#S4.p7.1)\.
- \[6\]T\. Chen and C\. Guestrin\(2016\)XGBoost: A scalable tree boosting system\.InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13\-17, 2016,B\. Krishnapuram, M\. Shah, A\. J\. Smola, C\. C\. Aggarwal, D\. Shen, and R\. Rastogi \(Eds\.\),pp\. 785–794\.External Links:[Link](https://doi.org/10.1145/2939672.2939785),[Document](https://dx.doi.org/10.1145/2939672.2939785)Cited by:[§I](https://arxiv.org/html/2606.24062#S1.p2.1),[§IV](https://arxiv.org/html/2606.24062#S4.p1.1)\.
- \[7\]K\. Cho, B\. van Merrienboer, Ç\. Gülçehre, D\. Bahdanau, F\. Bougares, H\. Schwenk, and Y\. Bengio\(2014\)Learning phrase representations using RNN encoder\-decoder for statistical machine translation\.InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25\-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL,A\. Moschitti, B\. Pang, and W\. Daelemans \(Eds\.\),pp\. 1724–1734\.External Links:[Link](https://doi.org/10.3115/v1/d14-1179),[Document](https://dx.doi.org/10.3115/V1/D14-1179)Cited by:[§I](https://arxiv.org/html/2606.24062#S1.p2.1)\.
- \[8\]F\. Corsi\(2009\)A simple approximate long\-memory model of realized volatility\.Journal of financial econometrics7\(2\),pp\. 174–196\.Cited by:[§I](https://arxiv.org/html/2606.24062#S1.p4.1)\.
- \[9\]F\. X\. Diebold and R\. S\. Mariano\(2002\)Comparing predictive accuracy\.Journal of Business & economic statistics20\(1\),pp\. 134–144\.Cited by:[§II\-A](https://arxiv.org/html/2606.24062#S2.SS1.p4.3)\.
- \[10\]J\. L\. Elman\(1990\)Finding structure in time\.Cogn\. Sci\.14\(2\),pp\. 179–211\.External Links:[Link](https://doi.org/10.1207/s15516709cog1402%5C_1),[Document](https://dx.doi.org/10.1207/S15516709COG1402%5F1)Cited by:[§I](https://arxiv.org/html/2606.24062#S1.p2.1),[§IV](https://arxiv.org/html/2606.24062#S4.p1.1)\.
- \[11\]W\. Fedus, B\. Zoph, and N\. Shazeer\(2022\)Switch transformers: scaling to trillion parameter models with simple and efficient sparsity\.J\. Mach\. Learn\. Res\.23,pp\. 120:1–120:39\.External Links:[Link](https://jmlr.org/papers/v23/21-0998.html)Cited by:[§II\-D](https://arxiv.org/html/2606.24062#S2.SS4.p4.3),[§IV](https://arxiv.org/html/2606.24062#S4.p6.1)\.
- \[12\]J\. H\. Friedman\(2001\)Greedy function approximation: a gradient boosting machine\.Annals of statistics,pp\. 1189–1232\.Cited by:[§I](https://arxiv.org/html/2606.24062#S1.p2.1)\.
- \[13\]S\. Gu, B\. Kelly, and D\. Xiu\(2020\)Empirical asset pricing via machine learning\.The Review of Financial Studies33\(5\),pp\. 2223–2273\.Cited by:[§I](https://arxiv.org/html/2606.24062#S1.p1.3),[§II\-A](https://arxiv.org/html/2606.24062#S2.SS1.p1.5)\.
- \[14\]S\. Guo, Y\. Lin, N\. Feng, C\. Song, and H\. Wan\(2019\)Attention based spatial\-temporal graph convolutional networks for traffic flow forecasting\.InThe Thirty\-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty\-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 \- February 1, 2019,pp\. 922–929\.External Links:[Link](https://doi.org/10.1609/aaai.v33i01.3301922),[Document](https://dx.doi.org/10.1609/AAAI.V33I01.3301922)Cited by:[§III\-A](https://arxiv.org/html/2606.24062#S3.SS1.p4.1),[§III\-C](https://arxiv.org/html/2606.24062#S3.SS3.p1.1)\.
- \[15\]J\. Hasbrouck\(1991\)Measuring the information content of stock trades\.The Journal of Finance46\(1\),pp\. 179–207\.Cited by:[§I](https://arxiv.org/html/2606.24062#S1.p1.3),[§II\-A](https://arxiv.org/html/2606.24062#S2.SS1.p1.5)\.
- \[16\]S\. Hochreiter and J\. Schmidhuber\(1997\)Long short\-term memory\.Neural Comput\.9\(8\),pp\. 1735–1780\.External Links:[Link](https://doi.org/10.1162/neco.1997.9.8.1735),[Document](https://dx.doi.org/10.1162/NECO.1997.9.8.1735)Cited by:[§I](https://arxiv.org/html/2606.24062#S1.p2.1),[§IV](https://arxiv.org/html/2606.24062#S4.p1.1)\.
- \[17\]G\. Ke, Q\. Meng, T\. Finley, T\. Wang, W\. Chen, W\. Ma, Q\. Ye, and T\. Liu\(2017\)LightGBM: A highly efficient gradient boosting decision tree\.InAdvances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4\-9, 2017, Long Beach, CA, USA,I\. Guyon, U\. von Luxburg, S\. Bengio, H\. M\. Wallach, R\. Fergus, S\. V\. N\. Vishwanathan, and R\. Garnett \(Eds\.\),pp\. 3146–3154\.External Links:[Link](https://proceedings.neurips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html)Cited by:[§I](https://arxiv.org/html/2606.24062#S1.p2.1),[§IV](https://arxiv.org/html/2606.24062#S4.p1.1)\.
- \[18\]T\. Kim, J\. Kim, Y\. Tae, C\. Park, J\. Choi, and J\. Choo\(2022\)Reversible instance normalization for accurate time\-series forecasting against distribution shift\.InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25\-29, 2022,External Links:[Link](https://openreview.net/forum?id=cGDAkQo1C0p)Cited by:[§II\-C](https://arxiv.org/html/2606.24062#S2.SS3.p1.1)\.
- \[19\]T\. Li, Z\. Liu, Y\. Shen, X\. Wang, H\. Chen, and S\. Huang\(2024\)Master: market\-guided stock transformer for stock price forecasting\.InProceedings of the AAAI conference on artificial intelligence,Vol\.38,pp\. 162–170\.Cited by:[§IV](https://arxiv.org/html/2606.24062#S4.p3.1)\.
- \[20\]H\. Lin, D\. Zhou, W\. Liu, and J\. Bian\(2021\)Learning multiple stock trading patterns with temporal routing adaptor and optimal transport\.InKDD ’21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, Singapore, August 14\-18, 2021,F\. Zhu, B\. C\. Ooi, and C\. Miao \(Eds\.\),pp\. 1017–1026\.External Links:[Link](https://doi.org/10.1145/3447548.3467358),[Document](https://dx.doi.org/10.1145/3447548.3467358)Cited by:[§IV](https://arxiv.org/html/2606.24062#S4.p3.1)\.
- \[21\]X\. Liu, J\. Liu, G\. Woo, T\. Aksu, Y\. Liang, R\. Zimmermann, C\. Liu, J\. Li, S\. Savarese, C\. Xiong, and D\. Sahoo\(2025\)Moirai\-moe: empowering time series foundation models with sparse mixture of experts\.InForty\-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13\-19, 2025,A\. Singh, M\. Fazel, D\. Hsu, S\. Lacoste\-Julien, F\. Berkenkamp, T\. Maharaj, K\. Wagstaff, and J\. Zhu \(Eds\.\),Proceedings of Machine Learning Research\.External Links:[Link](https://proceedings.mlr.press/v267/liu25an.html)Cited by:[§IV](https://arxiv.org/html/2606.24062#S4.p6.1),[§IV](https://arxiv.org/html/2606.24062#S4.p7.1)\.
- \[22\]Y\. Liu, T\. Hu, H\. Zhang, H\. Wu, S\. Wang, L\. Ma, and M\. Long\(2024\)ITransformer: inverted transformers are effective for time series forecasting\.InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024,External Links:[Link](https://openreview.net/forum?id=JePfAI8fah)Cited by:[§I](https://arxiv.org/html/2606.24062#S1.p3.2),[3rd item](https://arxiv.org/html/2606.24062#S3.I1.i3.p1.1),[§IV](https://arxiv.org/html/2606.24062#S4.p2.1)\.
- \[23\]A\. W\. Lo\(2004\)The adaptive markets hypothesis: market efficiency from an evolutionary perspective\.Journal of Portfolio Management, Forthcoming\.Cited by:[§I](https://arxiv.org/html/2606.24062#S1.p1.3)\.
- \[24\]D\. Luo and X\. Wang\(2024\)ModernTCN: A modern pure convolution structure for general time series analysis\.InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024,External Links:[Link](https://openreview.net/forum?id=vpJMJerXHU)Cited by:[§IV](https://arxiv.org/html/2606.24062#S4.p2.1)\.
- \[25\]M\. M\. N\. Murad, M\. Aktukmak, and Y\. Yilmaz\(2025\)WPMixer: efficient multi\-resolution mixing for long\-term time series forecasting\.InThirty\-Ninth AAAI Conference on Artificial Intelligence, Thirty\-Seventh Conference on Innovative Applications of Artificial Intelligence, Fifteenth Symposium on Educational Advances in Artificial Intelligence, AAAI 2025, Philadelphia, PA, USA, February 25 \- March 4, 2025,T\. Walsh, J\. Shah, and Z\. Kolter \(Eds\.\),pp\. 19581–19588\.External Links:[Link](https://doi.org/10.1609/aaai.v39i18.34156),[Document](https://dx.doi.org/10.1609/AAAI.V39I18.34156)Cited by:[§I](https://arxiv.org/html/2606.24062#S1.p3.2),[§IV](https://arxiv.org/html/2606.24062#S4.p2.1)\.
- \[26\]Y\. Nie, N\. H\. Nguyen, P\. Sinthong, and J\. Kalagnanam\(2023\)A time series is worth 64 words: long\-term forecasting with transformers\.InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1\-5, 2023,External Links:[Link](https://openreview.net/forum?id=Jbdc0vTOcol)Cited by:[§I](https://arxiv.org/html/2606.24062#S1.p3.2),[§II\-C](https://arxiv.org/html/2606.24062#S2.SS3.p2.1),[§II\-C](https://arxiv.org/html/2606.24062#S2.SS3.p3.4),[1st item](https://arxiv.org/html/2606.24062#S3.I1.i1.p1.1),[§IV](https://arxiv.org/html/2606.24062#S4.p2.1)\.
- \[27\]X\. Piao, Z\. Chen, T\. Murayama, Y\. Matsubara, and Y\. Sakurai\(2024\)Fredformer: frequency debiased transformer for time series forecasting\.InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2024, Barcelona, Spain, August 25\-29, 2024,R\. Baeza\-Yates and F\. Bonchi \(Eds\.\),pp\. 2400–2410\.External Links:[Link](https://doi.org/10.1145/3637528.3671928),[Document](https://dx.doi.org/10.1145/3637528.3671928)Cited by:[§I](https://arxiv.org/html/2606.24062#S1.p3.2),[§IV](https://arxiv.org/html/2606.24062#S4.p2.1)\.
- \[28\]O\. B\. Sezer, M\. U\. Gudelek, and A\. M\. Özbayoglu\(2020\)Financial time series forecasting with deep learning : A systematic literature review: 2005\-2019\.Appl\. Soft Comput\.90,pp\. 106181\.External Links:[Link](https://doi.org/10.1016/j.asoc.2020.106181),[Document](https://dx.doi.org/10.1016/J.ASOC.2020.106181)Cited by:[§IV](https://arxiv.org/html/2606.24062#S4.p1.1)\.
- \[29\]M\. A\. Shabani, A\. H\. Abdi, L\. Meng, and T\. Sylvain\(2023\)Scaleformer: iterative multi\-scale refining transformers for time series forecasting\.InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1\-5, 2023,External Links:[Link](https://openreview.net/forum?id=sCrnllCtjoE)Cited by:[4th item](https://arxiv.org/html/2606.24062#S3.I1.i4.p1.1)\.
- \[30\]N\. Shazeer, A\. Mirhoseini, K\. Maziarz, A\. Davis, Q\. V\. Le, G\. E\. Hinton, and J\. Dean\(2017\)Outrageously large neural networks: the sparsely\-gated mixture\-of\-experts layer\.In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24\-26, 2017, Conference Track Proceedings,External Links:[Link](https://openreview.net/forum?id=B1ckMDqlg)Cited by:[§II\-D](https://arxiv.org/html/2606.24062#S2.SS4.p4.3)\.
- \[31\]X\. Shi, S\. Wang, Y\. Nie, D\. Li, Z\. Ye, Q\. Wen, and M\. Jin\(2025\)Time\-moe: billion\-scale time series foundation models with mixture of experts\.InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025,External Links:[Link](https://openreview.net/forum?id=e1wDDFmlVu)Cited by:[§IV](https://arxiv.org/html/2606.24062#S4.p6.1),[§IV](https://arxiv.org/html/2606.24062#S4.p7.1)\.
- \[32\]Y\. Shi, Z\. Fu, S\. Chen, B\. Zhao, W\. Xu, C\. Zhang, and J\. Li\(2026\)Kronos: A foundation model for the language of financial markets\.InFortieth AAAI Conference on Artificial Intelligence, Thirty\-Eighth Conference on Innovative Applications of Artificial Intelligence, Sixteenth Symposium on Educational Advances in Artificial Intelligence, AAAI 2026, Singapore, January 20\-27, 2026,S\. Koenig, C\. Jenkins, and M\. E\. Taylor \(Eds\.\),pp\. 25366–25373\.External Links:[Link](https://doi.org/10.1609/aaai.v40i30.39730),[Document](https://dx.doi.org/10.1609/AAAI.V40I30.39730)Cited by:[§IV](https://arxiv.org/html/2606.24062#S4.p7.1)\.
- \[33\]Y\. Sun, Z\. Xie, E\. Eldele, D\. Chen, Q\. Hu, and M\. Wu\(2025\)Learning pattern\-specific experts for time series forecasting under patch\-level distribution shift\.InAdvances in Neural Information Processing Systems,D\. Belgrave, C\. Zhang, H\. Lin, R\. Pascanu, P\. Koniusz, M\. Ghassemi, and N\. Chen \(Eds\.\),Vol\.38,pp\. 91810–91844\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2025/file/8491a7fcc218946b471b600a915c8b02-Paper-Conference.pdf)Cited by:[§IV](https://arxiv.org/html/2606.24062#S4.p6.1),[§IV](https://arxiv.org/html/2606.24062#S4.p7.1)\.
- \[34\]X\. Wang, T\. Zhou, Q\. Wen, J\. Gao, B\. Ding, and R\. Jin\(2024\)CARD: channel aligned robust blend transformer for time series forecasting\.InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024,External Links:[Link](https://openreview.net/forum?id=MJksrOhurE)Cited by:[§IV](https://arxiv.org/html/2606.24062#S4.p2.1)\.
- \[35\]K\. D\. West\(1996\)Asymptotic inference about predictive ability\.Econometrica64\(5\),pp\. 1067–1084\.Cited by:[§II\-A](https://arxiv.org/html/2606.24062#S2.SS1.p4.3)\.
- \[36\]H\. Wu, T\. Hu, Y\. Liu, H\. Zhou, J\. Wang, and M\. Long\(2023\)TimesNet: temporal 2d\-variation modeling for general time series analysis\.InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1\-5, 2023,External Links:[Link](https://openreview.net/forum?id=ju%5C_Uqw384Oq)Cited by:[§I](https://arxiv.org/html/2606.24062#S1.p3.2),[5th item](https://arxiv.org/html/2606.24062#S3.I1.i5.p1.1),[§IV](https://arxiv.org/html/2606.24062#S4.p2.1)\.
- \[37\]H\. Wu, J\. Xu, J\. Wang, and M\. Long\(2021\)Autoformer: decomposition transformers with auto\-correlation for long\-term series forecasting\.InAdvances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6\-14, 2021, virtual,M\. Ranzato, A\. Beygelzimer, Y\. N\. Dauphin, P\. Liang, and J\. W\. Vaughan \(Eds\.\),pp\. 22419–22430\.External Links:[Link](https://proceedings.neurips.cc/paper/2021/hash/bcc0d400288793e8bdcd7c19a8ac0c2b-Abstract.html)Cited by:[§I](https://arxiv.org/html/2606.24062#S1.p1.3),[§IV](https://arxiv.org/html/2606.24062#S4.p2.1)\.
- \[38\]W\. Xu, W\. Liu, L\. Wang, Y\. Xia, J\. Bian, J\. Yin, and T\. Liu\(2021\)HIST: A graph\-based framework for stock trend forecasting via mining concept\-oriented shared information\.CoRRabs/2110\.13716\.External Links:[Link](https://arxiv.org/abs/2110.13716),2110\.13716Cited by:[§IV](https://arxiv.org/html/2606.24062#S4.p3.1)\.
- \[39\]W\. Xu, D\. Xiang, Y\. Liu, X\. Wang, Y\. Ma, L\. Zhang, C\. Xu, and J\. Zhang\(2025\)FinMultiTime: A four\-modal bilingual dataset for financial time\-series analysis\.CoRRabs/2506\.05019\.External Links:[Link](https://doi.org/10.48550/arXiv.2506.05019),[Document](https://dx.doi.org/10.48550/ARXIV.2506.05019),2506\.05019Cited by:[§III\-A](https://arxiv.org/html/2606.24062#S3.SS1.p2.1)\.
- \[40\]X\. Yang, W\. Liu, D\. Zhou, J\. Bian, and T\. Liu\(2020\)Qlib: an ai\-oriented quantitative investment platform\.CoRRabs/2009\.11189\.External Links:[Link](https://arxiv.org/abs/2009.11189),2009\.11189Cited by:[§III\-B](https://arxiv.org/html/2606.24062#S3.SS2.p4.4)\.
- \[41\]A\. Zeng, M\. Chen, L\. Zhang, and Q\. Xu\(2023\)Are transformers effective for time series forecasting?\.InThirty\-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty\-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7\-14, 2023,B\. Williams, Y\. Chen, and J\. Neville \(Eds\.\),pp\. 11121–11128\.External Links:[Link](https://doi.org/10.1609/aaai.v37i9.26317),[Document](https://dx.doi.org/10.1609/AAAI.V37I9.26317)Cited by:[§IV](https://arxiv.org/html/2606.24062#S4.p2.1)\.
- \[42\]X\. Zhang, Z\. Huang, Y\. Wu, X\. Lu, E\. Qi, Y\. Chen, Z\. Xue, Q\. Wang, P\. Wang, and W\. Wang\(2025\)Multi\-period learning for financial time series forecasting\.InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, V\.1, KDD 2025, Toronto, ON, Canada, August 3\-7, 2025,Y\. Sun, F\. Chierichetti, H\. W\. Lauw, C\. Perlich, W\. H\. Tok, and A\. Tomkins \(Eds\.\),pp\. 2848–2859\.External Links:[Link](https://doi.org/10.1145/3690624.3709422),[Document](https://dx.doi.org/10.1145/3690624.3709422)Cited by:[§I](https://arxiv.org/html/2606.24062#S1.p4.1),[2nd item](https://arxiv.org/html/2606.24062#S3.I1.i2.p1.1),[§III\-A](https://arxiv.org/html/2606.24062#S3.SS1.p3.1),[§IV](https://arxiv.org/html/2606.24062#S4.p5.1),[§IV](https://arxiv.org/html/2606.24062#S4.p7.1)\.
- \[43\]Y\. Zhang and J\. Yan\(2023\)Crossformer: transformer utilizing cross\-dimension dependency for multivariate time series forecasting\.InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1\-5, 2023,External Links:[Link](https://openreview.net/forum?id=vSVLM2j9eie)Cited by:[3rd item](https://arxiv.org/html/2606.24062#S3.I1.i3.p1.1),[§IV](https://arxiv.org/html/2606.24062#S4.p2.1)\.
- \[44\]L\. Zhao, S\. Kong, and Y\. Shen\(2023\)DoubleAdapt: A meta\-learning approach to incremental learning for stock trend forecasting\.InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2023, Long Beach, CA, USA, August 6\-10, 2023,A\. K\. Singh, Y\. Sun, L\. Akoglu, D\. Gunopulos, X\. Yan, R\. Kumar, F\. Ozcan, and J\. Ye \(Eds\.\),pp\. 3492–3503\.External Links:[Link](https://doi.org/10.1145/3580305.3599315),[Document](https://dx.doi.org/10.1145/3580305.3599315)Cited by:[§IV](https://arxiv.org/html/2606.24062#S4.p3.1)\.
- \[45\]H\. Zhou, S\. Zhang, J\. Peng, S\. Zhang, J\. Li, H\. Xiong, and W\. Zhang\(2021\)Informer: beyond efficient transformer for long sequence time\-series forecasting\.InThirty\-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty\-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2\-9, 2021,pp\. 11106–11115\.External Links:[Link](https://doi.org/10.1609/aaai.v35i12.17325),[Document](https://dx.doi.org/10.1609/AAAI.V35I12.17325)Cited by:[§I](https://arxiv.org/html/2606.24062#S1.p3.2),[§IV](https://arxiv.org/html/2606.24062#S4.p2.1)\.
- \[46\]T\. Zhou, Z\. Ma, X\. Wang, Q\. Wen, L\. Sun, T\. Yao, W\. Yin, and R\. Jin\(2022\)FiLM: frequency improved legendre memory model for long\-term time series forecasting\.InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 \- December 9, 2022,S\. Koyejo, S\. Mohamed, A\. Agarwal, D\. Belgrave, K\. Cho, and A\. Oh \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2022/hash/524ef58c2bd075775861234266e5e020-Abstract-Conference.html)Cited by:[5th item](https://arxiv.org/html/2606.24062#S3.I1.i5.p1.1)\.
- \[47\]T\. Zhou, Z\. Ma, Q\. Wen, X\. Wang, L\. Sun, and R\. Jin\(2022\)FEDformer: frequency enhanced decomposed transformer for long\-term series forecasting\.InInternational Conference on Machine Learning, ICML 2022, 17\-23 July 2022, Baltimore, Maryland, USA,K\. Chaudhuri, S\. Jegelka, L\. Song, C\. Szepesvári, G\. Niu, and S\. Sabato \(Eds\.\),Proceedings of Machine Learning Research,pp\. 27268–27286\.External Links:[Link](https://proceedings.mlr.press/v162/zhou22g.html)Cited by:[5th item](https://arxiv.org/html/2606.24062#S3.I1.i5.p1.1),[§IV](https://arxiv.org/html/2606.24062#S4.p2.1)\.Similar Articles
RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO
RAVEN introduces a real-time autoregressive video extrapolation framework with CM-GRPO, a novel reinforcement learning method for consistency model sampling, improving long-horizon generation quality.
@_albertgu: Introducing a new sequence model Raven which pushes the boundary of fixed-state-size sequence models! Raven bridges pop…
Researchers introduce Raven, a novel sequence model that merges state space model efficiency with a selective slot-updating mechanism inspired by sliding window attention to improve long-context retrieval. The approach offers a more principled alternative to existing linear-time models.
FinAcumen: Financial Multimodal Reasoning via Self-Evolving Experience Memory Harness
FinAcumen is a framework that accumulates reasoning experience from prior trajectories into a persistent memory bank for financial multimodal reasoning, improving performance across four benchmarks while maintaining a frozen 8B vision-language model.
Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly Detection
This paper introduces VisAnomReasoner, a parameter-efficient vision-language model fine-tuned on a novel benchmark (VisAnomBench) with natural-language rationales, achieving over 21pp improvement in precision and F1 for time-series anomaly detection and strong cross-benchmark generalization.
From Long News to Accurate Forecast: Importance-Aware Fusion and PRM-Guided Reflection for Time Series Forecasting
This paper introduces a framework for time series forecasting that uses importance-aware news compression and process reward model-guided retrieval to incorporate long news articles within fixed context limits, improving prediction accuracy across finance, energy, traffic, and Bitcoin benchmarks.