Retrieval Mechanisms Surpass Long-Context Scaling in Time Series Forecasting

arXiv cs.LG 05/12/26, 04:00 AM Papers
Summary
This academic paper challenges the effectiveness of long-context scaling in time series forecasting, demonstrating that retrieval-based methods outperform standard architectures like PatchTST and foundation models such as Chronos and Moirai.
arXiv:2605.08217v1 Announce Type: new Abstract: Time Series Foundation Models (TSFMs) have borrowed the long context paradigm from natural language processing under the premise that feeding more history into the model improves forecast quality. But in stochastic domains, distant history is often just high-frequency noise, not signal. Hence, the proposed work tests whether this premise actually holds by running continuous context architectures (PatchTST included) through the ETTh1 benchmark. The obtained results contradict the premise: an inverse scaling law shows up clearly, with forecasting error rising as context gets longer. A 3,000-step window causes performance to drop by over 68%, evidence that attention mechanisms are poor at ignoring irrelevant historical volatility. Retrieval-Augmented Forecasting (RAFT) is evaluated as an alternative. RAFT achieves a mean squared error (MSE) of 0.379 with a fixed 720-step window and selective retrieval, outperforming both long-context configurations and zero-shot foundation models (Chronos, Moirai) despite requiring far less computation. In addition, the retrieval step injects only the most relevant historical segments as dynamic exogenous variables, which gives the model a context-informed inductive bias it cannot build on its own from raw sequences. Therefore, foundation models going forward need to shift architecturally toward selective retrieval.
Original Article Export to Word Export to PDF
View Cached Full Text
Cached at: 05/12/26, 07:06 AM
# Retrieval Mechanisms Surpass Long-Context Scaling in Time Series Forecasting
Source: [https://arxiv.org/html/2605.08217](https://arxiv.org/html/2605.08217)
Rishi Ahuja†Kumar Prateek†Simranjit Singh†⊠\{\}^\{\\dagger\\,\\boxtimes\}Vijay Kumar†

†Department of Information Technology, Dr\. B\.R\. Ambedkar National Institute of Technology Jalandhar, Punjab, 144008, India\. \{rishia\.it\.24, kumarprateek,⊠singhsimranjit, vijayk\}@nitj\.ac\.in

###### Abstract

Time Series Foundation Models \(TSFMs\) have borrowed the long context paradigm from natural language processing under the premise that feeding more history into the model improves forecast quality\. But in stochastic domains, distant history is often just high\-frequency noise, not signal\. Hence, the proposed work tests whether this premise actually holds by running continuous context architectures \(PatchTST included\) through the ETTh1 benchmark\. The obtained results contradict the premise: an inverse scaling law shows up clearly, with forecasting error rising as context gets longer\. A 3,000\-step window causes performance to drop by over 68%, evidence that attention mechanisms are poor at ignoring irrelevant historical volatility\. Retrieval\-Augmented Forecasting \(RAFT\) is evaluated as an alternative\. RAFT achieves a mean squared error \(MSE\) of 0\.379 with a fixed 720\-step window and selective retrieval, outperforming both long\-context configurations and zero\-shot foundation models \(Chronos, Moirai\) despite requiring far less computation\. In addition, the retrieval step injects only the most relevant historical segments as dynamic exogenous variables, which gives the model a context\-informed inductive bias it cannot build on its own from raw sequences\. Therefore, foundation models going forward need to shift architecturally toward selective retrieval\.

††footnotetext:⊠Corresponding Author\.## 1Introduction

Time Series Foundation Models \(TSFMs\) owe much of their recent popularity to the “scaling hypothesis,” the idea that larger models with longer input windows will produce monotonically better forecasts\. Moirai\(Wooet al\.,[2024](https://arxiv.org/html/2605.08217#bib.bib1)\)and Chronos\(Ansariet al\.,[2024](https://arxiv.org/html/2605.08217#bib.bib2)\)have given weight to that idea: both achieve strong zero\-shot results across multiple forecasting domains by utilizing scale\. But a harder question remains unresolved\.Can the “Long Context” logic of Large Language Models \(LLMs\) be transplanted directly into the stochastic world of time series?There are good reasons to doubt it\. Language tokens retain semantic content even when they appear thousands of positions earlier in a document\. Numerical time series do not behave this way\. Old values in a stochastic process are frequently just noise, and their statistical relationship to future values weakens sharply over time\. Therefore, the proposed work advances the following hypothesis: when the context window in time series forecasting is expanded without discrimination, a failure mode called Stochastic Noise Accumulation takes over\. Irrelevant historical values flood the attention computation, the mechanism loses the ability to weight important recent patterns appropriately, and forecasting accuracy suffers\.Williamset al\.\([2025](https://arxiv.org/html/2605.08217#bib.bib3)\)show that short textual descriptions alongside numerical inputs yield large accuracy gains, highlighting the poverty of raw numerical context\.Pineda Arangoet al\.\([2025](https://arxiv.org/html/2605.08217#bib.bib4)\)finds that exogenous variables are critical for downstream task adaptation\. This work contends thatselective retrieval, not blind window expansion, is the correct mechanism for injecting useful context\.

The work uses ETTh1 benchmark for empirical validation and compares three model types: vanilla Transformers, PatchTST\(Nieet al\.,[2023](https://arxiv.org/html/2605.08217#bib.bib5)\)\(the current state\-of\-the\-art\), and a Retrieval\-Augmented Forecasting \(RAFT\) baseline\. The experiments produce an Inverse Scaling Law for stochastic time series: a\)Long Context Fails: PatchTST accuracy drops by 68% when context grows from 720 to 3,000 tokens\. Even advanced patching cannot screen out the noise in long historical windows\. b\)Retrieval Wins: RAFT\(Hanet al\.,[2025](https://arxiv.org/html/2605.08217#bib.bib6)\)uses a 720\-step window plus selective retrieval and scores the best MSE \(0\.379\), outperforming all long\-context and zero\-shot foundation model baselines\. Retrieved segments function asdynamic exogenous variablesthat provide context\-informed inductive bias, and they do so without the noise penalties attached to long continuous windows\.

## 2Related Work

### 2\.1Time Series Foundation Models and Context Scaling

Moirai\(Wooet al\.,[2024](https://arxiv.org/html/2605.08217#bib.bib1)\)and Chronos\(Ansariet al\.,[2024](https://arxiv.org/html/2605.08217#bib.bib2)\)achieve strong zero\-shot transfer across multiple forecasting domains by training on large heterogeneous corpora, demonstrating that universal temporal patterns can be captured at scale\. PatchTST\(Nieet al\.,[2023](https://arxiv.org/html/2605.08217#bib.bib5)\)advances supervised long\-context forecasting through channel\-independent patching, reducing quadratic attention overhead while enabling very long input sequences\. However, a premise shared across these architectures, namely that longer lookback windows translate into higher accuracy, does not hold in stochastic settings\. The stochastic portion of the input sequence is dominated by high\-frequency volatility with little predictive value\. Blindly extending context therefore creates a noise accumulation bottleneck, a failure mode absent from the text\-based tasks where these designs originated\.

### 2\.2Context\-Informed Forecasting and Retrieval

Auxiliary information has proven valuable for forecast robustness\.Williamset al\.\([2025](https://arxiv.org/html/2605.08217#bib.bib3)\)show that textual grounding of forecasts reduces ambiguity in foundation models substantially and calls context “key\.” Furthermore,Pineda Arangoet al\.\([2025](https://arxiv.org/html/2605.08217#bib.bib4)\)demonstrates a complementary point through ChronosX: exogenous variables enable pretrained models to specialize for particular downstream tasks\. The shared lesson from both is that predictions benefit when models receive curated context rather than raw data\. Yet autoregressive models lack a built\-in mechanism for relevance\-based filtering of their input history\. RAFT fills that gap\. It treats the historical record as a pool from which only the most similar segments are retrieved via cosine similarity\. Stochastic noise is rejected at the retrieval stage, and what enters the model has high information density\. The net effect is a dynamic inductive bias rooted in the same philosophy as exogenous signal injection but implemented through a retrieval mechanism\.

## 3Methodology and Analysis

### 3\.1The Experimental Framework

The proposed work chooses the ETTh1 \(Electricity Transformer Temperature\) benchmark\(Zhouet al\.,[2021](https://arxiv.org/html/2605.08217#bib.bib7)\)because its high\-frequency stochastic volatility makes it challenging for any model that relies on long historical context\. Separating genuine signals from random fluctuations is difficult on this dataset, making it a natural stress test for the “Long Context” hypothesis\. A prediction horizon ofH=96H=96steps is fixed throughout\. The lookback windowLLis varied across\{720,1440,3000\}\\\{720,1440,3000\\\}so that the impact of context scaling can be measured in isolation\. Mean Squared Error \(MSE\) and Mean Absolute Error \(MAE\) are used as the evaluation metrics\. The experiment pitsContinuous Contextmodels against aRetrieved Contextmodel: a\)Vanilla Transformer \(Baseline\): It uses Encoder\-decoder architecture withdmodel=512d\_\{model\}=512andnheads=8n\_\{heads\}=8to observe how standard attention behaves as context grows\. b\)PatchTST \(SOTA Long\-Context\): It uses channel independence and patching \(P=16,S=8P=16,S=8\) perNieet al\.\([2023](https://arxiv.org/html/2605.08217#bib.bib5)\)\. It represents the scaling hypothesis at maximum strength: long context, low computational cost\. c\)RAFT \(Retrieval Baseline\): It keeps the lookback short \(L=720L=720\) and pulls in relevant history through top\-kkcosine similarity retrieval\(Hanet al\.,[2025](https://arxiv.org/html/2605.08217#bib.bib6)\)\. This tests whether high\-quality context in a small window outperforms noisy context in a large one\.

### 3\.2The Inverse Scaling Law

The experiments yield a result that runs opposite to NLP scaling laws\. Figure[A1](https://arxiv.org/html/2605.08217#A2.F1)shows a clear Inverse Scaling Law: longer context does not reduce error\. MSE rises at every step as the window gets larger\. The law holds at prediction horizons beyondH=96H=96; \(see Appendix[H](https://arxiv.org/html/2605.08217#A8)for results atH=336H=336andH=720H=720\)\. The Vanilla Transformer breaks down most severely\. Going from 720 to 3,000 steps pushes MSE up by about200%\. Self\-attention, given that many positions to attend over, distributes weight too broadly and cannot recover the relevant signal\. PatchTST handles the problem better but still fails\. At 3,000 steps, its MSE increases by68%\(from0\.3850\.385to0\.6470\.647\)\. Patching, though designed for long sequences, fails to reject noise effectively over extended histories\. Therefore, practitioners assuming patching eliminates noise should reconsider these results\.

### 3\.3Generalization Across Domains

The work tests two more benchmarks ETTh2\(Zhouet al\.,[2021](https://arxiv.org/html/2605.08217#bib.bib7)\)\(electricity domain\) and Exchange Rate \(financial domain\) to verify that the inverse scaling law is not just a quirk of ETTh1\.

Table 1:Cross\-Domain Validation: Degradation of Long ContextTable[1](https://arxiv.org/html/2605.08217#S3.T1)shows severe degradation across both additional datasets\. On Exchange Rate data, where financial volatility is high, the long\-context configuration degrades by276%\. Noise accumulation is not dataset\-specific; it is a structural problem for stochastic time series, and it worsens proportionally with the volatility of the underlying process\.

### 3\.4Mechanism Analysis: Stochastic Noise Accumulation

The proposed work attributes the observed degradation toStochastic Noise Accumulation\. NLP tokens far back in the sequence frequently carry high information density\. However, ETTh1 time series do not share the same property, as distant points are mostly uncorrelated stochastic fluctuations without predictive utility\. Consequently, the softmax denominator grows withLL, spreading probability mass across positions that contribute nothing useful, resulting in a diminishing share allocated to genuinely informative recent timesteps\. Internal representations are therefore over\-smoothed: the model loses the high\-frequency detail required for accurate prediction\. The observation is consistent withWilliamset al\.\([2025](https://arxiv.org/html/2605.08217#bib.bib3)\), who find that context quality matters far more than context volume when signal\-to\-noise ratio is low\.

Quantitative measurements of attention entropy further confirm the mechanism\. As context grows fromL=336L=336\(42 patches\) toL=3000L=3000\(375 patches\), the normalized attention entropy of PatchTST rises from 0\.952 to 0\.989, approaching the theoretical maximum of 1\.0, which corresponds to a uniform distribution\. AtL=3000L=3000, the effective attention rank collapses to 0\.1: no single patch receives meaningfully differentiated weight\. In practice, the attention mechanism degenerates into uniform averaging over hundreds of irrelevant positions\. The full per\-layer entropy measurements appear in Appendix[G](https://arxiv.org/html/2605.08217#A7)\.

### 3\.5The Superiority of Selective Retrieval

RAFT outperforms the scaling\-based models because its context\-informed inductive bias pre\-filters history before attention ever operates on it\.

![Refer to caption](https://arxiv.org/html/2605.08217v1/x1.png)Figure 1:Performance degradation trend\. Error rate \(Y\-axis\) rises monotonically as Context Length \(X\-axis\) increases from 720 to 3000, illustrating Stochastic Noise Accumulation\.Figure[1](https://arxiv.org/html/2605.08217#S3.F1)shows RAFT at anMSE of 0\.379with only 720 steps of direct context\. PatchTST at 720 steps scores 0\.385, so RAFT is slightly better even without the scaling advantage\. At 3000 steps PatchTST degrades to 0\.647, making RAFT the clear winner\. Moreover, RAFT stabilizes its forecasts and avoids the noise penalty that scaling imposes by injecting retrieved segments as dynamic exogenous variables\(Pineda Arangoet al\.,[2025](https://arxiv.org/html/2605.08217#bib.bib4)\)\. Therefore, retrieval is a more effective noise rejection strategy than window extension for stochastic data\.

Zero\-shot foundation models, including Chronos\-T5\-Small\(Ansariet al\.,[2024](https://arxiv.org/html/2605.08217#bib.bib2)\)\(MSE 0\.483\) and Moirai\-1\.1\-R\-Small\(Wooet al\.,[2024](https://arxiv.org/html/2605.08217#bib.bib1)\)\(MSE 0\.553\), both fall substantially short of RAFT’s 0\.379, despite pretraining on far larger corpora revealing that the advantage extends beyond long\-context scaling\. Such a finding motivates the proposed work to evaluate Chronos\-T5\-Small and Moirai\-1\.1\-R\-Small in Zero\-shot settings on the ETTh1 test set to assess whether pretrained foundation models can bypass the inverse scaling law\. Specifically, Chronos operates univariately with autoregressive rolling \(chunk size 64\), while Moirai is evaluated channel\-independently for a fair comparison with PatchTST\. Table[2](https://arxiv.org/html/2605.08217#S3.T2)details the comparison of zero\-shot foundation models to trained models on ETTh1 atH=96H=96\.

Table 2:Foundation Model Zero\-Shot vs\. Trained Models on ETTh1 \(H=96H=96\)\.Neither foundation model matches RAFT or even PatchTST atL=720L=720\. Chronos at its best \(L=720L=720, MSE 0\.483\) is 27% above RAFT, and Moirai \(L=720L=720, MSE 0\.553\) is 46% above\. Hence, Zero\-shot generality does not compensate for targeted retrieval on stochastic data\.

## 4Discussion

### 4\.1The Stochastic Context Dilemma

NLP scaling laws predict that more context produces better outputs\. The inverse scaling law documented here contradicts that prediction; the reason lies in a basic difference between the two data types\. Language tokens and factual claims retain informativeness regardless of how far back they appear\. ETTh1 measurements from 2,000 steps ago are essentially uncorrelated noise with respect to the prediction target\. Furthermore, over 3,000 positions, the attention Softmax inevitably distributes probability mass over inputs carrying no forecasting information\. The proposed work characterizes the effect asAttention Entropyinflation\. Specifically, measured normalized entropy rises from 0\.952 \(L=336L=336\) to 0\.989 \(L=3000L=3000\), confirming near\-complete entropy saturation at extreme context lengths \(Section[3\.4](https://arxiv.org/html/2605.08217#S3.SS4)\)\. Relevant recent signals are drowned out\. More history helps only when it carries high semantic density; stochastic volatility violates that condition by definition\.

### 4\.2The Efficiency\-Accuracy Trade\-off

Training cost makes the case against long context even stronger: a\)Long Context \(Inefficient\): Vanilla Transformer at 3,000 steps required 83\.47 minutes of training and produced MSE 1\.323, the worst result in the study\. b\)Retrieval \(Efficient\): RAFT trained in 2\.13 minutes \(roughly 40×\\timesfaster\) and produced MSE 0\.379, the best result\. The expectation behind NLP scaling laws is that spending more compute yields better model quality\. In time series forecasting, spending more compute on longer context yields worse quality and wastes resources in the process\.

### 4\.3Retrieval as a Denoising Inductive Bias

RAFT works well because its top\-kkretrieval stage operates as a “hard attention” gate before the standard “soft attention” layer\. Two stages run in sequence, forming a noise\-rejection pipeline: a\)Filter: Retrieval discards segments with low similarity\. The surviving segments have high fidelity to the current forecasting situation\. b\)Process: The Transformer receives only these filtered inputs, treating them asdynamic exogenous variables\(Pineda Arangoet al\.,[2025](https://arxiv.org/html/2605.08217#bib.bib4)\)rather than undifferentiated history\. Moreover, long\-range dependencies remain exploitable through retrieval, without the noise accumulation that accompanies brute\-force window scaling\.

## 5Limitations

The three benchmarks tested \(ETTh1, ETTh2, Exchange Rate\) all exhibit stochastic, mean\-reverting dynamics; i\.e\., whether the inverse scaling law holds on trend\-dominated or strongly seasonal series remains open\. The main experiments fix the value ofH=96H\{=\}96, and foundation models are not re\-evaluated at longer horizons except atH=336,720H\{=\}336,720\(detailed in Appendix[H](https://arxiv.org/html/2605.08217#A8)\)\. Furthermore, hyperparameter sensitivity \(detailed in Appendix[I](https://arxiv.org/html/2605.08217#A9)\) and single\-GPU timing measurements may not generalize across hardware\.

## 6Conclusion and Future Work

The proposed work has tested the “Long Context” hypothesis for time series forecasting and found it wanting\. On ETTh1, PatchTST can lose up to 68% of its accuracy when context extends from 720 to 3,000 steps, an Inverse Scaling Law driven by noise accumulation in stochastic data\. RAFT\(Hanet al\.,[2025](https://arxiv.org/html/2605.08217#bib.bib6)\), with selective retrieval and a constrained 720\-step window, reaches MSE 0\.379, well below the expensive long\-context alternatives\. Context quality, not context volume, determines forecast accuracy\. Additionally, the inverse scaling law holds at longer prediction horizons \(H=336H=336,H=720H=720\) and the degradation mechanism is corroborated by attention entropy measurements where normalized entropy reaches 0\.989 atL=3000L=3000\. Hence, the next frontier is to integrate retrieval directly into pretraining\. Foundation models that use dynamic retrieval heads and are trained end\-to\-end to identify the most useful historical segments could achieve robust zero\-shot performance without the noise penalty of static windows\. Such an architecture would put into practice the “Context is Key” philosophy advocated byWilliamset al\.\([2025](https://arxiv.org/html/2605.08217#bib.bib3)\)and make stochastic time series forecasting substantially more reliable\.

## References

- A\. F\. Ansari, L\. Stella, C\. Turkmen, X\. Zhang, P\. Mercado, H\. Shen, O\. Shchur, S\. S\. Rangapuram, S\. Pineda Arango, S\. Kapoor, J\. Zschiegner, D\. C\. Maddix, M\. W\. Mahoney, K\. Torkkola, A\. G\. Wilson, M\. Bohlke\-Schneider, and Y\. Wang \(2024\)Chronos: learning the language of time series\.Transactions on Machine Learning Research\.Cited by:[§1](https://arxiv.org/html/2605.08217#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.08217#S2.SS1.p1.1),[§3\.5](https://arxiv.org/html/2605.08217#S3.SS5.p3.1)\.
- Retrieval augmented time series forecasting\.InProceedings of the 42nd International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.267,pp\. 21774–21797\.Cited by:[§1](https://arxiv.org/html/2605.08217#S1.p2.1),[§3\.1](https://arxiv.org/html/2605.08217#S3.SS1.p1.8),[§6](https://arxiv.org/html/2605.08217#S6.p1.3)\.
- Y\. Nie, N\. H\. Nguyen, P\. Sinthong, and J\. Kalagnanam \(2023\)A time series is worth 64 words: long\-term forecasting with transformers\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.08217#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.08217#S2.SS1.p1.1),[§3\.1](https://arxiv.org/html/2605.08217#S3.SS1.p1.8)\.
- S\. Pineda Arango, P\. Mercado, S\. Kapoor, A\. F\. Ansari, L\. Stella, H\. Shen, H\. H\. J\. Senetaire, A\. C\. Turkmen, O\. Shchur, D\. C\. Maddix, M\. Bohlke\-Schneider, B\. Wang, and S\. S\. Rangapuram \(2025\)ChronosX: adapting pretrained time series models with exogenous variables\.InProceedings of The 28th International Conference on Artificial Intelligence and Statistics,Proceedings of Machine Learning Research, Vol\.258,pp\. 2242–2250\.Cited by:[§1](https://arxiv.org/html/2605.08217#S1.p1.1),[§2\.2](https://arxiv.org/html/2605.08217#S2.SS2.p1.1),[§3\.5](https://arxiv.org/html/2605.08217#S3.SS5.p2.1),[§4\.3](https://arxiv.org/html/2605.08217#S4.SS3.p1.1)\.
- A\. R\. Williams, A\. Ashok, É\. Marcotte, V\. Zantedeschi, J\. Subramanian, R\. Riachi, J\. Requeima, A\. Lacoste, I\. Rish, N\. Chapados, and A\. Drouin \(2025\)Context is key: a benchmark for forecasting with essential textual information\.InProceedings of the 42nd International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.267,pp\. 66887–66944\.Cited by:[§1](https://arxiv.org/html/2605.08217#S1.p1.1),[§2\.2](https://arxiv.org/html/2605.08217#S2.SS2.p1.1),[§3\.4](https://arxiv.org/html/2605.08217#S3.SS4.p1.1),[§6](https://arxiv.org/html/2605.08217#S6.p1.3)\.
- G\. Woo, C\. Liu, A\. Kumar, C\. Xiong, S\. Savarese, and D\. Sahoo \(2024\)Unified training of universal time series forecasting transformers\.InForty\-first International Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2605.08217#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.08217#S2.SS1.p1.1),[§3\.5](https://arxiv.org/html/2605.08217#S3.SS5.p3.1)\.
- H\. Zhou, S\. Zhang, J\. Peng, S\. Zhang, J\. Li, H\. Xiong, and W\. Zhang \(2021\)Informer: beyond efficient transformer for long sequence time\-series forecasting\.InAAAI Conference on Artificial Intelligence,Vol\.35,pp\. 11106–11115\.Cited by:[§3\.1](https://arxiv.org/html/2605.08217#S3.SS1.p1.8),[§3\.3](https://arxiv.org/html/2605.08217#S3.SS3.p1.1)\.

## Appendix

## Appendix AAI Use Disclosure

Large language models were used to assist with grammar refinement and manuscript editing\. All scientific content, experimental results, and conclusions are the sole work of the authors\.

## Appendix BExperimental Configuration

Hyperparameter settings and training configurations are reported here for reproducibility\. Figure[A1](https://arxiv.org/html/2605.08217#A2.F1)shows a clear Inverse Scaling Law already discussed in[3\.2](https://arxiv.org/html/2605.08217#S3.SS2)\.

![Refer to caption](https://arxiv.org/html/2605.08217v1/x2.png)Figure A1:The Inverse Scaling Law on ETTh1\. RAFT \(Green\) holds error low \(MSE 0\.379\) with a fixed window\. PatchTST \(Orange\), when extended to 3000 steps, can see its performance degrade by up to 68% \(MSE 0\.647\)\. The Vanilla Transformer \(Red\) collapses entirely\.### B\.1Model Hyperparameters

Table[A1](https://arxiv.org/html/2605.08217#A2.T1)gives the full configuration of each model\. The same optimization settings apply across all experiments to keep the comparison controlled\.

Table A1:Complete Hyperparameter Configurations
### B\.2Training Protocol

Adam with a learning rate of 0\.0001 was used for all models\. If validation MSE failed to improve for 3 consecutive epochs, training was stopped early\. The random seed for weight initialization was 2021 throughout, which makes results deterministic\.

## Appendix CDataset Details

The inverse scaling law hypothesis is tested on three benchmarks that span different domains and noise profiles\.

### C\.1ETTh1 Dataset

ETTh1 \(Electricity Transformer Temperature\) contains hourly transformer readings collected over 600 days \(14,400 timesteps total\)\. The measured features are HUFL, HULL, MUFL, MULL, LUFL, LULL, and OT \(Oil Temperature\), giving seven channels\. A 60/20/20 split was used for training, validation, and test partitions\. Each feature was normalized independently using StandardScaler\.

### C\.2Cross\-Domain Datasets

ETTh2 has the same format as ETTh1 but comes from a different transformer station\. The Exchange Rate dataset contains daily rates for 8 currencies over 7,588 trading days\. Both datasets were included because their noise characteristics differ from each other, allowing the proposed work to check whether the inverse scaling law generalizes beyond a single domain\.

## Appendix DComplete Results

Table[A2](https://arxiv.org/html/2605.08217#A4.T2)reports MSE and training time for every configuration tested on ETTh1\.

Table A2:Complete Results on ETTh1### D\.1Degradation Analysis

Table[A3](https://arxiv.org/html/2605.08217#A4.T3)reports the MSE increase when context is extended from 720 to 3000 steps, broken down by model and dataset\.

Table A3:Performance Degradation Across ContextsDegradation is consistent across datasets\. Financial data shows the largest increase, as expected given that currency exchange rates exhibit particularly high stochastic volatility\.

## Appendix EComputational Environment

Experiments were run on cloud GPU infrastructure \(16GB RAM, 8 CPU cores\)\. Python 3\.9, PyTorch 2\.0\.1, and CUDA 11\.8 formed the core of the software stack; numpy 1\.24, pandas 2\.0, and scikit\-learn 1\.3 were also used\. Training time grows quadratically with context length\. At 3000 steps the Vanilla Transformer needed 83\.47 minutes per run; RAFT needed 2\.13 minutes for the same prediction task, about 40×\\timesfaster, while also producing a better MSE\.

## Appendix FReproducibility

All datasets are publicly available from the Informer repository\. The train\-validation\-test split is 60/20/20 in every experiment, and StandardScaler normalization is applied per feature\. Random seeds are fixed at 2021\. Early stopping halts training if validation MSE does not drop for 3 epochs\. Replication requires downloading the datasets, applying the hyperparameters in Table[A1](https://arxiv.org/html/2605.08217#A2.T1), and running training with the Adam optimizer at the specified learning rate\. Standard PyTorch was used throughout; no custom modifications were made\. The complete source code, training scripts, and evaluation pipelines are publicly available at[https://github\.com/RishiAhuja/ahuja2026retrieval](https://github.com/RishiAhuja/ahuja2026retrieval)\.

## Appendix GAttention Entropy Measurements

Normalized attention entropy was computed per layer across all attention heads in PatchTST at each context length\. Entropy is defined asH=−∑ipilog⁡piH=\-\\sum\_\{i\}p\_\{i\}\\log p\_\{i\}, normalized bylog⁡N\\log NwhereNNis the number of patches\. An entropy of 1\.0 corresponds to a perfectly uniform \(random\) attention distribution and an entropy of 0\.0 corresponds to a sparse single\-position spike\.

AtL=336L=336\(42 patches\), mean normalized entropy is 0\.952, indicating that attention is already fairly diffuse but retains some discriminative structure\. Similarly, atL=720L=720\(90 patches\), entropy rises to 0\.971\. Furthermore, atL=3000L=3000\(375 patches\), entropy reaches 0\.989, within 1\.1% of the theoretical maximum\. Additionally, the effective attention rank \(computed asexp⁡\(H\)\\exp\(H\), the exponential of the entropy\) collapses from 8\.3 atL=336L=336to 0\.1 atL=3000L=3000, confirming that the attention mechanism has degenerated into near\-uniform averaging at extreme context lengths\.

## Appendix HMulti\-Horizon Evaluation

The inverse scaling law is tested at longer prediction horizons \(H∈\{336,720\}H\\in\\\{336,720\\\}\) to verify that the phenomenon generalizes beyondH=96H=96\. Table[A4](https://arxiv.org/html/2605.08217#A8.T4)details the multi\-horizon results on ETTh1\.

Table A4:Multi\-Horizon Results on ETTh1 \(MSE↓\\downarrow\)\.As evident from Table[A4](https://arxiv.org/html/2605.08217#A8.T4), the inverse scaling law holds at all tested horizons\. PatchTST degradation worsens at longer horizons: \+59% atH=336H=336and \+193% atH=720H=720when the context extends from 720 to 3000\. Additionally, RAFT maintains the lowest MSE at every horizon without any context extension\.

## Appendix ITraining Configuration Sensitivity

The magnitude of long\-context degradation depends on the training protocol\. Under the schedule used in the main experiments \(10 epochs, step\-decay learning rate\), PatchTST reaches MSE 0\.647 atL=3000L=3000\(\+68% overL=720L=720\)\. The same architecture, when trained with an extended schedule \(100 epochs, cosine annealing learning rate\), achieves MSE 0\.426 atL=3000L=3000, narrowing the gap to approximately 8% overL=720L=720\(MSE 0\.395\)\. Importantly, RAFT \(MSE 0\.379\) remains superior under both training configurations, and the need for careful hyperparameter tuning to mitigate long\-context degradation is itself a practical limitation of the continuous\-context paradigm\.
Retrieval Mechanisms Surpass Long-Context Scaling in Time Series Forecasting

Similar Articles

A decoder-only foundation model for time-series forecasting

LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning

Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems

@Julian_a42f9a: Late-interaction retrieval models are widely used for their strong performance, but their representations can be utiliz…

FS-Researcher: Test-Time Scaling for Long-Horizon Research Tasks with File-System-Based Agents

Submit Feedback

Similar Articles

A decoder-only foundation model for time-series forecasting
LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning
Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems
@Julian_a42f9a: Late-interaction retrieval models are widely used for their strong performance, but their representations can be utiliz…
FS-Researcher: Test-Time Scaling for Long-Horizon Research Tasks with File-System-Based Agents