Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It
Summary
This paper identifies that chain-of-thought supervised fine-tuning degrades long-context recall in hybrid linear-attention models by biasing attention gradients toward short-range patterns, and proposes QK-Restore, a training-free method that restores long-context recall while preserving reasoning performance.
View Cached Full Text
Cached at: 06/10/26, 05:44 AM
Paper page - Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It
Source: https://huggingface.co/papers/2606.11052 Published on Jun 9
·
Submitted byhttps://huggingface.co/xinyu04
Zhouon Jun 10
Abstract
Chain-of-thought supervised fine-tuning degrades long-context recall in hybrid linear-attention models by biasing attention gradients toward short-range patterns, but a training-free method called QK-Restore can restore long-context capabilities by reverting query-key projections while preserving reasoning performance.
Chain-of-thought (CoT) supervised fine-tuning (SFT) is widely adopted to improve reasoning ability, yet we find that it systematically degradeslong-context recallinhybrid linear-attention models. Across architectures including HypeNet and Jet-Nemotron, retrieval performance onNeedle-In-A-Haystack(NIAH) deteriorates substantially after CoT-SFT, and the degradation becomes more severe under harder retrieval settings and longer context windows. For example, HypeNet-9B on NIAH-S2@256K decreases from 67.2% to 9.4%. We attribute this to CoT-SFT biasingattention gradientstoward short-range patterns, disruptingquery-key projections(W_Q,W_K) that are responsible for long-range routing. Motivated by this observation, we proposeQK-Restore, a training-free method that restores onlyW_QandW_Kfrom the pre-SFT checkpoint while preserving all other post-SFT parameters. We further introduce aProcrustes variantto balancerouting preservationand reasoning adaptation. Across architectures,QK-Restoreconsistently restores long-context capability at zero training cost while preservingreasoning performance; for instance, on HypeNet-5B it improves S3@256K from 65.4% to 76.4% while maintaining strongreasoning performance.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2606\.11052
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.11052 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.11052 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.11052 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Rethinking the Role of Efficient Attention in Hybrid Architectures
This paper systematically analyzes the role of efficient attention modules in hybrid language model architectures, finding that different designs converge in long-context performance under sufficient training, and that long-range retrieval is primarily carried by full attention while efficient attention shapes the optimization trajectory, revealing a 'Large-Window Laziness' phenomenon.
Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning
Proposes ProxyCoT, a training framework that improves long-context reasoning in large language models by first obtaining chain-of-thought reasoning traces on short proxy contexts (via reinforcement learning or distillation) and then grounding them in full long contexts through supervised fine-tuning. Experiments show consistent improvements over baselines with reduced computational cost.
LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning
LongAct proposes a saliency-guided sparse update strategy for improving long-context reasoning in LLMs by selectively updating weights associated with high-magnitude activations in query and key vectors, achieving ~8% improvement on LongBench v2.
Less Context, More Accuracy: A Bi-Temporal Memory Engine for LLM Agents Where a Lean Retrieved Context Beats the Full History
This paper introduces Engram, an open-source bi-temporal memory engine for LLM agents that retrieves a compact context slice (∼9.6k tokens) to outperform the full-history baseline (79k tokens) by 10.4 accuracy points on LongMemEval, using a hybrid read path fusing dense, lexical, graph, and temporal signals.
Variational Linear Attention: Stable Associative Memory for Long-Context Transformers
This paper introduces Variational Linear Attention (VLA), a method that stabilizes memory states in linear attention mechanisms for long-context transformers. VLA reframes memory updates as an online regularized least-squares problem, proving bounded state norms and demonstrating significant speedups and improved retrieval accuracy over standard linear attention and DeltaNet.