Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It

Hugging Face Daily Papers 06/09/26, 12:00 AM Papers

Summary

This paper identifies that chain-of-thought supervised fine-tuning degrades long-context recall in hybrid linear-attention models by biasing attention gradients toward short-range patterns, and proposes QK-Restore, a training-free method that restores long-context recall while preserving reasoning performance.

Chain-of-thought (CoT) supervised fine-tuning (SFT) is widely adopted to improve reasoning ability, yet we find that it systematically degrades long-context recall in hybrid linear-attention models. Across architectures including HypeNet and Jet-Nemotron, retrieval performance on Needle-In-A-Haystack (NIAH) deteriorates substantially after CoT-SFT, and the degradation becomes more severe under harder retrieval settings and longer context windows. For example, HypeNet-9B on NIAH-S2@256K decreases from 67.2% to 9.4%. We attribute this to CoT-SFT biasing attention gradients toward short-range patterns, disrupting query-key projections (W_Q, W_K) that are responsible for long-range routing. Motivated by this observation, we propose QK-Restore, a training-free method that restores only W_Q and W_K from the pre-SFT checkpoint while preserving all other post-SFT parameters. We further introduce a Procrustes variant to balance routing preservation and reasoning adaptation. Across architectures, QK-Restore consistently restores long-context capability at zero training cost while preserving reasoning performance; for instance, on HypeNet-5B it improves S3@256K from 65.4% to 76.4% while maintaining strong reasoning performance.

Original Article

View Cached Full Text

Cached at: 06/10/26, 05:44 AM

Paper page - Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It

Source: https://huggingface.co/papers/2606.11052 Published on Jun 9

Submitted byhttps://huggingface.co/xinyu04

Zhouon Jun 10

Abstract

Chain-of-thought supervised fine-tuning degrades long-context recall in hybrid linear-attention models by biasing attention gradients toward short-range patterns, but a training-free method called QK-Restore can restore long-context capabilities by reverting query-key projections while preserving reasoning performance.

Chain-of-thought (CoT) supervised fine-tuning (SFT) is widely adopted to improve reasoning ability, yet we find that it systematically degradeslong-context recallinhybrid linear-attention models. Across architectures including HypeNet and Jet-Nemotron, retrieval performance onNeedle-In-A-Haystack(NIAH) deteriorates substantially after CoT-SFT, and the degradation becomes more severe under harder retrieval settings and longer context windows. For example, HypeNet-9B on NIAH-S2@256K decreases from 67.2% to 9.4%. We attribute this to CoT-SFT biasingattention gradientstoward short-range patterns, disruptingquery-key projections(W_Q,W_K) that are responsible for long-range routing. Motivated by this observation, we proposeQK-Restore, a training-free method that restores onlyW_QandW_Kfrom the pre-SFT checkpoint while preserving all other post-SFT parameters. We further introduce aProcrustes variantto balancerouting preservationand reasoning adaptation. Across architectures,QK-Restoreconsistently restores long-context capability at zero training cost while preservingreasoning performance; for instance, on HypeNet-5B it improves S3@256K from 65.4% to 76.4% while maintaining strongreasoning performance.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2606\.11052

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.11052 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.11052 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.11052 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It

Paper page - Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Rethinking the Role of Efficient Attention in Hybrid Architectures

Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning

LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning

Less Context, More Accuracy: A Bi-Temporal Memory Engine for LLM Agents Where a Lean Retrieved Context Beats the Full History

Variational Linear Attention: Stable Associative Memory for Long-Context Transformers

Submit Feedback

Similar Articles

Rethinking the Role of Efficient Attention in Hybrid Architectures

Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning

LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning

Less Context, More Accuracy: A Bi-Temporal Memory Engine for LLM Agents Where a Lean Retrieved Context Beats the Full History

Variational Linear Attention: Stable Associative Memory for Long-Context Transformers