Language Models Need Sleep

Hugging Face Daily Papers 05/25/26, 12:00 AM Papers

Summary

This paper proposes a sleep-like consolidation mechanism for transformer models that uses fast weights and recurrent passes to improve long-context processing while maintaining inference speed.

Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. To handle this, we study a sleep-like consolidation mechanism in which a model periodically converts recent context into persistent fast weights before clearing its key-value cache. During sleep, the model performs N offline recurrent passes over the accumulated context and updates the fast weights in its state-space model (SSM) blocks through a learned local rule. During inference, this shifts extra computation to sleep while preserving the latency of wake-time prediction. We test our method on controlled synthetic tasks, including cellular automata and multi-hop graph retrieval, as well as a realistic math reasoning task, on which a regular transformer as well as SSM-attention hybrid models fail. We then show that increasing sleep duration N for our models improves performance, with the largest gains on examples that require deeper reasoning.

Original Article

View Cached Full Text

Cached at: 05/26/26, 02:44 PM

Paper page - Language Models Need Sleep

Source: https://huggingface.co/papers/2605.26099

Abstract

A sleep-like consolidation mechanism for transformer models uses fast weights and recurrent passes to improve long-context processing while maintaining inference speed.

Transformer-based large language modelsare increasingly used for long-horizon tasks; however, theirattention mechanismscales poorly withcontext length. To handle this, we study asleep-like consolidation mechanismin which a model periodically converts recent context into persistentfast weightsbefore clearing itskey-value cache. During sleep, the model performs N offlinerecurrent passesover the accumulated context and updates thefast weightsin itsstate-space model(SSM) blocks through a learned local rule. During inference, this shifts extra computation to sleep while preserving the latency of wake-time prediction. We test our method on controlled synthetic tasks, includingcellular automataandmulti-hop graph retrieval, as well as a realisticmath reasoningtask, on which a regular transformer as well as SSM-attention hybrid models fail. We then show that increasing sleep duration N for our models improves performance, with the largest gains on examples that require deeper reasoning.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2605\.26099

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.26099 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.26099 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.26099 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Language Models Need Sleep

Paper page - Language Models Need Sleep

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Language Models Need Sleep

Language Models Need Sleep: Learning to Self-Modify and Consolidate Memories

@omarsar0: Language models need "sleep"

Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models

Mela: Test-Time Memory Consolidation based on Transformation Hypothesis

Submit Feedback

Similar Articles

Language Models Need Sleep: Learning to Self-Modify and Consolidate Memories

@omarsar0: Language models need "sleep"

Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models

Mela: Test-Time Memory Consolidation based on Transformation Hypothesis