End-to-End Context Compression at Scale

Hugging Face Daily Papers 06/08/26, 12:00 AM Papers

context-compression encoder-decoder latent-context long-context language-model efficiency pretraining

Summary

This paper presents Latent Context Language Models (LCLMs), a family of encoder-decoder compressors that efficiently handle long contexts through architectural search and large-scale pretraining, outperforming traditional KV cache methods in accuracy, speed, and memory usage.

Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. Recent techniques to compress the KV cache fall short: they either degrade model quality substantially or require considerable time and compute to compress a single long prompt. Furthermore, many methods require the input to fit within the target model's context window, and are generally incompatible with modern production inference engines. Encoder-decoder compressors, which map a long token sequence to a shorter sequence of latent embeddings consumed by a decoder, are an appealing alternative in principle. However, existing approaches are not competitive with KV cache compression on the accuracy-efficiency frontier. In this work, we revisit encoder-decoder compression and close this gap. We first perform an architecture search, pre-training many variants from scratch to determine how best to design and train encoder-decoder compressors. Guided by our findings, we continually pre-train a family of 0.6B-encoder, 4B-decoder models on over 350B tokens each, at compression ratios of 1:4, 1:8, and 1:16. We introduce Latent Context Language Models (LCLMs), a family of compressors that improve the Pareto frontier across general-task performance, compression speed, and peak memory usage. We demonstrate that LCLMs serve as efficient backbones for long-horizon agents, letting the agent skim through a compressed long context and adaptively expand relevant segments on demand.

Original Article

View Cached Full Text

Cached at: 06/09/26, 08:43 AM

Paper page - End-to-End Context Compression at Scale

Source: https://huggingface.co/papers/2606.09659 Authors:

Abstract

Encoder-decoder compression techniques are improved through architectural search and large-scale pretraining to create Latent Context Language Models that efficiently handle long contexts with better performance and memory usage compared to traditional KV cache methods.

Long-context language model inference is bottlenecked by memory, as theKV cachegrows with context length. Recent techniques to compress theKV cachefall short: they either degrade model quality substantially or require considerable time and compute to compress a single long prompt. Furthermore, many methods require the input to fit within the target model’s context window, and are generally incompatible with modern production inference engines.Encoder-decoder compressors, which map a long token sequence to a shorter sequence oflatent embeddingsconsumed by a decoder, are an appealing alternative in principle. However, existing approaches are not competitive withKV cachecompression on the accuracy-efficiency frontier. In this work, we revisitencoder-decoder compressionand close this gap. We first perform anarchitecture search,pre-trainingmany variants from scratch to determine how best to design and trainencoder-decoder compressors. Guided by our findings, we continually pre-train a family of 0.6B-encoder, 4B-decoder models on over 350B tokens each, atcompression ratiosof 1:4, 1:8, and 1:16. We introduceLatent Context Language Models(LCLMs), a family of compressors that improve the Pareto frontier across general-task performance, compression speed, and peak memory usage. We demonstrate that LCLMs serve as efficient backbones forlong-horizon agents, letting the agent skim through a compressed long context and adaptively expand relevant segments on demand.

View arXiv page View PDF GitHub3 Add to collection

Get this paper in your agent:

hf papers read 2606\.09659

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.09659 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.09659 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.09659 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

End-to-End Context Compression at Scale

Paper page - End-to-End Context Compression at Scale

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

@Pavel_Izmailov: New paper: Latent Context Language Models (LCLMs)! Idea: encode 16 tokens as 1 latent token, and have the LLM work on t…

What should context compression keep? I looked at how six agents handle it[D]

Latent Cache Flow: Model-to-Model Communication Without Text

Looped Latent Attention: Cross-Loop KV Compression for Looped Transformers

Less Context, More Accuracy: A Bi-Temporal Memory Engine for LLM Agents Where a Lean Retrieved Context Beats the Full History

Submit Feedback

Similar Articles

@Pavel_Izmailov: New paper: Latent Context Language Models (LCLMs)! Idea: encode 16 tokens as 1 latent token, and have the LLM work on t…

What should context compression keep? I looked at how six agents handle it[D]

Latent Cache Flow: Model-to-Model Communication Without Text

Looped Latent Attention: Cross-Loop KV Compression for Looped Transformers

Less Context, More Accuracy: A Bi-Temporal Memory Engine for LLM Agents Where a Lean Retrieved Context Beats the Full History