End-to-End Context Compression at Scale

Hugging Face Daily Papers Papers

Summary

This paper presents Latent Context Language Models (LCLMs), a family of encoder-decoder compressors that efficiently handle long contexts through architectural search and large-scale pretraining, outperforming traditional KV cache methods in accuracy, speed, and memory usage.

Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. Recent techniques to compress the KV cache fall short: they either degrade model quality substantially or require considerable time and compute to compress a single long prompt. Furthermore, many methods require the input to fit within the target model's context window, and are generally incompatible with modern production inference engines. Encoder-decoder compressors, which map a long token sequence to a shorter sequence of latent embeddings consumed by a decoder, are an appealing alternative in principle. However, existing approaches are not competitive with KV cache compression on the accuracy-efficiency frontier. In this work, we revisit encoder-decoder compression and close this gap. We first perform an architecture search, pre-training many variants from scratch to determine how best to design and train encoder-decoder compressors. Guided by our findings, we continually pre-train a family of 0.6B-encoder, 4B-decoder models on over 350B tokens each, at compression ratios of 1:4, 1:8, and 1:16. We introduce Latent Context Language Models (LCLMs), a family of compressors that improve the Pareto frontier across general-task performance, compression speed, and peak memory usage. We demonstrate that LCLMs serve as efficient backbones for long-horizon agents, letting the agent skim through a compressed long context and adaptively expand relevant segments on demand.
Original Article
View Cached Full Text

Cached at: 06/09/26, 08:43 AM

Paper page - End-to-End Context Compression at Scale

Source: https://huggingface.co/papers/2606.09659 Authors:

,

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

Encoder-decoder compression techniques are improved through architectural search and large-scale pretraining to create Latent Context Language Models that efficiently handle long contexts with better performance and memory usage compared to traditional KV cache methods.

Long-context language model inference is bottlenecked by memory, as theKV cachegrows with context length. Recent techniques to compress theKV cachefall short: they either degrade model quality substantially or require considerable time and compute to compress a single long prompt. Furthermore, many methods require the input to fit within the target model’s context window, and are generally incompatible with modern production inference engines.Encoder-decoder compressors, which map a long token sequence to a shorter sequence oflatent embeddingsconsumed by a decoder, are an appealing alternative in principle. However, existing approaches are not competitive withKV cachecompression on the accuracy-efficiency frontier. In this work, we revisitencoder-decoder compressionand close this gap. We first perform anarchitecture search,pre-trainingmany variants from scratch to determine how best to design and trainencoder-decoder compressors. Guided by our findings, we continually pre-train a family of 0.6B-encoder, 4B-decoder models on over 350B tokens each, atcompression ratiosof 1:4, 1:8, and 1:16. We introduceLatent Context Language Models(LCLMs), a family of compressors that improve the Pareto frontier across general-task performance, compression speed, and peak memory usage. We demonstrate that LCLMs serve as efficient backbones forlong-horizon agents, letting the agent skim through a compressed long context and adaptively expand relevant segments on demand.

View arXiv pageView PDFGitHub3Add to collection

Get this paper in your agent:

hf papers read 2606\.09659

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.09659 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.09659 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.09659 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

What should context compression keep? I looked at how six agents handle it[D]

Reddit r/MachineLearning

An analysis of how six AI coding agents (Claude Code, Codex CLI, OpenCode, Cline, Cursor, Amp) converge on layered progressive compression for long contexts, differing in what they protect (user messages, stateful tool outputs) and whether they inform the model of compression, with tradeoffs between cost and accuracy.

LongAttnComp: Cross-Family Context Compression for Long-Context Reasoning

Hugging Face Daily Papers

LongAttnComp adapts AttnComp for long-context reasoning by fine-tuning lightweight cross-attention layers and introducing token-level chunking, a top-p algorithm, positional reordering, and a query parser. It achieves strong performance on long-context tasks like code debugging and transfers across multiple model families.