Can Language Models Actually Retrieve In-Context? Drowning in Documents at Million Token Scale
Summary
This paper systematically studies in-context retrieval at million-token scale, introducing BlockSearch, a 0.6B LM retriever, and analyzing attention dilution. The model matches or outperforms dense retrieval on benchmarks like MS MARCO and NQ, and significantly outperforms on tasks requiring different similarity notions, highlighting the potential of in-context retrieval while emphasizing attention control under extreme context growth.
View Cached Full Text
Cached at: 07/03/26, 05:40 AM
# Can Language Models Actually Retrieve In-Context? Drowning in Documents at Million Token Scale Source: [https://arxiv.org/abs/2607.01538](https://arxiv.org/abs/2607.01538) [View PDF](https://arxiv.org/pdf/2607.01538) > Abstract:Language models \(LMs\) raise an intriguing alternative to vector\-based retrieval: conditioning on an in\-context corpus and directly generating a relevant answer\. However, prior work has largely focused on proprietary systems or the smaller\-scale reranking task, leaving corpus\-scale in\-context retrieval largely unexplored\. In this work, we present the first systematic study of in\-context retrieval on two scales practical retrievers demand: million\-token corpora and length\-generalization far beyond training\-time sizes\. We first introduce BlockSearch, a 0\.6B LM retriever whose architectural and training modifications improve over prior LM baselines and length\-generalize up to 10 times beyond its training regime\. Nevertheless, retrieval still collapses under more extreme extrapolation\. We trace this failure to an attention dilution effect: as the corpus grows, irrelevant documents dominate the softmax denominator, reducing the normalized mass on the gold document even when its pre\-softmax score stays high\. Motivated by this analysis, we introduce length\-aware adjustments to the attention softmax and document\-level sparse attention\. With these modifications, at the million\-token scale, our model matches dense retrieval on widely studied benchmarks \(e\.g, MS MARCO and NQ\), while outperforming the concurrent model MSA despite being 7 times smaller\. Furthermore, it significantly outperforms dense retrieval on tasks requiring entirely different notions of similarity, such as LIMIT, achieving a 3 times higher score\. Together, our results position in\-context retrieval a promising alternative to classical retrieval while emphasizing attention control under extreme context growth as a new challenge\. ## Submission history From: Siddharth Gollapudi \[[view email](https://arxiv.org/show-email/f1ac6a9f/2607.01538)\] **\[v1\]**Wed, 1 Jul 2026 23:38:25 UTC \(46 KB\)
Similar Articles
@samhogan: RLMs pretty much solved context btw You can shove tens of millions of tokens into a good RLM harness and it just works.…
A developer shares their experience with Recurrent Language Models (RLMs), claiming they effectively handle extremely long context windows with tens of millions of tokens, representing a significant advancement in context handling capabilities.
@liquidai: Introducing LFM2.5-Embedding-350M and LFM2.5-ColBERT-350M: two multilingual retrieval models built for ultra-fast and a…
Liquid AI introduces LFM2.5-Embedding-350M and LFM2.5-ColBERT-350M, two multilingual retrieval models optimized for fast and accurate search across 11 languages, with latency as low as 1.5ms.
Understanding the Behaviors of Environment-aware Information Retrieval
This paper presents the first systematic analysis of how large language models can learn to adapt query formulation strategies for different retrievers using reinforcement learning, revealing distinct optimal query styles and introducing a branching-based rollout technique for multi-retrieval-step training stability.
@Pavel_Izmailov: New paper: Latent Context Language Models (LCLMs)! Idea: encode 16 tokens as 1 latent token, and have the LLM work on t…
Introduces Latent Context Language Models (LCLMs), which encode 16 tokens as 1 latent token to improve performance, speed, and memory usage.
DeepSeek-V4: a million-token context that agents can actually use
DeepSeek releases V4, a MoE model with a 1M-token context window optimized for agentic tasks through hybrid attention and reduced KV cache requirements.