Can Language Models Actually Retrieve In-Context? Drowning in Documents at Million Token Scale

arXiv cs.CL Papers

Summary

This paper systematically studies in-context retrieval at million-token scale, introducing BlockSearch, a 0.6B LM retriever, and analyzing attention dilution. The model matches or outperforms dense retrieval on benchmarks like MS MARCO and NQ, and significantly outperforms on tasks requiring different similarity notions, highlighting the potential of in-context retrieval while emphasizing attention control under extreme context growth.

arXiv:2607.01538v1 Announce Type: new Abstract: Language models (LMs) raise an intriguing alternative to vector-based retrieval: conditioning on an in-context corpus and directly generating a relevant answer. However, prior work has largely focused on proprietary systems or the smaller-scale reranking task, leaving corpus-scale in-context retrieval largely unexplored. In this work, we present the first systematic study of in-context retrieval on two scales practical retrievers demand: million-token corpora and length-generalization far beyond training-time sizes. We first introduce BlockSearch, a 0.6B LM retriever whose architectural and training modifications improve over prior LM baselines and length-generalize up to 10 times beyond its training regime. Nevertheless, retrieval still collapses under more extreme extrapolation. We trace this failure to an attention dilution effect: as the corpus grows, irrelevant documents dominate the softmax denominator, reducing the normalized mass on the gold document even when its pre-softmax score stays high. Motivated by this analysis, we introduce length-aware adjustments to the attention softmax and document-level sparse attention. With these modifications, at the million-token scale, our model matches dense retrieval on widely studied benchmarks (e.g, MS MARCO and NQ), while outperforming the concurrent model MSA despite being 7 times smaller. Furthermore, it significantly outperforms dense retrieval on tasks requiring entirely different notions of similarity, such as LIMIT, achieving a 3 times higher score. Together, our results position in-context retrieval a promising alternative to classical retrieval while emphasizing attention control under extreme context growth as a new challenge.
Original Article
View Cached Full Text

Cached at: 07/03/26, 05:40 AM

# Can Language Models Actually Retrieve In-Context? Drowning in Documents at Million Token Scale
Source: [https://arxiv.org/abs/2607.01538](https://arxiv.org/abs/2607.01538)
[View PDF](https://arxiv.org/pdf/2607.01538)

> Abstract:Language models \(LMs\) raise an intriguing alternative to vector\-based retrieval: conditioning on an in\-context corpus and directly generating a relevant answer\. However, prior work has largely focused on proprietary systems or the smaller\-scale reranking task, leaving corpus\-scale in\-context retrieval largely unexplored\. In this work, we present the first systematic study of in\-context retrieval on two scales practical retrievers demand: million\-token corpora and length\-generalization far beyond training\-time sizes\. We first introduce BlockSearch, a 0\.6B LM retriever whose architectural and training modifications improve over prior LM baselines and length\-generalize up to 10 times beyond its training regime\. Nevertheless, retrieval still collapses under more extreme extrapolation\. We trace this failure to an attention dilution effect: as the corpus grows, irrelevant documents dominate the softmax denominator, reducing the normalized mass on the gold document even when its pre\-softmax score stays high\. Motivated by this analysis, we introduce length\-aware adjustments to the attention softmax and document\-level sparse attention\. With these modifications, at the million\-token scale, our model matches dense retrieval on widely studied benchmarks \(e\.g, MS MARCO and NQ\), while outperforming the concurrent model MSA despite being 7 times smaller\. Furthermore, it significantly outperforms dense retrieval on tasks requiring entirely different notions of similarity, such as LIMIT, achieving a 3 times higher score\. Together, our results position in\-context retrieval a promising alternative to classical retrieval while emphasizing attention control under extreme context growth as a new challenge\.

## Submission history

From: Siddharth Gollapudi \[[view email](https://arxiv.org/show-email/f1ac6a9f/2607.01538)\] **\[v1\]**Wed, 1 Jul 2026 23:38:25 UTC \(46 KB\)

Similar Articles

Understanding the Behaviors of Environment-aware Information Retrieval

Hugging Face Daily Papers

This paper presents the first systematic analysis of how large language models can learn to adapt query formulation strategies for different retrievers using reinforcement learning, revealing distinct optimal query styles and introducing a branching-based rollout technique for multi-retrieval-step training stability.