Can Language Models Actually Retrieve In-Context? Drowning in Documents at Million Token Scale

arXiv cs.CL 07/03/26, 04:00 AM Papers

Summary

This paper systematically studies in-context retrieval at million-token scale, introducing BlockSearch, a 0.6B LM retriever, and analyzing attention dilution. The model matches or outperforms dense retrieval on benchmarks like MS MARCO and NQ, and significantly outperforms on tasks requiring different similarity notions, highlighting the potential of in-context retrieval while emphasizing attention control under extreme context growth.

arXiv:2607.01538v1 Announce Type: new Abstract: Language models (LMs) raise an intriguing alternative to vector-based retrieval: conditioning on an in-context corpus and directly generating a relevant answer. However, prior work has largely focused on proprietary systems or the smaller-scale reranking task, leaving corpus-scale in-context retrieval largely unexplored. In this work, we present the first systematic study of in-context retrieval on two scales practical retrievers demand: million-token corpora and length-generalization far beyond training-time sizes. We first introduce BlockSearch, a 0.6B LM retriever whose architectural and training modifications improve over prior LM baselines and length-generalize up to 10 times beyond its training regime. Nevertheless, retrieval still collapses under more extreme extrapolation. We trace this failure to an attention dilution effect: as the corpus grows, irrelevant documents dominate the softmax denominator, reducing the normalized mass on the gold document even when its pre-softmax score stays high. Motivated by this analysis, we introduce length-aware adjustments to the attention softmax and document-level sparse attention. With these modifications, at the million-token scale, our model matches dense retrieval on widely studied benchmarks (e.g, MS MARCO and NQ), while outperforming the concurrent model MSA despite being 7 times smaller. Furthermore, it significantly outperforms dense retrieval on tasks requiring entirely different notions of similarity, such as LIMIT, achieving a 3 times higher score. Together, our results position in-context retrieval a promising alternative to classical retrieval while emphasizing attention control under extreme context growth as a new challenge.

Original Article

View Cached Full Text

Cached at: 07/03/26, 05:40 AM

# Can Language Models Actually Retrieve In-Context? Drowning in Documents at Million Token Scale
Source: [https://arxiv.org/abs/2607.01538](https://arxiv.org/abs/2607.01538)
[View PDF](https://arxiv.org/pdf/2607.01538)

> Abstract:Language models \(LMs\) raise an intriguing alternative to vector\-based retrieval: conditioning on an in\-context corpus and directly generating a relevant answer\. However, prior work has largely focused on proprietary systems or the smaller\-scale reranking task, leaving corpus\-scale in\-context retrieval largely unexplored\. In this work, we present the first systematic study of in\-context retrieval on two scales practical retrievers demand: million\-token corpora and length\-generalization far beyond training\-time sizes\. We first introduce BlockSearch, a 0\.6B LM retriever whose architectural and training modifications improve over prior LM baselines and length\-generalize up to 10 times beyond its training regime\. Nevertheless, retrieval still collapses under more extreme extrapolation\. We trace this failure to an attention dilution effect: as the corpus grows, irrelevant documents dominate the softmax denominator, reducing the normalized mass on the gold document even when its pre\-softmax score stays high\. Motivated by this analysis, we introduce length\-aware adjustments to the attention softmax and document\-level sparse attention\. With these modifications, at the million\-token scale, our model matches dense retrieval on widely studied benchmarks \(e\.g, MS MARCO and NQ\), while outperforming the concurrent model MSA despite being 7 times smaller\. Furthermore, it significantly outperforms dense retrieval on tasks requiring entirely different notions of similarity, such as LIMIT, achieving a 3 times higher score\. Together, our results position in\-context retrieval a promising alternative to classical retrieval while emphasizing attention control under extreme context growth as a new challenge\.

## Submission history

From: Siddharth Gollapudi \[[view email](https://arxiv.org/show-email/f1ac6a9f/2607.01538)\] **\[v1\]**Wed, 1 Jul 2026 23:38:25 UTC \(46 KB\)

Can Language Models Actually Retrieve In-Context? Drowning in Documents at Million Token Scale

Similar Articles

@samhogan: RLMs pretty much solved context btw You can shove tens of millions of tokens into a good RLM harness and it just works.…

@liquidai: Introducing LFM2.5-Embedding-350M and LFM2.5-ColBERT-350M: two multilingual retrieval models built for ultra-fast and a…

Understanding the Behaviors of Environment-aware Information Retrieval

@Pavel_Izmailov: New paper: Latent Context Language Models (LCLMs)! Idea: encode 16 tokens as 1 latent token, and have the LLM work on t…

DeepSeek-V4: a million-token context that agents can actually use

Submit Feedback

Similar Articles

@samhogan: RLMs pretty much solved context btw You can shove tens of millions of tokens into a good RLM harness and it just works.…

@liquidai: Introducing LFM2.5-Embedding-350M and LFM2.5-ColBERT-350M: two multilingual retrieval models built for ultra-fast and a…

Understanding the Behaviors of Environment-aware Information Retrieval

@Pavel_Izmailov: New paper: Latent Context Language Models (LCLMs)! Idea: encode 16 tokens as 1 latent token, and have the LLM work on t…

DeepSeek-V4: a million-token context that agents can actually use