Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction
Summary
The paper introduces Direct Corpus Interaction (DCI), a novel approach allowing AI agents to query raw text directly using standard terminal tools instead of traditional embedding-based retrieval. By bypassing fixed similarity interfaces and offline indexing, DCI significantly outperforms conventional sparse, dense, and reranking baselines across multiple IR and agentic search benchmarks.
View Cached Full Text
Cached at: 05/08/26, 02:27 PM
Paper page - Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction
Source: https://huggingface.co/papers/2605.05242 Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
Direct corpus interaction enables more effective agentic search by allowing agents to query raw text directly, outperforming traditional retrieval methods in complex tasks.
Modernretrieval systems, whether lexical or semantic, expose a corpus through a fixed similarity interface that compresses access into a single top-k retrieval step before reasoning. This abstraction is efficient, but foragentic search, it becomes a bottleneck: exact lexical constraints, sparse clue conjunctions, local context checks, and multi-step hypothesis refinement are difficult to implement by calling a conventional off-the-shelf retriever, and evidence filtered out early cannot be recovered by stronger downstream reasoning. Agentic tasks further exacerbate this limitation because they require agents to orchestrate multiple steps, including discovering intermediate entities, combining weak clues, and revising the plan after observing partial evidence. To tackle the limitation, we studydirect corpus interaction(DCI), where an agent searches the raw corpus directly with general-purposeterminal tools(e.g., grep, file reads, shell commands, lightweight scripts), without any embedding model, vector index, or retrieval API. This approach requires no offline indexing and adapts naturally to evolving local corpora. AcrossIR benchmarksand end-to-endagentic searchtasks, this simple setup substantially outperforms strong sparse, dense, andrerankingbaselines on several BRIGHT andBEIR datasets, and attains strong accuracy onBrowseComp-Plusandmulti-hop QAwithout relying on any conventional semantic retriever. Our results indicate that as language agents become stronger, retrieval quality depends not only on reasoning ability but also on the resolution of the interface through which the model interacts with the corpus, with which DCI opens a broader interface-design space foragentic search.
View arXiv pageView PDFGitHub2Add to collection
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.05242 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.05242 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.05242 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Dr-DCI: Scaling Direct Corpus Interaction via Dynamic Workspace Expansion
This paper introduces DR-DCI, a retriever-steered framework for scaling direct corpus interaction by dynamically expanding a local workspace, achieving improved accuracy and efficiency in agentic search over large corpora.
Towards Retrieving Interaction Spaces for Agentic Search
RISE framework constructs bounded interaction spaces for agentic search by combining BM25 retrieval with preprocessed document indexing, enabling efficient corpus exploration while maintaining high accuracy at scale.
Rethinking Agentic Search with Pi-Serini: Is Lexical Retrieval Sufficient?
This paper introduces Pi-Serini, a BM25-based agentic search system that demonstrates lexical retrieval can suffice for deep search when agents refine queries, achieving high accuracy and reducing costs compared to default settings.
@zhuofengli96475: DCI just hit #1 on Hugging Face Daily Papers! Try it Now! @HuggingPapers https://huggingface.co/papers/2605.05242…
DCI (Direct Corpus Interaction) proposes using simple terminal tools like grep and bash for agentic search, outperforming traditional retrieval methods without embeddings or vector indexes.
Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems
The paper introduces BRIGHT-Pro, a new benchmark for reasoning-intensive retrieval, and RTriever-Synth, a synthetic corpus used to fine-tune RTriever-4B for improved performance in agentic search systems.