RAG Retrieval Deep Dive: BM25, Embeddings, and the Power of Agentic Search

YouTube AI Channels Tools

Summary

This article provides an in-depth comparison of the advantages and disadvantages of BM25 lexical search and embedding semantic search in RAG retrieval, offers a practical framework for selecting retrieval methods based on query type and system trade-offs, and emphasizes the importance of treating RAG as a system rather than a simple component.

No content available
Original Article
View Cached Full Text

Cached at: 06/27/26, 07:24 AM

# RAG Retrieval Deep Dive: BM25, Embeddings, and the Power of Agentic Search **TL;DR:** When building production-grade RAG systems, the retrieval stage is the key bottleneck. This article provides an in-depth comparison of BM25 lexical search and embedding-based semantic search, and offers a practical framework for choosing a retrieval method based on query type and system trade-offs. ## Current State and Challenges of RAG Many people are building their own RAG solutions. But when you try to move from a demo to production, problems emerge: accuracy drops (users ask more complex questions than the synthetic test data), latency becomes unbearable, scaling from 100 documents to 1 million is difficult, costs skyrocket (some enterprise multi-agent RAG systems cost up to $1 per query), and there's the often-overlooked issue of permissions—not everyone has access to all documents. The answer to these problems is not to cobble together arXiv papers, but to think of RAG as a **system**. ## Thinking of RAG as a System A production-grade RAG system is not just two models; it's a combination of multiple components: - Cleanly parsing documents and extracting information (this is often overlooked when dealing with complex documents) - How documents are queried - Retrieval (finding relevant chunks) - Generating a response Before starting design, you need to understand the **trade-offs** users expect: speed, cost, and replacing accuracy with "question complexity." Since generative AI is error-prone, it's often deployed where errors are easy to check (e.g., code assistants), while medical RAG systems that affect patient lives are much harder to launch because the cost of errors is high. ### Query Type Determines System Design Before starting, understand what types of queries your users will ask the RAG system: 1. **Simple keyword queries** – Very suitable for BM25 2. **Semantic variations** – Users don't know the exact keywords, e.g., asking "how many banks" instead of "total revenue" 3. **Multi-step reasoning** – Needs to solve multiple parts step by step 4. **Cross-document answers** – Answers spread across multiple documents 5. **External knowledge** – The answer is not in the provided documents 6. **Agentic scenarios** – Users ask broader questions that require a thinking LLM Each scenario influences system design. This article focuses on the retrieval part, because that's where a lot of confusion and uncertainty lies. ## Why Do We Need Retrieval? The Meaning of Chunking You can't expect a single model to hold everything—even with 10 million context length, running every query through it is not practical due to compute cost. Chunking documents into smaller pieces is a computationally efficient approach. That's why RAG retrieval is here to stay. ## BM25: The Classic of Lexical Search BM25 (Best Matching 25) is a lexical search method that went through many iterations to reach its current form. It examines all words in documents and creates an **inverted index**: which documents contain each word. When you query "butterfly," it instantly knows which documents contain that word. This approach is very efficient. **Performance comparison (synthetic data)**: - Linear search (Ctrl+F): 1000 documents → 3000 seconds, and scales linearly with document count - Inverted index + BM25: orders of magnitude faster ### Limitations of BM25 BM25 is great for keyword retrieval, but it can't handle synonyms or abbreviations. For example, a user asks about "physician" but the document only uses "doctor"; or a user asks for "International Business Machines" but the document uses "IBM." Still, BM25 is a strong baseline, and many neural network methods may not outperform it on certain corpora. If users already know the keywords they need, BM25 might be sufficient. ## Language Models and Embeddings: Semantic Search Text is encoded into numbers through an encoder, and these numbers represent the semantics of the text. Language models trained on large data understand similar concepts: "physician" and "doctor" are very close in embedding space, "International Business Machines" and "IBM" are also close. This is semantic search, which Google introduced to great success. ### How to Choose an Embedding Model? Hugging Face has hundreds of embedding models. The chart below shows different models distributed on **CPU inference speed** (horizontal axis, faster toward the right) and **retrieval quality** (vertical axis, NDCG metric, higher is better): - **BM25**: on the right, fast, but retrieval quality is not optimal - **Static embeddings (e.g., Word2Vec)**: faster than BM25, even runnable on CPU. But a drawback is they cannot handle polysemy (e.g., "model" in AI vs. fashion), which can degrade retrieval quality - **Larger models** (left side of chart): higher quality, but slower The **Multilingual Embedding Leaderboard** and **Retrieval Embedding Leaderboard** (nano version of BEIR) on Hugging Face can help you choose a model based on the speed-quality trade-off. ## Intelligent Search and Agentic Search In the talk, the author also mentioned "intelligent search" and "agentic scenarios" as higher-level forms of retrieval. When users expect queries with complexity, requiring step-by-step resolution of multiple parts, or answers spread across multiple documents, more sophisticated retrieval strategies are needed—even combining reasoning and tool calls. This part was mentioned as a future direction in the talk, but this article already covers the core foundation of retrieval: the comparison between BM25 and embeddings and their applicable scenarios. ## Summary Building a production-grade RAG system starts by viewing retrieval as part of a system, understanding query types and system trade-offs. BM25 is a fast, simple baseline suitable for queries with clear keywords; embedding models can handle semantic variations but require choosing the right model based on speed and quality needs. There is no silver bullet—only engineering choices made based on the specific scenario. **Source:** https://www.youtube.com/watch?v=AS_HlJbJjH8

Similar Articles

Introducing Contextual Retrieval

Anthropic Engineering

Anthropic introduces Contextual Retrieval, a technique combining contextual embeddings and BM25 to significantly improve RAG accuracy by reducing failed retrievals.

@vintcessun: Feeding too many documents into RAG causes retrieval quality to drop from 75% to 40%? Vector search is diluted by a large amount of irrelevant content, causing a sharp drop in hit rate in real deployment. Root cause: heterogeneous documents are retrieved together, noise drowns out signal. Multi-agent orchestration seems intelligent but actually introduces a precision-fidelity paradox—poor configuration leads to failure in both aspects. The paper proposes MA…

X AI KOLs Timeline

This paper identifies 'vector search dilution' in RAG systems when scaling to large heterogeneous document collections, where accuracy dropped from 75% to 40% in a real-world deployment. The proposed MASDR-RAG method uses domain scoping via organizational metadata before retrieval, improving P@10 from 0.77 to 0.86 with low cost and easy deployment.

AgenticRAG: Agentic Retrieval for Enterprise Knowledge Bases

arXiv cs.AI

This paper introduces AgenticRAG, a framework from Microsoft that enhances enterprise knowledge base retrieval by equipping LLMs with tools for iterative search, document navigation, and analysis. It demonstrates significant improvements in recall and factuality over standard RAG pipelines on multiple benchmarks.