Cached at:
06/27/26, 07:24 AM
# RAG Retrieval Deep Dive: BM25, Embeddings, and the Power of Agentic Search
**TL;DR:** When building production-grade RAG systems, the retrieval stage is the key bottleneck. This article provides an in-depth comparison of BM25 lexical search and embedding-based semantic search, and offers a practical framework for choosing a retrieval method based on query type and system trade-offs.
## Current State and Challenges of RAG
Many people are building their own RAG solutions. But when you try to move from a demo to production, problems emerge: accuracy drops (users ask more complex questions than the synthetic test data), latency becomes unbearable, scaling from 100 documents to 1 million is difficult, costs skyrocket (some enterprise multi-agent RAG systems cost up to $1 per query), and there's the often-overlooked issue of permissions—not everyone has access to all documents.
The answer to these problems is not to cobble together arXiv papers, but to think of RAG as a **system**.
## Thinking of RAG as a System
A production-grade RAG system is not just two models; it's a combination of multiple components:
- Cleanly parsing documents and extracting information (this is often overlooked when dealing with complex documents)
- How documents are queried
- Retrieval (finding relevant chunks)
- Generating a response
Before starting design, you need to understand the **trade-offs** users expect: speed, cost, and replacing accuracy with "question complexity." Since generative AI is error-prone, it's often deployed where errors are easy to check (e.g., code assistants), while medical RAG systems that affect patient lives are much harder to launch because the cost of errors is high.
### Query Type Determines System Design
Before starting, understand what types of queries your users will ask the RAG system:
1. **Simple keyword queries** – Very suitable for BM25
2. **Semantic variations** – Users don't know the exact keywords, e.g., asking "how many banks" instead of "total revenue"
3. **Multi-step reasoning** – Needs to solve multiple parts step by step
4. **Cross-document answers** – Answers spread across multiple documents
5. **External knowledge** – The answer is not in the provided documents
6. **Agentic scenarios** – Users ask broader questions that require a thinking LLM
Each scenario influences system design. This article focuses on the retrieval part, because that's where a lot of confusion and uncertainty lies.
## Why Do We Need Retrieval? The Meaning of Chunking
You can't expect a single model to hold everything—even with 10 million context length, running every query through it is not practical due to compute cost. Chunking documents into smaller pieces is a computationally efficient approach. That's why RAG retrieval is here to stay.
## BM25: The Classic of Lexical Search
BM25 (Best Matching 25) is a lexical search method that went through many iterations to reach its current form. It examines all words in documents and creates an **inverted index**: which documents contain each word. When you query "butterfly," it instantly knows which documents contain that word. This approach is very efficient.
**Performance comparison (synthetic data)**:
- Linear search (Ctrl+F): 1000 documents → 3000 seconds, and scales linearly with document count
- Inverted index + BM25: orders of magnitude faster
### Limitations of BM25
BM25 is great for keyword retrieval, but it can't handle synonyms or abbreviations. For example, a user asks about "physician" but the document only uses "doctor"; or a user asks for "International Business Machines" but the document uses "IBM." Still, BM25 is a strong baseline, and many neural network methods may not outperform it on certain corpora. If users already know the keywords they need, BM25 might be sufficient.
## Language Models and Embeddings: Semantic Search
Text is encoded into numbers through an encoder, and these numbers represent the semantics of the text. Language models trained on large data understand similar concepts: "physician" and "doctor" are very close in embedding space, "International Business Machines" and "IBM" are also close. This is semantic search, which Google introduced to great success.
### How to Choose an Embedding Model?
Hugging Face has hundreds of embedding models. The chart below shows different models distributed on **CPU inference speed** (horizontal axis, faster toward the right) and **retrieval quality** (vertical axis, NDCG metric, higher is better):
- **BM25**: on the right, fast, but retrieval quality is not optimal
- **Static embeddings (e.g., Word2Vec)**: faster than BM25, even runnable on CPU. But a drawback is they cannot handle polysemy (e.g., "model" in AI vs. fashion), which can degrade retrieval quality
- **Larger models** (left side of chart): higher quality, but slower
The **Multilingual Embedding Leaderboard** and **Retrieval Embedding Leaderboard** (nano version of BEIR) on Hugging Face can help you choose a model based on the speed-quality trade-off.
## Intelligent Search and Agentic Search
In the talk, the author also mentioned "intelligent search" and "agentic scenarios" as higher-level forms of retrieval. When users expect queries with complexity, requiring step-by-step resolution of multiple parts, or answers spread across multiple documents, more sophisticated retrieval strategies are needed—even combining reasoning and tool calls. This part was mentioned as a future direction in the talk, but this article already covers the core foundation of retrieval: the comparison between BM25 and embeddings and their applicable scenarios.
## Summary
Building a production-grade RAG system starts by viewing retrieval as part of a system, understanding query types and system trade-offs. BM25 is a fast, simple baseline suitable for queries with clear keywords; embedding models can handle semantic variations but require choosing the right model based on speed and quality needs. There is no silver bullet—only engineering choices made based on the specific scenario.
**Source:** https://www.youtube.com/watch?v=AS_HlJbJjH8