The author shares their experience switching from semantic embeddings to BM25 for tool selection in agents, finding that BM25 achieves 81% top-1 accuracy vs. 64% for embeddings on a corpus of 200 query-tool pairs, because tool descriptions are short and keyword-driven rather than semantically rich like documents.
I've been building agents for about a year and recently shipped one for a client running \~140 MCP-exposed tools at peak. Along the way I made the canonical mistake. I used cosine similarity over tool description embeddings to pick which tools the model could see per turn. Worked great in demos. Was actively dangerous in production. Here's the problem. In a basic semantic-ranking setup you embed the user query, embed every tool description once, and rank by cosine similarity at runtime. That works for general document retrieval where chunks are paragraph-length, semantically rich, and roughly equal in form. Tool descriptions are not that. They are short (often <50 tokens), structurally similar (verb-noun, parameters list), and the discriminative information is often a single keyword. "Read a file from disk" and "Read messages from a channel" both embed close to "read" + "file/channel." Cosine similarity puts them next to each other for a query like "read the latest commits" because all three words share the verb embedding space, and the actual discriminator (the noun "commits") gets diluted. I watched this happen in eval. Asked the agent "list the open issues for this repo." The semantic ranker returned `slack_search_messages` first because the description had "list", "open", and "issues" as close embedding neighbors. The actual `github_list_issues` tool ranked 4th because the GitHub MCP author wrote a terse "Lists issues in a repository" description that scored lower on every soft keyword. If the model sees `slack_search_messages` first and `github_list_issues` fourth, it's going to pick the wrong one. Often. So I built three retrieval strategies and tested them on a fixed corpus of 200 query→correct-tool pairs. **Semantic embeddings (text-embedding-3-small)**: 64% top-1 accuracy. Sneaky failure mode: when wrong, it was confidently wrong, often with a totally unrelated tool ranked first. **BM25 over a flat-text projection of tool name + description + schema walk**: 81% top-1. Failures were almost always lexical (the tool used "fetch" while the user said "get"), recoverable with light query rewriting. **Hybrid (0.7 semantic + 0.3 BM25 normalized)**: 78%. Worse than BM25 alone. The semantic noise dragged BM25's clean signal down. I sat with that result for a while. The "obvious" answer is hybrid; every RAG paper since 2023 says hybrid wins. For tool selection specifically, hybrid lost. The reason is that tools live in a smaller, more structured space than documents do. The discriminative signal is keyword-shaped. BM25 is built for exactly that. The other thing I learned: indexing schema fields matters. The clean BM25 win came from projecting `name` \+ `description` \+ a walk over `input_schema` and `output_schema` (semantic tokens only, JSON Schema structure stripped). Property names like `repo_id` or `branch` are exactly the discriminators that turn "list the open issues" into a hit on GitHub instead of Slack. If you only index `name + description` you leave half your signal on the floor. I ended up adopting Ratel's indexing approach (their ADR-0004 documents the exact projection) because rebuilding it myself was redundant. Open source, in-process Rust, NAPI-RS bound to a TS SDK, no infra. The semantic + re-ranking story is on their roadmap, but for now the BM25-only default is what I want anyway. Happy to share it in the comments if anyone wants to try. The takeaway for anyone building tool selection or agent gateways: do not assume document-RAG defaults transfer. Tools are a different shape of data. BM25 is not the boring fallback; for this problem it's the right primary and semantic is the optional add. Test your specific corpus before you reach for embeddings.
The article argues against overusing vector search, highlighting BM25's effectiveness for exact keyword matching and its role in hybrid search systems.
HornetDev team published a post on tuning approximate-nearest-neighbor search at 100M scale, covering embedding bias, graph connectivity, and quantization limits.
This post questions whether combining BM25 and vector search with RRF improves hit rates in agentic memory retrieval, suggesting BM25 alone may suffice.
Anthropic introduces Contextual Retrieval, a technique combining contextual embeddings and BM25 to significantly improve RAG accuracy by reducing failed retrievals.