Why I stopped using semantic embeddings for tool selection and switched back to BM25 [D]

Reddit r/MachineLearning Tools

Summary

The author shares their experience switching from semantic embeddings to BM25 for tool selection in agents, finding that BM25 achieves 81% top-1 accuracy vs. 64% for embeddings on a corpus of 200 query-tool pairs, because tool descriptions are short and keyword-driven rather than semantically rich like documents.

I've been building agents for about a year and recently shipped one for a client running \~140 MCP-exposed tools at peak. Along the way I made the canonical mistake. I used cosine similarity over tool description embeddings to pick which tools the model could see per turn. Worked great in demos. Was actively dangerous in production. Here's the problem. In a basic semantic-ranking setup you embed the user query, embed every tool description once, and rank by cosine similarity at runtime. That works for general document retrieval where chunks are paragraph-length, semantically rich, and roughly equal in form. Tool descriptions are not that. They are short (often <50 tokens), structurally similar (verb-noun, parameters list), and the discriminative information is often a single keyword. "Read a file from disk" and "Read messages from a channel" both embed close to "read" + "file/channel." Cosine similarity puts them next to each other for a query like "read the latest commits" because all three words share the verb embedding space, and the actual discriminator (the noun "commits") gets diluted. I watched this happen in eval. Asked the agent "list the open issues for this repo." The semantic ranker returned `slack_search_messages` first because the description had "list", "open", and "issues" as close embedding neighbors. The actual `github_list_issues` tool ranked 4th because the GitHub MCP author wrote a terse "Lists issues in a repository" description that scored lower on every soft keyword. If the model sees `slack_search_messages` first and `github_list_issues` fourth, it's going to pick the wrong one. Often. So I built three retrieval strategies and tested them on a fixed corpus of 200 query→correct-tool pairs. **Semantic embeddings (text-embedding-3-small)**: 64% top-1 accuracy. Sneaky failure mode: when wrong, it was confidently wrong, often with a totally unrelated tool ranked first. **BM25 over a flat-text projection of tool name + description + schema walk**: 81% top-1. Failures were almost always lexical (the tool used "fetch" while the user said "get"), recoverable with light query rewriting. **Hybrid (0.7 semantic + 0.3 BM25 normalized)**: 78%. Worse than BM25 alone. The semantic noise dragged BM25's clean signal down. I sat with that result for a while. The "obvious" answer is hybrid; every RAG paper since 2023 says hybrid wins. For tool selection specifically, hybrid lost. The reason is that tools live in a smaller, more structured space than documents do. The discriminative signal is keyword-shaped. BM25 is built for exactly that. The other thing I learned: indexing schema fields matters. The clean BM25 win came from projecting `name` \+ `description` \+ a walk over `input_schema` and `output_schema` (semantic tokens only, JSON Schema structure stripped). Property names like `repo_id` or `branch` are exactly the discriminators that turn "list the open issues" into a hit on GitHub instead of Slack. If you only index `name + description` you leave half your signal on the floor. I ended up adopting Ratel's indexing approach (their ADR-0004 documents the exact projection) because rebuilding it myself was redundant. Open source, in-process Rust, NAPI-RS bound to a TS SDK, no infra. The semantic + re-ranking story is on their roadmap, but for now the BM25-only default is what I want anyway. Happy to share it in the comments if anyone wants to try. The takeaway for anyone building tool selection or agent gateways: do not assume document-RAG defaults transfer. Tools are a different shape of data. BM25 is not the boring fallback; for this problem it's the right primary and semantic is the optional add. Test your specific corpus before you reach for embeddings.
Original Article

Similar Articles

is [ BM25 + vector ]+ RRF really worth it?

Reddit r/AI_Agents

This post questions whether combining BM25 and vector search with RRF improves hit rates in agentic memory retrieval, suggesting BM25 alone may suffice.

Introducing Contextual Retrieval

Anthropic Engineering

Anthropic introduces Contextual Retrieval, a technique combining contextual embeddings and BM25 to significantly improve RAG accuracy by reducing failed retrievals.