Why I stopped using semantic embeddings for tool selection and switched back to BM25 [D]

Reddit r/MachineLearning 06/08/26, 01:24 PM Tools

tool-selection semantic-embeddings bm25 agent-building mcp retrieval-strategies production-lessons

Summary

The author shares their experience switching from semantic embeddings to BM25 for tool selection in agents, finding that BM25 achieves 81% top-1 accuracy vs. 64% for embeddings on a corpus of 200 query-tool pairs, because tool descriptions are short and keyword-driven rather than semantically rich like documents.

I've been building agents for about a year and recently shipped one for a client running \~140 MCP-exposed tools at peak. Along the way I made the canonical mistake. I used cosine similarity over tool description embeddings to pick which tools the model could see per turn. Worked great in demos. Was actively dangerous in production. Here's the problem. In a basic semantic-ranking setup you embed the user query, embed every tool description once, and rank by cosine similarity at runtime. That works for general document retrieval where chunks are paragraph-length, semantically rich, and roughly equal in form. Tool descriptions are not that. They are short (often <50 tokens), structurally similar (verb-noun, parameters list), and the discriminative information is often a single keyword. "Read a file from disk" and "Read messages from a channel" both embed close to "read" + "file/channel." Cosine similarity puts them next to each other for a query like "read the latest commits" because all three words share the verb embedding space, and the actual discriminator (the noun "commits") gets diluted. I watched this happen in eval. Asked the agent "list the open issues for this repo." The semantic ranker returned `slack_search_messages` first because the description had "list", "open", and "issues" as close embedding neighbors. The actual `github_list_issues` tool ranked 4th because the GitHub MCP author wrote a terse "Lists issues in a repository" description that scored lower on every soft keyword. If the model sees `slack_search_messages` first and `github_list_issues` fourth, it's going to pick the wrong one. Often. So I built three retrieval strategies and tested them on a fixed corpus of 200 query→correct-tool pairs. **Semantic embeddings (text-embedding-3-small)**: 64% top-1 accuracy. Sneaky failure mode: when wrong, it was confidently wrong, often with a totally unrelated tool ranked first. **BM25 over a flat-text projection of tool name + description + schema walk**: 81% top-1. Failures were almost always lexical (the tool used "fetch" while the user said "get"), recoverable with light query rewriting. **Hybrid (0.7 semantic + 0.3 BM25 normalized)**: 78%. Worse than BM25 alone. The semantic noise dragged BM25's clean signal down. I sat with that result for a while. The "obvious" answer is hybrid; every RAG paper since 2023 says hybrid wins. For tool selection specifically, hybrid lost. The reason is that tools live in a smaller, more structured space than documents do. The discriminative signal is keyword-shaped. BM25 is built for exactly that. The other thing I learned: indexing schema fields matters. The clean BM25 win came from projecting `name` \+ `description` \+ a walk over `input_schema` and `output_schema` (semantic tokens only, JSON Schema structure stripped). Property names like `repo_id` or `branch` are exactly the discriminators that turn "list the open issues" into a hit on GitHub instead of Slack. If you only index `name + description` you leave half your signal on the floor. I ended up adopting Ratel's indexing approach (their ADR-0004 documents the exact projection) because rebuilding it myself was redundant. Open source, in-process Rust, NAPI-RS bound to a TS SDK, no infra. The semantic + re-ranking story is on their roadmap, but for now the BM25-only default is what I want anyway. Happy to share it in the comments if anyone wants to try. The takeaway for anyone building tool selection or agent gateways: do not assume document-RAG defaults transfer. Tools are a different shape of data. BM25 is not the boring fallback; for this problem it's the right primary and semantic is the optional add. Test your specific corpus before you reach for embeddings.

Original Article

Why I stopped using semantic embeddings for tool selection and switched back to BM25 [D]

Similar Articles

@DailyDoseOfDS_: Stop using vector search everywhere! A 30-year-old algorithm with zero training, zero embeddings, and zero fine-tuning …

@mixedbreadai: By now, everyone knows that single-vector embedding models are hugely limiting for modern workflows. But they contain t…

@jobergum: You know me as the BM25 guy, but embeddings are cool too. New post from the @HornetDev team just dropped. ANN tuning at…

is [ BM25 + vector ]+ RRF really worth it?

Introducing Contextual Retrieval

Submit Feedback

Similar Articles

@DailyDoseOfDS_: Stop using vector search everywhere! A 30-year-old algorithm with zero training, zero embeddings, and zero fine-tuning …

@mixedbreadai: By now, everyone knows that single-vector embedding models are hugely limiting for modern workflows. But they contain t…

@jobergum: You know me as the BM25 guy, but embeddings are cool too. New post from the @HornetDev team just dropped. ANN tuning at…

is [ BM25 + vector ]+ RRF really worth it?

Introducing Contextual Retrieval