@DailyDoseOfDS_: Stop using vector search everywhere! A 30-year-old algorithm with zero training, zero embeddings, and zero fine-tuning …

X AI KOLs Timeline News

Summary

The article argues against overusing vector search, highlighting BM25's effectiveness for exact keyword matching and its role in hybrid search systems.

Stop using vector search everywhere! A 30-year-old algorithm with zero training, zero embeddings, and zero fine-tuning still powers Elasticsearch, OpenSearch, and most production search systems today. It's called BM25. Let us explain what makes it so powerful: Imagine you're searching for "transformer attention mechanism" in a library of ML papers. BM25 asks three simple questions: "How rare is this word?" Every paper contains "the" and "is", which makes it useless. But "transformer" is specific and informative. BM25 boosts rare words and ignores the noise. → This is IDF(qᵢ) in the formula "How many times does it appear?" If "attention" appears 10 times in a paper, that's a good sign. But 10 vs 100 occurrences won't make much difference. BM25 applies diminishing returns. → This is f(qᵢ, D) combined with k₁ that controls saturation "Is this document unusually long?" A 50-page paper will naturally contain more keywords than a 5-page paper. BM25 levels the playing field so longer documents don't cheat their way to the top. → This is |D|/avgdl controlled by parameter b Three questions. No neural networks. No training data. Just elegant math (refer to the image below) The best part: BM25 excels at exact keyword matching - something embeddings often struggle with. If your user searches for "error code 5012," embeddings might return semantically similar results. BM25 will find the exact match. This is why hybrid search exists. Top RAG systems today combine BM25 with vector search. You get the best of both worlds: semantic understanding AND precise keyword matching. So before you throw GPUs at every search problem, consider BM25. It might already solve your problem, or make your semantic search even better when combined.
Original Article

Similar Articles

is [ BM25 + vector ]+ RRF really worth it?

Reddit r/AI_Agents

This post questions whether combining BM25 and vector search with RRF improves hit rates in agentic memory retrieval, suggesting BM25 alone may suffice.

Mongo with vector search performance

Reddit r/LocalLLaMA

The article discusses the performance of MongoDB's vector search capabilities, likely comparing it to other solutions or highlighting improvements for AI workloads.