Your RAG is hallucinating because of garbage retrieval — here's the 3-line fix (with real scores)
Summary
A practical fix for RAG hallucination caused by noisy retrieval: use cross-encoder re-ranking to filter chunks with a score > 1.5, improving relevance from -0.28 to +3.80 on average.
Similar Articles
RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration
RAGognizer introduces a hallucination-aware fine-tuning approach that integrates a lightweight detection head into LLMs for joint optimization of language modeling and hallucination detection in RAG systems. The paper presents RAGognize, a dataset of naturally occurring closed-domain hallucinations with token-level annotations, and demonstrates state-of-the-art hallucination detection while reducing hallucination rates without degrading language quality.
Most agent RAG problems I see are retrieval problems, not model problems
The author argues that most agent RAG failures are due to retrieval problems—specifically chunking errors, lack of freshness signals, and reliance on pure vector search—rather than the LLM, and recommends structural chunking, decay-based ranking, and hybrid BM25+vector search.
@vintcessun: Feeding too many documents into RAG causes retrieval quality to drop from 75% to 40%? Vector search is diluted by a large amount of irrelevant content, causing a sharp drop in hit rate in real deployment. Root cause: heterogeneous documents are retrieved together, noise drowns out signal. Multi-agent orchestration seems intelligent but actually introduces a precision-fidelity paradox—poor configuration leads to failure in both aspects. The paper proposes MA…
This paper identifies 'vector search dilution' in RAG systems when scaling to large heterogeneous document collections, where accuracy dropped from 75% to 40% in a real-world deployment. The proposed MASDR-RAG method uses domain scoping via organizational metadata before retrieval, improving P@10 from 0.77 to 0.86 with low cost and easy deployment.
When Retrieval Doesn't Help: A Large-Scale Study of Biomedical RAG
A large-scale study across 5 models (7B–72B), 10 biomedical QA datasets, 4 retrieval methods, and 4 corpora finds that RAG yields only small and inconsistent gains (1–2 points) over no-retrieval baselines in biomedical question answering. The study concludes that the main bottleneck is not retrieval quality but models' limited ability to effectively use retrieved evidence.
"Most RAG benchmarks lie about real-world corpora." Test data from 3 production websites.
This article argues that most RAG benchmarks are misleading because they assume uniform corpus quality, while real-world corpora vary significantly in content density. Using data from three production websites, it shows that a tiered approach and a 'yield score' can better predict retrieval effectiveness.