Tag
This paper introduces a framework for auditing source-dependence in medical multi-source RAG systems, releasing the TransplantQA benchmark, HERO-QA retrieval strategy, and a structured-output judge to measure inter-source answer relationships. It demonstrates that better retrieval reveals more disagreement than previously estimated, and argues for shifting NLP evaluation from answer correctness to inter-source relationship analysis.