Tag
Introduces HiMed, a Hindi reasoning medical corpus and benchmark suite, and HiMed-8B, a Hindi-form medical reasoning LLM using decaying scaffolding reward, demonstrating improved Hindi medical reasoning and reduced English–Hindi accuracy gap.
Introduces OGCaReBench, a free-form retrieval benchmark for evaluating LLMs on clinical questions that require reasoning beyond standard guidelines. Experiments show that even the best model achieves only 56% accuracy, but retrieval augmentation boosts performance to 82%.
This paper presents a large-scale assessment of medical LLMs, including custom MedGPTs and open-source models, finding 25-30% exhibit low factual accuracy and 33.6-54.3% violate operational thresholds, highlighting systemic safety risks.
This article critiques Mark Kaplan's approach to fine-tuning medical LLMs via his platform healtthruth.ai, highlighting pitfalls in overriding foundational training for healthcare AI.