IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages

Hugging Face Daily Papers Papers

Summary

IndicMedDialog is a parallel multi-turn medical dialogue dataset spanning English and nine Indic languages, with a fine-tuned model for personalized symptom elicitation. The dataset is derived from MDDial, enhanced with LLM-generated synthetic consultations and expert verification, supporting multilingual healthcare AI.

Most existing medical dialogue systems operate in a single-turn question--answering paradigm or rely on template-based datasets, limiting conversational realism and multilingual applicability. We introduce IndicMedDialog, a parallel multi-turn medical dialogue dataset spanning English and nine Indic languages: Assamese, Bengali, Gujarati, Hindi, Marathi, Punjabi, Tamil, Telugu, and Urdu. The dataset extends MDDial with LLM-generated synthetic consultations, translated using TranslateGemma, verified by native speakers, and refined through a script-aware post-processing pipeline to correct phonetic, lexical, and character-spacing errors. Building on this dataset, we fine-tune IndicMedLM via parameter-efficient adaptation of a quantized small language model, incorporating optional patient pre-context to personalise multi-turn symptom elicitation. We evaluate against zero-shot multilingual baselines, conduct systematic error analysis across ten languages, and validate clinical plausibility through medical expert evaluation.
Original Article
View Cached Full Text

Cached at: 05/14/26, 08:20 PM

Paper page - IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages

Source: https://huggingface.co/papers/2605.13292

Abstract

A parallel multi-turn medical dialogue dataset spanning English and nine Indic languages is introduced, along with a fine-tuned model using parameter-efficient adaptation for personalized symptom elicitation.

Most existing medical dialogue systems operate in a single-turn question--answering paradigm or rely on template-based datasets, limiting conversational realism and multilingual applicability. We introduce IndicMedDialog, a parallel multi-turn medical dialogue dataset spanning English and nine Indic languages: Assamese, Bengali, Gujarati, Hindi, Marathi, Punjabi, Tamil, Telugu, and Urdu. The dataset extends MDDial with LLM-generatedsynthetic consultations, translated usingTranslateGemma, verified by native speakers, and refined through a script-awarepost-processing pipelineto correct phonetic, lexical, and character-spacing errors. Building on this dataset, we fine-tune IndicMedLM viaparameter-efficient adaptationof a quantized small language model, incorporating optional patient pre-context to personalise multi-turn symptom elicitation. We evaluate againstzero-shot multilingualbaselines, conductsystematic error analysisacross ten languages, and validateclinical plausibilitythrough medical expert evaluation.

View arXiv pageView PDFGitHub1Add to collection

Get this paper in your agent:

hf papers read 2605\.13292

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.13292 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.13292 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.13292 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Synthesis and Evaluation of Long-term History-aware Medical Dialogue

arXiv cs.CL

This paper introduces a framework for synthesizing long-term medical dialogue datasets using LLMs, and creates MediLongChat with three benchmark tasks to evaluate healthcare agents' memory and reasoning capabilities. Experiments show that even state-of-the-art LLMs struggle with these tasks.

MedAction: Towards Active Multi-turn Clinical Diagnostic LLMs

arXiv cs.CL

This paper introduces MedAction, a framework for training LLMs on active, multi-turn clinical diagnosis by simulating iterative test ordering and hypothesis updates. It presents a new dataset, MedAction-32K, and demonstrates state-of-the-art performance for open-source models on medical benchmarks.

HiMed: Incentivizing Hindi Reasoning in Medical LLMs

arXiv cs.CL

Introduces HiMed, a Hindi reasoning medical corpus and benchmark suite, and HiMed-8B, a Hindi-form medical reasoning LLM using decaying scaffolding reward, demonstrating improved Hindi medical reasoning and reduced English–Hindi accuracy gap.