A French OSCE Dialogue Dataset and Controllable Virtual Patient System for Clinical Training
Summary
This paper introduces a French OSCE dialogue dataset of 240 interactions and a controllable LLM-based pipeline for generating synthetic OSCE dialogues, enabling realistic virtual patient simulations for medical training with automatic feedback.
View Cached Full Text
Cached at: 06/30/26, 05:27 AM
# A French OSCE Dialogue Dataset and Controllable Virtual Patient System for Clinical Training
Source: [https://arxiv.org/html/2606.28526](https://arxiv.org/html/2606.28526)
Doria Bonzi1, Tom Bourgeade1, Fabrice Lefèvre2, Irina Illina1 1Lorraine Université, CNRS, Inria, LORIA, Nancy, France 2Avignon Université, LIA, UPR 4128, Avignon, France doria\.bonzi@loria\.fr, tom\.bourgeade@loria\.fr, irina\.illina@loria\.fr fabrice\.lefevre@univ\-avignon\.fr
###### Abstract
The clinical and communication skills of medical students are commonly assessed through Objective Structured Clinical Examinations \(OSCEs\), which consist of brief scenario\-driven simulations of doctor\-patient interactions\. However, training is often limited by the low availability of human standardized patients, motivating the development of realistic virtual patients \(VPs\)\. To address this gap, we introduce a French OSCE dialogue dataset comprising 240 student–patient training interactions\. We build upon it a controllable LLM\-based pipeline to generate synthetic OSCE dialogues\. The pipeline integrates modular components, such as retrieval\-based grounding and a reflection loop, to ensure patient fidelity, coherence, and realism\. Additionally, we propose a multi\-level evaluation framework assessing patient simulation quality, student performance, and linguistic quality, using an LLM\-as\-a\-Judge approach\. Experiments suggest that controllability modules generally improve patient fidelity and student evaluation consistency\. Finally, we also implement an interactive prototype in which students can practice with a VP and receive automatic feedback\.
A French OSCE Dialogue Dataset and Controllable Virtual Patient System for Clinical Training
Doria Bonzi1, Tom Bourgeade1, Fabrice Lefèvre2, Irina Illina11Lorraine Université, CNRS, Inria, LORIA, Nancy, France2Avignon Université, LIA, UPR 4128, Avignon, Francedoria\.bonzi@loria\.fr, tom\.bourgeade@loria\.fr, irina\.illina@loria\.frfabrice\.lefevre@univ\-avignon\.fr
## 1Introduction
Objective Structured Clinical Examinations \(OSCEs\) are used to assess both clinical reasoning and communication skills in medical education\. In France, these exams involve medical students playing the role of a physician in 7 to 10\-minute scenario\-based simulated interactions with astandardized patient\(SP\), under the observation of anevaluator\. A SP is defined as a trained individual who portrays a predefined patient scenario for the training and assessment of healthcare studentsLewiset al\.\([2017](https://arxiv.org/html/2606.28526#bib.bib52)\)\. Each OSCE scenario is referred to as a “station,” and can cover various medical interactions, ranging from patient history taking and analyzing results to breaking bad news\. By design, OSCEs approximate real clinical encounters and intentionally simplify many aspects for pedagogical purposes\. Communication skills play a central role in OSCE performance; however, student training is limited by the availability of human standardized patients and evaluators, leading to scalability and cost issues for repeated practice\.
Figure 1:Overview of the controllable LLM\-based dialogue generation pipeline for OSCE simulations in automatic mode\. The physician generator follows a criteria\-based script; the patient generator can be augmented by a retrieval module, and the reflection loop with controller\-corrector modules can be activated to control patient faithfulness through a verification\-correction loop\.Virtual Patient\(VP\) systems have been proposed to address these problems, relying on rule\-based or script\-driven approachesZiniet al\.\([2019](https://arxiv.org/html/2606.28526#bib.bib14)\); Campillos\-Llanoset al\.\([2019](https://arxiv.org/html/2606.28526#bib.bib24)\); Laleyeet al\.\([2020a](https://arxiv.org/html/2606.28526#bib.bib5)\)\. Data\-driven and LLM\-based VP systems have demonstrated improved fluency and realismVoigtet al\.\([2025](https://arxiv.org/html/2606.28526#bib.bib12)\); García\-Torreset al\.\([2024](https://arxiv.org/html/2606.28526#bib.bib15)\); Cooket al\.\([2025](https://arxiv.org/html/2606.28526#bib.bib32)\)\. Some studies explored automated feedback for OSCE assessments using LLM\-as\-a\-JudgeShakuret al\.\([2024](https://arxiv.org/html/2606.28526#bib.bib40)\); Campbellet al\.\([2025](https://arxiv.org/html/2606.28526#bib.bib34)\); Huanget al\.\([2026](https://arxiv.org/html/2606.28526#bib.bib44)\)\. Despite these advances, current LLM\-based VP systems often lack controllability, hindering consistent adherence to predefined clinical stations\. Recent work has also highlighted the lack of standardized and reproducible evaluation frameworks in LLM\-based VP systemsLi and Lutfi \([2026](https://arxiv.org/html/2606.28526#bib.bib30)\), essential for reliable assessment\.
Publicly available OSCE\-related datasets are rare, especially in French\. Existing OSCE datasets in English focus on specific clinical tasks or domainsFareezet al\.\([2022](https://arxiv.org/html/2606.28526#bib.bib17)\); Saleyet al\.\([2024](https://arxiv.org/html/2606.28526#bib.bib8)\)\. Large\-scale medical dialogue resourcesLiuet al\.\([2024](https://arxiv.org/html/2606.28526#bib.bib28)\); Ben Abachaet al\.\([2023](https://arxiv.org/html/2606.28526#bib.bib20)\)are generally not aligned with OSCE constraints or lack the interactional and evaluative structure required for examination settings\. French resources remain limited in size and scopeLaleyeet al\.\([2020b](https://arxiv.org/html/2606.28526#bib.bib21)\)\. This lack of French OSCE data restricts the development of data\-driven training and evaluation tools\.
To address these challenges, we introduce in this paper:
1. \(1\)a dataset of 240 recorded and transcribed French OSCE training dialogues available on Zenodo111Dataset may be accessed upon request:[zenodo\.org/records/20719833](https://zenodo.org/records/20719833);
2. \(2\)a controllable LLM\-based OSCE dialogue generation pipeline with modular components, supporting both automatic and interactive modes;
3. \(3\)a multi\-level LLM\-as\-a\-Judge evaluation framework assessing generated dialogues and recorded OSCE interactions\.
## 2Related work
Virtual patients and LLM\-based simulations:Recent work on LLM\-based VPs focuses on realism, interactivity, automatic scoring and feedbackVoigtet al\.\([2025](https://arxiv.org/html/2606.28526#bib.bib12)\); García\-Torreset al\.\([2024](https://arxiv.org/html/2606.28526#bib.bib15)\)and patient data fidelityWanget al\.\([2024b](https://arxiv.org/html/2606.28526#bib.bib37)\); Laverdeet al\.\([2025](https://arxiv.org/html/2606.28526#bib.bib35)\)\. Embodied VP have been explored as simulation\-based tools, emphasizing realistic behavioral interactionsChaby and others \([2022](https://arxiv.org/html/2606.28526#bib.bib47)\)\. Agentic approaches used structured patient data combined with multi\-agent RAG workflows to enable controllabilityYuet al\.\([2025](https://arxiv.org/html/2606.28526#bib.bib9)\)\. However, these works were rarely anchored in real OSCE recordings, limiting their grounding in authentic interactions\.
Medical dialogue and synthetic dialogue generation:
In the context of OSCEs,Fareezet al\.\([2022](https://arxiv.org/html/2606.28526#bib.bib17)\)introduced a dataset of simulated patient interviews focused on respiratory cases\. Building on this work,Saleyet al\.\([2024](https://arxiv.org/html/2606.28526#bib.bib8)\)released an English dataset for medical history\-taking in OSCE format\. French medical dialogue resources remain limited\.Laleyeet al\.\([2020b](https://arxiv.org/html/2606.28526#bib.bib21)\)introduced a small annotated French corpus combining generated dialogues and interactions between medical students and patients\.Chenet al\.\([2023](https://arxiv.org/html/2606.28526#bib.bib36)\)has investigated LLM\-based dual\-agent simulations, emulating both physicians and patients for clinical dialogue generation\.
Large\-scale medical dialogue datasets provide broad coverage of clinical interactions but are not designed for OSCE\-specific scenariosJepsonet al\.\([2017](https://arxiv.org/html/2606.28526#bib.bib18)\); Zenget al\.\([2020](https://arxiv.org/html/2606.28526#bib.bib3)\); Liuet al\.\([2024](https://arxiv.org/html/2606.28526#bib.bib28)\); Ben Abachaet al\.\([2023](https://arxiv.org/html/2606.28526#bib.bib20)\)\. In parallel, LLM\-based dialogue generation approaches improved fluency and realism but lack strict controllability and alignment with structured exam settingsDaset al\.\([2024](https://arxiv.org/html/2606.28526#bib.bib31)\); Wanget al\.\([2024a](https://arxiv.org/html/2606.28526#bib.bib19)\)\.
Reflective prompting, and LLM\-based evaluation:Recent advances in LLM prompting have introduced self\-reflection and critique mechanisms to improve factuality, coherence, and task adherence in complex generation tasksAgrawalet al\.\([2026](https://arxiv.org/html/2606.28526#bib.bib38)\); Chirkovaet al\.\([2026](https://arxiv.org/html/2606.28526#bib.bib1)\); Liet al\.\([2023](https://arxiv.org/html/2606.28526#bib.bib6)\)\. In parallel, LLM\-as\-a\-Judge paradigms have emerged to evaluate conversational agents at scale across multiple criteria, including dialogue quality and OSCE performanceShakuret al\.\([2024](https://arxiv.org/html/2606.28526#bib.bib40)\); Campbellet al\.\([2025](https://arxiv.org/html/2606.28526#bib.bib34)\)\.Guet al\.\([2026](https://arxiv.org/html/2606.28526#bib.bib2)\)highlighted their growing adoption, with applications ranging from human\-machine dialogue assessmentNjifenjouet al\.\([2025](https://arxiv.org/html/2606.28526#bib.bib25),[2024](https://arxiv.org/html/2606.28526#bib.bib26)\)to OSCE scenariosShakuret al\.\([2024](https://arxiv.org/html/2606.28526#bib.bib40)\); Campbellet al\.\([2025](https://arxiv.org/html/2606.28526#bib.bib34)\)and benchmarking conversational agents in controlled settingsZhenget al\.\([2023](https://arxiv.org/html/2606.28526#bib.bib10)\)\. These approaches motivate the design of structured evaluation and reflective stages, which we incorporate in our pipeline to assess and iteratively improve our generated OSCE dialogues\.
Building on prior work, we distinguish our approach in three ways: \(1\) a focus on French OSCE stations, addressing the scarcity of French medical dialogue resources; \(2\) a modular, controllable LLM dialogue generation pipeline integrating information retrieval and self\-reflection, grounded in OSCE data across more than ten medical specialties; \(3\) a multi\-faceted evaluation covering linguistic and clinical performance\.
## 3Proposed French OSCE dialogue dataset
Our proposed dataset consists of: \(i\) a corpus of 240 recorded OSCE training dialogues, representing a total of 30 hours of audio, and \(ii\) a corpus of 792 generated dialogues produced using different experimental configurations of our controllable LLM\-based generation pipeline \(see[2\(a\)](https://arxiv.org/html/2606.28526#S3.F2.sf1)\)\.
ComponentCountTotal OSCE stations192Recorded dialoguesOSCE stations recorded23Recorded dialogues240Total audio duration \(hours\)30Generated dialoguesOSCE stations used11Generated dialogues792Total generated words1,22M\(a\)Dataset statistics: recorded and generated dialogues\.
\(b\)OSCE station classification and number of stations per category\.
Figure 2:Overview of the French OSCE dialogue dataset, both recorded and generated\.### 3\.1Data collection: OSCE stations and recorded dialogues
#### OSCE stations:
A total of 192 OSCE clinical stations were collected for this dataset across more than 10 medical specialties\. These stations are authored by medical teachers and healthcare professionals\. Each OSCE station consists of three complementary documents made available through a dedicated online platform: aphysician sheetaddressed to the evaluated student, apatient sheet, and anevaluator sheet\.
Each OSCE station was mapped to a category from a dialogue\-generation\-focuseddifficulty classificationdeveloped specifically for this work, with: \(i\) the type of interlocutor and \(ii\) the requirement for document analysis \([2\(b\)](https://arxiv.org/html/2606.28526#S3.F2.sf2)\)\.
#### OSCE recorded dialogues:
We recorded 240 played physician\-patient dialogues involving 99 sixth\-year medical students across 23 distinct OSCE stations\. These recordings were collected during OSCE training sessions organized weekly by the local student association, which are designed to closely replicate the conditions of the national OSCE examinations in France\. Due to the limited availability of external volunteers, all roles \(physician, patient, and evaluator\) were performed by students\. Although this is not ideal, it was an unavoidable constraint in our setting\. As participants are neither professional actors nor trained OSCE evaluators, this setup may affect the realism of the interactions and evaluations, and may partially account for differences observed when comparing these dialogues with synthetically generated ones\.
Each OSCE training session lasted 8 minutes, followed by a 2\-minute feedback period from the evaluator\. All participants provided written informed consent for the use of their recordings for research purposes\. Audio was captured using wireless clip\-on microphones\. All recordings were acquired using the same software, producing synchronized two\-channel WAV files\. More details on the recording protocol are available in Appendix[A\.1](https://arxiv.org/html/2606.28526#A1.SS1)\.
#### Recorded stations annotations:
Each recorded station was manually given labels with: one of 12specialties\(e\.g\., neurology, pediatrics, gastroenterology\), aconsultation type\(e\.g\., emergency, follow\-up\), and one or moreobjectives\(e\.g\., diagnosis, breaking bad news, history\-taking, patient education\)\.
All annotations were performed by a single annotator through careful examination of the OSCE station materials, particularly the physician and evaluator sheets, which typically provide the station context, including specialty, consultation type, and primary objectives \(see Appendix[A\.4](https://arxiv.org/html/2606.28526#A1.SS4)\)\.
#### Automatic transcription:
All recorded dialogues were automatically diarized using theprecision\-2modelPyAnnoteAI \([2024](https://arxiv.org/html/2606.28526#bib.bib50)\), and transcribed using thefaster\-whisper\-large\-v3\-turbomodelOpenAI \([2024](https://arxiv.org/html/2606.28526#bib.bib51)\); Radfordet al\.\([2023](https://arxiv.org/html/2606.28526#bib.bib7)\)\.
#### Transcription evaluation:
To assess transcription quality, Word Error Rate \(WER\) was computed by comparing automatic transcriptions with manually corrected versions\. A subset of 26 automatically transcribed dialogues, representing approximately 11% of the recorded corpus, was manually corrected by 15 French\-fluent annotators using the original audio recordings\.
The resulting WER of 7\.7% suggests good transcription quality for real\-world clinical dialogues, though computed on a limited sample\. More information on WER computation is available in Appendix[A\.2](https://arxiv.org/html/2606.28526#A1.SS2)\. Recurrent errors included names, acronyms \(e\.g\.,SMUR, ECOS\), domain\-specific medical terminology \(e\.g\.,dyspnée; “shortness of breath”\), occasional diarization errors due to cross\-talk and disfluencies such as hesitations or repetitions\.
## 4Controlled dialogue generation pipeline
To benchmark LLM\-based virtual patients against student\-acted patients in OSCE training sessions, we synthetically generated 792 dialogues from structured OSCE station sheets\. We focused on 11 stations for which real training recordings were available, excluding multimodal stations involving document analysis \(e\.g\., radiology images\)\. In this work, we deliberately restrict the problem to text\-based interactions, excluding speech and multimodal information\.
Our proposed approach \([Figure 1](https://arxiv.org/html/2606.28526#S1.F1)\) frames dialogue generation aroundcontrollability, aiming to ensure fidelity to patient profiles, coherence, and realism\. The pipeline incorporates multiple levels of control withretrieval\-based groundingand consistency enforcement through areflection loop\. The proposed system generates VP utterances interacting with a simulatedphysicianin an OSCE setting, constrained to a fixed consultation duration of 8 minutes\. The architecture is modular and LLM\-based and supports two operating modes:
1. 1\.aninteractive mode, where a student can train as the physician, via our prototype interface \(see[Figure 3](https://arxiv.org/html/2606.28526#S4.F3)\)\. At the end of the interaction, students receive an automatic LLM\-based evaluation of their performance, including feedback on addressed criteria and overall communication quality \(see[5\.2](https://arxiv.org/html/2606.28526#S5.SS2.SSS0.Px2)\)\.
2. 2\.anautomatic mode, where both physician and patient are LLM\-generated, enabling large\-scale dialogue dataset generation\.
In both modes, VP utterances are controlled to ensure fidelity and consistency with the patient profile\. The system relies on up to four distinct LLM instances assigned to specific roles:retrieval,generation,control, andcorrection\. This modular design enables independent model selection and parameter tuning for each component\. Dialogue generation is grounded in three components:
1. 1\.Patient sheet, describing the simulated patient’s identity, medical history, symptoms, and behavior \(e\.g\.,Frédérique Dumontconsulting for an extension of medical leave\);
2. 2\.Physician sheet, providing the clinical context and setting, available informations, and consultation objectives \(e\.g\., current medication and information\-gathering goals\);
3. 3\.Evaluation sheet, listing OSCE criteria the physician should address, each corresponding to a specific clinical action or information \(e\.g\., assessing treatment effectiveness\)\.
Figure 3:Interactive demo of our VP system\. The student interacts with the VP via messages and can consult the physician sheet at any time\. An 8\-minute timer simulates OSCE conditions\.#### Dialogue script and criteria reordering:
In automatic mode, dialogue generation is guided by a script derived from the evaluation checklist of the OSCE station\. Following prior workHuanget al\.\([2026](https://arxiv.org/html/2606.28526#bib.bib44)\), criteria are reordered into dialogue phases:Opening and preparation,Information collection,Assessment, andPlan and conclusion \(OIAP\), inspired by SOAPPodderet al\.\([2026](https://arxiv.org/html/2606.28526#bib.bib33)\)\. This reordering is performed by an LLM before generation and used to guide physician utterances step by step, acting as a dialogue scriptHuanget al\.\([2026](https://arxiv.org/html/2606.28526#bib.bib44)\)\. We define two generation modes: astandardmode following the OIAP structure, and arandommode where intermediate phases and criteria order are shuffled while preserving the opening and conclusion\. This enables generating diverse dialogues for the same station\. Dialogue generation proceeds in batches of up to four criteria, with up to three dialogue turns per batch, empirically chosen to balance generation quality and dialogue duration constraints\. As OSCE stations are time\-constrained, dialogue duration was estimated based on average speech rate and pauses\.
Physician generator:The physician generator produces the physician’s questions and statements\. To ensure that these utterances remain coherent, contextually accurate, and aligned with the intended OSCE workflow, the prompt is dynamically enriched for each batch with the context of the physician sheet, the dialogue history, the current clinical stage, the relevant criteria checklist, and instructions for stage transitions when the batch manager signals a phase change\.
Patient generator:Upon receiving a physician utterance, the retrieval module selects relevant information from the patient sheet as input context for the generator\. If retrieval is disabled, the full sheet is provided\. The generator then produces a context\-aware response based on the patient sheet, instructions, and dialogue history\.
Controller module:Generated responses are evaluated for fidelity to the patient sheet using an LLM\-based scoring module that assesses consistency with the provided patient information and identifies potential contradictions\. The controller assigns a score from 0 to 10 based on the number and severity of detected inconsistencies, along with a list of errors\. Responses scoring above 8/10 are accepted, while lower\-scoring responses are sent to a correction module before final validation\.
Thecorrection submoduletakes the flagged response along with an instruction prompt, the patient sheet, the physician’s question, and the errors identified by the controller\. Its role is to correct the response by addressing the identified errors\. Once corrected, the response is sent back to the controller module for final validation, forming areflection loopthat iteratively improves response fidelity and reduces contradictions with the patient sheet\.
All modules are LLM\-powered, and different models can be used per module\. Examples are included in the prompt using a few\-shot prompting strategy for both the controller and correction modules\. Detailed prompts are provided in Appendix[A\.6](https://arxiv.org/html/2606.28526#A1.SS6)\.
Patient simulationPhysician performanceLinguistic qualityModelInformationrecall\(%\)Responserelevance\(/5\)OSCEevaluation\(%\)Roleadherence\(/5\)Naturalness\(/5\)Fluency\(/5\)Coherence\(/5\)Recorded dialogues43\.584\.1468\.574\.033\.213\.793\.91Claude\-Haiku\-4\.552\.244\.9787\.484\.763\.424\.354\.28Gemini\-3\.1\-Flash\-Lite47\.264\.0385\.964\.673\.404\.074\.20Ministral\-14B47\.444\.9782\.554\.702\.833\.984\.24GPT\-4o\-mini47\.765\.0074\.814\.833\.604\.314\.42
Table 1:Results for recorded and generated dialogues on baseline configuration\. Scores are computed using GPT\-4o\-mini as LLM\-as\-a\-Judge and averaged over 44 generated dialogues\. Best results in bold, second\-best underlined\.
## 5Experiments and evaluation
### 5\.1Experimental protocol for dialogue generation
To assess the contribution of each module, we define different experimental configurations: a baseline configuration without optional modules; the reflection loop only; and the reflection loop combined with retrieval\. This allows us to evaluate the impact of each module on dialogue quality and controllability\. When active, the reflection loop was configured with a maximum of 3 iterations and a minimum controller threshold score of 8/10 to accept a patient response \(prompts available in Appendix[A\.6](https://arxiv.org/html/2606.28526#A1.SS6)\)\.
Configurations were evaluated using the following LLMs: Claude Haiku 4\.5Anthropic \([2025](https://arxiv.org/html/2606.28526#bib.bib42)\), Gemini\-3\.1\-Flash\-LiteGoogle \([2026](https://arxiv.org/html/2606.28526#bib.bib13)\), Ministral\-14B\-2512Liuet al\.\([2026](https://arxiv.org/html/2606.28526#bib.bib43)\), and GPT\-4o\-miniOpenAIet al\.\([2024](https://arxiv.org/html/2606.28526#bib.bib41)\)accessed through the OpenRouter API with a temperature of 0 and top\-p of 0\.9 across all modules \(generation, retrieval, controller, corrector\)\. Models were selected based on cost\-efficiency considerations, as the system is intended for free use by medical students\. Ministral\-14B was additionally chosen for its potential for local deployment, an important factor for educational applications\. We did not include domain\-specialized medical LLMs as recent work suggests that specialized models do not necessarily outperform general\-purpose ones on medical tasksDorfneret al\.\([2024](https://arxiv.org/html/2606.28526#bib.bib45)\); Labraket al\.\([2024](https://arxiv.org/html/2606.28526#bib.bib4)\); Huanget al\.\([2026](https://arxiv.org/html/2606.28526#bib.bib44)\)\. This is also consistent with our setting, where the VP is intended to simulate non\-expert behavior\.
For each configuration and model, we generated 4 dialogues per station, yielding a total of 792 generated dialogues across 11 OSCE stations described in Appendix[A\.4](https://arxiv.org/html/2606.28526#A1.SS4)\.
Patient simulationPhysician performanceLinguistic qualityPhysicianperformancevariabilityExperimentalsetupInformationrecall\(%\)Responserelevance\(/5\)OSCEevaluation\(%\)Roleadherence\(/5\)Naturalness\(/5\)Fluency\(/5\)Coherence\(/5\)100%baseline52\.244\.9787\.484\.763\.424\.354\.28reflection54\.294\.8387\.754\.853\.574\.304\.36refl\. \+ retr\.56\.624\.8989\.214\.893\.594\.294\.3650%baseline24\.124\.8546\.214\.313\.524\.224\.26reflection30\.845\.0045\.594\.553\.514\.244\.27refl\. \+ retr\.27\.234\.8645\.054\.143\.464\.204\.2025%baseline17\.804\.4425\.693\.893\.004\.004\.8reflection22\.654\.6729\.894\.173\.354\.134\.15refl\. \+ retr\.18\.744\.7527\.064\.003\.004\.004\.10
Table 2:Results for Claude\-Haiku\-4\.5 across three experimental setups \(baseline, reflection loop, and reflection \+ retrieval\) under varying physician performance constraints \(100%, 50%, and 25% OSCE criteria coverage\)\. Best results for each physician performance variation in bold\.
### 5\.2Evaluation setup: description and metrics
All evaluations were conducted in an LLM\-as\-a\-Judge frameworkGuet al\.\([2026](https://arxiv.org/html/2606.28526#bib.bib2)\), using GPT\-4o\-miniOpenAIet al\.\([2024](https://arxiv.org/html/2606.28526#bib.bib41)\)with a temperature of 0\. We assess generated and recorded dialogues along three dimensions, each comprising multiple metrics\. Evaluation prompts are available in Appendix[A\.7](https://arxiv.org/html/2606.28526#A1.SS7)\. To assess the reliability of this automatic evaluation, we additionally collected a small set of annotations from 6 annotators without medical training\.
#### 1\. Patient simulation quality
evaluates how faithfully and appropriately the VP behaves with respect to the patient sheet\. We propose the following metrics for the patient simulation quality evaluation:
∙\\bulletInformation recallmeasures the proportion of patient sheet elements mentioned by the VP during the dialogue\. In a preliminary step, we use GPT\-4o\-mini to transform each patient sheet into a binary checklist \(e\.g\., mentions the patient’s name, reports allergies\)\. The LLM\-based judge then reviews all patient utterances and marks each item as mentioned or not, yielding a percentage score\.
∙\\bulletResponse relevanceassesses whether patient utterances are appropriate given the physician’s questions and the patient sheet content\. The LLM\-based judge assigns a score from 1 \(very poor\) to 5 \(excellent\)\.
#### 2\. Physician performance
evaluates the quality of the physician’s conduct during the consultation, with the following metrics:
∙\\bulletOSCE evaluationimplements traditional OSCE scoring: the evaluator sheet lists specific criteria that the physician is expected to address \(e\.g\., asks about allergies, proposes a treatment\), and the judge checks them as met or not, yielding a percentage score\.
∙\\bulletRole adherenceassesses if the physician appropriately fulfilled the role defined in the physician sheet: respecting the clinical context, staying within the OSCE framework, and behaving professionally\. Only physician utterances are evaluated\. The LLM\-based judge assigns a score from 1 \(very poor, major deviations from the expected role\) to 5 \(excellent, perfect role adherence\)\.
#### 3\. Linguistic quality
evaluates the dialogue independently of its clinical content, focusing on three standard dialogue quality metrics\.Naturalnessmeasures whether the dialogue resembles a real spoken conversation between two people\.Fluencyassesses the smoothness and grammatical correctness of the exchanges\.Coherenceevaluates the logical structure and progression of the dialogue\. Each criterion is scored on a 0–100 scale using detailed rubrics we defined in prompts, then rescaled to a 1–5 Likert scale for consistency with the other similar metrics in this framework \(e\.g\., response relevance\)\. All evaluation prompts are available in Appendix[A\.7](https://arxiv.org/html/2606.28526#A1.SS7)\.
## 6Results and analysis
We conduct four analyses: \(1\) comparison between recorded and baseline\-generated dialogues across the four models \([Table 1](https://arxiv.org/html/2606.28526#S4.T1)\), \(2\) impact of reflection and retrieval modules with Claude\-Haiku\-4\.5 \([Table 2](https://arxiv.org/html/2606.28526#S5.T2)\), \(3\) robustness to varying physician performance \([Table 2](https://arxiv.org/html/2606.28526#S5.T2)\), and \(4\) station\-level naturalness analysis \([Figure 4](https://arxiv.org/html/2606.28526#S6.F4)\)\. Detailed results are provided in Appendix[A\.5](https://arxiv.org/html/2606.28526#A1.SS5)with all configurations and models\.
### 6\.1Generated vs\. recorded dialogues
Across all evaluation dimensions, generated dialogues consistently outperform recorded interactions, including higher information recall across all models, as shown in[Table 1](https://arxiv.org/html/2606.28526#S4.T1)\. Linguistic quality scores are also higher for generated dialogues, except for Ministral\-14B, whose naturalness score falls below the recorded dialogue baseline of 3\.21, suggesting that this model produces less conversationally realistic utterances\.
This gap is partly expected: recorded dialogues reflect real student variability, including incomplete coverage of OSCE criteria, hesitations, and communication difficulties, whereas generated dialogues benefit from systematic criterion coverage and controlled patient behavior\. Additionally, these results may partly reflect biases in LLM\-as\-a\-Judge evaluation, which could favor the style and structure of LLM\-generated text over the disfluencies and variability characteristic of spontaneous speech\.
As part of the linguistic quality evaluation \(Section[6\.5](https://arxiv.org/html/2606.28526#S6.SS5)\), annotators were asked to determine whether each dialogue was LLM\-generated or recorded from a real interaction\. They correctly identified the dialogue origin in only 14 of 24 judgments \(58\.3%\), achieving identical accuracy for LLM\-generated and recorded dialogues\.
For completeness, we also include the results of GPT\-4o\-mini, despite using it as the LLM\-as\-a\-Judge evaluator\. Interestingly, it does not achieve the highest scores across the evaluated models, despite also serving as the evaluator\. This observation, however, should not be interpreted as evidence that the evaluation is free from model\-specific biases\.
Since Claude\-Haiku\-4\.5 consistently achieves the best overall performance across evaluation dimensions, we use this model for the remaining experiments\.
### 6\.2Impact of the reflection and retrieval modules
Table[2](https://arxiv.org/html/2606.28526#S5.T2)shows the contribution of reflection loop and retrieval module compared to the baseline system with Claude\-Haiku\-4\.5\. Overall, the reflection loop consistently improves patient simulation metrics compared to the baseline, particularly information recall, while maintaining stable linguistic quality\. These improvements suggest that the controller and corrector modules help refine responses and reduce inconsistencies in generated dialogues\. The reflection loop was triggered in 3\.6% of patient turns, indicating that most responses were accepted by the controller without requiring correction\. When activated, the correction process improved the controller score in 79\.5% of cases\.
Adding retrieval sometimes further improves performance, yielding the highest information recall and OSCE evaluation scores\. However, these gains remain inconsistent across setups, suggesting that retrieval effectiveness depends on the interaction dynamics\. Across configurations, linguistic quality remains stable, indicating that both modules improve fidelity and task performance without degrading dialogue naturalness\. Overall, the reflection loop and retrieval modules sometimes improve results, but their impact remains moderate\.
Figure 4:Naturalness score across all models and recorded dialogues: bars represent baseline performance averaged over four generated dialogues, deltas indicate performance changes with the retrieval and reflection loop for five representative stations\.To further assess whether these improvements affect conversational quality at the utterance level, we evaluate linguistic quality scores separately for each speaker role \([Table 3](https://arxiv.org/html/2606.28526#S6.T3)\)\. Linguistic quality remains stable across experimental setups for both speaker roles\. Physician scores are consistently slightly higher than patient simulation scores, suggesting that patient generation may introduce more variability than physician utterances\.
Patient simulationPhysician performanceExperimentalsetupNaturalness\(/5\)Fluency\(/5\)Coherence\(/5\)Naturalness\(/5\)Fluency\(/5\)Coherence\(/5\)baseline3\.824\.264\.253\.804\.404\.65reflection3\.784\.234\.103\.674\.314\.47refl\. \+ retr\.3\.824\.254\.253\.684\.344\.48
Table 3:Linguistic quality scores per speaker role for Claude\-Haiku\-4\.5, evaluated with GPT\-4o\-mini as LLM\-as\-a\-Judge\. Full dialogues are provided to the judge, which scores each speaker independently\. Scores are computed on a 0–100 scale using rubric\-based prompts and rescaled to 1–5 Likert scores\.
### 6\.3Robustness to physician performance variability
To assess our VP robustness to variable physician performance, we degraded the physician generator by providing only 50% or 25% of the expected OSCE evaluation criteria, and perturbing them followingHuanget al\.\([2026](https://arxiv.org/html/2606.28526#bib.bib44)\), reported in[Table 2](https://arxiv.org/html/2606.28526#S5.T2)\. This degradation led to a drop in information recall score, which is consistent with the statistically significant correlation between information recall and OSCE scores, reported in Appendix[A\.3](https://arxiv.org/html/2606.28526#A1.SS3): as the physician asks fewer questions, the patient provides less information\.
Response relevance remains stable and linguistic quality scores slightly decrease\. Compared to the 100%\-criteria setting, these results suggest that the VP remains robust to variations in physician performance, particularly when using the retrieval and reflection loop pipeline, which maintains high response relevance with degraded interaction conditions\. This robustness allows the VP to adapt to student performance and enables learning even from failed interviews\.
### 6\.4Station\-level analysis
[Figure 4](https://arxiv.org/html/2606.28526#S6.F4)presents the analysis of naturalness scores at the station level to assess where the reflection loop and retrieval module are most beneficial\. We focus on five representative stations \(44, 107, 220, 224, 228\) covering diverse clinical contexts \(see Appendix[A\.4](https://arxiv.org/html/2606.28526#A1.SS4)\), including different specialties and objectives\. The gain on station 228 may be partly driven by few\-shot prompting, as one example directly matches this station\. Overall, results suggest that improvements in naturalness are very context\-dependent and not uniform across stations\. This variability further suggests that a discretized station design, based on more static and well\-defined categories, could improve simulation consistency\.
### 6\.5Agreement between human evaluators and LLM\-as\-a\-Judge
To assess the reliability of our LLM\-as\-a\-Judge framework, six annotators each evaluated four dialogues on three linguistic\-quality criteria \(naturalness, fluency, and coherence\) using the same 1–5 Likert scale as the LLM\. Each of the 12 dialogues was rated by two annotators\.
CriterionICC \(human\)PearsonSpearmanNaturalness0\.1310\.1310\.080\.080\.090\.09Fluency−0\.134\-0\.1340\.460\.460\.450\.45Coherence−0\.040\-0\.0400\.76∗0\.76^\{\\ast\}0\.450\.45Table 4:Inter\-annotator agreement and correlation between mean human scores and LLM\-as\-a\-Judge scores \(n=12n=12dialogues\)\.p∗<0\.05\{\}^\{\\ast\}p<0\.05Intraclass Correlation Coefficient \(ICC\)Koo and Li \([2016](https://arxiv.org/html/2606.28526#bib.bib46)\)is weak across all three criteria, reflecting the subjective nature of linguistic\-quality evaluation\. Human\-LLM agreement remains limited, with only coherence showing a significant Pearson correlation, as shown in[Table 4](https://arxiv.org/html/2606.28526#S6.T4)\.
We also evaluated agreement between GPT\-4o\-mini and human evaluator scores on the OSCE evaluation metric using ICC and Spearman correlation\. On 67 recorded dialogues across 14 OSCE stations, agreement was moderate \(ICC = 0\.41,ρ=0\.33\\rho=0\.33\), and became strong when restricting the analysis to patient\-only stations without document analysis \(ICC = 0\.85,ρ=0\.82\\rho=0\.82\)\. These results suggest that the LLM\-as\-a\-Judge is more reliable for checklist\-style OSCE evaluation than for linguistic\-quality assessment, although these results remain preliminary\.
## 7Conclusion
We introduced a French OSCE dialogue dataset combining 240 student\-patient OSCE training dialogues and 792 generated dialogues, providing a new data resource for French medical education\. We proposed a controllable LLM\-based generation pipeline with retrieval and controller modules to improve patient fidelity and dialogue coherence, and an LLM\-as\-a\-Judge evaluation framework assessing patient simulation quality, physician performance, and linguistic quality\.
Results show that generated dialogues consistently outperform recorded interactions in patient simulation quality, particularly in terms of information recall and response relevance\. The proposed VP system with a reflection loop yields small gains in fidelity\-related metrics, while retrieval provides additional but less consistent improvements depending on interaction context and model\. Our VP remains stable across variations in physician performance, while station\-level analyses reveal scenario\-dependence in naturalness, suggesting the need for more fine\-grained station\-level modeling to better adapt system behavior\.
Future work will include extending the pipeline to document\-based stations, improving controllability of VP behavior, and validating our proposed VP system in real training conditions with medical students\.
## Limitations & Ethics statement
#### Limitations
This study has several limitations\. First, while our dataset includes 192 OSCE stations, only 23 were recorded, and dialogue generation was conducted on a subset of 11 stations\. This limits the diversity of clinical scenarios covered and the generalization of our findings, though the dataset is under active expansion\. Second, our evaluation relies entirely on a single LLM\-as\-a\-Judge with a single evaluation run per dialogue\. Although we validated the OSCE evaluation criterion against human annotations, the remaining criteria lack human validation, and inter\-run variability was not assessed\. Third, the generation pipeline was not tested in real training conditions with medical students interacting with the VP, leaving its pedagogical effectiveness still unevaluated\. Fourth, the pipeline relies heavily on LLM calls across all modules \(generation, retrieval, controller, corrector, evaluation\), which raises concerns regarding reproducibility, cost, and scalability\. We estimate the generation cost of a single dialogue at approximately $0\.20, but a full experimental run involves numerous API calls across multiple modules\. Finally, only three models were evaluated; larger or domain\-specialized medical models may yield different results\. Future work will focus on expanding human evaluation, assessing inter\-annotator agreement, conducting user studies with medical students, and evaluating the pipeline in real training conditions\.
## Ethics statement
This study follows ethical guidelines for the use of language technologies in educational and clinical training contexts\. No personal, clinical, or patient\-identifiable data were collected or processed\. All real interactions used in this work were recorded with appropriate authorization, and all scenarios were fictitious and designed in a game\-like educational setting, ensuring full anonymization of participants\. We emphasize that the goal of this work is to support the development of assistive technologies that augment human learning without reducing expertise or learner agency, particularly in sensitive domains such as medical education\.
## 8Acknowledgements
We thank the members of the Multispeech team for their contributions to the manual transcription and human evaluation of the dialogues used in this study\. We also acknowledge ECNAsso for facilitating data collection\. This work benefited from government funding managed by the National Research Agency under France 2030 via the ENACT AI Cluster \(ANR\-23\-IACL\-0004\), and from support by Région Grand Est\.
## References
- L\. A\. Agrawal, S\. Tan, D\. Soylu, N\. Ziems, R\. Khare, K\. Opsahl\-Ong, A\. Singhvi, H\. Shandilya, M\. J\. Ryan, M\. Jiang, C\. Potts, K\. Sen, A\. Dimakis, I\. Stoica, D\. Klein, M\. Zaharia, and O\. Khattab \(2026\)GEPA: reflective prompt evolution can outperform reinforcement learning\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=RQm2KQTM5r)Cited by:[§2](https://arxiv.org/html/2606.28526#S2.p5.1)\.
- Claude haiku 4\.5 system card\.Technical reportAnthropic\.Note:Accessed: 2026\-04\-07External Links:[Link](https://www-cdn.anthropic.com/7aad69bf12627d42234e01ee7c36305dc2f6a970.pdf)Cited by:[§5\.1](https://arxiv.org/html/2606.28526#S5.SS1.p2.1)\.
- A\. Ben Abacha, W\. Yim, Y\. Fan, and T\. Lin \(2023\)An empirical study of clinical note generation from doctor\-patient encounters\.InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics,Dubrovnik, Croatia,pp\. 2291–2302\.External Links:[Link](https://aclanthology.org/2023.eacl-main.168)Cited by:[§1](https://arxiv.org/html/2606.28526#S1.p3.1),[§2](https://arxiv.org/html/2606.28526#S2.p4.1)\.
- K\. K\. Campbell, M\. J\. Holcomb, S\. Vedovato, L\. Young, G\. Danuser, T\. O\. Dalton, and D\. J\. Scott \(2025\)Applying state\-of\-the\-art artificial intelligence to grading in simulation\-based education: assessment, feedback, and roi\.Discover Artificial Intelligence5\(1\),pp\. 247–?\.External Links:[Document](https://dx.doi.org/10.1007/s44163-025-00417-3),[Link](https://doi.org/10.1007/s44163-025-00417-3)Cited by:[§1](https://arxiv.org/html/2606.28526#S1.p2.1),[§2](https://arxiv.org/html/2606.28526#S2.p5.1)\.
- L\. Campillos\-Llanos, C\. Thomas, É\. Bilinski, P\. Zweigenbaum, and S\. Rosset \(2019\)Designing a virtual patient dialogue system based on terminology\-rich resources: challenges and evaluation\.Natural Language Engineering26\(2\),pp\. 183–220\.External Links:[Document](https://dx.doi.org/10.1017/S1351324919000329),[Link](https://doi.org/10.1017/S1351324919000329)Cited by:[§1](https://arxiv.org/html/2606.28526#S1.p2.1)\.
- L\. Chabyet al\.\(2022\)Embodied virtual patients as a simulation\-based framework for training clinician\-patient communication skills: an overview of their use in psychiatric and geriatric care\.Frontiers in Virtual Reality3,pp\. 827312\.Cited by:[§2](https://arxiv.org/html/2606.28526#S2.p1.1)\.
- S\. Chen, M\. Wu, K\. Q\. Zhu, K\. Lan, Z\. Zhang, and L\. Cui \(2023\)LLM\-empowered chatbots for psychiatrist and patient simulation: application and evaluation\.arXiv preprint arXiv:2305\.13614\.External Links:[Link](https://arxiv.org/abs/2305.13614),[Document](https://dx.doi.org/10.48550/arXIV.2305.13614)Cited by:[§2](https://arxiv.org/html/2606.28526#S2.p3.1)\.
- N\. Chirkova, T\. O\. Ajayi, S\. Aycock, Z\. M\. Mujahid, V\. Perlić, E\. Borisova, and M\. Vartampetian \(2026\)LLM\-as\-a\-qualitative\-judge: automating error analysis in natural language generation\.InProceedings of the First Workshop on Multilingual Multicultural Evaluation,P\. Chen, V\. Zouhar, H\. Hu, S\. Khanuja, W\. Zhu, B\. Haddow, A\. Birch, A\. F\. Aji, R\. Sennrich, and S\. Hooker \(Eds\.\),Rabat, Morocco,pp\. 99–132\.External Links:[Document](https://dx.doi.org/10.18653/v1/2026.mme-main.7),[Link](https://aclanthology.org/2026.mme-main.7/),ISBN 979\-8\-89176\-368\-5Cited by:[§2](https://arxiv.org/html/2606.28526#S2.p5.1)\.
- D\. A\. Cook, J\. Overgaard, V\. S\. Pankratz, G\. Del Fiol, and C\. A\. Aakre \(2025\)Virtual patients using large language models: scalable, contextualized simulation of clinician\-patient dialogue with feedback\.Journal of Medical Internet Research27,pp\. e68486\.External Links:[Document](https://dx.doi.org/10.2196/68486),[Link](https://www.jmir.org/2025/1/e68486)Cited by:[§1](https://arxiv.org/html/2606.28526#S1.p2.1)\.
- T\. Das, D\. Albassam, and J\. Sun \(2024\)Synthetic patient\-physician dialogue generation from clinical notes using llm\.External Links:2408\.06285,[Link](https://arxiv.org/abs/2408.06285)Cited by:[§2](https://arxiv.org/html/2606.28526#S2.p4.1)\.
- F\. J\. Dorfner, A\. Dada, F\. Busch, M\. R\. Makowski, T\. Han, D\. Truhn, J\. Kleesiek, M\. Sushil, J\. Lammert, L\. C\. Adams, and K\. K\. Bressem \(2024\)Biomedical large languages models seem not to be superior to generalist models on unseen medical data\.External Links:2408\.13833,[Link](https://arxiv.org/abs/2408.13833)Cited by:[§5\.1](https://arxiv.org/html/2606.28526#S5.SS1.p2.1)\.
- Explosion AI \(2024\)Fr\_core\_news\_md: spacy french model\.Note:Version 3\.8\.0External Links:[Link](https://github.com/explosion/spacy-models/releases/tag/fr_core_news_md-3.8.0)Cited by:[§A\.2](https://arxiv.org/html/2606.28526#A1.SS2.p2.1)\.
- F\. Fareez, T\. Parikh, C\. Wavell, S\. Shahab, M\. Chevalier, S\. Good, I\. De Blasi, R\. Rhouma, C\. McMahon, J\. Lam, T\. Lo, and C\. W\. Smith \(2022\)A dataset of simulated patient\-physician medical interviews with a focus on respiratory cases\.Scientific Data9,pp\. 313\.External Links:[Document](https://dx.doi.org/10.1038/s41597-022-01423-1),[Link](https://www.nature.com/articles/s41597-022-01423-1)Cited by:[§1](https://arxiv.org/html/2606.28526#S1.p3.1),[§2](https://arxiv.org/html/2606.28526#S2.p3.1)\.
- D\. García\-Torres, M\. A\. Vicente Ripoll, C\. Fernández Peris, and J\. J\. Mira Solves \(2024\)Enhancing clinical reasoning with virtual patients: a hybrid systematic review combining human reviewers and chatgpt\.Healthcare12\(22\),pp\. 2241\.External Links:[Document](https://dx.doi.org/10.3390/healthcare12222241),[Link](https://pmc.ncbi.nlm.nih.gov/articles/PMC11594149/)Cited by:[§1](https://arxiv.org/html/2606.28526#S1.p2.1),[§2](https://arxiv.org/html/2606.28526#S2.p1.1)\.
- Google \(2026\)Gemini 3\.1 flash\-lite\.Note:[https://ai\.google\.dev/gemini\-api/docs/models/gemini\-3\.1\-flash\-lite?hl=fr](https://ai.google.dev/gemini-api/docs/models/gemini-3.1-flash-lite?hl=fr)Consulté le 5 juin 2026Cited by:[§5\.1](https://arxiv.org/html/2606.28526#S5.SS1.p2.1)\.
- J\. Gu, X\. Jiang, Z\. Shi, H\. Tan, X\. Zhai, C\. Xu, W\. Li, Y\. Shen, S\. Ma, H\. Liu, S\. Wang, K\. Zhang, Z\. Lin, B\. Zhang, L\. Ni, W\. Gao, Y\. Wang, and J\. Guo \(2026\)A survey on LLM\-as\-a\-judge\.The Innovation,pp\. 101253\.External Links:ISSN 2666\-6758,[Document](https://dx.doi.org/10.1016/j.xinn.2025.101253),[Link](https://www.sciencedirect.com/science/article/pii/S2666675825004564)Cited by:[§2](https://arxiv.org/html/2606.28526#S2.p5.1),[§5\.2](https://arxiv.org/html/2606.28526#S5.SS2.p1.1)\.
- T\. Huang, T\. Bourgeade, and I\. Illina \(2026\)LLM\-based data generation and clinical skills evaluation for low\-resource french osces\.InProceedings of the 15th International Conference on Language Resources and Evaluation \(LREC 2026\),Note:Accepted for publicationExternal Links:[Link](https://arxiv.org/abs/2604.08126)Cited by:[§1](https://arxiv.org/html/2606.28526#S1.p2.1),[§4](https://arxiv.org/html/2606.28526#S4.SS0.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2606.28526#S5.SS1.p2.1),[§6\.3](https://arxiv.org/html/2606.28526#S6.SS3.p1.1)\.
- M\. Jepson, C\. Salisbury, M\. J\. Ridd, C\. Metcalfe, L\. Garside, and R\. K\. Barnes \(2017\)The ‘one in a million’ study: creating a database of uk primary care consultations\.British Journal of General Practice67\(658\),pp\. e345–e351\.External Links:[Document](https://dx.doi.org/10.3399/bjgp17x690521),[Link](https://bjgp.org/content/67/658/e345)Cited by:[§2](https://arxiv.org/html/2606.28526#S2.p4.1)\.
- Jitsi \(2024\)Jiwer: evaluate word error rate \(wer\)\.External Links:[Link](https://github.com/jitsi/jiwer)Cited by:[§A\.2](https://arxiv.org/html/2606.28526#A1.SS2.p2.1)\.
- T\. K\. Koo and M\. Y\. Li \(2016\)A guideline of selecting and reporting intraclass correlation coefficients for reliability research\.Journal of Chiropractic Medicine15\(2\),pp\. 155–163\.External Links:[Document](https://dx.doi.org/10.1016/j.jcm.2016.02.012)Cited by:[§6\.5](https://arxiv.org/html/2606.28526#S6.SS5.p2.1)\.
- Y\. Labrak, A\. Bazoge, O\. El Khettari, M\. Rouvier, P\. Constant Dit Beaufils, N\. Grabar, B\. Daille, S\. Quiniou, E\. Morin, P\. Gourraud, and R\. Dufour \(2024\)DrBenchmark: A Large Language Understanding Evaluation Benchmark for French Biomedical Domain\.InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation \(LREC\-COLING 2024\),N\. Calzolari, M\. Kan, V\. Hoste, A\. Lenci, S\. Sakti, and N\. Xue \(Eds\.\),Torino, Italia,pp\. 5376–5390\.External Links:[Link](https://aclanthology.org/2024.lrec-main.478/)Cited by:[§5\.1](https://arxiv.org/html/2606.28526#S5.SS1.p2.1)\.
- F\. A\. A\. Laleye, A\. Blanié, A\. Brouquet, D\. Behnamou, and G\. de Chalendar \(2020a\)Semantic similarity to improve question understanding in a virtual patient\.InProceedings of the 35th Annual ACM Symposium on Applied Computing,SAC ’20,New York, NY, USA,pp\. 859–866\.External Links:[Document](https://dx.doi.org/10.1145/3341105.3373936),[Link](https://dl.acm.org/doi/10.1145/3341105.3373936),ISBN 978\-1\-4503\-6866\-7Cited by:[§1](https://arxiv.org/html/2606.28526#S1.p2.1)\.
- F\. A\. A\. Laleye, G\. de Chalendar, A\. Blanié, A\. Brouquet, and D\. Behnamou \(2020b\)A French Medical Conversations Corpus Annotated for a Virtual Patient Dialogue System\.InProceedings of The 12th Language Resources and Evaluation Conference,Marseille, France,pp\. 574–580\.External Links:[Link](https://www.aclweb.org/anthology/2020.lrec-1.72)Cited by:[§1](https://arxiv.org/html/2606.28526#S1.p3.1),[§2](https://arxiv.org/html/2606.28526#S2.p3.1)\.
- N\. Laverde, C\. Grévisse, S\. Jaramillo, and R\. Manrique \(2025\)Integrating large language model\-based agents into a virtual patient chatbot for clinical anamnesis training\.Computational and Structural Biotechnology Journal27,pp\. 2481–2491\.External Links:[Document](https://dx.doi.org/10.1016/j.csbj.2025.05.025),[Link](https://www.sciencedirect.com/science/article/pii/S2001037025001850)Cited by:[§2](https://arxiv.org/html/2606.28526#S2.p1.1)\.
- K\. L\. Lewis, C\. A\. Bohnert, W\. L\. Gammon, H\. Hölzer, L\. Lyman, C\. Smith, T\. M\. Thompson, A\. M\. Wallace, and G\. Gliva\-McConvey \(2017\)The association of standardized patient educators \(aspe\) standards of best practice\.Advances in Simulation2,pp\. 10\.External Links:[Document](https://dx.doi.org/10.1186/s41077-017-0059-1)Cited by:[§1](https://arxiv.org/html/2606.28526#S1.p1.1)\.
- D\. Li and S\. L\. Lutfi \(2026\)Large language model–based virtual patient systems for history\-taking in medical education: comprehensive systematic review\.Journal of Medical Systems / Elsevier \(ScienceDirect\)\.External Links:[Document](https://dx.doi.org/S2291969426000013),[Link](https://www.sciencedirect.com/science/article/pii/S2291969426000013)Cited by:[§1](https://arxiv.org/html/2606.28526#S1.p2.1)\.
- T\. Li, G\. Li, Z\. Deng, B\. Wang, and Y\. Li \(2023\)A Zero\-Shot Language Agent for Computer Control with Structured Reflection\.InFindings of the Association for Computational Linguistics: EMNLP 2023,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 11261–11274\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.findings-emnlp.753),[Link](https://aclanthology.org/2023.findings-emnlp.753/)Cited by:[§2](https://arxiv.org/html/2606.28526#S2.p5.1)\.
- A\. H\. Liu, K\. Khandelwal, S\. Subramanian, and et al\. \(2026\)Ministral 3\.arXiv preprint arXiv:2601\.08584\.External Links:[Link](https://arxiv.org/abs/2601.08584)Cited by:[§5\.1](https://arxiv.org/html/2606.28526#S5.SS1.p2.1)\.
- X\. Liu, V\. Segonne, A\. Mannion, D\. Schwab, L\. Goeuriot, and F\. Portet \(2024\)MedDialog\-FR: a French version of the MedDialog corpus for multi\-label classification and response generation related to women’s intimate health\.InProceedings of the First Workshop on Patient\-Oriented Language Processing \(CL4Health\) @ LREC\-COLING 2024,D\. Demner\-Fushman, S\. Ananiadou, P\. Thompson, and B\. Ondov \(Eds\.\),Torino, Italia,pp\. 173–183\.External Links:[Link](https://aclanthology.org/2024.cl4health-1.21/)Cited by:[§1](https://arxiv.org/html/2606.28526#S1.p3.1),[§2](https://arxiv.org/html/2606.28526#S2.p4.1)\.
- A\. Njifenjou, V\. Sucal, B\. Jabaian, and F\. Lefèvre \(2024\)Role\-play zero\-shot prompting with large language models for open\-domain human\-machine conversation\.External Links:2406\.18460,[Link](https://arxiv.org/abs/2406.18460)Cited by:[§2](https://arxiv.org/html/2606.28526#S2.p5.1)\.
- A\. Njifenjou, V\. Sucal, B\. Jabaian, and F\. Lefèvre \(2025\)Enabling trait\-based personality simulation in conversational LLM agents: case study of customer assistance in French\.InProceedings of the 15th International Workshop on Spoken Dialogue Systems Technology,M\. I\. Torres, Y\. Matsuda, Z\. Callejas, A\. del Pozo, and L\. F\. D’Haro \(Eds\.\),Bilbao, Spain,pp\. 299–308\.External Links:[Link](https://aclanthology.org/2025.iwsds-1.32/),ISBN 979\-8\-89176\-248\-0Cited by:[§2](https://arxiv.org/html/2606.28526#S2.p5.1)\.
- OpenAI, J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat, R\. Avila, I\. Babuschkin, S\. Balaji, V\. Balcom, P\. Baltescu, H\. Bao, M\. Bavarian, J\. Belgum, I\. Bello, J\. Berdine, G\. Bernadett\-Shapiro, C\. Berner, L\. Bogdonoff, O\. Boiko, M\. Boyd, A\. Brakman, G\. Brockman, T\. Brooks, M\. Brundage, K\. Button, T\. Cai, R\. Campbell, A\. Cann, B\. Carey, C\. Carlson, R\. Carmichael, B\. Chan, C\. Chang, F\. Chantzis, D\. Chen, S\. Chen, R\. Chen, J\. Chen, M\. Chen, B\. Chess, C\. Cho, C\. Chu, H\. W\. Chung, D\. Cummings, J\. Currier, Y\. Dai, C\. Decareaux, T\. Degry, N\. Deutsch, D\. Deville, A\. Dhar, D\. Dohan, S\. Dowling, S\. Dunning, A\. Ecoffet, A\. Eleti, T\. Eloundou, D\. Farhi, L\. Fedus, N\. Felix, S\. P\. Fishman, J\. Forte, I\. Fulford, L\. Gao, E\. Georges, C\. Gibson, V\. Goel, T\. Gogineni, G\. Goh, R\. Gontijo\-Lopes, J\. Gordon, M\. Grafstein, S\. Gray, R\. Greene, J\. Gross, S\. S\. Gu, Y\. Guo, C\. Hallacy, J\. Han, J\. Harris, Y\. He, M\. Heaton, J\. Heidecke, C\. Hesse, A\. Hickey, W\. Hickey, P\. Hoeschele, B\. Houghton, K\. Hsu, S\. Hu, X\. Hu, J\. Huizinga, S\. Jain, S\. Jain, J\. Jang, A\. Jiang, R\. Jiang, H\. Jin, D\. Jin, S\. Jomoto, B\. Jonn, H\. Jun, T\. Kaftan, Ł\. Kaiser, A\. Kamali, I\. Kanitscheider, N\. S\. Keskar, T\. Khan, L\. Kilpatrick, J\. W\. Kim, C\. Kim, Y\. Kim, J\. H\. Kirchner, J\. Kiros, M\. Knight, D\. Kokotajlo, Ł\. Kondraciuk, A\. Kondrich, A\. Konstantinidis, K\. Kosic, G\. Krueger, V\. Kuo, M\. Lampe, I\. Lan, T\. Lee, J\. Leike, J\. Leung, D\. Levy, C\. M\. Li, R\. Lim, M\. Lin, S\. Lin, M\. Litwin, T\. Lopez, R\. Lowe, P\. Lue, A\. Makanju, K\. Malfacini, S\. Manning, T\. Markov, Y\. Markovski, B\. Martin, K\. Mayer, A\. Mayne, B\. McGrew, S\. M\. McKinney, C\. McLeavey, P\. McMillan, J\. McNeil, D\. Medina, A\. Mehta, J\. Menick, L\. Metz, A\. Mishchenko, P\. Mishkin, V\. Monaco, E\. Morikawa, D\. Mossing, T\. Mu, M\. Murati, O\. Murk, D\. Mély, A\. Nair, R\. Nakano, R\. Nayak, A\. Neelakantan, R\. Ngo, H\. Noh, L\. Ouyang, C\. O’Keefe, J\. Pachocki, A\. Paino, J\. Palermo, A\. Pantuliano, G\. Parascandolo, J\. Parish, E\. Parparita, A\. Passos, M\. Pavlov, A\. Peng, A\. Perelman, F\. de Avila Belbute Peres, M\. Petrov, H\. P\. de Oliveira Pinto, Michael, Pokorny, M\. Pokrass, V\. H\. Pong, T\. Powell, A\. Power, B\. Power, E\. Proehl, R\. Puri, A\. Radford, J\. Rae, A\. Ramesh, C\. Raymond, F\. Real, K\. Rimbach, C\. Ross, B\. Rotsted, H\. Roussez, N\. Ryder, M\. Saltarelli, T\. Sanders, S\. Santurkar, G\. Sastry, H\. Schmidt, D\. Schnurr, J\. Schulman, D\. Selsam, K\. Sheppard, T\. Sherbakov, J\. Shieh, S\. Shoker, P\. Shyam, S\. Sidor, E\. Sigler, M\. Simens, J\. Sitkin, K\. Slama, I\. Sohl, B\. Sokolowsky, Y\. Song, N\. Staudacher, F\. P\. Such, N\. Summers, I\. Sutskever, J\. Tang, N\. Tezak, M\. B\. Thompson, P\. Tillet, A\. Tootoonchian, E\. Tseng, P\. Tuggle, N\. Turley, J\. Tworek, J\. F\. C\. Uribe, A\. Vallone, A\. Vijayvergiya, C\. Voss, C\. Wainwright, J\. J\. Wang, A\. Wang, B\. Wang, J\. Ward, J\. Wei, C\. Weinmann, A\. Welihinda, P\. Welinder, J\. Weng, L\. Weng, M\. Wiethoff, D\. Willner, C\. Winter, S\. Wolrich, H\. Wong, L\. Workman, S\. Wu, J\. Wu, M\. Wu, K\. Xiao, T\. Xu, S\. Yoo, K\. Yu, Q\. Yuan, W\. Zaremba, R\. Zellers, C\. Zhang, M\. Zhang, S\. Zhao, T\. Zheng, J\. Zhuang, W\. Zhuk, and B\. Zoph \(2024\)GPT\-4 technical report\.External Links:2303\.08774,[Link](https://arxiv.org/abs/2303.08774)Cited by:[§5\.1](https://arxiv.org/html/2606.28526#S5.SS1.p2.1),[§5\.2](https://arxiv.org/html/2606.28526#S5.SS2.p1.1)\.
- OpenAI \(2024\)Whisper large\-v3\-turbo\.Note:Automatic speech recognition modelExternal Links:[Link](https://huggingface.co/openai/whisper-large-v3-turbo)Cited by:[§3\.1](https://arxiv.org/html/2606.28526#S3.SS1.SSS0.Px4.p1.1)\.
- V\. Podder, V\. Lew, and S\. Ghassemzadeh \(2026\)SOAP notes\.StatPearls \[Internet\] edition,StatPearls Publishing\.Note:Accessed 2026External Links:[Link](https://www.ncbi.nlm.nih.gov/books/NBK482263/),ISBNCited by:[§4](https://arxiv.org/html/2606.28526#S4.SS0.SSS0.Px1.p1.1)\.
- PyAnnoteAI \(2024\)Pyannote speaker\-diarization precision\-2\.Note:Speaker diarization modelExternal Links:[Link](https://huggingface.co/pyannote/speaker-diarization-precision-2)Cited by:[§3\.1](https://arxiv.org/html/2606.28526#S3.SS1.SSS0.Px4.p1.1)\.
- A\. Radford, J\. W\. Kim, T\. Xu, G\. Brockman, C\. Mcleavey, and I\. Sutskever \(2023\)Robust Speech Recognition via Large\-Scale Weak Supervision\.InProceedings of the 40th International Conference on Machine Learning,pp\. 28492–28518\.External Links:ISSN 2640\-3498,[Link](https://proceedings.mlr.press/v202/radford23a.html)Cited by:[§3\.1](https://arxiv.org/html/2606.28526#S3.SS1.SSS0.Px4.p1.1)\.
- V\. V\. Saley, G\. Saha, R\. J\. Das, D\. Raghu, and M\. \. \(2024\)MediTOD: An English Dialogue Dataset for Medical History Taking with Comprehensive Annotations\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 16843–16877\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.936),[Link](https://aclanthology.org/2024.emnlp-main.936/)Cited by:[§1](https://arxiv.org/html/2606.28526#S1.p3.1),[§2](https://arxiv.org/html/2606.28526#S2.p3.1)\.
- A\. H\. Shakur, M\. J\. Holcomb, D\. Hein, S\. Kang, T\. O\. Dalton, K\. K\. Campbell, D\. J\. Scott, and A\. R\. Jamieson \(2024\)Large language models for medical osce assessment: a novel approach to transcript analysis\.External Links:2410\.12858,[Link](https://arxiv.org/abs/2410.12858)Cited by:[§1](https://arxiv.org/html/2606.28526#S1.p2.1),[§2](https://arxiv.org/html/2606.28526#S2.p5.1)\.
- H\. Voigt, Y\. Sugamiya, K\. Lawonn, S\. Zarrieß, and A\. Takanishi \(2025\)LLM\-powered virtual patient agents for interactive clinical skills training with automated feedback\.External Links:2508\.13943,[Link](https://arxiv.org/abs/2508.13943)Cited by:[§1](https://arxiv.org/html/2606.28526#S1.p2.1),[§2](https://arxiv.org/html/2606.28526#S2.p1.1)\.
- J\. Wang, Z\. Yao, Z\. Yang, H\. Zhou, R\. Li, X\. Wang, Y\. Xu, and H\. Yu \(2024a\)NoteChat: a dataset of synthetic patient\-physician conversations conditioned on clinical notes\.InFindings of the Association for Computational Linguistics ACL 2024,pp\. 15183–15201\.External Links:[Link](http://dx.doi.org/10.18653/v1/2024.findings-acl.901),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.901)Cited by:[§2](https://arxiv.org/html/2606.28526#S2.p4.1)\.
- R\. Wang, S\. Milani, J\. C\. Chiu, J\. Zhi, S\. M\. Eack, T\. Labrum, S\. M\. Murphy, N\. Jones, K\. V\. Hardy, H\. Shen, F\. Fang, and Z\. Chen \(2024b\)PATIENT\-ψ\\psi: using large language models to simulate patients for training mental health professionals\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 12772–12797\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.711)Cited by:[§2](https://arxiv.org/html/2606.28526#S2.p1.1)\.
- H\. Yu, J\. Zhou, L\. Li, S\. Chen, J\. Gallifant, A\. Shi, J\. Sun, X\. Li, J\. He, W\. Hua, M\. Jin, G\. Chen, Y\. Zhou, Z\. Li, T\. Gupte, M\. Chen, Z\. Azizi, Q\. Dou, B\. P\. Yan, Y\. Xing, Y\. Zhang, T\. L\. Assimes, D\. S\. Bitterman, X\. Ma, L\. Lu, and L\. Fan \(2025\)Simulated patient systems powered by large language model\-based AI agents offer potential for transforming medical education\.Communications Medicine6\(1\),pp\. 27\.External Links:ISSN 2730\-664X,[Document](https://dx.doi.org/10.1038/s43856-025-01283-x),[Link](https://www.nature.com/articles/s43856-025-01283-x)Cited by:[§2](https://arxiv.org/html/2606.28526#S2.p1.1)\.
- G\. Zeng, W\. Yang, Z\. Ju, Y\. Yang, S\. Wang, R\. Zhang, M\. Zhou, J\. Zeng, X\. Dong, R\. Zhang, H\. Fang, P\. Zhu, S\. Chen, and P\. Xie \(2020\)MedDialog: Large\-scale Medical Dialogue Datasets\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),B\. Webber, T\. Cohn, Y\. He, and Y\. Liu \(Eds\.\),Online,pp\. 9241–9250\.External Links:[Document](https://dx.doi.org/10.18653/v1/2020.emnlp-main.743),[Link](https://aclanthology.org/2020.emnlp-main.743/)Cited by:[§2](https://arxiv.org/html/2606.28526#S2.p4.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica \(2023\)Judging LLM\-as\-a\-Judge with MT\-Bench and Chatbot Arena\.Advances in Neural Information Processing Systems36,pp\. 46595–46623\.External Links:[Link](https://proceedings.neurips.cc/paper%5C_files/paper/2023/hash/91f18a1287b398d378ef22505bf41832-Abstract-Datasets%5C_and%5C_Benchmarks.html)Cited by:[§2](https://arxiv.org/html/2606.28526#S2.p5.1)\.
- J\. E\. Zini, Y\. Rizk, M\. Awad, and J\. Antoun \(2019\)Towards a deep learning question\-answering specialized chatbot for objective structured clinical examinations\.In2019 International Joint Conference on Neural Networks \(IJCNN\),Vol\.,pp\. 1–9\.External Links:[Document](https://dx.doi.org/10.1109/IJCNN.2019.8851729)Cited by:[§1](https://arxiv.org/html/2606.28526#S1.p2.1)\.
## Appendix AAppendix
### A\.1Data collection: recording protocol
#### Recording context:
Students followed a predefined schedule for the training sessions, rotating through numbered stations\. Students playing patient roles received detailed station descriptions in advance and often portrayed the same patient throughout a session, appearing in multiple recordings for the same station, while students playing physicians accessed station\-specific context only upon entering the room\. An evaluator, also a medical student, observed the interaction and provided structured feedback using the evaluator sheet\.
#### Recording protocol:
The recordings were conducted unobtrusively during the training sessions\. Out of the eight available examination rooms, two were equipped for audio recording\. In each recorded rooms, wireless clip\-on microphones \(DJI Mic Mini\), worn by both the evaluated student and the simulated patient, were used to record\. In addition to audio, 85 of the recorded dialogues include video data, which were not used in the present work\. The collected data were transferred and stored on encrypted hard drives with restricted access\.
### A\.2Transcription evaluation: WER computation
The WER is defined as:
WER=S\+D\+IN,\\mathrm\{WER\}=\\frac\{S\+D\+I\}\{N\},whereSSdenotes the number of substitutions,DDthe number of deletions,IIthe number of insertions, andNNthe total number of words in the reference transcription\.
Before WER computation, transcripts were normalized and tokenized using thespaCypipeline with thefr\_core\_news\_mdmodelExplosion AI \([2024](https://arxiv.org/html/2606.28526#bib.bib48)\)\. WER computation was performed using a custom Python pipeline based on thejiwerlibraryJitsi \([2024](https://arxiv.org/html/2606.28526#bib.bib49)\)\. Automatic and reference transcripts were paired by dialogue identifier, normalized, segmented by speaker, and then evaluated\.
### A\.3Correlation test: information recall and OSCE score
Pearson \(r = 0\.29\) and Spearman \(r = 0\.31\) correlations between information recall and OSCE score were computed, showing a weak but statistically significant positive association \(p<<0\.001\)\. This indicates a partial relationship between information recall and OSCE performance, while suggesting that other conversational and clinical reasoning factors also contribute to the overall score\. Information recall is reported for informational purposes only, as exhaustive disclosure of patient information is neither expected nor desirable in OSCE settings\.
### A\.4Dataset details
Station \(ID\)SpecialtyConsultation typeObjective44PsychiatryEmergencyHistory taking; diagnosis; clinical summary46PediatricsFirst consultationBreaking bad news; patient education107General practiceFollow\-upHistory taking; diagnosis120NeurologyFollow\-upBreaking bad news; patient education220Cardiology; ICUEmergencyDiagnosis; interprofessional coordination224PulmonologyFollow\-upHistory taking; diagnosis; clinical summary225Occupational medicineFollow\-upSpecialist referral; patient education226General practiceFollow\-upPatient education; treatment initiation228General practiceEmergencyHistory taking; diagnosis; specialist referral233General practiceFollow\-upHistory taking; prescription renewal; patient education235Emergency practiceEmergencyExamination interpretation; history taking; diagnosis; patient educationTable 5:Annotations of the 11 OSCE stations used for dialogue generation, including medical specialty, consultation type, and communication objectives\.
### A\.5Detailed results
Patient simulationPhysician performanceLinguistic qualityModelSetupInformationrecall\(%\)Responserelevance\(/5\)OSCEevaluation\(%\)Roleadherence\(/5\)Naturalness\(/5\)Fluency\(/5\)Coherence\(/5\)Recorded dialogues43\.584\.1468\.574\.033\.213\.793\.91Claude\-Haiku\-4\.5baseline52\.244\.9787\.484\.763\.424\.354\.28reflection54\.294\.8387\.754\.853\.574\.304\.36refl\. \+ retr\.56\.624\.8989\.214\.893\.594\.294\.36GPT\-4o\-minibaseline47\.765\.0074\.814\.833\.604\.314\.42reflection44\.664\.9176\.794\.823\.644\.294\.44refl\. \+ retr\.42\.394\.9775\.714\.693\.594\.304\.38Ministral\-14Bbaseline47\.444\.9782\.554\.702\.833\.984\.24reflection50\.024\.9884\.444\.742\.954\.034\.28refl\. \+ retr\.52\.404\.9583\.504\.552\.603\.964\.21
Table 6:Evaluation results across all three dimensions with GPT\-4o\-mini as LLM\-as\-a\-Judge, for all three models and each configuration: baseline, reflection loop only, reflection loop \+ retrieval\. Each score is averaged over 44 generated dialogues\. Best results for each model in bold\.
### A\.6Prompts for dialogue generation
OSCE criteria classification promptYouareamedicaleducationexpert\.Classifyeachofthefollowingclinicalconsultationcriteriaintoexactlyoneofthesestages:
\{stages\_desc\}
Criteriatoclassify:
\{criteria\_text\}
ReturnONLYaJSONobjectmappingcriterionnumbertostageid:
\{\{
"1":"opening\_and\_preparation",
"2":"information\_collection",
\.\.\.
\}\}
Noexplanation,nomarkdownfences,justtheJSONobject\.
Physician’s utterances generation promptYouaregeneratingtheresponsesofaphysicianduringanObjectiveStructuredClinicalExamination\.TheexamtakesplaceinaFrenchmedicaluniversity,andentirelyinFrench\.Youmustinteractwiththepatient,askingquestionsandmakingcommentsasarealphysicianwoulddoduringaconsultation\.Youmuststrictlyfollowtheinstructionsandcontextprovidedbelow\.
CONTEXTANDINSTRUCTIONS:
\{consultation\_context\}
CURRENTPHASE:\{stage\_label\}\(batch\{batch\_number\}/\{total\_batches\}\)
\{stage\_transition\_instruction\}
OBJECTIVESFORTHISPHASE:
\{current\_checklist\}
\{covered\_criteria\_section\}
CONVERSATIONHISTORY:
\{conversation\_history\}
LASTPATIENTRESPONSE:
\{last\_patient\_response\}
Youhave8minutestotal\.Currently,\{time\_info\}remains\.Iflessthan2minutesremain,prioritizeclosingtheconsultationandaddressingthemostcriticalobjectives\.
\{final\_part\_instruction\}
GENERALRULES:
\-Outputspeechonly\(noparentheses,noactions,nothoughts\)\.
\-Contextualfidelity:Strictlyfollowtherole,objectives,andconstraintsprovidedabove\.
\-Useopen\-endedquestionstoexploresymptoms\.
\-Medicalhistory:Investigaterelevantmedicalhistorywhenappropriate\.
\-Adaptability:Tailoryourfollow\-upquestionstothepatient’sspecificanswers\.Donotproducelonganswers\.Beconcise\.
\-Flowcontrol:Askonlyonequestionatatime\.
\-Coverage:MakesuretoaddressALLcriterialistedinOBJECTIVESFORTHISPHASEbeforetheconversationmoveson\.
\-Phaseawareness:Yourquestionsshouldbeappropriateforthecurrentphaseoftheconsultation\.
IMPORTANT:Onlygeneratethephysician’snextspokenresponseastext\.Donotincludestagedirections,actions,oranynarration\.NOPHYSICALEXAMINATIONS\.Youranswermustbeshort\(1\-2sentences\)\.
Physician’snextquestion:
Patient’s utterrances generation promptYouaregeneratingtheresponsesofapatientduringanObjectiveStructuredClinicalExamination,tohelpteachmedicinestudents\.TheexamtakesplaceinaFrenchmedicaluniversity,andentirelyinFrench\.
Youmustinteractwithandanswerthedoctor’squestionsastruthfullyaspossible\.
Composeyourresponsesasthoughyouarespeakingwiththedoctor\-student\.Trytobeasconciseaspossible,astheexamonlylasts7minutes\.Keepanswersshortandnatural\.Useeverydaylanguage,notmedicaljargon\.
Rules:
\-Answeronlywhatisasked,andgiveonesmallpieceofinformationatatime\.
\-Donotvolunteerextradetails;waitforthedoctortoask\.
\-Ifyoudon’tknowortheinformationisnotinthepatientdata,sayyoudon’tknow\.
\-Ifthedoctorusescomplexterms,askforasimplerexplanation\.
\-Donotarguewiththedoctor;ifyouareunsure,askforclarification\.
\-Keepresponsesto1shortsentenceortwoshortsentencesatmost\.
\-Donotrepeatthesameclosinglineoverandover;onceyouagreeoracknowledge,waitforthenextquestion\.
\-Rememberwhatthedoctorjusttoldyouandrespondconsistently,evenifitchangesyourattitude\.
\-Outputspeechonly\(noparentheses,noactions,nothoughts\)\.
\-Oncetheinterviewiscomingtoaclose,anditisyourlastturn,thankthedoctorandsaygoodbye\.
\{patient\_data\_section\}
Startthedialoguewith"phrase\_demarrage"indicatedinyourpatientprofile\.
Conversationhistory:
"\{conversation\_history\}"
Lastphysicianmessage:
"\{doctor\_question\}"
Youranswer:
Patient’s utterance evaluation \(controller\) promptYouareanexpertevaluatorinclinicalcommunicationforOSCE\(ObjectiveStructuredClinicalExamination\)exams\.
Yourtaskistoevaluateasimulatedpatientreplywritteninplainnaturallanguage\.
TheresponseyouevaluateisNOTaJSONandMUSTbetreatedasrawtextonly\.
PATIENTPROFILE:
\{patient\_profile\}
HISTORY:
\{conversation\_history\}
GENERATEDRESPONSE:
\{patient\_response\}
EVALUATIONCRITERIA:
\-Consistency:Theresponseiscompatiblewiththepatientprofile,withoutcontradictionsorinventedinformation\.
\-Patientrealism:Theleveloflanguageandunderstandingmatchesthatofapatient\.
\-Credibleinteraction:Onlyverbaldiscourse;noactionsoremotionsdescribed;conciseresponse\.
CRITICALCONCISION:
\-Thepatientanswersaccordinglytotheirprofileandtothedoctor’sanswer\.
\-TheresponseMUSTNOTcontainasterisks\(\*\),parenthesestodescribeactions,ordescriptionsofgestures/facialexpressions\.ANYtextbetweenasterisks\(action\)orparentheses\(gesture\)isaMAJORERRORthatmakestheresponseNON\-COMPLIANT\("conforme":false\)\.
\-Patientrole:Thepatientdoesnotreasonlikeadoctoranddoesnotsuggestdiagnosesortreatments\.
\-Linguisticquality:Theresponseisclear,natural,andcorrect\.
\-Iftheanswerisempty,giveamaximumscoreof1\.
STRICTRULEONASTERISKS:
IftheresponsecontainsevenaSINGLEasterisk\(\*\)orparenthesisdescribinganaction:
"conforme":false
"score":maximum4
Addtoerrors:"Presenceofasterisksorprohibitedactiondescriptions"\.
YOURMISSION:
\-Verifytheabsenceofasterisksorparenthesesdescribingactions\.
\-Detectanyinconsistencyorroleerror\.
\-Identifyspecificissues\.
STRICTOUTPUTFORMAT\(JSON\):
\{\{
"conforme":true\|false,
"score":intfrom0to10,
"erreurs":\[
"precisedescriptionofproblem1",
"precisedescriptionofproblem2"
\]
\}\}
RULES:
\-ProduceONLYthisJSON,nothingelse\.
\-NotextbeforeoraftertheJSON\.
\-Nomarkdowntagslike‘‘‘json\.
\-Nocomments\.
\-ONLYONCE\(donotrepeattheJSON\)\.
Evaluation:
Patient’s utterances correction promptYouareamedicalsimulationassistant\.Yourtaskistocorrectasimulatedpatient’sresponsetoensureitisrealistic,coherentwiththepatientprofile,andfollowstheformatrules\.
PATIENTPROFILE:
\{patient\_profile\}
DOCTOR’SQUESTION:
\{doctor\_question\}
CURRENTPATIENTRESPONSE\(TOCORRECT\):
\{patient\_response\}
ERRORSDETECTED:
\{errors\_text\}
\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
CORRECTIONINSTRUCTIONS
\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
1\.\*\*MINIMALCHANGES\*\*:MakeONLYthechangesnecessarytofixthelistederrors
\-Keepthesametone,style,andstructure
\-Keepthesamevocabularylevel
\-ChangeONLYwhatisincorrect
2\.\*\*FORBIDDENELEMENTS\*\*\(mustberemoved\):
\-Asterisksforactions:\*soupir\*,\*grimace\*,\*acquiesce\*
\-Parenthesesforactions:\(sourit\),\(hesite\),\(reflechit\)
\-Physicaldescriptions:"Jehochelatete","Jecroiselesbras"
\-Stagedirectionsornarration
3\.\*\*ALLOWEDELEMENTS\*\*:
\-Verbalhesitations:"euh\.\.\.","hum\.\.\.","ben\.\.\.","alors\.\.\."
\-Naturalspeechpatterns:repetitions,corrections,incompletesentences
\-Simplewordsandeverydaylanguage\(patientsarenotdoctors\)
4\.\*\*COHERENCEWITHPROFILE\*\*:
\-IftheprofilesaysthepatienttakesMetformin,theresponsemustmentionit
\-Iftheprofilesaysnon\-smoker,theresponsemustNOTmentionsmoking
\-Respectallfactualelementsfromtheprofile
5\.\*\*NATURALSPEECH\*\*:
\-KeepresponsesSHORT\(1\-2sentencestypicalforasingleturn\)
\-Patientsdon’tknowexactmedicaltermsordosages
YOURTASK:
ProvideONLYthecorrectedpatientresponse\.Noexplanations,nocomments,nomarkdown,justthecorrectedtext\.
CORRECTEDRESPONSE:
### A\.7Prompts for dialogue evaluation
Patient simulation quality evaluation prompt: information recallYouareanexpertmedicalevaluator\.Foreachcriterion,determinewhetherthepatientmentioneditinthedialogue\.
RespondONLYwithvalidJSON\.
Patientturnsonly:
\-\-\-
\{patient\_turns\}
\-\-\-
Binarycriteria:
\{criteres\_json\}
Return:
\{\{
"evaluation\_binaire":\[
\{\{"id":<id\>,"critere":"<text\>","mentionne":true/false,"justification":"<shortextractorreason\>"\}\}
\],
"score\_binaire":\{\{"mentionne":<n\>,"total":<n\>,"pourcentage":<float\>\}\}
\}\}
RULES:
\-ReturnONLYvalidJSON\.
\-EvaluateONLYpatientspeech\.
\-DoNOTpenalizedateinconsistencies\.
Patient simulation quality evaluation prompt: response relevanceYouareamedicaldialogueevaluationexpert\.
Evaluatewhetherpatientresponsesarerelevanttothedoctor’squestionsandconsistentwiththepatientsheet\.
RespondONLYwithvalidJSON\.
Patientsheet:
\{fiche\_json\}
Patientturns:
\{patient\_turns\}
Dialogueforcontext:
\{dialogue\}
Evaluatewhetherpatientresponsesarerelevantto:
1\.Thedoctor’squestions
2\.Thepatientsheet
Scorerelevance:
1=Verypoor
2=Poor
3=Acceptable
4=Good
5=Excellent
Return:
\{\{
"score\_response\_relevance":<1\-5\>,
"justification":"<briefexplanation\>",
"examples":\{\{
"relevant":\["\.\.\."\],
"irrelevant":\["\.\.\."\]
\}\}
\}\}
Rules:
\-EvaluateONLYpatientresponses
\-Compareresponseswithbothdoctorquestionsandpatientsheet
\-Donotjudgemedicalcorrectnessbeyondthesheet
\-ReturnONLYvalidJSON
Physician performance evaluation prompt: OSCE evaluationYouareanexpertmedicalevaluator\.Foreachcriterion,determinewhetherthephysicianmentionedoraddresseditinthedialogue\.
RespondONLYwithvalidJSON\.
Fulldialogue:
\{dialogue\}
Binarycriteriatoevaluate:
\{fiche\_doctor\}
Return:
\{\{
"osce\_evaluation":\[
\{\{"id":<id\>,"critere":"<text\>","ok":true/false,"justification":"<shortextractorreason\>"\}\}
\],
"score\_osce":\{\{"ok":<n\>,"total":<n\>,"pourcentage":<float\>\}\}
\}\}
RULES:
\-ReturnONLYvalidJSON\.
\-EvaluateONLYphysicianspeech\.
Physician performance evaluation prompt: role adherenceYouareamedicalsimulationexpert\.Evaluatewhetherthephysiciancorrectlyplaystheirrolegiventhecontextprovidedtothem\.
RespondONLYwithvalidJSON\.
Contextgiventothedoctoratthestartofthesimulation:
\-\-\-
\{medecin\_context\}
\-\-\-
Fulldialogue:
\-\-\-
\{dialogue\}
\-\-\-
Basedsolelyonthecontextabove,evaluatethephysician’soverallroleadherence\.Considerthefollowingaspectsinyourassessment:
\-Didthephysicianrespecttheclinicalcontexttheyweregiven?
\-Wasthecommunicationstyleappropriateforaremotemedicalconsultation?
\-Didthephysicianstaywithintheboundsofwhatispossibleinaspokeninteraction?
Scoreona1\-5Likertscale:
\-1=Verypoor:Majordeviationsfromrole,inappropriateactions,contextignored
\-2=Poor:Severaldeviationsfromroleorsignificantinappropriatebehaviors
\-3=Acceptable:Mostlyinrolewithsomenotableissues
\-4=Good:Consistentlyinrolewithonlyminorissues
\-5=Excellent:Perfectroleadherence,fullyrespectscontextandconstraints
Return:
\{\{
"role\_adherence":\{\{
"score":<1\-5\>,
"justification":"<detailedexplanationofthescore\>",
"issues":\["<listanyproblemsfound\>"\]
\}\},
"score\_role\_adherence":<1\-5\>
\}\}
RULES:
\-ReturnONLYvalidJSON\.
\-EvaluateONLYphysicianspeech\.
\-PENALIZEifthephysicianstartsaphysicalexamormimicsactionsimpossibleinaremotesimulation\(e\.g\.shakinghands,auscultation\)\.
\-PENALIZEifthephysiciandescribesemotionsorphysicalgestures\(e\.g\.\*smiling\*,\*lookingconcerned\*\),asthisbreaksimmersion\.
\-JudgethephysicianONLYagainstthecontexttheyweregiven,notagainsthiddenpatientinformation\.
Linguistic quality evaluation prompt: naturalness, fluency, coherenceYouareanexpertevaluatorofmedicaldialoguelinguisticquality\.Youevaluatedialoguesonthreedimensions:naturalness,fluency,andcoherence\.
Youusestrictrubricsandscoreona0\-100scale\.
RespondONLYwithvalidJSON\.
Evaluatethefollowingdialogueonnaturalness,fluency,andcoherence\.
\*\*Rubrics\*\*:
\{rubrics\}
\*\*Dialoguetoevaluate\*\*:
\-\-\-
\{dialogue\}
\-\-\-
Return:
\{\{
"linguistic\_quality":\{\{
"naturalness":\{\{
"score\_100":<0\-100\>,
"band":"<0\-20\|20\-40\|40\-60\|60\-80\|80\-100\>",
"justification":"<\.\.\.\>",
"context\_violations":\["<listanycontextviolations\>"\]
\}\},
"fluency":\{\{
"score\_100":<0\-100\>,
"band":"<0\-20\|20\-40\|40\-60\|60\-80\|80\-100\>",
"justification":"<\.\.\.\>"
\}\},
"coherence":\{\{
"score\_100":<0\-100\>,
"band":"<0\-20\|20\-40\|40\-60\|60\-80\|80\-100\>",
"justification":"<\.\.\.\>"
\}\}
\}\}
\}\}
RULES:
\-ReturnONLYvalidJSON\.
\-Bestrict\.Ascoreof80\+meansnear\-perfect\.Penalizecontextviolationsheavilyinnaturalness\.Similar Articles
A Dataset of Robot-Patient and Doctor-Patient Medical Dialogues for Spoken Language Processing Tasks
This paper introduces MeDial-Speech, a dataset of robot-patient and doctor-patient medical dialogues for spoken language processing, and evaluates three LLMs on a sentence selection benchmark, finding Claude Sonnet 4 most accurate.
Psy-Chronicle:A Structured Pipeline for Synthesizing Long-Horizon Campus Psychological Counseling Dialogues
This paper introduces Psy-Chronicle, a structured framework for synthesizing long-horizon campus psychological counseling dialogues, and releases CPCD, a Chinese dataset of 90,000 dialogues across multiple sessions, along with a benchmark for evaluating models on long-horizon counseling capabilities.
Synthesis and Evaluation of Long-term History-aware Medical Dialogue
This paper introduces a framework for synthesizing long-term medical dialogue datasets using LLMs, and creates MediLongChat with three benchmark tasks to evaluate healthcare agents' memory and reasoning capabilities. Experiments show that even state-of-the-art LLMs struggle with these tasks.
IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages
IndicMedDialog is a parallel multi-turn medical dialogue dataset spanning English and nine Indic languages, with a fine-tuned model for personalized symptom elicitation. The dataset is derived from MDDial, enhanced with LLM-generated synthetic consultations and expert verification, supporting multilingual healthcare AI.
Fully Open Meditron: An Auditable Pipeline for Clinical LLMs
Introduces Fully Open Meditron, the first fully open pipeline for building clinical LLMs, featuring a clinician-audited training corpus and reproducible framework, achieving state-of-the-art among fully open medical specialist models.