Fully Open Meditron: An Auditable Pipeline for Clinical LLMs

arXiv cs.AI Papers

Summary

Introduces Fully Open Meditron, the first fully open pipeline for building clinical LLMs, featuring a clinician-audited training corpus and reproducible framework, achieving state-of-the-art among fully open medical specialist models.

arXiv:2605.16215v1 Announce Type: new Abstract: Clinical decision support systems (CDSS) require scrutable, auditable pipelines that enable rigorous, reproducible validation. Yet current LLM-based CDSS remain largely opaque. Most "open" models are open-weight only, releasing parameters while withholding the data provenance, curation procedures, and generation pipelines that determine model behavior. Fully Open (FO) models, which expose the complete training stack end-to-end, do not currently exist in medicine. We introduce Fully Open Meditron, the first fully open pipeline for building LLM-CDSS, comprising a clinician-audited training corpus, a reproducible data construction and training framework, and a use-aligned evaluation protocol. The corpus unifies eight public medical QA datasets into a normalized conversational format and expands coverage with three clinician-vetted synthetic extensions: exam-style QA, guideline-grounded QA derived from 46,469 clinical practice guidelines, and clinical vignettes. The pipeline enforces system-wide decontamination, gold-label resampling of teacher generations, and end-to-end validation by a four-physician panel. We evaluate using an LLM-as-a-judge protocol over expert-written clinical vignettes, calibrated against 204 human raters. We apply the recipe to five FO base models (Apertus-70B/8B-Instruct, OLMo-2-32B-SFT, EuroLLM-22B/9B-Instruct). All MeditronFO variants are preferred over their bases. Apertus-70B-MeditronFO improves +6.6 points over its base (47.2% to 53.8%) on aggregate medical benchmarks, establishing a new FO SoTA. Gemma-3-27B-MeditronFO is preferred over MedGemma in 58.6% of LLM-as-a-judge comparisons and outperforms it on HealthBench (58% vs 55.9%). These results show that fully open pipelines can achieve state-of-the-art domain-specific performance without sacrificing auditability or reproducibility.
Original Article
View Cached Full Text

Cached at: 05/18/26, 06:36 AM

# Fully Open Meditron: An Auditable Pipeline for Clinical LLMs
Source: [https://arxiv.org/html/2605.16215](https://arxiv.org/html/2605.16215)
###### Abstract

Clinical decision support systems \(CDSS\) require scrutable, auditable pipelines that enable rigorous, reproducible validation\. Yet current LLM\-based CDSS \(LLM\-CDSS\) remain largely opaque\. Most “open” models are open\-weight only, releasing parameters, while withholding the data provenance, curation procedures, and generation pipelines that determine model behavior\. Fully Open \(FO\) models, which expose the complete training stack end\-to\-end, do not currently exist in medicine\. We introduce Fully Open Meditron, the first fully open pipeline for building LLM\-CDSS, comprising a clinician\-audited training corpus, a reproducible data construction and training framework, and a use\-aligned evaluation protocol\. The corpus unifies eight public medical QA datasets into a normalized conversational format and expands coverage with three clinician\-vetted synthetic extensions: exam\-style QA, guideline\-grounded QA derived from 46,469 clinical practice guidelines, and clinical vignettes\. The pipeline enforces system\-wide decontamination to eliminate overlap with evaluation benchmarks, includes gold\-label resampling of teacher generations, and also includes end\-to\-end validation by a four\-physician panel\. We evaluate using an LLM\-as\-a\-judge protocol over expert\-written clinical vignettes, calibrated against 204 human raters, capturing open\-ended clinical reasoning beyond typical multiple\-choice benchmarks\. We apply the recipe to five FO base models \(Apertus\-70B/8B\-Instruct, OLMo\-2\-32B\-SFT, EuroLLM\-22B/9B\-Instruct\)\. All MeditronFO variants are preferred over their bases in pair\-wise clinical evaluation\. Apertus\-70B\-MeditronFO improves \+6\.6 points over its base \(47\.2% → 53\.8%\) on aggregate medical benchmarks, establishing a new FO SoTA for LLM\-CDSS\. Additionally, Gemma\-3\-27B\-MeditronFO is preferred over MedGemma in 58\.6% of LLM\-as\-a\-judge comparisons and also outperforms it on HealthBench \(58% vs 55\.9%\)\. These results show that fully open pipelines can achieve state\-of\-the\-art domain\-specific performance, without sacrificing auditability or reproducibility\.

![Refer to caption](https://arxiv.org/html/2605.16215v1/x1.png)Figure 1:Evolution of medical LLM performance on Healthbench over time across closed\-data, open\-weight, and fully open models\.While open\-weight medical specialists have approached the performance of proprietary systems, no fully open medical specialist previously existed\. This work introduces Apertus\-MeditronFO, the first fully open medical specialist model, establishing a new state of the art among fully open systems\.## 1Introduction

Medical large language models \(LLMs\) are increasingly being deployed in high\-stakes clinical settings, from specialist decision support to autonomous patient\-facing applications that may operate with little external scrutiny\. As these systems encounter increasingly variable real\-world interactions, questions of trust, auditability and provenance become increasingly important\. Yet most “open” medical LLMs release only model weights while withholding the training data provenance, data preparation pipelines, and adaptation procedures that shape model behavior\. Adapting generalist large language models into medical specialists is now widespread, producing systems such as MedGemma\(Sellergrenet al\.,[2025](https://arxiv.org/html/2605.16215#bib.bib46)\), Meditron\(Chenet al\.,[2023b](https://arxiv.org/html/2605.16215#bib.bib33)\), and BioMistral\(Labraket al\.,[2024](https://arxiv.org/html/2605.16215#bib.bib20)\)\. The typical pipeline combines continued pre\-training on medical corpora with supervised fine\-tuning on curated QA datasets\. However, the resulting systems remain largely opaque\. Releasing weights alone does not reveal whether a model learned from guideline\-grounded evidence, benchmark artifacts, synthetic hallucinations, or clinically narrow populations\. Consistent with concerns raised about opaque adaptation pipelines\(Alberet al\.,[2025](https://arxiv.org/html/2605.16215#bib.bib25)\)\(Betleyet al\.,[2026](https://arxiv.org/html/2605.16215#bib.bib23)\), current open\-weight specialists including MedGemma do not disclose training corpora or generation pipelines, limiting independent auditability\.

This concern is amplified by the saturation of standard medical benchmarks, where performance gains may reflect contamination, memorization, or benchmark\-specific adaptation rather than clinical capability\. In clinical practice, where clinicians, regulators, and patients may reasonably demand to audit what a model has learned and how it was trained, this opacity presents a fundamental limitation\. Fully open \(FO\) models offer a path to end\-to\-end auditability, but also operate under a disadvantage: because training data, preparation pipelines, and model weights must be openly releasable, they cannot rely on proprietary clinical corpora, restricted datasets, or undisclosed synthetic pipelines that underpin many frontier systems\. As a result, FO models generally lag behind closed\-data counterparts on established benchmarks, and no fully open medical specialist currently exists\.

Table 1:Openness dimensions across medical LLMs\.Most prior medical LLMs release weights but withhold the data and pipelines that determine model behavior\. MeditronFO is the first family to satisfy all openness dimensions end\-to\-end\. Openness is assessed separately for thebase modeland themedical adaptation\. For the base model,Datarefers to pretraining, post\-training, instruction\-tuning, or alignment data;Coderefers to reproducible training code and recipe; andWeightsrefers to released model weights\. For the medical adaptation,Datarefers to fine\-tuning or instruction data;Syn\-datarefers to the synthetic data generation pipeline, including prompts, teacher model, and filtering procedure;Coderefers to the fine\-tuning/training code and recipe; andWeightsrefers to the adapted medical model weights\. License categories areO=permissive open license,C=community or commercially usable license with restrictions,IC=inherited C license, IC reflects the base model’s license; the medical adaptation itself is permissively released, andR=restrictive, research\-only, or proprietary license\.We argue that this gap reflects corpus construction rather than an inherent limitation of open models\. Public medical benchmarks are heterogeneous, narrowly scoped, and poorly aligned with clinical interaction; for instance, emergency\-care scenarios account for only 15% of the aggregated public QA we examine, and life\-threatening cases for under 9%, despite being the settings where clinical decision support matters most\. Prior work shows that biomedical specialists frequently fail to outperform their generalist bases on unseen medical data, suggesting reported gains may reflect contamination or benchmark adaptation rather than clinical capability\(Dorfneret al\.,[2025](https://arxiv.org/html/2605.16215#bib.bib17)\)\. Existing benchmarks also underrepresent low\-resource settings, vulnerable populations, and diagnostic reasoning under uncertainty\. This issue is further exacerbated by the over\-reliance on multiple\-choice evaluation\. MCQs reward rote structural recall but fail to capture clinically important dimensions, such as contextual awareness, communication, harmlessness, and alignment with guidelines\. Building clinically useful models therefore requires open\-ended evaluation and training corpora designed accordingly\.

Contributions\.To address this gap, we introduceFully Open Meditron, the first FO pipeline for adapting FO foundation models into medical specialists\. We show that competitive medical specialization can be achieved under FO constraints through disciplined clinician\-audited corpus construction and open\-ended clinical evaluation\. Our main contributions are:

•A fully open medical adaptation framework\.We release a reproducible end\-to\-end framework spanning corpus construction, synthetic data generation, decontamination, training, and evaluation for adapting fully open foundation models to medicine\.

•A structured, fully open clinician\-audited knowledge corpus\.We normalize eight public medical QA datasets and systematically expand coverage via clinician\-vetted synthetic generation, shifting emergency\-care coverage from 15\.0% to 38\.7% and life\-threatening severity from 8\.6% to 31\.8% \(exam\-style QA, guideline\-grounded QA derived from 46,469 clinical practice guidelines, and open\-ended clinical vignettes seeded from a unique global scale clinical evaluation corpus\)\. The pipeline enforces rigorous decontamination and utilizes gold\-label resampling of synthetic targets\.

•An automated, open\-ended clinical evaluation protocol\.We introduce Auto\-MOOVE, an LLM\-as\-a\-judge framework validated against 204 human raters to assess multidimensional clinical reasoning beyond standard MCQ metrics\.

•A family of fully open medical specialists\.We apply this recipe to five fully open base models spanning three model families\. Apertus\-70B\-MeditronFO improves from 47\.2% to 53\.8% on aggregate medical benchmarks, establishing a new fully\-open SoTA\. In open\-ended evaluations, Gemma\-3\-27B\-MeditronFO is preferred over MedGemma on Auto\-MOOVE and scores higher on HealthBench, suggesting that the pipeline improves dimensions not captured by MCQA alone\.

## 2Related works

Open and fully open medical LLMs\.Closed\-data specialists including the MedPaLM family\(Singhalet al\.,[2023](https://arxiv.org/html/2605.16215#bib.bib9),[2025](https://arxiv.org/html/2605.16215#bib.bib13)\)and Med\-Gemini\(Saabet al\.,[2024](https://arxiv.org/html/2605.16215#bib.bib31)\)report strong medical benchmark performance but disclose neither training corpora nor adaptation pipelines\. In parallel, a growing body of work adapts open\-weight generalist LLMs into medical specialists\. HuatuoGPT\-II\(Chenet al\.,[2023a](https://arxiv.org/html/2605.16215#bib.bib34)\)unifies pretraining and fine\-tuning into a single stage, while MC\-LLaMA\(Wuet al\.,[2024](https://arxiv.org/html/2605.16215#bib.bib19)\)and BioMistral\(Labraket al\.,[2024](https://arxiv.org/html/2605.16215#bib.bib20)\)continue\-pretraining on biomedical corpora before instruction\-tuning on aggregated QA benchmarks\. Meditron\-70B\(Chenet al\.,[2023b](https://arxiv.org/html/2605.16215#bib.bib33); Sallinenet al\.,[2025](https://arxiv.org/html/2605.16215#bib.bib22)\)scales this recipe with curated clinical guidelines\. Despite growing interest in openness, most medical LLMs remain only partially transparent: often releasing weights, subsets of training data, or benchmark recipes, while withholding key components such as data provenance, filtering procedures, synthetic generation pipelines, or adaptation workflows\. Even open\-weight systems such as MedGemma\(Sellergrenet al\.,[2025](https://arxiv.org/html/2605.16215#bib.bib46)\)disclose neither their training data nor their synthetic\-generation pipelines\. A detailed comparison of openness dimensions across all models is provided in Appendix[L](https://arxiv.org/html/2605.16215#A12)\.

Risks of opaque adaptation pipelines\.Recent work demonstrates that medical LLMs are vulnerable both to targeted corruption of adaptation data and to broader behavioral drift induced by narrow\-domain fine\-tuning\.\(Alberet al\.,[2025](https://arxiv.org/html/2605.16215#bib.bib25)\)showed fine\-tuning attacks that survive standard safety evaluations, while Betley et al\.\(Betleyet al\.,[2026](https://arxiv.org/html/2605.16215#bib.bib23)\)show that fine\-tuning on narrow corruptions in one domain can induce broadly misaligned deployment behavior\.

Benchmark contamination and decontamination\.\(Denget al\.,[2024](https://arxiv.org/html/2605.16215#bib.bib16)\)demonstrate substantial overlap between widely used evaluation benchmarks \(MMLU\(Hendryckset al\.,[2020](https://arxiv.org/html/2605.16215#bib.bib6)\), TruthfulQA\(Linet al\.,[2022](https://arxiv.org/html/2605.16215#bib.bib5)\), HellaSwag\(Zellerset al\.,[2019](https://arxiv.org/html/2605.16215#bib.bib4)\), WinoGrande\(Sakaguchiet al\.,[2021](https://arxiv.org/html/2605.16215#bib.bib3)\), GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.16215#bib.bib30)\), OpenBookQA\(Mihaylovet al\.,[2018](https://arxiv.org/html/2605.16215#bib.bib2)\)\) and major pretraining corpora, using both retrieval\-based search and a Testset Slot Guessing protocol applicable to open\- and closed\-weight models\.\(Golchin and Surdeanu,[2023](https://arxiv.org/html/2605.16215#bib.bib18)\)complement this by showing that contamination can be detected post\-hoc through prompting strategies that elicit verbatim recall of evaluation instances\. Fully Open Meditron mitigates this risk through the two\-stage n\-gram and token\-alignment decontamination pipeline introduced in Apertus\(Apertuset al\.,[2025](https://arxiv.org/html/2605.16215#bib.bib40)\), applied system\-wide against all evaluation references regardless of training\-component provenance\.

Clinician participation in the development of open medical AI\.Med\-PaLM\(Singhalet al\.,[2023](https://arxiv.org/html/2605.16215#bib.bib9)\)introduced multi\-axis physician evaluation across factuality, reasoning, possible harm, and bias, and HealthBench\(Aroraet al\.,[2025](https://arxiv.org/html/2605.16215#bib.bib44)\)scaled this to 5,000 physician\-authored conversational rubrics\. Thirunavukarasu et al\.\(Thirunavukarasuet al\.,[2023](https://arxiv.org/html/2605.16215#bib.bib24)\)similarly argue that clinical deployment requires evaluation paradigms grounded in workflows rather than exam\-style recall\. Fully Open Meditron incorporates clinician input at both the data\-curation and evaluation stages, with a four\-physician panel auditing synthetic\-generation prompts and Auto\-MOOVE built on expert\-written vignettes\.

Open\-ended evaluation at scale\.Recent work addresses the limitations of multiple\-choice evaluation through rubric\-based protocols: HealthBench\(Aroraet al\.,[2025](https://arxiv.org/html/2605.16215#bib.bib44)\)scores model responses against physician\-authored rubrics across thousands of conversational scenarios, and LiveClin\(Wanget al\.,[2026](https://arxiv.org/html/2605.16215#bib.bib43)\)introduces an updated benchmark to mitigate contamination\. Pairwise preference evaluation has emerged as a complementary paradigm, both in domain\-specific settings such as MOOVE\(Sallinenet al\.,[2025](https://arxiv.org/html/2605.16215#bib.bib22)\), which collects expert comparisons over clinical vignettes, and in platforms such as Chatbot Arena\(Zhenget al\.,[2023](https://arxiv.org/html/2605.16215#bib.bib27)\), which aggregates large\-scale human pairwise judgments into model rankings\. These approaches highlight that relative comparisons are often more reliable than absolute scoring, but rely heavily on human annotation, limiting scalability in specialised domains\. LLM judges offer a path to scalable pairwise evaluation:\(Zhenget al\.,[2023](https://arxiv.org/html/2605.16215#bib.bib27)\)establishes the paradigm and shows GPT\-4 matches expert crowd preferences on open\-ended dialogue, while\(Thakuret al\.,[2025](https://arxiv.org/html/2605.16215#bib.bib28)\)shows that Cohen’sκ\\kappais a more reliable validation metric than raw percent agreement, and Han et al\.\(Hanet al\.,[2025](https://arxiv.org/html/2605.16215#bib.bib35)\)introduce a human\-likeness test that situates a judge’sκ\\kappawithin the distribution of per\-raterκ\\kappavalues from a human panel\. The MOOVE platform\(Sallinenet al\.,[2025](https://arxiv.org/html/2605.16215#bib.bib22)\)collects pairwise expert preferences over clinical vignettes; Auto\-MOOVE builds on this by automating the comparison protocol with an LLM\-as\-a\-judge validated against human inter\-rater agreement\.

## 3The Fully Open Meditron Corpus

![Refer to caption](https://arxiv.org/html/2605.16215v1/x2.png)Figure 2:The Fully Open Meditron Corpus construction pipeline\.The corpus combines three source streams: \(1\) eight aggregated public medical QA datasets \(Curated QA\), \(2\) 46,469 clinical practice guidelines from 16 global institutions \(GUIDELINES\), and \(3\) Expert\-written clinical vignettes from the MOOVE training split\. Clinician\-vetted prompts and sampled exemplars are passed to GPT\-OSS\-120B to generate three synthetic components: Synthetic Curated QA \(novel exam\-style QA pairs, stratified by question type\), Synthetic Guidelines QA \(guideline\-grounded QA\), and Synthetic MOOVE \(novel open\-ended clinical vignette prompts designed to elicit complex diagnostic reasoning\)\. Hallucinations are mitigated via gold\-label rejection\-sampling\. Source and synthetic components are merged into the final Fully Open Meditron Corpus\.### 3\.1Data Aggregation

The foundation of our Fully Open Meditron Corpus is an aggregation of eight public medical QA datasets\. To capture both exam\-style reasoning and open\-ended clinical interaction, we unify MedQA\(Jinet al\.,[2021](https://arxiv.org/html/2605.16215#bib.bib1)\), MedMCQA\(Palet al\.,[2022](https://arxiv.org/html/2605.16215#bib.bib7)\), PubMedQA\(Jinet al\.,[2019](https://arxiv.org/html/2605.16215#bib.bib8)\), MedExpQA\(Alonsoet al\.,[2024](https://arxiv.org/html/2605.16215#bib.bib14)\), HealthSearchQA\(Singhalet al\.,[2023](https://arxiv.org/html/2605.16215#bib.bib9)\), and LiveQA\(Abachaet al\.,[2017](https://arxiv.org/html/2605.16215#bib.bib10)\)\. We additionally include AfriMed\-QA v1/v2\(Olatunjiet al\.,[2024](https://arxiv.org/html/2605.16215#bib.bib11)\)to partially mitigate the North American and European bias of standard medical benchmarks and expand representation of diverse clinical settings\. Only training splits are utilized\. All entries are normalized into a unifieds​y​s​t​e​m,u​s​e​r,a​s​s​i​s​t​a​n​tsystem,user,assistantconversational format incorporating step\-by\-step rationales, discarding items that cannot be unambiguously mapped\. This harmonization preserves diagnostic reasoning trajectories across heterogeneous source formats spanning MCQA, consumer\-health queries, and open\-ended specialist examinations and aligns with principles described in the MedGemma technical report\(Sellergrenet al\.,[2025](https://arxiv.org/html/2605.16215#bib.bib46)\); dataset sources and sizes summarized in Table[4](https://arxiv.org/html/2605.16215#A2.T4)\. Because medical benchmarks are widely contaminated in pretraining corpora, we apply a system\-wide decontamination against all evaluation references, adapting the two\-stage n\-gram and token\-alignment pipeline from Apertus\. The reference set spans all benchmarks used in this work: MedQA, MedMCQA, PubMedQA, MedXpertQA, MMLU\-Pro, IFEval, and ARC\-Challenge\. Specific thresholds, implementation details in Appendix[K](https://arxiv.org/html/2605.16215#A11)\.

To characterize this curated corpus and identify coverage gaps, we use an LLM as a zero\-shot clinical metadata extractor over the first turn of each conversation\. Extracted attributes include geographic context, resource setting, level of care, clinical severity, medical specialty, question type, and patient demographics\. This analysis reveals that naïvely aggregating public benchmarks underrepresents clinically important dimensions such as low\-resource settings, pediatric and geriatric populations, and open\-ended diagnostic reasoning\. Identifying these structural gaps motivates our strategy for targeted coverage expansion via clinician\-vetted synthetic generation\.

This expanded coverage is evident in the synthetic MOOVE subset, which shifts towards emergency care settings \(from15\.0%in the source data to38\.7%in the synthetic\) and life\-threatening severities \(8\.6%to31\.8%\)\. Similarly, the synthetic Curated QA data significantly alters specialty coverage, notably boosting cardiology \(3\.7%to32\.7%\) and pulmonology \(2\.9%to32\.2%\) relative to the source data\. It also shifts the age demographic toward adults \(from35\.8%to84\.6%\) and skews clinical severity away from routine cases \(dropping from45\.6%to11\.7%\) in favor of urgent \(28\.0%to67\.8%\) and life\-threatening \(2\.2%to16\.3%\) scenarios\. Conversely, the Guidelines dataset maintains a much more stable distribution between its source and synthetic components, consistently emphasizing routine \(42\.4%and48\.2%\) and urgent \(39\.5%and41\.7%\) severities within primary and tertiary care levels\. Full annotation results are in Appendix[B](https://arxiv.org/html/2605.16215#A2)\.

![Refer to caption](https://arxiv.org/html/2605.16215v1/figures/Fig3_pie_annie.png)Figure 3:Overview of Fully Open Meditron datasets in records count\.
### 3\.2Clinician\-Vetted Synthetic Coverage Expansion

To address the identified distributional gaps, we expand the corpus using GPT\-OSS\-120B to generate targeted synthetic data\. Before scaling generation, a panel of four physicians validated the few\-shot generation prompts and audited a representative sample of outputs\. The panel comprised clinicians with expertise spanning global health, humanitarian response, infectious disease, emergency medicine, primary care, pediatrics and surgery, with clinical experience across Europe, the United States and multiple African settings\. For each prompt template, three sampled QA pairs were independently reviewed, with disagreements resolved via panel discussion \(prompts in Appendix[J](https://arxiv.org/html/2605.16215#A10)\)\.

This review produced four structural improvements to our generation pipeline: \(1\) refining constraints on "controversial" and "outdated" content to preserve standard\-of\-care practices in low\-resource settings; \(2\) requiring explicit disease progression and geographic context for epidemiological realism; \(3\) decoupling stems \(which may contain realistic distractors\) from answers \(which must remain strictly evidence\-based\); and \(4\) excluding low\-quality evidence sources \(e\.g\. WikiDoc\) and overly US\-centric phrasing\. Following this vetting, we generate three distinct synthetic components:

•Synthetic Curated QA: Novel exam\-style QA pairs seeded from our curated benchmark pool, stratified by question type, incorporating continuous answer\-position monitoring to mitigate label bias\.

•Guidelines QA: Question\-Answer pairs grounded in 46,469 clinical practice guidelines across 16 global institutions\.

•Synthetic MOOVE: Open\-ended clinical vignettes seeded from the MOOVE training split to elicit complex diagnostic reasoning\.

Synthetic targets are generated using GPT\-OSS\-120B\(Agarwalet al\.,[2025](https://arxiv.org/html/2605.16215#bib.bib42)\), selected as the strongest open\-source model on the medical training distribution \(ablation in table[12](https://arxiv.org/html/2605.16215#A7.T12)\)\. To mitigate hallucinations, labeled examples are rejection\-sampled up to eight times at temperature 0\.7 until the generated answer matches the gold label under dataset specific regex extraction\.

## 4Experimental Setup

### 4\.1Base Models & Baselines

We use our corpus for supervised finetuning of five fully open base models: Apertus\-70B/8B\-Instruct\(Apertuset al\.,[2025](https://arxiv.org/html/2605.16215#bib.bib40)\), OLMo\-2\-32B\-SFT\(OLMoet al\.,[2024](https://arxiv.org/html/2605.16215#bib.bib39)\), EuroLLM\-22B/9B\-Instruct\(Ramoset al\.,[2026](https://arxiv.org/html/2605.16215#bib.bib37)\), and one open\-weight control, Gemma\-3\-27B\-IT\(Teamet al\.,[2025](https://arxiv.org/html/2605.16215#bib.bib47)\), to enable a controlled comparison against MedGemma\. For each base, we report the unmodified instruction\-tuned variant and its MeditronFO finetune under identical decoding settings and prompt templates\. Training and code release details are in Appendix[I](https://arxiv.org/html/2605.16215#A9)\. In addition to base\-vs\-finetune comparisons, we report results for three external medical LLMs: MedGemma\-27B\(Sellergrenet al\.,[2025](https://arxiv.org/html/2605.16215#bib.bib46)\), the strongest open\-access medical model trained on undisclosed proprietary data, Llama\-3\.1\-70B\-Meditron\(Sallinenet al\.,[2025](https://arxiv.org/html/2605.16215#bib.bib22)\)\. For reference on the upper bound of the open\-access ecosystem, we also report GPT\-OSS\-120B\(Agarwalet al\.,[2025](https://arxiv.org/html/2605.16215#bib.bib42)\), which is the model used for our synthetic data generation and Qwen3\-30B\-A3B\-Instruct\-2507\(Yanget al\.,[2025](https://arxiv.org/html/2605.16215#bib.bib45)\)\.

### 4\.2Training and Evaluation

We adapt these base models via supervised fine\-tuning on the Fully Open Meditron corpus while preserving each model’s native instruction\-tuning format\. Full training infrastructure, optimizer configurations, and per\-model hyperparameters are detailed in Appendix[I](https://arxiv.org/html/2605.16215#A9)\. We evaluate medical knowledge on the test splits of MedQA\(Jinet al\.,[2021](https://arxiv.org/html/2605.16215#bib.bib1)\), MedMCQA\(Palet al\.,[2022](https://arxiv.org/html/2605.16215#bib.bib7)\), and PubMedQA\(Jinet al\.,[2019](https://arxiv.org/html/2605.16215#bib.bib8)\), utilizing the held\-out MedXpertQA\(Zuoet al\.,[2025](https://arxiv.org/html/2605.16215#bib.bib12)\)as an out\-of\-distribution check\. As a smoke test that guards against catastrophic forgetting, we evaluate on MMLU\-Pro\(Wanget al\.,[2024](https://arxiv.org/html/2605.16215#bib.bib15)\), IFEval\(Zhouet al\.,[2023](https://arxiv.org/html/2605.16215#bib.bib32)\), and ARC\-Challenge\(Clarket al\.,[2018](https://arxiv.org/html/2605.16215#bib.bib41)\)\. All evaluations use temperature 0\.0; we report accuracy and unweighted averages\.

### 4\.3Open\-Ended Clinical Evaluation

Standard multiple\-choice benchmarks reward structured recall but fail to capture the nuances of open\-ended clinical interaction, such as contextual awareness, communication, alignment with guidelines, and harmlessness\. We evaluate these dimensions along two axes: First, we use the HealthBench evaluation\(Aroraet al\.,[2025](https://arxiv.org/html/2605.16215#bib.bib44)\), utilizing Qwen3\-235B\-A22B\-Instruct\(Yanget al\.,[2025](https://arxiv.org/html/2605.16215#bib.bib45)\)as an LLM judge to assess open\-ended clinical reasoning against structured, physician\-authored rubrics\. Second, we apply Auto\-MOOVE, an automated LLM\-as\-a\-judge protocol we developed over clinical prompts drawn from the MOOVE dataset\.

For each Auto\-MOOVE prompt, two models generate responses, which are passed to our judge models for comparative evaluation\. We utilize Qwen3\-235B\-A22B\-Instruct as our primary judge to evaluate the responses and declare an overall winner \(Model 1, Model 2, or Tie\) and to assign Likert scores from 1 \(Poor\) to 5 \(Excellent\) across nine clinical criteria: question comprehension, logical reasoning, relevance and completeness, harmlessness, fairness, contextual awareness, communication, clarity, and alignment with guidelines\. Random answer\-order swapping is applied at inference to mitigate positional bias, with positions re\-mapped during analysis\.

We validate the judge against existing human annotations from MOOVE before using it to evaluate models\. Across 204 human raters, the judge’s agreement with the panel falls within standard margins of error\. Full validation methodology and per\-criterion analysis are in Appendix[H](https://arxiv.org/html/2605.16215#A8)\. To assess sensitivity to judge choice, we additionally report results with GPT\-OSS\-120B as judge in Section[5](https://arxiv.org/html/2605.16215#S5)\(Table[12](https://arxiv.org/html/2605.16215#A7.T12)\); the qualitative ordering of MeditronFO vs base is preserved across judges for all bases except EuroLLM\-22B, where the effect size is smallest\. For an overview of the Fully Open Meditron evaluation datasets, refer to Appendix[C](https://arxiv.org/html/2605.16215#A3)\.

## 5Results

### 5\.1Results on Medical QA Benchmarks

We report comprehensive benchmarking results on established MCQA tasks as well as HealthBench in Table[2](https://arxiv.org/html/2605.16215#S5.T2)\. Finetuning a base model on our Fully Open Meditron corpus consistently improves base models\. In particular we produce Apertus\-70B\-MeditronFO, a new state\-of\-the\-art in fully open medical LLMs\.

Table 2:Medical benchmark accuracy \(%\)\. Every MeditronFO variant improves over its base; gains range from \+0\.66 \(EuroLLM\-22B\) to \+12\.80 \(Apertus\-8B\), with smaller bases benefiting most\. Apertus\-70B\-MeditronFO is the strongest fully open model at 53\.77 average, narrowing but not closing the gap to MedGemma\-27B \(60\.67\)\. Held\-out MedXpertQA tracks the same ordering, indicating gains are not driven by contamination\. Best within partitionbolded; best fully openunderlined\. For Healthbench, we use the full benchmark and Qwen3\-235B\-A22B acts as a judge\. For a detailed table with older closed open access reference please see appendix[D](https://arxiv.org/html/2605.16215#A4)
### 5\.2Results on Open\-Ended Clinical Evaluation

Figure[4](https://arxiv.org/html/2605.16215#S5.F4)reports Auto\-MOOVE pairwise comparisons between each base model and its Fully Open Meditron finetune\. Every \*\-MeditronFO variant is preferred over its corresponding base, with adjusted win rates ranging from 67\.2% \(EuroLLM\-22B\) to 92% \(Apertus\-8B\), again with the largest gains observed for smaller bases\. Figure[5](https://arxiv.org/html/2605.16215#S5.F5)complements these win\-rate results by showing the per\-criterion Likert profiles: improvements are not confined to a single axis, but are consistent across clinically relevant dimensions such as question comprehension, logical reasoning, relevance and completeness, contextual awareness, communication, clarity, and alignment with guidelines\. EuroLLM\-22B shows the smallest margins, consistent with its weaker pairwise preference signal, whereas Apertus\-70B, OLMo\-2\-32B, and Gemma\-3\-27B exhibit broader gains across criteria\. See detailed table of Auto\-MOOVE pairwise results in appendix[E](https://arxiv.org/html/2605.16215#A5)\.

These gains also extend beyond base\-versus\-finetune comparisons\. In cross\-model evaluations, Gemma\-3\-27B\-MeditronFO is preferred over MedGemma in 58\.6%, indicating that the improvements are not limited to recovering weaknesses of the underlying bases\. This conclusion is further supported by HealthBench \(Table[2](https://arxiv.org/html/2605.16215#S5.T2)\), an independent physician\-rubric benchmark on which Gemma\-3\-27B\-MeditronFO scores 58\.02 compared with 55\.92 for MedGemma \(\+2\.1\)\. The agreement between Auto\-MOOVE and HealthBench, despite differing judges, prompts, and scoring protocols, argues against the observed gains being a judge\-specific or a dataset\-distribution artifact\.

![Refer to caption](https://arxiv.org/html/2605.16215v1/x3.png)

![Refer to caption](https://arxiv.org/html/2605.16215v1/x4.png)

Figure 4:Auto\-MOOVE pairwise preference results\.For each prompt drawn from the MOOVE evaluation split, two model responses are evaluated by Qwen3\-235B\-A22B which assigns a winner \(Model 1, Model 2, or Tie\)\. Bars show the share of prompts on which each model wins, ties, or loses \(N=12,602N=12\{,\}602comparisons per pair\)\. Judge agreement with a 204\-rater human panel was validated prior to use; see App\.[H](https://arxiv.org/html/2605.16215#A8)\. \(Left: Each Fully Open Meditron model versus its corresponding base\. Right: Gemma\-3\-27B\-MeditronFO versus MedGemma\-27B\)![Refer to caption](https://arxiv.org/html/2605.16215v1/x5.png)

![Refer to caption](https://arxiv.org/html/2605.16215v1/x6.png)

![Refer to caption](https://arxiv.org/html/2605.16215v1/x7.png)

![Refer to caption](https://arxiv.org/html/2605.16215v1/x8.png)

Figure 5:Per\-criterion Auto\-MOOVE Likert profiles for Fully Open Meditron models versus corresponding base\.Panels show \(top\-left\) Gemma 27B, \(top\-right\) Apertus 70B, \(bottom\-left\) EuroLLM 22B, and \(bottom\-right\) OLMo 32B\. Axes show mean Likert score \(1–5\) across the nine evaluation criteria: question comprehension, logical reasoning, relevance and completeness, harmlessness, fairness, contextual awareness, communication, clarity, and alignment with guidelines\. Scores are averaged over the sameN=12,602N=12\{,\}602prompts as Figure[4](https://arxiv.org/html/2605.16215#S5.F4)with position\-swap debiasing applied, and Qwen3\-235B\-A22B acts a judge\. Larger enclosed area indicates broader improvement across criteria\.
### 5\.3Ablations

Corpus\-component ablations\.The corpus\-component ablations in table[3](https://arxiv.org/html/2605.16215#S5.T3)are the most informative because they identify which parts of the training mixture drive gains on structured medical QA versus open\-ended clinical evaluation\. The ablations show that there is no single universally optimal recipe: exam\-style accuracy, open\-ended clinical preference, and general instruction\-following pull the training mixture in different directions\. The ablation of the Guidelines QA component clearly illustrates this tradeoff\. Its removal slightly improves aggregate MCQA accuracy \(Med Avg rises from 53\.77 to 54\.34\) while leaving open\-ended clinical preference essentially unchanged, consistent with guideline\-derived supervision contributing primarily on the margin for exam\-style items\. Conversely, removing Curated QA produces the largest degradation on both Auto\-MOOVE \(79\.6 drops to 73\.4\) andΔ\\DeltaLikert \(0\.40 to 0\.27\), indicating that exam\-style supervision contributes meaningfully to open\-ended clinical quality as well\. Removing Synthetic MOOVE also reduces Auto\-MOOVE \(to 75\.5\) andΔ\\DeltaLikert \(to 0\.34\), consistent with its design: vignette\-style prompts trade strict exam\-format alignment for broader distributional coverage of open\-ended diagnostic interaction, i\.e\. the primary axis measured by Auto\-MOOVE\.

Table 3:Corpus\-component ablations using Apertus\-70B as a base\.All runs use identical training settings; only the indicated corpus component is removed\. Medical benchmark columns and Medical Avg follow the evaluation protocol of Table 1\. Auto\-MOOVE reports adjusted win rate under the Qwen3\-235B\-A22B judge, and “Δ\\DeltaLikert” denotes the mean per\-criterion Likert difference averaged across the nine evaluation dimensions\. Best values per column are bolded\. For Healthbench, we use the full benchmark and Qwen3\-235B\-A22B acts as a judge\. “Auto\-MOOVE” is the adjusted win rate \(%\) of the ablated model against the Apertus\-70B\-Instruct base under the Qwen3\-235B\-A22B judge; Extended ablations are in App\.[G](https://arxiv.org/html/2605.16215#A7)\.We include two additional ablations in Appendix[G](https://arxiv.org/html/2605.16215#A7)\. First, we investigate the retention of general\-purpose capacities after fine\-tuning on our Fully Open Meditron Corpus, finding only mild drops, which is in line with other domain\-specific fine\-tuning recipes\. Additionally, we analyze a range of different judge models for Auto\-MOOVE, finding a consistent preference for the MeditronFO variants over their corresponding base models across judges\.

## 6Discussion

Evaluating clinical LLMs requires moving beyond traditional multiple\-choice question answering \(MCQA\) to assess genuine clinical interaction\. After finetuning on the Fully Open Meditron Corpus, Apertus\-70B\-MeditronFO establishes a new state of the art among fully open medical models\. It achieves strong performance on both MCQA benchmarks and LLM\-as\-a\-judge evaluations including Auto\-MOOVE and HealthBench\.

Importantly, these gains generalize across model families: every finetuned model improves over its base in both structured and open\-ended evaluation\. This supports the central premise of the fully open paradigm: clinically competitive medical specialization can be achieved using reproducible, auditable pipelines rather than opaque adaptation procedures\. The resulting corpus provides a reusable foundation for training and evaluating future fully open medical models\. Notably, Gemma\-27B\-MeditronFO surpasses MedGemma\-27B on both HealthBench and Auto\-MOOVE despite being derived from a fully open pipeline\.

Limitations and future directions\.Several specific limitations warrant attention: Auto\-MOOVE judge agreement falls below the median human rater and is systematically less discriminating than clinicians on safety\-relevant criteria such as harmlessness and fairness, making it unsuitable as a deployment\-readiness signal for these dimensions; our decontamination is syntactic rather than semantic, leaving open the possibility that a teacher paraphrases or generalizes evaluation content when seeded from the corresponding training split; instruction\-following degrades on some bases, suggesting the uniform 10% Tülu replay should be tuned per base; synthetic data accounts for roughly 64% of the corpus while clinician auditing covered only three sampled QA pairs per generation prompt template, bounding systematic but not item\-level errors; and a single teacher \(GPT\-OSS\-120B\) and single judge introduce model\-specific stylistic and reasoning biases that our ablations probe but do not eliminate\. Finally, this work focuses on supervised fine\-tuning of off\-the\-shelf bases, and incorporating preference optimization, continued pretraining on the GUIDELINES corpus, or end\-to\-end open\-provenance teachers might present opportunities to further enhance the auditability and clinical capabilities of fully open medical LLMs\.

### 6\.1Broader impact

Fully Open Meditron is intended to advance the auditability of medical AI by making the full training pipeline inspectable\. The accompanying risks are those general to medical LLMs: confidently incorrect outputs, propagation of training\-data biases, and misuse as a substitute for clinical judgment\. The fact that the corpus is open is a partial mitigation \(it enables third\-party auditing and red\-teaming\) and a partial amplifier \(the recipe is reproducible by parties who may not perform equivalent audits\)\. We release the corpus under a research\-use license and recommend that downstream practitioners conduct domain\-specific safety evaluation before any deployment\-adjacent use\.

## Acknowledgments and Disclosure of Funding

This work was supported under project ID \#27 as part of the Swiss AI Initiative, through a grant from the ETH Domain and computational resources provided by the Swiss National Supercomputing Centre \(CSCS\) under the Alps infrastructure\.

We thank the physician review panel within the LiGHT laboratory for their clinical auditing, methodological review, and validation of the synthetic generation and evaluation pipelines\. We additionally acknowledge the many physicians and clinical experts who contributed to the MOOVE initiative through expert review, pairwise evaluation, benchmarking, and clinical vignette development across diverse international settings\.

## References

- \[1\]A\. B\. Abacha, E\. Agichtein, Y\. Pinter, and D\. Demner\-Fushman\(2017\)Overview of the medical question answering task at trec 2017 liveqa\.\.InTREC,pp\. 1–12\.Cited by:[§3\.1](https://arxiv.org/html/2605.16215#S3.SS1.p1.1)\.
- \[2\]S\. Agarwal, L\. Ahmad, J\. Ai, S\. Altman, A\. Applebaum, E\. Arbus, R\. K\. Arora, Y\. Bai, B\. Baker, H\. Bao,et al\.\(2025\)Gpt\-oss\-120b & gpt\-oss\-20b model card\.arXiv preprint arXiv:2508\.10925\.Cited by:[§3\.2](https://arxiv.org/html/2605.16215#S3.SS2.p6.1),[§4\.1](https://arxiv.org/html/2605.16215#S4.SS1.p1.1)\.
- \[3\]D\. A\. Alber, Z\. Yang, A\. Alyakin, E\. Yang, S\. Rai, A\. A\. Valliani, J\. Zhang, G\. R\. Rosenbaum, A\. K\. Amend\-Thomas, D\. B\. Kurland,et al\.\(2025\)Medical large language models are vulnerable to data\-poisoning attacks\.Nature Medicine31\(2\),pp\. 618–626\.Cited by:[§1](https://arxiv.org/html/2605.16215#S1.p1.1),[§2](https://arxiv.org/html/2605.16215#S2.p2.1)\.
- \[4\]I\. Alonso, M\. Oronoz, and R\. Agerri\(2024\)Medexpqa: multilingual benchmarking of large language models for medical question answering\.Artificial intelligence in medicine155,pp\. 102938\.Cited by:[§3\.1](https://arxiv.org/html/2605.16215#S3.SS1.p1.1)\.
- \[5\]P\. Apertus, A\. Hernández\-Cano, A\. Hägele, A\. H\. Huang, A\. Romanou, A\. Solergibert, B\. Pasztor, B\. Messmer, D\. Garbaya, E\. F\. Ďurech,et al\.\(2025\)Apertus: democratizing open and compliant llms for global language environments\.arXiv preprint arXiv:2509\.14233\.Cited by:[Appendix K](https://arxiv.org/html/2605.16215#A11.p1.1),[§2](https://arxiv.org/html/2605.16215#S2.p3.1),[§4\.1](https://arxiv.org/html/2605.16215#S4.SS1.p1.1)\.
- \[6\]R\. K\. Arora, J\. Wei, R\. S\. Hicks, P\. Bowman, J\. Quiñonero\-Candela, F\. Tsimpourlas, M\. Sharman, M\. Shah, A\. Vallone, A\. Beutel,et al\.\(2025\)Healthbench: evaluating large language models towards improved human health\.arXiv preprint arXiv:2505\.08775\.Cited by:[§2](https://arxiv.org/html/2605.16215#S2.p4.1),[§2](https://arxiv.org/html/2605.16215#S2.p5.3),[§4\.3](https://arxiv.org/html/2605.16215#S4.SS3.p1.1)\.
- \[7\]J\. Betley, N\. Warncke, A\. Sztyber\-Betley, D\. Tan, X\. Bao, M\. Soto, M\. Srivastava, N\. Labenz, and O\. Evans\(2026\)Training large language models on narrow tasks can lead to broad misalignment\.Nature649\(8097\),pp\. 584–589\.Cited by:[§1](https://arxiv.org/html/2605.16215#S1.p1.1),[§2](https://arxiv.org/html/2605.16215#S2.p2.1)\.
- \[8\]J\. Chen, X\. Wang, K\. Ji, A\. Gao, F\. Jiang, S\. Chen, H\. Zhang, D\. Song, W\. Xie, C\. Kong,et al\.\(2023\)Huatuogpt\-ii, one\-stage training for medical adaption of llms\.arXiv preprint arXiv:2311\.09774\.Cited by:[§2](https://arxiv.org/html/2605.16215#S2.p1.1)\.
- \[9\]Z\. Chen, A\. H\. Cano, A\. Romanou, A\. Bonnet, K\. Matoba, F\. Salvi, M\. Pagliardini, S\. Fan, A\. Köpf, A\. Mohtashami,et al\.\(2023\)Meditron\-70b: scaling medical pretraining for large language models\.arXiv preprint arXiv:2311\.16079\.Cited by:[§1](https://arxiv.org/html/2605.16215#S1.p1.1),[§2](https://arxiv.org/html/2605.16215#S2.p1.1)\.
- \[10\]P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord\(2018\)Think you have solved question answering? try arc, the ai2 reasoning challenge\.arXiv preprint arXiv:1803\.05457\.Cited by:[§4\.2](https://arxiv.org/html/2605.16215#S4.SS2.p1.1)\.
- \[11\]K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§2](https://arxiv.org/html/2605.16215#S2.p3.1)\.
- \[12\]C\. Deng, Y\. Zhao, X\. Tang, M\. Gerstein, and A\. Cohan\(2024\)Investigating data contamination in modern benchmarks for large language models\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 8706–8719\.Cited by:[§2](https://arxiv.org/html/2605.16215#S2.p3.1)\.
- \[13\]F\. J\. Dorfner, A\. Dada, F\. Busch, M\. R\. Makowski, T\. Han, D\. Truhn, J\. Kleesiek, M\. Sushil, L\. C\. Adams, and K\. K\. Bressem\(2025\)Evaluating the effectiveness of biomedical fine\-tuning for large language models on clinical tasks\.Journal of the American Medical Informatics Association32\(6\),pp\. 1015–1024\.Cited by:[§1](https://arxiv.org/html/2605.16215#S1.p3.1)\.
- \[14\]S\. Golchin and M\. Surdeanu\(2023\)Time travel in llms: tracing data contamination in large language models\.arXiv preprint arXiv:2308\.08493\.Cited by:[§2](https://arxiv.org/html/2605.16215#S2.p3.1)\.
- \[15\]S\. Han, G\. T\. Junior, T\. Balough, and W\. Zhou\(2025\)Judge’s verdict: a comprehensive analysis of llm judge capability through human agreement\.arXiv preprint arXiv:2510\.09738\.Cited by:[§2](https://arxiv.org/html/2605.16215#S2.p5.3)\.
- \[16\]D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt\(2020\)Measuring massive multitask language understanding\.arXiv preprint arXiv:2009\.03300\.Cited by:[§2](https://arxiv.org/html/2605.16215#S2.p3.1)\.
- \[17\]D\. Jin, E\. Pan, N\. Oufattole, W\. Weng, H\. Fang, and P\. Szolovits\(2021\)What disease does this patient have? a large\-scale open domain question answering dataset from medical exams\.Applied Sciences11\(14\),pp\. 6421\.Cited by:[§3\.1](https://arxiv.org/html/2605.16215#S3.SS1.p1.1),[§4\.2](https://arxiv.org/html/2605.16215#S4.SS2.p1.1)\.
- \[18\]Q\. Jin, B\. Dhingra, Z\. Liu, W\. Cohen, and X\. Lu\(2019\)Pubmedqa: a dataset for biomedical research question answering\.InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing \(EMNLP\-IJCNLP\),pp\. 2567–2577\.Cited by:[§3\.1](https://arxiv.org/html/2605.16215#S3.SS1.p1.1),[§4\.2](https://arxiv.org/html/2605.16215#S4.SS2.p1.1)\.
- \[19\]Y\. Labrak, A\. Bazoge, E\. Morin, P\. Gourraud, M\. Rouvier, and R\. Dufour\(2024\)Biomistral: a collection of open\-source pretrained large language models for medical domains\.InFindings of the association for computational linguistics: acl 2024,pp\. 5848–5864\.Cited by:[§1](https://arxiv.org/html/2605.16215#S1.p1.1),[§2](https://arxiv.org/html/2605.16215#S2.p1.1)\.
- \[20\]N\. Lambert, J\. Morrison, V\. Pyatkin, S\. Huang, H\. Ivison, F\. Brahman, L\. J\. V\. Miranda, A\. Liu, N\. Dziri, S\. Lyu,et al\.\(2025\)Tulu 3: pushing frontiers in open language model post\-training\.InProceedings of the International Conference on Learning Representations,Cited by:[Appendix G](https://arxiv.org/html/2605.16215#A7.p2.1)\.
- \[21\]S\. Lin, J\. Hilton, and O\. Evans\(2022\)Truthfulqa: measuring how models mimic human falsehoods\.InProceedings of the 60th annual meeting of the association for computational linguistics \(volume 1: long papers\),pp\. 3214–3252\.Cited by:[§2](https://arxiv.org/html/2605.16215#S2.p3.1)\.
- \[22\]T\. Mihaylov, P\. Clark, T\. Khot, and A\. Sabharwal\(2018\)Can a suit of armor conduct electricity? a new dataset for open book question answering\.InProceedings of the 2018 conference on empirical methods in natural language processing,pp\. 2381–2391\.Cited by:[§2](https://arxiv.org/html/2605.16215#S2.p3.1)\.
- \[23\]T\. Olatunji, C\. Nimo, A\. Owodunni, T\. Abdullahi, E\. Ayodele, M\. Sanni, C\. Aka, F\. Omofoye, F\. Yuehgoh, T\. Faniran,et al\.\(2024\)AfriMed\-qa: a pan\-african, multi\-specialty, medical question\-answering benchmark dataset\.arXiv preprint arXiv:2411\.15640\.Cited by:[§3\.1](https://arxiv.org/html/2605.16215#S3.SS1.p1.1)\.
- \[24\]T\. OLMo, P\. Walsh, L\. Soldaini, D\. Groeneveld, K\. Lo, S\. Arora, A\. Bhagia, Y\. Gu, S\. Huang, M\. Jordan,et al\.\(2024\)2 olmo 2 furious\.arXiv preprint arXiv:2501\.00656\.Cited by:[§4\.1](https://arxiv.org/html/2605.16215#S4.SS1.p1.1)\.
- \[25\]Cited by:[§3\.1](https://arxiv.org/html/2605.16215#S3.SS1.p1.1),[§4\.2](https://arxiv.org/html/2605.16215#S4.SS2.p1.1)\.
- \[26\]M\. M\. Ramos, D\. M\. Alves, H\. Gisserot\-Boukhlef, J\. Alves, P\. H\. Martins, P\. Fernandes, J\. Pombal, N\. M\. Guerreiro, R\. Rei, N\. Boizard,et al\.\(2026\)EuroLLM\-22b: technical report\.arXiv preprint arXiv:2602\.05879\.Cited by:[§4\.1](https://arxiv.org/html/2605.16215#S4.SS1.p1.1)\.
- \[27\]K\. Saab, T\. Tu, W\. Weng, R\. Tanno, D\. Stutz, E\. Wulczyn, F\. Zhang, T\. Strother, C\. Park, E\. Vedadi,et al\.\(2024\)Capabilities of gemini models in medicine\.arXiv preprint arXiv:2404\.18416\.Cited by:[§2](https://arxiv.org/html/2605.16215#S2.p1.1)\.
- \[28\]K\. Sakaguchi, R\. L\. Bras, C\. Bhagavatula, and Y\. Choi\(2021\)Winogrande: an adversarial Winograd schema challenge at scale\.Communications of the ACM64\(9\),pp\. 99–106\.Cited by:[§2](https://arxiv.org/html/2605.16215#S2.p3.1)\.
- \[29\]A\. Sallinen, A\. Solergibert, M\. Zhang, G\. Boyé, M\. Dupont\-Roc, X\. Theimer\-Lienhard, E\. Boisson, B\. Bernath, H\. Hadhri, A\. Tran,et al\.\(2025\)Llama\-3\-meditron: an open\-weight suite of medical llms based on llama\-3\.1\.InWorkshop on Large Language Models and Generative AI for Health at AAAI 2025,Cited by:[§2](https://arxiv.org/html/2605.16215#S2.p1.1),[§2](https://arxiv.org/html/2605.16215#S2.p5.3),[§4\.1](https://arxiv.org/html/2605.16215#S4.SS1.p1.1)\.
- \[30\]A\. Sellergren, S\. Kazemzadeh, T\. Jaroensri, A\. Kiraly, M\. Traverse, T\. Kohlberger, S\. Xu, F\. Jamil, C\. Hughes, C\. Lau,et al\.\(2025\)Medgemma technical report\.arXiv preprint arXiv:2507\.05201\.Cited by:[§1](https://arxiv.org/html/2605.16215#S1.p1.1),[§2](https://arxiv.org/html/2605.16215#S2.p1.1),[§3\.1](https://arxiv.org/html/2605.16215#S3.SS1.p1.1),[§4\.1](https://arxiv.org/html/2605.16215#S4.SS1.p1.1)\.
- \[31\]K\. Singhal, S\. Azizi, T\. Tu, S\. S\. Mahdavi, J\. Wei, H\. W\. Chung, N\. Scales, A\. Tanwani, H\. Cole\-Lewis, S\. Pfohl,et al\.\(2023\)Large language models encode clinical knowledge\.Nature620\(7972\),pp\. 172–180\.Cited by:[§2](https://arxiv.org/html/2605.16215#S2.p1.1),[§2](https://arxiv.org/html/2605.16215#S2.p4.1),[§3\.1](https://arxiv.org/html/2605.16215#S3.SS1.p1.1)\.
- \[32\]K\. Singhal, T\. Tu, J\. Gottweis, R\. Sayres, E\. Wulczyn, M\. Amin, L\. Hou, K\. Clark, S\. R\. Pfohl, H\. Cole\-Lewis,et al\.\(2025\)Toward expert\-level medical question answering with large language models\.Nature medicine31\(3\),pp\. 943–950\.Cited by:[§2](https://arxiv.org/html/2605.16215#S2.p1.1)\.
- \[33\]G\. Team, A\. Kamath, J\. Ferret, S\. Pathak, N\. Vieillard, R\. Merhej, S\. Perrin, T\. Matejovicova, A\. Ramé, M\. Rivière,et al\.\(2025\)Gemma 3 technical report\.External Links:2503\.19786,[Link](https://arxiv.org/abs/2503.19786)Cited by:[§4\.1](https://arxiv.org/html/2605.16215#S4.SS1.p1.1)\.
- \[34\]A\. S\. Thakur, K\. Choudhary, V\. S\. Ramayapally, S\. Vaidyanathan, and D\. Hupkes\(2025\-07\)Judging the judges: evaluating alignment and vulnerabilities in LLMs\-as\-judges\.InProceedings of the Fourth Workshop on Generation, Evaluation and Metrics \(GEM\),Vienna, Austria,pp\. 404–430\.External Links:ISBN 979\-8\-89176\-261\-9Cited by:[§2](https://arxiv.org/html/2605.16215#S2.p5.3)\.
- \[35\]A\. J\. Thirunavukarasu, D\. S\. J\. Ting, K\. Elangovan, L\. Gutierrez, T\. F\. Tan, and D\. S\. W\. Ting\(2023\)Large language models in medicine\.Nature medicine29\(8\),pp\. 1930–1940\.Cited by:[§2](https://arxiv.org/html/2605.16215#S2.p4.1)\.
- \[36\]X\. Wang, S\. Guo, Y\. Shen, J\. Chen, J\. Wang, J\. Gu, P\. Zhang, L\. Liu, and B\. Wang\(2026\)LiveClin: a live clinical benchmark without leakage\.arXiv preprint arXiv:2602\.16747\.Cited by:[§2](https://arxiv.org/html/2605.16215#S2.p5.3)\.
- \[37\]Y\. Wang, X\. Ma, G\. Zhang, Y\. Ni, A\. Chandra, S\. Guo, W\. Ren, A\. Arulraj, X\. He, Z\. Jiang,et al\.\(2024\)Mmlu\-pro: a more robust and challenging multi\-task language understanding benchmark\.Advances in Neural Information Processing Systems37,pp\. 95266–95290\.Cited by:[§4\.2](https://arxiv.org/html/2605.16215#S4.SS2.p1.1)\.
- \[38\]M\. White, I\. Haddad, C\. Osborne, X\. Y\. Liu, A\. Abdelmonsef, S\. Varghese, and A\. L\. Hors\(2024\)The model openness framework: promoting completeness and openness for reproducibility, transparency, and usability in artificial intelligence\.arXiv preprint arXiv:2403\.13784\.Cited by:[Appendix L](https://arxiv.org/html/2605.16215#A12.p1.1)\.
- \[39\]C\. Wu, W\. Lin, X\. Zhang, Y\. Zhang, W\. Xie, and Y\. Wang\(2024\)PMC\-llama: toward building open\-source language models for medicine\.Journal of the American Medical Informatics Association31\(9\),pp\. 1833–1843\.Cited by:[§2](https://arxiv.org/html/2605.16215#S2.p1.1)\.
- \[40\]Q\. Xie, Q\. Chen, A\. Chen, C\. Peng, Y\. Hu, F\. Lin, X\. Peng, J\. Huang, J\. Zhang, V\. Keloth, X\. Zhou, L\. Qian, H\. He, D\. Shung, L\. Ohno\-Machado, Y\. Wu, H\. Xu, and J\. Bian\(2025\)Medical foundation large language models for comprehensive text analysis and beyond\.npj Digital Medicine8,pp\. 141\.External Links:[Document](https://dx.doi.org/10.1038/s41746-025-01533-1)Cited by:[Appendix G](https://arxiv.org/html/2605.16215#A7.p1.2)\.
- \[41\]Cited by:[Appendix B](https://arxiv.org/html/2605.16215#A2.p2.5),[§4\.1](https://arxiv.org/html/2605.16215#S4.SS1.p1.1),[§4\.3](https://arxiv.org/html/2605.16215#S4.SS3.p1.1)\.
- \[42\]R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi\(2019\)Hellaswag: can a machine really finish your sentence?\.InProceedings of the 57th annual meeting of the association for computational linguistics,pp\. 4791–4800\.Cited by:[§2](https://arxiv.org/html/2605.16215#S2.p3.1)\.
- \[43\]L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica\(2023\)Judging LLM\-as\-a\-judge with MT\-bench and chatbot arena\.InAdvances in Neural Information Processing Systems,Vol\.36\.Cited by:[§2](https://arxiv.org/html/2605.16215#S2.p5.3)\.
- \[44\]J\. Zhou, T\. Lu, S\. Mishra, S\. Brahma, S\. Basu, Y\. Luan, D\. Zhou, and L\. Hou\(2023\)Instruction\-following evaluation for large language models\.arXiv preprint arXiv:2311\.07911\.Cited by:[§4\.2](https://arxiv.org/html/2605.16215#S4.SS2.p1.1)\.
- \[45\]Y\. Zuo, S\. Qu, Y\. Li, Z\. Chen, X\. Zhu, E\. Hua, K\. Zhang, N\. Ding, and B\. Zhou\(2025\)Medxpertqa: benchmarking expert\-level medical reasoning and understanding\.arXiv preprint arXiv:2501\.18362\.Cited by:[§4\.2](https://arxiv.org/html/2605.16215#S4.SS2.p1.1)\.

## Appendix AExamples where Gemma\-Meditron wins against MedGemma

We present two qualitative examples where Gemma\-Meditron is preferred over MedGemma by an LLM\-as\-a\-judge evaluation\. and the judge’s preference reflects substantive differences in clinical reasoning, contextual awareness, and structured presentation\.

### Example 1

PromptI am the emergency radiologist at the CHUV in Lausanne who has been asked to perform imaging in an 18\-year\-old man transported to the hospital after a severe bicycle accident\. He is conscient and his Glasgow Coma Scale is 15, but hemodynamically, he is quite unstable\. Furthermore, he suffers from severe upper abdominal pain\. What imaging modality and what acquisition protocol do you recommend? What is the most likely diagnosis?

MedGemma\-27BOkay, this is a critical situation\. An 18\-year\-old male, high\-energy trauma \(bicycle accident\), GCS 15 \(good neurological status\), but hemodynamically unstable with severe upper abdominal pain\. This immediately raises concerns for significant intra\-abdominal hemorrhage, likely from solid organ injury \(liver, spleen\) or mesenteric injury\.Here’s my recommendation for imaging modality and protocol, along with the most likely diagnosis:1\. Imaging Modality: Contrast\-Enhanced CT Scan\*Why CT?CT is the gold standard for evaluating hemodynamically unstable trauma patients with suspected abdominal or thoracic injury\. It is fast, widely available, and provides detailed anatomical information about solid organs, vasculature, retroperitoneum, and bony structures\. \*Why Contrast Enhanced?While a non\-contrast CT can identify free fluid \(hemoperitoneum\), contrast enhancement is crucial for: \* Identifying active arterial extravasation \(bleeding\)\. \* Assessing the extent and grade of solid organ injuries\. \* Evaluating vascular injuries \(e\.g\., aorta, vena cava, mesenteric vessels\)\. \* Visualizing bowel and mesenteric injuries\. \* Differentiating between solid organ hematomas and active bleeding\.2\. Acquisition Protocol: Trauma CT Protocol \(Optimized for Hemodynamic Instability\)Given the patient’s hemodynamic instability, speed and efficiency are paramount\. The protocol should be streamlined and focused\.\*Scanner:Use a high\-speed, multi\-detector CT scanner \(64\-slice or higher preferred\)\. \*Patient Positioning:Supine position with arms above the head\. \*Scanning Range:*Head:Non\-contrast CT head \(brief, to rule out intracranial injury despite GCS 15, as focal neurological deficits can be subtle or develop later\)\.*This can potentially be skipped or deferred if the patient is extremely unstable and abdominal/thoracic injury is the primary concern, but it’s generally standard in trauma\.\* \*Neck:Non\-contrast CT neck \(to evaluate for cervical spine injury, especially with a bicycle accident\)\. \*Chest:Contrast\-enhanced CT chest \(thin slices, 1\-1\.5mm\)\. Essential to rule out hemothorax, pneumothorax, pulmonary contusion, aortic injury, and mediastinal hematoma\. \*Abdomen & Pelvis:Contrast\-enhanced CT abdomen and pelvis \(thin slices, 1\-1\.5mm\)\. This is the most critical part for this patient\. \*Contrast Administration:\*Type:Iodinated intravenous contrast \(e\.g\., iohexol, iopamidol\)\. \*Volume:Sufficient volume to adequately opacify the aorta and hepatic veins \(e\.g\., 100\-150 mL in an adult, adjusted for weight\)\. \*Injection Rate:Relatively high rate \(e\.g\., 4\-5 mL/sec\) to achieve rapid arterial phase enhancement\.*Timing:Use a*dual\-phase*or*tri\-phase\* protocol if feasible, but prioritize the arterial phase\. \*Arterial Phase:Essential for detecting active bleeding and assessing vascular injury \(typically 20\-30 seconds post\-injection\)\. \*Portal Venous Phase:Helps delineate venous injuries and assess organ perfusion \(typically 60\-80 seconds post\-injection\)\. \*Delayed Phase \(Optional/If Time Permits\):Can be useful for identifying urinary tract injuries, but may be skipped in an unstable patient to save time\. \*Reconstruction:\* Axial images\. \* Coronal and sagittal reformations \(especially for abdomen/pelvis\)\. \* Thin\-section multiplanar reconstructions \(MPR\) as needed\. \*Speed:Maximize scan speed while maintaining diagnostic quality\. Instruct the CT technologist to minimize delays between phases\. \*Monitoring:Continuous monitoring of vital signs during the scan\. Have resuscitation equipment readily available\. Inform the resuscitation team of the scan progress\.3\. Most Likely Diagnosis:Given the mechanism \(bicycle accident\), hemodynamic instability, and severe upper abdominal pain, the most likely diagnoses involve significantintra\-abdominal hemorrhage\.\*Splenic Injury:The spleen is the most commonly injured solid organ in blunt abdominal trauma\. Rupture or laceration can lead to massive hemoperitoneum and shock\. \*Liver Injury:The liver is the second most commonly injured organ\. Lacerations, hematomas, or vascular injuries \(e\.g\., hepatic artery, portal vein\) can cause significant bleeding\. \*Mesenteric Injury:Injury to the mesentery or its vessels can cause significant venous or arterial bleeding

Gemma\-Meditron \(ours\)Okay, this is a classic trauma scenario\. A young, unstable patient with upper abdominal pain after a high\-impact mechanism \(bicycle accident\) demands a rapid, comprehensive imaging workup\. Here’s my recommendation for imaging modality, protocol, the most likely diagnosis, and a brief discussion of why this approach is chosen, tailored to the situation at CHUV \(a high\-volume trauma center\)\.1\. Imaging Modality: Contrast\-Enhanced Multidetector CT \(MDCT\) of the Abdomen and Pelvis – FAST\-CT Protocol*Why CT, not US?While bedside ultrasound \(FAST exam\) is*always*first\-line in the ED for the "ABCDE" assessment and to detect free fluid, in a patient who is already being taken for definitive imaging, and who has*specific\* abdominal pain, a CT offers a far more detailed and complete evaluation of solid organ injury, vascular compromise, bowel perforation, and pelvic fracture\. The patient’s GCS of 15 allows for a quick, whole\-body CT scan\. \*Why contrast?Contrast is essential for evaluating liver, spleen, kidney, and bowel perfusion, detecting active bleeding, and differentiating solid organ injury from hematoma\. In a hemodynamically unstable patient, contrast helps identify sources of ongoing hemorrhage\. \*Why ’FAST\-CT’?This is a tailored protocol specifically designed for trauma\. It prioritizes speed and efficiency while maximizing diagnostic yield\.2\. Acquisition Protocol \(FAST\-CT – Adapted for CHUV resources\)\| Parameter \| Recommended Setting \(CHUV\-typical\) \| Rationale \| \|———–\|————————————\-\|———–\| \|Scanner\| 128\- or 256\-slice \(or higher\) \| Faster rotation, thinner slices \| \|Gantry Tilt\| 0\-15° \(depending on patient habitus\) \| Improves visualization of the upper abdomen \| \|Inspiration Breath\-Hold\| 10\-15 seconds \(if possible\) \| Reduces motion artifact; if patient cannot hold breath, use high\-pitch spiral with respiratory gating \| \|Scan Range\| Upper dome of the liver to pubic symphysis \(full abdomen & pelvis\) \| Captures all potential injury sites \| \|Slice Thickness\| 2\.5\-3 mm \(reconstructed to 1\.25 mm\) \| Adequate detail for most injuries; thinner reconstructions for vascular imaging \| \|Contrast Protocol \(IV Dual\-energy Acquisition\)\| \-Arterial Phase \(30\-40 s delay\): 1\.5\-2 mL/kg iodinated contrast \(e\.g\., Iomeprol 400 mg/mL\) at 4\-5 mL/s flow rate\. \-Portal Venous Phase \(60\-90 s delay\): Repeat bolus of 0\.8\-1 mL/kg same contrast agent\. \-Delayed Phase \(5\-10 min\): Can be added if suspected bowel injury or delayed hemorrhage; 0\.5 mL/kg \| Arterial phase shows active bleeding, portal venous phase defines parenchymal lacerations, delayed phase identifies bowel leak\. \| \|Reconstruction Algorithms\| \-Standard Reconstruction:Soft\-tissue window \(B30f/45s\) for parenchyma\. \-Vascular Reconstruction:Thin\-slice MIP/VR for aorta, IVC, SMV, portal vein, hepatic arteries\. \-Bone Windows:For pelvic fracture assessment\. \| Optimizes visualization of different tissues\. \| \|Radiation Dose Reduction\| Use automated tube current modulation \(CARE Dose 4D/kV\-switching\) and iterative reconstruction \(e\.g\., iDose4\) \| Minimizes radiation exposure without sacrificing image quality\. \| \|Image Post\-processing\| \-Volume Rendering \(VR/MIP\): For vascular assessment\. \-Multiplanar Reconstructions \(MPR\): Axial, coronal, sagittal planes\. \-3D Segmentation \(if available\): For quantifying hematoma volume\. \| Facilitates surgical planning\. \| \|Total Acquisition Time\| ~10\-15 seconds for the dynamic phases \(arterial, portal\) \+ 15\-20 seconds for the delayed phase\. Total scan time <40\-60 seconds\. \| Critical in unstable patients\. \| \|Image Reporting\| Structured reporting using the AAST\-OIS injury grading system for each organ \(liver, spleen, pancreas, kidney, bowel, mesentery, adrenal glands\)\. \| Standardized communication with surgeons\. \|Important Protocol Considerations at CHUV:\*Dual\-energy CT \(DECT\):CHUV has advanced DECT capabilities\. Use DECT to perform virtual unenhanced images and iodine maps\. This can reduce contrast\-induced

Judge VerdictBoth models demonstrate excellent understanding of the clinical scenario, recognizing the urgency of hemodynamic instability, the need for rapid imaging, and the high likelihood of intra\-abdominal hemorrhage following blunt trauma\. Model 1 provides a clear, accurate, and clinically sound recommendation for contrast\-enhanced CT with appropriate protocol details, including phase timing, contrast administration, and reconstruction\. It correctly identifies splenic and hepatic injuries as the most likely diagnoses\. Model 2 builds upon this with superior contextual awareness—specifically referencing CHUV’s advanced capabilities \(e\.g\., dual\-energy CT, DECT, iDose4, 3D segmentation\), which enhances relevance and realism for the setting\. It also includes a structured table that improves clarity and clinical utility, and it integrates radiation dose optimization and structured reporting, aligning with current trauma imaging best practices\. While Model 1 is thorough and correct, Model 2 offers more nuanced, institutionally tailored, and technically advanced guidance without sacrificing clarity or safety\. Both are harmless, fair, and well\-aligned with guidelines\. Model 2’s contextual specificity and enhanced technical detail give it a slight edge\.

### Example 2

PromptI am a pediatric intensive care physician working at CHUV in Lausanne in Switzerland\. I take care of a 14\-year\-old female patient following a severe traumatic brain injury with diffuse cerebral edema and multiple parenchymal hemorrhages on head CT\. What is the use of dosing biomarkers to better assess her prognosis ?

MedGemma\-27BOkay, this is a critical situation, and using biomarkers is becoming increasingly important in TBI management, especially in severe cases like the one you’re describing\. Here’s a breakdown of how dosing biomarkers can be used to better assess prognosis in a 14\-year\-old female patient with severe TBI, diffuse cerebral edema, and multiple parenchymal hemorrhages at CHUV:Understanding the Goal: Why Biomarkers?Traditional prognostic tools in TBI \(like GCS, age, pupillary response, CT findings\) are valuable but have limitations\. They provide a snapshot in time and don’t capture the complex, dynamic biological processes occurring after injury\. Biomarkers aim to:1\.Detect Injury Severity More Accurately:Go beyond macroscopic damage seen on CT\. 2\.Track Pathophysiological Processes:Monitor inflammation, neuronal injury, and metabolic changes\. 3\.Predict Long\-Term Outcomes:Help differentiate patients likely to recover well from those with a high risk of poor outcomes \(death, severe disability\)\. 4\.Guide Management:Potentially inform therapeutic decisions \(e\.g\., intensity of monitoring, interventions to manage secondary injury\)\. 5\.Provide Earlier Prognostication:Offer insights sooner than clinical assessments alone, which can be crucial for family counseling and resource allocation\.Key Biomarkers for Severe TBI Prognosis \(and their relevance to your patient\):Several biomarkers are being investigated and increasingly used, often in combination\. Here are the most relevant ones:1\.Neuron\-Specific Enolase \(NSE\):\*What it is:A glycolytic enzyme found primarily in neurons\. Released into CSF and blood upon neuronal damage/death\. \*Relevance:Elevated levels correlate with the extent of neuronal injury\. Very high levels early on are associated with poor prognosis \(death or severe disability\)\. \*Use:Often measured in serum or CSF within the first 24\-72 hours\. Serial measurements can track the trajectory of injury\. In your patient with diffuse edema and hemorrhages, NSE is likely to be significantly elevated, and its level is a strong prognostic indicator\.2\.S100B \(Soluble Protein 100B\):\*What it is:A calcium\-binding protein, primarily expressed by astrocytes\. Released into CSF and blood after brain injury, particularly involving glial cells\. \*Relevance:Reflects astrocytic damage and blood\-brain barrier disruption\. Also associated with injury severity and outcome\. While less specific to neurons than NSE, it’s often elevated in TBI\.*Use:Measured in serum or CSF, often within the first 24\-48 hours\. High levels are associated with worse outcomes\. Its utility might be slightly less pronounced than NSE for*severe\* TBI prognosis, but it’s still valuable, especially when combined with other markers\.3\.Glial Fibrillary Acidic Protein \(GFAP\):\*What it is:An intermediate filament protein specific to astrocytes\. Released into CSF and blood following astrocyte injury\. \*Relevance:Increasingly recognized as a sensitive and relatively specific marker of brain injury, particularly astrocytic damage\. It appears to be more stable in serum than S100B\. \*Use:Measured in serum or CSF\. Emerging evidence suggests GFAP may be a powerful prognostic marker, potentially even more so than NSE or S100B in some studies, especially when used in combination or as part of panels\. Its availability might be more limited depending on the lab, but it’s a key marker to consider\.4\.Ubiquitin C\-terminal Hydrolase L1 \(UCH\-L1\):\*What it is:An enzyme involved in protein degradation, found predominantly in neurons\. Released into CSF and blood after neuronal injury\. \*Relevance:Similar to NSE, it indicates neuronal damage\. Some studies suggest it might be more specific to axonal injury than NSE\. \*Use:Measured in serum or CSF\. High levels correlate with injury severity and poor outcome\. Often used in conjunction with other markers\.5\.Neurofilament Light Chain \(NfL\):\*What it is:A structural protein of neuronal axons\. Released into CSF and blood when axons are damaged or degenerate\. \*Relevance:A marker of axonal injury\. Elevated levels are associated with TBI severity and poorer outcomes\. It’s also being investigated for tracking recovery and neurodegeneration\. \*Use:Measured in serum or CSF\. Its role in acute TBI prognosis is still being refined, but it shows promise, particularly for detecting subtle or evolving axonal damage\.

Gemma\-Meditron \(ours\)BackgroundTraumatic brain injury \(TBI\) in the pediatric population is a major cause of morbidity and mortality, and timely identification of patients at high risk for severe outcomes is critical for resource allocation and shared decision\-making with families\. Conventional imaging \(CT, MRI\) and clinical scores \(e\.g\., Glasgow Coma Scale, Pediatric Glasgow Outcome Scale\) are often insufficient for precise prognostication, especially in the acute phase\.Emerging Role of Biomarkers in Pediatric TBIBiomarkers can provide a window into the neurobiological response to injury, offering objective data that complements clinical assessment\. They fall into several categories, some more established than others:\|Biomarker Class\|Examples\|Timing of Measurement \(from injury\)\|Physiologic Rationale\|Prognostic Value \(pediatric TBI\)\|Practical Considerations for CHUV\| \|———————\-\|————–\|—————————————\-\|—————————\|————————————\-\|———————————–\| \|Neuronal Injury\|GFAP \(Glial Fibrillary Acidic Protein\)\| $\\\{\}leq$ 12\-24 hours \| Marker of astrocyte injury, correlates with lesion volume \| Strong predictor of unfavorable outcome \(death, severe disability\) at 6 months\. High sensitivity, specificity in pediatric data\. \| Point\-of\-care assay available; can be ordered from the central lab if not emergent\. \| \| \|Uch\-L1 \(Ubiquitin C\-terminal hydrolase\-L1\)\| $\\\{\}leq$ 12\-24h \| Cytosolic protein released from neurons/glia upon injury \| Adds prognostic information when combined with GFAP; predicts long\-term disability\. \| Standard laboratory panel, readily available\. \| \|Astrocyte activation/astroglial injury\|S100B\| 24\-72h \| Released from astrocytes; can be non\-specific \(e\.g\., also elevated in cardiac injury\)\. \| Less robust than GFAP/Uch\-L1, especially in moderate TBI\. May be useful as part of a multi\-marker panel\. \| Already measured in routine CHUV ICU labs\. \| \|Axonal Injury\|NF\-L \(Neurofilament Light Chain\)\| 24\-72h, then serial measurements \| Released from damaged axons; reflects neuronal degeneration\. \| Emerging as a strong prognosticator, especially for long\-term neurocognitive deficits\. \| Not yet standard at CHUV; requires a send\-out assay or local validation\. \| \|Inflammation/Immune Response\|IL\-6 \(Interleukin\-6\), TNF\-$\\\{\}alpha$, HMGB1\| 6\-24h, then 24\-72h \| Reflect early inflammatory cascade; correlate with secondary injury\. \| Predictive of cerebral edema, ICU length of stay, and functional outcome, but less specific than neuronal markers\. \| CHUV immunology lab can measure these\. \| \|Metabolic/Cerebral Energy Crisis\| \*\*Plasma glucose, lactate, pyruvate, cerebral microdialysis \(lactate/pyruvate ratio\) \| Continuous monitoring \(lactate/pyruvate\) or spot samples\. \| Early marker of mitochondrial dysfunction and energy failure\. \| Useful for guiding resuscitation \(e\.g\., glycemic control, oxygen delivery\) but limited as a standalone prognosticator\. \| Standard ICU monitoring; microdialysis not routinely used at CHUV\. \|How to Use Biomarkers in Your 14\-Year\-Old Patient at CHUV1\.Acute Phase \(first 24\-48 hours\)\-GFAP \+ Uch\-L1:This combination is the most evidence\-based for early prognostication in pediatric severe TBI\. Order them stat from the CHUV central lab\. \-Interpretation:\-High GFAP \(\> 200\-400 pg/mL\) \+ high Uch\-L1 \(\> 50 ng/mL\): Strongly predicts an increased risk of death or severe disability at 6 months \(based on the Traumatic Brain Injury Predictive Study\)\. This information helps frame the discussion with the family about potential outcomes and the appropriateness of aggressive interventions \(e\.g\., prolonged mechanical ventilation, neurocritical care\)\. \-Discordant results \(e\.g\., high GFAP, normal Uch\-L1\): Suggests astrocyte injury may be present but neuronal injury is less severe\. May warrant a more expectant approach while continuing intensive monitoring\. \-S100B:May be obtained simultaneously as it is already measured in your ICU panel, but its prognostic value is lower when GFAP/Uch\-L1 are available\.2\.Subacute Phase \(24\-72 hours\)\-NF\-L:If the

Judge VerdictNote: in the raw judge output, the labels “Model 1” and “Model 2” are swapped relative to the model identities shown above \(the evaluation randomizes presentation order to control for position bias\)\.Model 1 demonstrates superior clinical reasoning, depth of knowledge, and structured presentation\. It accurately identifies the most relevant biomarkers for pediatric severe TBI \(GFAP, Uch\-L1, NF\-L, S100B, inflammatory markers\), provides precise timing, physiological rationale, prognostic value, and practical implementation guidance tailored to CHUV’s lab capabilities\. The use of a table enhances clarity and facilitates rapid clinical decision\-making\. It also includes nuanced interpretation of biomarker combinations and their implications for family counseling and treatment intensity—critical in a complex case like this\.Model 2 offers a reasonable overview but lacks depth and specificity\. It omits key biomarkers \(e\.g\., NF\-L, IL\-6, lactate/pyruvate ratio\) and fails to integrate contextual details about CHUV’s lab infrastructure\. The response is cut off mid\-sentence, indicating incompleteness\. While it correctly identifies NSE, S100B, GFAP, Uch\-L1, and NfL, it does not provide actionable guidance on timing, interpretation thresholds, or integration with clinical and imaging data\. The communication is less structured, and the lack of a table or clear decision framework reduces utility in a high\-acuity setting\.Both models are harmless and fair, but Model 1 is significantly more aligned with clinical guidelines for neurocritical care and evidence\-based prognostication in pediatric TBI\.

## Appendix BData analysis

The training mix combines a curated pool of public medical QA with three synthetic components seeded from real corpora \(Table[4](https://arxiv.org/html/2605.16215#A2.T4)\): exam\-style QA seeded from the curated pool\(Table[5](https://arxiv.org/html/2605.16215#A2.T5)\), QA grounded in clinical practice guidelines, and clinical\-vignette prompts seeded from MOOVE training data\. Synthetic data accounts for∼64%\\sim\\\!64\\%of examples and∼71%\\sim\\\!71\\%of tokens, motivating the source\-versus\-synthetic distribution checks reported below\.

Table 4:Overview of Fully Open Meditron datasetsWe compare each synthetic component against its source along three axes:specialty,urgency, anddifficulty\(Figures[6](https://arxiv.org/html/2605.16215#A2.F6)–[8](https://arxiv.org/html/2605.16215#A2.F8)\)\. All labels are produced by Qwen3\-32B\[[41](https://arxiv.org/html/2605.16215#bib.bib45)\]as a zero\-shot classifier with one prompt template per axis; the same model labels source and synthetic, so systematic classifier bias largely cancels in the comparison\. Figure captions report Jensen–Shannon divergenceJSD\\mathrm\{JSD\}for categorical axes and the Wasserstein\-1 distance𝒲1\\mathcal\{W\}\_\{1\}for difficulty \(which under a unimodal shift coincides with the mean shift\)\. Components order cleanly by source homogeneity: Guidelines \(single corpus type\) is the tightest match, MOOVE \(single vignette pool\) is intermediate, and Curated QA \(eight aggregated public datasets\) shows the largest redistribution\. Difficulty shifts upward by∼0\.73\\sim\\\!0\.73points on the11–55scale in both MOOVE and Curated, indicating a near\-uniform translation produced by the generator rather than a dataset\-specific effect\.

![Refer to caption](https://arxiv.org/html/2605.16215v1/x9.png)\(a\)Specialty
![Refer to caption](https://arxiv.org/html/2605.16215v1/x10.png)\(b\)Urgency
![Refer to caption](https://arxiv.org/html/2605.16215v1/x11.png)\(c\)Difficulty

Figure 6:Synthetic MOOVE vs\. source \(nsrc=24,679n\_\{\\text\{src\}\}=24\{,\}679,nsyn=24,465n\_\{\\text\{syn\}\}=24\{,\}465\)\. Top specialties preserved in rank; difficulty shifts toward levels44–55\.![Refer to caption](https://arxiv.org/html/2605.16215v1/x12.png)\(a\)Specialty
![Refer to caption](https://arxiv.org/html/2605.16215v1/x13.png)\(b\)Urgency

Figure 7:Guidelines QA vs\. source \(nsrc=16,300n\_\{\\text\{src\}\}=16\{,\}300,nsyn=145,681n\_\{\\text\{syn\}\}=145\{,\}681, a∼9×\\sim\\\!9\\timesamplification\)\. Difficulty is not comparable for this component, since the source consists of clinical practice guidelines rather than question–answer pairs\. Both annotated axes closely match the source \(JSD≤0\.014\\mathrm\{JSD\}\\leq 0\.014\)\.![Refer to caption](https://arxiv.org/html/2605.16215v1/x14.png)\(a\)Specialty
![Refer to caption](https://arxiv.org/html/2605.16215v1/x15.png)\(b\)Urgency
![Refer to caption](https://arxiv.org/html/2605.16215v1/x16.png)\(c\)Difficulty

Figure 8:Synthetic Curated QA vs\. source \(nsrc=211,244n\_\{\\text\{src\}\}=211\{,\}244,nsyn=214,654n\_\{\\text\{syn\}\}=214\{,\}654\)\. The generator broadens coverage from the eight aggregated source datasets, promoting under\-represented specialties; difficulty shift is2\.81→3\.552\.81\\rightarrow 3\.55\.Table 5:Description of Curated QA
## Appendix COverview of Fully Open Meditron evaluation datasets\.

Table 6:Overview of Fully Open Meditron evaluation datasets\.
## Appendix DMedical benchmark accuracy

Table 7:Medical benchmark accuracy \(%\)\. Judge is Qwen3\-30B\-A3B\-Instruct\. Best within partitionbolded; best fully openunderlined\.Note:Per\-task 95% CIs \(approximate, varying withpp\): MedMCQA±\\pm1\.5 pp \(n=4183n\{=\}4183\), MedQA±\\pm2\.6 pp \(n=1273n\{=\}1273\), PubMedQA±\\pm3\.9 pp \(n=500n\{=\}500\), MedXpertQA±\\pm1\.5 pp \(n=2450n\{=\}2450\), Healthbench Hard±\\pm2\.9 pp \(n=1000n\{=\}1000\)\. Avg CIs computed by SE propagation\. Gains are paired differences; the unpaired SE bound is∼\\sim1\.2 pp, so gains\>\>2\.5 pp are robustly significant\.

## Appendix EAuto\-MOOVE pairwise results

Table 8:Auto\-MOOVE pairwise results \(N=12,602N=12\{,\}602per pair\)\. Net Win Rate==Win−\-Loss %; Adjusted Win Rate==Win\+\+Tie/2;Δ\\DeltaLikert averaged across criteria\.Table 9:Auto\-MOOVE pairwise comparisons \(N=12,602N=12\{,\}602per pair\), Judge ablations\.Base modelOur modelJudgeNet Win RateAdj\. Win RateΔ\\DeltaLikertOLMo\-2\-32B\-SFTOLMo\-2\-32B\-MeditronFOQwen3\-30B\-A3B\+67\.283\.7\+0\.43OLMo\-2\-32B\-SFTOLMo\-2\-32B\-MeditronFOQwen3\-235B\-A22B\+69\.784\.8\+0\.44OLMo\-2\-32B\-SFTOLMo\-2\-32B\-MeditronFOgpt\-oss\-120\+35\.167\.5\+0\.32OLMo\-2\-32B\-SFTOLMo\-2\-32B\-MeditronFO \- Synth\. MOOVEQwen3\-30B\-A3B\+56\.278\.2\+0\.31EuroLLM\-22B\-InstructEuroLLM\-22B\-MeditronFOQwen3\-30B\-A3B\+8\.054\.0\+0\.04EuroLLM\-22B\-InstructEuroLLM\-22B\-MeditronFOQwen3\-235B\-A22B\+34\.367\.2\+0\.20EuroLLM\-22B\-InstructEuroLLM\-22B\-MeditronFOgpt\-oss\-120\-12\.543\.7\-0\.15EuroLLM\-22B\-InstructEuroLLM\-22B\-MeditronFO \- Synth\. MOOVEQwen3\-30B\-A3B\+7\.353\.7\+0\.02Gemma\-3\-27B\-ITGemma\-3\-27B\-MeditronFOQwen3\-30B\-A3B\+29\.864\.9\+0\.15MedGemmaGemma\-3\-27B\-MeditronFOQwen3\-30B\-A3B\+32\.766\.3\+0\.16MedGemmaGemma\-3\-27B\-MeditronFOQwen3\-235B\-A22B\+17\.258\.6\+0\.04Gemma\-3\-27B\-ITGemma\-3\-27B\-MeditronFOgpt\-oss\-120\+23\.061\.5\+0\.11Gemma\-3\-27B\-ITGemma\-3\-27B\-MeditronFO \- Synth\. MOOVEQwen3\-30B\-A3B\+25\.062\.5\+0\.12

## Appendix FGeneral\-purpose benchmark results\.

Table 10:General\-purpose benchmark results\.
## Appendix GAdditional Ablations

General\-purpose capability as a smoke test\.We treat general\-purpose benchmarks as a smoke test for catastrophic forgetting rather than as a primary optimization target\. Domain adaptation is expected to trade off some broad instruction\-following capability against improved medical specialization\[[40](https://arxiv.org/html/2605.16215#bib.bib26)\], and our results should be interpreted in that light\. By default, Apertus\-70B\-MeditronFO drops 13\.4 points on the general\-purpose average relative to its base \(54\.10→\\rightarrow40\.74\), driven largely by IFEval \(64\.70→\\rightarrow41\.04\)\. However, this pattern is neither unique to our recipe nor uniformly severe across models: OLMo\-2\-32B\-MeditronFO improves slightly over its base \(\+1\.70\), while smaller models and Gemma\-3\-27B show moderate degradations\. Notably, MedGemma\-27B also underperforms its general\-purpose base Gemma\-3\-27B, indicating that this tradeoff is a broader feature of medical specialization rather than a pathology of fully open training\. Relative to prior open medical finetunes, our recipe also appears to retain more general capability: Llama\-3\.1\-70B\-Meditron exhibits a substantially larger drop than Apertus\-70B\-MeditronFO \(45\.59 vs\. 71\.04 for its base\), suggesting that the cost of specialization is reduced, though not eliminated, in our setting\. Detailed results are recorded in Table[11](https://arxiv.org/html/2605.16215#A7.T11)

As an optional mitigation, the training mixture can be augmented with a 10% subset of the fully open Tülu 3 SFT mixture\[[20](https://arxiv.org/html/2605.16215#bib.bib29)\], which recovers most of the general\-purpose loss for Apertus\-70B \(49\.85 average, 61\.92 on IFEval\) while largely preserving medical gains\. We do not include Tülu replay in the default Fully Open Meditron recipe, because our primary objective is domain specialization and we prefer to keep the core corpus focused and interpretable\. Instead, we provide instructions for enabling replay in the codebase and document the corresponding ablation\.

Table 11:General\-purpose benchmark results\. For a detailed table with older closed open access reference please see appendix[F](https://arxiv.org/html/2605.16215#A6)Table 12:Ablation of Judge for Open ended evaluation\.Across eight diverse judges spanning model families \(Qwen, GPT\-OSS, Gemma, GLM, Llama, Nemotron\) and sizes \(27B–235B\), Apertus\-70B\-MeditronFO is consistently preferred over Apertus\-70B\-Instruct, with adjusted win rates ranging from 73\.2% \(Llama\-3\.3\-70B\) to 93\.7% \(Nemotron\-3\-Nano\-30B\) and all Likert deltas strictly positive\. Notably, gpt\-oss\-120b is the model used for our synthetic data generation; using the generator as a judge would conceptually favor models stylistically closer to its own outputs, yet it does not yield disproportionately higher win rates than other independent judges\. This argues against a style\-matching explanation for the observed gains\. Auto\-MOOVE pairwise comparisons \(N=12,602N=12\{,\}602per pair\); complementary results are in appendix[E](https://arxiv.org/html/2605.16215#A5)\.Table 13:Ablation study\. All runs use Apertus\-70B\-Instruct as base\. Judge is Qwen3\-30B\-A3B\-Instruct
## Appendix HAutoMOOVE validation

Table 14:Auto\-MOOVE validation against human clinical judgments\.Left:judgeκ\\kappaagainst the full human panel, situated within the distribution of per\-raterκ\\kappavalues \(each rater scored against the consensus of all others; minimum 10 triplets per rater\)\.Right:average Likert score difference \(chosen minus rejected\) per criterion\.
![Refer to caption](https://arxiv.org/html/2605.16215v1/figures/judge_undistiguishable.png)Figure 9:Distribution of per\-raterκ\\kappavalues across the 204\-rater human panel, with the Auto\-MOOVE judge’sκ\\kappasituated within it\. The judge falls within±2​σ\\pm 2\\sigmaof the human mean under both with\-ties and no\-ties scoring, indicating it is statistically indistinguishable from a typical human rater on this validation set\.
## Appendix ITraining details

### I\.1Infrastructure and framework

All Fully Open Meditron models were trained on a high\-performance computing cluster using NVIDIA GH200 Grace Hopper Superchip nodes with 4 GPUs per node\. Large bases \(Apertus\-70B, OLMo\-2\-32B, EuroLLM\-22B, Gemma\-3\-27B\) were trained on 8 nodes \(32 GH200 GPUs\); small bases \(Apertus\-8B, EuroLLM\-9B\) were trained on 4 nodes \(16 GH200 GPUs\)\.

Training used the Axolotl framework with PyTorch’storchrunlauncher andc10drendezvous\. The 70B Apertus run used DeepSpeed ZeRO Stage 3 for memory partitioning; all other runs used PyTorch FSDP v2 with transformer\-block auto\-wrap, sharded state\-dict checkpointing, reshard\-after\-forward, and activation checkpointing\. Apertus\-70B, OLMo\-2\-32B, and EuroLLM\-22B/9B additionally used the cut\-cross\-entropy plugin to reduce activation memory at the loss\-computation step\. All runs used Flash Attention 2 and bfloat16 mixed\-precision training\.

### I\.2Common training settings

To preserve the alignment work invested in each base, we maintained the instruction\-tuning chat template native to each model \(ChatML for EuroLLM via explicit override; native templates for all others\)\. All runs share the following settings unless noted otherwise:

- •Sequence length:4096 tokens with sample packing\.
- •Optimizer:AdamW \(fused implementation\),β1=0\.9\\beta\_\{1\}=0\.9,β2=0\.999\\beta\_\{2\}=0\.999\(default\) unless otherwise stated\.
- •LR scheduler:cosine decay with warmup\.
- •Gradient clipping:max gradient norm 1\.0\.
- •Random seeds:42 for both model initialization and data shuffling\.

### I\.3Per\-model hyperparameters

Per\-model settings are summarized in Table[15](https://arxiv.org/html/2605.16215#A9.T15)\. Hyperparameters were selected based on each base model’s published instruction\-tuning recipe where available, and lightly tuned via short pilot runs on a held\-out subset of MedQA/MedMCQA dev splits before full training\.

Table 15:Per\-model training hyperparameters\. “Eff\. batch” is the effective batch size in sequences \(micro\-batch×\\timesgradient accumulation×\\timesworld size\)\. All runs use fused AdamW, cosine LR schedule, sequence length 4096 with sample packing, and seed 42\.
### I\.4Reproducibility artifacts

The full Axolotl YAML configuration files for each model, the SLURM submission script, and the data preparation pipeline will be released alongside the corpus upon publication\. A 10% Tülu 3 SFT replay variant is also provided as an opt\-in configuration but is not part of the default Fully Open Meditron recipe\.

### I\.5Compute resources

Table 16:Compute for the main MeditronFO training runs\. Wall\-clock times are taken from cluster job logs\. GPU\-hours = nodes×\\times4×\\timeswall\-clock hours\.

## Appendix JSynthetic Data Generation Prompts

This appendix documents the exact prompts used by the three synthetic data generation pipelines \(Synthetic Curated QA, Guidelines QA, and Synthetic MOOVE\)\. All three pipelines share a common system message and a component\-specific developer message, followed by a user message that injects either few\-shot exemplars or a source guideline\. Prompts are reproduced verbatim from the generation scripts; placeholders such as\{date\},\{reasoning\}, and the example slots are filled at runtime\.

### J\.1Shared System Message

All three pipelines use the harmony\-format system message below, with\{date\}set to the generation date and\{reasoning\}set tolow\.

System Message \(shared\)``` You are ChatGPT, a large language model trained by OpenAI. Knowledge cutoff: 2024-06 Current date: {date} Reasoning: {reasoning} # Valid channels: analysis, commentary, final. # Channel must be included for every message. ```

### J\.2Guidelines QA Prompt

The Guidelines QA pipeline seeds generation with one full clinical practice guideline per call and elicits ten multiple\-choice vignettes grounded strictly in that document\.

Developer Message — Guidelines QA``` You are an expert medical educator and physician tasked with creating high-quality, clinically accurate content for a medical exam. Your task is to generate clinical vignette-style questions along with its correct answer, based STRICTLY on the provided medical guideline. Focus on realistic patient presentations (age, symptoms, physical exam findings), identifying ’red flags’, and diagnostic reasoning highlighted in the text. The timeline and objective progress should always be clear and detailed in the vignettes. Include clear context about site and where people travelled etc. Do not include outside information or unproven treatments. ```

User Message — Guidelines QA``` Here is the medical guideline: === GUIDELINE START === {guideline_text} === GUIDELINE END === Based ONLY on the guideline above, generate exactly 10 unique MULTIPLE-CHOICE clinical vignette questions and their answers. Each question should present a realistic patient scenario that tests the diagnostic or management principles in the text. For each vignette, provide 4-5 plausible multiple-choice options (A-E). Ensure distractors represent common diagnostic pitfalls or ’next best steps’ that are incorrect based strictly on the provided guideline. You MUST format EACH of the 10 items exactly as follows, using these specific XML tags: <qa> <question> Patient scenario and the specific question here. A) [Option 1] B) [Option 2] C) [Option 3] D) [Option 4] </question> <answer>The rationale explaining your chain of thought without mentioning the guideline and then Answer: correct answer</answer> </qa> ```

### J\.3Synthetic Curated QA Prompt

The Synthetic Curated QA pipeline samples five exemplars without replacement from the curated benchmark pool and produces a single new QA pair per call\. The pool is partitioned into labeled \(multiple\-choice, carrying alabel\_letter\) and unlabeled \(open\-ended\) buckets, and the user message is specialized accordingly\.

Developer Message — Synthetic Curated QA``` You are an expert medical educator and physician tasked with creating high-quality, clinically accurate content for a medical exam. Your task is to generate a new, unique, clinical vignette-style question along with its evidence-based correct answer. The timeline and progress should always be clear and detailed in the vignettes. Include clear context about site and where people travelled etc. The content must reflect realistic clinical scenarios, standard-of-care protocols, and well established medical consensus. Avoid scientifically controversial treatments. You will be provided with 5 examples. Use them strictly to understand the desired format, diagnostic difficulty, and clinical depth. DO NOT copy them. Generate a completely new, scientifically rigorous question that would be unconditionally approved by a medical review board. ```

User Message — Synthetic Curated QA``` Here are example questions and answers to model your format on: --- Example 1 --- <question>{question_1} <answer>{answer_1} --- Example 2 --- <question>{question_2} <answer>{answer_2} ... (5 examples total) ... # If sampled from the labeled bucket: Now generate a brand new, unique, and clinically accurate MULTIPLE-CHOICE medical question and its detailed answer. Ensure the answer matches the formatting tags above. # If sampled from the unlabeled bucket: Now generate a brand new, unique, and clinically accurate OPEN-ENDED medical question and its detailed answer. Ensure the answer matches the formatting tags above. ```

### J\.4Synthetic MOOVE Prompt

The Synthetic MOOVE pipeline samples five exemplar prompts without replacement from the MOOVE training split and generates a single new open\-ended clinical scenario per call\. Only the question stem is generated; assistant responses are produced downstream\.

Developer Message — Synthetic MOOVE``` You are an expert medical educator and physician tasked with creating high-quality, clinically accurate content. Your task is to generate a new, unique, and realistic medical scenario or question prompt. The content must reflect realistic clinical presentations, inquiries from colleagues, or patient encounters. The timeline and objective progress should always be clear and detailed. Include clear context about site and where people travelled etc. You will be provided with 5 examples. Use them strictly to understand the desired format, diagnostic difficulty, and clinical depth. DO NOT copy them. Generate a completely new question that would be unconditionally approved by a medical review board. ```

User Message — Synthetic MOOVE``` Here are example prompts to model your format and clinical depth on: --- Example 1 --- <question> {prompt_1} </question> --- Example 2 --- <question> {prompt_2} </question> ... (5 examples total) ... Now, acting as an expert medical educator, generate a brand new, unique, and clinically accurate medical scenario or question. Wrap your generated scenario strictly within <question> and </question> tags. ```

## Appendix KDecontamination details

We apply a two\-stage n\-gram and token\-alignment decontamination pipeline adapted from Apertus\[[5](https://arxiv.org/html/2605.16215#bib.bib40)\]111[https://github\.com/swiss\-ai/posttraining\-data/tree/main/04\-decontamination](https://github.com/swiss-ai/posttraining-data/tree/main/04-decontamination)to the full curated corpus\. The reference set aggregates the prompts of all evaluation benchmarks used in this work: MedQA, MedMCQA, PubMedQA, MedXpertQA, MMLU\-Pro, IFEval, and ARC\-Challenge\.

Training samples are tokenized withalehc/swissai\-tokenizer\. In the first stage, samples sharing any 8\-gram with a reference prompt are flagged as candidates\. In the second stage, each candidate is token\-aligned against the matched reference and removed if the normalized alignment difference is at mostτ=0\.5\\tau=0\.5\. This filters incidental n\-gram overlaps while still catching lightly paraphrased contaminations\. For each dataset, the pipeline outputs a decontaminated corpus and a report logging removed samples and their matched references\.

## Appendix LFull openness comparison across medical LLMs

We assess all models discussed in this work along four openness dimensions defined by the Model Openness Framework \(MOF\)\[[38](https://arxiv.org/html/2605.16215#bib.bib36)\]: released weights, publicly available training data, a reproducible training recipe, and medical specialisation\. As argued in Section[2](https://arxiv.org/html/2605.16215#S2), releasing weights alone does not constitute full openness: the pretraining data and training recipe of the base model determine what clinical knowledge and biases the model has absorbed, yet these dimensions are undisclosed for every major open\-weight base used in prior medical LLM work \(Llama 2, Llama 3\.1, Mistral, Qwen2\.5, Gemma 3\)\. Apertus is the first base model at this scale to satisfy all MOF dimensions simultaneously, and Apertus\-MeditronFO inherits this property while adding medical specialisation\.

Table 17:Openness dimensions across medical LLMs and their base models, following the Model Openness Framework\.YES= fully satisfied;~= partial \(e\.g\. some data disclosed, recipe absent\);NO= not satisfied\. “Medical specialist” denotes a model adapted for clinical tasks via continued pretraining or supervised fine\-tuning on medical data\.![Refer to caption](https://arxiv.org/html/2605.16215v1/x17.png)Figure 10:Medical LLM Openness Tiers
## Appendix MLicenses of existing assets

Table[18](https://arxiv.org/html/2605.16215#A13.T18)lists each existing asset used in this work, with its originating reference, version where applicable, public URL, and license\. All assets are used in accordance with their respective licenses and restricted to research purposes consistent with the originating works\.

Table 18:Licenses of existing assets used in this work\. Datasets and models are listed with their originating reference, version, public URL, and license\. † License terms verified at time of writing; downstream users should re\-verify upstream terms before redistribution\.

Similar Articles

MedAction: Towards Active Multi-turn Clinical Diagnostic LLMs

arXiv cs.CL

This paper introduces MedAction, a framework for training LLMs on active, multi-turn clinical diagnosis by simulating iterative test ordering and hypothesis updates. It presents a new dataset, MedAction-32K, and demonstrates state-of-the-art performance for open-source models on medical benchmarks.