OpenMedQ: Broad Open Pretraining for Medical Vision-Language Models
Summary
OpenMedQ is a fully-open medical vision-language model pretrained on 14 datasets (~3.35M samples), achieving state-of-the-art results on medical VQA and classification benchmarks.
View Cached Full Text
Cached at: 06/12/26, 08:54 AM
# OpenMedQ: Broad Open Pretraining for Medical Vision-Language Models
Source: [https://arxiv.org/html/2606.12953](https://arxiv.org/html/2606.12953)
\\jmlrproceedings
MIDLMedical Imaging with Deep Learning\\jmlrpages\\jmlryear2026\\jmlrworkshopShort Paper Track\\jmlrvolume\\midlauthor\\NameIbrahim Gulluk\\midljointauthortextEqual contribution\\nametag1\\Emailgulluk@stanford\.edu \\NameMax Van Puyvelde\\midlotherjointauthor\\nametag2,3\\Emailmaxvpuyv@stanford\.edu \\NameOlivier Gevaert\\nametag2\\Emailogevaert@stanford\.edu \\addr1Department of Electrical Engineering, Stanford University \\addr2Department of Biomedical Data Science, Stanford University School of Medicine \\addr3Department of Mathematical Modelling, Statistics and Bioinformatics, Ghent University
###### Abstract
We present*OpenMedQ*, a medical vision\-language model pretrained on the broadest fully\-open medical mix to date: 14 datasets totaling∼3\.35\{\\sim\}3\.35M pretraining samples spanning pathology, radiology, microscopy, and text\-only clinical QA\. OpenMedQ reaches state\-of\-the\-art BLEU\-1 on PathVQA \(75\.9\), beating Med\-PaLM M variants up to 562B parameters \(∼80×\{\\sim\}80\\timeslarger\), and matches the best reported VQA\-MED BLEU\-1 \(64\.5\)\. Its vision encoder, transferred to 8 unseen medical classification benchmarks under an identical downstream recipe, obtains the highest average macro\-F1 \(0\.757\) among BiomedCLIP \(0\.745\), PMC\-CLIP \(0\.745\), PubMedCLIP \(0\.746\), and a from\-scratch baseline \(0\.616\)\. We release our[code](https://github.com/gevaertlab/OpenMedQ)and an interactive demo is publicly available as a reproducible baseline for the community\.
###### keywords:
Medical Vision\-Language Models, Medical Image Classification, Open Science
## 1Introduction
Medical foundation models are increasingly capable, yet most published medical VLMs rely on a handful of narrow pretraining sources and withhold either their weights, their data, or both\. Contrastive encoders such as BiomedCLIP\(biomedclip\), PMC\-CLIP\(pmcclip\), and PubMedCLIP train on single image\-caption corpora; generative medical VLMs such as PMC\-VQA\(pmcvqa\)and LLaVA\-Med\(llavamed\)demonstrate strong visual question answering \(VQA\) on a few benchmarks but use comparably narrow pretraining mixes, while BiomedGPT\(biomedgpt\)and Med\-PaLM M\(medpalm\)scale data and parameters but do not release weights\. This leaves practitioners without a fully\-open, broadly\-pretrained baseline they can actually inspect, reuse, and extend\.
We introduce*OpenMedQ*, a LLaVA\-style\(llava\)VLM \(ViT\-base\(biomedclip\)\+ LLaMA\-7B\(llama;pmcllama\), LoRA\(lora\)\) trained on the broadest open medical pretraining mix to date \(14 datasets,∼3\.35\{\\sim\}3\.35M samples\) with next\-token prediction\. We will release weights and dataset recipes upon acceptance; a live interactive demo is already available at[https://openmedq\.streamlit\.app/](https://openmedq.streamlit.app/)for qualitative inspection\.
## 2Method
### Architecture and pretraining\.
The vision encoderfvisf\_\{\\mathrm\{vis\}\}is a ViT\-base\-patch16\-224 initialized from BiomedCLIP\(biomedclip\); a linear projection feeds its image tokens into a LLaMA\-7B\(llama\)language model initialized from PMC\-LLaMA\(pmcllama\)\. Image and text tokens are concatenated and decoded left\-to\-right, following LLaVA\(llava\)\. We fine\-tune with LoRA\(lora\)of rankr=8r=8using next\-token cross\-entropy with image and prefix tokens masked\. All images are resized to224×224224\{\\times\}224; training uses AdamW, batch size 64, learning rate5×10−55\{\\times\}10^\{\-5\}, for up to 15 epochs on a single NVIDIA A100\.
### Classification transfer\.
To probe the vision features produced by pretraining, we detachfvisf\_\{\\mathrm\{vis\}\}and attach a linear headW∈ℝ2d×mW\\\!\\in\\\!\\mathbb\{R\}^\{2d\\times m\}; encoder and head are fine\-tuned together on each downstream dataset for 100 epochs\. We benchmark OpenMedQ’s encoder against three strong medical contrastive baselines \(BiomedCLIP, PMC\-CLIP, PubMedCLIP\) and a from\-scratch baseline, all under an identical downstream recipe so that any gap is attributable to the pretraining\.
## 3Datasets
### Pretraining mix \(14 datasets,∼\\sim3\.35M samples\)\.
Image\-text sources \(∼2\.94\{\\sim\}2\.94M pairs\) span pathology \(PathVQA\(pathvqa\)\), radiology \(VQA\-RAD\(vqarad\), IU\-XRAY\(iuxray\), MIMIC\-CXR\(mimiccxr\), ROCO\(roco\), OmniMedVQA\(omnimedvqa\)\), mixed modalities \(Slake\(slake\), PMC\-OA\(pmcclip\), PMC\-VQA\(pmcvqa\), VQA\-MED\(vqamed\)\), and microscopy \(μ\\mu\-Bench\(ubench\)\)\. A further∼410\{\\sim\}410K text\-only clinical QA samples \(MedQA, MedMCQA, PubMedQA\) are included to preserve language capability during pretraining\.
### Classification benchmarks \(8 datasets\)\.
We evaluate on CXR8\(cxr8\), MedFMC\(medfmc\)\(chest, colon, endo subtasks\), Breast\-Ultrasound\(breastus\), CHAOYANG\(chaoyang\), CBIS\-DDSM\(cbisddsm\), and Mendeley\-CXray\(mendeley\)\. These datasets were not seen during pretraining\.
## 4Results
\\floatconts
fig:hero
Figure 1:\(a\)Macro\-F1 across 8 unseen medical classification benchmarks: all bars share an identical downstream recipe and differ only in the pretrained vision encoder\. OpenMedQ attains the highest*Mean*\(0\.757\)\.\(b\)OpenMedQ’s pretraining mix: 14 fully\-open datasets \(∼3\.35\{\\sim\}3\.35M pairs\), colored by modality group\.### Classification transfer\.
\\figureref
fig:hero\(a\) is our headline result\. OpenMedQ achieves the highest mean macro\-F1 \(0\.757\) across the eight benchmarks, ahead of PubMedCLIP \(0\.746\), PMC\-CLIP and BiomedCLIP \(0\.745\), and the from\-scratch baseline \(0\.616\)\. OpenMedQ wins outright on MedFMC\-chest and MedFMC\-endo, ties PMC\-CLIP on CXR8, and trails the best encoder by at most 0\.02 on four more; the only meaningful gap is Breast\-Ultrasound \(0\.876 vs\. 0\.915\)\. Since the downstream recipe is fixed, this delta reflects what OpenMedQ’s pretraining added to the BiomedCLIP initialization\.
### Open\-ended VQA\.
On PathVQA, OpenMedQ reaches 75\.9 BLEU\-1, beating prefix tuning\(vansonsbeek\)\(70\.3\) and all three Med\-PaLM M variants up to 562B\(medpalm\)\(72\.27\) despite using only 7B parameters\. On VQA\-MED, OpenMedQ reaches 64\.5, just above the 2019 challenge best \(64\.4\)\.
## 5Discussion
*Breadth*of open pretraining data is a competitive lever for medical VLMs: at 7B parameters, OpenMedQ sets a new state of the art on PathVQA against Med\-PaLM M up to 562B, and its vision encoder beats three strong contrastive medical encoders on average classification transfer\. Data diversity is a reproducible lever; proprietary scale is not\. The lever has its limits: Med\-PaLM M’s larger variants still lead on VQA\-RAD and Slake, BLEU\-1 captures only surface agreement, and narrow\-modality encoders can edge us out on Breast\-Ultrasound\. The demo is available at[https://openmedq\.streamlit\.app/](https://openmedq.streamlit.app/)\.
## ReferencesSimilar Articles
maziyarpanahi/openmed
OpenMed is an open-source local-first healthcare AI toolkit that provides entity extraction, PII de-identification, and over 1,000 specialized medical models, all running on-device with no cloud dependency.
Attention Consistent Longitudinal Medical Visual Question Answering Guided by Vision Foundation Models
Proposes an attention-guided encoder-decoder for longitudinal medical visual question answering, using a frozen DINO-based mask generator and auxiliary losses to improve consistency and interpretability, achieving strong results on the Medical-Diff-VQA benchmark.
Fully Open Meditron: An Auditable Pipeline for Clinical LLMs
Introduces Fully Open Meditron, the first fully open pipeline for building clinical LLMs, featuring a clinician-audited training corpus and reproducible framework, achieving state-of-the-art among fully open medical specialist models.
MedGemma: Our most capable open models for health AI development
Google DeepMind released MedGemma 27B Multimodal and MedSigLIP, expanding their open-source Health AI Developer Foundations to include high-performing, privacy-preserving models for medical text and imaging tasks.
@AdinaYakup: MOSS-VL Vision model from @Open_MOSS Model: https://huggingface.co/collections/OpenMOSS-Team/moss-vl… Demo: https://hug…
Open_MOSS released MOSS-VL, an 11B Apache 2.0 vision-language model using cross-attention and XRoPE that outperforms Qwen3-VL-8B by 8.3 points on VSI-bench.