OpenMedQ: Broad Open Pretraining for Medical Vision-Language Models

arXiv cs.AI 06/12/26, 04:00 AM Papers

medical-vlm open-source pretraining vision-language vqa classification llava

Summary

OpenMedQ is a fully-open medical vision-language model pretrained on 14 datasets (~3.35M samples), achieving state-of-the-art results on medical VQA and classification benchmarks.

arXiv:2606.12953v1 Announce Type: new Abstract: We present OpenMedQ, a medical vision-language model pretrained on the broadest fully-open medical mix to date: 14 datasets totaling ~3.35M pretraining samples spanning pathology, radiology, microscopy, and text-only clinical QA. OpenMedQ reaches state-of-the-art BLEU-1 on PathVQA (75.9), beating Med-PaLM M variants up to 562B parameters (~80x larger), and matches the best reported VQA-MED BLEU-1 (64.5). Its vision encoder, transferred to 8 unseen medical classification benchmarks under an identical downstream recipe, obtains the highest average macro-F1 (0.757) among BiomedCLIP (0.745), PMC-CLIP (0.745), PubMedCLIP (0.746), and a from-scratch baseline (0.616). We release our code and an interactive demo is publicly available as a reproducible baseline for the community.

Original Article

View Cached Full Text

Cached at: 06/12/26, 08:54 AM

# OpenMedQ: Broad Open Pretraining for Medical Vision-Language Models
Source: [https://arxiv.org/html/2606.12953](https://arxiv.org/html/2606.12953)
\\jmlrproceedings

MIDLMedical Imaging with Deep Learning\\jmlrpages\\jmlryear2026\\jmlrworkshopShort Paper Track\\jmlrvolume\\midlauthor\\NameIbrahim Gulluk\\midljointauthortextEqual contribution\\nametag1\\Emailgulluk@stanford\.edu \\NameMax Van Puyvelde\\midlotherjointauthor\\nametag2,3\\Emailmaxvpuyv@stanford\.edu \\NameOlivier Gevaert\\nametag2\\Emailogevaert@stanford\.edu \\addr1Department of Electrical Engineering, Stanford University \\addr2Department of Biomedical Data Science, Stanford University School of Medicine \\addr3Department of Mathematical Modelling, Statistics and Bioinformatics, Ghent University

###### Abstract

We present*OpenMedQ*, a medical vision\-language model pretrained on the broadest fully\-open medical mix to date: 14 datasets totaling∼3\.35\{\\sim\}3\.35M pretraining samples spanning pathology, radiology, microscopy, and text\-only clinical QA\. OpenMedQ reaches state\-of\-the\-art BLEU\-1 on PathVQA \(75\.9\), beating Med\-PaLM M variants up to 562B parameters \(∼80×\{\\sim\}80\\timeslarger\), and matches the best reported VQA\-MED BLEU\-1 \(64\.5\)\. Its vision encoder, transferred to 8 unseen medical classification benchmarks under an identical downstream recipe, obtains the highest average macro\-F1 \(0\.757\) among BiomedCLIP \(0\.745\), PMC\-CLIP \(0\.745\), PubMedCLIP \(0\.746\), and a from\-scratch baseline \(0\.616\)\. We release our[code](https://github.com/gevaertlab/OpenMedQ)and an interactive demo is publicly available as a reproducible baseline for the community\.

###### keywords:

Medical Vision\-Language Models, Medical Image Classification, Open Science

## 1Introduction

Medical foundation models are increasingly capable, yet most published medical VLMs rely on a handful of narrow pretraining sources and withhold either their weights, their data, or both\. Contrastive encoders such as BiomedCLIP\(biomedclip\), PMC\-CLIP\(pmcclip\), and PubMedCLIP train on single image\-caption corpora; generative medical VLMs such as PMC\-VQA\(pmcvqa\)and LLaVA\-Med\(llavamed\)demonstrate strong visual question answering \(VQA\) on a few benchmarks but use comparably narrow pretraining mixes, while BiomedGPT\(biomedgpt\)and Med\-PaLM M\(medpalm\)scale data and parameters but do not release weights\. This leaves practitioners without a fully\-open, broadly\-pretrained baseline they can actually inspect, reuse, and extend\.

We introduce*OpenMedQ*, a LLaVA\-style\(llava\)VLM \(ViT\-base\(biomedclip\)\+ LLaMA\-7B\(llama;pmcllama\), LoRA\(lora\)\) trained on the broadest open medical pretraining mix to date \(14 datasets,∼3\.35\{\\sim\}3\.35M samples\) with next\-token prediction\. We will release weights and dataset recipes upon acceptance; a live interactive demo is already available at[https://openmedq\.streamlit\.app/](https://openmedq.streamlit.app/)for qualitative inspection\.

## 2Method

### Architecture and pretraining\.

The vision encoderfvisf\_\{\\mathrm\{vis\}\}is a ViT\-base\-patch16\-224 initialized from BiomedCLIP\(biomedclip\); a linear projection feeds its image tokens into a LLaMA\-7B\(llama\)language model initialized from PMC\-LLaMA\(pmcllama\)\. Image and text tokens are concatenated and decoded left\-to\-right, following LLaVA\(llava\)\. We fine\-tune with LoRA\(lora\)of rankr=8r=8using next\-token cross\-entropy with image and prefix tokens masked\. All images are resized to224×224224\{\\times\}224; training uses AdamW, batch size 64, learning rate5×10−55\{\\times\}10^\{\-5\}, for up to 15 epochs on a single NVIDIA A100\.

### Classification transfer\.

To probe the vision features produced by pretraining, we detachfvisf\_\{\\mathrm\{vis\}\}and attach a linear headW∈ℝ2d×mW\\\!\\in\\\!\\mathbb\{R\}^\{2d\\times m\}; encoder and head are fine\-tuned together on each downstream dataset for 100 epochs\. We benchmark OpenMedQ’s encoder against three strong medical contrastive baselines \(BiomedCLIP, PMC\-CLIP, PubMedCLIP\) and a from\-scratch baseline, all under an identical downstream recipe so that any gap is attributable to the pretraining\.

## 3Datasets

### Pretraining mix \(14 datasets,∼\\sim3\.35M samples\)\.

Image\-text sources \(∼2\.94\{\\sim\}2\.94M pairs\) span pathology \(PathVQA\(pathvqa\)\), radiology \(VQA\-RAD\(vqarad\), IU\-XRAY\(iuxray\), MIMIC\-CXR\(mimiccxr\), ROCO\(roco\), OmniMedVQA\(omnimedvqa\)\), mixed modalities \(Slake\(slake\), PMC\-OA\(pmcclip\), PMC\-VQA\(pmcvqa\), VQA\-MED\(vqamed\)\), and microscopy \(μ\\mu\-Bench\(ubench\)\)\. A further∼410\{\\sim\}410K text\-only clinical QA samples \(MedQA, MedMCQA, PubMedQA\) are included to preserve language capability during pretraining\.

### Classification benchmarks \(8 datasets\)\.

We evaluate on CXR8\(cxr8\), MedFMC\(medfmc\)\(chest, colon, endo subtasks\), Breast\-Ultrasound\(breastus\), CHAOYANG\(chaoyang\), CBIS\-DDSM\(cbisddsm\), and Mendeley\-CXray\(mendeley\)\. These datasets were not seen during pretraining\.

## 4Results

\\floatconts

fig:hero![Refer to caption](https://arxiv.org/html/2606.12953v1/x1.png)

Figure 1:\(a\)Macro\-F1 across 8 unseen medical classification benchmarks: all bars share an identical downstream recipe and differ only in the pretrained vision encoder\. OpenMedQ attains the highest*Mean*\(0\.757\)\.\(b\)OpenMedQ’s pretraining mix: 14 fully\-open datasets \(∼3\.35\{\\sim\}3\.35M pairs\), colored by modality group\.### Classification transfer\.

\\figureref

fig:hero\(a\) is our headline result\. OpenMedQ achieves the highest mean macro\-F1 \(0\.757\) across the eight benchmarks, ahead of PubMedCLIP \(0\.746\), PMC\-CLIP and BiomedCLIP \(0\.745\), and the from\-scratch baseline \(0\.616\)\. OpenMedQ wins outright on MedFMC\-chest and MedFMC\-endo, ties PMC\-CLIP on CXR8, and trails the best encoder by at most 0\.02 on four more; the only meaningful gap is Breast\-Ultrasound \(0\.876 vs\. 0\.915\)\. Since the downstream recipe is fixed, this delta reflects what OpenMedQ’s pretraining added to the BiomedCLIP initialization\.

### Open\-ended VQA\.

On PathVQA, OpenMedQ reaches 75\.9 BLEU\-1, beating prefix tuning\(vansonsbeek\)\(70\.3\) and all three Med\-PaLM M variants up to 562B\(medpalm\)\(72\.27\) despite using only 7B parameters\. On VQA\-MED, OpenMedQ reaches 64\.5, just above the 2019 challenge best \(64\.4\)\.

## 5Discussion

*Breadth*of open pretraining data is a competitive lever for medical VLMs: at 7B parameters, OpenMedQ sets a new state of the art on PathVQA against Med\-PaLM M up to 562B, and its vision encoder beats three strong contrastive medical encoders on average classification transfer\. Data diversity is a reproducible lever; proprietary scale is not\. The lever has its limits: Med\-PaLM M’s larger variants still lead on VQA\-RAD and Slake, BLEU\-1 captures only surface agreement, and narrow\-modality encoders can edge us out on Breast\-Ultrasound\. The demo is available at[https://openmedq\.streamlit\.app/](https://openmedq.streamlit.app/)\.

## References

OpenMedQ: Broad Open Pretraining for Medical Vision-Language Models

Similar Articles

maziyarpanahi/openmed

Attention Consistent Longitudinal Medical Visual Question Answering Guided by Vision Foundation Models

Fully Open Meditron: An Auditable Pipeline for Clinical LLMs

MedGemma: Our most capable open models for health AI development

@AdinaYakup: MOSS-VL Vision model from @Open_MOSS Model: https://huggingface.co/collections/OpenMOSS-Team/moss-vl… Demo: https://hug…

Submit Feedback

Similar Articles

Attention Consistent Longitudinal Medical Visual Question Answering Guided by Vision Foundation Models

Fully Open Meditron: An Auditable Pipeline for Clinical LLMs

MedGemma: Our most capable open models for health AI development

@AdinaYakup: MOSS-VL Vision model from @Open_MOSS Model: https://huggingface.co/collections/OpenMOSS-Team/moss-vl… Demo: https://hug…