When Background Matters: Breaking Medical Vision Language Models by Transferable Attack
Summary
MedFocusLeak introduces the first transferable black-box adversarial attack on medical vision-language models, using imperceptible background perturbations to mislead clinical diagnoses across six imaging modalities.
View Cached Full Text
Cached at: 04/22/26, 01:58 AM
Paper page - When Background Matters: Breaking Medical Vision Language Models by Transferable Attack
Source: https://huggingface.co/papers/2604.17318
https://huggingface.co/login?next=%2Fpapers%2F2604.17318-
Abstract
MedFocusLeak enables transferable black-box attacks on vision-language models for medical imaging by injecting imperceptible perturbations that redirect model attention, demonstrating significant vulnerabilities in clinical diagnostic reasoning.
Vision-Language Models(VLMs) are increasingly used inclinical diagnostics, yet their robustness toadversarial attacks remains largely unexplored, posing serious risks. Existing medical attacks focus on secondary objectives such as model stealing or adversarial fine-tuning, whiletransferable attacks from natural images introduce visible distortions that clinicians can easily detect. To address this, we propose MedFocusLeak, a highly transferable black-box multimodal attack that induces incorrect yet clinically plausible diagnoses while keeping perturbations imperceptible. The method injects coordinated perturbations into non-diagnostic background regions and employs anattention distraction mechanismto shift the model’s focus away from pathological areas. Extensive evaluations across sixmedical imagingmodalities show that MedFocusLeak achieves state-of-the-art performance, generating misleading yet realistic diagnostic outputs across diverse VLMs. We further introduce a unified evaluation framework with novel metrics that jointly capture attack success andimage fidelity, revealing a critical weakness in the reasoning capabilities of modern clinical VLMs.
View arXiv pageView PDFProject pageGitHub2Add to collection
Community
Paper submitter
The First black box transferable attack paper for Medical Vision Language Models
Upload images, audio, and videos by dragging in the text input, pasting, orclicking here.
Tap or paste here to upload images
https://huggingface.co/login?next=%2Fpapers%2F2604.17318-
Get this paper in your agent:
hf papers read 2604\.17318
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.17318 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.17318 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.17318 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Attacking machine learning with adversarial examples
This article examines adversarial attacks on machine learning models and demonstrates why gradient masking—a defensive technique that attempts to deny attackers access to useful gradients—is fundamentally ineffective. The paper shows that attackers can circumvent gradient masking by training substitute models that mimic the defended model's behavior, making the defense strategy ultimately futile.
Surrogate modeling for interpreting black-box LLMs in medical predictions
Researchers propose a surrogate modeling framework to quantify and interpret latent medical knowledge encoded in black-box LLMs, revealing both valid associations and persistent racial biases.
Conceal, Reconstruct, Jailbreak: Exploiting the Reconstruction-Concealment Tradeoff in MLLMs
This paper analyzes the reconstruction-concealment tradeoff in intent-obfuscation jailbreak attacks on Multimodal Large Language Models (MLLMs). It proposes concealment-aware variant construction and keyword-related distractor images to exploit model vulnerabilities more effectively.
Robust adversarial inputs
Researchers demonstrated adversarial images that reliably fool neural network classifiers across multiple scales and perspectives, challenging assumptions about the robustness of multi-scale image capture systems used in autonomous vehicles.
Adversarial attacks on neural network policies
OpenAI researchers demonstrate that adversarial attacks, previously studied in computer vision, are also effective against neural network policies in reinforcement learning, showing significant performance degradation even with small imperceptible perturbations in white-box and black-box settings.