MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge
Summary
Researchers introduce MM-JudgeBias, a benchmark that exposes systematic compositional biases in multimodal large language models when used as automatic judges, testing 26 SOTA MLLMs across 1,800 samples.
View Cached Full Text
Cached at: 04/22/26, 06:17 AM
Paper page - MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge
Source: https://huggingface.co/papers/2604.18164
Abstract
Research identifies systematic biases in multimodal large language models used as automatic evaluators, revealing reliability issues and proposing a benchmark for measuring compositional bias through controlled perturbations and specific metrics.
Multimodal Large Language Models(MLLMs) have been increasingly used as automatic evaluators-a paradigm known asMLLM-as-a-Judge. However, their reliability and vulnerabilities to biases remain underexplored. We find that many MLLM judges fail to reliably integrate key visual or textual cues, yielding unreliable evaluations when evidence is missing or mismatched, and exhibiting instability under semantically irrelevant perturbations. To address this, we systematically defineCompositional BiasinMLLM-as-a-Judgesystems and introduceMM-JudgeBias, a benchmark for evaluating it.MM-JudgeBiasintroduces controlled perturbations across Query, Image, and Response, and evaluates model behavior via two complementary metrics:Bias-Deviation(BD) for sensitivity andBias-Conformity(BC) for stability. Our dataset of over 1,800 curated and refined multimodal samples, drawn from 29 source benchmarks, enables a fine-grained diagnosis of nine bias types across diverse tasks and domains. Experiments on 26 state-of-the-art MLLMs reveal systematic modality neglect and asymmetric evaluation tendencies, underscoring the need for more reliable judges.
View arXiv pageView PDFProject pageGitHub0Add to collection
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.18164 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.18164 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.18164 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling
This paper identifies perceptual judgment bias in multimodal LLM judges, where they over-reward fluent but visually wrong responses, and proposes a dataset PPJD and a trained model Perception-Judge using GRPO with batch-ranking reward to mitigate this bias and improve perception-grounded evaluation.
A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs
Researchers from Jilin University systematically evaluate positional bias in multi-video summarization using MLLMs, constructing a benchmark from ActivityNet and News videos and assessing nine models with metrics including Coverage, Directional Positional Bias, and Middle-Edge Gap. Results show positional effects are domain- and model-dependent, and increasing visual or generation budget does not uniformly resolve the imbalance.
Judge Circuits
This paper investigates the internal mechanisms of LLM-as-a-judge, finding a shared Latent Evaluator sub-graph in mid-to-late MLPs across models that handles abstract judging, while format-specific terminal branches map the judgment to output tokens, revealing the cause of format-induced inconsistency.
StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs
A new benchmark called StylisticBias systematically evaluates attribute-level social bias in multimodal large language models, finding that a small set of visual cues like fashion style drive most biases.
The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation
This paper investigates the run-to-run reliability of LLM-as-a-Judge evaluations, finding that pairwise preferences flip 13.6% of the time on average, with significant first-position bias in GPT-4o-mini, and recommends multi-trial aggregation and position randomization.