MM-JudgeBias：评估多模态大语言模型评审中组合性偏见的基准

摘要

多模态大语言模型（MLLM）正日益被用作自动评估工具——这一范式被称为“MLLM即评委”。然而，其可靠性及对偏差的脆弱性仍未得到充分探索。我们发现许多MLLM评委难以可靠地整合关键视觉或文本线索，当证据缺失或失配时会产生不可靠的评估结果，并在语义无关的扰动下表现出不稳定性。针对此问题，我们系统性地定义了MLLM即评委系统中的组合偏差，并提出了用于评估该偏差的基准MM-JudgeBias。该基准通过对查询、图像和响应施加受控扰动，采用偏差偏离度（BD）和偏差一致性（BC）两个互补指标来评估模型行为：前者衡量敏感性，后者评估稳定性。我们从29个源基准中精心筛选并优化了1800余个多模态样本构建数据集，可对跨任务和跨领域的九种偏差类型进行细粒度诊断。对26个前沿MLLM的实验揭示了系统性的模态忽视和不对称评估倾向，凸显了对更可靠评估器的需求。

English

Multimodal Large Language Models (MLLMs) have been increasingly used as automatic evaluators-a paradigm known as MLLM-as-a-Judge. However, their reliability and vulnerabilities to biases remain underexplored. We find that many MLLM judges fail to reliably integrate key visual or textual cues, yielding unreliable evaluations when evidence is missing or mismatched, and exhibiting instability under semantically irrelevant perturbations. To address this, we systematically define Compositional Bias in MLLM-as-a-Judge systems and introduce MM-JudgeBias, a benchmark for evaluating it. MM-JudgeBias introduces controlled perturbations across Query, Image, and Response, and evaluates model behavior via two complementary metrics: Bias-Deviation (BD) for sensitivity and Bias-Conformity (BC) for stability. Our dataset of over 1,800 curated and refined multimodal samples, drawn from 29 source benchmarks, enables a fine-grained diagnosis of nine bias types across diverse tasks and domains. Experiments on 26 state-of-the-art MLLMs reveal systematic modality neglect and asymmetric evaluation tendencies, underscoring the need for more reliable judges.

MM-JudgeBias：评估多模态大语言模型评审中组合性偏见的基准

MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge

摘要

Support