MM-JudgeBias：评估多模态大模型即评判系统中组合性偏见的基准

摘要

多模态大语言模型（MLLM）正日益被用作自动评估工具，这一范式被称为"MLLM即评委"。然而，其可靠性及易受偏见影响的脆弱性仍未得到充分探索。我们发现，许多MLLM评委难以稳定整合关键视觉或文本线索，当证据缺失或不匹配时会产生不可靠的评估结果，并在语义无关干扰下表现出不稳定性。为此，我们系统性地定义了MLLM即评委系统中的组合偏见，并推出评测基准MM-JudgeBias。该基准通过控制查询、图像和响应三个维度的扰动，采用偏差偏离度（BD）和偏差一致度（BC）两个互补指标分别衡量模型敏感度与稳定性。我们从29个源基准中精选并优化了1800余个多模态样本构建数据集，可对跨任务跨领域的九种偏见类型进行细粒度诊断。在26个前沿MLLM上的实验揭示了系统性的模态忽视和不对称评估倾向，凸显了开发更可靠评估工具的迫切性。

English

Multimodal Large Language Models (MLLMs) have been increasingly used as automatic evaluators-a paradigm known as MLLM-as-a-Judge. However, their reliability and vulnerabilities to biases remain underexplored. We find that many MLLM judges fail to reliably integrate key visual or textual cues, yielding unreliable evaluations when evidence is missing or mismatched, and exhibiting instability under semantically irrelevant perturbations. To address this, we systematically define Compositional Bias in MLLM-as-a-Judge systems and introduce MM-JudgeBias, a benchmark for evaluating it. MM-JudgeBias introduces controlled perturbations across Query, Image, and Response, and evaluates model behavior via two complementary metrics: Bias-Deviation (BD) for sensitivity and Bias-Conformity (BC) for stability. Our dataset of over 1,800 curated and refined multimodal samples, drawn from 29 source benchmarks, enables a fine-grained diagnosis of nine bias types across diverse tasks and domains. Experiments on 26 state-of-the-art MLLMs reveal systematic modality neglect and asymmetric evaluation tendencies, underscoring the need for more reliable judges.