MM-JudgeBias: MLLM審査機能における合成的バイアス評価のためのベンチマーク

要旨

マルチモーダル大規模言語モデル（MLLM）は、自動評価ツールとしてますます利用されるようになっており、このパラダイムは「MLLM-as-a-Judge」として知られている。しかし、その信頼性やバイアスに対する脆弱性については、未だ十分に検討されていない。我々は、多くのMLLM評価器が重要な視覚的またはテキスト的な手がかりを確実に統合できず、証拠が欠落または不一致の場合に信頼性の低い評価を生み出し、意味的に関連性のない摂動に対して不安定性を示すことを発見した。この問題に対処するため、我々はMLLM-as-a-Judgeシステムにおける「構成バイアス」を体系的に定義し、それを評価するベンチマーク「MM-JudgeBias」を提案する。MM-JudgeBiasは、クエリ、画像、応答に対して制御された摂動を導入し、感度を測るBias-Deviation（BD）と安定性を測るBias-Conformity（BC）という2つの相補的な指標を通じてモデルの挙動を評価する。29のソースベンチマークから抽出・精選した1,800以上のマルチモーダルサンプルからなるデータセットは、多様なタスクと領域にわたる9種類のバイアスタイプの詳細な診断を可能にする。26の最先端MLLMを用いた実験では、体系的なモダリティ軽視と非対称的な評価傾向が明らかになり、より信頼性の高い評価器の必要性が浮き彫りとなった。

English

Multimodal Large Language Models (MLLMs) have been increasingly used as automatic evaluators-a paradigm known as MLLM-as-a-Judge. However, their reliability and vulnerabilities to biases remain underexplored. We find that many MLLM judges fail to reliably integrate key visual or textual cues, yielding unreliable evaluations when evidence is missing or mismatched, and exhibiting instability under semantically irrelevant perturbations. To address this, we systematically define Compositional Bias in MLLM-as-a-Judge systems and introduce MM-JudgeBias, a benchmark for evaluating it. MM-JudgeBias introduces controlled perturbations across Query, Image, and Response, and evaluates model behavior via two complementary metrics: Bias-Deviation (BD) for sensitivity and Bias-Conformity (BC) for stability. Our dataset of over 1,800 curated and refined multimodal samples, drawn from 29 source benchmarks, enables a fine-grained diagnosis of nine bias types across diverse tasks and domains. Experiments on 26 state-of-the-art MLLMs reveal systematic modality neglect and asymmetric evaluation tendencies, underscoring the need for more reliable judges.

MM-JudgeBias: MLLM審査機能における合成的バイアス評価のためのベンチマーク

MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge

要旨

Support