知覚摂動と報酬モデリングによるマルチモーダルLLM-as-a-Judgeにおける知覚判断バイアスの軽減

要旨

近年のマルチモーダル大規模言語モデルは、強力な推論能力を示している。しかし、自動評価器としての信頼性は、重要な脆弱性によって依然として制限されている。すなわち、視覚的証拠がテキストの手掛かりと矛盾する場合、MLLM（マルチモーダル大規模言語モデル）判別器は、知覚的に正しい回答よりも、もっともらしい物語を優先する傾向がある。本稿では、この現象を特定し、体系的に分析する。我々はこれを「知覚的判断バイアス」と命名する。制御された視覚的摂動を通じて、既存のマルチモーダル判別器は、自身の視覚的知覚ではなく、応答テキストに頻繁に固執し、一貫性がなく検証不可能な評価をもたらす。この問題に対処するため、我々は「知覚的摂動判断データセット」を導入する。これは、知覚的誤りを分離し、検証可能な監視を可能にする最小限に編集された反実仮想的応答を構築する。このデータセットに基づき、構造化されたGRPOベースの報酬とバッチランキング目的関数を組み合わせた統一訓練フレームワークを開発し、明示的なペアラベルなしで整合性のある大域的順序付けを達成する。多様なMLLM-as-a-Judgeベンチマークにおける実験により、我々のアプローチが、知覚的忠実性、ランキングの整合性、人間評価との一致を大幅に改善することが示される。本研究結果は、知覚的に根拠付けられ、解釈可能であり、視覚-推論の競合に対して頑健なマルチモーダル判別器を訓練するための、スケーラブルで一般化可能な経路を確立するものである。

English

Recent multimodal large language models have demonstrated strong reasoning ability, yet their reliability as automated evaluators remains limited by a critical weakness: when visual evidence conflicts with textual cues, MLLM judges tend to reward plausible narratives over perceptually correct answers. We identify and systematically analyze this phenomenon, which we term Perceptual Judgment Bias. Through controlled visual perturbations, existing multimodal judges frequently anchor on the response text instead of their own visual perception, leading to inconsistent and non-verifiable evaluations. To address this issue, we introduce the Perceptually Perturbed Judgment Dataset, which constructs minimally edited counterfactual responses that isolate perceptual errors and enable verifiable supervision. Building on this dataset, we develop a unified training framework that combines a structured GRPO-based reward with a batch-ranking objective, achieving coherent global ordering without explicit pairwise labels. Experiments across diverse MLLM-as-a-Judge benchmarks show that our approach substantially improves perceptual fidelity, ranking coherence, and alignment with human evaluation. Our results establish a scalable and generalizable pathway for training multimodal judges that are perceptually grounded, interpretable, and robust to visual-reasoning conflicts.