透過感知擾動與獎勵建模緩解多模態LLM裁判中的感知判斷偏差

摘要

近期多模态大型語言模型展現出強大的推理能力，但其作為自動評估器的可靠性仍受制於一項關鍵弱點：當視覺證據與文本線索衝突時，多模態語言模型（MLLM）評估者傾向於獎勵看似合理的故事敘述，而非知覺上正確的答案。我們識別並系統性分析此現象，稱之為「感知判斷偏差」。透過受控的視覺擾動，現有多模態評估者經常錨定於回應文本，而非其自身的視覺感知，導致不一致且無法驗證的評估。為解決此問題，我們引入「感知擾動判斷資料集」，該資料集建構最小編輯的反事實回應，以隔離感知錯誤並實現可驗證的監督。基於此資料集，我們開發一套統一訓練框架，結合結構化 GRPO 獎勵機制與批次排序目標，在無需明確成對標籤的情況下達成連貫的全局排序。跨多種 MLLM 作為評判基準的實驗顯示，我們的方法大幅提升感知忠實度、排序連貫性，以及與人類評估的一致性。我們的研究成果建立了一條可擴展且具泛化性的途徑，用以訓練感知基礎、可解釋且對視覺推理衝突具有強健性的多模態評估者。

English

Recent multimodal large language models have demonstrated strong reasoning ability, yet their reliability as automated evaluators remains limited by a critical weakness: when visual evidence conflicts with textual cues, MLLM judges tend to reward plausible narratives over perceptually correct answers. We identify and systematically analyze this phenomenon, which we term Perceptual Judgment Bias. Through controlled visual perturbations, existing multimodal judges frequently anchor on the response text instead of their own visual perception, leading to inconsistent and non-verifiable evaluations. To address this issue, we introduce the Perceptually Perturbed Judgment Dataset, which constructs minimally edited counterfactual responses that isolate perceptual errors and enable verifiable supervision. Building on this dataset, we develop a unified training framework that combines a structured GRPO-based reward with a batch-ranking objective, achieving coherent global ordering without explicit pairwise labels. Experiments across diverse MLLM-as-a-Judge benchmarks show that our approach substantially improves perceptual fidelity, ranking coherence, and alignment with human evaluation. Our results establish a scalable and generalizable pathway for training multimodal judges that are perceptually grounded, interpretable, and robust to visual-reasoning conflicts.