通过感知扰动和奖励建模缓解多模态大语言模型评判中的感知判断偏差

摘要

近年来，多模态大语言模型展现出强大的推理能力，但其作为自动评估器的可靠性仍受制于一个关键缺陷：当视觉证据与文本线索冲突时，多模态大语言模型评判者更倾向于奖励看似合理的叙述，而非基于感知的正确答案。我们识别并系统分析了这一现象，将其命名为"感知判断偏差"。通过受控视觉扰动实验发现，现有的多模态评判者常将判断锚定于回答文本而非自身的视觉感知，导致评估结果不一致且不可验证。针对这一问题，我们提出了"感知扰动判断数据集"，该数据集构建了最小化编辑的反事实回答，能够隔离感知错误并提供可验证的监督信号。基于此数据集，我们开发了一个统一训练框架，结合基于GRPO的结构化奖励机制与批量排序目标，无需显式成对标签即可实现连贯的全局排序。在多个MLLM-as-a-Judge基准测试上的实验表明，我们的方法显著提升了感知保真度、排序一致性以及与人类评估的对齐程度。研究结果为训练具备感知基础、可解释且能抵抗视觉-推理冲突的多模态评判者开辟了一条兼具可扩展性与泛化性的新路径。

English

Recent multimodal large language models have demonstrated strong reasoning ability, yet their reliability as automated evaluators remains limited by a critical weakness: when visual evidence conflicts with textual cues, MLLM judges tend to reward plausible narratives over perceptually correct answers. We identify and systematically analyze this phenomenon, which we term Perceptual Judgment Bias. Through controlled visual perturbations, existing multimodal judges frequently anchor on the response text instead of their own visual perception, leading to inconsistent and non-verifiable evaluations. To address this issue, we introduce the Perceptually Perturbed Judgment Dataset, which constructs minimally edited counterfactual responses that isolate perceptual errors and enable verifiable supervision. Building on this dataset, we develop a unified training framework that combines a structured GRPO-based reward with a batch-ranking objective, achieving coherent global ordering without explicit pairwise labels. Experiments across diverse MLLM-as-a-Judge benchmarks show that our approach substantially improves perceptual fidelity, ranking coherence, and alignment with human evaluation. Our results establish a scalable and generalizable pathway for training multimodal judges that are perceptually grounded, interpretable, and robust to visual-reasoning conflicts.