지각적 교란과 보상 모델링을 통한 멀티모달 LLM-as-a-Judge의 지각적 판단 편향 완화

초록

최근 다중 모달 거대 언어 모델은 강력한 추론 능력을 입증했지만, 자동 평가자로서의 신뢰성은 여전히 중요한 약점에 의해 제한된다. 시각적 증거가 텍스트 신호와 충돌할 때, MLLM 평가자는 지각적으로 올바른 답변보다 그럴듯한 내러티브에 더 높은 점수를 부여하는 경향이 있다. 우리는 이러한 현상을 확인하고 체계적으로 분석하며, 이를 지각 판단 편향(Perceptual Judgment Bias)이라고 명명한다. 통제된 시각적 교란을 통해, 기존의 다중 모달 평가자는 자신의 시각적 지각 대신 응답 텍스트에 고정되는 경우가 빈번하며, 이로 인해 일관되지 않고 검증 불가능한 평가가 초래된다. 이 문제를 해결하기 위해, 우리는 지각적으로 교란된 판단 데이터셋(Perceptually Perturbed Judgment Dataset)을 도입한다. 이 데이터셋은 지각 오류를 분리하고 검증 가능한 감독을 가능하게 하는 최소 편집된 반사실적 응답을 구성한다. 이 데이터셋을 바탕으로, 구조화된 GRPO 기반 보상과 배치 순위 매기기 목표를 결합한 통합 훈련 프레임워크를 개발하여 명시적인 쌍별 레이블 없이도 전역적 순서를 일관되게 도출한다. 다양한 MLLM-as-a-Judge 벤치마크에 걸친 실험은 우리의 접근 방식이 지각 충실도, 순위 일관성, 인간 평가와의 일관성을 크게 향상시킴을 보여준다. 우리의 결과는 지각적으로 근거하며, 해석 가능하고, 시각-추론 충돌에 강건한 다중 모달 평가자를 훈련하기 위한 확장 가능하고 일반화 가능한 경로를 확립한다.

English

Recent multimodal large language models have demonstrated strong reasoning ability, yet their reliability as automated evaluators remains limited by a critical weakness: when visual evidence conflicts with textual cues, MLLM judges tend to reward plausible narratives over perceptually correct answers. We identify and systematically analyze this phenomenon, which we term Perceptual Judgment Bias. Through controlled visual perturbations, existing multimodal judges frequently anchor on the response text instead of their own visual perception, leading to inconsistent and non-verifiable evaluations. To address this issue, we introduce the Perceptually Perturbed Judgment Dataset, which constructs minimally edited counterfactual responses that isolate perceptual errors and enable verifiable supervision. Building on this dataset, we develop a unified training framework that combines a structured GRPO-based reward with a batch-ranking objective, achieving coherent global ordering without explicit pairwise labels. Experiments across diverse MLLM-as-a-Judge benchmarks show that our approach substantially improves perceptual fidelity, ranking coherence, and alignment with human evaluation. Our results establish a scalable and generalizable pathway for training multimodal judges that are perceptually grounded, interpretable, and robust to visual-reasoning conflicts.