다중모달 LLM-as-a-Judge 향상을 위한 다중 작업 강화 학습

초록

멀티모달 대규모 언어 모델(MLLM)은 다양한 시각 작업에서 인간의 판단과 높은 일치도를 보여주며 MLLM-as-a-Judge(판사 역할 MLLM)로 널리 채택되고 있습니다. 그러나 기존의 대부분의 판사 모델은 단일 작업 시나리오에 최적화되어 있어 신뢰할 수 있는 평가를 위한 핵심 요구사항인 다양한 상황으로의 일반화에 어려움을 겪습니다. 이러한 한계를 해결하기 위해 본 연구에서는 RL의 일반화 능력을 활용해 여러 작업에 걸쳐 판사 모델을 공동으로 최적화하는 프레임워크인 MT-RL-Judge(Multi-Task Reinforcement Learning for MLLM-as-a-Judge)를 제안합니다. 여러 강력한 베이스라인과의 비교 실험 결과, MT-RL-Judge가 판단 일관성과 인간 선호도 상관관계 모두에서 우수한 성능을 보이는 것으로 나타났습니다. 또한 본 접근법은 분포 외 작업에서도 강건한 일반화 능력을 보여 그 효과를 추가로 입증했습니다.

English

Multimodal Large Language Models (MLLMs) have been widely adopted as MLLM-as-a-Judges due to their strong alignment with human judgment across various visual tasks. However, most existing judge models are optimized for single-task scenarios and struggle to generalize to diverse contexts, which is a critical requirement for reliable evaluation. To address this limitation, we propose Multi-Task Reinforcement Learning for MLLM-as-a-Judge (MT-RL-Judge), a framework that jointly optimizes the judge model across multiple tasks, leveraging the generalization capabilities of RL. Experimental results against several strong baselines demonstrate that MT-RL-Judge outperforms strong baselines in both judgment consistency and correlation with human preferences. Furthermore, our approach exhibits robust generalization on out-of-distribution tasks, further validating its effectiveness.

다중모달 LLM-as-a-Judge 향상을 위한 다중 작업 강화 학습

Multi-Task Reinforcement Learning for Enhanced Multimodal LLM-as-a-Judge

초록

Support