어떤 것도 판단하다: 모든 모달리티를 아우르는 판단자로서의 MLLM

초록

다양한 모달리티(예: 이미지, 오디오, 비디오)에 걸친 개방형 다중모달 이해(MMU) 및 생성(MMG) 작업에서 생성형 기반 모델을 평가하는 것은 모달리티 간 상호작용의 복잡성으로 인해 상당한 도전을 제기합니다. 이를 위해 다중모달 대형 언어 모델(MLLM)을 자동 평가자로 활용하는 아이디어가 등장했으며, 시각-언어 이해 작업 평가에서 고무적인 결과를 보여주었습니다. 이 논문은 이를 더 나아가 MLLM-as-a-Judge를 모든 모달리티에 걸쳐 통합된 방식으로 확장하기 위해 TaskAnything과 JudgeAnything이라는 두 가지 벤치마크를 소개합니다. 이는 각각 MLLM의 전반적인 성능과 판단 능력을 임의의 모달리티 작업에서 평가하기 위한 것입니다. 구체적으로, TaskAnything은 15개의 임의의 모달리티 범주에 걸친 MMU 및 MMG 능력을 평가하며, 잘 정립된 벤치마크에서 선별된 1,500개의 질의를 사용합니다. 또한, JudgeAnything은 GPT-4o와 Gemini-2.0-Flash와 같은 5개의 고급 모델의 판단 능력을 Pair Comparison과 Score Evaluation의 관점에서 평가하며, 인간 판단과 상세한 평가 기준을 포함한 표준화된 테스트베드를 제공합니다. 우리의 광범위한 실험은 이러한 MLLM들이 MMU 평가에서 유망한 성과를 보이지만(즉, Pair Comparison 설정에서 평균 66.55%, Score Evaluation 설정에서 평균 42.79% 달성), MMG 작업에서는 상당한 어려움을 겪는 것으로 나타났습니다(즉, Pair Comparison 설정에서 평균 53.37%, Score Evaluation 설정에서 평균 30.05% 달성). 이는 모달리티 간 편향과 환각 문제를 드러냅니다. 이를 해결하기 위해 우리는 OmniArena를 제시합니다. 이는 오므니 모델과 다중모달 보상 모델을 평가하기 위한 자동화된 플랫폼입니다. 우리의 작업은 더 공정한 평가 프로토콜과 인간 선호도와의 더 강한 정렬의 필요성을 강조합니다. 소스 코드와 데이터셋은 https://urrealhero.github.io/judgeanythingweb/에서 공개적으로 이용 가능합니다.

English

Evaluating generative foundation models on open-ended multimodal understanding (MMU) and generation (MMG) tasks across diverse modalities (e.g., images, audio, video) poses significant challenges due to the complexity of cross-modal interactions. To this end, the idea of utilizing Multimodal LLMs (MLLMs) as automated judges has emerged, with encouraging results in assessing vision-language understanding tasks. Moving further, this paper extends MLLM-as-a-Judge across modalities to a unified manner by introducing two benchmarks, TaskAnything and JudgeAnything, to respectively evaluate the overall performance and judging capabilities of MLLMs across any-to-any modality tasks. Specifically, TaskAnything evaluates the MMU and MMG capabilities across 15 any-to-any modality categories, employing 1,500 queries curated from well-established benchmarks. Furthermore, JudgeAnything evaluates the judging capabilities of 5 advanced (e.g., GPT-4o and Gemini-2.0-Flash) from the perspectives of Pair Comparison and Score Evaluation, providing a standardized testbed that incorporates human judgments and detailed rubrics. Our extensive experiments reveal that while these MLLMs show promise in assessing MMU (i.e., achieving an average of 66.55% in Pair Comparison setting and 42.79% in Score Evaluation setting), they encounter significant challenges with MMG tasks (i.e., averaging only 53.37% in Pair Comparison setting and 30.05% in Score Evaluation setting), exposing cross-modality biases and hallucination issues. To address this, we present OmniArena, an automated platform for evaluating omni-models and multimodal reward models. Our work highlights the need for fairer evaluation protocols and stronger alignment with human preferences. The source code and dataset are publicly available at: https://urrealhero.github.io/judgeanythingweb/.

어떤 것도 판단하다: 모든 모달리티를 아우르는 판단자로서의 MLLM

Judge Anything: MLLM as a Judge Across Any Modality

초록

Support