全能裁判:多模态大语言模型作为跨模态评判者
Judge Anything: MLLM as a Judge Across Any Modality
March 21, 2025
作者: Shu Pu, Yaochen Wang, Dongping Chen, Yuhang Chen, Guohao Wang, Qi Qin, Zhongyi Zhang, Zhiyuan Zhang, Zetong Zhou, Shuang Gong, Yi Gui, Yao Wan, Philip S. Yu
cs.AI
摘要
评估生成式基础模型在开放式多模态理解(MMU)和生成(MMG)任务上的表现,尤其是在跨多种模态(如图像、音频、视频)的复杂交互中,面临着重大挑战。为此,利用多模态大语言模型(MLLMs)作为自动化评判者的想法应运而生,并在视觉-语言理解任务的评估中取得了鼓舞人心的成果。进一步地,本文通过引入两个基准——TaskAnything和JudgeAnything,将MLLM-as-a-Judge的理念扩展至跨模态的统一评估方式,分别用于评估MLLMs在任意到任意模态任务中的整体表现和评判能力。具体而言,TaskAnything评估了15种任意到任意模态类别下的MMU和MMG能力,采用了从知名基准中精选的1,500个查询。此外,JudgeAnything从配对比较和评分评估两个角度,评估了包括GPT-4o和Gemini-2.0-Flash在内的5种先进模型的评判能力,提供了一个融合人类判断与详细评分标准的标准化测试平台。我们的广泛实验表明,尽管这些MLLMs在评估MMU方面展现出潜力(即在配对比较设置中平均达到66.55%,在评分评估设置中平均达到42.79%),但在处理MMG任务时却面临显著挑战(即在配对比较设置中平均仅为53.37%,在评分评估设置中平均仅为30.05%),暴露出跨模态偏见和幻觉问题。针对这些问题,我们推出了OmniArena,一个用于评估全能模型和多模态奖励模型的自动化平台。我们的工作强调了制定更公平的评估协议及加强与人类偏好对齐的必要性。源代码和数据集已公开于:https://urrealhero.github.io/judgeanythingweb/。
English
Evaluating generative foundation models on open-ended multimodal
understanding (MMU) and generation (MMG) tasks across diverse modalities (e.g.,
images, audio, video) poses significant challenges due to the complexity of
cross-modal interactions. To this end, the idea of utilizing Multimodal LLMs
(MLLMs) as automated judges has emerged, with encouraging results in assessing
vision-language understanding tasks. Moving further, this paper extends
MLLM-as-a-Judge across modalities to a unified manner by introducing two
benchmarks, TaskAnything and JudgeAnything, to respectively evaluate the
overall performance and judging capabilities of MLLMs across any-to-any
modality tasks. Specifically, TaskAnything evaluates the MMU and MMG
capabilities across 15 any-to-any modality categories, employing 1,500 queries
curated from well-established benchmarks. Furthermore, JudgeAnything evaluates
the judging capabilities of 5 advanced (e.g., GPT-4o and Gemini-2.0-Flash) from
the perspectives of Pair Comparison and Score Evaluation, providing a
standardized testbed that incorporates human judgments and detailed rubrics.
Our extensive experiments reveal that while these MLLMs show promise in
assessing MMU (i.e., achieving an average of 66.55% in Pair Comparison setting
and 42.79% in Score Evaluation setting), they encounter significant challenges
with MMG tasks (i.e., averaging only 53.37% in Pair Comparison setting and
30.05% in Score Evaluation setting), exposing cross-modality biases and
hallucination issues. To address this, we present OmniArena, an automated
platform for evaluating omni-models and multimodal reward models. Our work
highlights the need for fairer evaluation protocols and stronger alignment with
human preferences. The source code and dataset are publicly available at:
https://urrealhero.github.io/judgeanythingweb/.Summary
AI-Generated Summary