全能判官:跨模态的多模态大语言模型评判系统
Judge Anything: MLLM as a Judge Across Any Modality
March 21, 2025
作者: Shu Pu, Yaochen Wang, Dongping Chen, Yuhang Chen, Guohao Wang, Qi Qin, Zhongyi Zhang, Zhiyuan Zhang, Zetong Zhou, Shuang Gong, Yi Gui, Yao Wan, Philip S. Yu
cs.AI
摘要
評估生成式基礎模型在開放式多模態理解(MMU)和生成(MMG)任務上的表現,尤其是在跨越多樣模態(如圖像、音頻、視頻)的複雜交互中,面臨著重大挑戰。為此,利用多模態大語言模型(MLLMs)作為自動評判者的想法應運而生,並在評估視覺-語言理解任務中取得了令人鼓舞的成果。進一步地,本文通過引入兩個基準——TaskAnything和JudgeAnything,將MLLM-as-a-Judge的理念擴展到跨模態的統一評估方式,分別評估MLLMs在任意到任意模態任務中的整體表現和評判能力。具體而言,TaskAnything評估了15種任意到任意模態類別下的MMU和MMG能力,採用了從成熟基準中精選的1,500個查詢。此外,JudgeAnything從配對比較和分數評估兩個角度,評估了5種先進模型(如GPT-4o和Gemini-2.0-Flash)的評判能力,提供了一個包含人類判斷和詳細評分標準的標準化測試平台。我們的大量實驗表明,儘管這些MLLMs在評估MMU方面展現出潛力(即在配對比較設置中平均達到66.55%,在分數評估設置中平均達到42.79%),但在MMG任務上卻面臨顯著挑戰(即在配對比較設置中平均僅為53.37%,在分數評估設置中平均僅為30.05%),暴露了跨模態偏見和幻覺問題。為解決這一問題,我們推出了OmniArena,一個用於評估全能模型和多模態獎勵模型的自動化平台。我們的工作強調了需要更公平的評估協議和更強的人類偏好對齊。源代碼和數據集已公開於:https://urrealhero.github.io/judgeanythingweb/。
English
Evaluating generative foundation models on open-ended multimodal
understanding (MMU) and generation (MMG) tasks across diverse modalities (e.g.,
images, audio, video) poses significant challenges due to the complexity of
cross-modal interactions. To this end, the idea of utilizing Multimodal LLMs
(MLLMs) as automated judges has emerged, with encouraging results in assessing
vision-language understanding tasks. Moving further, this paper extends
MLLM-as-a-Judge across modalities to a unified manner by introducing two
benchmarks, TaskAnything and JudgeAnything, to respectively evaluate the
overall performance and judging capabilities of MLLMs across any-to-any
modality tasks. Specifically, TaskAnything evaluates the MMU and MMG
capabilities across 15 any-to-any modality categories, employing 1,500 queries
curated from well-established benchmarks. Furthermore, JudgeAnything evaluates
the judging capabilities of 5 advanced (e.g., GPT-4o and Gemini-2.0-Flash) from
the perspectives of Pair Comparison and Score Evaluation, providing a
standardized testbed that incorporates human judgments and detailed rubrics.
Our extensive experiments reveal that while these MLLMs show promise in
assessing MMU (i.e., achieving an average of 66.55% in Pair Comparison setting
and 42.79% in Score Evaluation setting), they encounter significant challenges
with MMG tasks (i.e., averaging only 53.37% in Pair Comparison setting and
30.05% in Score Evaluation setting), exposing cross-modality biases and
hallucination issues. To address this, we present OmniArena, an automated
platform for evaluating omni-models and multimodal reward models. Our work
highlights the need for fairer evaluation protocols and stronger alignment with
human preferences. The source code and dataset are publicly available at:
https://urrealhero.github.io/judgeanythingweb/.Summary
AI-Generated Summary