全能判官：跨模态的多模态大语言模型评判系统

摘要

評估生成式基礎模型在開放式多模態理解（MMU）和生成（MMG）任務上的表現，尤其是在跨越多樣模態（如圖像、音頻、視頻）的複雜交互中，面臨著重大挑戰。為此，利用多模態大語言模型（MLLMs）作為自動評判者的想法應運而生，並在評估視覺-語言理解任務中取得了令人鼓舞的成果。進一步地，本文通過引入兩個基準——TaskAnything和JudgeAnything，將MLLM-as-a-Judge的理念擴展到跨模態的統一評估方式，分別評估MLLMs在任意到任意模態任務中的整體表現和評判能力。具體而言，TaskAnything評估了15種任意到任意模態類別下的MMU和MMG能力，採用了從成熟基準中精選的1,500個查詢。此外，JudgeAnything從配對比較和分數評估兩個角度，評估了5種先進模型（如GPT-4o和Gemini-2.0-Flash）的評判能力，提供了一個包含人類判斷和詳細評分標準的標準化測試平台。我們的大量實驗表明，儘管這些MLLMs在評估MMU方面展現出潛力（即在配對比較設置中平均達到66.55%，在分數評估設置中平均達到42.79%），但在MMG任務上卻面臨顯著挑戰（即在配對比較設置中平均僅為53.37%，在分數評估設置中平均僅為30.05%），暴露了跨模態偏見和幻覺問題。為解決這一問題，我們推出了OmniArena，一個用於評估全能模型和多模態獎勵模型的自動化平台。我們的工作強調了需要更公平的評估協議和更強的人類偏好對齊。源代碼和數據集已公開於：https://urrealhero.github.io/judgeanythingweb/。

English

Evaluating generative foundation models on open-ended multimodal understanding (MMU) and generation (MMG) tasks across diverse modalities (e.g., images, audio, video) poses significant challenges due to the complexity of cross-modal interactions. To this end, the idea of utilizing Multimodal LLMs (MLLMs) as automated judges has emerged, with encouraging results in assessing vision-language understanding tasks. Moving further, this paper extends MLLM-as-a-Judge across modalities to a unified manner by introducing two benchmarks, TaskAnything and JudgeAnything, to respectively evaluate the overall performance and judging capabilities of MLLMs across any-to-any modality tasks. Specifically, TaskAnything evaluates the MMU and MMG capabilities across 15 any-to-any modality categories, employing 1,500 queries curated from well-established benchmarks. Furthermore, JudgeAnything evaluates the judging capabilities of 5 advanced (e.g., GPT-4o and Gemini-2.0-Flash) from the perspectives of Pair Comparison and Score Evaluation, providing a standardized testbed that incorporates human judgments and detailed rubrics. Our extensive experiments reveal that while these MLLMs show promise in assessing MMU (i.e., achieving an average of 66.55% in Pair Comparison setting and 42.79% in Score Evaluation setting), they encounter significant challenges with MMG tasks (i.e., averaging only 53.37% in Pair Comparison setting and 30.05% in Score Evaluation setting), exposing cross-modality biases and hallucination issues. To address this, we present OmniArena, an automated platform for evaluating omni-models and multimodal reward models. Our work highlights the need for fairer evaluation protocols and stronger alignment with human preferences. The source code and dataset are publicly available at: https://urrealhero.github.io/judgeanythingweb/.

全能判官：跨模态的多模态大语言模型评判系统

Judge Anything: MLLM as a Judge Across Any Modality

摘要

Support