Judge Anything: 任意のモダリティにわたる裁判官としてのMLLM

要旨

多様なモダリティ（例：画像、音声、動画）にわたるオープンエンドなマルチモーダル理解（MMU）および生成（MMG）タスクにおいて、生成基盤モデルを評価することは、クロスモーダル相互作用の複雑さから大きな課題を抱えています。このため、マルチモーダルLLM（MLLM）を自動評価者として活用するアイデアが浮上し、視覚言語理解タスクの評価において有望な結果を示しています。さらに、本論文では、MLLM-as-a-Judgeをモダリティ横断的に統一的な方法で拡張し、TaskAnythingとJudgeAnythingという2つのベンチマークを導入することで、任意のモダリティ間タスクにおけるMLLMの全体的な性能と評価能力をそれぞれ評価します。具体的には、TaskAnythingは、確立されたベンチマークから厳選された1,500のクエリを用いて、15の任意モダリティカテゴリにわたるMMUおよびMMG能力を評価します。さらに、JudgeAnythingは、5つの先進的なMLLM（例：GPT-4oやGemini-2.0-Flash）の評価能力を、ペア比較とスコア評価の観点から評価し、人間の判断と詳細な評価基準を組み込んだ標準化されたテストベッドを提供します。我々の大規模な実験により、これらのMLLMはMMUの評価において有望な結果を示す一方（ペア比較設定で平均66.55%、スコア評価設定で平均42.79%）、MMGタスクでは大きな課題に直面していることが明らかになりました（ペア比較設定で平均53.37%、スコア評価設定で平均30.05%）。これにより、クロスモーダルのバイアスや幻覚問題が露呈しました。これを解決するため、我々はオムニモデルとマルチモーダル報酬モデルを評価するための自動プラットフォームであるOmniArenaを提案します。我々の研究は、より公平な評価プロトコルと人間の嗜好との強い整合性の必要性を強調しています。ソースコードとデータセットは以下のURLで公開されています：https://urrealhero.github.io/judgeanythingweb/。

English

Evaluating generative foundation models on open-ended multimodal understanding (MMU) and generation (MMG) tasks across diverse modalities (e.g., images, audio, video) poses significant challenges due to the complexity of cross-modal interactions. To this end, the idea of utilizing Multimodal LLMs (MLLMs) as automated judges has emerged, with encouraging results in assessing vision-language understanding tasks. Moving further, this paper extends MLLM-as-a-Judge across modalities to a unified manner by introducing two benchmarks, TaskAnything and JudgeAnything, to respectively evaluate the overall performance and judging capabilities of MLLMs across any-to-any modality tasks. Specifically, TaskAnything evaluates the MMU and MMG capabilities across 15 any-to-any modality categories, employing 1,500 queries curated from well-established benchmarks. Furthermore, JudgeAnything evaluates the judging capabilities of 5 advanced (e.g., GPT-4o and Gemini-2.0-Flash) from the perspectives of Pair Comparison and Score Evaluation, providing a standardized testbed that incorporates human judgments and detailed rubrics. Our extensive experiments reveal that while these MLLMs show promise in assessing MMU (i.e., achieving an average of 66.55% in Pair Comparison setting and 42.79% in Score Evaluation setting), they encounter significant challenges with MMG tasks (i.e., averaging only 53.37% in Pair Comparison setting and 30.05% in Score Evaluation setting), exposing cross-modality biases and hallucination issues. To address this, we present OmniArena, an automated platform for evaluating omni-models and multimodal reward models. Our work highlights the need for fairer evaluation protocols and stronger alignment with human preferences. The source code and dataset are publicly available at: https://urrealhero.github.io/judgeanythingweb/.

Judge Anything: 任意のモダリティにわたる裁判官としてのMLLM

Judge Anything: MLLM as a Judge Across Any Modality

要旨

Support