WorldBench: 挑戦的で視覚的に多様なマルチモーダル推論ベンチマーク

要旨

現実世界の応用において、モデルは多様な環境で確実に動作することが期待される。しかし、既存の多くのマルチモーダルベンチマークは、タスクの種類を拡大する一方で、開かれた視覚的入力を扱うために必要な視覚的多様性を捉えていない。そこで我々は、マルチモーダル大規模言語モデル（MLLM）を評価するための、挑戦的かつ視覚的に多様な推論ベンチマークであるWorldBenchを提案する。我々は、複数のドメイン（例：生物）にわたる数千の視覚的概念からなる分類体系を構築する。この分類体系に基づき、検索エンジンや既存のデータセットから広範な画像コレクションを厳選し、視覚世界を包括的に表現する。構造化された試行錯誤を通じて、最先端のMLLMが答えられない難易度の高い質問を手動で設計する。量的評価および人間による評価において、WorldBenchは既存の多様なベンチマークよりも高い視覚的多様性を達成する。WorldBench上で15のMLLMを評価した結果、視覚理解の弱点が明らかになった。最も強力なモデルでも正解率は64.0%にとどまり、一部のモデルは偶然のレベルをわずかに上回る程度であった。本研究が、マルチモーダルベンチマークの構築における視覚的多様性の重要性を強調するものとなることを期待する。

English

In real-world applications, models are expected to perform reliably across diverse settings. Yet, many existing multimodal benchmarks expand task types without capturing the visual diversity needed to handle open-ended visual inputs. We present WorldBench, a challenging and visually diverse reasoning benchmark to evaluate Multimodal Large Language Models (MLLMs). We build a taxonomy of thousands of visual concepts across multiple domains (e.g., living things). Guided by this taxonomy, we curate a broad collection of images from search engines and existing datasets to comprehensively represent the visual world. Through structured trial-and-error, we manually design challenging questions that frontier MLLMs fail to answer. On quantitative and human evaluations, WorldBench achieves higher visual diversity than any existing diverse benchmark. Evaluating 15 MLLMs on WorldBench reveals weaknesses in visual understanding: even the strongest model reaches only 64.0% accuracy, while some models perform marginally above chance-level. We hope our work highlights the importance of visual diversity in building multimodal benchmarks.