WorldBench：一個具有挑戰性且視覺多樣的多模態推理基準

摘要

在實際應用中，模型應能在多樣化的環境下可靠地運行。然而，許多現有的多模態基準測試雖擴展了任務類型，卻未能捕捉處理開放式視覺輸入所需的視覺多樣性。我們提出WorldBench，一個具挑戰性且視覺多樣化的推理基準，用於評估多模態大型語言模型（MLLMs）。我們建立了一個橫跨多個領域（如生物）的數千個視覺概念分類體系。在此分類體系引導下，我們從搜尋引擎和現有數據集中精選大量圖像，以全面呈現視覺世界。透過結構化的反覆試驗，我們手動設計了前沿MLLMs無法回答的具挑戰性問題。在量化評估與人類評估中，WorldBench達到了比任何現有具多樣性基準測試更高的視覺多樣性。在WorldBench上評估15個MLLMs，揭示了其在視覺理解上的弱點：即使是最強的模型，準確率也僅達64.0%，而部分模型的表現僅略高於隨機水平。我們希望這項工作能凸顯視覺多樣性在構建多模態基準測試中的重要性。

English

In real-world applications, models are expected to perform reliably across diverse settings. Yet, many existing multimodal benchmarks expand task types without capturing the visual diversity needed to handle open-ended visual inputs. We present WorldBench, a challenging and visually diverse reasoning benchmark to evaluate Multimodal Large Language Models (MLLMs). We build a taxonomy of thousands of visual concepts across multiple domains (e.g., living things). Guided by this taxonomy, we curate a broad collection of images from search engines and existing datasets to comprehensively represent the visual world. Through structured trial-and-error, we manually design challenging questions that frontier MLLMs fail to answer. On quantitative and human evaluations, WorldBench achieves higher visual diversity than any existing diverse benchmark. Evaluating 15 MLLMs on WorldBench reveals weaknesses in visual understanding: even the strongest model reaches only 64.0% accuracy, while some models perform marginally above chance-level. We hope our work highlights the importance of visual diversity in building multimodal benchmarks.