WorldBench: 一个具有挑战性和视觉多样性的多模态推理基准

摘要

在实际应用中，模型需要在不同场景下都能稳定可靠地运行。然而，现有许多多模态基准测试虽然拓展了任务类型，却未能捕捉处理开放式视觉输入所需的视觉多样性。我们提出WorldBench——一个具有挑战性且视觉多样化的推理基准，用于评估多模态大语言模型（MLLMs）。我们构建了一个涵盖多个领域（如生物）的数千个视觉概念的分类体系。在该分类体系指导下，我们从搜索引擎和现有数据集中广泛收集图像，以全面表征视觉世界。通过结构化试错法，我们人工设计了前沿多模态大语言模型难以回答的挑战性问题。在定量评估和人工评估中，WorldBench的视觉多样性超越了所有现有的多样性基准。对15个多模态大语言模型在WorldBench上的评估揭示了它们在视觉理解上的弱点：即使是表现最强的模型，准确率也仅为64.0%，而部分模型的性能仅略高于随机水平。我们希望本研究能凸显视觉多样性在构建多模态基准测试中的重要性。

English

In real-world applications, models are expected to perform reliably across diverse settings. Yet, many existing multimodal benchmarks expand task types without capturing the visual diversity needed to handle open-ended visual inputs. We present WorldBench, a challenging and visually diverse reasoning benchmark to evaluate Multimodal Large Language Models (MLLMs). We build a taxonomy of thousands of visual concepts across multiple domains (e.g., living things). Guided by this taxonomy, we curate a broad collection of images from search engines and existing datasets to comprehensively represent the visual world. Through structured trial-and-error, we manually design challenging questions that frontier MLLMs fail to answer. On quantitative and human evaluations, WorldBench achieves higher visual diversity than any existing diverse benchmark. Evaluating 15 MLLMs on WorldBench reveals weaknesses in visual understanding: even the strongest model reaches only 64.0% accuracy, while some models perform marginally above chance-level. We hope our work highlights the importance of visual diversity in building multimodal benchmarks.