WorldBench: 도전적이고 시각적으로 다양한 멀티모달 추론 벤치마크

초록

실제 응용 환경에서는 모델이 다양한 상황에서도 안정적으로 작동할 것으로 기대된다. 그러나 기존의 많은 멀티모달 벤치마크는 개방형 시각 입력을 처리하는 데 필요한 시각적 다양성을 포착하지 못한 채 과제 유형만 확장하고 있다. 우리는 Multimodal Large Language Models(MLLM)을 평가하기 위한 도전적이고 시각적으로 다양한 추론 벤치마크인 WorldBench를 제안한다. 우리는 여러 도메인(예: 생물)에 걸친 수천 개의 시각적 개념에 대한 분류 체계를 구축한다. 이 분류 체계를 기반으로 검색 엔진과 기존 데이터셋에서 광범위한 이미지 컬렉션을 선별하여 시각적 세계를 포괄적으로 표현한다. 구조화된 시행착오를 통해 최첨단 MLLM이 답하지 못하는 도전적인 질문을 수동으로 설계한다. 정량적 평가와 인간 평가 모두에서 WorldBench는 기존의 어떤 다양한 벤치마크보다 높은 시각적 다양성을 달성한다. WorldBench에서 15개의 MLLM을 평가한 결과 시각적 이해의 약점이 드러났다. 가장 강력한 모델조차 64.0%의 정확도에 도달하는 반면, 일부 모델은 우연 수준에 간신히 근접하는 성능을 보인다. 우리의 연구가 멀티모달 벤치마크 구축에 있어 시각적 다양성의 중요성을 강조하기를 바란다.

English

In real-world applications, models are expected to perform reliably across diverse settings. Yet, many existing multimodal benchmarks expand task types without capturing the visual diversity needed to handle open-ended visual inputs. We present WorldBench, a challenging and visually diverse reasoning benchmark to evaluate Multimodal Large Language Models (MLLMs). We build a taxonomy of thousands of visual concepts across multiple domains (e.g., living things). Guided by this taxonomy, we curate a broad collection of images from search engines and existing datasets to comprehensively represent the visual world. Through structured trial-and-error, we manually design challenging questions that frontier MLLMs fail to answer. On quantitative and human evaluations, WorldBench achieves higher visual diversity than any existing diverse benchmark. Evaluating 15 MLLMs on WorldBench reveals weaknesses in visual understanding: even the strongest model reaches only 64.0% accuracy, while some models perform marginally above chance-level. We hope our work highlights the importance of visual diversity in building multimodal benchmarks.