ChatPaper.aiChatPaper

ZeroBench:當代大型多模型的不可能視覺基準

ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models

February 13, 2025
作者: Jonathan Roberts, Mohammad Reza Taesiri, Ansh Sharma, Akash Gupta, Samuel Roberts, Ioana Croitoru, Simion-Vlad Bogolin, Jialu Tang, Florian Langer, Vyas Raina, Vatsal Raina, Hanyi Xiong, Vishaal Udandarao, Jingyi Lu, Shiyang Chen, Sam Purkis, Tianshuo Yan, Wenye Lin, Gyungin Shin, Qiaochu Yang, Anh Totti Nguyen, Kai Han, Samuel Albanie
cs.AI

摘要

大型多模型(LMMs)在解釋圖像時存在主要缺陷,某些指標顯示它們的空間認知能力比小孩或動物還要差。儘管如此,它們在許多流行的視覺基準測試中取得高分,但這些成績很快就會被不斷進步的模型所超越。為了應對這一問題,迫切需要更具挑戰性的基準測試,以保持長期的相關性。我們將這個想法推向極限,引入了ZeroBench-一個輕量級的視覺推理基準測試,對於當前領先的LMMs來說完全不可能。我們的基準測試包含100個手動精選問題和334個較不困難的子問題。我們對20個LMMs在ZeroBench上的表現進行評估,所有模型得分均為0.0%,並對錯誤進行嚴格分析。為了推動視覺理解的進步,我們公開發布了ZeroBench。
English
Large Multimodal Models (LMMs) exhibit major shortfalls when interpreting images and, by some measures, have poorer spatial cognition than small children or animals. Despite this, they attain high scores on many popular visual benchmarks, with headroom rapidly eroded by an ongoing surge of model progress. To address this, there is a pressing need for difficult benchmarks that remain relevant for longer. We take this idea to its limit by introducing ZeroBench-a lightweight visual reasoning benchmark that is entirely impossible for contemporary frontier LMMs. Our benchmark consists of 100 manually curated questions and 334 less difficult subquestions. We evaluate 20 LMMs on ZeroBench, all of which score 0.0%, and rigorously analyse the errors. To encourage progress in visual understanding, we publicly release ZeroBench.

Summary

AI-Generated Summary

PDF445February 17, 2025