视觉过载：探究视觉语言模型在极度复杂场景中的理解能力

摘要

在当前的视觉语言模型（VLMs）中，基础的视觉理解问题是否真的已被解决？我们推出了VisualOverload，这是一个略有不同的视觉问答（VQA）基准测试，包含2,720个问答对，并配有私有的真实答案。与以往通常关注近乎全局图像理解的VQA数据集不同，VisualOverload挑战模型在密集（或过载）场景中执行简单、无需知识的视觉任务。我们的数据集由高分辨率的公共领域绘画扫描组成，这些画作中充满了众多人物、动作以及展开的副情节，背景细节丰富。我们手动为这些图像标注了涵盖六类任务的提问，以深入探究对场景的全面理解。我们假设，当前的基准测试高估了VLMs的表现，对细节的编码与推理对它们而言仍是一项挑战，尤其是在面对密集场景时。确实，我们观察到，在测试的37个模型中，即便是表现最佳的模型（o3）在我们最难的测试集上仅达到19.6%的准确率，所有问题上的总体准确率为69.5%。除了全面的评估外，我们还通过错误分析补充了基准测试，揭示了多种失败模式，包括计数能力不足、OCR失败以及在复杂任务下显著的逻辑不一致性。总之，VisualOverload揭示了当前视觉模型中的关键差距，并为社区开发更优模型提供了重要资源。基准测试链接：http://paulgavrikov.github.io/visualoverload

English

Is basic visual understanding really solved in state-of-the-art VLMs? We present VisualOverload, a slightly different visual question answering (VQA) benchmark comprising 2,720 question-answer pairs, with privately held ground-truth responses. Unlike prior VQA datasets that typically focus on near global image understanding, VisualOverload challenges models to perform simple, knowledge-free vision tasks in densely populated (or, overloaded) scenes. Our dataset consists of high-resolution scans of public-domain paintings that are populated with multiple figures, actions, and unfolding subplots set against elaborately detailed backdrops. We manually annotated these images with questions across six task categories to probe for a thorough understanding of the scene. We hypothesize that current benchmarks overestimate the performance of VLMs, and encoding and reasoning over details is still a challenging task for them, especially if they are confronted with densely populated scenes. Indeed, we observe that even the best model (o3) out of 37 tested models only achieves 19.6% accuracy on our hardest test split and overall 69.5% accuracy on all questions. Beyond a thorough evaluation, we complement our benchmark with an error analysis that reveals multiple failure modes, including a lack of counting skills, failure in OCR, and striking logical inconsistencies under complex tasks. Altogether, VisualOverload exposes a critical gap in current vision models and offers a crucial resource for the community to develop better models. Benchmark: http://paulgavrikov.github.io/visualoverload

视觉过载：探究视觉语言模型在极度复杂场景中的理解能力

VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes

摘要

Support