視覺超載：探討視覺語言模型在極度密集場景中的視覺理解能力

摘要

在當今最先進的視覺語言模型（VLMs）中，基礎視覺理解是否真的已被攻克？我們提出了VisualOverload，這是一個略有不同的視覺問答（VQA）基準測試，包含2,720個問答對，並配有私密保存的真實答案。與以往通常聚焦於近乎全局圖像理解的VQA數據集不同，VisualOverload挑戰模型在密集（或稱過載）場景中執行簡單、無需知識的視覺任務。我們的數據集由公共領域繪畫的高分辨率掃描圖像組成，這些畫作中充滿了多個人物、動作以及展開的子情節，背景細節豐富。我們手動為這些圖像標註了六類任務的問題，以探測對場景的深入理解。我們假設，現有的基準測試高估了VLMs的表現，對細節的編碼與推理對它們而言仍是一項挑戰，尤其是在面對密集場景時。確實，我們觀察到，在測試的37個模型中，即便是最佳模型（o3）在我們最難的測試集上僅達到了19.6%的準確率，而在所有問題上的總體準確率為69.5%。除了全面的評估外，我們還通過錯誤分析補充了基準測試，揭示了多種失敗模式，包括計數能力不足、光學字符識別（OCR）失敗以及在複雜任務下顯著的邏輯不一致性。總之，VisualOverload揭示了當前視覺模型中的一個關鍵缺口，並為社區開發更好的模型提供了重要資源。基準測試網址：http://paulgavrikov.github.io/visualoverload

English

Is basic visual understanding really solved in state-of-the-art VLMs? We present VisualOverload, a slightly different visual question answering (VQA) benchmark comprising 2,720 question-answer pairs, with privately held ground-truth responses. Unlike prior VQA datasets that typically focus on near global image understanding, VisualOverload challenges models to perform simple, knowledge-free vision tasks in densely populated (or, overloaded) scenes. Our dataset consists of high-resolution scans of public-domain paintings that are populated with multiple figures, actions, and unfolding subplots set against elaborately detailed backdrops. We manually annotated these images with questions across six task categories to probe for a thorough understanding of the scene. We hypothesize that current benchmarks overestimate the performance of VLMs, and encoding and reasoning over details is still a challenging task for them, especially if they are confronted with densely populated scenes. Indeed, we observe that even the best model (o3) out of 37 tested models only achieves 19.6% accuracy on our hardest test split and overall 69.5% accuracy on all questions. Beyond a thorough evaluation, we complement our benchmark with an error analysis that reveals multiple failure modes, including a lack of counting skills, failure in OCR, and striking logical inconsistencies under complex tasks. Altogether, VisualOverload exposes a critical gap in current vision models and offers a crucial resource for the community to develop better models. Benchmark: http://paulgavrikov.github.io/visualoverload

視覺超載：探討視覺語言模型在極度密集場景中的視覺理解能力

VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes

摘要

Support