VisualOverload: 高密度シーンにおける視覚言語モデルの視覚理解の探求

要旨

最先端の視覚言語モデル（VLM）において、基本的な視覚理解は本当に解決されているのか？本論文では、VisualOverloadという、わずかに異なる視覚質問応答（VQA）ベンチマークを提案する。このベンチマークは、2,720の質問-回答ペアから成り、非公開の正解応答を保持している。従来のVQAデータセットが通常、ほぼ全体的な画像理解に焦点を当てているのに対し、VisualOverloadは、密集した（または過負荷の）シーンにおいて、単純で知識を必要としない視覚タスクを実行することをモデルに要求する。我々のデータセットは、パブリックドメインの絵画の高解像度スキャンで構成されており、複数の人物、行動、そして詳細な背景の中で展開されるサブプロットが描かれている。これらの画像を、シーンを徹底的に理解するための6つのタスクカテゴリーにわたる質問で手動で注釈付けした。我々は、現在のベンチマークがVLMの性能を過大評価しており、詳細のエンコードと推論は、特に密集したシーンに直面した場合、依然として困難なタスクであると仮説を立てている。実際、テストした37のモデルの中で最良のモデル（o3）でさえ、最も難しいテスト分割では19.6%の精度しか達成せず、全質問では69.5%の精度に留まった。徹底的な評価に加えて、我々はベンチマークをエラー分析で補完し、カウントスキルの欠如、OCRの失敗、複雑なタスク下での顕著な論理的不整合など、複数の失敗モードを明らかにした。全体として、VisualOverloadは現在の視覚モデルの重要なギャップを暴露し、より良いモデルを開発するための重要なリソースをコミュニティに提供する。ベンチマーク: http://paulgavrikov.github.io/visualoverload

English

Is basic visual understanding really solved in state-of-the-art VLMs? We present VisualOverload, a slightly different visual question answering (VQA) benchmark comprising 2,720 question-answer pairs, with privately held ground-truth responses. Unlike prior VQA datasets that typically focus on near global image understanding, VisualOverload challenges models to perform simple, knowledge-free vision tasks in densely populated (or, overloaded) scenes. Our dataset consists of high-resolution scans of public-domain paintings that are populated with multiple figures, actions, and unfolding subplots set against elaborately detailed backdrops. We manually annotated these images with questions across six task categories to probe for a thorough understanding of the scene. We hypothesize that current benchmarks overestimate the performance of VLMs, and encoding and reasoning over details is still a challenging task for them, especially if they are confronted with densely populated scenes. Indeed, we observe that even the best model (o3) out of 37 tested models only achieves 19.6% accuracy on our hardest test split and overall 69.5% accuracy on all questions. Beyond a thorough evaluation, we complement our benchmark with an error analysis that reveals multiple failure modes, including a lack of counting skills, failure in OCR, and striking logical inconsistencies under complex tasks. Altogether, VisualOverload exposes a critical gap in current vision models and offers a crucial resource for the community to develop better models. Benchmark: http://paulgavrikov.github.io/visualoverload

VisualOverload: 高密度シーンにおける視覚言語モデルの視覚理解の探求

VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes

要旨

Support