VisualOverload: 매우 복잡한 장면에서 시각적 언어 모델(VLM)의 시각적 이해 능력 탐구

초록

최첨단 시각-언어 모델(VLM)이 정말로 기본적인 시각 이해를 해결했을까? 우리는 2,720개의 질문-답변 쌍으로 구성된 약간 다른 시각 질의응답(VQA) 벤치마크인 VisualOverload를 소개한다. 이 벤치마크는 비공개로 보관된 정답을 포함하고 있다. 일반적으로 전체적인 이미지 이해에 초점을 맞추는 기존의 VQA 데이터셋과 달리, VisualOverload는 모델이 복잡하게 채워진(또는 과부하된) 장면에서 간단하고 지식이 필요 없는 시각 작업을 수행하도록 요구한다. 우리의 데이터셋은 공개 도메인 회화 작품의 고해상도 스캔으로 구성되어 있으며, 이 작품들은 다수의 인물, 행동, 그리고 정교하게 디테일이 묘사된 배경 속에서 펼쳐지는 하위 플롯들로 가득 차 있다. 우리는 이러한 이미지들을 수동으로 주석 처리하여 장면에 대한 철저한 이해를 탐구하기 위해 6가지 작업 범주에 걸친 질문들을 추가했다. 우리는 현재의 벤치마크가 VLM의 성능을 과대평가하고 있으며, 세부 사항을 인코딩하고 추론하는 것은 여전히 어려운 과제라고 가정한다. 특히 복잡하게 채워진 장면을 마주할 때 더욱 그렇다. 실제로, 우리는 테스트한 37개 모델 중 최고의 모델(o3)도 가장 어려운 테스트 분할에서 단 19.6%의 정확도를, 모든 질문에 대해 전체적으로 69.5%의 정확도를 달성하는 것을 관찰했다. 철저한 평가를 넘어, 우리는 벤치마크를 오류 분석으로 보완하여, 숫자 세기 능력의 부족, OCR 실패, 복잡한 작업에서의 놀라운 논리적 불일치를 포함한 여러 실패 모드를 밝혀냈다. 전반적으로, VisualOverload는 현재의 시각 모델에서 중요한 격차를 드러내고, 더 나은 모델을 개발하기 위한 커뮤니티에 중요한 자원을 제공한다. 벤치마크: http://paulgavrikov.github.io/visualoverload

English

Is basic visual understanding really solved in state-of-the-art VLMs? We present VisualOverload, a slightly different visual question answering (VQA) benchmark comprising 2,720 question-answer pairs, with privately held ground-truth responses. Unlike prior VQA datasets that typically focus on near global image understanding, VisualOverload challenges models to perform simple, knowledge-free vision tasks in densely populated (or, overloaded) scenes. Our dataset consists of high-resolution scans of public-domain paintings that are populated with multiple figures, actions, and unfolding subplots set against elaborately detailed backdrops. We manually annotated these images with questions across six task categories to probe for a thorough understanding of the scene. We hypothesize that current benchmarks overestimate the performance of VLMs, and encoding and reasoning over details is still a challenging task for them, especially if they are confronted with densely populated scenes. Indeed, we observe that even the best model (o3) out of 37 tested models only achieves 19.6% accuracy on our hardest test split and overall 69.5% accuracy on all questions. Beyond a thorough evaluation, we complement our benchmark with an error analysis that reveals multiple failure modes, including a lack of counting skills, failure in OCR, and striking logical inconsistencies under complex tasks. Altogether, VisualOverload exposes a critical gap in current vision models and offers a crucial resource for the community to develop better models. Benchmark: http://paulgavrikov.github.io/visualoverload

VisualOverload: 매우 복잡한 장면에서 시각적 언어 모델(VLM)의 시각적 이해 능력 탐구

VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes

초록

Support