OmniSpatial: 비전 언어 모델을 위한 포괄적 공간 추론 벤치마크 구축

초록

공간 추론은 인지 심리학의 핵심 요소이며, 현재의 시각-언어 모델(VLMs)에게 주요한 병목 현상으로 남아 있습니다. 기본적인 공간 관계, 예를 들어 좌우 구분, 가까움과 멂의 차이, 물체 계수 등을 이해하는 VLMs의 능력을 평가하거나 개선하기 위한 광범위한 연구가 진행되어 왔지만, 이러한 과제들은 공간 추론의 가장 기본적인 수준에 불과합니다. 본 연구에서는 인지 심리학에 기반을 둔 포괄적이고 도전적인 공간 추론 벤치마크인 OmniSpatial을 소개합니다. OmniSpatial은 동적 추론, 복잡한 공간 논리, 공간 상호작용, 관점 수용이라는 네 가지 주요 범주와 50개의 세부 범주를 다룹니다. 인터넷 데이터 크롤링과 신중한 수동 주석을 통해 1,500개 이상의 질문-답변 쌍을 구성했습니다. 광범위한 실험을 통해 오픈소스 및 클로즈드소스 VLMs, 그리고 기존의 추론 및 공간 이해 모델들이 포괄적인 공간 이해에 있어 상당한 한계를 보임을 확인했습니다. 또한 실패 사례를 분석하고 향후 연구를 위한 잠재적인 방향을 제안합니다.

English

Spatial reasoning is a key aspect of cognitive psychology and remains a major bottleneck for current vision-language models (VLMs). While extensive research has aimed to evaluate or improve VLMs' understanding of basic spatial relations, such as distinguishing left from right, near from far, and object counting, these tasks represent only the most fundamental level of spatial reasoning. In this work, we introduce OmniSpatial, a comprehensive and challenging benchmark for spatial reasoning, grounded in cognitive psychology. OmniSpatial covers four major categories: dynamic reasoning, complex spatial logic, spatial interaction, and perspective-taking, with 50 fine-grained subcategories. Through Internet data crawling and careful manual annotation, we construct over 1.5K question-answer pairs. Extensive experiments show that both open- and closed-source VLMs, as well as existing reasoning and spatial understanding models, exhibit significant limitations in comprehensive spatial understanding. We further analyze failure cases and propose potential directions for future research.

OmniSpatial: 비전 언어 모델을 위한 포괄적 공간 추론 벤치마크 구축

OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models

초록

Support