OmniSpatial: 視覚言語モデルのための包括的空間推論ベンチマークに向けて

要旨

空間推論は認知心理学の重要な側面であり、現在の視覚言語モデル（VLM）にとって主要なボトルネックとなっています。これまで、左右の区別、遠近の判断、物体の数え上げといった基本的な空間関係の理解を評価または改善するための研究が数多く行われてきましたが、これらのタスクは空間推論の最も基礎的なレベルに過ぎません。本研究では、認知心理学に基づいた包括的で挑戦的な空間推論ベンチマーク「OmniSpatial」を提案します。OmniSpatialは、動的推論、複雑な空間論理、空間的相互作用、視点取得の4つの主要カテゴリと50の細分化されたサブカテゴリを網羅しています。インターネットデータのクローリングと慎重な手動アノテーションを通じて、1,500以上の質問-回答ペアを構築しました。広範な実験により、オープンソースおよびクローズドソースのVLM、ならびに既存の推論および空間理解モデルが、包括的な空間理解において重大な制限を示すことが明らかになりました。さらに、失敗事例を分析し、今後の研究の可能性のある方向性を提案します。

English

Spatial reasoning is a key aspect of cognitive psychology and remains a major bottleneck for current vision-language models (VLMs). While extensive research has aimed to evaluate or improve VLMs' understanding of basic spatial relations, such as distinguishing left from right, near from far, and object counting, these tasks represent only the most fundamental level of spatial reasoning. In this work, we introduce OmniSpatial, a comprehensive and challenging benchmark for spatial reasoning, grounded in cognitive psychology. OmniSpatial covers four major categories: dynamic reasoning, complex spatial logic, spatial interaction, and perspective-taking, with 50 fine-grained subcategories. Through Internet data crawling and careful manual annotation, we construct over 1.5K question-answer pairs. Extensive experiments show that both open- and closed-source VLMs, as well as existing reasoning and spatial understanding models, exhibit significant limitations in comprehensive spatial understanding. We further analyze failure cases and propose potential directions for future research.

OmniSpatial: 視覚言語モデルのための包括的空間推論ベンチマークに向けて

OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models

要旨

Support