VLMsは視覚的空間知能からどれほど離れているのか？ベンチマーク主導の視点から

要旨

視覚的空間推論（Visual Spatial Reasoning, VSR）は、人間の認知能力の中核をなすものであり、具現化された知能や自律システムの進歩において重要な要件です。近年のVision-Language Models（VLMs）の進展にもかかわらず、三次元空間の表現と推論の複雑さから、人間レベルのVSRを達成することは依然として非常に困難です。本論文では、VLMsにおけるVSRの体系的な調査を提示し、入力モダリティ、モデルアーキテクチャ、トレーニング戦略、推論メカニズムにわたる既存の手法をレビューします。さらに、空間知能を3つの能力レベル、すなわち基本的な知覚、空間理解、空間計画に分類し、23のタスク設定にわたる約20のオープンソースデータセットを網羅する空間知能ベンチマーク「SIBench」をキュレーションしました。最先端のVLMsを用いた実験では、知覚と推論の間に顕著なギャップが明らかになり、モデルは基本的な知覚タスクでは有能であるものの、特に数値推定、多視点推論、時間的ダイナミクス、空間的想像力において、理解と計画タスクでは一貫して低いパフォーマンスを示しました。これらの発見は、空間知能の達成に残された大きな課題を浮き彫りにするとともに、今後の研究を推進するための体系的なロードマップと包括的なベンチマークを提供します。本研究の関連リソースはhttps://sibench.github.io/Awesome-Visual-Spatial-Reasoning/でアクセス可能です。

English

Visual Spatial Reasoning (VSR) is a core human cognitive ability and a critical requirement for advancing embodied intelligence and autonomous systems. Despite recent progress in Vision-Language Models (VLMs), achieving human-level VSR remains highly challenging due to the complexity of representing and reasoning over three-dimensional space. In this paper, we present a systematic investigation of VSR in VLMs, encompassing a review of existing methodologies across input modalities, model architectures, training strategies, and reasoning mechanisms. Furthermore, we categorize spatial intelligence into three levels of capability, ie, basic perception, spatial understanding, spatial planning, and curate SIBench, a spatial intelligence benchmark encompassing nearly 20 open-source datasets across 23 task settings. Experiments with state-of-the-art VLMs reveal a pronounced gap between perception and reasoning, as models show competence in basic perceptual tasks but consistently underperform in understanding and planning tasks, particularly in numerical estimation, multi-view reasoning, temporal dynamics, and spatial imagination. These findings underscore the substantial challenges that remain in achieving spatial intelligence, while providing both a systematic roadmap and a comprehensive benchmark to drive future research in the field. The related resources of this study are accessible at https://sibench.github.io/Awesome-Visual-Spatial-Reasoning/.

VLMsは視覚的空間知能からどれほど離れているのか？ベンチマーク主導の視点から

How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective

要旨

Support