ViewSpatial-Bench: 視覚言語モデルにおける多視点空間位置推定の評価

要旨

視覚言語モデル（VLMs）は、視覚的コンテンツの理解と推論において顕著な能力を発揮してきたが、異なる視点間の理解や空間推論を必要とするタスクでは依然として大きな課題が残されている。本研究では、現在のVLMsが主にエゴセントリックな空間推論（カメラの視点からの推論）に優れているものの、他のエンティティの空間的参照枠を採用する必要がある場合のアロセントリックな視点への一般化に失敗するという重要な限界を指摘する。我々は、5つの異なるタスクタイプにわたる多視点空間位置認識評価のために特別に設計された初の包括的なベンチマークであるViewSpatial-Benchを導入し、正確な方向ラベルを生成する自動化された3Dアノテーションパイプラインをサポートする。ViewSpatial-Benchを用いた多様なVLMsの包括的評価により、カメラ視点のタスクでは妥当な性能を示すものの、人間の視点からの推論では精度が低下するという顕著な性能差が明らかになった。我々の多視点空間データセットでVLMsをファインチューニングすることで、タスク全体で46.24%の性能向上を達成し、本アプローチの有効性を強調した。本研究は、エンボディードAIシステムにおける空間知能の重要なベンチマークを確立し、3D空間関係をモデル化することがVLMsの対応する空間理解能力を向上させることを実証的に示すものである。

English

Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content, but significant challenges persist in tasks requiring cross-viewpoint understanding and spatial reasoning. We identify a critical limitation: current VLMs excel primarily at egocentric spatial reasoning (from the camera's perspective) but fail to generalize to allocentric viewpoints when required to adopt another entity's spatial frame of reference. We introduce ViewSpatial-Bench, the first comprehensive benchmark designed specifically for multi-viewpoint spatial localization recognition evaluation across five distinct task types, supported by an automated 3D annotation pipeline that generates precise directional labels. Comprehensive evaluation of diverse VLMs on ViewSpatial-Bench reveals a significant performance disparity: models demonstrate reasonable performance on camera-perspective tasks but exhibit reduced accuracy when reasoning from a human viewpoint. By fine-tuning VLMs on our multi-perspective spatial dataset, we achieve an overall performance improvement of 46.24% across tasks, highlighting the efficacy of our approach. Our work establishes a crucial benchmark for spatial intelligence in embodied AI systems and provides empirical evidence that modeling 3D spatial relationships enhances VLMs' corresponding spatial comprehension capabilities.

ViewSpatial-Bench: 視覚言語モデルにおける多視点空間位置推定の評価

ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models

要旨

Support