ViewSpatial-Bench：評估視覺語言模型中的多視角空間定位能力

摘要

視覺語言模型（VLMs）在理解和推理視覺內容方面展現了顯著的能力，但在需要跨視角理解和空間推理的任務中仍存在重大挑戰。我們發現了一個關鍵限制：當前VLMs主要擅長於自我中心（從相機視角出發）的空間推理，但在需要採用其他實體的空間參考框架時，無法有效泛化至他者中心視角。我們引入了ViewSpatial-Bench，這是首個專為多視角空間定位識別評估設計的綜合基準，涵蓋五種不同的任務類型，並由一個自動化的3D註釋管道支持，該管道生成精確的方向標籤。在ViewSpatial-Bench上對多樣化VLMs的全面評估揭示了一個顯著的性能差距：模型在相機視角任務上表現尚可，但在從人類視角進行推理時準確性下降。通過在我們的多視角空間數據集上微調VLMs，我們在跨任務中實現了46.24%的整體性能提升，凸顯了我們方法的有效性。我們的工作為具身AI系統中的空間智能建立了一個關鍵基準，並提供了經驗證據，表明建模3D空間關係能增強VLMs相應的空間理解能力。

English

Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content, but significant challenges persist in tasks requiring cross-viewpoint understanding and spatial reasoning. We identify a critical limitation: current VLMs excel primarily at egocentric spatial reasoning (from the camera's perspective) but fail to generalize to allocentric viewpoints when required to adopt another entity's spatial frame of reference. We introduce ViewSpatial-Bench, the first comprehensive benchmark designed specifically for multi-viewpoint spatial localization recognition evaluation across five distinct task types, supported by an automated 3D annotation pipeline that generates precise directional labels. Comprehensive evaluation of diverse VLMs on ViewSpatial-Bench reveals a significant performance disparity: models demonstrate reasonable performance on camera-perspective tasks but exhibit reduced accuracy when reasoning from a human viewpoint. By fine-tuning VLMs on our multi-perspective spatial dataset, we achieve an overall performance improvement of 46.24% across tasks, highlighting the efficacy of our approach. Our work establishes a crucial benchmark for spatial intelligence in embodied AI systems and provides empirical evidence that modeling 3D spatial relationships enhances VLMs' corresponding spatial comprehension capabilities.