基於視覺語言模型的自中心多視角場景空間推理

摘要

理解三维空间关系仍是当前视觉-语言模型（VLMs）的一大局限。先前的研究通过创建基于单张图像或室内视频的空间问答（QA）数据集来解决这一问题。然而，现实世界中的具身AI代理，如机器人和自动驾驶汽车，通常依赖于自我中心的多视角观测。为此，我们引入了Ego3D-Bench，这是一个旨在利用自我中心的多视角户外数据评估VLMs空间推理能力的新基准。Ego3D-Bench包含超过8,600个QA对，通过人类标注者的深度参与确保了质量和多样性。我们对包括GPT-4o、Gemini1.5-Pro、InternVL3和Qwen2.5-VL在内的16种SOTA VLMs进行了基准测试。结果显示，人类水平得分与VLM表现之间存在显著差距，表明当前VLMs在空间理解上仍未能达到人类水平。为弥合这一差距，我们提出了Ego3D-VLM，一个后训练框架，旨在增强VLMs的三维空间推理能力。Ego3D-VLM基于估计的全局三维坐标生成认知地图，使多选题QA平均提升12%，绝对距离估计平均提升56%。Ego3D-VLM模块化设计，可与任何现有VLM集成。Ego3D-Bench与Ego3D-VLM共同为推进真实世界多视角环境中人类水平空间理解提供了宝贵工具。

English

Understanding 3D spatial relationships remains a major limitation of current Vision-Language Models (VLMs). Prior work has addressed this issue by creating spatial question-answering (QA) datasets based on single images or indoor videos. However, real-world embodied AI agents such as robots and self-driving cars typically rely on ego-centric, multi-view observations. To this end, we introduce Ego3D-Bench, a new benchmark designed to evaluate the spatial reasoning abilities of VLMs using ego-centric, multi-view outdoor data. Ego3D-Bench comprises over 8,600 QA pairs, created with significant involvement from human annotators to ensure quality and diversity. We benchmark 16 SOTA VLMs, including GPT-4o, Gemini1.5-Pro, InternVL3, and Qwen2.5-VL. Our results reveal a notable performance gap between human level scores and VLM performance, highlighting that current VLMs still fall short of human level spatial understanding. To bridge this gap, we propose Ego3D-VLM, a post-training framework that enhances 3D spatial reasoning of VLMs. Ego3D-VLM generates cognitive map based on estimated global 3D coordinates, resulting in 12% average improvement on multi-choice QA and 56% average improvement on absolute distance estimation. Ego3D-VLM is modular and can be integrated with any existing VLM. Together, Ego3D-Bench and Ego3D-VLM offer valuable tools for advancing toward human level spatial understanding in real-world, multi-view environments.

基於視覺語言模型的自中心多視角場景空間推理

Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-View Scenes

摘要

Support