エゴセントリックなマルチビューシーンにおける視覚言語モデルを用いた空間推論

要旨

3次元空間関係の理解は、現在の視覚言語モデル（VLMs）の主要な限界点の一つである。これまでの研究では、単一画像や屋内動画に基づく空間的質問応答（QA）データセットを作成することでこの問題に対処してきた。しかし、現実世界の具現化AIエージェント、例えばロボットや自動運転車は、通常、自己中心的な多視点観測に依存している。この目的のために、我々はEgo3D-Benchを導入する。これは、自己中心的な多視点の屋外データを用いてVLMsの空間推論能力を評価するための新しいベンチマークである。Ego3D-Benchは、品質と多様性を確保するために人間のアノテーターの多大な関与のもとで作成された8,600以上のQAペアから構成されている。我々は、GPT-4o、Gemini1.5-Pro、InternVL3、Qwen2.5-VLを含む16の最先端VLMsをベンチマークした。その結果、人間レベルのスコアとVLMの性能との間に顕著なギャップがあることが明らかになり、現在のVLMsが人間レベルの空間理解にまだ及んでいないことが強調された。このギャップを埋めるために、我々はEgo3D-VLMを提案する。これは、VLMsの3次元空間推論を強化するポストトレーニングフレームワークである。Ego3D-VLMは、推定されたグローバル3次元座標に基づいて認知地図を生成し、多肢選択QAで平均12%、絶対距離推定で平均56%の改善をもたらす。Ego3D-VLMはモジュール式であり、既存の任意のVLMと統合することができる。Ego3D-BenchとEgo3D-VLMは、現実世界の多視点環境における人間レベルの空間理解に向けた貴重なツールを提供する。

English

Understanding 3D spatial relationships remains a major limitation of current Vision-Language Models (VLMs). Prior work has addressed this issue by creating spatial question-answering (QA) datasets based on single images or indoor videos. However, real-world embodied AI agents such as robots and self-driving cars typically rely on ego-centric, multi-view observations. To this end, we introduce Ego3D-Bench, a new benchmark designed to evaluate the spatial reasoning abilities of VLMs using ego-centric, multi-view outdoor data. Ego3D-Bench comprises over 8,600 QA pairs, created with significant involvement from human annotators to ensure quality and diversity. We benchmark 16 SOTA VLMs, including GPT-4o, Gemini1.5-Pro, InternVL3, and Qwen2.5-VL. Our results reveal a notable performance gap between human level scores and VLM performance, highlighting that current VLMs still fall short of human level spatial understanding. To bridge this gap, we propose Ego3D-VLM, a post-training framework that enhances 3D spatial reasoning of VLMs. Ego3D-VLM generates cognitive map based on estimated global 3D coordinates, resulting in 12% average improvement on multi-choice QA and 56% average improvement on absolute distance estimation. Ego3D-VLM is modular and can be integrated with any existing VLM. Together, Ego3D-Bench and Ego3D-VLM offer valuable tools for advancing toward human level spatial understanding in real-world, multi-view environments.

エゴセントリックなマルチビューシーンにおける視覚言語モデルを用いた空間推論

Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-View Scenes

要旨

Support