자기 중심적 다중 시점 장면에서 시각-언어 모델을 활용한 공간 추론

초록

3D 공간 관계 이해는 현재의 시각-언어 모델(VLMs)의 주요 한계로 남아 있다. 기존 연구는 단일 이미지나 실내 비디오를 기반으로 한 공간 질의응답(QA) 데이터셋을 생성하여 이 문제를 해결하려고 시도했다. 그러나 로봇이나 자율주행차와 같은 실제 세계의 구체화된 AI 에이전트는 일반적으로 자기 중심적이고 다중 시점 관측에 의존한다. 이를 위해 우리는 자기 중심적이고 다중 시점의 야외 데이터를 사용하여 VLMs의 공간 추론 능력을 평가하기 위한 새로운 벤치마크인 Ego3D-Bench를 소개한다. Ego3D-Bench는 8,600개 이상의 QA 쌍으로 구성되어 있으며, 질과 다양성을 보장하기 위해 인간 주석자의 상당한 참여를 통해 생성되었다. 우리는 GPT-4o, Gemini1.5-Pro, InternVL3, Qwen2.5-VL을 포함한 16개의 최신 VLMs를 벤치마크했다. 결과는 인간 수준 점수와 VLM 성능 간에 현저한 격차가 있음을 보여주며, 현재의 VLMs가 여전히 인간 수준의 공간 이해에 미치지 못함을 강조한다. 이 격차를 해소하기 위해 우리는 VLMs의 3D 공간 추론을 강화하는 사후 훈련 프레임워크인 Ego3D-VLM을 제안한다. Ego3D-VLM은 추정된 전역 3D 좌표를 기반으로 인지 지도를 생성하며, 이는 다중 선택 QA에서 평균 12%의 개선과 절대 거리 추정에서 평균 56%의 개선을 가져온다. Ego3D-VLM은 모듈식으로 설계되어 기존의 어떤 VLM과도 통합할 수 있다. Ego3D-Bench와 Ego3D-VLM은 함께 실제 세계의 다중 시점 환경에서 인간 수준의 공간 이해를 향해 나아가기 위한 가치 있는 도구를 제공한다.

English

Understanding 3D spatial relationships remains a major limitation of current Vision-Language Models (VLMs). Prior work has addressed this issue by creating spatial question-answering (QA) datasets based on single images or indoor videos. However, real-world embodied AI agents such as robots and self-driving cars typically rely on ego-centric, multi-view observations. To this end, we introduce Ego3D-Bench, a new benchmark designed to evaluate the spatial reasoning abilities of VLMs using ego-centric, multi-view outdoor data. Ego3D-Bench comprises over 8,600 QA pairs, created with significant involvement from human annotators to ensure quality and diversity. We benchmark 16 SOTA VLMs, including GPT-4o, Gemini1.5-Pro, InternVL3, and Qwen2.5-VL. Our results reveal a notable performance gap between human level scores and VLM performance, highlighting that current VLMs still fall short of human level spatial understanding. To bridge this gap, we propose Ego3D-VLM, a post-training framework that enhances 3D spatial reasoning of VLMs. Ego3D-VLM generates cognitive map based on estimated global 3D coordinates, resulting in 12% average improvement on multi-choice QA and 56% average improvement on absolute distance estimation. Ego3D-VLM is modular and can be integrated with any existing VLM. Together, Ego3D-Bench and Ego3D-VLM offer valuable tools for advancing toward human level spatial understanding in real-world, multi-view environments.

자기 중심적 다중 시점 장면에서 시각-언어 모델을 활용한 공간 추론

Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-View Scenes

초록

Support