城市場景理解的3D問答
3D Question Answering for City Scene Understanding
July 24, 2024
作者: Penglei Sun, Yaoxian Song, Xiang Liu, Xiaofei Yang, Qiang Wang, Tiefeng Li, Yang Yang, Xiaowen Chu
cs.AI
摘要
3D 多模式問答(MQA)在場景理解中扮演著關鍵角色,使智能代理能夠在 3D 環境中理解周圍環境。儘管現有研究主要集中在室內家庭任務和室外道路自動駕駛任務上,但對於城市級場景理解任務的探索有限。此外,由於缺乏城市級空間語義信息和人-環境交互信息,現有研究在理解城市場景方面面臨挑戰。為應對這些挑戰,我們從數據集和方法的角度研究 3D MQA。從數據集角度出發,我們引入了一個名為 City-3DQA 的新穎 3D MQA 數據集,用於城市級場景理解,這是第一個在城市中結合場景語義和人-環境交互任務的數據集。從方法角度出發,我們提出了一種名為場景圖增強的城市級理解方法(Sg-CityU),利用場景圖引入空間語義。我們報告了一個新的基準,我們提出的 Sg-CityU 在 City-3DQA 的不同設置中實現了 63.94% 和 63.76% 的準確率。與室內 3D MQA 方法和使用先進的大型語言模型(LLMs)進行零樣本測試相比,Sg-CityU 在魯棒性和泛化性能方面展現了最先進的表現。
English
3D multimodal question answering (MQA) plays a crucial role in scene
understanding by enabling intelligent agents to comprehend their surroundings
in 3D environments. While existing research has primarily focused on indoor
household tasks and outdoor roadside autonomous driving tasks, there has been
limited exploration of city-level scene understanding tasks. Furthermore,
existing research faces challenges in understanding city scenes, due to the
absence of spatial semantic information and human-environment interaction
information at the city level.To address these challenges, we investigate 3D
MQA from both dataset and method perspectives. From the dataset perspective, we
introduce a novel 3D MQA dataset named City-3DQA for city-level scene
understanding, which is the first dataset to incorporate scene semantic and
human-environment interactive tasks within the city. From the method
perspective, we propose a Scene graph enhanced City-level Understanding method
(Sg-CityU), which utilizes the scene graph to introduce the spatial semantic. A
new benchmark is reported and our proposed Sg-CityU achieves accuracy of 63.94
% and 63.76 % in different settings of City-3DQA. Compared to indoor 3D MQA
methods and zero-shot using advanced large language models (LLMs), Sg-CityU
demonstrates state-of-the-art (SOTA) performance in robustness and
generalization.Summary
AI-Generated Summary