3D場景中的任務導向式序列 grounding
Task-oriented Sequential Grounding in 3D Scenes
August 7, 2024
作者: Zhuofan Zhang, Ziyu Zhu, Pengxiang Li, Tengyu Liu, Xiaojian Ma, Yixin Chen, Baoxiong Jia, Siyuan Huang, Qing Li
cs.AI
摘要
將自然語言與實體3D環境相結合對於推動具體化人工智慧的發展至關重要。目前用於3D視覺對象定位的數據集和模型主要集中在識別和定位靜態、以對象為中心的描述中的對象。這些方法並未充分解決任務導向對象定位的動態和序列性質,這在實際應用中是必要的。在這項工作中,我們提出了一個新任務:在3D場景中進行任務導向的序列定位,其中代理必須按照詳細的逐步說明,在室內場景中定位一系列目標對象以完成日常活動。為了促進這一任務,我們引入了SG3D,一個包含22,346個任務、112,236個步驟的大規模數據集,跨越4,895個現實世界的3D場景。該數據集是通過從各種3D場景數據集中獲取的RGB-D掃描和自動任務生成流程構建的,隨後通過人工驗證以確保質量。我們將三種最先進的3D視覺對象定位模型適應到序列定位任務中,並在SG3D上評估它們的性能。我們的結果顯示,儘管這些模型在傳統基準上表現良好,但在任務導向的序列定位方面面臨重大挑戰,突顯了在該領域進行進一步研究的必要性。
English
Grounding natural language in physical 3D environments is essential for the
advancement of embodied artificial intelligence. Current datasets and models
for 3D visual grounding predominantly focus on identifying and localizing
objects from static, object-centric descriptions. These approaches do not
adequately address the dynamic and sequential nature of task-oriented grounding
necessary for practical applications. In this work, we propose a new task:
Task-oriented Sequential Grounding in 3D scenes, wherein an agent must follow
detailed step-by-step instructions to complete daily activities by locating a
sequence of target objects in indoor scenes. To facilitate this task, we
introduce SG3D, a large-scale dataset containing 22,346 tasks with 112,236
steps across 4,895 real-world 3D scenes. The dataset is constructed using a
combination of RGB-D scans from various 3D scene datasets and an automated task
generation pipeline, followed by human verification for quality assurance. We
adapted three state-of-the-art 3D visual grounding models to the sequential
grounding task and evaluated their performance on SG3D. Our results reveal that
while these models perform well on traditional benchmarks, they face
significant challenges with task-oriented sequential grounding, underscoring
the need for further research in this area.Summary
AI-Generated Summary