三维场景中的任务导向顺序 grounding
Task-oriented Sequential Grounding in 3D Scenes
August 7, 2024
作者: Zhuofan Zhang, Ziyu Zhu, Pengxiang Li, Tengyu Liu, Xiaojian Ma, Yixin Chen, Baoxiong Jia, Siyuan Huang, Qing Li
cs.AI
摘要
将自然语言与物理三维环境相结合对于推动具有体验的人工智能至关重要。当前用于三维视觉指代的数据集和模型主要集中在从静态、以物体为中心的描述中识别和定位对象。这些方法并未充分解决任务导向指代所需的动态和序列性质,这在实际应用中是必要的。在这项工作中,我们提出了一个新任务:三维场景中的任务导向序列指代,代理程序必须按照详细的逐步说明,在室内场景中定位一系列目标对象以完成日常活动。为了促进这一任务,我们引入了SG3D,一个包含22,346个任务、112,236个步骤的大规模数据集,涵盖4,895个真实世界的三维场景。该数据集是通过结合来自各种三维场景数据集的RGB-D扫描和自动化任务生成流程构建的,随后经过人工验证以确保质量。我们将三种最先进的三维视觉指代模型调整为序列指代任务,并在SG3D上评估它们的性能。我们的结果显示,尽管这些模型在传统基准上表现良好,但它们在任务导向的序列指代方面面临重大挑战,突显了在这一领域需要进一步研究的必要性。
English
Grounding natural language in physical 3D environments is essential for the
advancement of embodied artificial intelligence. Current datasets and models
for 3D visual grounding predominantly focus on identifying and localizing
objects from static, object-centric descriptions. These approaches do not
adequately address the dynamic and sequential nature of task-oriented grounding
necessary for practical applications. In this work, we propose a new task:
Task-oriented Sequential Grounding in 3D scenes, wherein an agent must follow
detailed step-by-step instructions to complete daily activities by locating a
sequence of target objects in indoor scenes. To facilitate this task, we
introduce SG3D, a large-scale dataset containing 22,346 tasks with 112,236
steps across 4,895 real-world 3D scenes. The dataset is constructed using a
combination of RGB-D scans from various 3D scene datasets and an automated task
generation pipeline, followed by human verification for quality assurance. We
adapted three state-of-the-art 3D visual grounding models to the sequential
grounding task and evaluated their performance on SG3D. Our results reveal that
while these models perform well on traditional benchmarks, they face
significant challenges with task-oriented sequential grounding, underscoring
the need for further research in this area.Summary
AI-Generated Summary