SceneVerse:为基于场景的三维视觉-语言学习扩展规模的研究
SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding
January 17, 2024
作者: Baoxiong Jia, Yixin Chen, Huangyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, Siyuan Huang
cs.AI
摘要
3D视觉-语言对齐,专注于将语言与3D物理环境相匹配,是体现具象代理发展的基石。与2D领域最新进展相比,在3D场景中对语言进行对齐面临几个重大挑战:(i) 由于多样的物体配置、丰富的属性和错综复杂的关系,3D场景的固有复杂性;(ii) 缺乏配对的3D视觉-语言数据来支持基于对齐的学习;以及(iii) 缺乏一个统一的学习框架来从基于对齐的3D数据中提炼知识。在这项工作中,我们旨在通过系统地提升室内环境中的3D视觉-语言学习来解决这三个主要挑战。我们引入了首个百万规模的3D视觉-语言数据集SceneVerse,包含约68K个3D室内场景,由人类标注和我们可扩展的基于场景图的生成方法得出的250万个视觉-语言配对组成。我们展示了这种扩展性使得可以使用统一的预训练框架Grounded Pre-training for Scenes (GPS)进行3D视觉-语言学习。通过大量实验,我们展示了GPS的有效性,实现了所有现有3D视觉对齐基准上的最先进性能。SceneVerse和GPS的巨大潜力通过在具有挑战性的3D视觉-语言任务中的零样本迁移实验中得以揭示。项目网站:https://scene-verse.github.io。
English
3D vision-language grounding, which focuses on aligning language with the 3D
physical environment, stands as a cornerstone in the development of embodied
agents. In comparison to recent advancements in the 2D domain, grounding
language in 3D scenes faces several significant challenges: (i) the inherent
complexity of 3D scenes due to the diverse object configurations, their rich
attributes, and intricate relationships; (ii) the scarcity of paired 3D
vision-language data to support grounded learning; and (iii) the absence of a
unified learning framework to distill knowledge from grounded 3D data. In this
work, we aim to address these three major challenges in 3D vision-language by
examining the potential of systematically upscaling 3D vision-language learning
in indoor environments. We introduce the first million-scale 3D vision-language
dataset, SceneVerse, encompassing about 68K 3D indoor scenes and comprising
2.5M vision-language pairs derived from both human annotations and our scalable
scene-graph-based generation approach. We demonstrate that this scaling allows
for a unified pre-training framework, Grounded Pre-training for Scenes (GPS),
for 3D vision-language learning. Through extensive experiments, we showcase the
effectiveness of GPS by achieving state-of-the-art performance on all existing
3D visual grounding benchmarks. The vast potential of SceneVerse and GPS is
unveiled through zero-shot transfer experiments in the challenging 3D
vision-language tasks. Project website: https://scene-verse.github.io .