ChatPaper.aiChatPaper

SceneVerse:擴展三維視覺語言學習以實現基於場景的理解。

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

January 17, 2024
作者: Baoxiong Jia, Yixin Chen, Huangyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, Siyuan Huang
cs.AI

摘要

在發展具體代理人方面,3D視覺語言對齊成為一個基石,著重將語言與3D物理環境相結合。與2D領域的最新進展相比,將語言與3D場景相結合面臨著幾個重大挑戰:(i) 由於多樣的物體配置、豐富的屬性和複雜的關係,3D場景的固有複雜性;(ii) 缺乏配對的3D視覺語言數據以支持具體學習;以及(iii) 缺乏一個統一的學習框架來從具體的3D數據中提煉知識。在這項工作中,我們旨在通過系統性地擴展室內環境中的3D視覺語言學習,來應對這三個主要挑戰。我們介紹了第一個百萬規模的3D視覺語言數據集SceneVerse,包括約68K個3D室內場景,由人類標註和我們可擴展的基於場景圖的生成方法衍生的250萬視覺語言對。我們展示了這種擴展性使得可以進行統一的預訓練框架,名為Grounded Pre-training for Scenes (GPS),用於3D視覺語言學習。通過廣泛的實驗,我們展示了GPS的有效性,並在所有現有的3D視覺對齊基準測試中實現了最先進的性能。SceneVerse和GPS的巨大潛力透過具有挑戰性的3D視覺語言任務中的零-shot轉移實驗得以揭示。項目網站:https://scene-verse.github.io。
English
3D vision-language grounding, which focuses on aligning language with the 3D physical environment, stands as a cornerstone in the development of embodied agents. In comparison to recent advancements in the 2D domain, grounding language in 3D scenes faces several significant challenges: (i) the inherent complexity of 3D scenes due to the diverse object configurations, their rich attributes, and intricate relationships; (ii) the scarcity of paired 3D vision-language data to support grounded learning; and (iii) the absence of a unified learning framework to distill knowledge from grounded 3D data. In this work, we aim to address these three major challenges in 3D vision-language by examining the potential of systematically upscaling 3D vision-language learning in indoor environments. We introduce the first million-scale 3D vision-language dataset, SceneVerse, encompassing about 68K 3D indoor scenes and comprising 2.5M vision-language pairs derived from both human annotations and our scalable scene-graph-based generation approach. We demonstrate that this scaling allows for a unified pre-training framework, Grounded Pre-training for Scenes (GPS), for 3D vision-language learning. Through extensive experiments, we showcase the effectiveness of GPS by achieving state-of-the-art performance on all existing 3D visual grounding benchmarks. The vast potential of SceneVerse and GPS is unveiled through zero-shot transfer experiments in the challenging 3D vision-language tasks. Project website: https://scene-verse.github.io .
PDF221December 15, 2024