SceneVerse: グラウンディングされたシーン理解のための3D視覚-言語学習のスケーリング

要旨

3D視覚言語グラウンディングは、言語を3D物理環境に整合させることに焦点を当てた技術であり、エンボディエージェントの開発における基盤をなすものです。2D領域での最近の進展と比較して、3Dシーンにおける言語のグラウンディングにはいくつかの重要な課題があります：(i) 多様なオブジェクト配置、豊富な属性、複雑な関係性による3Dシーンの本質的な複雑さ、(ii) グラウンディング学習をサポートするためのペア化された3D視覚言語データの不足、(iii) グラウンディングされた3Dデータから知識を抽出するための統一された学習フレームワークの欠如。本研究では、屋内環境における3D視覚言語学習の体系的スケールアップの可能性を検証することで、これら3つの主要な課題に取り組むことを目指します。我々は、約68Kの3D屋内シーンと2.5Mの視覚言語ペアから成る、初の百万規模の3D視覚言語データセット「SceneVerse」を導入しました。このデータセットは、人間によるアノテーションと我々のスケーラブルなシーングラフベースの生成アプローチから得られています。このスケーリングにより、3D視覚言語学習のための統一された事前学習フレームワーク「Grounded Pre-training for Scenes (GPS)」が可能となることを示します。広範な実験を通じて、GPSの有効性を実証し、既存のすべての3D視覚グラウンディングベンチマークで最先端の性能を達成しました。SceneVerseとGPSの膨大な可能性は、挑戦的な3D視覚言語タスクにおけるゼロショット転移実験を通じて明らかにされています。プロジェクトウェブサイト: https://scene-verse.github.io

English

3D vision-language grounding, which focuses on aligning language with the 3D physical environment, stands as a cornerstone in the development of embodied agents. In comparison to recent advancements in the 2D domain, grounding language in 3D scenes faces several significant challenges: (i) the inherent complexity of 3D scenes due to the diverse object configurations, their rich attributes, and intricate relationships; (ii) the scarcity of paired 3D vision-language data to support grounded learning; and (iii) the absence of a unified learning framework to distill knowledge from grounded 3D data. In this work, we aim to address these three major challenges in 3D vision-language by examining the potential of systematically upscaling 3D vision-language learning in indoor environments. We introduce the first million-scale 3D vision-language dataset, SceneVerse, encompassing about 68K 3D indoor scenes and comprising 2.5M vision-language pairs derived from both human annotations and our scalable scene-graph-based generation approach. We demonstrate that this scaling allows for a unified pre-training framework, Grounded Pre-training for Scenes (GPS), for 3D vision-language learning. Through extensive experiments, we showcase the effectiveness of GPS by achieving state-of-the-art performance on all existing 3D visual grounding benchmarks. The vast potential of SceneVerse and GPS is unveiled through zero-shot transfer experiments in the challenging 3D vision-language tasks. Project website: https://scene-verse.github.io .

SceneVerse: グラウンディングされたシーン理解のための3D視覚-言語学習のスケーリング

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

要旨

Support