テキスト、画像、および3D構造をトークンごとに整合させる

要旨

3D空間を理解する機械の開発は、3D環境を構築・編集するデザイナーや、3次元空間内を移動・相互作用するロボットを支援する上で不可欠である。言語および画像モデリングの進展に触発され、我々は新しいモダリティである構造化された3Dシーンに対する自己回帰モデルの可能性を探る。この目的のために、言語、画像、3Dシーンを統合するLLMフレームワークを提案し、最適なトレーニングと性能を達成するための重要な設計選択を詳細に記した「クックブック」を提供する。これには、データ表現、モダリティ固有の目的など、関連する主要な問いに答える内容が含まれる。我々は、レンダリング、認識、指示追従、質問応答という4つのコア3Dタスクと、合成および実世界の4つの3Dデータセットにわたって性能を評価する。さらに、量子化された形状エンコーディングを用いて3Dモダリティを強化し、複雑な3Dオブジェクト形状の再構築にアプローチを拡張し、実世界の3Dオブジェクト認識タスクにおけるモデルの有効性を示す。プロジェクトウェブページ: https://glab-caltech.github.io/kyvo/

English

Creating machines capable of understanding the world in 3D is essential in assisting designers that build and edit 3D environments and robots navigating and interacting within a three-dimensional space. Inspired by advances in language and image modeling, we investigate the potential of autoregressive models for a new modality: structured 3D scenes. To this end, we propose a unified LLM framework that aligns language, images, and 3D scenes and provide a detailed ''cookbook'' outlining critical design choices for achieving optimal training and performance addressing key questions related to data representation, modality-specific objectives, and more. We evaluate performance across four core 3D tasks -- rendering, recognition, instruction-following, and question-answering -- and four 3D datasets, synthetic and real-world. We extend our approach to reconstruct complex 3D object shapes by enriching our 3D modality with quantized shape encodings, and show our model's effectiveness on real-world 3D object recognition tasks. Project webpage: https://glab-caltech.github.io/kyvo/

テキスト、画像、および3D構造をトークンごとに整合させる

Aligning Text, Images, and 3D Structure Token-by-Token

要旨

Support