SceneGen: 単一画像からの3Dシーン生成を1回のフィードフォワードパスで実現

要旨

3Dコンテンツ生成は、VR/ARやエンボディドAIへの応用から、最近大きな研究関心を集めています。本研究では、単一のシーン画像内で複数の3Dアセットを合成するという挑戦的な課題に取り組みます。具体的には、以下の4つの貢献を行います：(i) シーン画像と対応するオブジェクトマスクを入力として、幾何学とテクスチャを備えた複数の3Dアセットを同時に生成する新しいフレームワーク「SceneGen」を提案します。特に、SceneGenは最適化やアセット検索を必要とせずに動作します。(ii) 特徴抽出モジュール内で視覚的および幾何学的エンコーダーから得られるローカルおよびグローバルなシーン情報を統合する新しい特徴集約モジュールを導入します。これに位置ヘッドを組み合わせることで、単一の順伝播で3Dアセットとその相対的な空間位置を生成することが可能になります。(iii) SceneGenが複数画像入力シナリオに直接拡張可能であることを示します。単一画像入力のみで訓練されているにもかかわらず、我々のアーキテクチャ設計により、複数画像入力での生成性能が向上します。(iv) 広範な定量的および定性的評価により、本手法の効率性と堅牢な生成能力が確認されました。このパラダイムは、高品質な3Dコンテンツ生成のための新しい解決策を提供し、下流タスクにおける実用的な応用を進展させる可能性があります。コードとモデルは以下のURLで公開されます: https://mengmouxu.github.io/SceneGen。

English

3D content generation has recently attracted significant research interest due to its applications in VR/AR and embodied AI. In this work, we address the challenging task of synthesizing multiple 3D assets within a single scene image. Concretely, our contributions are fourfold: (i) we present SceneGen, a novel framework that takes a scene image and corresponding object masks as input, simultaneously producing multiple 3D assets with geometry and texture. Notably, SceneGen operates with no need for optimization or asset retrieval; (ii) we introduce a novel feature aggregation module that integrates local and global scene information from visual and geometric encoders within the feature extraction module. Coupled with a position head, this enables the generation of 3D assets and their relative spatial positions in a single feedforward pass; (iii) we demonstrate SceneGen's direct extensibility to multi-image input scenarios. Despite being trained solely on single-image inputs, our architectural design enables improved generation performance with multi-image inputs; and (iv) extensive quantitative and qualitative evaluations confirm the efficiency and robust generation abilities of our approach. We believe this paradigm offers a novel solution for high-quality 3D content generation, potentially advancing its practical applications in downstream tasks. The code and model will be publicly available at: https://mengmouxu.github.io/SceneGen.

SceneGen: 単一画像からの3Dシーン生成を1回のフィードフォワードパスで実現

SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass

要旨

Support