ワンダーランド：単一画像からの3Dシーンナビゲーション

要旨

この論文は、次の難問に取り組んでいます：どのようにして単一の任意の画像から高品質で広範囲な3Dシーンを効率的に作成できるか。既存の手法は、複数のビューデータが必要であること、シーンごとの最適化に時間がかかること、背景の視覚的品質が低いこと、および未知の領域での歪んだ再構成など、いくつかの制約に直面しています。これらの制約を克服するための新しいパイプラインを提案します。具体的には、ビデオ拡散モデルからの潜在変数を使用して、フィードフォワード方式でシーンのための3Dガウススプラッティングを予測する大規模な再構築モデルを導入します。ビデオ拡散モデルは、指定されたカメラ軌跡に厳密に従ってビデオを作成するよう設計されており、マルチビュー情報を含む圧縮されたビデオ潜在変数を生成することができ、3Dの一貫性を保ちます。3D再構築モデルをビデオ潜在空間で動作させるために、段階的なトレーニング戦略を用いてトレーニングし、高品質で広範囲かつ汎用的な3Dシーンを効率的に生成します。さまざまなデータセットを対象とした包括的な評価により、当社のモデルが既存の単一ビュー3Dシーン生成手法を大幅に上回り、特にドメイン外の画像に対して優れた性能を発揮することが示されています。初めて、拡散モデルの潜在空間を基盤として効果的に3D再構築モデルを構築し、効率的な3Dシーン生成を実現できることを示しています。

English

This paper addresses a challenging question: How can we efficiently create high-quality, wide-scope 3D scenes from a single arbitrary image? Existing methods face several constraints, such as requiring multi-view data, time-consuming per-scene optimization, low visual quality in backgrounds, and distorted reconstructions in unseen areas. We propose a novel pipeline to overcome these limitations. Specifically, we introduce a large-scale reconstruction model that uses latents from a video diffusion model to predict 3D Gaussian Splattings for the scenes in a feed-forward manner. The video diffusion model is designed to create videos precisely following specified camera trajectories, allowing it to generate compressed video latents that contain multi-view information while maintaining 3D consistency. We train the 3D reconstruction model to operate on the video latent space with a progressive training strategy, enabling the efficient generation of high-quality, wide-scope, and generic 3D scenes. Extensive evaluations across various datasets demonstrate that our model significantly outperforms existing methods for single-view 3D scene generation, particularly with out-of-domain images. For the first time, we demonstrate that a 3D reconstruction model can be effectively built upon the latent space of a diffusion model to realize efficient 3D scene generation.

ワンダーランド：単一画像からの3Dシーンナビゲーション

Wonderland: Navigating 3D Scenes from a Single Image

要旨

Support