ShowRoom3D: 3Dプリオアを活用したテキストから高品質な3Dルーム生成

要旨

本論文では、テキストから高品質な3Dルームスケールシーンを生成するための3段階アプローチであるShowRoom3Dを紹介します。これまでの手法では、2D拡散事前分布を用いてニューラルラジアンスフィールド（NeRF）を最適化することでルームスケールシーンを生成していましたが、その品質は満足のいくものではありませんでした。これは主に、2D事前分布が3D認識を欠いていることと、トレーニング方法論における制約に起因しています。本論文では、3D拡散事前分布であるMVDiffusionを活用して3Dルームスケールシーンを最適化します。我々の貢献は2つの側面にあります。まず、NeRFを最適化するための段階的ビュー選択プロセスを提案します。これにより、トレーニングプロセスを3つの段階に分割し、カメラサンプリング範囲を徐々に拡大します。次に、第2段階におけるポーズ変換手法を提案します。これにより、MVDiffusionが正確なビューガイダンスを提供することが保証されます。その結果、ShowRoom3Dは、構造的な整合性が向上し、どの視点からも鮮明で、コンテンツの繰り返しが減少し、異なる視点間の一貫性が高まったルームの生成を可能にします。大規模な実験により、我々の手法がユーザースタディにおいて、最先端のアプローチを大幅に上回ることが実証されています。

English

We introduce ShowRoom3D, a three-stage approach for generating high-quality 3D room-scale scenes from texts. Previous methods using 2D diffusion priors to optimize neural radiance fields for generating room-scale scenes have shown unsatisfactory quality. This is primarily attributed to the limitations of 2D priors lacking 3D awareness and constraints in the training methodology. In this paper, we utilize a 3D diffusion prior, MVDiffusion, to optimize the 3D room-scale scene. Our contributions are in two aspects. Firstly, we propose a progressive view selection process to optimize NeRF. This involves dividing the training process into three stages, gradually expanding the camera sampling scope. Secondly, we propose the pose transformation method in the second stage. It will ensure MVDiffusion provide the accurate view guidance. As a result, ShowRoom3D enables the generation of rooms with improved structural integrity, enhanced clarity from any view, reduced content repetition, and higher consistency across different perspectives. Extensive experiments demonstrate that our method, significantly outperforms state-of-the-art approaches by a large margin in terms of user study.

ShowRoom3D: 3Dプリオアを活用したテキストから高品質な3Dルーム生成

ShowRoom3D: Text to High-Quality 3D Room Generation Using 3D Priors

要旨

Support