IM-3D: 高品質3D生成のための反復的多視点拡散と再構成

要旨

ほとんどのテキストから3D生成モデルは、数十億枚の画像で訓練された既存のテキストから画像生成モデルを基盤としています。これらのモデルは、スコア蒸留サンプリング（SDS）の変種を使用していますが、これは処理が遅く、やや不安定で、アーティファクトが発生しやすいという課題があります。この問題を緩和するために、2D生成モデルをマルチビュー対応にファインチューニングする方法があり、これにより蒸留が改善されたり、再構築ネットワークと組み合わせて直接3Dオブジェクトを出力することが可能になります。本論文では、テキストから3Dモデルの設計空間をさらに探求します。画像生成モデルではなく、ビデオ生成モデルを考慮することで、マルチビュー生成を大幅に改善しました。ガウススプラッティングを使用してロバストな画像ベースの損失を最適化できる3D再構築アルゴリズムと組み合わせることで、生成されたビューから直接高品質な3D出力を生成します。私たちの新しい手法であるIM-3Dは、2D生成ネットワークの評価回数を10～100倍削減し、より効率的なパイプライン、より高い品質、幾何学的な不整合の減少、そして使用可能な3Dアセットの収量向上を実現しました。

English

Most text-to-3D generators build upon off-the-shelf text-to-image models trained on billions of images. They use variants of Score Distillation Sampling (SDS), which is slow, somewhat unstable, and prone to artifacts. A mitigation is to fine-tune the 2D generator to be multi-view aware, which can help distillation or can be combined with reconstruction networks to output 3D objects directly. In this paper, we further explore the design space of text-to-3D models. We significantly improve multi-view generation by considering video instead of image generators. Combined with a 3D reconstruction algorithm which, by using Gaussian splatting, can optimize a robust image-based loss, we directly produce high-quality 3D outputs from the generated views. Our new method, IM-3D, reduces the number of evaluations of the 2D generator network 10-100x, resulting in a much more efficient pipeline, better quality, fewer geometric inconsistencies, and higher yield of usable 3D assets.

IM-3D: 高品質3D生成のための反復的多視点拡散と再構成

IM-3D: Iterative Multiview Diffusion and Reconstruction for High-Quality 3D Generation

要旨

Support