ViewDiff: テキストから画像モデルを用いた3D整合性のある画像生成

要旨

3Dアセット生成は、テキストガイドによる2Dコンテンツ作成の最近の成功に触発され、多大な注目を集めています。既存のテキストから3Dを生成する手法は、事前学習済みのテキストから画像を生成する拡散モデルを最適化問題に使用したり、合成データでファインチューニングしたりしますが、これらはしばしば背景のない非写実的な3Dオブジェクトを生成してしまいます。本論文では、事前学習済みのテキストから画像を生成するモデルを事前分布として活用し、実世界のデータから単一のノイズ除去プロセスでマルチビュー画像を生成する方法を学習する手法を提案します。具体的には、既存のテキストから画像を生成するU-Netネットワークの各ブロックに、3Dボリュームレンダリングとクロスフレームアテンションレイヤーを統合することを提案します。さらに、任意の視点でより3D整合性の高い画像をレンダリングする自己回帰生成を設計します。実世界のオブジェクトデータセットでモデルを学習し、本手法が多様な高品質な形状とテクスチャを持つインスタンスを本物の環境下で生成する能力を示します。既存手法と比較して、本手法で生成された結果は一貫性があり、視覚品質が優れています（FID -30%、KID -37%）。

English

3D asset generation is getting massive amounts of attention, inspired by the recent success of text-guided 2D content creation. Existing text-to-3D methods use pretrained text-to-image diffusion models in an optimization problem or fine-tune them on synthetic data, which often results in non-photorealistic 3D objects without backgrounds. In this paper, we present a method that leverages pretrained text-to-image models as a prior, and learn to generate multi-view images in a single denoising process from real-world data. Concretely, we propose to integrate 3D volume-rendering and cross-frame-attention layers into each block of the existing U-Net network of the text-to-image model. Moreover, we design an autoregressive generation that renders more 3D-consistent images at any viewpoint. We train our model on real-world datasets of objects and showcase its capabilities to generate instances with a variety of high-quality shapes and textures in authentic surroundings. Compared to the existing methods, the results generated by our method are consistent, and have favorable visual quality (-30% FID, -37% KID).

ViewDiff: テキストから画像モデルを用いた3D整合性のある画像生成

ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models

要旨

Support