Isotropic3D: 単一のCLIP埋め込みに基づく画像から3Dへの生成

要旨

事前学習済みの2D拡散モデルの利用可能性が高まっていることを受け、スコア蒸留サンプリング（SDS）を活用した画像から3Dへの生成が著しい進歩を遂げています。既存の手法の多くは、参照画像を条件として取り入れる2D拡散モデルからの新規視点リフティングを組み合わせつつ、参照視点において厳密なL2画像監視を適用しています。しかし、画像に過度に依存すると、2D拡散モデルの帰納的知識が損なわれ、平坦または歪んだ3D生成が頻繁に発生する傾向があります。本研究では、画像から3Dへの生成を新たな視点で再検討し、画像CLIP埋め込みのみを入力とするIsotropic3Dという画像から3Dへの生成パイプラインを提案します。Isotropic3Dは、SDS損失のみに依存することで、方位角に対して等方的な最適化を可能にします。私たちのフレームワークの中核は、2段階の拡散モデルのファインチューニングにあります。まず、テキストエンコーダを画像エンコーダに置き換えることで、テキストから3Dへの拡散モデルをファインチューニングし、モデルが画像から画像への能力を予備的に獲得します。次に、ノイズの多いマルチビュー画像とノイズフリーの参照画像を明示的な条件として組み合わせたExplicit Multi-view Attention（EMA）を使用してファインチューニングを行います。CLIP埋め込みはファインチューニング後も拡散モデルに送信されますが、参照画像はファインチューニング後に破棄されます。その結果、単一の画像CLIP埋め込みを用いて、Isotropic3Dは相互に一貫したマルチビュー画像と、より対称的で整った内容、均整の取れたジオメトリ、豊かな色のテクスチャ、そして歪みの少ない3Dモデルを生成することが可能です。これにより、既存の画像から3Dへの手法と比較して、参照画像との類似性を大幅に保ちつつ、より高品質な3D生成を実現します。プロジェクトページはhttps://isotropic3d.github.io/で、コードとモデルはhttps://github.com/pkunliu/Isotropic3Dで公開されています。

English

Encouraged by the growing availability of pre-trained 2D diffusion models, image-to-3D generation by leveraging Score Distillation Sampling (SDS) is making remarkable progress. Most existing methods combine novel-view lifting from 2D diffusion models which usually take the reference image as a condition while applying hard L2 image supervision at the reference view. Yet heavily adhering to the image is prone to corrupting the inductive knowledge of the 2D diffusion model leading to flat or distorted 3D generation frequently. In this work, we reexamine image-to-3D in a novel perspective and present Isotropic3D, an image-to-3D generation pipeline that takes only an image CLIP embedding as input. Isotropic3D allows the optimization to be isotropic w.r.t. the azimuth angle by solely resting on the SDS loss. The core of our framework lies in a two-stage diffusion model fine-tuning. Firstly, we fine-tune a text-to-3D diffusion model by substituting its text encoder with an image encoder, by which the model preliminarily acquires image-to-image capabilities. Secondly, we perform fine-tuning using our Explicit Multi-view Attention (EMA) which combines noisy multi-view images with the noise-free reference image as an explicit condition. CLIP embedding is sent to the diffusion model throughout the whole process while reference images are discarded once after fine-tuning. As a result, with a single image CLIP embedding, Isotropic3D is capable of generating multi-view mutually consistent images and also a 3D model with more symmetrical and neat content, well-proportioned geometry, rich colored texture, and less distortion compared with existing image-to-3D methods while still preserving the similarity to the reference image to a large extent. The project page is available at https://isotropic3d.github.io/. The code and models are available at https://github.com/pkunliu/Isotropic3D.

Isotropic3D: 単一のCLIP埋め込みに基づく画像から3Dへの生成

Isotropic3D: Image-to-3D Generation Based on a Single CLIP Embedding

要旨

Support