ChatPaper.aiChatPaper

等向3D:基于单个CLIP嵌入的图像到3D生成

Isotropic3D: Image-to-3D Generation Based on a Single CLIP Embedding

March 15, 2024
作者: Pengkun Liu, Yikai Wang, Fuchun Sun, Jiafang Li, Hang Xiao, Hongxiang Xue, Xinzhou Wang
cs.AI

摘要

受到预训练的2D扩散模型日益增加的可用性的鼓舞,通过利用得分蒸馏采样(SDS)进行图像到3D生成正在取得显著进展。大多数现有方法将来自2D扩散模型的新视角提升与通常将参考图像作为条件并在参考视角应用硬L2图像监督相结合。然而,过度依赖图像容易破坏2D扩散模型的归纳知识,导致频繁出现平坦或扭曲的3D生成。在这项工作中,我们从新的角度重新审视图像到3D,并提出Isotropic3D,这是一个仅以图像CLIP嵌入作为输入的图像到3D生成流程。Isotropic3D允许优化相对于方位角是各向同性的,仅依靠SDS损失。我们框架的核心在于两阶段扩散模型微调。首先,我们通过用图像编码器替换其文本编码器来微调文本到3D扩散模型,通过这种方式,模型初步获得图像到图像的能力。其次,我们使用我们的显式多视图注意力(EMA)进行微调,将多视图图像与无噪声的参考图像作为显式条件结合。CLIP嵌入在整个过程中发送到扩散模型,而参考图像在微调后被丢弃。因此,仅使用单个图像CLIP嵌入,Isotropic3D能够生成多视图相互一致的图像,以及一个具有更对称整洁内容、比例匀称的几何、丰富彩色纹理和较少失真的3D模型,与现有的图像到3D方法相比,仍然在很大程度上保持与参考图像的相似性。项目页面可在https://isotropic3d.github.io/找到。代码和模型可在https://github.com/pkunliu/Isotropic3D获取。
English
Encouraged by the growing availability of pre-trained 2D diffusion models, image-to-3D generation by leveraging Score Distillation Sampling (SDS) is making remarkable progress. Most existing methods combine novel-view lifting from 2D diffusion models which usually take the reference image as a condition while applying hard L2 image supervision at the reference view. Yet heavily adhering to the image is prone to corrupting the inductive knowledge of the 2D diffusion model leading to flat or distorted 3D generation frequently. In this work, we reexamine image-to-3D in a novel perspective and present Isotropic3D, an image-to-3D generation pipeline that takes only an image CLIP embedding as input. Isotropic3D allows the optimization to be isotropic w.r.t. the azimuth angle by solely resting on the SDS loss. The core of our framework lies in a two-stage diffusion model fine-tuning. Firstly, we fine-tune a text-to-3D diffusion model by substituting its text encoder with an image encoder, by which the model preliminarily acquires image-to-image capabilities. Secondly, we perform fine-tuning using our Explicit Multi-view Attention (EMA) which combines noisy multi-view images with the noise-free reference image as an explicit condition. CLIP embedding is sent to the diffusion model throughout the whole process while reference images are discarded once after fine-tuning. As a result, with a single image CLIP embedding, Isotropic3D is capable of generating multi-view mutually consistent images and also a 3D model with more symmetrical and neat content, well-proportioned geometry, rich colored texture, and less distortion compared with existing image-to-3D methods while still preserving the similarity to the reference image to a large extent. The project page is available at https://isotropic3d.github.io/. The code and models are available at https://github.com/pkunliu/Isotropic3D.

Summary

AI-Generated Summary

PDF91December 15, 2024