ChatPaper.aiChatPaper

Isotropic3D:基於單個CLIP嵌入的圖像生成3D模型

Isotropic3D: Image-to-3D Generation Based on a Single CLIP Embedding

March 15, 2024
作者: Pengkun Liu, Yikai Wang, Fuchun Sun, Jiafang Li, Hang Xiao, Hongxiang Xue, Xinzhou Wang
cs.AI

摘要

受到預先訓練的2D擴散模型日益增加的可用性鼓舞,透過利用得分蒸餾採樣(Score Distillation Sampling,SDS)的影像生成3D技術正在取得顯著進展。大多數現有方法結合從2D擴散模型進行新視角提升,通常以參考影像作為條件,同時在參考視角應用硬L2影像監督。然而,過度依賴影像容易破壞2D擴散模型的歸納知識,導致頻繁生成平坦或扭曲的3D影像。在這項研究中,我們從新的角度重新檢視影像生成3D,提出Isotropic3D,一種僅以影像CLIP嵌入作為輸入的影像生成3D流程。Isotropic3D允許優化相對於方位角是等向的,僅依靠SDS損失。我們框架的核心在於兩階段擴散模型微調。首先,我們通過將其文本編碼器替換為影像編碼器,微調文本生成3D擴散模型,使其初步獲得影像對影像的能力。其次,我們使用我們的明確多視圖注意力(Explicit Multi-view Attention,EMA)進行微調,將多視圖影像與無噪聲的參考影像結合作為明確條件。在整個過程中,CLIP嵌入被發送到擴散模型,而參考影像在微調後被丟棄。因此,憑藉單個影像CLIP嵌入,Isotropic3D能夠生成多視圖相互一致的影像,以及一個具有更對稱整潔內容、比例協調的幾何、豐富彩色紋理和較少扭曲的3D模型,相較於現有的影像生成3D方法,仍然在很大程度上保持與參考影像的相似性。該項目頁面位於https://isotropic3d.github.io/。代碼和模型可在https://github.com/pkunliu/Isotropic3D找到。
English
Encouraged by the growing availability of pre-trained 2D diffusion models, image-to-3D generation by leveraging Score Distillation Sampling (SDS) is making remarkable progress. Most existing methods combine novel-view lifting from 2D diffusion models which usually take the reference image as a condition while applying hard L2 image supervision at the reference view. Yet heavily adhering to the image is prone to corrupting the inductive knowledge of the 2D diffusion model leading to flat or distorted 3D generation frequently. In this work, we reexamine image-to-3D in a novel perspective and present Isotropic3D, an image-to-3D generation pipeline that takes only an image CLIP embedding as input. Isotropic3D allows the optimization to be isotropic w.r.t. the azimuth angle by solely resting on the SDS loss. The core of our framework lies in a two-stage diffusion model fine-tuning. Firstly, we fine-tune a text-to-3D diffusion model by substituting its text encoder with an image encoder, by which the model preliminarily acquires image-to-image capabilities. Secondly, we perform fine-tuning using our Explicit Multi-view Attention (EMA) which combines noisy multi-view images with the noise-free reference image as an explicit condition. CLIP embedding is sent to the diffusion model throughout the whole process while reference images are discarded once after fine-tuning. As a result, with a single image CLIP embedding, Isotropic3D is capable of generating multi-view mutually consistent images and also a 3D model with more symmetrical and neat content, well-proportioned geometry, rich colored texture, and less distortion compared with existing image-to-3D methods while still preserving the similarity to the reference image to a large extent. The project page is available at https://isotropic3d.github.io/. The code and models are available at https://github.com/pkunliu/Isotropic3D.

Summary

AI-Generated Summary

PDF91December 15, 2024