ViewDiff:使用文本到圖像模型實現3D一致的圖像生成
ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models
March 4, 2024
作者: Lukas Höllein, Aljaž Božič, Norman Müller, David Novotny, Hung-Yu Tseng, Christian Richardt, Michael Zollhöfer, Matthias Nießner
cs.AI
摘要
3D 資產生成正受到廣泛關注,受到最近文本引導的 2D 內容創作成功的啟發。現有的文本轉 3D 方法使用預訓練的文本轉圖像擴散模型來解決優化問題,或在合成數據上對其進行微調,這通常會導致沒有背景的非照片寫實 3D 物體。在本文中,我們提出了一種方法,利用預訓練的文本轉圖像模型作為先驗,並學習從現實世界數據中在單一去噪過程中生成多視圖圖像。具體而言,我們建議將 3D 體素渲染和跨幀注意力層整合到現有 U-Net 網絡的每個塊中,以改進文本轉圖像模型。此外,我們設計了一種自回歸生成方法,可以在任何視角呈現更具 3D 一致性的圖像。我們在真實世界物體的數據集上訓練我們的模型,展示了它生成具有各種高質量形狀和紋理的實境環境中實例的能力。與現有方法相比,我們方法生成的結果一致,視覺質量較高(FID 減少 30%,KID 減少 37%)。
English
3D asset generation is getting massive amounts of attention, inspired by the
recent success of text-guided 2D content creation. Existing text-to-3D methods
use pretrained text-to-image diffusion models in an optimization problem or
fine-tune them on synthetic data, which often results in non-photorealistic 3D
objects without backgrounds. In this paper, we present a method that leverages
pretrained text-to-image models as a prior, and learn to generate multi-view
images in a single denoising process from real-world data. Concretely, we
propose to integrate 3D volume-rendering and cross-frame-attention layers into
each block of the existing U-Net network of the text-to-image model. Moreover,
we design an autoregressive generation that renders more 3D-consistent images
at any viewpoint. We train our model on real-world datasets of objects and
showcase its capabilities to generate instances with a variety of high-quality
shapes and textures in authentic surroundings. Compared to the existing
methods, the results generated by our method are consistent, and have favorable
visual quality (-30% FID, -37% KID).