HexaGen3D: StableDiffusion 與快速多樣的文本生成到3D 只有一步之遙。
HexaGen3D: StableDiffusion is just one step away from Fast and Diverse Text-to-3D Generation
January 15, 2024
作者: Antoine Mercier, Ramin Nakhli, Mahesh Reddy, Rajeev Yasarla, Hong Cai, Fatih Porikli, Guillaume Berger
cs.AI
摘要
儘管生成建模取得了最新的顯著進展,但從文本提示有效生成高質量的3D資產仍然是一項困難的任務。一個關鍵挑戰在於數據稀缺:最廣泛的3D資料集僅包含數百萬個資產,而其2D對應物包含數十億個文本-圖像對。為了應對這一挑戰,我們提出了一種新穎的方法,利用了大型預訓練的2D擴散模型的強大功能。更具體地說,我們的方法HexaGen3D對預訓練的文本到圖像模型進行微調,以共同預測6個正交投影和相應的潛在三面圖。然後,我們解碼這些潛在因素以生成帶紋理的網格。HexaGen3D不需要每個樣本進行優化,並且可以在7秒內從文本提示中推斷出高質量和多樣化的物體,相較於現有方法,提供了更好的質量和延遲之間的折衷。此外,HexaGen3D展示了對新物體或組合的強大泛化能力。
English
Despite the latest remarkable advances in generative modeling, efficient
generation of high-quality 3D assets from textual prompts remains a difficult
task. A key challenge lies in data scarcity: the most extensive 3D datasets
encompass merely millions of assets, while their 2D counterparts contain
billions of text-image pairs. To address this, we propose a novel approach
which harnesses the power of large, pretrained 2D diffusion models. More
specifically, our approach, HexaGen3D, fine-tunes a pretrained text-to-image
model to jointly predict 6 orthographic projections and the corresponding
latent triplane. We then decode these latents to generate a textured mesh.
HexaGen3D does not require per-sample optimization, and can infer high-quality
and diverse objects from textual prompts in 7 seconds, offering significantly
better quality-to-latency trade-offs when comparing to existing approaches.
Furthermore, HexaGen3D demonstrates strong generalization to new objects or
compositions.