HexaGen3D:StableDiffusion 距离快速且多样化的文本到三维生成仅一步之遥。
HexaGen3D: StableDiffusion is just one step away from Fast and Diverse Text-to-3D Generation
January 15, 2024
作者: Antoine Mercier, Ramin Nakhli, Mahesh Reddy, Rajeev Yasarla, Hong Cai, Fatih Porikli, Guillaume Berger
cs.AI
摘要
尽管生成建模取得了最新的显著进展,但从文本提示有效生成高质量的3D资产仍然是一项困难的任务。一个关键挑战在于数据稀缺:最广泛的3D数据集仅包含数百万个资产,而它们的2D对应物包含数十亿个文本-图像对。为了解决这个问题,我们提出了一种新颖的方法,利用大型预训练的2D扩散模型的强大能力。更具体地说,我们的方法HexaGen3D对预训练的文本到图像模型进行微调,共同预测6个正交投影和相应的潜在三视图。然后,我们解码这些潜在因子以生成带纹理的网格。HexaGen3D不需要每个样本的优化,可以在7秒内从文本提示中推断出高质量且多样化的对象,相比现有方法,提供了更好的质量与延迟之间的折衷。此外,HexaGen3D展示了对新对象或组合的强大泛化能力。
English
Despite the latest remarkable advances in generative modeling, efficient
generation of high-quality 3D assets from textual prompts remains a difficult
task. A key challenge lies in data scarcity: the most extensive 3D datasets
encompass merely millions of assets, while their 2D counterparts contain
billions of text-image pairs. To address this, we propose a novel approach
which harnesses the power of large, pretrained 2D diffusion models. More
specifically, our approach, HexaGen3D, fine-tunes a pretrained text-to-image
model to jointly predict 6 orthographic projections and the corresponding
latent triplane. We then decode these latents to generate a textured mesh.
HexaGen3D does not require per-sample optimization, and can infer high-quality
and diverse objects from textual prompts in 7 seconds, offering significantly
better quality-to-latency trade-offs when comparing to existing approaches.
Furthermore, HexaGen3D demonstrates strong generalization to new objects or
compositions.