漸進式渲染蒸餾：無需3D數據即可實現即時文本到網格生成的穩定擴散適應

摘要

獲得一個能夠在幾秒鐘內從文本提示生成高質量3D網格的模型是非常理想的。雖然最近的嘗試已經將預訓練的文本到圖像擴散模型（如Stable Diffusion, SD）改編為3D表示（例如Triplane）的生成器，但由於缺乏足夠的高質量3D訓練數據，它們往往質量不佳。為了解決數據短缺的問題，我們提出了一種新的訓練方案，稱為漸進渲染蒸餾（Progressive Rendering Distillation, PRD），通過蒸餾多視角擴散模型並將SD改編為原生3D生成器，消除了對3D地面真值的需求。在每次訓練迭代中，PRD使用U-Net從隨機噪聲中逐步去噪潛在變量幾步，並在每一步將去噪後的潛在變量解碼為3D輸出。多視角擴散模型，包括MVDream和RichDreamer，與SD聯合使用，通過分數蒸餾將文本一致的紋理和幾何信息蒸餾到3D輸出中。由於PRD支持無需3D地面真值的訓練，我們可以輕鬆擴展訓練數據，並提高具有創意概念的挑戰性文本提示的生成質量。同時，PRD可以在幾步之內加速生成模型的推理速度。通過PRD，我們訓練了一個Triplane生成器，名為TriplaneTurbo，它僅增加了2.5%的可訓練參數來適應SD進行Triplane生成。TriplaneTurbo在效率和質量上都優於之前的文本到3D生成器。具體來說，它可以在1.2秒內生成高質量的3D網格，並且對挑戰性文本輸入具有良好的泛化能力。代碼可在https://github.com/theEricMa/TriplaneTurbo獲取。

English

It is highly desirable to obtain a model that can generate high-quality 3D meshes from text prompts in just seconds. While recent attempts have adapted pre-trained text-to-image diffusion models, such as Stable Diffusion (SD), into generators of 3D representations (e.g., Triplane), they often suffer from poor quality due to the lack of sufficient high-quality 3D training data. Aiming at overcoming the data shortage, we propose a novel training scheme, termed as Progressive Rendering Distillation (PRD), eliminating the need for 3D ground-truths by distilling multi-view diffusion models and adapting SD into a native 3D generator. In each iteration of training, PRD uses the U-Net to progressively denoise the latent from random noise for a few steps, and in each step it decodes the denoised latent into 3D output. Multi-view diffusion models, including MVDream and RichDreamer, are used in joint with SD to distill text-consistent textures and geometries into the 3D outputs through score distillation. Since PRD supports training without 3D ground-truths, we can easily scale up the training data and improve generation quality for challenging text prompts with creative concepts. Meanwhile, PRD can accelerate the inference speed of the generation model in just a few steps. With PRD, we train a Triplane generator, namely TriplaneTurbo, which adds only 2.5% trainable parameters to adapt SD for Triplane generation. TriplaneTurbo outperforms previous text-to-3D generators in both efficiency and quality. Specifically, it can produce high-quality 3D meshes in 1.2 seconds and generalize well for challenging text input. The code is available at https://github.com/theEricMa/TriplaneTurbo.

漸進式渲染蒸餾：無需3D數據即可實現即時文本到網格生成的穩定擴散適應

Progressive Rendering Distillation: Adapting Stable Diffusion for Instant Text-to-Mesh Generation without 3D Data

摘要

Support