VFusion3D：從視訊擴散模型中學習可擴展的3D生成模型

摘要

本文提出了一種新穎的範式，用於構建可擴展的3D生成模型，利用預先訓練的視頻擴散模型。在開發基礎3D生成模型時的主要障礙是3D數據的有限可用性。與圖像、文本或視頻不同，3D數據並不容易獲取，難以獲得。這導致與其他類型的數據相比規模存在顯著差異。為了解決這個問題，我們提議使用一個視頻擴散模型，通過大量文本、圖像和視頻訓練，作為3D數據的知識來源。通過微調來解鎖其多視角生成能力，我們生成了一個大規模的合成多視角數據集，用於訓練前向3D生成模型。所提出的模型VFusion3D，在近300萬個合成多視角數據上訓練，可以在幾秒鐘內從單張圖像生成3D資產，與當前最先進的前向3D生成模型相比，性能優越，用戶超過70%的時間更喜歡我們的結果。

English

This paper presents a novel paradigm for building scalable 3D generative models utilizing pre-trained video diffusion models. The primary obstacle in developing foundation 3D generative models is the limited availability of 3D data. Unlike images, texts, or videos, 3D data are not readily accessible and are difficult to acquire. This results in a significant disparity in scale compared to the vast quantities of other types of data. To address this issue, we propose using a video diffusion model, trained with extensive volumes of text, images, and videos, as a knowledge source for 3D data. By unlocking its multi-view generative capabilities through fine-tuning, we generate a large-scale synthetic multi-view dataset to train a feed-forward 3D generative model. The proposed model, VFusion3D, trained on nearly 3M synthetic multi-view data, can generate a 3D asset from a single image in seconds and achieves superior performance when compared to current SOTA feed-forward 3D generative models, with users preferring our results over 70% of the time.

VFusion3D：從視訊擴散模型中學習可擴展的3D生成模型

VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models

摘要

Support