ChatPaper.aiChatPaper

VFusion3D:从视频扩散中学习可扩展的3D生成模型

VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models

March 18, 2024
作者: Junlin Han, Filippos Kokkinos, Philip Torr
cs.AI

摘要

本文提出了一种利用预训练视频扩散模型构建可扩展的3D生成模型的新范式。在开发基础3D生成模型时的主要障碍是3D数据的有限可用性。与图像、文本或视频不同,3D数据不容易获取,难以获得。这导致与其他类型数据的大量存在数量之间存在显著差距。为解决这一问题,我们提议使用一个经过广泛训练的视频扩散模型作为3D数据的知识源。通过微调解锁其多视角生成能力,我们生成了一个大规模的合成多视角数据集,用于训练前馈3D生成模型。所提出的模型VFusion3D,在近300万个合成多视角数据上训练,可以在几秒钟内从单个图像生成3D资产,并在与当前最先进的前馈3D生成模型相比表现出色,用户超过70%的时间更喜欢我们的结果。
English
This paper presents a novel paradigm for building scalable 3D generative models utilizing pre-trained video diffusion models. The primary obstacle in developing foundation 3D generative models is the limited availability of 3D data. Unlike images, texts, or videos, 3D data are not readily accessible and are difficult to acquire. This results in a significant disparity in scale compared to the vast quantities of other types of data. To address this issue, we propose using a video diffusion model, trained with extensive volumes of text, images, and videos, as a knowledge source for 3D data. By unlocking its multi-view generative capabilities through fine-tuning, we generate a large-scale synthetic multi-view dataset to train a feed-forward 3D generative model. The proposed model, VFusion3D, trained on nearly 3M synthetic multi-view data, can generate a 3D asset from a single image in seconds and achieves superior performance when compared to current SOTA feed-forward 3D generative models, with users preferring our results over 70% of the time.

Summary

AI-Generated Summary

PDF62December 15, 2024