MVD^2:多视角扩散下的高效多视角3D重建
MVD^2: Efficient Multiview 3D Reconstruction for Multiview Diffusion
February 22, 2024
作者: Xin-Yang Zheng, Hao Pan, Yu-Xiao Guo, Xin Tong, Yang Liu
cs.AI
摘要
作为一种有前景的3D生成技术,多视角扩散(MVD)因其在泛化性、质量和效率方面的优势而受到广泛关注。通过微调预训练的大型图像扩散模型,MVD方法首先基于图像或文本提示生成3D对象的多个视图,然后通过多视角3D重建来重建3D形状。然而,生成图像中的稀疏视图和不一致细节使得3D重建具有挑战性。我们提出了MVD^2,这是一种用于多视角扩散(MVD)图像的高效3D重建方法。MVD^2通过投影和卷积将图像特征聚合成3D特征体积,然后将体积特征解码为3D网格。我们使用3D形状集合和由3D形状的渲染视图提示的MVD图像来训练MVD^2。为了解决生成的多视角图像与3D形状的地面真实视图之间的差异,我们设计了一个简单但高效的视图相关训练方案。MVD^2提高了MVD的3D生成质量,快速且对各种MVD方法具有鲁棒性。训练后,它可以在一秒内高效地从多视角图像解码3D网格。我们使用Zero-123++和ObjectVerse-LVIS 3D数据集对MVD^2进行训练,并展示了它在使用合成和真实图像作为提示时,从不同MVD方法生成的多视角图像中生成3D模型的卓越性能。
English
As a promising 3D generation technique, multiview diffusion (MVD) has
received a lot of attention due to its advantages in terms of generalizability,
quality, and efficiency. By finetuning pretrained large image diffusion models
with 3D data, the MVD methods first generate multiple views of a 3D object
based on an image or text prompt and then reconstruct 3D shapes with multiview
3D reconstruction. However, the sparse views and inconsistent details in the
generated images make 3D reconstruction challenging. We present MVD^2, an
efficient 3D reconstruction method for multiview diffusion (MVD) images.
MVD^2 aggregates image features into a 3D feature volume by projection and
convolution and then decodes volumetric features into a 3D mesh. We train
MVD^2 with 3D shape collections and MVD images prompted by rendered views of
3D shapes. To address the discrepancy between the generated multiview images
and ground-truth views of the 3D shapes, we design a simple-yet-efficient
view-dependent training scheme. MVD^2 improves the 3D generation quality of
MVD and is fast and robust to various MVD methods. After training, it can
efficiently decode 3D meshes from multiview images within one second. We train
MVD^2 with Zero-123++ and ObjectVerse-LVIS 3D dataset and demonstrate its
superior performance in generating 3D models from multiview images generated by
different MVD methods, using both synthetic and real images as prompts.