MVD^2:用於多視角擴散的高效多視角3D重建
MVD^2: Efficient Multiview 3D Reconstruction for Multiview Diffusion
February 22, 2024
作者: Xin-Yang Zheng, Hao Pan, Yu-Xiao Guo, Xin Tong, Yang Liu
cs.AI
摘要
作為一種有前途的3D生成技術,多視圖擴散(MVD)因其在泛化性、質量和效率方面的優勢而受到廣泛關注。通過微調預訓練的大型圖像擴散模型以3D數據,MVD方法首先基於圖像或文本提示生成3D物體的多個視圖,然後通過多視圖3D重建重建3D形狀。然而,生成的圖像中稀疏的視圖和不一致的細節使得3D重建具有挑戰性。我們提出了MVD^2,這是一種用於多視圖擴散(MVD)圖像的高效3D重建方法。MVD^2通過投影和卷積將圖像特徵聚合成3D特徵體積,然後將體積特徵解碼為3D網格。我們使用3D形狀集和由3D形狀的渲染視圖提示的MVD圖像來訓練MVD^2。為了解決生成的多視圖圖像與3D形狀的地面真實視圖之間的差異,我們設計了一種簡單但高效的視圖依賴訓練方案。MVD^2提高了MVD的3D生成質量,對各種MVD方法都快速且堅固。訓練後,它可以在一秒內有效地從多視圖圖像解碼3D網格。我們使用Zero-123++和ObjectVerse-LVIS 3D數據集來訓練MVD^2,並展示了它在從不同MVD方法生成的多視圖圖像中生成3D模型方面的優越性能,使用合成和真實圖像作為提示。
English
As a promising 3D generation technique, multiview diffusion (MVD) has
received a lot of attention due to its advantages in terms of generalizability,
quality, and efficiency. By finetuning pretrained large image diffusion models
with 3D data, the MVD methods first generate multiple views of a 3D object
based on an image or text prompt and then reconstruct 3D shapes with multiview
3D reconstruction. However, the sparse views and inconsistent details in the
generated images make 3D reconstruction challenging. We present MVD^2, an
efficient 3D reconstruction method for multiview diffusion (MVD) images.
MVD^2 aggregates image features into a 3D feature volume by projection and
convolution and then decodes volumetric features into a 3D mesh. We train
MVD^2 with 3D shape collections and MVD images prompted by rendered views of
3D shapes. To address the discrepancy between the generated multiview images
and ground-truth views of the 3D shapes, we design a simple-yet-efficient
view-dependent training scheme. MVD^2 improves the 3D generation quality of
MVD and is fast and robust to various MVD methods. After training, it can
efficiently decode 3D meshes from multiview images within one second. We train
MVD^2 with Zero-123++ and ObjectVerse-LVIS 3D dataset and demonstrate its
superior performance in generating 3D models from multiview images generated by
different MVD methods, using both synthetic and real images as prompts.