One-2-3-45:在45秒內將任何單張圖像轉換為3D網格,無需進行每個形狀的優化。
One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without Per-Shape Optimization
June 29, 2023
作者: Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund Varma T, Zexiang Xu, Hao Su
cs.AI
摘要
單張圖像的3D重建是一項重要但具有挑戰性的任務,需要對我們自然世界有廣泛的了解。許多現有方法通過在2D擴散模型的指導下優化神經輻射場來解決這個問題,但存在優化時間長、3D不一致結果和幾何不佳等問題。在這項工作中,我們提出了一種新方法,它以任何物體的單張圖像作為輸入,並在單次前向傳遞中生成完整的360度3D紋理網格。給定一張圖像,我們首先使用一個視圖條件的2D擴散模型Zero123,為輸入視圖生成多視圖圖像,然後旨在將它們提升到3D空間。由於傳統的重建方法難以應對不一致的多視圖預測,我們基於基於SDF的通用神經表面重建方法構建了我們的3D重建模塊,並提出了幾個關鍵的訓練策略,以實現360度網格的重建。在沒有昂貴優化的情況下,我們的方法比現有方法更快地重建3D形狀。此外,我們的方法偏好更好的幾何形狀,生成更具3D一致性的結果,並更貼近輸入圖像。我們在合成數據和野外圖像上評估了我們的方法,並展示了其在網格質量和運行時間方面的優越性。此外,我們的方法可以通過與現成的文本到圖像擴散模型集成,無縫支持文本到3D任務。
English
Single image 3D reconstruction is an important but challenging task that
requires extensive knowledge of our natural world. Many existing methods solve
this problem by optimizing a neural radiance field under the guidance of 2D
diffusion models but suffer from lengthy optimization time, 3D inconsistency
results, and poor geometry. In this work, we propose a novel method that takes
a single image of any object as input and generates a full 360-degree 3D
textured mesh in a single feed-forward pass. Given a single image, we first use
a view-conditioned 2D diffusion model, Zero123, to generate multi-view images
for the input view, and then aim to lift them up to 3D space. Since traditional
reconstruction methods struggle with inconsistent multi-view predictions, we
build our 3D reconstruction module upon an SDF-based generalizable neural
surface reconstruction method and propose several critical training strategies
to enable the reconstruction of 360-degree meshes. Without costly
optimizations, our method reconstructs 3D shapes in significantly less time
than existing methods. Moreover, our method favors better geometry, generates
more 3D consistent results, and adheres more closely to the input image. We
evaluate our approach on both synthetic data and in-the-wild images and
demonstrate its superiority in terms of both mesh quality and runtime. In
addition, our approach can seamlessly support the text-to-3D task by
integrating with off-the-shelf text-to-image diffusion models.