GTR:透過幾何和紋理的優化改進大型3D重建模型
GTR: Improving Large 3D Reconstruction Models through Geometry and Texture Refinement
June 9, 2024
作者: Peiye Zhuang, Songfang Han, Chaoyang Wang, Aliaksandr Siarohin, Jiaxu Zou, Michael Vasilkovsky, Vladislav Shakhrai, Sergey Korolev, Sergey Tulyakov, Hsin-Ying Lee
cs.AI
摘要
我們提出了一種新穎的方法,用於從多視角圖像重建3D網格。我們的方法受到大型重建模型(如LRM)的啟發,該模型使用基於Transformer的三平面生成器和在多視角圖像上訓練的神經輻射場(NeRF)模型。然而,在我們的方法中,我們引入了幾個重要的修改,使我們能夠顯著提高3D重建質量。首先,我們檢查了原始LRM架構並找到了幾個缺點。隨後,我們對LRM架構進行了相應的修改,這些修改導致了改進的多視角圖像表示和更高效的訓練。其次,為了改善幾何重建並實現全圖像分辨率的監督,我們以可微分的方式從NeRF場中提取網格,並通過網格渲染微調NeRF模型。這些修改使我們能夠在2D和3D評估指標上實現最先進的性能,例如在Google掃描對象(GSO)數據集上達到28.67的PSNR。儘管取得了優異的結果,我們的前向模型仍然難以重建複雜的紋理,例如資產上的文本和肖像。為了解決這個問題,我們引入了一個輕量級的每個實例紋理細化程序。該程序通過僅使用4秒內的輸入多視角圖像,微調三平面表示和NeRF顏色估計模型在網格表面上。這種細化將PSNR提高到29.79,實現了對複雜紋理(如文本)的忠實重建。此外,我們的方法還支持各種下游應用,包括文本或圖像到3D生成。
English
We propose a novel approach for 3D mesh reconstruction from multi-view
images. Our method takes inspiration from large reconstruction models like LRM
that use a transformer-based triplane generator and a Neural Radiance Field
(NeRF) model trained on multi-view images. However, in our method, we introduce
several important modifications that allow us to significantly enhance 3D
reconstruction quality. First of all, we examine the original LRM architecture
and find several shortcomings. Subsequently, we introduce respective
modifications to the LRM architecture, which lead to improved multi-view image
representation and more computationally efficient training. Second, in order to
improve geometry reconstruction and enable supervision at full image
resolution, we extract meshes from the NeRF field in a differentiable manner
and fine-tune the NeRF model through mesh rendering. These modifications allow
us to achieve state-of-the-art performance on both 2D and 3D evaluation
metrics, such as a PSNR of 28.67 on Google Scanned Objects (GSO) dataset.
Despite these superior results, our feed-forward model still struggles to
reconstruct complex textures, such as text and portraits on assets. To address
this, we introduce a lightweight per-instance texture refinement procedure.
This procedure fine-tunes the triplane representation and the NeRF color
estimation model on the mesh surface using the input multi-view images in just
4 seconds. This refinement improves the PSNR to 29.79 and achieves faithful
reconstruction of complex textures, such as text. Additionally, our approach
enables various downstream applications, including text- or image-to-3D
generation.Summary
AI-Generated Summary