Carve3D:透過強化學習微調改進擴散模型的多視角重建一致性
Carve3D: Improving Multi-view Reconstruction Consistency for Diffusion Models with RL Finetuning
December 21, 2023
作者: Desai Xie, Jiahao Li, Hao Tan, Xin Sun, Zhixin Shu, Yi Zhou, Sai Bi, Sören Pirk, Arie E. Kaufman
cs.AI
摘要
最近在文本轉3D任務中的新進展利用微調的文本到圖像擴散模型生成多視圖圖像,然後進行 NeRF 重建。然而,現有的監督式微調(SFT)擴散模型仍然存在多視圖不一致性和由此產生的 NeRF 異常。雖然使用 SFT 進行更長時間的訓練可以改善一致性,但也會導致分佈轉移,降低多樣性和逼真細節。我們認為多視圖擴散模型的 SFT 類似於 LLM 對齊流程中的指導微調階段,可以從 RL 微調(RLFT)方法中受益。基本上,RLFT 方法通過使用其自身的輸出來優化模型,超越其 SFT 數據分佈,有效地緩解分佈轉移。為此,我們引入 Carve3D,一種與多視圖重建一致性(MRC)度量相結合的 RLFT 方法,以改善多視圖擴散模型的一致性。為了計算一組多視圖圖像的 MRC,我們將它們與相應的在相同視角重建的 NeRF 渲染進行比較。我們通過在受控不一致性水平下進行的大量實驗來驗證 MRC 的穩健性。我們增強了基本的 RLFT 算法以穩定訓練過程,減少分佈轉移並確定擴展定律。通過定性和定量實驗以及用戶研究,我們展示了 Carve3D 改善的多視圖一致性,由此帶來的優越 NeRF 重建質量,以及與更長的 SFT 相比的最小分佈轉移。項目網頁:https://desaixie.github.io/carve-3d。
English
Recent advancements in the text-to-3D task leverage finetuned text-to-image
diffusion models to generate multi-view images, followed by NeRF
reconstruction. Yet, existing supervised finetuned (SFT) diffusion models still
suffer from multi-view inconsistency and the resulting NeRF artifacts. Although
training longer with SFT improves consistency, it also causes distribution
shift, which reduces diversity and realistic details. We argue that the SFT of
multi-view diffusion models resembles the instruction finetuning stage of the
LLM alignment pipeline and can benefit from RL finetuning (RLFT) methods.
Essentially, RLFT methods optimize models beyond their SFT data distribution by
using their own outputs, effectively mitigating distribution shift. To this
end, we introduce Carve3D, a RLFT method coupled with the Multi-view
Reconstruction Consistency (MRC) metric, to improve the consistency of
multi-view diffusion models. To compute MRC on a set of multi-view images, we
compare them with their corresponding renderings of the reconstructed NeRF at
the same viewpoints. We validate the robustness of MRC with extensive
experiments conducted under controlled inconsistency levels. We enhance the
base RLFT algorithm to stabilize the training process, reduce distribution
shift, and identify scaling laws. Through qualitative and quantitative
experiments, along with a user study, we demonstrate Carve3D's improved
multi-view consistency, the resulting superior NeRF reconstruction quality, and
minimal distribution shift compared to longer SFT. Project webpage:
https://desaixie.github.io/carve-3d.