Carve3D: RL 미세 조정을 통한 Diffusion 모델의 다중 뷰 재구성 일관성 개선

초록

텍스트-3D 작업의 최근 발전은 파인튠된 텍스트-이미지 확산 모델을 활용하여 다중 뷰 이미지를 생성한 후 NeRF 재구성을 수행하는 방식으로 이루어졌습니다. 그러나 기존의 지도 학습 파인튠(SFT) 확산 모델은 여전히 다중 뷰 불일치와 그로 인한 NeRF 아티팩트 문제를 겪고 있습니다. SFT를 더 오래 학습시키면 일관성이 개선되지만, 이는 분포 이동을 초래하여 다양성과 현실적인 디테일을 감소시킵니다. 우리는 다중 뷰 확산 모델의 SFT가 LLM 정렬 파이프라인의 지시 파인튠 단계와 유사하며, RL 파인튠(RLFT) 방법으로부터 이점을 얻을 수 있다고 주장합니다. 본질적으로, RLFT 방법은 모델의 SFT 데이터 분포를 넘어서 모델 자체의 출력을 사용하여 분포 이동을 효과적으로 완화합니다. 이를 위해, 우리는 다중 뷰 확산 모델의 일관성을 개선하기 위해 Multi-view Reconstruction Consistency (MRC) 메트릭과 결합된 RLFT 방법인 Carve3D를 소개합니다. 다중 뷰 이미지 세트에 대해 MRC를 계산하기 위해, 우리는 동일한 시점에서 재구성된 NeRF의 렌더링과 비교합니다. 우리는 통제된 불일치 수준에서 수행된 광범위한 실험을 통해 MRC의 견고성을 검증합니다. 우리는 기본 RLFT 알고리즘을 개선하여 학습 과정을 안정화하고, 분포 이동을 줄이며, 스케일링 법칙을 식별합니다. 정성적 및 정량적 실험과 사용자 연구를 통해, 우리는 Carve3D가 더 긴 SFT에 비해 개선된 다중 뷰 일관성, 우수한 NeRF 재구성 품질, 그리고 최소한의 분포 이동을 달성함을 입증합니다. 프로젝트 웹페이지: https://desaixie.github.io/carve-3d.

English

Recent advancements in the text-to-3D task leverage finetuned text-to-image diffusion models to generate multi-view images, followed by NeRF reconstruction. Yet, existing supervised finetuned (SFT) diffusion models still suffer from multi-view inconsistency and the resulting NeRF artifacts. Although training longer with SFT improves consistency, it also causes distribution shift, which reduces diversity and realistic details. We argue that the SFT of multi-view diffusion models resembles the instruction finetuning stage of the LLM alignment pipeline and can benefit from RL finetuning (RLFT) methods. Essentially, RLFT methods optimize models beyond their SFT data distribution by using their own outputs, effectively mitigating distribution shift. To this end, we introduce Carve3D, a RLFT method coupled with the Multi-view Reconstruction Consistency (MRC) metric, to improve the consistency of multi-view diffusion models. To compute MRC on a set of multi-view images, we compare them with their corresponding renderings of the reconstructed NeRF at the same viewpoints. We validate the robustness of MRC with extensive experiments conducted under controlled inconsistency levels. We enhance the base RLFT algorithm to stabilize the training process, reduce distribution shift, and identify scaling laws. Through qualitative and quantitative experiments, along with a user study, we demonstrate Carve3D's improved multi-view consistency, the resulting superior NeRF reconstruction quality, and minimal distribution shift compared to longer SFT. Project webpage: https://desaixie.github.io/carve-3d.

Carve3D: RL 미세 조정을 통한 Diffusion 모델의 다중 뷰 재구성 일관성 개선

Carve3D: Improving Multi-view Reconstruction Consistency for Diffusion Models with RL Finetuning

초록

Support