Carve3D: RLファインチューニングによる拡散モデルのマルチビュー再構成一貫性の向上

要旨

テキストから3D生成タスクにおける最近の進展では、微調整されたテキストから画像への拡散モデルを活用してマルチビュー画像を生成し、その後NeRFによる再構築を行っています。しかし、既存の教師あり微調整（SFT）拡散モデルは、依然としてマルチビューの不整合とそれに伴うNeRFのアーティファクトに悩まされています。SFTを長期間訓練することで一貫性は向上しますが、分布シフトを引き起こし、多様性と現実的な詳細が減少してしまいます。我々は、マルチビュー拡散モデルのSFTは、LLMアライメントパイプラインの指示微調整段階に類似しており、RL微調整（RLFT）手法から恩恵を受けることができると主張します。本質的に、RLFT手法は、モデル自身の出力を使用してSFTデータ分布を超えてモデルを最適化し、分布シフトを効果的に軽減します。この目的のために、我々はCarve3Dを導入します。これは、マルチビュー再構築一貫性（MRC）メトリックと組み合わせたRLFT手法であり、マルチビュー拡散モデルの一貫性を向上させます。一連のマルチビュー画像に対してMRCを計算するために、それらを同じ視点で再構築されたNeRFのレンダリングと比較します。制御された不整合レベル下での広範な実験を通じて、MRCの堅牢性を検証します。基本RLFTアルゴリズムを強化し、訓練プロセスを安定化させ、分布シフトを減少させ、スケーリング則を特定します。定性的および定量的な実験、およびユーザスタディを通じて、Carve3Dがマルチビュー一貫性を向上させ、それに伴う優れたNeRF再構築品質を実現し、長期間のSFTと比較して最小限の分布シフトを達成することを実証します。プロジェクトウェブページ: https://desaixie.github.io/carve-3d。

English

Recent advancements in the text-to-3D task leverage finetuned text-to-image diffusion models to generate multi-view images, followed by NeRF reconstruction. Yet, existing supervised finetuned (SFT) diffusion models still suffer from multi-view inconsistency and the resulting NeRF artifacts. Although training longer with SFT improves consistency, it also causes distribution shift, which reduces diversity and realistic details. We argue that the SFT of multi-view diffusion models resembles the instruction finetuning stage of the LLM alignment pipeline and can benefit from RL finetuning (RLFT) methods. Essentially, RLFT methods optimize models beyond their SFT data distribution by using their own outputs, effectively mitigating distribution shift. To this end, we introduce Carve3D, a RLFT method coupled with the Multi-view Reconstruction Consistency (MRC) metric, to improve the consistency of multi-view diffusion models. To compute MRC on a set of multi-view images, we compare them with their corresponding renderings of the reconstructed NeRF at the same viewpoints. We validate the robustness of MRC with extensive experiments conducted under controlled inconsistency levels. We enhance the base RLFT algorithm to stabilize the training process, reduce distribution shift, and identify scaling laws. Through qualitative and quantitative experiments, along with a user study, we demonstrate Carve3D's improved multi-view consistency, the resulting superior NeRF reconstruction quality, and minimal distribution shift compared to longer SFT. Project webpage: https://desaixie.github.io/carve-3d.

Carve3D: RLファインチューニングによる拡散モデルのマルチビュー再構成一貫性の向上

Carve3D: Improving Multi-view Reconstruction Consistency for Diffusion Models with RL Finetuning

要旨

Support