基於幾何引導的強化學習實現多視角一致的3D場景編輯

摘要

利用二維擴散模型的先驗知識進行三維編輯已成為極具前景的研究範式。然而，編輯結果的多視圖一致性保持仍是難題，且三維一致性編輯配對數據的極度稀缺，使得監督微調這類編輯任務中最有效的訓練策略難以實施。本文發現，雖然生成多視圖一致的三維內容極具挑戰性，但驗證三維一致性卻相對可行，這自然使強化學習成為可行解決方案。基於此，我們提出RL3DEdit——一個由強化學習優化驅動的單次生成框架，其創新獎勵信號源自三維基礎模型VGGT。具體而言，我們利用VGGT從海量真實數據中學習的強健先驗，輸入編輯後的圖像，並將其輸出的置信度圖與姿態估計誤差作為獎勵信號，通過強化學習將二維編輯先驗有效錨定在三維一致性流形上。大量實驗表明，RL3DEdit能實現穩定的多視圖一致性，在編輯品質上超越現有最先進方法且具有高效性。為推動三維編輯領域發展，我們將公開代碼與模型。

English

Leveraging the priors of 2D diffusion models for 3D editing has emerged as a promising paradigm. However, maintaining multi-view consistency in edited results remains challenging, and the extreme scarcity of 3D-consistent editing paired data renders supervised fine-tuning (SFT), the most effective training strategy for editing tasks, infeasible. In this paper, we observe that, while generating multi-view consistent 3D content is highly challenging, verifying 3D consistency is tractable, naturally positioning reinforcement learning (RL) as a feasible solution. Motivated by this, we propose RL3DEdit, a single-pass framework driven by RL optimization with novel rewards derived from the 3D foundation model, VGGT. Specifically, we leverage VGGT's robust priors learned from massive real-world data, feed the edited images, and utilize the output confidence maps and pose estimation errors as reward signals, effectively anchoring the 2D editing priors onto a 3D-consistent manifold via RL. Extensive experiments demonstrate that RL3DEdit achieves stable multi-view consistency and outperforms state-of-the-art methods in editing quality with high efficiency. To promote the development of 3D editing, we will release the code and model.

基於幾何引導的強化學習實現多視角一致的3D場景編輯

Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing

摘要

Support