几何引导的强化学习实现多视角一致的3D场景编辑

摘要

利用二维扩散模型的先验知识进行三维编辑已成为一种前景广阔的研究范式。然而，编辑结果的多视角一致性保持仍具挑战性，且三维一致性编辑配对数据的极端稀缺使得监督微调——这一编辑任务中最有效的训练策略——难以实施。本文发现，虽然生成多视角一致的三维内容极具挑战性，但验证三维一致性却相对可行，这自然将强化学习定位为可行解决方案。基于此，我们提出RL3DEdit框架：通过源自三维基础模型VGGT的新型奖励信号驱动强化学习优化，实现单次推理编辑。具体而言，我们利用VGGT从海量真实数据中学习到的强健先验，输入编辑后的图像，并将其输出的置信度图与姿态估计误差作为奖励信号，通过强化学习将二维编辑先验有效锚定在三维一致性流形上。大量实验表明，RL3DEdit在实现稳定多视角一致性的同时，编辑质量超越现有最优方法且效率显著。为促进三维编辑领域发展，我们将公开代码与模型。

English

Leveraging the priors of 2D diffusion models for 3D editing has emerged as a promising paradigm. However, maintaining multi-view consistency in edited results remains challenging, and the extreme scarcity of 3D-consistent editing paired data renders supervised fine-tuning (SFT), the most effective training strategy for editing tasks, infeasible. In this paper, we observe that, while generating multi-view consistent 3D content is highly challenging, verifying 3D consistency is tractable, naturally positioning reinforcement learning (RL) as a feasible solution. Motivated by this, we propose RL3DEdit, a single-pass framework driven by RL optimization with novel rewards derived from the 3D foundation model, VGGT. Specifically, we leverage VGGT's robust priors learned from massive real-world data, feed the edited images, and utilize the output confidence maps and pose estimation errors as reward signals, effectively anchoring the 2D editing priors onto a 3D-consistent manifold via RL. Extensive experiments demonstrate that RL3DEdit achieves stable multi-view consistency and outperforms state-of-the-art methods in editing quality with high efficiency. To promote the development of 3D editing, we will release the code and model.