ジオメトリ誘導型強化学習によるマルチビュー一貫性のある3Dシーン編集

要旨

2D拡散モデルの事前知識を3D編集に活用する手法は、有望なパラダイムとして登場しています。しかし、編集結果のマルチビュー一貫性を維持することは依然として課題であり、3D一貫性を持つ編集データの極端な不足により、編集タスクにおいて最も効果的な学習戦略である教師ありファインチューニング（SFT）の適用が困難です。本論文では、マルチビュー一貫性のある3Dコンテンツの生成は非常に困難である一方、3D一貫性の検証は扱いやすい問題であることを指摘し、この特性から強化学習（RL）が実行可能な解決策として自然に位置づけられることを示します。この観点に基づき、我々は3D基盤モデルVGGTから導出した新規報酬を用いたRL最適化による単一パスフレームワーク、RL3DEditを提案します。具体的には、VGGTが大規模実世界データから学習した頑健な事前知識を活用し、編集された画像を入力として、出力される信頼度マップと姿勢推定誤差を報酬信号として利用します。これにより、RLを介して2D編集の事前知識を3D一貫性多様体上に効果的に固定化します。大規模な実験により、RL3DEditが安定したマルチビュー一貫性を実現し、編集品質において現状最高の手法を効率的に凌駕することを実証します。3D編集技術の発展に貢献するため、コードとモデルを公開予定です。

English

Leveraging the priors of 2D diffusion models for 3D editing has emerged as a promising paradigm. However, maintaining multi-view consistency in edited results remains challenging, and the extreme scarcity of 3D-consistent editing paired data renders supervised fine-tuning (SFT), the most effective training strategy for editing tasks, infeasible. In this paper, we observe that, while generating multi-view consistent 3D content is highly challenging, verifying 3D consistency is tractable, naturally positioning reinforcement learning (RL) as a feasible solution. Motivated by this, we propose RL3DEdit, a single-pass framework driven by RL optimization with novel rewards derived from the 3D foundation model, VGGT. Specifically, we leverage VGGT's robust priors learned from massive real-world data, feed the edited images, and utilize the output confidence maps and pose estimation errors as reward signals, effectively anchoring the 2D editing priors onto a 3D-consistent manifold via RL. Extensive experiments demonstrate that RL3DEdit achieves stable multi-view consistency and outperforms state-of-the-art methods in editing quality with high efficiency. To promote the development of 3D editing, we will release the code and model.

ジオメトリ誘導型強化学習によるマルチビュー一貫性のある3Dシーン編集

Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing

要旨

Support