VGGT-Edit: 残差場予測によるフィードフォワードなネイティブ3Dシーン編集

要旨

高品質な3次元シーン再構成は近年、汎用可能なフィードフォワードアーキテクチャへと進展し、単一の順伝搬で複雑な環境を生成できるようになった。しかしながら、静的なシーン知覚において優れた性能を示す一方で、これらのモデルは動的な人間の指示に応答する能力に限界があり、インタラクティブな応用での利用が制限されている。既存の編集手法は通常、2Dリフティング戦略に依存しており、個々の視点を独立に編集した後に3次元空間へリフティングする。この間接的なパイプラインは、2Dエディタが視点間の構造を保持するために必要な空間認識を欠くため、しばしばぼやけたテクスチャや不整合な幾何形状を引き起こす。これらの制約に対処するため、我々はVGGT-Editを提案する。これはテキスト条件付きのネイティブ3次元シーン編集のためのフィードフォワードフレームワークである。VGGT-Editは深度同期テキスト注入を導入し、セマンティックガイダンスをバックボーンの空間ポーズに整合させることで、安定した指示の接地を保証する。このセマンティック信号はその後、残差変換ヘッドによって処理され、背景の安定性を維持しつつシーンを変形させる3次元幾何学的変位を直接予測する。高忠実度の結果を保証するため、我々は幾何学的精度と視点間一貫性を強制する多項目的関数を用いてフレームワークを監視する。また、自動化されたパイプラインと3次元一致フィルタリングによってグラウンドトゥルースの品質を保証して生成された大規模データセットであるDeltaSceneデータセットを構築する。実験により、VGGT-Editは2Dリフティングベースラインを大幅に上回り、より鮮明な物体詳細、強力な多視点一貫性、ほぼ瞬時の推論速度を実現することを示す。

English

High-quality 3D scene reconstruction has recently advanced toward generalizable feed-forward architectures, enabling the generation of complex environments in a single forward pass. However, despite their strong performance in static scene perception, these models remain limited in responding to dynamic human instructions, which restricts their use in interactive applications. Existing editing methods typically rely on a 2D-lifting strategy, where individual views are edited independently and then lifted back into 3D space. This indirect pipeline often leads to blurry textures and inconsistent geometry, as 2D editors lack the spatial awareness required to preserve structure across viewpoints. To address these limitations, we propose VGGT-Edit, a feed-forward framework for text-conditioned native 3D scene editing. VGGT-Edit introduces depth-synchronized text injection to align semantic guidance with the backbone's spatial poses, ensuring stable instruction grounding. This semantic signal is then processed by a residual transformation head, which directly predicts 3D geometric displacements to deform the scene while preserving background stability. To ensure high-fidelity results, we supervise the framework with a multi-term objective function that enforces geometric accuracy and cross-view consistency. We also construct the DeltaScene Dataset, a large-scale dataset generated through an automated pipeline with 3D agreement filtering to ensure ground-truth quality. Experiments show that VGGT-Edit substantially outperforms 2D-lifting baselines, producing sharper object details, stronger multi-view consistency, and near-instant inference speed.