VGGT-Edit：基於殘差場預測的前饋式原生3D場景編輯

摘要

高品質三維場景重建近期已朝向可泛化的前饋式架構發展，使得複雜場景能夠在一次前向傳遞中生成。然而，儘管這類模型在靜態場景感知方面表現強勁，其在回應動態人類指令時仍有限制，因而限制了互動式應用的可能性。現有編輯方法通常依賴於二維抬升策略，即先獨立編輯各個視角，再將其抬升回三維空間。這種間接流程常導致紋理模糊與幾何不一致，因為二維編輯器缺乏跨視角保持結構所需的空間感知能力。為解決這些限制，我們提出VGGT-Edit，這是一個基於文字條件的前饋式原生三維場景編輯框架。VGGT-Edit引入深度同步文本注入，將語義引導與骨幹模型的空間姿態對齊，確保穩定的指令賦予。該語義信號隨後由殘差變換頭處理，直接預測三維幾何位移以變形場景，同時保持背景穩定性。為確保高保真結果，我們以多項目標函數監督該框架，強制執行幾何準確性與跨視角一致性。我們亦建構DeltaScene資料集，這是一個透過自動化流程生成的大規模資料集，並採用三維一致性過濾以確保真實標註品質。實驗顯示，VGGT-Edit大幅優於二維抬升基準方法，產生更清晰的物件細節、更強的多視角一致性，且推理速度近乎即時。

English

High-quality 3D scene reconstruction has recently advanced toward generalizable feed-forward architectures, enabling the generation of complex environments in a single forward pass. However, despite their strong performance in static scene perception, these models remain limited in responding to dynamic human instructions, which restricts their use in interactive applications. Existing editing methods typically rely on a 2D-lifting strategy, where individual views are edited independently and then lifted back into 3D space. This indirect pipeline often leads to blurry textures and inconsistent geometry, as 2D editors lack the spatial awareness required to preserve structure across viewpoints. To address these limitations, we propose VGGT-Edit, a feed-forward framework for text-conditioned native 3D scene editing. VGGT-Edit introduces depth-synchronized text injection to align semantic guidance with the backbone's spatial poses, ensuring stable instruction grounding. This semantic signal is then processed by a residual transformation head, which directly predicts 3D geometric displacements to deform the scene while preserving background stability. To ensure high-fidelity results, we supervise the framework with a multi-term objective function that enforces geometric accuracy and cross-view consistency. We also construct the DeltaScene Dataset, a large-scale dataset generated through an automated pipeline with 3D agreement filtering to ensure ground-truth quality. Experiments show that VGGT-Edit substantially outperforms 2D-lifting baselines, producing sharper object details, stronger multi-view consistency, and near-instant inference speed.