VGGT-Edit: 잔차 필드 예측을 통한 피드포워드 네이티브 3D 장면 편집

초록

최근 고품질 3D 장면 복원 분야는 일반화 가능한 피드포워드(feed-forward) 아키텍처로 발전하여, 단일 순방향 패스(single forward pass)로 복잡한 환경을 생성할 수 있게 되었다. 그러나 정적 장면 인식에서 뛰어난 성능을 보임에도 불구하고, 이러한 모델들은 동적인 인간의 지시에 응답하는 데 여전히 한계가 있어 대화형 응용 프로그램에 사용이 제한된다. 기존 편집 방법은 일반적으로 2D 리프팅(lifting) 전략에 의존하는데, 이는 개별 뷰를 독립적으로 편집한 후 다시 3D 공간으로 리프팅한다. 이러한 간접적인 파이프라인은 2D 편집기가 뷰 간 구조를 보존하는 데 필요한 공간 인식 능력이 부족하기 때문에 종종 흐릿한 텍스처와 불일치하는 형상을 초래한다. 이러한 한계를 해결하기 위해, 우리는 텍스트 조건의 네이티브 3D 장면 편집을 위한 피드포워드 프레임워크인 VGGT-Edit을 제안한다. VGGT-Edit은 깊이 동기화 텍스트 주입(depth-synchronized text injection)을 도입하여 의미적 안내를 백본의 공간 포즈와 정렬시킴으로써 안정적인 명령어 근거 확보를 보장한다. 이 의미적 신호는 이후 잔차 변환 헤드(residual transformation head)에 의해 처리되어 배경 안정성을 유지하면서 장면을 변형시키기 위한 3D 기하학적 변위를 직접 예측한다. 고충실도 결과를 보장하기 위해, 우리는 기하학적 정확성과 뷰 간 일관성을 강제하는 다중 항목 목적 함수(multi-term objective function)로 프레임워크를 감독한다. 또한, 3D 일치 필터링을 통해 실제 정답 품질을 보장하는 자동화된 파이프라인을 통해 생성된 대규모 데이터셋인 DeltaScene 데이터셋을 구축한다. 실험 결과는 VGGT-Edit이 2D 리프팅 기준 모델보다 훨씬 뛰어난 성능을 보여, 더 선명한 객체 디테일, 강력한 다중 뷰 일관성, 그리고 거의 즉각적인 추론 속도를 제공함을 보여준다.

English

High-quality 3D scene reconstruction has recently advanced toward generalizable feed-forward architectures, enabling the generation of complex environments in a single forward pass. However, despite their strong performance in static scene perception, these models remain limited in responding to dynamic human instructions, which restricts their use in interactive applications. Existing editing methods typically rely on a 2D-lifting strategy, where individual views are edited independently and then lifted back into 3D space. This indirect pipeline often leads to blurry textures and inconsistent geometry, as 2D editors lack the spatial awareness required to preserve structure across viewpoints. To address these limitations, we propose VGGT-Edit, a feed-forward framework for text-conditioned native 3D scene editing. VGGT-Edit introduces depth-synchronized text injection to align semantic guidance with the backbone's spatial poses, ensuring stable instruction grounding. This semantic signal is then processed by a residual transformation head, which directly predicts 3D geometric displacements to deform the scene while preserving background stability. To ensure high-fidelity results, we supervise the framework with a multi-term objective function that enforces geometric accuracy and cross-view consistency. We also construct the DeltaScene Dataset, a large-scale dataset generated through an automated pipeline with 3D agreement filtering to ensure ground-truth quality. Experiments show that VGGT-Edit substantially outperforms 2D-lifting baselines, producing sharper object details, stronger multi-view consistency, and near-instant inference speed.