VGGT-X: VGGT와 조밀한 신시점 합성의 만남

초록

우리는 3D 파운데이션 모델(3DFMs)을 조밀한 새로운 시점 합성(Novel View Synthesis, NVS)에 적용하는 문제를 연구한다. NeRF와 3DGS를 기반으로 한 새로운 시점 합성 분야에서 상당한 진전이 있었음에도 불구하고, 현재의 접근 방식들은 여전히 Structure-from-Motion(SfM)을 통해 획득한 정확한 3D 속성(예: 카메라 포즈 및 포인트 클라우드)에 의존하고 있다. 이러한 SfM은 저조도 또는 중첩이 적은 촬영 환경에서 느리고 취약한 단점이 있다. 최근의 3DFMs는 기존 파이프라인에 비해 수십 배 빠른 속도를 보여주며 온라인 NVS에 대한 큰 잠재력을 보여주고 있다. 그러나 대부분의 검증과 결론은 희소 시점 설정에 국한되어 있다. 우리의 연구는 3DFMs를 조밀한 시점으로 확장할 때 두 가지 근본적인 장벽에 직면함을 보여준다: VRAM 부담의 급격한 증가와 초기화에 민감한 3D 학습을 저하시키는 불완전한 출력이다. 이러한 장벽을 해결하기 위해, 우리는 VGGT-X를 도입했다. 이는 1,000개 이상의 이미지로 확장 가능한 메모리 효율적인 VGGT 구현, VGGT 출력 향상을 위한 적응형 전역 정렬, 그리고 강력한 3DGS 학습 방법을 포함한다. 광범위한 실험을 통해 이러한 조치들이 COLMAP 초기화 파이프라인과의 충실도 격차를 상당히 줄이며, 조밀한 COLMAP-free NVS와 포즈 추정에서 최첨단 결과를 달성함을 보여준다. 또한, 우리는 COLMAP 초기화 렌더링과의 남은 격차의 원인을 분석하여, 3D 파운데이션 모델과 조밀한 NVS의 미래 발전을 위한 통찰을 제공한다. 우리의 프로젝트 페이지는 https://dekuliutesla.github.io/vggt-x.github.io/에서 확인할 수 있다.

English

We study the problem of applying 3D Foundation Models (3DFMs) to dense Novel View Synthesis (NVS). Despite significant progress in Novel View Synthesis powered by NeRF and 3DGS, current approaches remain reliant on accurate 3D attributes (e.g., camera poses and point clouds) acquired from Structure-from-Motion (SfM), which is often slow and fragile in low-texture or low-overlap captures. Recent 3DFMs showcase orders of magnitude speedup over the traditional pipeline and great potential for online NVS. But most of the validation and conclusions are confined to sparse-view settings. Our study reveals that naively scaling 3DFMs to dense views encounters two fundamental barriers: dramatically increasing VRAM burden and imperfect outputs that degrade initialization-sensitive 3D training. To address these barriers, we introduce VGGT-X, incorporating a memory-efficient VGGT implementation that scales to 1,000+ images, an adaptive global alignment for VGGT output enhancement, and robust 3DGS training practices. Extensive experiments show that these measures substantially close the fidelity gap with COLMAP-initialized pipelines, achieving state-of-the-art results in dense COLMAP-free NVS and pose estimation. Additionally, we analyze the causes of remaining gaps with COLMAP-initialized rendering, providing insights for the future development of 3D foundation models and dense NVS. Our project page is available at https://dekuliutesla.github.io/vggt-x.github.io/