VGGT-X: VGGTと高密度新規視点合成の融合

要旨

本研究では、3D基盤モデル（3DFMs）を高密度な新視点合成（NVS）に適用する問題を探求します。NeRFや3DGSによって推進された新視点合成の分野では大きな進展が見られるものの、現在のアプローチは依然としてStructure-from-Motion（SfM）から取得した正確な3D属性（例えば、カメラポーズや点群）に依存しており、低テクスチャや低オーバーラップのキャプチャでは遅くて脆弱です。最近の3DFMsは、従来のパイプラインに比べて桁違いの高速化を示し、オンラインNVSの大きな可能性を秘めています。しかし、その検証と結論のほとんどは疎視点設定に限定されています。本研究では、3DFMsを高密度視点に単純にスケーリングする際に、二つの根本的な障壁に直面することを明らかにしました：劇的に増加するVRAM負荷と、初期化に敏感な3Dトレーニングを劣化させる不完全な出力です。これらの障壁に対処するため、我々はVGGT-Xを導入しました。これには、1,000枚以上の画像にスケールするメモリ効率の良いVGGT実装、VGGT出力を強化する適応的グローバルアライメント、そして堅牢な3DGSトレーニング手法が含まれます。広範な実験により、これらの対策がCOLMAP初期化パイプラインとの忠実度ギャップを大幅に縮め、高密度なCOLMAPフリーNVSとポーズ推定において最先端の結果を達成することが示されました。さらに、COLMAP初期化レンダリングとの残存ギャップの原因を分析し、3D基盤モデルと高密度NVSの将来の発展に向けた洞察を提供します。プロジェクトページはhttps://dekuliutesla.github.io/vggt-x.github.io/で公開されています。

English

We study the problem of applying 3D Foundation Models (3DFMs) to dense Novel View Synthesis (NVS). Despite significant progress in Novel View Synthesis powered by NeRF and 3DGS, current approaches remain reliant on accurate 3D attributes (e.g., camera poses and point clouds) acquired from Structure-from-Motion (SfM), which is often slow and fragile in low-texture or low-overlap captures. Recent 3DFMs showcase orders of magnitude speedup over the traditional pipeline and great potential for online NVS. But most of the validation and conclusions are confined to sparse-view settings. Our study reveals that naively scaling 3DFMs to dense views encounters two fundamental barriers: dramatically increasing VRAM burden and imperfect outputs that degrade initialization-sensitive 3D training. To address these barriers, we introduce VGGT-X, incorporating a memory-efficient VGGT implementation that scales to 1,000+ images, an adaptive global alignment for VGGT output enhancement, and robust 3DGS training practices. Extensive experiments show that these measures substantially close the fidelity gap with COLMAP-initialized pipelines, achieving state-of-the-art results in dense COLMAP-free NVS and pose estimation. Additionally, we analyze the causes of remaining gaps with COLMAP-initialized rendering, providing insights for the future development of 3D foundation models and dense NVS. Our project page is available at https://dekuliutesla.github.io/vggt-x.github.io/