VGGT:基於視覺幾何的Transformer模型
VGGT: Visual Geometry Grounded Transformer
March 14, 2025
作者: Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, David Novotny
cs.AI
摘要
我們提出了VGGT,這是一種前饋神經網絡,能夠直接從場景的一個、多個或數百個視圖中推斷出所有關鍵的3D屬性,包括相機參數、點雲圖、深度圖和3D點軌跡。這一方法在3D計算機視覺領域邁出了重要一步,因為以往的模型通常受限於並專注於單一任務。VGGT不僅簡單高效,能在不到一秒的時間內重建圖像,而且在無需視覺幾何優化技術後處理的情況下,仍能超越其他替代方案。該網絡在多項3D任務中達到了最先進的水平,包括相機參數估計、多視圖深度估計、密集點雲重建和3D點追蹤。我們還展示了使用預訓練的VGGT作為特徵骨幹,能顯著提升下游任務的性能,如非剛性點追蹤和前饋新視圖合成。代碼和模型已公開於https://github.com/facebookresearch/vggt。
English
We present VGGT, a feed-forward neural network that directly infers all key
3D attributes of a scene, including camera parameters, point maps, depth maps,
and 3D point tracks, from one, a few, or hundreds of its views. This approach
is a step forward in 3D computer vision, where models have typically been
constrained to and specialized for single tasks. It is also simple and
efficient, reconstructing images in under one second, and still outperforming
alternatives that require post-processing with visual geometry optimization
techniques. The network achieves state-of-the-art results in multiple 3D tasks,
including camera parameter estimation, multi-view depth estimation, dense point
cloud reconstruction, and 3D point tracking. We also show that using pretrained
VGGT as a feature backbone significantly enhances downstream tasks, such as
non-rigid point tracking and feed-forward novel view synthesis. Code and models
are publicly available at https://github.com/facebookresearch/vggt.Summary
AI-Generated Summary