ChatPaper.aiChatPaper

VGGT:基于视觉几何的Transformer模型

VGGT: Visual Geometry Grounded Transformer

March 14, 2025
作者: Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, David Novotny
cs.AI

摘要

我们提出了VGGT,一种前馈神经网络,能够直接从场景的一个、多个乃至数百个视角中推断出所有关键的3D属性,包括相机参数、点云图、深度图以及3D点轨迹。这一方法在3D计算机视觉领域迈出了重要一步,传统模型通常局限于并专精于单一任务。VGGT不仅简洁高效,能在不到一秒的时间内重建图像,而且在无需视觉几何优化技术后处理的情况下,性能仍优于其他方案。该网络在多项3D任务中达到了业界领先水平,涵盖相机参数估计、多视角深度估计、稠密点云重建及3D点跟踪。我们还展示了,将预训练的VGGT作为特征骨干网络,能显著提升下游任务的表现,如非刚性点跟踪和前馈式新视角合成。代码与模型已公开于https://github.com/facebookresearch/vggt。
English
We present VGGT, a feed-forward neural network that directly infers all key 3D attributes of a scene, including camera parameters, point maps, depth maps, and 3D point tracks, from one, a few, or hundreds of its views. This approach is a step forward in 3D computer vision, where models have typically been constrained to and specialized for single tasks. It is also simple and efficient, reconstructing images in under one second, and still outperforming alternatives that require post-processing with visual geometry optimization techniques. The network achieves state-of-the-art results in multiple 3D tasks, including camera parameter estimation, multi-view depth estimation, dense point cloud reconstruction, and 3D point tracking. We also show that using pretrained VGGT as a feature backbone significantly enhances downstream tasks, such as non-rigid point tracking and feed-forward novel view synthesis. Code and models are publicly available at https://github.com/facebookresearch/vggt.

Summary

AI-Generated Summary

PDF212March 17, 2025