VGGT: Trasformatore con Base nella Geometria Visiva

Abstract

Presentiamo VGGT, una rete neurale feed-forward che inferisce direttamente tutti gli attributi 3D chiave di una scena, inclusi i parametri della telecamera, le mappe di punti, le mappe di profondità e le tracce di punti 3D, da una, poche o centinaia delle sue viste. Questo approccio rappresenta un passo avanti nella visione artificiale 3D, dove i modelli sono stati tipicamente vincolati e specializzati per singoli compiti. È inoltre semplice ed efficiente, ricostruendo le immagini in meno di un secondo, superando comunque alternative che richiedono post-elaborazione con tecniche di ottimizzazione della geometria visiva. La rete raggiunge risultati all'avanguardia in molteplici task 3D, tra cui la stima dei parametri della telecamera, la stima della profondità multi-vista, la ricostruzione di nuvole di punti dense e il tracciamento di punti 3D. Mostriamo anche che l'utilizzo di VGGT pre-addestrato come backbone per le feature migliora significativamente task downstream, come il tracciamento di punti non rigidi e la sintesi feed-forward di nuove viste. Codice e modelli sono disponibili pubblicamente su https://github.com/facebookresearch/vggt.

English

We present VGGT, a feed-forward neural network that directly infers all key 3D attributes of a scene, including camera parameters, point maps, depth maps, and 3D point tracks, from one, a few, or hundreds of its views. This approach is a step forward in 3D computer vision, where models have typically been constrained to and specialized for single tasks. It is also simple and efficient, reconstructing images in under one second, and still outperforming alternatives that require post-processing with visual geometry optimization techniques. The network achieves state-of-the-art results in multiple 3D tasks, including camera parameter estimation, multi-view depth estimation, dense point cloud reconstruction, and 3D point tracking. We also show that using pretrained VGGT as a feature backbone significantly enhances downstream tasks, such as non-rigid point tracking and feed-forward novel view synthesis. Code and models are publicly available at https://github.com/facebookresearch/vggt.

VGGT: Trasformatore con Base nella Geometria Visiva

VGGT: Visual Geometry Grounded Transformer

Abstract

Support