VGGT: 視覚的幾何学に基づくトランスフォーマー

要旨

本論文では、VGGTというフィードフォワードニューラルネットワークを提案します。このネットワークは、1枚、数枚、または数百枚のビューから、カメラパラメータ、ポイントマップ、深度マップ、3Dポイントトラックなど、シーンのすべての主要な3D属性を直接推論します。このアプローチは、従来の3Dコンピュータビジョンモデルが単一タスクに限定されていた状況から一歩前進したものです。また、シンプルで効率的であり、1秒未満で画像を再構築し、視覚的幾何学最適化技術を必要とする代替手法を凌駕します。本ネットワークは、カメラパラメータ推定、マルチビュー深度推定、密な点群再構築、3Dポイントトラッキングなど、複数の3Dタスクにおいて最先端の結果を達成します。さらに、事前学習済みのVGGTを特徴量バックボーンとして使用することで、非剛体ポイントトラッキングやフィードフォワード新規ビュー合成などの下流タスクが大幅に向上することを示します。コードとモデルはhttps://github.com/facebookresearch/vggtで公開されています。

English

We present VGGT, a feed-forward neural network that directly infers all key 3D attributes of a scene, including camera parameters, point maps, depth maps, and 3D point tracks, from one, a few, or hundreds of its views. This approach is a step forward in 3D computer vision, where models have typically been constrained to and specialized for single tasks. It is also simple and efficient, reconstructing images in under one second, and still outperforming alternatives that require post-processing with visual geometry optimization techniques. The network achieves state-of-the-art results in multiple 3D tasks, including camera parameter estimation, multi-view depth estimation, dense point cloud reconstruction, and 3D point tracking. We also show that using pretrained VGGT as a feature backbone significantly enhances downstream tasks, such as non-rigid point tracking and feed-forward novel view synthesis. Code and models are publicly available at https://github.com/facebookresearch/vggt.

VGGT: 視覚的幾何学に基づくトランスフォーマー

VGGT: Visual Geometry Grounded Transformer

要旨

Support