VGGT: 시각적 기하학 기반 트랜스포머

초록

우리는 VGGT를 소개합니다. VGGT는 피드포워드 신경망으로, 하나, 몇 개, 혹은 수백 개의 뷰로부터 카메라 파라미터, 포인트 맵, 깊이 맵, 3D 포인트 트랙 등 장면의 모든 주요 3D 속성을 직접 추론합니다. 이 접근법은 기존에 단일 작업에 제한되고 특화되어 있던 3D 컴퓨터 비전 모델에서 한 단계 진전된 것입니다. 또한 이 방법은 단순하고 효율적이며, 1초 미만으로 이미지를 재구성하면서도 시각적 기하학 최적화 기술을 통한 후처리가 필요한 대안들을 능가합니다. 이 네트워크는 카메라 파라미터 추정, 다중 뷰 깊이 추정, 밀집 포인트 클라우드 재구성, 3D 포인트 트래킹 등 다양한 3D 작업에서 최첨단 결과를 달성합니다. 또한, 사전 학습된 VGGT를 특징 백본으로 사용하면 비강체 포인트 트래킹과 피드포워드 새로운 뷰 합성과 같은 하위 작업이 크게 향상됨을 보여줍니다. 코드와 모델은 https://github.com/facebookresearch/vggt에서 공개적으로 이용 가능합니다.

English

We present VGGT, a feed-forward neural network that directly infers all key 3D attributes of a scene, including camera parameters, point maps, depth maps, and 3D point tracks, from one, a few, or hundreds of its views. This approach is a step forward in 3D computer vision, where models have typically been constrained to and specialized for single tasks. It is also simple and efficient, reconstructing images in under one second, and still outperforming alternatives that require post-processing with visual geometry optimization techniques. The network achieves state-of-the-art results in multiple 3D tasks, including camera parameter estimation, multi-view depth estimation, dense point cloud reconstruction, and 3D point tracking. We also show that using pretrained VGGT as a feature backbone significantly enhances downstream tasks, such as non-rigid point tracking and feed-forward novel view synthesis. Code and models are publicly available at https://github.com/facebookresearch/vggt.

VGGT: 시각적 기하학 기반 트랜스포머

VGGT: Visual Geometry Grounded Transformer

초록

Support