VGG-T^3:大規模離線前饋式三維重建
VGG-T^3: Offline Feed-Forward 3D Reconstruction at Scale
February 26, 2026
作者: Sven Elflein, Ruilong Li, Sérgio Agostinho, Zan Gojcic, Laura Leal-Taixé, Qunjie Zhou, Aljosa Osep
cs.AI
摘要
我們提出了一種可擴展的3D重建模型,旨在解決離線前饋方法的一個關鍵侷限性:其計算與記憶體需求會隨輸入影像數量呈二次方增長。我們的方法基於一項關鍵發現——該瓶頸源於場景幾何的可變長度鍵值空間表徵,我們透過測試時訓練將其提煉為固定大小的多層感知機。VGG-T^3(視覺幾何基礎測試時訓練)的計算複雜度與輸入視角數量呈線性關係(與線上模型類似),僅需54秒即可完成對1000張影像集的重建,相比依賴softmax注意力的基準方法實現了11.6倍加速。由於我們的方法保留了全局場景聚合能力,其點雲重建誤差大幅優於其他線性時間方法。最後,我們透過對未見影像進行場景表徵查詢,展示了模型的視覺定位能力。
English
We present a scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods: their computational and memory requirements grow quadratically w.r.t. the number of input images. Our approach is built on the key insight that this bottleneck stems from the varying-length Key-Value (KV) space representation of scene geometry, which we distill into a fixed-size Multi-Layer Perceptron (MLP) via test-time training. VGG-T^3 (Visual Geometry Grounded Test Time Training) scales linearly w.r.t. the number of input views, similar to online models, and reconstructs a 1k image collection in just 54 seconds, achieving a 11.6times speed-up over baselines that rely on softmax attention. Since our method retains global scene aggregation capability, our point map reconstruction error outperforming other linear-time methods by large margins. Finally, we demonstrate visual localization capabilities of our model by querying the scene representation with unseen images.