VGG-T^3:大规模离线前馈式三维重建
VGG-T^3: Offline Feed-Forward 3D Reconstruction at Scale
February 26, 2026
作者: Sven Elflein, Ruilong Li, Sérgio Agostinho, Zan Gojcic, Laura Leal-Taixé, Qunjie Zhou, Aljosa Osep
cs.AI
摘要
我们提出了一种可扩展的三维重建模型,该模型解决了离线前馈方法的关键局限:其计算和内存需求随输入图像数量呈平方级增长。我们的方法基于一个重要发现——这一瓶颈源于场景几何体可变长度的键值空间表示,我们通过测试时训练将其蒸馏为固定大小的多层感知机。VGG-T³(视觉几何基测试时训练)模型在处理输入视图时具有线性计算复杂度,与在线模型类似,仅需54秒即可完成千张图像集的重建,相比基于softmax注意力的基线方法实现11.6倍加速。由于本方法保留了全局场景聚合能力,我们的点云重建误差显著优于其他线性时间复杂度方法。最后,我们通过用未见过的图像查询场景表示,验证了模型具备视觉定位能力。
English
We present a scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods: their computational and memory requirements grow quadratically w.r.t. the number of input images. Our approach is built on the key insight that this bottleneck stems from the varying-length Key-Value (KV) space representation of scene geometry, which we distill into a fixed-size Multi-Layer Perceptron (MLP) via test-time training. VGG-T^3 (Visual Geometry Grounded Test Time Training) scales linearly w.r.t. the number of input views, similar to online models, and reconstructs a 1k image collection in just 54 seconds, achieving a 11.6times speed-up over baselines that rely on softmax attention. Since our method retains global scene aggregation capability, our point map reconstruction error outperforming other linear-time methods by large margins. Finally, we demonstrate visual localization capabilities of our model by querying the scene representation with unseen images.