VGG-T^3：大规模离线前馈式三维重建

摘要

我们提出了一种可扩展的三维重建模型，该模型解决了离线前馈方法的关键局限：其计算和内存需求随输入图像数量呈平方级增长。我们的方法基于一个重要发现——这一瓶颈源于场景几何体可变长度的键值空间表示，我们通过测试时训练将其蒸馏为固定大小的多层感知机。VGG-T³（视觉几何基测试时训练）模型在处理输入视图时具有线性计算复杂度，与在线模型类似，仅需54秒即可完成千张图像集的重建，相比基于softmax注意力的基线方法实现11.6倍加速。由于本方法保留了全局场景聚合能力，我们的点云重建误差显著优于其他线性时间复杂度方法。最后，我们通过用未见过的图像查询场景表示，验证了模型具备视觉定位能力。

English

We present a scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods: their computational and memory requirements grow quadratically w.r.t. the number of input images. Our approach is built on the key insight that this bottleneck stems from the varying-length Key-Value (KV) space representation of scene geometry, which we distill into a fixed-size Multi-Layer Perceptron (MLP) via test-time training. VGG-T^3 (Visual Geometry Grounded Test Time Training) scales linearly w.r.t. the number of input views, similar to online models, and reconstructs a 1k image collection in just 54 seconds, achieving a 11.6times speed-up over baselines that rely on softmax attention. Since our method retains global scene aggregation capability, our point map reconstruction error outperforming other linear-time methods by large margins. Finally, we demonstrate visual localization capabilities of our model by querying the scene representation with unseen images.

VGG-T^3：大规模离线前馈式三维重建

VGG-T^3: Offline Feed-Forward 3D Reconstruction at Scale

摘要

Support