VGG-T^3: 大規模オフライン・フィードフォワード3D再構成

要旨

本論文では、オフラインのフィードフォワード手法が抱える重要な課題、すなわち入力画像数に対して計算量とメモリ使用量が二次関数的に増大する問題に着目した、スケーラブルな3次元再構成モデルを提案する。本手法の核心は、このボトルネックがシーン幾何学の可変長キー・バリュー空間表現に起因するという洞察にある。我々は、テスト時学習によりこの表現を固定サイズの多層パーセプトロンに集約する。VGG-T^3は、オンラインモデルと同様に入力ビュー数に対して線形にスケールし、1,000枚の画像コレクションをわずか54秒で再構成し、ソフトマックス注意機構に依存するベースライン手法と比較して11.6倍の高速化を実現する。本手法は大域的なシーン集約能力を保持するため、ポイントクラウド再構成誤差は他の線形時間手法を大幅に上回る。最後に、未見画像を用いてシーン表現に問い合わせることで、本モデルの視覚的位置推定能力を実証する。

English

We present a scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods: their computational and memory requirements grow quadratically w.r.t. the number of input images. Our approach is built on the key insight that this bottleneck stems from the varying-length Key-Value (KV) space representation of scene geometry, which we distill into a fixed-size Multi-Layer Perceptron (MLP) via test-time training. VGG-T^3 (Visual Geometry Grounded Test Time Training) scales linearly w.r.t. the number of input views, similar to online models, and reconstructs a 1k image collection in just 54 seconds, achieving a 11.6times speed-up over baselines that rely on softmax attention. Since our method retains global scene aggregation capability, our point map reconstruction error outperforming other linear-time methods by large margins. Finally, we demonstrate visual localization capabilities of our model by querying the scene representation with unseen images.

VGG-T^3: 大規模オフライン・フィードフォワード3D再構成

VGG-T^3: Offline Feed-Forward 3D Reconstruction at Scale

要旨

Support