TTT3R: テストタイムトレーニングとしての3D再構築

要旨

現代のリカレントニューラルネットワークは、線形時間計算量の特性から3D再構成において競争力のあるアーキテクチャとなっています。しかし、訓練コンテキスト長を超えて適用すると性能が大幅に低下し、長さ一般化能力が限られていることが明らかになりました。本研究では、テストタイムトレーニングの観点から3D再構成の基盤モデルを再検討し、その設計をオンライン学習問題として捉え直します。この観点に基づき、メモリ状態と新たな観測値との整合性信頼度を活用して、メモリ更新のための閉形式学習率を導出し、過去情報の保持と新たな観測への適応のバランスを取ります。このトレーニング不要の介入手法、TTT3Rは、長さ一般化能力を大幅に改善し、ベースラインと比較してグローバルポーズ推定において2倍の精度向上を達成します。さらに、数千枚の画像を処理する際にわずか6GBのGPUメモリで20FPSを実現します。コードはhttps://rover-xingyu.github.io/TTT3Rで公開されています。

English

Modern Recurrent Neural Networks have become a competitive architecture for 3D reconstruction due to their linear-time complexity. However, their performance degrades significantly when applied beyond the training context length, revealing limited length generalization. In this work, we revisit the 3D reconstruction foundation models from a Test-Time Training perspective, framing their designs as an online learning problem. Building on this perspective, we leverage the alignment confidence between the memory state and incoming observations to derive a closed-form learning rate for memory updates, to balance between retaining historical information and adapting to new observations. This training-free intervention, termed TTT3R, substantially improves length generalization, achieving a 2times improvement in global pose estimation over baselines, while operating at 20 FPS with just 6 GB of GPU memory to process thousands of images. Code available in https://rover-xingyu.github.io/TTT3R

TTT3R: テストタイムトレーニングとしての3D再構築

TTT3R: 3D Reconstruction as Test-Time Training

要旨

Support