WinT3R: カメラトークンプールを用いたウィンドウベースのストリーミング再構成

要旨

本論文では、高精度なカメラ姿勢と高品質なポイントマップをオンラインで予測可能なフィードフォワード再構成モデル「WinT3R」を提案します。従来の手法では、再構成品質とリアルタイム性能の間でトレードオフが生じていました。この課題に対処するため、まずスライディングウィンドウ機構を導入し、ウィンドウ内のフレーム間で十分な情報交換を確保することで、大規模な計算を伴わずに幾何学的予測の品質を向上させます。さらに、カメラのコンパクトな表現を活用し、グローバルなカメラトークンプールを維持することで、効率性を損なうことなくカメラ姿勢推定の信頼性を高めます。これらの設計により、WinT3Rはオンライン再構成品質、カメラ姿勢推定、再構成速度の面で最先端の性能を達成し、多様なデータセットを用いた広範な実験によってその有効性が検証されています。コードとモデルはhttps://github.com/LiZizun/WinT3Rで公開されています。

English

We present WinT3R, a feed-forward reconstruction model capable of online prediction of precise camera poses and high-quality point maps. Previous methods suffer from a trade-off between reconstruction quality and real-time performance. To address this, we first introduce a sliding window mechanism that ensures sufficient information exchange among frames within the window, thereby improving the quality of geometric predictions without large computation. In addition, we leverage a compact representation of cameras and maintain a global camera token pool, which enhances the reliability of camera pose estimation without sacrificing efficiency. These designs enable WinT3R to achieve state-of-the-art performance in terms of online reconstruction quality, camera pose estimation, and reconstruction speed, as validated by extensive experiments on diverse datasets. Code and model are publicly available at https://github.com/LiZizun/WinT3R.

WinT3R: カメラトークンプールを用いたウィンドウベースのストリーミング再構成

WinT3R: Window-Based Streaming Reconstruction with Camera Token Pool

要旨

Support