Track4World: フィードフォワード世界中心的全画素高密度3次元トラッキング

要旨

単眼映像から各ピクセルの3次元軌跡を推定することは、映像の3次元ダイナミクスを包括的に理解する上で極めて重要かつ有望な技術である。近年の単眼3次元トラッキング手法は顕著な性能を示しているが、初期フレーム上の疎な点群の追跡に限定されるか、あるいは高密度トラッキングにおいて最適化ベースの遅いフレームワークに制限されている。本論文では、Track4Worldと名付けた順伝播型モデルを提案し、世界座標系における全ピクセルの効率的な全体的3次元トラッキングを実現する。VGGTスタイルのViTにより符号化された大域的な3次元シーン表現を基盤として、Track4Worldは新規の3次元相関スキームを適用し、任意のフレーム間における画素単位の2次元及び3次元高密度フローを同時に推定する。推定されたシーンフローと復元された3次元形状に基づいて、当該映像の全ピクセルの効率的な後続3次元トラッキングが可能となる。複数のベンチマークによる広範な実験により、本手法が2次元/3次元フロー推定及び3次元トラッキングにおいて既存手法を一貫して凌駕し、実世界の4次元再構築タスクに対するその頑健性と拡張性が実証された。

English

Estimating the 3D trajectory of every pixel from a monocular video is crucial and promising for a comprehensive understanding of the 3D dynamics of videos. Recent monocular 3D tracking works demonstrate impressive performance, but are limited to either tracking sparse points on the first frame or a slow optimization-based framework for dense tracking. In this paper, we propose a feedforward model, called Track4World, enabling an efficient holistic 3D tracking of every pixel in the world-centric coordinate system. Built on the global 3D scene representation encoded by a VGGT-style ViT, Track4World applies a novel 3D correlation scheme to simultaneously estimate the pixel-wise 2D and 3D dense flow between arbitrary frame pairs. The estimated scene flow, along with the reconstructed 3D geometry, enables subsequent efficient 3D tracking of every pixel of this video. Extensive experiments on multiple benchmarks demonstrate that our approach consistently outperforms existing methods in 2D/3D flow estimation and 3D tracking, highlighting its robustness and scalability for real-world 4D reconstruction tasks.

Track4World: フィードフォワード世界中心的全画素高密度3次元トラッキング

Track4World: Feedforward World-centric Dense 3D Tracking of All Pixels

要旨

Support