Track4World：基于前馈世界中心模型的全像素密集三维追踪

摘要

从单目视频中估计每个像素的三维轨迹，对于全面理解视频的三维动态至关重要且前景广阔。近期单目三维跟踪研究展现了令人瞩目的性能，但仅限于跟踪首帧的稀疏点或采用缓慢的基于优化的稠密跟踪框架。本文提出一种前馈模型Track4World，能够以世界坐标系实现高效的全像素三维整体跟踪。该模型基于VGGT风格视觉Transformer编码的全局三维场景表示，采用新颖的三维关联机制，可同步估计任意帧对间的像素级二维与三维稠密流。估算出的场景流与重建的三维几何结构相结合，支持对该视频所有像素进行后续高效三维跟踪。在多基准测试上的广泛实验表明，我们的方法在二维/三维流估计和三维跟踪任务中持续超越现有方法，凸显了其在真实世界四维重建任务中的鲁棒性与可扩展性。

English

Estimating the 3D trajectory of every pixel from a monocular video is crucial and promising for a comprehensive understanding of the 3D dynamics of videos. Recent monocular 3D tracking works demonstrate impressive performance, but are limited to either tracking sparse points on the first frame or a slow optimization-based framework for dense tracking. In this paper, we propose a feedforward model, called Track4World, enabling an efficient holistic 3D tracking of every pixel in the world-centric coordinate system. Built on the global 3D scene representation encoded by a VGGT-style ViT, Track4World applies a novel 3D correlation scheme to simultaneously estimate the pixel-wise 2D and 3D dense flow between arbitrary frame pairs. The estimated scene flow, along with the reconstructed 3D geometry, enables subsequent efficient 3D tracking of every pixel of this video. Extensive experiments on multiple benchmarks demonstrate that our approach consistently outperforms existing methods in 2D/3D flow estimation and 3D tracking, highlighting its robustness and scalability for real-world 4D reconstruction tasks.

Track4World：基于前馈世界中心模型的全像素密集三维追踪

Track4World: Feedforward World-centric Dense 3D Tracking of All Pixels

摘要

Support