TrackingWorld:以世界为中心的几乎全像素单目三维追踪
TrackingWorld: World-centric Monocular 3D Tracking of Almost All Pixels
December 9, 2025
作者: Jiahao Lu, Weitao Xiong, Jiacheng Deng, Peng Li, Tianyu Huang, Zhiyang Dou, Cheng Lin, Sai-Kit Yeung, Yuan Liu
cs.AI
摘要
单目三维追踪旨在从单目视频中捕捉像素在三维空间的长期运动,近年来取得显著进展。然而,现有方法仍存在两大局限:一是难以分离摄像机运动与前景动态运动,二是无法对视频中新出现的动态目标进行密集追踪。针对这些问题,我们提出TrackingWorld——一种在世界坐标系下实现几乎所有像素密集三维追踪的新范式。首先,我们引入追踪上采样器,能够将任意稀疏二维轨迹高效提升为密集二维轨迹。其次,为扩展现有方法对新出现物体的追踪能力,我们对所有帧应用上采样器,并通过消除重叠区域轨迹来降低二维追踪的冗余度。最后,我们提出基于优化的高效框架,通过估计相机位姿和二维轨迹的三维坐标,将密集二维轨迹反投影至世界中心的三维轨迹。在合成数据集与真实场景数据集上的大量实验表明,本系统能在世界坐标系下实现精确且密集的三维追踪。
English
Monocular 3D tracking aims to capture the long-term motion of pixels in 3D space from a single monocular video and has witnessed rapid progress in recent years. However, we argue that the existing monocular 3D tracking methods still fall short in separating the camera motion from foreground dynamic motion and cannot densely track newly emerging dynamic subjects in the videos. To address these two limitations, we propose TrackingWorld, a novel pipeline for dense 3D tracking of almost all pixels within a world-centric 3D coordinate system. First, we introduce a tracking upsampler that efficiently lifts the arbitrary sparse 2D tracks into dense 2D tracks. Then, to generalize the current tracking methods to newly emerging objects, we apply the upsampler to all frames and reduce the redundancy of 2D tracks by eliminating the tracks in overlapped regions. Finally, we present an efficient optimization-based framework to back-project dense 2D tracks into world-centric 3D trajectories by estimating the camera poses and the 3D coordinates of these 2D tracks. Extensive evaluations on both synthetic and real-world datasets demonstrate that our system achieves accurate and dense 3D tracking in a world-centric coordinate frame.