追踪世界:以世界为中心的单目三维像素级全局追踪
TrackingWorld: World-centric Monocular 3D Tracking of Almost All Pixels
December 9, 2025
作者: Jiahao Lu, Weitao Xiong, Jiacheng Deng, Peng Li, Tianyu Huang, Zhiyang Dou, Cheng Lin, Sai-Kit Yeung, Yuan Liu
cs.AI
摘要
单目三维追踪旨在从单目视频中捕捉像素在三维空间的长期运动,近年来取得显著进展。然而,现有方法在分离相机运动与前景动态运动方面仍存在不足,且无法对视频中新出现的动态目标进行密集追踪。针对这两大局限,我们提出TrackingWorld——一种在世界中心三维坐标系下对几乎所有像素进行密集三维追踪的新范式。首先,我们引入追踪上采样器,能够高效地将任意稀疏二维追踪提升为密集二维追踪。其次,为将现有追踪方法泛化至新出现的目标,我们对所有帧应用上采样器,并通过消除重叠区域轨迹来降低二维追踪的冗余度。最后,我们提出基于优化的高效框架,通过估计相机位姿和二维轨迹的三维坐标,将密集二维追踪反投影至世界中心三维轨迹。在合成数据集与真实场景数据集上的大量实验表明,本系统能在世界坐标系下实现精确且密集的三维追踪。
English
Monocular 3D tracking aims to capture the long-term motion of pixels in 3D space from a single monocular video and has witnessed rapid progress in recent years. However, we argue that the existing monocular 3D tracking methods still fall short in separating the camera motion from foreground dynamic motion and cannot densely track newly emerging dynamic subjects in the videos. To address these two limitations, we propose TrackingWorld, a novel pipeline for dense 3D tracking of almost all pixels within a world-centric 3D coordinate system. First, we introduce a tracking upsampler that efficiently lifts the arbitrary sparse 2D tracks into dense 2D tracks. Then, to generalize the current tracking methods to newly emerging objects, we apply the upsampler to all frames and reduce the redundancy of 2D tracks by eliminating the tracks in overlapped regions. Finally, we present an efficient optimization-based framework to back-project dense 2D tracks into world-centric 3D trajectories by estimating the camera poses and the 3D coordinates of these 2D tracks. Extensive evaluations on both synthetic and real-world datasets demonstrate that our system achieves accurate and dense 3D tracking in a world-centric coordinate frame.