协同追踪：共同追踪更佳

摘要

视频运动预测的方法通常通过光流估计给定视频帧中所有点的瞬时运动，或者独立跟踪视频中各个点的运动。即使是能够通过遮挡跟踪点的强大深度学习方法，也是如此。独立跟踪点会忽略点之间可能存在的强相关性，例如，它们属于同一物体，这可能会影响性能。因此，在本文中，我们提出了CoTracker，一种能够联合跟踪整个视频中多个点的架构。该架构结合了光流和跟踪文献中的几个思想，设计灵活且强大。它基于一个变压器网络，通过专门的注意力层模拟不同点在时间上的相关性。变压器网络迭代更新多条轨迹的估计。它可以以滑动窗口的方式应用于非常长的视频，我们设计了一个展开的训练循环。它可以联合跟踪一个到多个点，并支持随时添加新的跟踪点。结果是一个灵活且强大的跟踪算法，在几乎所有基准测试中都优于最先进的方法。

English

Methods for video motion prediction either estimate jointly the instantaneous motion of all points in a given video frame using optical flow or independently track the motion of individual points throughout the video. The latter is true even for powerful deep-learning methods that can track points through occlusions. Tracking points individually ignores the strong correlation that can exist between the points, for instance, because they belong to the same physical object, potentially harming performance. In this paper, we thus propose CoTracker, an architecture that jointly tracks multiple points throughout an entire video. This architecture combines several ideas from the optical flow and tracking literature in a new, flexible and powerful design. It is based on a transformer network that models the correlation of different points in time via specialised attention layers. The transformer iteratively updates an estimate of several trajectories. It can be applied in a sliding-window manner to very long videos, for which we engineer an unrolled training loop. It can track from one to several points jointly and supports adding new points to track at any time. The result is a flexible and powerful tracking algorithm that outperforms state-of-the-art methods in almost all benchmarks.

协同追踪：共同追踪更佳

CoTracker: It is Better to Track Together

摘要

Support