CoTracker：一起追踪更好

摘要

視頻運動預測的方法可以通過光流估計給定視頻幀中所有點的瞬時運動，或者獨立跟踪整個視頻中各個點的運動。即使對於可以通過遮擋跟踪點的強大深度學習方法，後者也是真實的。獨立跟踪點忽略了點之間可能存在的強相關性，例如，因為它們屬於同一物理對象，這可能損害性能。因此，在本文中，我們提出了CoTracker，一種可以共同跟踪整個視頻中多個點的架構。該架構結合了光流和跟踪文獻中的幾個想法，設計靈活且強大。它基於一個可以通過專門的注意力層對不同時間點的點之間相關性進行建模的變壓器網絡。變壓器迭代更新多條軌跡的估計。它可以以滑動窗口的方式應用於非常長的視頻，我們為此設計了一個展開的訓練循環。它可以共同跟踪從一個到多個點，並支持隨時添加新的跟踪點。結果是一種靈活且強大的跟踪算法，在幾乎所有基準測試中都優於最先進的方法。

English

Methods for video motion prediction either estimate jointly the instantaneous motion of all points in a given video frame using optical flow or independently track the motion of individual points throughout the video. The latter is true even for powerful deep-learning methods that can track points through occlusions. Tracking points individually ignores the strong correlation that can exist between the points, for instance, because they belong to the same physical object, potentially harming performance. In this paper, we thus propose CoTracker, an architecture that jointly tracks multiple points throughout an entire video. This architecture combines several ideas from the optical flow and tracking literature in a new, flexible and powerful design. It is based on a transformer network that models the correlation of different points in time via specialised attention layers. The transformer iteratively updates an estimate of several trajectories. It can be applied in a sliding-window manner to very long videos, for which we engineer an unrolled training loop. It can track from one to several points jointly and supports adding new points to track at any time. The result is a flexible and powerful tracking algorithm that outperforms state-of-the-art methods in almost all benchmarks.

CoTracker：一起追踪更好

CoTracker: It is Better to Track Together

摘要

Support