CoTracker：共に追跡する方が優れている

要旨

ビデオの動き予測の手法は、オプティカルフローを使用して特定のビデオフレーム内のすべての点の瞬間的な動きを共同で推定するか、ビデオ全体を通じて個々の点の動きを独立して追跡するかのいずれかです。後者は、オクルージョンを通過しても点を追跡できる強力な深層学習手法においても当てはまります。点を個別に追跡することは、例えばそれらが同じ物理的オブジェクトに属しているために存在する可能性のある点間の強い相関を無視し、性能を損なう可能性があります。本論文では、ビデオ全体を通じて複数の点を共同で追跡するアーキテクチャであるCoTrackerを提案します。このアーキテクチャは、オプティカルフローと追跡の文献からいくつかのアイデアを組み合わせた、新しく柔軟で強力な設計です。これは、特殊なアテンションレイヤーを介して時間内の異なる点の相関をモデル化するトランスフォーマーネットワークに基づいています。トランスフォーマーは、いくつかの軌跡の推定値を反復的に更新します。非常に長いビデオに対してスライディングウィンドウ方式で適用でき、そのために展開されたトレーニングループを設計します。1つから複数の点を共同で追跡でき、いつでも追跡する新しい点を追加することをサポートします。その結果、ほぼすべてのベンチマークで最先端の手法を上回る、柔軟で強力な追跡アルゴリズムが得られます。

English

Methods for video motion prediction either estimate jointly the instantaneous motion of all points in a given video frame using optical flow or independently track the motion of individual points throughout the video. The latter is true even for powerful deep-learning methods that can track points through occlusions. Tracking points individually ignores the strong correlation that can exist between the points, for instance, because they belong to the same physical object, potentially harming performance. In this paper, we thus propose CoTracker, an architecture that jointly tracks multiple points throughout an entire video. This architecture combines several ideas from the optical flow and tracking literature in a new, flexible and powerful design. It is based on a transformer network that models the correlation of different points in time via specialised attention layers. The transformer iteratively updates an estimate of several trajectories. It can be applied in a sliding-window manner to very long videos, for which we engineer an unrolled training loop. It can track from one to several points jointly and supports adding new points to track at any time. The result is a flexible and powerful tracking algorithm that outperforms state-of-the-art methods in almost all benchmarks.

CoTracker：共に追跡する方が優れている

CoTracker: It is Better to Track Together

要旨

Support