局所的全ペア対応によるポイントトラッキング

要旨

我々は、ビデオシーケンスにおける任意の点の追跡（TAP）タスク向けに設計された、高精度で効率的なモデル「LocoTrack」を紹介する。このタスクにおける従来のアプローチでは、クエリ画像の点からターゲット画像の局所領域への対応関係を確立するために、局所的な2D相関マップに依存することが多く、均一な領域や反復的な特徴に対して苦戦し、マッチングの曖昧さを引き起こすことが多かった。LocoTrackは、この課題を克服するために、領域間の全ペア対応関係、すなわち局所的な4D相関を利用する新たなアプローチを採用し、双方向の対応関係とマッチングの滑らかさによって、曖昧さに対するロバスト性を大幅に向上させている。また、計算効率を高めるために軽量な相関エンコーダを組み込み、長期的な時間情報を統合するためにコンパクトなTransformerアーキテクチャを採用している。LocoTrackは、すべてのTAP-Vidベンチマークで他を寄せ付けない精度を達成し、現在の最先端技術と比べて約6倍の速度で動作する。

English

We introduce LocoTrack, a highly accurate and efficient model designed for the task of tracking any point (TAP) across video sequences. Previous approaches in this task often rely on local 2D correlation maps to establish correspondences from a point in the query image to a local region in the target image, which often struggle with homogeneous regions or repetitive features, leading to matching ambiguities. LocoTrack overcomes this challenge with a novel approach that utilizes all-pair correspondences across regions, i.e., local 4D correlation, to establish precise correspondences, with bidirectional correspondence and matching smoothness significantly enhancing robustness against ambiguities. We also incorporate a lightweight correlation encoder to enhance computational efficiency, and a compact Transformer architecture to integrate long-term temporal information. LocoTrack achieves unmatched accuracy on all TAP-Vid benchmarks and operates at a speed almost 6 times faster than the current state-of-the-art.

局所的全ペア対応によるポイントトラッキング

Local All-Pair Correspondence for Point Tracking

要旨

Support