點跟踪的本地全對應

摘要

我們介紹了 LocoTrack，這是一個專為跟蹤任意點（TAP）於視頻序列任務而設計的高精確度和高效率模型。先前在這項任務中的方法通常依賴於本地2D相關性地圖，以建立從查詢圖像中的一點到目標圖像中的本地區域的對應，但往往在處理均質區域或重複特徵時遇到困難，導致匹配的模棱兩可。LocoTrack通過一種新穎的方法克服了這個挑戰，該方法利用區域間的全對應，即本地4D相關性，來建立精確的對應，雙向對應和匹配平滑明顯增強了對模棱兩可的魯棒性。我們還將輕量級相關編碼器和緊湊的Transformer架構納入，以整合長期時間信息。LocoTrack在所有TAP-Vid基準測試中實現了無與倫比的準確性，並且運行速度幾乎比當前最先進的方法快了近6倍。

English

We introduce LocoTrack, a highly accurate and efficient model designed for the task of tracking any point (TAP) across video sequences. Previous approaches in this task often rely on local 2D correlation maps to establish correspondences from a point in the query image to a local region in the target image, which often struggle with homogeneous regions or repetitive features, leading to matching ambiguities. LocoTrack overcomes this challenge with a novel approach that utilizes all-pair correspondences across regions, i.e., local 4D correlation, to establish precise correspondences, with bidirectional correspondence and matching smoothness significantly enhancing robustness against ambiguities. We also incorporate a lightweight correlation encoder to enhance computational efficiency, and a compact Transformer architecture to integrate long-term temporal information. LocoTrack achieves unmatched accuracy on all TAP-Vid benchmarks and operates at a speed almost 6 times faster than the current state-of-the-art.

點跟踪的本地全對應

Local All-Pair Correspondence for Point Tracking

摘要

Support