TAPNext: 임의의 점 추적(TAP)을 다음 토큰 예측으로 수행하기

초록

비디오에서 임의의 점(Tracking Any Point, TAP)을 추적하는 것은 로보틱스, 비디오 편집, 3D 재구성 등 다양한 응용 분야에서 활용되는 어려운 컴퓨터 비전 문제입니다. 기존의 TAP 방법들은 복잡한 추적 특화 귀납적 편향(inductive biases)과 휴리스틱에 크게 의존하여 일반성과 확장성이 제한되었습니다. 이러한 문제를 해결하기 위해, 우리는 TAP를 순차적 마스크 토큰 디코딩(sequential masked token decoding)으로 재구성한 새로운 접근법인 TAPNext를 제안합니다. 우리의 모델은 인과적(causal)이며 순수 온라인 방식으로 추적을 수행하고, 추적 특화 귀납적 편향을 제거합니다. 이를 통해 TAPNext는 최소의 지연 시간으로 실행될 수 있으며, 기존의 최첨단 추적기들이 필요로 하는 시간적 윈도잉(temporal windowing)을 제거합니다. 단순함에도 불구하고, TAPNext는 온라인 및 오프라인 추적기 모두에서 새로운 최첨단 추적 성능을 달성합니다. 마지막으로, 우리는 널리 사용되는 많은 추적 휴리스틱들이 TAPNext에서 종단 간(end-to-end) 학습을 통해 자연스럽게 나타난다는 증거를 제시합니다.

English

Tracking Any Point (TAP) in a video is a challenging computer vision problem with many demonstrated applications in robotics, video editing, and 3D reconstruction. Existing methods for TAP rely heavily on complex tracking-specific inductive biases and heuristics, limiting their generality and potential for scaling. To address these challenges, we present TAPNext, a new approach that casts TAP as sequential masked token decoding. Our model is causal, tracks in a purely online fashion, and removes tracking-specific inductive biases. This enables TAPNext to run with minimal latency, and removes the temporal windowing required by many existing state of art trackers. Despite its simplicity, TAPNext achieves a new state-of-the-art tracking performance among both online and offline trackers. Finally, we present evidence that many widely used tracking heuristics emerge naturally in TAPNext through end-to-end training.

TAPNext: 임의의 점 추적(TAP)을 다음 토큰 예측으로 수행하기

TAPNext: Tracking Any Point (TAP) as Next Token Prediction

초록

Support