SpatialTracker: 3D 공간에서 모든 2D 픽셀 추적

초록

비디오에서 조밀하고 장거리 픽셀 움직임을 복원하는 것은 어려운 문제입니다. 이러한 어려움의 일부는 3D에서 2D로의 투영 과정에서 비롯되며, 이는 2D 움직임 영역에서의 폐색과 불연속성을 초래합니다. 2D 움직임이 복잡할 수 있지만, 우리는 근본적인 3D 움직임이 종종 단순하고 저차원일 수 있다고 가정합니다. 본 연구에서는 이미지 투영으로 인한 문제를 완화하기 위해 3D 공간에서의 점 궤적을 추정하는 방법을 제안합니다. 우리의 방법인 SpatialTracker는 단안 깊이 추정기를 사용하여 2D 픽셀을 3D로 변환하고, 각 프레임의 3D 내용을 트리플레인 표현으로 효율적으로 나타내며, 트랜스포머를 사용한 반복적 업데이트를 통해 3D 궤적을 추정합니다. 3D에서의 추적은 가능한 한 강체(ARAP) 제약을 활용할 수 있게 하며, 동시에 픽셀을 서로 다른 강체 부분으로 클러스터링하는 강체 임베딩을 학습합니다. 광범위한 평가를 통해 우리의 접근 방식이 특히 평면 외 회전과 같은 어려운 시나리오에서 질적 및 양적으로 최첨단 추적 성능을 달성함을 보여줍니다.

English

Recovering dense and long-range pixel motion in videos is a challenging problem. Part of the difficulty arises from the 3D-to-2D projection process, leading to occlusions and discontinuities in the 2D motion domain. While 2D motion can be intricate, we posit that the underlying 3D motion can often be simple and low-dimensional. In this work, we propose to estimate point trajectories in 3D space to mitigate the issues caused by image projection. Our method, named SpatialTracker, lifts 2D pixels to 3D using monocular depth estimators, represents the 3D content of each frame efficiently using a triplane representation, and performs iterative updates using a transformer to estimate 3D trajectories. Tracking in 3D allows us to leverage as-rigid-as-possible (ARAP) constraints while simultaneously learning a rigidity embedding that clusters pixels into different rigid parts. Extensive evaluation shows that our approach achieves state-of-the-art tracking performance both qualitatively and quantitatively, particularly in challenging scenarios such as out-of-plane rotation.

SpatialTracker: 3D 공간에서 모든 2D 픽셀 추적

SpatialTracker: Tracking Any 2D Pixels in 3D Space

초록

Support