Track4World: 모든 픽셀의 피드포워드 월드 중심 밀집 3D 추적

초록

단안 영상에서 모든 픽셀의 3차원 궤적을 추정하는 것은 영상의 3차원 동역학을 종합적으로 이해하는 데 있어 중요하고 유망한 과제입니다. 최근 단안 3차원 추적 연구들은 인상적인 성능을 보여주지만, 첫 프레임의 희소 점만 추적하거나 조밀 추적을 위한 느린 최적화 기반 프레임워크로 제한됩니다. 본 논문에서는 월드 중심 좌표계에서 모든 픽셀의 효율적인 전체론적 3차원 추적을 가능하게 하는 Track4World라는 피드포워드 모델을 제안합니다. VGGT 스타일 ViT로 인코딩된 전역 3차원 장면 표현을 기반으로 하는 Track4World는 임의의 프레임 쌍 간 픽셀 단위 2차원 및 3차원 조밀 흐름을 동시에 추정하기 위해 새로운 3차원 상관 관계 기법을 적용합니다. 추정된 장면 흐름과 복원된 3차원 기하 구조를 통해 이후 해당 영상의 모든 픽셀에 대한 효율적인 3차원 추적이 가능해집니다. 다양한 벤치마크에서의 광범위한 실험을 통해 우리의 접근 방식이 2차원/3차원 흐름 추정 및 3차원 추적에서 기존 방법들을 지속적으로 능가하며, 실제 세계 4차원 재구성 작업에 대한 강건성과 확장성을 입증하였습니다.

English

Estimating the 3D trajectory of every pixel from a monocular video is crucial and promising for a comprehensive understanding of the 3D dynamics of videos. Recent monocular 3D tracking works demonstrate impressive performance, but are limited to either tracking sparse points on the first frame or a slow optimization-based framework for dense tracking. In this paper, we propose a feedforward model, called Track4World, enabling an efficient holistic 3D tracking of every pixel in the world-centric coordinate system. Built on the global 3D scene representation encoded by a VGGT-style ViT, Track4World applies a novel 3D correlation scheme to simultaneously estimate the pixel-wise 2D and 3D dense flow between arbitrary frame pairs. The estimated scene flow, along with the reconstructed 3D geometry, enables subsequent efficient 3D tracking of every pixel of this video. Extensive experiments on multiple benchmarks demonstrate that our approach consistently outperforms existing methods in 2D/3D flow estimation and 3D tracking, highlighting its robustness and scalability for real-world 4D reconstruction tasks.

Track4World: 모든 픽셀의 피드포워드 월드 중심 밀집 3D 추적

Track4World: Feedforward World-centric Dense 3D Tracking of All Pixels

초록

Support