MVTrack4Gen: 4D 비디오 생성을 위한 기하학적 감독으로서의 다중 시점 포인트 추적

초록

단안 참조 비디오로부터 새로운 시점의 비디오를 목표 카메라 궤적에 따라 합성하는 작업은 참조 비디오의 기하학적 일관성과 모션 충실도를 동시에 요구한다. 기존의 명시적 3차원 표현 기반 방법들은 기성 재구성 모듈의 정확도에 의해 제한을 받으며, 이 모듈들은 단안 비디오에서 동적 객체에 대해 부정확한 기하학을 생성하는 경우가 많다. 반면, 카메라 조건화 전용 방법은 높은 시각적 품질을 달성할 수 있지만, 기하학 및 모션 일관성을 유지하는 데 어려움을 겪는 경우가 많다. 본 연구에서는 MVTrack4Gen(Multi-View point Tracking for Novel-View Generation)을 소개한다. 이는 다중 뷰 포인트 추적을 추가적인 기하학 및 모션 감독 신호로 활용하는 모션 인식 훈련 프레임워크로, 카메라 조건화 전용 새로운 시점 비디오 확산 모델을 대상으로 한다. 핵심 발견은 특정 주의집중 계층이 강한 대응 신호를 인코딩한다는 점인데, 여기서 쿼리 특징은 뷰 간 및 시간에 걸쳐 기하학적으로 대응되는 위치의 키 특징에 주의를 기울이며, 이러한 대응 관계의 정렬 불일치가 모션 불일치를 유발한다. 이 관찰에 기반하여, 이러한 특징들을 보조 다중 뷰 추적 헤드로 전달하고 포인트 추적 목적 함수와 함께 확산 모델을 공동으로 훈련한다. 이러한 모션 인식 대응 관계를 명시적으로 강화함으로써, MVTrack4Gen은 기존 모델을 개선하여 참조 뷰의 모션을 더 잘 따르고 뷰 간 기하학적 일관성을 유지하도록 한다. 다양한 벤치마크에서 본 방법은 최첨단 기하학적 일관성과 경쟁력 있는 카메라 정확도를 달성한다.

English

Synthesizing a novel-view video from a monocular reference video along a target camera trajectory requires both geometric consistency and motion fidelity with respect to the reference video. Existing methods based on explicit 3D representations are limited by the accuracy of off-the-shelf reconstruction modules, which often produce inaccurate geometry for dynamic objects in monocular videos. In contrast, camera-conditioning-only methods can achieve high visual quality but often struggle to preserve geometric and motion consistency. In this work, we introduce MVTrack4Gen (Multi-View point Tracking for Novel-View Generation), a motion-aware training framework that leverages multi-view point tracking as an additional geometric and motion supervision signal for camera-conditioning-only novel-view video diffusion models. Our key finding is that specific attention layers encode strong correspondence cues, where query features attend to key features at geometrically corresponding locations across views and over time, and the misalignment of these correspondences causes motion inconsistency. Based on this observation, we route these features into an auxiliary multi-view tracking head and jointly train the diffusion model with a point-tracking objective. By explicitly strengthening these motion-aware correspondences, MVTrack4Gen improves existing models to better follow the motion in the reference view and maintain cross-view geometric consistency. Across diverse benchmarks, our method achieves state-of-the-art geometric consistency and competitive camera accuracy.