Track2View: 쌍을 이루는 3D 점 추적을 통한 4D 일관적 카메라 제어 비디오 생성

초록

기존 비디오를 새로운 카메라 시점에서 재렌더링하려면 출력 결과가 지정된 카메라 궤적을 따르면서 원본 장면의 외형과 역학을 모든 프레임에 걸쳐 보존해야 한다. 기존 방법들은 프레임별 포즈 임베딩, 잡음이 포함된 포인트 클라우드 렌더링, 또는 암시적 학습 대응 관계에 의존하며, 이들 중 어느 것도 소스 픽셀과 타깃 픽셀 사이의 명시적이고 시간적으로 연속적인 연결을 제공하지 못한다. 우리는 Track2View를 제안한다. 이 방법은 비디오 확산 트랜스포머를 쌍을 이룬 3D 포인트 트랙에 조건화한다: 소스 및 타깃 카메라 뷰 모두에 투영된 장면 포인트들의 희소 궤적이다. 이러한 트랙들은 구성상 시간적으로 연속적인 명시적 시공간 대응 관계를 제공하며, 어떤 콘텐츠가 언제 어디에 나타나야 하는지를 인코딩한다. Track2View의 핵심은 이중 뷰 트랙 조건화기로, 매개변수 없는 기하학적 연산과 학습된 시간적 집계를 통해 시각적 컨텍스트를 소스 뷰에서 타깃 뷰로 전달함으로써, 특정 움직임을 암기하지 않고 임의의 카메라 궤적에 대한 일반화를 보장한다. 또한, 우리는 시간적으로 연결된 멀티 카메라 뷰 쌍에 3D 포인트 트래커를 실행하여 일대일 트랙 대응 관계를 추출하는 데이터 큐레이션 파이프라인을 도입한다. 정적 및 동적 장면을 포함하는 400개 비디오 벤치마크에서 Track2View는 시각적 품질, 뷰 동기화 및 카메라 정확도 전반에 걸쳐 최첨단 결과를 달성하며, 주요 기준선 대비 회전 오차를 30-65%, 변위 오차를 61-72% 감소시킨다. 프로젝트 페이지는 다음 https URL에서 확인할 수 있다: https://qjizhi.github.io/track2view

English

Re-rendering an existing video from a novel camera viewpoint requires the output to follow the prescribed camera trajectory while preserving the appearance and dynamics of the original scene across every frame. Existing methods rely on per-frame pose embeddings, noisy point-cloud renderings, or implicit learned correspondences, none of which provides an explicit, temporally continuous link between source and target pixels. We propose Track2View, which conditions a video diffusion transformer on paired 3D point tracks: sparse trajectories of scene points projected into both the source and target camera views. These tracks provide explicit spatiotemporal correspondences that are temporally continuous by construction, encoding what content should appear where and when. At the core of Track2View is a dual-view track conditioner that transfers visual context from source to target view through parameter-free geometric operations and learned temporal aggregation, ensuring generalization to arbitrary camera trajectories without memorizing specific motions. We further introduce a data curation pipeline that extracts one-to-one track correspondences by running a 3D point tracker on temporally concatenated multi-camera view pairs. On a 400-video benchmark spanning static and dynamic scenes, Track2View achieves state-of-the-art results across visual quality, view synchronization, and camera accuracy, reducing rotation error by 30-65% and translation error by 61-72% relative to leading baselines. Project page is available at this https URL: https://qjizhi.github.io/track2view