Reangle-A-Video: 비디오 간 변환으로서의 4D 비디오 생성

초록

우리는 단일 입력 비디오에서 동기화된 다중 뷰 비디오를 생성하기 위한 통합 프레임워크인 Reangle-A-Video를 소개합니다. 대규모 4D 데이터셋에서 다중 뷰 비디오 확산 모델을 학습하는 주류 접근 방식과 달리, 우리의 방법은 다중 뷰 비디오 생성 작업을 비디오-투-비디오 변환으로 재구성하며, 공개적으로 이용 가능한 이미지 및 비디오 확산 사전 지식을 활용합니다. 본질적으로 Reangle-A-Video는 두 단계로 작동합니다. (1) 다중 뷰 모션 학습: 이미지-투-비디오 확산 트랜스포머를 자기 지도 방식으로 동기적으로 미세 조정하여 왜곡된 비디오 세트에서 뷰 불변 모션을 추출합니다. (2) 다중 뷰 일관성 있는 이미지-투-이미지 변환: 입력 비디오의 첫 번째 프레임을 DUSt3R를 사용한 추론 시점 교차 뷰 일관성 가이던스 하에 다양한 카메라 시점으로 왜곡 및 인페인팅하여 다중 뷰 일관성 있는 시작 이미지를 생성합니다. 정적 뷰 전송 및 동적 카메라 제어에 대한 광범위한 실험을 통해 Reangle-A-Video가 기존 방법을 능가하며, 다중 뷰 비디오 생성을 위한 새로운 솔루션을 확립함을 보여줍니다. 우리는 코드와 데이터를 공개할 예정입니다. 프로젝트 페이지: https://hyeonho99.github.io/reangle-a-video/

English

We introduce Reangle-A-Video, a unified framework for generating synchronized multi-view videos from a single input video. Unlike mainstream approaches that train multi-view video diffusion models on large-scale 4D datasets, our method reframes the multi-view video generation task as video-to-videos translation, leveraging publicly available image and video diffusion priors. In essence, Reangle-A-Video operates in two stages. (1) Multi-View Motion Learning: An image-to-video diffusion transformer is synchronously fine-tuned in a self-supervised manner to distill view-invariant motion from a set of warped videos. (2) Multi-View Consistent Image-to-Images Translation: The first frame of the input video is warped and inpainted into various camera perspectives under an inference-time cross-view consistency guidance using DUSt3R, generating multi-view consistent starting images. Extensive experiments on static view transport and dynamic camera control show that Reangle-A-Video surpasses existing methods, establishing a new solution for multi-view video generation. We will publicly release our code and data. Project page: https://hyeonho99.github.io/reangle-a-video/

Reangle-A-Video: 비디오 간 변환으로서의 4D 비디오 생성

Reangle-A-Video: 4D Video Generation as Video-to-Video Translation

초록

Support