Motion-I2V: 명시적 모션 모델링을 통한 일관적이고 제어 가능한 이미지-비디오 생성

초록

본 논문에서는 일관적이고 제어 가능한 이미지-투-비디오 생성(I2V)을 위한 새로운 프레임워크인 Motion-I2V를 소개한다. 기존의 복잡한 이미지-투-비디오 매핑을 직접 학습하는 방법과 달리, Motion-I2V는 명시적인 모션 모델링을 통해 I2V를 두 단계로 분해한다. 첫 번째 단계에서는 참조 이미지의 픽셀 궤적을 추론하는 데 초점을 맞춘 확산 기반 모션 필드 예측기를 제안한다. 두 번째 단계에서는 비디오 잠재 확산 모델의 제한된 1차원 시간적 주의력을 강화하기 위해 모션 증강 시간적 주의 모듈을 제안한다. 이 모듈은 첫 번째 단계에서 예측된 궤적의 지도 하에 참조 이미지의 특징을 합성된 프레임에 효과적으로 전파할 수 있다. 기존 방법과 비교하여 Motion-I2V는 큰 모션과 시점 변화가 있는 경우에도 더 일관된 비디오를 생성할 수 있다. 첫 번째 단계를 위해 희소 궤적 ControlNet을 학습함으로써, Motion-I2V는 사용자가 희소 궤적 및 영역 주석을 통해 모션 궤적과 모션 영역을 정밀하게 제어할 수 있도록 지원한다. 이는 텍스트 지시에만 의존하는 것보다 I2V 과정의 제어 가능성을 더욱 높인다. 또한, Motion-I2V의 두 번째 단계는 자연스럽게 제로샷 비디오-투-비디오 변환을 지원한다. 질적 및 양적 비교를 통해 Motion-I2V가 일관적이고 제어 가능한 이미지-투-비디오 생성에서 기존 접근법보다 우수함을 입증한다.

English

We introduce Motion-I2V, a novel framework for consistent and controllable image-to-video generation (I2V). In contrast to previous methods that directly learn the complicated image-to-video mapping, Motion-I2V factorizes I2V into two stages with explicit motion modeling. For the first stage, we propose a diffusion-based motion field predictor, which focuses on deducing the trajectories of the reference image's pixels. For the second stage, we propose motion-augmented temporal attention to enhance the limited 1-D temporal attention in video latent diffusion models. This module can effectively propagate reference image's feature to synthesized frames with the guidance of predicted trajectories from the first stage. Compared with existing methods, Motion-I2V can generate more consistent videos even at the presence of large motion and viewpoint variation. By training a sparse trajectory ControlNet for the first stage, Motion-I2V can support users to precisely control motion trajectories and motion regions with sparse trajectory and region annotations. This offers more controllability of the I2V process than solely relying on textual instructions. Additionally, Motion-I2V's second stage naturally supports zero-shot video-to-video translation. Both qualitative and quantitative comparisons demonstrate the advantages of Motion-I2V over prior approaches in consistent and controllable image-to-video generation.

Motion-I2V: 명시적 모션 모델링을 통한 일관적이고 제어 가능한 이미지-비디오 생성

Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling

초록

Support