TRIP: 이미지-비디오 확산 모델을 위한 이미지 노이즈 사전 정보 기반 시간적 잔차 학습

초록

텍스트-투-비디오 생성 분야의 최근 발전은 강력한 확산 모델의 유용성을 입증했습니다. 그러나 정적 이미지를 애니메이션화하는(즉, 이미지-투-비디오 생성) 작업에서 확산 모델을 적용하는 문제는 사소하지 않습니다. 이러한 어려움은 후속 애니메이션 프레임의 확산 과정이 주어진 이미지와의 충실한 정렬을 유지해야 할 뿐만 아니라 인접 프레임 간의 시간적 일관성을 추구해야 한다는 점에서 비롯됩니다. 이를 완화하기 위해, 우리는 정적 이미지에서 도출된 이미지 노이즈 사전에 기반하여 프레임 간 관계 추론을 공동으로 촉발하고 시간적 잔차 학습을 통해 일관된 시간적 모델링을 용이하게 하는 새로운 이미지-투-비디오 확산 패러다임인 TRIP을 제안합니다. 기술적으로, 이미지 노이즈 사전은 정적 이미지와 노이즈가 추가된 비디오 잠재 코드를 기반으로 한 한 단계 역확산 과정을 통해 먼저 획득됩니다. 다음으로, TRIP은 노이즈 예측을 위한 잔차와 유사한 이중 경로 방식을 실행합니다: 1) 각 프레임의 참조 노이즈로 이미지 노이즈 사전을 직접 사용하여 첫 번째 프레임과 후속 프레임 간의 정렬을 강화하는 단축 경로; 2) 노이즈가 추가된 비디오와 정적 이미지 잠재 코드에 대해 3D-UNet을 적용하여 프레임 간 관계 추론을 가능하게 하고, 이를 통해 각 프레임의 잔차 노이즈 학습을 용이하게 하는 잔차 경로. 또한, 각 프레임의 참조 노이즈와 잔차 노이즈는 최종 비디오 생성을 위해 주의 메커니즘을 통해 동적으로 통합됩니다. WebVid-10M, DTDB 및 MSR-VTT 데이터셋에 대한 광범위한 실험을 통해 우리의 TRIP이 이미지-투-비디오 생성에 효과적임을 입증했습니다. 자세한 내용은 프로젝트 페이지(https://trip-i2v.github.io/TRIP/)를 참조하십시오.

English

Recent advances in text-to-video generation have demonstrated the utility of powerful diffusion models. Nevertheless, the problem is not trivial when shaping diffusion models to animate static image (i.e., image-to-video generation). The difficulty originates from the aspect that the diffusion process of subsequent animated frames should not only preserve the faithful alignment with the given image but also pursue temporal coherence among adjacent frames. To alleviate this, we present TRIP, a new recipe of image-to-video diffusion paradigm that pivots on image noise prior derived from static image to jointly trigger inter-frame relational reasoning and ease the coherent temporal modeling via temporal residual learning. Technically, the image noise prior is first attained through one-step backward diffusion process based on both static image and noised video latent codes. Next, TRIP executes a residual-like dual-path scheme for noise prediction: 1) a shortcut path that directly takes image noise prior as the reference noise of each frame to amplify the alignment between the first frame and subsequent frames; 2) a residual path that employs 3D-UNet over noised video and static image latent codes to enable inter-frame relational reasoning, thereby easing the learning of the residual noise for each frame. Furthermore, both reference and residual noise of each frame are dynamically merged via attention mechanism for final video generation. Extensive experiments on WebVid-10M, DTDB and MSR-VTT datasets demonstrate the effectiveness of our TRIP for image-to-video generation. Please see our project page at https://trip-i2v.github.io/TRIP/.

TRIP: 이미지-비디오 확산 모델을 위한 이미지 노이즈 사전 정보 기반 시간적 잔차 학습

TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models

초록

Support