ConsistI2V: 이미지-비디오 생성을 위한 시각적 일관성 강화

초록

이미지-투-비디오(I2V) 생성은 초기 프레임(텍스트 프롬프트와 함께)을 사용하여 비디오 시퀀스를 생성하는 것을 목표로 합니다. I2V 생성에서의 주요 과제는 비디오 전반에 걸쳐 시각적 일관성을 유지하는 것입니다: 기존 방법들은 종종 첫 번째 프레임의 주제, 배경, 스타일의 무결성을 유지하고 비디오 내러티브의 유연하고 논리적인 진행을 보장하는 데 어려움을 겪습니다. 이러한 문제를 완화하기 위해, 우리는 I2V 생성을 위한 시각적 일관성을 강화하는 확산 기반 방법인 ConsistI2V를 제안합니다. 구체적으로, 우리는 (1) 첫 번째 프레임에 대한 시공간적 주의 메커니즘을 도입하여 공간적 및 동작 일관성을 유지하고, (2) 첫 번째 프레임의 저주파 대역에서의 노이즈 초기화를 통해 레이아웃 일관성을 강화합니다. 이 두 가지 접근 방식은 ConsistI2V가 매우 일관된 비디오를 생성할 수 있도록 합니다. 또한, 우리는 제안된 접근 방식을 확장하여 자동 회귀적 장기 비디오 생성 및 카메라 동작 제어에서의 일관성 개선 가능성을 보여줍니다. 우리의 방법의 효과를 검증하기 위해, 우리는 I2V 생성을 위한 포괄적인 평가 벤치마크인 I2V-Bench를 제안합니다. 자동 및 인간 평가 결과는 ConsistI2V가 기존 방법들보다 우수함을 입증합니다.

English

Image-to-video (I2V) generation aims to use the initial frame (alongside a text prompt) to create a video sequence. A grand challenge in I2V generation is to maintain visual consistency throughout the video: existing methods often struggle to preserve the integrity of the subject, background, and style from the first frame, as well as ensure a fluid and logical progression within the video narrative. To mitigate these issues, we propose ConsistI2V, a diffusion-based method to enhance visual consistency for I2V generation. Specifically, we introduce (1) spatiotemporal attention over the first frame to maintain spatial and motion consistency, (2) noise initialization from the low-frequency band of the first frame to enhance layout consistency. These two approaches enable ConsistI2V to generate highly consistent videos. We also extend the proposed approaches to show their potential to improve consistency in auto-regressive long video generation and camera motion control. To verify the effectiveness of our method, we propose I2V-Bench, a comprehensive evaluation benchmark for I2V generation. Our automatic and human evaluation results demonstrate the superiority of ConsistI2V over existing methods.

ConsistI2V: 이미지-비디오 생성을 위한 시각적 일관성 강화

ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation

초록

Support