Fairy: 빠른 병렬화 명령어 기반 비디오-투-비디오 합성

초록

본 논문에서는 이미지 편집용 확산 모델을 비디오 편집 애플리케이션에 적합하도록 개선한, 미니멀리스트적이면서도 강력한 Fairy를 소개합니다. 우리의 접근 방식은 앵커 기반 프레임 간 주의 메커니즘을 중심으로, 확산 특징을 프레임 간에 암묵적으로 전파하여 우수한 시간적 일관성과 높은 충실도의 합성을 보장합니다. Fairy는 이전 모델들의 메모리 및 처리 속도 한계를 해결할 뿐만 아니라, 고유한 데이터 증강 전략을 통해 시간적 일관성을 개선합니다. 이 전략은 모델이 소스 및 대상 이미지 모두에서 아핀 변환에 대해 등변성을 갖도록 합니다. Fairy는 놀라울 정도로 효율적이며, 120프레임의 512x384 비디오(30 FPS 기준 4초 길이)를 단 14초 만에 생성하여 기존 방법들보다 최소 44배 빠른 성능을 보입니다. 1000개의 생성 샘플을 포함한 포괄적인 사용자 연구를 통해, 우리의 접근 방식이 우수한 품질을 제공하며 기존 방법들을 결정적으로 능가한다는 사실이 확인되었습니다.

English

In this paper, we introduce Fairy, a minimalist yet robust adaptation of image-editing diffusion models, enhancing them for video editing applications. Our approach centers on the concept of anchor-based cross-frame attention, a mechanism that implicitly propagates diffusion features across frames, ensuring superior temporal coherence and high-fidelity synthesis. Fairy not only addresses limitations of previous models, including memory and processing speed. It also improves temporal consistency through a unique data augmentation strategy. This strategy renders the model equivariant to affine transformations in both source and target images. Remarkably efficient, Fairy generates 120-frame 512x384 videos (4-second duration at 30 FPS) in just 14 seconds, outpacing prior works by at least 44x. A comprehensive user study, involving 1000 generated samples, confirms that our approach delivers superior quality, decisively outperforming established methods.

Fairy: 빠른 병렬화 명령어 기반 비디오-투-비디오 합성

Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis

초록

Support