LAMP: 소수 샷 기반 비디오 생성을 위한 모션 패턴 학습

초록

확산 기반 텍스트-이미지 생성 기술의 놀라운 발전과 함께, 이러한 강력한 생성 능력을 텍스트-비디오 생성으로 확장하는 것은 큰 관심을 끌고 있다. 기존 방법들은 대규모 텍스트-비디오 쌍과 방대한 학습 자원을 요구하거나, 템플릿 비디오와 정확히 정렬된 움직임을 학습한다. 비디오 생성에서 생성 자유도와 자원 비용 간의 균형을 맞추는 것은 사소한 문제가 아니다. 본 연구에서는 단일 GPU에서 8~16개의 비디오로 텍스트-이미지 확산 모델이 특정 움직임 패턴(Learn A specific Motion Pattern, LAMP)을 학습할 수 있는 소수 샷 기반 튜닝 프레임워크를 제안한다. 구체적으로, 우리는 오프-더-셸프 텍스트-이미지 모델을 사용하여 콘텐츠 생성을 위한 첫 프레임 조건 파이프라인을 설계함으로써, 튜닝된 비디오 확산 모델이 주로 움직임 학습에 집중할 수 있도록 한다. 잘 개발된 텍스트-이미지 기술은 시각적으로 만족스럽고 다양한 콘텐츠를 생성 조건으로 제공할 수 있어, 비디오 품질과 생성 자유도를 크게 향상시킨다. 시간 차원의 특징을 포착하기 위해, 우리는 T2I 모델의 사전 학습된 2D 컨볼루션 레이어를 새로운 시간-공간 움직임 학습 레이어로 확장하고, 어텐션 블록을 시간 수준으로 수정한다. 또한, 계산 비용을 들여 비디오의 안정성을 향상시킬 수 있는 효과적인 추론 트릭인 공유 노이즈 샘플링을 개발했다. 우리의 방법은 실제 이미지 애니메이션 및 비디오 편집과 같은 다른 작업에도 유연하게 적용될 수 있다. 광범위한 실험을 통해 LAMP가 제한된 데이터에서 움직임 패턴을 효과적으로 학습하고 고품질 비디오를 생성할 수 있음을 입증했다. 코드와 모델은 https://rq-wu.github.io/projects/LAMP에서 확인할 수 있다.

English

With the impressive progress in diffusion-based text-to-image generation, extending such powerful generative ability to text-to-video raises enormous attention. Existing methods either require large-scale text-video pairs and a large number of training resources or learn motions that are precisely aligned with template videos. It is non-trivial to balance a trade-off between the degree of generation freedom and the resource costs for video generation. In our study, we present a few-shot-based tuning framework, LAMP, which enables text-to-image diffusion model Learn A specific Motion Pattern with 8~16 videos on a single GPU. Specifically, we design a first-frame-conditioned pipeline that uses an off-the-shelf text-to-image model for content generation so that our tuned video diffusion model mainly focuses on motion learning. The well-developed text-to-image techniques can provide visually pleasing and diverse content as generation conditions, which highly improves video quality and generation freedom. To capture the features of temporal dimension, we expand the pretrained 2D convolution layers of the T2I model to our novel temporal-spatial motion learning layers and modify the attention blocks to the temporal level. Additionally, we develop an effective inference trick, shared-noise sampling, which can improve the stability of videos with computational costs. Our method can also be flexibly applied to other tasks, e.g. real-world image animation and video editing. Extensive experiments demonstrate that LAMP can effectively learn the motion pattern on limited data and generate high-quality videos. The code and models are available at https://rq-wu.github.io/projects/LAMP.

LAMP: 소수 샷 기반 비디오 생성을 위한 모션 패턴 학습

LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation

초록

Support