텍스트-투-비디오 확산 모델에서 모션 커스터마이징

초록

우리는 텍스트-투-비디오 생성 모델에 맞춤형 동작을 추가하여, 원래의 학습 데이터에 포함된 동작을 넘어서는 기능을 확장하는 접근 방식을 소개합니다. 특정 움직임을 보여주는 몇 개의 비디오 샘플을 입력으로 활용함으로써, 우리의 방법은 입력된 동작 패턴을 학습하고 이를 다양한 텍스트로 지정된 시나리오에 일반화합니다. 우리의 기여는 세 가지로 요약됩니다. 첫째, 우리는 기존의 텍스트-투-비디오 모델을 미세 조정하여 입력 예제에 묘사된 동작과 새로운 고유 토큰 간의 매핑을 학습합니다. 새로운 맞춤 동작에 과적합하는 것을 방지하기 위해, 비디오에 대한 정규화 접근 방식을 도입합니다. 둘째, 사전 학습된 모델의 동작 사전 지식을 활용함으로써, 우리의 방법은 여러 사람이 맞춤 동작을 수행하는 새로운 비디오를 생성할 수 있으며, 이 동작을 다른 동작과 결합하여 호출할 수 있습니다. 더 나아가, 우리의 접근 방식은 개별화된 주체의 동작과 외관의 다중 모드 맞춤 설정으로 확장되어, 독특한 캐릭터와 독특한 동작을 특징으로 하는 비디오 생성을 가능하게 합니다. 셋째, 우리의 방법을 검증하기 위해, 학습된 맞춤 동작을 정량적으로 평가하는 접근 방식을 도입하고 체계적인 절제 연구를 수행합니다. 우리는 우리의 방법이 동작 맞춤 설정 작업으로 확장되었을 때, 기존의 외관 기반 맞춤 설정 접근 방식을 크게 능가함을 보여줍니다.

English

We introduce an approach for augmenting text-to-video generation models with customized motions, extending their capabilities beyond the motions depicted in the original training data. By leveraging a few video samples demonstrating specific movements as input, our method learns and generalizes the input motion patterns for diverse, text-specified scenarios. Our contributions are threefold. First, to achieve our results, we finetune an existing text-to-video model to learn a novel mapping between the depicted motion in the input examples to a new unique token. To avoid overfitting to the new custom motion, we introduce an approach for regularization over videos. Second, by leveraging the motion priors in a pretrained model, our method can produce novel videos featuring multiple people doing the custom motion, and can invoke the motion in combination with other motions. Furthermore, our approach extends to the multimodal customization of motion and appearance of individualized subjects, enabling the generation of videos featuring unique characters and distinct motions. Third, to validate our method, we introduce an approach for quantitatively evaluating the learned custom motion and perform a systematic ablation study. We show that our method significantly outperforms prior appearance-based customization approaches when extended to the motion customization task.

텍스트-투-비디오 확산 모델에서 모션 커스터마이징

Customizing Motion in Text-to-Video Diffusion Models

초록

Support