VMC: 텍스트-비디오 확산 모델을 위한 시간적 주의 적응 기반 비디오 모션 커스터마이제이션

초록

텍스트-투-비디오 확산 모델은 비디오 생성 분야에서 상당한 진전을 이루었습니다. 그러나 이러한 모델을 맞춤형 동작을 가진 비디오를 생성하도록 사용자 정의하는 것은 상당한 과제로 남아 있습니다. 특히, (a) 대상 비디오의 동작을 정확하게 재현하고, (b) 다양한 시각적 변형을 생성하는 데 어려움을 겪습니다. 예를 들어, 정적 이미지 사용자 정의 방법을 비디오에 직접 확장하는 경우, 외관과 동작 데이터 간의 복잡한 상호작용이 발생할 수 있습니다. 이를 해결하기 위해, 본 연구에서는 비디오 확산 모델 내의 시간적 주의 계층을 조정하기 위해 설계된 새로운 원샷 튜닝 접근법인 비디오 동작 사용자 정의(Video Motion Customization, VMC) 프레임워크를 제안합니다. 우리의 접근법은 연속 프레임 간의 잔차 벡터를 동작 참조로 사용하는 새로운 동작 증류 목표를 도입합니다. 이 확산 과정은 저주파 동작 궤적을 보존하면서 이미지 공간에서 고주파 동작과 무관한 노이즈를 완화합니다. 우리는 다양한 실제 동작과 상황에서 최신 비디오 생성 모델과 비교하여 우리의 방법을 검증합니다. 우리의 코드, 데이터 및 프로젝트 데모는 https://video-motion-customization.github.io에서 확인할 수 있습니다.

English

Text-to-video diffusion models have advanced video generation significantly. However, customizing these models to generate videos with tailored motions presents a substantial challenge. In specific, they encounter hurdles in (a) accurately reproducing motion from a target video, and (b) creating diverse visual variations. For example, straightforward extensions of static image customization methods to video often lead to intricate entanglements of appearance and motion data. To tackle this, here we present the Video Motion Customization (VMC) framework, a novel one-shot tuning approach crafted to adapt temporal attention layers within video diffusion models. Our approach introduces a novel motion distillation objective using residual vectors between consecutive frames as a motion reference. The diffusion process then preserves low-frequency motion trajectories while mitigating high-frequency motion-unrelated noise in image space. We validate our method against state-of-the-art video generative models across diverse real-world motions and contexts. Our codes, data and the project demo can be found at https://video-motion-customization.github.io

VMC: 텍스트-비디오 확산 모델을 위한 시간적 주의 적응 기반 비디오 모션 커스터마이제이션

VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models

초록

Support