MotionDirector: 텍스트-비디오 확산 모델의 모션 커스터마이징

초록

대규모 사전 학습된 확산 모델은 다양한 비디오 생성에서 뛰어난 능력을 보여주고 있습니다. 동일한 동작 개념을 가진 비디오 클립 집합이 주어졌을 때, 모션 커스터마이제이션(Motion Customization) 작업은 기존의 텍스트-투-비디오(text-to-video) 확산 모델을 적응시켜 해당 동작을 가진 비디오를 생성하는 것입니다. 예를 들어, 특정 카메라 움직임에 따라 규정된 방식으로 차가 움직이는 비디오를 생성하여 영화를 만들거나, 곰이 역기를 드는 모습을 보여주는 비디오를 생성하여 창작자들에게 영감을 줄 수 있습니다. 주제나 스타일과 같은 외형을 커스터마이징하기 위한 적응 방법은 개발되었지만, 모션에 대한 연구는 아직 이루어지지 않았습니다. 모션 커스터마이제이션을 위해 주류 적응 방법을 확장하는 것은 간단합니다. 이는 전체 모델 튜닝, 추가 레이어의 파라미터 효율적 튜닝, 그리고 Low-Rank Adaptions (LoRAs)를 포함합니다. 그러나 이러한 방법으로 학습된 모션 개념은 종종 훈련 비디오의 제한된 외형과 결합되어 있어, 커스터마이즈된 모션을 다른 외형으로 일반화하기 어렵게 만듭니다. 이 문제를 극복하기 위해, 우리는 외형과 모션의 학습을 분리하기 위한 이중 경로 LoRAs 아키텍처를 가진 MotionDirector를 제안합니다. 더 나아가, 우리는 시간적 훈련 목표에 대한 외형의 영향을 완화하기 위한 새로운 외형 편향 제거 시간적 손실을 설계했습니다. 실험 결과는 제안된 방법이 커스터마이즈된 모션을 위한 다양한 외형의 비디오를 생성할 수 있음을 보여줍니다. 우리의 방법은 또한 각각의 외형과 모션을 가진 다양한 비디오를 혼합하거나, 단일 이미지를 커스터마이즈된 모션으로 애니메이션화하는 등 다양한 다운스트림 애플리케이션을 지원합니다. 우리의 코드와 모델 가중치는 공개될 예정입니다.

English

Large-scale pre-trained diffusion models have exhibited remarkable capabilities in diverse video generations. Given a set of video clips of the same motion concept, the task of Motion Customization is to adapt existing text-to-video diffusion models to generate videos with this motion. For example, generating a video with a car moving in a prescribed manner under specific camera movements to make a movie, or a video illustrating how a bear would lift weights to inspire creators. Adaptation methods have been developed for customizing appearance like subject or style, yet unexplored for motion. It is straightforward to extend mainstream adaption methods for motion customization, including full model tuning, parameter-efficient tuning of additional layers, and Low-Rank Adaptions (LoRAs). However, the motion concept learned by these methods is often coupled with the limited appearances in the training videos, making it difficult to generalize the customized motion to other appearances. To overcome this challenge, we propose MotionDirector, with a dual-path LoRAs architecture to decouple the learning of appearance and motion. Further, we design a novel appearance-debiased temporal loss to mitigate the influence of appearance on the temporal training objective. Experimental results show the proposed method can generate videos of diverse appearances for the customized motions. Our method also supports various downstream applications, such as the mixing of different videos with their appearance and motion respectively, and animating a single image with customized motions. Our code and model weights will be released.

MotionDirector: 텍스트-비디오 확산 모델의 모션 커스터마이징

MotionDirector: Motion Customization of Text-to-Video Diffusion Models

초록

Support