Still-Moving: 맞춤형 비디오 데이터 없이 맞춤형 비디오 생성

초록

텍스트-이미지(T2I) 모델의 커스터마이징은 최근 개인화, 스타일화, 조건부 생성과 같은 분야에서 엄청난 진전을 보여왔습니다. 그러나 이러한 진전을 비디오 생성으로 확장하는 것은 여전히 초기 단계에 머물러 있으며, 이는 주로 커스터마이즈된 비디오 데이터의 부족 때문입니다. 본 연구에서는 커스터마이즈된 비디오 데이터 없이도 텍스트-비디오(T2V) 모델을 커스터마이징할 수 있는 새로운 일반 프레임워크인 Still-Moving을 소개합니다. 이 프레임워크는 비디오 모델이 텍스트-이미지(T2I) 모델 위에 구축된(예: 인플레이션을 통해) 주요 T2V 설계에 적용됩니다. 우리는 여전히 이미지 데이터만으로 훈련된 커스터마이즈된 T2I 모델(예: DreamBooth 또는 StyleDrop 사용)에 접근할 수 있다고 가정합니다. 커스터마이즈된 T2I 모델의 가중치를 T2V 모델에 단순히 연결하는 것은 종종 심각한 아티팩트나 커스터마이즈 데이터에 대한 충분한 준수를 이끌어내지 못합니다. 이 문제를 해결하기 위해, 우리는 주입된 T2I 레이어에서 생성된 특징을 조정하는 경량의 공간 어댑터(Spatial Adapters)를 훈련시킵니다. 중요한 것은, 우리의 어댑터가 커스터마이즈된 T2I 모델에 의해 생성된 이미지 샘플로부터 구성된 "고정된 비디오"(즉, 반복된 이미지)에서 훈련된다는 점입니다. 이 훈련은 새로운 모션 어댑터(Motion Adapter) 모듈에 의해 가능해지며, 이를 통해 정적 비디오에서도 비디오 모델의 모션 사전 지식을 보존하면서 훈련할 수 있습니다. 테스트 시에는 모션 어댑터 모듈을 제거하고 훈련된 공간 어댑터만 남겨둡니다. 이는 T2V 모델의 모션 사전 지식을 복원하면서도 커스터마이즈된 T2I 모델의 공간 사전 지식을 준수합니다. 우리는 개인화, 스타일화, 조건부 생성과 같은 다양한 작업에서 우리의 접근 방식의 효과를 입증합니다. 평가된 모든 시나리오에서, 우리의 방법은 커스터마이즈된 T2I 모델의 공간 사전 지식과 T2V 모델이 제공하는 모션 사전 지식을 원활하게 통합합니다.

English

Customizing text-to-image (T2I) models has seen tremendous progress recently, particularly in areas such as personalization, stylization, and conditional generation. However, expanding this progress to video generation is still in its infancy, primarily due to the lack of customized video data. In this work, we introduce Still-Moving, a novel generic framework for customizing a text-to-video (T2V) model, without requiring any customized video data. The framework applies to the prominent T2V design where the video model is built over a text-to-image (T2I) model (e.g., via inflation). We assume access to a customized version of the T2I model, trained only on still image data (e.g., using DreamBooth or StyleDrop). Naively plugging in the weights of the customized T2I model into the T2V model often leads to significant artifacts or insufficient adherence to the customization data. To overcome this issue, we train lightweight Spatial Adapters that adjust the features produced by the injected T2I layers. Importantly, our adapters are trained on "frozen videos" (i.e., repeated images), constructed from image samples generated by the customized T2I model. This training is facilitated by a novel Motion Adapter module, which allows us to train on such static videos while preserving the motion prior of the video model. At test time, we remove the Motion Adapter modules and leave in only the trained Spatial Adapters. This restores the motion prior of the T2V model while adhering to the spatial prior of the customized T2I model. We demonstrate the effectiveness of our approach on diverse tasks including personalized, stylized, and conditional generation. In all evaluated scenarios, our method seamlessly integrates the spatial prior of the customized T2I model with a motion prior supplied by the T2V model.

Still-Moving: 맞춤형 비디오 데이터 없이 맞춤형 비디오 생성

Still-Moving: Customized Video Generation without Customized Video Data

초록

Support