CustomCrafter:保留运动和概念组合能力的定制视频生成
CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities
August 23, 2024
作者: Tao Wu, Yong Zhang, Xintao Wang, Xianpan Zhou, Guangcong Zheng, Zhongang Qi, Ying Shan, Xi Li
cs.AI
摘要
定制视频生成旨在通过文本提示和主题参考图像生成高质量视频。然而,由于仅在静态图像上训练,主题学习的微调过程会破坏视频扩散模型(VDMs)结合概念和生成动作的能力。为恢复这些能力,一些方法使用类似于提示的额外视频来微调或引导模型。这需要频繁更改引导视频,甚至在生成不同动作时重新调整模型,这对用户来说非常不便。在本文中,我们提出了CustomCrafter,这是一个新颖的框架,可以保留模型的动作生成和概念组合能力,无需额外视频和微调即可恢复。为了保留概念组合能力,我们设计了一个即插即用模块,用于更新VDMs中的少量参数,增强模型捕捉外观细节和概念组合能力以适应新主题。对于动作生成,我们观察到VDMs倾向于在去噪的早期阶段恢复视频的运动,而在后期专注于恢复主题细节。因此,我们提出了动态加权视频采样策略。利用我们主题学习模块的可插拔性,我们减少了该模块对去噪早期阶段动作生成的影响,保留了VDMs生成动作的能力。在去噪的后期阶段,我们恢复该模块以修复指定主题的外观细节,从而确保主题外观的保真度。实验结果表明,我们的方法与先前方法相比有显著改进。
English
Customized video generation aims to generate high-quality videos guided by
text prompts and subject's reference images. However, since it is only trained
on static images, the fine-tuning process of subject learning disrupts
abilities of video diffusion models (VDMs) to combine concepts and generate
motions. To restore these abilities, some methods use additional video similar
to the prompt to fine-tune or guide the model. This requires frequent changes
of guiding videos and even re-tuning of the model when generating different
motions, which is very inconvenient for users. In this paper, we propose
CustomCrafter, a novel framework that preserves the model's motion generation
and conceptual combination abilities without additional video and fine-tuning
to recovery. For preserving conceptual combination ability, we design a
plug-and-play module to update few parameters in VDMs, enhancing the model's
ability to capture the appearance details and the ability of concept
combinations for new subjects. For motion generation, we observed that VDMs
tend to restore the motion of video in the early stage of denoising, while
focusing on the recovery of subject details in the later stage. Therefore, we
propose Dynamic Weighted Video Sampling Strategy. Using the pluggability of our
subject learning modules, we reduce the impact of this module on motion
generation in the early stage of denoising, preserving the ability to generate
motion of VDMs. In the later stage of denoising, we restore this module to
repair the appearance details of the specified subject, thereby ensuring the
fidelity of the subject's appearance. Experimental results show that our method
has a significant improvement compared to previous methods.Summary
AI-Generated Summary