ChatPaper.aiChatPaper

CustomCrafter:保留動作和概念組合能力的定制視頻生成

CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities

August 23, 2024
作者: Tao Wu, Yong Zhang, Xintao Wang, Xianpan Zhou, Guangcong Zheng, Zhongang Qi, Ying Shan, Xi Li
cs.AI

摘要

定制視頻生成旨在根據文本提示和主題參考圖像生成高質量視頻。然而,由於它僅在靜態圖像上進行訓練,主題學習的微調過程會破壞視頻擴散模型(VDMs)結合概念並生成動作的能力。為了恢復這些能力,一些方法使用額外的與提示相似的視頻來進行微調或引導模型。這需要頻繁更改引導視頻,甚至在生成不同動作時重新調整模型,這對用戶來說非常不方便。在本文中,我們提出了CustomCrafter,一個新穎的框架,它在不使用額外視頻和微調的情況下保留了模型的動作生成和概念組合能力。為了保留概念組合能力,我們設計了一個即插即用模塊來更新VDMs中的少量參數,增強模型捕捉外觀細節和概念組合能力以應用於新主題。對於動作生成,我們觀察到VDMs傾向於在去噪的早期階段恢復視頻的運動,而在後期則專注於恢復主題細節。因此,我們提出了動態加權視頻採樣策略。利用我們主題學習模塊的可插拔性,我們減少了該模塊對動作生成的早期階段的影響,保留了VDMs生成動作的能力。在去噪的後期階段,我們恢復該模塊以修復指定主題的外觀細節,從而確保主題外觀的保真度。實驗結果表明,我們的方法相比之前的方法有顯著改善。
English
Customized video generation aims to generate high-quality videos guided by text prompts and subject's reference images. However, since it is only trained on static images, the fine-tuning process of subject learning disrupts abilities of video diffusion models (VDMs) to combine concepts and generate motions. To restore these abilities, some methods use additional video similar to the prompt to fine-tune or guide the model. This requires frequent changes of guiding videos and even re-tuning of the model when generating different motions, which is very inconvenient for users. In this paper, we propose CustomCrafter, a novel framework that preserves the model's motion generation and conceptual combination abilities without additional video and fine-tuning to recovery. For preserving conceptual combination ability, we design a plug-and-play module to update few parameters in VDMs, enhancing the model's ability to capture the appearance details and the ability of concept combinations for new subjects. For motion generation, we observed that VDMs tend to restore the motion of video in the early stage of denoising, while focusing on the recovery of subject details in the later stage. Therefore, we propose Dynamic Weighted Video Sampling Strategy. Using the pluggability of our subject learning modules, we reduce the impact of this module on motion generation in the early stage of denoising, preserving the ability to generate motion of VDMs. In the later stage of denoising, we restore this module to repair the appearance details of the specified subject, thereby ensuring the fidelity of the subject's appearance. Experimental results show that our method has a significant improvement compared to previous methods.

Summary

AI-Generated Summary

PDF122November 16, 2024