控制式文本生成視頻:使用擴散模型實現可控文本生成視頻
Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models
May 23, 2023
作者: Weifeng Chen, Jie Wu, Pan Xie, Hefeng Wu, Jiashi Li, Xin Xia, Xuefeng Xiao, Liang Lin
cs.AI
摘要
本文提出了一個可控的文本到視頻(T2V)擴散模型,名為Video-ControlNet,該模型根據一系列控制信號(如邊緣或深度圖)生成視頻。Video-ControlNet基於一個預先訓練的條件文本到圖像(T2I)擴散模型,通過引入空間-時間自注意機制和可訓練的時間層進行有效的跨幀建模。提出了一種首幀條件策略,以促進模型以自回歸方式生成從圖像領域轉換的視頻以及任意長度的視頻。此外,Video-ControlNet採用了一種新型基於殘差的噪聲初始化策略,從輸入視頻引入運動先驗,生成更連貫的視頻。通過所提出的架構和策略,Video-ControlNet能夠實現資源高效的收斂,生成具有精細控制的優質和一致的視頻。大量實驗證明了其在各種視頻生成任務中的成功,如視頻編輯和視頻風格轉換,以一致性和質量方面優於先前方法。項目頁面:https://controlavideo.github.io/
English
This paper presents a controllable text-to-video (T2V) diffusion model, named
Video-ControlNet, that generates videos conditioned on a sequence of control
signals, such as edge or depth maps. Video-ControlNet is built on a pre-trained
conditional text-to-image (T2I) diffusion model by incorporating a
spatial-temporal self-attention mechanism and trainable temporal layers for
efficient cross-frame modeling. A first-frame conditioning strategy is proposed
to facilitate the model to generate videos transferred from the image domain as
well as arbitrary-length videos in an auto-regressive manner. Moreover,
Video-ControlNet employs a novel residual-based noise initialization strategy
to introduce motion prior from an input video, producing more coherent videos.
With the proposed architecture and strategies, Video-ControlNet can achieve
resource-efficient convergence and generate superior quality and consistent
videos with fine-grained control. Extensive experiments demonstrate its success
in various video generative tasks such as video editing and video style
transfer, outperforming previous methods in terms of consistency and quality.
Project Page: https://controlavideo.github.io/