控制式文本生成視頻：使用擴散模型實現可控文本生成視頻

摘要

本文提出了一個可控的文本到視頻（T2V）擴散模型，名為Video-ControlNet，該模型根據一系列控制信號（如邊緣或深度圖）生成視頻。Video-ControlNet基於一個預先訓練的條件文本到圖像（T2I）擴散模型，通過引入空間-時間自注意機制和可訓練的時間層進行有效的跨幀建模。提出了一種首幀條件策略，以促進模型以自回歸方式生成從圖像領域轉換的視頻以及任意長度的視頻。此外，Video-ControlNet採用了一種新型基於殘差的噪聲初始化策略，從輸入視頻引入運動先驗，生成更連貫的視頻。通過所提出的架構和策略，Video-ControlNet能夠實現資源高效的收斂，生成具有精細控制的優質和一致的視頻。大量實驗證明了其在各種視頻生成任務中的成功，如視頻編輯和視頻風格轉換，以一致性和質量方面優於先前方法。項目頁面：https://controlavideo.github.io/

English

This paper presents a controllable text-to-video (T2V) diffusion model, named Video-ControlNet, that generates videos conditioned on a sequence of control signals, such as edge or depth maps. Video-ControlNet is built on a pre-trained conditional text-to-image (T2I) diffusion model by incorporating a spatial-temporal self-attention mechanism and trainable temporal layers for efficient cross-frame modeling. A first-frame conditioning strategy is proposed to facilitate the model to generate videos transferred from the image domain as well as arbitrary-length videos in an auto-regressive manner. Moreover, Video-ControlNet employs a novel residual-based noise initialization strategy to introduce motion prior from an input video, producing more coherent videos. With the proposed architecture and strategies, Video-ControlNet can achieve resource-efficient convergence and generate superior quality and consistent videos with fine-grained control. Extensive experiments demonstrate its success in various video generative tasks such as video editing and video style transfer, outperforming previous methods in terms of consistency and quality. Project Page: https://controlavideo.github.io/

控制式文本生成視頻：使用擴散模型實現可控文本生成視頻

Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models

摘要

Support