视频控制-A-视频：使用扩散模型进行可控文本到视频生成

摘要

本文提出了一种可控文本到视频（T2V）扩散模型，命名为Video-ControlNet，它能够根据一系列控制信号（如边缘或深度图）生成视频。Video-ControlNet基于一个预训练的有条件文本到图像（T2I）扩散模型，通过引入空间-时间自注意机制和可训练的时间层进行跨帧建模，实现了高效的视频生成。提出了一种首帧调节策略，有助于模型以自回归方式生成从图像领域转换而来的视频以及任意长度的视频。此外，Video-ControlNet采用了一种基于残差的噪声初始化策略，从输入视频中引入运动先验，生成更连贯的视频。通过所提出的架构和策略，Video-ControlNet能够实现资源高效的收敛，并生成具有精细控制的高质量和一致性视频。大量实验证明了其在各种视频生成任务中的成功，如视频编辑和视频风格转移，在一致性和质量方面优于先前的方法。项目页面：https://controlavideo.github.io/

English

This paper presents a controllable text-to-video (T2V) diffusion model, named Video-ControlNet, that generates videos conditioned on a sequence of control signals, such as edge or depth maps. Video-ControlNet is built on a pre-trained conditional text-to-image (T2I) diffusion model by incorporating a spatial-temporal self-attention mechanism and trainable temporal layers for efficient cross-frame modeling. A first-frame conditioning strategy is proposed to facilitate the model to generate videos transferred from the image domain as well as arbitrary-length videos in an auto-regressive manner. Moreover, Video-ControlNet employs a novel residual-based noise initialization strategy to introduce motion prior from an input video, producing more coherent videos. With the proposed architecture and strategies, Video-ControlNet can achieve resource-efficient convergence and generate superior quality and consistent videos with fine-grained control. Extensive experiments demonstrate its success in various video generative tasks such as video editing and video style transfer, outperforming previous methods in terms of consistency and quality. Project Page: https://controlavideo.github.io/

视频控制-A-视频：使用扩散模型进行可控文本到视频生成

Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models

摘要

Support