赋能视频扩散模型的多功能控制

摘要

尽管文本到视频生成领域已取得显著进展，但在视频生成研究中，实现对细粒度时空属性的精确灵活控制仍是一个重大未解难题。为应对这些局限，我们提出了VCtrl（亦称PP-VCtrl），一个旨在统一方式下对预训练视频扩散模型实现细粒度控制的新颖框架。VCtrl通过一个可泛化的条件模块，将用户指定的多样化控制信号——如Canny边缘、分割掩码及人体关键点——整合进预训练视频扩散模型中，该模块能够在不改动底层生成器的情况下，统一编码多种类型的辅助信号。此外，我们设计了一套统一的控制信号编码流程及稀疏残差连接机制，以高效融入控制表示。全面的实验与人类评估表明，VCtrl有效提升了可控性与生成质量。源代码及预训练模型已公开，并采用PaddlePaddle框架实现，访问地址为http://github.com/PaddlePaddle/PaddleMIX/tree/develop/ppdiffusers/examples/ppvctrl。

English

Despite substantial progress in text-to-video generation, achieving precise and flexible control over fine-grained spatiotemporal attributes remains a significant unresolved challenge in video generation research. To address these limitations, we introduce VCtrl (also termed PP-VCtrl), a novel framework designed to enable fine-grained control over pre-trained video diffusion models in a unified manner. VCtrl integrates diverse user-specified control signals-such as Canny edges, segmentation masks, and human keypoints-into pretrained video diffusion models via a generalizable conditional module capable of uniformly encoding multiple types of auxiliary signals without modifying the underlying generator. Additionally, we design a unified control signal encoding pipeline and a sparse residual connection mechanism to efficiently incorporate control representations. Comprehensive experiments and human evaluations demonstrate that VCtrl effectively enhances controllability and generation quality. The source code and pre-trained models are publicly available and implemented using the PaddlePaddle framework at http://github.com/PaddlePaddle/PaddleMIX/tree/develop/ppdiffusers/examples/ppvctrl.