ChatPaper.aiChatPaper

實現視頻擴散模型的多功能控制

Enabling Versatile Controls for Video Diffusion Models

March 21, 2025
作者: Xu Zhang, Hao Zhou, Haoming Qin, Xiaobin Lu, Jiaxing Yan, Guanzhong Wang, Zeyu Chen, Yi Liu
cs.AI

摘要

儘管在文本到視頻生成領域取得了顯著進展,但在視頻生成研究中,實現對細粒度時空屬性的精確和靈活控制仍然是一個重大的未解難題。為應對這些限制,我們引入了VCtrl(亦稱PP-VCtrl),這是一個新穎的框架,旨在以統一的方式實現對預訓練視頻擴散模型的細粒度控制。VCtrl通過一個可泛化的條件模塊,將多樣化的用戶指定控制信號——如Canny邊緣、分割掩碼和人體關鍵點——整合到預訓練的視頻擴散模型中,而無需修改底層生成器。此外,我們設計了一個統一的控制信號編碼管道和稀疏殘差連接機制,以高效地融入控制表示。全面的實驗和人類評估表明,VCtrl有效提升了可控性和生成質量。源代碼和預訓練模型已公開,並使用PaddlePaddle框架實現,詳見http://github.com/PaddlePaddle/PaddleMIX/tree/develop/ppdiffusers/examples/ppvctrl。
English
Despite substantial progress in text-to-video generation, achieving precise and flexible control over fine-grained spatiotemporal attributes remains a significant unresolved challenge in video generation research. To address these limitations, we introduce VCtrl (also termed PP-VCtrl), a novel framework designed to enable fine-grained control over pre-trained video diffusion models in a unified manner. VCtrl integrates diverse user-specified control signals-such as Canny edges, segmentation masks, and human keypoints-into pretrained video diffusion models via a generalizable conditional module capable of uniformly encoding multiple types of auxiliary signals without modifying the underlying generator. Additionally, we design a unified control signal encoding pipeline and a sparse residual connection mechanism to efficiently incorporate control representations. Comprehensive experiments and human evaluations demonstrate that VCtrl effectively enhances controllability and generation quality. The source code and pre-trained models are publicly available and implemented using the PaddlePaddle framework at http://github.com/PaddlePaddle/PaddleMIX/tree/develop/ppdiffusers/examples/ppvctrl.

Summary

AI-Generated Summary

PDF152March 24, 2025