Ctrl-Adapter:一种高效且多功能的框架,用于将各种控制器适应到任何扩散模型中。
Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model
April 15, 2024
作者: Han Lin, Jaemin Cho, Abhay Zala, Mohit Bansal
cs.AI
摘要
ControlNet广泛用于在图像生成中添加空间控制,以应对不同条件,如深度图、Canny边缘和人体姿势。然而,在利用预训练图像ControlNet进行受控视频生成时存在几个挑战。首先,由于特征空间不匹配,预训练ControlNet无法直接插入新的主干模型,并且为新主干训练ControlNet的成本很高。其次,不同帧的ControlNet特征可能无法有效处理时间一致性。为解决这些挑战,我们引入了Ctrl-Adapter,这是一个高效且多功能的框架,通过调整预训练ControlNet(并改善视频的时间对齐)为任何图像/视频扩散模型添加多样化控制。Ctrl-Adapter提供多种功能,包括图像控制、视频控制、稀疏帧视频控制、多条件控制、与不同主干的兼容性、适应未见控制条件以及视频编辑。在Ctrl-Adapter中,我们训练适配器层,将预训练ControlNet特征融合到不同图像/视频扩散模型中,同时保持ControlNet和扩散模型的参数不变。Ctrl-Adapter包括时间和空间模块,以有效处理视频的时间一致性。我们还提出了潜在跳跃和逆时间步采样,以实现强大的适应性和稀疏控制。此外,Ctrl-Adapter通过简单地取(加权)ControlNet输出的平均值,实现了来自多种条件的控制。借助多样化的图像/视频扩散主干(SDXL、Hotshot-XL、I2VGen-XL和SVD),Ctrl-Adapter与图像控制的ControlNet相匹配,并在视频控制方面优于所有基线(在DAVIS 2017数据集上实现了SOTA准确性),并且计算成本显著降低(不到10个GPU小时)。
English
ControlNets are widely used for adding spatial control in image generation
with different conditions, such as depth maps, canny edges, and human poses.
However, there are several challenges when leveraging the pretrained image
ControlNets for controlled video generation. First, pretrained ControlNet
cannot be directly plugged into new backbone models due to the mismatch of
feature spaces, and the cost of training ControlNets for new backbones is a big
burden. Second, ControlNet features for different frames might not effectively
handle the temporal consistency. To address these challenges, we introduce
Ctrl-Adapter, an efficient and versatile framework that adds diverse controls
to any image/video diffusion models, by adapting pretrained ControlNets (and
improving temporal alignment for videos). Ctrl-Adapter provides diverse
capabilities including image control, video control, video control with sparse
frames, multi-condition control, compatibility with different backbones,
adaptation to unseen control conditions, and video editing. In Ctrl-Adapter, we
train adapter layers that fuse pretrained ControlNet features to different
image/video diffusion models, while keeping the parameters of the ControlNets
and the diffusion models frozen. Ctrl-Adapter consists of temporal and spatial
modules so that it can effectively handle the temporal consistency of videos.
We also propose latent skipping and inverse timestep sampling for robust
adaptation and sparse control. Moreover, Ctrl-Adapter enables control from
multiple conditions by simply taking the (weighted) average of ControlNet
outputs. With diverse image/video diffusion backbones (SDXL, Hotshot-XL,
I2VGen-XL, and SVD), Ctrl-Adapter matches ControlNet for image control and
outperforms all baselines for video control (achieving the SOTA accuracy on the
DAVIS 2017 dataset) with significantly lower computational costs (less than 10
GPU hours).Summary
AI-Generated Summary