Ctrl-Adapter:一個高效且多才多藝的框架,用於將各種控制器適應到任何擴散模型。
Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model
April 15, 2024
作者: Han Lin, Jaemin Cho, Abhay Zala, Mohit Bansal
cs.AI
摘要
ControlNets被廣泛應用於圖像生成中添加空間控制,並搭配不同條件,如深度圖、Canny邊緣和人體姿勢。然而,在利用預訓練圖像ControlNets進行受控視頻生成時存在幾個挑戰。首先,由於特徵空間不匹配,預訓練的ControlNet無法直接插入新的骨幹模型,而為新骨幹訓練ControlNets的成本很高。其次,不同幀的ControlNet特徵可能無法有效處理時間一致性。為應對這些挑戰,我們引入了Ctrl-Adapter,這是一個高效且多功能的框架,可通過適應預訓練的ControlNets(並改進視頻的時間對齊)為任何圖像/視頻擴散模型添加多樣控制。Ctrl-Adapter提供多種功能,包括圖像控制、視頻控制、具有稀疏幀的視頻控制、多條件控制、與不同骨幹的兼容性、適應未見控制條件以及視頻編輯。在Ctrl-Adapter中,我們訓練適配器層,將預訓練的ControlNet特徵融合到不同的圖像/視頻擴散模型中,同時保持ControlNets和擴散模型的參數凍結。Ctrl-Adapter包括時間和空間模塊,以有效處理視頻的時間一致性。我們還提出了潛在跳躍和反向時間步長抽樣,以實現強健的適應和稀疏控制。此外,Ctrl-Adapter通過簡單地將ControlNet輸出的(加權)平均值來實現從多種條件進行控制。憑藉多樣的圖像/視頻擴散骨幹(SDXL、Hotshot-XL、I2VGen-XL和SVD),Ctrl-Adapter與圖像控制的ControlNet相匹配,並在視頻控制方面優於所有基準線(在DAVIS 2017數據集上實現了SOTA準確性),並且計算成本顯著降低(少於10個GPU小時)。
English
ControlNets are widely used for adding spatial control in image generation
with different conditions, such as depth maps, canny edges, and human poses.
However, there are several challenges when leveraging the pretrained image
ControlNets for controlled video generation. First, pretrained ControlNet
cannot be directly plugged into new backbone models due to the mismatch of
feature spaces, and the cost of training ControlNets for new backbones is a big
burden. Second, ControlNet features for different frames might not effectively
handle the temporal consistency. To address these challenges, we introduce
Ctrl-Adapter, an efficient and versatile framework that adds diverse controls
to any image/video diffusion models, by adapting pretrained ControlNets (and
improving temporal alignment for videos). Ctrl-Adapter provides diverse
capabilities including image control, video control, video control with sparse
frames, multi-condition control, compatibility with different backbones,
adaptation to unseen control conditions, and video editing. In Ctrl-Adapter, we
train adapter layers that fuse pretrained ControlNet features to different
image/video diffusion models, while keeping the parameters of the ControlNets
and the diffusion models frozen. Ctrl-Adapter consists of temporal and spatial
modules so that it can effectively handle the temporal consistency of videos.
We also propose latent skipping and inverse timestep sampling for robust
adaptation and sparse control. Moreover, Ctrl-Adapter enables control from
multiple conditions by simply taking the (weighted) average of ControlNet
outputs. With diverse image/video diffusion backbones (SDXL, Hotshot-XL,
I2VGen-XL, and SVD), Ctrl-Adapter matches ControlNet for image control and
outperforms all baselines for video control (achieving the SOTA accuracy on the
DAVIS 2017 dataset) with significantly lower computational costs (less than 10
GPU hours).Summary
AI-Generated Summary