Ctrl-Adapter: 다양한 제어 방식을 모든 디퓨전 모델에 효율적이고 유연하게 적용하기 위한 프레임워크

초록

ControlNet은 깊이 맵, 캐니 에지, 인간 포즈 등 다양한 조건을 통해 이미지 생성에 공간적 제어를 추가하는 데 널리 사용됩니다. 그러나 사전 학습된 이미지 ControlNet을 제어된 비디오 생성에 활용할 때는 몇 가지 과제가 존재합니다. 첫째, 사전 학습된 ControlNet은 특징 공간의 불일치로 인해 새로운 백본 모델에 직접 적용할 수 없으며, 새로운 백본을 위한 ControlNet을 학습하는 비용이 큰 부담이 됩니다. 둘째, 다른 프레임에 대한 ControlNet 특징이 시간적 일관성을 효과적으로 처리하지 못할 수 있습니다. 이러한 과제를 해결하기 위해, 우리는 Ctrl-Adapter를 소개합니다. Ctrl-Adapter는 사전 학습된 ControlNet을 활용(및 비디오의 시간적 정렬을 개선)하여 모든 이미지/비디오 확산 모델에 다양한 제어를 추가하는 효율적이고 다목적 프레임워크입니다. Ctrl-Adapter는 이미지 제어, 비디오 제어, 희소 프레임을 통한 비디오 제어, 다중 조건 제어, 다양한 백본과의 호환성, 보이지 않는 제어 조건에 대한 적응, 비디오 편집 등 다양한 기능을 제공합니다. Ctrl-Adapter에서는 ControlNet과 확산 모델의 매개변수를 고정한 상태로 사전 학습된 ControlNet 특징을 다양한 이미지/비디오 확산 모델에 융합하는 어댑터 레이어를 학습합니다. Ctrl-Adapter는 시간적 모듈과 공간적 모듈로 구성되어 비디오의 시간적 일관성을 효과적으로 처리할 수 있습니다. 또한, 강력한 적응과 희소 제어를 위해 잠재적 건너뛰기(latent skipping)와 역 시간 단계 샘플링(inverse timestep sampling)을 제안합니다. 더 나아가, Ctrl-Adapter는 ControlNet 출력의 (가중) 평균을 간단히 취함으로써 다중 조건에서의 제어를 가능하게 합니다. 다양한 이미지/비디오 확산 백본(SDXL, Hotshot-XL, I2VGen-XL, SVD)을 사용하여, Ctrl-Adapter는 이미지 제어에서 ControlNet과 동등한 성능을 보이며, 비디오 제어에서는 모든 기준선을 능가합니다(DAVIS 2017 데이터셋에서 SOTA 정확도 달성). 이는 훨씬 낮은 계산 비용(10 GPU 시간 미만)으로 이루어집니다.

English

ControlNets are widely used for adding spatial control in image generation with different conditions, such as depth maps, canny edges, and human poses. However, there are several challenges when leveraging the pretrained image ControlNets for controlled video generation. First, pretrained ControlNet cannot be directly plugged into new backbone models due to the mismatch of feature spaces, and the cost of training ControlNets for new backbones is a big burden. Second, ControlNet features for different frames might not effectively handle the temporal consistency. To address these challenges, we introduce Ctrl-Adapter, an efficient and versatile framework that adds diverse controls to any image/video diffusion models, by adapting pretrained ControlNets (and improving temporal alignment for videos). Ctrl-Adapter provides diverse capabilities including image control, video control, video control with sparse frames, multi-condition control, compatibility with different backbones, adaptation to unseen control conditions, and video editing. In Ctrl-Adapter, we train adapter layers that fuse pretrained ControlNet features to different image/video diffusion models, while keeping the parameters of the ControlNets and the diffusion models frozen. Ctrl-Adapter consists of temporal and spatial modules so that it can effectively handle the temporal consistency of videos. We also propose latent skipping and inverse timestep sampling for robust adaptation and sparse control. Moreover, Ctrl-Adapter enables control from multiple conditions by simply taking the (weighted) average of ControlNet outputs. With diverse image/video diffusion backbones (SDXL, Hotshot-XL, I2VGen-XL, and SVD), Ctrl-Adapter matches ControlNet for image control and outperforms all baselines for video control (achieving the SOTA accuracy on the DAVIS 2017 dataset) with significantly lower computational costs (less than 10 GPU hours).

Ctrl-Adapter: 다양한 제어 방식을 모든 디퓨전 모델에 효율적이고 유연하게 적용하기 위한 프레임워크

Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model

초록

Support