CCM: テキストから画像への一貫性モデルに条件付き制御を追加する手法

要旨

一貫性モデル（Consistency Models, CMs）は、視覚コンテンツを効率的かつ高品質に生成する可能性を示しています。しかし、事前学習済みのCMに新しい条件制御を追加する方法はまだ検討されていません。本技術レポートでは、ControlNetのような条件制御をCMに追加するための代替戦略を検討し、3つの重要な知見を提示します。1) 拡散モデル（Diffusion Models, DMs）向けに学習されたControlNetは、高レベルの意味的制御には直接適用可能ですが、低レベルの詳細やリアリズムの制御には課題があります。2) CMは独立した生成モデルのクラスとして機能し、Songらが提案したConsistency Trainingを用いてControlNetをゼロから学習させることが可能です。3) 軽量なアダプターを複数の条件下でConsistency Trainingを通じて共同最適化することで、DMsベースのControlNetをCMに迅速に転移させることができます。これらの3つの解決策を、エッジ、深度、人間のポーズ、低解像度画像、テキストから画像への潜在一貫性モデルを用いたマスク画像など、さまざまな条件制御において検証しました。

English

Consistency Models (CMs) have showed a promise in creating visual content efficiently and with high quality. However, the way to add new conditional controls to the pretrained CMs has not been explored. In this technical report, we consider alternative strategies for adding ControlNet-like conditional control to CMs and present three significant findings. 1) ControlNet trained for diffusion models (DMs) can be directly applied to CMs for high-level semantic controls but struggles with low-level detail and realism control. 2) CMs serve as an independent class of generative models, based on which ControlNet can be trained from scratch using Consistency Training proposed by Song et al. 3) A lightweight adapter can be jointly optimized under multiple conditions through Consistency Training, allowing for the swift transfer of DMs-based ControlNet to CMs. We study these three solutions across various conditional controls, including edge, depth, human pose, low-resolution image and masked image with text-to-image latent consistency models.

CCM: テキストから画像への一貫性モデルに条件付き制御を追加する手法

CCM: Adding Conditional Controls to Text-to-Image Consistency Models

要旨

Support