適応的ガイダンス：条件付き拡散モデルの訓練不要な高速化

要旨

本論文は、推論効率の観点から、テキスト条件付き拡散モデルにおけるClassifier-Free Guidance（CFG）の役割について包括的な研究を提示する。特に、CFGをすべての拡散ステップに適用するというデフォルトの選択を緩和し、効率的なガイダンスポリシーを探索する。我々は、そのようなポリシーの発見を微分可能なニューラルアーキテクチャサーチ（NAS）フレームワークに定式化する。我々の研究結果は、CFGによって提案されるノイズ除去ステップが単純な条件付きステップと次第に一致するようになり、特にノイズ除去プロセスの後半において、CFGの追加のニューラルネットワーク評価が冗長になることを示唆している。この洞察に基づき、我々は「Adaptive Guidance」（AG）を提案する。これはCFGの効率的な変種であり、ノイズ除去プロセスが収束を示す場合にネットワーク評価を適応的に省略する。我々の実験は、AGがCFGの画像品質を維持しながら計算量を25％削減することを実証している。したがって、AGはGuidance Distillationのプラグアンドプレイ代替手段として機能し、後者の速度向上の50％を達成しながら、トレーニング不要であり、ネガティブプロンプトを処理する能力を保持する。最後に、我々は拡散プロセスの前半におけるCFGのさらなる冗長性を明らかにし、ニューラル関数評価全体を過去のスコア推定の単純なアフィン変換で置き換えることができることを示す。この方法はLinearAGと称され、ベースラインモデルから逸脱する代償として、さらに安価な推論を提供する。我々の研究結果は、条件付きノイズ除去プロセスの効率に関する洞察を提供し、テキスト条件付き拡散モデルのより実用的で迅速な展開に貢献する。

English

This paper presents a comprehensive study on the role of Classifier-Free Guidance (CFG) in text-conditioned diffusion models from the perspective of inference efficiency. In particular, we relax the default choice of applying CFG in all diffusion steps and instead search for efficient guidance policies. We formulate the discovery of such policies in the differentiable Neural Architecture Search framework. Our findings suggest that the denoising steps proposed by CFG become increasingly aligned with simple conditional steps, which renders the extra neural network evaluation of CFG redundant, especially in the second half of the denoising process. Building upon this insight, we propose "Adaptive Guidance" (AG), an efficient variant of CFG, that adaptively omits network evaluations when the denoising process displays convergence. Our experiments demonstrate that AG preserves CFG's image quality while reducing computation by 25%. Thus, AG constitutes a plug-and-play alternative to Guidance Distillation, achieving 50% of the speed-ups of the latter while being training-free and retaining the capacity to handle negative prompts. Finally, we uncover further redundancies of CFG in the first half of the diffusion process, showing that entire neural function evaluations can be replaced by simple affine transformations of past score estimates. This method, termed LinearAG, offers even cheaper inference at the cost of deviating from the baseline model. Our findings provide insights into the efficiency of the conditional denoising process that contribute to more practical and swift deployment of text-conditioned diffusion models.

適応的ガイダンス：条件付き拡散モデルの訓練不要な高速化

Adaptive Guidance: Training-free Acceleration of Conditional Diffusion Models

要旨

Support