CFG-Zero*: フローマッチングモデルのための改良型クラシファイアフリーガイダンス

要旨

Classifier-Free Guidance (CFG) は、拡散/フローモデルにおいて画像の忠実度と制御性を向上させるために広く採用されている技術です。本研究では、まず、ガウス混合分布で学習されたフローマッチングモデルに対するCFGの影響を解析的に調査します。ここでは、真のフローを導出できる状況を想定しています。その結果、学習の初期段階でフロー推定が不正確な場合、CFGがサンプルを誤った軌道に向かわせることが観察されました。この観察に基づき、我々はCFG-Zero*を提案します。これは、以下の2つの改善点を備えたCFGの改良版です：(a) 最適化されたスケール。ここでは、推定された速度の不正確さを補正するためにスカラー値が最適化されます。これが名前の * の由来です。(b) zero-init。これは、ODEソルバーの最初の数ステップをゼロにすることを含みます。テキストから画像への生成（Lumina-Next、Stable Diffusion 3、Flux）およびテキストから動画への生成（Wan-2.1）の実験において、CFG-Zero* がCFGを一貫して上回ることを示し、フローマッチングモデルを導く上での有効性を強調しています。（コードは github.com/WeichenFan/CFG-Zero-star で公開されています）

English

Classifier-Free Guidance (CFG) is a widely adopted technique in diffusion/flow models to improve image fidelity and controllability. In this work, we first analytically study the effect of CFG on flow matching models trained on Gaussian mixtures where the ground-truth flow can be derived. We observe that in the early stages of training, when the flow estimation is inaccurate, CFG directs samples toward incorrect trajectories. Building on this observation, we propose CFG-Zero*, an improved CFG with two contributions: (a) optimized scale, where a scalar is optimized to correct for the inaccuracies in the estimated velocity, hence the * in the name; and (b) zero-init, which involves zeroing out the first few steps of the ODE solver. Experiments on both text-to-image (Lumina-Next, Stable Diffusion 3, and Flux) and text-to-video (Wan-2.1) generation demonstrate that CFG-Zero* consistently outperforms CFG, highlighting its effectiveness in guiding Flow Matching models. (Code is available at github.com/WeichenFan/CFG-Zero-star)

CFG-Zero*: フローマッチングモデルのための改良型クラシファイアフリーガイダンス

CFG-Zero*: Improved Classifier-Free Guidance for Flow Matching Models

要旨

Support