CFG-Zero*: 플로우 매칭 모델을 위한 개선된 클래스리어 프리 가이던스

초록

Classifier-Free Guidance(CFG)는 확산/흐름 모델에서 이미지 충실도와 제어 가능성을 향상시키기 위해 널리 채택된 기법입니다. 본 연구에서는 먼저 가우시안 혼합 데이터에 대해 학습된 흐름 매칭 모델에서 CFG의 영향을 분석적으로 연구합니다. 이 경우 정확한 지상 진실 흐름을 도출할 수 있습니다. 우리는 학습 초기 단계에서 흐름 추정이 부정확할 때 CFG가 샘플을 잘못된 궤적으로 유도한다는 것을 관찰했습니다. 이 관찰을 바탕으로 두 가지 개선 사항을 도입한 CFG-Zero*를 제안합니다: (a) 최적화된 스케일 - 추정된 속도의 부정확성을 보정하기 위해 스칼라 값을 최적화하며, 이로 인해 이름에 *가 포함됨; (b) 제로 초기화 - ODE 솔버의 초기 몇 단계를 0으로 설정합니다. 텍스트-이미지 생성(Lumina-Next, Stable Diffusion 3, Flux) 및 텍스트-비디오 생성(Wan-2.1) 실험에서 CFG-Zero*는 CFG를 일관되게 능가하며, 흐름 매칭 모델을 효과적으로 안내하는 것을 입증했습니다. (코드는 github.com/WeichenFan/CFG-Zero-star에서 확인 가능)

English

Classifier-Free Guidance (CFG) is a widely adopted technique in diffusion/flow models to improve image fidelity and controllability. In this work, we first analytically study the effect of CFG on flow matching models trained on Gaussian mixtures where the ground-truth flow can be derived. We observe that in the early stages of training, when the flow estimation is inaccurate, CFG directs samples toward incorrect trajectories. Building on this observation, we propose CFG-Zero*, an improved CFG with two contributions: (a) optimized scale, where a scalar is optimized to correct for the inaccuracies in the estimated velocity, hence the * in the name; and (b) zero-init, which involves zeroing out the first few steps of the ODE solver. Experiments on both text-to-image (Lumina-Next, Stable Diffusion 3, and Flux) and text-to-video (Wan-2.1) generation demonstrate that CFG-Zero* consistently outperforms CFG, highlighting its effectiveness in guiding Flow Matching models. (Code is available at github.com/WeichenFan/CFG-Zero-star)

CFG-Zero*: 플로우 매칭 모델을 위한 개선된 클래스리어 프리 가이던스

CFG-Zero*: Improved Classifier-Free Guidance for Flow Matching Models

초록

Support