적응형 가이던스: 조건부 확산 모델의 학습 없이 가속화하기

초록

본 논문은 텍스트 조건부 확산 모델에서 Classifier-Free Guidance(CFG)의 역할을 추론 효율성 관점에서 포괄적으로 연구한 결과를 제시한다. 특히, 모든 확산 단계에 CFG를 적용하는 기본 선택을 완화하고, 대신 효율적인 가이던스 정책을 탐색한다. 이러한 정책의 발견을 미분 가능한 신경망 구조 탐색(Neural Architecture Search) 프레임워크 내에서 공식화한다. 연구 결과에 따르면, CFG가 제안하는 노이즈 제거 단계는 점점 단순한 조건부 단계와 일치하게 되어, 특히 노이즈 제거 과정의 후반부에서 CFG의 추가 신경망 평가가 불필요해진다. 이러한 통찰을 바탕으로, 노이즈 제거 과정이 수렴을 보일 때 신경망 평가를 적응적으로 생략하는 CFG의 효율적인 변형인 "Adaptive Guidance"(AG)를 제안한다. 실험 결과, AG는 CFG의 이미지 품질을 유지하면서 계산량을 25% 줄인다. 따라서 AG는 Guidance Distillation의 플러그 앤 플레이 대안으로, 후자의 속도 향상의 50%를 달성하면서도 학습이 필요 없고 부정 프롬프트를 처리할 수 있는 능력을 유지한다. 마지막으로, 확산 과정의 전반부에서 CFG의 추가적인 불필요성을 발견하고, 전체 신경망 함수 평가를 과거 점수 추정치의 단순한 아핀 변환으로 대체할 수 있음을 보인다. 이 방법은 LinearAG로 명명되었으며, 기준 모델에서 벗어나는 대신 더 저렴한 추론을 제공한다. 본 연구 결과는 조건부 노이즈 제거 과정의 효율성에 대한 통찰을 제공함으로써 텍스트 조건부 확산 모델의 보다 실용적이고 신속한 배포에 기여한다.

English

This paper presents a comprehensive study on the role of Classifier-Free Guidance (CFG) in text-conditioned diffusion models from the perspective of inference efficiency. In particular, we relax the default choice of applying CFG in all diffusion steps and instead search for efficient guidance policies. We formulate the discovery of such policies in the differentiable Neural Architecture Search framework. Our findings suggest that the denoising steps proposed by CFG become increasingly aligned with simple conditional steps, which renders the extra neural network evaluation of CFG redundant, especially in the second half of the denoising process. Building upon this insight, we propose "Adaptive Guidance" (AG), an efficient variant of CFG, that adaptively omits network evaluations when the denoising process displays convergence. Our experiments demonstrate that AG preserves CFG's image quality while reducing computation by 25%. Thus, AG constitutes a plug-and-play alternative to Guidance Distillation, achieving 50% of the speed-ups of the latter while being training-free and retaining the capacity to handle negative prompts. Finally, we uncover further redundancies of CFG in the first half of the diffusion process, showing that entire neural function evaluations can be replaced by simple affine transformations of past score estimates. This method, termed LinearAG, offers even cheaper inference at the cost of deviating from the baseline model. Our findings provide insights into the efficiency of the conditional denoising process that contribute to more practical and swift deployment of text-conditioned diffusion models.

적응형 가이던스: 조건부 확산 모델의 학습 없이 가속화하기

Adaptive Guidance: Training-free Acceleration of Conditional Diffusion Models

초록

Support