플로우 정렬: 연속 시간 플로우 맵 디스틸레이션의 확장

초록

확산(Diffusion) 및 흐름(Flow) 기반 모델들은 최첨단 생성 모델링 접근법으로 자리 잡았지만, 많은 샘플링 단계를 필요로 합니다. 일관성(Consistency) 모델은 이러한 모델들을 효율적인 단일 단계 생성기로 증류할 수 있지만, 흐름 및 확산 기반 방법과 달리 단계 수를 증가시킬 때 성능이 필연적으로 저하됩니다. 이는 우리가 분석적 및 실증적으로 보여줍니다. 흐름 맵(Flow Map)은 이러한 접근법을 일반화하여 임의의 두 노이즈 레벨을 단일 단계로 연결하며, 모든 단계 수에서 효과적으로 작동합니다. 본 논문에서는 흐름 맵을 훈련하기 위한 두 가지 새로운 연속 시간 목적 함수와 추가적인 새로운 훈련 기법을 소개하며, 기존의 일관성 및 흐름 매칭 목적 함수를 일반화합니다. 또한, 자동 가이던스(Autoguidance)를 통해 성능을 향상시킬 수 있음을 보여주는데, 이는 증류 과정에서 저품질 모델을 가이드로 사용하며, 적대적 미세 조정(Adversarial Finetuning)을 통해 추가적인 성능 향상을 달성할 수 있고, 샘플 다양성의 최소한의 손실로 이를 가능하게 합니다. 우리는 Align Your Flow라고 명명한 흐름 맵 모델을 도전적인 이미지 생성 벤치마크에서 광범위하게 검증하고, ImageNet 64x64 및 512x512에서 소규모이면서도 효율적인 신경망을 사용하여 최첨단의 적은 단계 생성 성능을 달성합니다. 마지막으로, 텍스트-이미지 흐름 맵 모델을 보여주며, 이는 텍스트 조건 합성에서 기존의 모든 비적대적 훈련된 적은 단계 샘플러를 능가합니다.

English

Diffusion- and flow-based models have emerged as state-of-the-art generative modeling approaches, but they require many sampling steps. Consistency models can distill these models into efficient one-step generators; however, unlike flow- and diffusion-based methods, their performance inevitably degrades when increasing the number of steps, which we show both analytically and empirically. Flow maps generalize these approaches by connecting any two noise levels in a single step and remain effective across all step counts. In this paper, we introduce two new continuous-time objectives for training flow maps, along with additional novel training techniques, generalizing existing consistency and flow matching objectives. We further demonstrate that autoguidance can improve performance, using a low-quality model for guidance during distillation, and an additional boost can be achieved by adversarial finetuning, with minimal loss in sample diversity. We extensively validate our flow map models, called Align Your Flow, on challenging image generation benchmarks and achieve state-of-the-art few-step generation performance on both ImageNet 64x64 and 512x512, using small and efficient neural networks. Finally, we show text-to-image flow map models that outperform all existing non-adversarially trained few-step samplers in text-conditioned synthesis.

플로우 정렬: 연속 시간 플로우 맵 디스틸레이션의 확장

Align Your Flow: Scaling Continuous-Time Flow Map Distillation

초록

Support