단계적 일관성 모델

초록

일관성 모델(Consistency Model, CM)은 최근 확산 모델의 생성 속도를 가속화하는 데 있어 상당한 진전을 이루었습니다. 그러나 잠재 공간에서의 고해상도 텍스트 조건부 이미지 생성(일명 LCM)에의 적용은 여전히 만족스럽지 못한 상황입니다. 본 논문에서는 현재 LCM 설계의 세 가지 주요 결함을 식별하고, 이러한 한계의 원인을 조사하며, 설계 공간을 일반화하고 모든 식별된 한계를 해결하는 단계적 일관성 모델(Phased Consistency Model, PCM)을 제안합니다. 우리의 평가 결과, PCM은 1~16단계 생성 설정에서 LCM을 크게 능가하는 성능을 보여줍니다. PCM은 다단계 정제를 위해 특별히 설계되었지만, 이전의 최첨단 1단계 생성 방법들과 비교해도 우수하거나 비슷한 1단계 생성 결과를 달성합니다. 더 나아가, PCM의 방법론은 비디오 생성에도 적용 가능하며, 이를 통해 최첨단의 소수 단계 텍스트-투-비디오 생성기를 학습할 수 있음을 보여줍니다. 자세한 내용은 https://g-u-n.github.io/projects/pcm/에서 확인할 수 있습니다.

English

The consistency model (CM) has recently made significant progress in accelerating the generation of diffusion models. However, its application to high-resolution, text-conditioned image generation in the latent space (a.k.a., LCM) remains unsatisfactory. In this paper, we identify three key flaws in the current design of LCM. We investigate the reasons behind these limitations and propose the Phased Consistency Model (PCM), which generalizes the design space and addresses all identified limitations. Our evaluations demonstrate that PCM significantly outperforms LCM across 1--16 step generation settings. While PCM is specifically designed for multi-step refinement, it achieves even superior or comparable 1-step generation results to previously state-of-the-art specifically designed 1-step methods. Furthermore, we show that PCM's methodology is versatile and applicable to video generation, enabling us to train the state-of-the-art few-step text-to-video generator. More details are available at https://g-u-n.github.io/projects/pcm/.