원웨이 티켓: 텍스트-이미지 확산 모델 증류를 위한 시간 독립적 통합 인코더

초록

텍스트-이미지(T2I) 확산 모델은 생성 모델링 분야에서 놀라운 발전을 이루었으나, 추론 속도와 이미지 품질 간의 트레이드오프로 인해 효율적인 배포에 어려움을 겪고 있다. 기존의 증류된 T2I 모델은 적은 샘플링 단계로도 고품질의 이미지를 생성할 수 있지만, 특히 일단계 모델에서 다양성과 품질에 문제를 보인다. 우리의 분석에 따르면, UNet 인코더에서 중복 계산이 발생하는 것을 확인할 수 있었다. 연구 결과에 따르면, T2I 확산 모델의 경우 디코더가 더 풍부하고 명시적인 의미 정보를 포착하는 데 더 능숙하며, 인코더는 다양한 시간 단계의 디코더 간에 효과적으로 공유될 수 있다는 것을 알 수 있었다. 이러한 관찰을 바탕으로, 우리는 학생 모델 UNet 아키텍처를 위한 최초의 시간 독립 통합 인코더(TiUE)를 제안한다. 이는 T2I 확산 모델을 증류하기 위한 루프 없는 이미지 생성 접근법이다. TiUE는 일회성 스킴을 사용하여 여러 디코더 시간 단계 간에 인코더 특징을 공유함으로써 병렬 샘플링을 가능하게 하고 추론 시간 복잡도를 크게 줄인다. 또한, KL 발산 항을 도입하여 노이즈 예측을 정규화함으로써 생성된 이미지의 지각적 현실감과 다양성을 향상시켰다. 실험 결과, TiUE는 LCM, SD-Turbo, SwiftBrushv2를 포함한 최신 방법들을 능가하며, 계산 효율성을 유지하면서 더 다양하고 현실적인 결과를 생성하는 것으로 나타났다.

English

Text-to-Image (T2I) diffusion models have made remarkable advancements in generative modeling; however, they face a trade-off between inference speed and image quality, posing challenges for efficient deployment. Existing distilled T2I models can generate high-fidelity images with fewer sampling steps, but often struggle with diversity and quality, especially in one-step models. From our analysis, we observe redundant computations in the UNet encoders. Our findings suggest that, for T2I diffusion models, decoders are more adept at capturing richer and more explicit semantic information, while encoders can be effectively shared across decoders from diverse time steps. Based on these observations, we introduce the first Time-independent Unified Encoder TiUE for the student model UNet architecture, which is a loop-free image generation approach for distilling T2I diffusion models. Using a one-pass scheme, TiUE shares encoder features across multiple decoder time steps, enabling parallel sampling and significantly reducing inference time complexity. In addition, we incorporate a KL divergence term to regularize noise prediction, which enhances the perceptual realism and diversity of the generated images. Experimental results demonstrate that TiUE outperforms state-of-the-art methods, including LCM, SD-Turbo, and SwiftBrushv2, producing more diverse and realistic results while maintaining the computational efficiency.

원웨이 티켓: 텍스트-이미지 확산 모델 증류를 위한 시간 독립적 통합 인코더

One-Way Ticket:Time-Independent Unified Encoder for Distilling Text-to-Image Diffusion Models

초록

Support