片道切符：テキストから画像への拡散モデル蒸縮のための時間非依存統合エンコーダ

要旨

テキストから画像（T2I）拡散モデルは、生成モデリングにおいて顕著な進歩を遂げてきた。しかし、推論速度と画像品質の間にはトレードオフが存在し、効率的な展開に課題を残している。既存の蒸留されたT2Iモデルは、少ないサンプリングステップで高忠実度の画像を生成できるが、特にワンステップモデルでは多様性と品質に苦戦することが多い。我々の分析から、UNetエンコーダーに冗長な計算が存在することが観察された。T2I拡散モデルにおいて、デコーダーはより豊かで明示的な意味情報を捉えるのに適しており、エンコーダーは異なる時間ステップのデコーダー間で効果的に共有できることが示唆された。これらの観察に基づき、我々は学生モデルのUNetアーキテクチャ向けに、初めての時間独立型統一エンコーダー（TiUE）を提案する。これは、T2I拡散モデルの蒸留におけるループフリーな画像生成アプローチである。ワンパス方式を用いることで、TiUEは複数のデコーダー時間ステップ間でエンコーダー特徴を共有し、並列サンプリングを可能にし、推論時間の複雑さを大幅に削減する。さらに、ノイズ予測を正則化するためにKLダイバージェンス項を組み込み、生成画像の知覚的リアリズムと多様性を向上させた。実験結果は、TiUEがLCM、SD-Turbo、SwiftBrushv2などの最先端手法を上回り、計算効率を維持しながら、より多様で現実的な結果を生成することを示している。

English

Text-to-Image (T2I) diffusion models have made remarkable advancements in generative modeling; however, they face a trade-off between inference speed and image quality, posing challenges for efficient deployment. Existing distilled T2I models can generate high-fidelity images with fewer sampling steps, but often struggle with diversity and quality, especially in one-step models. From our analysis, we observe redundant computations in the UNet encoders. Our findings suggest that, for T2I diffusion models, decoders are more adept at capturing richer and more explicit semantic information, while encoders can be effectively shared across decoders from diverse time steps. Based on these observations, we introduce the first Time-independent Unified Encoder TiUE for the student model UNet architecture, which is a loop-free image generation approach for distilling T2I diffusion models. Using a one-pass scheme, TiUE shares encoder features across multiple decoder time steps, enabling parallel sampling and significantly reducing inference time complexity. In addition, we incorporate a KL divergence term to regularize noise prediction, which enhances the perceptual realism and diversity of the generated images. Experimental results demonstrate that TiUE outperforms state-of-the-art methods, including LCM, SD-Turbo, and SwiftBrushv2, producing more diverse and realistic results while maintaining the computational efficiency.

片道切符：テキストから画像への拡散モデル蒸縮のための時間非依存統合エンコーダ

One-Way Ticket:Time-Independent Unified Encoder for Distilling Text-to-Image Diffusion Models

要旨

Support