Skrr：メモリ効率の高いテキストから画像への生成のためのスキップおよび再利用テキストエンコーダーレイヤー

要旨

テキストから画像への変換（T2I）拡散モデルにおける大規模テキストエンコーダーは、テキストプロンプトから高品質な画像を生成する際に優れた性能を示しています。複数の反復ステップに依存するノイズ除去モジュールとは異なり、テキストエンコーダーはテキスト埋め込みを生成するために単一の順方向パスのみを必要とします。しかし、総推論時間や浮動小数点演算（FLOPs）への寄与が最小であるにもかかわらず、テキストエンコーダーはノイズ除去モジュールよりも最大8倍のメモリ使用量を要求します。この非効率性に対処するために、我々はテキストエンコーダー向けに特に設計されたシンプルかつ効果的な剪定戦略であるSkip and Re-use layers（Skrr）を提案します。Skrrは、T2Iタスク向けに調整された方法で、トランスフォーマーブロック内の固有の冗長性を利用し、特定のレイヤーを選択的にスキップまたは再利用することで、パフォーマンスを損なうことなくメモリ消費を削減します。包括的な実験により、Skrrが高い疎密度下でも元のモデルと同等の画質を維持し、既存のブロック単位の剪定手法を凌駕することが示されています。さらに、Skrrは、FID、CLIP、DreamSim、およびGenEvalスコアを含む複数の評価メトリクスにわたり、最先端のメモリ効率を達成し、パフォーマンスを維持しています。

English

Large-scale text encoders in text-to-image (T2I) diffusion models have demonstrated exceptional performance in generating high-quality images from textual prompts. Unlike denoising modules that rely on multiple iterative steps, text encoders require only a single forward pass to produce text embeddings. However, despite their minimal contribution to total inference time and floating-point operations (FLOPs), text encoders demand significantly higher memory usage, up to eight times more than denoising modules. To address this inefficiency, we propose Skip and Re-use layers (Skrr), a simple yet effective pruning strategy specifically designed for text encoders in T2I diffusion models. Skrr exploits the inherent redundancy in transformer blocks by selectively skipping or reusing certain layers in a manner tailored for T2I tasks, thereby reducing memory consumption without compromising performance. Extensive experiments demonstrate that Skrr maintains image quality comparable to the original model even under high sparsity levels, outperforming existing blockwise pruning methods. Furthermore, Skrr achieves state-of-the-art memory efficiency while preserving performance across multiple evaluation metrics, including the FID, CLIP, DreamSim, and GenEval scores.

Skrr：メモリ効率の高いテキストから画像への生成のためのスキップおよび再利用テキストエンコーダーレイヤー

Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient Text-to-Image Generation

要旨

Support