Skrr：跳過並重複使用文本編碼層以實現記憶效率的文本到圖像生成

摘要

在文字到圖像（T2I）擴散模型中，大規模文本編碼器展現出卓越的性能，能夠從文字提示中生成高質量的圖像。與依賴多次迭代步驟的去噪模塊不同，文本編碼器僅需進行單次前向傳遞即可生成文本嵌入。然而，儘管對總推理時間和浮點運算（FLOPs）的貢獻很小，文本編碼器卻需要顯著更高的記憶體使用量，高達去噪模塊的八倍。為解決這種效率問題，我們提出了Skip and Re-use layers（Skrr），這是一種針對T2I擴散模型中文本編碼器的簡單而有效的修剪策略。Skrr通過有針對性地跳過或重複使用轉換器塊中的某些層來利用轉換器塊中的固有冗餘，從而降低記憶體消耗而不影響性能。大量實驗表明，即使在高稀疏水平下，Skrr仍能保持與原始模型相當的圖像質量，勝過現有的塊狀修剪方法。此外，Skrr實現了最先進的記憶體效率，同時在多個評估指標（包括FID、CLIP、DreamSim和GenEval分數）上保持性能。

English

Large-scale text encoders in text-to-image (T2I) diffusion models have demonstrated exceptional performance in generating high-quality images from textual prompts. Unlike denoising modules that rely on multiple iterative steps, text encoders require only a single forward pass to produce text embeddings. However, despite their minimal contribution to total inference time and floating-point operations (FLOPs), text encoders demand significantly higher memory usage, up to eight times more than denoising modules. To address this inefficiency, we propose Skip and Re-use layers (Skrr), a simple yet effective pruning strategy specifically designed for text encoders in T2I diffusion models. Skrr exploits the inherent redundancy in transformer blocks by selectively skipping or reusing certain layers in a manner tailored for T2I tasks, thereby reducing memory consumption without compromising performance. Extensive experiments demonstrate that Skrr maintains image quality comparable to the original model even under high sparsity levels, outperforming existing blockwise pruning methods. Furthermore, Skrr achieves state-of-the-art memory efficiency while preserving performance across multiple evaluation metrics, including the FID, CLIP, DreamSim, and GenEval scores.

Skrr：跳過並重複使用文本編碼層以實現記憶效率的文本到圖像生成

Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient Text-to-Image Generation

摘要

Support