TeEFusion：融合文本嵌入以提炼无分类器引导

摘要

文本到图像合成领域的最新进展在很大程度上得益于复杂的采样策略和无分类器引导（CFG），以确保生成高质量图像。然而，CFG依赖于两次前向传播，尤其是在结合复杂的采样算法时，导致了极高的推理成本。为解决这一问题，我们提出了TeEFusion（文本嵌入融合），这是一种新颖且高效的蒸馏方法，它直接将引导强度融入文本嵌入中，并蒸馏教师模型的复杂采样策略。通过简单的线性操作融合条件与无条件文本嵌入，TeEFusion无需额外参数即可重建所需的引导效果，同时使学生模型能够学习教师模型通过其复杂采样方法生成的输出。在诸如SD3等最先进模型上的大量实验表明，我们的方法使学生模型能够以更为简洁高效的采样策略紧密模仿教师模型的性能。因此，学生模型的推理速度比教师模型快至6倍，同时保持的图像质量与教师模型复杂采样方法所得相当。代码已公开于https://github.com/AIDC-AI/TeEFusion{github.com/AIDC-AI/TeEFusion}。

English

Recent advances in text-to-image synthesis largely benefit from sophisticated sampling strategies and classifier-free guidance (CFG) to ensure high-quality generation. However, CFG's reliance on two forward passes, especially when combined with intricate sampling algorithms, results in prohibitively high inference costs. To address this, we introduce TeEFusion (Text Embeddings Fusion), a novel and efficient distillation method that directly incorporates the guidance magnitude into the text embeddings and distills the teacher model's complex sampling strategy. By simply fusing conditional and unconditional text embeddings using linear operations, TeEFusion reconstructs the desired guidance without adding extra parameters, simultaneously enabling the student model to learn from the teacher's output produced via its sophisticated sampling approach. Extensive experiments on state-of-the-art models such as SD3 demonstrate that our method allows the student to closely mimic the teacher's performance with a far simpler and more efficient sampling strategy. Consequently, the student model achieves inference speeds up to 6times faster than the teacher model, while maintaining image quality at levels comparable to those obtained through the teacher's complex sampling approach. The code is publicly available at https://github.com/AIDC-AI/TeEFusion{github.com/AIDC-AI/TeEFusion}.