ChatPaper.aiChatPaper

交叉注意力机制使文本到图像扩散模型推理过程变得繁琐

Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models

April 3, 2024
作者: Wentian Zhang, Haozhe Liu, Jinheng Xie, Francesco Faccio, Mike Zheng Shou, Jürgen Schmidhuber
cs.AI

摘要

本研究探讨了在文本条件扩散模型推理过程中交叉注意力的作用。我们发现,交叉注意力的输出在经过少量推理步骤后会收敛到一个固定点。因此,收敛的时间点自然地将整个推理过程划分为两个阶段:初始的语义规划阶段,在此阶段,模型依赖交叉注意力来规划面向文本的视觉语义;以及随后的保真度提升阶段,在此阶段,模型尝试从先前规划的语义中生成图像。令人惊讶的是,在保真度提升阶段忽略文本条件不仅降低了计算复杂度,还保持了模型性能。这产生了一种简单且无需训练的方法,称为TGATE,用于高效生成,该方法在交叉注意力输出收敛后将其缓存,并在剩余的推理步骤中保持固定。我们对MS-COCO验证集的实证研究证实了其有效性。TGATE的源代码可在https://github.com/HaozheLiu-ST/T-GATE获取。
English
This study explores the role of cross-attention during inference in text-conditional diffusion models. We find that cross-attention outputs converge to a fixed point after few inference steps. Accordingly, the time point of convergence naturally divides the entire inference process into two stages: an initial semantics-planning stage, during which, the model relies on cross-attention to plan text-oriented visual semantics, and a subsequent fidelity-improving stage, during which the model tries to generate images from previously planned semantics. Surprisingly, ignoring text conditions in the fidelity-improving stage not only reduces computation complexity, but also maintains model performance. This yields a simple and training-free method called TGATE for efficient generation, which caches the cross-attention output once it converges and keeps it fixed during the remaining inference steps. Our empirical study on the MS-COCO validation set confirms its effectiveness. The source code of TGATE is available at https://github.com/HaozheLiu-ST/T-GATE.

Summary

AI-Generated Summary

PDF131November 26, 2024