跨注意力使得文本到圖像擴散模型中的推論變得繁瑣。
Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models
April 3, 2024
作者: Wentian Zhang, Haozhe Liu, Jinheng Xie, Francesco Faccio, Mike Zheng Shou, Jürgen Schmidhuber
cs.AI
摘要
本研究探討了在文本條件擴散模型推論過程中跨注意力的作用。我們發現跨注意力的輸出在少數推論步驟後會收斂到一個固定點。因此,收斂的時間點自然地將整個推論過程分為兩個階段:初始的語義規劃階段,在此階段模型依賴跨注意力來規劃以文本為導向的視覺語義;以及後續的保真度改進階段,在此階段模型試圖從先前規劃的語義生成圖像。令人驚訝的是,在保真度改進階段中忽略文本條件不僅降低了計算複雜度,還保持了模型性能。這帶來了一種簡單且無需訓練的高效生成方法,稱為TGATE,一旦跨注意力輸出收斂,就將其緩存並在其餘推論步驟中保持不變。我們對MS-COCO驗證集的實證研究確認了其有效性。TGATE的源代碼可在https://github.com/HaozheLiu-ST/T-GATE 找到。
English
This study explores the role of cross-attention during inference in
text-conditional diffusion models. We find that cross-attention outputs
converge to a fixed point after few inference steps. Accordingly, the time
point of convergence naturally divides the entire inference process into two
stages: an initial semantics-planning stage, during which, the model relies on
cross-attention to plan text-oriented visual semantics, and a subsequent
fidelity-improving stage, during which the model tries to generate images from
previously planned semantics. Surprisingly, ignoring text conditions in the
fidelity-improving stage not only reduces computation complexity, but also
maintains model performance. This yields a simple and training-free method
called TGATE for efficient generation, which caches the cross-attention output
once it converges and keeps it fixed during the remaining inference steps. Our
empirical study on the MS-COCO validation set confirms its effectiveness. The
source code of TGATE is available at https://github.com/HaozheLiu-ST/T-GATE.Summary
AI-Generated Summary