クロスアテンションはテキストから画像への拡散モデルにおける推論を煩雑にする

要旨

本研究では、テキスト条件付き拡散モデルにおける推論時のクロスアテンションの役割を探求する。クロスアテンションの出力は、数回の推論ステップ後に固定点に収束することがわかった。これにより、収束の時点が自然と推論プロセス全体を2つの段階に分ける：初期の意味計画段階では、モデルはクロスアテンションに依存してテキスト指向の視覚的意味を計画し、その後の忠実度向上段階では、モデルは事前に計画された意味から画像を生成しようとする。驚くべきことに、忠実度向上段階でテキスト条件を無視することは、計算複雑性を低減するだけでなく、モデルの性能も維持する。これにより、クロスアテンションの出力が収束したらそれをキャッシュし、残りの推論ステップ中に固定するという、TGATEと呼ばれるシンプルでトレーニング不要な効率的生成手法が得られる。MS-COCO検証セットでの実証研究により、その有効性が確認された。TGATEのソースコードはhttps://github.com/HaozheLiu-ST/T-GATEで公開されている。

English

This study explores the role of cross-attention during inference in text-conditional diffusion models. We find that cross-attention outputs converge to a fixed point after few inference steps. Accordingly, the time point of convergence naturally divides the entire inference process into two stages: an initial semantics-planning stage, during which, the model relies on cross-attention to plan text-oriented visual semantics, and a subsequent fidelity-improving stage, during which the model tries to generate images from previously planned semantics. Surprisingly, ignoring text conditions in the fidelity-improving stage not only reduces computation complexity, but also maintains model performance. This yields a simple and training-free method called TGATE for efficient generation, which caches the cross-attention output once it converges and keeps it fixed during the remaining inference steps. Our empirical study on the MS-COCO validation set confirms its effectiveness. The source code of TGATE is available at https://github.com/HaozheLiu-ST/T-GATE.

クロスアテンションはテキストから画像への拡散モデルにおける推論を煩雑にする

Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models

要旨

Support