ToDo: 高解像度画像の効率的な生成のためのトークンダウンサンプリング

要旨

アテンション機構は画像拡散モデルにおいて重要な役割を果たしてきたが、その二次的な計算複雑性のため、現実的な時間とメモリ制約内で処理可能な画像サイズが制限されてきた。本論文では、生成画像モデルにおける密なアテンションの重要性を検証する。これらのモデルはしばしば冗長な特徴を含むため、より疎なアテンション機構に適している。我々は、キーとバリューのトークンダウンサンプリングに依存する新しいトレーニング不要の手法ToDoを提案し、Stable Diffusionの推論を一般的なサイズでは最大2倍、2048x2048のような高解像度では最大4.5倍以上高速化する。我々のアプローチが、効率的なスループットと忠実度のバランスにおいて、従来の手法を凌駕することを実証する。

English

Attention mechanism has been crucial for image diffusion models, however, their quadratic computational complexity limits the sizes of images we can process within reasonable time and memory constraints. This paper investigates the importance of dense attention in generative image models, which often contain redundant features, making them suitable for sparser attention mechanisms. We propose a novel training-free method ToDo that relies on token downsampling of key and value tokens to accelerate Stable Diffusion inference by up to 2x for common sizes and up to 4.5x or more for high resolutions like 2048x2048. We demonstrate that our approach outperforms previous methods in balancing efficient throughput and fidelity.

ToDo: 高解像度画像の効率的な生成のためのトークンダウンサンプリング

ToDo: Token Downsampling for Efficient Generation of High-Resolution Images

要旨

Support