LinFusion: 1つのGPU、1分、16K画像

要旨

現代の拡散モデルは、特にTransformerベースのUNetをノイズ除去に利用するモデルは、複雑な空間関係を管理するために自己注意メカニズムに大きく依存しており、それにより印象的な生成パフォーマンスを達成しています。しかしながら、この既存のパラダイムは、空間トークンの数に対して2次の時間とメモリの複雑さを持つため、高解像度のビジュアルコンテンツの生成において重要な課題に直面しています。この制限に対処するために、本論文では新しい線形注意メカニズムを代替手段として提案します。具体的には、最近導入されたMamba、Mamba2、およびGated Linear Attentionなどの線形複雑性を持つモデルから探索を開始し、注意の正規化と非因果推論という2つの重要な特徴を特定し、高解像度のビジュアル生成パフォーマンスを向上させます。これらの知見を基に、一般化された線形注意パラダイムを導入し、広範囲の人気のある線形トークンミキサーの低ランク近似として機能します。トレーニングコストを節約し、事前学習済みモデルをより効果的に活用するために、我々はモデルを初期化し、事前学習済みのStableDiffusion（SD）からの知識を蒸留します。蒸留されたモデルであるLinFusionは、控えめなトレーニング後に元のSDと同等またはそれ以上のパフォーマンスを達成し、時間とメモリの複雑さを大幅に削減します。SD-v1.5、SD-v2.1、およびSD-XLに対する広範な実験により、LinFusionが16K解像度などの高解像度画像を生成するなど、満足のいくゼロショットのクロス解像度生成パフォーマンスを提供することが示されました。さらに、ControlNetやIP-Adapterなどの事前学習済みSDコンポーネントと非常に互換性があり、適応の努力が不要です。コードはhttps://github.com/Huage001/LinFusionで入手可能です。

English

Modern diffusion models, particularly those utilizing a Transformer-based UNet for denoising, rely heavily on self-attention operations to manage complex spatial relationships, thus achieving impressive generation performance. However, this existing paradigm faces significant challenges in generating high-resolution visual content due to its quadratic time and memory complexity with respect to the number of spatial tokens. To address this limitation, we aim at a novel linear attention mechanism as an alternative in this paper. Specifically, we begin our exploration from recently introduced models with linear complexity, e.g., Mamba, Mamba2, and Gated Linear Attention, and identify two key features-attention normalization and non-causal inference-that enhance high-resolution visual generation performance. Building on these insights, we introduce a generalized linear attention paradigm, which serves as a low-rank approximation of a wide spectrum of popular linear token mixers. To save the training cost and better leverage pre-trained models, we initialize our models and distill the knowledge from pre-trained StableDiffusion (SD). We find that the distilled model, termed LinFusion, achieves performance on par with or superior to the original SD after only modest training, while significantly reducing time and memory complexity. Extensive experiments on SD-v1.5, SD-v2.1, and SD-XL demonstrate that LinFusion delivers satisfactory zero-shot cross-resolution generation performance, generating high-resolution images like 16K resolution. Moreover, it is highly compatible with pre-trained SD components, such as ControlNet and IP-Adapter, requiring no adaptation efforts. Codes are available at https://github.com/Huage001/LinFusion.

LinFusion: 1つのGPU、1分、16K画像

LinFusion: 1 GPU, 1 Minute, 16K Image

要旨

Support