LinFusion：1 個 GPU，1 分鐘，16K 圖像

摘要

現代擴散模型，特別是利用基於Transformer的UNet進行去噪的模型，大量依賴自注意力操作來管理複雜的空間關係，從而實現令人印象深刻的生成性能。然而，這種現有範式在生成高分辨率視覺內容方面面臨著重大挑戰，因為它與空間標記數量的關係呈二次時間和記憶體複雜度。為了解決這一限制，本文旨在提出一種新型的線性注意力機制作為替代方案。具體而言，我們從具有線性複雜度的最近引入的模型，如Mamba、Mamba2和閘控線性注意力，開始我們的探索，並確定兩個關鍵特徵 - 注意力歸一化和非因果推斷 - 這些特徵增強了高分辨率視覺生成性能。基於這些見解，我們引入了一種通用的線性注意力範式，它作為廣泛流行的線性標記混合器的低秩近似。為了節省訓練成本並更好地利用預訓練模型，我們初始化我們的模型並從預訓練的StableDiffusion (SD) 中提煉知識。我們發現，提煉的模型，稱為LinFusion，在僅經過適度訓練後實現了與原始SD相當或更優的性能，同時顯著降低了時間和記憶體複雜度。對SD-v1.5、SD-v2.1和SD-XL的大量實驗表明，LinFusion 提供了令人滿意的零-shot跨解析度生成性能，生成高分辨率圖像，如16K分辨率。此外，它與預訓練的SD組件高度兼容，如ControlNet和IP-Adapter，無需進行適應努力。代碼可在 https://github.com/Huage001/LinFusion 找到。

English

Modern diffusion models, particularly those utilizing a Transformer-based UNet for denoising, rely heavily on self-attention operations to manage complex spatial relationships, thus achieving impressive generation performance. However, this existing paradigm faces significant challenges in generating high-resolution visual content due to its quadratic time and memory complexity with respect to the number of spatial tokens. To address this limitation, we aim at a novel linear attention mechanism as an alternative in this paper. Specifically, we begin our exploration from recently introduced models with linear complexity, e.g., Mamba, Mamba2, and Gated Linear Attention, and identify two key features-attention normalization and non-causal inference-that enhance high-resolution visual generation performance. Building on these insights, we introduce a generalized linear attention paradigm, which serves as a low-rank approximation of a wide spectrum of popular linear token mixers. To save the training cost and better leverage pre-trained models, we initialize our models and distill the knowledge from pre-trained StableDiffusion (SD). We find that the distilled model, termed LinFusion, achieves performance on par with or superior to the original SD after only modest training, while significantly reducing time and memory complexity. Extensive experiments on SD-v1.5, SD-v2.1, and SD-XL demonstrate that LinFusion delivers satisfactory zero-shot cross-resolution generation performance, generating high-resolution images like 16K resolution. Moreover, it is highly compatible with pre-trained SD components, such as ControlNet and IP-Adapter, requiring no adaptation efforts. Codes are available at https://github.com/Huage001/LinFusion.

LinFusion：1 個 GPU，1 分鐘，16K 圖像

LinFusion: 1 GPU, 1 Minute, 16K Image

摘要

Support