LinFusion:1 個 GPU,1 分鐘,16K 圖像
LinFusion: 1 GPU, 1 Minute, 16K Image
September 3, 2024
作者: Songhua Liu, Weihao Yu, Zhenxiong Tan, Xinchao Wang
cs.AI
摘要
現代擴散模型,特別是利用基於Transformer的UNet進行去噪的模型,大量依賴自注意力操作來管理複雜的空間關係,從而實現令人印象深刻的生成性能。然而,這種現有範式在生成高分辨率視覺內容方面面臨著重大挑戰,因為它與空間標記數量的關係呈二次時間和記憶體複雜度。為了解決這一限制,本文旨在提出一種新型的線性注意力機制作為替代方案。具體而言,我們從具有線性複雜度的最近引入的模型,如Mamba、Mamba2和閘控線性注意力,開始我們的探索,並確定兩個關鍵特徵 - 注意力歸一化和非因果推斷 - 這些特徵增強了高分辨率視覺生成性能。基於這些見解,我們引入了一種通用的線性注意力範式,它作為廣泛流行的線性標記混合器的低秩近似。為了節省訓練成本並更好地利用預訓練模型,我們初始化我們的模型並從預訓練的StableDiffusion (SD) 中提煉知識。我們發現,提煉的模型,稱為LinFusion,在僅經過適度訓練後實現了與原始SD相當或更優的性能,同時顯著降低了時間和記憶體複雜度。對SD-v1.5、SD-v2.1和SD-XL的大量實驗表明,LinFusion 提供了令人滿意的零-shot跨解析度生成性能,生成高分辨率圖像,如16K分辨率。此外,它與預訓練的SD組件高度兼容,如ControlNet和IP-Adapter,無需進行適應努力。代碼可在 https://github.com/Huage001/LinFusion 找到。
English
Modern diffusion models, particularly those utilizing a Transformer-based
UNet for denoising, rely heavily on self-attention operations to manage complex
spatial relationships, thus achieving impressive generation performance.
However, this existing paradigm faces significant challenges in generating
high-resolution visual content due to its quadratic time and memory complexity
with respect to the number of spatial tokens. To address this limitation, we
aim at a novel linear attention mechanism as an alternative in this paper.
Specifically, we begin our exploration from recently introduced models with
linear complexity, e.g., Mamba, Mamba2, and Gated Linear Attention, and
identify two key features-attention normalization and non-causal inference-that
enhance high-resolution visual generation performance. Building on these
insights, we introduce a generalized linear attention paradigm, which serves as
a low-rank approximation of a wide spectrum of popular linear token mixers. To
save the training cost and better leverage pre-trained models, we initialize
our models and distill the knowledge from pre-trained StableDiffusion (SD). We
find that the distilled model, termed LinFusion, achieves performance on par
with or superior to the original SD after only modest training, while
significantly reducing time and memory complexity. Extensive experiments on
SD-v1.5, SD-v2.1, and SD-XL demonstrate that LinFusion delivers satisfactory
zero-shot cross-resolution generation performance, generating high-resolution
images like 16K resolution. Moreover, it is highly compatible with pre-trained
SD components, such as ControlNet and IP-Adapter, requiring no adaptation
efforts. Codes are available at https://github.com/Huage001/LinFusion.Summary
AI-Generated Summary