ChatPaper.aiChatPaper

LinFusion:1 GPU,1分钟,16K图像

LinFusion: 1 GPU, 1 Minute, 16K Image

September 3, 2024
作者: Songhua Liu, Weihao Yu, Zhenxiong Tan, Xinchao Wang
cs.AI

摘要

现代扩散模型,特别是利用基于Transformer的UNet进行去噪的模型,大量依赖自注意力操作来管理复杂的空间关系,从而实现令人印象深刻的生成性能。然而,这种现有范式在生成高分辨率视觉内容方面面临重大挑战,因为它与空间标记数量的二次时间和内存复杂度相关。为了解决这一局限性,本文旨在提出一种新颖的线性注意力机制作为替代方案。具体而言,我们从最近引入的具有线性复杂度的模型(例如Mamba、Mamba2和门控线性注意力)开始探索,并确定了两个关键特性——注意力归一化和非因果推断——这些特性增强了高分辨率视觉生成性能。基于这些见解,我们引入了一个广义线性注意力范式,它作为一种广泛线性标记混合器的低秩近似。为了节省训练成本并更好地利用预训练模型,我们初始化我们的模型并从预训练的StableDiffusion(SD)中提炼知识。我们发现,经过适度训练后,所提炼的模型,称为LinFusion,在减少时间和内存复杂度的同时,实现了与原始SD相当或优越的性能。对SD-v1.5、SD-v2.1和SD-XL的大量实验表明,LinFusion提供了令人满意的零样本跨分辨率生成性能,生成高分辨率图像,如16K分辨率。此外,它与预训练的SD组件(如ControlNet和IP-Adapter)高度兼容,无需任何适应工作。代码可在https://github.com/Huage001/LinFusion找到。
English
Modern diffusion models, particularly those utilizing a Transformer-based UNet for denoising, rely heavily on self-attention operations to manage complex spatial relationships, thus achieving impressive generation performance. However, this existing paradigm faces significant challenges in generating high-resolution visual content due to its quadratic time and memory complexity with respect to the number of spatial tokens. To address this limitation, we aim at a novel linear attention mechanism as an alternative in this paper. Specifically, we begin our exploration from recently introduced models with linear complexity, e.g., Mamba, Mamba2, and Gated Linear Attention, and identify two key features-attention normalization and non-causal inference-that enhance high-resolution visual generation performance. Building on these insights, we introduce a generalized linear attention paradigm, which serves as a low-rank approximation of a wide spectrum of popular linear token mixers. To save the training cost and better leverage pre-trained models, we initialize our models and distill the knowledge from pre-trained StableDiffusion (SD). We find that the distilled model, termed LinFusion, achieves performance on par with or superior to the original SD after only modest training, while significantly reducing time and memory complexity. Extensive experiments on SD-v1.5, SD-v2.1, and SD-XL demonstrate that LinFusion delivers satisfactory zero-shot cross-resolution generation performance, generating high-resolution images like 16K resolution. Moreover, it is highly compatible with pre-trained SD components, such as ControlNet and IP-Adapter, requiring no adaptation efforts. Codes are available at https://github.com/Huage001/LinFusion.

Summary

AI-Generated Summary

PDF354November 16, 2024