LinFusion: 1개의 GPU, 1분, 16K 이미지

초록

현대 확산 모델은 특히 Transformer 기반 UNet을 사용한 노이즈 제거에 많이 의존하며 복잡한 공간 관계를 관리하기 위해 자기 주의 연산을 적극 활용하여 인상적인 생성 성능을 달성합니다. 그러나 이 기존 패러다임은 공간 토큰 수에 대해 제곱 시간 및 메모리 복잡성을 가지므로 고해상도 시각 콘텐츠를 생성하는 데 상당한 어려움을 겪습니다. 이 한계를 해결하기 위해 본 논문에서는 대안으로 새로운 선형 주의 메커니즘을 목표로 합니다. 구체적으로, 최근 소개된 Mamba, Mamba2 및 Gated Linear Attention과 같은 선형 복잡성 모델에서 출발하여 주의 정규화 및 비인과적 추론이라는 두 가지 주요 기능을 식별하고 고해상도 시각 생성 성능을 향상시킵니다. 이러한 통찰력을 기반으로 인기 있는 선형 토큰 믹서의 넓은 스펙트럼에 대한 저위험 근사치 역할을 하는 일반화된 선형 주의 패러다임을 소개합니다. 훈련 비용을 줄이고 사전 훈련된 모델을 더 잘 활용하기 위해 초기 모델을 초기화하고 사전 훈련된 StableDiffusion (SD)에서 지식을 추출합니다. 우리는 이러한 추출된 모델인 LinFusion이 적은 훈련만으로 원래 SD와 동등하거나 우수한 성능을 달성하며 시간 및 메모리 복잡성을 크게 줄인다는 것을 발견했습니다. SD-v1.5, SD-v2.1 및 SD-XL에 대한 광범위한 실험에서 LinFusion이 16K 해상도와 같은 고해상도 이미지를 생성하는 등의 만족스러운 제로샷 교차 해상도 생성 성능을 제공함을 보여줍니다. 더불어, ControlNet 및 IP-Adapter와 같은 사전 훈련된 SD 구성 요소와 매우 호환되며 적응 노력이 필요하지 않습니다. 코드는 https://github.com/Huage001/LinFusion에서 사용할 수 있습니다.

English

Modern diffusion models, particularly those utilizing a Transformer-based UNet for denoising, rely heavily on self-attention operations to manage complex spatial relationships, thus achieving impressive generation performance. However, this existing paradigm faces significant challenges in generating high-resolution visual content due to its quadratic time and memory complexity with respect to the number of spatial tokens. To address this limitation, we aim at a novel linear attention mechanism as an alternative in this paper. Specifically, we begin our exploration from recently introduced models with linear complexity, e.g., Mamba, Mamba2, and Gated Linear Attention, and identify two key features-attention normalization and non-causal inference-that enhance high-resolution visual generation performance. Building on these insights, we introduce a generalized linear attention paradigm, which serves as a low-rank approximation of a wide spectrum of popular linear token mixers. To save the training cost and better leverage pre-trained models, we initialize our models and distill the knowledge from pre-trained StableDiffusion (SD). We find that the distilled model, termed LinFusion, achieves performance on par with or superior to the original SD after only modest training, while significantly reducing time and memory complexity. Extensive experiments on SD-v1.5, SD-v2.1, and SD-XL demonstrate that LinFusion delivers satisfactory zero-shot cross-resolution generation performance, generating high-resolution images like 16K resolution. Moreover, it is highly compatible with pre-trained SD components, such as ControlNet and IP-Adapter, requiring no adaptation efforts. Codes are available at https://github.com/Huage001/LinFusion.

LinFusion: 1개의 GPU, 1분, 16K 이미지

LinFusion: 1 GPU, 1 Minute, 16K Image

초록

Summary

Support

Support