透過多解析度擴散模型減輕影像生成中的失真
Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models
June 13, 2024
作者: Qihao Liu, Zhanpeng Zeng, Ju He, Qihang Yu, Xiaohui Shen, Liang-Chieh Chen
cs.AI
摘要
本文提出了對擴散模型的創新增強,通過整合一個新穎的多分辨率網絡和時間依賴的層規範化。擴散模型因其在高保真度圖像生成方面的有效性而聲名鵲起。傳統方法依賴於卷積 U-Net 結構,而最近基於 Transformer 的設計表現出卓越的性能和可擴展性。然而,Transformer 結構通過對輸入數據進行標記化(通過“patchification”)而面臨著視覺保真度和計算複雜度之間的折衷,這是由於自注意力操作相對於標記長度的二次性質所導致的。較大的 patch 大小可以提高注意力計算效率,但它們難以捕捉細緻的視覺細節,導致圖像失真。為應對這一挑戰,我們提出了將多分辨率網絡(DiMR)與擴散模型相結合,該框架可以跨多個分辨率對特徵進行細化,逐步從低分辨率到高分辨率增強細節。此外,我們引入了時間依賴的層規範化(TD-LN),這是一種具有參數效率的方法,將時間依賴的參數納入層規範化中,以注入時間信息並實現卓越的性能。我們的方法在類別條件下的 ImageNet 生成基準上得到了驗證,其中 DiMR-XL 變體優於先前的擴散模型,在 ImageNet 256 x 256 上取得了 1.70 的新的 FID 分數,在 ImageNet 512 x 512 上取得了 2.89 的新的 FID 分數。項目頁面:https://qihao067.github.io/projects/DiMR
English
This paper presents innovative enhancements to diffusion models by
integrating a novel multi-resolution network and time-dependent layer
normalization. Diffusion models have gained prominence for their effectiveness
in high-fidelity image generation. While conventional approaches rely on
convolutional U-Net architectures, recent Transformer-based designs have
demonstrated superior performance and scalability. However, Transformer
architectures, which tokenize input data (via "patchification"), face a
trade-off between visual fidelity and computational complexity due to the
quadratic nature of self-attention operations concerning token length. While
larger patch sizes enable attention computation efficiency, they struggle to
capture fine-grained visual details, leading to image distortions. To address
this challenge, we propose augmenting the Diffusion model with the
Multi-Resolution network (DiMR), a framework that refines features across
multiple resolutions, progressively enhancing detail from low to high
resolution. Additionally, we introduce Time-Dependent Layer Normalization
(TD-LN), a parameter-efficient approach that incorporates time-dependent
parameters into layer normalization to inject time information and achieve
superior performance. Our method's efficacy is demonstrated on the
class-conditional ImageNet generation benchmark, where DiMR-XL variants
outperform prior diffusion models, setting new state-of-the-art FID scores of
1.70 on ImageNet 256 x 256 and 2.89 on ImageNet 512 x 512. Project page:
https://qihao067.github.io/projects/DiMR