透過多解析度擴散模型減輕影像生成中的失真

摘要

本文提出了對擴散模型的創新增強，通過整合一個新穎的多分辨率網絡和時間依賴的層規範化。擴散模型因其在高保真度圖像生成方面的有效性而聲名鵲起。傳統方法依賴於卷積 U-Net 結構，而最近基於 Transformer 的設計表現出卓越的性能和可擴展性。然而，Transformer 結構通過對輸入數據進行標記化（通過“patchification”）而面臨著視覺保真度和計算複雜度之間的折衷，這是由於自注意力操作相對於標記長度的二次性質所導致的。較大的 patch 大小可以提高注意力計算效率，但它們難以捕捉細緻的視覺細節，導致圖像失真。為應對這一挑戰，我們提出了將多分辨率網絡（DiMR）與擴散模型相結合，該框架可以跨多個分辨率對特徵進行細化，逐步從低分辨率到高分辨率增強細節。此外，我們引入了時間依賴的層規範化（TD-LN），這是一種具有參數效率的方法，將時間依賴的參數納入層規範化中，以注入時間信息並實現卓越的性能。我們的方法在類別條件下的 ImageNet 生成基準上得到了驗證，其中 DiMR-XL 變體優於先前的擴散模型，在 ImageNet 256 x 256 上取得了 1.70 的新的 FID 分數，在 ImageNet 512 x 512 上取得了 2.89 的新的 FID 分數。項目頁面：https://qihao067.github.io/projects/DiMR

English

This paper presents innovative enhancements to diffusion models by integrating a novel multi-resolution network and time-dependent layer normalization. Diffusion models have gained prominence for their effectiveness in high-fidelity image generation. While conventional approaches rely on convolutional U-Net architectures, recent Transformer-based designs have demonstrated superior performance and scalability. However, Transformer architectures, which tokenize input data (via "patchification"), face a trade-off between visual fidelity and computational complexity due to the quadratic nature of self-attention operations concerning token length. While larger patch sizes enable attention computation efficiency, they struggle to capture fine-grained visual details, leading to image distortions. To address this challenge, we propose augmenting the Diffusion model with the Multi-Resolution network (DiMR), a framework that refines features across multiple resolutions, progressively enhancing detail from low to high resolution. Additionally, we introduce Time-Dependent Layer Normalization (TD-LN), a parameter-efficient approach that incorporates time-dependent parameters into layer normalization to inject time information and achieve superior performance. Our method's efficacy is demonstrated on the class-conditional ImageNet generation benchmark, where DiMR-XL variants outperform prior diffusion models, setting new state-of-the-art FID scores of 1.70 on ImageNet 256 x 256 and 2.89 on ImageNet 512 x 512. Project page: https://qihao067.github.io/projects/DiMR

透過多解析度擴散模型減輕影像生成中的失真

Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models

摘要

Support