ChatPaper.aiChatPaper

DiM:扩散玛巴用于高效的高分辨率图像合成

DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis

May 23, 2024
作者: Yao Teng, Yue Wu, Han Shi, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, Xihui Liu
cs.AI

摘要

扩散模型在图像生成方面取得了巨大成功,其主干从U-Net逐渐演变为视觉Transformer。然而,Transformer的计算成本随着标记数量呈二次增长,这在处理高分辨率图像时带来了重大挑战。在这项工作中,我们提出了扩散曼巴(DiM),它将基于状态空间模型(SSM)的Mamba序列模型的高效性与扩散模型的表现力相结合,实现了高效的高分辨率图像合成。为了解决Mamba无法泛化到二维信号的挑战,我们进行了多方向扫描、在每行和每列末尾添加可学习的填充标记,以及轻量级局部特征增强等多项架构设计。我们的DiM架构实现了高分辨率图像的推理效率。此外,为了进一步提高DiM在高分辨率图像生成方面的训练效率,我们研究了“由弱到强”的训练策略,即在低分辨率图像(256x256)上预训练DiM,然后在高分辨率图像(512x512)上进行微调。我们进一步探索了无需训练的上采样策略,使模型能够生成更高分辨率的图像(例如1024x1024和1536x1536),而无需进一步微调。实验证明了我们DiM的有效性和高效性。
English
Diffusion models have achieved great success in image generation, with the backbone evolving from U-Net to Vision Transformers. However, the computational cost of Transformers is quadratic to the number of tokens, leading to significant challenges when dealing with high-resolution images. In this work, we propose Diffusion Mamba (DiM), which combines the efficiency of Mamba, a sequence model based on State Space Models (SSM), with the expressive power of diffusion models for efficient high-resolution image synthesis. To address the challenge that Mamba cannot generalize to 2D signals, we make several architecture designs including multi-directional scans, learnable padding tokens at the end of each row and column, and lightweight local feature enhancement. Our DiM architecture achieves inference-time efficiency for high-resolution images. In addition, to further improve training efficiency for high-resolution image generation with DiM, we investigate ``weak-to-strong'' training strategy that pretrains DiM on low-resolution images (256times 256) and then finetune it on high-resolution images (512 times 512). We further explore training-free upsampling strategies to enable the model to generate higher-resolution images (e.g., 1024times 1024 and 1536times 1536) without further fine-tuning. Experiments demonstrate the effectiveness and efficiency of our DiM.
PDF170December 15, 2024