DiM:擴散瑪巴用於高效率高解析度影像合成
DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis
May 23, 2024
作者: Yao Teng, Yue Wu, Han Shi, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, Xihui Liu
cs.AI
摘要
擴散模型在圖像生成方面取得了巨大成功,其基礎從 U-Net 演變為 Vision Transformers。然而,Transformer 的計算成本與 token 數量的平方成正比,導致在處理高分辨率圖像時面臨重大挑戰。在本研究中,我們提出了擴散瑪巴(DiM),將基於狀態空間模型(SSM)的 Mamba 的效率與擴散模型的表現力相結合,以實現高效的高分辨率圖像合成。為了解決 Mamba 無法泛化到 2D 信號的挑戰,我們進行了多個架構設計,包括多方向掃描、在每行和每列末尾使用可學習的填充 token,以及輕量級的局部特徵增強。我們的 DiM 架構實現了高分辨率圖像的推理效率。此外,為了進一步提高 DiM 在高分辨率圖像生成方面的訓練效率,我們研究了“由弱到強”的訓練策略,即在低分辨率圖像(256x256)上預訓練 DiM,然後在高分辨率圖像(512x512)上進行微調。我們進一步探索了無需訓練的上採樣策略,使模型能夠生成更高分辨率的圖像(例如1024x1024和1536x1536)而無需進行進一步微調。實驗證明了我們 DiM 的有效性和效率。
English
Diffusion models have achieved great success in image generation, with the
backbone evolving from U-Net to Vision Transformers. However, the computational
cost of Transformers is quadratic to the number of tokens, leading to
significant challenges when dealing with high-resolution images. In this work,
we propose Diffusion Mamba (DiM), which combines the efficiency of Mamba, a
sequence model based on State Space Models (SSM), with the expressive power of
diffusion models for efficient high-resolution image synthesis. To address the
challenge that Mamba cannot generalize to 2D signals, we make several
architecture designs including multi-directional scans, learnable padding
tokens at the end of each row and column, and lightweight local feature
enhancement. Our DiM architecture achieves inference-time efficiency for
high-resolution images. In addition, to further improve training efficiency for
high-resolution image generation with DiM, we investigate ``weak-to-strong''
training strategy that pretrains DiM on low-resolution images (256times 256)
and then finetune it on high-resolution images (512 times 512). We further
explore training-free upsampling strategies to enable the model to generate
higher-resolution images (e.g., 1024times 1024 and 1536times 1536)
without further fine-tuning. Experiments demonstrate the effectiveness and
efficiency of our DiM.Summary
AI-Generated Summary