DiCo:重振卷積神經網絡,實現可擴展且高效的擴散建模
DiCo: Revitalizing ConvNets for Scalable and Efficient Diffusion Modeling
May 16, 2025
作者: Yuang Ai, Qihang Fan, Xuefeng Hu, Zhenheng Yang, Ran He, Huaibo Huang
cs.AI
摘要
擴散變換器(Diffusion Transformer, DiT)作為視覺生成領域中頗具前景的擴散模型,展現了卓越的性能,但同時也伴隨著顯著的計算開銷。有趣的是,對預訓練DiT模型的分析揭示,全局自注意力機制往往存在冗餘,主要捕捉的是局部模式——這凸顯了尋求更高效替代方案的潛力。本文中,我們重新審視卷積作為構建高效且表達力強的擴散模型的替代基礎模塊。然而,簡單地將自注意力替換為卷積通常會導致性能下降。我們的研究將這一性能差距歸因於卷積神經網絡(ConvNets)相比於變換器存在更高的通道冗餘。為解決此問題,我們引入了一種緊湊的通道注意力機制,該機制促進了更多樣化通道的激活,從而增強了特徵多樣性。這催生了完全由標準卷積神經網絡模塊構建的擴散模型家族——擴散卷積網絡(Diffusion ConvNet, DiCo),其在保持強大生成性能的同時,顯著提升了效率。在類別條件下的ImageNet基準測試中,DiCo在圖像質量和生成速度上均超越了先前的擴散模型。特別值得一提的是,DiCo-XL在256x256分辨率下達到了2.05的FID值,在512x512分辨率下為2.53,相比DiT-XL/2分別實現了2.7倍和3.1倍的加速。此外,我們最大的模型DiCo-H,規模擴展至10億參數,在ImageNet 256x256上無需任何額外監督訓練即達到了1.90的FID值。代碼已開源於:https://github.com/shallowdream204/DiCo。
English
Diffusion Transformer (DiT), a promising diffusion model for visual
generation, demonstrates impressive performance but incurs significant
computational overhead. Intriguingly, analysis of pre-trained DiT models
reveals that global self-attention is often redundant, predominantly capturing
local patterns-highlighting the potential for more efficient alternatives. In
this paper, we revisit convolution as an alternative building block for
constructing efficient and expressive diffusion models. However, naively
replacing self-attention with convolution typically results in degraded
performance. Our investigations attribute this performance gap to the higher
channel redundancy in ConvNets compared to Transformers. To resolve this, we
introduce a compact channel attention mechanism that promotes the activation of
more diverse channels, thereby enhancing feature diversity. This leads to
Diffusion ConvNet (DiCo), a family of diffusion models built entirely from
standard ConvNet modules, offering strong generative performance with
significant efficiency gains. On class-conditional ImageNet benchmarks, DiCo
outperforms previous diffusion models in both image quality and generation
speed. Notably, DiCo-XL achieves an FID of 2.05 at 256x256 resolution and 2.53
at 512x512, with a 2.7x and 3.1x speedup over DiT-XL/2, respectively.
Furthermore, our largest model, DiCo-H, scaled to 1B parameters, reaches an FID
of 1.90 on ImageNet 256x256-without any additional supervision during training.
Code: https://github.com/shallowdream204/DiCo.Summary
AI-Generated Summary