DiCo:为可扩展且高效的扩散模型注入新活力的卷积神经网络
DiCo: Revitalizing ConvNets for Scalable and Efficient Diffusion Modeling
May 16, 2025
作者: Yuang Ai, Qihang Fan, Xuefeng Hu, Zhenheng Yang, Ran He, Huaibo Huang
cs.AI
摘要
扩散变换器(DiT)作为一种前景广阔的视觉生成扩散模型,虽展现出卓越性能,却伴随着显著的计算开销。有趣的是,对预训练DiT模型的分析表明,全局自注意力机制往往存在冗余,主要捕捉局部模式,这提示了存在更高效替代方案的可能性。本文重新审视卷积作为构建高效且表达能力强的扩散模型的替代基础模块。然而,简单地将自注意力替换为卷积通常会导致性能下降。我们的研究将这一性能差距归因于卷积神经网络(ConvNets)相比变换器(Transformers)具有更高的通道冗余。为解决此问题,我们引入了一种紧凑的通道注意力机制,该机制促进更多样化通道的激活,从而增强特征多样性。由此诞生了扩散卷积网络(DiCo),这是一系列完全由标准卷积模块构建的扩散模型,在提供强大生成性能的同时显著提升了效率。在类别条件ImageNet基准测试中,DiCo在图像质量和生成速度上均超越了以往的扩散模型。特别地,DiCo-XL在256x256分辨率下取得了2.05的FID分数,在512x512分辨率下为2.53,分别比DiT-XL/2快了2.7倍和3.1倍。此外,我们最大的模型DiCo-H,参数规模扩展至10亿,在ImageNet 256x256上达到了1.90的FID分数——且训练过程中未使用任何额外监督。代码已发布:https://github.com/shallowdream204/DiCo。
English
Diffusion Transformer (DiT), a promising diffusion model for visual
generation, demonstrates impressive performance but incurs significant
computational overhead. Intriguingly, analysis of pre-trained DiT models
reveals that global self-attention is often redundant, predominantly capturing
local patterns-highlighting the potential for more efficient alternatives. In
this paper, we revisit convolution as an alternative building block for
constructing efficient and expressive diffusion models. However, naively
replacing self-attention with convolution typically results in degraded
performance. Our investigations attribute this performance gap to the higher
channel redundancy in ConvNets compared to Transformers. To resolve this, we
introduce a compact channel attention mechanism that promotes the activation of
more diverse channels, thereby enhancing feature diversity. This leads to
Diffusion ConvNet (DiCo), a family of diffusion models built entirely from
standard ConvNet modules, offering strong generative performance with
significant efficiency gains. On class-conditional ImageNet benchmarks, DiCo
outperforms previous diffusion models in both image quality and generation
speed. Notably, DiCo-XL achieves an FID of 2.05 at 256x256 resolution and 2.53
at 512x512, with a 2.7x and 3.1x speedup over DiT-XL/2, respectively.
Furthermore, our largest model, DiCo-H, scaled to 1B parameters, reaches an FID
of 1.90 on ImageNet 256x256-without any additional supervision during training.
Code: https://github.com/shallowdream204/DiCo.Summary
AI-Generated Summary