DiCo: 拡張性と効率性を備えた拡散モデリングのためのConvNetsの再活性化

要旨

視覚生成のための有望な拡散モデルであるDiffusion Transformer (DiT)は、印象的な性能を示す一方で、大きな計算コストを伴います。興味深いことに、事前学習済みのDiTモデルを分析すると、グローバルなセルフアテンションがしばしば冗長であり、主にローカルなパターンを捉えていることが明らかになり、より効率的な代替手法の可能性が浮かび上がります。本論文では、効率的で表現力豊かな拡散モデルを構築するための代替的な構成要素として、畳み込みを再検討します。しかし、セルフアテンションを単純に畳み込みに置き換えると、通常は性能が低下します。私たちの調査によると、この性能差は、Transformerと比較してConvNetのチャネル冗長性が高いことに起因しています。これを解決するため、より多様なチャネルの活性化を促進するコンパクトなチャネルアテンションメカニズムを導入し、特徴の多様性を向上させます。これにより、標準的なConvNetモジュールのみで構築された拡散モデルファミリーであるDiffusion ConvNet (DiCo)が誕生し、強力な生成性能と大幅な効率向上を実現します。クラス条件付きImageNetベンチマークにおいて、DiCoは画像品質と生成速度の両方で従来の拡散モデルを上回ります。特に、DiCo-XLは256x256解像度でFID 2.05、512x512解像度でFID 2.53を達成し、DiT-XL/2に対してそれぞれ2.7倍と3.1倍の高速化を実現しました。さらに、最大規模のモデルであるDiCo-Hは、1Bパラメータにスケールアップし、ImageNet 256x256においてFID 1.90を達成しました。これは、トレーニング中に追加の監督なしで達成されたものです。コード: https://github.com/shallowdream204/DiCo.

English

Diffusion Transformer (DiT), a promising diffusion model for visual generation, demonstrates impressive performance but incurs significant computational overhead. Intriguingly, analysis of pre-trained DiT models reveals that global self-attention is often redundant, predominantly capturing local patterns-highlighting the potential for more efficient alternatives. In this paper, we revisit convolution as an alternative building block for constructing efficient and expressive diffusion models. However, naively replacing self-attention with convolution typically results in degraded performance. Our investigations attribute this performance gap to the higher channel redundancy in ConvNets compared to Transformers. To resolve this, we introduce a compact channel attention mechanism that promotes the activation of more diverse channels, thereby enhancing feature diversity. This leads to Diffusion ConvNet (DiCo), a family of diffusion models built entirely from standard ConvNet modules, offering strong generative performance with significant efficiency gains. On class-conditional ImageNet benchmarks, DiCo outperforms previous diffusion models in both image quality and generation speed. Notably, DiCo-XL achieves an FID of 2.05 at 256x256 resolution and 2.53 at 512x512, with a 2.7x and 3.1x speedup over DiT-XL/2, respectively. Furthermore, our largest model, DiCo-H, scaled to 1B parameters, reaches an FID of 1.90 on ImageNet 256x256-without any additional supervision during training. Code: https://github.com/shallowdream204/DiCo.

DiCo: 拡張性と効率性を備えた拡散モデリングのためのConvNetsの再活性化

DiCo: Revitalizing ConvNets for Scalable and Efficient Diffusion Modeling

要旨

Support