ZigMa: ジグザグマンバ拡散モデル

要旨

拡散モデルは長らく、特にトランスフォーマーベースの構造において、スケーラビリティと二次的な計算複雑性の問題に悩まされてきた。本研究では、State-Space Modelの一種であるMambaの長いシーケンスモデリング能力を活用し、視覚データ生成への適用性を拡張することを目指す。まず、現在のMambaベースの視覚手法の多くに見られる重大な見落とし、すなわちMambaのスキャンスキームにおける空間的連続性の考慮不足を指摘する。次に、この洞察に基づいて、シンプルでプラグアンドプレイ、パラメータ不要の手法であるZigzag Mambaを提案し、Mambaベースのベースラインを上回る性能を示し、トランスフォーマーベースのベースラインと比較して速度とメモリ使用効率の向上を実証する。最後に、Zigzag MambaをStochastic Interpolantフレームワークと統合し、FacesHQ 1024×1024やUCF101、MultiModal-CelebA-HQ、MS COCO 256×256などの大解像度視覚データセットにおけるモデルのスケーラビリティを調査する。コードはhttps://taohu.me/zigma/で公開予定である。

English

The diffusion model has long been plagued by scalability and quadratic complexity issues, especially within transformer-based structures. In this study, we aim to leverage the long sequence modeling capability of a State-Space Model called Mamba to extend its applicability to visual data generation. Firstly, we identify a critical oversight in most current Mamba-based vision methods, namely the lack of consideration for spatial continuity in the scan scheme of Mamba. Secondly, building upon this insight, we introduce a simple, plug-and-play, zero-parameter method named Zigzag Mamba, which outperforms Mamba-based baselines and demonstrates improved speed and memory utilization compared to transformer-based baselines. Lastly, we integrate Zigzag Mamba with the Stochastic Interpolant framework to investigate the scalability of the model on large-resolution visual datasets, such as FacesHQ 1024times 1024 and UCF101, MultiModal-CelebA-HQ, and MS COCO 256times 256. Code will be released at https://taohu.me/zigma/

ZigMa: ジグザグマンバ拡散モデル

ZigMa: Zigzag Mamba Diffusion Model

要旨

Support