ChatPaper.aiChatPaper

ZigMa:蟒蛇曲线扩散模型

ZigMa: Zigzag Mamba Diffusion Model

March 20, 2024
作者: Vincent Tao Hu, Stefan Andreas Baumann, Ming Gui, Olga Grebenkova, Pingchuan Ma, Johannes Fischer, Bjorn Ommer
cs.AI

摘要

扩散模型长期以来一直受到可扩展性和二次复杂性问题的困扰,尤其是在基于Transformer的结构中。在这项研究中,我们旨在利用名为Mamba的状态空间模型的长序列建模能力,将其适用于视觉数据生成。首先,我们确定了大多数当前基于Mamba的视觉方法中存在的一个关键疏忽,即在Mamba的扫描方案中缺乏对空间连续性的考虑。其次,基于这一洞察,我们引入了一种名为Zigzag Mamba的简单、即插即用、零参数方法,该方法优于基于Mamba的基准线,并且在速度和内存利用方面优于基于Transformer的基准线。最后,我们将Zigzag Mamba与随机插值框架相结合,以研究模型在大分辨率视觉数据集(如FacesHQ 1024x1024和UCF101、MultiModal-CelebA-HQ以及MS COCO 256x256)上的可扩展性。代码将在https://taohu.me/zigma/发布。
English
The diffusion model has long been plagued by scalability and quadratic complexity issues, especially within transformer-based structures. In this study, we aim to leverage the long sequence modeling capability of a State-Space Model called Mamba to extend its applicability to visual data generation. Firstly, we identify a critical oversight in most current Mamba-based vision methods, namely the lack of consideration for spatial continuity in the scan scheme of Mamba. Secondly, building upon this insight, we introduce a simple, plug-and-play, zero-parameter method named Zigzag Mamba, which outperforms Mamba-based baselines and demonstrates improved speed and memory utilization compared to transformer-based baselines. Lastly, we integrate Zigzag Mamba with the Stochastic Interpolant framework to investigate the scalability of the model on large-resolution visual datasets, such as FacesHQ 1024times 1024 and UCF101, MultiModal-CelebA-HQ, and MS COCO 256times 256. Code will be released at https://taohu.me/zigma/

Summary

AI-Generated Summary

PDF182December 15, 2024