使用Mamba实现可扩展的自回归图像生成

摘要

我们介绍了AiM，这是一种基于Mamba架构的自回归（AR）图像生成模型。AiM采用了Mamba，这是一种新颖的状态空间模型，以其在具有线性时间复杂度的长序列建模中表现出色，取代了AR图像生成模型中常用的Transformer，旨在实现更优越的生成质量和增强的推理速度。与现有方法通过多方向扫描来调整Mamba以处理二维信号不同，AiM直接利用了下一个标记预测范式用于自回归图像生成。这种方法避免了需要进行大量修改以使Mamba学习2D空间表示的必要性。通过为视觉生成任务实施简单但具有战略目标的修改，我们保留了Mamba的核心结构，充分利用其高效的长序列建模能力和可扩展性。我们提供了各种规模的AiM模型，参数数量从148M到1.3B不等。在ImageNet1K 256*256基准测试中，我们最佳的AiM模型实现了2.21的FID，超越了所有具有相似参数数量的现有AR模型，并展示了与扩散模型的显著竞争力，推理速度快2到10倍。代码可在https://github.com/hp-l33/AiM获取。

English

We introduce AiM, an autoregressive (AR) image generative model based on Mamba architecture. AiM employs Mamba, a novel state-space model characterized by its exceptional performance for long-sequence modeling with linear time complexity, to supplant the commonly utilized Transformers in AR image generation models, aiming to achieve both superior generation quality and enhanced inference speed. Unlike existing methods that adapt Mamba to handle two-dimensional signals via multi-directional scan, AiM directly utilizes the next-token prediction paradigm for autoregressive image generation. This approach circumvents the need for extensive modifications to enable Mamba to learn 2D spatial representations. By implementing straightforward yet strategically targeted modifications for visual generative tasks, we preserve Mamba's core structure, fully exploiting its efficient long-sequence modeling capabilities and scalability. We provide AiM models in various scales, with parameter counts ranging from 148M to 1.3B. On the ImageNet1K 256*256 benchmark, our best AiM model achieves a FID of 2.21, surpassing all existing AR models of comparable parameter counts and demonstrating significant competitiveness against diffusion models, with 2 to 10 times faster inference speed. Code is available at https://github.com/hp-l33/AiM

使用Mamba实现可扩展的自回归图像生成

Scalable Autoregressive Image Generation with Mamba

摘要

Support