利用 Mamba 實現可擴展的自回歸圖像生成
Scalable Autoregressive Image Generation with Mamba
August 22, 2024
作者: Haopeng Li, Jinyue Yang, Kexin Wang, Xuerui Qiu, Yuhong Chou, Xin Li, Guoqi Li
cs.AI
摘要
我們介紹了AiM,這是一種基於Mamba架構的自回歸(AR)圖像生成模型。AiM採用了Mamba,一種新型的狀態空間模型,其以線性時間複雜度在長序列建模方面表現出色,以取代常用的Transformer在AR圖像生成模型中,旨在實現更優質的生成品質和增強的推理速度。與現有方法不同,這些方法通過多方向掃描來適應Mamba以處理二維信號,而AiM直接利用下一個標記預測範式進行自回歸圖像生成。這種方法避免了需要進行大量修改以使Mamba學習2D空間表示的必要性。通過為視覺生成任務實施直接且具有策略性的修改,我們保留了Mamba的核心結構,充分利用其高效的長序列建模能力和可擴展性。我們提供了各種規模的AiM模型,參數數量從148M到1.3B不等。在ImageNet1K 256*256基準測試中,我們最佳的AiM模型實現了2.21的FID,超越了所有具有相同參數數量的現有AR模型,並展示了與擴散模型的顯著競爭力,推理速度快2到10倍。代碼可在https://github.com/hp-l33/AiM找到。
English
We introduce AiM, an autoregressive (AR) image generative model based on
Mamba architecture. AiM employs Mamba, a novel state-space model characterized
by its exceptional performance for long-sequence modeling with linear time
complexity, to supplant the commonly utilized Transformers in AR image
generation models, aiming to achieve both superior generation quality and
enhanced inference speed. Unlike existing methods that adapt Mamba to handle
two-dimensional signals via multi-directional scan, AiM directly utilizes the
next-token prediction paradigm for autoregressive image generation. This
approach circumvents the need for extensive modifications to enable Mamba to
learn 2D spatial representations. By implementing straightforward yet
strategically targeted modifications for visual generative tasks, we preserve
Mamba's core structure, fully exploiting its efficient long-sequence modeling
capabilities and scalability. We provide AiM models in various scales, with
parameter counts ranging from 148M to 1.3B. On the ImageNet1K 256*256
benchmark, our best AiM model achieves a FID of 2.21, surpassing all existing
AR models of comparable parameter counts and demonstrating significant
competitiveness against diffusion models, with 2 to 10 times faster inference
speed. Code is available at https://github.com/hp-l33/AiMSummary
AI-Generated Summary