NextStep-1:邁向大規模連續token的自回歸圖像生成
NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale
August 14, 2025
作者: NextStep Team, Chunrui Han, Guopeng Li, Jingwei Wu, Quan Sun, Yan Cai, Yuang Peng, Zheng Ge, Deyu Zhou, Haomiao Tang, Hongyu Zhou, Kenkun Liu, Ailin Huang, Bin Wang, Changxin Miao, Deshan Sun, En Yu, Fukun Yin, Gang Yu, Hao Nie, Haoran Lv, Hanpeng Hu, Jia Wang, Jian Zhou, Jianjian Sun, Kaijun Tan, Kang An, Kangheng Lin, Liang Zhao, Mei Chen, Peng Xing, Rui Wang, Shiyu Liu, Shutao Xia, Tianhao You, Wei Ji, Xianfang Zeng, Xin Han, Xuelin Zhang, Yana Wei, Yanming Xu, Yimin Jiang, Yingming Wang, Yu Zhou, Yucheng Han, Ziyang Meng, Binxing Jiao, Daxin Jiang, Xiangyu Zhang, Yibo Zhu
cs.AI
摘要
現有的文本到圖像生成自回歸(AR)模型,要么依賴於計算密集型的擴散模型來處理連續的圖像標記,要么採用向量量化(VQ)來獲取帶有量化損失的離散標記。本文中,我們通過NextStep-1模型推動了自回歸範式的發展,這是一個擁有140億參數的自回歸模型,配備了一個1.57億參數的流匹配頭,通過下一個標記預測目標來訓練離散的文本標記和連續的圖像標記。NextStep-1在文本到圖像生成任務中達到了自回歸模型的頂尖性能,展現了在高保真圖像合成方面的強大能力。此外,我們的方法在圖像編輯方面也表現出色,凸顯了我們統一方法的強大與多樣性。為了促進開放研究,我們將向社區公開我們的代碼和模型。
English
Prevailing autoregressive (AR) models for text-to-image generation either
rely on heavy, computationally-intensive diffusion models to process continuous
image tokens, or employ vector quantization (VQ) to obtain discrete tokens with
quantization loss. In this paper, we push the autoregressive paradigm forward
with NextStep-1, a 14B autoregressive model paired with a 157M flow matching
head, training on discrete text tokens and continuous image tokens with
next-token prediction objectives. NextStep-1 achieves state-of-the-art
performance for autoregressive models in text-to-image generation tasks,
exhibiting strong capabilities in high-fidelity image synthesis. Furthermore,
our method shows strong performance in image editing, highlighting the power
and versatility of our unified approach. To facilitate open research, we will
release our code and models to the community.