NextStep-1:迈向基于连续令牌的大规模自回归图像生成
NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale
August 14, 2025
作者: NextStep Team, Chunrui Han, Guopeng Li, Jingwei Wu, Quan Sun, Yan Cai, Yuang Peng, Zheng Ge, Deyu Zhou, Haomiao Tang, Hongyu Zhou, Kenkun Liu, Ailin Huang, Bin Wang, Changxin Miao, Deshan Sun, En Yu, Fukun Yin, Gang Yu, Hao Nie, Haoran Lv, Hanpeng Hu, Jia Wang, Jian Zhou, Jianjian Sun, Kaijun Tan, Kang An, Kangheng Lin, Liang Zhao, Mei Chen, Peng Xing, Rui Wang, Shiyu Liu, Shutao Xia, Tianhao You, Wei Ji, Xianfang Zeng, Xin Han, Xuelin Zhang, Yana Wei, Yanming Xu, Yimin Jiang, Yingming Wang, Yu Zhou, Yucheng Han, Ziyang Meng, Binxing Jiao, Daxin Jiang, Xiangyu Zhang, Yibo Zhu
cs.AI
摘要
当前主流的自回归(AR)模型在文本到图像生成任务中,要么依赖计算密集型的扩散模型处理连续图像标记,要么采用向量量化(VQ)获取离散标记但伴随量化损失。本文中,我们通过NextStep-1推进自回归范式,该模型包含一个140亿参数的自回归主体与一个1.57亿参数的流匹配头,采用离散文本标记与连续图像标记进行训练,并以下一标记预测为目标。NextStep-1在文本到图像生成任务中达到了自回归模型的顶尖性能,展现了在高保真图像合成方面的强大能力。此外,我们的方法在图像编辑任务中也表现出色,彰显了统一方法的强大与多功能性。为促进开放研究,我们将向社区公开代码与模型。
English
Prevailing autoregressive (AR) models for text-to-image generation either
rely on heavy, computationally-intensive diffusion models to process continuous
image tokens, or employ vector quantization (VQ) to obtain discrete tokens with
quantization loss. In this paper, we push the autoregressive paradigm forward
with NextStep-1, a 14B autoregressive model paired with a 157M flow matching
head, training on discrete text tokens and continuous image tokens with
next-token prediction objectives. NextStep-1 achieves state-of-the-art
performance for autoregressive models in text-to-image generation tasks,
exhibiting strong capabilities in high-fidelity image synthesis. Furthermore,
our method shows strong performance in image editing, highlighting the power
and versatility of our unified approach. To facilitate open research, we will
release our code and models to the community.