NextFlow:统一序列建模激活多模态理解与生成
NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation
January 5, 2026
作者: Huichao Zhang, Liao Qu, Yiheng Liu, Hang Chen, Yangyang Song, Yongsheng Dong, Shikun Sun, Xian Li, Xu Wang, Yi Jiang, Hu Ye, Bo Chen, Yiming Gao, Peng Liu, Akide Liu, Zhipeng Yang, Qili Deng, Linjie Xing, Jiyang Liu, Zhao Wang, Yang Zhou, Mingcong Liu, Yi Zhang, Qian He, Xiwei Hu, Zhongqi Qi, Jie Shao, Zhiye Fu, Shuai Wang, Fangmin Chen, Xuezhi Chai, Zhihua Wu, Yitong Wang, Zehuan Yuan, Daniel K. Du, Xinglong Wu
cs.AI
摘要
我们推出NextFlow——一个基于6万亿交错图文离散令牌训练的统一解码器自回归Transformer。通过将统一视觉表征融入统一自回归架构,NextFlow原生激活了多模态理解与生成能力,解锁了图像编辑、交错内容生成和视频生成等潜能。针对文本严格序列化而图像本质分层化的模态特性,我们保留文本的下一令牌预测机制,但对视觉生成采用下一尺度预测范式。这一创新突破了传统光栅扫描方法,仅需5秒即可生成1024x1024图像,比同类自回归模型快数个数量级。我们通过稳健的训练方案解决了多尺度生成的不稳定性问题,并引入面向强化学习的前缀调优策略。实验表明,NextFlow在统一模型中实现最先进性能,其视觉质量可与专业扩散模型相媲美。
English
We present NextFlow, a unified decoder-only autoregressive transformer trained on 6 trillion interleaved text-image discrete tokens. By leveraging a unified vision representation within a unified autoregressive architecture, NextFlow natively activates multimodal understanding and generation capabilities, unlocking abilities of image editing, interleaved content and video generation. Motivated by the distinct nature of modalities - where text is strictly sequential and images are inherently hierarchical - we retain next-token prediction for text but adopt next-scale prediction for visual generation. This departs from traditional raster-scan methods, enabling the generation of 1024x1024 images in just 5 seconds - orders of magnitude faster than comparable AR models. We address the instabilities of multi-scale generation through a robust training recipe. Furthermore, we introduce a prefix-tuning strategy for reinforcement learning. Experiments demonstrate that NextFlow achieves state-of-the-art performance among unified models and rivals specialized diffusion baselines in visual quality.