ChatPaper.aiChatPaper

NextFlow:統一序列建模驅動多模態理解與生成

NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation

January 5, 2026
作者: Huichao Zhang, Liao Qu, Yiheng Liu, Hang Chen, Yangyang Song, Yongsheng Dong, Shikun Sun, Xian Li, Xu Wang, Yi Jiang, Hu Ye, Bo Chen, Yiming Gao, Peng Liu, Akide Liu, Zhipeng Yang, Qili Deng, Linjie Xing, Jiyang Liu, Zhao Wang, Yang Zhou, Mingcong Liu, Yi Zhang, Qian He, Xiwei Hu, Zhongqi Qi, Jie Shao, Zhiye Fu, Shuai Wang, Fangmin Chen, Xuezhi Chai, Zhihua Wu, Yitong Wang, Zehuan Yuan, Daniel K. Du, Xinglong Wu
cs.AI

摘要

我們推出NextFlow——一個基於6萬億交錯式圖文離散標記訓練的統一解碼器自回歸變換器。通過在統一自回歸架構內整合視覺表徵,NextFlow原生激發多模態理解與生成能力,實現圖像編輯、交錯內容生成及視頻生成的突破。針對文本嚴格序列化與圖像本質層次化的模態差異,我們保留文本的下一標記預測機制,而對視覺生成採用下一尺度預測方法。這一設計突破傳統光柵掃描模式,僅需5秒即可生成1024x1024分辨率圖像,速度超越同類自回歸模型數個量級。我們通過穩健的訓練方案解決多尺度生成的不穩定性,並引入強化學習前綴調優策略。實驗表明,NextFlow在統一模型中實現最先進性能,其視覺質量可媲美專業擴散模型基準。
English
We present NextFlow, a unified decoder-only autoregressive transformer trained on 6 trillion interleaved text-image discrete tokens. By leveraging a unified vision representation within a unified autoregressive architecture, NextFlow natively activates multimodal understanding and generation capabilities, unlocking abilities of image editing, interleaved content and video generation. Motivated by the distinct nature of modalities - where text is strictly sequential and images are inherently hierarchical - we retain next-token prediction for text but adopt next-scale prediction for visual generation. This departs from traditional raster-scan methods, enabling the generation of 1024x1024 images in just 5 seconds - orders of magnitude faster than comparable AR models. We address the instabilities of multi-scale generation through a robust training recipe. Furthermore, we introduce a prefix-tuning strategy for reinforcement learning. Experiments demonstrate that NextFlow achieves state-of-the-art performance among unified models and rivals specialized diffusion baselines in visual quality.
PDF451January 7, 2026