ChatPaper.aiChatPaper

STARFlow2:橋接語言模型與正規化流以實現統一多模態生成

STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation

May 8, 2026
作者: Ying Shen, Tianrong Chen, Yuan Gao, Yizhe Zhang, Yuyang Wang, Miguel Ángel Bautista, Shuangfei Zhai, Joshua M. Susskind, Jiatao Gu
cs.AI

摘要

深度生成模型在文本與視覺領域快速進展,促使能理解、推理並生成交錯文本-圖像序列的統一多模態系統應運而生。現有方法多數將自迴歸語言建模與擴散式影像生成器結合,因而繼承了因果文字生成與迭代視覺去噪之間的結構不匹配問題。我們觀察到自迴歸正規化流本質上即為自迴歸Transformer——與大型語言模型共享相同的因果遮罩、KV快取機制及由左至右結構——使其成為真正統一多模態生成最自然的典範。我們提出STARFlow2,基於Pretzel架構構建,該架構透過殘差跳躍連接垂直交錯預訓練視覺語言模型流與TarFlow流,兩者並在同一因果遮罩下運作。結合深度-淺層流設計與統一FAE潛在空間,STARFlow2實現快取友善的交錯生成,使文字與視覺輸出直接進入KV快取而無需重新編碼。實驗結果顯示,該模型在影像生成與多模態理解基準測試上表現強勁,驗證了自迴歸流作為統一多模態建模可行基礎的潛力。
English
Deep generative models have advanced rapidly across text and vision, motivating unified multimodal systems that can understand, reason over, and generate interleaved text-image sequences. Most existing approaches combine autoregressive language modeling with diffusion-based image generators, inheriting a structural mismatch between causal text generation and iterative visual denoising. We observe that autoregressive normalizing flows are autoregressive Transformers--sharing the same causal mask, KV-cache mechanism, and left-to-right structure as LLMs--making them the most natural paradigm for true unified multimodal generation. We present STARFlow2, built on the Pretzel architecture that vertically interleaves a pretrained VLM stream with a TarFlow stream via residual skip connections, both operating under the same causal mask. Combined with a deep-shallow flow design and a unified FAE latent space, STARFlow2 enables cache-friendly interleaved generation where both text and visual outputs directly enter the KV-cache without re-encoding. Experiments demonstrate strong performance across image generation and multimodal understanding benchmarks, validating autoregressive flows as a viable foundation for unified multimodal modeling.
PDF91May 12, 2026