STARFlow2：橋接語言模型與正規化流以實現統一多模態生成

摘要

深度生成模型在文本與視覺領域快速進展，促使能理解、推理並生成交錯文本-圖像序列的統一多模態系統應運而生。現有方法多數將自迴歸語言建模與擴散式影像生成器結合，因而繼承了因果文字生成與迭代視覺去噪之間的結構不匹配問題。我們觀察到自迴歸正規化流本質上即為自迴歸Transformer——與大型語言模型共享相同的因果遮罩、KV快取機制及由左至右結構——使其成為真正統一多模態生成最自然的典範。我們提出STARFlow2，基於Pretzel架構構建，該架構透過殘差跳躍連接垂直交錯預訓練視覺語言模型流與TarFlow流，兩者並在同一因果遮罩下運作。結合深度-淺層流設計與統一FAE潛在空間，STARFlow2實現快取友善的交錯生成，使文字與視覺輸出直接進入KV快取而無需重新編碼。實驗結果顯示，該模型在影像生成與多模態理解基準測試上表現強勁，驗證了自迴歸流作為統一多模態建模可行基礎的潛力。

English

Deep generative models have advanced rapidly across text and vision, motivating unified multimodal systems that can understand, reason over, and generate interleaved text-image sequences. Most existing approaches combine autoregressive language modeling with diffusion-based image generators, inheriting a structural mismatch between causal text generation and iterative visual denoising. We observe that autoregressive normalizing flows are autoregressive Transformers--sharing the same causal mask, KV-cache mechanism, and left-to-right structure as LLMs--making them the most natural paradigm for true unified multimodal generation. We present STARFlow2, built on the Pretzel architecture that vertically interleaves a pretrained VLM stream with a TarFlow stream via residual skip connections, both operating under the same causal mask. Combined with a deep-shallow flow design and a unified FAE latent space, STARFlow2 enables cache-friendly interleaved generation where both text and visual outputs directly enter the KV-cache without re-encoding. Experiments demonstrate strong performance across image generation and multimodal understanding benchmarks, validating autoregressive flows as a viable foundation for unified multimodal modeling.

STARFlow2：橋接語言模型與正規化流以實現統一多模態生成

STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation

摘要

Support