STARFlow2: 连接语言模型与归一化流以实现统一多模态生成

摘要

深度生成模型在文本与视觉领域取得了快速进展，催生了能够理解、推理并生成交错文本-图像序列的统一多模态系统。现有方法多将自回归语言建模与基于扩散的图像生成器结合，但这种方式继承了因果文本生成与迭代式视觉去噪之间的结构不匹配问题。我们观察到，自回归归一化流本质上就是自回归Transformer——共享相同的因果掩码、KV缓存机制和从左到右的结构——因此成为实现真正统一多模态生成的最自然范式。本文提出STARFlow2，基于Pretzel架构构建，该架构通过残差跳跃连接将预训练的视觉语言模型（VLM）流与TarFlow流垂直交错，二者均在同一因果掩码下运行。结合深浅流设计及统一的FAE潜在空间，STARFlow2实现了缓存友好的交错生成，文本和视觉输出可直接进入KV缓存而无需重新编码。实验表明，该方法在图像生成和多模态理解基准测试中均展现出强劲性能，验证了自回归流作为统一多模态建模可行基础的有效性。

English

Deep generative models have advanced rapidly across text and vision, motivating unified multimodal systems that can understand, reason over, and generate interleaved text-image sequences. Most existing approaches combine autoregressive language modeling with diffusion-based image generators, inheriting a structural mismatch between causal text generation and iterative visual denoising. We observe that autoregressive normalizing flows are autoregressive Transformers--sharing the same causal mask, KV-cache mechanism, and left-to-right structure as LLMs--making them the most natural paradigm for true unified multimodal generation. We present STARFlow2, built on the Pretzel architecture that vertically interleaves a pretrained VLM stream with a TarFlow stream via residual skip connections, both operating under the same causal mask. Combined with a deep-shallow flow design and a unified FAE latent space, STARFlow2 enables cache-friendly interleaved generation where both text and visual outputs directly enter the KV-cache without re-encoding. Experiments demonstrate strong performance across image generation and multimodal understanding benchmarks, validating autoregressive flows as a viable foundation for unified multimodal modeling.

STARFlow2: 连接语言模型与归一化流以实现统一多模态生成

STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation

摘要

Support