STARFlow-V:基于标准化流的端到端视频生成建模
STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flow
November 25, 2025
作者: Jiatao Gu, Ying Shen, Tianrong Chen, Laurent Dinh, Yuyang Wang, Miguel Angel Bautista, David Berthelot, Josh Susskind, Shuangfei Zhai
cs.AI
摘要
归一化流(NFs)是一种基于似然的端到端连续数据生成模型,近期在图像生成领域取得的突破性进展使其重获关注。然而在时空复杂度与计算成本显著更高的视频生成领域,现有顶尖系统几乎完全依赖基于扩散的模型。本研究通过提出STARFlow-V重新探索了这一设计空间,该基于归一化流的视频生成器具备端到端学习、稳健因果预测和原生似然估计等显著优势。基于最新提出的STARFlow架构,STARFlow-V在时空隐空间采用全局-局部架构:将因果依赖限制于全局隐空间,同时保留帧内丰富的局部交互。这有效缓解了标准自回归扩散模型生成中常见的时间维度误差累积问题。此外,我们提出流得分匹配技术,通过轻量化因果去噪器以自回归方式提升视频生成的一致性。为提升采样效率,STARFlow-V采用视频感知的雅可比迭代方案,在保持因果性的前提下将内部更新重构为可并行化迭代。得益于可逆结构,该模型可原生支持文本到视频、图像到视频及视频到视频的生成任务。实证表明,相较于基于扩散的基线模型,STARFlow-V在实现卓越视觉保真度与时间一致性的同时,具备实用的采样吞吐量。据我们所知,这是首个证明归一化流能够实现高质量自回归视频生成的实证研究,为构建世界模型开辟了新的研究方向。代码与生成样本详见https://github.com/apple/ml-starflow。
English
Normalizing flows (NFs) are end-to-end likelihood-based generative models for continuous data, and have recently regained attention with encouraging progress on image generation. Yet in the video generation domain, where spatiotemporal complexity and computational cost are substantially higher, state-of-the-art systems almost exclusively rely on diffusion-based models. In this work, we revisit this design space by presenting STARFlow-V, a normalizing flow-based video generator with substantial benefits such as end-to-end learning, robust causal prediction, and native likelihood estimation. Building upon the recently proposed STARFlow, STARFlow-V operates in the spatiotemporal latent space with a global-local architecture which restricts causal dependencies to a global latent space while preserving rich local within-frame interactions. This eases error accumulation over time, a common pitfall of standard autoregressive diffusion model generation. Additionally, we propose flow-score matching, which equips the model with a light-weight causal denoiser to improve the video generation consistency in an autoregressive fashion. To improve the sampling efficiency, STARFlow-V employs a video-aware Jacobi iteration scheme that recasts inner updates as parallelizable iterations without breaking causality. Thanks to the invertible structure, the same model can natively support text-to-video, image-to-video as well as video-to-video generation tasks. Empirically, STARFlow-V achieves strong visual fidelity and temporal consistency with practical sampling throughput relative to diffusion-based baselines. These results present the first evidence, to our knowledge, that NFs are capable of high-quality autoregressive video generation, establishing them as a promising research direction for building world models. Code and generated samples are available at https://github.com/apple/ml-starflow.