ChatPaper.aiChatPaper

STARFlow-V:基于标准化流的端到端视频生成建模

STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flow

November 25, 2025
作者: Jiatao Gu, Ying Shen, Tianrong Chen, Laurent Dinh, Yuyang Wang, Miguel Angel Bautista, David Berthelot, Josh Susskind, Shuangfei Zhai
cs.AI

摘要

歸一流模型(NFs)是針對連續數據的端到端基於似然度的生成模型,近期在圖像生成領域取得的突破性進展使其重獲關注。然而在時空複雜度與計算成本顯著更高的視頻生成領域,現有頂尖系統幾乎完全依賴基於擴散模型的架構。本研究重新審視這一設計空間,提出STARFlow-V——一種基於歸一化流的視頻生成器,其具備端到端學習、魯棒因果預測及原生似然度估計等顯著優勢。基於最新提出的STARFlow架構,STARFlow-V在時空潛空間中採用全局-局部架構,將因果依賴限制於全局潛空間,同時保留豐富的幀內局部交互。這種設計有效緩解了標準自回歸擴散模型生成中常見的誤差累積問題。此外,我們提出流分數匹配技術,為模型配備輕量級因果去噪器,以自回歸方式提升視頻生成的一致性。為提高採樣效率,STARFlow-V採用視頻感知型雅可比迭代方案,將內部更新重構為可並行化的迭代過程而不破壞因果性。得益於可逆結構,該模型能原生支持文本到視頻、圖像到視頻及視頻到視頻的生成任務。實證研究表明,相較於基於擴散的基準模型,STARFlow-V在視覺保真度與時間一致性方面表現優異,並具備實用的採樣吞吐量。這些成果首次證明歸一化流模型能夠實現高質量自回歸視頻生成,為構建世界模型開闢了極具前景的研究路徑。代碼與生成樣例已開源於:https://github.com/apple/ml-starflow。
English
Normalizing flows (NFs) are end-to-end likelihood-based generative models for continuous data, and have recently regained attention with encouraging progress on image generation. Yet in the video generation domain, where spatiotemporal complexity and computational cost are substantially higher, state-of-the-art systems almost exclusively rely on diffusion-based models. In this work, we revisit this design space by presenting STARFlow-V, a normalizing flow-based video generator with substantial benefits such as end-to-end learning, robust causal prediction, and native likelihood estimation. Building upon the recently proposed STARFlow, STARFlow-V operates in the spatiotemporal latent space with a global-local architecture which restricts causal dependencies to a global latent space while preserving rich local within-frame interactions. This eases error accumulation over time, a common pitfall of standard autoregressive diffusion model generation. Additionally, we propose flow-score matching, which equips the model with a light-weight causal denoiser to improve the video generation consistency in an autoregressive fashion. To improve the sampling efficiency, STARFlow-V employs a video-aware Jacobi iteration scheme that recasts inner updates as parallelizable iterations without breaking causality. Thanks to the invertible structure, the same model can natively support text-to-video, image-to-video as well as video-to-video generation tasks. Empirically, STARFlow-V achieves strong visual fidelity and temporal consistency with practical sampling throughput relative to diffusion-based baselines. These results present the first evidence, to our knowledge, that NFs are capable of high-quality autoregressive video generation, establishing them as a promising research direction for building world models. Code and generated samples are available at https://github.com/apple/ml-starflow.
PDF172December 1, 2025