基於度量路徑的均勻離散擴散影片生成方法
Uniform Discrete Diffusion with Metric Path for Video Generation
October 28, 2025
作者: Haoge Deng, Ting Pan, Fan Zhang, Yang Liu, Zhuoyan Luo, Yufeng Cui, Wenxuan Wang, Chunhua Shen, Shiguang Shan, Zhaoxiang Zhang, Xinlong Wang
cs.AI
摘要
連續空間影片生成技術發展迅速,而離散方法因誤差累積與長序列不一致性問題進展滯後。本研究重新審視離散生成建模,提出具備度量路徑的均勻離散擴散框架(URSA),這一簡潔而強大的架構成功縮小了離散方法與連續方法在可擴展影片生成領域的差距。URSA的核心在於將影片生成任務定義為離散時空標記的迭代式全局優化過程,其整合了兩項關鍵設計:線性化度量路徑與解析度相關時間步偏移機制。這些設計使URSA能高效擴展至高解析度影像合成與長時序影片生成,同時大幅減少推理所需步數。此外,我們提出非同步時序微調策略,將插值與影像轉影片等多樣化任務統一於單一模型中。在具挑戰性的影片與影像生成基準測試中,大量實驗表明URSA不僅持續超越現有離散方法,更達到與頂尖連續擴散方法相當的性能。程式碼與模型已開源於:https://github.com/baaivision/URSA
English
Continuous-space video generation has advanced rapidly, while discrete
approaches lag behind due to error accumulation and long-context inconsistency.
In this work, we revisit discrete generative modeling and present Uniform
discRete diffuSion with metric pAth (URSA), a simple yet powerful framework
that bridges the gap with continuous approaches for the scalable video
generation. At its core, URSA formulates the video generation task as an
iterative global refinement of discrete spatiotemporal tokens. It integrates
two key designs: a Linearized Metric Path and a Resolution-dependent Timestep
Shifting mechanism. These designs enable URSA to scale efficiently to
high-resolution image synthesis and long-duration video generation, while
requiring significantly fewer inference steps. Additionally, we introduce an
asynchronous temporal fine-tuning strategy that unifies versatile tasks within
a single model, including interpolation and image-to-video generation.
Extensive experiments on challenging video and image generation benchmarks
demonstrate that URSA consistently outperforms existing discrete methods and
achieves performance comparable to state-of-the-art continuous diffusion
methods. Code and models are available at https://github.com/baaivision/URSA