基于度量路径的均匀离散扩散视频生成方法
Uniform Discrete Diffusion with Metric Path for Video Generation
October 28, 2025
作者: Haoge Deng, Ting Pan, Fan Zhang, Yang Liu, Zhuoyan Luo, Yufeng Cui, Wenxuan Wang, Chunhua Shen, Shiguang Shan, Zhaoxiang Zhang, Xinlong Wang
cs.AI
摘要
连续空间视频生成技术发展迅猛,而离散方法因误差累积和长上下文不一致问题进展缓慢。本研究重新审视离散生成建模,提出具有度量路径的均匀离散扩散框架(URSA),这一简洁而强大的架构成功弥合了离散方法与连续方法在可扩展视频生成领域的差距。URSA的核心是将视频生成任务构建为离散时空标记的迭代式全局优化过程。该框架融合了两项关键设计:线性化度量路径和分辨率相关的时间步偏移机制。这些设计使URSA能够高效扩展至高分辨率图像合成和长时序视频生成,同时大幅减少推理步数。此外,我们引入异步时序微调策略,将插值和图像到视频生成等多种任务统一于单一模型中。在具有挑战性的视频与图像生成基准测试中,大量实验表明URSA持续超越现有离散方法,并达到与最先进连续扩散方法相媲美的性能。代码与模型已开源:https://github.com/baaivision/URSA
English
Continuous-space video generation has advanced rapidly, while discrete
approaches lag behind due to error accumulation and long-context inconsistency.
In this work, we revisit discrete generative modeling and present Uniform
discRete diffuSion with metric pAth (URSA), a simple yet powerful framework
that bridges the gap with continuous approaches for the scalable video
generation. At its core, URSA formulates the video generation task as an
iterative global refinement of discrete spatiotemporal tokens. It integrates
two key designs: a Linearized Metric Path and a Resolution-dependent Timestep
Shifting mechanism. These designs enable URSA to scale efficiently to
high-resolution image synthesis and long-duration video generation, while
requiring significantly fewer inference steps. Additionally, we introduce an
asynchronous temporal fine-tuning strategy that unifies versatile tasks within
a single model, including interpolation and image-to-video generation.
Extensive experiments on challenging video and image generation benchmarks
demonstrate that URSA consistently outperforms existing discrete methods and
achieves performance comparable to state-of-the-art continuous diffusion
methods. Code and models are available at https://github.com/baaivision/URSA