BlockVid:基于区块扩散的高质量长视频生成技术
BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation
November 28, 2025
作者: Zeyu Zhang, Shuning Chang, Yuanyu He, Yizeng Han, Jiasheng Tang, Fan Wang, Bohan Zhuang
cs.AI
摘要
生成分钟级视频是发展世界模型的关键一步,其为构建逼真长场景和高级AI模拟器奠定基础。新兴的半自回归(块扩散)范式融合了扩散模型与自回归模型的优势,通过KV缓存和并行采样实现任意长度视频生成并提升推理效率。然而该方法仍面临两大长期挑战:(i)KV缓存导致的长期误差累积;(ii)缺乏细粒度长视频基准与连贯性感知指标。为突破这些局限,我们提出BlockVid——一种配备语义感知稀疏KV缓存的新型块扩散框架,结合名为"块强制"的有效训练策略,以及专有的分块噪声调度与混洗机制,以降低误差传播并增强时序一致性。我们进一步推出LV-Bench细粒度分钟级视频基准,配套评估长程连贯性的新指标。在VBench和LV-Bench上的大量实验表明,BlockVid在生成高质量、高连贯性分钟级视频方面持续优于现有方法。特别是在LV-Bench上,其VDE主体指标较最优方法提升22.2%,VDE清晰度指标提升19.4%。项目网站:https://ziplab.co/BlockVid。代码库:https://github.com/alibaba-damo-academy/Inferix。
English
Generating minute-long videos is a critical step toward developing world models, providing a foundation for realistic extended scenes and advanced AI simulators. The emerging semi-autoregressive (block diffusion) paradigm integrates the strengths of diffusion and autoregressive models, enabling arbitrary-length video generation and improving inference efficiency through KV caching and parallel sampling. However, it yet faces two enduring challenges: (i) KV-cache-induced long-horizon error accumulation, and (ii) the lack of fine-grained long-video benchmarks and coherence-aware metrics. To overcome these limitations, we propose BlockVid, a novel block diffusion framework equipped with semantic-aware sparse KV cache, an effective training strategy called Block Forcing, and dedicated chunk-wise noise scheduling and shuffling to reduce error propagation and enhance temporal consistency. We further introduce LV-Bench, a fine-grained benchmark for minute-long videos, complete with new metrics evaluating long-range coherence. Extensive experiments on VBench and LV-Bench demonstrate that BlockVid consistently outperforms existing methods in generating high-quality, coherent minute-long videos. In particular, it achieves a 22.2% improvement on VDE Subject and a 19.4% improvement on VDE Clarity in LV-Bench over the state of the art approaches. Project website: https://ziplab.co/BlockVid. Inferix (Code): https://github.com/alibaba-damo-academy/Inferix.