基于对角蒸馏的自回归视频流生成技术
Streaming Autoregressive Video Generation via Diagonal Distillation
March 10, 2026
作者: Jinxiu Liu, Xuanming Liu, Kangfu Mei, Yandong Wen, Ming-HsuanYang, Weiyang Liu
cs.AI
摘要
大型预训练扩散模型显著提升了生成视频的质量,但其在实时流媒体中的应用仍受限。自回归模型为序列帧合成提供了自然框架,但需要大量计算才能实现高保真度。扩散蒸馏技术可将这些模型压缩为高效少步数变体,但现有视频蒸馏方法大多沿用图像专用技术,忽略了时间依赖性。这些技术在图像生成中表现出色,却在视频合成中表现欠佳,存在运动连贯性降低、长序列错误累积以及延迟与质量的权衡问题。我们识别出导致这些局限的两个因素:步数缩减期间对时序上下文利用不足,以及下一片段预测中隐含的后续噪声水平预测(即曝光偏差)。为解决这些问题,我们提出对角线蒸馏法,该方法与现有思路正交且能更好利用视频片段和去噪步骤中的时序信息。我们的核心策略是非对称生成设计:前期多步数,后期少步数。该设计使后续片段能从充分处理的早期片段继承丰富的外观信息,同时将部分去噪片段作为后续合成的条件输入。通过使片段生成时隐含的后续噪声水平预测与实际推理条件对齐,我们的方法有效缓解了长序列中的错误传播和过饱和现象。我们进一步引入隐式光流建模,在严格步数限制下保持运动质量。该方法仅需2.61秒即可生成5秒视频(最高达31 FPS),相较未蒸馏模型实现277.3倍加速。
English
Large pretrained diffusion models have significantly enhanced the quality of generated videos, and yet their use in real-time streaming remains limited. Autoregressive models offer a natural framework for sequential frame synthesis but require heavy computation to achieve high fidelity. Diffusion distillation can compress these models into efficient few-step variants, but existing video distillation approaches largely adapt image-specific methods that neglect temporal dependencies. These techniques often excel in image generation but underperform in video synthesis, exhibiting reduced motion coherence, error accumulation over long sequences, and a latency-quality trade-off. We identify two factors that result in these limitations: insufficient utilization of temporal context during step reduction and implicit prediction of subsequent noise levels in next-chunk prediction (i.e., exposure bias). To address these issues, we propose Diagonal Distillation, which operates orthogonally to existing approaches and better exploits temporal information across both video chunks and denoising steps. Central to our approach is an asymmetric generation strategy: more steps early, fewer steps later. This design allows later chunks to inherit rich appearance information from thoroughly processed early chunks, while using partially denoised chunks as conditional inputs for subsequent synthesis. By aligning the implicit prediction of subsequent noise levels during chunk generation with the actual inference conditions, our approach mitigates error propagation and reduces oversaturation in long-range sequences. We further incorporate implicit optical flow modeling to preserve motion quality under strict step constraints. Our method generates a 5-second video in 2.61 seconds (up to 31 FPS), achieving a 277.3x speedup over the undistilled model.