自回归视频生成中的推测解码技术

摘要

自回归视频扩散模型正成为流媒体视频合成的新兴范式，其中步数蒸馏是加速推理的主要手段。然而，大语言模型的主流加速策略——推测解码能否有效适配自回归视频生成仍存疑问，因为视频块是连续时空张量，缺乏可用于精确拒绝采样的词元级分布。我们提出SDVG框架，通过用图像质量路由器替代词元验证，将推测解码引入基于块的自回归视频扩散。13亿参数的草稿模型通过四步去噪生成候选块；每个块经VAE解码后，由ImageReward采用最差帧聚合策略（取每帧奖励最小值以捕捉单帧伪影）进行评分。评分超过固定阈值τ的块被存入140亿参数目标模型的KV缓存，其余则由目标模型重新生成。两个关键设计被证明至关重要：首帧始终强制拒绝以锚定场景构图，而τ作为单一调控旋钮可勾勒出平滑的质量-速度帕累托边界。在1003个MovieGenVideoBench提示词（832x480分辨率）上的测试表明，当τ=-0.7时，SDVG在实现1.59倍加速的同时保持目标模型单独生成98.1%的VisionReward质量（0.0773 vs 0.0788）；当加速比达2.09倍时仍保持95.7%的质量留存率，且始终优于纯草稿生成超过17%。该框架无需训练、不改变模型架构，可无缝集成到现有自回归视频生成流程中。

English

Autoregressive video diffusion is emerging as a promising paradigm for streaming video synthesis, with step distillation serving as the primary means of accelerating inference. Whether speculative decoding, the dominant acceleration strategy for large language models, can be effectively adapted to autoregressive video generation remains an open question, because video blocks are continuous spatiotemporal tensors with no token-level distribution for exact rejection sampling. We introduce SDVG, which brings speculative decoding to block-based autoregressive video diffusion by replacing token verification with an image-quality router. A 1.3B drafter proposes candidate blocks via four denoising steps; each block is VAE-decoded and scored by ImageReward using worst-frame aggregation--taking the minimum per-frame reward to catch single-frame artifacts that averaging would mask. Blocks scoring above a fixed threshold tau are accepted into the 14B target's KV cache; the rest are regenerated by the target. Two additional design choices prove critical: the first block is always force-rejected to anchor scene composition, and tau serves as a single knob that traces a smooth quality-speed Pareto frontier. On 1003 MovieGenVideoBench prompts (832x480), SDVG retains 98.1% of target-only VisionReward quality (0.0773 vs. 0.0788) at a 1.59x speedup with tau=-0.7, and reaches 2.09x at 95.7% quality retention--while consistently outperforming draft-only generation by over +17%. The framework is training-free, requires no architectural changes, and can be seamlessly integrated into existing autoregressive video generation pipelines.

自回归视频生成中的推测解码技术

Speculative Decoding for Autoregressive Video Generation

摘要

Support