奖励驱动:基于奖励分布匹配蒸馏的高效流式视频生成
Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation
December 4, 2025
作者: Yunhong Lu, Yanhong Zeng, Haobo Li, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jiapeng Zhu, Hengyuan Cao, Zhipeng Zhang, Xing Zhu, Yujun Shen, Min Zhang
cs.AI
摘要
高效流式视频生成对于模拟交互式动态世界至关重要。现有方法通过滑动窗口注意力机制蒸馏少步数视频扩散模型,将初始帧作为锚定标记以维持注意力性能并减少误差累积。然而视频帧会过度依赖这些静态标记,导致复制初始帧并削弱运动动态。为此,我们提出奖励驱动框架(Reward Forcing),包含两项核心设计。首先,我们提出EMA-Sink机制,该机制维护由初始帧初始化的固定尺寸标记,并通过指数移动平均融合滑出窗口的淘汰标记进行持续更新。在不增加计算成本的前提下,EMA-Sink标记既能捕捉长期上下文又能保留近期动态,在保持长时序一致性的同时避免初始帧复制。其次,为更好地从教师模型蒸馏运动动态,我们提出奖励驱动的分布匹配蒸馏(Re-DMD)。传统分布匹配平等对待所有训练样本,限制了模型优先处理动态内容的能力。而Re-DMD通过视觉语言模型对动态性评分,优先处理高动态样本,使模型输出分布偏向高奖励区域。该方法在保持数据保真度的同时显著提升运动质量。定量与定性实验表明,奖励驱动框架在标准基准测试中达到最优性能,并在单张H100 GPU上实现23.1 FPS的高质量流式视频生成。
English
Efficient streaming video generation is critical for simulating interactive and dynamic worlds. Existing methods distill few-step video diffusion models with sliding window attention, using initial frames as sink tokens to maintain attention performance and reduce error accumulation. However, video frames become overly dependent on these static tokens, resulting in copied initial frames and diminished motion dynamics. To address this, we introduce Reward Forcing, a novel framework with two key designs. First, we propose EMA-Sink, which maintains fixed-size tokens initialized from initial frames and continuously updated by fusing evicted tokens via exponential moving average as they exit the sliding window. Without additional computation cost, EMA-Sink tokens capture both long-term context and recent dynamics, preventing initial frame copying while maintaining long-horizon consistency. Second, to better distill motion dynamics from teacher models, we propose a novel Rewarded Distribution Matching Distillation (Re-DMD). Vanilla distribution matching treats every training sample equally, limiting the model's ability to prioritize dynamic content. Instead, Re-DMD biases the model's output distribution toward high-reward regions by prioritizing samples with greater dynamics rated by a vision-language model. Re-DMD significantly enhances motion quality while preserving data fidelity. We include both quantitative and qualitative experiments to show that Reward Forcing achieves state-of-the-art performance on standard benchmarks while enabling high-quality streaming video generation at 23.1 FPS on a single H100 GPU.