獎勵強制:基於獎勵分佈匹配蒸餾的高效串流影片生成
Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation
December 4, 2025
作者: Yunhong Lu, Yanhong Zeng, Haobo Li, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jiapeng Zhu, Hengyuan Cao, Zhipeng Zhang, Xing Zhu, Yujun Shen, Min Zhang
cs.AI
摘要
高效串流影片生成對於模擬互動式動態世界至關重要。現有方法透過滑動視窗注意力機制蒸餾少步數影片擴散模型,將初始幀作為錨點令牌以維持注意力效能並減少誤差累積。然而這種做法會導致影片幀過度依賴靜態令牌,造成初始幀複製與運動動態衰減。為解決此問題,我們提出獎勵引導框架,包含兩項關鍵設計:首先提出EMA-Sink機制,維護從初始幀初始化的固定尺寸令牌,並在令牌移出滑動視窗時透過指數移動平均融合被替換令牌實現持續更新。EMA-Sink在不增加計算成本的前提下,既能捕捉長期上下文又能保留近期動態,有效避免初始幀複製同時維持長時序一致性。其次提出獎勵式分佈匹配蒸餾法(Re-DMD),傳統分佈匹配平等對待所有訓練樣本,限制了模型優先學習動態內容的能力。Re-DMD透過視覺語言模型對動態程度評分,優先選擇高動態樣本,使模型輸出分佈偏向高獎勵區域。該方法在保持數據保真度的同時顯著提升運動品質。定量與定性實驗表明,獎勵引導框架在標準基準測試中達到最先進性能,並在單張H100 GPU上實現23.1 FPS的高品質串流影片生成。
English
Efficient streaming video generation is critical for simulating interactive and dynamic worlds. Existing methods distill few-step video diffusion models with sliding window attention, using initial frames as sink tokens to maintain attention performance and reduce error accumulation. However, video frames become overly dependent on these static tokens, resulting in copied initial frames and diminished motion dynamics. To address this, we introduce Reward Forcing, a novel framework with two key designs. First, we propose EMA-Sink, which maintains fixed-size tokens initialized from initial frames and continuously updated by fusing evicted tokens via exponential moving average as they exit the sliding window. Without additional computation cost, EMA-Sink tokens capture both long-term context and recent dynamics, preventing initial frame copying while maintaining long-horizon consistency. Second, to better distill motion dynamics from teacher models, we propose a novel Rewarded Distribution Matching Distillation (Re-DMD). Vanilla distribution matching treats every training sample equally, limiting the model's ability to prioritize dynamic content. Instead, Re-DMD biases the model's output distribution toward high-reward regions by prioritizing samples with greater dynamics rated by a vision-language model. Re-DMD significantly enhances motion quality while preserving data fidelity. We include both quantitative and qualitative experiments to show that Reward Forcing achieves state-of-the-art performance on standard benchmarks while enabling high-quality streaming video generation at 23.1 FPS on a single H100 GPU.