Stream-R1: ストリーミング動画生成のための信頼性-パープレキシティ考慮型報酬蒸留

要旨

蒸留ベースの高速化技術は、自己回帰型ストリーミング動画拡散モデルを実用化する基盤となっており、分布マッチング蒸留（DMD）が事実上の標準手法として確立されている。しかし、既存手法は教師モデルの出力を無差別に模倣するように学生モデルを訓練しており、すべてのロールアウト、フレーム、ピクセルを同等に信頼できる教師信号として扱っている。本研究では、このアプローチが蒸留品質の上限を規定していると主張する。なぜなら、DMDの教師信号における二つの相補的な分散軸、すなわち「信頼性の異なる学生ロールアウト間の相互信頼性（Inter-Reliability）」と、「品質向上の余地が不均一に分布する空間領域および時間フレーム間の内部パープレキシティ（Intra-Perplexity）」を見落としているからである。これにより、目標関数は「各ロールアウトから学習すべきか」と「その中で最適化を集中すべき箇所はどこか」という二つの問いを均一な重みで混在させてしまう。この問題に対処するため、我々は信頼性とパープレキシティを考慮した報酬指導型蒸留フレームワーク「Stream-R1」を提案する。本手法は、単一の共有報酬誘導機構により、ロールアウトレベルと時空間要素レベルの両方で蒸留目標関数を適応的に再重み付けする。相互信頼性レベルでは、Stream-R1は各ロールアウトの損失を、事前学習済み動画報酬スコアの指数関数で再スケーリングし、信頼性の高い教師信号を持つロールアウトが最適化を支配するようにする。内部パープレキシティレベルでは、同じ報酬モデルを逆伝播させてピクセル単位の勾配重要度を抽出し、これを空間重みと時間重みに分解して、改善が最大の期待利益をもたらす領域とフレームに最適化圧力を集中させる。適応的バランシング機構により、視覚品質、動きの品質、テキスト整合性という三つの品質軸のいずれかが過度に支配することを防ぐ。Stream-R1は、標準的なストリーミング動画生成ベンチマークにおいて、蒸留ベースラインと比較して三つの次元すべてで一貫した改善を達成し、アーキテクチャの変更や推論コストの追加を伴わない。

English

Distillation-based acceleration has become foundational for making autoregressive streaming video diffusion models practical, with distribution matching distillation (DMD) as the de facto choice. Existing methods, however, train the student to match the teacher's output indiscriminately, treating every rollout, frame, and pixel as equally reliable supervision. We argue that this caps distilled quality, since it overlooks two complementary axes of variance in DMD supervision: Inter-Reliability across student rollouts whose supervision varies in reliability, and Intra-Perplexity across spatial regions and temporal frames that contribute unequally to where quality can still be improved. The objective thus conflates two questions under a uniform weight: whether to learn from each rollout, and where to concentrate optimization within it. To address this, we propose Stream-R1, a Reliability-Perplexity Aware Reward Distillation framework that adaptively reweights the distillation objective at both rollout and spatiotemporal-element levels through a single shared reward-guided mechanism. At the Inter-Reliability level, Stream-R1 rescales each rollout's loss by an exponential of a pretrained video reward score, so that rollouts with reliable supervision dominate optimization. At the Intra-Perplexity level, it back-propagates the same reward model to extract per-pixel gradient saliency, which is factored into spatial and temporal weights that concentrate optimization pressure on regions and frames where refinement yields the largest expected gain. An adaptive balancing mechanism prevents any single quality axis from dominating across visual quality, motion quality, and text alignment. Stream-R1 attains consistent improvements on all three dimensions over distillation baselines on standard streaming video generation benchmarks, without architectural modification or additional inference cost.

Stream-R1: ストリーミング動画生成のための信頼性-パープレキシティ考慮型報酬蒸留

Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation

要旨

Support