自己回帰動画生成のためのオン方策敵対的フロー蒸留

要旨

自己回帰型ビデオ生成器は、ストリーミング、長期的、およびインタラクティブなアプリケーションにおいて魅力的であるが、強力なブラックボックス教師モデルを因果的な生徒モデルに蒸留することは依然として困難である。生徒モデルは自身のロールアウト分布のもとで学習しなければならないのに対し、実用的な教師モデルはプロンプトに条件付けられた完成済みビデオのみを公開し、アーキテクチャ、容量、時間設計、サンプリングスケジュールが異なる場合がある。このインターフェースにより、教師ありファインチューニングはオフ方策となり、スコアベースの蒸留は適用不可能となり、直接的な敵対的模倣はノイズ除去時のクレジット割り当てには疎すぎる。本論文では、異種ブラックボックスビデオ蒸留のためのオン方策フレームワークであるAdversarial Flow Distillation（AFD）を提案する。AFDは、同一のプロンプト上で教師モデルに問い合わせると同時に現在の生徒モデルをロールアウトし、プロンプト対を用いたBradley-Terry識別器を学習してクリーンサンプルにおける教師-生徒間の乖離を推定し、得られたオン方策アドバンテージを、生徒自身のノイズ付与状態に対する順方向プロセスフローマッチング更新に変換する。これにより、AFDは教師スコア、潜在変数、ノイズ除去軌跡、ステップ調整、逆連鎖強化学習を必要とせずに、密な速度場の教師信号を提供する。二つの因果的自己回帰生徒モデルファミリーにわたる実験により、AFDは一般的なビデオ品質を維持しつつ、動作および物理に敏感な生成を一貫して改善し、アブレーション実験は適応的なオン方策フィードバックと順方向プロセスクレジット割り当ての重要性を検証する。本手法は、クリーンな教師ビデオと生徒ロールアウトのみを必要とし、プロプライエタリまたは異種のビデオ生成器を効率的な自己回帰型生徒モデルに蒸留する実用的な経路を提供する。

English

Autoregressive video generators are attractive for streaming, long-horizon, and interactive applications, but distilling strong black-box teachers into causal students remains difficult. The student must learn under its own rollout distribution, whereas practical teachers may expose only prompt-conditioned completed videos and may differ in architecture, capacity, temporal design, and sampling schedule. This interface makes supervised fine-tuning off-policy, score-based distillation inapplicable, and direct adversarial imitation too sparse for denoising-time credit assignment. We propose Adversarial Flow Distillation (AFD), an on-policy framework for heterogeneous black-box video distillation. AFD queries the teacher and rolls out the current student on the same prompts, trains a prompt-paired Bradley-Terry discriminator to estimate clean-sample teacher-student discrepancy, and converts the resulting on-policy advantage into forward-process flow-matching updates on the student's own noised states. Thus, AFD provides dense velocity-field supervision while requiring no teacher scores, latents, denoising trajectories, step alignment, or reverse-chain reinforcement learning. Experiments across two causal AR student families show that AFD consistently improves motion- and physics-sensitive generation while preserving general video quality, and ablations validate the importance of adaptive on-policy feedback and forward-process credit assignment. The method requires only clean teacher videos and student rollouts, providing a practical route for distilling proprietary or heterogeneous video generators into efficient autoregressive students.