用於自回歸影片生成的在策略對抗流蒸餾

摘要

自回歸視頻生成器在串流、長時域及互動應用中具有吸引力，但將強大的黑箱教師模型蒸餾至因果學生模型仍具挑戰。學生模型需在其自身滾動分佈下學習，而實務上的教師模型僅能提供提示條件下的完整視頻，且在架構、容量、時域設計與取樣排程上可能有所不同。此介面使得監督式微調偏離策略、基於分數的蒸餾無法適用，而直接對抗模仿在去噪時間信用分配上資訊過於稀疏。我們提出對抗性流動蒸餾（AFD），一個針對異質黑箱視頻蒸餾的基於策略框架。AFD查詢教師模型並在同一提示下執行當前學生模型的滾動，訓練一個提示配對的Bradley-Terry判別器以估計乾淨樣本層級的師生差距，並將所得基於策略優勢轉化為學生模型自身噪聲狀態上的前向過程流匹配更新。因此，AFD提供密集的速度場監督，而無需教師分數、潛變量、去噪軌跡、步驟對齊或反向鏈強化學習。在兩個因果自回歸學生模型家族上的實驗顯示，AFD在保持整體視頻品質的同時，持續改善運動與物理敏感的生成效果，消融實驗驗證了自適應基於策略回饋與前向過程信用分配的重要性。該方法僅需乾淨的教師視頻與學生滾動，為將專有或異質視頻生成器蒸餾至高效自回歸學生模型提供了實用途徑。

English

Autoregressive video generators are attractive for streaming, long-horizon, and interactive applications, but distilling strong black-box teachers into causal students remains difficult. The student must learn under its own rollout distribution, whereas practical teachers may expose only prompt-conditioned completed videos and may differ in architecture, capacity, temporal design, and sampling schedule. This interface makes supervised fine-tuning off-policy, score-based distillation inapplicable, and direct adversarial imitation too sparse for denoising-time credit assignment. We propose Adversarial Flow Distillation (AFD), an on-policy framework for heterogeneous black-box video distillation. AFD queries the teacher and rolls out the current student on the same prompts, trains a prompt-paired Bradley-Terry discriminator to estimate clean-sample teacher-student discrepancy, and converts the resulting on-policy advantage into forward-process flow-matching updates on the student's own noised states. Thus, AFD provides dense velocity-field supervision while requiring no teacher scores, latents, denoising trajectories, step alignment, or reverse-chain reinforcement learning. Experiments across two causal AR student families show that AFD consistently improves motion- and physics-sensitive generation while preserving general video quality, and ablations validate the importance of adaptive on-policy feedback and forward-process credit assignment. The method requires only clean teacher videos and student rollouts, providing a practical route for distilling proprietary or heterogeneous video generators into efficient autoregressive students.