用于自回归视频生成的在策略对抗流蒸馏

摘要

自回归视频生成器因其对流式、长程及交互式应用的吸引力而备受关注，但将强大的黑盒教师模型蒸馏为因果学生模型仍面临挑战。学生需在其自身展开分布下学习，而实际教师可能仅提供基于提示条件生成的完整视频，且在架构、容量、时间设计及采样调度上存在差异。这种接口使得监督微调成为离策略行为，基于分数的蒸馏方法无法适用，直接对抗模仿在去噪时间信用分配上过于稀疏。为此，我们提出对抗性流蒸馏（AFD），一种面向异质黑盒视频蒸馏的在策略框架。AFD对相同提示同时查询教师模型并展开当前学生模型，训练一个配对提示的Bradley-Terry判别器以估计干净样本上的师生差异，并将所得的在策略优势转化为学生自身含噪状态上前向过程的流匹配更新。因此，AFD提供了密集的速度场监督，且无需教师分数、潜在变量、去噪轨迹、步长对齐或反向链强化学习。在两个因果自回归学生家族上的实验表明，AFD在保持整体视频质量的同时，持续提升了运动及物理敏感性生成效果；消融实验验证了自适应在策略反馈与前向过程信用分配的重要性。该方法仅需干净教师视频与学生展开结果，为将专有或异质视频生成器蒸馏为高效自回归学生提供了一条实用路径。

English

Autoregressive video generators are attractive for streaming, long-horizon, and interactive applications, but distilling strong black-box teachers into causal students remains difficult. The student must learn under its own rollout distribution, whereas practical teachers may expose only prompt-conditioned completed videos and may differ in architecture, capacity, temporal design, and sampling schedule. This interface makes supervised fine-tuning off-policy, score-based distillation inapplicable, and direct adversarial imitation too sparse for denoising-time credit assignment. We propose Adversarial Flow Distillation (AFD), an on-policy framework for heterogeneous black-box video distillation. AFD queries the teacher and rolls out the current student on the same prompts, trains a prompt-paired Bradley-Terry discriminator to estimate clean-sample teacher-student discrepancy, and converts the resulting on-policy advantage into forward-process flow-matching updates on the student's own noised states. Thus, AFD provides dense velocity-field supervision while requiring no teacher scores, latents, denoising trajectories, step alignment, or reverse-chain reinforcement learning. Experiments across two causal AR student families show that AFD consistently improves motion- and physics-sensitive generation while preserving general video quality, and ablations validate the importance of adaptive on-policy feedback and forward-process credit assignment. The method requires only clean teacher videos and student rollouts, providing a practical route for distilling proprietary or heterogeneous video generators into efficient autoregressive students.