자기회귀적 비디오 생성을 위한 온-정책 적대적 흐름 증류

초록

자기회귀 영상 생성기(auto regressive video generator)는 스트리밍, 장기 지평(long-horizon) 및 대화형 응용에 매력적이지만, 강력한 블랙박스 교사(teacher) 모델을 인과적 학생(student) 모델에 증류(distillation)하는 것은 여전히 어렵다. 학생 모델은 자체 롤아웃 분포(rollout distribution) 하에서 학습해야 하는 반면, 실제 교사 모델은 프롬프트 조건화된 완성 영상만을 제공할 수 있으며, 구조, 용량, 시간적 설계 및 샘플링 일정에서 차이가 있을 수 있다. 이러한 인터페이스는 지도 미세 조정(supervised fine-tuning)을 오프-정책(off-policy)으로 만들고, 점수 기반 증류(score-based distillation)를 적용 불가능하게 하며, 직접적인 적대적 모방(adversarial imitation)은 잡음 제거 시간에 대한 신용 할당(credit assignment)에 너무 희소하다. 우리는 이질적 블랙박스 영상 증류를 위한 온-정책(on-policy) 프레임워크인 적대적 흐름 증류(Adversarial Flow Distillation, AFD)를 제안한다. AFD는 교사 모델을 질의하고 동일한 프롬프트에서 현재 학생 모델을 롤아웃하며, 프롬프트 쌍을 이룬 Bradley-Terry 판별기(discriminator)를 학습시켜 클린 샘플(clean-sample)에서의 교사-학생 차이를 추정하고, 결과적인 온-정책 이득(advantage)을 학생 모델 자체의 노이즈가 추가된 상태에 대한 순방향 과정 흐름 정합(forward-process flow-matching) 업데이트로 변환한다. 따라서 AFD는 교사 모델의 점수, 잠재 변수, 잡음 제거 궤적, 단계 정렬 또는 역방향 체인 강화 학습 없이도 조밀한 속도장 지도(dense velocity-field supervision)를 제공한다. 두 가지 인과적 자기회귀 학생 모델군에 대한 실험은 AFD가 일반 영상 품질을 유지하면서도 움직임 및 물리 관련 생성 능력을 일관되게 향상시키며, 절제 연구는 적응형 온-정책 피드백과 순방향 과정 신용 할당의 중요성을 검증한다. 이 방법은 교사 영상과 학생 롤아웃만을 필요로 하므로, 독점적이거나 이질적인 영상 생성기를 효율적인 자기회귀 학생 모델로 증류하는 실용적인 경로를 제공한다.

English

Autoregressive video generators are attractive for streaming, long-horizon, and interactive applications, but distilling strong black-box teachers into causal students remains difficult. The student must learn under its own rollout distribution, whereas practical teachers may expose only prompt-conditioned completed videos and may differ in architecture, capacity, temporal design, and sampling schedule. This interface makes supervised fine-tuning off-policy, score-based distillation inapplicable, and direct adversarial imitation too sparse for denoising-time credit assignment. We propose Adversarial Flow Distillation (AFD), an on-policy framework for heterogeneous black-box video distillation. AFD queries the teacher and rolls out the current student on the same prompts, trains a prompt-paired Bradley-Terry discriminator to estimate clean-sample teacher-student discrepancy, and converts the resulting on-policy advantage into forward-process flow-matching updates on the student's own noised states. Thus, AFD provides dense velocity-field supervision while requiring no teacher scores, latents, denoising trajectories, step alignment, or reverse-chain reinforcement learning. Experiments across two causal AR student families show that AFD consistently improves motion- and physics-sensitive generation while preserving general video quality, and ablations validate the importance of adaptive on-policy feedback and forward-process credit assignment. The method requires only clean teacher videos and student rollouts, providing a practical route for distilling proprietary or heterogeneous video generators into efficient autoregressive students.