基于自重采样的自回归视频扩散模型端到端训练方法
End-to-End Training for Autoregressive Video Diffusion via Self-Resampling
December 17, 2025
作者: Yuwei Guo, Ceyuan Yang, Hao He, Yang Zhao, Meng Wei, Zhenheng Yang, Weilin Huang, Dahua Lin
cs.AI
摘要
自回归视频扩散模型在仿真世界方面前景广阔,但存在训练测试失配导致的曝光偏差问题。现有研究虽能通过后训练方式缓解此问题,但通常依赖双向教师模型或在线判别器。为实现端到端解决方案,我们提出重采样强制——一种无需教师模型的框架,支持从零开始大规模训练自回归视频模型。该方法的核心理念是自重采样机制,在训练过程中模拟推理阶段历史帧的模型误差。基于这些退化历史帧,稀疏因果掩码在保持时序因果关系的同时,支持结合帧级扩散损失的并行训练。为提升长序列生成效率,我们进一步提出历史路由机制:这种无参数方法能动态检索与每个查询帧最相关的k个历史帧。实验表明,本方法在达到与基于蒸馏的基线相当性能的同时,因采用原生长度训练,在长视频上展现出更优的时序一致性。
English
Autoregressive video diffusion models hold promise for world simulation but are vulnerable to exposure bias arising from the train-test mismatch. While recent works address this via post-training, they typically rely on a bidirectional teacher model or online discriminator. To achieve an end-to-end solution, we introduce Resampling Forcing, a teacher-free framework that enables training autoregressive video models from scratch and at scale. Central to our approach is a self-resampling scheme that simulates inference-time model errors on history frames during training. Conditioned on these degraded histories, a sparse causal mask enforces temporal causality while enabling parallel training with frame-level diffusion loss. To facilitate efficient long-horizon generation, we further introduce history routing, a parameter-free mechanism that dynamically retrieves the top-k most relevant history frames for each query. Experiments demonstrate that our approach achieves performance comparable to distillation-based baselines while exhibiting superior temporal consistency on longer videos owing to native-length training.