ChatPaper.aiChatPaper

VideoAR:基于下一帧与尺度预测的自回归视频生成

VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction

January 9, 2026
作者: Longbin Ji, Xiaoxiong Liu, Junyuan Shang, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang
cs.AI

摘要

近期视频生成领域主要由扩散模型和流匹配模型主导,这些模型能生成高质量结果,但计算成本高昂且难以扩展。本文提出VideoAR——首个结合多尺度帧预测与自回归建模的大规模视觉自回归视频生成框架。VideoAR通过将帧内自回归建模与因果帧预测相结合,并辅以能高效编码时空动态的三维多尺度分词器,实现了空间与时间依赖关系的解耦。为提升长时一致性,我们提出多尺度时序旋转位置编码、跨帧误差校正和随机帧掩码技术,共同抑制误差传播并稳定时序连贯性。我们的多阶段预训练流程能在递增的分辨率与时长中逐步对齐时空学习。实验表明,VideoAR在自回归模型中实现了最新最优性能:将UCF-101的FVD指标从99.5提升至88.6,同时减少超过10倍的推理步数,并以81.74的VBench得分与规模大一个数量级的扩散模型持平。这些成果证明VideoAR缩小了自回归与扩散范式间的性能差距,为未来视频生成研究提供了可扩展、高效且时序一致的基础框架。
English
Recent advances in video generation have been dominated by diffusion and flow-matching models, which produce high-quality results but remain computationally intensive and difficult to scale. In this work, we introduce VideoAR, the first large-scale Visual Autoregressive (VAR) framework for video generation that combines multi-scale next-frame prediction with autoregressive modeling. VideoAR disentangles spatial and temporal dependencies by integrating intra-frame VAR modeling with causal next-frame prediction, supported by a 3D multi-scale tokenizer that efficiently encodes spatio-temporal dynamics. To improve long-term consistency, we propose Multi-scale Temporal RoPE, Cross-Frame Error Correction, and Random Frame Mask, which collectively mitigate error propagation and stabilize temporal coherence. Our multi-stage pretraining pipeline progressively aligns spatial and temporal learning across increasing resolutions and durations. Empirically, VideoAR achieves new state-of-the-art results among autoregressive models, improving FVD on UCF-101 from 99.5 to 88.6 while reducing inference steps by over 10x, and reaching a VBench score of 81.74-competitive with diffusion-based models an order of magnitude larger. These results demonstrate that VideoAR narrows the performance gap between autoregressive and diffusion paradigms, offering a scalable, efficient, and temporally consistent foundation for future video generation research.
PDF131January 13, 2026