基于下一帧预测的长上下文自回归视频建模

摘要

长上下文自回归建模在语言生成领域取得了显著进展，但视频生成仍难以充分利用扩展的时间上下文。为探究长上下文视频建模，我们引入了帧自回归（Frame AutoRegressive, FAR），作为视频自回归建模的强大基线。正如语言模型学习词元间的因果依赖关系（即Token AR），FAR模型则建模连续帧间的时间因果依赖，实现了比Token AR和视频扩散变换器更优的收敛性。基于FAR，我们观察到长上下文视觉建模面临视觉冗余的挑战。现有的RoPE缺乏对远程上下文的有效时间衰减，且难以良好外推至长视频序列。此外，长视频训练计算成本高昂，因为视觉词元的增长速度远超语言词元。为解决这些问题，我们提出平衡局部性与长程依赖。我们引入了FlexRoPE，一种在测试时向RoPE添加灵活时间衰减的技术，使其能够外推至16倍长的视觉上下文。进一步，我们提出了长短上下文建模，其中高分辨率的短期上下文窗口确保了细粒度的时间一致性，而无限长的长期上下文窗口则用更少的词元编码长程信息。通过这种方法，我们能够在可管理的词元上下文长度下训练长视频序列。我们展示了FAR在短视频和长视频生成中均达到了最先进的性能，为视频自回归建模提供了一个简单而有效的基线。

English

Long-context autoregressive modeling has significantly advanced language generation, but video generation still struggles to fully utilize extended temporal contexts. To investigate long-context video modeling, we introduce Frame AutoRegressive (FAR), a strong baseline for video autoregressive modeling. Just as language models learn causal dependencies between tokens (i.e., Token AR), FAR models temporal causal dependencies between continuous frames, achieving better convergence than Token AR and video diffusion transformers. Building on FAR, we observe that long-context vision modeling faces challenges due to visual redundancy. Existing RoPE lacks effective temporal decay for remote context and fails to extrapolate well to long video sequences. Additionally, training on long videos is computationally expensive, as vision tokens grow much faster than language tokens. To tackle these issues, we propose balancing locality and long-range dependency. We introduce FlexRoPE, an test-time technique that adds flexible temporal decay to RoPE, enabling extrapolation to 16x longer vision contexts. Furthermore, we propose long short-term context modeling, where a high-resolution short-term context window ensures fine-grained temporal consistency, while an unlimited long-term context window encodes long-range information using fewer tokens. With this approach, we can train on long video sequences with a manageable token context length. We demonstrate that FAR achieves state-of-the-art performance in both short- and long-video generation, providing a simple yet effective baseline for video autoregressive modeling.

基于下一帧预测的长上下文自回归视频建模

Long-Context Autoregressive Video Modeling with Next-Frame Prediction

摘要

Support