ChatPaper.aiChatPaper

基於下一幀預測的長上下文自回歸視頻建模

Long-Context Autoregressive Video Modeling with Next-Frame Prediction

March 25, 2025
作者: Yuchao Gu, Weijia Mao, Mike Zheng Shou
cs.AI

摘要

長上下文自迴歸建模在語言生成領域取得了顯著進展,但視頻生成仍難以充分利用延長的時間上下文。為探究長上下文視頻建模,我們引入了幀自迴歸(Frame AutoRegressive, FAR),作為視頻自迴歸建模的強力基準。正如語言模型學習詞元間的因果依賴(即Token AR),FAR模型則建模連續幀間的時序因果依賴,相比Token AR和視頻擴散變換器,實現了更好的收斂性。基於FAR,我們觀察到長上下文視覺建模面臨視覺冗餘的挑戰。現有的RoPE缺乏對遠程上下文的有效時間衰減,且難以良好外推至長視頻序列。此外,訓練長視頻計算成本高昂,因為視覺詞元的增長速度遠快於語言詞元。為解決這些問題,我們提出平衡局部性與長程依賴性。我們引入了FlexRoPE,這是一種測試時技術,為RoPE添加靈活的時間衰減,使其能夠外推至16倍長的視覺上下文。進一步,我們提出了長短期上下文建模,其中高分辨率的短期上下文窗口確保了細粒度的時間一致性,而無限制的長期上下文窗口則使用更少的詞元編碼長程信息。通過這種方法,我們能夠在可管理的詞元上下文長度下訓練長視頻序列。我們展示了FAR在短視頻和長視頻生成中均達到了最先進的性能,為視頻自迴歸建模提供了一個簡單而有效的基準。
English
Long-context autoregressive modeling has significantly advanced language generation, but video generation still struggles to fully utilize extended temporal contexts. To investigate long-context video modeling, we introduce Frame AutoRegressive (FAR), a strong baseline for video autoregressive modeling. Just as language models learn causal dependencies between tokens (i.e., Token AR), FAR models temporal causal dependencies between continuous frames, achieving better convergence than Token AR and video diffusion transformers. Building on FAR, we observe that long-context vision modeling faces challenges due to visual redundancy. Existing RoPE lacks effective temporal decay for remote context and fails to extrapolate well to long video sequences. Additionally, training on long videos is computationally expensive, as vision tokens grow much faster than language tokens. To tackle these issues, we propose balancing locality and long-range dependency. We introduce FlexRoPE, an test-time technique that adds flexible temporal decay to RoPE, enabling extrapolation to 16x longer vision contexts. Furthermore, we propose long short-term context modeling, where a high-resolution short-term context window ensures fine-grained temporal consistency, while an unlimited long-term context window encodes long-range information using fewer tokens. With this approach, we can train on long video sequences with a manageable token context length. We demonstrate that FAR achieves state-of-the-art performance in both short- and long-video generation, providing a simple yet effective baseline for video autoregressive modeling.

Summary

AI-Generated Summary

PDF722March 26, 2025