ChatPaper.aiChatPaper

背景強制:利用長上下文實現一致的自回歸影片生成

Context Forcing: Consistent Autoregressive Video Generation with Long Context

February 5, 2026
作者: Shuo Chen, Cong Wei, Sun Sun, Ping Nie, Kai Zhou, Ge Zhang, Ming-Hsuan Yang, Wenhu Chen
cs.AI

摘要

近期即時長影片生成方法通常採用串流調校策略,試圖透過短上下文(無記憶)教師模型來訓練長上下文學生模型。在此類框架中,學生模型執行長序列生成,卻僅能獲得受限於5秒短視窗的教師監督。這種結構性差異導致關鍵的師生錯配問題:教師因無法獲取長期歷史資訊,難以指導學生建立全局時間依賴關係,實質上限制了學生模型的上下文長度。為解決此問題,我們提出「上下文強制」框架,透過長上下文教師模型來訓練長上下文學生模型。通過確保教師能感知完整生成歷史,我們消除了監督錯配現象,使模型能接受穩健訓練以實現長期一致性。為使極端時長(如2分鐘)的計算具可行性,我們引入上下文管理系統,將線性增長的上下文轉換為「慢快記憶」架構,顯著降低視覺冗餘。大量實驗結果表明,本方法能實現超過20秒的有效上下文長度——較LongLive、Infinite-RoPE等頂尖方法提升2至10倍。憑藉此擴展的上下文能力,「上下文強制」技術在長時間跨度中保持卓越的一致性,在多項長影片評估指標上超越現有頂尖基準方法。
English
Recent approaches to real-time long video generation typically employ streaming tuning strategies, attempting to train a long-context student using a short-context (memoryless) teacher. In these frameworks, the student performs long rollouts but receives supervision from a teacher limited to short 5-second windows. This structural discrepancy creates a critical student-teacher mismatch: the teacher's inability to access long-term history prevents it from guiding the student on global temporal dependencies, effectively capping the student's context length. To resolve this, we propose Context Forcing, a novel framework that trains a long-context student via a long-context teacher. By ensuring the teacher is aware of the full generation history, we eliminate the supervision mismatch, enabling the robust training of models capable of long-term consistency. To make this computationally feasible for extreme durations (e.g., 2 minutes), we introduce a context management system that transforms the linearly growing context into a Slow-Fast Memory architecture, significantly reducing visual redundancy. Extensive results demonstrate that our method enables effective context lengths exceeding 20 seconds -- 2 to 10 times longer than state-of-the-art methods like LongLive and Infinite-RoPE. By leveraging this extended context, Context Forcing preserves superior consistency across long durations, surpassing state-of-the-art baselines on various long video evaluation metrics.
PDF256February 7, 2026