ChatPaper.aiChatPaper

上下文强制:长上下文下保持一致的自主回归视频生成

Context Forcing: Consistent Autoregressive Video Generation with Long Context

February 5, 2026
作者: Shuo Chen, Cong Wei, Sun Sun, Ping Nie, Kai Zhou, Ge Zhang, Ming-Hsuan Yang, Wenhu Chen
cs.AI

摘要

当前实时长视频生成方法通常采用流式调优策略,试图通过短上下文(无记忆)教师模型训练长上下文学生模型。这类框架中,学生模型执行长序列生成,却只能获得局限于5秒短窗口的教师监督。这种结构差异导致关键性的师生错配:教师模型因无法获取长期历史信息,难以指导学生模型学习全局时间依赖关系,实质上限制了学生模型的上下文长度。为解决此问题,我们提出上下文强制(Context Forcing)框架,通过长上下文教师模型训练长上下文学生模型。通过确保教师模型感知完整生成历史,我们消除了监督错配问题,从而实现对具备长期一致性能力模型的稳健训练。为实现极端时长(如2分钟)下的计算可行性,我们引入了上下文管理系统,将线性增长的上下文转换为快慢记忆(Slow-Fast Memory)架构,显著降低视觉冗余。大量实验结果表明,本方法可实现超过20秒的有效上下文长度——相较LongLive、Infinite-RoPE等前沿方法提升2至10倍。借助这种扩展上下文,上下文强制框架在长时程中保持卓越的一致性,在各种长视频评估指标上均超越现有最优基线方法。
English
Recent approaches to real-time long video generation typically employ streaming tuning strategies, attempting to train a long-context student using a short-context (memoryless) teacher. In these frameworks, the student performs long rollouts but receives supervision from a teacher limited to short 5-second windows. This structural discrepancy creates a critical student-teacher mismatch: the teacher's inability to access long-term history prevents it from guiding the student on global temporal dependencies, effectively capping the student's context length. To resolve this, we propose Context Forcing, a novel framework that trains a long-context student via a long-context teacher. By ensuring the teacher is aware of the full generation history, we eliminate the supervision mismatch, enabling the robust training of models capable of long-term consistency. To make this computationally feasible for extreme durations (e.g., 2 minutes), we introduce a context management system that transforms the linearly growing context into a Slow-Fast Memory architecture, significantly reducing visual redundancy. Extensive results demonstrate that our method enables effective context lengths exceeding 20 seconds -- 2 to 10 times longer than state-of-the-art methods like LongLive and Infinite-RoPE. By leveraging this extended context, Context Forcing preserves superior consistency across long durations, surpassing state-of-the-art baselines on various long video evaluation metrics.
PDF256February 7, 2026