長視頻生成的多上下文混合
Mixture of Contexts for Long Video Generation
August 28, 2025
作者: Shengqu Cai, Ceyuan Yang, Lvmin Zhang, Yuwei Guo, Junfei Xiao, Ziyan Yang, Yinghao Xu, Zhenheng Yang, Alan Yuille, Leonidas Guibas, Maneesh Agrawala, Lu Jiang, Gordon Wetzstein
cs.AI
摘要
長視頻生成本質上是一個長上下文記憶問題:模型必須在長時間範圍內保留並檢索重要事件,而不會崩潰或漂移。然而,將擴散變壓器擴展以生成長上下文視頻,從根本上受到自注意力機制二次方成本的限制,這使得記憶和計算變得難以處理,並且難以針對長序列進行優化。我們將長上下文視頻生成重新定義為內部信息檢索任務,並提出了一個簡單且可學習的稀疏注意力路由模塊——上下文混合(Mixture of Contexts, MoC),作為有效的長期記憶檢索引擎。在MoC中,每個查詢動態選擇幾個信息豐富的片段加上必要的錨點(如字幕、局部窗口)進行關注,並通過因果路由防止迴路閉合。隨著我們擴展數據並逐步稀疏化路由,模型將計算資源分配給重要的歷史信息,從而保留身份、動作和場景,持續數分鐘的內容。效率作為檢索的副產品(接近線性擴展)隨之而來,這使得實際訓練和合成成為可能,並在數分鐘的尺度上實現了記憶和一致性的湧現。
English
Long video generation is fundamentally a long context memory problem: models
must retain and retrieve salient events across a long range without collapsing
or drifting. However, scaling diffusion transformers to generate long-context
videos is fundamentally limited by the quadratic cost of self-attention, which
makes memory and computation intractable and difficult to optimize for long
sequences. We recast long-context video generation as an internal information
retrieval task and propose a simple, learnable sparse attention routing module,
Mixture of Contexts (MoC), as an effective long-term memory retrieval engine.
In MoC, each query dynamically selects a few informative chunks plus mandatory
anchors (caption, local windows) to attend to, with causal routing that
prevents loop closures. As we scale the data and gradually sparsify the
routing, the model allocates compute to salient history, preserving identities,
actions, and scenes over minutes of content. Efficiency follows as a byproduct
of retrieval (near-linear scaling), which enables practical training and
synthesis, and the emergence of memory and consistency at the scale of minutes.