長視頻生成的多上下文混合

摘要

長視頻生成本質上是一個長上下文記憶問題：模型必須在長時間範圍內保留並檢索重要事件，而不會崩潰或漂移。然而，將擴散變壓器擴展以生成長上下文視頻，從根本上受到自注意力機制二次方成本的限制，這使得記憶和計算變得難以處理，並且難以針對長序列進行優化。我們將長上下文視頻生成重新定義為內部信息檢索任務，並提出了一個簡單且可學習的稀疏注意力路由模塊——上下文混合（Mixture of Contexts, MoC），作為有效的長期記憶檢索引擎。在MoC中，每個查詢動態選擇幾個信息豐富的片段加上必要的錨點（如字幕、局部窗口）進行關注，並通過因果路由防止迴路閉合。隨著我們擴展數據並逐步稀疏化路由，模型將計算資源分配給重要的歷史信息，從而保留身份、動作和場景，持續數分鐘的內容。效率作為檢索的副產品（接近線性擴展）隨之而來，這使得實際訓練和合成成為可能，並在數分鐘的尺度上實現了記憶和一致性的湧現。

English

Long video generation is fundamentally a long context memory problem: models must retain and retrieve salient events across a long range without collapsing or drifting. However, scaling diffusion transformers to generate long-context videos is fundamentally limited by the quadratic cost of self-attention, which makes memory and computation intractable and difficult to optimize for long sequences. We recast long-context video generation as an internal information retrieval task and propose a simple, learnable sparse attention routing module, Mixture of Contexts (MoC), as an effective long-term memory retrieval engine. In MoC, each query dynamically selects a few informative chunks plus mandatory anchors (caption, local windows) to attend to, with causal routing that prevents loop closures. As we scale the data and gradually sparsify the routing, the model allocates compute to salient history, preserving identities, actions, and scenes over minutes of content. Efficiency follows as a byproduct of retrieval (near-linear scaling), which enables practical training and synthesis, and the emergence of memory and consistency at the scale of minutes.

長視頻生成的多上下文混合

Mixture of Contexts for Long Video Generation

摘要

Support