长视频生成的多上下文混合

摘要

长视频生成本质上是一个长上下文记忆问题：模型必须在不崩溃或漂移的情况下，跨越长时间范围保留并检索关键事件。然而，将扩散变换器扩展用于生成长上下文视频，从根本上受到自注意力机制二次方成本的限制，这使得内存和计算变得难以处理，且难以针对长序列进行优化。我们将长上下文视频生成重新定义为内部信息检索任务，并提出了一种简单、可学习的稀疏注意力路由模块——上下文混合（Mixture of Contexts, MoC），作为有效的长期记忆检索引擎。在MoC中，每个查询动态选择少量信息丰富的片段加上必要的锚点（如字幕、局部窗口）进行关注，采用因果路由以防止循环闭合。随着我们扩展数据并逐步稀疏化路由，模型将计算资源分配给显著的历史信息，在数分钟的内容中保持身份、动作和场景的一致性。检索效率随之提升（接近线性扩展），这实现了实际训练与合成，并在分钟级别上涌现出记忆与一致性。

English

Long video generation is fundamentally a long context memory problem: models must retain and retrieve salient events across a long range without collapsing or drifting. However, scaling diffusion transformers to generate long-context videos is fundamentally limited by the quadratic cost of self-attention, which makes memory and computation intractable and difficult to optimize for long sequences. We recast long-context video generation as an internal information retrieval task and propose a simple, learnable sparse attention routing module, Mixture of Contexts (MoC), as an effective long-term memory retrieval engine. In MoC, each query dynamically selects a few informative chunks plus mandatory anchors (caption, local windows) to attend to, with causal routing that prevents loop closures. As we scale the data and gradually sparsify the routing, the model allocates compute to salient history, preserving identities, actions, and scenes over minutes of content. Efficiency follows as a byproduct of retrieval (near-linear scaling), which enables practical training and synthesis, and the emergence of memory and consistency at the scale of minutes.

长视频生成的多上下文混合

Mixture of Contexts for Long Video Generation

摘要

Support