長尺動画生成のためのコンテキスト混合

要旨

長尺動画生成は本質的に長文脈記憶の問題である：モデルは、崩壊やドリフトを起こすことなく、長い範囲にわたって重要なイベントを保持し、検索できなければならない。しかし、拡散トランスフォーマーをスケーリングして長文脈の動画を生成することは、自己注意の二次コストによって根本的に制限されており、長いシーケンスに対してメモリと計算が扱いにくく、最適化が困難である。我々は、長文脈動画生成を内部情報検索タスクとして再定義し、効果的な長期記憶検索エンジンとして、学習可能なスパース注意ルーティングモジュールであるMixture of Contexts（MoC）を提案する。MoCでは、各クエリが動的にいくつかの情報豊富なチャンクと必須のアンカー（キャプション、ローカルウィンドウ）を選択して注意を向け、ループクロージャを防ぐ因果ルーティングを行う。データをスケーリングし、ルーティングを徐々にスパース化するにつれて、モデルは計算を重要な履歴に割り当て、数分にわたるコンテンツの中でアイデンティティ、アクション、シーンを保持する。効率性は検索の副産物として得られ（ほぼ線形スケーリング）、実用的なトレーニングと合成を可能にし、数分スケールでの記憶と一貫性の出現を可能にする。

English

Long video generation is fundamentally a long context memory problem: models must retain and retrieve salient events across a long range without collapsing or drifting. However, scaling diffusion transformers to generate long-context videos is fundamentally limited by the quadratic cost of self-attention, which makes memory and computation intractable and difficult to optimize for long sequences. We recast long-context video generation as an internal information retrieval task and propose a simple, learnable sparse attention routing module, Mixture of Contexts (MoC), as an effective long-term memory retrieval engine. In MoC, each query dynamically selects a few informative chunks plus mandatory anchors (caption, local windows) to attend to, with causal routing that prevents loop closures. As we scale the data and gradually sparsify the routing, the model allocates compute to salient history, preserving identities, actions, and scenes over minutes of content. Efficiency follows as a byproduct of retrieval (near-linear scaling), which enables practical training and synthesis, and the emergence of memory and consistency at the scale of minutes.

長尺動画生成のためのコンテキスト混合

Mixture of Contexts for Long Video Generation

要旨

Support