長文脈チューニングによる動画生成

要旨

近年のビデオ生成技術の進歩により、拡張可能な拡散トランスフォーマーを用いて現実的な1分間のシングルショットビデオを生成することが可能になりました。しかし、現実世界の物語ビデオでは、複数のショットからなるシーンが視覚的かつ動的に一貫している必要があります。本研究では、Long Context Tuning (LCT) を導入します。これは、事前学習済みのシングルショットビデオ拡散モデルのコンテキストウィンドウを拡張し、シーンレベルの一貫性をデータから直接学習するトレーニングパラダイムです。本手法は、個々のショットからシーン内のすべてのショットにわたる完全な注意機構を拡張し、インターリーブされた3D位置埋め込みと非同期ノイズ戦略を組み込むことで、追加のパラメータなしで共同および自己回帰的なショット生成を可能にします。LCT後の双方向注意機構を持つモデルは、コンテキスト因果注意を用いてさらに微調整することができ、効率的なKVキャッシュを用いた自己回帰生成を促進します。実験により、LCT後のシングルショットモデルが一貫したマルチショットシーンを生成し、構成的生成やインタラクティブなショット拡張などの新たな能力を示すことが実証され、より実用的なビジュアルコンテンツ作成への道を開きます。詳細は https://guoyww.github.io/projects/long-context-video/ をご覧ください。

English

Recent advances in video generation can produce realistic, minute-long single-shot videos with scalable diffusion transformers. However, real-world narrative videos require multi-shot scenes with visual and dynamic consistency across shots. In this work, we introduce Long Context Tuning (LCT), a training paradigm that expands the context window of pre-trained single-shot video diffusion models to learn scene-level consistency directly from data. Our method expands full attention mechanisms from individual shots to encompass all shots within a scene, incorporating interleaved 3D position embedding and an asynchronous noise strategy, enabling both joint and auto-regressive shot generation without additional parameters. Models with bidirectional attention after LCT can further be fine-tuned with context-causal attention, facilitating auto-regressive generation with efficient KV-cache. Experiments demonstrate single-shot models after LCT can produce coherent multi-shot scenes and exhibit emerging capabilities, including compositional generation and interactive shot extension, paving the way for more practical visual content creation. See https://guoyww.github.io/projects/long-context-video/ for more details.

長文脈チューニングによる動画生成

Long Context Tuning for Video Generation

要旨

Support