SlowFast-VGen：アクション駆動の長いビデオ生成のためのスローファスト学習

要旨

人間は補完的な学習システムを持ち、一般的な世界のダイナミクスの遅い学習と新しい経験からのエピソードメモリの高速な記憶を結ぶものです。しかしながら、以前のビデオ生成モデルは、主に膨大な量のデータで事前トレーニングすることによる遅い学習に焦点を当てており、エピソードメモリの記憶に不可欠な高速学習段階を見落としています。この見落としは、より長いビデオを生成する際に、これらのフレームがモデルのコンテキストウィンドウを超えているため、時間的に離れたフレーム間での不整合を引き起こします。このため、アクション駆動の長いビデオ生成のための新しい双速学習システムであるSlowFast-VGenを導入します。当社の手法は、世界のダイナミクスの遅い学習のためのマスクされた条件付きビデオ拡散モデルと、時間的LoRAモジュールに基づく推論時の高速学習戦略を組み合わせています。具体的には、高速学習プロセスは、ローカルな入力と出力に基づいてその時間的LoRAパラメータを更新し、そのパラメータにエピソードメモリを効率的に保存します。さらに、内部の高速学習ループを外部の遅い学習ループにシームレスに統合し、コンテキストを考慮したスキル学習のための以前の複数エピソードの経験を呼び起こすための遅い高速学習ループアルゴリズムを提案します。おおよその世界モデルの遅い学習を促進するために、広範囲のシナリオをカバーする言語アクション注釈付きの20万本のビデオの大規模データセットを収集します。幅広い実験により、SlowFast-VGenがアクション駆動のビデオ生成においてさまざまなメトリクスでベースラインを上回り、FVDスコアが782に対して514となり、平均0.37のシーンカットに対して0.89を維持し、より長いビデオでの一貫性を維持することが示されました。遅い高速学習ループアルゴリズムは、長期的な計画タスクにおいても性能を大幅に向上させます。プロジェクトウェブサイト: https://slowfast-vgen.github.io

English

Human beings are endowed with a complementary learning system, which bridges the slow learning of general world dynamics with fast storage of episodic memory from a new experience. Previous video generation models, however, primarily focus on slow learning by pre-training on vast amounts of data, overlooking the fast learning phase crucial for episodic memory storage. This oversight leads to inconsistencies across temporally distant frames when generating longer videos, as these frames fall beyond the model's context window. To this end, we introduce SlowFast-VGen, a novel dual-speed learning system for action-driven long video generation. Our approach incorporates a masked conditional video diffusion model for the slow learning of world dynamics, alongside an inference-time fast learning strategy based on a temporal LoRA module. Specifically, the fast learning process updates its temporal LoRA parameters based on local inputs and outputs, thereby efficiently storing episodic memory in its parameters. We further propose a slow-fast learning loop algorithm that seamlessly integrates the inner fast learning loop into the outer slow learning loop, enabling the recall of prior multi-episode experiences for context-aware skill learning. To facilitate the slow learning of an approximate world model, we collect a large-scale dataset of 200k videos with language action annotations, covering a wide range of scenarios. Extensive experiments show that SlowFast-VGen outperforms baselines across various metrics for action-driven video generation, achieving an FVD score of 514 compared to 782, and maintaining consistency in longer videos, with an average of 0.37 scene cuts versus 0.89. The slow-fast learning loop algorithm significantly enhances performances on long-horizon planning tasks as well. Project Website: https://slowfast-vgen.github.io

SlowFast-VGen：アクション駆動の長いビデオ生成のためのスローファスト学習

SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation

要旨

Support