高速長尺動画生成のためのモード探索と平均探索の融合

要旨

秒単位から分単位へのビデオ生成のスケーリングには、重大なボトルネックが存在します。すなわち、短い動画のデータは豊富で高精細である一方、一貫性のある長尺の動画データは乏しく、限られた分野に限定されていることです。この問題に対処するため、我々は「モード追従」と「平均追従」を融合した新しい学習パラダイムを提案します。これは、Decoupled Diffusion Transformerによる統一された表現に基づき、局所的な精細度と長期的な一貫性を分離するものです。本手法では、長尺ビデオに対して教師あり学習で訓練された大域的なFlow Matchingヘッドを用いて物語構造を捕捉します。同時に、局所的なDistribution Matchingヘッドを併用し、スライディングウィンドウ単位で固定された短尺ビデオ教師モデルに対して、モード追従型の逆KLダイバージェンスを用いて整合を図ります。この戦略により、限られた長尺ビデオから教師ありフローマッチングによって長距離の一貫性と動きを学習しつつ、学生モデルのあらゆるスライディングウィンドウセグメントを固定された短尺ビデオ教師モデルに整合させることで局所的なリアリズムを継承した、分単位のビデオを数ステップで高速生成する手法を実現します。評価結果より、本手法は局所的なシャープネス、動きの質、長距離一貫性を共同で改善することで、精細度と時間軸のギャップを効果的に解消することが示されました。プロジェクトサイト: https://primecai.github.io/mmm/。

English

Scaling video generation from seconds to minutes faces a critical bottleneck: while short-video data is abundant and high-fidelity, coherent long-form data is scarce and limited to narrow domains. To address this, we propose a training paradigm where Mode Seeking meets Mean Seeking, decoupling local fidelity from long-term coherence based on a unified representation via a Decoupled Diffusion Transformer. Our approach utilizes a global Flow Matching head trained via supervised learning on long videos to capture narrative structure, while simultaneously employing a local Distribution Matching head that aligns sliding windows to a frozen short-video teacher via a mode-seeking reverse-KL divergence. This strategy enables the synthesis of minute-scale videos that learns long-range coherence and motions from limited long videos via supervised flow matching, while inheriting local realism by aligning every sliding-window segment of the student to a frozen short-video teacher, resulting in a few-step fast long video generator. Evaluations show that our method effectively closes the fidelity-horizon gap by jointly improving local sharpness, motion and long-range consistency. Project website: https://primecai.github.io/mmm/.

高速長尺動画生成のためのモード探索と平均探索の融合

Mode Seeking meets Mean Seeking for Fast Long Video Generation

要旨

Support