模式搜尋與均值搜尋交匯：實現快速長影片生成的新路徑

摘要

將影片生成從秒級擴展至分鐘級面臨關鍵瓶頸：短影片數據雖豐富且高擬真，但連貫的長時序數據稀缺且侷限於狹窄領域。為此，我們提出一種「模式追尋遇見均值追尋」的訓練範式，通過解耦擴散變壓器的統一表徵，將局部擬真度與長時序連貫性分離。該方法採用經長影片監督學習訓練的全局流匹配頭來捕捉敘事結構，同時通過模式尋求的反向KL散度，讓局部分佈匹配頭將滑動窗口與凍結的短影片教師模型對齊。此策略使模型能通過監督式流匹配從有限的長影片中學習長程連貫性與運動，並通過將學生的每個滑動窗口片段與凍結的短影片教師對齊來繼承局部真實感，最終實現少步快速生成長影片。評估顯示，我們的方法通過聯合提升局部銳利度、運動表現及長程一致性，有效彌合了擬真度與時長之間的差距。項目網站：https://primecai.github.io/mmm/。

English

Scaling video generation from seconds to minutes faces a critical bottleneck: while short-video data is abundant and high-fidelity, coherent long-form data is scarce and limited to narrow domains. To address this, we propose a training paradigm where Mode Seeking meets Mean Seeking, decoupling local fidelity from long-term coherence based on a unified representation via a Decoupled Diffusion Transformer. Our approach utilizes a global Flow Matching head trained via supervised learning on long videos to capture narrative structure, while simultaneously employing a local Distribution Matching head that aligns sliding windows to a frozen short-video teacher via a mode-seeking reverse-KL divergence. This strategy enables the synthesis of minute-scale videos that learns long-range coherence and motions from limited long videos via supervised flow matching, while inheriting local realism by aligning every sliding-window segment of the student to a frozen short-video teacher, resulting in a few-step fast long video generator. Evaluations show that our method effectively closes the fidelity-horizon gap by jointly improving local sharpness, motion and long-range consistency. Project website: https://primecai.github.io/mmm/.

模式搜尋與均值搜尋交匯：實現快速長影片生成的新路徑

Mode Seeking meets Mean Seeking for Fast Long Video Generation

摘要

Support