LumosFlow:運動引導的長視頻生成
LumosFlow: Motion-Guided Long Video Generation
June 3, 2025
作者: Jiahao Chen, Hangjie Yuan, Yichen Qian, Jingyun Liang, Jiazheng Xing, Pengwei Liu, Weihua Chen, Fan Wang, Bing Su
cs.AI
摘要
長視頻生成因其在娛樂和模擬等領域的廣泛應用而日益受到關注。儘管技術有所進步,合成時間上連貫且視覺上吸引人的長序列仍然是一個巨大的挑戰。傳統方法通常通過順序生成並拼接短片段,或生成關鍵幀然後以分層方式插值中間幀來合成長視頻。然而,這兩種方法仍面臨重大挑戰,導致時間重複或過渡不自然等問題。本文重新審視了分層長視頻生成流程,並引入了LumosFlow框架,該框架明確引入了運動指導。具體而言,我們首先使用大運動文本到視頻擴散模型(LMTV-DM)生成具有較大運動間隔的關鍵幀,從而確保生成長視頻的內容多樣性。考慮到在關鍵幀之間插值上下文過渡的複雜性,我們進一步將中間幀插值分解為運動生成和事後精煉。對於每對關鍵幀,潛在光流擴散模型(LOF-DM)合成複雜且大運動的光流,而MotionControlNet隨後精煉扭曲結果以提高質量並指導中間幀生成。與傳統的視頻幀插值相比,我們實現了15倍的插值,確保了相鄰幀之間合理且連續的運動。實驗表明,我們的方法能夠生成具有一致運動和外觀的長視頻。代碼和模型將在接受後公開。我們的項目頁面:https://jiahaochen1.github.io/LumosFlow/
English
Long video generation has gained increasing attention due to its widespread
applications in fields such as entertainment and simulation. Despite advances,
synthesizing temporally coherent and visually compelling long sequences remains
a formidable challenge. Conventional approaches often synthesize long videos by
sequentially generating and concatenating short clips, or generating key frames
and then interpolate the intermediate frames in a hierarchical manner. However,
both of them still remain significant challenges, leading to issues such as
temporal repetition or unnatural transitions. In this paper, we revisit the
hierarchical long video generation pipeline and introduce LumosFlow, a
framework introduce motion guidance explicitly. Specifically, we first employ
the Large Motion Text-to-Video Diffusion Model (LMTV-DM) to generate key frames
with larger motion intervals, thereby ensuring content diversity in the
generated long videos. Given the complexity of interpolating contextual
transitions between key frames, we further decompose the intermediate frame
interpolation into motion generation and post-hoc refinement. For each pair of
key frames, the Latent Optical Flow Diffusion Model (LOF-DM) synthesizes
complex and large-motion optical flows, while MotionControlNet subsequently
refines the warped results to enhance quality and guide intermediate frame
generation. Compared with traditional video frame interpolation, we achieve 15x
interpolation, ensuring reasonable and continuous motion between adjacent
frames. Experiments show that our method can generate long videos with
consistent motion and appearance. Code and models will be made publicly
available upon acceptance. Our project page:
https://jiahaochen1.github.io/LumosFlow/