LumosFlow:运动引导的长视频生成
LumosFlow: Motion-Guided Long Video Generation
June 3, 2025
作者: Jiahao Chen, Hangjie Yuan, Yichen Qian, Jingyun Liang, Jiazheng Xing, Pengwei Liu, Weihua Chen, Fan Wang, Bing Su
cs.AI
摘要
长视频生成因其在娱乐和模拟等领域的广泛应用而日益受到关注。尽管技术不断进步,合成时间连贯且视觉吸引力强的长序列视频仍然是一个巨大的挑战。传统方法通常通过顺序生成并拼接短视频片段,或先生成关键帧再以分层方式插值中间帧来合成长视频。然而,这两种方法仍面临显著挑战,导致诸如时间重复或过渡不自然等问题。本文重新审视了分层长视频生成流程,并引入了LumosFlow框架,该框架显式地引入了运动指导。具体而言,我们首先采用大运动文本到视频扩散模型(LMTV-DM)生成具有更大运动间隔的关键帧,从而确保生成的长视频内容多样性。鉴于在关键帧之间插值上下文过渡的复杂性,我们进一步将中间帧插值分解为运动生成和后处理细化。对于每一对关键帧,潜在光流扩散模型(LOF-DM)合成复杂且大运动的光流,而MotionControlNet随后对扭曲结果进行细化,以提升质量并指导中间帧的生成。与传统的视频帧插值相比,我们实现了15倍的插值,确保了相邻帧之间合理且连续的运动。实验表明,我们的方法能够生成具有一致运动和外观的长视频。代码和模型将在论文被接受后公开。项目页面:https://jiahaochen1.github.io/LumosFlow/
English
Long video generation has gained increasing attention due to its widespread
applications in fields such as entertainment and simulation. Despite advances,
synthesizing temporally coherent and visually compelling long sequences remains
a formidable challenge. Conventional approaches often synthesize long videos by
sequentially generating and concatenating short clips, or generating key frames
and then interpolate the intermediate frames in a hierarchical manner. However,
both of them still remain significant challenges, leading to issues such as
temporal repetition or unnatural transitions. In this paper, we revisit the
hierarchical long video generation pipeline and introduce LumosFlow, a
framework introduce motion guidance explicitly. Specifically, we first employ
the Large Motion Text-to-Video Diffusion Model (LMTV-DM) to generate key frames
with larger motion intervals, thereby ensuring content diversity in the
generated long videos. Given the complexity of interpolating contextual
transitions between key frames, we further decompose the intermediate frame
interpolation into motion generation and post-hoc refinement. For each pair of
key frames, the Latent Optical Flow Diffusion Model (LOF-DM) synthesizes
complex and large-motion optical flows, while MotionControlNet subsequently
refines the warped results to enhance quality and guide intermediate frame
generation. Compared with traditional video frame interpolation, we achieve 15x
interpolation, ensuring reasonable and continuous motion between adjacent
frames. Experiments show that our method can generate long videos with
consistent motion and appearance. Code and models will be made publicly
available upon acceptance. Our project page:
https://jiahaochen1.github.io/LumosFlow/Summary
AI-Generated Summary