LumosFlow: モーションガイドによる長時間動画生成

要旨

長尺動画生成は、エンターテイメントやシミュレーションなどの分野での幅広い応用により、注目を集めています。進展があるにもかかわらず、時間的に一貫性があり視覚的に魅力的な長尺シーケンスを合成することは依然として大きな課題です。従来のアプローチでは、短いクリップを順次生成して連結したり、キーフレームを生成してから階層的に中間フレームを補間したりすることが一般的です。しかし、これらの方法では依然として重大な課題が残っており、時間的な繰り返しや不自然な遷移などの問題が生じます。本論文では、階層的な長尺動画生成パイプラインを再検討し、明示的にモーションガイダンスを導入するフレームワークであるLumosFlowを提案します。具体的には、まずLarge Motion Text-to-Video Diffusion Model (LMTV-DM)を使用して、より大きなモーション間隔を持つキーフレームを生成し、生成される長尺動画の内容の多様性を確保します。キーフレーム間の文脈遷移を補間する複雑さを考慮し、中間フレーム補間をモーション生成と事後精細化に分解します。各キーフレームペアに対して、Latent Optical Flow Diffusion Model (LOF-DM)が複雑で大きなモーションのオプティカルフローを合成し、MotionControlNetがワープ結果を精細化して品質を向上させ、中間フレーム生成をガイドします。従来の動画フレーム補間と比較して、15倍の補間を実現し、隣接フレーム間の合理的で連続的なモーションを確保します。実験結果は、本手法が一貫したモーションと外観を持つ長尺動画を生成できることを示しています。コードとモデルは受理後に公開されます。プロジェクトページ: https://jiahaochen1.github.io/LumosFlow/

English

Long video generation has gained increasing attention due to its widespread applications in fields such as entertainment and simulation. Despite advances, synthesizing temporally coherent and visually compelling long sequences remains a formidable challenge. Conventional approaches often synthesize long videos by sequentially generating and concatenating short clips, or generating key frames and then interpolate the intermediate frames in a hierarchical manner. However, both of them still remain significant challenges, leading to issues such as temporal repetition or unnatural transitions. In this paper, we revisit the hierarchical long video generation pipeline and introduce LumosFlow, a framework introduce motion guidance explicitly. Specifically, we first employ the Large Motion Text-to-Video Diffusion Model (LMTV-DM) to generate key frames with larger motion intervals, thereby ensuring content diversity in the generated long videos. Given the complexity of interpolating contextual transitions between key frames, we further decompose the intermediate frame interpolation into motion generation and post-hoc refinement. For each pair of key frames, the Latent Optical Flow Diffusion Model (LOF-DM) synthesizes complex and large-motion optical flows, while MotionControlNet subsequently refines the warped results to enhance quality and guide intermediate frame generation. Compared with traditional video frame interpolation, we achieve 15x interpolation, ensuring reasonable and continuous motion between adjacent frames. Experiments show that our method can generate long videos with consistent motion and appearance. Code and models will be made publicly available upon acceptance. Our project page: https://jiahaochen1.github.io/LumosFlow/

LumosFlow: モーションガイドによる長時間動画生成

LumosFlow: Motion-Guided Long Video Generation

要旨

Support