LumosFlow: 동작 기반 장편 비디오 생성

초록

긴 영상 생성은 엔터테인먼트 및 시뮬레이션과 같은 분야에서의 광범위한 응용으로 인해 점점 더 많은 관심을 받고 있습니다. 그러나 시간적으로 일관되고 시각적으로 매력적인 긴 시퀀스를 합성하는 것은 여전히 큰 도전 과제로 남아 있습니다. 기존의 접근 방식은 주로 짧은 클립을 순차적으로 생성하고 연결하거나, 키 프레임을 생성한 후 계층적 방식으로 중간 프레임을 보간하는 방법을 사용합니다. 하지만 이러한 방법들은 여전히 시간적 반복이나 부자연스러운 전환과 같은 문제를 야기합니다. 본 논문에서는 계층적 긴 영상 생성 파이프라인을 재검토하고, 명시적으로 모션 가이던스를 도입한 LumosFlow 프레임워크를 소개합니다. 구체적으로, 우리는 먼저 Large Motion Text-to-Video Diffusion Model (LMTV-DM)을 사용하여 더 큰 모션 간격을 가진 키 프레임을 생성함으로써 생성된 긴 영상에서의 내용 다양성을 보장합니다. 키 프레임 간의 문맥적 전환을 보간하는 복잡성을 고려하여, 우리는 중간 프레임 보간을 모션 생성과 사후 정제로 분해합니다. 각 키 프레임 쌍에 대해, Latent Optical Flow Diffusion Model (LOF-DM)은 복잡하고 큰 모션의 광학 흐름을 합성하며, MotionControlNet은 이후에 왜곡된 결과를 정제하여 품질을 향상시키고 중간 프레임 생성을 안내합니다. 기존의 비디오 프레임 보간과 비교하여, 우리는 15배의 보간을 달성하여 인접 프레임 간의 합리적이고 연속적인 모션을 보장합니다. 실험 결과, 우리의 방법은 일관된 모션과 외관을 가진 긴 영상을 생성할 수 있음을 보여줍니다. 코드와 모델은 논문 채택 후 공개될 예정입니다. 프로젝트 페이지: https://jiahaochen1.github.io/LumosFlow/

English

Long video generation has gained increasing attention due to its widespread applications in fields such as entertainment and simulation. Despite advances, synthesizing temporally coherent and visually compelling long sequences remains a formidable challenge. Conventional approaches often synthesize long videos by sequentially generating and concatenating short clips, or generating key frames and then interpolate the intermediate frames in a hierarchical manner. However, both of them still remain significant challenges, leading to issues such as temporal repetition or unnatural transitions. In this paper, we revisit the hierarchical long video generation pipeline and introduce LumosFlow, a framework introduce motion guidance explicitly. Specifically, we first employ the Large Motion Text-to-Video Diffusion Model (LMTV-DM) to generate key frames with larger motion intervals, thereby ensuring content diversity in the generated long videos. Given the complexity of interpolating contextual transitions between key frames, we further decompose the intermediate frame interpolation into motion generation and post-hoc refinement. For each pair of key frames, the Latent Optical Flow Diffusion Model (LOF-DM) synthesizes complex and large-motion optical flows, while MotionControlNet subsequently refines the warped results to enhance quality and guide intermediate frame generation. Compared with traditional video frame interpolation, we achieve 15x interpolation, ensuring reasonable and continuous motion between adjacent frames. Experiments show that our method can generate long videos with consistent motion and appearance. Code and models will be made publicly available upon acceptance. Our project page: https://jiahaochen1.github.io/LumosFlow/

LumosFlow: 동작 기반 장편 비디오 생성

LumosFlow: Motion-Guided Long Video Generation

초록

Support