장시간 동영상 생성 속도 향상을 위한 모드 탐색과 평균 탐색의 결합

초록

영상 생성의 시간 범위를 초 단위에서 분 단위로 확장하는 데는 결정적인 병목 현상이 존재합니다: 짧은 영상 데이터는 풍부하고 고해상도이지만, 일관된 장편 영상 데이터는 부족하고 특정 도메인에 한정됩니다. 이를 해결하기 위해 우리는 Mode Seeking과 Mean Seeking을 결합한 훈련 패러다임을 제안하며, Decoupled Diffusion Transformer를 통한 통합 표현을 기반으로 국소적 정확도와 장기적 일관성을 분리합니다. 우리의 접근법은 장편 영상에 대한 지도 학습으로 훈련된 글로벌 Flow Matching 헤드를 사용하여 서사 구조를 포착하는 동시에, 모드 추출형 역-KL 발산을 통해 고정된 단영상 교사 모델에 슬라이딩 윈도우를 정렬하는 로컬 Distribution Matching 헤드를 병행합니다. 이 전략은 지도 흐름 매칭을 통해 제한된 장편 영상으로부터 장거리 일관성과 움직임을 학습하면서, 학생 모델의 모든 슬라이딩 윈도우 세그먼트를 고정된 단영상 교사 모델에 정렬함으로써 국소적 현실성을 계승하여, 적은 단계로 빠르게 장편 영상을 생성하는 방식을 가능하게 합니다. 평가 결과, 우리의 방법이 국소적 선명도, 움직임 및 장거리 일관성을 함께 개선하여 정확도-시간 범위 간극을 효과적으로 해소함을 보여줍니다. 프로젝트 웹사이트: https://primecai.github.io/mmm/.

English

Scaling video generation from seconds to minutes faces a critical bottleneck: while short-video data is abundant and high-fidelity, coherent long-form data is scarce and limited to narrow domains. To address this, we propose a training paradigm where Mode Seeking meets Mean Seeking, decoupling local fidelity from long-term coherence based on a unified representation via a Decoupled Diffusion Transformer. Our approach utilizes a global Flow Matching head trained via supervised learning on long videos to capture narrative structure, while simultaneously employing a local Distribution Matching head that aligns sliding windows to a frozen short-video teacher via a mode-seeking reverse-KL divergence. This strategy enables the synthesis of minute-scale videos that learns long-range coherence and motions from limited long videos via supervised flow matching, while inheriting local realism by aligning every sliding-window segment of the student to a frozen short-video teacher, resulting in a few-step fast long video generator. Evaluations show that our method effectively closes the fidelity-horizon gap by jointly improving local sharpness, motion and long-range consistency. Project website: https://primecai.github.io/mmm/.

장시간 동영상 생성 속도 향상을 위한 모드 탐색과 평균 탐색의 결합

Mode Seeking meets Mean Seeking for Fast Long Video Generation

초록

Support