FreeLong++: 멀티밴드 스펙트럴 퓨전을 통한 학습 없이 가능한 장편 비디오 생성

초록

최근 비디오 생성 모델의 발전으로 텍스트 프롬프트에서 고품질의 짧은 비디오를 생성할 수 있게 되었습니다. 그러나 이러한 모델을 더 긴 비디오로 확장하는 것은 여전히 큰 과제로 남아 있으며, 이는 주로 시간적 일관성과 시각적 충실도의 저하 때문입니다. 우리의 예비 관찰에 따르면, 짧은 비디오 생성 모델을 긴 시퀀스에 그대로 적용하면 눈에 띄는 품질 저하가 발생합니다. 추가 분석을 통해 비디오 길이가 증가함에 따라 고주파수 성분이 점점 더 왜곡되는 체계적인 경향을 확인했으며, 이를 고주파수 왜곡이라고 명명했습니다. 이를 해결하기 위해 우리는 디노이징 과정에서 긴 비디오 특징의 주파수 분포를 균형 있게 조정하도록 설계된 학습이 필요 없는 프레임워크인 FreeLong을 제안합니다. FreeLong은 전체 비디오에 걸친 전체적인 의미를 포착하는 전역적 저주파수 특징과 짧은 시간 창에서 추출된 지역적 고주파수 특징을 혼합하여 세부 사항을 보존합니다. 이를 기반으로 FreeLong++는 FreeLong의 이중 분기 설계를 여러 주의 분기를 가진 다중 분기 아키텍처로 확장하며, 각 분기는 서로 다른 시간적 규모에서 작동합니다. 전역에서 지역까지 다양한 창 크기를 배열함으로써 FreeLong++는 저주파수에서 고주파수까지 다중 대역 주파수 융합을 가능하게 하여, 더 긴 비디오 시퀀스에서 의미적 연속성과 세밀한 동작 역학을 모두 보장합니다. 추가 학습 없이도 FreeLong++는 기존 비디오 생성 모델(예: Wan2.1 및 LTX-Video)에 플러그인되어 시간적 일관성과 시각적 충실도가 크게 향상된 더 긴 비디오를 생성할 수 있습니다. 우리의 접근 방식이 더 긴 비디오 생성 작업(예: 기본 길이의 4배 및 8배)에서 이전 방법들을 능가함을 보여줍니다. 또한, 부드러운 장면 전환과 함께 일관된 다중 프롬프트 비디오 생성을 지원하며, 긴 깊이 또는 포즈 시퀀스를 사용하여 제어 가능한 비디오 생성을 가능하게 합니다.

English

Recent advances in video generation models have enabled high-quality short video generation from text prompts. However, extending these models to longer videos remains a significant challenge, primarily due to degraded temporal consistency and visual fidelity. Our preliminary observations show that naively applying short-video generation models to longer sequences leads to noticeable quality degradation. Further analysis identifies a systematic trend where high-frequency components become increasingly distorted as video length grows, an issue we term high-frequency distortion. To address this, we propose FreeLong, a training-free framework designed to balance the frequency distribution of long video features during the denoising process. FreeLong achieves this by blending global low-frequency features, which capture holistic semantics across the full video, with local high-frequency features extracted from short temporal windows to preserve fine details. Building on this, FreeLong++ extends FreeLong dual-branch design into a multi-branch architecture with multiple attention branches, each operating at a distinct temporal scale. By arranging multiple window sizes from global to local, FreeLong++ enables multi-band frequency fusion from low to high frequencies, ensuring both semantic continuity and fine-grained motion dynamics across longer video sequences. Without any additional training, FreeLong++ can be plugged into existing video generation models (e.g. Wan2.1 and LTX-Video) to produce longer videos with substantially improved temporal consistency and visual fidelity. We demonstrate that our approach outperforms previous methods on longer video generation tasks (e.g. 4x and 8x of native length). It also supports coherent multi-prompt video generation with smooth scene transitions and enables controllable video generation using long depth or pose sequences.

FreeLong++: 멀티밴드 스펙트럴 퓨전을 통한 학습 없이 가능한 장편 비디오 생성

FreeLong++: Training-Free Long Video Generation via Multi-band SpectralFusion

초록

Support