FreeLong++：基於多頻帶譜融合的免訓練長視頻生成

摘要

近期，视频生成模型的进展已实现了从文本提示生成高质量短视频的能力。然而，将这些模型扩展至生成长视频仍面临重大挑战，主要原因是时间一致性和视觉保真度的下降。我们的初步观察表明，简单地将短视频生成模型应用于长序列会导致明显的质量退化。进一步分析揭示了一个系统性趋势，即随着视频长度的增加，高频成分逐渐失真，这一问题我们称之为高频失真。为解决此问题，我们提出了FreeLong，一种无需训练即可在去噪过程中平衡长视频特征频率分布的框架。FreeLong通过融合全局低频特征（捕捉整个视频的整体语义）与从短时间窗口提取的局部高频特征（保留细节）来实现这一目标。在此基础上，FreeLong++将FreeLong的双分支设计扩展为多分支架构，每个分支在不同的时间尺度上运行。通过安排从全局到局部的多个窗口大小，FreeLong++实现了从低频到高频的多频带融合，确保了长视频序列中的语义连续性和精细运动动态。无需额外训练，FreeLong++即可插入现有视频生成模型（如Wan2.1和LTX-Video）中，生成时间一致性和视觉保真度显著提升的长视频。我们证明，我们的方法在长视频生成任务（如原生长度的4倍和8倍）上优于以往方法。它还支持具有平滑场景转换的连贯多提示视频生成，并允许使用长深度或姿态序列进行可控视频生成。

English

Recent advances in video generation models have enabled high-quality short video generation from text prompts. However, extending these models to longer videos remains a significant challenge, primarily due to degraded temporal consistency and visual fidelity. Our preliminary observations show that naively applying short-video generation models to longer sequences leads to noticeable quality degradation. Further analysis identifies a systematic trend where high-frequency components become increasingly distorted as video length grows, an issue we term high-frequency distortion. To address this, we propose FreeLong, a training-free framework designed to balance the frequency distribution of long video features during the denoising process. FreeLong achieves this by blending global low-frequency features, which capture holistic semantics across the full video, with local high-frequency features extracted from short temporal windows to preserve fine details. Building on this, FreeLong++ extends FreeLong dual-branch design into a multi-branch architecture with multiple attention branches, each operating at a distinct temporal scale. By arranging multiple window sizes from global to local, FreeLong++ enables multi-band frequency fusion from low to high frequencies, ensuring both semantic continuity and fine-grained motion dynamics across longer video sequences. Without any additional training, FreeLong++ can be plugged into existing video generation models (e.g. Wan2.1 and LTX-Video) to produce longer videos with substantially improved temporal consistency and visual fidelity. We demonstrate that our approach outperforms previous methods on longer video generation tasks (e.g. 4x and 8x of native length). It also supports coherent multi-prompt video generation with smooth scene transitions and enables controllable video generation using long depth or pose sequences.

FreeLong++：基於多頻帶譜融合的免訓練長視頻生成

FreeLong++: Training-Free Long Video Generation via Multi-band SpectralFusion

摘要

Support