FreeLong++：マルチバンドスペクトル融合によるトレーニング不要の長尺動画生成

要旨

近年のビデオ生成モデルの進展により、テキストプロンプトからの高品質な短編ビデオ生成が可能となった。しかし、これらのモデルを長編ビデオに拡張することは依然として大きな課題であり、主に時間的一貫性と視覚的忠実度の低下が原因となっている。我々の予備的な観察によれば、短編ビデオ生成モデルを長いシーケンスに単純に適用すると、明らかな品質の低下が生じることがわかった。さらに詳細な分析を行った結果、ビデオの長さが増すにつれて高周波成分が次第に歪んでいくという系統的な傾向が確認され、この問題を「高周波歪み」と名付けた。この問題に対処するため、我々はFreeLongを提案する。これは、長編ビデオの特徴量の周波数分布をデノイジングプロセス中にバランスさせるためのトレーニング不要のフレームワークである。FreeLongは、ビデオ全体の意味論を捉えるグローバルな低周波特徴量と、短い時間ウィンドウから抽出されたローカルな高周波特徴量をブレンドすることで、細部の詳細を保持しながらこれを実現する。これを基に、FreeLong++はFreeLongのデュアルブランチ設計を、それぞれ異なる時間スケールで動作する複数のアテンションブランチを持つマルチブランチアーキテクチャに拡張する。グローバルからローカルまでの複数のウィンドウサイズを配置することで、FreeLong++は低周波から高周波までのマルチバンド周波数融合を可能にし、長編ビデオシーケンス全体にわたって意味論的な連続性と細かなモーションダイナミクスを確保する。追加のトレーニングを必要とせず、FreeLong++は既存のビデオ生成モデル（例：Wan2.1やLTX-Video）に組み込むことができ、時間的一貫性と視覚的忠実度が大幅に向上した長編ビデオを生成する。我々のアプローチは、長編ビデオ生成タスク（例：ネイティブ長さの4倍や8倍）において従来の手法を上回る性能を示す。また、滑らかなシーントランジションを伴う一貫性のあるマルチプロンプトビデオ生成をサポートし、長い深度やポーズシーケンスを使用した制御可能なビデオ生成を可能にする。

English

Recent advances in video generation models have enabled high-quality short video generation from text prompts. However, extending these models to longer videos remains a significant challenge, primarily due to degraded temporal consistency and visual fidelity. Our preliminary observations show that naively applying short-video generation models to longer sequences leads to noticeable quality degradation. Further analysis identifies a systematic trend where high-frequency components become increasingly distorted as video length grows, an issue we term high-frequency distortion. To address this, we propose FreeLong, a training-free framework designed to balance the frequency distribution of long video features during the denoising process. FreeLong achieves this by blending global low-frequency features, which capture holistic semantics across the full video, with local high-frequency features extracted from short temporal windows to preserve fine details. Building on this, FreeLong++ extends FreeLong dual-branch design into a multi-branch architecture with multiple attention branches, each operating at a distinct temporal scale. By arranging multiple window sizes from global to local, FreeLong++ enables multi-band frequency fusion from low to high frequencies, ensuring both semantic continuity and fine-grained motion dynamics across longer video sequences. Without any additional training, FreeLong++ can be plugged into existing video generation models (e.g. Wan2.1 and LTX-Video) to produce longer videos with substantially improved temporal consistency and visual fidelity. We demonstrate that our approach outperforms previous methods on longer video generation tasks (e.g. 4x and 8x of native length). It also supports coherent multi-prompt video generation with smooth scene transitions and enables controllable video generation using long depth or pose sequences.

FreeLong++：マルチバンドスペクトル融合によるトレーニング不要の長尺動画生成

FreeLong++: Training-Free Long Video Generation via Multi-band SpectralFusion

要旨

Support