一貫性のある長尺動画のための訓練不要無限フレーム生成の強化

要旨

大きな計算オーバーヘッドを伴わずに、訓練不要の長動画生成は、基盤動画生成モデルがより長い動画を生成できるようにすることを目的としています。フレームレベルの自己回帰フレームワーク（例：FIFO-diffusion）は、一定のメモリ消費で無限に長い動画を生成できるという利点があります。しかしながら、学習時と推論時の不整合と、長期的な一貫性を維持するという課題が相まって、基盤モデルの効果的な活用が制限されています。これらの問題を軽減するために、我々はMIGAという新しい無限フレーム長動画生成手法を提案します。まず、モデルに与える過剰なノイズ区間を削減することで学習-推論ギャップを軽減する、効果的な2段階のアライメント機構を提案します。次に、自己反映アプローチが初期の高ノイズフレームを修正し、長距離フレームガイダンスアプローチが広いカバレッジを持つ後期の低ノイズフレームを活用して生成を導く、革新的な二重の一貫性強化機構を導入し、時間的一貫性を共同で改善します。VBenchとNarrLVでの広範な実験により、MIGAの最先端の性能が実証されています。プロジェクトページは https://xiaokunfeng.github.io/miga_homepage/ でご覧いただけます。

English

Without incurring significant computational overhead, train-free long video generation aims to enable foundation video generation models to produce longer videos. Frame-level autoregressive frameworks, e.g., FIFO-diffusion, offer the advantage of generating infinitely long videos with constant memory consumption. However, the mismatch between training and inference, coupled with the challenge of maintaining long-term consistency, limits the effective utilization of foundation models. To mitigate these concerns, we propose MIGA, a novel infinite-frame long video generation method. Firstly, we propose an effective two-stage alignment mechanism that mitigates the training-inference gap by reducing the excessive noise span fed to the model. We then introduce an innovative dual consistency enhancement mechanism, where the self-reflection approach corrects early high-noise frames and the long-range frame guidance approach leverages later low-noise frames with broad coverage to steer generation, jointly improving temporal consistency. Extensive experiments on VBench and NarrLV demonstrate the state-of-the-art performance of MIGA. Our project page is available at https://xiaokunfeng.github.io/miga_homepage/.