フレームガイダンス：ビデオ拡散モデルにおけるフレームレベル制御のためのトレーニング不要なガイダンス

要旨

拡散モデルの進展により、映像品質が大幅に向上し、細粒度の制御可能性に注目が集まっている。しかしながら、多くの既存手法は特定のタスクに対して大規模な映像モデルのファインチューニングに依存しており、モデルサイズが増大し続ける中でその実用性が低下している。本研究では、キーフレーム、スタイル参照画像、スケッチ、深度マップなどのフレームレベル信号に基づく、トレーニング不要の制御可能な映像生成手法であるFrame Guidanceを提案する。実用的なトレーニング不要のガイダンスを実現するため、メモリ使用量を大幅に削減するシンプルな潜在空間処理手法を提案し、グローバルに一貫性のある映像生成を目的とした新規の潜在空間最適化戦略を適用する。Frame Guidanceは、キーフレームガイダンス、スタイライゼーション、ループ生成など、多様なタスクにわたる効果的な制御を可能にし、いかなる映像モデルとも互換性がある。実験結果から、Frame Guidanceが幅広いタスクと入力信号に対して高品質な制御映像を生成できることが示された。

English

Advancements in diffusion models have significantly improved video quality, directing attention to fine-grained controllability. However, many existing methods depend on fine-tuning large-scale video models for specific tasks, which becomes increasingly impractical as model sizes continue to grow. In this work, we present Frame Guidance, a training-free guidance for controllable video generation based on frame-level signals, such as keyframes, style reference images, sketches, or depth maps. For practical training-free guidance, we propose a simple latent processing method that dramatically reduces memory usage, and apply a novel latent optimization strategy designed for globally coherent video generation. Frame Guidance enables effective control across diverse tasks, including keyframe guidance, stylization, and looping, without any training, compatible with any video models. Experimental results show that Frame Guidance can produce high-quality controlled videos for a wide range of tasks and input signals.

フレームガイダンス：ビデオ拡散モデルにおけるフレームレベル制御のためのトレーニング不要なガイダンス

Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models

要旨

Support