帧引导：视频扩散模型中无需训练的帧级控制方法

摘要

擴散模型的進步顯著提升了視頻質量，促使研究焦點轉向細粒度可控性。然而，許多現有方法依賴於針對特定任務微調大規模視頻模型，隨著模型規模持續增長，這種做法變得越來越不切實際。在本研究中，我們提出了幀引導（Frame Guidance），這是一種基於幀級信號（如關鍵幀、風格參考圖像、草圖或深度圖）的無訓練可控視頻生成引導方法。為實現實用的無訓練引導，我們提出了一種簡單的潛在處理方法，大幅降低了內存使用，並應用了一種新穎的潛在優化策略，旨在生成全局連貫的視頻。幀引導能夠在多樣化任務中實現有效控制，包括關鍵幀引導、風格化及循環生成，無需任何訓練，且兼容於任何視頻模型。實驗結果表明，幀引導能夠為廣泛的任務和輸入信號生成高質量的可控視頻。

English

Advancements in diffusion models have significantly improved video quality, directing attention to fine-grained controllability. However, many existing methods depend on fine-tuning large-scale video models for specific tasks, which becomes increasingly impractical as model sizes continue to grow. In this work, we present Frame Guidance, a training-free guidance for controllable video generation based on frame-level signals, such as keyframes, style reference images, sketches, or depth maps. For practical training-free guidance, we propose a simple latent processing method that dramatically reduces memory usage, and apply a novel latent optimization strategy designed for globally coherent video generation. Frame Guidance enables effective control across diverse tasks, including keyframe guidance, stylization, and looping, without any training, compatible with any video models. Experimental results show that Frame Guidance can produce high-quality controlled videos for a wide range of tasks and input signals.

帧引导：视频扩散模型中无需训练的帧级控制方法

Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models

摘要

Support