帧引导:视频扩散模型中无需训练的帧级控制方法
Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models
June 8, 2025
作者: Sangwon Jang, Taekyung Ki, Jaehyeong Jo, Jaehong Yoon, Soo Ye Kim, Zhe Lin, Sung Ju Hwang
cs.AI
摘要
擴散模型的進步顯著提升了視頻質量,促使研究焦點轉向細粒度可控性。然而,許多現有方法依賴於針對特定任務微調大規模視頻模型,隨著模型規模持續增長,這種做法變得越來越不切實際。在本研究中,我們提出了幀引導(Frame Guidance),這是一種基於幀級信號(如關鍵幀、風格參考圖像、草圖或深度圖)的無訓練可控視頻生成引導方法。為實現實用的無訓練引導,我們提出了一種簡單的潛在處理方法,大幅降低了內存使用,並應用了一種新穎的潛在優化策略,旨在生成全局連貫的視頻。幀引導能夠在多樣化任務中實現有效控制,包括關鍵幀引導、風格化及循環生成,無需任何訓練,且兼容於任何視頻模型。實驗結果表明,幀引導能夠為廣泛的任務和輸入信號生成高質量的可控視頻。
English
Advancements in diffusion models have significantly improved video quality,
directing attention to fine-grained controllability. However, many existing
methods depend on fine-tuning large-scale video models for specific tasks,
which becomes increasingly impractical as model sizes continue to grow. In this
work, we present Frame Guidance, a training-free guidance for controllable
video generation based on frame-level signals, such as keyframes, style
reference images, sketches, or depth maps. For practical training-free
guidance, we propose a simple latent processing method that dramatically
reduces memory usage, and apply a novel latent optimization strategy designed
for globally coherent video generation. Frame Guidance enables effective
control across diverse tasks, including keyframe guidance, stylization, and
looping, without any training, compatible with any video models. Experimental
results show that Frame Guidance can produce high-quality controlled videos for
a wide range of tasks and input signals.