帧引导：视频扩散模型中帧级控制的免训练引导方法

摘要

扩散模型的进展显著提升了视频质量，将研究焦点引向了细粒度可控性。然而，许多现有方法依赖于针对特定任务微调大规模视频模型，随着模型规模持续增长，这种做法愈发不切实际。本研究中，我们提出了帧引导（Frame Guidance），一种基于帧级信号（如关键帧、风格参考图像、草图或深度图）的无训练可控视频生成引导方法。为实现实用的无训练引导，我们提出了一种简单的潜在处理方法，大幅降低了内存占用，并应用了一种新颖的潜在优化策略，专为全局连贯的视频生成而设计。帧引导能够在无需任何训练的情况下，有效控制包括关键帧引导、风格化及循环播放在内的多样化任务，且兼容所有视频模型。实验结果表明，帧引导能够针对广泛的任务和输入信号生成高质量的受控视频。

English

Advancements in diffusion models have significantly improved video quality, directing attention to fine-grained controllability. However, many existing methods depend on fine-tuning large-scale video models for specific tasks, which becomes increasingly impractical as model sizes continue to grow. In this work, we present Frame Guidance, a training-free guidance for controllable video generation based on frame-level signals, such as keyframes, style reference images, sketches, or depth maps. For practical training-free guidance, we propose a simple latent processing method that dramatically reduces memory usage, and apply a novel latent optimization strategy designed for globally coherent video generation. Frame Guidance enables effective control across diverse tasks, including keyframe guidance, stylization, and looping, without any training, compatible with any video models. Experimental results show that Frame Guidance can produce high-quality controlled videos for a wide range of tasks and input signals.

帧引导：视频扩散模型中帧级控制的免训练引导方法

Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models

摘要

Support