视频指导：通过教师指导改进视频扩散模型，无需训练

摘要

文本到图像（T2I）扩散模型已经彻底改变了视觉内容的创作，但将这些能力扩展到文本到视频（T2V）生成仍然是一个挑战，特别是在保持时间一致性方面。现有的旨在提高一致性的方法通常会导致诸如降低图像质量和不切实际的计算时间等折衷。为了解决这些问题，我们引入了VideoGuide，这是一个新颖的框架，可以增强预训练的T2V模型的时间一致性，而无需额外的训练或微调。相反，VideoGuide在推理的早期阶段利用任何预训练视频扩散模型（VDM）或自身作为指导，通过将指导模型的去噪样本插值到采样模型的去噪过程中，提高时间质量。所提出的方法显著改善了时间一致性和图像保真度，提供了一种成本效益和实用的解决方案，将各种视频扩散模型的优势进行了协同。此外，我们展示了先前的蒸馏，揭示了基础模型可以通过利用所提出的方法中指导模型的优越数据先验，实现增强的文本连贯性。项目页面：http://videoguide2025.github.io/

English

Text-to-image (T2I) diffusion models have revolutionized visual content creation, but extending these capabilities to text-to-video (T2V) generation remains a challenge, particularly in preserving temporal consistency. Existing methods that aim to improve consistency often cause trade-offs such as reduced imaging quality and impractical computational time. To address these issues we introduce VideoGuide, a novel framework that enhances the temporal consistency of pretrained T2V models without the need for additional training or fine-tuning. Instead, VideoGuide leverages any pretrained video diffusion model (VDM) or itself as a guide during the early stages of inference, improving temporal quality by interpolating the guiding model's denoised samples into the sampling model's denoising process. The proposed method brings about significant improvement in temporal consistency and image fidelity, providing a cost-effective and practical solution that synergizes the strengths of various video diffusion models. Furthermore, we demonstrate prior distillation, revealing that base models can achieve enhanced text coherence by utilizing the superior data prior of the guiding model through the proposed method. Project Page: http://videoguide2025.github.io/

视频指导：通过教师指导改进视频扩散模型，无需训练

VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide

摘要

Support