TokenFlow：一致的扩散特征用于一致的视频编辑

摘要

生成式人工智能革命最近已经扩展到视频领域。然而，当前最先进的视频模型在视觉质量和用户对生成内容的控制方面仍落后于图像模型。在这项工作中，我们提出了一个框架，利用文本到图像扩散模型的能力进行文本驱动视频编辑。具体而言，给定一个源视频和一个目标文本提示，我们的方法生成一个高质量视频，符合目标文本，同时保留输入视频的空间布局和运动。我们的方法基于一个关键观察，即通过在扩散特征空间中强制保持一致性，可以获得编辑视频的一致性。我们通过根据模型中已有的帧间对应关系明确传播扩散特征来实现这一点。因此，我们的框架不需要任何训练或微调，并且可以与任何现成的文本到图像编辑方法配合使用。我们展示了在各种真实世界视频上的最先进编辑结果。网页链接：https://diffusion-tokenflow.github.io/

English

The generative AI revolution has recently expanded to videos. Nevertheless, current state-of-the-art video models are still lagging behind image models in terms of visual quality and user control over the generated content. In this work, we present a framework that harnesses the power of a text-to-image diffusion model for the task of text-driven video editing. Specifically, given a source video and a target text-prompt, our method generates a high-quality video that adheres to the target text, while preserving the spatial layout and motion of the input video. Our method is based on a key observation that consistency in the edited video can be obtained by enforcing consistency in the diffusion feature space. We achieve this by explicitly propagating diffusion features based on inter-frame correspondences, readily available in the model. Thus, our framework does not require any training or fine-tuning, and can work in conjunction with any off-the-shelf text-to-image editing method. We demonstrate state-of-the-art editing results on a variety of real-world videos. Webpage: https://diffusion-tokenflow.github.io/

TokenFlow：一致的扩散特征用于一致的视频编辑

TokenFlow: Consistent Diffusion Features for Consistent Video Editing

摘要

Support