VidToMe：零-shot 视频编辑的视频令牌合并

摘要

扩散模型在生成高质量图像方面取得了显著进展，但由于时间运动的复杂性，它们在视频生成方面的应用仍然具有挑战性。零样本视频编辑通过利用预训练的图像扩散模型将源视频转换为新视频，提供了一种解决方案。然而，现有方法在保持严格的时间一致性和高效的内存消耗方面存在困难。在这项工作中，我们提出了一种新颖的方法，通过跨帧合并自注意力标记来增强生成视频的时间一致性。通过在帧间对齐和压缩时间上冗余的标记，我们的方法改善了时间连贯性，并减少了自注意力计算中的内存消耗。合并策略根据帧间的时间对应关系匹配和对齐标记，有助于在生成的视频帧中实现自然的时间一致性。为了管理视频处理的复杂性，我们将视频分成块，并开发了块内局部标记合并和块间全局标记合并，确保短期视频连续性和长期内容一致性。我们的视频编辑方法将图像编辑的进展无缝扩展到视频编辑，相较于最先进的方法，在时间一致性方面取得了良好的结果。

English

Diffusion models have made significant advances in generating high-quality images, but their application to video generation has remained challenging due to the complexity of temporal motion. Zero-shot video editing offers a solution by utilizing pre-trained image diffusion models to translate source videos into new ones. Nevertheless, existing methods struggle to maintain strict temporal consistency and efficient memory consumption. In this work, we propose a novel approach to enhance temporal consistency in generated videos by merging self-attention tokens across frames. By aligning and compressing temporally redundant tokens across frames, our method improves temporal coherence and reduces memory consumption in self-attention computations. The merging strategy matches and aligns tokens according to the temporal correspondence between frames, facilitating natural temporal consistency in generated video frames. To manage the complexity of video processing, we divide videos into chunks and develop intra-chunk local token merging and inter-chunk global token merging, ensuring both short-term video continuity and long-term content consistency. Our video editing approach seamlessly extends the advancements in image editing to video editing, rendering favorable results in temporal consistency over state-of-the-art methods.

VidToMe：零-shot 视频编辑的视频令牌合并

VidToMe: Video Token Merging for Zero-Shot Video Editing

摘要

Support