VidToMe：ゼロショット動画編集のためのビデオトークン統合

要旨

拡散モデルは高品質な画像生成において大きな進展を遂げてきたが、時間的な動きの複雑さから、動画生成への応用は依然として課題となっている。ゼロショット動画編集は、事前学習済みの画像拡散モデルを利用してソース動画を新しい動画に変換する手法を提供する。しかし、既存の手法では厳密な時間的一貫性と効率的なメモリ消費を維持することが困難である。本研究では、フレーム間の自己注意トークンを統合することで、生成された動画の時間的一貫性を向上させる新たなアプローチを提案する。フレーム間で時間的に冗長なトークンを整列・圧縮することで、本手法は時間的整合性を改善し、自己注意計算におけるメモリ消費を削減する。この統合戦略は、フレーム間の時間的対応に基づいてトークンをマッチングし整列させることで、生成された動画フレームにおける自然な時間的一貫性を促進する。動画処理の複雑さを管理するため、動画をチャンクに分割し、チャンク内の局所的なトークン統合とチャンク間のグローバルなトークン統合を開発し、短期的な動画の連続性と長期的な内容の一貫性を確保する。本動画編集アプローチは、画像編集の進展を動画編集にシームレスに拡張し、最先端の手法を上回る時間的一貫性を実現する。

English

Diffusion models have made significant advances in generating high-quality images, but their application to video generation has remained challenging due to the complexity of temporal motion. Zero-shot video editing offers a solution by utilizing pre-trained image diffusion models to translate source videos into new ones. Nevertheless, existing methods struggle to maintain strict temporal consistency and efficient memory consumption. In this work, we propose a novel approach to enhance temporal consistency in generated videos by merging self-attention tokens across frames. By aligning and compressing temporally redundant tokens across frames, our method improves temporal coherence and reduces memory consumption in self-attention computations. The merging strategy matches and aligns tokens according to the temporal correspondence between frames, facilitating natural temporal consistency in generated video frames. To manage the complexity of video processing, we divide videos into chunks and develop intra-chunk local token merging and inter-chunk global token merging, ensuring both short-term video continuity and long-term content consistency. Our video editing approach seamlessly extends the advancements in image editing to video editing, rendering favorable results in temporal consistency over state-of-the-art methods.

VidToMe：ゼロショット動画編集のためのビデオトークン統合

VidToMe: Video Token Merging for Zero-Shot Video Editing

要旨

Support