VidToMe：用於零樣本視頻編輯的視頻標記合併

摘要

擴散模型在生成高質量影像方面取得了顯著進展，但由於時間運動的複雜性，將其應用於視頻生成一直是一個具有挑戰性的問題。零樣本視頻編輯提供了一種解決方案，通過利用預先訓練的影像擴散模型將源視頻轉換為新視頻。然而，現有方法在保持嚴格的時間一致性和高效的內存消耗方面存在困難。在這項工作中，我們提出了一種新方法，通過跨幀合併自注意力標記來增強生成視頻的時間一致性。通過對幀間的時間冗餘標記進行對齊和壓縮，我們的方法改善了時間上的連貫性，並減少了自注意力計算中的內存消耗。合併策略根據幀間的時間對應匹配和對齊標記，有助於生成視頻幀中的自然時間一致性。為了應對視頻處理的複雜性，我們將視頻分為塊並開發塊內局部標記合併和塊間全局標記合併，確保短期視頻連續性和長期內容一致性。我們的視頻編輯方法將影像編輯的進展無縫擴展到視頻編輯，並在時間一致性方面優於最先進的方法，呈現出良好的結果。

English

Diffusion models have made significant advances in generating high-quality images, but their application to video generation has remained challenging due to the complexity of temporal motion. Zero-shot video editing offers a solution by utilizing pre-trained image diffusion models to translate source videos into new ones. Nevertheless, existing methods struggle to maintain strict temporal consistency and efficient memory consumption. In this work, we propose a novel approach to enhance temporal consistency in generated videos by merging self-attention tokens across frames. By aligning and compressing temporally redundant tokens across frames, our method improves temporal coherence and reduces memory consumption in self-attention computations. The merging strategy matches and aligns tokens according to the temporal correspondence between frames, facilitating natural temporal consistency in generated video frames. To manage the complexity of video processing, we divide videos into chunks and develop intra-chunk local token merging and inter-chunk global token merging, ensuring both short-term video continuity and long-term content consistency. Our video editing approach seamlessly extends the advancements in image editing to video editing, rendering favorable results in temporal consistency over state-of-the-art methods.

VidToMe：用於零樣本視頻編輯的視頻標記合併

VidToMe: Video Token Merging for Zero-Shot Video Editing

摘要

Support