物件中心擴散以提高影片編輯效率

摘要

基於擴散的影片編輯已經取得令人印象深刻的品質，可以根據文字編輯提示，轉換給定影片輸入的全域風格、局部結構和屬性。然而，這類解決方案通常需要耗費大量記憶體和計算成本來生成時間上連貫的幀，無論是通過擴散反演還是跨幀注意力的形式。本文對這些效率低下之處進行了分析，並提出了簡單而有效的修改建議，可以實現顯著的加速，同時保持品質。此外，我們引入了以Object-Centric Diffusion為名的OCD，通過將計算更多地分配給對知覺品質更重要的前景編輯區域，進一步降低延遲。我們通過兩個新提議來實現這一點：i）Object-Centric Sampling，將用於突出顯著區域或背景的擴散步驟分開，將大部分模型容量分配給前者；ii）Object-Centric 3D Token Merging，通過將不重要的背景區域中的冗餘token融合，降低跨幀注意力的成本。這兩種技術都可以直接應用於給定的影片編輯模型，無需重新訓練，並且可以顯著降低其記憶體和計算成本。我們在基於反演和基於控制信號的編輯流程上評估了我們的提議，並展示了相當合成品質的情況下，延遲減少高達10倍。

English

Diffusion-based video editing have reached impressive quality and can transform either the global style, local structure, and attributes of given video inputs, following textual edit prompts. However, such solutions typically incur heavy memory and computational costs to generate temporally-coherent frames, either in the form of diffusion inversion and/or cross-frame attention. In this paper, we conduct an analysis of such inefficiencies, and suggest simple yet effective modifications that allow significant speed-ups whilst maintaining quality. Moreover, we introduce Object-Centric Diffusion, coined as OCD, to further reduce latency by allocating computations more towards foreground edited regions that are arguably more important for perceptual quality. We achieve this by two novel proposals: i) Object-Centric Sampling, decoupling the diffusion steps spent on salient regions or background, allocating most of the model capacity to the former, and ii) Object-Centric 3D Token Merging, which reduces cost of cross-frame attention by fusing redundant tokens in unimportant background regions. Both techniques are readily applicable to a given video editing model without retraining, and can drastically reduce its memory and computational cost. We evaluate our proposals on inversion-based and control-signal-based editing pipelines, and show a latency reduction up to 10x for a comparable synthesis quality.

物件中心擴散以提高影片編輯效率

Object-Centric Diffusion for Efficient Video Editing

摘要

Support