ChatPaper.aiChatPaper

面向对象的扩散以实现高效视频编辑

Object-Centric Diffusion for Efficient Video Editing

January 11, 2024
作者: Kumara Kahatapitiya, Adil Karjauv, Davide Abati, Fatih Porikli, Yuki M. Asano, Amirhossein Habibian
cs.AI

摘要

基于扩散的视频编辑已经取得了令人印象深刻的质量,并可以根据文本编辑提示转换给定视频输入的全局风格、局部结构和属性。然而,这类解决方案通常需要付出巨大的内存和计算成本来生成时间连贯的帧,无论是通过扩散反演还是跨帧注意力的形式。在本文中,我们对这些低效率进行了分析,并提出了简单而有效的修改,可以在保持质量的同时实现显著加速。此外,我们引入了一种名为物体中心扩散(Object-Centric Diffusion)的方法,缩写为OCD,通过将计算更多地分配到对感知质量更重要的前景编辑区域,进一步减少延迟。我们通过两个新颖的提议实现了这一点:i) 物体中心采样(Object-Centric Sampling),将用于显著区域或背景的扩散步骤分离开来,将大部分模型容量分配给前者;ii) 物体中心3D令牌合并(Object-Centric 3D Token Merging),通过融合不重要的背景区域中的冗余令牌,降低跨帧注意力的成本。这两种技术可以直接应用于给定的视频编辑模型,无需重新训练,并且可以显著降低其内存和计算成本。我们在基于反演和基于控制信号的编辑流程上评估了我们的提议,并展示了与可比较的合成质量相比高达10倍的延迟降低。
English
Diffusion-based video editing have reached impressive quality and can transform either the global style, local structure, and attributes of given video inputs, following textual edit prompts. However, such solutions typically incur heavy memory and computational costs to generate temporally-coherent frames, either in the form of diffusion inversion and/or cross-frame attention. In this paper, we conduct an analysis of such inefficiencies, and suggest simple yet effective modifications that allow significant speed-ups whilst maintaining quality. Moreover, we introduce Object-Centric Diffusion, coined as OCD, to further reduce latency by allocating computations more towards foreground edited regions that are arguably more important for perceptual quality. We achieve this by two novel proposals: i) Object-Centric Sampling, decoupling the diffusion steps spent on salient regions or background, allocating most of the model capacity to the former, and ii) Object-Centric 3D Token Merging, which reduces cost of cross-frame attention by fusing redundant tokens in unimportant background regions. Both techniques are readily applicable to a given video editing model without retraining, and can drastically reduce its memory and computational cost. We evaluate our proposals on inversion-based and control-signal-based editing pipelines, and show a latency reduction up to 10x for a comparable synthesis quality.
PDF110December 15, 2024