オブジェクト中心拡散による効率的なビデオ編集

要旨

拡散モデルに基づく動画編集は、テキスト編集プロンプトに従って、与えられた動画入力のグローバルなスタイル、ローカルな構造、属性を変換する際に、印象的な品質を達成しています。しかし、そのような手法は通常、時間的に一貫したフレームを生成するために、拡散逆変換やクロスフレームアテンションの形で、重いメモリと計算コストを伴います。本論文では、このような非効率性を分析し、品質を維持しながら大幅な高速化を可能にする、シンプルかつ効果的な修正を提案します。さらに、知覚品質にとってより重要であるとされる前景編集領域に計算リソースを集中させることで、さらなるレイテンシ削減を図る「Object-Centric Diffusion」（OCD）を導入します。これを実現するために、2つの新しい提案を行います：i) Object-Centric Sampling（オブジェクト中心サンプリング）では、注目領域と背景の拡散ステップを分離し、モデル容量の大部分を前者に割り当てます。ii) Object-Centric 3D Token Merging（オブジェクト中心3Dトークン結合）では、重要でない背景領域の冗長なトークンを融合することで、クロスフレームアテンションのコストを削減します。どちらの技術も、再学習なしで既存の動画編集モデルに適用可能であり、メモリと計算コストを劇的に削減できます。提案手法を逆変換ベースおよび制御信号ベースの編集パイプラインで評価し、同等の合成品質を維持しながら最大10倍のレイテンシ削減を実現することを示します。

English

Diffusion-based video editing have reached impressive quality and can transform either the global style, local structure, and attributes of given video inputs, following textual edit prompts. However, such solutions typically incur heavy memory and computational costs to generate temporally-coherent frames, either in the form of diffusion inversion and/or cross-frame attention. In this paper, we conduct an analysis of such inefficiencies, and suggest simple yet effective modifications that allow significant speed-ups whilst maintaining quality. Moreover, we introduce Object-Centric Diffusion, coined as OCD, to further reduce latency by allocating computations more towards foreground edited regions that are arguably more important for perceptual quality. We achieve this by two novel proposals: i) Object-Centric Sampling, decoupling the diffusion steps spent on salient regions or background, allocating most of the model capacity to the former, and ii) Object-Centric 3D Token Merging, which reduces cost of cross-frame attention by fusing redundant tokens in unimportant background regions. Both techniques are readily applicable to a given video editing model without retraining, and can drastically reduce its memory and computational cost. We evaluate our proposals on inversion-based and control-signal-based editing pipelines, and show a latency reduction up to 10x for a comparable synthesis quality.

オブジェクト中心拡散による効率的なビデオ編集

Object-Centric Diffusion for Efficient Video Editing

要旨

Support