EditCtrl: リアルタイム生成ビデオ編集のための局所的制御と大域的制御の分離

要旨

高品質な動画生成編集は、事前学習済みのビデオ基盤モデルを活用することで著しい品質向上を実現してきた。しかし、その計算コストは大きなボトルネックとなっている。既存手法は、インペインティングマスクのサイズや編集範囲の疎密にかかわらず、ビデオ全体のコンテキストを非効率的に処理するように設計されているためである。本論文では、必要な箇所のみに計算リソースを集中させる効率的な動画インペインティング制御フレームワーク「EditCtrl」を提案する。我々のアプローチは、マスクされたトークンのみを処理する新規のローカルビデオコンテキストモジュールを特徴とし、編集サイズに比例した計算コストを実現する。このローカルファーストの生成は、最小限のオーバーヘッドでビデオ全体のコンテキスト一貫性を保証する軽量な時間的グローバルコンテキスト埋め込み器によって導かれる。EditCtrlは、最先端の生成編集手法と比べて計算効率が10倍高く、フルアテンションで設計された手法と比較しても編集品質を向上させる。さらに、テキストプロンプトを用いた複数領域編集や自己回帰的コンテンツ伝播など、EditCtrlが可能にする新機能についても紹介する。

English

High-fidelity generative video editing has seen significant quality improvements by leveraging pre-trained video foundation models. However, their computational cost is a major bottleneck, as they are often designed to inefficiently process the full video context regardless of the inpainting mask's size, even for sparse, localized edits. In this paper, we introduce EditCtrl, an efficient video inpainting control framework that focuses computation only where it is needed. Our approach features a novel local video context module that operates solely on masked tokens, yielding a computational cost proportional to the edit size. This local-first generation is then guided by a lightweight temporal global context embedder that ensures video-wide context consistency with minimal overhead. Not only is EditCtrl 10 times more compute efficient than state-of-the-art generative editing methods, it even improves editing quality compared to methods designed with full-attention. Finally, we showcase how EditCtrl unlocks new capabilities, including multi-region editing with text prompts and autoregressive content propagation.

EditCtrl: リアルタイム生成ビデオ編集のための局所的制御と大域的制御の分離

EditCtrl: Disentangled Local and Global Control for Real-Time Generative Video Editing

要旨

Support