VidEdit: ゼロショットかつ空間認識を備えたテキスト駆動型ビデオ編集

要旨

近年、拡散モデルに基づく生成モデルは画像生成と編集において顕著な成功を収めています。しかし、ビデオ編集への応用では依然として重要な制約が存在します。本論文では、時間的および空間的な一貫性を強く保証するゼロショットテキストベースのビデオ編集手法であるVidEditを紹介します。まず、アトラスベースのモデルと事前学習済みのテキストから画像への拡散モデルを組み合わせることで、トレーニング不要で効率的な編集手法を提案し、設計上時間的な滑らかさを実現します。次に、既存のパノプティックセグメンターとエッジ検出器を活用し、条件付き拡散ベースのアトラス編集に適応させます。これにより、対象領域の細かな空間制御を可能にしつつ、元のビデオの構造を厳密に保持します。定量的および定性的な実験により、VidEditがDAVISデータセットにおいて、意味的忠実性、画像保存性、時間的一貫性の指標において最先端の手法を凌駕することが示されています。このフレームワークでは、単一のビデオ処理に約1分しかかからず、一意のテキストプロンプトに基づいて複数の互換性のある編集を生成することが可能です。プロジェクトのウェブページはhttps://videdit.github.ioにあります。

English

Recently, diffusion-based generative models have achieved remarkable success for image generation and edition. However, their use for video editing still faces important limitations. This paper introduces VidEdit, a novel method for zero-shot text-based video editing ensuring strong temporal and spatial consistency. Firstly, we propose to combine atlas-based and pre-trained text-to-image diffusion models to provide a training-free and efficient editing method, which by design fulfills temporal smoothness. Secondly, we leverage off-the-shelf panoptic segmenters along with edge detectors and adapt their use for conditioned diffusion-based atlas editing. This ensures a fine spatial control on targeted regions while strictly preserving the structure of the original video. Quantitative and qualitative experiments show that VidEdit outperforms state-of-the-art methods on DAVIS dataset, regarding semantic faithfulness, image preservation, and temporal consistency metrics. With this framework, processing a single video only takes approximately one minute, and it can generate multiple compatible edits based on a unique text prompt. Project web-page at https://videdit.github.io

VidEdit: ゼロショットかつ空間認識を備えたテキスト駆動型ビデオ編集

VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing

要旨

Support