Make-A-Protagonist：専門家アンサンブルによる汎用ビデオ編集

要旨

テキスト駆動型の画像およびビデオ拡散モデルは、現実的で多様なコンテンツの生成において前例のない成功を収めています。最近では、拡散ベースの生成モデルを用いた既存の画像やビデオの編集およびバリエーションが大きな注目を集めています。しかし、これまでの研究はテキストによるコンテンツの編集や単一の視覚的ヒントを用いた粗いパーソナライゼーションに限定されており、細かい制御を必要とする描写不可能なコンテンツには適していませんでした。この点を踏まえ、我々は「Make-A-Protagonist」と呼ばれる汎用的なビデオ編集フレームワークを提案します。このフレームワークは、テキストと視覚的なヒントを活用してビデオを編集し、個人が主人公になることを可能にすることを目的としています。具体的には、複数の専門家を活用してソースビデオ、ターゲットの視覚的およびテキスト的なヒントを解析し、マスク誘導型ノイズ除去サンプリングを用いた視覚-テキストベースのビデオ生成モデルを提案します。広範な結果は、Make-A-Protagonistの多様で卓越した編集能力を実証しています。

English

The text-driven image and video diffusion models have achieved unprecedented success in generating realistic and diverse content. Recently, the editing and variation of existing images and videos in diffusion-based generative models have garnered significant attention. However, previous works are limited to editing content with text or providing coarse personalization using a single visual clue, rendering them unsuitable for indescribable content that requires fine-grained and detailed control. In this regard, we propose a generic video editing framework called Make-A-Protagonist, which utilizes textual and visual clues to edit videos with the goal of empowering individuals to become the protagonists. Specifically, we leverage multiple experts to parse source video, target visual and textual clues, and propose a visual-textual-based video generation model that employs mask-guided denoising sampling to generate the desired output. Extensive results demonstrate the versatile and remarkable editing capabilities of Make-A-Protagonist.

Make-A-Protagonist：専門家アンサンブルによる汎用ビデオ編集

Make-A-Protagonist: Generic Video Editing with An Ensemble of Experts

要旨

Support