DragVideo: インタラクティブなドラッグスタイルのビデオ編集

要旨

ビデオの視覚的コンテンツの編集は依然として大きな課題であり、主に2つの問題が存在します：1）直接的な簡単なユーザー制御による編集、2）形状、表情、レイアウトを変更した後の自然な編集結果と、見苦しい歪みやアーティファクトの回避です。最近の画像ベースのドラッグスタイル編集技術であるDragGANにインスパイアされ、我々は上記の問題に対処するため、DragVideoを提案します。ここでは、同様のドラッグスタイルのユーザーインタラクションを採用し、時間的な一貫性を維持しながらビデオコンテンツを編集します。DragDiffusionと同様に最近の拡散モデルを活用したDragVideoは、新しいDrag-on-Video U-Net（DoVe）編集手法を含み、ビデオU-Netによって生成された拡散ビデオ潜在変数を最適化して、望ましい制御を実現します。具体的には、Sample-specific LoRAファインチューニングとMutual Self-Attention制御を使用して、DoVe手法によるビデオの忠実な再構築を保証します。また、ドラッグスタイルのビデオ編集のための一連のテスト例を提示し、モーション編集、スケルトン編集など、幅広い挑戦的な編集タスクにわたる広範な実験を行い、DragVideoの汎用性と一般性を強調します。DragVideoのウェブユーザーインターフェースを含むコードを公開する予定です。

English

Editing visual content on videos remains a formidable challenge with two main issues: 1) direct and easy user control to produce 2) natural editing results without unsightly distortion and artifacts after changing shape, expression and layout. Inspired by DragGAN, a recent image-based drag-style editing technique, we address above issues by proposing DragVideo, where a similar drag-style user interaction is adopted to edit video content while maintaining temporal consistency. Empowered by recent diffusion models as in DragDiffusion, DragVideo contains the novel Drag-on-Video U-Net (DoVe) editing method, which optimizes diffused video latents generated by video U-Net to achieve the desired control. Specifically, we use Sample-specific LoRA fine-tuning and Mutual Self-Attention control to ensure faithful reconstruction of video from the DoVe method. We also present a series of testing examples for drag-style video editing and conduct extensive experiments across a wide array of challenging editing tasks, such as motion editing, skeleton editing, etc, underscoring DragVideo's versatility and generality. Our codes including the DragVideo web user interface will be released.

DragVideo: インタラクティブなドラッグスタイルのビデオ編集

DragVideo: Interactive Drag-style Video Editing

要旨

Support