VMC: テキストからビデオへの拡散モデルにおける時間的注意適応を用いたビデオモーションカスタマイズ

要旨

テキストからビデオを生成する拡散モデルは、ビデオ生成を大幅に進化させました。しかし、これらのモデルをカスタマイズして特定の動きを持つビデオを生成することは、依然として大きな課題です。具体的には、(a) ターゲットビデオから正確に動きを再現すること、および (b) 多様な視覚的バリエーションを創出することに困難を抱えています。例えば、静止画のカスタマイズ手法をそのままビデオに拡張すると、外観と動きのデータが複雑に絡み合うことがしばしばあります。この問題に対処するため、本論文では「Video Motion Customization (VMC)」フレームワークを提案します。これは、ビデオ拡散モデル内の時間的注意層を適応させるために設計された、新しいワンショットチューニングアプローチです。私たちの手法では、連続するフレーム間の残差ベクトルを動きの参照として使用する新しい動き蒸留目的関数を導入します。これにより、拡散プロセスは低周波数の動き軌跡を保持しつつ、画像空間における高周波数の動きに関連しないノイズを軽減します。私たちは、多様な実世界の動きと文脈において、最先端のビデオ生成モデルと比較して本手法を検証しました。コード、データ、およびプロジェクトのデモは https://video-motion-customization.github.io で公開しています。

English

Text-to-video diffusion models have advanced video generation significantly. However, customizing these models to generate videos with tailored motions presents a substantial challenge. In specific, they encounter hurdles in (a) accurately reproducing motion from a target video, and (b) creating diverse visual variations. For example, straightforward extensions of static image customization methods to video often lead to intricate entanglements of appearance and motion data. To tackle this, here we present the Video Motion Customization (VMC) framework, a novel one-shot tuning approach crafted to adapt temporal attention layers within video diffusion models. Our approach introduces a novel motion distillation objective using residual vectors between consecutive frames as a motion reference. The diffusion process then preserves low-frequency motion trajectories while mitigating high-frequency motion-unrelated noise in image space. We validate our method against state-of-the-art video generative models across diverse real-world motions and contexts. Our codes, data and the project demo can be found at https://video-motion-customization.github.io

VMC: テキストからビデオへの拡散モデルにおける時間的注意適応を用いたビデオモーションカスタマイズ

VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models

要旨

Support