MotionDirector: テキストからビデオへの拡散モデルのモーションカスタマイズ

要旨

大規模な事前学習済み拡散モデルは、多様なビデオ生成において顕著な能力を発揮しています。同じ動作概念を持つ一連のビデオクリップが与えられた場合、Motion Customization（モーションカスタマイズ）のタスクは、既存のテキストからビデオへの拡散モデルを適応させ、その動作を持つビデオを生成することです。例えば、特定のカメラムーブメントの下で車が指定された方法で動くビデオを生成して映画を作成したり、クマがウェイトを持ち上げる様子を描いたビデオを生成してクリエイターにインスピレーションを与えたりすることが挙げられます。これまで、被写体やスタイルなどの外観をカスタマイズするための適応手法は開発されてきましたが、モーションについては未開拓でした。モーションカスタマイズのために主流の適応手法を拡張することは直感的であり、フルモデルのチューニング、追加レイヤーのパラメータ効率的なチューニング、低ランク適応（LoRA）などが含まれます。しかし、これらの手法で学習されたモーション概念は、トレーニングビデオの限られた外観と結びついていることが多く、カスタマイズされたモーションを他の外観に一般化することが困難です。この課題を克服するために、我々はMotionDirectorを提案し、外観とモーションの学習を分離するデュアルパスLoRAアーキテクチャを採用しました。さらに、外観の影響を時間的トレーニング目標から軽減するための新しい外観バイアス除去時間的損失を設計しました。実験結果は、提案手法がカスタマイズされたモーションに対して多様な外観のビデオを生成できることを示しています。また、我々の手法は、異なるビデオの外観とモーションをそれぞれ組み合わせたり、単一の画像をカスタマイズされたモーションでアニメーション化したりするなど、様々な下流アプリケーションをサポートします。コードとモデルウェイトは公開予定です。

English

Large-scale pre-trained diffusion models have exhibited remarkable capabilities in diverse video generations. Given a set of video clips of the same motion concept, the task of Motion Customization is to adapt existing text-to-video diffusion models to generate videos with this motion. For example, generating a video with a car moving in a prescribed manner under specific camera movements to make a movie, or a video illustrating how a bear would lift weights to inspire creators. Adaptation methods have been developed for customizing appearance like subject or style, yet unexplored for motion. It is straightforward to extend mainstream adaption methods for motion customization, including full model tuning, parameter-efficient tuning of additional layers, and Low-Rank Adaptions (LoRAs). However, the motion concept learned by these methods is often coupled with the limited appearances in the training videos, making it difficult to generalize the customized motion to other appearances. To overcome this challenge, we propose MotionDirector, with a dual-path LoRAs architecture to decouple the learning of appearance and motion. Further, we design a novel appearance-debiased temporal loss to mitigate the influence of appearance on the temporal training objective. Experimental results show the proposed method can generate videos of diverse appearances for the customized motions. Our method also supports various downstream applications, such as the mixing of different videos with their appearance and motion respectively, and animating a single image with customized motions. Our code and model weights will be released.

MotionDirector: テキストからビデオへの拡散モデルのモーションカスタマイズ

MotionDirector: Motion Customization of Text-to-Video Diffusion Models

要旨

Support