MotionBooth: モーションを考慮したカスタマイズ可能なテキストからビデオ生成

要旨

本研究では、MotionBoothという革新的なフレームワークを提案します。このフレームワークは、カスタマイズされた被写体をアニメーション化し、物体とカメラの動きを精密に制御することを目的としています。特定の物体の数枚の画像を活用することで、テキストからビデオを生成するモデルを効率的にファインチューニングし、物体の形状や属性を正確に捉えます。私たちのアプローチでは、被写体の学習性能を向上させるために被写体領域損失とビデオ保存損失を導入し、さらに被写体トークンのクロスアテンション損失を用いてカスタマイズされた被写体とモーション制御信号を統合します。加えて、推論時に被写体とカメラの動きを管理するためのトレーニング不要の技術を提案します。特に、クロスアテンションマップの操作を用いて被写体の動きを制御し、カメラの動きを制御するための新しい潜在シフトモジュールも導入します。MotionBoothは、生成されたビデオにおける被写体の外観を保ちながら、同時に動きを制御する点で優れています。広範な定量的および定性的な評価を通じて、本手法の優位性と有効性が実証されています。プロジェクトページはhttps://jianzongwu.github.io/projects/motionboothにあります。

English

In this work, we present MotionBooth, an innovative framework designed for animating customized subjects with precise control over both object and camera movements. By leveraging a few images of a specific object, we efficiently fine-tune a text-to-video model to capture the object's shape and attributes accurately. Our approach presents subject region loss and video preservation loss to enhance the subject's learning performance, along with a subject token cross-attention loss to integrate the customized subject with motion control signals. Additionally, we propose training-free techniques for managing subject and camera motions during inference. In particular, we utilize cross-attention map manipulation to govern subject motion and introduce a novel latent shift module for camera movement control as well. MotionBooth excels in preserving the appearance of subjects while simultaneously controlling the motions in generated videos. Extensive quantitative and qualitative evaluations demonstrate the superiority and effectiveness of our method. Our project page is at https://jianzongwu.github.io/projects/motionbooth

MotionBooth: モーションを考慮したカスタマイズ可能なテキストからビデオ生成

MotionBooth: Motion-Aware Customized Text-to-Video Generation

要旨

Support