MotionMaster: ビデオ生成のためのトレーニング不要なカメラモーション転送

要旨

拡散モデルの出現は、画像および動画生成の進展を大きく推進してきた。最近では、テキストから動画を生成する手法や動画のモーション制御など、制御可能な動画生成に向けた取り組みが行われており、その中でもカメラモーション制御は重要なトピックである。しかし、既存のカメラモーション制御手法は、時間的なカメラモジュールを訓練する必要があり、動画生成モデルのパラメータ数が多いため、膨大な計算リソースを必要とする。さらに、既存の手法では訓練中にカメラモーションのタイプを事前に定義するため、カメラ制御の柔軟性が制限されている。そこで、訓練コストを削減し、柔軟なカメラ制御を実現するために、我々はCOMDという新しい訓練不要の動画モーショントランスファーモデルを提案する。このモデルは、ソース動画からカメラモーションとオブジェクトモーションを分離し、抽出したカメラモーションを新しい動画に転送する。まず、単一のソース動画からカメラモーションを抽出するワンショットカメラモーション分離手法を提案し、移動するオブジェクトを背景から分離し、背景のモーションに基づいて移動オブジェクト領域のカメラモーションをポアソン方程式を解くことで推定する。さらに、類似したカメラモーションを持つ複数の動画から共通のカメラモーションを抽出するためのフューショットカメラモーション分離手法を提案し、時間的アテンションマップにおける共通特徴をウィンドウベースのクラスタリング技術を用いて抽出する。最後に、異なるタイプのカメラモーションを組み合わせるためのモーション結合手法を提案し、モデルにより制御可能で柔軟なカメラ制御を可能にする。大規模な実験により、我々の訓練不要アプローチがカメラとオブジェクトのモーションを効果的に分離し、分離されたカメラモーションを幅広い制御可能な動画生成タスクに適用できることが示され、柔軟で多様なカメラモーション制御を実現することが確認された。

English

The emergence of diffusion models has greatly propelled the progress in image and video generation. Recently, some efforts have been made in controllable video generation, including text-to-video generation and video motion control, among which camera motion control is an important topic. However, existing camera motion control methods rely on training a temporal camera module, and necessitate substantial computation resources due to the large amount of parameters in video generation models. Moreover, existing methods pre-define camera motion types during training, which limits their flexibility in camera control. Therefore, to reduce training costs and achieve flexible camera control, we propose COMD, a novel training-free video motion transfer model, which disentangles camera motions and object motions in source videos and transfers the extracted camera motions to new videos. We first propose a one-shot camera motion disentanglement method to extract camera motion from a single source video, which separates the moving objects from the background and estimates the camera motion in the moving objects region based on the motion in the background by solving a Poisson equation. Furthermore, we propose a few-shot camera motion disentanglement method to extract the common camera motion from multiple videos with similar camera motions, which employs a window-based clustering technique to extract the common features in temporal attention maps of multiple videos. Finally, we propose a motion combination method to combine different types of camera motions together, enabling our model a more controllable and flexible camera control. Extensive experiments demonstrate that our training-free approach can effectively decouple camera-object motion and apply the decoupled camera motion to a wide range of controllable video generation tasks, achieving flexible and diverse camera motion control.

MotionMaster: ビデオ生成のためのトレーニング不要なカメラモーション転送

MotionMaster: Training-free Camera Motion Transfer For Video Generation

要旨

Support