MotionMaster：無需訓練的相機運動轉移，用於視頻生成

摘要

擴散模型的出現極大地推動了影像和視頻生成的進展。最近，一些工作致力於可控視頻生成，包括文本到視頻生成和視頻運動控制，其中攝像機運動控制是一個重要話題。然而，現有的攝像機運動控制方法依賴於訓練一個時間攝像機模塊，並且由於視頻生成模型中的大量參數，需要大量的計算資源。此外，現有方法在訓練期間預先定義攝像機運動類型，這限制了它們在攝像機控制方面的靈活性。因此，為了降低訓練成本並實現靈活的攝像機控制，我們提出了COMD，一種新穎的無需訓練的視頻運動轉移模型，它將來源視頻中的攝像機運動和物體運動分離開來，並將提取的攝像機運動轉移到新視頻中。我們首先提出了一種一次攝像機運動分離方法，從單個來源視頻中提取攝像機運動，將移動物體與背景分開，並根據背景中的運動通過解決泊松方程來估計移動物體區域中的攝像機運動。此外，我們提出了一種少數次攝像機運動分離方法，從具有相似攝像機運動的多個視頻中提取共同的攝像機運動，該方法利用基於窗口的聚類技術從多個視頻的時間注意力圖中提取共同特徵。最後，我們提出了一種運動組合方法，將不同類型的攝像機運動結合在一起，使我們的模型具有更可控和靈活的攝像機控制。大量實驗表明，我們的無需訓練方法可以有效地解耦攝像機-物體運動並將解耦的攝像機運動應用於各種可控視頻生成任務，實現靈活和多樣化的攝像機運動控制。

English

The emergence of diffusion models has greatly propelled the progress in image and video generation. Recently, some efforts have been made in controllable video generation, including text-to-video generation and video motion control, among which camera motion control is an important topic. However, existing camera motion control methods rely on training a temporal camera module, and necessitate substantial computation resources due to the large amount of parameters in video generation models. Moreover, existing methods pre-define camera motion types during training, which limits their flexibility in camera control. Therefore, to reduce training costs and achieve flexible camera control, we propose COMD, a novel training-free video motion transfer model, which disentangles camera motions and object motions in source videos and transfers the extracted camera motions to new videos. We first propose a one-shot camera motion disentanglement method to extract camera motion from a single source video, which separates the moving objects from the background and estimates the camera motion in the moving objects region based on the motion in the background by solving a Poisson equation. Furthermore, we propose a few-shot camera motion disentanglement method to extract the common camera motion from multiple videos with similar camera motions, which employs a window-based clustering technique to extract the common features in temporal attention maps of multiple videos. Finally, we propose a motion combination method to combine different types of camera motions together, enabling our model a more controllable and flexible camera control. Extensive experiments demonstrate that our training-free approach can effectively decouple camera-object motion and apply the decoupled camera motion to a wide range of controllable video generation tasks, achieving flexible and diverse camera motion control.

MotionMaster：無需訓練的相機運動轉移，用於視頻生成

MotionMaster: Training-free Camera Motion Transfer For Video Generation

摘要

Support