直播視頻：使用用戶指導的攝像機運動和物體運動生成定制視頻

摘要

最近的文本轉視頻擴散模型取得了令人印象深刻的進展。在實踐中，用戶通常希望能夠獨立控制物體運動和攝像機運動，以定制視頻創作。然而，目前的方法缺乏專注於以解耦方式分別控制物體運動和攝像機運動，這限制了文本轉視頻模型的可控性和靈活性。在本文中，我們介紹了Direct-a-Video，這是一個系統，允許用戶獨立指定一個或多個物體的運動和/或攝像機運動，就像指導一部視頻一樣。我們提出了一種簡單而有效的策略，用於分離控制物體運動和攝像機運動。通過使用模型固有先驗進行空間交叉注意力調節來控制物體運動，無需額外的優化。對於攝像機運動，我們引入了新的時間交叉注意力層來解釋定量的攝像機運動參數。我們進一步採用基於增強的方法，在小規模數據集上以自監督方式訓練這些層，消除了對明確運動標註的需求。這兩個組件可以獨立運行，允許單獨或結合控制，並且可以推廣到開放領域場景。大量實驗證明了我們方法的優越性和有效性。項目頁面：https://direct-a-video.github.io/。

English

Recent text-to-video diffusion models have achieved impressive progress. In practice, users often desire the ability to control object motion and camera movement independently for customized video creation. However, current methods lack the focus on separately controlling object motion and camera movement in a decoupled manner, which limits the controllability and flexibility of text-to-video models. In this paper, we introduce Direct-a-Video, a system that allows users to independently specify motions for one or multiple objects and/or camera movements, as if directing a video. We propose a simple yet effective strategy for the decoupled control of object motion and camera movement. Object motion is controlled through spatial cross-attention modulation using the model's inherent priors, requiring no additional optimization. For camera movement, we introduce new temporal cross-attention layers to interpret quantitative camera movement parameters. We further employ an augmentation-based approach to train these layers in a self-supervised manner on a small-scale dataset, eliminating the need for explicit motion annotation. Both components operate independently, allowing individual or combined control, and can generalize to open-domain scenarios. Extensive experiments demonstrate the superiority and effectiveness of our method. Project page: https://direct-a-video.github.io/.

直播視頻：使用用戶指導的攝像機運動和物體運動生成定制視頻

Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion

摘要

Support