DragNUWA：通過整合文本、圖像和軌跡實現視頻生成的精細控制

摘要

近年來，可控式影片生成受到相當大的關注。然而，仍存在兩個主要限制：首先，大多數現有作品著重於文字、圖像或軌跡控制，導致無法實現影片的精細控制。其次，軌跡控制研究仍處於早期階段，大多數實驗是在像是Human3.6M這樣的簡單數據集上進行的。這一限制限制了模型處理開放域圖像並有效處理複雜曲線軌跡的能力。本文提出了DragNUWA，一種基於擴散的開放域影片生成模型。為了應對現有作品中控制粒度不足的問題，我們同時引入了文字、圖像和軌跡信息，從語義、空間和時間角度提供對影片內容的精細控制。為了解決當前研究中開放域軌跡控制有限的問題，我們提出了三個方面的軌跡建模：軌跡取樣器（TS）實現任意軌跡的開放域控制，多尺度融合（MF）控制不同粒度的軌跡，以及自適應訓練（AT）策略生成遵循軌跡的一致影片。我們的實驗驗證了DragNUWA的有效性，展示了其在影片生成中精細控制方面的卓越性能。首頁鏈接為https://www.microsoft.com/en-us/research/project/dragnuwa/

English

Controllable video generation has gained significant attention in recent years. However, two main limitations persist: Firstly, most existing works focus on either text, image, or trajectory-based control, leading to an inability to achieve fine-grained control in videos. Secondly, trajectory control research is still in its early stages, with most experiments being conducted on simple datasets like Human3.6M. This constraint limits the models' capability to process open-domain images and effectively handle complex curved trajectories. In this paper, we propose DragNUWA, an open-domain diffusion-based video generation model. To tackle the issue of insufficient control granularity in existing works, we simultaneously introduce text, image, and trajectory information to provide fine-grained control over video content from semantic, spatial, and temporal perspectives. To resolve the problem of limited open-domain trajectory control in current research, We propose trajectory modeling with three aspects: a Trajectory Sampler (TS) to enable open-domain control of arbitrary trajectories, a Multiscale Fusion (MF) to control trajectories in different granularities, and an Adaptive Training (AT) strategy to generate consistent videos following trajectories. Our experiments validate the effectiveness of DragNUWA, demonstrating its superior performance in fine-grained control in video generation. The homepage link is https://www.microsoft.com/en-us/research/project/dragnuwa/

DragNUWA：通過整合文本、圖像和軌跡實現視頻生成的精細控制

DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

摘要

Support