DragNUWA：通过整合文本、图像和轨迹实现视频生成中的细粒度控制

摘要

近年来，可控视频生成受到了广泛关注。然而，仍然存在两个主要限制：首先，大多数现有作品集中在文本、图像或基于轨迹的控制上，导致无法实现视频的精细控制。其次，轨迹控制研究仍处于早期阶段，大多数实验都是在诸如Human3.6M之类的简单数据集上进行的。这种限制限制了模型处理开放域图像并有效处理复杂曲线轨迹的能力。在本文中，我们提出了DragNUWA，这是一种基于扩散的开放域视频生成模型。为了解决现有作品中控制粒度不足的问题，我们同时引入文本、图像和轨迹信息，从语义、空间和时间角度提供对视频内容的精细控制。为了解决当前研究中有限的开放域轨迹控制问题，我们提出了三个方面的轨迹建模：轨迹采样器（TS）实现任意轨迹的开放域控制，多尺度融合（MF）控制不同粒度的轨迹，自适应训练（AT）策略生成遵循轨迹的一致视频。我们的实验证实了DragNUWA的有效性，展示了其在视频生成中精细控制方面的卓越性能。主页链接为https://www.microsoft.com/en-us/research/project/dragnuwa/

English

Controllable video generation has gained significant attention in recent years. However, two main limitations persist: Firstly, most existing works focus on either text, image, or trajectory-based control, leading to an inability to achieve fine-grained control in videos. Secondly, trajectory control research is still in its early stages, with most experiments being conducted on simple datasets like Human3.6M. This constraint limits the models' capability to process open-domain images and effectively handle complex curved trajectories. In this paper, we propose DragNUWA, an open-domain diffusion-based video generation model. To tackle the issue of insufficient control granularity in existing works, we simultaneously introduce text, image, and trajectory information to provide fine-grained control over video content from semantic, spatial, and temporal perspectives. To resolve the problem of limited open-domain trajectory control in current research, We propose trajectory modeling with three aspects: a Trajectory Sampler (TS) to enable open-domain control of arbitrary trajectories, a Multiscale Fusion (MF) to control trajectories in different granularities, and an Adaptive Training (AT) strategy to generate consistent videos following trajectories. Our experiments validate the effectiveness of DragNUWA, demonstrating its superior performance in fine-grained control in video generation. The homepage link is https://www.microsoft.com/en-us/research/project/dragnuwa/

DragNUWA：通过整合文本、图像和轨迹实现视频生成中的细粒度控制

DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

摘要

Support