DragNUWA:融合文本、图像与轨迹实现视频生成的细粒度控制
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory
August 16, 2023
作者: Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang Li, Gong Ming, Nan Duan
cs.AI
摘要
近年来,可控视频生成技术备受关注,但始终存在两大局限:其一,现有研究多集中于文本、图像或轨迹等单一控制方式,导致难以实现视频内容的细粒度控制;其二,轨迹控制研究尚处于早期阶段,多数实验仅在Human3.6M等简单数据集上进行,这限制了模型处理开放域图像及复杂曲线轨迹的能力。本文提出DragNUWA——一个基于扩散模型的开放域视频生成框架。针对现有控制粒度不足的问题,我们创新性地融合文本、图像与轨迹信息,从语义、空间和时间三个维度实现视频内容的精细化控制。为解决开放域轨迹控制的局限性,我们提出三阶段轨迹建模方案:通过轨迹采样器实现任意轨迹的开放域控制,采用多尺度融合技术适配不同粒度轨迹,并设计自适应训练策略确保轨迹跟踪的视频连贯性。实验证明DragNUWA在视频生成细粒度控制方面具有显著优势。项目主页详见:https://www.microsoft.com/en-us/research/project/dragnuwa/
English
Controllable video generation has gained significant attention in recent
years. However, two main limitations persist: Firstly, most existing works
focus on either text, image, or trajectory-based control, leading to an
inability to achieve fine-grained control in videos. Secondly, trajectory
control research is still in its early stages, with most experiments being
conducted on simple datasets like Human3.6M. This constraint limits the models'
capability to process open-domain images and effectively handle complex curved
trajectories. In this paper, we propose DragNUWA, an open-domain
diffusion-based video generation model. To tackle the issue of insufficient
control granularity in existing works, we simultaneously introduce text, image,
and trajectory information to provide fine-grained control over video content
from semantic, spatial, and temporal perspectives. To resolve the problem of
limited open-domain trajectory control in current research, We propose
trajectory modeling with three aspects: a Trajectory Sampler (TS) to enable
open-domain control of arbitrary trajectories, a Multiscale Fusion (MF) to
control trajectories in different granularities, and an Adaptive Training (AT)
strategy to generate consistent videos following trajectories. Our experiments
validate the effectiveness of DragNUWA, demonstrating its superior performance
in fine-grained control in video generation. The homepage link is
https://www.microsoft.com/en-us/research/project/dragnuwa/