ChatPaper.aiChatPaper

DragNUWA:融合文本、图像与轨迹实现视频生成的细粒度控制

DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

August 16, 2023
作者: Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang Li, Gong Ming, Nan Duan
cs.AI

摘要

近年来,可控视频生成技术备受关注,但始终存在两大局限:其一,现有研究多集中于文本、图像或轨迹等单一控制方式,导致难以实现视频内容的细粒度控制;其二,轨迹控制研究尚处于早期阶段,多数实验仅在Human3.6M等简单数据集上进行,这限制了模型处理开放域图像及复杂曲线轨迹的能力。本文提出DragNUWA——一个基于扩散模型的开放域视频生成框架。针对现有控制粒度不足的问题,我们创新性地融合文本、图像与轨迹信息,从语义、空间和时间三个维度实现视频内容的精细化控制。为解决开放域轨迹控制的局限性,我们提出三阶段轨迹建模方案:通过轨迹采样器实现任意轨迹的开放域控制,采用多尺度融合技术适配不同粒度轨迹,并设计自适应训练策略确保轨迹跟踪的视频连贯性。实验证明DragNUWA在视频生成细粒度控制方面具有显著优势。项目主页详见:https://www.microsoft.com/en-us/research/project/dragnuwa/
English
Controllable video generation has gained significant attention in recent years. However, two main limitations persist: Firstly, most existing works focus on either text, image, or trajectory-based control, leading to an inability to achieve fine-grained control in videos. Secondly, trajectory control research is still in its early stages, with most experiments being conducted on simple datasets like Human3.6M. This constraint limits the models' capability to process open-domain images and effectively handle complex curved trajectories. In this paper, we propose DragNUWA, an open-domain diffusion-based video generation model. To tackle the issue of insufficient control granularity in existing works, we simultaneously introduce text, image, and trajectory information to provide fine-grained control over video content from semantic, spatial, and temporal perspectives. To resolve the problem of limited open-domain trajectory control in current research, We propose trajectory modeling with three aspects: a Trajectory Sampler (TS) to enable open-domain control of arbitrary trajectories, a Multiscale Fusion (MF) to control trajectories in different granularities, and an Adaptive Training (AT) strategy to generate consistent videos following trajectories. Our experiments validate the effectiveness of DragNUWA, demonstrating its superior performance in fine-grained control in video generation. The homepage link is https://www.microsoft.com/en-us/research/project/dragnuwa/
PDF230March 22, 2026