ChatPaper.aiChatPaper

DragNUWA:透過整合文字、影像與軌跡實現影片生成的細粒度控制

DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

August 16, 2023
作者: Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang Li, Gong Ming, Nan Duan
cs.AI

摘要

近年來,可控影片生成技術備受關注,但現有研究仍存在兩大侷限性:首先,多數成果僅專注於文本、圖像或軌跡控制單一維度,導致無法實現影片的細粒度控制;其次,軌跡控制研究尚處早期階段,現有實驗多基於Human3.6M等簡單數據集,這限制了模型處理開放域圖像與複雜曲線軌跡的能力。本文提出DragNUWA——一個基於擴散模型的開放域影片生成框架。針對控制粒度不足的問題,我們首次融合文本、圖像與軌跡信息,從語義、空間和時間三重維度實現影片內容的細粒度控制。為突破開放域軌跡控制的侷限性,我們提出三階段軌跡建模方案:通過軌跡採樣器實現任意軌跡的開放域控制,採用多尺度融合機制適應不同粒度軌跡,並設計自適應訓練策略確保軌跡跟蹤的影片連貫性。實驗證明DragNUWA在影片細粒度控制方面具有顯著優勢。項目主頁鏈接:https://www.microsoft.com/en-us/research/project/dragnuwa/
English
Controllable video generation has gained significant attention in recent years. However, two main limitations persist: Firstly, most existing works focus on either text, image, or trajectory-based control, leading to an inability to achieve fine-grained control in videos. Secondly, trajectory control research is still in its early stages, with most experiments being conducted on simple datasets like Human3.6M. This constraint limits the models' capability to process open-domain images and effectively handle complex curved trajectories. In this paper, we propose DragNUWA, an open-domain diffusion-based video generation model. To tackle the issue of insufficient control granularity in existing works, we simultaneously introduce text, image, and trajectory information to provide fine-grained control over video content from semantic, spatial, and temporal perspectives. To resolve the problem of limited open-domain trajectory control in current research, We propose trajectory modeling with three aspects: a Trajectory Sampler (TS) to enable open-domain control of arbitrary trajectories, a Multiscale Fusion (MF) to control trajectories in different granularities, and an Adaptive Training (AT) strategy to generate consistent videos following trajectories. Our experiments validate the effectiveness of DragNUWA, demonstrating its superior performance in fine-grained control in video generation. The homepage link is https://www.microsoft.com/en-us/research/project/dragnuwa/
PDF230April 9, 2026