DragNUWA: テキスト、画像、軌道の統合によるビデオ生成のきめ細かい制御

要旨

制御可能なビデオ生成は近年、大きな注目を集めています。しかし、2つの主要な課題が依然として存在します。第一に、既存の研究の多くはテキスト、画像、または軌道ベースの制御のいずれかに焦点を当てており、ビデオにおける細かな制御を実現できていません。第二に、軌道制御の研究はまだ初期段階にあり、ほとんどの実験はHuman3.6Mのような単純なデータセットで行われています。この制約により、モデルがオープンドメインの画像を処理し、複雑な曲線軌道を効果的に扱う能力が制限されています。本論文では、オープンドメインの拡散モデルに基づくビデオ生成モデルであるDragNUWAを提案します。既存研究における制御の粒度不足の問題に対処するため、テキスト、画像、軌道情報を同時に導入し、意味的、空間的、時間的な観点からビデオコンテンツを細かく制御します。現在の研究におけるオープンドメイン軌道制御の限界を解決するため、軌道モデリングを3つの側面から提案します。任意の軌道をオープンドメインで制御可能にするTrajectory Sampler (TS)、異なる粒度で軌道を制御するMultiscale Fusion (MF)、軌道に沿った一貫性のあるビデオを生成するAdaptive Training (AT)戦略です。実験により、DragNUWAの有効性が検証され、ビデオ生成における細かな制御において優れた性能を発揮することが示されました。ホームページのリンクはhttps://www.microsoft.com/en-us/research/project/dragnuwa/です。

English

Controllable video generation has gained significant attention in recent years. However, two main limitations persist: Firstly, most existing works focus on either text, image, or trajectory-based control, leading to an inability to achieve fine-grained control in videos. Secondly, trajectory control research is still in its early stages, with most experiments being conducted on simple datasets like Human3.6M. This constraint limits the models' capability to process open-domain images and effectively handle complex curved trajectories. In this paper, we propose DragNUWA, an open-domain diffusion-based video generation model. To tackle the issue of insufficient control granularity in existing works, we simultaneously introduce text, image, and trajectory information to provide fine-grained control over video content from semantic, spatial, and temporal perspectives. To resolve the problem of limited open-domain trajectory control in current research, We propose trajectory modeling with three aspects: a Trajectory Sampler (TS) to enable open-domain control of arbitrary trajectories, a Multiscale Fusion (MF) to control trajectories in different granularities, and an Adaptive Training (AT) strategy to generate consistent videos following trajectories. Our experiments validate the effectiveness of DragNUWA, demonstrating its superior performance in fine-grained control in video generation. The homepage link is https://www.microsoft.com/en-us/research/project/dragnuwa/

DragNUWA: テキスト、画像、軌道の統合によるビデオ生成のきめ細かい制御

DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

要旨

Support