DragNUWA: 텍스트, 이미지, 궤적 통합을 통한 비디오 생성의 세밀한 제어

초록

제어 가능한 비디오 생성은 최근 몇 년 동안 상당한 관심을 받아왔습니다. 그러나 두 가지 주요 한계점이 여전히 존재합니다: 첫째, 대부분의 기존 연구는 텍스트, 이미지 또는 궤적 기반 제어에 초점을 맞추고 있어 비디오에서 세밀한 제어를 달성하지 못하고 있습니다. 둘째, 궤적 제어 연구는 아직 초기 단계에 있으며, 대부분의 실험이 Human3.6M과 같은 간단한 데이터셋에서 수행되고 있습니다. 이러한 제약은 모델이 오픈 도메인 이미지를 처리하고 복잡한 곡선 궤적을 효과적으로 다루는 능력을 제한합니다. 본 논문에서는 오픈 도메인 기반의 확산 모델인 DragNUWA를 제안합니다. 기존 연구에서의 제어 세분화 부족 문제를 해결하기 위해, 우리는 텍스트, 이미지, 궤적 정보를 동시에 도입하여 비디오 콘텐츠를 의미론적, 공간적, 시간적 관점에서 세밀하게 제어할 수 있도록 합니다. 현재 연구에서의 제한된 오픈 도메인 궤적 제어 문제를 해결하기 위해, 우리는 세 가지 측면의 궤적 모델링을 제안합니다: 임의의 궤적을 오픈 도메인에서 제어할 수 있는 Trajectory Sampler(TS), 다양한 세분화 수준에서 궤적을 제어할 수 있는 Multiscale Fusion(MF), 그리고 궤적을 따라 일관된 비디오를 생성하기 위한 Adaptive Training(AT) 전략입니다. 우리의 실험은 DragNUWA의 효과성을 검증하며, 비디오 생성에서의 세밀한 제어에서 우수한 성능을 입증합니다. 홈페이지 링크는 https://www.microsoft.com/en-us/research/project/dragnuwa/ 입니다.

English

Controllable video generation has gained significant attention in recent years. However, two main limitations persist: Firstly, most existing works focus on either text, image, or trajectory-based control, leading to an inability to achieve fine-grained control in videos. Secondly, trajectory control research is still in its early stages, with most experiments being conducted on simple datasets like Human3.6M. This constraint limits the models' capability to process open-domain images and effectively handle complex curved trajectories. In this paper, we propose DragNUWA, an open-domain diffusion-based video generation model. To tackle the issue of insufficient control granularity in existing works, we simultaneously introduce text, image, and trajectory information to provide fine-grained control over video content from semantic, spatial, and temporal perspectives. To resolve the problem of limited open-domain trajectory control in current research, We propose trajectory modeling with three aspects: a Trajectory Sampler (TS) to enable open-domain control of arbitrary trajectories, a Multiscale Fusion (MF) to control trajectories in different granularities, and an Adaptive Training (AT) strategy to generate consistent videos following trajectories. Our experiments validate the effectiveness of DragNUWA, demonstrating its superior performance in fine-grained control in video generation. The homepage link is https://www.microsoft.com/en-us/research/project/dragnuwa/

DragNUWA: 텍스트, 이미지, 궤적 통합을 통한 비디오 생성의 세밀한 제어

DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

초록

Support