DreamVideo-2:零样本主题驱动视频定制与精准运动控制
DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control
October 17, 2024
作者: Yujie Wei, Shiwei Zhang, Hangjie Yuan, Xiang Wang, Haonan Qiu, Rui Zhao, Yutong Feng, Feng Liu, Zhizhong Huang, Jiaxin Ye, Yingya Zhang, Hongming Shan
cs.AI
摘要
最近定制视频生成的进展使用户能够创建针对特定主题和运动轨迹的视频。然而,现有方法通常需要复杂的测试时微调,并且在平衡主题学习和运动控制方面存在困难,从而限制了它们在现实世界中的应用。在本文中,我们提出了DreamVideo-2,这是一个零样本视频定制框架,能够生成具有特定主题和运动轨迹的视频,分别由单个图像和边界框序列引导,而无需进行测试时微调。具体来说,我们引入了参考注意力,利用模型固有的主题学习能力,并设计了一个基于蒙版引导的运动模块,通过充分利用从边界框导出的框蒙版的稳健运动信号来实现精确的运动控制。虽然这两个组件实现了它们预期的功能,但我们在实证观察中发现运动控制往往会主导主题学习。为了解决这个问题,我们提出了两个关键设计:1)蒙版参考注意力,将混合潜在蒙版建模方案集成到参考注意力中,以增强所需位置的主题表示,2)重新加权扩散损失,区分边界框内外区域的贡献,以确保在主题和运动控制之间实现平衡。对一个新整理的数据集进行的大量实验结果表明,DreamVideo-2在主题定制和运动控制方面优于最先进的方法。数据集、代码和模型将公开提供。
English
Recent advances in customized video generation have enabled users to create
videos tailored to both specific subjects and motion trajectories. However,
existing methods often require complicated test-time fine-tuning and struggle
with balancing subject learning and motion control, limiting their real-world
applications. In this paper, we present DreamVideo-2, a zero-shot video
customization framework capable of generating videos with a specific subject
and motion trajectory, guided by a single image and a bounding box sequence,
respectively, and without the need for test-time fine-tuning. Specifically, we
introduce reference attention, which leverages the model's inherent
capabilities for subject learning, and devise a mask-guided motion module to
achieve precise motion control by fully utilizing the robust motion signal of
box masks derived from bounding boxes. While these two components achieve their
intended functions, we empirically observe that motion control tends to
dominate over subject learning. To address this, we propose two key designs: 1)
the masked reference attention, which integrates a blended latent mask modeling
scheme into reference attention to enhance subject representations at the
desired positions, and 2) a reweighted diffusion loss, which differentiates the
contributions of regions inside and outside the bounding boxes to ensure a
balance between subject and motion control. Extensive experimental results on a
newly curated dataset demonstrate that DreamVideo-2 outperforms
state-of-the-art methods in both subject customization and motion control. The
dataset, code, and models will be made publicly available.Summary
AI-Generated Summary