DropletVideo:探索时空一致性视频生成的数据集与方法
DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation
March 8, 2025
作者: Runze Zhang, Guoguang Du, Xiaochuan Li, Qi Jia, Liang Jin, Lu Liu, Jingjing Wang, Cong Xu, Zhenhua Guo, Yaqian Zhao, Xiaoli Gong, Rengang Li, Baoyu Fan
cs.AI
摘要
时空一致性是视频生成领域的一个关键研究课题。一个合格的生成视频片段必须确保情节的合理性和连贯性,同时在不同视角下保持物体和场景的视觉一致性。以往的研究,尤其是开源项目,主要集中于时间或空间一致性,或它们的基本组合,例如在提示后附加相机运动的描述,而不限制该运动的结果。然而,相机运动可能会引入新物体到场景中或移除现有物体,从而叠加并影响先前的叙述。特别是在包含大量相机运动的视频中,多个情节之间的相互作用变得愈发复杂。本文引入并探讨了整体时空一致性,考虑了情节进展与摄影技术之间的协同作用,以及先前内容对后续生成的长远影响。我们的研究从数据集构建延伸至模型开发。首先,我们构建了DropletVideo-10M数据集,该数据集包含1000万个具有动态相机运动和物体动作的视频。每个视频平均配有206个词的注释,详细描述了各种相机运动和情节发展。随后,我们开发并训练了DropletVideo模型,该模型在视频生成过程中擅长保持时空连贯性。DropletVideo数据集和模型可通过https://dropletx.github.io访问。
English
Spatio-temporal consistency is a critical research topic in video generation.
A qualified generated video segment must ensure plot plausibility and coherence
while maintaining visual consistency of objects and scenes across varying
viewpoints. Prior research, especially in open-source projects, primarily
focuses on either temporal or spatial consistency, or their basic combination,
such as appending a description of a camera movement after a prompt without
constraining the outcomes of this movement. However, camera movement may
introduce new objects to the scene or eliminate existing ones, thereby
overlaying and affecting the preceding narrative. Especially in videos with
numerous camera movements, the interplay between multiple plots becomes
increasingly complex. This paper introduces and examines integral
spatio-temporal consistency, considering the synergy between plot progression
and camera techniques, and the long-term impact of prior content on subsequent
generation. Our research encompasses dataset construction through to the
development of the model. Initially, we constructed a DropletVideo-10M dataset,
which comprises 10 million videos featuring dynamic camera motion and object
actions. Each video is annotated with an average caption of 206 words,
detailing various camera movements and plot developments. Following this, we
developed and trained the DropletVideo model, which excels in preserving
spatio-temporal coherence during video generation. The DropletVideo dataset and
model are accessible at https://dropletx.github.io.Summary
AI-Generated Summary