DropletVideo:探索时空一致性视频生成的數據集與方法
DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation
March 8, 2025
作者: Runze Zhang, Guoguang Du, Xiaochuan Li, Qi Jia, Liang Jin, Lu Liu, Jingjing Wang, Cong Xu, Zhenhua Guo, Yaqian Zhao, Xiaoli Gong, Rengang Li, Baoyu Fan
cs.AI
摘要
時空一致性是視頻生成中的一個關鍵研究課題。一段合格的生成視頻片段必須確保情節的合理性和連貫性,同時在不同視角下保持物體和場景的視覺一致性。先前的研究,特別是在開源項目中,主要集中於時間或空間一致性,或它們的基本組合,例如在提示後附加相機運動的描述,而不限制此運動的結果。然而,相機運動可能會引入新物體到場景中或消除現有物體,從而覆蓋並影響先前的敘述。特別是在具有眾多相機運動的視頻中,多個情節之間的相互作用變得越來越複雜。本文引入並探討了整體時空一致性,考慮了情節進展與相機技術之間的協同作用,以及先前內容對後續生成的長期影響。我們的研究涵蓋了從數據集構建到模型開發的全過程。首先,我們構建了一個名為DropletVideo-10M的數據集,該數據集包含1000萬個具有動態相機運動和物體動作的視頻。每個視頻都附有平均206個字的註釋,詳細描述了各種相機運動和情節發展。隨後,我們開發並訓練了DropletVideo模型,該模型在視頻生成過程中出色地保持了時空一致性。DropletVideo數據集和模型可在https://dropletx.github.io上訪問。
English
Spatio-temporal consistency is a critical research topic in video generation.
A qualified generated video segment must ensure plot plausibility and coherence
while maintaining visual consistency of objects and scenes across varying
viewpoints. Prior research, especially in open-source projects, primarily
focuses on either temporal or spatial consistency, or their basic combination,
such as appending a description of a camera movement after a prompt without
constraining the outcomes of this movement. However, camera movement may
introduce new objects to the scene or eliminate existing ones, thereby
overlaying and affecting the preceding narrative. Especially in videos with
numerous camera movements, the interplay between multiple plots becomes
increasingly complex. This paper introduces and examines integral
spatio-temporal consistency, considering the synergy between plot progression
and camera techniques, and the long-term impact of prior content on subsequent
generation. Our research encompasses dataset construction through to the
development of the model. Initially, we constructed a DropletVideo-10M dataset,
which comprises 10 million videos featuring dynamic camera motion and object
actions. Each video is annotated with an average caption of 206 words,
detailing various camera movements and plot developments. Following this, we
developed and trained the DropletVideo model, which excels in preserving
spatio-temporal coherence during video generation. The DropletVideo dataset and
model are accessible at https://dropletx.github.io.Summary
AI-Generated Summary