ChatPaper.aiChatPaper

LVD-2M:具有时间密集字幕的长视频数据集

LVD-2M: A Long-take Video Dataset with Temporally Dense Captions

October 14, 2024
作者: Tianwei Xiong, Yuqing Wang, Daquan Zhou, Zhijie Lin, Jiashi Feng, Xihui Liu
cs.AI

摘要

视频生成模型的有效性在很大程度上取决于它们的训练数据集的质量。大多数先前的视频生成模型是在短视频剪辑上进行训练的,而最近越来越多地有人对直接在更长的视频上训练长视频生成模型产生了兴趣。然而,缺乏这种高质量的长视频阻碍了长视频生成技术的进步。为了促进长视频生成领域的研究,我们需要一个具有训练长视频生成模型所必需的四个关键特征的新数据集:(1)至少覆盖10秒的长视频,(2)没有剪辑的长镜头视频,(3)大幅度运动和多样内容,以及(4)时间上密集的字幕。为了实现这一目标,我们引入了一个新的流程,用于选择高质量的长镜头视频并生成时间上密集的字幕。具体而言,我们定义了一组指标来定量评估视频质量,包括场景切换、动态程度和语义级质量,从而使我们能够从大量源视频中筛选出高质量的长镜头视频。随后,我们开发了一个分层视频字幕流程,为长视频添加时间上密集的字幕注释。通过这一流程,我们策划了第一个长镜头视频数据集,LVD-2M,包括200万个长镜头视频,每个视频覆盖超过10秒,并带有时间上密集的字幕注释。我们进一步通过微调视频生成模型来生成具有动态运动的长视频,验证了LVD-2M的有效性。我们相信我们的工作将对未来的长视频生成研究产生重要贡献。
English
The efficacy of video generation models heavily depends on the quality of their training datasets. Most previous video generation models are trained on short video clips, while recently there has been increasing interest in training long video generation models directly on longer videos. However, the lack of such high-quality long videos impedes the advancement of long video generation. To promote research in long video generation, we desire a new dataset with four key features essential for training long video generation models: (1) long videos covering at least 10 seconds, (2) long-take videos without cuts, (3) large motion and diverse contents, and (4) temporally dense captions. To achieve this, we introduce a new pipeline for selecting high-quality long-take videos and generating temporally dense captions. Specifically, we define a set of metrics to quantitatively assess video quality including scene cuts, dynamic degrees, and semantic-level quality, enabling us to filter high-quality long-take videos from a large amount of source videos. Subsequently, we develop a hierarchical video captioning pipeline to annotate long videos with temporally-dense captions. With this pipeline, we curate the first long-take video dataset, LVD-2M, comprising 2 million long-take videos, each covering more than 10 seconds and annotated with temporally dense captions. We further validate the effectiveness of LVD-2M by fine-tuning video generation models to generate long videos with dynamic motions. We believe our work will significantly contribute to future research in long video generation.

Summary

AI-Generated Summary

PDF213November 16, 2024