ChatPaper.aiChatPaper

VidChapters-7M:大规模视频章节

VidChapters-7M: Video Chapters at Scale

September 25, 2023
作者: Antoine Yang, Arsha Nagrani, Ivan Laptev, Josef Sivic, Cordelia Schmid
cs.AI

摘要

将长视频分割成章节使用户能够快速导航到他们感兴趣的信息。这一重要主题由于缺乏公开发布的数据集而研究不足。为了解决这个问题,我们提出了VidChapters-7M,这是一个包含817K用户分章视频的数据集,总共包括7M个章节。VidChapters-7M是通过从在线视频中爬取用户注释的章节而自动创建的,因此没有额外的手动标注。我们基于这些数据提出了以下三个任务。首先,视频章节生成任务包括对视频进行时间分割,并为每个片段生成一个章节标题。为了进一步剖析问题,我们还定义了这个任务的两个变体:在给定地面真实边界的情况下生成视频章节,这需要在给定注释视频段的情况下生成一个章节标题,以及视频章节定位,这需要在给定标题的情况下暂时定位一个章节。我们为这三个任务基准测试了简单的基线和最先进的视频-语言模型。我们还展示了在VidChapters-7M上进行预训练对零样本和微调设置下的密集视频字幕任务具有良好的迁移效果,大大改善了YouCook2和ViTT基准测试的最新技术水平。最后,我们的实验表明,下游性能随着预训练数据集的规模扩大而有很好的提升。我们的数据集、代码和模型可在https://antoyang.github.io/vidchapters.html上公开获取。
English
Segmenting long videos into chapters enables users to quickly navigate to the information of their interest. This important topic has been understudied due to the lack of publicly released datasets. To address this issue, we present VidChapters-7M, a dataset of 817K user-chaptered videos including 7M chapters in total. VidChapters-7M is automatically created from videos online in a scalable manner by scraping user-annotated chapters and hence without any additional manual annotation. We introduce the following three tasks based on this data. First, the video chapter generation task consists of temporally segmenting the video and generating a chapter title for each segment. To further dissect the problem, we also define two variants of this task: video chapter generation given ground-truth boundaries, which requires generating a chapter title given an annotated video segment, and video chapter grounding, which requires temporally localizing a chapter given its annotated title. We benchmark both simple baselines and state-of-the-art video-language models for these three tasks. We also show that pretraining on VidChapters-7M transfers well to dense video captioning tasks in both zero-shot and finetuning settings, largely improving the state of the art on the YouCook2 and ViTT benchmarks. Finally, our experiments reveal that downstream performance scales well with the size of the pretraining dataset. Our dataset, code, and models are publicly available at https://antoyang.github.io/vidchapters.html.
PDF113December 15, 2024