VidChapters-7M:大規模視頻章節
VidChapters-7M: Video Chapters at Scale
September 25, 2023
作者: Antoine Yang, Arsha Nagrani, Ivan Laptev, Josef Sivic, Cordelia Schmid
cs.AI
摘要
將長視頻分割成章節,使用戶能夠快速導航到他們感興趣的信息。這一重要主題由於缺乏公開發布的數據集而研究不足。為了解決這個問題,我們提出了VidChapters-7M,這是一個包含817K個用戶分章視頻的數據集,總共包含7M個章節。VidChapters-7M是通過從在線視頻中爬取用戶標註的章節來自動創建的,因此無需進行任何額外的手動標註。我們基於這個數據集提出了以下三個任務。首先,視頻章節生成任務包括將視頻在時間上進行分割並為每個片段生成一個章節標題。為了進一步分析問題,我們還定義了這個任務的兩個變體:在給定地面真實邊界的情況下進行視頻章節生成,這需要在給定標註的視頻片段的情況下生成一個章節標題,以及視頻章節定位,這需要在給定其標註標題的情況下暫時定位一個章節。我們為這三個任務基準了簡單的基準線和最先進的視頻-語言模型。我們還展示了在VidChapters-7M上的預訓練對於零-shot和微調設置下的密集視頻字幕任務具有良好的遷移效果,大大提高了YouCook2和ViTT基準測試的最新技術水平。最後,我們的實驗表明,下游性能隨著預訓練數據集的規模增加而有很好的提升。我們的數據集、代碼和模型可以在https://antoyang.github.io/vidchapters.html 公開獲取。
English
Segmenting long videos into chapters enables users to quickly navigate to the
information of their interest. This important topic has been understudied due
to the lack of publicly released datasets. To address this issue, we present
VidChapters-7M, a dataset of 817K user-chaptered videos including 7M chapters
in total. VidChapters-7M is automatically created from videos online in a
scalable manner by scraping user-annotated chapters and hence without any
additional manual annotation. We introduce the following three tasks based on
this data. First, the video chapter generation task consists of temporally
segmenting the video and generating a chapter title for each segment. To
further dissect the problem, we also define two variants of this task: video
chapter generation given ground-truth boundaries, which requires generating a
chapter title given an annotated video segment, and video chapter grounding,
which requires temporally localizing a chapter given its annotated title. We
benchmark both simple baselines and state-of-the-art video-language models for
these three tasks. We also show that pretraining on VidChapters-7M transfers
well to dense video captioning tasks in both zero-shot and finetuning settings,
largely improving the state of the art on the YouCook2 and ViTT benchmarks.
Finally, our experiments reveal that downstream performance scales well with
the size of the pretraining dataset. Our dataset, code, and models are publicly
available at https://antoyang.github.io/vidchapters.html.