视频复盘:递归式为一小时长视频添加字幕
Video ReCap: Recursive Captioning of Hour-Long Videos
February 20, 2024
作者: Md Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Nagarajan, Lorenzo Torresani, Gedas Bertasius
cs.AI
摘要
大多数视频字幕模型旨在处理几秒钟的短视频片段,并输出描述低级视觉概念(例如对象、场景、原子动作)的文本。然而,大多数现实世界的视频持续时间为几分钟甚至几小时,并具有跨越不同时间粒度的复杂分层结构。我们提出了Video ReCap,一种递归视频字幕模型,可以处理长度差异巨大的视频输入(从1秒到2小时),并在多个层次上输出视频字幕。递归视频-语言架构利用不同视频层次之间的协同作用,能够高效处理长达数小时的视频。我们利用课程学习训练方案来学习视频的分层结构,从描述原子动作的片段级字幕开始,然后关注段级描述,最终生成长达数小时视频的摘要。此外,我们通过将Ego4D与8,267个手动收集的长距离视频摘要相结合,引入了Ego4D-HCap数据集。我们的递归模型可以灵活生成不同层次的字幕,同时也可用于其他复杂的视频理解任务,例如在EgoSchema上进行的视频问答。数据、代码和模型可在以下网址获取:https://sites.google.com/view/vidrecap
English
Most video captioning models are designed to process short video clips of few
seconds and output text describing low-level visual concepts (e.g., objects,
scenes, atomic actions). However, most real-world videos last for minutes or
hours and have a complex hierarchical structure spanning different temporal
granularities. We propose Video ReCap, a recursive video captioning model that
can process video inputs of dramatically different lengths (from 1 second to 2
hours) and output video captions at multiple hierarchy levels. The recursive
video-language architecture exploits the synergy between different video
hierarchies and can process hour-long videos efficiently. We utilize a
curriculum learning training scheme to learn the hierarchical structure of
videos, starting from clip-level captions describing atomic actions, then
focusing on segment-level descriptions, and concluding with generating
summaries for hour-long videos. Furthermore, we introduce Ego4D-HCap dataset by
augmenting Ego4D with 8,267 manually collected long-range video summaries. Our
recursive model can flexibly generate captions at different hierarchy levels
while also being useful for other complex video understanding tasks, such as
VideoQA on EgoSchema. Data, code, and models are available at:
https://sites.google.com/view/vidrecap