視頻重述：對一小時長視頻進行遞歸式標題生成

摘要

大多數影片字幕模型旨在處理幾秒鐘的短視頻片段，並輸出描述低層次視覺概念（例如物體、場景、基本動作）的文字。然而，大多數現實世界的視頻持續時間為數分鐘或數小時，具有跨越不同時間粒度的複雜階層結構。我們提出了Video ReCap，一種遞迴式視頻字幕模型，可以處理長度截然不同的視頻輸入（從1秒到2小時），並在多個層次上輸出視頻字幕。這種遞迴式視頻-語言架構利用了不同視頻階層之間的協同作用，可以高效處理長達一小時的視頻。我們利用課程學習訓練方案來學習視頻的層次結構，從描述基本動作的片段級字幕開始，然後專注於段落級描述，最後生成長達一小時視頻的摘要。此外，我們通過將Ego4D與8,267個手動收集的長範圍視頻摘要進行擴充，引入了Ego4D-HCap數據集。我們的遞迴模型可以靈活生成不同層次的字幕，同時對於其他複雜的視頻理解任務也很有用，例如在EgoSchema上進行的VideoQA。數據、代碼和模型可在以下網址獲得：https://sites.google.com/view/vidrecap

English

Most video captioning models are designed to process short video clips of few seconds and output text describing low-level visual concepts (e.g., objects, scenes, atomic actions). However, most real-world videos last for minutes or hours and have a complex hierarchical structure spanning different temporal granularities. We propose Video ReCap, a recursive video captioning model that can process video inputs of dramatically different lengths (from 1 second to 2 hours) and output video captions at multiple hierarchy levels. The recursive video-language architecture exploits the synergy between different video hierarchies and can process hour-long videos efficiently. We utilize a curriculum learning training scheme to learn the hierarchical structure of videos, starting from clip-level captions describing atomic actions, then focusing on segment-level descriptions, and concluding with generating summaries for hour-long videos. Furthermore, we introduce Ego4D-HCap dataset by augmenting Ego4D with 8,267 manually collected long-range video summaries. Our recursive model can flexibly generate captions at different hierarchy levels while also being useful for other complex video understanding tasks, such as VideoQA on EgoSchema. Data, code, and models are available at: https://sites.google.com/view/vidrecap

視頻重述：對一小時長視頻進行遞歸式標題生成

Video ReCap: Recursive Captioning of Hour-Long Videos

摘要

Support