UniVTG：走向统一的视频-语言时间定位

摘要

视频时间定位（VTG）旨在根据自定义语言查询（例如句子或单词）从视频中定位目标片段（如连续间隔或不连续镜头），对于社交媒体上的视频浏览至关重要。在这个方向上，大多数方法开发了专门的任务模型，这些模型经过训练使用特定类型的标签，如时刻检索（时间间隔）和精彩片段检测（价值曲线），这限制了它们推广到各种VTG任务和标签的能力。在本文中，我们提出统一多样化的VTG标签和任务，命名为UniVTG，沿着三个方向进行：首先，我们重新审视各种VTG标签和任务，并定义一个统一的公式。基于此，我们开发数据注释方案以创建可扩展的伪监督。其次，我们开发了一种有效灵活的定位模型，能够处理每个任务并充分利用每个标签。最后，由于统一框架，我们能够从大规模多样化的标签中解锁时间定位预训练，并开发更强大的定位能力，例如零样本定位。在七个数据集（QVHighlights、Charades-STA、TACoS、Ego4D、YouTube Highlights、TVSum和QFVS）上进行的广泛实验展示了我们提出的框架的有效性和灵活性。代码可在https://github.com/showlab/UniVTG 上找到。

English

Video Temporal Grounding (VTG), which aims to ground target clips from videos (such as consecutive intervals or disjoint shots) according to custom language queries (e.g., sentences or words), is key for video browsing on social media. Most methods in this direction develop taskspecific models that are trained with type-specific labels, such as moment retrieval (time interval) and highlight detection (worthiness curve), which limits their abilities to generalize to various VTG tasks and labels. In this paper, we propose to Unify the diverse VTG labels and tasks, dubbed UniVTG, along three directions: Firstly, we revisit a wide range of VTG labels and tasks and define a unified formulation. Based on this, we develop data annotation schemes to create scalable pseudo supervision. Secondly, we develop an effective and flexible grounding model capable of addressing each task and making full use of each label. Lastly, thanks to the unified framework, we are able to unlock temporal grounding pretraining from large-scale diverse labels and develop stronger grounding abilities e.g., zero-shot grounding. Extensive experiments on three tasks (moment retrieval, highlight detection and video summarization) across seven datasets (QVHighlights, Charades-STA, TACoS, Ego4D, YouTube Highlights, TVSum, and QFVS) demonstrate the effectiveness and flexibility of our proposed framework. The codes are available at https://github.com/showlab/UniVTG.

UniVTG：走向统一的视频-语言时间定位

UniVTG: Towards Unified Video-Language Temporal Grounding

摘要

Support