UniVTG：朝向統一的影片語言時間定位前進

摘要

影片時間定位（VTG）旨在根據自定義語言查詢（例如句子或單詞）從影片中定位目標片段（例如連續間隔或不連續片段），對於在社交媒體上瀏覽影片至關重要。這個方向上的大多數方法開發了特定任務模型，這些模型是通過特定類型的標籤進行訓練的，例如時刻檢索（時間間隔）和精華檢測（值得關注的曲線），這限制了它們對各種VTG任務和標籤的泛化能力。在本文中，我們提出統一多樣的VTG標籤和任務，稱為UniVTG，涵蓋三個方向：首先，我們重新審視各種VTG標籤和任務，並定義統一的公式。基於此，我們開發了數據標註方案，以創建可擴展的虛擬監督。其次，我們開發了一個有效靈活的定位模型，能夠應對每個任務並充分利用每個標籤。最後，由於統一框架，我們能夠從大規模多樣標籤中解鎖時間定位預訓練，並發展更強的定位能力，例如零樣本定位。在七個數據集（QVHighlights、Charades-STA、TACoS、Ego4D、YouTube Highlights、TVSum 和 QFVS）上進行的廣泛實驗證明了我們提出的框架的有效性和靈活性。代碼可在 https://github.com/showlab/UniVTG 找到。

English

Video Temporal Grounding (VTG), which aims to ground target clips from videos (such as consecutive intervals or disjoint shots) according to custom language queries (e.g., sentences or words), is key for video browsing on social media. Most methods in this direction develop taskspecific models that are trained with type-specific labels, such as moment retrieval (time interval) and highlight detection (worthiness curve), which limits their abilities to generalize to various VTG tasks and labels. In this paper, we propose to Unify the diverse VTG labels and tasks, dubbed UniVTG, along three directions: Firstly, we revisit a wide range of VTG labels and tasks and define a unified formulation. Based on this, we develop data annotation schemes to create scalable pseudo supervision. Secondly, we develop an effective and flexible grounding model capable of addressing each task and making full use of each label. Lastly, thanks to the unified framework, we are able to unlock temporal grounding pretraining from large-scale diverse labels and develop stronger grounding abilities e.g., zero-shot grounding. Extensive experiments on three tasks (moment retrieval, highlight detection and video summarization) across seven datasets (QVHighlights, Charades-STA, TACoS, Ego4D, YouTube Highlights, TVSum, and QFVS) demonstrate the effectiveness and flexibility of our proposed framework. The codes are available at https://github.com/showlab/UniVTG.

UniVTG：朝向統一的影片語言時間定位前進

UniVTG: Towards Unified Video-Language Temporal Grounding

摘要

Support