UniVTG：統一的な映像-言語時間的グラウンディングに向けて

要旨

ビデオ時間的グラウンディング（VTG）は、カスタム言語クエリ（例：文や単語）に基づいてビデオからターゲットクリップ（連続した区間や非連続なショットなど）を特定することを目的としており、ソーシャルメディア上のビデオ閲覧において重要な役割を果たします。この分野のほとんどの手法は、特定のタスクに特化したモデルを開発し、タイプ固有のラベル（例：モーメント検索（時間区間）やハイライト検出（価値曲線））で学習させるため、様々なVTGタスクやラベルへの汎化能力が制限されています。本論文では、多様なVTGラベルとタスクを統一する「UniVTG」を提案します。まず、広範なVTGラベルとタスクを再検討し、統一的な定式化を定義します。これに基づき、スケーラブルな疑似教師データを作成するためのデータアノテーションスキームを開発します。次に、各タスクに対応し、各ラベルを最大限に活用できる効果的で柔軟なグラウンディングモデルを開発します。最後に、統一フレームワークのおかげで、大規模で多様なラベルからの時間的グラウンディング事前学習を可能にし、ゼロショットグラウンディングなどの強力なグラウンディング能力を開発します。7つのデータセット（QVHighlights、Charades-STA、TACoS、Ego4D、YouTube Highlights、TVSum、QFVS）にわたる3つのタスク（モーメント検索、ハイライト検出、ビデオ要約）での広範な実験により、提案フレームワークの有効性と柔軟性が実証されています。コードはhttps://github.com/showlab/UniVTGで公開されています。

English

Video Temporal Grounding (VTG), which aims to ground target clips from videos (such as consecutive intervals or disjoint shots) according to custom language queries (e.g., sentences or words), is key for video browsing on social media. Most methods in this direction develop taskspecific models that are trained with type-specific labels, such as moment retrieval (time interval) and highlight detection (worthiness curve), which limits their abilities to generalize to various VTG tasks and labels. In this paper, we propose to Unify the diverse VTG labels and tasks, dubbed UniVTG, along three directions: Firstly, we revisit a wide range of VTG labels and tasks and define a unified formulation. Based on this, we develop data annotation schemes to create scalable pseudo supervision. Secondly, we develop an effective and flexible grounding model capable of addressing each task and making full use of each label. Lastly, thanks to the unified framework, we are able to unlock temporal grounding pretraining from large-scale diverse labels and develop stronger grounding abilities e.g., zero-shot grounding. Extensive experiments on three tasks (moment retrieval, highlight detection and video summarization) across seven datasets (QVHighlights, Charades-STA, TACoS, Ego4D, YouTube Highlights, TVSum, and QFVS) demonstrate the effectiveness and flexibility of our proposed framework. The codes are available at https://github.com/showlab/UniVTG.

UniVTG：統一的な映像-言語時間的グラウンディングに向けて

UniVTG: Towards Unified Video-Language Temporal Grounding

要旨

Support