ナンバリングする：マンガをめくるような動画の時間的な位置合わせ

要旨

ビデオ大規模言語モデル（Vid-LLMs）は、QAダイアログのビデオコンテンツを理解する上で顕著な進展を遂げています。ただし、ビデオ時間的位置合わせ（VTG）として知られる正確な時間的位置合わせを必要とするタスクにこの視覚理解を拡張するのに苦労しています。このギャップに対処するために、我々はNumber-Prompt（NumPro）を導入します。これは、各ビデオフレームに固有の数値識別子を追加することで、Vid-LLMsが視覚理解と時間的位置合わせを結びつけるのを支援する革新的な手法です。ビデオを番号付きフレーム画像のシーケンスとして扱うことで、NumProはVTGを直感的なプロセスに変換します。これにより、Vid-LLMsはイベントのタイムラインを「読む」ことができ、視覚コンテンツを対応する時間情報と正確にリンクさせることができます。私たちの実験は、NumProが追加の計算コストなしで、トップクラスのVid-LLMsのVTG性能を大幅に向上させることを示しています。さらに、NumProで強化されたデータセットでのファインチューニングは、瞬間検索のmIoUで最大6.9％、ハイライト検出のmAPで最大8.5％、以前の最高性能手法を上回る、VTGの新たな最先端を定義します。コードはhttps://github.com/yongliang-wu/NumProで入手可能です。

English

Video Large Language Models (Vid-LLMs) have made remarkable advancements in comprehending video content for QA dialogue. However, they struggle to extend this visual understanding to tasks requiring precise temporal localization, known as Video Temporal Grounding (VTG). To address this gap, we introduce Number-Prompt (NumPro), a novel method that empowers Vid-LLMs to bridge visual comprehension with temporal grounding by adding unique numerical identifiers to each video frame. Treating a video as a sequence of numbered frame images, NumPro transforms VTG into an intuitive process: flipping through manga panels in sequence. This allows Vid-LLMs to "read" event timelines, accurately linking visual content with corresponding temporal information. Our experiments demonstrate that NumPro significantly boosts VTG performance of top-tier Vid-LLMs without additional computational cost. Furthermore, fine-tuning on a NumPro-enhanced dataset defines a new state-of-the-art for VTG, surpassing previous top-performing methods by up to 6.9\% in mIoU for moment retrieval and 8.5\% in mAP for highlight detection. The code will be available at https://github.com/yongliang-wu/NumPro.

ナンバリングする：マンガをめくるような動画の時間的な位置合わせ

Number it: Temporal Grounding Videos like Flipping Manga

要旨

Support