MultiVENT 2.0: イベント中心のビデオ検索のための大規模な多言語ベンチマーク

要旨

大規模なマルチモーダルコレクションから情報を効率的に取得し、統合することは重要な課題となっています。ただし、既存のビデオ検索データセットは範囲の制限に苦しんでおり、主に記述的であるが曖昧なクエリを専門的に編集された英語中心の小規模なビデオコレクションと一致させることに焦点を当てています。このギャップを埋めるために、218,000以上のニュースビデオと特定の世界の出来事を対象とした3,906のクエリを備えた大規模な、多言語対応のイベント中心のビデオ検索ベンチマーク、MultiVENT 2.0を紹介します。これらのクエリは、ビデオの視覚コンテンツ、音声、埋め込みテキスト、およびテキストメタデータに含まれる情報を特に対象としており、システムがこれらのソースをすべて活用して課題に成功する必要があります。予備結果によると、最先端のビジョン言語モデルはこの課題に大きな苦労をしており、代替アプローチは有望な結果を示していますが、まだこの問題を十分に対処するには不十分です。これらの知見は、より堅牢なマルチモーダル検索システムの必要性を強調しており、効果的なビデオ検索はマルチモーダルコンテンツ理解および生成タスクに向けた重要な段階であることを示しています。

English

Efficiently retrieving and synthesizing information from large-scale multimodal collections has become a critical challenge. However, existing video retrieval datasets suffer from scope limitations, primarily focusing on matching descriptive but vague queries with small collections of professionally edited, English-centric videos. To address this gap, we introduce MultiVENT 2.0, a large-scale, multilingual event-centric video retrieval benchmark featuring a collection of more than 218,000 news videos and 3,906 queries targeting specific world events. These queries specifically target information found in the visual content, audio, embedded text, and text metadata of the videos, requiring systems leverage all these sources to succeed at the task. Preliminary results show that state-of-the-art vision-language models struggle significantly with this task, and while alternative approaches show promise, they are still insufficient to adequately address this problem. These findings underscore the need for more robust multimodal retrieval systems, as effective video retrieval is a crucial step towards multimodal content understanding and generation tasks.