ChatPaper.aiChatPaper

MultiVENT 2.0:一個龐大的多語言基準測試集,用於事件中心視頻檢索。

MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video Retrieval

October 15, 2024
作者: Reno Kriz, Kate Sanders, David Etter, Kenton Murray, Cameron Carpenter, Kelly Van Ochten, Hannah Recknor, Jimena Guallar-Blasco, Alexander Martin, Ronald Colaianni, Nolan King, Eugene Yang, Benjamin Van Durme
cs.AI

摘要

有效地從大規模多模式收集中檢索和綜合信息已成為一個關鍵挑戰。然而,現有的視頻檢索數據集存在範圍限制,主要集中在將描述性但模糊的查詢與小規模專業編輯的以英語為中心的視頻匹配。為了解決這一問題,我們引入了MultiVENT 2.0,一個大規模、多語言事件中心的視頻檢索基準,包含超過218,000條新聞視頻和3,906個針對特定世界事件的查詢。這些查詢特別針對視頻的視覺內容、音頻、嵌入式文本和文本元數據中的信息,要求系統利用所有這些來源才能成功完成任務。初步結果顯示,最先進的視覺語言模型在這項任務上遇到了很大困難,而替代方法表現出一定的潛力,但仍不足以充分解決這個問題。這些發現強調了需要更強大的多模式檢索系統,因為有效的視頻檢索是實現多模式內容理解和生成任務的關鍵一步。
English
Efficiently retrieving and synthesizing information from large-scale multimodal collections has become a critical challenge. However, existing video retrieval datasets suffer from scope limitations, primarily focusing on matching descriptive but vague queries with small collections of professionally edited, English-centric videos. To address this gap, we introduce MultiVENT 2.0, a large-scale, multilingual event-centric video retrieval benchmark featuring a collection of more than 218,000 news videos and 3,906 queries targeting specific world events. These queries specifically target information found in the visual content, audio, embedded text, and text metadata of the videos, requiring systems leverage all these sources to succeed at the task. Preliminary results show that state-of-the-art vision-language models struggle significantly with this task, and while alternative approaches show promise, they are still insufficient to adequately address this problem. These findings underscore the need for more robust multimodal retrieval systems, as effective video retrieval is a crucial step towards multimodal content understanding and generation tasks.

Summary

AI-Generated Summary

PDF12November 16, 2024