ChatPaper.aiChatPaper

VideoVista-文化语汇:360度视野——跨越文化、语言与领域,实现视频理解

VideoVista-CulturalLingo: 360^circ Horizons-Bridging Cultures, Languages, and Domains in Video Comprehension

April 23, 2025
作者: Xinyu Chen, Yunxin Li, Haoyuan Shi, Baotian Hu, Wenhan Luo, Yaowei Wang, Min Zhang
cs.AI

摘要

评估多模态AI系统的视频理解能力,能有效衡量其理解与推理水平。当前多数视频评估基准仅限于单一语言,通常为英语,且视频内容多植根于西方文化背景。本文中,我们推出了VideoVista-CulturalLingo,这是首个旨在跨越文化、语言及领域鸿沟的视频理解评估基准。我们的工作与现有基准相比具有以下特点:1)文化多样性,涵盖中国、北美及欧洲文化;2)多语言性,问题以中文和英文呈现,这两种全球使用最广泛的语言;3)领域广泛,视频素材来自数百个人工创建的领域。VideoVista-CulturalLingo包含1,389个视频和3,134个问答对,并对24个近期开源或专有的视频大模型进行了评估。实验结果表明:1)现有模型在处理以中国为中心的问题时表现逊色于西方中心问题,尤其是涉及中国历史的内容;2)当前开源模型在时间理解上仍显不足,特别是在事件定位任务中,最高得分仅为45.2%;3)主流模型在一般科学问题上表现强劲,而开源模型在数学领域则表现较弱。
English
Assessing the video comprehension capabilities of multimodal AI systems can effectively measure their understanding and reasoning abilities. Most video evaluation benchmarks are limited to a single language, typically English, and predominantly feature videos rooted in Western cultural contexts. In this paper, we present VideoVista-CulturalLingo, the first video evaluation benchmark designed to bridge cultural, linguistic, and domain divide in video comprehension. Our work differs from existing benchmarks in the following ways: 1) Cultural diversity, incorporating cultures from China, North America, and Europe; 2) Multi-linguistics, with questions presented in Chinese and English-two of the most widely spoken languages; and 3) Broad domain, featuring videos sourced from hundreds of human-created domains. VideoVista-CulturalLingo contains 1,389 videos and 3,134 QA pairs, and we have evaluated 24 recent open-source or proprietary video large models. From the experiment results, we observe that: 1) Existing models perform worse on Chinese-centric questions than Western-centric ones, particularly those related to Chinese history; 2) Current open-source models still exhibit limitations in temporal understanding, especially in the Event Localization task, achieving a maximum score of only 45.2%; 3) Mainstream models demonstrate strong performance in general scientific questions, while open-source models demonstrate weak performance in mathematics.

Summary

AI-Generated Summary

PDF212April 28, 2025