ChatPaper.aiChatPaper

LVOmniBench:为全模态大语言模型开创长音频-视频理解评估新纪元

LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs

March 19, 2026
作者: Keda Tao, Yuhua Zheng, Jia Xu, Wenjie Du, Kele Shao, Hesong Wang, Xueyi Chen, Xin Jin, Junhan Zhu, Bohan Yu, Weiqiang Wang, Jian Liu, Can Qin, Yulun Zhang, Ming-Hsuan Yang, Huan Wang
cs.AI

摘要

近期,全模态大语言模型(OmniLLMs)在音视频内容理解方面取得了显著进展。然而,当前评估主要聚焦于10秒至5分钟的短音视频片段,未能反映现实应用场景的需求——此类场景中的视频通常长达数十分钟。为弥补这一关键空白,我们推出了专门针对长格式音视频跨模态理解的新基准LVOmniBench。该数据集收录来自开放平台的高质量视频,具有丰富的视听动态特征。通过严格的人工筛选与标注,LVOmniBench包含275段时长10至90分钟的视频及1,014组问答对。该基准旨在系统评估OmniLLMs在长期记忆、时间定位、细粒度理解和多模态感知等领域的性能。大量实验表明,现有OmniLLMs在处理长格式音视频输入时面临显著挑战:开源模型准确率普遍低于35%,而Gemini 3 Pro的最高准确率约为65%。我们期待该数据集及实证发现能推动后续研究,促进能够解决长格式音视频语境下复杂跨模态理解问题的先进模型发展。
English
Recent advancements in omnimodal large language models (OmniLLMs) have significantly improved the comprehension of audio and video inputs. However, current evaluations primarily focus on short audio and video clips ranging from 10 seconds to 5 minutes, failing to reflect the demands of real-world applications, where videos typically run for tens of minutes. To address this critical gap, we introduce LVOmniBench, a new benchmark designed specifically for the cross-modal comprehension of long-form audio and video. This dataset comprises high-quality videos sourced from open platforms that feature rich audio-visual dynamics. Through rigorous manual selection and annotation, LVOmniBench comprises 275 videos, ranging in duration from 10 to 90 minutes, and 1,014 question-answer (QA) pairs. LVOmniBench aims to rigorously evaluate the capabilities of OmniLLMs across domains, including long-term memory, temporal localization, fine-grained understanding, and multimodal perception. Our extensive evaluation reveals that current OmniLLMs encounter significant challenges when processing extended audio-visual inputs. Open-source models generally achieve accuracies below 35%, whereas the Gemini 3 Pro reaches a peak accuracy of approximately 65%. We anticipate that this dataset, along with our empirical findings, will stimulate further research and the development of advanced models capable of resolving complex cross-modal understanding problems within long-form audio-visual contexts.
PDF251March 21, 2026