ChatPaper.aiChatPaper

LVOmniBench:為全模態大型語言模型開創長音視頻理解評估新紀元

LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs

March 19, 2026
作者: Keda Tao, Yuhua Zheng, Jia Xu, Wenjie Du, Kele Shao, Hesong Wang, Xueyi Chen, Xin Jin, Junhan Zhu, Bohan Yu, Weiqiang Wang, Jian Liu, Can Qin, Yulun Zhang, Ming-Hsuan Yang, Huan Wang
cs.AI

摘要

近期全模態大型語言模型(OmniLLMs)的突破性進展顯著提升了對音頻與視頻輸入的理解能力。然而現有評估主要聚焦於10秒至5分鐘的短音視頻片段,難以反映實際應用場景中通常長達數十分鐘的視頻處理需求。為彌補這一關鍵空白,我們推出專為長時序音視頻跨模態理解設計的新基準LVOmniBench。該數據集精選自開放平台的高質量視頻,具備豐富的視聽動態特徵,經過嚴格人工篩選與標註,最終收錄275段時長10至90分鐘的視頻及1,014組問答對。LVOmniBench旨在系統評估OmniLLMs在長期記憶、時間定位、細粒度理解及多模態感知等領域的性能。大量實驗表明,現有OmniLLMs在處理長時序視聽輸入時面臨重大挑戰:開源模型準確率普遍低於35%,而Gemini 3 Pro的最高準確率約為65%。我們期待該數據集與實證研究能推動後續探索,促進開發能解決長時序音視頻場景下複雜跨模態理解問題的先進模型。
English
Recent advancements in omnimodal large language models (OmniLLMs) have significantly improved the comprehension of audio and video inputs. However, current evaluations primarily focus on short audio and video clips ranging from 10 seconds to 5 minutes, failing to reflect the demands of real-world applications, where videos typically run for tens of minutes. To address this critical gap, we introduce LVOmniBench, a new benchmark designed specifically for the cross-modal comprehension of long-form audio and video. This dataset comprises high-quality videos sourced from open platforms that feature rich audio-visual dynamics. Through rigorous manual selection and annotation, LVOmniBench comprises 275 videos, ranging in duration from 10 to 90 minutes, and 1,014 question-answer (QA) pairs. LVOmniBench aims to rigorously evaluate the capabilities of OmniLLMs across domains, including long-term memory, temporal localization, fine-grained understanding, and multimodal perception. Our extensive evaluation reveals that current OmniLLMs encounter significant challenges when processing extended audio-visual inputs. Open-source models generally achieve accuracies below 35%, whereas the Gemini 3 Pro reaches a peak accuracy of approximately 65%. We anticipate that this dataset, along with our empirical findings, will stimulate further research and the development of advanced models capable of resolving complex cross-modal understanding problems within long-form audio-visual contexts.
PDF251March 21, 2026