SONIC-O1:评估多模态大语言模型音视频理解能力的现实基准
SONIC-O1: A Real-World Benchmark for Evaluating Multimodal Large Language Models on Audio-Video Understanding
January 29, 2026
作者: Ahmed Y. Radwan, Christos Emmanouilidis, Hina Tabassum, Deval Pandya, Shaina Raza
cs.AI
摘要
多模态大语言模型(MLLMs)是当前人工智能研究的重点领域。然而,现有研究主要集中于静态图像理解,对序列化音视频数据处理能力的探索仍显不足。这一空白凸显了建立高质量基准测试体系以系统评估MLLMs在真实场景中性能表现的必要性。我们推出SONIC-O1基准——一个涵盖13个现实对话领域、包含4,958条人工校验标注及人口统计元数据的综合评估体系。该基准通过开放式摘要生成、多项选择题作答及基于推理的时间定位等关键任务评估MLLMs性能。针对闭源与开源模型的实验揭示了其局限性:虽然两类模型在选择题准确率上差距较小,但最佳闭源模型在时间定位任务上较开源模型领先22.6%。模型在不同人口群体间的性能差异进一步表明其行为存在持续性偏差。总体而言,SONIC-O1为具有时序基础和社会鲁棒性的多模态理解提供了开放评估框架。我们公开SONIC-O1以促进可复现性研究:项目页面:https://vectorinstitute.github.io/sonic-o1/ 数据集:https://huggingface.co/datasets/vector-institute/sonic-o1 GitHub:https://github.com/vectorinstitute/sonic-o1 排行榜:https://huggingface.co/spaces/vector-institute/sonic-o1-leaderboard
English
Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. This gap highlights the need for a high-quality benchmark to systematically evaluate MLLM performance in a real-world setting. We introduce SONIC-O1, a comprehensive, fully human-verified benchmark spanning 13 real-world conversational domains with 4,958 annotations and demographic metadata. SONIC-O1 evaluates MLLMs on key tasks, including open-ended summarization, multiple-choice question (MCQ) answering, and temporal localization with supporting rationales (reasoning). Experiments on closed- and open-source models reveal limitations. While the performance gap in MCQ accuracy between two model families is relatively small, we observe a substantial 22.6% performance difference in temporal localization between the best performing closed-source and open-source models. Performance further degrades across demographic groups, indicating persistent disparities in model behavior. Overall, SONIC-O1 provides an open evaluation suite for temporally grounded and socially robust multimodal understanding. We release SONIC-O1 for reproducibility and research: Project page: https://vectorinstitute.github.io/sonic-o1/ Dataset: https://huggingface.co/datasets/vector-institute/sonic-o1 Github: https://github.com/vectorinstitute/sonic-o1 Leaderboard: https://huggingface.co/spaces/vector-institute/sonic-o1-leaderboard