ChatPaper.aiChatPaper

见、闻、悟:多模态大语言模型中视听人类语音理解的基准评测

See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

December 1, 2025
作者: Le Thien Phuc Nguyen, Zhuoran Yu, Samuel Low Yu Hang, Subin An, Jeongik Lee, Yohan Ban, SeungEun Chung, Thanh-Huy Nguyen, JuWan Maeng, Soochahn Lee, Yong Jae Lee
cs.AI

摘要

多模态大语言模型(MLLMs)被期望能协同解析视觉、听觉与语言信息,然而现有视频基准测试鲜少评估对人类语音的细粒度推理能力。许多任务仍可通过视觉信息单独解决,或仅对语音进行粗粒度评估,难以判断模型是否能精准关联说话者身份、言语内容及时间节点。我们推出AV-SpeakerBench——一个精心构建的基准测试集,包含3,212道基于真实世界视频的说话者中心化多模态推理选择题。其特色在于:(1)以说话者而非场景为核心推理单元的构建范式;(2)将视听依赖关系嵌入问题语义的融合式提问设计;(3)通过专家标注确保时间精度与跨模态有效性。综合评估表明,Gemini系列模型持续领先开源系统,其中Gemini 2.5 Pro表现最佳。在开源模型中,Qwen3-Omni-30B虽接近Gemini 2.0 Flash水平,但仍远逊于Gemini 2.5 Pro,主要差距源于视听融合能力而非视觉感知能力。我们相信AV-SpeakerBench为推进未来多模态系统的细粒度视听推理建立了严谨的基准框架。
English
Multimodal large language models (MLLMs) are expected to jointly interpret vision, audio, and language, yet existing video benchmarks rarely assess fine-grained reasoning about human speech. Many tasks remain visually solvable or only coarsely evaluate speech, offering limited insight into whether models can align who speaks, what is said, and when it occurs. We introduce AV-SpeakerBench, a curated benchmark of 3,212 multiple-choice questions focused on speaker-centric audiovisual reasoning in real-world videos. It features: (1) a speaker-centered formulation that treats speakers-not scenes-as the core reasoning unit; (2) fusion-grounded question design embedding audiovisual dependencies into question semantics; and (3) expert-curated annotations ensuring temporal precision and cross-modal validity. Comprehensive evaluations show that the Gemini family consistently outperforms open-source systems, with Gemini 2.5 Pro achieving the best results. Among open models, Qwen3-Omni-30B approaches Gemini 2.0 Flash but remains far behind Gemini 2.5 Pro, primarily due to weaker audiovisual fusion rather than visual perception. We believe AV-SpeakerBench establishes a rigorous foundation for advancing fine-grained audiovisual reasoning in future multimodal systems.
PDF72December 11, 2025