見、聞、知:多模態大型語言模型中視聽人類語音理解的基準測試
See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models
December 1, 2025
作者: Le Thien Phuc Nguyen, Zhuoran Yu, Samuel Low Yu Hang, Subin An, Jeongik Lee, Yohan Ban, SeungEun Chung, Thanh-Huy Nguyen, JuWan Maeng, Soochahn Lee, Yong Jae Lee
cs.AI
摘要
儘管多模態大型語言模型(MLLMs)被期望能協同解讀視覺、聽覺與語言資訊,但現有的影片基準測試鮮少評估針對人類語言的細粒度推理能力。許多任務仍可透過視覺單一模態解決,或僅對語音進行粗粒度評估,難以判斷模型是否能精準對應「誰在說話、說了什麼、何時發生」的關聯。為此,我們提出AV-SpeakerBench——一個精選的3,212道選擇題基準數據集,專注於真實世界影片中的說話者中心視聽推理。其特色包括:(1)以說話者而非場景為核心推理單元的問題構建;(2)融合驅動的問題設計,將視聽依賴關係嵌入問題語義;(3)專家標註確保時序精度與跨模態有效性。綜合評估顯示,Gemini系列模型持續優於開源系統,其中Gemini 2.5 Pro表現最佳。在開源模型中,Qwen3-Omni-30B雖接近Gemini 2.0 Flash水平,但仍遠遜於Gemini 2.5 Pro,主要差距源於視聽融合能力而非視覺感知能力。我們認為AV-SpeakerBench為推動未來多模態系統的細粒度視聽推理奠定了嚴謹的基礎。
English
Multimodal large language models (MLLMs) are expected to jointly interpret vision, audio, and language, yet existing video benchmarks rarely assess fine-grained reasoning about human speech. Many tasks remain visually solvable or only coarsely evaluate speech, offering limited insight into whether models can align who speaks, what is said, and when it occurs. We introduce AV-SpeakerBench, a curated benchmark of 3,212 multiple-choice questions focused on speaker-centric audiovisual reasoning in real-world videos. It features: (1) a speaker-centered formulation that treats speakers-not scenes-as the core reasoning unit; (2) fusion-grounded question design embedding audiovisual dependencies into question semantics; and (3) expert-curated annotations ensuring temporal precision and cross-modal validity. Comprehensive evaluations show that the Gemini family consistently outperforms open-source systems, with Gemini 2.5 Pro achieving the best results. Among open models, Qwen3-Omni-30B approaches Gemini 2.0 Flash but remains far behind Gemini 2.5 Pro, primarily due to weaker audiovisual fusion rather than visual perception. We believe AV-SpeakerBench establishes a rigorous foundation for advancing fine-grained audiovisual reasoning in future multimodal systems.