ChatPaper.aiChatPaper

VoiceAssistant-Eval:跨听、说、看三大维度的人工智能助手基准测试

VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing

September 26, 2025
作者: Ke Wang, Houxing Ren, Zimu Lu, Mingjie Zhan, Hongsheng Li
cs.AI

摘要

随着大型语言模型和多模态系统能力的不断提升,语音优先的AI助手引发了广泛关注,然而现有基准测试难以全面评估这些系统的综合能力。为此,我们推出了VoiceAssistant-Eval,一个旨在全方位评估AI助手在听、说、看三方面表现的综合性基准。VoiceAssistant-Eval包含了10,497个精选示例,覆盖13个任务类别,其中包括自然声音、音乐及口语对话的听力测试;多轮对话、角色扮演模仿及多种情境的说话能力评估;以及高度异质性的图像视觉理解任务。为验证其有效性,我们对21个开源模型及GPT-4o-Audio进行了评估,重点考察了回答内容与语音的质量及其一致性。评估结果揭示了三大关键发现:(1) 专有模型并非在所有方面均优于开源模型;(2) 多数模型在说话任务上表现出色,但在音频理解方面仍有不足;(3) 设计精良的小型模型能够与规模大得多的模型相媲美。特别值得一提的是,中等规模的Step-Audio-2-mini(7B)在听力准确率上超过了LLaMA-Omni2-32B-Bilingual的两倍。然而,挑战依然存在:当前模型在处理多模态(音频加视觉)输入及角色扮演语音模仿任务时表现欠佳,且在鲁棒性和安全对齐方面仍有显著差距。VoiceAssistant-Eval不仅识别了这些差距,还为评估和指导下一代AI助手的开发建立了严谨的框架。代码与数据将在https://mathllm.github.io/VoiceAssistantEval/ 发布。
English
The growing capabilities of large language models and multimodal systems have spurred interest in voice-first AI assistants, yet existing benchmarks are inadequate for evaluating the full range of these systems' capabilities. We introduce VoiceAssistant-Eval, a comprehensive benchmark designed to assess AI assistants across listening, speaking, and viewing. VoiceAssistant-Eval comprises 10,497 curated examples spanning 13 task categories. These tasks include natural sounds, music, and spoken dialogue for listening; multi-turn dialogue, role-play imitation, and various scenarios for speaking; and highly heterogeneous images for viewing. To demonstrate its utility, we evaluate 21 open-source models and GPT-4o-Audio, measuring the quality of the response content and speech, as well as their consistency. The results reveal three key findings: (1) proprietary models do not universally outperform open-source models; (2) most models excel at speaking tasks but lag in audio understanding; and (3) well-designed smaller models can rival much larger ones. Notably, the mid-sized Step-Audio-2-mini (7B) achieves more than double the listening accuracy of LLaMA-Omni2-32B-Bilingual. However, challenges remain: multimodal (audio plus visual) input and role-play voice imitation tasks are difficult for current models, and significant gaps persist in robustness and safety alignment. VoiceAssistant-Eval identifies these gaps and establishes a rigorous framework for evaluating and guiding the development of next-generation AI assistants. Code and data will be released at https://mathllm.github.io/VoiceAssistantEval/ .
PDF192September 29, 2025