ChatPaper.aiChatPaper

VoiceAssistant-Eval:跨聽覺、語音與視覺的AI助手基準測試

VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing

September 26, 2025
作者: Ke Wang, Houxing Ren, Zimu Lu, Mingjie Zhan, Hongsheng Li
cs.AI

摘要

大型語言模型和多模態系統日益增強的效能,激發了人們對語音優先AI助手的興趣,然而現有的基準測試並不足以全面評估這些系統的能力。我們推出了VoiceAssistant-Eval,這是一個旨在全面評估AI助手在聽、說、看三方面表現的綜合基準。VoiceAssistant-Eval包含了10,497個精選範例,涵蓋13個任務類別。這些任務包括自然聲音、音樂和口語對話的聆聽;多輪對話、角色扮演模仿及各種場景的說話;以及高度異質性圖像的觀看。為展示其效用,我們評估了21個開源模型和GPT-4o-Audio,測量了回應內容和語音的質量,以及它們的一致性。結果揭示了三個關鍵發現:(1)專有模型並未普遍優於開源模型;(2)大多數模型在說話任務上表現出色,但在音頻理解方面落後;(3)設計良好的小型模型可以與更大的模型相媲美。值得注意的是,中等規模的Step-Audio-2-mini(7B)在聆聽準確性上超過了LLaMA-Omni2-32B-Bilingual的兩倍。然而,挑戰依然存在:多模態(音頻加視覺)輸入和角色扮演語音模仿任務對當前模型來說仍具難度,且在魯棒性和安全對齊方面存在顯著差距。VoiceAssistant-Eval識別了這些差距,並為評估和指導下一代AI助手的發展建立了嚴謹的框架。代碼和數據將在https://mathllm.github.io/VoiceAssistantEval/ 發布。
English
The growing capabilities of large language models and multimodal systems have spurred interest in voice-first AI assistants, yet existing benchmarks are inadequate for evaluating the full range of these systems' capabilities. We introduce VoiceAssistant-Eval, a comprehensive benchmark designed to assess AI assistants across listening, speaking, and viewing. VoiceAssistant-Eval comprises 10,497 curated examples spanning 13 task categories. These tasks include natural sounds, music, and spoken dialogue for listening; multi-turn dialogue, role-play imitation, and various scenarios for speaking; and highly heterogeneous images for viewing. To demonstrate its utility, we evaluate 21 open-source models and GPT-4o-Audio, measuring the quality of the response content and speech, as well as their consistency. The results reveal three key findings: (1) proprietary models do not universally outperform open-source models; (2) most models excel at speaking tasks but lag in audio understanding; and (3) well-designed smaller models can rival much larger ones. Notably, the mid-sized Step-Audio-2-mini (7B) achieves more than double the listening accuracy of LLaMA-Omni2-32B-Bilingual. However, challenges remain: multimodal (audio plus visual) input and role-play voice imitation tasks are difficult for current models, and significant gaps persist in robustness and safety alignment. VoiceAssistant-Eval identifies these gaps and establishes a rigorous framework for evaluating and guiding the development of next-generation AI assistants. Code and data will be released at https://mathllm.github.io/VoiceAssistantEval/ .
PDF192September 29, 2025