音频感知大语言模型作为口语风格的评判者

摘要

音频感知大语言模型（ALLMs）能够理解音频输入中的文本与非文本信息。本文探讨了将ALLMs作为自动评判者来评估演讲的说话风格。我们利用ALLM评判者来评估由口语语言模型（SLMs）在两项任务上生成的演讲：语音风格指令跟随与角色扮演。所考察的说话风格包括情感、音量、语速、词语强调、音调控制以及非语言元素。我们采用四种口语语言模型完成这两项任务，并分别由人类和ALLMs对SLMs的响应进行评判。我们比较了两种ALLM评判者——GPT-4o-audio与Gemini-2.5-pro——与人类评估结果，发现Gemini与人类评判者之间的一致性可与人类评估者之间的一致性相媲美。这些积极结果表明，ALLMs可作为评判者来评估SLMs。我们的研究结果还揭示，当前的SLMs，即便是GPT-4o-audio，在控制说话风格和生成自然对话方面仍有提升空间。

English

Audio-aware large language models (ALLMs) can understand the textual and non-textual information in the audio input. In this paper, we explore using ALLMs as an automatic judge to assess the speaking styles of speeches. We use ALLM judges to evaluate the speeches generated by SLMs on two tasks: voice style instruction following and role-playing. The speaking style we consider includes emotion, volume, speaking pace, word emphasis, pitch control, and non-verbal elements. We use four spoken language models (SLMs) to complete the two tasks and use humans and ALLMs to judge the SLMs' responses. We compare two ALLM judges, GPT-4o-audio and Gemini-2.5-pro, with human evaluation results and show that the agreement between Gemini and human judges is comparable to the agreement between human evaluators. These promising results show that ALLMs can be used as a judge to evaluate SLMs. Our results also reveal that current SLMs, even GPT-4o-audio, still have room for improvement in controlling the speaking style and generating natural dialogues.

音频感知大语言模型作为口语风格的评判者

Audio-Aware Large Language Models as Judges for Speaking Styles

摘要

Support