音頻感知大型語言模型作為語音風格的評判者

摘要

音訊感知大型語言模型（ALLMs）能夠理解音訊輸入中的文本與非文本資訊。本文探討將ALLMs作為自動評判者，用於評估演講的說話風格。我們運用ALLM評判者來評估由語音語言模型（SLMs）在兩項任務上生成的演講：語音風格指令遵循與角色扮演。所考量的說話風格包括情感、音量、語速、詞語強調、音高控制及非語言元素。我們採用四種語音語言模型完成這兩項任務，並由人類與ALLMs對SLMs的回應進行評判。我們比較了兩種ALLM評判者——GPT-4o-audio與Gemini-2.5-pro——與人類評判結果，顯示Gemini與人類評判者之間的一致性可與人類評判者之間的一致性相媲美。這些積極結果表明，ALLMs可作為評判者來評估SLMs。我們的結果同時揭示，即便是GPT-4o-audio，現有的SLMs在控制說話風格與生成自然對話方面仍有改進空間。

English

Audio-aware large language models (ALLMs) can understand the textual and non-textual information in the audio input. In this paper, we explore using ALLMs as an automatic judge to assess the speaking styles of speeches. We use ALLM judges to evaluate the speeches generated by SLMs on two tasks: voice style instruction following and role-playing. The speaking style we consider includes emotion, volume, speaking pace, word emphasis, pitch control, and non-verbal elements. We use four spoken language models (SLMs) to complete the two tasks and use humans and ALLMs to judge the SLMs' responses. We compare two ALLM judges, GPT-4o-audio and Gemini-2.5-pro, with human evaluation results and show that the agreement between Gemini and human judges is comparable to the agreement between human evaluators. These promising results show that ALLMs can be used as a judge to evaluate SLMs. Our results also reveal that current SLMs, even GPT-4o-audio, still have room for improvement in controlling the speaking style and generating natural dialogues.

音頻感知大型語言模型作為語音風格的評判者

Audio-Aware Large Language Models as Judges for Speaking Styles

摘要

Support