音声認識対応大規模言語モデルによる話し方スタイルの評価

要旨

音声対応大規模言語モデル（ALLM）は、音声入力におけるテキスト情報と非テキスト情報を理解することができます。本論文では、ALLMを自動評価者として活用し、スピーチの話し方を評価する方法を探ります。ALLM評価者を用いて、音声スタイル指示の遵守とロールプレイという2つのタスクにおいて、音声言語モデル（SLM）が生成したスピーチを評価します。評価対象となる話し方の要素には、感情、音量、話すペース、単語の強調、ピッチ制御、および非言語的要素が含まれます。4つの音声言語モデル（SLM）を使用して2つのタスクを実行し、人間とALLMがSLMの応答を評価します。GPT-4o-audioとGemini-2.5-proという2つのALLM評価者を人間の評価結果と比較し、Geminiと人間評価者の一致度が、人間評価者間の一致度に匹敵することを示します。これらの有望な結果は、ALLMがSLMを評価するための評価者として使用できることを示しています。また、現在のSLM、たとえGPT-4o-audioであっても、話し方を制御し自然な対話を生成する点において改善の余地があることが明らかになりました。

English

Audio-aware large language models (ALLMs) can understand the textual and non-textual information in the audio input. In this paper, we explore using ALLMs as an automatic judge to assess the speaking styles of speeches. We use ALLM judges to evaluate the speeches generated by SLMs on two tasks: voice style instruction following and role-playing. The speaking style we consider includes emotion, volume, speaking pace, word emphasis, pitch control, and non-verbal elements. We use four spoken language models (SLMs) to complete the two tasks and use humans and ALLMs to judge the SLMs' responses. We compare two ALLM judges, GPT-4o-audio and Gemini-2.5-pro, with human evaluation results and show that the agreement between Gemini and human judges is comparable to the agreement between human evaluators. These promising results show that ALLMs can be used as a judge to evaluate SLMs. Our results also reveal that current SLMs, even GPT-4o-audio, still have room for improvement in controlling the speaking style and generating natural dialogues.

音声認識対応大規模言語モデルによる話し方スタイルの評価

Audio-Aware Large Language Models as Judges for Speaking Styles

要旨

Support