음향 인식 대형 언어 모델을 활용한 발화 스타일 평가

초록

오디오 인식 대형 언어 모델(ALLM)은 오디오 입력에 포함된 텍스트 및 비텍스트 정보를 이해할 수 있습니다. 본 논문에서는 ALLM을 자동 평가자로 활용하여 연설의 발화 스타일을 평가하는 방법을 탐구합니다. ALLM 평가자를 사용하여 음성 언어 모델(SLM)이 생성한 연설을 두 가지 과제(음성 스타일 지시 따르기 및 역할극)에서 평가합니다. 우리가 고려하는 발화 스타일에는 감정, 음량, 발화 속도, 단어 강조, 음조 조절 및 비언어적 요소가 포함됩니다. 두 가지 과제를 수행하기 위해 네 가지 음성 언어 모델(SLM)을 사용하고, 인간과 ALLM이 SLM의 응답을 평가하도록 합니다. GPT-4o-audio와 Gemini-2.5-pro라는 두 가지 ALLM 평가자를 인간 평가 결과와 비교한 결과, Gemini와 인간 평가자 간의 일치도가 인간 평가자 간의 일치도와 비슷한 수준임을 보여줍니다. 이러한 긍정적인 결과는 ALLM이 SLM을 평가하는 평가자로 사용될 수 있음을 시사합니다. 또한, 현재의 SLM(심지어 GPT-4o-audio도 포함)은 발화 스타일을 제어하고 자연스러운 대화를 생성하는 데 있어 여전히 개선의 여지가 있음을 보여줍니다.

English

Audio-aware large language models (ALLMs) can understand the textual and non-textual information in the audio input. In this paper, we explore using ALLMs as an automatic judge to assess the speaking styles of speeches. We use ALLM judges to evaluate the speeches generated by SLMs on two tasks: voice style instruction following and role-playing. The speaking style we consider includes emotion, volume, speaking pace, word emphasis, pitch control, and non-verbal elements. We use four spoken language models (SLMs) to complete the two tasks and use humans and ALLMs to judge the SLMs' responses. We compare two ALLM judges, GPT-4o-audio and Gemini-2.5-pro, with human evaluation results and show that the agreement between Gemini and human judges is comparable to the agreement between human evaluators. These promising results show that ALLMs can be used as a judge to evaluate SLMs. Our results also reveal that current SLMs, even GPT-4o-audio, still have room for improvement in controlling the speaking style and generating natural dialogues.