S2S-Arena: 패럴링구이스틱 정보를 활용한 지시사항 수행에 대한 Speech2Speech 프로토콜 평가

초록

대규모 언어 모델(LLMs)의 급속한 발전은 음성 모델, 특히 음성 입력과 출력을 지원하는 speech2speech 프로토콜의 최근 진전에 상당한 관심을 불러일으켰습니다. 그러나 기존 벤치마크는 이러한 모델의 지시 수행 능력을 평가하기 위해 자동 텍스트 기반 평가자를 채택하고 있어, 음성 이해 및 생성 과정에서의 파라링구스틱(paralinguistic) 정보를 고려하지 못하고 있습니다. 이러한 문제를 해결하기 위해, 우리는 실제 작업에서 음성 입력과 출력 모두에 걸쳐 파라링구스틱 정보를 포함한 지시 수행 능력을 평가하는 새로운 아레나 스타일의 S2S 벤치마크인 S2S-Arena를 소개합니다. 우리는 4개 도메인에서 21개 작업에 걸쳐 TTS와 실시간 녹음을 융합한 154개의 샘플을 설계하고, 기존의 인기 있는 음성 모델들을 아레나 스타일로 수동 평가했습니다. 실험 결과는 다음과 같습니다: (1) GPT-4o의 우수한 성능 외에도, ASR, LLM, TTS를 연쇄적으로 연결한 음성 모델이 텍스트-음성 정렬 후 공동 학습 모델을 능가하는 것으로 나타났습니다; (2) 파라링구스틱 정보를 고려할 때, 음성 모델의 지식 수준은 주로 LLM 백본에 의존하며, 다국어 지원은 음성 모듈에 의해 제한됩니다; (3) 우수한 음성 모델은 이미 음성 입력의 파라링구스틱 정보를 이해할 수 있지만, 적절한 파라링구스틱 정보를 포함한 오디오를 생성하는 것은 여전히 과제로 남아 있습니다.

English

The rapid development of large language models (LLMs) has brought significant attention to speech models, particularly recent progress in speech2speech protocols supporting speech input and output. However, the existing benchmarks adopt automatic text-based evaluators for evaluating the instruction following ability of these models lack consideration for paralinguistic information in both speech understanding and generation. To address these issues, we introduce S2S-Arena, a novel arena-style S2S benchmark that evaluates instruction-following capabilities with paralinguistic information in both speech-in and speech-out across real-world tasks. We design 154 samples that fused TTS and live recordings in four domains with 21 tasks and manually evaluate existing popular speech models in an arena-style manner. The experimental results show that: (1) in addition to the superior performance of GPT-4o, the speech model of cascaded ASR, LLM, and TTS outperforms the jointly trained model after text-speech alignment in speech2speech protocols; (2) considering paralinguistic information, the knowledgeability of the speech model mainly depends on the LLM backbone, and the multilingual support of that is limited by the speech module; (3) excellent speech models can already understand the paralinguistic information in speech input, but generating appropriate audio with paralinguistic information is still a challenge.

S2S-Arena: 패럴링구이스틱 정보를 활용한 지시사항 수행에 대한 Speech2Speech 프로토콜 평가

S2S-Arena, Evaluating Speech2Speech Protocols on Instruction Following with Paralinguistic Information

초록

Support