EmergentTTS-Eval: Evaluatie van TTS-modellen op complexe prosodische, expressiviteits- en linguïstische uitdagingen met behulp van Model-as-a-Judge

Samenvatting

Text-to-Speech (TTS)-benchmarks slagen er vaak niet in om vast te leggen hoe goed modellen omgaan met genuanceerde en semantisch complexe tekst. Voortbouwend op EmergentTTS introduceren we EmergentTTS-Eval, een uitgebreide benchmark die zes uitdagende TTS-scenario's omvat: emoties, paralinguïstiek, buitenlandse woorden, syntactische complexiteit, complexe uitspraak (bijv. URL's, formules) en vragen. Cruciaal is dat ons framework zowel de generatie van testgevallen als de evaluatie automatiseert, waardoor de benchmark eenvoudig uitbreidbaar is. Uitgaande van een kleine set door mensen geschreven seed-prompts breiden we deze iteratief uit met behulp van LLM's om specifieke structurele, fonetische en prosodische uitdagingen aan te pakken, wat resulteert in 1.645 diverse testgevallen. Bovendien gebruiken we een model-as-a-judge-benadering, waarbij een Large Audio Language Model (LALM) wordt ingezet om de spraak te beoordelen op meerdere dimensies, zoals uitgedrukte emotie, prosodie, intonatie en uitspraaknauwkeurigheid. We evalueren state-of-the-art open-source en propriëtaire TTS-systemen, zoals 11Labs, Deepgram en OpenAI's 4o-mini-TTS, op EmergentTTS-Eval, wat aantoont dat het in staat is om fijnmazige prestatieverschillen te onthullen. De resultaten laten zien dat de model-as-a-judge-benadering een robuuste TTS-evaluatie biedt en een hoge correlatie vertoont met menselijke voorkeuren. We maken de evaluatiecode en de dataset openbaar via https://github.com/boson-ai/EmergentTTS-Eval-public en https://huggingface.co/datasets/bosonai/EmergentTTS-Eval.

English

Text-to-Speech (TTS) benchmarks often fail to capture how well models handle nuanced and semantically complex text. Building on EmergentTTS, we introduce EmergentTTS-Eval, a comprehensive benchmark covering six challenging TTS scenarios: emotions, paralinguistics, foreign words, syntactic complexity, complex pronunciation (e.g. URLs, formulas), and questions. Crucially, our framework automates both test-case generation and evaluation, making the benchmark easily extensible. Starting from a small set of human-written seed prompts, we iteratively extend them using LLMs to target specific structural, phonetic and prosodic challenges, resulting in 1,645 diverse test cases. Moreover, we employ a model-as-a-judge approach, using a Large Audio Language Model (LALM) to assess the speech across multiple dimensions such as expressed emotion, prosodic, intonational, and pronunciation accuracy. We evaluate state-of-the-art open-source and proprietary TTS systems, such as 11Labs, Deepgram, and OpenAI's 4o-mini-TTS, on EmergentTTS-Eval, demonstrating its ability to reveal fine-grained performance differences. Results show that the model-as-a-judge approach offers robust TTS assessment and a high correlation with human preferences. We open source the evaluation https://github.com/boson-ai/EmergentTTS-Eval-public{code} and the https://huggingface.co/datasets/bosonai/EmergentTTS-Eval{dataset}.

EmergentTTS-Eval: Evaluatie van TTS-modellen op complexe prosodische, expressiviteits- en linguïstische uitdagingen met behulp van Model-as-a-Judge

EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge

Samenvatting

Support