EmergentTTS-Eval: Valutazione dei Modelli TTS su Sfide Prosodiche Complesse, Espressività e Linguistiche Utilizzando il Modello-come-Giudice

Abstract

I benchmark di Text-to-Speech (TTS) spesso non riescono a catturare quanto bene i modelli gestiscano testi sfumati e semanticamente complessi. Basandoci su EmergentTTS, introduciamo EmergentTTS-Eval, un benchmark completo che copre sei scenari impegnativi per il TTS: emozioni, paralinguistica, parole straniere, complessità sintattica, pronuncia complessa (ad esempio URL, formule) e domande. In modo cruciale, il nostro framework automatizza sia la generazione dei casi di test che la valutazione, rendendo il benchmark facilmente estensibile. Partendo da un piccolo insieme di prompt iniziali scritti da esseri umani, li estendiamo iterativamente utilizzando LLM per affrontare specifiche sfide strutturali, fonetiche e prosodiche, ottenendo 1.645 casi di test diversificati. Inoltre, adottiamo un approccio "model-as-a-judge", utilizzando un Large Audio Language Model (LALM) per valutare il parlato su più dimensioni, come l'emozione espressa, la prosodia, l'intonazione e l'accuratezza della pronuncia. Valutiamo sistemi TTS open-source e proprietari all'avanguardia, come 11Labs, Deepgram e il 4o-mini-TTS di OpenAI, su EmergentTTS-Eval, dimostrando la sua capacità di rivelare differenze di performance a grana fine. I risultati mostrano che l'approccio "model-as-a-judge" offre una valutazione robusta del TTS e un'elevata correlazione con le preferenze umane. Rendiamo open source il codice di valutazione https://github.com/boson-ai/EmergentTTS-Eval-public e il dataset https://huggingface.co/datasets/bosonai/EmergentTTS-Eval.

English

Text-to-Speech (TTS) benchmarks often fail to capture how well models handle nuanced and semantically complex text. Building on EmergentTTS, we introduce EmergentTTS-Eval, a comprehensive benchmark covering six challenging TTS scenarios: emotions, paralinguistics, foreign words, syntactic complexity, complex pronunciation (e.g. URLs, formulas), and questions. Crucially, our framework automates both test-case generation and evaluation, making the benchmark easily extensible. Starting from a small set of human-written seed prompts, we iteratively extend them using LLMs to target specific structural, phonetic and prosodic challenges, resulting in 1,645 diverse test cases. Moreover, we employ a model-as-a-judge approach, using a Large Audio Language Model (LALM) to assess the speech across multiple dimensions such as expressed emotion, prosodic, intonational, and pronunciation accuracy. We evaluate state-of-the-art open-source and proprietary TTS systems, such as 11Labs, Deepgram, and OpenAI's 4o-mini-TTS, on EmergentTTS-Eval, demonstrating its ability to reveal fine-grained performance differences. Results show that the model-as-a-judge approach offers robust TTS assessment and a high correlation with human preferences. We open source the evaluation https://github.com/boson-ai/EmergentTTS-Eval-public{code} and the https://huggingface.co/datasets/bosonai/EmergentTTS-Eval{dataset}.

EmergentTTS-Eval: Valutazione dei Modelli TTS su Sfide Prosodiche Complesse, Espressività e Linguistiche Utilizzando il Modello-come-Giudice

EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge

Abstract

Support