EmergentTTS-Eval:基於模型即裁判的複雜韻律、表現力及語言挑戰之TTS模型評估
EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge
May 29, 2025
作者: Ruskin Raj Manku, Yuzhi Tang, Xingjian Shi, Mu Li, Alex Smola
cs.AI
摘要
文本轉語音(TTS)基準測試往往未能充分評估模型處理細膩且語義複雜文本的能力。基於EmergentTTS,我們引入了EmergentTTS-Eval,這是一個涵蓋六種具挑戰性TTS場景的綜合基準:情感、副語言特徵、外來詞彙、句法複雜性、複雜發音(如URL、公式)以及疑問句。關鍵在於,我們的框架自動化地完成了測試案例生成與評估,使得該基準易於擴展。從少量人工編寫的種子提示出發,我們利用大型語言模型(LLMs)迭代擴展,針對特定的結構、語音和韻律挑戰,最終生成了1,645個多樣化的測試案例。此外,我們採用“模型即裁判”的方法,利用大型音頻語言模型(LALM)從多個維度評估語音,包括表達的情感、韻律、語調及發音準確性。我們在EmergentTTS-Eval上評估了多個領先的開源和專有TTS系統,如11Labs、Deepgram及OpenAI的4o-mini-TTS,展示了其揭示細粒度性能差異的能力。結果表明,“模型即裁判”的方法提供了穩健的TTS評估,並與人類偏好高度相關。我們開源了評估代碼(https://github.com/boson-ai/EmergentTTS-Eval-public)及數據集(https://huggingface.co/datasets/bosonai/EmergentTTS-Eval)。
English
Text-to-Speech (TTS) benchmarks often fail to capture how well models handle
nuanced and semantically complex text. Building on EmergentTTS, we
introduce EmergentTTS-Eval, a comprehensive benchmark covering six
challenging TTS scenarios: emotions, paralinguistics, foreign words, syntactic
complexity, complex pronunciation (e.g. URLs, formulas), and questions.
Crucially, our framework automates both test-case generation and evaluation,
making the benchmark easily extensible. Starting from a small set of
human-written seed prompts, we iteratively extend them using LLMs to target
specific structural, phonetic and prosodic challenges, resulting in 1,645
diverse test cases. Moreover, we employ a model-as-a-judge approach, using a
Large Audio Language Model (LALM) to assess the speech across multiple
dimensions such as expressed emotion, prosodic, intonational, and pronunciation
accuracy. We evaluate state-of-the-art open-source and proprietary TTS systems,
such as 11Labs, Deepgram, and OpenAI's 4o-mini-TTS, on EmergentTTS-Eval,
demonstrating its ability to reveal fine-grained performance differences.
Results show that the model-as-a-judge approach offers robust TTS assessment
and a high correlation with human preferences. We open source the evaluation
https://github.com/boson-ai/EmergentTTS-Eval-public{code} and the
https://huggingface.co/datasets/bosonai/EmergentTTS-Eval{dataset}.