EmergentTTS-Eval：利用模型即评委方法评估TTS模型在复杂韵律、表现力及语言挑战上的表现

摘要

文本转语音（TTS）基准测试往往难以全面评估模型在处理细腻且语义复杂文本时的表现。基于EmergentTTS，我们推出了EmergentTTS-Eval，这是一个涵盖六大挑战性TTS场景的综合基准：情感表达、副语言特征、外来词汇、句法复杂度、复杂发音（如网址、公式）以及疑问句处理。尤为关键的是，我们的框架实现了测试用例生成与评估的自动化，使得基准易于扩展。从少量人工编写的种子提示出发，我们利用大型语言模型（LLMs）迭代扩展这些提示，针对特定的结构、语音和韵律挑战，最终生成了1,645个多样化的测试案例。此外，我们采用“模型即评委”的方法，借助大型音频语言模型（LALM）从多个维度评估语音质量，包括情感表达、韵律、语调及发音准确性。我们在EmergentTTS-Eval上评估了如11Labs、Deepgram及OpenAI的4o-mini-TTS等顶尖开源与专有TTS系统，展示了该基准在揭示细微性能差异方面的能力。结果表明，“模型即评委”方法提供了稳健的TTS评估，并与人类偏好高度相关。我们已开源评估代码https://github.com/boson-ai/EmergentTTS-Eval-public及数据集https://huggingface.co/datasets/bosonai/EmergentTTS-Eval。

English

Text-to-Speech (TTS) benchmarks often fail to capture how well models handle nuanced and semantically complex text. Building on EmergentTTS, we introduce EmergentTTS-Eval, a comprehensive benchmark covering six challenging TTS scenarios: emotions, paralinguistics, foreign words, syntactic complexity, complex pronunciation (e.g. URLs, formulas), and questions. Crucially, our framework automates both test-case generation and evaluation, making the benchmark easily extensible. Starting from a small set of human-written seed prompts, we iteratively extend them using LLMs to target specific structural, phonetic and prosodic challenges, resulting in 1,645 diverse test cases. Moreover, we employ a model-as-a-judge approach, using a Large Audio Language Model (LALM) to assess the speech across multiple dimensions such as expressed emotion, prosodic, intonational, and pronunciation accuracy. We evaluate state-of-the-art open-source and proprietary TTS systems, such as 11Labs, Deepgram, and OpenAI's 4o-mini-TTS, on EmergentTTS-Eval, demonstrating its ability to reveal fine-grained performance differences. Results show that the model-as-a-judge approach offers robust TTS assessment and a high correlation with human preferences. We open source the evaluation https://github.com/boson-ai/EmergentTTS-Eval-public{code} and the https://huggingface.co/datasets/bosonai/EmergentTTS-Eval{dataset}.