EmergentTTS-Eval：基於模型即裁判的複雜韻律、表現力及語言挑戰之TTS模型評估

摘要

文本轉語音（TTS）基準測試往往未能充分評估模型處理細膩且語義複雜文本的能力。基於EmergentTTS，我們引入了EmergentTTS-Eval，這是一個涵蓋六種具挑戰性TTS場景的綜合基準：情感、副語言特徵、外來詞彙、句法複雜性、複雜發音（如URL、公式）以及疑問句。關鍵在於，我們的框架自動化地完成了測試案例生成與評估，使得該基準易於擴展。從少量人工編寫的種子提示出發，我們利用大型語言模型（LLMs）迭代擴展，針對特定的結構、語音和韻律挑戰，最終生成了1,645個多樣化的測試案例。此外，我們採用“模型即裁判”的方法，利用大型音頻語言模型（LALM）從多個維度評估語音，包括表達的情感、韻律、語調及發音準確性。我們在EmergentTTS-Eval上評估了多個領先的開源和專有TTS系統，如11Labs、Deepgram及OpenAI的4o-mini-TTS，展示了其揭示細粒度性能差異的能力。結果表明，“模型即裁判”的方法提供了穩健的TTS評估，並與人類偏好高度相關。我們開源了評估代碼（https://github.com/boson-ai/EmergentTTS-Eval-public）及數據集（https://huggingface.co/datasets/bosonai/EmergentTTS-Eval）。

English

Text-to-Speech (TTS) benchmarks often fail to capture how well models handle nuanced and semantically complex text. Building on EmergentTTS, we introduce EmergentTTS-Eval, a comprehensive benchmark covering six challenging TTS scenarios: emotions, paralinguistics, foreign words, syntactic complexity, complex pronunciation (e.g. URLs, formulas), and questions. Crucially, our framework automates both test-case generation and evaluation, making the benchmark easily extensible. Starting from a small set of human-written seed prompts, we iteratively extend them using LLMs to target specific structural, phonetic and prosodic challenges, resulting in 1,645 diverse test cases. Moreover, we employ a model-as-a-judge approach, using a Large Audio Language Model (LALM) to assess the speech across multiple dimensions such as expressed emotion, prosodic, intonational, and pronunciation accuracy. We evaluate state-of-the-art open-source and proprietary TTS systems, such as 11Labs, Deepgram, and OpenAI's 4o-mini-TTS, on EmergentTTS-Eval, demonstrating its ability to reveal fine-grained performance differences. Results show that the model-as-a-judge approach offers robust TTS assessment and a high correlation with human preferences. We open source the evaluation https://github.com/boson-ai/EmergentTTS-Eval-public{code} and the https://huggingface.co/datasets/bosonai/EmergentTTS-Eval{dataset}.

EmergentTTS-Eval：基於模型即裁判的複雜韻律、表現力及語言挑戰之TTS模型評估

EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge

摘要

Support