EmergentTTS-Eval: モデル・アズ・ア・ジャッジを用いた複雑な韻律、表現力、言語的課題に対するTTSモデルの評価

要旨

Text-to-Speech（TTS）ベンチマークは、モデルがニュアンスや意味的に複雑なテキストをどれだけうまく処理できるかを捉えることができないことが多い。EmergentTTSを基盤として、我々はEmergentTTS-Evalを導入する。これは、感情、パラ言語、外国語、構文的複雑さ、複雑な発音（例：URL、数式）、質問という6つの挑戦的なTTSシナリオをカバーする包括的なベンチマークである。重要な点として、このフレームワークはテストケースの生成と評価の両方を自動化し、ベンチマークを容易に拡張可能にしている。人間が書いた少数のシードプロンプトから始めて、LLMを使用して特定の構造的、音声的、韻律的課題をターゲットに反復的に拡張し、1,645の多様なテストケースを生成した。さらに、モデルを審判として活用するアプローチを採用し、Large Audio Language Model（LALM）を使用して、表現された感情、韻律、イントネーション、発音の正確さなど、複数の次元で音声を評価する。我々は、11Labs、Deepgram、OpenAIの4o-mini-TTSなど、最先端のオープンソースおよびプロプライエタリなTTSシステムをEmergentTTS-Evalで評価し、その細かい性能差を明らかにする能力を示した。結果は、モデルを審判とするアプローチが堅牢なTTS評価を提供し、人間の選好と高い相関を持つことを示している。評価コードとデータセットを公開している：https://github.com/boson-ai/EmergentTTS-Eval-public{コード} および https://huggingface.co/datasets/bosonai/EmergentTTS-Eval{データセット}。

English

Text-to-Speech (TTS) benchmarks often fail to capture how well models handle nuanced and semantically complex text. Building on EmergentTTS, we introduce EmergentTTS-Eval, a comprehensive benchmark covering six challenging TTS scenarios: emotions, paralinguistics, foreign words, syntactic complexity, complex pronunciation (e.g. URLs, formulas), and questions. Crucially, our framework automates both test-case generation and evaluation, making the benchmark easily extensible. Starting from a small set of human-written seed prompts, we iteratively extend them using LLMs to target specific structural, phonetic and prosodic challenges, resulting in 1,645 diverse test cases. Moreover, we employ a model-as-a-judge approach, using a Large Audio Language Model (LALM) to assess the speech across multiple dimensions such as expressed emotion, prosodic, intonational, and pronunciation accuracy. We evaluate state-of-the-art open-source and proprietary TTS systems, such as 11Labs, Deepgram, and OpenAI's 4o-mini-TTS, on EmergentTTS-Eval, demonstrating its ability to reveal fine-grained performance differences. Results show that the model-as-a-judge approach offers robust TTS assessment and a high correlation with human preferences. We open source the evaluation https://github.com/boson-ai/EmergentTTS-Eval-public{code} and the https://huggingface.co/datasets/bosonai/EmergentTTS-Eval{dataset}.

EmergentTTS-Eval: モデル・アズ・ア・ジャッジを用いた複雑な韻律、表現力、言語的課題に対するTTSモデルの評価

EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge

要旨

Support