语音优先国度的偏好：印度语言文本转语音系统的大规模成对评估与偏好分析

摘要

众包配对评估已成为评估基础模型的一种可扩展方法。然而，将其应用于文本转语音（TTS）领域时，由于语言多样性和语音感知的多维特性，会引入较高方差。我们提出了一种针对多语言TTS的受控多维配对评估框架，该框架将语言控制与基于感知的标注相结合。通过使用10种印度语言中超过5000句原生及语码混合句子，我们评估了7个前沿TTS系统，并收集了来自1900余名母语评分者超过12万组配对比较数据。除整体偏好外，评分者还需在6个感知维度提供评判：可懂度、表现力、音质、生动性、噪声和幻象。采用布拉德利-特里模型构建多语言排行榜后，我们通过SHAP分析解读人类偏好，并在分析各模型在感知维度上的优势与权衡的同时，验证了排行榜的可靠性。

English

Crowdsourced pairwise evaluation has emerged as a scalable approach for assessing foundation models. However, applying it to Text to Speech(TTS) introduces high variance due to linguistic diversity and multidimensional nature of speech perception. We present a controlled multidimensional pairwise evaluation framework for multilingual TTS that combines linguistic control with perceptually grounded annotation. Using 5K+ native and code-mixed sentences across 10 Indic languages, we evaluate 7 state-of-the-art TTS systems and collect over 120K pairwise comparisons from over 1900 native raters. In addition to overall preference, raters provide judgments across 6 perceptual dimensions: intelligibility, expressiveness, voice quality, liveliness, noise, and hallucinations. Using Bradley-Terry modeling, we construct a multilingual leaderboard, interpret human preference using SHAP analysis and analyze leaderboard reliability alongside model strengths and trade-offs across perceptual dimensions.

语音优先国度的偏好：印度语言文本转语音系统的大规模成对评估与偏好分析

Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages

摘要

Support