语音优先国度的偏好：印度语言TTS的大规模成对评估与偏好分析

摘要

众包成对评估已成为评估基础模型的一种可扩展方法。然而，将其应用于文本转语音（TTS）领域时，由于语言多样性和语音感知的多维特性，会引入较高方差。我们提出一种受控的多维成对评估框架，用于多语言TTS评估，该框架将语言控制与基于感知的标注相结合。通过使用10种印度语言中超过5000句母语及语码混合句子，我们评估了7个最先进的TTS系统，并从1900多名母语评分者处收集了超过12万组成对比较数据。除整体偏好外，评分者还需在6个感知维度上进行评判：可懂度、表现力、音质、生动性、噪声和幻听。采用布拉德利-特里模型构建多语言排行榜，通过SHAP分析解读人类偏好，并综合感知维度分析排行榜可靠性及模型优势与权衡。

English

Crowdsourced pairwise evaluation has emerged as a scalable approach for assessing foundation models. However, applying it to Text to Speech(TTS) introduces high variance due to linguistic diversity and multidimensional nature of speech perception. We present a controlled multidimensional pairwise evaluation framework for multilingual TTS that combines linguistic control with perceptually grounded annotation. Using 5K+ native and code-mixed sentences across 10 Indic languages, we evaluate 7 state-of-the-art TTS systems and collect over 120K pairwise comparisons from over 1900 native raters. In addition to overall preference, raters provide judgments across 6 perceptual dimensions: intelligibility, expressiveness, voice quality, liveliness, noise, and hallucinations. Using Bradley-Terry modeling, we construct a multilingual leaderboard, interpret human preference using SHAP analysis and analyze leaderboard reliability alongside model strengths and trade-offs across perceptual dimensions.